Post

The Big Data Paradox in COVID Surveys

Meng (2018) summarizes the Big Data Paradox as "The more the data, the surer we fool ourselves." It may be counterintuitive at first--even a trained Statistician is likely to get caught in the idea that "more data is better" at some point. But this idea can be quickly squashed with simple examples. For instance, if we are interested in measuring the average height of the population, surveying 1,000 men is going to give us an estimate that is too high.

Post

Statistical Machine Translation: R Package

I wrote an R package for conducting Statistical Machine Translation (SMT) as part of my first-year comps. Find it here. It is based largely on Koehn’s 2009 SMT book and implements the so-called “IBM” models, as well as phrase-based translation. While these methods have been largely supplanted by neural network-based methods, they are still interesting models, and the IBM models can be used to derive word alignments between a sentence and its translation.

Post

R Tutorial: Multi-State Models

I wrote this tutorial on estimating multi-state models in R as part of the class STAT 935 (Survival Analysis) at University of Waterloo. There are other tutorials out there, but this one (in my biased opinion, and to the best of my knowledge) is the only one that goes one by one through each type of mult-state model, the theory, how to structure the data, and how to estimate the models using the coxph function in R.

Post

How Influential Are Music Critics?

I gave this presentation on December 6, 2021 as part of the course SURV727 “Fundamentals of Cmputing and Data Display.” I use R for data collection, and then clean and analyze the data in Stata. My data science experience grew quickly and greatly during this course, and I had a lot of fun combing various sources of data involving API’s from Spotify, Last.FM, Wikipedia, and Google, and using web scraping techniques to obtain review scores from Wikipedia.

Post

How Influential Are Music Critics?

I gave this presentation on December 6, 2021 as part of the course SURV727 “Fundamentals of Cmputing and Data Display.” I use R for data collection, and then clean and analyze the data in Stata. My data science experience grew quickly and greatly during this course, and I had a lot of fun combing various sources of data involving API’s from Spotify, Last.FM, Wikipedia, and Google, and using web scraping techniques to obtain review scores from Wikipedia.

Post

Forests Ontario Blog: Heritage Tree Travellers Discover Ontario's Majestic Living History

Due to COVID, my wife and I have started enjoying nature a lot more. It turns out Ontario is a beautiful place with lots to see, including some really cool old trees! Forests Ontario has been cataloguing such trees in Ontario, and so we’ve been using their map to find them. Forests Ontario thought it was cool, and so asked me to write an article about our tree exploits! It was a really fun and nerdy experience and you can find the article here.

Post

The Big Data Paradox in COVID Surveys

Meng (2018) summarizes the Big Data Paradox as "The more the data, the surer we fool ourselves." It may be counterintuitive at first--even a trained Statistician is likely to get caught in the idea that "more data is better" at some point. But this idea can be quickly squashed with simple examples. For instance, if we are interested in measuring the average height of the population, surveying 1,000 men is going to give us an estimate that is too high.

Post

ksmirnovk: Stata Program for Performing a k-sample Kolmogorov-Smirnov Test

Here is some Stata code I wrote back in 2017. A colleague asked how to perform a Kolmogorov-Smirnov test in Stata when there are more than two groups. I was surprised to find that such a test is not implemented in Stata, nor widely implemented in general. I thought perhaps that this was because such a test did not exist. On the contrary, a k-sample analogue to the Kolmogorov-Smirnov test was developed back in 1959 by Jack Kiefer, a mathematical statistician at Cornell.

Post

Statistical Machine Translation: R Package

I wrote an R package for conducting Statistical Machine Translation (SMT) as part of my first-year comps. Find it here. It is based largely on Koehn’s 2009 SMT book and implements the so-called “IBM” models, as well as phrase-based translation. While these methods have been largely supplanted by neural network-based methods, they are still interesting models, and the IBM models can be used to derive word alignments between a sentence and its translation.

Post

Statistical Machine Translation: R Package

I wrote an R package for conducting Statistical Machine Translation (SMT) as part of my first-year comps. Find it here. It is based largely on Koehn’s 2009 SMT book and implements the so-called “IBM” models, as well as phrase-based translation. While these methods have been largely supplanted by neural network-based methods, they are still interesting models, and the IBM models can be used to derive word alignments between a sentence and its translation.

Post

R Tutorial: Multi-State Models

I wrote this tutorial on estimating multi-state models in R as part of the class STAT 935 (Survival Analysis) at University of Waterloo. There are other tutorials out there, but this one (in my biased opinion, and to the best of my knowledge) is the only one that goes one by one through each type of mult-state model, the theory, how to structure the data, and how to estimate the models using the coxph function in R.

Post

How Influential Are Music Critics?

I gave this presentation on December 6, 2021 as part of the course SURV727 “Fundamentals of Cmputing and Data Display.” I use R for data collection, and then clean and analyze the data in Stata. My data science experience grew quickly and greatly during this course, and I had a lot of fun combing various sources of data involving API’s from Spotify, Last.FM, Wikipedia, and Google, and using web scraping techniques to obtain review scores from Wikipedia.

Post

How Influential Are Music Critics?

I gave this presentation on December 6, 2021 as part of the course SURV727 “Fundamentals of Cmputing and Data Display.” I use R for data collection, and then clean and analyze the data in Stata. My data science experience grew quickly and greatly during this course, and I had a lot of fun combing various sources of data involving API’s from Spotify, Last.FM, Wikipedia, and Google, and using web scraping techniques to obtain review scores from Wikipedia.

Post

ksmirnovk: Stata Program for Performing a k-sample Kolmogorov-Smirnov Test

Here is some Stata code I wrote back in 2017. A colleague asked how to perform a Kolmogorov-Smirnov test in Stata when there are more than two groups. I was surprised to find that such a test is not implemented in Stata, nor widely implemented in general. I thought perhaps that this was because such a test did not exist. On the contrary, a k-sample analogue to the Kolmogorov-Smirnov test was developed back in 1959 by Jack Kiefer, a mathematical statistician at Cornell.

Post

How Influential Are Music Critics?

I gave this presentation on December 6, 2021 as part of the course SURV727 “Fundamentals of Cmputing and Data Display.” I use R for data collection, and then clean and analyze the data in Stata. My data science experience grew quickly and greatly during this course, and I had a lot of fun combing various sources of data involving API’s from Spotify, Last.FM, Wikipedia, and Google, and using web scraping techniques to obtain review scores from Wikipedia.

Post

Statistical Machine Translation: R Package

I wrote an R package for conducting Statistical Machine Translation (SMT) as part of my first-year comps. Find it here. It is based largely on Koehn’s 2009 SMT book and implements the so-called “IBM” models, as well as phrase-based translation. While these methods have been largely supplanted by neural network-based methods, they are still interesting models, and the IBM models can be used to derive word alignments between a sentence and its translation.

Post

R Tutorial: Multi-State Models

I wrote this tutorial on estimating multi-state models in R as part of the class STAT 935 (Survival Analysis) at University of Waterloo. There are other tutorials out there, but this one (in my biased opinion, and to the best of my knowledge) is the only one that goes one by one through each type of mult-state model, the theory, how to structure the data, and how to estimate the models using the coxph function in R.

Post

The Big Data Paradox in COVID Surveys

Meng (2018) summarizes the Big Data Paradox as "The more the data, the surer we fool ourselves." It may be counterintuitive at first--even a trained Statistician is likely to get caught in the idea that "more data is better" at some point. But this idea can be quickly squashed with simple examples. For instance, if we are interested in measuring the average height of the population, surveying 1,000 men is going to give us an estimate that is too high.

Post

ksmirnovk: Stata Program for Performing a k-sample Kolmogorov-Smirnov Test

Here is some Stata code I wrote back in 2017. A colleague asked how to perform a Kolmogorov-Smirnov test in Stata when there are more than two groups. I was surprised to find that such a test is not implemented in Stata, nor widely implemented in general. I thought perhaps that this was because such a test did not exist. On the contrary, a k-sample analogue to the Kolmogorov-Smirnov test was developed back in 1959 by Jack Kiefer, a mathematical statistician at Cornell.

Post

The Big Data Paradox in COVID Surveys

Meng (2018) summarizes the Big Data Paradox as "The more the data, the surer we fool ourselves." It may be counterintuitive at first--even a trained Statistician is likely to get caught in the idea that "more data is better" at some point. But this idea can be quickly squashed with simple examples. For instance, if we are interested in measuring the average height of the population, surveying 1,000 men is going to give us an estimate that is too high.

Post

Forests Ontario Blog: Heritage Tree Travellers Discover Ontario's Majestic Living History

Due to COVID, my wife and I have started enjoying nature a lot more. It turns out Ontario is a beautiful place with lots to see, including some really cool old trees! Forests Ontario has been cataloguing such trees in Ontario, and so we’ve been using their map to find them. Forests Ontario thought it was cool, and so asked me to write an article about our tree exploits! It was a really fun and nerdy experience and you can find the article here.

Post

R Tutorial: Multi-State Models

I wrote this tutorial on estimating multi-state models in R as part of the class STAT 935 (Survival Analysis) at University of Waterloo. There are other tutorials out there, but this one (in my biased opinion, and to the best of my knowledge) is the only one that goes one by one through each type of mult-state model, the theory, how to structure the data, and how to estimate the models using the coxph function in R.

Tags

Tag: Big-Data

The Big Data Paradox in COVID Surveys

Tag: Coursework

Statistical Machine Translation: R Package

R Tutorial: Multi-State Models

How Influential Are Music Critics?

Tag: Data-Science

How Influential Are Music Critics?

Tag: Forests-Ontario

Forests Ontario Blog: Heritage Tree Travellers Discover Ontario's Majestic Living History

Tag: Literature-Review

The Big Data Paradox in COVID Surveys

Tag: Mathematical-Statistics

ksmirnovk: Stata Program for Performing a k-sample Kolmogorov-Smirnov Test

Tag: Natural-Language-Processing

Statistical Machine Translation: R Package

Tag: R

Statistical Machine Translation: R Package

R Tutorial: Multi-State Models

How Influential Are Music Critics?

Tag: Research

How Influential Are Music Critics?

Tag: Stata

ksmirnovk: Stata Program for Performing a k-sample Kolmogorov-Smirnov Test

How Influential Are Music Critics?

Tag: Statistics

Statistical Machine Translation: R Package

R Tutorial: Multi-State Models

The Big Data Paradox in COVID Surveys

ksmirnovk: Stata Program for Performing a k-sample Kolmogorov-Smirnov Test

Tag: Survey-Methodology

The Big Data Paradox in COVID Surveys

Tag: Trees

Forests Ontario Blog: Heritage Tree Travellers Discover Ontario's Majestic Living History

Tag: Tutorial

R Tutorial: Multi-State Models