Tag: Big-Data
Post
The Big Data Paradox in COVID Surveys
Meng (2018) summarizes the Big Data Paradox as "The more the data, the surer we fool ourselves."
It may be counterintuitive at first--even a trained Statistician is likely
to get caught in the idea that "more data is better" at some point.
But this idea can be quickly squashed with simple examples. For instance,
if we are interested in measuring the average height of the population,
surveying 1,000 men is going to give us an estimate that is too high.
Tag: Coursework
Post
Statistical Machine Translation: R Package
I wrote an R package for conducting Statistical Machine Translation (SMT) as part of my first-year comps. Find it here. It is based largely on Koehn’s 2009 SMT book and implements the so-called “IBM” models, as well as phrase-based translation. While these methods have been largely supplanted by neural network-based methods, they are still interesting models, and the IBM models can be used to derive word alignments between a sentence and its translation.
Post
R Tutorial: Multi-State Models
I wrote this tutorial on estimating multi-state models in R as part of the class STAT 935 (Survival Analysis) at University of Waterloo. There are other tutorials out there, but this one (in my biased opinion, and to the best of my knowledge) is the only one that goes one by one through each type of mult-state model, the theory, how to structure the data, and how to estimate the models using the coxph function in R.
Post
How Influential Are Music Critics?
I gave this presentation on December 6, 2021 as part of the course SURV727 “Fundamentals of Cmputing and Data Display.” I use R for data collection, and then clean and analyze the data in Stata.
My data science experience grew quickly and greatly during this course, and I had a lot of fun combing various sources of data involving API’s from Spotify, Last.FM, Wikipedia, and Google, and using web scraping techniques to obtain review scores from Wikipedia.
Tag: Data-Science
Post
How Influential Are Music Critics?
I gave this presentation on December 6, 2021 as part of the course SURV727 “Fundamentals of Cmputing and Data Display.” I use R for data collection, and then clean and analyze the data in Stata.
My data science experience grew quickly and greatly during this course, and I had a lot of fun combing various sources of data involving API’s from Spotify, Last.FM, Wikipedia, and Google, and using web scraping techniques to obtain review scores from Wikipedia.
Tag: Forests-Ontario
Post
Forests Ontario Blog: Heritage Tree Travellers Discover Ontario's Majestic Living History
Due to COVID, my wife and I have started enjoying nature a lot more. It turns out Ontario is a beautiful place with lots to see, including some really cool old trees! Forests Ontario has been cataloguing such trees in Ontario, and so we’ve been using their map to find them. Forests Ontario thought it was cool, and so asked me to write an article about our tree exploits! It was a really fun and nerdy experience and you can find the article here.
Tag: Literature-Review
Post
The Big Data Paradox in COVID Surveys
Meng (2018) summarizes the Big Data Paradox as "The more the data, the surer we fool ourselves."
It may be counterintuitive at first--even a trained Statistician is likely
to get caught in the idea that "more data is better" at some point.
But this idea can be quickly squashed with simple examples. For instance,
if we are interested in measuring the average height of the population,
surveying 1,000 men is going to give us an estimate that is too high.
Tag: Mathematical-Statistics
Post
ksmirnovk: Stata Program for Performing a k-sample Kolmogorov-Smirnov Test
Here is some Stata code I wrote back in 2017. A colleague asked how to perform a Kolmogorov-Smirnov test in Stata when there are more than two groups. I was surprised to find that such a test is not implemented in Stata, nor widely implemented in general. I thought perhaps that this was because such a test did not exist.
On the contrary, a k-sample analogue to the Kolmogorov-Smirnov test was developed back in 1959 by Jack Kiefer, a mathematical statistician at Cornell.
Tag: Natural-Language-Processing
Post
Statistical Machine Translation: R Package
I wrote an R package for conducting Statistical Machine Translation (SMT) as part of my first-year comps. Find it here. It is based largely on Koehn’s 2009 SMT book and implements the so-called “IBM” models, as well as phrase-based translation. While these methods have been largely supplanted by neural network-based methods, they are still interesting models, and the IBM models can be used to derive word alignments between a sentence and its translation.
Tag: R
Post
Statistical Machine Translation: R Package
I wrote an R package for conducting Statistical Machine Translation (SMT) as part of my first-year comps. Find it here. It is based largely on Koehn’s 2009 SMT book and implements the so-called “IBM” models, as well as phrase-based translation. While these methods have been largely supplanted by neural network-based methods, they are still interesting models, and the IBM models can be used to derive word alignments between a sentence and its translation.
Post
R Tutorial: Multi-State Models
I wrote this tutorial on estimating multi-state models in R as part of the class STAT 935 (Survival Analysis) at University of Waterloo. There are other tutorials out there, but this one (in my biased opinion, and to the best of my knowledge) is the only one that goes one by one through each type of mult-state model, the theory, how to structure the data, and how to estimate the models using the coxph function in R.
Post
How Influential Are Music Critics?
I gave this presentation on December 6, 2021 as part of the course SURV727 “Fundamentals of Cmputing and Data Display.” I use R for data collection, and then clean and analyze the data in Stata.
My data science experience grew quickly and greatly during this course, and I had a lot of fun combing various sources of data involving API’s from Spotify, Last.FM, Wikipedia, and Google, and using web scraping techniques to obtain review scores from Wikipedia.
Tag: Research
Post
How Influential Are Music Critics?
I gave this presentation on December 6, 2021 as part of the course SURV727 “Fundamentals of Cmputing and Data Display.” I use R for data collection, and then clean and analyze the data in Stata.
My data science experience grew quickly and greatly during this course, and I had a lot of fun combing various sources of data involving API’s from Spotify, Last.FM, Wikipedia, and Google, and using web scraping techniques to obtain review scores from Wikipedia.
Tag: Stata
Post
ksmirnovk: Stata Program for Performing a k-sample Kolmogorov-Smirnov Test
Here is some Stata code I wrote back in 2017. A colleague asked how to perform a Kolmogorov-Smirnov test in Stata when there are more than two groups. I was surprised to find that such a test is not implemented in Stata, nor widely implemented in general. I thought perhaps that this was because such a test did not exist.
On the contrary, a k-sample analogue to the Kolmogorov-Smirnov test was developed back in 1959 by Jack Kiefer, a mathematical statistician at Cornell.
Post
How Influential Are Music Critics?
I gave this presentation on December 6, 2021 as part of the course SURV727 “Fundamentals of Cmputing and Data Display.” I use R for data collection, and then clean and analyze the data in Stata.
My data science experience grew quickly and greatly during this course, and I had a lot of fun combing various sources of data involving API’s from Spotify, Last.FM, Wikipedia, and Google, and using web scraping techniques to obtain review scores from Wikipedia.
Tag: Statistics
Post
Statistical Machine Translation: R Package
I wrote an R package for conducting Statistical Machine Translation (SMT) as part of my first-year comps. Find it here. It is based largely on Koehn’s 2009 SMT book and implements the so-called “IBM” models, as well as phrase-based translation. While these methods have been largely supplanted by neural network-based methods, they are still interesting models, and the IBM models can be used to derive word alignments between a sentence and its translation.
Post
R Tutorial: Multi-State Models
I wrote this tutorial on estimating multi-state models in R as part of the class STAT 935 (Survival Analysis) at University of Waterloo. There are other tutorials out there, but this one (in my biased opinion, and to the best of my knowledge) is the only one that goes one by one through each type of mult-state model, the theory, how to structure the data, and how to estimate the models using the coxph function in R.
Post
The Big Data Paradox in COVID Surveys
Meng (2018) summarizes the Big Data Paradox as "The more the data, the surer we fool ourselves."
It may be counterintuitive at first--even a trained Statistician is likely
to get caught in the idea that "more data is better" at some point.
But this idea can be quickly squashed with simple examples. For instance,
if we are interested in measuring the average height of the population,
surveying 1,000 men is going to give us an estimate that is too high.
Post
ksmirnovk: Stata Program for Performing a k-sample Kolmogorov-Smirnov Test
Here is some Stata code I wrote back in 2017. A colleague asked how to perform a Kolmogorov-Smirnov test in Stata when there are more than two groups. I was surprised to find that such a test is not implemented in Stata, nor widely implemented in general. I thought perhaps that this was because such a test did not exist.
On the contrary, a k-sample analogue to the Kolmogorov-Smirnov test was developed back in 1959 by Jack Kiefer, a mathematical statistician at Cornell.
Tag: Survey-Methodology
Post
The Big Data Paradox in COVID Surveys
Meng (2018) summarizes the Big Data Paradox as "The more the data, the surer we fool ourselves."
It may be counterintuitive at first--even a trained Statistician is likely
to get caught in the idea that "more data is better" at some point.
But this idea can be quickly squashed with simple examples. For instance,
if we are interested in measuring the average height of the population,
surveying 1,000 men is going to give us an estimate that is too high.
Tag: Trees
Post
Forests Ontario Blog: Heritage Tree Travellers Discover Ontario's Majestic Living History
Due to COVID, my wife and I have started enjoying nature a lot more. It turns out Ontario is a beautiful place with lots to see, including some really cool old trees! Forests Ontario has been cataloguing such trees in Ontario, and so we’ve been using their map to find them. Forests Ontario thought it was cool, and so asked me to write an article about our tree exploits! It was a really fun and nerdy experience and you can find the article here.
Tag: Tutorial
Post
R Tutorial: Multi-State Models
I wrote this tutorial on estimating multi-state models in R as part of the class STAT 935 (Survival Analysis) at University of Waterloo. There are other tutorials out there, but this one (in my biased opinion, and to the best of my knowledge) is the only one that goes one by one through each type of mult-state model, the theory, how to structure the data, and how to estimate the models using the coxph function in R.