Pages

Monday, April 25, 2016

Central Limit Theorem: The Cristiano Ronaldo of Statistics

I believe whenever you watch a cooking contest, say U.S. Master Chef, you’ll see that Chef Gordon Ramsay, who is one of the three judges on the show, would taste the dish prepared by the contestants normally with only one small spoon. Then he will give his take on the dish whether he thinks the food is good or the food is trash. Chef Ramsay will never make a secret of his disdain for certain food because he will tell you in your face that your food sucks if he thinks it is. Have you ever wondered how he can be sure that he exactly knows how the whole dish taste like with just one single spoon? Well, the answer is simply most of the time one single spoon of the dish can tell you everything that you need to know about the whole dish. In the world of statistics, one spoon of the dish is a representative sample of the whole dish. The whole dish would be referred to as the population. You see, Chef Ramsay does not need to finish the whole dish to know whether the dish is delicious or not. Sure, if he wants to be extremely accurate, he can taste the whole dish, but his opinion on the dish would not be different from the taste of a single spoon. You may wonder “how does food taste test by Chef Ramsay have anything to do with statistics?” Well, it is a great and simple analogy with inferential statistics and especially today’s concept of Central Limit Theorem.


What fascinates me is that we can make a strong statement and inference about the whole population with just a small sample drawn from the population that we attempt to study. Such inference can be done thanks to an elegant concept called Central Limit Theorem (CLT). Economist Charles Wheelan called it the LeBron James of statistics. My inspiration for writing this article is because of Charles as well. For those who does not follow basketball but follow football, the CLT is like the Cristiano Ronaldo of statistics - powerful and elegant.

Before we unravel the gist of the Central Limited Theorem, probably it’s better to start with a simple example inspired by Charles. Let say that the famous school of engineering has a field trip to the beach. The engineering students were randomly assigned to 20 buses and the trip took 5 hours. After 5 hours, 19 buses arrived at the destination except for 1 bus that went missing. You and the rescuers searched the forest and found a bus with several foreign young people who don’t speak your language. Statistics to the rescue!!! You found that the average math score of these people are 65 (assume that everyone is carrying a math report card or you ask everyone to solve a difficult integral question. I know I know, it’s ridiculous but that’s for simplicity). You, as the smartest statistician of the rescuers, sighed and you told everyone that this is not the bus of engineering students. There is no way in hell engineering students who learn all of those complex derivatives and integral would score that low on math (on average). Later, with latest Google Translate technology, we learn that this is the bus of students who major in Khimal (a make-up language and you’ll find no result from Google). This shows why their average math score is not so high because they specialize in language not complex calculation. 

Well guys, that’s it. That’s the Ronaldo of Statistics. That’s Central Limit Theorem. Simply, CLT states that the sample drawn from the population will represent similar characteristics to the population as a whole. A bus of engineering student will be similar to the whole engineering student. A spoon of the dish is very similar to the whole dish. However, each sample drawn from the population will slightly differ from one another but there is a very low probability or low likelihood (unlikelihood???) that the sample is extremely different from the population. It’s just like the average math score of engineering students on each of the 20 buses will slightly differ from the true average math score, but the probability that engineering students on one of the bus have an average math score totally different from the true average math score of all engineering students is very, very low. Yes, there may be some engineering students who would score 65 on math, but it’s highly unlikely that most of the engineering students on the bus that we found would also score 65, as we know that engineering students are very competent in math or they wouldn’t be admitted to engineering school in the first place. Therefore, we can reject that the student bus with an average math score of 65 is not the engineering student group.

Yes, we made it. This is the intuition behind Central Limit Theorem and what’s left is just some calculation and formula related with sample mean and sample standard deviation and the normal distribution, but we won’t touch for today. I think the intuition will help you understand those formula very easily. I hope we can go over the formula in the next post. Until then, please appreciate the beauty of the Ronaldo of Statistics.