Jump to content

Recommended Posts

In the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalised sum tends toward a normal distribution even if the original variables themselves are not normally distributed. The Question is how could this be proven as it looks very intimidating at times. Why is 30 considered the minimum sample size in some forms of statistical analysis? Is there any rationale for this. 

Link to post
Share on other sites

Dear Ransingh, 


There is a very good animation that you can see at http://onlinestatbook.com/stat_sim/sampling_dist/


You can change the distribution type of independent random variables from different distributions, and see how the averages (or sums) become normally distributed. As the sample size approaches 30, the curve becomes quite normal. Of course, beyond 30, it will become even more beautifully normal but the normality test passes very well at 30. The differences you see in the match of the obtained curve with the perfect bell curve between sample sizes of 5 and 6 are much bigger than what you see between 30 and 31. As the perfection towards normality grows very slowly with an addition of every single number after 30, the number 30 is considered as a reasonably good minimum size.


Of course, the number 30 is just a rule of thumb and one could always take more to be safer. 

Link to post
Share on other sites

Dear Ransingh


A very good question and the link shared by VK will help you visualize how CLT works.


I want to highlight a common misconception about Central Limit Theorem. It is probably one of the most misunderstood concepts in Lean Six Sigma.


Most of the people assume that if they have a large sample size (read greater than 30), then the data set follows normal distribution. This is far from truth. Irrespective of the sample size, the sample will always follow the distribution of the original data set. So if the original data set is Not Normal, then the sample (be it size 1 or 2 or 10 or 30 or 100 or however big) will also be Not Normal.


Then where does CLT apply? 

CLT applies on the distribution of the sample means or sample sums i.e. if i pick up multiple samples from the Not Normal data set, calculate either the sum or the mean of all the samples and plot them on a histogram, then it will follow a Normal distribution.


For e.g. consider a roll of a single dice. Possible values are 1,2,3,4,5,6 each having the same probability. A common misconception would be if i roll the dice multiple times (say 6000) times, I will get a normal distribution. This is not true. Roll of a dice follows a Uniform distribution and hence if you roll it 6000 times, it is likely that 1 through 6 will occur 1000 times each.


However, what happens if 2 dice are rolled and sum of each roll is noted. The possible values are 2,3,4,5,6,7,8,9,10,11 and 12. Here however the probability is not the same.

Prob. of getting 2 = 1/36 (only 1 combination will give 2)

Prob. of getting 3 = 2/36 (2 combinations will give us 3)

Prob. of getting 4 = 3/36 (3 combinations will give us 4) and so on......

7 has the maximum probability (6/36) of occurrence while 2 and 12 have the least (1/36).

Now, if I roll the 2 dice for 6000 times and plot the sums of each roll on a histogram, the plot will start resembling a normal distribution because of the variation in the probabilities of each number. 

Here if you notice closely,

1. The original distribution is Not Normal

2. Taking 2 data points from the original data set will give me a sample (equivalent to rolling of 2 dice). Then for each sample, the sum is being calculated and plotted

3. CLT is being applied on the sum and not on the individual data points

The same is evident in the animation link shared by VK.


So let's be aware of the misuse of this theorem and apply it correctly.


P.S. there are multiple online sources where you can also find the mathematical proof of the the Central Limit Theorem.

Link to post
Share on other sites
  • 7 months later...

The Central Limit Theorem states that no matter how random a sample distribution is, average of a fixed number of samples will show a normal distribution if the sample size is big, at least of 4 variables and the number of average (mean/ mode) taken is large. The distribution gets normal relatively with the number of instances/ averages taken. 

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...