Question: How do you handle non-normal data?

May 5, 200917 yr

If we are interested in comparing the means of two populations and the output data is continuous (say Turn Around Time), one could potentially use the 2-sample t test if the data sets for the two populations were normal. How would you analyze this data if one or both of the data were non-normal?

May 8, 200917 yr

Hi SJ sir,

Before attempting to answer your question, I wish to seek some clarification at your end on the same. The previous response sent by me was wrongly worded without reading the question properly. Apologise for that & the same has been deleted/edited here. What I wish to seek clarification for, is, when you call the data as being non-normal (in the second part of the question) in a population(s), are you referring it to be as non-continuous data ie., Discrete data (aka qualitative/attribute data). Please clarify ?

Thanks & rgds.

May 12, 200917 yr

Hi again SJ sir,

I wasn't able to interpret your question properly earlier & got confused between nature of data (normal & non-normal) & data types (continuous & discrete), as you had mentioned the data to be continuous (data type) in the first part & then data to be non-normal (data nature) in the second part of your above question. I got your point now & shall get back, after I'm able to reach some reasonable conclusion.

Thanks & rgds.

May 15, 200917 yr

Just to refresh everyone's concepts, the 2 sample t test (which is available even in Excel Analysis Tool Pak) can be used to compare the means of two sets of data. This test requires that the two sets of data are normally distributed.

If you have attended Green Belt from Benchmark Six Sigma, we did an exercise comparing turn around times (TAT) of teams having different experience (in months). The null hypothesis was the average TAT are same (amount of experience experience in the range of 3 months to 15 months makes no significant impact) and the alternate hypothesis was that the averages are different (amount of experience in the same range creates a significant difference in TAT).

If the data is not normally distributed (bell shaped distribution), it is not appropriate to use the 2 sample t test.

SJ's question here is about the approach that we should follow if data is not normally distributed.

May 21, 200917 yr

The data should be converted to Normal by using Box-Jenkins test and proceed ahead as usual or by collecting the data once again.

Would surely love to hear on this more from SJ & VK.

Regards,

Anup

May 23, 200917 yr

Hi All,

Normal course is to use Box-Cox transformation. It corrects with some Lambda value and transforms the data with that ration. I think Min-tab has it. Otherwise we can use non-parametric tests and for t-test non-normal it is John-Whitney test. Or we can try to use stratification technique to use the region approx to normal and correct it. Some cases they try different other distributions like Log-normal, Weibull etc....

Best if we can take more samples >100, we don't need to worry about it. If we have Yes or No data, we can use Log distribution by taking the logrithams of the data. Sub-group averaging is also done for which a min of 4 data is required for sample. It all depends on situations we have and the type of histograms we see. We First need to plot the historgram and then decide.

Further thoughts from one & all......

Thanks & rgds.

May 29, 200917 yr

Author

When one or more populations are not-normal, the first thing to do is to check the cause of non-normality. Sometimes, normal data appears non-normal if we make errors in typing in the data points, if we have significant measurement systems error etc. Let's say we rule out these obvious issues and the data is still not normal, what can we do?

In this case, we should not be using the 2-sample t test as it relies on the normality of the data points. If there are minor departures from normality, it may still be okay to use a 2-sample t test as these tests are relatively robust.

One option is to use a non-parametric test such as the Mann-Whitney test to do this analysis. This approach will work but it has the limitation of loss of information. For example instead of working with the raw data, we will be working with ranks. Secondly, these tests are not very sensitive and will usually report that the null hypothesis is true. They will only report that alternate hypothesis is true (there is a difference between populations) when there are marked differences between the populations.

A second option is to transform the data. You could try the Box-Cox transformation, using the different distributions (Log-normal, etc), Johnson transformation etc. The problem with the transformations is that we are no longer working with the original data but with transformed data. For example, if we take the square root of one data set to make it normal and the other data set is already normal, then we will be comparing the means of one population with the mean of the square root of the other population. Which is usually hard to interpret/understand.

It is not possible to make a data set normal by collecting more data points. Of course, if you collect more points, the averages of the samples will be normal not the individual data points themselves.

SJ

May 29, 200917 yr

Thanks SJ sir for enlightning us on the matter, which quite a few statisticians have avoided, whenever I've tried putting a more or less similar question to them.

Rgds.

Question: How do you handle non-normal data?

Featured Replies

Create an account or sign in to comment

Who's Online (See full list)

Lead AI Transformation without coding

Most Solved

Forum Statistics

Member Statistics

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)