Grubbs test is a statistical method used to find the outlier in the data range. Also, this test is used to find a single outlier in a normally distributed data set. This test is used to find if the maximum or the minimum value is an outlier in the given data range.
Definition - Hypothesis of Grubbs test:
Ho - There are no outliers in the given data set
Ha - There is only one outlier in the given data set
Test Statistic for the Grubbs' test -
Y¯ represents sample mean and s represents standard deviation, the Grubbs test statistic is the largest absolute deviation from the sample mean in units of the given sample’s standard deviation. This is a 2-sided version of the test, the Grubbs test can also be defined as one of the following one-sided tests,
1. Test whether the minimum value is an outlier,
2. Test whether the maximum value is an outlier,
Grubbs Test Example:
Range given - 199.31, 199.53, 200.19, 200.82, 201.92, 201.95, 202.18, 245.57
Firstly a normal probability plot was generated,
This plot indicates that the normality assumption is reasonable except for the maximum value. We, therefore, compute the Grubbs test for the given case to find whether the maximum value of 245.57, is an outlier or not.
Test Results,
H0: there are no outliers in the data
Ha: the maximum value is an outlier
Test statistic: G = 2.4687
Significance level: α = 0.05
Critical value for an upper one-tailed test: 2.032
Critical region: Reject H0 if G > 2.032
Hence we conclude that the maximum value is in fact an outlier at 0.05 significance level.
Boxplots are used to graphically display different parameters briefly. Among other things, the median, the interquartile range, and the outliers can be read in a boxplot. The data used must have a metric scale level. Such as a person's age, electricity consumption, or temperature.
How to interpret the boxplot?
The box indicates the range in which the middle 50% of all values lie. Therefore, the lower end of the box is the 1st quartile, and the upper end is considered the 3rd quartile. Below q1 lies 25% of the data, and above q3 lie 25% of the data.
In the boxplot, the solid line represents the median whereas the dashed line represents the mean.
The T-shaped whiskers in the boxplot are the last part, which is within 1.5 times the interquartile range. This means, that the T-shaped whisker is the maximum value of your data but at most 1.5 times the interquartile range. Therefore, if there is an outlier, then the whisker goes up to 1.5 times the interquartile range. If there is no outlier present in the data, then the whisker is the maximum value. Hence, the upper whisker is either the maximum value or 1.5 times the interquartile range. Depending on which value is smaller. The same applies to the lower whisker as well, which is either the minimum or 1.5 times the interquartile range. Points that are further away are considered outliers. If no point is further away than 1.5 times the interquartile range, the T-shaped whisker thus gives the maximum or minimum value.
Box Plot Example: Range - 199.31, 199.53, 200.19, 200.82, 201.92, 201.95, 202.18, 245.57
From the above example it’s graphically visible that the data value of 245.57 is not falling within 1.5 times the interquartile, hence it’s an outlier.
Conclusion – I would prefer a box plot to find the outliers in normally distributed data range, since its less complex and easy to easy to understand because of its graphical representation. Thanks.