Outlier is Anomaly, an extreme observation.
It is any observation that is outside the pattern of the overall population distribution.
Simply any data point that is more than 1.5 * IQR, either below the First Quartile or Above the Third Quartile.
Many a times, the indication of outlier is considered as mistake in data collection and it can skew the statistical relationship.
However, we could get an outlier because of the following reasons:
Data entry/Type errors
Measurement errors
Experimental errors
Intentional/dummy data
Data processing errors (due to formula)
Sampling errors
Natural (not usually an error, it could be novelties in data)
We can find outlier by,
Foremost, when we use common sense
Visually find the outlier (Graphical Summary out help to find outliers, or boxplot / scatterplot)
Using statistical tests (There are many tests to find out outlier, listed below are few)
Grubbs test for outliers (also called extreme studentized deviate)
Dixon Q test for outliers
Cochran’s C test
Mandel’s h and k statistics
Pierce’s criterion
Chauvenet’s criterion
Mahalanobis distance and leverage
Methods of detection includes:
Z-Score / Extreme Value Analysis
Probabilistic and Statistical Modeling
Linear Regression Models
Proximity Based Models
Information Theory Models
High Dimensional Outlier Detection Methods
In SAS, PROC Univariate, PROC SGPLOT can be used to find outlier.
Statistical Tests can be used to detect an Outlier. However, it should not be used to determine what to do with them! (Ignore / Remove).
One should have a good Domain Knowledge when Analyzing Outliers.
Below is the example data set with Outlier and Without Outliers:
Data set with Outlier
Data set without Outlier
We could have either have Univariate or Multivariate outlier.
Univariate outlier: Data point with outlier on one variable
Multivariate outlier: Combination of outliers on at least two variables
Other forms of Outlier includes:
Point outliers: Single outlier
Contextual outliers: Can be noise in the data
Collective outliers: Can be subset of uniqueness in the data (novelties)
We can ignore outliers when, it is Bad Outlier, and
We know that it is wrong data (Common sense)
We have big data set (ignoring outlier doesn't matter at this situation)
We can go back and validate the dataset for accuracy
When the Outlier does not change the result, however influence change in assumption
When Outlier influences both result and assumption, it is better to run analysis with and without outlier (as we are not sure whether it is because of mistake or misclassification of the data). Post analysis investigating both results to find the significance is minor or major.
When outlier is a data from an unintended population
We should not ignore outliers when, it is Good Outlier, and
Results and outcomes are critical
We have too many outliers (Usually when it is not unusual)
Before Ignoring we will have to run through this checklist (for cautious and safe removal)
Is Outlier because of data entry typo error?
Identified Outlier value scientifically impossible?
Assumption of Gaussian distribution on the data set is uncertain?
Is the Outlier value seems to be scientifically interesting?
Do we have substantial information about Outlier that we need to retain it?
Are there any special circumstances / situations / cases for the data points?
Are there any potential measurement errors?
Under multi Outlier situation, can Masking be a problem? (In Masking - "outlier” is not detected)
If the Answer to above questions is No, then
Either, (Situation A) the so called, outlier, could have resulted from the same Gaussian population, it is just that we would have collected the observation from either the top/bottom tail of the population data.
Or, (Situation B) the identified outlier, could be from different distribution. However, we would have collected the data due to mistake or bad sampling technique.
For Situation A, removing outlier would be mistake
For Situation B, We can remove the outlier cautiously
Removal of Outlier can be dangerous. However it may improve the distribution and fit, but most of the time some important information is lost.
So Points to remember, if we remove outlier:
Trim the data set
Do Winsorization (Replace outliers with nearest good data)
Transform the data, Discretization
Top, Bottom and Zero Coding
Replace outlier with mean / median (Extreme Outliers will influence Mean, but not he Median; Ref to below example), random Imputation
While we run Experiments and observe many Outliers in the data, we should repeat the data collection instead of simply removing them and when the Outliers are significant, then consider using Robust Statistical Technique.
Outliers are not always bad data points, however, when the data set is small, then outlier can greatly influence the data statistics (We could have Skewed data, inflated or deflated means, distorted range and type I and type II errors).
So it is better to do through investigation and also have background domain knowledge while performing this analysis.
Case to case the analysis differs and based on that we should take cautious decision whether we have to Remove, Keep or change the Outlier.