Jump to content

Unusual Observation/Outlier

 

Unusual Observation/Outlier - is a data point or an observation that is located far from the rest of the data points. Outliers may be an outcome of variability in measurement or due to an experimental error. All outliers need to be investigated and corrected (usually removed from the data set) as presence of outliers can lead to incorrect statistical analysis.

 

An application-oriented question on the topic along with responses can be seen below. The best answer was provided by Natwar Lal and Mohamed Asif.

 

Applause for the respondents- Natwar Lal, Mohamed Asif, Sunil Kant, Ratnakar Bonda and Prabhakar Acharya. 

 

Also review the answer provided by Mr Venugopal R, Benchmark Six Sigma's in-house expert.

Question

Q. 197  A statistically unusual observation or an outlier is of special interest many a times. However, an outlier is ignored or removed from data set sometimes. What are the factors that help one decide if an outlier is to be considered and investigated or if it can be ignored safely? 

 

Note for website visitors - Two questions are asked every week on this platform. One on Tuesday and the other on Friday.

Share this post


Link to post
Share on other sites

7 answers to this question

Recommended Posts

  • 1

Outlier – is a data point that is significantly different from the rest of the data point. Another definition of an outlier is a data point that significantly varies from the distribution of the data set. It is a common phenomenon to get outliers in any data one collects. There can be many reasons for outliers in a data set. Some of the more common reasons are as below

1. Experimental error

2. Measurement system error (either the gauge gave an incorrect reading or it was noted down incorrectly by the operator)

3. Data collected from 2 or more distributions

4. Special Cause occurrences or anomalies

5. Attempts of fraud

 

Outlier management may be a thesis topic in itself. However, I will try to keep it simple and practical here :) 

 

How to identify the outliers

1. Box Plot: It is the most common method to identify the outliers. An outlier will be marked in * in a box plot. These star marks are observations which are smaller or greater than 1.5 times IQR (Inter Quartile Range)

2. Control Charts: Any point outside the control limits is considered as an Outlier. Applying Nelson rules, one could also identify the unusual observations though these may not be significantly different from the rest of the data points

3. Modelling (like Regression Analysis, Probability distribution fitting etc.): whenever a model or a probability is fitted using historical data, it lets us know if there are any data points that do not fit the model. Usually such data points are outliers in the dataset which do not follow the fitted model

There are other statistical methods also to identify the outliers in any data set. The ones listed above are the basic ones.

 

What to do if you encounter an outlier?

First do not panic :) And I write this because I have seen people panicking about outliers in their data sets. Getting outliers in a data set is not an outlier (it is very common)

Second, take a structured approach to resolve the Outliers

1. Find out the reason for the outlier. Best and the easiest method is to do a 5 Why analysis

2. Determine if the root cause is a part of the nature of the business or process (say a seasonal or cyclical effect etc). One will need common sense and business knowledge for this determination. If one does not have either :D  then it is advisable to work with an SME while doing this step. E.g. – the daily transactions during a month end show an unusual spike. However this spike is in the nature of banking business. Compare it with spike of transactions that happened due to demonetisation (Demonetisation is not the usual nature of business)

3. If the root cause is part of the nature of business, then the Outlier CANNOT be removed from the data set. Removing such an outlier would result in undesired data modification. If the outlier cannot be removed from the data set, then there are two possibilities

a. Outlier is beneficial for the process – in which case the root cause for the outlier should be replicated

b. Outlier is bad for the process – in which case the root cause for the outlier should be eliminated from the process (eliminating root cause is different from removing the data point. Elimination would ensure non recurrence of a bad outlier)

4. Alternatively, if the root cause if not part of the nature of business, then it may be EXCLUDED. It can be done in following two ways

a. Trimming i.e. deleting the outlier from the analysis

b. Winsorising i.e. replacing the outlier with the border case values (these border case values are not outliers)

 

If there is outlier or any data point that is excluded or winsorised, it should be clearly called out in the reporting

 

Share this post


Link to post
Share on other sites
  • 1

Outlier is Anomaly, an extreme observation. 

It is any observation that is outside the pattern of the overall population distribution.

 

Simply any data point that is more than 1.5 * IQR, either below the First Quartile or Above the Third Quartile.

 

Quote

Outlier: "An observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism" - (Hawkins, 1980)

 

Many a times, the indication of outlier is considered as mistake in data collection and it can skew the statistical relationship.

However, we could get an outlier because of the following reasons:

  • Data entry/Type errors
  • Measurement errors
  • Experimental errors
  • Intentional/dummy data
  • Data processing errors (due to formula)
  • Sampling errors
  • Natural (not usually an error, it could be novelties in data)

 

We can find outlier by,

  • Foremost, when we use common sense
  • Visually find the outlier (Graphical Summary out help to find outliers, or boxplot / scatterplot)
  • Using statistical tests (There are many tests to find out outlier, listed below are few)
    • Grubbs test for outliers (also called extreme studentized deviate)
    • Dixon Q test for outliers
    • Cochran’s C test
    • Mandel’s h and k statistics
    • Pierce’s criterion
    • Chauvenet’s criterion
    • Mahalanobis distance and leverage
  • Methods of detection includes:
    • Z-Score / Extreme Value Analysis
    • Probabilistic and Statistical Modeling
    • Linear Regression Models
    • Proximity Based Models
    • Information Theory Models
    • High Dimensional Outlier Detection Methods
  • In SAS, PROC Univariate, PROC SGPLOT can be used to find outlier.

 

Statistical Tests can be used to detect an Outlier. However, it should not be used to determine what to do with them! (Ignore / Remove).

One should have a good Domain Knowledge when Analyzing Outliers.  

 

Below is the example data set with Outlier and Without Outliers:

 

Dataset.jpg.65fe02c941821773de5e191274e2f479.jpg

Data set with Outlier

 

WO.jpg.07f66142189ba99485de663ed5ef4a7e.jpg

Data set without Outlier

 

We could have either have Univariate or Multivariate outlier.

Univariate outlier: Data point with outlier on one variable
Multivariate outlier: Combination of outliers on at least two variables

 

Other forms of Outlier includes:

Point outliers: Single outlier
Contextual outliers: Can be noise in the data
Collective outliers: Can be subset of uniqueness in the data (novelties)

 

We can ignore outliers when, it is Bad Outlier, and

  • We know that it is wrong data (Common sense)
  • We have big data set (ignoring outlier doesn't matter at this situation)
  • We can go back and validate the dataset for accuracy
  • When the Outlier does not change the result, however influence change in assumption
  • When Outlier influences both result and assumption, it is better to run analysis with and without outlier (as we are not sure whether it is because of mistake or misclassification of the data). Post analysis investigating both results to find the significance is minor or major.    
  • When outlier is a data from an unintended population

 

We should not ignore outliers when, it is Good Outlier, and

  • Results and outcomes are critical
  • We have too many outliers (Usually when it is not unusual)  

 

Before Ignoring we will have to run through this checklist (for cautious and safe removal)

  • Is Outlier because of data entry typo error?
  • Identified Outlier value scientifically impossible?
  • Assumption of Gaussian distribution on the data set is uncertain?
  • Is the Outlier value seems to be scientifically interesting?
  • Do we have substantial information about Outlier that we need to retain it?
  • Are there any special circumstances / situations / cases for the data points?
  • Are there any potential measurement errors?  
  • Under multi Outlier situation, can Masking be a problem? (In Masking - "outlier” is not detected)

 

If the Answer to above questions is No, then

Either, (Situation A) the so called, outlier, could have resulted from the same Gaussian population, it is just that we would have collected the observation from either the top/bottom tail of the population data.

Or, (Situation B) the identified outlier, could be from different distribution. However, we would have collected the data due to mistake or bad sampling technique.

For Situation A, removing outlier would be mistake

For Situation B, We can remove the outlier cautiously

 

Removal of Outlier can be dangerous. However it may improve the distribution and fit, but most of the time some important information is lost.

 

So Points to remember, if we remove outlier:

Trim the data set
Do Winsorization (Replace outliers with nearest good data)

Transform the data, Discretization

Top, Bottom and Zero Coding
Replace outlier with mean / median (Extreme Outliers will influence Mean, but not he Median; Ref to below example), random Imputation

Dataset34.jpg.5c943140ee6294884c0a4e9e5994ce87.jpg

While we run Experiments and observe many Outliers in the data, we should repeat the data collection instead of simply removing them and when the Outliers are significant, then consider using Robust Statistical Technique.

 

Outliers are not always bad data points, however, when the data set is small, then outlier can greatly influence the data statistics (We could have Skewed data, inflated or deflated means, distorted range and type I and type II errors).

So it is better to do through investigation and also have background domain knowledge while performing this analysis.

 

Case to case the analysis differs and based on that we should take cautious decision whether we have to Remove, Keep or change the Outlier. 

 

Share this post


Link to post
Share on other sites
  • 1

Benchmark Six Sigma Expert View by Venugopal R

While the respondents can approach this question from various angles,  let me ride upon the principle of control chart to address the discussion on outlier. The control chart is a wonderful tool that helps us in monitoring process stability. To build the limits of the control chart we usually take data points from a process that is considered statistically stable – and the plotted data will determine its own control limits. While the control limits are formed, we would expect all the data points to be contained within the limits – any point that fall outside indicates a high likelihood that it does not belong to the population and is considered as a outlier.

 

Many of you will be aware that there is a practice known as ‘control chart homogenization’, where we remove the outliers and re-calculate the control limits. In the case of charts such as X-bar R charts, where range charts are available, the outliers need to be removed from range chart first and then check whether the data gets homogenized.

 

From the above practice, we can see that for making the data  “statistically pure”, we have to ignore the outliers and recalculate. (Here too, we cannot keep removing the outliers beyond a certain point!). The objective here is to establish the mean value and control limits that are representative to the population and may be used for future process monitoring.

 

However, if we are viewing the control chart for improving the process, the first step is to examine the stability of the process. More the outliers, more the instability. Then, the outlying data points need to be analysed to understand their cause, even if there are outliers on the ‘favorable side’ of a control chart. The ‘abnormal’ conditions that caused an outlier could also be a measurement error, that needs to be identified and corrected.

 

For example, if we are studying the pattern of the quantum of fuel being purchased per day for cars from a fuel outlet and we start collecting data for a 3 month period. In between, if there was an announcement for fuel price increase, we will can certainly expect and high outlier point on the day prior to the hike and subsequently a dip, or low outlier immediately after the hike was announced. Such outliers with known reasons may be ignored for our study.

 

To summarize..

  1. In general, outliers are important information that help us identify a problem, (or sometimes a very favourable condition) and need to be examined in detail.
  2. Where our interest is to understand the statistical distribution of a set of data (either to set a baseline, or to validate an improvement), we may not want to get biased due to few stray outliers – hence we may exclude them from our calculation with discretion, though we may still analyse them.
  3. While monitoring a process, If we know the exact reason of particular outliers and such reasons being very extra-ordinary and usually not in our control, we may exclude them from to prevent mis-representation of a normal situation.

Share this post


Link to post
Share on other sites
  • 0

Removing at outlier data depend on intent of analysis. if we are doing analysis of larger data group we should remove such outlier from analysis , otherwise data output may be skewed towards outlier or no conclusion.

But we should investigate outlier to fix loss in process. Generally rootcause for outlier is very specific and can be fixed easily. Outlier cause big loss in single event vs data trends.

Share this post


Link to post
Share on other sites
  • 0

Outliers to be considered as critical for analysis . If outlier removed without knowing it may repeat and it may incurr cost or production loss etc... 

 

In control charts we can sense the abnormality if it is crossing the UCL or LCL. Before that one who understands Stewart rule properly he can sense the outliers before itself as a recurrence prevention. Proactive approch 

Share this post


Link to post
Share on other sites
  • 0

The factors are 

The occurrence of the outliers in the data

The level of impact on the output - higher the occurrence and higher impact needs to be considered to investigate. Low occurrence and low impact needs the outliers to be ignored

 

Share this post


Link to post
Share on other sites
  • 0

If you are looking for a short answer, read Natwar Lal's response. 

 

If you want to go deeper and need to consider more details, read the response by Mohd Asif,

 

Both Natwar Lal and Mohd Asif are winners for this question.

 

Do read Benchmark Expert Venugopal's view which is apt for professionals who are equipped with Lean Six Sigma qualification. 

Share this post


Link to post
Share on other sites
Guest
This topic is now closed to further replies.
Sign in to follow this  

  • Who's Online (See full list)

    There are no registered users currently online

  • Forum Statistics

    • Total Topics
      2,713
    • Total Posts
      13,177
  • Member Statistics

    • Total Members
      54,287
    • Most Online
      865

    Newest Member
    Ina
    Joined
×
×
  • Create New...