Skip to content
View in the app

A better way to browse. Learn more.

Benchmark Six Sigma Forum

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.
Message added by Mayank Gupta,

Outlier is a data point or an observation that is located far from the rest of the data points. Outliers may be an outcome of variability in measurement or due to an experimental error. All outliers need to be investigated and corrected (usually removed from the data set) as presence of outliers can lead to incorrect statistical analysis.

 

Outlier Management is the science of investigating and applying a suitable treatment to the outliers in the data. 


An application-oriented question on the topic along with responses can be seen below. The best answer was provided by Johanan Collins on 13th Jan 2022.

 

Applause for all the respondents - Manish Manjhi, Johanan Collins, Sanchita, Rathish Parameshwaran, Afzal Wadood.

Featured Replies

Q 436. Outliers are unusual observations in the data set and whenever we work with real world data, we will find outliers. What are the different approaches to deal with outliers? Answer with the most number of unique approaches and examples will be the winner.

 

Note for website visitors - Two questions are asked every week on this platform. One on Tuesday and the other on Friday.

Solved by Johanan Collins

1_O3lOgPwuHP7Vfc1T6NDRrQ.png.e4e98cce686d0f6777628f27cc432297.pngOne of our Partners in my firm, always used to quote a saying of W. Edwards Deming, which goes by ”In God we trust. All others must bring data.”, whenever he wanted to explain the importance of the data for an exercise. And in my limited experience, I was able to see the power of data in quite a few industries. And as another saying goes “with great power comes great responsibility”, with data as well you have to use data very responsibly.

 

One such responsibility is to identify the Outliers. As per the definition “Outliers is a data point or an observation that is located far from the rest of the data points and maybe an outcome of variability in measurement or due to an experimental error.”

 

But whenever I hear the term outlier I start visualizing a scene where my mother using Supa to clean rice, where all the foreign particles are outliers, which has to be separated to make the delicious rice, else it can lead to poor taste or sometimes small stones comes with rice and we all know how much it hurt once you chew on that.

 

Hence similarly in the case of data, if outliers are not removed it may lead to wrong or skewed analysis and ultimately lead to failure in achieving the desired results.

 

Origin

Before going into different approaches to deal with the outliers, let me first define the possible generation of outliers:

  • Data entry errors: These are human errors where errors can occur during data collection and data entry. For eg. On one day you accidentally wrote production as 100 units instead of average production of 10 and with an available capacity of 12, then that week average production will be 23 instead of 10.
  • Instrument errors or Measurement errors: This error occurs when we are using a faulty instrument or measurement system. For eg. in one of my exercises client asked me to understand the reason for high truck freight variation, but after understanding the data we were able to see since they are not capturing the truck type and truck utilization data, they are not comparing freight of similar scenarios and by defining above two parameters, it got very much clear that variation was not high and the team is doing a good job in keeping it in control. Similar example one can also see on the manufacturing side, where we use faulty or uncalibrated devices to capture the control parameters.
  • Sampling errors — Best analogy one can think of this type of error is comparing apples with oranges, where we collect and mix data from wrong or different types or characteristics and then try to analyse assuming characteristics are the same. For eg. In a work content estimation exercise in a manufacturing setup, one has to analyse the work content of blue and white-collar separately, since in the case of blue-collar one should see a high percentage of active work and in the case of white-collar, one must see a high percentage of supervisory type of work.
  • Data processing errors — Outliers can be generated when we extract data from multiple sources and see some unknown manipulation or when we have some gaps in our data analysis model or formula that is leading to the generation of outliers for the scenario which is not considered in the model. For eg. If you try to get the cycle time between two activities and you don’t put the logic for calculating cycle time when the start time is 23:00 Hrs 1 Jan 2022 and end time is 5:00 Hrs 2 Jan 2022, it will lead to the generation of outliers.
  • Natural novelties in data: Data points that are not generated due to some errors but are generated naturally and are unusual in nature. For eg. To do any cost optimization exercise (manpower, operational, logistics, etc) when we have to take 2021 data, we remove data points of the lockdown period (Mar 2021 to Jul 2021) to remove the unusual situation that occurred due to Covid.

 

Identification

The above points highlight possible generation points of outliers, now let us understand how to effectively identify them using statistical tools:

  1. Plot the data in the box plot and identify the data points outside the minimum and maximum whiskers. (To know more, check out: Box Plot )
  2. Plot the data in the scatter plot and identify the data points going away from the pattern (To know more, check out: Scatter Plot)
  3. Use Z score, where you distribute the data in different frequency ranges and create a histogram out of it and identify the ranges in both x-axis extremes where the occurrence of the data is very low or data points lie between +/- 3 standard deviation. (To know more, check out: Z - Score)

 

Resolution

Now we have clarity on the outliers origin and way of measurement, let us now talk about the cure. To deal with outliers I generally take either one or a combination of the activities explained below:

  1. Deleting the values: I delete the outlier if I am confident that the identified outlier are wrongly entered or wrongly calculated from the model due to missing information or the outlier occurred due to one of the cases which never going to happen in the future. As stated in the above examples, where we deleted the data of the lockdown period to calculate the actual cost and example of wrong production data entered despite knowing capacity is low but actual production is high.
  2. Changing the values: I go ahead and change the values in the cases where I know the reason for the outliers. Consider the above example of lockdown data removal, but when I am checking year on year cost variation then I take the average of remaining months data or average of last year and populate data points of the lockdown period.
  3. Using different analysis methods: One can use different statistical tests which will not create an impact on the final output with the presence of the outliers. For example, In the production data example, if one would have taken median instead of average, the value we would have got will be ~10 and hence will not be impacted by the outlier.
  4. Valuing the outliers: Those outliers which caused naturally and have a valid reason to exist should be analyzed further to understand the root cause of the outlier. This type of outliers may be hiding precious information to improve your process and performance. This has to be classified as special causes and separately analyzed to get that precious information if any. For eg. when observing data of employee wise product reports if we found one employee out of 100 is ensuring more than 90% performance across the month the rest are maintaining performance at ~80%, then the work practice of that employee has to be analyzed and if found something tangible which can be implemented across the organization can be captured.

 

Apart from the above points, I also believe one should focus on working towards reducing the generation of unnatural outliers instead of spending time on identifying one for analysis. To do that one can take the help of tools such as robotic process automation, digitalization of systems to gather data, etc. to reduce the possibility of generating unnatural outliers

 

And now we can conclude our understanding of the Outliers where we have seen how outliers can impact the data, how it can be generated, how it can be tracked or measured, how to resolve it and how to control it from generation. Since the identification of outliers and taking appropriate action is an important activity or task everyone should follow to extract the right power of the available data.

  • Solution

Outliers are part of the real world and need to be investigated before analyzing and interpreting the data. This is, even more, the case with small sample sizes, as the outliers have a greater impact on the results.  Some models such as Principal Component Analysis, Hierarchical Models, K-Means, Linear, and Logistic Regression are very sensitive to outliers. Detection of unusual transactions may be the aim of the operations. This unusual transaction is generally in the form of outliers, such as fraud detection, stock forecasting, etc. Hence understanding outliers is critical because outliers are most likely to bias the entire interpretation or the outliers maybe what we are looking for.

 

Reason for Outliers

 

  • Error

 

The error may be due to Data Entry, Recording, Measurement in Gage, Measurement due Operator, Measurement error due to calibration, Sampling Errors, Data Processing Errors.

 

  • Part of Normal Process

 

Outliers may be present in the data due to Bulk orders, Resellers or Extra Loyal Customers, etc.


How to Detect?

 

  • Data Visualization

 

Outliers can be detected through Data Visualization such as Box plots, Scatter Plots, Histograms, Run Charts, Lag Plots, Line Charts.

 

  • Statistical Methods

 

Outliers can be detected through Statistical Methods such as the Standard Deviation Method, Tukey's Method. Etc. 


What is the strategy to deal with outliers?

 

  • Keep the outlier and carry out the test with the outliers.
  • Segment the data and carry out a deeper analysis.
  • Imputing outliers and treating them separately. 
  • Set up a filter to do the test without the outliers. Since significant effects are hidden by outliers, it may be appropriate to set up a filter to examine the results without the outliers.
  • Delete the outlier - The outliers may be deleted if there was an error in data or the reason for the outlier is not likely to happen again
  • Delete the outlier after post-test analysis
  • Change the value of the outlier. This may be done by replacing it with a more appropriate value such as the mean or the median.
  • Consider the underlying distribution. An Anderson Darlings or Shapiro Wilk Test may be done to check the normality of the data. 
  • Carry out a Non-Parametric Test in case the underlying distribution is not Normal.
  • Transform the Data. Data can be transformed using the Box-Cox Transformation, Johnson Transformation, log transformations, scaling, cube root normalization, etc. 


Methods and Tests that can be done for data having Outliers

 

  • Winsorizing or Winsorization

 

It is named after Charles P Winsor, who was an Engineer and Biostatistician. In this process the effect of the outliers is reduced by limiting the extreme values. It sets the value of all the outliers to a specific percentile of the sample. Data estimated through the Winsorization method is generally more robust to outliers. Example. A 95% Winsorization would set the bottom 2.5 percentile of the data to the 2.5 percentile value and the top 2.5 percentile of the data to the 97.5 percentile value. 

 

 

  • Trimming/ Truncation

 

This is a method of censoring data. All data above/below a certain percentile is removed. Example. A 95% truncated data would eliminate the bottom 2.5% of the data and the top 2.5% of the data above the 97.5 percentile. TRIMMEAN function in Excel may be used from trimming the data.


Winsorized mean and truncated mean are not the same.

Non-Parametric Tests such as 1 Sample Sign Test, 1 Sample Wilcoxon Test, Mann Whitney, Kruskal Wallis, Moods Median, Friedman, Runs can be done in case of the underlying distributions being not normal.

 

  • Transformation - Transform Data and carry out Parametric tests. 

 

  • Univariate Methods

 

Box Plot - The box plot is the easiest method for identifying outliers. It uses the median and the Q1 and Q3 to determine the outliers.


Tukey Method - This method identifies the extreme outliers as being greater than three 3 times the Inter Quartile Range below/above first/third quartile, Mild Outliers as between 1.5 to 3 times IQR.

 

  • Multivariate Methods 


At times the univariate method may not detect the outliers. Multivariate methods such as multiple linear regression may be used. 

 

  • Minkowski Error. 


This method can be used to minimize the impact of the outliers on the model. It is a loss index and more insensitive to outliers than the mean square error since in the mean square error the contribution of the outliers increases exponentially.


References

https://en.wikipedia.org/wiki/Winsorizing

https://www.sigmamagic.com/blogs/how-to-handle-outliers/

https://cxl.com/blog/outliers/

https://aichapters.com/how-do-you-handle-outliers-in-data/

https://aichapters.com/how-do-you-handle-outliers-in-data/

https://www.aquare.la/en/what-are-outliers-and-how-to-treat-them-in-data-analytics/

This response is based on typical scenarios experienced in the off shoring/BPS processes.

 

Outliers are those data points which are different from the rest of the data points and can be distinguished using graphical analysis. Visually they stand out when we view a scatter plot or box plot. In a control chart, the data points beyond the LCL & LCL are the outliers. They impact the overall central tendency if the values are significantly high or low.

 

Example of AHT metric of a process, wherein 2-3 transactions or calls may have a significantly high AHT or low AHT.  Let’s say AHT is 15 mins where few odd data points are 72 mins or 1 min.

These impact the overall calculation of central tendency due to extreme values.

From the below, Section B has extreme values hence the Mean is influenced. So is the Std deviation.

In comparison with section A data set, the values in B range from 2 to 80 while A has more closer values.

 

Mean

15.09

18.60

Median

15.00

14.00

Mode

15.00

14.50

Std Dev

2.39

20.70

A

B

15.00

14.00

14.00

2.00

15.00

14.50

15.00

11.50

12.00

12.50

17.00

80.00

12.00

12.00

16.00

16.00

13.00

14.50

17.00

13.00

20.00

14.60

 

 

These can be because of:

Ø  Erroneous data entry: A manual input in data tracking of AHT

Ø  Erroneous measurement system: the system or tool which tracks AHT had a glitch that incorrectly tracked few data points, or a processor had to manually stop the timer which wasn’t stopped at the right time causing the timer to run longer

Ø  Genuine scenarios: A truly lengthy call or case where it took much longer.

o    A complicated case

o    A long call wherein a complaint has been resolved

o    A long call wherein a person with disability is being assisted where its makes sense to remain customer centric and try to provide a better customer experience.

 

Options to deal with Outliers:

·         Investigate and eliminate: if we know the data points found as outliers are inaccurate reflection of process, it can be eliminated due to obvious reasons. Example if there is a known issue in AHT data tracking which spikes up the AHT to >50 mins we can safely eliminate it.

·         Investigate and retain: As per research, if we realize the outlier data point is valid, can acknowledge and retain it, we may have to treat it differently however need not be excluded from the entire study.

·         Investigate and modify: As per research, if we conclude there is a known reason for a spike in AHT, we can modify the data point by capping it. Example, if we know there is manual error in capture AHT; many be timer ran a bit too long and associate didn’t click on stop button, we can cap the AHT tracked to a known value basis the nature of the case handled

·         Use different method: for assessing central tendency we can use median instead using mean which doesn’t get heavily impacted by outliers or use of equivalent nonparametric tests etc.

An unusual or abnormal distance of an observation from the other values taken from a random sample of a population is called an outlier. The degrees of this these outliers could be mild or extreme and it is up to the model or an analyst to define what is abnormal or an outlier.

These outliers may contain valuable information or could be a meaningless deviation resulting from measuring or recording errors.

The outliers can be detected using the Box Plot, Z-Score, or the Inter Quartile Range (IQR) techniques. Once the outliers are detected we can use the below method to handle them.

Removing or trimming the outliers – Remove the abnormal data from the data set, its not a good practice though.

Flooring and capping based on quantile – capping the value at a certain percentile (ex 90th percentile) or flooring at a factor below the 10th percentile

Imputation of Mean/Median – take the Median value instead of the Mean which will be influenced by the outliers.

Outliers are the data points which looks different and far from the rest of the data. Outliers can influence estimates such as mean , variance ,etc. and reduces the power of statistical test. It is important to handle the outliers carefully before working on any estimates. Outliers can be detected though tools like boxplot, scatterplot, Z score, etc.

Below are some of the the approaches to handle the outliers:

 

  1. Identify the outliers in the data using some of tools mentioned above.
  2. Check whether outliers are the results of measurement or data entry error. We can discuss the same with the data provider and correct the entry if possible.
  3. In case , we know that it is a data entry error but we don’t know the actual data value we can simply replace the value using  various imputation techniques (Example: replacing it with mean or median values).  
  4. We can also use some filters to cap the outliers values. For example , any value above the 95th percentile can be replaced with  95th percentile value. Similar approach can be taken for low outliers.
  5. If we think that outlier is because of mixing of another population then we can simply remove them so that our sample becomes more representative. In this case, we should document the removed data values along with the reasons for the data removal.

 

Finally deciding on the best way to handle outliers requires detailed evaluation on the  problem under study, data distribution, research methodology, etc.

It was easy to identify the winner - Johanan Collins. 

 

Response from Manish Manjhi is a must read.

Create an account or sign in to comment

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.