Skip to content
View in the app

A better way to browse. Learn more.

Benchmark Six Sigma Forum

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

Topics

Leaderboard

Popular Content

Showing content with the highest reputation on 01/14/2022 in all areas

  1. Outliers are part of the real world and need to be investigated before analyzing and interpreting the data. This is, even more, the case with small sample sizes, as the outliers have a greater impact on the results. Some models such as Principal Component Analysis, Hierarchical Models, K-Means, Linear, and Logistic Regression are very sensitive to outliers. Detection of unusual transactions may be the aim of the operations. This unusual transaction is generally in the form of outliers, such as fraud detection, stock forecasting, etc. Hence understanding outliers is critical because outliers are most likely to bias the entire interpretation or the outliers maybe what we are looking for. Reason for Outliers Error The error may be due to Data Entry, Recording, Measurement in Gage, Measurement due Operator, Measurement error due to calibration, Sampling Errors, Data Processing Errors. Part of Normal Process Outliers may be present in the data due to Bulk orders, Resellers or Extra Loyal Customers, etc. How to Detect? Data Visualization Outliers can be detected through Data Visualization such as Box plots, Scatter Plots, Histograms, Run Charts, Lag Plots, Line Charts. Statistical Methods Outliers can be detected through Statistical Methods such as the Standard Deviation Method, Tukey's Method. Etc. What is the strategy to deal with outliers? Keep the outlier and carry out the test with the outliers. Segment the data and carry out a deeper analysis. Imputing outliers and treating them separately. Set up a filter to do the test without the outliers. Since significant effects are hidden by outliers, it may be appropriate to set up a filter to examine the results without the outliers. Delete the outlier - The outliers may be deleted if there was an error in data or the reason for the outlier is not likely to happen again Delete the outlier after post-test analysis Change the value of the outlier. This may be done by replacing it with a more appropriate value such as the mean or the median. Consider the underlying distribution. An Anderson Darlings or Shapiro Wilk Test may be done to check the normality of the data. Carry out a Non-Parametric Test in case the underlying distribution is not Normal. Transform the Data. Data can be transformed using the Box-Cox Transformation, Johnson Transformation, log transformations, scaling, cube root normalization, etc. Methods and Tests that can be done for data having Outliers Winsorizing or Winsorization It is named after Charles P Winsor, who was an Engineer and Biostatistician. In this process the effect of the outliers is reduced by limiting the extreme values. It sets the value of all the outliers to a specific percentile of the sample. Data estimated through the Winsorization method is generally more robust to outliers. Example. A 95% Winsorization would set the bottom 2.5 percentile of the data to the 2.5 percentile value and the top 2.5 percentile of the data to the 97.5 percentile value. Trimming/ Truncation This is a method of censoring data. All data above/below a certain percentile is removed. Example. A 95% truncated data would eliminate the bottom 2.5% of the data and the top 2.5% of the data above the 97.5 percentile. TRIMMEAN function in Excel may be used from trimming the data. Winsorized mean and truncated mean are not the same. Non-Parametric Tests such as 1 Sample Sign Test, 1 Sample Wilcoxon Test, Mann Whitney, Kruskal Wallis, Moods Median, Friedman, Runs can be done in case of the underlying distributions being not normal. Transformation - Transform Data and carry out Parametric tests. Univariate Methods Box Plot - The box plot is the easiest method for identifying outliers. It uses the median and the Q1 and Q3 to determine the outliers. Tukey Method - This method identifies the extreme outliers as being greater than three 3 times the Inter Quartile Range below/above first/third quartile, Mild Outliers as between 1.5 to 3 times IQR. Multivariate Methods At times the univariate method may not detect the outliers. Multivariate methods such as multiple linear regression may be used. Minkowski Error. This method can be used to minimize the impact of the outliers on the model. It is a loss index and more insensitive to outliers than the mean square error since in the mean square error the contribution of the outliers increases exponentially. References https://en.wikipedia.org/wiki/Winsorizing https://www.sigmamagic.com/blogs/how-to-handle-outliers/ https://cxl.com/blog/outliers/ https://aichapters.com/how-do-you-handle-outliers-in-data/ https://aichapters.com/how-do-you-handle-outliers-in-data/ https://www.aquare.la/en/what-are-outliers-and-how-to-treat-them-in-data-analytics/
  2. One of our Partners in my firm, always used to quote a saying of W. Edwards Deming, which goes by ”In God we trust. All others must bring data.”, whenever he wanted to explain the importance of the data for an exercise. And in my limited experience, I was able to see the power of data in quite a few industries. And as another saying goes “with great power comes great responsibility”, with data as well you have to use data very responsibly. One such responsibility is to identify the Outliers. As per the definition “Outliers is a data point or an observation that is located far from the rest of the data points and maybe an outcome of variability in measurement or due to an experimental error.” But whenever I hear the term outlier I start visualizing a scene where my mother using Supa to clean rice, where all the foreign particles are outliers, which has to be separated to make the delicious rice, else it can lead to poor taste or sometimes small stones comes with rice and we all know how much it hurt once you chew on that. Hence similarly in the case of data, if outliers are not removed it may lead to wrong or skewed analysis and ultimately lead to failure in achieving the desired results. Origin Before going into different approaches to deal with the outliers, let me first define the possible generation of outliers: Data entry errors: These are human errors where errors can occur during data collection and data entry. For eg. On one day you accidentally wrote production as 100 units instead of average production of 10 and with an available capacity of 12, then that week average production will be 23 instead of 10. Instrument errors or Measurement errors: This error occurs when we are using a faulty instrument or measurement system. For eg. in one of my exercises client asked me to understand the reason for high truck freight variation, but after understanding the data we were able to see since they are not capturing the truck type and truck utilization data, they are not comparing freight of similar scenarios and by defining above two parameters, it got very much clear that variation was not high and the team is doing a good job in keeping it in control. Similar example one can also see on the manufacturing side, where we use faulty or uncalibrated devices to capture the control parameters. Sampling errors — Best analogy one can think of this type of error is comparing apples with oranges, where we collect and mix data from wrong or different types or characteristics and then try to analyse assuming characteristics are the same. For eg. In a work content estimation exercise in a manufacturing setup, one has to analyse the work content of blue and white-collar separately, since in the case of blue-collar one should see a high percentage of active work and in the case of white-collar, one must see a high percentage of supervisory type of work. Data processing errors — Outliers can be generated when we extract data from multiple sources and see some unknown manipulation or when we have some gaps in our data analysis model or formula that is leading to the generation of outliers for the scenario which is not considered in the model. For eg. If you try to get the cycle time between two activities and you don’t put the logic for calculating cycle time when the start time is 23:00 Hrs 1 Jan 2022 and end time is 5:00 Hrs 2 Jan 2022, it will lead to the generation of outliers. Natural novelties in data: Data points that are not generated due to some errors but are generated naturally and are unusual in nature. For eg. To do any cost optimization exercise (manpower, operational, logistics, etc) when we have to take 2021 data, we remove data points of the lockdown period (Mar 2021 to Jul 2021) to remove the unusual situation that occurred due to Covid. Identification The above points highlight possible generation points of outliers, now let us understand how to effectively identify them using statistical tools: Plot the data in the box plot and identify the data points outside the minimum and maximum whiskers. (To know more, check out: Box Plot ) Plot the data in the scatter plot and identify the data points going away from the pattern (To know more, check out: Scatter Plot) Use Z score, where you distribute the data in different frequency ranges and create a histogram out of it and identify the ranges in both x-axis extremes where the occurrence of the data is very low or data points lie between +/- 3 standard deviation. (To know more, check out: Z - Score) Resolution Now we have clarity on the outliers origin and way of measurement, let us now talk about the cure. To deal with outliers I generally take either one or a combination of the activities explained below: Deleting the values: I delete the outlier if I am confident that the identified outlier are wrongly entered or wrongly calculated from the model due to missing information or the outlier occurred due to one of the cases which never going to happen in the future. As stated in the above examples, where we deleted the data of the lockdown period to calculate the actual cost and example of wrong production data entered despite knowing capacity is low but actual production is high. Changing the values: I go ahead and change the values in the cases where I know the reason for the outliers. Consider the above example of lockdown data removal, but when I am checking year on year cost variation then I take the average of remaining months data or average of last year and populate data points of the lockdown period. Using different analysis methods: One can use different statistical tests which will not create an impact on the final output with the presence of the outliers. For example, In the production data example, if one would have taken median instead of average, the value we would have got will be ~10 and hence will not be impacted by the outlier. Valuing the outliers: Those outliers which caused naturally and have a valid reason to exist should be analyzed further to understand the root cause of the outlier. This type of outliers may be hiding precious information to improve your process and performance. This has to be classified as special causes and separately analyzed to get that precious information if any. For eg. when observing data of employee wise product reports if we found one employee out of 100 is ensuring more than 90% performance across the month the rest are maintaining performance at ~80%, then the work practice of that employee has to be analyzed and if found something tangible which can be implemented across the organization can be captured. Apart from the above points, I also believe one should focus on working towards reducing the generation of unnatural outliers instead of spending time on identifying one for analysis. To do that one can take the help of tools such as robotic process automation, digitalization of systems to gather data, etc. to reduce the possibility of generating unnatural outliers And now we can conclude our understanding of the Outliers where we have seen how outliers can impact the data, how it can be generated, how it can be tracked or measured, how to resolve it and how to control it from generation. Since the identification of outliers and taking appropriate action is an important activity or task everyone should follow to extract the right power of the available data.
This leaderboard is set to Kolkata/GMT+05:30

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.