One of our Partners in my firm, always used to quote a saying of W. Edwards Deming, which goes by ”In God we trust. All others must bring data.”, whenever he wanted to explain the importance of the data for an exercise. And in my limited experience, I was able to see the power of data in quite a few industries. And as another saying goes “with great power comes great responsibility”, with data as well you have to use data very responsibly.
One such responsibility is to identify the Outliers. As per the definition “Outliers is a data point or an observation that is located far from the rest of the data points and maybe an outcome of variability in measurement or due to an experimental error.”
But whenever I hear the term outlier I start visualizing a scene where my mother using Supa to clean rice, where all the foreign particles are outliers, which has to be separated to make the delicious rice, else it can lead to poor taste or sometimes small stones comes with rice and we all know how much it hurt once you chew on that.
Hence similarly in the case of data, if outliers are not removed it may lead to wrong or skewed analysis and ultimately lead to failure in achieving the desired results.
Origin
Before going into different approaches to deal with the outliers, let me first define the possible generation of outliers:
Data entry errors: These are human errors where errors can occur during data collection and data entry. For eg. On one day you accidentally wrote production as 100 units instead of average production of 10 and with an available capacity of 12, then that week average production will be 23 instead of 10.
Instrument errors or Measurement errors: This error occurs when we are using a faulty instrument or measurement system. For eg. in one of my exercises client asked me to understand the reason for high truck freight variation, but after understanding the data we were able to see since they are not capturing the truck type and truck utilization data, they are not comparing freight of similar scenarios and by defining above two parameters, it got very much clear that variation was not high and the team is doing a good job in keeping it in control. Similar example one can also see on the manufacturing side, where we use faulty or uncalibrated devices to capture the control parameters.
Sampling errors — Best analogy one can think of this type of error is comparing apples with oranges, where we collect and mix data from wrong or different types or characteristics and then try to analyse assuming characteristics are the same. For eg. In a work content estimation exercise in a manufacturing setup, one has to analyse the work content of blue and white-collar separately, since in the case of blue-collar one should see a high percentage of active work and in the case of white-collar, one must see a high percentage of supervisory type of work.
Data processing errors — Outliers can be generated when we extract data from multiple sources and see some unknown manipulation or when we have some gaps in our data analysis model or formula that is leading to the generation of outliers for the scenario which is not considered in the model. For eg. If you try to get the cycle time between two activities and you don’t put the logic for calculating cycle time when the start time is 23:00 Hrs 1 Jan 2022 and end time is 5:00 Hrs 2 Jan 2022, it will lead to the generation of outliers.
Natural novelties in data: Data points that are not generated due to some errors but are generated naturally and are unusual in nature. For eg. To do any cost optimization exercise (manpower, operational, logistics, etc) when we have to take 2021 data, we remove data points of the lockdown period (Mar 2021 to Jul 2021) to remove the unusual situation that occurred due to Covid.
Identification
The above points highlight possible generation points of outliers, now let us understand how to effectively identify them using statistical tools:
Plot the data in the box plot and identify the data points outside the minimum and maximum whiskers. (To know more, check out: Box Plot )
Plot the data in the scatter plot and identify the data points going away from the pattern (To know more, check out: Scatter Plot)
Use Z score, where you distribute the data in different frequency ranges and create a histogram out of it and identify the ranges in both x-axis extremes where the occurrence of the data is very low or data points lie between +/- 3 standard deviation. (To know more, check out: Z - Score)
Resolution
Now we have clarity on the outliers origin and way of measurement, let us now talk about the cure. To deal with outliers I generally take either one or a combination of the activities explained below:
Deleting the values: I delete the outlier if I am confident that the identified outlier are wrongly entered or wrongly calculated from the model due to missing information or the outlier occurred due to one of the cases which never going to happen in the future. As stated in the above examples, where we deleted the data of the lockdown period to calculate the actual cost and example of wrong production data entered despite knowing capacity is low but actual production is high.
Changing the values: I go ahead and change the values in the cases where I know the reason for the outliers. Consider the above example of lockdown data removal, but when I am checking year on year cost variation then I take the average of remaining months data or average of last year and populate data points of the lockdown period.
Using different analysis methods: One can use different statistical tests which will not create an impact on the final output with the presence of the outliers. For example, In the production data example, if one would have taken median instead of average, the value we would have got will be ~10 and hence will not be impacted by the outlier.
Valuing the outliers: Those outliers which caused naturally and have a valid reason to exist should be analyzed further to understand the root cause of the outlier. This type of outliers may be hiding precious information to improve your process and performance. This has to be classified as special causes and separately analyzed to get that precious information if any. For eg. when observing data of employee wise product reports if we found one employee out of 100 is ensuring more than 90% performance across the month the rest are maintaining performance at ~80%, then the work practice of that employee has to be analyzed and if found something tangible which can be implemented across the organization can be captured.
Apart from the above points, I also believe one should focus on working towards reducing the generation of unnatural outliers instead of spending time on identifying one for analysis. To do that one can take the help of tools such as robotic process automation, digitalization of systems to gather data, etc. to reduce the possibility of generating unnatural outliers
And now we can conclude our understanding of the Outliers where we have seen how outliers can impact the data, how it can be generated, how it can be tracked or measured, how to resolve it and how to control it from generation. Since the identification of outliers and taking appropriate action is an important activity or task everyone should follow to extract the right power of the available data.