Finding outliers and dealing with them - What is the true purpose of identifying outliers and how should they be dealt with? - General Discussions

November 16, 201510 yr

Quote

We had a recent discussion on this topic. All forum members are invited to continue this discussion here.

564aec16e4624_ShahjahanH.png.6a3bb040d28

Sahjahan H

Outliers are nothing but the Special causes of the given data. We need to identify the special causes and remove them for the Data to become normal. Analysis with the normal data will always be more meaningful than the data with Outliers. Before removing any Outliers, we need to make sure that proper RCA is in place and actions are taken against them. Sometimes, removing one outlier will lead to create more outliers, where the decision for removing them has to be made cautiously.

564aec1294f8b_BeverlyDaniels.png.e9a1c80

Beverly Daniels

I respectfully disagree. As I posted in the earlier thread regarding control charts: Outlier detection was intended for static data sets (enumerative studies) not for data streams (analytical studies). In SPC a ‘special cause’ is a REAL value – or set of values – that occurs because a condition creates results that are substantially different than that produced by a stable common set of factors that vary in a predictable and ‘controlled’ manner. These results – even when a single value occurs beyond the control limits – are NOT outliers. They are a signal that the process has changed.

Regardless of the data set or the type of study, an outlier is a value that truly doesn’t belong in the data set because it is WRONG. Outliers are to be ‘removed’ or censored from the data only when they are validated as:
* Impossible results
* Misprints or typos
* Results from an invalid measurement event
* Mis-read measurements

564aec1294f8b_BeverlyDaniels.png.e9a1c80

Beverly Daniels

Outliers that are valid results are real and will occur to us and our Customers and so should not be censored or ignored in any type of study. Outlier detection is a result of comparing results to a theoretical distributional model (that doesn’t actually exist in real life) and the threshold for detecting the so called outlier is too often set at 95% of the distribution. In theory any distributional model will then have 5% of its data that is supposed to lie beyond the 95% threshold. When you have large data sets, you will detect ‘outliers’ that are SUPPOSED to be there per the distributional model.

When you have smaller data sets it’s still quite possible to get an ‘outlier’ that is supposed to be there in your data set. Outlier detection using statistical distribution models is a waste of time. If for no other reason than when you detect an outlier you aren’t supposed to do anything with those values unless they meet the criteria listed above for an invalid result.

564aec1294f8b_BeverlyDaniels.png.e9a1c80

Beverly Daniels

We do not remove outliers to achieve a Normal distribution (this is impossible anyway) as we don’t need Normal distributions to analyze the data properly and effectively. Think about the statement “a better analysis”. What does that mean? A clean ideal statistical test? Or the truth about the process we are analyzing? As Shewhart said (and someone here recently quoted) “Probability models do not generate our data, real world processes do.” Our analysis must match the data, we must not twist the data to meet some ideal analysis technique…

564aec16e4624_ShahjahanH.png.6a3bb040d28

Sahjahan H

May be I can quote some example from IT industry, which can help us to debate more: Server Availability - Due to the server down on a particular day will not mean Server is not performing well overall. This is an Outlier and we need to identify the root cause for the Server down and we need to exclude this outlier and consider the remaining data for the Server Availability SLA calculation. If this is a recurring issue, then we need to consider this outlier into the analysis.

564aec16e4624_ShahjahanH.png.6a3bb040d28

Sahjahan H

Beverly, you've quoted "Outlier detection using statistical distribution models is a waste of time, as we are not going to do anything with those data unless the mentioned criteria are present" So, what method you will use to deduct the Outliers? Or, you will not consider the term "Outlier" itself?

564aec1412036_ErikLaufer.png.dc99ffa31f9

Erik Laufer

This is a great question...perhaps, to bring us back to the original post, I would respond with what question are we asking of the data...what is the objective of the analysis/statistics that we are employing? Am I using data for descriptive statistics of what has been? Am I taking data, and projecting results, to others that were not included in the sample? Am I taking the data, and looking to predict what might occur in the future? These are critical to understand...Beverly's distinction in important (experimental/ad hoc analysis vs observational analysis). So, outliers (interesting data) could be assessed via Tukey gates for retrospective/passive data analysis...quasi-EDA/or potentially Tolerance Intervals. Outliers (interesting data) could be assessed via control charts. Outliers (interesting data) could be assessed via Confidence Intervals/Prediction Intervals...it all comes back to the question being posed of the data and how that informs decisions.
Regards, Erik

564aec15ed869_RaviSharma.png.466706cc9f4

Ravi Sharma

Before thinking to make a change in process its imperative to look at the outliers also known as special causes and eliminate them.
Every process has two factors impacting I.e. Special and common. Common is the inherent variability and requires a lot of efforts infact a change in process.
As I mentioned its worth looking at special causes, remove them.

564aec1294f8b_BeverlyDaniels.png.e9a1c80

Beverly Daniels

One of the issues we are experiencing is how we define our statistical words. This isn’t just semantical nit-picking. Our words matter. Outliers are not Special Causes and Special Causes are not Outliers. I know that some people use these words interchangeably but doing so creates confusion, misunderstandings and inhibits learning.

These definitions are not my opinion. They come from notable statisticians who selected the names, developed the definitions and the mathematical method by which we determine the existence of that type of result. Shewhart developed the name of special cause for results that came from a system causes that was not the common system performance of a stable process STREAM. Other statisticians developed mathematical tests for the detection of outliers in STATIC data sets. They defined outliers as values that were a substantial distance from the bulk of the data set. Statistical software uses these different definitions.

564aec1294f8b_BeverlyDaniels.png.e9a1c80

Beverly Daniels

The IT example is a good one regarding the mis-use of the term outlier as well as the dangers of censoring outliers that are not invalid results. First, the downtime was the result of an assignable or special cause. The IT group determined it through science, logic and reason – once they determined the cause they knew – or believed – that it was a different causal mechanism than those for the ‘typical’ downtime events. In all likelihood, they didn’t determine that this was a special cause in strict SPC terms through a violation of control limits. (And that’s OK, Shewhart’s intent encompasses both statistical detected special causes and physically detected special causes). The IT Team took the right action to determine the cause and take action to correct and prevent it in the future. However, censoring it from their downtime reporting is WRONG. This is a misuse of ‘outliers’. The event actually happened, it actually effected downtime. It should be in the performance metric.

564aec166d087_RuneSndergaard-Hannen.png.

Rune Søndergaard-Hannen

Hi Beverly, first of all where do you have the 95% from UCL & LCL are +/- 3 x Stdev => 99,7%.

I think we need to discuss the calculation model of the Stdev used for the calculation of control limits too. Is it the AIAG standard or the "normal" statistic calculation. There can be a relative huge difference between those two models and because of that a potential risk for a different number of outliers.

I my opinion outliers needs to be explained either by a none normal incident or with a real RCA.

Whar are opinion in this forum to this?

564af1589cf64_ChristopherAyres.png.5d3f1

Christopher Ayres

Use correct distributions when posting results. Also make sure outliers are real outliers within your analysis. Make sure they are not cause of bad data collection. If your doing say a perato for a real thing then you must count them in a 0-100 mindset because 1 defect is still a defect however keep in mind if it's mass data 6 sigma still aloes for 3.4 defects per million opportunities.

564af1589cf64_ChristopherAyres.png.5d3f1

Christopher Ayres

Outliers also depend on what results your looking to achieve. In educational econometrics they will just lower your confidence level if even used depending on how far out they are. If your measuring auto defects then an outlier is still a defect, etc. and must be accounted for. Like I said before though 2 outliers out of a 10 thousand study will still give you a confidence level of +or- 99% ate.

564aec15ed869_RaviSharma.png.466706cc9f4

Ravi Sharma

Outliers are the measured observation that don't to seem to fit the grouping of rest of the observation. They are either too far to the right or to the left of the rest of the data for someone to conclude that they come from the same set of circumstances that created all other points. When we see an outlier on a dot plot or histogram we immediately know that something is different about the condition that created those points whether its process set up or execution or the way we measured the process. Investigate all outliers, find out what caused their value to be so different.

564aec166d087_RuneSndergaard-Hannen.png.

Rune Søndergaard-Hannen

I talk about the confidence interval where 99,7% of all datapoint should be if the process are in control. But yes It's possible to say 95% of the points are inside CL and because of that we have 5% outleirs. That Will just signal that the process aren't in control...

If that's the case then a huge job will be in front of you to get the process in control. Investigate potential lines in the reasons etc. to smoothen the process .

Finding outliers and dealing with them - What is the true purpose of identifying outliers and how should they be dealt with?

Featured Replies

Create an account or sign in to comment

Who's Online (See full list)

Lead AI Transformation without coding

Most Solved

Forum Statistics

Member Statistics

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)