Jump to content
  • 1

Outlier


Go to solution Solved by mohanpb0,

Outlier

 

Outlier - is a data point or an observation that is located far from the rest of the data points. Outliers may be an outcome of variability in measurement or due to an experimental error. All outliers need to be investigated and corrected (usually removed from the data set) as presence of outliers can lead to incorrect statistical analysis.

 

 

An application oriented question on the topic along with responses can be seen below. The best answer was provided by Mohan PB on 8th November 2017. 

 

 

Question

Q40. What is the need to identify an outlier in a data-set? What are the methods and approaches that are useful for identifying outliers? 

 

Note for website visitors - Two questions are asked every week on this platform. One on Tuesday and the other on Friday.

Link to post
Share on other sites

18 answers to this question

Recommended Posts

  • 0
  • Solution

At a routine military inspection at an Army barracks, the Colonel inspecting the unit asked a new recruit, “What is the first thing you do, when cleaning your rifle?” to which the recruit answered, “Make sure that the rifle is mine”.

 

Behind the humour of the above-mentioned quip, is an important lesson to be kept in mind before embarking on a project, big, small or miniscule. One needs to make sure that he is on the right job. Identification of outliers in a data set is akin to what the army recruit rightly said.

 

If processes, be they in a factory or in a laboratory or in an office were all running as intended and planned, there would not be any reason to have problem solving measures as there would not be any problems at all in the first place. It is only because that this does not happen all the time, that data needs to be collected and problems need to be solved.

 

Problems are not exclusive only to the process being studied. They can also happen in the monitoring, measurement and data collection processes themselves. Due to problems in any of these, some transactions or parts can get impacted. It could be possible that there was a problem with the measurement device or gauge or software. Due to one or more of these reasons, the value of the metric for the one or few transactions or parts may go out of a normal expected range or be zero or the maximum value on the measuring device. Also, there could be errors in transcription or reporting.

 

Before embarking on an analysis of the data collected, or as a first step of the data analysis, it makes sense to check for possibility of any occurrences of the above. Mere occurrence of a low or high value of a metric need not make it an outlier. The situation and the other data need to be considered. For example, when collecting data on weights of normal male adults, if a person’s weight is recorded as 8 kilograms, obviously this is an outlier caused by a digit missing on any one side of the “8”. But when collecting data on weights of new born babies, weights even around two kilograms may not be an outlier.

 

If the data analyser does not identify, investigate and remove outliers from the data to be analysed, any measures computed from the data with outliers can be incorrect as most measures are sensitive to every data point in the data set. Further, any advanced analytical tools if used on data with outliers can return incorrect results and mislead the investigator and send him on wild-goose chases. Worse still, the investigator can incorrectly conclude that there are no problems with the process and thus discontinue the investigation.

 

There are popular formulae for identifying potential outliers in terms of the Inter-Quartile Range (IQR) and Quartiles that can help in a first level screening. For example, points lower than the first quartile by more than 1.5 times the IQR and points higher than the third quartile by more than 1.5 times the IQR are considered as mild outliers, while points lower than the first quartile by more than 3 times the IQR and points higher than the third quartile by more than 3 times the IQR are considered as extreme outliers

 

But all points identified by these formulae cannot be blindly called outliers and left out. The reasons for these outliers need to be investigated before a decision to ignore them or include them can be taken. Potential outliers can occur due to the following reasons.

 

Improvement Opportunity

As this is a valid data point due to an hitherto unknown cause, to be considered and further investigated

Genuine data error

Investigate the data point and correct the data

Malicious intention

Subject to reach and time available, interact with those who have misreported, motivate them to report the correct data and use it; if not feasible, can be identified as an outlier

Lack of standardisation

After confirming the reason, can be identified as an outlier

Uncontrolled sampling error

As the sample itself is incorrect, can be identified as an outlier

Link to post
Share on other sites
  • 1

Outliers are the uncharacteristic aspect of the process, these will mask the original character of the process and mislead one who is studying the process. First step of process improvement is to act upon these outliers and subject them to analysis to segregate common cause and special causes of variation. So that it becomes very clear on what needs attention and how. Once the special causes of variations are confirmed, that they can be handled, then analysis excluding the outliers will help us to understand true capability of the process.

Statistical Process Control and Monitoring using Control charts is the good method to identify the outliers, which helps us to assess stability of the process. Based on pattern of the data spread control chart helps to proactively assess the stability of the process. Stable process would be one where there are no possibilities of failure and is operating consistently within the control limits of the process.

Even before getting in to detailed analysis of outliers using Control charts, Box plot techniques will help to perform initial refining of data outliers. Box plot gives a good visibility in to data spread and helps to do meaningful analysis of process performance.

Link to post
Share on other sites
  • 1

An outlier is any value that is different from normal data set.  Need to identify is there because it shows the changed behaviour of process at that particular set or particular value,  we can expect some different some cause happened for changed behaviour of process.  For example, a machine is producing 20000 parts daily with a rejection of 10-20, mean under Rejection target of 0.3% of the total number of parts.  We tracked the process for 30 days continuous and we found that on one day Rejection was of 100 parts if we see the behaviour of the process, these particular days is showing some different scenario than other days and something happened that's why rejection increased on that particular day.  We can do why-why analysis and find d out the root cause for the same.  So it is necessary to identify outlier in the process to check whether it is affecting actually or no more affect the process.  It depends upon methods that we are adopting or the source of the outlier.  

 

Sources of an outlier:

There are several source of outlier happening in a process:

1. Measurement error. 

2. Experimental error

3. Human error

4. Sampling error

5. Error due to intention

We can ignore once in a while measurement or experimental error but if error has occurred due to pure chance of a process then we should koi avoid it.  We should go for countermeasure and consider it. 

 

Methods to identify outlier in a data set or in a process:

 

There are several methods to identify outlier in any data set or in any process:

1. Box plot

2.Scatter plot

3. Histogram

4. Control charts

Link to post
Share on other sites
  • 0

An outlier is an indication of violation of process limits (process out of control). A simple control chart is a method to identify outliers. A box plot can also identify outliers. Once outliers are identify, a root cause analysis can be conducted to understand process behavior which further helps in taking actions.

Link to post
Share on other sites
  • 0

Need to identify an Outlier : Outlier is an Undesired outcome. If an Outlier occurs, it is studied whether it has occured because of Common cause variation & special cause variation, before deciding the further course of action. The end objective remains : Minimize Outliers or eliminate if possible. 

 

Methods and approaches that useful for identifying outliers: One common method used to identify outliers is "Box Plot". Outliers are also identified via Control Charts, while monitoring Process Performance.

 

Commonly statistical softwares like Mnitab, Jmp are well capable in identifying "Outliers".

Link to post
Share on other sites
  • 0

One of the biggest challenges in data analysis is dealing with outliers. Detecting outliers and understanding them can lead to interesting findings. Outliers are defined as data that are exceptionally far from the mainstream data or is an observation that appears to deviate markedly from other observations in the sample. Determining whether data is an outlier is purely subjective as there is no mathematical definition of what constitutes an outlier.

Many approaches are available to detecting outliers. They are:

  • Extreme Value Analysis: This is the most basic form of outlier detection and only good for 1-dimension data. In these types of analysis, it is assumed that values which are too large or too small are outliers. Z-test and Student’s t-test are examples of these statistical methods... They can be used as final steps for interpreting outputs of other outlier detection methods.
  • Probabilistic and Statistical Models: These models assume specific distributions for data. They use the expectation-maximization (EM) methods to estimate the parameters of the model and calculate probability of membership of each data point. The points with low probability of membership are marked as outliers.
  • Linear Models: These methods uses the distance of each data point to plane that fits the sub-space to find outliers. PCA(Principal Component Analysis) is an example of linear models for anomaly detection.
  • Proximity-based Models: Outliers are used as points which are isolated from rest of observations. Cluster analysis, density based analysis and nearest neighbourhood are main approaches of this kind. LOF (Local Outlier Factor) method is one of the methods used.
  • Information Theoretic Models: These methods are used based on the fact that outliers increase the minimum code length to describe a data set.
  • High-Dimensional Outlier Detection: Specific methods to handle high dimensional sparse data is used. Here the nearest neighbourhood, density of each cluster and finally outlier score of each data point is calculated.

The importance of identification of potential outliers is as follows

1.       An outlier may indicate bad data. Data may have been coded incorrectly or may not run correctly. If it proved that data is wrong then it should be deleted from the analysis.

2.      In some cases the outlier may not be bad data. Outlier may be due to random variation or may indicate something scientifically interesting.

The issues with regards to outliers are as follows:

·        Outlier Labelling

·        Outlier Accommodation

·        Outlier Identification

It is recommended that some tests like Normal Probability Plot be used.

Box Plot & histogram Tools can also be used in checking the normality

Assumption and in identifying potential outliers.

Grubbs Test, Tietien-Moore Test or the ESD (Generalized Extreme

Studentized Deviate) Test can be used where data is normally distributed.

In case of lognormal distribution, data can be converted to normal distribution

before applying the above test.

 

 

 

Link to post
Share on other sites
  • 0

Q. What is the need to identify outliers in a data-set?

Outliers are extreme values or abnormal distance from other values in a random sample from a Population, in simple term the outlier is the extreme data point which is distinctly stand outs or drastically deviates from the given norms or the rest of the data.

For example in the following scores 26, 27, 2, 33, 90,34,29,35

Both 2 and 90 are lies outside of the most of the other values in a set of data. 2 is much smaller and 90 is much larger compared to the other value in data set. Hence here both 2 & 90 are the outliers.

Need of Outliers identification:

The outliers are as important as other measures of central tendency and variability and its identification is vital for data analysis

  1. Outliers due to data entry errors (human errors) or Measurement errors (instrument errors) or Sampling errors (extracting or mixing data from wrong or various sources) distort the picture of the data we obtain using descriptive statistics and data visualization. When our goal is to understand the data, it is often worthwhile to disregard outliers.

  2. Outliers due to variability in measurements or experimental errors (data extraction or execution errors), play havoc with many machine learning algorithms and statistical models. When our goal is to predict, our models are often improved by ignoring outliers.  

  3. Outliers due to novelty (not an error, natural) can be exactly what we want to learn about, especially for tasks like anomaly detection. Omitting outliers from the data set, significant changes in the conclusions drawn from the study may result. Because of this, knowing how to calculate and assess outliers is important for ensuring proper understanding of statistical data. 

Q. What are the methods and approaches that are useful for identifying outliers?

There is some guidance which helps in a great way to start questioning about which points in the data should be treated as outliers. However none of these methods will deliver the objective truth about which of a dataset’s observations are outliers, simply because there is no objective way of knowing whether something is truly an outlier or an honest-to-goodness data point your model should account for. It is a subjective decision, depending on the goals of the analysis.

 

Approaches for detecting Outliers.

Outlier Analysis classifies Outlier detection models in following groups:

1. Extreme Value Analysis: This is the most basic form of outlier detection and only good for 1-dimension data. In these types of analysis, it is assumed that values which are too large or too small are outliers. Z-test and Student’s t-test are examples of these statistical methods. These are good heuristics for initial analysis of data but they don’t have much value in multivariate settings. They can be used as final steps for interpreting outputs of other outlier detection methods.

 

2.  Probabilistic and Statistical Models: These models assume specific distributions for data. Then using the expectation-maximization(EM) methods they estimate the parameters of the model. Finally, they calculate probability of membership of each data point to calculated distribution. The points with low probability of membership are marked as outliers.

 

3.  Linear Models: These methods model the data into a lower dimensional sub-spaces with the use of linear correlations. Then the distance of each data point to plane that fits the sub-space is being calculated. This distance is used to find outliers. PCA(Principal Component Analysis) is an example of linear models for anomaly detection.

 

4. Proximity-based Models: The idea with these methods is to model outliers as points which are isolated from rest of observations. Cluster analysis, density based analysis and nearest neighborhood are main approaches of this kind.

 

5. Information Theoretic Models: The idea of these methods is the fact that outliers increase the minimum code length to describe a data set.

 

6. High-Dimensional Outlier Detection: Specific methods to handle high dimensional sparse data

Link to post
Share on other sites
  • 0
Dear Vishwadeep,
 
An outlier is an observation / value in a data set that is not a representative of the data set or is at a substantial distance from other observations / values in the data set.
 
Simplest real life example / Analogy to understand an outlier is that of a vehicle moving either very slow or very fast than the prescribed speed limit on an highway or a vehicle moving in a wrong direction after flouting " No Entry " traffic discipline. 
 
It is important and necessary to identify outliers in a data set as they can have an adverse effect on statistical analysis and can lead to misleading results.Outliers can also provide useful information about process data so it's important to properly identify and understand them and discard them from the data set while doing a statistical analysis.
 
There are many methods of finding outliers in a data set e.g. by using Box plots , scatter plots, Excel sheet ( Conditional formating ---> Statistical --> Outlier ) , Minitab Statistical Software etc.
 
Best regards
Aniruddha
Link to post
Share on other sites
  • 0

While dealing with a data set, the term “Outlier” refers to data points that are relatively away from the majority of the data points. Many of the statistical tools and methods work on identifying and differentiating data points that are considered as part of mainstream and that which are considered or suspected as not belonging to the mainstream data.

 

The science of data management has defined many statistical distributions to which the behavior of data is associated, depending on the type of process or activity from where the data is generated. These distributions have their characteristics and properties, based on which we can decide the probability of a data point belonging to the population represented by the distribution. For any data-point that is picked up as the outcome of a process, if the probability of that data-point falling within the distribution is lower than a set threshold, then such data-points are suspected as “Outliers”.

 

Now the question comes “If such improbable data-points (Outliers) do occur, is there something abnormal going on?” “Could it be a measurement error?” “Could it be a mix up of data?” It calls for an analysis.

 

Outlier influence on Central tendency

 

Consider a simple set of 10 data points representing the cycle time in seconds for a process.

30, 30, 32, 33, 30, 31, 60, 32, 30, 32

The average of the above data is 34.

You can see one data point, 60, which evidently appears abnormally high, compared to the rest of the data points is apparent as an outlier. If we take the average of this set of data, ignoring this point, then it comes to 31.11.

Now, if we ask which one, 34 or 31.11 represents the realistic cycle time of data set, the obvious answer would be 31.11. Thus, even one outlier could seriously impair our interpretation.

 

On the other hand, if we take the median value, without ignoring any value, it come to 31.5, which is reasonably close to the mean value that was calculated excluding the outlier. So median is a better choice to represent central tendency, when we suspect outliers in a data set.

 

Need for identifying an Outlier

 

  1. Outliers will indicate the possibility of assignable or special causes to be influencing the process from which the data is generated. It will help focus our attention to investigate such special causes and address them.
  2. If we are generating the data to decide the control limits for a control chart, then it is important to identify the      outliers, exclude them and recalculate the limits. This process is called ‘homogenizing’ the data. However, if the amount of exclusion exceeds a certain threshold, the data will have to be discarded and taken afresh.
  3. If we are using data as input for machine learning, it is important that outliers do not confuse the data and the decision boundaries for the machine learning process.

Methods for identifying outliers:

1.     Box and Whisker plots

 

image.png.808665660361e886977ae0b7e91c8bd4.png 

 

The distance between the quartile I (Q1) and quartile 3 (Q3) is known as IQR (Inter Quartile Range). A data point falling below the Ist quartile or above the 3rd quartile at a distance of 1.5 times the IQR, then it is called as an outlier.

 

2.     Normal distribution principles

When dealing with variable data that is expected to follow a normal distribution, constructing a frequency distribution and plotting the normal curve will help in identifying outliers using the principles of normal distribution. (Only 0.27% data is expected to fall beyond the 3 sigma limits). The Z score method to determine outliers is based on this principle.

 

3.     Statistical Control Charts:

While using a data set to decide the control limits for a control chart, the data points that fall outside the control limits are eliminated as outliers and the control limits are re-calculated.

 

To conclude, identification of outliers in a data set help in understanding and studying special causes and addressing them as appropriate. When we use data for setting up a baseline, standard or for machine learning purpose, it is important to identify and isolate the adverse influence of outliers. It is possible to associate data sets with applicable statistical distributions and identify outliers with high degree of objectivity.

Link to post
Share on other sites
  • 0

In a data set, an outlier is a piece of data which falls far outside the typically expected variation. It has a marked deviation from other values in the data set.

Outliers are an integral and critical part of the data set and they need to be identified and investigated carefully.

  • They can be ignored/removed if and only if it is known or can be concluded with 100% certainty that it was result of error in experiment, documentation and so on.
  • Wrongly removing outliers may result in underestimated variance of the data set.

 

Need to identify an outlier

1.       If Outliers are present in a data set, then they can skew or bias the analysis performed on a data set.

Example: In a data set of 5 temp readings - 51.5, 51, 52, 51.7, and 80 deg C, then the 80 deg C is an Outlier and skews the average temperature in the data set.

 

2.       Outliers can provide key information.

Example: If delivery timing dataset has few outliers, then those outliers may indicate the performance of a particular delivery man.

 

3.       They may reveal a hidden pattern in the data set which in turn, might expose some product characteristic.

Example: A device might be malfunctioning and every once in a while, might be giving a bad data. An analysis of a pattern showing the outliers would enable to identify the faulty device.

 

Methods to identify an outlier

1.       Box Plot

2.       Scatter Plot

3.       Histogram

Link to post
Share on other sites
  • 0

Any outlier is defined as a data point that are different from the majority of the data. It is crucial to identify outliers since these can have a direct impact on the decisions taken. It is necessary not only to identify but negate the impact of such outliers and take decision based on the majority data. A lot of tools and techniques are used for identifying outliers. Some of these are discussed below:

 

Usage of graphical method

Histogram - any outlier will be easily detectable visually.

Box Plots - box plot operate within quartile ranges and hence it is easy to identify an outlier using box plot.

 

Use of data

- Using inter-quartile ranges is one of the widely used method to identify outliers. Any data point outside (Q1-1.5*IQR) or (Q3+1.5*IQR) is considered as outlier.

- Calculating Z score - Z score measures how many standard deviations are below or above mean population of data.

Link to post
Share on other sites
  • 0

An Outlier is an observation that lies at an abnormal distance from other values in a random sample from a population. Overtly, it is data that lies outside the other values in the set and the distance or range is far removed from the mass of data- Obviously they are not in the normal range of the data set.

 

In the following set of numbers:  6 ,90, 94, 99, 106, 109, 211

The numbers 6 and 211 are the outliers as it is obvious that both the numbers fall outside the characteristically expected variation, when the other numbers in the set are considered together. On a scatter plot the outliers are easy to spot and in a set of data in general it is easy when the data is sorted in a descending or ascending order. In any logical data set , it is easy to simply view outliners as an “irritant” as they normally cause complications when one attempts to create a process model or visualise a the data .

Having said that when we are looking at data points of real patterns like performance measures like service time in a QSR, Outliers can reveal useful insights- sometimes ones which were never imagined before. Let us take an example of a QSR which is situated in a Movie multiplex. We see many outliers when the order taking times is mapped, during a “Children’s Movie”break….. The reason is simple- Parents tend to take a little more time to decide on what they and the children will order and also the added time killer when the children tend to change their order a few times during the order taking period.

Noted below are some other situations that show how outliers help to improve processes:

-          It helps to check for errors- When a proper fact finding is done , many a times mistakes in the records are detected with evidence that the information was logged or processed incorrectly- Thus a checkpoint to detect errors of measurement is created.

-          Helps to estimate occurrence of “special” situations in a live process-  Outliers in real patterns like customer CSat scores on a standardised product are actually pointers for opportunities in the product, when scrutiny is done on what the outliers mean in context.

-          Helps to gather additional Information.  Outliers are sometimes associated with any other unusual situation in the business environment and there is need to reconsider the operational targets and act upon various cascading impacts . for example a study of home delivery timing during monsoon compel us to look outside the data set – obviously  heavy rains has impact on delivery timing and hence an additional step of calling up customers or updating the online systems about expected delivery time (ETA) becomes a mandate

So, as the above examples indicate, if we can find the reasons for the outliers and relate to the betterment of the process, or at least make a cognisant guess- the chances of missing on noteworthy information or business intelligence is reduced. We tend to miss on these opportunities by simply focussing on the average and improving from there.

Thus even the Outliers which seem like nuisances  and stragglers which can throw off the stats, are in reality provider of scope for improvement in a process.

The following are the popular or common methods to find an outlier-

-          A box and whiskers chart (Boxplot) often shows outliers, however access is not always available to a boxplot and then in some boxplots the whiskers tend to reach out to include the outliers and hence may not show the outliers.

-          The most effective way to find all outliers is by using the interquartile range(IQR). It contains the middle bulk of the data and hence the outliers can be easily found once you know the IQR.

-          The Tukey method for finding outliers uses the interquartile range to filter out very large or very small numbers- basically it uses the concept of “fences” where the formulae creates equations that give you 2 values or fences that cordonsn off the outliers from all the values that are a part of the bulk of the data.

There are other advanced methods like the Grubbs’ Test, Dixon’s Q Test, Modified Thompson Tau Test and Pierce’s Criterion.

Link to post
Share on other sites
  • 0

Outliers are individual readings of data that differ very greatly in value relative to most other values in a set. Why they are important is because depending on just how different they are they might disproportionately bias the results of a statistical analysis of the data set as a whole. 

Identification of potential outliers is important for the following reasons.

An outlier may indicate bad data. ... Outliers may be due to random variation or may indicate something scientifically interesting.

example, consider this set of weight taken 10 times.15, 15, 16, 16.5, 15, 15.5, 25, 16, 15, 16 (kg). A simple average of the entire set of readings is 16.5 kg. That’s a straightforward mathematical perspective taken without questioning the validity of any of the readings. But considering them more closely, wouldn’t it seem odd that when all the readings hover around the 15 or 16 kg mark, there’s one reading of 25 kg? It’s a sudden spike. Do things like that usually happen? If we exclude that particular reading from our calculation, the average changes to 15.55 kg, which is almost a whole kg lower. 

A number of additional questions arise:

Is it possible that the reading of 25kg is the result of a measurement error? A faulty weight machine perhaps? Or is it possible that the reading was real and correct, but that there were extraordinary factors that caused it to be so different that time?
Should we include the reading of 25 kg while computing the average or should we exclude it?
What if we had taken a 100 readings instead of just 10? Would we have had more readings of 25, or between 15 and 25? Is there a chance there could have been any readings of above 25?
What if there was no reading of 25, but there were two other significantly different readings of 20 each?
How different does a value have to be in order for it to be considered to be so different that it could distort the result of a statistical analysis?
If we consider the 25 kg reading as an outlier, would other analysts also do the same with this set of data?
If we were to use such data over time in a machine learning system would it delay or retard its achievement of maximum effectiveness?
Is it possible to have a single generic method by which we can decide whether a value should be considered to be an outlier or not? 

The decision about whether or not a data value should be treated as an outlier is at least to some degree a subjective one. The decision may initially be based on some set of objective identification rules using a standard mathematical technique, but must then be reconsidered further in a subjective manner that questions the data (and also the entire data set) within the context of its business meaning. The same subjectivity would need to be applied in considering the results of any statistical analysis run on the overall data set, with an awareness of whether or not the outliers were included in the analysis. 

It is because of all the questions that arise that it is important to be able to identify outliers and evaluate them fully before deciding how to treat them. There are a number of various mathematical methods to identify outliers, starting with John Tukey’s IQR or box plot method, the simple z-score method, and going on to others that may be more robust in the face of various factors that might stress the analysis in any way. It is important to have both a statistical feel as well as qualitative business feel to the data so that the most appropriate choice of method is made to identify and treat the outliers. 

Link to post
Share on other sites
  • 0

Outlier:

 

An outlier is a value or an observation that lies abnormal to the usual set of data or lies far away from the normal set of data or values.

It is important rather very necessary to identify the outlier in order to have a right acceptable or considerable conclusion from the data or values. Outlier can also be called exception. Exceptions as it is said are abnormal to the usual way & hence any conclusion drawn from the data or values containing exceptions or outliers shows unsuitability of the facts & interpretations.

 

It is very important to identify the outliers to have a right interpretation of the facts or values towards considerably analysis of the facts or values.

 

For example:-

300 Indians: 299 always attends the meeting on time , very punctual. whereas one individual always late by 10 min to all the meetings.

Now we cannot conclude that all Indians are always not punctual just because one person always ends up late. The one who is always late could be an outlier.

 

Blood pressure of 40 healthy people is checked & all the values falling between 80 to 110 except ones whose value falls between 200 to 240.

 

Using scatter plot or Box plot, the outliers could be identified.

 

If there are more than on outliers, I feel Pareto analysis and Histogram could also be helpful.

 

Outliers if also considered for the factual analysis could lead to entirely wrong results.

Link to post
Share on other sites
  • 0

Outlier: 

An Outlier is an extreme value (or observation in the dataset) and is normally outside the upper and lower limits and is distant from other observations.  The outlier value might be due to multiple factors :
- Could be due to variability in measurement (Measurement error)
- Could be due to experimental error (Repeatability and Reproducibility)
- Could be due to a special cause/condition

 

What is the need to identify an outlier in a dataset?

As an outlier is outside the control limits or further away from the rest of the observations, it is important that we identify the outlier to ensure that it is relatively nearer to the rest of observations and hence thereby within the control limits.


What are the methods and approaches that are useful for identifying outliers?

Some of the best methods that can be useful to identify outliers are :

 

- Box Plot :  
It displays 5 points that represent the centering, spread and distribution of a set of continuous data. The plot consists of a box, whiskers and outliers and shows the maximum value, the minimum value, the median, the 75th percentile and the 25th percentile.  The outliers are plotted with dots and asterisk.

 

- Control Charts:
It tracks process statistics over time and detects the presence of special causes which are nothing but outliers.  

 

Example of Outliers and how to deal depending on circumstances

Eg:1:  Every day, a person goes to office and he needs to be there between 9 am – 9.15 am IST. One day he faces a challenge when a procession delays his going to office and he reaches office by 9.50 am IST.  Nevertheless for the entire month, he is on time to office.  Now, reaching office at 9.50 is an outlier. But as it happened only once, this can be excluded or ignored as a one-off.


Conclusion: 

Outliers need to be addressed properly in every process.  Reasons of outliers could be due to aforementioned factors or could happen by chance in any distribution.  Depending on the context/situations, the outliers may be addressed or excluded.           

Link to post
Share on other sites
  • 0

Q40. What is the need to identify an outlier in a data-set? What are the methods and approaches that are useful for identifying outliers? 

 

Outliers are observed data points which are abnormal and away from the other observed data points. Outliers are usually the values which fall outside the trend of the data collected. In the sense, the values being normal or abnormal is upto the analyst who is involved in the project. Outliers are not bad and at the same time it is not good. Outliers can provide us some valuable insights about the variations in the process.

 

Causes of outlier:

Outliers can have multiple reasons. It may be due to human error, process error, measurement error due to measurement apparatus malfunctions, or natural process variations or even it could be due to sample contamination by the large population. It can also be by the flaws occurred in the assumptions made by the analyst.

 

Type of outliers:

1.       When we explain the process using bell shaped distribution, this will provide us the shape of the data collected explaining the features, symmetry and departures from the assumptions. The assumptions are clearly stated and deviations are monitored which is also called as outliers.

2.       Analysing the data for unusual observations from the mass of data collected, are usually referred as outliers in statistics.

 

Graphs used to identify the outliers:

Scatter plot and box plots are used to identify the outliers along with the analytical tools like dixon’s test, chauvenet’s test , pierce’s criteria and rosner’s test, etc.

https://en.wikipedia.org/wiki/Outlier

 

Outlier detection criteria:

Interquartile range:

IQR is a tool which is used to determine the outliers. It has 2 important quartiles, first and third quartile.

IQR = 3rd quartile – 1st quartile

This tells the distance or how far the first half of the data set from the middle value or median.

 

Determining outliers:

IQR *1.5 will give you a certain value which will determine the outlier in the data set.

1st quartile –(1.5*IQR) is used to determine if the values are less than thecalculated value are considered as outliers.

3rd quartile + (1.5*IQR) used to determine the value if it is greater than the calculated number.

 

Strong Outlier – Instead of 1.5, if we multiply the IQR by 3.0 and add or subtract from the respective quartiles, it is called as strong outliers. Or Extreme outlier: A point beyond the outer fence/ limits are called as extreme outlier.

 

Weak outlier – Any other outliers apart from the strong outliers, are called as weak outliers. Or Mild outlier: when a abnormal point falls with in the inner llimits but far away from the other data point.

 

Lets take with an example:

Data set collected is as below.

1, 2, 2, 3, 3, 4,  5, 5, 10

From the above data, it is clearly seen that the number 10 is abnormal and far away from the other observed data points. Hence outlier is 10. But it is decided subjectively.

To prove the outlier identified, IQR as mentioned above is used.

First quartile for the data set is 2 and 3rd quartile is 5.

IQR = 3rd quartile – 1st quartile (5-2) = 3.

IQR *1.5 (3*1.5) = 4.5

Outliers:

First quartile – 4.5 = 2 – 4.5 = -2.5, which means anything lesser to this is considered as outliers.

3rd quartile + 4.5 = 5+4.5 = 9.5, which means anything greater than 9.5 is considered as outliers.

 

Reasons for outliers:

Here begins your objective analysis to support the subjective judgement about the data. Of course outliers are always a bad data points. But it has to be carefully driven / investigated. Without a thorough investigation, it is not a valid situation to either remove or consider. Because it will provide a valuable information about the process.

 

Consider the following before removing or considering the outlier:

Ø  If the outlier is caused by human error / measurement error / calculation error or process error then being accurate is  a problem area. In such cases, omitting the outlier is a right choice. If it is not any any of the error mentioned above, the process itself has such deviations, then the outliers will provide us a valuable insight about it, why it has occurred, when did it occur and what are the main reasons behind, etc. This will help us in to initiate the project for process improvements.

Ø  If the outlier skew the process average, then it has to removed. If it does not skew, then inclusion may be possible to get an accurate picture of the process.

 

Labeling, Accomodation and identification of outliers:  issues with regards to outliers

1.       Labelling – potential outliers are part of the erroneous data or it is an indicative of abnormal distribution of the data set.

2.       Accomodation of outlier -  If the potential outliers are not erroneous, we can modify the statistical tools to account or accommodate these outliers in the experiment conducted.

3.       Outlier identification – A formal application or test to identify the outlier.

 

Formal tests will follow the characteristics to identify the outliers

1.       What distribution model is followed?

2.       Single outlier or multiple outlier

3.       If multiple outliers, is there any upper bound specified?

 

Statistical tools

1.       Grubb’s test – For single outlier

2.       Tietjen moore test – multiple outliers

3.       Generalized extreme studentized deviate ( ESD ) – Multiple outliers with upper bound specified.

 

Conclusion:

Outliers are often bad data points. But at times, it can provide a valuable information of the process behavior. Hence before removing or considering the outliers for anlaysis, is depending on the situations and distribution of the data. It has to be carefully investigated for the analysis to be more meaningful. 

 

Thanks

Kavitha

Link to post
Share on other sites
  • 0

OUTLIER

         An outlier is an observation that lies in an abnormal distance from other values in random sample from population. outliers are extreme value that fall a along way outside of other observations. for example, in a normal distributions outliers may be the values on the tiles of the distributions.the outlier is identified as the largest value in the data set outlier should investigate carefully often the contain the valuable information about the process under investigations are the data gathering and recording process before considering the possible elimination of this point one should try to understand why the appeared and its wheather likely similar values will continue to appear.

 

METHOD AND APPROCHES ARE IDENTIFYING OUTLIERS

    we always need to be on the lookout for outliers sometimes they are caused by errors and other time it indicates the presences of  previously unknown  phonomenon.

 the following are the methods to detect outliers

1.extreme value analysis

2.propablistics and statistical models

3.linear models

4.proxmity ased models

5.information theortic models

6.high dimensional outlier detection

             

Link to post
Share on other sites
  • 0

In statistics, outlier is a data point which is significantly different from other data points in a sample. It can occur due to wrong data collection /measurement error or actually due to some sudden and temporary variation in process. Keeping outliers in analysis may lead to wrong results hence its necessary to identify and remove them from the analysis. 

If data set is represented graphically, outlier point would be far away from the other values. Box Plot or Probability Plots are good tools for screening the data for outliers. Dixon's Test may be used to identify single outlier. Rosener's Test helps to identify multiple outliers in a data set.

IQR method can be easily used to identify outliers. This method sets upper and lower limits beyond which any data point would be termed as outlier. In any data set, first we have to calculate the quartile (lower quartile is the data point below which 25 percent of the observations fall)

Link to post
Share on other sites
Guest
This topic is now closed to further replies.
  • Who's Online (See full list)

    There are no registered users currently online

  • Forum Statistics

    • Total Topics
      2,855
    • Total Posts
      14,452
  • Member Statistics

    • Total Members
      55,016
    • Most Online
      888

    Newest Member
    Shantnu Kukreja
    Joined
×
×
  • Create New...