Skip to content
View in the app

A better way to browse. Learn more.

Benchmark Six Sigma Forum

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.
Message added by Mayank Gupta,

Influential Observations (also called as Unusual Observations) are data points that have a disproportionate impact on the regression or ANOVA model. It is important to identify such points because they can produce misleading results (e.g. an unusual observation can cause a significant coefficient to seem insignificant).

Influential observations can be either, or both, of the following:
Leverage points, which are extreme in the x-direction
Outliers (large residuals), which are extreme in the y-direction relative to the fitted regression line

 

An application-oriented question on the topic along with responses can be seen below. The best answer was provided by Rahul Garg on 30th May 2021.

 

Applause for all the respondents - Gowtham Prabu, Rahul Garg, Sandhya Venu, Ajit Pathania, Guru Saran, Eka Pillai, Archana Handa.

Featured Replies

Q 369. Regression analysis identifies the best fit line. However, leverage points and large residual points can influence the fitment of the line. Both put together are known as influential or unusual observations. Explain both leverage points and large residual points with examples. Why is it important to analyze these points?

 

 

Note for website visitors - Two questions are asked every week on this platform. One on Tuesday and the other on Friday.

Solved by RahulGarg

If residual points are more in the regression line so data points doesn't fit in the prediction equation, we shouldn't rely the result since the equation doesn't has significant factors.

  • Solution

An influential point is a point that has a large impact on the regression analysis and an outlier is a point with a large residual. Interestingly, these are not the same thing. A point can be an outlier without being influential point too. A point can be influential even if its not an outlier. A point can be both or neither of these as well.

 

An outlier is a data point whose response Y does not follow the trend of rest of data and a data point considered as an outlier only if that point is extreme with respect to the other Y values and not the X values. A data point is influential if it influences the regression analysis in a big way, such as predicted responses, the estimated slope coefficients, or the hypothesis test results. Kindly note that Outliers and high leverage data points have the potential to be influential, but we need to investigate further to determine whether or not they are actually influential.

 

Let's take a look at some of the scenarios that should help us to clarify the distinction between these two types of extreme values i.e. Outliers and Leverage Points.

 

Lets take Dependent Variable (Y) here as Marks Obtained in Exam and Independent Variable as Hours Studied (X). 

As we know, equation of line is represented as below : 


Y = mX+C
Y - Dependent Variable i.e. Marks Obtained , m - Slope of Line or rate of change , X - Independent Variable i.e. Hours Studied and C - Intercept on Y line 

H0 - X  (Hours Studied) has no effect on Y (Marks Obtained).
H1 - X  (Hours Studied) has effect on Y (Marks Obtained).

Scenario 1 : 
As per definitions above , do you think there are any outliers, leverage points or influential observations in below :

image.png.4a9549acc4cb818eea86b58e92d955a0.png

Definitely not, as all of the data points follow the general trend as rest of the data, so there are no outliers (in the Y direction). And, none of the data points are extreme with respect to X as well, so there are no high leverage points too. To conclude, none of the data points appears to be influential with respect to the location of the best fitting line. So, more and more I am studying, higher and higher are my scores.

 

Scenario 2 : 
Do you see any outliers or any high leverage points or any influential observations in below ?

image.png.649d06d8af3a3ec4549614f31d878bf9.png

 

Definitely Yes, because the red data point does not follow the trend of rest of the data, it would be considered as an outlier. So here, though I have studied for less hours, but my scores were quite high. So such scenario must be analysed to know what exactly happened at that instance might be better concentration or exam was easy or nothing came outside the syllabus than what i had studied. However, this data point does not have an extreme X value, so it does not have the high leverage. Is the red data point influential? An easy way to determine if any data point is influential is to draw the best fitting line twice — first with the red data point included and another with the red data point excluded. The following graph illustrates the two best fitting lines :

 

image.png.6119ae3ddf160a9ca3950532b4927b3e.png


Great,  it's hard to even tell the difference between the two estimated regression equations! The solid line represents the regression equation with inclusion of the red data point, while the dotted line represents the estimated regression equation with the red data point being excluded. The slopes of the two lines are also very similar i.e. 5.04 and 5.12 respectively.

Do the two samples yield different results when testing H0 : m = 0? Well, we get the following output when the red data point is included in this data set :

image.png.c10bc42bfd216b26519bbe5f0af3917b.png

and the following output when red data point is excluded from the data set :

image.png.288e14c80854eaddbed8812bd045ccc6.png

There certainly are some minor side effects of including the red data point, but not very serious.

 

As we can see here, R^2 value has decreased slightly, but the relationship between Y and X would still looking strong.

The standard error (SE), which is used in calculating our confidence interval of m, is larger when the red data point is included, therefore increasing the width of the confidence interval. As we know that the standard error depends on the mean squared error MSE, which tells us the difference between the observed and predicted responses. It is because the red data point is an outlier i.e. in the Y direction so the standard error is increasing, not because the data point is influential.

 

In each case, the p value for testing H0: m = 0 is less than 0.001;  we can conclude that there is sufficient evidence at the 0.05 level to conclude that, in the population, X is related to Y.

 

Therefore, the predicted responses, estimated slope coefficients and hypothesis test results are not impacted by inclusion of the red data point. Therefore, the data point is not deemed influential. In nutshell, the red data point is not an influential data point and does not have a high leverage too, but it is definitely an outlier.

 

Scenario 3 :
Now, lets look at the below scenario for outlier and leverage point.

image.png.736b4100ff0696cb6512a809d10c7506.png

Here, the red data point follows the general trend of rest of the data. Therefore, it is not deemed an outlier. However, this point has an extreme X value, so it has the high leverage. So here,  I have studied for a large number of hours which are substantially more than the other days,  but my scores were also quite high in that proportion and following the same trend. Such scenario again must be analysed as why all of sudden i have studied for so many hrs than normal and reasons may be important exam, less subject preparation, difficult exam pattern etc.. Now, lets see the red data point influential? It certainly appears to be far away from rest of the data (in the X direction), but is it sufficient to make the data point influential ?

 

The following plot depicts the two best fitting lines; one obtained when the red data point is included and another when the red data point is excluded:

image.png.d11e974522ca1a20f1f6e293053c2dc2.png

Again, is difficult to separate both the regression lines. Solid line represents the estimated regression equation with red data point included, while the dotted line represents the estimated regression equation with the red data point excluded. The slopes of the two lines are also very similar i.e. 4.927 and 5.117 respectively.

 

Do the two samples yield different results when testing H0: m = 0? Well, we obtain the following output when the red data point is included in the data set :

image.png

 

and the following output when red data point is excluded from the data set:

image.png.8624f950f24223ecddae48a2a6824cbe.png

So we see here that there are hardly any side effects from including the red data point:

The R^2 value has hardly changed, increasing slightly from 97.3% to 97.7%. In both the cases, the relationship between Y and X is looks strong.

 

The standard error is also same in each case i.e. 0.172 when the red data point is included, and 0.200 when the red data point is excluded. Therefore, the width of the confidence intervals would remain unaffected by presence of red data point. You can see that this is because the data point is not an outlier heavily impacting MSE. In each case, the p value for testing H0: m = 0 is less than 0.001. In either case, we can easily conclude that there is sufficient evidence at the 0.05 level, in the population that X is related to Y.

 

Therefore,  the predicted responses, estimated slope coefficients, and hypothesis test results are not affected by inclusion of red data point. Therefore, the data point is not appearing as an influential. To summarize, the red data point is neither influential, nor is it an outlier, however it has the high leverage.

 

Scenario 4 :

Lets look at the last scenario as below for any outliers and leverage points.

image.png.83b76ea08137ac8b7d8dcbd8db65a48d.png

Bingo, the red data point is most certainly an outlier and also has high leverage! The red data point does not follow the trend as rest of the data and it also has an extreme X value. So here though I have studied for substantially high number of hours but my scores have not increased in that proportion or rather it came down; so as a general observations it shall influence my regression analysis definitely. Therefore, in this case the red data point is certainly influential. The two best fitting lines i.e. one obtained when the red data point is included and one obtained when the red data point is excluded:

image.png.aef80e9a519a076ecdbd8877e9f3d53b.png

are (not surprisingly) substantially different. The solid line represents the estimated regression equation with the inclusion of red data point, while the dashed line represents the estimated regression equation with the red data point exclusion. The existence of the red data point significantly reduces the slope of the regression line i.e. dropping it from 5.117 to 3.320.

 

Do the two samples yield different results when testing H0: m = 0? Well, we obtain the following output when the red data point is present :

 

image.png.3539786995ed98395786223ed63f31bc.png
and the following output when the red data point is not present :

image.png.adf25116f970cafaccfefbd010db10ef.png

Here the R^2 value has decreased substantially from 97.32% to 55.19%. Therefore, if we include the red data point, we conclude that the relationship between y and x is only moderately strong, whereas if we exclude the red data point, we conclude that the relationship between Y and X is very strong.

 

The standard error also is almost 3.5 times larger when the red data point is included i.e. increasing from 0.20 to 0.686. This increase would have a substantial effect on the width of our confidence intervals too. Again, the increase is because the red data point is an outlier in the Y direction.

 

In each case, the p value for testing H0: m = 0 is less than 0.001. In both the cases, we can conclude that there is sufficient evidence present at the 0.05 level to conclude that, in the population, X is related to Y as largely the data points are in favor of it . Note, however, that the t-statistic decreased dramatically from 25.55 to 4.84 upon inclusion of the red data point. (Measure of how many points actually fall on the regression line which has decreased here with inclusion of red data point) 

 

Here, the predicted responses and estimated slope coefficients are certainly affected by the presence of the red data point. While the data point does not affect the significance of the hypothesis test, the t-statistic did change largely (The greater the magnitude of T, the greater the evidence against the null hypothesis). In this case, the red data point seems to have both high leverage and also an outlier, and finally it turns out to be influential too.

 

Summary

In the above scenarios,  through the use of simple plots, we have highlighted the distinction between outliers and high leverage data points. There were outliers in scenarios 2 and 4. There were high leverage data points in scenarios 3 and 4. However, only in scenario 4, data point that was both an outlier and a high leverage point turn out to be influential. That is, not every outlier or high leverage data point strongly influences the regression analysis. Therefore, its our duty as an analyst to determine if regression analysis is unduly influenced by one or more data points and if these are incorrect observations or values we can delete / ignore them however if these are right observations we must study them in detail as these are kind of special causes and indicating to something special that has happened at that particular instance E.g. Though I have studied for more hours but my scores were not high rather dipped (Scenario 4), which may be i did not study the relevant subject or topics or may be my concentration was not good on that particular day or may be exam questions have not come from the syllabus i have studied. So Its always good and advisable to study these points in details so that something interesting or unusual causes can surface out as a result of this study and decisions can be taken accordingly to consider the same in further analysis / study.

 

Edited by RahulGarg
Formatting error and missed a point of question.

Influential Observations:

 

The observations that we should not ignore while concluding on the calculation or formula and the deletion /exclusion of those data / observations may noticeably change the result of the formula/ calculations.

Many a times we have noticed that the influential are also the outliers hence there is every chance that we miss it.

During the regression analysis there can be misses due to not considering the influential observations which will end up with huge impact on the parameter estimation.

 

Let us consider the scatter plot Y vs X in the below example where we intend to estimate through a regression model.

In the first equation that we have estimated calculates Y = 25.69+0.99X, however when we exclude the outlier and estimate the equation changes to Y= 22.33 +2.35X. It is clear now that there is a huge influence of the data point in the equation. It also means that influential observation has a disproportionate effect on the least squares model hence it is one such data point which will have a high impact on the slope coefficient and the y intercept.  

Reference :  20/05/2015 by Cincinnati State Statistics

 

 image.png.28208d5a9639803e7072511967266c13.png         

image.png.dd40fe71b6c0d7e46841085a6b61c2f4.png

    

 

It is not necessary that all the outliers are default influencers and hence always advised to check and confirm if the outlier is an influential observation or not before considering removing it from calculations. When the outlier impacts the slope of the regression it is called an influential observation.

Linear regression is a linear approach to model the relationship between a dependent output variable(y) and one or more explanatory independent variables(x).  When there is a single input variable (x), the method is called as simple linear regression. When there are multiple input variables it is referred as multiple linear regression.  Linear regression consists of finding the best-fitting straight line through the points. The best-fitting line is called a regression line.

 

 

Leverage is a data point whose independent variable(x) is unusual and dependent variable(y) follows the predicted regression line. A leverage point looks fine as it falls on the predicted regression line. It is a measure of how far an independent variable deviate from its mean.  Leverage points have no impact on the coefficients because the point follows the predicted regression line. High leverage points can lead to a great amount of effect on the estimate of regression coefficients.

 

 

In the attached scatterplot of y vs x red data point follows the general trend of the rest of the data. Therefore, it is not considered an outlier here. However, this point has an extreme x value, so it has a high leverage.  But this can fit into best fit line.

 

 

 

image.png.8446bc04134f9eb9cf054175ba3d5124.png

 

Residual is the difference between the predicted value (based on the regression equation) and the actual, observed value.  Large residual points in a linear regression are outlier.  It is a deviation from the sample mean.  It is an observation whose dependent-variable(y) value is unusual given its value on the independent predictor variables(x). 

 

In the attached scatterplot point(2,8) is 4 units above the best fit line.   This is large residual point.  Point(6,7) is closer to the best fit line, its an outlier but residual is less. 

 

image.png.7ff666f358ea27a8e75c484116a1d418.png

Let us quickly realize few definitions which are important to understand this topic.

Regression Analysis:  Estimate the relation between two continuous variables. These variables can be one dependent Variable (Y Variable) and one or more independent variable (X Variable). The relation can be shown as an equation by drawing the best fit line connecting the data.

Residual: Residual is calculated as the vertical distance of the data point to the best fit line. Farther the data point from the best fit line in the Y direction, the bigger the residual is.

Outlier: Outlier are those points which fall far away from rest of the data points.

Influential Points: By removing these points from the data will alter the estimates and regression equation to a great deal.

Leverage Points:  A leverage point is an observation that has an unusual predictor (X) value

 

To explain in detail, I took hypothetical sample data of Height (Independent Variable) and corresponding weight (Dependent Variable) of people. Plotted the best fit line using Minitab. I intentionally added a data point that is far away from the other data to demonstrate the outlier effect.

 

Scenario #1: The outlier is far away from the other data points. This outlier is almost at the middle of the X data range and far away in the Y direction. This point has a bigger residual value. Regression equation with the outlier included is Y_Weight = 61.93 + 0.1441 X_Height. R-Square (R-Sq) is 11.7%

S0.thumb.JPG.fb672f3d7007612f3cb39c8d98af06d6.JPG

To understand the effect of this outlier on the regression equation, let us remove the outlier data point and plot the new trend line and equation.

 

Scenario #2

s2.thumb.JPG.f2f12db2f3fa46e0f9ddccb5f7123c88.JPG

 

New equation is Y_Weight = 60.46 + 0.1480 X_Height

Surprisingly, there is no big difference in the constant value or the slope. We can say that the outlier is not an influential point as it has a negligible effect on the estimates and regression equation. R-Sq is improved as the fitment improved after removing the outlier.

 

Scenario #3 Now let us take the same example with an extreme point added in X data or X direction. Again, we consider this as an outlier as it is far away from the overall data points.

s3.thumb.JPG.bb8787f9a0145c381983baabb3295434.JPG

The new equation is Y_Weight = 41.16 + 0.2633 X_Height.

 

There is a remarkable change in constant and slope values compared to Scenario #2. Constant was changed from 60.46 to 41.16. Slope increased from 0.148 to 0.263. Here the outlier is an influential point as there is a remarkable change in the regression equation. R-Sq further improved as the new data point created another level in the X direction, thereby improving the discrimination.

Points with the extreme value of X are having high leverage. In the first example, the outlier was somewhere in the middle of the X data range so had less leverage. In Scenario #3 outlier is far away from the X data range and also at some distance from the trend line. A leverage point does not influence if that lies near the regression line.

It is very common that we find this kind of data points in our daily analysis and can’t emphasize more how important to identify these unusual observations and take appropriate action. There are numeric measures available to find out leverage and influence. For example, Cook's Distance measures how much parameter estimates change if an influential point is removed. I will not get into these details as this may take into a much bigger explanation.

 

In my experience and learning the following quick points can help.

1.      First of all, be aware that these points exist. A simple scatter plot of X and Y data can show any outliers present.

2.      If unusual data, say more than 5% are present, and influencing the equation, then better to go back to the process and check it out. Control charts are the best tools to find out the unusual points live in a parameter and analyse them at the same time they occurred. If the regression analysis is from secondary data, then better to avoid those results if influencers are not explainable.

3.      One or few unusual points should not impact your overall results. Careful consideration is important.

4.      Many times, if the data point is extremely far away from the data range, these are entry mistakes or data capture issues. Correct the entries matching with source data. If the source data is not available, remove the data point rather guessing.

5.      Remove unusual data and see if there is any influence on results. If the influence is negligible, then leave the original data set as is.

6.      If the unusual data exist as a random phenomenon, then try to find and collect more data near that X value. Many times, this is not feasible.

7.      Usually, Y outliers are less severe than X. Find out these data points are just random phenomena and not big influencers.

8.      Show your regression results with and without unusual data. Let the collective wisdom take the final call.

9.      Show your results confined to a particular range of X data. Any omission of influencer data should be mentioned with reasons in your reports.

While doing Regression analysis to arrive at the best fit line we can observe some unusual appearance on the data points  with respect to its y-value or x-value.

Instead of calling them  x- or y-unusual observations, they are categorized as leverage or outlier  and these are considered  as influential points  as they obviously impact the outcome of the  regression model.

These unusual observations aka outliers aka influential points are observed on  x-value or y-value.

Its interesting o note an x-influential point will make the scope of regression analysis too wide and hence considered less accurate . An x-outlier is rare and when it occurs it may heavily impact the regression results .

In an  observation it is normally  considered an outlier if the absolute value of the residual is higher.

For instance If we see the data point on row # 6 it  has a very high residual compared to any other data points of the data set. Generally, higher absolute value for any of these diagnostic statistics for a point is considered an outlier The absolute values for the other diagnostic statistics viz adjusted residuals , standardized residuals (SRES) and deleted residuals (TRES) are also seen to be much higher than the data point of row#6

246716456_forumquestiononregressionpic1.png.2b461e26e923403bfd7c3740983db44b.png

In order to statistically determine the influential point we use DFIT and Cook Distance method

If the absolute value of DFIT exceeds 1 for small to medium data sets this point is considered as influential points to the fit the line of regression  

Significance of leverage point by following example line

Let us see the relationship between muscle mass and power . IN the following study most of the individuals have their weight around 200 pounds and one person weighs 400 pounds . This is an extreme y-value which is power which will override the relationship more than all other individuals

1000297457_forumquestiononregressionpic2.png.df0dfb9c8fda7b0d0d0dd1a7f96f4692.png

Leverage Point in a Regression Analysis( pic at right )

 

889937282_forumquestiononregressionpic3.png.084d66959461634caff932978cff6fc5.png

Influential Point shown in a Regression Analysis

 

 

We can notice  that leverage points normally make the functional regression relationship very broad and hence the conclusion of the study would be misleading Therefore, the conclusions for the study could be misleading.

In a general note we can say a wider model is considered as less accurate compared to a shorter one and hence to have more accurate results shorter models are preferred

 

General note :-

 

Many a times different sets of data which we would be considering as x and y may have a linear incremental relationship and as the process has a continuity and when its in chronological order the correlation will establish stronger . However the influential points impact would still be critical .

If we see You Tube scenario a particular video which has views / likes / comments and subscription which are chronological in nature but definitely has an influence on one another . In a common man angle if you approach this scenario we can say if a video is viewed more then the likes can increase and if likes increase then possibilities of views will increase as the person would see it again but may not register like again and in next level the more one sees it again and again that person would enter comments and if the video is liked extremely well then the person may subscribe t that channel .

But we can still find the outliers and influential points if we are comparing videos which has thousands in range and if it is compared to videos with million views we can see the impact

 

What is Leverage Point?

Leverage points are those observations and data points, which are made at extreme or outlying values of the independent or x variables such that the lack of surrounding observations means that the fitted regression model will pass closer to the particular data point.

 

Suppose there is a regression line fitted around a given dataset.

Assuming there is an extra data point, an outlier which is far away from the main cluster of the data, but one which lies somewhere along that regression line, when extended.

In case the regression line is to be refitted, the coefficients won’t change. On the other hand, removing the extra-outlier would have NO influence on the coefficients.

Therefore, the outlier or leverage point would have zero influence if it were perfectly consistent with the rest of the data and the model.

 

Example:

It's easy to demonstrate that a high leverage point may not be influential in the case of a simple linear model.  Refer to the model below:

image.png.f83bf3e851595fc8cc49bc6808be5f60.png

The blue-line is a regression line on the basis of the original dataset, the red-line ignores the point at the top right of the plot.

This point matches the definition given for a high leverage point as it is far away from the rest of the data. Therefore, the regression line has to pass close to that point. But since its position largely fits the pattern observed in the original dataset, the other model would predict it very well and it is therefore not particularly influential.

Compare this description to the scatterplot given below:

 

image.png.0425298c4e3f3dfe0fc6664f34ff2a67.png

Here, the point on the right of the plot is still a high leverage point but this time it does not really fit the pattern observed in the original dataset. The blue-line is very close but the red-line is not. Including or excluding this one additional data point completely changes the parameter estimates: Hence, it has a lot of influence on the final regression line.

 

image.png.f3e4d2b5a091ad54c95a3447ace1b399.png

In this final example, the observation on the bottom right has a (relatively) larger influence on the fit of model but it still is far away from the regression line.

 

 

 

This was a tricky question and definitely a topic that everyone enjoys :)

 

There are some excellent answers (read answers by Guru Saran and Eka Pillai), however the one answer that has dissected the question neatly and looked into all aspects of influential observations is given by Rahul. Hence his answer has been selected as the winner.

Create an account or sign in to comment

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.