Jump to content
• 0

# Multicollinearity in Regression Analysis

Go to solution Solved by Mohamed Asif,

Multicollinearity is the phenomenon where the independent variables or the predictors (in a Multiple Linear Regression) are correlated with each other.

Regression Analysis is a statistical tool that defines the relationship between two or more variables. It uses data on relevant variables to develop a prediction equation, or model. It generates an equation to describe the statistical relationship between one or more predictors and the response variable and to predict new observations.

An application-oriented question on the topic along with responses can be seen below. The best answer was provided by Mohamed Asif on 8th Aug 2020

Applause for all the respondents - Aritra Das Gupta, Mohamed Asif, Sourabh Nandi, Kanak Roy Chowdhury

Also review the answer provided by Mr Venugopal R, Benchmark Six Sigma's in-house expert.

## Question

Q 286. Why is multicollinearity a problem? How is it detected? What is the right course of action when it is found? Explain with an example.

Note for website visitors - Two questions are asked every week on this platform. One on Tuesday and the other on Friday.

## Recommended Posts

• 1
• Solution

Multicollinearity is a statistical phenomenon.
It happens when several independent variables are highly correlated,
However not perfectly correlated and
In this situation we get regression results to be unreliable.

In the above example, we could see how and to what extend does Consumer Price Index and Inflation Index can predict the rates.
There is a considerable overlap between Consumer Price Index and Borrow Rate and
Substantial overlap between Inflation Index and Borrow rate.
Now, because there is a significant overlap between Consumer Price Index and Inflation Index themselves. It would be possible to predict with the unique non-overlapping contribution.
Unique non-overlapping contribution of Consumer Price Index is Area c and
Unique non-overlapping contribution of Inflation Index is Area b and
Area a will be lost to standard error.

Why Multicollinearity is considered as a problem?

• We would not be able to discriminate the individual effects of the independent variables on the dependent variable

Further Correlated independent variables make it hard to make inference about individual regression coefficients and their effects on dependent variable.

As a result, it is difficult to disprove Null Hypothesis, wherein actually the same should be rejected.

• Multicollinearity might not affect the accuracy of the model by a lot. But we might lose reliability in determining the effects of individual features of the model and that can be a problem when it comes to interpretability.

How do we detect Multicollinearity?
By using scatter plot or by using correlation matrix it would be possible to detect multicollinearity with regards to bivariate relationship between variables

It can be detected based on Variance Inflation Factor or as popularly referred as VIF.
VIF score of independent variable represents how well the variable is explained by other independent variables.

When R2 value is close to 1, higher the value of VIF and higher the multicollinearity with the independent variable.

VIF = 1 implies No correlation between independent variables and other variables
VIF > 5, indicates high multicollinearity

Diagnosis and Fix:

• Dropping one of the correlated features can bring down multicollinearity significantly

Priority of dropping variable is based on the high VIF value

• Combining correlated variables into one and drop the others

Points to remember before fixing:
Removing multicollinearity will be a good option when more preference is given to individual features relatively to the group features that impact the focus variable

Efficient corrective action to remove multicollinearity requires selectivity and selectivity in turn requires specifics about the nature of the problem.

##### Share on other sites
• 1

Benchmark Six Sigma Expert View by Venugopal R

What is Multicollinearity?

When we want to study the relationship one dependent variable with several independent variables (predictors), Multiple Regression is used. The ideal situation is where we have all the independent variables in such a manner that there is no correlation between any two of them. We say Multicollinearity is prevalent when a high correlation exists between any of the chosen independent variables.

Y = β0 + β1X1 + β2X2 + β3X3 + …...

In the above equation, we expect relationship between Y and X1, Y and X2 and so on. However, if we have a relationship between X1 and X2 or X1 and X3 or X2 and X3 and so on…, then there is multicollinearity.

Example

Let me quote an example that I worked on as part of a project with a Bank. The parameter that needed to be improved was the ‘Occupancy cost’ incurred by the branches of the bank. Many contributing variables were identified and I will just mention a few that are easy to understand in general, to illustrate the topic of multicollinearity. The data on Occupancy cost for each branch was collected and some of the contributing variables identified were Floor Space, Power consumption, Maintenance expense, Location index, Rent and so on.

We were interested to know the regression relationship between Occupancy cost and these contributing variables and a Multiple Regression model was applied. Out of these variables, it was observed that the ‘Maintenance expense’ is highly correlated to the ‘Floor Space’, and both these variables have been considered as independent ‘X’ factors. Hence there is a problem of ‘multicollinearity’ on the multiple regression model.

How does Multicollinearity affect the regression exercise?

While multicollinearity does not affect the overall multiple regression model, the individual effects of each factor involved in the multicollinearity will be affected.

In the multiple regression equation, the β value denotes the marginal effect on Y with increase of one unit of X1, holding other variables constant. Now, if X1 and X2 are highly correlated, it is not practically possible to vary one, and hold the other constant! So how does this affect the output of our multiple regression?

The regression output table, like the one we get from Minitab, gives the coefficient, standard error and ‘p’ value for each predictor variable. When we have multicollinearity, the standard errors for the involved variables get blown up and so do the ‘p’ values. This will result in lowering the significance of the individual effects for those variables.

Illustrating from the above example, assume that we find there is significant relationship between Occupancy cost and Floor space, when checked individually. The relationship between Occupancy cost and Maintenance expenses was also found to be significant, upon individual study. However, while performing the multiple regression, both the factors, Floor space and Maintenance expense became insignificant. The correlation between these two predictor variables was found to be very high, above 0.95.

Some of the ways by which the Multicollinearity can be sensed and detected

1. We find a regression coefficient for a variable not significant, though theoretically we expect it to be significant
2. The regression coefficients for a variable change significantly when we add or remove another variable
3. We find a negative correlation coefficient though the response is expected to increase with X, and vice versa
4. The X variables when checked pairwise exhibit high correlation
5. The Variance Inflation Factor (VIF) provided by Minitab output is an indicator of multicollinearity. Any VIF value greater than 1 indicates correlation. However, the correlation may be problematic if the VIF is above 5. VIF value of 10 or above is a clear indication that the regression coefficients are severely impacted due to multicollinearity

Actions required, once multicollinearity is detected

1. Check the degree of multicollinearity using VIF. If it is very low, no action would be required.
2. If the model is used only for prediction, and the correlated variables are not of particular interest, no action is required. In our example, if the floor space and maintenance expenses are not the variables that we are interested to work on, we may proceed with the model despite their multicollinearity.
3. Remove highly correlated predictors from the model. If there are two highly correlated predictors, remove one of them, since the other one is redundant. Removal decision may be done keeping in mind highest R2 value.
4. Consider replacing the correlated variable that would give a combined outcome of the correlated variables if possible. For example, if two predictors viz. Rent and Floor space are having high correlation, use Rent per sq. ft as a predictor instead.
5. Use Partial Least Squares or Principal Component Analysis, regression methods that reduce the number of predictors to smaller set of uncorrelated variables.
##### Share on other sites
• 1

What is Multicollinearity?
Multicollinearity is a known statistical event in which two or more variables in a regression model rely upon the different variables to be linearly predicted from one other with a significant degree of accuracy. It is generally used in observational investigations and less popular in empirical studies. An easy way to understand Multicollinearity is as follows;

No Multicollinearity:
In the following diagram, we can observe that Xs don't overlap with each other and have a mild correlation with Y.

Moderate and Extreme Multicollinearity:
On the below-left diagram, there is mild overlap among the predictors. We can still measure each predictor's unique effect on Y—those are the yellow, red, and blue sections. The orange and purple parts will not be included in the Type III regression coefficients. The coefficients don't give the full picture of each predictor's effect on Y.

But in the right diagram, the overlay between X1 and X2 becomes so intense that it can induce the model to have estimation problems and gives a perfect example of Multicollinearity. This model tries to determine each predictor's unique effect on Y, but there isn't enough individual information about X1 and X2 to calculate it. When Multicollinearity becomes perfect, we find the two predictors are confounded. We cannot separate the variance from one another.

Types of Multicollinearity
There are four varieties of Multicollinearity:

1. Perfect Multicollinearity – It exists when the independent variables within the equation predict the absolute linear relationship.
2. High Multicollinearity – It refers to the immediate correlation between the two or more further independent variables that are not ideally correlated.
3. Structural Multicollinearity – Each researcher himself causes this by inserting various independent variables within the equation.
4. Data based Multicollinearity– It's originated by experiments that are inadequately designed by the researcher.

Causes of Multicollinearity
Autonomous Variables, Change in the Variables' parameters do that a little change in the variables; there is a notable impact on the result & data Collections refers to the sample of the selected population is taken.

Examples of Multicollinearity

Example #1
Let’s assume that XYZ Ltd, a Quality Analyst, is being hired by a pharma company to provide research services and statistical analysis on India's diseases. XYZ Ltd has selected age, height, weight, health, and profession as the prima facia parameters.

• There is a multicollinearity situation within the above example since the independent variables selected for the study are directly correlated to the results. Hence, it'd be advisable for the researcher to customize the variables first before starting any project since the chosen variables will directly impact the outcomes.

Example #2
Let’s assume that XYZ Ltd has been appointed by Tata Motors to learn the sales volume of tata motors are high within which category in the business.

• In the above example, firstly, independent variables are finalized based upon which the research requires to be completed. It can be a monthly income, age, brand, the socio-economic class. It means only that data is going to be selected, which can fit the whole of these tabs to figure out how many people can purchase this car ( tata nano ) without even gazing at the other vehicle.

Example #3
Let’s Assume that XYZ Ltd is appointed to submit a report back to understand how many people under fifty are prone to heart attacks. For this data collection, the parameters are age, sex, and medical records.

• In the above example, there's Multicollinearity that has occurred. The independent variable “age” must be tweaked to age under fifty for inviting applications from the general public so that the persons who are more than fifty years of age automatically get separated.

Advantages
Below are some of the benefits of Multicollinearity

• Linear Relationship between the Independent Variables within the equation.
• Very useful in statistical models and research papers prepared by the research-based corporations.
• Direct impact on the specified result.

Disadvantages
Below are some of the Multicollinearity disadvantages

• In some circumstances, this problem would be resolved by collecting more further data on the variables.
• Incorrect use of model variables, i.e., the researcher may overlook to use them whenever needed.
• Inserting two same or identical variables in the equation like kg and ml in weights.
• Adding a variable in the equation which is a combination of 2.
• It isn't effortless to perform calculations since it is the statistical method and requires scientific calculators to execute.

Summing-up
Multicollinearity is one of the foremost preferred statistical tools often utilized in regression analysis and statistical analysis for extensive databases and the desired yield. All major corporations have a separate analytical department in their company to perform simple statistical regression analysis about products or people to implement a strategic outlook of the market to the management and help them plan their long-term strategies. The Graphical presentation of the report gives the reader a clear idea of the direct relationship, accuracy, and performance.

• If the analyst's goal is to know the independent variables within the equation, then Multicollinearity will be a significant problem.
• The researcher has to do the desired changes within the variables at stage 0 itself; alternatively, it should have a tremendous impact on the results.
• Multicollinearity can also be accomplished by examining the correlation matrix.
• Remedial measures play a vital role in resolving the problems of Multicollinearity.
##### Share on other sites
• 0

Multicollinearity is a problem which can be faced by anyone who is running a regression analysis. Regression analysis helps us to get a mathematical equation between the dependent & Independent variable .This helps us to predict the dependent variable.

In Regression analysis we try to assess the relative coefficient to understand the relative influence of independent variable on dependent variable.

One of the important steps in regression is identifying the various factors which has in impact on our Y (Dependent factors). It is very critical therefore that we pick independent factors which are not related to each other.

An example might be an organisation wants to improve their NPS score. Some of the Independent factors identified might be :-

1.First Call Resolution
2.Average Speed of Answer
3.Communication skills of the agents
4.Knowledge of the agents
5.Price of product
6.Repeat Calls

In the above example while other factors are fine FCR & Repeat calls are on and the same .FCR means first call resolution & Repeat calls is the number of customers who have not got FCR.In other words they are 2 parts of the same coin.

Important thing to remember that while multicollinearity will not affect the regression equation however it will make it difficult to assess whether a independent factor has an impact on the dependent factor.

Multicollinearity can be identified by 2 ways :-

1.Coefficient of Multiple Determination – This is often represented by (R2) and this within the variable that's explained by all variable.
If the variable is ignored we are able to calculate the Coefficient of multiple correlation (R2k) for every of the K variable.
If (R2k) is 0 variable K isn't correlated with the other variable .If (R2k) is bigger than 0.75 there's a high multicollinearity.

2.Variance Inflation Factor – Variance Inflation factor is computed for every of the experimental variable .

Variance Inflation Factor for k = 1 / ( 1 - R2k )

In VIFk ,VIF is that the variance inflation factor for variable K .In this equation R2k is that the coefficient of multiple determination for variable K .

If the VIFk >4 then there's a multicollinearity and if its >10 then there's a really high degree of multicollinearity.

How To Remove Multicollinearity :-

1.Redisign the study to style multicollinearity
2.Increase Sample Size
3.Remove one or more highly correlated experimental variable
4. Define a replacement variable adequate a linear combination of the highly-correlated variables.

##### Share on other sites
• 0

Multicollinearity is a phenomenon that happens during multiple regression analysis. In multiple regression analysis, the dependent variable (Y) is dependent on nos. of independent variables (X1, X2, X3 etc.).

Every independent variable has their own influence on Y separately & the regression equation will reflect their individual effect in a combined manner in the form of Y=a+bX1+cX2+dX3.

These independent variables are not dependent on each other. Therefore, X1, X2, X3 etc will influence only Y & not themselves.

This equation implies that, for a “b” unit change of Y, X1 changes by 1 & X2 & X3 will remain constant.

Similarly, when X2 changes by 1 unit, Y changes by “c” & X1 & X3 will remain constant & so on.

On the other hand, if variables are selected in such a manner that, they are related to each other & change in 1 independent variable will change other independent variable instead of remaining constant, it will create problems on interpretation of behavior of individual independent variables. This phenomenon is known as multicollinearity.

If multicolinearity occurs following problem may occur:

i) Instead of having high R², any independent variable (s) may be non-significant (generally>0.05), means the variable doesn’t have any relation with Y

ii) Sign of the coefficient of the independent variable may differ from real life understanding

iii) Standard error will be large

Detection:

i) From the above 3 signs, multicollinearity can be detected

ii) From the VIF (variable inflation factor) value, which is equal to 1/(1- R²). If VIF>5; high multicollinearity. If 1<VIF<3, moderate multicollinearity & can be ignored

iii) From the r values between each two independent variables, if r values are too high; close to 1 that means multicollinearity may occur

Course of action when detected:

i) Conduct regression analysis considering one of the independent variables as Y and rest as Xs in a rotational manner & calculate VIF as mentioned above for each regression

ii) Identify the independent variable (which is in the position of Y) with high VIF (Minitab analysis directly shows high VIF)

iii) Remove the independent variable with high VIF & conduct the regression once again

iv) In Minitab, stepwise regression analysis can be choosen with α=0.05 to remove the variable directly

iv)If F is significant then overall model will be significant & the equation can be considered if effect of individual coefficients not to be analysed. It is better to remove the correlated variables & conduct the analysis

Example:

Compressive strength of a finished good can be controlled by 3 independent variables e.g. moisture in the material, grade of the raw material & pressure at which the product was produced. The values are shown below:

 Strength Moisture Content Grade Pressure 85 2.20 50 80 89 2.10 50 80 89 2.10 54 85 89 2.00 54 85 75 2.49 40 64 70 2.60 40 64 92 1.89 55 88 70 2.55 40 65 95 1.95 55 89 84 2.15 50 79 84 2.20 50 80 80 2.20 48 77 75 2.40 45 72 92 1.95 55 89 97 1.89 56 90 95 1.90 56 90

Regression analysis with effect of multicollinearity:

Regression Analysis: Strength versus Moisture Content, Grade, Pressure

Analysis of Variance

Source              DF   Adj SS   Adj MS     F-Value   P-Value

Regression           3   1118.42   372.808    92.21    0.000 F is significant, overall model significant

Moisture Content   1    16.82    16.821     4.16     0.064

Grade              1     0.71    0.708      0.18     0.683

Pressure           1     2.53    2.530      0.63     0.444

Model Summary

S    R-sq   R-sq(adj)  R-sq(pred)

2.01068   95.84%     94.80%      92.06% high R² with high R-sq (adj) & strong predictibility

Coefficients

Term               Coef   SE Coef   T-Value   P-Value     VIF

Constant           113.7     50.5     2.25    0.044

Moisture Content   -24.5     12.0    -2.04    0.064   30.49

Grade              -0.47     1.12    -0.42    0.683   157.37 variable (s) non-significant (>0.05), high VIF

Pressure           0.598    0.755     0.79    0.444   181.53 variable (s) non-significant (>0.05), high VIF

Regression Equation

Strength = 113.7 - 24.5 Moisture Content - 0.47 Grade + 0.598 Pressure   As per the data, If higher grade is used, higher strength is obtained whereas the sign in the equation is reverse (-ve sign, -0.47xgrade). i.e. 1 unit increase in grade, strength will be reduced by 0.47.

Regression analysis without the effect of multicollinearity:

Regression Analysis: Strength versus Moisture Content, Grade, Pressure

Stepwise Selection of Terms

α to enter = 0.05, α to remove = 0.05

Analysis of Variance

Source              DF   Adj SS   Adj MS    F-Value   P-Value

Regression           1   1113.68   1113.68   292.74    0.000   F significant, overall model is significant

Moisture Content   1   1113.68   1113.68   292.74    0.000

Model Summary

S    R-sq     R-sq(adj)  R-sq(pred)

1.95047   95.44%     95.11%      94.18%         Standard error is less than previous one (2.01)

high R² with high R-sq (adj) & strong predictibility

Coefficients

Term                Coef  SE Coef  T-Value  P-Value   VIF

Constant          163.11     4.59    35.55    0.000

Moisture Content  -36.12     2.11   -17.11    0.000  1.00 variable significant (<0.05), low VIF

Regression Equation

Strength = 163.11 - 36.12 Moisture Content

##### Share on other sites
• 0

This question had four parts -

• Why is Multicollinearity a problem? - Saurabh, Asif and Kanak have answered this
• How is it detected? - Sourabh, Aritra, Asif and Kanak have answered this
• What is the right course of action when it is found - Aritra, Asif and Kanak have answered this
• Example - Sourabh, Aritra, Asif and Kanak have provided this

Mohd Asif is the winner today, having responded to all parts of the question correctly.

One must go through Sourabh's answer to see types and examples, Aritra's answer to see a good example and action options, Kanak's answer for courses of action and a very good worked out example.

Do go though the response by Benchmark Expert Venugopal. His response covers everything very well.

##### Share on other sites
This topic is now closed to further replies.
• ### Who's Online (See full list)

There are no registered users currently online

• ### Forum Statistics

• Total Topics
2,905
• Total Posts
14,650
• ### Member Statistics

• Total Members
55,323
• Most Online
888

Newest Member

Joined
×
×
• Create New...