Jump to content

Frequently Asked Interview Questions for MBBs_OLS Regression (Basics)


Recommended Posts

Dear All,

These are few frequently asked Master Black Belt interview questions on Ordinary Least Square Regression Basics. Let's get answers to these questions

  • What does VIF signify? Are high values of VIF desirable?
  • What does Durbin Watson statistic help us infer?
  • What is the difference between R-sq, Adjusted R-Sq, and Predicted R-Sq?
  • What does PRESS assess?
  • What do we mean by unusual observations?
  • What do Leverage values, Cook's Distance and Mahalanobis distance help us identify?
  • What are the assumptions in Ordinary Least Square Regression?
  • Everybody is invited to post answers to these questions.

Regards,

shantanu kumar

Link to post
Share on other sites

well let me make an atempt,

The assupmtions of OLS regression are:-

1. The model is lenear

2. The data is random

3. The expected value of error is 0

4.The residuals are independetly distributed and same varaince.

Coming to 4h question: R sq gives a measure of how well the regression line fitting the real data. It lies between 0 and 1. 1 indicates the best model, able to predict the trend. whereas Adjusted R Sq is a modification of R sq. Unlike R Sq, the adjusted R Sq increases only if the new term improves the model more than would be expected by chance. It will always be less than or equal to R sq.

Thats all from me..... Over to you now!

R Sq S

(Regds resumed , Shalini)

Link to post
Share on other sites

Just did some search and found out about predicted R sq which goes like this:

Predicted R-squared is used in regression analysis to indicate how well the model predicts responses for new observations, whereas R-squared indicates how well the model fits your data. Predicted R-squared can prevent overfitting the model and can be more useful than adjusted R-squared for comparing models because it is calculated using observations not included in model estimation. Overfitting refers to models that appear to explain the relationship between the predictor and response variables for the data set used for model calculation but fail to provide valid predictions for new observations.

Predicted R-squared ranges between 0 and 100% and is calculated from the PRESS statistic. Larger values of predicted R-squared suggest models of greater predictive ability.

For example, you work for a financial consulting firm and are developing a model to predict future market conditions. The model you settle on looks promising because it has an R-squared of 87%. However, when you calculate the predicted R-squared you see that it drops to 52%. This may indicate an overfitted model and suggests that your model will not predict new observations nearly as well as it fits your existing data.

So, with this above explaination, can I infer that it is always wise to check "precicted r sq" value than a Adj R Sq?

Link to post
Share on other sites
  • 5 weeks later...
  • 3 weeks later...

Hi everyone

I would like to answer "What does VIF signify? Are high values of VIF desirable?" only for now.

VIF stands for variation inflation factor.

Its optional but advisable to check VIF option while doing MLR. VIF indicates the multicolearity among the factors/ predictors in the model. Higher the VIF value, higher is the correlation among Xs which is not a desirable state of affairs.

Therefore higher VIF is not desirable.

Normally Xs with VIF value higher than 4 needs to be checked and one of the two should be removed form the model. Decision as to which one to remove can be based on process knowledge & experience.

What is the difference between R-sq, Adjusted R-Sq, and Predicted R-Sq?

The difference in Rsqr & Rsqr -adj is nicely explained by Shalini. I would like to through some light on the difference btw Rsqr & Rsqr adj.

If we add dummy factors (Xs) to the regression model then Rsqr may be increasing but Rsqr adj will not change. This is the reason why we refer to Rsqr adj value and not Rsqr.

Regards

sanjay bishnoi

Link to post
Share on other sites
  • 2 weeks later...
  • 1 month later...

Hi

The assumptions are

Model is linear in coefficients, Randomness in all predictors, Average of residuals should be zero, no corellation within residuals, Residuals have constant variance, Residuals pass normality test, no corellation within input variables( Multi colinearity)

VIF signifies the presence of multicolinearity that is correlation within independent variables or Xs. we dont allow >5, we operate with caution if it is <5 , we ignore if it is <1

Durban watson statistic indicate that residuals are auto correlated

R-Sq, R,Sq Adjusted,R Sq predicted is answered

Larger value of PRESS statistic indicate the model has strong predictive ability

Unusual observation that does not fit the model. Usually guides us to rootcause and not Linear to the model

cooks distance help us in identifying the effect of deleting an given observation

Hope I pass the Quiz

Srinivas

MBB

Link to post
Share on other sites
  • 1 month later...

Dear All,

Apologies for the late response.

Shalini, Sanjay and Srinivas have provided answers to most questions. It is slightly tough to explain these concepts in a post however I am still trying. To know more about these concepts, you can attend the MBB workshop.

Regards,

shantanu kumar

Q1: What are the assumptions in Ordinary Least Square Regression?

Answer: The assumptions in Ordinary Least Square Regression are:

  • Model is linear in parameters
  • The data are a random sample of the population
  • The errors are statistically independent from one another
  • The expected value of the errors is always zero
  • The independent variables are not strongly col linear
  • The residuals have constant variance
  • The errors are normally distributed

And of course the independent variables are measured precisely

Q2: What does VIF signify? Are high values of VIF desirable?

Answer: The easiest way to understand what VIF signifies is to understand how it is calculated. Let's say that we have four independent variables X1, X2, X3, X4 and one dependent variable Y. To calculate VIF of X1, we will first regress independent X1 on the remaining independent variables. We will get a Rsquare value and the use the following formula VIF (X1) = 1/(1- Rsquare of X1)

  • VIF =1 would mean 0% variation explained by remaining independent variables in X1.There is no Multicollinearity - IT IS GOOD
  • VIF=5 or more would mean 80% or more of the variation explained by remaining independent variables in X1. - IT IS NOT GOOD- High Multicollinearity

Therefore VIF of X1 will be high when Rsquare of X1 when regressed on X2, X3, and X4 is high. Therefore if remaining variables are strongly correlated with X1, VIF will be high.

Multi-Collinearity refers to independent variable (predictor) that is correlated with other Independent variables (predictors). We will follow the same procedure for the remaining independent variables (predictors).

High values of VIF are undesirable as it indicates multicollinearity. If independent variables are strongly correlated to each other, the regression coefficients will be imprecise.

Q3. What does Durbin Watson statistic help us infer?

Answer: As mentioned earlier, in OLS Regression, one of the assumptions is Errors(residuals) are independent of each other. If Errors(Residuals) are correlated, in OLS regression, predictors may appear to be significant when they aren't.

Durbin-Watson test for the presence of autocorrelation in Errors (residuals) by determining whether or not the correlation between two adjacent error terms is zero.

We compare Durbin-Watson statistic to lower and upper bound in Durbin Watson Table.

  • If Durbin Watson Statistic < lower bound, we infer positive auto-correlation in residuals
  • If Durbin Watson Statistic > upper bound, we infer no positive auto-correlation in residuals
  • If lower bound<Durbin Watson Statistic<upper bound, we consider the test to be inconclusive
  • If 4-Durbin Watson Statistic<lower bound, we infer negative auto-correlation in residuals
  • If 4-Durbin Watson Statistic>upper bound, we infer no negative auto-correlation in residuals
  • If lower bound<(4-Durbin Watson Statistic)<upper bound, we consider the test to be inconclusive

Note: Durbin Watson statistic should be used when data is in meaningful order

Q4. What is the difference between R-sq, Adjusted R-Sq, and Predicted R-Sq?

Answer: R square is the coefficient of determination. It indicates how much variation in the response is explained by the model.

Rsquare= 1-(SSError/SSTotal)

R-Square Adjusted: In our effort to increase Rsquare value, we add unnecessary predictors in our model. Rsquare adjusted accounts for the number of predictors in the model. We should use it for comparing models with different number of predictors.

R square Adjusted= 1-(MS Error/MS Total)

Predicted R-Square: It is calculated from PRESS (Predicted Residual Sum of Squares). Larger Values of Predicted R square suggests models of greater predictive ability.

Predicted Rsquare = 1- (PRESS/ (SS Total)

Predicted R-squared can prevent over-fitting the model and can be more useful than adjusted R-squared for comparing models because it is calculated using observations not included in model estimation. Over-fitting refers to models that appear to explain the relationship between the predictor and response variables for the data set used for model calculation but fail to provide valid predictions for new observations.

Q5. What does PRESS assess?

Answer: Predicted-Residual-Sum-of squares procedure, or PRESS procedure as described by Weisberg (1985). Models are repeatedly estimated using data sets of n -1 observations, each time omitting a different observation from calibration and using the estimated model to generate a predicted value of the predictand for the deleted observation.

Lower values of PRESS are desired. Lower values of PRESS would mean High Predicted Rsquare.

Q6. What do we mean by unusual observations?

Answer: Unusual Observation is a single observation that can have a large effect on results. Data can be unusual in the following ways:

  1. Outlier: observation with a large residual (i.e. far from regression line).
  2. Leverage - explained below

Q7. What do Leverage values, Cook's Distance and Mahalanobis distance help us identify?

Answer: Leverage values are one or more cases whose X values give them more potential to influence the results than the others. Observation with large leverage values may exert considerable influence on the fitted value, and thus the regression model.

Leverage values fall between 0 and 1. As a rule of thumb, it is advised that we investigate observations with leverage values greater than 3c/n, where c is the number of model (including the constant) and n is the number of observations.

Cook's Distance: A measure of how much the residual of all the cases would change if a particular case were excluded from the calculation of the regression co-efficients. A large Cook's Distance indicates that excluding a case from computation of the regression statistics changes the coefficients substantially.

  • measures the influence of a particular observation on the entire model
  • lowest possible value is 0
  • higher value means the observation is more influential

Mahanlanobis Distance: A measure of how much a case's values on the independent variables differ from the average of all cases. A large Mahalanobis distance identifies a case as having extreme values on one or more of the independent variables

Key

  • SSTotal -Sum of Squares Total
  • SSE - Sum of Squares Error
  • MSTotal - Mean Square Total
  • MSE - Mean Square Error
  • PRESS - Predicted Residual Sum of Squares
  • OLS - Ordinary Least Square Regression

Link to post
Share on other sites

Hi All,

Sorry. Forgot to mention the references. Here are the references

  • Data Analysis Using Regression and Multilevel/Hierarchical Models by Jennifer Hill, Andrew Gelman
  • Linear least squares computations By R. W. Farebrother
  • Applied regression analysis, linear models, and related methods By John Fox
  • Numerical methods for least squares problems By Åke Björck
  • Weisberg, S. 1985. Applied Linear Regression, 2nd ed. New York: John Wiley and Sons.

Regards,

shantanu kumar

Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...