Prasoon Bhargav 10 Posted October 3, 2009 Report Share Posted October 3, 2009 Dear All, These are few frequently asked Master Black Belt interview questions on Ordinary Least Square Regression Basics. Let's get answers to these questionsWhat does VIF signify? Are high values of VIF desirable?What does Durbin Watson statistic help us infer?What is the difference between R-sq, Adjusted R-Sq, and Predicted R-Sq?What does PRESS assess?What do we mean by unusual observations?What do Leverage values, Cook's Distance and Mahalanobis distance help us identify?What are the assumptions in Ordinary Least Square Regression?Everybody is invited to post answers to these questions.Regards, shantanu kumar 0 Quote Link to post Share on other sites

shalu1512 0 Posted October 9, 2009 Report Share Posted October 9, 2009 Hi Shantanu, Seeing these questions i feel i have still not earned my Black Belt. All the above (except 4th lto certain extent) looks unfamiliar and out of syllabus. Request you to open the suspense now by answering these questions pse. Rgds, Shalini 0 Quote Link to post Share on other sites

Suresh Jayaram 19 Posted October 11, 2009 Report Share Posted October 11, 2009 Hi Shalini,Don't worry - these are advanced topics that are covered in MBB training. Most of the questions here refer to Multiple Linear Regression which is not covered in BB training - only Simple Linear Regression is covered.You should be able to answer 1 and 4 based on BB training.Best Regards,SJ 0 Quote Link to post Share on other sites

shalu1512 0 Posted October 11, 2009 Report Share Posted October 11, 2009 well let me make an atempt, The assupmtions of OLS regression are:- 1. The model is lenear 2. The data is random 3. The expected value of error is 0 4.The residuals are independetly distributed and same varaince. Coming to 4h question: R sq gives a measure of how well the regression line fitting the real data. It lies between 0 and 1. 1 indicates the best model, able to predict the trend. whereas Adjusted R Sq is a modification of R sq. Unlike R Sq, the adjusted R Sq increases only if the new term improves the model more than would be expected by chance. It will always be less than or equal to R sq. Thats all from me..... Over to you now! R Sq S (Regds resumed , Shalini) 0 Quote Link to post Share on other sites

shalu1512 0 Posted October 11, 2009 Report Share Posted October 11, 2009 Just did some search and found out about predicted R sq which goes like this: Predicted R-squared is used in regression analysis to indicate how well the model predicts responses for new observations, whereas R-squared indicates how well the model fits your data. Predicted R-squared can prevent overfitting the model and can be more useful than adjusted R-squared for comparing models because it is calculated using observations not included in model estimation. Overfitting refers to models that appear to explain the relationship between the predictor and response variables for the data set used for model calculation but fail to provide valid predictions for new observations. Predicted R-squared ranges between 0 and 100% and is calculated from the PRESS statistic. Larger values of predicted R-squared suggest models of greater predictive ability. For example, you work for a financial consulting firm and are developing a model to predict future market conditions. The model you settle on looks promising because it has an R-squared of 87%. However, when you calculate the predicted R-squared you see that it drops to 52%. This may indicate an overfitted model and suggests that your model will not predict new observations nearly as well as it fits your existing data. So, with this above explaination, can I infer that it is always wise to check "precicted r sq" value than a Adj R Sq? 0 Quote Link to post Share on other sites

Prasoon Bhargav 10 Posted October 11, 2009 Author Report Share Posted October 11, 2009 Hi Shalini, Good response on Adjusted R-Sq and Predicted R-Sq. I will post answers to all questions next week.Regards, shantanu kumar 0 Quote Link to post Share on other sites

shalu1512 0 Posted November 10, 2009 Report Share Posted November 10, 2009 Dear Shantanu,Response on above awaited,Regds Shalini 0 Quote Link to post Share on other sites

sanjay.shakti 1 Posted December 1, 2009 Report Share Posted December 1, 2009 Hi everyone I would like to answer "What does VIF signify? Are high values of VIF desirable?" only for now. VIF stands for variation inflation factor. Its optional but advisable to check VIF option while doing MLR. VIF indicates the multicolearity among the factors/ predictors in the model. Higher the VIF value, higher is the correlation among Xs which is not a desirable state of affairs. Therefore higher VIF is not desirable. Normally Xs with VIF value higher than 4 needs to be checked and one of the two should be removed form the model. Decision as to which one to remove can be based on process knowledge & experience. What is the difference between R-sq, Adjusted R-Sq, and Predicted R-Sq?The difference in Rsqr & Rsqr -adj is nicely explained by Shalini. I would like to through some light on the difference btw Rsqr & Rsqr adj. If we add dummy factors (Xs) to the regression model then Rsqr may be increasing but Rsqr adj will not change. This is the reason why we refer to Rsqr adj value and not Rsqr. Regardssanjay bishnoi 0 Quote Link to post Share on other sites

shalu1512 0 Posted December 13, 2009 Report Share Posted December 13, 2009 Dear Sanjay,Thanks for further bringing clarity on R sq and R sq Adj.Kindly let us know where do we find option for checking VIF value in MINITAB? 0 Quote Link to post Share on other sites

CT Srinivas 0 Posted January 28, 2010 Report Share Posted January 28, 2010 Hi The assumptions are Model is linear in coefficients, Randomness in all predictors, Average of residuals should be zero, no corellation within residuals, Residuals have constant variance, Residuals pass normality test, no corellation within input variables( Multi colinearity) VIF signifies the presence of multicolinearity that is correlation within independent variables or Xs. we dont allow >5, we operate with caution if it is <5 , we ignore if it is <1 Durban watson statistic indicate that residuals are auto correlated R-Sq, R,Sq Adjusted,R Sq predicted is answered Larger value of PRESS statistic indicate the model has strong predictive ability Unusual observation that does not fit the model. Usually guides us to rootcause and not Linear to the model cooks distance help us in identifying the effect of deleting an given observation Hope I pass the Quiz Srinivas MBB 0 Quote Link to post Share on other sites

CT Srinivas 0 Posted January 28, 2010 Report Share Posted January 28, 2010 In Simple Regression options/ Display VIF 0 Quote Link to post Share on other sites

Prasoon Bhargav 10 Posted March 10, 2010 Author Report Share Posted March 10, 2010 Dear All, Apologies for the late response. Shalini, Sanjay and Srinivas have provided answers to most questions. It is slightly tough to explain these concepts in a post however I am still trying. To know more about these concepts, you can attend the MBB workshop. Regards, shantanu kumar Q1: What are the assumptions in Ordinary Least Square Regression? Answer: The assumptions in Ordinary Least Square Regression are:Model is linear in parametersThe data are a random sample of the populationThe errors are statistically independent from one anotherThe expected value of the errors is always zeroThe independent variables are not strongly col linearThe residuals have constant varianceThe errors are normally distributedAnd of course the independent variables are measured precisely Q2: What does VIF signify? Are high values of VIF desirable? Answer: The easiest way to understand what VIF signifies is to understand how it is calculated. Let's say that we have four independent variables X1, X2, X3, X4 and one dependent variable Y. To calculate VIF of X1, we will first regress independent X1 on the remaining independent variables. We will get a Rsquare value and the use the following formula VIF (X1) = 1/(1- Rsquare of X1)VIF =1 would mean 0% variation explained by remaining independent variables in X1.There is no Multicollinearity - IT IS GOODVIF=5 or more would mean 80% or more of the variation explained by remaining independent variables in X1. - IT IS NOT GOOD- High Multicollinearity Therefore VIF of X1 will be high when Rsquare of X1 when regressed on X2, X3, and X4 is high. Therefore if remaining variables are strongly correlated with X1, VIF will be high. Multi-Collinearity refers to independent variable (predictor) that is correlated with other Independent variables (predictors). We will follow the same procedure for the remaining independent variables (predictors). High values of VIF are undesirable as it indicates multicollinearity. If independent variables are strongly correlated to each other, the regression coefficients will be imprecise. Q3. What does Durbin Watson statistic help us infer? Answer: As mentioned earlier, in OLS Regression, one of the assumptions is Errors(residuals) are independent of each other. If Errors(Residuals) are correlated, in OLS regression, predictors may appear to be significant when they aren't. Durbin-Watson test for the presence of autocorrelation in Errors (residuals) by determining whether or not the correlation between two adjacent error terms is zero.We compare Durbin-Watson statistic to lower and upper bound in Durbin Watson Table.If Durbin Watson Statistic < lower bound, we infer positive auto-correlation in residualsIf Durbin Watson Statistic > upper bound, we infer no positive auto-correlation in residualsIf lower bound<Durbin Watson Statistic<upper bound, we consider the test to be inconclusiveIf 4-Durbin Watson Statistic<lower bound, we infer negative auto-correlation in residualsIf 4-Durbin Watson Statistic>upper bound, we infer no negative auto-correlation in residualsIf lower bound<(4-Durbin Watson Statistic)<upper bound, we consider the test to be inconclusive Note: Durbin Watson statistic should be used when data is in meaningful order Q4. What is the difference between R-sq, Adjusted R-Sq, and Predicted R-Sq? Answer: R square is the coefficient of determination. It indicates how much variation in the response is explained by the model. Rsquare= 1-(SSError/SSTotal) R-Square Adjusted: In our effort to increase Rsquare value, we add unnecessary predictors in our model. Rsquare adjusted accounts for the number of predictors in the model. We should use it for comparing models with different number of predictors. R square Adjusted= 1-(MS Error/MS Total) Predicted R-Square: It is calculated from PRESS (Predicted Residual Sum of Squares). Larger Values of Predicted R square suggests models of greater predictive ability. Predicted Rsquare = 1- (PRESS/ (SS Total) Predicted R-squared can prevent over-fitting the model and can be more useful than adjusted R-squared for comparing models because it is calculated using observations not included in model estimation. Over-fitting refers to models that appear to explain the relationship between the predictor and response variables for the data set used for model calculation but fail to provide valid predictions for new observations. Q5. What does PRESS assess? Answer: Predicted-Residual-Sum-of squares procedure, or PRESS procedure as described by Weisberg (1985). Models are repeatedly estimated using data sets of n -1 observations, each time omitting a different observation from calibration and using the estimated model to generate a predicted value of the predictand for the deleted observation. Lower values of PRESS are desired. Lower values of PRESS would mean High Predicted Rsquare. Q6. What do we mean by unusual observations? Answer: Unusual Observation is a single observation that can have a large effect on results. Data can be unusual in the following ways:Outlier: observation with a large residual (i.e. far from regression line).Leverage - explained below Q7. What do Leverage values, Cook's Distance and Mahalanobis distance help us identify? Answer: Leverage values are one or more cases whose X values give them more potential to influence the results than the others. Observation with large leverage values may exert considerable influence on the fitted value, and thus the regression model.Leverage values fall between 0 and 1. As a rule of thumb, it is advised that we investigate observations with leverage values greater than 3c/n, where c is the number of model (including the constant) and n is the number of observations. Cook's Distance: A measure of how much the residual of all the cases would change if a particular case were excluded from the calculation of the regression co-efficients. A large Cook's Distance indicates that excluding a case from computation of the regression statistics changes the coefficients substantially.measures the influence of a particular observation on the entire modellowest possible value is 0higher value means the observation is more influential Mahanlanobis Distance: A measure of how much a case's values on the independent variables differ from the average of all cases. A large Mahalanobis distance identifies a case as having extreme values on one or more of the independent variables KeySSTotal -Sum of Squares TotalSSE - Sum of Squares ErrorMSTotal - Mean Square TotalMSE - Mean Square ErrorPRESS - Predicted Residual Sum of SquaresOLS - Ordinary Least Square Regression 0 Quote Link to post Share on other sites

Prasoon Bhargav 10 Posted March 14, 2010 Author Report Share Posted March 14, 2010 Hi All, Sorry. Forgot to mention the references. Here are the referencesData Analysis Using Regression and Multilevel/Hierarchical Models by Jennifer Hill, Andrew GelmanLinear least squares computations By R. W. FarebrotherApplied regression analysis, linear models, and related methods By John FoxNumerical methods for least squares problems By Åke BjörckWeisberg, S. 1985. Applied Linear Regression, 2nd ed. New York: John Wiley and Sons.Regards, shantanu kumar 0 Quote Link to post Share on other sites

## Recommended Posts

## Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.