Multicollinearity in Multiple Linear Regression using Ordinary Least Squares

文章评分 0 次，平均分 0.0 ：

The collinearity statistics provide information to allow the analyst to detect when the independents are intercorrelated to the degree that the regression output may be adversely affected. Interrelatedness of the independent variables creates what is termed as an ill-conditioned X'X matrix. The process for inverting the matrix and calculating the regression coefficient estimates becomes unstable increasing the likelihood of unreasonable estimates. Multicollinearity measures of interest include:

Bivariate Correlations measure the degree linear relationship between two variables. If two variables that are included as independent variables in a multiple regression analysis and are highly correlated (positively or negatively) then these variables clearly violate the assumption of independence making the Ordinary Least Squares process for estimating their regression estimates unstable. However using bivariate correlations alone may not detect linear relations between multiple variables.

Tolerance (a measure calculated for each variable) is 1 – R-square for the regression of that variable against all the other independents, without the dependent variable. It represents the proportion of variability that is not explained by the other independent variables in the regression model. When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the estimated regression coefficients will be unstable.

Variance Inflation Factor, VIF, (a measure calculated for each variable) is simply the reciprocal of tolerance. It measures the degree to which the interrelatedness of the variable with other predictor variables inflates the variance of the estimated regression coefficient for that variable. Hence the square root of the VIF is the degree to which the collinearity has increased the standard error for that variable. Therefore, a high VIF value indicates high multicollinearity of that variable with other independents and instability of the regression coefficient estimation process. There are no statistical tests to test for multicollinearity using the tolerance or VIF measures. VIF=1 is ideal and many authors use VIF=10 as a suggested upper limit for indicting a definite multicollinearity problem for an individual variable (VIF=10 inflates the Standard Error by 3.16). Some would consider VIF=4 (doubling the Standard Error) as a minimum for indicated a possible multicollinearity problem.

Condition Index values are calculated from the eigenvalues for a rescaled crossproduct \(X’X\) matrix. Hence these measures are not for individual variables (like the tolerance and VIF measures) but are for individual dimensions/components/factors and measure of the amount of the variability it accounts for in the rescaled crossproduct \(X’X\) matrix. The rescaled crossproduct \(X’X\) matrix values are obtained by dividing each original value by the square root of the sum of squared original values for that column in the original matrix, including those for the intercept. This yields an \(X’X\) matrix with ones on the main diagonal. Eigenvalues close to 0 indicate dimensions which explain little variability. A wide spread in eigenvalues indicates an ill-conditioned crossproduct matrix, meaning there is a problem with multicollinearity. A condition index is calculated for each dimension/component/factor by taking the square root of ratio of the largest eigenvalue divided by the eigenvalue for the dimension. A common rule of thumb is that a condition index over 15 indicates a possible multicollinearity problem and a condition index over 30 suggests a serious multicollinearity problem. Since each dimension is a linear combination of the original variables the analyst using OLS regression is not able to merely exclude the problematic dimension. Hence a guide is needed to determine which variables are associated with the problematic dimension.

Regression Coefficient Variance Decomposition Proportions provide a breakdown or decomposition of the variance associated with each regression coefficient, including the intercept. This breakdown is according to the individual dimensions/components/factors and reported as a percentage of the total variance for that coefficient that is associated with the respective dimension. The VIF measures are based on the fact that interrelatedness of a variable with other predictor variables inflates the variance of the estimated regression coefficient for that variable. Since the only cure available to analyst using standard OLS is the selection of independent variables included in the model, then some measure(s) must be provided to better pinpoint variables that contribute to the instability of the estimation process associated with inverting the crossproduct \(X’X\) matrix. Since interrelatedness must involve more than one variable then one looks to the dimensions with a high index value to see if the proportion of variance is high for two or more variables. Hence Belsley, Kuh and Welsch propose that degrading collinearity exists when one observes at least one dimension with both

a high condition index (value greater than 30 is generally accepted as a guide for being high) and
high variance decomposition proportions for two or more estimated regression coefficient variances (value greater than 0.5 is a generally accepted guide for being high).

No clear prescription exits for the best way to eliminate a multicollinearity problem. One can perform a factor analysis using a principal component solution on the set of original variables and then have SPSS calculate factor scores for the respective dimensions. Then a linear regression can be performed on the non-problematic dimensions (those with low condition index values). However interpreting the regression coefficients becomes very difficult since each dimension/component/factor is a linear combination of all of the variables. Generally the analyst looks to eliminate variables that are problematic. The analyst has multiple objectives in building a good model to describe the linear relationship between a dependent variable and a set of independent. Eliminating collinearity problems is just one of those objectives. Hence the analyst may want to try multiple models to see which one seems to be best, given these multiple objectives. SPSS provides automated solution methods such as Forward, Backward & Stepwise that can be helpful in addition to trying specific models that make sense to the analyst.

Information sources:

Class text: Multivariate Data Analysis (6th edition) by Hair, Black, Babin, Anderson & Tatham.
Regression Diagnostics: Identifying Influential Data and Sources of Collinearity (1980) by Belsley, Kuh & Welch