16 Formal Tests on Assumptions
16.1 Multicollinearity
Multicollinearity occurs when independent variables in a regression model are highly correlated, making coefficient estimates unreliable.
- Variance Inflation Factor (VIF)
- If VIF > 10, multicollinearity may be a problem.
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## educ exper tenure female educ:female exper:female tenure:female
## 1.918370 3.283682 2.121188 30.508011 25.140337 5.105177 2.236909
How to fix?
- Remove or combine highly correlated variables
- Do Principal Component Analysis or Factor Analysis
16.2 Heteroscedasticity
The variance of residuals is not constant across observations.
- Breusch-Pagan Test
\(H_0\) : Homoscedasticity
\(H_a\) : Heteroscedasticity
- White Test
- Detects both heteroscedasticity and model misspecification
##
## studentized Breusch-Pagan test
##
## data: full
## BP = 46.768, df = 7, p-value = 0.00000006195
Heteroscedasticity present.
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 158.2722, Df = 1, p = < 0.000000000000000222
Heteroscedasticity present.
How to fix?
Use robust standard errors.
library(sandwich)
library(lmtest)
coeftest(full, vcov = vcovHC(model, type = "HC1")) # Huber-White robust SEs)##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.531140 0.871723 -4.0508 0.00005887 ***
## educ 0.677180 0.073396 9.2264 < 0.00000000000000022 ***
## female 2.075290 1.460808 1.4206 0.15602
## educ:female -0.213688 0.123818 -1.7258 0.08498 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Transform the dependent variable (log transformation)
- Weighted Least Squares
16.3 Specification Errors
The model omits relevant variables, includes irrelevant ones or has the wrong functional form.
- Ramsey RESET Test
Tests for omitted variables or incorrect functional form.
\(H_0\) : The model is correctly specified
\(H_a\) : The model may be misspecified
library(lmtest)
resettest(full, power = 2:3) #tests for quadratic and cubic functional form errors. Helps check if other functional forms are necessary while checking for omitted variables##
## RESET test
##
## data: full
## RESET = 10.012, df1 = 2, df2 = 516, p-value = 0.00005421
Model is misspecified.
How to Fix?
- Include relevant missing variables
- Try polynomial or interaction terms if relationships are nonlinear
- Check for categorical variable misclassification
16.4 Alternative Functional Forms
16.4.1 Log-Log Model
The slope coefficient \(\beta_1\) measures the elasticity of Y with respect to X.
The percentage change in Y for a given percentage change in X.
The simplest way to decide whether the log-log model fits the data is to plot the scatterplot of
lnYagainstXand see if the points lie approx. on a straight line.
library(readxl)
CEOSAL1 <- read_excel("CEOSAL1.xls")
ceovars<-c("salary","sales")
CEOSAL2<-CEOSAL1[ceovars]
#Create the log of a variable using the log function
CEOSAL2<-CEOSAL2%>%mutate(lsales=log(sales),lsalary=log(salary))
log2<-lm(lsalary~lsales, data=CEOSAL2)
summary(log2)##
## Call:
## lm(formula = lsalary ~ lsales, data = CEOSAL2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01038 -0.28140 -0.02723 0.21222 2.81128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.82200 0.28834 16.723 < 0.0000000000000002 ***
## lsales 0.25667 0.03452 7.436 0.0000000000027 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5044 on 207 degrees of freedom
## Multiple R-squared: 0.2108, Adjusted R-squared: 0.207
## F-statistic: 55.3 on 1 and 207 DF, p-value: 0.000000000002703
16.4.2 Log-Lin Model
Useful in finding out the rate of growth of certain economic variables.
The slope coefficient measures the constant proportional or relative change in Y for a given absolute change in the value of the regressor.
$\beta_1 = \frac{\text{relative change in regressand}}{\text{absolute change in regressor}}$
If we multiply the relative change in Y by 100, it will give the percentage change or growth rate
##
## Call:
## lm(formula = lsalary ~ sales, data = CEOSAL2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.44220 -0.29159 -0.02837 0.28323 2.72487
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.846650393 0.045003033 152.138 < 0.0000000000000002 ***
## sales 0.000014982 0.000003553 4.217 0.000037 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5448 on 207 degrees of freedom
## Multiple R-squared: 0.07912, Adjusted R-squared: 0.07467
## F-statistic: 17.79 on 1 and 207 DF, p-value: 0.00003696
16.4.3 Lin-Log Model
- The absolute change in Y is equal to the slope times the relative change in X. If the relative change in X is multiplied by 100, then the equation above gives the absolute change in Y for a percentage change in X.
##
## Call:
## lm(formula = salary ~ lsales, data = CEOSAL2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1072.1 -447.6 -222.8 41.7 13702.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -898.93 771.50 -1.165 0.24529
## lsales 262.90 92.36 2.847 0.00486 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1349 on 207 degrees of freedom
## Multiple R-squared: 0.03767, Adjusted R-squared: 0.03302
## F-statistic: 8.103 on 1 and 207 DF, p-value: 0.004863