16 Formal Tests on Assumptions

16.1 Multicollinearity

Multicollinearity occurs when independent variables in a regression model are highly correlated, making coefficient estimates unreliable.

  1. Variance Inflation Factor (VIF)
    • If VIF > 10, multicollinearity may be a problem.
if(!("car" %in% installed.packages()[,"Package"])) install.packages("car")
library(car)
vif(full)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##          educ         exper        tenure        female   educ:female  exper:female tenure:female 
##      1.918370      3.283682      2.121188     30.508011     25.140337      5.105177      2.236909

How to fix?

  1. Remove or combine highly correlated variables
  2. Do Principal Component Analysis or Factor Analysis

16.2 Heteroscedasticity

The variance of residuals is not constant across observations.

  1. Breusch-Pagan Test
    • \(H_0\) : Homoscedasticity

    • \(H_a\) : Heteroscedasticity

  2. White Test
    • Detects both heteroscedasticity and model misspecification
#Breusch-Pagan Test
library(lmtest)
bptest(full)
## 
##  studentized Breusch-Pagan test
## 
## data:  full
## BP = 46.768, df = 7, p-value = 0.00000006195

Heteroscedasticity present.

#White Test
library(car)
ncvTest(full)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 158.2722, Df = 1, p = < 0.000000000000000222

Heteroscedasticity present.

How to fix?

  1. Use robust standard errors.

    if(!("sandwich" %in% installed.packages()[,"Package"])) install.packages("sandwich")
library(sandwich)
library(lmtest)

coeftest(full, vcov = vcovHC(model, type = "HC1"))  # Huber-White robust SEs)
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value              Pr(>|t|)    
## (Intercept) -3.531140   0.871723 -4.0508            0.00005887 ***
## educ         0.677180   0.073396  9.2264 < 0.00000000000000022 ***
## female       2.075290   1.460808  1.4206               0.15602    
## educ:female -0.213688   0.123818 -1.7258               0.08498 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Transform the dependent variable (log transformation)
  2. Weighted Least Squares

16.3 Specification Errors

The model omits relevant variables, includes irrelevant ones or has the wrong functional form.

  1. Ramsey RESET Test
    • Tests for omitted variables or incorrect functional form.

    • \(H_0\) : The model is correctly specified

    • \(H_a\) : The model may be misspecified

library(lmtest)
resettest(full, power = 2:3) #tests for quadratic and cubic functional form errors. Helps check if other functional forms are necessary while checking for omitted variables
## 
##  RESET test
## 
## data:  full
## RESET = 10.012, df1 = 2, df2 = 516, p-value = 0.00005421

Model is misspecified.

How to Fix?

  1. Include relevant missing variables
  2. Try polynomial or interaction terms if relationships are nonlinear
  3. Check for categorical variable misclassification

16.4 Alternative Functional Forms

16.4.1 Log-Log Model

  • The slope coefficient \(\beta_1\) measures the elasticity of Y with respect to X.

  • The percentage change in Y for a given percentage change in X.

  • The simplest way to decide whether the log-log model fits the data is to plot the scatterplot of lnY against X and see if the points lie approx. on a straight line.

library(readxl)
CEOSAL1 <- read_excel("CEOSAL1.xls")
ceovars<-c("salary","sales")
CEOSAL2<-CEOSAL1[ceovars]

#Create the log of a variable using the log function
CEOSAL2<-CEOSAL2%>%mutate(lsales=log(sales),lsalary=log(salary))
log2<-lm(lsalary~lsales, data=CEOSAL2)
summary(log2)
## 
## Call:
## lm(formula = lsalary ~ lsales, data = CEOSAL2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01038 -0.28140 -0.02723  0.21222  2.81128 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  4.82200    0.28834  16.723 < 0.0000000000000002 ***
## lsales       0.25667    0.03452   7.436      0.0000000000027 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5044 on 207 degrees of freedom
## Multiple R-squared:  0.2108, Adjusted R-squared:  0.207 
## F-statistic:  55.3 on 1 and 207 DF,  p-value: 0.000000000002703
#You can also do it in the regression
#log2<-lm(log(salary)~log(sales),data=CEOSAL2)

16.4.2 Log-Lin Model

  • Useful in finding out the rate of growth of certain economic variables.

  • The slope coefficient measures the constant proportional or relative change in Y for a given absolute change in the value of the regressor.

    $\beta_1 = \frac{\text{relative change in regressand}}{\text{absolute change in regressor}}$

  • If we multiply the relative change in Y by 100, it will give the percentage change or growth rate

loglin<-lm(lsalary~sales,data = CEOSAL2)
summary(loglin)
## 
## Call:
## lm(formula = lsalary ~ sales, data = CEOSAL2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.44220 -0.29159 -0.02837  0.28323  2.72487 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept) 6.846650393 0.045003033 152.138 < 0.0000000000000002 ***
## sales       0.000014982 0.000003553   4.217             0.000037 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5448 on 207 degrees of freedom
## Multiple R-squared:  0.07912,    Adjusted R-squared:  0.07467 
## F-statistic: 17.79 on 1 and 207 DF,  p-value: 0.00003696

16.4.3 Lin-Log Model

  • The absolute change in Y is equal to the slope times the relative change in X. If the relative change in X is multiplied by 100, then the equation above gives the absolute change in Y for a percentage change in X.
linlog<-lm(salary~lsales, data=CEOSAL2)
summary(linlog)
## 
## Call:
## lm(formula = salary ~ lsales, data = CEOSAL2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1072.1  -447.6  -222.8    41.7 13702.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -898.93     771.50  -1.165  0.24529   
## lsales        262.90      92.36   2.847  0.00486 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1349 on 207 degrees of freedom
## Multiple R-squared:  0.03767,    Adjusted R-squared:  0.03302 
## F-statistic: 8.103 on 1 and 207 DF,  p-value: 0.004863