19 Part 2: Instrumental Variables Interpretation (Simulation)

Now consider a situation where education may be endogenous.

Possible reasons:

  • ability affects both education and wages

  • family background influences both

If education is correlated with the error term, OLS estimates of the return to education are biased.

We simulate an instrument: distance to college.

Idea:

  • living closer to college increases education

  • but distance should not directly affect wages

19.1 Simulated Data

set.seed(321)

n <- 1000

iv_data <- tibble(
  distance_college = runif(n,0,20),
  ability = rnorm(n)
) %>%
  mutate(
    educ = 12 - 0.2*distance_college + ability + rnorm(n),
    wage = 5 + 0.9*educ + ability + rnorm(n)
  )

Here:

  • ability affects both education and wages

  • therefore OLS will be biased

19.2 OLS Regression

ols_model <- lm(wage ~ educ, data = iv_data)
summary(ols_model)
## 
## Call:
## lm(formula = wage ~ educ, data = iv_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2791 -0.9928 -0.0048  0.9130  4.5021 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.10619    0.24171   8.714 <0.0000000000000002 ***
## educ         1.19107    0.02386  49.927 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.353 on 998 degrees of freedom
## Multiple R-squared:  0.7141, Adjusted R-squared:  0.7138 
## F-statistic:  2493 on 1 and 998 DF,  p-value: < 0.00000000000000022

Interpretation:

“OLS estimates the association between education and wages, but this estimate may be biased because unobserved ability affects both variables.”

19.3 First Stage Regression

first_stage <- lm(educ ~ distance_college, data = iv_data)
summary(first_stage)
## 
## Call:
## lm(formula = educ ~ distance_college, data = iv_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1973 -1.0360  0.0289  1.0139  4.9790 
## 
## Coefficients:
##                   Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)      11.982741   0.088877   134.8 <0.0000000000000002 ***
## distance_college -0.196666   0.007563   -26.0 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.386 on 998 degrees of freedom
## Multiple R-squared:  0.4039, Adjusted R-squared:  0.4033 
## F-statistic: 676.2 on 1 and 998 DF,  p-value: < 0.00000000000000022

Interpretation:

The coefficient on distance_college tells us whether the instrument predicts education.

If significant:

“Distance to college is strongly related to educational attainment, satisfying the relevance condition for a valid instrument.”

Check the first‑stage F‑statistic.

Rule of thumb:

  • F > 10 → instrument likely strong

19.4 IV Regression (Two Stage Least Squares)

library(AER)
iv_model <- ivreg(wage ~ educ | distance_college, data = iv_data)
summary(iv_model)
## 
## Call:
## ivreg(formula = wage ~ educ | distance_college, data = iv_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -4.013526 -1.057973  0.002348  1.137755  4.163679 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  5.42866    0.41194   13.18 <0.0000000000000002 ***
## educ         0.85789    0.04104   20.90 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.479 on 998 degrees of freedom
## Multiple R-Squared: 0.6582,  Adjusted R-squared: 0.6579 
## Wald test: 436.9 on 1 and 998 DF,  p-value: < 0.00000000000000022

19.5 Interpretation of IV Coefficient

The IV estimate measures the causal effect of education on wages using only variation in education generated by the instrument.

Interpretation example:

“Using distance to college as an instrument for education, a one‑year increase in education increases wages by β₁ units on average.”

Key idea:

  • OLS uses all variation in education

  • IV uses only exogenous variation generated by the instrument

Thus IV attempts to isolate a causal effect.

19.6 Comparing OLS and IV

Researchers often compare the two estimates.

If they differ substantially, that suggests endogeneity bias in the OLS model.

Example interpretation:

“The IV estimate of the return to education differs from the OLS estimate, suggesting that unobserved ability biases the OLS coefficient.”

19.7 Interpreting IV Results Carefully

Always discuss:

  1. Instrument relevance

  2. Instrument validity (exclusion restriction)

  3. Precision of the IV estimate

Weak instruments lead to large standard errors and unreliable estimates.

19.8 Summary of Interpretation Strategy

When interpreting regression results:

  1. Identify the reference group.

  2. Interpret coefficients relative to that group.

  3. For interactions, compute combined effects.

  4. For IV models, discuss instrument validity and first stage strength.

  5. Distinguish statistical significance from economic importance.

19.9 Final reminders for interpretation

When writing up your regression results, avoid statements such as:

  • “X has no effect” when the coefficient is not significant

  • “The relationship is strong” without explaining whether you mean statistical significance, economic size, or model fit

  • “p = 0.08 means insignificant” without noting that it is marginal at the 10% level

Better wording is:

  • “The estimated effect is negative, but it is not statistically distinguishable from zero at the 5% level.”

  • “The effect is statistically significant, but its magnitude is economically small.”

  • “The coefficient is marginally significant at the 10% level, so the evidence is suggestive rather than strong.”