19 Part 2: Instrumental Variables Interpretation (Simulation)
Now consider a situation where education may be endogenous.
Possible reasons:
ability affects both education and wages
family background influences both
If education is correlated with the error term, OLS estimates of the return to education are biased.
We simulate an instrument: distance to college.
Idea:
living closer to college increases education
but distance should not directly affect wages
19.1 Simulated Data
set.seed(321)
n <- 1000
iv_data <- tibble(
distance_college = runif(n,0,20),
ability = rnorm(n)
) %>%
mutate(
educ = 12 - 0.2*distance_college + ability + rnorm(n),
wage = 5 + 0.9*educ + ability + rnorm(n)
)Here:
abilityaffects both education and wagestherefore OLS will be biased
19.2 OLS Regression
##
## Call:
## lm(formula = wage ~ educ, data = iv_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2791 -0.9928 -0.0048 0.9130 4.5021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.10619 0.24171 8.714 <0.0000000000000002 ***
## educ 1.19107 0.02386 49.927 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.353 on 998 degrees of freedom
## Multiple R-squared: 0.7141, Adjusted R-squared: 0.7138
## F-statistic: 2493 on 1 and 998 DF, p-value: < 0.00000000000000022
Interpretation:
“OLS estimates the association between education and wages, but this estimate may be biased because unobserved ability affects both variables.”
19.3 First Stage Regression
##
## Call:
## lm(formula = educ ~ distance_college, data = iv_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1973 -1.0360 0.0289 1.0139 4.9790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.982741 0.088877 134.8 <0.0000000000000002 ***
## distance_college -0.196666 0.007563 -26.0 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.386 on 998 degrees of freedom
## Multiple R-squared: 0.4039, Adjusted R-squared: 0.4033
## F-statistic: 676.2 on 1 and 998 DF, p-value: < 0.00000000000000022
Interpretation:
The coefficient on distance_college tells us whether the instrument predicts education.
If significant:
“Distance to college is strongly related to educational attainment, satisfying the relevance condition for a valid instrument.”
Check the first‑stage F‑statistic.
Rule of thumb:
- F > 10 → instrument likely strong
19.4 IV Regression (Two Stage Least Squares)
##
## Call:
## ivreg(formula = wage ~ educ | distance_college, data = iv_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.013526 -1.057973 0.002348 1.137755 4.163679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.42866 0.41194 13.18 <0.0000000000000002 ***
## educ 0.85789 0.04104 20.90 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.479 on 998 degrees of freedom
## Multiple R-Squared: 0.6582, Adjusted R-squared: 0.6579
## Wald test: 436.9 on 1 and 998 DF, p-value: < 0.00000000000000022
19.5 Interpretation of IV Coefficient
The IV estimate measures the causal effect of education on wages using only variation in education generated by the instrument.
Interpretation example:
“Using distance to college as an instrument for education, a one‑year increase in education increases wages by β₁ units on average.”
Key idea:
OLS uses all variation in education
IV uses only exogenous variation generated by the instrument
Thus IV attempts to isolate a causal effect.
19.6 Comparing OLS and IV
Researchers often compare the two estimates.
If they differ substantially, that suggests endogeneity bias in the OLS model.
Example interpretation:
“The IV estimate of the return to education differs from the OLS estimate, suggesting that unobserved ability biases the OLS coefficient.”
19.7 Interpreting IV Results Carefully
Always discuss:
Instrument relevance
Instrument validity (exclusion restriction)
Precision of the IV estimate
Weak instruments lead to large standard errors and unreliable estimates.
19.8 Summary of Interpretation Strategy
When interpreting regression results:
Identify the reference group.
Interpret coefficients relative to that group.
For interactions, compute combined effects.
For IV models, discuss instrument validity and first stage strength.
Distinguish statistical significance from economic importance.
19.9 Final reminders for interpretation
When writing up your regression results, avoid statements such as:
“X has no effect” when the coefficient is not significant
“The relationship is strong” without explaining whether you mean statistical significance, economic size, or model fit
“p = 0.08 means insignificant” without noting that it is marginal at the 10% level
Better wording is:
“The estimated effect is negative, but it is not statistically distinguishable from zero at the 5% level.”
“The effect is statistically significant, but its magnitude is economically small.”
“The coefficient is marginally significant at the 10% level, so the evidence is suggestive rather than strong.”