11 Descriptive Statistics Continuation

For this section, we use built-in datasets in R.

rm(list=ls())
gc()
##           used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells 4040357 215.8    6548841 349.8   6548841  349.8
## Vcells 8900232  68.0   99463268 758.9 202265969 1543.2
11.0.0.0.1 Built-in Datasets

The packages you download sometimes have datasets built-in them to serve as practice.

11.0.0.0.2 Install Packages

Some of the packages are already installed; but, just in case, they are included in the code chunk.

ch7p <- c(
  "tidyverse",
  "psych",
  "plm",
  "AER",
  "lmtest",
  "sandwich",
  "forecast",
  "wooldridge"
)

ch7mp <- ch7p[
  !(ch7p %in% installed.packages()[,"Package"])
]

if(length(ch7mp)) {
  install.packages(ch7mp)
}

We list the packages we need, then, we locate the missing packages then install them. Finally, we load the libraries of the packages we installed. There will be a lot of warning messages when you install. Always double-check in the Packages list whether the packages were installed.

11.0.0.0.3 Load Packages
lapply(ch7p, library, character.only = TRUE)

The packages are listed;

11.1 Time Series Descriptive Statistics

As a continuation, let us do a deep-dive of Descriptive Statistics focusing on time series data. In a time series, before you do any modeling, you need to understand the structure, trends and stationarity of the series.

11.1.0.1 Key Things to Look For:

  • Trend: Is the series increasing/decreasing over time?

  • Seasonality: Are there repeating patterns?

  • Volatility or Variance Changes: Look for periods of high/low variability.

  • Autocorrelation: Are observations correlated with past values?

The patterns inform whether simple OLS is valid or if we need to do transformations (e.g., differencing, logs)

11.1.0.1.1 Example:

We use the built-in AirPassengers dataset. This dataset is the monthly airline passenger numbers.

data("AirPassengers")
ch7.ts_data<-AirPassengers
ch7.ts <- data.frame(
  Month = time(ch7.ts_data),
  Air_Passengers = as.numeric(ch7.ts_data)
)

The dataset is loaded through the data command. We are using tidyverse, it expects data frames so we have to change the object (now it is in Values) to data frame.

You notice that we extracted the time; it is because we are going to plot and detect trend and seasonality. We also have the number of Air Passengers per month so we set that as numeric.

11.1.0.1.2 Inspect data
str(ch7.ts)
## 'data.frame':    144 obs. of  2 variables:
##  $ Month         : Time-Series  from 1949 to 1961: 1949 1949 1949 1949 1949 ...
##  $ Air_Passengers: num  112 118 132 129 121 135 148 148 136 119 ...

11.1.1 1. Descriptive Statistics using psych

11.1.1.0.1 Use describe function
psych::describe(ch7.ts$Air_Passengers)
##    vars   n  mean     sd median trimmed    mad min max range
## X1    1 144 280.3 119.97  265.5  271.45 133.43 104 622   518
##    skew kurtosis se
## X1 0.57    -0.43 10

We need to look at the data, whether it is skewed, highly volatile, or has unusual values. We look at Skewness aside from the central tendency measures.

If skewness is 1: it is highly skewed. In this case, it is moderately skewed. Usually, the rule of thumb for skewness interpretations is between -0.5 and 0.5 (roughly symmetric).

11.1.2 2. Plot Time Series

ggplot(ch7.ts,
       aes(x=Month, y=Air_Passengers))+
  geom_line(color="darkgreen")+
  geom_smooth(method="loess", color="tomato") + #smooth trend line
  labs(title = "AirPassengers with Smooth Trend Line", subtitle = "From 1949 to 1961")
## Don't know how to automatically pick scale for object of
## type <ts>. Defaulting to continuous.
## `geom_smooth()` using formula = 'y ~ x'

Let us improve the graph;

ch7.ts<-ch7.ts %>% 
  mutate(
    year = floor(Month),
    month = round((Month - year)*12+1),
    date = make_date(year=year, month = month, day = 1)
  )

We round off the Month to only have the year then, we convert the decimal to months by multiplying by 12 and add 1 because January counts as month 1. Unlike what we used before, as.Date, we use make_date because unlike character strings, this time, it is numeric.

11.1.2.0.0.1 Improved Plot
ggplot(ch7.ts,
       aes(x=date, y=Air_Passengers))+
  geom_line(color="darkgreen")+
  geom_smooth(method="loess", color="tomato", se=FALSE) + #smooth trend line
  scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
  labs(title = "AirPassengers with Trend Line", subtitle = "From 1949 to 1961",
       x = "Year",
       y = "Passengers")+
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#for base R plotting
fit <- loess(y ~ x)
ord <- order(x)
lines(x[ord], fitted(fit)[ord], col = "tomato", lwd = 2)
  • geom_line shows raw counts of Air Passengers

  • geom_smooth(method = "loess") fits a smooth curve to highlight trend, ignoring short-term fluctuations

    • LOESS is local regression wherein what happens is that, it can capture curves, trends and nonlinearity. It can be seen that the trend in Air Passengers accelerates then slows down. You can see peaks/troughs, therefore, LOESS is better over linear regression.
  • scale_x_date is just to show every year, not just some years in x-axis. You can add months here by putting inside the parenthesis, %b but it looks cluttered so I did not put it.

11.1.3 3. Autocorrelation

Detect if observations are correlated over time, which violates OLS independence assumption.

acf(ch7.ts$Air_Passengers, main="ACF: Monthly AirPassengers")

pacf(ch7.ts$Air_Passengers, main="PACF: Monthly AirPassengers")

How to Interpret:

Plot Significant? Implication for OLS/modeling
ACF Spike outside blue dashed lines Residuals may be autocorrelated so this violates OLS independence assumption
PACF Spike outside blue dashed lines Direct lagged effect so we may need to lag regressors in OLS
Multiple PACF Lags Multiple spikes beyond bounds OLS residuals are correlated so we have to consider other models like ARIMA, dynamic regression or IV

In summary, Spikes in PACF of residuals -> OLS standard errors may be invalid so, need to lag variables or do IV.

11.2 Panel Data: Within vs Between Variation

You might wonder why we have the within vs between variation; Note the following:

  • Within:

    • Variation of an individual over time

    • Example: NCR GDP changes from year to year

    • Useful when estimating fixed-effects models because we want to isolate how changes within an individual/entity affect the dependent variable

  • Between:

    • Variation across individuals over time

    • Example: Average GDP differs across NCR, CAR, BARMM

    • Useful when estimating random-effects models or when comparing differences across individuals/entities

    Note: If you ignore within vs between variation, OLS estimates will be biased.
    Fixed-effects OLS = controls for unobserved, time-invariant heterogeneity so uses within variation only.

    What is time-invariant heterogeneity? A constant, hidden characteristic of entities in panel data that do not change over time.

11.2.0.0.1 Example:

We use the Grunfeld Dataset which is built-in the plm package that is used for Panel Data modeling

data("Grunfeld", package="plm")
11.2.0.0.2 Inspect data
str(Grunfeld)
## 'data.frame':    200 obs. of  5 variables:
##  $ firm   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ year   : int  1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 ...
##  $ inv    : num  318 392 411 258 331 ...
##  $ value  : num  3078 4662 5387 2792 4313 ...
##  $ capital: num  2.8 52.6 156.9 209.2 203.4 ...

11.2.0.1 1. Plot Panel Data

11.2.0.1.1 a. Within variation: investment over time per firm
ggplot(Grunfeld,
       aes(x=year, y=inv, group=firm, color=factor(firm)))+
  geom_line()+
  labs(title="Within variation: investment over time by firm", x="Year", y="Investment", color = "Firm")+
  theme_classic()

  • Some firms show large changes but most of them show little variation. For these firms, it shows changes within the firm but for the others, not really.
11.2.0.1.2 b. Between variation
Grunfeld %>%
  group_by(firm) %>%
  summarise(mean_invest = mean(inv)) %>%
  ggplot(aes(x=firm, y=mean_invest)) +
  geom_col(fill="purple") +
  labs(title="Between Variation: Average Investment by Firm", x="Firm", y="Avg Investment")+
  theme_bw()

  • Shows how firms 1 and 2 have much higher average investment than the others. Unlike before, the between variation here looks at differences in average across firms, ignoring time.

Important: Choosing the right model (FE vs RE) depends on whether your research question is about changes within entities or differences across entities.