7 Data Management - Time Series and Panel Data

7.1 Time Series Data

Time series analysis requires correct date formats, proper chronological ordering, etc.

For this, we will use Chapter3 Practice found in the Modules.

7.1.1 Preliminaries

Always remember the first steps: Set Working Directory and Clean the Global Environment.

rm(list=ls())
gc()
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 3165374 169.1    5173005 276.3  5173005 276.3
## Vcells 6228023  47.6   12255594  93.6 10624508  81.1

To make sure that you are using the correct directory and that you have all the files you need, use list.files() function.

list.files()

Now, we load the following packages: dplyr, lubridate, and zoo. Make sure you have all 3 installed; if not, install them.

library(dplyr)
library(lubridate)
library(zoo)

7.1.2 Create a Simple Date Dataset

date_date<-data.frame(
  ID = 1:5,
  dob = c("15-05-1990", "20-08-1985", "01-12-2000", "10-03-1995", "25-07-2010"),
  stringsAsFactors = FALSE
)

print(date_date)
##   ID        dob
## 1  1 15-05-1990
## 2  2 20-08-1985
## 3  3 01-12-2000
## 4  4 10-03-1995
## 5  5 25-07-2010
str(date_date)
## 'data.frame':    5 obs. of  2 variables:
##  $ ID : int  1 2 3 4 5
##  $ dob: chr  "15-05-1990" "20-08-1985" "01-12-2000" "10-03-1995" ...

I won’t repeat what the code has except for new ones like ID which creates a numeric sequence for the ID column. date_of_birth = c("15-05-1990",...) are dates stored as character strings. stringsAsFactors = FALSE because R sometimes turns text into factors (categories) so by making it FALSE, it ensures dates remain character strings not categories

As seen, the dob is in character format:

7.1.3 Converting Character Dates to Date Format

date_date$dob<- as.Date(
  date_date$dob,
  format = "%d-%m-%Y"
)
class(date_date$dob)
## [1] "Date"

So, we converted character strings into Date objects. With this, we can calculate age, lags, etc.

7.1.4 Calculate Age:

Say you want to calculate the age:

Before, we added columns through mutate; this time, we add columns through creating a new object.

date_date$age <- as.numeric(floor((Sys.Date()-date_date$dob)/365.25))

as.Date() converts strings to dates while floor() simply rounds down.

7.1.4.1 Custom Reference Date:

ref_date<-as.Date("2020-12-20") #uses YYYY-MM-DD as ISO standard that R recognizes immediately
date_date$age2<- as.numeric(
  floor((ref_date-date_date$dob)/365.25)
)

7.1.5 Using Time Series Data

We load Daily Bitcoin Data

ch3_p1<-read.csv("Ch3Practice.csv")
head(ch3_p1)
##           ds       y
## 1 2015-06-13 232.402
## 2 2015-06-14 233.543
## 3 2015-06-15 236.823
## 4 2015-06-16 250.895
## 5 2015-06-17 249.284
## 6 2015-06-18 249.007

We need to understand our data;

str(ch3_p1)
## 'data.frame':    1825 obs. of  2 variables:
##  $ ds: chr  "2015-06-13" "2015-06-14" "2015-06-15" "2015-06-16" ...
##  $ y : num  232 234 237 251 249 ...

7.1.6 Convert to Date Format

As you can see, our ds is what our date column is, however, it is in the character format. We need to convert it to the Date class. We also need to do this to a copy of the raw data for further modifications.

ch3_p1.1<-ch3_p1
head(ch3_p1.1$ds)
## [1] "2015-06-13" "2015-06-14" "2015-06-15" "2015-06-16"
## [5] "2015-06-17" "2015-06-18"
class(ch3_p1.1$ds)
## [1] "character"
ch3_p1.1$ds <- as.Date(ch3_p1.1$ds, format = "%Y-%m-%d")

class(ch3_p1.1$ds)
## [1] "Date"

7.1.7 Aggregate Data

We now have daily data. Say we want to create weekly data, we use the cut function to group dates by week, month, quarter and year.

Since we know for sure that the date column is in date format, no need to check, however, it is always useful to check the class of date.

7.1.7.1 Aggregate by Week

Add new columns for each aggregation.

ch3_p1.1$week<-cut(ch3_p1.1$ds, breaks = "week")

The cut is used to divide the date into intervals while the breaks specifies that the date be divided into weekly intervals.

Check creation of the week column

head(ch3_p1.1)
##           ds       y       week
## 1 2015-06-13 232.402 2015-06-08
## 2 2015-06-14 233.543 2015-06-08
## 3 2015-06-15 236.823 2015-06-15
## 4 2015-06-16 250.895 2015-06-15
## 5 2015-06-17 249.284 2015-06-15
## 6 2015-06-18 249.007 2015-06-15

Since we have the y column which is actually Bitcoin Price, we need to aggregate that weekly using the aggregate function

week_y<-aggregate(y ~ week, #refers to column for bitcoin
                  data=ch3_p1.1, 
                  FUN = mean #specifies that the mean function should be applied to the numeric column within each week
                  )

You will notice that this creates a separate data frame. We will merge week_y with ch3_p1.1

ch3_p1.1<-merge(ch3_p1.1, week_y, by = "week", #ensures the merge aligns based on the week
                suffixes = c("","_weekly")) #adds _weekly to the column name to distinguish them from the original columns

We will slightly do the same thing when aggregating by month, quarter and year. I will do the initial steps, but please do the succeeding steps on your own.

7.1.7.2 Aggregate by Month

# Add a month column
ch3_p1.1$month <- format(ch3_p1.1$ds, "%Y-%m")
# Calculate monthly means
month_y <- aggregate(y ~ month, data = ch3_p1.1, FUN = mean)

Do the next steps as well as inspection on your own.

7.1.7.3 Aggregate by Quarter

This is different since we will use the paste0 and the format functions. The format function extracts the year from the date and extracts the quarter from the date. The paste0 combines the year and quarter without a space between them so that it results in which quarter of which year.

ch3_p1.1$quarter <- paste0(format(ch3_p1.1$ds, "%Y"), " ", quarters(ch3_p1.1$ds))

7.1.7.4 Aggregate by Year

ch3_p1.1$year<-format(ch3_p1.1$ds, "%Y")

Can you aggregate the Bitcoin values quarterly and yearly on your own with inspection?

7.2 Panel Data

We will use a package in R containing different datasets.

if(!("wooldridge" %in% installed.packages()[,"Package"]))
  install.packages("wooldridge")
library(wooldridge)
data("wagepan")

Unlike importing CSV, Excel or RData, since the data is found in a package, we call for the data through loading the package then data("dataset_name"). The dataset will appear in the environment.

str(wagepan)
## 'data.frame':    4360 obs. of  44 variables:
##  $ nr      : int  13 13 13 13 13 13 13 13 17 17 ...
##  $ year    : int  1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
##  $ agric   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ black   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bus     : int  1 0 1 1 0 1 1 1 0 0 ...
##  $ construc: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ent     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ exper   : int  1 2 3 4 5 6 7 8 4 5 ...
##  $ fin     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hisp    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poorhlth: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours   : int  2672 2320 2940 2960 3071 2864 2994 2640 2484 2804 ...
##  $ manuf   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ married : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ min     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nrthcen : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nrtheast: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ occ1    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ2    : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ occ3    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ4    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ5    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ occ6    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ7    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ8    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ9    : int  1 1 1 1 0 0 0 0 0 0 ...
##  $ per     : int  0 1 0 0 1 0 0 0 0 0 ...
##  $ pro     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pub     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ rur     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ south   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ educ    : int  14 14 14 14 14 14 14 14 13 13 ...
##  $ tra     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trad    : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ union   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ lwage   : num  1.2 1.85 1.34 1.43 1.57 ...
##  $ d81     : int  0 1 0 0 0 0 0 0 0 1 ...
##  $ d82     : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ d83     : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ d84     : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ d85     : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ d86     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ d87     : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ expersq : int  1 4 9 16 25 36 49 64 16 25 ...
##  - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"

In practice, it is best to leave the raw dataset untouched. Create a copy of the dataset and that is where you do modifications.

ch3_p2<-wagepan
str(ch3_p2)
## 'data.frame':    4360 obs. of  44 variables:
##  $ nr      : int  13 13 13 13 13 13 13 13 17 17 ...
##  $ year    : int  1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
##  $ agric   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ black   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bus     : int  1 0 1 1 0 1 1 1 0 0 ...
##  $ construc: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ent     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ exper   : int  1 2 3 4 5 6 7 8 4 5 ...
##  $ fin     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hisp    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poorhlth: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours   : int  2672 2320 2940 2960 3071 2864 2994 2640 2484 2804 ...
##  $ manuf   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ married : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ min     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nrthcen : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nrtheast: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ occ1    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ2    : int  0 0 0 0 0 1 1 1 1 1 ...
##  $ occ3    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ4    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ5    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ occ6    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ7    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ8    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ occ9    : int  1 1 1 1 0 0 0 0 0 0 ...
##  $ per     : int  0 1 0 0 1 0 0 0 0 0 ...
##  $ pro     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pub     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ rur     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ south   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ educ    : int  14 14 14 14 14 14 14 14 13 13 ...
##  $ tra     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trad    : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ union   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ lwage   : num  1.2 1.85 1.34 1.43 1.57 ...
##  $ d81     : int  0 1 0 0 0 0 0 0 0 1 ...
##  $ d82     : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ d83     : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ d84     : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ d85     : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ d86     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ d87     : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ expersq : int  1 4 9 16 25 36 49 64 16 25 ...
##  - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"

7.2.1 Sorting Panel Data

We need to sort panel data so that they are ordered by individual (id), then time. This will help you when you do lags and differences.

ch3_p2<-ch3_p2 %>% 
  arrange(nr, year)
head(ch3_p2)
##   nr year agric black bus construc ent exper fin hisp
## 1 13 1980     0     0   1        0   0     1   0    0
## 2 13 1981     0     0   0        0   0     2   0    0
## 3 13 1982     0     0   1        0   0     3   0    0
## 4 13 1983     0     0   1        0   0     4   0    0
## 5 13 1984     0     0   0        0   0     5   0    0
## 6 13 1985     0     0   1        0   0     6   0    0
##   poorhlth hours manuf married min nrthcen nrtheast occ1
## 1        0  2672     0       0   0       0        1    0
## 2        0  2320     0       0   0       0        1    0
## 3        0  2940     0       0   0       0        1    0
## 4        0  2960     0       0   0       0        1    0
## 5        0  3071     0       0   0       0        1    0
## 6        0  2864     0       0   0       0        1    0
##   occ2 occ3 occ4 occ5 occ6 occ7 occ8 occ9 per pro pub rur
## 1    0    0    0    0    0    0    0    1   0   0   0   0
## 2    0    0    0    0    0    0    0    1   1   0   0   0
## 3    0    0    0    0    0    0    0    1   0   0   0   0
## 4    0    0    0    0    0    0    0    1   0   0   0   0
## 5    0    0    0    1    0    0    0    0   1   0   0   0
## 6    1    0    0    0    0    0    0    0   0   0   0   0
##   south educ tra trad union    lwage d81 d82 d83 d84 d85 d86
## 1     0   14   0    0     0 1.197540   0   0   0   0   0   0
## 2     0   14   0    0     1 1.853060   1   0   0   0   0   0
## 3     0   14   0    0     0 1.344462   0   1   0   0   0   0
## 4     0   14   0    0     0 1.433213   0   0   1   0   0   0
## 5     0   14   0    0     0 1.568125   0   0   0   1   0   0
## 6     0   14   0    0     0 1.699891   0   0   0   0   1   0
##   d87 expersq
## 1   0       1
## 2   0       4
## 3   0       9
## 4   0      16
## 5   0      25
## 6   0      36

7.2.2 Checking Panel Balance

When using panel data, checking for a balanced panel is necessary to know if every individual is observed every time period. An unbalanced panel means some individuals are missing years. This matters because many econometric assumptions rely on balance; such as selection bias in the sense that when you do not detect imbalance, you might miss events like firms exiting the market or respondents dropping out.

table(table(ch3_p2$nr))
## 
##   8 
## 545

From the result, we know that each of the 545 individuals were observed 8 times. For unbalanced panel, you will see an output with multiple numbers as these indicate the missing periods and individuals.

The table(table(id)) tells you whether each individual has the same amount of time in the panel.

7.2.3 Creating Lagged Variables

We know that economic processes do not happen simultaneously, that is why it is important to use lags. Lags enforce causality, not correlation.

In panel data, lags are within-individual;

So, the question you are trying to answer here is:

“How did this individual’s past outcome affect this person’s current outcome?”

When using the Fixed Effects which rely on within-unit variation over time, lag adjustments are needed because without them, fixed effects would not work.

ch3_p2<-ch3_p2 %>% 
  group_by(nr) %>% 
  mutate(lwage_lag = lag(lwage))
head(ch3_p2)
## # A tibble: 6 × 45
## # Groups:   nr [1]
##      nr  year agric black   bus construc   ent exper   fin
##   <int> <int> <int> <int> <int>    <int> <int> <int> <int>
## 1    13  1980     0     0     1        0     0     1     0
## 2    13  1981     0     0     0        0     0     2     0
## 3    13  1982     0     0     1        0     0     3     0
## 4    13  1983     0     0     1        0     0     4     0
## 5    13  1984     0     0     0        0     0     5     0
## 6    13  1985     0     0     1        0     0     6     0
## # ℹ 36 more variables: hisp <int>, poorhlth <int>,
## #   hours <int>, manuf <int>, married <int>, min <int>,
## #   nrthcen <int>, nrtheast <int>, occ1 <int>, occ2 <int>,
## #   occ3 <int>, occ4 <int>, occ5 <int>, occ6 <int>,
## #   occ7 <int>, occ8 <int>, occ9 <int>, per <int>,
## #   pro <int>, pub <int>, rur <int>, south <int>,
## #   educ <int>, tra <int>, trad <int>, union <int>, …

It is important to do group_by because each individual has their own time line and the lags stay within individuals. Therefore, what this does is treating each individual as separate time series.

7.2.4 Missing Data in Panels

colSums(is.na(ch3_p2))
##        nr      year     agric     black       bus  construc 
##         0         0         0         0         0         0 
##       ent     exper       fin      hisp  poorhlth     hours 
##         0         0         0         0         0         0 
##     manuf   married       min   nrthcen  nrtheast      occ1 
##         0         0         0         0         0         0 
##      occ2      occ3      occ4      occ5      occ6      occ7 
##         0         0         0         0         0         0 
##      occ8      occ9       per       pro       pub       rur 
##         0         0         0         0         0         0 
##     south      educ       tra      trad     union     lwage 
##         0         0         0         0         0         0 
##       d81       d82       d83       d84       d85       d86 
##         0         0         0         0         0         0 
##       d87   expersq lwage_lag 
##         0         0       545
ch3_p2.1<-ch3_p2 %>% 
  filter(!is.na(lwage_lag))

7.2.5 Creating Growth Rates

ch3_p2 <- ch3_p2 %>%
  group_by(nr) %>%
  mutate(
    wg = lwage - lag(lwage)
  )
head(ch3_p2$wg, 10) #due to number of cols, chose the one we just created and specified 10 observations
##  [1]          NA  0.65551984 -0.50859833  0.08875167
##  [5]  0.13491178  0.13176584 -2.42015356  2.38945049
##  [9]          NA -0.15756428

What we did was create within-individual wage growth over time.

Remember our discussion on lag?

It will answer: “How much did this individual’s wage change from the previous period?”

Furthermore, we are doing dynamic analysis here. We also note that the first observation per person has no lag.

7.3 Closing

  1. Quiz questions will be uploaded in Animospace.
  2. You have 15 minutes to answer.
  3. Clean the environment and free the memory AFTER the quiz.