7 Data Management - Time Series and Panel Data
7.1 Time Series Data
Time series analysis requires correct date formats, proper chronological ordering, etc.
For this, we will use Chapter3 Practice found in the Modules.
7.1.1 Preliminaries
Always remember the first steps: Set Working Directory and Clean the Global Environment.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3165374 169.1 5173005 276.3 5173005 276.3
## Vcells 6228023 47.6 12255594 93.6 10624508 81.1
To make sure that you are using the correct directory and that you have all the files you need, use list.files() function.
Now, we load the following packages: dplyr, lubridate, and zoo. Make sure you have all 3 installed; if not, install them.
7.1.2 Create a Simple Date Dataset
date_date<-data.frame(
ID = 1:5,
dob = c("15-05-1990", "20-08-1985", "01-12-2000", "10-03-1995", "25-07-2010"),
stringsAsFactors = FALSE
)
print(date_date)## ID dob
## 1 1 15-05-1990
## 2 2 20-08-1985
## 3 3 01-12-2000
## 4 4 10-03-1995
## 5 5 25-07-2010
## 'data.frame': 5 obs. of 2 variables:
## $ ID : int 1 2 3 4 5
## $ dob: chr "15-05-1990" "20-08-1985" "01-12-2000" "10-03-1995" ...
I won’t repeat what the code has except for new ones like ID which creates a numeric sequence for the ID column. date_of_birth = c("15-05-1990",...) are dates stored as character strings. stringsAsFactors = FALSE because R sometimes turns text into factors (categories) so by making it FALSE, it ensures dates remain character strings not categories
As seen, the dob is in character format:
7.1.3 Converting Character Dates to Date Format
## [1] "Date"
So, we converted character strings into Date objects. With this, we can calculate age, lags, etc.
7.1.4 Calculate Age:
Say you want to calculate the age:
Before, we added columns through mutate; this time, we add columns through creating a new object.
as.Date() converts strings to dates while floor() simply rounds down.
7.1.5 Using Time Series Data
We load Daily Bitcoin Data
## ds y
## 1 2015-06-13 232.402
## 2 2015-06-14 233.543
## 3 2015-06-15 236.823
## 4 2015-06-16 250.895
## 5 2015-06-17 249.284
## 6 2015-06-18 249.007
We need to understand our data;
## 'data.frame': 1825 obs. of 2 variables:
## $ ds: chr "2015-06-13" "2015-06-14" "2015-06-15" "2015-06-16" ...
## $ y : num 232 234 237 251 249 ...
7.1.6 Convert to Date Format
As you can see, our ds is what our date column is, however, it is in the character format. We need to convert it to the Date class. We also need to do this to a copy of the raw data for further modifications.
## [1] "2015-06-13" "2015-06-14" "2015-06-15" "2015-06-16"
## [5] "2015-06-17" "2015-06-18"
## [1] "character"
## [1] "Date"
7.1.7 Aggregate Data
We now have daily data. Say we want to create weekly data, we use the cut function to group dates by week, month, quarter and year.
Since we know for sure that the date column is in date format, no need to check, however, it is always useful to check the class of date.
7.1.7.1 Aggregate by Week
Add new columns for each aggregation.
The cut is used to divide the date into intervals while the breaks specifies that the date be divided into weekly intervals.
Check creation of the week column
## ds y week
## 1 2015-06-13 232.402 2015-06-08
## 2 2015-06-14 233.543 2015-06-08
## 3 2015-06-15 236.823 2015-06-15
## 4 2015-06-16 250.895 2015-06-15
## 5 2015-06-17 249.284 2015-06-15
## 6 2015-06-18 249.007 2015-06-15
Since we have the y column which is actually Bitcoin Price, we need to aggregate that weekly using the aggregate function
week_y<-aggregate(y ~ week, #refers to column for bitcoin
data=ch3_p1.1,
FUN = mean #specifies that the mean function should be applied to the numeric column within each week
)You will notice that this creates a separate data frame. We will merge week_y with ch3_p1.1
ch3_p1.1<-merge(ch3_p1.1, week_y, by = "week", #ensures the merge aligns based on the week
suffixes = c("","_weekly")) #adds _weekly to the column name to distinguish them from the original columnsWe will slightly do the same thing when aggregating by month, quarter and year. I will do the initial steps, but please do the succeeding steps on your own.
7.1.7.3 Aggregate by Quarter
This is different since we will use the paste0 and the format functions. The format function extracts the year from the date and extracts the quarter from the date. The paste0 combines the year and quarter without a space between them so that it results in which quarter of which year.
7.2 Panel Data
We will use a package in R containing different datasets.
Unlike importing CSV, Excel or RData, since the data is found in a package, we call for the data through loading the package then data("dataset_name"). The dataset will appear in the environment.
## 'data.frame': 4360 obs. of 44 variables:
## $ nr : int 13 13 13 13 13 13 13 13 17 17 ...
## $ year : int 1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
## $ agric : int 0 0 0 0 0 0 0 0 0 0 ...
## $ black : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bus : int 1 0 1 1 0 1 1 1 0 0 ...
## $ construc: int 0 0 0 0 0 0 0 0 0 0 ...
## $ ent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ exper : int 1 2 3 4 5 6 7 8 4 5 ...
## $ fin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hisp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poorhlth: int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours : int 2672 2320 2940 2960 3071 2864 2994 2640 2484 2804 ...
## $ manuf : int 0 0 0 0 0 0 0 0 0 0 ...
## $ married : int 0 0 0 0 0 0 0 0 0 0 ...
## $ min : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nrthcen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nrtheast: int 1 1 1 1 1 1 1 1 1 1 ...
## $ occ1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ2 : int 0 0 0 0 0 1 1 1 1 1 ...
## $ occ3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ5 : int 0 0 0 0 1 0 0 0 0 0 ...
## $ occ6 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ7 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ8 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ9 : int 1 1 1 1 0 0 0 0 0 0 ...
## $ per : int 0 1 0 0 1 0 0 0 0 0 ...
## $ pro : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pub : int 0 0 0 0 0 0 0 0 0 0 ...
## $ rur : int 0 0 0 0 0 0 0 0 0 0 ...
## $ south : int 0 0 0 0 0 0 0 0 0 0 ...
## $ educ : int 14 14 14 14 14 14 14 14 13 13 ...
## $ tra : int 0 0 0 0 0 0 0 0 0 0 ...
## $ trad : int 0 0 0 0 0 0 0 0 1 1 ...
## $ union : int 0 1 0 0 0 0 0 0 0 0 ...
## $ lwage : num 1.2 1.85 1.34 1.43 1.57 ...
## $ d81 : int 0 1 0 0 0 0 0 0 0 1 ...
## $ d82 : int 0 0 1 0 0 0 0 0 0 0 ...
## $ d83 : int 0 0 0 1 0 0 0 0 0 0 ...
## $ d84 : int 0 0 0 0 1 0 0 0 0 0 ...
## $ d85 : int 0 0 0 0 0 1 0 0 0 0 ...
## $ d86 : int 0 0 0 0 0 0 1 0 0 0 ...
## $ d87 : int 0 0 0 0 0 0 0 1 0 0 ...
## $ expersq : int 1 4 9 16 25 36 49 64 16 25 ...
## - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
In practice, it is best to leave the raw dataset untouched. Create a copy of the dataset and that is where you do modifications.
## 'data.frame': 4360 obs. of 44 variables:
## $ nr : int 13 13 13 13 13 13 13 13 17 17 ...
## $ year : int 1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
## $ agric : int 0 0 0 0 0 0 0 0 0 0 ...
## $ black : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bus : int 1 0 1 1 0 1 1 1 0 0 ...
## $ construc: int 0 0 0 0 0 0 0 0 0 0 ...
## $ ent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ exper : int 1 2 3 4 5 6 7 8 4 5 ...
## $ fin : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hisp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poorhlth: int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours : int 2672 2320 2940 2960 3071 2864 2994 2640 2484 2804 ...
## $ manuf : int 0 0 0 0 0 0 0 0 0 0 ...
## $ married : int 0 0 0 0 0 0 0 0 0 0 ...
## $ min : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nrthcen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nrtheast: int 1 1 1 1 1 1 1 1 1 1 ...
## $ occ1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ2 : int 0 0 0 0 0 1 1 1 1 1 ...
## $ occ3 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ5 : int 0 0 0 0 1 0 0 0 0 0 ...
## $ occ6 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ7 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ8 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ occ9 : int 1 1 1 1 0 0 0 0 0 0 ...
## $ per : int 0 1 0 0 1 0 0 0 0 0 ...
## $ pro : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pub : int 0 0 0 0 0 0 0 0 0 0 ...
## $ rur : int 0 0 0 0 0 0 0 0 0 0 ...
## $ south : int 0 0 0 0 0 0 0 0 0 0 ...
## $ educ : int 14 14 14 14 14 14 14 14 13 13 ...
## $ tra : int 0 0 0 0 0 0 0 0 0 0 ...
## $ trad : int 0 0 0 0 0 0 0 0 1 1 ...
## $ union : int 0 1 0 0 0 0 0 0 0 0 ...
## $ lwage : num 1.2 1.85 1.34 1.43 1.57 ...
## $ d81 : int 0 1 0 0 0 0 0 0 0 1 ...
## $ d82 : int 0 0 1 0 0 0 0 0 0 0 ...
## $ d83 : int 0 0 0 1 0 0 0 0 0 0 ...
## $ d84 : int 0 0 0 0 1 0 0 0 0 0 ...
## $ d85 : int 0 0 0 0 0 1 0 0 0 0 ...
## $ d86 : int 0 0 0 0 0 0 1 0 0 0 ...
## $ d87 : int 0 0 0 0 0 0 0 1 0 0 ...
## $ expersq : int 1 4 9 16 25 36 49 64 16 25 ...
## - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
7.2.1 Sorting Panel Data
We need to sort panel data so that they are ordered by individual (id), then time. This will help you when you do lags and differences.
## nr year agric black bus construc ent exper fin hisp
## 1 13 1980 0 0 1 0 0 1 0 0
## 2 13 1981 0 0 0 0 0 2 0 0
## 3 13 1982 0 0 1 0 0 3 0 0
## 4 13 1983 0 0 1 0 0 4 0 0
## 5 13 1984 0 0 0 0 0 5 0 0
## 6 13 1985 0 0 1 0 0 6 0 0
## poorhlth hours manuf married min nrthcen nrtheast occ1
## 1 0 2672 0 0 0 0 1 0
## 2 0 2320 0 0 0 0 1 0
## 3 0 2940 0 0 0 0 1 0
## 4 0 2960 0 0 0 0 1 0
## 5 0 3071 0 0 0 0 1 0
## 6 0 2864 0 0 0 0 1 0
## occ2 occ3 occ4 occ5 occ6 occ7 occ8 occ9 per pro pub rur
## 1 0 0 0 0 0 0 0 1 0 0 0 0
## 2 0 0 0 0 0 0 0 1 1 0 0 0
## 3 0 0 0 0 0 0 0 1 0 0 0 0
## 4 0 0 0 0 0 0 0 1 0 0 0 0
## 5 0 0 0 1 0 0 0 0 1 0 0 0
## 6 1 0 0 0 0 0 0 0 0 0 0 0
## south educ tra trad union lwage d81 d82 d83 d84 d85 d86
## 1 0 14 0 0 0 1.197540 0 0 0 0 0 0
## 2 0 14 0 0 1 1.853060 1 0 0 0 0 0
## 3 0 14 0 0 0 1.344462 0 1 0 0 0 0
## 4 0 14 0 0 0 1.433213 0 0 1 0 0 0
## 5 0 14 0 0 0 1.568125 0 0 0 1 0 0
## 6 0 14 0 0 0 1.699891 0 0 0 0 1 0
## d87 expersq
## 1 0 1
## 2 0 4
## 3 0 9
## 4 0 16
## 5 0 25
## 6 0 36
7.2.2 Checking Panel Balance
When using panel data, checking for a balanced panel is necessary to know if every individual is observed every time period. An unbalanced panel means some individuals are missing years. This matters because many econometric assumptions rely on balance; such as selection bias in the sense that when you do not detect imbalance, you might miss events like firms exiting the market or respondents dropping out.
##
## 8
## 545
From the result, we know that each of the 545 individuals were observed 8 times. For unbalanced panel, you will see an output with multiple numbers as these indicate the missing periods and individuals.
The table(table(id)) tells you whether each individual has the same amount of time in the panel.
7.2.3 Creating Lagged Variables
We know that economic processes do not happen simultaneously, that is why it is important to use lags. Lags enforce causality, not correlation.
In panel data, lags are within-individual;
So, the question you are trying to answer here is:
“How did this individual’s past outcome affect this person’s current outcome?”
When using the Fixed Effects which rely on within-unit variation over time, lag adjustments are needed because without them, fixed effects would not work.
## # A tibble: 6 × 45
## # Groups: nr [1]
## nr year agric black bus construc ent exper fin
## <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 13 1980 0 0 1 0 0 1 0
## 2 13 1981 0 0 0 0 0 2 0
## 3 13 1982 0 0 1 0 0 3 0
## 4 13 1983 0 0 1 0 0 4 0
## 5 13 1984 0 0 0 0 0 5 0
## 6 13 1985 0 0 1 0 0 6 0
## # ℹ 36 more variables: hisp <int>, poorhlth <int>,
## # hours <int>, manuf <int>, married <int>, min <int>,
## # nrthcen <int>, nrtheast <int>, occ1 <int>, occ2 <int>,
## # occ3 <int>, occ4 <int>, occ5 <int>, occ6 <int>,
## # occ7 <int>, occ8 <int>, occ9 <int>, per <int>,
## # pro <int>, pub <int>, rur <int>, south <int>,
## # educ <int>, tra <int>, trad <int>, union <int>, …
It is important to do group_by because each individual has their own time line and the lags stay within individuals. Therefore, what this does is treating each individual as separate time series.
7.2.4 Missing Data in Panels
## nr year agric black bus construc
## 0 0 0 0 0 0
## ent exper fin hisp poorhlth hours
## 0 0 0 0 0 0
## manuf married min nrthcen nrtheast occ1
## 0 0 0 0 0 0
## occ2 occ3 occ4 occ5 occ6 occ7
## 0 0 0 0 0 0
## occ8 occ9 per pro pub rur
## 0 0 0 0 0 0
## south educ tra trad union lwage
## 0 0 0 0 0 0
## d81 d82 d83 d84 d85 d86
## 0 0 0 0 0 0
## d87 expersq lwage_lag
## 0 0 545
7.2.5 Creating Growth Rates
ch3_p2 <- ch3_p2 %>%
group_by(nr) %>%
mutate(
wg = lwage - lag(lwage)
)
head(ch3_p2$wg, 10) #due to number of cols, chose the one we just created and specified 10 observations## [1] NA 0.65551984 -0.50859833 0.08875167
## [5] 0.13491178 0.13176584 -2.42015356 2.38945049
## [9] NA -0.15756428
What we did was create within-individual wage growth over time.
Remember our discussion on lag?
It will answer: “How much did this individual’s wage change from the previous period?”
Furthermore, we are doing dynamic analysis here. We also note that the first observation per person has no lag.