6 Data Management - Cross-Sectional Data
6.1 Where to Get Data?
Before we proceed to Data Management, let us first find where we can get data for the Data Story Archive. Note that the data you collect should still ensure that you are following the Code of Ethics and analyze Ethical Considerations.
Please view the necessary documents from the Office of the Vice Chancellor for Research and Innovation (https://www.dlsu.edu.ph/research/research-manual/)
A list of links you can search and get data from:
Note: I will not include the best links as they are pretty straightforward and these are governmental databases like the ones from World Bank, IMF, UN, Philippine Statistics Authority, and Bangko Sentral ng Pilipinas. The list here is a general list but use with proper discretion.
| Name | Link | Notes |
|---|---|---|
| Awesome Public Datasets | https://github.com/awesomedata/awesome-public-datasets | This repository is filled with public datasets, mostly from International contexts. |
| Google Dataset Search | https://datasetsearch.research.google.com/ | You can download publicly available datasets from searching through Google. Though, sometimes the datasets come from ‘Statista.com’. You can check the sources from the search. |
6.2 Preliminaries
6.2.1 Packages
We will mostly use the tidyverse package, in particular, the dplyr package and the tidyr package; double-check in your Packages list whether you have these two packages; if not, you can simply install them.
6.2.2 Clean Everything
Do this step every time you use other data or when we do the other chapters.
# Remove all objects in the global environment
rm(list = ls())
# Perform garbage collection to free up memory
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3112258 166.3 5173005 276.3 5173005 276.3
## Vcells 6108317 46.7 12255594 93.6 9035397 69.0
6.2.3 Importing the Datasets
Before we can manage the data, we must first import it into R. There are two ways to do this:
- Writing code (preferred for replicability)
- Clicking in RStudio
We start with the most common file types.
6.2.3.1 Importing a CSV file
The file must be in the working directory. If it is not, then, simply putting the file name inside the quotation marks will not work. Meaning, you have to input the entire path where the file is.
Another, if you notice, the CSV file name is simple and easy to share. The object, in this case, cp_csv is also in small letters and does not have spaces but rather an underscore replacing the space. Always do this when naming objects.
a. Small letters
b. No spaces
c. Place underscore instead.
To check the information and what the dataset looks like:
## Country ID HH1_Num_People HH2a_Sex HH2b_Age
## 1 Bulgaria BG1535216 3 Female 56
## 2 Netherlands NL5130211 1 Female 20
## 3 Netherlands NL5063519 2 Male 63
## 4 Slovenia SI1042916 3 Male 63
## 5 Bulgaria BG1396625 2 Male 89
## 6 Slovakia SK1184115 2 Female 44
## HH2d_EmploymentSituation
## 1 Unemployed less than 12 months
## 2 In education (at school, university, etc.) / student
## 3 Retired
## 4 At work as employee or employer/self-employed
## 5 Retired
## 6 At work as employee or employer/self-employed
## Q1_PaidJob Q2_Empoyment
## 1 Yes <NA>
## 2 Yes <NA>
## 3 Yes <NA>
## 4 <NA> Employed
## 5 Yes <NA>
## 6 <NA> Self-employed without employees
## Q3_Contract
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 On an unlimited permanent contract
## 5 <NA>
## 6 On an unlimited permanent contract
## Q4_Occupation Q7_HoursWeekWork
## 1 <NA> NA
## 2 <NA> NA
## 3 <NA> NA
## 4 Technician or junior professional 35
## 5 <NA> NA
## 6 Technician or junior professional 43
## Q7a_AdditionalJob Q7b_HoursWeekWeekAddJob Q7c_Work
## 1 <NA> NA No
## 2 <NA> NA No
## 3 <NA> NA No
## 4 No NA <NA>
## 5 <NA> NA No
## 6 No NA <NA>
## Q8_HoursWeekWorkPref Q9_HoursWeekWorkPartner
## 1 40 40
## 2 10 NA
## 3 0 NA
## 4 30 NA
## 5 0 NA
## 6 43 NA
## Q10_HoursWeekWorkPartnerPref Q17_Rooms
## 1 4 3
## 2 NA 1
## 3 0 6
## 4 30 3
## 5 0 1
## 6 NA 2
## Q18_Tenancy Q19a_ShortageSpace
## 1 Own without mortgage 0
## 2 Other 0
## 3 Own with mortgage 0
## 4 Tenant, paying rent to private landlord 0
## 5 Own without mortgage 0
## 6 Own without mortgage 1
## Q19b_Rot Q19c_Leaks Q19d_NoFlusingToilet Q19e_NoBathShower
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## Q19f_NoOutside Q20_LeaveAccomodation_NoAff
## 1 0 Very unlikely
## 2 0 Quite unlikely
## 3 0 Very unlikely
## 4 0 Quite unlikely
## 5 0 Very unlikely
## 6 0 Very likely
## Q24_Trust Q25a_TensionClass
## 1 6 Some tension
## 2 6 Some tension
## 3 8 Some tension
## 4 6 Some tension
## 5 1 - you can't be too careful Some tension
## 6 5 Some tension
## Q25b_TensionWork Q25c_TensionSex Q25d_TensionsAge
## 1 No tension No tension 3
## 2 Some tension No tension 3
## 3 Some tension No tension 3
## 4 Some tension No tension 3
## 5 Don”t know Some tension 2
## 6 Some tension Some tension 2
## Q25e_TensionRace Q25f_TensionReligion
## 1 Some tension No tension
## 2 Some tension Some tension
## 3 A lot of tension A lot of tension
## 4 Some tension Some tension
## 5 Don”t know Don”t know
## 6 Some tension Some tension
## Q25g_TensionSexOrient Q37a_HoursWeekChildren
## 1 No tension 2
## 2 A lot of tension NA
## 3 Some tension NA
## 4 No tension 5
## 5 Don”t know NA
## 6 Some tension NA
## Q37b_HoursWeekHousework Q37c_HoursWeekElderly
## 1 14 NA
## 2 7 NA
## 3 14 3
## 4 NA NA
## 5 NA NA
## 6 20 NA
## Q48_Education Q49_Area
## 1 1250 A medium to large town
## 2 2956 A city or city suburb
## 3 2963 A medium to large town
## 4 3420 A medium to large town
## 5 1253 A city or city suburb
## 6 3530 A medium to large town
## Q50a_NeighbourhoodNoise Q50b_NeighbourhoodAir
## 1 No problems No problems
## 2 No problems No problems
## 3 No problems No problems
## 4 No problems No problems
## 5 No problems No problems
## 6 No problems No problems
## Q50c_NeighbourhoodWater Q50d_NeighbourhoodCrime
## 1 Major problems No problems
## 2 No problems No problems
## 3 No problems No problems
## 4 No problems No problems
## 5 No problems Moderate problems
## 6 No problems No problems
## Q50e_NeighbourhoodLitter Q50f_NeighbourhoodTraffic
## 1 No problems No problems
## 2 No problems No problems
## 3 No problems No problems
## 4 No problems Moderate problems
## 5 Moderate problems Don't know
## 6 No problems Moderate problems
## Q51a_AccServicesPost Q51b_AccServicesBank
## 1 Service not used Easily
## 2 Easily With some difficulty
## 3 Easily Easily
## 4 Easily Easily
## 5 Service not used Service not used
## 6 Very easily Very easily
## Q53c_QualityPublicTransport Income_PPP
## 1 6 NA
## 2 7 273.4731
## 3 7 3190.5196
## 4 6 4171.6329
## 5 3 478.3407
## 6 5 759.6961
## 'data.frame': 4036 obs. of 49 variables:
## $ Country : chr "Bulgaria" "Netherlands" "Netherlands" "Slovenia" ...
## $ ID : chr "BG1535216" "NL5130211" "NL5063519" "SI1042916" ...
## $ HH1_Num_People : int 3 1 2 3 2 2 2 1 4 2 ...
## $ HH2a_Sex : chr "Female" "Female" "Male" "Male" ...
## $ HH2b_Age : int 56 20 63 63 89 44 52 43 47 81 ...
## $ HH2d_EmploymentSituation : chr "Unemployed less than 12 months" "In education (at school, university, etc.) / student" "Retired" "At work as employee or employer/self-employed" ...
## $ Q1_PaidJob : chr "Yes" "Yes" "Yes" NA ...
## $ Q2_Empoyment : chr NA NA NA "Employed" ...
## $ Q3_Contract : chr NA NA NA "On an unlimited permanent contract" ...
## $ Q4_Occupation : chr NA NA NA "Technician or junior professional" ...
## $ Q7_HoursWeekWork : int NA NA NA 35 NA 43 NA NA 17 NA ...
## $ Q7a_AdditionalJob : chr NA NA NA "No" ...
## $ Q7b_HoursWeekWeekAddJob : int NA NA NA NA NA NA NA NA NA NA ...
## $ Q7c_Work : chr "No" "No" "No" NA ...
## $ Q8_HoursWeekWorkPref : int 40 10 0 30 0 43 40 40 35 0 ...
## $ Q9_HoursWeekWorkPartner : int 40 NA NA NA NA NA 41 NA 35 NA ...
## $ Q10_HoursWeekWorkPartnerPref: int 4 NA 0 30 0 NA 30 NA 35 0 ...
## $ Q17_Rooms : int 3 1 6 3 1 2 NA 1 5 3 ...
## $ Q18_Tenancy : chr "Own without mortgage" "Other" "Own with mortgage" "Tenant, paying rent to private landlord" ...
## $ Q19a_ShortageSpace : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Q19b_Rot : int 0 0 0 0 0 0 0 1 1 0 ...
## $ Q19c_Leaks : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Q19d_NoFlusingToilet : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Q19e_NoBathShower : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Q19f_NoOutside : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Q20_LeaveAccomodation_NoAff : chr "Very unlikely" "Quite unlikely" "Very unlikely" "Quite unlikely" ...
## $ Q24_Trust : chr "6" "6" "8" "6" ...
## $ Q25a_TensionClass : chr "Some tension" "Some tension" "Some tension" "Some tension" ...
## $ Q25b_TensionWork : chr "No tension" "Some tension" "Some tension" "Some tension" ...
## $ Q25c_TensionSex : chr "No tension" "No tension" "No tension" "No tension" ...
## $ Q25d_TensionsAge : int 3 3 3 3 2 2 98 3 3 1 ...
## $ Q25e_TensionRace : chr "Some tension" "Some tension" "A lot of tension" "Some tension" ...
## $ Q25f_TensionReligion : chr "No tension" "Some tension" "A lot of tension" "Some tension" ...
## $ Q25g_TensionSexOrient : chr "No tension" "A lot of tension" "Some tension" "No tension" ...
## $ Q37a_HoursWeekChildren : int 2 NA NA 5 NA NA 4 NA NA NA ...
## $ Q37b_HoursWeekHousework : int 14 7 14 NA NA 20 4 8 4 6 ...
## $ Q37c_HoursWeekElderly : int NA NA 3 NA NA NA NA NA NA NA ...
## $ Q48_Education : chr "1250" "2956" "2963" "3420" ...
## $ Q49_Area : chr "A medium to large town" "A city or city suburb" "A medium to large town" "A medium to large town" ...
## $ Q50a_NeighbourhoodNoise : chr "No problems" "No problems" "No problems" "No problems" ...
## $ Q50b_NeighbourhoodAir : chr "No problems" "No problems" "No problems" "No problems" ...
## $ Q50c_NeighbourhoodWater : chr "Major problems" "No problems" "No problems" "No problems" ...
## $ Q50d_NeighbourhoodCrime : chr "No problems" "No problems" "No problems" "No problems" ...
## $ Q50e_NeighbourhoodLitter : chr "No problems" "No problems" "No problems" "No problems" ...
## $ Q50f_NeighbourhoodTraffic : chr "No problems" "No problems" "No problems" "Moderate problems" ...
## $ Q51a_AccServicesPost : chr "Service not used" "Easily" "Easily" "Easily" ...
## $ Q51b_AccServicesBank : chr "Easily" "With some difficulty" "Easily" "Easily" ...
## $ Q53c_QualityPublicTransport : chr "6" "7" "7" "6" ...
## $ Income_PPP : num NA 273 3191 4172 478 ...
The head code shows the first 6 rows and all the columns (variables) in the dataset. Meanwhile the str code shows the information of the dataset, such as what format type each column/variable is, how many observations, etc.
B. Using RStudio
- Go to Environment
- Click Import Dataset
- Choose From Text (base) or From Test (readr)
- Click Browse and select your
.csvfile - Click Import
RStudio will load the dataset and generate R code in Console. Copy the R code to your script for reproducibility.
6.2.3.2 Importing an Excel File
First, install and load the package:
If you only want a particular sheet to be imported,
## # A tibble: 6 × 49
## Country ID HH1_Num_People HH2a_Sex HH2b_Age
## <chr> <chr> <dbl> <chr> <dbl>
## 1 Bulgaria BG1535216 3 Female 56
## 2 Netherlands NL5130211 1 Female 20
## 3 Netherlands NL5063519 2 Male 63
## 4 Slovenia SI1042916 3 Male 63
## 5 Bulgaria BG1396625 2 Male 89
## 6 Slovakia SK1184115 2 Female 44
## # ℹ 44 more variables: HH2d_EmploymentSituation <chr>,
## # Q1_PaidJob <chr>, Q2_Empoyment <chr>, Q3_Contract <chr>,
## # Q4_Occupation <chr>, Q7_HoursWeekWork <dbl>,
## # Q7a_AdditionalJob <chr>, Q7b_HoursWeekWeekAddJob <dbl>,
## # Q7c_Work <chr>, Q8_HoursWeekWorkPref <dbl>,
## # Q9_HoursWeekWorkPartner <dbl>,
## # Q10_HoursWeekWorkPartnerPref <dbl>, Q17_Rooms <dbl>, …
B. Using RStudio
- Go to Environment
- Click Import Dataset
- Choose From Excel
- Click Browse and select your
.xlsxfile - Choose the sheet
- Click Import
Check that you also have the variables as the header.
6.2.3.3 Importing RData
These files are native to R and can contain multiple objects.
## # A tibble: 6 × 49
## Country ID HH1_Num_People HH2a_Sex HH2b_Age
## <fct> <chr> <dbl> <fct> <dbl>
## 1 Bulgaria BG1535216 3 Female 56
## 2 Netherlands NL5130211 1 Female 20
## 3 Netherlands NL5063519 2 Male 63
## 4 Slovenia SI1042916 3 Male 63
## 5 Bulgaria BG1396625 2 Male 89
## 6 Slovakia SK1184115 2 Female 44
## # ℹ 44 more variables: HH2d_EmploymentSituation <fct>,
## # Q1_PaidJob <fct>, Q2_Empoyment <fct>, Q3_Contract <fct>,
## # Q4_Occupation <fct>, Q7_HoursWeekWork <dbl>,
## # Q7a_AdditionalJob <fct>, Q7b_HoursWeekWeekAddJob <dbl>,
## # Q7c_Work <fct>, Q8_HoursWeekWorkPref <dbl>,
## # Q9_HoursWeekWorkPartner <dbl>,
## # Q10_HoursWeekWorkPartnerPref <dbl>, Q17_Rooms <dbl>, …
B. Using RStudio
- Environment panel
- Click Import Dataset
- Choose From RData
- Select the
.RDatafile - Click Import
6.3 Data Cleaning
As you can see, there are 49 columns. Let’s simplify and work with fewer variables relevant for analysis. We can do this using the select() function in dplyr. We will save them into a new data frame, ch2_p1.1.
ch2_p1.1<-select(cp_csv,
Country, ID, HH2a_Sex, HH2b_Age, Q1_PaidJob, Q7_HoursWeekWork, Q17_Rooms, Q49_Area, Income_PPP)select(df, var1, var2,...) keeps only the listed columns and removes the rest.
6.3.1 Renaming the Variables
We will edit the names to much easier conventions. First, let us say that we just want to change them to lowercase names.
Inspect:
## country id hh2a_sex hh2b_age q1_paidjob
## 1 Bulgaria BG1535216 Female 56 Yes
## 2 Netherlands NL5130211 Female 20 Yes
## 3 Netherlands NL5063519 Male 63 Yes
## 4 Slovenia SI1042916 Male 63 <NA>
## 5 Bulgaria BG1396625 Male 89 Yes
## 6 Slovakia SK1184115 Female 44 <NA>
## q7_hoursweekwork q17_rooms q49_area
## 1 NA 3 A medium to large town
## 2 NA 1 A city or city suburb
## 3 NA 6 A medium to large town
## 4 35 3 A medium to large town
## 5 NA 1 A city or city suburb
## 6 43 2 A medium to large town
## income_ppp
## 1 NA
## 2 273.4731
## 3 3190.5196
## 4 4171.6329
## 5 478.3407
## 6 759.6961
Inspect again on your own.
Let us rename specific columns:
ch2_p1.1<-ch2_p1.1 %>%
rename(
sex = hh2a_sex,
age = hh2b_age,
paid_job = q1_paidjob,
hours_work = q7_hoursweekwork,
rooms = q17_rooms,
area = q49_area,
income = income_ppp
)Inspect on your own.
names() simply gets or sets variable names.
tolower is to simply change to small letters
rename(new_name = old_name) changes a column’s name without touching data
%>% is the pipe operator: it passes the dataset from one function to the next.
6.3.2 Sorting Variables
Let’s say, we want to arrange income. We will create a different data for this. We use arrange or desc in dplyr package
#Sort dataset by income, ascending (default)
ch2_p1sort<-arrange(ch2_p1.1,
income)
#Sort dataset by income, descending
ch2_p1sort_desc<-arrange(ch2_p1.1, desc(income))Inspect on your own.
6.3.2.1 Making our own data frame
c() stands for combine. We’re creating a vector of values.
as.factor() converts the vector into a factor, that is a categorical variable not numbers.
data.frame(...) creates a dataset in R.
## ch2_p2
## 1 $10,000
## 2 $20,500
## 3 $15,250
## 4 $30,000
## 5 $50,750
Let’s sort this:
## ch2_p2
## 1 $10,000
## 2 $20,500
## 3 $15,250
## 4 $30,000
## 5 $50,750
It did not work. The problem is, ch2_p2 is not numeric. We can check:
## [1] "factor"
We need to make it into a numeric value but we have a , and $. We need to remove them. We use the str_replace function in the stringr package.
ch2_p2$ch2_p2<-str_replace(
ch2_p2$ch2_p2, #column we want to edit
pattern = ',', #what to find
replacement = '' #what to replace it with
)## ch2_p2
## 1 $10000
## 2 $20500
## 3 $15250
## 4 $30000
## 5 $50750
Now, let us remove the dollar sign; usually, simply doing the same thing we did with the comma works, but, there are some symbols that are used as “special character”. To “force” R to replace the presence of ‘$’, we add two backslashes before the dollar sign.
Inspect on your own.
Now, sort ch2_p2 on your own.
We can see that it was arranged, however, take a look at the way ch2_p2 was encoded; it is not numeric. So, we need to change this.
## [1] "character"
Change to numeric through as.numeric()
Inspect on your own.
6.3.3 Pipe Operator
%>% allows functions to be chained; it can be read as “then” - it tells R to do whatever comes after it to the stuff that comes before it.
Inspect on your own.
6.3.4 Adding columns
We will be using the pipe operator and the mutate to add a new column to ch2_p1.1
ch2_p1.1 <- ch2_p1.1 %>%
mutate( #adds a new column to the dataset
room_group = case_when( #checks each row's value and assigned a category
rooms == 1 ~ "one_room",
rooms == 2 ~ "two_rooms",
rooms == 3 ~ "three_rooms",
rooms == 4 ~ "four_rooms",
rooms == 5 ~ "five_rooms",
rooms == 6 ~ "six_rooms",
rooms == 7 ~ "seven_rooms",
rooms == 8 ~ "eight_rooms",
rooms == 9 ~ "nine_rooms",
rooms == 10 ~ "ten_rooms",
rooms == 11 ~ "eleven_rooms",
rooms == 12 ~ "twelve_rooms",
TRUE ~ "other" # anything outside 1–12 or missing
)
)6.3.5 Transforming Values
Now, you can see that paid_job is a character that is “yes/no”. We need to change that to numeric value. This is particularly useful when we use dummy variables later on. We will not use case_when as it is not necessary; rather, we will use ifelse:
6.3.6 Categorizing into groups
We use the case_when():
ch2_p1.1 <- ch2_p1.1 %>%
mutate(
age_group = case_when(
age >= 18 & age <= 29 ~ "young", # 18–29 years
age >= 30 & age <= 59 ~ "adult", # 30–59 years
age >= 60 ~ "older_adult", # 60+ years
TRUE ~ "other" # NA
)
)mutate() adding new column
The conditions are important.
6.3.7 Summarizing
Let us get the average of income by age group, which we’ll call ave_income, by using the group_by() and summarise() functions in dplyr
ch2_p1.1ave<-ch2_p1.1 %>%
group_by(age_group) %>% #group by age group, THEN
summarise(ave_income=mean(income)) #calculate the mean of income for each age group
head(ch2_p1.1ave)## # A tibble: 3 × 2
## age_group ave_income
## <chr> <dbl>
## 1 adult NA
## 2 older_adult NA
## 3 young NA
How did this happen? It is because there are NAs in income. Let’s check for NAs in our dataset
## country id sex age paid_job
## 0 0 0 0 1768
## hours_work rooms area income room_group
## 2304 25 0 978 0
## age_group
## 0
There are 978 NAs in income, that is why the average income is NA. For the sake of, let’s remove the NA.
ch2_p1.1ave <- ch2_p1.1 %>%
group_by(age_group) %>%
summarise(ave_income = mean(income, na.rm = TRUE))
head(ch2_p1.1ave)## # A tibble: 3 × 2
## age_group ave_income
## <chr> <dbl>
## 1 adult 2145.
## 2 older_adult 1566.
## 3 young 1809.
The difference is the inclusion of na.rm = TRUE which means we remove the missing values when calculating the mean.
So, even though we know income is numeric, the presence of NA causes the output when calculating mean, sd, etc. become NA too.
6.3.8 Merging datasets
We have two main datasets, ch2_p1.1 and ch2_p1.1ave. By doing this, we could compare side-by-side each observation compared to the average per age group.
We will join the datasets by age_group variable, since that is consistent across both datasets. We name the new file as ch2_p1merged:
## age_group country id sex age paid_job
## 1 adult Bulgaria BG1535216 Female 56 1
## 2 adult Hungary HU1042815 Male 39 0
## 3 adult Poland PL1229721 Female 47 NA
## 4 adult Germany DE1045717 Female 56 1
## 5 adult Finland FI1024310 Male 58 NA
## 6 adult Slovakia SK1184115 Female 44 NA
## hours_work rooms area income
## 1 NA 3 A medium to large town NA
## 2 NA 2 A village/small town 987.0490
## 3 40 4 A village/small town 1722.0737
## 4 NA 3 A city or city suburb NA
## 5 32 5 A medium to large town 3958.9271
## 6 43 2 A medium to large town 759.6961
## room_group ave_income
## 1 three_rooms 2145.059
## 2 two_rooms 2145.059
## 3 four_rooms 2145.059
## 4 three_rooms 2145.059
## 5 five_rooms 2145.059
## 6 two_rooms 2145.059
6.3.9 Splitting datasets
Say I want to save different datasets based on the age_group column.
filter() keeps rows that meet the condition.
You can save it as a .csv file:
Can you do the others? (young or older_adult)
6.3.10 Reshaping Datasets
Let’s say you want to reshape income and hours_work into a long format so every row is a variable measurement. We need both tidyr and dplyr packages.
6.3.10.1 Long Format
library(tidyr)
library(dplyr)
ch2_long<-ch2_p1.1 %>%
pivot_longer(
cols = c(hours_work, income), # columns to stack
names_to = "variable", #new column for variable names
values_to = "value" #new column for their values
)
head(ch2_long)## # A tibble: 6 × 11
## country id sex age paid_job rooms area room_group
## <chr> <chr> <chr> <int> <dbl> <int> <chr> <chr>
## 1 Bulgaria BG15… Fema… 56 1 3 A me… three_roo…
## 2 Bulgaria BG15… Fema… 56 1 3 A me… three_roo…
## 3 Netherla… NL51… Fema… 20 1 1 A ci… one_room
## 4 Netherla… NL51… Fema… 20 1 1 A ci… one_room
## 5 Netherla… NL50… Male 63 1 6 A me… six_rooms
## 6 Netherla… NL50… Male 63 1 6 A me… six_rooms
## # ℹ 3 more variables: age_group <chr>, variable <chr>,
## # value <dbl>
pivot_longer() turns columns into rows
6.3.10.2 Wide Format
ch2_wide<-ch2_long %>%
pivot_wider(
names_from = variable, #column to spread to multiple columns
values_from = value #value to fill columns
)
head(ch2_wide)## # A tibble: 6 × 11
## country id sex age paid_job rooms area room_group
## <chr> <chr> <chr> <int> <dbl> <int> <chr> <chr>
## 1 Bulgaria BG15… Fema… 56 1 3 A me… three_roo…
## 2 Netherla… NL51… Fema… 20 1 1 A ci… one_room
## 3 Netherla… NL50… Male 63 1 6 A me… six_rooms
## 4 Slovenia SI10… Male 63 NA 3 A me… three_roo…
## 5 Bulgaria BG13… Male 89 1 1 A ci… one_room
## 6 Slovakia SK11… Fema… 44 NA 2 A me… two_rooms
## # ℹ 3 more variables: age_group <chr>, hours_work <dbl>,
## # income <dbl>
Say you want to analyze repeated categories or measures. We can pivot the room_group and age_group to long:
ch2_long2<-ch2_p1.1 %>%
pivot_longer(
cols = c(room_group, age_group),
names_to = "category",
values_to = "group"
)
head(ch2_long2)## # A tibble: 6 × 11
## country id sex age paid_job hours_work rooms area
## <chr> <chr> <chr> <int> <dbl> <int> <int> <chr>
## 1 Bulgaria BG15… Fema… 56 1 NA 3 A me…
## 2 Bulgaria BG15… Fema… 56 1 NA 3 A me…
## 3 Netherla… NL51… Fema… 20 1 NA 1 A ci…
## 4 Netherla… NL51… Fema… 20 1 NA 1 A ci…
## 5 Netherla… NL50… Male 63 1 NA 6 A me…
## 6 Netherla… NL50… Male 63 1 NA 6 A me…
## # ℹ 3 more variables: income <dbl>, category <chr>,
## # group <chr>
6.3.11 Missing Values
We already did it before but just to remind:
## country id sex age paid_job
## 0 0 0 0 1768
## hours_work rooms area income room_group
## 2304 25 0 978 0
## age_group
## 0
When there are missing values, it reduces sample size, can bias estimates if the missingness is not random and may introduce endogeneity if the missing data is correlated with the outcome or other variables.
6.3.11.1 Types of Missing Data
- MCAR - Missing Completely at Random
- Missingness is unrelated to anything in the data
- MAR - Missing at Random
Missingness depends on observed variables but not on the missing value itself.
Example: younger respondents are less likely to report income.
- MNAR - Missing Not at Random
Missingness depends on the value itself
Example: high income people refuse to report income
If you have MCAR, safe to remove this. If MAR or MNAR, removing these can cause bias. If a variable with missing values affects the outcome and you remove it or impute incorrectly, you can create endogeneity.
6.3.11.2 Handling Missing Values
6.3.11.2.1 a. Remove missing observations
!is.na() means you are keeping rows WITHOUT missing income
Again, double-check because removing the missing values may bias your results
6.3.11.2.2 b. Replace missing values (imputation of mean/median)
#replace missing income with the mean
ch2_p1.3<-ch2_p1.1 %>%
mutate(income=ifelse(is.na(income), mean(income, na.rm=TRUE), income))ifelse(condition, value_if_true, value_if_false) in this case, you are telling R to search income’s NA values, if there are NA values, calculate the mean income without the NAs. then, change the NAs with the calculated mean income, otherwise, keep the income as is.
There are advanced methods but will not be discussed.
You have to be careful and always check for missing data.