6 Data Management - Cross-Sectional Data
6.1 Where to Get Data?
Before we proceed to Data Management, let us first find where we can get data for the Data Story Archive. Note that the data you collect should still ensure that you are following the Code of Ethics and analyze Ethical Considerations.
Please view the necessary documents from the Office of the Vice Chancellor for Research and Innovation (https://www.dlsu.edu.ph/research/research-manual/)
A list of links you can search and get data from:
Note: I will not include the best links as they are pretty straightforward and these are governmental databases like the ones from World Bank, IMF, UN, Philippine Statistics Authority, and Bangko Sentral ng Pilipinas. The list here is a general list but use with proper discretion.
| Name | Link | Notes |
|---|---|---|
| Awesome Public Datasets | https://github.com/awesomedata/awesome-public-datasets | This repository is filled with public datasets, mostly from International contexts. |
| Google Dataset Search | https://datasetsearch.research.google.com/ | You can download publicly available datasets from searching through Google. Though, sometimes the datasets come from ‘Statista.com’. You can check the sources from the search. |
6.2 Preliminaries
6.2.1 Packages
We will mostly use the tidyverse package, in particular, the dplyr package and the tidyr package; double-check in your Packages list whether you have these two packages; if not, you can simply install them.
6.2.2 Clean Everything
Do this step every time you use other data or when we do the other chapters.
# Remove all objects in the global environment
rm(list = ls())
# Perform garbage collection to free up memory
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5758881 307.6 10206931 545.2 10206931 545.2
## Vcells 12146455 92.7 81773890 623.9 249542360 1903.9
6.2.3 Importing the Datasets
Before we can manage the data, we must first import it into R. There are two ways to do this:
- Writing code (preferred for replicability)
- Clicking in RStudio
We start with the most common file types.
6.2.3.1 Importing a CSV file
The file must be in the working directory. If it is not, then, simply putting the file name inside the quotation marks will not work. Meaning, you have to input the entire path where the file is.
Another, if you notice, the CSV file name is simple and easy to share. The object, in this case, cp_csv is also in small letters and does not have spaces but rather an underscore replacing the space. Always do this when naming objects.
a. Small letters
b. No spaces
c. Place underscore instead.
To check the information and what the dataset looks like:
## Country ID HH1_Num_People HH2a_Sex HH2b_Age HH2d_EmploymentSituation Q1_PaidJob
## 1 Bulgaria BG1535216 3 Female 56 Unemployed less than 12 months Yes
## 2 Netherlands NL5130211 1 Female 20 In education (at school, university, etc.) / student Yes
## 3 Netherlands NL5063519 2 Male 63 Retired Yes
## 4 Slovenia SI1042916 3 Male 63 At work as employee or employer/self-employed <NA>
## 5 Bulgaria BG1396625 2 Male 89 Retired Yes
## 6 Slovakia SK1184115 2 Female 44 At work as employee or employer/self-employed <NA>
## Q2_Empoyment Q3_Contract Q4_Occupation Q7_HoursWeekWork Q7a_AdditionalJob
## 1 <NA> <NA> <NA> NA <NA>
## 2 <NA> <NA> <NA> NA <NA>
## 3 <NA> <NA> <NA> NA <NA>
## 4 Employed On an unlimited permanent contract Technician or junior professional 35 No
## 5 <NA> <NA> <NA> NA <NA>
## 6 Self-employed without employees On an unlimited permanent contract Technician or junior professional 43 No
## Q7b_HoursWeekWeekAddJob Q7c_Work Q8_HoursWeekWorkPref Q9_HoursWeekWorkPartner Q10_HoursWeekWorkPartnerPref Q17_Rooms
## 1 NA No 40 40 4 3
## 2 NA No 10 NA NA 1
## 3 NA No 0 NA 0 6
## 4 NA <NA> 30 NA 30 3
## 5 NA No 0 NA 0 1
## 6 NA <NA> 43 NA NA 2
## Q18_Tenancy Q19a_ShortageSpace Q19b_Rot Q19c_Leaks Q19d_NoFlusingToilet Q19e_NoBathShower Q19f_NoOutside
## 1 Own without mortgage 0 0 0 0 0 0
## 2 Other 0 0 0 0 0 0
## 3 Own with mortgage 0 0 0 0 0 0
## 4 Tenant, paying rent to private landlord 0 0 0 0 0 0
## 5 Own without mortgage 0 0 0 0 0 0
## 6 Own without mortgage 1 0 0 0 0 0
## Q20_LeaveAccomodation_NoAff Q24_Trust Q25a_TensionClass Q25b_TensionWork Q25c_TensionSex Q25d_TensionsAge
## 1 Very unlikely 6 Some tension No tension No tension 3
## 2 Quite unlikely 6 Some tension Some tension No tension 3
## 3 Very unlikely 8 Some tension Some tension No tension 3
## 4 Quite unlikely 6 Some tension Some tension No tension 3
## 5 Very unlikely 1 - you can't be too careful Some tension Don”t know Some tension 2
## 6 Very likely 5 Some tension Some tension Some tension 2
## Q25e_TensionRace Q25f_TensionReligion Q25g_TensionSexOrient Q37a_HoursWeekChildren Q37b_HoursWeekHousework Q37c_HoursWeekElderly
## 1 Some tension No tension No tension 2 14 NA
## 2 Some tension Some tension A lot of tension NA 7 NA
## 3 A lot of tension A lot of tension Some tension NA 14 3
## 4 Some tension Some tension No tension 5 NA NA
## 5 Don”t know Don”t know Don”t know NA NA NA
## 6 Some tension Some tension Some tension NA 20 NA
## Q48_Education Q49_Area Q50a_NeighbourhoodNoise Q50b_NeighbourhoodAir Q50c_NeighbourhoodWater Q50d_NeighbourhoodCrime
## 1 1250 A medium to large town No problems No problems Major problems No problems
## 2 2956 A city or city suburb No problems No problems No problems No problems
## 3 2963 A medium to large town No problems No problems No problems No problems
## 4 3420 A medium to large town No problems No problems No problems No problems
## 5 1253 A city or city suburb No problems No problems No problems Moderate problems
## 6 3530 A medium to large town No problems No problems No problems No problems
## Q50e_NeighbourhoodLitter Q50f_NeighbourhoodTraffic Q51a_AccServicesPost Q51b_AccServicesBank Q53c_QualityPublicTransport Income_PPP
## 1 No problems No problems Service not used Easily 6 NA
## 2 No problems No problems Easily With some difficulty 7 273.4731
## 3 No problems No problems Easily Easily 7 3190.5196
## 4 No problems Moderate problems Easily Easily 6 4171.6329
## 5 Moderate problems Don't know Service not used Service not used 3 478.3407
## 6 No problems Moderate problems Very easily Very easily 5 759.6961
## 'data.frame': 4036 obs. of 49 variables:
## $ Country : chr "Bulgaria" "Netherlands" "Netherlands" "Slovenia" ...
## $ ID : chr "BG1535216" "NL5130211" "NL5063519" "SI1042916" ...
## $ HH1_Num_People : int 3 1 2 3 2 2 2 1 4 2 ...
## $ HH2a_Sex : chr "Female" "Female" "Male" "Male" ...
## $ HH2b_Age : int 56 20 63 63 89 44 52 43 47 81 ...
## $ HH2d_EmploymentSituation : chr "Unemployed less than 12 months" "In education (at school, university, etc.) / student" "Retired" "At work as employee or employer/self-employed" ...
## $ Q1_PaidJob : chr "Yes" "Yes" "Yes" NA ...
## $ Q2_Empoyment : chr NA NA NA "Employed" ...
## $ Q3_Contract : chr NA NA NA "On an unlimited permanent contract" ...
## $ Q4_Occupation : chr NA NA NA "Technician or junior professional" ...
## $ Q7_HoursWeekWork : int NA NA NA 35 NA 43 NA NA 17 NA ...
## $ Q7a_AdditionalJob : chr NA NA NA "No" ...
## $ Q7b_HoursWeekWeekAddJob : int NA NA NA NA NA NA NA NA NA NA ...
## $ Q7c_Work : chr "No" "No" "No" NA ...
## $ Q8_HoursWeekWorkPref : int 40 10 0 30 0 43 40 40 35 0 ...
## $ Q9_HoursWeekWorkPartner : int 40 NA NA NA NA NA 41 NA 35 NA ...
## $ Q10_HoursWeekWorkPartnerPref: int 4 NA 0 30 0 NA 30 NA 35 0 ...
## $ Q17_Rooms : int 3 1 6 3 1 2 NA 1 5 3 ...
## $ Q18_Tenancy : chr "Own without mortgage" "Other" "Own with mortgage" "Tenant, paying rent to private landlord" ...
## $ Q19a_ShortageSpace : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Q19b_Rot : int 0 0 0 0 0 0 0 1 1 0 ...
## $ Q19c_Leaks : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Q19d_NoFlusingToilet : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Q19e_NoBathShower : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Q19f_NoOutside : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Q20_LeaveAccomodation_NoAff : chr "Very unlikely" "Quite unlikely" "Very unlikely" "Quite unlikely" ...
## $ Q24_Trust : chr "6" "6" "8" "6" ...
## $ Q25a_TensionClass : chr "Some tension" "Some tension" "Some tension" "Some tension" ...
## $ Q25b_TensionWork : chr "No tension" "Some tension" "Some tension" "Some tension" ...
## $ Q25c_TensionSex : chr "No tension" "No tension" "No tension" "No tension" ...
## $ Q25d_TensionsAge : int 3 3 3 3 2 2 98 3 3 1 ...
## $ Q25e_TensionRace : chr "Some tension" "Some tension" "A lot of tension" "Some tension" ...
## $ Q25f_TensionReligion : chr "No tension" "Some tension" "A lot of tension" "Some tension" ...
## $ Q25g_TensionSexOrient : chr "No tension" "A lot of tension" "Some tension" "No tension" ...
## $ Q37a_HoursWeekChildren : int 2 NA NA 5 NA NA 4 NA NA NA ...
## $ Q37b_HoursWeekHousework : int 14 7 14 NA NA 20 4 8 4 6 ...
## $ Q37c_HoursWeekElderly : int NA NA 3 NA NA NA NA NA NA NA ...
## $ Q48_Education : chr "1250" "2956" "2963" "3420" ...
## $ Q49_Area : chr "A medium to large town" "A city or city suburb" "A medium to large town" "A medium to large town" ...
## $ Q50a_NeighbourhoodNoise : chr "No problems" "No problems" "No problems" "No problems" ...
## $ Q50b_NeighbourhoodAir : chr "No problems" "No problems" "No problems" "No problems" ...
## $ Q50c_NeighbourhoodWater : chr "Major problems" "No problems" "No problems" "No problems" ...
## $ Q50d_NeighbourhoodCrime : chr "No problems" "No problems" "No problems" "No problems" ...
## $ Q50e_NeighbourhoodLitter : chr "No problems" "No problems" "No problems" "No problems" ...
## $ Q50f_NeighbourhoodTraffic : chr "No problems" "No problems" "No problems" "Moderate problems" ...
## $ Q51a_AccServicesPost : chr "Service not used" "Easily" "Easily" "Easily" ...
## $ Q51b_AccServicesBank : chr "Easily" "With some difficulty" "Easily" "Easily" ...
## $ Q53c_QualityPublicTransport : chr "6" "7" "7" "6" ...
## $ Income_PPP : num NA 273 3191 4172 478 ...
The head code shows the first 6 rows and all the columns (variables) in the dataset. Meanwhile the str code shows the information of the dataset, such as what format type each column/variable is, how many observations, etc.
B. Using RStudio
- Go to Environment
- Click Import Dataset
- Choose From Text (base) or From Test (readr)
- Click Browse and select your
.csvfile - Click Import
RStudio will load the dataset and generate R code in Console. Copy the R code to your script for reproducibility.
6.2.3.2 Importing an Excel File
First, install and load the package:
If you only want a particular sheet to be imported,
## # A tibble: 6 × 49
## Country ID HH1_Num_People HH2a_Sex HH2b_Age HH2d_EmploymentSitua…¹ Q1_PaidJob Q2_Empoyment Q3_Contract Q4_Occupation Q7_HoursWeekWork
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Bulgaria BG15… 3 Female 56 Unemployed less than … Yes <NA> <NA> <NA> NA
## 2 Netherlands NL51… 1 Female 20 In education (at scho… Yes <NA> <NA> <NA> NA
## 3 Netherlands NL50… 2 Male 63 Retired Yes <NA> <NA> <NA> NA
## 4 Slovenia SI10… 3 Male 63 At work as employee o… <NA> Employed On an unli… Technician o… 35
## 5 Bulgaria BG13… 2 Male 89 Retired Yes <NA> <NA> <NA> NA
## 6 Slovakia SK11… 2 Female 44 At work as employee o… <NA> Self-employ… On an unli… Technician o… 43
## # ℹ abbreviated name: ¹HH2d_EmploymentSituation
## # ℹ 38 more variables: Q7a_AdditionalJob <chr>, Q7b_HoursWeekWeekAddJob <dbl>, Q7c_Work <chr>, Q8_HoursWeekWorkPref <dbl>,
## # Q9_HoursWeekWorkPartner <dbl>, Q10_HoursWeekWorkPartnerPref <dbl>, Q17_Rooms <dbl>, Q18_Tenancy <chr>, Q19a_ShortageSpace <dbl>,
## # Q19b_Rot <dbl>, Q19c_Leaks <dbl>, Q19d_NoFlusingToilet <dbl>, Q19e_NoBathShower <dbl>, Q19f_NoOutside <dbl>,
## # Q20_LeaveAccomodation_NoAff <chr>, Q24_Trust <chr>, Q25a_TensionClass <chr>, Q25b_TensionWork <chr>, Q25c_TensionSex <chr>,
## # Q25d_TensionsAge <chr>, Q25e_TensionRace <chr>, Q25f_TensionReligion <chr>, Q25g_TensionSexOrient <chr>, Q37a_HoursWeekChildren <dbl>,
## # Q37b_HoursWeekHousework <dbl>, Q37c_HoursWeekElderly <dbl>, Q48_Education <chr>, Q49_Area <chr>, Q50a_NeighbourhoodNoise <chr>, …
B. Using RStudio
- Go to Environment
- Click Import Dataset
- Choose From Excel
- Click Browse and select your
.xlsxfile - Choose the sheet
- Click Import
Check that you also have the variables as the header.
6.2.3.3 Importing RData
These files are native to R and can contain multiple objects.
## # A tibble: 6 × 49
## Country ID HH1_Num_People HH2a_Sex HH2b_Age HH2d_EmploymentSitua…¹ Q1_PaidJob Q2_Empoyment Q3_Contract Q4_Occupation Q7_HoursWeekWork
## <fct> <chr> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 Bulgaria BG15… 3 Female 56 Unemployed less than … Yes <NA> <NA> <NA> NA
## 2 Netherlands NL51… 1 Female 20 In education (at scho… Yes <NA> <NA> <NA> NA
## 3 Netherlands NL50… 2 Male 63 Retired Yes <NA> <NA> <NA> NA
## 4 Slovenia SI10… 3 Male 63 At work as employee o… <NA> Employed On an unli… Technician o… 35
## 5 Bulgaria BG13… 2 Male 89 Retired Yes <NA> <NA> <NA> NA
## 6 Slovakia SK11… 2 Female 44 At work as employee o… <NA> Self-employ… On an unli… Technician o… 43
## # ℹ abbreviated name: ¹HH2d_EmploymentSituation
## # ℹ 38 more variables: Q7a_AdditionalJob <fct>, Q7b_HoursWeekWeekAddJob <dbl>, Q7c_Work <fct>, Q8_HoursWeekWorkPref <dbl>,
## # Q9_HoursWeekWorkPartner <dbl>, Q10_HoursWeekWorkPartnerPref <dbl>, Q17_Rooms <dbl>, Q18_Tenancy <fct>, Q19a_ShortageSpace <dbl>,
## # Q19b_Rot <dbl>, Q19c_Leaks <dbl>, Q19d_NoFlusingToilet <dbl>, Q19e_NoBathShower <dbl>, Q19f_NoOutside <dbl>,
## # Q20_LeaveAccomodation_NoAff <fct>, Q24_Trust <fct>, Q25a_TensionClass <fct>, Q25b_TensionWork <fct>, Q25c_TensionSex <fct>,
## # Q25d_TensionsAge <hvn_lbl_>, Q25e_TensionRace <fct>, Q25f_TensionReligion <fct>, Q25g_TensionSexOrient <fct>,
## # Q37a_HoursWeekChildren <dbl>, Q37b_HoursWeekHousework <dbl>, Q37c_HoursWeekElderly <dbl>, Q48_Education <fct>, Q49_Area <fct>, …
B. Using RStudio
- Environment panel
- Click Import Dataset
- Choose From RData
- Select the
.RDatafile - Click Import
6.3 Data Cleaning
As you can see, there are 49 columns. Let’s simplify and work with fewer variables relevant for analysis. We can do this using the select() function in dplyr. We will save them into a new data frame, ch2_p1.1.
ch2_p1.1<-select(cp_csv,
Country, ID, HH2a_Sex, HH2b_Age, Q1_PaidJob, Q7_HoursWeekWork, Q17_Rooms, Q49_Area, Income_PPP)select(df, var1, var2,...) keeps only the listed columns and removes the rest.
6.3.1 Renaming the Variables
We will edit the names to much easier conventions. First, let us say that we just want to change them to lowercase names.
Inspect:
## country id hh2a_sex hh2b_age q1_paidjob q7_hoursweekwork q17_rooms q49_area income_ppp
## 1 Bulgaria BG1535216 Female 56 Yes NA 3 A medium to large town NA
## 2 Netherlands NL5130211 Female 20 Yes NA 1 A city or city suburb 273.4731
## 3 Netherlands NL5063519 Male 63 Yes NA 6 A medium to large town 3190.5196
## 4 Slovenia SI1042916 Male 63 <NA> 35 3 A medium to large town 4171.6329
## 5 Bulgaria BG1396625 Male 89 Yes NA 1 A city or city suburb 478.3407
## 6 Slovakia SK1184115 Female 44 <NA> 43 2 A medium to large town 759.6961
Inspect again on your own.
Let us rename specific columns:
ch2_p1.1<-ch2_p1.1 %>%
rename(
sex = hh2a_sex,
age = hh2b_age,
paid_job = q1_paidjob,
hours_work = q7_hoursweekwork,
rooms = q17_rooms,
area = q49_area,
income = income_ppp
)Inspect on your own.
names() simply gets or sets variable names.
tolower is to simply change to small letters
rename(new_name = old_name) changes a column’s name without touching data
%>% is the pipe operator: it passes the dataset from one function to the next.
6.3.2 Sorting Variables
Let’s say, we want to arrange income. We will create a different data for this. We use arrange or desc in dplyr package
#Sort dataset by income, ascending (default)
ch2_p1sort<-arrange(ch2_p1.1,
income)
#Sort dataset by income, descending
ch2_p1sort_desc<-arrange(ch2_p1.1, desc(income))Inspect on your own.
6.3.2.1 Making our own data frame
c() stands for combine. We’re creating a vector of values.
as.factor() converts the vector into a factor, that is a categorical variable not numbers.
data.frame(...) creates a dataset in R.
## ch2_p2
## 1 $10,000
## 2 $20,500
## 3 $15,250
## 4 $30,000
## 5 $50,750
Let’s sort this:
## ch2_p2
## 1 $10,000
## 2 $20,500
## 3 $15,250
## 4 $30,000
## 5 $50,750
It did not work. The problem is, ch2_p2 is not numeric. We can check:
## [1] "factor"
We need to make it into a numeric value but we have a , and $. We need to remove them. We use the str_replace function in the stringr package.
ch2_p2$ch2_p2<-str_replace(
ch2_p2$ch2_p2, #column we want to edit
pattern = ',', #what to find
replacement = '' #what to replace it with
)## ch2_p2
## 1 $10000
## 2 $20500
## 3 $15250
## 4 $30000
## 5 $50750
Now, let us remove the dollar sign; usually, simply doing the same thing we did with the comma works, but, there are some symbols that are used as “special character”. To “force” R to replace the presence of ‘$’, we add two backslashes before the dollar sign.
Inspect on your own.
Now, sort ch2_p2 on your own.
We can see that it was arranged, however, take a look at the way ch2_p2 was encoded; it is not numeric. So, we need to change this.
## [1] "character"
Change to numeric through as.numeric()
Inspect on your own.
6.3.3 Pipe Operator
%>% allows functions to be chained; it can be read as “then” - it tells R to do whatever comes after it to the stuff that comes before it.
Inspect on your own.
6.3.4 Adding columns
We will be using the pipe operator and the mutate to add a new column to ch2_p1.1
ch2_p1.1 <- ch2_p1.1 %>%
mutate( #adds a new column to the dataset
room_group = case_when( #checks each row's value and assigned a category
rooms == 1 ~ "one_room",
rooms == 2 ~ "two_rooms",
rooms == 3 ~ "three_rooms",
rooms == 4 ~ "four_rooms",
rooms == 5 ~ "five_rooms",
rooms == 6 ~ "six_rooms",
rooms == 7 ~ "seven_rooms",
rooms == 8 ~ "eight_rooms",
rooms == 9 ~ "nine_rooms",
rooms == 10 ~ "ten_rooms",
rooms == 11 ~ "eleven_rooms",
rooms == 12 ~ "twelve_rooms",
TRUE ~ "other" # anything outside 1–12 or missing
)
)6.3.5 Transforming Values
Now, you can see that paid_job is a character that is “yes/no”. We need to change that to numeric value. This is particularly useful when we use dummy variables later on. We will not use case_when as it is not necessary; rather, we will use ifelse:
6.3.6 Categorizing into groups
We use the case_when():
ch2_p1.1 <- ch2_p1.1 %>%
mutate(
age_group = case_when(
age >= 18 & age <= 29 ~ "young", # 18–29 years
age >= 30 & age <= 59 ~ "adult", # 30–59 years
age >= 60 ~ "older_adult", # 60+ years
TRUE ~ "other" # NA
)
)mutate() adding new column
The conditions are important.
6.3.7 Summarizing
Let us get the average of income by age group, which we’ll call ave_income, by using the group_by() and summarise() functions in dplyr
ch2_p1.1ave<-ch2_p1.1 %>%
group_by(age_group) %>% #group by age group, THEN
summarise(ave_income=mean(income)) #calculate the mean of income for each age group
head(ch2_p1.1ave)## # A tibble: 3 × 2
## age_group ave_income
## <chr> <dbl>
## 1 adult NA
## 2 older_adult NA
## 3 young NA
How did this happen? It is because there are NAs in income. Let’s check for NAs in our dataset
## country id sex age paid_job hours_work rooms area income room_group age_group
## 0 0 0 0 1768 2304 25 0 978 0 0
There are 978 NAs in income, that is why the average income is NA. For the sake of, let’s remove the NA.
ch2_p1.1ave <- ch2_p1.1 %>%
group_by(age_group) %>%
summarise(ave_income = mean(income, na.rm = TRUE))
head(ch2_p1.1ave)## # A tibble: 3 × 2
## age_group ave_income
## <chr> <dbl>
## 1 adult 2145.
## 2 older_adult 1566.
## 3 young 1809.
The difference is the inclusion of na.rm = TRUE which means we remove the missing values when calculating the mean.
So, even though we know income is numeric, the presence of NA causes the output when calculating mean, sd, etc. become NA too.
6.3.8 Merging datasets
We have two main datasets, ch2_p1.1 and ch2_p1.1ave. By doing this, we could compare side-by-side each observation compared to the average per age group.
We will join the datasets by age_group variable, since that is consistent across both datasets. We name the new file as ch2_p1merged:
## age_group country id sex age paid_job hours_work rooms area income room_group ave_income
## 1 adult Bulgaria BG1535216 Female 56 1 NA 3 A medium to large town NA three_rooms 2145.059
## 2 adult Hungary HU1042815 Male 39 0 NA 2 A village/small town 987.0490 two_rooms 2145.059
## 3 adult Poland PL1229721 Female 47 NA 40 4 A village/small town 1722.0737 four_rooms 2145.059
## 4 adult Germany DE1045717 Female 56 1 NA 3 A city or city suburb NA three_rooms 2145.059
## 5 adult Finland FI1024310 Male 58 NA 32 5 A medium to large town 3958.9271 five_rooms 2145.059
## 6 adult Slovakia SK1184115 Female 44 NA 43 2 A medium to large town 759.6961 two_rooms 2145.059
6.3.9 Splitting datasets
Say I want to save different datasets based on the age_group column.
filter() keeps rows that meet the condition.
You can save it as a .csv file:
Can you do the others? (young or older_adult)
6.3.10 Reshaping Datasets
Let’s say you want to reshape income and hours_work into a long format so every row is a variable measurement. We need both tidyr and dplyr packages.
6.3.10.1 Long Format
library(tidyr)
library(dplyr)
ch2_long<-ch2_p1.1 %>%
pivot_longer(
cols = c(hours_work, income), # columns to stack
names_to = "variable", #new column for variable names
values_to = "value" #new column for their values
)
head(ch2_long)## # A tibble: 6 × 11
## country id sex age paid_job rooms area room_group age_group variable value
## <chr> <chr> <chr> <int> <dbl> <int> <chr> <chr> <chr> <chr> <dbl>
## 1 Bulgaria BG1535216 Female 56 1 3 A medium to large town three_rooms adult hours_work NA
## 2 Bulgaria BG1535216 Female 56 1 3 A medium to large town three_rooms adult income NA
## 3 Netherlands NL5130211 Female 20 1 1 A city or city suburb one_room young hours_work NA
## 4 Netherlands NL5130211 Female 20 1 1 A city or city suburb one_room young income 273.
## 5 Netherlands NL5063519 Male 63 1 6 A medium to large town six_rooms older_adult hours_work NA
## 6 Netherlands NL5063519 Male 63 1 6 A medium to large town six_rooms older_adult income 3191.
pivot_longer() turns columns into rows
6.3.10.2 Wide Format
ch2_wide<-ch2_long %>%
pivot_wider(
names_from = variable, #column to spread to multiple columns
values_from = value #value to fill columns
)
head(ch2_wide)## # A tibble: 6 × 11
## country id sex age paid_job rooms area room_group age_group hours_work income
## <chr> <chr> <chr> <int> <dbl> <int> <chr> <chr> <chr> <dbl> <dbl>
## 1 Bulgaria BG1535216 Female 56 1 3 A medium to large town three_rooms adult NA NA
## 2 Netherlands NL5130211 Female 20 1 1 A city or city suburb one_room young NA 273.
## 3 Netherlands NL5063519 Male 63 1 6 A medium to large town six_rooms older_adult NA 3191.
## 4 Slovenia SI1042916 Male 63 NA 3 A medium to large town three_rooms older_adult 35 4172.
## 5 Bulgaria BG1396625 Male 89 1 1 A city or city suburb one_room older_adult NA 478.
## 6 Slovakia SK1184115 Female 44 NA 2 A medium to large town two_rooms adult 43 760.
Say you want to analyze repeated categories or measures. We can pivot the room_group and age_group to long:
ch2_long2<-ch2_p1.1 %>%
pivot_longer(
cols = c(room_group, age_group),
names_to = "category",
values_to = "group"
)
head(ch2_long2)## # A tibble: 6 × 11
## country id sex age paid_job hours_work rooms area income category group
## <chr> <chr> <chr> <int> <dbl> <int> <int> <chr> <dbl> <chr> <chr>
## 1 Bulgaria BG1535216 Female 56 1 NA 3 A medium to large town NA room_group three_rooms
## 2 Bulgaria BG1535216 Female 56 1 NA 3 A medium to large town NA age_group adult
## 3 Netherlands NL5130211 Female 20 1 NA 1 A city or city suburb 273. room_group one_room
## 4 Netherlands NL5130211 Female 20 1 NA 1 A city or city suburb 273. age_group young
## 5 Netherlands NL5063519 Male 63 1 NA 6 A medium to large town 3191. room_group six_rooms
## 6 Netherlands NL5063519 Male 63 1 NA 6 A medium to large town 3191. age_group older_adult
6.3.11 Missing Values
We already did it before but just to remind:
## country id sex age paid_job hours_work rooms area income room_group age_group
## 0 0 0 0 1768 2304 25 0 978 0 0
When there are missing values, it reduces sample size, can bias estimates if the missingness is not random and may introduce endogeneity if the missing data is correlated with the outcome or other variables.
6.3.11.1 Types of Missing Data
- MCAR - Missing Completely at Random
- Missingness is unrelated to anything in the data
- MAR - Missing at Random
Missingness depends on observed variables but not on the missing value itself.
Example: younger respondents are less likely to report income.
- MNAR - Missing Not at Random
Missingness depends on the value itself
Example: high income people refuse to report income
If you have MCAR, safe to remove this. If MAR or MNAR, removing these can cause bias. If a variable with missing values affects the outcome and you remove it or impute incorrectly, you can create endogeneity.
6.3.11.2 Handling Missing Values
6.3.11.2.1 a. Remove missing observations
!is.na() means you are keeping rows WITHOUT missing income
Again, double-check because removing the missing values may bias your results
6.3.11.2.2 b. Replace missing values (imputation of mean/median)
#replace missing income with the mean
ch2_p1.3<-ch2_p1.1 %>%
mutate(income=ifelse(is.na(income), mean(income, na.rm=TRUE), income))ifelse(condition, value_if_true, value_if_false) in this case, you are telling R to search income’s NA values, if there are NA values, calculate the mean income without the NAs. then, change the NAs with the calculated mean income, otherwise, keep the income as is.
There are advanced methods but will not be discussed.
You have to be careful and always check for missing data.