6 Data Management - Cross-Sectional Data

6.1 Where to Get Data?

Before we proceed to Data Management, let us first find where we can get data for the Data Story Archive. Note that the data you collect should still ensure that you are following the Code of Ethics and analyze Ethical Considerations.

Please view the necessary documents from the Office of the Vice Chancellor for Research and Innovation (https://www.dlsu.edu.ph/research/research-manual/)

A list of links you can search and get data from:

Note: I will not include the best links as they are pretty straightforward and these are governmental databases like the ones from World Bank, IMF, UN, Philippine Statistics Authority, and Bangko Sentral ng Pilipinas. The list here is a general list but use with proper discretion.

Name Link Notes
Awesome Public Datasets https://github.com/awesomedata/awesome-public-datasets This repository is filled with public datasets, mostly from International contexts.
Google Dataset Search https://datasetsearch.research.google.com/ You can download publicly available datasets from searching through Google. Though, sometimes the datasets come from ‘Statista.com’. You can check the sources from the search.

6.2 Preliminaries

6.2.1 Packages

We will mostly use the tidyverse package, in particular, the dplyr package and the tidyr package; double-check in your Packages list whether you have these two packages; if not, you can simply install them.

6.2.2 Clean Everything

Do this step every time you use other data or when we do the other chapters.

# Remove all objects in the global environment
rm(list = ls())

# Perform garbage collection to free up memory
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  5758881 307.6   10206931 545.2  10206931  545.2
## Vcells 12146455  92.7   81773890 623.9 249542360 1903.9

6.2.3 Importing the Datasets

Before we can manage the data, we must first import it into R. There are two ways to do this:

  1. Writing code (preferred for replicability)
  2. Clicking in RStudio

We start with the most common file types.

6.2.3.1 Importing a CSV file

cp_csv<-read.csv("CP_1.csv")

The file must be in the working directory. If it is not, then, simply putting the file name inside the quotation marks will not work. Meaning, you have to input the entire path where the file is.

Another, if you notice, the CSV file name is simple and easy to share. The object, in this case, cp_csv is also in small letters and does not have spaces but rather an underscore replacing the space. Always do this when naming objects.

a. Small letters

b. No spaces

c. Place underscore instead.

To check the information and what the dataset looks like:

head(cp_csv)
##       Country        ID HH1_Num_People HH2a_Sex HH2b_Age                             HH2d_EmploymentSituation Q1_PaidJob
## 1    Bulgaria BG1535216              3   Female       56                       Unemployed less than 12 months        Yes
## 2 Netherlands NL5130211              1   Female       20 In education (at school, university, etc.) / student        Yes
## 3 Netherlands NL5063519              2     Male       63                                              Retired        Yes
## 4    Slovenia SI1042916              3     Male       63        At work as employee or employer/self-employed       <NA>
## 5    Bulgaria BG1396625              2     Male       89                                              Retired        Yes
## 6    Slovakia SK1184115              2   Female       44        At work as employee or employer/self-employed       <NA>
##                      Q2_Empoyment                        Q3_Contract                     Q4_Occupation Q7_HoursWeekWork Q7a_AdditionalJob
## 1                            <NA>                               <NA>                              <NA>               NA              <NA>
## 2                            <NA>                               <NA>                              <NA>               NA              <NA>
## 3                            <NA>                               <NA>                              <NA>               NA              <NA>
## 4                        Employed On an unlimited permanent contract Technician or junior professional               35                No
## 5                            <NA>                               <NA>                              <NA>               NA              <NA>
## 6 Self-employed without employees On an unlimited permanent contract Technician or junior professional               43                No
##   Q7b_HoursWeekWeekAddJob Q7c_Work Q8_HoursWeekWorkPref Q9_HoursWeekWorkPartner Q10_HoursWeekWorkPartnerPref Q17_Rooms
## 1                      NA       No                   40                      40                            4         3
## 2                      NA       No                   10                      NA                           NA         1
## 3                      NA       No                    0                      NA                            0         6
## 4                      NA     <NA>                   30                      NA                           30         3
## 5                      NA       No                    0                      NA                            0         1
## 6                      NA     <NA>                   43                      NA                           NA         2
##                               Q18_Tenancy Q19a_ShortageSpace Q19b_Rot Q19c_Leaks Q19d_NoFlusingToilet Q19e_NoBathShower Q19f_NoOutside
## 1                    Own without mortgage                  0        0          0                    0                 0              0
## 2                                   Other                  0        0          0                    0                 0              0
## 3                       Own with mortgage                  0        0          0                    0                 0              0
## 4 Tenant, paying rent to private landlord                  0        0          0                    0                 0              0
## 5                    Own without mortgage                  0        0          0                    0                 0              0
## 6                    Own without mortgage                  1        0          0                    0                 0              0
##   Q20_LeaveAccomodation_NoAff                    Q24_Trust Q25a_TensionClass Q25b_TensionWork Q25c_TensionSex Q25d_TensionsAge
## 1               Very unlikely                            6      Some tension       No tension      No tension                3
## 2              Quite unlikely                            6      Some tension     Some tension      No tension                3
## 3               Very unlikely                            8      Some tension     Some tension      No tension                3
## 4              Quite unlikely                            6      Some tension     Some tension      No tension                3
## 5               Very unlikely 1 - you can't be too careful      Some tension       Don”t know    Some tension                2
## 6                 Very likely                            5      Some tension     Some tension    Some tension                2
##   Q25e_TensionRace Q25f_TensionReligion Q25g_TensionSexOrient Q37a_HoursWeekChildren Q37b_HoursWeekHousework Q37c_HoursWeekElderly
## 1     Some tension           No tension            No tension                      2                      14                    NA
## 2     Some tension         Some tension      A lot of tension                     NA                       7                    NA
## 3 A lot of tension     A lot of tension          Some tension                     NA                      14                     3
## 4     Some tension         Some tension            No tension                      5                      NA                    NA
## 5       Don”t know           Don”t know            Don”t know                     NA                      NA                    NA
## 6     Some tension         Some tension          Some tension                     NA                      20                    NA
##   Q48_Education               Q49_Area Q50a_NeighbourhoodNoise Q50b_NeighbourhoodAir Q50c_NeighbourhoodWater Q50d_NeighbourhoodCrime
## 1          1250 A medium to large town             No problems           No problems          Major problems             No problems
## 2          2956  A city or city suburb             No problems           No problems             No problems             No problems
## 3          2963 A medium to large town             No problems           No problems             No problems             No problems
## 4          3420 A medium to large town             No problems           No problems             No problems             No problems
## 5          1253  A city or city suburb             No problems           No problems             No problems       Moderate problems
## 6          3530 A medium to large town             No problems           No problems             No problems             No problems
##   Q50e_NeighbourhoodLitter Q50f_NeighbourhoodTraffic Q51a_AccServicesPost Q51b_AccServicesBank Q53c_QualityPublicTransport Income_PPP
## 1              No problems               No problems     Service not used               Easily                           6         NA
## 2              No problems               No problems               Easily With some difficulty                           7   273.4731
## 3              No problems               No problems               Easily               Easily                           7  3190.5196
## 4              No problems         Moderate problems               Easily               Easily                           6  4171.6329
## 5        Moderate problems                Don't know     Service not used     Service not used                           3   478.3407
## 6              No problems         Moderate problems          Very easily          Very easily                           5   759.6961
str(cp_csv)
## 'data.frame':    4036 obs. of  49 variables:
##  $ Country                     : chr  "Bulgaria" "Netherlands" "Netherlands" "Slovenia" ...
##  $ ID                          : chr  "BG1535216" "NL5130211" "NL5063519" "SI1042916" ...
##  $ HH1_Num_People              : int  3 1 2 3 2 2 2 1 4 2 ...
##  $ HH2a_Sex                    : chr  "Female" "Female" "Male" "Male" ...
##  $ HH2b_Age                    : int  56 20 63 63 89 44 52 43 47 81 ...
##  $ HH2d_EmploymentSituation    : chr  "Unemployed less than 12 months" "In education (at school, university, etc.) / student" "Retired" "At work as employee or employer/self-employed" ...
##  $ Q1_PaidJob                  : chr  "Yes" "Yes" "Yes" NA ...
##  $ Q2_Empoyment                : chr  NA NA NA "Employed" ...
##  $ Q3_Contract                 : chr  NA NA NA "On an unlimited permanent contract" ...
##  $ Q4_Occupation               : chr  NA NA NA "Technician or junior professional" ...
##  $ Q7_HoursWeekWork            : int  NA NA NA 35 NA 43 NA NA 17 NA ...
##  $ Q7a_AdditionalJob           : chr  NA NA NA "No" ...
##  $ Q7b_HoursWeekWeekAddJob     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Q7c_Work                    : chr  "No" "No" "No" NA ...
##  $ Q8_HoursWeekWorkPref        : int  40 10 0 30 0 43 40 40 35 0 ...
##  $ Q9_HoursWeekWorkPartner     : int  40 NA NA NA NA NA 41 NA 35 NA ...
##  $ Q10_HoursWeekWorkPartnerPref: int  4 NA 0 30 0 NA 30 NA 35 0 ...
##  $ Q17_Rooms                   : int  3 1 6 3 1 2 NA 1 5 3 ...
##  $ Q18_Tenancy                 : chr  "Own without mortgage" "Other" "Own with mortgage" "Tenant, paying rent to private landlord" ...
##  $ Q19a_ShortageSpace          : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Q19b_Rot                    : int  0 0 0 0 0 0 0 1 1 0 ...
##  $ Q19c_Leaks                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q19d_NoFlusingToilet        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q19e_NoBathShower           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q19f_NoOutside              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q20_LeaveAccomodation_NoAff : chr  "Very unlikely" "Quite unlikely" "Very unlikely" "Quite unlikely" ...
##  $ Q24_Trust                   : chr  "6" "6" "8" "6" ...
##  $ Q25a_TensionClass           : chr  "Some tension" "Some tension" "Some tension" "Some tension" ...
##  $ Q25b_TensionWork            : chr  "No tension" "Some tension" "Some tension" "Some tension" ...
##  $ Q25c_TensionSex             : chr  "No tension" "No tension" "No tension" "No tension" ...
##  $ Q25d_TensionsAge            : int  3 3 3 3 2 2 98 3 3 1 ...
##  $ Q25e_TensionRace            : chr  "Some tension" "Some tension" "A lot of tension" "Some tension" ...
##  $ Q25f_TensionReligion        : chr  "No tension" "Some tension" "A lot of tension" "Some tension" ...
##  $ Q25g_TensionSexOrient       : chr  "No tension" "A lot of tension" "Some tension" "No tension" ...
##  $ Q37a_HoursWeekChildren      : int  2 NA NA 5 NA NA 4 NA NA NA ...
##  $ Q37b_HoursWeekHousework     : int  14 7 14 NA NA 20 4 8 4 6 ...
##  $ Q37c_HoursWeekElderly       : int  NA NA 3 NA NA NA NA NA NA NA ...
##  $ Q48_Education               : chr  "1250" "2956" "2963" "3420" ...
##  $ Q49_Area                    : chr  "A medium to large town" "A city or city suburb" "A medium to large town" "A medium to large town" ...
##  $ Q50a_NeighbourhoodNoise     : chr  "No problems" "No problems" "No problems" "No problems" ...
##  $ Q50b_NeighbourhoodAir       : chr  "No problems" "No problems" "No problems" "No problems" ...
##  $ Q50c_NeighbourhoodWater     : chr  "Major problems" "No problems" "No problems" "No problems" ...
##  $ Q50d_NeighbourhoodCrime     : chr  "No problems" "No problems" "No problems" "No problems" ...
##  $ Q50e_NeighbourhoodLitter    : chr  "No problems" "No problems" "No problems" "No problems" ...
##  $ Q50f_NeighbourhoodTraffic   : chr  "No problems" "No problems" "No problems" "Moderate problems" ...
##  $ Q51a_AccServicesPost        : chr  "Service not used" "Easily" "Easily" "Easily" ...
##  $ Q51b_AccServicesBank        : chr  "Easily" "With some difficulty" "Easily" "Easily" ...
##  $ Q53c_QualityPublicTransport : chr  "6" "7" "7" "6" ...
##  $ Income_PPP                  : num  NA 273 3191 4172 478 ...

The head code shows the first 6 rows and all the columns (variables) in the dataset. Meanwhile the str code shows the information of the dataset, such as what format type each column/variable is, how many observations, etc.

B. Using RStudio

  1. Go to Environment
  2. Click Import Dataset
  3. Choose From Text (base) or From Test (readr)
  4. Click Browse and select your .csv file
  5. Click Import

RStudio will load the dataset and generate R code in Console. Copy the R code to your script for reproducibility.

6.2.3.2 Importing an Excel File

First, install and load the package:

install.packages("readxl")
library(readxl)
cp_xlsx<-read_excel("CP_1.xlsx")

If you only want a particular sheet to be imported,

cp_xlsx<-read_excel("CP_1.xlsx", sheet= 1)
head(cp_xlsx)
## # A tibble: 6 × 49
##   Country     ID    HH1_Num_People HH2a_Sex HH2b_Age HH2d_EmploymentSitua…¹ Q1_PaidJob Q2_Empoyment Q3_Contract Q4_Occupation Q7_HoursWeekWork
##   <chr>       <chr>          <dbl> <chr>       <dbl> <chr>                  <chr>      <chr>        <chr>       <chr>                    <dbl>
## 1 Bulgaria    BG15…              3 Female         56 Unemployed less than … Yes        <NA>         <NA>        <NA>                        NA
## 2 Netherlands NL51…              1 Female         20 In education (at scho… Yes        <NA>         <NA>        <NA>                        NA
## 3 Netherlands NL50…              2 Male           63 Retired                Yes        <NA>         <NA>        <NA>                        NA
## 4 Slovenia    SI10…              3 Male           63 At work as employee o… <NA>       Employed     On an unli… Technician o…               35
## 5 Bulgaria    BG13…              2 Male           89 Retired                Yes        <NA>         <NA>        <NA>                        NA
## 6 Slovakia    SK11…              2 Female         44 At work as employee o… <NA>       Self-employ… On an unli… Technician o…               43
## # ℹ abbreviated name: ¹​HH2d_EmploymentSituation
## # ℹ 38 more variables: Q7a_AdditionalJob <chr>, Q7b_HoursWeekWeekAddJob <dbl>, Q7c_Work <chr>, Q8_HoursWeekWorkPref <dbl>,
## #   Q9_HoursWeekWorkPartner <dbl>, Q10_HoursWeekWorkPartnerPref <dbl>, Q17_Rooms <dbl>, Q18_Tenancy <chr>, Q19a_ShortageSpace <dbl>,
## #   Q19b_Rot <dbl>, Q19c_Leaks <dbl>, Q19d_NoFlusingToilet <dbl>, Q19e_NoBathShower <dbl>, Q19f_NoOutside <dbl>,
## #   Q20_LeaveAccomodation_NoAff <chr>, Q24_Trust <chr>, Q25a_TensionClass <chr>, Q25b_TensionWork <chr>, Q25c_TensionSex <chr>,
## #   Q25d_TensionsAge <chr>, Q25e_TensionRace <chr>, Q25f_TensionReligion <chr>, Q25g_TensionSexOrient <chr>, Q37a_HoursWeekChildren <dbl>,
## #   Q37b_HoursWeekHousework <dbl>, Q37c_HoursWeekElderly <dbl>, Q48_Education <chr>, Q49_Area <chr>, Q50a_NeighbourhoodNoise <chr>, …

B. Using RStudio

  1. Go to Environment
  2. Click Import Dataset
  3. Choose From Excel
  4. Click Browse and select your .xlsx file
  5. Choose the sheet
  6. Click Import

Check that you also have the variables as the header.

6.2.3.3 Importing RData

These files are native to R and can contain multiple objects.

load("Dataset_CP.RData")
head(Dataset_CP)
## # A tibble: 6 × 49
##   Country     ID    HH1_Num_People HH2a_Sex HH2b_Age HH2d_EmploymentSitua…¹ Q1_PaidJob Q2_Empoyment Q3_Contract Q4_Occupation Q7_HoursWeekWork
##   <fct>       <chr>          <dbl> <fct>       <dbl> <fct>                  <fct>      <fct>        <fct>       <fct>                    <dbl>
## 1 Bulgaria    BG15…              3 Female         56 Unemployed less than … Yes        <NA>         <NA>        <NA>                        NA
## 2 Netherlands NL51…              1 Female         20 In education (at scho… Yes        <NA>         <NA>        <NA>                        NA
## 3 Netherlands NL50…              2 Male           63 Retired                Yes        <NA>         <NA>        <NA>                        NA
## 4 Slovenia    SI10…              3 Male           63 At work as employee o… <NA>       Employed     On an unli… Technician o…               35
## 5 Bulgaria    BG13…              2 Male           89 Retired                Yes        <NA>         <NA>        <NA>                        NA
## 6 Slovakia    SK11…              2 Female         44 At work as employee o… <NA>       Self-employ… On an unli… Technician o…               43
## # ℹ abbreviated name: ¹​HH2d_EmploymentSituation
## # ℹ 38 more variables: Q7a_AdditionalJob <fct>, Q7b_HoursWeekWeekAddJob <dbl>, Q7c_Work <fct>, Q8_HoursWeekWorkPref <dbl>,
## #   Q9_HoursWeekWorkPartner <dbl>, Q10_HoursWeekWorkPartnerPref <dbl>, Q17_Rooms <dbl>, Q18_Tenancy <fct>, Q19a_ShortageSpace <dbl>,
## #   Q19b_Rot <dbl>, Q19c_Leaks <dbl>, Q19d_NoFlusingToilet <dbl>, Q19e_NoBathShower <dbl>, Q19f_NoOutside <dbl>,
## #   Q20_LeaveAccomodation_NoAff <fct>, Q24_Trust <fct>, Q25a_TensionClass <fct>, Q25b_TensionWork <fct>, Q25c_TensionSex <fct>,
## #   Q25d_TensionsAge <hvn_lbl_>, Q25e_TensionRace <fct>, Q25f_TensionReligion <fct>, Q25g_TensionSexOrient <fct>,
## #   Q37a_HoursWeekChildren <dbl>, Q37b_HoursWeekHousework <dbl>, Q37c_HoursWeekElderly <dbl>, Q48_Education <fct>, Q49_Area <fct>, …

B. Using RStudio

  1. Environment panel
  2. Click Import Dataset
  3. Choose From RData
  4. Select the .RData file
  5. Click Import

6.3 Data Cleaning

As you can see, there are 49 columns. Let’s simplify and work with fewer variables relevant for analysis. We can do this using the select() function in dplyr. We will save them into a new data frame, ch2_p1.1.

library(dplyr)
ch2_p1.1<-select(cp_csv,
       Country, ID, HH2a_Sex, HH2b_Age, Q1_PaidJob, Q7_HoursWeekWork, Q17_Rooms, Q49_Area, Income_PPP)

select(df, var1, var2,...) keeps only the listed columns and removes the rest.

6.3.1 Renaming the Variables

We will edit the names to much easier conventions. First, let us say that we just want to change them to lowercase names.

names(ch2_p1.1)<-tolower(names(ch2_p1.1))

Inspect:

head(ch2_p1.1)
##       country        id hh2a_sex hh2b_age q1_paidjob q7_hoursweekwork q17_rooms               q49_area income_ppp
## 1    Bulgaria BG1535216   Female       56        Yes               NA         3 A medium to large town         NA
## 2 Netherlands NL5130211   Female       20        Yes               NA         1  A city or city suburb   273.4731
## 3 Netherlands NL5063519     Male       63        Yes               NA         6 A medium to large town  3190.5196
## 4    Slovenia SI1042916     Male       63       <NA>               35         3 A medium to large town  4171.6329
## 5    Bulgaria BG1396625     Male       89        Yes               NA         1  A city or city suburb   478.3407
## 6    Slovakia SK1184115   Female       44       <NA>               43         2 A medium to large town   759.6961

Inspect again on your own.

Let us rename specific columns:

ch2_p1.1<-ch2_p1.1 %>% 
  rename(
    sex = hh2a_sex,
    age = hh2b_age,
    paid_job = q1_paidjob,
    hours_work = q7_hoursweekwork,
    rooms = q17_rooms,
    area = q49_area,
    income = income_ppp
  )

Inspect on your own.

names() simply gets or sets variable names.

tolower is to simply change to small letters

rename(new_name = old_name) changes a column’s name without touching data

%>% is the pipe operator: it passes the dataset from one function to the next.

6.3.2 Sorting Variables

Let’s say, we want to arrange income. We will create a different data for this. We use arrange or desc in dplyr package

#Sort dataset by income, ascending (default)
ch2_p1sort<-arrange(ch2_p1.1,
                    income)

#Sort dataset by income, descending
ch2_p1sort_desc<-arrange(ch2_p1.1, desc(income))

Inspect on your own.

6.3.2.1 Making our own data frame

ch2_p2<-data.frame(
  ch2_p2 = as.factor(c("$10,000", "$20,500", "$15,250", "$30,000", "$50,750"))
)

c() stands for combine. We’re creating a vector of values.

as.factor() converts the vector into a factor, that is a categorical variable not numbers.

data.frame(...) creates a dataset in R.

head(ch2_p2)
##    ch2_p2
## 1 $10,000
## 2 $20,500
## 3 $15,250
## 4 $30,000
## 5 $50,750

Let’s sort this:

ch2_p2sort<-arrange(ch2_p2)
head(ch2_p2)
##    ch2_p2
## 1 $10,000
## 2 $20,500
## 3 $15,250
## 4 $30,000
## 5 $50,750

It did not work. The problem is, ch2_p2 is not numeric. We can check:

class(ch2_p2$ch2_p2)
## [1] "factor"

We need to make it into a numeric value but we have a , and $. We need to remove them. We use the str_replace function in the stringr package.

library(stringr)
ch2_p2$ch2_p2<-str_replace(
  ch2_p2$ch2_p2, #column we want to edit
  pattern = ',', #what to find
  replacement = '' #what to replace it with
)
head(ch2_p2)
##   ch2_p2
## 1 $10000
## 2 $20500
## 3 $15250
## 4 $30000
## 5 $50750

Now, let us remove the dollar sign; usually, simply doing the same thing we did with the comma works, but, there are some symbols that are used as “special character”. To “force” R to replace the presence of ‘$’, we add two backslashes before the dollar sign.

ch2_p2$ch2_p2<-str_replace(
  ch2_p2$ch2_p2, 
  pattern = '\\$',
  replacement = ''
)

Inspect on your own.

Now, sort ch2_p2 on your own.

We can see that it was arranged, however, take a look at the way ch2_p2 was encoded; it is not numeric. So, we need to change this.

class(ch2_p2$ch2_p2)
## [1] "character"

Change to numeric through as.numeric()

ch2_p2$ch2_p2<-as.numeric(ch2_p2$ch2_p2)

Inspect on your own.

6.3.3 Pipe Operator

%>% allows functions to be chained; it can be read as “then” - it tells R to do whatever comes after it to the stuff that comes before it.

ch2_p1.1.2 <- ch2_p1.1 %>% 
  filter(age > 60) %>% 
  arrange(desc(income))

Inspect on your own.

6.3.4 Adding columns

We will be using the pipe operator and the mutate to add a new column to ch2_p1.1

ch2_p1.1 <- ch2_p1.1 %>%
  mutate( #adds a new column to the dataset
    room_group = case_when( #checks each row's value and assigned a category
      rooms == 1  ~ "one_room",
      rooms == 2  ~ "two_rooms",
      rooms == 3  ~ "three_rooms",
      rooms == 4  ~ "four_rooms",
      rooms == 5  ~ "five_rooms",
      rooms == 6  ~ "six_rooms",
      rooms == 7  ~ "seven_rooms",
      rooms == 8  ~ "eight_rooms",
      rooms == 9  ~ "nine_rooms",
      rooms == 10 ~ "ten_rooms",
      rooms == 11 ~ "eleven_rooms",
      rooms == 12 ~ "twelve_rooms",
      TRUE        ~ "other"        # anything outside 1–12 or missing
    )
  )

6.3.5 Transforming Values

Now, you can see that paid_job is a character that is “yes/no”. We need to change that to numeric value. This is particularly useful when we use dummy variables later on. We will not use case_when as it is not necessary; rather, we will use ifelse:

ch2_p1.1<-ch2_p1.1 %>% 
  mutate(
    paid_job = ifelse(paid_job == "Yes", 1, 0)
  )

6.3.6 Categorizing into groups

We use the case_when():

ch2_p1.1 <- ch2_p1.1 %>%
  mutate(
    age_group = case_when(
      age >= 18 & age <= 29  ~ "young",        # 18–29 years
      age >= 30 & age <= 59  ~ "adult",        # 30–59 years
      age >= 60              ~ "older_adult",  # 60+ years
      TRUE                   ~ "other"         # NA
    )
  )

mutate() adding new column

The conditions are important.

6.3.7 Summarizing

Let us get the average of income by age group, which we’ll call ave_income, by using the group_by() and summarise() functions in dplyr

ch2_p1.1ave<-ch2_p1.1 %>% 
  group_by(age_group) %>% #group by age group, THEN
  summarise(ave_income=mean(income)) #calculate the mean of income for each age group
head(ch2_p1.1ave)
## # A tibble: 3 × 2
##   age_group   ave_income
##   <chr>            <dbl>
## 1 adult               NA
## 2 older_adult         NA
## 3 young               NA

How did this happen? It is because there are NAs in income. Let’s check for NAs in our dataset

colSums(is.na(ch2_p1.1))
##    country         id        sex        age   paid_job hours_work      rooms       area     income room_group  age_group 
##          0          0          0          0       1768       2304         25          0        978          0          0

There are 978 NAs in income, that is why the average income is NA. For the sake of, let’s remove the NA.

ch2_p1.1ave <- ch2_p1.1 %>% 
  group_by(age_group) %>% 
  summarise(ave_income = mean(income, na.rm = TRUE))
head(ch2_p1.1ave)
## # A tibble: 3 × 2
##   age_group   ave_income
##   <chr>            <dbl>
## 1 adult            2145.
## 2 older_adult      1566.
## 3 young            1809.

The difference is the inclusion of na.rm = TRUE which means we remove the missing values when calculating the mean.

So, even though we know income is numeric, the presence of NA causes the output when calculating mean, sd, etc. become NA too.

6.3.8 Merging datasets

We have two main datasets, ch2_p1.1 and ch2_p1.1ave. By doing this, we could compare side-by-side each observation compared to the average per age group.

We will join the datasets by age_group variable, since that is consistent across both datasets. We name the new file as ch2_p1merged:

ch2_p1merged<-merge(x=ch2_p1.1, y=ch2_p1.1ave, by="age_group")
head(ch2_p1merged)
##   age_group  country        id    sex age paid_job hours_work rooms                   area    income  room_group ave_income
## 1     adult Bulgaria BG1535216 Female  56        1         NA     3 A medium to large town        NA three_rooms   2145.059
## 2     adult  Hungary HU1042815   Male  39        0         NA     2   A village/small town  987.0490   two_rooms   2145.059
## 3     adult   Poland PL1229721 Female  47       NA         40     4   A village/small town 1722.0737  four_rooms   2145.059
## 4     adult  Germany DE1045717 Female  56        1         NA     3  A city or city suburb        NA three_rooms   2145.059
## 5     adult  Finland FI1024310   Male  58       NA         32     5 A medium to large town 3958.9271  five_rooms   2145.059
## 6     adult Slovakia SK1184115 Female  44       NA         43     2 A medium to large town  759.6961   two_rooms   2145.059

6.3.9 Splitting datasets

Say I want to save different datasets based on the age_group column.

adult_data<-ch2_p1.1 %>% filter(age_group=="adult")

filter() keeps rows that meet the condition.

You can save it as a .csv file:

write.csv(adult_data, "adult_data.csv", row.names = FALSE)

Can you do the others? (young or older_adult)

6.3.10 Reshaping Datasets

Let’s say you want to reshape income and hours_work into a long format so every row is a variable measurement. We need both tidyr and dplyr packages.

6.3.10.1 Long Format

library(tidyr)
library(dplyr)

ch2_long<-ch2_p1.1 %>% 
  pivot_longer(
    cols = c(hours_work, income), # columns to stack
    names_to = "variable", #new column for variable names
    values_to = "value" #new column for their values
  )
head(ch2_long)
## # A tibble: 6 × 11
##   country     id        sex      age paid_job rooms area                   room_group  age_group   variable   value
##   <chr>       <chr>     <chr>  <int>    <dbl> <int> <chr>                  <chr>       <chr>       <chr>      <dbl>
## 1 Bulgaria    BG1535216 Female    56        1     3 A medium to large town three_rooms adult       hours_work   NA 
## 2 Bulgaria    BG1535216 Female    56        1     3 A medium to large town three_rooms adult       income       NA 
## 3 Netherlands NL5130211 Female    20        1     1 A city or city suburb  one_room    young       hours_work   NA 
## 4 Netherlands NL5130211 Female    20        1     1 A city or city suburb  one_room    young       income      273.
## 5 Netherlands NL5063519 Male      63        1     6 A medium to large town six_rooms   older_adult hours_work   NA 
## 6 Netherlands NL5063519 Male      63        1     6 A medium to large town six_rooms   older_adult income     3191.

pivot_longer() turns columns into rows

6.3.10.2 Wide Format

ch2_wide<-ch2_long %>% 
  pivot_wider(
    names_from = variable, #column to spread to multiple columns
    values_from = value #value to fill columns
  )
head(ch2_wide)
## # A tibble: 6 × 11
##   country     id        sex      age paid_job rooms area                   room_group  age_group   hours_work income
##   <chr>       <chr>     <chr>  <int>    <dbl> <int> <chr>                  <chr>       <chr>            <dbl>  <dbl>
## 1 Bulgaria    BG1535216 Female    56        1     3 A medium to large town three_rooms adult               NA    NA 
## 2 Netherlands NL5130211 Female    20        1     1 A city or city suburb  one_room    young               NA   273.
## 3 Netherlands NL5063519 Male      63        1     6 A medium to large town six_rooms   older_adult         NA  3191.
## 4 Slovenia    SI1042916 Male      63       NA     3 A medium to large town three_rooms older_adult         35  4172.
## 5 Bulgaria    BG1396625 Male      89        1     1 A city or city suburb  one_room    older_adult         NA   478.
## 6 Slovakia    SK1184115 Female    44       NA     2 A medium to large town two_rooms   adult               43   760.

Say you want to analyze repeated categories or measures. We can pivot the room_group and age_group to long:

ch2_long2<-ch2_p1.1 %>% 
  pivot_longer(
    cols = c(room_group, age_group),
    names_to = "category",
    values_to = "group"
  )
head(ch2_long2)
## # A tibble: 6 × 11
##   country     id        sex      age paid_job hours_work rooms area                   income category   group      
##   <chr>       <chr>     <chr>  <int>    <dbl>      <int> <int> <chr>                   <dbl> <chr>      <chr>      
## 1 Bulgaria    BG1535216 Female    56        1         NA     3 A medium to large town    NA  room_group three_rooms
## 2 Bulgaria    BG1535216 Female    56        1         NA     3 A medium to large town    NA  age_group  adult      
## 3 Netherlands NL5130211 Female    20        1         NA     1 A city or city suburb    273. room_group one_room   
## 4 Netherlands NL5130211 Female    20        1         NA     1 A city or city suburb    273. age_group  young      
## 5 Netherlands NL5063519 Male      63        1         NA     6 A medium to large town  3191. room_group six_rooms  
## 6 Netherlands NL5063519 Male      63        1         NA     6 A medium to large town  3191. age_group  older_adult

6.3.11 Missing Values

We already did it before but just to remind:

#Count missing values per column
colSums(is.na(ch2_p1.1))
##    country         id        sex        age   paid_job hours_work      rooms       area     income room_group  age_group 
##          0          0          0          0       1768       2304         25          0        978          0          0

When there are missing values, it reduces sample size, can bias estimates if the missingness is not random and may introduce endogeneity if the missing data is correlated with the outcome or other variables.

6.3.11.1 Types of Missing Data

  1. MCAR - Missing Completely at Random
    • Missingness is unrelated to anything in the data
  2. MAR - Missing at Random
    • Missingness depends on observed variables but not on the missing value itself.

    • Example: younger respondents are less likely to report income.

  3. MNAR - Missing Not at Random
    • Missingness depends on the value itself

    • Example: high income people refuse to report income

If you have MCAR, safe to remove this. If MAR or MNAR, removing these can cause bias. If a variable with missing values affects the outcome and you remove it or impute incorrectly, you can create endogeneity.

6.3.11.2 Handling Missing Values

6.3.11.2.1 a. Remove missing observations
#Keep only rows without missing income
ch2_p1.2<-ch2_p1.1 %>% 
  filter(!is.na(income))

!is.na() means you are keeping rows WITHOUT missing income

Again, double-check because removing the missing values may bias your results

6.3.11.2.2 b. Replace missing values (imputation of mean/median)
#replace missing income with the mean
ch2_p1.3<-ch2_p1.1 %>% 
  mutate(income=ifelse(is.na(income), mean(income, na.rm=TRUE), income))

ifelse(condition, value_if_true, value_if_false) in this case, you are telling R to search income’s NA values, if there are NA values, calculate the mean income without the NAs. then, change the NAs with the calculated mean income, otherwise, keep the income as is.

There are advanced methods but will not be discussed.

You have to be careful and always check for missing data.

6.4 Closing

  1. Quiz questions will be uploaded in Animospace.

  2. You have 15 minutes to answer.

  3. Clean the environment and free the memory.