6 Data Management - Cross-Sectional Data

6.1 Where to Get Data?

Before we proceed to Data Management, let us first find where we can get data for the Data Story Archive. Note that the data you collect should still ensure that you are following the Code of Ethics and analyze Ethical Considerations.

Please view the necessary documents from the Office of the Vice Chancellor for Research and Innovation (https://www.dlsu.edu.ph/research/research-manual/)

A list of links you can search and get data from:

Note: I will not include the best links as they are pretty straightforward and these are governmental databases like the ones from World Bank, IMF, UN, Philippine Statistics Authority, and Bangko Sentral ng Pilipinas. The list here is a general list but use with proper discretion.

Name Link Notes
Awesome Public Datasets https://github.com/awesomedata/awesome-public-datasets This repository is filled with public datasets, mostly from International contexts.
Google Dataset Search https://datasetsearch.research.google.com/ You can download publicly available datasets from searching through Google. Though, sometimes the datasets come from ‘Statista.com’. You can check the sources from the search.

6.2 Preliminaries

6.2.1 Packages

We will mostly use the tidyverse package, in particular, the dplyr package and the tidyr package; double-check in your Packages list whether you have these two packages; if not, you can simply install them.

6.2.2 Clean Everything

Do this step every time you use other data or when we do the other chapters.

# Remove all objects in the global environment
rm(list = ls())

# Perform garbage collection to free up memory
gc()
##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 3112258 166.3    5173005 276.3  5173005 276.3
## Vcells 6108317  46.7   12255594  93.6  9035397  69.0

6.2.3 Importing the Datasets

Before we can manage the data, we must first import it into R. There are two ways to do this:

  1. Writing code (preferred for replicability)
  2. Clicking in RStudio

We start with the most common file types.

6.2.3.1 Importing a CSV file

cp_csv<-read.csv("CP_1.csv")

The file must be in the working directory. If it is not, then, simply putting the file name inside the quotation marks will not work. Meaning, you have to input the entire path where the file is.

Another, if you notice, the CSV file name is simple and easy to share. The object, in this case, cp_csv is also in small letters and does not have spaces but rather an underscore replacing the space. Always do this when naming objects.

a. Small letters

b. No spaces

c. Place underscore instead.

To check the information and what the dataset looks like:

head(cp_csv)
##       Country        ID HH1_Num_People HH2a_Sex HH2b_Age
## 1    Bulgaria BG1535216              3   Female       56
## 2 Netherlands NL5130211              1   Female       20
## 3 Netherlands NL5063519              2     Male       63
## 4    Slovenia SI1042916              3     Male       63
## 5    Bulgaria BG1396625              2     Male       89
## 6    Slovakia SK1184115              2   Female       44
##                               HH2d_EmploymentSituation
## 1                       Unemployed less than 12 months
## 2 In education (at school, university, etc.) / student
## 3                                              Retired
## 4        At work as employee or employer/self-employed
## 5                                              Retired
## 6        At work as employee or employer/self-employed
##   Q1_PaidJob                    Q2_Empoyment
## 1        Yes                            <NA>
## 2        Yes                            <NA>
## 3        Yes                            <NA>
## 4       <NA>                        Employed
## 5        Yes                            <NA>
## 6       <NA> Self-employed without employees
##                          Q3_Contract
## 1                               <NA>
## 2                               <NA>
## 3                               <NA>
## 4 On an unlimited permanent contract
## 5                               <NA>
## 6 On an unlimited permanent contract
##                       Q4_Occupation Q7_HoursWeekWork
## 1                              <NA>               NA
## 2                              <NA>               NA
## 3                              <NA>               NA
## 4 Technician or junior professional               35
## 5                              <NA>               NA
## 6 Technician or junior professional               43
##   Q7a_AdditionalJob Q7b_HoursWeekWeekAddJob Q7c_Work
## 1              <NA>                      NA       No
## 2              <NA>                      NA       No
## 3              <NA>                      NA       No
## 4                No                      NA     <NA>
## 5              <NA>                      NA       No
## 6                No                      NA     <NA>
##   Q8_HoursWeekWorkPref Q9_HoursWeekWorkPartner
## 1                   40                      40
## 2                   10                      NA
## 3                    0                      NA
## 4                   30                      NA
## 5                    0                      NA
## 6                   43                      NA
##   Q10_HoursWeekWorkPartnerPref Q17_Rooms
## 1                            4         3
## 2                           NA         1
## 3                            0         6
## 4                           30         3
## 5                            0         1
## 6                           NA         2
##                               Q18_Tenancy Q19a_ShortageSpace
## 1                    Own without mortgage                  0
## 2                                   Other                  0
## 3                       Own with mortgage                  0
## 4 Tenant, paying rent to private landlord                  0
## 5                    Own without mortgage                  0
## 6                    Own without mortgage                  1
##   Q19b_Rot Q19c_Leaks Q19d_NoFlusingToilet Q19e_NoBathShower
## 1        0          0                    0                 0
## 2        0          0                    0                 0
## 3        0          0                    0                 0
## 4        0          0                    0                 0
## 5        0          0                    0                 0
## 6        0          0                    0                 0
##   Q19f_NoOutside Q20_LeaveAccomodation_NoAff
## 1              0               Very unlikely
## 2              0              Quite unlikely
## 3              0               Very unlikely
## 4              0              Quite unlikely
## 5              0               Very unlikely
## 6              0                 Very likely
##                      Q24_Trust Q25a_TensionClass
## 1                            6      Some tension
## 2                            6      Some tension
## 3                            8      Some tension
## 4                            6      Some tension
## 5 1 - you can't be too careful      Some tension
## 6                            5      Some tension
##   Q25b_TensionWork Q25c_TensionSex Q25d_TensionsAge
## 1       No tension      No tension                3
## 2     Some tension      No tension                3
## 3     Some tension      No tension                3
## 4     Some tension      No tension                3
## 5       Don”t know    Some tension                2
## 6     Some tension    Some tension                2
##   Q25e_TensionRace Q25f_TensionReligion
## 1     Some tension           No tension
## 2     Some tension         Some tension
## 3 A lot of tension     A lot of tension
## 4     Some tension         Some tension
## 5       Don”t know           Don”t know
## 6     Some tension         Some tension
##   Q25g_TensionSexOrient Q37a_HoursWeekChildren
## 1            No tension                      2
## 2      A lot of tension                     NA
## 3          Some tension                     NA
## 4            No tension                      5
## 5            Don”t know                     NA
## 6          Some tension                     NA
##   Q37b_HoursWeekHousework Q37c_HoursWeekElderly
## 1                      14                    NA
## 2                       7                    NA
## 3                      14                     3
## 4                      NA                    NA
## 5                      NA                    NA
## 6                      20                    NA
##   Q48_Education               Q49_Area
## 1          1250 A medium to large town
## 2          2956  A city or city suburb
## 3          2963 A medium to large town
## 4          3420 A medium to large town
## 5          1253  A city or city suburb
## 6          3530 A medium to large town
##   Q50a_NeighbourhoodNoise Q50b_NeighbourhoodAir
## 1             No problems           No problems
## 2             No problems           No problems
## 3             No problems           No problems
## 4             No problems           No problems
## 5             No problems           No problems
## 6             No problems           No problems
##   Q50c_NeighbourhoodWater Q50d_NeighbourhoodCrime
## 1          Major problems             No problems
## 2             No problems             No problems
## 3             No problems             No problems
## 4             No problems             No problems
## 5             No problems       Moderate problems
## 6             No problems             No problems
##   Q50e_NeighbourhoodLitter Q50f_NeighbourhoodTraffic
## 1              No problems               No problems
## 2              No problems               No problems
## 3              No problems               No problems
## 4              No problems         Moderate problems
## 5        Moderate problems                Don't know
## 6              No problems         Moderate problems
##   Q51a_AccServicesPost Q51b_AccServicesBank
## 1     Service not used               Easily
## 2               Easily With some difficulty
## 3               Easily               Easily
## 4               Easily               Easily
## 5     Service not used     Service not used
## 6          Very easily          Very easily
##   Q53c_QualityPublicTransport Income_PPP
## 1                           6         NA
## 2                           7   273.4731
## 3                           7  3190.5196
## 4                           6  4171.6329
## 5                           3   478.3407
## 6                           5   759.6961
str(cp_csv)
## 'data.frame':    4036 obs. of  49 variables:
##  $ Country                     : chr  "Bulgaria" "Netherlands" "Netherlands" "Slovenia" ...
##  $ ID                          : chr  "BG1535216" "NL5130211" "NL5063519" "SI1042916" ...
##  $ HH1_Num_People              : int  3 1 2 3 2 2 2 1 4 2 ...
##  $ HH2a_Sex                    : chr  "Female" "Female" "Male" "Male" ...
##  $ HH2b_Age                    : int  56 20 63 63 89 44 52 43 47 81 ...
##  $ HH2d_EmploymentSituation    : chr  "Unemployed less than 12 months" "In education (at school, university, etc.) / student" "Retired" "At work as employee or employer/self-employed" ...
##  $ Q1_PaidJob                  : chr  "Yes" "Yes" "Yes" NA ...
##  $ Q2_Empoyment                : chr  NA NA NA "Employed" ...
##  $ Q3_Contract                 : chr  NA NA NA "On an unlimited permanent contract" ...
##  $ Q4_Occupation               : chr  NA NA NA "Technician or junior professional" ...
##  $ Q7_HoursWeekWork            : int  NA NA NA 35 NA 43 NA NA 17 NA ...
##  $ Q7a_AdditionalJob           : chr  NA NA NA "No" ...
##  $ Q7b_HoursWeekWeekAddJob     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Q7c_Work                    : chr  "No" "No" "No" NA ...
##  $ Q8_HoursWeekWorkPref        : int  40 10 0 30 0 43 40 40 35 0 ...
##  $ Q9_HoursWeekWorkPartner     : int  40 NA NA NA NA NA 41 NA 35 NA ...
##  $ Q10_HoursWeekWorkPartnerPref: int  4 NA 0 30 0 NA 30 NA 35 0 ...
##  $ Q17_Rooms                   : int  3 1 6 3 1 2 NA 1 5 3 ...
##  $ Q18_Tenancy                 : chr  "Own without mortgage" "Other" "Own with mortgage" "Tenant, paying rent to private landlord" ...
##  $ Q19a_ShortageSpace          : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Q19b_Rot                    : int  0 0 0 0 0 0 0 1 1 0 ...
##  $ Q19c_Leaks                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q19d_NoFlusingToilet        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q19e_NoBathShower           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q19f_NoOutside              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q20_LeaveAccomodation_NoAff : chr  "Very unlikely" "Quite unlikely" "Very unlikely" "Quite unlikely" ...
##  $ Q24_Trust                   : chr  "6" "6" "8" "6" ...
##  $ Q25a_TensionClass           : chr  "Some tension" "Some tension" "Some tension" "Some tension" ...
##  $ Q25b_TensionWork            : chr  "No tension" "Some tension" "Some tension" "Some tension" ...
##  $ Q25c_TensionSex             : chr  "No tension" "No tension" "No tension" "No tension" ...
##  $ Q25d_TensionsAge            : int  3 3 3 3 2 2 98 3 3 1 ...
##  $ Q25e_TensionRace            : chr  "Some tension" "Some tension" "A lot of tension" "Some tension" ...
##  $ Q25f_TensionReligion        : chr  "No tension" "Some tension" "A lot of tension" "Some tension" ...
##  $ Q25g_TensionSexOrient       : chr  "No tension" "A lot of tension" "Some tension" "No tension" ...
##  $ Q37a_HoursWeekChildren      : int  2 NA NA 5 NA NA 4 NA NA NA ...
##  $ Q37b_HoursWeekHousework     : int  14 7 14 NA NA 20 4 8 4 6 ...
##  $ Q37c_HoursWeekElderly       : int  NA NA 3 NA NA NA NA NA NA NA ...
##  $ Q48_Education               : chr  "1250" "2956" "2963" "3420" ...
##  $ Q49_Area                    : chr  "A medium to large town" "A city or city suburb" "A medium to large town" "A medium to large town" ...
##  $ Q50a_NeighbourhoodNoise     : chr  "No problems" "No problems" "No problems" "No problems" ...
##  $ Q50b_NeighbourhoodAir       : chr  "No problems" "No problems" "No problems" "No problems" ...
##  $ Q50c_NeighbourhoodWater     : chr  "Major problems" "No problems" "No problems" "No problems" ...
##  $ Q50d_NeighbourhoodCrime     : chr  "No problems" "No problems" "No problems" "No problems" ...
##  $ Q50e_NeighbourhoodLitter    : chr  "No problems" "No problems" "No problems" "No problems" ...
##  $ Q50f_NeighbourhoodTraffic   : chr  "No problems" "No problems" "No problems" "Moderate problems" ...
##  $ Q51a_AccServicesPost        : chr  "Service not used" "Easily" "Easily" "Easily" ...
##  $ Q51b_AccServicesBank        : chr  "Easily" "With some difficulty" "Easily" "Easily" ...
##  $ Q53c_QualityPublicTransport : chr  "6" "7" "7" "6" ...
##  $ Income_PPP                  : num  NA 273 3191 4172 478 ...

The head code shows the first 6 rows and all the columns (variables) in the dataset. Meanwhile the str code shows the information of the dataset, such as what format type each column/variable is, how many observations, etc.

B. Using RStudio

  1. Go to Environment
  2. Click Import Dataset
  3. Choose From Text (base) or From Test (readr)
  4. Click Browse and select your .csv file
  5. Click Import

RStudio will load the dataset and generate R code in Console. Copy the R code to your script for reproducibility.

6.2.3.2 Importing an Excel File

First, install and load the package:

install.packages("readxl")
library(readxl)
cp_xlsx<-read_excel("CP_1.xlsx")

If you only want a particular sheet to be imported,

cp_xlsx<-read_excel("CP_1.xlsx", sheet= 1)
head(cp_xlsx)
## # A tibble: 6 × 49
##   Country     ID        HH1_Num_People HH2a_Sex HH2b_Age
##   <chr>       <chr>              <dbl> <chr>       <dbl>
## 1 Bulgaria    BG1535216              3 Female         56
## 2 Netherlands NL5130211              1 Female         20
## 3 Netherlands NL5063519              2 Male           63
## 4 Slovenia    SI1042916              3 Male           63
## 5 Bulgaria    BG1396625              2 Male           89
## 6 Slovakia    SK1184115              2 Female         44
## # ℹ 44 more variables: HH2d_EmploymentSituation <chr>,
## #   Q1_PaidJob <chr>, Q2_Empoyment <chr>, Q3_Contract <chr>,
## #   Q4_Occupation <chr>, Q7_HoursWeekWork <dbl>,
## #   Q7a_AdditionalJob <chr>, Q7b_HoursWeekWeekAddJob <dbl>,
## #   Q7c_Work <chr>, Q8_HoursWeekWorkPref <dbl>,
## #   Q9_HoursWeekWorkPartner <dbl>,
## #   Q10_HoursWeekWorkPartnerPref <dbl>, Q17_Rooms <dbl>, …

B. Using RStudio

  1. Go to Environment
  2. Click Import Dataset
  3. Choose From Excel
  4. Click Browse and select your .xlsx file
  5. Choose the sheet
  6. Click Import

Check that you also have the variables as the header.

6.2.3.3 Importing RData

These files are native to R and can contain multiple objects.

load("Dataset_CP.RData")
head(Dataset_CP)
## # A tibble: 6 × 49
##   Country     ID        HH1_Num_People HH2a_Sex HH2b_Age
##   <fct>       <chr>              <dbl> <fct>       <dbl>
## 1 Bulgaria    BG1535216              3 Female         56
## 2 Netherlands NL5130211              1 Female         20
## 3 Netherlands NL5063519              2 Male           63
## 4 Slovenia    SI1042916              3 Male           63
## 5 Bulgaria    BG1396625              2 Male           89
## 6 Slovakia    SK1184115              2 Female         44
## # ℹ 44 more variables: HH2d_EmploymentSituation <fct>,
## #   Q1_PaidJob <fct>, Q2_Empoyment <fct>, Q3_Contract <fct>,
## #   Q4_Occupation <fct>, Q7_HoursWeekWork <dbl>,
## #   Q7a_AdditionalJob <fct>, Q7b_HoursWeekWeekAddJob <dbl>,
## #   Q7c_Work <fct>, Q8_HoursWeekWorkPref <dbl>,
## #   Q9_HoursWeekWorkPartner <dbl>,
## #   Q10_HoursWeekWorkPartnerPref <dbl>, Q17_Rooms <dbl>, …

B. Using RStudio

  1. Environment panel
  2. Click Import Dataset
  3. Choose From RData
  4. Select the .RData file
  5. Click Import

6.3 Data Cleaning

As you can see, there are 49 columns. Let’s simplify and work with fewer variables relevant for analysis. We can do this using the select() function in dplyr. We will save them into a new data frame, ch2_p1.1.

library(dplyr)
ch2_p1.1<-select(cp_csv,
       Country, ID, HH2a_Sex, HH2b_Age, Q1_PaidJob, Q7_HoursWeekWork, Q17_Rooms, Q49_Area, Income_PPP)

select(df, var1, var2,...) keeps only the listed columns and removes the rest.

6.3.1 Renaming the Variables

We will edit the names to much easier conventions. First, let us say that we just want to change them to lowercase names.

names(ch2_p1.1)<-tolower(names(ch2_p1.1))

Inspect:

head(ch2_p1.1)
##       country        id hh2a_sex hh2b_age q1_paidjob
## 1    Bulgaria BG1535216   Female       56        Yes
## 2 Netherlands NL5130211   Female       20        Yes
## 3 Netherlands NL5063519     Male       63        Yes
## 4    Slovenia SI1042916     Male       63       <NA>
## 5    Bulgaria BG1396625     Male       89        Yes
## 6    Slovakia SK1184115   Female       44       <NA>
##   q7_hoursweekwork q17_rooms               q49_area
## 1               NA         3 A medium to large town
## 2               NA         1  A city or city suburb
## 3               NA         6 A medium to large town
## 4               35         3 A medium to large town
## 5               NA         1  A city or city suburb
## 6               43         2 A medium to large town
##   income_ppp
## 1         NA
## 2   273.4731
## 3  3190.5196
## 4  4171.6329
## 5   478.3407
## 6   759.6961

Inspect again on your own.

Let us rename specific columns:

ch2_p1.1<-ch2_p1.1 %>% 
  rename(
    sex = hh2a_sex,
    age = hh2b_age,
    paid_job = q1_paidjob,
    hours_work = q7_hoursweekwork,
    rooms = q17_rooms,
    area = q49_area,
    income = income_ppp
  )

Inspect on your own.

names() simply gets or sets variable names.

tolower is to simply change to small letters

rename(new_name = old_name) changes a column’s name without touching data

%>% is the pipe operator: it passes the dataset from one function to the next.

6.3.2 Sorting Variables

Let’s say, we want to arrange income. We will create a different data for this. We use arrange or desc in dplyr package

#Sort dataset by income, ascending (default)
ch2_p1sort<-arrange(ch2_p1.1,
                    income)

#Sort dataset by income, descending
ch2_p1sort_desc<-arrange(ch2_p1.1, desc(income))

Inspect on your own.

6.3.2.1 Making our own data frame

ch2_p2<-data.frame(
  ch2_p2 = as.factor(c("$10,000", "$20,500", "$15,250", "$30,000", "$50,750"))
)

c() stands for combine. We’re creating a vector of values.

as.factor() converts the vector into a factor, that is a categorical variable not numbers.

data.frame(...) creates a dataset in R.

head(ch2_p2)
##    ch2_p2
## 1 $10,000
## 2 $20,500
## 3 $15,250
## 4 $30,000
## 5 $50,750

Let’s sort this:

ch2_p2sort<-arrange(ch2_p2)
head(ch2_p2)
##    ch2_p2
## 1 $10,000
## 2 $20,500
## 3 $15,250
## 4 $30,000
## 5 $50,750

It did not work. The problem is, ch2_p2 is not numeric. We can check:

class(ch2_p2$ch2_p2)
## [1] "factor"

We need to make it into a numeric value but we have a , and $. We need to remove them. We use the str_replace function in the stringr package.

library(stringr)
ch2_p2$ch2_p2<-str_replace(
  ch2_p2$ch2_p2, #column we want to edit
  pattern = ',', #what to find
  replacement = '' #what to replace it with
)
head(ch2_p2)
##   ch2_p2
## 1 $10000
## 2 $20500
## 3 $15250
## 4 $30000
## 5 $50750

Now, let us remove the dollar sign; usually, simply doing the same thing we did with the comma works, but, there are some symbols that are used as “special character”. To “force” R to replace the presence of ‘$’, we add two backslashes before the dollar sign.

ch2_p2$ch2_p2<-str_replace(
  ch2_p2$ch2_p2, 
  pattern = '\\$',
  replacement = ''
)

Inspect on your own.

Now, sort ch2_p2 on your own.

We can see that it was arranged, however, take a look at the way ch2_p2 was encoded; it is not numeric. So, we need to change this.

class(ch2_p2$ch2_p2)
## [1] "character"

Change to numeric through as.numeric()

ch2_p2$ch2_p2<-as.numeric(ch2_p2$ch2_p2)

Inspect on your own.

6.3.3 Pipe Operator

%>% allows functions to be chained; it can be read as “then” - it tells R to do whatever comes after it to the stuff that comes before it.

ch2_p1.1.2 <- ch2_p1.1 %>% 
  filter(age > 60) %>% 
  arrange(desc(income))

Inspect on your own.

6.3.4 Adding columns

We will be using the pipe operator and the mutate to add a new column to ch2_p1.1

ch2_p1.1 <- ch2_p1.1 %>%
  mutate( #adds a new column to the dataset
    room_group = case_when( #checks each row's value and assigned a category
      rooms == 1  ~ "one_room",
      rooms == 2  ~ "two_rooms",
      rooms == 3  ~ "three_rooms",
      rooms == 4  ~ "four_rooms",
      rooms == 5  ~ "five_rooms",
      rooms == 6  ~ "six_rooms",
      rooms == 7  ~ "seven_rooms",
      rooms == 8  ~ "eight_rooms",
      rooms == 9  ~ "nine_rooms",
      rooms == 10 ~ "ten_rooms",
      rooms == 11 ~ "eleven_rooms",
      rooms == 12 ~ "twelve_rooms",
      TRUE        ~ "other"        # anything outside 1–12 or missing
    )
  )

6.3.5 Transforming Values

Now, you can see that paid_job is a character that is “yes/no”. We need to change that to numeric value. This is particularly useful when we use dummy variables later on. We will not use case_when as it is not necessary; rather, we will use ifelse:

ch2_p1.1<-ch2_p1.1 %>% 
  mutate(
    paid_job = ifelse(paid_job == "Yes", 1, 0)
  )

6.3.6 Categorizing into groups

We use the case_when():

ch2_p1.1 <- ch2_p1.1 %>%
  mutate(
    age_group = case_when(
      age >= 18 & age <= 29  ~ "young",        # 18–29 years
      age >= 30 & age <= 59  ~ "adult",        # 30–59 years
      age >= 60              ~ "older_adult",  # 60+ years
      TRUE                   ~ "other"         # NA
    )
  )

mutate() adding new column

The conditions are important.

6.3.7 Summarizing

Let us get the average of income by age group, which we’ll call ave_income, by using the group_by() and summarise() functions in dplyr

ch2_p1.1ave<-ch2_p1.1 %>% 
  group_by(age_group) %>% #group by age group, THEN
  summarise(ave_income=mean(income)) #calculate the mean of income for each age group
head(ch2_p1.1ave)
## # A tibble: 3 × 2
##   age_group   ave_income
##   <chr>            <dbl>
## 1 adult               NA
## 2 older_adult         NA
## 3 young               NA

How did this happen? It is because there are NAs in income. Let’s check for NAs in our dataset

colSums(is.na(ch2_p1.1))
##    country         id        sex        age   paid_job 
##          0          0          0          0       1768 
## hours_work      rooms       area     income room_group 
##       2304         25          0        978          0 
##  age_group 
##          0

There are 978 NAs in income, that is why the average income is NA. For the sake of, let’s remove the NA.

ch2_p1.1ave <- ch2_p1.1 %>% 
  group_by(age_group) %>% 
  summarise(ave_income = mean(income, na.rm = TRUE))
head(ch2_p1.1ave)
## # A tibble: 3 × 2
##   age_group   ave_income
##   <chr>            <dbl>
## 1 adult            2145.
## 2 older_adult      1566.
## 3 young            1809.

The difference is the inclusion of na.rm = TRUE which means we remove the missing values when calculating the mean.

So, even though we know income is numeric, the presence of NA causes the output when calculating mean, sd, etc. become NA too.

6.3.8 Merging datasets

We have two main datasets, ch2_p1.1 and ch2_p1.1ave. By doing this, we could compare side-by-side each observation compared to the average per age group.

We will join the datasets by age_group variable, since that is consistent across both datasets. We name the new file as ch2_p1merged:

ch2_p1merged<-merge(x=ch2_p1.1, y=ch2_p1.1ave, by="age_group")
head(ch2_p1merged)
##   age_group  country        id    sex age paid_job
## 1     adult Bulgaria BG1535216 Female  56        1
## 2     adult  Hungary HU1042815   Male  39        0
## 3     adult   Poland PL1229721 Female  47       NA
## 4     adult  Germany DE1045717 Female  56        1
## 5     adult  Finland FI1024310   Male  58       NA
## 6     adult Slovakia SK1184115 Female  44       NA
##   hours_work rooms                   area    income
## 1         NA     3 A medium to large town        NA
## 2         NA     2   A village/small town  987.0490
## 3         40     4   A village/small town 1722.0737
## 4         NA     3  A city or city suburb        NA
## 5         32     5 A medium to large town 3958.9271
## 6         43     2 A medium to large town  759.6961
##    room_group ave_income
## 1 three_rooms   2145.059
## 2   two_rooms   2145.059
## 3  four_rooms   2145.059
## 4 three_rooms   2145.059
## 5  five_rooms   2145.059
## 6   two_rooms   2145.059

6.3.9 Splitting datasets

Say I want to save different datasets based on the age_group column.

adult_data<-ch2_p1.1 %>% filter(age_group=="adult")

filter() keeps rows that meet the condition.

You can save it as a .csv file:

write.csv(adult_data, "adult_data.csv", row.names = FALSE)

Can you do the others? (young or older_adult)

6.3.10 Reshaping Datasets

Let’s say you want to reshape income and hours_work into a long format so every row is a variable measurement. We need both tidyr and dplyr packages.

6.3.10.1 Long Format

library(tidyr)
library(dplyr)

ch2_long<-ch2_p1.1 %>% 
  pivot_longer(
    cols = c(hours_work, income), # columns to stack
    names_to = "variable", #new column for variable names
    values_to = "value" #new column for their values
  )
head(ch2_long)
## # A tibble: 6 × 11
##   country   id    sex     age paid_job rooms area  room_group
##   <chr>     <chr> <chr> <int>    <dbl> <int> <chr> <chr>     
## 1 Bulgaria  BG15… Fema…    56        1     3 A me… three_roo…
## 2 Bulgaria  BG15… Fema…    56        1     3 A me… three_roo…
## 3 Netherla… NL51… Fema…    20        1     1 A ci… one_room  
## 4 Netherla… NL51… Fema…    20        1     1 A ci… one_room  
## 5 Netherla… NL50… Male     63        1     6 A me… six_rooms 
## 6 Netherla… NL50… Male     63        1     6 A me… six_rooms 
## # ℹ 3 more variables: age_group <chr>, variable <chr>,
## #   value <dbl>

pivot_longer() turns columns into rows

6.3.10.2 Wide Format

ch2_wide<-ch2_long %>% 
  pivot_wider(
    names_from = variable, #column to spread to multiple columns
    values_from = value #value to fill columns
  )
head(ch2_wide)
## # A tibble: 6 × 11
##   country   id    sex     age paid_job rooms area  room_group
##   <chr>     <chr> <chr> <int>    <dbl> <int> <chr> <chr>     
## 1 Bulgaria  BG15… Fema…    56        1     3 A me… three_roo…
## 2 Netherla… NL51… Fema…    20        1     1 A ci… one_room  
## 3 Netherla… NL50… Male     63        1     6 A me… six_rooms 
## 4 Slovenia  SI10… Male     63       NA     3 A me… three_roo…
## 5 Bulgaria  BG13… Male     89        1     1 A ci… one_room  
## 6 Slovakia  SK11… Fema…    44       NA     2 A me… two_rooms 
## # ℹ 3 more variables: age_group <chr>, hours_work <dbl>,
## #   income <dbl>

Say you want to analyze repeated categories or measures. We can pivot the room_group and age_group to long:

ch2_long2<-ch2_p1.1 %>% 
  pivot_longer(
    cols = c(room_group, age_group),
    names_to = "category",
    values_to = "group"
  )
head(ch2_long2)
## # A tibble: 6 × 11
##   country   id    sex     age paid_job hours_work rooms area 
##   <chr>     <chr> <chr> <int>    <dbl>      <int> <int> <chr>
## 1 Bulgaria  BG15… Fema…    56        1         NA     3 A me…
## 2 Bulgaria  BG15… Fema…    56        1         NA     3 A me…
## 3 Netherla… NL51… Fema…    20        1         NA     1 A ci…
## 4 Netherla… NL51… Fema…    20        1         NA     1 A ci…
## 5 Netherla… NL50… Male     63        1         NA     6 A me…
## 6 Netherla… NL50… Male     63        1         NA     6 A me…
## # ℹ 3 more variables: income <dbl>, category <chr>,
## #   group <chr>

6.3.11 Missing Values

We already did it before but just to remind:

#Count missing values per column
colSums(is.na(ch2_p1.1))
##    country         id        sex        age   paid_job 
##          0          0          0          0       1768 
## hours_work      rooms       area     income room_group 
##       2304         25          0        978          0 
##  age_group 
##          0

When there are missing values, it reduces sample size, can bias estimates if the missingness is not random and may introduce endogeneity if the missing data is correlated with the outcome or other variables.

6.3.11.1 Types of Missing Data

  1. MCAR - Missing Completely at Random
    • Missingness is unrelated to anything in the data
  2. MAR - Missing at Random
    • Missingness depends on observed variables but not on the missing value itself.

    • Example: younger respondents are less likely to report income.

  3. MNAR - Missing Not at Random
    • Missingness depends on the value itself

    • Example: high income people refuse to report income

If you have MCAR, safe to remove this. If MAR or MNAR, removing these can cause bias. If a variable with missing values affects the outcome and you remove it or impute incorrectly, you can create endogeneity.

6.3.11.2 Handling Missing Values

6.3.11.2.1 a. Remove missing observations
#Keep only rows without missing income
ch2_p1.2<-ch2_p1.1 %>% 
  filter(!is.na(income))

!is.na() means you are keeping rows WITHOUT missing income

Again, double-check because removing the missing values may bias your results

6.3.11.2.2 b. Replace missing values (imputation of mean/median)
#replace missing income with the mean
ch2_p1.3<-ch2_p1.1 %>% 
  mutate(income=ifelse(is.na(income), mean(income, na.rm=TRUE), income))

ifelse(condition, value_if_true, value_if_false) in this case, you are telling R to search income’s NA values, if there are NA values, calculate the mean income without the NAs. then, change the NAs with the calculated mean income, otherwise, keep the income as is.

There are advanced methods but will not be discussed.

You have to be careful and always check for missing data.

6.4 Closing

  1. Quiz questions will be uploaded in Animospace.

  2. You have 15 minutes to answer.

  3. Clean the environment and free the memory.