10 Descriptive Statistics

For this section, we will make use of Ch6PRactice.RData. Since the file is in RData format, we will only need to load it in R.

10.1 Preliminaries

10.1.1 Install Packages

We install and load the psych , stargazer, kableExtra and gtsummary packages and load tidyverse package.

if(!("psych" %in% installed.packages()[,"Package"])) install.packages("psych")
if(!("gtsummary" %in% installed.packages()[,"Package"])) install.packages("gtsummary")
if(!("stargazer" %in% installed.packages()[,"Package"])) install.packages("stargazer")
if(!("kableExtra" %in% installed.packages()[,"Package"])) install.packages("kableExtra")

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
## 
##     alpha, rescale
## The following object is masked from 'package:car':
## 
##     logit
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(tidyverse)
library(gtsummary)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

10.1.2 Load the dataset

rm(list=ls())
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  3733295 199.4    6548841 349.8   6548841  349.8
## Vcells 34340189 262.0  124329084 948.6 202265969 1543.2
load("Ch6Practice.RData")

We subset the data:

# Create a subset of the dataset with only female respondents 
Dataset_CP_f <- Dataset_CP %>% filter(HH2a_Sex == "Female")  
# Create a subset of the dataset with only male respondents
Dataset_CP_m <- Dataset_CP %>% filter(HH2a_Sex == "Male")
str(Dataset_CP_m)
## tibble [1,723 × 49] (S3: tbl_df/tbl/data.frame)
##  $ Country                     : Factor w/ 34 levels "Austria","Belgium",..: 20 25 3 8 12 16 7 21 6 6 ...
##  $ ID                          : chr [1:1723] "NL5063519" "SI1042916" "BG1396625" "EE1261010" ...
##   ..- attr(*, "label")= chr "Unique respondent ID"
##   ..- attr(*, "format.spss")= chr "A60"
##   ..- attr(*, "display_width")= int 9
##  $ HH1_Num_People              : num [1:1723] 2 3 2 2 2 2 1 2 2 1 ...
##  $ HH2a_Sex                    : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HH2b_Age                    : num [1:1723] 63 63 89 52 81 54 59 76 70 53 ...
##  $ HH2d_EmploymentSituation    : Factor w/ 10 levels "At work as employee or employer/self-employed",..: 7 1 7 7 7 6 1 7 7 1 ...
##  $ Q1_PaidJob                  : Factor w/ 4 levels "Yes","No","Don't Know",..: 1 NA 1 1 1 1 NA 1 1 NA ...
##  $ Q2_Empoyment                : Factor w/ 6 levels "Self-employed without employees",..: NA 3 NA NA NA NA 3 NA NA 3 ...
##  $ Q3_Contract                 : Factor w/ 9 levels "On an unlimited permanent contract",..: NA 1 NA NA NA NA 1 NA NA 1 ...
##  $ Q4_Occupation               : Factor w/ 13 levels "Armed forces",..: NA 4 NA NA NA NA 6 NA NA 6 ...
##  $ Q7_HoursWeekWork            : num [1:1723] NA 35 NA NA NA NA 37 NA NA 25 ...
##  $ Q7a_AdditionalJob           : Factor w/ 4 levels "Yes","No","Don't know",..: NA 2 NA NA NA NA 2 NA NA 2 ...
##  $ Q7b_HoursWeekWeekAddJob     : num [1:1723] NA NA NA NA NA NA NA NA NA NA ...
##  $ Q7c_Work                    : Factor w/ 4 levels "Yes","No","Don't Know",..: 2 NA 2 1 2 2 NA 2 2 NA ...
##  $ Q8_HoursWeekWorkPref        : num [1:1723] 0 30 0 40 0 20 30 NA 40 38 ...
##  $ Q9_HoursWeekWorkPartner     : num [1:1723] NA NA NA 41 NA 30 NA NA NA NA ...
##  $ Q10_HoursWeekWorkPartnerPref: num [1:1723] 0 30 0 30 0 40 NA NA 35 NA ...
##  $ Q17_Rooms                   : num [1:1723] 6 3 1 NA 3 4 2 3 3 1 ...
##  $ Q18_Tenancy                 : Factor w/ 8 levels "Own without mortgage",..: 2 3 1 1 5 1 1 1 4 4 ...
##  $ Q19a_ShortageSpace          : num [1:1723] 0 0 0 0 0 0 1 0 0 1 ...
##  $ Q19b_Rot                    : num [1:1723] 0 0 0 0 0 0 0 1 0 0 ...
##  $ Q19c_Leaks                  : num [1:1723] 0 0 0 0 0 1 0 1 0 0 ...
##  $ Q19d_NoFlusingToilet        : num [1:1723] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Q19e_NoBathShower           : num [1:1723] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Q19f_NoOutside              : num [1:1723] 0 0 0 0 0 1 0 0 0 0 ...
##  $ Q20_LeaveAccomodation_NoAff : Factor w/ 6 levels "Very likely",..: 4 3 4 4 4 4 4 4 4 4 ...
##  $ Q24_Trust                   : Factor w/ 12 levels "1 - you can't be too careful",..: 8 6 1 9 5 10 6 2 3 4 ...
##  $ Q25a_TensionClass           : Factor w/ 5 levels "A lot of tension",..: 2 2 2 2 1 1 2 1 3 4 ...
##  $ Q25b_TensionWork            : Factor w/ 5 levels "A lot of tension",..: 2 2 4 2 1 1 2 1 3 3 ...
##  $ Q25c_TensionSex             : Factor w/ 5 levels "A lot of tension",..: 3 3 2 3 1 1 3 2 2 2 ...
##  $ Q25d_TensionsAge            : hvn_lbl_ [1:1723]  3,  3,  2, 98,  1,  1,  2,  2,  ...
##    ..@ label      : chr "Q25d Old people and young people / How much tension is there in this country?"
##    ..@ na_values  : num [1:2] 98 99
##    ..@ format.spss: chr "F8.0"
##    ..@ labels     : Named num [1:5] 1 2 3 98 99
##    .. ..- attr(*, "names")= chr [1:5] "A lot of  tension" "Some tension" "No tension" "(Don’t know)" ...
##  $ Q25e_TensionRace            : Factor w/ 5 levels "A lot of tension",..: 1 2 4 2 1 1 2 2 2 1 ...
##  $ Q25f_TensionReligion        : Factor w/ 5 levels "A lot of tension",..: 1 2 4 3 2 1 2 2 2 1 ...
##  $ Q25g_TensionSexOrient       : Factor w/ 5 levels "A lot of tension",..: 2 3 4 1 1 1 2 3 3 4 ...
##  $ Q37a_HoursWeekChildren      : num [1:1723] NA 5 NA 4 NA NA NA NA NA NA ...
##  $ Q37b_HoursWeekHousework     : num [1:1723] 14 NA NA 4 6 NA 7 12 3 5 ...
##  $ Q37c_HoursWeekElderly       : num [1:1723] 3 NA NA NA NA NA NA NA NA NA ...
##  $ Q48_Education               : Factor w/ 472 levels "1010","1011",..: 306 361 44 111 174 226 91 312 78 76 ...
##  $ Q49_Area                    : Factor w/ 6 levels "The open countryside",..: 3 3 4 2 2 2 4 2 2 4 ...
##  $ Q50a_NeighbourhoodNoise     : Factor w/ 5 levels "Major problems",..: 3 3 3 3 3 3 3 3 3 2 ...
##  $ Q50b_NeighbourhoodAir       : Factor w/ 5 levels "Major problems",..: 3 3 3 3 3 3 3 3 2 4 ...
##  $ Q50c_NeighbourhoodWater     : Factor w/ 5 levels "Major problems",..: 3 3 3 2 3 1 3 3 3 4 ...
##  $ Q50d_NeighbourhoodCrime     : Factor w/ 5 levels "Major problems",..: 3 3 2 3 3 3 2 3 3 1 ...
##  $ Q50e_NeighbourhoodLitter    : Factor w/ 5 levels "Major problems",..: 3 3 2 2 3 3 3 3 3 1 ...
##  $ Q50f_NeighbourhoodTraffic   : Factor w/ 5 levels "Major problems",..: 3 2 4 3 3 3 3 3 3 4 ...
##  $ Q51a_AccServicesPost        : Factor w/ 7 levels "With great difficulty",..: 3 3 5 3 3 1 2 2 3 4 ...
##  $ Q51b_AccServicesBank        : Factor w/ 7 levels "With great difficulty",..: 3 3 5 3 3 1 4 5 3 4 ...
##  $ Q53c_QualityPublicTransport : Factor w/ 12 levels "1 - very poor quality",..: 7 6 3 8 5 2 4 11 8 7 ...
##  $ Income_PPP                  : num [1:1723] 3191 4172 478 426 887 ...

Do Descriptive Statistics for Female Respondents. We compare.

10.2 Frequency Table

Run a Frequency Table for the number of hours the respondents spend caring for children and/or grandchildren per week (variable Q37a_HoursWeekChildren). For both sex separately, report the mean and standard deviation.

10.2.1 Male

Calculate the descriptive statistics for the variable Q37a_HoursWeekChildren, using the function describe of the package psych:

psych::describe(Dataset_CP_m$Q37a_HoursWeekChildren)
##    vars   n  mean    sd median trimmed   mad min max range
## X1    1 575 16.86 15.53     10   14.36 10.38   1 100    99
##    skew kurtosis   se
## X1 1.83     4.65 0.65

Notice how we directly use the describe function on the variable Q37a_HoursWeekChildren (called with the dollar sign operator, $, which works as follows: name_dataset$name_variable - notice that to access a specific variable, some functions can use the pipe operator while others require the dollar sign operator, as in this case). Notice also how we call the function describe: we use the :: operator, which tells R to look for a function within a specific package as follows

name_package::name_function

You could also simply call the function by only using its name, R will then search for that specific function across all the loaded R packages. However, if more packages have functions with the same name, specifying the name of the package with the :: operator helps assuring the usage of the correct function.

There are 575 non-missing observations for Q37a_HoursWeekChildren. psych::describe ignores the missing observations.

Report the Frequency Table for the variable Q37a_HoursWeekChildren, using the function tbl_summary of the package gtsummary:

Dataset_CP_m %>%
  select(Q37a_HoursWeekChildren) %>%
  tbl_summary(
    statistic = list(all_categorical() ~ "{n} ({p}%)"),  # Show count & percentage
    missing = "no"  # Exclude missing values (set to "ifany" to include)
  ) %>%
  modify_header(label = "**Hours per Week with Children**")  # Customize header
Hours per Week with Children N = 1,7231
Q37a_HoursWeekChildren 10 (5, 24)
1 Median (Q1, Q3)

The table shows that there are 1, 723 non-missing observations for males, this shows the entire sample, not just variable-specific number.

You need to explain the different Ns from psych::describe and gtsummary

The median value of hours per week with children is 10 hours. (5,24) represents the interquartile range. Q1 (5): The first quartile (25th percentile) and the third quartile (75th percentile) are 5 and 24 hours respectively. This simply means that 50% of the responses lie between 5 and 24 hours per week.

When interpreting:

“The median hours per week spent with children of males is 10 hours, with half of the respondents reporting values between 5 and 24 hours. The total sample size is 1,723.”

Moreover, in R there are other ways to access specific elements within a dataset. Datasets are organized in rows (which represent our observations) and columns (which represent our variables). We can select a specific cell of a dataset as follows: name_dataset[row, column], where row is the number(s) of row(s) we are interested in and column is either the number or the name of the column we are interested in. For instance, if we want to select the value of the variable Q37a_HoursWeekChildren for the third observation, we can use the following command

Dataset_CP[3,“Q37a_HoursWeekChildren”]

If we do not specify a value for the rows (or columns), we select all the rows (or columns) of the specified columns (or rows). For instance, the following command selects all the observations for the variable Q37a_HoursWeekChildren

Dataset_CP[ ,“Q37a_HoursWeekChildren”]

10.3 Histogram

Plot the histogram for the variable Q37a_HoursWeekChildren

ggplot(data = Dataset_CP_m, aes(x = Q37a_HoursWeekChildren)) +      
  geom_histogram(bins = 50, 
                 color = "darkred", 
                 fill = "pink") +
  theme_minimal()+   
  labs(title = "Histogram: Hours per week spent taking care of children", subtitle = "Only male respondents")+ 
  xlab("Hours per week spent taking care of children (Question 37a)") +   
  ylab("Count") +      
  theme(plot.title = element_text(color = "darkred", 
                                  size = 14, 
                                  face = "italic"), 
        plot.subtitle = element_text(color = "darkred", size = 12, face = "italic"))

  • Depending on the data, different number of bins are more or less appropriate. In this case, we indicate that we want our data to be split in 50 intervals. You could indicate a different number or you could also leave the bins option out of the code and let R select it (R will give a warning (not error) about the number of bins which has been selected). Each type of graph has its own set of options that can be set, use the help function to learn more about them.

  • Next, we call a pre-defined layout of the graph called theme_minimal. There are many pre-defined layouts and you are free to choose whichever you like the most, or even create your own layout (which is partially what we do with the last layer).

  • In the fourth layer we use labs in order to define the title of the graph, we add a subtitle and we specify the name of the axes. There are more options that you can define within labs, use help(labs) to find out what these are.

  • Finally, we edit the appearance of the title and subtitle using theme and introducing the elements plot.title = element_text() and plot.subtitle = element_text(), where we specify their color, size and face. With that, we are able to change the format of the title and subtitle, however, it is not always necessary. If you want to find out more options that you can specify within theme, you can use help(“theme”) or search in Google.

10.4 Summary Table

To summarize, we can build a table showing the mean and the standard deviation for the variable Q37a_HoursWeekChildren for the two groups, separately. To do so, we use two features:

First, in the menu bar of your Quarto document (on top), we click on “Table” and then select “Insert Table”. A new window will appear and ask us the number of rows and columns our table should include. Once the table has been created, we can fill in its cells with whatever we wish to present in the table.

Second, we use inline code, which allows us to embed directly into the text the output of an R code (you can find more information about inline coding here).

A chunk of code, as we have seen so far, is separated from the text of our Quarto document and does not allow us to build a readable text combining words with output from R codes which in one linear sentence. So, if we want to say:

The number of male respondents in our dataset is equal to

Dataset_CP_m %>% summarize(n = n())
## # A tibble: 1 × 1
##       n
##   <int>
## 1  1723

We can clearly see that this is not ideal when using a chunk of code.

To better integrate this into our text we can use inline code. To do so, in the menu bar of our Quarto document (on top), we click on “</>”, then we type r and add a space and then write the line of code we would like to execute in order to include its output into the text. In our previous example, this could be done by writing as an inline code “r Dataset_CP_m %>% summarize(n = n())”, as follows:

The number of male respondents in our dataset is equal to 1723

This could be useful, for instance, if we wish to present some properties of the data without us manually copy-pasting the numbers from the console to the text. At the same time, we can use this to populate our table, as we indeed do below.

Note:

In a Quarto document, you can run the code which is included in a chunk of code with the green play button. This allows you to execute and visualize the output of that specific command. Unfortunately, inline codes do not have the same option. Inline codes will be executed only when the Quarto document is rendered. This means that you cannot directly see the result of inline codes before you render your Quarto document nor you can verify if that code is correct (and if it is not, the rendering will not work).

To overcome this issue, you can copy-paste the code of your inline code into the Console and run it from there. This will allow you to verify if the code is correct and what is the output.

We have created here a table summarizing some of the descriptive statistics for the male respondents in our dataset. Take a look at this table:

Male Code
N_total 1723 Dataset_CP_m %>% summarize (n=n())
N_missing 1148 Dataset_CP_m%>% summarize(n_na=sum(is.na(Q37a_HoursWeekChildren)))
N_valid 575 Dataset_CP_m%>%summarize(n_valid=sum(!is.na(Q37a_HoursWeekChildren)))
Mean 16.8591304 Dataset_CP_m%>%summarize(mean = mean(Q37a_HoursWeekChildren, na.rm=T))
Median 10 Dataset_CP_m%>%summarize(median = median(Q37a_HoursWeekChildren, na.rm=T))
Std. Deviation 15.5294252 Dataset_CP_m%>% summarize(sd=sd(Q37a_HoursWeekChildren, na.rm=T))

Notice how we use the pipe operator on the dataset to compute the descriptive statistics with the summarize function and

  • the n function to calculate the total number of observations;

  • the is.na function to calculate the number of observations with missing values;

  • the ! operator combined with the is.na function to calculate the opposite of the number of observations with missing values, i.e., the number of observations with valid values;

  • the mean function to calculate the mean, with the na.rm option set to TRUE to tell R that there are missing values and it should compute the mean without considering the missing values;

  • the median function to calculate the median, with the na.rm option set to TRUE to tell R that there are missing values and it should compute the median without considering the missing values.

  • the sd function to calculate the standard deviation, with the na.rm option set to TRUE to tell R that there are missing values and it should compute the standard deviation without considering the missing values.

10.4.1 Summary Table using Stargazer

We can create a descriptive statistics table for respondents including mean, median, SD, min, and max:

We do this by combining psych package with stargazer. We also need to create a data frame since stargazer can’t handle tibbles. out= is so that we can have the tables and we can copy-paste them.

desc <- psych::describe(Dataset_CP_m %>%
                         select(Q37a_HoursWeekChildren))

#We don't need all stats
desc_table <- desc %>%
  select(mean, median, sd, min, max) %>%
  as.data.frame()

desc_table$Variable <- rownames(desc_table)
desc_table <- desc_table[, c("mean", "median", "sd", "min", "max")]


stargazer(desc_table, 
          type = "text", 
          summary = FALSE,        # Don't summarize again
          title = "Descriptive Statistics: Selected Variable (Male Respondents)",
          out="Desc_Stats_HWC.txt")
## 
## Descriptive Statistics: Selected Variable (Male Respondents)
## ===================================================
##                         mean  median   sd   min max
## ---------------------------------------------------
## Q37a_HoursWeekChildren 16.859   10   15.529  1  100
## ---------------------------------------------------

The output would be as a text but you can also save the table to HTML by changing type = "text" to type = "html"

10.4.2 Summary Table with Grouping and Stargazer

Say you want to see Q37a_HoursWeekChildren by PaidJob. We use describeBy command which is similar to describe() but it works when you want to group by factor.

desc_grp<-psych::describeBy(Dataset_CP_m$Q37a_HoursWeekChildren, 
                  group = Dataset_CP_m$Q1_PaidJob, 
                  mat = TRUE)  # mat=TRUE gives a nice combined data frame which we use for stargazer
df_grp <- as.data.frame(desc_grp)

stargazer(df_grp, type = "text",
          summary = FALSE, 
          title = "Hours with Children by Paid Job Status",
          out="HWCbyPJS.txt")
## 
## Hours with Children by Paid Job Status
## ====================================================================================================
##     item   group1   vars  n   mean    sd   median trimmed  mad   min max  range skew  kurtosis  se  
## ----------------------------------------------------------------------------------------------------
## X11  1      Yes      1   189 16.566 17.453   10   13.405  10.378  1  100   99   1.738  3.175   1.269
## X12  2       No      1   14  10.714 10.845 6.500   9.500  6.672   1   35   34   1.063  -0.263  2.898
## X13  3   Don't Know  1    0                                      Inf -Inf -Inf                      
## X14  4    Refusal    1    1    40            40     40      0    40   40    0                       
## ----------------------------------------------------------------------------------------------------

10.5 Control Variables

10.5.1 What is a Control Variable?

A control variable is a variable that is included in an analysis to account for alternative explanations of a relationship between an explanatory variable and an outcome variable. Control variables are chosen based on theory and prior knowledge.

Suppose we are interested in the relationship between:

  • Sex

  • Hours spent caring for children per week

If other characteristics differ systematically between men and women, and those characteristics also affect caregiving time, then differences we observe may not be attributable to sex alone. Control variables helps us isolate the relationship of interest. There are potential control variables we can use but, for now, we use HH1_Num_People due to care tasks may be shared in larger households.

10.6 Confounding Variables

10.6.1 What is a Confounding Variable?

A confounder is a variable that is associated with both:

  1. The explanatory variable and
  2. The outcome variable

When a confounder is omitted, the estimated relationship between the explanatory variable and the outcome may be biased or misleading.

In our example,

  • Explanatory Variable: Sex

  • Outcome variable: Hours caring for children

  • Confounder: Hours worked per week (HoursWeekWork)

We consider HoursWeekWork as a confounder due to meeting both criteria for confounding:

  1. Men and women differ, on average, in hours worked per week.
  2. Individuals who work more hours typically have less time available for childcare.

As a result, observed differences in caregiving time between men and women may partly reflect differences in labor supply rather than caregiving preferences or norms.

Descriptive statistics allow us to detect potential confounding.

10.7 Difference between Control Variables and Confounding Variables

Type Variable What it does / Why we care Interpretation Example
Control variable HH1_Num_People Extra factor that only affects the outcome. We include it to be fair. We check caregiving hours while making sure household size isn’t confusing the result.
Confounding variable Q7_HoursWeekWork A factor that affects both the explanatory and the outcome. Men and women might work different hours, and the more hours someone works, the less time they have for caregiving.

10.8 Descriptive Statistics of Control and Confounding Variables

For this portion, I will show you the Descriptive Statistics of our outcome, explanatory, control and confounding variables:

a. Create our data frame

#Start with males
df_m <- Dataset_CP_m %>%
  select(Q37a_HoursWeekChildren, HH1_Num_People, Q7_HoursWeekWork) %>%
  na.omit() %>%
  as.data.frame()

# Females
df_f <- Dataset_CP_f %>%
  select(Q37a_HoursWeekChildren, HH1_Num_People, Q7_HoursWeekWork) %>%
  na.omit() %>%
  as.data.frame()

b. use psych::describe

dsc_male <- psych::describe(df_m)
dsc_female <- psych::describe(df_f)

c. Combine both descriptive statistics results from psych::describe

both_dsc <- data.frame(
  Variable = rownames(dsc_male)[1:3],
  Mean_Male = dsc_male$mean[1:3],
  Mean_Female = dsc_female$mean[1:3],
  Median_Male = dsc_male$median[1:3],
  Median_Female = dsc_female$median[1:3],
  SD_Male = dsc_male$sd[1:3],
  SD_Female = dsc_female$sd[1:3],
  Min_Male = dsc_male$min[1:3],
  Min_Female = dsc_female$min[1:3],
  Max_Male = dsc_male$max[1:3],
  Max_Female = dsc_female$max[1:3]
)

We create a data frame and get the names of the three variables Variable = rownames(dsc_male)[1:3] No need to double this, as it is the same for the female. dsc_male$mean[1:3] gets the mean (median, SD, min, max) for each of the three variables. just copy what was done to the males, for the females.

d. Use stargazer for pretty tables

stargazer(
  both_dsc,
  type = "text",
  summary = FALSE,
  title = "Descriptive Statistics: Outcome, Control, and Confounding by Sex",
  out="DS_OCCbySex.txt"
)
## 
## Descriptive Statistics: Outcome, Control, and Confounding by Sex
## ==================================================================================================================================
##          Variable        Mean_Male Mean_Female Median_Male Median_Female SD_Male SD_Female Min_Male Min_Female Max_Male Max_Female
## ----------------------------------------------------------------------------------------------------------------------------------
## 1 Q37a_HoursWeekChildren  17.299     31.748        14           25       14.636   31.349      1         1        100       168    
## 2     HH1_Num_People       3.622      3.265         4            3        1.307    1.147      1         1         9         7     
## 3    Q7_HoursWeekWork     44.414     35.646        40           40       11.186   11.087      8         5         90        80    
## ----------------------------------------------------------------------------------------------------------------------------------

e. Optional Use kableExtra for even prettier tables

both_dsc %>% 
  kable("html", escape = FALSE, align = "c") %>%
  add_header_above(c(" " = 1, "Mean" = 2, "Median" = 2, "SD" = 2, "Min" = 2, "Max" = 2)) %>%
  kable_styling(full_width = FALSE, position = "center", font_size = 12, bootstrap_options = c("striped", "hover")) %>%
  scroll_box(width = "100%", height = "300px")
Mean
Median
SD
Min
Max
Variable Mean_Male Mean_Female Median_Male Median_Female SD_Male SD_Female Min_Male Min_Female Max_Male Max_Female
Q37a_HoursWeekChildren 17.298630 31.747826 14 25 14.635916 31.34858 1 1 100 168
HH1_Num_People 3.621918 3.265217 4 3 1.307023 1.14679 1 1 9 7
Q7_HoursWeekWork 44.413699 35.645652 40 40 11.186301 11.08683 8 5 90 80
#saving kableExtra table
library(kableExtra)
library(htmltools) #install

tbl <- both_dsc %>% 
  kable("html", escape = FALSE, align = "c") %>%
  add_header_above(c(" " = 1, "Mean" = 2, "Median" = 2, "SD" = 2, "Min" = 2, "Max" = 2)) %>%
  kable_styling(full_width = FALSE, position = "center")


save_html(tbl, "DSOCCbySex.html")

For kable("html", escape = FALSE, align = "c") is to have html of the table and the escape is that HTML be shown correctly. The alignment is centered but you can change to “r”, “l”. add_header_above() is to create our multi-column header like the one in the table. The argument where " " = 1 is the first column and it is empty but the others span 2 columns because we are differentiating between Male and Female. The table is not stretched across the page but it is centered. You can copy-paste this after rendering your Quarto document. The scroll_box is so that we can scroll when we have a lot of columns.

10.8.0.1 Results:

  1. Q37a_HoursWeekChildren

    • Females spend more time on caregiving than males;

    • Women, on average, dedicate around twice as many hours to childcare as men in this sample.

  2. HH1_Num_People (Control Variable)

    • Household size can affect caregiving time but doesn’t vary by sex in a way that confuses results.

    • Males have slightly larger households, but the difference is small.

    • When comparing caregiving hours between men and women, we account for household size so that differences aren’t simply because one sex lives in larger households.

  3. HoursWeekWork (Confounding Variable)

    • Work hours affect caregiving time and vary by sex.

    • Men work slightly more hours per week but the standard deviation show overlap, meaning many men and women work similar hours

    • Because work hours affect caregiving and differ by sex, they could confuse the relationship between sex and childcare. Including work hours in the analysis helps isolate the true effect of sex on caregiving.

10.9 Descriptive Statistics of Categorical/Character Variables

Some variables such as EmploymentSituation are in character format. Therefore, to find out more about the variable, we use gtsummary . You can transform these variables to numeric, though, this is also a good way to know how many respondents there are.

Dataset_CP_m %>%
  select(HH2d_EmploymentSituation) %>%
  tbl_summary()
Characteristic N = 1,7231
HH2d_EmploymentSituation
    At work as employee or employer/self-employed 870 (50%)
    Employed, on child-care leave or other leave 6 (0.3%)
    At work as relative assisting on family farm or business 13 (0.8%)
    Unemployed less than 12 months 66 (3.8%)
    Unemployed 12 months or more 90 (5.2%)
    Unable to work due to long-term illness or disability 46 (2.7%)
    Retired 520 (30%)
    Full time homemaker/ responsible for ordinary shopping and looking after the home 5 (0.3%)
    In education (at school, university, etc.) / student 90 (5.2%)
    Other 17 (1.0%)
1 n (%)

10.10 Importance of Descriptive Statistics:

Descriptive statistics serve as tools to summarize and organize data into meaningful insights, making it easier to understand and communicate findings. The following are the key aspects of descriptive statistics:

10.10.0.1 Definition

Descriptive statistics involve summarizing, organizing, and simplifying data in a way that is understandable and interpretable. This can include numerical summaries, tables, and graphical representations to describe the distribution, central tendency, and variability of data.

10.10.1 Key Components of Descriptive Statistics

  1. Measures of Central Tendency:

    • Mean: The average value, useful for symmetric distributions without outliers.

    • Median: The middle value in the data, efficient even with skewness and outliers.

    • Mode: The most frequent value, helpful for categorical or multimodal distributions.

  2. Measures of Variability:

    • Range: Difference between the largest and smallest values. Shows how wide the data can go.
    • Standard Deviation (SD): Shows how close most values are to the average. A small SD → most numbers are similar; a large SD → numbers vary a lot.
    • Interquartile Range (IQR): Shows the spread of the middle 50% of the data. Focuses on the “typical” range, ignoring extreme values.
    • Standard Error (SE): Shows how precise the mean estimate is; smaller SE → mean is more reliable.
  3. Shape of the Data

    • Skewness: Measures asymmetry. Positive –> long right tail; Negative –> long left tail

    • Kurtosis: Measures heaviness of tails. High –> more outliers; low –> flatter distribution

  4. Data Distribution:

    • Histograms and frequency distributions help identify the shape (e.g., symmetric, skewed, bimodal) and any potential outliers in the dataset.
  5. Data Representation:

    • Graphical methods like bar charts, line graphs, pie charts, and boxplots make patterns and trends in the data more visible and easier to interpret.

10.10.2 When to Use Mean or Median

  • The mean is preferred when the data is symmetrically distributed without extreme values.

  • The median is better for skewed distributions or datasets with significant outliers, as it is not affected by extreme values.

  • Example: In income data, where a few individuals earn significantly more than the majority, the median provides a more realistic “typical” income.

10.10.3 How Descriptive Statistics Help in Regression

  1. Choosing Variables to Include:

    By looking at descriptive stats, we can see which variables affect our outcome and which might mix up relationships.

    • Control variables are factors that only affect the outcome.

      Example: Household size might affect caregiving hours but is not our main focus.

    • Confounding variables affect both the explanatory variable and the outcome.

      Example: Hours worked per week can differ by sex and affect caregiving, so we need to account for it.

  2. Preparing for Regression

    • Descriptive stats let us check for outliers, skewed data, or missing values before modeling.

    • They help us understand the scale of variables and whether transformations might be needed.

    • They show if control and confounding variables need special attention in our regression models.

Takeaway: Descriptive statistics are the first step in regression analysis.

10.10.4 Limitations

  • Descriptive statistics only describe the data at hand; they do not infer or predict trends in the population.

  • Without considering measures of variability (e.g., SD or IQR), central tendency measures (mean, median) can be misleading.

  • They cannot fully reveal causal relationships; confounding can still bias conclusions if not properly addressed.

10.11 Closing

  • Do NOT clean your Environment

    • You will use the datasets for your quiz.