3  Getting an Overview

After importing datasets, we want to get an overview. Every statistical analysis starts with a description of the variables. In this session, we will see how to use tables to get an overview of the information in a dataset. We will also work with the PASS Campus File in this session. Let’s start by loading the data:

fdz_install("haven") # if not already installed 
library(haven) # data import for Stata datasets
library(tidyverse) # tidyverse
pend <- read_dta("./orig/PENDDAT_cf_W13.dta")

3.1 Frequency Counts

We have various commands available to create a frequency count:

  • table()
  • count() from {dplyr}

The simplest command for counting frequencies is the table() command. For example, with the variable statakt representing the education status of respondents:

table(pend$statakt)

 -10   -9   -5    1    2    3 
3765 3289  280 9470 6139 5481 

Here, we see the absolute frequencies displayed. The first row lists the different values, and the second row shows the frequencies.

However, the labels are ignored in the output of table(). A look into the PASS data report or using attributes() reveals the value labels:

attributes(pend$statakt)$labels
    Item not surveyed in questionnaire version 
                                           -10 
                     Item not surveyed in wave 
                                            -9 
          Cannot be generated (missing values) 
                                            -5 
In occupation with earnings >400 EUR per month 
                                             1 
                        Unemployed, registered 
                                             2 
                        Pupil/student (school) 
                                             3 
                       Apprenticeship/Studying 
                                             4 
                  Military or civilian service 
                                             5 
                  Carrying out domestic duties 
                                             6 
           Maternity protection/parental leave 
                                             7 
                    Pensioner/early retirement 
                                             8 
                                         Other 
                                             9 
         Unemployed, not registered (since W4) 
                                            10 
                Ill/unfit to work/unemployable 
                                            11 
                   Self-employed/family worker 
                                            12 

9470 respondents are employed, 5481 respondents are inactive, etc. (More on labels and working with value labels in R later.)

With count() from {dplyr}, we get the labels displayed directly. Again, we use the pipe %>%:

pend %>% count(statakt)
# A tibble: 6 × 2
  statakt                                                  n
  <dbl+lbl>                                            <int>
1 -10 [Item not surveyed in questionnaire version]      3765
2  -9 [Item not surveyed in wave]                       3289
3  -5 [Cannot be generated (missing values)]             280
4   1 [In occupation with earnings >400 EUR per month]  9470
5   2 [Unemployed, registered]                          6139
6   3 [Pupil/student (school)]                          5481

We can also store tables under a freely chosen name and call them up later:

t1 <- table(pend$statakt)
t2 <- pend %>% count(statakt)

We see here that the table with table() creates a new object form, a table. With count(), however, a data.frame is created.

class(t1)
[1] "table"
class(t2)
[1] "tbl_df"     "tbl"        "data.frame"

3.2 Missing Values in R: NA

Negative values are a bit annoying.

To mark the values like -5 as missing data in R, we need to set them to NA in pend. To do this, we call pend$statakt and filter with [] only the values for statakt equal to -1. In the previous chapter, we learned how to call specific values this way:

pend$statakt[pend$statakt == -5] # only call statakt = -5
<labelled<double>[280]>: Current main occupation, gen. (since W2)
  [1] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
 [26] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
 [51] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
 [76] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[101] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[126] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[151] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[176] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[201] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[226] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[251] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[276] -5 -5 -5 -5 -5

Labels:
 value                                          label
   -10     Item not surveyed in questionnaire version
    -9                      Item not surveyed in wave
    -5           Cannot be generated (missing values)
     1 In occupation with earnings >400 EUR per month
     2                         Unemployed, registered
     3                         Pupil/student (school)
     4                        Apprenticeship/Studying
     5                   Military or civilian service
     6                   Carrying out domestic duties
     7            Maternity protection/parental leave
     8                     Pensioner/early retirement
     9                                          Other
    10          Unemployed, not registered (since W4)
    11                 Ill/unfit to work/unemployable
    12                    Self-employed/family worker

(Here, we get the labels again, which is somewhat suboptimal for clarity.)

If we then assign a new value with <-, the called values will be overwritten - here, we overwrite all values for statakt == -1 with NA:

pend$statakt[pend$statakt == -5]  <- NA

However, we have not yet overwritten all the negative values; -10 and -9 are still missing. Of course, it would be possible this way, but it’s a bit cumbersome:

pend$statakt[pend$statakt == -9 ]  <- NA
pend$statakt[pend$statakt == -10]  <- NA

For the PASS data, it’s shorter to use < 0, because all missing codes are less than 0:1

pend$statakt[pend$statakt < 0 ]  <- NA

In count(), NA is also counted:

pend %>% count(statakt)
# A tibble: 4 × 2
  statakt                                                 n
  <dbl+lbl>                                           <int>
1  1 [In occupation with earnings >400 EUR per month]  9470
2  2 [Unemployed, registered]                          6139
3  3 [Pupil/student (school)]                          5481
4 NA                                                   7334

If we want to avoid this, we use filter() again - with is.na(), we can identify NA. By prefixing with !, we can request that all non-NA values be retained with TRUE:

pend %>% filter(!is.na(statakt)) %>% count(statakt)
# A tibble: 3 × 2
  statakt                                                n
  <dbl+lbl>                                          <int>
1 1 [In occupation with earnings >400 EUR per month]  9470
2 2 [Unemployed, registered]                          6139
3 3 [Pupil/student (school)]                          5481

3.3 Other Table Values

With the help of additional functions, we can customize the frequency table to match the Stata with tab statakt:

   Current main occupation, gen. (since |
                                    W2) |      Freq.     Percent        Cum.
----------------------------------------+-----------------------------------
In occupation with earnings >400 EUR pe |      9,470       44.90       44.90
                 Unemployed, registered |      6,139       29.11       74.01
                 Pupil/student (school) |      5,481       25.99      100.00
----------------------------------------+-----------------------------------
                                  Total |     21,090      100.00
tab1 <- pend %>% filter(!is.na(statakt)) %>% count(statakt)
  • prop.table(): relative values/percentages
tab1$pct <- prop.table(tab1$n) 
tab1
# A tibble: 3 × 3
  statakt                                                n   pct
  <dbl+lbl>                                          <int> <dbl>
1 1 [In occupation with earnings >400 EUR per month]  9470 0.449
2 2 [Unemployed, registered]                          6139 0.291
3 3 [Pupil/student (school)]                          5481 0.260

29.109% of respondents are unemployed.

  • prop.table() with cumsum(): cumulative relative frequencies
tab1$cum <- prop.table(tab1$n) %>% cumsum()
tab1
# A tibble: 3 × 4
  statakt                                                n   pct   cum
  <dbl+lbl>                                          <int> <dbl> <dbl>
1 1 [In occupation with earnings >400 EUR per month]  9470 0.449 0.449
2 2 [Unemployed, registered]                          6139 0.291 0.740
3 3 [Pupil/student (school)]                          5481 0.260 1    

74.011% of respondents are employed or unemployed (and not inactive).

3.3.1 Exercise

3.4 Contingency Tables

Contingency tables allow us to explore how frequently combinations of different variables occur together. Let’s look at two ways to create contingency tables in R.

Using count() from the {dplyr} package, we create a contingency table by inserting two variables. For instance, if we want to see the frequencies of employment status (statakt) by gender (zpsex), we can use the following command:

pend %>% count(zpsex, statakt)
# A tibble: 8 × 3
  zpsex      statakt                                                 n
  <dbl+lbl>  <dbl+lbl>                                           <int>
1 1 [Male]    1 [In occupation with earnings >400 EUR per month]  4685
2 1 [Male]    2 [Unemployed, registered]                          3240
3 1 [Male]    3 [Pupil/student (school)]                          2047
4 1 [Male]   NA                                                   3555
5 2 [Female]  1 [In occupation with earnings >400 EUR per month]  4785
6 2 [Female]  2 [Unemployed, registered]                          2899
7 2 [Female]  3 [Pupil/student (school)]                          3434
8 2 [Female] NA                                                   3779
tab2 <- pend %>% count(zpsex, statakt)
tab2$pct <- prop.table(tab2$n)
tab2
# A tibble: 8 × 4
  zpsex      statakt                                                 n    pct
  <dbl+lbl>  <dbl+lbl>                                           <int>  <dbl>
1 1 [Male]    1 [In occupation with earnings >400 EUR per month]  4685 0.165 
2 1 [Male]    2 [Unemployed, registered]                          3240 0.114 
3 1 [Male]    3 [Pupil/student (school)]                          2047 0.0720
4 1 [Male]   NA                                                   3555 0.125 
5 2 [Female]  1 [In occupation with earnings >400 EUR per month]  4785 0.168 
6 2 [Female]  2 [Unemployed, registered]                          2899 0.102 
7 2 [Female]  3 [Pupil/student (school)]                          3434 0.121 
8 2 [Female] NA                                                   3779 0.133 
tab2 %>% mutate(pct_zpsex= prop.table(n), .by = zpsex)
# A tibble: 8 × 5
  zpsex      statakt                                          n    pct pct_zpsex
  <dbl+lbl>  <dbl+lbl>                                    <int>  <dbl>     <dbl>
1 1 [Male]    1 [In occupation with earnings >400 EUR pe…  4685 0.165      0.346
2 1 [Male]    2 [Unemployed, registered]                   3240 0.114      0.240
3 1 [Male]    3 [Pupil/student (school)]                   2047 0.0720     0.151
4 1 [Male]   NA                                            3555 0.125      0.263
5 2 [Female]  1 [In occupation with earnings >400 EUR pe…  4785 0.168      0.321
6 2 [Female]  2 [Unemployed, registered]                   2899 0.102      0.195
7 2 [Female]  3 [Pupil/student (school)]                   3434 0.121      0.231
8 2 [Female] NA                                            3779 0.133      0.254

3.4.1 Exercise

3.5 Summary Statistics

For numerical variables, such as income (netges), we often compute summary statistics like the mean, median, or quantiles. To get a quick overview, use summary():

summary(pend$netges)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    -5.0     -3.0     -3.0    567.9    990.0 111419.0 
pend$netges[pend$netges < 0] <- NA # Handling Missing Data
summary(pend$netges)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0     880    1320    1562    1890  111419   18056 

3.5.1 Calculating Specific Statistics

To calculate specific statistics, we can use:

  • Minimum: min()
  • Maximum: max()
  • Mean: mean()
  • Median: median()
  • Quantiles: quantile()
  • Variance: var()
  • Standard Deviation: sd()

For instance, the mean of income. Setting na.rm = TRUE forces R to ignore missing values:

mean(pend$netges)
[1] NA
mean(pend$netges, na.rm = TRUE)
[1] 1562.3

3.5.2 Custom Summary with summarise()

You can use summarise() from {dplyr} to create custom summary tables:

pend %>%
  summarise(
    Minimum = min(netges, na.rm = TRUE),
    Median = median(netges, na.rm = TRUE),
    Mean = mean(netges, na.rm = TRUE),
    Maximum = max(netges, na.rm = TRUE)
  )
# A tibble: 1 × 4
  Minimum   Median  Mean Maximum  
  <dbl+lbl>  <dbl> <dbl> <dbl+lbl>
1 0           1320 1562. 111419   

3.5.3 Comparing Across Groups

To compare statistics across groups, use .by in summarise():

pend %>%
  summarise(
    Minimum = min(netges, na.rm = TRUE),
    Median = median(netges, na.rm = TRUE),
    Mean = mean(netges, na.rm = TRUE),
    Maximum = max(netges, na.rm = TRUE),
    .by = welle
  ) %>% arrange(welle)
# A tibble: 13 × 5
   welle                   Minimum   Median  Mean Maximum  
   <dbl+lbl>               <dbl+lbl>  <dbl> <dbl> <dbl+lbl>
 1  1 [Wave 1 (2006/2007)] 1          1200  1525. 111419   
 2  2 [Wave 2 (2007/2008)] 0          1320  1529.   7200   
 3  3 [Wave 3 (2008/2009)] 0          1298. 1498.  12000   
 4  4 [Wave 4 (2010)]      0          1210  1447.  10800   
 5  5 [Wave 5 (2011)]      0          1250  1494.  33363   
 6  6 [Wave 6 (2012)]      0          1215  1459.  15950   
 7  7 [Wave 7 (2013)]      0          1250  1539.  87835   
 8  8 [Wave 8 (2014)]      0          1255  1456.   9000   
 9  9 [Wave 9 (2015)]      0          1280  1613. 110451   
10 10 [Wave 10 (2016)]     0          1375  1541.   6300   
11 11 [Wave 11 (2017)]     0          1500  1748.  44440   
12 12 [Wave 12 (2018)]     0          1500  1667.   7150   
13 13 [Wave 13 (2019)]     0          1550  1816.  88453   

Given that the resulting ‘table’ is a data.frame, we can also filter for specific waves if needed:

pend %>% 
  summarise(
    Minimum = min(netges, na.rm = TRUE),
    Median = median(netges, na.rm = TRUE),
    Mean = mean(netges, na.rm = TRUE),
    Maximum = max(netges, na.rm = TRUE),
    .by = welle
  ) %>% 
  filter(welle %in% c(1, 10)) 
# A tibble: 2 × 5
  welle                   Minimum   Median  Mean Maximum  
  <dbl+lbl>               <dbl+lbl>  <dbl> <dbl> <dbl+lbl>
1  1 [Wave 1 (2006/2007)] 1           1200 1525. 111419   
2 10 [Wave 10 (2016)]     0           1375 1541.   6300   

These methods allow for thorough analysis of both categorical and numerical data in R.

3.5.4 Exercise

3.6 Exercises

Use the PASS CampusFile PENDDAT_cf_W13.dta for all exercises:

library(haven)
pend <- read_dta("./orig/PENDDAT_cf_W13.dta")

As a reminder: you can find an overview of the data import commands here

3.6.1 Exercise 1

We are interested in the variable famstand, which contains the marital status of the respondents:

Marital status, gen. Marital status, gen.
-8 Implausible value
-4 Question mistakenly not asked
-3 Not applicable (filter)
-2 No answer
1 Single
2 Married/civil partnership, living together
3 Married/civil partnership, not living together
4 Divorced
5 Widowed
  • Display a table with absolute frequencies of famstand using both table() and count() (Remember to load {tidyverse} for count()).
  • Overwrite missing codes with NA.
  • Did the replacement of missing values with NA work? Create the table again.
  • Display the relative frequencies (proportions). Use prop.table()

Back to top

3.6.2 Exercise 2

  • Create a contingency table for famstand and zpsex using count().
  • What percentage of the respondents are divorced women? Use prop.table()

Back to top

3.6.3 Exercise 3

Describe the age of respondents (palter) using summary and create your own overview using summarise() to compare respondent age by marital status.

  • First, overwrite missing values with NA:
pend$palter[pend$palter<0] <- NA
pend$famstand[pend$famstand<0] <- NA
  • Create an overview using summary().
  • Create an overview with the minimum, median, mean, variance, and maximum age values using summarise().
  • Extend this overview to display the summary statistics for the different famstand categories.

Back to top

3.7 Notes

3.7.1 Rounding with round()

Explanation: You can round values to a certain number of digits using round(x , 3). The second number in the parentheses (after the comma) specifies how many decimal places you want:

round(21.12121123,digits = 3)
[1] 21.121
round(21.12121123,digits = 5)
[1] 21.12121
round(21.12121123,digits = 0)
[1] 21

We can round the relative frequencies to make the table above more readable:

xtabs(~zpsex+statakt, data = pend) %>% 
  prop.table(.,margin = 1) %>% 
  round(.,3)
     statakt
zpsex     1     2     3
    1 0.470 0.325 0.205
    2 0.430 0.261 0.309

  1. For non-systematic values, we can use the %in% operator that we already learned about in connection with filter(): pend$var1[pend$var1 %in% c(-9,2,124) ] <- NA (this is just an example).↩︎