fdz_install("haven") # if not already installed
3 Getting an Overview
After importing datasets, we want to get an overview. Every statistical analysis starts with a description of the variables. In this session, we will see how to use tables to get an overview of the information in a dataset. We will also work with the PASS Campus File in this session. Let’s start by loading the data:
library(haven) # data import for Stata datasets
library(tidyverse) # tidyverse
<- read_dta("./orig/PENDDAT_cf_W13.dta") pend
3.1 Frequency Counts
We have various commands available to create a frequency count:
table()
count()
from{dplyr}
The simplest command for counting frequencies is the table()
command. For example, with the variable statakt
representing the education status of respondents:
table(pend$statakt)
-10 -9 -5 1 2 3
3765 3289 280 9470 6139 5481
Here, we see the absolute frequencies displayed. The first row lists the different values, and the second row shows the frequencies.
However, the labels are ignored in the output of table()
. A look into the PASS data report or using attributes()
reveals the value labels:
attributes(pend$statakt)$labels
Item not surveyed in questionnaire version
-10
Item not surveyed in wave
-9
Cannot be generated (missing values)
-5
In occupation with earnings >400 EUR per month
1
Unemployed, registered
2
Pupil/student (school)
3
Apprenticeship/Studying
4
Military or civilian service
5
Carrying out domestic duties
6
Maternity protection/parental leave
7
Pensioner/early retirement
8
Other
9
Unemployed, not registered (since W4)
10
Ill/unfit to work/unemployable
11
Self-employed/family worker
12
9470 respondents are employed, 5481 respondents are inactive, etc. (More on labels and working with value labels in R later.)
With count()
from {dplyr}
, we get the labels displayed directly. Again, we use the pipe %>%
:
%>% count(statakt) pend
# A tibble: 6 × 2
statakt n
<dbl+lbl> <int>
1 -10 [Item not surveyed in questionnaire version] 3765
2 -9 [Item not surveyed in wave] 3289
3 -5 [Cannot be generated (missing values)] 280
4 1 [In occupation with earnings >400 EUR per month] 9470
5 2 [Unemployed, registered] 6139
6 3 [Pupil/student (school)] 5481
We can also store tables under a freely chosen name and call them up later:
<- table(pend$statakt)
t1 <- pend %>% count(statakt) t2
We see here that the table with table()
creates a new object form, a table. With count()
, however, a data.frame
is created.
class(t1)
[1] "table"
class(t2)
[1] "tbl_df" "tbl" "data.frame"
3.2 Missing Values in R: NA
Negative values are a bit annoying.
To mark the values like -5
as missing data in R, we need to set them to NA
in pend
. To do this, we call pend$statakt
and filter with []
only the values for statakt
equal to -1
. In the previous chapter, we learned how to call specific values this way:
$statakt[pend$statakt == -5] # only call statakt = -5 pend
<labelled<double>[280]>: Current main occupation, gen. (since W2)
[1] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[26] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[51] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[76] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[101] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[126] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[151] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[176] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[201] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[226] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[251] -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 -5
[276] -5 -5 -5 -5 -5
Labels:
value label
-10 Item not surveyed in questionnaire version
-9 Item not surveyed in wave
-5 Cannot be generated (missing values)
1 In occupation with earnings >400 EUR per month
2 Unemployed, registered
3 Pupil/student (school)
4 Apprenticeship/Studying
5 Military or civilian service
6 Carrying out domestic duties
7 Maternity protection/parental leave
8 Pensioner/early retirement
9 Other
10 Unemployed, not registered (since W4)
11 Ill/unfit to work/unemployable
12 Self-employed/family worker
(Here, we get the labels again, which is somewhat suboptimal for clarity.)
If we then assign a new value with <-
, the called values will be overwritten - here, we overwrite all values for statakt == -1
with NA
:
$statakt[pend$statakt == -5] <- NA pend
However, we have not yet overwritten all the negative values; -10
and -9
are still missing. Of course, it would be possible this way, but it’s a bit cumbersome:
$statakt[pend$statakt == -9 ] <- NA
pend$statakt[pend$statakt == -10] <- NA pend
For the PASS data, it’s shorter to use < 0
, because all missing codes are less than 0:1
$statakt[pend$statakt < 0 ] <- NA pend
In count()
, NA
is also counted:
%>% count(statakt) pend
# A tibble: 4 × 2
statakt n
<dbl+lbl> <int>
1 1 [In occupation with earnings >400 EUR per month] 9470
2 2 [Unemployed, registered] 6139
3 3 [Pupil/student (school)] 5481
4 NA 7334
If we want to avoid this, we use filter()
again - with is.na()
, we can identify NA
. By prefixing with !
, we can request that all non-NA
values be retained with TRUE
:
%>% filter(!is.na(statakt)) %>% count(statakt) pend
# A tibble: 3 × 2
statakt n
<dbl+lbl> <int>
1 1 [In occupation with earnings >400 EUR per month] 9470
2 2 [Unemployed, registered] 6139
3 3 [Pupil/student (school)] 5481
3.3 Other Table Values
With the help of additional functions, we can customize the frequency table to match the Stata with tab statakt
:
Current main occupation, gen. (since |
W2) | Freq. Percent Cum.
----------------------------------------+-----------------------------------
In occupation with earnings >400 EUR pe | 9,470 44.90 44.90
Unemployed, registered | 6,139 29.11 74.01
Pupil/student (school) | 5,481 25.99 100.00
----------------------------------------+-----------------------------------
Total | 21,090 100.00
<- pend %>% filter(!is.na(statakt)) %>% count(statakt) tab1
prop.table()
: relative values/percentages
$pct <- prop.table(tab1$n)
tab1 tab1
# A tibble: 3 × 3
statakt n pct
<dbl+lbl> <int> <dbl>
1 1 [In occupation with earnings >400 EUR per month] 9470 0.449
2 2 [Unemployed, registered] 6139 0.291
3 3 [Pupil/student (school)] 5481 0.260
29.109% of respondents are unemployed.
prop.table()
withcumsum()
: cumulative relative frequencies
$cum <- prop.table(tab1$n) %>% cumsum()
tab1 tab1
# A tibble: 3 × 4
statakt n pct cum
<dbl+lbl> <int> <dbl> <dbl>
1 1 [In occupation with earnings >400 EUR per month] 9470 0.449 0.449
2 2 [Unemployed, registered] 6139 0.291 0.740
3 3 [Pupil/student (school)] 5481 0.260 1
74.011% of respondents are employed or unemployed (and not inactive).
3.3.1 Exercise
3.4 Contingency Tables
Contingency tables allow us to explore how frequently combinations of different variables occur together. Let’s look at two ways to create contingency tables in R.
Using count()
from the {dplyr}
package, we create a contingency table by inserting two variables. For instance, if we want to see the frequencies of employment status (statakt
) by gender (zpsex
), we can use the following command:
%>% count(zpsex, statakt) pend
# A tibble: 8 × 3
zpsex statakt n
<dbl+lbl> <dbl+lbl> <int>
1 1 [Male] 1 [In occupation with earnings >400 EUR per month] 4685
2 1 [Male] 2 [Unemployed, registered] 3240
3 1 [Male] 3 [Pupil/student (school)] 2047
4 1 [Male] NA 3555
5 2 [Female] 1 [In occupation with earnings >400 EUR per month] 4785
6 2 [Female] 2 [Unemployed, registered] 2899
7 2 [Female] 3 [Pupil/student (school)] 3434
8 2 [Female] NA 3779
<- pend %>% count(zpsex, statakt)
tab2 $pct <- prop.table(tab2$n)
tab2 tab2
# A tibble: 8 × 4
zpsex statakt n pct
<dbl+lbl> <dbl+lbl> <int> <dbl>
1 1 [Male] 1 [In occupation with earnings >400 EUR per month] 4685 0.165
2 1 [Male] 2 [Unemployed, registered] 3240 0.114
3 1 [Male] 3 [Pupil/student (school)] 2047 0.0720
4 1 [Male] NA 3555 0.125
5 2 [Female] 1 [In occupation with earnings >400 EUR per month] 4785 0.168
6 2 [Female] 2 [Unemployed, registered] 2899 0.102
7 2 [Female] 3 [Pupil/student (school)] 3434 0.121
8 2 [Female] NA 3779 0.133
%>% mutate(pct_zpsex= prop.table(n), .by = zpsex) tab2
# A tibble: 8 × 5
zpsex statakt n pct pct_zpsex
<dbl+lbl> <dbl+lbl> <int> <dbl> <dbl>
1 1 [Male] 1 [In occupation with earnings >400 EUR pe… 4685 0.165 0.346
2 1 [Male] 2 [Unemployed, registered] 3240 0.114 0.240
3 1 [Male] 3 [Pupil/student (school)] 2047 0.0720 0.151
4 1 [Male] NA 3555 0.125 0.263
5 2 [Female] 1 [In occupation with earnings >400 EUR pe… 4785 0.168 0.321
6 2 [Female] 2 [Unemployed, registered] 2899 0.102 0.195
7 2 [Female] 3 [Pupil/student (school)] 3434 0.121 0.231
8 2 [Female] NA 3779 0.133 0.254
3.4.1 Exercise
3.5 Summary Statistics
For numerical variables, such as income (netges
), we often compute summary statistics like the mean, median, or quantiles. To get a quick overview, use summary()
:
summary(pend$netges)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-5.0 -3.0 -3.0 567.9 990.0 111419.0
$netges[pend$netges < 0] <- NA # Handling Missing Data
pendsummary(pend$netges)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 880 1320 1562 1890 111419 18056
3.5.1 Calculating Specific Statistics
To calculate specific statistics, we can use:
- Minimum:
min()
- Maximum:
max()
- Mean:
mean()
- Median:
median()
- Quantiles:
quantile()
- Variance:
var()
- Standard Deviation:
sd()
For instance, the mean of income. Setting na.rm = TRUE
forces R to ignore missing values:
mean(pend$netges)
[1] NA
mean(pend$netges, na.rm = TRUE)
[1] 1562.3
3.5.2 Custom Summary with summarise()
You can use summarise()
from {dplyr}
to create custom summary tables:
%>%
pend summarise(
Minimum = min(netges, na.rm = TRUE),
Median = median(netges, na.rm = TRUE),
Mean = mean(netges, na.rm = TRUE),
Maximum = max(netges, na.rm = TRUE)
)
# A tibble: 1 × 4
Minimum Median Mean Maximum
<dbl+lbl> <dbl> <dbl> <dbl+lbl>
1 0 1320 1562. 111419
3.5.3 Comparing Across Groups
To compare statistics across groups, use .by
in summarise()
:
%>%
pend summarise(
Minimum = min(netges, na.rm = TRUE),
Median = median(netges, na.rm = TRUE),
Mean = mean(netges, na.rm = TRUE),
Maximum = max(netges, na.rm = TRUE),
.by = welle
%>% arrange(welle) )
# A tibble: 13 × 5
welle Minimum Median Mean Maximum
<dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl+lbl>
1 1 [Wave 1 (2006/2007)] 1 1200 1525. 111419
2 2 [Wave 2 (2007/2008)] 0 1320 1529. 7200
3 3 [Wave 3 (2008/2009)] 0 1298. 1498. 12000
4 4 [Wave 4 (2010)] 0 1210 1447. 10800
5 5 [Wave 5 (2011)] 0 1250 1494. 33363
6 6 [Wave 6 (2012)] 0 1215 1459. 15950
7 7 [Wave 7 (2013)] 0 1250 1539. 87835
8 8 [Wave 8 (2014)] 0 1255 1456. 9000
9 9 [Wave 9 (2015)] 0 1280 1613. 110451
10 10 [Wave 10 (2016)] 0 1375 1541. 6300
11 11 [Wave 11 (2017)] 0 1500 1748. 44440
12 12 [Wave 12 (2018)] 0 1500 1667. 7150
13 13 [Wave 13 (2019)] 0 1550 1816. 88453
Given that the resulting ‘table’ is a data.frame, we can also filter for specific waves if needed:
%>%
pend summarise(
Minimum = min(netges, na.rm = TRUE),
Median = median(netges, na.rm = TRUE),
Mean = mean(netges, na.rm = TRUE),
Maximum = max(netges, na.rm = TRUE),
.by = welle
%>%
) filter(welle %in% c(1, 10))
# A tibble: 2 × 5
welle Minimum Median Mean Maximum
<dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl+lbl>
1 1 [Wave 1 (2006/2007)] 1 1200 1525. 111419
2 10 [Wave 10 (2016)] 0 1375 1541. 6300
These methods allow for thorough analysis of both categorical and numerical data in R.
3.5.4 Exercise
3.6 Exercises
Use the PASS CampusFile PENDDAT_cf_W13.dta
for all exercises:
library(haven)
<- read_dta("./orig/PENDDAT_cf_W13.dta") pend
As a reminder: you can find an overview of the data import commands here
3.6.1 Exercise 1
We are interested in the variable famstand
, which contains the marital status of the respondents:
Marital status, gen. | Marital status, gen. |
---|---|
-8 | Implausible value |
-4 | Question mistakenly not asked |
-3 | Not applicable (filter) |
-2 | No answer |
1 | Single |
2 | Married/civil partnership, living together |
3 | Married/civil partnership, not living together |
4 | Divorced |
5 | Widowed |
- Display a table with absolute frequencies of
famstand
using bothtable()
andcount()
(Remember to load{tidyverse}
forcount()
). - Overwrite missing codes with
NA
. - Did the replacement of missing values with NA work? Create the table again.
- Display the relative frequencies (proportions). Use
prop.table()
3.6.2 Exercise 2
- Create a contingency table for
famstand
andzpsex
usingcount()
. - What percentage of the respondents are divorced women? Use
prop.table()
3.6.3 Exercise 3
Describe the age of respondents (palter
) using summary
and create your own overview using summarise()
to compare respondent age by marital status.
- First, overwrite missing values with
NA
:
$palter[pend$palter<0] <- NA
pend$famstand[pend$famstand<0] <- NA pend
- Create an overview using
summary()
. - Create an overview with the minimum, median, mean, variance, and maximum age values using
summarise()
. - Extend this overview to display the summary statistics for the different
famstand
categories.
3.7 Notes
3.7.1 Rounding with round()
Explanation: You can round values to a certain number of digits using round(x , 3)
. The second number in the parentheses (after the comma) specifies how many decimal places you want:
round(21.12121123,digits = 3)
[1] 21.121
round(21.12121123,digits = 5)
[1] 21.12121
round(21.12121123,digits = 0)
[1] 21
We can round the relative frequencies to make the table above more readable:
xtabs(~zpsex+statakt, data = pend) %>%
prop.table(.,margin = 1) %>%
round(.,3)
statakt
zpsex 1 2 3
1 0.470 0.325 0.205
2 0.430 0.261 0.309