library(tidyverse)
<- haven::read_dta("./orig/PENDDAT_cf_W13.dta",
pend_kap5 col_select = c("pnr","welle", "zpsex", "PSM0100","azges1","palter")) %>%
filter(welle == 8, palter > 0,azges1 > 0)
10 Labels & factors
10.1 Labels from Other Programs in R
In many software packages like Stata or SPSS, labels are often retained through various operations and then displayed automatically. This is not the case in R. Instead, in R, we can assign labels using the factor
variable type. This approach might seem unusual for those who have worked extensively with Stata or SPSS, but it is quite useful in practice if you get accustomed to the workflow.
Generally, you can use value labels from other software packages in are. For example, when we create a count summary with count()
, the labels from the .dta
-file are displayed:
# Counting occurrences and showing labels
%>%
pend_kap5 count(PSM0100)
# A tibble: 3 × 2
PSM0100 n
<dbl+lbl> <int>
1 -5 [Does not use the internet] 28
2 1 [Yes] 318
3 2 [No] 337
These are assigned as attributes()
variables:
attributes(pend_kap5$PSM0100)
$label
[1] "Usage of social networks"
$format.stata
[1] "%39.0f"
$labels
Item not surveyed in wave Does not use the internet Details refused
-9 -5 -2
Don't know Yes No
-1 1 2
$class
[1] "haven_labelled" "vctrs_vctr" "double"
enframe()
from the {tibble}
(part of the {tidyverse}
) package helps to get data.frame with an overview of all value labels stored in an attribute:
attributes(pend_kap5$PSM0100)$labels %>% enframe(value = "variable_value",name = "label")
# A tibble: 6 × 2
label variable_value
<chr> <dbl>
1 Item not surveyed in wave -9
2 Does not use the internet -5
3 Details refused -2
4 Don't know -1
5 Yes 1
6 No 2
However, managing attributes() is tedious and sometimes causes problems when working with the labelled variables.
R’s native way to work with labels are factor
variables. As mentioned in chapter 2, factor
variables are strings with a predefined universe and ordering.
How can we use the attributes()
-labels as factor
to save typing?
{haven}
includes the function as_factor
1, which allows us to directly create a factor
variable from labels:
$PSM0100_fct <- as_factor(pend_kap5$PSM0100) # create factor variable from attributes and values
pend_kap5
# view:
%>% select(contains("PSM0100")) %>% head() pend_kap5
# A tibble: 6 × 2
PSM0100 PSM0100_fct
<dbl+lbl> <fct>
1 2 [No] No
2 1 [Yes] Yes
3 2 [No] No
4 -5 [Does not use the internet] Does not use the internet
5 -5 [Does not use the internet] Does not use the internet
6 -5 [Does not use the internet] Does not use the internet
10.2 Creating or editing factor
manually
Alternatively, we can also label with factor()
using the levels
and labels
options ourselves. The labels
are assigned in order to the numbers from levels
. Additionally, all unspecified levels
automatically become NA
:
$PSM0100_fct2 <- factor(pend_kap5$PSM0100,
pend_kap5levels = c(1,2),
labels = c("Yes!","No :-("))
# view:
%>% select(contains("PSM0100")) %>% head() pend_kap5
# A tibble: 6 × 3
PSM0100 PSM0100_fct PSM0100_fct2
<dbl+lbl> <fct> <fct>
1 2 [No] No No :-(
2 1 [Yes] Yes Yes!
3 2 [No] No No :-(
4 -5 [Does not use the internet] Does not use the internet <NA>
5 -5 [Does not use the internet] Does not use the internet <NA>
6 -5 [Does not use the internet] Does not use the internet <NA>
Or we can use the functions from {forcats}
to recode a factor
. {forcats}
is part of the {tidyverse}
. With fct_recode()
, we can change the levels
:
levels(pend_kap5$PSM0100_fct)
[1] "Item not surveyed in wave" "Does not use the internet"
[3] "Details refused" "Don't know"
[5] "Yes" "No"
$PSM0100_fct3 <- fct_recode(pend_kap5$PSM0100_fct,
pend_kap5`Uses social networks` = "Yes", # use `` around words with spaces
)levels(pend_kap5$PSM0100_fct3)
[1] "Item not surveyed in wave" "Does not use the internet"
[3] "Details refused" "Don't know"
[5] "Uses social networks" "No"
%>% select(contains("PSM0100")) %>% head() pend_kap5
# A tibble: 6 × 4
PSM0100 PSM0100_fct PSM0100_fct2 PSM0100_fct3
<dbl+lbl> <fct> <fct> <fct>
1 2 [No] No No :-( No
2 1 [Yes] Yes Yes! Uses social…
3 2 [No] No No :-( No
4 -5 [Does not use the internet] Does not use the int… <NA> Does not us…
5 -5 [Does not use the internet] Does not use the int… <NA> Does not us…
6 -5 [Does not use the internet] Does not use the int… <NA> Does not us…
More fct_....()
functions from {forcats}
can be found in this Cheatsheet.
10.3 Exercise
<- haven::read_dta("./orig/PENDDAT_cf_W13.dta",
pend_ue5 col_select = c("pnr","welle","PD0400")) %>%
filter(PD0400>0)
Edit the value labels of PD0400
: Religiousness, self-rating
value | label |
---|---|
1 | Not at all religious |
2 | Rather not religious |
3 | Rather religious |
4 | Very religious |
- First, use
head()
and a count withcount()
to get an overview. - How can you use the labels from the
attributes()
withas_factor()
to create a variablePD0400_fct
? - Create a
factor()
variableF411_01_fct2
with value labels:1 = Not at all
,2 = Rather not
,3 = Rather yes
,4 = Very much
Bonus exercise: Use the labeled variable for a bar chart.
10.4 Appendix
10.4.1 Remove labels with zap_...
from {haven}
The label attributes()
often cause problems in further processing. With haven::zap_labels()
, we can remove value labels from a dataset, and with haven::zap_label()
, we can remove variable labels.
pend_kap5
# A tibble: 683 × 6
pnr welle zpsex PSM0100 azges1 palter
<dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+> <dbl+>
1 1000002601 8 [Wave 8 (2014)] 2 [Female] 2 [No] 22 34
2 1000010402 8 [Wave 8 (2014)] 2 [Female] 1 [Yes] 40 30
3 1000019102 8 [Wave 8 (2014)] 1 [Male] 2 [No] 40 34
4 1000031403 8 [Wave 8 (2014)] 1 [Male] -5 [Does not use the i… 44 52
5 1000032801 8 [Wave 8 (2014)] 2 [Female] -5 [Does not use the i… 44 58
6 1000032802 8 [Wave 8 (2014)] 1 [Male] -5 [Does not use the i… 43 62
7 1000038201 8 [Wave 8 (2014)] 1 [Male] 1 [Yes] 43 61
8 1000040003 8 [Wave 8 (2014)] 1 [Male] 2 [No] 36 40
9 1000051801 8 [Wave 8 (2014)] 2 [Female] 2 [No] 31 44
10 1000053101 8 [Wave 8 (2014)] 1 [Male] 1 [Yes] 27 47
# ℹ 673 more rows
%>%
pend_kap5 ::zap_labels() # remove value labels haven
# A tibble: 683 × 6
pnr welle zpsex PSM0100 azges1 palter
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1000002601 8 2 2 22 34
2 1000010402 8 2 1 40 30
3 1000019102 8 1 2 40 34
4 1000031403 8 1 -5 44 52
5 1000032801 8 2 -5 44 58
6 1000032802 8 1 -5 43 62
7 1000038201 8 1 1 43 61
8 1000040003 8 1 2 36 40
9 1000051801 8 2 2 31 44
10 1000053101 8 1 1 27 47
# ℹ 673 more rows
10.4.2 Creating labels in R and exporting to Stata
If we want to label a dataset for Stata, for example, {labelled}
comes in handy again:
fdz_install("labelled")
library(labelled)
$zpsex_num2 <- as.numeric(pend_kap5$zpsex)
pend_kap5attributes(pend_kap5$zpsex_num2)
NULL
val_labels(pend_kap5$zpsex_num2) <- c("Männer Testtesttest"=1,"Frauen"=2)
attributes(pend_kap5$zpsex_num2)
$labels
Männer Testtesttest Frauen
1 2
$class
[1] "haven_labelled" "vctrs_vctr" "double"
%>% count(zpsex_num2) pend_kap5
# A tibble: 2 × 2
zpsex_num2 n
<dbl+lbl> <int>
1 1 [Männer Testtesttest] 324
2 2 [Frauen] 359
%>%
pend_kap5 select(zpsex_num2) %>%
::write_dta(.,path = "./data/pend_kap5.dta") haven
…in Stata:
use "./data/pend_kap5.dta"
tab zpsex_num2
zpsex_num2 | Freq. Percent Cum.
--------------------+-----------------------------------
Männer Testtesttest | 324 47.44 47.44
Frauen | 359 52.56 100.00
--------------------+-----------------------------------
Total | 683 100.00
Not to be confused with
as.factor()
from base R – the_
makes a difference!↩︎