10  Labels & factors

library(tidyverse)

pend_kap5 <- haven::read_dta("./orig/PENDDAT_cf_W13.dta",
                             col_select = c("pnr","welle", "zpsex", "PSM0100","azges1","palter")) %>% 
  filter(welle == 8, palter > 0,azges1 > 0)

10.1 Labels from Other Programs in R

In many software packages like Stata or SPSS, labels are often retained through various operations and then displayed automatically. This is not the case in R. Instead, in R, we can assign labels using the factor variable type. This approach might seem unusual for those who have worked extensively with Stata or SPSS, but it is quite useful in practice if you get accustomed to the workflow.

Generally, you can use value labels from other software packages in are. For example, when we create a count summary with count(), the labels from the .dta-file are displayed:

# Counting occurrences and showing labels
pend_kap5 %>% 
  count(PSM0100)
# A tibble: 3 × 2
  PSM0100                            n
  <dbl+lbl>                      <int>
1 -5 [Does not use the internet]    28
2  1 [Yes]                         318
3  2 [No]                          337

These are assigned as attributes() variables:

attributes(pend_kap5$PSM0100)
$label
[1] "Usage of social networks"

$format.stata
[1] "%39.0f"

$labels
Item not surveyed in wave Does not use the internet           Details refused 
                       -9                        -5                        -2 
               Don't know                       Yes                        No 
                       -1                         1                         2 

$class
[1] "haven_labelled" "vctrs_vctr"     "double"        

enframe() from the {tibble} (part of the {tidyverse}) package helps to get data.frame with an overview of all value labels stored in an attribute:

attributes(pend_kap5$PSM0100)$labels %>% enframe(value = "variable_value",name = "label")
# A tibble: 6 × 2
  label                     variable_value
  <chr>                              <dbl>
1 Item not surveyed in wave             -9
2 Does not use the internet             -5
3 Details refused                       -2
4 Don't know                            -1
5 Yes                                    1
6 No                                     2

However, managing attributes() is tedious and sometimes causes problems when working with the labelled variables.

R’s native way to work with labels are factor variables. As mentioned in chapter 2, factor variables are strings with a predefined universe and ordering.
How can we use the attributes()-labels as factor to save typing?

{haven} includes the function as_factor1, which allows us to directly create a factor variable from labels:

pend_kap5$PSM0100_fct <- as_factor(pend_kap5$PSM0100) # create factor variable from attributes and values

# view:
pend_kap5 %>% select(contains("PSM0100")) %>% head()
# A tibble: 6 × 2
  PSM0100                        PSM0100_fct              
  <dbl+lbl>                      <fct>                    
1  2 [No]                        No                       
2  1 [Yes]                       Yes                      
3  2 [No]                        No                       
4 -5 [Does not use the internet] Does not use the internet
5 -5 [Does not use the internet] Does not use the internet
6 -5 [Does not use the internet] Does not use the internet

10.2 Creating or editing factor manually

Alternatively, we can also label with factor() using the levels and labels options ourselves. The labels are assigned in order to the numbers from levels. Additionally, all unspecified levels automatically become NA:

pend_kap5$PSM0100_fct2 <- factor(pend_kap5$PSM0100,
                               levels = c(1,2),
                               labels = c("Yes!","No :-("))

# view:
pend_kap5 %>% select(contains("PSM0100")) %>% head()
# A tibble: 6 × 3
  PSM0100                        PSM0100_fct               PSM0100_fct2
  <dbl+lbl>                      <fct>                     <fct>       
1  2 [No]                        No                        No :-(      
2  1 [Yes]                       Yes                       Yes!        
3  2 [No]                        No                        No :-(      
4 -5 [Does not use the internet] Does not use the internet <NA>        
5 -5 [Does not use the internet] Does not use the internet <NA>        
6 -5 [Does not use the internet] Does not use the internet <NA>        

Or we can use the functions from {forcats} to recode a factor. {forcats} is part of the {tidyverse}. With fct_recode(), we can change the levels:

levels(pend_kap5$PSM0100_fct)
[1] "Item not surveyed in wave" "Does not use the internet"
[3] "Details refused"           "Don't know"               
[5] "Yes"                       "No"                       
pend_kap5$PSM0100_fct3 <- fct_recode(pend_kap5$PSM0100_fct,
  `Uses social networks` =  "Yes", # use `` around words with spaces
  )
levels(pend_kap5$PSM0100_fct3)
[1] "Item not surveyed in wave" "Does not use the internet"
[3] "Details refused"           "Don't know"               
[5] "Uses social networks"      "No"                       
pend_kap5 %>% select(contains("PSM0100")) %>% head()
# A tibble: 6 × 4
  PSM0100                        PSM0100_fct           PSM0100_fct2 PSM0100_fct3
  <dbl+lbl>                      <fct>                 <fct>        <fct>       
1  2 [No]                        No                    No :-(       No          
2  1 [Yes]                       Yes                   Yes!         Uses social…
3  2 [No]                        No                    No :-(       No          
4 -5 [Does not use the internet] Does not use the int… <NA>         Does not us…
5 -5 [Does not use the internet] Does not use the int… <NA>         Does not us…
6 -5 [Does not use the internet] Does not use the int… <NA>         Does not us…

More fct_....() functions from {forcats} can be found in this Cheatsheet.

10.3 Exercise

pend_ue5 <- haven::read_dta("./orig/PENDDAT_cf_W13.dta",
                               col_select = c("pnr","welle","PD0400")) %>% 
  filter(PD0400>0)

Edit the value labels of PD0400: Religiousness, self-rating

value

label

1

Not at all religious

2

Rather not religious

3

Rather religious

4

Very religious

  • First, use head() and a count with count() to get an overview.
  • How can you use the labels from the attributes() with as_factor() to create a variable PD0400_fct?
  • Create a factor() variable F411_01_fct2 with value labels: 1 = Not at all, 2 = Rather not, 3 = Rather yes, 4 = Very much

Bonus exercise: Use the labeled variable for a bar chart.

10.4 Appendix

10.4.1 Remove labels with zap_... from {haven}

The label attributes() often cause problems in further processing. With haven::zap_labels(), we can remove value labels from a dataset, and with haven::zap_label(), we can remove variable labels.

pend_kap5
# A tibble: 683 × 6
          pnr welle             zpsex      PSM0100                 azges1 palter
        <dbl> <dbl+lbl>         <dbl+lbl>  <dbl+lbl>               <dbl+> <dbl+>
 1 1000002601 8 [Wave 8 (2014)] 2 [Female]  2 [No]                 22     34    
 2 1000010402 8 [Wave 8 (2014)] 2 [Female]  1 [Yes]                40     30    
 3 1000019102 8 [Wave 8 (2014)] 1 [Male]    2 [No]                 40     34    
 4 1000031403 8 [Wave 8 (2014)] 1 [Male]   -5 [Does not use the i… 44     52    
 5 1000032801 8 [Wave 8 (2014)] 2 [Female] -5 [Does not use the i… 44     58    
 6 1000032802 8 [Wave 8 (2014)] 1 [Male]   -5 [Does not use the i… 43     62    
 7 1000038201 8 [Wave 8 (2014)] 1 [Male]    1 [Yes]                43     61    
 8 1000040003 8 [Wave 8 (2014)] 1 [Male]    2 [No]                 36     40    
 9 1000051801 8 [Wave 8 (2014)] 2 [Female]  2 [No]                 31     44    
10 1000053101 8 [Wave 8 (2014)] 1 [Male]    1 [Yes]                27     47    
# ℹ 673 more rows
pend_kap5 %>% 
  haven::zap_labels() # remove value labels
# A tibble: 683 × 6
          pnr welle zpsex PSM0100 azges1 palter
        <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>
 1 1000002601     8     2       2     22     34
 2 1000010402     8     2       1     40     30
 3 1000019102     8     1       2     40     34
 4 1000031403     8     1      -5     44     52
 5 1000032801     8     2      -5     44     58
 6 1000032802     8     1      -5     43     62
 7 1000038201     8     1       1     43     61
 8 1000040003     8     1       2     36     40
 9 1000051801     8     2       2     31     44
10 1000053101     8     1       1     27     47
# ℹ 673 more rows

10.4.2 Creating labels in R and exporting to Stata

If we want to label a dataset for Stata, for example, {labelled} comes in handy again:

fdz_install("labelled")
library(labelled)
pend_kap5$zpsex_num2 <- as.numeric(pend_kap5$zpsex)
attributes(pend_kap5$zpsex_num2)
NULL
val_labels(pend_kap5$zpsex_num2) <- c("Männer Testtesttest"=1,"Frauen"=2)
attributes(pend_kap5$zpsex_num2)
$labels
Männer Testtesttest              Frauen 
                  1                   2 

$class
[1] "haven_labelled" "vctrs_vctr"     "double"        
pend_kap5 %>% count(zpsex_num2)
# A tibble: 2 × 2
  zpsex_num2                  n
  <dbl+lbl>               <int>
1 1 [Männer Testtesttest]   324
2 2 [Frauen]                359
pend_kap5 %>% 
  select(zpsex_num2) %>% 
  haven::write_dta(.,path = "./data/pend_kap5.dta")

…in Stata:

use "./data/pend_kap5.dta" 
tab zpsex_num2 
         zpsex_num2 |      Freq.     Percent        Cum.
--------------------+-----------------------------------
Männer Testtesttest |        324       47.44       47.44
             Frauen |        359       52.56      100.00
--------------------+-----------------------------------
              Total |        683      100.00

More on labels in {labelled}.


  1. Not to be confused with as.factor() from base R – the _ makes a difference!↩︎