9 Data Wrangling II: Merging & reshaping
9.1 Joining/merging data sets
A mutating join allows you to combine variables from two data.frames. It first matches observations by their keys, then copies across variables from one table to the other.
R for Data Science: Mutating joins
A quick illustration:1
There are also right_join()
and anti_join()
. For a more in-depth introduction, the chapter on Relational Data in R for Data Science is highly recommended.
A very helpful option in the ..._join()
functions is the ability to join different variables. For example, here we have some cases from ids_df
, for which the (fictional) unemployment figures from alo_bula
should be used. However, in ids_df
, the variable Bula
contains the state information, while in alo_bula
, it is the variable bundesland
:
Code
<- data.frame(pnr = sample(1:9,4),
ids_df Bula = c(2,1,14,15))
set.seed(90459)
<- data.frame(bundesland = seq(1:8),
alo_bula Werte = sample(letters,size = 8) # mit sample() kann eine zufällige Auswahl getroffen werden
)
ids_df
#> pnr Bula
#> 1 8 2
#> 2 4 1
#> 3 2 14
#> 4 1 15
alo_bula
#> bundesland Werte
#> 1 1 g
#> 2 2 m
#> 3 3 n
#> 4 4 z
#> 5 5 w
#> 6 6 r
#> 7 7 t
#> 8 8 h
%>% left_join(alo_bula,by = c("Bula"="bundesland")) ids_df
#> pnr Bula Werte
#> 1 8 2 m
#> 2 4 1 g
#> 3 2 14 <NA>
#> 4 1 15 <NA>
A quick check for the matching cases can be done using:
table(ids_df$Bula %in% alo_bula$bundesland)
#>
#> FALSE TRUE
#> 2 2
anti_join()
allows for checking which key variables are not present in the other data.frame:
%>% anti_join(alo_bula,by = c("Bula"="bundesland")) ids_df
#> pnr Bula
#> 1 2 14
#> 2 1 15
%>% anti_join(ids_df,by = c("bundesland"="Bula")) alo_bula
#> bundesland Werte
#> 1 3 n
#> 2 4 z
#> 3 5 w
#> 4 6 r
#> 5 7 t
#> 6 8 h
9.1.1 Exercise
9.2 Reshaping Data: pivot_longer()
& pivot_wider()
9.2.1 Wide to Long
Reshaping data from wide to long format is useful when you want to store multiple observations per row. For example:
<- data.frame(bula = c("NRW", "NDS"), alo2018 = c(2, 2), alo2017 = c(1, 1))
bsp_df bsp_df
#> bula alo2018 alo2017
#> 1 NRW 2 1
#> 2 NDS 2 1
We can use pivot_longer()
to convert this wide format to long:
%>% pivot_longer(cols = c(alo2018, alo2017), names_to = "year", values_to = "alo") bsp_df
#> # A tibble: 4 × 3
#> bula year alo
#> <chr> <chr> <dbl>
#> 1 NRW alo2018 2
#> 2 NRW alo2017 1
#> 3 NDS alo2018 2
#> 4 NDS alo2017 1
To remove a prefix from the column names:
%>% pivot_longer(cols = c(alo2018, alo2017), names_to = "year", values_to = "alo", names_prefix = "alo") bsp_df
#> # A tibble: 4 × 3
#> bula year alo
#> <chr> <chr> <dbl>
#> 1 NRW 2018 2
#> 2 NRW 2017 1
#> 3 NDS 2018 2
#> 4 NDS 2017 1
9.2.2 Long to Wide
To convert from long format back to wide:
<- data.frame(land = c("NRW", "NDS", "NRW", "NDS"), alo = c(2.1, 1.8, 2.4, 2.2), alter = c("age_1825", "age_1825", "age_2630", "age_2630"))
bsp_df2 bsp_df2
#> land alo alter
#> 1 NRW 2.1 age_1825
#> 2 NDS 1.8 age_1825
#> 3 NRW 2.4 age_2630
#> 4 NDS 2.2 age_2630
%>% pivot_wider(names_from = alter, values_from = alo) bsp_df2
#> # A tibble: 2 × 3
#> land age_1825 age_2630
#> <chr> <dbl> <dbl>
#> 1 NRW 2.1 2.4
#> 2 NDS 1.8 2.2
9.2.3 Exercise
9.3 Exercises
9.3.1 Exercise 1: Joining
Join the selected observations from PENDDAT_cf_W13.dta
with the household data to include the region
where the respondents live, using hnr
and welle
as keys.
<- haven::read_dta("./orig/PENDDAT_cf_W13.dta", col_select = c("pnr", "welle")) %>% slice(1:10)
pend_ue11
<- haven::read_dta("./orig/HHENDDAT_cf_W13.dta", col_select = c("hnr", "welle", "region"))
hh_dat
%>% left_join(hh_dat, by = c("welle")) pend_ue11
#> # A tibble: 18,466 × 4
#> pnr welle hnr region
#> <dbl> <dbl+lbl> <dbl> <dbl+lbl>
#> 1 1000001901 1 [Wave 1 (2006/2007)] 10000019 4 [West]
#> 2 1000001901 1 [Wave 1 (2006/2007)] 10000020 4 [West]
#> 3 1000001901 1 [Wave 1 (2006/2007)] 10000023 4 [West]
#> 4 1000001901 1 [Wave 1 (2006/2007)] 10000026 4 [West]
#> 5 1000001901 1 [Wave 1 (2006/2007)] 10000031 4 [West]
#> 6 1000001901 1 [Wave 1 (2006/2007)] 10000032 4 [West]
#> 7 1000001901 1 [Wave 1 (2006/2007)] 10000035 4 [West]
#> 8 1000001901 1 [Wave 1 (2006/2007)] 10000040 4 [West]
#> 9 1000001901 1 [Wave 1 (2006/2007)] 10000043 4 [West]
#> 10 1000001901 1 [Wave 1 (2006/2007)] 10000055 3 [Süd]
#> # ℹ 18,456 more rows
9.3.2 Exercise 2: Reshaping
Bring the following data into long format:
<- haven::read_dta("./orig/PENDDAT_cf_W13.dta", col_select = c("pnr", "welle", "famstand")) %>%
pend_ue11b slice(200:210) %>%
filter(welle %in% 2:3)
%>% pivot_wider(names_from = welle, values_from = famstand) pend_ue11b
#> # A tibble: 3 × 3
#> pnr `2` `3`
#> <dbl> <dbl+lbl> <dbl+lbl>
#> 1 1000014501 -4 [Question mistakenly not asked] 3 [Married/civil p…
#> 2 1000014701 2 [Married/civil partnership, living together] 2 [Married/civil p…
#> 3 1000014702 2 [Married/civil partnership, living together] 2 [Married/civil p…
Using names_prefix = "wave"
:
%>% pivot_wider(names_from = welle, values_from = famstand, names_prefix = "wave") pend_ue11b
#> # A tibble: 3 × 3
#> pnr wave2 wave3
#> <dbl> <dbl+lbl> <dbl+lbl>
#> 1 1000014501 -4 [Question mistakenly not asked] 3 [Married/civil p…
#> 2 1000014701 2 [Married/civil partnership, living together] 2 [Married/civil p…
#> 3 1000014702 2 [Married/civil partnership, living together] 2 [Married/civil p…
Using tidyexplain↩︎