6  Regression models

Let’s consider a (fictional) dataset with only five cases. We can create a dataset using the data.frame command. Our dataset initially has just two variables: var1 and var2:

dat1 <- data.frame(id   = 1:8,
                   var1 = c(2,1,2,5,7, 8, 9,5),
                   var2 = c(2,2,1,9,7, 4,25,3),
                   educ = c(3,1,2,2,1, 3, 2,-1),
                   gend = c(2,1,1,2,1,2,1,2),
                   x    = c(2,1,2,4,1,NA,NA,NA) )
dat1
  id var1 var2 educ gend  x
1  1    2    2    3    2  2
2  2    1    2    1    1  1
3  3    2    1    2    1  2
4  4    5    9    2    2  4
5  5    7    7    1    1  1
6  6    8    4    3    2 NA
7  7    9   25    2    1 NA
8  8    5    3   -1    2 NA

Here is the translation of the provided text into English:


6.1 Regression Models with lm()

Regression models in R can be created with lm(). Here, we specify the variable for the y-axis (the dependent variable) and, after a ~, the variable for the x-axis (the independent variable). We will discuss the interpretation of the results in the coming weeks.

lm(var2 ~ var1, data = dat1)

Call:
lm(formula = var2 ~ var1, data = dat1)

Coefficients:
(Intercept)         var1  
     -2.340        1.839  

The value under var1 indicates how much the line changes up/down per “step to the right”. Thus, the line increases by 1.8389662 for each unit of var1. We can store the results under m1:

m1 <- lm(var2 ~ var1, data = dat1)  

With summary(), we get a regression table:

summary(m1)

Call:
lm(formula = var2 ~ var1, data = dat1)

Residuals:
   Min     1Q Median     3Q    Max 
-8.372 -3.613  0.162  2.234 10.789 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -2.3400     4.3454  -0.538   0.6096  
var1          1.8390     0.7727   2.380   0.0548 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.127 on 6 degrees of freedom
Multiple R-squared:  0.4856,    Adjusted R-squared:  0.3999 
F-statistic: 5.664 on 1 and 6 DF,  p-value: 0.05477

m1 contains all the information about the model, and $coefficients is particularly helpful:

m1$coefficients
(Intercept)        var1 
  -2.339960    1.838966 
summary(m1)$coefficients
             Estimate Std. Error    t value   Pr(>|t|)
(Intercept) -2.339960  4.3453801 -0.5384938 0.60961706
var1         1.838966  0.7727028  2.3799139 0.05477457

We can view the individual values with View(m1):

For example, fitted.values contains the predicted values for each case.

6.1.1 Exercise

6.2 Calculating Models for Specific Cases

If we want to recalculate the model, we have two options:

6.2.1 Create a New Data Frame

We can keep multiple data.frame objects in memory in R. Thus, we can easily create a new data.frame containing only observations with var2 < 20 and use this for our lm() command:

dat1_u20 <- dat1 %>% filter(var2 < 20)
m2a <- lm(var2 ~ var1, data = dat1_u20)
summary(m2a)

Call:
lm(formula = var2 ~ var1, data = dat1_u20)

Residuals:
      1       2       3       4       5       6       7 
-0.4737  0.1941 -1.4737  4.5230  1.1875 -2.4803 -1.4770 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.1382     1.9217   0.592    0.579
var1          0.6678     0.3877   1.722    0.146

Residual standard error: 2.555 on 5 degrees of freedom
Multiple R-squared:  0.3724,    Adjusted R-squared:  0.2469 
F-statistic: 2.967 on 1 and 5 DF,  p-value: 0.1456

6.2.2 Filter Directly in lm()

We can also incorporate the filter() command directly into the data= argument of lm():

m2b <- lm(var2 ~ var1, data = dat1 %>% filter(var2 < 20))
summary(m2b)

Call:
lm(formula = var2 ~ var1, data = dat1 %>% filter(var2 < 20))

Residuals:
      1       2       3       4       5       6       7 
-0.4737  0.1941 -1.4737  4.5230  1.1875 -2.4803 -1.4770 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   1.1382     1.9217   0.592    0.579
var1          0.6678     0.3877   1.722    0.146

Residual standard error: 2.555 on 5 degrees of freedom
Multiple R-squared:  0.3724,    Adjusted R-squared:  0.2469 
F-statistic: 2.967 on 1 and 5 DF,  p-value: 0.1456

6.3 Regression tables

If we want to compare these different models, a table is a good option.

There are numerous alternatives for creating regression tables, and my personal favorite is modelsummary() from the eponymous package {modelsummary}. It handles (almost) all types of models and offers a wide range of features, including Word output (more on this later) and coefficient plots (which we will also cover). Additionally, the documentation is excellent.

fdz_install("modelsummary")
library(modelsummary)
modelsummary(list(m1,m2a,m2b))
tinytable_vo6671bg88ivlj3vlakp
(1) (2) (3)
(Intercept) -2.340 1.138 1.138
(4.345) (1.922) (1.922)
var1 1.839 0.668 0.668
(0.773) (0.388) (0.388)
Num.Obs. 8 7 7
R2 0.486 0.372 0.372
R2 Adj. 0.400 0.247 0.247
AIC 55.4 36.6 36.6
BIC 55.6 36.5 36.5
Log.Lik. -24.702 -15.321 -15.321
F 5.664 2.967 2.967
RMSE 5.31 2.16 2.16

We will delve a bit more into the customization options for {modelsummary} later, but here are two key options for now:

  • By using stars = T, we can display the significance levels with the common star codes (*: p < .05, etc.)
  • By using gof_omit = "IC|RM|Log", we can hide the goodness of fit statistics that have IC, RM, or Log in their names (such as AIC, BIC, RMSE, and LogLikelihood)
  • By using "Name" = in list(), we can specify names:
modelsummary(list("m1"=m1,"m2a"=m2a,"m2b"=m2b),stars = T,gof_omit = "IC|RM|Log")
tinytable_od6ezk9exwtmeo62j3c1
m1 m2a m2b
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) -2.340 1.138 1.138
(4.345) (1.922) (1.922)
var1 1.839+ 0.668 0.668
(0.773) (0.388) (0.388)
Num.Obs. 8 7 7
R2 0.486 0.372 0.372
R2 Adj. 0.400 0.247 0.247
F 5.664 2.967 2.967

The output = option allows us to export the modelsummary as .docx-file:

modelsummary(list("m1"=m1,"m2a"=m2a,"m2b"=m2b),stars = T,gof_omit = "IC|RM|Log",output = "./results/Regression_table.docx")

6.3.1 Exercise

6.4 Categorical Independent Variables

Of course, we can also include categorical independent variables in our model. However, we need to define the relevant variables as factors to inform R that the numeric values should not be interpreted numerically.

For instance, in our small example, educ represents education levels where 1 stands for basic education, 2 for intermediate, and 3 for high education.

dat1
  id var1 var2 educ gend  x
1  1    2    2    3    2  2
2  2    1    2    1    1  1
3  3    2    1    2    1  2
4  4    5    9    2    2  4
5  5    7    7    1    1  1
6  6    8    4    3    2 NA
7  7    9   25    2    1 NA
8  8    5    3   -1    2 NA
m3 <- lm(var2 ~ factor(educ), dat1)
summary(m3)

Call:
lm(formula = var2 ~ factor(educ), data = dat1)

Residuals:
         1          2          3          4          5          6          7 
-1.000e+00 -2.500e+00 -1.067e+01 -2.667e+00  2.500e+00  1.000e+00  1.333e+01 
         8 
-1.277e-15 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)    3.000e+00  8.848e+00   0.339    0.752
factor(educ)1  1.500e+00  1.084e+01   0.138    0.897
factor(educ)2  8.667e+00  1.022e+01   0.848    0.444
factor(educ)3 -1.088e-15  1.084e+01   0.000    1.000

Residual standard error: 8.848 on 4 degrees of freedom
Multiple R-squared:  0.2848,    Adjusted R-squared:  -0.2516 
F-statistic: 0.531 on 3 and 4 DF,  p-value: 0.685

It’s even better if we label educ beforehand:

dat1$ed_fct <- factor(dat1$educ, levels = 1:3,
                        labels = c("basic", "medium", "high"))
dat1
  id var1 var2 educ gend  x ed_fct
1  1    2    2    3    2  2   high
2  2    1    2    1    1  1  basic
3  3    2    1    2    1  2 medium
4  4    5    9    2    2  4 medium
5  5    7    7    1    1  1  basic
6  6    8    4    3    2 NA   high
7  7    9   25    2    1 NA medium
8  8    5    3   -1    2 NA   <NA>

Then use the factor in the regression command:

m3 <- lm(var2 ~ ed_fct, dat1)
summary(m3)

Call:
lm(formula = var2 ~ ed_fct, data = dat1)

Residuals:
      1       2       3       4       5       6       7 
 -1.000  -2.500 -10.667  -2.667   2.500   1.000  13.333 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)     4.500      6.257   0.719    0.512
ed_fctmedium    7.167      8.077   0.887    0.425
ed_fcthigh     -1.500      8.848  -0.170    0.874

Residual standard error: 8.848 on 4 degrees of freedom
  (1 Beobachtung als fehlend gelöscht)
Multiple R-squared:  0.2594,    Adjusted R-squared:  -0.1109 
F-statistic: 0.7005 on 2 and 4 DF,  p-value: 0.5485
  • Compared to educ = basic, the predicted value for var2 when educ = medium is 7.17 higher.

  • Compared to educ = basic, the predicted value for var2 when educ = high is -1.5 higher.

We can also change the reference category:

dat1$ed_fct <- relevel(dat1$ed_fct, ref = "medium")
m3b <- lm(var2 ~ ed_fct, dat1)
summary(m3b)

Call:
lm(formula = var2 ~ ed_fct, data = dat1)

Residuals:
      1       2       3       4       5       6       7 
 -1.000  -2.500 -10.667  -2.667   2.500   1.000  13.333 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   11.667      5.109   2.284   0.0844 .
ed_fctbasic   -7.167      8.077  -0.887   0.4251  
ed_fcthigh    -8.667      8.077  -1.073   0.3437  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.848 on 4 degrees of freedom
  (1 Beobachtung als fehlend gelöscht)
Multiple R-squared:  0.2594,    Adjusted R-squared:  -0.1109 
F-statistic: 0.7005 on 2 and 4 DF,  p-value: 0.5485
  • Compared to educ = medium, the predicted value for var2 when educ = basic is 7.17 lower.

  • Compared to educ = medium, the predicted value for var2 when educ = high is 8.67 lower.

6.4.1 Exercise

6.5 Multiple Independent Variables

To include multiple independent variables in our regression models, we specify them using +:

m4 <- lm(var2 ~ ed_fct + var1, dat1)
summary(m4)

Call:
lm(formula = var2 ~ ed_fct + var1, data = dat1)

Residuals:
     1      2      3      4      5      6      7 
 4.258  2.758 -4.824 -2.082 -2.758 -4.258  6.907 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.3187     5.8227   0.398    0.717
ed_fctbasic  -4.8297     6.0381  -0.800    0.482
ed_fcthigh   -8.0824     5.9411  -1.360    0.267
var1          1.7527     0.8347   2.100    0.127

Residual standard error: 6.501 on 3 degrees of freedom
  (1 Beobachtung als fehlend gelöscht)
Multiple R-squared:  0.7002,    Adjusted R-squared:  0.4003 
F-statistic: 2.335 on 3 and 3 DF,  p-value: 0.2521

6.6 Coefficient Plots

In addition to regression tables, {modelsummary} provides the modelplot() function, which makes it easy to create coefficient plots from one or more models:

modelplot(m4)

For model comparison, simply provide the models in a named list, and you can further customize the plot with the usual {ggplot2} commands:

modelplot(list("Model 1" = m1,
               "Model 4" = m4))

With coef_map, you can assign labels to the coefficients (note that (Intercept) does not get a name and is therefore omitted):

modelplot(list("Model 1" = m1,
               "Model 4" = m4),
          coef_map = c("var1" = "Name for var1",
                       "ed_fcthigh" = "Higher Education",
                       "ed_fctbasic" = "Basic Education"
                          ))

You can also further customize the plot with the usual {ggplot2} commands:

modelplot(list("Model 1" = m1,
               "Model 4" = m4),
          coef_map = c("var1" = "Name for var1",
                       "ed_fcthigh" = "Higher Education",
                       "ed_fctbasic" = "Basic\nEducation")) + # \n inserts a line break
  geom_vline(aes(xintercept = 0), linetype = "dashed", color = "grey40") +  # Add a 0 line
  scale_color_manual(values = c("orange", "navy")) +
  theme_minimal(base_size = 15, base_family = "serif") 

With {broom}, you can also create a data.frame from the regression results and create the ggplot entirely yourself - see appendix.

margins: {marginaleffects}

Alternative visual summaries of regression results rely on marginal effects, contrasts, elasticities, or predictions. For all of these metric, see the {marginaleffects} package. There’s a special chapter for replicating Stata’s margins

6.6.1 Exercise

6.7 Exercises

Use the following subset of the PASS CampusFile:

pend_ue08 <- haven::read_dta("./orig/PENDDAT_cf_W13.dta") %>% 
  filter(welle == 13, netges > 0, azges1 > 0, schul2 > 1, palter > 0)

6.7.1 Regression model

  • Create an object mod1 with a linear regression model (lm) where netges (monthly net income in EUR) is the dependent variable and azges1 (working hours) is the independent variable! (see here)
  • Examine the results of mod1 - what can you infer about the relationship between netges and azges1?

6.7.2 Categorical Independent Variables

  • Create a regression model with the income of respondents (netges) as the dependent variable and the education level of the respondents schul as the independent variable:
value label

2

Finished school without degree

3

School incorporating physically or mentally disabled children (Sonderschulabschluss)

4

Lower secondary school (Hauptschulabschluss)

5

Intermediate secondary school (Realschulabschluss, Mittlere Reife)

6

Upper secondary Fachoberschule, Fachhochschulreife)

7

General/subject-specific upper secondary school (Hochschulreife)
  • Ensure that schul2 is defined as a factor. Assign the labels “No degree”, “Special education School”, “Secondary School”, “Intermediate Diploma”, “Vocational School”, “Abitur” to levels 2-7 and save the factor as a variable schul2_fct in your data.frame - see the code help below:
pend_ue08$schul2_fct <-  
  factor(pend_ue08$schul2, 
         levels = 2:7, 
         labels = c("No degree", "Special education School", "Secondary School",
                    "Intermediate secondary school", 
                    "Upper secondary", "Abitur"))
  • Create the regression model using this new factor variable for schul2_fct as the independent variable.
  • Change the reference category to Intermediate secondary school (schul2 = 5) and estimate the model again.

6.7.3 Multiple Independent Variables & Coefficient Plot

  • Adjust the lm() model mod1 (with all cases from pend_u08) to include the education level (schul2) as an additional independent variable.
  • Also create a graphical comparison of the two models with and without the education level.

6.8 Appendix

6.8.1 Visualizing Regression Lines and Data

With geom_smooth(method = "lm"), we can also represent regression lines in {ggplot2}:

We can visualize our model with var1 and var2 as follows:

library(ggplot2)
ggplot(dat1, aes(x = var1, y = var2)) + 
  geom_point(size = 2) + 
  geom_smooth(method = "lm")  

Here, it appears we have an outlier. In our small dataset, it is easy to find. In larger datasets, geom_text() can help us:

ggplot(dat1, aes(x = var1, y = var2)) + 
  geom_point(size = 2) + 
  geom_smooth(method = "lm")  +
  geom_text(data = . %>% filter(var2 > 20), aes(y = var2 + 3, label = id), color = "sienna1")

We can also specify geom_ for just a subset by reassigning data = (not using the selection from the main ggplot() command) and applying a filter(). Additionally, we shift the label slightly above the point with var2 + 3.

dat1 <- dat1 %>% select(-matches("compl"))

6.8.2 Predicted Values

The predicted values from lm() can be found under $fitted.values:

m1$fitted.values
        1         2         3         4         5         6         7         8 
 1.337972 -0.500994  1.337972  6.854871 10.532803 12.371769 14.210736  6.854871 

These predicted values are simply the sum of the value under Intercept and the value under var1 multiplied by the respective value for var1.

m1

Call:
lm(formula = var2 ~ var1, data = dat1)

Coefficients:
(Intercept)         var1  
     -2.340        1.839  

For the first row of dat1, the predicted value from m1 is: 2.1351 + 0.5811 * 1 =2.7162

The values under fitted.values follow the order in the dataset, so we can simply add them as a new column in dat1:

dat1$lm_predictions <- m1$fitted.values
dat1
  id var1 var2 educ gend  x ed_fct lm_predictions
1  1    2    2    3    2  2   high       1.337972
2  2    1    2    1    1  1  basic      -0.500994
3  3    2    1    2    1  2 medium       1.337972
4  4    5    9    2    2  4 medium       6.854871
5  5    7    7    1    1  1  basic      10.532803
6  6    8    4    3    2 NA   high      12.371769
7  7    9   25    2    1 NA medium      14.210736
8  8    5    3   -1    2 NA   <NA>       6.854871

The plot shows how predictions based on m1 look: They correspond to the values on the blue line (the so-called regression line) at the respective points for var1.

Code
ggplot(dat1, aes(x = var1, y = var2)) +
  geom_point(size = 3) +      
  geom_smooth(method = "lm", color = "darkblue", se = FALSE, size = .65) +
  geom_point(aes(x = var1, y = lm_predictions), color = "dodgerblue3", size = 3) +
  expand_limits(x = c(0,8), y = c(0,8))

6.8.3 Residuals

The light blue points (i.e., the predictions from m1) are close to the actual points. However, even the light blue points do not perfectly overlap with the actual values. These deviations between the predicted and actual values are called residuals: \[Residual = observed\, value \; - \; predicted\, value\] \[\varepsilon_{\text{i}} = \text{y}_{i} - \hat{y}_{i}\] We can calculate these manually as the difference between the actual and predicted value or simply call them using m1$residuals:

m1$residuals
         1          2          3          4          5          6          7 
 0.6620278  2.5009940 -0.3379722  2.1451292 -3.5328032 -8.3717694 10.7892644 
         8 
-3.8548708 

We can also store the residuals for lm in dat1:

dat1$lm_residuals <- m1$residuals
dat1
  id var1 var2 educ gend  x ed_fct lm_predictions lm_residuals
1  1    2    2    3    2  2   high       1.337972    0.6620278
2  2    1    2    1    1  1  basic      -0.500994    2.5009940
3  3    2    1    2    1  2 medium       1.337972   -0.3379722
4  4    5    9    2    2  4 medium       6.854871    2.1451292
5  5    7    7    1    1  1  basic      10.532803   -3.5328032
6  6    8    4    3    2 NA   high      12.371769   -8.3717694
7  7    9   25    2    1 NA medium      14.210736   10.7892644
8  8    5    3   -1    2 NA   <NA>       6.854871   -3.8548708

Here are the residuals for lm shown in light blue:

Code
ggplot(dat1, aes(x = var1, y = var2)) + 
  geom_smooth(method = "lm", color = "darkblue", se = FALSE, size = .65) +
  geom_segment(aes(x = var1, xend = var1, y = var2, yend = lm_predictions), color = "dodgerblue3", size = .65, linetype = 1) +
  geom_point(size = 3) +
  geom_point(aes(x = var1, y = lm_predictions), color = "dodgerblue3", size = 3) +
  expand_limits(x = c(0,8), y = c(0,8))

6.8.4 Checking Assumptions

Model Dashboard

install.packages("performance")
library(performance)

model_test <- check_model(m4)
plot(model_test)

6.8.5 Test for Normal Distribution of Residuals

Graphical check: Q-Q plot

library(ggfortify)
autoplot(m1, which = 2)

You can check the normality assumption with the Shapiro-Wilk test & shapiro.test(). This tests the null hypothesis \(H_0\): “The residuals are normally distributed” against the alternative hypothesis \(H_A\): “The residuals significantly deviate from normal distribution.”

shapiro.test(m1$residuals) 

    Shapiro-Wilk normality test

data:  m1$residuals
W = 0.95346, p-value = 0.746

6.8.6 Test for Homoscedasticity

Homoscedasticity is present when the predicted values are approximately equally distant from the actual values (m1\$fitted.values) across the entire range of values. There is both a graphical method for checking this and a formal test. For the graphical check, the predicted values and the residuals are plotted as a scatterplot. The autoplot() function can be helpful here:

autoplot(m1, which = 1)

The associated test is the Breusch-Pagan test. This test evaluates the null hypothesis (\(H_0\)) of “homoscedasticity” against the alternative hypothesis (\(H_A\)) of “heteroscedasticity.” The p-value indicates the probability with which we must reject the homoscedasticity assumption. In R, you can use bptest from the lmtest package for this:

install.packages("lmtest")
library(lmtest)
bptest(m3)

    studentized Breusch-Pagan test

data:  m3
BP = 3.6069, df = 2, p-value = 0.1647

6.8.7 Test for Multicollinearity

install.packages("car")
# library(car)
pendx <- haven::read_dta("./orig/PENDDAT_cf_W13.dta",n_max = 300)  %>% filter(netges >0, palter >0 )
mox <- lm(netges ~ palter + azges1, data=pendx)
car::vif(mox)
  palter   azges1 
1.000133 1.000133 
  • A common threshold for the Variance Inflation Factor (VIF) is 10. Values of VIF above 10 indicate a serious multicollinearity problem, and often measures are recommended starting from a stricter threshold of about 5.00. In this specific example, all independent variables are within acceptable limits according to both thresholds.
  • If multicollinearity is present, there are several ways to address it: We can exclude one or more independent variables from the model. This is ultimately a substantive question and cannot be resolved with a standard recipe. Alternatively, we can combine the collinear independent variables into index variables. For example, we could create a common index, such as the average of the respective independent variables.

6.8.8 Comparing Regression Models

With the {performance} package, we can also perform a comprehensive model comparison:

install.packages("performance")
library(performance)
compare_performance(m1, m4, metrics = c("R2", "R2_adj"))
When comparing models, please note that probably not all models were fit
  from same data.
# Comparison of Model Performance Indices

Name | Model |    R2 | R2 (adj.)
--------------------------------
m1   |    lm | 0.486 |     0.400
m4   |    lm | 0.700 |     0.400

6.8.9 Individual Coefficient Plots with {broom}

modelplot() offers a quick way to create coefficient plots, but I often use {broom}. Using broom::tidy(..., conf.int = TRUE), we get a data.frame with the results of the regression model, which we can then process further in {ggplot2}—if the standard solution from modelplot() doesn’t meet our needs or preferences:

library(broom) ## already loaded as part of the tidyverse
tidy(m3, conf.int = TRUE)
# A tibble: 3 × 7
  term         estimate std.error statistic p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)      4.5       6.26     0.719   0.512    -12.9      21.9
2 ed_fctmedium     7.17      8.08     0.887   0.425    -15.3      29.6
3 ed_fcthigh      -1.5       8.85    -0.170   0.874    -26.1      23.1
tidy(m3, conf.int = TRUE) %>% 
  mutate(term = str_replace(term, "ed_fct", "Education: ")) %>% 
  ggplot(aes(y = term, x = estimate)) +
  geom_vline(aes(xintercept = 0), linetype = "dashed", color = "navy") +
  geom_errorbarh(aes(xmin = conf.low, xmax  = conf.high), height = .1) + 
  geom_point(color = "orange", shape = 18, size = 7) +
  theme_minimal(base_size = 16)