Interpreting regression models through marginal effects

Application exercise
Modified

May 1, 2024

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.5      ✔ rsample      1.2.1 
✔ dials        1.2.1      ✔ tune         1.2.1 
✔ infer        1.0.7      ✔ workflows    1.1.4 
✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
✔ recipes      1.0.10     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
library(marginaleffects)
library(scales)
library(colorspace)

theme_set(theme_minimal())

Studying public corruption

Political scientists are frequently concerned with issues related to corruption and human rights. In this application exercise, we’ll use data from the World Bank and the Varieties of Democracy project to study the relationship between respect for human rights, public sector corruption, and legal campaign finance disclosure requirements.

Import data

# import corruption data
corruption <- read_rds(file = "data/corruption.rds")
glimpse(corruption)
Rows: 169
Columns: 17
$ country_name             <chr> "Mexico", "Suriname", "Sweden", "Switzerland"…
$ country_text_id          <chr> "MEX", "SUR", "SWE", "CHE", "GHA", "ZAF", "JP…
$ year                     <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202…
$ region                   <fct> Latin America and the Caribbean, Latin Americ…
$ disclose_donations_ord   <dbl> 3, 1, 3, 0, 2, 2, 3, 2, 2, 2, 2, 0, 3, 3, 4, …
$ public_sector_corruption <dbl> 49.1, 24.4, 1.5, 1.2, 66.6, 58.0, 3.8, 37.7, …
$ polyarchy                <dbl> 65.2, 75.4, 90.1, 90.1, 72.2, 71.4, 83.5, 42.…
$ civil_liberties          <dbl> 70.2, 86.7, 96.5, 95.3, 91.7, 83.0, 91.3, 49.…
$ disclose_donations       <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,…
$ iso2c                    <chr> "MX", "SR", "SE", "CH", "GH", "ZA", "JP", "MM…
$ population               <dbl> 125998302, 607065, 10353442, 8638167, 3218040…
$ gdp_percapita            <dbl> 9273.811, 7275.365, 51952.673, 84637.013, 195…
$ capital                  <chr> "Mexico City", "Paramaribo", "Stockholm", "Be…
$ longitude                <chr> "-99.1276", "-55.1679", "18.0645", "7.44821",…
$ latitude                 <chr> "19.427", "5.8232", "59.3327", "46.948", "5.5…
$ income                   <chr> "Upper middle income", "Upper middle income",…
$ log_gdp_percapita        <dbl> 9.134950, 8.892249, 10.858088, 11.346127, 7.5…

The dataset covers 169 countries in 2020. The main variables we will utilize include:

  • public_sector_corruption - Index of public sector corruption, ranging from 0 to 100, based on the measure

    To what extent do public sector employees grant favors in exchange for bribes, kickbacks, or other material inducements, and how often do they steal, embezzle, or misappropriate public funds or other state resources for personal or family use?

    0 indicates no corruption and 100 the most corruption.

  • disclosure_donations - Presence of campaign finance disclosure laws, coded as FALSE for no (none or minimally enforced disclosure requirements) and TRUE for yes (disclosure requirements in place and largely enforced).

  • polyarchy - continuous variable from 0 to 100 with higher values representing greater achievement of democratic ideals.

  • civil_liberties - Index of civil liberties, ranging from 0 to 100, with higher values representing better respect for human rights and civil liberties.

  • log_gdp_percapita - natural logarithm of GDP per capita in constant 2015 USD.

  • region - region of the world where the country is located. One of

    • Eastern Europe and Central Asia (including Mongolia)
    • Latin America and the Caribbean
    • The Middle East and North Africa (including Israel and Turkey, excluding Cyprus)
    • Sub-Saharan Africa
    • Western Europe and North America (including Cyprus, Australia and New Zealand)
    • Asia and Pacific (excluding Australia and New Zealand)

Effect of civil liberties on public sector corruption

First let’s examine the effect of civil liberties on public sector corruption.

Demo: Estimate a simple linear regression model of the effect of civil liberties on public sector corruption.

# simple model
ggplot(data = corruption, mapping = aes(x = civil_liberties, y = public_sector_corruption)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    x = "Civil liberties index",
    y = "Public sector corruption index"
  )
`geom_smooth()` using formula = 'y ~ x'

# estimate formally
model_simple <- linear_reg() |>
  fit(public_sector_corruption ~ civil_liberties,
    data = corruption
  ) |>
  extract_fit_engine()
tidy(model_simple)
# A tibble: 2 × 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)      102.       5.36        19.1 7.90e-44
2 civil_liberties   -0.813    0.0734     -11.1 1.04e-21

Your turn: Interpret the results. What do we learn from this model in terms of the strength and directionality of the relationship?

Add response here.

Add a polynomial term

Demo: Estimate a polynomial linear regression model of the effect of civil liberties on public sector corruption.

# visualize the model
ggplot(data = corruption, mapping = aes(x = civil_liberties, y = public_sector_corruption)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x + I(x^2)) +
  labs(
    x = "Civil liberties index",
    y = "Public sector corruption index"
  )

# estimate the model
model_sq <- linear_reg() |>
  fit(public_sector_corruption ~ civil_liberties + I(civil_liberties^2),
    data = corruption
  ) |>
  extract_fit_engine()
tidy(model_sq)
# A tibble: 3 × 5
  term                 estimate std.error statistic     p.value
  <chr>                   <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)           54.0     10.1          5.37 0.000000263
2 civil_liberties        1.24     0.378        3.29 0.00124    
3 I(civil_liberties^2)  -0.0175   0.00316     -5.53 0.000000125

Your turn: Interpret the results. What do we learn from this model in terms of the strength and directionality of the relationship?

Add response here.

What is the marginal effect of civil liberties on public sector corruption?

Demo: What is the marginal effect of civil liberties on public sector corruption for countries with civil liberties scores of 25, 55, and 80?

slopes(
  model_sq,
  newdata = datagrid(
    civil_liberties = c(25, 55, 80)
  )
)

            Term civil_liberties Estimate Std. Error      z Pr(>|z|)    S
 civil_liberties              25    0.368     0.2242   1.64    0.101  3.3
 civil_liberties              55   -0.680     0.0718  -9.48   <0.001 68.4
 civil_liberties              80   -1.554     0.1503 -10.34   <0.001 80.9
   2.5 % 97.5 %
 -0.0713  0.807
 -0.8210 -0.540
 -1.8484 -1.259

Columns: rowid, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, civil_liberties, predicted_lo, predicted_hi, predicted, public_sector_corruption 
Type:  response 

Demo: What about all possible values of civil liberties?

plot_slopes(model_sq,
  variables = "civil_liberties",
  condition = "civil_liberties"
) +
  labs(
    x = "Civil liberties",
    y = "Marginal effect of civil liberties on public sector corruption",
  )

Your turn:

  • How does the slope change in relation to the civil liberties score? Add response here.
  • What does this suggest about the relationship between civil liberties and public sector corruption? Add response here.
  • Are these results statistically significant? Add response here.

Summarizing the marginal effects

Average marginal effects

Demo: Calculate the average marginal effect of civil liberties on public sector corruption.

# all slopes
slopes(model_sq)

            Term Estimate Std. Error       z Pr(>|z|)     S  2.5 % 97.5 %
 civil_liberties   -1.211     0.0989 -12.245   <0.001 112.1 -1.405 -1.018
 civil_liberties   -1.788     0.1890  -9.458   <0.001  68.1 -2.159 -1.418
 civil_liberties   -2.131     0.2479  -8.593   <0.001  56.7 -2.616 -1.645
 civil_liberties   -2.089     0.2406  -8.680   <0.001  57.8 -2.560 -1.617
 civil_liberties   -1.963     0.2189  -8.968   <0.001  61.5 -2.392 -1.534
--- 159 rows omitted. See ?avg_slopes and ?print.marginaleffects --- 
 civil_liberties   -1.806     0.1920  -9.404   <0.001 67.4 -2.182 -1.429
 civil_liberties   -1.676     0.1703  -9.841   <0.001 73.5 -2.010 -1.342
 civil_liberties   -1.739     0.1809  -9.616   <0.001 70.3 -2.094 -1.385
 civil_liberties   -0.114     0.1434  -0.796    0.426  1.2 -0.395  0.167
 civil_liberties   -1.641     0.1646  -9.973   <0.001 75.4 -1.964 -1.319
Columns: rowid, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted, public_sector_corruption, civil_liberties 
Type:  response 
# summarized slopes
slopes(model_sq) |>
  group_by(term) |>
  summarize(avg_slope = mean(estimate))
# A tibble: 1 × 2
  term            avg_slope
  <chr>               <dbl>
1 civil_liberties     -1.17
# simpler method with additional statistics
avg_slopes(model_sq)

            Term Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
 civil_liberties    -1.17     0.0933 -12.5   <0.001 117.0 -1.35 -0.984

Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response 

Your turn: What does the AME tell us? Add response here.

Marginal effects at the mean

# what is the mean?
mean(corruption$civil_liberties)
[1] 68.93432
slopes(model_sq, newdata = "mean")

            Term Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
 civil_liberties    -1.17     0.0933 -12.5   <0.001 117.0 -1.35 -0.984

Columns: rowid, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted, civil_liberties, public_sector_corruption 
Type:  response 

Your turn: What does the MEM tell us? How does it differ from the AME? Add response here.

Interpreting marginal effects for logistic regression models

Demo: Estimate a logistic regression model of the effect of public sector corruption on the presence of campaign finance disclosure laws.

# visualize the model
ggplot(
  data = corruption,
  mapping = aes(x = public_sector_corruption, y = as.numeric(disclose_donations))
) +
  geom_point() +
  geom_smooth(method = "glm", method.args = list(family = binomial(link = "logit"))) +
  labs(
    x = "Public sector corruption",
    y = "Predicted probability of presence of\ncampaign finance disclosure laws"
  )
`geom_smooth()` using formula = 'y ~ x'

# estimate the model
model_logit <- logistic_reg() |>
  fit(factor(disclose_donations) ~ public_sector_corruption,
    data = corruption
  ) |>
  extract_fit_engine()
tidy(model_logit)
# A tibble: 2 × 5
  term                     estimate std.error statistic  p.value
  <chr>                       <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                1.87     0.375        4.99 6.10e- 7
2 public_sector_corruption  -0.0644   0.00935     -6.88 5.81e-12

Your turn: Interpret the results. What do we learn from this model in terms of the strength and directionality of the relationship? Add response here.

Potential marginal effects

Your turn: Estimate the average marginal effect and marginal effect at the mean for the logistic regression model. How do the results differ?

# AME
model_logit |>
  avg_slopes()

                     Term Estimate Std. Error     z Pr(>|z|)     S    2.5 %
 public_sector_corruption -0.00816   0.000253 -32.3   <0.001 757.4 -0.00866
   97.5 %
 -0.00767

Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response 
# ME at the mean
model_logit |>
  avg_slopes(newdata = "mean")

                     Term Estimate Std. Error     z Pr(>|z|)    S   2.5 %
 public_sector_corruption   -0.012    0.00163 -7.35   <0.001 42.2 -0.0152
   97.5 %
 -0.00879

Columns: rowid, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
Type:  response 
# how do they differ?
bind_rows(
  AME = model_logit |>
    avg_slopes(),
  MEM = model_logit |>
    avg_slopes(newdata = "mean"),
  .id = "type"
) |>
  mutate(type = fct_rev(type)) |>
  ggplot(mapping = aes(x = estimate, y = type, color = type)) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_pointrange(mapping = aes(xmin = conf.low, xmax = conf.high)) +
  scale_x_continuous(labels = label_percent()) +
  scale_color_discrete_qualitative(guide = "none") +
  labs(
    x = "Marginal effect (percentage points)",
    y = NULL
  )

Add response here.

Even more complex model

Demo: Estimate a more complex (and potentially realistic) logistic regression model of multiple factors on the presence of campaign finance disclosure laws.

model_logit_fancy <- logistic_reg() |>
  fit(
    factor(disclose_donations) ~ public_sector_corruption + I(public_sector_corruption^2) +
      polyarchy + log_gdp_percapita + public_sector_corruption * region,
    data = corruption
  ) |>
  extract_fit_engine()

# tidied results
tidy(model_logit_fancy)
# A tibble: 15 × 5
   term                                     estimate std.error statistic p.value
   <chr>                                       <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)                              -4.56e-1  3.70        -0.123  0.902 
 2 public_sector_corruption                 -4.22e-2  0.0538      -0.783  0.433 
 3 I(public_sector_corruption^2)            -5.83e-5  0.000548    -0.106  0.915 
 4 polyarchy                                 3.43e-2  0.0167       2.06   0.0398
 5 log_gdp_percapita                         6.09e-2  0.334        0.182  0.855 
 6 regionLatin America and the Caribbean    -2.68e+0  1.50        -1.78   0.0748
 7 regionMiddle East and North Africa        9.24e-1  2.41         0.383  0.702 
 8 regionSub-Saharan Africa                 -1.69e+0  1.53        -1.10   0.269 
 9 regionWestern Europe and North America   -2.38e-1  1.83        -0.130  0.897 
10 regionAsia and Pacific                    6.14e-1  1.91         0.321  0.748 
11 public_sector_corruption:regionLatin Am…  2.98e-2  0.0329       0.906  0.365 
12 public_sector_corruption:regionMiddle E… -4.92e-2  0.0636      -0.774  0.439 
13 public_sector_corruption:regionSub-Saha…  1.30e-2  0.0296       0.438  0.662 
14 public_sector_corruption:regionWestern … -8.30e-2  0.0990      -0.838  0.402 
15 public_sector_corruption:regionAsia and… -4.45e-2  0.0498      -0.894  0.372 

Your turn: When interpreting the results as they relate to public sector corruption, does it matter whether your use the AME or MEM?

# focus on slopes for public sector corruption
bind_rows(
  AME = model_logit_fancy |>
    avg_slopes(),
  MEM = model_logit_fancy |>
    avg_slopes(newdata = "mean"),
  .id = "type"
) |>
  filter(term == "public_sector_corruption") |>
  mutate(type = fct_rev(type)) |>
  ggplot(mapping = aes(x = estimate, y = type, color = type)) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_pointrange(mapping = aes(xmin = conf.low, xmax = conf.high)) +
  scale_x_continuous(labels = label_percent()) +
  scale_color_discrete_qualitative(guide = "none") +
  labs(
    x = "Marginal effect (percentage points)",
    y = NULL
  )

Add response here.

Group average marginal effects

Your turn: Calculate the average marginal effect of public sector corruption on the presence of campaign finance disclosure laws by region. How does the AME differ by region?

model_logit_fancy |>
  avg_slopes(
    variables = "public_sector_corruption",
    newdata = datagrid(region = levels(corruption$region)),
    by = "region"
  ) |>
  mutate(region = fct_reorder(.f = region, .x = estimate)) |>
  ggplot(mapping = aes(x = estimate, y = region, color = region)) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_pointrange(mapping = aes(xmin = conf.low, xmax = conf.high)) +
  scale_x_continuous(labels = label_percent()) +
  scale_color_discrete_qualitative(guide = "none") +
  labs(
    x = "Marginal effect (percentage points)",
    y = NULL
  )

Add response here.

Marginal effects at user-specified or representative values

Demo: Let’s examine countries in three regions (Western Europe, Latin America, and the Middle East) with public sector corruption of 20 and 80. Leave all other values at their typical average.

# create a grid of values
regions_to_use <- c(
  "Western Europe and North America",
  "Latin America and the Caribbean",
  "Middle East and North Africa"
)

# lots of instantaneous marginal effects
model_logit_fancy |>
  slopes(
    variables = "public_sector_corruption",
    newdata = datagrid(
      public_sector_corruption = c(20, 80),
      region = regions_to_use
    )
  )

                     Term public_sector_corruption
 public_sector_corruption                       20
 public_sector_corruption                       20
 public_sector_corruption                       20
 public_sector_corruption                       80
 public_sector_corruption                       80
 public_sector_corruption                       80
                           region  Estimate Std. Error      z Pr(>|z|)   S
 Western Europe and North America -2.62e-02   0.015428 -1.698   0.0896 3.5
 Latin America and the Caribbean  -2.78e-03   0.006462 -0.430   0.6670 0.6
 Middle East and North Africa     -1.90e-02   0.010674 -1.776   0.0757 3.7
 Western Europe and North America -2.11e-05   0.000126 -0.167   0.8671 0.2
 Latin America and the Caribbean  -1.99e-03   0.003390 -0.587   0.5575 0.8
 Middle East and North Africa     -7.41e-04   0.001840 -0.403   0.6871 0.5
     2.5 %   97.5 %
 -0.056429 0.004048
 -0.015446 0.009885
 -0.039881 0.001959
 -0.000269 0.000226
 -0.008633 0.004656
 -0.004347 0.002865

Columns: rowid, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, public_sector_corruption, region, predicted_lo, predicted_hi, predicted, polyarchy, log_gdp_percapita, disclose_donations 
Type:  response 

Let’s make this more useful. Let’s still focus on the effect of public sector corruption conditional on region, but now visualize across a range of possible values on the \(x\)-axis while holding the other values constant.

# without CIs
plot_predictions(model_logit_fancy, condition = c("public_sector_corruption", "region"), vcov = FALSE) +
  scale_y_continuous(labels = label_percent()) +
  scale_color_discrete_qualitative() +
  labs(
    x = "Public sector corruption",
    y = "Predicted probability of having\na campaign finance disclosure law",
    color = NULL
  ) +
  theme(
    legend.position = "bottom"
  )

# with CIs
plot_predictions(model_logit_fancy, condition = c("public_sector_corruption", "region")) +
  scale_y_continuous(labels = label_percent()) +
  scale_color_discrete_qualitative(aesthetic = c("fill", "color"), guide = "none") +
  facet_wrap(facets = vars(region)) +
  labs(
    x = "Public sector corruption",
    y = "Predicted probability of having\na campaign finance disclosure law",
    color = NULL
  )

Your turn: How do the predicted probabilities vary across region? What do the confidence intervals tell us about this story?

Add response here.

Acknowledgments