In the final day of the workshop, we’ll really be slowing things down. In Days 1 and 2, we explored a variety of complex visualizations and walked through the grammar of graphics in some detail (Wickham, Navarro, and Pedersen 2023; Wilkinson 2005). In Day 3, we’ll get a bit more pragmatic and address a topic that should be of interest to every population researcher out there: how can we visualize the results of the statistical models we fit?

As in Days 1 and 2, you can engage with course materials via two channels:

By following the day3.R script file. This file is available in the companion repository for PopAging DataViz.^{1}

By working your through the .html page you are currently on.

A Reminder About Packages (Click to Expand)

If renv::restore() is not working for you, you can install Day 3 packages by executing the following code:

After fitting our model, let’s use functions from gtsummary, modelsummary or broom to look at our results.

Caveat Emptor

The models produced here are simply for illustrative purposes. To wit, they offer little inferential or predictive power — and are undoubtedly misspecified.

These functions (and libraries) are wonderful ways to review model results or begin the process of building a regression table. However, visual summaries are particularly useful for demystifying complex (and at times inscrutable) statistical models.

With that in mind, here’s a simple way we can use broom::tidy() and ggplot2 to visualize the results of our model while highlighting our parameters of interest.

mod_gapminder_1 %>%tidy(conf.int =TRUE) %>%# Zeroing in on variables of interest:filter(str_detect(term, "log")) %>%ggplot() +geom_pointrange(mapping =aes(x = estimate,y = term,xmin = conf.low,xmax = conf.high))

Quick Exercise (Click to Expand)

Spend the next 10 minutes manipulating this basic coefficient plot using some of the skills you’ve developed during PopAging DataViz.

Coefficient Plots via modelplot()

Producing coefficient plots manually gives us fine control — but may take a fair bit of time as well (depending on the complexity of our models). Thankfully, the modelplot() function from modelsummary can streamline the process:

modelplot(mod_gapminder_1,# Omitting specific variables (e.g., controls, # fixed effects etc.)coef_omit ="Interc|contin",# 99% confidence intervals:conf_level =0.99) +geom_vline(xintercept =0,linetype ="dotted") +# Shading in "pointranges" based on significance level:aes(colour =ifelse(p.value <0.01, "Significant", "Not significant")) +scale_colour_manual(values =c("grey", "red"))

To showcase the utility of modelplot(), let’s fit a model that’s a bit more complex. To do so, we’ll use the wage_panelr dataset as well as the lm_robust() function from estimatr. For some context, here’s a summary of wage_panelr:

skim(wage_panelr)

Data summary

Name

wage_panelr

Number of rows

4165

Number of columns

14

_______________________

Column type frequency:

factor

10

numeric

4

________________________

Group variables

None

Variable type: factor

skim_variable

n_missing

complete_rate

ordered

n_unique

top_counts

id

0

1

FALSE

595

1: 7, 2: 7, 3: 7, 4: 7

wave

0

1

FALSE

7

1: 595, 2: 595, 3: 595, 4: 595

blue_collar

0

1

FALSE

2

1: 2129, 0: 2036

manufacturing

0

1

FALSE

2

0: 2518, 1: 1647

south

0

1

FALSE

2

0: 2956, 1: 1209

stndrd_metro

0

1

FALSE

2

1: 2723, 0: 1442

married

0

1

FALSE

2

1: 3392, 0: 773

black

0

1

FALSE

2

0: 3864, 1: 301

female

0

1

FALSE

2

0: 3696, 1: 469

union

0

1

FALSE

2

0: 2649, 1: 1516

Variable type: numeric

skim_variable

n_missing

complete_rate

mean

sd

p0

p25

p50

p75

p100

hist

lwage

0

1

6.68

0.46

4.61

6.4

6.68

6.95

8.54

▁▂▇▃▁

years_worked

0

1

19.85

10.97

1.00

11.0

18.00

29.00

51.00

▇▇▆▅▁

weeks_worked

0

1

46.81

5.13

5.00

46.0

48.00

50.00

52.00

▁▁▁▁▇

years_education

0

1

12.85

2.79

4.00

12.0

12.00

16.00

17.00

▁▂▂▇▅

Now, let’s regress the log of individual wages on a series of covariates while clustering standard errors at the respondent-level^{2}:

mod_wages_1 <-lm_robust(lwage ~ years_worked + female + black + blue_collar +south + union + manufacturing + wave,data = wage_panelr,clusters = id,# Produces robust clustered SEs (via the # Delta method - as in Stata):se_type ="stata")

Having stored our results in mod_wages_1, let’s summarize our model graphically:

# Labels for coefficients:wages_labels <-c("manufacturing1"="Manufacturing","union1"='Union Member', "south1"="Southerner","black1"="Black","female1"="Female","years_worked"="Years in Workforce","blue_collar1"="Blue Collar Worker")mod_wages_1 %>%modelplot(conf_level =0.99,#String search:coef_omit ='Interc|wave',colour ="red",# Renaming coefficients:coef_map = wages_labels) +theme_bw(base_family ="Inconsolata") +# Adding reference point for "non-significant" finding:geom_vline(xintercept =0, linetype ="dashed") +labs(x ="Coefficient Estimates",title ="Predicting the Log of Wages",subtitle ="With 99% Confidence Intervals")

As we fine-tune our model, we may be interested in testing whether the sex differences we uncovered in mod_wages_1 would persist if additional background variables were adjusted. With this in mind, let’s include an additional regressor on the right-hand side: years_education (i.e.,years of schooling).

As these plots lay bare, introducing an additional covariate (years_education) does little to change our story.

Adjusted Predictions, Marginal Effects

Adjusted Predictions

There is a rather large body of scholarship detailing the importance of predictions as tools for clarifying model results — especially when our outcomes are discrete or interactions are brought into our statistical ecosystems (see Arel-Bundock 2023; Long and Mustillo 2021; Mize 2019). That is, however, a story for another day. For now, let’s use two wonderful libraries—ggeffects and marginaleffects—to visualize model predictions.

To do so, let’s fit a slightly more complicated version of mod_gapminder_1. Specifically, let’s allow continent to moderate the relationship between per capita GDP (logged) and life expectancy. We can specify this model as follows:

Here’s how we can visualize predictions using ggeffects:

ggeffect(mod_gapminder_2, # Predicted life expectancy at logged GDP per capita of 6 to 11# for countries in Europe and Africa:terms =c("log_gdpPercap [6:11]", "continent [Europe, Africa]")) %>%plot() +labs(title ="Predicted Values of Life Expectancy", x ="Log of Per Capita GDP",y ="Life Expectancy (Years)",fill ="",colour ="") +theme_ggeffects(base_family ="IBM Plex Sans") +scale_color_economist() +scale_fill_economist() +theme(legend.position ="bottom") +# Overriding default alpha of confidence intervals:guides(fill =guide_legend(override.aes =list(alpha =0.35)))

The Slightly Longer Route

We can turn the results of ggeffect() into a tibble that we feed into ggplot2:

mod_gapminder_2 %>%ggeffect(terms =c("log_gdpPercap [6:11]","continent [Europe, Africa]")) %>%as_tibble() %>%ggplot(mapping =aes(x = x, y = predicted, ymin = conf.low,ymax = conf.high,# group corresponds to discrete term:colour = group,fill = group)) +geom_line() +# For confidence intervals:geom_ribbon(alpha =0.5)

Here’s how we can use marginaleffects to visualize model predictions:

plot_predictions(mod_gapminder_2, condition =list("log_gdpPercap"=6:11,"continent"=c("Africa", "Europe"))) +theme_ggeffects(base_family ="IBM Plex Sans") +labs(fill ="", colour ="",x ="Log of GDP per Capita",y ="Life Expectancy")

The Slightly Longer Route

We can turn the results of avg_predictions() into a tibble that we feed into ggplot2:

As Arel-Bundock (2023) explains, “a marginal effect measures the association between a change in a regressor \(x\), and a change in the response \(y\). Put differently, the marginal effect is the slope of the prediction function, measured at a specific value of the regressor \(x\).” Across the social and behavioural sciences, this is often the quantity of substantive interest.

Below, we use a few lines of code to estimate and visualize the average marginal effects^{3} associated with a key regressor (log_gdpPercap) across levels of the continent variable.

How can we transform these results into a ggplot2 object?

Final Exercise

Time permitting, explore the gss_2010 data frame and fit a few interesting models. Then, use some of the tools covered today to visually summarize your findings.

To make matters easier, here’s every labelled variable in the gss_2010 data frame:

References

Arel-Bundock, Vincent. 2023. “marginaleffects: Predictions, Comparisons, Slopes, Marginal Means, and Hypothesis Tests. The Marginal Effects Zoo (0.15.0).” September 10, 2023. https://marginaleffects.com/.

Long, J. Scott, and Sarah A. Mustillo. 2021. “Using Predictions and Marginal Effects to Compare Groups in Regression Models for Binary Outcomes.”Sociological Methods & Research 50 (3): 1284–1320. https://doi.org/10.1177/0049124118799374.

Mize, Trenton. 2019. “Best Practices for Estimating, Interpreting, and Presenting Nonlinear Interaction Effects.”Sociological Science 6: 81–117. https://doi.org/10.15195/v6.a4.

Wickham, Hadley, Danielle Navarro, and Thomas Lin Pedersen. 2023. “ggplot2: Elegant Graphics for Data Analysis.”https://ggplot2-book.org/.

Of course, you can also copy the source code into your console or editor by clicking the Code button on the top-right of this page and clicking View Source.↩︎

A better approach might involve fitting latent growth curves or mixed models.↩︎