In the final day of the workshop, we’ll really be slowing things down. In Days 1 and 2, we explored a variety of complex visualizations and walked through the grammar of graphics in some detail (Wickham, Navarro, and Pedersen 2023; Wilkinson 2005). In Day 3, we’ll get a bit more pragmatic and address a topic that should be of interest to every population researcher out there: how can we visualize the results of the statistical models we fit?
As in Days 1 and 2, you can engage with course materials via two channels:
By following the day3.R script file. This file is available in the companion repository for PopAging DataViz.1
By working your through the .html page you are currently on.
A Reminder About Packages (Click to Expand)
If renv::restore() is not working for you, you can install Day 3 packages by executing the following code:
After fitting our model, let’s use functions from gtsummary, modelsummary or broom to look at our results.
Caveat Emptor
The models produced here are simply for illustrative purposes. To wit, they offer little inferential or predictive power — and are undoubtedly misspecified.
These functions (and libraries) are wonderful ways to review model results or begin the process of building a regression table. However, visual summaries are particularly useful for demystifying complex (and at times inscrutable) statistical models.
With that in mind, here’s a simple way we can use broom::tidy() and ggplot2 to visualize the results of our model while highlighting our parameters of interest.
mod_gapminder_1 %>%tidy(conf.int =TRUE) %>%# Zeroing in on variables of interest:filter(str_detect(term, "log")) %>%ggplot() +geom_pointrange(mapping =aes(x = estimate,y = term,xmin = conf.low,xmax = conf.high))
Quick Exercise (Click to Expand)
Spend the next 10 minutes manipulating this basic coefficient plot using some of the skills you’ve developed during PopAging DataViz.
Coefficient Plots via modelplot()
Producing coefficient plots manually gives us fine control — but may take a fair bit of time as well (depending on the complexity of our models). Thankfully, the modelplot() function from modelsummary can streamline the process:
modelplot(mod_gapminder_1,# Omitting specific variables (e.g., controls, # fixed effects etc.)coef_omit ="Interc|contin",# 99% confidence intervals:conf_level =0.99) +geom_vline(xintercept =0,linetype ="dotted") +# Shading in "pointranges" based on significance level:aes(colour =ifelse(p.value <0.01, "Significant", "Not significant")) +scale_colour_manual(values =c("grey", "red"))
To showcase the utility of modelplot(), let’s fit a model that’s a bit more complex. To do so, we’ll use the wage_panelr dataset as well as the lm_robust() function from estimatr. For some context, here’s a summary of wage_panelr:
skim(wage_panelr)
Data summary
Name
wage_panelr
Number of rows
4165
Number of columns
14
_______________________
Column type frequency:
factor
10
numeric
4
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
id
0
1
FALSE
595
1: 7, 2: 7, 3: 7, 4: 7
wave
0
1
FALSE
7
1: 595, 2: 595, 3: 595, 4: 595
blue_collar
0
1
FALSE
2
1: 2129, 0: 2036
manufacturing
0
1
FALSE
2
0: 2518, 1: 1647
south
0
1
FALSE
2
0: 2956, 1: 1209
stndrd_metro
0
1
FALSE
2
1: 2723, 0: 1442
married
0
1
FALSE
2
1: 3392, 0: 773
black
0
1
FALSE
2
0: 3864, 1: 301
female
0
1
FALSE
2
0: 3696, 1: 469
union
0
1
FALSE
2
0: 2649, 1: 1516
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
lwage
0
1
6.68
0.46
4.61
6.4
6.68
6.95
8.54
▁▂▇▃▁
years_worked
0
1
19.85
10.97
1.00
11.0
18.00
29.00
51.00
▇▇▆▅▁
weeks_worked
0
1
46.81
5.13
5.00
46.0
48.00
50.00
52.00
▁▁▁▁▇
years_education
0
1
12.85
2.79
4.00
12.0
12.00
16.00
17.00
▁▂▂▇▅
Now, let’s regress the log of individual wages on a series of covariates while clustering standard errors at the respondent-level2:
mod_wages_1 <-lm_robust(lwage ~ years_worked + female + black + blue_collar +south + union + manufacturing + wave,data = wage_panelr,clusters = id,# Produces robust clustered SEs (via the # Delta method - as in Stata):se_type ="stata")
Having stored our results in mod_wages_1, let’s summarize our model graphically:
# Labels for coefficients:wages_labels <-c("manufacturing1"="Manufacturing","union1"='Union Member', "south1"="Southerner","black1"="Black","female1"="Female","years_worked"="Years in Workforce","blue_collar1"="Blue Collar Worker")mod_wages_1 %>%modelplot(conf_level =0.99,#String search:coef_omit ='Interc|wave',colour ="red",# Renaming coefficients:coef_map = wages_labels) +theme_bw(base_family ="Inconsolata") +# Adding reference point for "non-significant" finding:geom_vline(xintercept =0, linetype ="dashed") +labs(x ="Coefficient Estimates",title ="Predicting the Log of Wages",subtitle ="With 99% Confidence Intervals")
As we fine-tune our model, we may be interested in testing whether the sex differences we uncovered in mod_wages_1 would persist if additional background variables were adjusted. With this in mind, let’s include an additional regressor on the right-hand side: years_education (i.e.,years of schooling).
Warning: Duplicated aesthetics after name standardisation: size and colour
As these plots lay bare, introducing an additional covariate (years_education) does little to change our story.
Adjusted Predictions, Marginal Effects
Adjusted Predictions
There is a rather large body of scholarship detailing the importance of predictions as tools for clarifying model results — especially when our outcomes are discrete or interactions are brought into our statistical ecosystems (see Arel-Bundock 2023; Long and Mustillo 2021; Mize 2019). That is, however, a story for another day. For now, let’s use two wonderful libraries—ggeffects and marginaleffects—to visualize model predictions.
To do so, let’s fit a slightly more complicated version of mod_gapminder_1. Specifically, let’s allow continent to moderate the relationship between per capita GDP (logged) and life expectancy. We can specify this model as follows:
Here’s how we can visualize predictions using ggeffects:
ggeffect(mod_gapminder_2, # Predicted life expectancy at logged GDP per capita of 6 to 11# for countries in Europe and Africa:terms =c("log_gdpPercap [6:11]", "continent [Europe, Africa]")) %>%plot() +labs(title ="Predicted Values of Life Expectancy", x ="Log of Per Capita GDP",y ="Life Expectancy (Years)",fill ="",colour ="") +theme_ggeffects(base_family ="IBM Plex Sans") +scale_color_economist() +scale_fill_economist() +theme(legend.position ="bottom") +# Overriding default alpha of confidence intervals:guides(fill =guide_legend(override.aes =list(alpha =0.35)))
The Slightly Longer Route
We can turn the results of ggeffect() into a tibble that we feed into ggplot2:
mod_gapminder_2 %>%ggeffect(terms =c("log_gdpPercap [6:11]","continent [Europe, Africa]")) %>%as_tibble() %>%ggplot(mapping =aes(x = x, y = predicted, ymin = conf.low,ymax = conf.high,# group corresponds to discrete term:colour = group,fill = group)) +geom_line() +# For confidence intervals:geom_ribbon(alpha =0.5)
Here’s how we can use marginaleffects to visualize model predictions:
plot_predictions(mod_gapminder_2, condition =list("log_gdpPercap"=6:11,"continent"=c("Africa", "Europe"))) +theme_ggeffects(base_family ="IBM Plex Sans") +labs(fill ="", colour ="",x ="Log of GDP per Capita",y ="Life Expectancy")
The Slightly Longer Route
We can turn the results of avg_predictions() into a tibble that we feed into ggplot2:
As Arel-Bundock (2023) explains, “a marginal effect measures the association between a change in a regressor \(x\), and a change in the response \(y\). Put differently, the marginal effect is the slope of the prediction function, measured at a specific value of the regressor \(x\).” Across the social and behavioural sciences, this is often the quantity of substantive interest.
Below, we use a few lines of code to estimate and visualize the average marginal effects3 associated with a key regressor (log_gdpPercap) across levels of the continent variable.
How can we transform these results into a ggplot2 object?
Final Exercise
Time permitting, explore the gss_2010 data frame and fit a few interesting models. Then, use some of the tools covered today to visually summarize your findings.
To make matters easier, here’s every labelled variable in the gss_2010 data frame:
References
Arel-Bundock, Vincent. 2023. “marginaleffects: Predictions, Comparisons, Slopes, Marginal Means, and Hypothesis Tests. The Marginal Effects Zoo (0.15.0).” September 10, 2023. https://marginaleffects.com/.
Long, J. Scott, and Sarah A. Mustillo. 2021. “Using Predictions and Marginal Effects to Compare Groups in Regression Models for Binary Outcomes.”Sociological Methods & Research 50 (3): 1284–1320. https://doi.org/10.1177/0049124118799374.
Mize, Trenton. 2019. “Best Practices for Estimating, Interpreting, and Presenting Nonlinear Interaction Effects.”Sociological Science 6: 81–117. https://doi.org/10.15195/v6.a4.
Wickham, Hadley, Danielle Navarro, and Thomas Lin Pedersen. 2023. “ggplot2: Elegant Graphics for Data Analysis.”https://ggplot2-book.org/.
Of course, you can also copy the source code into your console or editor by clicking the Code button on the top-right of this page and clicking View Source.↩︎
A better approach might involve fitting latent growth curves or mixed models.↩︎