Showcasing the gtsummary package for sample descriptive statistics, weighted population estimates, and model summary output.
Earlier this spring, IPUMS PMA released a harmonized version of the PMA COVID-19 survey, which comes from a telephone interview with reproductive aged women in four countries who are all participants in an ongoing panel study related to family planning and reproductive health. We’re excited to feature this urgent resource all summer long here on the Data Analysis Hub; you’ll find find future posts related to PMA COVID-19 data and our last post - an introduction to the available data - if you follow along here.
As always, one of our main goals on this blog is to introduce tools that make it easy for anyone to explore new data and develop new ideas for research projects. Today, we’ll be diving into a topic that has probably frustrated everyone who has ever presented or published statistical findings at one time or another: making publication-ready tables.
If you talk to students or colleagues who use R, you might be surprised to learn that many of us don’t actually use statistical software to make the tables you see when you read an academic article. In reality, plenty of us just use R to make a model, and then we copy and paste the output into a table we make by hand with Microsoft Word! This can save lots of time, and it’s a perfectly reasonable solution if you know that you can make exactly what you want with the tools that Word provides.
You might consider making tables with R if you’ve ever found yourself:
In this post, we’ll show you how to get up and running with flexible, easy-to-make tables using gtsummary, an R package that builds on the same tidyverse conventions we’ve featured elsewhere on this blog. There are a lot of different packages available to help make tables with R, but - as we’ll see - we love gtsummary
because we think it allows users to maximize choice of style and output formats, all while minimizing the amount of code necessary to implement those choices.
In our last post, we explained that you’ll find all of the PMA COVID-19 survey data if you select the new COVID-19 Unit of Analysis in the IPUMS PMA data extract system.
In this post, we’ll work with a data extract containing the following variables. You’ll be able to follow along with our coding examples if you create and download an extract containing all four samples (Female Respondents only) and these variables:
You’ll also need to install these R packages if you don’t haven’t done so before (current versions are recommended):
When you’ve finished downloading your data extract and installing all of these packages, load the packages and use read_ipums_micro()
to load the data extract into R (make sure to change the file paths to match your own extract):
As a reminder: women who were interviewed for the PMA COVID-19 survey are participants in a ongoing panel study focused on core PMA topics in reproductive health. The baseline survey for this panel study was conducted just a few months prior to the COVID-19 survey (between November 2019 and February 2020), but data from the baseline survey are not included in the COVID-19 dataset you’ll download here. We will show how to locate and merge data from the baseline survey in an upcoming post in this series. The COVID-19 survey data are structured like all of the other cross-sectional survey datasets available from IPUMS PMA: each woman’s responses are stored in a single row.
covid
# A tibble: 12,184 × 12
SAMPLE COUNTRY YEAR ROUND EAID CONSENTFQ CVQWEIGHT AGE
<int+lbl> <int+lb> <int> <dbl> <dbl> <int+lbl> <dbl> <int>
1 85411 [Burki… 1 [Burk… 2020 1 8.54e8 1 [Yes] 2.13 35
2 85411 [Burki… 1 [Burk… 2020 1 8.54e8 1 [Yes] 0.309 29
3 85411 [Burki… 1 [Burk… 2020 1 8.54e8 1 [Yes] 0.624 25
4 85411 [Burki… 1 [Burk… 2020 1 8.54e8 1 [Yes] 0.265 38
5 85411 [Burki… 1 [Burk… 2020 1 8.54e8 1 [Yes] 0.133 30
6 85411 [Burki… 1 [Burk… 2020 1 8.54e8 1 [Yes] 0.944 16
7 85411 [Burki… 1 [Burk… 2020 1 8.54e8 1 [Yes] 0.361 29
8 85411 [Burki… 1 [Burk… 2020 1 8.54e8 1 [Yes] 0.234 29
9 85411 [Burki… 1 [Burk… 2020 1 8.54e8 1 [Yes] 2.99 41
10 85411 [Burki… 1 [Burk… 2020 1 8.54e8 1 [Yes] 0.194 27
# … with 12,174 more rows, and 4 more variables: MARSTAT <int+lbl>,
# EDUCATTGEN <int+lbl>, URBAN <int+lbl>, HLTHCAREDIFFFEAR <int+lbl>
The great thing about gtsummary
is that you can make a high quality table with just one line of code, but you can also customize any element of your table and easily apply custom styling (you can choose between several journal-specific themes or create your own). And, unlike many of other table-making packages for R, gtsummary
supports printing directly to HTML, PDF, Word, and Rich Text Format.
Because gtsummary
is designed with tidyverse
users in-mind, you can pipe functions like dplyr::select
directly into the function tbl_summary, which will then identify the object class for each variable and calculate default summary statistics accordingly. To demonstrate, we’ll select
a few demographic variables, then break them down by COUNTRY
in a basic call to tbl_summary()
:
covid %>%
select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>%
tbl_summary(by = COUNTRY)
Characteristic | 1, N = 3,5281 | 2, N = 1,3241 | 7, N = 5,9861 | 9, N = 1,3461 |
---|---|---|---|---|
Age in female respondent questionnaire | 29 (22, 36) | 28 (22, 36) | 30 (24, 38) | 31 (24, 39) |
Marital status, female questionnaire | ||||
10 | 882 (25%) | 605 (46%) | 1,206 (20%) | 348 (26%) |
21 | 2,187 (62%) | 450 (34%) | 3,717 (62%) | 855 (64%) |
22 | 254 (7.2%) | 170 (13%) | 367 (6.1%) | 26 (1.9%) |
31 | 126 (3.6%) | 82 (6.2%) | 493 (8.2%) | 77 (5.7%) |
32 | 78 (2.2%) | 16 (1.2%) | 201 (3.4%) | 39 (2.9%) |
98 | 1 (<0.1%) | 1 (<0.1%) | 2 (<0.1%) | 1 (<0.1%) |
Highest level of school attended, general (4 categories) | ||||
1 | 1,213 (34%) | 4 (0.3%) | 119 (2.0%) | 86 (6.4%) |
2 | 702 (20%) | 40 (3.0%) | 2,598 (43%) | 134 (10.0%) |
3 | 1,374 (39%) | 884 (67%) | 2,217 (37%) | 637 (47%) |
4 | 239 (6.8%) | 395 (30%) | 1,052 (18%) | 486 (36%) |
98 | 0 (0%) | 1 (<0.1%) | 0 (0%) | 3 (0.2%) |
Urban/rural status | 2,636 (75%) | 0 (NA%) | 2,289 (38%) | 1,182 (88%) |
Unknown | 0 | 1,324 | 0 | 0 |
1
Median (IQR); n (%)
|
ℹ Column(s) AGE, MARSTAT, EDUCATTGEN, URBAN, and COUNTRY are class
'haven_labelled'. This is an intermediate datastructure not meant for analysis.
Convert columns with `haven::as_factor()`,`labelled::to_factor()`,
`labelled::unlabelled()`, and `unclass()`. 'haven_labelled' value labels are
ignored when columns are not converted. Failure to convert may have unintended
consequences or result in error.
• https://haven.tidyverse.org/articles/semantics.html
• https://larmarange.github.io/labelled/articles/intro_labelled.html#unlabelled
This is a very helpful starting point, but it’s certainly not a finished product yet. The biggest issue is related to the alert message shown above: as we’ve discussed in previous posts, the categorical variables you’ll find in IPUMS data extracts are usually imported as haven_labelled
objects, rather than the more common factor
class of objects. In practice, this means that every response option from the questionnaire has both a value and a label:
# A tibble: 6 × 2
MARSTAT n
<int+lbl> <int>
1 10 [Never married] 3041
2 21 [Currently married] 7209
3 22 [Currently living with partner] 817
4 31 [Divorced or separated] 778
5 32 [Widow or widower] 334
6 98 [No response or missing] 5
The variable MARSTAT
is a haven_labelled
object where the value of each response is an integer (10, 21, 22, 31, 32, or 98), and the label describing each value is shown in square brackets to the right.
When gtsummary
warns you that
This is an intermediate datastructure not meant for analysis
…it’s referring to the fact that labels are only an attribute of the variable. Attributes are metadata meant to assist the analyst running R in real-time, but they aren’t typically used by R in graphics or computational analysis.
In our table, gtsummary
displays the numeric value for each response, rather than the much more readable labels. The easiest way to change this behavior is to coerce all of our categorical variables to the factor
object class.
We recommend dividing this process into three steps:
No response or missing
shown for MARSTAT
. We’ll want to exclude these from our table, so we’ll convert them to the generic missing value NA
with ipumsr::lbl_na_if()
.AGE
in our dataset is labelled because the values 90 through 99 can represent either a respondent’s age in years (if not labelled) or a non-response code (if labelled). After converting non-response codes to NA
in step 1, we’ll want to change the class of variables like AGE
with ipumsr::zap_labels()
.ipumsr::as_factor()
, and remove any unused response options with fct_drop()
.covid <- covid %>%
mutate(
across(everything(), ~lbl_na_if(
.x,
~.lbl %in% c(
"Logical edit - missing",
"Not interviewed (female questionnaire)",
"Not interviewed (household questionnaire)",
"Don't know",
"No response or missing",
"NIU (not in universe)"
)
)),
across(AGE, zap_labels),
across(where(is.labelled), ~as_factor(.x) %>% fct_drop)
)
Coercing categorical variables as factors will greatly improve the readability of our table. Because we’ve introduced NA
values, we’ll add the argument missing = "no"
to exclude them from our table:
covid %>%
select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>%
tbl_summary(by = COUNTRY, missing = "no")
Characteristic | Burkina Faso, N = 3,5281 | Congo, Democratic Republic, N = 1,3241 | Kenya, N = 5,9861 | Nigeria, N = 1,3461 |
---|---|---|---|---|
Age in female respondent questionnaire | 29 (22, 36) | 28 (22, 36) | 30 (24, 38) | 31 (24, 39) |
Marital status, female questionnaire | ||||
Never married | 882 (25%) | 605 (46%) | 1,206 (20%) | 348 (26%) |
Currently married | 2,187 (62%) | 450 (34%) | 3,717 (62%) | 855 (64%) |
Currently living with partner | 254 (7.2%) | 170 (13%) | 367 (6.1%) | 26 (1.9%) |
Divorced or separated | 126 (3.6%) | 82 (6.2%) | 493 (8.2%) | 77 (5.7%) |
Widow or widower | 78 (2.2%) | 16 (1.2%) | 201 (3.4%) | 39 (2.9%) |
Highest level of school attended, general (4 categories) | ||||
Never attended | 1,213 (34%) | 4 (0.3%) | 119 (2.0%) | 86 (6.4%) |
Primary/Middle school | 702 (20%) | 40 (3.0%) | 2,598 (43%) | 134 (10.0%) |
Secondary/post-primary | 1,374 (39%) | 884 (67%) | 2,217 (37%) | 637 (47%) |
Tertiary/post-secondary | 239 (6.8%) | 395 (30%) | 1,052 (18%) | 486 (36%) |
Urban/rural status | ||||
Rural | 892 (25%) | 0 (NA%) | 3,697 (62%) | 164 (12%) |
Urban | 2,636 (75%) | 0 (NA%) | 2,289 (38%) | 1,182 (88%) |
1
Median (IQR); n (%)
|
You may have noticed that gtsummary
ignored the value labels for each of our haven_labelled
variables before we converted them into factor
variables, but it did find and use variable labels. For example, you see the variable label Age in female respondent questionnaire
instead of the variable name AGE
.
While this is sometimes helpful behavior, we feel that the word “Age” would have been fine on its own. Likewise, we’d like to clean up the labels for MARSTAT
, EDUCATTGEN
, and URBAN
to make them as concise as possible.
You can override the ipumsr
value labels in your table without changing the underlying data. Just add them as a list of formulas via the label
argument: the variable name goes on the left of ~
, and a character string containing the desired label goes on the right.
There are a few supporting functions that allow you to stylize these labels. We’ll use italicize_labels()
to print our new labels in italics.
covid %>%
select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>%
tbl_summary(
by = COUNTRY,
missing = "no",
label = list(
AGE ~ "Age",
MARSTAT ~ "Marital status",
EDUCATTGEN ~ "Education",
URBAN ~ "Urban vs Rural"
)
) %>%
italicize_labels()
Characteristic | Burkina Faso, N = 3,5281 | Congo, Democratic Republic, N = 1,3241 | Kenya, N = 5,9861 | Nigeria, N = 1,3461 |
---|---|---|---|---|
Age | 29 (22, 36) | 28 (22, 36) | 30 (24, 38) | 31 (24, 39) |
Marital status | ||||
Never married | 882 (25%) | 605 (46%) | 1,206 (20%) | 348 (26%) |
Currently married | 2,187 (62%) | 450 (34%) | 3,717 (62%) | 855 (64%) |
Currently living with partner | 254 (7.2%) | 170 (13%) | 367 (6.1%) | 26 (1.9%) |
Divorced or separated | 126 (3.6%) | 82 (6.2%) | 493 (8.2%) | 77 (5.7%) |
Widow or widower | 78 (2.2%) | 16 (1.2%) | 201 (3.4%) | 39 (2.9%) |
Education | ||||
Never attended | 1,213 (34%) | 4 (0.3%) | 119 (2.0%) | 86 (6.4%) |
Primary/Middle school | 702 (20%) | 40 (3.0%) | 2,598 (43%) | 134 (10.0%) |
Secondary/post-primary | 1,374 (39%) | 884 (67%) | 2,217 (37%) | 637 (47%) |
Tertiary/post-secondary | 239 (6.8%) | 395 (30%) | 1,052 (18%) | 486 (36%) |
Urban vs Rural | ||||
Rural | 892 (25%) | 0 (NA%) | 3,697 (62%) | 164 (12%) |
Urban | 2,636 (75%) | 0 (NA%) | 2,289 (38%) | 1,182 (88%) |
1
Median (IQR); n (%)
|
What about the statistics that tbl_summary()
calculates? By default, tbl_summary()
reports the median (and IQR) for integer variables like AGE
, and it reports the frequency (and percentage) of each level for all of the factor
variables we’ve created.
You can change the statistics calculated for one or more variables by name, or you can change them for all variables of a similar type (e.g. all_categorical
). You’ll need to choose or define a custom function, then provide it to the stat
argument as a character string between curly brackets like this:
"{mean}"
Here, we’ll demonstrate how to calculate the mean (and standard deviation) for AGE
, and the percentage for all responses to all factors / categorical variables.
You may also decide to feature information about which statistics were calculated more prominently alongside the variable labels, rather than in the footer. Simply pipe your table to the function add_stat_label()
as shown below:
covid %>%
select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>%
tbl_summary(
by = COUNTRY,
missing = "no",
label = list(
AGE ~ "Age",
MARSTAT ~ "Marital status",
EDUCATTGEN ~ "Education",
URBAN ~ "Urban vs Rural"
),
stat = list(
AGE ~ "{mean} ({sd})",
all_categorical() ~"{p}"
)
)%>%
italicize_labels() %>%
add_stat_label()
Characteristic | Burkina Faso, N = 3,528 | Congo, Democratic Republic, N = 1,324 | Kenya, N = 5,986 | Nigeria, N = 1,346 |
---|---|---|---|---|
Age, Mean (SD) | 30 (9) | 29 (9) | 31 (9) | 31 (9) |
Marital status, % | ||||
Never married | 25 | 46 | 20 | 26 |
Currently married | 62 | 34 | 62 | 64 |
Currently living with partner | 7.2 | 13 | 6.1 | 1.9 |
Divorced or separated | 3.6 | 6.2 | 8.2 | 5.7 |
Widow or widower | 2.2 | 1.2 | 3.4 | 2.9 |
Education, % | ||||
Never attended | 34 | 0.3 | 2.0 | 6.4 |
Primary/Middle school | 20 | 3.0 | 43 | 10.0 |
Secondary/post-primary | 39 | 67 | 37 | 47 |
Tertiary/post-secondary | 6.8 | 30 | 18 | 36 |
Urban vs Rural, % | ||||
Rural | 25 | NA | 62 | 12 |
Urban | 75 | NA | 38 | 88 |
Table headers and footnotes can be customized with the functions modify_header()
and modify_footnote()
. We’ll remove the frequencies from our header, use the abbreviated label DR Congo
, and remove the label Characteristic
from the first column. Lastly, we’ll add a title to our table with modify_spanning_header()
.
covid %>%
select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>%
tbl_summary(
by = COUNTRY,
missing = "no",
stat = list(
AGE ~ "{mean} ({sd})",
all_categorical() ~"{p}"
),
label = list(
AGE ~ "Age",
MARSTAT ~ "Marital status",
EDUCATTGEN ~ "Education",
URBAN ~ "Urban vs Rural"
)
) %>%
add_stat_label() %>%
italicize_labels() %>%
modify_header(update = list(
label ~ " ",
stat_1 ~ "**Burkina Faso**",
stat_2 ~ "**DR Congo <br> (Kinshasa)**",
stat_3 ~ "**Kenya**",
stat_4 ~ "**Nigeria <br> (Lagos & Kano)**"
)) %>%
modify_spanning_header(
everything() ~ "## Sample Demographics"
)
Sample Demographics |
||||
---|---|---|---|---|
Burkina Faso | DR Congo (Kinshasa) |
Kenya | Nigeria (Lagos & Kano) |
|
Age, Mean (SD) | 30 (9) | 29 (9) | 31 (9) | 31 (9) |
Marital status, % | ||||
Never married | 25 | 46 | 20 | 26 |
Currently married | 62 | 34 | 62 | 64 |
Currently living with partner | 7.2 | 13 | 6.1 | 1.9 |
Divorced or separated | 3.6 | 6.2 | 8.2 | 5.7 |
Widow or widower | 2.2 | 1.2 | 3.4 | 2.9 |
Education, % | ||||
Never attended | 34 | 0.3 | 2.0 | 6.4 |
Primary/Middle school | 20 | 3.0 | 43 | 10.0 |
Secondary/post-primary | 39 | 67 | 37 | 47 |
Tertiary/post-secondary | 6.8 | 30 | 18 | 36 |
Urban vs Rural, % | ||||
Rural | 25 | NA | 62 | 12 |
Urban | 75 | NA | 38 | 88 |
We mentioned in our last post that the PMA COVID-19 survey comes with a new weighting variable, CVQWEIGHT, which is analogous to the variable FQWEIGHT found in other Household and Female samples. CVQWEIGHT
can be used to estimate all of the statistics shown in our table for the broader population represented by each sample (note that two of the samples are not nationally representative):
The srvyr package includes several functions that make it easy to incorporate survey weights into a tidy workflow. Simply provide information about the survey design to srvyr::as_survey_design()
, and then pipe this information to a survey analysis function.
For example, a tidy workflow calculating the mean AGE
of women in each of the four samples might look like this:
# A tibble: 4 × 2
COUNTRY `mean(AGE)`
<fct> <dbl>
1 Burkina Faso 29.7
2 Congo, Democratic Republic 29.5
3 Kenya 31.1
4 Nigeria 31.4
You can use CVQWEIGHT
to estimate the mean AGE
of reproductive age women in each of the four target populations like this:
covid %>%
as_survey_design(weight = CVQWEIGHT) %>%
group_by(COUNTRY) %>%
summarise(survey_mean(AGE))
# A tibble: 4 × 3
COUNTRY coef `_se`
<fct> <dbl> <dbl>
1 Burkina Faso 27.9 0.370
2 Congo, Democratic Republic 28.9 0.327
3 Kenya 28.9 0.186
4 Nigeria 30.4 0.329
Happily, gtsummary
can read survey information from the function as_survey_design()
. This saves us the trouble of calculating weighted statistics for each of the variables in our table; instead, we can create weighted statistics for our table with just one line of code if we swap tbl_summary
for its companion function, tbl_svysummary()
:
covid %>%
as_survey_design(weight = CVQWEIGHT) %>% # CVQWEIGHT goes here
select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>%
tbl_svysummary( # use tbl_svysummary() in place of tbl_summary()
by = COUNTRY,
missing = "no",
stat = list(
AGE ~ "{mean} ({sd})",
all_categorical() ~"{p}"
),
label = list(
AGE ~ "Age",
MARSTAT ~ "Marital status",
EDUCATTGEN ~ "Education",
URBAN ~ "Urban vs Rural"
)
) %>%
italicize_labels() %>%
add_stat_label() %>%
modify_header(update = list(
label ~ " ",
stat_1 ~ "**Burkina Faso**",
stat_2 ~ "**DR Congo <br> (Kinshasa)**",
stat_3 ~ "**Kenya**",
stat_4 ~ "**Nigeria <br> (Lagos & Kano)**"
)) %>%
modify_spanning_header(
everything() ~ "## Weighted Population Estimates"
)
Weighted Population Estimates |
||||
---|---|---|---|---|
Burkina Faso | DR Congo (Kinshasa) |
Kenya | Nigeria (Lagos & Kano) |
|
Age, Mean (SD) | 28 (9) | 29 (9) | 29 (9) | 30 (9) |
Marital status, % | ||||
Never married | 22 | 47 | 30 | 27 |
Currently married | 69 | 31 | 54 | 63 |
Currently living with partner | 6.1 | 13 | 5.8 | 2.4 |
Divorced or separated | 1.9 | 6.3 | 7.3 | 5.1 |
Widow or widower | 1.4 | 1.9 | 2.8 | 2.3 |
Education, % | ||||
Never attended | 55 | 0.4 | 1.9 | 9.8 |
Primary/Middle school | 19 | 5.7 | 46 | 13 |
Secondary/post-primary | 24 | 72 | 39 | 46 |
Tertiary/post-secondary | 1.8 | 22 | 13 | 31 |
Urban vs Rural, % | ||||
Rural | 79 | NA | 73 | 18 |
Urban | 21 | NA | 27 | 82 |
We’ve now made a weighted descriptive statistics table, and we’ve only changed two lines of code. As we’ll see, creating a summary table from a model that uses sample design information can be just as easy.
The gtsummary
package also contains a function designed to summarise and format output from regression models. For example, let’s build a simple logistic regression model for HLTHCAREDIFFFEAR
, which indicates whether a woman experienced difficulty accessing healthcare because she was afraid of becoming infected with COVID-19. We’ll try modeling this outcome using the demographic variables that are available for all four samples: AGE
, MARSTAT
, and EDUCATTGEN
(URBAN
was not available for the DRC sample).
First, we’ll recode HLTHCAREDIFFFEAR
into a binary indicator that’s suitable for use in a logistic regression model. We can collapse responses “No” and “None of the above”, since the latter indicates that the woman experienced no difficulties accessing healthcare at all.
# A tibble: 4 × 2
HLTHCAREDIFFFEAR n
<fct> <int>
1 No 1101
2 Yes 3881
3 None of the above 7114
4 <NA> 88
covid <- covid %>%
mutate(HLTHCAREDIFFFEAR = fct_collapse(HLTHCAREDIFFFEAR, No = c(
"No",
"None of the above"
)))
covid %>% count(HLTHCAREDIFFFEAR)
# A tibble: 3 × 2
HLTHCAREDIFFFEAR n
<fct> <int>
1 No 8215
2 Yes 3881
3 <NA> 88
Because IPUMS PMA samples are collected with geographic clusters - represented by EAID
- we generally recommend specifying both a sample weight and a cluster identification with as_survey_design()
. We’ll need to use a function that can build a general linear model using that survey design information, so we’ll use survey::svyglm()
, rather than the base R function glm
that might be more familiar. The output of a call to svyglm()
is not formatted as a publication-ready table. Let’s see what happens when we build a model for our Burkina Faso sample:
covid %>%
filter(COUNTRY == "Burkina Faso") %>%
as_survey_design(weight = CVQWEIGHT, id = EAID) %>%
survey::svyglm(
formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN,
family = "quasibinomial"
)
1 - level Cluster Sampling design (with replacement)
With (164) clusters.
Called via srvyr
Sampling variables:
- ids: EAID
- weights: CVQWEIGHT
Call: svyglm(formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN,
design = ., family = "quasibinomial")
Coefficients:
(Intercept)
-0.7495
AGE
-0.0128
MARSTATCurrently married
0.5784
MARSTATCurrently living with partner
-0.3228
MARSTATDivorced or separated
0.3920
MARSTATWidow or widower
0.1639
EDUCATTGENPrimary/Middle school
-0.0782
EDUCATTGENSecondary/post-primary
-0.3467
EDUCATTGENTertiary/post-secondary
0.1427
Degrees of Freedom: 3526 Total (i.e. Null); 155 Residual
(1 observation deleted due to missingness)
Null Deviance: 4381
Residual Deviance: 4299 AIC: NA
The function gtsummary::tbl_regression()
will tidy this output into a much more reader-friendly format. We’ll assign labels to each of our variables using the same label
argument we saw before, and we’ll also choose to exponentiate our regression coefficients so that the results will reflect odds ratios:
covid %>%
filter(COUNTRY == "Burkina Faso") %>%
as_survey_design(weight = CVQWEIGHT, id = EAID) %>%
survey::svyglm(
formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN,
family = "quasibinomial"
) %>%
tbl_regression(
exponentiate = TRUE,
label = list(
AGE ~ "Age",
MARSTAT ~ "Marital status",
EDUCATTGEN ~ "Education"
)
)
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
Age | 0.99 | 0.96, 1.01 | 0.3 |
Marital status | |||
Never married | — | — | |
Currently married | 1.78 | 1.02, 3.13 | 0.043 |
Currently living with partner | 0.72 | 0.25, 2.06 | 0.5 |
Divorced or separated | 1.48 | 0.69, 3.18 | 0.3 |
Widow or widower | 1.18 | 0.41, 3.42 | 0.8 |
Education | |||
Never attended | — | — | |
Primary/Middle school | 0.92 | 0.61, 1.41 | 0.7 |
Secondary/post-primary | 0.71 | 0.44, 1.12 | 0.14 |
Tertiary/post-secondary | 1.15 | 0.66, 2.02 | 0.6 |
1
OR = Odds Ratio, CI = Confidence Interval
|
We can also add conventional “stars” representing the significance of each coefficient with the add_significance_stars()
function. Here, we could choose to display cluster-robust standard error estimates, but we’ll display 95% confidence intervals instead. We’ll also customize the header and add a title, using the same modify
functions shown above.
covid %>%
filter(COUNTRY == "Burkina Faso") %>%
as_survey_design(weight = CVQWEIGHT, id = EAID) %>%
survey::svyglm(
formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN,
family = "quasibinomial"
) %>%
tbl_regression(
exp = TRUE,
label = list(
AGE ~ "Age",
MARSTAT ~ "Marital status",
EDUCATTGEN ~ "Education"
)
) %>%
modify_footnote(everything() ~ NA, abbreviation = TRUE) %>%
add_significance_stars(hide_se = TRUE, hide_ci = FALSE) %>%
modify_header(update = list(
label ~ " " ,
estimate ~ '**Burkina Faso**',
ci ~ "95% CI"
)) %>%
modify_spanning_header(
everything() ~ "## Odds Ratios obtained from Logistic Regression"
)
Odds Ratios obtained from Logistic Regression |
||
---|---|---|
Burkina Faso1 | 95% CI | |
Age | 0.99 | 0.96, 1.01 |
Marital status | ||
Never married | — | — |
Currently married | 1.78* | 1.02, 3.13 |
Currently living with partner | 0.72 | 0.25, 2.06 |
Divorced or separated | 1.48 | 0.69, 3.18 |
Widow or widower | 1.18 | 0.41, 3.42 |
Education | ||
Never attended | — | — |
Primary/Middle school | 0.92 | 0.61, 1.41 |
Secondary/post-primary | 0.71 | 0.44, 1.12 |
Tertiary/post-secondary | 1.15 | 0.66, 2.02 |
1
*p<0.05; **p<0.01; ***p<0.001
|
Researchers often report results from several models if, for example, they want to compare results including a range of different controls. You might also decide to model data from each of the COVID-19 samples separately in order to highlight important differences between their target populations. How would you merge results from four models into a single table?
If you repeat the same code shown above for each of the four samples, you’ll obtain four separate tables. We recommend using purrr::map()
to store these tables in a list, which you can then pass to gtsummary::tbl_merge()
. Here, we map
over each of the factor levels in COUNTRY
, using tbl_regression()
to build a regression table for each. We pipe a list of four tables to tbl_merge()
, and then add a header and title.
levels(covid$COUNTRY) %>%
map(~{
covid %>%
as_survey_design(weight = CVQWEIGHT) %>%
filter(COUNTRY == .x) %>%
survey::svyglm(
formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN,
family = "quasibinomial"
) %>%
tbl_regression(
exp = TRUE,
label = list(
AGE ~ "Age",
MARSTAT ~ "Marital status",
EDUCATTGEN ~ "Education"
)
) %>%
italicize_labels() %>%
modify_footnote(everything() ~ NA, abbreviation = TRUE) %>%
add_significance_stars(hide_se = T, hide_ci = F) %>%
modify_header(update = list(ci ~ "95% CI"))
}) %>%
tbl_merge() %>%
modify_header(update = list(
label ~ " " ,
estimate_1 ~ '**Burkina Faso**',
estimate_2 ~ '**DR Congo <br> (Kinshasa)**',
estimate_3 ~ '**Kenya**',
estimate_4 ~ '**Nigeria <br> (Lagos & Kano)**'
)) %>%
modify_spanning_header(
everything() ~ "## Odds Ratios obtained from Logistic Regression"
)
Odds Ratios obtained from Logistic Regression |
||||||||
---|---|---|---|---|---|---|---|---|
Burkina Faso1 | 95% CI | DR Congo (Kinshasa)1 |
95% CI | Kenya1 | 95% CI | Nigeria (Lagos & Kano)1 |
95% CI | |
Age | 0.99 | 0.97, 1.01 | 0.97* | 0.95, 1.00 | 0.98*** | 0.97, 0.99 | 0.99 | 0.97, 1.01 |
Marital status | ||||||||
Never married | — | — | — | — | — | — | — | — |
Currently married | 1.78 | 0.92, 3.45 | 1.16 | 0.73, 1.83 | 1.21 | 0.97, 1.52 | 1.64 | 0.95, 2.83 |
Currently living with partner | 0.72 | 0.32, 1.62 | 0.55 | 0.27, 1.11 | 0.81 | 0.59, 1.11 | 2.81 | 0.94, 8.40 |
Divorced or separated | 1.48 | 0.50, 4.35 | 1.94 | 0.95, 3.98 | 1.26 | 0.92, 1.73 | 1.10 | 0.49, 2.49 |
Widow or widower | 1.18 | 0.38, 3.62 | 0.70 | 0.08, 6.11 | 1.09 | 0.72, 1.65 | 0.39 | 0.10, 1.56 |
Education | ||||||||
Never attended | — | — | — | — | — | — | — | — |
Primary/Middle school | 0.92 | 0.58, 1.47 | 0.39 | 0.02, 6.06 | 1.50 | 0.95, 2.38 | 0.85 | 0.33, 2.17 |
Secondary/post-primary | 0.71 | 0.43, 1.16 | 0.99 | 0.10, 9.67 | 1.31 | 0.82, 2.09 | 0.80 | 0.37, 1.73 |
Tertiary/post-secondary | 1.15 | 0.71, 1.88 | 1.29 | 0.13, 12.7 | 1.47 | 0.91, 2.37 | 0.78 | 0.35, 1.74 |
1
*p<0.05; **p<0.01; ***p<0.001
|
So now you know that you can make a great-looking table in R that renders nicely in HTML (you’re reading this post on a web page, after all). But what if you want to export your table to a Word document, PDF, or some other format?
For Word documents, try piping your table to the function gtsummary::as_flex_table()
like this, and then copy / paste the result directly into Word. (You may need to install the package flextable
first).
covid %>%
select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>%
tbl_summary(by = COUNTRY) %>%
as_flex_table()
Several printers work well for PDF output, including the very popular kable_extra
. If you’re making a PDF with RMarkdown, for example, you could pipe your table into gtsummary::as_kable_extra()
. (You may need to install the package kable_extra
first).
covid %>%
select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>%
tbl_summary(by = COUNTRY) %>%
as_kable_extra()
You’ll find more information about output options here.
We hope you found these steps helpful in using R to create tables for descriptive statistics, survey weights, and model summary output. As always, feel free to reach out to us with any questions. In our next post, we will show how to make likert-style stacked bar charts with R using the PMA COVID-19 survey data. Check back here in two weeks for the next post in this series!
If you see mistakes or want to suggest changes, please create an issue on the source repository.