Making Tables with PMA COVID-19 Data

COVID-19 gtsummary srvyr purrr Descriptive Analysis Data Manipulation Weights

Showcasing the gtsummary package for sample descriptive statistics, weighted population estimates, and model summary output.

Shelby Rutzick https://www.linkedin.com/in/shelby-rutzick/ (IPUMS PMA Graduate Research Assistant) , Matt Gunther (IPUMS PMA Senior Data Analyst)
07-01-2021

Earlier this spring, IPUMS PMA released a harmonized version of the PMA COVID-19 survey, which comes from a telephone interview with reproductive aged women in four countries who are all participants in an ongoing panel study related to family planning and reproductive health. We’re excited to feature this urgent resource all summer long here on the Data Analysis Hub; you’ll find find future posts related to PMA COVID-19 data and our last post - an introduction to the available data - if you follow along here.

As always, one of our main goals on this blog is to introduce tools that make it easy for anyone to explore new data and develop new ideas for research projects. Today, we’ll be diving into a topic that has probably frustrated everyone who has ever presented or published statistical findings at one time or another: making publication-ready tables.

If you talk to students or colleagues who use R, you might be surprised to learn that many of us don’t actually use statistical software to make the tables you see when you read an academic article. In reality, plenty of us just use R to make a model, and then we copy and paste the output into a table we make by hand with Microsoft Word! This can save lots of time, and it’s a perfectly reasonable solution if you know that you can make exactly what you want with the tools that Word provides.

You might consider making tables with R if you’ve ever found yourself:

In this post, we’ll show you how to get up and running with flexible, easy-to-make tables using gtsummary, an R package that builds on the same tidyverse conventions we’ve featured elsewhere on this blog. There are a lot of different packages available to help make tables with R, but - as we’ll see - we love gtsummary because we think it allows users to maximize choice of style and output formats, all while minimizing the amount of code necessary to implement those choices.

Setup

In our last post, we explained that you’ll find all of the PMA COVID-19 survey data if you select the new COVID-19 Unit of Analysis in the IPUMS PMA data extract system.

In this post, we’ll work with a data extract containing the following variables. You’ll be able to follow along with our coding examples if you create and download an extract containing all four samples (Female Respondents only) and these variables:

You’ll also need to install these R packages if you don’t haven’t done so before (current versions are recommended):

When you’ve finished downloading your data extract and installing all of these packages, load the packages and use read_ipums_micro() to load the data extract into R (make sure to change the file paths to match your own extract):

library(ipumsr)
library(tidyverse)
library(survey)
library(srvyr)
library(gtsummary)

covid <- read_ipums_micro(
  ddi = "data/pma_00032.xml",
  dat = "data/pma_00032.dat.gz"
)

As a reminder: women who were interviewed for the PMA COVID-19 survey are participants in a ongoing panel study focused on core PMA topics in reproductive health. The baseline survey for this panel study was conducted just a few months prior to the COVID-19 survey (between November 2019 and February 2020), but data from the baseline survey are not included in the COVID-19 dataset you’ll download here. We will show how to locate and merge data from the baseline survey in an upcoming post in this series. The COVID-19 survey data are structured like all of the other cross-sectional survey datasets available from IPUMS PMA: each woman’s responses are stored in a single row.

covid
# A tibble: 12,184 × 12
          SAMPLE  COUNTRY  YEAR ROUND   EAID CONSENTFQ CVQWEIGHT   AGE
       <int+lbl> <int+lb> <int> <dbl>  <dbl> <int+lbl>     <dbl> <int>
 1 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     2.13     35
 2 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.309    29
 3 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.624    25
 4 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.265    38
 5 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.133    30
 6 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.944    16
 7 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.361    29
 8 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.234    29
 9 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     2.99     41
10 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.194    27
# … with 12,174 more rows, and 4 more variables: MARSTAT <int+lbl>,
#   EDUCATTGEN <int+lbl>, URBAN <int+lbl>, HLTHCAREDIFFFEAR <int+lbl>

Descriptive Statistics Table

The great thing about gtsummary is that you can make a high quality table with just one line of code, but you can also customize any element of your table and easily apply custom styling (you can choose between several journal-specific themes or create your own). And, unlike many of other table-making packages for R, gtsummary supports printing directly to HTML, PDF, Word, and Rich Text Format.

Because gtsummary is designed with tidyverse users in-mind, you can pipe functions like dplyr::select directly into the function tbl_summary, which will then identify the object class for each variable and calculate default summary statistics accordingly. To demonstrate, we’ll select a few demographic variables, then break them down by COUNTRY in a basic call to tbl_summary():

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(by = COUNTRY)
Characteristic 1, N = 3,5281 2, N = 1,3241 7, N = 5,9861 9, N = 1,3461
Age in female respondent questionnaire 29 (22, 36) 28 (22, 36) 30 (24, 38) 31 (24, 39)
Marital status, female questionnaire
10 882 (25%) 605 (46%) 1,206 (20%) 348 (26%)
21 2,187 (62%) 450 (34%) 3,717 (62%) 855 (64%)
22 254 (7.2%) 170 (13%) 367 (6.1%) 26 (1.9%)
31 126 (3.6%) 82 (6.2%) 493 (8.2%) 77 (5.7%)
32 78 (2.2%) 16 (1.2%) 201 (3.4%) 39 (2.9%)
98 1 (<0.1%) 1 (<0.1%) 2 (<0.1%) 1 (<0.1%)
Highest level of school attended, general (4 categories)
1 1,213 (34%) 4 (0.3%) 119 (2.0%) 86 (6.4%)
2 702 (20%) 40 (3.0%) 2,598 (43%) 134 (10.0%)
3 1,374 (39%) 884 (67%) 2,217 (37%) 637 (47%)
4 239 (6.8%) 395 (30%) 1,052 (18%) 486 (36%)
98 0 (0%) 1 (<0.1%) 0 (0%) 3 (0.2%)
Urban/rural status 2,636 (75%) 0 (NA%) 2,289 (38%) 1,182 (88%)
Unknown 0 1,324 0 0

1 Median (IQR); n (%)

ℹ Column(s) AGE, MARSTAT, EDUCATTGEN, URBAN, and COUNTRY are class 
'haven_labelled'. This is an intermediate datastructure not meant for analysis. 
Convert columns with `haven::as_factor()`,`labelled::to_factor()`, 
`labelled::unlabelled()`, and `unclass()`. 'haven_labelled' value labels are 
ignored when columns are not converted. Failure to convert may have unintended 
consequences or result in error.

• https://haven.tidyverse.org/articles/semantics.html
• https://larmarange.github.io/labelled/articles/intro_labelled.html#unlabelled

Display value labels

This is a very helpful starting point, but it’s certainly not a finished product yet. The biggest issue is related to the alert message shown above: as we’ve discussed in previous posts, the categorical variables you’ll find in IPUMS data extracts are usually imported as haven_labelled objects, rather than the more common factor class of objects. In practice, this means that every response option from the questionnaire has both a value and a label:

covid %>% count(MARSTAT)
# A tibble: 6 × 2
                             MARSTAT     n
                           <int+lbl> <int>
1 10 [Never married]                  3041
2 21 [Currently married]              7209
3 22 [Currently living with partner]   817
4 31 [Divorced or separated]           778
5 32 [Widow or widower]                334
6 98 [No response or missing]            5

The variable MARSTAT is a haven_labelled object where the value of each response is an integer (10, 21, 22, 31, 32, or 98), and the label describing each value is shown in square brackets to the right.

When gtsummary warns you that

This is an intermediate datastructure not meant for analysis

…it’s referring to the fact that labels are only an attribute of the variable. Attributes are metadata meant to assist the analyst running R in real-time, but they aren’t typically used by R in graphics or computational analysis.

In our table, gtsummary displays the numeric value for each response, rather than the much more readable labels. The easiest way to change this behavior is to coerce all of our categorical variables to the factor object class.

We recommend dividing this process into three steps:

  1. Identify any labels in your dataset that represent non-response codes, like No response or missing shown for MARSTAT. We’ll want to exclude these from our table, so we’ll convert them to the generic missing value NA with ipumsr::lbl_na_if().
  2. Identify any labelled variables that are not categorical. For example, the variable AGE in our dataset is labelled because the values 90 through 99 can represent either a respondent’s age in years (if not labelled) or a non-response code (if labelled). After converting non-response codes to NA in step 1, we’ll want to change the class of variables like AGE with ipumsr::zap_labels().
  3. Coerce all remaining labelled variables as factors with ipumsr::as_factor(), and remove any unused response options with fct_drop().
covid <- covid %>% 
  mutate(
    across(everything(), ~lbl_na_if(
      .x,
      ~.lbl %in% c(
        "Logical edit - missing",
        "Not interviewed (female questionnaire)",
        "Not interviewed (household questionnaire)",
        "Don't know",
        "No response or missing",
        "NIU (not in universe)"
      )
    )),
    across(AGE, zap_labels),
    across(where(is.labelled), ~as_factor(.x) %>% fct_drop)
  )

Coercing categorical variables as factors will greatly improve the readability of our table. Because we’ve introduced NA values, we’ll add the argument missing = "no" to exclude them from our table:

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(by = COUNTRY, missing = "no")
Characteristic Burkina Faso, N = 3,5281 Congo, Democratic Republic, N = 1,3241 Kenya, N = 5,9861 Nigeria, N = 1,3461
Age in female respondent questionnaire 29 (22, 36) 28 (22, 36) 30 (24, 38) 31 (24, 39)
Marital status, female questionnaire
Never married 882 (25%) 605 (46%) 1,206 (20%) 348 (26%)
Currently married 2,187 (62%) 450 (34%) 3,717 (62%) 855 (64%)
Currently living with partner 254 (7.2%) 170 (13%) 367 (6.1%) 26 (1.9%)
Divorced or separated 126 (3.6%) 82 (6.2%) 493 (8.2%) 77 (5.7%)
Widow or widower 78 (2.2%) 16 (1.2%) 201 (3.4%) 39 (2.9%)
Highest level of school attended, general (4 categories)
Never attended 1,213 (34%) 4 (0.3%) 119 (2.0%) 86 (6.4%)
Primary/Middle school 702 (20%) 40 (3.0%) 2,598 (43%) 134 (10.0%)
Secondary/post-primary 1,374 (39%) 884 (67%) 2,217 (37%) 637 (47%)
Tertiary/post-secondary 239 (6.8%) 395 (30%) 1,052 (18%) 486 (36%)
Urban/rural status
Rural 892 (25%) 0 (NA%) 3,697 (62%) 164 (12%)
Urban 2,636 (75%) 0 (NA%) 2,289 (38%) 1,182 (88%)

1 Median (IQR); n (%)

Customize variable labels

You may have noticed that gtsummary ignored the value labels for each of our haven_labelled variables before we converted them into factor variables, but it did find and use variable labels. For example, you see the variable label Age in female respondent questionnaire instead of the variable name AGE.

While this is sometimes helpful behavior, we feel that the word “Age” would have been fine on its own. Likewise, we’d like to clean up the labels for MARSTAT, EDUCATTGEN, and URBAN to make them as concise as possible.

You can override the ipumsr value labels in your table without changing the underlying data. Just add them as a list of formulas via the label argument: the variable name goes on the left of ~, and a character string containing the desired label goes on the right.

There are a few supporting functions that allow you to stylize these labels. We’ll use italicize_labels() to print our new labels in italics.

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(
    by = COUNTRY,
    missing = "no",
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education",
      URBAN ~ "Urban vs Rural"
    )
  ) %>% 
  italicize_labels() 
Characteristic Burkina Faso, N = 3,5281 Congo, Democratic Republic, N = 1,3241 Kenya, N = 5,9861 Nigeria, N = 1,3461
Age 29 (22, 36) 28 (22, 36) 30 (24, 38) 31 (24, 39)
Marital status
Never married 882 (25%) 605 (46%) 1,206 (20%) 348 (26%)
Currently married 2,187 (62%) 450 (34%) 3,717 (62%) 855 (64%)
Currently living with partner 254 (7.2%) 170 (13%) 367 (6.1%) 26 (1.9%)
Divorced or separated 126 (3.6%) 82 (6.2%) 493 (8.2%) 77 (5.7%)
Widow or widower 78 (2.2%) 16 (1.2%) 201 (3.4%) 39 (2.9%)
Education
Never attended 1,213 (34%) 4 (0.3%) 119 (2.0%) 86 (6.4%)
Primary/Middle school 702 (20%) 40 (3.0%) 2,598 (43%) 134 (10.0%)
Secondary/post-primary 1,374 (39%) 884 (67%) 2,217 (37%) 637 (47%)
Tertiary/post-secondary 239 (6.8%) 395 (30%) 1,052 (18%) 486 (36%)
Urban vs Rural
Rural 892 (25%) 0 (NA%) 3,697 (62%) 164 (12%)
Urban 2,636 (75%) 0 (NA%) 2,289 (38%) 1,182 (88%)

1 Median (IQR); n (%)

Change default statistics

What about the statistics that tbl_summary() calculates? By default, tbl_summary() reports the median (and IQR) for integer variables like AGE, and it reports the frequency (and percentage) of each level for all of the factor variables we’ve created.

You can change the statistics calculated for one or more variables by name, or you can change them for all variables of a similar type (e.g. all_categorical). You’ll need to choose or define a custom function, then provide it to the stat argument as a character string between curly brackets like this:

"{mean}"

Here, we’ll demonstrate how to calculate the mean (and standard deviation) for AGE, and the percentage for all responses to all factors / categorical variables.

You may also decide to feature information about which statistics were calculated more prominently alongside the variable labels, rather than in the footer. Simply pipe your table to the function add_stat_label() as shown below:

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(
    by = COUNTRY,
    missing = "no",
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education",
      URBAN ~ "Urban vs Rural"
    ),
    stat = list(
      AGE ~ "{mean} ({sd})",
      all_categorical() ~"{p}"
    )
  )%>% 
  italicize_labels() %>% 
  add_stat_label()
Characteristic Burkina Faso, N = 3,528 Congo, Democratic Republic, N = 1,324 Kenya, N = 5,986 Nigeria, N = 1,346
Age, Mean (SD) 30 (9) 29 (9) 31 (9) 31 (9)
Marital status, %
Never married 25 46 20 26
Currently married 62 34 62 64
Currently living with partner 7.2 13 6.1 1.9
Divorced or separated 3.6 6.2 8.2 5.7
Widow or widower 2.2 1.2 3.4 2.9
Education, %
Never attended 34 0.3 2.0 6.4
Primary/Middle school 20 3.0 43 10.0
Secondary/post-primary 39 67 37 47
Tertiary/post-secondary 6.8 30 18 36
Urban vs Rural, %
Rural 25 NA 62 12
Urban 75 NA 38 88

Customize headers & footnotes

Table headers and footnotes can be customized with the functions modify_header() and modify_footnote(). We’ll remove the frequencies from our header, use the abbreviated label DR Congo, and remove the label Characteristic from the first column. Lastly, we’ll add a title to our table with modify_spanning_header().

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(
    by = COUNTRY,
    missing = "no",
    stat = list(
      AGE ~ "{mean} ({sd})",
      all_categorical() ~"{p}"
    ),
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education",
      URBAN ~ "Urban vs Rural"
    )
  ) %>% 
  add_stat_label() %>% 
  italicize_labels() %>% 
  modify_header(update = list(
      label ~ " ",
      stat_1 ~ "**Burkina Faso**",
      stat_2 ~ "**DR Congo <br> (Kinshasa)**",
      stat_3 ~ "**Kenya**",
      stat_4 ~ "**Nigeria <br> (Lagos & Kano)**"
  )) %>% 
  modify_spanning_header(
    everything() ~ "## Sample Demographics"
  ) 

Sample Demographics

Burkina Faso DR Congo
(Kinshasa)
Kenya Nigeria
(Lagos & Kano)
Age, Mean (SD) 30 (9) 29 (9) 31 (9) 31 (9)
Marital status, %
Never married 25 46 20 26
Currently married 62 34 62 64
Currently living with partner 7.2 13 6.1 1.9
Divorced or separated 3.6 6.2 8.2 5.7
Widow or widower 2.2 1.2 3.4 2.9
Education, %
Never attended 34 0.3 2.0 6.4
Primary/Middle school 20 3.0 43 10.0
Secondary/post-primary 39 67 37 47
Tertiary/post-secondary 6.8 30 18 36
Urban vs Rural, %
Rural 25 NA 62 12
Urban 75 NA 38 88

Sampling Weights

We mentioned in our last post that the PMA COVID-19 survey comes with a new weighting variable, CVQWEIGHT, which is analogous to the variable FQWEIGHT found in other Household and Female samples. CVQWEIGHT can be used to estimate all of the statistics shown in our table for the broader population represented by each sample (note that two of the samples are not nationally representative):

The srvyr package includes several functions that make it easy to incorporate survey weights into a tidy workflow. Simply provide information about the survey design to srvyr::as_survey_design(), and then pipe this information to a survey analysis function.

For example, a tidy workflow calculating the mean AGE of women in each of the four samples might look like this:

covid %>% 
  group_by(COUNTRY) %>% 
  summarise(mean(AGE))
# A tibble: 4 × 2
  COUNTRY                    `mean(AGE)`
  <fct>                            <dbl>
1 Burkina Faso                      29.7
2 Congo, Democratic Republic        29.5
3 Kenya                             31.1
4 Nigeria                           31.4

You can use CVQWEIGHT to estimate the mean AGE of reproductive age women in each of the four target populations like this:

covid %>% 
  as_survey_design(weight = CVQWEIGHT) %>%
  group_by(COUNTRY) %>% 
  summarise(survey_mean(AGE))
# A tibble: 4 × 3
  COUNTRY                     coef `_se`
  <fct>                      <dbl> <dbl>
1 Burkina Faso                27.9 0.370
2 Congo, Democratic Republic  28.9 0.327
3 Kenya                       28.9 0.186
4 Nigeria                     30.4 0.329

Happily, gtsummary can read survey information from the function as_survey_design(). This saves us the trouble of calculating weighted statistics for each of the variables in our table; instead, we can create weighted statistics for our table with just one line of code if we swap tbl_summary for its companion function, tbl_svysummary():

covid %>% 
  as_survey_design(weight = CVQWEIGHT) %>% # CVQWEIGHT goes here
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_svysummary( # use tbl_svysummary() in place of tbl_summary()
    by = COUNTRY,
    missing = "no",
    stat = list(
      AGE ~ "{mean} ({sd})",
      all_categorical() ~"{p}"
    ),
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education",
      URBAN ~ "Urban vs Rural"
    )
  ) %>% 
  italicize_labels() %>% 
  add_stat_label() %>% 
  modify_header(update = list(
      label ~ " ",
      stat_1 ~ "**Burkina Faso**",
      stat_2 ~ "**DR Congo <br> (Kinshasa)**",
      stat_3 ~ "**Kenya**",
      stat_4 ~ "**Nigeria <br> (Lagos & Kano)**"
  )) %>% 
  modify_spanning_header(
    everything() ~ "## Weighted Population Estimates"
  )

Weighted Population Estimates

Burkina Faso DR Congo
(Kinshasa)
Kenya Nigeria
(Lagos & Kano)
Age, Mean (SD) 28 (9) 29 (9) 29 (9) 30 (9)
Marital status, %
Never married 22 47 30 27
Currently married 69 31 54 63
Currently living with partner 6.1 13 5.8 2.4
Divorced or separated 1.9 6.3 7.3 5.1
Widow or widower 1.4 1.9 2.8 2.3
Education, %
Never attended 55 0.4 1.9 9.8
Primary/Middle school 19 5.7 46 13
Secondary/post-primary 24 72 39 46
Tertiary/post-secondary 1.8 22 13 31
Urban vs Rural, %
Rural 79 NA 73 18
Urban 21 NA 27 82

We’ve now made a weighted descriptive statistics table, and we’ve only changed two lines of code. As we’ll see, creating a summary table from a model that uses sample design information can be just as easy.

Model Summary Table

The gtsummary package also contains a function designed to summarise and format output from regression models. For example, let’s build a simple logistic regression model for HLTHCAREDIFFFEAR, which indicates whether a woman experienced difficulty accessing healthcare because she was afraid of becoming infected with COVID-19. We’ll try modeling this outcome using the demographic variables that are available for all four samples: AGE, MARSTAT, and EDUCATTGEN (URBAN was not available for the DRC sample).

First, we’ll recode HLTHCAREDIFFFEAR into a binary indicator that’s suitable for use in a logistic regression model. We can collapse responses “No” and “None of the above”, since the latter indicates that the woman experienced no difficulties accessing healthcare at all.

covid %>% count(HLTHCAREDIFFFEAR)
# A tibble: 4 × 2
  HLTHCAREDIFFFEAR      n
  <fct>             <int>
1 No                 1101
2 Yes                3881
3 None of the above  7114
4 <NA>                 88
covid <- covid %>% 
  mutate(HLTHCAREDIFFFEAR = fct_collapse(HLTHCAREDIFFFEAR, No = c(
    "No", 
    "None of the above"
  ))) 

covid %>% count(HLTHCAREDIFFFEAR)
# A tibble: 3 × 2
  HLTHCAREDIFFFEAR     n
  <fct>            <int>
1 No                8215
2 Yes               3881
3 <NA>                88

Results from one model

Because IPUMS PMA samples are collected with geographic clusters - represented by EAID - we generally recommend specifying both a sample weight and a cluster identification with as_survey_design(). We’ll need to use a function that can build a general linear model using that survey design information, so we’ll use survey::svyglm(), rather than the base R function glm that might be more familiar. The output of a call to svyglm() is not formatted as a publication-ready table. Let’s see what happens when we build a model for our Burkina Faso sample:

covid %>% 
  filter(COUNTRY == "Burkina Faso") %>% 
  as_survey_design(weight = CVQWEIGHT, id = EAID) %>% 
  survey::svyglm(
    formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN, 
    family = "quasibinomial"
  ) 
1 - level Cluster Sampling design (with replacement)
With (164) clusters.
Called via srvyr
Sampling variables:
 - ids: EAID
 - weights: CVQWEIGHT

Call:  svyglm(formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN, 
    design = ., family = "quasibinomial")

Coefficients:
                         (Intercept)  
                             -0.7495  
                                 AGE  
                             -0.0128  
            MARSTATCurrently married  
                              0.5784  
MARSTATCurrently living with partner  
                             -0.3228  
        MARSTATDivorced or separated  
                              0.3920  
             MARSTATWidow or widower  
                              0.1639  
     EDUCATTGENPrimary/Middle school  
                             -0.0782  
    EDUCATTGENSecondary/post-primary  
                             -0.3467  
   EDUCATTGENTertiary/post-secondary  
                              0.1427  

Degrees of Freedom: 3526 Total (i.e. Null);  155 Residual
  (1 observation deleted due to missingness)
Null Deviance:      4381 
Residual Deviance: 4299     AIC: NA

The function gtsummary::tbl_regression() will tidy this output into a much more reader-friendly format. We’ll assign labels to each of our variables using the same label argument we saw before, and we’ll also choose to exponentiate our regression coefficients so that the results will reflect odds ratios:

covid %>% 
  filter(COUNTRY == "Burkina Faso") %>% 
  as_survey_design(weight = CVQWEIGHT, id = EAID) %>% 
  survey::svyglm(
    formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN, 
    family = "quasibinomial"
  ) %>% 
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education"   
    )
  ) 
Characteristic OR1 95% CI1 p-value
Age 0.99 0.96, 1.01 0.3
Marital status
Never married
Currently married 1.78 1.02, 3.13 0.043
Currently living with partner 0.72 0.25, 2.06 0.5
Divorced or separated 1.48 0.69, 3.18 0.3
Widow or widower 1.18 0.41, 3.42 0.8
Education
Never attended
Primary/Middle school 0.92 0.61, 1.41 0.7
Secondary/post-primary 0.71 0.44, 1.12 0.14
Tertiary/post-secondary 1.15 0.66, 2.02 0.6

1 OR = Odds Ratio, CI = Confidence Interval

We can also add conventional “stars” representing the significance of each coefficient with the add_significance_stars() function. Here, we could choose to display cluster-robust standard error estimates, but we’ll display 95% confidence intervals instead. We’ll also customize the header and add a title, using the same modify functions shown above.

covid %>% 
  filter(COUNTRY == "Burkina Faso") %>% 
  as_survey_design(weight = CVQWEIGHT, id = EAID) %>% 
  survey::svyglm(
    formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN, 
    family = "quasibinomial"
  ) %>% 
  tbl_regression(
    exp = TRUE,
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education"   
    )
  ) %>%
  modify_footnote(everything() ~ NA, abbreviation = TRUE) %>%
  add_significance_stars(hide_se = TRUE, hide_ci = FALSE) %>%
  modify_header(update = list(
    label ~ " " ,
    estimate ~ '**Burkina Faso**',
    ci ~ "95% CI"
  )) %>% 
  modify_spanning_header(
    everything() ~ "## Odds Ratios obtained from Logistic Regression"
  ) 

Odds Ratios obtained from Logistic Regression

Burkina Faso1 95% CI
Age 0.99 0.96, 1.01
Marital status
Never married
Currently married 1.78* 1.02, 3.13
Currently living with partner 0.72 0.25, 2.06
Divorced or separated 1.48 0.69, 3.18
Widow or widower 1.18 0.41, 3.42
Education
Never attended
Primary/Middle school 0.92 0.61, 1.41
Secondary/post-primary 0.71 0.44, 1.12
Tertiary/post-secondary 1.15 0.66, 2.02

1 *p<0.05; **p<0.01; ***p<0.001

Results from several models

Researchers often report results from several models if, for example, they want to compare results including a range of different controls. You might also decide to model data from each of the COVID-19 samples separately in order to highlight important differences between their target populations. How would you merge results from four models into a single table?

If you repeat the same code shown above for each of the four samples, you’ll obtain four separate tables. We recommend using purrr::map() to store these tables in a list, which you can then pass to gtsummary::tbl_merge(). Here, we map over each of the factor levels in COUNTRY, using tbl_regression() to build a regression table for each. We pipe a list of four tables to tbl_merge(), and then add a header and title.

levels(covid$COUNTRY) %>% 
  map(~{
    covid %>% 
      as_survey_design(weight = CVQWEIGHT) %>% 
      filter(COUNTRY == .x) %>% 
      survey::svyglm(
        formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN, 
        family = "quasibinomial"
      ) %>% 
      tbl_regression(
        exp = TRUE,
        label = list(
          AGE ~ "Age",
          MARSTAT ~ "Marital status",
          EDUCATTGEN ~ "Education"   
        )
      ) %>% 
      italicize_labels() %>% 
      modify_footnote(everything() ~ NA, abbreviation = TRUE) %>%
      add_significance_stars(hide_se = T, hide_ci = F) %>% 
      modify_header(update = list(ci ~ "95% CI"))
  }) %>% 
  tbl_merge() %>% 
  modify_header(update = list(
    label ~ " " ,
    estimate_1 ~ '**Burkina Faso**',
    estimate_2 ~ '**DR Congo <br> (Kinshasa)**',
    estimate_3 ~ '**Kenya**',
    estimate_4 ~ '**Nigeria <br> (Lagos & Kano)**'
  )) %>% 
  modify_spanning_header(
    everything() ~ "## Odds Ratios obtained from Logistic Regression"
  ) 

Odds Ratios obtained from Logistic Regression

Burkina Faso1 95% CI DR Congo
(Kinshasa)
1
95% CI Kenya1 95% CI Nigeria
(Lagos & Kano)
1
95% CI
Age 0.99 0.97, 1.01 0.97* 0.95, 1.00 0.98*** 0.97, 0.99 0.99 0.97, 1.01
Marital status
Never married
Currently married 1.78 0.92, 3.45 1.16 0.73, 1.83 1.21 0.97, 1.52 1.64 0.95, 2.83
Currently living with partner 0.72 0.32, 1.62 0.55 0.27, 1.11 0.81 0.59, 1.11 2.81 0.94, 8.40
Divorced or separated 1.48 0.50, 4.35 1.94 0.95, 3.98 1.26 0.92, 1.73 1.10 0.49, 2.49
Widow or widower 1.18 0.38, 3.62 0.70 0.08, 6.11 1.09 0.72, 1.65 0.39 0.10, 1.56
Education
Never attended
Primary/Middle school 0.92 0.58, 1.47 0.39 0.02, 6.06 1.50 0.95, 2.38 0.85 0.33, 2.17
Secondary/post-primary 0.71 0.43, 1.16 0.99 0.10, 9.67 1.31 0.82, 2.09 0.80 0.37, 1.73
Tertiary/post-secondary 1.15 0.71, 1.88 1.29 0.13, 12.7 1.47 0.91, 2.37 0.78 0.35, 1.74

1 *p<0.05; **p<0.01; ***p<0.001

Output Options

So now you know that you can make a great-looking table in R that renders nicely in HTML (you’re reading this post on a web page, after all). But what if you want to export your table to a Word document, PDF, or some other format?

For Word documents, try piping your table to the function gtsummary::as_flex_table() like this, and then copy / paste the result directly into Word. (You may need to install the package flextable first).

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(by = COUNTRY) %>% 
  as_flex_table()

Several printers work well for PDF output, including the very popular kable_extra. If you’re making a PDF with RMarkdown, for example, you could pipe your table into gtsummary::as_kable_extra(). (You may need to install the package kable_extra first).

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(by = COUNTRY) %>% 
  as_kable_extra()

You’ll find more information about output options here.

Next Steps

We hope you found these steps helpful in using R to create tables for descriptive statistics, survey weights, and model summary output. As always, feel free to reach out to us with any questions. In our next post, we will show how to make likert-style stacked bar charts with R using the PMA COVID-19 survey data. Check back here in two weeks for the next post in this series!

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.