Earlier this spring, IPUMS PMA released a harmonized version of the PMA COVID-19 survey, which comes from a telephone interview with reproductive aged women in four countries who are all participants in an ongoing panel study related to family planning and reproductive health. We’re excited to feature this urgent resource all summer long here on the Data Analysis Hub; you’ll find find future posts related to PMA COVID-19 data and our last post - an introduction to the available data - if you follow along here.

As always, one of our main goals on this blog is to introduce tools that make it easy for anyone to explore new data and develop new ideas for research projects. Today, we’ll be diving into a topic that has probably frustrated everyone who has ever presented or published statistical findings at one time or another: making publication-ready tables.

If you talk to students or colleagues who use R, you might be surprised to learn that many of us don’t actually use statistical software to make the tables you see when you read an academic article. In reality, plenty of us just use R to make a model, and then we copy and paste the output into a table we make by hand with Microsoft Word! This can save lots of time, and it’s a perfectly reasonable solution if you know that you can make exactly what you want with the tools that Word provides.

You might consider making tables with R if you’ve ever found yourself:

manually inserting information to several rows or columns that could be calculated by R (like significance symbols *, **, and ***)
endlessly tweaking row height, column width, or fonts
correcting copy / paste errors or typos made while transferring information between R and Word
revisiting work you’ve already done to format a table for one publication, only to change it all for a different publication, a presentation, or a new file format (PDF, HTML, etc)

In this post, we’ll show you how to get up and running with flexible, easy-to-make tables using gtsummary, an R package that builds on the same tidyverse conventions we’ve featured elsewhere on this blog. There are a lot of different packages available to help make tables with R, but - as we’ll see - we love gtsummary because we think it allows users to maximize choice of style and output formats, all while minimizing the amount of code necessary to implement those choices.

If gtsummary doesn’t fit your needs, we also recommend gt (which does much of the heavy-lifting for gtsummary), kableExtra, flextable, and huxtable for different contexts.

Setup

In our last post, we explained that you’ll find all of the PMA COVID-19 survey data if you select the new COVID-19 Unit of Analysis in the IPUMS PMA data extract system.

In this post, we’ll work with a data extract containing the following variables. You’ll be able to follow along with our coding examples if you create and download an extract containing all four samples (Female Respondents only) and these variables:

You’ll also need to install these R packages if you don’t haven’t done so before (current versions are recommended):

ipumsr (0.4.5)
tidyverse (1.3.1)
survey (4.0)
srvyr (1.0.1)
gtsummary (1.4.1)

When you’ve finished downloading your data extract and installing all of these packages, load the packages and use read_ipums_micro() to load the data extract into R (make sure to change the file paths to match your own extract):

library(ipumsr)
library(tidyverse)
library(survey)
library(srvyr)
library(gtsummary)

covid <- read_ipums_micro(
  ddi = "data/pma_00032.xml",
  dat = "data/pma_00032.dat.gz"
)

As a reminder: women who were interviewed for the PMA COVID-19 survey are participants in a ongoing panel study focused on core PMA topics in reproductive health. The baseline survey for this panel study was conducted just a few months prior to the COVID-19 survey (between November 2019 and February 2020), but data from the baseline survey are not included in the COVID-19 dataset you’ll download here. We will show how to locate and merge data from the baseline survey in an upcoming post in this series. The COVID-19 survey data are structured like all of the other cross-sectional survey datasets available from IPUMS PMA: each woman’s responses are stored in a single row.

covid

# A tibble: 12,184 × 12
          SAMPLE  COUNTRY  YEAR ROUND   EAID CONSENTFQ CVQWEIGHT   AGE
       <int+lbl> <int+lb> <int> <dbl>  <dbl> <int+lbl>     <dbl> <int>
 1 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     2.13     35
 2 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.309    29
 3 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.624    25
 4 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.265    38
 5 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.133    30
 6 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.944    16
 7 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.361    29
 8 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.234    29
 9 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     2.99     41
10 85411 [Burki… 1 [Burk…  2020     1 8.54e8   1 [Yes]     0.194    27
# … with 12,174 more rows, and 4 more variables: MARSTAT <int+lbl>,
#   EDUCATTGEN <int+lbl>, URBAN <int+lbl>, HLTHCAREDIFFFEAR <int+lbl>

Descriptive Statistics Table

The great thing about gtsummary is that you can make a high quality table with just one line of code, but you can also customize any element of your table and easily apply custom styling (you can choose between several journal-specific themes or create your own). And, unlike many of other table-making packages for R, gtsummary supports printing directly to HTML, PDF, Word, and Rich Text Format.

Because gtsummary is designed with tidyverse users in-mind, you can pipe functions like dplyr::select directly into the function tbl_summary, which will then identify the object class for each variable and calculate default summary statistics accordingly. To demonstrate, we’ll select a few demographic variables, then break them down by COUNTRY in a basic call to tbl_summary():

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(by = COUNTRY)

Characteristic	1, N = 3,528¹	2, N = 1,324¹	7, N = 5,986¹	9, N = 1,346¹
Age in female respondent questionnaire	29 (22, 36)	28 (22, 36)	30 (24, 38)	31 (24, 39)
Marital status, female questionnaire
10	882 (25%)	605 (46%)	1,206 (20%)	348 (26%)
21	2,187 (62%)	450 (34%)	3,717 (62%)	855 (64%)
22	254 (7.2%)	170 (13%)	367 (6.1%)	26 (1.9%)
31	126 (3.6%)	82 (6.2%)	493 (8.2%)	77 (5.7%)
32	78 (2.2%)	16 (1.2%)	201 (3.4%)	39 (2.9%)
98	1 (<0.1%)	1 (<0.1%)	2 (<0.1%)	1 (<0.1%)
Highest level of school attended, general (4 categories)
1	1,213 (34%)	4 (0.3%)	119 (2.0%)	86 (6.4%)
2	702 (20%)	40 (3.0%)	2,598 (43%)	134 (10.0%)
3	1,374 (39%)	884 (67%)	2,217 (37%)	637 (47%)
4	239 (6.8%)	395 (30%)	1,052 (18%)	486 (36%)
98	0 (0%)	1 (<0.1%)	0 (0%)	3 (0.2%)
Urban/rural status	2,636 (75%)	0 (NA%)	2,289 (38%)	1,182 (88%)
Unknown	0	1,324	0	0
¹ Median (IQR); n (%)

ℹ Column(s) AGE, MARSTAT, EDUCATTGEN, URBAN, and COUNTRY are class 
'haven_labelled'. This is an intermediate datastructure not meant for analysis. 
Convert columns with `haven::as_factor()`,`labelled::to_factor()`, 
`labelled::unlabelled()`, and `unclass()`. 'haven_labelled' value labels are 
ignored when columns are not converted. Failure to convert may have unintended 
consequences or result in error.

• https://haven.tidyverse.org/articles/semantics.html
• https://larmarange.github.io/labelled/articles/intro_labelled.html#unlabelled

Display value labels

This is a very helpful starting point, but it’s certainly not a finished product yet. The biggest issue is related to the alert message shown above: as we’ve discussed in previous posts, the categorical variables you’ll find in IPUMS data extracts are usually imported as haven_labelled objects, rather than the more common factor class of objects. In practice, this means that every response option from the questionnaire has both a value and a label:

covid %>% count(MARSTAT)

# A tibble: 6 × 2
                             MARSTAT     n
                           <int+lbl> <int>
1 10 [Never married]                  3041
2 21 [Currently married]              7209
3 22 [Currently living with partner]   817
4 31 [Divorced or separated]           778
5 32 [Widow or widower]                334
6 98 [No response or missing]            5

You can access value labels with the ipumsr function ipums_val_labels.

The variable MARSTAT is a haven_labelled object where the value of each response is an integer (10, 21, 22, 31, 32, or 98), and the label describing each value is shown in square brackets to the right.

When gtsummary warns you that

This is an intermediate datastructure not meant for analysis

…it’s referring to the fact that labels are only an attribute of the variable. Attributes are metadata meant to assist the analyst running R in real-time, but they aren’t typically used by R in graphics or computational analysis.

In our table, gtsummary displays the numeric value for each response, rather than the much more readable labels. The easiest way to change this behavior is to coerce all of our categorical variables to the factor object class.

We recommend dividing this process into three steps:

Identify any labels in your dataset that represent non-response codes, like No response or missing shown for MARSTAT. We’ll want to exclude these from our table, so we’ll convert them to the generic missing value NA with ipumsr::lbl_na_if().
Identify any labelled variables that are not categorical. For example, the variable AGE in our dataset is labelled because the values 90 through 99 can represent either a respondent’s age in years (if not labelled) or a non-response code (if labelled). After converting non-response codes to NA in step 1, we’ll want to change the class of variables like AGE with ipumsr::zap_labels().
Coerce all remaining labelled variables as factors with ipumsr::as_factor(), and remove any unused response options with fct_drop().

covid <- covid %>% 
  mutate(
    across(everything(), ~lbl_na_if(
      .x,
      ~.lbl %in% c(
        "Logical edit - missing",
        "Not interviewed (female questionnaire)",
        "Not interviewed (household questionnaire)",
        "Don't know",
        "No response or missing",
        "NIU (not in universe)"
      )
    )),
    across(AGE, zap_labels),
    across(where(is.labelled), ~as_factor(.x) %>% fct_drop)
  )

Common mistake: make sure to use the function as_factor() from ipumsr, and not the base R function as.factor(). The difference is that the former will use labels for each factor level, while the latter will use the original numeric values.

Coercing categorical variables as factors will greatly improve the readability of our table. Because we’ve introduced NA values, we’ll add the argument missing = "no" to exclude them from our table:

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(by = COUNTRY, missing = "no")

Characteristic	Burkina Faso, N = 3,528¹	Congo, Democratic Republic, N = 1,324¹	Kenya, N = 5,986¹	Nigeria, N = 1,346¹
Age in female respondent questionnaire	29 (22, 36)	28 (22, 36)	30 (24, 38)	31 (24, 39)
Marital status, female questionnaire
Never married	882 (25%)	605 (46%)	1,206 (20%)	348 (26%)
Currently married	2,187 (62%)	450 (34%)	3,717 (62%)	855 (64%)
Currently living with partner	254 (7.2%)	170 (13%)	367 (6.1%)	26 (1.9%)
Divorced or separated	126 (3.6%)	82 (6.2%)	493 (8.2%)	77 (5.7%)
Widow or widower	78 (2.2%)	16 (1.2%)	201 (3.4%)	39 (2.9%)
Highest level of school attended, general (4 categories)
Never attended	1,213 (34%)	4 (0.3%)	119 (2.0%)	86 (6.4%)
Primary/Middle school	702 (20%)	40 (3.0%)	2,598 (43%)	134 (10.0%)
Secondary/post-primary	1,374 (39%)	884 (67%)	2,217 (37%)	637 (47%)
Tertiary/post-secondary	239 (6.8%)	395 (30%)	1,052 (18%)	486 (36%)
Urban/rural status
Rural	892 (25%)	0 (NA%)	3,697 (62%)	164 (12%)
Urban	2,636 (75%)	0 (NA%)	2,289 (38%)	1,182 (88%)
¹ Median (IQR); n (%)

Customize variable labels

You may have noticed that gtsummary ignored the value labels for each of our haven_labelled variables before we converted them into factor variables, but it did find and use variable labels. For example, you see the variable label Age in female respondent questionnaire instead of the variable name AGE.

You can access variable labels with the ipumsr function ipums_var_label.

While this is sometimes helpful behavior, we feel that the word “Age” would have been fine on its own. Likewise, we’d like to clean up the labels for MARSTAT, EDUCATTGEN, and URBAN to make them as concise as possible.

You can override the ipumsr value labels in your table without changing the underlying data. Just add them as a list of formulas via the label argument: the variable name goes on the left of ~, and a character string containing the desired label goes on the right.

There are a few supporting functions that allow you to stylize these labels. We’ll use italicize_labels() to print our new labels in italics.

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(
    by = COUNTRY,
    missing = "no",
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education",
      URBAN ~ "Urban vs Rural"
    )
  ) %>% 
  italicize_labels()

Characteristic	Burkina Faso, N = 3,528¹	Congo, Democratic Republic, N = 1,324¹	Kenya, N = 5,986¹	Nigeria, N = 1,346¹
Age	29 (22, 36)	28 (22, 36)	30 (24, 38)	31 (24, 39)
Marital status
Never married	882 (25%)	605 (46%)	1,206 (20%)	348 (26%)
Currently married	2,187 (62%)	450 (34%)	3,717 (62%)	855 (64%)
Currently living with partner	254 (7.2%)	170 (13%)	367 (6.1%)	26 (1.9%)
Divorced or separated	126 (3.6%)	82 (6.2%)	493 (8.2%)	77 (5.7%)
Widow or widower	78 (2.2%)	16 (1.2%)	201 (3.4%)	39 (2.9%)
Education
Never attended	1,213 (34%)	4 (0.3%)	119 (2.0%)	86 (6.4%)
Primary/Middle school	702 (20%)	40 (3.0%)	2,598 (43%)	134 (10.0%)
Secondary/post-primary	1,374 (39%)	884 (67%)	2,217 (37%)	637 (47%)
Tertiary/post-secondary	239 (6.8%)	395 (30%)	1,052 (18%)	486 (36%)
Urban vs Rural
Rural	892 (25%)	0 (NA%)	3,697 (62%)	164 (12%)
Urban	2,636 (75%)	0 (NA%)	2,289 (38%)	1,182 (88%)
¹ Median (IQR); n (%)

Change default statistics

What about the statistics that tbl_summary() calculates? By default, tbl_summary() reports the median (and IQR) for integer variables like AGE, and it reports the frequency (and percentage) of each level for all of the factor variables we’ve created.

You can change the statistics calculated for one or more variables by name, or you can change them for all variables of a similar type (e.g. all_categorical). You’ll need to choose or define a custom function, then provide it to the stat argument as a character string between curly brackets like this:

"{mean}"

Here, we’ll demonstrate how to calculate the mean (and standard deviation) for AGE, and the percentage for all responses to all factors / categorical variables.

You may also decide to feature information about which statistics were calculated more prominently alongside the variable labels, rather than in the footer. Simply pipe your table to the function add_stat_label() as shown below:

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(
    by = COUNTRY,
    missing = "no",
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education",
      URBAN ~ "Urban vs Rural"
    ),
    stat = list(
      AGE ~ "{mean} ({sd})",
      all_categorical() ~"{p}"
    )
  )%>% 
  italicize_labels() %>% 
  add_stat_label()

Characteristic	Burkina Faso, N = 3,528	Congo, Democratic Republic, N = 1,324	Kenya, N = 5,986	Nigeria, N = 1,346
Age, Mean (SD)	30 (9)	29 (9)	31 (9)	31 (9)
Marital status, %
Never married	25	46	20	26
Currently married	62	34	62	64
Currently living with partner	7.2	13	6.1	1.9
Divorced or separated	3.6	6.2	8.2	5.7
Widow or widower	2.2	1.2	3.4	2.9
Education, %
Never attended	34	0.3	2.0	6.4
Primary/Middle school	20	3.0	43	10.0
Secondary/post-primary	39	67	37	47
Tertiary/post-secondary	6.8	30	18	36
Urban vs Rural, %
Rural	25	NA	62	12
Urban	75	NA	38	88

Customize headers & footnotes

Table headers and footnotes can be customized with the functions modify_header() and modify_footnote(). We’ll remove the frequencies from our header, use the abbreviated label DR Congo, and remove the label Characteristic from the first column. Lastly, we’ll add a title to our table with modify_spanning_header().

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(
    by = COUNTRY,
    missing = "no",
    stat = list(
      AGE ~ "{mean} ({sd})",
      all_categorical() ~"{p}"
    ),
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education",
      URBAN ~ "Urban vs Rural"
    )
  ) %>% 
  add_stat_label() %>% 
  italicize_labels() %>% 
  modify_header(update = list(
      label ~ " ",
      stat_1 ~ "**Burkina Faso**",
      stat_2 ~ "**DR Congo <br> (Kinshasa)**",
      stat_3 ~ "**Kenya**",
      stat_4 ~ "**Nigeria <br> (Lagos & Kano)**"
  )) %>% 
  modify_spanning_header(
    everything() ~ "## Sample Demographics"
  )

Sample Demographics
	Burkina Faso	DR Congo (Kinshasa)	Kenya	Nigeria (Lagos & Kano)
Age, Mean (SD)	30 (9)	29 (9)	31 (9)	31 (9)
Marital status, %
Never married	25	46	20	26
Currently married	62	34	62	64
Currently living with partner	7.2	13	6.1	1.9
Divorced or separated	3.6	6.2	8.2	5.7
Widow or widower	2.2	1.2	3.4	2.9
Education, %
Never attended	34	0.3	2.0	6.4
Primary/Middle school	20	3.0	43	10.0
Secondary/post-primary	39	67	37	47
Tertiary/post-secondary	6.8	30	18	36
Urban vs Rural, %
Rural	25	NA	62	12
Urban	75	NA	38	88

Text provided to modify_header(), modify_footnote(), and modify_spanning_header() can be stylized with either markdown syntax (shown) or HTML.

Headers can span individuals columns (as shown in modify_header()), groups of columns, or everything() (as shown in modify_spanning_header()).

Tip: if you simply want to move the sample size from the header into its own row (rather than delete it, as we’ve done here), you could create a new factor variable N where every person gets the value 1 in the step after our select() function.

Sampling Weights

We mentioned in our last post that the PMA COVID-19 survey comes with a new weighting variable, CVQWEIGHT, which is analogous to the variable FQWEIGHT found in other Household and Female samples. CVQWEIGHT can be used to estimate all of the statistics shown in our table for the broader population represented by each sample (note that two of the samples are not nationally representative):

Burkina Faso: nationally representative
Kenya: nationally representative
DRC: Kinshasa only
Nigeria: Lagos and Kano only

The srvyr package includes several functions that make it easy to incorporate survey weights into a tidy workflow. Simply provide information about the survey design to srvyr::as_survey_design(), and then pipe this information to a survey analysis function.

For example, a tidy workflow calculating the mean AGE of women in each of the four samples might look like this:

covid %>% 
  group_by(COUNTRY) %>% 
  summarise(mean(AGE))

# A tibble: 4 × 2
  COUNTRY                    `mean(AGE)`
  <fct>                            <dbl>
1 Burkina Faso                      29.7
2 Congo, Democratic Republic        29.5
3 Kenya                             31.1
4 Nigeria                           31.4

You can use CVQWEIGHT to estimate the mean AGE of reproductive age women in each of the four target populations like this:

covid %>% 
  as_survey_design(weight = CVQWEIGHT) %>%
  group_by(COUNTRY) %>% 
  summarise(survey_mean(AGE))

# A tibble: 4 × 3
  COUNTRY                     coef `_se`
  <fct>                      <dbl> <dbl>
1 Burkina Faso                27.9 0.370
2 Congo, Democratic Republic  28.9 0.327
3 Kenya                       28.9 0.186
4 Nigeria                     30.4 0.329

Happily, gtsummary can read survey information from the function as_survey_design(). This saves us the trouble of calculating weighted statistics for each of the variables in our table; instead, we can create weighted statistics for our table with just one line of code if we swap tbl_summary for its companion function, tbl_svysummary():

covid %>% 
  as_survey_design(weight = CVQWEIGHT) %>% # CVQWEIGHT goes here
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_svysummary( # use tbl_svysummary() in place of tbl_summary()
    by = COUNTRY,
    missing = "no",
    stat = list(
      AGE ~ "{mean} ({sd})",
      all_categorical() ~"{p}"
    ),
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education",
      URBAN ~ "Urban vs Rural"
    )
  ) %>% 
  italicize_labels() %>% 
  add_stat_label() %>% 
  modify_header(update = list(
      label ~ " ",
      stat_1 ~ "**Burkina Faso**",
      stat_2 ~ "**DR Congo <br> (Kinshasa)**",
      stat_3 ~ "**Kenya**",
      stat_4 ~ "**Nigeria <br> (Lagos & Kano)**"
  )) %>% 
  modify_spanning_header(
    everything() ~ "## Weighted Population Estimates"
  )

Weighted Population Estimates
	Burkina Faso	DR Congo (Kinshasa)	Kenya	Nigeria (Lagos & Kano)
Age, Mean (SD)	28 (9)	29 (9)	29 (9)	30 (9)
Marital status, %
Never married	22	47	30	27
Currently married	69	31	54	63
Currently living with partner	6.1	13	5.8	2.4
Divorced or separated	1.9	6.3	7.3	5.1
Widow or widower	1.4	1.9	2.8	2.3
Education, %
Never attended	55	0.4	1.9	9.8
Primary/Middle school	19	5.7	46	13
Secondary/post-primary	24	72	39	46
Tertiary/post-secondary	1.8	22	13	31
Urban vs Rural, %
Rural	79	NA	73	18
Urban	21	NA	27	82

We’ve now made a weighted descriptive statistics table, and we’ve only changed two lines of code. As we’ll see, creating a summary table from a model that uses sample design information can be just as easy.

Model Summary Table

The gtsummary package also contains a function designed to summarise and format output from regression models. For example, let’s build a simple logistic regression model for HLTHCAREDIFFFEAR, which indicates whether a woman experienced difficulty accessing healthcare because she was afraid of becoming infected with COVID-19. We’ll try modeling this outcome using the demographic variables that are available for all four samples: AGE, MARSTAT, and EDUCATTGEN (URBAN was not available for the DRC sample).

First, we’ll recode HLTHCAREDIFFFEAR into a binary indicator that’s suitable for use in a logistic regression model. We can collapse responses “No” and “None of the above”, since the latter indicates that the woman experienced no difficulties accessing healthcare at all.

covid %>% count(HLTHCAREDIFFFEAR)

# A tibble: 4 × 2
  HLTHCAREDIFFFEAR      n
  <fct>             <int>
1 No                 1101
2 Yes                3881
3 None of the above  7114
4 <NA>                 88

covid <- covid %>% 
  mutate(HLTHCAREDIFFFEAR = fct_collapse(HLTHCAREDIFFFEAR, No = c(
    "No", 
    "None of the above"
  ))) 

covid %>% count(HLTHCAREDIFFFEAR)

# A tibble: 3 × 2
  HLTHCAREDIFFFEAR     n
  <fct>            <int>
1 No                8215
2 Yes               3881
3 <NA>                88

Results from one model

Because IPUMS PMA samples are collected with geographic clusters - represented by EAID - we generally recommend specifying both a sample weight and a cluster identification with as_survey_design(). We’ll need to use a function that can build a general linear model using that survey design information, so we’ll use survey::svyglm(), rather than the base R function glm that might be more familiar. The output of a call to svyglm() is not formatted as a publication-ready table. Let’s see what happens when we build a model for our Burkina Faso sample:

covid %>% 
  filter(COUNTRY == "Burkina Faso") %>% 
  as_survey_design(weight = CVQWEIGHT, id = EAID) %>% 
  survey::svyglm(
    formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN, 
    family = "quasibinomial"
  )

1 - level Cluster Sampling design (with replacement)
With (164) clusters.
Called via srvyr
Sampling variables:
 - ids: EAID
 - weights: CVQWEIGHT

Call:  svyglm(formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN, 
    design = ., family = "quasibinomial")

Coefficients:
                         (Intercept)  
                             -0.7495  
                                 AGE  
                             -0.0128  
            MARSTATCurrently married  
                              0.5784  
MARSTATCurrently living with partner  
                             -0.3228  
        MARSTATDivorced or separated  
                              0.3920  
             MARSTATWidow or widower  
                              0.1639  
     EDUCATTGENPrimary/Middle school  
                             -0.0782  
    EDUCATTGENSecondary/post-primary  
                             -0.3467  
   EDUCATTGENTertiary/post-secondary  
                              0.1427  

Degrees of Freedom: 3526 Total (i.e. Null);  155 Residual
  (1 observation deleted due to missingness)
Null Deviance:      4381 
Residual Deviance: 4299     AIC: NA

The function gtsummary::tbl_regression() will tidy this output into a much more reader-friendly format. We’ll assign labels to each of our variables using the same label argument we saw before, and we’ll also choose to exponentiate our regression coefficients so that the results will reflect odds ratios:

covid %>% 
  filter(COUNTRY == "Burkina Faso") %>% 
  as_survey_design(weight = CVQWEIGHT, id = EAID) %>% 
  survey::svyglm(
    formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN, 
    family = "quasibinomial"
  ) %>% 
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education"   
    )
  )

Characteristic	OR¹	95% CI¹	p-value
Age	0.99	0.96, 1.01	0.3
Marital status
Never married	—	—
Currently married	1.78	1.02, 3.13	0.043
Currently living with partner	0.72	0.25, 2.06	0.5
Divorced or separated	1.48	0.69, 3.18	0.3
Widow or widower	1.18	0.41, 3.42	0.8
Education
Never attended	—	—
Primary/Middle school	0.92	0.61, 1.41	0.7
Secondary/post-primary	0.71	0.44, 1.12	0.14
Tertiary/post-secondary	1.15	0.66, 2.02	0.6
¹ OR = Odds Ratio, CI = Confidence Interval

We can also add conventional “stars” representing the significance of each coefficient with the add_significance_stars() function. Here, we could choose to display cluster-robust standard error estimates, but we’ll display 95% confidence intervals instead. We’ll also customize the header and add a title, using the same modify functions shown above.

covid %>% 
  filter(COUNTRY == "Burkina Faso") %>% 
  as_survey_design(weight = CVQWEIGHT, id = EAID) %>% 
  survey::svyglm(
    formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN, 
    family = "quasibinomial"
  ) %>% 
  tbl_regression(
    exp = TRUE,
    label = list(
      AGE ~ "Age",
      MARSTAT ~ "Marital status",
      EDUCATTGEN ~ "Education"   
    )
  ) %>%
  modify_footnote(everything() ~ NA, abbreviation = TRUE) %>%
  add_significance_stars(hide_se = TRUE, hide_ci = FALSE) %>%
  modify_header(update = list(
    label ~ " " ,
    estimate ~ '**Burkina Faso**',
    ci ~ "95% CI"
  )) %>% 
  modify_spanning_header(
    everything() ~ "## Odds Ratios obtained from Logistic Regression"
  )

Odds Ratios obtained from Logistic Regression
	Burkina Faso¹	95% CI
Age	0.99	0.96, 1.01
Marital status
Never married	—	—
Currently married	1.78*	1.02, 3.13
Currently living with partner	0.72	0.25, 2.06
Divorced or separated	1.48	0.69, 3.18
Widow or widower	1.18	0.41, 3.42
Education
Never attended	—	—
Primary/Middle school	0.92	0.61, 1.41
Secondary/post-primary	0.71	0.44, 1.12
Tertiary/post-secondary	1.15	0.66, 2.02
¹ p<0.05; p<0.01; **p<0.001

Results from several models

Researchers often report results from several models if, for example, they want to compare results including a range of different controls. You might also decide to model data from each of the COVID-19 samples separately in order to highlight important differences between their target populations. How would you merge results from four models into a single table?

If you repeat the same code shown above for each of the four samples, you’ll obtain four separate tables. We recommend using purrr::map() to store these tables in a list, which you can then pass to gtsummary::tbl_merge(). Here, we map over each of the factor levels in COUNTRY, using tbl_regression() to build a regression table for each. We pipe a list of four tables to tbl_merge(), and then add a header and title.

levels(covid$COUNTRY) %>% 
  map(~{
    covid %>% 
      as_survey_design(weight = CVQWEIGHT) %>% 
      filter(COUNTRY == .x) %>% 
      survey::svyglm(
        formula = HLTHCAREDIFFFEAR ~ AGE + MARSTAT + EDUCATTGEN, 
        family = "quasibinomial"
      ) %>% 
      tbl_regression(
        exp = TRUE,
        label = list(
          AGE ~ "Age",
          MARSTAT ~ "Marital status",
          EDUCATTGEN ~ "Education"   
        )
      ) %>% 
      italicize_labels() %>% 
      modify_footnote(everything() ~ NA, abbreviation = TRUE) %>%
      add_significance_stars(hide_se = T, hide_ci = F) %>% 
      modify_header(update = list(ci ~ "95% CI"))
  }) %>% 
  tbl_merge() %>% 
  modify_header(update = list(
    label ~ " " ,
    estimate_1 ~ '**Burkina Faso**',
    estimate_2 ~ '**DR Congo <br> (Kinshasa)**',
    estimate_3 ~ '**Kenya**',
    estimate_4 ~ '**Nigeria <br> (Lagos & Kano)**'
  )) %>% 
  modify_spanning_header(
    everything() ~ "## Odds Ratios obtained from Logistic Regression"
  )

purrr is included with library(tidyverse).

Odds Ratios obtained from Logistic Regression
	Burkina Faso¹	95% CI	DR Congo (Kinshasa)¹	95% CI	Kenya¹	95% CI	Nigeria (Lagos & Kano)¹	95% CI
Age	0.99	0.97, 1.01	0.97*	0.95, 1.00	0.98***	0.97, 0.99	0.99	0.97, 1.01
Marital status
Never married	—	—	—	—	—	—	—	—
Currently married	1.78	0.92, 3.45	1.16	0.73, 1.83	1.21	0.97, 1.52	1.64	0.95, 2.83
Currently living with partner	0.72	0.32, 1.62	0.55	0.27, 1.11	0.81	0.59, 1.11	2.81	0.94, 8.40
Divorced or separated	1.48	0.50, 4.35	1.94	0.95, 3.98	1.26	0.92, 1.73	1.10	0.49, 2.49
Widow or widower	1.18	0.38, 3.62	0.70	0.08, 6.11	1.09	0.72, 1.65	0.39	0.10, 1.56
Education
Never attended	—	—	—	—	—	—	—	—
Primary/Middle school	0.92	0.58, 1.47	0.39	0.02, 6.06	1.50	0.95, 2.38	0.85	0.33, 2.17
Secondary/post-primary	0.71	0.43, 1.16	0.99	0.10, 9.67	1.31	0.82, 2.09	0.80	0.37, 1.73
Tertiary/post-secondary	1.15	0.71, 1.88	1.29	0.13, 12.7	1.47	0.91, 2.37	0.78	0.35, 1.74
¹ p<0.05; p<0.01; **p<0.001

Output Options

So now you know that you can make a great-looking table in R that renders nicely in HTML (you’re reading this post on a web page, after all). But what if you want to export your table to a Word document, PDF, or some other format?

For Word documents, try piping your table to the function gtsummary::as_flex_table() like this, and then copy / paste the result directly into Word. (You may need to install the package flextable first).

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(by = COUNTRY) %>% 
  as_flex_table()

Several printers work well for PDF output, including the very popular kable_extra. If you’re making a PDF with RMarkdown, for example, you could pipe your table into gtsummary::as_kable_extra(). (You may need to install the package kable_extra first).

covid %>% 
  select(AGE, MARSTAT, EDUCATTGEN, URBAN, COUNTRY) %>% 
  tbl_summary(by = COUNTRY) %>% 
  as_kable_extra()

You’ll find more information about output options here.

Next Steps

We hope you found these steps helpful in using R to create tables for descriptive statistics, survey weights, and model summary output. As always, feel free to reach out to us with any questions. In our next post, we will show how to make likert-style stacked bar charts with R using the PMA COVID-19 survey data. Check back here in two weeks for the next post in this series!

Making Tables with PMA COVID-19 Data