Survival Analysis with Survey Weights

Data Analysis Survival Analysis Weights Cluster Sampling Contraceptive Calendar survey survival srvyr

The survey package for R makes it easy to incorpate weights and other elements of survey design into survival analysis.

Matt Gunther (IPUMS PMA Senior Data Analyst)
2023-01-25

Several months ago, we introduced the PMA Contraceptive Calendar as a monthly recollection of contraceptive use and pregnancy status reported by women in each phase of the ongoing PMA panel study. These data are an especially useful tool for researchers interested in survival analysis, which can be used to estimate the expected amount of time to a particular family planning event (measured in months).

In that introduction, we demonstrated how to estimate time to adoption of a family planning method for women who initially were not using one. Because we focused most of our attention on data preparation and visualization techniques with the survival package for R, we set aside some issues related to PMA survey weights and its clustered sample design. In short: survival is probably the best known way to conduct this kind of survival analysis in R, but it does not include the tools we need to incorporate information about the PMA survey design.1

In our case, survival functions included in the survey package offer an easy way to use PMA weights and identifiers for each cluster and sample stratum.

Data

Let’s pick up with the contraceptive calendar dataset we created in our earlier post, except that we’ll now include all of the variables necessary to describe PMA survey design:

Additionally, we’ve included a two variables describing 1) each woman’s family planning intentions at Phase 1 and 2) her monthly family planning status reported on the Phase 2 Contraceptive Calendar.

Since we’re working with a dataset created in a previous session, we’ll load it with read_rds from the tidyverse. We’ll also load survey (along with its tidy-helper srvyr). The survival package is listed as a dependency for survey, so it is attched automatically.

library(tidyverse)
library(survey)
library(srvyr)
dat <- read_rds("data/dat.rds")
# A tibble: 208,098 × 10
# Groups:   ID [17,725]
      ID POP          PANELWEIGHT      EAID STRATA CALSTART CALSTOP CALCMC FPPLANYR FPSTATUS
   <int> <chr>              <dbl>     <dbl>  <int>    <dbl>   <dbl>  <dbl> <lgl>    <chr>   
 1     1 Burkina Faso        2.50 854111005  85402     1442    1453   1453 TRUE     3       
 2     1 Burkina Faso        2.50 854111005  85402     1442    1453   1452 TRUE     3       
 3     1 Burkina Faso        2.50 854111005  85402     1442    1453   1451 TRUE     3       
 4     1 Burkina Faso        2.50 854111005  85402     1442    1453   1450 TRUE     3       
 5     1 Burkina Faso        2.50 854111005  85402     1442    1453   1449 TRUE     3       
 6     1 Burkina Faso        2.50 854111005  85402     1442    1453   1448 TRUE     3       
 7     1 Burkina Faso        2.50 854111005  85402     1442    1453   1447 TRUE     0       
 8     1 Burkina Faso        2.50 854111005  85402     1442    1453   1446 TRUE     0       
 9     1 Burkina Faso        2.50 854111005  85402     1442    1453   1445 TRUE     0       
10     1 Burkina Faso        2.50 854111005  85402     1442    1453   1444 TRUE     0       
11     1 Burkina Faso        2.50 854111005  85402     1442    1453   1443 TRUE     0       
12     1 Burkina Faso        2.50 854111005  85402     1442    1453   1442 TRUE     0       
13     2 Burkina Faso        2.61 854111004  85402     1441    1452   1452 FALSE    5       
14     2 Burkina Faso        2.61 854111004  85402     1441    1452   1451 FALSE    5       
15     2 Burkina Faso        2.61 854111004  85402     1441    1452   1450 FALSE    5       
# … with 208,083 more rows

Setup

The numeric codes shown above in FPSTATUS refer to specific family planning methods used in a particular month, with the following exceptions:

We’ll build a binary indicator for USE of any method, as indicated by any value except for these codes. (For the moment, our data contain placeholder NA values for women with incomplete Phase 2 calendars).

dat <- dat %>% 
  rowwise() %>% 
  mutate(USE = !any(FPSTATUS == c("0", "P", "B", "T")))

Our analysis will compare time to adoption for Phase 1 non-users like the woman at the top of our dataset (ID == 1). FPPLANYR shows that she reported plans to adopt a method within 12 months after the Phase 1 interview, and the Phase 2 calendar data in FPSTATUS shows that she did so just 6 months later. We should hypothesize that women with plans probably adopt a method sooner, on average, compared with women who had no plans.

dat 
# A tibble: 208,098 × 11
# Rowwise:  ID
      ID POP          PANELWEIGHT      EAID STRATA CALSTART CALSTOP CALCMC FPPLANYR FPSTATUS USE  
   <int> <chr>              <dbl>     <dbl>  <int>    <dbl>   <dbl>  <dbl> <lgl>    <chr>    <lgl>
 1     1 Burkina Faso        2.50 854111005  85402     1442    1453   1453 TRUE     3        TRUE 
 2     1 Burkina Faso        2.50 854111005  85402     1442    1453   1452 TRUE     3        TRUE 
 3     1 Burkina Faso        2.50 854111005  85402     1442    1453   1451 TRUE     3        TRUE 
 4     1 Burkina Faso        2.50 854111005  85402     1442    1453   1450 TRUE     3        TRUE 
 5     1 Burkina Faso        2.50 854111005  85402     1442    1453   1449 TRUE     3        TRUE 
 6     1 Burkina Faso        2.50 854111005  85402     1442    1453   1448 TRUE     3        TRUE 
 7     1 Burkina Faso        2.50 854111005  85402     1442    1453   1447 TRUE     0        FALSE
 8     1 Burkina Faso        2.50 854111005  85402     1442    1453   1446 TRUE     0        FALSE
 9     1 Burkina Faso        2.50 854111005  85402     1442    1453   1445 TRUE     0        FALSE
10     1 Burkina Faso        2.50 854111005  85402     1442    1453   1444 TRUE     0        FALSE
11     1 Burkina Faso        2.50 854111005  85402     1442    1453   1443 TRUE     0        FALSE
12     1 Burkina Faso        2.50 854111005  85402     1442    1453   1442 TRUE     0        FALSE
13     2 Burkina Faso        2.61 854111004  85402     1441    1452   1452 FALSE    5        TRUE 
14     2 Burkina Faso        2.61 854111004  85402     1441    1452   1451 FALSE    5        TRUE 
15     2 Burkina Faso        2.61 854111004  85402     1441    1452   1450 FALSE    5        TRUE 
# … with 208,083 more rows

Given our research interest, you might consider subsetting the dataset to exclude women who were already using a method at the time of the Phase 1 interview. In fact, we’ll want to keep track of these women so that we can calculate the correct degrees of freedom from each original sample size. For now, we’ll mark each of those cases as “User”, dividing the remaining cases into “Plan” and “No Plan” in a new variable called INTENT.

dat <- dat %>% 
  group_by(ID) %>% 
  mutate(
    INTENT = case_when(
      any(USE & CALCMC == CALSTART) ~ "User",
      any(!USE & FPPLANYR & CALCMC == CALSTART) ~ "Plan",
      any(!USE & !FPPLANYR & CALCMC == CALSTART) ~ "No Plan",
    )
  )

Lastly, whether who intend to use the survival package or its counterpart functions in the survey package, you’ll want to filter down to use one row per person: this should be the first month of USE or the last reported month in CALCMC, whichever comes first.

With only one row per person remaining we’ll 1) create MONTH to count the number of months after the Phase 1 interview, and 2) remove any extra variables that won’t be necessary in our analysis.

dat <- dat %>% 
  group_by(ID) %>% 
  mutate(
    USEMO = case_when(USE ~ CALCMC),
    KEEP = ifelse(any(USE), min(USEMO, na.rm = TRUE), max(CALCMC))
  ) %>% 
  filter(KEEP == CALCMC) %>% 
  mutate(.before = USE, MONTH = CALCMC - CALSTART) %>% 
  select(-c(starts_with("CAL"), starts_with("FP"), USEMO, KEEP)) %>% 
  ungroup()

dat 
# A tibble: 17,725 × 8
      ID POP          PANELWEIGHT      EAID STRATA MONTH USE   INTENT 
   <int> <chr>              <dbl>     <dbl>  <int> <dbl> <lgl> <chr>  
 1     1 Burkina Faso       2.50  854111005  85402     6 TRUE  Plan   
 2     2 Burkina Faso       2.61  854111004  85402     3 TRUE  No Plan
 3     3 Burkina Faso       3.21  854161007  85402    12 FALSE No Plan
 4     4 Burkina Faso       0.538 854141008  85401    11 FALSE No Plan
 5     5 Burkina Faso       0.358 854191022  85402     0 FALSE Plan   
 6     6 Burkina Faso       3.20  854141002  85402    12 FALSE No Plan
 7     7 Burkina Faso       0.404 854131034  85401    11 FALSE No Plan
 8     8 Burkina Faso       0.314 854191036  85402    11 FALSE No Plan
 9     9 Burkina Faso       0.337 854121003  85401     0 TRUE  User   
10    10 Burkina Faso       2.55  854231003  85402     0 FALSE No Plan
11    11 Burkina Faso       0.221 854111006  85401     9 TRUE  No Plan
12    12 Burkina Faso       0.478 854191025  85402     0 TRUE  User   
13    13 Burkina Faso       0.412 854191026  85402     0 TRUE  User   
14    14 Burkina Faso       0.404 854191013  85402     0 TRUE  User   
15    15 Burkina Faso       0.395 854131021  85401     0 TRUE  User   
# … with 17,710 more rows

Unweighted Estimates

Let’s review the procedure highlighted in our previous post, where we used no survey design information at all. To estimate a survival model for one population, we’ll filter dat to include only women from one sample; we’ll use Kenya as an example.

The function Surv indicates whether the observation in the last MONTH for each woman was adopted USE of a method, or that her ultimate adoption is right-censored (if at happened at all). Surv is part of a model formula provided to the function survfit; in this case, we’re modeling USE with each woman’s original INTENT at Phase 1.

ke_survival <- dat %>% 
  filter(POP == "Kenya") %>% 
  survfit(Surv(MONTH, USE) ~ INTENT, data = .)

ke_survival 
Call: survfit(formula = Surv(MONTH, USE) ~ INTENT, data = .)

                  n events median 0.95LCL 0.95UCL
INTENT=No Plan 2997    419     NA      NA      NA
INTENT=Plan     676    328     10       9      11
INTENT=User    3266   3266      0      NA      NA

The output returned as ke_survival is an object in the survfit class defined by the survival package. We’ll get more information from this object if we pass it to tidy from the broom package.

broom::tidy(ke_survival) %>% print(n = Inf)
# A tibble: 29 × 9
    time n.risk n.event n.censor estimate std.error conf.high conf.low strata        
   <dbl>  <dbl>   <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl> <chr>         
 1     0   2997       0      269    1       0           1        1     INTENT=No Plan
 2     1   2728      53        0    0.981   0.00269     0.986    0.975 INTENT=No Plan
 3     2   2675      41        0    0.966   0.00362     0.972    0.959 INTENT=No Plan
 4     3   2634      24        0    0.957   0.00407     0.964    0.949 INTENT=No Plan
 5     4   2610      40        0    0.942   0.00475     0.951    0.933 INTENT=No Plan
 6     5   2570      26        0    0.933   0.00515     0.942    0.923 INTENT=No Plan
 7     6   2544      36        0    0.919   0.00567     0.930    0.909 INTENT=No Plan
 8     7   2508      34        0    0.907   0.00613     0.918    0.896 INTENT=No Plan
 9     8   2474      36        0    0.894   0.00660     0.905    0.882 INTENT=No Plan
10     9   2438      37        0    0.880   0.00707     0.892    0.868 INTENT=No Plan
11    10   2401      40        0    0.865   0.00755     0.878    0.853 INTENT=No Plan
12    11   2361      28      782    0.855   0.00788     0.869    0.842 INTENT=No Plan
13    12   1551      24     1336    0.842   0.00850     0.856    0.828 INTENT=No Plan
14    13    191       0      191    0.842   0.00850     0.856    0.828 INTENT=No Plan
15     0    676       0       90    1       0           1        1     INTENT=Plan   
16     1    586      28        0    0.952   0.00925     0.970    0.935 INTENT=Plan   
17     2    558      33        0    0.896   0.0141      0.921    0.872 INTENT=Plan   
18     3    525      45        0    0.819   0.0194      0.851    0.789 INTENT=Plan   
19     4    480      34        0    0.761   0.0231      0.796    0.727 INTENT=Plan   
20     5    446      35        0    0.701   0.0270      0.739    0.665 INTENT=Plan   
21     6    411      28        0    0.654   0.0301      0.693    0.616 INTENT=Plan   
22     7    383      35        0    0.594   0.0342      0.635    0.555 INTENT=Plan   
23     8    348      20        0    0.560   0.0366      0.601    0.521 INTENT=Plan   
24     9    328      23        0    0.520   0.0397      0.563    0.482 INTENT=Plan   
25    10    305      24        0    0.480   0.0430      0.522    0.441 INTENT=Plan   
26    11    281      15       66    0.454   0.0453      0.496    0.415 INTENT=Plan   
27    12    200       7      159    0.438   0.0473      0.481    0.399 INTENT=Plan   
28    13     34       1       33    0.425   0.0559      0.474    0.381 INTENT=Plan   
29     0   3266    3266        0    0     Inf          NA       NA     INTENT=User   

Notice that tidy creates an object in the tibble class with several helpful columns:

In the last row, strata shows the results for 3,266 women who were already using a method at month 0 (when Phase 1 interviews were conducted). The survival probability for these women is 0 because they “survived” no months of non-use.

Let’s create a quick step-wise Time to Event plot for the two groups with data after month 0.

Show ggplot theme
# Define `theme_pma` for maps
library(showtext)
sysfonts::font_add(
  family = "cabrito", 
  regular = "../../fonts/cabritosansnormregular-webfont.ttf"
)
showtext::showtext_auto()

theme_pma <- function(){
  legned_title =  "Planned to Adopt Any Method within 12 Months" %>% 
    str_wrap(16)
  list(
    theme_minimal() %+replace% 
      theme(
        text = element_text(family = "cabrito", size = 13),
        plot.title = element_text(
          size = 18, 
          color = pma_blue, 
          hjust = 0, 
          margin = margin(b = 10)
        ),
        panel.spacing = unit(1, "lines"),
        legend.position = "right"
      ),
    scale_y_continuous(labels = scales::percent, limits = c(0, 1)),
    scale_x_continuous(limits = c(0,15)), 
    scale_fill_manual(
      aesthetics = c("color", "fill"),
      values = c(
        "Plan" = pma_pink, 
        "No Plan" = pma_blue
      ),
      labels = c(
        "Plan" = "Yes", 
        "No Plan" = "No"
      )
    ), 
    labs(
      x = "Months After Baseline", 
      y = "Probability of Continued Non-use",
      fill = legned_title,
      color = legned_title
    )
  )
}
ke_survival %>% 
  broom::tidy() %>% 
  mutate(estimate = 1 - estimate, strata = strata %>% str_remove("INTENT=")) %>% 
  filter(strata != "User") %>% 
  ggplot(aes(x = time, y = estimate, color = strata)) +
  geom_step() + 
  theme_pma() + 
  labs(title = "Predicted time to FP Adoption for Phase 1 Non-users in Kenya")

No surprises here: women who planned to adopt a method at Phase 1 are expected to adopt one sooner than women with no such plans. Notice, however, that we didn’t bother to plot confidence intervals for each line: that’s because we’ll first want to incorporate information about the PMA sampling procedure with help from the survey package.

Survey Design

We can use a similar workflow with survey, but it’s important to remember that the output of survfit shown above was an object of a particular class called survfit - because survival is such a popular package, the developers of broom wrote specific methods for tidy-ing objects from this class.

survey uses the survival function Surv, but has its own modeling function svykm. Its output is not familiar to broom, so we’ll have to create our own tidy output from scratch.

Before we add survey design information, let’s ensure that we’re able to reproduce the above results with svykm. This time, we’ll pass dummy survey design information to as_survey_design via a test-variable w: we’ll use this to weight every case equally with the value 1.

ke_survey <- dat %>%
  filter(POP == "Kenya") %>%
  mutate(w = 1) %>%
  as_survey_design(weight = w) %>%
  svykm(Surv(MONTH, USE) ~ INTENT, design = .)

ke_survey
Weighted survival curves:
svykm(formula = Surv(MONTH, USE) ~ INTENT, design = .)
No Plan : Q1 = Inf  median = Inf  Q3 = Inf 
Plan : Q1 = 5  median = 10  Q3 = Inf 
User : Q1 = 0  median = 0  Q3 = 0 

Notice how this output looks different from ke_survival? If we try, broom will be unable to tidy it.

broom::tidy(ke_survey)
Error: No tidy method for objects of class svykmlist

You can think about ke_survey as a kind of special list, with one element per group in INTENT. Each of those elements also has a list-like structure with vectors representing time (months), surv (survival probability), and standard error (if specified). For example, you can use list-syntax to check that the monthly estimates in ke_survey match those in ke_survival for women who planned to adopt a method:

ke_survey$Plan$surv %>% round(3)
 [1] 1.000 1.000 0.952 0.896 0.819 0.761 0.701 0.654 0.594 0.560 0.520 0.480 0.454 0.438 0.425

We’ll want to compile all of these lists in a tibble, which we’ll demonstrate with the new recommended workflow for map with purrr version 1.0.0. (We now use list_rbind in place of map_dfr, which has been superseded).

ke_survey %>% 
  imap(~c(.x) %>% as_tibble() %>% mutate(strata = .y)) %>% 
  list_rbind() 
# A tibble: 32 × 3
    time  surv strata 
   <dbl> <dbl> <chr>  
 1     0 1     No Plan
 2     0 1     No Plan
 3     1 0.981 No Plan
 4     2 0.966 No Plan
 5     3 0.957 No Plan
 6     4 0.942 No Plan
 7     5 0.933 No Plan
 8     6 0.919 No Plan
 9     7 0.907 No Plan
10     8 0.894 No Plan
11     9 0.880 No Plan
12    10 0.865 No Plan
13    11 0.855 No Plan
14    12 0.842 No Plan
15    13 0.842 No Plan
# … with 17 more rows

We’ll ultimately want cluster-robust standard error estimates for each month, so this output isn’t very useful on its own. Instead, we’ll specify se = TRUE in svykm. Just make sure to drop women who were originally using a method at Phase 1: the estimated standard error for these cases is infinite in month 0, and R will likely stall if you don’t exclude them from svykm! We’ll do this after our call to as_survey_design in order to preserve the correct sample size.

ke_survey <- dat %>% 
  filter(POP == "Kenya") %>% 
  mutate(w = 1) %>% 
  as_survey_design(weight = w) %>% 
  filter(INTENT != "User") %>% # remove baseline users 
  svykm(Surv(MONTH, USE) ~ INTENT, design = ., se = TRUE) %>% # add se = TRUE
  imap(~c(.x) %>% as_tibble() %>% mutate(strata = .y)) %>% 
  list_rbind() 

ke_survey
# A tibble: 747 × 4
    time  surv      varlog strata 
   <dbl> <dbl>       <dbl> <chr>  
 1     1 1.00  0.000000134 No Plan
 2     1 0.999 0.000000269 No Plan
 3     1 0.999 0.000000403 No Plan
 4     1 0.999 0.000000537 No Plan
 5     1 0.998 0.000000671 No Plan
 6     1 0.998 0.000000805 No Plan
 7     1 0.997 0.000000938 No Plan
 8     1 0.997 0.00000107  No Plan
 9     1 0.997 0.00000121  No Plan
10     1 0.996 0.00000134  No Plan
11     1 0.996 0.00000147  No Plan
12     1 0.996 0.00000161  No Plan
13     1 0.995 0.00000174  No Plan
14     1 0.995 0.00000187  No Plan
15     1 0.995 0.00000200  No Plan
# … with 732 more rows

The new column varlog represents the log-scale variance, which is the square of our desired standard error. The confidence intervals shown above can be reproduced if we use p = 0.05 in qnorm and exponentiate the results.

ke_survey <- ke_survey %>% 
  group_by(strata, time) %>% 
  summarise(
    n.event = n(),
    estimate = min(surv),
    std.error = sqrt(varlog) %>% min(),
    conf.high = exp(log(estimate) - std.error * qnorm(0.05/2)),
    conf.low = exp(log(estimate) + std.error * qnorm(0.05/2))
  )

ke_survey
# A tibble: 25 × 7
# Groups:   strata [2]
   strata   time n.event estimate std.error conf.high conf.low
   <chr>   <dbl>   <int>    <dbl>     <dbl>     <dbl>    <dbl>
 1 No Plan     1      53    0.981  0.000367     0.981    0.980
 2 No Plan     2      41    0.966  0.00267      0.971    0.961
 3 No Plan     3      24    0.957  0.00357      0.964    0.950
 4 No Plan     4      40    0.943  0.00403      0.950    0.935
 5 No Plan     5      26    0.933  0.00469      0.942    0.925
 6 No Plan     6      36    0.920  0.00509      0.929    0.911
 7 No Plan     7      34    0.908  0.00560      0.918    0.898
 8 No Plan     8      36    0.894  0.00606      0.905    0.884
 9 No Plan     9      37    0.881  0.00652      0.892    0.870
10 No Plan    10      40    0.866  0.00698      0.878    0.855
11 No Plan    11      28    0.856  0.00745      0.869    0.844
12 No Plan    12      24    0.843  0.00779      0.856    0.830
13 Plan        1      28    0.953  0.00171      0.957    0.950
14 Plan        2      33    0.899  0.00899      0.915    0.883
15 Plan        3      45    0.825  0.0135       0.847    0.803
# … with 10 more rows

This is close to the output from ke_survival, except that you’ll notice we have no rows for months were all cases are censored (e.g. month 0). This leaves us with no column analogous to n.risk. We’ll create it from dat and then left_join the results from ke_survey. If you’d like to have a column for censored cases in n.censor, it can be made by subtracting n.event from the count of women in each month (n). Lastly, we’ll carry the same std.error and confidence interval downward for all new rows where only censored cases are listed.

ke_survey <- dat %>% 
  filter(POP == "Kenya") %>% 
  group_by(INTENT) %>% 
  count(MONTH) %>% 
  mutate(n.risk = if_else(MONTH == 0, sum(n), lag(sum(n) - n))) %>% 
  rename(strata = INTENT, time = MONTH) %>% 
  left_join(ke_survey) %>% 
  mutate(
    n.event = n.event %>% replace_na(0), 
    n.censor = n - n.event,
    across(
      c(estimate, std.error, starts_with("conf")),
      ~case_when(
        time == 0 ~ as.double(cur_column() != "std.error"),
        !is.na(.x) ~ .x,
        is.na(.x) ~ lag(.x)
      )
    )
  )

ke_survey
# A tibble: 29 × 10
# Groups:   strata [3]
   strata   time     n n.risk n.event estimate std.error conf.high conf.low n.censor
   <chr>   <dbl> <int>  <int>   <int>    <dbl>     <dbl>     <dbl>    <dbl>    <int>
 1 No Plan     0   269   2997       0    1      0            1        1          269
 2 No Plan     1    53   2728      53    0.981  0.000367     0.981    0.980        0
 3 No Plan     2    41   2944      41    0.966  0.00267      0.971    0.961        0
 4 No Plan     3    24   2956      24    0.957  0.00357      0.964    0.950        0
 5 No Plan     4    40   2973      40    0.943  0.00403      0.950    0.935        0
 6 No Plan     5    26   2957      26    0.933  0.00469      0.942    0.925        0
 7 No Plan     6    36   2971      36    0.920  0.00509      0.929    0.911        0
 8 No Plan     7    34   2961      34    0.908  0.00560      0.918    0.898        0
 9 No Plan     8    36   2963      36    0.894  0.00606      0.905    0.884        0
10 No Plan     9    37   2961      37    0.881  0.00652      0.892    0.870        0
11 No Plan    10    40   2960      40    0.866  0.00698      0.878    0.855        0
12 No Plan    11   810   2957      28    0.856  0.00745      0.869    0.844      782
13 No Plan    12  1360   2187      24    0.843  0.00779      0.856    0.830     1336
14 No Plan    13   191   1637       0    0.843  0.00779      0.856    0.830      191
15 Plan        0    90    676       0    1      0            1        1           90
# … with 14 more rows

Now that we know how to obtain identical results from survival and survey, we’ll rebuild our table with the actual survey weights, cluster IDs, and stratum IDs in as_survey_design.

# Fit the survival model with information from `as_survey_design`
 ke_survey <- dat %>% 
  filter(POP == "Kenya") %>% 
  as_survey_design(weight = PANELWEIGHT, id = EAID, strata = STRATA) %>% 
  filter(INTENT != "User") %>% 
  svykm(Surv(MONTH, USE) ~ INTENT, design = ., se = TRUE) 

# Create tidy output from the `svykm` object
ke_survey <- ke_survey %>% 
  imap(~c(.x) %>% as_tibble() %>% mutate(strata = .y)) %>% 
  list_rbind() %>% 
  group_by(strata, time) %>% 
  summarise(
    n.event = n(),
    estimate = min(surv),
    std.error = sqrt(varlog) %>% min(),
    conf.high = exp(log(estimate) - std.error * qnorm(0.05/2)),
    conf.low = exp(log(estimate) + std.error * qnorm(0.05/2))
  )

# Complete tidy output with information about censored cases in each month
ke_survey <- dat %>% 
  filter(POP == "Kenya") %>% 
  group_by(INTENT) %>% 
  count(MONTH) %>% 
  mutate(n.risk = if_else(MONTH == 0, sum(n), lag(sum(n) - n))) %>% 
  rename(strata = INTENT, time = MONTH) %>% 
  left_join(ke_survey) %>% 
  mutate(
    n.event = n.event %>% replace_na(0), 
    n.censor = n - n.event,
    across(
      c(estimate, std.error, starts_with("conf")),
      ~case_when(
        time == 0 ~ as.double(cur_column() != "std.error"),
        !is.na(.x) ~ .x,
        is.na(.x) ~ lag(.x)
      )
    )
  )

ke_survey
# A tibble: 29 × 10
# Groups:   strata [3]
   strata   time     n n.risk n.event estimate std.error conf.high conf.low n.censor
   <chr>   <dbl> <int>  <int>   <int>    <dbl>     <dbl>     <dbl>    <dbl>    <int>
 1 No Plan     0   269   2997       0    1      0            1        1          269
 2 No Plan     1    53   2728      53    0.981  0.000201     0.982    0.981        0
 3 No Plan     2    41   2944      41    0.965  0.00279      0.971    0.960        0
 4 No Plan     3    24   2956      24    0.958  0.00418      0.966    0.950        0
 5 No Plan     4    40   2973      40    0.942  0.00487      0.951    0.933        0
 6 No Plan     5    26   2957      26    0.934  0.00556      0.944    0.924        0
 7 No Plan     6    36   2971      36    0.920  0.00583      0.931    0.910        0
 8 No Plan     7    34   2961      34    0.907  0.00689      0.920    0.895        0
 9 No Plan     8    36   2963      36    0.898  0.00767      0.911    0.884        0
10 No Plan     9    37   2961      37    0.883  0.00808      0.897    0.869        0
11 No Plan    10    40   2960      40    0.869  0.00886      0.884    0.854        0
12 No Plan    11   810   2957      28    0.860  0.00982      0.876    0.843      782
13 No Plan    12  1360   2187      24    0.845  0.0103       0.862    0.828     1336
14 No Plan    13   191   1637       0    0.845  0.0103       0.862    0.828      191
15 Plan        0    90    676       0    1      0            1        1           90
# … with 14 more rows

Finally, we’ll plot these new weighted estimates, this time also adding their associated cluster-robust confidence intervals with geom_rect.

ke_survey %>% 
  filter(strata != "User") %>% 
  group_by(strata) %>% 
  mutate(
    across(c(estimate, starts_with("conf")), ~1 - .x),
    xmax = if_else(time == max(time), time, time + 1)
  ) %>% 
  ggplot(aes(x = time, y = estimate, fill = strata)) +
  geom_step(linewidth = 0.25) + 
  geom_rect(
    aes(xmin = time, xmax = xmax, ymin = conf.low, ymax = conf.high),
    alpha = 0.2, color = 0
  ) + 
  theme_pma() + 
  labs(
    title = "Predicted time to FP Adoption for Phase 1 Non-users in Kenya",
    subtitle = "Weighted estimates (95% CI)"
  )

Data with Multiple Samples

So far, we’ve only generated survival curves for one of the samples in our original data extract. Now we’ll show how to iterate through each of the samples, and then compbine their results in a faceted plot.

The tricky thing here is that we need to be careful to avoid combining sample sizes in the formula survey uses to calculate degrees of freedom. Remember our point above about not subsetting the Kenya sample to exclude Phase 1 users prior to defining its survey design? By contrast, you should subset your data extract before defining survey design if your extract contains multiple samples.

Fortunately, this procedure is made very simple with two functions from dplyr: we’ll use group_by followed by group_modify.

In group_by, we’ll subset dat by the variable POP so that everything in group_modify applies only to the members of each group.

Then, in group_modify, we’ll signal an anonymous function with ~{}, inside of which .x references only the data within each group. Hence .x replaces all of the code where we’d written dat %>% filter(POP == "Kenya") above. We also replace the name ke_survey with a more general name output.

dat_surv <- dat %>% 
  group_by(POP) %>% 
  group_modify(~{
    # Fit the survival model with information from `as_survey_design`
    output <- .x %>% 
      as_survey_design(weight = PANELWEIGHT, id = EAID, strata = STRATA) %>% 
      filter(INTENT != "User") %>% 
      svykm(Surv(MONTH, USE) ~ INTENT, design = ., se = TRUE) 
    
    # Create tidy output from the `svykm` object
    output <- output %>% 
      imap(~c(.x) %>% as_tibble() %>% mutate(strata = .y)) %>% 
      list_rbind() %>% 
      group_by(strata, time) %>% 
      summarise(
        n.event = n(),
        estimate = min(surv),
        std.error = sqrt(varlog) %>% min(),
        conf.high = exp(log(estimate) - std.error * qnorm(0.05/2)),
        conf.low = exp(log(estimate) + std.error * qnorm(0.05/2))
      )
    
    # Complete tidy output with information about censored cases in each month
    output <- .x %>%
      group_by(INTENT) %>% 
      count(MONTH) %>% 
      mutate(n.risk = if_else(MONTH == 0, sum(n), lag(sum(n) - n))) %>% 
      rename(strata = INTENT, time = MONTH) %>% 
      left_join(output) %>% 
      mutate(
        n.event = n.event %>% replace_na(0), 
        n.censor = n - n.event,
        across(
          c(estimate, std.error, starts_with("conf")),
          ~case_when(
            time == 0 ~ as.double(cur_column() != "std.error"),
            !is.na(.x) ~ .x,
            is.na(.x) ~ lag(.x)
          )
        )
      )
  }) %>% 
  ungroup()

dat_surv
# A tibble: 175 × 11
   POP          strata   time     n n.risk n.event estimate std.error conf.high conf.low n.censor
   <chr>        <chr>   <dbl> <int>  <int>   <int>    <dbl>     <dbl>     <dbl>    <dbl>    <int>
 1 Burkina Faso No Plan     0   348   2686       0    1      0            1        1          348
 2 Burkina Faso No Plan     1    30   2338      30    0.990  0.000334     0.990    0.989        0
 3 Burkina Faso No Plan     2    20   2656      20    0.984  0.00304      0.990    0.979        0
 4 Burkina Faso No Plan     3    16   2666      16    0.977  0.00424      0.985    0.969        0
 5 Burkina Faso No Plan     4    22   2670      22    0.967  0.00500      0.976    0.957        0
 6 Burkina Faso No Plan     5    16   2664      16    0.958  0.00617      0.970    0.947        0
 7 Burkina Faso No Plan     6    18   2670      18    0.952  0.00666      0.964    0.940        0
 8 Burkina Faso No Plan     7    22   2668      22    0.943  0.00751      0.957    0.929        0
 9 Burkina Faso No Plan     8    17   2664      17    0.937  0.00783      0.952    0.923        0
10 Burkina Faso No Plan     9    32   2669      32    0.927  0.00820      0.942    0.912        0
11 Burkina Faso No Plan    10   152   2654      39    0.911  0.00936      0.928    0.895      113
12 Burkina Faso No Plan    11  1100   2534      29    0.902  0.0103       0.920    0.884     1071
13 Burkina Faso No Plan    12   783   1586       7    0.894  0.0116       0.915    0.874      776
14 Burkina Faso No Plan    13    84   1903       3    0.879  0.0131       0.902    0.857       81
15 Burkina Faso No Plan    14    26   2602       0    0.879  0.0131       0.902    0.857       26
# … with 160 more rows

The resulting tibble dat_surv includes the POP variable we used to define each group. That’s helpful, because we’ll now want to use add facet_wrap(~POP) to our previous ggplot2 code, thereby creating one panel from each sample.

dat_surv %>% 
  filter(strata != "User") %>% 
  group_by(POP, strata) %>% 
  mutate(
    across(c(estimate, starts_with("conf")), ~1 - .x),
    xmax = if_else(time == max(time), time, time + 1)
  ) %>% 
  ggplot(aes(x = time, y = estimate, fill = strata)) +
  geom_step(linewidth = 0.25) + 
  geom_rect(
    aes(xmin = time, xmax = xmax, ymin = conf.low, ymax = conf.high),
    alpha = 0.2, color = 0
  ) + 
  theme_pma() + 
  labs(
    title = "Predicted time to FP Adoption for Phase 1 Non-users",
    subtitle = "Weighted estimates (95% CI)"
  ) + 
  facet_wrap(~POP) # one panel per sample 


  1. Certain functions like survival::survfit function do accept a weight argument, but these are intended for frequency weights representing duplicate observations. Our discussion concerns survey weights representing each woman’s sampling probability. You can use the weight arguments in the survival package to obtain weighted point-estimates, but the standard error estimation will be incorrect. See discussion here.↩︎

  2. PANELWEIGHT should be used in analyses of all questions on the female questionnaire for women who were interviewed in both Phase 1 and Phase 2 survey rounds. See our weighting guide for details.↩︎

  3. We discuss reasons for missing data in our previous Contraceptive Calendar post. The data we feature here have been modified to include women who maintained the same family planning status throughout the duration of the calendar period.↩︎

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.