Merging Service Delivery Point Data to Household & Female Records

Individuals in Context Data Manipulation Service Delivery Points pivot_longer join

Create aggregate measures for women living the areas served by SDPs

Matt Gunther (IPUMS PMA Senior Data Analyst)
01-29-2021

Welcome to the third post in a series all about using PMA Service Delivery Point (SDP) data to better understand Individuals in Context. In our last post, we discussed a few of the variable groups related to contraceptive availability, and we showed how to use functions like dplyr::across to recode and summarize these variable groups in preparation for merging with Household and Female data.

Before we dive in, let’s quickly revisit the geographic sampling units - or enumeration areas - we’ll be using to link SDPs with their counterparts in the Household and Female data.

Reviewing SDP Sample Design

Remember: the SDP sample design selects facilities meant to reflect the health service environment experienced by individuals included in Household and Female samples. If you were designing survey with this goal in mind, how would you select facilities?

Well, you might target a sample of facilities located within the same geographic sampling units PMA used to define Household and Female samples from the same country in the same year. Presumably, the health services available to a woman living in enumeration area X would be captured pretty well if we surveyed a list of facilities also located in enumeration area X.

But what happens if a lot of women living in enumeration area X travel to enumeration area Y to receive family planning services? In that case, you’d want to know as much as possible about the service catchment areas for facilities in that country. Then, you could select facilities based on whether they provide services to enumeration area X, rather than relying simply to those that are located there.

In fact, PMA partners with government health agencies to obtain information about the service catchment area for all of the public-sector health facilities in each participating country. As a result, public SDPs are sampled if one of the enumeration areas used in a corresponding Household and Female sample appears in their service catchment area.

Because service catchment data are only available for public facilities, PMA uses a different method to select private-sector facilities. A private facility will be selected for a SDP sample only if it is located inside the boundaries of an enumeration area included in a corresponding Household and Female sample.

Setup: Create and Load a Data Extract

Let’s take a look at an example SDP dataset to see how all of this information gets reported. We’ll use the same data we highlighted in our last post, which includes facilities sampled from Burkina Faso in 2017 and 2018. First, load the following packages into R:

Again in this post, we’ll be working with all of the available contraceptive services and stock variables ending with the suffixes PROV, OBS, OUT3MO, and OUTDAY. We’ll also add the variable group EASERVED, which - as we’ll see - stores information about the service catchment area for facilities where that information was available. Finally, we’ll add a few more variables that we’ll explore a bit later: AUTHORITY, FACILITYTYPE, and FACILITYTYPEGEN.

We’ll first load the data using ipumsr::read_ipums_micro:

sdp <- ipumsr::read_ipums_micro(
  ddi = "data/pma_00008.xml",
  data = "data/pma_00008.dat.gz") 

Then, following the steps outlined in our last post, we’ll apply a couple of recoding functions from ipumsr.

sdp <- sdp %>% 
  select(-EASERVED) %>% # error from extract system
  mutate(
    across(ends_with("OBS"), ~lbl_relabel(
      .x,
      lbl(1, "in stock") ~ .val %in% 1:2,
      lbl(0, "out of stock") ~ .val == 3
    )),
    across(everything(), ~lbl_na_if(
      .x,
      ~.lbl %in% c(
        "Not interviewed (SDP questionnaire)",
        "Don't know",
        "No response or missing",
        "NIU (not in universe)"
      )
    ))
  )

EAID and EASERVED

For the moment, let’s just take a look at the basic structure of our data, selecting only the variables FACILITYID, SAMPLE, AUTHORITY, CONSENTSQ, and EAID. For this preview, we’ll also arrange the data in ascending order of FACILITYID and SAMPLE:

sdp %>% 
  select(FACILITYID, SAMPLE, AUTHORITY, CONSENTSQ, EAID) %>% 
  arrange(FACILITYID, SAMPLE)
# A tibble: 234 x 5
   FACILITYID                      SAMPLE    AUTHORITY CONSENTSQ  EAID
    <int+lbl>                   <int+lbl>    <int+lbl> <int+lbl> <dbl>
 1       7006 85405 [Burkina Faso 2017 R… 1 [Governme…   1 [Yes]  7390
 2       7006 85408 [Burkina Faso 2018 R… 1 [Governme…   1 [Yes]  7390
 3       7027 85405 [Burkina Faso 2017 R… 1 [Governme…   1 [Yes]  7332
 4       7027 85408 [Burkina Faso 2018 R… 1 [Governme…   1 [Yes]  7332
 5       7029 85405 [Burkina Faso 2017 R… 4 [Private]    1 [Yes]  7111
 6       7029 85408 [Burkina Faso 2018 R… 4 [Private]    0 [No]   7111
 7       7036 85408 [Burkina Faso 2018 R… 1 [Governme…   1 [Yes]  7412
 8       7046 85408 [Burkina Faso 2018 R… 4 [Private]    1 [Yes]  7798
 9       7048 85405 [Burkina Faso 2017 R… 1 [Governme…   1 [Yes]  7009
10       7051 85408 [Burkina Faso 2018 R… 1 [Governme…   1 [Yes]  7798
# … with 224 more rows

Each row in our data represents one facility from one sample. Notice that some - but not all - facilities appear once in sample 85405 (from 2017), and again in sample 85408 (from 2018).

The variable AUTHORITY shows the managing authority for each facility. Following the discussion above, we’ll expect to find information about the service catchment area for each facility where the managing authority is 1 - Government.

Also notice CONSENTSQ, which indicates whether a respondent at each facility consented to be interviewed. When you first obtain a data extract, you should expect most variables to be marked Not interviewed (SDP questionnaire) for facilities where CONSENTSQ shows 0 - No. However, we’ve already taken the extra step of marking all non-response values NA: we should now expect to see NA substituted for Not interviewed (SDP questionnaire).

Lastly, take particular note of the variable EAID: in SDP data, EAID shows the identification code associated with the enumeration area where a facility is located.

We’ll find information about the service catchment area for each facility in a different set of variables, each starting with with prefix EASERVED:

sdp %>% 
  select(starts_with("EASERVED")) 
# A tibble: 234 x 18
   EASERVED1 EASERVED2 EASERVED3 EASERVED4 EASERVED5 EASERVED6
   <int+lbl> <int+lbl> <int+lbl> <int+lbl> <int+lbl> <int+lbl>
 1      7380      7323      7491      7605      7142      7279
 2      7879      7516      7111      7554      7934      7791
 3      7483        NA        NA        NA        NA        NA
 4      7185        NA        NA        NA        NA        NA
 5      7725      7859      7472      7175        NA        NA
 6      7082        NA        NA        NA        NA        NA
 7      7650        NA        NA        NA        NA        NA
 8      7955        NA        NA        NA        NA        NA
 9      7323        NA        NA        NA        NA        NA
10      7774        NA        NA        NA        NA        NA
# … with 224 more rows, and 12 more variables: EASERVED7 <int+lbl>,
#   EASERVED8 <int+lbl>, EASERVED9 <int+lbl>, EASERVED10 <int+lbl>,
#   EASERVED11 <int+lbl>, …

You’ll notice that our extract contains 18 EASERVED variables. Why 18? If you created your own data extract, you’ll remember that you only selected one variable called EASERVED: once you’ve selected samples, the IPUMS extract system automatically determines the correct number of EASERVED variables for your dataset based on the facility with the largest service catchment list.

As we’ve discussed, PMA only receives service catchment information about public-sector facilities. In their case, each EASERVED variable contains an ID code for one of the enumeration areas in its service catchment list, or else it’s NA. We’ll look at these public-sector facilities first:

sdp %>% count(AUTHORITY)
# A tibble: 3 x 2
        AUTHORITY     n
*       <int+lbl> <int>
1 1 [Government]    202
2 3 [Faith-based]     3
3 4 [Private]        29

The vast majority of SDPs in our sample are public-sector facilities. They comprise 202 of the 234 facilities in our sample.

sdp %>% 
  filter(AUTHORITY == 1) %>% 
  select(starts_with("EASERVED")) 
# A tibble: 202 x 18
   EASERVED1 EASERVED2 EASERVED3 EASERVED4 EASERVED5 EASERVED6
   <int+lbl> <int+lbl> <int+lbl> <int+lbl> <int+lbl> <int+lbl>
 1      7380      7323      7491      7605      7142      7279
 2      7879      7516      7111      7554      7934      7791
 3      7483        NA        NA        NA        NA        NA
 4      7185        NA        NA        NA        NA        NA
 5      7725      7859      7472      7175        NA        NA
 6      7082        NA        NA        NA        NA        NA
 7      7650        NA        NA        NA        NA        NA
 8      7955        NA        NA        NA        NA        NA
 9      7323        NA        NA        NA        NA        NA
10      7774        NA        NA        NA        NA        NA
# … with 192 more rows, and 12 more variables: EASERVED7 <int+lbl>,
#   EASERVED8 <int+lbl>, EASERVED9 <int+lbl>, EASERVED10 <int+lbl>,
#   EASERVED11 <int+lbl>, …

Using two of the dplyr functions discussed in our last post - summarize and across - we’ll get a better sense of the catchment areas for our public-sector SDPs. Let’s see how many missing values exist for each of these EASERVED variables:

sdp %>% 
  filter(AUTHORITY == 1) %>% 
  summarise(across(starts_with("EASERVED"), ~sum(is.na(.x))))
# A tibble: 1 x 18
  EASERVED1 EASERVED2 EASERVED3 EASERVED4 EASERVED5 EASERVED6
      <int>     <int>     <int>     <int>     <int>     <int>
1         0       156       173       181       190       192
# … with 12 more variables: EASERVED7 <int>, EASERVED8 <int>,
#   EASERVED9 <int>, EASERVED10 <int>, EASERVED11 <int>, …

We see that every public facility serves at least one enumeration area (there are no missing values for EASERVED1). However, there are 156 missing values for EASERVED2, which tells us that 156 public facilities only serve one enumeration area. Likewise: 173 facilities serve 2 enumeration areas or fewer, 181 serve 3 or fewer, and so forth.

What about the 32 non-public facilities?

sdp %>% 
  filter(AUTHORITY != 1) %>% 
  summarise(across(starts_with("EASERVED"), ~sum(is.na(.x))))
# A tibble: 1 x 18
  EASERVED1 EASERVED2 EASERVED3 EASERVED4 EASERVED5 EASERVED6
      <int>     <int>     <int>     <int>     <int>     <int>
1         4        32        32        32        32        32
# … with 12 more variables: EASERVED7 <int>, EASERVED8 <int>,
#   EASERVED9 <int>, EASERVED10 <int>, EASERVED11 <int>, …

PMA receives no information about the service catchment areas for these facilities, so - as you might expect - there are 32 missing values for EASERVED2 onward. Note, however, that there are only 4 missing values for EASERVED1: for non-public facilities, EASERVED1 usually contains that same enumeration area code shown in EAID (this is the enumeration area where the facility is, itself, located).

The exception to this rule comes from facilities where CONSENTSQ shows that no respondent provided consent to be interviewed. If we’d like, we can copy EAID to EASERVED1 for these facilities using dplyr::case_when:

sdp <- sdp %>% 
  mutate(EASERVED1 = case_when(
    is.na(EASERVED1) ~ EAID,
    T ~ as.double(EASERVED1)
  ))

Now, every SDP has at least one enumeration area included in the EASERVED group. This will be important in our next step, where we’ll see how to summarize the SDP data by groups of facilities serving the same enumeration area.

Pivot Longer: EASERVED in Rows

Now that we’re familiar with EASERVED variables, let’s take a look at the kinds of summary statistics we might want to construct from variables related to contraceptive service availability. For example, consider EMRGPROV, which indicates whether a facility provides emergency contraceptives to clients.

Remember that, right now, each row of our SDP dataset represents responses from one facility per sample. We’ll ultimately want to count the number of facilities providing emergency contraceptives to clients in each enumeration area, so we should use the tidyr function pivot_longer to reshape the data in a way that repeats each facility’s response to EMRGPROV once for every enumeration area that it serves.

Take, for example, the first 5 facilities in our dataset: for now, let’s just look at the first two EASERVED variables, along with each facility’s FACILITYID, EAID, and EMRGPROV response:

sdp %>% 
  slice(1:5) %>% 
   select(FACILITYID, EASERVED1, EASERVED2, EMRGPROV)
# A tibble: 5 x 4
  FACILITYID EASERVED1 EASERVED2  EMRGPROV
   <int+lbl>     <dbl> <int+lbl> <int+lbl>
1       7250      7380      7323   0 [No] 
2       7399      7879      7516   0 [No] 
3       7506      7483        NA   0 [No] 
4       7982      7185        NA   0 [No] 
5       7065      7725      7859   1 [Yes]

Among these 5 facilities, only facility 7065 provides emergency contraceptives. This facility happens to provide services to 2 enumeration areas: 7725 and 7859. When we use pivot_longer, we’ll reshape the data to emphasize a different conclusion: our example shows two enumeration areas where individuals can access emergency contraceptives. We convey this information by placing each enumeration area from EASERVED1 or EASERVED2 in its own row:

sdp %>% 
  slice(1:5) %>% 
  select(FACILITYID, EASERVED1, EASERVED2, EMRGPROV) %>% 
  pivot_longer(
    cols = starts_with("EASERVED"),
    values_to = "EASERVED",
    names_to = NULL
  )
# A tibble: 10 x 3
   FACILITYID  EMRGPROV  EASERVED
    <int+lbl> <int+lbl> <dbl+lbl>
 1       7250   0 [No]       7380
 2       7250   0 [No]       7323
 3       7399   0 [No]       7879
 4       7399   0 [No]       7516
 5       7506   0 [No]       7483
 6       7506   0 [No]         NA
 7       7982   0 [No]       7185
 8       7982   0 [No]         NA
 9       7065   1 [Yes]      7725
10       7065   1 [Yes]      7859

Now, we find that each of the values previously stored in EASERVED1 and EASERVED2 appear in a new column, EASERVED. Each facility occupies two rows: one for each of the enumeration areas that it serves.

What about the rows where EASERVED contains NA? These rows are meaningless: we’re repeating each facility’s response to EMRGPROV twice to represent two enumeration areas, but facilities 7506 and 7982 only serve one enumeration area apiece. We should include the argument values_drop_na = T to drop these rows when we use pivot_longer():

sdp %>% 
  slice(1:5) %>% 
  select(FACILITYID, EASERVED1, EASERVED2, EMRGPROV) %>% 
  pivot_longer(
    cols = starts_with("EASERVED"),
    values_to = "EASERVED",
    names_to = NULL,
    values_drop_na = T
  )
# A tibble: 8 x 3
  FACILITYID  EMRGPROV  EASERVED
   <int+lbl> <int+lbl> <dbl+lbl>
1       7250   0 [No]       7380
2       7250   0 [No]       7323
3       7399   0 [No]       7879
4       7399   0 [No]       7516
5       7506   0 [No]       7483
6       7982   0 [No]       7185
7       7065   1 [Yes]      7725
8       7065   1 [Yes]      7859

Now that we know how to pivot_longer, let’s apply the function to our full dataset:

sdp <- sdp %>%
  pivot_longer(
    cols = starts_with("EASERVED"),
    values_to = "EASERVED",
    values_drop_na = T,
    names_to = NULL
  ) %>%
  distinct() # in case any facility listed the same EASERVED twice

Dropping each row where EASERVED is missing, we’re left with 372 rows where information about each SDP gets repeated once for every enumeration area that it serves. (Remember: our original dataset contained only 234 rows because SDPs occupied just one row apiece).

sdp %>% select(FACILITYID, EASERVED, everything())
# A tibble: 372 x 58
   FACILITYID EASERVED      SAMPLE COUNTRY  YEAR ROUND  EAID CONSENTSQ
    <int+lbl> <dbl+lb>   <int+lbl> <int+l> <int> <dbl> <dbl> <int+lbl>
 1       7250     7380 85405 [Bur… 1 [Bur…  2017     5  7142   1 [Yes]
 2       7250     7323 85405 [Bur… 1 [Bur…  2017     5  7142   1 [Yes]
 3       7250     7491 85405 [Bur… 1 [Bur…  2017     5  7142   1 [Yes]
 4       7250     7605 85405 [Bur… 1 [Bur…  2017     5  7142   1 [Yes]
 5       7250     7142 85405 [Bur… 1 [Bur…  2017     5  7142   1 [Yes]
 6       7250     7279 85405 [Bur… 1 [Bur…  2017     5  7142   1 [Yes]
 7       7250     7370 85405 [Bur… 1 [Bur…  2017     5  7142   1 [Yes]
 8       7250     7725 85405 [Bur… 1 [Bur…  2017     5  7142   1 [Yes]
 9       7250     7811 85405 [Bur… 1 [Bur…  2017     5  7142   1 [Yes]
10       7250     7859 85405 [Bur… 1 [Bur…  2017     5  7142   1 [Yes]
# … with 362 more rows, and 50 more variables: STRATA <int+lbl>,
#   FACILITYTYPE <int+lbl>, FACILITYTYPEGEN <int+lbl>,
#   AUTHORITY <int+lbl>, CONPROV <int+lbl>, …

Summarise by EASERVED and SAMPLE

Now that we’ve reshaped our data, we’ll be able to create some simple summary statistics about each of the enumeration areas served by the facilities in our sample. First, let’s group_by(EASERVED, SAMPLE) and count() the number of facilities providing services to each enumeration area in each of our samples:

sdp %>% 
  group_by(EASERVED, SAMPLE) %>% 
  summarize(
    .groups = "keep", 
    N_SDP = n()
  )
# A tibble: 149 x 3
# Groups:   EASERVED, SAMPLE [149]
    EASERVED                            SAMPLE N_SDP
   <dbl+lbl>                         <int+lbl> <int>
 1      7003 85405 [Burkina Faso 2017 Round 5]     3
 2      7003 85408 [Burkina Faso 2018 Round 6]     2
 3      7006 85405 [Burkina Faso 2017 Round 5]     2
 4      7006 85408 [Burkina Faso 2018 Round 6]     1
 5      7009 85405 [Burkina Faso 2017 Round 5]     3
 6      7009 85408 [Burkina Faso 2018 Round 6]     2
 7      7016 85405 [Burkina Faso 2017 Round 5]     3
 8      7016 85408 [Burkina Faso 2018 Round 6]     2
 9      7026 85405 [Burkina Faso 2017 Round 5]     3
10      7026 85408 [Burkina Faso 2018 Round 6]     3
# … with 139 more rows

Continuing with the variable EMRGPROV, we can now also count the number of sampled facilities providing emergency contraception to each EASERVED:

sdp %>% 
  group_by(EASERVED, SAMPLE) %>% 
  summarize(
    .groups = "keep",
    N_EMRGPROV = sum(EMRGPROV)
  )
# A tibble: 149 x 3
# Groups:   EASERVED, SAMPLE [149]
    EASERVED                            SAMPLE N_EMRGPROV
   <dbl+lbl>                         <int+lbl>      <int>
 1      7003 85405 [Burkina Faso 2017 Round 5]          0
 2      7003 85408 [Burkina Faso 2018 Round 6]          0
 3      7006 85405 [Burkina Faso 2017 Round 5]          0
 4      7006 85408 [Burkina Faso 2018 Round 6]          0
 5      7009 85405 [Burkina Faso 2017 Round 5]          2
 6      7009 85408 [Burkina Faso 2018 Round 6]          1
 7      7016 85405 [Burkina Faso 2017 Round 5]          0
 8      7016 85408 [Burkina Faso 2018 Round 6]          0
 9      7026 85405 [Burkina Faso 2017 Round 5]          0
10      7026 85408 [Burkina Faso 2018 Round 6]          0
# … with 139 more rows

What if we want to include a count of the facilities providing each of the different contraceptive methods in our data? Building on a technique showcased in our last post, we could use dplyr::across to iterate over all variables ending with the suffix PROV:

sdp %>% 
  group_by(EASERVED, SAMPLE) %>% 
  summarize(
    .groups = "keep",
    across(ends_with("PROV"), ~sum(.x), .names = "N_{.col}")
  ) 
# A tibble: 149 x 15
# Groups:   EASERVED, SAMPLE [149]
   EASERVED      SAMPLE N_CONPROV N_CYCBPROV N_DEPOPROV N_DIAPROV
   <dbl+lb>   <int+lbl>     <int>      <int>      <int>     <int>
 1     7003 85405 [Bur…         3          3          3         0
 2     7003 85408 [Bur…         2          2          2         0
 3     7006 85405 [Bur…         2          2          2         0
 4     7006 85408 [Bur…         1          1          1         0
 5     7009 85405 [Bur…         3          1          3         0
 6     7009 85408 [Bur…         2          2          2         0
 7     7016 85405 [Bur…         3          3          3         0
 8     7016 85408 [Bur…         2          1          2         0
 9     7026 85405 [Bur…         3          3          3         0
10     7026 85408 [Bur…         3          3          3         0
# … with 139 more rows, and 9 more variables: N_EMRGPROV <int>,
#   N_FCPROV <int>, N_FSTPROV <int>, N_FJPROV <int>, N_IMPPROV <int>,
#   …

We’ll reduce this information even further, creating a variable NUM_METHODS_PROV indicating the number of methods provided by at least one sampled facility:

sdp %>% 
  group_by(EASERVED, SAMPLE) %>% 
  summarize(
    .groups = "keep",
    across(ends_with("PROV"), ~sum(.x), .names = "N_{.col}")
  ) %>% 
  transmute(
    EASERVED,
    SAMPLE,
    NUM_METHODS_PROV = sum(c_across(ends_with("PROV")) > 0, na.rm = T)
  )
# A tibble: 149 x 3
# Groups:   EASERVED, SAMPLE [149]
    EASERVED                            SAMPLE NUM_METHODS_PROV
   <dbl+lbl>                         <int+lbl>            <int>
 1      7003 85405 [Burkina Faso 2017 Round 5]               10
 2      7003 85408 [Burkina Faso 2018 Round 6]                8
 3      7006 85405 [Burkina Faso 2017 Round 5]               10
 4      7006 85408 [Burkina Faso 2018 Round 6]                8
 5      7009 85405 [Burkina Faso 2017 Round 5]                9
 6      7009 85408 [Burkina Faso 2018 Round 6]               10
 7      7016 85405 [Burkina Faso 2017 Round 5]                9
 8      7016 85408 [Burkina Faso 2018 Round 6]                8
 9      7026 85405 [Burkina Faso 2017 Round 5]               10
10      7026 85408 [Burkina Faso 2018 Round 6]               10
# … with 139 more rows

In our last post, we introduced 4 variable groups related to the availability of different contraceptive methods. We’ll now create a summary variable for each one, and then show how to attach our new variables to a Household and Female dataset:

sdp <- sdp %>% 
  group_by(EASERVED, SAMPLE) %>% 
  summarize(
    .groups = "keep",
    N_SDP = n(),
    across(ends_with("PROV"), ~sum(.x, na.rm = T), .names = "N_{.col}"),
    across(ends_with("OBS"), ~sum(.x, na.rm = T), .names = "N_{.col}"),
    across(ends_with("OUT3MO"), ~sum(.x, na.rm = T), .names = "N_{.col}"),
    across(ends_with("OUTDAY"), ~mean(.x, na.rm = T), .names = "N_{.col}"),
  ) %>% 
  transmute(
    EASERVED,
    SAMPLE,
    N_SDP,
    NUM_METHODS_PROV = sum(c_across(ends_with("PROV")) > 0, na.rm = T),
    NUM_METHODS_INSTOCK = sum(c_across(ends_with("OBS")) > 0, na.rm = T),
    NUM_METHODS_OUT3MO = sum(c_across(ends_with("OUT3MO")) > 0, na.rm = T),
    MEAN_OUTDAY = mean(c_across(ends_with("OUTDAY")), na.rm = T)
  ) %>% 
  ungroup()

Merging to Household and Female Data

Consider the following female respondent dataset collected from Burkina Faso in 2017 and 2018. It contains a variable FPCURRUSE indicating whether the woman is currently using a method of family planning:

hhf <- read_ipums_micro(
  ddi = "data/pma_00011.xml",
  data = "data/pma_00011.dat.gz"
) %>% 
  select(PERSONID, EAID, URBAN, SAMPLE, FPCURRUSE) %>% 
  mutate(
    across(everything(), ~lbl_na_if(
      .x,
      ~.lbl %in% c(
        "No response or missing",
        "NIU (not in universe)"
      )
    ))
  )
hhf
# A tibble: 6,944 x 5
   PERSONID           EAID    URBAN                   SAMPLE FPCURRUSE
   <chr>             <dbl> <int+lb>                <int+lbl> <int+lbl>
 1 0762000000029022…  7620 1 [Urba… 85405 [Burkina Faso 201…  NA      
 2 0735800000017142…  7358 0 [Rura… 85405 [Burkina Faso 201…   0 [No] 
 3 0710400000020992…  7104 0 [Rura… 85405 [Burkina Faso 201…   0 [No] 
 4 0704800000014092…  7048 1 [Urba… 85405 [Burkina Faso 201…   0 [No] 
 5 0715600000020782…  7156 0 [Rura… 85405 [Burkina Faso 201…   0 [No] 
 6 0727900000021452…  7279 1 [Urba… 85405 [Burkina Faso 201…   0 [No] 
 7 0743100000024642…  7431 0 [Rura… 85405 [Burkina Faso 201…   1 [Yes]
 8 0721200000025792…  7212 0 [Rura… 85405 [Burkina Faso 201…   0 [No] 
 9 0704200000014542…  7042 1 [Urba… 85405 [Burkina Faso 201…   0 [No] 
10 0797200000013032…  7972 1 [Urba… 85405 [Burkina Faso 201…   0 [No] 
# … with 6,934 more rows

You’ll notice that each row represents one female respondent with a unique PERSONID (non-respondents and other household members have been removed beforehand). We’ve also got EAID, which represents the enumeration area where each respondent resides; the variable URBAN indicates whether the enumeration area is primarily “urban” or “rural”.

The variable SAMPLE contains the same values seen in our SDP data:

When we merge, we’ll want to match each woman to both a SAMPLE and an EASERVED from the SDP data. We’ll rename EASERVED to match the variable EAID in the HHF data:

bf_merged <- sdp %>% 
  rename(EAID = EASERVED) %>% 
  right_join(hhf, by = c("EAID", "SAMPLE"))

Now, each woman’s record contains all of the variables we created above summarizing the SDPs that serve her enumeration area. For example, for all sampled women living in EAID == 7003 in 2017, the value in NUM_METHODS_OUT3MO shows the number of family planning methods that were out of stock with any SDP serving the woman’s enumeration area within three months prior to the survey:

bf_merged %>% 
  filter(EAID == 7003, SAMPLE == 85405) %>% 
  select(PERSONID, EAID, SAMPLE, NUM_METHODS_OUT3MO)
# A tibble: 55 x 4
   PERSONID             EAID                  SAMPLE NUM_METHODS_OUT3…
   <chr>            <dbl+lb>               <int+lbl>             <int>
 1 070030000001973…     7003 85405 [Burkina Faso 20…                 0
 2 070030000001973…     7003 85405 [Burkina Faso 20…                 0
 3 070030000002640…     7003 85405 [Burkina Faso 20…                 0
 4 070030000001075…     7003 85405 [Burkina Faso 20…                 0
 5 070030000001609…     7003 85405 [Burkina Faso 20…                 0
 6 070030000000835…     7003 85405 [Burkina Faso 20…                 0
 7 070030000001273…     7003 85405 [Burkina Faso 20…                 0
 8 070030000000527…     7003 85405 [Burkina Faso 20…                 0
 9 070030000002561…     7003 85405 [Burkina Faso 20…                 0
10 070030000002391…     7003 85405 [Burkina Faso 20…                 0
# … with 45 more rows

You’ll notice that 55 women were surveyed in EAID 7003 in 2017, and each one has the same value (0) for NUM_METHODS_OUT3MO.

We’ll dig deeper into the types of research questions that our new combined dataset can answer in our upcoming Data Analysis post. For now, take a look at the apparent relationship between FPCURRUSE and NUM_METHODS_OUT3MO for all of the women with non-missing responses for both variables:

bf_merged %>% 
  filter(!is.na(FPCURRUSE) & !is.na(NUM_METHODS_OUT3MO)) %>% 
  group_by(NUM_METHODS_OUT3MO > 0) %>% 
  count(FPCURRUSE) %>% 
  mutate(pct = n/sum(n))
# A tibble: 4 x 4
# Groups:   NUM_METHODS_OUT3MO > 0 [2]
  `NUM_METHODS_OUT3MO > 0` FPCURRUSE     n   pct
  <lgl>                    <int+lbl> <int> <dbl>
1 FALSE                      0 [No]   2721 0.648
2 FALSE                      1 [Yes]  1475 0.352
3 TRUE                       0 [No]   1124 0.700
4 TRUE                       1 [Yes]   482 0.300

Notably, among those respondents living in an enumeration area that experienced zero stockouts within the 3 months prior to the SDP survey, 35% indicated that they were actively using a family planning method. Compare that to the set of respondents living in an area where at least one method was out of stock during the same time period: only 30% were using a family planning method.

While a 5% difference may or may not prove to be statistically significant under further analysis, it’s not entirely surprising that the reliable availability of contraceptive methods from service providers might influence the contraceptive prevalence rate among women in a given area.

As always, let us know what kinds of questions about fertility and family planning you’re answering with data merged from service providers!

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.