Use dplyr::across to summarize variables with a similar naming pattern.
In our last post, we introduced PMA Service Delivery Point (SDP) data as an important resource for understanding the health services environment experienced by individuals sampled in PMA Household and Female data. For our second post in this Individuals in Context series, we’ll now take deeper dive into one of the important topics in SDP data: the range and availability of contraceptive methods provided at each facility.
A common feature of the variables in this topic - and in many other topics - is that you’ll find several binary indicators constructed from the same multiple resposne item on the SDP questionnaire. We’ll see that IPUMS PMA uses a common naming convention to help users group these variables together for use in functions like dplyr::across.
Every SDP respondent receives a question associated with the variable FPOFFERED, which indicates whether to facility usually offers family planning services or products:
Do you usually offer family planning services / products?
[] Yes
[] No
[] No response
If yes, they’ll then receive a multiple response-type question asking about the contraceptive methods provided to clients. The range of options provided on the questionnaire may vary across samples, but most look something like this:
Which of the following methods are provided to clients at this facility?
[] Female sterilization
[] Male sterilization
[] Implant
[] IUD
[] Injectables - Depo Provera
[] Injectables - Sayana Press
[] Pill
[] Emergency Contraception
[] Male Condom
[] Female Condom
[] Diaphragm
[] Foam/Jelly
[] Std. Days / Cycle beads
[] None of the above
[] No response
This is a multiple response question: each method in the list could be answered individually (Yes
or No
), or the respondent could reply None of the above
or provide No response
. The IPUMS PMA extract system generates one variable for each of the methods in the list:
The questionnaire continues for each one of the methods provided at a given facility. Next, it checks for the current availability of each of the provided methods:
You mentioned that you typically provide the [METHOD] at this facility,
can you show it to me? If no, probe: Is the [METHOD] out of stock today?
[] In-stock and observed
[] In-stock but not observed
[] Out of stock
[] No Response
The variables associated with each response end with the same suffix OBS
:
If a facility did have a particular method in-stock, it received a question asking whether supplies were unavailable any time in the previous three months:
Has the [METHOD] been out of stock at any time in the last 3 months?
[] Yes
[] No
[] Don't know
[] No response
This question becomes a series of variables ending with the suffix OUT3MO
:
On the other hand, if a facility that normally provides a given method did not have supplies in-stock during the interview, it received a different question about the duration of the current stockout:
How many days has the [METHOD] been out of stock?
Number of days____
The resulting variables - each ending with the suffix OUTDAY
- take an integer value representing the stockout duration in days (except where the value is a non-response code, see below):
As you can see, we’re left with quite a few variables from just these 4 questions! That’s very useful if you’re interested in the availability of one method, in particular, but what if you want to get a picture of the full range of methods provided at a particular facility?
Fortunately, the repeated use of variable suffixes (PROV
, OBS
, OUT3MO
, and OUTDAY
) make these variables highly suitable for column-wise processing with dplyr::across.
Let’s start with an example data extract containing all of the variables listed above, collected from just two samples:
Once you’ve downloaded an extract, open RStudio and load the packages tidyverse and ipumsr:
Next, use the file paths for your data extract to load it into R:
sdp <- ipumsr::read_ipums_micro(
ddi = "data/pma_00008.xml",
data = "data/pma_00008.dat.gz"
)
Using dplyr::ends_with
, we’ll select only FACILITYID
, SAMPLE
, EAID
, and the variables using one of the four suffixes PROV
, OBS
, OUT3MO
, or OUTDAY
.
That leaves us with 234 rows - each a facility from one of our two samples - and 49 variables:
sdp
# A tibble: 234 x 49
FACILITYID SAMPLE EAID CONPROV CYCBPROV DEPOPROV DIAPROV
<int+lbl> <int+lbl> <dbl> <int+lb> <int+lb> <int+lb> <int+l>
1 7250 85405 [Burkina… 7142 1 [Yes] 1 [Yes] 1 [Yes] 0 [No]
2 7399 85405 [Burkina… 7879 1 [Yes] 1 [Yes] 1 [Yes] 0 [No]
3 7506 85405 [Burkina… 7483 1 [Yes] 1 [Yes] 1 [Yes] 0 [No]
4 7982 85405 [Burkina… 7185 1 [Yes] 1 [Yes] 1 [Yes] 0 [No]
5 7065 85405 [Burkina… 7859 1 [Yes] 1 [Yes] 1 [Yes] 0 [No]
6 7729 85405 [Burkina… 7082 1 [Yes] 0 [No] 1 [Yes] 0 [No]
7 7490 85405 [Burkina… 7650 1 [Yes] 1 [Yes] 1 [Yes] 0 [No]
8 7311 85405 [Burkina… 7955 1 [Yes] 1 [Yes] 1 [Yes] 0 [No]
9 7524 85405 [Burkina… 7323 1 [Yes] 1 [Yes] 1 [Yes] 0 [No]
10 7932 85405 [Burkina… 7774 1 [Yes] 1 [Yes] 1 [Yes] 0 [No]
# … with 224 more rows, and 42 more variables: EMRGPROV <int+lbl>,
# FCPROV <int+lbl>, FSTPROV <int+lbl>, FJPROV <int+lbl>,
# IMPPROV <int+lbl>, IUDPROV <int+lbl>, MSTPROV <int+lbl>,
# PILLPROV <int+lbl>, SAYPROV <int+lbl>, CONOBS <int+lbl>,
# CYCBOBS <int+lbl>, DEPOOBS <int+lbl>, DIAOBS <int+lbl>,
# EMRGOBS <int+lbl>, FCOBS <int+lbl>, FJOBS <int+lbl>,
# IMPOBS <int+lbl>, IUDOBS <int+lbl>, PILLOBS <int+lbl>,
# SAYOBS <int+lbl>, CONOUT3MO <int+lbl>, CYCBOUT3MO <int+lbl>,
# DEPOOUT3MO <int+lbl>, DIAOUT3MO <int+lbl>, EMRGOUT3MO <int+lbl>,
# FCOUT3MO <int+lbl>, FJOUT3MO <int+lbl>, IMPOUT3MO <int+lbl>,
# IUDOUT3MO <int+lbl>, PILLOUT3MO <int+lbl>, SAYOUT3MO <int+lbl>,
# CONOUTDAY <int+lbl>, CYCBOUTDAY <int+lbl>, DEPOOUTDAY <int+lbl>,
# DIAOUTDAY <int+lbl>, EMRGOUTDAY <int+lbl>, FCOUTDAY <int+lbl>,
# FJOUTDAY <int+lbl>, IMPOUTDAY <int+lbl>, IUDOUTDAY <int+lbl>,
# PILLOUTDAY <int+lbl>, SAYOUTDAY <int+lbl>
A key feature to remember about IPUMS PMA extracts is that variables often have value labels, which are text labels assigned to the different values taken by a variable. When we load the extract into R with an ipumsr
function, these variables are imported as labelled
objects rather than the more common factor
class of objects.
As a result, IPUMS data users need to take some unusual steps when recoding a variable or handling NA
values. Happily, the ipumsr
package provide a few functions (starting with the prefix lbl_
) that make this process very easy.
Let’s take a look at the variable CONOBS
:
sdp %>% count(CONOBS)
# A tibble: 5 x 2
CONOBS n
<int+lbl> <int>
1 1 [In-stock and observed] 204
2 2 [In-stock but not observed] 3
3 3 [Out of stock] 5
4 94 [Not interviewed (SDP questionnaire)] 4
5 99 [NIU (not in universe)] 18
Notice that we have two values representing SDPs with male condoms “in-stock”: SDPs where the interviewer personally observed the condoms get 1
, while those where condoms where reported in-stock - but not actually observed by the interviewer - get 2
.
Depending on your research question, the interviewer’s personal observation of each method may or may not be important. You might decide that you’d prefer to recode this variable into a simple binary measure that could be easily plugged into a regression model as a dummy variable later on. To do that, you could use the ipumsr
function lbl_relabel:
sdp %>%
mutate(CONOBS = lbl_relabel(
CONOBS,
lbl(1, "In-stock") ~ .val %in% 1:2,
lbl(0, "Out of stock") ~ .val == 3
)) %>%
count(CONOBS)
# A tibble: 4 x 2
CONOBS n
<dbl+lbl> <int>
1 0 [Out of stock] 5
2 1 [In-stock] 207
3 94 [Not interviewed (SDP questionnaire)] 4
4 99 [NIU (not in universe)] 18
That collapses the values 1
and 2
together, and it moves the value 3
(“Out of stock”) to 0
. However, we’ve still got a the values 94
and 99
, which are each a different type of non-response. The easiest strategy here would be to recode any value larger than 90
as NA
, and we could do that with another ipumsr
function, lbl_na_if:
sdp %>%
mutate(
CONOBS = lbl_relabel(
CONOBS,
lbl(1, "in-stock") ~ .val %in% 1:2,
lbl(0, "out of stock") ~ .val == 3
),
CONOBS = lbl_na_if(
CONOBS,
~.val > 90
)
) %>%
count(CONOBS)
# A tibble: 3 x 2
CONOBS n
<dbl+lbl> <int>
1 0 [out of stock] 5
2 1 [in-stock] 207
3 NA 22
This works great for our example variable, CONOBS
. Unfortunately, though, we can’t always rely on the rule ~.val > 90
to handle missing responses. For variables like CONOUTDAY
, a value above 90 could be a valid response: what if a facility experienced a stockout lasting 94 days? For this reason, the non-response values for CONOUTDAY
are padded with additional digits:
sdp %>% count(CONOUTDAY)
# A tibble: 7 x 2
CONOUTDAY n
<int+lbl> <int>
1 1 1
2 3 1
3 10 1
4 15 1
5 60 1
6 9994 [Not interviewed (SDP questionnaire)] 4
7 9999 [NIU (not in universe)] 225
We could write a different lbl_na_if
function for our OUTDAY
variables, but ipumsr
provides a much nicer workaround: we can specify non-response labels rather than values, as long as we make sure to use all of the different non-response labels appearing throughout our dataset:
sdp %>%
mutate(
CONOBS = lbl_relabel(
CONOBS,
lbl(1, "in-stock") ~ .val %in% 1:2,
lbl(0, "out of stock") ~ .val == 3
),
CONOBS = lbl_na_if(
CONOBS,
~.lbl %in% c(
"Not interviewed (SDP questionnaire)",
"Don't know",
"No response or missing",
"NIU (not in universe)"
)
)
) %>%
count(CONOBS)
# A tibble: 3 x 2
CONOBS n
<dbl+lbl> <int>
1 0 [out of stock] 5
2 1 [in-stock] 207
3 NA 22
Now, we’ll be able to recode all of our variables with the same pair of functions! To do that, we’ll first need to take a look at the column-wise workflow made available by dplyr::across
.
While there are several ways to apply a function across a set of variables in R, the simplest method comes from a new addition to the dplyr
package that’s loaded when you run library(tidyverse)
. The function dplyr::across takes two arguments: a function, and a selection of columns where you want that function to be applied.
Remember that we want collapse the values 1 - In-stock and observed
and 2 - In-stock but not observed
for all of the variables ending with OBS
, not just CONOBS
. Using across
and a selection of variables ending with OBS
, we’ll apply the same lbl_relabel
function we used on CONOBS
above:
sdp %>%
mutate(
across(ends_with("OBS"), ~lbl_relabel(
.x,
lbl(1, "in-stock") ~ .val %in% 1:2,
lbl(0, "out of stock") ~ .val == 3
))
) %>%
count(CONOBS)
# A tibble: 4 x 2
CONOBS n
<dbl+lbl> <int>
1 0 [out of stock] 5
2 1 [in-stock] 207
3 94 [Not interviewed (SDP questionnaire)] 4
4 99 [NIU (not in universe)] 18
Here, we stick lbl_relabel
inside a lambda function with syntax from purrr: the ~
designates a tidy lambda function, which in turn uses .x
as a kind of pronoun referencing each of the variables returned by ends_with("OBS")
. We’re showing that CONOBS
still gets recoded as before, but so do all of the other variables in its group!
We’ll use across
again with lbl_na_if
, but this time we want to produce NA
values for all of the variables in our dataset. In place of ends_with("OBS")
, we’ll use the selection function everything()
. This will take care of all the recoding we want to do, so we’ll also reassign our data with sdp <- sdp
:
sdp <- sdp %>%
mutate(
across(ends_with("OBS"), ~lbl_relabel(
.x,
lbl(1, "in-stock") ~ .val %in% 1:2,
lbl(0, "out of stock") ~ .val == 3
)),
across(everything(), ~lbl_na_if(
.x,
~.lbl %in% c(
"Not interviewed (SDP questionnaire)",
"Don't know",
"No response or missing",
"NIU (not in universe)"
)
))
)
Let’s pick a few variables to check out work:
sdp %>% count(CONOBS)
# A tibble: 3 x 2
CONOBS n
<dbl+lbl> <int>
1 0 [out of stock] 5
2 1 [in-stock] 207
3 NA 22
sdp %>% count(IMPOBS)
# A tibble: 3 x 2
IMPOBS n
<dbl+lbl> <int>
1 0 [out of stock] 3
2 1 [in-stock] 204
3 NA 27
sdp %>% count(CONOUTDAY)
# A tibble: 6 x 2
CONOUTDAY n
<int+lbl> <int>
1 1 1
2 3 1
3 10 1
4 15 1
5 60 1
6 NA 229
sdp %>% count(IMPOUTDAY)
# A tibble: 4 x 2
IMPOUTDAY n
<int+lbl> <int>
1 1 1
2 14 1
3 90 1
4 NA 231
Everything looks great! Now that we’ve finished reformatting the data, remember that our ultimate goal is to get some sense of the scope of methods available at a particular facility.
We’d like to use something like across
again here, but this time we’ll only want to apply our function to a selection of columns within the same row (because each row of our dataset represents one facility). To do this, we’ll divide the dataset rowwise, and then use the related function c_across to apply a calculation across columns within each row.
For instance, suppose we want to create NUM_METHODS_PROV
to show the total number of methods provided at each facility. Let’s look at the PROV
variables for the first few facilities:
sdp %>% select(ends_with("PROV"))
# A tibble: 234 x 13
CONPROV CYCBPROV DEPOPROV DIAPROV EMRGPROV FCPROV FSTPROV FJPROV
<int+lb> <int+lbl> <int+lb> <int+l> <int+lb> <int+l> <int+l> <int+>
1 1 [Yes] 1 [Yes] 1 [Yes] 0 [No] 0 [No] 1 [Yes] 1 [Yes] 0 [No]
2 1 [Yes] 1 [Yes] 1 [Yes] 0 [No] 0 [No] 1 [Yes] 1 [Yes] 0 [No]
3 1 [Yes] 1 [Yes] 1 [Yes] 0 [No] 0 [No] 1 [Yes] 0 [No] 0 [No]
4 1 [Yes] 1 [Yes] 1 [Yes] 0 [No] 0 [No] 1 [Yes] 0 [No] 0 [No]
5 1 [Yes] 1 [Yes] 1 [Yes] 0 [No] 1 [Yes] 1 [Yes] 1 [Yes] 0 [No]
6 1 [Yes] 0 [No] 1 [Yes] 0 [No] 0 [No] 1 [Yes] 1 [Yes] 0 [No]
7 1 [Yes] 1 [Yes] 1 [Yes] 0 [No] 0 [No] 1 [Yes] 0 [No] 0 [No]
8 1 [Yes] 1 [Yes] 1 [Yes] 0 [No] 0 [No] 1 [Yes] 1 [Yes] 0 [No]
9 1 [Yes] 1 [Yes] 1 [Yes] 0 [No] 0 [No] 1 [Yes] 0 [No] 0 [No]
10 1 [Yes] 1 [Yes] 1 [Yes] 0 [No] 0 [No] 1 [Yes] 0 [No] 0 [No]
# … with 224 more rows, and 5 more variables: IMPPROV <int+lbl>,
# IUDPROV <int+lbl>, MSTPROV <int+lbl>, PILLPROV <int+lbl>,
# SAYPROV <int+lbl>
To calculate NUM_METHODS_PROV
, we can just find the sum of values across all of the PROV
variables (thanks to our recoding work, the only possible values here are 1
for “yes”, or 0
for “no”). Notice that c_across
takes only one argument: a selection function like ends_with("PROV")
. That’s because c_across
works like the familiar concatenate function c()
used to provide a vector of values to a function like sum(c(1,2,3))
.
First, use rowwise()
to signal that we’ll only calculate the sum across variables in the same row. Then, use c_across()
to find the sum()
of PROV
variables in each row:
# A tibble: 234 x 1
# Rowwise:
NUM_METHODS_PROV
<int>
1 10
2 10
3 8
4 8
5 10
6 9
7 8
8 10
9 8
10 8
# … with 224 more rows
We can now create a summary variable for each of the four variable groups. Let’s create:
NUM_METHODS_PROV
- number of methods providedNUM_METHODS_INSTOCK
- number of methods in-stockNUM_METHODS_OUT3MO
- number of methods out of stock in the last 3 monthsMEAN_OUTDAY
- the mean length of a stockout for all out of stock methodssdp %>%
rowwise() %>%
transmute(
NUM_METHODS_PROV = sum(c_across(ends_with("PROV")), na.rm = T),
NUM_METHODS_INSTOCK = sum(c_across(ends_with("OBS")), na.rm = T),
NUM_METHODS_OUT3MO = sum(c_across(ends_with("OUT3MO")), na.rm = T),
MEAN_OUTDAY = mean(c_across(ends_with("OUTDAY")), na.rm = T)
)
# A tibble: 234 x 4
# Rowwise:
NUM_METHODS_PROV NUM_METHODS_INSTOCK NUM_METHODS_OUT3MO MEAN_OUTDAY
<int> <dbl> <int> <dbl>
1 10 8 0 NaN
2 10 7 0 8
3 8 8 0 NaN
4 8 8 0 NaN
5 10 9 0 NaN
6 9 7 0 NaN
7 8 7 0 365
8 10 8 1 NaN
9 8 8 1 NaN
10 8 7 0 30
# … with 224 more rows
In our last post, we mentioned that the best use case for SDP data is to aggregate information collected from facilities working in the same geographic sampling units - or enumeration areas - used to select individuals for PMA Household and Female samples. In our next post, we’ll take a close look at the variable group EASERVED, which lists all of the enumeration area codes where a facility is known to provide health services. We’ll then introduce a strategy using tidyr::pivot_longer to summarize the full scope of services available to women living in a particular enumeration area.
For now, let’s simply consider all of the sampled facilities located in a particular enumeration area. That is, rather than calculating the number of methods provided by one facility NUM_METHODS_PROV
, let’s create one variable for each method indicating whether the method was provided by at least one facility in a given enumeration area EAID
in a given SAMPLE
.
For instance, look at the number of facilities providing IUDs in enumeration area 7111
for the Burkina Faso sample collected in 2017:
sdp %>%
filter(EAID == 7111, SAMPLE == 85405) %>%
select(EAID, SAMPLE, FACILITYID, PILLPROV)
# A tibble: 4 x 4
EAID SAMPLE FACILITYID PILLPROV
<dbl> <int+lbl> <int+lbl> <int+lbl>
1 7111 85405 [Burkina Faso 2017 Round 5] 7210 1 [Yes]
2 7111 85405 [Burkina Faso 2017 Round 5] 7029 0 [No]
3 7111 85405 [Burkina Faso 2017 Round 5] 7441 1 [Yes]
4 7111 85405 [Burkina Faso 2017 Round 5] 7403 1 [Yes]
We want to use a summarize
function to create a variable like ANY_PILLPROV
, which should simply indicate whether any of these four facilities provide contraceptive pills. Three of them do provide pills, so we want ANY_PILLPROV
to be TRUE
.
# A tibble: 1 x 1
ANY_PILLPROV
<lgl>
1 TRUE
Now that we’re familiar with across
, we should be able to do the same thing to all PROV
variables for this particular group of facilities. Let’s also introduce a naming convention where we glue the prefix ANY_
to the column name referenced by the pronoun .x
:
sdp %>%
filter(EAID == 7111, SAMPLE == 85405) %>%
summarize(across(ends_with("PROV"), ~any(.x == 1), .names = "ANY_{.col}"))
# A tibble: 1 x 13
ANY_CONPROV ANY_CYCBPROV ANY_DEPOPROV ANY_DIAPROV ANY_EMRGPROV
<lgl> <lgl> <lgl> <lgl> <lgl>
1 TRUE TRUE TRUE FALSE FALSE
# … with 8 more variables: ANY_FCPROV <lgl>, ANY_FSTPROV <lgl>,
# ANY_FJPROV <lgl>, ANY_IMPPROV <lgl>, ANY_IUDPROV <lgl>,
# ANY_MSTPROV <lgl>, ANY_PILLPROV <lgl>, ANY_SAYPROV <lgl>
Let’s repeat the same procedure for every enumeration area in each of our samples. Rather than using a filter
to select one EAID
in one SAMPLE
, we’ll use group_by
to work with each EAID
in each SAMPLE
.
sdp %>%
group_by(EAID, SAMPLE) %>%
summarize(
.groups = "keep",
across(ends_with("PROV"), ~any(.x == 1), .names = "ANY_{.col}")
)
# A tibble: 142 x 15
# Groups: EAID, SAMPLE [142]
EAID SAMPLE ANY_CONPROV ANY_CYCBPROV ANY_DEPOPROV ANY_DIAPROV
<dbl> <int+lbl> <lgl> <lgl> <lgl> <lgl>
1 7003 85405 [Bur… TRUE TRUE TRUE FALSE
2 7003 85408 [Bur… TRUE TRUE TRUE FALSE
3 7006 85405 [Bur… TRUE TRUE TRUE FALSE
4 7006 85408 [Bur… TRUE TRUE TRUE FALSE
5 7009 85405 [Bur… TRUE FALSE TRUE FALSE
6 7009 85408 [Bur… TRUE TRUE TRUE FALSE
7 7016 85405 [Bur… TRUE TRUE TRUE FALSE
8 7016 85408 [Bur… TRUE TRUE TRUE FALSE
9 7026 85405 [Bur… TRUE TRUE TRUE FALSE
10 7042 85405 [Bur… TRUE TRUE TRUE FALSE
# … with 132 more rows, and 9 more variables: ANY_EMRGPROV <lgl>,
# ANY_FCPROV <lgl>, ANY_FSTPROV <lgl>, ANY_FJPROV <lgl>,
# ANY_IMPPROV <lgl>, ANY_IUDPROV <lgl>, ANY_MSTPROV <lgl>,
# ANY_PILLPROV <lgl>, ANY_SAYPROV <lgl>
This is still quite a bit of information! Suppose we want to summarize it even further: let’s calculate NUM_METHODS_PROV
again with our summary output. This time, NUM_METHODS_PROV
will count the number of methods provided by at least one facility in each group.
sdp %>%
group_by(EAID, SAMPLE) %>%
summarize(
.groups = "keep",
across(ends_with("PROV"), ~any(.x == 1), .names = "ANY_{.col}")
) %>%
transmute(NUM_METHODS_PROV= sum(c_across(ends_with("PROV")), na.rm = T))
# A tibble: 142 x 3
# Groups: EAID, SAMPLE [142]
EAID SAMPLE NUM_METHODS_PROV
<dbl> <int+lbl> <int>
1 7003 85405 [Burkina Faso 2017 Round 5] 8
2 7003 85408 [Burkina Faso 2018 Round 6] 8
3 7006 85405 [Burkina Faso 2017 Round 5] 8
4 7006 85408 [Burkina Faso 2018 Round 6] 8
5 7009 85405 [Burkina Faso 2017 Round 5] 8
6 7009 85408 [Burkina Faso 2018 Round 6] 10
7 7016 85405 [Burkina Faso 2017 Round 5] 8
8 7016 85408 [Burkina Faso 2018 Round 6] 8
9 7026 85405 [Burkina Faso 2017 Round 5] 8
10 7042 85405 [Burkina Faso 2017 Round 5] 10
# … with 132 more rows
These summaries are exactly the type of SDP data we’d like to attach to a Household and Female dataset! Watch for our next post, where we’ll show how to create summaries by both EAID
and EASERVED
, and then match them to records from female respondents sampled from Burkina Faso in 2017 and 2018.
If you see mistakes or want to suggest changes, please create an issue on the source repository.