It’s now much easier to perform row-wise operations with pre-grouped data!
When IPUMS PMA first released harmonized 2018 nutrition surveys from households in Burkina Faso and Kenya, we published a user note describing how to match children and mothers in the same household with pmap - a function designed to iterate over several variables simultaneously.
Our challenge was this: IPUMS PMA nutrition data are actually a harmonized version of three separate surveys. When you download and open a nutrition data extract, any given row may contain variables associated with:
As we mentioned in our last post, all children under age 5 living in one of the screened households were sampled in the subsequent child nutrition survey. However, women aged 10-49 were only selected for the female nutrition survey from a random sub-sample of screened households (45% of sampled households in Burkina Faso, 25% in Kenya).
We imagine that many users will want to link children to mothers where possible, particularly because the female nutrition survey includes questions about antenatal care and nutrtion support the mother received during her most recent pregnancy. Our user note offered code you could use to, at minimum, link sampled mothers to their youngest sampled child.
In this post, we’d like to offer an updated version of this code that uses the rowwise function released with dplyr 1.0.0 last year. Both approaches work great! But, we believe that the rowwise
approach is a quite bit easier to read and understand.
To keep things simple, we’ll use a data extract from Burkina Faso 2018 that contains only household members who participated in either the child nutrition survey or the female nutrition survey (you can request a file with all household members if you select “All Cases” at checkout). We’ll use the tidyverse and ipumsr packages to load our data into R.
library(tidyverse)
library(ipumsr)
dat <- read_ipums_micro(
ddi = "data/pma_00002.xml",
data = "data/pma_00002.dat.gz"
)
For the purposes of this post, we’ll be working with only a small subset of variables (feel free to include any number of variables when you create your own extract). Also: in order to make our coding examples a bit more readable, we’ll take an extra step of reassigning shorter identification numbers to each household and person - this is not recommended practice in general and we do so only to reduce the number of characters in HHID (20) and PERSONID (22) to 4.
dat <- dat %>%
select(
HHID,
ELIGTYPE,
AGEHQ,
RELATEKID,
KIDBIRTHYR,
KIDBIRTHMO,
LASTBIRTHYR,
LASTBIRTHMO,
KIDARMCIRCVAL,
RPANCPREGMO
) %>%
arrange(HHID, ELIGTYPE) %>%
rowid_to_column(var = "PERSONID") %>%
group_by(HHID) %>%
mutate(HHID = cur_group_id()) %>%
ungroup()
For any given household in our dataset, there should always be at least one child under age 5 who participated in the child nutrition survey. We also expect roughly 45% of households should also include at least one woman aged 10-49 who particpated in the female nutrition survey.
A typical household might look something like this:
HHID | PERSONID | AGEHQ | ELIGTYPE | KIDBIRTHMO | KIDBIRTHYR | LASTBIRTHMO | LASTBIRTHYR |
---|---|---|---|---|---|---|---|
1598 | 3521 | 0 | 11 - Infant under age 2 (INF) | December | 2017 | NA | NA |
1598 | 3522 | 0 | 11 - Infant under age 2 (INF) | July | 2017 | NA | NA |
1598 | 3523 | 0 | 11 - Infant under age 2 (INF) | April | 2018 | NA | NA |
1598 | 3524 | 4 | 14 - Aged 2-5 (K) | July | 2013 | NA | NA |
1598 | 3525 | 20 | 20 - Selected women aged 10-49 (WN) | NA | NA | April | 2018 |
1598 | 3526 | 25 | 20 - Selected women aged 10-49 (WN) | NA | NA | December | 2017 |
1598 | 3527 | 45 | 20 - Selected women aged 10-49 (WN) | NA | NA | July | 2017 |
All members of the household share the same household ID in HHID
. Each person has a unique PERSONID
, while AGEHQ gives their age (when the household questionnaire was completed) and ELIGTYPE describes their eligibility for sub-modules on either the child or female nutrition questionnaire (both the numeric code and label are shown).
In this household, it’s easy to spot the mother for each infant: each infant’s birth month KIDBIRTHMO and year KIDBIRTHYR match the month and year of one woman’s most recent delivery shown in LASTBIRTHMO and LASTBIRTHYR. Unfortunately, it’s not possible to match the 4 year-old child to a mother (women were only asked about the date of their most recent birth).
Once the matches are identified by both month and year, we could attach any number of variables from the mother’s record onto the record for the child. For demonstration purposes, we’ll just attach her PERSONID
as a new variable called MOMID
. If the child’s mother cannot be determined - or if a particular row contains data from a mother, rather than a child - we’ll assign the value NA
to MOMID
.
Imagine dividing our search in two basic steps: first, we’ll want to group our data into households. Then, we’ll iterate through person in the dataset: if that person is a child, we’ll search the members of their group for a woman with a matching birth month and year.
Let’s focus on the first step for a moment. A simple way to minimize the number of records involved with the search for each child’s mother is to group_by HHID
, and then create a unique subset of the larger dataset for each child. After we’ve grouped the data, we’ll use cur_data to return the current data for each group. For convenience, you might save this miniature dataset as a column in dat
- we’ll call ours HH_DATA
:
dat <- dat %>%
group_by(HHID) %>%
mutate(HH_DATA = list(cur_data())) %>%
ungroup()
If you now look at dat
, you’ll see we’ve stored one table in HH_DATA
for each person. If you opened one of those tables, you’d see:
dat
except for the grouping column HHID
(10 in total)dat %>% select(PERSONID, HHID, HH_DATA)
# A tibble: 5,267 × 3
PERSONID HHID HH_DATA
<int> <int> <list>
1 1 1 <tibble [3 × 10]>
2 2 1 <tibble [3 × 10]>
3 3 1 <tibble [3 × 10]>
4 4 2 <tibble [2 × 10]>
5 5 2 <tibble [2 × 10]>
6 6 3 <tibble [2 × 10]>
7 7 3 <tibble [2 × 10]>
8 8 4 <tibble [1 × 10]>
9 9 5 <tibble [5 × 10]>
10 10 5 <tibble [5 × 10]>
# … with 5,257 more rows
Let’s pull HH_DATA
for the first four individuals in dat
(essentially printing the contents of the list):
dat %>%
slice(1:4) %>%
pull(HH_DATA)
[[1]]
# A tibble: 3 × 10
PERSONID ELIGTYPE AGEHQ RELATEKID KIDBIRTHYR KIDBIRTHMO LASTBIRTHYR
<int> <int+lbl> <int> <int+lbl> <dbl> <int+lbl> <int+lbl>
1 1 14 [Aged… 4 1 [Mothe… 2014 6 [June] 9999 [NIU …
2 2 14 [Aged… 2 1 [Mothe… 2016 3 [March] 9999 [NIU …
3 3 14 [Aged… 2 2 [Fathe… 2016 5 [May] 9999 [NIU …
# … with 3 more variables: LASTBIRTHMO <int+lbl>,
# KIDARMCIRCVAL <dbl+lbl>, RPANCPREGMO <int+lbl>
[[2]]
# A tibble: 3 × 10
PERSONID ELIGTYPE AGEHQ RELATEKID KIDBIRTHYR KIDBIRTHMO LASTBIRTHYR
<int> <int+lbl> <int> <int+lbl> <dbl> <int+lbl> <int+lbl>
1 1 14 [Aged… 4 1 [Mothe… 2014 6 [June] 9999 [NIU …
2 2 14 [Aged… 2 1 [Mothe… 2016 3 [March] 9999 [NIU …
3 3 14 [Aged… 2 2 [Fathe… 2016 5 [May] 9999 [NIU …
# … with 3 more variables: LASTBIRTHMO <int+lbl>,
# KIDARMCIRCVAL <dbl+lbl>, RPANCPREGMO <int+lbl>
[[3]]
# A tibble: 3 × 10
PERSONID ELIGTYPE AGEHQ RELATEKID KIDBIRTHYR KIDBIRTHMO LASTBIRTHYR
<int> <int+lbl> <int> <int+lbl> <dbl> <int+lbl> <int+lbl>
1 1 14 [Aged… 4 1 [Mothe… 2014 6 [June] 9999 [NIU …
2 2 14 [Aged… 2 1 [Mothe… 2016 3 [March] 9999 [NIU …
3 3 14 [Aged… 2 2 [Fathe… 2016 5 [May] 9999 [NIU …
# … with 3 more variables: LASTBIRTHMO <int+lbl>,
# KIDARMCIRCVAL <dbl+lbl>, RPANCPREGMO <int+lbl>
[[4]]
# A tibble: 2 × 10
PERSONID ELIGTYPE AGEHQ RELATEKID KIDBIRTHYR KIDBIRTHMO LASTBIRTHYR
<int> <int+lbl> <int> <int+lbl> <dbl> <int+lbl> <int+lbl>
1 4 14 [Aged… 3 1 [Moth… 2014 9 [Septe… 9999 [NIU …
2 5 20 [Sele… 25 99 [NIU … 9999 99 [NIU (… 2014
# … with 3 more variables: LASTBIRTHMO <int+lbl>,
# KIDARMCIRCVAL <dbl+lbl>, RPANCPREGMO <int+lbl>
The first three individuals are members of the same household, so they appear in eachother’s HH_DATA
. The fourth is a member of a second household, and so on.
For our purposes, it’s perfectly fine to leave the household datasets in a list. Once we define a search function to find each child’s mother, we’ll simply apply it to each row of HH_DATA
. For example, suppose we just wanted to find the age of the oldest person in each household. We would use rowwise to iterate over each row in dat
, looking for the maximum value of AGEHQ
in the HH_DATA
table stored in each row.
dat %>%
rowwise() %>%
mutate(HH_AGE_OLDEST = max(HH_DATA$AGEHQ)) %>%
select(PERSONID, HHID, AGEHQ, HH_AGE_OLDEST)
# A tibble: 5,267 × 4
# Rowwise:
PERSONID HHID AGEHQ HH_AGE_OLDEST
<int> <int> <int+lbl> <int+lbl>
1 1 1 4 4
2 2 1 2 4
3 3 1 2 4
4 4 2 3 25
5 5 2 25 25
6 6 3 1 4
7 7 3 4 4
8 8 4 2 2
9 9 5 4 27
10 10 5 2 27
# … with 5,257 more rows
The main difference between max
and the function we’ll use for our search is that max
returns the same value for every person in the household; strictly speaking, it’s not necessary to do a row-wise search through pre-grouped household data in this case. Our problem is more complex: we want to return a specific value for each person via rowwise
, but we also want to restrict each person’s search to members of a pre-defined group.
Now that we’ve isolated data for each person’s household in HH_DATA
, we need to write a custom function that will match birth months and years, and then return a value for MOMID
only if exactly one match is found.
To save time, we’ll only perform our search if ELIGTYPE
shows that the current row contains data for a sampled child. In other words, the value for ELIGTYPE
should be less than 20.
dat %>% count(ELIGTYPE)
# A tibble: 3 × 2
ELIGTYPE n
<int+lbl> <int>
1 11 [Infant under age 2 (INF)] 1194
2 14 [Aged 2-5 (K)] 1662
3 20 [Selected women aged 10-49 (WN)] 2411
The case_when function makes this job simple: we’ll explicitly define a function for the “case when” ELIGTYPE < 20
, but we’ll not specify what to in the “case when” ELIGTYPE == 20
. A nice feature of case_when
is that it automatically returns the value NA
for any cases that are not handled explicitly:
dat %>%
count(ELIGTYPE) %>%
mutate(EXAMPLE = case_when(ELIGTYPE < 20 ~ TRUE))
# A tibble: 3 × 3
ELIGTYPE n EXAMPLE
<int+lbl> <int> <lgl>
1 11 [Infant under age 2 (INF)] 1194 TRUE
2 14 [Aged 2-5 (K)] 1662 TRUE
3 20 [Selected women aged 10-49 (WN)] 2411 NA
As we’ve seen, one way to identify the likely mother for each child is to match the mother’s LASTBIRTHMO
and LASTBIRTHYR
to the child’s KIDBIRTHMO
and KIDBIRTHYR
. If there is only one match in the household, we’ll return the mother’s PERSONID
from HH_DATA
as a new column MOMID
. On the other hand, if multiple women could be the mother or if no match could be found, we’ll return the value NA
.
dat <- dat %>%
rowwise() %>%
mutate(MOMID = case_when(ELIGTYPE < 20 ~ {
# re-name these in order to avoid confusion with the columns in HH_DATA:
kid_month <- KIDBIRTHMO
kid_year <- KIDBIRTHYR
# pull the PERSONID for any match
moms <- HH_DATA %>%
filter(LASTBIRTHMO == kid_month, LASTBIRTHYR == kid_year) %>%
pull(PERSONID)
# if exactly one match was found, return that PERSONID
# otherwise, return NA
ifelse(length(moms) == 1, moms, NA)
})) %>%
ungroup()
Let’s return to our example household:
HHID | PERSONID | MOMID | AGEHQ | ELIGTYPE | KIDBIRTHMO | KIDBIRTHYR | LASTBIRTHMO | LASTBIRTHYR |
---|---|---|---|---|---|---|---|---|
1598 | 3521 | 3526 | 0 | 11 - Infant under age 2 (INF) | December | 2017 | NA | NA |
1598 | 3522 | 3527 | 0 | 11 - Infant under age 2 (INF) | July | 2017 | NA | NA |
1598 | 3523 | 3525 | 0 | 11 - Infant under age 2 (INF) | April | 2018 | NA | NA |
1598 | 3524 | NA | 4 | 14 - Aged 2-5 (K) | July | 2013 | NA | NA |
1598 | 3525 | NA | 20 | 20 - Selected women aged 10-49 (WN) | NA | NA | April | 2018 |
1598 | 3526 | NA | 25 | 20 - Selected women aged 10-49 (WN) | NA | NA | December | 2017 |
1598 | 3527 | NA | 45 | 20 - Selected women aged 10-49 (WN) | NA | NA | July | 2017 |
Success! MOMID
shows the correct PERSONID
for the mother of each infant, and it contains expected NA
values as discussed above.
But how many children were matched to a mother, overall? Remember that a slight majority of sampled children (approximately 55%) live in households where no women were sampled for the female nutrition survey. We’ll flag the rest with an indicator WOMEN_IN_HH
, then test whether each child was LINKED
with a mother:
dat %>%
group_by(HHID) %>%
mutate(WOMEN_IN_HH = any(ELIGTYPE == 20)) %>%
filter(ELIGTYPE < 20) %>%
ungroup() %>%
count(WOMEN_IN_HH, LINKED = !is.na(MOMID)) %>%
mutate(pct = 100*prop.table(n))
# A tibble: 3 × 4
WOMEN_IN_HH LINKED n pct
<lgl> <lgl> <int> <dbl>
1 FALSE FALSE 1579 55.3
2 TRUE FALSE 481 16.8
3 TRUE TRUE 796 27.9
So far, we’ve managed to locate the mother for only about 28% of the children in our sample. Around 17% of the remaining children do live with a sampled woman, but could not be linked with a mother by matched birthdates.
Let’s see if we can improve on these results.
In our first search, our use of LASTBIRTHMO
and LASTBIRTHYR
ensured that only mother’s most recent child could be linked to her record. It’s possible to expand these criteria in certain circumstances using RELATEKID, which describes the relationship between each child and the person who provided responses to the interviewer on their behalf.
dat %>% count(RELATEKID)
# A tibble: 7 × 2
RELATEKID n
<int+lbl> <int>
1 1 [Mother] 2646
2 2 [Father] 57
3 3 [Grandmother] 94
4 4 [Grandfather] 7
5 5 [Sister] 4
6 94 [Other] 48
7 99 [NIU (not in universe)] 2411
When RELATEKID == 1
, this respondent is the child’s mother. So, suppose we have a household like this one:
HHID | PERSONID | MOMID | AGEHQ | ELIGTYPE | RELATEKID | KIDBIRTHMO | KIDBIRTHYR | LASTBIRTHMO | LASTBIRTHYR |
---|---|---|---|---|---|---|---|---|---|
61 | 147 | 149 | 0 | 11 - Infant under age 2 (INF) | Mother | May | 2018 | NA | NA |
61 | 148 | NA | 1 | 11 - Infant under age 2 (INF) | Mother | December | 2016 | NA | NA |
61 | 149 | NA | 19 | 20 - Selected women aged 10-49 (WN) | NA | NA | NA | May | 2018 |
We’ve already matched the youngest child in this household with our first search, but the older child remains unlinked. Fortunately, RELATEKID
tells us that the older child’s mother also lives in the household, since she provided responses on their behalf.
Because there is only one reproductive age woman living in this particular household, it’s reasonable to assume that the same woman is the mother for both children.
Let’s implement these criteria in a new search, creating MOMID2
.
dat <- dat %>%
rowwise() %>%
mutate(MOMID2 = case_when(ELIGTYPE < 20 ~ {
# This is the same as above: if we find a match, we'll call it `mom`
kid_month <- KIDBIRTHMO
kid_year <- KIDBIRTHYR
moms <- HH_DATA %>%
filter(LASTBIRTHMO == kid_month, LASTBIRTHYR == kid_year) %>%
pull(PERSONID)
mom <- ifelse(length(moms) == 1, moms, NA)
# Additionally, if `mom` is NA, we'll look at `RELATEKID`
# If the respondent was the child's mother, and only one woman in the
# household has ever given birth, we'll define that woman as `mom`
if(is.na(mom)){
moms <- HH_DATA %>%
filter(LASTBIRTHYR < 9000) %>%
pull(PERSONID)
mom <- ifelse(length(moms) == 1 & RELATEKID == 1, moms, NA)
}
mom
})) %>%
ungroup()
Now, both of the children in our example household will show the same MOMID2
:
HHID | PERSONID | MOMID | MOMID2 | AGEHQ | ELIGTYPE | RELATEKID | KIDBIRTHMO | KIDBIRTHYR | LASTBIRTHMO | LASTBIRTHYR |
---|---|---|---|---|---|---|---|---|---|---|
61 | 147 | 149 | 149 | 0 | 11 - Infant under age 2 (INF) | Mother | May | 2018 | NA | NA |
61 | 148 | NA | 149 | 1 | 11 - Infant under age 2 (INF) | Mother | December | 2016 | NA | NA |
61 | 149 | NA | NA | 19 | 20 - Selected women aged 10-49 (WN) | NA | NA | NA | May | 2018 |
How much does MOMID2
improve on the results from MOMID
?
dat %>%
group_by(HHID) %>%
mutate(WOMEN_IN_HH = any(ELIGTYPE == 20)) %>%
filter(ELIGTYPE < 20) %>%
ungroup() %>%
count(WOMEN_IN_HH, LINKED1 = !is.na(MOMID), LINKED2 = !is.na(MOMID2)) %>%
mutate(pct = 100*prop.table(n))
# A tibble: 4 × 5
WOMEN_IN_HH LINKED1 LINKED2 n pct
<lgl> <lgl> <lgl> <int> <dbl>
1 FALSE FALSE FALSE 1579 55.3
2 TRUE FALSE FALSE 198 6.93
3 TRUE FALSE TRUE 283 9.91
4 TRUE TRUE TRUE 796 27.9
Our second search found mothers for an additional 283 children, or about 10% of the overall child nutrition sample. Setting aside the 55% of child who could not possibly be linked because of sample design restrictions, we were unable to find a mother in only 7% of cases (many of these may not live with their mother, or else we have insufficient information to identify a match).
Now that we’ve linked as many children as possible to their mother’s data, we’ll be able to explore how certain antenatal interventions might impact child growth and nutrition. Consider our first example household again:
PERSONID | MOMID | AGEHQ | KIDARMCIRCVAL | RPANCPREGMO |
---|---|---|---|---|
3521 | 3526 | 0 | 13.3 | NA |
3522 | 3527 | 0 | 13.6 | NA |
3523 | 3525 | 0 | 10.9 | NA |
3524 | NA | 4 | 14.1 | NA |
3525 | NA | 20 | NA | 7 |
3526 | NA | 25 | NA | 3 |
3527 | NA | 45 | NA | 3 |
We’ve added two new variables here:
Now that we’ve established a link between the infants and mothers, it’s possible to investigate how infants are impacted by, for example, the timing or quality of antenatal care provided to their mother during pregnancy. In this household, the KIDARMCIRCVAL
shown for the child in row 3 suggests that they may be at risk for acute malnutrition. We also observe in RPANCPREGMO
that their mother received no antenatal care until the third trimester of her pregnancy (month 7). If we took the additional step of attaching the value in RPANCPREGMO
to the appropriate child’s record - for example, in a variable we might call MOM_ANCPREGMO
- we could then build a model examining this relationship in the larger dataset.
In our next post, we’ll continue our series on PMA nutrition surveys as we consider more ways to measure nutritional outcomes in children. In particular, we’ll be looking at common assessment tools for recommended infant and young child feeding practices (IYCFP) and how to create them from dietary intake data in the child nutrition survey.
If you see mistakes or want to suggest changes, please create an issue on the source repository.