Update: Matching Records in 2018 Nutrition Surveys

Nutrition Data Manipulation rowwise group_by cur_data pmap

It’s now much easier to perform row-wise operations with pre-grouped data!

Matt Gunther (IPUMS PMA Senior Data Analyst)
09-15-2021

When IPUMS PMA first released harmonized 2018 nutrition surveys from households in Burkina Faso and Kenya, we published a user note describing how to match children and mothers in the same household with pmap - a function designed to iterate over several variables simultaneously.

Our challenge was this: IPUMS PMA nutrition data are actually a harmonized version of three separate surveys. When you download and open a nutrition data extract, any given row may contain variables associated with:

As we mentioned in our last post, all children under age 5 living in one of the screened households were sampled in the subsequent child nutrition survey. However, women aged 10-49 were only selected for the female nutrition survey from a random sub-sample of screened households (45% of sampled households in Burkina Faso, 25% in Kenya).

We imagine that many users will want to link children to mothers where possible, particularly because the female nutrition survey includes questions about antenatal care and nutrtion support the mother received during her most recent pregnancy. Our user note offered code you could use to, at minimum, link sampled mothers to their youngest sampled child.

In this post, we’d like to offer an updated version of this code that uses the rowwise function released with dplyr 1.0.0 last year. Both approaches work great! But, we believe that the rowwise approach is a quite bit easier to read and understand.

Setup

To keep things simple, we’ll use a data extract from Burkina Faso 2018 that contains only household members who participated in either the child nutrition survey or the female nutrition survey (you can request a file with all household members if you select “All Cases” at checkout). We’ll use the tidyverse and ipumsr packages to load our data into R.

library(tidyverse)
library(ipumsr)

dat <- read_ipums_micro(
  ddi = "data/pma_00002.xml",
  data = "data/pma_00002.dat.gz"
)

For the purposes of this post, we’ll be working with only a small subset of variables (feel free to include any number of variables when you create your own extract). Also: in order to make our coding examples a bit more readable, we’ll take an extra step of reassigning shorter identification numbers to each household and person - this is not recommended practice in general and we do so only to reduce the number of characters in HHID (20) and PERSONID (22) to 4.

dat <- dat %>% 
  select(
    HHID,
    ELIGTYPE,
    AGEHQ,
    RELATEKID,
    KIDBIRTHYR,
    KIDBIRTHMO,
    LASTBIRTHYR,
    LASTBIRTHMO,
    KIDARMCIRCVAL,
    RPANCPREGMO
  ) %>% 
  arrange(HHID, ELIGTYPE) %>% 
  rowid_to_column(var = "PERSONID") %>% 
  group_by(HHID) %>% 
  mutate(HHID = cur_group_id()) %>% 
  ungroup()

Matching with multiple varialbes

For any given household in our dataset, there should always be at least one child under age 5 who participated in the child nutrition survey. We also expect roughly 45% of households should also include at least one woman aged 10-49 who particpated in the female nutrition survey.

A typical household might look something like this:

HHID PERSONID AGEHQ ELIGTYPE KIDBIRTHMO KIDBIRTHYR LASTBIRTHMO LASTBIRTHYR
1598 3521 0 11 - Infant under age 2 (INF) December 2017 NA NA
1598 3522 0 11 - Infant under age 2 (INF) July 2017 NA NA
1598 3523 0 11 - Infant under age 2 (INF) April 2018 NA NA
1598 3524 4 14 - Aged 2-5 (K) July 2013 NA NA
1598 3525 20 20 - Selected women aged 10-49 (WN) NA NA April 2018
1598 3526 25 20 - Selected women aged 10-49 (WN) NA NA December 2017
1598 3527 45 20 - Selected women aged 10-49 (WN) NA NA July 2017

All members of the household share the same household ID in HHID. Each person has a unique PERSONID, while AGEHQ gives their age (when the household questionnaire was completed) and ELIGTYPE describes their eligibility for sub-modules on either the child or female nutrition questionnaire (both the numeric code and label are shown).

In this household, it’s easy to spot the mother for each infant: each infant’s birth month KIDBIRTHMO and year KIDBIRTHYR match the month and year of one woman’s most recent delivery shown in LASTBIRTHMO and LASTBIRTHYR. Unfortunately, it’s not possible to match the 4 year-old child to a mother (women were only asked about the date of their most recent birth).

Once the matches are identified by both month and year, we could attach any number of variables from the mother’s record onto the record for the child. For demonstration purposes, we’ll just attach her PERSONID as a new variable called MOMID. If the child’s mother cannot be determined - or if a particular row contains data from a mother, rather than a child - we’ll assign the value NA to MOMID.

Identifying the “current group”

Imagine dividing our search in two basic steps: first, we’ll want to group our data into households. Then, we’ll iterate through person in the dataset: if that person is a child, we’ll search the members of their group for a woman with a matching birth month and year.

Let’s focus on the first step for a moment. A simple way to minimize the number of records involved with the search for each child’s mother is to group_by HHID, and then create a unique subset of the larger dataset for each child. After we’ve grouped the data, we’ll use cur_data to return the current data for each group. For convenience, you might save this miniature dataset as a column in dat - we’ll call ours HH_DATA:

dat <- dat %>% 
  group_by(HHID) %>% 
  mutate(HH_DATA = list(cur_data())) %>% 
  ungroup()

If you now look at dat, you’ll see we’ve stored one table in HH_DATA for each person. If you opened one of those tables, you’d see:

dat %>% select(PERSONID, HHID, HH_DATA)
# A tibble: 5,267 × 3
   PERSONID  HHID HH_DATA          
      <int> <int> <list>           
 1        1     1 <tibble [3 × 10]>
 2        2     1 <tibble [3 × 10]>
 3        3     1 <tibble [3 × 10]>
 4        4     2 <tibble [2 × 10]>
 5        5     2 <tibble [2 × 10]>
 6        6     3 <tibble [2 × 10]>
 7        7     3 <tibble [2 × 10]>
 8        8     4 <tibble [1 × 10]>
 9        9     5 <tibble [5 × 10]>
10       10     5 <tibble [5 × 10]>
# … with 5,257 more rows

Let’s pull HH_DATA for the first four individuals in dat (essentially printing the contents of the list):

dat %>% 
  slice(1:4) %>% 
  pull(HH_DATA)
[[1]]
# A tibble: 3 × 10
  PERSONID  ELIGTYPE AGEHQ RELATEKID KIDBIRTHYR KIDBIRTHMO LASTBIRTHYR
     <int> <int+lbl> <int> <int+lbl>      <dbl>  <int+lbl>   <int+lbl>
1        1 14 [Aged…     4 1 [Mothe…       2014  6 [June]  9999 [NIU …
2        2 14 [Aged…     2 1 [Mothe…       2016  3 [March] 9999 [NIU …
3        3 14 [Aged…     2 2 [Fathe…       2016  5 [May]   9999 [NIU …
# … with 3 more variables: LASTBIRTHMO <int+lbl>,
#   KIDARMCIRCVAL <dbl+lbl>, RPANCPREGMO <int+lbl>

[[2]]
# A tibble: 3 × 10
  PERSONID  ELIGTYPE AGEHQ RELATEKID KIDBIRTHYR KIDBIRTHMO LASTBIRTHYR
     <int> <int+lbl> <int> <int+lbl>      <dbl>  <int+lbl>   <int+lbl>
1        1 14 [Aged…     4 1 [Mothe…       2014  6 [June]  9999 [NIU …
2        2 14 [Aged…     2 1 [Mothe…       2016  3 [March] 9999 [NIU …
3        3 14 [Aged…     2 2 [Fathe…       2016  5 [May]   9999 [NIU …
# … with 3 more variables: LASTBIRTHMO <int+lbl>,
#   KIDARMCIRCVAL <dbl+lbl>, RPANCPREGMO <int+lbl>

[[3]]
# A tibble: 3 × 10
  PERSONID  ELIGTYPE AGEHQ RELATEKID KIDBIRTHYR KIDBIRTHMO LASTBIRTHYR
     <int> <int+lbl> <int> <int+lbl>      <dbl>  <int+lbl>   <int+lbl>
1        1 14 [Aged…     4 1 [Mothe…       2014  6 [June]  9999 [NIU …
2        2 14 [Aged…     2 1 [Mothe…       2016  3 [March] 9999 [NIU …
3        3 14 [Aged…     2 2 [Fathe…       2016  5 [May]   9999 [NIU …
# … with 3 more variables: LASTBIRTHMO <int+lbl>,
#   KIDARMCIRCVAL <dbl+lbl>, RPANCPREGMO <int+lbl>

[[4]]
# A tibble: 2 × 10
  PERSONID  ELIGTYPE AGEHQ RELATEKID KIDBIRTHYR KIDBIRTHMO LASTBIRTHYR
     <int> <int+lbl> <int> <int+lbl>      <dbl>  <int+lbl>   <int+lbl>
1        4 14 [Aged…     3  1 [Moth…       2014  9 [Septe… 9999 [NIU …
2        5 20 [Sele…    25 99 [NIU …       9999 99 [NIU (… 2014       
# … with 3 more variables: LASTBIRTHMO <int+lbl>,
#   KIDARMCIRCVAL <dbl+lbl>, RPANCPREGMO <int+lbl>

The first three individuals are members of the same household, so they appear in eachother’s HH_DATA. The fourth is a member of a second household, and so on.

For our purposes, it’s perfectly fine to leave the household datasets in a list. Once we define a search function to find each child’s mother, we’ll simply apply it to each row of HH_DATA. For example, suppose we just wanted to find the age of the oldest person in each household. We would use rowwise to iterate over each row in dat, looking for the maximum value of AGEHQ in the HH_DATA table stored in each row.

dat %>% 
  rowwise() %>% 
  mutate(HH_AGE_OLDEST = max(HH_DATA$AGEHQ)) %>% 
  select(PERSONID, HHID, AGEHQ, HH_AGE_OLDEST)
# A tibble: 5,267 × 4
# Rowwise: 
   PERSONID  HHID     AGEHQ HH_AGE_OLDEST
      <int> <int> <int+lbl>     <int+lbl>
 1        1     1         4             4
 2        2     1         2             4
 3        3     1         2             4
 4        4     2         3            25
 5        5     2        25            25
 6        6     3         1             4
 7        7     3         4             4
 8        8     4         2             2
 9        9     5         4            27
10       10     5         2            27
# … with 5,257 more rows

The main difference between max and the function we’ll use for our search is that max returns the same value for every person in the household; strictly speaking, it’s not necessary to do a row-wise search through pre-grouped household data in this case. Our problem is more complex: we want to return a specific value for each person via rowwise, but we also want to restrict each person’s search to members of a pre-defined group.

Row-wise search within groups

Now that we’ve isolated data for each person’s household in HH_DATA, we need to write a custom function that will match birth months and years, and then return a value for MOMID only if exactly one match is found.

To save time, we’ll only perform our search if ELIGTYPE shows that the current row contains data for a sampled child. In other words, the value for ELIGTYPE should be less than 20.

dat %>% count(ELIGTYPE)
# A tibble: 3 × 2
                             ELIGTYPE     n
                            <int+lbl> <int>
1 11 [Infant under age 2 (INF)]        1194
2 14 [Aged 2-5 (K)]                    1662
3 20 [Selected women aged 10-49 (WN)]  2411

The case_when function makes this job simple: we’ll explicitly define a function for the “case when” ELIGTYPE < 20, but we’ll not specify what to in the “case when” ELIGTYPE == 20. A nice feature of case_when is that it automatically returns the value NA for any cases that are not handled explicitly:

dat %>% 
  count(ELIGTYPE) %>% 
  mutate(EXAMPLE = case_when(ELIGTYPE < 20 ~ TRUE))
# A tibble: 3 × 3
                             ELIGTYPE     n EXAMPLE
                            <int+lbl> <int> <lgl>  
1 11 [Infant under age 2 (INF)]        1194 TRUE   
2 14 [Aged 2-5 (K)]                    1662 TRUE   
3 20 [Selected women aged 10-49 (WN)]  2411 NA     

As we’ve seen, one way to identify the likely mother for each child is to match the mother’s LASTBIRTHMO and LASTBIRTHYR to the child’s KIDBIRTHMO and KIDBIRTHYR. If there is only one match in the household, we’ll return the mother’s PERSONID from HH_DATA as a new column MOMID. On the other hand, if multiple women could be the mother or if no match could be found, we’ll return the value NA.

dat <- dat %>% 
  rowwise() %>%
  mutate(MOMID = case_when(ELIGTYPE < 20 ~ {
    # re-name these in order to avoid confusion with the columns in HH_DATA:
    kid_month <- KIDBIRTHMO
    kid_year <- KIDBIRTHYR
    
    # pull the PERSONID for any match
    moms <- HH_DATA %>% 
      filter(LASTBIRTHMO == kid_month, LASTBIRTHYR == kid_year) %>% 
      pull(PERSONID)
    
    # if exactly one match was found, return that PERSONID
    # otherwise, return NA
    ifelse(length(moms) == 1, moms, NA) 
  })) %>% 
  ungroup()

Let’s return to our example household:

HHID PERSONID MOMID AGEHQ ELIGTYPE KIDBIRTHMO KIDBIRTHYR LASTBIRTHMO LASTBIRTHYR
1598 3521 3526 0 11 - Infant under age 2 (INF) December 2017 NA NA
1598 3522 3527 0 11 - Infant under age 2 (INF) July 2017 NA NA
1598 3523 3525 0 11 - Infant under age 2 (INF) April 2018 NA NA
1598 3524 NA 4 14 - Aged 2-5 (K) July 2013 NA NA
1598 3525 NA 20 20 - Selected women aged 10-49 (WN) NA NA April 2018
1598 3526 NA 25 20 - Selected women aged 10-49 (WN) NA NA December 2017
1598 3527 NA 45 20 - Selected women aged 10-49 (WN) NA NA July 2017

Success! MOMID shows the correct PERSONID for the mother of each infant, and it contains expected NA values as discussed above.

But how many children were matched to a mother, overall? Remember that a slight majority of sampled children (approximately 55%) live in households where no women were sampled for the female nutrition survey. We’ll flag the rest with an indicator WOMEN_IN_HH, then test whether each child was LINKED with a mother:

dat %>%
  group_by(HHID) %>% 
  mutate(WOMEN_IN_HH = any(ELIGTYPE == 20)) %>% 
  filter(ELIGTYPE < 20) %>% 
  ungroup() %>% 
  count(WOMEN_IN_HH, LINKED = !is.na(MOMID)) %>% 
  mutate(pct = 100*prop.table(n))
# A tibble: 3 × 4
  WOMEN_IN_HH LINKED     n   pct
  <lgl>       <lgl>  <int> <dbl>
1 FALSE       FALSE   1579  55.3
2 TRUE        FALSE    481  16.8
3 TRUE        TRUE     796  27.9

So far, we’ve managed to locate the mother for only about 28% of the children in our sample. Around 17% of the remaining children do live with a sampled woman, but could not be linked with a mother by matched birthdates.

Let’s see if we can improve on these results.

Expanded search

In our first search, our use of LASTBIRTHMO and LASTBIRTHYR ensured that only mother’s most recent child could be linked to her record. It’s possible to expand these criteria in certain circumstances using RELATEKID, which describes the relationship between each child and the person who provided responses to the interviewer on their behalf.

dat %>% count(RELATEKID)
# A tibble: 7 × 2
                   RELATEKID     n
                   <int+lbl> <int>
1  1 [Mother]                 2646
2  2 [Father]                   57
3  3 [Grandmother]              94
4  4 [Grandfather]               7
5  5 [Sister]                    4
6 94 [Other]                    48
7 99 [NIU (not in universe)]  2411

When RELATEKID == 1, this respondent is the child’s mother. So, suppose we have a household like this one:

HHID PERSONID MOMID AGEHQ ELIGTYPE RELATEKID KIDBIRTHMO KIDBIRTHYR LASTBIRTHMO LASTBIRTHYR
61 147 149 0 11 - Infant under age 2 (INF) Mother May 2018 NA NA
61 148 NA 1 11 - Infant under age 2 (INF) Mother December 2016 NA NA
61 149 NA 19 20 - Selected women aged 10-49 (WN) NA NA NA May 2018

We’ve already matched the youngest child in this household with our first search, but the older child remains unlinked. Fortunately, RELATEKID tells us that the older child’s mother also lives in the household, since she provided responses on their behalf.

Because there is only one reproductive age woman living in this particular household, it’s reasonable to assume that the same woman is the mother for both children.

Let’s implement these criteria in a new search, creating MOMID2.

dat <- dat %>% 
  rowwise() %>%
  mutate(MOMID2 = case_when(ELIGTYPE < 20 ~ {
    # This is the same as above: if we find a match, we'll call it `mom`
    kid_month <- KIDBIRTHMO
    kid_year <- KIDBIRTHYR
    moms <- HH_DATA %>%
      filter(LASTBIRTHMO == kid_month, LASTBIRTHYR == kid_year) %>%
      pull(PERSONID)
    mom <- ifelse(length(moms) == 1, moms, NA)
    
    # Additionally, if `mom` is NA, we'll look at `RELATEKID`
    # If the respondent was the child's mother, and only one woman in the 
    # household has ever given birth, we'll define that woman as `mom`
    if(is.na(mom)){
      moms <- HH_DATA %>% 
        filter(LASTBIRTHYR < 9000) %>% 
        pull(PERSONID) 
      mom <- ifelse(length(moms) == 1 & RELATEKID == 1, moms, NA)
    }
    mom
  })) %>% 
  ungroup() 

Now, both of the children in our example household will show the same MOMID2:

HHID PERSONID MOMID MOMID2 AGEHQ ELIGTYPE RELATEKID KIDBIRTHMO KIDBIRTHYR LASTBIRTHMO LASTBIRTHYR
61 147 149 149 0 11 - Infant under age 2 (INF) Mother May 2018 NA NA
61 148 NA 149 1 11 - Infant under age 2 (INF) Mother December 2016 NA NA
61 149 NA NA 19 20 - Selected women aged 10-49 (WN) NA NA NA May 2018

How much does MOMID2 improve on the results from MOMID?

dat %>%
  group_by(HHID) %>% 
  mutate(WOMEN_IN_HH = any(ELIGTYPE == 20)) %>% 
  filter(ELIGTYPE < 20) %>% 
  ungroup() %>% 
  count(WOMEN_IN_HH, LINKED1 = !is.na(MOMID), LINKED2 = !is.na(MOMID2)) %>% 
  mutate(pct = 100*prop.table(n))
# A tibble: 4 × 5
  WOMEN_IN_HH LINKED1 LINKED2     n   pct
  <lgl>       <lgl>   <lgl>   <int> <dbl>
1 FALSE       FALSE   FALSE    1579 55.3 
2 TRUE        FALSE   FALSE     198  6.93
3 TRUE        FALSE   TRUE      283  9.91
4 TRUE        TRUE    TRUE      796 27.9 

Our second search found mothers for an additional 283 children, or about 10% of the overall child nutrition sample. Setting aside the 55% of child who could not possibly be linked because of sample design restrictions, we were unable to find a mother in only 7% of cases (many of these may not live with their mother, or else we have insufficient information to identify a match).

New research questions

Now that we’ve linked as many children as possible to their mother’s data, we’ll be able to explore how certain antenatal interventions might impact child growth and nutrition. Consider our first example household again:

PERSONID MOMID AGEHQ KIDARMCIRCVAL RPANCPREGMO
3521 3526 0 13.3 NA
3522 3527 0 13.6 NA
3523 3525 0 10.9 NA
3524 NA 4 14.1 NA
3525 NA 20 NA 7
3526 NA 25 NA 3
3527 NA 45 NA 3

We’ve added two new variables here:

Now that we’ve established a link between the infants and mothers, it’s possible to investigate how infants are impacted by, for example, the timing or quality of antenatal care provided to their mother during pregnancy. In this household, the KIDARMCIRCVAL shown for the child in row 3 suggests that they may be at risk for acute malnutrition. We also observe in RPANCPREGMO that their mother received no antenatal care until the third trimester of her pregnancy (month 7). If we took the additional step of attaching the value in RPANCPREGMO to the appropriate child’s record - for example, in a variable we might call MOM_ANCPREGMO - we could then build a model examining this relationship in the larger dataset.

In our next post, we’ll continue our series on PMA nutrition surveys as we consider more ways to measure nutritional outcomes in children. In particular, we’ll be looking at common assessment tools for recommended infant and young child feeding practices (IYCFP) and how to create them from dietary intake data in the child nutrition survey.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.