2021 Blog Round-Up

R Tips R packages

Where to find example code for all of the key concepts we’ve covered so far.

Matt Gunther (IPUMS PMA Senior Data Analyst)
2021-11-15

We launched the IPUMS PMA Data Analysis Hub one year ago this week, and we’ve covered a lot since then! As we’re gearing up for new series exploring Client Exit Interviews and longitudinal analysis with Round 2 Panel Data in the coming months, we thought it might be a nice time to pause and take inventory of the important concepts we’ve covered in our first year.

Introduction Course

Looking for a place to start learning about IPUMS PMA data with coding examples in your preferred programming language? Check out our new online introduction course! This course is free and open to all registered IPUMS PMA data users. You’ll learn all about PMA surveys, the IPUMS data extract system, and basic analysis tools for R or Stata.

Data Discovery

Our Data Discovery posts are where you’ll find announcements about new data releases and deep-dives into available data for a featured topic. So far, we’ve explored:

Data Manipulation

A Data Manipulation post typically showcases tips for recoding, reshaping, describing, or visualizing variables included in an IPUMS PMA data extract. Often, we’ll feature tools from one or two of our favorite R packages; while we always recommend visiting package documentation websites, our goal here is to show why these tools are important specifically for working with data from IPUMS PMA.

IPUMS labelled data

Almost every post uses the ipumsr package to import an IPUMS PMA data extract into R. Once loaded, you’ll find that many IPUMS PMA variables are labelled; the ipumsr package contains functions for exploring and manipulating those labels. We’ve shown how to:

Iteration with columns

When you want to apply changes to several IPUMS PMA variables, it’s usually most efficient to leverage the across function from dplyr. We use across all the time, but some of our best examples show how to:

Iteration with rows

Many functions in R are vectorized: when you apply them to a column in your data frame, you typically get one result for each row.

Non-vectorized functions instead return one result summarizing the entire column. For instance, the function mean gives you one mean value derived from all rows in a given column. Tools from dplyr like rowwise and group_by allow us to apply non-vectorized functions to individual rows or groups of rows. We’ve used these functions to:

Joining and Reshaping

All of the IPUMS PMA data extracts we’ve examined so far are rectangular in shape: each row represents one person or facility. Sometimes, it’s necessary to change this structure: we might want to spread multiple observations of the same individual into separate rows, or we might want to leverage hierarchical data to situate a person within some larger context (like a household, a region, or a country). With help from data-structuring packages like tidyr, we’ve shown how to:

Visualization

Whenever we summarize variables or model output on this blog, we like to make figures with ggplot2 and tables with gtsummary. For maps, we’ve focused on ggspatial - an extension of ggplot2 that supports raster data from both the raster package and the terra package. Topics include:

Spatial data

Lastly, we’ve devoted several Data Manipulation posts to exploring external sources for spatial data. If you’re new to spatial data - or if you’re an ArcGIS user interested in learning more about R - we recommend checking out the free, open-source text Geocomputation with R. You’ll be introduced to both of the major packages we use on this blog: sf for vector data and terra for raster data. We’re planning to cover much more spatial content in the coming months, but so far we’ve explained how to:

Data Analysis

We like to end every series with a Data Analysis post. Here, we use what we’ve learned in earlier posts to build a model and discuss important findings. Often, we’ll use this as an opportunity to showcase new and exciting published research from authors working with PMA data. These posts discuss:

Plans for 2022

We’ve been trilled to connect with so many of you - more than 2,000 readers in over 100 countries - in our first year! Coming in the months ahead, we’ll continue providing bi-weekly blog posts covering the latest IPUMS PMA data releases as they become available. We’ve mentioned upcoming series planned for Client Exit Interviews and Round 2 Panel Data, and we’ll also return to the relationship between COVID-19 and family planning in new samples. We plan to continue emphasizing spatial analysis made possible with external data sources, and we’d love to expand our coverage of PMA publications in collaboration with researchers using IPUMS PMA data in their own work. We’re also planning a new course on longitudinal analysis using the first two rounds of PMA panel data; look for new announcements on this in the early spring. And, as always, we’ll continue exploring free and powerful tools for working with survey data in R that we hope will make IPUMS PMA accessible to more researchers than ever before.

Special thanks to all of our outstanding partners at the Bill & Melinda Gates Foundation, Johns Hopkins University, and here at IPUMS for tremendous support as we launched this new project in 2021!

This blog would not be possible without brilliant insight and ongoing encouragement from the whole team at IPUMS PMA: professors Kathryn Grace, Elizabeth Heger Boyle, and Nina Brooks; project director Devon Kristiansen; post-doctoral associates Maya Luetke and Jiao Yu; and graduate research assistants Shelby Rutzick, Saeun Park, and Tayler Nelson who contributed posts this year - thanks to each of you!

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.