IPUMS Terra: Integrated Population and Environment Data

The IPUMS Terra project allows you to create datasets for your research that seamlessly combine area-level data, raster data and microdata and use whichever of these data formats is best for your analysis.

Raster data is data that describes the world on a grid - each cell in the grid gets its own value. Temperature is an example of data available in raster form from IPUMS Terra - satellite imagery allows geographers to create estimates of average temperature for approximately 1 km square grids across the globe.

Area level data is data that represents a particular area. In IPUMS Terra, the areas are political boundaries, like countries, states or provinces.

Microdata is data that describe each person separately. In IPUMS Terra, this data generally comes from census data from the IPUMS International project.

Learn more about the kinds of data in IPUMS Terra here: https://data.terrapop.org

ipumsr helps you read this data into R so you can continue your analysis.

library(ipumsr)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(sf)

Learning more about spatial data in R

This vignette focuses on the mechanics of getting spatial data from IPUMS Terra into R, but it glosses over the details about what you can do after you’ve loaded it. To learn more about these, here are some resources:

Example: Age and migration patterns by temperature in the US

On this cold February day in Minnesota, I wonder how migration between areas of the US is affected by the temperature and whether I see any evidence of my preconception that older people retire to warmer climates. As a quick aside - this analysis might be better served by data from the IPUMS NHGIS project (see vigntette("ipums-nhgis")), because it often has more granular geographic areas than IPUMS Terra, which which pulls from our microdata projects, but I use Terra to show off the way it helps moving between data types.

Raster data

IPUMS Terra includes raster data on the average temperature 1950-2000 by month from the WorldClim dataset. It also converts area-level summaries of some variables from the US census microdata to raster grid, so we can get the total population and population by age on a grid.

For the example in this vignette, I use the sample 2010 US Census and variables:

  • Area Level: POPTOT, POP6569, POP7074, POP7579, POP80 (year-specific and at the lowest geographic level available)
  • Raster: TEMPAVGFEB (mean, also year-specific and at lowest geographic level available)

Once we’ve made the extract and downloaded it, let’s check what’s available in it using the function ipums_list_files().

ipums_list_files(raster_extract)
#> # A tibble: 7 × 2
#>   type   file                                          
#>   <chr>  <chr>                                         
#> 1 raster TEMPAVGFEBUS2013.tiff                         
#> 2 raster UnitedStatesStates-2010_rasterized_geoids.tiff
#> 3 raster UnitedStatesStates-2010-TOTPOP.tiff           
#> 4 raster UnitedStatesStates-2010-POP6569.tiff          
#> 5 raster UnitedStatesStates-2010-POP7074.tiff          
#> 6 raster UnitedStatesStates-2010-POP7579.tiff          
#> 7 raster UnitedStatesStates-2010-POP80.tiff

To load the data, ipumsr provides two functions: read_terra_raster() and read_terra_raster_list(). The first reads a single raster, and the second loads 1 or more and puts them into a list.

Reading a single raster

Let’s first look at loading a single layer, by making a map of the TEMPAVGFEB layer (called TEMPAVGFEBUS2013.tiff in the file, but we can refer to it with the matches() function).

febtemp <- read_terra_raster(raster_extract, matches("TEMPAVGFEB"))
#> Use of data from IPUMS Terra is subject to conditions including that users
#> should cite the data appropriately. Use command `ipums_conditions()` for more
#> details.

The details of making good maps with raster data are beyond this tutorial, (see below for some pointers on where to learn more), but just to prove that we got the data, we can use the raster package’s built in plot() method.

# Crop map a bit
zoomed_extent <- raster::extent(c(-180, -64, 16.5, 72.5))

raster::plot(febtemp / 10, ext = zoomed_extent)
title("Average temp (deg C) in Feb\n1950-2010 (US raster)")

Reading multiple rasters

We can also read multiple rasters and put them into a list object.

all_rasters <- read_terra_raster_list(raster_extract)
#> Use of data from IPUMS Terra is subject to conditions including that users
#> should cite the data appropriately. Use command `ipums_conditions()` for more
#> details.

Now we can make a map as before; this would be identical to the one above.

raster::plot(all_rasters[["TEMPAVGFEBUS2013"]] / 10, ext = zoomed_extent)
title("Average temp (deg C) in Feb\n1950-2010 (US raster)")

Again the details of spatial analysis of raster data in R are beyond the scope of this document (see below for some resources), but it one thing to keep in mind is that IPUMS Terra data, especially data from different sources (like rasterized area level data vs. the temperature data) may not have the same spatial extent (coverage area) or resolution, so you may need to do some conversion.

For example, to get the percent of the total population that is between ages 65-69, we can do this (because the data for total population and age-specific population have the same extent and resolution):

pop_pct65 <- all_rasters[["UnitedStatesStates-2010-POP6569"]] / 
  all_rasters[["UnitedStatesStates-2010-TOTPOP"]]

raster::plot(pop_pct65, ext = zoomed_extent)
title("Percent of population ages 65-69 in 2010\n(raseterized area)")

We can kind of see that the percent of population that is aged 65-69 is highest in a few places in warm places (in Arizona and Florida). To make this code run faster, I’ve only looked at 65-69 year olds, but really it’d be better to look at the percentage of the population that is 65+ by adding together all of the age categories. We leave that as an exercise for the reader

Also, comparing the population to the temperature on a grid level is more difficult. One solution is to use the resample() function from the raster package, which will use interpolation to put the rasters on the same grid. This is beyond the scope of this vignette.

Area level data

IPUMS Terra includes data from both the raster datasets and microdata that have been aggregated up to a geographic boundary.

For the example in this vignette, I use the sample 2010 US Census and variables:

  • Area Level: POPTOT, POP6569, POP7074, POP7579, POP80
  • Raster: TEMPAVGFEB (Mean)

Also, be sure to include the boundary files in the extract.

Once we’ve made the extract and downloaded it, let’s check what’s available in it using the function ipums_list_files().

ipums_list_files(area_extract)
#> # A tibble: 2 × 2
#>   type  file                                  
#>   <chr> <chr>                                 
#> 1 data  data_4618_IPUMS_US_SLAD_2010.csv      
#> 2 shape boundaries_4618_IPUMS_US_SLAD_2010.zip

Reading area-level data

We can load the data and geography together using the functions: read_terra_area_sf() or read_terra_area_sp()

(The first loads as an sf object, while the second loads them as the older SpatialPolygonsDataFrame object)

Or we could load them separately using the functions: read_terra_area() and read_ipums_sf()/read_ipums_sp()

For now, we focus on reading them together as an sf object, but the general idea of loading them is similar across all of these functions.

terra_area <- read_terra_area_sf(area_extract)
#> Use of data from IPUMS Terra is subject to conditions including that users
#> should cite the data appropriately. Use command `ipums_conditions()` for more
#> details.
#> options:        ENCODING=UTF-8 
#> Reading layer `boundaries_4618_IPUMS_US_SLAD_2010' from data source 
#>   `C:\Users\derek\AppData\Local\Temp\Rtmp6VsTzP\file32ec6ecf2c6b\boundaries_4618_IPUMS_US_SLAD_2010.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 2070 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -179.1526 ymin: 18.91236 xmax: 179.7758 ymax: 71.38875
#> Geodetic CRS:  WGS 84

With the (currently) still in-development version of ggplot2, we can use the function geom_sf() to make maps of the sf object.

Here’s a map of the average temperature in February, averaged over all rasters in a given geographic boundary:

terra_area <- terra_area %>% 
  mutate(feb_temp_c = TEMPAVGFEB_mean_GEO2_US2010_WORLDCLIM / 10)


ggplot(terra_area, aes(fill = feb_temp_c)) + 
  geom_sf(linetype = "blank") + 
  coord_sf(xlim = c(-180, -64), ylim = c(16.5, 72.5)) + # Crop out Alaska's East Hemisphere tail
  ggtitle("Average temp (deg C) in Feb\n1950-2010 (US Geo2)")

Or we can look at a scatter comparing percentage of population older than 65 compared to average temperature in February.

terra_area <- terra_area %>%
  mutate(
    pct_pop_65 =
      (POP6569_GEO2_US2010_US2010A + POP7074_GEO2_US2010_US2010A +
         POP7579_GEO2_US2010_US2010A + POP80_GEO2_US2010_US2010A) /
      (TOTPOP_GEO2_US2010_US2010A)
  )

ggplot(terra_area, aes(x = pct_pop_65, y = feb_temp_c)) + 
  geom_point(alpha = 0.2) + 
  ggtitle("Average temperature against percent of\npopulation over 65 by US Geo 2")
#> Warning: Removed 1 rows containing missing values (geom_point).

Here we again see those retirement havens are in the warmer climates, though in general it doesn’t seem to be an incredibly strong relationship.

Microdata

Finally, IPUMS Terra includes microdata with the area-level and raster data attached to it.

For the example in this vignette, I use the sample 2010 US Census and variables: - Microdata: - Household: GEO2_US2010, GEO1_US21010 - Person: AGE, MIGUS2 - Area Level: (None) - Raster: TEMPAVGFEB (mean, also year-specific and at lowest geographic level available)

Also, be sure to include the boundary files in the extract.

Once we’ve made the extract and downloaded it, let’s check what’s available in it using the function ipums_list_files().

ipums_list_files(micro_extract)
#> # A tibble: 2 × 2
#>   type  file                                  
#>   <chr> <chr>                                 
#> 1 data  terrapop_extract_4621.csv.gz          
#> 2 shape boundaries_4621_IPUMS_US_SLAD_2010.zip

Using ipumsr to load the microdata

For microdata, we should read the microdata using read_terra_micro() and the boundary files using either read_ipums_sf() or read_ipums_sp(). We don’t want to merge these two right away, because this would create a copy of each polygon for every observation in the microdata. Instead, generally we create area level summaries from the microdata and then merge them together using the ipums_shape_*_join() functions.

terra_micro <- read_terra_micro(micro_extract)
#> Use of data from IPUMS Terra is subject to conditions including that users
#> should cite the data appropriately. Use command `ipums_conditions()` for more
#> details.
terra_micro_shapes <- read_ipums_sf(micro_extract)
#> options:        ENCODING=latin1 
#> Reading layer `boundaries_4621_IPUMS_US_SLAD_2010' from data source 
#>   `C:\Users\derek\AppData\Local\Temp\Rtmp6VsTzP\file32ec3b3c626e\boundaries_4621_IPUMS_US_SLAD_2010.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 2070 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -179.1526 ymin: 18.91236 xmax: 179.7758 ymax: 71.38875
#> Geodetic CRS:  WGS 84

Now we’ll create area level summaries for people older than 65 of what percent of the population has moved states in the past year.

graph_data <- terra_micro %>%
  mutate(moved_states = MIGUS2 != GEO1_US2010) %>%
  group_by(GEO2_US2010, age65plus = ifelse(AGE >= 65, "65 and older", "Under 65")) %>%
  summarize(
    moved_states = weighted.mean(moved_states, PERWT),
    feb_temp_c = TEMPAVGFEB_mean_GEO2_US2010_WORLDCLIM[1] / 10
  )
#> `summarise()` has grouped output by 'GEO2_US2010'. You can override using the
#> `.groups` argument.

# ipums_shape_inner_join() will join the 2 datasets together, and keep
# only observations that are in both files. In this case, we only
# lose one observation that turns out to be the Great lakes.
graph_data <- ipums_shape_inner_join(
  graph_data, 
  terra_micro_shapes,
  by = c("GEO2_US2010" = "GEOID")
)
#> Some observations were lost in the join (1 observations in the shape file). See
#> `join_failures(...)` for more details.

# Look at percent of population 65 and older who have moved in last year
ggplot(
  graph_data %>% filter(age65plus == "65 and older"), 
  aes(fill = moved_states)
) + 
  geom_sf(linetype = "blank") + 
  coord_sf(xlim = c(-180, -64), ylim = c(16.5, 72.5)) + # Crop out Alaska's East Hemisphere tail
  ggtitle("65 and older: percent moved from another\nstate in last year")

Looks like some parts of Florida and Arizona might be sticking out again, but it’s hard to say. Again, there’s lots more we could look at here, but I’ll leave that up to you!