Read tabular data from an NHGIS extract

Read a csv or fixed-width (.dat) file downloaded from the NHGIS extract system.

To read spatial data from an NHGIS extract, use read_ipums_sf().

Usage

read_nhgis(
  data_file,
  file_select = NULL,
  vars = NULL,
  col_types = NULL,
  n_max = Inf,
  guess_max = min(n_max, 1000),
  do_file = NULL,
  var_attrs = c("val_labels", "var_label", "var_desc"),
  remove_extra_header = TRUE,
  verbose = TRUE,
  data_layer = deprecated()
)

Arguments

data_file

Path to a .zip archive containing an NHGIS extract or a single file from an NHGIS extract.

file_select

If data_file is a .zip archive that contains multiple files, an expression identifying the file to load. Accepts a character vector specifying the file name, a tidyselect selection, or an index position. This must uniquely identify a file.

vars

Names of variables to include in the output. Accepts a vector of names or a tidyselect selection. If NULL, includes all variables in the file.

col_types

One of NULL, a cols() specification or a string. If NULL, all column types will be inferred from the values in the first guess_max rows of each column. Alternatively, you can use a compact string representation to specify column types:

c = character
i = integer
n = number
d = double
l = logical
f = factor
D = date
T = date time
t = time
? = guess
_ or - = skip

See read_delim() for more details.

n_max

Maximum number of lines to read.

guess_max

For .csv files, maximum number of lines to use for guessing column types. Will never use more than the number of lines read.

do_file

For fixed-width files, path to the .do file associated with the provided data_file. The .do file contains the parsing instructions for the data file.

By default, looks in the same path as data_file for a .do file with the same name. See Details section below.

var_attrs

Variable attributes to add from the codebook (.txt) file included in the extract. Defaults to all available attributes.

See set_ipums_var_attributes() for more details.

remove_extra_header

If TRUE, remove the additional descriptive header row included in some NHGIS .csv files.

This header row is not usually needed as it contains similar information to that included in the "label" attribute of each data column (if var_attrs includes "var_label").

verbose

Logical controlling whether to display output when loading data. If TRUE, displays IPUMS conditions, a progress bar, and column types. Otherwise, all are suppressed.

Will be overridden by readr.show_progress and readr.show_col_types options, if they are set.

data_layer

Please use file_select instead.

Value

A tibble containing the data found in data_file

Details

The .do file that is included when downloading an NHGIS fixed-width extract contains the necessary metadata (e.g. column positions and implicit decimals) to correctly parse the data file. read_nhgis() uses this information to parse and recode the fixed-width data appropriately.

If you no longer have access to the .do file, consider resubmitting the extract that produced the data. You can also change the desired data format to produce a .csv file, which does not require additional metadata files to be loaded.

For more about resubmitting an existing extract via the IPUMS API, see vignette("ipums-api", package = "ipumsr").

Examples

# Example files
csv_file <- ipums_example("nhgis0972_csv.zip")
fw_file <- ipums_example("nhgis0730_fixed.zip")

# Provide the .zip archive directly to load the data inside:
read_nhgis(csv_file)
#> Use of data from NHGIS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
#> Rows: 71 Columns: 25
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (9): GISJOIN, STUSAB, CMSA, PMSA, PMSAA, AREALAND, AREAWAT, ANPSADPI, F...
#> dbl (13): YEAR, MSA_CMSAA, INTPTLAT, INTPTLNG, PSADC, D6Z001, D6Z002, D6Z003...
#> lgl  (3): DIVISIONA, REGIONA, STATEA
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 71 × 25
#>    GISJOIN  YEAR STUSAB CMSA  DIVISIONA MSA_CMSAA PMSA      PMSAA REGIONA STATEA
#>    <chr>   <dbl> <chr>  <chr> <lgl>         <dbl> <chr>     <chr> <lgl>   <lgl> 
#>  1 G0080    1990 OH     28    NA             1692 Akron, O… 0080  NA      NA    
#>  2 G0360    1990 CA     49    NA             4472 Anaheim-… 0360  NA      NA    
#>  3 G0440    1990 MI     35    NA             2162 Ann Arbo… 0440  NA      NA    
#>  4 G0620    1990 IL     14    NA             1602 Aurora--… 0620  NA      NA    
#>  5 G0845    1990 PA     78    NA             6282 Beaver C… 0845  NA      NA    
#>  6 G0875    1990 NJ     70    NA             5602 Bergen--… 0875  NA      NA    
#>  7 G1120    1990 MA     07    NA             1122 Boston, … 1120  NA      NA    
#>  8 G1125    1990 CO     34    NA             2082 Boulder-… 1125  NA      NA    
#>  9 G1145    1990 TX     42    NA             3362 Brazoria… 1145  NA      NA    
#> 10 G1160    1990 CT     70    NA             5602 Bridgepo… 1160  NA      NA    
#> # ℹ 61 more rows
#> # ℹ 15 more variables: AREALAND <chr>, AREAWAT <chr>, ANPSADPI <chr>,
#> #   FUNCSTAT <chr>, INTPTLAT <dbl>, INTPTLNG <dbl>, PSADC <dbl>, D6Z001 <dbl>,
#> #   D6Z002 <dbl>, D6Z003 <dbl>, D6Z004 <dbl>, D6Z005 <dbl>, D6Z006 <dbl>,
#> #   D6Z007 <dbl>, D6Z008 <dbl>

# For extracts that contain multiple files, use `file_select` to specify
# a single file to load. This accepts a tidyselect expression:
read_nhgis(fw_file, file_select = matches("ds239"), verbose = FALSE)
#> # A tibble: 1 × 114
#>   YEAR  STUSAB NATION NATIONA AIHHTLI MEMI  PCI   GEOID NAME_E AJWBE001 AJWBE002
#>   <chr> <chr>  <chr>  <chr>   <chr>   <chr> <chr> <chr> <chr>     <dbl>    <dbl>
#> 1 2014… US     Unite… 1       NA      NA    NA    0100… Unite…   3.23e8   1.59e8
#> # ℹ 103 more variables: AJWBE003 <dbl>, AJWBE004 <dbl>, AJWBE005 <dbl>,
#> #   AJWBE006 <dbl>, AJWBE007 <dbl>, AJWBE008 <dbl>, AJWBE009 <dbl>,
#> #   AJWBE010 <dbl>, AJWBE011 <dbl>, AJWBE012 <dbl>, AJWBE013 <dbl>,
#> #   AJWBE014 <dbl>, AJWBE015 <dbl>, AJWBE016 <dbl>, AJWBE017 <dbl>,
#> #   AJWBE018 <dbl>, AJWBE019 <dbl>, AJWBE020 <dbl>, AJWBE021 <dbl>,
#> #   AJWBE022 <dbl>, AJWBE023 <dbl>, AJWBE024 <dbl>, AJWBE025 <dbl>,
#> #   AJWBE026 <dbl>, AJWBE027 <dbl>, AJWBE028 <dbl>, AJWBE029 <dbl>, …

# Or an index position:
read_nhgis(fw_file, file_select = 2, verbose = FALSE)
#> # A tibble: 84 × 28
#>    GISJOIN STATE         STATEFP STATENH A00AA1790 A00AA1800 A00AA1810 A00AA1820
#>    <chr>   <chr>         <chr>   <chr>       <dbl>     <dbl>     <dbl>     <dbl>
#>  1 G010    Alabama       01      010            NA        NA        NA    127901
#>  2 G020    Alaska        02      020            NA        NA        NA        NA
#>  3 G025    Alaska Terri… NA      025            NA        NA        NA        NA
#>  4 G040    Arizona       04      040            NA        NA        NA        NA
#>  5 G045    Arizona Terr… NA      045            NA        NA        NA        NA
#>  6 G050    Arkansas      05      050            NA        NA        NA        NA
#>  7 G055    Arkansas Ter… NA      055            NA        NA        NA     14273
#>  8 G060    California    06      060            NA        NA        NA        NA
#>  9 G080    Colorado      08      080            NA        NA        NA        NA
#> 10 G085    Colorado Ter… NA      085            NA        NA        NA        NA
#> # ℹ 74 more rows
#> # ℹ 20 more variables: A00AA1830 <dbl>, A00AA1840 <dbl>, A00AA1850 <dbl>,
#> #   A00AA1860 <dbl>, A00AA1870 <dbl>, A00AA1880 <dbl>, A00AA1890 <dbl>,
#> #   A00AA1900 <dbl>, A00AA1910 <dbl>, A00AA1920 <dbl>, A00AA1930 <dbl>,
#> #   A00AA1940 <dbl>, A00AA1950 <dbl>, A00AA1960 <dbl>, A00AA1970 <dbl>,
#> #   A00AA1980 <dbl>, A00AA1990 <dbl>, A00AA2000 <dbl>, A00AA2010 <dbl>,
#> #   A00AA2020 <dbl>

# For CSV files, column types are inferred from the data. You can
# manually specify column types with `col_types`. This may be useful for
# geographic codes, which should typically be interpreted as character values
read_nhgis(csv_file, col_types = list(MSA_CMSAA = "c"), verbose = FALSE)
#> # A tibble: 71 × 25
#>    GISJOIN  YEAR STUSAB CMSA  DIVISIONA MSA_CMSAA PMSA      PMSAA REGIONA STATEA
#>    <chr>   <dbl> <chr>  <chr> <lgl>     <chr>     <chr>     <chr> <lgl>   <lgl> 
#>  1 G0080    1990 OH     28    NA        1692      Akron, O… 0080  NA      NA    
#>  2 G0360    1990 CA     49    NA        4472      Anaheim-… 0360  NA      NA    
#>  3 G0440    1990 MI     35    NA        2162      Ann Arbo… 0440  NA      NA    
#>  4 G0620    1990 IL     14    NA        1602      Aurora--… 0620  NA      NA    
#>  5 G0845    1990 PA     78    NA        6282      Beaver C… 0845  NA      NA    
#>  6 G0875    1990 NJ     70    NA        5602      Bergen--… 0875  NA      NA    
#>  7 G1120    1990 MA     07    NA        1122      Boston, … 1120  NA      NA    
#>  8 G1125    1990 CO     34    NA        2082      Boulder-… 1125  NA      NA    
#>  9 G1145    1990 TX     42    NA        3362      Brazoria… 1145  NA      NA    
#> 10 G1160    1990 CT     70    NA        5602      Bridgepo… 1160  NA      NA    
#> # ℹ 61 more rows
#> # ℹ 15 more variables: AREALAND <chr>, AREAWAT <chr>, ANPSADPI <chr>,
#> #   FUNCSTAT <chr>, INTPTLAT <dbl>, INTPTLNG <dbl>, PSADC <dbl>, D6Z001 <dbl>,
#> #   D6Z002 <dbl>, D6Z003 <dbl>, D6Z004 <dbl>, D6Z005 <dbl>, D6Z006 <dbl>,
#> #   D6Z007 <dbl>, D6Z008 <dbl>

# Fixed-width files are parsed with the correct column positions
# and column types automatically:
read_nhgis(fw_file, file_select = contains("ts"), verbose = FALSE)
#> # A tibble: 84 × 28
#>    GISJOIN STATE         STATEFP STATENH A00AA1790 A00AA1800 A00AA1810 A00AA1820
#>    <chr>   <chr>         <chr>   <chr>       <dbl>     <dbl>     <dbl>     <dbl>
#>  1 G010    Alabama       01      010            NA        NA        NA    127901
#>  2 G020    Alaska        02      020            NA        NA        NA        NA
#>  3 G025    Alaska Terri… NA      025            NA        NA        NA        NA
#>  4 G040    Arizona       04      040            NA        NA        NA        NA
#>  5 G045    Arizona Terr… NA      045            NA        NA        NA        NA
#>  6 G050    Arkansas      05      050            NA        NA        NA        NA
#>  7 G055    Arkansas Ter… NA      055            NA        NA        NA     14273
#>  8 G060    California    06      060            NA        NA        NA        NA
#>  9 G080    Colorado      08      080            NA        NA        NA        NA
#> 10 G085    Colorado Ter… NA      085            NA        NA        NA        NA
#> # ℹ 74 more rows
#> # ℹ 20 more variables: A00AA1830 <dbl>, A00AA1840 <dbl>, A00AA1850 <dbl>,
#> #   A00AA1860 <dbl>, A00AA1870 <dbl>, A00AA1880 <dbl>, A00AA1890 <dbl>,
#> #   A00AA1900 <dbl>, A00AA1910 <dbl>, A00AA1920 <dbl>, A00AA1930 <dbl>,
#> #   A00AA1940 <dbl>, A00AA1950 <dbl>, A00AA1960 <dbl>, A00AA1970 <dbl>,
#> #   A00AA1980 <dbl>, A00AA1990 <dbl>, A00AA2000 <dbl>, A00AA2010 <dbl>,
#> #   A00AA2020 <dbl>

# You can also read in a subset of the data file:
read_nhgis(
  csv_file,
  n_max = 15,
  vars = c(GISJOIN, YEAR, D6Z002),
  verbose = FALSE
)
#> # A tibble: 15 × 3
#>    GISJOIN  YEAR D6Z002
#>    <chr>   <dbl>  <dbl>
#>  1 G0080    1990  11593
#>  2 G0360    1990  95737
#>  3 G0440    1990   8988
#>  4 G0620    1990   8982
#>  5 G0845    1990   1814
#>  6 G0875    1990  20476
#>  7 G1120    1990  58143
#>  8 G1125    1990   9467
#>  9 G1145    1990   6774
#> 10 G1160    1990   9710
#> 11 G1170    1990   3209
#> 12 G1200    1990   3551
#> 13 G1280    1990  12072
#> 14 G1600    1990 111582
#> 15 G1640    1990  37225