Read a csv or fixed-width (.dat) file downloaded from the NHGIS extract system.
To read spatial data from an NHGIS extract, use read_ipums_sf()
.
Usage
read_nhgis(
data_file,
file_select = NULL,
vars = NULL,
col_types = NULL,
n_max = Inf,
guess_max = min(n_max, 1000),
do_file = NULL,
var_attrs = c("val_labels", "var_label", "var_desc"),
remove_extra_header = TRUE,
verbose = TRUE,
data_layer = deprecated()
)
Arguments
- data_file
Path to a .zip archive containing an NHGIS extract or a single file from an NHGIS extract.
- file_select
If
data_file
is a .zip archive that contains multiple files, an expression identifying the file to load. Accepts a character vector specifying the file name, a tidyselect selection, or an index position. This must uniquely identify a file.- vars
Names of variables to include in the output. Accepts a vector of names or a tidyselect selection. If
NULL
, includes all variables in the file.- col_types
One of
NULL
, acols()
specification or a string. IfNULL
, all column types will be inferred from the values in the firstguess_max
rows of each column. Alternatively, you can use a compact string representation to specify column types:c = character
i = integer
n = number
d = double
l = logical
f = factor
D = date
T = date time
t = time
? = guess
_ or - = skip
See
read_delim()
for more details.- n_max
Maximum number of lines to read.
- guess_max
For .csv files, maximum number of lines to use for guessing column types. Will never use more than the number of lines read.
- do_file
For fixed-width files, path to the .do file associated with the provided
data_file
. The .do file contains the parsing instructions for the data file.By default, looks in the same path as
data_file
for a .do file with the same name. See Details section below.- var_attrs
Variable attributes to add from the codebook (.txt) file included in the extract. Defaults to all available attributes.
See
set_ipums_var_attributes()
for more details.- remove_extra_header
If
TRUE
, remove the additional descriptive header row included in some NHGIS .csv files.This header row is not usually needed as it contains similar information to that included in the
"label"
attribute of each data column (ifvar_attrs
includes"var_label"
).- verbose
Logical controlling whether to display output when loading data. If
TRUE
, displays IPUMS conditions, a progress bar, and column types. Otherwise, all are suppressed.Will be overridden by
readr.show_progress
andreadr.show_col_types
options, if they are set.- data_layer
Value
A tibble
containing the data found in
data_file
Details
The .do file that is included when downloading an NHGIS fixed-width
extract contains the necessary metadata (e.g. column positions and implicit
decimals) to correctly parse the data file. read_nhgis()
uses this
information to parse and recode the fixed-width data appropriately.
If you no longer have access to the .do file, consider resubmitting the extract that produced the data. You can also change the desired data format to produce a .csv file, which does not require additional metadata files to be loaded.
For more about resubmitting an existing extract via the IPUMS API, see
vignette("ipums-api", package = "ipumsr")
.
See also
read_ipums_sf()
to read spatial data from an IPUMS extract.
read_nhgis_codebook()
to read metadata about an IPUMS NHGIS extract.
ipums_list_files()
to list files in an IPUMS extract.
Examples
# Example files
csv_file <- ipums_example("nhgis0972_csv.zip")
fw_file <- ipums_example("nhgis0730_fixed.zip")
# Provide the .zip archive directly to load the data inside:
read_nhgis(csv_file)
#> Use of data from NHGIS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
#> Rows: 71 Columns: 25
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (9): GISJOIN, STUSAB, CMSA, PMSA, PMSAA, AREALAND, AREAWAT, ANPSADPI, F...
#> dbl (13): YEAR, MSA_CMSAA, INTPTLAT, INTPTLNG, PSADC, D6Z001, D6Z002, D6Z003...
#> lgl (3): DIVISIONA, REGIONA, STATEA
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 71 × 25
#> GISJOIN YEAR STUSAB CMSA DIVISIONA MSA_CMSAA PMSA PMSAA REGIONA STATEA
#> <chr> <dbl> <chr> <chr> <lgl> <dbl> <chr> <chr> <lgl> <lgl>
#> 1 G0080 1990 OH 28 NA 1692 Akron, O… 0080 NA NA
#> 2 G0360 1990 CA 49 NA 4472 Anaheim-… 0360 NA NA
#> 3 G0440 1990 MI 35 NA 2162 Ann Arbo… 0440 NA NA
#> 4 G0620 1990 IL 14 NA 1602 Aurora--… 0620 NA NA
#> 5 G0845 1990 PA 78 NA 6282 Beaver C… 0845 NA NA
#> 6 G0875 1990 NJ 70 NA 5602 Bergen--… 0875 NA NA
#> 7 G1120 1990 MA 07 NA 1122 Boston, … 1120 NA NA
#> 8 G1125 1990 CO 34 NA 2082 Boulder-… 1125 NA NA
#> 9 G1145 1990 TX 42 NA 3362 Brazoria… 1145 NA NA
#> 10 G1160 1990 CT 70 NA 5602 Bridgepo… 1160 NA NA
#> # ℹ 61 more rows
#> # ℹ 15 more variables: AREALAND <chr>, AREAWAT <chr>, ANPSADPI <chr>,
#> # FUNCSTAT <chr>, INTPTLAT <dbl>, INTPTLNG <dbl>, PSADC <dbl>, D6Z001 <dbl>,
#> # D6Z002 <dbl>, D6Z003 <dbl>, D6Z004 <dbl>, D6Z005 <dbl>, D6Z006 <dbl>,
#> # D6Z007 <dbl>, D6Z008 <dbl>
# For extracts that contain multiple files, use `file_select` to specify
# a single file to load. This accepts a tidyselect expression:
read_nhgis(fw_file, file_select = matches("ds239"), verbose = FALSE)
#> # A tibble: 1 × 114
#> YEAR STUSAB NATION NATIONA AIHHTLI MEMI PCI GEOID NAME_E AJWBE001 AJWBE002
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2014… US Unite… 1 NA NA NA 0100… Unite… 3.23e8 1.59e8
#> # ℹ 103 more variables: AJWBE003 <dbl>, AJWBE004 <dbl>, AJWBE005 <dbl>,
#> # AJWBE006 <dbl>, AJWBE007 <dbl>, AJWBE008 <dbl>, AJWBE009 <dbl>,
#> # AJWBE010 <dbl>, AJWBE011 <dbl>, AJWBE012 <dbl>, AJWBE013 <dbl>,
#> # AJWBE014 <dbl>, AJWBE015 <dbl>, AJWBE016 <dbl>, AJWBE017 <dbl>,
#> # AJWBE018 <dbl>, AJWBE019 <dbl>, AJWBE020 <dbl>, AJWBE021 <dbl>,
#> # AJWBE022 <dbl>, AJWBE023 <dbl>, AJWBE024 <dbl>, AJWBE025 <dbl>,
#> # AJWBE026 <dbl>, AJWBE027 <dbl>, AJWBE028 <dbl>, AJWBE029 <dbl>, …
# Or an index position:
read_nhgis(fw_file, file_select = 2, verbose = FALSE)
#> # A tibble: 84 × 28
#> GISJOIN STATE STATEFP STATENH A00AA1790 A00AA1800 A00AA1810 A00AA1820
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 G010 Alabama 01 010 NA NA NA 127901
#> 2 G020 Alaska 02 020 NA NA NA NA
#> 3 G025 Alaska Terri… NA 025 NA NA NA NA
#> 4 G040 Arizona 04 040 NA NA NA NA
#> 5 G045 Arizona Terr… NA 045 NA NA NA NA
#> 6 G050 Arkansas 05 050 NA NA NA NA
#> 7 G055 Arkansas Ter… NA 055 NA NA NA 14273
#> 8 G060 California 06 060 NA NA NA NA
#> 9 G080 Colorado 08 080 NA NA NA NA
#> 10 G085 Colorado Ter… NA 085 NA NA NA NA
#> # ℹ 74 more rows
#> # ℹ 20 more variables: A00AA1830 <dbl>, A00AA1840 <dbl>, A00AA1850 <dbl>,
#> # A00AA1860 <dbl>, A00AA1870 <dbl>, A00AA1880 <dbl>, A00AA1890 <dbl>,
#> # A00AA1900 <dbl>, A00AA1910 <dbl>, A00AA1920 <dbl>, A00AA1930 <dbl>,
#> # A00AA1940 <dbl>, A00AA1950 <dbl>, A00AA1960 <dbl>, A00AA1970 <dbl>,
#> # A00AA1980 <dbl>, A00AA1990 <dbl>, A00AA2000 <dbl>, A00AA2010 <dbl>,
#> # A00AA2020 <dbl>
# For CSV files, column types are inferred from the data. You can
# manually specify column types with `col_types`. This may be useful for
# geographic codes, which should typically be interpreted as character values
read_nhgis(csv_file, col_types = list(MSA_CMSAA = "c"), verbose = FALSE)
#> # A tibble: 71 × 25
#> GISJOIN YEAR STUSAB CMSA DIVISIONA MSA_CMSAA PMSA PMSAA REGIONA STATEA
#> <chr> <dbl> <chr> <chr> <lgl> <chr> <chr> <chr> <lgl> <lgl>
#> 1 G0080 1990 OH 28 NA 1692 Akron, O… 0080 NA NA
#> 2 G0360 1990 CA 49 NA 4472 Anaheim-… 0360 NA NA
#> 3 G0440 1990 MI 35 NA 2162 Ann Arbo… 0440 NA NA
#> 4 G0620 1990 IL 14 NA 1602 Aurora--… 0620 NA NA
#> 5 G0845 1990 PA 78 NA 6282 Beaver C… 0845 NA NA
#> 6 G0875 1990 NJ 70 NA 5602 Bergen--… 0875 NA NA
#> 7 G1120 1990 MA 07 NA 1122 Boston, … 1120 NA NA
#> 8 G1125 1990 CO 34 NA 2082 Boulder-… 1125 NA NA
#> 9 G1145 1990 TX 42 NA 3362 Brazoria… 1145 NA NA
#> 10 G1160 1990 CT 70 NA 5602 Bridgepo… 1160 NA NA
#> # ℹ 61 more rows
#> # ℹ 15 more variables: AREALAND <chr>, AREAWAT <chr>, ANPSADPI <chr>,
#> # FUNCSTAT <chr>, INTPTLAT <dbl>, INTPTLNG <dbl>, PSADC <dbl>, D6Z001 <dbl>,
#> # D6Z002 <dbl>, D6Z003 <dbl>, D6Z004 <dbl>, D6Z005 <dbl>, D6Z006 <dbl>,
#> # D6Z007 <dbl>, D6Z008 <dbl>
# Fixed-width files are parsed with the correct column positions
# and column types automatically:
read_nhgis(fw_file, file_select = contains("ts"), verbose = FALSE)
#> # A tibble: 84 × 28
#> GISJOIN STATE STATEFP STATENH A00AA1790 A00AA1800 A00AA1810 A00AA1820
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 G010 Alabama 01 010 NA NA NA 127901
#> 2 G020 Alaska 02 020 NA NA NA NA
#> 3 G025 Alaska Terri… NA 025 NA NA NA NA
#> 4 G040 Arizona 04 040 NA NA NA NA
#> 5 G045 Arizona Terr… NA 045 NA NA NA NA
#> 6 G050 Arkansas 05 050 NA NA NA NA
#> 7 G055 Arkansas Ter… NA 055 NA NA NA 14273
#> 8 G060 California 06 060 NA NA NA NA
#> 9 G080 Colorado 08 080 NA NA NA NA
#> 10 G085 Colorado Ter… NA 085 NA NA NA NA
#> # ℹ 74 more rows
#> # ℹ 20 more variables: A00AA1830 <dbl>, A00AA1840 <dbl>, A00AA1850 <dbl>,
#> # A00AA1860 <dbl>, A00AA1870 <dbl>, A00AA1880 <dbl>, A00AA1890 <dbl>,
#> # A00AA1900 <dbl>, A00AA1910 <dbl>, A00AA1920 <dbl>, A00AA1930 <dbl>,
#> # A00AA1940 <dbl>, A00AA1950 <dbl>, A00AA1960 <dbl>, A00AA1970 <dbl>,
#> # A00AA1980 <dbl>, A00AA1990 <dbl>, A00AA2000 <dbl>, A00AA2010 <dbl>,
#> # A00AA2020 <dbl>
# You can also read in a subset of the data file:
read_nhgis(
csv_file,
n_max = 15,
vars = c(GISJOIN, YEAR, D6Z002),
verbose = FALSE
)
#> # A tibble: 15 × 3
#> GISJOIN YEAR D6Z002
#> <chr> <dbl> <dbl>
#> 1 G0080 1990 11593
#> 2 G0360 1990 95737
#> 3 G0440 1990 8988
#> 4 G0620 1990 8982
#> 5 G0845 1990 1814
#> 6 G0875 1990 20476
#> 7 G1120 1990 58143
#> 8 G1125 1990 9467
#> 9 G1145 1990 6774
#> 10 G1160 1990 9710
#> 11 G1170 1990 3209
#> 12 G1200 1990 3551
#> 13 G1280 1990 12072
#> 14 G1600 1990 111582
#> 15 G1640 1990 37225