Slightly modified implementation of tidyselect selection language in ipumsr.
Syntax
In general, the selection language in ipumsr operates the same as in tidyselect.
Where applicable, variables can be selected with:
A character vector of variable names (
c("var1", "var2")
)A bare vector of variable names (
c(var1, var2)
)A selection helper from tidyselect (
starts_with("var")
). See below for a list of helpers.
Primary differences
tidyselect selection is generally intended for use with column variables in data.frame-like objects. In contrast, ipumsr allows selection language syntax in other cases as well (for instance, when selecting files from within a .zip archive). ipumsr functions will indicate whether they support the selection language.
Selection with
where()
is not consistently supported.
Selection helpers (from tidyselect)
var1
:var10
: variables lying betweenvar1
on the left andvar10
on the right.starts_with("a")
: names that start with"a"
ends_with("z")
: names that end with"z"
contains("b")
: names that contain"b"
matches("x.y")
: names that match regular expressionx.y
num_range(x, 1:4)
: names following the patternx1, x2, ..., x4
all_of(vars)
/any_of(vars)
: matches names stored in the character vectorvars
.all_of(vars)
will error if the variables aren't present;any_of(vars)
will match just the variables that exist.everything()
: all variableslast_col()
: furthest column to the right
Operators for combining those selections:
!selection
: only variables that don't matchselection
selection1 & selection2
: only variables included in bothselection1
andselection2
selection1 | selection2
: all variables that match eitherselection1
orselection2
Examples
cps_file <- ipums_example("cps_00157.xml")
# Load 3 variables by name
read_ipums_micro(
cps_file,
vars = c("YEAR", "MONTH", "PERNUM"),
verbose = FALSE
)
#> # A tibble: 7,668 × 3
#> YEAR MONTH PERNUM
#> <dbl> <int+lbl> <dbl>
#> 1 1962 3 [March] 1
#> 2 1962 3 [March] 2
#> 3 1962 3 [March] 3
#> 4 1962 3 [March] 1
#> 5 1962 3 [March] 1
#> 6 1962 3 [March] 1
#> 7 1962 3 [March] 1
#> 8 1962 3 [March] 2
#> 9 1962 3 [March] 3
#> 10 1962 3 [March] 4
#> # ℹ 7,658 more rows
# "Bare" variables are supported
read_ipums_micro(
cps_file,
vars = c(YEAR, MONTH, PERNUM),
verbose = FALSE
)
#> # A tibble: 7,668 × 3
#> YEAR MONTH PERNUM
#> <dbl> <int+lbl> <dbl>
#> 1 1962 3 [March] 1
#> 2 1962 3 [March] 2
#> 3 1962 3 [March] 3
#> 4 1962 3 [March] 1
#> 5 1962 3 [March] 1
#> 6 1962 3 [March] 1
#> 7 1962 3 [March] 1
#> 8 1962 3 [March] 2
#> 9 1962 3 [March] 3
#> 10 1962 3 [March] 4
#> # ℹ 7,658 more rows
# Standard tidyselect selectors are also supported
read_ipums_micro(cps_file, vars = starts_with("ASEC"), verbose = FALSE)
#> # A tibble: 7,668 × 2
#> ASECWTH ASECWT
#> <dbl> <dbl>
#> 1 1476. 1476.
#> 2 1476. 1471.
#> 3 1476. 1579.
#> 4 1598. 1598.
#> 5 1707. 1707.
#> 6 1790. 1790.
#> 7 4355. 4355.
#> 8 4355. 1386.
#> 9 4355. 1629.
#> 10 4355. 1432.
#> # ℹ 7,658 more rows
# Selection methods can be combined
read_ipums_micro(
cps_file,
vars = c(YEAR, MONTH, contains("INC")),
verbose = FALSE
)
#> # A tibble: 7,668 × 3
#> YEAR MONTH INCTOT
#> <dbl> <int+lbl> <dbl+lbl>
#> 1 1962 3 [March] 4883
#> 2 1962 3 [March] 5800
#> 3 1962 3 [March] 999999998 [Missing. (1962-1964 only)]
#> 4 1962 3 [March] 14015
#> 5 1962 3 [March] 16552
#> 6 1962 3 [March] 6375
#> 7 1962 3 [March] 999999999 [N.I.U.]
#> 8 1962 3 [March] 0
#> 9 1962 3 [March] 600
#> 10 1962 3 [March] 999999999 [N.I.U.]
#> # ℹ 7,658 more rows
read_ipums_micro(
cps_file,
vars = starts_with("S") & ends_with("P"),
verbose = FALSE
)
#> # A tibble: 7,668 × 1
#> STATEFIP
#> <int+lbl>
#> 1 55 [Wisconsin]
#> 2 55 [Wisconsin]
#> 3 55 [Wisconsin]
#> 4 27 [Minnesota]
#> 5 27 [Minnesota]
#> 6 27 [Minnesota]
#> 7 19 [Iowa]
#> 8 19 [Iowa]
#> 9 19 [Iowa]
#> 10 19 [Iowa]
#> # ℹ 7,658 more rows
# Other selection arguments also support this syntax.
# For instance, load a particular file based on a tidyselect match:
read_nhgis(
ipums_example("nhgis0731_csv.zip"),
file_select = contains("nominal_state"),
verbose = FALSE
)
#> # A tibble: 84 × 28
#> GISJOIN STATE STATEFP STATENH A00AA1790 A00AA1800 A00AA1810 A00AA1820
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 G010 Alabama 01 010 NA NA NA 127901
#> 2 G020 Alaska 02 020 NA NA NA NA
#> 3 G025 Alaska Terri… NA 025 NA NA NA NA
#> 4 G040 Arizona 04 040 NA NA NA NA
#> 5 G045 Arizona Terr… NA 045 NA NA NA NA
#> 6 G050 Arkansas 05 050 NA NA NA NA
#> 7 G055 Arkansas Ter… NA 055 NA NA NA 14273
#> 8 G060 California 06 060 NA NA NA NA
#> 9 G080 Colorado 08 080 NA NA NA NA
#> 10 G085 Colorado Ter… NA 085 NA NA NA NA
#> # ℹ 74 more rows
#> # ℹ 20 more variables: A00AA1830 <dbl>, A00AA1840 <dbl>, A00AA1850 <dbl>,
#> # A00AA1860 <dbl>, A00AA1870 <dbl>, A00AA1880 <dbl>, A00AA1890 <dbl>,
#> # A00AA1900 <dbl>, A00AA1910 <dbl>, A00AA1920 <dbl>, A00AA1930 <dbl>,
#> # A00AA1940 <dbl>, A00AA1950 <dbl>, A00AA1960 <dbl>, A00AA1970 <dbl>,
#> # A00AA1980 <dbl>, A00AA1990 <dbl>, A00AA2000 <dbl>, A00AA2010 <dbl>,
#> # A00AA2020 <dbl>