Read part one first.
Last time, I discussed the Fixed Length Record (FLR) format and showed how to use the ‘hflr’ Ruby gem to read in hierarchical data in FLR format and produce Ruby structs. In this post I’ll demonstrate how to combine ‘hflr’ with a simple importer class to load a database with the FLR data. Then I’ll show you how to get some real-world FLR data to import. Finally, I’ll discuss specific issues that come up when importing this data, and explore the IPUMS microdata just a bit.
Efficiently Import Data
One could simply write code such as the following to import small numbers of records into a database:
However, performance will degrade unacceptably as the size of the import grows to a few thousand records.
What’s needed is a way to avoid: (1) Instantiating all records as ActiveRecord objects, and (2) Issuing an insert statement to the database for each of those records. The ‘activerecord-import’ gem groups the inserts and, optionally, avoids instantiation into ActiveRecord objects altogether. In essence, activerecord-import allows you to pass in a large set of data as arrays or hashes or AR objects and do one insert for the lot of them. Doing this improves import time by around 50x. The speed-up will depend on the database type, hardware, and validations (they can be enabled or disabled.) Passing data to activerecord-import as arrays or hashes is substantially faster than passing ActiveRecord objects.
Using activerecord-import, Your code would look like
The ‘import’ method can take an array of arrays, and a ‘fields’ argument giving the order of fields in each inner array. Alternatively ‘import’ may take an array of hashes, removing the necessity for the ‘fields’ argument as the names of fields and their values are linked in the hashes.
You may instead pass in already instantiated Active Record models, but this will be slower than the first two options.
We can package up the buffering logic shown in the code snippets above, along with a way to manage the various options for passing data to ‘import’. In the file ‘importer.rb’ you can find the full source. flrdata example code on github.com.
In the Git repository there’s an ‘input_data’ directory with a sample data file ‘usa_00001.dat’. This is a slice of a much larger data file created with the IPUMS-USA data extraction service. The dataset will let us study the housing boom, bust and recovery. The data comes from the American Community Survey, covering the years 2001 to 2013. We have both household records and person records in this dataset. We can create some interesting tables using just the household records, but some tabulations will require the people associated with those households.
The example tabulations that will follow later on were done with the full dataset. The slice of data provided in the example source is only for testing the import code.
[To learn more about the IPUMS and to recreate the full 3.3 GB dataset in our example, See usa.ipums.org. You may browse the system and begin requesting data without signing in, but to get your data you must create a free account.]
The ‘codebook.txt’ and ‘usa_0001.sps’ files were created by the data extract system and were downloaded with the data file. The ‘codebook.txt’ is a human readable description of the data file, and the .sps file is for the SPSS statistics analysis software to read, to help you to use the data in SPSS. The extract system also produces files for SAS and Stata. For extracts with only one record type, the extraction service offers CSV and native binary formats of all three stats applications.
To figure out what to pass to the FLRFile class initializer, you’d look at the code book or possibly the SPSS file supplied with the dataset:
This is fairly convenient, and if you regularly needed to import IPUMS data you might write a simple script to parse this sort of code book. Whatever the source of your FLR data you’ll need a description with comparable information.
In our example, we’re loading data into a database, so we need to define database tables with corresponding columns. See the ‘schema.rb’ for the translation of the above SPSS code into an Active Record migration script like one found in a Rails project. Again, you’d probably write a migration generator script if importing extracts was a frequent task.
As you can see from the ‘schema.rb’ migrations we’re making two tables, corresponding to the two record types in the data file ‘usa_00001.dat’: Household and person. We’ll use Rails conventions and call these tables ‘households’ and ‘people’.
Since the data come with keys to link the two tables together, we’ll make sure to index the two columns holding those keys. The ‘households’ table has a column ‘serial’ which will get the value of the SERIAL variable – it’s a unique household serial number. Within one dataset SERIAL is unique; since we’re extracting data across thirteen datasets, we’ll need to use ‘ACSYR’ (the year) to differentiate records. The ‘people’ table has a ‘serialp’ column that will hold the SERIALP variable; SERIALP has the value of the household serial number for the household the person record belongs to. Again, we’d need the ACSYR (which is on the people table as well) to differentiate between datasets within the ‘people’ table. The tables already have the Rails standard ‘id’ columns as unique keys. A nice feature for the importer would be to add a ‘household_id’ column to the people table and copy the value of ‘id’ from the correct household record, so we wouldn’t need to use composite keys in our joins. (I’ve done this in production code but it’s rather specialized and gets in the way of the example code.)
To read the data we use the ‘hflr’ gem as in the above example code and in the previous post. We need to provide a layout to ‘hflr’. To see the HFLR layout data, look in ‘extract_layout.rb’.
To run the example, just do
# ruby import.rb . You may use JRuby instead. The required gems are given at the top of the ‘import.rb’ file.
The import process should take a few seconds on the data included with the example code. Before importing data, the script executes the migrations in the ‘schema.rb’ file, rebuilding the database schema from scratch. The full dataset referenced in the ‘codebook.txt’ file has approximately thirty-one million person records and fourteen million household records. It will take between one and five hours to import depending on your storage hardware, CPU and Ruby version.
Once the import has finished we should do a few simple queries to verify that all records got imported. To make effective use of this data you’d need to read documentation at usa.ipums.org. You can begin exploring the data just with the variable category labels included in the codebook, however. For now, you just need to know that each household record has a weight variable and each person record has a person weight variable. The value of these variables, in the ‘households.HHWT’ and ‘people.PERWT’ columns respectively, indicates the number of people or households the record represents from the larger population, as this dataset is only a sample of the entire United States population. Divide the weight values by 100 (the two least significant digits are fractional amounts) to get a useful whole number version of the weight. For instance, the value 10100 really means “this record represents 101.00 records like it.”
So, to see how many people the U.S. Census estimated were in the country in 2013, you would do:
And we can check the population growth:
From this result we know that not only were the right number of person records imported but they were records with correct weights, indicating that the data in those records is probably all correct. Of course we’d have to check every column to be absolutely sure. As a final check we can join the people with their household data to get geography and housing information for the state of Arizona (STATEICP=61) for every year in the data. The OWNERSHP variable will distinguish between people living in rented households (2) and owner occupied (1). OWNERSHP=0 is for other living situations like prisons, military bases, dormitories, etc.
This all looks quite believable. It’s interesting to notice how the ratio of people living in rentals to those living in owner-occupied households changed over the boom and bust cycle of the housing market. There are a few more tables included in the example source repository that explore the topic further.
More Efficient Analysis
The example pushes Sqlite3 to its limits; if you recreated my extract in full you’d need to wait an hour or two just to import it. Some of the queries require five or more minutes to run, even though the necessary indexes are present. (This shows why analyzing hierarchical data in a relational database can be challenging.)
The first option, clearly, is to use the same import script with a full fledged database server on the back-end. MySql or Postgres will not import much more quickly, but the queries will execute faster in many cases, especially if your server has a lot of memory and it’s tuned correctly. I’ve provided a ‘pivot()’ and ‘pivoted_html_table()’ function because Sqlite3 doesn’t support pivoting result sets. You won’t need those helpers with a full featured database.
If you want to run frequent ad-hoc queries on this data, you might find that even MySql, Postgres or other databases are painfully slow, especially when joining the two tables. Column-store databases will allow for very fast tabulations across large tables when the tabulation involves limited numbers of columns. You could use a script similar to the example to load a column oriented database such as InfiniDB. Check InfiniDB at Maria DB to download it. Or try InfoBright.
There’s a plug-in for Postgres which may not perform quite as well as the above two options but integrates with a standard Postgres install and is super convenient. See Citus Data for a quick introduction and installation instructions.
Alternatives to Databases
Consider looking into the iPython / iRuby projects. iPython provides tools to analyze data locally and has visualization tools. iRuby is built on the same libraries as iPython but lets you script your work in Ruby.
Get the iRuby project source and documentation from iRuby on github.com. You can also install iRuby with
# gem install iruby . You should look into the JuPytR project as well (Julia, Python, Ruby). There are now many other languages supported as well as these three. Get it on the iPython page.
When using iRuby or JuPyter you’d find the HFLR gem useful in formatting your hierarchical fixed-length data into sets of CSV files for easy consumption. You may run into size limitations, but not with even the full-sized example extract in my example code.
Colin Davis Code · Infrastructure