Keeping it Simple: Exploiting CSV and csvkit at the MPC

How we use csvkit to wrangle data around here.

In a job interview long ago I was asked what gave me satisfaction at work. I thought about it for a short time and finally answered with the very brief “I like to solve puzzles,” which I think was one of the best answers I’ve given in an interview (I got that job), and my answer still holds true today. I could also probably write a very entertaining article about all of the really bad rambling interview answers I’ve given. I could, but I won’t.

One of the things I love about working at the Minnesota Population Center (MPC) is that I’m presented with so many interesting puzzles to solve, plus tons of intellectual freedom to solve them in inventive ways. In this article I’m going to write about one of the puzzle-solving journeys I’m taking with regards to data processing here.

A lot of my activities at MPC revolve around a common programming task:

“See these data over here? We need to turn it into these data over here but in a specific format for ingestion required by that technology over there.”

That task might require speed, efficiency, automation, ease of use, or some combination thereof. In this post I’m going to discuss how I’ve been driving a refactoring of our processes to use CSV instead of XML as a data exchange format, as well as introducing you to a set of utilities, csvkit, focused on the last of those requirements: ease of use. I’ll also describe how this exploration into making a process easier can also open doors to realizing performance gains elsewhere, both through discovering alternative approaches and refactoring inefficient legacy code from the processing flow.

Description of IPUMS Data and Metadata

Many of the products from MPC fall inside the Integrated Public Use Microdata Series (IPUMS) framework. Internally these products follow nearly identical work flows. MPC researchers pull in public use data from elsewhere (e.g. the U.S. Census Bureau) and go through several steps to define the metadata and reorganize the microdata into a form that we can then distribute publicly as “harmonized” microdata. This includes organizing the data files into a hierarchical format, typically a household record followed by person records that reside in those households, with one line per record. If you want to know more about the data specifics, we’ve got a blog post that goes into further detail on how we process and deliver IPUMS data at MPC.

An IPUMS microdata file for a given sample is stored as a large fixed-width file of coded variable information. To give meaning to those coded variables, alongside that file needs to be metadata that describes their location and meaning, and these are done via “data dictionaries”. Data dictionaries are initially authored by MPC demographic researchers as Excel spreadsheets. Downstream data processing working towards the eventual public dissemination of IPUMS data is dependent on these two types of files.

There are also high-level spreadsheets, known as control files, that describe product-wide variables, location of data files and data dictionaries, product-specific variable naming conventions, and so on.

Stop using XML where CSV will cover all the bases

Excel is not an ideal environment for dealing with data downstream, so much of what is compiled there has historically been shipped out in XML format (via VBA macros) for ingestion by our Data Conversion Program (DCP) and onward to the website. In the case of control files, I thought this was unnecessarily heavy for what we were using those outputs for, and also found that it was incredibly trivial to amend the macros to also output to comma-separated values (CSV) format. In fact, much much more trivial than the existing XML export code.

Now that we have .csv as well as .xml control file exports, IT core developer Colin Davis has been able to rework the mission critical and processor-intensive Data Conversion Program (DCP) to parse and use CSV instead of XML. He has reported that for some larger sample files there has been a 30x reduction in memory footprint in DCP. That’s a really important improvement made with a fairly small refactoring of code. Very gratifying!

csvkit: A Set of Command-Line Utilities To Create and Manipulate CSV Files

Christopher Groskopf of NPR has produced a set of command-line utilities, collectively called csvkit, that offer a great deal of value to MPC researchers to explore data and to MPC IT people like me as a quick means of massaging grids of Excel data into CSV for further use downstream. There is an excellent quick read article in the journalism code blog Source that covers the highlights of csvkit well. I’m going to largely reiterate those highlights here, but with a specific bent towards the particularities of MPC data and metadata.

Let’s start by looking at a data dictionary spreadsheet. As an example, here is a screenshot of the April 2014 data dictionary from IPUMS-CPS (Current Population Survey):

1_datadictionary

Data dictionary spreadsheets have some complexity to them. Variable information are stored in a hierarchical organization, with one row dedicated to the variable specifics like record type (person or household), column location, and label. Below these rows are rows that describe what the codes found in those variables mean, and the frequency in which those codes are found in the data. The format of MPC data dictionaries is entirely predictable, but not as simple as a typical two-dimensional data grid.

Using csvkit’s insanely handy in2csv utility, I can in one command export the tabular data from a spreadsheet to a csv file:

benklaas@frak:~/csvkit$ in2csv data_dict_cps2013_04b.xls > data_dict_cps2013_04b.csv
benklaas@frak:~/csvkit$ head -10 data_dict_cps2013_04b.csv
RecordType,Var,Col,Wid,Frm,Value,VarLabel,ValueLabel,VarLabelOrig,ValueLabelOrig,Freq,Sel,Notes,Svar,ValueSvar,VarLabelSvar,ValueLabelSvar,UnivSvar,NoRec,NonTab,Hide,Decim,String,CommP,CodeTy,DDoc1,DTag1,JDoc1,JTag1,DDoc2,DTag2,JDoc2,JTag2
CR,rectype,1,1,a,,Record type,,,,,,,CPS2013_04B_0001,,Record type,,,1,1,1,,,,,,,,,,,,
,,,,,H,,Household,,,72503.0,,,,H,,Household,,,,,,,,,,,,,,,,
,,,,,P,,Person,,,132635.0,,,,P,,Person,,,,,,,,,,,,,,,,
C,hhNum,2,6,,,Household number,,,,skip: 72503,,,CPS2013_04B_0002,,Household number,,,1,1,1,,,,,,,,,,,,
C,perNum,8,2,,,Person number within household,,,,,,,CPS2013_04B_0003,,Person number within household,,,1,1,1,,,,,,,,,,,,
,,,,,,,[no label],,,72503.0,,,,,,[no label],,,,,,,,,,,,,,,,
,,,,,1.0,,[no label],,,53738.0,,,,1.0,,[no label],,,,,,,,,,,,,,,,
,,,,,2.0,,[no label],,,38688.0,,,,2.0,,[no label],,,,,,,,,,,,,,,,
,,,,,3.0,,[no label],,,20124.0,,,,3.0,,[no label],,,,,,,,,,,,,,,,

Now that we’ve got the data in csv, we can sic the other csvkit utilities at the file. Firstly, csvcut with the -n flag will give us a rundown of the csv file headings:

benklaas@frak:~/csvkit$ csvcut -n data_dict_cps2013_04b.csv
  1: RecordType
  2: Var
  3: Col
  4: Wid
  5: Frm
  6: Value
  7: VarLabel
  8: ValueLabel
  9: VarLabelOrig
 10: ValueLabelOrig
 11: Freq
 12: Sel
 13: Notes
 14: Svar
 15: ValueSvar
 16: VarLabelSvar
 17: ValueLabelSvar
 18: UnivSvar
 19: NoRec
 20: NonTab
 21: Hide
 22: Decim
 23: String
 24: CommP
 25: CodeTy
 26: DDoc1
 27: DTag1
 28: JDoc1
 29: JTag1
 30: DDoc2
 31: DTag2
 32: JDoc2
 33: JTag2
benklaas@frak:~/csvkit$

Learning from that output, we can use csvcut with the -c flag specifying particular rows to “scalpel out” the specific columns we’re interested in:

benklaas@frak:~/csvkit$ csvcut -c RecordType,Var,Col,Wid data_dict_cps2013_04b.csv
RecordType,Var,Col,Wid
CR,rectype,1,1
,,,
,,,
C,hhNum,2,6
C,perNum,8,2
,,,
,,,
,,,
,,,
,,,
,,,
,,,
,,,
,,,
,,,
,,,
,,,
,,,
,,,
,,,
,,,
,,,
C,persons,10,2
,,,
,,,
,,,
,,,
,,,
,,,

Uh oh. Remember that spreadsheet screenshot from up above? Since our original data dictionaries follow an atypical tabular organization, so too does the output of the csvcut utility. In this case, what I really wanted was a list of column locations for H (household) records. Is there anything within the csvkit toolset that can help? Glad you asked! Indeed there is:

benklaas@frak:~/csvkit$ csvcut -c RecordType,Var,Col,Wid data_dict_cps2013_04b.csv | csvgrep -c RecordType -m H
RecordType,Var,Col,Wid
H,cpsid,12,14
H,hhid,26,15
H,hrmonth,41,2
H,hryear4,43,4
H,hurespli,47,2
H,hufinal,49,3
H,huspnish,52,2
H,hetenure,54,2
H,hehousut,56,2
H,hetelhhd,58,2
H,hetelavl,60,2
H,hephoneo,62,2
H,hefaminc,64,2
H,hutypea,66,2
H,hutypeb,68,2
H,hutypec,70,2
H,hwhhwgt,72,10
H,hrintsta,82,2
H,hrnumhou,84,2
H,hrhtype,86,2
H,hrmis,88,2
H,huinttyp,90,2
H,huprscnt,92,2
H,hrlonglk,94,2
H,hrhhid2,96,5

Since csvkit utilities are built following the Unix philosophy of command-line tools, the output of one csvkit utility can be piped directly into another. In this case, we’re now using csvcut to grab 4 columns of the original data, then piping that to csvgrep, which allows us to limit rows to only those that have H in the RecordType column (household records).

However, we still have a problem. You’ll notice the output of the file shows the first data row starting at column 12. What happened to columns 1-11? Well, in IPUMS data we also have the concept of common, or “C” records, which are used in all record types. We want those rows too. Luckily, csvgrep has a -r option to use a regular expression:

benklaas@frak:~/csvkit$ csvcut -c RecordType,Var,Col,Wid data_dict_cps2013_04b.csv | csvgrep -c RecordType -r '^[CH]'
RecordType,Var,Col,Wid
CR,rectype,1,1
C,hhNum,2,6
C,perNum,8,2
C,persons,10,2
H,cpsid,12,14
H,hhid,26,15
H,hrmonth,41,2
H,hryear4,43,4
H,hurespli,47,2
H,hufinal,49,3
H,huspnish,52,2
H,hetenure,54,2
H,hehousut,56,2
H,hetelhhd,58,2
H,hetelavl,60,2
H,hephoneo,62,2
H,hefaminc,64,2
H,hutypea,66,2
H,hutypeb,68,2
H,hutypec,70,2
H,hwhhwgt,72,10
H,hrintsta,82,2
H,hrnumhou,84,2
H,hrhtype,86,2
H,hrmis,88,2
H,huinttyp,90,2
H,huprscnt,92,2
H,hrlonglk,94,2
H,hrhhid2,96,5

And just like that we have a full schema map of household variables, which alone has lots of uses internally, but also is an important step to the next topic: converting the fixed width microdata flat files into csv.

benklaas@frak:~/csvkit$ egrep '^H' cps2013_04b.dat > cps2013_04b_households.dat
benklaas@frak:~/csvkit$ in2csv -e cp1252 -f fixed -s household_schema.csv cps2013_04b_households.dat > cps2013_04b_households.csv

in2csv isn’t just for xls files! It has a wonderful facility for converting fixed-width files, provided you can pass it a schema (which we created in the previous step). So, with a pretty straight-forward set of commands I’ve gone from complex data dictionary spreadsheet and a flat file of microdata to household variable schema and microdata in csv form. With the microdata in csv format, I can throw more csvkit commands against it:

benklaas@frak:~/csvkit$ csvstat -c persons cps2013_04b_households.csv
  4. persons
	<type 'unicode'>
	Nulls: False
	Unique values: 16
	5 most frequent values:
		00:	18765
		02:	18564
		01:	15050
		03:	8052
		04:	7042
	Max length: 2

Row count: 72503

csvstat is really useful for exploratory things like finding out the freqencies of values for a given variable, but it’s not particularly performant. csvkit wasn’t designed to be blazing fast, but to be really easy to use and patch together. That’s great and has tons of value at MPC. When it comes to performance, I can throw faster technologies at it for these steps:

benklaas@frak:~/csvkit$ ipython
IPython 2.3.0 -- An enhanced Interactive Python.

In [1]: import pandas as pd

In [2]: %timeit cpsdata = pd.read_csv('cps2013_04b_households.csv')
1 loops, best of 3: 1.03 s per loop

In [3]: cpsdata = pd.read_csv('cps2013_04b_households.csv')

In [4]: cpsdata.head()
Out[4]:
  rectype  hhNum  perNum  persons           cpsid           hhid  hrmonth  \
0       H      1       0        1  20130400000100    61002006171        4
1       H      2       0        0  20120200000100  2600310997690        4
2       H      3       0        2  20120100000200  2909470991091        4
3       H      4       0        5  20120400000300  5906350895161        4
4       H      5       0        0  20120400000500  8520510992294        4   

   hryear4  hurespli  hufinal     ...       gtindvpc  gtcbsasz  gtcsa  \
0     2013         1      201     ...              0         3    290
1     2013        -1      219     ...              0         3    290
2     2013         1      201     ...              0         3    290
3     2013         1      201     ...              0         5      0
4     2013        -1      226     ...              0         5      0   

   hxtenure  hxhousut  hxtelhhd  hxtelavl  hxphoneo  hxfaminc  hetenure_df
0        -1         0         0         1         0         0            1
1        -1         0         1         1         0         0            B
2        -1         0         0         1         0        23            1
3        -1         0         0         1         0         0            2
4        -1         0         1         1         0         1            B  

[5 rows x 54 columns]

In [5]: %timeit cpsdata['persons'].value_counts()
100 loops, best of 3: 2.92 ms per loop

In [6]: cpsdata['persons'].value_counts()
Out[6]:
0     18765
2     18564
1     15050
3      8052
4      7042
5      3187
6      1168
7       410
8       151
9        71
10       24
11       11
16        2
15        2
13        2
12        2
dtype: int64

Using Python and the pandas data library I can run a process that took 1m20s in csvkit in < 10 seconds. However, because I took the easy csvkit steps to get the data into a simple, portable format, that makes the high-performance task that much easier to setup.

Making improvements like these is really gratifying and is an example of why I enjoy working at MPC. We work on cool data, which keeps things interesting, but on top of that there are fertile avenues for learning, which makes me look forward to every day I come to work here.