ISRDI IT Blog

Ingesting the 2020 U.S. Census with DuckDB

Mon, 27 Nov 2023 00:00:00 +0000

Introduction

Our IPUMS NHGIS data collection aims to make public U.S. Census Bureau (USCB) data releases easier to use. The USCB publishes statistics from every decennial census and the annual American Community Survey (ACS). NHGIS makes this information easier to navigate: users can request data on specific topics, from specific times and places. The stats are available for different geographic levels: states, counties, tracts and even city blocks (as well as many more). When you hear about census stats for your state or county or city, it’s probably available from NHGIS. In addition to making the information easier to find, NHGIS makes the data comparable across many decades so that you can more easily look at change over time.

Behind the scenes, the IPUMS team has to get the original form of the data from the public USCB publication and transform it along with the metadata also provided by USCB (labels of locations, names of tables and so forth.) The metadata helps us match up topics and labels to prior censuses. We download and reshape the data to match all of our existing data. We refer to this process of consuming the USCB’s metadata and data and transforming it into the NHGIS schema as “ingesting”. The ingest process is partly automated but always involves some amount of manual adjustment to data and metadata to account for changes in the latest census or survey.

There are a few steps, all of which may be performed mostly automatically or mostly manually if necessary:

Download from USCB
Prepare the NHGIS metadata description of the NHGIS version of the census
Transform census data into our format suitable for driving our NHGIS data extract system
Evaluate and validate, going back to steps 2 and 3 when necessary until reaching a good quality release
Publish to the NHGIS data dissemination system

Most of this article is concerned with step 3, the data transformation. The 2020 census release from USCB threw us some curveballs, forcing us to improvise in order to get the data out the door in a reasonable timeframe. One of the things that really helped us achieve this is DuckDB, which allowed us to do our 2020 ingest in a completely different way from the 2010 ingest. Let’s take a closer look.

The Problem

Upon the recent 2020 full U.S. Census release last June, we discovered the public data didn’t at all match our expected format. We expected something like the format used for the annual ACS, for which we are very prepared with a lot of automation. Instead, the 2020 census, formally known as the 2020 Demographic Household Census (DHC), was mostly still in the legacy format USCB used in 2010. To quickly incorporate it into our NHGIS system we would have to improvise.

The last time we did a full census ingest for 2010, it took considerable time. Much of the effort went to the time needed for our research experts to iteratively evaluate the data and the metadata needed to describe the data in our extract system. Some of this effort is irreducible: the hard work is in finding the unique quirks and mistakes in the USCB published data and metadata and addressing them with artisinal code or hand-crafted data and metadata changes. Next, we look at the differences since the last census. New geographic locations appear, new questions get asked, new cross-tabulations published. All that new stuff has to get identified and incorporated into our NHGIS extract system and documentation.

The overall process took several weeks. However, in addition to the required research time, the data transformation work was also slow. Many days were spent reading and reshaping the data. If that aspect could be improved we’d be able to focus more on the parts that are necessarily not automatable.

In recent years, we had modernized our ingest software to support new ACS formats, an investment that quickly paid off since we ingest new ACS releases every year. However, since censuses only come around once a decade and the 2020 census was still using a legacy 2010 format, extending our new ingest software to support the 2020 census would be a big effort that would only be used once, since we fully expect the 2030 census to use a different format. Building highly-engineered reusable software for this use case seemed a poor use of our resources. After all, the goal is to get the 2020 census into NHGIS, not write software that would get retired immediately.

We needed a different approach.

A Closer Look at the 2020 DHC NHGIS Data

First, let’s dive deeper into the data we get from USCB for the 2020 DHC census.

Census Tables

USCB organizes the census into a set of tables. Let’s look at an simple table: HOUSING UNITS (what USCB refers to as table H1 in the 2020 DHC). This table has only one variable:

Total

which contains the number of housing units in this geographic area. Table exist at multiple levels of geography, so there will be an H1 table for the entire state of Alaska, for each county in Alaska, for each county subdivision in Alaska, and so on.

Let’s look at a slightly more complicated table: URBAN AND RURAL (housing units), table H2. This table has four variables:

Total
Urban
Rural
Not defined for this file

which has the same number of housing units as would be found in table H1 but H2 breaks them down into their urban/rural characteristic.

There are around 250 tables containing nearly 10,000 variables in the 2020 DHC census data.

Segmenting

USCB distributed the data as one .zip file per state, and one for the whole nation. D.C. and Puerto Rico are included as “states,” so there were fifty-three files in total.

When you unzip one of these files you find that each state has been divided into a set of 44 “segment” files which are pipe-delimited text containing the data itself (e.g. ak000012020.dhc.gz below), plus a geographic header file (e.g. akgeo2020.dhc.gz below). We end up gzipping these files to save significant space.

Here’s a sample of what the input data is structured as:

.
├── ak2020.dhc
│   ├── ak000012020.dhc.gz
│   ├── ak000022020.dhc.gz
..... snip ........
│   ├── ak000442020.dhc.gz
│   └── akgeo2020.dhc.gz
├── al2020.dhc
│   ├── al000012020.dhc.gz
│   ├── al000022020.dhc.gz
.... snip ....
│   ├── al000442020.dhc.gz
│   └── algeo2020.dhc.gz
├── ar2020.dhc
│   ├── ar000012020.dhc.gz
│   ├── ar000022020.dhc.gz
│   ├── ar000032020.dhc.gz
... and so on for all the states

Every state has the same number of segments. Each segment comprises many published census tables from USCB. USCB provides mappings so that users can figure out which segment(s) hold their tables of interest.

The file name ak000012020.dhc.gz can be broken down as follows: ak for Alaska, 00001 for segment #1, 2020 for the 2020 census, dhc is for Demographic and Housing Characteristics (there are various data products for the 2020 census, DHC is only one of them), and .gz because we converted it to a gzip file for space savings. (In the remainder of this post assume we’ve unzipped it to do the work).

Each row in a segment file represents a specific geographic unit within that state, so while all of Alaska’s segment files will have the same number of rows, Alabama’s segment files will have a different number of rows than Alaska’s segment files, because Alabama has a different number of counties, county subdivisions, and so on than Alaska.

This is what a row in the ak000012020.dhc.gz segment file looks like:

DHCST|AK|000|01|0000001|326200|326200|195718|130482|0|326200|269148|57052|269148|112016|59992|97140|182067|83335|42134|56598|8061|2290|589|5182|33029|8486|10489|14054|12248|5107|1952|5189|2810|547|94|2169|5776|2164|889|2723|25157|10087|3845|11225|13959|5187|1653|7119|178056|81766|41764|54526|7745|2207|569|4969|32268|8224|10395|13649|12077|5035|1936|5106|2751|533|84|2134|1736|770|363|603|20556|8294|3228|9034|4011|1569|370|2072|316|83|20|213|761|262|94|405|171|72|16|83|59|14|10|35|4040|1394|526|2120|4601|1793|617|2191|57052|10070|1244|3152|1183|29722|336|11345|269148|182067|8061|33029|12248|2810|5776|25157|269148|255189|178056|7745|32268|12077|2751|1736|20556|13959|4011|316|761|171|59|4040|4601|703100|269148|73681|88573|40371|33204|17103|8869|7347|269148|172008|125469|2879|18975|7059|641|3053|13932|97140|56598|5182|14054|5189|2169|2723|11225|269148|172008|165168|6840|97140|90021|7119|269148|172008|38409|61202|26633|23076|11663|5970|5055|97140|35272|27371|13738|10128|5440|2899|2292|182067|125469|29330|48723|18861|16076|7247|3140|2092|56598|22564|17107|7365|5316|2463|1130|653|8061|2879|699|885|484|365|229|110|107|5182|1949|1318|746|534|322|172|141|33029|18975|3938|4671|2929|2564|1844|1485|1544|14054|4496|3250|2124|1688|1134|730|632

The first column describes the data product: DHCST for DHC at the state level. The next 4 columns describe the geographic unit this row refers to, and the remaining columns are the table data.

Geographic Header Files and Summary Levels

As I just mentioned, columns 2-5 contain geographic unit data. Starting in column 2, we have AK for Alaska, the 000 and 01 are related to something called Characteristic Iteration and is beyond scope here, and the 0000001 is a Logical Record Number which allows us to link to the geography header file to get more information about this geographic unit.

Alaska has 43,234 distinct geographic units in the 2020 DHC census, which means there are 43,234 rows in each of Alaska’s segment files and 43,234 geographic variants of many of the 2020 DHC data tables for Alaska (not all data tables are available at all geography levels). The geographic header file for Alaska therefore has 43,234 rows to decribe all of these places. Other states have more or fewer geographic units, but all told there are more than 11 million geographic units in the 2020 DHC census.

Each of those more than 11 million geographic units maps to a particular geography level, or what USCB calls a “Summary Level”. Common summary levels include state, county, county subdivision, place, census tract, block group and block, but there are also others such as school and legislative districts, plus separate hierarchies for American Indian, Alaska Native and Native Hawaiian Areas. This concept becomes important later in our process.

Getting back to the specific example row above, linking into Alaska’s geography header file using the Logical Record Number, we can determine that 0000001 means “the state of Alaska”. Here is the relevant row:

DHCST|AK|040|00|00|000|00|0000001|0400000US02|02|4|9|02|01785533|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||1478941109938|245380162784|Alaska|Alaska|A||733391|326200|+63.3473560|-152.8397334|00||

The 040 in column 3 is the summary level code for “state”. You can see there are quite a few columns, and many of them are blank in this case. Let’s look at a more interesting example, 0000036, which represents the municipality of Anchorage:

DHCST|AK|050|00|00|000|00|0000036|0500000US02020|02020|4|9|02|01785533|020|H6|01416061|||||||||||||||||||||||||||||||||11260|1|999|99999||||||||||||||||||||||||||||||||4420591349|621302197|Anchorage|Anchorage Municipality|C||291247|118640|+61.1742503|-149.2843294|12||

States can have different classification systems for their places, but you can see here from the 050 in column 3 that USCB considers this the equivalent of a county, since that’s the summary level code representing the State-County hierarchy. As the geographic units get smaller in size, the records contain more of this “nesting” information. Here’s an example of the lowest geographic unit, a census block:

DHCST|AK|100|00|00|000|00|0018802|1000000US021220013001063|021220013001063|4|9|02|01785533|122|H1|01419972|68610|Z5|01939944|99999|99|99999999||||99999|99|99999999|99999|99|99999999|001300|1|1063|9999|9|99999|99|99999999|999|99999|99|99999999|999999|9|14410|E7|02419341|99999|9|999|99999|99999|9|999|99999|9|9|99999|9|R|00|||||00P|||||032|||||32-847|A|99664|99999|99999|00390|00200|105812|0|1063|Block 1063|S||0|0|+60.0818687|-149.3458873|BK||99999

Most fields contain data in this case. So as you can see, there is quite a bit of contextual geographic data available, especially as you get into the smaller geographies. This type of information is incredibly useful for users of this census data. Rather than repeating this information on each row within each of the 44 segment files, USCB pulled that information out into its own file and put the Logical Record Number linking keys in the data itself to save space. We will need to re-join this data back to the table data within NHGIS so we can provide this useful contextual geographic data to users in their data extracts.

Table Data

The actual table data starts at column 5. Using table-to-segment mapping information from USCB, we know that table H1 is the first table in segment 1. We saw above that table H1 only has one variable, for the total number of housing units. So that 326200 in the fifth column is telling us there are 326,200 housing units in the state of Alaska. The next four columns are for table H2. We see the same 326200 total of housing units, with 195718 of them being urban, 130482 rural, and 0 unspecified (which checks out since 195,718 + 130,482 = 326,200). The rest of the colums in segment 1 represent tables H3 through H12C.

If you are wondering why USCB does this segmenting business, it’s to keep each data file under 255 columns to make it easier to use these files in spreadsheet software. A rough calculation of 10,000 variables divided by 255 columns would require at least 40 segments. We end up with 44 because USCB doesn’t want to split tables across two different segments so they can’t be perfectly efficient with using all 255 columns per segment.

From USCB Segments to NHGIS Datasets

NHGIS groups the 250 tables in the 2020 DHC census into NHGIS “datasets”. Remember how I said above that not all tables are available at all geographies? One of the main reasons NHGIS groups tables into separate datasets is to group data together which is available for all of the same geographic summary levels. NHGIS creates three distinct datasets for the 2020 DHC. At NHGIS we also don’t have the same 255 column restriction that USCB has, so we can also get rid of segments at this point.

While learning about the 2010 DHC workflow which handled a “similar to but not exactly the same as” 2020 data format, it seemed to me that the underlying transformations on the data tables looked like pretty straightforward operations. Our past system in 2010 employed a series of scripts to transform the data as text files, but what they were doing was pretty much translatable to relational database terms.

The general overview of our ingest process is as follows:

Merge segments together (i.e. a “horizontal” merge) based on which NHGIS dataset their tables will be in. Luckily segments don’t split across datasets – all of the tables in each segment belong to one and only one NHGIS dataset, which makes the transformation from segments to datasets a bit easier.
Merge states together (i.e. a “vertical” merge). Note that steps 1 and 2 can happen in either order - this is important in a moment.
Join the geography information from the geog header files for each state onto the rows of data.
Enhance the records with NHGIS-specific value adds. Much of this involves bringing in enhanced geographic information.
Do QA checks and look for errors in USCB data or metadata.
Transform data into a format which is compatible with and performant for the NHGIS data extract system

Could a Database Help?

All of these steps could benefit from being done within a database engine, but particularly steps 3-6, which are the most compute intensive, whereas the file munging in steps 1 and 2 is already quite efficient with UNIX command line tools like cat and paste. In theory, these activities map directly in relational database terms to select with union (steps 1 and 2), join (step 3), update (step 4), select (step 5) and select and join (step 6). Could we really boil things down to a series of SQL statements? That would be so nice. Representing the data reshaping tasks as SQL would bring the ingest process much closer to a self-documenting procedure for the future when we’ve all forgotten what the heck was going on ten years ago or when it turns out we need to totally reprocess the 2010 and 2020 censuses for some reason I’m sure will make sense in the year 2029.

The decennial census data presents challenges the annual ACS does not: the decennial census contains finer-grained geography levels like “block group” and “block”, meaning we have some records for every block in the census, of which there are about 11 million across the country. That may not sound especially large, except that these tables can have thousands of columns and we need to do some checking on them all – or did in the past anyhow.

In 2010, the obstacles to executing some of the queries would have been twofold: the simple version of the schema would have required many thousands of columns, not something databases available to us in 2010 could support, and more raw memory than was available to us in 2010. Well, now we have servers with 256GB and even one TB of RAM, as well as dozens of cores! And solid-state storage. And very fast ethernet between servers. It’s so much better than things were ten years ago, so maybe a database-based approach is now feasible?

Another reason we hadn’t used a database tool earlier was that our work style requires that we do a lot of import-transform-export workflows and having a single database server as a bottleneck slows this down and prevents parallel workflows. We may want many versions of the same set of data at once for concurrent processing. These are things expensive commercial products can accommodate, but we use open source tools (partly out of principle and partly out of budgetary constraints). So while the query execution is something we could have really benefitted from in the past, the client-server database model wasn’t a great fit. Perhaps the situation has improved a decade later?

Choosing a Database Tool

If we use a database tool there would be essentially two broad phases: cleaning + loading, then transforming + enhancing. With luck the export is a straightforward afterthought. We set out to identify an open-source database tool that might be able to accommodate this workflow. We looked at a number of options, but long story short, we chose DuckDB.

Looking at DuckDB’s home page:

When to use DuckDB 
• Processing and storing tabular datasets, e.g., 
from CSV or Parquet files
• Interactive data analysis, e.g., join & aggregate
multiple large tables
• Concurrent large changes, to multiple large tables, 
e.g., appending rows, adding/removing/updating columns
• Large result set transfer to client

In addition, DuckDB is a stand-alone tool and doesn’t concern itself with running a server.

Seems like a perfect match.

How DuckDB Supported our Workflow

Now for how we actually used DuckDB. We should also note that we used DuckDB 0.7.1 for most of our work; 0.9.2 is available now and you should probably use that.

Loading the Data

I mentioned above that steps 1 and 2 were already quite efficient using UNIX command line utilities. Let’s say that we determine segments 1-27 are all destined for NHGIS dataset A. An alternative cheap, low-memory way to join the data before loading into the database would be to use paste like so:

$ paste ak000012020.dhc ak000022020.dhc .. ak000272020.dhc -d '|' > ak_dataset_a.dhc

to join all of a state’s segments together, and then:

$ cat ak_dataset_a.dhc al_dataset_a.dhc .. wy_dataset_a.dhc > us_dataset_a.dhc

to combine all of the states into one data file for the NHGIS dataset A.

Then the loading process would simply be one database “create table from csv file” operation per dataset, no join needed. We considered doing this, but the database would need a robust CSV reader to support the large number of columns which might differ from the number of columns allowed in a internally created table. This turned out to be an issue on the version of DuckDB we used, so instead we loaded segments into temporary tables and joined in the DB.

So, instead, to initially load the DHC data into a database we first:

Make files to serve as the headers for the pipe-delimited segment files provided by USCB (e.g. S44_heading.dhc.gz, created using metadata from USCB)
Concatenate all the state files for each segment into consolidated files, one per segment: We simply use cat for this like we showed above, but create one file per segment so they are not too wide (looks like we do have something akin to a 255 column limit after all, at lesat until we get into the database!)
Ensure the input files use the correct character encoding (in our case they had to get encoded from Windows 1252 to UTF-8 to work with the rest of our workflow.)
In DuckDB, load those segment files in with “read_csv_auto()” – there are similar functions in other database systems.

The concatenation was like this (yes, you can cat together gzip‘ed files!):

cat S44_heading.dhc.gz \
/tmp/2020dhc_data/segmented_states/tx2020.dhc/tx000442020.dhc.gz \
/tmp/2020dhc_data/segmented_states/nc2020.dhc/nc000442020.dhc.gz \
/tmp/2020dhc_data/segmented_states/co2020.dhc/co000442020.dhc.gz \
/tmp/2020dhc_data/segmented_states/nv2020.dhc/nv000442020.dhc.gz \
... 
> /tmp/2020dhc_data/segmented_states/segment_S44.dhc.gz	

for every state, on every segment file up to segment 44. We end up with one file per segment.

And here is a sample of the loading script:

ccd@build:/tmp/2020dhc_data/work$ head 02_load_into_duckdb.sh                                                                                                                                                      
duckdb-71 new_cleaned_tmp.nhgis.data.db -s "PRAGMA memory_limit='100GB';create table geo as select * from read_csv_auto('geou.psv', header=true, sep='|', sample_size=-1,all_varchar=1);alter table geo alter LOGRECNO type integer"                                                                                                                                                                                                  
duckdb-71 new_cleaned_tmp.nhgis.data.db -s "PRAGMA memory_limit='100GB';create table S01 as select * from read_csv_auto('segment_S01.dhc.gz', header=true, sep='|')"                                               
duckdb-71 new_cleaned_tmp.nhgis.data.db -s "PRAGMA memory_limit='100GB';create table S02 as select * from read_csv_auto('segment_S02.dhc.gz', header=true, sep='|')"                                               
duckdb-71 new_cleaned_tmp.nhgis.data.db -s "PRAGMA memory_limit='100GB';create table S03 as select * from read_csv_auto('segment_S03.dhc.gz', header=true, sep='|')"                                               
duckdb-71 new_cleaned_tmp.nhgis.data.db -s "PRAGMA memory_limit='100GB';create table S04 as select * from read_csv_auto('segment_S04.dhc.gz', header=true, sep='|')"                                               
duckdb-71 new_cleaned_tmp.nhgis.data.db -s "PRAGMA memory_limit='100GB';create table S05 as select * from read_csv_auto('segment_S05.dhc.gz', header=true, sep='|')"                                               
duckdb-71 new_cleaned_tmp.nhgis.data.db -s "PRAGMA memory_limit='100GB';create table S06 as select * from read_csv_auto('segment_S06.dhc.gz', header=true, sep='|')"                                               
duckdb-71 new_cleaned_tmp.nhgis.data.db -s "PRAGMA memory_limit='100GB';create table S07 as select * from read_csv_auto('segment_S07.dhc.gz', header=true, sep='|')"                                               
duckdb-71 new_cleaned_tmp.nhgis.data.db -s "PRAGMA memory_limit='100GB';create table S08 as select * from read_csv_auto('segment_S08.dhc.gz', header=true, sep='|')"                                               
duckdb-71 new_cleaned_tmp.nhgis.data.db -s "PRAGMA memory_limit='100GB';create table S09 as select * from read_csv_auto('segment_S09.dhc.gz', header=true, sep='|')"                                               
ccd@build:/tmp/2020dhc_data/work$  

Now we have an initial database and are ready to join the temporary segment tables. All segment files will be in the same order as long as the concatenation process reads the state directories in the same order for each segment. We now have a database consisting of tables matching the concatenated segment files so that there are tables like “S01”, “S02” … “S44”.

Transforming the Data

This part is all pure SQL. However, we can invoke DuckDB as a command line tool against the working database file (much like Sqlite), so we can still accomplish many tasks with a generated shell script that includes the query we want to execute. We used a mix of shell scripts calling the DuckDB CLI tool and the Python library for DuckDB.

Here is each transformation step with sample SQL.

Join segment tables into dataset tables. As we mentioned above, NHGIS groups DHC data tables into three datasets. Our segment files happen to all be in the same order so that the join is simple. All segment tables in the same dataset have the exact same number of rows because we grouped segments based on the geography levels contained in the tables within. There are about 11 million geographic units and therefore rows in dataset “A”, 2.5 million in dataset “B” and 38 thousand in dataset “C”. The join runs fairly fast considering there are 3500 and 5700 columns in A and B respectively. The memory footprint is pretty large, however.

The joins were done like this (here STUSAB is the USCB code for the state):

duckdb-71 new_cleaned_tmp.nhgis.data.db -s  "
create index S16_idx on S16(S16STUSAB,S16LOGRECNO);
create index S17_idx on S17(S17STUSAB,S17LOGRECNO);
drop table if exists tmp_dataset_cph_2020_DHCc;
PRAGMA force_index_join;
create table tmp_dataset_cph_2020_DHCc as select * from S16 left join S17 on S16.S16STUSAB = S17.S17STUSAB and S16.S16LOGRECNO = S17.S17LOGRECNO;
"

This is for “Dataset C” only, the smallest one, which is comprised of only two segments. You see that we use the Logical Record Number to join across the two segments, to ensure that the joining rows represent the same geographic place. It’s technically a left join but segments 16 and 17 should contain the same number of rows since they have tables at all of the same geographic summary levels.

Next we join the geo table, which is populated by the geographic header files, to each of the dataset tables; this can be an inner join using the state code (STUSAB) and Logical Record Number (LOGRECNO) as the key on both sides – the temp dataset tables will have columns like s01_STUSAB, s01_LOGRECNO and so on, indicating their origin on different segments. The values all match. These joins take some time and is one place where a row-oriented DBMS may have performed better.

create table dataset_2020_DHCc as 
 select * from geo 
  inner join tmp_dataset_2020_DHCc on 
    geo.STUSAB = tmp_dataset_2020_DHCc.S16STUSAB and 
    geo.LOGRECNO = tmp_dataset_2020_DHCc.S16LOGRECNO

We have now created the final dataset tables, but they still need some updates.

Drop redundant columns such as the segment variants of LOGRECNO and STUSAB; drop the temporary dataset tables and segment tables

For example on dataset C:

drop table S16;
drop table S17;

drop table tmp_cph_2020_DHCc;

alter table cph_2020_DHCc drop S16FILEID;
alter table cph_2020_DHCc drop S16STUSAB;
alter table cph_2020_DHCc drop S16CHARITER;
alter table cph_2020_DHCc drop S16CIFSN;
alter table cph_2020_DHCc drop S16LOGRECNO;
alter table cph_2020_DHCc drop S17FILEID;
alter table cph_2020_DHCc drop S17STUSAB;
alter table cph_2020_DHCc drop S17CHARITER;
alter table cph_2020_DHCc drop S17CIFSN;
alter table cph_2020_DHCc drop S17LOGRECNO;

This is fast.

We also do little bits of housekeeping, like adding a US convenience column

duckdb-71 new_cleaned_tmp.nhgis.data.db -s "--
-- Add the US convenience column to datasets.;
alter table dataset_2020_DHCa  add column  US varchar; update dataset_2020_DHCa set US=1 where SUMLEV=10; alter table dataset_2020_DHCb add column US varchar; update dataset_2020_DHCb set US=1 where SUMLEV=10;alter table dataset_2020_DHCc add column US varchar; update dataset_2020_DHCc set US=1 where SUMLEV=10"

This is very fast because of DuckDB’s column store.

Next we can add and compute the GISJOIN columns to each dataset. GISJOIN is a concatenated geographic identifier we construct for internal NHGIS usage. The algorithm for generating a GISJOIN varies depending on the summary level. This is a rather complex update but with DuckDB it runs quite fast. So if we find we made a mistake it’s cheap to redo.

-- Create and update all GISJOIN rows for each dataset;
alter table cph_2020_DHCa 
  add column GISJOIN varchar;
;
UPDATE cph_2020_DHCa 
  SET GISJOIN = CONCAT('G', RIGHT(CONCAT('00000', US), 1)) 
  WHERE SUMLEV = '010';
;
UPDATE cph_2020_DHCa 
  SET GISJOIN = CONCAT('G', RIGHT(CONCAT('00000', REGION), 1)) 
  WHERE SUMLEV = '020';
;
UPDATE cph_2020_DHCa 
  SET GISJOIN = CONCAT('G', RIGHT(CONCAT('00000', DIVISION), 1)) 
  WHERE SUMLEV = '030';
;
UPDATE cph_2020_DHCa 
  SET GISJOIN = CONCAT('G', RIGHT(CONCAT('00000', STATE), 2), '0') 
  WHERE SUMLEV = '040';
,....

and so on for every geographic level.

Add and compute the GN_GISJOIN columns on each dataset. Adding the columns is trivial.

alter table cph_2020_DHCa 
  add column if not exists nation_gn_gisjoin varchar;
alter table cph_2020_DHCa 
  add column if not exists region_gn_gisjoin varchar;
alter table cph_2020_DHCa 
  add column if not exists division_gn_gisjoin varchar;
alter table cph_2020_DHCa 
  add column if not exists state_gn_gisjoin varchar;
alter table cph_2020_DHCa 
  add column if not exists county_gn_gisjoin varchar;
alter table cph_2020_DHCa 
  add column if not exists cty_sub_gn_gisjoin varchar;
alter table cph_2020_DHCa 
  add column if not exists place_gn_gisjoin varchar;

and so on, for all geographies on every dataset table. This is very fast.

The updates look like this:

UPDATE cph_2020_DHCa 
  SET county_gn_gisjoin = CONCAT('G', RIGHT(CONCAT('00000', STATE), 2), '0', RIGHT(CONCAT('00000', COUNTY), 3), '0') 
  WHERE SUMLEV IN ('310', '311', '312', '313', '314', '315', '316', '320', '321', '322', '323', '324', '332', '333', '341');

These occur for every geography on each dataset. These are also fast.

Add and compute the PUMA columns. These are “Public Use Microdata Area” values useful for matching geography to microdata published only with PUMA identifiers for geographic location. The update runs quickly and is easy to iterate on.

create table puma_x_tract as
  select * from read_csv_auto('layouts/pumas.csv', header=true);

UPDATE cph_2020_DHCa
  SET puma = puma_x_tract.PUMA5CE FROM puma_x_tract
    where cph_2020_DHCa.STATE = puma_x_tract.STATEFP AND 
      cph_2020_DHCa.COUNTY = puma_x_tract.COUNTYFP AND 
      cph_2020_DHCa.TRACT = puma_x_tract.TRACTCE AND 
      cph_2020_DHCa.SUMLEV IN ('080', '085', '090', '091', '140', '144', '150', '154', '158', '511', '631', '636');
    ;

UPDATE cph_2020_DHCb 
  SET puma = puma_x_tract.PUMA5CE FROM puma_x_tract
  WHERE cph_2020_DHCb.STATE = puma_x_tract.STATEFP AND 
    cph_2020_DHCb.COUNTY = puma_x_tract.COUNTYFP AND 
    cph_2020_DHCb.TRACT = puma_x_tract.TRACTCE  AND 
    cph_2020_DHCb.SUMLEV IN ('080', '085', '140', '144', '158', '511', '631', '636'); 

Adding this sort of geographic information to each record is extremely valuable, and in the past has also been extremely time consuming and error prone. I can’t overstate how useful DuckDB has been in allowing us to accelerate this phase of ingest and iterate over the data until we get it right.

QA Checks

Finally, we use this database to check data against the metadata USCB gave us and make corrections; perhaps some geography shows up in the data that’s not in the metadata, or labels don’t quite match, or many other things. We make spot corrections to data and metadata until they are ready to publish.

Having this database in DuckDB format is a great advantage to the checking phase as the checks can run in split seconds instead of hours. Querying for specific values in a single column or collecting stats on a few columns is extremely fast with DuckDB compared to row-oriented databases or with text files. When we make corrections, the updates typically run very fast as well.

Here’s a very basic example of a quality check query. This is launching the DuckDB cli tool and loading a database file (this is the 2020 DHC DB which is 44 GB) and reading it off of shared storage. The server it’s running on isn’t particularly fast.

ccd@gp1:/pkg/ipums/istads/ingest/census_2020/dhc/05_data/db$ time duckdb-71 new_cleaned_tmp.nhgis.data.db -c "
select count(*) as areas, region from cph_2020_DHCa group by region
"                                                                                                                                                                     
┌─────────┬─────────┐                                                                                                                                                       
│  areas  │ REGION  │                                                                                                                                                       
│  int64  │ varchar │                                                                                                                                                       
├─────────┼─────────┤                                                                                                                                                       
│ 4307349 │ 3       │                                                                                                                                                       
│ 2221626 │ 4       │                                                                                                                                                       
│ 3443783 │ 2       │                                                                                                                                                       
│ 1547447 │ 1       │                                                                                                                                                       
│   89424 │ 9       │                                                                                                                                                       
│   51175 │         │                                                                                                                                                       
└─────────┴─────────┘                                                                                                                                                       
                                                                                                                                                                            
real    0m2.626s                                                                                                                                                            
user    0m2.241s                                                                                                                                                            
sys     0m0.555s                                                                                                                                                            

This is on a 11,660,804 row, 3212 column table without indexing on the columns in the query.

Finalizing the Database

After transforming the database it looks like this:

ccd@gp1:/pkg/ipums/istads/ingest/census_2020/dhc/05_data/db$ duckdb-71 new_cleaned_tmp.nhgis.data.db                                                                        
v0.7.1 b00b93f0b1                                                                                                                                                           
Enter ".help" for usage hints.                                                                                                                                              
D show tables;                                                                                                                                                              
100% ▕████████████████████████████████████████████████████████████▏                                                                                                         
┌───────────────┐                                                                                                                                                           
│     name      │                                                                                                                                                           
│    varchar    │                                                                                                                                                           
├───────────────┤                                                                                                                                                           
│ cph_2020_DHCa │                                                                                                                                                           
│ cph_2020_DHCb │                                                                                                                                                           
│ cph_2020_DHCc │                                                                                                                                                           
│ geo           │                                                                                                                                                           
└───────────────┘                                                                                                                                                           
D

Here’s the shape of the data. We have three tables with loads of columns (this output is a bit confusing - the “6 columns” refers to the number of columns in the describe table whereas the “3212 rows” refers to the number of columns in the actual dataset table). Geographic columns come first followed by table data and our constructed GN_ columns at the end.

D describe select * from cph_2020_DHCa;                                                                                                                             
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐                                                                                                       
│ column_name │ column_type │  null   │   key   │ default │  extra  │                                                                                                       
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │                                                                                                       
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤                                                                                                       
│ FILEID      │ VARCHAR     │ YES     │         │         │         │                                                                                                       
│ STUSAB      │ VARCHAR     │ YES     │         │         │         │                                                                                                       
│ SUMLEV      │ VARCHAR     │ YES     │         │         │         │                                                                                                       
│ LOGRECNO    │ INTEGER     │ YES     │         │         │         │                                                                                                       
│ GEOID       │ VARCHAR     │ YES     │         │         │         │                                                                                                       
│ GEOCODE     │ VARCHAR     │ YES     │         │         │         │                                                                                                       
│ REGION      │ VARCHAR     │ YES     │         │         │         │                                                                                                       
│ STATE       │ VARCHAR     │ YES     │         │         │         │                                                                                                       
│ COUNTY      │ VARCHAR     │ YES     │         │         │         │                                                                                                       
│    ·        │    ·        │  ·      │    ·    │    ·    │    ·    │                                                                                                       
│    ·        │    ·        │  ·      │    ·    │    ·    │    ·    │                                                                                                       
│    ·        │    ·        │  ·      │    ·    │    ·    │    ·    │                                                                                                       
│ P1_001      │ INTEGER     │ YES     │         │         │         │                                                                                                         
│ P2_001      │ INTEGER     │ YES     │         │         │         │                                                                                                         
│ P2_002      │ INTEGER     │ YES     │         │         │         │                                                                                                         
│ P2_003      │ INTEGER     │ YES     │         │         │         │                                                                                                         
│ P2_004      │ INTEGER     │ YES     │         │         │         │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │    ·    │                                                                                                       
│    ·        │    ·        │  ·      │    ·    │    ·    │    ·    │                                                                                                       
│    ·        │    ·        │  ·      │    ·    │    ·    │    ·    │                                                                                                       
│ GN_state    │ VARCHAR     │ YES     │         │         │         │                                                                                                       
│ GN_county   │ VARCHAR     │ YES     │         │         │         │                                                                                                       
├─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤                                                                                                       
│ 3212 rows (40 shown)                                    6 columns │                                                                                                       
└───────────────────────────────────────────────────────────────────┘                                                                                                       
D

D describe cph_2020_DHCb;                                                                                                                                                   
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐                                                                                                         
│ column_name │ column_type │  null   │   key   │ default │ extra │                                                                                                         
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │                                                                                                         
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤                                                                                                         
│ FILEID      │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ STUSAB      │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ SUMLEV      │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ LOGRECNO    │ INTEGER     │ YES     │         │         │       │                                                                                                       
│ GEOID       │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ GEOCODE     │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ REGION      │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ STATE       │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ COUNTY      │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│ PCT1_001    │ INTEGER     │ YES     │         │         │       │                                                                                                         
│ PCT1_002    │ INTEGER     │ YES     │         │         │       │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│ PCT1_016    │ INTEGER     │ YES     │         │         │       │                                                                                                         
│ PCT1_017    │ INTEGER     │ YES     │         │         │       │                                                                                                         
│ PCT2_001    │ INTEGER     │ YES     │         │         │       │                                                                                                         
│ PCT2_002    │ INTEGER     │ YES     │         │         │       │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│ PCT2_018    │ INTEGER     │ YES     │         │         │       │                                                                                                         
│ PCT2_019    │ INTEGER     │ YES     │         │         │       │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│ GN_state    │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ GN_county   │ VARCHAR     │ YES     │         │         │       │                                                                                                       
├─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┤                                                                                                         
│ 5767 rows (40 shown)                                  6 columns │                                                                                                         
└─────────────────────────────────────────────────────────────────┘                                                                                                         
D  

D describe cph_2020_DHCc;                                                                                                                                                   
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐                                                                                                         
│ column_name │ column_type │  null   │   key   │ default │ extra │                                                                                                         
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │                                                                                                         
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤                                                                                                         
│ FILEID      │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ STUSAB      │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ SUMLEV      │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ LOGRECNO    │ INTEGER     │ YES     │         │         │       │                                                                                                       
│ GEOID       │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ GEOCODE     │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ REGION      │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ STATE       │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ COUNTY      │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│ PC1_001     │ INTEGER     │ YES     │         │         │       │                                                                                                         
│ PC1_002     │ INTEGER     │ YES     │         │         │       │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│ PC1_038     │ INTEGER     │ YES     │         │         │       │                                                                                                         
│ PC1_039     │ INTEGER     │ YES     │         │         │       │                                                                                                         
│ PC2_001     │ INTEGER     │ YES     │         │         │       │                                                                                                         
│ PC2_002     │ INTEGER     │ YES     │         │         │       │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│ PC2_038     │ INTEGER     │ YES     │         │         │       │                                                                                                         
│ PC2_039     │ INTEGER     │ YES     │         │         │       │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│    ·        │    ·        │  ·      │    ·    │    ·    │   ·   │                                                                                                         
│ GN_state    │ VARCHAR     │ YES     │         │         │       │                                                                                                       
│ GN_county   │ VARCHAR     │ YES     │         │         │       │                                                                                                       
├─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┤                                                                                                         
│ 442 rows (40 shown)                                   6 columns │                                                                                                         
└─────────────────────────────────────────────────────────────────┘                                                                                                         
D  

The table row-counts:

D                                                                                                                                                                           
D select count(*) from cph_2020_DHCa;                                                                                                                                       
┌──────────────┐                                                                                                                                                            
│ count_star() │                                                                                                                                                            
│    int64     │                                                                                                                                                            
├──────────────┤                                                                                                                                                            
│     11660804 │                                                                                                                                                            
└──────────────┘                                                                                                                                                            
D select count(*) from cph_2020_DHCb;                                                                                                                                       
┌──────────────┐                                                                                                                                                            
│ count_star() │                                                                                                                                                            
│    int64     │                                                                                                                                                            
├──────────────┤                                                                                                                                                            
│      2388853 │                                                                                                                                                            
└──────────────┘                                                                                                                                                            
D select count(*) from cph_2020_DHCc;                                                                                                                                       
┌──────────────┐                                                                                                                                                            
│ count_star() │                                                                                                                                                            
│    int64     │                                                                                                                                                            
├──────────────┤                                                                                                                                                            
│        31722 │                                                                                                                                                            
└──────────────┘                                                                                                                                                            
D  

This represents the number of different “places” in each of the datasets. Since dataset A goes all the way down to the “block” summary level and there are nearly 11 million census blocks in the US, we see there are more than 11 million rows in dataset A (accounting for all the other summary levels present as well).

Exporting

You might think that we are now done, and could serve our extract system with this database directly, or export each of these dataset tables as three big Parquet files and serve NHGIS’ data extracts from those. However, our existing vast repository of NHGIS data has another, more complex layout that conveys important information and is also designed to be extracted with Apache Spark. The export arranges the data into a structure representing the harmonized-across-time NHGIS geography scheme which isn’t contained in our prepared database – we only have the 2020 specific geography from USCB. We have separate processing code to deal with preparing data for dissemination in this way, but it expects inputs to be one summary level per file.

So, we need to export from DuckDB into one summary level per file: we will use DuckDB copy() to export to CSV files in a hierarchical structure by geography.

The export queries look like this (the details aren’t terribly important; the key idea is that we need to export large amounts of data to CSV using DuckDB):

copy (
	select * from cph_2020_DHCa 
	where geocomp='00' and sumlev = '040' and 
		not (STUSAB = 'US' and SUMLEV in ('040', '050', '060', '070', '155', '160', '170', '172', '230', '500', '610', '620')) 
	order by STUSAB, LOGRECNO ) 
	to '/tmp/2020dhc_data/work/export_data/cph_2020_dhca/2020/state_040/ge00_file.csv' (HEADER, DELIMITER '|');


copy (
	select * from cph_2020_DHCa 
	where geocomp='00' and sumlev = '050' 
		and not (STUSAB = 'US' and SUMLEV in ('040', '050', '060', '070', '155', '160', '170', '172', '230', '500', '610', '620')) 
	order by STUSAB, LOGRECNO ) 
	to '/tmp/2020dhc_data/work/export_data/cph_2020_dhca/2020/county_050/ge00_file.csv' (HEADER, DELIMITER '|');

There’s one export for every geography and dataset (and “geocomp” which we don’t need to get into.) This works, but DuckDB has some trouble on the largest geographies. Since we’re generating the queries in a Python script, we can break the largest result sets into chunks and call them separately. An even better solution is to pass the in-memory results of the queries to Polars to export to CSV. Here we used the Python DuckDB library along with polars to help export data quickly.

(This is simplified)

import duckdb
import polars as pl

con = duckdb.connect(str(db_name), read_only=True)
for ds in datasets:
	con.execute(dataset_files_query(ds))
	results = con.fetchall()
	datafiles = []
	
	for data_file in results:
		sumlev, geocomp, ct = data_file
		output_table = f"export_{data_dirname}_{geocomp}_{ds}"
		...
		output_query = dataset_slice_query(ds, sumlev, geocomp)                
		con.execute(output_query).pl().write_csv(
			output_csv, has_header=True, separator="|")
                

Notice how we took the result of the query and passed it to Polars to actually write the CSV? The results are in Arrow and can be passed to Polars. DuckDB + Polars can be a powerful combination.

In the end we have a directory structure organized by geography. We hand off this file structure to our Parquet format producing tool to make data compatible with the existing Spark NHGIS extract engine. While DuckDB can export directly to Parquet, our format requires some particular nested map type columns.

Why DuckDB

Our story with DuckDB has three pillars:

An in-process database solution is now technically feasible. Since the last U.S. Census in 2010, computer permanent storage and memory grew a lot; the Census only grew a little.
In 2010 open source database software had severe limits on the maximum number of columns per table and NHGIS data has thousands of columns per dataset table. The backend storage didn’t favor some of the operations we needed – column store was just beginning its rise in popularity and was mostly only available commercially.
DuckDB offers a uniquely convenient set of features including columnar storage with tens of thousands of columns allowed in tables and great performance for our quality checking tasks.

With enough local memory and disk, a single database server solution was finally usable, and DuckDB had the features to make that solution easy to build.

With this approach, you can have a single database file. Deployment can be as easy as copying the file. Versioning data is as simple as versioning the database files. To formalize and automate versioning one could also put the database files into DVC.

Why not Sqlite?

DuckDB in many respects serves as a drop in replacement for Sqlite. It can persist data locally in a database file or in memory. As with Sqlite, you can embed DuckDB into your application to programmatically use its features as part of your app; there’s no server, it is an in-process database. Like Sqlite, you can get DuckDB as a stand-alone binary. It’s a virtually zero-setup tool – just download it to where you want to work and get going.

DuckDB focuses on flexibility and performance. It can import from CSV, Parquet and JSON and export to CSV or Parquet, and do so quite fast. DuckDB uses a physical layout for storing data by column rather than row as traditional RDBMS do; this allows for extremely fast aggregate functions. The huge number of columns per table allowed also makes it attractive for use with flat aggregated data like census or survey data tables.

In addition to persisting data in DuckDB native format, it can treat external CSV and Parquet files as read-only tables. In many applications you don’t need to formally import data at all. In addition, DuckDB can now query Sqlite format database files. Together, these features offer a very practical advantage for simple number crunching or data shaping jobs, as they don’t need an import step at all.

In contrast, while we could have done most tasks with Sqlite (after rebuilding Sqlite with a higher max column value,) a number of steps would have been much slower and we wouldn’t get the excellent CSV and Parquet import/export. Extracting data from the final database would have been a lot slower too.

Why not Spark?

Spark would also work, but at least in past versions, SparkSQL doesn’t perform well on datasets wider than one thousand columns or so. That’s actually one reason we developed a complicated format for our exported data that the Spark-driven extract engine reads.

In addition, DuckDB is a super simple, no-setup tool using standard SQL which you can use from the command line or directly from Python or Rust or C++. Spark is more complicated to set up on a developer box and to maintain in production. If you need to distribute work across many compute cluster nodes Spark makes sense, but when your servers are powerful enough to do the job on one node it’s overkill and you pay a performance penalty for distributed computation. For data within DuckDB’s grasp, queries run substantially faster than on Spark.

Why not Polars?

We could have used Polars and it would have performed the steps quickly – as fast or faster than DuckDB in most cases. It would not give a single database artifact, however. Also, we wanted to preserve the data transformations in accessible language (which is SQL for our researchers on NHGIS.) They have the ability to launch the DuckDB CLI tool and independently inspect the NHGIS data – they already do so with the NHGIS metadata which we store in Sqlite. So again, it was a great fit.

In a decade from now it may be that any database engine can execute our SQL whereas a Python Polars program might be a little harder to understand and get running for a non-expert. That said, Polars is great and if you like dataframes it’s going to serve you well.

Conclusion

When the 2020 DHC census data was released in a legacy format, we really had to improvise to figure out an efficient way to ingest it into NHGIS. Thanks to advances in computing technology and database software, it’s now feasible to load entire NHGIS datasets into a database for processing. Our choice of DuckDB as the particular database engine brought additional benefits of a stand-alone/embeddable database, a column-store backend, strong import/export support for Parquet and CSV, and an easily versionable and deployable database artifact. DuckDB has been a great tool to add to our toolbox, and we’re looking forward to seeing where else we can apply it within our workflows.

Hopefully, this post also shed a bit of light onto the intricacies of this ingest process and the effort required to create IPUMS datasets, even as this post left out and glossed over several of the nuances and complexities of the full 2020 DCS ingest. This project has created a more efficient, accessible, and sustainable process for performing NHGIS data ingest, which will benefit future IT and research staff and allows us to refocus efforts to the parts of the process which intrinsically require more of our attention.

Reading the Parquet Data Format in Rust

Fri, 01 Jul 2022 00:00:00 +0000

Effectively using the parquet Rust crate to read data in the Parquet format isn’t too dificult, but more detailed examples than those in the official documentation would really help. In this article I’ll present some sample code to fill that gap.

I assume basic Rust knowledge. You should know how to set up a project with Cargo and know some Rust. The code here is kept simple and skips most error handling for brevity.

Introduction

Recently, a new IPUMS utility needed to read Parquet data, and it needed to be fast and use low amounts of memory. We chose Rust because of its performance, and the fact it had good Parquet support. It required use of a few key parquet crate features I’ll outline in the post. The project was to allow the Stata statistical software to read from Parquet data. While that code has too many Stata specific quirks to make it easy to present as the first example code in an introductory article, the main points outlined next were what made it possible. The utility had to feed Stata data at a fast rate, as if it were reading directly from a data file.

What and Why of Parquet

The Parquet data format groups data into files that you can think of as collections of records, or tables. Unlike typical record-oriented formats, Parquet physically organizes data first by column, to accelerate operations concerned with only a few columns at once. The Parquet format supports high-performance analytic workloads or really any type of data task that you might characterize as requiring a long vertical slice out of horizontally wide data.

On top of the fundamental columnar structure, Parquet employs powerful compression and encoding of data to reduce on-disk storage and retrieval times.

Parquet files have a schema, much like a database table has a schema, though a Parquet schema is potentially more complex than a flat table, with nested schemas.

Parquet in Rust

Since Parquet is built for high performance, it makes sense that you may wish to use it with a high-performance language like Rust

The Rust Arrow library, arrow-rs, has recently become a first-class project outside the main Arrow project. The official Apache Rust crate supporting Parquet lives in the arrow-rs project.

The Arrow and Parquet projects have undergone a lot of development over the last few years. Last year when I began using the Apache “parquet” crate it was at version 4.0, and they’re now up to 15.0 with many features recently available.

I’ll give example source code to accomplish each of the following:

Printing the schema including logical and physical column types, column numbers and names
Getting parquet file metadata statistics like numbers of rows and row groups
Reading a subset of columns from the file (a “schema projection”.) Doing this could vastly speed up reads.
Extracting data row by row and working with the values

All these are possible and fairly easy with the Rust parquet crate if you can see it done. The documentation has some extremely basic example code which may not be enough to get you started, especially if you’re not super familiar with the Parquet and Arrow APIs for Java or C++. To work out what to do you’ll need to read the source code to the Parquet crate, but even then it helps to know where to look. Reading the test source code can help too.

Arrow-rs gives you three ways to work with data from Parquet: For reading Parquet data as records there’s a high-level Arrow-backed record batch API, a row oriented record Parquet file interface, and also a low-level column API.

Those who have worked with Parquet schema types for the Java and C++ Parquet and Arrow APIs will find the Rust implementation familiar.

If you need to work with large amounts of Parquet data efficiently in memory or perform calculations on the data in memory you will want to investigate the arrow module and the Arrow Parquet reader.
If you need to get high performance when accessing a few columns at once you should look at the low-level column reader and writer support in the parquet crate: column API. (It’s not the simplest approach though.)
The simplest and potentially most memory efficient way to access Parquet data is by using the parquet::file and parquet::record modules. That’s what we’ll look at here.

Setting Up

Create a new project with cargo new parquet_examples. Then add the parquet crate as a dependency.

To use the parquet crate in your own project, add parquet = "15.0.0" to your Cargo.toml file (15.0.0 is the latest version available for download as of this writing.) The 15.0.0 Parquet version requires a recent Rust version. I used version 1.59.0. You can update using the Rustup utility.

Now you should be able to add the example code to your src/main.rs file.

Simple Parquet Reader Example

From the documentation for the parquet::record::Row struct, , you see how to set up a reader and process records (not columns) from a Parquet file.

use std::fs::File;
use std::path::Path;
use parquet::file::reader::{FileReader, SerializedFileReader};

let file = File::open(&Path::new("/path/to/file")).unwrap();
let reader = SerializedFileReader::new(file).unwrap();
let mut iter = reader.get_row_iter(None).unwrap();
while let Some(record) = iter.next() {
    println!("{}", record);
}

This gets you started, but leaves a few questions unanswered. How do I work with data? How do I select only the columns I need? You can dig into the Parquet source code and learn how to extract column values from the records; if you have an IDE like VScode that task will be easier. However, it’s not obvious how to use the API effectively for some essential tasks.

Reading the Schema

Parquet allows nested schema definitions, but I’m not going to go into that much here. To start with you probably at least want to know how to read column information from a simple Parquet file made up of a group of columns.

To begin with, understand that Parquet files have file metadata, and schema information. The file metadata has statistics like number of rows and row groups; the schema has the description of columns in the file. A schema can be a group or primitive node: A group node has inside of it a list of other schema nodes which in turn may be group or primitive nodes. In the simplest (and most common) type of Parquet file the schema is a group node that holds a list of primitive columns. The file metadata holds the schema.

Note: A Parquet “file” can actually be multiple files in a directory (normally with a .parquet suffix on its name.) These are sometimes produced by big data frameworks because they’re transforming data in parallel; each individual file represents one worker’s output. Sometimes each file could be part of a partitioned dataset where the files contain related data, like states, provinces, or ranges of serial numbers – really anything sortable. To start with we’ll stick to reading single files.

File Metadata

let file = File::open(&Path::new(&self.parquet_path)).expect("Couldn't open parquet data");
let reader = SerializedFileReader::new(file).unwrap();
let parquet_metadata = reader.metadata();
let _rows = parquet_metadata.file_metadata().num_rows();

Schema

You get the schema from the file metadata:

let fields = parquet_metadata.file_metadata().schema().get_fields();

// Iterate over fields
// We use the enumerate() so that we can get the column number along with the other information
// about the column/ column number can be  used to access a column in a Parquet file.
for (pos, column) in fields.iter().enumerate() {
	let name = column.name();
	println!("Column {}: {}", pos, name);
}

Column Types

The parquet::schema module supports general Logical types, or as a specific physical type. Until recently the generic types were called Logical Types. These are now known as “converted types” (this is for the transition to Parquet version 4.0.)

The actual type names have changed: Check the documentation in the source code carefully and use the converted type to avoid confusion. Just Googling for “logical types Rust” will serve up old or contradictory information.

You may actually want the physical Parquet type names to assist moving data into other Rust variables. These type names include double, float, int64, int32, byte array, bool, and others. The physical type names correspond to types a programming language might use. To make things even more confusing, in the Rust module where the physical types are defined the enum is named Type. You alias it to something else to avoid a name clash:

use parquet::basic::Type as PhysicalType;

Here’s how you could print a flat Parquet schema. This is a complete program:

extern crate parquet;
use parquet::file::reader::{FileReader, SerializedFileReader};
use parquet::basic::Type as PhysicalType;
use std::fs::File;
use std::path::Path;
use std::env;
use std::process;

fn main(){
	let args: Vec<String> = env::args().collect();
	if args.len()<2{
		println!("Usage: print_{} file.parquet",&args[0]);	
		process::exit(1);
	}
	
	let parquet_path = &args[1];	
	let file = File::open(&Path::new(parquet_path)).expect("Couldn't open parquet data");
	let reader = SerializedFileReader::new(file).unwrap();
	let parquet_metadata = reader.metadata();               
	let fields = parquet_metadata.file_metadata().schema().get_fields();              
	
	for (pos, column) in fields.iter().enumerate() {
		let name = column.name();
		        
		let p_type = column.get_physical_type();
		// print type names you'd need if a Rust program consumed the data...
		let output_rust_type = match p_type {					
			PhysicalType::FIXED_LEN_BYTE_ARRAY=>"String",
			PhysicalType::BYTE_ARRAY=> "String",
			PhysicalType::INT64=>"i64",
			PhysicalType::INT32=> "i32",
			PhysicalType::FLOAT => "f32",
			PhysicalType::DOUBLE=> "f64",
			_ =>panic!(
				"Cannot convert  this parquet file, unhandled data type'{}'  for column {}",
				&p_type, name),									
		};
		println!("{} {} {}",pos, name, output_rust_type);			
	} // for each column
}		

If you only want to print types and names rather than validating the schema and printing the Rust equivalent type, you could skip the match and print the type names:

println!("{} : {}", name, &p_type);

Read a Schema Projection to Unlock Parquet’s Superpower

Reading only a part of the full schema saves time if you don’t need all the values in the rows. Not only will the retrieved rows be smaller but the big advantage is that the Parquet reader will only need to visit the parts of the Parquet file that have the data you want.

Imagine an extreme scenario where you need to access all records in a very large amount of data. It could be in a database or a raw file of records. We have delivery records for packages handled by a home delivery service. there are a lot and they’re big records.

The problem: You need only an delivery date and shipping time for every delivery out of the orders table. That table also has eighty other columns and in particular one column full of very large data: A picture of the delivery location and package. You want to access all records at once for some analytical purpose, but only need those delivery times. The whole file of 50,000 orders might take three hundred gigabytes, but the delivery date and ship time would only take up 800 KB or so on their own.

In a traditional arrangement, you might only store a link to the image data to keep the orders table smaller. In a columnar structure you can maintain a simpler system and keep everything together, because you’re reading only the columns you’re interested in. The file is organized not by record, but by column. So, if you don’t care about pictures you don’t read the pictures column, so you don’t read that region of the file. Since the delivery date and ship time columns are each stored in contiguous blocks in the file it takes no longer than it would normally take to read less than a megabyte of data.

Parquet gives you that power. To read serialized records only containing your columns of choice, you must construct a “schema projection”, which is a schema definition with only the columns you want. You pass this to the serialized reader’s get_row_iter() method. If you pass in None the reader reads the entire schema stored in the Parquet file; otherwise you would do something like:

	let mut row_iter = reader.get_row_iter(Some(schema_projection)).unwrap();

Building a Schema Projection

We’ll pass in the column names we want to extract as arguments on the command line. We want to retain the part of the Parquet schema that has columns with these same names.

// consider everything on the command line after the parquet file 
// name to be a column name
let requested_fields = &args[2..];
		
let mut selected_fields = fields.to_vec();
if requested_fields.len()>0{
	selected_fields.retain(|f|  
		requested_fields.contains(&String::from(f.name())));
}			

// Now build a schema from these selected fields:
let schema_projection = Type::group_type_builder("schema")
	.with_fields(&mut selected_fields)
	.build()
	.unwrap();

Reading Selected Data

Now that we have a schema projection we can make a reader and pass the projection to it:

use parquet::file::reader::{FileReader, SerializedFileReader};

let reader:SerializedFileReader<File> = SerializedFileReader::new(file).unwrap();
let mut row_iter = reader.get_row_iter(Some(schema_projection)).unwrap();

As we read rows, we’ll want to manipulate data on the rows, or display it somehow depending on the task. Let’s start with formatting the rows returned by the reader. Here’s a small formatter to make a row with values separated by a delimiter. You can see how data from individual columns can be accessed:

// This is for demonstration purposes; if you have string data to format
// consider using the CSV Writer library or picking your delimiter carefully.
fn format_row(row : &parquet::record::Row, delimiter: &str) -> String {    
	row.get_column_iter()
		.map(|c| c.1.to_string())
		.collect::<Vec<String>>()
		.join(delimiter)
}

The main loop to read Parquet data and turn it into records would look like this.

while let Some(record) = row_iter.next() {	
	println!("{}",format_row(&record, &delimiter));
}

A Working Program

Let’s finish up by putting all the pieces together. The final program will be able to print a flat schema if it consists of simple data types that could reasonably be formatted as a CSV or other type of delimited text data. The same program will alternatively accept a list of column names to extract and format into CSV output or print all data if no column names are given.

While this isn’t much, it’s an actually useful program that lets us convert to CSV so that some tools that can’t read Parquet can work with data originally stored in the Parquet format. Note that string data isn’t escaped, so pick your delimiter carefully or add that logic, preferably a proper CSV writer.

extern crate parquet;
use parquet::file::reader::{FileReader, SerializedFileReader};
use parquet::record::Row;
use parquet::schema::types::Type;
use parquet::basic::Type as PhysicalType;

use std::fs::File;
use std::path::Path;
use std::env;
use std::process;
use std::sync::Arc;


fn print_schema(
		fields:&[Arc<parquet::schema::types::Type>]
	){
	
	for (pos, column) in fields.iter().enumerate() {
		let name = column.name();		       
		let p_type = column.get_physical_type();
		let output_rust_type = match p_type {					
			PhysicalType::FIXED_LEN_BYTE_ARRAY=>"String",
			PhysicalType::BYTE_ARRAY=> "String",
			PhysicalType::INT64=>"i64",
			PhysicalType::INT32=> "i32",
			PhysicalType::FLOAT => "f32",
			PhysicalType::DOUBLE=> "f64",
			_ =>panic!(
				"Cannot convert  this parquet file, unhandled data type for column {}", 
				name),									
		};
		println!("{} {} {}",pos, name, output_rust_type);			
	} // for each column	
}


fn print_data(
	reader: &SerializedFileReader<File>, 
	fields:&[Arc<parquet::schema::types::Type>], 
	args:Vec<String>){
	
	let delimiter = ",";	
	let requested_fields = &args[2..];
		
	let mut selected_fields = fields.to_vec();
	if requested_fields.len()>0{
		selected_fields.retain(|f|  
			requested_fields.contains(&String::from(f.name())));
	}			
	
	let header: String = format!("{}",
		selected_fields.iter().map(|v| v.name())
		.collect::<Vec<&str>>().join(delimiter));
				
	let schema_projection = Type::group_type_builder("schema")
			.with_fields(&mut selected_fields)
			.build()
			.unwrap();

	let mut row_iter = reader
		.get_row_iter(Some(schema_projection)).unwrap();
	println!("{}",header);
	while let Some(record) = row_iter.next() {	
		println!("{}",format_row(&record, &delimiter));
	}
}

fn format_row(
		row : &parquet::record::Row, 
		delimiter: &str) -> String {  
	
	row.get_column_iter()
		.map(|c| c.1.to_string())
		.collect::<Vec<String>>()
		.join(delimiter)
}


fn main(){
	// Keeping argument handling extra-simple here. 
	// For anything more complex consider using the 
	// "clapp" crate.
	let args: Vec<String> = env::args().collect();
	if args.len()<2{
		println!("Usage: print_{} file.parquet [--schema] [column-name1 column-name2 ...]",
			&args[0]);					
		process::exit(1);
	}
	
	let parquet_path = &args[1];	
	let file = File::open(
		&Path::new(parquet_path))
		.expect("Couldn't open parquet data");
		
	let reader:SerializedFileReader<File> = SerializedFileReader::new(file).unwrap();
	let parquet_metadata = reader.metadata();               
	
	// Writing the type signature here, to be super 
	// clear about the return type of get_fields()
	let fields:&[Arc<parquet::schema::types::Type>] = parquet_metadata
		.file_metadata()
		.schema()
		.get_fields();              
	
	if args.len()>2 && args[2] == "--schema"{
		print_schema(fields);
	}else{
		print_data(&reader, fields, args);
	}	
}	

Implementing The Fastest (Pseudo) Jaro-Winkler Algorithm in Rust

Fri, 24 Jun 2022 00:00:00 +0000

After binging these very nerdy conference videos focused on performance improvements:

CppCon 2019: Andrei Alexandrescu “Speed Is Found In The Minds of People”
CppCon 2016: Timur Doumler “Want fast C++? Know your hardware!”
CppCon 2015: Andrei Alexandrescu “Declarative Control Flow” (this one I actually used as a sleep aid..)

I got the itch to try to speed something up that I’ve been struggling with for at least 5 years: the Jaro-Winkler algorithm.

This is a blog series that is separated into 2 separate posts. The first post is one the Jaro Winkler similarity measure itself, how it is constructed, and my implementation of this formula in rust using bitwise operations. The second is an interesting technical journey I went on in order to improve the performance of the “trailing_zeros()” function.

Part 1: Speeding up Jaro Winkler with rust and bitwise operations

The Jaro Winkler distance is a string metric used to calculate the similarity between two strings. We use it extensively at IPUMS in order to create links between US censuses, because often people will have variations in their names between enumerations. For example, one year someone could have “Jakub” as their name and the next census it could be “Jacob”.

Jaro-Winkler metric explained

For those that aren’t familiar with the Jaro Winkler metric, here’s a quick rundown of what it is.

The Jaro Winkler algorithm computes a similarity score between two strings and is calculated in two parts. The “Jaro” part checks for matching characters and transpositions of characters between the two strings. The “Winkler” part is an adjustment to the “Jaro” score by weighting characters matched at the beginning of the string more than those at the end.

The “Jaro” part

To calculate the Jaro part:

jaro(s1, s2) = ⅓ * ( m / ||s1|| + m / ||s2|| + (m-t) / m)

s1 = the first string
s2 = the second string
||s1|| = length of the first string
||s2|| = length of the second string
m = # of "matching characters"
t = # of "transpositions"

This looks complicated, but is really just calculating an average of 3 different things:

The percent of “matching” characters in the first string: m / ||s1||
The percent of “matching” characters in the second string: m / ||s2||
The percentage of matching characters that aren’t “transposed”: (m-t) / m

The secret sauce in the jaro formula is how it defines “matching” and “transposition”. A character is considered “matching” if it occurs in both strings within a specified maximum distance “d”. This distance “d” is defined as max(||s1||, ||s2||) / 2 - 1, which means half of the length of the longest string minus 1. The number of transpositions is defined as half the number of “matching” characters that are out of sequence.

Examples:

jaro("jake", "joe")
- First find the maximum distance “d” that two characters can be from each other and still considered matching: d = max(||s1||, ||s2||) / 2 - 1 = 4 / 2 -1 = 1
- Then go through each character of the first string and find if it has a match. I’m writing the matches in the form of: (character in s1, index in s1) -> (character in s2, index in s2)
```
j,0 -> j,0
a,1 -> NO MATCH
k,2 -> NO MATCH
e,3 -> e,2
```
- So in this case there are 2 matches and m = 2. The matching characters are “j” and “e”, which occur in the exact same order in both strings, so there are no transpositions. This gives us the formula: jaro("jake", "joe") = ⅓ * ( m / ||s1|| + m / ||s2|| + (m-t) / m) = ⅓ * (2 / 4 + 2 / 3 + 2 / 2) = .722
jaro("amy", "mary")
- Here’s another example with a transposition. First find the maximum distance “d”: d = max(||s1||, ||s2||) / 2 - 1 = 4 / 2 -1 = 1
- Then find the matches:
```
a,0 -> a,1
m,1 -> m,0
y,2 -> y,3
```
- Because the “a” and “m” matches are out of sequence, then the number of transpositions “t” is equal to 2 / 2 = 1. Therefore: jaro("amy", "mary") = ⅓ * ( m / ||s1|| + m / ||s2|| + (m-t) / m) = ⅓ * (3 / 3 + 3 / 4 + (3-1) / 3) = 0.806

The “Winkler” part

Finally, we can move on to the “winkler” part of the jaro-winkler formula. The change is small but powerful and can be described with the following formula:

jaro_winkler(s1, s2) = jaro(s1, s2) + L *  P * (1 - jaro(s1, s2))

L = length of common prefix at the start of the 
    string up to a maximum of 4 characters

P = scaling factor for for how much the score is adjusted
    upwards for having common prefix

The “winkler” addition allows for a boost to the Jaro score when there are matching characters at the beginning of the strings. The size of this boost is controlled by the scaling factor “P”, which is usually set to 0.1. This is a very useful addition to the Jaro score for our use case of name comparison, because often the beginning characters of names are more useful than other characters. For example nicknames, such as “Jake” for “Jacob”, usually begin very similarly to their full counterparts.

The Jaro-Winkler measurement is very useful for us because it gives results which are quite similar to how a human would rate the similarities of two names and is able to be computed very quickly in comparison to other algorithms. At IPUMS, some of the largest datasets that we need to link are the 1930 and 1940 full count historical US censuses. In 1940 there were about 130,000,000 people in the United States, and in 1930 there were about 123,000,000 people. If we were to naively run a Jaro Winkler on every single pair of people then it would take about 1.6e+16 comparisons to see who might be a match. That would take an unreasonable amount of computational power to do, so we start by narrowing down the scope of the problem by only comparing people’s names who match on other attributes like sex, age, and birth place. Unfortanuetly, even after narrowing down the number of people that we compare, we still have significant performance issues with computing all these string comparisons.

Performance improvements

There are two performance improvements that we have been experimenting with to speed up that process.

Lookup tables

The first major improvement to performance came from the batch_jaro_winkler implementation by Dominik Bousquet. The idea behind this implementation is to create a lookup table for one of the datasets that has tuples of names and pointers to the records with that name. This lookup table is keyed on the letters which are present in the names. For example:

{
  "a" => [("anne", [p1, p2, p3]]), ("jane", [p4, p5, p6]), ("aaron", [p7, p8, p9]), …]
  "b" => [("bob", [p9, p10, p11]), ("kalib", [p12, p13, p14]), ("bruno", [p15, p16, p17]), …]
}

Then for each name in the other dataset, you use the lookup table to only compare it against names which have at least one matching letter. For example if you were going to try to find matches for the name “joe”, then you would look up the letters “j”, “o”, and “e” in your lookup table to get the possible matches. This avoids doing computations that would result in very low similarity scores.

Character bitmasks

The second major improvement was related to how the strings were stored. Generally, programmers think of strings as a sequence of contiguous characters in memory. For example, in the c programming language, the string “Jake” is stored as an array of 5 bytes. One byte for each letter in its ascii representation and a final null character “\0” which signifies the end of the string. It would be quite natural to implement the Jaro Winkler algorithm using this representation, and almost all implementations do so. This works great most of the time, but starts to run into trouble when comparing very large datasets of strings where each individual pair may have little in common.

This is where a new representation of strings comes into play that I’ll call the “bitmask representation”. In the bitmask representation the string “jake” would be stored as the following structure:

[("j", "‘1000"), ("a", "0100"), ("k", "0010"), (e, "0001")].

Each unique character in the string is stored as a 2-tuple (a pair of elements). The first element of the pair is the character and the second element are flags which indicate where in the word that character appears, with a 1 indicating the character occupies that position in the string and a 0 indicating it does not. If the character appears twice in a string, then two of the flags are set to 1. For example “mom” in binary representation is:

[("m", "101"), ("o", "010")].

This new representation allows for the use of bitwise operations. These operations act on binary data – each digit can only be a zero or a one – and because computer architecture is natively binary these operations are very fast. In fact many of the bitwise operations can be performed in a single machine instruction. Further thse operations are sometimes able to take advantage of a processor’s SIMD features, which allows for the processor to run a single instruction on multiple pieces of data in parallel.

A very common example of a useful bitwise operation is the ‘&’ (bitwise AND) operator, which compares two bit strings digit by digit and returns a bit string with 1’s where there was a 1 in both bit strings. In our use case this can tell us when a letter was in the same position in the two names. Let’s say that I want to see if there are any “m”s matching in the strings “mom” and “martha”. The bitmask representations for the “m”s in “mom” and “martha” are: “101000” and “100000” (you’ll notice that I padded the “mom” representation to a length of 6 with extra zeros so that the two matched in length). If you do the ‘&’ operation on those two bitmasks then you get “100000”, which is a new bitmask where all the “1” values represent places where the m’s match each other. Because “mom” and “martha” only have matching “m”s in the first place, the new bitmask has only one “1” in it. So in this example it took only a single machine instruction to check if there were matching “m”s. However if we did this in the traditional way then we would have to traverse each string and check if the characters are “m”s, which would take many many more machine instructions as you are now maintaining pointers, converting ascii to binary, and so on.

It would be extremely verbose for me to explain all of the small bitwise operations that are in use within the algorithm and would probably require an entire blog post of its own. If you are curious about the full implementation of the algorithm you can check out the code in our (relatively well-documented) repo on github.

SIMD

When running this algorithm on an architecture that supports SIMD operations, the compiler will optimize the code to compare many names all at once with a single machine instruction. In our use case if we have a “jake” in 1940 and are trying to find potential matches in 1930, then we can compare “jake” to many 1930 names all at once. For instance we can look for ‘j’s in a set of 1930 names and find all of the matching ‘j’s in that set with one machine instruction. This is incredibly efficient. The compiler can only do SIMD optimizations like this under very specific conditions. Code complexity such as branching logic often makes this type of optimization impossible. However the core of this implementation of the Jaro Winkler algorithm is a loop exclusively composed of bitwise operations which the compiler can optimize easily. If you are interested in learning more about SIMD optimization you can check out this video from CppCon.

Performance Results

In the end I was able to achieve a 10-15x speedup compared to the batch_jaro_winkler library, and a 40-50x speedup compared to common string comparison libraries. However there is one caveat: in some cases the algorithm over counts transpositions resulting in occasionally depressed scores from other implementations. This difference is relatively minor in the average error size is 0.002 or less when testing against names from the 1880 U.S. census. Our next steps with the algorithm will be to extend it to include computing Jaro Winkler scores on multiple names for the same record (first and last), which will be required before we integrate it into our larger linking pipeline.

Upgrading the Shared Conda Installation at ISRDI

Sat, 28 Nov 2020 00:00:00 +0000

Background

The Python ecosystem offers a few ways to manage Python and third-party libraries. For both conda and pip + virtualenv, these solutions are designed for managing environments on an individual’s local computer, not a shared work environment.

From the Conda website:

Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.

It’s clear from this description, especially the “environments on your local computer” part, that the main focus of this technology is for local development.

At ISRDI, we develop and deploy a large amount of software, mostly command-line tools, for our researchers to process large-scale demographic data. These scripts reside on multi-user linux servers, and it’s a requirement that they perform consistently and reproducibly for all team members. For this purpose, we need shared, named Python environments in which dependencies can be managed for all users by safe and controlled means.

We’ve implemented conda as our solution for this need for multi-user shared environments. Conda has a few feature areas that are more user-friendly than the alternatives, making it a better fit for our particular situation:

It provides a slightly cleaner solution than the pip/virtualenv/wheel for building binary, installable, portable packages for internal libraries.
We are heavy users of our JupyterHub/Lab set-up, and the Jupyter ecosystem is built to work very nicely with named conda environments, particularly for managing notebook kernels amongst multiple users.
Conda is language agnostic and can install non-python libraries. We have a few places where we’ve pulled non-Python (C++, R) packages into an environment.

In our case, we both want conda to be a server-based solution and serve as a shared installation in which named environments can be used by multiple users on-demand. Everyone working on our production servers uses these shared conda environments when running Python scripts to do data work. We have a custom conda channel where we serve builds of our own internal Python libraries. We also use a number of these named environments as custom kernels for our shared JupyterHub/Lab installation (more on this to come in a future blog post!).

This setup has been a viable and beneficial solution for years at ISRDI, but it has its challenges, most notably upgrading conda itself. Updating to new conda versions has been nearly impossible in the past because conda update conda has broken lots of things every time we’ve done version updates. We desperately wanted to update our conda however, as we were at a stage where the current conda version was no longer supported. Even more critically, as a result of unfixed bugs in the version of conda we were using, many of our shared environments were not able to be updated. Outside of the perils of conda update conda, we also wanted a way to reduce cruft as we moved from version to version, so we wanted more of a “migration” than an “update” process to allow us some intentional housekeeping focus to remove some unused environments and package caches.

To achieve this goal, we devised a way to install multiple installations of conda side-by-side so that we can symlink conda to the current version installation while doing quality checking on the new qa version installation. This allows a full qualification before changing the symlink to the newer version (and thus putting the new installation into production for all users). We perform what is essentially an environment migration by exporting environment specifications from current and then recreating them in the qa installation. By directly invoking the python executables in the qa conda bin, we are able to test these environments before a zero-down-time cutover to the new installation via a one-line command to point the current symlink to the new installation.

Directory organization

We install all conda installations under one folder:

/pkg/ipums/programming/
  └─ conda/
       ├─ current ⇒ v4.3  ← symlink to current conda version
       ├─ qa      ⇒ v4.8  ← symlink to next conda version undergoing QA
       │
       ├─ envs    ⇒ current/envs  ← symlink to folder with current environments
       │
       ├─ v4.3/        ← conda 4.3 root (base) environment
       │    ├─ bin/
       │    ├─ envs/   ← default location for conda 4.3 environments
       │    ├─ pkgs/   ← default location for conda 4.3 package cache
       │    └─ ...
       ├─ v4.8/        ← conda 4.8 root (base) environment
       │    ├─ bin/
       │    ├─ envs/   ← default location for conda 4.8 environments
       │    ├─ pkgs/   ← default location for conda 4.8 package cache
       │    └─ ...
       ├─ kernels/     ← for JupyterHub/JupyterLab, each points to environment-specific ipykernel
       ├─ mpc/         ← our custom conda channel for local packages
       └─ tools/       ← installation directory for the code in this repository

The path /pkg/ipums/programming/conda/current is what all users get as the path to the shared conda environment, with that path’s bin/ folder being in all user’s $PATH when logged in via a bash shell. This structure allows us to leave our production, or “current” conda intact while allowing us to test and qualify a new qa version in a separate installation. When we are ready to switch to the qa version, we simply change the current symlink to point to this new version.

Migration process

To begin a migration to a new conda version, we first use a miniconda installer to install the new version in the directory as described above. Then we symlink qa to the new version directory to allow us to both programmatically and manually access the qa installation in the following steps.

Getting information about the current conda install

We then capture some information about the current conda install so we can get ready to reproduce it. We manually created an .ini file with a list of all current environments. This can be done by inventorying the current directory envs/ or with conda env list in the current conda install. We communicate with all of the people responsible for those environments on Slack about whether or not they would like to have their environment migrated into the qa conda installation. If so, we indicate this in the .ini file with ` = REPLICATE. If not, we mark it as = export` to flag it for exporting YAML but not creating it. By doing this we have available specs to recreate any of these environments should the need arise later.

Further, we need to take care to make sure that all environments currently used as kernels in our shared JupyterLab/Hub are replicated.

Here is an example .ini file:

[environments]

# v4.3 environments as of 2020-10-19
# keep
hlink = REPLICATE
nhis_reformat = REPLICATE
super_important_jupyter_kernel = REPLICATE

# do not migrate
old_not_used = export

Exporting YAML spec files for all environments

We wrote a script to export one .yml file per environment in the .ini file to an env_specs directory in the conda directory structure. As stated above, we are exporting YAML specs for all existing environments, not just those we are migrating.

The script uses the command conda env export --name ENV_NAME --no-builds. The --no-builds flag is important because many of the exact builds¹ are no longer available in package repositories, and pretty much every single environment build will fail with them in our situation. Because we were on version 4.3.22, and the --no-builds flag was broken in that minor version, we chanced a minor in-place conda version update to 4.3.31 where this bug fix was applied so that we didn’t need to do post-processing of the yaml files on the command line (with tools such as sed, cut, grep, etc.) before recreating the environments.

While we want to control the specific version of software in a particular environment, getting the specific build of that version is not important to us, especially when those specific builds sometimes disappear from repositories. In other words, we’re going to trust that v1.3.4 of a package is exactly v1.3.4 no matter where or when it was built.

Replicating current environments in the QA installation

We wrote a second script to sequentially replicate the environments marked as ` = REPLICATE in the .ini file in the qa` conda installation. We observe the error output to the console as the script runs to see which envs fail. How we managed this during the first migration was to create a spreadsheet from the .ini file, and paste errors form the CLI output into the spreadsheet for manual remediation on an env-by-env basis. We hope future iterations automate this error capture.

One thing to note is that we had to add special handling in the script for local legacy noarch conda packages. We identified the list of noarch packages in our custom conda channel and saved these in a text file. If the script found any of these packages in the YAML spec for an environment, it would strip it from the YAML before replicating it, then run a subsequent process to conda install each noarch package once the initial environment was created. We will rebuild these noarch packages now that we have migrated to a newer conda version to avoid this step in future migrations/updates.

Troubleshooting failed environments

Out of the 60 environments we migrated from 4.3 to 4.8.5, 48 replicated using conda 4.8.5 successfully, while 12 had failures. Some common issues causing failures were:

Custom conda channel packages with really old versions no longer housed in our custom channel.
Noarch packages listed in the pip installed package list often needed to be moved to the conda section of the .yml file. If the package had been installed with both conda and pip in the old environment, it needed to be removed from the list of pip installed packages in the .yml file.
Packages installed from git clones with pip needed to be removed from the .yml file and manually installed in the qa environment after the new environment is recreated.

Testing

We ran tests in several of the qa environments once we had recreated them to make sure the environment was behaving as expected with the qa conda. We took extra care to make sure we tested the environments that initially failed and needed some manual intervention, especially because several of these support essential internal services (mesos cluster, JupyterLab, etc.)

Wrapping up loose ends

There will be a future blog post to describe our shared JupyterHub/Lab installation, and how we have reconfigured that to make the conda update process easier and less interdependent. For now, the gist is that we have now containerized this deployment, and it needs to be pointed at the new conda installation when it is made current.

Finally, we change the current conda symlink to the qa version and remove the old qa symlink. If everything is broken and things are falling apart, our old current installation is still just hanging out in the conda directory, and we can simply switch the symlink back to buy us time for investigation and a fix without blocking our users work during that time.

Some final thoughts

If Stackoverflow posts are any indication, we are far from the first work group to have conda update conda ruin our week with broken environments. We struggled to find much if anything written about managing a shared conda environment such as we have set up. Now that we have a path forward for updating our conda, we are even more convinced this is the right architecture for us. The shared environments ensure replicable and reproducible results with our software across users and/or data sets, and save enormous amounts of time in project on-boarding and set-up because the environments are automatically the same across users.

Are you using a shared conda installation at your institution? Please let us know because we’d love to compare notes!

Acknowledgements

Ben Klaas and Jimm Domingo edited drafts of this blog post and collaborated on the described project.

to be clear, in the conda ecosystem build hashes specifically tag a specific build for a specific version. ↩

GPL: Do We Need to Wait for the Courts?

Mon, 02 Nov 2020 00:00:00 +0000

In an earlier post, Fran, our IT director, shared the story of how we struggled with the ambiguity that arose from linking IPUMS software with GPL code. I participated in some of the discussions, and provided feedback on that post. I’m not an attorney, just a developer who’s had to wrestle with these issues throughout my career.

Until the Court Decides

Midway through that story, Fran quoted the GPL FAQ where the Free Software Foundation (FSF) acknowledges the presence of legal ambiguity:

Where’s the line between two separate programs, and one program with two parts? This is a legal question, which ultimately judges will decide.

He concluded the story, describing IPUMS’ decision to follow the less-than-ideal option that’s common practice in the R community for linking GPL and non-GPL code – a practice which acquiesces to the uncertainty while waiting “for the legal community to clarify the issue, if ever”.

This need for legal clarification about GPL and linking is also noted by Van Lindberg – an attorney, developer and author – in his book Intellectual Property and Open Source: A Practical Guide to Protecting Code:¹

Copyright law, especially as applied to computer software, is a difficult subject. Until a court rules on the exact terms of the GPL in exactly these circumstances, we just don’t know how the law deals with the issue of linking and derivative works.

But I wonder… do we have to wait for the courts?

Is the court the only way to resolve this ambiguity? Statements like those above can easily be interpreted as the legal system is the only avenue to resolution. But might there be another way forward? Could this issue with the GPL be fixed outside of the courtroom?

When & Where Does GPL Apply?

Lindberg devotes a whole chapter in his book to how to work with the GPL and its legal ambiguity. The very first sentence in the chapter acknowledges the multitude of questions about the GPL:

A lot of the most difficult questions in free and open source software revolve around the GPL.

Yet he continues the opening paragraph praising what the GPL has accomplished:

The GPL has a lot of things going for it: it is the single most common open source software license, it has brought together a large and vibrant community of developers, and it is a brilliant hack, socially and legally.

He continues, that despite its accomplishments, the GPL has exasperated not just open source programmers, like us here in ISRDI IT, but legal professionals too:

At the same time there is no single license that is more mistrusted or reviled than the GPL. Many open source developers refuse to accept or release code under the GPL because it imposes restrictions at the same time that it grants freedoms. I know from personal experience that the GPL gives most lawyers fits.

Lindberg identifies two key issues that drive people’s strong opinions of the GPL:

In short, very few people have a balanced or nuanced view of the GPL – they either love it or hate it. Speaking in broad generalizations, though, I think that these strong emotional reactions arise from two core issues.

The first issue is the philosophy of free software. More than any other single document, the GPL has come to embody the free software movement, so people’s reactions to the GPL mirror their opinions of free software as a moral imperative. … These social issues are interesting but beyond the scope of this book.

The second issue, though, is quite appropriate to our discussion: legal ambiguity. (emphasis added) There is basically no argument that the GPL is a valid and enforceable license. There is, however, a lot of confusion about when and where the GPL applies.

That’s the confusion recounted in our earlier post – confusion that we spent a lot of time and effort struggling with.

“We Just Don’t Know”

This confusion has been around for 3 decades. Lindberg highlights the issue’s longevity:

Nevertheless, there is a persistent issue that won’t go away – whether linking programs together creates a derivative work. If linking creates a derivative work, the GPL applies to the linked program; otherwise, the GPL doesn’t apply.

In legal practice, this arises as a common concern of clients just getting into open source. This question is usually phrased as either, “Can I load and use a GPL-licensed library without applying the GPL to my application?” or, “Do I have to apply the GPL to my plug-in for a particular program if that program is licensed under the GPL?

I won’t keep you in suspense; the short answer is that we don’t know.

We still don’t. Because there’s no definitive answer, we used our best judgment to navigate through all the confusion and make a decision so we could get on with other work.

Linking = Derivative Work?

According to Lindberg, the crux of the matter is “whether linking creates a derivative work.”

If linking creates a derivative work, the GPL applies to the linked program; otherwise, the GPL doesn’t apply.

In other words, “the scope of the GPL is intrinsically tied to the scope of copyright”.

The GPLv2 makes an explicit tie between derivative works under copyright and the reach of the GPL; the GPL applies to “either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications” (emphasis added in the book). The GPLv3 similarly ties its interpretation to copyright law.

… Copyright law must be interpreted to determine what constitutes derivative work; the GPL intends to go only as far as copyright law does.

The ambiguity arises because GPL extends the definition of a derivative work beyond the law.

The more fundamental problem is that the arguments over linking and licensing are really arguments over the scope of copyright…

… The problem is that in a number of public statements from the FSF, the tactical decision has been to take an expansive view of copyright, applying the GPL in the broadest range of situations possible.

… The result is uncertainty and a perception, fed by the FSF itself, that the GPL is more infective than the license and the law may support.

Lindberg acknowledges that this long-running controversy has been very difficult to solve:

The controversy over linking and licensing… isn’t an issue that is easily resolved. There are arguments and prominent open source experts on both sides of the divide. For example, Eben Moglen (attorney for the FSF and founder of the Software Freedom Conservancy) and Lawrence Rosen (former general counsel and director of the OSI[Open Source Initiative]) disagree on the scope of linking and licensing.

He devotes a lot of his GPL chapter examining the details of the controversy. He concludes the chapter with a Q&A section, with the intent of providing guidance. In one answer, he explains why his own interpretation of some linking scenarios differs from the GPL FAQ:

The GPL FAQ was written in inexact language, and gives the impression that the rules regarding derivative works may have greater reach than current copyright law allows. The FSF has repeatedly stated, however, that they believe in copyright minimalism and that the GPL should not be interpreted to extend beyond the reach of copyright.

Of course, Lindberg’s final answer acknowledges that the courts ultimately need to resolve the ambiguity:

(Q) Can I depend on the answers in this Q&A to keep me out of trouble?

(A) No. This is our best understanding of copyright law as it stands right now,² but it could change tomorrow – and nobody really knows until these questions are resolved in a court of law.

Can FSF Resolve the Ambiguity?

But do we really need to wait for the courts? Could the FSF resolve the issue outside of the courtroom? Could the FSF modify the GPL and its FAQ to fix the ambiguity? If they remove any confusing language that extends the definition of a derivative work, would the ambiguity be eliminated?

What if instead, they simply state the conditions under which the license applies. So rather than trying to define linked programs as derivative works, the GPL simply stated that the act of linking with GPL code and distributing the linked program requires that the program be licensed under the GPL. Avoid arguing whether the linked program is a derivative work or not.

Imagine if the FSF revised the GPL and its FAQ to eliminate all ambiguity without waiting for court action. This long overdue clarity – simply stating unambiguously that the GPL applies when linking to GPL code – would benefit the whole free and open source community.

It’s a well-written resource that I highly recommend for a open-source developer’s bookshelf. ↩
Note: the book was published a decade ago in 2009. ↩

More Better Python: 10 Cool New (to me) Python Things

Wed, 08 Jan 2020 00:00:00 +0000

It’s the time of year when people compile lists of things, so I thought I’d join the fray with a list of 10 things I’ve recently folded into my own Python toolbox.

1. Paint it Black

Black is a code formatter that takes the reins over a tedious job.

As my co-worker Joe put it succinctly, “using Black is so freeing.”

Black is really well described in the first paragraph of its documentation:

By using Black, you agree to cede control over minutiae of hand-formatting. In return, Black gives you speed, determinism, and freedom from pycodestyle nagging about formatting. You will save time and mental energy for more important matters.

Well said. Since adopting Black in my workflow (and adopting automatic blackening as described in the next item), I don’t really pay much attention to formatting any more. There’s no need. 98% of what Black does looks great. The remaining 2% is passable, and if that 2% really grates at you because it isn’t just so, consider re-evaluating what matters in code writing.

2. Automatic Black and Flake8 Before Git Commits

Black (auto-formatting) and Flake8 (PEP8 linter) are great tools for writing clean code, but they only work if you use them. Using them manually: sure. Using them automatically: YES PLEASE!

There are a number of ways one might fashion a precommit git hook to do black and flake8 before allowing a commit, but I found Precommits using Black and Flake8 to be just the ticket. This is a really cool way of doing automatic execution of Black and Flake8 before a git commit, which for my (and maybe your!) purposes is exactly where I want those tools to do their magic.

3. Napoleon

Sphinx is a technology for producing documentation for Python projects from the code’s associated docstrings. It’s an invaluable tool for delivering API documentation to developers. The docstrings you have to write in ReStructuredText though…seriously awful.

:param path: The path of the file to wrap
:type path: str
:param field_storage: The :class:`FileStorage` instance to wrap
:type field_storage: FileStorage
:param temporary: Whether or not to delete the file when the File instance is destructed
:type temporary: bool
:returns: A buffered writable file descriptor
:rtype: BufferedFileStorage

MY EYES! With the Napoleon Sphinx extension, I can write the above in Google-style docstrings (Numpy-style also supported) like so:

Args:
    path (str): The path of the file to wrap
    field_storage (FileStorage): The :class:`FileStorage` instance to wrap
    temporary (bool): Whether or not to delete the file when the File
       instance is destructed
   
Returns:
    BufferedFileStorage: A buffered writable file descriptor

Napoleon under the hood takes these legible docstrings and pre-processes them to ReStructuredText to be fed into Sphinx to create your API docs. My library documentation is so much easier to write now.

4. How to use print(). Wait, WAT?

Really? I’m going to waste an entire part of this article on print()? I have no doubt if you’re reading this you know how to Hello World.

Hear me out though: have a read through Your Guide to the Python Print Function and see if you can get through it without finding something new and valuable. For example, I’ve been using an awkward combination of sys.stdout.write() and sys.stdout.flush() for years to cobble hold-the-newline print statements that don’t buffer, when I could have been doing print("isn't this easier?", end="", flush=True).

5. TQDM

If you write command-line programs that push feedback to the console, you’ve undoubtedly needed to deal with how best to message progress while long iterations are proceeding. The ideal UI for this in many cases is a progress bar, and my de facto approach for this involves the low-overhead, full-featured library TQDM.

tqdm means “progress” in Arabic (taqadum, تقدّم) and is an abbreviation for “I love you so much” in Spanish (te quiero demasiado).

Instantly make your loops show a smart progress meter - just wrap any iterable with tqdm(iterable), and you’re done!

Well, not quite. The only rub I have with TQDM is that the defaults for the tqdm progress bar are a bit of a mess. When I use TQDM, the typical thing I’m processing is files, and the thing I want to see progress for is how far along I am in processing those files as well as the percentage completion. So, I wrote this simple wrapper around TQDM that uses default values I’m happy with while still allowing the full power of customization by passing through arguments (or overriding those defaults) to tqdm()

def progress_bar(iterable, **kwargs):                                                   
"""Wrapper on tqdm progress bar to give sensible defaults.                          
                                                                                    
Args:                                                                               
    iterable: a data structure to be iterated with progress bar.                    
    kwargs: pass-through arguments to tqdm                                          
Returns:                                                                            
    tqdm: progress bar with sensible defaults                        
"""                                                                                 
defaults = {                                                                        
    "bar_format": "{l_bar}{bar}|{n_fmt}/{total_fmt} {percentage:3.0f}%",            
    "ncols": 80,                                                                    
    "unit": "files",                                                                
    "desc": "Files",                                                                
}                                                                                   
for k, v in defaults.items():                                                       
    if k not in kwargs:                                                             
        kwargs[k] = defaults[k]                                                     
return tqdm(iterable, **kwargs)                                                     

This progress_bar() wrapper method can be invoked exactly as tqdm() is, but now the default progress bar is the much more sensible: Happy progress_bar()ing!

6. Python’s stdlib has so. much. stuff.

Sure, there’s a positively stunning amount of 3rd party libraries available at PyPi , but I have to remind myself: always start with the libraries provided by the Python Standard Libraries (stdlib). Writing a program using only stdlib makes it far more portable and deployable. Some personal favorites:

collections.defaultdict: create dicts where missing keys return a default value. Avoids the tedious pattern of my_dict = {}; if someKey not in my_dict: do_this() else: do_that(). Since no checking of existing keys is necessary, when using a defaultdict that code pattern reduces to simply do_that().
pathlib.Path: treat filepaths as objects. Stop it with the os.path() already! The most important library you can learn if you deal with filepaths.
functools.lru_cache: when you use the @lru_cache decorator, subsequent calls with the same arguments to a function will fetch the value from the in-memory lru_cache instead of re-running the function. This is supremely useful if your function is computationally expensive to run from scratch each time. Be careful though, you can really chew up a lot of RAM if you @lru_cache indiscriminately, as you are holding the results of the functions in memory.

stdlib modules I should use but keep forgetting to:

collections.Counter stop writing your own counting code!
string.Template how many hand-cobbled, complicated format strings do I need to write before using string.Template?

Two invaluable resources for exploring stdlib:

PyMOTW3: Python3 Module of the Week is a great resource of code-by-example style writeups of stdlib modules. Indeed, all of the above links go there!
Modern Python Standard Library If books are your thing, this is the one for stdlib info. Like PyMOTW3, it adopts a code-by-example model for teaching stdlib, and is a more comprehensive resource. I know price is a fungible thing, but at the time of writing this post the ebook version of this was $5. Money well spent.

7. Visual Studio Code Insiders + RemoteSSH + Python

I will readily admit I am a caveman. I do a great deal of my code authoring via command-line vim. My first pass at debugging is almost always with print() statements, which is literally called “caveman debugging”.

Lots of people are beholden to heavyweight IDEs like PyCharm. Pycharm et al. are great tools, but you either give yourself over to them or become disenchanted with the overhead, restrictions, and particulars of such tools. As a caveman, I lean strongly towards the latter. Microsoft’s Visual Studio Code forges a middle ground, a lightweight IDE extensible for specific applications. For my purposes, the RemoteSSH and Python extensions make VS Code a compelling tool.

The work I do at ISRDI, including code development, requires access to a shared network drive with all of our data and metadata, accessible via ssh on linux servers. Using an IDE in this environment can be tricky. A typical solution might be to nfs mount or sshfs mount the drives locally and then use an IDE locally. Anyone that does this regularly knows it to be a shaky proposition as well as a hassle. The VS Code Remote-SSH approach is novel: over ssh to your remote host it drops in a remote headless “installation” (in quotes because nothing is actually compiled/installed). The VS Code front-end remains locally driven, giving you the speedy quickness of a desktop GUI, while the VS Code back-end is run on the host. Once initially configured, it’s remarkably seamless and easy to use. While Jupyterlab via Jupyterhub is generally my go-to daily productivity tool, I lean on VS Code a lot for certain things. For example, it’s really good at guiding you through git merge conflicts.

The “Insiders” version (a fancy name for a published beta branch) is what I went with early because that’s how they released Remote-SSH and Python extensions initially. My guess is that these are all in the production version, but since Insiders has turned out to be totally stable for day-to-day use, I continue to recommend this version.

8. better_exceptions

Stack traces from raised exceptions are obviously a key tool in debugging Python programs. With better_exceptions, it gets even better.

For example, here’s a program intended to convert a list of Fahrenheit temperatures to Celsius temperatures, but there’s a bug:

gp1  ~$ cat buggy.py
fahrenheit_temps = [
32, 5, 6, 15, 26, 212, 24, 33,
-55, 30, '20', 31, 55, 15,
15, 60, 1, -20, 1, ]

celsius_temps = []
for temp in fahrenheit_temps:
    celsius_temps.append(int((temp-32)-(5/9)))

Running this program will result in a TypeError when it hits the string in the list:

That’s the stack trace without better_exceptions. Now let’s take a look at the same exception with better_exceptions installed:

In the enhanced exception, better_exceptions improves on the standard stack trace by:

Pointing directly at the part of the code that failed
Printing the values of variables at the code line of failure
Color coding the exception in a helpful manner.

In this trivial example, you can see that the TypeError was hit when hitting the string ‘20’, and not for example when hitting a negative value earlier in the list of numbers.

So that’s a simple example, but you (hopefully) are writing more complicated code, with objects and very big data structures and outside libraries, etc. This is where better_exceptions truly shines.

better_exceptions is also wonderful in that it is a “set it and forget it”. Once you have it installed, you get a free lifetime subscription to better_exceptions in all the code you execute from the terminal.

9. bpython

It’s hard not to have a pretty strong bias for iPython. As a REPL, it is a dependable workhorse and huge step up from the default python interpreter. More importantly perhaps, it provides the principal kernel that powers Jupyter Notebooks, which have been a game changer for reproducible code workflows. I’ve been using iPython for years. It’s great.

But…could there be a compelling alternative command-line REPL over iPython? YES!

From the bpython about page:

“bpython doesn’t attempt to create anything new or groundbreaking, it simply brings together a few neat ideas and focuses on practicality and usefulness”

Check out these features:

In-line syntax highlighting
Readline-like autocomplete with suggestions displayed as you type.
Expected parameter list for any Python function.
“Rewind” function to pop the last line of code from memory and re-evaluate.
Send the code you’ve entered off to a pastebin.
Save the code you’ve entered to a file.
Auto-indentation.

If that’s not enough to sway you to give bpython a try, check out these screenshots

bpython uses the curses library (ergo: no native Windows support) to do its magic, and as a result can provide a lot more interactivity than iPython can from a console. Give it a shot, my guess is you’ll use it once or twice and it will win you over like it did me.

10. Attend PyCon

Maybe this should have been the first one in the list…PyCon is a GREAT way to get in touch with what’s going on in the general Python community, and gets credit for a lot of the items listed above. For example, last year I both attended a talk on Black by its author and spoke on the conference room floor with Microsoft people about Visual Studio Code and its new support for remote ssh and Python. The conference is exceptionally well attended and organized and has something to offer across many perspectives and skill levels.

If you can’t make it, browse through some of the talks from PyCon 2019, there’s some great ones in there.

Credits

My co-worker Joe Grover is always jonesing for new tools, and a lot of the things above (e.g. Black, better_exceptions) first came on to my radar through Joe.
PyCon, as mentioned above, is a great spark generator for all things Python.
I heard about the print() article on the Talk Python to Me podcast episode Top 10 RealPython Articles of 2019, which covered it and several other interesting articles. RealPython is a great resource for focused Python articles and tutorials.
I wrote this blog post using the browser-based markdown editor StackEdit.

Reflections on Creating an API Program

Mon, 06 Jan 2020 00:00:00 +0000

More than two years ago, IPUMS launched an ambitious plan to create an API Program and make IPUMS data available programmatically via API. Today, IPUMS is thrilled to announce that our IPUMS NHGIS Metadata and Data Extract APIs are leaving beta and reaching the v1.0 milestone! This marks the first time that IPUMS data has ever been publicly available programmatically via a non-beta API.

As we reach this important milestone, this is a great time to look back and reflect on our API journey as an organization. This post will take a look at some of the challenges we encountered as we built an API Program from scratch, and some of the decisions that helped get us here today.

Creating an API Habit

As an organization we had not focused on building APIs previously, so we knew we would need to think about how to make space for this new sort of work. Very early on, we formed what we eventually called the API Program Support Team (APST), comprised half of developers and half of researchers (our equivalent of the “business” folks) and chaired by the IT Director (who straddles both worlds). This new team created a space where we could begin to tackle some of the most fundamental business and technical questions, such as developing an API roadmap aligned with our business goals, defining the minimum infrastructure we needed to begin developing and publishing APIs, establishing the metrics by which we would measure our API Program’s success, and fostering API development best practices and standards across our development teams.

Perhaps the one thing we were focused on more than anything else was the creation of an “APIs as First-Class Products” mindset. We knew that to be successful APIs could not be viewed as a technical concern or an internal IT project. Rather, they had to be seen as products in their own right, a means towards novel ways of growing our mission (which in our case is getting more of our data into the hands of more users and new types of users). Stakeholder buy-in was an important task for us from the beginning, so in those early days we held focus groups and interest meetings to begin the long process of pivoting towards an API-minded organization.

An early stakeholder presentation

For the past two years, the APST has worked to educate the organization on APIs and their benefits, to gather stakeholder perspective and priorities, and to assist development teams as our initial APIs have been built out. The APST is also the group in charge of overarching API Program tasks such as measuring key success metrics and providing the shared infrastructure and scaffolding needed to support our API Program, such as managing the API gateway and developer documentation services and developing an API styleguide for our developers.

Example page from our internal API Styleguide

A final key thing we did in the early going was to recognize that we were not experts and that we would benefit from bringing in an outside perspective. Over the past two years, we’ve had multiple visits from an API consultant who has helped us with everything from refining our API vision to effective stakeholder communications to the nuts and bolts of how best to actually design and build these APIs.

One API or Many?

One of the big design challenges we hit early on is whether to conceptualize our API offering as one unified IPUMS API or a series of distinct APIs. A unified API would be conceptually clean for our end users, but the reality is that we serve up a wide variety of data that differs in scale, structure, and organization. Wedging these all into a single API would be cumbersome at best.

Where we landed was something of a compromise - we have multiple APIs, but with a design principle that similar APIs should behave similarly. For example, we will have a Microdata Data Extract API and an NHGIS Data Extract API, and for those APIs the endpoints and HTTP verbs will be the same where possible, even though the payloads will necessarily be different due to data structure differences. As one example, IPUMS NHGIS data is organized by table, while IPUMS USA data is organized by variable, so the vocabulary of the payloads will differ to reflect that. A similar story will be the case for our Metadata APIs and eventually our Data Streaming APIs, as well.

We strive to standardize as much as possible, and we considered things like URL design very carefully to design a scheme that will be consistent yet flexible enough to adapt to all of our products. As we’re still early in the process, we don’t yet know how successful we ultimately were, but we’re hopeful!

Our Data Model is Not Obvious

A second design challenge we encountered is how to communicate our data model to our users so that they could assemble data requests. Unlike more ubiquitous domains such as e-commerce or photo collections or social media messages, the structure and even the vocabulary of the IPUMS data model is not widely understood and can be challenging to comprehend. IPUMS data involves some pretty specific demographic and internal-to-IPUMS concepts such as variable and dataset groups, NHGIS codes, breakdown values, extents, universes… the list goes on. How do we make our data model accessible and intuitive for our users, particularly for new non-traditional user groups such as data journalists?

So many options! How to make this intuitive via API?

On our websites we have the ability to provide lots of context and interface elements to help users understand the choices they are making as they browse and assemble a data request. We don’t have those same opportunities with the API, so we had to think carefully about request payload design. In some cases, we changed the names of concepts to make more sense for external users. In other cases, we simply tried to write really good developer documentation and code samples.

When operating in a niche space like we are with the diversity of data we are trying to provide, it’s probably unrealistic to expect that our API will be as intuitive to use as a more mainstream or narrowly scoped API. Nevertheless, we have put a lot of energy into making the API as intuitive and usable as we can, given these constraints.

Real-Time Streaming of IPUMS Data is Hard

The final major design challenge we had in our early API efforts was whether to aim for real-time streaming of data back to the users right from the get-go, which is the most intriguing target in terms of unlocking new models of IPUMS data dissemination. Real-time data streaming would present some quite difficult technical challenges - it’s not uncommon for IPUMS users to make data requests on the order of 10, 20 or more gigabytes, and for our more popular products we currently have queue wait times that occasionally stretch to an hour or more. We would have to tackle these challenges if we were going to offer up an API that provided real-time data streaming. We weren’t sure we wanted to go that way with our first API - we were more comfortable doing something a bit smaller in scale.

We compromised by building our Data Extract APIs first. In this API, the user does not get data in real-time, but rather submits an extract request to be enqueued alongside the extract requests coming in from our websites. We then outfitted the API with a status endpoint so that users can check the status of their requests. When the extract is complete, users can use the API to retrieve the data extract zip file, and then can programmatically unpack it and operate on the data as needed.

‘status’ is complete, can now download zip file

We recognize that this might be an awkward model for interacting with IPUMS data programmatically as there is now added delay and code complexity for the user as they must check on the request status and then download and deal with the zip file. However, the ability to provide this solution to users relatively quickly versus the real-time data streaming option tipped our API roadmap towards this approach. We still intend to provide real-time data streaming APIs in the future, and have already begun early work towards that goal, but in the meantime there are many users and use cases that can benefit from the current solution.

Conclusion: Don’t Let Perfect be the Enemy of Good

Our API journey thus far can probably be summed up this way: don’t let perfect be the enemy of good. Compromise and pragmatism have been important principles for our success thus far. Building an IPUMS API Program, especially as an organization with no prior API development history, felt like a hugely daunting task two years ago. By putting in place good support structures and figuring out how to make incremental progress and release value to our users as quickly as possible, we’ve begun to chip away at the task and make real progress, even as we’re still a long way from our ultimate API goals. I hope our experience resonates with others and serves as inspiration for those who are just beginning your API journey!

Our API Program is growing up so fast! (https://developer.ipums.org)

Exploring the Weird World of IPUMS Data Availability

Thu, 03 Oct 2019 00:00:00 +0000

The IPUMS data products aim to make comparison of census and survey data across time and nations as easy as possible. The first step in effectively using IPUMS is to discover data that’s available for the time and place you’re interested in studying. The IPUMS data request websites’ goal is to give users an efficient method for finding and requesting data from IPUMS.

“Efficiently finding data” presents some interesting user interface challenges when it comes to IPUMS. In the rest of this article we’ll take a visual, high-level view of the IPUMS metadata that drives the data request system and underpins our data integration process.

To start with, we’ll focus on the U.S. Census (IPUMS USA) since most of us have some familiarity with a national census, but the discussion also applies to other IPUMS data drawn from health and employment surveys.

Why Data Discovery is Not Dead Simple

In a perfect world the U.S. Census Bureau would have asked every possible question of every person since the very first census and the questions would always have had the same range of responses.

In that utopian world, everyone would have happily responded to the lengthy questioning and census-takers would all have behaved exactly the same, making no mistakes, and would have displayed perfect penmanship and consistent spelling.

In the real world, however, we must build IPUMS with the census and survey data we actually have. You can request a customized IPUMS extract from our site by requesting the original datasets (census or survey year) and variables (questions) you’d like to include in your IPUMS extract. In a perfect world, you could select what you wanted from a simple list of datasets and another simple list of variables. Unfortunately, not all questions were asked in every year, which complicates the data discovery process.

The U.S. Census changed throughout the history of the nation. Some questions got dropped, while others were added. The census also had two versions – a long form and a short form – for many decades, and most people would get the short form while a smaller subset will get the long form. The long form had many more questions, which meant that for some questions, there were far fewer respondents than for other. In 2001, the U.S. Census Bureau introduced the American Community Survey as a more detailed survey that goes to about 3.5 million households annually. As a result, the Bureau dropped the long form as of the 2010 census.

Because of these changes in the U.S. Census over time, if IPUMS took the naive approach to requesting data – simply picking and choosing what you wanted from static lists of datasets and questions – then the resulting dataset would have lots of missing data in years where a chosen question was not asked.

Users could resort to reading documentation on each census question before requesting it, to learn if it was asked in every year they want in their analysis. That could take a lot of time.

Let’s imagine you want to study the relationship between home values and time spent commuting to work. You discover the IPUMS “VALUEH” variable (home value.) And you find the “TRANTIME” variable (“travel time to work one-way.) So this looks promising. “TRANTIME” was first asked in 1960, while VALUEH was in the Census in 1930, 1940, not in 1950, then again every census since 1960. That’s good to know - now we’ve determined that for this research question, the data is fully available since 1960.

A more difficult example: Say you learn about the “DURMARR” variable (Duration of marriage.) You can study qualities of long-married people with this. Are their homes worth more than briefly married people of the same ages (probably?) Anyway you proceed to request “VALUEH”, “DURMARR” and “AGE”. Unfortunately you’ll discover “DURMARR” only exists in 1900, 1910 and 1950, all decades when “VALUEH” wasn’t asked.

Finding Only Data That Exists - the Availability Grid

To simplify this “data availability exploration” process, we present a construct called an “availability grid” on our websites when users are selecting data. The grid shows the cross-product of choosable variables from within a selected topic group by chosen datasets (or all datasets if none have yet been chosen.) We mark every cell in the grid – every variable-dataset combination – as available or not available.

This way, users know ahead of time if their desired question was asked in a particular year. The horizontal axis represents census years, while the vertical is variables. Users can then scan a column to know what variables exist in a given year, and a row to see in which years a particular variable exists.

Screenshot of an availability grid with a few samples and variables

The data availability grid is foundational to the whole IPUMS concept. Integrating data across time and countries is an imperfect business. We make available data as comparable as possible but we can’t conjure comparable answers out of thin air^* (*Except when we do. We impute some variables when we have enough other information to allow it. See IMPREL.)

I mentioned that the data availability grid is foundational to IPUMS, and indeed it manifests itself not just within the web UI but at several points of the IPUMS data pipeline, starting at the data integration step, continuing through the modeling of the metadata driving the website and data extract engines, and ending in the UI for selecting data.

Navigating Available Data

So far, so good - the availability grid approach has definitely proven its value over many years now. But what if you need to look at a large set of years and variables simultaneously? The grid approach, in a literal sense, does not scale:

Screenshot of a full availability grid

The x-axis extends far to the right of even the largest screen and you can scroll down for page after page to see all of the variables.

Other IPUMS data products have the additional difficulty that they contain many more questions than the U.S. Census, and sometimes many more datasets as well. So the dimensions of this style of grid can get out of hand very easily and a one-size-fits-all solution is hard to find. It makes for an interesting user interface design challenge.

We encourage users to select a small set of datasets (“samples”) up-front to limit the width of the x-axis. To limit the size of the y-axis we guide the user to view subsets of variables grouped by topics or alphabetically. For products with many more variables or datasets than IPUMS USA we make more variable topic groups and make dataset selection easier. For instance IPUMS CPS, with nearly six hundred datasets, has the Rotation Pattern Explorer tool to help users navigate its many datasets.

The rise of mobile computing with much smaller screen sizes makes the limitations of the grid approach even more acute. Due to the screen real estate limitations, the IPUMS Abacus mobile real-time tabulation app had to dispense with a grid altogether and rely on a dynamic, interactive data selection approach: users choose a year and immediately all topics and variables narrow down to what’s available for that year. If users choose a topic and variable first, the choices of years narrow accordingly as well. The dynamic feedback loop actually makes exploring what’s possible to tabulate more efficient than the grid approach. It’s an elegant solution, especially when few variables and years are involved.

Screenshot of IPUMS Abacus showing variables available for the 1940 Census sample

This “dynamic narrowing” approach could be confusing if a user could select dozens of variables or years at once, potentially missing exactly which year or variable choice it was that may have dramatically reduced their choices in the other dimension. Fortunately for Abacus, tabulation naturally limits itself to a few variables and datasets at a time, so the effect of each choice on the other dimension is evident.

Visualizing IPUMS Data Availability

Let’s now take a bird’s-eye view of variable availability across all years for all variables in IPUMS-USA (U.S. Census data). In this graph, the x-axis is time and the y-axis is variables, sorted by their variable group (topic).

IPUMS-USA Data availability Grid

The dark spots show where data is available. You see many gaps, with availability generally increasing over time as automation of the Census took hold and more questions could be asked or inferred every year.

The right-hand side has more coverage because since 2001 we include the annual ACS (American Community Survey) data, which contains many more questions than the Census.

The vertical gap towards the right-hand side represents the 2010 decennial Census, which had only a handful of questions and no long form. If you’re wondering why it takes up two vertical spaces instead of one, that’s because the Puerto Rico and U.S. states Censuses are considered separate datasets.

The grid has too many points to actually label them; you’d need a software “microscope” to navigate this grid directly, zooming in on topics of interest and plucking records from the pile near your area of interest. You could imagine a third dimension representing available cases (number of people responding to this question) or total categories (possible answers to a question). You could imagine turning the graph ninety degrees on its back so that the most recent data lives on the “surface” of the graph, and to go back in time you’d “dive” down into the graph instead of moving left along the x-axis.

Visualizing data like this is the closest thing I can think of to William Gibson’s description of wandering through a giant corporate database in Cyberspace:

Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts… A graphic representation of data abstracted from banks of every computer in the human system. Unthinkable complexity. Lines of light ranged in the nonspace of the mind, clusters and constellations of data. Like city lights, receding…

William Gibson

Okay, maybe I shouldn’t get too excited.

Let’s look at how some other data products can be visualized in this way:

The IPUMS-International availability Grid

Here we see a fairly well populated bottom half of the grid with the top-half being quite sparse. Why? It’s because the top half represents country-specific questions which mirror the original census questions from each country, sometimes uniquely from each year, and each of these questions are represented as a unique variable in our metadata model. While these country-specific variables aren’t very “integrated” in the IPUMS sense, they are quite useful, so we compromise on the amount of integration required to be in the IPUMSI data, keeping a core of widely comparable variables while still making country-specific data easily available.

Now look at this IPUMS Higher Ed survey data:

The Higher-Ed Survey Data availability Grid

Here we have a much smaller number of datasets and variables, and the survey appears to be quite consistent across time, which makes sense since this survey was designed to measure the same things each time it was taken, and there aren’t many years in which the survey might have changed.

On the other hand:

NHIS Health Interview Data availability Grid

The NHIS health survey has had a vast number of questions asked every year since 1963. Some of these years had supplemental questionnaires, which consequently causes that year’s variable count to balloon.

Conclusion

The diagrams you’ve just seen give you a macro view of the diversity of the landscape of variable-dataset availability that the IPUMS extract system is designed to navigate. To provide for effective data discovery, we employ a combination of interactive user interface approaches along with robust documentation to help users navigate the available data.

The GPL License and Linking: Still Unclear After 30 Years

Thu, 04 Oct 2018 00:00:00 +0000

It all started with an simple idea from my colleague who maintains our ipumsr R package, which we released on CRAN under the Mozilla Public License v2. His idea was “I’d like to fork the readr package from CRAN and add functionality to deal with hierarchical data so I can use it in ipumsr. But readr has a GPLv3 license.”

From there, it got anything but simple - we unwittingly waded into a decades-old debate.

Understanding Why GPL Exists

It’s useful to spend a few minutes getting some background context on the GNU General Public License (GPL).

Richard Stallman, author of the GPL and founder of the Free Software Foundation, wrote the first version of the Emacs text editor in 1975 and made the source freely available. In 1982, James Gosling wrote the first C version of Emacs. For a while, Gosling made the source code of his version freely available too, and Stallman subsequently used Gosling Emacs in 1985 to create GNU Emacs. Later Gosling sold the rights for Gosling Emacs to UniPress, which then told Stallman to stop distributing the source code. Stallman was forced to comply. After that experience, he set out to create a license to ensure that would never happen again. He wanted to preserve access to the source code of any derivative software that benefited from free software.

There are two key properties Stallman put into the GPL that are critical to this discussion. The first is that it’s a very “sticky” or “viral” license. The GPL is what is known as a strong copyleft license, which means that all derived works of the GPL’ed work must also be licensed under GPL if distributed. The second is that the GPL requires that the source code be freely available to anyone who wants it. Combined, these properties mean that you cannot use GPL code in any software system you distribute (even as a cloud-based software as a service) unless you also make your source code available under a GPL license too.

IPUMS, We Have a Problem

In contrast, our ipumsr package is released as Mozilla Public License v2 (MPLv2), which is the preferred license we use at IPUMS when releasing open source code. MPL is a weak copyleft license, which means that if you modify MPL’ed code and distribute it, you need to make the source code for those modifications available, but you’re not required to also make available your code that simply uses the MPL’ed code. We chose MPL because it strikes a good balance between keeping our own work, including improvements to it, freely available while not restricting what people can do with their own software just because they find our library useful. In other words, we don’t want to impose our licensing philosophy on other people beyond our own code. We want to preserve their freedoom to license their code as they wish.

It’s also worth noting the third major class of license, the permissive license. The best known examples of a permissive license are probably the MIT and Apache licenses. The main idea of a permissive license is to place a few restrictions as possible on the use of the code (not even requiring that distributed modifications of the code also be released as open source). If you look at the MIT license for example, it essentially says only that you must keep the copyright notice present in the code, and that if you break anything you’re on your own. That’s it.

Sometimes the GPL sort of restriction and “viral license propagation” of a strong copyleft license is what you want, but it’s not what we want, so my colleague knew we had an issue to solve. He had some ideas about how to work around that and comply with GPL, and he was coming to me for a second opinion.

Let’s Create an Intermediary

Our idea was that we would create a third package, “hipread” (hierarchical ipums reader). hipread would be a fork of readr, which we would then modify to add the hierarchical support. We would release hipread as GPL, which is naturally required since we would take GPL readr code and modify it to make hipread.

Essentially, hipread would be a small wrapper/extension of readr, we’d release it as GPL, we’d use that library in our ipumsr library, and we’re all good. Right?

Not so fast…

What Constitutes a Derived Work?

When we started researching our proposed solution to see if it met all licensing requirements and goals, we determined that the first part of our idea - writing hipread as a wrapper extension around readr and releasing hipread as GPL - would be a fine option. However, we then came across quite the surprise…

It seems that there’s quite a bit of debate around whether or not simply using a GPL’ed library (e.g. via an import or use statement) in your code constitutes creating a derived work and therefore subjects your code to the GPL license!

Just using hipread in ipumsr might require ipumsr to be released under GPL? Yikes. And the more we researched it, the more confusing it became.

Let’s first examine what the GPLv3 license itself has to say about this. The relevant section is titled “Corresponding Source” in the license text:

The “Corresponding Source” for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work’s System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work.

“Corresponding Source for a work means all the source code needed to…run the object code… Corresponding Source includes…the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require.” Well, that sure makes it sound like if ipumsr imports readr, it has created a larger derivative work of readr.

The GPL FAQ confirms this strict interpretation:

Linking a GPL covered work statically or dynamically with other modules is making a combined work based on the GPL covered work. Thus, the terms and conditions of the GNU General Public License cover the whole combination.

I suppose this isn’t that surprising. Agree with Stallman’s viewpoint or not, given his experience it makes sense that his intent would be that anyone who uses GPL’ed code to build a derived software system should have to release their source code back to the world, and that his interpretation of what makes a derived system would be fairly broad.

On that latter point, the FAQ goes on to say:

If the main program dynamically links plug-ins, and they make function calls to each other and share data structures, we believe they form a single combined program, which must be treated as an extension of both the main program and the plug-ins. If the main program dynamically links plug-ins, but the communication between them is limited to invoking the ‘main’ function of the plug-in with some options and waiting for it to return, that is a borderline case.

Now we’re getting oddly situational about what does and does not constitute a combined work. That last bit about “only invoking main” is a bit confusing in how that would apply to the readr-hipread-ipumsr relationship. readr has a function to read a csv file which is used by hipread which is in turn used by ipumsr. The file to read is an option to that function. Is that use case covered under this exception? Or because the function is returning a data structure which ipumsr is going to interact with, are we indeed creating a combined work? Not very clarifying.

It gets even stranger when you dig into the FSF’s answer to this FAQ question: What is the difference between an “aggregate” and other kinds of “modified versions”?

Where’s the line between two separate programs, and one program with two parts? This is a legal question, which ultimately judges will decide. We believe that a proper criterion depends both on the mechanism of communication (exec, pipes, rpc, function calls within a shared address space, etc.) and the semantics of the communication (what kinds of information are interchanged).

If the modules are included in the same executable file, they are definitely combined in one program. If modules are designed to run linked together in a shared address space, that almost surely means combining them into one program.

By contrast, pipes, sockets and command-line arguments are communication mechanisms normally used between two separate programs. So when they are used for communication, the modules normally are separate programs. But if the semantics of the communication are intimate enough, exchanging complex internal data structures, that too could be a basis to consider the two parts as combined into a larger program.

The interesting part of this answer is not whether we’re distributing an aggregate or not, but rather the insight this answer offers into what FSF considers to be a single program. They come back to “if the semantics of the communication are intimate enough”, but also assert “this is a legal question, which ultimately judges will decide.”

However, I think it’s fair to conclude at this point that in the opinion of the FSF, they want what we’re doing to be bound by GPL. Our ipumsr library doesn’t work without significant interaction with readr, so therefore in their eyes we’ve created a combined work.

Is the FSF Position Enforceable?

For a counterpoint, we can turn to Larry Rosen, former general counsel of the Open Source Initiative. He wrote a concise Linux Journal article on his opinion of what constitutes a derivative work in 2003. His key conclusion is:

The meaning of derivative work will not be broadened to include software created by linking to library programs that were designed and intended to be used as library programs. When a company releases a scientific subroutine library, or a library of objects, for example, people who merely use the library, unmodified, perhaps without even looking at the source code, are not thereby creating derivative works of the library.

and he goes on to assert why he feels this is important:

You should care about this issue to encourage that free and open-source software be created without scaring proprietary software users away. We need to make sure that companies know, with some degree of certainty, when they’ve created a derivative work and when they haven’t.

Rosen speaks in more depth about the topic as it specifically relates to the GPL in a 3-page white paper titled “The Unreasonable Fear of Infection” and comes to the same strong conclusion that linking to GPL code is not enough to meet the definition of a derivative work.

Malcolm Bain, a Barcelona lawyer, also explored this topic in depth in a 2011 white paper, but frustratingly concludes, more or less, “it’s unclear”.

This pattern of confusion is reflected across the internet as a whole. You can find plenty of people who argue that using a library does expose your code to the GPL conditions. You can find plenty who say no, it doesn’t.

Ultimately, it has not been sorted out in a court yet, so there’s no clear answer as to the enforceability of the GPL as the FSF wants it to be. In the meantime, lots of folks are avoiding the GPL because of this uncertainty.

LGPL: A Failed(?) Attempt to Address This Problem

By 1991, shortly after the GPL was created, people started to realize that while the GPL is useful for protecting whole software applications, it created complications for library code. The FSF subsequently released the first version of the GNU Library General Public License, now known as the Lesser General Public License (LGPL), as a compromise between the strong copyleft of the GPL and the permissive nature of licenses like the MIT license. The LGPL is a weak copyleft license and it’s very similar to the MPL that we use.

The basic idea of a “weak copyleft” license is “I want to ensure that if you modify my code and distribute the modified code, then you give the source code for those modifications back to the world freely, but I really don’t care to restrict how you can simply use my code as part of your larger system.” If someone writes a library and wants to ensure:

that the source code is available for modified copies of the library that are distributed, and
that application developers using that library have the freedom to license their application code as they want,

then the LGPL was designed for those library developers.

Unfortunately, the Free Software Foundation doesn’t seem to like their own weak copyleft license much. In addition to intentionally labeling it “Lesser”, they specifically encourage folks to NOT use LGPL for libraries. They argue that doing so allows free libraries to be used in proprietary software, and that we shouldn’t be giving proprietary software companies any more advantages. Rather, we should create unique functionality, release it as GPL, and force companies to release their code for free if they want to use the library functionality.

LGPL has remained much less common than the GPL. According to this study, as of 2016 LGPL was 6% of the open source license “market”, whereas GPL(v2 and v3) was 34%.

In any case, readr uses GPL, not LGPL, so LGPL can’t help us with our ipumsr problem.

So… What Do We Do for Our ipumsr Problem?

It seems we’re left with three less-than-ideal choices:

Release ipumsr under the GPL, which goes against our desire to let anyone benefit from ipumsr, whether that’s free software that prefers to use a different license than GPL, or commercial software, or whatever.
Jump in with many, many others in the R community (and elsewhere) and use GPL’ed libraries in our non-GPL’ed code, and wait for the legal community to clarify the issue, if ever.
Write our own library that provides the same functionality as readr and license it as we wish.

The goal we’re trying to achieve here is to simply make IPUMS data easier to use for R users. We don’t charge for IPUMS data, and if you know anything about our mission, we strongly believe in keeping data free. We’re not going to profit in any way from incorporating readr in our library.

It’s true that someone downstream may take ipumsr and use it in a way that they profit from it. I don’t know how the authors of readr would feel about that. At IPUMS, we’d be ok with that. If they distributed a modified version of our library, we’d want the source code for those modifications to be released back to the public, and the MPLv2 license that we use formalizes that wish. But their own code that simply uses our library? That’s for them to decide.

So, we’re going with option 2. It doesn’t feel great, but we’re going with the option that feels most pragmatic and is in the spirit with being as helpful as we can to the R community. If the enforceability of the GPL on code that simply uses a library is ever sorted out (and it’s been 30 years, so we’re not holding our breath) we will of course adjust accordingly, but until then, we’re just happy that our library will be available for others to use with few strings attached.

And on a pragmatic note, ipumsr already imports multiple GPL’ed packages before this issue every came onto our radar, so we’re not creating any additional exposure we didn’t already have. That shows our prior ignorance on this topic. But it’s also completely inline with what hundreds if not thousands of other CRAN packages are doing today, so perhaps our ignorance can be forgiven?

What About the R Community at Large?

In full disclosure, I am not a member of the R community. I’ve never written R code beyond a few tutorials I did to get the flavor of it. But as an IT Director who is trying to provide guidance to our organization about how we can share our code with the world in the most usable way, the GPL is a big mess that I would prefer to avoid altogether, at least until the linking issue is sorted out.

Usage of the GPL has been in general decline, along with the other copyleft licenses. In fact, between 2012 and 2018, permissive licenses overtook copyleft licenses as the most commonly used open source licenses.

And yet, the R community seems to prefer GPL as one of its favored licenses. If this is due to the community being especially principled about free software, I absolutely respect that. If, on the other hand, this propagation of GPL to so many libraries is simply due to folks being unaware of the implications, perhaps it’s time for a reckoning around this topic.

Putting aside for a moment the motivations for using GPL for so many libraries, the R community definitely has a potential looming disaster around the GPL linking issue. Spend a few minutes clicking around R’s CRAN package repository and see just how many non-GPL packages are importing GPL’ed packages. Just looking at packages which import readr, a random sampling showed almost half of them were distributed with licenses other than GPL. If a court ever were to rule that merely importing a GPL’ed library requires that code to also be GPL’ed, there’s going to be an awful lot of scrambling that would need to happen.

As it turns out, I don’t need to merely wonder about the community’s intentions. The R Consortium conducted a survey last year on this topic. Here’s some of what they found:

60% of respondents want other software developers to be able to use their package(s) without imposing license requirements on the software that uses their package (via API), with only 15% disagreeing.
The most popular license used among respondents is ‘GPL-3’ at 35% with ‘GPL-2 or GPL-3’ a close second at 34% and ‘GPL-2’ next at 24%.

Those two findings confirm that there is indeed a lot of confusion about licensing in the R community. Perhaps it is time for that reckoning after all.

Big Data on a Laptop: Tools and Strategies - Part 3

Wed, 19 Sep 2018 00:00:00 +0000

Introduction

Doing analysis on big data doesn’t have to mean mastering enterprise Big Data tools or accessing hundred-node compute clusters in the cloud. In this series, we’ll find out what your humble laptop can do with the help of modern tools first.

In the first part of this series, we looked at how to get out of Excel and work with big datasets more effectively. In the second part, we introduced Parquet to achieve better performance and scalability. In today’s final part, we will discuss how to harness the power of all of the cores of your system to do work in parallel using Spark.

Introducing Spark: Spark + Parquet, Harness all the Cores on Your System and Beyond

Spark is a distributed data processing engine that supports batch, iterative, and streaming types of workloads. Learning to useApache Spark can seem intimidating, but don’t worry, getting started is pretty easy. While true that learning everything about the Spark API and how to exploit Spark on a cluster as well as mastering performance tuning is a big undertaking, here I’m showing that limited but very useful Spark applications are simple to create and run.

You can start Spark in “local” mode with almost no configuration. You can run interactively with Python, Scala or even Ruby. When your data outgrows your local machine you can put your Spark application onto a larger machine or even a cluster of machines, and as long as you move the data to the new environment the application should function identically (except faster).

The Spark API, Very Briefly

Spark was written in Scala (a JVM language) and provides a Java API in addition to the Scala API. There’s also an extensive Python API. The main disadvantage to Python is simply slower performance as a language compared to Java and Scala, but in many cases, where Python acts mainly as the glue code, the difference is negligible since most of the data processing load gets handled by the Spark libraries, not Python.

Spark supports Spark SQL for a query language, as well as user-defined functions callable from Spark SQL, so you can rely on your SQL skills if that makes the most sense. Spark provides other ways to manipulate data as well.

The foundational data abstraction in Spark is the RDD (Resilient Distributed Datasets.) An RDD distributes data across nodes in a Spark cluster so that a Spark application can access the data in parallel, running on multiple cores and machines. You can work directly with RDDs, but the interface is fairly low level. Generally, you will want to work with Spark’s higher level APIs instead: DataFrames and DataSets. Today, we’ll use DataFrames for working with our data.

DataFrame is an API for working with structured data in a table-like form, which will be familiar to users of relational databases or spreadsheets. Results of Spark SQL queries can be returned and further processed using the DataFrame API. For importing and exporting data, the API includes lots of functionality to ingest and output various formats, including JSON, text, CSV, and Parquet.

When to Use Spark

It’s important to understand a little bit about how Spark executes queries so that you know what to expect performance-wise and when to use Spark vs. the previously discussed approach with Python and Parquet. Spark imposes a bit of overhead due to its need to coordinate multiple processes/processors. In “local” mode this overhead is small, but for certain problems, it can become a factor. Also, Spark’s implementation of Parquet is not the most efficient, so there is some performance cost if your dataset has many variables (columns).

What does this mean in practice? Spark will be great if you have lots of rows of data and a reasonable number of variables (say, < 100), which is a pretty common use case. However, if you use Spark to perform a simple aggregation on relatively large data with many columns you may start to notice slightly worse performance compared to directly using Parquet with C++ or with the PyArrow library, just because of the overhead of inter-process communication and the way Spark implements Parquet.

Using Spark even for these sorts of queries may still make sense if you’re developing a solution that you eventually need to scale up to an actual compute cluster, since with Spark it will transition seamlessly. Another reason you may choose Spark may be to get the full SQL query support, or if you simply prefer to code in Java or Scala instead of Python.

Unlike the Parquet examples with PyArrow from the last post, Spark can use a multi-core system for more than just reading columns in parallel - it can take full advantage of all the cores on your machine for computation as well. Assuming you have enough RAM to hold the data involved in the computation, you’ll see a big speed-up.

Setting Up Spark on Your Laptop

You can set up Spark on Linux, Windows or MacOS. You’ll need Java 8. If you’re on Windows 10 I recommend using WSL (Windows Subsystem for Linux) instead of installing under native Windows, because most of the examples you’ll find online are Linux-based. My example below is on Linux.

Go to the Apache Spark downloads page and select your version (I’ll use the latest as of this writing.)

 # Untar with
 tar -xf  spark-2.3.1-bin-hadoop2.7.tgz

 # Now before anything else, set your SPARK_HOME directory.
 # Assuming you put Spark in /opt/spark/current:

 export SPARK_HOME=/opt/spark/current

Spark and Python

If you want to work only in Python you can just install with ‘pip’:

pip install pyspark

Which installs pyspark with a bundled Spark engine. Then you can run ‘pyspark’ and begin an interactive session. (Beware if you already have a SPARK_HOME set, PySpark will try to use that version of Spark instead of the version you just downloaded as part of PySpark.)

If instead you downloaded Spark from spark.apache.org as I did above, you can go to the ‘bin’ directory in your Spark installation and run the ‘pyspark.sh’ program located there.

Once you get your PySpark shell up, you can read Parquet data directly and get the result as a DataFrame. In this example, I’ll use the “extract65.parquet” dataset again - the same one I used in the previous post. It contains individual person records from samples of the U.S. Census for years 1960-2016 and has columns full of IPUMS variables concerning commuting, work, industry and home ownership.

$ pyspark

>>> df=spark.sql("select count(*) from parquet.`./extract65.parquet`")
>>> df.show()
+-----------+
|count(YEAR)|
+-----------+
|   17634469|
+-----------+

The query isn’t actually executed until you call “show()”.

You can read the data into a dataframe:

>>> df_extract65 = spark.read.load("./extract65.parquet")

then programmatically work with the data in SQL:

>>> df_extract65.createOrReplaceTempView("travel")

>>> df_commute = spark.sql("select YEAR,int(sum(TRANTIME*PERWT)/60) as hours_commuting from travel where YEAR> 1980 group by YEAR order by YEAR")

You can also see the schema for the data frame:

>>> df_commute
DataFrame[YEAR: int, hours_commuting: int]

and the query results:

>>> df_commute.show()
+----+---------------+
|YEAR|hours_commuting|
+----+---------------+
|1990|       41527560|
|2000|       52791989|
|2010|       55293482|
|2016|       63408548|
+----+---------------+

The “TRANWORK” variable holds responses to the question of how many minutes it typically takes someone to travel to work. I have divided by 60 to give hours spent on a work-day commuting to work. (For total travel time you could roughly double these numbers.)

That’s a lot of time spent in traffic. Americans spent more than 120 million hours per workday going to and from work! They aren’t getting paid for that time - let’s try to put a value on it. A quick check of the BLS data for earnings shows a bit more than $25 per hour in 2016, giving an approximate “commuting cost” of three billion per workday or $750 billion per year, assuming 250 days working per person.

[Aside: to find a more accurate number we could adjust for systematic differences between commute time and income. This would require computing an hourly wage per person to produce a new variable, We could call it “COMMUTE_COST.” I’ll leave that as an exercise for the reader.]

Now let’s suppose you want to hand off a subset of the “extract65” dataset. It needs only people over age 50, their state of residence, and their travel time to work. You’ll also need to include the PERWT variable to calculate numbers matching the U.S. population.

To deliver this you can work with the dataframe programmatically using the DataFrame API:

>>> df_over_50 = df.filter(df['AGE'] > 50).select(df['YEAR'],df['TRANTIME'],df['STATEFIP'],df['PERWT'])

Now that we have just the data we want, we can convert to a different format:

>> df_over_50.write.csv("over_50")

The ability to read and write many formats is extremely useful.

There’s a ton more to learn, so check out the PySpark documentation.

Spark and Ruby

Ruby isn’t directly supported by Spark but thanks to the JRuby project you have access to an interactive shell (REPL) for Ruby that also gives you access to compiled Java classes.

I used JRuby to create a small interactive Spark environment. JRuby can use Java classes directly, so all I had to do was instantiate Spark Java classes in JRuby and use the Java API from Ruby’s “irb” REPL. What I’m doing here is calling the Java API for Spark; JRuby is simply a convenient way to script the use of the Java API so I don’t have to compile Java programs every time I make a change. You should be able to achieve a similar setup with any interpreted JVM language such as Clojure.

The following example code shows how you could start up Spark and make an ad-hoc query from JRuby, and then work with the Java DataFrame API interactively to do similar tasks to what we did with Python.

Spark JRuby example

There are two basic approaches you can take if you wish to run a Spark job from an unsupported language like Ruby:

Write a small Java class to issue Spark SQL queries and perform pre-determined actions on DataFrame instances. This allows you to skip importing all the Spark JAR files directly in your JRuby program, and is probably the right approach to take if your app requirements are known and you don’t need interactive access to Spark through JRuby.
Simply import all necessary Spark Java libraries into your JRuby program. Then you can load your code in “jirb” and interact with Spark.

Here’s an example of approach #1. I’m using a Spark “helper” library. All it does is put all the Spark libraries on the classpath and import all necessary packages so you can use them in JRuby.

require 'spark_env_helper'

# For example:
df = $spark.sql("select int(sum(TRANTIME*PERWT)/60) as hours_commuting
	from parquet.`./extract65.parquet` where YEAR=2016 and AGE > 50")

# At this point you can manipulate the data frame with the Spark API
df.show

And now for an example of approach #2, the interactive approach.

Before starting an interactive session you may wish to reduce Spark’s default log level, otherwise you’ll get a large amount of informational messages. Change the first setting in the /conf/log4j.properties from INFO,console to WARN (see the comments.) Spark will read the log4j.properties file in the conf subdirectory of the directory pointed to by your SPARK_HOME environment variable. On the other hand, if you’re iteratively developing a performance-critical task you should consider leaving the logging level on “INFO” because you can gain a lot of insight into how Spark distributes work and what resources Spark is using.

The interactive session would look like:

	irb> load "spark_env_helper.rb"
	Loading test helper, SPARK_HOME is /home/ccd/spark-2.3.1/spark-2.3.1-bin-hadoop2.7, setting classpath

	irb(main):004:0>  df = $spark.sql("select int(sum(TRANTIME*PERWT)/60) as hours_commuting
	irb(main):005:0>  from parquet.`./extract65.parquet` where YEAR=2016 and AGE > 50")

You will get an object of type:

	=> #<Java::OrgApacheSparkSql::Dataset:0x7b4619a3>

Now, to actually execute the job:

	irb(main):006:0> df.show

	+---------------+
	|hours_commuting|
	+---------------+
	|       19282789|
	+---------------+

	=> nil
	irb(main):007:0>

As with the Python examples you can save data and work with it more with commands like df.write.csv("dataset_name")).

A Note about Spark-Friendly Parquet

You should know that Spark doesn’t deal especially well with very large monolithic Parquet files. While in principle – and as we’ve shown with a C++ reader or PyArrow – you could read from different parts of a Parquet file in parallel, the Spark Parquet reader doesn’t automatically do this well unless you’re using HDFS (Hadoop Distributed File System.) Additionally, extremely large numbers of columns in a Parquet dataset will hurt Spark performance, not due to a limitation of the Parquet format itself but rather the Spark implementation of it.

However, we can work around this by playing to the Spark Parquet reader’s strengths. By breaking up the Parquet monolith into chunks (multiple Parquet files), you can take advantage of all the cores Spark can bring to bear without paying these excessive I/O costs. Spark prefers to have your dataset chunked out over multiple Parquet files rather than a single monolithic file.

If you’re making your Parquet files with PyArrow you can save Parquet as “flavor=spark” to make Spark-friendly Parquet. Alternately, when constructing Parquet files manually, ensure “block groups” are around 1GB in size and consult Spark documentation for supported column data types. Sizes of “row groups” should be larger than the HDFS block size (this is 128mb by default but it’s recommended to use larger HDFS block sizes for Spark and Hadoop.)

To calculate a good row group size first compute the approximate size of each row in the Parquet file by taking the data types of each column into account.

For instance given the schema:

col1:int32, col2:int32, col3:int64, col4:double ......

Compute the size in bytes:

row_size =  4 + 4 + 8 + 8 + ....

Then divide your target row group size by the row size in bytes:

rows_in_group = 1000000 / row_size

You’ll find that Spark will perform a bit better if the Parquet files have been tailored in this way.

Conclusion

This brings us to the end of our “big data on a laptop” journey. Hopefully, you’ve found this series useful for learning how to leverage your local computer for more scalable and performant data analysis than you thought was possible!

The steps outlined in this series will take you just about as far as you can go for processing large-scale data on a local system. If you need more memory or performance, you’ll need to graduate to a larger server or even a full-fledged compute cluster (or perhaps use some more esoteric tooling, which I refer to briefly in the appendix below). I’ve put some next steps in the appendix that you can take if you find yourself needing more.

Thanks for reading!

Appendix: Next steps

Spark on an Actual Cluster

One major benefit of working with Spark on your local system is that it is then very easy to “graduate” to an actual Spark cluster - just move the code and data over, and it should all essentially “just work”, but perform faster.

Learn about running Spark on a cluster
Learn about submitting jobs to a Spark cluster with spark-submit
Explore Spark tools beyond Spark SQL: Mllib for machine learning, and Spark Streaming for incorporating streams of data (Kafka, ZeroMQ, Twitter, others) into your Spark workflow.

Extreme Column Store Solutions

The following two products are designed for the fastest possible queries on analytic workloads. Neither are as approachable as what I’ve discussed in this series, but may be of interest to those seeking the absolute most performant tooling.

Yandex ClickHouse

Yandex, the “Google of Russia”, has open-sourced their column store database and it benchmarks quite impressively. Think of ClickHouse as a standard relational SQL database but tuned for analytic queries. You get a relatively conventional relational database interface with a nice bulk loading system - see all <a href”https://clickhouse.yandex/docs/en/interfaces/formats/”>supported formats</a>. You can productively run ClickHouse on a single machine – it’s designed to use hardware very efficiently – but it scales easily to multiple servers for more resiliency and performance. For a quick introduction to loading, querying and deploying the server, just read the tutorial.

Q and KDB

The “Q” and “K” language and KDB+ database ecosystem, and the J and JDB open source variant, are what I consider the expert power tool for high speed data analysis on a single machine. Here’s some more background. These are column oriented databases. Unlike Parquet format they allow very fast appends / inserts, while keeping very fast query times and compact size. JDB supports a fairly friendly query language, though it’s not SQL. The actual J and K languages are, like APL from which they are derived, extremely terse with single ASCII symbols used as keywords.