Blog Category: Code

Ingesting the 2020 U.S. Census with DuckDB by Colin Davis · November 27, 2023
DuckDB made it possible for us to reshape the U.S. Census with pure SQL and a sprinkling of classic UNIX utilities.
Read More ›

Reading the Parquet Data Format in Rust by Colin Davis · July 1, 2022
Move Beyond the Basic Examples
Read More ›

Implementing The Fastest (Pseudo) Jaro-Winkler Algorithm in Rust by Jake Wellington · June 24, 2022
Creating one of the fastest jaro winkler implementations available
Read More ›

Upgrading the Shared Conda Installation at ISRDI by Kelly Thompson · November 28, 2020
A shared conda installation provides a multitude of benefits for an organization, but upgrading it can be a challenge. This is that journey.
Read More ›

GPL: Do We Need to Wait for the Courts? by Jimm Domingo · November 2, 2020
Is there another way to resolve the ambiguity with the GPL that doesn't involve the courts?
Read More ›

More Better Python: 10 Cool New (to me) Python Things by Ben Klaas · January 8, 2020
One of the sublime pleasures of Python is discovering newer, better ways at doing Python. Here are 10 I've discovered recently.
Read More ›

Reflections on Creating an API Program by Fran Fabrizio · January 6, 2020
As the IPUMS NHGIS Metadata and Data Extract APIs reach v1.0, this is a great time to look back and reflect on how we got here.
Read More ›

The GPL License and Linking: Still Unclear After 30 Years by Fran Fabrizio · October 4, 2018
The prevalence of GPL licensing for R libraries in CRAN, the challenge that creates for users, and the reckoning that the R community might not see coming.
Read More ›

Big Data on a Laptop: Tools and Strategies - Part 3 by Colin Davis · September 19, 2018
In our final installment of this series, we show how to harness all the compute cores available on your local system, turning it into a personal cluster for parallel computing.
Read More ›

Big Data on a Laptop: Tools and Strategies - Part 2 by Colin Davis · July 5, 2018
When you've hit the memory or storage limits of your local machine, it's time to look at more efficient data storage formats. Today, we explore Parquet.
Read More ›

Big Data on a Laptop: Tools and Strategies - Part 1 by Colin Davis · May 25, 2018
Doing analysis on big data doesn’t have to mean mastering enterprise Big Data tools or accessing hundred-node compute clusters in the cloud. In this series, we’ll find out what your humble laptop can do with the help of modern tools first.
Read More ›

ipumsr - Integrating IPUMS Data with R by Greg Freedman Ellis · November 17, 2017
We are excited to announce the ipumsr R package, which helps make importing IPUMS data into R easy.
Read More ›

Flame Graphs: Making the Opaque Obvious by Colin Davis · August 9, 2017
With a flame graph style profile of your application, you can spot poor performance hotspots even at a glance
Read More ›

Slurping Up Excel Data on the Quick: Python, Pandas, and Pickle by Ben Klaas · February 14, 2017
If you have very large tables of data imprisoned in a vendor-locked Excel jail, consider setting them free by caching worksheets using Python+Pandas+Pickle.
Read More ›

Towards a Sustainable Excel by Ben Klaas · February 3, 2017
Building Excel Macros With Python, part 3 of a series on reinventing our metadata management environment.
Read More ›

Automated Analysis of a Data Workflow - Part 2 by Jesse Erdmann · September 14, 2016
The conclusion of the story of how we created DCP Analytics - our in-house automated, web-based analysis tool using Pandas, Bokeh, Jupyter and Conda to help our researchers quickly find data anomalies and processing errors in our data production pipelines..
Read More ›

Automated Analysis of a Data Workflow - Part 1 by Jesse Erdmann · August 24, 2016
The story of how we created DCP Analytics - our in-house automated, web-based analysis tool using Pandas, Bokeh, Jupyter and Conda to help our researchers quickly find data anomalies and processing errors in our data production pipelines.
Read More ›

Excel VBA and Version Control by Jimm Domingo · May 19, 2016
The second post in the series about Team Unicorn Rainbows' work in the first round of MPC IT Shark Tank.
Read More ›

Improving Menu Creation in Excel with VBA by Jimm Domingo · February 25, 2016
In this series, we present some highlights from Team Unicorn Rainbows' work in the first round of MPC IT Shark Tank. This first post describes how we improved menu creation in Excel.
Read More ›

High Performance Analysis of Big Spatial Data by MPC IT · November 18, 2015
Our own HPC specialist Ankit Soni and the TerraPop team presented their published article at the IEEE Big Data 2015 conference in Santa Clara earlier this month.
Read More ›

Importing Fixed Length Data Using Ruby (Part Two) by Colin Davis · May 7, 2015
A follow-up to my post discussing my 'hflr' Ruby gem for reading hierarchical data in FLR format, today I'll demonstrate how to combine 'hflr' with a simple importer class to load a database with the data.
Read More ›

Fixed Length Record Data by Colin Davis · January 28, 2015
Dealing with fixed-length record (FLR) data is a reality for us at the MPC. Colin introduces readers to his Ruby Gem, HFLR, which makes processing hierarchical fixed-length record data a bit easier.
Read More ›

Ember for Rails Devs: Understanding How Ember Thinks by Jake Wellington · December 10, 2014
An introduction to Ember.js for devs who are used to thinking in Rails.
Read More ›

Keeping it Simple: Exploiting CSV and csvkit at the MPC by Ben Klaas · November 21, 2014
How we use csvkit to wrangle data around here.
Read More ›

Feeling Fuzzy: Name Matching at the MPC by Fran Fabrizio · October 10, 2014
The MPC's data has been cited thousands of times. In this article, we explore how we connected those citations with our user accounts using fuzzy name matching.
Read More ›

Data Duplication Detection by Jesse Erdmann · October 8, 2014