First, let’s define what we mean by big data. “Big data” is a polarizing phrase - it means different things to different people and disciplines, and has been co-opted for marketing purposes by so many companies that no one knows what it means any more. When we say big data here, we simply mean datasets that are traditionally thought to be “too large” to work with using a local laptop or desktop. Around here, that might mean the entire 1940 US census, for example (132 million rather wide records).
As dataset sizes continue to grow across all disciplines, researchers are increasingly turning to what the industry labels “big data” tools. These include technologies such as Hadoop and Spark and all of the commercial variants. These tools can be wonderful, but there are steep barriers to entry, both in the learning curve and in the required infrastructure. Through this series, we’ll show how – with a little ingenuity and tooling – we can push that barrier farther out and keep working with larger and larger datasets on our local machine.
The barriers that you hit on a local machine with large datasets can be either hard (out of memory or storage) or soft (it “works” but takes so long!) If you’re hitting these barriers on your local machine, you may need to re-think your choice of tools and data formats. Average size and speed of storage and memory is improving, but so is the amount of data generated by business processes and scientific studies. What’s not growing so fast: computer processor speeds. (Just recently performance has actually gone backwards.)
Depending on your volume of data and sophistication of the software you use you may find your poor computer:
- Spending lots of time loading enormous spreadsheets and running macros and formulas, and you find you are putting effort into organizing and versioning spreadsheets.
- Waiting while loading data into a database or waiting for slow analytic queries / reports.
- Dealing with text data in CSV, JSON or other bulky formats.
- Running out of space or needing a bigger hard drive.
- Crashing your program or computer as it runs out of memory.
Solutions (beyond upgrading your computer) include:
- Bring better tools to your data: Don’t just hack something together with Excel.
- Change your data format: Columnar formats allow for faster analytic queries and save lots of disk space.
- Parallelize work: Utilize the multiple cores on your machine. A modern laptop usually supports at least four concurrent threads of execution or more.
To make your data life better, we’ll look at a number of tools that put these solutions into practice. There’s nothing to lose, except excuses for coffee breaks.
This series is divided into three parts:
- Today we’re covering Part 1, Easier CSV
- Next will be Part 2, all about the Parquet data format
- Finally we’ll cover Part 3 on using Spark locally.
Let’s get started!
Easier CSV : Get out of Excel, Use SQL
If you’re dealing with CSV data by importing it into Excel you should consider using query tools directly on the CSV instead. Examples include q - Text as Data or CSV Kit. We’ve previously blogged about how we use CSVKit at IPUMS, so let’s explore
If your data is natively in Excel format, you can export it as CSV and then apply these same techniques. You can even automate the conversion of Excel spreadsheets to CSV with the
intocsv command that’s part of CSV Kit.
q tool allows SQL queries directly against CSV data as if it were a database table, rather than importing the CSV data into an actual database. If you already know the query you’d like to run and the data is of reasonable size those tools may be overkill. The
q utility lets you not only query the CSV file directly, but also join two or more “tables” (files).
q is extremely simple and it’s free. If you know the UNIX command line, you can combine
q with other command line UNIX tools, because
q defaults to outputting plain text consumable by other UNIX tools such as
sed and so forth.
My examples are in Ubuntu Linux, but the
q package is also available for Windows and Mac.
$ sudo apt-get install python3-q-text-as-data
If you’re not the administrator, you can download the program right off the author’s page or the Github repository; it’s one file. It depends on a few things, namely Python and Sqlite but if you have those it will “just work.”
For our example I’ll use one of the log files our internal software generates. IPUMS keeps a pipe-delimited log of every run of our data conversion tool. Every time a researcher or developer produces an IPUMS file we record basic information to have an historical record separate from the application logs. The file has a few columns:
and currently there are nearly 224,000 rows:
$ wc dcp_multi.completed.csv 223949 990559 12617458 dcp_multi.completed.csv
q, we can easily query this CSV file to find how many times we’ve produceed datasets for each data product:
$ time q -d"|" -H -b \ "select count(*) as runs, product \ from ./dcp_multi.completed.csv \ group by product \ order by runs desc \ limit 15" 132642|usa 33604 |dhs 31712 |cps 5917 |mtus 5057 |ihis 3154 |health 2990 |atus 2170 |pma 1742 |brfss 1604 |yrbss 1480 |highered 858 |nsduh 405 |nyts 292 |fullusa 153 |ahtus real 0m2.491s user 0m2.432s sys 0m0.044s
That was fast! 2.5 seconds. And pretty straightforward, too. The
-d"|" flag defines the delimiter;
-b formats the output a bit;
-H reads the header for column names; the
\ just tells bash to continue reading the command on the next line.
Let’s explore a bigger example now. Recently, using IPUMS USA data I built a moderately large demographic dataset created to study changes in commuting patterns and housing costs among U.S. workers since 1960. I’ve downloaded the data to my computer. It contains over 17 million individual records and 82 columns, so it’s pretty large. The data is in CSV format in the file
usa_00065.csv and is about 4.2GB uncompressed.
At IPUMS IT, we really like bike commuting a lot - a huge chunk of our team bike commutes to work regularly. Suppose we’re sitting around one day and someone asks “Hey, is biking to work really more popular among software developers compared to other kinds of workers, or is it just us?” Well, this dataset has all the IPUMS variables to answer that question. How about I just throw a query right at the
In the following query the OCC2010 column uses numeric codes to represent job classifications. Codes 1000 to 1099 can be considered software developers and similar workers. The TRANWORK (mode of transport to work) variable uses 20 for motorcycles and 40 for bikes, and all other modes of transport are also represented. (How do I know this? Check the IPUMS documentation. Googling “IPUMS variable TRANWORK” or other IPUMS variable names will also bring it right up.)
Let’s start with bicycling during 2016. We’ll exclude unemployed people by using the clause OCC2010 < 9900, since unemployed is represented by 99XX codes. The “sum(PERWT)” gives us a total population count adjusted for the 1% sample size of the 2016 American Community Survey this data comes from.
I’ll use the
-D (output delimiter),
-b (beautify) and
-O(format print output header) flags to make the results prettier.
$ time q -d, -b -O -D"|" -H \ "select sum(perwt) as total, \ case when TRANWORK=40 then 'Biking' else 'other' end as biker, \ case when OCC2010/100 = 10 then 'Programmers' else 'other' end as programmer \ from ./usa_00065.csv \ where YEAR=2016 and OCC2010<9900 and TRANWORK>0 \ group by biker, programmer" total |biker |programmer 36719 |Biking|Programmers 824999 |Biking|other 3983940 |other |Programmers 145593645|other |other real 24m53.442s user 24m30.220s sys 0m21.928s
All right, that was easy, and for smaller datasets this could be a great stategy. But for our 17 million rows, each with 82 columns, it took forever. If we’re planning to query several times a day on this dataset we need to find a better way. Looking at memory use of the above process, it’s clear the
q program creates an in-memory database after loading all the CSV data into memory (this is a common model - many other data tools operate in a similar manner). If my machine had less than 8GB the script would fail; if I want to analyze a larger dataset I’ll hit a memory wall. We’ll tackle this issue in Part 2. But often, the dataset you’re interested in is a lot smaller than 17 million rows and 82 columns, so this may be the only optimization you need to keep working with your data on your local machine.
SQLite as Alternative to q for Frequent Querying
q needs to load the data into memory on every query. If you’re focusing on one dataset of moderate size you could skip
q and import your CSV into Sqlite, then repeatedly query it. This helps if the data isn’t too large and you don’t need to constantly import different new data. Here’s how to import our sample data:
sqlite> .mode csv sqlite> .import ./usa_00065.csv extract sqlite> select count(*) from extract; 17634469 sqlite> .schema extract CREATE TABLE extract( "YEAR" TEXT, "DATANUM" TEXT, "SERIAL" TEXT, "HHWT" TEXT, "STATEFIP" TEXT, "METAREA" TEXT, "METAREAD" TEXT, ........ [many more columns ] ....
The import takes a while, but not as long as
q. For the duration of your session you have an in-memory database you can query. The big catch is that you’re limited by available RAM. If I had included every year from 2000 to 2016 from the American Community Survey in my data it would not fit in my computer’s memory. We’ll need different tricks for datasets that large.
Conclusion to Part 1
That brings us to the end of Part 1 of our series.
In answer to the research question, programmers seem to bike at a 61% higher rate than the average of all other types of workers, but the absolute rates (0.91% for information technology workers, 0.56% for others) is really low.
I’d really like to break out mode of transportation by more categories but that will take a long time with the current method. In addition, I’d like to classify workers by income, because I’m suspecting biking to work may be connected to income and class more than profession. So, on to something better. In our next post in Part 2, we’ll introduce some techniques for dealing with datasets that are bigger than your system’s memory or larger than your available storage.
Colin Davis Code · Data
Ruby CSV Parquet C++ Spark Data Science