Gannett Foundation Award for Technical Innovation in the Service of Digital Journalism finalist

Tabula

Next Post Previous Post

Finalist(s)
Manuel Aristaran, Mike Tigas, Jeremy Merrill

Organizations
LA NACION
Mozilla OpenNews
ProPublica

Award
Gannett Foundation Award for Technical Innovation in the Service of Digital Journalism

Program
2014

Entry Links
Link 1
Link 2
Link 3

View Entry

About the Project

Tabula is a tool for liberating data tables trapped inside Adobe Acrobat PDF files. If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful this is. You can’t easily copy-and-paste rows of data out of them.

Why is PDF data so hard to extract? It turns out that this is because PDFs are designed to maintain perfect fidelity to the appearance of the original document, so they discard structural information in favor of precise coordinates: goo.gl/8UeMcO

Drawing upon the extensive literature in the academic field of document analysis, Tabula implements algorithms and heuristics to reconstruct tabular information found in PDF files. The extracted information can be downloaded it as CSV. It’s free and open source, and runs on Windows, Mac OS and Linux.

Using Tabula, getting data out of PDFs at scale has gone from a labor-intensive chore to a solved problem for a wide variety of fields, including journalism and open-data activism around the world. Mary Jo Webster at Digital First used it for a project on Obamacare subsidies and told us Tabula helped make the project possible. “If we’d have had to fight states to get data in another format, I’m certain that story never would’ve made it to publication.”

Another example: As we write this the Kenya Media Program’s data journalism training program in Naivasha, Kenya is using Tabula to unlock gov’t PDFs to get stories: twitter.com/pudo/status/476343213813673984

Other 2014 finalists in this category

Quartz Chartbuilder, Quartz
Journalists’ Toolbox, Northwestern University Knight Lab

Winners in this category

360° Drone, CNN
Trint, Trint
Publish2, Publish2
NPR API, NPR

Winners in the 2014 awards

Betrayed by Silence, MPR News
Fugitives Next Door, Gannett Digital
Planet Money Makes A T-Shirt, NPR
Innocents Lost: A Miami Herald I-Team Investigation, Miami Herald

Tabula

About the Project

Other 2014 finalists in this category

Winners in this category

Winners in the 2014 awards

More Info

The Awards

About the OJAs