Tabula is a tool for liberating data tables trapped inside Adobe Acrobat PDF files. If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful this is. You can’t easily copy-and-paste rows of data out of them.
Why is PDF data so hard to extract? It turns out that this is because PDFs are designed to maintain perfect fidelity to the appearance of the original document, so they discard structural information in favor of precise coordinates: goo.gl/8UeMcO
Drawing upon the extensive literature in the academic field of document analysis, Tabula implements algorithms and heuristics to reconstruct tabular information found in PDF files. The extracted information can be downloaded it as CSV. It’s free and open source, and runs on Windows, Mac OS and Linux.
Using Tabula, getting data out of PDFs at scale has gone from a labor-intensive chore to a solved problem for a wide variety of fields, including journalism and open-data activism around the world. Mary Jo Webster at Digital First used it for a project on Obamacare subsidies and told us Tabula helped make the project possible. “If we’d have had to fight states to get data in another format, I’m certain that story never would’ve made it to publication.”
Another example: As we write this the Kenya Media Program’s data journalism training program in Naivasha, Kenya is using Tabula to unlock gov’t PDFs to get stories: twitter.com/pudo/status/476343213813673984