Raw-fi Data

Extract CSV from tables in PDF

2024-10-06

Issue

There is a data issue that is sometimes provided embedded in tables in PDFs provided by public agencies and governments. To load it into our data pipeline, we need to extract the data from the PDF.

Such data can be extracted by converting it in the following order: PDF -> Word -> Excel -> CSV. However, it is quite difficult to automatically integrate it into a data pipeline.

This problem can be solved by having a program that directly extracts table data from PDFs and converts it to CSV with a program. In this article, we will introduce how to extract table data from PDF files using the Python library tabula-py.

tabula-py

tabula-py is a Python library for extracting table data from PDF files. Internally, it uses the Java library tabula-java to analyze PDFs and the extracted data can be handled as a pandas DataFrame. On the other hand, tabula is a tool for extracting table data from PDF files.

tabula-py acts as a wrapper to make tabula-java functions easily available from Python. In other words, by using tabula-py, you can efficiently extract PDF tables in a Python environment.

PDFs that can be handled by tabula-py

Regarding PDFs that can be handled by tabula-py, the PDF file should be text-based PDFs. Text-based PDFs are PDFs in which you can select text by clicking and dragging. Conversely, PDFs that do not have text information, such as scanned documents, cannot be handled with this.

How to use tabula-py

Install

To use tabula-py, you must first install it with the following command:

pip install tabula-py

A Java runtime environment is required, so if it is not installed, install it beforehand.

Next, use the read_pdf function to read the PDF file.

import tabula

Load a PDF file and read it

dfs = tabula.read_pdf("test.pdf", pages='all')

You can also specify the pages with the pages argument. Specifying 'all' will extract tables from all pages. The extracted table data will be stored in dfs as a list of pandas DataFrame.

Extract table data from PDF and save it as a CSV file

By using the convert_into function, you can also save the extracted table data in formats such as CSV, TSV, and JSON.

tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')

Demo

I deployed the demo for this with Streamlit. Please check it.

Conclusion

The data extracted by tabula-py is a Pandas DataFrame, which can be updated, joined with other data sources, or visualized using matplotlib, etc.

For detailed information on how to use tabula-py, please refer to the official documentation and the tutorial article provided in the source code.

tabula-py is a powerful tool for extracting table data from PDF files. If you need to work with PDF data in a Python environment, please try it out.