How to convert tables from PDF to Excel or CSV with Tabula

Pass and convert pdf to csv and excel

Looking at the historical data offered by a meteorological observatory in my city, I see that they only offer them graphically and for download as PDF. I don't understand why they don't let you download them in csv, which would be much more useful for everyone.

So I've been looking for one solution to pass these tables from pdf to csv or if someone wants to format Excel or Libre Office. I like csv because with a csv you do everything you can handle it with python and its libraries or you can easily import it into any spreadsheet.

As the idea is to get an automated process, what I want is a script to work with Python and this is where Tabula comes in.

Convert pdf to csv with Tabula

The steps and operation is very simple. The first will be install the Tabula library in our development environment. Tabula allows us to extract data from tables in PDF into Pandas dataframes, the Python library optimized for working with csv and arrays.

Also allows extract and convert between PDF, JSON, CSV and TSV. A gem. You can find much more information in its github repository

I take advantage of all the work from previous days and install it in Anaconda. In the link you can see how install Anaconda.

We install Tabula

#primero activamos nuestro entorno de desarrollo en nuestro caso sería conda activate comparador
pip install tabula-py

When executing it, it gave me an error

the solution as indicated in their documentation was to uninstall the old version of Tabula and install the new one.

pip uninstall tabula
pip install tabula-py

We create the executable .py

read tables from pdf to csv

I create the executable .py that I call pdftocsv.py I put it in my Downloads / eltiempo folder and it is a file with the following code

import tabula
# Extaer los datos del pdf al DataFrame
df = tabula.read_pdf("inforatge.pdf")
# lo convierte en un csv llamdo out.csv codificado con utf-8
df.to_csv('out.csv', sep='\t', encoding='utf-8')

The pdf to read is called inforatge.pdf and I tell it that the output is called out.csv and it will stay in the folder in which we are working.

We go to the directory where we have both the executable and the pdf that we want to convert. It is important because if it will tell us that it cannot find the file.

cd Descargas/eltiempo

In this directory we have the PDF, the .py file that we have created and there it will return the csv we want.

We execute the code

python pdftocsv.py

Notice that I have used python, that is, I tell it to run it with python 2 and not with python3 that fails. And that's it if it doesn't return any error, we already have it.

run Tabula in our Anaconda development environment

We have added 3 more lines to the file for runtime control. at the end we have left our pdftocsv.py file as

import tabula
import time

start_time = time.time()

df = tabula.read_pdf("inforatge.pdf")
df.to_csv('out.csv', sep='\t', encoding='utf-8')

print("--- %s seconds ---" % (time.time() - start_time))

More options from Tabula

More examples of things we can do. There are many options, it is best to go through the official Github repository that I have left

# Leer PDF remotos y convertirlos en DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# Convertir un PDF en CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv")

And without a doubt one of the most useful things to convert all PDF, JSON, etc. files in a directory.

tabula.convert_into_by_batch("input_directory", output_format='csv')

With this we can automate tasks that would otherwise be long and tedious. In the end, this is one of the reasons for using this library.

Convert pdf to excel online

If what we want is to simply convert a file, extract the data from the table from PDF to Excel, Librecalc or similar, it is not necessary to complicate it so much. There are tools available to do this, some to install and some to get the job done online.

I have tried these two online tools and they work very well.

Keep in mind that this is not an automated job, and that is why the study of these tools has not been exhaustive. I only comment on them for those who may be interested.

The classical method

And we always have the classic method, the most sloppy and expensive but in the end it is an option if there is little work.

Copy the table cells from the pdf and paste them into our spreadsheet.

If you are a restless person like us and want to collaborate in the maintenance and improvement of the project, you can make a donation. All the money will go to buy books and materials to experiment and do tutorials

Leave a comment