Looking at the historical data offered by a meteorological observatory in my city, I see that they only offer them graphically and for download as PDF. I don't understand why they don't let you download them in csv, which would be much more useful for everyone.
So I've been looking for one solution to pass these tables from pdf to csv or if someone wants to format Excel or Libre Office. I like csv because with a csv you do everything you can handle it with python and its libraries or you can easily import it into any spreadsheet.
As the idea is to get an automated process, what I want is a script to work with Python and this is where Tabula comes in.
Convert pdf to csv with Tabula
The steps and operation is very simple. The first will be install the Tabula library in our development environment. Tabula allows us to extract data from tables in PDF into Pandas dataframes, the Python library optimized for working with csv and arrays.
Also allows extract and convert between PDF, JSON, CSV and TSV. A gem. You can find much more information in its github repository
I take advantage of all the work from previous days and install it in Anaconda. In the link you can see how install Anaconda.
We install Tabula
#primero activamos nuestro entorno de desarrollo en nuestro caso sería conda activate comparador pip install tabula-py
When executing it, it gave me an error
the solution as indicated in their documentation was to uninstall the old version of Tabula and install the new one.
pip uninstall tabula pip install tabula-py
We create the executable .py
I create the executable .py that I call pdftocsv.py I put it in my Downloads / eltiempo folder and it is a file with the following code
import tabula # Extaer los datos del pdf al DataFrame df = tabula.read_pdf("inforatge.pdf") # lo convierte en un csv llamdo out.csv codificado con utf-8 df.to_csv('out.csv', sep='\t', encoding='utf-8')
The pdf to read is called inforatge.pdf and I tell it that the output is called out.csv and it will stay in the folder in which we are working.
We go to the directory where we have both the executable and the pdf that we want to convert. It is important because if it will tell us that it cannot find the file.
In this directory we have the PDF, the .py file that we have created and there it will return the csv we want.
We execute the code
Notice that I have used python, that is, I tell it to run it with python 2 and not with python3 that fails. And that's it if it doesn't return any error, we already have it.
We have added 3 more lines to the file for runtime control. at the end we have left our pdftocsv.py file as
import tabula import time start_time = time.time() df = tabula.read_pdf("inforatge.pdf") df.to_csv('out.csv', sep='\t', encoding='utf-8') print("--- %s seconds ---" % (time.time() - start_time))
More options from Tabula
More examples of things we can do. There are many options, it is best to go through the official Github repository that I have left
# Leer PDF remotos y convertirlos en DataFrame df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf") # Convertir un PDF en CSV tabula.convert_into("test.pdf", "output.csv", output_format="csv")
And without a doubt one of the most useful things to convert all PDF, JSON, etc. files in a directory.
With this we can automate tasks that would otherwise be long and tedious. In the end, this is one of the reasons for using this library.
Convert pdf to excel online
If what we want is to simply convert a file, extract the data from the table from PDF to Excel, Librecalc or similar, it is not necessary to complicate it so much. There are tools available to do this, some to install and some to get the job done online.
I have tried these two online tools and they work very well.
- https://www.pdftoexcelonline.com/es/ works fine but sends you the document by mail
Keep in mind that this is not an automated job, and that is why the study of these tools has not been exhaustive. I only comment on them for those who may be interested.
The classical method
And we always have the classic method, the most sloppy and expensive but in the end it is an option if there is little work.
Copy the table cells from the pdf and paste them into our spreadsheet.