Let’s extract data from PDF’s using Python

Welcome back! Let’s go ahead and do some data extraction with Python, specifically extracting data from PDF files! Typically, we extract data from excel files and websites, let’s go ahead and extract data from PDF files! In this specific tutorial i’ll be using a Google CoLab project, but you can build this project in any IDE that you want to, let’s get started!

Building The Project!

First off, we want a PDF to extract data from, this is essentially the easiest process in this whole tutorial, I chose to use this PDF: http://www.africau.edu/images/default/sample.pdf , you can choose whichever one you want. Let’s start off by downloading this PDF to our machine, make sure to note the file path (exactly where it’s stored).

Next up, let’s open up our Python IDE and get to coding! We first want to install the textract package, this will allow us to parse out the text from these PDF files, to instal this package use the following line:

pip install textract

Next up we want to import this package:

import textract

Great, now we want to bring in that PDF file into a variable, we do so by using the the textract function process():

#READING DATA INTO A VARIABLE
text = textract.process(“PUT/THE/FILEPATH/HERE/.pdf”)
text = str(text)

Please make sure to change the above file path to wherever your PDF file is located. The next line in that code block creates a string variable for our variable, making it easier to parse it out later. As a precaution, I added a print() statement that verifies the output of the variable:

#PRINTING OUT TEXT FOR ASSURANCE
print(text)

The output should look something like this:

Awesome! At this point you can theoretically do anything that you want with this data, something I saw from another CoLab project was creating a word chart, so let’s go ahead and do this. We are going to import the wordcloud package and matplotlib and plot the most frequent words, to do this use the following code:

from wordcloud import WordCloud
wc = WordCloud().generate(text)
import matplotlib.pyplot as plt
plt.imshow(wc)
plt.axis("off")
plt.show()

This is the output from that code:

There you have it! That’s how you read in PDF files and do some visualizations of the data as well!

As Always

if you have any suggestions, thoughts or just want to connect, feel free to contact / follow me on Twitter! Also, below is a link to some of my favorite resources for learning programming, Python, R, Data Science, etc.

Thanks so much for reading!

Data Scientist / Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store