Member-only story

Let’s extract data from PDF’s using Python

Manpreet Singh
3 min readMay 27, 2021

--

Welcome back! Let’s go ahead and do some data extraction with Python, specifically extracting data from PDF files! Typically, we extract data from excel files and websites, let’s go ahead and extract data from PDF files! In this specific tutorial i’ll be using a Google CoLab project, but you can build this project in any IDE that you want to, let’s get started!

Building The Project!

First off, we want a PDF to extract data from, this is essentially the easiest process in this whole tutorial, I chose to use this PDF: http://www.africau.edu/images/default/sample.pdf , you can choose whichever one you want. Let’s start off by downloading this PDF to our machine, make sure to note the file path (exactly where it’s stored).

Next up, let’s open up our Python IDE and get to coding! We first want to install the textract package, this will allow us to parse out the text from these PDF files, to instal this package use the following line:

pip install textract

Next up we want to import this package:

import textract

Great, now we want to bring in that PDF file into a variable, we do so by using the the textract function process():

#READING DATA INTO A VARIABLE
text =…

--

--

Manpreet Singh
Manpreet Singh

Responses (3)