How to scrape data from Reddit using Python!

Welcome back! Reddit is a massive platform with tons of different data points, so let’s go ahead and scrape some data from this platform!

First off

As always, if you have any suggestions, thoughts or just want to connect, feel free to contact / follow me on Twitter! Also, below is a link to some of my favorite resources for learning programming, Python, R, Data Science, etc.

Basic introduction you could probably skip that I copied from my other article

First things first, we will need to have Python installed, read my article here to make sure you have Python and some IDE installed. Next, I wrote an article on using Selenium in Python, Selenium is a web scraping package that allows us to mimic a web browser using Python, it might be best to read that article for more of an understanding on web scraping, but it’s not a necessity, you can read that article here.

Let’s get started!

Now that we have our Python environment setup, let’s open up a blank Python script. Let’s import the Selenium package that you hopefully preinstalled from the previous paragraph (just pip install selenium). Once installed, import the following packages:

#IMPORT THESE PACKAGES
import selenium
from selenium import webdriver
import pandas as pd
#OPTIONAL PACKAGE, BUY MAYBE NEEDED
from webdriver_manager.chrome import ChromeDriverManager

As I stated in my previous articles, we are using the Google Chrome browser as our GUI, but you can use other browsers within Selenium, if you’de like to use a different browser go for it! Make sure to have the specific browser installed on your machine.

Within Selenium we need to define our web browser, so let’s do so by using the following line of code:

#THIS INITIALIZES THE DRIVER (AKA THE WEB BROWSER)
driver = webdriver.Chrome(ChromeDriverManager().install())

I would recommend running all of the code you just typed out and see if a blank Google Chrome window opens up, if so, you’re doing great 👍 !

At this point we want to create an empty Pandas data frame, this will allow us to store our data from Reddit into an actual iterating data frame that we can call later. For this article, I wanted to scrape some of the most common points of a Reddit post, these data points could be the post title, upvotes (likes) and how old the post is (time), this is exactly how to setup this data frame within Python:

data1 = {'Post Title':[], 'Upvotes':[], 'Time':[]}
fulldf = pd.DataFrame(data1)

So in this example, our data frame has 3 columns: Post Title, Upvotes and Time, we then store those columns in an actual pandas data frame called “fulldf”. Next up, we want to set our Google Chrome instance to the specific Reddit website we want to navigate to, in this case we’ll be using the subreddit for Python, which the url is: https://www.reddit.com/r/Python/ , this is how we would essentially call our Chrome browser to navigate to that website using Selenium:

driver.get("https://www.reddit.com/r/Python/")
time.sleep(2)

The “time.sleep” command literally just pauses the Python program, this is sometimes needed in case the website doesn’t load up in time before it runs to the next line of code. Next up, we want to go ahead and get our variables setup to paste to our data frame which was created before. To do this, we’ll create 3 different variables, TITLE, UPVOTES and TIME. Next up, we want to declare our element scraper, we do this by essentially copying and pasting the following code below:

TITLE = driver.find_element_by_xpath('PASTE XPATH HERE').text
UPVOTES = driver.find_element_by_xpath('PASTE XPATH HERE').text
TIME = driver.find_element_by_xpath('PASTE XPATH HERE').text

We will be replacing some of the code above.

Great, now all we have to do is Get the xPath of the exact data points that we need. All we need to do is make our way over to that specific URL, we want to find a specific post, right click the title > Click inspect element > Right click on the highlighted portion within the inspector > Click copy > and click Copy full xpath. Use the bottom image to help you navigate there:

Once copied, you want to put it in the code pasted before within the TITLE variable.

We want to do the same thing with the upvotes:

And the time as well:

Make sure to copy them in their respective lines of code. Our complete code here should look something like this:

Finally, we want to add the data to a variable then append that to our “fulldf” data frame we made before, this is the code to do so:

row = [TITLE, UPVOTES, TIME]
fulldf.loc[len(fulldf)] = row

Awesome, that's it! This is the final code for me, I added a few extra things but it essentially does the same thing!

import time
import selenium
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd


# SETTING UP THE DRIVER TO USE CHROME
driver = webdriver.Chrome(ChromeDriverManager().install())

data1 = {'Post Title':[], 'Upvotes':[], 'Time':[]}
fulldf = pd.DataFrame(data1)

driver.get("https://www.reddit.com/r/Python/")
time.sleep(2)

TITLE = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[2]/div/div/div/div[2]/div[3]/div[1]/div[4]/div[4]/div/div/div[2]/div[2]/div[2]/a/div/h3').text
UPVOTES = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[2]/div/div/div/div[2]/div[3]/div[1]/div[4]/div[4]/div/div/div[1]/div/div').text
TIME = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/div[2]/div/div/div/div[2]/div[3]/div[1]/div[4]/div[4]/div/div/div[2]/div[1]/div/div[1]/a').text
print(TITLE)
print(UPVOTES)
print(TIME)
row = [TITLE, UPVOTES, TIME]
fulldf.loc[len(fulldf)] = row

When you run this code you will essentially get a Chrome browser pop up on your screen, you will then see the Chrome browser go straight to the Reddit page, then you will see the data points printed within your Python console just like this:

That’s it! Now this is a very bare bones project, but a very good start for web scraping. I would encourage you to think of ways to improve on this program: Could you scrape more than one post at a time? Maybe the whole page of posts? Could you make a front end to paste links within? These are great ways to build upon this project.

Data Scientist / Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store