How to scrape Twitter data without an API

Welcome back! If you follow my articles you know I love scraping data, but you don’t care about my background, so let’s talk about scraping some Twitter data! In this tutorial we will be using Python, we’ll also be using Twitter’s website to get the data, this means no API, limits or credentials needed to get in the way of us getting valuable data!

First off

As always, if you have any suggestions, thoughts or just want to connect, feel free to contact / follow me on Twitter! Also, below is a link to some of my favorite resources for learning programming, Python, R, Data Science, etc.

Basic introduction you could probably skip that I copied from my other article

First things first, we will need to have Python installed, read my article here to make sure you have Python and some IDE installed. Next, I wrote an article on using Selenium in Python, Selenium is a web scraping package that allows us to mimic a web browser using Python, it might be best to read that article for more of an understanding on web scraping, but it’s not a necessity, you can read that article here.

Let’s get started!

Now that we have our Python environment setup, let’s open up a blank Python script. Let’s import the Selenium package that you hopefully preinstalled from the previous paragraph (just pip install selenium). Once installed, import the following packages:

#IMPORT THESE PACKAGES
import selenium
from selenium import webdriver
import pandas as pd
#OPTIONAL PACKAGE, BUY MAYBE NEEDED
from webdriver_manager.chrome import ChromeDriverManager

As I stated in my previous articles, we are using the Google Chrome browser as our GUI, but you can use other browsers within Selenium, if you’de like to use a different browser go for it! Make sure to have the specific browser installed on your machine.

Within Selenium we need to define our web browser, so let’s do so by using the following line of code:

#THIS INITIALIZES THE DRIVER (AKA THE WEB BROWSER)
driver = webdriver.Chrome(ChromeDriverManager().install())

I would recommend running all of the code you just typed out and see if a blank Google Chrome window opens up, if so, you’re doing great 👍 !

At this point we want to create an empty Pandas data frame, this will allow us to store our data from Twitter into an actual iterating data frame that we can call later. For this article, I wanted to scrape some of the most common data points of a Twitter post, these data points could be the text of the Tweet, likes and retweets for that post, this is exactly how to setup this data frame within Python:

data1 = {'Tweet': [], 'Likes': [], 'Retweets': []}
fulldf = pd.DataFrame(data1)

Awesome! Now when you think of Twitter you must think about the different ways a tweet can be discovered, whether it’s a user retweeting it, someone sending it to you, etc. In this specific case we’ll be using a hashtag page to get some tweets from. This program will essentially go to a specific hashtag page and scrape the first tweet that it finds and store that data into our data frame. Now, this is a pretty simple task, but these are the building blocks for your future ☺️

At this point, we want our opened Chrome browser to go to a specific hashtag web page, to do this we must call the Selenium “driver.get” function, then place our link within the quotes, this may seem pretty intense, but this is the line we use:

driver.get("INSERT LINK HERE")

All we have to do now is find our specific webpage we want to bring in, let’s say we wanted the page for the hashtag “programming”, we want to go to that page and copy the link, then we just need to insert the link in the line above, our code will look like this:

driver.get("https://twitter.com/search?q=%23programming&src=typed_query")
time.sleep(10)

The “time.sleep(10)” line just tells the program to wait 10 seconds before going to the next line, this is important if you’re loading a web page that has a lot of elements (like Twitter), you want to make sure to have some sort of command like that.

We are almost done, now all we have to do is select the specific elements we want from that page and copy the full xpath for those, to do this, we first want to store our variables as a Selenium “driver.find” function, this will allow us to scrape in that data straight into a variable which we can pull into our data frame we made before, this is the following code to do so:

Tweet = driver.find_element_by_xpath('INSERT LINE HERE').text
Likes = driver.find_element_by_xpath('INSERT LINE HERE').text
Retweets = driver.find_element_by_xpath('INSERT LINE HERE').text

We will be inserting new lines of code in there in a few seconds, but try to understand the process in these lines.

Next up, we want to have the web page loaded in our Selenium Google Chrome environment, we then want to right click over the text we want to store, in this case we want to store the actual Tweet, we then click inspect or inspect element, a little window like this will load up:

Great! At this point, we want to select over the specific highlighted text, you may see the actual text and match it with the number or text in the web page, all you have to do now is right click over that highlighted portion of code > Click copy > then copy full x path, just like the image below:

We then copy and paste this into that line of code that we did before into the Tweet variable, this completed line will look like this:

Tweet = driver.find_element_by_xpath(
'/html/body/div/div/div/div[2]/main/div/div/div/div/div/div[2]/div/div/section/div/div/div[1]/div/div/article/div/div/div/div[2]/div[2]/div[2]/div[1]/div/span[1]').text

Awesome! We now want to do this with the likes as well, highlight over the like number > right click on it > click inspect > Go to the highlighted portion of the code and right click on it > Copy > and copy full xpath, use the following image for reference:

We then copy and paste this into that line of code that we did before into the Likes variable, this completed line will look like this:

Likes = driver.find_element_by_xpath(
'/html/body/div/div/div/div[2]/main/div/div/div/div/div/div[2]/div/div/section/div/div/div[1]/div/div/article/div/div/div/div[2]/div[2]/div[2]/div[3]/div[3]/div/div/div[2]/span/span').text

Awesome! We now want to do this with retweets as well, highlight over the like number > right click on it > click inspect > Go to the highlighted portion of the code and right click on it > Copy > and copy full xpath, use the following image as a reference:

We then copy and paste this into that line of code that we did before into the Likes variable, this completed line will look like this:

Retweets = driver.find_element_by_xpath(
'/html/body/div/div/div/div[2]/main/div/div/div/div/div/div[2]/div/div/section/div/div/div[1]/div/div/article/div/div/div/div[2]/div[2]/div[2]/div[3]/div[2]/div/div/div[2]/span/span').text

Finally, we want to store our tweets into a variable that we can append to our “fulldf” data frame we made before, to do this use the following line(s):

row = [Tweet, Likes, Retweets]
fulldf.loc[len(fulldf)] = row

Awesome! Now this is our completed code block, I did make some additions to it but for the most part it’s the same thing:

import time
import selenium
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd

# SETTING UP THE DRIVER TO USE CHROME
driver = webdriver.Chrome(ChromeDriverManager().install())

data1 = {'Tweet': [], 'Likes': [], 'Retweets': []}
fulldf = pd.DataFrame(data1)

driver.get("https://twitter.com/search?q=%23programming&src=typed_query")
time.sleep(10)

Tweet = driver.find_element_by_xpath(
'/html/body/div/div/div/div[2]/main/div/div/div/div/div/div[2]/div/div/section/div/div/div[1]/div/div/article/div/div/div/div[2]/div[2]/div[2]/div[1]/div/span[1]').text
Likes = driver.find_element_by_xpath(
'/html/body/div/div/div/div[2]/main/div/div/div/div/div/div[2]/div/div/section/div/div/div[1]/div/div/article/div/div/div/div[2]/div[2]/div[2]/div[3]/div[3]/div/div/div[2]/span/span').text
Retweets = driver.find_element_by_xpath(
'/html/body/div/div/div/div[2]/main/div/div/div/div/div/div[2]/div/div/section/div/div/div[1]/div/div/article/div/div/div/div[2]/div[2]/div[2]/div[3]/div[2]/div/div/div[2]/span/span').text
print(Tweet)
print(Likes)
print(Retweets)
row = [Tweet, Likes, Retweets]
fulldf.loc[len(fulldf)] = row

Great! now let’s run the program! When you run this program you will see the Google Chrome browser open up > Navigate straight to that web page > And within a few seconds the selected data points will be printed out in the Python console and stored into our data frame!

Thats pretty much it! Now, as I mentioned before this is a pretty basic project, but think about ways you can expand on this: Could you pull maybe more than one tweet at a time (Hint: use a loop and iterate through the numbers of the xpath)? Can you make a front end someone could paste a hashtag in and search through the tweets that way (Hint: use Streamlit or other GUI building Python tool and use the URL to iterate through the hashtags)? These are some massive things you can do to add to this project and improve your skill set!

Data Scientist / Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store