How to scrape data from Instagram with Python!

Welcome back! Instagram is a very massive social media platform with tons of data, so let’s scrape some using Python! This is a pretty basic beginner project, we’re essentially going to be able to input an Instagram URL into our code, run the program, output specific data points and store them in a data frame, if that sounds like fun let’s get started!

Basic introduction you could probably skip that I copied from my other article

First things first, we will need to have Python installed, read my article here to make sure you have Python and some IDE installed. Next, I wrote a article on using Selenium in Python, Selenium is a web scraping package that allows us to mimic a web browser using Python, it might be best to read that article for more of an understanding on web scraping, but it’s not a necessity, you can read that article here.

Python has several different web scraping packages, beautiful soup and selenium are a few, in this tutorial we’re going to be using Selenium, in another article we’ll talk about Beautiful Soup.

Instagram Website

Before we right any code we want to check what data points we want from the website, let’s go open up a Instagram page, I used Kim Kardashians account below:

As you can see, there are a few data points we can scrape, things like number of posts, followers and following as well, so let’s develop some code to scrape all of the parameters.

Developing The Code

Starting off on our Python environment, we want import our Selenium package, pandas package and a webdriver manager we need for this product, we do so by using the follow lines of code:

#IMPORT THESE PACKAGES
import selenium
from selenium import webdriver
import pandas as pd
#OPTIONAL PACKAGE, BUY MAYBE NEEDED
from webdriver_manager.chrome import ChromeDriverManager

Next up, we want to install and declare our driver and point it to a website, our driver is pretty much the web browser we’re using, to do this use the following code:

#THIS INITIALIZES THE DRIVER (AKA THE WEB BROWSER)
driver = webdriver.Chrome(ChromeDriverManager().install())
#THIS PRETTY MUCH TELLS THE WEB BROWSER WHICH WEBSITE TO GO TO
driver.get('https://www.instagram.com/kimkardashian/?hl=en')

The driver.get function just tells the browser which website we want to go to. Next up, we want to declare 3 variables: Posts, Followers and Following, this will hold the text of those values from the website, it will make more sense in a second, we will then use the Selenium function called driver.find_element_by_xpath() to get the text from the website and store it inside of those variables, this is how we would do that:

#NUMBER OF POSTS
Posts = driver.find_element_by_xpath('PUT FULL XPATH HERE').text
#NUMBER OF FOLLOWERS
Followers = driver.find_element_by_xpath('PUT FULL XPATH HERE').text
#NUMBER FOLLOWING
Following = driver.find_element_by_xpath('PUT FULL XPATH HERE').text

Next up, we want to get the full xpath, to do this we want to open up our web browser, go to that specific web page, and right click over any of the text on the posts, click on inspect > look at the highlighted portion on the inspector console > right click on it and click copy > click on copy full xpath, use the following image as a resource:

We now want to copy and paste that full xpath in between the quotes in the corresponding variable, so that code will now look like this:

#NUMBER OF POSTS
Posts = driver.find_element_by_xpath('/html/body/div[1]/section/main/div/ul/li[1]/a/span').text

Awesome, we now want to do the same exact thing to the other variables. Hover over the followers number, right click, click on inspect > look at the highlighted portion on the inspector console > right click on it and click copy > click on copy full xpath, use the following image as a resource:

We now want to copy and paste that full xpath in between the quotes in the corresponding variable, so that code will now look like this:

#NUMBER OF FOLLOWERS
Followers = driver.find_element_by_xpath('/html/body/div[1]/section/main/div/ul/li[2]/a/span').text

Finally, we want to do the same exact thing to the following number, so hover over the following number, right click, click on inspect > look at the highlighted portion on the inspector console > right click on it and click copy > click on copy full xpath, use the following image as a resource:

We now want to copy and paste that full xpath in between the quotes in the corresponding variable, so that code will now look like this:

#NUMBER FOLLOWING
Following = driver.find_element_by_xpath('/html/body/div[1]/section/main/div/ul/li[3]/a/span').text

Awesome, let’s go ahead and print these variables out when the program is ran, we do so by using the following lines of code:

#PRINTS OUT THE DATA PULLED FROM ABOVE
print(Posts)
print(Followers)
print(Following)

Awesome! Now all we have to do is create an empty Pandas data frame with our variable names then append that to our data frame, to do this use the following lines of code:

#CREATES A EMPTY DATAFRAME
data1 = {'Posts':[], 'Followers':[], 'Following':[],}
fulldf = pd.DataFrame(data1)

Almost done! Now we append the data from the variable into another variable, then append that data into our pandas data frame, use the following lines of code to do so:

#APPENDING THE DATA PULLED FROM ABOVE INTO THE EXISTING DATAFRAME
row = [Posts, Followers, Following]
fulldf.loc[len(fulldf)] = row

Awesome! This is all of the code (with some extras) we developed in this project:

#IMPORT THESE PACKAGES
import selenium
from selenium import webdriver
import pandas as pd
#OPTIONAL PACKAGE, BUY MAYBE NEEDED
from webdriver_manager.chrome import ChromeDriverManager

#THIS INITIALIZES THE DRIVER (AKA THE WEB BROWSER)
driver = webdriver.Chrome(ChromeDriverManager().install())

#THIS PRETTY MUCH TELLS THE WEB BROWSER WHICH WEBSITE TO GO TO
driver.get('https://www.instagram.com/kimkardashian/?hl=en')

#NUMBER OF POSTS
Posts = driver.find_element_by_xpath('/html/body/div[1]/section/main/div/ul/li[1]/a/span').text
#NUMBER OF FOLLOWERS
Followers = driver.find_element_by_xpath('/html/body/div[1]/section/main/div/ul/li[2]/a/span').text
#NUMBER FOLLOWING
Following = driver.find_element_by_xpath('/html/body/div[1]/section/main/div/ul/li[3]/a/span').text



#PRINTS OUT THE DATA PULLED FROM ABOVE
print(Posts)
print(Followers)
print(Following)

#CREATES A EMPTY DATAFRAME
data1 = {'Posts':[], 'Followers':[], 'Following':[],}
fulldf = pd.DataFrame(data1)

#APPENDING THE DATA PULLED FROM ABOVE INTO THE EXISTING DATAFRAME
row = [Posts, Followers, Following]
fulldf.loc[len(fulldf)] = row

Running The Program

There are 2 main ways to run this program, the first being running the .py file in your command prompt / terminal or running the program line by line, I personally like running the program line by line. Regardless, when you run the program you will see the Chrome browser pop up on your display, navigate to the Instagram account page and the number of posts, followers and following will print out on your Python console!

Awesome! You have just scraped some data from Instagram! As always, I would encourage you to look into ways you can improve this project, maybe make a front end using Streamlit! I hope you enjoyed reading this article!

As Always

As always, if you have any suggestions, thoughts or just want to connect, feel free to contact / follow me on Twitter! Also, below is a link to some of my favorite resources for learning programming, Python, R, Data Science, etc.

Thanks for reading!

Data Scientist / Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store