How to scrape data from any website using Python

Welcome back! Web scraping is one of the most powerful things you can learn, so let’s Learn to scrape some data from some websites using Python!

Basic introduction you could probably skip that I copied from my other article

First things first, we will need to have Python installed, read my article here to make sure you have Python and some IDE installed. Next, I wrote a article on using Selenium in Python, Selenium is a web scraping package that allows us to mimic a web browser using Python, it might be best to read that article for more of an understanding on web scraping, but it’s not a necessity, you can read that article here.

Python has several different web scraping packages, beautiful soup and selenium are a few, in this tutorial we’re going to be using Selenium, in another article we’ll talk about Beautiful Soup.

The Website

First off, we need to find a website where we want to scrape some data from, in this case lets just use Amazon. Lets first make our way over to a product page from Amazon, in this case I searched up Amazon and found a product, clicked on it and this is what we see:

As you can see, there’s tons of different things we can scrape, lets just say we want to scrape the title and the price of this product, this is how we would do it.

Developing The Code

Starting off on our Python environment, we want import our Selenium package, pandas package and a webdriver manager we need for this product, we do so by using the follow lines of code:

#IMPORT THESE PACKAGES
import selenium
from selenium import webdriver
import pandas as pd
#OPTIONAL PACKAGE, BUY MAYBE NEEDED
from webdriver_manager.chrome import ChromeDriverManager

Next up, we want to install and declare our driver and point it to a website, our driver is pretty much the web browser we’re using, to do this use the following code:

#THIS INITIALIZES THE DRIVER (AKA THE WEB BROWSER)
driver = webdriver.Chrome(ChromeDriverManager().install())

#THIS PRETTY MUCH TELLS THE WEB BROWSER WHICH WEBSITE TO GO TO
driver.get('https://www.amazon.com/Acer-Display-Graphics-Keyboard-A515-43-R19L/dp/B07RF1XD36/ref=sr_1_3?dchild=1&keywords=laptop&qid=1618857971&sr=8-3')

The driver.get function just tells the browser which website we want to go to. Next up, we want to declare 2 variables: Title and Price, this will hold the text of those values from the website, it will make more sense in a second, we will then use the Selenium function called driver.find_element_by_xpath() to get the text from the website and store it inside of those variables, this is how we would do that:

#TITLE OF PRODUCT
Title = driver.find_element_by_xpath('PASTE THE FULL XPATH HERE').text
#PRICE OF PRODUCT
Price = driver.find_element_by_xpath('PASTE THE FULL XPATH HERE').text

Next up, we want to get the full xpath, to do this we want to open up our web browser, go to that specific web page, and right click over any of the text on the title, click on inspect > look at the highlighted portion on the inspector console > right click on it and click copy > click on copy full xpath, use the following image as a resource:

We then want to copy and paste that xpath in between the quotes within the Title variable above, that variable should now look something like this:

Title = driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[9]/div[4]/div[4]/div[1]/div/h1/span').text

Awesome! Next up let’s do the same thing for price, right click over any of the text on the price, click on inspect > look at the highlighted portion on the inspector console > right click on it and click copy > click on copy full xpath, use the following image as a resource:

Then paste that in the price variable above, the price variable will now look like this:

Price = driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[9]/div[4]/div[4]/div[10]/div[1]/div/table/tbody/tr/td[2]/span[1]').text

Awesome! Now all we have to do is create an empty Pandas data frame with our variable names then append that to our data frame, to do this use the following lines of code:

#CREATES A EMPTY DATAFRAME
data1 = {'Title':[], 'Price':[],}
fulldf = pd.DataFrame(data1)

Almost done! Now we append the data from the variable into another variable, then append that data into our pandas data frame, use the following lines of code to do so:

#APPENDING THE DATA PULLED FROM ABOVE INTO THE EXISTING DATAFRAME
row = [Title, Price]
fulldf.loc[len(fulldf)] = row

Awesome! This is all of the code (with some extras) we developed in this project:

#IMPORT THESE PACKAGES
import selenium
from selenium import webdriver
import pandas as pd
#OPTIONAL PACKAGE, BUY MAYBE NEEDED
from webdriver_manager.chrome import ChromeDriverManager

#THIS INITIALIZES THE DRIVER (AKA THE WEB BROWSER)
driver = webdriver.Chrome(ChromeDriverManager().install())

#THIS PRETTY MUCH TELLS THE WEB BROWSER WHICH WEBSITE TO GO TO
driver.get('https://www.amazon.com/Acer-Display-Graphics-Keyboard-A515-43-R19L/dp/B07RF1XD36/ref=sr_1_3?dchild=1&keywords=laptop&qid=1618857971&sr=8-3')

#TITLE OF PRODUCT
Title = driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[9]/div[4]/div[4]/div[1]/div/h1/span').text
#PRICE OF PRODUCT
Price = driver.find_element_by_xpath('/html/body/div[2]/div[3]/div[9]/div[4]/div[4]/div[10]/div[1]/div/table/tbody/tr/td[2]/span[1]').text

#PRINTS OUT THE DATA PULLED FROM ABOVE
print(Title)
print(Price)

#CREATES A EMPTY DATAFRAME
data1 = {'Title':[], 'Price':[],}
fulldf = pd.DataFrame(data1)

#APPENDING THE DATA PULLED FROM ABOVE INTO THE EXISTING DATAFRAME
row = [Title, Price]
fulldf.loc[len(fulldf)] = row

Running The Program

There are 2 main ways to run this program, the first being running the .py file in your command prompt / terminal or running the program line by line, I personally like running the program line by line. Regardless, when you run the program you will see the Chrome browser pop up on your display, navigate to the Amazon product page and the price and title will print out on your Python console!

Awesome! You have just scraped some data from Amazon! As always, I would encourage you to look into ways you can improve this project, maybe make a front end using Streamlit! I hope you enjoyed reading this article!

As Always

As always, if you have any suggestions, thoughts or just want to connect, feel free to contact / follow me on Twitter! Also, below is a link to some of my favorite resources for learning programming, Python, R, Data Science, etc.

Thanks for reading!

Data Scientist / Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store