How to scrape data from any website using R!

Manpreet Singh
5 min readApr 24, 2021

Welcome back! Web scraping is one of my favorite things to do (if you couldn’t tell from the millions of articles I talk about), so let’s do some web scraping using the fantastic programming language R! This is a very beginner friendly tutorial, but i’m assuming you have R installed on your machine and know a little bit of how this language works, if that sounds like you then lets get started!

Installation

The specific package we’re going to be using is Rvest, this is pretty much BeutifulSoup (the Python package) but for our R enviornment, to install this package use the following command in your R console:

install.packages(“rvest”)

Awesome! You’ve just installed the package for this tutorial!

Understanding HTML

Now before we start pulling data, we first must learn the layout of how the data is going to be scraped. When I started using this package I always saw a ton of tutorials speeding past this part, it led me to being stuck on tons of basic steps, so this is a very important concept to understand during web scraping, let’s take a look at the following HTML code:

<!DOCTYPE html>
<html>
<body>
<p1>This is a test.</p>
<p2>This is not a test.</p>
<p3>This is still a test.</p>
</body>
</html>

As most of you know, every single web page is built using HTML, CSS, Javascript or some variation of these languages, HTML is essentially where the raw data is placed, since we’re scraping text, this is where we want to focus our attention at. Let’s say we wanted to scrape the “This is a test” text, we would essentially point our web scraper to scrape the text from the p1 HTML tag, since it’s a unique identifier. Let’s say we wanted to do the same thing but for the This is not a test text, we would want to scrape the text from the p2 HTML tag.

Also, if you’re wondering how you can find the HTML code of a website, almost every single browser (including Safari, Firefox and Chrome) allow you to see this code, to do so right click on any portion of the website and select inspect, you may have to enable this functionality within your settings. That’s a very quick walkthrough but that type of…

--

--