Previous icon
Back

How to scrape a website

published
June 11, 2018
Reading time
5
minutes

We often hear about how much data is on the web and how it’s growing exponentially from year to year.

That leads to discussions of Big Data, and Machine Learning, and so on. But in the end, what do YOU do with web data?

The answer is probably nothing, because 99% of websites don't let you access their data easily. You need access to that information, in a scalable way.

Luckily, there's web scraping to the rescue!

Web scraping allows you to automatically extract any content from any website. You can virtually scrape anything, from e-commerce shops to GitHub repositories.

How it works

First, you have to understand how a web page is created, and particularly how HTML works.

A web browser renders HTML documents. These documents describe the structure of the page semantically.

Think of it as a tree with branches. In reality, to render a web page, web browsers organize the HTML document in a tree structure called the DOM (Document Object Model).

<!DOCTYPE html>
<html>
    <head>
        <title>This is a title</title>
    </head>
    <body>
        <h1>Heading</h1>
        <p>Hello world!</p>
    </body>
</html>

What you need to keep in mind is that everything is nested.

...
<body>
    <h1>Heading</h1>
    <p>
                <span>I'm nested <b>I'm nested and bold!</b>
            <span>Wow, too much nesting for me, I'm getting lost!
                <span>Wait... can you actually do that?</span>
            </span>
        </span>
    </p>
</body>
...

There are some rules to respect, but that's not the topic of this article.

The elements "h1" and "p" are tags. They can be described by attributes:

<h1 class="nice-heading" id="main-heading">Heading</h1>

Attributes further describe the tags (nodes). They are very, very useful, mostly because they let you describe a path to the data.

Indeed, when you say "I want to extract data" from the single line of code above, what you're referring to is the "Heading" value, which is a text value.

But how do you access this data?

Accessing Data

Okay, so let's say we have the following code:

...
<body>
 <div class="container">  
   ...
   <div class="card">
     <h3 class="use-case">Repositories</h3>
     <p>Enrich your business database or find new leads to feed your CRM.</p>
   </div>
   ...
 </div>
</body>
...

The previous (simplified) code outputs the following:

Scraping Example

In this case, how do you access the text of the first card described by our HTML code?

Easy! You need to use the XPath language.

The XPath language is based on a tree representation of the XML document, and provides the ability to navigate around the tree, selecting nodes by a variety of criteria.

Remember how I said that an HTML document is like a tree with branches? Well, it's the same for XML. Both of these languages are what we call a markup language.

Xpath gives you the ability to navigate the DOM (remember, the fact HTML is organized into a tree structure with branches!).

In the end, it's very simple to access data, because what you get is the following structure:

div.container
 -- div.card
   -- h3.use-case
     #text
   -- p
    #text

Now, if you want to access the text inside the <p> tag by using XPATH:

document.xpath("//div[@class="container"]//div[class="card"]//p/text()")

Basically, what this code says is: "Take the div container then go to the div card and extract the text inside the p tag"

This way, you're able to extract the text "Enrich your business database or find new leads to feed your CRM".

Amazing, isn't it?

The Next Level

Now that you understand the basics, you need to dive a bit deeper into programming.

For most scraping use cases, I generally recommend to use Python.

There's an amazing community and tons of packages and libraries that you can use to scrape web data.

Here is an example to Scrape Websites with Python and BeautifulSoup.

Among others:

We've only been talking about basic HTML pages, but you probably know that websites nowadays use more and more JavaScript to build very cool stuff.

Unfortunately, JS does not simplify web scraping. But there's a solution to every problem :)

Some examples of useful libraries:

To help you a bit, here's a great XPath Cheatsheet to use whenever you want to access complicated nested data.

If you need help with web scraping, be sure to get in touch.

Be sure to check out our blog to get a sense of what you can do with web scraping.

{{tech-component}}

Guillaume Odier
Co-founder
table of contents
The rise of Operations

Understand how these data-centered roles are shaping the future of business growth in 2023 and beyond.

Our focus? Your growth.

A data-driven approach is key to hitting your targets. Discover strategies and insights you need to get there.

Thank you! You're successfully subscribed to our newsletter 💌
Oops! Something went wrong while submitting the form.
Eliminate the guesswork.

Business decisions should be backed by fresh and accurate insights. Power your growth with data-driven automations that adapt to your needs.

Extract your data with Captain Data

Seamlessly navigate the web's massive unstructured data, and capture the leads that will drive your business forward.

supercharge your data automation skills

Get our newsletter

Get exclusive tips and industry insights directly to your mailbox, every month

Thank you! You're successfully subscribed to our newsletter 💌
Oops! Something went wrong while submitting the form.
© 
 Captain Data, All rights reserved.
The Rise of Operations

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Crafted for leaders, designed for growth

Channel the full potential of revenue automation to save time and drive growth.  

The best decision is an informed one

Easily extract, enrich and integrate the data you need to scale your operations and supercharge your growth.

Markets evolve, and leaders adapt.

Fully automate your Inbound and Outbound lead gen using Captain Data.  

Turn data points into vantage points

Channel the full potential of revenue automation to transform raw data into actionable insights

Evolving markets demand evolving strategies

Leverage the power of automation to eliminate unnecessary data entry, save time, and drive growth.

Make sense of your market one byte at a time

Easily extract, enrich and integrate the data you need to scale your operations and drive your growth.

Captain Data in 5 minutes

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

The Rise of Operations

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Thank you! You're successfully subscribed to our newsletter 💌
Oops! Something went wrong while submitting the form.