Previous icon
Back

How to scrape websites with Python and BeautifulSoup

published
November 8, 2018
Reading time
5
minutes

What do you do when you can't download a website's information? You do it by hand? Wow, you're brave!

I'm a web developer, so I'm way too lazy to do things manually :)

If you're about to scrape data for the first time, go ahead and read How To Scrape A Website.

Today, let's say that you need to enrich your CRM with company data.

To make it interesting for you, we will scrape Angel List.

More specifically, we'll scrape Uber's company profile.

Please scrape responsibly!

Getting started

Before starting to code, be sure to have Python 3 installed, as we won't cover it here. Chances are you already have it installed.

You also need pip, a package management tool for Python.

easy_install pip

The full code and dependencies are available here.

We'll be using BeautifulSoup, a standard Python scraping library.

pip install BeautifulSoup4

You could also create a virtual environment and install all the dependencies inside the requirements.txt file:

pip install -r requirements.txt

Inspecting Content

Open https://angel.co/uber in your web browser (I recommend using Chrome).

Right-click and open your browser's inspector (sorry, it's in French!).

Hover your cursor on the description.

This example is pretty straightforward: you want the <h2> tag with the js-startup_high_concept class.

This would be the unique location of our data thanks to the class tags.

Extracting Data

Let's dive right in with a bit of code:

# we'll get back to this
headers = {}

# the Uber company page you're about to scrape!
company_page = '<https://angel.co/uber>'

# open the page
page_request = request.Request(company_page, headers=headers)
page = request.urlopen(page_request)

# parse the html using beautifulsoup
html_content = BeautifulSoup(page, 'html.parser')
description = html_content.find('h2', attrs={'class': 'js-startup_high_concept'})

print(description)

Let's get into the details:

  • We create a variable headers (more on this very soon)
  • The company_page variable is the page we're targeting
  • Then we build our request. We inject the company_page and headers variable inside the Request object. Then we open the url with the parameterized request.
  • We parse the HTML response with BeautifulSoup
  • We look for our text content with the find() method
  • We print our result!

Save this as script.py and run it in your shell, like this python script.py.

You should get the following:

urllib.error.HTTPError: HTTP Error 403: Forbidden

Oh :( What happened?

Well, it seems that AngelList has detected that we are a bot. Clever people!

Okay, so change the headers variable for this one:

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}

Run the code with python script.py. Now it should be good:

<h2 class="js-startup_high_concept u-fontSize15 u-fontWeight400 u-colorGray3">
 The better way to get there
</h2>

Yeah! Our first piece of data :D

Want to find the website? Easy:

# we extract the website
website = html_content.find('a', attrs={'class': 'company_url'})

print(website)

And you get:

<a class="u-uncoloredLink company_url" href="http://www.uber.com/" rel= nofollow noopener noreferrer" target="_blank">uber.com</a>

Ok, but how do I get the value of the website?

Easy. Tell the program to extract the href:

print(website['href'])

Make sure to use the strip() method, otherwise you'll have big spaces:

description = description.text.strip()

I won't cover in detail all the elements you could extract. If you're having issues, you can always check this amazing XPath cheatsheet.

Save results to CSV

Pretty useless to print data, right? We should definitely save it!

The Comma-Separated Values format is really a standard for this purpose.

You can import it very easily in Excel or Google Sheets.

import csv

# open a csv with the append (a) parameter
with open('angel.csv', 'a') as csv_file:    
 writer = csv.writer(csv_file)    
 writer.writerow([description, website])

What you get is a single line of data. Since we told the program to append every result, new lines won't erase previous results.

Check out the whole script

from urllib import request
from datetime import datetime
from bs4 import BeautifulSoup
import csv


# add the correct User-Agent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}

# the company page you're about to scrape
company_page = 'https://angel.co/uber'

# open the page
page_request = request.Request(company_page, headers=headers)
page = request.urlopen(page_request)

# parse the html using beautiful soup
html_content = BeautifulSoup(page, 'html.parser')

# we parse the title
title = html_content.find('h1')
title = title.text.strip()
print(title)

# we parse the description
description = html_content.find('h2', attrs={'class': 'js-startup_high_concept'})
description = description.text.strip()
print(description)

# we extract the website
website = html_content.find('a', attrs={'class': 'company_url'})
website = website['href'].strip()
print(website)

# open a csv with the append (a) parameter. We also save the date which is always a good indicator.
with open('angel.csv', 'a') as csv_file:
   writer = csv.writer(csv_file)
   writer.writerow([title, description, website, datetime.now()])

Conclusion

It wasn't that hard, right?

We covered a very basic example. You could also add multiple pages and parse them inside a for loop.

Remember how we got blocked by the website's security and resolved this by adding a custom User-Agent?

We wrote a small paper about anti-scraping techniques. It'll help you understand how websites try to block bots.

{{tech-component}}

Guillaume Odier
Co-founder
table of contents
The rise of Operations

Understand how these data-centered roles are shaping the future of business growth in 2023 and beyond.

Our focus? Your growth.

A data-driven approach is key to hitting your targets. Discover strategies and insights you need to get there.

Thank you! You're successfully subscribed to our newsletter 💌
Oops! Something went wrong while submitting the form.
Eliminate the guesswork.

Business decisions should be backed by fresh and accurate insights. Power your growth with data-driven automations that adapt to your needs.

Extract your data with Captain Data

Seamlessly navigate the web's massive unstructured data, and capture the leads that will drive your business forward.

supercharge your data automation skills

Get our newsletter

Get exclusive tips and industry insights directly to your mailbox, every month

Thank you! You're successfully subscribed to our newsletter 💌
Oops! Something went wrong while submitting the form.
© 
 Captain Data, All rights reserved.
The Rise of Operations

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Crafted for leaders, designed for growth

Channel the full potential of revenue automation to save time and drive growth.  

The best decision is an informed one

Easily extract, enrich and integrate the data you need to scale your operations and supercharge your growth.

Markets evolve, and leaders adapt.

Fully automate your Inbound and Outbound lead gen using Captain Data.  

Turn data points into vantage points

Channel the full potential of revenue automation to transform raw data into actionable insights

Evolving markets demand evolving strategies

Leverage the power of automation to eliminate unnecessary data entry, save time, and drive growth.

Make sense of your market one byte at a time

Easily extract, enrich and integrate the data you need to scale your operations and drive your growth.

Captain Data in 5 minutes

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

The Rise of Operations

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Thank you! You're successfully subscribed to our newsletter 💌
Oops! Something went wrong while submitting the form.