How to scrape websites with Python and BeautifulSoup

Guillaume Odier
November 8, 2018

What do you do when you can't download a website's information? You do it by hand? Wow, you're brave!

I'm a web developer, so I'm way too lazy to do things manually :)

If you're about to scrape data for the first time, go ahead and read How To Scrape A Website. You can also read a small intro about web scraping.

Today, let's say that you need to enrich your CRM with company data.

To make it interesting for you, we will scrape Angel List.

More specifically, we'll scrape Uber's company profile.

Please scrape responsibly!

Getting started

Before starting to code, be sure to have Python 3 installed, as we won't cover it here. Chances are you already have it installed.

You also need pip, a package management tool for Python.

The full code and dependencies are available here.

We'll be using BeautifulSoup, a standard Python scraping library.

You could also create a virtual environment and install all the dependencies inside the requirements.txt file:

Inspecting Content

Open https://angel.co/uber in your web browser (I recommend using Chrome).

Right-click and open your browser's inspector.

Example Inspecter
Sorry, it's in French!

Hover your cursor on the description.

Example Selector Uber

This example is pretty straightforward: you want the <h2> tag with the js-startup_high_concept class.

This would be the unique location of our data thanks to the class tags.

Extracting Data

Let's dive right in with a bit of code:

Let's get into the details:

  • We create a variable headers (more on this very soon)
  • The company_page variable is the page we're targeting
  • Then we build our request. We inject the company_page and headers variable inside the Request object. Then we open the url with the parameterized request.
  • We parse the HTML response with BeautifulSoup
  • We look for our text content with the find() method
  • We print our result!

Save this as script.py and run it in your shell, like this python script.py.

You should get the following:

Oh :( What happened?

Well, it seems that AngelList has detected that we are a bot. Clever people!

Okay, so change the headers variable for this one:

Run the code with python script.py. Now it should be good:

Yeah! Our first piece of data :D

Want to find the website? Easy:

And you get:

Ok, but how do I get the value of the website?

Easy. Tell the program to extract the href:

Make sure to use the strip() method, otherwise you'll have big spaces:

I won't cover in detail all the elements you could extract. If you're having issues, you can always check this amazing XPath cheatsheet.

Save results to CSV

Pretty useless to print data, right? We should definitely save it!

The Comma-Separated Values format is really a standard for this purpose. You can import it very easily in Excel or Google Sheets.

What you get is a single line of data. Since we told the program to append every result, new lines won't erase previous results.

Check out the whole script

Conclusion

It wasn't that hard, right?

We covered a very basic example. You could also add multiple pages and parse them inside a for loop.

Remember how we got blocked by the website's security and resolved this by adding a custom User-Agent? We wrote a small paper about anti-scraping techniques. It'll help you understand how websites try to block bots.

If you feel like web scraping is too difficult for you or you're getting blocked, you can always contact us!

You can also use a more advanced version of this script on our platform.