Previous icon
Back

Anti-Scraping - Why you already lost the war

published
October 27, 2018
Reading time
5
minutes

What the heck is Web Scraping ?

Web scraping is a data extraction technique used on a website: you create a script (a bot) that automatically fetches data, without you needing to do anything.

The subject is well known amongst startups' (digital) marketing services, with the big rise of Growth Hacking. It's been a few years that everyone wants to scrape data for everything.

Understand that everyone extracts data on the web. Among others:

  • The startup that wants to enrich its business database or feed its CRM
  • The company that wants to monitor its competitors
  • The big corporate that analyzes its market penetration
  • The SME that sells a product requiring web data

Problem is, most, if not all, websites (shall we say 99% ?) do not offer a direct access to their data. They don't offer an Application Programming Interface (API). Worse, some websites offer badly designed APIs that forces us to scrape their content to get everything. What a shame!

Web scraping and hacking

At Captain Data, people often ask us about the legal aspect of things, GDPR or technical feasibility:

  • Concerning legality, that would require an entire article :) In one word: yes, it's legal (or at least, it's not illegal). Still, you need to behave yourself and respect copyrights, for example.
  • As for GDPR, no problem since we only deal with business data. It happens that we're asked to extract sensitive information, but in that case we save it directly in our client's database.
  • We're left with the interesting point that concerns the technical feasibility: there are more and more scraping services and solutions, and therefore more and more solutions to protect from scraping.

There's a connection that we often like to make between hacking and scraping: in the end, it's only a matter of means.

Anti-Scraping: a matter of means ?

When you learn the basic of IT security, you're often told that no system is perfect.

The more you invest in secure solutions, the more you can hope to be protected - provided that your employees do not leave their passwords on a small piece of paper!

It's the same for scraping: the more you protect yourself, you increase your chance of spotting bots. But remember that everything is about risk management: what is the (reasonable) percentage of successful bots spotting you're comfortable with.

However, if some IT securities seem really non-crackable from a hacking point of view, we have always managed to deliver our bots.

Protections technique are based on two major factors:

  • The digital footprint
  • Machine learning and statistics

Digital Footprint

When you surf on the web, you leave what's called a digital footprint. This is a set of parameters that you accumulate over time: cookies, your IP, your browser settings, etc.

This footprint helps to determine whether you are a human behind your screen or a bot harvesting the website. Bad news for website owners, recent technologies, including Headless Chrome (a "simplified" version of Google Chrome), make it extremely easy to reproduce these settings.

In other words: pretending to be a human is quite possible and easy for a bot.

Machine Learning

Machine learning allows to create high quality anti-scraping solutions. In short, companies harvest web data (Big Data) to create behavioral detection patterns.

Statistical analysis: number of IPs, number of sessions by IPs, speed of extraction, etc., make the extraction process much more complicated.

There are few anti-scraping scraping solutions based on machine learning on the market. When you have to counter one, it can become pretty hard :).

"Fortunately", like in hacking, the integration of such technical solutions relies on the expertise of men and women, which is therefore prone to error. And you don't need much, just a slight detail: a page is not protected properly, a piece of internal API is exposed when it should not, etc. And boom, the door is left open for a smart developer.

As scraping experts, what we bring on the table is easy to get: we're used to find the small hidden door(s).

One might think that the company that spends the most will have the last word in the story. In theory, it's true. However, practice shows us that no protection is unbreakable.

Web scraping as an innovation leverage

We're not telling you not to protect yourself! Indeed, products like Cloudflare bring so much more to the table and are not only an anti-scraping protection.

We juste do not have the same vision of the web :)

At Captain Data, we see scraping as a mean to innovate:

  • It allows you to enrich a database, to maximize its value
  • You can create new ways to do business and attract customers
  • It facilitates the creation of innovative products
  • And finally, it allows to modernize old application layers (legacy softwares)

Rather than trying to protect your data, you'd better think about how you can use it to foster innovation or enhance your offer.

You do not have an API? Build one!

You don't know how to do? We can help.

Building an API is too expensive? You can always ask us to scrap your application to create one on the fly and make a smooth transition by modernizing your infrastructure at a lower cost.

You read it everywhere « data is the new oil ». So what are you waiting for?

{{automation-component}}

Guillaume Odier
Co-founder
table of contents
The rise of Operations

Understand how these data-centered roles are shaping the future of business growth in 2023 and beyond.

Our focus? Your growth.

A data-driven approach is key to hitting your targets. Discover strategies and insights you need to get there.

Thank you! You're successfully subscribed to our newsletter 💌
Oops! Something went wrong while submitting the form.
Eliminate the guesswork.

Business decisions should be backed by fresh and accurate insights. Power your growth with data-driven automations that adapt to your needs.

Extract your data with Captain Data

Seamlessly navigate the web's massive unstructured data, and capture the leads that will drive your business forward.

supercharge your data automation skills

Get our newsletter

Get exclusive tips and industry insights directly to your mailbox, every month

Thank you! You're successfully subscribed to our newsletter 💌
Oops! Something went wrong while submitting the form.
© 
 Captain Data, All rights reserved.
The Rise of Operations

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Crafted for leaders, designed for growth

Channel the full potential of revenue automation to save time and drive growth.  

The best decision is an informed one

Easily extract, enrich and integrate the data you need to scale your operations and supercharge your growth.

Markets evolve, and leaders adapt.

Fully automate your Inbound and Outbound lead gen using Captain Data.  

Turn data points into vantage points

Channel the full potential of revenue automation to transform raw data into actionable insights

Evolving markets demand evolving strategies

Leverage the power of automation to eliminate unnecessary data entry, save time, and drive growth.

Make sense of your market one byte at a time

Easily extract, enrich and integrate the data you need to scale your operations and drive your growth.

Captain Data in 5 minutes

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

The Rise of Operations

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Thank you! You're successfully subscribed to our newsletter 💌
Oops! Something went wrong while submitting the form.