Blog



A primer on web scraping bots: good, bad or a little of both?


Posted: 26th July 2016 12:32

There are plenty of times you think you know everything you need based on a small amount of information. It could be something as minor as the way a name sounds. A cornea scraping. Debt litigation. New Weird Al song. It would seem the name is telling you as much as you would want to know. And then there are times when you think you can safely assume something is unpleasant based on how it sounds when, in fact, there’s more to the story. That’s the category in which you will find web scraping bots.

Lots and lots of bots

A bot is a software application that runs automated tasks, otherwise known as scripts, over the internet. Bots are great at performing simple, repetitive tasks quickly, much quicker than a person could. This makes bots valuable for good reasons as well as bad ones. Bots make up nearly half of all web traffic overall, with websites getting less than 10,000 visitors per day having over 70% of their traffic accounted for by bots. Upwards of two-thirds of all bot traffic is malicious.

Bots do great things for the internet like search engine crawling, measuring site speed, monitoring the health of websites, fetching web content, powering APIs, automating security auditing and scanning for website vulnerabilities. Bots also do terrible things like launch distributed denial of service attacks, scan for vulnerabilities in order to compromise websites, impersonate Googlebots, and spam comment sessions and message boards. They also scrape web content.

Web scraping

Web scraping bots are in the business of automatically collecting information from the internet, most usually in the form of site scraping. When a web scraping bot scrapes your site, it accesses your site’s source code and grabs the data it wants. Typically, your site’s content would be scraped in order for it to be posted on another website.
A web scraping bot can also scrape your site’s database. It does so by interacting with a target site’s application in order to get data from the database, stealing customer lists, price lists, intellectual property, and other datasets that humans would not have the time or patience to retrieve.

Bad news bots?

Admittedly, the things web scraping bots do as outlined above all sound terrible. Web scraping bots are capable of stealing the entire contents of your website and reposting it on another site, or stealing your pricing datasets and using them to undercut you. But there is such a thing as legitimate web scraping.
Consider all of those travel sites, concert and sporting ticket sales sites, air travel booking sites and hotel booking sites that compare all of the deals available to you across the internet. They wouldn’t be possible without web scraping bots, and those bots ultimately end up driving traffic to the websites from which they scraped content.

A nuanced approach to mitigation

Essentially, web scraping bots are like every other kind of bot. Some are good, some are bad, some are beneficial to your site, and some are useless at best and malicious at worst. In order to reap the benefits of good bots and keep bad bots from getting on your site, you need a security solution that takes a nuanced approach to both classifying and dealing with bots. Security firm Imperva Incapsula recommends the following four methods for classifying and mitigating bots:

1. Using an analysis tool. A static analysis tool serves to analyze structural web requests as well as header information in order to compare that information with what the bot is presenting itself as. This allows the tool to determine what kind of bot it is actually dealing with and block it if necessary.

2. Employing a challenge-based approach. Proactive web components evaluate visitor behavior, determining things like whether or not it supports cookies or JavaScript. If a visitor exhibits suspicious behavior, your solution can issue progressive challenges to allow a visitor to prove it actually is what it appears to be.

3. Employing a behavioral approach. This approach compares the activity or behavior of a bot with the activity or behavior of what it claims to be. Most bots are linked to a parent program like Chrome or JavaScript, and if the bot’s characteristics don’t align with the parent program this presents an anomaly that allows you to detect and block a bad bot.

4. Using robots.txt.This can protect your site from bad scraping bots…sort of. Robots.txt basically lets bad bots know they’re not welcome on your site, but since bad bots aren’t exactly known for their rule-following, they may ignore those commands.

Getting the whole story

As much as you need to know the whole story when it comes to bots, so does your bot defense solution. It needs to be able to assess the entire impact a bot will have before it even decides whether or not that bot is going to be allowed on your website. This really is the best approach. Except for when it comes to new Weird Al songs. Go with your gut in that case.