Friday, June 5, 2009

How to block robots.. before they hit robots.txt - ala: mod_security

As many of you know, robots (in their many forms) can be quite pesky when it comes to crawling your site, indexing things that you don't want indexed. Yes, there is the standard of putting a robots.txt in your webroot, but that is often not highly effective. This is due to a number of facts... the least of which is not that robots tend to be poorly written to begin with and thus simply ignore the robots.txt anyway.

This comes up because a friend of mine that runs a big e-com site recently asked me.. "J, how can I block everything from these robots, I simply don't want them crawling our site." My typical response to this was "you know that you will then block these search engines and keep them from indexing your site"... to whit "yes, none of our sales are organic, they all come from referring partners and affiliate programs".... That's all that I needed to know... as long as it doesn't break anything that they need heh.

After puting some thought into it, and deciding that there was no really easy way to do this on a firewall, I decided that the best way to do it was to create some mod_security rules that looked for known robots and returned a 404 whenever any such monster hit the site. This made the most sense because they are running an Apache reverse proxy in front of their web application servers with mod_security (and some other fun).

A quick search on the internet found the robotstxt.org site that contained a listing (http://www.robotstxt.org/db/all.txt) of quite a few common robots. Looking through this file, all that I really cared about was the robots-useragent value. As such, I quickly whipped up the following perl that automaticaly creates a file named modsecurity_crs_36_all_robots.conf. Simply place this file in the apt path (for me /usr/local/etc/apache/Includes/mod_security2/) and restart your apache... voila.. now only (for the most part) users can browse your webserver. I'll not get into other complex setups, but you could do this on a per directory level also, from your httpd.conf, and mimic robots.txt (except the robots can't ignore the 404 muahahaha).

#####################Begin Perl#######################
#!/usr/bin/perl

##
## Quick little routine to pull the user-agent string out of the
## all.txt file from the robots project, with the intention of creating
## regular expression block rules so that they can no longer crawl
## against the rules!
## Copyright JJ Cummings 2009
## cummingsj@gmail.com
##

use strict;
use warnings;
use File::Path;

my ($line,$orig);
my $c = 1000000;
my $file = "all.txt";
my $write = "modsecurity_crs_36_all_robots.conf";
open (DATA,"<$file");
my @lines = ;
close (DATA);

open (WRITE,">$write");
print WRITE "#\n#\tQuick list of known robots that are parsable via http://www.robotstxt.org/db/all.txt\n";
print WRITE "#\tgenerated by robots.pl written by JJ Cummings \n\n";
foreach $line(@lines){
if ($line=~/robot-useragent:/i){
$line=~s/robot-useragent://;
$line=~s/^\s+//;
$line=~s/\s+$//;
$orig=$line;
$line=~s/\//\\\//g;
#$line=~s/\s/\\ /g;
$line=~s/\./\\\./g;
$line=~s/\!/\\\!/g;
$line=~s/\?/\\\?/g;
$line=~s/\$/\\\$/g;
$line=~s/\+/\\\+/g;
$line=~s/\|/\\\|/g;
$line=~s/\{/\\\{/g;
$line=~s/\}/\\\}/g;
$line=~s/\(/\\\(/g;
$line=~s/\)/\\\)/g;
$line=~s/\*/\\\*/g;
$line=~s/X/\./g;
$line=lc($line);
chomp($line);
if (($line ne "") && ($line !~ "no") && ($line !~ /none/i)) {
$c++;
$orig=~s/'//g;
$orig=~s/`//g;
chomp($orig);
print WRITE "SecRule REQUEST_HEADERS:User-Agent \"$line\" \\\n";
print WRITE "\t\"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'$c',tag:'AUTOMATION/BOTS',severity:'2'\"\n";
}
}
}
close (WRITE);
$c=$c-1000000;
print "$c total robots\n";


#####################End Perl#######################

To use the above, you have to save the all.txt file to the same directory as the perl.. and of course have +w permissions so that the perl can create the apt new file. This is a pretty basic routine... I wrote it in about 5 minutes (with a few extra minutes for tweaking of the ruleset format output (displayed below). So please, feel free to modify / enhance / whatever to fit your own needs as best you deem. **yes, I did shrink it so that it would format correctly here**

#####################Begin Example Output#######################
SecRule REQUEST_HEADERS:User-Agent "abcdatos botlink\/1\.0\.2 \(test links\)" \
"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'1000001',tag:'AUTOMATION/BOTS',severity:'2'"
SecRule REQUEST_HEADERS:User-Agent "'ahoy\! the homepage finder'" \
"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'1000002',tag:'AUTOMATION/BOTS',severity:'2'"
SecRule REQUEST_HEADERS:User-Agent "alkalinebot" \
"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'1000003',tag:'AUTOMATION/BOTS',severity:'2'"
SecRule REQUEST_HEADERS:User-Agent "anthillv1\.1" \
"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'1000004',tag:'AUTOMATION/BOTS',severity:'2'"
SecRule REQUEST_HEADERS:User-Agent "appie\/1\.1" \
"phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Automated Web Crawler Block Activity',id:'1000005',tag:'AUTOMATION/BOTS',severity:'2'"

#####################End Example Output#######################

And that folks, is how you destroy robots that you don't like.. you can modify the error that returns to fit whatever suits you best.. 403, 404.....

Cheers,
JJC

No comments: