Hi, there is another post from me. You should know something about PHP, MySQL and HTTP protocol to understand it well. It’s not my intention to describe how to manage statistics on your website, I am just illustrating it in order to explain how do I detect and trap bots.
I have a website with Mod Rewrite on, because I redirect everything to index.php document no matter what ever you type into URL line. To be more specific I redirect everything to index.php?q=*, so I can use the $_GET['q'] variable to manage different URLs.
The next step is logging every single visit into MySQL database. In order to do that I have a table of visits which looks something like this:
CREATE TABLE `visits` (
`id` int(20) NOT NULL AUTO_INCREMENT,
`datefield` datetime NOT NULL,
`ip` varchar(100) NOT NULL,
`useragent` varchar(255) NOT NULL,
`uri` varchar(255) NOT NULL,
`referer` varchar(255) NOT NULL,
`session` varchar(32) NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8;
When somebody visits my website I can insert some data into that DB table:
$ip = $_SERVER['REMOTE_ADDR'];
$agent = $_SERVER['HTTP_USER_AGENT'];
$uri = $_SERVER['REQUEST_URI'];
$referer = $_SERVER['HTTP_REFERER'];
$session = session_id();
mysql_query ("
INSERT INTO visits(datefield, ip, useragent, uri, referer, session)
VALUES (NOW(), '$ip', '$agent', '$uri', '$referer', '$session')
");
OK, so that’s really easy. The table visits represents raw data about every single visit and this is only the beginning. If you want to get some real benefit out of your statistics you should create some statistics summary and collect data into some useful information: daily hits, unique hits, referrers, bots visits, users browsers, users operation systems and so on.
There are many solutions to get it done right but as I said before it was not my intention to talk about that. Let’s just concentrate on bots visits. As you know there are many robots crawling through the web and collecting data from websites. It’s allways good to know who they are and what are they doing on your website. It’s also useful to trap and redirect bad robots away.
Look again at the table visits and check column useragent. It holds data about users browsers and it looks like Mozilla/5.0 (Windows; U; Windows NT 6.1; sl,en:us; rv:1.9.2) Gecko/20100115 Firefox/3.6 (.NET CLR 3.5.30729) when the user is human and something like Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) when the user is a robot. I could look for a word ‘bot’ in my useragent column and I should find most of them really easy.
But I found even easier way to do that. It’s true, robots are not very smart. They can’t resist trying to open the document robots.txt. When a robot comes around there is a huge possibility that it would search for robots.txt file. Of course, I don’t have one. If there would be such a file, a robot would open it and so it would slip around my statistics collector. But in my case I just see in my $GET['q'] variable that he wanted a robots.txt file (but my .htaccess file redirects him to index.php script).
That’s first step how can I detect bots because humans don’t search for robots.txt file very often. In addition with ‘bot’ word in useragent column I can be pretty sure if you are a human or you are just another bot. Of course I don’t like bots to go to index.php when they are requesting robots.txt file. So right after when I insert it’s visit into my database I create a fake robots.txt file with PHP code:
if($_GET['q']=='robots.txt'){
$text = "User-agent: *\r\nDisallow: /email-list/";
header("Content-Type: text/plain");
echo $text;
exit();
}
There is a slight trap for bad robots included. As you can see the robot requests robots.txt file and gets:
User-agent: *
Disallow: /email-list/
Good robots obey and don’t try to access the email-list folder. But bad robots do just that! They immediately try to get into my email-list folder… which doesn’t exist, of course! It’s a simple trap which helps me to separate good robots from bad ones. I have a separate table in my database just for robots where I specify if a robot is good or bad.
So, that’s it. It is up to your imagination what to do with bad robots. You can simply write them some die(‘spammer’); command, you can trap them into some PHP script and have fun with them or you can immediately redirect them to www.google.com! You can do whatever you want and live happily ever after!
I get aproximately 20 spam comments every day. All comments are for only one post – Get IMDB ID (tt number) from movie title. I don’t know why spammers are so much persistent on this one, because the others posts are not attacked at all! Maybe the post’s link on Twitter caused that, maybe…
Anyway, thanks to WordPress, I can delete them before they would be published. Comments are not allowed anymore on that article.