Tag-Archive for ◊ http ◊

When you post a link on Twitter…
Sunday, March 07th, 2010 | Author:

I made a custom 404 Error page this morning. It’s really easy with Mod Rewrite On, because you can catch every URL request with php. In the case of broken or bad requests you may want to tell you visitors that they simply missed… It looks something like this:

if($request == false){
    header("HTTP/1.0 404 Not Found");
    include('#error.php');
    exit();
 }

My #error.php file is a simple XHTML file with some custom made shape and content. I put the # character at the beginning of the file’s name, so you can’t access it directly through URL. Now, if you want to visit http://www.hladnik.net/ups, you should get my custom 404 Error response.

I posted this link also on my Twitter account at 11:06 am (CET). In the first 12 minutes I got 18 visits from these bots:

2010-03-07 11:06:23	85.114.136.243		Mozilla/5.0 (compatible; Windows NT 6.0) Gecko/20090624 Firefox/3.5 NjuiceBot
2010-03-07 11:06:23	204.236.249.194		JS-Kit URL Resolver, http://js-kit.com/
2010-03-07 11:06:24	64.13.147.188		Mozilla/5.0 (compatible; abby/1.0; +http://www.ellerdale.com/crawler.html)
2010-03-07 11:06:25	66.249.71.179		Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
2010-03-07 11:06:25	174.129.90.99		Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)
2010-03-07 11:06:26	65.52.26.149		Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
2010-03-07 11:06:27	216.24.142.47		Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 OneRiot/1.0 (http://www.oneriot.com)
2010-03-07 11:06:28	89.151.116.52		Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot
2010-03-07 11:06:32	70.37.70.230		Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
2010-03-07 11:06:35	67.207.201.153		Mozilla/5.0 (compatible; mxbot/1.0; +http://www.chainn.com/mxbot.html)
2010-03-07 11:06:53	67.202.7.134		PycURL/7.18.2
2010-03-07 11:06:54	72.13.91.40		Java/1.6.0_18
2010-03-07 11:07:06	79.99.6.106		Twingly Recon
2010-03-07 11:07:46	208.74.66.39		Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly.html) Gecko/2009032608 Firefox/3.0.8
2010-03-07 11:08:34	142.166.170.104		radian6_linkcheck_(www.radian6.com/crawler)
2010-03-07 11:09:13	142.166.170.103		R6_FeedFetcher(www.radian6.com/crawler)
2010-03-07 11:12:38	204.236.203.128		Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)
2010-03-07 11:18:43	184.73.20.47		Python-urllib/2.5

Four of them also looked for my robots.txt file:

2010-03-07 11:06:24	64.13.147.188		Mozilla/5.0 (compatible; abby/1.0; +http://www.ellerdale.com/crawler.html)
2010-03-07 11:06:36	67.207.201.153		Mozilla/5.0 (compatible; mxbot/1.0; +http://www.chainn.com/mxbot.html)
2010-03-07 11:07:45	208.74.66.36		Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly.html) Gecko/2009032608 Firefox/3.0.8
2010-03-07 11:08:47	142.166.170.103		R6_FeedFetcher(www.radian6.com/crawler)

And that was it. No more visits after 11:18 am. Is it good or is it bad or it means nothing? These robots didn’t try to visit any other  content on my website, although there are 5 links on my404 Error page! So I can say only: “Much Ado About Nothing!”

2010-03-07 11:06:23    85.114.136.243    Mozilla/5.0 (compatible; Windows NT 6.0) Gecko/20090624 Firefox/3.5 NjuiceBot
2010-03-07 11:06:23    204.236.249.194    JS-Kit URL Resolver, http://js-kit.com/
2010-03-07 11:06:24    64.13.147.188    Mozilla/5.0 (compatible; abby/1.0; +http://www.ellerdale.com/crawler.html)
2010-03-07 11:06:25    66.249.71.179    Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
2010-03-07 11:06:25    174.129.90.99    Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)
2010-03-07 11:06:26    65.52.26.149    Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
2010-03-07 11:06:27    216.24.142.47    Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 OneRiot/1.0 (http://www.oneriot.com)
2010-03-07 11:06:28    89.151.116.52    Mozilla/5.0 (compatible; MSIE 6.0b; Windows NT 5.0) Gecko/2009011913 Firefox/3.0.6 TweetmemeBot
2010-03-07 11:06:32    70.37.70.230    Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
2010-03-07 11:06:35    67.207.201.153    Mozilla/5.0 (compatible; mxbot/1.0; +http://www.chainn.com/mxbot.html)
2010-03-07 11:06:53    67.202.7.134    PycURL/7.18.2
2010-03-07 11:06:54    72.13.91.40    Java/1.6.0_18
2010-03-07 11:07:06    79.99.6.106    Twingly Recon
2010-03-07 11:07:46    208.74.66.39    Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly.html) Gecko/2009032608 Firefox/3.0.8
2010-03-07 11:08:34    142.166.170.104    radian6_linkcheck_(www.radian6.com/crawler)
2010-03-07 11:09:13    142.166.170.103    R6_FeedFetcher(www.radian6.com/crawler)
2010-03-07 11:12:38    204.236.203.128    Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)
2010-03-07 11:18:43    184.73.20.47    Python-urllib/2.5
Category: Web development  | Tags: , ,  | 6 Comments
My approach to detecting and trapping bots
Saturday, February 27th, 2010 | Author:

Hi, there is another post from me. You should know something about PHP, MySQL and HTTP protocol to understand it well. It’s not my intention to describe how to manage statistics on your website, I am just illustrating it in order to explain how do I detect and trap bots.

I have a website with Mod Rewrite on, because I redirect everything to index.php document no matter what ever you type into URL line. To be more specific I redirect everything to index.php?q=*, so I can use the $_GET['q'] variable to manage different URLs.

The next step is logging every single visit into MySQL database. In order to do that I have a table of visits which looks something like this:

CREATE TABLE `visits` (
    `id` int(20) NOT NULL AUTO_INCREMENT,
    `datefield` datetime NOT NULL,
    `ip` varchar(100) NOT NULL,
    `useragent` varchar(255) NOT NULL,
    `uri` varchar(255) NOT NULL,
    `referer` varchar(255) NOT NULL,
    `session` varchar(32) NOT NULL,
    PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8;

When somebody visits my website I can insert some data into that DB table:

$ip = $_SERVER['REMOTE_ADDR'];
$agent = $_SERVER['HTTP_USER_AGENT'];
$uri = $_SERVER['REQUEST_URI'];
$referer = $_SERVER['HTTP_REFERER'];
$session = session_id();
mysql_query ("
    INSERT INTO visits(datefield, ip, useragent, uri, referer, session)
    VALUES (NOW(), '$ip', '$agent', '$uri', '$referer', '$session')
");

OK, so that’s really easy. The table visits represents raw data about every single visit and this is only the beginning. If you want to get some real benefit out of your statistics you should create some statistics summary and collect data into some useful information: daily hits, unique hits, referrers, bots visits, users browsers, users operation systems and so on.

There are many solutions to get it done right but as I said before it was not my intention to talk about that. Let’s just concentrate on bots visits. As you know there are many robots crawling through the web and collecting data from websites. It’s allways good to know who they are and what are they doing on your website. It’s also useful to trap and redirect bad robots away.

Look again at the table visits and check column useragent. It holds data about users browsers and it looks like Mozilla/5.0 (Windows; U; Windows NT 6.1; sl,en:us; rv:1.9.2) Gecko/20100115 Firefox/3.6 (.NET CLR 3.5.30729) when the user is human and something like Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) when the user is a robot. I could look for a word ‘bot’ in my useragent column and I should find most of them really easy.

But I found even easier way to do that. It’s true, robots are not very smart. They can’t resist trying to open the document robots.txt. When a robot comes around there is a huge possibility that it would search for robots.txt file. Of course, I don’t have one. If there would be such a file, a robot would open it and so it would slip around my statistics collector. But in my case I just see in my $GET['q'] variable that he wanted a robots.txt file (but my .htaccess file redirects him to index.php script).

That’s first step how can I detect bots because humans don’t search for robots.txt file very often. In addition with ‘bot’ word in useragent column I can be pretty sure if you are a human or you are just another bot. Of course I don’t like bots to go to index.php when they are requesting robots.txt file. So right after when I insert it’s visit into my database I create a fake robots.txt file with PHP code:

if($_GET['q']=='robots.txt'){
    $text = "User-agent: *\r\nDisallow: /email-list/";
    header("Content-Type: text/plain");
    echo $text;
    exit();
}

There is a slight trap for bad robots included. As you can see the robot requests robots.txt file and gets:

User-agent: *
Disallow: /email-list/

Good robots obey and don’t try to access the email-list folder. But bad robots do just that! They immediately try to get into my email-list folder… which doesn’t exist, of course! It’s a simple trap which helps me to separate good robots from bad ones. I have a separate table in my database just for robots where I specify if a robot is good or bad.

So, that’s it. It is up to your imagination what to do with bad robots. You can simply write them some die(‘spammer’); command, you can trap them into some PHP script and have fun with them or you can immediately redirect them to www.google.com! You can do whatever you want and live happily ever after!

Category: Web development  | Tags: , , , ,  | 2 Comments