Robots.txt file usage is sometimes ignored. On the other hand, it is an important factor for the webpages being indexed properly and very easy to setup.
I know that robots.txt is not something new. But, I’ve been preparing a SEO sheet for a while and wanted to share this small & useful portion with you.

What is robots.txt?

Robots.txt is a file that is used to exclude content from the crawling process of search engine spiders / bots. Robots.txt is also called the Robots Exclusion Protocol.

Why to use robots.txt?

In general, we prefer that our webpages are indexed by the search engines. But there may be some content that we don’t want to be crawled & indexed. Like the personal images folder, website administration folder, customer’s test folder of a web developer, no search value folders like cgi-bin, and many more. The main idea is we don’t want them to be indexed.

Is robots.txt file a certain solution?

No. Standards based bots like Google’s, Yahoo’s or other big search engine’s robots listen to your robots.txt file. This is because they are programmed to. If configured so, any search engine bot can ignore the robots.txt file. Result: there is no guarantee.

How to use robot.txt file?

Robots.txt file has some simple directives which manages the bots. These are:
  • User-agent: this parameter defines, for which bots the next parameters will be valid. * is a wildcard which means all bots or Googlebot for Google.
  • Disallow: defines which folders or files will be excluded. None means nothing will be excluded, / means everything will be excluded or /folder name/ or /filename can be used to specify the values to excluded. Folder name between slashes like /folder name/ means that only folder name/default.html will be excluded. Using 1 slash like /folder name means all content inside the folder name folder will be excluded.
There are also some other parameters which are only supported by all browsers. These are:
  • Allow: this parameter works just the opposite of Disallow. You can mention which content will be allowed to be crawled here. * is a wildcard.
  • Request-rate: defines pages/seconds to be crawled ratio. 1/20 would be 1 page in every 20 second.
  • Crawl-delay: defines howmany seconds to wait after each succesful crawling.
  • Visit-time: you can define between which hours you want your pages to be crawled. Example usage is: 0100-0330 which means that pages will be indexed between 01:00 AM - 03:30 AM GMT.
  • Sitemap: this is the parameter where you can show where your sitemap file is. You must use the complete URL addres for the file.

Robots.txt example:

User-agent: * #allows all search engine spiders.
Disallow: /secretcontent/ #disallow them to crawl secretcontent folder.
Resources:
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360
http://www.robotstxt.org/
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt