Spider problems

**Etcher** · Fri 4 Jul '03, 4:18am

Robot.txt files provide a protocol that will help all search engines navigate a Web site. If propriety or privacy is an issue, we suggest you identify folders on your Web site that should be excluded from searching. Using robots.txt file, these folders then can be made off-limits. The following discussion about robots will be updated frequently. The Inktomi robot respects the use of the robots.txt file. Starting at the root URL, the spider proceeds through the site based on links from this root. The robots.txt file will also help other search engines traverse your Web site while excluding entry to areas not desired. To facilitate this, many Web robots offer facilities for Web site administrators and content providers that limit robot activities. This exclusion can be achieved through two mechanisms: The Robots Exclusion Protocol A Web site administrator can indicate which parts of the site should not be visited by a robot by providing a specially formatted file on their site in http://.../robots.txt.

The robots.txt file needs to reside in the root directory of the Web site!

Site URLCorresponding Robots.txt URL
http://www.here.com/ http://www.here.com/robots.txt
http://www.here.com:80/ http://www.here.com:80/robots.txt
The actual text file would contain command information like this: User-agent: *
Disallow: /cgi-bin/
Disallow: /test/
Disallow: /~dept/ In this example, three directories are excluded. The line User-agent specifies which robots are allowed to enter the site. In this case the * signals that all robots are allowed to pass. You need a separate "Disallow" line for every URL prefix you want to exclude; you cannot say "Disallow: /cgi-bin/ /tmp/".
Also, you may not have blank lines in a record because they are used to limit multiple records. The Robots META tag A Web author can indicate if a page may be indexed or analyzed for links through the use of a special HTML META tag. The tag looks like the one below and would be located with other metatags in the <HEAD> area of the Web page Within the robot's META tag are directives separated by commas. The INDEX directive tells an indexing robot to index the page. The FOLLOW directive specifies a robot to follow links on the page. Both INDEX and FOLLOW are defaults. The values ALL and NONE set all directives on or off: ALL=INDEX,FOLLOW and NONE=NOINDEX,NOFOLLOW. Here are some examples:

Unfortunately, this metatag has a few drawbacks: Few robots adhere to the standard and not many people know about and use the Robots metatag. In addition, there is no individual robot exclusion. This may change soon.

Spider problems

Spider problems

Comment

Related Topics