Robots.txt File - Exclude Pages From Search Engines
Results 1 to 2 of 2

Thread: Robots.txt File - Exclude Pages From Search Engines

  1. #1
    Senior Member
    Join Date
    Aug 2001
    Posts
    356

    Robots.txt File - Security / Tutorial

    I noticed there has been a lot of talk about google and other search engines on the forums lately. I came across this article / tutorial on how to use the robots.txt file to tell a search engine NOT to list certain pages of your site. It's simple to do.

    Please keep the following notes in mind before using the robots.txt file:

    Please Note: Not all search engines will access the robots.txt file. Some smaller search engines will NOT read the robots.txt file, but most of the popular search engines do. So this is not a fool proof way to not list pages, however it will work in most cases.

    Security Note: Using robots.txt is fine if there are pages that you don't want a search engine to add to it's index, but it should NOT be used as a type of security to hide pages from people. The robots.txt file can be accessed by anyone by simply pointing their browser to the file. An example of the robot.txt file in use is: http://www.altavista.com/robots.txt

    You should not list directories in the file that you don't want people to know about. All they have to do is take a look at the file and they will see the listed directories. If you currently already use the robots.txt file, keep that security note in mind, and review your file.

    Anyway, here is the article. Hope some of you find it useful.

    ---------------------------------------

    This is a useful file that keeps search engines from indexing pages you do not want spidered. Why would you not want a page indexed by a search engine? Perhaps you want to display a page that shows an example of spamming the search engines. This type of page might include an example of repeated keywords, hidden tags with keywords, and other things that could get a page or an entire site banned from a search engine.

    An example of such a page is on this server, it is another one of the articles here- and it talks about search engine spammers. To look at the article, see The "Secrets" of Spamdexers.

    The robots.txt file is a good way to prevent this page from getting indexed. However, not every site can use it. The only robots.txt file that the spiders will read is the one at the top html directory of your server. This means you can only use it if you run your own domain. The spiders will look for the file in a location similar to these below:

    http://www.pageresource.com/robots.txt
    http://www.javascriptcity.com/robots.txt
    http://www.mysite.com/robots.txt

    Any other location of the robots.txt file will not be read by a search engine spider, so the file locations below will not be worthwhile:

    http://www.pageresource.com/html/robots.txt
    http://members.someplace.com/you/robots.txt
    http://someisp.net/~you/robots.txt

    Now, if you have your own domain- you can see where to place the file. So let's take a look at exactly what needs to go into the robots.txt file to make the spider see what you want done.

    If you want to exclude all the search engine spiders from your entire domain, you would write just the following into the robots.txt file:

    User-agent: *
    Disallow: /

    If you want to exclude all the spiders from a certain directory within your site, you would write the following:

    User-agent: *
    Disallow: /aboutme/

    If you want to do this for multiple directories, you add on more Disallow lines:

    User-agent: *
    Disallow: /aboutme/
    Disallow: /stats/

    If you want to exclude certain files, then type in the rest of the path to the files you want to exclude:

    User-agent: *
    Disallow: /aboutme/album.html
    Disallow: /stats/refer.htm

    If you are curious, here is what I used to keep the spamming article from getting indexed:

    User-agent: *
    Disallow: /zine/spam1.htm

    If you want to keep a specific search engine spider from indexing your site, do this:

    User-agent: Robot_Name
    Disallow: /zine/spam1.htm

    You'll need to know the name of the search engine spider or robot, and place it where Robot_Name is above. You can find these names from the web sites of the various search engines.

    So, if you need to exclude something from search engine indexing, this is the most effective tool recognized by the search engines- so use it to keep the spiders out of any part of your web you want them to avoid.

    http://www.pageresource.com/zine/robotstxt.htm
    An Ounce of Prevention is Worth a Pound of Cure...
     

  2. #2
    Senior Member
    Join Date
    Oct 2001
    Location
    Helsinki, Finland
    Posts
    570
    If you want to prevent single pages from being indexed, the easiest way is to use META-tags in your document: inside the <HEAD>....</HEAD> -section of you page, put <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">. Now robots won't index it neither follow the links in page. NOFOLLOW can be replaced by FOLLOW and NOINDEX by INDEX, you guess what they do.
    More about the META-tags: http://vancouver-webpages.com/META/

    -ZeroOne
    Q: Why do computer scientists confuse Christmas and Halloween?
    A: Because Oct 31 = Dec 25

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

 Security News

     Patches

       Security Trends

         How-To

           Buying Guides