Results 1 to 5 of 5

Thread: MSNBot Spider: I do not understand. Good or bad?

  1. #1
    Senior Member
    Join Date
    Oct 2001
    Location
    Texas!
    Posts
    271

    MSNBot Spider: I do not understand. Good or bad?

    Are these "spiders/web crawlers" bad or good? I really do not understand. Are they suppose to help the pages rank in a search engine?

    "MSNBot Spider" is on my forum right now. I have never seen this or any other spider there before. What is it doing?

    I found this: http://www.robotstxt.org/wc/faq.html#what, but it's not very much help in figuring out how this "spider" got to my site, and why it's there.

    Thank you in advance, AO.

  2. #2
    Well i found this on google bot. I believe they're all similar...

    Googlebot has two versions, deepbot and freshbot. Deepbot, the deep crawler, tries to follow every link on the web and download as many pages as it can to the Google indexers. Currently (March 2006), it completes this process about once a month. Freshbot crawls the web looking for fresh content. It visits websites that change frequently, according to how frequently they change. Ideally, freshbot would visit a daily newspaper's website every day and a weekly ezine would get crawled once every 7 days.

    Googlebot discovers pages by harvesting all of the links on every page it finds. It then follows these links to other web pages. New web pages must be linked to from another known page on the web in order to be crawled and indexed.

    Source: Wikipedia
    Spiders just look at websites, search for keywod and text and index it....A friend of mine made a thread and the whole thread just read "Smack a Hoe" since not many websites have that phrase (thread: http://www.pureehosting.com/postt58.html )....googlebot searched his site found that thread indexed the phrase and now the thread is #1 when googling for smack a hoe.....
    O.G at A.O

  3. #3
    Master-Jedi-Pimps0r & Moderator thehorse13's Avatar
    Join Date
    Dec 2002
    Location
    Washington D.C. area
    Posts
    2,885
    The MSN bot should be harmless to your site while crawling it. Last week, it crawled our site, however, we had a poorly written app that went south during the process.

    All it does is index your site then rank it in their search engine. No voodoo here.

    --TH13
    Our scars have the power to remind us that our past was real. -- Hannibal Lecter.
    Talent is God given. Be humble. Fame is man-given. Be grateful. Conceit is self-given. Be careful. -- John Wooden

  4. #4
    Senior Member Falcon21's Avatar
    Join Date
    Dec 2002
    Location
    Singapore
    Posts
    252
    I consider it as a good thing. Those spiders index your webpages so that people searching for certain info will arrive at your website. Your website will be exposed to larger audiences. If you wish to prevent them from crawling and indexing certain files or directories for some reasons, you can use a robots.txt file http://www.robotstxt.org/wc/norobots.html and here is a validator for robots.txt to check for syntax errors http://tool.motoricerca.info/robots-checker.phtml

  5. #5
    AO übergeek phishphreek's Avatar
    Join Date
    Jan 2002
    Posts
    4,325
    A good example of how to write a robots.txt file is similar to the whitehouse does theirs.

    There is a lot that they don't want cached. I'll leave the conspiracy theorists to wonder why...

    http://www.whitehouse.gov/robots.txt

    Google also has a pretty extensive one.

    http://www.google.com/robots.txt

    AO doesn't block much at all... looks like they block only for performance reasons?
    Block bots that may use too many resources?

    http://www.antionline.com/robots.txt

    So, if you want more exposure, let them crawl. If not, then block them.

    Oh, some "offline" website browsers (such as httrack) also obey the robots.txt file. So, if you don't want someone copying your whole website, you can block their "default" settings. Most of them can be changed at will to bypass any robots.txt files.

    On the security side of things... be careful what info you put in your robots.txt file. If you have directirues that are "hidden" and there are no links to on your website, there is no reason to put that path in your robots.txt file as it will now become public knowledge.

    A lot can be told about the layout of one's site by the robots.txt file. Which directories they have their scripts in, images, etc. etc. Also, if you don't want a directory "cached"... an attacker will think "why not" and may investigate those directories first looking for "private" goods.
    Quitmzilla is a firefox extension that gives you stats on how long you have quit smoking, how much money you\'ve saved, how much you haven\'t smoked and recent milestones. Very helpful for people who quit smoking and used to smoke at their computers... Helps out with the urges.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •