Google Security

Some background first...
Google as a Hacking Tool by 3rr0r:
http://www.antionline.com/showthread...hreadid=257512
Google Hacking Honeypots:
http://www.antionline.com/showthread...hreadid=260050
Google hacking and Credit Card Security:
http://www.antionline.com/showthread...hreadid=260580
Google: Net Hacker Tool:
http://www.antionline.com/showthread...hreadid=240791
Google Aids Hackers:
http://www.antionline.com/showthread...hreadid=240734
Google is watching you:
http://www.antionline.com/showthread...hreadid=260700


Seems that Google is becoming a problem for some webmasters. I decided to check out what Google knew about the site I took over, so I decided to write this tut while I worked as a reference.

Control the Spiders

Nearly all crawlers work with something called the Robots Exclusion Standard, which allows webmasters to determine which parts of their website are indexed.

To do this, we stick a text file called robots.txt at the top level of our document root folder. Here is an example file:
Code:
User-agent:  *
Disallow:
This code sucks. It allows all crawlers to index whatever they want. Lets write code to deny all crawlers.

Code:
User-agent:  *
Disallow: /
Notice the slash, it tells all crawlers to ignore everything past the document root folder.

Code:
User-agent:  *
Disallow: /admin
Disallow: /cgi-bin
This code will tell the crawler to ignore documents past the admin and cgi-bin folders in the document root folder. Now lets define which crawlers we like and dont like. These are called records, and hard returns matter for it to work. 1 return between records.

Code:
#Denys access to Google's spiders
User-agent: Google
Disallow: /

User-agent: *
Disallow:
You can also deny a single file

Code:
User-agent:  *
Disallow: /admin/index.html
Note that wildcards only work in the "User-agent" line.

Meta Tag Crawler Denial

You may not have permission to put a robots.txt file in the document root of your webserver. This method is available, though crawlers do not support this method as well. This is simple, place one of these meta tags in your pages:

Permission to index, and follow links:
<meta name="robots" content="index,follow">

Do not index, permission to follow links
<meta name="robots" content="noindex,follow">

Permission to index, do not follow links
<meta name="robots" content="index,nofollow">

Do not index, do not follow links.
<meta name="robots" content="noindex,nofollow">

This method is a lot more work, and is not well supported, but requires no permission to setup.

Dumping info in Google

This is an easy trick, though not practical for large sites. Enter this into the google search engine:

site:www.YOURSITEHERE.com

You'll see that it dumps all it knows about your site. If you aren't too popular, you can skim through it to see what it knows.

Foundstone's SiteDigger

In order to use this great tool, you need to register for a Google license key. Get it done here:
https://www.google.com/accounts/NewAccount

SiteDigger can be found here-
http://www.foundstone.com/resources/...sitedigger.htm

Install SiteDigger, and enter your license key in the bottom right corner. After that, update your signatures by clicking options, update signatures. Enter your domain where it says, "please enter your domain", and click search.

What SiteDigger does is run automated searches on your domain with signatures, looking for common indexing mistakes left behind by webmasters. Hackers use this, so should you. Anything it finds should be handled accordingly.

In short, learn to protect your public files. Learn to use .htaccess files for apache webservers here-
http://www.antionline.com/showthread...hreadid=231380

All done.
Comments and criticisms encouraged.

SOURCES:
http://www.robotstxt.org/
http://www.antionline.com/