This tutorial is an introduction to the security risks associated with common internet search engines. In the past few years, Google has come to be the most popular search engine in the world, so much so that many consider it the only one worth using. For this reason this tutorial will focus on Google, however it can be safely assumed that most of what is said here applies to other similar engines as well.

WEB VULNERABILITIES

A system is vulnerable when an attacker is able to make it do something it's not supposed to. There are generally two types of vulnerabilities to be found on the web: software vulnerabilities and user misconfigurations.

Although there are some sophisticated attackers who target a specific system and try to discover vulnerabilities which will allow them access, the vast majority of attackers start out with a specific software vulnerability or common user misconfiguration which they already know how to exploit, and simply try to find - or scan for - systems which have this vulnerability. Google is of limited use to the first attacker, but invaluable to the second, as will be explained in the next section.

SCANNERS

When an attacker knows the sort of vulnerability they want to exploit but have no specific target, they employ a scanner. This is a program which automates the process of examining a massive quantity of systems for a security flaw. The earliest computer related scanner, for example, was a wardialer; a program which would dial long lists of phone numbers and record which ones responded with a modem handshake.

Today there are scanners which automatically query IP addresses to see what ports they have open, determine what operating system they're probably running, some even determine the geographic location of the system. One of the most popular IP scanners is NMap ( http://www.insecure.org ). When using NMap, one specifies a range of hosts and the specific services on each one to scan for. The program will then return a list of the available (and presumably vulnerable) systems.

GOOGLE AS A SCANNER

With a little creativity, Google can be made to operate in a similar way as NMap, though they use different protocols. As an example, let's pretend that we know a great new exploit that will allow us to steal credit card information from any online store that uses the SHOP.PL scripts. We know that www.secure.com uses SHOP.PL, but when we try our exploit it turns out that they already patched the vulnerability. As dedicated malicious hackers, though, we don't give up. We turn to Google and enter the following search string:

inurl:shop.pl

Feel free to try this - it's not illegal and shouldn't get you in to trouble. Note that the above search employs advanced operators, which are described here: ( http://www.google.com/help/operators.html ). What is produced is a list of all sites which have "shop.pl" somewhere in their URL, essentially a list of potentially vulnerable targets. Just as with NMap, all that's left to do is try our exploit against each site on the list.

There are countless variations on this scheme, including some rather clever ways to find particular versions of server programs. For example, if one was to enter:

"seeing this instead" intitle:"test page for apache"

It would return a list of sites using Apache 1.3.11 - 1.3.26, because those specific phrases are used on the default page for those versions. Once again, if an attacker had an exploit for Apache 1.3.11 - 1.3.26, it would take very little effort to compromise a large number of systems.

GOOGLE AS AN EXPLOITER

It seems ridiculous, but sometimes administrators misconfigure their sites so badly, it's not even neccessary to use a "third party" exploit in order to gain access to a system. Google indexes the web very aggressively, and unless a file is put behind in a password or otherwise access-restricted area of your website, there is a good chance that it will be searchable in Google. This includes password files, credit reports, medical records, etc.

In cases where the files are not adequately protected from Google, the search engine has basically already performed the exploit for the attacker. If, for example, a script kiddie wanted to deface a random web site, he or she would simply search for:

intitle:"Index of" htpasswd

Which would return a list of all poor users who allowed Google to index their .htpasswd file, probably containing the administrative username and password for their web page. All the attacker would need to do is open the file, crack the password, and deface away.

GOOGLE AS A PROXY

A proxy is an intermediary system which an attacker can use to disguise his or her identity. For example, if I was to gain remote access to Bill Gates' computer and cause it to run attacks on cia.gov, it would appear to the Feds that Bill Gates was hacking them. His computer would be acting as a proxy. Google can be used in a similar way, as is explained here.

Even if Google didn't provide such an easy way to locate vulnerabilities, there are other tools which can do that particular job. A program called AccessDiver ( http://www.accessdiver.com/ ) allows the user to specify a domain name, and it tries to access URLs which commonly lead to sensitive data or system access. This tool was, however, intended for use by administrators on their own networks. If anyone tried to use this tool against cia.gov, the system would deny them access, log their IP, and send them to jail for attempted computer trespass.

When using Google, there is very little danger of detection, because the attackers computer doesn't have to access every site itself and ask suspicious questions like "Do you have a page containing .htpasswd?" The search engine has already gathered this information and will give it freely without a peep to the vulnerable site. Things get even more interesting when you consider the Google cache function. If you have never used this feature, try this:

Do a Google search for "USA Today". Click on the first result and read a few of the headlines. Now click back to return to your search. This time, click the "Cached" link to the right of the URL of the page you just visited. Notice anything weird? You're probably looking at the headlines from yesterday or the day before. Why, you ask? It's because whenever Google indexes a page, it saves a copy of the entire thing to its server. You are accessing the most recent copy Google made of www.usatoday.com.

This can be used for a lot more than reading old newspapers. Our attacker can now use Google to scan for sensitive files without alerting potential targets, and even when a target is found the attacker can access it's files from the Google cache without ever making contact with the target's server. The only server with any logs of the attack would be Google's, and it's unlikely they will realize an attack has taken place.

An even more elaborate trick involves crafting a special URL that would not normally be indexed by Google, perhaps one involving a buffer overflow or SQL injection (common web exploits). This URL is then submitted to Google as a new web page at ( http://www.google.com/addurl.html ). Google automatically accesses it, stores the resulting data in it's searchable cache, and the rest is history.

SECURING AGAINST GOOGLE EXPLOITS

This probably doesn't even have to be mentioned at this point, but make sure you are comfortable with sharing everything in your public web folder with the whole world, because Google will share it, whether you like it or not.
Also, in order to prevent attackers from easily figuring out what server software you are running, change the default error messages and other identifiers. Often, when a "404 Not Found" error is detected, servers will return a page like this:

Not Found
The requested URL /cgi-bin/xxxxxx was not found on this server.

Apache/1.3.27 Server at www.countrybookshop.co.uk Port 80
The only information that the legimitate user really needs is on the top line, and restricting the other information will prevent your page from turning up in an attacker's search for a specific flavour of server.

The Google cache issue raises another interesting security concern: just because you take a document off your site doesn't mean it's inaccessible. Google periodically purges it's cache, but until then your sensitive files are still being happily offered to the public. If you realize that the search engine has cached files which you want to be unavailable, go to ( http://www.google.com/remove.html ) and follow the instructions to remove your page, or parts of your page, from their database.

There is not really anything that an administrator can do to prevent the use of Google as a proxy, so the best thing to do is ensure that your system is not vulnerable to any HTTP attacks that could be conducted through Google. A good HTTP vulnerability scanning tool is N-Stealth ( http://www.nstalker.com/nstealth/ ). Run this against your network, and it will hopefully point out any holes that you need to patch up.

OTHER RESOURCES

There are many, many variations on the Google hacking techniques described here. An excellent place to find out more about these exploits is ( http://johnny.ihackstuff.com/ ). ( http://www.oxygen-inc.com/google.html ) Also contains a pretty good tutorial specifically focusing on how to find sensitive information using Google.

Hope this information is useful!