About the only thing you can do is check for a robots.txt on the root of their webserver. This file contains directories which you don't want robots to crawl.
It is funny you mention robots.txt. I always thought it was kind of a double edged sword. Isn't it kind of like leaving a note for a burlgar saying the jewels are hidden in the vase on the coffee table. Ironically, I have looked at a few robots.txt files. Then tried to browse the dir, to find index enabled. They went to the trouble to write a robots.txt. Yet overlooked something as simple as indexing.

Back towards the topic however. Go here
Perhaps you can read some of these scripts. Then alter them, and write your own.


Be safe and stay free