you could always just grab the HTML, write a simple script to extract all the HTML tags, and place every word on a new line, and then search through the file to remove duplicate words.

If you think I have this script you're wrong. ;p