Around the middle of November 2006, I decided to experiment with how to protect email addresses from spam-harvesting robots. My testbed was this website, which is admittedly not the highest-traffic site in the world. As such, it might be considered to be representative of other websites.
I planted addresses on my elsewhere page, in a hidden <div>. (They have since been removed.)
results
Interestingly, the spambots don't include a dash or a plus as part
of an address. s-mailto
became just mailto
, and
s-prot+REMOVEME
became removeme
. Note the change to
lowercase. Posting an address in uppercase, and filtering out all
mail to lowercased addresses, seems that it would be effective. It's
also permitted by RFC822—email programs are allowed to be
case-sensitive, except with regards to the special address
POSTMASTER
, which must be accepted regardless of case.
| address | spams | days to harvest (after first spam) |
|---|---|---|
s-mailto(became mailto) | 107 | 0 |
s-prot+REMOVEME(became removeme) | 22 | 3 |
s-inline(became inline) | 29 | 9 |
s-span | none! | none! |
From this, it appears that the simple strategy of using HTML <span> tags to enclose the recipient username and the host parts of the address will scare off just about all spam-harvesting parsers. Apparently they use badly-tuned regexes against the raw text of the page, rather than doing the really hard work of parsing HTML.