Anti-Spam
Technology Alternatives: Part 2 Pattern-Matching
Most sophisticated anti-spam systems rely on known-spam-blocking or whitelists. Using those approaches, anti-spam vendors are
able to keep ahead of the ever-changing tricks the spammers devise. However, when anti-spam systems rely on pattern-matching, its the spammers who are
ahead, with the anti-spammers constantly playing catch-up.
Its actually not hard to write a set of anti-spam
rules that pick up 90%+ of all spam. I did it
in my email client program Eudora. The main
kinds of rule triggers were:
1.
Words and phrases about spammish subjects: porn, hair loss, mortgages, home-based businesses,
dead Nigerian husbands, etc.
2.
Words and phrases indicative of a mass-mailing: basically, a lot of variants on Click here
to remove
3.
Words and phrases indicative of a sales pitch: free, opportunity, open now, etc., plus of course
the string adv
4.
Headers suggestive of a spammish sender: freemail addresses such as yahoo.com or hotmail,
traces of certain mass-mailing programs, etc. (triggers of this type worked better a year
or two ago)
5.
Letter/character combinations like arise when an Asian language email is
sent to a client without Asian fonts.
Obviously, this leads to a lot of false positives. Mailing lists I actually want to be on can be
handled by filtering them to their own folders (and hence not the spam folders). Mail from old friends who have new Yahoo accounts
is, however, more of a problem.
To get around filters of this kind, spammers have
recently started sending messages with random characters inserted, along the lines of
En!large y&ou]r bo(dy p*ar^ts.
They also send email which consists of little more than a text-filled
graphical image. Eudora 6.0, however, comes
with a built-in spam filter that captures 90% or more of all spam, including spam of the
newer types. In fact, I would guess
and this is just a guess that words broken up with punctuation marks like that
automatically trigger spam rules, as probably do graphics-only emails. The Eudora spam filter obviously has a rich set of
old-fashioned pattern matching rules as well; indeed, it filters almost 100% of everything
I can catch with my hand-built rule set.
While pattern-based anti-spam like this is a lot better
than nothing, it has one enormous drawback: False
positives. My Eudora Junk folder
is now replete with messages that arent really junk, so I wind up having to look
through it every time I download email. And
that largely defeats the purpose of anti-spam technology in the first place.
This doesnt appear to be a war the anti-spam
vendors can win. Its been suggested
that Bayesian techniques could make anti-spam pattern-matching better, by identifying not
just words that indicate a high probability of spam, but also words that indicate a low
probability of spam. However, were such
Bayesian anti-spam systems to be widely relied on, spammers could confound them by seeding
their spams with low-spam-probability words. And
its tough to imagine a Bayesian system sensitive enough to weed out those tricks
without producing a lot of false positives as well.
Because if a system that powerful could be built, text
indexing and web content-filtering censorware would be a lot more
effective than they actually are.
So we believe that the most robust approach to spam
filtering is known-spam-blockers, based on actual real-world
spam harvesting, augmented by whitelists and pattern-matching to fill in the gaps.
For more information, please contact Curt Monash.
To reach Monash
Information Services by phone, please call 978-266-1815.
Copyright 1996-2003, Monash Information Services. All
rights reserved.
Updated: 05/11/04 |