monash logo '03.gif (2709 bytes)

Our Services

Our Staff

Guide to Search Engines

How Search Engines Work

Testimonials

Our Research

"Curt Monash's publications provide unmatched insight into
technology and marketplace trends. I have read them avidly for over a decade."

--Larry Ellison, Chairman
and CEO, Oracle

"Curt Monash possesses the rare ability to distill the essence of technological issues into understandable terms. He is particularly adept at melding a firm's product positioning, corporate strategy, and valuation parameters into a concise and coherent framework upon which one can make an informed investment decision. He is a trusted resource."
 
Matthew P. Kaufler, CFA
Portfolio Manager
Clover Capital Management, Inc.

Note To Folks Looking for Monash University

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Anti-Spam Technology Alternatives: Part 2 – Pattern-Matching 

Most sophisticated anti-spam systems rely on known-spam-blocking or whitelists.  Using those approaches, anti-spam vendors are able to keep ahead of the ever-changing tricks the spammers devise.  However, when anti-spam systems rely on pattern-matching, it’s the spammers who are ahead, with the anti-spammers constantly playing catch-up. 

It’s actually not hard to write a set of anti-spam rules that pick up 90%+ of all spam.  I did it in my email client program Eudora.  The main kinds of rule triggers were: 

1.      Words and phrases about spammish subjects:  porn, hair loss, mortgages, home-based businesses, dead Nigerian husbands, etc.

2.      Words and phrases indicative of a mass-mailing:  basically, a lot of variants on “Click here to remove” 

3.      Words and phrases indicative of a sales pitch:  free, opportunity, open now, etc., plus of course the string “adv”

4.      Headers suggestive of a spammish sender:  freemail addresses such as yahoo.com or hotmail, traces of certain mass-mailing programs, etc. (triggers of this type worked better a year or two ago)

5.      Letter/character combinations like “=?G”, which tend to arise when an Asian language email is sent to a client without Asian fonts. 

Obviously, this leads to a lot of false positives.  Mailing lists I actually want to be on can be handled by filtering them to their own folders (and hence not the spam folders).  Mail from old friends who have new Yahoo accounts is, however, more of a problem. 

To get around filters of this kind, spammers have recently started sending messages with random characters inserted, along the lines of “En!la’rge y&ou]r bo(dy p*ar^ts”.   They also send email which consists of little more than a text-filled graphical image.  Eudora 6.0, however, comes with a built-in spam filter that captures 90% or more of all spam, including spam of the newer types.  In fact, I would guess – and this is just a guess – that words broken up with punctuation marks like that automatically trigger spam rules, as probably do graphics-only emails.  The Eudora spam filter obviously has a rich set of old-fashioned pattern matching rules as well; indeed, it filters almost 100% of everything I can catch with my hand-built rule set. 

While pattern-based anti-spam like this is a lot better than nothing, it has one enormous drawback:  False positives.  My Eudora “Junk” folder is now replete with messages that aren’t really junk, so I wind up having to look through it every time I download email.  And that largely defeats the purpose of anti-spam technology in the first place. 

This doesn’t appear to be a war the anti-spam vendors can win.  It’s been suggested that Bayesian techniques could make anti-spam pattern-matching better, by identifying not just words that indicate a high probability of spam, but also words that indicate a low probability of spam.   However, were such Bayesian anti-spam systems to be widely relied on, spammers could confound them by seeding their spams with low-spam-probability words.  And it’s tough to imagine a Bayesian system sensitive enough to weed out those tricks without producing a lot of false positives as well.    Because if a system that powerful could be built, text indexing and web content-filtering “censorware” would be a lot more effective than they actually are. 

So we believe that the most robust approach to spam filtering is known-spam-blockers, based on actual real-world spam harvesting, augmented by whitelists and pattern-matching to fill in the gaps.

 

For more information, please contact Curt Monash.

To reach Monash Information Services by phone, please call 978-266-1815.

 

 

Copyright 1996-2003, Monash Information Services. All rights reserved.
Updated: 05/11/04