Spam Filters

Updated on 01-Mar-2006

People have different reactions to spam. Some occasionally click on spam-looking messages out of curiosity; some just hit the [Delete] key; some use a non-learning spam filter and let it quietly do its job; some go all the way, training their Bayesian spam filters for months.

There’s something about spam that triggers off a primal sort of hatred in us. If you think about it rationally, just how long does it take to delete all your junk mail, one keystroke at a time? Two minutes? Out of 24 hours? But still, we whine and groan at the spam “menace.”

OK, it is indeed a menace, because if you put together all the man-hours dedicated to deleting spam, it actually makes a difference to the economy! It’s been documented, at least in America: the total cost of spam in terms of lost productivity was $21.6 billion (Rs 97,000 crore!) in 2003, for 1.2 billion hours of lost productivity, according to Talkgold.com.

But forget the figures-you aren’t bothered about the economy. The point is, what do you do with your spam? It probably depends on how much spam you get. The smart thing to do, of course, is to prevent spam from getting to your Inbox in the first place: use a dedicated junk mail account for Web forms, visit that account once a month, and delete everything (except the one or two valid mails that get there). Also, if you’ve got a Web site, put up your address in the format “agent underscore 001 at thinkdigit dot com”-spell it out instead of writing it the normal way.

It’s when you must use your primary address that the spam problem-and therefore spam filters-come in.

Smart And Smarter
The first step is to consider using Outlook 2003, if you don’t already. There’s an inbuilt spam filter, and you can set it at two sensitivity levels-Low or High. High, of course, is stricter on what messages get through to your inbox.

Outook’s spam filter is not Bayesian, meaning it doesn’t use AI, and it doesn’t learn from experience. There are several Bayesian filters available, which do learn from experience-including SpamBayes, POPFile, and K9, which is not an exhaustive list in the least.

There are a few small differences amongst the filters we’ve mentioned. K9, for example, is very light on system resources. Also, it doesn’t ignore words such as “the” and “and”, which could make a difference to the analysis. It has an interface of its own, which displays all your mail. SpamBayes and POPFile are pretty neck-to-neck in terms of fan followings, and use your Web browser as the interface. They also offer more configuration options than does K9.

Some Popular Free Spam Filters

Name	URL	Comments
SPAMfighter	www.spamfighter.com	Includes an anti-phishing tool
K9	http://keir.net/k9.html	Lightweight
SpamBayes	http://spambayes.sourceforge.net	Has an “Unsure” rating
POPFile	http://popfile.sourceforge.net	Considered by many the best
Spamihilator	www.spamihilator.com	Has plugins for it
Blue Frog	www.bluesecurity.com	Works on the principle of actively reporting spam
SpamAssassin	http://spamassassin.apache.org	Very powerful

Installation of these filters is straightforward enough-the only “tricky” parts are configuring your mail client’s POP server address, and setting it to put the messages into different folders.

“Learning from experience” means you train the software- when you first install one of these programs, it doesn’t know anything. You tell it about each message whether it is spam or not-and after a while, the software figures out which of your messages is spam.

There’s another utility to spam filters: in POPFile, for example, you can classify messages as not just spam and good mail, but also into any category you like, such as “work,” “home,” “subscriptions,” and so on. You could even filter out mail from your mother-in-law this way!

Now, how does Outlook filter mail without being trained? According to David Coursey of ZDNet.com, “Microsoft has looked at millions of spam and non-spam e-mails and created a filtering mechanism based on some 100,000 variables.” Coursey’s method for the Outlook spam filter is to first set the filtering level to Low, which offers about a 95 per cent accuracy level, and diligently check your Deleted Items folder for good mail. If you find good mail there, put the senders on your Allowed Senders list. Do this for a couple of weeks, and then set the filtering level to High-you can get up to 98 per cent accuracy, he says.

A Question Of Maths
So what’s not to like about spam filters? Accuracy can get to 95 per cent or even 98 per cent! Great, right? Not in the least! The entire problem with spam filters is false positives-good mail that ends up in the Spam folder. Instead of two minutes a day deleting junk mail, you now spend a minute and a half checking your Junk folder for good mail. And worse, since you expect that folder to be full of junk, you skim through it, making for a good chance that you’ll miss a valid mail! What’s worse-having to delete 50 spam messages, or losing one good mail?

Let’s do some maths. If the system is 98 per cent accurate, one in 50 mails gets misclassified. So let’s say one in 100 spam messages gets to your Inbox, and one in 100 good mails gets sent to the Spam folder. The former is very acceptable indeed. But the latter just isn’t! If you get 20 good mails a day, that’s one good message trashed every five days!

And that’s precisely where Bayesian filters come in. They learn, and accuracy can be improved. 99 per cent, for example, would mean one good message trashed every 10 days; 99.5 per cent would make that 20 days; and 99.9 per cent would mean less than one good mail trashed in three months! Now that is a figure we’d be quite happy with, but it would take a rather long time to reach 99.9 per cent accuracy.

Bayesian filters are recommended over Outlook’s “hard-coded” filter, because they evolve with the spammeisters’ tricks

You could speed up the process: go to your junk e-mail account and forward all your junk to your main address-and train your filter based on all these. You could also tell your friends to forward all their junk mail to you! But then again, there’s a catch: SpamBayes, for example, warns you if the spam and ham (good mail) it’s been trained on are imbalanced. “Too much spam and too little ham,” it’ll say. “SpamBayes works best with approximately equal numbers of each.” Now how are you to manage that? Write yourself good mails and train the filter based on those? Tell your friends to mail you mails more often? (“I’m training my spam filter! Write me!”)

That apart, we must mention that since Bayesian filters learn, they evolve with spam! About three years ago, it was common to see spam with such subject lines as “Cheap Viagra”, coming from “Online Pharmacy”. Now, the same message is more likely to come from “Dave Perry”, and the subject line is more likely to be “Azure eucalyptus”! Bayesian filters, therefore, are recommended over Outlook’s “hard-coded” filter, because they evolve with the spammeisters’ tricks.

Is The Solution Worse Than The Problem?
Just like people have different reactions to spam, people have different reactions to spam filters. Some feel they can’t live without them, some feel it’s made life a little easier, and some can’t see any utility to them.

One of the problems in choosing a spam filter-and, in fact, in deciding to use one-is the lack of solid data on how good they are. There are dozens of forums that discuss just one thing: how good a certain filter is. And on these forums, you’ll find every third post asking what others’ experiences with a particular software has been like. Everyone wants to know, but no-one knows for sure.

The reason is that experiences themselves vary wildly, and that’s due to two main reasons-the first being different expectations of the software, and the second being the different kinds and volumes of spam people receive. Like we’ve said earlier, Bayesian filters tend to work best if the ratio of ham to spam is about 1; everyone’s ratio is different. Also, the type of spam people receive is different, depending on where on the Net you’ve given out your e-mail address: for example, porn and “free offers” in mail are relatively easy to filter out.

We’re still stuck with our original question: should you use a spam filter? That, as you can see by now, is difficult to answer. In the end, it’s all about throwing technology at the problem: it’s fun! The reason many of us here at Digit use spam filters is just for the fun of it-their real value in terms of productivity is very questionable. But it’s fun to train your filter. It’s fun to see spam automatically going into the Junk folder. It’s fun to see mail being automatically classified into “Work” and “Personal”. It’s fun to boast to your friends that your filter catches all your spam-though you’d be lying if you said so! Ultimately, it’s fun to see tech doing its trick.

We encourage you to experiment with various spam filters and tell us about your experiences with them. But remember, you’ll be spending much more time playing around with your spam filter than just getting rid of the spam using [Delete]!

Team Digit

Team Digit is made up of some of the most experienced and geekiest technology editors in India!

Team Digit

01-Mar-2006

Spam Filters

Latest Article