by Jeremy Kirk

Spam project pits humans vs. machines

news
Jun 26, 20063 mins

Spam fighter's new project asks people to donate time to classify e-mail messages and test the accuracy of spam filters

John Graham-Cumming is about 666,666 clicks away from a new weapon that could help kill spam — that’s unsolicited e-mail, not the salty canned meat — for good.

Graham-Cumming, an Englishman who lives in Toulouse, France, is a seasoned spam fighter who wrote Popfile, an open-source e-mail classification tool. He also wrote Polymail, an antispam library licensed by other companies for use in spam filters.

Spam still comprises about 80 percent of all e-mail, although it has become less of an annoyance due to much-improved filtering. But spammers persevere, finding technical ways of slipping e-mail through, and the race continues to develop sharper filters.

“I don’t think spam is going to go away,” Graham-Cumming said. “Clearly spammers are still making money or they wouldn’t be sending lots of spam.”

Graham-Cumming’s new project asks people to donate their time to classify a “corpus” of 100,000 e-mail messages used to test the accuracy of spam filters. He’s set up a site, www.spamorham.org, where people can randomly sort messages as either “spam” or “ham,” which is good e-mail.

The e-mail messages comprise the TREC (Text Retrieval Conference) 2005 Public Spam Corpus, affiliated with the U.S. National Institute of Standards and Technology (NIST).

An unlikely major donor of the e-mail was Enron, the U.S. energy company whose errant accounting practices led to bankruptcy in 2001. The e-mail of dozens of Enron employees was subpoenaed and eventually released to the public.

The Enron e-mail messages are a hot commodity for spam research — a rich trove of private e-mail and spam that’s hard to come by, Graham-Cumming said.

The idea is for each e-mail to be classified 10 times for a majority consensus. So far, the project is about one-third done.

I stepped up to the challenge. I started classifying e-mail, hoping to run across Enron employee gossip about what happened at the last company party, such as stories of accountants wearing lamp shades on their heads (which appears to have continued well into their working day).

I buzzed through 25 e-mail messages, most of which were obviously spam and devoid of scuttlebutt. Unfortunately, the real messages I came across were strictly numbing work chatter, which made the seedy spam subject lines at least mildly amusing by comparison.

I disagreed with the machines on one message, which was classified as real by the filters. The message was composed of complete sentences that appeared to be from news stories but in utter non sequiturs. The e-mail also lacked a bull’s eye zinger such as +V1a*gra! 2nite!

The message was obviously junk, but didn’t make any sense, somehow wriggling through the spam filter’s clutches.

Most messages are easy to classify to anyone vaguely familiar with e-mail. But overall, machines and people disagree about one out of 10 times, Graham-Cumming said.

Not surprisingly, phishing e-mail messages, which often look quite legitimate but dupe people into divulging personal details, are hardest for people to distinguish, Graham-Cumming said.

The research could be used to publish an updated corpus, one that more precisely classifies what is spam and what is ham, Graham-Cumming said. It also may lend new knowledge into phishing attempts, which continue to flourish despite better awareness.

“I’d be very interested in discovering if there are certain sorts of legitimate mail that always gets filtered,” Graham-Cumming said.

Those who participate in classifying messages have a chance to win a suite of Austin Powers movie trinkets, including an “Enlarger.”

What’s an enlarger? Check your junk mail box.