Σπάμ

6
COMPUTING SCIENCE SPAM ,SPAM ,SPAM ,LOVELY SPAM Brian Hayes A reprint from American Scientist the magazine of Sigma Xi, the Scientific Research Society Volume 91, Number 3 May–June, 2003 pages 200–204 This reprint is provided for personal and noncommercial use. For any other use, please send a request to Permissions, American Scientist, P.O. Box 13975, Research Triangle Park, NC, 27709, U.S.A., or by electronic mail to [email protected]. © 2003 Brian Hayes.

description

Σπάμ

Transcript of Σπάμ

Page 1: Σπάμ

COMPUTING SCIENCE

SPAM, SPAM, SPAM, LOVELY SPAM

Brian Hayes

A reprint from

American Scientistthe magazine of Sigma Xi, the Scientific Research Society

Volume 91, Number 3May–June, 2003pages 200–204

This reprint is provided for personal and noncommercial use. For any other use, please send arequest to Permissions, American Scientist, P.O. Box 13975, Research Triangle Park, NC, 27709, U.S.A.,or by electronic mail to [email protected]. © 2003 Brian Hayes.

Page 2: Σπάμ

Iused to feel forlorn whenever I checked mye-mail and found nothing waiting for me. Notmuch risk of that these days. There’s always

someone who wants to help me lose 32 pounds,or clean up my tarnished credit report, or get methat college degree I skipped in my recklessyouth. Absolute strangers send me tips on thestock market and ideas for starting a new busi-ness that I could run with a laptop computerfrom my new condo in Hawaii. Then there arethe ill-considered schemes to share the ill-gottengains of African dictators, as well as a volumi-nous and varied stream of messages that are bestdescribed as indecent proposals.

When the very first of these missives began ap-pearing in my mailbox, some years ago, they weremildly intriguing, like messages in bottles washedup on the beach. I had to wonder—in the millisec-ond before I hit the delete button—who had sentthem and from where and why. Most of all I won-dered for whom they were meant, since they wereclearly of no use or interest to me. By now, ofcourse, the sense of mystery is long gone. The oc-casional message in a bottle has become a dailytide of wrack and flotsam; it’s as if whole cargoesare being dumped overboard to litter our shores;an unstoppable oil slick of oleaginous marketingsludge slops into every e-mail inbox around theworld, wave after wave and day after day.

The very efficiency and convenience of elec-tronic communication gets some of the blame forthis flood of unwanted and thoroughly unlovedjunk e-mail. By dramatically reducing costs, theInternet makes it economically feasible to blanketthe globe with boring sales-pitch messages, evenif only the tiniest percentage of the recipients re-spond. But if technology created this problem,maybe it can also contribute to the solution. Orare social and legal remedies more promising?

Green Cards and SpamIt is worth remembering, in this era of Web pagesfestooned with blinking and bleeping bannerads, and accompanied by pop-ups and pop-un-ders, that once upon a time the Internet was an

advertising-free zone. As long as the U.S. govern-ment controlled a major part of the backbone,most forms of commercial activity were forbid-den. The occasional violations of this rule attract-ed swift and severe retribution. For example, in1978 Digital Equipment Corporation (since ab-sorbed into Compaq) sent a notice about a newcomputer system to 600 subscribers on theARPANET, one of the ancestors of the Internet. Themessage was immediately labeled a “flagrant vi-olation” of government policy, with the assurancethat “appropriate action is being taken to pre-clude its occurrence again.”

The rules changed in 1993, as the Net was pri-vatized, but social strictures on indiscriminate ad-vertising remained powerful for some years. InApril of 1994 a message with the subject heading“Green Card Lottery- Final One?” was posted si-multaneously to 6,000 Usenet news groups. Theadvertisement, signed by the Phoenix law firm ofCanter & Siegel, offered information and legalservices to immigrants. Thousands of Usenet reg-ulars—incensed not only by the commercial na-ture of the message but also by the waste of band-width and the breach of “netiquette”—houndedCanter & Siegel by e-mail and fax and telephone.The lawyers’ Internet access was cut off, andeventually the firm went out of business; Canterwas disbarred. There have not been many suchvictories in the fight against spam.

As it happens, the Canter & Siegel incident wasthe event that first popularized the term “spam.”According to Brad Templeton, a Usenet pioneerand current chair of the Electronic Frontier Foun-dation, certain small online communities hadused the word earlier to describe various kinds ofunwelcome verbiage, but it was the Green Cardaffair that made it widely known. The ultimatesource was a 1970 skit on the British televisionshow Monty Python’s Flying Circus, about a restau-rant with a limited menu and a chorus of Vikingschanting “Spam, spam, spam, spam, lovely spam,lovely spam.” (Incidentally, Hormel Foods, whomake SPAM rather than spam, attempted a de-fense of their brand name, then decided to have asense of humor about it.)

Although spam today is mainly a plague ofe-mail, the Canter & Siegel ad and several other

200 American Scientist, Volume 91

COMPUTING SCIENCE

SPAM, SPAM, SPAM, LOVELY SPAM

Brian Hayes

Brian Hayes is Senior Writer for American Scientist. Address:211 Dacian Avenue, Durham, NC 27701; [email protected]

Page 3: Σπάμ

early examples were never sent as mail; insteadthey were posted to news groups. The Usenetnews service is especially vulnerable to spam-ming because the complete list of groups is freelyavailable to everyone, unlike e-mail addresses,for which there is no central directory. Further-more, because a single news group can havemany readers, and a single reader may look atmany groups, posting to a few thousand groupsannoys millions. E-mail, in contrast, is generallyone-on-one, and it takes more effort to cause thesame amount of consternation.

In view of the vulnerability of the news sys-tem, it’s encouraging to report that the Usenetcommunity was able to organize itself to copewith the problem, if not solve it. Both manualand automated controls have been put in place,allowing a message that can be identified asspam to be removed or at least flagged as suspectbefore it reaches the reader. (One of the anti-spamprotocols is called NoCeM, pronounced “no see‘em.”) Of course there is the potential for abuseby individuals who maliciously cancel legitimatepostings, but safeguards are available, and thesystem seems generally to be working. Somespam still gets through to news groups, but thenoise level peaked several years ago, and in mostgroups it is now quite tolerable. On the otherhand, part of the reason that spammers leaveUsenet in peace these days may be that they justdon’t care: Only a small fraction of Internet usersever look at news groups.

The Spam on My PlateIn the case of e-mail, the magnitude of the spamcrisis is hard to pin down, but there is certainly awidespread perception that the amount hasmushroomed in the past year or two. Lately theproblem has been getting frequent attention inboth the popular and the technical press, andthere have been several recent conferences andworkshops to explore remedies. The Internet En-gineering Task Force has just formed an Anti-Spam Research Group. At the governmental lev-el, the Federal Trade Commission has taken aninterest (although not, as yet, much action).About half the states have enacted laws on spam;federal legislation is the subject of intense lobby-ing efforts on both sides. Elsewhere, the Euro-pean parliament has adopted a stern policy, buttheir success in enforcing it is not yet known.

The statistics on e-mail spam that appear innews reports come mainly from companies thatsell products and services for combatting spam.One of these vendors, Brightmail, Inc., says thatspam made up 8 percent of all e-mail traffic in2001 but had grown to 36 percent by the middleof 2002 and was over 40 percent by the end of lastyear. Postini, Inc., another company providinganti-spam services, reports that the proportion ofspam in the mail they monitor climbed from 20percent in January 2002 to 60 percent by Decem-ber. I have no reason to doubt the accuracy of

these figures, but it is only fair to point out thatthe companies’ own interest lies in emphasizingthe severity of the crisis.

A few consulting firms and foundations havealso surveyed the volume of spam. Jupiter Re-search estimates that the average e-mail user getsabout 2,200 spams a year, and the Gartner Groupsays that corporate e-mail is 25 to 35 percentspam. But a discordant note comes from the PewInternet & American Life Project, which sur-veyed 2,500 Internet users, asking only aboute-mail they receive at work. Half said they getno spam at all in their workplace accounts, and71 percent reported no more than “a little.”

Out of curiosity, I have been keeping track ofspam in my own in-box for the past few months.From November 2002 through March 2003 I re-ceived 1,571 items that I would unequivocallyclassify as spam; another 287 are doubtful cases.The total number of messages received in the pe-riod was 6,028. Even giving the benefit of thedoubt to all the doubtful ones, this tally suggeststhat I’m getting more than my fair share of spamwhen measured in absolute numbers; at this rate,I can expect almost 3,800 spams a year ratherthan the 2,200 predicted by Jupiter Research. Yetthe proportion of spam in my mail is only 26 per-cent, less than the averages reported by Bright-mail and Postini, and near the low end of theGartner estimate. (Obviously, percentage mea-surements are sensitive to the amount of bothspam and nonspam mail.)

Spammers are said to harvest most of their ad-dresses on the World Wide Web. My address hasbeen posted on the American Scientist Web site foreight years, which may have something to do withmy popularity among the spammers. But there’smore to the story. Another of my e-mail accountshas never been published on the Web or anywhereelse, yet it attracted more than 70 spams.

2003 May–June 201www.americanscientist.org

1

2

3

4

5

6

7

0S O N D J F M A M J J A S O N D J F

2001 2002 2003

spam

atta

cks

(mill

ions

)

Figure 1. Number of spam attacks has quadrupled in the past year anda half. Each attack sends out up to several million pieces of unwantede-mail. The data were collected by Brightmail, Inc., using a network of“honeypot” mail accounts set up deliberately to attract spam.

Page 4: Σπάμ

202 American Scientist, Volume 91

Classifying my mail as spam or nonspam tookmore thought than I expected. Of course there aremany echt spams that I could recognize in an in-stant, without glancing beyond the subject line:“Eat pizza, watch TV ... and lose 22 pounds,”“Absolutely FREE - Receive $610.00!!!” I have ahard time believing that anyone would actuallywant to receive some of these messages. Theworst of them can only be understood as “trolls,”deliberately meant to be annoying, and thereby toprovoke the recipient into responding—thus ver-ifying that the e-mail address is valid.

Alongside these blackest of spams, however,there are also shades of gray—mail that I don’twant and that I didn’t ask for (or at least I don’tremember asking for it) but that might conceiv-ably interest someone else and that comes from asource I know. For example, when a certain scien-tific society of which I’m a member (not SigmaXi!) sends me seven invitations to register for itsannual meeting, is that spam? What about a cata-logue merchant from whom I’ve made purchasesin the past and who, unbidden, sends me weeklysales flyers? Or an employment recruiter lookingfor candidates to fill a job? Would it make a dif-ference if the job might interest me? Then there’smy cousin Larry, who sends everyone in the fam-ily a continuing stream of chain letters, urban leg-ends and unfunny jokes. Is Larry a spammer?

I find it worthwhile to distinguish betweenmail that comes from a known source (my bank,my travel agency, my cousin) and mail whosesender is unknown and perhaps unknowable.Some of the mail from identified sources maywell be unwanted and indeed may qualify asspam, but at least there are direct ways of dealingwith the issue. I can call my bank, or have a wordwith Cousin Larry. The anonymous spam is aharder problem. I have no reliable means of throt-tling back the glut of e-mail I get from Your-ShoppingRewards and Freddys-Fabulous-Finds.

Spam is often defined as “unsolicited commer-cial e-mail,” and indeed almost all of it that I’m

receiving currently appears to have some com-mercial purpose. Whoever is sending this junk istrying to make money out of it (although it’s notalways obvious how). Still, lots of noncommer-cial bulk mail would be equally unwelcome. Anearly Usenet spam—even before Canter &Siegel—had the subject header “Global Alert ForAll: Jesus is Coming Soon”; the aim was to savesouls, not earn bucks, but readers were just asunhappy with the author. Spam wars can alsobreak out between political factions or over is-sues such as abortion or the death penalty. (If Iwere a spammer, I would make 10 percent of mymailings political, just to bolster the freedom-of-speech argument against regulation of bulk e-mail. I hope I haven’t just given someone anidea.) In any case, plans for controlling spamshould probably not be tied too closely to itscommercial nature.

Canning SpamIn the early years of the Internet, dealing withmiscreants was remarkably straightforward. Al-though Net lore and legend celebrate the lack ofany central governing body, the network was ac-tually run by a cohesive community with sharedaims and values. Any serious breach of the rulescould be punished by expulsion—by cancelingthe violator’s account. The situation is differentnow. If Hotmail kicks you out, you just moveover to AOL or MSN. Furthermore, Net pariahswith enough resources can set up their own In-ternet service provider.

Yet the power to isolate renegade sites has notbeen lost entirely. You cannot easily reach out andunplug an offending node of the network, butyou can set up your own system to ignore anyinformation coming from that node. In particu-lar, a system administrator can configure an In-ternet router so that it refuses traffic from selectedsites. Some years ago Paul Vixie, who conceivedseveral early Internet protocols, began publishinga list of network nodes from which spam was em-anating. This Realtime Blackhole List, or RBL, isnow maintained by a nonprofit organizationcalled MAPS. For subscribers to the RBL, the list-ed sites become black holes—no e-mail can getout, and in some cases traffic of all kinds isblocked. The weapon is quite blunt, in that itstops not only the spam but also other innocentcommunication. The rationale for this policy isthat legitimate users of a blacklisted service willexert pressure to shut down the spammer so thatthey can again reach the outside world. It’s ratherlike keeping the whole class after school untilsomeone turns in the naughty child. Not every-one approves of this strategy, and there have beenseveral lawsuits against Vixie and MAPS. Mean-while, spammers cope by keeping on the moveand by disguising their whereabouts.

Other approaches to filtering out spam try toblock only the offending mail. The filter can beinstalled on the individual user’s computer, on a

Figure 2. Daily tally of spam received in my own mailbox shows astrong uptick in the most recent weeks. The green curve records anony-mous spam—unwanted mail from unknown senders. The purple curvetraces messages that many would also classify as spam but that fall in adifferent category because they come from sources known to me. In thefive months studied, there was not a day without spam.

0

10

20

30m

essa

ges

per

day

4 11 18 25 2 9 16 23 30 6 13 20 27 3 10 17 24 3 10 17 24Jan FebNov Dec Mar

Page 5: Σπάμ

mail server or farther upstream. Much ingenuityhas been brought to bear on designing filters.Also on evading them.

The simplest kind of filtering uses static crite-ria to sort incoming mail into various folders ordirectories. For example, a filter might reject anymail that comes from “Bargain Blizzard” or thathas “inkjet” in the subject line. But the spam-mer’s response is all too easy: The sender be-comes “Blizzard of Bargains” and the subject be-comes “i-n-k-j-e-t” or one of a thousand othervariations. There is also the problem that a friendwriting for advice on an inkjet printer may havea hard time getting through.

Large-scale services such as Brightmail andPostini cannot hope to keep up with the evolvingspam ecosystem by hand-crafting filter rules. Thekey to their methodology is to set up thousandsof “honeypots”—e-mail accounts whose onlypurpose is to attract spam. Since these addressesshould have no legitimate e-mail sent to them,messages collected there can serve as templatesfor filtering the stream of mail going to the ser-vice’s subscribers. In this way one of the essen-tial, defining characteristics of spam—the factthat it goes simultaneously to thousands of ad-dresses—is turned into a weapon against it.

Another mechanism for building filters isbased on the collaborative effort of thousands ofpeople performing the routine daily chore of sort-ing their e-mail. If you are participating in such acooperative network, then every time you mark aspam message for deletion, a copy of the e-mail issent to a central repository; there many such re-ports are gathered and compiled into filter crite-ria. When the same message arrives again, ad-dressed either to you or to another participant inthe cooperative, the mail is automatically shuntedto the spam bin. This idea originated with VipulVed Prakash, a San Francisco programmer, whocreated a public-domain program called Vipul’sRazor. A commercial version called SpamNet hasbeen in testing for the past year. The SpamNet co-operative has more than 300,000 members.

Schemes that filter out only identical copies of aknown exemplar message have a serious weak-ness: The spammer can overcome them by mak-ing each copy of a mailing slightly different, per-haps by adding a few random characters to thetext. (Presumably, this explains subject headerssuch as “Married, Lonely, and home alone !2563SEpT0-115eltW64-18.”) Brightmail reportsthat 90 percent of all spam messages are nowunique, and so more-elaborate algorithms areneeded to establish a match between a templateand a target. The SpamNet technique is vulnera-ble to the same countermeasure, and so again thefilter cannot rely on a simple, exact match. Givenmany copies to examine, however, an algorithmcan determine which regions of the message areconstant and which are variable, and thereafterfocus only on the stable, identifying features. Butalready a counter-countermeasure has appeared.

It is “scramblespam,” where random charactersare not a minor addition but make up most of themessage, typically with the actual content of thead embedded in an image. Cloudmark, the com-pany behind SpamNet, reports it has recently de-vised an algorithm for identifying scramblespam.

Yet another filtering strategy abandons thewhole idea of matching messages to templatesand simply looks at the statistical properties dis-tinguishing desirable e-mails from spams. Thisidea began to gain momentum last summerwhen Paul Graham, a computer scientist bestknown for his books on the Lisp programminglanguage, circulated an article titled “A Plan forSpam.” It soon emerged that similar principleswere already familiar and well-developed in oth-er fields, such as automated text analysis andcomputational learning theory. By the time of aconference on spam held at MIT in January, sev-eral variations of the algorithm were being ac-tively explored.

2003 May–June 203www.americanscientist.org

Figure 3. Money and sex are leading topics in the spam that lands in myaccount, as they are in broader surveys; but my spam sample also has afew surprises. Some 65 messages deal with the art market; all of themmay come from a single mailer. Two dozen advertisements for labora-tory equipment and other items of possible interest to people in the sci-ences suggest that some spam may be targeted to a particular audiencerather than broadcast to the entire world. On the other hand, more than250 messages are clearly not meant for me, since they are written in al-phabets my computer cannot display. Another 29 messages, labeled“enigmas,” are written in English, but I haven’t a clue what they’reabout. Six political and religious tracts are classified as noncommercial.

0 50 100 150 200

moneymaking loans, credit

investments, taxes sweepstakes, etc.

grants, fundraising Nigerian scams

diplomas, degrees insurance

other services

pornography dating, marriage sexual products

weight loss health products

computers, telecom Internet services

spam services

home and auto entertainment

travel, vacations gambling

magazines other products

art

lab equipment, etc.

noncommercial

enigmas Russian

Japanese Korean

unknown languages

other spam

number of messages

Page 6: Σπάμ

The statistical filtering process requires a rea-sonably large corpus of e-mail messages, alreadydivided into spam and nonspam categories sothat they can serve as a training set. The programbreaks messages down into individual wordsand other “tokens,” recording the number of ap-pearances of each token in the two groups ofmessages. The resulting frequency tables give theprobability that any spam or nonspam messagecontains a specified token. Furthermore, from thesame information it is also possible to calculatethe inverse probability: Given any token, the ta-bles determine the probability that a messagecontaining the token is either spam or nonspam.New messages are classified by finding the most“interesting” tokens—those whose probabilitiesare closest either to 0 or to 1—and then comput-ing an overall composite probability that thee-mail is spam. In Graham’s experiments, theprobability distribution turned out to be stronglybimodal: Most messages were either close to 0 orclose to 1, with few in the middle. His reportederror rate is about five per 1,000 for false nega-tives (spams that squeak by as legitimate mail),with no false positives (legitimate mail misiden-tified as spam).

Graham argues that a filter based on the entirecontent of a message cannot be evaded withoutaltering the content itself. “It would not be enoughfor spammers to make their emails unique or tostop using individual naughty words,” he writes.“They’d have to make their mails indistinguish-able from your ordinary mail.… Spam is mostlysales pitches, so unless your regular mail is allsales pitches, spams will inevitably have a differ-ent character.” The argument seems compellingand the results so far are impressive, but the realtest will come when such filters are widely de-ployed, putting pressure on spam authors to in-vent countermeasures.

The Price of SpamPurely technical remedies are not the only optionfor controlling spam. There are also legal and eco-nomic maneuvers, as well as a few tricks of socialengineering.

At last count 26 states had enacted some formof law imposing sanctions on bulk e-mail, butmany spam opponents are lukewarm about theseremedies, worrying that regulating the spambusiness will tend to legitimize it. Most of thelaws take an “opt-out” approach, meaning it’s therecipient’s responsibility to get off the mailing listrather than the sender’s responsibility to obtainpermission first. The guidelines adopted by theEuropean Union require an “opt-in” mechanism.But whatever the details of the law, enforcementis always problematic because e-mail crossesjurisdictional boundaries so readily.

Another legal tactic is to forbid concealing ordisguising the origin of e-mail, on the principlethat spam can usually be stopped if it can betraced back to its true source. Proposed changes

in the Internet protocols for transporting e-mailwould have the same effect, making it more diffi-cult to send messages with a forged identity.

Economic remedies would shift the cost ofspam from the recipients back onto the senders. Itnow costs only $99 (according to some spam I re-ceived) to spew out a million e-mails. At that priceit could well be profitable to irk and inconvenience999,900 people in order to swindle the mostgullible 100. A small tax or fee imposed on eachmessage might restore the balance. Paying a pen-ny per message, the ordinary user would hardlynotice the charge, but a spammer sending 100 mil-lion e-mails would face a bill of $1 million. Thecatch again is enforcement and jurisdiction.

Scott E. Fahlman has proposed a variation inwhich the fee would be paid directly to the recip-ient. Under this plan, e-mail software would ac-cept a message from unknown correspondentsonly if the sender agreed to pay for “interruptrights.” The charge could be waived retroactivelyat the recipient’s option, so that nonspammerswould never actually have to pay.

Most of the social and economic anti-spamschemes would work only if a large majority ofe-mail users were to adopt them. Assemblingthat majority is the challenge. Indeed, if we couldachieve universal agreement on the issue ofspam, there would be no need for strategy at all.The very simplest approach would suffice: Wecould just say no. In the end, all that’s needed todefeat spam is for everyone to ignore it. But thatdoesn’t seem to be happening so far. Looking atthe 1,571 tantalizing offers in my mailbox, I don’tfeel the slightest itch to buy anything, but some-one must be taking the bait.

Even if all the filters, blackhole lists and othermeasures fail to abolish spam, efforts to control itare still worth making. They can reduce the vol-ume. They might even bring us better spam: Inorder to get through to reluctant or jaded readers,advertisers will have make their spam more ap-petizing. Live with it long enough, and youmight develop a taste for the stuff. Spam, spam,spam, spam, spam, spam, lovely spam, wonder-ful spam.

BibliographyFahlman, Scott E. 2002. Selling interrupt rights: A way to con-

trol unwanted e-mail and telephone calls. IBM SystemsJournal 41:759–766.

Garcia, Dan. Dan Garcia’s spam homepage.http://www.cs.berkeley.edu/~ddgarcia/spam.html

Gleick, James. 2003. Tangled up in spam. The New York TimesMagazine, February 9, 2003.

Graham, Paul. 2002. A plan for spam. http://www.paulgraham.com/spam.html

Graham-Cumming, John. 2003. The spammer’s compendi-um. http://spamconference.org/proceedings2003.html

Krim, Jonathan. 2003. Spam’s cost to business escalates. TheWashington Post, March 13, 2003, page A01.http://www.washingtonpost.com/ac2/wp-dyn/A17754-2003Mar12.html

Templeton, Brad. Origin of the term “spam” to mean netabuse. http://www.templetons.com/brad/spamterm.html

204 American Scientist, Volume 91