In the Hewlett-Packard spam data, a set of n = 4601 emails were classified according to whether they were spam, where "0" means not spam, "1" means spam. Fifty-seven explanatory variables based on the content of the emails were recorded, including various word and symbol frequencies. The emails were sent to George Forman (not the boxer) at Hewlett-Packard labs, hence emails with the words "George" or "hp" would likely indicate non-spam, while "credit" or "!" would suggest spam. The data were collected by Hopkins et al. [1999], and are in the data matrix Spam. ( They are also in the R data frame spam from the ElemStatLearn package [Halvorsen, 2009], as well as at the UCI Machine Learning Repository [Frank and Asuncion, 2010].)

Spam

Format

A double matrix with 4601 observations on the following 58 variables.

WFmake

Percentage of words in the e-mail that match make.

WFaddress

Percentage of words in the e-mail that match address.

WFall

Percentage of words in the e-mail that match all.

WF3d

Percentage of words in the e-mail that match 3d.

WFour

Percentage of words in the e-mail that match our.

WFover

Percentage of words in the e-mail that match over.

WFremove

Percentage of words in the e-mail that match remove.

WFinternet

Percentage of words in the e-mail that match internet.

WForder

Percentage of words in the e-mail that match order.

WFmail

Percentage of words in the e-mail that match mail.

WFreceive

Percentage of words in the e-mail that match receive.

WFwill

Percentage of words in the e-mail that match will.

WFpeople

Percentage of words in the e-mail that match people.

WFreport

Percentage of words in the e-mail that match report.

WFaddresses

Percentage of words in the e-mail that match addresses.

WFfree

Percentage of words in the e-mail that match free.

WFbusiness

Percentage of words in the e-mail that match business.

WFemail

Percentage of words in the e-mail that match email.

WFyou

Percentage of words in the e-mail that match you.

WFcredit

Percentage of words in the e-mail that match credit.

WFyour

Percentage of words in the e-mail that match your.

WFfont

Percentage of words in the e-mail that match font.

WF000

Percentage of words in the e-mail that match 000.

WFmoney

Percentage of words in the e-mail that match money.

WFhp

Percentage of words in the e-mail that match hp.

WFgeorge

Percentage of words in the e-mail that match george.

WF650

Percentage of words in the e-mail that match 650.

WFlab

Percentage of words in the e-mail that match lab.

WFlabs

Percentage of words in the e-mail that match labs.

WFtelnet

Percentage of words in the e-mail that match telnet.

WF857

Percentage of words in the e-mail that match 857.

WFdata

Percentage of words in the e-mail that match data.

WF415

Percentage of words in the e-mail that match 415.

WF85

Percentage of words in the e-mail that match 85.

WFtechnology

Percentage of words in the e-mail that match technology.

WF1999

Percentage of words in the e-mail that match 1999.

WFparts

Percentage of words in the e-mail that match parts.

WFpm

Percentage of words in the e-mail that match pm.

WFdirect

Percentage of words in the e-mail that match direct.

WFcs

Percentage of words in the e-mail that match cs.

WFmeeting

Percentage of words in the e-mail that match meeting.

WForiginal

Percentage of words in the e-mail that match original.

WFproject

Percentage of words in the e-mail that match project.

WFre

Percentage of words in the e-mail that match re.

WFedu

Percentage of words in the e-mail that match edu.

WFtable

Percentage of words in the e-mail that match table.

WFconference

Percentage of words in the e-mail that match conference.

CFsemicolon

Percentage of characters in the e-mail that match SEMICOLON.

CFparen

Percentage of characters in the e-mail that match PARENTHESES.

CFbracket

Percentage of characters in the e-mail that match BRACKET.

CFexclam

Percentage of characters in the e-mail that match EXCLAMATION.

CFdollar

Percentage of characters in the e-mail that match DOLLAR.

CFpound

Percentage of characters in the e-mail that match POUND.

CRLaverage

Average length of uninterrupted sequences of capital letters.

CRLlongest

Length of longest uninterrupted sequence of capital letters.

CRLtotal

Total number of capital letters in the e-mail

spam

Denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

Source

Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. Spam data. Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304, 1999.