In the Hewlett-Packard spam data, a set of n = 4601 emails were classified according to whether they were spam, where "0" means not spam, "1" means spam. Fifty-seven explanatory variables based on the content of the emails were recorded, including various word and symbol frequencies. The emails were sent to George Forman (not the boxer) at Hewlett-Packard labs, hence emails with the words "George" or "hp" would likely indicate non-spam, while "credit" or "!" would suggest spam. The data were collected by Hopkins et al. [1999], and are in the data matrix Spam. ( They are also in the R data frame spam from the ElemStatLearn package [Halvorsen, 2009], as well as at the UCI Machine Learning Repository [Frank and Asuncion, 2010].)
Spam
A double matrix with 4601 observations on the following 58 variables.
Percentage of words in the e-mail that match make.
Percentage of words in the e-mail that match address.
Percentage of words in the e-mail that match all.
Percentage of words in the e-mail that match 3d.
Percentage of words in the e-mail that match our.
Percentage of words in the e-mail that match over.
Percentage of words in the e-mail that match remove.
Percentage of words in the e-mail that match internet.
Percentage of words in the e-mail that match order.
Percentage of words in the e-mail that match mail.
Percentage of words in the e-mail that match receive.
Percentage of words in the e-mail that match will.
Percentage of words in the e-mail that match people.
Percentage of words in the e-mail that match report.
Percentage of words in the e-mail that match addresses.
Percentage of words in the e-mail that match free.
Percentage of words in the e-mail that match business.
Percentage of words in the e-mail that match email.
Percentage of words in the e-mail that match you.
Percentage of words in the e-mail that match credit.
Percentage of words in the e-mail that match your.
Percentage of words in the e-mail that match font.
Percentage of words in the e-mail that match 000.
Percentage of words in the e-mail that match money.
Percentage of words in the e-mail that match hp.
Percentage of words in the e-mail that match george.
Percentage of words in the e-mail that match 650.
Percentage of words in the e-mail that match lab.
Percentage of words in the e-mail that match labs.
Percentage of words in the e-mail that match telnet.
Percentage of words in the e-mail that match 857.
Percentage of words in the e-mail that match data.
Percentage of words in the e-mail that match 415.
Percentage of words in the e-mail that match 85.
Percentage of words in the e-mail that match technology.
Percentage of words in the e-mail that match 1999.
Percentage of words in the e-mail that match parts.
Percentage of words in the e-mail that match pm.
Percentage of words in the e-mail that match direct.
Percentage of words in the e-mail that match cs.
Percentage of words in the e-mail that match meeting.
Percentage of words in the e-mail that match original.
Percentage of words in the e-mail that match project.
Percentage of words in the e-mail that match re.
Percentage of words in the e-mail that match edu.
Percentage of words in the e-mail that match table.
Percentage of words in the e-mail that match conference.
Percentage of characters in the e-mail that match SEMICOLON.
Percentage of characters in the e-mail that match PARENTHESES.
Percentage of characters in the e-mail that match BRACKET.
Percentage of characters in the e-mail that match EXCLAMATION.
Percentage of characters in the e-mail that match DOLLAR.
Percentage of characters in the e-mail that match POUND.
Average length of uninterrupted sequences of capital letters.
Length of longest uninterrupted sequence of capital letters.
Total number of capital letters in the e-mail
Denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. Spam data. Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304, 1999.