Resources:
In a continuation from Lab 5, we will be making a Bayesian spam classifier for this lab. Use your favorite programming language for the program.
Recall from Lab 5 and the discussion on Wednesday that the basic calculation for the conditional probability is based on the product of its parts.
P(WordList|Spam) = P(Spam) * P(word1|Spam) * P(word2|Spam) * ... * P(wordn|Spam) P(WordList|Not-Spam) = P(Not-Spam) * P(word1|Not-Spam) * P(word2|Not-Spam) * ... * P(wordn|Not-Spam)For this lab, use the Spambase dataset created at HP and archived at UC Irvine: http://archive.ics.uci.edu/ml/datasets/Spambase
This dataset contains a comma-separated list of numbers representing the following data format:
word frequency list, punctuation frequency list, character characteristic list, classwhere each is as follows:
spambase.names
for the actual words in the list.
spambase.names
for the actual words in the list.
Given: database of labeled values D Output: Accuracy, error rate, precision and recall statistics where "positive" is mapped to "spam" and "negative" is mapped to "not-spam" Split D into a training set and testing set using the Holdout method For each field in the dataset Calculate the average and standard deviation for Spam training entries Calculate the average and standard deviation for Not-Spam training entries EndFor set tp, fp, tn, fn all to 0 For each entry in the testing dataset Calculate P(Spam|entry) and P(Not-Spam|entry) Use the continuous function in the book for each P(FieldInEntry|Spam) and P(FieldInEntry|Not-Spam) Label the entry as either Spam or Not-Spam Compare the guessed label to the actual label in the dataset If guess == Spam and actual == Spam, increment tp If guess == Not-Spam and actual == Not-Spam, increment tn If guess == Spam and actual == Not-Spam, increment fp If guess == Not-Spam and actual == Spam, increment fn EndFor Output the accuracy, error rate, precision, and recall.Turn in your code as an email to the instructor.