CMPS 445 - Lab 5

Lab 5 - Bayesian Spam Classifier

Due: Friday by 3:30pm

Resources:

In this lab, we are going to explore using Bayesian classifiers for filtering spam email. We will begin the lab with a brief lecture on the techniques used.

Assume you have the following data from your training dataset for this lab. Your training dataset contains 200 samples, 100 of which are spam and 100 of which are not-spam:

Word	Percent observed in Spam	Percent observed in Non-spam
sir	50%	8%
madam	50%	2%
beneficiary	20%	2%
opportunity	65%	50%
account	38%	20%
reputable	40%	19%
rare	35%	10%
important	40%	35%
confidential or confidentially	60%	30%
business	50%	45%
urgent or urgently	50%	50%
overseas or foreign	42%	2%
million	60%	10%

We would use the following calculations to see if a message is spam or not spam:

Create a word list of words seen in the email that are in the training data table.

Calculate P(WordList|Spam) and P(WordList|Not-Spam) by the following:

     P(WordList|Spam) = P(word1|Spam) * P(word2|Spam) * ... * P(wordn|Spam)
     P(WordList|Not-Spam) = P(word1|Not-Spam) * P(word2|Not-Spam) * ... * P(wordn|Not-Spam)

Classify the email as spam if P(WordList|Spam) > P(WordList|Not-Spam)

An example will be shown in class.

What to Do For This Assignment

Find an example of a Nigerian/419 scam email. Apply the above steps to it to calculate if this training set would detect it as Spam or Not-Spam. Submit a writeup that contains the message, the word list from step 1, the calculations from step 2, and the classification of the email from step 3.