Resources:
In this lab, we are going to explore using Bayesian classifiers for filtering spam email. We will begin the lab with a brief lecture on the techniques used.
Assume you have the following data from your training dataset for this lab. Your training dataset contains 200 samples, 100 of which are spam and 100 of which are not-spam:
Word | Percent observed in Spam | Percent observed in Non-spam |
---|---|---|
sir | 50% | 8% |
madam | 50% | 2% |
beneficiary | 20% | 2% |
opportunity | 65% | 50% |
account | 38% | 20% |
reputable | 40% | 19% |
rare | 35% | 10% |
important | 40% | 35% |
confidential or confidentially | 60% | 30% |
business | 50% | 45% |
urgent or urgently | 50% | 50% |
overseas or foreign | 42% | 2% |
million | 60% | 10% |
We would use the following calculations to see if a message is spam or not spam:
P(WordList|Spam) = P(word1|Spam) * P(word2|Spam) * ... * P(wordn|Spam) P(WordList|Not-Spam) = P(word1|Not-Spam) * P(word2|Not-Spam) * ... * P(wordn|Not-Spam)