Your task this week is to write a very simple spam classifie
Your task this week is to write a very simple spam classifier in Python. It will classify messages as either SPAM (unwanted) or HAM (wanted) You will also define a spam threshold which reflects the allowed percentage of spam words in the message. You\'ll compute a \'spam indicator\', which is the ratio of spam words to the total number of unique words in the message. You will round the spam indicator to two decimals. If the spam indicator exceeds the spam threshold, the message is classified as spam. Otherwise it is classified as ham. We\'ll assume that the spam threshold is a constant and has a value of 0.10. Your program will prompt the user for a message and then will print the corresponding classification. The program will be case insensitive. The spam words are detected whether they are in lower case or upper case or mixed case. For simplicity, we\'ll ignore punctuation.
Testing: Make sure that you test your solution before you submit it. Here are a few test cases with the expected output. Feel free to add your own.
Test case 1 - classify message correctly as SPAM - Make sure the SPAM indicator is correct Please enter your message: The widow of a deposed dictator wants your help in getting his money out of the country SPAM indicator: 0.27 This message is: SPAM
Test case 2 - classify message correctly as HAM Please enter your message: I got a new job offer today. It looks good. Are you free for lunch tomorrow? We can meet downtown at noon. SPAM indicator: 0.09 This message is: HAM
Test case 3 - classify message correctly regardless of the case Please enter your message: Do not miss out on this once in a lifetime OPPORTUNITY call NOW SPAM indicator: 0.23 This message is: SPAM
Test case 4 - classify message correctly based on the number of unique words Please enter your message: It is urgent that you call us immediately yada yada yada yada yada yada SPAM indicator: 0.11 This message is: SPAM
Test case 5 - A message with a SPAM indicator 0.1 is classified as HAM. Please enter your message: Congratulations on your new job! I hope you like it. SPAM indicator: 0.1
How she set up her format for us to do:
Enter your module docstring with a one-line overview here and a more detailed description here. \"\"\" SPAM_WORDS = {\'opportunity\', \'inheritance\', \'money\', \'rich\', \'dictator\', \'discount\', \'save\', \'free\',\'offer\', \'credit\', \'loan\', \'winner\', \'warranty\', \'lifetime\', \'medicine\', \'claim\', \'now\', \'urgent\', \'expire\', \'top\', \'plan\', \'prize\', \'congratulations\', \'help\', \'widow\'} def spam_indicator(text): \"\"\"
Enter your function docstring here \"\"\" # this function returns the spam indicator rounded to two decimals def classify(indicator): \"\"\"
Enter your function docstring here \"\"\" # this function prints the spam classification def get_input(): \"\"\"
Enter your function docstring here \"\"\" # prompt the user for input and return the input def main(): # get the user input and save it in a variable # Call spam_indicator to compute the spam indicator # Print the spam_indicator # Call classify to print the classification if __name__ == \'__main__\': main()
Solution
from __future__ import print_function, division
import nltk
import os
import random
from collections import Counter
from nltk import word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier, classify
stoplist = stopwords.words(\'english\')
def init_lists(folder):
a_list = []
file_list = os.listdir(folder)
for a_file in file_list:
f = open(folder + a_file, \'r\')
a_list.append(f.read())
f.close()
return a_list
def preprocess(sentence):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(unicode(sentence, errors=\'ignore\'))]
def get_features(text, setting):
if setting==\'bow\':
return {word: count for word, count in Counter(preprocess(text)).items() if not word in stoplist}
else:
return {word: True for word in preprocess(text) if not word in stoplist}
def train(features, samples_proportion):
train_size = int(len(features) * samples_proportion)
# initialise the training and test sets
train_set, test_set = features[:train_size], features[train_size:]
print (\'Training set size = \' + str(len(train_set)) + \' emails\')
print (\'Test set size = \' + str(len(test_set)) + \' emails\')
# train the classifier
classifier = NaiveBayesClassifier.train(train_set)
return train_set, test_set, classifier
def evaluate(train_set, test_set, classifier):
# check how the classifier performs on the training and test sets
print (\'Accuracy on the training set = \' + str(classify.accuracy(classifier, train_set)))
print (\'Accuracy of the test set = \' + str(classify.accuracy(classifier, test_set)))
# check which words are most informative for the classifier
classifier.show_most_informative_features(20)
if __name__ == "__main__":
# initialise the data
spam = init_lists(\'enron1/spam/\')
ham = init_lists(\'enron1/ham/\')
all_emails = [(email, \'spam\') for email in spam]
all_emails += [(email, \'ham\') for email in ham]
random.shuffle(all_emails)
print (\'Corpus size = \' + str(len(all_emails)) + \' emails\')
# extract the features
all_features = [(get_features(email, \'\'), label) for (email, label) in all_emails]
print (\'Collected \' + str(len(all_features)) + \' feature sets\')
# train the classifier
train_set, test_set, classifier = train(all_features, 0.8)
# evaluate its performance
evaluate(train_set, test_set, classifier)



