Your task this week is to write a very simple spam classifie

Your task this week is to write a very simple spam classifier in Python. It will classify messages as either SPAM (unwanted) or HAM (wanted) You will also define a spam threshold which reflects the allowed percentage of spam words in the message. You\'ll compute a \'spam indicator\', which is the ratio of spam words to the total number of unique words in the message. You will round the spam indicator to two decimals. If the spam indicator exceeds the spam threshold, the message is classified as spam. Otherwise it is classified as ham. We\'ll assume that the spam threshold is a constant and has a value of 0.10. Your program will prompt the user for a message and then will print the corresponding classification. The program will be case insensitive. The spam words are detected whether they are in lower case or upper case or mixed case. For simplicity, we\'ll ignore punctuation.

Testing: Make sure that you test your solution before you submit it. Here are a few test cases with the expected output. Feel free to add your own.

Test case 1 - classify message correctly as SPAM - Make sure the SPAM indicator is correct Please enter your message: The widow of a deposed dictator wants your help in getting his money out of the country SPAM indicator: 0.27 This message is: SPAM

Test case 2 - classify message correctly as HAM Please enter your message: I got a new job offer today. It looks good. Are you free for lunch tomorrow? We can meet downtown at noon. SPAM indicator: 0.09 This message is: HAM

Test case 3 - classify message correctly regardless of the case Please enter your message: Do not miss out on this once in a lifetime OPPORTUNITY call NOW SPAM indicator: 0.23 This message is: SPAM

Test case 4 - classify message correctly based on the number of unique words Please enter your message: It is urgent that you call us immediately yada yada yada yada yada yada SPAM indicator: 0.11 This message is: SPAM

Test case 5 - A message with a SPAM indicator 0.1 is classified as HAM. Please enter your message: Congratulations on your new job! I hope you like it. SPAM indicator: 0.1

How she set up her format for us to do:

Enter your module docstring with a one-line overview here and a more detailed description here. \"\"\" SPAM_WORDS = {\'opportunity\', \'inheritance\', \'money\', \'rich\', \'dictator\', \'discount\', \'save\', \'free\',\'offer\', \'credit\', \'loan\', \'winner\', \'warranty\', \'lifetime\', \'medicine\', \'claim\', \'now\', \'urgent\', \'expire\', \'top\', \'plan\', \'prize\', \'congratulations\', \'help\', \'widow\'} def spam_indicator(text): \"\"\"

Enter your function docstring here \"\"\" # this function returns the spam indicator rounded to two decimals def classify(indicator): \"\"\"

Enter your function docstring here \"\"\" # this function prints the spam classification def get_input(): \"\"\"

Enter your function docstring here \"\"\" # prompt the user for input and return the input def main(): # get the user input and save it in a variable # Call spam_indicator to compute the spam indicator # Print the spam_indicator # Call classify to print the classification if __name__ == \'__main__\': main()

Solution

from __future__ import print_function, division

import nltk

import os

import random

from collections import Counter

from nltk import word_tokenize, WordNetLemmatizer

from nltk.corpus import stopwords

from nltk import NaiveBayesClassifier, classify

stoplist = stopwords.words(\'english\')

def init_lists(folder):

    a_list = []

    file_list = os.listdir(folder)

    for a_file in file_list:

        f = open(folder + a_file, \'r\')

        a_list.append(f.read())

    f.close()

    return a_list

def preprocess(sentence):

    lemmatizer = WordNetLemmatizer()

    return [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(unicode(sentence, errors=\'ignore\'))]

def get_features(text, setting):

    if setting==\'bow\':

        return {word: count for word, count in Counter(preprocess(text)).items() if not word in stoplist}

    else:

        return {word: True for word in preprocess(text) if not word in stoplist}

def train(features, samples_proportion):

    train_size = int(len(features) * samples_proportion)

    # initialise the training and test sets

    train_set, test_set = features[:train_size], features[train_size:]

    print (\'Training set size = \' + str(len(train_set)) + \' emails\')

    print (\'Test set size = \' + str(len(test_set)) + \' emails\')

    # train the classifier

    classifier = NaiveBayesClassifier.train(train_set)

    return train_set, test_set, classifier

def evaluate(train_set, test_set, classifier):

    # check how the classifier performs on the training and test sets

    print (\'Accuracy on the training set = \' + str(classify.accuracy(classifier, train_set)))

    print (\'Accuracy of the test set = \' + str(classify.accuracy(classifier, test_set)))

    # check which words are most informative for the classifier

    classifier.show_most_informative_features(20)

if __name__ == "__main__":

    # initialise the data

    spam = init_lists(\'enron1/spam/\')

    ham = init_lists(\'enron1/ham/\')

    all_emails = [(email, \'spam\') for email in spam]

    all_emails += [(email, \'ham\') for email in ham]

    random.shuffle(all_emails)

    print (\'Corpus size = \' + str(len(all_emails)) + \' emails\')

    # extract the features

    all_features = [(get_features(email, \'\'), label) for (email, label) in all_emails]

    print (\'Collected \' + str(len(all_features)) + \' feature sets\')

    # train the classifier

    train_set, test_set, classifier = train(all_features, 0.8)

    # evaluate its performance

    evaluate(train_set, test_set, classifier)

Your task this week is to write a very simple spam classifier in Python. It will classify messages as either SPAM (unwanted) or HAM (wanted) You will also defin
Your task this week is to write a very simple spam classifier in Python. It will classify messages as either SPAM (unwanted) or HAM (wanted) You will also defin
Your task this week is to write a very simple spam classifier in Python. It will classify messages as either SPAM (unwanted) or HAM (wanted) You will also defin

Get Help Now

Submit a Take Down Notice

Tutor
Tutor: Dr Jack
Most rated tutor on our site