Simple Spam Filter Using Back-of-Words

This is a script that trains a Multinomial Naive Bayes model to detect spam mails. The script can be executed as follows:

To train a model run

script.py train path/to/spam path/to/nospam

To evaluate the model using cross-validation (using cross_val_score) run

script.py cross path/to/spam path/to/nospam

To find the best set of parameters using a grid search (using GridSearchCV) run

script.py grid path/to/spam path/to/nospam

The grid search does a limited sweap over the vocabulary size (no limit, 1000000) and the percentage used for detecting stop words (1.0, 0.99, 0.98, 0.95). The latter parameter drops a word from the vocabulary if it occures in more than x percent of the documents.

Spam Detection

The Problem

Classifying e-mails into spam and no-spam is not an easy task because the distribution of labels is very inbalanced. Often well over 95% of the labels are spam mails. This makes training a classifier diffictult because the naive prediction (always predicting spam) already achieves 95% accuracy. Of course, this solution is useless as all relevant mails will be filtered out as well.

There are several strategies to overcome this problem:

  • oversampling the minority class to get a 50/50 distribution of labels
  • undersampling the majority class to get a 50/50 distribution of labels
  • generating new samples in the minority class that are close to the existing samples in the feature space

In this script I didn’t employ any of these techniques as the dataset was already balanced.

Simple NLP Approach

Many features from the mails can be used to classify spam. For example one can use meta information from the mail header like time, IP address, sender and so forth. I’m sure suffisticated spam filter systems use this kind of information, but modern spam filters all rely heavily on natural language processing to use the mail body for classification. This is the approach that is demonstrated here.

Bag of Words

In this script I use the approcha of modelling the natural language as a bag-of-words. That is for every mail we have a long vector that is the length of the vocabulary were each entry represents the number of times that word occures in the mail. In scikiti-learn this can be done with the CountVectorizer.

Additionally one approach that is often used in document classification and retrieval is to use the so called TF-IDF statistic. It stands for term term frequency–inverse document frequency. The term frequency of a term of a specific document is weighted by teh inverse document frequency of that term over the whole corpus.

The The idea is that some words have a high frequency in a document but also occure in a lot of documents in the corpus. They say less about the specific document compared to a word that is frequent in the document but doesn’t occure in a lot of documents.

Here I use the TfidfTransformer from scikit-learn to generate the TF-IDF feature out of the word count matrix (word count for each word in the vocabulary for all documents).

Naive Bayes

For classification I use a Naive Bayes based on the multinomial distribution which models the probability of counts. Perfect! Specifically, I use the MultinomialNB model from scikit-learn.

import sys
import os
import pickle
import numpy
from pandas import DataFrame
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# TODO: SVM Model
# TODO: Try Binary feature
# TODO: subject line feature

# TODO: Grid search for Binary vs TF vs TF*IDF


def load_from_folders(dirs):
    print(dirs)
    data = DataFrame({'body': [], 'label': []})
    for path, label in dirs:
        mails, fnames = [], []
        for root_dir, dir_names, file_names in os.walk(path):
            # load data in sub-directories
            for directory in dir_names:
                data = data.append(load_from_folders([(os.path.join('.',directory),label)]))
            # load files in root directory
            for file_name in file_names:
                file_path = os.path.join(root_dir, file_name)
                if os.path.isfile(file_path):
                    past_header, lines = False, []
                    f = open(file_path, encoding="latin-1")
                    for line in f:
                        if past_header:
                            lines.append(line)
                        elif line == '\n':
                            past_header = True
                    f.close()
                    content = '\n'.join(lines)
                    mails.append({'body': content, 'label': label})
                    fnames.append(file_path)
        data = data.append(DataFrame(mails, index=fnames))
    return data

def create_dataframe(dirs):
    data = load_from_folders(dirs)
    #data.reset_index().drop_duplicates(subset='index').set_index('index')
    return data.reindex(numpy.random.permutation(data.index))

def write_prediction(prediction, file_name):
    f = open(file_name, "w", encoding="latin-1")

    for line in zip(prediction.index.values, prediction['label']):
        f.write('{0}\t{1}\n'.format(line[0].split('/')[-1], line[1]))
        #print('{0}\t{1}'.format(line[0], line[1]))

    f.close()

def create_model():
    # HashingVectorizer ?
    return Pipeline([
            ('count_vectorizer', CountVectorizer(
                ngram_range=(1, 2), 
                strip_accents='unicode', 
                min_df=2, 
                max_df=0.90, 
                stop_words=None, 
                max_features=1000000,
                binary=True)),
            # ('idf_transformer', TfidfTransformer(
            # norm='l2', 
            # use_idf=True, 
            # smooth_idf=True, 
            # sublinear_tf=False)),
            ('classifier', MultinomialNB(
                alpha=0.001, 
                fit_prior=True, 
                class_prior=None))
        ])

def load_model(model_name):
    with open(model_name, 'rb') as f:
        model_attributes = pickle.load(f)
        pipeline = create_model()
        pipeline.named_steps['count_vectorizer'].vocabulary_ = model_attributes[0]
        pipeline.named_steps['count_vectorizer'].stop_words_ = None

        pipeline.named_steps['classifier'].class_count_ = model_attributes[1]
        pipeline.named_steps['classifier'].feature_count_ = model_attributes[2]
        pipeline.named_steps['classifier'].class_log_prior_ = numpy.log(numpy.divide(
            model_attributes[1],
            numpy.sum(model_attributes[1])
        ))
        pipeline.named_steps['classifier'].feature_log_prob_ = numpy.transpose(numpy.log(numpy.multiply(
            numpy.transpose(pipeline.named_steps['classifier'].feature_count_),
            numpy.divide(1.0, pipeline.named_steps['classifier'].class_count_)
        )))
        pipeline.named_steps['classifier'].classes_ = model_attributes[3]

        return pipeline

def save_pipeline(pipeline, model_name):
    with open(model_name, 'wb') as f:
        pickle.dump([
            pipeline.named_steps['count_vectorizer'].vocabulary_, 
            pipeline.named_steps['classifier'].class_count_, 
            pipeline.named_steps['classifier'].feature_count_,
            pipeline.named_steps['classifier'].classes_], f)

if __name__ == "__main__":
    arguments = sys.argv

    if len(sys.argv) == 5:
        if arguments[1] == 'classify':
            # load model
            pipeline = load_model(arguments[2])

            # load mails from directory
            data = create_dataframe([(arguments[3], '')])

            # predict class labels
            data['label'] = pipeline.predict(data['body'])

            # output the result
            print('\nTotal emails classified:', len(data), 
                '\nvocab size:', len(pipeline.named_steps['count_vectorizer'].vocabulary_))

            write_prediction(data, arguments[4])

        elif arguments[1] == 'learn':
            # load training data
            data = create_dataframe([(arguments[2], 'SPAM'), (arguments[3], 'NOSPAM')])

            # create pipeline
            pipeline = create_model()

            # train classifier
            pipeline.fit(data['body'].values, data['label'].values.astype(str))

            # save the model
            save_pipeline(pipeline, arguments[4])

            with open('backup.model', 'wb') as f:
                pickle.dump(pipeline, f)

        elif arguments[1] == 'cross':
            # load training data
            data = create_dataframe([(arguments[2], 'SPAM'), (arguments[3], 'NOSPAM')])

            # create pipeline
            pipeline = create_model()

            # perform 10-fold crossvalidation
            scores = cross_val_score(pipeline, data['body'].values, data['label'].values.astype(str), cv=10, n_jobs=2, pre_dispatch=3)

            # train classifier
            pipeline.fit(data['body'].values, data['label'].values.astype(str))

            # output the result
            print('Total emails classified:', len(data), '\nvocab size:', len(pipeline.named_steps['count_vectorizer'].vocabulary_), 
                '\nstop words:', len(pipeline.named_steps['count_vectorizer'].stop_words_), '\n\n10-Fold-Cross-Validation:')
            for index, kscore in enumerate(scores):
                print('{0:3}: {1:.3f}  '.format(index+1, kscore), end='\n')
            print('----------\nAvg: {0:.3f}'.format(sum(scores)/len(scores)))

            # save the model
            save_pipeline(pipeline, arguments[4])

        elif arguments[1] == 'grid':
            # load training data
            data = create_dataframe([(arguments[2], 'SPAM'), (arguments[3], 'NOSPAM')])

            # create pipeline
            pipeline = create_model()

            # define parameters for grid
            param_grid = dict(count_vectorizer__max_features=[None, 1000000],
                              count_vectorizer__max_df=[1.0, 0.99, 0.98, 0.95])
            grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=0, cv=5, n_jobs=2, pre_dispatch=2)

            # start grid search
            grid_search.fit(data['body'].values, data['label'].values.astype(str))

            # print result
            print('Cross-Validation result:\n', grid_search.grid_scores_)
            print('\nbest\n', grid_search.best_params_, '\nscore:', grid_search.best_score_)

            # save the best model
            save_pipeline(pipeline, arguments[4])

        else:
            print('Mode', arguments[1], 'not known.')
    else:
        print('Wrong number of command line arguments!')