Yesterday, I participated in the Refugees United Modding Day 2013 organised by Rewired State for Refugees United. I worked on a hack that tried to implement and algorithm to identify spam and eventually also identify spammers. Our presentation was OK but not fantastic as time was short. The short of it was that by the end of the day I wasn't convinced the algorithm as coded worked so today I decided to get some statistics out of it and tweak it a little bit.
The idea was to implement a Bayesian filtering algorithm to detect spam. This type of algorithms requires initial training using a body of existing messages comprising both spam and non-spam where you tell it which is which. By doing that, it will build a corpus of know good data and a corpus of known spam data. It will then use that stored information to score incoming messages. The resulting score is a probability between 0 and 1 of the message being spam (in practice the probability is between 0.01 and 0.99 because you can never be completely sure). After that, each time the algorithm gets its decision wrong, the user has the ability to correct it and it will learn from the correction. This is exactly the sort of algorithms that is implemented in popular email clients like Mozilla Thunderbird.
In order to train the filter, we had a file containing 5500 SMS messages, some spam, some ham. So in order to see how good the filter was, I decided to do the following:
- Train the filter on the first few hundred messages in the file;
- The score the remainder of the file and see for each score whether it correctly identified the message as ham or spam.
The main factor that influences how such a filter works is the way the message is broken into individual tokens. The simplest algorithm is to break the message into words. Some words will be used mainly in spam messages, others mainly in innocent messages and yet others in both. Using 5% of the file for training and the remainder for scoring and applying that simple tokenizing algorithm, I got the following statistics:
- Correct guesses: 72.03%
- Neutral score (doesn't know if it's spam or ham): 25.68%
- False positives (identified as spam when it is ham): 1.74%
- False negatives (identified as ham when it is spam): 0.55%
That's not bad for a basic algorithm! So I decided to apply three additional tokenizing techniques on top of the core one and see what would happen:
- The first technique consists in adding word pairs in addition to individual words. The logic behind this is that words like fantastic and offer may not be typical of spam on their own but the combination fantastic offer may be.
- The second technique consists in adding the lowercase version of every word to the list of tokens in an attempt to see if weird capitalisation would catch spammers out.
- The third one is a combination of the first two.
The results show that word pairs seem to be better than capitalisation but both together are even better.
However, this is not the whole story. The important metric is the bad stuff that ends up in the user's inbox as well as the good stuff that gets rejected. The former is composed of two parts: messages that the filter considered good when in fact they should have been caught as spam (false negatives) and messages that had a neutral score that are in fact spam (bad neutral). The latter is messages that the filter marked as spam when in fact they are good (false positives). Looking at those metrics, the result is not great.
False positives increase slightly with word pairs and quite significantly when introducing capitalisation variation. This may mean that a lot of legit users actually capitalise their messages like spammers. So not such a good algorithm after all.
One aspect of spam is that it is better to have false negatives than false positives. That is, it is better to have a bit of spam in the user's inbox but not block legit messages by mistake. So one typical way to do this is to introduce some bias into the algorithm to give a bit more weight to positive probabilities. Compared to the basic algorithm, it definitely cuts into false positives.
The Impact of Training
An interesting statistic is to see how much training the algorithm makes a difference. So I decided to compare the baseline where 5% of the file is used for training to a version where 10 of the file is used for training.
And of course, a well trained biased algorithm returns some decent numbers too when compared to the baseline.
The strength of statistical algorithms like Bayesian filtering is that it learns from the mistakes it does and it doesn't matter if spammers change their wording, the algorithm will adapt. This is demonstrated by the reasonably high rate of success reached by a very basic implementation of the technique with training on a few hundred messages.
In order to improve the efficiency of such an algorithm, there are a few options:
- Include message meta-data such as sender and recipient details, in addition to the message text: this would also be a first step to identify spammers in addition to spam, which was one of the aims of the hack.
- Use advanced tokenization techniques such as sparse binary polynomial hashing.
- Use Markovian discrimination rather than Bayesian filtering.
All those improvements would come at the cost of increased complexity and processing power requirements so may not all be practical.