I have a modified version of the SpamLookup plugin that is distributed with Movable Type 3.2. This is a single file drop in (mostly) backwards compatible replacement. To install, get the spamlookup.pm file from the distribution and replace the file in the plugins/spamlookup/lib directory.

SpamLookup Extension is now available for

Additional Capabilities

The modification provides the ability to apply word filters to specific fields in the comments and trackbacks, rather than the conglomeration of all fields. There are a number of specialized checks that I have found useful, most of which I had previously put in to a custom spam filter.

The new syntax for filter lines is

word ( fields ) weight

where

word
A space delimited word or a ‘/’ delimited regular expression.
fields
A space delimited set of field keywords. This is optional.
weight
A numeric value specifying the weight of this word. This is optional. A weight can be negative (with a leading ‘–’) which makes it a white listing instead of a blacklisting.

A word can be either a space delimited literal word or a ‘/’ delimited regular expression. In the latter case, the characters ‘-ismx’ are permitted to directly following the trailing ‘/’. If present, these are passed to the PERL regular expression engine to control the interpretation of the regular expression.

The valid field keywords are

FIELDMEANINGDISPLAY
nameName for a comment authorCommentor
emailEmail address for a comment authorEmail
homeHomepage URL for a comment authorURL
contentText content of a comment(text box)
blogWeblog name for trackback sourceSource Site
titleTitle for the source post of the trackbackSource Title
sourceSource URL for a trackbackURL
excerptTrackback excerpt(text box)
urlEquivalent to home or source
textEquivalent to content or excerpt
allAll fields together, separated by newlines

If no fields are specified, “all” is used, thereby preserving (almost) the orginal behavior and syntax. The url and text fields are provided to make scanning equivalent fields for comments and trackbacks easy. If multiple fields are specified, they are all scanned in order. A match on any field counts as a match.

Examples

Most of these are from my customized spam filter, which can now be simplified. All of them are things I have seen in comment and trackback spam.

Empty trackback excerpt.
/^$/ (excerpt)
Double dash in domain.
-- (url email)
Comment is the single word 'Hi.'
/^Hi\.$/ (content)
No source URL for trackback
/^$/ (source)
The word 'poker' in the email address or homepage
poker (url email)
Homepage looks like an archive, not a base URL
/[[:digit:]]{3,}\.(?:html|htm|shtml|php)$/ (home)
Email address is purely numeric
/^[[:digit:]]+@/ (email)

Note that some of these are simply not possible in the original release (particularly the empty field checks). Others are possible but too broad to be practicle (such as checking for archive like links as a home page, or double dashes).

One feature that has come in handy is the ability to search for phrases at the beginning of a comment, which detects spammers with much less impact on legitimate commentors. For instance, one spammer starts his comments with ‘Hello, Admin!’. This can be detected with

/^Hello, Admin!/ (text)

The ‘^’ means “match only at the start” when inside a regular expression (denoted by the ‘/’ characters). The other handy character is ‘$’, which means “match only at the end”. Together with ‘^’ this means you can match on phrases that are the entire comment, not just part of it. E.g., if a spammer just puts the word ‘Hi.’ in his comments, you can catch that with

/^Hi\.$/ (text)

Note that we have to put a backslash in front of the ‘.’ because otherwise that means “match any single character”. With this, some one who writes ‘Hi.’ at the start won’t be matched, if he writes any additional text.

One common request is to ban a specific email address. This is now trivial. To ban the email address “neo@hotmail.com” you can do

neo@hotmail.com (email)

If you want to detect the word “poker” in email address, commentor name, or home page URL, you can do

poker (email url name)

and your commentors can still use the word in the text without impediment.

The weight of a filter can be negative, which causes it to be a white listing. If I wanted to give myself a bypass, I could use the rule

Annoying Old Guy (name) -10

which would subtract 10 from the junk score of all of my comments so that my comment would pass even if it hit several filters. This could easily be used to give bypasses to other commentors by name, email, or home page URL. One could even use this to create a magic bypass word, say ‘ciscomyyahoo’. Put in the rule

ciscomyyahoo (content) -10

and then when you want a bypass, put the magic word in an HTML comment, e.g. <!–– ciscomyyahoo ––> in your comment and you’re through the filters yet other people don’t see the magic word.

My current working set of filters can be found here, which has numerous other examples.

Differences

There a few minor functional differences.

Bug fixes

The original distribution of SpamLookup had bugs in the handling of decoded1 text where the decoded version of the text would not be scanned if the raw version did not have any hits by filters for junk filters. In the case of moderation filters, the decoded text would always be scanned, regardless of hits in the raw text (making the raw text scan pointless).

The new version now applies filters per field first to the raw text for the field, and then, if the raw text did not have a match, the decoded version (if different) is scanned. A match in either case counts as a match for the filter.

The processing of standard words (not regular expressions) is improved. Previously, words that started with non-word boundary characters would never match (because expressions like ‘\b<’ never match). These cases are now handled correctly.

The base implementation has a problem where backreferences do not work correctly because the regular expression is compiled before the outermost parentheses are added leading to various problems. This does not occur in this version and the workaround suggested won’t break. The only thing to keep in mind is that all backreferences need to be incremented by one before being used.

Undocumented feature

The original release, when combinging the fields in to the combined text, would prefix each field with the string “field:” where field was one of ‘name’, ‘email’, ‘url’, ‘text’ for comments and one of ‘blog’, ‘title’, ‘url’, ‘text’ for trackbacks. Presumably this was done to provide a semblance of the functionality of my modification (at least, one could check if a specific field began with a specific filter word) but as there are no comments in the code there and no documentation, this is only speculation. In any case, I could see no use for it at all in this modification so it was removed.

Log enchancement

If a match occurs for a specific field other than ‘all’, the name of the field is added to the pattern in the log message on the comment / trackback.

For each word filter match, the specific score for that match is logged.

The patterns and matched text are HTML encoded before being placed in to the log so that pattern or match text that contains HTML code does not disturb the log display. In addition, literal question marks ‘?’ are transformed in to entities because they mess up the log display (although I have not had the time to track down why that is).

MT 3.3

The MT 3.34 replacement lib is available here. This is currently in use on multiple live weblogs.

The new version has a bit more tuning, although it’s unlikely to be noticeable in the real world. I have added more comments in the code for the curious and tweaked the log messages to be a bit clearer.

The only known potential issue is with Japanese and word boundaries. As best as I understood the code, the special checks for that should still work correctly. The original checks were removed, as they should be handled correctly by my bug fix for broken word boundaries in the original version.


1 “Decoded” means entities decoded to their textual equivalents.