I have a modified version of the SpamLookup plugin that is distributed with Movable Type 3.2. This is a single file drop in (mostly) backwards compatible replacement. To install, get the spamlookup.pm file from the distribution and replace the file in the plugins/spamlookup/lib directory.
SpamLookup Extension is now available for
- Movable Type 3.34 and 3.35 — see here.
- Movable Type 4.0 (beta) — distribution. This has the same release notes as the 3.34 / 3.25 version.
Additional Capabilities
The modification provides the ability to apply word filters to specific fields in the comments and trackbacks, rather than the conglomeration of all fields. There are a number of specialized checks that I have found useful, most of which I had previously put in to a custom spam filter.
The new syntax for filter lines is
word ( fields ) weight
where
- word
- A space delimited word or a ‘/’ delimited regular expression.
- fields
- A space delimited set of field keywords. This is optional.
- weight
- A numeric value specifying the weight of this word. This is optional. A weight can be negative (with a leading ‘–’) which makes it a white listing instead of a blacklisting.
A word can be either a space delimited literal word or a ‘/’ delimited regular expression. In the latter case, the characters ‘-ismx’ are permitted to directly following the trailing ‘/’. If present, these are passed to the PERL regular expression engine to control the interpretation of the regular expression.
The valid field keywords are
| FIELD | MEANING | DISPLAY |
| name | Name for a comment author | Commentor |
| Email address for a comment author | ||
| home | Homepage URL for a comment author | URL |
| content | Text content of a comment | (text box) |
| blog | Weblog name for trackback source | Source Site |
| title | Title for the source post of the trackback | Source Title |
| source | Source URL for a trackback | URL |
| excerpt | Trackback excerpt | (text box) |
| url | Equivalent to home or source | |
| text | Equivalent to content or excerpt | |
| all | All fields together, separated by newlines | |
If no fields are specified, “all” is used, thereby preserving (almost) the orginal behavior and syntax. The url and text fields are provided to make scanning equivalent fields for comments and trackbacks easy. If multiple fields are specified, they are all scanned in order. A match on any field counts as a match.
Examples
Most of these are from my customized spam filter, which can now be simplified. All of them are things I have seen in comment and trackback spam.
- Empty trackback excerpt.
- /^$/ (excerpt)
- Double dash in domain.
- -- (url email)
- Comment is the single word 'Hi.'
- /^Hi\.$/ (content)
- No source URL for trackback
- /^$/ (source)
- The word 'poker' in the email address or homepage
- poker (url email)
- Homepage looks like an archive, not a base URL
- /[[:digit:]]{3,}\.(?:html|htm|shtml|php)$/ (home)
- Email address is purely numeric
- /^[[:digit:]]+@/ (email)
Note that some of these are simply not possible in the original release (particularly the empty field checks). Others are possible but too broad to be practicle (such as checking for archive like links as a home page, or double dashes).
One feature that has come in handy is the ability to search for phrases at the beginning of a comment, which detects spammers with much less impact on legitimate commentors. For instance, one spammer starts his comments with ‘Hello, Admin!’. This can be detected with
/^Hello, Admin!/ (text)
The ‘^’ means “match only at the start” when inside a regular expression (denoted by the ‘/’ characters). The other handy character is ‘$’, which means “match only at the end”. Together with ‘^’ this means you can match on phrases that are the entire comment, not just part of it. E.g., if a spammer just puts the word ‘Hi.’ in his comments, you can catch that with
/^Hi\.$/ (text)
Note that we have to put a backslash in front of the ‘.’ because otherwise that means “match any single character”. With this, some one who writes ‘Hi.’ at the start won’t be matched, if he writes any additional text.
One common request is to ban a specific email address. This is now trivial. To ban the email address “neo@hotmail.com” you can do
neo@hotmail.com (email)
If you want to detect the word “poker” in email address, commentor name, or home page URL, you can do
poker (email url name)
and your commentors can still use the word in the text without impediment.
The weight of a filter can be negative, which causes it to be a white listing. If I wanted to give myself a bypass, I could use the rule
Annoying Old Guy (name) -10
which would subtract 10 from the junk score of all of my comments so that my comment would pass even if it hit several filters. This could easily be used to give bypasses to other commentors by name, email, or home page URL. One could even use this to create a magic bypass word, say ‘ciscomyyahoo’. Put in the rule
ciscomyyahoo (content) -10
and then when you want a bypass, put the magic word in an HTML comment, e.g. <!–– ciscomyyahoo ––> in your comment and you’re through the filters yet other people don’t see the magic word.
My current working set of filters can be found here, which has numerous other examples.
Differences
There a few minor functional differences.
Bug fixes
The original distribution of SpamLookup had bugs in the handling of decoded1 text where the decoded version of the text would not be scanned if the raw version did not have any hits by filters for junk filters. In the case of moderation filters, the decoded text would always be scanned, regardless of hits in the raw text (making the raw text scan pointless).
The new version now applies filters per field first to the raw text for the field, and then, if the raw text did not have a match, the decoded version (if different) is scanned. A match in either case counts as a match for the filter.
The processing of standard words (not regular expressions) is improved. Previously, words that started with non-word boundary characters would never match (because expressions like ‘\b<’ never match). These cases are now handled correctly.
The base implementation has a problem where backreferences do not work correctly because the regular expression is compiled before the outermost parentheses are added leading to various problems. This does not occur in this version and the workaround suggested won’t break. The only thing to keep in mind is that all backreferences need to be incremented by one before being used.
Undocumented feature
The original release, when combinging the fields in to the combined text, would prefix each field with the string “field:” where field was one of ‘name’, ‘email’, ‘url’, ‘text’ for comments and one of ‘blog’, ‘title’, ‘url’, ‘text’ for trackbacks. Presumably this was done to provide a semblance of the functionality of my modification (at least, one could check if a specific field began with a specific filter word) but as there are no comments in the code there and no documentation, this is only speculation. In any case, I could see no use for it at all in this modification so it was removed.
Log enchancement
If a match occurs for a specific field other than ‘all’, the name of the field is added to the pattern in the log message on the comment / trackback.
For each word filter match, the specific score for that match is logged.
The patterns and matched text are HTML encoded before being placed in to the log so that pattern or match text that contains HTML code does not disturb the log display. In addition, literal question marks ‘?’ are transformed in to entities because they mess up the log display (although I have not had the time to track down why that is).
MT 3.3
The MT 3.34 replacement lib is available here. This is currently in use on multiple live weblogs.
The new version has a bit more tuning, although it’s unlikely to be noticeable in the real world. I have added more comments in the code for the curious and tweaked the log messages to be a bit clearer.
The only known potential issue is with Japanese and word boundaries. As best as I understood the code, the special checks for that should still work correctly. The original checks were removed, as they should be handled correctly by my bug fix for broken word boundaries in the original version.
1 “Decoded” means entities decoded to their textual equivalents.
Trackback URL: http://blog.thought-mesh.net/solidwallofcode/movable_type/spamlookup_exte.php/ping
From daily babble: uh oh on Monday, 03 April 2006 at 09:13
From deaddooor.net ファッションモデル大集å?ˆ: ç”°ä¸ç¾Žä¿?ã?®CM ç”»åƒ? 鈴木ã?ˆã?¿ on Thursday, 01 June 2006 at 00:06
This truly is an excellent hack.
Obviously it doesn’t work for MT 3.3. Have you tested in out on the new version of Spam Lookup in MT 3.3? Will you be updating this hack? Just curious…
I plan to update. Since I got booted off the ProNet mailing list I haven’t kept up, so I didn’t realize that MT 3.3 was out. I’ll try to take a look in the near future.
Bummer about getting booted. That must have just happened because I still see messages from you in May.
Thanks again, AOG, for all the work. I had the same issue with ProNet zealously booting me and thus missing all the 3.3 announcements… so much for my plugins.
I got booted off the mailing list as well, how can I get back on ?
I just waited a bit and re-applied. I think their mailing list software is over eager in culling dead addresses.
AOG- Have you checked this with the most recent release of MT (3.34)?
Not yet. I will try to do that this week.
AOG- Does the 3.34 version need to be modified for 3.35?
Jack, yes it does. It really should be in MT4…
Kevin;
It doesn’t look to me like it should have to have any changes. The only change in the base version from 3.34 to 3.35 is one line, the checkout details from the source control system. Otherwise the files are byte for byte identical.
Jack;
Did I miss this and fail to respond when you originally commented? If so, sorry! Much sorry!