Training bogofilter

2025-04-16

I've been running a MovableType blog for quite some time now, and have added numerous hacks over the years (months?) to fend off comment spam. I have a lengthy series of regexps and a few other checks that reject most spam, but small amounts still get through, which I clean up by hand. (As a side-effect, I have a database of about 5,500 'ham' comments, and, after I started logging the comments I was failing, I now have about 40,000 spams.)

I'd like to play with feeding comments through bogofilter and letting it decide if it's spam or not. But I'm new to the whole Bayesian concept, so a few questions:

- How often do you train it? Obviously there's the initial training, but are you supposed to 'train' it on stuff it filters correctly? (This could be the difference between a 95% certainty and a 100% certainty?)

- I have about 10 people who account for 99% of the (legitimate) comments. Is whitelisting/blacklisting before bogofilter advised (cut down on CPU), or is it a bad idea (bogofilter only sees stuff that's uncertain)?

- I'm planning on feeding the given name, e-mail, URL, and text through. What are you thoughts on using an IP? On one hand, it should make things more clear-cut: it's not like the IP changes ever 30 minutes. On the other hand, over time, the IPs it saw would change.

Any thoughts on this?