bayes – kbps

Data Lust

Feb 20, 2007 Jay

I love Mozilla Thunderbird, not least of all because it’s a Mozilla-branded product, but also largely because of its adaptive junk mail filter. What this means is that for every email you get, you’re able to mark it as “junk” or as “not junk,” and from both of these practices, Thunderbird begins to learn (through Bayesian filtering) how to identify spam.

If you’re anything like me you’ve noticed that spammers are getting a lot craftier in recent months; I’ve even had a few spam emails slip into my Gmail inbox, when Gmail has in my experience been nothing short of astounding in its ability to identify spam. Which is to say, Thunderbird isn’t catching everything for me, at least not yet. I mark every spam I get as such, but the filtering relies on your marking the non-spam as well.

Anyway, it’s not hard work to mark all these emails (especially if you can highlight a bunch from a number of trusted senders and mark “not spam”), but it’s still work, and I’d hate to see it all go to waste if my hard drive crashed, or even if Thunderbird’s development suddenly halted — the data could prove useful elsewhere. And the idea of even having that data accessible to me outside of a practical implementation within a single program — in raw, browsable form — is really, really appealing.

Through very little Googling I found out that Thunderbird keeps all this training data in a single file, named, aptly, training.dat. It’s in your “Documents and Settings\~~Jay~~\Application Data\Thunderbird\Profiles\~~2e8vm8m0~~.default” folder. And apparently, simply putting it in another profile folder migrates all the training you’ve done to that other profile. Amazingly simple.

Here’s what the first ten lines of mine look like:

Ã¾ÃÃºÃŽ justifies, meaningful sublicense propelling direct flyer-ing, herbaliseratt aggression (surprise, inflatable

I don’t get it either, and it just goes on like that, with no immediately recognizable structure or indication of what significance these words have, save for some seemingly random paragraph breaks.

BUT, when I Googled what I now knew to be the filename of the training data, I found that Mozilla created a little Java program called the Bayes Junk Tool, which makes this data surprisingly legible, AND exportable as XML, AND allows you to edit this data arbitrarily!! I couldn’t have asked for more.

Truthfully, I’m a little disappointed in the relatively rudimentary Bayesian approach. I thought for sure this training.dat file would be riddled with regular expressions, teaching Thunderbird that “v1agar” is the same thing as “\/|a gra.” Although that’s probably too subtle even for regular expressions. I can dream can’t I.

None of this is to undercut the invaluability of MozBackup, which keeps settings, cookies, extensions, cached files, and more within a single backup file.

3 Responses

From the docs: the part to the next %? is displayed only if a % escape inside the part expands…

does anyone have any idea of the meaning of %?and how to use it ?

Noah was one of my best friends growing up. We went to Tappan Zee High Schiol in Rockland County, NY.…

I really like this take. I've found that the ai tools that are actually decent at search pretty much do…

"Something really struck me about the language he was using here. “Blackmail”? “Earth”? “We will document it in great detail”?…