Data Lust

I love Mozil­la Thun­der­bird, not least of all because it’s a Mozil­la-brand­ed prod­uct, but also large­ly because of its adap­tive junk mail fil­ter. What this means is that for every email you get, you’re able to mark it as “junk” or as “not junk,” and from both of these prac­tices, Thun­der­bird begins to learn (through Bayesian fil­ter­ing) how to iden­ti­fy spam.

If you’re any­thing like me you’ve noticed that spam­mers are get­ting a lot crafti­er in recent months; I’ve even had a few spam emails slip into my Gmail inbox, when Gmail has in my expe­ri­ence been noth­ing short of astound­ing in its abil­i­ty to iden­ti­fy spam. Which is to say, Thun­der­bird isn’t catch­ing every­thing for me, at least not yet. I mark every spam I get as such, but the fil­ter­ing relies on your mark­ing the non-spam as well.

Any­way, it’s not hard work to mark all these emails (espe­cial­ly if you can high­light a bunch from a num­ber of trust­ed senders and mark “not spam”), but it’s still work, and I’d hate to see it all go to waste if my hard dri­ve crashed, or even if Thun­der­bird’s devel­op­ment sud­den­ly halt­ed — the data could prove use­ful else­where. And the idea of even hav­ing that data acces­si­ble to me out­side of a prac­ti­cal imple­men­ta­tion with­in a sin­gle pro­gram — in raw, brows­able form — is real­ly, real­ly appeal­ing.

Through very lit­tle Googling I found out that Thun­der­bird keeps all this train­ing data in a sin­gle file, named, apt­ly, training.dat. It’s in your “Doc­u­ments and Set­tings\Jay\Appli­ca­tion Data\Thunderbird\Profiles\2e8vm8m0.default” fold­er. And appar­ent­ly, sim­ply putting it in anoth­er pro­file fold­er migrates all the train­ing you’ve done to that oth­er pro­file. Amaz­ing­ly sim­ple.

Here’s what the first ten lines of mine look like:

þíúÎ
jus­ti­fies,
mean­ing­ful
sub­li­cense
pro­pelling direct
fly­er-ing,
herbalis­er­att
aggres­sion
(sur­prise,
inflat­able

I don’t get it either, and it just goes on like that, with no imme­di­ate­ly rec­og­niz­able struc­ture or indi­ca­tion of what sig­nif­i­cance these words have, save for some seem­ing­ly ran­dom para­graph breaks.

BUT, when I Googled what I now knew to be the file­name of the train­ing data, I found that Mozil­la cre­at­ed a lit­tle Java pro­gram called the Bayes Junk Tool, which makes this data sur­pris­ing­ly leg­i­ble, AND exportable as XML, AND allows you to edit this data arbi­trar­i­ly!! I could­n’t have asked for more.

Truth­ful­ly, I’m a lit­tle dis­ap­point­ed in the rel­a­tive­ly rudi­men­ta­ry Bayesian approach. I thought for sure this training.dat file would be rid­dled with reg­u­lar expres­sions, teach­ing Thun­der­bird that “v1agar” is the same thing as “\/|a gra.” Although that’s prob­a­bly too sub­tle even for reg­u­lar expres­sions. I can dream can’t I.

None of this is to under­cut the invalu­a­bil­i­ty of MozBack­up, which keeps set­tings, cook­ies, exten­sions, cached files, and more with­in a sin­gle back­up file.