Data Lust

3 Responses · February 20, 2007

I love Mozilla Thun­der­bird, not least of all because it’s a Mozilla-branded prod­uct, but also largely because of its adap­tive junk mail fil­ter. What this means is that for every email you get, you’re able to mark it as “junk” or as “not junk,” and from both of these prac­tices, Thun­der­bird begins to learn (through Bayesian fil­ter­ing) how to iden­tify spam.

If you’re any­thing like me you’ve noticed that spam­mers are get­ting a lot craftier in recent months; I’ve even had a few spam emails slip into my Gmail inbox, when Gmail has in my expe­ri­ence been noth­ing short of astound­ing in its abil­ity to iden­tify spam. Which is to say, Thun­der­bird isn’t catch­ing every­thing for me, at least not yet. I mark every spam I get as such, but the fil­ter­ing relies on your mark­ing the non-spam as well.

Any­way, it’s not hard work to mark all these emails (espe­cially if you can high­light a bunch from a num­ber of trusted senders and mark “not spam”), but it’s still work, and I’d hate to see it all go to waste if my hard drive crashed, or even if Thunderbird’s devel­op­ment sud­denly halted — the data could prove use­ful else­where. And the idea of even hav­ing that data acces­si­ble to me out­side of a prac­ti­cal imple­men­ta­tion within a sin­gle pro­gram — in raw, brows­able form — is really, really appealing.

Through very lit­tle Googling I found out that Thun­der­bird keeps all this train­ing data in a sin­gle file, named, aptly, training.dat. It’s in your “Doc­u­ments and Set­tings\Jay\Appli­ca­tion Data\Thunderbird\Profiles\2e8vm8m0.default” folder. And appar­ently, sim­ply putting it in another pro­file folder migrates all the train­ing you’ve done to that other pro­file. Amaz­ingly simple.

Here’s what the first ten lines of mine look like:

þíúÎ
jus­ti­fies,
mean­ing­ful
sub­li­cense
pro­pelling direct
flyer-ing,
herbalis­er­att
aggres­sion
(sur­prise,
inflatable

I don’t get it either, and it just goes on like that, with no imme­di­ately rec­og­niz­able struc­ture or indi­ca­tion of what sig­nif­i­cance these words have, save for some seem­ingly ran­dom para­graph breaks.

BUT, when I Googled what I now knew to be the file­name of the train­ing data, I found that Mozilla cre­ated a lit­tle Java pro­gram called the Bayes Junk Tool, which makes this data sur­pris­ingly leg­i­ble, AND exportable as XML, AND allows you to edit this data arbi­trar­ily!! I couldn’t have asked for more.

Truth­fully, I’m a lit­tle dis­ap­pointed in the rel­a­tively rudi­men­tary Bayesian approach. I thought for sure this training.dat file would be rid­dled with reg­u­lar expres­sions, teach­ing Thun­der­bird that “v1agar” is the same thing as “\/|a gra.” Although that’s prob­a­bly too sub­tle even for reg­u­lar expres­sions. I can dream can’t I.

None of this is to under­cut the invalu­a­bil­ity of MozBackup, which keeps set­tings, cook­ies, exten­sions, cached files, and more within a sin­gle backup file.

a poem:

þíúÎ
jus­ti­fies,
mean­ing­ful
sub­li­cense
pro­pelling direct
flyer-ing,
herbalis­er­att
aggres­sion
(sur­prise,
inflatable

jessi · 21 Feb 2007

it’s so beautiful :’(

Jay · 28 Feb 2007

:’ )

jessi · 2 Mar 2007

Leave a Comment or Subscribe