« April 2004 | Main | June 2004 »

May 2004 Archives

May 7, 2004

SVMMail

I reached a great milestone with SVMMail today. I will be doing more test and releasing more information next week. The initial results are a 97% accuracy on the filter. Also with real world testing (so far a low number of 65 mails), there was only 2 errors (1 false positive) in prediction. I had training data of 550 real emails and 713 spam emails (all of which i collected in the 3 weeks or so that I ahve been working on this project.) I am really excited that I have past the stumbling blocks that I was on the last 3 days where I was actually getting a 0% accuracy because a bug the was generating a pretty much random model.

There is currently no web interface and it is all just run directly from java (jbuilder in my case). I will add features like that and the ability to track how many of each type of error my system makes later.

This is a great I am really happy with how this is working out.

May 11, 2004

SVM Mail testing

Today I got a chance to let SVM mail sort 252 messages that it has never seen before. The messages were moved sucessfully fromt he inbox to either my real folder, or my spam folder. Out of the 252 messages the system incorrectly categorized 8 emails. Three of the 8 were actually the same message sent out from the frienster network. I have added those messages to real training model. This means on the first really good real world test my filter achieved an 96.8% accuracy. This comparess well to the 98% estimated acuuracy of the model, by categorizing the training data 98% correctly.

The system has been through about 400 emails, and my accuracy is now at 97.25 %. I have only retrained the model once including little bits of the new info I have collected. I places all of the incorrectly identified emails in the proper categories and retrained the system. Then just to see if that would make it correctly identify those emails I recategoried them, it cut the misclassifications in half, but some where still missclassified. It seems that forwarded messages with attachments are what it will missclassify still. Other emails seem good.

I am currently not working on or extending this project because it was just a testing project for some of my code, which I am now focusing back on my News Shaker project, which is using SVM classification to create a google news like site.

May 18, 2004

News Shaker Update

After doing some initial work with the 8 category problem, I have run into some problems. Nothing that can�t be solved but just some initial hiccups as expected. The very first run through I was getting approximately 30% accuracy on my categorizations. Better than random guessing but still pretty worthless. After changing to a different layout of the model, I am now getting around 43%. Which 43% (on average of the 8 models some are higher) also sucks. I now have about 5 different ideas after talking with a professor at CU at how to improve my overall percents. I am trying to get over 75% accuracy once I have about that level (which isn�t that high) I am hoping with some user feedback on the site that the model will train and improve itself. Which would be really cool, and possible since pretty much the whole process is automated now.

I first was taking all of the categories and creating a positive and a negative vector. The positive was all of the categorized data in the model. The negative was all of the other data in all the other categories. This wasn�t doing so well, so I removed the general category from the negative vector. I also removed the uncategorized data from the negative vector since it is possible these could fit in the category. Doing this increased my model accuracy from the 30% to the 43%.

I am now considering other things I could do to improve the accuracy. One of the things I am considering is a two level model. The first would only say if the model relates to special education the second level would then categorize within the special education category. This would allow me to quickly dump anything I know isn�t related to special education at all. It would also allow me on the final site to have users help with the categorization process. Anything that couldn�t be categorized better than just special education related could be placed in a general category. The general category users could view and then place in the proper category which would in turn help train the system.

I am also now considering a move from SVMlight to libSVM. Apparently libSVM offers some better options and optimizations, but still uses the same input format. This is important because text2SVM, took awhile and was written with SVMlight in mind. I have done some other optimizations on text2SVM which isn�t included in the released source because the project has begun to become less general and more specific to my project. It has improved and become far faster though. If I move to libSVM this would allow me to get results of a categorization attempt as a percent. If I had percents I could compare the results to different models better which would be useful since the value comparison between models isn�t scaled the same.

One of the problems I am running into is testing time. It takes about 2 1/2 hours or so to create and run a new test. It requires a few different steps. If I run them all at once my machine runs out of memory and crashes. So I have to run the steps one at time even though the code is completely automated, it can�t run as such without time to dump the memory. Perhaps I will have to start looking around CU for a gigantic machine that I can use to do testing much faster.

The spam filter has gone through over 500 emails now and has an accuracy of 97.5% on unseen new email. This is great, if it wasn�t so specialized to my mail I would make the filter available to everyone.

That�s it for now. The good news is I think I am still headed in the right direction and I think I will end up with a capable system. The bad news is that I think it is going to be harder and more time consuming than originally planned. I will be busy with some other stuff and out of town over the next 3 weeks so there will probably be little updated information available on the project.

May 19, 2004

SVMMail vs Apple Mail

Well it looks like my SVM mail is very similar to apple's Mail. Apples mail also uses a vector representation and training to achieve 98%+ accuracy (the claim). After reading through this, it makes sense why my filter is achieving such good accuracy. They are using a little more complicated vector analysis tool than I. I use SVMlight, while Apple is using LSA (Latentic Semantic Analysis), which i used to work with but I found the tools far less developed and harder to work with. It was causing me all sorts of problems to tell LSA to do the simple clustering I was doing very simple with SVMlight. The main reason it looks like they are using LSA is they first reduce the space vector and then using LSA on the reduced vector they are claiming a major performance increase. This is really believable, especially since LSA offers quick tools for folding new information into a model instead of recreating it. Anyways, I am happy to learn that the approach I took with my mail filter appears to be very similar to one of the best computer companies out there. It gives me even more reason to believe that I am on the right trail with all of my various Vector based learning algorithms.

Explaining the Apple Mail Filter

P.S. This means that my filter that has barely been developed and has many enhancements possible is achieving about .5% less than apples filter!

May 20, 2004

SVM Mail feature vector

I have gotten a few emails and questions from others researchers in the community and I decided that I would begin to answer questions on my site rather than through email so any others could also benefit from the answer. So here is the first response I am posting on the web. Feel free to contact me if you have any other questions and I can try to respond.

I found your project while googling for various alternatives to spam
filtering; I've been thinking about trying SVM for mail filtering myself,
but I'm slightly at loss as to what features to use.

A bag-of-words model comes naturally to mind, but it is not the most
efficient computationally; are you perhaps using it or something similar?

Continue reading "SVM Mail feature vector" »

May 24, 2004

SVMMail testing update

I have now gone through about 800 emails with my SVM / SMO mail filter. I am still getting about 97.5% accuracy. I have blocked over 700 spams from my mail account. I am taking off (to europe) and I am sure my mail will fill up with about 600 emails while I am gone so I will have a much larger test results when I return. Good luck to anyone else out there working with or thinking about SVMs / SMOs for spam filters. It looks like it is a winning combination. I hope my various code and articles can help you on your way.

peace,
Dan Mayer

Web 2.0 craziness

View Dan Mayer's profile on LinkedIn


I Power Seekler
I Power Seekler

www.flickr.com
This is a Flickr badge showing public photos and videos from mayer_dan. Make your own badge here.

Creative Commons License
This weblog is licensed under a Creative Commons License.