18 May 2004

News Shaker Update

Dan Mayer
Dan Mayer @danmayer

After doing some initial work with the 8 category problem, I have run into some problems. Nothing that can�t be solved but just some initial hiccups as expected. The very first run through I was getting approximately 30% accuracy on my categorizations. Better than random guessing but still pretty worthless. After changing to a different layout of the model, I am now getting around 43%. Which 43% (on average of the 8 models some are higher) also sucks. I now have about 5 different ideas after talking with a professor at CU at how to improve my overall percents. I am trying to get over 75% accuracy once I have about that level (which isn�t that high) I am hoping with some user feedback on the site that the model will train and improve itself. Which would be really cool, and possible since pretty much the whole process is automated now. I first was taking all of the categories and creating a positive and a negative vector. The positive was all of the categorized data in the model. The negative was all of the other data in all the other categories. This wasn�t doing so well, so I removed the general category from the negative vector. I also removed the uncategorized data from the negative vector since it is possible these could fit in the category. Doing this increased my model accuracy from the 30% to the 43%. I am now considering other things I could do to improve the accuracy. One of the things I am considering is a two level model. The first would only say if the model relates to special education the second level would then categorize within the special education category. This would allow me to quickly dump anything I know isn�t related to special education at all. It would also allow me on the final site to have users help with the categorization process. Anything that couldn�t be categorized better than just special education related could be placed in a general category. The general category users could view and then place in the proper category which would in turn help train the system. I am also now considering a move from SVMlight to libSVM. Apparently libSVM offers some better options and optimizations, but still uses the same input format. This is important because text2SVM, took awhile and was written with SVMlight in mind. I have done some other optimizations on text2SVM which isn�t included in the released source because the project has begun to become less general and more specific to my project. It has improved and become far faster though. If I move to libSVM this would allow me to get results of a categorization attempt as a percent. If I had percents I could compare the results to different models better which would be useful since the value comparison between models isn�t scaled the same. One of the problems I am running into is testing time. It takes about 2 1/2 hours or so to create and run a new test. It requires a few different steps. If I run them all at once my machine runs out of memory and crashes. So I have to run the steps one at time even though the code is completely automated, it can�t run as such without time to dump the memory. Perhaps I will have to start looking around CU for a gigantic machine that I can use to do testing much faster. The spam filter has gone through over 500 emails now and has an accuracy of 97.5% on unseen new email. This is great, if it wasn�t so specialized to my mail I would make the filter available to everyone. That�s it for now. The good news is I think I am still headed in the right direction and I think I will end up with a capable system. The bad news is that I think it is going to be harder and more time consuming than originally planned. I will be busy with some other stuff and out of town over the next 3 weeks so there will probably be little updated information available on the project.