« March 2004 | Main | May 2004 »

April 2004 Archives

April 9, 2004

Busy

Alright i have been really busy on my senior project for the last couple weeks so I haven't had time to really get any work done. I did however have a meeting with a CU AI professor who is familiar witht he use of SVMs. He thought i was on the right track and was approaching everything properly. He answered the one last question i had before i can really set the system up to start working on its own for categorization. So that was exciting. Now all i really need is a week or two to really spend some time coding. If i had the time i really think i could make a very impressive first run beta of the entire system. Then with some time i think it oculd turn into a pretty cool application. I am hoping to tweak some of the hard coded values to allow for much more flexability allowing hte system to be set up anywhere and work on categorizing any categories the user wants. The categories are currently hard coded into a few of the functions and in the end that wont be the best way to do things.

Seperate good news is that my senior project just passed a 1 and 1/2 hour live test with 80 law students and a professor with no errors and no problems. That is the longest time the system has been used continously and by far the most users on the system at once. It is really kind of cool to think about the fact that their was 80 law students actively using something that i was a large part of creating. We are going to run a larger more extensive test on monday where the professor will be braodcasting a bunch of questions in class. We will see how that works out in the end, I am excited about it. The site will move soon, but if anyone is interested in looking at what my senior project can do take a look at

http://blackout.cs.colorado.edu/mtroom3/jsp

It is an interactive classroom for the law school, so that professors can more easily get feedback and quiz large classes of students.

Finally a good friend of mine from my research lab was notice by google and they called her up to talk to her and ask for her resume. So that is some pretty exciting news, that they are actively searching out talent. Perhaps as my project matures they will stumble apon what i am doing.

April 13, 2004

A step closer to fully automated

After recent success with my models i wanted to do some much more involved and usefull tests. The only problem was half of the stuff i was doing by hand. I had written some java software Text2SVM toat would help witht he conversions and such, but i had to give it the names of the files and everything myself. I now integrated Text2SVM more into newsshaker. It uses the database to find out the names of all the categories. It recursively sorts through all the text files in given directories. It is quite nice. I ran into some problems where i was running the entire system out of memory even with a full gig given to java, which would cause way to much swapping and slow the hold system down. I rewrote my code to break up the steps into smaller parts using much less memory and it now doesnt crash and runs 2 or 3 times faster.

I now can go straight from the database to 9 SVMlight formated text training datasets for models. The next step will be to write some code the generates the models using an interface between java and SVMlight. After that i will be writting some code to keep all the results of the testing organized and worthwhile, which i will be storing in the database. The final step will then be writting code that takes many random documents from my database and trys to categorize them, and stores if they were correctly told to be place in the category from which they came. I will then be storing a matrix of attempts to categorize a category and where it was actually categorized. This should be highly exciting because it will really let me see the info I need to know to make my system worka dn to know how reliably it works, and where the problems are occuring. It would let me for instants see if the space between special education parenting is to similiar and close to special education school to be determined, but if combined would serve as a valid and seperate categorization from everything else. School will probably keep me busy for awhile, but it is nice to be making some good progress and see a first beta version coming into very close view. Then i will have a wonderfull testing and development application for text categorization. Hopefully by the end of summer i will have everything generalized to the point that anyone could add categories and begin maintaining a category and seeing how they can get different results with different information retrival schemes, or using diffferent categorization algorythms besides SVMs.

Anyone have some other algorythms worth looking into for text categorization, besides LSA/LSI and SVMs?

Programming progress

Today I accomplished alot on many different programming projects. 3 to be correct. Anyways, I got my comp org program almost done which is nice. I made some really good progress on my text categorization project for work. Then finally for my 3d graphics class i made some good progress on my very simple 3d shooter.

I am happy to have accomplished so much, but on the downside it is 10:30 and i have been programming the entire day on one project or another. Argg...

I swear i will have to do something fun and exciting for all of you soon. Then i will have something worth writting about again, but until then here is a picture of my 3d shooter in progress:

3d shooter.jpg

April 18, 2004

SVM Spam Filtering

Support Vector Machine (SVM) Spam filtering. After working more with SVM and my software to utilize it moving along nicely, I am begining to do work on a spam filter for my email. I have been getting a much larger amount of email in the last couple months. The spam is highly recognizable and simple to see patterns in, 90% of it is related to buying prescription drugs. I figured that SVM should be able to detected this very easily. Since i have already written, a few java email applications, I thought it would be easy enough to write a program that will let me log on and list things as spam and then use that to train and SVM model, which takes far less training data than many other machine learning methods. For now i have done a very simplistic design, the first step will, be building a simple webmail system, it will really only be used to mark as spam, and check the spam folder to make sure there was no false positives. Then I will use my normal mail system to read and respond to the mail. When i first log into my simple webmail all new mail will be download and categorized. I will then be able to submit any missed items as spam. Anything newly categorized as spam or not will be places in either the spam, or real mail section of the database. The database will store who the mail was from, the subject, and the body of the message.

I have now created two folders in my mail system. old and spam. I have moved all my old real mail to the old folder and all spam to the spam folder. I have about 300 spam messages and about 520 real messages. I have finished hte code for the mail checking and sorting using java. I now imported my Text2SVM software and used the to create the orginal model file. Now similiarly to my NewShaker software all that is left to do is have autocategorization. For spam this should be a little simpler since there are only two classification levels. This should let me learn the process that I will need to use for Newsshaker, but at a little simpler level. I created a model using 170 of real mail, vs spam. With it I was getting an 80% recognition on my test data, I think with some simple fixes in the way I am creating my text from my mail, stop listing, and stemming I should be able to increase this rate. I should also be able to increase the rate once I add more training data.

(This was created as a sub project to create and test code that will be used for newsshaker)


Here is my initial design.


gif_1.gif

Web 2.0 craziness

View Dan Mayer's profile on LinkedIn


I Power Seekler
I Power Seekler

www.flickr.com
This is a Flickr badge showing public photos and videos from mayer_dan. Make your own badge here.

Creative Commons License
This weblog is licensed under a Creative Commons License.