13 April 2004

A step closer to fully automated

Dan Mayer
Dan Mayer @danmayer

After recent success with my models i wanted to do some much more involved and usefull tests. The only problem was half of the stuff i was doing by hand. I had written some java software Text2SVM toat would help witht he conversions and such, but i had to give it the names of the files and everything myself. I now integrated Text2SVM more into newsshaker. It uses the database to find out the names of all the categories. It recursively sorts through all the text files in given directories. It is quite nice. I ran into some problems where i was running the entire system out of memory even with a full gig given to java, which would cause way to much swapping and slow the hold system down. I rewrote my code to break up the steps into smaller parts using much less memory and it now doesnt crash and runs 2 or 3 times faster. I now can go straight from the database to 9 SVMlight formated text training datasets for models. The next step will be to write some code the generates the models using an interface between java and SVMlight. After that i will be writting some code to keep all the results of the testing organized and worthwhile, which i will be storing in the database. The final step will then be writting code that takes many random documents from my database and trys to categorize them, and stores if they were correctly told to be place in the category from which they came. I will then be storing a matrix of attempts to categorize a category and where it was actually categorized. This should be highly exciting because it will really let me see the info I need to know to make my system worka dn to know how reliably it works, and where the problems are occuring. It would let me for instants see if the space between special education parenting is to similiar and close to special education school to be determined, but if combined would serve as a valid and seperate categorization from everything else. School will probably keep me busy for awhile, but it is nice to be making some good progress and see a first beta version coming into very close view. Then i will have a wonderfull testing and development application for text categorization. Hopefully by the end of summer i will have everything generalized to the point that anyone could add categories and begin maintaining a category and seeing how they can get different results with different information retrival schemes, or using diffferent categorization algorythms besides SVMs. Anyone have some other algorythms worth looking into for text categorization, besides LSA/LSI and SVMs?

Categories