09 March 2004

Text2SVM

Dan Mayer
Dan Mayer @danmayer

This is the main page for “Text2SVM” a java program that converts text documents to the format require for SVMlight. The project is currently in beta. I figured that i should release a early working version for anyone else that has been finding it very difficult to find software to help them work with SVM’s. This project is distibuted as is with no garuntees. So use it try it out. Right now it is written very specifically to handle text files conversion to a single format for SVM light. This should be capable of being edited to support many other formats though. To read more about SVMlight visit their page. If you make any good modifications or changes of this program I would love to hear about it please contact me at ddmayer (at) colorado (dot) edu (spam pertection). You can also leave comments on this page so other users could help solve your problems. To download the code click here. This first release 0.001 include these features:-Import entire directory of files-Save dictionary to generate other word vectors based on same SVMmodel-export files in SVMlight format-Stemming to increase the occurance of words like (stopping and stopped)-The ability to import saved dictionaries and continue adding to the model-debugging code left in What is SVM, and why would i want to convert text to a word vector?Support Vector Machines. They are a method or clustering and machine learning. In this examples we are working with Text categorization using SVM. This programs purpose is to take a bunch of documents in a category and a bunch of documents unrelated to that category. It then creates word vectors out of all the documents. The word vectors then are converted to SVMlights format and given positve weight and negative weight. Anytime the document vector has a possitive weight that means we consider it part of our category. A negative weight means it is outside of our category. After the text is written in this format you can use SVMlight to compare unknown text and see if it belongs in this category. KNOWN ISSUES:- occassionally cuts the first letter off of a document.- uses high amount of memory to run the program add -Xmx256x then the program your trying to run. wc.java is known to run out of memory with large amounts of text.

Categories