This is the main page for "Text2SVM" a java program that converts text documents to the format require for SVMlight. The project is currently in beta. I figured that i should release a early working version for anyone else that has been finding it very difficult to find software to help them work with SVM's. This project is distibuted as is with no garuntees. So use it try it out. Right now it is written very specifically to handle text files conversion to a single format for SVM light. This should be capable of being edited to support many other formats though. To read more about SVMlight visit their page. If you make any good modifications or changes of this program I would love to hear about it please contact me at ddmayer (at) colorado (dot) edu (spam pertection). You can also leave comments on this page so other users could help solve your problems.
To download the code click here.
This first release 0.001 include these features:
-Import entire directory of files
-Save dictionary to generate other word vectors based on same SVMmodel
-export files in SVMlight format
-Stemming to increase the occurance of words like (stopping and stopped)
-The ability to import saved dictionaries and continue adding to the model
-debugging code left in
What is SVM, and why would i want to convert text to a word vector?
Support Vector Machines. They are a method or clustering and machine learning. In this examples we are working with Text categorization using SVM. This programs purpose is to take a bunch of documents in a category and a bunch of documents unrelated to that category. It then creates word vectors out of all the documents. The word vectors then are converted to SVMlights format and given positve weight and negative weight. Anytime the document vector has a possitive weight that means we consider it part of our category. A negative weight means it is outside of our category. After the text is written in this format you can use SVMlight to compare unknown text and see if it belongs in this category.
KNOWN ISSUES:
- occassionally cuts the first letter off of a document.
- uses high amount of memory to run the program add -Xmx256x then the program your trying to run. wc.java is known to run out of memory with large amounts of text.


Comments (8)
Hi,
Please be informed about the download's broken link:
http://www.deadawakemovie.com/ml/archives/files/Text2SVM.zip
We would be so thankful for any update or repairing.
Regards
Mahdi Akbar
Posted by Mahdi Akbar | April 5, 2004 11:53 AM
Posted on April 5, 2004 11:53
I am sorry about that. I fixed the link today. I hope it didn't cause you any problems. Thanks for pointing out this problem to me.
Thanks,
Dan
Posted by Dan Mayer | April 8, 2004 9:44 PM
Posted on April 8, 2004 21:44
Hello! This tool is nice. I found that it works only with UTF8 encoding. Is it right? Because I tried to convert text files to S.V.M.-format and it didn't work till I converted encoding of my text files to UTF8.
Posted by Baurzhan | April 18, 2007 8:33 AM
Posted on April 18, 2007 08:33
I was not aware of that, but I bet your right that it only works on UTF8. I haven't work on this code since sometime back in 2004 likely around when I first made this post. So I am glad you have found it useful.
Posted by Dan | May 17, 2007 1:48 AM
Posted on May 17, 2007 01:48
Does anyone know where to get this file from? The link seems to be broken again. It looks like a really useful piece of software.
D.
Posted by Dave | August 17, 2007 9:40 AM
Posted on August 17, 2007 09:40
Hello,
Thank you very much for this useful software, actually i don't any background about Java ...
So would you thankfully tell me which file to start execution and which software could i use...
Thank you...
Posted by Ola | March 31, 2008 8:44 PM
Posted on March 31, 2008 20:44
Hello,
this is exactly what I need for my study project!
Could you please load that tool up again? (the link is broken..)
thanks a lot!
Posted by ElComandante | May 31, 2008 9:58 AM
Posted on May 31, 2008 09:58
I updated and fixed the link again... This code is really old, so there are probably much better solutions out there.
Posted by Dan Mayer | May 31, 2008 4:26 PM
Posted on May 31, 2008 16:26