« Week of stuff | Main | NewsShaker: Features needed / bug report V.001 »

Text2SVM

This is the main page for "Text2SVM" a java program that converts text documents to the format require for SVMlight. The project is currently in beta. I figured that i should release a early working version for anyone else that has been finding it very difficult to find software to help them work with SVM's. This project is distibuted as is with no garuntees. So use it try it out. Right now it is written very specifically to handle text files conversion to a single format for SVM light. This should be capable of being edited to support many other formats though. To read more about SVMlight visit their page. If you make any good modifications or changes of this program I would love to hear about it please contact me at ddmayer (at) colorado (dot) edu (spam pertection). You can also leave comments on this page so other users could help solve your problems.

To download the code click here.

This first release 0.001 include these features:
-Import entire directory of files
-Save dictionary to generate other word vectors based on same SVMmodel
-export files in SVMlight format
-Stemming to increase the occurance of words like (stopping and stopped)
-The ability to import saved dictionaries and continue adding to the model
-debugging code left in

What is SVM, and why would i want to convert text to a word vector?
Support Vector Machines. They are a method or clustering and machine learning. In this examples we are working with Text categorization using SVM. This programs purpose is to take a bunch of documents in a category and a bunch of documents unrelated to that category. It then creates word vectors out of all the documents. The word vectors then are converted to SVMlights format and given positve weight and negative weight. Anytime the document vector has a possitive weight that means we consider it part of our category. A negative weight means it is outside of our category. After the text is written in this format you can use SVMlight to compare unknown text and see if it belongs in this category.

KNOWN ISSUES:
- occassionally cuts the first letter off of a document.
- uses high amount of memory to run the program add -Xmx256x then the program your trying to run. wc.java is known to run out of memory with large amounts of text.

Comments (9)

Hi,
Please be informed about the download's broken link:
http://www.deadawakemovie.com/ml/archives/files/Text2SVM.zip

We would be so thankful for any update or repairing.

Regards
Mahdi Akbar

I am sorry about that. I fixed the link today. I hope it didn't cause you any problems. Thanks for pointing out this problem to me.
Thanks,
Dan

Baurzhan:

Hello! This tool is nice. I found that it works only with UTF8 encoding. Is it right? Because I tried to convert text files to S.V.M.-format and it didn't work till I converted encoding of my text files to UTF8.

Dan:

I was not aware of that, but I bet your right that it only works on UTF8. I haven't work on this code since sometime back in 2004 likely around when I first made this post. So I am glad you have found it useful.

Dave:

Does anyone know where to get this file from? The link seems to be broken again. It looks like a really useful piece of software.

D.

Ola:

Hello,

Thank you very much for this useful software, actually i don't any background about Java ...

So would you thankfully tell me which file to start execution and which software could i use...

Thank you...

ElComandante:

Hello,

this is exactly what I need for my study project!

Could you please load that tool up again? (the link is broken..)

thanks a lot!

I updated and fixed the link again... This code is really old, so there are probably much better solutions out there.

MyD:

Would you mind to tell me how to use your program?

I have 2 categories +class and -class. Furthermore, I have 2 text files containing the text for +class and on the other hand for the -class.

How does your program segment the text? I already segmented my text and it is separated by a whitespace. Thanks in advance.

Regards,
MyD

Post a comment


Type the characters you see in the picture above.

Web 2.0 craziness

View Dan Mayer's profile on LinkedIn


I Power Seekler
I Power Seekler

www.flickr.com
This is a Flickr badge showing public photos and videos from mayer_dan. Make your own badge here.

Creative Commons License
This weblog is licensed under a Creative Commons License.