« February 2004 | Main | April 2004 »

March 2004 Archives

March 1, 2004

Setting Up

This is going to be a research blog kept by Dan Mayer to show his progress on his computer science research into machine learning.

I will be setting up a collection of all the work and research that I have done on machine learning. I am working on a project for the L3D labs at the University of Colorado. I am working specifically with text categorization, currently focused on using LSA and SVM to categorize. I will post all the links, projects, examples, and resources that I have used and learned from during my progress on this project.

This is documenting the work on my project News Shaker, which will be similiar to google news. It uses a web spider and text categorization to organize the data into particular categories. The categories I am currently working with are special education related. The topics of the categories were choosen so that this project would become a sub project of Web2gether, which is a site to help the special education community.

Currently all of the categories seen on the News Shaker site are manually sorted and loaded into the system after the spider crawls specific topics on the web. I am now working with LSA and SVMs to begin the text categorization of newly crawled websites and results. This project is also setting up a easy to use web managed content management system. This would allow for the quick set up and design of any topic based categorization using the methods developed while creating the original News Shaker which is focused on special education.

Very cool clustering

This isn't quite related to my own machine learning research but it is cool to see what others out there are doing with machine learning. This is a very cool search engine that using text clustering.

http://vivisimo.com

March 2, 2004

Space Altering

I ran some initial results that would tell me how closely related all the documents within a category were related. we were looking for numbers in the .75 range. The first attempt was actually returning the .339 for one category and .328 for another. These were rather low. To improve the mean of the categories we thought creating a larger space with an overall larger vocabulary would help the documents to be more related. We increased the space from 987 total documents to 1798 documents. I didn't add any documents to either of the test categories, only to the overall space (documents that were really not related to either catgory). I also added a simple stoplist filtering out some common but irrelavent words. This didn't seem to help the relational mean between the two categories. The first category lowered from .339 to .320 while the second increased from .338 to .349. Now I have to decide wether to greatly increase the space, or greatly increase the documents within the categories (which i have to do by hand so it is a slow and time consuming process.)

March 4, 2004

Converting documents to word vectors

After unsuccessfully searching all over the web for source that would do this, I began writting my own code to convert all of my documents to word vectors based on a overall word space. I have a program that came with LSA called Pindex that does this but for LSA and it seems to give some odd results that I am not sure if they are compatable with SVM's. (I know that the formatting is different, which I will be converting to SVMlight formatting, but I am not sure if the data is valid.) I am now making a word vector for every document that will just be its total ratio to a total spaces count of those words. I will then have a seperate java app that allows you to give it two sets of documents and it will generate the proper SVM light input file. Giving one set of documents the positive value and the other set the negative. Then I should be able to create take any new text create its word vector based on the same dictionary and then run comparisons in SVMlight. I am hoping to have the text to vector done this weekend. It shouldn't be to difficult. Then I will release it on the web since i had a next to impossible time trying to find anything like this on the web.

Text to vectors progress

I finished the basic text to vector, but in its current form it is kinda useless. It only outputs to system.out in a format that isn't usefull yet. Also it doesn't save the dictionary that it uses to generate the vectors. So you can't generate any new word vectors to run comparisons with. I will be adding in features to save the generation dictionary, to import a saved dictionary, and to output the feature word vectors in the SVM light format. I also have recently found some cool java stemming software that I will add in to my project after I get the first usable version out. If your interested in the stemming software here are the links:

Porter Stemming Algorythm

Lancaster Stemmer

March 8, 2004

Week of stuff

Well this is going to be a busy week. Instead of doing homework on saturday. I went to ft. Collins because my friend Matt wanted me to come up. Also since steve is in town i have been hanging out with him and doing stuff and there for not getting my work done. Today oddly enough instead of spending my time working on my various school projects, I worked for 6 hours on my work project. I accomplished quite a bit and I am now ready to introduce my Machine Learning blog. It follows my work and progress on some projects that i am working on and serves to be a center for most of my coding and source code.

I am sure that doesn't interest 90% of the people that read my blog, but oh well. I am planning to be really busy until spring break. Which is kind of annoying. I wanted to ask this girl out, but i really dont have any time to go out on a date before spring break. Also, the whole spring break thing is still up in the air. I seriously have no clue what I will be doing. Scott and Dom still dont know when they are off work, how much time they will have and such. Also, I am pretty sure they both would rather just spend it with thier girlfriends. So I am really thinking about just going to vegas with some of my other friends. Also Megan, one of my good friends from highschool is coming to colorado the same week as my break for her springbreak. I would really like to see her, but it is likely that no matter what I do i will be out of town. Maybe I should just flip a coin a couple of times and figure it out.

Well now I am off to a statistics review session.. yeaaa school from 7-9! You got to love being inside the entire day when it is beautiful outside.

March 9, 2004

Text2SVM

This is the main page for "Text2SVM" a java program that converts text documents to the format require for SVMlight. The project is currently in beta. I figured that i should release a early working version for anyone else that has been finding it very difficult to find software to help them work with SVM's. This project is distibuted as is with no garuntees. So use it try it out. Right now it is written very specifically to handle text files conversion to a single format for SVM light. This should be capable of being edited to support many other formats though. To read more about SVMlight visit their page. If you make any good modifications or changes of this program I would love to hear about it please contact me at ddmayer (at) colorado (dot) edu (spam pertection). You can also leave comments on this page so other users could help solve your problems.

To download the code click here.

This first release 0.001 include these features:
-Import entire directory of files
-Save dictionary to generate other word vectors based on same SVMmodel
-export files in SVMlight format
-Stemming to increase the occurance of words like (stopping and stopped)
-The ability to import saved dictionaries and continue adding to the model
-debugging code left in

What is SVM, and why would i want to convert text to a word vector?
Support Vector Machines. They are a method or clustering and machine learning. In this examples we are working with Text categorization using SVM. This programs purpose is to take a bunch of documents in a category and a bunch of documents unrelated to that category. It then creates word vectors out of all the documents. The word vectors then are converted to SVMlights format and given positve weight and negative weight. Anytime the document vector has a possitive weight that means we consider it part of our category. A negative weight means it is outside of our category. After the text is written in this format you can use SVMlight to compare unknown text and see if it belongs in this category.

KNOWN ISSUES:
- occassionally cuts the first letter off of a document.
- uses high amount of memory to run the program add -Xmx256x then the program your trying to run. wc.java is known to run out of memory with large amounts of text.

March 29, 2004

NewsShaker: Features needed / bug report V.001

Features:
* from empty DB to fully functional (no manual data entry)
* No code should contain hard coded value refrencing categories
* Easy way to define where search should put results default (in a category or in the uncategorized)
* Ability for user to say they think the categorization is incorrect (if enough users say this it should be moved to the new category)
* Store all SVM results int he database
* a way to visualize the distances between the categories..
* how many categorizations per category have been correct or incorrect
* a way to start over the correct and incorrect after generating new SVM models
* SVM weighting with C towards the positive examples
* a way to let a user create their own account and their own categories to manage themselves
* When administrator is adding a URL to crawl. Should be able to pick depth to crawl and default category that all results will be placed in.
* Search entire database or by category using Lucecne
* Ability to add single page to a category
* More administration features
* Ability to start or stop auto categorization
* caching the front page
* Text2SVM integration
* Text2SVM configuration file
* ability to create the SVM categories from the web as administrator
* create only the dictionary and store it as one function
* use stored dictionary to create all needed spaces for SVM models
* ability to distinguish between multiple categories instead of single boolean.

Bugs:
* If the user chooses to move a page from one category to another but doesn't choose a new category it should do nothing and give them an error.
* Crawling from the web doesn't work anymore
* SVM first word spacing?? with the first character removal??? is this still a bug?
* Counts for the categories should be switch to be autocounted
* Text2SVM runs out of memory on large examples
* front page loads to slowly
* categories are manually counted let MYSQL do the counting!

March 30, 2004

FIRST GOOD SUCCESS!!

Using my Text2SVM after learning how to increase java memory size so that i could do a large test was very successful. I used about 2,700 documents and then put them into SVMlight. I used 90% training data and 10% testing data leaving around 270 tests. I achieved 95% accuracy in training SVM to recognize one text category from another. I am highly excited! Tomorrow i will be testing with several of my smaller categories today i tested with my largest category. If these trends contrinue.... eeehhhh. I was really only hoping to achieve over 80% correct. So lets hope this is the start of something truely wonderful. I know that the results continuing this high are very unlikely, but i am still really excited about the first good results. I am pumped about doing more testing tomorrow!!

Current Percent levels

These are the current success levels for models based on different categories. These models were built with the current data in the News Shaker database. They were converted to SVMlight models with the use of Text2SVM without stemming enabled.

Web 2.0 craziness

View Dan Mayer's profile on LinkedIn


I Power Seekler
I Power Seekler

www.flickr.com
This is a Flickr badge showing public photos and videos from mayer_dan. Make your own badge here.

Creative Commons License
This weblog is licensed under a Creative Commons License.