Main

Machine Learning Archives

June 17, 2004

Initial Description of HAMCOD

This is a description of a idea that may be the next phase of my project as I continue my work with text categorization. It right now is just a initial idea and outline of an idea so that I have some thought to begin working with when I get to the next point.

HAMCOD
Human Assisted Machine Categorized Open Directory

A collaborative human assisted machine categorized directory. That will extend the functionality of the ODP (http:www.dmoz.com) project. This project will combine a already large and extensive base of human knowledge, with text categorization and social collaboration techniques to increase the amount of well categorized and defined websites, without the need of such high levels of human interaction that are required for DMOZ. This projects goal is to eventually use machine learning techniques to replace the slow and time consuming process of human categorization.

This will have humans interact with the machine learning process as it contributes to and works with the categorized directory. This will have humans say when something is miss categorized and they will be able to recommend new re-categorizations. Or recommend deletion from the directory. The spider would crawl for new pages that aren't listed in the category, if a new page is discover by the spider, the system will attempt to categorize it to the lowest level of a category that it can.

So if there is a main category like "shoes" it might have a hierarchy like this:

1.______Shoes____________________Car______________________Cheese
2._________|________________________|_________________________|
3.Nike___Reebok__Vans________Ford____Toyota_________American____Cheddar


The system would first do categorization on the level of is this page about shoes. If it is determined to be about shoes it would them use different models knowing that it is a shoe to try to determine what shoe company the page is about.

In its early stages it will only work through the Computers category of the DMOZ project. Which is already contains 149,512 websites. To first determine if a site belongs in the computers category, I will need to get about 400,000 websites at random from other parts of the DMOZ directory. Alternatively, I could assume that if I can't find an appropriate sub category in computers that the site doesn't belong in the computers category at all.

Initially I will design the system with no log in or registration required. Since use will be practically non existent. Once use begins I will require an email address and to have that email address to be confirmed. This will make it rather hard for companies or individuals to artificially increase their rankings with in the directory. With in each category the sites will need some sort of ranking system initially I won't worry about ranking the sites. I will just assume that as you get specific enough there will only be a small number of sites in each category. (This seems to be true for DMOZ)

May 19, 2004

SVMMail vs Apple Mail

Well it looks like my SVM mail is very similar to apple's Mail. Apples mail also uses a vector representation and training to achieve 98%+ accuracy (the claim). After reading through this, it makes sense why my filter is achieving such good accuracy. They are using a little more complicated vector analysis tool than I. I use SVMlight, while Apple is using LSA (Latentic Semantic Analysis), which i used to work with but I found the tools far less developed and harder to work with. It was causing me all sorts of problems to tell LSA to do the simple clustering I was doing very simple with SVMlight. The main reason it looks like they are using LSA is they first reduce the space vector and then using LSA on the reduced vector they are claiming a major performance increase. This is really believable, especially since LSA offers quick tools for folding new information into a model instead of recreating it. Anyways, I am happy to learn that the approach I took with my mail filter appears to be very similar to one of the best computer companies out there. It gives me even more reason to believe that I am on the right trail with all of my various Vector based learning algorithms.

Explaining the Apple Mail Filter

P.S. This means that my filter that has barely been developed and has many enhancements possible is achieving about .5% less than apples filter!

May 11, 2004

SVM Mail testing

Today I got a chance to let SVM mail sort 252 messages that it has never seen before. The messages were moved sucessfully fromt he inbox to either my real folder, or my spam folder. Out of the 252 messages the system incorrectly categorized 8 emails. Three of the 8 were actually the same message sent out from the frienster network. I have added those messages to real training model. This means on the first really good real world test my filter achieved an 96.8% accuracy. This comparess well to the 98% estimated acuuracy of the model, by categorizing the training data 98% correctly.

The system has been through about 400 emails, and my accuracy is now at 97.25 %. I have only retrained the model once including little bits of the new info I have collected. I places all of the incorrectly identified emails in the proper categories and retrained the system. Then just to see if that would make it correctly identify those emails I recategoried them, it cut the misclassifications in half, but some where still missclassified. It seems that forwarded messages with attachments are what it will missclassify still. Other emails seem good.

I am currently not working on or extending this project because it was just a testing project for some of my code, which I am now focusing back on my News Shaker project, which is using SVM classification to create a google news like site.

May 7, 2004

SVMMail

I reached a great milestone with SVMMail today. I will be doing more test and releasing more information next week. The initial results are a 97% accuracy on the filter. Also with real world testing (so far a low number of 65 mails), there was only 2 errors (1 false positive) in prediction. I had training data of 550 real emails and 713 spam emails (all of which i collected in the 3 weeks or so that I ahve been working on this project.) I am really excited that I have past the stumbling blocks that I was on the last 3 days where I was actually getting a 0% accuracy because a bug the was generating a pretty much random model.

There is currently no web interface and it is all just run directly from java (jbuilder in my case). I will add features like that and the ability to track how many of each type of error my system makes later.

This is a great I am really happy with how this is working out.

April 18, 2004

SVM Spam Filtering

Support Vector Machine (SVM) Spam filtering. After working more with SVM and my software to utilize it moving along nicely, I am begining to do work on a spam filter for my email. I have been getting a much larger amount of email in the last couple months. The spam is highly recognizable and simple to see patterns in, 90% of it is related to buying prescription drugs. I figured that SVM should be able to detected this very easily. Since i have already written, a few java email applications, I thought it would be easy enough to write a program that will let me log on and list things as spam and then use that to train and SVM model, which takes far less training data than many other machine learning methods. For now i have done a very simplistic design, the first step will, be building a simple webmail system, it will really only be used to mark as spam, and check the spam folder to make sure there was no false positives. Then I will use my normal mail system to read and respond to the mail. When i first log into my simple webmail all new mail will be download and categorized. I will then be able to submit any missed items as spam. Anything newly categorized as spam or not will be places in either the spam, or real mail section of the database. The database will store who the mail was from, the subject, and the body of the message.

I have now created two folders in my mail system. old and spam. I have moved all my old real mail to the old folder and all spam to the spam folder. I have about 300 spam messages and about 520 real messages. I have finished hte code for the mail checking and sorting using java. I now imported my Text2SVM software and used the to create the orginal model file. Now similiarly to my NewShaker software all that is left to do is have autocategorization. For spam this should be a little simpler since there are only two classification levels. This should let me learn the process that I will need to use for Newsshaker, but at a little simpler level. I created a model using 170 of real mail, vs spam. With it I was getting an 80% recognition on my test data, I think with some simple fixes in the way I am creating my text from my mail, stop listing, and stemming I should be able to increase this rate. I should also be able to increase the rate once I add more training data.

(This was created as a sub project to create and test code that will be used for newsshaker)


Here is my initial design.


gif_1.gif

March 9, 2004

Text2SVM

This is the main page for "Text2SVM" a java program that converts text documents to the format require for SVMlight. The project is currently in beta. I figured that i should release a early working version for anyone else that has been finding it very difficult to find software to help them work with SVM's. This project is distibuted as is with no garuntees. So use it try it out. Right now it is written very specifically to handle text files conversion to a single format for SVM light. This should be capable of being edited to support many other formats though. To read more about SVMlight visit their page. If you make any good modifications or changes of this program I would love to hear about it please contact me at ddmayer (at) colorado (dot) edu (spam pertection). You can also leave comments on this page so other users could help solve your problems.

To download the code click here.

This first release 0.001 include these features:
-Import entire directory of files
-Save dictionary to generate other word vectors based on same SVMmodel
-export files in SVMlight format
-Stemming to increase the occurance of words like (stopping and stopped)
-The ability to import saved dictionaries and continue adding to the model
-debugging code left in

What is SVM, and why would i want to convert text to a word vector?
Support Vector Machines. They are a method or clustering and machine learning. In this examples we are working with Text categorization using SVM. This programs purpose is to take a bunch of documents in a category and a bunch of documents unrelated to that category. It then creates word vectors out of all the documents. The word vectors then are converted to SVMlights format and given positve weight and negative weight. Anytime the document vector has a possitive weight that means we consider it part of our category. A negative weight means it is outside of our category. After the text is written in this format you can use SVMlight to compare unknown text and see if it belongs in this category.

KNOWN ISSUES:
- occassionally cuts the first letter off of a document.
- uses high amount of memory to run the program add -Xmx256x then the program your trying to run. wc.java is known to run out of memory with large amounts of text.

March 4, 2004

Text to vectors progress

I finished the basic text to vector, but in its current form it is kinda useless. It only outputs to system.out in a format that isn't usefull yet. Also it doesn't save the dictionary that it uses to generate the vectors. So you can't generate any new word vectors to run comparisons with. I will be adding in features to save the generation dictionary, to import a saved dictionary, and to output the feature word vectors in the SVM light format. I also have recently found some cool java stemming software that I will add in to my project after I get the first usable version out. If your interested in the stemming software here are the links:

Porter Stemming Algorythm

Lancaster Stemmer

Converting documents to word vectors

After unsuccessfully searching all over the web for source that would do this, I began writting my own code to convert all of my documents to word vectors based on a overall word space. I have a program that came with LSA called Pindex that does this but for LSA and it seems to give some odd results that I am not sure if they are compatable with SVM's. (I know that the formatting is different, which I will be converting to SVMlight formatting, but I am not sure if the data is valid.) I am now making a word vector for every document that will just be its total ratio to a total spaces count of those words. I will then have a seperate java app that allows you to give it two sets of documents and it will generate the proper SVM light input file. Giving one set of documents the positive value and the other set the negative. Then I should be able to create take any new text create its word vector based on the same dictionary and then run comparisons in SVMlight. I am hoping to have the text to vector done this weekend. It shouldn't be to difficult. Then I will release it on the web since i had a next to impossible time trying to find anything like this on the web.

March 2, 2004

Space Altering

I ran some initial results that would tell me how closely related all the documents within a category were related. we were looking for numbers in the .75 range. The first attempt was actually returning the .339 for one category and .328 for another. These were rather low. To improve the mean of the categories we thought creating a larger space with an overall larger vocabulary would help the documents to be more related. We increased the space from 987 total documents to 1798 documents. I didn't add any documents to either of the test categories, only to the overall space (documents that were really not related to either catgory). I also added a simple stoplist filtering out some common but irrelavent words. This didn't seem to help the relational mean between the two categories. The first category lowered from .339 to .320 while the second increased from .338 to .349. Now I have to decide wether to greatly increase the space, or greatly increase the documents within the categories (which i have to do by hand so it is a slow and time consuming process.)

March 1, 2004

Very cool clustering

This isn't quite related to my own machine learning research but it is cool to see what others out there are doing with machine learning. This is a very cool search engine that using text clustering.

http://vivisimo.com

Setting Up

This is going to be a research blog kept by Dan Mayer to show his progress on his computer science research into machine learning.

I will be setting up a collection of all the work and research that I have done on machine learning. I am working on a project for the L3D labs at the University of Colorado. I am working specifically with text categorization, currently focused on using LSA and SVM to categorize. I will post all the links, projects, examples, and resources that I have used and learned from during my progress on this project.

This is documenting the work on my project News Shaker, which will be similiar to google news. It uses a web spider and text categorization to organize the data into particular categories. The categories I am currently working with are special education related. The topics of the categories were choosen so that this project would become a sub project of Web2gether, which is a site to help the special education community.

Currently all of the categories seen on the News Shaker site are manually sorted and loaded into the system after the spider crawls specific topics on the web. I am now working with LSA and SVMs to begin the text categorization of newly crawled websites and results. This project is also setting up a easy to use web managed content management system. This would allow for the quick set up and design of any topic based categorization using the methods developed while creating the original News Shaker which is focused on special education.

Web 2.0 craziness

View Dan Mayer's profile on LinkedIn

I Power Seekler
I Power Seekler




www.flickr.com
This is a Flickr badge showing public photos and videos from mayer_dan. Make your own badge here.