« May 2004 | Main | July 2004 »

June 2004 Archives

June 16, 2004

Done testing and work on SVMMail

After going through a little less than 3000 emails. I have finished testing nad doing any work on SVMMail. It still is sitting at a 97.5% accuracy. I am sure this could be increased, but I need to move all of my focus back to my primary project, News Shaker.

On the News Shaker front. I have added about 350 new manually categorized sites across the database. I am going to rebuild all of the models and see if the increased training data brings my percents up to a more reasonable level. Then once I have a little better percent accuracy I will begin all of the auto categorization code and just start to let the system go crazy and see how many sites it can categorize correctly when left to its own devices. Should be an interesting time next week. That is if the machine boots up. Someone was working on my system and now it freezes on boot up. I am sure all my data is still there and I have a fairly recent back up but hopefully this can be sorted out before the begining of next week.

I also have began reading Managing Gigabytes which is about compressing and indexing documents and images. I also ordered a new book about machine learning and artifical intellegence that i will begin reading soon. Perhaps they will provide me with some new ideas on how to improve my system.

June 17, 2004

Initial Description of HAMCOD

This is a description of a idea that may be the next phase of my project as I continue my work with text categorization. It right now is just a initial idea and outline of an idea so that I have some thought to begin working with when I get to the next point.

HAMCOD
Human Assisted Machine Categorized Open Directory

A collaborative human assisted machine categorized directory. That will extend the functionality of the ODP (http:www.dmoz.com) project. This project will combine a already large and extensive base of human knowledge, with text categorization and social collaboration techniques to increase the amount of well categorized and defined websites, without the need of such high levels of human interaction that are required for DMOZ. This projects goal is to eventually use machine learning techniques to replace the slow and time consuming process of human categorization.

This will have humans interact with the machine learning process as it contributes to and works with the categorized directory. This will have humans say when something is miss categorized and they will be able to recommend new re-categorizations. Or recommend deletion from the directory. The spider would crawl for new pages that aren't listed in the category, if a new page is discover by the spider, the system will attempt to categorize it to the lowest level of a category that it can.

So if there is a main category like "shoes" it might have a hierarchy like this:

1.______Shoes____________________Car______________________Cheese
2._________|________________________|_________________________|
3.Nike___Reebok__Vans________Ford____Toyota_________American____Cheddar


The system would first do categorization on the level of is this page about shoes. If it is determined to be about shoes it would them use different models knowing that it is a shoe to try to determine what shoe company the page is about.

In its early stages it will only work through the Computers category of the DMOZ project. Which is already contains 149,512 websites. To first determine if a site belongs in the computers category, I will need to get about 400,000 websites at random from other parts of the DMOZ directory. Alternatively, I could assume that if I can't find an appropriate sub category in computers that the site doesn't belong in the computers category at all.

Initially I will design the system with no log in or registration required. Since use will be practically non existent. Once use begins I will require an email address and to have that email address to be confirmed. This will make it rather hard for companies or individuals to artificially increase their rankings with in the directory. With in each category the sites will need some sort of ranking system initially I won't worry about ranking the sites. I will just assume that as you get specific enough there will only be a small number of sites in each category. (This seems to be true for DMOZ)

June 18, 2004

New NewsShaker Feature

After waiting weeks of meaning to add this feature I finally did it. It actually took me less than an hour when I thought i was going to have to write all sorts of new code and that everthing would somehow end up being far more complex than I wanted it to be.

Simple feature added, now instead of telling the system to crawl an entire site, you can tell the system to add a single page to the database. This makes it easier when finding an article, that links to entirely useless data, but should be added. So I am glad i finally took the time to add this simple feature. It was also good to see that I still remember alot more of the code structure on the spidering system than i would have thought I remembered.

Starting next week I am going to finish making the system entirely automated. I should be able to finish that in a couple days. Then I am going to make the system very general so it doesn't have to remain so specific to special education and then the same code base for newsshaker would be adoptable to other systems such as the HAMCOD project (which is a horrible name, but since I am more interested in just working on the idea for now I am not going to spend any time working on a name until success full. Man I could make some amazing progress on the system if i could get about 3 people coding on these machine learning systems. Oh well it is good to be back and making some progress again.

June 29, 2004

News Shaker

The last couple of weeks i have done alot of work on news shaker. I have done lots of testing. I all of the categories (about 12) to an approximated error average of 88%. For 12 categories this is really good. First i began by adding more and more data to the categories and rebuilding the models. This initially was increasing the percents but it ceased to help after all of the categories had about 90 documents in them. I then began to play with the weight of the positive terms. This was highly successful after increasing the weighting on all of the positive training vectors I could successfully take all of the training data and recategorize it with 88% accuracy with the remaining documents not wrongly categorized but declared to be of an unknown category. I then started real world testing giving all of my category unseen documents that were hand categorized. The results for the few real world tests i have done so far have been fairly poor, showing only 15-20% accuracy. I am not sure why that varying how the model is made dramatically increases percent of categorization of known documents but seems to have no effect on unseen documents. This currently is the problem i am working on. It is possitive to get known values accuracy for my models to range from 85% to 93%. After a little more real world testing and some other discussion i might be able to come to a conclusion as to what is going on between known and unknown examples.

Web 2.0 craziness

View Dan Mayer's profile on LinkedIn


I Power Seekler
I Power Seekler

www.flickr.com
This is a Flickr badge showing public photos and videos from mayer_dan. Make your own badge here.

Creative Commons License
This weblog is licensed under a Creative Commons License.