This is a description of a idea that may be the next phase of my project as I continue my work with text categorization. It right now is just a initial idea and outline of an idea so that I have some thought to begin working with when I get to the next point. HAMCODHuman Assisted Machine Categorized Open Directory A collaborative human assisted machine categorized directory. That will extend the functionality of the ODP (http:www.dmoz.com) project. This project will combine a already large and extensive base of human knowledge, with text categorization and social collaboration techniques to increase the amount of well categorized and defined websites, without the need of such high levels of human interaction that are required for DMOZ. This projects goal is to eventually use machine learning techniques to replace the slow and time consuming process of human categorization. This will have humans interact with the machine learning process as it contributes to and works with the categorized directory. This will have humans say when something is miss categorized and they will be able to recommend new re-categorizations. Or recommend deletion from the directory. The spider would crawl for new pages that aren’t listed in the category, if a new page is discover by the spider, the system will attempt to categorize it to the lowest level of a category that it can. So if there is a main category like “shoes” it might have a hierarchy like this: 1.____Shoes________Car______Cheese2.___ ________ _________ 3.Nike__Reebok__Vans__FordToyota____American__Cheddar The system would first do categorization on the level of is this page about shoes. If it is determined to be about shoes it would them use different models knowing that it is a shoe to try to determine what shoe company the page is about. In its early stages it will only work through the Computers category of the DMOZ project. Which is already contains 149,512 websites. To first determine if a site belongs in the computers category, I will need to get about 400,000 websites at random from other parts of the DMOZ directory. Alternatively, I could assume that if I can’t find an appropriate sub category in computers that the site doesn’t belong in the computers category at all. Initially I will design the system with no log in or registration required. Since use will be practically non existent. Once use begins I will require an email address and to have that email address to be confirmed. This will make it rather hard for companies or individuals to artificially increase their rankings with in the directory. With in each category the sites will need some sort of ranking system initially I won’t worry about ranking the sites. I will just assume that as you get specific enough there will only be a small number of sites in each category. (This seems to be true for DMOZ)


blog comments powered by Disqus
Dan Mayer Profile Pic
Welcome to Dan Mayer's development blog. I primary write about Ruby development, distributed teams, and dev/PM process. The archives go back to my first CS classes during college when I was first learning programming. I contribute to a few OSS projects and often work on my own projects, You can find my code on github.

Twitter @danmayer

Github @danmayer