I ran some initial results that would tell me how closely related all the documents within a category were related. we were looking for numbers in the .75 range. The first attempt was actually returning the .339 for one category and .328 for another. These were rather low. To improve the mean of the categories we thought creating a larger space with an overall larger vocabulary would help the documents to be more related. We increased the space from 987 total documents to 1798 documents. I didn’t add any documents to either of the test categories, only to the overall space (documents that were really not related to either catgory). I also added a simple stoplist filtering out some common but irrelavent words. This didn’t seem to help the relational mean between the two categories. The first category lowered from .339 to .320 while the second increased from .338 to .349. Now I have to decide wether to greatly increase the space, or greatly increase the documents within the categories (which i have to do by hand so it is a slow and time consuming process.)

blog comments powered by Disqus
Dan Mayer Profile Pic
Welcome to Dan Mayer's development blog. I primary write about Ruby development, distributed teams, and dev/PM process. The archives go back to my first CS classes during college when I was first learning programming. I contribute to a few OSS projects and often work on my own projects, You can find my code on github.

Twitter @danmayer

Github @danmayer