I have gotten a few emails and questions from others researchers in the community and I decided that I would begin to answer questions on my site rather than through email so any others could also benefit from the answer. So here is the first response I am posting on the web. Feel free to contact me if you have any other questions and I can try to respond. I found your project while googling for various alternatives to spamfiltering; I’ve been thinking about trying SVM for mail filtering myself,but I’m slightly at loss as to what features to use. A bag-of-words model comes naturally to mind, but it is not the mostefficient computationally; are you perhaps using it or something similar?First the text content is run through a common words stop list. This removes all very common and useless words when it comes to categorization. Removing words such as �the�, �is�, �that�, and so on. Then the text is converted to the SVM format of feature vectors. The feature vectors are created using a relative frequency to the total word space. Feature # : frequency in document / frequency in word space So let�s say you have 3 documents in your space. All of the words in all of the documents would end up in your word space with a value of the total occurrences in all of the documents. Then a ratio for each document with the same word would be used for that feature. Each feature represents one word in the document. Lets do a tiny example: Documents1: �There is a little cat�2: �Where is the little cat�3:�There is a little dog, very little� SpaceThere:2Where:1Is:2A:2Very:1The:1Little:4Cat:2Dog:1 So each of the words would be assigned a feature number in order of appearance, for instance �There� would be 1 and �Is� would be 2. Then to make the vector you would take each document and calculate the frequency ratio. For example we will do this for the 3rd document. There(1) : � Is : � a : � Little: 2/4 Dog : 1/1 Very : 1/1 Replacing the words with their feature numbers and the fractions with values would result in the final document vector based on frequency. 1:.5 2:.5 3:.5 4:.5 5:1 6:1 You mentioned bag of words which is another way you can create vectors to represent the text. I thought about this method and decided that the frequency would provide a better representation of each document. Neither way is that computationally affective I guess. You can leave out all features with 0 since they will be ignored in SVM anyways. I also leave out all ratios that result in less than 0.000001 because I figure that means the feature is really worthless in comparison to the space. What makes this fairly efficient is that the dictionary with the word counts is saved. So while building the model is time consuming getting new text and making comparisons is really pretty fast. Then you don�t have to rebuild the model each time you place something in a category you can only rebuild as needed or on a set time like once a week. If you would like to see all of the libraries and math that I created to create these vectors download the source code to Text2SVM which I use to do these conversions. This also allows you to save a dictionary since each word must relate to the same feature value in the model and with any future files you plan to test against the model. This allows you to create new vectors that will match up with your old model.



blog comments powered by Disqus
Dan Mayer Profile Pic
Welcome to Dan Mayer's development blog. I primary write about Ruby development, distributed teams, and dev/PM process. The archives go back to my first CS classes during college when I was first learning programming. I contribute to a few OSS projects and often work on my own projects, You can find my code on github.

Twitter @danmayer

Github @danmayer