« February 2004 | Main | April 2004 »

March 2004 Archives

March 1, 2004

Setting Up

This is going to be a research blog kept by Dan Mayer to show his progress on his computer science research into machine learning.

I will be setting up a collection of all the work and research that I have done on machine learning. I am working on a project for the L3D labs at the University of Colorado. I am working specifically with text categorization, currently focused on using LSA and SVM to categorize. I will post all the links, projects, examples, and resources that I have used and learned from during my progress on this project.

This is documenting the work on my project News Shaker, which will be similiar to google news. It uses a web spider and text categorization to organize the data into particular categories. The categories I am currently working with are special education related. The topics of the categories were choosen so that this project would become a sub project of Web2gether, which is a site to help the special education community.

Currently all of the categories seen on the News Shaker site are manually sorted and loaded into the system after the spider crawls specific topics on the web. I am now working with LSA and SVMs to begin the text categorization of newly crawled websites and results. This project is also setting up a easy to use web managed content management system. This would allow for the quick set up and design of any topic based categorization using the methods developed while creating the original News Shaker which is focused on special education.

Very cool clustering

This isn't quite related to my own machine learning research but it is cool to see what others out there are doing with machine learning. This is a very cool search engine that using text clustering.

http://vivisimo.com

March 2, 2004

Space Altering

I ran some initial results that would tell me how closely related all the documents within a category were related. we were looking for numbers in the .75 range. The first attempt was actually returning the .339 for one category and .328 for another. These were rather low. To improve the mean of the categories we thought creating a larger space with an overall larger vocabulary would help the documents to be more related. We increased the space from 987 total documents to 1798 documents. I didn't add any documents to either of the test categories, only to the overall space (documents that were really not related to either catgory). I also added a simple stoplist filtering out some common but irrelavent words. This didn't seem to help the relational mean between the two categories. The first category lowered from .339 to .320 while the second increased from .338 to .349. Now I have to decide wether to greatly increase the space, or greatly increase the documents within the categories (which i have to do by hand so it is a slow and time consuming process.)

Argg like an angry cat

Things are going well i guess. I am making some good progress on some stuff at work. I have been catching up well in school, and everything seems to be going alright. I am just feeling like I am really stuck in a rut again. I feel like an angry cat that has no power to do anything. Our cat always gets frustrated and doesn't have the ability to do anything about it except moew. So here is what I feel like:

dancatl.jpg

Today I was singing in the shower and I came up with a really cool song. The problem is that I can't sing and write at the same time. I wish i could rewrite some of the lyrics here for you but i dont rememeber anything except being impressed that is flowed so well and sounded so good... Maybe i was just tired though. One day i guess i will wake up.

March 3, 2004

Connections crossed

It is impressive how many lives we drift in and out of. I mean how many people have you know that were once a part of your life that you now know nothing about. Nothing about what is going on in their life, where they are, what their plans are, how they are doing? How often do you still think about these people? How many people have you been the person that has dissappeared from their life? How many people wonder where you have gone? How your doing, what is going on in your life? Perhaps our lives cross over into to many other peoples lives, perhaps we haven't crossed paths with enough people. It just seems there is thousands of people that have been a small part of my life, a teacher, a lab partner, or a friend, that eventually just fades away. Half the people I probably never think about again. Others you see once in awhile and trade that knowing nod with nothing more said. Others you never see but still think about occassionally, what happened to that cool kid from art camp... Things of that nature. I guess there is no big conclusion or anything, just something I was pondering over and thinking about.

March 4, 2004

Converting documents to word vectors

After unsuccessfully searching all over the web for source that would do this, I began writting my own code to convert all of my documents to word vectors based on a overall word space. I have a program that came with LSA called Pindex that does this but for LSA and it seems to give some odd results that I am not sure if they are compatable with SVM's. (I know that the formatting is different, which I will be converting to SVMlight formatting, but I am not sure if the data is valid.) I am now making a word vector for every document that will just be its total ratio to a total spaces count of those words. I will then have a seperate java app that allows you to give it two sets of documents and it will generate the proper SVM light input file. Giving one set of documents the positive value and the other set the negative. Then I should be able to create take any new text create its word vector based on the same dictionary and then run comparisons in SVMlight. I am hoping to have the text to vector done this weekend. It shouldn't be to difficult. Then I will release it on the web since i had a next to impossible time trying to find anything like this on the web.

Power tools never looked so good.

the hottest music video. EVER! hehe

If you want to suddenly think power tools are cool... Click the link above.

Anyways it seems that i need to get ready for some hard core constant coding. It seems that I have a coding project due thursday, 2 on monday, a paper tuesday, and a large homework assignment wendsday. Argg so bacically my week is screwed until next thursday night. I am highly excited. By that I mean this sucks.

Seconds ago I just learned my good friend Steve is in town. He is all in Air Force out by florida now. I guess he is going to be here a week and a half. That makes my above paragraph suck way worse because now it is going to be really hard to A) spend time with him. B) get my work done C)become the new leoperachan on the front of the lucky charms box.

Sometimes the weather gives insight into the future. It was 70 this morning it was snowing like hell by 5. I have a feeling that will be my life over the next week. from great to a snowstorm that there is no way out of. Well wish me luck... Also I am trying to ask out this girl i am interested in sometime, but there never seems to be any time.

Text to vectors progress

I finished the basic text to vector, but in its current form it is kinda useless. It only outputs to system.out in a format that isn't usefull yet. Also it doesn't save the dictionary that it uses to generate the vectors. So you can't generate any new word vectors to run comparisons with. I will be adding in features to save the generation dictionary, to import a saved dictionary, and to output the feature word vectors in the SVM light format. I also have recently found some cool java stemming software that I will add in to my project after I get the first usable version out. If your interested in the stemming software here are the links:

Porter Stemming Algorythm

Lancaster Stemmer

March 5, 2004

Duct Taped to a wall

How can college ever be boring, with the rare, yet precious nights like tonight!!!
Dave, Scott, Steve, Amy, Isaac, Megan, Sean, and some others helped with the duct taping dan to the wal project done to entertain us on a night where it was snowing to much to just go out.

danwalls.jpg

(3 rools of duct tape were used in this experiment)

What is one of the dumbest fun things you have done in your life?

March 6, 2004

More pics

More better pics... going to ft collins.

danwall2s.jpg

March 8, 2004

Week of stuff

Well this is going to be a busy week. Instead of doing homework on saturday. I went to ft. Collins because my friend Matt wanted me to come up. Also since steve is in town i have been hanging out with him and doing stuff and there for not getting my work done. Today oddly enough instead of spending my time working on my various school projects, I worked for 6 hours on my work project. I accomplished quite a bit and I am now ready to introduce my Machine Learning blog. It follows my work and progress on some projects that i am working on and serves to be a center for most of my coding and source code.

I am sure that doesn't interest 90% of the people that read my blog, but oh well. I am planning to be really busy until spring break. Which is kind of annoying. I wanted to ask this girl out, but i really dont have any time to go out on a date before spring break. Also, the whole spring break thing is still up in the air. I seriously have no clue what I will be doing. Scott and Dom still dont know when they are off work, how much time they will have and such. Also, I am pretty sure they both would rather just spend it with thier girlfriends. So I am really thinking about just going to vegas with some of my other friends. Also Megan, one of my good friends from highschool is coming to colorado the same week as my break for her springbreak. I would really like to see her, but it is likely that no matter what I do i will be out of town. Maybe I should just flip a coin a couple of times and figure it out.

Well now I am off to a statistics review session.. yeaaa school from 7-9! You got to love being inside the entire day when it is beautiful outside.

March 9, 2004

Text2SVM

This is the main page for "Text2SVM" a java program that converts text documents to the format require for SVMlight. The project is currently in beta. I figured that i should release a early working version for anyone else that has been finding it very difficult to find software to help them work with SVM's. This project is distibuted as is with no garuntees. So use it try it out. Right now it is written very specifically to handle text files conversion to a single format for SVM light. This should be capable of being edited to support many other formats though. To read more about SVMlight visit their page. If you make any good modifications or changes of this program I would love to hear about it please contact me at ddmayer (at) colorado (dot) edu (spam pertection). You can also leave comments on this page so other users could help solve your problems.

To download the code click here.

This first release 0.001 include these features:
-Import entire directory of files
-Save dictionary to generate other word vectors based on same SVMmodel
-export files in SVMlight format
-Stemming to increase the occurance of words like (stopping and stopped)
-The ability to import saved dictionaries and continue adding to the model
-debugging code left in

What is SVM, and why would i want to convert text to a word vector?
Support Vector Machines. They are a method or clustering and machine learning. In this examples we are working with Text categorization using SVM. This programs purpose is to take a bunch of documents in a category and a bunch of documents unrelated to that category. It then creates word vectors out of all the documents. The word vectors then are converted to SVMlights format and given positve weight and negative weight. Anytime the document vector has a possitive weight that means we consider it part of our category. A negative weight means it is outside of our category. After the text is written in this format you can use SVMlight to compare unknown text and see if it belongs in this category.

KNOWN ISSUES:
- occassionally cuts the first letter off of a document.
- uses high amount of memory to run the program add -Xmx256x then the program your trying to run. wc.java is known to run out of memory with large amounts of text.

March 10, 2004

Boulder life

Well I worked my butt off sunday, monday, and tuesday. This allows me to hang out with steve on wendsday, thursday, friday, and saturday. I still have classes and school, but I should be able to come home by five every day and not worry about homework again until sunday. It is quite nice. I am not sure what we will be doing the next couple days, but I am looking forward to whatever is thrown my way.

There has been lots of talk of relationships and marriage from many of my friends lately. Many are begining to feel pressures to get married. Some are deciding they are not ready and ending relationships that isn't quite headed towards what they would want from marriage. Others are moving in, away, or closer to their respective others. I guess it has slowly pushed the relationship thoughts back in my head. I dont really know what to do about it right now, because boulder actually just doesn't seem to have the type of girl i am looking for. Everyone in boulder is too extreme towards one way of life. We have a ton of boulder hippies, we have the sorority happy go luckiest, and the anal career and business women. I want a little bit of everything. I guess that is one of the reasons, that i think i am done with boulder. I am ready to move on and find another place to call home.

Balance is something everyone and every place needs. Perhaps Boulders balance is between all of the very extreme groups, and the always interesting, but pointed views. Oh well another day, another little bit of progress towards my life, where ever it may be headed.

March 14, 2004

ending the weekend

Well it is hard to get things done when there is fun stuff around and fun people. I am having a hard time getting back on the motivation train after a fun weekend of being social and partying. I just am not ready for how much time school is going to be consuming from me for the next couple months. I guess i will get used to it or something. Anyways I had a great week and I got to hang out with a bunch of my friends. Anyways school sucks but i guess things are going well... Now i just need to find some free time so i can ask this girl out who gave me her number this weekend.

March 15, 2004

Ah crap bad timing!

Well there is a good chance, I am sick. Given that all my muscles ache, my nose is running like crazy, and a friend (Penny) staying at our house has been sick the last few days. Oddly enough if i can make it to spring break i just may live through this semiester. I just may have to work an over 40 hour week on various projects and work to stay afloat though.

Anyways always trying to be the barrer of good news here is a fun little tid bit from a friend that was recently in Vegas:

Vegas Stripper: "What's your major?"
Friend: "Computer Science"
Vegas Stripper: "You can call me R2D2"

If thats not hot and funny, I dont know what is...... Perhaps, I should begin looking for love in strip clubs around the country. If you have time to party right now raise a glass for me, cause i will be there in spirit for awhile, while I slave away to code.

March 19, 2004

Spring Break

Even though i have a ton to do over break and am not really doing anything cool or that fun. It feels great to have a huge weight of stress off my shoulder for a week. I am sure I am going to spend a good portion of break working on various homework and job related activities. Oh well it is my last hard semester. After coolege all the hard work better pay off and I better get a good job and all the crazy always drunk and over paid people better end up working their asses off just to find a job and realizing it isn't going to pay them millions and doesn't give them everything they want. Anger issues??? Maybe.... hehe Anywas I am going to try to spend some time with my best friends and some time just relaxing and doing things I haven't really had time for in awhile. In reality it means I am going to spend to much, sleep to much, and drink to much. That said, now that I am awake from my nap I guess it is time to start spring break even if, I am still sick and I tend to cough every 45 seconds.... Really you can keep time with my cough.

March 20, 2004

Cry to you

A poem that i wrote who knows when that i just found on my computer. It is still relevant. I guess while alot has changed in college not alot about who I am has really changed that much. Many things i have said in the past i still hold true today.

Cry to You:

Cold fears;
Let them be said.
Silent tears;
Let them be shed.
Open to the world
I am now dead.
I�ve been reborn, instead.
Now I wake open my eyes
I see life first time I realize
Is this what that I�ve missed
This wonderful life created from a kiss.

March 25, 2004

Springbreak

Well break is going well. Except that i am still a little sick. Anyways we hit up Scott's cabin in breck for a couple days and got some skiing in. The first two days of break I actually did about 6 hours of home work each day. Tomorrow i will really have to get back on the homework and job related activities. Today i am just relaxing nad laying around and trying to get rid of my cold.

Last night i went to ft fun and hung out with a friend Megan who was in town for her spring break. I also got to see matt and jesse which was nice. We didn't really do anything special just chat and talk and catch up. It was a good time. I really do enjoy hanging out with megan. She and here friend headed back to IL today though.

Well i think i will take a shower then make lunch. Enjoy your day.

March 27, 2004

a thought from my dream.

Break is coming to an end. I guess i am ready to get back to an almost endless stream of work. I would have enjoyed break far more if i wasn't sick the whole time. It really took some fun out of everything.

Technology can't fix and help everything in our world. Many of the problems are actually now stemming from the horrible traits and beliefs taught to our children through a marketing controlled media. Sickening body images and impossible ideals of relationships and love. To truely change our society besides changing technology and science, we must begin to change what beliefs are held and taught to the public. I believe otherwise we will continue to fall into a higher divorcerate and more single families that is bad for the nations childrens upbringing. Many kids even within our day are begining to believe that staying in a family unit is an outdated concept. Kids are raising themselves more and more with nothing but the medias influence.

Any thoughts about that?

March 29, 2004

NewsShaker: Features needed / bug report V.001

Features:
* from empty DB to fully functional (no manual data entry)
* No code should contain hard coded value refrencing categories
* Easy way to define where search should put results default (in a category or in the uncategorized)
* Ability for user to say they think the categorization is incorrect (if enough users say this it should be moved to the new category)
* Store all SVM results int he database
* a way to visualize the distances between the categories..
* how many categorizations per category have been correct or incorrect
* a way to start over the correct and incorrect after generating new SVM models
* SVM weighting with C towards the positive examples
* a way to let a user create their own account and their own categories to manage themselves
* When administrator is adding a URL to crawl. Should be able to pick depth to crawl and default category that all results will be placed in.
* Search entire database or by category using Lucecne
* Ability to add single page to a category
* More administration features
* Ability to start or stop auto categorization
* caching the front page
* Text2SVM integration
* Text2SVM configuration file
* ability to create the SVM categories from the web as administrator
* create only the dictionary and store it as one function
* use stored dictionary to create all needed spaces for SVM models
* ability to distinguish between multiple categories instead of single boolean.

Bugs:
* If the user chooses to move a page from one category to another but doesn't choose a new category it should do nothing and give them an error.
* Crawling from the web doesn't work anymore
* SVM first word spacing?? with the first character removal??? is this still a bug?
* Counts for the categories should be switch to be autocounted
* Text2SVM runs out of memory on large examples
* front page loads to slowly
* categories are manually counted let MYSQL do the counting!

March 30, 2004

FIRST GOOD SUCCESS!!

Using my Text2SVM after learning how to increase java memory size so that i could do a large test was very successful. I used about 2,700 documents and then put them into SVMlight. I used 90% training data and 10% testing data leaving around 270 tests. I achieved 95% accuracy in training SVM to recognize one text category from another. I am highly excited! Tomorrow i will be testing with several of my smaller categories today i tested with my largest category. If these trends contrinue.... eeehhhh. I was really only hoping to achieve over 80% correct. So lets hope this is the start of something truely wonderful. I know that the results continuing this high are very unlikely, but i am still really excited about the first good results. I am pumped about doing more testing tomorrow!!

Current Percent levels

These are the current success levels for models based on different categories. These models were built with the current data in the News Shaker database. They were converted to SVMlight models with the use of Text2SVM without stemming enabled.

the end of BREAK!

Repeated failure to achieve
can lead to lack of belief in worth
of the idea of learning from trying.

Ambition isn't always good if it leads to finding some failures.

--------------------------Or-------------------------------------------

It is great to relax and not have much to worry about once in awhile.