I'm reading this book currently. A lot of it is covered in traditional AI courses in CS programs, but I like seeing the Pythonic representations of some of the concepts.
The site itself seems kind of neat, feels like you'd need to use it for a while to get it working well.
Cool site. I am working on the same thing (but for school/fun). You should look into Support Vector Machines as they are much better at text classification then kNN.
I have looked into SVMs but I don't think they would work well in this case because:
1) A separate classifier would have to trained for each user and this would take too much resources.
2) I think an SVM would require too many training cases before it becomes useful.
This is a personalized news site that I wrote in two months based on algorithms from the book Programming Collective Intelligence. Please tell me what you think.
It has two main features: the ability to identify related /similar links and suggestions/recommendations that actually work.
The basis for all of the algorithms is a document similarity metric presented in Chapter 3: Discovering Groups. Basically, to compare document A with document B, we calculate the Pearson correlation coefficient between the word frequencies of document A and the word counts of document B. (You can imagine this as plotting a series of points of a graph: each point's x coordinate is its frequency in document A and each point's Y coordinate is its frequency in document B. The Pearson correlation coefficient is a measure of how well the line-of-best-fit fits the points.)
Using this similarity metric, links can be clustered together using K-means clustering. This is what you get when you click on “related” at the bottom of each link. Clicking on “similar” gives the results of running K-NN. (“related” doesn't work as well as it could be right now because there are too few links for a link to be similar with, but this is an example of where it does work: http://fyynd.com/links/197/related/ “similar” usually works better right now.)
There are two algorithms for giving recommendations, “Suggested” and “Recommended”. "Recommended" generally works better than Suggested when you haven't yet made votes but Suggested should be more in tune to your preferences in the long run.
In layman's terms, the Recommendation algorithm works by "averaging" together the links that you liked and then find links that are similar to that while the Suggestion algorithm tries to determine whether you will like a particular link by seeing whether it is similar to any page that you have already rated highly. As a result, "Recommended" will list pages in your general interest area, but insensitive to any "niche" interest that you might have. The "Suggested" page will be sensitive to "niche" interests but will requires more votes to train. For example, if most of the link you rate highly are about computer science, with a only a few links about biology, when the recommendation algorithm averages them together, the biology links would count for very little. As a result, you wouldn't see much on biology. On the other hand, the suggestion algorithm will not be hindered by this, though it will have trouble if you don't vote much.
Please note that because predictions are so computationally intensive, they are not updated in real-time but on a hourly basis. Thus, you have to wait a bit before they come out. Please be patient!
Please check it out and tell me what you think! Any questions/comments/suggestions are more than welcome!
P.S.: I forgot to mention: the voting system normalizes your ratings. Thus, if you vote all 5 stars it is same as not voting at all! You must tell it what you don't like as well as what you like.
I really like the interface. It has some features I wish HN had, like the ability to hide items from view. I've been meaning to write something like this for a while but never got around to it. Keep up the good work.
oh, please create a bookmarklet to let users submit stories while browsing, this is VERY IMPORTANT, and shouldn't take much effort (use the HN one as an example).
I'd like to feed the site with stories from here and create a Greasemonkey plugin to automatically rate items on your site when I vote them up here (if I can find a good way to vote up items programatically on fyynd).
Bookmarklets: done. See http://fyynd.com/bookmarklets/. As for rating links programmatically: it is a simple POST to "http://fyynd.com/links/[link_id]/rate/" with a parameter "rating". "rating" should be a float between 0 and 5. A rating of 0 will delete that vote.
I viewed your site and I had a case of information overload. I would suggest having a feature where a person can type in what they like (cars, technology, etc) and that you feed them what they want rather then dump everything on them at once.
The site itself seems kind of neat, feels like you'd need to use it for a while to get it working well.