A Programmer's Guide to Data Mining

399 pointsby carlosggover 11 years ago

18 comments

While scanning table of contents, I was like this is simple stuff. But then I dived in a chapter and I was converted. This is good because it shows you how to use all those techniques in a real world, with examples mining data from twitter and Facebook streams. Probably best hands on guide I saw for data mining/sentiment analysis.

评论 #6870242 未加载

terramarsover 11 years ago

before i say anything directly about the book, i'd like to point out that for simple systems (like these), the most challenging parts are overwhelmingly data collection, normalization / featurization, and model testing, rather than actually creating or using models. while there are rare cases where a simple solution (hey, let's throw naive bayes at it) will give you a good answer, these are almost always because someone did a very good job collecting and sanitizing the input. furthermore, stuff like the twitter movie sentiment analysis - while great in theory - rarely ends up doing what you expect in practice. product recommendation and collaborative filtering are proven to work very well in practice, but sentiment systems are a totally different monster.onto the book - it looks promising for an intro to recommendation systems. no opinion about classification yet. doesn't appear to have anything on graphs or network effects which is somewhat disappointing. that being said i need to review bayesian stuff / teach myself some of the harder stuff and it will be nice to have a practical walkthrough.that being said no one should be implementing these themselves (except the dumb stuff like distance metrics).. it's useful to learn but scikit-learn is amazing when it comes to fancy algorithms.

villeover 11 years ago

This looks nice. I've also heard many recommendations for the book Programming Collective Intelligence[1], which touches the same subjects and also has examples in Python. Now I'm tempted to read both :)[1]: <a href="http://shop.oreilly.com/product/9780596529321.do" rel="nofollow">http://shop.oreilly.com/product/9780596529321.do</a>

评论 #6869770 未加载

评论 #6869771 未加载

natebodover 11 years ago

Looking through chapter 6 on Bayesian Classifiers. I do not think it is correct from page 52. He appears to be using the p.d.f of the standard normal distribution for point estimators. I have training in classical/frequentist stats, so correct me if I'm wrong, but probability estimates from a pdf are given by the area under the curve, the value at a point is meaningless. In fact the probability at a given point is always zero.

评论 #6872107 未加载

pigscantflyover 11 years ago

As an alternative for anyone who wants to delve a little further into data mining, I'm currently taking the Stanford data mining class, STATS202. The book we're using has been really great (published this year) and covers a great deal more than this site seems to. It's called "An Introduction to Statistical Learning with Applications in R." It's free online through the Stanford libraries, but I'm not sure about accessing it for free elsewhere. The lectures are also probably recorded online somewhere, if anyone is really interested.

评论 #6869879 未加载

SeppoErvialaover 11 years ago

Check out gensim if you want to do topic modeling or similarity comparisons in Python.<a href="http://radimrehurek.com/gensim/" rel="nofollow">http://radimrehurek.com/gensim/</a>It has good implementations of various algorithms, some of which support streaming or dirstribution, and it allows loading and dumping data in various formats.I've used it for building content based recommender using tf-idf, lsi and similarity index. After the index is built, queries to it are really fast. It can handle quite large corpuses with little memory.

评论 #6870342 未加载

评论 #6872840 未加载

crandlesover 11 years ago

This is from one of my college professors, nice to see it make it on HN, and it looks like there's a bit more material since I used it in class. I found it helpful in explaining basic concepts (more-so than the bland textbook that I had to pay for).

garraethover 11 years ago

I just scanned it but it looks awesome! Thanks for putting this together! I didn't look terribly hard but did you mention the work of Ziegler and Golbeck: "Investigating interactions of trust and interest similarity"? It's a bit old (2006) but I think it's a great reference for real-world engines and helped me a ton back in the day.

sownover 11 years ago

This is neat! Fantastic, even! The math is less theoretical and more systems oriented. The choice of python, modern psuedocode that runs, is great, too. The naive Bayes chapter is useful, too. One might want to look at Udacity's AI course for more info about this topic or as a supplement. Bayes seems to be one of those things where the math is short and difficult; I've been reading about it recently, myself. Just practice, I guess. To engineer stuff with it you may not need to understand it perfectly (until you get bugs ;). Anyways, it's still good. It's a hard topic and Bayes law/tricks appear in AI often so it's worth knowing more about.Thank you, Ron Zacharski!(disclaimer: you do not want my opinion regarding any topic).

gautamnarulaover 11 years ago

This looks great! Is there an email list or any other way I can get notifications as new material is added/revised?

sushirainover 11 years ago

After reading chapter two, my conclusion is that this book is also suitable for high-school level. Not many books simplify things so much as this book. The Python implementation even avoids Numpy, which makes it very easy to understand (even though using Numpy is more practical).

frikover 11 years ago

Is the code also available in C like syntax? (C, C++, PHP, JS, etc)Porting Python code can be painful. (I checked the chapter 7 py file and it isn't filled with functional style code, though various kinds of arrays with index starting with 1 or so may still be an issue)

评论 #6869224 未加载

评论 #6870372 未加载

评论 #6872383 未加载

cmao3over 11 years ago

My feeling is that it's very interesting book even for high school kids.

karangoeluwover 11 years ago

This is awesome. How about a PDF with all chapters combined?

lovegratisbooksover 11 years ago

As of January 5, 2014, the pdf for this book will be available for free, with the consent of the publisher, on the book website.

nashequilibriumover 11 years ago

I actually went through this book almost two years ago, i remember the author did not finish it, but i enjoyed it! Thanks!

ewhartonover 11 years ago

This is great - I love that it's in Python

LambdaAlmightyover 11 years ago

Not bad as an introductory text, but the code could use some love. Disappointing when it says "programmer's" in the title.Ever heard of PEP8 for Python coding style? List comprehensions?I'm afraid this falls in no man's land, with code too weak for practitioners and theory too weak for theoreticians.

评论 #6871653 未加载

评论 #6872387 未加载