A Programmer's Guide to Data Mining

399 点作者 carlosgg超过 11 年前

18 条评论

While scanning table of contents, I was like this is simple stuff. But then I dived in a chapter and I was converted. This is good because it shows you how to use all those techniques in a real world, with examples mining data from twitter and Facebook streams. Probably best hands on guide I saw for data mining/sentiment analysis.

评论 #6870242 未加载

terramars超过 11 年前

before i say anything directly about the book, i'd like to point out that for simple systems (like these), the most challenging parts are overwhelmingly data collection, normalization / featurization, and model testing, rather than actually creating or using models. while there are rare cases where a simple solution (hey, let's throw naive bayes at it) will give you a good answer, these are almost always because someone did a very good job collecting and sanitizing the input. furthermore, stuff like the twitter movie sentiment analysis - while great in theory - rarely ends up doing what you expect in practice. product recommendation and collaborative filtering are proven to work very well in practice, but sentiment systems are a totally different monster.onto the book - it looks promising for an intro to recommendation systems. no opinion about classification yet. doesn't appear to have anything on graphs or network effects which is somewhat disappointing. that being said i need to review bayesian stuff / teach myself some of the harder stuff and it will be nice to have a practical walkthrough.that being said no one should be implementing these themselves (except the dumb stuff like distance metrics).. it's useful to learn but scikit-learn is amazing when it comes to fancy algorithms.

ville超过 11 年前

This looks nice. I've also heard many recommendations for the book Programming Collective Intelligence[1], which touches the same subjects and also has examples in Python. Now I'm tempted to read both :)[1]: <a href="http://shop.oreilly.com/product/9780596529321.do" rel="nofollow">http://shop.oreilly.com/product/9780596529321.do</a>

评论 #6869770 未加载

评论 #6869771 未加载

natebod超过 11 年前

Looking through chapter 6 on Bayesian Classifiers. I do not think it is correct from page 52. He appears to be using the p.d.f of the standard normal distribution for point estimators. I have training in classical/frequentist stats, so correct me if I'm wrong, but probability estimates from a pdf are given by the area under the curve, the value at a point is meaningless. In fact the probability at a given point is always zero.

评论 #6872107 未加载

pigscantfly超过 11 年前

As an alternative for anyone who wants to delve a little further into data mining, I'm currently taking the Stanford data mining class, STATS202. The book we're using has been really great (published this year) and covers a great deal more than this site seems to. It's called "An Introduction to Statistical Learning with Applications in R." It's free online through the Stanford libraries, but I'm not sure about accessing it for free elsewhere. The lectures are also probably recorded online somewhere, if anyone is really interested.

评论 #6869879 未加载

SeppoErviala超过 11 年前

Check out gensim if you want to do topic modeling or similarity comparisons in Python.<a href="http://radimrehurek.com/gensim/" rel="nofollow">http://radimrehurek.com/gensim/</a>It has good implementations of various algorithms, some of which support streaming or dirstribution, and it allows loading and dumping data in various formats.I've used it for building content based recommender using tf-idf, lsi and similarity index. After the index is built, queries to it are really fast. It can handle quite large corpuses with little memory.

评论 #6870342 未加载

评论 #6872840 未加载

crandles超过 11 年前

This is from one of my college professors, nice to see it make it on HN, and it looks like there's a bit more material since I used it in class. I found it helpful in explaining basic concepts (more-so than the bland textbook that I had to pay for).

garraeth超过 11 年前

I just scanned it but it looks awesome! Thanks for putting this together! I didn't look terribly hard but did you mention the work of Ziegler and Golbeck: "Investigating interactions of trust and interest similarity"? It's a bit old (2006) but I think it's a great reference for real-world engines and helped me a ton back in the day.

sown超过 11 年前

This is neat! Fantastic, even! The math is less theoretical and more systems oriented. The choice of python, modern psuedocode that runs, is great, too. The naive Bayes chapter is useful, too. One might want to look at Udacity's AI course for more info about this topic or as a supplement. Bayes seems to be one of those things where the math is short and difficult; I've been reading about it recently, myself. Just practice, I guess. To engineer stuff with it you may not need to understand it perfectly (until you get bugs ;). Anyways, it's still good. It's a hard topic and Bayes law/tricks appear in AI often so it's worth knowing more about.Thank you, Ron Zacharski!(disclaimer: you do not want my opinion regarding any topic).

gautamnarula超过 11 年前

This looks great! Is there an email list or any other way I can get notifications as new material is added/revised?

sushirain超过 11 年前

After reading chapter two, my conclusion is that this book is also suitable for high-school level. Not many books simplify things so much as this book. The Python implementation even avoids Numpy, which makes it very easy to understand (even though using Numpy is more practical).

frik超过 11 年前

Is the code also available in C like syntax? (C, C++, PHP, JS, etc)Porting Python code can be painful. (I checked the chapter 7 py file and it isn't filled with functional style code, though various kinds of arrays with index starting with 1 or so may still be an issue)

评论 #6869224 未加载

评论 #6870372 未加载

评论 #6872383 未加载

cmao3超过 11 年前

My feeling is that it's very interesting book even for high school kids.

karangoeluw超过 11 年前

This is awesome. How about a PDF with all chapters combined?

lovegratisbooks超过 11 年前

As of January 5, 2014, the pdf for this book will be available for free, with the consent of the publisher, on the book website.

nashequilibrium超过 11 年前

I actually went through this book almost two years ago, i remember the author did not finish it, but i enjoyed it! Thanks!

ewharton超过 11 年前

This is great - I love that it's in Python

LambdaAlmighty超过 11 年前

Not bad as an introductory text, but the code could use some love. Disappointing when it says "programmer's" in the title.Ever heard of PEP8 for Python coding style? List comprehensions?I'm afraid this falls in no man's land, with code too weak for practitioners and theory too weak for theoreticians.

评论 #6871653 未加载

评论 #6872387 未加载