TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Surviving Data Science at the Speed of Hype

224 点作者 mistermcgruff超过 10 年前

16 条评论

mwetzler超过 10 年前
I&#x27;m a data scientist that works with companies on their analytics problems every day. This article is spot on.<p>By far the biggest factor influencing the success of an analytics project is that the company has a <i>human</i> who has the time and inclination to think and reason about the business. They figure out what questions are important to ask and then go look at the data to see what they find. Collecting the data is the easy part. There is no analytics product that asks &amp; answers your most important business questions for you.<p>I enjoyed the jab at predictive modeling; it&#x27;s almost comical how many companies dream about predictive when they haven&#x27;t yet got basic tracking in place for what&#x27;s _already_ happening in their business.<p>Love the post, thanks for sharing.
评论 #8974228 未加载
评论 #8974777 未加载
评论 #8975056 未加载
rm999超过 10 年前
Good article. The author is completely correct that people often underestimate the fragility of predictive models, and that summary analysis (I group this into a more general concept called &quot;insights&quot;) are simpler and more robust. I think the article is a little harsh towards predictive models though.<p>The primary difference between a model and an insight is that insights require a human to process - anything more automatic is a model. Insights are easy to implement and are great for finding patterns and anomalies (the human mind is basically designed to pick these out). But the human element makes insights less scalable with significantly higher latency. For some problems these are unacceptable tradeoffs, and this has little to do with how stable a company&#x27;s environment is. It&#x27;s purely a product&#x2F;strategy question, and about understanding all the tradeoffs.
评论 #8973670 未加载
LargeWu超过 10 年前
I once worked at a major big box retailer where somebody came up with a visualization that purported to show, for a given product category, purchases made in other categories. One surprising purchase correlation was customers bought TV stands after buying DVD players. So, this nugget was trumpeted at countless meetings about the value of big data analytics. Multiple marketing campaigns were designed around this discovery.<p>Of course, that made no sense, so I checked a little deeper. You know what else people also buy when they buy DVD players? TV&#x27;s. The DVD&#x2F;furniture relationship was an artifact of the high degree of correlation between TV&#x27;s and DVD players, which the visualization tool failed to account for.<p>I brought this up immediately, but received tepid response. Of course, months later, I was still hearing about DVD players and furniture. It had become part of the institutional lore, and no facts were going to replace that.
评论 #8974974 未加载
评论 #8976863 未加载
评论 #8974326 未加载
评论 #8974328 未加载
mfdupuis超过 10 年前
Very good post. Refreshing.<p>I think that the hype and buzzwords around Big Data and data science cause more than just bad business decisions. I believe they are also damaging the industry and creating a larger sense of disillusionment (I&#x27;m mostly thinking of &quot;deep learning&quot;). Not sure what this means for data science in the long term though, just thinking out loud.<p>I&#x27;ll also add that I frequently see sledge hammers being used to hang a picture frame. By that I mean using huge clusters to run algos that would actually run in Tableau, Excel etc.
评论 #8974659 未加载
threeseed超过 10 年前
Firstly, someone needs to explain to me why smart people get worked up over vendor marketing. Since the beginning of time it has always been about exaggerated claims, bold, specific numbers e.g. 80% better and always targets those who make purchasing decisions. Do people really expect them to say, &quot;Hey our product is great but you know you probably don&#x27;t need it. But maybe buy it anyway ?&quot;.<p>Secondly, the author seems to have conflated two different parts of the data science picture. Yes great analysts who do amazing work is important. But it relies on (a) having data available and (b) in the right format. For those of us doing significant volume ingestions it is not trivial to do this. Hadoop is painfully slow and overall data science end to end tooling is slow, fragmented and incomplete. Some of us do need vendors to be bold and coming up with new technologies&#x2F;approaches.<p>And the point about IBM is just stupid. Did you ever think that maybe Watson DID help them slow their sales losses ? Weird that a data scientist would make predictions based on inadequate data.
评论 #8974531 未加载
评论 #8976445 未加载
tel超过 10 年前
This is perhaps the first halfway sensible post on &quot;big data&quot; or &quot;analytics&quot; that I&#x27;ve seen hit the front page of HN in a <i>long</i> time.
qthrul超过 10 年前
Timely. I did a &quot;big data&quot; presentation yesterday and hoped to convey how important it was to read original source materials to form opinions and avoid the hype.<p>Since slide decks get busy I moved my bibliography of links to a gist. So, while it didn&#x27;t factor into my presentation I&#x27;ve now added this blog post. :-)<p><a href="https://gist.github.com/JayCuthrell/8bcd9597d37a8602c639" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;JayCuthrell&#x2F;8bcd9597d37a8602c639</a>
dchuk超过 10 年前
I just love the way this guy writes. His book, Data Smart, is hands down the most approachable intro to data science you could ever possibly read if you don&#x27;t have the sufficient math background to dive into full on textbooks. And it&#x27;s hilarious too.
评论 #8973904 未加载
评论 #8973998 未加载
kiyoto超过 10 年前
First of all, John Foreman is great. Read his book &quot;Data Smart&quot; and <a href="http://analyticsmadeskeezy.com/blog/" rel="nofollow">http:&#x2F;&#x2F;analyticsmadeskeezy.com&#x2F;blog&#x2F;</a><p>(disclaimer: I am in no way tied to John Foreman. Also, I work at a company that provides a data processing&#x2F;collaboration SaaS...for big data! <a href="http://www.treasuredata.com" rel="nofollow">http:&#x2F;&#x2F;www.treasuredata.com</a>)<p>A quote from the OP:<p>&gt;If your business is currently too chaotic to support a complex model, don&#x27;t build one. Focus on providing solid, simple analysis until an opportunity arises that is revenue-important enough and stable enough to merit the type of investment a full-fledged data science modeling effort requires.<p>This is consistent with what we see in our customers. The use cases we see most with processing big data boils down to generating reports.<p>Generating reports may sound really prosaic, but as I learned from our customers, most organizations are very, very far from providing access to their data in a cogent, accessible manner. Just to generate reports&#x2F;summaries&#x2F;basic descriptive statistics, incredibly complex enterprise architectures have been proposed, built by a cadre of enterprise architects and deployed with obscenely high maintenance subscription fees billed by various vendors. That&#x27;s the reality at many companies.<p>As bad and confusing the buzzword &quot;big data&quot; is, one good byproduct is that it has forced slow-moving enterprises to rethink their data collection&#x2F;storage&#x2F;management&#x2F;reporting systems.<p>Finally, I am starting to see folks do meaningful predictive modelling on top of large-ish data (in the order of terabytes). Some of them are our customers at Treasure Data, some aren&#x27;t, but they are definitely not &quot;build[ing] a clustering algorithm that leverages storm and the Twitter API&quot; but actually doing the hard work of thinking through how (or if) the data they collect is meaningful and useful.<p>And that&#x27;s a good thing.
tbjohns超过 10 年前
An important distinction is that the author&#x27;s experience is mostly with the businessy side of data science, and his jab is at people who use buzzword tools that add complexity rather than simple solutions.<p>In defense of the hype, many tools like storm are worth their hype many times over when used for the right application.<p>The author makes this distinction, but it can easily be lost in the post.
Fomite超过 10 年前
I&#x27;m a working scientist, rather than someone in the corporate world, but this rings true for me as well. During a recent outbreak, we&#x27;ve had very fast turnaround demands, and while we&#x27;ve done great work in that time, I think some of our best ideas have come from being able to slow the hell down and think.
muser超过 10 年前
There&#x27;s lots written in the credit scoring space that I think other industries could look at - especially when it comes to calibration of models. It doesn&#x27;t matter if the prediction is weak just as long as it is consistent over time periods. Banks rely on this consistency to ensure they are provisioning properly for losses.
maxxxxx超过 10 年前
I view IBM or especially HP jumping on a bandwagon as a strong negative signal for that technology.
whatsgood超过 10 年前
all this will change with the internet of things. once every &quot;thing&quot; is networked, then these optimization platforms won&#x27;t need to wait for some human to input info about altered environments. the platform will &quot;sense&quot; it.
nartz超过 10 年前
Amen
trhway超过 10 年前
&gt;And that is not primarily a tool problem.<p>&gt;A lot of vendors want to cast the problem as a technological one. That if only you had the right tools then your analytics could stay ahead of the changing business in time for your data to inform the change rather than lag behind it.<p>many people like the author just don&#x27;t get it and it is fine. The same way like people didn&#x27;t get the search before Google.<p>&gt;But how do I feel good about my graduate degree if all I&#x27;m doing is pulling a median?<p>the graduate degree is what allows to receive $Nx10e5&#x2F;year (for a respectable value of N) for that pulling of a median<p>&gt;If your goal is to positively impact the business, not to build a clustering algorithm that leverages storm and the Twitter API, you&#x27;ll be OK.<p>on the other hand if your goal is power(OK, OK) instead of just OK then the clustering algorithm&#x2F;storm&#x2F;twitter is the way to go.
评论 #8974602 未加载