These types of posts validate my concern about the people entering my field right now.<p>Data science, as a line of work, is distinct from other technical roles in its focus on <i>creating business value using machine learning and statistics</i>. This quality is easily observed in the most successful data scientists I've worked with (whether at unicorn startups, big companies like my current employer, or "mission-driven" companies).<p>Implicit in this definition is <i>avoiding the destruction of business value by misapplying ML/statistics</i>. In that sense, I am concerned about blog posts like these (which list 50 libraries and zero textbooks or papers) and those who comment arguing the relevance of "real math" in the era of computers.<p>Speaking bluntly: if you are a "data scientist" that can't derive a posterior distribution or explain the architecture of a neural network in rigorous detail, you're only going to solve easy problems amenable to black-box approaches. This is code for "toss things into pandas and throw sklearn at it". I would look for a separate line of work.
The roles of statistician and data scientist are not substitutes but more like complements. This guy definitely <i>is</i> a data scientist. Here's some ways to tell:<p>- Works on non-mission-critical components, e.g. he's not doing statistics for the when the wing will fall off your airplane, but he can help you figure out business problems more open to interpretation, e.g. subject line open rates.<p>- His publishing tools favor flair over convention, e.g. Ctrl+f for "latex" has zero results, but he does have D3, C3, Bokeh, surprisingly no tableau.<p>- Not sure he even references a single classical statistics package. The vast majority of people publishing in social sciences or "old school" life sciences are using Minitab, JMP, R, or SAS (correct me if I'm wrong, please, it's an outsider's perspective).<p>This skillset is not inherently "cutting edge!"- or deceptively "all talk, no walk". They really are completely different roles, that use some of the same tools and formulas and jargon. To cut to the heart of it: When a company builds a plane and says "I wonder how unlikely it would be for the wing to fall off?" that creates the demand for a statistician. When a company is trying to out-compete others, or maximize profit/charitable-effectiveness, often in a service or a field that is heavily influenced with human psychology, that creates the potential for a data scientist to add value.
Cassandra is mentioned, I agree it's great for storing metadata and can be used to build efficient graph implementations but it's cited for Graphs and Relationships? I think that can be misleading as Cassandra is a a distributed column based key-value store.
Am I the only one who came here looking for someone's experience as a tool set? For a second there I thought I might have stumbled over real honesty, a rare treat these days. Maybe, if we stop putting each other in stupid labeled boxes to please our bullshit peddling masters, we would get somewhere...
From the article:<p>> Machine learning and data mining are not well distinguished, but machine learning techniques increasingly favor “unsupervised” learning algorithms.<p>The statement above puzzles me because it does not align with what I can see in the news. Maybe I'm just uninformed, so please let me know if I'm wrong.<p>According to what I can read in the news:<p>1 - Almost all of the recent ML developments that I can think of are in the field of supervised learning / reinforcement learning<p>2 - the only field that I can think of where unsupervised learning techniques are prevalent is data mining, which is precisely why I see it as a very specific field.<p>Am I missing something?
I'm not sure this is what a data scientist is. It was supposed to be a research scientist (which is where the <i>scientist</i> part came from) that wrangles data and code. This individual should have both domain knowledge and coding chops while knowing how to conduct research.