Hadoop “fails us”

62 pointsby cdvonstinkpotabout 8 years ago

9 comments

daroseabout 8 years ago

This really strikes me as more of a marketing piece by Snowflake than a well-researched piece of reporting. The article mostly just quotes one person - Bob Muglia - who is, as they say on Wall Street, "talking his book" - i.e. giving an opinion that is not coincidentally in line with his own financial interests. Sure, Hadoop is getting old, and is quickly becoming replaced by spark. But loads of organizations have used, and continue to use hadoop /spark successfully. And the part about Kafka replacing Hadoop /Spark is just silly. They're completely different technologies, used for very different purposes, and many organizations use both side by side.

评论 #13958454 未加载

评论 #13958604 未加载

c3534labout 8 years ago

> Hadoop is great if you’re a data scientist who knows how to code in MapReduce or PigI guess that kind of makes sense. What programmers like might not necessarily be a good basis for the long-term. I do have to say, though, that part of the failure of Hadoop is it generated a lot of hype and so better tools and alternatives developed to meet that need. So when you're saying a better alternative is Spark or Kafka, I feel like it's almost as if you're saying "Oracle is a failure that never materialized the promised benefits. Instead we should be using PostGres and MySQL."But I also know that with a lot of big data hype, businesses were wanting to do Hadoop and NoSQL and all this stuff because it was the cool new thing, not that they actually needed it. I've heard data scientists make the joke that every business thinks they need these tools because they're having difficulty running their business out of a spreadsheet.I think it's important to remember that for most businesses, Spark, Hive, whatever; those aren't the right tools, either. SQL is still what most companies need. Businesses want machine learning, but usually what they need is boring old statistics. In an industry that always wants to be ahead of the curve, we tend to forget that it's not always the right thing to have the newest toys. Sometime companies do well utilizing the latest and greatest, but sometimes they just use what already exists wisely. I suspect that for many of those companies that used Hadoop and felt it didn't work them, the problem wasn't that they needed Spark instead, it was they were trying to solve problems that didn't exist. Man, I'm too young to sound this old. But, eh, yeah, we need to respect our elders' technology first, and consider the newest stuff only when we have a definable need for it.

评论 #13960887 未加载

评论 #13960858 未加载

75dvtwinabout 8 years ago

My observation that traditional database technologies are transforming themselves into 'hybrids' (as far document-oriented data types go).Example is Postgres. Now has with JSONB as document-oriented field type. Now has Postgres XL, as horizontally scalable ACID database. Will have, by approximately august', abiltity to maintain views in memory (aka lambda.architecture speed layer) via PipelineDB extension, for fast streaming analytics.It seems that a combination of Postgres (with extensions) + Kafka +Redis -- is a strong stack for lambda architecture and, as initial data hub component of the overall puzzleWhile Spark (or even Python+Dask) can be viewed as distributed data analytics platforms that replaces 'non-ui centric' BI. I think ui-centric BI (eg adhoc reports/visualizatiosn) are going to be still dominated within enterprises by Tableu/QlickView type of solutions.For traditional BI-oriented data marts (that organized downstream from the datahub) -- probably traditional column oriented databases, and the new open source ones make sense.To me ,the promise of hadoop being a silver bullet for 'all the big data needs' -- was always nothing more than unsubstantiated hype.So it definitely failed the ones who believed, the hype -- but did not fail others who did not buy into it.

samspencabout 8 years ago

I worked on Hadoop and HBase extensively from 2011 - 2013, working on engines processing 30 billion raw data points a month and storing a subset of those, and then we migrated to other Big Data technologies. Just wanted to add my thoughts here.Hadoop (and its general ecosystem, which includes HBase), is a fairly good idea. Its core ideas - map/reduce on Hadoop, and a large distributed key/value store for HBase - are actually pretty solid.And for many years, there were simply no alternatives to Hadoop. Think of the years 2008 to 2012/13. If you had to process terabytes or petabytes of data, what were your solutions? No wonder Yahoo and Facebook (and others) put in so much effort into their Hadoop solutions.But, IMHO, there were several issues with Hadoop and their ilk.1. The core infrastructure wasn't stable enough. Hadoop / HBase were supposed to be distributed systems, and they worked well, but small failures could bring down your entire cluster. Given that Hadoop and HBase were being used in mission-critical systems in the cloud, and given the amount of DevOps or sys-admin work that went into maintaining these, I'm not surprised people eventually migrated to distributed systems that were easier to maintain and run.2. There are now plenty of "hosted on the cloud" solutions such as Amazon DynamoDB or similar cloud solutions. When your company depends on 99.99% or similar SLAs, you don't want to have downtime on your database systems and spend time debugging complicated core dumps on your Hadoop or HBase clusters when you can just store it "on the cloud" and be done with it. Sure, there's a higher price point, but those are the trade-offs you live with.3. If you want to be in-house, there are plenty of alternatives out there as well today. Apache Spark for processing, Kafka for a messaging bus / streaming data, ElasticSearch for large scale storage, with multiple indices. Many of them are much more robust than Apache Hadoop / HBase, and I'm not surprised they've gotten more traction recently.Ultimately, I think Hadoop / HBase are just showing their age. They were fantastic for the first wave of Big Data technologies, and you had little alternative if you were building large-scale systems circa 2008 to 2013, but now, you just have a plethora of choices from various vendors.

评论 #13958196 未加载

评论 #13958277 未加载

评论 #13958183 未加载

ghcabout 8 years ago

The article is on point about how we need better data infrastructure to support data scientists and analysts. In the past I've worked to develop very scalable data infrastructure to support data science workloads on high variety sensor dat,a but it always felt like the only reason we were doing it was that nobody developed tools made for companies like ours.We can build better data infrastructure for data scientists, but in practice it's hard to sell "10x easier to use" into organizations with hadoop or half-broken bespoke infrastructure because the IT groups running the show don't really care that they're making their data scientists miserable.If change is to come, it's going to have to be from data scientists embedded within business units demanding better tools, because hadoop works just fine if you don't care how hard it is for your users to access their data.

评论 #13958149 未加载

ktamuraabout 8 years ago

Hadoop didn't fail us. We failed Hadoop.The decline of Hadoop as a software category is Software Product Marketing 101: it did not identify pervasive killer use cases critical to running . Yes, it's true that Hadoop was a revolutionary way to store and process massive datasets on commodity hardware, but what's the use case for that? If you are Visa/AMEX (fraud detection), Facebook/Google (various ML-based data products) and a few other types of companies with obvious applications of massive data processing, yes, Hadoop has been great.But here's the thing: beyond a few such corner cases, it never found a use case that enterprise data warehouse couldn't handle.Then came Redshift, then BigQuery, and now Snowflake (as a BigQuery on AWS, really). While there are some key technical differences between Redshift and BigQuery/Snowflake, they are all _much_ cheaper than the previous generation of data warehouses (Vertica, Netezza, Greenplum, etc.) The lower price meant greater access, and developers who previously couldn't imagine using data warehouses could finally spin one up with a credit card swipe.Hadoop, too, took a lot of collateral damage because many developers realized that they didn't need much of Hadoop beyond SQL-on-Hadoop.Redshift was a beautiful feat of product strategy and marketing: They just took what used to cost a lot and offered it for much less in an environment where developers already had a lot of data (AWS). This was much simpler to execute than what Hadoop had to do: introduce new technology, identify use cases, and finally compete with incumbent solutions.We failed Hadoop (as you can see from www.cloudera.com, even Cloudera, the Hadoop company, hardly mentions Hadoop on its top page). Not the other way around.

评论 #13982038 未加载

评论 #13959656 未加载

nazilla12about 8 years ago

Disclaimer : I work for a corporate IT consulting giant.The trend I'm seeing the big Data sphere in my company is a by-and-large move away from technologies that implement MapReduce and and complicated batch processing with HDFS as a data store. More and more customers want insight as soon as they get/produce their data so we've seen a particularly large increase in interest in technologies such as Kafka/ Samza and spark/ pySpark.I see a trend in Kafka and but I think the community needs to jump behind it too, keep it as a pipeline tool and not a querying engine.I dont see Hadoop-based solutions going away any time soon though.

评论 #13960696 未加载

shmerlabout 8 years ago

What about competition like HPCC[1]?[1]: <a href="https://en.wikipedia.org/wiki/HPCC" rel="nofollow">https://en.wikipedia.org/wiki/HPCC</a>

frikabout 8 years ago

Kafka relies on ZooKeeper of the Hadoop eco-system. ZooKeeper is not so great.