Is the "modern data stack" still a useful idea?

129 pointsby tim_swover 1 year ago

13 comments

I'm a newly minted head of analytics who transitioned from a different domain, so I never had to muck my way through the MDS but attentively watched others from the sidelines over the last few years. Best I can tell, "the modern data stack" is just a marketing phrase invented by a cadre of vampire vendors. The lessons I learned watching others translated into a few simple requirements for our nascent "stack" that most importantly include transparent pricing I can reason about and divvy up, as many integrations as possible so I can minimize rolling my own, and a straightforward framework for ETL code. These three requirements plainly disqualify most of the MDS universe.With the benefit of starting basically from scratch and not having to mess around with real-time analytics, it's pretty easy to ignore the MDS vendors. So far I've landed on BigQuery, AirByte, GitHub, BI Engine, Looker Studio, and Pandas 2.x or DuckDB for local stuff. I send as many things as possible straight to BQ, lock junior analysts out of gigantic tables, archive periodically to partitioned parquet files in cold storage, use mostly turnkey integrations, and ruthlessly prioritize custom ETL jobs. Putting GitHub in the mix isn't super ergonomic and we may be in the market for new tools once we cross the "big data" frontier, but that'll be a while from now. I'll probably never know or care what the MDS vendors think I'm missing.

评论 #39348603 未加载

评论 #39445954 未加载

评论 #39342356 未加载

评论 #39345916 未加载

dm03514over 1 year ago

Sorry nothing positive to say here.I’ve been using dbt and MDS for nearly 3.5 years and I believe the entire approach is profoundly broken. There’s really nothing “modern” about it, especially compared to software engineering.<a href="https://on-systems.tech/blog/135-draining-the-data-swamp/" rel="nofollow">https://on-systems.tech/blog/135-draining-the-data-swamp/</a>Building on Extracted operational data is hard at best and a business altering security liability at worse.I believe The MDS trails at least a decade behind modern software engineering practices, lacking industry guidance and generic tooling to support: CI/CD, versioned deployment artifacts, zero downtime deployments, unit testing, observability, monitoring and alerting.MDS Data engineering is a meme in the industry, the expectation of “100%” combined with the lack of modern tooling makes success really hard to achieve.

评论 #39340714 未加载

评论 #39340348 未加载

评论 #39340405 未加载

评论 #39340917 未加载

评论 #39364605 未加载

评论 #39342937 未加载

Lyngbakrover 1 year ago

I've worked in data for a while now as an Data Engineer, Data Scientist, and Director and working with experienced software engineers has highlighted to me that most of the data stack is fluff. All the layers/tools that are heaped upon one another just lead to complexity and dependence on paid services. I'm not advocating for reinventing the wheel, but rather that a bespoke solutions seem to be worth the cost. In short, I'd prefer to spend the budget on experienced developers who can build maintainable systems than on a plethora of MDS tools. YMMV, though.

评论 #39340493 未加载

评论 #39340606 未加载

评论 #39340486 未加载

评论 #39364747 未加载

karakanbover 1 year ago

Disclaimer: I am the co-founder of a competitor, Bruin (<a href="https://getbruin.com" rel="nofollow">https://getbruin.com</a>). We are exactly the kind of integrated platform Tristan is talking about.The article resonates with me a lot, and it is because we called this out months ago. I find the idea of Tristan walking back on the premise of MDS and claiming it to be not useful anymore funny because they were one of the main drivers of the term and the whole hype around it.He even acknowledges that they played ball with other companies in the space:> There was a lot of valuable co-marketing, partnership deals, co-sponsored events, and co-selling. This had real value for everyone involved—customers and vendors alike. Companies voluntarily integrated their products together, cross-promoted each other publicly, and built partnerships that made owning and operating these technologies far easier for customers.Sorry, but no, this didn't have value for the customers, only for the vendors. The customers were left alone by themselves to deal with all of this complexity, and the vendors made a lot of money off of that. They convinced companies that they needed a bunch of different tools to build a simple pipeline, and rode on the wave of huge valuations based on these same ideas that they are walking back on.The companies prefer integrated solutions now because they woke up. Instead of paying 150k/y each to Fivetran, dbt and whatnot, they realize that they are better of just hiring an engineer or two in the worst case. It is 2024, and none of these tools still properly talk to each other. Do you want to get an end-to-end lineage of your data? Good luck with that. How about quality? How about governance? Companies are left alone with this hype cycle.I'd claim that a significant part of the blame lies on the executives and leaders in the companies, who just jumped on the ship for the sake of building their CVs and skipped the critical thinking step. None of them seriously asked themselves the question of whether or not it makes sense.To be honest, I feel sorry for all the money spent on building solutions around all of these hype-driven products.

评论 #39340479 未加载

评论 #39364781 未加载

评论 #39340407 未加载

评论 #39340120 未加载

jonmooreover 1 year ago

The Modern Data Stack / MLOps product space was succinctly described by one actually-technical CEO as "vending into ignorance"; the author corroborates this with a commendably candid take:>Imagine it’s 2021, peak MDS, and you meet the CDO of a large bank. “Oh cool,” she says, “you’re the CEO of a tech company. What does your product do?” What do you say?>“We build a tool that leverages the power of the cloud to apply standard SQL and software engineering best practices to the historically mundane (but critical!) job of data transformation.”>“We’re the standard for data transformation in the modern data stack.”>I will tell you that, empirically, option #2 is more effective.This tallies with what I've seen from a lot of enterprise CxOs and their teams as technology hype moved from big data and block chain and onto data science/machine learning.There is so much to write about this, but I'll just recommend "Life Cycle of a Silver Bullet" <a href="http://freyr.websages.com/Life_Cycle_of_a_Silver_Bullet.pdf" rel="nofollow">http://freyr.websages.com/Life_Cycle_of_a_Silver_Bullet.pdf</a>, which deserves more attention than it's had on HN.

评论 #39341624 未加载

gmsover 1 year ago

I've been in data and analytics for over a decade and co-founded a consolidated-ETL company (<a href="https://www.polytomic.com">https://www.polytomic.com</a>).Nice to see Tristan realising this (people should be commended for changing their minds). There were three problems with this term:1. It mostly resonated with VCs and industry observers whose jobs are to peddle in buzzwords, rather than users who simply want their problems solved.2. It was (and is) ill-defined. Ask a group of people in the industry to define it and you're guaranteed to get different answers.3. It committed the cardinal sin of using the adjective 'modern' in a noun. At some point, everything today stops being modern. Couple this with (2) and your term is now even more meaningless.At a mundane level there's no drama here: just another example in the long list of noise contributors to the free-money party that was going on during the Covid years (as the essay acknowledges).

williamcottonover 1 year ago

It seems that there’s a philosophy with these kinds od cloud data services: measure everything then figure out what to measure after.It seems that the better approach is to make a hypothesis first, collect only the data you need, and then analyze.Otherwise it seems the indirect costs of retooling around “measure all the things indiscriminately” and the direct costs of paying for such services don’t seem worth it.Can someone offer a reasonable rebuttal?

评论 #39350883 未加载

评论 #39340549 未加载

Animatsover 1 year ago

Oh, it's something for the ad industry.Ads are watched only by people who don't have good ad blockers.

m0lluskover 1 year ago

compiled a quick list of terms to help understand this piece:MDS - modern data stackBI - business intelligenceRedshift - amazon redshift data warehouse service product handles large scale data sets and database migrations can handle analytic workoads on big data sets with column oriented approach built on top of massive parallel processing (MPP) from data warehouse company ParAccel (later acquired by Actian)ETL - extract, transform, loadELT - extract, load, transform -- an alternative to ETL that stores raw dataLooker - does BI, started with looker data sciences, acquired by googleTableau - data visualization focused on business intelligenceclickstream data - user website navigation recordssnowflake - an MDS data platform solutionmongo - nosql databasedatadog - monitoring, analytics for devopsconfluent - real time data streamsdatabricks - web based cluster management and data lakes for machine learningmeme-ification - reduction of an idea to cartoonspeak - highest point preceeding drop offfivetran - data ingestiondbt - dbt labs, data transformationCDO - chief data officerco-marketing - integrated marketing such as linked brandsco-sponsored - cooperative sponsorshipco-selling - shared sales assets including data recordsARR - annual recurring revenueZIRP - zero interest rate policyprivate multiples - private company valuation calculations, metricsforward revenue - revenue expected in the futurePowerBI - microsoft business analytics

评论 #39353551 未加载

hn72774over 1 year ago

DBT in simplest terms is just a way to orchestrate transformations in dependency order. Table "A" needs to be updated before table "B," because "B" selects from "A." It works well for that.I've seen it used to get wild west SQL logic into version control. To replace scheduled SQL workbooks running entire data pipelines.I've also seen over-engineered integrations with other orchestration tools in the "stack."Once a DBT project grows to a certain size, roughly 500-1000 .SQL files, it gets hard to manage. Not impossible, though it takes intentionality about how to group things together and scale them out operationally."Slim CI" is a recent buzzword that has some nice ideas about build automations.Cosmos is supposed to solve a lot of the automation gripes with dbt. I haven't tried it yet but would like to. <a href="https://www.astronomer.io/cosmos/" rel="nofollow">https://www.astronomer.io/cosmos/</a>

datadrivenangelover 1 year ago

The Modern Data Stack is Dead! Long Live the Analytics Stack!\Interesting to see one of the strongest popularizers of the term acknowledging this shift. The free money is over, so we need to get back to work and dbt is the least bad way to organize lots of SQL for data management.

extrover 1 year ago

What would HN commenters define as a truly modern data stack right now? Mostly asking what this chart <a href="https://a16z.com/emerging-architectures-for-modern-data-infrastructure/" rel="nofollow">https://a16z.com/emerging-architectures-for-modern-data-infr...</a> would look like in 2024.

评论 #39340156 未加载

评论 #39340197 未加载

评论 #39345386 未加载

评论 #39350715 未加载

tomrodover 1 year ago

Intriguing that the CEO of DBT declares MDS ("modern data stack") to be a meaningless term (i.e. no longer useful). Maybe a good demarcation that we are entering the post-modern phase of cloud tech generally? Cloud still there, but not the only option.I look forward to this happening with GenAI -- the tools we come up with will be pretty cool in the long run. I hope we can find ways around platform enshittification because that really sucks. Similar for blockchain tech.I see a common thread in these techs too -- the same type of LinkedIn influencers shop these during their hype phases. Ultimately, useful technology and best practices come out of it!

评论 #39340033 未加载