Emerging Architectures for Modern Data Infrastructure

431 点作者 soumyadeb超过 4 年前

14 条评论

This article has a large gap in the story: it ignores sensor data sources, which are both the highest velocity and highest volume data models by multiple orders of magnitude. They have become ubiquitous in diverse, medium-sized industrial enterprises and it has turned them into some of the largest customers of cloud providers due to the data intensity. Organizations routinely spend $100M/year to deal with this data, and the workloads are literally growing exponentially. Almost no one provides tooling and platforms that address it. (This is not idle speculation, I’ve run just about every platform you can name through lab tests in anger. They are uniformly inadequate for these data models, everyone relies on bespoke platforms designed by specialists if they can afford the tariff.)If you add real-time sensor data sources to the mix, the rest of the architecture model kind of falls apart. Requirements upstream have cascading effects on architecture downstream. The deficiencies are both technical and economic.First, you need a single ordinary server (like EC2) to be able to ingest, transform, and store about 10M events per second continuously, while making that data fully online for basic queries. You can’t afford the latency overhead and systems cost of these being separate systems. You need this efficiency because the raw source may be 1B events per second; even at that rate, you’ll need a fantastic cluster architecture. Most of the open source platforms tap out at 100k events per second per server for these kinds of mixed workloads and no one can afford to run 20k+ servers because the software architecture is throughput limited (never mind the cluster management aspects at that scale).Second, storage cost and data motion are the primary culprits that make these data models uneconomical. Open source tends to be profligate in these dimensions, and when you routinely operate on endless petabytes of data, it makes the entire enterprise problematic. To be fair, this is not to blame open source platforms per se, they were never designed for workloads where storage and latency costs were critical for viability. It can be done, but it was never a priority and you would design the software very differently if it was.I will make a prediction. When software that can address sensor data models becomes a platform instead of bespoke, it will eat the lunch of a lot of adjacent data platforms that aren’t targeted at sensor data for a simple reason: the extreme operational efficiency of data infrastructure required to handle sensor data models applies just as much to any other data model, there simply hasn’t been an existential economic incentive to build it for those other data models. I've seen this happen several times; someone pays for bespoke sensor data infrastructure and realizes they can adapt it to run their large-scale web analytics (or whatever) many times faster and at a fraction of the infrastructure cost, even though it wasn't designed for it. And it works.

评论 #24816361 未加载

评论 #24815607 未加载

评论 #24816750 未加载

评论 #24815595 未加载

评论 #24815577 未加载

评论 #24818464 未加载

评论 #24816219 未加载

评论 #24816309 未加载

评论 #24834484 未加载

评论 #24818263 未加载

评论 #24815917 未加载

评论 #24816365 未加载

评论 #24816912 未加载

评论 #24815677 未加载

评论 #24817263 未加载

评论 #24815659 未加载

评论 #24816396 未加载

评论 #24815706 未加载

ethanwillis超过 4 年前

While this is an article about data infrastructure I feel like we're missing the forest for the trees.What is most important here in my opinion is that the underlying data is useful. If your underlying data wasn't collected, collected properly, or even worse the wrong data was collected.. then setting up data infrastructure will be a boondoggle that will cause your organization to be data hostile.Just as much, if not more effort, needs to go into collecting the right data in the right way to fill your data infrastructure with. Most of the projects I've seen or heard of are just people taking the same old data that Ted in accounting, Jill in BI, etc. are already pretty proficient at using. So the gains you get by moving that into a modern infrastructure are marginal. How many more questions can you really ask of the same data that people have decades of experience with and an intuitive sense for?

评论 #24815319 未加载

评论 #24815354 未加载

评论 #24815726 未加载

评论 #24815426 未加载

评论 #24815212 未加载

malisper超过 4 年前

For a post detailing the modern data infrastructure I'm surprised they intentionally leave out SaaS analytics tools. I find this especially surprising given a16z has invested >$65M into Mixpanel.Based on my experience working at an analytics company and running one myself, what this post misses out is that an increasing number of people working with data today are not engineers. These people can range from product managers who are trying to figure out what features the company should focus on building, marketers to figure out how to drive more traffic to their website, or even the CEO trying to understand how their business as a whole is doing.For that reason, you'll still see many companies pay for full stack analytics tools (Mixpanel, Amplitude, Heap) in addition to building out their own data stack internally. It's becoming more and more important that the data is accessible to everyone at your company including the non-technical users. If you try to get everyone to use your own in-house built system, that's not going to happen.

评论 #24815510 未加载

评论 #24814856 未加载

huy超过 4 年前

For those who're interested in learning more about the history and evolution of data infrastructure/BI - basically why and how it has come to this stage - check out this short guidebook [1] that my colleagues and I put together a few months back.It goes into details how much relevance the practices of the past (OLAP, Kimball's modeling) has with the current changes in by the cloud era (MPP, cheap storage/compute, etc). Chapter 4 will be most interesting for HN audience: It walks through the different waves of data adoption ever since BI was invented in the 60-70s.<a href="https://holistics.io/books/setup-analytics/" rel="nofollow">https://holistics.io/books/setup-analytics/</a>

评论 #24816823 未加载

tuckerconnelly超过 4 年前

The ELT (rather than ETL) insight was really cool, hadn't heard of that before.Unless though, you're on a massive, massive scale, Just Use Postgres, and write your ETL (ELT now?) queues normally. Keep It Simple Stupid.

cageface超过 4 年前

While I think data science is a very interesting field with a lot of beneficial applications it also seems to be the one that's right at the heart of a lot of the negative impact some tech is having on society right now. I seriously considered specializing in it for a while but ultimately decided it was too likely I'd be asked to work on things that make me uncomfortable.

评论 #24815303 未加载

dm03514超过 4 年前

I'm really excited about the state of data infrastructure and the emergence of the data lake. I feel like the technical aspects of data engineering is reduced to getting data into some cloud storage (s3) as parquet. Transforms are "solved" using ELT from the data lake, or streaming using kafka/spark.I think executing this in orgs with legacy data technologies is hard but it is much more a people problem than a tech problem. In orgs that have achieved this foundation it's really cool to see the business and analytic impact to the company.

评论 #24814979 未加载

评论 #24815040 未加载

msolujic超过 4 年前

Good start for this vast and complex topic. One thing that pops out here as missing is Data Mesh [1] It is emerging pattern for complex data management and data exchange between multiple products and product components/services.[1] <a href="https://martinfowler.com/articles/data-monolith-to-mesh.html" rel="nofollow">https://martinfowler.com/articles/data-monolith-to-mesh.html</a>

m3kw9超过 4 年前

I wonder how many of those companies in the proposed architecture have A16z as investors?

评论 #24815372 未加载

fouc超过 4 年前

The recent HN threads about excel made me think there's definitely room for a new kind of excel that works well for big data.

评论 #24815246 未加载

评论 #24815199 未加载

评论 #24819723 未加载

评论 #24815218 未加载

fluffy87超过 4 年前

Citation needed?We connect all our sensors to an edge AI Server that handles sensor data, and only uploads to the cloud what’s actually relevant.It works quite well, and there are many OEMs that offer such systems, with accelerators for inference, sensor data compression, 5G, etc.

nicholast超过 4 年前

I considered this piece as sort of a loose validation that the Automunge library is filling an unmet need for data scientists. Intended for tabular data preprocessing in the steps immediately preceding the application of machine learning.

ca123超过 4 年前

Great article, but surprising that it does not mention or use the concept of DataOps. Even Gartner has recently written at length about the role of DataOps [1], and of course, we at Composable [2] are biased as they just name us as a Cool Vendor in DataOps [3].[1] <a href="https://www.gartner.com/en/documents/3970916/introducing-dataops-into-your-data-management-discipline" rel="nofollow">https://www.gartner.com/en/documents/3970916/introducing-dat...</a>[2] <a href="https://composable.ai" rel="nofollow">https://composable.ai</a>[3] <a href="https://www.gartner.com/en/documents/3991447/cool-vendors-in-dataops" rel="nofollow">https://www.gartner.com/en/documents/3991447/cool-vendors-in...</a>

cblconfederate超过 4 年前

What's the point of data hoarding? Intelligent systems in nature ingest the data, learn, and discard them