TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Questioning the Lambda Architecture

178 点作者 boredandroid将近 11 年前

4 条评论

mashraf将近 11 年前
I have never used Samza but have build similar pipelines using Kafka,Storm,Hadoop etc. In my experience you almost always have to do your transformation logic twice one for batch and one for real time and with that setup Jay&#x27;s setup look exactly like Lamda Architecture with your stream processing framework doing real time and batch computation.<p>Using stream processing framework like Storm maybe fine when you are running exactly the same code for both real time and batch but it breakdown in more complex cases when code is not exactly the same. Let say we need to calculate Top K trending item from now to last 30 mins, One day and One week. We also know that simple count will always make socks and underwear trend for an ecom shops and Justin Bieber and Lady Gaga for twitter(<a href="http://goo.gl/1SColQ" rel="nofollow">http:&#x2F;&#x2F;goo.gl&#x2F;1SColQ</a>). So we use count min sketch for realtime and a sligtly more complex ML algorithm for batch using Hadoop and merge the result in the end. IMO, training and running complex ML is not currently feasible on Streaming Frameworks we have today to use them for both realtime and batch.<p>edited for typos.
评论 #7978133 未加载
评论 #7977310 未加载
haddr将近 11 年前
While I really liked the main idea of the Lambda Architecture, I think this article is answering my doubts I had. It is really more reasonable to have another stream processing pipeline for treating historical data, rather than separate hadoop or other map-reduce framework. In case you need to reprocess the data, you instantiate another pipeline and stream that historical data through it. Clever!
vicpara将近 11 年前
The ideas exposed in the article aren&#x27;t new. Now we just use all these hipsterish technologies that we hope to magically solve our problems by just sticking one&#x27;s output into another&#x27;s input. If we think about what happens in a single machine while processing data we have had exactly the same problems for decades. How do you process 2 CDs worth of data when you only have a 486 with 4&#x2F;8&#x2F;16Mb of ram ?<p>- Historically, the data rarely (never) fitted into memory and was at least 100x larger than it.<p>- If we want to have it for the long run we need to store it on disk. Smarter, dumber, compact or verbose .. you have to do it.<p>- We have to make sure we spend the little CPU time we have on processing data not jiggling with it. Map-Reduce jobs takes ages to initialize and burn CPU just to read and write to file partitions.<p>- If you have a long processing pipeline there are two major concepts that we use: buffers and pumps. Files, cache, DBs act as buffers. Kafka is a essentially a pump with a buffer;<p>- When you process data, depending on what you compute, you need or need not multiple passes through the data. ML and AI most of the time needs such things. Descriptive stats with some smart math tricks avoids two passes. This variable number of passes is the party pooper in stream analytics. In cryptography they solved the problem by breaking down the stream into blocks of equal size. That makes sense for raw data because it is being assemble back using some buffers and pumps at some upper layers. Data wise, mathematically and statistically wise, it doesn&#x27;t make sense to randomly split data into chunks and apply your choice of algos.<p>- I still don&#x27;t understand why so many of us rely on out-of-the-box solutions instead trying to solve the problems, they have specifically, on their own. Why wouldn&#x27;t a developer stick his java code directly into the pipeline to suck data from Kafka and do his bespoke magic. It will be super fast because it is very specific and does exactly one single job. Yes, there will be maintenance time but all solutions require that time. Instead of debugging apache hadoop&#x2F;spark&#x2F;Tez code you debug your own.<p>What is mentioned above just scratches the surface of the knobs and tuning points of a processing pipeline. These are decisions we need to take and expecting fast-food solutions to do it for us are completely unrealistic expectations.
ntoshev将近 11 年前
There seems to be a fundamental trade-off between latency and throughput, with stream processors optimizing for latency and batch processors optimizing for throughput. The post says stream processors are actually used for reprocessing in production at LinkedIn, and simplifying things might well worth it being slower, but do they know how much slower this is than using Hadoop?
评论 #7983939 未加载
评论 #7985653 未加载