Big data is dead

879 点作者 davidgomes超过 2 年前

96 条评论

"For more than a decade now, the fact that people have a hard time gaining actionable insights from their data has been blamed on its size."The real issue is that business people usually ignore what the data says. Wading through data takes a huge amount of thought, which is in short supply. Data Scientists are commonly disregarded by VPs in large corporations, despite the claims about being "data driven". Most corporate decision making is highly political, the needs of/whats best for the business is just one parameter in a complex equation.

评论 #34696065 未加载

评论 #34698048 未加载

评论 #34703462 未加载

评论 #34698067 未加载

评论 #34696523 未加载

评论 #34699370 未加载

评论 #34695892 未加载

评论 #34695876 未加载

评论 #34699453 未加载

评论 #34696153 未加载

评论 #34702977 未加载

评论 #34705100 未加载

评论 #34709577 未加载

评论 #34705690 未加载

评论 #34701108 未加载

评论 #34698470 未加载

评论 #34697936 未加载

评论 #34702885 未加载

评论 #34701563 未加载

评论 #34699127 未加载

travisgriggs超过 2 年前

I've made anecdotal observations similiar to this over the last 10 years. I work in AgTech. A big push for a while here has been "more and more more data". Sensor-the-heck out of your farm, and We'll Tell You Things(tm).Most of what we as an industry are able to tell growers is stuff they already know or suspect. There is the occasional suprise or "Aha" moment where some correlation becomes apparent, but the thing about these is that once they've been observed and understood, the value of ongoing observation drops rapidly.A great example of this is soil moisture sensors. Every farmer that puts these in goes geek-crazy for the first year or so. It's so cool to see charts that illustrate the effect of their irrigation efforts. They may even learn a little and make some adjustments. But once those adjustments and knowledge have been applied, it's not like they really need this ongoing telementry as much anymore. They'll check periodically (maybe) to continue to validate their new assumptions, but 3 years later, the probes are often forgotten and left to rot, or reduced in count.

评论 #34696950 未加载

评论 #34696005 未加载

评论 #34697594 未加载

评论 #34698393 未加载

评论 #34696354 未加载

评论 #34700635 未加载

评论 #34699314 未加载

评论 #34697157 未加载

评论 #34701288 未加载

评论 #34696927 未加载

评论 #34697464 未加载

评论 #34697401 未加载

评论 #34698579 未加载

评论 #34703647 未加载

评论 #34697682 未加载

ryadh超过 2 年前

While I get that they're sometimes useful to trigger debate, I don't really subscribe to very bold statements.We are drowning in data, it's all around us. Information overload is real. Data enables most of our daily digital experiences, from operational data to insights in the form of user facing analytics. Data systems are the backbone of the digital life.It's is an ocean and it's all about the vessel you pick to navigate it. I don't believe that the vessel should dictates the size of the ocean, it's simply constrained by it's capabilities. The trick is to pick the right vessel for the job, whether you want to go fast, go far or fish for insights (ok, I need to stop pushing on this metaphor )This visionary paper from Michael Stonebreaker (2005) predicted it quite accurately and I think is still relevant: <a href="https://cs.brown.edu/~ugur/fits_all.pdf" rel="nofollow">https://cs.brown.edu/~ugur/fits_all.pdf</a>Databases come in various flavours and the "trends" are simply a reflection of what the current era needsDisclaimer: I work at ClickHouse

评论 #34700570 未加载

评论 #34700573 未加载

评论 #34706714 未加载

taftster超过 2 年前

This posting was great. Highly recommended reading through. It gets really good when the author hits "Data is a Liability".> An alternate definition of Big Data is “when the cost of keeping data around is less than the cost of figuring out what to throw away.”This is exactly it. It's way too hard to go through and make decisions about what to throw away. In many respects, companies are the ultimate hoarders and can't fathom throwing any data way, Just In Case.Really appreciated the post overall. Very insightful.As an anecdote to this article, when business folks have come up to me and asked about storing their data in a Big Data facility, I have never found the justification to recommend it. Like, if your data can fit into RAM, what exactly are we talking about Big Data for?

评论 #34702025 未加载

评论 #34702022 未加载

guardiangod超过 2 年前

There is literally a post on front page on ChatGPT, and Microsoft and Google are preparing to duke it out starting in the _next 2 days_ over big-data generated 'chat' result.Big data was never going to be useful to even medium size enterprises, unless anyone can get public access to PBs of data, but that doesn't mean big data is dead. ChatGPT is literally changing how school will test their students, for a start.Maybe what the author is trying to say is 'small-scale big data is dead, but big data chugs on.'

评论 #34695833 未加载

评论 #34695966 未加载

评论 #34696696 未加载

评论 #34695798 未加载

评论 #34696940 未加载

评论 #34700653 未加载

评论 #34698896 未加载

评论 #34695925 未加载

评论 #34698093 未加载

评论 #34696101 未加载

carlineng超过 2 年前

MotherDuck has been making the rounds with a big funding announcement [1], and a lot of posts like this one. As a life-long data industry person, I agree with nearly all of what Jordan and Ryan are saying. It all tracks with my personal experience on both the customer and vendor side of "Big Data".That being said, what's the product? The website says "Commercializing DuckDB", but that doesn't give much of an idea of what they're offering. DuckDB is already super easy to use out of the box, so what's their value-add? It's still a super young company, so I'm sure all that is being figured out as we speak, but if any MotherDuckers are on here, I'd love to hear more about the actual thing that you're building.[1]: <a href="https://techcrunch.com/2022/11/15/motherduck-secures-investment-from-andreessen-horowitz-to-commercialize-duckdb/" rel="nofollow">https://techcrunch.com/2022/11/15/motherduck-secures-investm...</a>

评论 #34698307 未加载

评论 #34696952 未加载

评论 #34698044 未加载

danuker超过 2 年前

I agree with many of the points here.My cheap no-name old laptop SSD writes with 170MB/s.A customer has a name, address, email and order. Let's say 200 bytes for each. That means I can write 844000 new customers per second, far outside my personal marketing reach.My disk is 240GB, which means I can store data for 1.2 billion customers. It'll take a while until I become that successful.

评论 #34696136 未加载

评论 #34697489 未加载

评论 #34699530 未加载

评论 #34698043 未加载

andix超过 2 年前

I see it all the time: people develop applications that will never ever get a database size of over 100GB and are using big data databases or distributed cloud databases. Often queries only hit a small subset of the date (one customer, one user). So you could easily fit everything into one SQL database.Using any of the traditional SQL databases takes away a lot of complications. You can do transactions, you can query whatever you want, …And if the database may get up to 1TB, still no problem with SQL. If exceed that, you may need a professional OPs team for your database and a few giant servers, but they should easily be able to go up to 10 TB, offload some queries to secondary servers, …

评论 #34698995 未加载

评论 #34696472 未加载

评论 #34699278 未加载

评论 #34696510 未加载

noobermin超过 2 年前

The less than a terabyte datasets being common had me awestruck. I, singular post-doctoral scientist noobermin[0], have processed terabytes of data at a time on HPC systems. Sure, a lot of it was garbage and I had to wade through it, but no one paid me millions to do it, I just did it to publish the papers. Sure, I needed the system which cost someone a lot of money, I suppose. But, I considered myself a small fry compared to some of the things others did on the system, particularly, hyrdrodynamics modellers. Moreover, I know I can probably process 100GB datasets on my own home PC, which isn't too impressive, it would just take longer (say a day or so instead of a hour or a few minutes). And this is with idk, python scripts using MPI. Yes, MPI because I'm a computational scientist and that's what HPC systems use, nothing fancy and likely the "legacy systems" he railed against in his pitches, but it worked.I'm just awestruck, I could tell anyone that "large data" isn't really a bottleneck, but making sense of it is the very difficult part. My mentors kept pushing me to mention the sheer size of the datasets I process in talks because it sounds impressive, and I do do so, but I always knew it didn't matter because the interpretation and analysis is the hard part, not just the "sheer size."[0] not going to use my real name

评论 #34704026 未加载

评论 #34704970 未加载

nerpderp82超过 2 年前

Big Data was whatever someone couldn't handle in a spreadsheet or on their laptop using R.This paper is 8 years old and it was somewhat obvious then.Scalability! But at what COST? <a href="https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf" rel="nofollow">https://www.usenix.org/system/files/conference/hotos15/hotos...</a>A big single machine can handle 98% of peoples data reduction needs. This has always been true. Just because your laptop only has 16GB doesn't mean you need a Hadoop (or Spark, or Snowflake) cluster.And it was always in the best interest of the BD vendors and Cloud vendors to say, "collect it all" and analyze on/or using our platform.The future of data analysis is doing it at the point of use and incorporating it into your system directly. Your actionable insights should be ON your grafana dashboard seconds after the event occurred.

评论 #34699045 未加载

评论 #34698042 未加载

KaiserPro超过 2 年前

big data isn't big anymore.1) 10 years ago, having access to 300tb of data that could sustain 10gigabytes/s of throughput would require something like two racks of disks with some SSD cache and junk.2) people thought hadoop was a good idea3) People assumed that everything could be solved with map:reduce3) machine learning was much less of a thing.4) people realised that postgres does virtually everything that mongo claimed it could.5) people realised that cassandra was a very expensive way to make a write only database.I gave a talk about using big data, and basically at the time the best definition I could come up with was "anything that's too big to reasonably fit in one computer. so think 4, 60 disk direct attached SAS boxes".Most of the time people were chasing the stuff for the CV, rather than actually stopping to think if it was a good idea. (think k8s two years ago, chatGPT now, chat bots in 2020). Most buisnesses just wanted metrics, and instead of building metrics into the app, they decided to boil the ocean by parsing unstructured logs.Not surprisingly it turned to shit pretty quick. Nowadays people are much better at building metrics generation directly into apps, so its much easier to easily plot and correlate stuff.

评论 #34758706 未加载

datan3rd超过 2 年前

Detailed web event telemetry is where I have seen the "biggest" data, not application-generated data. Orders, customers, products will always be within reasonable limits. Generating 100s of events (and their associated properties) for every single page/app view to track impressions, clicks, scrolls, page-quality measurements can get you to billions of rows and TBs of data pretty quickly for a moderately popular site. Convincing technical leaders to delete old, unused data has been difficult; convincing product owners to instrument fewer events is even harder.

ankrgyl超过 2 年前

I love DuckDB and am cheering for MotherDuck, but I think bragging about how fast you can query small data is really no different than bragging about big data. In reality, big data's success is not about data volume. It's about enabling people to effectively collaborate on data and share a single source of truth.I don't know much about MotherDuck's plans, but I hope they're focused on making it as easy to collaborate on "small data" as Snowflake/etc. have made it to collaborate on "big data".

lucidguppy超过 2 年前

Some of mongo's leveling off is the adoption of good jsonb columns in postgres.mongo's got sharding out of the box - which is nice - but you have to get your key right or it will suck.Also no one should want to host a mongo db - unless that's your business.

评论 #34699661 未加载

评论 #34704917 未加载

bfrog超过 2 年前

This reminds me of a great blog post by Frank McSherry (Materialize, timely dataflow, etc) talking about how using the right tools on a laptop could beat out a bunch of these JVM distributed querying tools because... data locality basically.<a href="https://github.com/frankmcsherry/blog/blob/master/posts/2015-02-04.md">https://github.com/frankmcsherry/blog/blob/master/posts/2015...</a>

ThereIsNoWorry超过 2 年前

Big Data is dead? Seems well and alive to me. If you're not a big company with big customers, it never affected you to begin with.

评论 #34696836 未加载

alexpetralia超过 2 年前

I am writing an essay series on this topic: last-mile analytics and how an abundance of data must be ultimately converted into (measurably correct) action.If anyone wants to follow along, the series is here!<a href="https://alexpetralia.com/2023/01/19/working-with-data-from-start-to-finish/" rel="nofollow">https://alexpetralia.com/2023/01/19/working-with-data-from-s...</a>

评论 #34699918 未加载

评论 #34695831 未加载

spaintech超过 2 年前

Not that big data is dead, more like real time data is coming to life, but you need the old stuff around to make a buck or two… Well, that my view. LLMs are transformer model technique are making data more relevant than ever. If you are a business, well you are in for a “now real” digital transformation.Making data the centerpiece of your business business could mean that your effectiveness of business process could increase several order of magnitudes. Funny thing is, you will not use some else’s model, unless you are building a ChatBox to infer, but you will need to build your own model and be trained in your own business process to be successful.Consider a bank, here is my prediction of expected outcomes:Enhanced Customer Experience: The system can act as a virtual banking assistant, providing customers with instant access to their account information, real-time transactions, and balance updates. The system can also answer customer inquiries and provide relevant information, improving the overall customer experience. Improved Fraud Detection: The system can monitor the bank's financial transactions in real-time and identify any potential fraud, helping the bank reduce its exposure to financial losses.Automated Loan Processing: The system can analyze loan applications, credit scores, and other relevant data to approve or reject loan applications in real-time, reducing the time and effort required for manual loan processing. Personalized Marketing: The system can analyze customer behavior, transaction history, and demographic information to provide personalized marketing and cross-selling opportunities, increasing the bank's revenue and customer loyalty.Real-Time Insights: The system can provide real-time insights into the bank's financial performance, customer behavior, and market trends, enabling the bank to make informed decisions and respond to market changes quickly.What is interesting to me is, this is just the beginning of what could be…

评论 #34698118 未加载

评论 #34698477 未加载

pier25超过 2 年前

> Are you in the big data one percent?Exactly, and I'd go further.Are you in the perf/scale/data one percent?So many people worry about scaling when in reality 99% of web apps will never reach above 100reqs/s.I've been in web dev for 20+ years. Only once when working for a big international corporate client I had to worry about traffic spikes. And that was just for one of their multiple web apps.

LeanderK超过 2 年前

Who has ever believed those claims? There's a common saying "garbage in, garbage out" about what happens with all those fancy models if the data quality is not high. That's really independent from dataset-size. There's no magic insight you get because your dataset is bigger. You need a quality analyst to handle your data, irrelevant of its size.Also, who thought their company would cease to function because surely they will hit google-scale dataset-sizes in the near future? Impossible for most except the biggest of the biggest

sixdimensional超过 2 年前

It is amusing that in 2005, "VLDB" (precursor term to "big data") was defined in Wikipedia to be "larger than 1TB".. after reading through the post and the author's experience.. it would appear that this was not actually a completely terrible estimate, although there are larger and smaller: <a href="https://en.wikipedia.org/w/index.php?title=Very_large_database&oldid=20738417" rel="nofollow">https://en.wikipedia.org/w/index.php?title=Very_large_databa...</a>The current version of that article states: "There is no absolute amount of data that can be cited. For example, one cannot say that any database with more than 1 TB of data is considered a VLDB. This absolute amount of data has varied over time as computer processing, storage and backup methods have become better able to handle larger amounts of data.[5] That said, VLDB issues may start to appear when 1 TB is approached,[8][9] and are more than likely to have appeared as 30 TB or so is exceeded.[10]" <a href="https://en.wikipedia.org/wiki/Very_large_database" rel="nofollow">https://en.wikipedia.org/wiki/Very_large_database</a>

college_physics超过 2 年前

Not dead, just complying with the Gartner cycle for hypes.There is probably a rational, well thought out classification of different types of data bigness, as in CERN-big, Google-big, MegaBank-big, down to wordpress-log big and on the basis of that one would probably find that different designs are indispensable, address different pain points and cannot really "die". Hype has a more erratic lifecycle than real needs

itamarst超过 2 年前

This is an excellent summary, but it glosses over part of the problem (perhaps because the author has an obvious, and often quite good solution, namely DuckDB).The implicit problem is that even if the dataset fits in memory, the software processing that data often uses more RAM than the machine has. And unlike using too much CPU, which just slows you down, using too much memory means your process is either dead or so slow it may as well be. It's _really easy_ to use way too much memory with e.g. Pandas. And there's three ways to approach this:* As mentioned in the article, throw more money at the problem with cloud VMs. This gets expensive at scale, and can be a pain, and (unless you pursue the next two solutions) is in some sense a workaround.* Better data processing tools: Use a smart enough tool that it can use efficient query planning and streaming algorithms to limit data usage. There's DuckDB, obviously, and Polars; here's a writeup I did showing how Polars uses much less memory than Pandas for the same query: <a href="https://pythonspeed.com/articles/polars-memory-pandas/" rel="nofollow">https://pythonspeed.com/articles/polars-memory-pandas/</a>* Better visibility/observability: Make it easier to actually see where memory usage is coming from, so that the problems can be fixed. It's often very difficult to get good visibility here, partially because the tooling for performance and memory is often biased towards web apps, that have different requirements than data processing. In particular, the bottleneck is _peak_ memory, which requires a particular kind of memory profiling.In the Python world, relevant memory profilers are pretty new. The most popular open source one at this point is Memray (<a href="https://bloomberg.github.io/memray/" rel="nofollow">https://bloomberg.github.io/memray/</a>), but I also maintain Fil (<a href="https://pythonspeed.com/fil/" rel="nofollow">https://pythonspeed.com/fil/</a>). Both can give you visibility into sources of memory usage that was previous painfully difficult to get. On the commercial side, I'm working on <a href="https://sciagraph.com" rel="nofollow">https://sciagraph.com</a>, which does memory and also performance profiling for Python data processing applications, and is designed to support running in development but also in production.

fredliu超过 2 年前

The title might be hyperbole (intentionally), but the observations are more or less in line with what I experienced through a few the Big Data initiatives over the years under different enterprise environments (although I have reservation about the one 1%er comment). To me, Big Data was never about how "big" the data was, but more about the tools/system/practice needed to overcome the limitation of the previous generation. From that perspective, yes, the "monolith" may be having a "coming back" for now due to the improvement of underlying single node performance. But I do think Data size will keep growing, everything needed to make Big Data work would still be there when the pendulum swings back where a single node can't handle it anymore.

moooo99超过 2 年前

I feel like big data has rarely lived in most organizations. My own experience working in large orgs largely supports the point that collected data is rarely queried. But this is rarely due to a lack of interest, it is mostly because a) nobody really has a great overview over what even is collected b) even if you know/assume something is collected, you usually have no idea where c) if you find the data, there is a decent chance that it is in some sort of weird format that requires a ton of processing to be usable.This has been - to varying extends - my own experience working in large organizations that don't have tech as their core business.Although there are some successful data analysis project, the potential of the collected data remains largely underutilized.

juujian超过 2 年前

> Most data is rarely queriedRight on point. In the past I have been obsessed with big data, looking for insights. Then I realized that a medium-sized specific data set is always better than a gargantuan general big data monster. There is so many applications in my field where only outliers matter anyways, and everything is very "centralized" to a few relevant observations. So the only thing about big data is that you maybe throw away 99.9% of the data right away and then you have some observations that you actually care about. There is soooo much data out there that is just noise, and so little that I actually care about. And that's why I still end up hand collecting stuff every now and then.

edpichler超过 2 年前

I believe we are living in the "emotional era", so data has being ignored and 'feelings' come first when making decisions or creating processes. This is happening not only in companies but in our current society in general.

评论 #34697016 未加载

评论 #34699074 未加载

gesman超过 2 年前

Customer pays data analytics vendor to tackle bunch of their [low quality, big size] data.If you have no tangible capabilities to do above, asking customer "ARE YOU IN THE BIG DATA ONE PERCENT?" will be the quickest way out of the door.

articsputnik超过 2 年前

I love DuckDB's simplicity and think it will solve many problems. Still, transitioning from a local single file DB to concurrent updates and serving it online will be different. I'm curious about what MotherDuck will come up with to solve DuckDB at scale.I love use cases like the Rill Data (<a href="https://youtube.com/watch?v=XvP2-dJ4nVM">https://youtube.com/watch?v=XvP2-dJ4nVM</a>), where you can suddenly run analytics with a single cmd line prompt and see your data just instantly visualized. Such use cases are only possible because of the "small" data approach that DuckDB tries.

sgt101超过 2 年前

Looks at 15 hr Spark job (running since this morning)Sighs...

harish_dash超过 2 年前

Confirmation bias exists almost everywhere. Confirmation bias especially among senior management is highly dangerous as decisions are based on not on data and facts, rather they are based on anecdotes, hunch/feelings, with high probability of going wrong. This is precisely where data scientists play a significant role, by providing recommendations and presenting facts based on hard data and mathematical models, in order to ensure that senior management decisions are based on facts/data, and not on anecdotes and hunches. Furthermore, a data driven organisation must have a supporting culture, where data driven decisions are given precedence, and data scientists (data messengers) must be empowered to present facts as is, no matter whether these facts are aligned or not with the basic assumptions and biases held by the senior management team. Creating such a supporting organization culture is extremely important but definite not easy. Culture is one of the factors that makes a difference between success or failure in a data driven organisation.

morelisp超过 2 年前

To the extent "Big Data" originally and is still often claimed to mean "data beyond what fits on a single [process/RAM/disk/etc]", it's always been strange to me how much it's identified with analytics pipelines doing largely trivial transformations producing ultra-expensive "BI" pablum.Yes, thank goodness that part is dead. But meanwhile - we've still got more actual data than ever to store, and ever-tighter deadlines on finding and delivering it. If we can get back to that and let the PySpark bootcampers fade away, maybe things can get a little better for once.In other words:Even when querying giant tables, you rarely end up needing to process very much data. Modern analytical databases can do column projection to read only a subset of fields, and partition pruning to read only a narrow date range. They can often go even further with segment elimination to exploit locality in the data via clustering or automatic micro partitioning. Other tricks like computing over compressed data, projection, and predicate pushdown are ways that you can do less IO at query time. And less IO turns into less computation that needs to be done, which turns into lower costs and latency.Big data is "dead" because data engineers (the programming ones, not the analysts-in-all-but-title) spent a ton of effort building DBs with new techniques that scale better than before, with other storage patterns than before. Someone still has to write and maintain those! And it would be even better if those tools and techniques could escape the half dozen major data cloud companies and be more directly accessible to the average small team.

anktor超过 2 年前

"90% of queries processed less than 100 MB of data. [in big query]"I think there is a problem when someone with such proclaimed knowledge of the sector gets to this, and similar, pieces of data, and does not attribute it to pricing. Could it be queries are short because bigquery pricing for analysis, as confusing as this models are, is based on amount of data?[0]Because the other line of reasoning is that a big chunk of that 90% of professionals being paid to do their jobs, do NOT take into account pricing of the tool and are using it for small data, instead of thinking that people are using the best tool with the lowest price, because there's plenty of options to process and analyse data right now in the cloud.On the "business have low amount of data", that matches my experience as well. At first I thought I was simply dealing with smaller sized companies, but it's a trend of doing big data projects for data that'd fit a pendrive.[0] <a href="https://cloud.google.com/bigquery/pricing#analysis_pricing_models" rel="nofollow">https://cloud.google.com/bigquery/pricing#analysis_pricing_m...</a>

heisenbit超过 2 年前

Sampling has proven extremely useful. Pi can be approximated with it as were nuclear bombs designed using statistical methods. Flame graphs based on stack samples are used to optimize servers. Government does planning with it. Management does its thing by wandering around.It usually does not take many data points for an actionable insight and most actions then will invalidate small details in old data anyhow. Better to start every round with fresh eyes.

poorman超过 2 年前

This entire post reads like "you probably don't actually have big data".What do these blockchains do that have to keep data around forever, with high throughput, and need to expose it quickly do? Are you saying they should delete parts of data in the chain?Seriously, I've spent my career working on big data systems, and while the answer is sometimes "yes you need to delete your data", I don't think that's going to always work.

评论 #34700511 未加载

Flatcircle超过 2 年前

Seems like just yesterday, every business magazine's cover story was about "big data." Wonder what the next batch of business buzz words will be?

idlewords超过 2 年前

Pretty funny to see this when every other headline on this site is about how large language models are about to revolutionize dentistry, beekeeping, etc.

wizwit999超过 2 年前

Perhaps this is true for business data (though I'm skeptical of the claims), but, for example, for security data, this isn't true at all. Collecting cloud, identity, SaaS, and network logs/data can easily exceed hundreds of terabytes. A big reason why we're building Matano as a data lake for security.It seems an odd pitch in general to say, hey my product specifically performs poorly on large datasets.

评论 #34697678 未加载

评论 #34697849 未加载

jl6超过 2 年前

To add to the “the real issue is…” pile:Most orgs collect the data that is easy to collect, and they are extremely lucky if that happens to be the data that enables the insights they desire. When the data they really need looks too hard to get, the org tries to compensate by collecting more of the easy stuff, and hoping that if blood can’t be squeezed out of a stone, maybe it can be squeezed out of 100bn stones.

Agingcoder超过 2 年前

I remember the big data craze. People had very little data and low quality at that so they had a data problem before they had a big data one!

评论 #34700169 未加载

sortalongo超过 2 年前

> Customer data sizes followed a power-law distribution. The largest customer had double the storage of the next largest customer, the next largest customer had half of that, etcI’m no statistician, but I’m like 99% sure that’s an exponential, not a power lawThere’s a world of difference. The point of an exponential is that you can ignore big things. The point of a power law is that you can’t.

评论 #34720028 未加载

andreygrehov超过 2 年前

With all the LLM craziness, this is just the beginning. How else are they going to train all those models? I'm not an expert, just imho.

mikepk超过 2 年前

We need to re-think how to make data _useful_. The fact that the value hasn't materialized after decades of attempts, billions of dollars, and lots of tools and technology points to the fact that our core assumptions and patterns are wrong.This post doesn't go far enough. It challenges the assumption that everyone's data is "big data" or that every company's data will eventually grow to be big data. I agree that "big data" was the wrong model. We also need to challenge that all data should be stored in one place (warehouse, lake, lakehouse). We need to challenge that one tool can be used for every data need. We need to challenge how we build systems both from a technology and people standpoint. We need to embrace that the problems and needs of companies _are always changing_.We are living with conceptual inertia. Many of our patterns are an evolution from the 70's and 80's and the first relational databases. It's time to rethink how we "do data" from first principles.

评论 #34700150 未加载

alluro2超过 2 年前

I'm quite surprised with data sizes mentioned in the article, and wondering if I'm missing something...We are a very small 2yo company, handling route optimization and delivery management / field service. Even with our very small number of customers, their relatively small sizes (e.g. number of "tasks" per day), being very early in development in terms of data that we collect - our database containing just customer data for 2 years is ~100GB. Which I previously considered small, and if we collected useful user metrics, had more elaborate analytics, location tracking history etc, I would expect it to be at least 3x.We don't use any "BigData" products yet, as there wasn't any need for them, even when we provide full search and relatively nice and rich set of analytics over all the data. Yet, based on the article, we're way above most of the companies relying heavily on such tools. Confusing.

alentred超过 2 年前

Another problem with "BigData": hiring and the tendency of the ecosystem to "sustain" itself (like any system). As a company hires traditional BigData Architects, Developers, Data Scientists, Engineers, etc. it will naturally have a tendency to choose the traditional BigData technology and solutions like BigQuery, Spark, storing everything in HDFS, etc.A trick I saw is companies hiring experienced jack-of-all-trades back-end engineers into Data teams. A lot of things get migrated from Spark to Postgres, from Kafka to REST API calls, and keep working fine and become generally more responsive.I'm on the same page as the author here: traditional BigData tech has its place and its uses, but before choosing it companies (CTOs, architects) should carefully consider if it is necessary, especially considering the cost of it and the risk of locking themselves down in a very specialized domain.

CobrastanJorji超过 2 年前

Tableau's "Medium Data" April Fools Day ad from several years ago still rings amazingly true.

mejakethomas超过 2 年前

So what I'm hearing is it's not the size of your data that matters, it's how you use it?

cmrdporcupine超过 2 年前

From about 2008/2009/2010 or so on there was perhaps an over-emphasis on specialized tools for the mass acquisition of streams of data. Maybe in large part due to the explosion of $$ in ad-tech. Some people had legitimately insane click/impression streams -- I worked at a couple companies like that. Development of DBs based on LSM trees or other write-specialized storage structures became important. Existing relational databases weren't particularly well built for this stuff. This was part of, but not the whole story with the whole NoSQL thing. People were willing to go completely denormalized in order to gain some advantage or ability here. It helped that much of the data looked at was of perhaps little structural complexity.In the meantime SSD storage took off, so the IOPS from a stock drive have skyrocketed, business domains for large data sets have broadened beyond click/impression streams, and the challenge now is not "can I store all this data" it's "WTH do I do with it?"Regardless of quantity of data, structuring and analysis and querying of said data remains paramount. The challenge for anybody working with data is to represent and extract knowledge. I remain convinced that logic -- first order logic and its offshoot in the relational model -- remains the best tool for reasoning about knowledge. Codd's prognostications on data from the 1970s are still profound.I think we're in a space now where we can turn our attention to knowledge management, not just accumulating streams of unstructured data. The challenge in a business is to discover and capture the rules and relationship in data. SQL is an existing but poor tool for this, based on some of the concepts in the relational model but tossing them together in a relatively uncomposable and awkward way (though it remains better than the dogs breakfast of "NoSQL" alternatives that were tossed together for a while there.)My employer is working in this space, I think they have a really good product: <a href="https://relational.ai/" rel="nofollow">https://relational.ai/</a>

zX41ZdbW超过 2 年前

My presentation from FOSDEM 2023 is very sympathetic to the "Big data is dead" statement: <a href="https://www.youtube.com/watch?v=JlcI2Vfz_uk">https://www.youtube.com/watch?v=JlcI2Vfz_uk</a>It is about using modern tools (ClickHouse) for data engineering without the fluff - when you can take whatever dataset or data stream and make what you need without the need for complex infrastructure.Nevertheless, the statement "big data is dead" is short-sighted, and I don't entirely follow this opinion.For example, here is one of ClickHouse's use-case:> Main cluster is 110PB nvme storage, 100k+ cpu cores, 800TB ram. The uncompressed data size on the main cluster is 1EB.And when you have this sort of data for realtime processing, no other technology can help you.

cmollis超过 2 年前

we regularly run audits on over 12 years of customer order histories. This requires scanning of about 40TB of data and growing. They used to jump through hoops on the Oracle cluster just to get data out for one customer. We pushed all of the order history into s3 parquet using Spark and I can query this in about 20 seconds using Spark or Presto. It's now streamed through kafka and Spark structured streaming so it's up to date in about 3 minutes. The click-bait-y title notwithstanding, I get that not all data is 'big' and duckdb (and datafusion, polars, etc) is probably great for certain use-cases but what I work on every day can't be done on a single machine.

评论 #34701145 未加载

zmmmmm超过 2 年前

To be honest, I slightly disagree about data size. I think the big data is there to be had, the real story is that data science itself has not panned out to provide the business value that people asserted would come from it. Data volumes haven't risen more because in the end, it turns out most of the things businesses need to know are easily ascertainable from much smaller data and their ability to action even these smaller very obvious things is already saturated.It doesn't help that we've shifted into a climate where hoarding data comes with a huge regulatory and compliance price tag, not to mention risk. But if the value was there we would do it, so this is not the primary driver.

luckydata超过 2 年前

It's kinda weird to read this. The whole argument is "we didn't have databases that could handle the sizes and use cases emerging, we worked on the problem for 20 years and now it's no biggie".Mission accomplished more than big data is dead IMHO.

low_tech_punk超过 2 年前

Long live Big Model, I guess? Instead of independent data warehouses, we are now moving towards a few centralized companies using supercomputer in physical data centers. The "winner takes all" effect will only increase as the trend goes on.

qikInNdOutReply超过 2 年前

Congratulations on the birth of little data, to the proud dad Big Data are in order.Well, obviously, the realization that management is largely emotion driven and little data driven, is a prelude for the CEO-AI yet in the makings.Of course this still got a face to it. A CEO who speaks and talks, as the voice commands, but does not do the part that even humans who think they are good at it, are bad at, decision making. The ground truth is there ("Cooperate history") going back to the merchants of sumeria. Lets learn that lesson, pack it into a decission tree, and wrap that bundle with Chat GPT smooth talking.

winterismute超过 2 年前

The database was the key technology in the 2001-2011 decade: it allowed companies to store massive amount of data in an organized way, so that they could provide basic functionality (search, monitoring) to users. Statistical learning is being the key "technology" of 2011-today: it allowed companies, which had stored massive amount of data, to feedback predictions to users. I think AR/Computer Graphics will be the key technology of the next decade: it will allow users to interact directly and seamlessly with the insights produced by ML systems, and possibly feed-back information.

jacobsenscott超过 2 年前

nosql is dead, client side SPAs are dead. Nice to see the complexity pendulum swinging back to the correct side again. Curious what the merchants of complexity will reach for next. Are applets going to be the new hot thing?

blakeburch超过 2 年前

Great post and really resonates with my experience. Good to have some confirmation that most organizations aren't using their large swaths of data.Although I don't think most organizations are blaming lack of actionable insights on the data size. It's the lack of prioritizing data usage over data accessibility. We need to be teaching data people business levers and teaching business people data levers.Data should be a byproduct of an actionable idea that you want to execute. It shouldn't exist until you have that experiment in mind.

singularity2001超过 2 年前

Big Data lives on in LLMs.

pelatimtt超过 2 年前

Agree. And thing I noticed is that tools like #apache spark have become the de-facto standard for any data engineer work even when data size does not require it. Result is that many jobs are much harder to mantain and often slower (due to all the shuffling) than running on a single node.

fijiaarone超过 2 年前

Somewhere along the line people were tricked into thinking that logging was data, and that to we needed to turn up every trace log to 11 on every production system.Logs are where data goes to die.

donretag超过 2 年前

My personal definition of Big Data has always been when you gather/store data without having a planned use for it. Do we need this data? Don't know, let's just store it for now.The article does allude to this definition when it states that "Most data is rarely queried". We have become data hoarders. Technology has made it easy (and relatively cheap) to store data, but the ideas of what to do with this data have not scaled in comparison.

freedude超过 2 年前

"Among customers who were using the service heavily, the median data storage size was much less than 100 GB"Eye-opening. Especially when combined with a recent quote from Satya Nadella, "First, as we saw customers accelerate their digital spend during the pandemic, we’re now seeing them optimize their digital spend to do more with less."Conclusion: SaaS is easy to drop off in downturns. Just as easy as it is to buy initially.

nemo44x超过 2 年前

Why would I use DuckDB instead of Clickhouse or similar? Is it just because I want to have the database embedded in my app and not connect to a server?

评论 #34699130 未加载

glogla超过 2 年前

I agree with a lot of the sentiments of the MotherDuck people, but boy are they loud and proud for someone who never delivered anything more than blogposts and vague promise to somehow exploit the MIT licensed DuckDB.Meanwhile for example boilingdata.com seems to have already done that - by using AWS Lambda + DuckDB as distributed compute engine which I can't decide if its awesome, deranged or both.

评论 #34696511 未加载

diceduckmonk超过 2 年前

Unlike quantum which cracks computationally complex algorithms, BigData was just about costs.SSDs we’re limited in capacity and still expensive.Parallelizing work with MapReduce allowed using cheap fault-prone commodity hardware and disks.If you’re dealing with terabytes rather than petabytes of data, you probably don’t need BigData

hugesniff超过 2 年前

"Very often when a data warehousing customer moves from an environment where they didn’t have separation of storage and compute into one where they do have it, their storage usage grows tremendously..."Can someone explain why this is the case? Is it due to more replications or maintaining more indices?

dbjt_baki超过 2 年前

Well then if businesses do not require data, then the "AI world" might need some. So changing practice to be a machine learning engineer might not seem too bad.

ammar_x超过 2 年前

Well, we have less than 2 TB of data, and although we are running MySQL on a large instance with ~120 GB of RAM, it's extremely slow when dealing with big tables (like a 25 GB table) and that's why we need "big data" tools like BigQuery.

hinkley超过 2 年前

It's always kind of amazed me how closely Big Data was followed by the KonMari method and it really seems like the nerds were not paying attention to that at all. Or just happy to take a paycheck from people who weren't paying attention.Hoarding is not a winning strategy.

therealbilly超过 2 年前

I think server hardware solved the big data issue. The stuff we have now can blitz through data in the blink of an eye. For national governments like our own, mainframes still have a place. For me personally, I don't even talk about big data anymore.

lern_too_spel超过 2 年前

People don't want to deal with having to rearchitect when their workload does not fit on a single instance. Yes, optimize for the small data case, but if you build a product that can handle only the small data case, you have a tough sell.

zzzeek超过 2 年前

> The most surprising thing that I learned was that most of the people using “Big Query” don’t really have Big Data.wow, ya think? Must have been eye opening to see all those customers with a few million rows thinking they had "Big data" huh?

CommieBobDole超过 2 年前

It's not dead, it's just entered the plateau of productivity. Where people use it for whatever it's useful for and don't try to solve every problem with it just because it's the cool new thing.

miguelazo超过 2 年前

On to the next hype theme(AI)!

H8crilA超过 2 年前

Big data starts somewhere around a petabyte, maybe a bit lower than that. That's when you need some serious, dedicated systems. But as always everyone wants to (pretend to) do what the big players do.

papito超过 2 年前

First they came for the sacred microservices, now they are after Bid Data. What. Is. Happening.Don't get me wrong, I love it. It's about time people got off these stupid and shockingly expensive bandwagons.

xnx超过 2 年前

Similar to the "we must have microservices so that we can scale" fad a lot of people thought they had big data even though their records easily fit on a single machine.

posharma超过 2 年前

We're going to reach a point where we might say the same thing about large language models. Fine tuned LMs (based off of their large parents) are going to be the bread and butter.

vonnik超过 2 年前

Agree with this post.Big data was vendor generated hype that convinced many engineers to confuse the size of their dataset with their, ahem, shoe size.They didn’t do their employers any favors.

siliconc0w超过 2 年前

Big data is dead because executives are rewarded when decisions are reactionary and politically savvy, data doesn't enter the picture.

cubefox超过 2 年前

This is a bit ironic given that generative AI models like GPT-3 and Dall-E only work because they were trained on very large datasets.

revskill超过 2 年前

Main goal of Big Data as i see is to profile performance and metrics. Number of user registration, number of converted users,...

5tefan超过 2 年前

I quite often say: If you need KPIs you're too far removed from how the company actually conducts business.

kthejoker2超过 2 年前

So the argument is you can do everything with an OLAP Database because we shrunk "Big Data" back inside RAM?K, good luck!

xiaodai超过 2 年前

Medium data is where it's at! diskframe.com and polars and arrows are good enough for most use cases!

AaronBBrown超过 2 年前

The truth is that most "big data" problems aren't big and can often be solved with awk and xargs.

rvieira超过 2 年前

What about IoT?

ralph84超过 2 年前

Big Data got replaced by Big Parameters.

评论 #34696196 未加载

ThomPete超过 2 年前

On the contrary. Now that AI is here, big data is going to be more alive than ever.

anonymousDan超过 2 年前

I think the relevant phrase here is 'Selling your book'

twwittr1超过 2 年前

This is what most of the companies are doing.

anon223345超过 2 年前

Long live big data!

cbreynoldson超过 2 年前

Long live Big Meaning

blipvert超过 2 年前

Listen to “Reason”

wowJustwow超过 2 年前

Interesting take from a Googler.Big data hype never felt to me like anything more than a hype campaign to help big tech research ML/AI.Larry even rambled as much: <a href="https://arstechnica.com/information-technology/2013/05/larry-page-wants-you-to-stop-worrying-and-let-him-fix-the-world/" rel="nofollow">https://arstechnica.com/information-technology/2013/05/larry...</a>It appears not all Googlers got the memo.Everyone else is in the way of him solving big problems! Not like such work could not be distributed among technologists and researchers around the globe via the internet. Help Google do it!I am leaning into “Deep Work” going forward; will slowly iterate on my own model creation and collaborate with like minded folks. I’m fucking done with intentionally empowering billionaire minority who convinced an ignorant political gerontocracy that minority is capable of magic.Anyone prattling on with common tropes of “longtermism”; nation state nutters, religious, technocrats; are appealing to non-existent authority they see a a magical future for us! Give them your money to insure it arrives! They have zero ability to insure such outcomes and a lot of upside to making people believe such today.

meindnoch超过 2 年前

Good riddance.