Uses and abuses of cloud data warehouses

156 pointsby Malpalmost 2 years ago

10 comments

mrbungiealmost 2 years ago

I remeber one time I was working as a Data & Analytics Lead (almost a Chief Data Officer but without the title) in a company were I don't work anymore and I was "challenged" by our parent company CDO about our data tech stack and operations. Just for context, my team at the time was me working as the lead and main Data Engineer plus 3 Data Analysts that I was coaching/teaching to convert into DEngs/DScientists.At the time we were mostly a batch data shop, based on Apache Airflow + K8S + BigQuery + GCS in Google Cloud Platform, with BigQuery + GCS as the central datalake techs for analytics and processing. We still had RT capabilities due to having also some Flink processes running in the K8S cluster, and also having time-critical (time, not latency) processes running in microbatches of minutes for NRT. It was pretty cheap and sufficiently reliable, with both Airflow and Flink having self-healing capabilities at least at the node/process level (and even cluster/region level should we need it and be willing to increase the costs), while also allowing for some changes down the road like moving out of BQ if the costs scaled up too much.What they wanted us to implement what according to them was the industry "best practices" circa 2021: a Kafka-based Datalake (KSQL and co.), at least other 4 engines (Trino, Pinot, Postgres and Flink) and an external object storage with most of the stuff running inside Docker containers orchestrated by Ansible in N compute instances manually controlled from a bastion instance. For some reason, they insisted on having a real time datalake based on Kafka. It was an insane mix of cargo cult, FOMO, high operational complexity and low reliability in one package.I resisted the idea until the last second I was in that place. I reunited with some of my team members for drinks months later after my departure and they told me the new CDO was already convinced that said "RT-based" datalake was the way to go forward. I still shudder every time I remember the architectural diagram and I hope they didn't finally follow that terrible advice.tl;dr: I will never understand the cargo cult around real time data and analytics but it is a thing that appeals to both decision makers and "data workers". Most businesses and operations (especially those whose main focus is not IT by itself) won't act or decide in hours, but rather in days. Build around your main use case and then make exceptions, not the other way around.

评论 #37151865 未加载

评论 #37156369 未加载

albert_ealmost 2 years ago

Arent a lot of businesses being sold on "real time analytics" these days?That mixes the uses cases of analytics and operations because everyone is led to believe that things that happened in last 10 minutes must go through the analytics lens and yield actionable insights in real time so their operational systems can react/adapt instantly.Most business processes probably don't need anywhere near such real time analytics capability but it is very easy to think (or be convinced that) we do. Especially if I am a owner of a given business process (with an IT budget) why wouldn't I want the ability to understand trends in real-time and react to it if not get ahead of them and predict/be prepared. Anything less than that is seen as being shamefully behind on the tech curve.In this context-- the section in article where it says present data is of virtually zero importance to analytics is no longer true. We need a real solution even if we apply those (presumably complex and costly) solutions to only the most deserving use cases (and not abuse them).What is the current thinking in this space? I am sure there are technical solutions here but what is the framework to evaluate which use case actually deserves pursuing such a setup.Curious to hear.

评论 #37148541 未加载

评论 #37150009 未加载

评论 #37148273 未加载

评论 #37149358 未加载

评论 #37148144 未加载

评论 #37149503 未加载

评论 #37155879 未加载

评论 #37172858 未加载

评论 #37156168 未加载

评论 #37148383 未加载

评论 #37152017 未加载

spullaraalmost 2 years ago

These reasons are why Snowflake is building hybrid tables (under the Unistore umbrella). Those tables keep recent data in an operational store and historical data in their typical data warehouse storage systems. Best of both worlds. Still in private preview but definitely the answer to how you build applications that need both without using multiple databases and syncing.<a href="https://www.snowflake.com/guides/htap-hybrid-transactional-and-analytical-processing" rel="nofollow noreferrer">https://www.snowflake.com/guides/htap-hybrid-transactional-a...</a>

评论 #37149670 未加载

andrenotgiantalmost 2 years ago

It seems like Snowflake is going all-in on building features and doing marketing that encourage their customers to build applications, serving operational workloads, etc... on them. Things like in-product analytics, usage-based billing, personalization, etc...Anyone here taking them up on it? I'm genuinely curious how it's going.

评论 #37147477 未加载

评论 #37147948 未加载

评论 #37146939 未加载

评论 #37149501 未加载

评论 #37147706 未加载

评论 #37155245 未加载

评论 #37146873 未加载

hodgesrmalmost 2 years ago

This article uses an either or definition that leaves out a big set of use cases that combine operational and analytic usage:> First, a working definition. An operational tool facilitates the day-to-day operation of your business. Think of it in contrast to analytical tools that facilitate historical analysis of your business to inform longer term resource allocation or strategy.Security event and incident management (SEIM) is a typical example. You want fast notification on events combined with the ability to sift through history extremely quickly to assesss problems. This is precisely the niche occupied by real-time analytic databases like ClickHouse, Druid, and Pinot.

mritchie712almost 2 years ago

I caught myself wondering how Google, Microsoft and Amazon let Snowflake win. You can argue they haven't won, but lets assume they have. Two things:1. SNOW's market cap is $50B. GOOGL, MSFT, AMZN are all over $1T. Owning Snowflake would be a drop in the bucket for any of them (let alone if they were splitting the revenue).2. Snowflake runs on AWS, GCP or Azure (customers choice), so a good chunk of their revenue goes back to these services.Looking at these two points as the CEO of GOOGL, MSFT, or AMZN, I'd shrug away Snowflake "beating us". It's crazy that you can build a $50B company that your largest competitors barely care about.

评论 #37147266 未加载

评论 #37148333 未加载

评论 #37153598 未加载

atwongalmost 2 years ago

There are other databases today that do real time analytics (ClickHouse, Apache Druid, StarRocks along with Apache Pinot). I'd look at the ClickHouse Benchmark to see who are the competitors in that space and their relative performance.

评论 #37150045 未加载

评论 #37172819 未加载

bob1029almost 2 years ago

> Operational workloads have fundamental requirements that are diametrically opposite from the requirements for analytical systems, and we’re finding that a tool designed for the latter doesn’t always solve for the former.We aren't even going to consider the other direction? Running your analytics on top of a basic-ass SQL database?In our shop, we aren't going for a separation between operational and analytical. The scale of our business and the technology available means we can use one big database for everything [0]. The only remaining challenge is to construct the schema such that consumers of the data are made aware of the rates of change and freshness of the rows (load interval, load date, etc).If someone wants to join operational with analytical, I think they shouldn't have to reach for a weird abstraction. Just write SQL like you always would and be aware that certain data sources might change faster than others.Sticking everything onto one target might sound like a risky thing, but I find many of these other "best practices" DW architectures to be far less palatable (aka sketchier) than one big box. Disaster recovery of 100% of our data is handled with replication of a single transaction log and is easy to validate.[0]: <a href="https://learn.microsoft.com/en-us/azure/azure-sql/database/hyperscale-architecture?view=azuresql#hyperscale-architecture-overview" rel="nofollow noreferrer">https://learn.microsoft.com/en-us/azure/azure-sql/database/h...</a>

debarshrialmost 2 years ago

15 years ago when I joined workforce business intelligence was all the rage. Data world was pretty much straight forward. You had transactional data in OLTP databases which would be shipped to Operational data stores, then rolled into the data warehouse. Datawarehouses were actual specialised hardware appliances (netezza et al) reporting tools were robust too.Everytime I moved from one org to another, these concepts of data warehouse somehow got muddled.

dontupvotemealmost 2 years ago

The random bolding of words reeks of adtech.is the usage of such an old html tag itself now a trigger to send something to /dev/null?

评论 #37155847 未加载