The Rise and Fall of the OLAP Cube

194 点作者 shadowsun7超过 5 年前

20 条评论

th0ma5超过 5 年前

I used to admin OLAP infrastructure for a large Fortune 100. I think the article makes a lot of great points, but ultimately there are many paths to the same result. I'm not entirely sure what that group is up to these days but using column stores is a big part of it I'm sure, but I think ultimately deciding between precalculated vs. dynamically calculated pivots is the big choice.Whatever system you have can do that, but the real work IMHO is understanding the org enough to cover the 80/20 of what people will want to see. Ideally you want to get to a higher level of abstraction such that you continuously codify your method of analysis or pivots to traditional tables if possible in order for maximum repeatability.Sometimes I wonder if graphs or trees really make more sense though and OLAP being a tree of sorts is just a symptom of the RDBMS ubiquity, but this is just a meandering notion perhaps.

quantified超过 5 年前

This piece reiterates a bunch of the usual misunderstandings of OLAP. Likely because it’s following the DBMS community’s redefinition of OLAP towards cross-tab group-by, and away from the higher end of the OLAP market. The “cube” is just as much a logical construct as a SQL table is, and just because row stores were bad at certain analytics didn’t mean that SQL was.Compressed column-stores hurt OLAP, because update throughput (the “on-line” in OLAP) is relatively bad. Uncompressed / array stores are quite good.The quantity and complexity of SQL necessary for a non-trivial OLAP view is daunting: time-series views (mix YTD and current-period calcs for transactional and balance accounts, get the ratio measures calculated, and handle the joins necessary for calculating and aggregating from the many (easily dozens) of fact tables that have different dimensionality and granularity and need to be joined at their finest detail. The user will want to see a set of metrics for a pair of orgsnizational units, and also as a percentage difference between the two org units. SQL does not have inter-row calculations, so quite a bit of work goes into re-shaping the SQL cursor’s results to something the user wants to see.All that query generation and result transformation is part of the value-add of the “cube” server.So the OLAP cube as a logical construct is definitely not fallen. Just the bad ones. They’re less flexible than SQL but provide way more productivity within the query space they’re built for.

评论 #22191612 未加载

willvarfar超过 5 年前

I'm conflicted; its a nice write-up, and probably generally true right now, for most stuff.However, I still live with databases big enough to still need cubes, although these cubes can afford to be less refined these days. Saying 'bigtable can do a regex on 30M rows per second' isn't saying it can't be done cheaper and quicker without paying google etc, if you just have some cubes.And I think its going to track the normal sine wave: over time, data sets get bigger, and we keep oscillating between needing to cube and being able to have the reporting tool 'cube on the fly' behind the scenes.I think there's a general move not mentioned in the article as data-lakes become faster, and then data outstrips them, and so on too.The strength will be tooling that transparently cubes-on-demand. I wish there were efficient statistics and CDC that tracked metadata so tools can say 'this mysql table has been written to since I last snapshotted something', and, even better, 'this materialized view that I have in this database is now out of date because of writes that affect the expression it is used from on that other database over there' etc. Basic classic data-sources can do a lot of new things to make downstream tools able to cache better.I have a slight problem with the terminology in the middle of the article, as I'm so far down the rabbit-hole that I think of cubes _as_ databases; I suffer cognitive dissonance when I read about shifts from cubes to databases etc. To me, a cube is just a fancy term for a table/view for a particular use-case.One tool that I'm terribly excited about these days is presto. <a href="https://prestosql.io/" rel="nofollow">https://prestosql.io/</a> allows you to take a constellation of different normal databases and query them as though they were one big database. And you can just keep on adding data-sources. Awesome!

评论 #22191441 未加载

buremba超过 5 年前

The columnar database engines are powerful enough to answer the ad-hoc questions so you often don't need to materialize the summary data somewhere else or use BI tools such as Tableau that fetch the data into their server and let you run queries on their platform.ELT solutions such as Airflow and DBT let you materialize the data on your database with (incremental) materialized views similar to the way how OLAP Cubes work but inside your database and only using SQL. That way, you won't get stuck to vendor-lock issues (looking at you, Tableau and Looker), instead manage the ELT workflow easily using these open-source tools.These tools target the analysts/data engineers, not the business users though. Your data team needs to model your data, manage the ETL workflow and adopt a BI tool for you. When you want to get a new measure into a summary table, you need to contact the analyst in your company and make him/her change the data model. As someone who is working in this industry, I can say that we still have a way but the BI workflows will be much more efficient in a few years thanks to the columnar databases.Shameless plug: We're also working for a data platform, you model your data (dimensions, measures, relations, etc.) and build up ad-hoc analytics interfaces for the business users. If the business user wants to optimize a specific set of queries (OLAP cubes), they simply select the dimension/measure pairs and the system automatically creates a DBT model that creates a summary table in your database similar to OLAP cubes thanks to the GROUPING SETS feature in ANSI SQL. Here are some of the public models if you're interested: <a href="https://github.com/rakam-io/recipes" rel="nofollow">https://github.com/rakam-io/recipes</a>

_Codemonkeyism超过 5 年前

After maintaining an OLAP cube system for some years, I'm not that sure after reading the article.The nice thing of an OLAP cube is the UI and how business users can easily drag and drop items to explore data (standard reports are best created automatically and don't need an OLAP layout/setup).If the UI (Tableau, Excel Power Pivot) is the same, then yes, OLAP cubes are a thing of the past. Otherwise not.

评论 #22190709 未加载

simo7超过 5 年前

I think the article is a bit simplistic.It's true that _often_ OLAP cubes are not needed. That's simply because the amount of data and the latency requirements are _often_ not too demanding.Also, materialized views don't solve the major issue with OLAP cubes: the need of maintaining data pipelines.I wonder if a solution to this problem could come from a different way of caching result sets: new queries that would produce a subset of a previously cached result could be run against the cached result itself. Of course this opens up a new set of problems, cache invalidation etc..

simo7超过 5 年前

Two false statements in this article:> ...Amazon, Airbnb, Uber and Google have rejected the data cube...Airbnb uses Druid which is essentially an OLAP cube.> BigQuery, for instance, doesn’t allow you to update data at allIt's not like that anymore since several years.

评论 #22193355 未加载

评论 #22192016 未加载

评论 #22195493 未加载

beefield超过 5 年前

Sorry, could someone ELI5 what is OLAP? And while you are there, what is Tabular Model? As background,I have worked with SQL and relational databases, and occasionally keep on hearing these, but nobody ever explained to me what these are and why I should be interested. So far I have just shrugged and thought that I guess my workloads/datamodels/whatnot just do not need these fancy things, but always I see them, there is someone nagging at the back of my head that maybe you should have a look...

评论 #22191481 未加载

评论 #22191998 未加载

lukehan超过 5 年前

well, I would like to say the OLAP Cube is just re-rising now. There are 1000+ companies deployed Apache Kylin (OLAP Engine for Big Data) in the past 5 years, for 100+B rows, for 100+ concurrent users...many different use cases are based that technology...it works very well with BI tools and so friendly to analysts who are using such "old fashion" every day over the decade (how hard for them to be Data Scientists?) check more here: <a href="http://kylin.apache.org/community/poweredby.html" rel="nofollow">http://kylin.apache.org/community/poweredby.html</a>

评论 #22191505 未加载

iblaine超过 5 年前

For anyone wondering, OLAP != OLAP CubesOLAP = A category of databases meant for analyzing data. These are eventually consistent db's, and not OLTP db's. OLAP db's include Redshift, Teradata, Snowflake, BigQuery, and others. Generally what makes a database an MPP database is partitioning compute and storage. Generally what differentiates one MPP db from another is whether or not data and compute are colocated.OLAP Cubes = A feature built into SQL Server, that includes has its own dialect of SQL called MDX. OLAP Cubes are decreasing in popularity because you can achieve the same results through other means and less effort.

评论 #22198929 未加载

评论 #22198961 未加载

评论 #22196047 未加载

inshadows超过 5 年前

What's some concrete example of OLAP cube? What does Alice, a data analyst, actually do when she gets to work at her computer? What does she interact with on the screen? Does she use some specialized software to project the data cube into two dimensions (contingency tables) to find hidden meaning in the data? There's a lot of abstract talk and no actual examples on the Internet, except for SQL Server tutorials which always end up with some kind of E-R diagram.

liyang-kylin超过 5 年前

Cost is a big factor the author underestimated in this big data era. Precalculated cube is not only faster but also times cheaper in the cloud, thanks to the reuse of precalculated result.Dynamic query services in the cloud basically charge by processed data volume, like Google BigQuery and Amazon Redshift/Athena. For small and medium dataset, this works well. But for big data close to or above billions of rows, the cost will make you reconsider.In the recent Apache Kylin Meetup in Berlin, OLX Group shared their comparison between OLAP cube and dynamic query in real case. Given 0.1 billion rows, cube technology (Apache Kylin and SSAS) prevails over MPP+Columnar (Redshift) easily. Especially Apache Kylin is 3.8x faster and 4.4x cheaper than Redshift for their business. (<a href="https://www.slideshare.net/TylerWishnoff/apache-kylin-meetup-berlin-with-olx-group" rel="nofollow">https://www.slideshare.net/TylerWishnoff/apache-kylin-meetup...</a>)For me, a mix of precalculation (80%) and dynamic calculation (20%) should hit the sweet point between cost effectiveness and query flexibility.

lkcubing超过 5 年前

Thanks for sharing. Interesting write up.While this article accurately captures the issues with traditional OLAP Cubes, it failed to recognize the latest development in this domain.Projects like Apache Kylin, and its commercial version Kyligence, leverage modern computer architectures such as columnar storage, distributed processing, and AI optimization to build cubes over 100s of billions rows of data that covers 100s of dimensions. The performance result is unprecedented in either traditional OLAP cubes or today's MPP data warehouses. That's why the world's largest banks, retailers, insurance companies, and manufactures are turning to Kylin/Kyligence for the most challenging analytical problems.Not to mention the rich semantic layer that modern OLAP cube technology provides, which greatly simplifies analytics architecture in the enterprises.And, comparing columnar stores to OLAP cubes is like comparing apples to oranges. The former is a storage format and the latter is an analytical pattern. Modern OLAP cube technology like Kylin/Kyligence stores cubes in columnar stores anyway.

评论 #22198040 未加载

ovi256超过 5 年前

If you're a hacker interested in SQL and OLAP, you might enjoy Greenspun's (he of of Greenspun's tenth rule fame) writings on these subjects:<a href="https://philip.greenspun.com/wtr/data-warehousing.html" rel="nofollow">https://philip.greenspun.com/wtr/data-warehousing.html</a> <a href="https://philip.greenspun.com/sql/" rel="nofollow">https://philip.greenspun.com/sql/</a>Although technically obsolete (as in talking about 90s database systems that have bitten the dust since then), that's a minor defect. He spends most effort on teaching timeless principles.

huy超过 5 年前

Interesting perspective. What do you say to someone who's been using OLAP cube for their entire BI implementation? What would be the transition plan to adopting MPP databases?

评论 #22189498 未加载

markus_zhang超过 5 年前

I have one question for you guys. If my company is focused on Spark and Vertica, and I want to learn data modelling on top of those, does Kimball still make sense? The article says yes in general but I'd like to know your opinions.Currently the BI team doesn't do much dimensional modelling as far as I see. Every thing is taken from Kafka and dumped into some wide tables with all columns that we the analysts need. Actually there is no data modelling at all.

js8超过 5 年前

It seems that the article makes a categorical error, arguing that OLAP cubes were replaced by columnar data stores. I always understood OLAP cube as an abstract concept that can have various technical implementations, while column store is a kind of optimization in that technical implementation.

评论 #22191238 未加载

评论 #22198891 未加载

评论 #22190978 未加载

sgt超过 5 年前

How does something like Tableau fit in? I know of people using Tableau with a Postgres connector, but I am not sure if that allows you the same kind of performance as you'd get with OLAP or even a columnar DB.

评论 #22191277 未加载

评论 #22190050 未加载

dexterzz超过 5 年前

The author don't know today‘s OLAP。 Look Tabular Modeling in SQL Server Analysis Services（Power BI） in-memory column-oriented analytics engine.

lou1306超过 5 年前

> Codd got called out for his conflict of interest and was forced to retract his paper … but without much fallout, it seems: today, Codd is still regarded as ‘the father of the relational database’I found this passage confusing. He is regarded as such because of his work on the relational algebra, and the shady OLAP backstory is unrelated to that.

20 条评论

th0ma5超过 5 年前

quantified超过 5 年前

评论 #22191612 未加载

willvarfar超过 5 年前

评论 #22191441 未加载

buremba超过 5 年前

_Codemonkeyism超过 5 年前

评论 #22190709 未加载

simo7超过 5 年前

评论 #22193355 未加载

评论 #22192016 未加载

评论 #22195493 未加载

beefield超过 5 年前

评论 #22191481 未加载

评论 #22191998 未加载

lukehan超过 5 年前

评论 #22191505 未加载

iblaine超过 5 年前

评论 #22198929 未加载

评论 #22198961 未加载

评论 #22196047 未加载

inshadows超过 5 年前

liyang-kylin超过 5 年前

lkcubing超过 5 年前

评论 #22198040 未加载

ovi256超过 5 年前

huy超过 5 年前

Interesting perspective. What do you say to someone who's been using OLAP cube for their entire BI implementation? What would be the transition plan to adopting MPP databases?

评论 #22189498 未加载

markus_zhang超过 5 年前

js8超过 5 年前

评论 #22191238 未加载

评论 #22198891 未加载

评论 #22190978 未加载

sgt超过 5 年前

评论 #22191277 未加载

评论 #22190050 未加载

dexterzz超过 5 年前

The author don't know today‘s OLAP。 Look Tabular Modeling in SQL Server Analysis Services（Power BI） in-memory column-oriented analytics engine.

lou1306超过 5 年前