The Rise of SQL-Based Data Modeling

192 pointsby huyover 5 years ago

26 comments

dbattenover 5 years ago

> This implies that SQL is not reusable, causing similar code with slightly different logic to be repeated all over the place. For example, one cannot easily write a SQL ‘library’ for accounting purposes and distribute it to the rest of your team for reuse whenever accounting-related analysis is required.Data scientist here. I think some of this problem is handled by effective use of views. Oh, everybody is constantly joining these three accounting-related tables and aggregating by, say, order number? Have your Data Engineer/DBA/analyst/whoever create a view that takes care of that. Boom. Now everybody's using the same data, calculated the same way, nobody's reinventing the wheel, and you don't have to worry about somebody fat-fingering something when they re-write that query for the 10th time.With that being said, I still think there's some truth to this criticism, in that it's not as easy/common to be able to build an abstract query that does a common operation on arbitrary data. You can't import trend_forecast.sql, hand it arbitrary time-series data, and generate an N-month linear forecast from your historical data points. At least, not easily in ANSI SQL.

评论 #22191403 未加载

评论 #22192084 未加载

评论 #22194333 未加载

cjf4over 5 years ago

One of the things I'm constantly baffled by given the growth of data science and analytics is that data modelling isn't treated as a first class concern. It's absolutely fundamental to doing quantitative work efficiently at an organization with any amount of complexity, yet the majority of people in the field seem to be unaware of the concepts.This ignorance is especially surprising given that it's essentially a solved problem (Kimball), yet if you talk about data modelling, people usually think regression, not schema.

评论 #22191145 未加载

评论 #22191107 未加载

swalshover 5 years ago

" Up till a few years ago, the traditional way of managing data (in SQL-based RDBMSs) was considered a relic of the past, as these systems couldn't scale to cope with such a huge amount of data. "I must have missed the boat on this one. I remember in 2010 there was a brief period of time where NoSQL was in fashion, but it rightfully died pretty quickly to a small set of specialized use cases. There have been some cases where big data systems have replaced more traditional rdbms systems, but now you can use SQL for those too (like Hive SQL).SQL is the one skill that has not become obsolete in the course of my career. Frankly I've started relying on it more, because it never goes obsolete. Also it's fast as hell. When I first started my career I did C#, and the .net Framework 2 was fairly new. Since then WinForms, and WebForms have gone away. ORM's changed, Javascript changed. Then I moved to Ruby, and Python, and PHP. Those ecosystems have evolved too. But the one thing that I learned 15 years ago, that I still use every day is SQL.

评论 #22191369 未加载

评论 #22191901 未加载

评论 #22191977 未加载

评论 #22191318 未加载

评论 #22191500 未加载

评论 #22191294 未加载

评论 #22191984 未加载

评论 #22191375 未加载

评论 #22191544 未加载

harshawover 5 years ago

This quote irks me: "Any tool that relies heavily on SQL is off-limits to business users who want more than viewing static charts and dashboards but don't know the language"The good business folks know SQL and aren't afraid of it. I used not to be sure of this until I worked in an organization where most PM types use SQL with comfort.

评论 #22191217 未加载

评论 #22191305 未加载

trolliedover 5 years ago

Don't really get this article. SQL never went away or got less popular. Always been in the background doing its awesome thing.

评论 #22190958 未加载

评论 #22190952 未加载

评论 #22199375 未加载

评论 #22191380 未加载

blowskiover 5 years ago

I don't really understand what this article is trying to say, other than "isn't Looker great!".Having timely and detailed data available to a wide range of people in the organisation is now seen as a competitive advantage in many industries. There's a lot of tech out there to help with this.But I could have written this article talking about how companies like Reltio are relying on NoSQL solutions to "empower the enterprise", or how Firebase is allowing startups to not worry about data structures, or how HSBC is deploying blockchain solutions, or how Spark means you can combine data from all different sources. It would still be just as accurate and meaningful. As it is, it just sounds like an infomercial for Looker.

评论 #22190187 未加载

评论 #22190437 未加载

oneofthoseover 5 years ago

DBT (data build tool) [0] embraces this idea - it's like make for data transformation. Just like make its syntax is sub-optimal. But that's the only draw-back. There is an open source version, it generates documentation automatically, you can materialize tables on a schedule, it allows you to write unit tests for your data ("this select must return 0 rows" kind of tests). I'm not affiliated with them, just happy user.[0] <a href="https://www.getdbt.com/" rel="nofollow">https://www.getdbt.com/</a>

评论 #22192381 未加载

iblaineover 5 years ago

The problem with modeling data isn't the lack of tools or the need to approach this problem differently. Plenty of solutions exist. People just don't care, and this is ok.It used to be that you needed tools like Informatica and Kimball inspired datamarts, but databases are now bigger and faster. Whatever data modeling problems you may have, you can easily clean up in an ETL or a BI layer, with relatively little effort. This makes tools like Looker, dbt, and Holistics a luxury and not something you need to have. I wish the industry would put more effort into defining clean data models, but I think that ship has sailed. The prevailing trend seems to be to create Data Lakes, add a BI layer, then call it a day.[edit] Also...some database points. The industry never shifted to using NoSQL to replace RDBMS systems. But event processing matured, NoSQL db's are ideal for storing unstructured data, so you see them in data engineering stacks. Greenplum is a free MPP database that has been around for nearly a decade. The point about Spanner SQL is interesting for the fact that Spanner evolved from NoSQL like methods to a SQL like dialect, but Spanner is a unique flower in the industry, due to being an HTAP db.

评论 #22198963 未加载

ckastnerover 5 years ago

The article doesn't mention performance.I haven't benchmarked this yet, but after my first experiences with somewhat complex data transformations in numpy and pandas, I was left with the feeling that despite them being optimized, any modern RDBMS would still have run circles around them.They've been optimized to this kind of stuff for decades, after all.

评论 #22191437 未加载

评论 #22191675 未加载

Twisellover 5 years ago

Being a total stubborn asshole to my boss over last decade because I refused to even investigate about how we could replace our old and "complicated" SQL integration/export workflows with a modern and "intuitive" visual proprietary ETL this article is a real relief for me!I'm now officially a bleeding edge DevOps with 10 years expertise on the brand new "old school" ELT (Extract,Load, Transform).LoL

评论 #22191516 未加载

评论 #22190273 未加载

评论 #22190140 未加载

评论 #22190973 未加载

sixdimensionalover 5 years ago

It’s a little bit off topic, but the tone of this discussion brings up an interesting conversation I’ve been having with some of my colleagues in the same age group - a perceived skills gap between the newest developers and ourselves.I don’t mean that in a negative way. I mean it in the sense that many of the newest developers don’t know where their cloud based NoSQL database came from (for example). They never were taught the history of what came before, during and after RDBMS. They are only now rediscovering some of what the “old” tools could do.Many of these developers seem eager to learn, and I am happy to mentor them and teach them the history that I know.But it has surprised me, it almost feels a little like so many years of waves of marketing and hype maybe actually had a real detriment to teaching people what is real, what is the best tool for the job in different cases, etc.I have no real evidence other than my anecdotal experience, but this article lends credence to the argument that some never learned or never were given the time to understand the discipline of databases.Possibly, the discipline of databases and related development has just been continually developing and never settled, so that is why the curriculum hasn’t kept up. But, it really does concern me when a new developer doesn’t even understand what a JOIN is.Edit: Or even moreso that SQL is an interface to a data engine, and was not necessarily always tightly coupled to relational databases (although it evolved often in lock step with them which is why you see them there more often).

0x5002over 5 years ago

> Instead, NoSQL systems like HBase, Cassandra, and MongoDB became fashionable as they marketed themselves as being scalable and easier to useHBase did not - the project has always been very clear that they cater towards a very specific set of use cases - fast writes with little schema constraints, fast single-key and range/fuzzy lookups, not big ETL pipelines.Even during the rise of Hadoop (everything is a file... I mean file based!) and the subsequent absorption of that into the Public Cloud vendors, SQL has always been there, just wrapped in different tools. These days, someone else hosts it and it's now called Athena instead of Hive, but fundamentally the same thing and has been the same thing.Even Apache Sparks entire Dataset/Datframe interface yields SQL-like execution plans, exposing the same functions that an RDMBS would, just in Scala/Python/R.

code4teeover 5 years ago

Yes. Plus all the SQL-like things you can do now. For example storing data on S3 and then querying it with AWS Athena is a simple and powerful way to keep a huge archive of data that you may want to query at some point.Also, never underestimate the power and speed of a well tuned SQL-family server/cluster, even at surprisingly large scale. A lot of use cases for the older “Hadoop Cluster” type stuff have been overtaken by these approaches. I’ve seen a lot of operations spend silly sums to build ultimately quite clunky Hadoop-based systems when really they probably just needed one half decent SQL admin and a well tuned cluster.

__ian__over 5 years ago

BigQuery is good. Looker is OK. This reads like paid-for PR content.

评论 #22191089 未加载

davewritescodeover 5 years ago

I think part of the real shift back to SQL is in part because of the fact that modern streaming platforms like Kafka give relatively simple mechanisms to implement eventually consistent databases. Aside from the simple key-value store use case, this was often a very good reason to use something like Cassandra instead of MySQL.

lostsoul8282over 5 years ago

Great article. SQL is so widely known and relatively simple to implement that I'm baffled business don't start with it as a solution and then work their way into our solutions if they find it doesn't meet their needs.It's so easy and quick to get started, it should be most people's first choice.

kohtatsuover 5 years ago

A few days ago I designed a DSL for CREATE TABLE statements.<a href="https://gist.github.com/nomsolence/69bc0b5fe1ba943d82cd37fdbd23df6d#file-db-png" rel="nofollow">https://gist.github.com/nomsolence/69bc0b5fe1ba943d82cd37fdb...</a>Being able to focus on the relationships without worrying about commas is nice.I'm still writing the compiler (it's my first, I'm sure it will be awful), but I'm starting from the finish so it's been easy for me to pick up where I left off; I started by deciding the language, then writing the outputs by hand for the tokenizer, my two stages of AST, and the actual SQL.

cube2222over 5 years ago

Hey, just wanted to chime with a tool I'm a co-author of, OctoSQL[1].I too very like a common interface to various data sources. OctoSQL allows you to use plain SQL to transform, analyse and join data from various datasources, currently including MySQL, PostgreSQL, Redis, JSON files, CSV, Excel.However, we're very inspired with yesteryears paper "One SQL to rule them all" and should have ergonomic steaming support with Kafka as a new datasource available very soon.[1]: <a href="https://github.com/cube2222/octosql" rel="nofollow">https://github.com/cube2222/octosql</a>

评论 #22191074 未加载

评论 #22191523 未加载

exabrialover 5 years ago

Does anyone have any ML Packages that directly integrate with SQL Databases? I feel like the ETL market is well covered with a myriad of tools, but our data scientists at the end of the day still want to extract giant CSVs b/c their python progs "just don't work natively" on an RDBMS (they probably do, but it's not the way it's done, and these guys aren't programmers for their first job anyway).

评论 #22191819 未加载

评论 #22192159 未加载

评论 #22192421 未加载

32gbsdover 5 years ago

Looks like a promo

threeseedover 5 years ago

Just a slight conflict of interest having a company that makes SQL-based data modelling tools telling us that SQL-based data modelling in on the rise.But given how many companies have setup data lakes with unstructured and semi-structured data (think SaaS exports) and how SQL layers have largely been unimpressive not sure it's the case.

评论 #22190445 未加载

sashavingardt2over 5 years ago

This article was written by yet another kid a few years out of college who doesn't know the history, the tooling and thinks he and his teammates are providing a solution to an existing problem. What he doesn't realize is that SQL reusability and data modeling have been solved 20 years ago.

评论 #22192176 未加载

IpV8over 5 years ago

I'm surprised to see only a cursory mention of Snowflake in this article. In my experience, they are really the pioneers of the new distributed, cloud-first database. They really enabled large scale relational data warehouses, and are still miles ahead of even the big cloud players.

评论 #22197053 未加载

tkyjonathanover 5 years ago

Already started a DataOps team in my company!

cryptonectorover 5 years ago

That's because SQL is pretty awesome.

lincpaover 5 years ago

[Everything is RMDB](<a href="https://github.com/linpengcheng/PurefunctionPipelineDataflow/blob/master/doc/Everything_is_RMDB.md" rel="nofollow">https://github.com/linpengcheng/PurefunctionPipelineDataflow...</a>)

评论 #22190632 未加载

评论 #22190268 未加载