Modern Data Practice and the SQL Tradition

286 pointsby MarkusWinandover 5 years ago

27 comments

dansoover 5 years ago

I've personally grown to love SQL and I think it is by far the clearest (if verbose) way to describe data transformation work. I learned traditional programming (e.g. languages like Python) long before I stumbled on SQL, but I'm not sure I could've understood Pandas/R as well without having learned SQL, particularly the concepts of joins.That said, my affection for SQL correlates with my overall growth as a software engineer, particularly grokking the concept of piping data between scripts: e.g. writing a SQL script to extract, filter, and transform data, and then a separate Python/R script that creates a visualization. I think SQL seems so unattractive to data scientists (i.e. people who may not be particularly focused on software development) because it seems so alien and separate from the frameworks (R/Pandas) they use to do the actual analysis and visualization. And so it's understandable to just learn how to do it all in R/Pandas, rather than a separate language and paradigm for data wrangling.

评论 #21483181 未加载

评论 #21483119 未加载

评论 #21483107 未加载

评论 #21484160 未加载

评论 #21485139 未加载

评论 #21486050 未加载

评论 #21483057 未加载

评论 #21483535 未加载

commandlinefanover 5 years ago

Not to step on anybody’s toes, but… I often suspect that there are lots of people who tried, and failed, to learn relational database techniques and gravitate toward schemaless solutions like Mongo just because they’re easier to understand. Maybe not everybody, but more than a few that I’ve interacted with.

评论 #21483603 未加载

评论 #21483966 未加载

0xFACEFEEDover 5 years ago

The author of this essay is beating a dead horse. "NoSQL" (I really dislike the term) has proven its value over the years and deserves a seat the table. People will always misuse technology or implement it poorly, but I don't think it warrants yet another oversimplified "SQL vs NoSQL" rant.

评论 #21483924 未加载

评论 #21483276 未加载

评论 #21482916 未加载

评论 #21482889 未加载

gfoover 5 years ago

I've been working with SQL a lot in my job lately. For what it's worth, I'm a big fan.It seems the central argument here is NoSQL doesn't force you into good design habits so it's overrated. I'll concede this is partially true from my perspective because much of my work involves trying to sanitize, transform, and otherwise refactor poorly structured or designed NoSQL datasets.But I've also seen my fair share of SQL databases which are poorly designed, don't use features which are meant to benefit developers (I haven't seen a Foreign Key smartly implemented in a LONG time). It's not really fair to say NoSQL has encouraged poor design practices; from my experience it seems like data model implementation is given little effort in general.NoSQL takes the 'training wheels' off data model implementation where SQL is like keeping them on, but even with them you can still fall off the bike if you aren't careful though it's much harder.

评论 #21484970 未加载

vkazanovover 5 years ago

I's nice that the article mentions Codd and his relational model of data but what it doesn't mention is how badly SQL parrots relational algebra. The language is inspired by the idea ("based on a real story"(c)), yes, but it takes a really clean and sound model and makes an unbelievable mess out of it.SQL is just an ugly historical accident. Unfortunately, this how it often works...NoSQL are a different story, of course. BTW, I believe that they predate Codd's work. There were many examples of non-relational DBs in the 70s.

评论 #21483314 未加载

评论 #21482935 未加载

评论 #21482750 未加载

评论 #21482747 未加载

评论 #21483037 未加载

exabrialover 5 years ago

The people that bemoan SQL are like the people that bemoan staff notation. Every [music|data] student thinks they can do better when they first learn it instead of embracing the though that's gone into it over many years.

评论 #21486327 未加载

cube2222over 5 years ago

I don't agree with the vilification of NoSQL, but I do agree that SQL is a great query language.That's partly why I wanted to create a tool to query various databases (NoSQL ones or files too) with SQL. We're still in an early stage with OctoSQL regarding the variety of databases, but take a look if that sound appreciable to you: <a href="https://github.com/cube2222/octosql/" rel="nofollow">https://github.com/cube2222/octosql/</a>

评论 #21485305 未加载

asdfman123over 5 years ago

Coming from the boring enterprise world, where we didn't really get swept along in the NoSQL hype, this doesn't seem like it should remotely be a surprise to anyone."Hey, turns out there's lots of great ways to use SQL!" Yeah, we know, they've been at the core of our business for the last 20 years at least.Lots of enterprise shops started out as SQL databases to keep track of all the data, and web apps grew up organically around them.

ctyover 5 years ago

SQL is a functional programming language. No other imperative programming model will be cleaner or clearer in expressing intent. But, you have to understand SQL to begin with.

评论 #21482831 未加载

评论 #21482908 未加载

评论 #21482907 未加载

评论 #21482944 未加载

评论 #21482757 未加载

zzzeekover 5 years ago

Give the data scientists SQL and relational algrebra.But please don't give them stored procedures and triggers.

评论 #21485430 未加载

评论 #21483211 未加载

评论 #21482943 未加载

Dowwieover 5 years ago

I've worked extensively with SQL and relational data modeling. It's been very useful for accomplishing my work. I haven't come across better tools and so I haven't adopted alternatives. If the time ever comes where there is truly a superior tool set to the one I am currently using, I will gladly stop using antiquated technology. Newness isn't sufficient to sway me. No dogma here. Just pragmatism.

acdover 5 years ago

Data structures still matters and very much so! If you run in the cloud data structures has to be very efficient! Data should be normalized. MySQL has built hash table support adaptive hash indexes.I think one should also not over complicate the data layer.Boring tech, I like the safety C D of SQLs ACID. <a href="http://boringtechnology.club/" rel="nofollow">http://boringtechnology.club/</a>

评论 #21484772 未加载

kureikainover 5 years ago

> On the other hand, if you check Postgres’ configuration file, most of the parameters are straightforward and tradeoffs not so hard to spot.The artcile also refer to MongoDB as hard to config It maybe a matter of taste. I found MongoDB document is way better than Postgres. They had thing like this:<a href="https://docs.mongodb.com/manual/administration/production-checklist-operations/" rel="nofollow">https://docs.mongodb.com/manual/administration/production-ch...</a> <a href="https://docs.mongodb.com/manual/administration/analyzing-mongodb-performance/" rel="nofollow">https://docs.mongodb.com/manual/administration/analyzing-mon...</a>Which I can easily follow and apply and they are action-able like set noatime on fstab, use XFS, max file handler etc.For Postgres <a href="https://www.postgresql.org/docs/12/index.html" rel="nofollow">https://www.postgresql.org/docs/12/index.html</a> I cannot easily find something similar to MongoDB one.

lukevover 5 years ago

Good article, but it and a lot of the comments here seem to be conflating SQL as a language and Postgres (or other RDBMS implementations.)For the record, I like both these things and they often go together well. But once your data gets too big for Postgres, you don't have to immediately jump to NoSQL: modern distributed SQL tools like Presto are quite good and can let you continue to use the same SQL interaction patterns, even at the terabyte or petabyte scale.You have to be a little more aware of what's going on under the hood, so you don't write something pathological, but it's quite a powerful approach.I am even using Presto for a lot of what would normally be considered ETL: using SQL to select from a "dirty" table and inserting into a "clean" table on a schedule.

评论 #21486263 未加载

barrkelover 5 years ago

Start with Postgres. Don't start with SQLite. SQLite is a file format, not a database; it scales atrociously (I've seen simple queries run for 10s of seconds with 100MB of data), it basically doesn't work in concurrent update scenarios, and the more data you put into it, the more pain you'll have migrating to Postgres.Use SQLite if you want a file format where you don't want to rewrite the whole file for every update, or if you're in a situation or environment where it's not feasible to have an actual database server.

评论 #21484733 未加载

评论 #21484505 未加载

评论 #21485612 未加载

psv1over 5 years ago

As someone who has to use Elasticsearch as an only data store, yes.

评论 #21483860 未加载

评论 #21483922 未加载

评论 #21482880 未加载

ComodoHackerover 5 years ago

One point got me curious:>As a rule of thumb, vertical scalability is in most cases more economical than the horizontal one.Isn't the opposite the reason why we started to scale horozontally in the first place?

评论 #21483069 未加载

评论 #21483169 未加载

评论 #21487295 未加载

maxdemarziover 5 years ago

Querying data in “Cypher” is so much easier. I have 20 years as a database developer, cypher is way better. I can live code 50 complex queries in Cypher before I get one done in SQL.

评论 #21489837 未加载

laichzeit0over 5 years ago

It’s best not to take criticism seriously from people who have no formal CS education and can’t even define the premises of relational algebra. Ah well.

评论 #21486365 未加载

jackschultzover 5 years ago

Alright here's a relevant question I've been having in terms of this. Let's say I have the code to gather / scrape / load some stats into postgres, but then want to run projections on them.For example, if I'm trying to predict the next day's stats by using stats in the past by simply taking the average, how many days in the past should I look at results and take the average of? Is the last 3 day average the best? 5 days? 8? 10?There are clearly a couple ways to do it. One is by getting a data frame of all the stats for all the objects, write the python logic to loop through the days, get back the stats from the past X days not including the current day, taking the average and then storing that back to postgres in a projections table. A function like set_projections(5) where 5 is the number of days in the past I'm taking the average of.Second way to do this is write that function as a plpgsql function where uses subqueries to find the past X day stats for the players and then creates or updates the projections table all in sql so we can run `select set_projections(5)` and that'll do it itself.So the question becomes, which ones is "best"? I have to imagine it's mostly a case by case basis. Since it's only me here, I've been doing it in postgres alone since it can be done in one query (with multiple sub queries, yes), but that's it. With python, it'd involve many more steps. On the other hand, the sql looks huge and then I've been running into the issue of do I split some of the sub queries into sub functions since that's what I'd be doing in python? If there were more people involved, would it be bad to have larger cases like that in postgres since we wouldn't know the skill of the others, where mostly they'd be coders and could write the projections in languages they'd want?Another example of this tradeoff is how should I interact with the database? I have a rails background, and ActiveRecord is an incredibly good ORM, whereas SqlAlchemy hasn't done it for me. In either case, there's a ton of overhead to getting the connections running with the correct class variables. So instead, I kind of created my own ORM / interface where I can quickly write queries on my own and use those in the code. This is especially easy since most the queries can be written in postgres functions so the strings of those queries is incredibly tiny!What I've learned from this project I'm doing is that sql is very, very powerful in what it does. I've shifted more to using it for actions than I would have in the past, and pretty much all thinking I do is making as little code as possible.Anyone make it through reading this and have comments about what I'm doing and what they like to do?

评论 #21483003 未加载

评论 #21482917 未加载

评论 #21482901 未加载

评论 #21482879 未加载

评论 #21484316 未加载

QuadrupleAover 5 years ago

I might be under-informed but "data science" seems to involve a lot of vague BS. Computing has always centered on data."Data practitioners", "data practice", "data management" just seems like weird rebranding of stuff that businesses have been doing since the 1960s. Partly what the article seems to be relating regarding SQL.What is "data science" besides computing, data storage of some sort, and analyzing the data? Because it's "BIG" now? Because now we "extract knowledge and insights" from data, since apparently giant databases were amassed for no reason in the past? Because now we "combine multidisciplinary fields like statistics", since apparently nobody had thought to run statistics on their data before? Because "AI"?

评论 #21484244 未加载

评论 #21483409 未加载

评论 #21483628 未加载

评论 #21483306 未加载

评论 #21483844 未加载

评论 #21485676 未加载

评论 #21485846 未加载

评论 #21483846 未加载

评论 #21483393 未加载

评论 #21483363 未加载

评论 #21485556 未加载

评论 #21484684 未加载

评论 #21485363 未加载

patgrdjover 5 years ago

RDBMS can be replaced by Impala, Spark SQL, Drill, Presto, or any SQL engine on top of Hadoop.

hypferover 5 years ago

Hey you've scraped my email from GitHub and sent me this link in some kind of newsletterI don't think this is GDPR compliant.

dsstudentover 5 years ago

feedback tha paroume?

_pmf_over 5 years ago

The damage SQL has done to the relation of perception of he general public towards the relational model cannot be undone. The relational model is beautiful; it's in no way more complex than objects and attributes, but SQL makes it seem so by conflating orthogonal aspects into it.

评论 #21482667 未加载

innagadadavidaover 5 years ago

In the Hadoop world things have evolved to support SQL. Spark, Hive, Impala all have full support for SQL. The Spark implementation is actually faster than you doing low level RDD processing as there are optimizers that work very well. In addition, you can create UDF that are easy to integrate.The only reason you might choose other approaches is to make the problem look significantly more complicated, thereby justifying more maintenance and resources.SQL is just too easy and some super smart engineers don’t like it because if it. That said, NoSQL does have value in some corner cases where it could perform better when most of the logic is simple lookups.

noobiemcfoobover 5 years ago

“Why didn’t we use an RDBMS in the first place? “Because initial application specifications are sparse and definitely wrong. If your application is still up and running 5 years later and your data definition hasn't changed much in the past 3, then maybe refactor around an RDBMS. Designing around rigid structures during your first pass is costly. This is why there's been a rise in NoSQL and dynamic languages.

评论 #21482612 未加载

评论 #21482801 未加载

评论 #21482564 未加载

评论 #21482994 未加载

评论 #21482970 未加载

评论 #21482746 未加载

评论 #21482605 未加载

评论 #21482769 未加载