I've personally grown to love SQL and I think it is by far the clearest (if verbose) way to describe data transformation work. I learned traditional programming (e.g. languages like Python) long before I stumbled on SQL, but I'm not sure I could've understood Pandas/R as well without having learned SQL, particularly the concepts of joins.<p>That said, my affection for SQL correlates with my overall growth as a software engineer, particularly grokking the concept of piping data between scripts: e.g. writing a SQL script to extract, filter, and transform data, and then a separate Python/R script that creates a visualization. I think SQL seems so unattractive to data scientists (i.e. people who may not be particularly focused on software development) because it seems so alien and separate from the frameworks (R/Pandas) they use to do the actual analysis and visualization. And so it's understandable to just learn how to do it all in R/Pandas, rather than a separate language and paradigm for data wrangling.
Not to step on anybody’s toes, but… I often suspect that there are lots of people who tried, and failed, to learn relational database techniques and gravitate toward schemaless solutions like Mongo just because they’re easier to understand. Maybe not everybody, but more than a few that I’ve interacted with.
The author of this essay is beating a dead horse. "NoSQL" (I really dislike the term) has proven its value over the years and deserves a seat the table. People will always misuse technology or implement it poorly, but I don't think it warrants yet another oversimplified "SQL vs NoSQL" rant.
I've been working with SQL a lot in my job lately. For what it's worth, I'm a big fan.<p>It seems the central argument here is NoSQL doesn't force you into good design habits so it's overrated. I'll concede this is partially true from my perspective because much of my work involves trying to sanitize, transform, and otherwise refactor poorly structured or designed NoSQL datasets.<p>But I've also seen my fair share of SQL databases which are poorly designed, don't use features which are meant to benefit developers (I haven't seen a Foreign Key smartly implemented in a LONG time). It's not really fair to say NoSQL has encouraged poor design practices; from my experience it seems like data model implementation is given little effort in general.<p>NoSQL takes the 'training wheels' off data model implementation where SQL is like keeping them on, but even with them you can still fall off the bike if you aren't careful though it's much harder.
I's nice that the article mentions Codd and his relational model of data but what it doesn't mention <i>is how badly SQL parrots relational algebra</i>. The language is inspired by the idea ("based on a real story"(c)), yes, but it takes a really clean and sound model and makes an unbelievable mess out of it.<p>SQL is just an ugly historical accident. Unfortunately, this how it often works...<p>NoSQL are a different story, of course. BTW, I believe that they predate Codd's work. There were many examples of non-relational DBs in the 70s.
The people that bemoan SQL are like the people that bemoan staff notation. Every [music|data] student thinks they can do better when they first learn it instead of embracing the though that's gone into it over many years.
I don't agree with the vilification of NoSQL, but I do agree that SQL is a great query language.<p>That's partly why I wanted to create a tool to query various databases (NoSQL ones or files too) with SQL. We're still in an early stage with OctoSQL regarding the variety of databases, but take a look if that sound appreciable to you: <a href="https://github.com/cube2222/octosql/" rel="nofollow">https://github.com/cube2222/octosql/</a>
Coming from the boring enterprise world, where we didn't really get swept along in the NoSQL hype, this doesn't seem like it should remotely be a surprise to anyone.<p>"Hey, turns out there's lots of great ways to use SQL!" Yeah, we know, they've been at the core of our business for the last 20 years at least.<p>Lots of enterprise shops started out as SQL databases to keep track of all the data, and web apps grew up organically around them.
SQL is a functional programming language. No other imperative programming model will be cleaner or clearer in expressing intent. But, you have to understand SQL to begin with.
I've worked extensively with SQL and relational data modeling. It's been very useful for accomplishing my work. I haven't come across better tools and so I haven't adopted alternatives. If the time ever comes where there is truly a superior tool set to the one I am currently using, I will gladly stop using antiquated technology. Newness isn't sufficient to sway me. No dogma here. Just pragmatism.
Data structures still matters and very much so! If you run in the cloud data structures has to be very efficient! Data should be normalized. MySQL has built hash table support adaptive hash indexes.<p>I think one should also not over complicate the data layer.<p>Boring tech, I like the safety C D of SQLs ACID.
<a href="http://boringtechnology.club/" rel="nofollow">http://boringtechnology.club/</a>
> On the other hand, if you check Postgres’ configuration file, most of the parameters are straightforward and tradeoffs not so hard to spot.<p>The artcile also refer to MongoDB as hard to config
It maybe a matter of taste. I found MongoDB document is way better than Postgres. They had thing like this:<p><a href="https://docs.mongodb.com/manual/administration/production-checklist-operations/" rel="nofollow">https://docs.mongodb.com/manual/administration/production-ch...</a>
<a href="https://docs.mongodb.com/manual/administration/analyzing-mongodb-performance/" rel="nofollow">https://docs.mongodb.com/manual/administration/analyzing-mon...</a><p>Which I can easily follow and apply and they are action-able like set noatime on fstab, use XFS, max file handler etc.<p>For Postgres <a href="https://www.postgresql.org/docs/12/index.html" rel="nofollow">https://www.postgresql.org/docs/12/index.html</a> I cannot easily find something similar to MongoDB one.
Good article, but it and a lot of the comments here seem to be conflating SQL as a language and Postgres (or other RDBMS implementations.)<p>For the record, I like both these things and they often go together well. But once your data gets too big for Postgres, you don't have to immediately jump to NoSQL: modern distributed SQL tools like Presto are quite good and can let you continue to use the same SQL interaction patterns, even at the terabyte or petabyte scale.<p>You have to be a <i>little</i> more aware of what's going on under the hood, so you don't write something pathological, but it's quite a powerful approach.<p>I am even using Presto for a lot of what would normally be considered ETL: using SQL to select from a "dirty" table and inserting into a "clean" table on a schedule.
Start with Postgres. Don't start with SQLite. SQLite is a file format, not a database; it scales atrociously (I've seen simple queries run for 10s of seconds with 100MB of data), it basically doesn't work in concurrent update scenarios, and the more data you put into it, the more pain you'll have migrating to Postgres.<p>Use SQLite if you want a file format where you don't want to rewrite the whole file for every update, or if you're in a situation or environment where it's not feasible to have an actual database server.
One point got me curious:<p>>As a rule of thumb, vertical scalability is in most cases more economical than the horizontal one.<p>Isn't <i>the opposite</i> the reason why we started to scale horozontally in the first place?
Querying data in “Cypher” is so much easier. I have 20 years as a database developer, cypher is way better. I can live code 50 complex queries in Cypher before I get one done in SQL.
It’s best not to take criticism seriously from people who have no formal CS education and can’t even define the premises of relational algebra. Ah well.
Alright here's a relevant question I've been having in terms of this. Let's say I have the code to gather / scrape / load some stats into postgres, but then want to run projections on them.<p>For example, if I'm trying to predict the next day's stats by using stats in the past by simply taking the average, how many days in the past should I look at results and take the average of? Is the last 3 day average the best? 5 days? 8? 10?<p>There are clearly a couple ways to do it. One is by getting a data frame of all the stats for all the objects, write the python logic to loop through the days, get back the stats from the past X days not including the current day, taking the average and then storing that back to postgres in a projections table. A function like set_projections(5) where 5 is the number of days in the past I'm taking the average of.<p>Second way to do this is write that function as a plpgsql function where uses subqueries to find the past X day stats for the players and then creates or updates the projections table all in sql so we can run `select set_projections(5)` and that'll do it itself.<p>So the question becomes, which ones is "best"? I have to imagine it's mostly a case by case basis. Since it's only me here, I've been doing it in postgres alone since it can be done in one query (with multiple sub queries, yes), but that's it. With python, it'd involve many more steps. On the other hand, the sql looks huge and then I've been running into the issue of do I split some of the sub queries into sub functions since that's what I'd be doing in python? If there were more people involved, would it be bad to have larger cases like that in postgres since we wouldn't know the skill of the others, where mostly they'd be coders and could write the projections in languages they'd want?<p>Another example of this tradeoff is how should I interact with the database? I have a rails background, and ActiveRecord is an incredibly good ORM, whereas SqlAlchemy hasn't done it for me. In either case, there's a ton of overhead to getting the connections running with the correct class variables. So instead, I kind of created my own ORM / interface where I can quickly write queries on my own and use those in the code. This is especially easy since most the queries can be written in postgres functions so the strings of those queries is incredibly tiny!<p>What I've learned from this project I'm doing is that sql is very, very powerful in what it does. I've shifted more to using it for actions than I would have in the past, and pretty much all thinking I do is making as little code as possible.<p>Anyone make it through reading this and have comments about what I'm doing and what they like to do?
I might be under-informed but "data science" seems to involve a lot of vague BS. Computing has always centered on data.<p>"Data practitioners", "data practice", "data management" just seems like weird rebranding of stuff that businesses have been doing since the 1960s. Partly what the article seems to be relating regarding SQL.<p>What is "data science" besides computing, data storage of some sort, and analyzing the data? Because it's "BIG" now? Because now we "extract knowledge and insights" from data, since apparently giant databases were amassed for no reason in the past? Because now we "combine multidisciplinary fields like statistics", since apparently nobody had thought to run statistics on their data before? Because "AI"?
The damage SQL has done to the relation of perception of he general public towards the relational model cannot be undone. The relational model is beautiful; it's in no way more complex than objects and attributes, but SQL makes it seem so by conflating orthogonal aspects into it.
In the Hadoop world things have evolved to support SQL. Spark, Hive, Impala all have full support for SQL. The Spark implementation is actually faster than you doing low level RDD processing as there are optimizers that work very well. In addition, you can create UDF that are easy to integrate.<p>The only reason you might choose other approaches is to make the problem look significantly more complicated, thereby justifying more maintenance and resources.<p>SQL is just too easy and some super smart engineers don’t like it because if it. That said, NoSQL does have value in some corner cases where it could perform better when most of the logic is simple lookups.
“Why didn’t we use an RDBMS in the first place? “<p>Because initial application specifications are sparse and definitely wrong. If your application is still up and running 5 years later and your data definition hasn't changed much in the past 3, then maybe refactor around an RDBMS. Designing around rigid structures during your first pass is costly. This is why there's been a rise in NoSQL and dynamic languages.