Best practices for writing SQL queries

424 pointsby ReginaDeiPiratiabout 4 years ago

23 comments

tremonabout 4 years ago

Avoid functions in WHERE clausesAvoid them on the column-side of expressions. This is called sargability [1], and refers to the ability of the query engine to limit the search to a specific index entry or data range. For example, WHERE SUBSTRING(field, 1, 1) = "A" will still cause a full table scan and the SUBSTRING function will be evaluated for every row, while WHERE field LIKE "A%" can use a partial index scan, provided an index on the field column exists.Prefer = to LIKEAnd therefore this advice is wrong. As long as your LIKE expression doesn't start with a wildcard, LIKE can use an index just fine.Filter with WHERE before HAVINGThis usually isn't an issue, because the search terms you would use under HAVING can't be used in the WHERE clause. But yes, the other way around is possible, so the rule of thumb is: if the condition can be evaluated in the WHERE clause, it should be.WITHBe aware that not all database engines perform predicate propagation across CTE boundaries. That is, a query like this:<pre><code> WITH allRows AS ( SELECT id, result = difficult_calculation(col) FROM table) SELECT result FROM allRows WHERE id = 15; </code></pre> might cause the database engine to perform difficult_calculation() on all rows, not just row 15. All big databases support this nowadays, but it's not a given.[1] <a href="https://en.wikipedia.org/wiki/Sargable" rel="nofollow">https://en.wikipedia.org/wiki/Sargable</a>

评论 #26764326 未加载

评论 #26764086 未加载

评论 #26764853 未加载

评论 #26766851 未加载

评论 #26766364 未加载

hn_throwaway_99about 4 years ago

This is an aside, but a colleague years back showed me his preferred method formatting SQL statements, and I've always found it to be the best in terms of readability, I just wish there was more automated tool support for this format. The idea is to line up the first value from each clause. Visually it makes it extremely easy to "chunk" the statement by clause, e.g.:<pre><code> SELECT a.foo, b.bar, g.zed FROM alpha a JOIN beta b ON a.id = b.alpha_id LEFT JOIN gamma g ON b.id = g.beta_id WHERE a.val > 1 AND b.col < 2 ORDER BY a.foo</code></pre>

评论 #26764798 未加载

评论 #26763332 未加载

评论 #26766398 未加载

评论 #26763352 未加载

评论 #26765187 未加载

评论 #26766394 未加载

评论 #26766106 未加载

评论 #26763138 未加载

评论 #26763113 未加载

评论 #26765434 未加载

评论 #26763891 未加载

colonwqbangabout 4 years ago

> Although it’s possible to join using a WHERE clause (an implicit join), prefer an explicit JOIN instead, as the ON keyword can take advantage of the database’s index.This implies that WHERE style join can't use indices.I can understand why some would prefer either syntax for readability/style reasons. But the idea that one uses indices and the other not, seems highly dubious.Looking at the postgres manual [1], the WHERE syntax is clearly presented as the main way of inner joining tables. The JOIN syntax is described as an "alternative syntax":> This [INNER JOIN] syntax is not as commonly used as the one above, but we show it here to help you understand the following topics.Maybe some database somewhere cannot optimise queries properly unless JOIN is used? Or is this just FUD?[1] <a href="https://www.postgresql.org/docs/13/tutorial-join.html" rel="nofollow">https://www.postgresql.org/docs/13/tutorial-join.html</a>

评论 #26765282 未加载

评论 #26767739 未加载

评论 #26765232 未加载

评论 #26765113 未加载

评论 #26765492 未加载

评论 #26766688 未加载

评论 #26765143 未加载

magicalhippoabout 4 years ago

I'd add "be aware of window functions"[1]. Certain gnarly aggregates and joins can often be much better expressed using window functions.And at least for the database we use at work, if the sole reason for a join is to reduce the data, prefer EXISTS.[1]: <a href="https://www.sqltutorial.org/sql-window-functions/" rel="nofollow">https://www.sqltutorial.org/sql-window-functions/</a>

ineedasernameabout 4 years ago

prefer an explicit JOINYes absolutely, and not just for performance benefits. It's much easier to track what, how, and why you're joining to something when it's not jumbled together in a list of a dozen conditions in the WHERE clause.I can't tell you how much bad data I've had to fix because when I break apart the implicit conditions into explicit joins it is absolutely not doing what the original author intended and it would have been obvious with an explicit join.And then in the explicit join, always be explicit about the join type. don't just use JOIN when you want an INNER JOIN. Otherwise I have to wonder if the author accidentally left off something.

jjiceabout 4 years ago

Does anyone have any good resources for practicing SQL queries? I recently had an interview where I did well on a project and the programming portions, but fumbled on the more SQL queries that were above basic joins. I didn't realize how much I need to learn and practice. I don't know if my lack of knowledge was enough to cost me the position or not yet, but I'd like to prepare for the future either way.I've seen a few websites, but I don't know which ones to use. Or maybe there is a dataset with practice question I could download?Edit: I found <a href="https://pgexercises.com" rel="nofollow">https://pgexercises.com</a> and it's been fantastic so far. Much more responsive than other sites, clear questions, and free.

评论 #26767917 未加载

bob1029about 4 years ago

This seems like reasonable discussion, but you would get far more traction if you have an opportunity to write an entire schema from scratch in the proper way.Not having to fight assumptions along lines of improperly denormalized columns (i.e. which table is source of truth for a specific fact) can auto-magically simplify a lot of really horrible joins and other SQL hack-arounds that otherwise wouldn't be necessary. The essential vs accidental complexity battle begins right here with domain modeling.You should be seeking something around 3rd normal form when developing a SQL schema for any arbitrary problem domain. Worry about performance after it's actually slow. A business expert who understands basic SQL should be able to look at and understand what every fact & relation table in your schema are for. They might even be able to help confirm the correctness of business logic throughout or even author some of it themselves. SQL can be an extremely powerful contract between the technology wizards and the business people.More along lines of the original topic - I would strongly advocate for views in cases where repetitive, complex queries are being made throughout the application. These serve as single points of reference for a particular projection of facts and can dramatically simplify downstream queries.

评论 #26764078 未加载

Hjfrfabout 4 years ago

Lots of mistakes (or at least rare opinions going against the crowd) here.Here's a better general performance tuning handbook - <a href="https://use-the-index-luke.com/" rel="nofollow">https://use-the-index-luke.com/</a>

评论 #26763484 未加载

AdrianB1about 4 years ago

Sorry to rain on your parade, but there is nothing in that article that is not included in the basic SQL manuals like Itzik Ben-Gan's.Also a few things are dead wrong: the "make the haystack small" is optimization (it should be at the end, as the first rule says), the "prefer UNION ALL to UNION" is missing the context (a good dev knows what is needed, not what to prefer) and the usage of CTEs is nice, but sometimes slower that other options and in SQL slower can easily be orders of magnitude, so nice is not enough. Same for 'avoid sorting where possible, especially in subqueries' or "use composite indexes" (really? it's a basic thing, not a best practice).In the past few months I interviewed and hired several DBAs, this list is ok-ish for a junior but a fail for a senior. I am not working for FAANG, so the bar is pretty low, this article would not even pass for a junior there.

评论 #26766362 未加载

c2h5ohabout 4 years ago

CTE advice is somewhat questionable, as it is database specific.CTEs were for a very long time an optimization fence in PostgreSQL, were not inlined and behaved more like temporary materialized views.Only with release of PostgreSQL 12 some CTE inlining is happening - with limitations: not recursive, no side-effects and are only referenced once in a later part of a query.Mode info: <a href="https://hakibenita.com/be-careful-with-cte-in-postgre-sql" rel="nofollow">https://hakibenita.com/be-careful-with-cte-in-postgre-sql</a>

cornel_ioabout 4 years ago

NB: this post is mostly performance advice, and it only applies to traditional databases. Specifically, it is not good advice for big data columnar DBs, for instance a limit clause doesn't help you at all on BigQuery and grabbing fewer columns really does.

评论 #26764081 未加载

supernova87aabout 4 years ago

Is there a good place to read from an advanced casual "lay user's" perspective what SQL query optimizers do in the background after you submit the query?I would love to know, so that I can know what optimizations and WHERE / JOIN conditions I should really be careful about making more efficient, versus others that I don't have to worry because the optimizer will take care of it.For example, if I'm joining 2 long tables together, should I be very careful to create 2 subtables with restrictive WHERE conditions first, so that it doesn't try to join the whole thing, or is the optimizer taking care of that if lump that query all into one entire join and only WHERE it afterwards? How do you tell what columns are indexed and inexpensive to query frequently, and which are not? Is it better to avoid joining on floating point value BETWEEN conditions?And other questions like this.

评论 #26763590 未加载

评论 #26763780 未加载

iblaineabout 4 years ago

"Best practices for writing SQL queries in metabase" should be the title here.10 or so years ago when SQL Server, Oracle & MySQL dominated the industry, you could talk about SQL optimization with the expectation that all advice was good advice. There are too many flavors of databases to do that today.

johnvalukabout 4 years ago

Overall an enjoyable read, but as someone who includes SQL queries in code, I disagree with two points:I despise table aliases and usually remove them from queries. To me, they add a level of abstraction that obscures the purpose of the query. They're usually meaningless strings generated automatically by the tools used by data analysts who rarely inspect the underlying SQL for readability. I fully agree that you should reference columns explicitly with the table name, which I think is the real point they're trying to make in the article.While it's true that sorting is expensive, the downstream benefits can be huge. The ability to easily diff sorted result sets helps with troubleshooting and can also save significant storage space whenever the results are archived.

评论 #26764863 未加载

评论 #26763893 未加载

评论 #26763292 未加载

评论 #26771796 未加载

评论 #26763964 未加载

eirkiabout 4 years ago

> Avoid<pre><code> SELECT title, last_name, first_name FROM books LEFT JOIN authors ON books.author_id = authors.id </code></pre> > Prefer<pre><code> SELECT b.title, a.last_name, a.first_name FROM books AS b LEFT JOIN authors AS a ON b.author_id = a.id </code></pre> Couldn't disagree more. One letter abbreviations hurt readability IMO.

评论 #26762990 未加载

评论 #26762968 未加载

评论 #26762966 未加载

评论 #26762956 未加载

评论 #26763163 未加载

评论 #26762976 未加载

评论 #26763003 未加载

评论 #26765213 未加载

评论 #26763192 未加载

评论 #26763126 未加载

SigmundAabout 4 years ago

>LIKE compares characters, and can be paired with wildcard operators like %, whereas the = >operator compares strings and numbers for exact matches. The = can take advantage of indexed columns.Unless this specific to certain databases, LIKE can take advantage of indexes too, without wildcards LIKE should be nearly identical in performance to = both seeking the index.>Using wildcards for searching can be expensive. Prefer adding wildcards to the end of strings. Prefixing a string with a wildcard can lead to a full table scan.Which is contradictory to the first quote, it seems you recognize that a wildcard at the end can take advantage of an index. Full table scan is the same thing as not taking advantage of an index, hence LIKE can take advantage of normal indexes so long as there are characters before the first wildcard or has no wildcards.

评论 #26763040 未加载

haolezabout 4 years ago

Metabase is an amazing product, but I'm using Superset[0] in my company because it supports Azure AD SSO, which became a necessity for us. But as soon as this feature appears in Metabase, we are switching.[0] <a href="https://superset.apache.org/" rel="nofollow">https://superset.apache.org/</a>

nerdbaggyabout 4 years ago

Does anybody else like putting from first? I find it makes the auto complete sooo much better and easier to read.

评论 #26762980 未加载

评论 #26762973 未加载

评论 #26763142 未加载

yardstickabout 4 years ago

> Although it’s possible to join using a WHERE clause (an implicit join), prefer an explicit JOIN instead, as the ON keyword can take advantage of the database’s index.Don’t most databases figure this out as part of the query planner anyway? Postgres has no problems using indexes for joins inside WHERE.

评论 #26763900 未加载

macandoabout 4 years ago

This article nudge me to google "SQL optimization tool". I found one that says: "Predict performance bottlenecks and optimize SQL queries, using AI". Basically, it gives you suggestions on how to improve your queries.I wonder what the results would be if I ran the queries from this article through that tool.

carbocationabout 4 years ago

Personal habit is to start my WHERE clause with a TRUE or a FALSE so that adding or removing clauses becomes seamless:<pre><code> SELECT foo FROM bar WHERE TRUE AND baz > boom </code></pre> For OR conditions it's a bit different:<pre><code> SELECT foo FROM bar WHERE FALSE OR baz > boom</code></pre>

评论 #26763214 未加载

评论 #26763926 未加载

评论 #26763654 未加载

the_arunabout 4 years ago

Anyone using Metabase? Is it worth having self hosted and managing it?

Justsignedupabout 4 years ago

some of these are inaccurate."a = 'foo'" is exactly the same performance as "a like 'foo'" and very close to the performance as "a like 'foo%'" and is fully indexed. When you put a wildcard in the front, the entire index is avoided, so you gotta switch to full text search.