A few cool tricks I use with window functions:<p>1- To find blocks of contiguous values, you can use something similar to Gauss' trick for calculating arithmetic progressions: sort them by descending order and add each value to the row number. All contiguous values will add to the same number. You can then apply max/min and get rows that correspond to the blocks of values.<p><pre><code> select min(n), max(n) from (
select n, n+row_number() over (order by n desc) group
from numbers
)
group by group
order by 1
</code></pre>
2- You can use a window function with exponential/logarithms in order to calculate the accumulated inflation for the last n months:<p><pre><code> select date, inflation, (exp(accumulated)-1)*100 from (
select date, inflation, sum(ln(1+(inflation/100))) over (order by date desc rows between current row and 11 following) as accumulated
from inflation
)
</code></pre>
3- You can do all the paging in SQL (fetch page n of m) or simply add a column with the total number of rows (this often makes it easier to process the results).
Introduction to Window Functions with the same examples (salaries per department)<p><a href="https://www.postgresql.org/docs/13/tutorial-window.html" rel="nofollow">https://www.postgresql.org/docs/13/tutorial-window.html</a><p>While you are at it, check out common table expressions as well<p><a href="https://www.postgresql.org/docs/13/queries-with.html" rel="nofollow">https://www.postgresql.org/docs/13/queries-with.html</a>
I've been working on a messaging app, and at the beginning I started out with Elasticsearch. Now, ES is really a wonderful tool and it got me pretty far, but its lack of window functions made one particular requirement impossible that we all take for granted in messengers: a listing of most-recently updated threads represented by the most recent message on each thread, sorted reverse chronologically. Here is the SQL that makes that happen:<p><pre><code> SELECT
tm.*
FROM
(
SELECT
row_number() OVER (
PARTITION BY tm_1.thread_id
ORDER BY tm_1.delivered_at DESC
) AS "row",
tm_1.*
FROM
thread_message tm_1
) tm
WHERE
tm."row" = 1;
</code></pre>
The inner query groups all messages by thread, orders them to find the "most recent" message on a thread given my ordering requirements, and then assigns a row number to each such message such that I can pick the most recent message on each thread in the outside query. I still have no idea how I would have done this with Elasticsearch.
I find window functions to be an excellent way to find the max version of a set of things. The trick is to partition by some columns (similar to how you would use a group by), order by descending on your version number field, and use the row_number() function which is very lightweight. Then you filter for all entries where rownumber = 1 and voila you have the max version without having to link back on yourself!
I made gifs to explain the different types of window functions: <a href="https://dataschool.com/how-to-teach-people-sql/how-window-functions-work/" rel="nofollow">https://dataschool.com/how-to-teach-people-sql/how-window-fu...</a>
What i like most about window functions is that they give me a way to do a sort of 'extended group by' which i have always wanted.<p>If you want to know the highest salary in each department, that's easy:<p><pre><code> select department, max(gross_salary)
from salary
group by department
</code></pre>
If you want to know who it is who earns that salary, you might try to do this:<p><pre><code> select department, first_name, max(gross_salary)
from salary
group by department
</code></pre>
But this doesn't work, because it's meaningless to ask for first_name in a situation where you're grouping by department. You could ask for an aggregation of all names, but there's no straightforward way to ask for the name of the person who earned that salary. You end up having to write a join against the group by, as in the article, which is pretty grim, and falls apart if you want to order by multiple columns to break ties.<p>Window functions let you re-frame this kind of group by like this:<p><pre><code> select department, gross_salary
from (
select *, row_number() over (partition by department order by gross_salary desc) as n
from salary
) _
where n = 1
</code></pre>
Because the outer query is no longer a group by, you can select any columns you like. The natural query works fine:<p><pre><code> select department, first_name, gross_salary
from (
select *, row_number() over (partition by department order by gross_salary desc) as n
from salary
) _
where n = 1
</code></pre>
This only works where the group by is based on an aggregate function that picks one value, like min or max. I somewhat think it was a mistake to model that kind of thing as an aggregation in the first place. If SQL had a way of picking one row from a group, rather than aggregating over it, that would be immensely useful.
I like how the article teaches by explicit example. Very helpful when it comes to data.<p>The "lag" and "lead" SQL functions are particularly useful when analysing sequences of timestamped events (GPS fixes, task completions etc). They allow you to easily return the delta of a value (time/distance) between the current and previous row, which is really useful if you want to compare expected vs actual on those deltas.
Question for the pros: In doing some data engineering work, I found that creating temporary tables and dropping them after the run was much more performant and memory-efficient than using CTEs. No other change was made to the queries in the CTE, just putting them in a separate CREATE TABLE AS... script before the part that needed the calculations.<p>Why is this the case? Shouldn't CTEs be more efficient?
Analytic functions as they are known by in Oracle, SQL have been around for more than a decade. They are useful for calculating running balances, getting a previous or next row value, doing a "group by" for a subset of columns.. etc. They're the best thing in SQL "since sliced bread".<p><a href="https://towardsdatascience.com/analytical-functions-in-oracle-4ac48e022370" rel="nofollow">https://towardsdatascience.com/analytical-functions-in-oracl...</a>
Wow I love me a window function!<p>Favorite/weirdest thing I learned to do most recently with window functions is using/abusing min() and max(). If there is no value in a column in a particular window, min and max will give NULL.<p>We use this to do really interesting stuff (entirely in sql!) like "partition usage into sessions where a session is defined as a bunch of activity by a user where they did an action at least once every 30 minutes"
SQL is only secondary in my work but I'd rather go with the explicit and verbose queries rather than with stuff with surprises like how the last_value gives wrong results when used the way you'd expect i.e. the reverse of first_value.
I've always wondered and perhaps someone here might know....do postgres' CTEs translate into big select/sub select queries under the hood? Or are they something special entirely? Ie (forgive formatting as I'm on phone) does:<p>With mything as (
Select * from table where...
),<p>Myotherthing as (
Select * from mything where...
)<p>Get translated to<p>Select * from (select * from ( select * from...)...)...)<p>So I'm just wondering are CTEs just easier to read, or do they offer any other known optimizations? We use them loads but mainly just for keeping larger queries easier to manage
I've been encountering a lot of issues, when doing dedicated queries, specially when a team is small, I've found is sometimes more useful to just do the grouping or left join (if more than 2 are required) manually.<p>The problem is ofc scalability, if your team is small it is better to just have some few, but very specific queries, and do whatever transactions required on a layer above.<p>It may be the difference between correcting a few lines, vs reworking 20+ different queries, and testing each one individually.