> “Here we typically expect that the combined dataset will have the same number of rows as our original left side dataset.”<p>For left join this isn’t entirely true. If there are more matching cases on the right joined table then you’ll get additional rows for each match. That is unless you take steps to ensure only at most one row is matched per row on the left (eg using something like DISTINCT ON in Postgres)
If I was going to add anything -<p>"Vertical join" is normally called union.<p>"Right join" is just (bad) syntax sugar and should be avoided. Left join with the tables reversed is the usual convention.<p>The join condition for inner is optional- if it's always true then you get a "cross join". Can be useful to show all the possible combinations of two fields.
Excellent learning material, thanks for sharing. I've noticed an interesting trend: JOINS are crucial for data analytics, yet many new open-source data analytics products, or open-source OLAP products, offer limited support for JOINS. Examples include ClickHouse, Apache Druid, and Apache Pinot. It seems that currently, only Trino and StarRocks provide robust support for JOINS. Commercial products tend to have better support in this area. I hope the various open-source projects will also enhance their JOIN capabilities.
The article omits the "natural join". A natural join automatically joins tables on columns having the same name, and the columns in the resulting output appear exactly once. Natural joins often simplify queries, and they automatically adapt to schema changes in source tables. Natural joins also encourage standardized naming conventions, which has longer-term benefits.
Good material.<p>Joins can be also categorized by the used join algorithm.<p>The simplest join is a nested loop join. In this case a larger dataset is iterated and another small dataset is combined using a simple key lookup.<p>Then there is a merge join. In this case two larger datasets are first sorted using merge sort and then aligned linearly.<p>Then there is a hash join. For the hash join a hash table is generated first based on the smaller dataset and the larger dataset is iterated and the join is made by making a hash lookup to the generated hash table.<p>The difference between nested loop join and hash join might be confusing.<p>In case of a nested loop join the second table is not loaded from the storage medium first, instead an index is used to lookup the location of the records. This has O(log n) complexity for each lookup. For hash join the table is loaded and hash table is generated. In this case each lookup has O(1) complexity but creation of hash table is expensive (it has O(n) complexity) and is only worth the cost when the dataset is relatively large.
There are lots of other joins not mentioned there.<p>A popular one with time series is the asof join.<p>There are also literal joins, which are generalizations of adjacent difference.
when I saw the term "horizontal joins", I immediately went to, what? what's a "vertical join?" must be a UNION, scrolled down and sure enough.<p>Is there some reason to use non-standard terminology for posts that are trying to be in-depth, authoritative tutorials ?
Dplyr actually supports some really cool join functionalities that I wish were in SQL implementations, including:<p>- Ability to specify whether your join should be one-to-one, many-to-one, etc. So that R will throw an error instead of quietly returning 100x as many rows as expected (which I've seen a lot in SQL pipelines).<p>- A direct anti_join function. Much cleaner than using LEFT JOIN... WHERE b IS NULL to replicate an anti join.<p>- Support for rolling joins. E.g. for each user, get the price of their last transaction. Super common but can be a pain in SQL since it requires nested subqueries or CTEs.
Its funny that the older i get 40 the more it seems like my surrounding doesn't even understand the basics of sql. In my 20s everyone knew sql. It was required for almost every job. And it is still very useful knowledge.
Difference between left, right and inner joins is ultra basic. If you dont know that i would be very worried.
<i>> Let’s try this again using R.</i><p>This should actually be "Let's try this again using dplyr, one of the most elegant pieces of software ever written".<p>Hadley Wickham is a treasure to humanity.
asof join:<p><a href="https://code.kx.com/q/ref/asof/" rel="nofollow">https://code.kx.com/q/ref/asof/</a><p><a href="https://code.kx.com/q/learn/brief-introduction/#time-joins" rel="nofollow">https://code.kx.com/q/learn/brief-introduction/#time-joins</a><p><a href="https://clickhouse.com/docs/en/sql-reference/statements/select/join#asof-join-usage" rel="nofollow">https://clickhouse.com/docs/en/sql-reference/statements/sele...</a><p><a href="https://duckdb.org/docs/guides/sql_features/asof_join.html" rel="nofollow">https://duckdb.org/docs/guides/sql_features/asof_join.html</a><p>> Do you have time series data that you want to join, but the timestamps don’t quite match? Or do you want to look up a value that changes over time using the times in another table?<p>DuckDB blog post on temporal joins:<p><a href="https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-lookups" rel="nofollow">https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-look...</a><p>note: this idea is very useful for timestamps but can also be used for other types of values
From the opening of the "Vertical Joins" section:<p>> <i>Similar to horizontal joins, there are many use cases for joining data horizontally, also called appending data.</i><p>Shouldn't this read "joining data vertically"? This seems like a typo.