Let's talk about joins

169 pointsby r4umover 1 year ago

14 comments

jarymover 1 year ago

> “Here we typically expect that the combined dataset will have the same number of rows as our original left side dataset.”For left join this isn’t entirely true. If there are more matching cases on the right joined table then you’ll get additional rows for each match. That is unless you take steps to ensure only at most one row is matched per row on the left (eg using something like DISTINCT ON in Postgres)

评论 #39072505 未加载

评论 #39071745 未加载

评论 #39074608 未加载

评论 #39075110 未加载

andy81over 1 year ago

If I was going to add anything -"Vertical join" is normally called union."Right join" is just (bad) syntax sugar and should be avoided. Left join with the tables reversed is the usual convention.The join condition for inner is optional- if it's always true then you get a "cross join". Can be useful to show all the possible combinations of two fields.

评论 #39072410 未加载

评论 #39074865 未加载

评论 #39071839 未加载

HermitXover 1 year ago

Excellent learning material, thanks for sharing. I've noticed an interesting trend: JOINS are crucial for data analytics, yet many new open-source data analytics products, or open-source OLAP products, offer limited support for JOINS. Examples include ClickHouse, Apache Druid, and Apache Pinot. It seems that currently, only Trino and StarRocks provide robust support for JOINS. Commercial products tend to have better support in this area. I hope the various open-source projects will also enhance their JOIN capabilities.

评论 #39070969 未加载

评论 #39070848 未加载

评论 #39070796 未加载

评论 #39072948 未加载

评论 #39076560 未加载

jethklover 1 year ago

The article omits the "natural join". A natural join automatically joins tables on columns having the same name, and the columns in the resulting output appear exactly once. Natural joins often simplify queries, and they automatically adapt to schema changes in source tables. Natural joins also encourage standardized naming conventions, which has longer-term benefits.

评论 #39070385 未加载

评论 #39069410 未加载

评论 #39070580 未加载

评论 #39069699 未加载

评论 #39082212 未加载

genmanover 1 year ago

Good material.Joins can be also categorized by the used join algorithm.The simplest join is a nested loop join. In this case a larger dataset is iterated and another small dataset is combined using a simple key lookup.Then there is a merge join. In this case two larger datasets are first sorted using merge sort and then aligned linearly.Then there is a hash join. For the hash join a hash table is generated first based on the smaller dataset and the larger dataset is iterated and the join is made by making a hash lookup to the generated hash table.The difference between nested loop join and hash join might be confusing.In case of a nested loop join the second table is not loaded from the storage medium first, instead an index is used to lookup the location of the records. This has O(log n) complexity for each lookup. For hash join the table is loaded and hash table is generated. In this case each lookup has O(1) complexity but creation of hash table is expensive (it has O(n) complexity) and is only worth the cost when the dataset is relatively large.

评论 #39072028 未加载

mgaunardover 1 year ago

There are lots of other joins not mentioned there.A popular one with time series is the asof join.There are also literal joins, which are generalizations of adjacent difference.

评论 #39072251 未加载

zzzeekover 1 year ago

when I saw the term "horizontal joins", I immediately went to, what? what's a "vertical join?" must be a UNION, scrolled down and sure enough.Is there some reason to use non-standard terminology for posts that are trying to be in-depth, authoritative tutorials ?

nikhilsimhaover 1 year ago

Never heard a union be called a “vertical join” before.

评论 #39072531 未加载

评论 #39082255 未加载

frogamelover 1 year ago

Dplyr actually supports some really cool join functionalities that I wish were in SQL implementations, including:- Ability to specify whether your join should be one-to-one, many-to-one, etc. So that R will throw an error instead of quietly returning 100x as many rows as expected (which I've seen a lot in SQL pipelines).- A direct anti_join function. Much cleaner than using LEFT JOIN... WHERE b IS NULL to replicate an anti join.- Support for rolling joins. E.g. for each user, get the price of their last transaction. Super common but can be a pain in SQL since it requires nested subqueries or CTEs.

评论 #39076643 未加载

holodukeover 1 year ago

Its funny that the older i get 40 the more it seems like my surrounding doesn't even understand the basics of sql. In my 20s everyone knew sql. It was required for almost every job. And it is still very useful knowledge. Difference between left, right and inner joins is ultra basic. If you dont know that i would be very worried.

评论 #39073446 未加载

评论 #39074510 未加载

airstrikeover 1 year ago

> Let’s try this again using R.This should actually be "Let's try this again using dplyr, one of the most elegant pieces of software ever written".Hadley Wickham is a treasure to humanity.

toshover 1 year ago

asof join:<a href="https://code.kx.com/q/ref/asof/" rel="nofollow">https://code.kx.com/q/ref/asof/</a><a href="https://code.kx.com/q/learn/brief-introduction/#time-joins" rel="nofollow">https://code.kx.com/q/learn/brief-introduction/#time-joins</a><a href="https://clickhouse.com/docs/en/sql-reference/statements/select/join#asof-join-usage" rel="nofollow">https://clickhouse.com/docs/en/sql-reference/statements/sele...</a><a href="https://duckdb.org/docs/guides/sql_features/asof_join.html" rel="nofollow">https://duckdb.org/docs/guides/sql_features/asof_join.html</a>> Do you have time series data that you want to join, but the timestamps don’t quite match? Or do you want to look up a value that changes over time using the times in another table?DuckDB blog post on temporal joins:<a href="https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-lookups" rel="nofollow">https://duckdb.org/2023/09/15/asof-joins-fuzzy-temporal-look...</a>note: this idea is very useful for timestamps but can also be used for other types of values

评论 #39073149 未加载

adrianmonkover 1 year ago

From the opening of the "Vertical Joins" section:> Similar to horizontal joins, there are many use cases for joining data horizontally, also called appending data.Shouldn't this read "joining data vertically"? This seems like a typo.

评论 #39070616 未加载

评论 #39070224 未加载

lakomenover 1 year ago

Beginners talking about SQL joins. What is this newbienews or hackernews? "But but ChatGPT says"...Disgusting

评论 #39074401 未加载

评论 #39073854 未加载