Awesome guide! If you want to see a very minimal walkthrough of actually implementing this in code, the third post of my "write a SQL database in Go" series does just this [0].<p>Once you have the index set up and are able to use it the next challenging part is pattern matching (plus algebraic rewrites) to figure out if the index is actually applicable.<p>For example, in my linked blog post, an index is only used if the WHERE expression has only AND-ed operations and one of those expressions is a filter using <, >, =, <> (and a few other operators) and one of the columns is indexed. This is simpler than what a real database might attempt.<p>Another interesting part of seeing it in code (and messing with the code) is observing cases like where even though you're using an index it doesn't speed anything up if you filter still matches every item in the index.<p>Also if you just want to learn more about indexes as a user, I highly recommend the site "Use the index, Luke" that has many useful guides/explanations.<p>[0] <a href="https://notes.eatonphil.com/database-basics-indexes.html" rel="nofollow">https://notes.eatonphil.com/database-basics-indexes.html</a><p>[1] <a href="https://use-the-index-luke.com/" rel="nofollow">https://use-the-index-luke.com/</a>
The article claims to talk about database indexing but as soon as it goes into any sort of details the information only applies to mysql.<p>> So here, indexing comes into picture. What indexing does is, sets up the column your search conditions are on, in a sorted order to assist in optimising query performance.<p>That's mysql specific, and not just that but it's <i>innodb</i> specific: innodb uses the primary key as a clustering index, or the first UNIQUE index if there's no PK.<p>Other engines don't usually do that, some (e.g. postgres) don't even have clustered indexes (aka indexes which affect the table's own layout), in postgres clustering is an explicit discrete table-rewriting operation.<p>And of course a table can have multiple indices, so it's not possible for an index to always determine the table order. In databases which support clustered indices, only one index per table can be clustered.<p>> The major advantage of B-tree is that the data in it is sortable<p>That's… not really true. A b-tree is sorted, period. You can't have an unordered btree.<p>> What are other types of indexes?<p>There are tons of index types not mentioned by the article e.g. BRIN, GIN, GiST<p>> For example, suppose you want to find out all of the companies which has company_id less than 40. How could you do that with a hash table index? Well, it’s not possible because a hash table is only good for looking up key value pairs – which means queries that check for equality (like “WHERE company_id = 18”).<p>And because btrees naturally handle "range" queries, they also handle <i>prefix</i> queries well. After all, "foo*" is just looking for every value between "foo" included and "fop" excluded.
An important detail which this guide leaves out: The "actual location of the row" is usually the leaf node of another B-Tree. That is the primary B-Tree and it's indexed by the primary key.<p>The major consequence is every non-primary index query involves dereferencing N pointers, which could mean loading N pages from disk/SSD to get leaf nodes spread out across the primary tree. Whereas if you query a range in the primary B-Tree directly, the rows are consecutive so you would only load N/M consecutive pages based on the ratio of rowsize to pagesize.<p>That's why some people use composite keys for primary key, to get better data locality in the primary B-Tree index.<p>See "How We've Scaled Dropbox"[1] to hear the same thing.<p>At 48:36 they explain why they changed from PRIMARY KEY (id) to PRIMARY KEY (ns_id, latest, id). ns_id is kinda like user_id. So that change groups journal entries for users together on disk<p>Specifically. PRIMARY KEY (id) orders things by creation date whereas PRIMARY KEY (ns_id, latest, id) orders things by ns_id primarily.<p>[1]: <a href="https://youtu.be/PE4gwstWhmc?t=2914" rel="nofollow">https://youtu.be/PE4gwstWhmc?t=2914</a>
I don't understand why
```
SELECT total(amount)
FROM orders
WHERE createdAt BETWEEN '2020-01-01 00:00:00' AND '2020-12-31 23:59:59';
```<p>would do a full table scan. Wouldn't the engine be able to use the index to find just the correct rows and then only do the total on those?
Great video from the SQLite founder that covers a lot of ground in terms of how indices work under the hood: <a href="https://m.youtube.com/watch?v=gpxnbly9bz4" rel="nofollow">https://m.youtube.com/watch?v=gpxnbly9bz4</a>