What goes around comes around and around [pdf]

130 pointsby craigkerstiens11 months ago

11 comments

bob102911 months ago

This paper is a really good treatment of the space from my perspective.I think the greatest power in the relational model comes from its ability to directly represent cyclical dependencies without forcing weird workarounds. Many real-world domains have ambiguities regarding which types should be strict dependents of another. This confounds approaches relying on serialization. As mentioned in the paper, many major providers offer extensions to SQL which allow you to iterate through the graph implied by these relations with a single logical command.> The impact of AI/ML on DBMSs will be significantI agree with this but not in the way the authors may have intended. I think the impact will be mostly negative. The amount of energy being spent on blackbox query generator approaches could be better spent elsewhere. You can get extremely close, but this often doesn't matter.> Do not ignore the out-of-box experience.This is why everyone says to start with SQLite now.

paulsutter11 months ago

Great article, one bit of errata: actually ChatGPT does not expose its internal embedding, so the use of embeddings for RAG are just optional or even coincidental. You can also use ordinary search like Elasticsearch (a point that's somehow often lost).Besides, the internal embedding for ChatGPT is per-token (~word), whereas the embedding used for RAG search is per-document (retrieval document might be small like a paragraph or page, or could be as large the the whole source document), so these wouldn't be usable for this purpose anyway> One compelling feature of vector DBMSs is that they provide better integration with AI tools (e.g., Chat- GPT [16], LangChain [36]) than RDBMSs. These sys- tems natively support transforming a record’s data into an embedding upon insertion using these tools and then uses the same transformation to convert a query’s in- put arguments into an embedding to perform the ANN search; other DBMSs require the application to perform these transformations outside of the database.

simonz0511 months ago

The paper is inspired by a hacker news comment: <a href="https://x.com/andy_pavlo/status/1807799839616614856" rel="nofollow">https://x.com/andy_pavlo/status/1807799839616614856</a>

评论 #40847434 未加载

SoftTalker11 months ago

In a technology career that started in the early 1990s, one of the constants has been relational databases and SQL. There is no better general-purpose data storage and query architecture, and it's the first (and usually last) thing I consider for almost any new development project that involves storing and retreiving data.

评论 #40853169 未加载

joatmon-snoo11 months ago

I don't know how I feel about this paper: on the one hand, I agree with the sentiment that the relational data model is the natural end state if you keep adding features to a data system (and it perfectly captures my sentiment about vector DBs) and it's silly to not use SQL out of the gate.On the other hand, the paper is kind of dismissive about engineering nuance and gets some details blatantly wrong.- MapReduce is alive and well, it just has a different name now (for Googlers, that name is Flume). I'm pretty confident that your cloud bill - whether or not you use GCP, AWS, or Azure, is powered by a couple hundred, if not thousand, of jobs like this.- Pretty sure anyone running in production has a hard serving dependency on Redis or Memcache _somewhere_ in their stack, because even if you're not using it directly, I would bet that one of your cloud service providers uses a distributed, shared-nothing KV cache under the hood.- The vast majority of software is not backed by a truly serializable ACID database implementation.-- MySQL's default isolation level has internal consistency violations[1] and its DDL is non-transactional.-- The classic transaction example of a "bank transfer" is hilariously mis-representative - ACH is very obviously not implemented using an inter-bank database that supports serializable transactions.-- A lot of search applications - I would venture to say most - don't need transactional semantics. Do you think Google Search is transactional? Or GitHub code search?[1]: <a href="https://jepsen.io/analyses/mysql-8.0.34" rel="nofollow">https://jepsen.io/analyses/mysql-8.0.34</a>

评论 #40849734 未加载

paulsutter11 months ago

This paper has a very concise and easier-to-understand definition of Google's Mapreduce:> To a first approximation, MR runs a single query:> SELECT map() FROM crawl_table GROUP BY reduce()Or you could read the entire Google Mapreduce paper

评论 #40849711 未加载

didgetmaster11 months ago

When I was building an object store years ago; I needed a way to attach metadata tags to each object. The objects themselves could be files like a picture, a document, or some music; and I wanted to allow tags to denote things like the author, the camera, or the music genre.Most systems use things like file extended attributes or a separate database to store such metadata; but I wanted something different. It needed to be able to attach tags to hundreds of millions of objects and find things that matched certain tags quickly.I invented a key-value store to hold the metadata and got it working well. When it started to look like a big columnar store with sparsely populated rows; I decided to see if it could handle queries like a relational database. To my surprise it not only did it well, it could outperform many of them.There are data models besides relational that can work extremely well for certain data sets.

评论 #40850679 未加载

mr_gibbins11 months ago

The number of times I mis-spelt Stonebraker in my Ph.D thesis... An absolute pioneer. I'm glad he's still around and active - sadly unlike many of his late contemporaries.

评论 #40863954 未加载

paulsutter11 months ago

More specifically, blockchains are designed to avoid double-spending in a low-trust environment. If you're not trying to avoid double-spending, OR you're not in a low-trust environment, you probably dont need a blockchain.> The ideal use case for blockchain databases is peer-to- peer applications where one cannot trust anybody. There is no centralized authority that controls the ordering of updates to the database. Thus, blockchain implementa- tions use a BFT commit protocol to determine which transaction to apply to the database next.

bitwize11 months ago

The relational model is to data what Lisp is to code: despite attempts to beat it, nothing really can because all those other models are expressible in terms of it (and, usually, can be made very efficient in practice).RDBMS and Lisp sit near the tao of their respective domains, which is why I advise people to stick with an RDBMS unless they have a really, really, really good reason not to. Or as Nik Suresh put it, "Just use Postgres. You nerd. You dweeb."

评论 #40848964 未加载

评论 #40849988 未加载

评论 #40857565 未加载

burcs11 months ago

What an amazing read, here's hoping they'll both be around for the 2044 edition. 101 is not too old to write another research paper Dr. Stonebraker!

11 comments

bob102911 months ago

paulsutter11 months ago

simonz0511 months ago

The paper is inspired by a hacker news comment: <a href="https://x.com/andy_pavlo/status/1807799839616614856" rel="nofollow">https://x.com/andy_pavlo/status/1807799839616614856</a>

评论 #40847434 未加载

SoftTalker11 months ago

评论 #40853169 未加载

joatmon-snoo11 months ago

评论 #40849734 未加载

paulsutter11 months ago

评论 #40849711 未加载

didgetmaster11 months ago

评论 #40850679 未加载

mr_gibbins11 months ago

The number of times I mis-spelt Stonebraker in my Ph.D thesis... An absolute pioneer. I'm glad he's still around and active - sadly unlike many of his late contemporaries.

评论 #40863954 未加载

paulsutter11 months ago

bitwize11 months ago

评论 #40848964 未加载

评论 #40849988 未加载

评论 #40857565 未加载

burcs11 months ago

What an amazing read, here's hoping they'll both be around for the 2044 edition. 101 is not too old to write another research paper Dr. Stonebraker!