Why I love databases

238 pointsby strzalekover 10 years ago

19 comments

Nice article. I love databases too for similar reasons but, as someone that designs database engines, some of the technical points are off the mark. I never really stop learning in this area, the technical range is incredibly deep and nuanced.Some of the points that caught my eye as being quite off:- Contrary to footnote 2, modern database designs bypass the OS file system cache and schedule their own I/O. This has an enormous performance impact versus the OS cache (2-3x is pretty typical) and is a good litmus test for the technical sophistication of the database implementation. It is the primary reason many open source database engines, even "write-oriented" ones, have relatively poor write performance to disk.- The three data retrieval models enumerated are textbook but if you were designing a new engine today you probably would not use any of them for a general purpose design. Modern spatial access methods (ex: Hyperdex, SpaceCurve, Laminar) are superior in almost every way though the literature is much sparser. Also, a number of real-time analytical databases use bitmap-structured databases (ex: ParStream), which are incredibly fast for some types of query workloads.- Many distributed database challenges are a side effect of "one server, one shard" type models. It is not necessary to do things this way, it is just simpler to implement; some distributed database systems have thousands of shards per server. The latter model is operationally more robust and better behaved under failure and load skew.- Tombstones are usually trivial if the database engine is properly designed. Complications are a side effect of poor architecture. The big challenge for tombstones is deciding when and how tombstoned records are garbage collected. It is outside the scope of the normal execution pathways but you also don't want a Java-like GC thread in the background.Of course, any of these is a long blog post in itself. :-)

评论 #8580623 未加载

评论 #8581155 未加载

评论 #8581154 未加载

pavlovover 10 years ago

I hate databases. People tend to have way too much faith in them (or their surrounding marketing), and thus make poor database choices that don't actually fit the shape of their data. Persistence is fundamentally the programmer's responsibility; a magic box behind a socket can't design it for you.Most applications I've seen wouldn't even need a database, but apparently a lot of programmers are conditioned into believing that writing anything to disk must involve building a database query string and transmitting it over a socket to another process which parses the string, executes it on an interpreter and stuffs the extracted data into a generic 1970s data model that finally gets written to disk in an opaque format from where it can only be retrieved by sending more strings over sockets. This stuff made sense when 1MB was a huge amount of memory, but today it's just not necessary.

评论 #8581037 未加载

评论 #8581547 未加载

评论 #8582024 未加载

评论 #8581537 未加载

评论 #8581723 未加载

jeeyoungkover 10 years ago

Thank you everybody for your valuable comment. Like I've said in the post, this was my first attempt to explain my passion. I read every one of your comments, and spent the next few days fully understanding their implications.This post is a bit biased towards what I've interacted with. I think I do have a good sample of various workflow, but there are obviously large number of databases and use cases that I'm missing, and my view of what is "modern" may not incorporate a lot of bleeding edge theory and technologies.Operational complexity is definitely my #1 concern. The service I originally maintained is a OLTP system, located at the top of the service & data dependency graph. Availability is the top concern.The current system I'm writing is metrics database. The operational burden is much lighter. It is almost a leaf node in the service & data dependency graph, and I can take downtime to restart the cluster. A very different workflow, indeed.Thanks!

gfodorover 10 years ago

"Designing Data-Intensive Applications" is shaping up to be an excellent treatement of modern databases and their underpinnings. It's at an excellent level of abstraction, deep enough to convey database internals while high level enough (so far at least) to be able to cover a wide variety of database systems. It also has its feet firmly planted in database history, and is NoSQL-koolaid free. Highly recommended.<a href="http://shop.oreilly.com/product/0636920032175.do" rel="nofollow">http://shop.oreilly.com/product/0636920032175.do</a>

评论 #8580757 未加载

Slackwiseover 10 years ago

I've always loved databases, but after having discovered write-only timestamped databases like Datomic, I can't imagine going backwards. It's a real shame that Datomic isn't fully open source.(Aren't BigTable and Spanner also write-only and timestamped?)

评论 #8581234 未加载

评论 #8580775 未加载

评论 #8581243 未加载

quonnover 10 years ago

"The study of databases intersects almost every topic in computer science" - I've heard this before, especially for Compilers. But it has been false for a long time, CS is far more diverse now. For example, how do Databases intersect AI/Machine Learning/Computer Vision, Computer Graphics, Numerics/Simulation, Robotics, Bioinformatics, Computer Architecture or Cryptography?

评论 #8580470 未加载

评论 #8581460 未加载

评论 #8581580 未加载

评论 #8580808 未加载

评论 #8580463 未加载

Pxtlover 10 years ago

I love databases, but I loathe SQL. And no, I don't mean NoSQL is better - that's throwing out the baby with the bathwater.To me, SQL is the Common Lisp of relational languages - a brilliant invention of its time that has since long-overstayed its welcome and should be replaced by modern considerations of the problem it solves. The difference is that there are a million rethinks and descendents and redesigns of LISP out there that happily threw out the mistakes and made great strides in the language. You could argue that every modern programming language is a descendant of Lisp thanks to the prevalence of great concepts like lexical closures. SQL, on the other hand, has a teeny tiny few spiritual fringe descendents like the various attempts at Date and Darwin's "Tutorial D".I love the relational model, but who says the only way to manage the relational model is this hoary old thing? It's immensely frustrating that every implementation of SQL bolts on a tacky and half-assed procedural language, but doesn't solve simple underlying frustrations.Simply accessing related objects is immensely wordy for a "relational" language. In an algol-derived language, I can say Group.Manager.Person.Address.PostalCode to walk the graph. In SQL, I have to deal with a zillion joins.Yes, some SQL variants let you join by the foreign key name to make the join a little more terse, but it's still hairy compared to every modern functional or procedural language.And the APIs - maybe the reason so many sites have SQL injection problems is the hideous APIs. Ever tried to build a WHERE IN (id1, id2, id3... idN) statement with a proper parametrized queries? Holy crap what a icky mass of boiler plate. I mean, it's not a hard problem, but how many times have you solved it, and how many times have you found a tedious bug in your solution? Just give me a proper way to concatenate the parametrization inline with the query FFS.<pre><code> db.RunQuery(""" SELECT * FROM MYTABLE WHERE ID IN + " + db.SomeParameterListFunc(a, b, c) + " ORDER BY HOLYCRAP_WAS_THAT_SO_HARD" """); </code></pre> The above syntax would be trivial in any language with operator overloading on the "+" sign, on the off chance that your SQL dialect is so messy it's impossible to safely build a properly-escaped initializer for the list containing a,b and c in text form.And that's not even getting into real actual first-class language support like ORMs give you.And speaking of APIs, the fact that a single "SELECT" is the baseline operation... that you work on one resultset at a time. I don't want a single pile of rows. This is not an excel spreadsheet, it's a relational database, and that means I want a graph of data. I don't want to write three queries to get my Customers, their Personnel, and their Addresses, nor do I want a single row of CustomerPersonnelAdddresses. Once and Only Once is good for the data, why the heck isn't it good for result sets?Where's the code reuse? Why can't I have a pile of SELECTs and a pile of WHEREs and combine them however I see fit? Oh right, I can use a VIEW... but see the previous point, a VIEW is a single glorified Excel spreadsheet, not a proper graph of data. If I want to bundle a bunch of SELECTs together, I have to just write a stored procedure, but then I can't use the proc with a JOIN statement against my VIEW that provides a custom WHERE clause. You could do something monstrous with table-valued parameters, I guess, but those aren't generally well-supported at the API level. This is not a hard problem in every modern language (except Go, of course - yes, you do freaking need map/reduce/filter).Namespaces. Real, actual, organizational tools for your giant list of 9000 tables and their related objects. No, schemas don't freaking count - you can't nest them and they're overly tied to the security model - using schemas for organization instead of security leads to madness, besides the fact that you can't nest them.And of course, so many common problems simply aren't nice to work with using the relational model. How do I make a nice audited row where I have the full history of all the row's changes? Well, I can insert it every time, but that's a lot of wasted space. Yes, again, there are ways to do this, but it's something I'd expect to come out-of-the-box since it's such a common problem. Common problems should be solved by the standard library. But SQL can't solve things like this with the standard library, because a SQL standard libraries are limited to crude things like functions and views and procs and not actual large-scale reusable constructs. It's like a programming language where they gave you a bunch of general tools for manipulating unicode points and dynamically sized arrays but no coherent "string" object.Or trees. Holy crap, you have a "relational" database where a relationship like a "tree" is a nightmare to actually query out! I know that's not what "relational" means, but still - this ins't exactly a rare edge-case, y'know? But it's not in the standard library because the standard library is limited to crude objects like data-types, functions, procedures, etc. that work below the row-level. Any concept of reusable schema concepts is completely left off the table./rant

评论 #8582013 未加载

评论 #8582857 未加载

评论 #8583045 未加载

评论 #8581949 未加载

评论 #8594082 未加载

readover 10 years ago

> You cannot give up partition tolerance.It'd be more accurate to say you don't want to give it up all the time. You don't want CAP; you want PACELC.<a href="http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html" rel="nofollow">http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-an...</a>

评论 #8582283 未加载

rattrayover 10 years ago

I wonder why RethinkDB hasn't gotten a mention here. Have folks here used it? Thoughts?

cfolgarover 10 years ago

Since we're discussing databases...is there any "golden standard" learning resource/introduction to PostegreSQL? As a college student, I do not have much experience with them yet but I am aware of how important it would be to be comfortable with them in your day to development. Something tells me that my usual approach of diving in and tinkering by building out an idea wouldn't serve me as well for db's; it just seems to me that there's some fundamental database concepts that I would be missing if I went down that path.Any advice as to where to start a structured approach to learning about databases would be highly appreciated :)

评论 #8581324 未加载

评论 #8592639 未加载

Roboprogover 10 years ago

I have lived a very sheltered existence. I have never worked on an application which had a database cluster or sharding, rather than just running on a single server.Of course, the servers are a little bigger now than they were back around say 1990.

评论 #8581313 未加载

foglemanover 10 years ago

I love databases for the data they contain. And for the ability to make sense of that data more easily when it's in a nice, structured format.I don't care as much for the dev-ops side of it.

pm90over 10 years ago

What I find both fascinating and scary about databases is how to choose between the wide variety of databases without understanding exactly how they work? And it doesn't help that there are new databases springing up all the time.Is there a way for application developers to understand these databases quickly without spending weeks working with them?

评论 #8580909 未加载

评论 #8581228 未加载

评论 #8581044 未加载

tkyjonathanover 10 years ago

I like SQL. It makes sense to me, personally. Even when I get really unwieldy, I can always take a step back and break it down to parts. If I'm allowed to use temp tables, I can do almost anything in it and usually a lot faster than developers in their language can.

metaphormover 10 years ago

this was a great article. extremely helpful to me, coming from a position of being a server-side application developer often tasked with getting several disparate data-stores talking to each other.the author does a great job of showing the full depth of the field while providing useful hooks and links for further study. this one made my bookmarks folder. i'll surely be going back to check it out again.

smartpantsover 10 years ago

This is what i needed. Great read

digital-rubberover 10 years ago

In my opinion, something more important, does the database love your data? Only then you can truly love your database.

marknadalover 10 years ago

I'm glad he loves databases, databases have been the bane of my existence.However, the torment they have given me has also lead to a similar fascination - and now I'm writing my own database! So I've become very familiar with the topics he writes on, and they are very good points for anybody interested in the subject.Why would I write my own database? Because databases are hard, and I am determined to make them easy (even if that means me sacrificing years of my life into doing all this crazy research). Check out <a href="http://github.com/amark/gun" rel="nofollow">http://github.com/amark/gun</a> !- CAP Theorem, he is correct, P cannot be sacrificed. GUN is AP with eventual consistency. The beauty of this, though, is that you can always build strong consistency out of eventual consistency (it just requires knowing X amount of peers in advance, and doing a trivial lock until you've heard back from all of them - in fact, I do this in one of the example apps) but you can never go from strong consistency down to an eventually consistent system.- Distributed Systems, this is incredibly incredibly important. I cannot repeat this enough times, there should be no "master" or "single source of truth" in any database. If there is, you're going to have a nightmare of a time being woken up at 3am to fix it when it crashes (my personal experience with other databases). Why? Because single points of failures will fail, centralized systems suck. Solution: Distribute and Decentralize! We make this easy for you.- Correctness vs Efficiency, as he says, Paxos is difficult - all of them, Raft, Quorum, leader election, etc. DO NOT USE THEM unless you are Google, Amazon, Walmart, or what not. Even then, do not use them. Instead, I've solved this challenging problem by developing a new Conflict Resolution system that (very poorly) can be summarized as Vector Clocks + Timestamps, you get the advantages of both without either of their weaknesses. What his means is that data integrity is guaranteed because every machine is using a deterministic algorithm, without any extra gossip between machines. Let me repeat, you'll get the same result on every machine, eventually consistent, without any multi-machine coordination. This means every peer is a master, and that is awesome, even if you are running it on an ephemeral server/cloud - completely resilient to failures, terminations, restarts, and reboots.- Empowering the App. Yes. Databases should serve you, not the other way around. Answers to his questions about abstractions are at <a href="http://github.com/amark/gun" rel="nofollow">http://github.com/amark/gun</a> .- Operational Challenges. This is where I diverge from him. If something seems wrong, like things suddenly becoming slow, you can easily just restart it without any damage/harm/failure occurring. And then you can look through your logs, taking your time, to see what went wrong.- Basic Building Blocks. Because GUN is a graph database, you get key-value like access, as well as documents and relational styles. That is because mathematically graphs are the superset of relational algebra and hierarchy trees.Happy to answer any questions!

评论 #8581528 未加载

评论 #8582754 未加载

评论 #8581815 未加载

评论 #8581608 未加载

评论 #8581288 未加载

评论 #8583107 未加载

评论 #8582009 未加载

MrBraover 10 years ago

Well, intro was promising and full of energy. Like "I am really going to transmit you some good love for databases and explain why it's so stimulating to deal with them!", but then right after this sparkling start it's all just about same old redundancy, consistency, scaling...