Why do we need database joins?

28 pointsby timfover 14 years ago

10 comments

Groxxover 14 years ago

Sets up a straw-man:<pre><code> Article table like: authorID, authorName, publisher, title </code></pre> Where an article only ever has one author. Then proceeds to knock it down (summarized and slightly satirized):>That's not even in second normal form. /makes an Author table. Viola! Future name changes are free.>But wait! We don't change author names on published articles! * gasp * we don't need joins!No, you've just demonstrated that blind optimizations are sometimes incorrect. Duh. They are correct in this respect, however:>The dogma of normalization [too] often leads to over-engineering.But all dogmatic adherence to X results in sometimes-incorrect use of X. This is true everywhere, not just for RDBMS normalization.

blahedoover 14 years ago

Meh. If you don't have "database joins", meaning SQL queries that have the word JOIN in them, then you end up effectively replicating the join in your glue code somewhere (either on the insert side or on the query side). And if you put the join in the code, it is algorithmically fixed; there is no chance that the database can optimise this for you.On the other hand, if you write your query in SQL on a moderately normalised database, you have the potential for the database itself to figure out what to denormalise, and what to parallelise, and any other sorts of optimisations it might do. It just has to know what the typical queries are.There's a lot of power in using a declarative rather than an imperative paradigm.

jazzychadover 14 years ago

Ok, serious question. I have been writing SQL for 7 years now and can normalize schemas with the best of them; but I am wanting to learn the proper techniques for creating scalable data sets like what one would store in GAE BigTable or EC2 SimpleDB...Let's say I have a book review site. In my SQL world I have tables of 'users', 'authors', 'books', and 'reviews'. When a user creates a review, it creates a new row in 'reviews' which contains the userid, bookid, and rating. Throw some JOINs together and I can pull out rows that have the user data, author data, book data, and the rating all together.What do you do in non-normalized world? When a user creates a review, just insert a row into a 'review' table that has user_id, user_name, author_name, book_title, book_isbn, book_etc, ..., rating ??I can see how this would be a much more scalable table and fetching a review is just one row, but it just seems like so much duplicate data...? Or maybe I am thinking about it the wrong way?

评论 #1972795 未加载

评论 #1972923 未加载

评论 #1972883 未加载

boyterover 14 years ago

I think that his arguement is a bit flawed. Just becuase you can come up with one arguement why joins dont work dosnt mean they are never the best solution. I know that for my day job they are the correct solution because I need the data to be transactional and correct. I shudder to think how much effort it would be to port them to a keyvalue store and maintain the same levels of data integrity.

评论 #1972941 未加载

tzsover 14 years ago

I looked into using some kind of NoSQL solution at work, and one thing puzzles me. Assume a key/value database, where values are JSON documents or something similar.Using such a database, with the key being customer account ID, and the value being a document that contains all the information about that customer's account would be convenient and useful. Information about the account would include things like name, address, email, status of the account, when their subscription to our service expires, re-billing information, all their purchases, and so on.This would definitely be a lot more convenient than our current SQL-based solution for two of the three things we do with our customer data.One of those things is display a customer's information for our customer support people when they are dealing with the customer on the phone, or working on a ticket in our help desk. With the NoSQL, it would be a simple get the JSON document for that customer, and display it nicely.The second thing is providing information to the installed software on the end user's computer. Our software provides a service that can partly be done locally, and partly requires back end support. All the back end information needed to support a given customer's installed software would be in that JSON document, so again the key/value store would work great.However, I'm at a loss as to how the third thing would be handled efficiently: reporting.For instance, suppose I'm asked for a quick report on how well a new SKU has sold over the last week. With our SQL database, this is a trivial select from the "orders" and "line_items" table. It is almost instantaneous. With the NoSQL solution, it would appear that I would have to go look at every single customer's entry.All in all, we've got a couple dozen regular reports, many running hourly, and a few being interactive on-demand via a web interface, that are fairly straightforward and efficient in our SQL-based solution (some might involve joining 6 or 7 tables, but it is still straightforward and efficient).Thus, at first glance, it would seem that going to NoSQL would be a nightmare for my reports. Many times an hour, it would appear that the database would get slammed by reports that would have to be accessing every freaking record in the database. That doesn't sound like something that's going to make me happy.How do people deal with reporting when using NoSQL?

评论 #1972965 未加载

评论 #1972931 未加载

wanderrover 14 years ago

Thought experiment: imagine you are going to port Bugzilla to a noSQL solution. How would you do it? There are some pretty obvious relationships between bugs, authors, commenters and comments.Imagine a bug that has 100 comments from as many commenters. A normalized soluion with no joins is going to require at least 200 round trips to look up all comment and commenter information, so the need to denormalize is pretty obvious.So let's say you enormalize all the way and save all relevant information in he bug document. Each bug has comments embedded in it and each comment has all of the commenter's information in it. Now each bug requires one lookup, which is awesome, but what about the fact that commenter information can change? Sure it's a somewhat rare event, but users can have thousands and thousands of comments. When a user changes their email address, do you tell them, this might take a few hours, in the meantime I hope you still have access to your old address? What kind of load do those massive updates place on your system? Unlike reads which are fairly steady and regular, these high load queries will be sporadic. How do you plan for and accommodate the spikes?

评论 #1973163 未加载

评论 #1973830 未加载

评论 #1973142 未加载

andrewcookeover 14 years ago

We need database joins so that we can derive new, interesting propositions from the data already in the database.A database is a set of facts about something. Joins are the logic reasoning that lets you deduce new things.If you never need to find out new things, then you don't need joins. So you can get away without joins when you have a simple system that is doing repetitive work. You can't avoid them when you are doing complex analysis or presenting data in a variety of different ways.

zmmmmmover 14 years ago

This seems to overlook really hard part where joins become important which is for things that need to transactional, such as how much money I have in my bank account. I might not care if it takes 30 seconds for my change of name to propagate to 300 different papers I have written, but it is crucially important that I have only one total for my bank account and that it is never allowed to fall below zero (simplistically).

rfuggerover 14 years ago

I can only conclude that this is an ironic statement implying that we need database joins to keep a web server responsive...

Swizecover 14 years ago

The site is down for me.Anyway, I don't think database joins are all that important. They make sense when structuring your schema - only store a unique piece of data once and then connect other data to it. This is a very sound design choice.But in light of modern key/value stores, I don't think one needs to perform joins on the database anymore. Sure it's theoretically a bit faster than performing the join in your app, but considering all the other implications and quickly you realise you don't need those.Consider these two options: Pull data from a db, it's neatly tied together with all its meta-data. But the query takes almost a second and when the DB is always worried about consistency and stuff so writes are kind of slow too.Pull data from a db. If you need some related data, you make another query based on the key stored in the first piece of data. Each query takes a fraction of a second, together they don't take as long as the full join query ... and you're only joining stuff when you actually need all of the data.The added benefit is that in this situation the DB doesn't have to worry about almost anything and is thus much faster.If that made any sense, I'm preparing a seminar on NoSQL so I'd love to hear some input on my understanding of things :)

评论 #1972851 未加载

评论 #1973015 未加载

评论 #1972780 未加载

评论 #1972736 未加载