> but few … nay, none who could tell me what the properly normalized data model was<p>I guess we have an entire generation of software engineers who never had to cram the degrees of normal forms for an exam? [0] The term "Boyce-Codd Normal Form" is etched in my neurons.<p>For anyone who seriously studied relational databases the normal forms are very clearly defined. This is not some "unknown" used to justify arbitrary design.<p>Likewise anyone who has seriously worked at delivering data at scale in an efficient way <i>has</i> consciously considered the trade-offs between normal and de-normalized data.<p>Furthermore the reason for Normalization isn't because it's pretty, but because it can drastically reduce the complexity of updates when you have duplicated, but mutable, data as well as reduce overall storage space back in an era when that was a major issue.<p>0. <a href="https://en.wikipedia.org/wiki/Database_normalization#Normal_forms" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Database_normalization#Normal_...</a>
There is some confusion in this thread about what normalization means. Here's a quick intro for the young 'uns. It refers to a bunch of constraints or suggestions about how you should organise information in a relational database.<p>The basic requirement is that each logical attribute goes to a separate scalar-valued column. Keep zip code separate from address, for example. Don't have arrays or json-valued fields. If all you do is obey this constraint, the db is in a "first normal form".<p>The next set of constraints refer to the keys. As a mnemonic, you can think of it as a variation on "the truth, the whole truth, nothing but the truth". As in, "the key, the whole key, and nothing but the key".<p>"The key" refers to a constraint that all other columns must functionally depend on a singular ("_the_") primary key column. There is no point having two columns that are each unique ("emp_seq_number_when_joining", "emp_id") and can be used to look up the rest of the related data.<p>"The whole key". The dependency must need the whole key, not partial. For example, both employee-id-year and employee-id are unique, but the latter is sufficient to look up the rest of the employee info. Only have so much info in the key that makes the row unique; any extra goes into a separate column.<p>"Nothing but the key". There should not be any other dependencies on any other columns. For example, one can have "department-name and "department-id" as columns. This is redundant. Department-id is sufficient to obtain the name. In practice one has extra columns to avoid the expense of joins. This is classic denormalisation.<p>The is the basic set of requirements (called "third normal form" or 3NF) to call the database normalized. For the most part, most dbs are in 3NF. Denormalization is done for performance reasons, typically to avoid expensive joins. For example, one could squirrel away a tuple or an entire json document in a db cell. Or have redundant column info as illustrated earlier.<p>The article is simply a reminder to be serious about normalization, and cautiously introduce optimizations with appropriate justifications in documentation.
I have had pretty good results over the years following the mantra “normalize until it hurts, denormalize until it works”. Start with as little redundancy as possible, then carefully consider anything you want to denormalize and plan around the complexity it introduces. Materialized views can also help a lot with this.<p>The main issue I’ve run into following this approach has been with ORMs that don’t provide nice raw sql escape hatches–many times something you’ll jump to denormalize is easily solved with a better query, if your tools let you do so.
Denormalization and nested data structures. Sometimes we need to spend a little more time thinking about data models rather than just dumping the original json into mongo and working around that for eternity.<p>/rant
Oddly, I'd argue that normalization is often the "simplification" that is not accurate to how things are. Mayhap an example could show that I'm thinking of different ideas, here?<p>My gut is that I'm thinking of what is normally called "materialization" moreso than just "denormalization." The idea being that what is denormalized is largely read data, and doesn't offer direct editing of the items that are copied. I view this as a bit more realistic for a lot of scenarios, as well, as that is likely what is happening to the items being represented.<p>This is especially true for any inventory style database. There are physical things backing the data, such that if you have normalized the representation, you are doing a simplification for yourself in the data. You are not more accurately representing it.
This is super easy, actually.<p>Unfortunately people see the relational model starting from a "source of truth" database schema and then know about "(de)normalization" in the form of "FORMAL forms".<p>The relational model is ALL about normalize and denormalize relations (aka tables).<p><i>All ways</i> you can combine, recombine, add, remove, change, extend, ... a relation is doing (de)normalization. It means: Add/remove columns, Add/remove rows.<p>So, `SELECT field FROM table` and `SELECT * FROM table WHERE id=1` are this.<p>---<p>`FORMAL` forms are just observations on the way the data is "It has duplicated rows?" and their properties are useful to understand similar to how is good to know what is different between a `set` and a `vector` of things.<p>And from this, you see how this is useful to use for modeling you RDBMS ("so, I don't need to duplicate all my customer information, I just can use a "pointer"/id for it!").
This article seems to ignore common denormalization patterns that don't cause problems, e.g. denormalizing immutable data such as row ids. These reduce the number of joins needed, and make code simpler and faster, refuting one of the article's claims about code complexity due to denormalization.<p>The article seems to be saying that the trade-offs of denormalization often aren't being correctly evaluated. Without examples or specific advice, this is not particularly useful information.
You know, given the math-y introduction, this rant would read even better with "denormalization" replaced with "numerical methods". The Euler's beautiful identity becomes ugly, not at all the same thing, and certainly not "simpler". They only give you approximate solutions! All in the name of performance! And many number crunchers can offer the "numerical solution" option, but few... nay, none could tell what the precise analytical model would have been. Oh, the humanity!<p>Well, it sure does suck that instead of perfect analytical solutions/perfectly normalized data models we have to deal with numerical approximations/denormalized schemas that may or may not have dubious properties that are usually tend to ignore in hope that they are actually fine but... that's just how it is. Deal with it, I guess.
Every problem that denormalization purports to solve has had a (better) solution for decades now. If the data isn't loading fast enough, it can be cached (either in an edge cache or a materialized view). Those solutions have costs - but so does denormalization!
>in the name of “simplification” and/or “performance,” it deprives us of the correct answer to the problem,<p>Is not a universally useful way to think. For many, many problems computers can not give you the right solution and thus the most relevant question is what is a good (tradeoff between fast, accurate, small, implementable, etc.) approximation.<p>In fact Euler's Identity is very hard to get out of a computer. Just telling it what 'pi' is, is not trivial in the slightest. Certainly 'pi' isn't a floating point number and if you import the mathematics library of your favorite language the thing called 'pi' is just a slightly more sophisticated version of 22/7.
I may be "denormalizing" the question itself a bit, but I think of it thusly:<p>- The 3rd normal form is when I'm thinking about the entities in a top-down way.<p>- In a key-value store, I'm thinking of a GUID with a related document, typically JSON, with some arbitrary structure. It's a bottom-up approach. The individual records matter more than the entity.<p>Argue the theoretical, top-down all day long, and watch me agree. But, like as not, your MongoDb / ElasticSearch / DynomoDb tables are doing a lot in production in the Real World(TM). They get the job done.<p>Probably driving the point too far here, but it all boils down to Plato vs. Aristotle.
> The tricky part of this proposal is that I’ve encountered many data modellers in my career who can offer the “denormalized” option, but few … nay, none who could tell me what the properly normalized data model was.<p>We have a data model like User has many Xs and Ys, both Xs and Ys have a foreign key to the User via user_id. The normalized version would be User has many Ys through Xs. We denormalize and store user_id on both for performance reasons. This seems like a) so common, b) so straightforward a refutation of this claim that I must be misunderstanding the author.
The author is all about "the correct answer to the problem", but is never explicit about what is meant by "the problem". In the case of software development, the problem includes performance constraints and resource usage management. It is not a a correct answer if it cannot be supplied within the time needed and using no more than the available memory and disk.
The example doesn't seem that convincing to me? 22/7 is correct to 3sf, and the identity would remain -1.00 (-0.00j) if also specified to 3sf.