> Normalization was built for a world with very different assumptions. In the data centers of the 1980s, storage was at a premium and compute was relatively cheap. But the times have changed. Storage is cheap as can be, while compute is at a premium.<p>Normalisation isn't primarily about about saving storage, it's about avoiding update anomalies i.e. correctness.
Author here! If you want more, I just released a book on DynamoDB yesterday --> <a href="https://www.dynamodbbook.com/" rel="nofollow">https://www.dynamodbbook.com/</a> . There's a launch discount for the next few days.<p>The book is highly recommended by folks at AWS, including Rick Houlihan, the leader of the NoSQL Blackbelt Team at AWS[0].<p>Happy to answer any questions you have! Also available on Twitter and via email (I'm easily findable).<p>[0] - <a href="https://twitter.com/houlihan_rick/status/1247522640278859777" rel="nofollow">https://twitter.com/houlihan_rick/status/1247522640278859777</a>
> "With denormalizing, data integrity is more of an application concern. You'll need to consider when this duplicated data can change and how to update it if needed. But this denormalization will give you a greater scale than is possible with other databases."<p>There's the big catch. As another poster pointed out, normalisation is not about efficiency. It's about correctness. People have been quick to make the comparison between storage and compute cost. The high cost of development and bug-fixing time trumps both of them by an order of magnitude. The guarantee of referential-integrity alone that SQL offers helps eradicate an entire class of bugs for your application with no added effort. This article glosses so blithely over this critical caveat. Whenever this discussion comes up I'm quick to refer back to the yardstick of "Does your application have users? If so, then its data is relational". I can't wait for the day when we look back at NoSQL as the 'dancing sickness' of the IT world.<p>It's also worth questioning: 'At what scale does this tradeoff become worthwhile?' Another poster here correctly pointed out that modern versions of Postgres scale remarkably well. The tipping point where this kind of NoSQL implementation becomes the most efficient option is likely to be far beyond the scale of most products. It's true that completely denormalising your data will make reads much faster, this is undeniable. This does not mean you need to throw the baby out with the bathwater and store your master data in NoSQL.
At what point do these auto-sharding databases like DynamoDB become worth the effort these days? You can squeeze a lot out of a single Postgres instance and much more if you go with read replicas or Redis caches.<p>When you start with a relational model you don't need a priori knowledge of your data access and you get solid performance and guarantees. If you need this access knowledge beforehand, is DynamoDB best for scaling mature products?
There is a lot more you should learn about DynamoDB but I appreciate the effort of the author. Please read the AWS documentation, it's not that big and explains vital things that just aren't in this article. Very important things like:<p>- LSI can't be created after table is created<p>- GSI late creation generate back pressure on main table<p>- If you have an LSI, your table will not scale beyond 10GB<p>- How often a table will scale up and down per day/hour?<p>- Cost of auto-scaling in cloudwatch (alarms aren't free)<p>...and so much more. I've been working with Dynamodb for over 2 years now and I love it.
I love postgresql.<p>Did business with a startup, signed up, started getting service, they play an intermediary biller / payor role.<p>Because of an issue in company name used in signup their billing system fell over and didn't setup billing.<p>But what was crazy is a quickly realized this shop was a noSQL shop. NOTHING connected to anything - so since they hadn't built any reports to cross check any of this they literally did not notice (I noticed other consistency issues elsewhere).<p>In a SQL database this stuff especially around accounting / money is 101 stuff, but noSQL seemed to really struggle here based on how they'd set it up.<p>I finally bugged them to charge us, but even that was a bit goofy (basically it looked they exported some transaction to a credit card system - but I doubt had any logic to handle failed payment issues etc).<p>We have one other vendor where the amount actually charged is a few pennies off the receipt totals - issues with doubles, rounding and application logic or something which doesn't simply use the same number (from database) for order / item detail and total and billing.<p>So at least in finance / accounting, a ledger which is basically a source of truth, and is linked / summarized / etc in various ways to other parts of systems (sales tax, receipts, credit card charges by order etc) really results in some free consistency wins that don't seem to free in nosql land.
This is a great post, and DDB is a great database for the right use cases. I want to give a shout out to FaunaDb to anybody looking for alternatives - its also serverless and crazy scalable, and usage-based pricing. Its downside is its proprietary FQL query language, not because it sucks (it doesn't!) but there is a learning curve. They provide a rich GraphQL interface as an alternative to FQL. Its upside vs DDB is a much richer set of functions including aggregations, and first-class support for user defined functions. Their attribute-based permissions system is phenomenal. Its definitely worth a look if you're considering DynamoDb but want something that takes less upfront planning about access patterns.
Having spent a few years working with DynamoDB to build multi-region, multi-tenancy platforms, I must say that DynamoDB is a good fit as a supplement datastore i.e. you should only store a sub-set of information managed by your serverless microservice. DynamoDB multi-region replication is just amazing. Unfortunately, we had a few massive billing spikes with DynamoDB, and we end-up adding pricing measurement and tests to track read/write units in all our functions.<p>I generally don't recommend DynamoDB as primary data store irrespective of your use case. It takes too much time to model the data. With every new requirement, you have to redo a lot of modelling exercise. Choices you made in beginning start looking bad and you will not remember why you created that particular combination of the composite key or local secondary index which offers no benefit due to incremental changes. Transaction support is painful, existing SDKs just don't cut.<p>I often wish some of the GCP Firebase features are available in DynamoDB like namespace, control on daily throughput to avoid billing spikes and transaction support.
Curious, besides being a truly serverless and scalable database why else would one choose to model relational data in DynamoDB? For the 'single table design' scheme the author talks about you are in a world of hurt if you need new access patterns? which is highly probable for most systems.
DynamoDB seems to be so low level that it takes a lot of design and programming effort to get right. Are there any higher level solutions that build on DynamoDB that take care of these things automatically? For example denormalization sounds pretty error prone if you implement it by hand.
"Normalization was built for a world with very different assumptions. In the data centers of the 1980s, storage was at a premium and compute was relatively cheap."<p>But forget to do normalisation and you will be paying 5 figures a month on your AWS RDS server.<p>"Storage is cheap as can be, while compute is at a premium."<p>This person fundamentally does not understand databases. Compute has almost nothing to do with the data layer - or at least, if your DB is maxing on CPU, then something is wrong like a missing index. And for storage, its not like you are just keeping old movies on your old hard disk - you are actively accessing that data.<p>It would be more correct to say: Disk storage is cheap, but SDRAM cache is x1000 more expensive.<p>The main issue with databases is IO and the more data you have to read, process and keep in cache, the slower your database becomes. Relational or non-relation still follows these rules of physics.
> To handle [compound primary keys] in DynamoDB, you would need to create two items in a transaction where each operation asserts that there is not an existing item with the same primary key.<p>There are other approaches--in cases where I've needed compound keys I've had success using version-5 UUIDs as the primary key constructed from a concatenation of the compound key fields. The advantage is that Dynamo's default optimistic locking works as expected with no transaction needed. A potential disadvantage is if you frequently need to look records up by just one component you'd need a secondary index instead of the primary doing double duty.
Thanks for this Alex! Will definitely check this out. DynamoDBGuide.com was a huge help to me when I was learning serverless last year to build wanderium.com. There's definitely a learning curve for DynamoDB but the performance is really good.<p>Do you talk about the best way to do aggregations in your book? That's one of the more annoying downsides of DynamoDB that I've kind of had to hack my way around. (I combine DynamoDB streams with a Lambda function to increment/decrement a count)
Once you get comfortable modeling your data in Dynamo, it becomes hard to justify using RDBMS with all of the overhead that goes along with it. Dynamo is not for every use case, of course, but as long as you design the table generically enough to handle adjacency lists and you don't mind making more than 1 query sometimes, it works really well.
IMS (<a href="https://en.wikipedia.org/wiki/IBM_Information_Management_System" rel="nofollow">https://en.wikipedia.org/wiki/IBM_Information_Management_Sys...</a>) called and wants its segmented hierarchies back. :-)
There is an unhealthy attachment to relational data stores today. It’s a tool, not an architecture or solution. We shouldn’t start with them and often should exclude them from our operational implementations. Reporting and analysis systems benefit tremendously from relational data stores. But we learned years ago that separate operational and reporting systems provided optimal performance.<p>I suggest those of you still unfamiliar with nosql operational data storage patterns trust companies like Trek10 and Accenture (where i saw great success).
There’s an aspect to software development relating to speed/agility. NoSQL data stores offer a schemaless approach That reduces a great deal of unnecessary friction. The amount of time we’ve spent managing relational schemas is ludicrously expensive. There are still great usage patterns for relational, but operationally is not one of them. I’d argue it’s an anti-pattern.