Flexible schemas are the mindkiller

79 点作者 l0b0大约 1 年前

16 条评论

I once worked at a place that had many different specialised applications serving different needs and serving different data.The incoming CEO who wanted to make his mark (let's call him Derek) had an impressive background in marketing decided that the right thing to do was to rationalise everything into one huge database.They employed a small army of consultants who covered the entire office (it was a big office) in huge database schema diagrams. This took several months. Eventually they "discovered" that the only data that the schemas had in common was a reference, a description, and a created date field.So they immediately cancelled the project and we all breathed a sigh of relief. Haha, only kidding. Of course they didn't. The right solution was to go schemaless. Enter mongodb. They would achieve full flexibility going forward, and wouldn't have to subside a small diagramatic wallpapering industry as a bonus.At this point I was no longer involved and just watched the slow train wreck from afar. Years later all the applications had finally been migrated to use the new OneDatabaseToRuleThemAll™ system. I don't think any new functionality was delivered over that period. There were persistent performance issues for the larger data sets. Changing anything required the code to support all previous possible schemas because they never migrated any data (it's schemaless, it's all super flexible!).I think Derek left to go ruin somewhere else, with a huge success story on his CV.

评论 #39561012 未加载

评论 #39560518 未加载

评论 #39574289 未加载

评论 #39560031 未加载

shoo大约 1 年前

One useful piece of rhetoric is using "schema-on-read" instead of "schemaless". It helps make it clearer that the schemas aren't eliminated by using a "schemaless" data store, you're just pushing the responsibility of data validation into all of the data-consuming applications. Shift right!I think i encountered "schema-on-read" vs "schema-on-write" from the book designing data-intensive applications

评论 #39561018 未加载

评论 #39559711 未加载

andrelaszlo大约 1 年前

> there is a certain type of person that probably has all the mental horsepower required to be a phenomenal engineer that simply gets stuck on their "elegance"and> I do not understand how they are both smarter than me in many respects, and then still don't understand how stupid this all is.This is such a good question. The absolute worst codebases I've worked on have been created by brilliant individuals. I remember spending months tracking down bugs leading to inconsistencies in the universal "Things" table. "Everything is a thing! Right?!" Wrong...

评论 #39559730 未加载

评论 #39559922 未加载

评论 #39559799 未加载

brylie大约 1 年前

> there is a schema, it lives in their incomprehensible codeI’ve started to think of this as a diffuse schema. I.e. the schema never really goes away in “schemaless” databases. It just spreads throughout the application in helper functions, mappings, and backward compatibility hacks.Perhaps the distinction is between diffuse and consolidated or implicit vs. explicit schema. Are there any similar models or articles that have further described this realization?

polotics大约 1 年前

I am surprised no-one mentionned the to me obvious explanation for Derrek's behavour: wilfull obfuscation, at-least partly aware of the consequences, with a calculated maximization of the consulting fee.If the incentive structure is set up in such a way as to provide an easy local maximal for the individual that is not a good outcome for the compoany, then it is a management failure.Having such a situation where all individuals up the management chain are doing short-term local maximization that medium term leads to a bad outcome for everyone is societal failure.

评论 #39561077 未加载

baazaa大约 1 年前

Tangentially related: it now takes our data eng team 6 months to mirror some tables (into databricks) due to data vault modelling... presumably to handle schema changes. And then at the end of it everything is riddled with duplicates and missing data because they don't know what they're doing. But none of the source-systems can do schema evolution anyway, so we know the schemas would never change.I think as an industry we should stop warning juniors of 'premature optimisation' (kids aren't even choosing the right data-structures/algos/architectures and are getting terrible perf), and instead warn them away from premature scalability and premature 'flexibility'.

fabian2k大约 1 年前

Obviously, if you actually have a defined schema then using anything like EAV or JSON support in your relational database is a bad, if not outright terrible idea. Your queries get a lot more complex and you lose most of the type safety a rigid schema provides.But there are cases where you need flexibility, and the very categorical dismissal of EAV and anything similar is not particularly helpful if you find yourself in the situation where you need that kind of feature. It's a lot better today with good JSON support in relational databases, but even that doesn't give you good index support unless you give up on some of the flexibility. EAV is actually superior in that aspect if you don't put all your values into a string column.

评论 #39561047 未加载

andyjohnson0大约 1 年前

Lone wolf developer in a small organisation with minimal supervision. Seen it before. I maybe even was one once, way back.

评论 #39560021 未加载

j-pb大约 1 年前

Shoving triples into SQL is braindead, but theres a multi-billion dollar industry around triple stores, graph databases, and RDF.Datomic is in the same group, and I'd consider Rich Hickey to be one of the best programmers there are.

评论 #39559333 未加载

评论 #39559475 未加载

评论 #39565757 未加载

MrBuddyCasino大约 1 年前

> I do not understand how they are both smarter than me in many respects, and then still don't understand how stupid this all is.Met this kind of person three times so far, been asking myself the same question. I suspect they live too much in their own head.

评论 #39559459 未加载

jfisher4024大约 1 年前

I’m working with a Derek right now. Highly motivated and highly incompetent. Usually it’s one or the other. How do you deal with people like this?

评论 #39561098 未加载

评论 #39561025 未加载

peteradio大约 1 年前

This is such a good article. Poor Derek probably lacks communication skills or any firm mentorship or a spine.

sam_lowry_大约 1 年前

@lucidity can you give an example of a database test with views?

评论 #39561116 未加载

fifticon大约 1 年前

I have a question about this, that I probably can't explain well in a single comment. Normally I have well-defined schemas for my data, and obviously prefer this for its uncountable benefits.However, over the years I have occasionally and reluctantly used the 'Derek table', for importing domain data with a dynamic schema, aware that I will be paying dearly for it.My question is: What alternatives are there..? Other kinds of databases, maybe those used for 'big data'?I must clarify:(1) I know I could instead create proper db tables on the fly - this way, I _can_ have varying columns depending on each new set of domain data I import. However, IF I do this, I will now have to dynamically build my SQL queries, to refer to these varying tables and column names. The BUILDING of those SQL queries is not to fear, but the query execution of them is, since each new variant is a not previously seen db execution plan, and some of those will hit weird performance problems. I am painfully aware of this, because I have worked(still do) 'lifeguard duty' on a production database, where I routinely had to yet again investigate how a user this time had created a dbms-choking query with exactly this technique.(2) In the derek approach, the approach would usually be to violently retrieve 'all the query-relevant records' (e.g., the contents of 'project'/'document'), and then do the actual calculation in code, e.g. C#. This of course has the downsides of(2.1) - we must haul huge (relatively speaking) amounts of raw data off the db, since we are not summarizing it to the calculation-end-result before-hand. But this is also the 'benefit': We know we won't bother the db further than this initial and rather simple haul/read. (I AM aware I could also query the derek monstrosity directly, but I have seen enough of that to know to avoid that, to not bring the dbms to the curb.)(2.2) This is an extension of 2.1: Since we are working with the raw data, we must 'pay' both for moving the big chunk of data over the network, and also/often for having it in working memory on the actual external processing/calculation server (this may not be true, if the calculation can be done piece-wise working on a stream). And, of course, there is the entire cost of 'a single table cell is now a whole db row'.Echoing poor Derek, the benefits of the described approach, is that it actually takes relatively simple approach and code to build the solution this way, at a tremendous cost to resources/efficiency. If I did 'the right thing', I would have to write considerably more and more complex code, to handle the dynamic DDL/schema processing, to dynamically work with the real DB schema.Back in the 90's, I would by necessity have 'done the right thing', since the Derek approach was doomed performance-wise then. But now, in the 2020's, we have so much computing power, we can survive - for a time - by wasting extravagant amounts of resources.To recap/PS: Whenever I have done this, it has been for a small subset of specific data; I have never done it in the insane "one single table", with dynamic tables too. My case has always been 'dynamic fields'.Also, for context: The 'calculation' to be done, typically amounts to what could generously be referred to as a bit pivot-table operation on a heterogeneous set of data (which is why, expressed on proper SQL, the query would be rather verbose and unwieldy, accounting for all those heterogeneous and possibly-present fields on the different source tables/datasets)

FredPret大约 1 年前

Amazing writing! Reminds me of the Cuckoo's Egg for some reason.

评论 #39559198 未加载

评论 #39561157 未加载

评论 #39559699 未加载

edg5000大约 1 年前

So derek left after 8 months without delivering anything useful. The data was in a format that prompted immediate conversion to a correct relational database model.You then took over and worked for ~8? months, taking good pay, working from home for a substantial portion, complaining about a not perfect air conditioner, and left as soon as your paperwork got approved.By the end of you time there, did you deliver something that they could put into use?Of course those first months are painful, as management was unaware of the unusable state of the code.