Databases always have a schema (2018)

147 点作者 ash大约 6 年前

13 条评论

kerng大约 6 年前

At a high level there are names for what the author attempts to describe in computer science:1) Schema on writeThe schema is enforced by the storage layer when data is persistet, optimized for defined use cases and fast. ACID. Typical RDBMS systems.2) Schema on readThis is was big data query processing and data lakes are about. Data is not interpreted or given meaning until it is read from the store and put into context.Just thought that might add value to the discussion.

评论 #19249865 未加载

fmjrey大约 6 年前

I would definitely prefer an approach where schema is explicit and defined in a minimum number of places. RDBMS/SQL schema do let you define schemas, however an important aspect of how SQL handles schemas which links back to their inflexibility is how they're defined at the level of a table, meaning a (possibly ordered) set of attributes.A more flexible approach as taken by Datomic is to define schemas at the level of attributes, meaning an entity (table) does not have a schema, only attributes do. This requires attribute names to be within namespaces which can certainly be seen as equivalent to table names, but they don't have to, namespaces could also refer to an entire domain/subdomain/org/whatever.Defining schema at the level of attributes provides a greater level of flexibility because client only declare/specify/depend-on a partial schema that only lists the attributes it needs. It also helps avoiding the "place oriented programming" approach of result sets, whereby attributes must be in a certain order, and if one does not have a value, you have to introduce NULL values.

评论 #19247198 未加载

评论 #19247192 未加载

评论 #19246481 未加载

virtualwhys大约 6 年前

I don't necessarily buy the two schema argument brought up in the article.If you generate your model/domain based on the underlying database schema then you effectively have a single view of the world, one that will ideally blow up at compile time (if your language is statically typed) when the database schema changes.It's not perfect, if someone changes the schema in production then you're SOL, but at least it brings some sanity to the table vs. NoSQL where you simply have no idea what the state of the world is until you run your program.F# Type Providers are probably the gold standard wrt to binding schema to application code, but any code generation library will provide similar benefit.

评论 #19248939 未加载

评论 #19246477 未加载

评论 #19248042 未加载

评论 #19247528 未加载

jfengel大约 6 年前

The distinction between "database schema" and "data type" is the same impedance mismatch problem we've had for decades. You end up making an arbitrary transformation somewhere between your operational code and the persistence layer, at which time you have two different formats to represent your data. Any time you represent something twice, at least one of them is wrong, sooner or later.The "schema" is the persistence-layer discussion of the contracts that programmers use to keep separate parts of the program consistent (even when they're divided up among developers, or with yourself over time). The persistence layer may use that information to provide additional services that it can't provide on a blob, like indexing. It's "natural" to do that for performance reasons: you want to lay stuff out on the disk, and limit network traffic. But in terms of code maintenance, that's optimization, and the root of all evil.Few programming environments really take seriously the idea of eliminating that distinction. What I want is a language that pushes up notions like transactions and persistence right into the language itself, and let the compiler/VM take care of whether a piece of data is persistent or not. Ideally, I'd love a programming environment that could just be shut off at any moment and restore itself transparently, including unwinding that had become out of date. I gather that Inferno took that approach, but I haven't had a chance to use it.The relevant bit of that is that it would eliminate the arbitrary distinction between "database schema" and the rest of your program. You'd use the same data specification for both, without a translator (or rather, with the translator completely hidden in the operating environment).

评论 #19248694 未加载

评论 #19248876 未加载

评论 #19250053 未加载

jayd16大约 6 年前

Now, this is just my opinion but I hate ORMs for several reasons. Highest on the list is that I fundamentally disagree with their reason for existence. They're designed to help manage the code schema vs db schema "problem" stated in the article.The database schema is for the integrity and performance of data storage. The code schema is for the integrity of the business logic and api. Forcing both schema to solve both problems is a huge and IMO unnecessary pain. ORMs attempt to be all things to all use cases and you often end up with a lot of hoop jumping.They're not totally disjoint, of course. Its the same data so it will naturally be in a similar format and how code and db interact will often inform how the schemas are designed. The take away is, use the best schema for the current use case and solve the mapping along the intersections.My sense is that this is where a lot of the passion for NoSQL came from although its a bit misplaced.

评论 #19247911 未加载

aboutruby大约 6 年前

In my current case I have highly unstructured data stored as a jsonb in postgres with some basic fields for indexing. So the database doesn't have a schema and making many tables for each variation of the data would take an enormous amount of time.

评论 #19245879 未加载

评论 #19245920 未加载

评论 #19245774 未加载

kamikaz1k大约 6 年前

Pretty interesting that the author was upset someone said SQL = Schema and NoSQL = No Schema. It's usually pretty clear that the no schema part is usually in reference to no certainty in the schema of the data. Or I guess one could model it as a spectrum of certainty.Actually now mulling over it some more, I see their point. It's just that their consideration of schema is more holistic than typical. I find people tend to use "schema" to refer to what's in the database, while in code people refer to it as a "model", and that a schema is just an implementation detail of a model.

_bxg1大约 6 年前

This is a good take. I've been offended by this aspect of NoSQL for some time now, always using something like Mongoose to add schemas on top of it, but I'm starting to think that what I was really offended by was the lack of types inherent to plain JavaScript. It now occurs to me that a type system could be used to define the canonical form of a given DB schema.I spent a year and a half at an enterprise Java shop, and there was a ton of work there put into maintaining and syncing the Java types and the SQL schema, even though that Java application was the only thing that would ever use that database.I wonder what it would look like to have a schema definition system that was equally understood by the DB and the code, so you didn't have to pick one or the other or both.

评论 #19247419 未加载

评论 #19246752 未加载

评论 #19246489 未加载

najarvg大约 6 年前

Very interesting. FWIW Martin Fowler, Pramod Sadalge et al have spoken about the implicit schema for NOSQL dbs from a design pattern perspective for quite a while now. Here is an early 2013 slide from him which is very informational - <a href="https://martinfowler.com/articles/schemaless/" rel="nofollow">https://martinfowler.com/articles/schemaless/</a>

eadmund大约 6 年前

He's correct that there's always a version of the schema in one's code — he's also correct that it's also nice to have a version of the schema in SQL, where certain properties can be checked & verified.Wouldn't it be nice to have a schema DSL which enabled one to declare the schema in code, and have it checked at compile time? Sounds to me like that'd be the best of both worlds.And client programs could just import that package as a library, so every would be on the same page …

评论 #19245995 未加载

评论 #19246049 未加载

评论 #19245974 未加载

评论 #19246520 未加载

taude大约 6 年前

In the context of this conversation, I assume we're talking about a single data store for all of an apps data needs. However, I'd argue in more modern application (newer) that one would design with a couple different data stores. using NoSQL for true data that can't be definied by a schema, and then the traditional RDMS when both ACID and schemas are needed. And then, there's several other types of data stores in between.

nathan_long大约 6 年前

> your stored data always has a schema, unless your code neither reads nor writes it as anything except an opaque blob.This is well-put. I'd go further: if there's no schema, you don't have data at all. A row with labelled email and name fields is data; a string of free-form input in which the user describe him/herself is not data. You might possibly be able to extract data from it, but if you do, you'll be building a set of labelled fields.> You can tell that your code's implicit schema exists even with an SQL database, even if your code auto-reads the SQL schemas, by asking what happens if the DBAs decide to rename a bunch of tables and a bunch of fields in those tables, maybe dropping some and adding othersMaybe a quibble, but I'd say your code has partial knowledge of the relational db's schema, not that it duplicates the schema, just as it may have knowledge of the file system structure, the domain names of various servers, etc, and depend on that knowledge being accurate, without duplicating them entirely.> In some ways NoSQL is more honest than SQL, because it tells you straight up that it's entirely up to your code to have and maintain a schema.With a relational db it's not entirely up to your code to have and maintain a schema. Eg, your code can assume that every "user" record has an "email" field and that it's NOT NULL and always a string and always unique, assuming the db is set up to guarantee that. The part of your code that reads records can be sure of that, even if the part of your code that writes records is lax about checking (and therefore blows up a lot). With a NoSQL database, it's possible that some records have blank emails, or numeric emails, or don't have that field at all, or have that field named something different. Those things will happen unless you're very careful to ensure in your code that they don't, and also unless you're careful to never to let anything write to the database except your code.I think a differentiating question is "what is it that guarantees all your records of type X have the same schema?" If you use a SQL database, the answer can be "the db". With NoSQL it might be "a combination of app logic, background jobs and manual intervention for weird cases" or "nothing".In cases where you want to store JSON blobs, an RDBMS like PostgreSQL lets you have a JSON column. This is useful (eg) for cases where you need to capture some input now, and might get around to parsing it out into real data later - eg "we got these records from the legacy system and don't yet know if / whether we can import them".Finally, although I wouldn't advocate moving complex application logic to the database, there are some validations that can only be reliably done by the database itself or by leaning on the database. Specifically, any validation that relies on the current contents of the database, such as "don't allow duplicate user names" (unique constraint) or "don't allow creating a comment for a post that was deleted" (foreign keys) or "don't allow overlapping reservations" (PostgreSQL exclusion constraint). Application code can do a read, check, and insert, but two threads may have a race condition and insert conflicting data. Application code can ask the db to lock while it does this, but that's leaning on the db. Or the database schema itself can guarantee this in a safe and performant way. But only if the database has a real schema.

ianamartin大约 6 年前

This blog sort of makes the definition of schema so broad that it ceases to have any real meaning. And it's a stupid argument to have anyway--whether or not something qualifies as being called schema. The point is how it functions, not what you want to call it.I like to think of model code and sql schema as two branches of government in your data democracy. Where your code is roughly congress and your SQL schema is the supreme court. In a well-designed system, your code can do a lot of different things without running afoul of the courts. Code has quite a lot of freedom to operate on existing data and store what it needs for the future. But when there's a conflict, your SQL schema pops up and says, "No. You can't do that. That's in direct conflict of the stated goals of what this data is supposed to mean. Either don't do that, or go back and explicitly change what you mean."Regardless of whether you want to call it schema or not, NoSQL is an extremely weak court of final appeal when it comes to deciding whether the code is obeying the intention of the data design. That is sometimes a problem and sometimes not.In general, if you are dealing with data, you need type checking somewhere. Regardless of your opinion about programming languages, if you care about data integrity at all, it has to be checked for type correctness. Strong SQL schemas are far more important when the access layer is via dynamically typed or weakly typed languages. You can feel a lot better about a weaker schema if your language is statically typed. Sure, it's weaker and violates a certain separation of powers and flips the power dynamic to the code being final arbiter, but it's not always completely awful.On the other hand, if your application code is written in javascript and your database has no enforced schematic constraints, your data is already corrupt. You just don't know it yet.While theoretically we can account for weaker schemata in NoSQL data stores, the reality is that every useful dataset will eventually be accessed and written to by a client you didn't originally plan for. And that client isn't going to be aware of or play by the rules you intended to enforce in your code. So once again, your data is, in reality, already corrupt if you don't/can't enforce schema in your data store.This isn't always a killer problem. There are a lot of cases where your data isn't all that important and this is an okay trade off. But pretending that application code is schema because you want to call it schema and it sort of does some of the same things so therefore schema is an energy field created by all living code that surrounds us, penetrates us, and binds our data together is just kind of bonkers.