Schema on write is better to live by

93 pointsby hrishiover 3 years ago

12 comments

simonwover 3 years ago

I've come around to almost the opposite approach.I pull all of the data I can get my hands on (from Twitter, GitHub, Swarm, Apple Health, Pocket, Apple Photos and more) into SQLite database tables that match the schema of the system that they are imported from. I call this family of scripts Dogsheep - <a href="https://dogsheep.github.io/" rel="nofollow">https://dogsheep.github.io/</a>For my own personal Dogsheep (<a href="https://simonwillison.net/2020/Nov/14/personal-data-warehouses/" rel="nofollow">https://simonwillison.net/2020/Nov/14/personal-data-warehous...</a>) that's 119 tables right now.Then I use SQL queries against those tables to extract and combine data in ways that are useful to me.If the schema of the systems I am importing from changes, I can update my queries to compensate for the change.This protects me from having to solve for a standard schema up front - I take whatever those systems give me. But it lets me combine and search across all of the data from disparate systems essentially at runtime.I even have a search engine for this, which is populated by SQL queries against the different source tables. You can see an example of how that works at <a href="https://github.com/simonw/datasette.io/blob/main/templates/dogsheep-beta.yml" rel="nofollow">https://github.com/simonw/datasette.io/blob/main/templates/d...</a> - which powers the search interface at <a href="https://datasette.io/-/beta" rel="nofollow">https://datasette.io/-/beta</a>

评论 #28261413 未加载

评论 #28262275 未加载

评论 #28264273 未加载

评论 #28263566 未加载

评论 #28261397 未加载

评论 #28262660 未加载

johnorourkeover 3 years ago

(commenting specifically in the context of a personal information collection)Schema-on-write increases the cost (mental load, time) of finding useful things in the first place, which isn't always what you want. I happen upon an interesting HN article (like this) and (like the author) dump it in my "later" pile - often that's during a little ADHD-brain distraction, where I just browse and read for a few mins before jumping back into a piece of work. I specifically DO NOT want extra cognitive load at that point.The balance is always a bit of both - sometimes organise, sometimes defer it. But always allocate some time for "emptying" those queues, lists, inboxes, etc - that's the critical part of staying organised and usually the hardest.The worst temptation is to automate the process - eg. "I'll auto-import all my data into organised tables" or "I'll use ML to automatically tag stuff" - that's fine, but the purpose is to help you learn and remember, and that automation will just create even more data for your "to look at later" pile.I'm having some joy using Obsidian - it's so quick for creating interlinked documents and notes, task lists, etc, and from a technical PoV they're all just plain Markdown documents stored in Dropbox - future-proof and device-proof.

评论 #28263552 未加载

tjoffover 3 years ago

This hits home in lots of ways and I'm going through a similar struggle and realizations.Thing is, I used to have an excellent memory. I could recall a four year old blog post I read once and instinctively know, whether it would help me with my current task (or enough of it to solve it right away).But now my memory is failing me (overload? stress? who knows). I still instinctively know I've read something related to something but I can't remember enough for it to help me, nor to find it again. I can't remember the punch-line, or if there even was one. It makes for the most boring and cringe anecdotes you've ever heard. "Oh, yeah! I read about that, there was a thing and then there was a conclusion. I'm not sure on the thing and the conclusion could go either way." Worthless. Yet I still go for it, because I'm used to remember enough of it for it to be helpful/relevant. Now I can barely trust what I do think I remember.I've come to the same conclusion. Schema on write. But as is noted "It is a lot of work.". And I'm struggling. Because previously the very useful and rewarding "hoarding" (wasn't set out to hoard information, was just a side effect driven by curiosity) was dirt cheap. But now it takes a lot of work to condense, and it isn't at all clear that the effort is worth it. For sure it depends. But even the act of looking up condensed information tilts the scale a bit, you ~need to know that the thing you are looking for exists in the first place or it might just become another distraction. But it doesn't matter, I'm too tired for it anyway and my mind refuses to adapt. Wanting more, but I can't cope and gets nothing done.More than a schema and order I need balance, perhaps by (forcefully?) constrain myself. Someone on HN tried text-mode-only linux ( recommend the read/skim <a href="https://dev.to/jackdoe/tty-only-1ijn" rel="nofollow">https://dev.to/jackdoe/tty-only-1ijn</a> ). Maybe. I am lost. (hence the ramble)

评论 #28263537 未加载

评论 #28261519 未加载

评论 #28262686 未加载

bccdeeover 3 years ago

I somewhat disagree with this.I've tried to organize notes and documents schema-first several times, and it's never worked. For instance, consider a system of notes where each note has tags: I write a note, come up with some tags for it, then I write a second similar note and end up thinking "hang on, I feel like I chose the wrong tags for the first one." So I have to come up with new tags for both.Either I'm constantly revising how I've tagged older notes, or I have to think so hard about how I'm going to tag each new note before I write it that I never want to write any notes. Before I write anything, I need to think up a set of tags which are specific enough to be useful but general enough to re-use.Instead, these days, I use Logseq [1], and I'm really liking it. In Logseq, everything is bullet lists. If you have a new note to write, you just append it to your root list. If two notes feel like they belong together, indent them under a parent. If any list starts becoming too bloated, you can move a subtree to a new page.A list like<pre><code> Main List - Entry 1 - Entry 2 - Entry 3 </code></pre> becomes<pre><code> Main List - Entry 1 - Category A - Entry 2 - Entry 3 </code></pre> which becomes<pre><code> Main List - Entry 1 - <hyperlink to "Category A"> Category A - Entry 2 - Entry 3 </code></pre> This way, the schema grows organically over time, from the bottom up. Instead of having to think up a system of classification before I start writing, I just write, and then classify later as information accrues.Logseq has tags and block embeds and many other features too, but the core nested list model is what has really attracted me to it. I'm sure it's not the only note app that works along these lines (I'm always open to suggestions), but it's open-source and it works quite nicely.[1]: <a href="https://logseq.com/" rel="nofollow">https://logseq.com/</a>

评论 #28263515 未加载

kthejoker2over 3 years ago

It's very simple:Schema on write is upfront costs for lower per unit costs later.Schema on read is no upfront costs and then higher per unit costs thereafter.The value of schema on write is based on how much lower the per unit costs will be, how many units you expect, and how much the upfront costs will be.Most people in ELT circles completely discount the per unit costs (many vendors are indeed incentivized to maximize these costs)And the flipside is a lot of ETL is somewhere between YAGNI and unnecessary extravagance.But broadly anyone who recommends one or the other without showing even a back of the napkin analysis of these costs are not serious data engineers.PsThree other real world metaphors for schema on read vs schema on write:* renting vs a mortgage* eating out every night vs learning how to cook* (pre pandemic) living far away from work to save money vs living closer to work.

Timothycquinnover 3 years ago

To OP - You sound kinda like me. I went over to Ecco Pro when I started doing programming as the amount of information coming into my head was way too much and I needed something that had low overhead seamless moving and organization of data. I found A single panel outliner (like Ecco Pro, Omni outliner...) works great for my brain.Every time I download a notetaking app to try a new system, I pray it has an single panel outlining feature but nope. I can't count the number of times I considered writing my own.Now you have mentioned Notion, I checked and it has an outline mode built right in. Woo Woo!! The only thing left to sell me is Linux support which is my primary DT env.Thanks for the post!

评论 #28263522 未加载

throwthereover 3 years ago

This is kind of the "organize your file structure versus throw everything in a single directory and rely on full text search" argument from the old days when everything important was saved in a file.I think if most of what you're organizing is just queued things to do-- videos to watch, articles to read-- your problem isn't organizing those, it's throughput (focus).

评论 #28261507 未加载

pjdkochover 3 years ago

I think the schema is already there. You can read and understand it, right? So its has more structure for you than just entropy.How to capitalize this is indeed a harder question. My mind gravitates more towards something like GPT3. If you fed it all your memories (expensive, I know) and gave it a prompt of "I was thinking about a funny gif with a dog and a mug in a room on fire", it'd for sure get you the this_is_fine.jpg. But the answer is not quite what I typically see myself wanting. Instead, I want to navigate this graph (the understanding necessary to make the connection from the prompt to the jpg in the first place) myself, and see what's interesting and non-obvious there. That's what I interpret as "being able to talk with your zettelkasten".

评论 #28263512 未加载

snidaneover 3 years ago

If you believe in ELT over ETL, you want both.Schema on read is good for raw data stages. You want to get the data from external systems in the database no matter what. You don't have a control over it and it can change any time. Just get it into a blob, variant or text column in a raw table and then transform it to your schema later. This way you can implement validation logic for the raw data within your database, instead of doing that outside in your ETL system. So if you believe in ELT, just get the data in first.Of course you don't want to use stringly typed tables for any serious data analysis. You need to type the columns to make calculations perform and make the schema stable over data history. Instead of doing your typing logic, converting string and variant columns to native database types, in an ETL script, you do it in a the data warehouse.The transformation into types has to happen somewhere and if you skip schema on read, that only means you never adopted ELT and are stuck with your database AND an ETL system. Adopting ELT would mean you'd only use one system, the data warehouse to do both.

riwskyover 3 years ago

That is arguably why it’s called “in formation” in the first place

评论 #28303743 未加载

offskyover 3 years ago

I’m bookmarking this thread for later. So much useful information here that can help me in the future :)

InGoldAndGreenover 3 years ago

This is super interesting! I've always had a strong preference for schema on write, both in databases and in life - prefer to organize the cupboards when I first set up house, rather than dumping everything in and hoping for the best. So I'm definitely very inclined to immediately accept your basic premise: schema on write is substantially better.But I've ended up in several headed discussions on this, in the context of SQL vs NoSQL. The one argument I've been slightly persuaded by is: schema-on-read is significantly more flexible than schema-on-write. In most of my actual programming applications I still use SQL, because in coding I think it's better to prioritize planning and structure over flexibility.If there's one area where flexibility is necessary, it's real life. When I first start researching something new, I don't usually have enough knowledge to actually structure my schema effectively, and the usefulness degrades. Of course, you can update the schema. This isn't always great. If your schema needs to change constantly, it wastes a lot of time. Depending on complexity, it can also just be a massive cost. Eg: when I have the kitchen set up all nicely, but then we get a new blender with five different attachments and now I need to find an empty shelf for them. Took about an hour to reshuffle everything satisfactorily.All that said, I'd still say that schema-on-write is better than schema-on-read. Some structure is typically always better than no structure.However, I've recently been reading a book that I think gives an interesting different insight to this problem - Designing Data-Intensive Applications, by Martin Kleppmann. I've always considered the main categorization of databases to be schema-on-write vs schema-on-read, but this gives a completely different method: databases are either document-based, relational, or graph. Relational databases we're all generally familiar with, while document-based is similar to today's NoSQL.Graph databases have fallen out of favour, but I actually think that they might be the best at representing the human information gathering process. They have a structure that's provided on write, but isn't always consistent across entries - because it's flexible, and can be added to very easily. This lets us expand the schema as we gather more information and our view of the world changes, without needing to rearrange our past knowledge. I also feel like a graphlike structure better represents how we think.Honestly, the main useful point I got from the book is that schema-on-write doesn't need to mean lack of flexibility. That is the case with most of the RDSs we use today, so it's what I've come to expect. But that shows a lack of imagination on my part, rather than any inherent restriction.

评论 #28260738 未加载

评论 #28263505 未加载

评论 #28261433 未加载