What if OpenDocument used SQLite? (2014)

445 pointsby weeberover 1 year ago

34 comments

p4bl0over 1 year ago

I'm currently working on an application where I use SQLite as the file format. I want to keep a usual workflow for users where you can make edit to your document and it only changes the file when you save it.So to open a file I copy it into the :memory: database [1], then the user can do whatever manipulation they want and I can directly make the change in the database I don't need to have a model of the document other than its database format. And to save the document I VACUUM [2] it back to the database file. It works quite well, at least for reasonably sized file (which is always the case for my app) :).[1] <a href="https://www.sqlite.org/inmemorydb.html" rel="nofollow noreferrer">https://www.sqlite.org/inmemorydb.html</a>[2] <a href="https://www.sqlite.org/lang_vacuum.html" rel="nofollow noreferrer">https://www.sqlite.org/lang_vacuum.html</a>

评论 #37554956 未加载

评论 #37554168 未加载

评论 #37556140 未加载

评论 #37554149 未加载

评论 #37555585 未加载

评论 #37554800 未加载

评论 #37554515 未加载

评论 #37558850 未加载

评论 #37572989 未加载

miki123211over 1 year ago

The problem with SQLite is that it's not a standardized file format. It's well-documented and pretty well understood for sure, but there's no ISO standard defining how to interpret an SQLite file in excruciating detail. Same goes for competing implementations, Zip and XML have a much smaller API surface than SQLite, whose API, apart from a bunch of C functions, is the SQL language itself. Writing an XML parser is not a trivial task, but it's still simpler than writing an SQL parser, query optimizer, compiler, bytecode VM, full-text search engine, and whatever else Sqlite offers, without any data corruption in the process. If Open Office used SQLite, its programmers would inevitably start using its more esoteric features and writing queries that a less-capable engine wouldn't be able to optimize too well.This isn't a concern for most software. If you're writing a domain-specific, closed-source application where interoperability with other apps or ISO standardization isn't a concern, SQLite is a perfectly fine file format, but as far as I understand the situation, those concerns did exist for Open Office.

评论 #37556353 未加载

评论 #37556093 未加载

评论 #37556225 未加载

评论 #37558118 未加载

评论 #37556037 未加载

评论 #37560265 未加载

评论 #37566388 未加载

评论 #37557141 未加载

评论 #37557905 未加载

评论 #37562153 未加载

评论 #37559847 未加载

nyanpasu64over 1 year ago

I was optimistic that Audacity adopting SQLite would be a substantial improvement in its file saving capabilities. In practice I encountered many gotchas:- On Linux, saving into a new file onto a root-owned but world-writable NTFS mount created in /etc/fstab, fails due to permission errors or something. Saving into an existing file works as usual.- Files are modified on disk when you edit the project in the program, creating spurious Git diffs if you check Audacity projects into Git as binary blobs. And when you save the file, old and deleted data is left in the SQLite file until you close the project's window (unlike saving a file in a text editor), and you can accidentally commit that into a Git repo if you don't close the window before committing. (I recall at one point that you had to manually vacuum the .aup3 file, but now closing the window is sufficient.) I'm getting Word 2003 Fast Save vibes.

评论 #37554220 未加载

评论 #37553989 未加载

评论 #37554035 未加载

Freak_NLover 1 year ago

Good article. Although one thing I do like about OpenDocument being just a bunch of XML files in a ZIP archive is that it is fairly easy to generate documents like spreadsheets without using a (potentially hefty) library which knows about the document format.I have a use case where users of a web service want to use data exported as a bunch of rows in a table in a variety of tools. Now, CSV with UTF-8 encoding is of course, totally open, conventional, and workable, but anyone who has ever offered CSV files to end users will know the pain of these users getting stuck when they want to use these files in a spreadsheet application¹. So I saved a sample spreadsheet in OpenDocument's ODS and another in that Microsoft XML abomination called OOXML as XLSX, and just figured out the basics of those XML formats. I trimmed the ZIP archives down to the essentials, marked the places where content goes, and just build a new spreadsheet file whenever data is requested in that format. Now I can output CSV, ODS, and XLSX (and JSON thrown in for good measure) of the same data.Doing this with SQLite would be possible of course, just a tad more complex and with a lower development speed. Being able to fire up the office suite, create a template document, and just dig into its XML files in the saved file is a nice feature (although admittedly of niche interest).1: More specifically, users who use Excel in a locale like nl_NL, where CSV files are, hardcoded, assumed to have their columns separated by semicolons, because Microsoft once notoriously decided that the Dutch did not use comma's in a comma separated values file.

评论 #37558611 未加载

stareatgoatsover 1 year ago

As an aside, this blew me away. I can hardly believe it. No nested query required?> SELECT manifest, versionId, max(checkinTime) FROM version;> "Aside: Yes, that second query above that uses "max(checkinTime)" really does work and really does return a well-defined answer in SQLite. Such a query either returns an undefined answer or generates an error in many other SQL database engines, but in SQLite it does what you would expect: it returns the manifest and versionId of the entry that has the maximum checkinTime.)"

评论 #37555595 未加载

评论 #37557892 未加载

评论 #37555533 未加载

评论 #37555916 未加载

评论 #37563105 未加载

评论 #37560008 未加载

gwbas1cover 1 year ago

I shipped a product that used both SQLite and XML files.One of the improvements that I made was moving a few tables that contained small amounts of data to xml files. Because these files were small and rarely written; it simplified the data access layer, and simplified diagnostics. (I made sure the files were multi-line tabbed xml.)For "technical" people who needed to diagnose the product, asking them to crack open a SQLite database was a huge ask; but for the major part of the product that used SQLite, it was hands-down better than XML files. (An older version of the product used XML files. It had scalability problems because there's no good way to make an incremental update to an XML file.)The advantages of XML, specifically, a human-readable format; really only work for small files when the design of the schema is optimized for readable XML. Unfortunately, the need to always rewrite the entire XML file, and the "complexities" that come with lots and lots of features will quickly erode XML's biggest advantages.IMO: A "lay" person needing to muck around with the internals of an office document is fringe enough that learning to use a SQLite reader is an acceptable speed bump. The limitations of XML + Zip, when it comes to random writes in the middle of a file, just can't be overcome by Moore's law.

评论 #37558906 未加载

ealexhudsonover 1 year ago

ODT was designed to be standardised: while the predecessor format was very similar too, it relies very heavily on XHTML, SVG, and CSS, to name but three (there's a lot more).Without being able to call out to existing standards, the ODT spec itself would suddenly become massive. The effort to update the standards appears to be significant and hasn't progressed much in recent years already :/I think realistically, an Sqlite format could be offered as an option, but the office doc ship has really sailed.Good argument to formalise the spec of Sqlite as a standard though...

评论 #37558763 未加载

orfover 1 year ago

Coupling a file format to SQLite smells wrong.SQLite is good, but it is also fairly unique in this space. Why? Because it’s hard to replicate everything it does, because it does a lot.But… for this case, do we need it do a lot? No, not really. We don’t need the full SQL standard, a query optimiser, etc etc for basic (+ safe) transaction semantics and the ability to store data in a basic table structure.Perhaps there is a better file format we can use, but it would be better if it was decoupled from SQLite.

评论 #37554196 未加载

评论 #37554639 未加载

评论 #37554104 未加载

评论 #37554969 未加载

评论 #37554072 未加载

评论 #37555037 未加载

评论 #37562371 未加载

out_of_protocolover 1 year ago

Other example: raster map tiles (basically up to millions of tiny square pictures)Zip vs tar vs filesystem vs sqlite. Tested all these scenarios, and sqlite was the fastest and the smallest, even beating plain archives with no overhead

评论 #37553924 未加载

评论 #37556236 未加载

3cats-in-a-coatover 1 year ago

OpenDocument is zipped images and XML. Implying you parse the entire format and put it in RAM. And frankly I don't see how SQLite can improve this. Well XML isn't ideal, but it's zipped, so there's no huge penalty in size here.All benefits SQLite's article lists (and I love SQLite to death by the way) can be implemented by having SQLite be the runtime model of the document. On disk and in memory. But SQLite doesn't need to be the transport format. In fact SQLite can easily get bigger than the current format, SQLite is full of unused space when you mutate it around, it can get fragmented and sparse. And if you need to optimize it every time, then the "fast save" etc. benefit goes away.There are formats which do need delta updates and quick indexed look-ups without fully loading the file in RAM, and this is why so many apps do use SQLite as a file format. I just feel OpenDocument was a bad pick to use SQLite for in this hypothetical scenario.

评论 #37556655 未加载

skybrianover 1 year ago

Implementing versioning in the file format conflicts with git, because each document is essentially its own little source control system. This can be surprising to users who copy the file and don’t realize that they’ve effectively copied the entire repo. Copying a file will sometimes include drafts they didn’t want to share. It can mean you lose control over when things are committed, and so you don’t end up with a useful history.If you then check the file into git, you are storing one source control system into another one, and older versions appear in two different histories. To be git friendly, you don’t want to save anything other than the current version, and then let git do its thing.Possibly the answer is “don’t use git, we have it covered,” but then the app developer should realize that they are implementing something like a source control system. How do people share drafts, review them, and merge changes? How do you publish a release that only includes the version you wanted to release?And it does seem relevant that the developer of Sqlite actually did implement their own source control system [1]. Maybe they could have warned people about what they’re getting themselves into if they go down this route?I wonder how terrible it would be to either use a git repo as your file format, or to build in git compatibility into your app somehow so you could push and pull?[1] <a href="https://en.m.wikipedia.org/wiki/Fossil_(software)" rel="nofollow noreferrer">https://en.m.wikipedia.org/wiki/Fossil_(software)</a>

评论 #37559307 未加载

robertlagrantover 1 year ago

> since OpenDocument predates SQLiteThis shocked me. Impressive how far SQLite's come in such a short space of time.

评论 #37553887 未加载

isoprophlexover 1 year ago

Man do I love SQLite.Over the past 1.5 yrs I've build a computer vision tool from recording hardware/software, to derp learning pipelines, to front-end; we had some requirements on the recording side that were difficult to solve with existing solutions (storing exactly timestamped camera frames, gps data, car telemetry and other metadata).Using a SQLite-backed data format for the video recordings made implementing things by ourselves super straightforward.

评论 #37554420 未加载

评论 #37554397 未加载

im3w1lover 1 year ago

I don't want people to read my drafts. That could be highly embarassing, and they should not make it into the final saved document.Past version and undo history should be stored separately from the document. They should be stored out of tree where they wont be commited into some git repository or be automatically synced or anything like that.

评论 #37554459 未加载

评论 #37555078 未加载

Lockalover 1 year ago

Sadly they did not include bad sides:1) Vulnerabilities: not only in SQLite, but also in wrappers like <a href="https://nvd.nist.gov/vuln/detail/CVE-2023-32697" rel="nofollow noreferrer">https://nvd.nist.gov/vuln/detail/CVE-2023-32697</a>2) Lack of transparency: zip with xml's contains only xml's; meanwhile SQLite contains by design all kinds of traces with sensitive information or empty blocks. Attempts to fix these issues removes benefits that were mentioned.3) Lack of implementer support. It was one of the reasons for WebSQL deprecation many years ago.4) Lack of standardization for file format. SQLite does not even promise forward compatibility, only backward one. Which means that new documents might not open in old software, or vendor should fork SQLite and only backport security patches.

评论 #37555468 未加载

评论 #37555531 未加载

kunleyover 1 year ago

Love the vibe of artivles, which present let's say reason-driven development vs habit-driven.Why habit? Well, I can imagine back at the time OpenOffice was a fresh project, it went like this: "XML is going to stay forever and everybody uses XML, so ofc we use one... oh, it is so big! And there are many files, so we just zip'em"...To be fair, the author of this excellent article doesn't even say about getting rid of XML in this format- but that could also be achieved by storing stuff in a SQLite file. Usage of XML was habitual thinking there- and not very visionary, as the format is dead now...

评论 #37554525 未加载

评论 #37554553 未加载

tannhaeuserover 1 year ago

Yeah what if? Then they haven't really understood the purpose of markup languages as plain text files for viewing/editing using generic text editors. There was no lack of proprietary formats such as MS Structured Format (used by MSO) and it was considered a big success when customers demanded open formats such as SGML/XML-based ones in late 90s/00's. The alternatives aren't even sequential (have fragments and cross pointers, etc). Yes they might be faster because they're closer to the in-memory representations as used by the original/historic app or even primitive memory dumps; marginal speed or size improvements were never a consideration though. And if anything, SQL (almost as old as SGML btw) is a joke as document query language compared to basically any alternative specifically designed for the job (ISO topic maps query language ie. Datalog, XPath and co, SPARQL, DSSSL/Scheme, ...) because of SQL's COBOLness, non-schemalessness, lock semantics/granularity a really bad fit, etc.).

dangover 1 year ago

Related:What If OpenDocument Used SQLite? (2014) - <a href="https://news.ycombinator.com/item?id=25462814">https://news.ycombinator.com/item?id=25462814</a> - Dec 2020 (194 comments)What If OpenDocument Used SQLite? - <a href="https://news.ycombinator.com/item?id=15607316">https://news.ycombinator.com/item?id=15607316</a> - Nov 2017 (190 comments)

kgeistover 1 year ago

Sqlite-based file formats are also very easy to debug, which saves a lot of dev time. After my app writes to a file and loading back doesn't work, I can just open it in Sqlite and inspect it in any way I wish because I have the full power of SQL at my fingertips.

nuc1e0nover 1 year ago

It's somewhat off topic I know, but is there something like sqlite but tailored for hierachical data? Like a xml document store rather than for relational data like sqlite is.

评论 #37554883 未加载

MrResearcherover 1 year ago

BLOBs in sqlite can be up to 2GB or less, depending on the compilation flags. If you store 2GB and the other application uses sqlite compiled with support for less than 2GB BLOB size, good luck on getting them to work... If you want to store content larger than 2GB in sqlite, you have to chunk them, manage the chunk sequences, etc. And you can't overwrite a fixed size 2KB portion at the specified offset, you'll have to rewrite the entire 2GB chunk.

michalcover 1 year ago

Shameless plug of a couple of Python libraries I’ve been involved with that work around memory issues of ODS files (for very specific use cases):<a href="https://github.com/uktrade/stream-read-ods">https://github.com/uktrade/stream-read-ods</a> <a href="https://github.com/uktrade/stream-write-ods">https://github.com/uktrade/stream-write-ods</a>

CodeCompostover 1 year ago

There really should be a "NoSQLite" or something equivalent to store hierarchical data instead of normalized data.

评论 #37554809 未加载

评论 #37556139 未加载

iefbr14over 1 year ago

Why only documents? How about a SQLitefs?

评论 #37554290 未加载

throwaway894345over 1 year ago

Is SQLite’s disk format an open, versioned standard? Or is it just “however SQLite saves data to disk”?

评论 #37558809 未加载

swiftcoderover 1 year ago

> The use of a ZIP archive to encapsulate XML files plus resources is an elegant approach to an application file format. It is clearly superior to a custom binary file format.I feel like I have considerable disagreement with the author of these sentences.

评论 #37559585 未加载

vxNsrover 1 year ago

I’m curious to know what a gsheet/doc/slide file actually is under the hood. I as the user am only ever presented with a link, there’s no way to download a gsheet in its native format.

cm2187over 1 year ago

Sqlite format is smaller than the original format only because xml is super verbose, so any uncompressed binary format ends up being less than lightly zipped xml.But sqlite files aren't small. One thing I don't understand is why they don't do string deduplication in sqlite (as in you only store a string once and every other occurence is just a pointer to that string). It seems such an obvious and easy way to reduce file size, memory consumption and therefore increase performance (less I/O). Is there a technical reason why this would not be desirable?

评论 #37555067 未加载

评论 #37555662 未加载

评论 #37556472 未加载

ongytenesover 1 year ago

Would be interesting to see a fork implementing SQLite. Time would tell how well it would compete with the standard.

chadcmulliganover 1 year ago

AutoCAD uses a database as its file format, it is fairly slow.

roywashereover 1 year ago

(2014)

littlecranky67over 1 year ago

deleted.

评论 #37554053 未加载

indymikeover 1 year ago

> there's no ISO standard defining how to interpret an SQLite file in excruciating detail.There comes a point where ISOing things doesn't help. The SQLite format belongs to SQLite, and an ISO standard would result in that standard being rendered irrelevant by the SQLite team, should they wish to make a change for any reason. Also, people would have to pay ISO for access to the specifications. SQLite should be treated as a defacto standard defined by the SQLite project.

评论 #37556772 未加载

评论 #37559445 未加载

vmfunctionover 1 year ago

At this point, why are we still using JSON/XML when there is SQLite for new projects? Stop the non sense of JSON/XML. SQLite is like json, but very queryable. Just send SQLite files around.MongoDB also saves document db type of store space just FYI.

评论 #37553915 未加载

评论 #37554054 未加载

评论 #37554116 未加载