Hey HN!<p>I’m Miles, co-founder of Splitgraph along with Artjoms (mildbyte). We met back in 2018, when I reached out to him after reading his blog on HN and realizing we lived right next to each other. Neither of us had a “real job,“ and we both wanted to build something truly innovative and cool. We tossed around a few ideas, but ultimately we couldn’t resist the idea of building “GitHub for data,” which seemed like an obvious gap in the market. After nearly two years of development, we are finally ready — and extremely excited — to share it with the world.<p>We are not the first to notice this gap or try to build this product. So we wanted to make sure we did it right. We made sure to start from “first principles” and really analyze the problem space. We ended up realizing that it’s not strictly Git or GitHub that people want “for data.” Rather, people just want to be able to work with data as easily as they can work with code. They want to experiment, build and maintain data without needless overhead.<p>Tools like Git and Docker are ubiquitous in any software engineer’s workflow, and we took a lot of inspiration from them when designing Splitgraph. We thought about <i>why</i> people like and use these tools, and tried to translate their benefits to the domain of data science. Our core philosophy is to stay out of the way, and work with existing abstractions instead of introducing new ones. You can version your code with Git without switching filesystems. You can build Docker images without changing your code to work in Docker. Our goal with Splitgraph is to provide an easy path to incremental adoption, so you can introduce it into your existing workflows where and when it makes sense.<p>Splitgraph is powered by Postgres, and provides an easy way to build and share versioned datasets, along with a whole bunch of other benefits. We encourage you to read the landing page which (hopefully) explains it well. The documentation goes into much more detail, and if you have ten minutes and Docker installed, you can try Splitgraph for yourself. [0] If you work with data, we really hope you’ll give Splitgraph a try.<p>We’re here to answer any questions, and we’ve also created a Discord server [1] to hopefully build a bit of a community around Splitgraph.<p>[0] <a href="https://www.splitgraph.com/docs/getting-started/five-minute-demo" rel="nofollow">https://www.splitgraph.com/docs/getting-started/five-minute-...</a><p>[1] <a href="https://discord.gg/eFEFRKm" rel="nofollow">https://discord.gg/eFEFRKm</a>
Personally I think I'm more drawn to the dotmesh approach (<a href="https://docs.dotmesh.com/concepts/architecture/" rel="nofollow">https://docs.dotmesh.com/concepts/architecture/</a>), but the one problem data has is as it gets massive it becomes really hard to move it around and I guess that's where trying to layer git like workflows on top of it become intractable. It's like data has it's own gravity and often times it is just easier to bring other things to the data, rather than the other way around. IIRC Bryan Cantrill said something similar about data when Joyent was developing their object storage system Manta (<a href="https://www.youtube.com/watch?v=79fvDDPaIoY);" rel="nofollow">https://www.youtube.com/watch?v=79fvDDPaIoY);</a> ergo, perhaps the Splitgraph approach will meet with better success.
Here is one of the DVC maintainers :) Congrats! It's great to see more tools for codifying data in different scenarios.<p>To be honest, since you introduce a new workflow and a few new concepts it's not that easy to get the right perspective in 5 minutes (I know the same problems exists with DVC and we've been iterating on docs a lot). Mind a few questions?<p>Do I understand it right, that is mostly focused on tabular data? Kinda git checkout for an SQL table?
This is so cool!<p>I have been looking around for databases that have any sort of cryptographic digest of data to ensure integrity. And this is the first time I have seen something do that.<p>Could the snapshots and content addressability be used for regular backups of application databases?
I'm probably a bit naive about this but could it make it unnecessary to explicitly create database dumps as backups in scenarios where you need a rollback? ie: could I just tag the database and be guaranteed I would later get back that data if, for example, my upgrade failed and I wanted to restore, simply by checking out the tag?