>The single SQL endpoint is well suited for a data marketplace. Data vendors currently ship data in CSV files or other ad-hoc formats. They have to maintain pages of instructions on ingesting this data. With Splitgraph, data consumers will be able to acquire and interact with data directly from their applications and clients.<p>I appreciate the effort to make it easier for users to access heterogeneous data sets, but I really hope data vendors keep shipping raw CSV files. I don't want a company to gate access to the data, merely offering a proxy. I want to be able to download the whole raw datasets from the vendor directly if I want to.
I see a new job title coming into being - Enterpruse Data Librarian<p>40,000 data sets - even if many are just diff versions - is a ridiculous number to manage or even know about on a non full time basis.<p>Data driven decisions need data yes, but they also need people to know the data exists. And what it means.<p>And this is just external curated data - use this as the standard for what each department should be producting internally.<p>In fact that's a good idea - a data publishing standard - not just the data types / schema, but actually supplying it through a format that is consumable by others.
As someone who tried, and almost succeeded, to get rid of pachyderm for the last two years, I like what I just read.<p>Something is not entirely clear to me right now: An image is an immutable snapshot of a dataset at a given point-in-time - great - but, can I query the same dataset at two different PIT using layered querying in SQL ? Something like this: SELECT * FROM dataset:version-1, dataset:version-2<p>Also, are you storing the entire dataset as new or only the diff between versions (and later reconstruct the full image) ?<p>Now, onto the things that could be improved...<p>- Git-like semantics (pull, push, checkout, commit) are poorly suited for versioned, immutable datasets. Just (intelligently) abstract fetching and sending datasets by looking at the SQL query (dataset:version-2, above)<p>- Versions should be at least partially ordered and monotonically increasing. Hashes doesn't convey the information necessary to decide if dataset:de4d is an earlier version of dataset:123a, or not.<p>- Tracing a derived dataset provenance will only work if you can assert that the "code" or transformations applied to the original dataset is deterministic (side-effect free). So, either you have your own ETL language that you can execute in a sandbox and add a myriad of useless stuff for creating and scheduling pipelines (please don't do that!), or you just let it go and don't end up becoming Pachyderm (sounds great!).
This is very cool. Relatedly, as a data scientist, I wish companies would expose their APIs through SQL. I've spent a lot of time pulling data into ETL jobs from things like mixpanel, adwords etc., and having a unified interface would make things much simpler.<p>I'm trying to understand the architecture of Splitgraph. Are all foreign data wrappers controlled directly by you, or can third parties host a database and connect it to Splitgraph in a federation?
Is there a CPU limit or timeout for queries? I’d be a little concerned that an intentionally slow and inefficient query could pin the CPU at 100% and ruin the performance for other users
> postgresql://data.splitgraph.com:5432/ddn<p>That’s actually pretty cool, to see a public URL with PostgreSQL protocol signifier like that.<p>Makes me wonder if any developers or DB Architects ever thought of putting their resume in a DB and putting a public read-only postgresql:// URL on their business card :D
Very neat indeed. I thought Postgres had a max identifier length of 63 characters so I was surprised to see <i>"cityofchicago/covid19-daily-cases-deaths-and-hospitalizations-naz8-j4nc".covid19_daily_cases_deaths_and_hospitalizations</i> in the FROM part of the statement. Does the max identifier length not apply for some reason here or have Splitgraph done something to increase it?<p>On a related note, I've long wanted longer identifier lengths in Postgres so we can have more meaningful column names but the powers-that-be have always refused... hopefully one day it'll increase in the default distribution.
I think FDWs are not more used because they're not easy to get into.<p>The best link/example I found was <a href="https://github.com/beargiles/passwd-fdw" rel="nofollow">https://github.com/beargiles/passwd-fdw</a> and it's quite easy to follow the code and understand all the moving parts.<p>Once you've written a FDW you'll see them everywhere.<p>In fact the same author wrote <a href="https://github.com/beargiles/zip" rel="nofollow">https://github.com/beargiles/zip</a> file-fdw and <a href="https://github.com/beargiles/tarfile-fdw" rel="nofollow">https://github.com/beargiles/tarfile-fdw</a><p>If you (still?) need inspiration and want to see what already exists: <a href="https://wiki.postgresql.org/wiki/Foreign_data_wrappers" rel="nofollow">https://wiki.postgresql.org/wiki/Foreign_data_wrappers</a><p>Simpler, using a 'generic' file-based FDW shipping with pg's sources:
<a href="https://aaronparecki.com/2015/02/19/8/monitoring-cpu-memory-usage-from-postgres" rel="nofollow">https://aaronparecki.com/2015/02/19/8/monitoring-cpu-memory-...</a><p>There's a python wrapper to get your feet wet (or prototype an idea) : <a href="https://github.com/Segfault-Inc/Multicorn" rel="nofollow">https://github.com/Segfault-Inc/Multicorn</a> (though I'm not sure how maintained this is).<p>The only annoying part is that you're plugging your code to an interface that might (and has sometimes) broken between releases of PG. So kind of the same fun as maintaining a gcc plugin...<p>By the way, anyone has any idea on the licensing terms/issues of PG FDWs and PG extensions in general?
Ok I signed up and used your recommended client (DBeaver 7.1.5) but I don't see the schemas in your picture.<p><a href="https://ibb.co/gwLfHVz" rel="nofollow">https://ibb.co/gwLfHVz</a>
Thanks for opening new way to work with public data and discover it. I have several ideas regarding this. I used public free APIs and the worst thing with them that they are all unreliable. Unrelaible on conditions, limits and usually don't scale. And you cannot blame API providers because you don't pay for it. I vote for premium resource based access to the data with free tier. When you can pay and have level of service you need, or can use tiny free limited access.
mm interesting... we have this open Postgres instance (read only) for covid19 research: <a href="https://covid19.eng.ox.ac.uk/" rel="nofollow">https://covid19.eng.ox.ac.uk/</a><p>we have it running on our own (cheap) server, but we fear we may get overwhelmed by too much traffic if the project becomes very successful. Would this be a solution for us? Is it for free?
Postgres foreign data wrappers is a weird choice of engine. Most queries to this service will be scans, in which case a column-oriented, vectorized, massively parallel engine like Presto will be 1000 times faster or so. Postgres’ underlying engine is optimized for scenarios where you read a small number of rows using an index.
Looks lovely, I can see real use for this in my work, postgres and the availabilty of postgis extension is really useful for mapping data and spatially realted queries.