The Splitgraph Data Delivery Network – query over 40k public datasets

297 pointsby mildbytealmost 5 years ago

19 comments

emersionalmost 5 years ago

>The single SQL endpoint is well suited for a data marketplace. Data vendors currently ship data in CSV files or other ad-hoc formats. They have to maintain pages of instructions on ingesting this data. With Splitgraph, data consumers will be able to acquire and interact with data directly from their applications and clients.I appreciate the effort to make it easier for users to access heterogeneous data sets, but I really hope data vendors keep shipping raw CSV files. I don't want a company to gate access to the data, merely offering a proxy. I want to be able to download the whole raw datasets from the vendor directly if I want to.

评论 #24234952 未加载

评论 #24234944 未加载

评论 #24237914 未加载

评论 #24238287 未加载

lifeisstillgoodalmost 5 years ago

I see a new job title coming into being - Enterpruse Data Librarian40,000 data sets - even if many are just diff versions - is a ridiculous number to manage or even know about on a non full time basis.Data driven decisions need data yes, but they also need people to know the data exists. And what it means.And this is just external curated data - use this as the standard for what each department should be producting internally.In fact that's a good idea - a data publishing standard - not just the data types / schema, but actually supplying it through a format that is consumable by others.

评论 #24240781 未加载

Fiahilalmost 5 years ago

As someone who tried, and almost succeeded, to get rid of pachyderm for the last two years, I like what I just read.Something is not entirely clear to me right now: An image is an immutable snapshot of a dataset at a given point-in-time - great - but, can I query the same dataset at two different PIT using layered querying in SQL ? Something like this: SELECT * FROM dataset:version-1, dataset:version-2Also, are you storing the entire dataset as new or only the diff between versions (and later reconstruct the full image) ?Now, onto the things that could be improved...- Git-like semantics (pull, push, checkout, commit) are poorly suited for versioned, immutable datasets. Just (intelligently) abstract fetching and sending datasets by looking at the SQL query (dataset:version-2, above)- Versions should be at least partially ordered and monotonically increasing. Hashes doesn't convey the information necessary to decide if dataset:de4d is an earlier version of dataset:123a, or not.- Tracing a derived dataset provenance will only work if you can assert that the "code" or transformations applied to the original dataset is deterministic (side-effect free). So, either you have your own ETL language that you can execute in a sandbox and add a myriad of useless stuff for creating and scheduling pipelines (please don't do that!), or you just let it go and don't end up becoming Pachyderm (sounds great!).

评论 #24236785 未加载

评论 #24235231 未加载

评论 #24235956 未加载

评论 #24235413 未加载

eadanalmost 5 years ago

This is very cool. Relatedly, as a data scientist, I wish companies would expose their APIs through SQL. I've spent a lot of time pulling data into ETL jobs from things like mixpanel, adwords etc., and having a unified interface would make things much simpler.I'm trying to understand the architecture of Splitgraph. Are all foreign data wrappers controlled directly by you, or can third parties host a database and connect it to Splitgraph in a federation?

评论 #24235309 未加载

big-mallocalmost 5 years ago

Is there a CPU limit or timeout for queries? I’d be a little concerned that an intentionally slow and inefficient query could pin the CPU at 100% and ruin the performance for other users

评论 #24234694 未加载

codetrotteralmost 5 years ago

> postgresql://data.splitgraph.com:5432/ddnThat’s actually pretty cool, to see a public URL with PostgreSQL protocol signifier like that.Makes me wonder if any developers or DB Architects ever thought of putting their resume in a DB and putting a public read-only postgresql:// URL on their business card :D

评论 #24234367 未加载

jarymalmost 5 years ago

Very neat indeed. I thought Postgres had a max identifier length of 63 characters so I was surprised to see "cityofchicago/covid19-daily-cases-deaths-and-hospitalizations-naz8-j4nc".covid19_daily_cases_deaths_and_hospitalizations in the FROM part of the statement. Does the max identifier length not apply for some reason here or have Splitgraph done something to increase it?On a related note, I've long wanted longer identifier lengths in Postgres so we can have more meaningful column names but the powers-that-be have always refused... hopefully one day it'll increase in the default distribution.

评论 #24234423 未加载

评论 #24234419 未加载

评论 #24234381 未加载

touisteuralmost 5 years ago

I think FDWs are not more used because they're not easy to get into.The best link/example I found was <a href="https://github.com/beargiles/passwd-fdw" rel="nofollow">https://github.com/beargiles/passwd-fdw</a> and it's quite easy to follow the code and understand all the moving parts.Once you've written a FDW you'll see them everywhere.In fact the same author wrote <a href="https://github.com/beargiles/zip" rel="nofollow">https://github.com/beargiles/zip</a> file-fdw and <a href="https://github.com/beargiles/tarfile-fdw" rel="nofollow">https://github.com/beargiles/tarfile-fdw</a>If you (still?) need inspiration and want to see what already exists: <a href="https://wiki.postgresql.org/wiki/Foreign_data_wrappers" rel="nofollow">https://wiki.postgresql.org/wiki/Foreign_data_wrappers</a>Simpler, using a 'generic' file-based FDW shipping with pg's sources: <a href="https://aaronparecki.com/2015/02/19/8/monitoring-cpu-memory-usage-from-postgres" rel="nofollow">https://aaronparecki.com/2015/02/19/8/monitoring-cpu-memory-...</a>There's a python wrapper to get your feet wet (or prototype an idea) : <a href="https://github.com/Segfault-Inc/Multicorn" rel="nofollow">https://github.com/Segfault-Inc/Multicorn</a> (though I'm not sure how maintained this is).The only annoying part is that you're plugging your code to an interface that might (and has sometimes) broken between releases of PG. So kind of the same fun as maintaining a gcc plugin...By the way, anyone has any idea on the licensing terms/issues of PG FDWs and PG extensions in general?

评论 #24248404 未加载

dumbfounderalmost 5 years ago

Ok I signed up and used your recommended client (DBeaver 7.1.5) but I don't see the schemas in your picture.<a href="https://ibb.co/gwLfHVz" rel="nofollow">https://ibb.co/gwLfHVz</a>

评论 #24234639 未加载

评论 #24238777 未加载

varelazalmost 5 years ago

Thanks for opening new way to work with public data and discover it. I have several ideas regarding this. I used public free APIs and the worst thing with them that they are all unreliable. Unrelaible on conditions, limits and usually don't scale. And you cannot blame API providers because you don't pay for it. I vote for premium resource based access to the data with free tier. When you can pay and have level of service you need, or can use tiny free limited access.

geordeealmost 5 years ago

SQL is the API for data.

评论 #24234386 未加载

dsr_almost 5 years ago

I can find a privacy policy. It's not awful.I can't find pricing.

评论 #24234405 未加载

sradmanalmost 5 years ago

Does the Splitgraph Data Delivery Network allow queries that ORDER BY an unsorted column? This seems like a vector for a Denial of Service attack.

dariosalvi78almost 5 years ago

mm interesting... we have this open Postgres instance (read only) for covid19 research: <a href="https://covid19.eng.ox.ac.uk/" rel="nofollow">https://covid19.eng.ox.ac.uk/</a>we have it running on our own (cheap) server, but we fear we may get overwhelmed by too much traffic if the project becomes very successful. Would this be a solution for us? Is it for free?

评论 #24235156 未加载

georgewfraseralmost 5 years ago

Postgres foreign data wrappers is a weird choice of engine. Most queries to this service will be scans, in which case a column-oriented, vectorized, massively parallel engine like Presto will be 1000 times faster or so. Postgres’ underlying engine is optimized for scenarios where you read a small number of rows using an index.

评论 #24244235 未加载

ekzhualmost 5 years ago

How do you handle expensive queries? Several JOIN over multiple large data sources can easily take minutes if not hours.

CharlesDodgsonalmost 5 years ago

Looks lovely, I can see real use for this in my work, postgres and the availabilty of postgis extension is really useful for mapping data and spatially realted queries.

Vasloalmost 5 years ago

Can other Database systems be used to do this like SQL Server or Oracle? Many of us are forced to use systems other than PostGres.

评论 #24235556 未加载

molszanskialmost 5 years ago

Great project! Congrats with the launch.