TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The Splitgraph Data Delivery Network – query over 40k public datasets

297 pointsby mildbytealmost 5 years ago

19 comments

emersionalmost 5 years ago
&gt;The single SQL endpoint is well suited for a data marketplace. Data vendors currently ship data in CSV files or other ad-hoc formats. They have to maintain pages of instructions on ingesting this data. With Splitgraph, data consumers will be able to acquire and interact with data directly from their applications and clients.<p>I appreciate the effort to make it easier for users to access heterogeneous data sets, but I really hope data vendors keep shipping raw CSV files. I don&#x27;t want a company to gate access to the data, merely offering a proxy. I want to be able to download the whole raw datasets from the vendor directly if I want to.
评论 #24234952 未加载
评论 #24234944 未加载
评论 #24237914 未加载
评论 #24238287 未加载
lifeisstillgoodalmost 5 years ago
I see a new job title coming into being - Enterpruse Data Librarian<p>40,000 data sets - even if many are just diff versions - is a ridiculous number to manage or even know about on a non full time basis.<p>Data driven decisions need data yes, but they also need people to know the data exists. And what it means.<p>And this is just external curated data - use this as the standard for what each department should be producting internally.<p>In fact that&#x27;s a good idea - a data publishing standard - not just the data types &#x2F; schema, but actually supplying it through a format that is consumable by others.
评论 #24240781 未加载
Fiahilalmost 5 years ago
As someone who tried, and almost succeeded, to get rid of pachyderm for the last two years, I like what I just read.<p>Something is not entirely clear to me right now: An image is an immutable snapshot of a dataset at a given point-in-time - great - but, can I query the same dataset at two different PIT using layered querying in SQL ? Something like this: SELECT * FROM dataset:version-1, dataset:version-2<p>Also, are you storing the entire dataset as new or only the diff between versions (and later reconstruct the full image) ?<p>Now, onto the things that could be improved...<p>- Git-like semantics (pull, push, checkout, commit) are poorly suited for versioned, immutable datasets. Just (intelligently) abstract fetching and sending datasets by looking at the SQL query (dataset:version-2, above)<p>- Versions should be at least partially ordered and monotonically increasing. Hashes doesn&#x27;t convey the information necessary to decide if dataset:de4d is an earlier version of dataset:123a, or not.<p>- Tracing a derived dataset provenance will only work if you can assert that the &quot;code&quot; or transformations applied to the original dataset is deterministic (side-effect free). So, either you have your own ETL language that you can execute in a sandbox and add a myriad of useless stuff for creating and scheduling pipelines (please don&#x27;t do that!), or you just let it go and don&#x27;t end up becoming Pachyderm (sounds great!).
评论 #24236785 未加载
评论 #24235231 未加载
评论 #24235956 未加载
评论 #24235413 未加载
eadanalmost 5 years ago
This is very cool. Relatedly, as a data scientist, I wish companies would expose their APIs through SQL. I&#x27;ve spent a lot of time pulling data into ETL jobs from things like mixpanel, adwords etc., and having a unified interface would make things much simpler.<p>I&#x27;m trying to understand the architecture of Splitgraph. Are all foreign data wrappers controlled directly by you, or can third parties host a database and connect it to Splitgraph in a federation?
评论 #24235309 未加载
big-mallocalmost 5 years ago
Is there a CPU limit or timeout for queries? I’d be a little concerned that an intentionally slow and inefficient query could pin the CPU at 100% and ruin the performance for other users
评论 #24234694 未加载
codetrotteralmost 5 years ago
&gt; postgresql:&#x2F;&#x2F;data.splitgraph.com:5432&#x2F;ddn<p>That’s actually pretty cool, to see a public URL with PostgreSQL protocol signifier like that.<p>Makes me wonder if any developers or DB Architects ever thought of putting their resume in a DB and putting a public read-only postgresql:&#x2F;&#x2F; URL on their business card :D
评论 #24234367 未加载
jarymalmost 5 years ago
Very neat indeed. I thought Postgres had a max identifier length of 63 characters so I was surprised to see <i>&quot;cityofchicago&#x2F;covid19-daily-cases-deaths-and-hospitalizations-naz8-j4nc&quot;.covid19_daily_cases_deaths_and_hospitalizations</i> in the FROM part of the statement. Does the max identifier length not apply for some reason here or have Splitgraph done something to increase it?<p>On a related note, I&#x27;ve long wanted longer identifier lengths in Postgres so we can have more meaningful column names but the powers-that-be have always refused... hopefully one day it&#x27;ll increase in the default distribution.
评论 #24234423 未加载
评论 #24234419 未加载
评论 #24234381 未加载
touisteuralmost 5 years ago
I think FDWs are not more used because they&#x27;re not easy to get into.<p>The best link&#x2F;example I found was <a href="https:&#x2F;&#x2F;github.com&#x2F;beargiles&#x2F;passwd-fdw" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;beargiles&#x2F;passwd-fdw</a> and it&#x27;s quite easy to follow the code and understand all the moving parts.<p>Once you&#x27;ve written a FDW you&#x27;ll see them everywhere.<p>In fact the same author wrote <a href="https:&#x2F;&#x2F;github.com&#x2F;beargiles&#x2F;zip" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;beargiles&#x2F;zip</a> file-fdw and <a href="https:&#x2F;&#x2F;github.com&#x2F;beargiles&#x2F;tarfile-fdw" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;beargiles&#x2F;tarfile-fdw</a><p>If you (still?) need inspiration and want to see what already exists: <a href="https:&#x2F;&#x2F;wiki.postgresql.org&#x2F;wiki&#x2F;Foreign_data_wrappers" rel="nofollow">https:&#x2F;&#x2F;wiki.postgresql.org&#x2F;wiki&#x2F;Foreign_data_wrappers</a><p>Simpler, using a &#x27;generic&#x27; file-based FDW shipping with pg&#x27;s sources: <a href="https:&#x2F;&#x2F;aaronparecki.com&#x2F;2015&#x2F;02&#x2F;19&#x2F;8&#x2F;monitoring-cpu-memory-usage-from-postgres" rel="nofollow">https:&#x2F;&#x2F;aaronparecki.com&#x2F;2015&#x2F;02&#x2F;19&#x2F;8&#x2F;monitoring-cpu-memory-...</a><p>There&#x27;s a python wrapper to get your feet wet (or prototype an idea) : <a href="https:&#x2F;&#x2F;github.com&#x2F;Segfault-Inc&#x2F;Multicorn" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Segfault-Inc&#x2F;Multicorn</a> (though I&#x27;m not sure how maintained this is).<p>The only annoying part is that you&#x27;re plugging your code to an interface that might (and has sometimes) broken between releases of PG. So kind of the same fun as maintaining a gcc plugin...<p>By the way, anyone has any idea on the licensing terms&#x2F;issues of PG FDWs and PG extensions in general?
评论 #24248404 未加载
dumbfounderalmost 5 years ago
Ok I signed up and used your recommended client (DBeaver 7.1.5) but I don&#x27;t see the schemas in your picture.<p><a href="https:&#x2F;&#x2F;ibb.co&#x2F;gwLfHVz" rel="nofollow">https:&#x2F;&#x2F;ibb.co&#x2F;gwLfHVz</a>
评论 #24234639 未加载
评论 #24238777 未加载
varelazalmost 5 years ago
Thanks for opening new way to work with public data and discover it. I have several ideas regarding this. I used public free APIs and the worst thing with them that they are all unreliable. Unrelaible on conditions, limits and usually don&#x27;t scale. And you cannot blame API providers because you don&#x27;t pay for it. I vote for premium resource based access to the data with free tier. When you can pay and have level of service you need, or can use tiny free limited access.
geordeealmost 5 years ago
SQL is the API for data.
评论 #24234386 未加载
dsr_almost 5 years ago
I can find a privacy policy. It&#x27;s not awful.<p>I can&#x27;t find pricing.
评论 #24234405 未加载
sradmanalmost 5 years ago
Does the Splitgraph Data Delivery Network allow queries that ORDER BY an unsorted column? This seems like a vector for a Denial of Service attack.
dariosalvi78almost 5 years ago
mm interesting... we have this open Postgres instance (read only) for covid19 research: <a href="https:&#x2F;&#x2F;covid19.eng.ox.ac.uk&#x2F;" rel="nofollow">https:&#x2F;&#x2F;covid19.eng.ox.ac.uk&#x2F;</a><p>we have it running on our own (cheap) server, but we fear we may get overwhelmed by too much traffic if the project becomes very successful. Would this be a solution for us? Is it for free?
评论 #24235156 未加载
georgewfraseralmost 5 years ago
Postgres foreign data wrappers is a weird choice of engine. Most queries to this service will be scans, in which case a column-oriented, vectorized, massively parallel engine like Presto will be 1000 times faster or so. Postgres’ underlying engine is optimized for scenarios where you read a small number of rows using an index.
评论 #24244235 未加载
ekzhualmost 5 years ago
How do you handle expensive queries? Several JOIN over multiple large data sources can easily take minutes if not hours.
CharlesDodgsonalmost 5 years ago
Looks lovely, I can see real use for this in my work, postgres and the availabilty of postgis extension is really useful for mapping data and spatially realted queries.
Vasloalmost 5 years ago
Can other Database systems be used to do this like SQL Server or Oracle? Many of us are forced to use systems other than PostGres.
评论 #24235556 未加载
molszanskialmost 5 years ago
Great project! Congrats with the launch.