Readers who like SQL may also enjoy Steampipe [1], an open source tool to live query 99+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc across the services. (Disclaimer - I'm a lead on the project.)<p>1 - <a href="https://github.com/turbot/steampipe">https://github.com/turbot/steampipe</a>
This is crazy cool. Instead of searching Google which returns info from literally any random source (occasional good sites among the ocean of SEO spam, malicious sites, ad-ridden clone sites, annoying trolls, paywalled, etc), you could have your own set of diverse query sources you've deemed to be actually useful and trustworthy.<p>I suspect this is only a basic, naive idea compared to the true potential capabilities Trustfall could unlock.
This is a nifty tool. It’s existence alongside the emerging LLMs reminds me of the two diametrically opposed approaches to harnessing it all:<p>1. Store the knowledge in a highly structured way and interrogate it with a precise and rigorous query language to extract the exact answer you want based on a well defined set of rules<p>2. Store the knowledge in whatever ad hoc way it’s produced, and then rely on a higher form of intelligence to take an equally ad hoc query, feed it through the entire universe of knowledge with some attention mechanism, and magically return a (statistically significant) response<p>Both approaches are so satisfying when they work. Of course you also have everything in between and then you have tools like LangChain that start to bring it all together.
I imagine with these types of things the vast majority of the work is writing integrations. Could you explain how this makes writing integrations easier?
Trustfall author here, pleasantly surprised to find this posted!<p>The goal of Trustfall is to be the LLVM of data sources. GraphQL, OpenAPI, JSON (with JSON schema or not), SQL, RDF/SPARQL -- and none of them can natively talk to each other. Sure, you can stick JSON into Postgres, or compile GraphQL to SQL -- I've done both in production and it's always ultimately a poor fit because you're <i>cramming one system into another</i> when it was never originally designed to support that.<p>Here's an example: tell me the GitHub or Twitter accounts of HN users that have commented on HN stories about OpenAI. The data is available from the HN APIs on Firebase (for item lookup) and Algolia (for search). I know all of us could write a script to do it -- but would we? Or is it too annoying and difficult, and not worth it? That "activation energy" barrier is something I want to eliminate. Here's that same query in the Trustfall Playground, where it took just a minute or two to put together: <a href="https://play.predr.ag/hackernews#?f=1&q=IyBDcm9zcyBBUEkgcXVlcnkgKEFsZ29saWEgKyBGaXJlYmFzZSk6CiMgRmluZCBjb21tZW50cyBvbiBzdG9yaWVzIGFib3V0ICJvcGVuYWkuY29tIiB3aGVyZQojIHRoZSBjb21tZW50ZXIncyBiaW8gaGFzIGF0IGxlYXN0IG9uZSBHaXRIdWIgb3IgVHdpdHRlciBsaW5rCnF1ZXJ5IHsKICAjIFRoaXMgaGl0cyB0aGUgQWxnb2xpYSBzZWFyY2ggQVBJIGZvciBIYWNrZXJOZXdzLgogICMgVGhlIHN0b3JpZXMvY29tbWVudHMvdXNlcnMgZGF0YSBpcyBmcm9tIHRoZSBGaXJlYmFzZSBITiBBUEkuCiAgIyBUaGUgdHJhbnNpdGlvbiBpcyBzZWFtbGVzcyAtLSBpdCBpc24ndCB2aXNpYmxlIGZyb20gdGhlIHF1ZXJ5LgogIFNlYXJjaEJ5RGF0ZShxdWVyeTogIm9wZW5haS5jb20iKSB7CiAgICAuLi4gb24gU3RvcnkgewogICAgICAjIEFsbCBkYXRhIGZyb20gaGVyZSBvbndhcmQgaXMgZnJvbSB0aGUgRmlyZWJhc2UgQVBJLgogICAgICBzdG9yeVRpdGxlOiB0aXRsZSBAb3V0cHV0CiAgICAgIHN0b3J5TGluazogdXJsIEBvdXRwdXQKICAgICAgc3Rvcnk6IHN1Ym1pdHRlZFVybCBAb3V0cHV0CiAgICAgICAgICAgICAgICAgICAgICAgICAgQGZpbHRlcihvcDogInJlZ2V4IiwgdmFsdWU6IFsiJHNpdGVQYXR0ZXJuIl0pCgogICAgICBjb21tZW50IHsKICAgICAgICByZXBseSBAcmVjdXJzZShkZXB0aDogNSkgewogICAgICAgICAgY29tbWVudDogdGV4dFBsYWluIEBvdXRwdXQKCiAgICAgICAgICBieVVzZXIgewogICAgICAgICAgICBjb21tZW50ZXI6IGlkIEBvdXRwdXQKICAgICAgICAgICAgY29tbWVudGVyQmlvOiBhYm91dFBsYWluIEBvdXRwdXQKCiAgICAgICAgICAgICMgVGhlIHByb2ZpbGUgbXVzdCBoYXZlIGF0IGxlYXN0IG9uZQogICAgICAgICAgICAjIGxpbmsgdGhhdCBwb2ludHMgdG8gZWl0aGVyIEdpdEh1YiBvciBUd2l0dGVyLgogICAgICAgICAgICBsaW5rCiAgICAgICAgICAgICAgQGZvbGQKICAgICAgICAgICAgICBAdHJhbnNmb3JtKG9wOiAiY291bnQiKQogICAgICAgICAgICAgIEBmaWx0ZXIob3A6ICI%2BPSIsIHZhbHVlOiBbIiRtaW5Qcm9maWxlcyJdKQogICAgICAgICAgICB7CiAgICAgICAgICAgICAgY29tbWVudGVySURzOiB1cmwgQGZpbHRlcihvcDogInJlZ2V4IiwgdmFsdWU6IFsiJHNvY2lhbFBhdHRlcm4iXSkKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBAb3V0cHV0CiAgICAgICAgICAgIH0KICAgICAgICAgIH0KICAgICAgICB9CiAgICAgIH0KICAgIH0KICB9Cn0%3D&v=ewogICJzaXRlUGF0dGVybiI6ICJodHRwW3NdOi8vKFteLl0qXFwuKSpvcGVuYWkuY29tLy4qIiwKICAibWluUHJvZmlsZXMiOiAxLAogICJzb2NpYWxQYXR0ZXJuIjogIihnaXRodWJ8dHdpdHRlcilcXC5jb20vIgp9" rel="nofollow">https://play.predr.ag/hackernews#?f=1&q=IyBDcm9zcyBBUEkgcXVl...</a><p>Trustfall is designed for interoperation from day 1. It separates the queries from the data providers, allowing the infrastructure to evolve and change how it serves queries <i>without</i> any of the queries noticing anything except faster execution. In practice, that means you don't have to rewrite your product to make it run faster -- which makes both the product side and the infra side happier :)<p>Here's a real-world example of that. The `cargo-semver-checks` Rust semantic versioning linter implements its lints as Trustfall queries over JSON files describing the package API, and I recently was able to speed up its execution by over 2000x without changing a single query -- just by changing how those queries execute under the hood. More details in my blog post here: <a href="https://predr.ag/blog/speeding-up-rust-semver-checking-by-over-2000x/" rel="nofollow">https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...</a><p>AMA, I guess :)
Clicking through to the python repo and looking at the examples there, I guess the query language is GraphQL.<p>Reading further, it seems that it is actually a variation of GraphQL.<p>I've never used GraphQL before, maybe one can get used to it, but it doesn't look very nice for defining data queries.
Nice! I wrote something similar for my workplace a couple of years ago based around Rx (as Rx has many implementations in different languages - we had a multi-node browser/server requirement, and optimization in a streaming DSL is easier than SQL, as you've done, as you can hint+order lazy materializations) and libs like <a href="https://pypi.org/project/lquery/" rel="nofollow">https://pypi.org/project/lquery/</a> to do pushdown queries.<p>Are you planning on doing reactive/live/materialized queries?
You could use a config file to specify the live queries (in the Trustfall DSL) which can be fed to the engine on startup.
You may be interested in RDF and SPARQL which supports federated queries. Much simpler than reinventing the wheel.<p>For more context, see <a href="https://ontop-vkg.org/guide/" rel="nofollow">https://ontop-vkg.org/guide/</a>
Could this be used to manage ingesting data from messier sources? Ala files (pdf/etc), web pages, etc?<p><i>edit</i>: Admittedly the website/video tends to talk about data sources a bit hand-wavy. I'd have loved some real world examples on how one goes about adding a data source. Also how we handle problems of scale.. ie passing filters to the data source, rather than drinking from a firehose and filtering after the fact.<p>With that said, the idea is becoming interesting to me. At the very least i am liking the idea of a standardized query interface to "things". Just feels like edge cases might drown me.
Very interesting, sounds kinda like GraphQL counterpart to <a href="https://github.com/cube2222/octosql">https://github.com/cube2222/octosql</a>
Isn't this the same as converting websites into a GraphQL API? Aren't there already dozens of projects and services that do the "convert websites into an API", at scale? What exactly is the innovation?
Also relevant - High Performance Open Source ELT Framework - <a href="https://github.com/cloudquery/cloudquery">https://github.com/cloudquery/cloudquery</a>
This looks super cool. Going to put this together with Dagster tonight.<p>@obi1kenobi, could you comment a bit on the motivation and background to this project?
What join algorithms does this support? How many data sources are integrated so far (what’s the MOM growth?) What’s the largest dataset this integrates with? And how many different people do you estimate have written queries in the language?