Trustfall: How to Query (Almost) Everything

288 pointsby sbt567over 2 years ago

20 comments

Readers who like SQL may also enjoy Steampipe [1], an open source tool to live query 99+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc across the services. (Disclaimer - I'm a lead on the project.)1 - <a href="https://github.com/turbot/steampipe">https://github.com/turbot/steampipe</a>

评论 #34721192 未加载

评论 #34721452 未加载

评论 #34724445 未加载

metadatover 2 years ago

This is crazy cool. Instead of searching Google which returns info from literally any random source (occasional good sites among the ocean of SEO spam, malicious sites, ad-ridden clone sites, annoying trolls, paywalled, etc), you could have your own set of diverse query sources you've deemed to be actually useful and trustworthy.I suspect this is only a basic, naive idea compared to the true potential capabilities Trustfall could unlock.

评论 #34719310 未加载

评论 #34717993 未加载

crosen99over 2 years ago

This is a nifty tool. It’s existence alongside the emerging LLMs reminds me of the two diametrically opposed approaches to harnessing it all:1. Store the knowledge in a highly structured way and interrogate it with a precise and rigorous query language to extract the exact answer you want based on a well defined set of rules2. Store the knowledge in whatever ad hoc way it’s produced, and then rely on a higher form of intelligence to take an equally ad hoc query, feed it through the entire universe of knowledge with some attention mechanism, and magically return a (statistically significant) responseBoth approaches are so satisfying when they work. Of course you also have everything in between and then you have tools like LangChain that start to bring it all together.

评论 #34722448 未加载

vsroyover 2 years ago

I imagine with these types of things the vast majority of the work is writing integrations. Could you explain how this makes writing integrations easier?

评论 #34719849 未加载

obi1kenobiover 2 years ago

Trustfall author here, pleasantly surprised to find this posted!The goal of Trustfall is to be the LLVM of data sources. GraphQL, OpenAPI, JSON (with JSON schema or not), SQL, RDF/SPARQL -- and none of them can natively talk to each other. Sure, you can stick JSON into Postgres, or compile GraphQL to SQL -- I've done both in production and it's always ultimately a poor fit because you're cramming one system into another when it was never originally designed to support that.Here's an example: tell me the GitHub or Twitter accounts of HN users that have commented on HN stories about OpenAI. The data is available from the HN APIs on Firebase (for item lookup) and Algolia (for search). I know all of us could write a script to do it -- but would we? Or is it too annoying and difficult, and not worth it? That "activation energy" barrier is something I want to eliminate. Here's that same query in the Trustfall Playground, where it took just a minute or two to put together: <a href="https://play.predr.ag/hackernews#?f=1&q=IyBDcm9zcyBBUEkgcXVlcnkgKEFsZ29saWEgKyBGaXJlYmFzZSk6CiMgRmluZCBjb21tZW50cyBvbiBzdG9yaWVzIGFib3V0ICJvcGVuYWkuY29tIiB3aGVyZQojIHRoZSBjb21tZW50ZXIncyBiaW8gaGFzIGF0IGxlYXN0IG9uZSBHaXRIdWIgb3IgVHdpdHRlciBsaW5rCnF1ZXJ5IHsKICAjIFRoaXMgaGl0cyB0aGUgQWxnb2xpYSBzZWFyY2ggQVBJIGZvciBIYWNrZXJOZXdzLgogICMgVGhlIHN0b3JpZXMvY29tbWVudHMvdXNlcnMgZGF0YSBpcyBmcm9tIHRoZSBGaXJlYmFzZSBITiBBUEkuCiAgIyBUaGUgdHJhbnNpdGlvbiBpcyBzZWFtbGVzcyAtLSBpdCBpc24ndCB2aXNpYmxlIGZyb20gdGhlIHF1ZXJ5LgogIFNlYXJjaEJ5RGF0ZShxdWVyeTogIm9wZW5haS5jb20iKSB7CiAgICAuLi4gb24gU3RvcnkgewogICAgICAjIEFsbCBkYXRhIGZyb20gaGVyZSBvbndhcmQgaXMgZnJvbSB0aGUgRmlyZWJhc2UgQVBJLgogICAgICBzdG9yeVRpdGxlOiB0aXRsZSBAb3V0cHV0CiAgICAgIHN0b3J5TGluazogdXJsIEBvdXRwdXQKICAgICAgc3Rvcnk6IHN1Ym1pdHRlZFVybCBAb3V0cHV0CiAgICAgICAgICAgICAgICAgICAgICAgICAgQGZpbHRlcihvcDogInJlZ2V4IiwgdmFsdWU6IFsiJHNpdGVQYXR0ZXJuIl0pCgogICAgICBjb21tZW50IHsKICAgICAgICByZXBseSBAcmVjdXJzZShkZXB0aDogNSkgewogICAgICAgICAgY29tbWVudDogdGV4dFBsYWluIEBvdXRwdXQKCiAgICAgICAgICBieVVzZXIgewogICAgICAgICAgICBjb21tZW50ZXI6IGlkIEBvdXRwdXQKICAgICAgICAgICAgY29tbWVudGVyQmlvOiBhYm91dFBsYWluIEBvdXRwdXQKCiAgICAgICAgICAgICMgVGhlIHByb2ZpbGUgbXVzdCBoYXZlIGF0IGxlYXN0IG9uZQogICAgICAgICAgICAjIGxpbmsgdGhhdCBwb2ludHMgdG8gZWl0aGVyIEdpdEh1YiBvciBUd2l0dGVyLgogICAgICAgICAgICBsaW5rCiAgICAgICAgICAgICAgQGZvbGQKICAgICAgICAgICAgICBAdHJhbnNmb3JtKG9wOiAiY291bnQiKQogICAgICAgICAgICAgIEBmaWx0ZXIob3A6ICI%2BPSIsIHZhbHVlOiBbIiRtaW5Qcm9maWxlcyJdKQogICAgICAgICAgICB7CiAgICAgICAgICAgICAgY29tbWVudGVySURzOiB1cmwgQGZpbHRlcihvcDogInJlZ2V4IiwgdmFsdWU6IFsiJHNvY2lhbFBhdHRlcm4iXSkKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBAb3V0cHV0CiAgICAgICAgICAgIH0KICAgICAgICAgIH0KICAgICAgICB9CiAgICAgIH0KICAgIH0KICB9Cn0%3D&v=ewogICJzaXRlUGF0dGVybiI6ICJodHRwW3NdOi8vKFteLl0qXFwuKSpvcGVuYWkuY29tLy4qIiwKICAibWluUHJvZmlsZXMiOiAxLAogICJzb2NpYWxQYXR0ZXJuIjogIihnaXRodWJ8dHdpdHRlcilcXC5jb20vIgp9" rel="nofollow">https://play.predr.ag/hackernews#?f=1&q=IyBDcm9zcyBBUEkgcXVl...</a>Trustfall is designed for interoperation from day 1. It separates the queries from the data providers, allowing the infrastructure to evolve and change how it serves queries without any of the queries noticing anything except faster execution. In practice, that means you don't have to rewrite your product to make it run faster -- which makes both the product side and the infra side happier :)Here's a real-world example of that. The `cargo-semver-checks` Rust semantic versioning linter implements its lints as Trustfall queries over JSON files describing the package API, and I recently was able to speed up its execution by over 2000x without changing a single query -- just by changing how those queries execute under the hood. More details in my blog post here: <a href="https://predr.ag/blog/speeding-up-rust-semver-checking-by-over-2000x/" rel="nofollow">https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...</a>AMA, I guess :)

评论 #34719378 未加载

评论 #34736477 未加载

评论 #34719519 未加载

评论 #34719806 未加载

评论 #34724760 未加载

评论 #34719742 未加载

sbt567over 2 years ago

Online playground for querying Hacker News: <a href="https://play.predr.ag/hackernews" rel="nofollow">https://play.predr.ag/hackernews</a>

parhamnover 2 years ago

When I saw a demo of this I was blow away by how easy it is to add another data source. Great work, looking forward to using this soon.

评论 #34720630 未加载

wodenokotoover 2 years ago

Clicking through to the python repo and looking at the examples there, I guess the query language is GraphQL.Reading further, it seems that it is actually a variation of GraphQL.I've never used GraphQL before, maybe one can get used to it, but it doesn't look very nice for defining data queries.

评论 #34720160 未加载

alexisreadover 2 years ago

Nice! I wrote something similar for my workplace a couple of years ago based around Rx (as Rx has many implementations in different languages - we had a multi-node browser/server requirement, and optimization in a streaming DSL is easier than SQL, as you've done, as you can hint+order lazy materializations) and libs like <a href="https://pypi.org/project/lquery/" rel="nofollow">https://pypi.org/project/lquery/</a> to do pushdown queries.Are you planning on doing reactive/live/materialized queries? You could use a config file to specify the live queries (in the Trustfall DSL) which can be fed to the engine on startup.

评论 #34729851 未加载

flanked-everglover 2 years ago

You may be interested in RDF and SPARQL which supports federated queries. Much simpler than reinventing the wheel.For more context, see <a href="https://ontop-vkg.org/guide/" rel="nofollow">https://ontop-vkg.org/guide/</a>

评论 #34721911 未加载

unshavedyakover 2 years ago

Could this be used to manage ingesting data from messier sources? Ala files (pdf/etc), web pages, etc?edit: Admittedly the website/video tends to talk about data sources a bit hand-wavy. I'd have loved some real world examples on how one goes about adding a data source. Also how we handle problems of scale.. ie passing filters to the data source, rather than drinking from a firehose and filtering after the fact.With that said, the idea is becoming interesting to me. At the very least i am liking the idea of a standardized query interface to "things". Just feels like edge cases might drown me.

评论 #34720012 未加载

评论 #34719947 未加载

sdfhbdfover 2 years ago

Very interesting, sounds kinda like GraphQL counterpart to <a href="https://github.com/cube2222/octosql">https://github.com/cube2222/octosql</a>

评论 #34719202 未加载

debarshriover 2 years ago

At first glance, the goal of the project feels very similar to SPARQL.Somehow it didn't make it to the mainstream.

评论 #34718182 未加载

whoopdeepooover 2 years ago

Already way too many tabs in the example. I could see this getting completely unreadable very quick.

评论 #34718459 未加载

loveparadeover 2 years ago

Isn't this the same as converting websites into a GraphQL API? Aren't there already dozens of projects and services that do the "convert websites into an API", at scale? What exactly is the innovation?

评论 #34718854 未加载

评论 #34722746 未加载

yevpatsover 2 years ago

Also relevant - High Performance Open Source ELT Framework - <a href="https://github.com/cloudquery/cloudquery">https://github.com/cloudquery/cloudquery</a>

haolezover 2 years ago

This looks a lot like what Apache Drill already does.

darkteflonover 2 years ago

This looks super cool. Going to put this together with Dagster tonight.@obi1kenobi, could you comment a bit on the motivation and background to this project?

评论 #34720911 未加载

srcreighover 2 years ago

What join algorithms does this support? How many data sources are integrated so far (what’s the MOM growth?) What’s the largest dataset this integrates with? And how many different people do you estimate have written queries in the language?

评论 #34720754 未加载

dilawarover 2 years ago

obligatory, "written in Rust".

评论 #34719210 未加载