TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Engraph – Automated ETL Pipelines

5 pointsby jleguinaabout 2 years ago
Hey HN, we’re Ross and Javier, co-founders of Engraph (www.engraph.ai). Our goal is to completely automate the process of building ETL pipelines, from ad hoc pipelines for question answering to fully fledged ETL pipelines within large organisations: For ad hoc pipelines, a question answering platform which enables users to ask questions in natural language about their organisation&#x27;s data.<p>Traditionally, access to data within organisations is limited to a handful of data-engineers. This means that if an employee needs access to some data, they have to go through a lengthy process of requesting it from a data-engineer, who then spends their day dealing with ad hoc requests. This process is time-consuming and inefficient for everyone involved.<p>We solve this problem by passing natural language questions into a planner LLM that decomposes the task given the available data sources. This planner spawns workers that query the appropriate information from each individual data source, whether via SQL or searches across vector embeddings of unstructured information.<p>Once the planner receives the data, it executes Python code to aggregate and operate on the data, and presents the answer to the user.<p>For persistent ETL pipelines, instead of inferring the format of the output data from a natural language question users can provide a data output specification (e.g. YAML) through an API. This gets fed into a planner similar to the above. However, instead of running Python code to return an answer, we load the relevant data into an intermediate data lake.<p>Internally (depending on the user’s preferences), we also use this API for the question answering platform: we learn from recurring patterns in the natural language questions (by clustering question embeddings), and implement persistent data storage that, in expectation, reduces the number of operations that need to be performed across the org’s data sources. For the user there is a trade-off between the cost of the intermediate data storage vs the cost&#x2F;load of performing the same operations for every question.<p>Surely you’ve been wondering about how we manage security and data access. Our goal is to never have to access company data on our side. Ideally all operations are performed on-prem. We take data privacy very seriously and we are building our platform accordingly.<p>We charge on a per seat licence with lenient fair usage terms. We’d rather do this than charge per usage which we feel could disincentivize the use of the platform.<p>If you&#x27;re interested in trying out our platform, check out our demo video (https:&#x2F;&#x2F;youtu.be&#x2F;Q8dNPQ8ofHk) or send us an email at {javier, ross}@engraph.ai. Thanks!

1 comment

bottlewhistlerabout 2 years ago
This is very interesting! Super cool features, I&#x27;m sure many non-technical stakeholders will find value in this solution.<p>On the security aspect: users will need to trust that you are only querying the schemas, and that you are not accessing the data in the databases? My understanding is that (regardless of the SQL flavor) you cannot grant access to read only the names of the schemas&#x2F;tables without granting access to the data they contain.<p>Do you envision larger (corporate) customers not being able to subscribe to your service because of these security concerns? Do you think there is any way in which you can bypass this limitation, e.g. by installing an on-premise client which holds the access credentials and never shares them with your servers?
评论 #35175302 未加载