TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Engraph – Automated ETL Pipelines

5 点作者 jleguina大约 2 年前
Hey HN, we’re Ross and Javier, co-founders of Engraph (www.engraph.ai). Our goal is to completely automate the process of building ETL pipelines, from ad hoc pipelines for question answering to fully fledged ETL pipelines within large organisations: For ad hoc pipelines, a question answering platform which enables users to ask questions in natural language about their organisation&#x27;s data.<p>Traditionally, access to data within organisations is limited to a handful of data-engineers. This means that if an employee needs access to some data, they have to go through a lengthy process of requesting it from a data-engineer, who then spends their day dealing with ad hoc requests. This process is time-consuming and inefficient for everyone involved.<p>We solve this problem by passing natural language questions into a planner LLM that decomposes the task given the available data sources. This planner spawns workers that query the appropriate information from each individual data source, whether via SQL or searches across vector embeddings of unstructured information.<p>Once the planner receives the data, it executes Python code to aggregate and operate on the data, and presents the answer to the user.<p>For persistent ETL pipelines, instead of inferring the format of the output data from a natural language question users can provide a data output specification (e.g. YAML) through an API. This gets fed into a planner similar to the above. However, instead of running Python code to return an answer, we load the relevant data into an intermediate data lake.<p>Internally (depending on the user’s preferences), we also use this API for the question answering platform: we learn from recurring patterns in the natural language questions (by clustering question embeddings), and implement persistent data storage that, in expectation, reduces the number of operations that need to be performed across the org’s data sources. For the user there is a trade-off between the cost of the intermediate data storage vs the cost&#x2F;load of performing the same operations for every question.<p>Surely you’ve been wondering about how we manage security and data access. Our goal is to never have to access company data on our side. Ideally all operations are performed on-prem. We take data privacy very seriously and we are building our platform accordingly.<p>We charge on a per seat licence with lenient fair usage terms. We’d rather do this than charge per usage which we feel could disincentivize the use of the platform.<p>If you&#x27;re interested in trying out our platform, check out our demo video (https:&#x2F;&#x2F;youtu.be&#x2F;Q8dNPQ8ofHk) or send us an email at {javier, ross}@engraph.ai. Thanks!

1 comment

bottlewhistler大约 2 年前
This is very interesting! Super cool features, I&#x27;m sure many non-technical stakeholders will find value in this solution.<p>On the security aspect: users will need to trust that you are only querying the schemas, and that you are not accessing the data in the databases? My understanding is that (regardless of the SQL flavor) you cannot grant access to read only the names of the schemas&#x2F;tables without granting access to the data they contain.<p>Do you envision larger (corporate) customers not being able to subscribe to your service because of these security concerns? Do you think there is any way in which you can bypass this limitation, e.g. by installing an on-premise client which holds the access credentials and never shares them with your servers?
评论 #35175302 未加载