Hey HN, we’re Ross and Javier, co-founders of Engraph (www.engraph.ai). Our goal is to completely automate the process of building ETL pipelines, from ad hoc pipelines for question answering to fully fledged ETL pipelines within large organisations:
For ad hoc pipelines, a question answering platform which enables users to ask questions in natural language about their organisation's data.<p>Traditionally, access to data within organisations is limited to a handful of data-engineers. This means that if an employee needs access to some data, they have to go through a lengthy process of requesting it from a data-engineer, who then spends their day dealing with ad hoc requests. This process is time-consuming and inefficient for everyone involved.<p>We solve this problem by passing natural language questions into a planner LLM that decomposes the task given the available data sources. This planner spawns workers that query the appropriate information from each individual data source, whether via SQL or searches across vector embeddings of unstructured information.<p>Once the planner receives the data, it executes Python code to aggregate and operate on the data, and presents the answer to the user.<p>For persistent ETL pipelines, instead of inferring the format of the output data from a natural language question users can provide a data output specification (e.g. YAML) through an API. This gets fed into a planner similar to the above. However, instead of running Python code to return an answer, we load the relevant data into an intermediate data lake.<p>Internally (depending on the user’s preferences), we also use this API for the question answering platform: we learn from recurring patterns in the natural language questions (by clustering question embeddings), and implement persistent data storage that, in expectation, reduces the number of operations that need to be performed across the org’s data sources. For the user there is a trade-off between the cost of the intermediate data storage vs the cost/load of performing the same operations for every question.<p>Surely you’ve been wondering about how we manage security and data access. Our goal is to never have to access company data on our side. Ideally all operations are performed on-prem. We take data privacy very seriously and we are building our platform accordingly.<p>We charge on a per seat licence with lenient fair usage terms. We’d rather do this than charge per usage which we feel could disincentivize the use of the platform.<p>If you're interested in trying out our platform, check out our demo video (https://youtu.be/Q8dNPQ8ofHk) or send us an email at {javier, ross}@engraph.ai. Thanks!