TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: We're building an open data warehouse inspired by Git scraping

5 点作者 liquorice超过 1 年前
Hey everyone, this is Jason and Nathan from <a href="https:&#x2F;&#x2F;subsets.io" rel="nofollow">https:&#x2F;&#x2F;subsets.io</a>, a new open data warehouse. Our goal is to make finding and accessing public data easier for human analysis, in apps, or as a source of up-to-date data for retrieval-augmented-generation.<p>Inspired by git scraping [1], the core idea is to build something where people don’t upload a snapshot of their dataset directly, like you might do on Kaggle or Huggingface. Instead, anyone can contribute code (connectors) which we then continuously run and make the fetched data available for everyone in our shared, public data warehouse. We currently have connectors for 120+ datasets including an index of YC companies, U.S. house prices, and Wikipedia search volumes.<p>Separately, open data portals, such as from NGOs, can be hard to use due to their use of semantic web principles - i.e., representing data as a graph and adding structured metadata. We’re taking a less structured approach: each dataset is just a table that you can download or query using SQL, and we’re building a machine learning engine for ranking, pre-processing, and to generate relevant subsets&#x2F;views from the data warehouse.<p>BigQuery is used as the data warehouse. We use dagster for the data pipelines, running it on top of Kubernetes. Frontend is NextJS. The data pipelines are currently centralised in our repo, but we’re building our own engine where you can just upload simple scripts. Search is currently basic semantic search, with one big index that stores unique strings across tables, columns, and rows. Before we used better search using LLM’s, but the cost, latency, and rate limits mean we’re still investigating the right way to go.<p>The project is in its very beginning stages, but we’d like to get some early feedback and find people who either want to help us build connectors or use the data to build something cool. The connectors are available at <a href="https:&#x2F;&#x2F;github.com&#x2F;subsetsio&#x2F;subsets-connectors">https:&#x2F;&#x2F;github.com&#x2F;subsetsio&#x2F;subsets-connectors</a>, and you can visually explore the datasets and get your own free API key at <a href="https:&#x2F;&#x2F;www.subsets.io" rel="nofollow">https:&#x2F;&#x2F;www.subsets.io</a>.<p>[1] - <a href="https:&#x2F;&#x2F;simonwillison.net&#x2F;2020&#x2F;Oct&#x2F;9&#x2F;git-scraping&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simonwillison.net&#x2F;2020&#x2F;Oct&#x2F;9&#x2F;git-scraping&#x2F;</a>

1 comment

j4yav超过 1 年前
Hey everyone, Jason from subsets here. Maybe some of you are wondering about pricing.. for the core features, we are setting up a simple usage-based model where we charge a small premium on top of bandwidth and query execution to cover our costs.<p>For now the only focus is getting people using it in a cost-neutral way, and once we have potential customers to work with we’ll figure out the rest of the details. In the meantime you’re welcome to run our open source connectors on your own infrastructure if that works better for you, though we appreciate any support we get (whether that’s through using our paid services or contributing code) as we aim to break even.