TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: ScratchDB – Open-Source Snowflake on ClickHouse

261 点作者 memset超过 1 年前
Hello! For the past year I’ve been working on a fully-managed data warehouse built on Clickhouse. I built this because I was frustrated with how much work was required to run an OLAP database in prod: re-writing my app to do batch inserts, managing clusters and needing to look up special CREATE TABLE syntax every time I made a change. I found pricing for other warehouses confusing (what is a “credit” exactly?) and worried about getting capacity-planning wrong.<p>I was previously building accounting software for firms with millions of transactions. I desperately needed to move from Postgres to an OLAP database but didn’t know where to start. I eventually built abstractions around Clickhouse: My application code called an insert() function but in the background I had to stand up Kafka for streaming, bulk loading, DB drivers, Clickhouse configs, and manage schema changes.<p>This was all a big distraction when all I wanted was to save data and get it back. So I decided to build a better developer experience around it. The software is open-source: <a href="https:&#x2F;&#x2F;github.com&#x2F;scratchdata&#x2F;ScratchDB">https:&#x2F;&#x2F;github.com&#x2F;scratchdata&#x2F;ScratchDB</a> and and the paid offering is a hosted version: <a href="https:&#x2F;&#x2F;www.scratchdb.com&#x2F;">https:&#x2F;&#x2F;www.scratchdb.com&#x2F;</a>.<p>It&#x27;s called “ScratchDB” because the idea is to make it easy to get started from scratch. It’s a massively simpler abstraction on top of Clickhouse.<p>ScratchDB provides two endpoints [1]: one to insert data and another to query. When you send any JSON, it automatically creates tables and columns based on the structure [2]. Because table creation is automated, you can just start sending data and the system will just work [3]. It also means you can use Scratch as any webhook destination without prior setup [4,5]. When you query, just pass SQL as a query param and it returns JSON.<p>It handles streaming and bulk loading data. When data is inserted, I append it to a file on disk, which is then bulk loaded into Clickhouse. The overall goal is for the platform to automatically handle managing shards and replicas.<p>The whole thing runs on regular servers. Hetzner has become our cloud of choice, along with Backblaze B2 and SQS. It is written in Go. From an architecture perspective I try to keep things simple - want folks to make economical use of their servers.<p>So far ScratchDB has ingested about 2 TB of data and 4,000 requests&#x2F;second on about $100 worth of monthly server costs.<p>Feel free to download it and play around - if you’re interested in this stuff then I’d love to chat! Really looking for feedback on what is hard about analytical databases and what would make the developer experience easier!<p>[1] <a href="https:&#x2F;&#x2F;scratchdb.com&#x2F;docs">https:&#x2F;&#x2F;scratchdb.com&#x2F;docs</a><p>[2] <a href="https:&#x2F;&#x2F;scratchdb.com&#x2F;blog&#x2F;flatten-json&#x2F;">https:&#x2F;&#x2F;scratchdb.com&#x2F;blog&#x2F;flatten-json&#x2F;</a><p>[3] <a href="https:&#x2F;&#x2F;scratchdb.com&#x2F;blog&#x2F;scratchdb-email-signups&#x2F;">https:&#x2F;&#x2F;scratchdb.com&#x2F;blog&#x2F;scratchdb-email-signups&#x2F;</a><p>[4] <a href="https:&#x2F;&#x2F;scratchdb.com&#x2F;blog&#x2F;stripe-data-ingest&#x2F;">https:&#x2F;&#x2F;scratchdb.com&#x2F;blog&#x2F;stripe-data-ingest&#x2F;</a><p>[5] <a href="https:&#x2F;&#x2F;scratchdb.com&#x2F;blog&#x2F;shopify-data-ingest&#x2F;">https:&#x2F;&#x2F;scratchdb.com&#x2F;blog&#x2F;shopify-data-ingest&#x2F;</a>

16 条评论

CharlesW超过 1 年前
Can you explain what &quot;open-source Snowflake&quot; means, since you don&#x27;t explain it in this description, in the repo, or on the site?<p>Is your goal explicitly to replicate all Snowflake capabilities? <a href="https:&#x2F;&#x2F;docs.snowflake.com&#x2F;en&#x2F;user-guide&#x2F;intro-supported-features" rel="nofollow noreferrer">https:&#x2F;&#x2F;docs.snowflake.com&#x2F;en&#x2F;user-guide&#x2F;intro-supported-fea...</a>
评论 #38043068 未加载
tbragin超过 1 年前
Disclaimer: I work at ClickHouse.<p>Thank you! Looks really interesting!<p>I personally agree that real-time OLAP databases have potential to better serve workloads currently in Postgres or cloud data warehouses that need real-time ingest and analytical queries. And simplifying developer experience on top of that, so you don&#x27;t have to learn about all the details of a powerful database, really speeds up developer velocity.<p>I&#x27;m curious, how you see your project differs from GraphJSON (<a href="https:&#x2F;&#x2F;www.graphjson.com&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.graphjson.com&#x2F;</a>) and Tinybird (<a href="https:&#x2F;&#x2F;www.tinybird.co&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.tinybird.co&#x2F;</a>)?<p>Congratulations again on the launch!
评论 #38042285 未加载
评论 #38043350 未加载
giovannibonetti超过 1 年前
Great product! Thanks for sharing it!<p>Question: I thought Clickhouse already has native support for flattening JSON [1], although it was released recently (version 22.3.1). Did you start working on it [2] before that? Or is it a different take? I&#x27;m curious about the pros and cons of each one.<p>[1] <a href="https:&#x2F;&#x2F;clickhouse.com&#x2F;docs&#x2F;en&#x2F;integrations&#x2F;data-formats&#x2F;json#semi-structured-approach" rel="nofollow noreferrer">https:&#x2F;&#x2F;clickhouse.com&#x2F;docs&#x2F;en&#x2F;integrations&#x2F;data-formats&#x2F;jso...</a> [2] <a href="https:&#x2F;&#x2F;scratchdb.com&#x2F;blog&#x2F;flatten-json&#x2F;">https:&#x2F;&#x2F;scratchdb.com&#x2F;blog&#x2F;flatten-json&#x2F;</a>
评论 #38043367 未加载
评论 #38045224 未加载
tiffanyh超过 1 年前
AGPL-3.0 license, for those wondering.
评论 #38046623 未加载
throwaway295729超过 1 年前
Congrats on the release! Can this be used for log data? How long is ingested data kept?
评论 #38038597 未加载
pitah1超过 1 年前
Thanks for sharing. Looks very clean and simple to use.<p>Do you plan on supporting non-JSON data types for insertion? For example, inserting CSV files, parquet files, Avro or Protobuf messages?
评论 #38045314 未加载
didip超过 1 年前
You should submit your benchmarks to ClickBench.
shrubble超过 1 年前
What does the license mean, if I don&#x27;t change any of the code you provide, but use it to provide a public-facing service? Like if I use it for a forum, for instance, but am using a separate bit of code to push data into and retrieve out of ScratchDB?
ddorian43超过 1 年前
Why is your storage 10X that of bigquery? How does your compute price compare to bigtable?<p>Edit: bigtable-&gt;bigquery
评论 #38042927 未加载
gbrits超过 1 年前
Congrats with the launch. This looks great. Inferring schemas on the fly is awesome to get started quickly, but are there ways to explicitly define a schema if I wanted to? For example, thinking of setting column specific compression
评论 #38045219 未加载
jed_sanders12超过 1 年前
This looks great. I have one question. When you are automatically creating tables, how do you choose primary keys order for clickhouse table?
评论 #38043402 未加载
anon3949494超过 1 年前
Just signed up but didn&#x27;t receive a confirmation email. Are you currently accepting new sign-ups for the managed service?
评论 #38049473 未加载
yoav超过 1 年前
I love everything about your story and what you built. In the process of doing something similar.<p>Nice work!
OmarAssadi超过 1 年前
&gt; The whole thing runs on regular servers. Hetzner has become our cloud of choice, along with Backblaze B2 and SQS. It is written in Go. From an architecture perspective I try to keep things simple - want folks to make economical use of their servers.<p>Cool, glad to see Hetzner, at least presumably for compute, rather than the almost routine, absurdly expensive, mega cloud providers.<p>I have a few questions if you&#x27;ve got time.<p>1. What made you pick Hetzner in particular, and did you evaluate any of their primary competitors? (e.g., OVH, etc)<p>2. In your $100&#x2F;month figure, did you decide to go with dedicated servers or the &quot;cloud&quot; VPS line? If the latter, was there any particular reason over going with the bare-metal offerings?<p>3. Are you making use of Hetzner&#x27;s U.S. servers as well or is everything currently in Europe (or vice-versa)?<p>4. Was there any particular reason for choosing B2 and SQS as opposed to self-hosting object-storage on the SX servers?<p>Normally, I wouldn&#x27;t even wonder why someone wouldn&#x27;t want the burden of more infrastructure. But given the choice of going with relatively unmanaged Hetzner servers, presumably self-hosting clickhouse, etc, and then with your compute provider also happening to offer fairly large storage servers on the cheap, I might&#x27;ve been tempted to cut out the additional providers and DIY it:<p>- less costly for large amounts of data<p>- zero lock-in [1]<p>- fewer companies to deal with<p><pre><code> - likely better negotiating power with Hetzner when the time comes if a bigger percentage of your overhead is with them as opposed to spread out across three providers - fewer points of failure; if the Hetzner servers are down, I would assume you&#x27;re in trouble anyway, so perhaps keeping [most] of your eggs on the same network might not be as bad as it sounds - presumably better latency and bandwidth + the ability to communicate over a private network [2] </code></pre> 5. I see the license is AGPL. But I don&#x27;t see the usual &quot;you must dual-license all contributions under MIT&#x2F;BSD&#x2F;ISC as well [so that only we can re-license the project]&quot; nor &quot;before contributing, sign this agreement transferring copyright [and your first born child]&quot;.<p>Was this just an oversight, or do you intend to be one of the few SaaS companies that really truly is open-source rather than <i>&quot;open-source&quot;</i> [until peopled are locked-in] and then going <i>&quot;open&quot;</i>-core? If the latter, then awesome -- cool to see.<p>6. Any regrets, disasters, or lessons learned so far? Usually, I find these stories the most interesting but unfortunately too few are willing to share.<p>---<p>[1]: I know B2 provides a relatively standard, at this point, S3-compatible API and everything as well. But I think there is also still something to be said about a somewhat Juche-esque approach to infrastructure, wherein should prices rise, contracts change, service degrades, or whatever else, you&#x27;d have the ability to almost immediately switch at a moment&#x27;s notice to literally anyone else who can lease you a box with some hard drives or any colo provider.<p>[2]: This goes out the window somewhat if you&#x27;re using the VPS line and American servers, though.
评论 #38046592 未加载
esafak超过 1 年前
TiDB is an HTAP whose OLAP component (TiFlash) was based on Clickhouse: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=23584022">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=23584022</a><p>If you have analyzed the competition, what are your selling points? Benchmarks welcome. Thank you!
评论 #38042422 未加载
评论 #38045291 未加载
wkoszek超过 1 年前
ScratchDB has save my business and it&#x27;s awesome. I think if you need a columnar store, you should really try these guys
评论 #38042966 未加载
评论 #38042557 未加载