Show HN: ScratchDB – Open-Source Snowflake on ClickHouse

261 点作者 memset超过 1 年前

Hello! For the past year I’ve been working on a fully-managed data warehouse built on Clickhouse. I built this because I was frustrated with how much work was required to run an OLAP database in prod: re-writing my app to do batch inserts, managing clusters and needing to look up special CREATE TABLE syntax every time I made a change. I found pricing for other warehouses confusing (what is a “credit” exactly?) and worried about getting capacity-planning wrong.I was previously building accounting software for firms with millions of transactions. I desperately needed to move from Postgres to an OLAP database but didn’t know where to start. I eventually built abstractions around Clickhouse: My application code called an insert() function but in the background I had to stand up Kafka for streaming, bulk loading, DB drivers, Clickhouse configs, and manage schema changes.This was all a big distraction when all I wanted was to save data and get it back. So I decided to build a better developer experience around it. The software is open-source: <a href="https://github.com/scratchdata/ScratchDB">https://github.com/scratchdata/ScratchDB</a> and and the paid offering is a hosted version: <a href="https://www.scratchdb.com/">https://www.scratchdb.com/</a>.It's called “ScratchDB” because the idea is to make it easy to get started from scratch. It’s a massively simpler abstraction on top of Clickhouse.ScratchDB provides two endpoints [1]: one to insert data and another to query. When you send any JSON, it automatically creates tables and columns based on the structure [2]. Because table creation is automated, you can just start sending data and the system will just work [3]. It also means you can use Scratch as any webhook destination without prior setup [4,5]. When you query, just pass SQL as a query param and it returns JSON.It handles streaming and bulk loading data. When data is inserted, I append it to a file on disk, which is then bulk loaded into Clickhouse. The overall goal is for the platform to automatically handle managing shards and replicas.The whole thing runs on regular servers. Hetzner has become our cloud of choice, along with Backblaze B2 and SQS. It is written in Go. From an architecture perspective I try to keep things simple - want folks to make economical use of their servers.So far ScratchDB has ingested about 2 TB of data and 4,000 requests/second on about $100 worth of monthly server costs.Feel free to download it and play around - if you’re interested in this stuff then I’d love to chat! Really looking for feedback on what is hard about analytical databases and what would make the developer experience easier![1] <a href="https://scratchdb.com/docs">https://scratchdb.com/docs</a>[2] <a href="https://scratchdb.com/blog/flatten-json/">https://scratchdb.com/blog/flatten-json/</a>[3] <a href="https://scratchdb.com/blog/scratchdb-email-signups/">https://scratchdb.com/blog/scratchdb-email-signups/</a>[4] <a href="https://scratchdb.com/blog/stripe-data-ingest/">https://scratchdb.com/blog/stripe-data-ingest/</a>[5] <a href="https://scratchdb.com/blog/shopify-data-ingest/">https://scratchdb.com/blog/shopify-data-ingest/</a>

16 条评论

CharlesW超过 1 年前

Can you explain what "open-source Snowflake" means, since you don't explain it in this description, in the repo, or on the site?Is your goal explicitly to replicate all Snowflake capabilities? <a href="https://docs.snowflake.com/en/user-guide/intro-supported-features" rel="nofollow noreferrer">https://docs.snowflake.com/en/user-guide/intro-supported-fea...</a>

评论 #38043068 未加载

tbragin超过 1 年前

Disclaimer: I work at ClickHouse.Thank you! Looks really interesting!I personally agree that real-time OLAP databases have potential to better serve workloads currently in Postgres or cloud data warehouses that need real-time ingest and analytical queries. And simplifying developer experience on top of that, so you don't have to learn about all the details of a powerful database, really speeds up developer velocity.I'm curious, how you see your project differs from GraphJSON (<a href="https://www.graphjson.com/" rel="nofollow noreferrer">https://www.graphjson.com/</a>) and Tinybird (<a href="https://www.tinybird.co/" rel="nofollow noreferrer">https://www.tinybird.co/</a>)?Congratulations again on the launch!

评论 #38042285 未加载

评论 #38043350 未加载

giovannibonetti超过 1 年前

Great product! Thanks for sharing it!Question: I thought Clickhouse already has native support for flattening JSON [1], although it was released recently (version 22.3.1). Did you start working on it [2] before that? Or is it a different take? I'm curious about the pros and cons of each one.[1] <a href="https://clickhouse.com/docs/en/integrations/data-formats/json#semi-structured-approach" rel="nofollow noreferrer">https://clickhouse.com/docs/en/integrations/data-formats/jso...</a> [2] <a href="https://scratchdb.com/blog/flatten-json/">https://scratchdb.com/blog/flatten-json/</a>

评论 #38043367 未加载

评论 #38045224 未加载

tiffanyh超过 1 年前

AGPL-3.0 license, for those wondering.

评论 #38046623 未加载

throwaway295729超过 1 年前

Congrats on the release! Can this be used for log data? How long is ingested data kept?

评论 #38038597 未加载

pitah1超过 1 年前

Thanks for sharing. Looks very clean and simple to use.Do you plan on supporting non-JSON data types for insertion? For example, inserting CSV files, parquet files, Avro or Protobuf messages?

评论 #38045314 未加载

didip超过 1 年前

You should submit your benchmarks to ClickBench.

shrubble超过 1 年前

What does the license mean, if I don't change any of the code you provide, but use it to provide a public-facing service? Like if I use it for a forum, for instance, but am using a separate bit of code to push data into and retrieve out of ScratchDB?

ddorian43超过 1 年前

Why is your storage 10X that of bigquery? How does your compute price compare to bigtable?Edit: bigtable->bigquery

评论 #38042927 未加载

gbrits超过 1 年前

Congrats with the launch. This looks great. Inferring schemas on the fly is awesome to get started quickly, but are there ways to explicitly define a schema if I wanted to? For example, thinking of setting column specific compression

评论 #38045219 未加载

jed_sanders12超过 1 年前

This looks great. I have one question. When you are automatically creating tables, how do you choose primary keys order for clickhouse table?

评论 #38043402 未加载

anon3949494超过 1 年前

Just signed up but didn't receive a confirmation email. Are you currently accepting new sign-ups for the managed service?

评论 #38049473 未加载

yoav超过 1 年前

I love everything about your story and what you built. In the process of doing something similar.Nice work!

OmarAssadi超过 1 年前

> The whole thing runs on regular servers. Hetzner has become our cloud of choice, along with Backblaze B2 and SQS. It is written in Go. From an architecture perspective I try to keep things simple - want folks to make economical use of their servers.Cool, glad to see Hetzner, at least presumably for compute, rather than the almost routine, absurdly expensive, mega cloud providers.I have a few questions if you've got time.1. What made you pick Hetzner in particular, and did you evaluate any of their primary competitors? (e.g., OVH, etc)2. In your $100/month figure, did you decide to go with dedicated servers or the "cloud" VPS line? If the latter, was there any particular reason over going with the bare-metal offerings?3. Are you making use of Hetzner's U.S. servers as well or is everything currently in Europe (or vice-versa)?4. Was there any particular reason for choosing B2 and SQS as opposed to self-hosting object-storage on the SX servers?Normally, I wouldn't even wonder why someone wouldn't want the burden of more infrastructure. But given the choice of going with relatively unmanaged Hetzner servers, presumably self-hosting clickhouse, etc, and then with your compute provider also happening to offer fairly large storage servers on the cheap, I might've been tempted to cut out the additional providers and DIY it:- less costly for large amounts of data- zero lock-in [1]- fewer companies to deal with<pre><code> - likely better negotiating power with Hetzner when the time comes if a bigger percentage of your overhead is with them as opposed to spread out across three providers - fewer points of failure; if the Hetzner servers are down, I would assume you're in trouble anyway, so perhaps keeping [most] of your eggs on the same network might not be as bad as it sounds - presumably better latency and bandwidth + the ability to communicate over a private network [2] </code></pre> 5. I see the license is AGPL. But I don't see the usual "you must dual-license all contributions under MIT/BSD/ISC as well [so that only we can re-license the project]" nor "before contributing, sign this agreement transferring copyright [and your first born child]".Was this just an oversight, or do you intend to be one of the few SaaS companies that really truly is open-source rather than "open-source" [until peopled are locked-in] and then going "open"-core? If the latter, then awesome -- cool to see.6. Any regrets, disasters, or lessons learned so far? Usually, I find these stories the most interesting but unfortunately too few are willing to share.---[1]: I know B2 provides a relatively standard, at this point, S3-compatible API and everything as well. But I think there is also still something to be said about a somewhat Juche-esque approach to infrastructure, wherein should prices rise, contracts change, service degrades, or whatever else, you'd have the ability to almost immediately switch at a moment's notice to literally anyone else who can lease you a box with some hard drives or any colo provider.[2]: This goes out the window somewhat if you're using the VPS line and American servers, though.

评论 #38046592 未加载

esafak超过 1 年前

TiDB is an HTAP whose OLAP component (TiFlash) was based on Clickhouse: <a href="https://news.ycombinator.com/item?id=23584022">https://news.ycombinator.com/item?id=23584022</a>If you have analyzed the competition, what are your selling points? Benchmarks welcome. Thank you!

评论 #38042422 未加载

评论 #38045291 未加载

wkoszek超过 1 年前

ScratchDB has save my business and it's awesome. I think if you need a columnar store, you should really try these guys

评论 #38042966 未加载

评论 #38042557 未加载