Using ClickHouse to scale an events engine

237 pointsby wyndhamabout 1 year ago

12 comments

> Recently, the most interesting rift in the Postgres vs OLAP space is [Hydra](<a href="https://www.hydra.so">https://www.hydra.so</a>), an open-source, column-oriented distribution of Postgres that was very recently launched (after our migration to ClickHouse). Had Hydra been available during our decision-making time period, we might’ve made a different choice.There will likely be a good OLAP solution (possibly implemented as an extension) in Postgres in the next year or so. There are a few companies are working on it (Hydra, Parade[0], tembo etc.).0 - <a href="https://www.paradedb.com/">https://www.paradedb.com/</a>

评论 #40005677 未加载

评论 #40009239 未加载

评论 #40009417 未加载

评论 #40010465 未加载

评论 #40010331 未加载

joshstrangeabout 1 year ago

I feel like with all the Clickhouse praise on HN that we /must/ be doing something fundamentally wrong because I hate every interaction I have with Clickhouse.* Timeouts (only 30s???) unless I used the cli client* Cancelling rows - Just kill me, so many bugs and FINAL/PREWHERE are massive foot-guns* Cluster just feels annoying and fragile don't forget "ON CLUSTER" or you'll have a bad timeAgain, I feel like we must be doing something wrong but we are paying an arm and a leg for that "privilege".

评论 #40005983 未加载

评论 #40006131 未加载

评论 #40007533 未加载

评论 #40008309 未加载

评论 #40006078 未加载

评论 #40006130 未加载

评论 #40006394 未加载

HermitXabout 1 year ago

Is ClickHouse a suitable engine for analyzing events? Absolutely, as long as you're analyzing a large table, its speed is definitely fast enough. However, you might want to consider the cost of maintaining an OSS ClickHouse cluster, especially when you need to scale up, as the operational costs can be quite high.If your analysis in Postgres was based on multiple tables and required a lot of JOIN operations, I don't think ClickHouse is a good choice. In such cases, you often need to denormalize multiple data tables into one large table in advance, which means complex ETL and maintenance costs.For these more common scenarios, I think StarRocks (www.StarRocks.io) is a better choice. It's a Linux Foundation open-source project, with single-table query speeds comparable to ClickHouse (you can check Clickbench), and unmatched multi-table join query speeds, plus it can directly query open data lakes.

评论 #40007028 未加载

breadchrisabout 1 year ago

ClickHouse is awesome, but as the post shows, some code is involved in getting the data there.I have been working on Scratchdata [1], which makes it easy to try out a column database to optimize aggregation queries (avg, sum, max). We have helped people [2] take their Postgres with 1 billion rows of information (1.5 TB) and significantly reduce their real-time data analysis query time. Because their data was stored more efficiently, they saved on their storage bill.You can send data as a curl request and it will get batch-processed and flattened into ClickHouse:curl -X POST "<a href="http://app.scratchdata.com/api/data/insert/your_table?api_key=xxx">http://app.scratchdata.com/api/data/insert/your_table?api_ke...</a>" --data '{"user": "alice", "event": "click"}'The founder, Jay, is super nice and just wants to help people save time and money. If you give us a ring, he or I will personally help you [3].[1] <a href="https://www.scratchdb.com/" rel="nofollow">https://www.scratchdb.com/</a> [2] <a href="https://www.scratchdb.com/blog/embeddables/" rel="nofollow">https://www.scratchdb.com/blog/embeddables/</a> [3] <a href="https://q29ksuefpvm.typeform.com/to/baKR3j0p?typeform-source=www.scratchdb.com#source=hero" rel="nofollow">https://q29ksuefpvm.typeform.com/to/baKR3j0p?typeform-source...</a>

评论 #40010418 未加载

alooPotatoabout 1 year ago

We use BigQuery a lot for internal analytics and we've been super happy. I don't see a lot of love for BigQuery on HN and I wonder why. Tons of features, no hassle and easy to throw a bunch of TB at it.I guess maybe the cost?

评论 #40009926 未加载

评论 #40009732 未加载

评论 #40009952 未加载

评论 #40010686 未加载

评论 #40009819 未加载

drewdaabout 1 year ago

This change may make sense for Lago as a hosted multi-tenant service, as offered by Lago the company.Simultaneously this change may not make sense for Lago as an open-source project self-hosted by a single tenant.But that may also mean that it effectively makes sense for Lago as a business... to make it harder to self host.I don't at all fault Lago for making decisions to prioritize their multi-tenant cloud offering. That's probably just the nature of running open-source SaaS these days.

评论 #40010478 未加载

stephen123about 1 year ago

How were they doing millions of events per minute with postgres.I'm struggling with pg write performance ATM and want some tips.

评论 #40009619 未加载

评论 #40007386 未加载

评论 #40008975 未加载

mathnodeabout 1 year ago

And if you use MariaDB, just enable columnstore. Why not treat yourself to s3 backed storage while you are there?It is extremely cost effective when you can scale a different workload without migrating.

评论 #40005407 未加载

samberabout 1 year ago

I'm curious: how many rows Lago store in its CH cluster? Do they collect data for fighting fraud?PG can handle a billion rows easily.

评论 #40006444 未加载

评论 #40005641 未加载

评论 #40006151 未加载

jackbauer24about 1 year ago

scale is becoming more and more important, not just for cost, but also as a key technology feature to help deal with unexpected traffic and reduce the cost of manual operations.

andretti1977about 1 year ago

I have a tangentially related question since I don’t use an Olap db: is deleting data so hard to perform? Is it necessarily an immutable storage?If so, is it a gdpr compliant storage solution? I am asking it since gdpr compliance may require data deletion (or at least anonimization)

评论 #40009030 未加载

dangoodmanUTabout 1 year ago

deleting this comment because apparently jokes are not received well here

评论 #40005547 未加载