Also wanted to share my overall positive experience with Clickhouse.<p>UPSIDES<p>* started a 3-node cluster using the official Docker images super quickly<p>* ingested billions of rows super fast<p>* great compression (of course, depends on your data's characteristics)<p>* features like <a href="https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree/" rel="nofollow">https://clickhouse.tech/docs/en/engines/table-engines/merget...</a> are amazing to see<p>* ODBC support. I initially said "Who uses that??", but we used it to connect PostgreSQL and so we can keep the non-timeseries data in PostgreSQL but still access PostgreSQL tables in Clickhouse (!)<p>* you can go the other way too: read Clickhouse from PostgreSQL (see <a href="https://github.com/Percona-Lab/clickhousedb_fdw" rel="nofollow">https://github.com/Percona-Lab/clickhousedb_fdw</a>, although we didn't try this)<p>* PRs welcome, and quickly reviewed. (We improved the ODBC UUID support)<p>* code quality is pretty high.<p>DOWNSIDES<p>* limited JOIN capabilities, which is expected from a timeseries-oriented database like Clickhouse. It's almost impossible to implement JOINs at this kind of scale. The philosophy is "If it won't be fast as scale, we don't support it"<p>* not-quite-standard SQL syntax, but they've been improving it<p>* limited DELETE support, which is also expected from this kind of database, but rarely used in the kinds of environments that CH usually runs in (how often do people delete data from ElasticSearch?)<p>It's really an impressive piece of engineering. Hats off to the Yandex crew.
I think it's an unfair comparison, notably because:<p>1) Clickhouse is rigid-schema + append-only - you can't simply dump semi-structured data (csv/json/documents) into it and worry about schema (index definition) + querying later. The only clickhouse integration I've seen up close had a lot of "json" blobs in it as a workaround, which cannot be queried with the same ease as in ES.<p>2) Clickhouse scalability is not as simple/documented as elasticsearch. You can set up a 200-node ES cluster with a relatively simple helm config or readily-available cloudformation recipe.<p>3) Elastic is more than elasticsearch - kibana and the "on top of elasticsearch" featureset is pretty substantial.<p>4) Every language/platform under the sun (except powerbi... god damnit) has native + mature client drivers for elasticsearch, and you can fall back to bog-standard http calls for querying if you need/want. ClickHouse supports some very elementary SQL primitives ("ANSI") and even those have some gotchas and are far from drop-in.<p>In this manner, I think that clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases, and less a replacement for elasticsearch. If you're using Elasticsearch for OLAP, you're probably better to ETL the semi-structured/raw data out of ES that you specifically wan to a more suitable database which is meant for that.
> SQL is a perfect language for analytics.<p>Slightly off topic, but I strongly agree with this statement and wonder why the languages used for a lot of data science work (R, Python) don't have such a strong focus on SQL.<p>It might just be my brain, but SQL makes so much logical sense as a query language and, with small variances, is used to directly query so many databases.<p>In R, why learn the data.tables (OK, speed) or dplyr paradigms, when SQL can be easily applied directly to dataframes? There are libraries to support this like sqldf[1], tidyquery[2] and duckdf[3] (author). And I'm sure the situation is similar in Python.<p>This is not a post against great libraries like data.table and dplyr, which I do use from time to time. It's more of a question about why SQL is not more popular as the query language de jour for data science.<p>[1] <a href="https://cran.r-project.org/web/packages/sqldf/index.html" rel="nofollow">https://cran.r-project.org/web/packages/sqldf/index.html</a><p>[2] <a href="https://github.com/ianmcook/tidyquery" rel="nofollow">https://github.com/ianmcook/tidyquery</a><p>[3] <a href="https://github.com/phillc73/duckdf" rel="nofollow">https://github.com/phillc73/duckdf</a>
ClickHouse is incredible. It has also replaced a large, expensive and slow Elasticsearch cluster at Contentsquare. We are actually starting an internal team to improve it and upstream patches, email me if interested!
Sentry.io is using ClickHouse for search, with an API they built on top of it to make it easier to transition if need be. They blogged about it at the time they adopted it:<p><a href="https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-new-search-infrastructure" rel="nofollow">https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-...</a>
I am using Clickhouse at my workplace as a side project. I wrote a Rust app that dumps the daily traffic data collected from my company's products into a ClickHouse database.<p>That's 1-5 billion rows, per day, with 60 days of data, onto a single i5 3500 desktop I have laying around. It returns a complex query in less than 5 minutes.<p>I was gonna get a beef-ier server, but 5 minutes is fine for my task. I was flabbergasted.
Uber recently blogged that they rebuilt the log analytics platform based on ClickHouse, replacing the previous ELK based one. The table schema choices made it easy to handle JSON formatted logs with changing schemas. <a href="https://eng.uber.com/logging/" rel="nofollow">https://eng.uber.com/logging/</a>
I've been recording a podcast with Commercial Open Source company founders (Plug! <a href="https://www.flagsmith.com/podcast" rel="nofollow">https://www.flagsmith.com/podcast</a>) and have been surprised how often Clickhouse has come up. It is <i>always</i> referred to with glowing praise/couldn't have built our business without it etc etc etc.
My biggest problem with Elasticsearch is how easy it is to get data in there and think everything is just fine... until it falls flat on its face the moment you hit some random use case that, according to Murphy's law, will also be a very important one.<p>I wish Elasticsearch were maybe a little more opinionated in its defaults. In some ways Clickhouse feels like they filled the gap <i>not</i> having opinionated defaults created. My usage is from a few years back so maybe things have improved
How does clickhouse compare to druid, pinot, rockset (commercial), memsql (commercial).
I know clickhouse is easier to deploy.<p>But from user's perspective is clickhouse superior to the others?
Does ClickHouse or anything else out there that even remotely compete with Splunk for adhoc troubleshooting/forensics/threat hunting type work?<p>I started off with Splunk and every time I try Elasticsearch I feel like I'm stuck in a cage. Probably why they can charge so much for it.
Sorry to hijack the thread but can anyone suggest alternatives to the 'search' side of Elasticsearch?<p>I haven't been following the topic and there's probably new and interesting developments like ClickHouse is for logging.
A related database using ideas from Clickhouse:<p><a href="https://github.com/VictoriaMetrics/VictoriaMetrics" rel="nofollow">https://github.com/VictoriaMetrics/VictoriaMetrics</a>
I just there was a foss loki-like solution built on ch - that was stable and used in production.<p>I know there's a few projects (see below) - but I'm not aware of anything mature..<p><a href="https://github.com/QXIP/cloki-go" rel="nofollow">https://github.com/QXIP/cloki-go</a><p><a href="https://github.com/lmangani/cloki" rel="nofollow">https://github.com/lmangani/cloki</a>
Regarding #1 in the article, Elastic does have SQL query support[1]. I can’t speak to performance or other comparative metrics, but it’s worked well for my purposes.<p>[1] <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/xpack-sql.html" rel="nofollow">https://www.elastic.co/guide/en/elasticsearch/reference/curr...</a>
I don't have any production experience running Clickhouse, but I have used it on a side project for an OLAP workload. Compared to Postgres Clickhouse was a couple orders of magnitude faster (for the query pattern), and it was pretty easy to setup a single node configuration compared to lots of the "big data" stuff. Clickhouse is really a game changer.
I've been using it successfully in production for year and a half.
I can think of no other database that would give me real time aggregation over hundreds of millions of rows inserter every day for virtually zero cost.
It's just a marvelous work.
> ElasticSearch repo has jaw-dropping 1076 PRs merged for the same month<p>Code change frequency is not a measure of quality or development speed.<p>One organization can encourage bigger PRs while another encourage tiny, frequent changes.<p>One can care about quality and stability while another can care very little about bugs.
Anyone know more lightweight alternative to (ELK) Elastic Stack? I found <a href="https://vector.dev" rel="nofollow">https://vector.dev</a> but it seems to be only the "L" part.
If you are looking an OSS ES replacement, CrateDB might also be worth a look :)<p>Basically a best of both worlds combination of ES and PostgreSQL, perfect for time-series and log analytics.
I am curious how do they deal with GDPR or PPI when they do the logging? At first sight it looks like they are doing the logs themselves and not the API provider.
I'm happy that more people are "discovering" ClickHouse.<p>ClickHouse is an outstanding product, with great capabilities that serve a wide array of big data use cases.<p>It's simple to deploy, simple to operate, simple to ingest large amounts of data, simple to scale, and simple to query.<p>We've been using ClickHouse to handle 100's of TB of data for workloads that require ranking on multi-dimensional timeseries aggregations, and we can resolve most complex queries in less than 500ms under load.