ClickHouse as an alternative to Elasticsearch for log storage and analysis

383 pointsby jetterabout 4 years ago

25 comments

Also wanted to share my overall positive experience with Clickhouse.UPSIDES* started a 3-node cluster using the official Docker images super quickly* ingested billions of rows super fast* great compression (of course, depends on your data's characteristics)* features like <a href="https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree/" rel="nofollow">https://clickhouse.tech/docs/en/engines/table-engines/merget...</a> are amazing to see* ODBC support. I initially said "Who uses that??", but we used it to connect PostgreSQL and so we can keep the non-timeseries data in PostgreSQL but still access PostgreSQL tables in Clickhouse (!)* you can go the other way too: read Clickhouse from PostgreSQL (see <a href="https://github.com/Percona-Lab/clickhousedb_fdw" rel="nofollow">https://github.com/Percona-Lab/clickhousedb_fdw</a>, although we didn't try this)* PRs welcome, and quickly reviewed. (We improved the ODBC UUID support)* code quality is pretty high.DOWNSIDES* limited JOIN capabilities, which is expected from a timeseries-oriented database like Clickhouse. It's almost impossible to implement JOINs at this kind of scale. The philosophy is "If it won't be fast as scale, we don't support it"* not-quite-standard SQL syntax, but they've been improving it* limited DELETE support, which is also expected from this kind of database, but rarely used in the kinds of environments that CH usually runs in (how often do people delete data from ElasticSearch?)It's really an impressive piece of engineering. Hats off to the Yandex crew.

评论 #26322916 未加载

评论 #26322264 未加载

评论 #26321358 未加载

评论 #26321558 未加载

评论 #26323524 未加载

tgtweakabout 4 years ago

I think it's an unfair comparison, notably because:1) Clickhouse is rigid-schema + append-only - you can't simply dump semi-structured data (csv/json/documents) into it and worry about schema (index definition) + querying later. The only clickhouse integration I've seen up close had a lot of "json" blobs in it as a workaround, which cannot be queried with the same ease as in ES.2) Clickhouse scalability is not as simple/documented as elasticsearch. You can set up a 200-node ES cluster with a relatively simple helm config or readily-available cloudformation recipe.3) Elastic is more than elasticsearch - kibana and the "on top of elasticsearch" featureset is pretty substantial.4) Every language/platform under the sun (except powerbi... god damnit) has native + mature client drivers for elasticsearch, and you can fall back to bog-standard http calls for querying if you need/want. ClickHouse supports some very elementary SQL primitives ("ANSI") and even those have some gotchas and are far from drop-in.In this manner, I think that clickhouse is better compared as a self-hosted alternative to Aurora and other cloud-native scalable SQL databases, and less a replacement for elasticsearch. If you're using Elasticsearch for OLAP, you're probably better to ETL the semi-structured/raw data out of ES that you specifically wan to a more suitable database which is meant for that.

评论 #26317929 未加载

评论 #26318186 未加载

评论 #26318077 未加载

评论 #26318599 未加载

评论 #26318405 未加载

评论 #26321270 未加载

评论 #26317913 未加载

评论 #26336909 未加载

phillc73about 4 years ago

> SQL is a perfect language for analytics.Slightly off topic, but I strongly agree with this statement and wonder why the languages used for a lot of data science work (R, Python) don't have such a strong focus on SQL.It might just be my brain, but SQL makes so much logical sense as a query language and, with small variances, is used to directly query so many databases.In R, why learn the data.tables (OK, speed) or dplyr paradigms, when SQL can be easily applied directly to dataframes? There are libraries to support this like sqldf[1], tidyquery[2] and duckdf[3] (author). And I'm sure the situation is similar in Python.This is not a post against great libraries like data.table and dplyr, which I do use from time to time. It's more of a question about why SQL is not more popular as the query language de jour for data science.[1] <a href="https://cran.r-project.org/web/packages/sqldf/index.html" rel="nofollow">https://cran.r-project.org/web/packages/sqldf/index.html</a>[2] <a href="https://github.com/ianmcook/tidyquery" rel="nofollow">https://github.com/ianmcook/tidyquery</a>[3] <a href="https://github.com/phillc73/duckdf" rel="nofollow">https://github.com/phillc73/duckdf</a>

评论 #26318601 未加载

评论 #26318550 未加载

评论 #26320538 未加载

评论 #26318617 未加载

sylvinusabout 4 years ago

ClickHouse is incredible. It has also replaced a large, expensive and slow Elasticsearch cluster at Contentsquare. We are actually starting an internal team to improve it and upstream patches, email me if interested!

评论 #26318707 未加载

评论 #26317969 未加载

评论 #26317775 未加载

js2about 4 years ago

Sentry.io is using ClickHouse for search, with an API they built on top of it to make it easier to transition if need be. They blogged about it at the time they adopted it:<a href="https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-new-search-infrastructure" rel="nofollow">https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-...</a>

评论 #26336863 未加载

guardiangodabout 4 years ago

I am using Clickhouse at my workplace as a side project. I wrote a Rust app that dumps the daily traffic data collected from my company's products into a ClickHouse database.That's 1-5 billion rows, per day, with 60 days of data, onto a single i5 3500 desktop I have laying around. It returns a complex query in less than 5 minutes.I was gonna get a beef-ier server, but 5 minutes is fine for my task. I was flabbergasted.

评论 #26318239 未加载

kaak3about 4 years ago

Uber recently blogged that they rebuilt the log analytics platform based on ClickHouse, replacing the previous ELK based one. The table schema choices made it easy to handle JSON formatted logs with changing schemas. <a href="https://eng.uber.com/logging/" rel="nofollow">https://eng.uber.com/logging/</a>

评论 #26318119 未加载

dabeeeensterabout 4 years ago

I've been recording a podcast with Commercial Open Source company founders (Plug! <a href="https://www.flagsmith.com/podcast" rel="nofollow">https://www.flagsmith.com/podcast</a>) and have been surprised how often Clickhouse has come up. It is always referred to with glowing praise/couldn't have built our business without it etc etc etc.

评论 #26317912 未加载

BoorishBearsabout 4 years ago

My biggest problem with Elasticsearch is how easy it is to get data in there and think everything is just fine... until it falls flat on its face the moment you hit some random use case that, according to Murphy's law, will also be a very important one.I wish Elasticsearch were maybe a little more opinionated in its defaults. In some ways Clickhouse feels like they filled the gap not having opinionated defaults created. My usage is from a few years back so maybe things have improved

评论 #26317607 未加载

dominotwabout 4 years ago

How does clickhouse compare to druid, pinot, rockset (commercial), memsql (commercial). I know clickhouse is easier to deploy.But from user's perspective is clickhouse superior to the others?

评论 #26318733 未加载

评论 #26322178 未加载

评论 #26318460 未加载

评论 #26318218 未加载

cduzzabout 4 years ago

Almost nobody wants to use elasticsearch.People want to use kibana and put up with elasticsearch.

评论 #26336944 未加载

评论 #26324495 未加载

评论 #26327865 未加载

jcimsabout 4 years ago

Does ClickHouse or anything else out there that even remotely compete with Splunk for adhoc troubleshooting/forensics/threat hunting type work?I started off with Splunk and every time I try Elasticsearch I feel like I'm stuck in a cage. Probably why they can charge so much for it.

评论 #26320477 未加载

moralestapiaabout 4 years ago

Sorry to hijack the thread but can anyone suggest alternatives to the 'search' side of Elasticsearch?I haven't been following the topic and there's probably new and interesting developments like ClickHouse is for logging.

评论 #26325652 未加载

评论 #26323607 未加载

评论 #26318268 未加载

评论 #26317806 未加载

评论 #26317725 未加载

评论 #26318058 未加载

评论 #26317811 未加载

评论 #26319299 未加载

wakatimeabout 4 years ago

A related database using ideas from Clickhouse:<a href="https://github.com/VictoriaMetrics/VictoriaMetrics" rel="nofollow">https://github.com/VictoriaMetrics/VictoriaMetrics</a>

评论 #26317796 未加载

e12eabout 4 years ago

I just there was a foss loki-like solution built on ch - that was stable and used in production.I know there's a few projects (see below) - but I'm not aware of anything mature..<a href="https://github.com/QXIP/cloki-go" rel="nofollow">https://github.com/QXIP/cloki-go</a><a href="https://github.com/lmangani/cloki" rel="nofollow">https://github.com/lmangani/cloki</a>

valiant-commaabout 4 years ago

Regarding #1 in the article, Elastic does have SQL query support[1]. I can’t speak to performance or other comparative metrics, but it’s worked well for my purposes.[1] <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/xpack-sql.html" rel="nofollow">https://www.elastic.co/guide/en/elasticsearch/reference/curr...</a>

harporoederabout 4 years ago

I don't have any production experience running Clickhouse, but I have used it on a side project for an OLAP workload. Compared to Postgres Clickhouse was a couple orders of magnitude faster (for the query pattern), and it was pretty easy to setup a single node configuration compared to lots of the "big data" stuff. Clickhouse is really a game changer.

评论 #26317393 未加载

pachicoabout 4 years ago

I've been using it successfully in production for year and a half. I can think of no other database that would give me real time aggregation over hundreds of millions of rows inserter every day for virtually zero cost. It's just a marvelous work.

eeZah7Uxabout 4 years ago

> ElasticSearch repo has jaw-dropping 1076 PRs merged for the same monthCode change frequency is not a measure of quality or development speed.One organization can encourage bigger PRs while another encourage tiny, frequent changes.One can care about quality and stability while another can care very little about bugs.

wiradikusumaabout 4 years ago

Anyone know more lightweight alternative to (ELK) Elastic Stack? I found <a href="https://vector.dev" rel="nofollow">https://vector.dev</a> but it seems to be only the "L" part.

评论 #26317271 未加载

评论 #26317207 未加载

评论 #26318236 未加载

评论 #26323081 未加载

评论 #26317181 未加载

crb002about 4 years ago

I wish they had a data store shoot-out like Techempower has for Web stacks.

proddataabout 4 years ago

If you are looking an OSS ES replacement, CrateDB might also be worth a look :)Basically a best of both worlds combination of ES and PostgreSQL, perfect for time-series and log analytics.

didipabout 4 years ago

Does ClickHouse have integration with Superset and Grafana?

评论 #26319496 未加载

wdbabout 4 years ago

I am curious how do they deal with GDPR or PPI when they do the logging? At first sight it looks like they are doing the logs themselves and not the API provider.

moralsupplyabout 4 years ago

I'm happy that more people are "discovering" ClickHouse.ClickHouse is an outstanding product, with great capabilities that serve a wide array of big data use cases.It's simple to deploy, simple to operate, simple to ingest large amounts of data, simple to scale, and simple to query.We've been using ClickHouse to handle 100's of TB of data for workloads that require ranking on multi-dimensional timeseries aggregations, and we can resolve most complex queries in less than 500ms under load.