科技回声

7 条评论

ozgune大约 7 年前

(Ozgun from Citus Data)What really excites me about this blog post is how PostgreSQL is becoming central across diverse workloads - including real-time analytics.A few Postgres resources that relate to this blog post are the following.1. TopN: Several Citus customers were already using the TopN extension. Algolia contributed to revising the public APIs in this extension. With these revised APIs, we felt pretty comfortable in open sourcing the extension for the Postgres community to use: <a href="https://github.com/citusdata/postgresql-topn" rel="nofollow">https://github.com/citusdata/postgresql-topn</a>2. Postgres JIT improvements: Postgres 11 is coming with LLVM JIT improvements. For analytical queries that run in-memory, these changes will improve query performance by up to 3x. This will significantly speed up roll-up performance mentioned in this blog post: <a href="https://news.ycombinator.com/item?id=16782052" rel="nofollow">https://news.ycombinator.com/item?id=16782052</a>3. For those interested, this tutorial talks about how to build real-time analytics ingest pipelines with Postgres: <a href="https://www.youtube.com/watch?v=daeUsVox8hs" rel="nofollow">https://www.youtube.com/watch?v=daeUsVox8hs</a>

评论 #16808589 未加载

ryanworl大约 7 年前

I think the choice to not go with Clickhouse deserves a bit more explanation than what was given in the article.Instead of writing all this code to do roll ups they could’ve used an AggregatingMergeTree table over their raw events table and... gotten back to work.Cloudflare is using Clickhouse for their DNS analytics and (maybe even by now) soon their HTTP analytics. And the system they migrated off of looked a heck of a lot like this one in the article.Edit: I should add that I am not saying their decision was wrong. I just think the sentence that was given in the article does not justify the decision by itself on an engineering level.The data load process of Clickhouse and Citus (in this configuration) are nearly identical. Clickhouse takes CSV files just fine like Citus. The default settings are fine for the volume mentioned in the article of single digit billions of records per day. This would probably fit on a single server if you age out the raw logs after your coarsest aggregate is created. Queries over the AggregatingMergeTree table at five minute resolution will finish in high double digit to low triple digit milliseconds if the server is not being hammered with queries and the time range is days to weeks.

评论 #16803606 未加载

评论 #16803107 未加载

al_james大约 7 年前

A great article, and I am a big fan of algolia, Citus and Redshift. However this article ends up making an odd apples to oranges comparison.They state that "However, achieving sub-second aggregation performances on very large datasets is prohibitively expensive with RedShift", this suggests that they want to do sub-second aggregations across raw event data. However, later in the article, the solution they build is to use rollup tables for sub-second responses.You can also do rollup tables in Redshift, and I can assure you (if you enable the fast query acceleration option) you can get sub-second queries from the rolled up lower-cardinality tables. If you want even better response times, you can store the rollups in plain old Postgres and use something like dblink or postgres_fdw to perform the periodic aggregations on Redshift and insert into the local rollup tables (see [1]). In this model the solution ends up being very similar to their solution with Citus.... and I would predict that this is cheaper than Citus Cloud as Redshift really is a great price point for a hosted system.So the question of performing sub-second aggregations across the raw data remains unanswered... however that really is the ideal end game as you can then offer way more flexibility in terms of filtering than any rollup based solution.Right now, research suggests Clickhouse, Redshift or BigQuery are probably the fastest solutions for that. Not sure about Druid, I dont know it. GPU databasees appear to the be the future of this. I would be interested to see benchmarks of Citus under this use case. I should imagine that Citus is also way better if you have something like a mixed OLAP and OLTP workload (e.g. you need the analytics and the row data to match exactly at all times).Aside: It would be great to see Citus benchmarked against the 1.1 billion taxi rides benchmark by Mark Litwintschik. Any chance of that?[1] <a href="https://aws.amazon.com/blogs/big-data/join-amazon-redshift-and-amazon-rds-postgresql-with-dblink/" rel="nofollow">https://aws.amazon.com/blogs/big-data/join-amazon-redshift-a...</a> [2] <a href="http://tech.marksblogg.com/benchmarks.html" rel="nofollow">http://tech.marksblogg.com/benchmarks.html</a>

评论 #16806631 未加载

评论 #16806452 未加载

shrumm大约 7 年前

Most of the discussion (rightly so) focused on DB optimization. The decision to build the API in Go was barely mentioned. I’m curious if you evaluated any other frameworks / languages or was Go just an automatic choice?

评论 #16806126 未加载

wjossey大约 7 年前

“A request targeting a single customer app will only ever need to target a single Postgres instance.”This seems remarkably dangerous to me. Isn’t hotspotting a big concern? I suppose they are large enough at this point to know what a “large” customer app looks like, but anytime I see sharding done in this manner alarm bells go off.Happy to see another positive citus case. I was skeptical a year ago but they’re building up great success stories. We need great options like Citus!Also, a happy algolia customer. If you’re not using them yet, give it a try!

bigger_cheese大约 7 年前

Seems similar to the approach used by Process Historians in industrial control world i.e store at native frequency out of the PLC then periodically aggregate.

napoleond大约 7 年前

Just use Keen.io and be done with it :)

评论 #16803079 未加载

评论 #16803628 未加载

7 条评论

ozgune大约 7 年前

评论 #16808589 未加载

ryanworl大约 7 年前

评论 #16803606 未加载

评论 #16803107 未加载

al_james大约 7 年前

评论 #16806631 未加载

评论 #16806452 未加载

shrumm大约 7 年前

评论 #16806126 未加载

wjossey大约 7 年前

bigger_cheese大约 7 年前

Seems similar to the approach used by Process Historians in industrial control world i.e store at native frequency out of the PLC then periodically aggregate.

napoleond大约 7 年前

Just use Keen.io and be done with it :)

评论 #16803079 未加载

评论 #16803628 未加载

Building Real Time Analytics APIs at Scale

7 条评论

Building Real Time Analytics APIs at Scale

7 条评论