Great to see it in ClickHouse.<p>Snowflake released a white paper before its IPO days and mentioned this same feature (secretly exploding JSON into columns). Explains how snowflake feels faster than it should, they’ve secretly done a lot of amazing things and just offered it as a polished product like Apple.
I've heard wonderful things about ClickHouse, but every time I try to use it, I get stuck on "how do I get data into it reliably". I search around, and inevitably end up with "by combining clickhouse and Kafka", at which point my desire to keep going drops to zero.<p>Are there any setups for reliable data ingestion into Clickhouse that <i>don't</i> involve spinning up Kafka & Zookeeper?
>Dynamically changing data: allow values with different data types (possibly incompatible and not known beforehand) for the same JSON paths without unification into a least common type, preserving the integrity of mixed-type data.<p>I'm so excited for this! One of my major bug-bears with storing logs in Elasticsearch is the set-type-on-first-seen-occurrence headache.<p>Hope to see this leave experimental support soon!
We've been waiting for more JSON support for Clickhouse - the new type looks promising - and the dynamic column, and no need to specifcy subtypes is particularly helpful for us.
If you're evaluating ClickHouse take a look at Apache Pinot as well. ClickHouse was designed for single-machine installations, although it has been enhanced to support clusters. But this support is lacking, for example if you add additional nodes it is not easy to redistribute data. Pinot is much easier to scale horizontally. Also take a look at star-tree indexes of Pinot [1]. If you're doing multi-dimensional analysis (Pivot table etc.) there is a huge difference in performance if you take advantage of star-tree.<p>[1] <a href="https://docs.pinot.apache.org/basics/indexing/star-tree-index" rel="nofollow">https://docs.pinot.apache.org/basics/indexing/star-tree-inde...</a>
Clickhouse is great stuff. I use it for OLAP with a modest database (~600mil rows, ~300GB before compression) and it handles everything I throw at it without issues. I'm hopeful this new JSON data type will be better at a use-case that I currently solve with nested tuples.
When I tried it a few weeks ago, because ClickHouse names the files based on column names, weird JSON keys resulted in very long filenames and slashes and it did not play well with it the file system and gave errors, I wonder that is fixed?
>For example, if we have two integers and a float as values for the same JSON path a, we don’t want to store all three as float values on disk<p>Well, if you want to do things exactly how JS does it, then storing them all as float is correct. However, The JSON standard doesn't say it needs to be done the same way as JS.
This seems similar to instead of storing any specific part (int, string, array) of JSON, just store any JSON type in the column, much like "enum with fields" in Swift, Kotlin or Rust, or algebraic data types in Haskell - a feature not present in many other languages.
Looks like Snowflake was the first popular warehouse to have variant type which could put JSON values into separate columns.<p>It turned out great idea which inspired other databases.
Oracle 23ai also has a similar feature that "explodes" JSON into relational tables/columns for storage while still providing JSON based access API's : <a href="https://www.oracle.com/database/json-relational-duality/" rel="nofollow">https://www.oracle.com/database/json-relational-duality/</a>
Using ClickHouse is one of the best decisions we've made here at PostHog. It has allowed us to scale performance all while allowing us to build more products on the same set of data.<p>Since we've been using ClickHouse long before this JSON functionality was available (or even before the earlier version of this called `Object('json')` was avaiable) we ended up setting up a job that would materialize json fields out of a json blob and into materialized columns based on query patterns against the keys in the JSON blob. Then, once those materialized columns were created we would just route the queries to those columns at runtime if they were available. This saved us a _ton_ on CPU and IO utilization. Even though ClickHouse uses some really fast SIMD JSON functions, the best way to make a computer go faster is to make the computer do less and this new JSON type does exactly that and it's so turn key!<p><a href="https://posthog.com/handbook/engineering/databases/materialized-columns">https://posthog.com/handbook/engineering/databases/materiali...</a><p>The team over at ClickHouse Inc. as well as the community behind it moves surprisingly fast. I can't recommend it enough and excited for everything else that is on the roadmap here. I'm really excited for what is on the horizon with Parquet and Iceberg support.
Clickhouse is criminally underused.<p>It's common knowledge that 'postgres is all you need' - but if you somehow reach the stage of 'postgres isn't all I need and I have hard proof' this should be the next tech you look at.<p>Also, clickhouse-local is rather amazing at csv processing using sql. Highly recommended for when you are fed up with google sheets or even excel.
I admit that I didn't read the entire article in depth, but I did my best to meaningfully skim-parse it.<p>Can someone briefly explain how or if adding data types to JSON - a standardized grammar - leaves something that still qualifies as JSON?<p>I have no problem with people creating supersets of JSON, but if my standard lib JSON parser can't read your "JSON" then wouldn't it be better to call it something like "CH-JSON"?<p>If I am wildly missing something, I'm happy to be schooled. The end result certainly sounds cool, even though I haven't needed ClickHouse yet.