We built a new powerful JSON data type for ClickHouse

230 points · markhneedham · 9 hours ago

42 comments

fuziontech · 6 hours ago

Using ClickHouse is one of the best decisions we've made here at PostHog. It has allowed us to scale performance all while allowing us to build more products on the same set of data.

Since we've been using ClickHouse long before this JSON functionality was available (or even before the earlier version of this called `Object('json')` was avaiable) we ended up setting up a job that would materialize json fields out of a json blob and into materialized columns based on query patterns against the keys in the JSON blob. Then, once those materialized columns were created we would just route the queries to those columns at runtime if they were available. This saved us a _ton_ on CPU and IO utilization. Even though ClickHouse uses some really fast SIMD JSON functions, the best way to make a computer go faster is to make the computer do less and this new JSON type does exactly that and it's so turn key!

https://posthog.com/handbook/engineering/databases/materiali...

The team over at ClickHouse Inc. as well as the community behind it moves surprisingly fast. I can't recommend it enough and excited for everything else that is on the roadmap here. I'm really excited for what is on the horizon with Parquet and Iceberg support.

breadwinner · 5 hours ago

If you're evaluating ClickHouse take a look at Apache Pinot as well. ClickHouse was designed for single-machine installations, although it has been enhanced to support clusters. But this support is lacking, for example if you add additional nodes it is not easy to redistribute data. Pinot is much easier to scale horizontally. Also take a look at star-tree indexes of Pinot [1]. If you're doing multi-dimensional analysis (Pivot table etc.) there is a huge difference in performance if you take advantage of star-tree.

[1] https://docs.pinot.apache.org/basics/indexing/star-tree-inde...

Show replies

ramraj07 · 7 hours ago

Great to see it in ClickHouse.

Snowflake released a white paper before its IPO days and mentioned this same feature (secretly exploding JSON into columns). Explains how snowflake feels faster than it should, they’ve secretly done a lot of amazing things and just offered it as a polished product like Apple.

everfrustrated · 5 hours ago

>Dynamically changing data: allow values with different data types (possibly incompatible and not known beforehand) for the same JSON paths without unification into a least common type, preserving the integrity of mixed-type data.

I'm so excited for this! One of my major bug-bears with storing logs in Elasticsearch is the set-type-on-first-seen-occurrence headache.

Hope to see this leave experimental support soon!

Show replies

abe94 · 9 hours ago

We've been waiting for more JSON support for Clickhouse - the new type looks promising - and the dynamic column, and no need to specifcy subtypes is particularly helpful for us.