Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse
Captured source
source ↗Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse 2026-05-14 James Morrison
Christian Endres
9 min read This post is also available in 简体中文 , 日本語 , 한국어 and 繁體中文 . At Cloudflare, we are heavy users of ClickHouse, an open source online analytical processing (OLAP) database. Every day, we make millions of calls to ClickHouse to determine how much users should be billed for their usage of Cloudflare products. If we don't finish those jobs in a timely fashion, the invoices become very difficult to reconcile. This pipeline powers hundreds of millions of dollars in usage revenue, fraud systems, and more, so being delayed has major downstream implications. Which is why it was a big problem when the daily aggregation jobs in ClickHouse – responsible for ensuring Cloudflare’s bills go out – had slowed way down, following a migration. All the usual suspects looked clean: I/O, memory, rows scanned, parts read. Everything we would normally check when a ClickHouse query is slow appeared to be normal. This is the story of how we discovered a hidden bottleneck buried deep within ClickHouse’s internals, and the three patches we wrote to fix it.
The setup: a petabyte-scale analytics platform
We use ClickHouse to store over a hundred petabytes of data across a few dozen clusters. To simplify onboarding for our many internal teams, we built a system called "Ready-Analytics" in early 2022. The premise is simple: instead of designing new tables, teams can stream data into a single, massive table. Datasets are disambiguated by a namespace , and each record uses a standard schema (e.g., 20 float fields, 20 string fields, a timestamp, and an indexID ). In ClickHouse, the way data is sorted is crucial to query performance. This is where the indexID comes into play. It’s a string field, which forms part of the primary key, meaning that every individual namespace can have its data sorted in a way that is optimal for the queries the owners of that namespace expect to be running. Altogether, we end up with a primary key that looks like this: ( namespace , indexID , timestamp ). This system is popular, with hundreds of applications using it. It had already grown to more than 2PiB of data by December 2024, and an ingestion rate of millions of rows per second. But it had one critical flaw: its retention policy.
The problem: one retention policy to rule them all
Cloudflare has been using ClickHouse for many years, since before it had native Time-to-Live (TTL) features. Consequently, we built our own retention system based on partitioning. The Ready-Analytics table was partitioned by day , and our retention job simply dropped partitions older than 31 days. This "one-size-fits-all" 31-day retention was a major limitation. Some teams needed to store data for years due to legal or contractual obligations, while others needed only a few days. This restriction meant these use cases couldn't use Ready-Analytics and had to opt for a conventional setup, which has a far more complex onboarding process. We needed a new system that allowed per-namespace retention .
The solution: a new partitioning scheme
We considered two main approaches: A Table-per-Namespace: This would naturally solve the retention problem but would require significant new automation to manage thousands of tables on demand.
A New Partitioning Key: We could change the partitioning key from just (day) to (namespace, day) .
We chose the second option. This would allow our existing retention system to continue managing partitions, but now with per-namespace granularity. We knew this would increase the total number of data parts in the table, but we made a key assumption: since every query is filtered by a specific namespace, the number of parts read by any single query shouldn't change. We believed this meant performance would be unaffected.
This shows how we changed the partitioning, allowing us to cheaply drop data for a single namespace This new system also allowed us to build a sophisticated storage management layer. Using the max-min fairness algorithm , we could set a target disk utilization (e.g., 90%) and automatically "share" available space. Namespaces using less than their fair share would cede their unused capacity to those that needed more. This allowed us to confidently run our clusters at 90% utilization. We began the migration in January 2025. Using ClickHouse's Merge table feature, we combined the old and new tables, writing all new data to the new partitioned table while the old data aged out.
The mystery: when billing starts to break
Two months later, in late March 2025, our billing team reported that their daily aggregation jobs were slowing down. These jobs are time-critical; if they don't finish, bills don't go out. The jobs were getting progressively slower, and we were approaching a deadline. We investigated, but none of the usual suspects were to blame. I/O was fine. Memory was fine. The metrics for individual queries showed they were not reading more data or more parts than before. Our initial assumption seemed correct, yet the system was grinding to a halt. It took several days before we even had a theory. Finally, we made a plot of query duration against the total part count in the cluster. The correlation was undeniable.
Average SELECT Query Durations on the Ready Analytics ClickHouse Cluster, showing progressive performance degradation.
Linear Growth in Total Data Part Count per Table Replica, following the new (namespace, day) partitioning scheme. But why ? If we weren't reading the extra parts, why did their mere existence slow us down?
The investigation: hunting bottlenecks with flame graphs
We turned to ClickHouse's built-in trace_log to generate flame graphs. This is a built-in table that records traces from the running ClickHouse server. It not only includes traces of what code is being executed, but it associates these with specific users, query IDs and other metadata, meaning you can filter down to quite precise sets of events if necessary. In our case, we wanted to look specifically at leaf SELECT queries . This was easy thanks to the available metadata in this table. The first CPU-based flame graph quickly confirmed our suspicion: a huge amount of time was being spent in query planning . This is the phase before execution when ClickHouse decides which parts to read.
Flame...
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Low HN traction, routine technical post