WritingDatabricks (DBRX)Databricks (DBRX)published Jun 11, 2026seen 2h

Ingesting the Milky Way: Petabyte-Scale with Zerobus Ingest

Open original ↗

Captured source

source ↗

Ingesting the Milky Way: Petabyte-Scale with Zerobus Ingest | Databricks Blog Skip to main content

Summary

Databricks Zerobus Ingest is a serverless streaming API that enables teams to instantly deploy petabyte-scale data pipelines without manual infrastructure management.

Zerobus’ architecture relies on dynamic partitioning to automatically scale compute resources, efficiently handling unpredictable data volumes without complex tuning.

This zero-setup framework easily processes massive workloads, demonstrating the ability to sustain over 12 GB/s throughput to a single table during 24-hour benchmarks.

Telemetry data is everywhere. IoT sensors on factory floors. Satellite arrays scanning the atmosphere. Autonomous vehicles are logging thousands of events per second. Every one of these systems has the same underlying problem: a continuous, high-volume stream of time-series observations that needs to land somewhere queryable. It needs to be fast, reliable, and without an engineering team spending weeks tuning and maintaining infrastructure that is typical of Kafka based workloads. That's the problem Zerobus Ingest is built to solve. Zerobus is Databricks' fully managed, serverless streaming ingest service. It's a push-based API that accepts data from any producer and writes it directly into Delta tables, governed by Unity Catalog. No infrastructure to provision. No connector pipeline to maintain. No partitions or broker decision-making.

Instead, you create a table and push data. It lands in your lakehouse, ready to query in seconds. You no longer need to run Kafka as a pipe when your destination is the lakehouse. We used NASA’s NEOWISE dataset , representing 200 billion data points over 11 years, to benchmark Zerobus Ingest, ingesting 1 petabyte in under 24 hours, with zero pre-configuration and stable latency. By ingesting 1PB within 24 hours, we demonstrate Zerobus’s ability to maintain continuous throughput of 12 GB/s to a single table! 🚀 Now Delivering Petabyte scale: Streaming the Milky Way (12GB/sec/table)

Expand

The visualization above replays one year of data. The Milky Way's disk emerges in orange as those detections land in the table; the cyan crescent marks the sky region the telescope was pointing at at any given moment.

For more on how to run the benchmark yourself, read this companion blog on Databricks Community. This post walks through three of our design decisions that made this possible. Designing a system that autoscales via dynamic partitioning. Building our own zero-copy protobuf decoder. Implementing a latency-optimized write-ahead log before data is published to the lakehouse.

Our key design decisions Our aspiration was to build a streaming system that could support petabyte-scale and auto-scale to handle fluctuating ingestion patterns. Traditional streaming architectures require you to decide how many brokers and partitions a given workload needs. This requires knowledge of peak load and consumer ingestion constraints, as well as forecasting and an understanding of the end-to-end pipeline. By going back to first principles, we designed and built a system that scales to handle petabyte-sized workloads for data producers “magically.” Autoscaling achieved through dynamic partitioning The problem we were trying to solve was how to have efficient autoscaling to achieve elastic “limitness” scaling. Our thesis was that by moving away from static partitioning and toward the logical unit of a stream/connection, we could unlock true autoscaling and rebalancing while maintaining ordering guarantees, which are important for consumption workloads. The static partition problem In message bus architectures, partitions are the unit of both parallelism and ordering . This coupling creates a constraint that can be painful once you have consumers who depend on it. Ordering is typically a per-partition guarantee, not per-producer. The number of partitions and the distribution of data across them affect a consumer's ability to keep up with ingestion. This means: If your partition count changes, the routing function that maps a producer's messages to a partition may now send them to a different partition. The consumer now has to reconcile this. In practice, most teams treat partition topology as immutable. You provision for peak load and carry that infrastructure forever. You can add partitions but you typically cannot safely decrease them. The standard workaround is a partition routing key derived from a field in the message. This helps with ordering consistency but still doesn't solve the scale-down problem.

We moved the ordering guarantee to the stream connection In traditional systems, ordering is a partition-level guarantee . In Zerobus Ingest, ordering is a stream connection-level guarantee . When a producer opens a stream with Zerobus (a connection to our server), they're registering a logical identity with the service. For the lifetime of that connection, their data arrives in order, regardless of which “partition” pod processes it. "Your stream is ordered" , not "your partition is ordered." That's the contract. Hot routing and true autoscaling Internally, Zerobus Ingest distributes streams across a pool of pods. Routing is heuristic-based: if a pod is running hot, new incoming streams are routed to a different pod. The producer is unaware. Their ordering guarantee is unaffected. Ordering lives at the stream level, which means pods can be added when demand spikes and removed when demand drops . Existing streams then drain gracefully, and new streams stop routing there. The pool then shrinks, keeping compute utilization efficient. This is true autoscaling. The unit of granularity is the stream connection, not the partition assignment. Our dynamic partitioning design enables Zerobus to autoscale to over 12GB per second throughput for a table while remaining cost-efficient.

Zero-copy high-performance data handling Zerobus's main goal is to allow an efficient, row-by-row transfer of data streams of any volume. To achieve this, we needed to completely avoid any needless copying and memory allocations - from the input formats that clients send to Zerobus, to the internal formats that guarantee durability and open Delta formats. Zerobus currently supports the following message formats. Zerobus Format When to use

protobuf Generic, fast record-by-record ingestion.

arrow Fast batch ingestion.

json Batch or...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Substantial technical blog post on data ingestion tool.