RepoDatabricks (DBRX)Databricks (DBRX)published Nov 11, 2025seen 5d

databricks/dicer

Scala

Open original ↗

Captured source

source ↗
published Nov 11, 2025seen 5dcaptured 15hhttp 200method plain

databricks/dicer

Description: Dicer auto-sharder: Infrastructure for building sharded services

Language: Scala

License: Apache-2.0

Stars: 268

Forks: 25

Open issues: 1

Created: 2025-11-11T00:25:21Z

Pushed: 2026-06-10T23:46:03Z

Default branch: master

Fork: no

Archived: no

README:

Dicer: Databricks’ Auto-Sharder

Dicer is a foundational infrastructure system for building sharded services. By colocating in-memory state with the computation that operates on it, Dicer enables applications to achieve low latency, high availability, and cost efficiency at scale. It is widely used across Databricks, where it has driven substantial reliability and performance improvements in production systems.

To motivate Dicer, we contrast with a typical stateless model, where the application does not retain in-memory state across requests. This is inherently expensive as every request incurs a database hit (and possibly additional RPCs to other services), driving up both operational costs and latency. Introducing a remote cache, while helpful, still fails to solve several fundamental inefficiencies:

  • Network Latency: Every request still pays the "tax" of network hops to the caching layer.
  • CPU Overhead: Significant cycles are wasted on (de)serialization as data moves between the cache and the application.
  • The Overread Problem: Stateless services often fetch entire objects or large blobs from the cache only to use a small fraction of the data. These overreads waste bandwidth and memory, as the application discards the majority of the data it just spent time fetching.

However, sharding can introduce more challenges if not done correctly. We built Dicer to change this. Services can integrate with a small Dicer library, and Dicer runs as an intelligent control plane that continuously and asynchronously updates the service's shard assignments. It reacts to a wide range of signals, including application health, load, termination notices, and other environmental inputs. As a result, Dicer keeps services highly available and well balanced even during rolling restarts, crashes, autoscaling events, and periods of severe load skew.

Figure 1. Dicer system overview. See the section below for more details.

Dicer powers critical production workloads at Databricks, and has significantly improved reliability and performance. For example:

  • Unity Catalog: Migrated to being a sharded service with an in-memory cache using Dicer, reducing database load by more than 10x.
  • SQL Query Orchestration Engine: Improved availability by 2 9s, by replacing static sharding with Dicer’s dynamic sharding. Achieved zero-down-time during rolling updates and auto-scaling, as well as improved load balancing.
  • Softstore Distributed Remote Cache: Our internal distributed remote caching service utilizes Dicer’s state transfer feature (to be available in a future release) to seamlessly transfer values between pods during planned restarts such as rolling updates, resulting in negligible impact to cache hit rates during rolling restarts.

In general there are many use cases for Dicer such as:

1. Caching and serving database state from local memory with low latency 2. Implementing purely in-memory systems such as remote caches or quota services 3. Rendezvous between publishers and subscribers 4. Wherever batching can help (e.g. for high throughput data ingestion) 5. Streaming data aggregation and summarization 6. Distributing regular background work among a set of workers 7. Implementing a highly available controller service using soft leader selection 8. Sharding user sessions (e.g. for KV cache reuse with LLMs)

See the blog post for more information on background and motivation for Dicer. In this README, we discuss Dicer, its model and features, the structure of this repository, and how to get started. See the [docs](docs/) directory for further documentation.

1. Overview

1.1. Application model

Dicer models an application as serving requests or otherwise performing some work associated with a logical key. For example, a service that serves user profiles might use user IDs as its keys. Dicer shards the application by continuously generating an assignment of keys to pods to keep the service highly available and load balanced.

1.2. Basic concepts

We now describe the basic concepts of Dicer. Figure 2 shows an example Dicer Assignment capturing these concepts.

Figure 2. Dicer assigns "slices" of the SliceKey key space to resources (pods).

Target: A string identifier of the application that Dicer will auto-shard (e.g. “myservice”), used in configs, metrics, and so on.

Resource: An entity to which Dicer will assign slices (today, this can only be a Kubernetes pod).

SliceKey: The representation of an application key to Dicer. Applications map each key in their application key space to a SliceKey, and Dicer assigns ranges of the SliceKey space to resources. SliceKeys should be a hash of the application key to evenly distribute keys across the SliceKey space (e.g. using a suitable hash function, like FarmHash).

Slice: To scale to applications with millions or billions of keys, Dicer operates on key ranges rather than individual keys. It partitions the SliceKey space into contiguous ranges, called slices, and assigns these slices to resources. Slices are automatically split, merged, replicated, or dereplicated as needed to maintain balanced load. Hot keys can also be isolated into their own slice and individually replicated (the red slice in Figure 1) when needed for load balancing.

Assignment: A complete set of non-overlapping slices which cover the full key space from “” to inf, where each slice is assigned to one or more resources.

1.3. Key system components

Dicer is composed of an Assigner service and two client libraries, shown in use by an application above in Figure 1.

Assigner: Name of the Dicer service which gathers target signals, generates assignments, and distributes those assignments to clients. The service is multi-tenant (designed to serve all Targets in a region), and supports HA.

Slicelet: Dicer library integrated into the servers of the Target service. Watches the Assigner service for assignment updates and caches assignments locally for fast lookup on the…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New repo from Databricks with solid traction (268 stars)