ForkDatabricks (DBRX)Databricks (DBRX)published Mar 26, 2024seen 5d

databricks/rollout-operator

forked from grafana/rollout-operator

Open original ↗

Captured source

source ↗
published Mar 26, 2024seen 5dcaptured 14hhttp 200method plain

databricks/rollout-operator

Description: Kubernetes Rollout Operator

Language: Go

License: Apache-2.0

Stars: 1

Forks: 3

Open issues: 1

Created: 2024-03-26T17:29:26Z

Pushed: 2025-09-01T23:07:00Z

Default branch: db_main

Fork: yes

Parent repository: grafana/rollout-operator

Archived: no

README:

Kubernetes Rollout Operator

This operator coordinates the rollout of pods between different StatefulSets within a specific namespace and can be used to manage multi-AZ deployments where pods running in each AZ are managed by a dedicated StatefulSet.

How updates work

The operator coordinates the rollout of pods belonging to StatefulSets with the rollout-group label and updates strategy set to OnDelete. The label value should identify the group of StatefulSets to which the StatefulSet belongs to. Make sure the StatefulSet has a label name in its spec.template, as the operator uses it to find pods belonging to it.

For example, given the following StatefulSets in a namespace:

  • ingester-zone-a with rollout-group: ingester
  • ingester-zone-b with rollout-group: ingester
  • compactor-zone-a with rollout-group: compactor
  • compactor-zone-b with rollout-group: compactor

The operator independently coordinates the rollout of pods of each group:

  • Rollout group: ingester
  • ingester-zone-a
  • ingester-zone-b
  • Rollout group: compactor
  • compactor-zone-a
  • compactor-zone-b

For each rollout group, the operator guarantees: 1. Pods in 2 different StatefulSets are not rolled out at the same time 1. Pods in a StatefulSet are rolled out if and only if all pods in all other StatefulSets of the same group are Ready (otherwise it will start or continue the rollout once this check is satisfied) 1. Pods are rolled out if and only if all StatefulSets in the same group have OnDelete update strategy (otherwise the operator will skip the group and log an error) 1. The maximum number of not-Ready pods in a StatefulSet doesn't exceed the value configured in the rollout-max-unavailable annotation (if not set, it defaults to 1). Values:

  • 1: pods are rolled out in parallel (honoring the configured number of max unavailable pods)

How scaling up and down works

The operator can also optionally coordinate scaling up and down of StatefulSets that are part of the same rollout-group based on the grafana.com/rollout-downscale-leader annotation. When using this feature, the grafana.com/min-time-between-zones-downscale label must also be set on each StatefulSet.

This can be useful for automating the tedious scaling of stateful services like Mimir ingesters. Making use of this feature requires adding a few annotations and labels to configure how it works.

If the grafana.com/rollout-upscale-only-when-leader-ready annotation is set to true on a follower StatefulSet, the operator will only scale up the follower once all replicas in the leader StatefulSet are ready. This ensures that the follower zone does not scale up until the leader zone is completely stable.

Example usage for a multi-AZ ingester group:

  • For ingester-zone-a, add the following:
  • Labels:
  • grafana.com/min-time-between-zones-downscale=12h (change the value here to an appropriate duration)
  • grafana.com/prepare-downscale=true (to allow the service to be notified when it will be scaled down)
  • Annotations:
  • grafana.com/prepare-downscale-http-path=ingester/prepare-shutdown (to call a specific endpoint on the service)
  • grafana.com/prepare-downscale-http-port=80 (to call a specific endpoint on the service)
  • For ingester-zone-b, add the following:
  • Labels:
  • grafana.com/min-time-between-zones-downscale=12h (change the value here to an appropriate duration)
  • grafana.com/prepare-downscale=true (to allow the service to be notified when it will be scaled down)
  • Annotations:
  • grafana.com/rollout-downscale-leader=ingester-zone-a (zone b will follow zone a, after a delay)
  • grafana.com/rollout-upscale-only-when-leader-ready=true (zone b will only scale up once all replicas in zone a are ready)
  • grafana.com/prepare-downscale-http-path=ingester/prepare-shutdown (to call a specific endpoint on the service)
  • grafana.com/prepare-downscale-http-port=80 (to call a specific endpoint on the service)
  • For ingester-zone-c, add the following:
  • Labels:
  • grafana.com/min-time-between-zones-downscale=12h (change the value here to an appropriate duration)
  • grafana.com/prepare-downscale=true (to allow the service to be notified when it will be scaled down)
  • Annotations:
  • grafana.com/rollout-downscale-leader=ingester-zone-b (zone c will follow zone b, after a delay)
  • grafana.com/rollout-upscale-only-when-leader-ready=true (zone c will only scale up once all replicas in zone b are ready)
  • grafana.com/prepare-downscale-http-path=ingester/prepare-shutdown (to call a specific endpoint on the service)
  • grafana.com/prepare-downscale-http-port=80 (to call a specific endpoint on the service)

Scaling based on reference resource

Rollout-operator can use custom resource with scale and status subresources as a "source of truth" for number of replicas for target statefulset. "Source of truth" resource (or "reference resource") is configured using following annotations:

  • grafana.com/rollout-mirror-replicas-from-resource-name
  • grafana.com/rollout-mirror-replicas-from-resource-kind
  • grafana.com/rollout-mirror-replicas-from-resource-api-version
  • grafana.com/rollout-mirror-replicas-from-resource-write-back

These annotations must be set on StatefulSet that rollout-operator will scale (ie. target statefulset). Number of replicas in target statefulset will follow replicas in reference resource (from scale subresource). Reference resource's status subresource will be updated with current number of replicas in target statefulset, unless explicitly disabled by setting grafana.com/rollout-mirror-replicas-from-resource-write-back annotation to false.

This is similar to using grafana.com/rollout-downscale-leader, but reference resource can be any kind of resource, not just statefulset. Furthermore grafana.com/min-time-between-zones-downscale is not respected when using scaling based on reference resource.

This can be used in combination with HorizontalPodAutoscaler, when it is undesireable to set number of replicas directly on target statefulset, because we want to add custom logic to the scaledown (see next…

Excerpt shown — open the source for the full document.