togethercomputer/slurm-operator
forked from SlinkyProject/slurm-operator
Captured source
source ↗togethercomputer/slurm-operator
Description: This project provides a framework that runs Slurm in Kubernetes.
Language: Go
Stars: 0
Forks: 0
Open issues: 3
Created: 2025-03-29T01:30:43Z
Pushed: 2026-05-27T17:30:16Z
Default branch: main
Fork: yes
Parent repository: SlinkyProject/slurm-operator
Archived: no
README:
Kubernetes Operator for Slurm Clusters
Run [Slurm] on [Kubernetes], by [SchedMD]. A [Slinky] project.
Table of Contents
- [Kubernetes Operator for Slurm Clusters](#kubernetes-operator-for-slurm-clusters)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
- [Slurm Cluster](#slurm-cluster)
- [Features](#features)
- [Controller](#controller)
- [NodeSets](#nodesets)
- [LoginSets](#loginsets)
- [Hybrid Support](#hybrid-support)
- [Slurm](#slurm)
- [Compatibility](#compatibility)
- [Quick Start](#quick-start)
- [Upgrades](#upgrades)
- [1.Y Releases](#1y-releases)
- [0.Y Releases](#0y-releases)
- [Documentation](#documentation)
- [Support and Development](#support-and-development)
- [License](#license)
Overview
[Slurm] and [Kubernetes] are workload managers originally designed for different kinds of workloads. In broad strokes: Kubernetes excels at scheduling workloads that typically run for an indefinite amount of time, with potentially vague resource requirements, on a single node, with loose policy, but can scale its resource pool infinitely to meet demand; Slurm excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, but its resource pool is known.
This project enables the best of both workload managers, unified on Kubernetes. It contains a [Kubernetes] operator to deploy and manage certain components of [Slurm] clusters. This repository implements [custom-controllers] and [custom resource definitions (CRDs)][crds] designed for the lifecycle (creation, upgrade, graceful shutdown) of Slurm clusters.

For additional architectural notes, see the [architecture] docs.
Slurm Cluster
Slurm clusters are very flexible and can be configured in various ways. Our Slurm helm chart provides a reference implementation that is highly customizable and tries to expose everything Slurm has to offer.

For additional information about Slurm, see the [slurm][slurm-docs] docs.
Features
Controller
The Slurm control-plane is responsible for scheduling Slurm workload onto its worker nodes and managing their states.
Changes to the Slurm configuration files are automatically detected and the Slurm cluster is reconfigured seamlessly with zero downtime of the Slurm control-plane.
> [!NOTE] > The kubelet's configMapAndSecretChangeDetectionStrategy and syncFrequency > settings directly affect when pods have their mounted ConfigMaps and Secrets > updated. By default, the kubelet is in Watch mode with a polling frequency > of 60 seconds.
NodeSets
A set of homogeneous Slurm workers (compute nodes), which are delegated to execute the Slurm workload.
The operator will take into consideration the running workload among Slurm nodes as it needs to scale-in, upgrade, or otherwise handle node failures. Slurm nodes will be marked as [drain][slurm-drain] before their eventual termination pending scale-in or upgrade.
Slurm node states (e.g. Idle, Allocated, Mixed, Down, Drain, Not Responding, etc...) are applied to each NodeSet pod via their pod conditions; each NodeSet pod contains a pod status that reflects their own Slurm node state.
The operator supports NodeSet scale to zero, scaling the resource down to zero replicas. Hence, any Horizontal Pod Autoscaler (HPA) that also support scale to zero can be best paired with NodeSets.
NodeSets can be resolved by hostname. This enables hostname-based resolution between login pods and worker pods, enabling direct pod-to-pod communication using predictable hostnames (e.g., cpu-1-0, gpu-2-1).
LoginSets
A set of homogeneous login nodes (submit node, jump host) for Slurm, which manage user identity via SSSD.
The operator supports LoginSet scale to zero, scaling the resource down to zero replicas. Hence, any Horizontal Pod Autoscaler (HPA) that also support scale to zero can be best paired with LoginSets.
Hybrid Support
Sometimes a Slurm cluster has some, but not all, of its components in Kubernetes. The operator and its CRDs are designed support these use cases.
Slurm
Slurm is a full featured HPC workload manager. To highlight a few features:
- [Accounting][slurm-accounting]: collect accounting information for every
job and job step executed.
- [Partitions][slurm-arch]: job queues with sets of resources and
constraints (e.g. job size limit, job time limit, users permitted).
- [Reservations][slurm-reservations]: reserve resources for jobs being
executed by select users and/or select accounts.
- [Job Dependencies][slurm-dependency]: defer the start of jobs until the
specified dependencies have been satisfied.
- [Job Containers][slurm-containers]: jobs which run an unprivileged OCI
container bundle.
- [MPI][slurm-mpi]: launch parallel MPI jobs, supports various MPI
implementations.
- [Priority][slurm-priority]: assigns priorities to jobs upon submission and
on an ongoing basis (e.g. as they age).
- [Preemption][slurm-preempt]: stop one or more low-priority jobs to let a
high-priority job run.
- [QoS][slurm-qos]: sets of policies affecting scheduling priority,
preemption, and resource limits.
- [Fairshare][slurm-fairshare]: distribute resources equitably among users
and accounts based on historical usage.
- [Node Health Check][slurm-healthcheck]: periodically check node health via
script.
Compatibility
| Software | Minimum Version | | :--------- | :----------------------------------------------------------------------: | | Kubernetes | v1.29 | | Slurm | 25.11 | | Cgroup | v2 |
Quick Start
Install the [cert-manager] with its CRDs:
helm repo add jetstack https://charts.jetstack.io helm repo update helm install cert-manager jetstack/cert-manager \ --set 'crds.enabled=true'…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Routine fork, no community traction.