What does this fork signal mean?

Together AI forked togethercomputer/slurm-operator (forked from SlinkyProject/slurm-operator). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo togethercomputer/slurm-operator · parent SlinkyProject/slurm-operator · Routine fork, no community traction.. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

Together AI Fork: togethercomputer/slurm-operator

Captured source

source ↗

GitHub/github.com/togethercomputer/slurm-operator

togethercomputer/slurm-operator repository metadata

Source ↗

published Mar 29, 2025seen 5dcaptured 16hhttp 200method plain

togethercomputer/slurm-operator

Description: This project provides a framework that runs Slurm in Kubernetes.

Language: Go

Stars: 0

Forks: 0

Open issues: 3

Created: 2025-03-29T01:30:43Z

Pushed: 2026-05-27T17:30:16Z

Default branch: main

Fork: yes

Parent repository: SlinkyProject/slurm-operator

Archived: no

README:

Kubernetes Operator for Slurm Clusters

Run [Slurm] on [Kubernetes], by [SchedMD]. A [Slinky] project.

[Kubernetes Operator for Slurm Clusters](#kubernetes-operator-for-slurm-clusters)
[Table of Contents](#table-of-contents)
[Overview](#overview)
[Slurm Cluster](#slurm-cluster)
[Features](#features)
[Controller](#controller)
[NodeSets](#nodesets)
[LoginSets](#loginsets)
[Hybrid Support](#hybrid-support)
[Slurm](#slurm)
[Compatibility](#compatibility)
[Quick Start](#quick-start)
[Upgrades](#upgrades)
[1.Y Releases](#1y-releases)
[0.Y Releases](#0y-releases)
[Documentation](#documentation)
[Support and Development](#support-and-development)
[License](#license)

Overview

[Slurm] and [Kubernetes] are workload managers originally designed for different kinds of workloads. In broad strokes: Kubernetes excels at scheduling workloads that typically run for an indefinite amount of time, with potentially vague resource requirements, on a single node, with loose policy, but can scale its resource pool infinitely to meet demand; Slurm excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, but its resource pool is known.

This project enables the best of both workload managers, unified on Kubernetes. It contains a [Kubernetes] operator to deploy and manage certain components of [Slurm] clusters. This repository implements [custom-controllers] and [custom resource definitions (CRDs)][crds] designed for the lifecycle (creation, upgrade, graceful shutdown) of Slurm clusters.

!["Slurm Operator Architecture"](./docs/_static/images/architecture-operator.svg)

For additional architectural notes, see the [architecture] docs.

Slurm Cluster

Slurm clusters are very flexible and can be configured in various ways. Our Slurm helm chart provides a reference implementation that is highly customizable and tries to expose everything Slurm has to offer.

!["Slurm Architecture"](./docs/_static/images/architecture-slurm.svg)

For additional information about Slurm, see the [slurm][slurm-docs] docs.

Features

Controller

The Slurm control-plane is responsible for scheduling Slurm workload onto its worker nodes and managing their states.

Changes to the Slurm configuration files are automatically detected and the Slurm cluster is reconfigured seamlessly with zero downtime of the Slurm control-plane.

> [!NOTE] > The kubelet's configMapAndSecretChangeDetectionStrategy and syncFrequency > settings directly affect when pods have their mounted ConfigMaps and Secrets > updated. By default, the kubelet is in Watch mode with a polling frequency > of 60 seconds.

NodeSets

A set of homogeneous Slurm workers (compute nodes), which are delegated to execute the Slurm workload.

The operator will take into consideration the running workload among Slurm nodes as it needs to scale-in, upgrade, or otherwise handle node failures. Slurm nodes will be marked as [drain][slurm-drain] before their eventual termination pending scale-in or upgrade.

Slurm node states (e.g. Idle, Allocated, Mixed, Down, Drain, Not Responding, etc...) are applied to each NodeSet pod via their pod conditions; each NodeSet pod contains a pod status that reflects their own Slurm node state.

The operator supports NodeSet scale to zero, scaling the resource down to zero replicas. Hence, any Horizontal Pod Autoscaler (HPA) that also support scale to zero can be best paired with NodeSets.

NodeSets can be resolved by hostname. This enables hostname-based resolution between login pods and worker pods, enabling direct pod-to-pod communication using predictable hostnames (e.g., cpu-1-0, gpu-2-1).

LoginSets

A set of homogeneous login nodes (submit node, jump host) for Slurm, which manage user identity via SSSD.

The operator supports LoginSet scale to zero, scaling the resource down to zero replicas. Hence, any Horizontal Pod Autoscaler (HPA) that also support scale to zero can be best paired with LoginSets.

Hybrid Support

Sometimes a Slurm cluster has some, but not all, of its components in Kubernetes. The operator and its CRDs are designed support these use cases.

Slurm

Slurm is a full featured HPC workload manager. To highlight a few features:

[Accounting][slurm-accounting]: collect accounting information for every

job and job step executed.

[Partitions][slurm-arch]: job queues with sets of resources and

constraints (e.g. job size limit, job time limit, users permitted).

[Reservations][slurm-reservations]: reserve resources for jobs being

executed by select users and/or select accounts.

[Job Dependencies][slurm-dependency]: defer the start of jobs until the

specified dependencies have been satisfied.

[Job Containers][slurm-containers]: jobs which run an unprivileged OCI

container bundle.

[MPI][slurm-mpi]: launch parallel MPI jobs, supports various MPI

implementations.

[Priority][slurm-priority]: assigns priorities to jobs upon submission and

on an ongoing basis (e.g. as they age).

[Preemption][slurm-preempt]: stop one or more low-priority jobs to let a

high-priority job run.

[QoS][slurm-qos]: sets of policies affecting scheduling priority,

preemption, and resource limits.

[Fairshare][slurm-fairshare]: distribute resources equitably among users

and accounts based on historical usage.

[Node Health Check][slurm-healthcheck]: periodically check node health via

script.

Compatibility

| Software | Minimum Version | | :--------- | :----------------------------------------------------------------------: | | Kubernetes | v1.29 | | Slurm | 25.11 | | Cgroup | v2 |

Quick Start

Install the [cert-manager] with its CRDs:

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--set 'crds.enabled=true'…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork, no community traction.