ReleaseNebiusNebiuspublished May 26, 2026seen 5d

nebius/soperator 4.0.0

nebius/soperator

Open original ↗

Captured source

source ↗
published May 26, 2026seen 5dcaptured 14hhttp 200method plain

4.0.0

Repository: nebius/soperator

Tag: 4.0.0

Published: 2026-05-26T14:09:47Z

Prerelease: no

Release notes: Changes made since version 3.0.4 prior to version 4.0.0:

🚀 Features

  • SCHED-795, SCHED-797: TaskProlog should check for recursive CPU bindings
  • PR: #2316
  • SCHED-1024 SSSD for LDAP integration
  • PR: #2295
  • SCHED-1210 use cuda_force_upgrade for upgrading CUDA version and upgr…
  • PR: #2355
  • SCHED-1228 Create active check with force options to upgrade cuda, nccl-tests, etc
  • PR: #2359
  • SCHED-1231 Do not collect diff and logs from sensitive files like .bashrc
  • PR: #2361
  • SCHED-856: Implement PersistentPodState CRD to ensure pods always schedule on the same node
  • PR: #2362
  • SCHED-1347: Add customizable liveness and readiness probe templates for all soperator CRDs and cluster components via Helm values
  • PR: #2401
  • SCHED-1079 First iteration of e2e using Godog
  • PR: #2277
  • SCHED-1260: add the initial number of powered up nodes
  • PR: #2406
  • SCHED-1295 Cluster Creation Acceptance
  • PR: #2428
  • SCHED-1320 Make Internal SSH test up to date
  • PR: #2443
  • SCHED-1302 Add RUN_UNSTABLE_TESTS flag
  • PR: #2447
  • SCHED-1080 Use Nebius docker registry proxies for public images
  • PR: #2469
  • SCHED-1298 Enroot container test
  • PR: #2473
  • SCHED-1297 SCHED-1298 Docker container test
  • PR: #2448
  • SCHED-1669: Cherry-pick Slurm controller metrics to release-4.0
  • PR: #2521
  • Remove ClusterType (refactoring)
  • PR: #2547

🐛 Fixes

  • SCHED-1204: Revert task prolog feature (PR #2316)
  • PR: #2349
  • fix: handle unregistered node in scontrol update during worker init
  • PR: #2330
  • fix: avoid nil pointer to empty string for ProcMount in slurmd contai…
  • PR: #2329
  • SCHED-951: e2e jail upload should not fail on semicolons
  • PR: #2358
  • SCHED-1232 fix Ansible warning "ansible_facts["fact_name"]"
  • PR: #2363
  • SCHED-1056 run slurm-divert twice
  • PR: #2365
  • runAfterCreation: false for manage-jail-state-force
  • PR: #2369
  • SCHED-1272: hostUsers to activechecks
  • PR: #2364
  • fix custom envs in nodesets
  • PR: #2387
  • SCHED-1229: Automatically undrain nodes after pod_ephemeral_storage check
  • PR: #2399
  • SCHED-1206 Do not set-unhealthy to instances assigned after drain time
  • PR: #2397
  • SCHED-1402: Requeue when populate jail job exists but has not completed yet
  • PR: #2421
  • SCHED-1429 Fix activecheck_jobs_controller skipping unfinished jobs
  • PR: #2423
  • SCHED-1372 pin libcublas-dev-13-0 package version
  • PR: #2430
  • SCHED-1372 add force option for upgrading nccl-tests
  • PR: #2441
  • SCHED-1471: Allow initialNumberEphemeralNodes to be set to 0
  • PR: #2453
  • SCHED-1464: Gate otel-collector jail-logs on soperator-outputs creation
  • PR: #2455
  • SCHED-1471: helm/nodesets: render initialNumberEphemeralNodes when set to 0
  • PR: #2458
  • SCHED-1389 Bind-mount SSSD sockets if they exist to the jail
  • PR: #2461
  • SCHED-1498 upgrade mocks for libnvidia-compute and create mock for libnvidia-ml1 and libnvidia-ml.so.1
  • PR: #2505
  • SCHED-1654 Change activeDeadlineSeconds for manage-jail-state checks
  • PR: #2511
  • remove [node_problem] prefix for nvme health check
  • PR: #2529
  • remove [node_problem] prefix for nvme health check
  • PR: #2534
  • SCHED-1660 Bind-mount libdummy not only on login nodes but also on CPU-only workers in GPU clusters
  • PR: #2538
  • Fix populate_jail_entrypoint for NFS
  • PR: #2542
  • remove if clusterType statement in Slurm healthCheckConfig for cpu clusters
  • PR: #2543

📦 Dependencies

  • Bump dorny/paths-filter from 3 to 4
  • PR: #2320
  • Bump softprops/action-gh-release from 2.5.0 to 2.6.1
  • PR: #2327
  • Bump actions/upload-artifact from 6 to 7
  • PR: #2303
  • Bump google.golang.org/grpc from 1.72.1 to 1.79.3 in the go_modules group across 1 directory
  • PR: #2335
  • Bump cryptography from 46.0.5 to 46.0.6 in /ansible in the pip group across 1 directory
  • PR: #2367

📔Docs

  • Add Readme for SSSD integration
  • PR: #2345

Other

  • Merge to soperator release 4.0 from/pr 2429/sched 1347/1
  • PR: #2436
  • Grafana dashboards: new panels and bug fixes
  • PR: #2418
  • New dahsboard: GPU stats
  • PR: #2477

Contributors: @theyoprst, @github-actions[bot], @dependabot[bot], @faucct, @asteny, @ivaravko, @Uburro, @rdjjke, @itechdima, @ChessProfessor, @ali-sattari

| 📁 Categorized PRs | 📂 Uncategorized PRs | 📥 Commits | ➕ Lines added | ➖ Lines deleted | | :---: | :---: | :---: | :---: | :---: | | 4059 | 178 | 310 | 29065 | 6470 |

Notability

notability 5.0/10

Major version release for AI infra tool, but lacks community traction indicators.