nebius/slurm-exporter
Captured source
source ↗nebius/slurm-exporter
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2025-08-13T12:58:11Z
Pushed: 2025-09-15T10:17:02Z
Default branch: main
Fork: no
Archived: no
README:
SLURM Exporter
Overview
The SLURM Exporter is a component of Soperator that collects metrics from SLURM clusters and exports them in Prometheus format. It provides comprehensive monitoring capabilities for SLURM cluster health, job status, node states, and controller performance metrics.
The exporter integrates seamlessly with the Prometheus monitoring stack and enables observability for SLURM workloads running on Kubernetes through Soperator.
Key Features
- Asynchronous metrics collection with configurable intervals (default: 30s)
- Real-time monitoring of SLURM nodes, jobs, and controller performance
- Prometheus-native metrics with standardized naming conventions
- Rich labeling for detailed filtering and aggregation
- Controller RPC diagnostics similar to SLURM's
sdiagcommand - Kubernetes-native deployment as part of Soperator
Configuration
The SLURM Exporter can be configured using either command-line flags or environment variables. Environment variables take precedence over defaults but are overridden by explicitly provided command-line flags.
Configuration Priority
The configuration follows this priority order: 1. Command-line flags (highest priority) - Explicitly provided flags override all other settings 2. Environment variables - Used when flags are not provided 3. Default values (lowest priority) - Used when neither flags nor environment variables are set
Configuration Options
All configuration options can be set via command-line flags or environment variables:
| Environment Variable | Flag | Description | Default | |---------------------|------|-------------|---------| | SLURM_EXPORTER_CLUSTER_NAME | --cluster-name | The name of the SLURM cluster (required) | *none* | | SLURM_EXPORTER_CLUSTER_NAMESPACE | --cluster-namespace | The namespace of the SLURM cluster | soperator | | SLURM_EXPORTER_SLURM_API_SERVER | --slurm-api-server | The address of the SLURM REST API server | http://localhost:6820 | | SLURM_EXPORTER_COLLECTION_INTERVAL | --collection-interval | How often to collect metrics from SLURM APIs | 30s | | SLURM_EXPORTER_METRICS_BIND_ADDRESS | --metrics-bind-address | Address for the main metrics endpoint | :8080 | | SLURM_EXPORTER_MONITORING_BIND_ADDRESS | --monitoring-bind-address | Address for the self-monitoring metrics endpoint | :8081 | | SLURM_EXPORTER_LOG_FORMAT | --log-format | Log format: plain or json | json | | SLURM_EXPORTER_LOG_LEVEL | --log-level | Log level: debug, info, warn, error | debug |
Exported Metrics
Core Metrics (Node and Job)
| Metric Name & Type | Description & Labels | |-------------------|---------------------| | slurm_node_info *Gauge* | Provides detailed information about SLURM nodes
Labels: • node_name - Name of the SLURM node • instance_id - Kubernetes instance identifier • state_base - Base node state (IDLE, ALLOCATED, DOWN, ERROR, MIXED, UNKNOWN) • state_is_drain - Whether node is in drain state ("true"/"false") • state_is_maintenance - Whether node is in maintenance state ("true"/"false") • state_is_reserved - Whether node is in reserved state ("true"/"false") • address - IP address of the node | | slurm_node_gpu_seconds_total *Counter* | Total GPU seconds accumulated on SLURM nodes
Labels: • node_name - Name of the SLURM node • state_base - Base node state • state_is_drain - Drain state flag • state_is_maintenance - Maintenance state flag • state_is_reserved - Reserved state flag | | slurm_node_fails_total *Counter* | Total number of node state transitions to failed states (DOWN/DRAIN)
Labels: • node_name - Name of the SLURM node • state_base - Base node state at time of failure • state_is_drain - Drain state flag • state_is_maintenance - Maintenance state flag • state_is_reserved - Reserved state flag • reason - Reason for the node failure | | slurm_job_info *Gauge* | Detailed information about SLURM jobs
Labels: • job_id - SLURM job identifier • job_state - Current job state (PENDING, RUNNING, COMPLETED, FAILED, etc.) • job_state_reason - Reason for current job state • slurm_partition - SLURM partition name • job_name - User-defined job name • user_name - Username who submitted the job • user_id - Numeric user ID who submitted the job • standard_error - Path to stderr file • standard_output - Path to stdout file • array_job_id - Array job ID (if applicable) • array_task_id - Array task ID (if applicable) • submit_time - When the job was submitted (Unix timestamp seconds, empty if not available or zero) • start_time - When the job started execution (Unix timestamp seconds, empty if not available or zero) • end_time - When the job completed (Unix timestamp seconds, empty if not available or zero). Warning: For non-terminal states like RUNNING, this may contain a future timestamp representing the forecasted end time based on the job's time limit • finished_time - When the job actually finished for terminal states only (Unix timestamp seconds, empty for non-terminal states or if end_time is zero). Unlike end_time, this field only contains actual completion times, never forecasted values | | slurm_node_job *Gauge* | Mapping between jobs and the nodes they're running on
Labels: • job_id - SLURM job identifier • node_name - Name of the node running the job | | slurm_job_duration_seconds *Gauge* | Job duration in seconds. For running jobs, this is the time elapsed since the job started. For completed jobs, this is the total execution time.
Labels: • job_id - SLURM job identifier
Notes: • Only exported for jobs with a valid start time • For non-terminal states (RUNNING, etc.): duration = current_time - start_time • For terminal states (COMPLETED, FAILED, etc.): duration = end_time - start_time (only if end_time is valid) |
Controller RPC Metrics
These metrics provide insights into SLURM controller performance, similar to the output of the sdiag command, and were implemented to address issue #1027.
| Metric Name & Type | Description & Labels | |-------------------|---------------------| |...
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine new repo, tool for SLURM, no traction