NVIDIA/k8s-test-infra v0.2.0
NVIDIA/k8s-test-infra
Captured source
source ↗published Jun 12, 2026seen 8hcaptured 8hhttp 200method plain
v0.2.0
Repository: NVIDIA/k8s-test-infra
Tag: v0.2.0
Published: 2026-06-12T08:05:10Z
Prerelease: no
Release notes:
Highlights
- Mock InfiniBand subsystem: real
ibstat,ibstatus,iblinkinfo,ibv_devinfo, and cross-nodeibpingnow work inside the nvml-mock DaemonSet without InfiniBand hardware.LD_PRELOADshims (libibmocksys.so,libibmockumad.so,libibmockverbs.so) redirect sysfs/umad/verbs access to a fake tree rendered per profile, and the in-podmock-ibdaemon relays MAD traffic between pods over the Kubernetes network. The daemon and Service are only created for profiles with InfiniBand enabled. (#367) - New GPU profile `gb300`: NVIDIA GB300 NVL (Grace-Blackwell Ultra) — 8 GPUs/node, 288 GiB HBM3e per GPU, 1.4 kW TDP, PCIe Gen6, NVLink v5, driver line 570.124.06, with an NVL72-shaped PCIe topology out of the box.
- PCIe sysfs topology: profiles carry a
pcie_topology:block and the newrender-pci-sysfsbinary materializes a fake/sys/bus/pci/devicestree, so topology-aware consumers (NVIDIA DRA driverdra.k8s.io/pcieRoot, device-plugin NUMA hints) resolve PCIe root complexes. (#263) - NVML library-size padding:
libnvidia-ml.sois padded to land within ~10% of the real driver library size, so size-sanity tooling accepts the mock; configurable or fully disableable. (#247) - Canonical sysfs `bus_id` form in profiles (
0000:07:00.0), aligning the mock with real Linux PCI sysfs. (#263) - Test suite standardized on `testify/require`; soft
t.Errorfchecks upgraded to hard failures. (#386)
Fixes
docs/demo/standalone/demo.shruns on macOS stock bash 3.2 (no moremapfile). (#385)- Helm chart OCI publishing signs correctly again: cosign now authenticates to GHCR and signs the chart by digest. (#388)
Container Image
docker pull ghcr.io/nvidia/nvml-mock:0.2.0
Helm Chart
helm install nvml-mock oci://ghcr.io/nvidia/k8s-test-infra/chart/nvml-mock --version 0.2.0
Image and chart are cosign-signed (keyless). Full details in CHANGELOG.md.
Full diff: https://github.com/NVIDIA/k8s-test-infra/compare/v0.1.0...v0.2.0
Addendum (2026-06-12)
The following v0.2.0 features were missing from the original notes (the changelog has been amended to match):
- Dynamic per-query metric sampling: utilization, temperature, power, and clocks vary plausibly across calls. (#323)
- GPU failure injection:
ecc_uncorrectable,lost, andfallen_off_busmodes with Xid 79 propagation. (#328) - ComputeDomain / NVLink fabric simulation:
nvmlDeviceGetGpuFabricInfo(+V) driven by a cluster topology ConfigMap, with fakenvidia-imex/nvidia-imex-ctlpeer-readiness coordination. (#337, #342) - Toolkit-ready marker file for GPU Operator validator compatibility. (#346)
Notability
notability 2.0/10Routine release of a Kubernetes test infrastructure tool.