GPU Job Scheduling Using an Idle Inference GPU Pool
Captured source
source ↗GPU Job Scheduling Using an Idle Inference GPU Pool | by LG AI Research | Jun, 2026 | Medium
Sitemap
Sign up
Sign in
Get app
Write
Search
Sign up
Sign in
GPU Job Scheduling Using an Idle Inference GPU Pool
LG AI Research
8 min read
1 hour ago
https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fvote%2Fp%2F1dbb4361c7bd&operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40lgairesearch%2Fgpu-job-scheduling-using-an-idle-inference-gpu-pool-1dbb4361c7bd&user=LG+AI+Research&userId=3223c7903363
--
https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Frepost%2Fp%2F1dbb4361c7bd&operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40lgairesearch%2Fgpu-job-scheduling-using-an-idle-inference-gpu-pool-1dbb4361c7bd&user=LG+AI+Research&userId=3223c7903363
https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F_%2Fbookmark%2Fp%2F1dbb4361c7bd&operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40lgairesearch%2Fgpu-job-scheduling-using-an-idle-inference-gpu-pool-1dbb4361c7bd
Listen
Share
With the advancement of AI technology, demand for GPU resources has been increasing rapidly. In particular, since training and inference for large language models (LLMs) require massive computational resources, securing stable GPU infrastructure has become a key competitive advantage for AI research and service operations. However, GPUs remain a costly resource, and there are practical limitations to scaling infrastructure within a limited budget. As a result, situations frequently arise where the demand for GPUs in research and service environments cannot be fully met.
Meanwhile, companies that provide AI services tend to allocate resources based on peak times in order to prepare for fluctuations in traffic. In such cases, during periods of low traffic, the allocated resources remain in an “idle state,” consuming GPU memory while remaining underutilized.
The project “GPU Job Scheduling using an idle inference GPU pool” began with this idea. It is an attempt to maximize infrastructure utilization by allocating idle GPU resources to model training or research workloads, while ensuring that service stability is not compromised. In this article, we share how we defined and addressed the problem in order to maximize the efficiency of GPU resource operations.
Press enter or click to view image in full size
1. Auto-scaling of LLM services
In typical services, it is relatively easy to predict resource usage based on metrics such as the number of requests, CPU utilization, and memory usage, and to configure auto-scaling policies accordingly. Since these metrics generally increase in proportion to service load, threshold-based scaling strategies tend to operate reliably.
However, it is difficult to apply the same approach to LLM-based services. This is because GPU resources consumed by each request vary depending on the number of input tokens, the number of generated tokens, and the size and architecture of the model. In other words, LLM services have a technical characteristic that makes it difficult to accurately predict resource demand using simple metrics alone.
Metrics such as GPU utilization and memory usage also have limitations in an LLM environment. Due to the nature of LLM inference, a single request can consume most of the available GPU resources, causing utilization to spike momentarily. However, it is difficult to interpret this as an indication of overall system load. Similarly, because a significant portion of memory is already allocated during the model loading phase, memory usage shows only minimal fluctuations as traffic changes, making it difficult to accurately reflect actual load variations.
Due to these characteristics, traditional CPU/memory-based auto-scaling or simple request-count-based approaches do not work effectively in LLM services, and new metrics are needed to better reflect the actual computational load.
To address this issue, we utilized the internal metrics provided by vLLM as auto-scaling signals. Since vLLM is designed for LLM serving, its metrics, such as real-time throughput and queue status, better capture the runtime characteristics of LLM workloads. By using these metrics, we were able to implement a more sophisticated auto-scaling strategy optimized for LLM services.
2. Auto-scaling analysis
Image 1. Replica Auto-scaling graph (24h)
A time-series analysis of changes in service replicas count after auto-scaling was applied revealed distinct GPU resource usage patterns across time periods. During daytime hours (8:00 AM to 8:00 PM), the replica count increased rapidly as traffic rose, reached a peak and then gradually decreased as traffic declined. In contrast, during the off-peak nighttime hours (8:00 PM to 8:00 AM the following day), the system maintained the minimum replica count with minimal fluctuations, leaving a large portion of GPU resources unused for extended periods.
Get LG AI Research’s stories in your inbox
Join Medium for free to get updates from this writer.
Subscribe
Subscribe
Remember me for faster sign in
For example, in an environment where a single replica uses four GPUs, an average of 52 GPUs remain idle for approximately 12 hours per day. While this is necessary to ensure service stability by providing a buffer of spare resources, there is room for improvement in terms of GPU utilization efficiency. By leveraging these idle resources for research and experimental tasks, we can significantly increase overall GPU utilization without expanding the infrastructure.
Image 2. GPU Replica Pool Status Example over Time
Based on this analysis, we built a pipeline to run GPU workloads during nighttime hours when service traffic is low. This pipeline is designed to avoid compromising the availability of existing service resources. It operates on a best-effort basis, meaning that running workloads can be interrupted at any time if GPU resources need to be reclaimed due to increased service traffic.
3. Designing a GPU Job Pipeline
Image 3. GPU Job Pipeline Based on Argo Workflows
To utilize idle GPU resources, a pipeline capable of executing a wide variety of tasks is required. In particular, research and experimental workloads often require diverse execution methods, and even the same task is often run repeatedly depending on the configuration; therefore, the pipeline must support versatility, flexibility, and reproducibility at the same time.
- Versatility:...
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10Routine blog post on GPU scheduling, no traction indicated.