10 Key Enhancements in Kubernetes v1.36: Revolutionizing Workload-Aware Scheduling
Kubernetes v1.36 marks a major leap forward in how the scheduler handles complex AI/ML and batch workloads. Building on the foundational work introduced in v1.35, this release introduces a clean architectural split between static workload definitions and dynamic runtime state, paving the way for more efficient, scalable, and intelligent scheduling. From a new PodGroup API to topology-aware scheduling and workload-aware preemption, v1.36 delivers a suite of enhancements that address real-world challenges. Here are the ten essential things you need to know about this transformative update.
1. Clean Separation of Workload and PodGroup APIs
The most significant change in v1.36 is the decoupling of the Workload and PodGroup APIs. Previously, in v1.35, both the pod group and its runtime state were bundled into a single Workload resource. Now, the Workload acts solely as a static template — it defines what a pod group should look like — while the PodGroup handles the runtime status. This architectural evolution makes the system more modular and allows each component to evolve independently. For developers, this means clearer boundaries and less complexity when defining batch jobs. The new APIs reside under scheduling.k8s.io/v1alpha2, completely replacing the older v1alpha1 version. This separation also improves performance by enabling per-replica sharding of status updates.
2. Workload API as a Static Template
In v1.36, the Workload API no longer carries dynamic state. Instead, it functions as a pure template that controllers like the Job controller use to stamp out PodGroup objects. The Workload specification includes podGroupTemplates — each template defines a named pod group along with its scheduling policy, such as gang scheduling requirements. For example, a training job might declare a template called "workers" with a minCount of 4, meaning all four pods must be schedulable before any is placed. This static nature simplifies the scheduler’s job: it doesn’t need to watch the Workload object at all. The outcome is a leaner, more predictable scheduling path.
3. PodGroup API Manages Runtime State
Complementing the static Workload is the new PodGroup API, which tracks the real‑time status of a group of pods. Controllers instantiate PodGroup objects based on the templates in the Workload. Each PodGroup holds the active scheduling policy and a reference back to its originating template. Crucially, the PodGroup’s status contains conditions that correspond to the states of individual pods — running, pending, backoff, etc. — providing a single, consolidated view of the group’s scheduling health. This design enables the scheduler to read all required information from the PodGroup directly, without parsing the Workload. The result is better scalability, especially in large clusters with many concurrent batch jobs.
4. New PodGroup Scheduling Cycle in kube-scheduler
To support the new API architecture, Kubernetes v1.36 introduces a dedicated PodGroup scheduling cycle within the kube-scheduler. This cycle treats all pods belonging to the same PodGroup as a single atomic unit during scheduling decisions. Instead of making per‑pod decisions sequentially, the scheduler evaluates the entire group at once, ensuring that resource availability is checked for the whole set. This atomic processing is essential for gang scheduling scenarios — like distributed training — where you cannot start any pod unless all can be placed. The new cycle also lays the groundwork for even more advanced scheduling policies in future releases.
5. First Iterations of Topology‑Aware Scheduling
Kubernetes v1.36 debuts the initial version of topology‑aware scheduling for pod groups. This feature allows the scheduler to consider the physical or logical topology of the cluster — such as nodes, zones, or racks — when placing pods from the same group. For AI/ML workloads that require low‑latency inter‑pod communication (e.g., via GPUDirect or high‑speed interconnects), topology awareness ensures that pods are placed close together. Although still in its early stages, this first iteration supports basic topology constraints and provides a foundation for more sophisticated policies in upcoming releases.
6. Workload‑Aware Preemption Introduced
Another new capability in v1.36 is workload‑aware preemption. Previously, the scheduler’s preemption logic would evict lower‑priority pods to make way for higher‑priority pods, but without considering the group context. Now, when a PodGroup requires resources, the scheduler can preempt pods in a way that minimizes disruption to other pod groups. For example, it avoids preempting pods that belong to a partially scheduled gang, because that would delay the entire group. This workload‑aware approach improves the overall efficiency of resource utilization while reducing the “thrashing” that could occur with naive preemption.
7. ResourceClaim Support Unlocks Dynamic Resource Allocation (DRA)
PodGroups in v1.36 can now make use of ResourceClaims, enabling Dynamic Resource Allocation (DRA). DRA allows pods to request specialized hardware resources — such as GPUs, FPGAs, or high‑performance network interfaces — without statically binding to them at submission time. The new support means that a PodGroup can define a ResourceClaim template inside the Workload or PodGroup, and the scheduler will allocate the resources dynamically as pods are placed. This is a game‑changer for AI/ML workloads that require accelerators, as it abstracts resource management and improves flexibility.
8. First Phase of Job Controller Integration
To demonstrate production readiness, v1.36 includes the first integration phase between the Kubernetes Job controller and the new Workload/PodGroup APIs. Now, the Job controller can natively create Workload objects and manage PodGroups, rather than relying on custom scripts or third‑party controllers. This integration enables seamless deployment of batch jobs that benefit from gang scheduling, topology awareness, and dynamic resource allocation. Users can define a Job with a standard spec that references a Workload template, and the controller handles the rest. This is a major step toward making workload‑aware scheduling a first‑class citizen in Kubernetes.
9. Performance and Scalability Gains from Per‑Replica Sharding
By separating runtime state into the PodGroup API, v1.36 achieves significant performance improvements. The PodGroup supports per‑replica sharding of status updates — each replica can independently update its portion of the PodGroup status without contending for a single resource. This reduces bottlenecks in large clusters where hundreds or thousands of PodGroups may be active simultaneously. Additionally, the scheduler no longer needs to watch Workload objects, cutting down on API server load. Early benchmarks indicate that this architecture reduces scheduling latency for large batch workloads by up to 30% in some scenarios.
10. Streamlined Scheduler Logic for Better Maintainability
Finally, the architectural changes in v1.36 make the kube-scheduler’s code simpler and easier to maintain. Because Workload objects are now pure templates, the scheduler’s scheduling framework only interacts with PodGroup objects. This removes the need to parse unstructured workload manifests and reduces the number of extension points that vendors must support. The new design also makes it easier to add future scheduling features — such as job scheduling or co‑scheduling — without touching the core scheduling cycle. For cluster administrators, this means a more stable scheduler that is less prone to regressions.
Conclusion
Kubernetes v1.36 represents a strategic evolution in workload‑aware scheduling, addressing the unique challenges posed by AI/ML and batch workloads. By cleanly separating the Workload template from the PodGroup runtime, introducing a dedicated scheduling cycle, and adding capabilities like topology awareness, workload‑aware preemption, and DRA support, this release delivers a more robust, scalable, and intelligent scheduling foundation. The integration with the Job controller confirms production readiness. As the ecosystem continues to adopt these features, Kubernetes is set to become an even stronger platform for modern, resource‑intensive applications.
Related Articles
- Unified Infrastructure Visibility: HCP Terraform with Infragraph Enters Public Preview
- Modal or New Page? A Step-by-Step UX Decision Guide
- Google’s Now Playing Feature Gains Dedicated App: How It Changes the Pixel Experience
- Espresso Pro 15 Portable Display: Expert Q&A for Mac and iPad Users
- Comparing Rule-Based and LLM Approaches for B2B Document Extraction
- Meta Completes Largest Data Ingestion Migration at Hyperscale, Boosting Reliability
- Kubernetes v1.36: 10 Key Enhancements for Workload-Aware Scheduling
- Stop Fire TV Stick Buffering: The Hidden Solution You’re Missing