Accelerating Linux Page Migration with AMD’s New Batch Copy Patches: A Developer’s Guide

From Hpimall, the free encyclopedia of technology

Overview

Page migration is a critical operation in modern Linux memory management, especially in systems with heterogeneous memory (e.g., NUMA nodes, disaggregated memory, or CXL-attached devices). When a process accesses memory from a remote node, performance degrades due to latency. Migration moves pages to the accessing node, but traditional single-page migrations are CPU-intensive and cause high overhead. The new patch series, initially proposed by a NVIDIA engineer in early 2025 and now advanced by AMD engineers, introduces accelerated page migration using batch copies and hardware offloading. This guide will help you understand, apply, and test these patches to boost system performance.

Accelerating Linux Page Migration with AMD’s New Batch Copy Patches: A Developer’s Guide

Prerequisites

  • Linux kernel development environment: You need a recent kernel source tree (mainline or linux-next) and build tools (gcc, make, binutils).
  • Hardware with migration acceleration: Typically AMD EPYC processors with certain features (check PCIe ATS/PRI, CXL.mem support).
  • Familiarity with memory management concepts: NUMA, page tables, direct memory access (DMA), and memory failure handling.
  • Access to the kernel mailing list (LKML): To fetch the latest patch revision (v4 or later).
  • Test workload: A program that stresses NUMA migrations (e.g., modified stream benchmark or sysbench memory test).

Step-by-Step Instructions

1. Understand the Patch Series

The patches extend the existing migrate_pages() system call and the internal migrate_vma mechanism. The key innovation is turning single-page copy operations into batch requests sent to hardware DMA engines (like the AMD IOMMU or CXL.mem controllers). This reduces per-page overhead and leverages dedicated copy engines. The series also adds a new MIGRATE_BATCH flag and modifies the kernel's page migration path to aggregate multiple pages before offloading.

Read the cover letter on LKML (link). Focus on the changes to mm/migrate.c, include/linux/migrate.h, and the architecture-specific DMA setup (e.g., arch/x86/kernel/amd_iommu.c).

2. Apply the Patches to Your Kernel

  1. Clone the latest linux-next tree: git clone https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
  2. Download the patch series from LKML (usually as mbox file or git pull request). For example, using git am after applying patches via b4 tool.
  3. Apply patches in order: git am *.patch
  4. Configure kernel: Enable CONFIG_MIGRATION (already default) and new CONFIG_MIGRATE_BATCH (optional experimental). Also ensure CONFIG_AMD_IOMMU is enabled for hardware offloading.
  5. Build kernel: make -j$(nproc)
  6. Install modules and new kernel: sudo make modules_install install
  7. Reboot into the new kernel.

3. Verify Feature Availability

After booting, check kernel messages for batch migration support:

dmesg | grep -i "migrate_batch"

You should see something like: "migrate_batch: acceleration enabled via IOMMU". Also examine /sys/kernel/debug/migration if debugfs is mounted. The directory may contain a batch_stats file.

4. Configure Hardware Offloading (Optional)

By default, batch offloading may be disabled. To enable, echo to sysfs:

echo 1 > /sys/module/migrate_batch/parameters/enable_offload

To set batch size (number of pages per batch, default 32):

echo 64 > /sys/module/migrate_batch/parameters/batch_size

Note: Larger batches may reduce overhead but increase latency for synchronous ops. Tune based on workload.

5. Run Benchmark and Compare Performance

  1. Test workload: Write a simple program that allocates memory on node 0, then binds to node 1 and accesses memory repeatedly, triggering page migration.
  2. Compile with gcc -o migrate_test migrate_test.c -lnuma (install libnuma-dev).
  3. Run with and without batch/hardware offloading. Disable offloading by writing 0 to the parameter.
  4. Measure migration time: Use perf stat or add internal timing. Example command:
sudo numactl --cpunodebind=1 --membind=0 timeout 10 ./migrate_test

Collect results and compare. Expected improvement: 2x-5x reduction in migration latency for large data sets.

Common Mistakes

  • Applying patches out of order: The patch series interleaves dependencies. Use git am --patch-format=mbox or a tool like b4 to apply correctly.
  • Missing kernel configuration symbols: Ensure CONFIG_MIGRATE_BATCH and CONFIG_AMD_IOMMU are enabled. The kernel may compile without errors but batch support will be a no-op.
  • Not enabling hardware offloading: The sysfs parameter defaults to off. Forgetting to enable it means you still use software batch copies (which still benefit from reduced syscall overhead but not DMA).
  • Overlooking hardware requirements: IOMMU v2 and CXL.mem hardware are required for true offloading. Check lspci -v | grep -i iommu and verify BIOS settings.
  • Using incompatible workload: Very small pages (4KB) may not benefit; test with 2MB huge pages for better batch packing.
  • Ignoring kernel warnings: If dmesg shows "migrate_batch: no DMA engine found", fall back to software batch mode.

Summary

The AMD batch migration patches represent a significant step toward reducing page migration overhead in Linux. By grouping migrations into large batches and offloading copy operations to hardware DMA engines, the kernel can improve performance for memory-intensive workloads on NUMA and CXL systems. This guide provided an overview, prerequisites, step-by-step instructions for applying and testing the patches, and common pitfalls to avoid. With careful tuning, developers can achieve substantial latency reductions. Keep an eye on LKML for future revisions that might extend support to other architectures or improve batch scheduling.