AI News 3d ago Updated 10h ago 88

Pinterest Engineers Eliminate CPU Zombie Processes to Resolve Production Bottlenecks

Pinterest engineers diagnosed and resolved an intermittent CPU starvation issue causing machine learning training failures on their Kubernetes-based P

85
Hot
90
Quality
92
Impact

Deep Analysis

This technical postmortem from Pinterest provides a fascinating case study in modern infrastructure debugging, revealing how abstracted systems can create opaque failure modes and the critical need for deep, layered diagnostics.

The Core Problem: Invisible Resource Contention

At its heart, the issue was resource starvation at the CPU level, specifically for the kernel system (system CPU). The symptoms—intermittent network resets (ENA device resets) and Ray task crashes—were disruptive but initially misleading. The key insight was that aggregate CPU metrics showed nothing wrong. This is a classic observability pitfall: averages can hide severe, transient, and localized bottlenecks. The starvation occurred on specific cores handling critical kernel tasks, effectively halting the system's ability to process network interrupts.

The Diagnostic Journey: From Aggregates to Atoms

The investigation's success hinged on changing the diagnostic lens:

  1. Shift in Tooling: Moving from high-level dashboards to mpstat for per-core analysis was pivotal. This exposed that individual cores were hitting 100% system CPU utilization for sustained periods.
  2. Understanding the Cascade: Engineers connected the CPU saturation to its consequence. A saturated core couldn't service the NAPI polling thread for the Elastic Network Adapter (ENA). This triggered the ENA's built-in self-healing mechanism (a device reset after 5 seconds of stalled transactions), which severed network connections and killed dependent tasks. This shows a deep understanding of the Linux kernel networking stack and cloud hardware behavior.
  3. Continuous Profiling for Precision: Instead of relying on snapshots, they implemented time-series performance capture (every 2 minutes over a 12-hour window). Visualizing this data in Netflix Flamescope allowed them to correlate the exact moment of network resets with CPU spikes, pinpointing kubelet as the culprit.

Unmasking the "Zombie" Cgroups: A Layer Violation

The most profound finding was that kubelet—a core Kubernetes component normally using <1% CPU—was spending most of its time in the kernel function mem_cgroup_nr_lru_pages. This led to the discovery of the "zombie" memory cgroups.

  • The Mechanism: An Amazon ECS agent, enabled by default in their base AMI but unused by Pinterest, was caught in a crash loop. Each crash leaked a cgroup without cleaning it up.
  • The Scale: While only ~240 cgroups were active, approximately ~70,000 "zombies" had accumulated. During its routine cgroup state synchronization, kubelet had to traverse this massive, bloated list.
  • The Impact: This traversal monopolized a CPU core for seconds, starving the network stack. This is a perfect example of a layer violation—a user-space daemon (ECS agent) causing kernel-level state leakage (cgroup exhaustion), which then crippled a different user-space orchestrator (kubelet).

Solution and Broader Lessons

The fix was elegantly simple: disabling the ECS agent's systemd unit in the AMI and rebooting to clear the accumulated state. This simplicity, however, underscores the complexity of the diagnosis.

Pinterest's experience imparts several crucial lessons:

  • Question Default Configurations: Base images and AMIs often include software you don't use. These "defaults" can become active failure points. A rigorous hardening and slimming process for production images is essential.
  • Observability Must Be Granular: High-level metrics are necessary but insufficient. Per-core, time-indexed profiling is vital for diagnosing intermittent performance issues in complex, multi-tenant environments.
  • Embrace the Full Stack: Modern SRE/DevOps requires understanding the entire stack, from user-space applications, through orchestration layers (Kubernetes), down to the Linux kernel and hardware drivers. Problems often hide in the intersections between these layers.
  • The Future is Proactive Profiling: The team advocates for continuous profiling tools like gProfiler, Parca, and Grafana Pyroscope. These tools, often using eBPF, provide cluster-wide, continuous visibility, turning a reactive "forensic investigation" into a proactive pattern-detection exercise.

In summary, this incident demonstrates that in large-scale distributed systems, performance and stability are emergent properties of the entire software and configuration stack. Pinterest's methodical approach—drilling from macro metrics down to kernel functions and finally to a single misconfigured service—provides a robust blueprint for debugging the insidious, layered failures that plague modern cloud infrastructure.