When a 10% CPU Spike Felt Like a Mystery — How Flamegraphs Solved It

How I used flamegraphs to find a sneaky CPU hotspot in production, with practical commands, tool choices, and real tradeoffs for Indian teams.

Written by: Rohan Deshpande

Developer typing on a laptop showing terminal windows with code and profiling output.
Image credit: Pexels / fauxels

A few months ago our payments service started showing a 10% CPU increase on the busiest node. Logs were clean, traces were noisy but inconclusive, and the pager was politely persistent. We tried increasing instance size, sampling traces, and adding caching — none fixed the root cause. What did work was embarrassingly simple: a 30‑second flamegraph.

If you haven’t used flamegraphs much, they feel like a detective’s magnifying glass for CPU and latency hotspots. You get a single interactive SVG that shows where time is spent, grouped and prioritized visually. For teams in India — small budgets, conservative production access, and tight SLAs — flamegraphs are a low-cost, high-signal tool you should learn to run cautiously and quickly.

Why flamegraphs first (and not flame-retina-blah)

A quick, practical pipeline (Linux perf + Brendan Gregg’s scripts)

Language-specific helpers that save time

When flamegraphs are the wrong tool

Real constraints and tradeoffs (what we learned)

How I used the flamegraph In our case the flamegraph quickly showed a large block inside a JSON marshaller path — not the DB or network as we’d suspected. A recent change had enabled a legacy logger to marshall entire payloads on every request. We reverted the logger change and the CPU normalized. The flamegraph saved us from an expensive horizontal scale-up and many noisy hypotheses.

Communicating results One underrated advantage: flamegraphs are persuasion-ready. Instead of “I think the marshaller is slow”, you can paste an SVG into the incident chat and show a colleague exactly which call path dominates CPU. That helped us get quick approval for the rollback.

A small checklist before you run one in production

Takeaway (my position) Flamegraphs aren’t a silver bullet, but they ought to be your default first instrument for mysterious CPU or latency hotspots. They’re cheap to run, produce a single actionable artifact, and force you to look at where time is actually spent. The downside is coordination and a tiny runtime cost — but compared to hours of blind guessing or a needless scale-up, they pay for themselves fast.

If you haven’t used them in a production incident, try this tonight on a staging replica: clone the FlameGraph repo, run a 30s profile, and open the SVG. The first time you see the actual hotspot laid out visually, you’ll understand why I keep one in my incident toolkit.