Why I Use eBPF for Production Profiling — and When It Sucks

How eBPF profiling gave our small Indian team deep, low-overhead visibility—what worked, what broke, and when to stop using it.

Written by: Rohan Deshpande

Rack of servers in a data centre lit with blue LEDs
Image credit: Unsplash

We shipped a performance fix in the middle of an Indian festival week because a tiny spike in tail latency started costing customer trust. The usual suspects—DB slow queries, noisy cache keys—looked fine. Instrumentation showed nothing useful. Then a teammate suggested: try eBPF profiling on the live service for a few minutes. Within ten minutes we had a clear off-CPU flamegraph pointing at a surprising syscall-heavy code path. We fixed it the next day.

This is why I use eBPF profiling: it gives low-overhead, deep visibility into what code actually does on real hosts. But it also comes with practical headaches that make it a tool for specific problems, not a silver bullet.

What eBPF profiling actually buys you

A few real examples from our stack

How to get started (pragmatic checklist)

  1. Confirm kernel and distro support. For useful features you’ll want Linux 5.x or recent stable 4.14+ with CONFIG_BPF and related options enabled. Many managed VMs in India (cheap VPS, older company fleets) still run older kernels—check first.
  2. Pick your tools. I use bpftrace for ad‑hoc sampling and flamegraphs, and BCC tools for quick syscall and tcp traces. For longer runs, consider OpenTelemetry integrations or eBPF-based SaaS like Pixie—if your security and budget allow.
  3. Start with sampling. A simple bpftrace one-liner that samples stack traces every N microseconds will show hot paths without instrumenting code.
  4. Limit scope and time. Run on a single instance or a small canary subset for 30–120 seconds. Capture PID or container ID filters to reduce noise.
  5. Archive traces and correlate. Save raw stacks and correlate them with deployment tags, commit IDs, and business events. That’s how you derive repeatable fixes.

A quick command I’ll run when nothing else helps (safe, short sampling to collect user+kernel stacks for PID 1234) sudo bpftrace -e ‘profile:hz:97 /pid == 1234/ { @[ustack(100)] = count(); }’ -t > stacks.txt Then convert counts to a flamegraph locally. It’s boringly effective.

When eBPF profiling shines

Real constraints and tradeoffs (the part nobody markets)

Practical governance that saved our skin

When to stop and use something else If a problem is reproducible in a test environment, instrument code or add metrics. If you need long-term SLO monitoring, invest in lightweight application metrics and tracing. eBPF profiling is best as a fast, last-mile diagnostic—especially when other tools disagree.

Parting thought In resource-conscious Indian teams (small infra budgets, mixed host control), eBPF profiling is a force-multiplier: a short burst of kernel-level truth that often points at a fix. But it’s not a replacement for good telemetry, and it brings operational and security friction. Use it for the weird, stubborn cases; treat it with the same discipline you’d apply to a hotfix in the middle of the night.

If you want, I can share the exact bpftrace snippets and a short runbook we use for canary traces—practical templates you can adapt to an Ubuntu 22.04 fleet or a small GCP project.