Why We Gave Our Team an Error Budget — and How It Stopped Burnout Without Killing Velocity

A practical playbook for small Indian engineering teams to use an error budget to balance reliability and feature velocity—what worked, what didn't.

Written by: Rohan Deshpande

A small engineering team standing around a whiteboard, sketching diagrams and discussing.
Image credit: Pixabay / StockSnap

We’d been doing the usual startup thing: ship fast, fix faster. Except at 2 a.m. on a Tuesday, when the payment gateway hiccuped during a flash sale, three of us were awake, scrambling, and apologising to an angry client while the rest of the company watched Slack explode. We told ourselves this was “part of growth.” After the third such night in three months, we tried something different: we gave the team an error budget.

This isn’t a fancy academic experiment. It’s a simple guardrail that helped us stop glorifying on-call suffering as a metric of commitment, and instead made trade-offs explicit: how much downtime are we willing to accept in exchange for faster releases? In small Indian teams—where people juggle on-call with client work, tight budgets, and festival-season traffic spikes—this clarity actually matters.

What an error budget is (in plain terms)

Why this helped us

A practical 30‑minute playbook to get started

  1. Choose one SLI and a 30-day SLO
    • Keep it narrow: e.g., “Payment API success rate.” Wider SLOs feel good but are hard to measure accurately.
  2. Convert SLO to an error budget number
    • 99.9% -> ~43.2 minutes/month. 99.95% -> ~21.6 minutes/month. Pick what reflects your product and customers.
  3. Instrument and measure
    • Use whatever you have: Prometheus, CloudWatch, or a simple success-rate metric logged to a dashboard. Accuracy matters; an over-counted error skews decisions.
  4. Define a burn policy (simple thresholds)
    • Burn <25%: safe to continue normal launches.
    • Burn 25–50%: review risky releases; increase automated checks.
    • Burn >50%: pause non-critical launches, mobilise stability fixes.
  5. Tie the policy to concrete actions
    • E.g., if burn >50% for 7 days, freeze feature releases until SLI is back within SLO for a rolling 3-day window.
  6. Communicate visibly
    • A one-line Slack status or a dashboard widget that shows remaining minutes removes guesswork.
  7. Blameless postmortems and replenishment
    • After incidents, do a short postmortem and prioritise work that replenishes the budget (root-cause fixes, better tests, or feature rollbacks).

Real constraints and tradeoffs (what will bite you)

Tools that actually work for small teams in India

Example numbers we used

When to not use an error budget

The down-to-earth truth: it’s a governance tool, not magic An error budget won’t reduce errors by itself. It gives you a shared language and a visible constraint to make sane trade-offs. In our case it stopped the all-nighters from becoming the default and made launches feel like informed bets instead of heroic gambles. We still had messy incidents—sometimes we miscalculated third-party risk or underinvested in testing—but we stopped letting those incidents define our culture.

If you try this, start narrow, keep the math simple, and be honest about your metrics. And when someone asks whether reliability should win over a new feature, point to the dashboard and have the conversation. It’s the kind of argument an engineer, a product manager, and a CTO can all understand without anyone having to sacrifice another weekend.

Thanks for reading—if you want, ping me a one-line description of your SLO and I’ll tell you roughly how many minutes you’re willing to lose this month.