How I Tamed a Flaky Test Epidemic Without Rewriting the Suite

A practical, low-cost playbook for identifying, triaging, and reducing flaky tests so CI stops eating time and engineers stop ignoring the pipeline.

Written by: Arjun Malhotra

Laptop screen showing a code editor and terminal with test failures
Image credit: Unsplash / Glenn Carstens-Peters

We hit the breaking point the week our main branch stopped being trustworthy. Developers were ignoring red builds. Pull requests sat unmerged because CI green was a lottery. The job board scrolled by with “fix flaky tests” tickets, but nobody had time for a full test rewrite.

If you’ve been there, you know the cost: wasted CI minutes, interrupted flow, and a slow-burn culture of “just rerun.” I’ll walk through the practical system I introduced at a small Indian product team that cut our CI noise by half in three months. It’s not magic—just focused tooling, policy, and a few uncomfortable tradeoffs.

Why flaky tests are a problem (beyond annoyance)

Our main keyword for this problem is simple: flaky tests. Treat it as a measurable engineering debt.

Step 1 — Measure before you act We started with two metrics:

How to measure cheaply: instrument your CI (GitHub Actions, GitLab CI, Circle) to attach a small JSON artifact on failure that logs run-id, test name, environment, node label, and stack trace. A Lambda or tiny Node script aggregated these into a CSV and we plotted flake rates in a simple Grafana panel. You don’t need an enterprise test flakiness product—just logs and a dashboard.

Step 2 — Fast triage: quarantine, don’t bury We introduced a “quarantine” label for tests that meet a flakiness threshold (≥ 30% failure rate over 7 days). Quarantining means:

This reduced merge-blocking reruns immediately. Important caveat: quarantine is temporary. Treat it like a loan, not a dump.

Step 3 — Make flakiness actionable at the moment of failure When a quarantined or flaky test fails on PR, the CI job:

This gives engineers quick evidence: a transient network glitch, an assertion that depends on time, or genuine regression. We found over 40% of flaky failures showed a clear environmental pattern (slow DNS, DB timeouts on certain runners), which were fixable without changing test logic.

Step 4 — Ownership + SLAs Assign an owner to any test that gets quarantined. The owner has two simple obligations:

At first this felt bureaucratic, but it worked: someone’s name on a ticket increases the odds the test gets fixed instead of sleeping in limbo.

Step 5 — Fixes that actually stick Common root causes and pragmatic fixes we used:

We avoided “big rewrites” as the first step. Small refactors and better isolation fixed the majority of high-value flakes.

Tradeoffs and the costs we accepted

India-specific notes that helped

How much did we improve? In three months we:

If you only do one thing, measure flaky tests. Data focuses attention and removes finger-pointing.

Parting candid advice Flaky tests are symptoms of brittle integration and fragile environments. You won’t eliminate them all—some things (browser rendering, 3rd-party networks) will always be intermittent. But you can make them visible, owned, and expensive to ignore. The system we used is intentionally lightweight: visibility, temporary quarantine, ownership, and small targeted fixes. It kept shipping predictable without a test-suite rewrite—and that’s often all a small product team needs.

Now, pull up your CI dashboard. Find the top five flipping tests. Triage one today, assign an owner, and you’ll feel the relief in one sprint.