Treadmills exhaust teams. Flywheels compound trust.

Team Design Change Agile DevOps Leadership

 In a recent conversation, an engineering leader highlighted success as how quickly managers jump in and close escalations. Crisis skills matter: calm triage, clear communication, decisive calls. The leaders I admire do that and then make tomorrow quieter.

Use the visual above in your mind as you read:

  • Treadmill: escalation → status update → next escalation

  • Flywheel: escalation → insight → guardrail → runbook → fewer pages

The goal is simple. Keep the ability to respond under pressure. Turn every incident into a small step that reduces the chance and impact of the next one.


What the treadmill looks like

  • Status meetings move faster than learning.

  • Fixes rely on a few heroes.

  • The same alerts reappear with new ticket numbers.

  • Confidence rises after a hot fix and fades by the next release.

What the flywheel looks like

  • Each escalation produces one clear insight.

  • Insights become guardrails: SLOs, timeouts, backpressure, flags, canaries, rollback.

  • Guardrails are backed by a short runbook that anyone on call can follow.

  • Fewer pages arrive next week. Confidence accumulates.


Turning an escalation into a flywheel step

Capture the insight

One sentence. Example: “Cache stampede on product detail under peak traffic.”

OR,

a 5-whys style root-cause: Example: 

  • Why did latency and 5xx spike on the product page?
  • Many concurrent cache misses triggered recomputes that overloaded the source service.
    • Why were there many recomputes at once?
    • All pods refreshed the same hot key when the TTL expired at the same moment.
      • Why did they all refresh simultaneously?
      • No single-flight or request coalescing, and no stale-while-revalidate path.
        • Why were those mechanisms missing?
        • The caching library used plain GET/SET with a hard TTL and no jitter.
          • Why was the library that limited?
          • Caching patterns were not standardized; the design checklist lacked “cache truth” and dogpile control, and ownership was unclear.

Add a guardrail

Examples:

  • Stale-while-refresh to limit stampedes
  • Idempotent handlers and retry policy
  • Circuit breaker and explicit timeouts
  • SLO with burn alerts

Write the runbook

One page or less. Triggers, commands, owner, rollback, verification checks.


Practice once

A team drill. Prove that anyone on call can execute it.



Measure fewer pages

Track pages per week and time to rollback. Expect the curve to bend.


Guardrail starters that pay off quickly

  • Deploy and release are separate. Use feature flags.

  • Releases are reversible. Keep a tested rollback for each change.

  • Requests fail predictably. Timeouts, budgets, and backoff.

  • Data paths are safe. Idempotency for writes and migrations.

  • Traffic is honest. Rate limits and backpressure at ingress.

  • Observability is ready. Dashboards for latency, errors, saturation, cost, and “what changed”.


A tiny runbook template

  • Trigger: when to use this playbook

  • Checks: dashboards to open

  • Actions: commands or steps

  • Decision points: continue or roll back

  • Rollback: exact steps

  • Validation: what good looks like

  • Owner: who has the baton


Metrics that make tomorrow quieter

  • Pages per week and per person

  • Mean and p95 time to rollback

  • Incidents per change

  • Guardrail changes per week

  • Percentage of changes that are reversible

  • Preventive work completed each sprint

Keep them lightweight. Learn, do not blame.


Roles: manager vs leaders

  • Managers close escalations and keep the queue moving.

  • Leaders  turn escalations into guardrails, training, and fewer pages across teams. They scale prevention.

Be both: a manager and a leader.


Anti-patterns to retire

  • Status without a design change

  • Hero mode as a habit

  • Postmortems that name people instead of conditions

  • Big-bang fixes with no rollback


TL;DR

Firefighting builds short term credibility. The best leaders turn that credibility into quieter systems by scaling prevention, not just response.

Choose the flywheel.