03/2026.02.02/Ops/8 min read

On-call without tears

Eight engineers, fifteen production systems, zero on-call cries in fourteen months. Here's the actual workflow — bots included.

We rotate on-call across all eight of us. There is no dedicated ops team. There is no SRE. The same humans who write features also wake up when those features break. Most months, nobody wakes up.

That's not an accident. It's a consequence of three changes we made over the last year, and an admission about how much of this we now hand off to AI.

Change one was killing the false alarm tax. We went through every alert in PagerDuty and asked, point blank, “if this fires at 3am, do we need to act before 9am?” Two thirds of the alerts could not justify themselves. We deleted them. The remaining ones became real.

Change two was a written runbook for every alert that survived. The runbook is not “consult the on-call engineer's tribal knowledge.” It is the actual sequence of commands and links that resolves the incident, written in the voice of whoever last solved it. New engineers can run the runbook on their first on-call shift. They mostly do.

Change three was AI in the diagnosis loop. When an alert fires, a bot reads the most recent commits, the deploy log, the relevant traces, and posts a paragraph in the incident channel suggesting what changed and what is likely. About forty percent of the time it's correct on the first try. The rest of the time it's at least narrowing the search. The on-call engineer is no longer the entire diagnostic engine. The on-call engineer is the human who confirms or rejects what the bot said.

This is the part where we are obligated to admit how much AI is involved. When an incident page goes off, the first eyes on the data are usually a model. The first attempted diagnosis is usually a model. The human confirms the diagnosis, runs the runbook, and goes back to bed. That is the actual workflow.

The thing we don't tell clients but is true: paying for the model is meaningfully cheaper than paying for the SRE we used to need to staff the same coverage. We're an eight-person studio handling production for fifteen-plus customer-facing systems. We could not do that without the model in the loop. We are not pretending we could.

We know everyone else is doing some version of this and not saying so. We are saying so because lying about it is exhausting and the agencies that lie about it will have to start telling the truth eventually anyway.

If you want a real ops team, hire a real ops team. We are not pretending to be one. What we are is a small team of engineers whose ops burden is small enough — thanks to flags, observability, and yes, the bots — that nobody on our team has cried in a year. That's the bar.

We track it. Cried-on-call count. It's been zero for fourteen months.

Brigada.dev
Prishtinë, Kosovo
← Back to home
More from the journal
01Stop staging, start shippingEngineering02A grid is not an opinionDesign04The SEO work we stopped doingGrowth