← Field Notes FN — 002 / Alert Systems at Scale: When the Cure Is the Disease
Filed under — FIELD-NOTES / ALERTS / INDUSTRIAL-MONITORING / UX-DESIGN / SYSTEM-DESIGN 6 min · 1,332 words

Alert Systems at Scale: When the Cure Is the Disease

Six months designing alerts for an industrial monitoring platform taught me the failure mode the spec couldn't see: alerts and silencing share the same off-switch, and once silenced, the system is structurally lying to itself.

I once spent six months designing an alert system that worked exactly as specified and made the thing it monitored less safe. The technicians who lived with it taught me why, mostly by silencing it.

This isn’t a story about bad design getting fixed. It’s about how the most common pattern in industrial monitoring — alert thresholds, notifications, paging — has a failure mode that looks like working. The cure for “we don’t know when something’s wrong” is alerts. The disease that emerges is too many alerts. And the cure for that, if you let it happen, is muting. Once muted, the system is back where it started — except now everyone trusts it less.

I want to walk through three phases of getting this wrong, because the pattern is the same whether you’re designing for industrial refrigeration, brewery telemetry, fleet logistics, or trading-platform risk dashboards. The substrate is different; the failure mode is identical.

Phase 1 — basic alerts, basic chaos

The simplest alert system is also the easiest to ship. You define thresholds. You wire them to a notification channel. You declare victory.

This is what I helped build for an industrial monitoring platform serving four enterprise customers, the largest of which ran a fleet of refrigeration units across distributed sites. The first version did three things: trigger on threshold breach, send an email, page someone if the breach persisted. It was correct. It was complete. It did exactly what the spec said.

Within three weeks of broad rollout, we noticed two things. First, the alerts were firing — a lot. Some sensors generated dozens of alerts per day, the majority for transient conditions that resolved themselves before anyone could check. Second, the technicians had figured out how to silence them. Some used the in-app mute. Others set up email filters. A few asked their managers to be removed from the notification list entirely, citing “noise.”

The default names didn’t help. The system auto-generated names like Alert 1, Alert 2, Alert 17. By the time a customer had eighty alerts configured across six sites, the management view was incomprehensible — a wall of Alert 43’s with no way to triage by relevance.

The thing nobody told us, because they didn’t have the language for it: the technicians weren’t muting bad alerts. They were muting the system that didn’t know how to tell good alerts from bad ones. The fix wasn’t fewer alerts. The fix was a system that could distinguish.

The hidden cost — silence is not a signal

Here’s the part of alert design that took me longest to understand. Once a technician mutes an alert, you have lost the most important diagnostic capability in the entire system: the ability to distinguish muted because it wasn’t useful from muted because the operator gave up.

From the platform’s point of view, both look the same — a flag flipped from “on” to “off.” From the operator’s point of view, they’re radically different decisions. The first is information hygiene. The second is professional surrender.

This matters because the failures that should page someone — the ones that aren’t transient, the ones that indicate a real problem developing — happen against a background of muted alerts that the system can no longer distinguish from healthy silence. A muted alert system is worse than no alert system, because no alert system at least makes its uselessness visible. A muted alert system pretends to be working.

This is what I mean by the cure is the disease. The cure (alerts) and the disease (alert silencing) share an off-switch, and once the off-switch is flipped, the system is structurally lying to itself.

Phase 2 — refined alerts

The second version did four things differently, each one a direct response to a Phase 1 failure mode.

Names came from context, not counters. Instead of Alert 17, an alert configured to monitor a refrigeration unit’s compressor temperature got auto-named Compressor temp > 75°C, Site B, Unit 12 — generated from the configuration. This isn’t naming as a UI nicety; it’s naming as a triage primitive. When the management view becomes sortable and scannable, alert proliferation stops being an information-hierarchy crisis.

Connectivity alerts learned to discriminate. “Disconnected” used to mean one thing — a missing heartbeat. Phase 2 distinguished a power outage (the building lost power; the device is fine when it comes back) from a signal-loss event (the device is up but can’t reach us, often a router problem) from a micro-outage (the heartbeat skipped one cycle and resumed, which is almost always nothing). Three categories, three notification policies, one massive reduction in noise. This was the single biggest lever in the redesign.

Schedule flexibility stopped being an advanced feature. The first version assumed 24/7 paging. Real operations have business hours. Some have a graveyard shift; some don’t. Some treat weekends as customer-non-impacting and shift to daily-summary instead of immediate-page. Building this in as a default — not a paid tier, not an advanced setting — meant the operator’s actual workflow stopped fighting the alert system.

Notifications met operators where they already were. The first version was email-default. Operators almost universally preferred WhatsApp — not because WhatsApp is technically superior, but because they already had it open, already monitored it, already responded faster on it. Technical convenience and user preference aren’t the same thing. Asking which channel an operator already lives in is a more reliable question than asking which channel they’d theoretically prefer.

The pattern across all four changes is the same: respect the operator’s existing context. The alert system isn’t dropping into a vacuum. It’s joining an existing workflow, and its job is to add signal without adding friction.

Phase 3 — what intelligent alerts would actually mean

I don’t have receipts for this phase, because the Phase 2 redesign was where the engagement ended for me. But every conversation I had with operators converged on the same question: can the system learn what I already know?

The thing operators already know — the thing currently locked in their heads — is which transient patterns are normal at their site. A given compressor on a given site has a daily warm-up curve that occasionally crosses a threshold for thirty seconds and then settles. The technician knows this. The alert system doesn’t. The technician mutes the alert, or learns to ignore it. The system has lost the technician’s expertise.

Phase 3 alerts learn the local patterns. They auto-tune thresholds based on each site’s history. They predict against pattern, not against absolute. They route based on what isn’t happening — the silent dropout — as much as what is.

This is the direction every monitoring system I’ve worked on since has needed to go, regardless of substrate. Industrial refrigeration. Trading-platform risk. Production telemetry. The pattern is invariant: alerts that distinguish unusual from abnormal are the only ones that survive contact with a real operator.

The principle, finally

Most of the alert systems I’ve reviewed since this engagement have Phase 1’s structure. They were specified by a team that wanted to know when something was wrong. They were built by engineers who wired thresholds to channels. They were rolled out to operators who, within weeks, had figured out how to silence them.

The reason the pattern repeats is that the design problem isn’t the alert. The design problem is the asymmetry between the system’s view of an alert (a row in a database, a payload in a queue) and the operator’s view (a small interruption that has to be worth its cost). When the system can’t model the operator’s cost-of-interruption, it doesn’t matter how accurate the thresholds are — the operator is the bottleneck, and the operator will route around any system that doesn’t respect their cost function.

The cure for “we don’t know when something’s wrong” is not more alerts. It’s an alert system that can be trusted enough to not be silenced. Trust gets built one un-fired alert at a time.