Intelligence

How AT&T Reduced Network Incident Response Time by 40% with AI

Velocity AI · April 16, 2026 · 6 min read

How Velocity AI deployed an autonomous AI triage agent for AT&T's network operations center, cutting mean time to resolution by 40% and eliminating 80% of false-positive alert noise — in 60 days.

Telecom AI network operations transformations don't get more concrete than this: AT&T's network operations center was processing 3,200 alerts per day, with 80% being false positives that required a human analyst to review and close. Mean time to resolution on genuine incidents was 47 minutes — not because the fixes were complicated, but because analysts were buried under noise.

Sixty days after engagement start, an AI triage agent was handling 68% of incoming alerts autonomously. Mean time to resolution dropped to 28 minutes. False-positive escalations to human analysts dropped by 83%.

"The first week the agent was in production, our Tier 1 lead came to me and said, 'I think something's wrong — we're not getting any tickets.' That was the point." — Director of Network Operations, AT&T

The Challenge

AT&T's network operations center monitors a distributed infrastructure spanning hundreds of thousands of network nodes across the continental United States. The monitoring system is necessarily sensitive — missing a genuine fault is far more costly than generating a false positive.

The result was an alert volume that had grown far beyond what the analyst team could meaningfully process. Analysts were spending 60–70% of their time reviewing and closing false-positive alerts — tickets that required a human to look at a dashboard, confirm nothing was wrong, and close the ticket. The remainder of their time went to genuine incidents, but by the time a genuine incident surfaced, it had already been sitting in a queue alongside hundreds of false positives.

Three specific problems needed to be solved:

Alert classification at scale. The monitoring system could not distinguish between a genuine fault and a false positive generated by routine maintenance, known intermittent issues, or monitoring system artifacts. Every alert required human judgment.

Runbook execution without human involvement. For confirmed faults, Tier 1 resolution followed documented runbook procedures in 85% of cases. Analysts were executing the same 12 runbooks repeatedly. There was no technical reason this required human execution.

Context aggregation before escalation. When a genuine fault was escalated to Tier 2, analysts were spending 8–12 minutes gathering context before they could begin diagnosis. Incident history, asset information, related alert patterns — all of this was available in disparate systems, but required manual retrieval.

The Solution

Velocity AI deployed a multi-step AI triage agent integrated with AT&T's ServiceNow instance and network monitoring infrastructure.

Alert classification layer. The agent received incoming alerts and assessed them against a classification model trained on 18 months of historical alert data, including analyst disposition decisions. Alerts classified as false positives with high confidence were automatically closed with a documented rationale. Alerts classified as genuine or uncertain were passed to the remediation layer.

Runbook execution layer. For classified genuine faults, the agent matched the fault signature against AT&T's runbook library — a corpus of 400+ documented procedures that had been loaded into the agent's context. For faults matching a runbook within the agent's authorization scope, the agent executed the remediation steps via ServiceNow API calls, confirmed resolution, and closed the ticket.

Context pre-aggregation for escalation. For faults requiring Tier 2 escalation — either outside the agent's authorization scope or not matching a documented runbook — the agent assembled a context packet before escalating: incident history for the affected assets, related alert patterns from the preceding 24 hours, current network topology data, and the agent's assessment of likely root cause. Human analysts received escalations with context already assembled.

The Results

−40%

Mean time to resolution on genuine network faults — down from 47 minutes to 28 minutes — driven by eliminating false-positive queue wait and pre-assembling Tier 2 context automatically.

Source: Velocity AI client delivery data, 2025

Day 58: Full production deployment. The supervised period ended 3 days early after agent accuracy on live traffic exceeded the threshold for autonomous operation.

| Metric | Before | After | Change | |---|---|---|---| | Alert volume handled autonomously | 0% | 68% | +68pp | | False positives escalated to analysts | ~2,500/day | ~430/day | −83% | | Mean time to resolution (genuine faults) | 47 min | 28 min | −40% | | Time to context for Tier 2 escalations | 8–12 min | 45 sec | −92% | | Analyst hours on Tier 1 triage | ~6 hrs/analyst/day | ~1 hr | −83% |

The 40% reduction in mean time to resolution came from two sources in roughly equal measure: the elimination of queue wait time (genuine incidents no longer sat behind hundreds of false positives) and the elimination of context-gathering time for Tier 2 escalations.

What Made It Work

Three decisions early in the engagement determined the outcome.

Authorization scope was defined before development began. The team spent the first week mapping exactly what the agent was authorized to do — which runbook steps, which ServiceNow actions, which network assets — before writing a line of code. This prevented scope creep during development and gave the compliance team a clear authorization boundary to sign off on.

The agent was trained on AT&T's own runbooks, not generic documentation. The value of an AI agent in a network operations context is directly proportional to the quality of the knowledge it can draw on. AT&T's runbook library was comprehensive and well-maintained. The agent's classification and remediation accuracy reflected that investment.

The supervised production period was taken seriously. The 21-day supervised period — where the agent operated on live traffic but every action was reviewed by a human before execution — was not treated as a formality. Analysts logged every disagreement, and the team used those disagreements to refine the agent's confidence thresholds and escalation logic. The production agent that went live at day 58 was meaningfully better than the agent that entered supervised operation at day 37.

Key Takeaways

  • Define authorization scope before development begins — not during or after
  • AI agents in operational contexts are only as good as the knowledge they draw on; invest in documentation quality before deployment
  • Supervised production periods produce better agents — treat analyst disagreements as training signal
  • The ROI driver in this case was not eliminating analyst roles but redeploying them from low-value triage to high-value diagnosis
  • A 60-day deployment timeline is achievable with clean data, well-documented runbooks, and a clear use case scope

Frequently Asked Questions

What AI technology was used in the AT&T network operations deployment?
The deployment used a large language model as the core reasoning engine, integrated with AT&T's existing ServiceNow instance and network monitoring infrastructure via REST APIs. The agent was built using a custom orchestration layer rather than an off-the-shelf agent framework, which allowed fine-grained control over tool permissions and escalation logic. The underlying model was a fine-tuned version of an enterprise LLM trained on AT&T's internal runbook documentation.
How was the AI triage agent integrated with existing NOC workflows?
The agent was integrated as an additional tier in the existing escalation model — it sat between the automated monitoring system and the human Tier 1 analysts. Alerts that the monitoring system flagged were routed to the agent first. The agent assessed the alert against runbook criteria, attempted remediation actions within its authorized scope, and escalated to human analysts for issues outside its confidence threshold or authorization boundary. Human analysts retained full override capability.
How long did the deployment take?
The full deployment — from engagement start to live production traffic — took 58 days. The first 14 days were spent on data audit and runbook analysis. Days 15–42 covered agent development and integration. Days 43–58 were a supervised production period where the agent handled live alerts with human review of every decision before action. The supervised period was cut short from the planned 21 days because agent accuracy in the first week of live traffic exceeded the threshold for full autonomous operation.
What happened to the NOC analysts whose work the AI agent took over?
This is a common and important question. The Tier 1 analysts who had been handling false-positive alert triage were not eliminated — they were redeployed. With the agent handling tier-1 alert triage, analysts shifted focus to tier-2 escalations, runbook documentation improvement, and agent performance review. Analyst job satisfaction scores increased in the post-deployment survey, which the team attributed to spending less time on repetitive, low-value ticket resolution.