How AT&T Reduced Network Incident Response Time by 40% with AI
Velocity AI · April 16, 2026 · 6 min read
How Velocity AI deployed an autonomous AI triage agent for AT&T's network operations center, cutting mean time to resolution by 40% and eliminating 80% of false-positive alert noise — in 60 days.
Telecom AI network operations transformations don't get more concrete than this: AT&T's network operations center was processing 3,200 alerts per day, with 80% being false positives that required a human analyst to review and close. Mean time to resolution on genuine incidents was 47 minutes — not because the fixes were complicated, but because analysts were buried under noise.
Sixty days after engagement start, an AI triage agent was handling 68% of incoming alerts autonomously. Mean time to resolution dropped to 28 minutes. False-positive escalations to human analysts dropped by 83%.
"The first week the agent was in production, our Tier 1 lead came to me and said, 'I think something's wrong — we're not getting any tickets.' That was the point." — Director of Network Operations, AT&T
The Challenge
AT&T's network operations center monitors a distributed infrastructure spanning hundreds of thousands of network nodes across the continental United States. The monitoring system is necessarily sensitive — missing a genuine fault is far more costly than generating a false positive.
The result was an alert volume that had grown far beyond what the analyst team could meaningfully process. Analysts were spending 60–70% of their time reviewing and closing false-positive alerts — tickets that required a human to look at a dashboard, confirm nothing was wrong, and close the ticket. The remainder of their time went to genuine incidents, but by the time a genuine incident surfaced, it had already been sitting in a queue alongside hundreds of false positives.
Three specific problems needed to be solved:
Alert classification at scale. The monitoring system could not distinguish between a genuine fault and a false positive generated by routine maintenance, known intermittent issues, or monitoring system artifacts. Every alert required human judgment.
Runbook execution without human involvement. For confirmed faults, Tier 1 resolution followed documented runbook procedures in 85% of cases. Analysts were executing the same 12 runbooks repeatedly. There was no technical reason this required human execution.
Context aggregation before escalation. When a genuine fault was escalated to Tier 2, analysts were spending 8–12 minutes gathering context before they could begin diagnosis. Incident history, asset information, related alert patterns — all of this was available in disparate systems, but required manual retrieval.
The Solution
Velocity AI deployed a multi-step AI triage agent integrated with AT&T's ServiceNow instance and network monitoring infrastructure.
Alert classification layer. The agent received incoming alerts and assessed them against a classification model trained on 18 months of historical alert data, including analyst disposition decisions. Alerts classified as false positives with high confidence were automatically closed with a documented rationale. Alerts classified as genuine or uncertain were passed to the remediation layer.
Runbook execution layer. For classified genuine faults, the agent matched the fault signature against AT&T's runbook library — a corpus of 400+ documented procedures that had been loaded into the agent's context. For faults matching a runbook within the agent's authorization scope, the agent executed the remediation steps via ServiceNow API calls, confirmed resolution, and closed the ticket.
Context pre-aggregation for escalation. For faults requiring Tier 2 escalation — either outside the agent's authorization scope or not matching a documented runbook — the agent assembled a context packet before escalating: incident history for the affected assets, related alert patterns from the preceding 24 hours, current network topology data, and the agent's assessment of likely root cause. Human analysts received escalations with context already assembled.
The Results
Mean time to resolution on genuine network faults — down from 47 minutes to 28 minutes — driven by eliminating false-positive queue wait and pre-assembling Tier 2 context automatically.
Source: Velocity AI client delivery data, 2025
Day 58: Full production deployment. The supervised period ended 3 days early after agent accuracy on live traffic exceeded the threshold for autonomous operation.
| Metric | Before | After | Change | |---|---|---|---| | Alert volume handled autonomously | 0% | 68% | +68pp | | False positives escalated to analysts | ~2,500/day | ~430/day | −83% | | Mean time to resolution (genuine faults) | 47 min | 28 min | −40% | | Time to context for Tier 2 escalations | 8–12 min | 45 sec | −92% | | Analyst hours on Tier 1 triage | ~6 hrs/analyst/day | ~1 hr | −83% |
The 40% reduction in mean time to resolution came from two sources in roughly equal measure: the elimination of queue wait time (genuine incidents no longer sat behind hundreds of false positives) and the elimination of context-gathering time for Tier 2 escalations.
What Made It Work
Three decisions early in the engagement determined the outcome.
Authorization scope was defined before development began. The team spent the first week mapping exactly what the agent was authorized to do — which runbook steps, which ServiceNow actions, which network assets — before writing a line of code. This prevented scope creep during development and gave the compliance team a clear authorization boundary to sign off on.
The agent was trained on AT&T's own runbooks, not generic documentation. The value of an AI agent in a network operations context is directly proportional to the quality of the knowledge it can draw on. AT&T's runbook library was comprehensive and well-maintained. The agent's classification and remediation accuracy reflected that investment.
The supervised production period was taken seriously. The 21-day supervised period — where the agent operated on live traffic but every action was reviewed by a human before execution — was not treated as a formality. Analysts logged every disagreement, and the team used those disagreements to refine the agent's confidence thresholds and escalation logic. The production agent that went live at day 58 was meaningfully better than the agent that entered supervised operation at day 37.
Key Takeaways
- Define authorization scope before development begins — not during or after
- AI agents in operational contexts are only as good as the knowledge they draw on; invest in documentation quality before deployment
- Supervised production periods produce better agents — treat analyst disagreements as training signal
- The ROI driver in this case was not eliminating analyst roles but redeploying them from low-value triage to high-value diagnosis
- A 60-day deployment timeline is achievable with clean data, well-documented runbooks, and a clear use case scope
Frequently Asked Questions
What AI technology was used in the AT&T network operations deployment?
How was the AI triage agent integrated with existing NOC workflows?
How long did the deployment take?
What happened to the NOC analysts whose work the AI agent took over?
Related Insights
Agentic AI for the Enterprise: Moving Beyond Chatbots to Autonomous Workflows
8 min read · Apr 16, 2026
Read moreAI in Financial Services: Building Compliant Models Under SOC 2 and GDPR
8 min read · Apr 16, 2026
Read moreAI in Healthcare: Building HIPAA-Compliant AI for Large Provider Networks
6 min read · Feb 20, 2026
Read more