Robot Terror

Tech experiments, travel adventures, and code explorations by Robert Taylor

About | Archive | GitHub

1 January 2026

The Robot in the Loop

by Robert Taylor

What if your newest hire knew what to do next?


View Presentation Fullscreen ↗

The Problem Everyone’s Still Trying to Solve

“My site is slow.”

That’s the ticket. A customer’s production system is struggling, and this is all the information you have to work with.

An experienced engineer hears those four words and immediately starts filling in blanks. What kind of site? What’s the hosting stack? What does “slow” actually mean: timeouts, delayed responses, intermittent failures? Is this affecting one user or everyone? They’re not analyzing data yet. They’re generating the questions that will produce useful data. Over thousands of previous cases, they’ve built a mental map that tells them which direction to walk when starting from almost nothing.

A newer agent hears the same four words and freezes. Not because they’re less intelligent, but because they don’t have that map yet. They don’t know which of the hundred possible first questions is the one that matters. So they ask the customer to elaborate, or they start checking things semi-randomly, or they escalate to someone more senior. Each path introduces delay.

This is the knowledge accessibility problem, and every tool in a support organization exists to address it. Account dashboards. Ticket history. Knowledge bases. Internal wikis. Runbooks. Slack channels. Post-It notes on monitor edges. All of them are trying to get the right information to the right person at the right moment. All of them assume the agent already knows what they’re looking for. None of them help with the actual hard part: transforming “my site is slow” into a diagnostic path.

What Experience Actually Provides

I spent several years as a senior systems engineer at Rackspace, handling escalated technical issues. One of the strangest compliments I ever received was when a colleague said it “freaked him out” that I could hear “my Plesk server’s load average is over 1000” and immediately respond: “rename file X in directory Y.”

It wasn’t magic. It was pattern recognition, but not the kind people usually mean when they use that phrase.

When a ticket came in saying “my site is slow,” I wasn’t scanning through a mental database of symptoms and solutions. I was following a decision tree I’d built over years of exposure. For non-Plesk customers, the first thing I checked wasn’t application logs or server metrics. It was the ingress/egress chart at the network switch. Was there a huge amount of traffic coming in, going out, or neither?

That single observation immediately narrowed the possibility space. Heavy inbound traffic suggested either a DDoS attack or the Slashdot effect, a legitimate traffic spike the server couldn’t handle. Heavy outbound traffic was more concerning: likely a compromised server attacking the rest of the internet. Normal traffic patterns pointed toward application-level issues. One glance at a graph, and I’d eliminated two-thirds of the possibilities before touching the server.

If I saw heavy outbound traffic, I didn’t start hunting through logs manually. I had a script ready that would parse Apache access logs across all domains on the server, correlate requests by time window, and surface the suspect domain within seconds:

TOPFILES=100;TOPIPS=100;DATE=`date +%d.%b.%Y:%H`;for i in `lsof -p $(netstat \
-ltpn|awk '$4 ~ /:80$/ {print substr($7,1,index($7,"/")-1)}')| awk '$9 ~ \
/access.log$/ {print $9| "sort -u"}'` ; do echo "-------"$i"------"$DATE; awk \
'$4 ~ /^.'$DATE'/ {day=substr($4,2,2);hour=substr($4,14,2);sixth= \
substr($4,17,1); hit[day"\t"hour"."sixth"0 - "hour"."sixth"9"]++;ip[$1]++; \
bytes[day"\t"hour"."sixth"0 - "hour"."sixth"9"]+=$10; flds=split($7, req, \
/\//);toss=split(req[flds],fn,/\?/);files[fn[1]]++ } END { for (i in hit) { \
print hit[i]"\t"i"\t"sprintf("%2.2d",bytes[i]/1024)"K"|"sort -k 3,3n"}; print \
"";for (i in ip) { if(ip[i] > '$TOPIPS') { print ip[i]"\t"i|"sort -n;echo \
\"\""}}; for (i in files) { if(files[i] > '$TOPFILES') {print files[i], i| \
"sort -k 1,1n;echo \"\""}} }' $i;done

This wasn’t elegant code. It was a tool forged from necessity, refined over dozens of similar incidents. The point isn’t the script itself. The point is that I knew to look at the switch first, knew what the traffic patterns meant, and had the right diagnostic tool ready before I ever saw the specific ticket. A newer agent facing the same “my site is slow” report would be checking CPU and memory, maybe restarting Apache, maybe asking the customer for more details. All reasonable steps, none of them addressing the actual problem.

The knowledge existed elsewhere too. In resolved tickets from years past. In a wiki page someone wrote in 2007. Probably in a forum thread somewhere. But none of that helped the tier 1 agent who received the initial ticket, because they didn’t know to search for “outbound traffic compromised server” when the customer just said “my site is slow.”

This is the gap. Institutional knowledge exists, but it isn’t operationally accessible at the moment of ambiguity. The experienced engineer’s value isn’t access to special information. It’s knowing how to transform a vague report into a specific investigation.

Why Traditional Solutions Fall Short

The standard approaches to this problem share a common flaw: they assume the agent already has a direction.

Training gives agents a foundation, but you can’t compress years of pattern exposure into a two-week bootcamp. More importantly, training prepares people for scenarios that have been anticipated and documented. The real challenge is the ticket that doesn’t match anything in the curriculum, where the agent has to reason from sparse clues rather than recognize a textbook case.

Documentation fails for the same reason. A runbook is useful when you know which runbook you need. Searching the knowledge base works when you know the right search terms. But when a customer says “my site is slow,” which of the 200 articles on performance issues is relevant? The agent is back to square one, guessing at directions.

Ticketing system search has potential. Somewhere in those thousands of resolved tickets is probably a case just like this one. But finding it requires knowing what to search for, which is exactly the knowledge the junior agent lacks. The experienced engineer could find it because they’d recognize the pattern. The junior agent scrolls through irrelevant results.

Adding more senior engineers doesn’t solve the problem either. It helps in the short term, but it doesn’t scale. It’s expensive. And it creates a bottleneck where experts spend most of their time answering questions they’ve answered hundreds of times before, instead of tackling genuinely novel challenges.

Where AI Actually Helps

The promise of AI in customer service has mostly been framed around two use cases: chatbots that handle simple inquiries automatically, and analytics tools that surface trends from large datasets. Both have value. Neither addresses the core problem of transforming ambiguity into direction.

What would actually help is an AI system that does what the experienced engineer does: take a sparse initial report, pull relevant context automatically, and suggest a diagnostic path. Not solve the problem autonomously, but collapse the possibility space so the human agent knows where to start.

Here’s what that might look like in practice. A ticket comes in: “My site is slow.” The AI agent, with access to the customer’s account information, network metrics, and historical ticket patterns, assembles a situation report:

Customer: Acme Corp (dedicated server DAL-42)
Reported issue: "My site is slow"

INFRASTRUCTURE CHECK:
Switch port metrics for DAL-42 (last 30 minutes):
  Ingress: 2.3 Mbps (normal baseline: 2-4 Mbps)
  Egress:  847 Mbps (normal baseline: 8-15 Mbps)

Egress traffic is approximately 60x normal levels.

SERVER STATE:
  Load average: 34.2 (baseline: 0.4-1.2)
  Apache connections: 847 active
  Memory: 94% utilized

PATTERN ANALYSIS:
Primary hypothesis (91% confidence): Compromised server / outbound attack
  Egress spike with high Apache connections matches attack pattern.
  47 similar cases in history, typically compromised WordPress or
  outdated CMS. Median resolution: 15-30 minutes once source identified.

Secondary hypothesis (23% confidence): Legitimate traffic spike
  Inconsistent with pattern. Legitimate spikes show elevated ingress
  AND egress. This shows egress only, which is characteristic of
  a server generating attack traffic rather than serving legitimate
  requests.

SUGGESTED DIAGNOSTIC PATH:
1. Identify which domain is generating outbound traffic.
   Parse Apache access logs by request volume and bytes served:
   [diagnostic script available - click to copy]

2. Once source domain identified, check for:
   - Recently modified PHP files (common malware injection point)
   - Outdated CMS versions (WordPress, Joomla, Drupal)
   - Suspicious cron jobs or running processes

3. If confirmed compromised: isolate domain, notify customer,
   provide remediation options.

NOTE: If traffic analysis shows distributed requests rather than
concentrated attack traffic, revisit the legitimate spike hypothesis.
Customer may have content going viral.

The agent hasn’t received an answer. They’ve received a map. Instead of staring at “my site is slow” and wondering where to begin, they have the same infrastructure-first view that an experienced engineer would check, a contextualized interpretation of what the metrics mean, and a concrete first step. The AI has performed the experienced engineer’s first thirty seconds of mental processing.

The Human Remains Essential

This approach only works if the human agent stays in the loop, and not as a formality.

The AI system is pattern matching against historical data, which means it’s only as good as the patterns it has seen. Novel problems, unusual combinations of factors, edge cases that don’t fit the training data: all of these will produce weak or misleading suggestions. The system in the example above includes a secondary hypothesis and explicitly notes conditions under which the primary hypothesis might be wrong. That isn’t a disclaimer. It’s a design requirement.

Human judgment remains essential for interpreting confidence levels. A 91% match sounds high, but it means that roughly one in ten times, the primary hypothesis is wrong. Experienced agents learn to hold hypotheses loosely, to notice when evidence doesn’t quite fit, to recognize when they need to step back and reconsider. Junior agents following AI suggestions need to develop that same skill, which means the AI must be designed to support human reasoning rather than replace it.

There’s also the customer relationship. Technical diagnosis is only part of customer service. Reading emotional cues. Adjusting communication style. Knowing when to apologize versus when to reassure. Deciding how to tell a customer their server was compromised because they didn’t update WordPress for three years. None of this can be automated, and all of it determines whether the customer walks away feeling helped or merely handled.

The right framing isn’t “AI plus human oversight.” It’s human expertise amplified by AI tooling. The agent is still doing the job. They just have access to a tool that compresses the experience gap.

When the Pattern Match Is Wrong

Any honest discussion of AI-assisted support must address failure modes. They’re inevitable, and how the system handles them determines whether it’s actually useful.

Pattern matching is statistical correlation. The AI observes that certain combinations of symptoms have historically been associated with certain root causes, and it projects that association onto new cases. This works well when the new case genuinely resembles the historical pattern. It fails when the resemblance is superficial.

Consider a variation on the earlier example. The AI sees high outbound traffic and flags a likely compromised server. But the customer just launched a new feature that serves large video files, and they didn’t mention it in their ticket. The egress spike is real but benign. An agent who fixates on the compromise hypothesis will waste time hunting for malware that doesn’t exist, possibly alarming the customer unnecessarily, while the actual problem—the server isn’t provisioned for video delivery—goes unaddressed.

A well-designed system mitigates this in several ways. It surfaces confidence levels honestly, so agents know when to trust a suggestion and when to hold it loosely. It flags what doesn’t fit the pattern, not just what does, giving agents cues for when to reconsider. It provides secondary hypotheses, so agents have somewhere to pivot when the primary path doesn’t pan out. And it explicitly acknowledges uncertainty rather than presenting its best guess as definitive.

The goal isn’t an AI that’s always right. That’s not achievable, and pretending otherwise would make the system dangerous. The goal is an AI that helps humans be right faster, and knows when to say “I’m not sure—proceed carefully.”

The Organizational Case

The benefits of this approach flow to multiple levels of the organization.

For tier 1 and tier 2 agents, it means more successful resolutions and fewer frustrating dead ends. One of the main drivers of burnout in support roles is the feeling of helplessness when facing a problem you don’t know how to solve. Giving agents a diagnostic starting point transforms that experience. They’re still doing the work, but they’re no longer starting from zero.

For tier 3 engineers and senior specialists, it means fewer interruptions for pattern-matchable questions and more time for genuinely complex problems. Over several years of handling escalations, I found that the number of truly unique situations requiring novel investigation was remarkably small. For every case that made me think, there were several where I knew the answer within seconds of reading the ticket. Not because I was smarter, but because I’d seen that exact constellation of symptoms before. If AI tooling can surface those patterns to tier 1 and 2 agents directly, the escalation queue shrinks and senior engineers can focus on work that actually requires their expertise.

For customers, it means faster resolution. The delay in most support interactions isn’t the time it takes to fix the problem once it’s understood. It’s the time spent figuring out what the problem actually is. Compressing that diagnostic phase improves the customer experience even if nothing else changes.

For the business, it means better economics. Tier 1 support costs less than tier 3 support. Faster resolutions mean higher customer satisfaction and lower churn. Senior engineers doing complex work instead of pattern-matching means better retention of expensive talent. None of this requires reducing headcount. It requires making existing headcount more effective.

The Conversation We Should Be Having

The debate around AI in customer service has calcified into two camps. Enthusiasts see full automation as inevitable and desirable, a way to cut costs and scale without limit. Skeptics see AI as a threat to jobs and service quality, a race to the bottom that will leave customers talking to machines that don’t understand them.

Both framings miss what’s actually interesting.

Full automation fails because customer service is fundamentally a human relationship mediated by technical problem-solving, not the reverse. When your production system is down at 2 AM, you don’t want to interact with a chatbot following a decision tree. You want a person who understands the urgency, who can think flexibly, who can tell you honestly what’s happening and what to expect. The technical diagnosis is necessary but not sufficient. The human relationship is what makes it service rather than just support.

But refusing to augment human agents with better tools isn’t a defense of service quality. It’s a decision to leave agents under-equipped, to force them to reinvent the wheel on every shift, to make customers wait while institutional knowledge stays locked in the heads of people who may not be available. The status quo isn’t human-centered service. It’s human-limited service, constrained by the accident of which agent happens to pick up the ticket and what they happen to have seen before.

The interesting question isn’t “AI or humans?” It’s this: what would it look like to make every customer service agent as effective as your best one, without losing what makes your best ones good in the first place?

That’s the future worth building. Not automation. Augmentation.

tags: ai - customer-service - support - augmentation - rackspace