Autonomous AI Operations Agent: LangGraph, a multi-model LLM stack, and human-in-the-loop learning
A production AI agent that autonomously triages and processes customer-support tickets for an NDA-protected SaaS. It proposes every response to a human reviewer in the CRM, executes only what's approved, and learns durable rules from every answer. Replaces a ~2,000-line n8n monolith with a modular Python and LangGraph service running a multi-model LLM stack (DeepSeek for reasoning, Qwen3-VL for vision, Whisper for audio) to keep inference cost predictable at scale.
Autonomous AI Operations Agent
Replacing a ~2,000-line n8n workflow with a modular Python and LangGraph service that runs a multi-model LLM stack, proposes ticket responses for human review, and learns durable rules from every answer.
Client: NDA-protected B2B SaaS (customer-support and operations platform). Role: Lead developer. Architecture, full implementation, deployment, ongoing iteration. Stack: Python 3.12, LangGraph, LangChain, DeepSeek V4 Flash (direct API), Qwen3-VL via Ollama Cloud, OpenAI Whisper, Supermemory, PostgreSQL, Docker Swarm, Traefik, GitHub Actions, Langfuse.
The starting point
The company ran their entire ticket-triage and follow-up automation as a single ~2,000-line workflow in n8n. A new branch was bolted on every time the business rule changed, and by year two no one could safely modify it. Every escalation required a human to read the ticket, dig through the CRM, write a reply, and remember the rule so the next similar ticket would be handled the same way. Knowledge lived in people's heads, not in the system.
The brief was deliberate: make the agent autonomous, but keep the human as the source of truth. No silent resolutions. No black-box rules. Every learned rule traceable to the human answer that produced it.
Architecture
A four-stage pipeline:
Ingest, route, process, then resolve and learn.
- Ingest. Polls Freshdesk, relays Telegram, accepts Microsoft Teams webhooks, and maps everything to a normalised
InboundWorkItem. PostgreSQL holds the audit trail (conversations, messages, external actions). - Route. An LLM classifier maps the ticket to a service area, type and subtype from a controlled taxonomy. A rule engine (baseline rules plus rules learned from human answers) decides where it goes.
- Process. A worker agent walks a 10-step checklist per ticket using around 40 typed tools wrapped around Freshdesk, Close.com, the CRM, Telegram and Teams.
- Resolve and Learn. Knowledge gaps are grouped by gap type, surfaced to the operator one question at a time, and the answer becomes a durable rule in Postgres plus a semantic chunk in the knowledge base. Tickets blocked on the same gap are automatically re-processed.
Orchestration is LangGraph: explicit work-item state, deterministic transitions, every node observable in Langfuse. The legacy n8n path was free-form; this one trades a little freedom for full auditability.
Multi-model LLM stack: pick the cheapest model that can do the job
Running one frontier model on every call would have been wasteful. The agent uses three providers, picked per task:
| Task | Model | Provider | Why |
|---|---|---|---|
| Reasoning and tool-calling (propose, execute, chat, classifier, summariser) | DeepSeek V4 Flash | DeepSeek direct API | Cheapest model in its class with reliable function calling. Roughly an order of magnitude lower cost per million tokens than the frontier alternatives we benchmarked, with no measurable quality drop on this workload. |
| Vision (identity documents, utility bills, inline ticket attachments) | Qwen3-VL 235B | Ollama Cloud | DeepSeek doesn't serve vision. Qwen3-VL on Ollama handles UK identity-document extraction reliably at a fraction of a hosted vision API's cost. |
| Call transcription | OpenAI Whisper | OpenAI | Mature, accurate on UK English, priced per minute. No reason to reinvent it. |
The code talks to all three providers through the same OpenAI-compatible interface. Provider precedence is configured at the Settings layer, so swapping or A/B-testing a model is a one-line env change, not a code change. A small benchmark script runs candidate models against a fixed suite (null-param discipline, by-email versus by-client-id routing, boolean argument passing, high-risk routing) so model swaps are measured decisions, not guesses.
One non-obvious detail that mattered in production: deepseek-v4-flash is a hybrid thinking model and 400s on multi-round tool calls unless the previous turn's reasoning_content is replayed, which LangChain's ChatOpenAI drops by default. The fix was to pin the model to non-thinking mode (extra_body={"thinking": {"type": "disabled"}}), which also matches how the Ollama-hosted variant serves the same model.
The "verified write" contract
Every tool that mutates external state, whether replying to a ticket, updating a workflow step or saving client data, implements the same pattern:
- POST the change.
- Immediately GET the affected resource.
- Compare intended state against actual state.
- Return
{ok, verified, discrepancies, ...}.
If verification fails, the tool result tells the agent why. The agent cannot silently claim "Done" because its own tool output contradicts that claim. This single contract caught a class of silent-failure bugs that an optimistic-write agent would never have surfaced: expired presigned S3 URLs, race conditions on workflow-step transitions, unicode mishandling in custom fields, third-party rate-limit recoveries.
Human-in-the-loop: one question at a time
Escalations go to Telegram and the contract is deliberately narrow:
- One question at a time. The agent never dumps a list. It picks the highest-leverage gap, asks, and waits.
- The answer becomes a rule. Saved to Postgres (deterministic, queryable) and to the Supermemory KB (semantic, retrievable across new tickets).
- The whole backlog re-processes. Other tickets blocked on the same gap are picked up automatically. One operator answer unblocks N tickets.
For Telegram chat, the same agent answers direct asks ("check ticket 12345", "process ticket 67890") through the same tool registry. Chat and autonomy aren't two systems; they're two entry points to the same one.
Source of truth lives in the CRM
The autonomous loop never resolves a ticket on its own. It generates a Markdown proposal with a confidence score and posts it to an "AI Ticket Review" surface in the CRM. A human clicks Approve and Send (or Leave for Human), and only then does the executor run the full write tool-chain on Freshdesk and the CRM. Approvals are deduplicated in Postgres so a missed webhook or a container restart cannot re-execute a ticket.
This is a deliberate inversion of the usual "agent acts, human supervises" pattern. The CRM is the source of truth; the agent is a competent intern that always asks before sending.
Memory boundaries
Three memory surfaces, kept strictly separate:
- Supermemory
user::{user_id}. Per-user episodic memory and thread summaries. - Supermemory
kb:global. Shared KB: learned rules, operator preferences, safety policies. - PostgreSQL. Audit trail only. Never used for LLM context.
KB ingestion (URLs, pasted text, call transcripts) is a separate pipeline from conversation memory. Mixing them produced subtle bleed in early versions, with a private user fact ending up in a global retrieval, so the boundary is now enforced at the service layer, not by convention.
Infrastructure
- Docker Swarm single-node, behind Traefik with Let's Encrypt (HTTP-01).
- GitHub Actions: tests, lint, rsync, remote
docker service update --force, healthz poll. Concurrency-serialised so two pushes cannot fight for the same swarm task. Manualworkflow_dispatchrollback to any ref. .envlives on the host (bind-mounted read-only), untouched by deploys. Env-only changes are a one-line edit plus a forced service update, no rebuild.- Langfuse for trace inspection: every LLM call, every tool call, every retry, with cost and latency rolled up per ticket.
- 184 automated tests on every push. Verified-write tools have contract tests against recorded HTTP fixtures.
What this build is actually about
The technically interesting bits are the multi-model stack and the verified-write contract. The important bit is more boring: the system is auditable. Every action has a trace, every rule has a source answer, every escalation has one clear question. That is what made it safe to flip from "human does everything" to "agent does everything except claim it is done."
The replacement of the n8n monolith ran in shadow for two weeks, then took over autonomously. Average human-touch time on the long tail of routine tickets dropped from minutes to seconds, and operator attention moved to the few cases that actually need a person.