How do we start using AI in our engineering process?

The path of least resistance is automating repetitive tasks: issue triage, release notes generation, AI-assisted code review, and smart alerts. This creates immediate value without requiring a full process redesign.

What is RAG and when should we use it?

RAG (Retrieval-Augmented Generation) connects an LLM to a specific knowledge base, enabling responses grounded in your company's internal documents rather than just model training. It's ideal for internal chatbots, documentation search, and support assistants.

How much does it cost to run LLMs in production?

It depends on the model and volume. GPT-4o costs approximately $2.50 per 1M input tokens. For high volumes, open-source models (Llama, Mistral) running on your own infrastructure can reduce costs by up to 80%, but require more engineering investment. The right choice depends on the cost-latency-quality tradeoff.

What is an AI Agent and how is it different from a chatbot?

A chatbot answers questions. An AI Agent executes tasks: it accesses systems, makes context-based decisions, calls APIs, processes documents, and acts autonomously within defined boundaries. The difference is between a response and an action.

AI agents in production: why most prototypes never ship

The demo is the easy part. The hard part is making it safe enough to leave running.

The prototype worked. The agent read the ticket, queried the database, wrote the report, and emailed the summary. Everyone in the room saw the future.

Then someone asked the obvious question: can we turn this on for real customers? That’s when the conversation shifts from capabilities to consequences. What happens when it deletes the wrong record? What happens when it loops? What happens when someone prompts-injects the ticket description? The demo didn’t have answers. Production demands them.

Most AI agent prototypes never ship because the gap between demo and production is larger than teams expect. It is not a matter of adding tests and calling it done. It is a redesign of how the agent thinks, acts, and is held accountable.

Prototypes hide the hard problems

A prototype is optimized for the happy path. The inputs are clean, the tools respond as expected, and the human is watching. In production, inputs are messy, tools fail silently, and the agent runs unattended for hours.

The prototype also hides cost. A ten-step agent loop that calls an LLM three times per step and queries APIs along the way can easily consume thousands of tokens per task. Multiply that by concurrent users and you have a budget problem before you have a scaling problem.

Finally, prototypes hide failure modes. An agent that confidently performs the wrong action is worse than a broken script because it looks like it’s working. Deterministic automation fails in obvious ways. Agentic systems fail in plausible-sounding ways.

You need boundaries, not just goals

A demo agent is given a goal: “summarize the support tickets from the last week.” A production agent needs boundaries: which tickets it can read, which systems it can touch, how many tokens it can spend, how long it can run, and what it must never do.

Boundaries should be enforced in architecture, not in prompts. Prompts can be bypassed by better prompts. Architecture can’t. The right model is a small, deterministic permissions layer that sits between the agent and every tool it wants to use.

We usually design three permission scopes:

Read-only tools for research and observation.
Sandboxed write tools that can propose changes but require human approval.
Approved automated actions that are reversible and low-blast-radius.

An agent should not discover mid-run that it has write access to production. It should be impossible for it to have that access in the first place.

Tool design is more important than model choice

Teams spend weeks comparing models and days designing tools. That’s backwards. A mediocre model with well-designed tools outperforms a great model with brittle tools.

Good tools are specific, idempotent, and observable. They do one thing, they can be retried safely, and they log what happened. Bad tools are broad, stateful, and opaque — “update the database” is a terrible tool because it hides intent and consequences.

Every tool should expose a contract: inputs, outputs, failure modes, side effects, and whether it’s reversible. That contract is what lets you test the agent in isolation, simulate failures, and audit behavior after an incident.

Evaluation is the new test suite

Unit tests work for code. They don’t work for agents because the output space is too large and the correct answer is often contextual. You need evaluation — systematic scoring of agent outputs against representative cases.

A good evaluation pipeline has:

A dataset of real tasks with expected outcomes.
Automatic judges for objective criteria: did the agent complete the task, did it use the right tools, did it exceed token or time limits.
Human review for subjective criteria: tone, accuracy, safety.
Regression tests that compare new versions against the old baseline.

Without this, every change is a gamble. With it, you can iterate on prompts, models, and tools with confidence.

Observability means understanding intent

Standard observability tells you what happened. Agent observability needs to tell you why. A trace should capture the goal, the reasoning steps, the tools called, the inputs and outputs, and the final action. If you can’t reconstruct the agent’s decision chain, you can’t fix it.

We log four things for every agent run:

The original user request and any injected context.
The agent’s plan and any changes to that plan.
Every tool call with input, output, and timestamp.
The final result and any human overrides.

That log is what turns a confusing incident into a five-minute diagnosis.

Human-in-the-loop is a design pattern, not a fallback

The default instinct is to automate everything and add human approval only when something looks risky. The better model is to define which actions the agent can take autonomously and which always require a human.

Autonomous actions are reversible, low-cost, and well-tested. Human actions are destructive, high-cost, or novel. The boundary should be explicit in code, not left to the agent’s judgment. An agent that decides whether to ask for permission is an agent that will eventually decide wrong.

The production checklist

Before turning an agent loose on real work, confirm you can answer yes to these:

Every tool has a defined contract and limited permissions.
Token and runtime budgets are enforced by the infrastructure.
There is a dataset of real tasks with passing evaluation criteria.
Every run is logged with enough detail to reconstruct decisions.
Destructive actions require human approval.
There is a kill switch and a rollback path.

If you can’t check all six, you have a demo with credentials. Not a production system.

Production agents are possible. We build and operate them for clients. But they require the same discipline as any critical automation — just with more uncertain outputs. If your team is trying to cross that gap, our AI automation services cover the architecture, tooling, and governance that make agents shipable.

AI agents in production: why most prototypes never ship

Prototypes hide the hard problems

You need boundaries, not just goals

Tool design is more important than model choice

Evaluation is the new test suite

Observability means understanding intent

Human-in-the-loop is a design pattern, not a fallback

The production checklist

Share

Frequently Asked Questions

AI agents in production: why most prototypes never ship

Prototypes hide the hard problems

You need boundaries, not just goals

Tool design is more important than model choice

Evaluation is the new test suite

Observability means understanding intent

Human-in-the-loop is a design pattern, not a fallback

The production checklist

Share

Related articles

Frequently Asked Questions

How do we start using AI in our engineering process?

What is RAG and when should we use it?

How much does it cost to run LLMs in production?

What is an AI Agent and how is it different from a chatbot?