How do we start using AI in our engineering process?

The path of least resistance is automating repetitive tasks: issue triage, release notes generation, AI-assisted code review, and smart alerts. This creates immediate value without requiring a full process redesign.

What is RAG and when should we use it?

RAG (Retrieval-Augmented Generation) connects an LLM to a specific knowledge base, enabling responses grounded in your company's internal documents rather than just model training. It's ideal for internal chatbots, documentation search, and support assistants.

How much does it cost to run LLMs in production?

It depends on the model and volume. GPT-4o costs approximately $2.50 per 1M input tokens. For high volumes, open-source models (Llama, Mistral) running on your own infrastructure can reduce costs by up to 80%, but require more engineering investment. The right choice depends on the cost-latency-quality tradeoff.

What is an AI Agent and how is it different from a chatbot?

A chatbot answers questions. An AI Agent executes tasks: it accesses systems, makes context-based decisions, calls APIs, processes documents, and acts autonomously within defined boundaries. The difference is between a response and an action.

How a modern CI/CD pipeline works — and where AI actually fits

Most engineers can tell you what CI/CD is. Fewer can tell you where AI actually fits into it — beyond the vague “AI makes it smarter” pitch every vendor is running right now.

I’ve set up pipelines for teams at every stage. From a 3-person startup deploying to a single EC2 instance, to 40-engineer orgs running Kubernetes across multiple regions. The fundamentals don’t change much. What’s changing is what AI can do inside each stage.

Let me walk you through both.

What a modern pipeline actually looks like

A CI/CD pipeline moves code from a developer’s branch to production. Every stage has one job.

Commit trigger. Someone pushes code or opens a PR. The pipeline wakes up. Everything else is a response to this event.

Build. Code compiles. Dependencies install. Docker images get built. This step exists to answer one question: does this code even build? You’d be surprised how often that fails.

Test. Unit tests, integration tests, contract tests. This is where bugs should get caught. If your tests aren’t catching bugs here, you have a coverage problem that no pipeline can fix.

Static analysis. Linters, security scanners, code quality tools. Ideally running in parallel with tests, not after them. Sequential security checks are how you add 20 minutes to every build.

Staging deploy. A production-like environment gets the new version. Smoke tests run. QA gates happen here if you have them.

Approval gate. Automated or human. On critical systems, a human reviews before anything touches prod. On mature pipelines with real test coverage, this is automated.

Production deploy. Blue-green, canary, or rolling. Your risk tolerance decides which one.

Post-deploy validation. Health checks, error rate checks, synthetic transactions. The pipeline isn’t done when the deploy command finishes. It’s done when the system says it’s healthy.

That’s the skeleton. The quality of a pipeline comes from what happens inside each stage — not just whether the stages exist on a diagram.

Where AI is genuinely useful

Test failure triage

A failing build at 9am is already friction. Spending 20 minutes figuring out why it failed makes it worse.

AI can read the failure log, cross-reference it with the diff, compare it to historical failures, and surface a probable cause before you open the terminal. That 20-minute debugging session becomes a 2-minute confirmation. Some teams are shipping this today — the model reads the output and writes a one-paragraph explanation of what broke and why.

Code review on every PR

Static analysis catches syntax errors and known vulnerability patterns. It doesn’t catch logic holes, missed edge cases, or the architectural inconsistency that won’t break anything today but will cause a painful refactor in six months.

An AI reviewer running on every PR can flag those patterns. A function that handles the happy path but silently fails on null input. A security pattern inconsistent with the rest of the codebase. A direct dependency introduced without a version pin.

This isn’t replacing the human reviewer. It’s giving them a first-pass summary so they can stop reviewing spacing decisions and start reviewing decisions that matter.

Deployment risk scoring

Not every change carries the same risk. A one-line typo fix in a README is different from a database migration touching three tables.

AI trained on your deploy history can score incoming changes by predicted incident probability — factoring in file surface area, test coverage of the changed code, time since the last incident in this area, and deploy frequency. Teams using this automatically route high-risk changes to stricter gates. It’s a smarter version of “require two reviewers for database changes.”

Post-deploy anomaly detection

Did this deploy break anything? That question is harder than it sounds when you’re relying on static thresholds.

AI-assisted monitoring watches error rate, latency, and key business metrics in the window right after a deploy. Instead of firing when error rate crosses 1%, it compares the current pattern to historical baselines and flags deviations that look like past incidents — even if no static threshold is crossed. Quieter normal. Louder when something’s actually wrong.

Where AI doesn’t belong

Skipping tests. I’ve heard “AI can just tell us if it looks right.” No. Tests are ground truth. They’re not the slow part — a test suite that takes 15 minutes is a you-need-better-tests problem, not an AI-can-skip-this problem.

Autonomous production deploys on critical systems. Confidence scores aren’t guarantees. If a bad deploy costs you users, revenue, or sleep — keep a human at the production gate. The cost of that review is lower than the cost of a bad deploy.

Undocumented automated decisions. If an AI agent skipped a test, approved a gate, or triggered a rollback — that needs to be in the logs. “The system decided” isn’t an acceptable answer when something goes wrong.

The thing most people get backwards

Teams ask “where does AI fit in my pipeline?” before they have a pipeline worth fitting AI into.

Before AI optimization makes sense, you need:

Tests that actually run deterministically
A staging environment that resembles production
Rollback capability in under 5 minutes
Some baseline for what “healthy” looks like post-deploy

Without those, AI adds noise to a system that doesn’t have signal yet. Get the foundation right. Then add the intelligence.

I work with teams building production systems and developer tooling. If this topic resonates, you can find more of my work at https://huntermussel.com.

How a modern CI/CD pipeline works — and where AI actually fits

What a modern pipeline actually looks like

Where AI is genuinely useful

Test failure triage

Code review on every PR

Deployment risk scoring

Post-deploy anomaly detection

Where AI doesn’t belong

The thing most people get backwards

Share

Frequently Asked Questions

How a modern CI/CD pipeline works — and where AI actually fits

What a modern pipeline actually looks like

Where AI is genuinely useful

Test failure triage

Code review on every PR

Deployment risk scoring

Post-deploy anomaly detection

Where AI doesn’t belong

The thing most people get backwards

Share

Related articles

Frequently Asked Questions

How do we start using AI in our engineering process?

What is RAG and when should we use it?

How much does it cost to run LLMs in production?

What is an AI Agent and how is it different from a chatbot?