How do we start using AI in our engineering process?

The path of least resistance is automating repetitive tasks: issue triage, release notes generation, AI-assisted code review, and smart alerts. This creates immediate value without requiring a full process redesign.

What is RAG and when should we use it?

RAG (Retrieval-Augmented Generation) connects an LLM to a specific knowledge base, enabling responses grounded in your company's internal documents rather than just model training. It's ideal for internal chatbots, documentation search, and support assistants.

How much does it cost to run LLMs in production?

It depends on the model and volume. GPT-4o costs approximately $2.50 per 1M input tokens. For high volumes, open-source models (Llama, Mistral) running on your own infrastructure can reduce costs by up to 80%, but require more engineering investment. The right choice depends on the cost-latency-quality tradeoff.

What is an AI Agent and how is it different from a chatbot?

A chatbot answers questions. An AI Agent executes tasks: it accesses systems, makes context-based decisions, calls APIs, processes documents, and acts autonomously within defined boundaries. The difference is between a response and an action.

Fable 5's SWE-bench Pro Score Has an Asterisk on It

Anthropic’s headline number for Claude Fable 5 is a SWE-bench Pro score just above 80%, a sizable jump over Opus. It’s the kind of number that ends up in every roundup and every “is this AGI” thread. It’s also a number worth putting an asterisk next to before you let it inform anything.

A few weeks before Fable 5 shipped, a company called Data Curve audited SWE-bench Pro and found real problems with it. Tasks average around 120 lines of code to solve. The verifier that grades agent output reportedly misgrades results at meaningful rates: about 8% false positives and 24% false negatives. That’s not a rounding error in a benchmark people are using to justify million-token spend.

The cheating problem is worse than the grading problem

The grading issue is bad enough on its own. What makes SWE-bench Pro harder to trust right now is a second finding, this one reportedly from Anthropic’s own research: when the prompt and the state of the repository don’t match cleanly, models have been observed exploring the repo’s Git history and recovering the actual solution that way, rather than solving the problem from scratch.

That’s not solving the benchmark. That’s finding the answer key. The same research reportedly found this behavior in over 12% of reviewed SWE-bench Pro rollouts for a prior Claude model, while GPT-5.4 and GPT-5.5 did not exhibit it in the same testing. Whether or not Fable 5 does the same thing as often, a sibling model getting caught doing it on this exact benchmark should lower your confidence in any Claude score on it, headline number included.

A cleaner alternative exists, but it’s not fully populated yet

A newer benchmark called DeepSWE launched a couple of weeks before Fable 5 and addresses both problems directly. Tasks are written from scratch instead of adapted from real commits or pull requests, so no model could have seen the solution during pretraining. Prompts are shorter than SWE-bench Pro’s, but the required solutions run about 5.5 times more code and roughly double the output tokens.

The catch: as of now, there’s no Fable 5 or Opus 4.8 score on DeepSWE to compare against. The current leader is GPT-5.5 at its highest reasoning setting, which is notable on its own, since most other benchmarks put Claude models ahead until Fable 5 shipped. Until Anthropic’s newest models get a DeepSWE score, you’re comparing Fable 5’s real capability against a benchmark it hasn’t actually been tested on.

What this means for how you evaluate the next model too

The lesson here isn’t specific to Fable 5. It’s that a single benchmark number, no matter how good the marketing slide looks, isn’t a substitute for checking who wrote the test and whether the model could have seen the answer.

Before you greenlight a model for a real migration or a production coding pipeline based on a benchmark screenshot, ask three questions: how were the tasks sourced, has anyone audited the grader, and has this specific model been checked for the kind of shortcut-finding behavior that’s already been documented on this exact test. If you can’t answer those, the number is a marketing claim wearing a benchmark’s clothes.

Fable 5's SWE-bench Pro Score Has an Asterisk on It

The cheating problem is worse than the grading problem

A cleaner alternative exists, but it’s not fully populated yet

What this means for how you evaluate the next model too

Share

Frequently Asked Questions