What Dead Code Taught Us About Building Tools for AI Agents

We set out to build a code visualization tool. AI can write code faster than you can review it, and we wanted to give developers a way to keep up: interactive architecture graphs, real-time structure views, a shared picture of what's actually happening in the codebase.

We quickly realized that to build any type of precise visualization, documentation, or code review, you first need a good graph. The graph is the precursor. And when we looked around, we saw the same thing everywhere: every code review tool, every documentation generator, every AI coding assistant that needs to understand codebase structure ends up building its own parser, its own import resolver, its own symbol graph. It's the same foundational work rebuilt independently by dozens of teams. Nobody had put together one comprehensive set of graph primitives that's well-maintained and available for anyone to build on top of.

So we decided to do that. We think code graphs are a core primitive, especially now, as the industry moves toward software factories where agents need structural understanding of what they're working on. Our focus is on maintaining precise graphs and parsing so that everyone building on top doesn't have to duplicate that effort.

This post is about dead code detection, the first tool we built on our own graph primitives, and the one we've benchmarked most extensively. We tested it across 14 real-world repositories, from 449-star libraries to 138K-star monorepos like next.js. The result: 156x cheaper, 11x faster, and 2x better performance than Claude Opus 4.6 alone. 94.1% average F1 with 100% precision across every task. But the thesis is bigger than dead code. We aim to make the following case: graphs are a primitive to code factories. This dead code removal tool is an example of what can be built with our public API. If you have your own interpretation of how this problem or another can be better solved with graph primitives, we are happy to provide you with the raw materials to do so.

The Dead Code Problem

We discovered the dead code problem by accident. We gave an agent a directory that had living code and dead code in it and told it to make documentation. It documented dead features as living.

With vibe-coded software, especially if there are multiple refactors, it's very likely that there will be dead code left behind clogging the context. In practice, engineers have known when manually coding that it is so frustrating to edit a method and see no change, only to discover that the pattern has drifted from the spec and the method is dead. AI agents don't have that intuition. They see every function as equally real.

Dead code clogs context windows, confuses agents, and wastes the most expensive resource in AI-powered development: tokens spent reasoning about code that doesn't matter.

We had the insight that good prompting is high signal. We want to provide a high volume of high-signal context to the model and eliminate noise as much as possible. If we could identify and remove dead code before the agent sees it, we could dramatically improve the quality of every downstream task: documentation, code review, refactoring, feature development.

From Graphs to Dead Code Candidates

Our insight was that with a well-made call graph and a well-made dependency graph, in many cases we could discover "dead code candidates." Naively, if you were to say "anything that is not imported or not called, it is dead." However, with generated code patterns there may be things that are not called until the system is built. Additionally, framework entry points (Express route handlers, Next.js pages, NestJS controllers) are never "called" by your code; they're invoked at runtime. Services gated by an API may have code that appears dead but isn't, since the client could be on the other side of a network boundary: a REST handler with zero internal callers, a webhook endpoint waiting for external events, a plugin loaded by convention rather than by import.

However, with these constraints in mind, it's possible to build an agent-enabled system that begins with a set of items that appear to be dead, ranked by probability. An intelligent system could self-improve with certain system knowledge. That is, project structures that follow a generator pattern typically have common directory names like target/. So this gives a system that can generate probabilistically more likely dead code candidates, with the caveat that there will be false positives that need to be sorted through.

Still, this greatly reduces the context load on an LLM. On smaller projects, an LLM can effectively trace the entire execution path inside of the context window. On larger projects this becomes increasingly infeasible. By using graph analysis primitives, we can eliminate a huge chunk of known noise. After that we can use agents to sort through candidates to remove false positives. Finally, over time we can learn how project structures and design patterns create false positives to make a more refined system that further reduces the false positives the agent needs to sort through.

The cumulative effect of this process is that we can build CI pipelines and refactoring tools that will reduce dead code with increasing accuracy and precision. The final outcome once the dead code is removed is less wasted context, fewer agent errors, and more work done.

Why Naive Reachability Isn't Enough

Static analysis can trace imports and function calls. What it can't easily see are the boundaries of indirection that make code appear dead when it isn't:

Framework entry points. A Next.js page.tsx, a NestJS @Controller(), an Express route handler. None of these are "called" by your code. They're invoked by the framework at runtime. A naive dead code detector would flag every API endpoint as unused.

Event-driven and plugin architectures. Webhook handlers, message queue consumers, dynamically loaded plugins. All registered through patterns that static analysis struggles to trace.

API boundaries. When a service exposes functions through a REST or GraphQL API, the callers live on the other side of a network boundary. The server-side handler has zero internal callers, but it's the most critical code in the system.

Generated code patterns. Code generators (ORMs, gRPC stubs, GraphQL codegen) produce symbols that aren't called until the rest of the system is wired up. These often live in conventionally-named directories like generated/, target/, or __generated__/.

Re-exports and type-level usage. A type that's re-exported through a barrel file (index.ts), or a constant used only in type annotations. These are alive but invisible to call-graph-only analysis.

These aren't edge cases. In a typical production codebase, they represent 30-60% of all exported symbols. Flag them all as dead and you've built a tool nobody trusts.

Our Approach: Probabilistic Candidates + Agent Verification

Instead of trying to build a perfect static analyzer (an impossible task), we designed a system that works with AI agents rather than replacing them.

Step 1: Graph Analysis. Parse the codebase with tree-sitter. Build the call graph and dependency graph. Run BFS reachability from identified entry points (framework conventions, main files, test files). Everything unreachable becomes a candidate.

Step 2: Probabilistic Ranking. Not all candidates are equally likely to be dead. We rank by signals: Is it in a generated directory? Does it follow a framework naming convention? Is it a type re-export? How deep is it in the import chain? This produces a ranked list of candidates, from "almost certainly dead" to "suspicious but uncertain."

Step 3: Agent Verification. Hand the ranked candidates to an AI agent. The agent can read surrounding code, check for dynamic usage patterns, and apply judgment that static analysis can't. The key insight: the agent's job is now filtering a short list, not searching an entire codebase. This is dramatically more tractable.

Step 4: Learn and Refine. Track which candidates turn out to be false positives. Learn that projects using Next.js have page.tsx files that look dead but aren't. Learn that __mocks__/ directories are test infrastructure. Feed this back into the ranking model.

The cumulative effect: each iteration produces fewer false positives for the agent to sort through, the verification gets faster and cheaper, and the system builds institutional knowledge about project patterns.

Benchmarking: How We Measured

We used mcpbr (Model Context Protocol Benchmark Runner), built by Grey Newell, to run controlled experiments. Grey also contributed critical fixes to the Supermodel API — including confidence calibration, OOM prevention, dead export detection, and the StreamReader fix that made baseline evaluation reliable — and built the codegraph-bench code navigation benchmark. The setup:

Model: Claude Opus 4.6 via the Anthropic API
Agent harness: Claude Code
Two conditions: (A) Agent with Supermodel MCP server providing graph analysis, (B) Baseline agent with only grep, glob, and file reads
Same prompt, same tools (minus the MCP server), same evaluation

Ground Truth: How Do You Know What's Actually Dead?

This is the hardest part of benchmarking dead code detection. You need to know, with certainty, which symbols in a codebase are dead. We used two approaches:

Synthetic codebases. We built a 35-file TypeScript Express app and intentionally planted 102 dead code items: legacy integrations, deprecated auth methods, feature flags that were never cleaned up, replaced utility functions. We know exactly what's dead because we put it there. This is useful for development but doesn't reflect real-world complexity.

Real pull requests from open-source projects. This is where the benchmark gets interesting. We searched GitHub for merged PRs whose commit messages and descriptions explicitly mention removing dead code, unused functions, or deprecated features. The logic: if a developer identified code as dead, removed it in a PR, the tests still pass, and the PR was approved by reviewers and merged, that's confirmed dead code.

For each PR, we extracted ground truth by parsing the diff: every exported function, class, interface, constant, or type that was deleted (not moved or renamed) became a ground truth item. The agent's job is to identify these same items by analyzing the codebase at the commit before the PR, the state where the dead code still exists.

This methodology has a key strength: it's grounded in real engineering decisions, not synthetic judgment calls. A human developer, with full context of the project, decided this code was dead. We're asking: can an AI agent reach the same conclusion?

We tested against PRs from 14 repositories spanning small libraries to massive monorepos:

Repository	Stars	Task
track-your-regions	--	tyr_pr258
podman-desktop	16K	podman_pr16084
gemini-cli	7K	gemini_cli_pr18681
jsLPSolver	449	jslpsolver_pr159
strapi	71.7K	strapi_pr24327
mimir	--	mimir_pr3613
opentelemetry-js	3.3K	otel_js_pr5444
TanStack/router	14K	tanstack_router_pr6735
latitude-llm	--	latitude_pr2300
storybook	89.6K	storybook_pr34168
Maskbook	1.6K	maskbook_pr12361
directus	34.6K	directus_pr26311
cal.com	40.9K	calcom_pr26222
next.js	138K	nextjs_pr87149

Across 60+ benchmark runs, we evaluated both agents on precision (what fraction of reported items are actually dead), recall (what fraction of actually dead items were found), and F1 score.

Results

The Headline: 156x Cheaper, 11x Faster, 2x Better

Let's start with the numbers that matter. Across 14 real-world tasks, each drawn from a merged PR in an open-source repository, the graph-enhanced agent using the Supermodel MCP server dominated the baseline agent on every dimension:

Metric	MCP (Graph) Agent	Baseline Agent	Improvement
Avg F1	94.1%	52.0%	2x
Avg Precision	100%	varies	Perfect
Avg Recall	90%	varies	--
Total Cost	$1.40	$219	156x cheaper
Total Runtime	28 min	306 min	11x faster
Total Tool Calls	28	4,079	146x fewer
Avg Tool Calls/Task	2	291	--
Head-to-Head Wins	11	0	--
Ties	3	3	--

100% precision across all 14 tasks. Zero false positives. Every single item the graph agent reported was confirmed dead code.

The baseline agent spent 4,079 tool calls grepping through codebases, trying to reconstruct call graphs at runtime. The graph agent made 28 tool calls total, 2 per task on average. It read the pre-computed analysis, reported the candidates, and was done. The graph pre-computes the expensive work, so the agent doesn't have to.

Per-Task Breakdown

Here's every task, sorted by the gap between MCP and baseline performance:

Task	Repo	Stars	MCP F1	Base F1	MCP P	MCP R
storybook_pr34168	storybook	89.6K	100%	0%	100%	100%
otel_js_pr5444	opentelemetry-js	3.3K	100%	17.6%	100%	100%
tanstack_router_pr6735	TanStack/router	14K	100%	12%	100%	100%
directus_pr26311	directus	34.6K	100%	14.3%	100%	100%
nextjs_pr87149	next.js	138K	88.9%	CRASH	100%	80%
latitude_pr2300	latitude-llm	--	92.3%	35.3%	100%	86%
calcom_pr26222	cal.com	40.9K	100%	57.1%	100%	100%
gemini_cli_pr18681	gemini-cli	7K	80%	42.9%	100%	67%
podman_pr16084	podman-desktop	16K	100%	67.7%	100%	100%
maskbook_pr12361	Maskbook	1.6K	81%	68.4%	100%	68%
tyr_pr258	track-your-regions	--	97.6%	81.6%	100%	95%
strapi_pr24327	strapi	71.7K	100%	100%	100%	100%
mimir_pr3613	mimir	--	100%	100%	100%	100%
jslpsolver_pr159	jsLPSolver	449	78.3%	78.6%	100%	64%

The storybook result is striking: 89.6K stars, massive monorepo, and the baseline agent couldn't find a single confirmed dead code item. The graph agent found all of them. The same pattern plays out across OpenTelemetry JS (17.6% vs 100%), TanStack Router (12% vs 100%), and Directus (14.3% vs 100%). On next.js, the largest repo in the benchmark at 138K stars, the baseline agent crashed entirely. The graph agent scored 88.9% F1.

The three ties (strapi, mimir, jslpsolver) are instructive. On strapi and mimir, both agents achieved perfect scores -- these tasks had clean, well-scoped dead code that even grep-based search could find. On jslpsolver, the baseline agent actually edged out the graph agent by 0.3 percentage points on F1, the only task where that happened. The graph agent's 100% precision (vs the baseline's lower precision) shows the tradeoff: the graph agent is more conservative, sometimes missing items the baseline stumbles onto, but never reports false positives.

What Changed: From 10% F1 to 94% F1

If you've been following our benchmarking journey, you'll notice these numbers look dramatically different from our earlier results. In our February and early March runs, the graph agent achieved high recall but terrible precision -- single-digit percentages, with hundreds or thousands of false positives per task. What happened?

Three things changed:

1. Parser improvements. Barrel re-export filtering, cross-package import resolution, class rescue patterns, and seven new pipeline phases dramatically reduced the candidate list. Fewer false candidates means fewer false positives.

2. MCP server instead of analysis dump. Previously, we pre-computed a large JSON analysis file and handed it to the agent. Files with 6,000+ candidates exceeded tool output limits, causing truncation and errors. The MCP server delivers candidates through a structured API call, solving the file size wall entirely.

3. Better agent prompting. We stopped asking the agent to verify candidates with grep (which was less accurate than the graph analysis it was checking) and instead told the agent to trust the graph analysis. This restored recall to expected levels and, combined with the improved parser, achieved the precision breakthrough.

The cumulative effect: the same architectural approach -- graph-based candidate generation plus agent verification -- went from promising-but-rough to production-grade. The thesis was right. The implementation needed iteration.

Failure Modes We Discovered (and Fixed)

1. The File Size Wall (Fixed)

Large analysis files exceed tool output limits. A 6,000-candidate analysis exceeds the 25K token tool output limit, so the agent either gets a truncated view or errors out.

Fix: Moving to the MCP server architecture eliminated this entirely. Instead of dumping a massive JSON file, the agent makes a structured API call and gets back a clean candidate list. This was one of the key changes that took us from single-digit precision to 100%.

2. API Recall Gaps (Mostly Fixed)

Sometimes the Supermodel parser misses ground truth items entirely. In earlier benchmarks, the Logto task found 0 of 8 ground truth items in the analysis. No amount of agent intelligence can find what the analysis doesn't contain.

Root causes we identified and fixed: export default not tracked, type re-exports (export type { X } from) missed, test file imports not scanned, barrel re-export filtering. These parser improvements, combined with the MCP server delivery, are why recall went from 85% to 90% average across a larger and harder set of tasks.

3. Agent Verification Can Hurt Performance (Fixed)

This one surprised us. In an earlier benchmark run, we instructed the agent to verify each candidate by grepping for the symbol name across the codebase. The idea was sound: if a symbol appears in other files, it's probably alive.

The result: recall dropped from 95.5% to 40% on our best-performing task (tyr_pr258). The agent's grep verification was killing real dead code.

Why? The grep used word-boundary matching (grep -w). A function named hasRole would match the word hasRole appearing in a comment, a string literal, or a completely unrelated variable name in another file. The agent would see the match and mark the function as "alive." A false negative introduced by the verification step.

The irony: the static analyzer had already performed proper call graph and dependency analysis to identify these candidates. The agent's grep check was a less accurate version of what the analyzer already did. By asking the agent to verify the analysis, we made it worse.

The fix was simple: tell the agent to trust the analysis and pass through all candidates without grep verification. The lesson: don't let a less precise tool override a more precise one. Graph-based reachability analysis is strictly more accurate than grep-based name matching for determining whether code is alive.

4. Agent Non-Determinism (Mitigated)

Same task, same config, different results. One run finds 3 true positives; the rerun finds 0. This is an inherent property of LLM-based agents.

The mitigation that worked: reduce the agent's degrees of freedom. With the MCP server delivering a short, well-ranked candidate list via a structured API, the agent has almost no room to go off-track. The result is 2 tool calls per task on average, and highly reproducible outcomes. When the agent's job is "read this list and report it," non-determinism effectively disappears.

The Scaling Insight

This is the finding we keep coming back to. Across the 14 tasks:

Total MCP cost: $1.40. Total baseline cost: $219. That's 156x cheaper.
Total MCP runtime: 28 minutes. Total baseline runtime: 306 minutes. That's 11x faster.
MCP tool calls: 28 total (2 per task average). Baseline tool calls: 4,079 total (291 per task average).

The economics are striking. At 2 tool calls per task, the graph agent's cost is nearly constant regardless of codebase size. The baseline agent's cost scales with the size and complexity of the repo -- 291 tool calls on average, but the variance is enormous. On next.js (138K stars), the baseline agent crashed before producing a result, consuming tokens the entire way.

As codebases grow:

Baseline cost explodes. More files means more tool calls spent building a mental model of the codebase.
Graph cost stays flat. The agent makes an MCP call, gets structured candidates, and reports them. Two tool calls whether the repo has 50 files or 50,000.
Baseline quality degrades. On the four largest repos (storybook, next.js, strapi, directus -- all 34K+ stars), the baseline averaged 28.4% F1. The graph agent averaged 97.2% F1.
Graph quality stays high. 90% average recall and 100% precision regardless of repo size.

The graph absorbs the complexity that would otherwise land on the agent. This is the fundamental value proposition, and it applies to any tool built on graph primitives, not just dead code detection.

Lessons for Building AI-Powered Code Tools

1. Context engineering matters more than model capability

Same model, same tools, different input structure: 2x better F1, 156x cheaper. The model wasn't the bottleneck. The signal-to-noise ratio of its input was.

This is the core lesson. Good prompting is high-signal prompting. The best thing you can do for an AI agent isn't give it a smarter model. It's give it pre-computed, structured, relevant context and eliminate the noise.

2. Pre-compute what you can, delegate judgment to the agent

Static analysis is good at exhaustive enumeration. AI agents are good at judgment calls. The worst outcome is making the agent do both: enumerate and judge. That's 291 tool calls per task and $219 total for worse results.

The best outcome is a pipeline: graphs enumerate candidates, agents verify them. Each component does what it's best at. Two tool calls per task and $1.40 total for better results.

3. Precision is achievable, not just aspirational

In our earlier benchmarks, we wrote that "precision is the frontier" -- we were achieving high recall but single-digit precision on real codebases. We believed precision was solvable through better ranking and filtering. We were right.

100% precision across 14 tasks. Zero false positives. The combination of parser improvements, MCP server delivery, and better agent prompting solved a problem we'd been publicly struggling with for months. The lesson: iterate on the system, not the model.

4. Real-world codebases are dramatically harder than synthetic ones -- but solvable

On our synthetic benchmark, we hit 95% F1. On real-world codebases in earlier runs, F1 was in the single digits despite high recall. We were worried this gap might be fundamental.

It wasn't. With the right system improvements, real-world F1 rose to 94.1% average. The four largest repos in our benchmark (storybook at 89.6K stars, next.js at 138K, strapi at 71.7K, directus at 34.6K) averaged 97.2% F1. Repo size is no longer the limiting factor.

5. The system improves iteratively

Every benchmark run taught us something:

jsLPSolver taught us that well-organized small repos favor grep-based search
Maskbook taught us about the file size wall
Logto taught us about parser gaps in export default
Directus taught us about the analysis-dump failure mode
storybook taught us that the MCP server approach scales to massive monorepos

Each lesson fed back into the parser, the delivery mechanism, and the agent prompt. The system got 9x better on F1 (from ~10% to 94.1%) not through model improvements, but through better context engineering.

The Bigger Picture: Graphs as Factory Primitives

The industry is moving toward software factories. Automated pipelines where agents write, review, test, and deploy code with increasing autonomy. These factories need infrastructure primitives. The LLM is becoming commodity. What's not commodity is the structural understanding of what agents are working on.

Dead code detection is one application. But the underlying primitive, a structured graph of code relationships, enables an entire category of tools:

Impact analysis: "If I change this function, what breaks?" (call graph)
Architecture documentation: "What are the domains and boundaries in this system?" (domain graph)
Dependency auditing: "Which packages are actually used?" (dependency graph)
Refactoring assistance: "Show me all the callers of this deprecated API" (call graph)
Security surface mapping: "What code paths lead from user input to database queries?" (call graph + data flow)

Each of these has the same structure: pre-compute the graph, rank candidates, let agents handle judgment. The graph is the primitive. The applications are built on top.

Every team building agent-powered workflows, whether it's code review, documentation generation, CI pipelines, or full factory orchestration, needs this structural awareness. Right now, most of them are building it from scratch. We think there should be one well-maintained set of graph primitives that everyone can build on, rather than dozens of teams independently duplicating the same foundational work.

The Benchmarking Journey: What We Got Wrong Along the Way

Building the dead code tool was one thing. Benchmarking it honestly was harder. Here's what we learned the hard way.

The evolution in numbers

Our benchmark results improved dramatically over three months of iteration:

Period	Avg F1	Avg Precision	Avg Recall	Tasks	Key Change
Feb 2026	~6%	~3%	~85%	10	Initial graph analysis dump
Mar 9, 2026	~10%	~6%	~97%	10	Parser improvements
Mar 30, 2026	94.1%	100%	90%	14	MCP server + prompt fixes

The jump from 10% to 94% F1 didn't come from a better model. It came from three system-level changes: parser improvements that reduced false candidates, the MCP server that eliminated the file size wall, and prompt changes that stopped the agent from second-guessing the graph analysis.

Measuring the wrong thing

Our initial benchmark prompt told the agent to read the analysis file, then "verify" each candidate by grepping the codebase to see if the symbol appeared in other files. This seemed rigorous. The agent would filter false positives before reporting.

It backfired. On our best-performing task (tyr_pr258), recall dropped from 95.5% to 40%. The agent's grep verification was less accurate than the graph analysis it was checking. A function named hasRole would match the word "hasRole" in a comment, a string literal, or an unrelated variable. The agent would incorrectly mark it as alive.

The lesson: don't verify a precise tool with a less precise tool. Graph-based reachability is strictly more accurate than text search for determining if code is reachable.

Two layers of invisible caching

After implementing parser improvements (barrel re-export filtering, 7 new pipeline phases, class rescue patterns), we ran the benchmark expecting dramatic improvement. The numbers were identical to the previous run.

It took investigation to discover why: the benchmark had two layers of result caching. A local file cache keyed on the zip hash short-circuited the API call entirely. Even when we busted through that, the API's server-side idempotency cache returned the old parser's results because the input hadn't changed (same repo, same commit, same zip).

We had to clear the local cache AND change the idempotency key to actually measure the improved parser. Without this, we would have published results that showed "no improvement" when the improvements were real but unmeasured.

What honest benchmarking looks like

These mistakes taught us that benchmark infrastructure has as many failure modes as the system being benchmarked. Our checklist now includes:

Cache invalidation: Clear all analysis caches when the parser changes
Prompt isolation: The benchmark prompt must not introduce behaviors (like grep verification) that interact with what we're measuring
Agent behavior logging: Always inspect the agent's transcript, not just the final numbers
A/B discipline: Change one variable at a time (parser version, prompt, agent model) or you can't attribute results

All of our benchmark data, including the runs where we got it wrong, is available in our benchmark repository. Transparency about methodology matters more than impressive numbers.

Scream tests: validating beyond PR ground truth

One thing to note about our benchmarks: our ground truth only captures dead code that a human developer explicitly removed in a PR. In a multi-million line project, there could be lots of dead code that a targeted PR missed. In earlier benchmarks with low precision, some of our "false positives" may have been genuinely dead code that the human developer didn't catch.

We've begun performing "scream test verification": systematically deleting reported dead code candidates, then running the project build and CI suite to confirm that things are truly dead. If the tests still pass after deletion, the candidate was genuinely dead regardless of whether a human had flagged it. Early scream test results are consistent with our current precision numbers and have revealed dead code that humans missed.

Try It

The Supermodel API is available today. Generate a dead code analysis for your codebase:

# Create a repo archive
cd /path/to/repo
git archive -o /tmp/repo.zip HEAD

# Analyze (via the dead code endpoint)
curl -X POST "https://api.supermodeltools.com/v1/analysis/dead-code" \
  -H "X-Api-Key: $SUPERMODEL_API_KEY" \
  -H "Idempotency-Key: $(git rev-parse --short HEAD)" \
  -F "file=@/tmp/repo.zip"

Or use the Supermodel MCP server to give your AI agent direct access to graph analysis in real time.

The graph endpoints (call graph, dependency graph, domain graph, parse graph) are all available through the same API. Our focus is on maintaining precise graphs and parsing so that you don't have to. If you're building agent workflows, code review tools, documentation generators, CI pipelines, or factory orchestration, anything that needs structural understanding of a codebase, we want to be the graph layer you build on top of.

If you have your own interpretation of how this problem or another can be better solved with graph primitives, we are happy to provide you with the raw materials to do so.

We maintain the graphs. You build the tools.

Methodology Notes

Benchmark framework: mcpbr by Grey Newell, with Claude Code harness
Model: Claude Opus 4.6 (claude-opus-4-6-20260330)
Agent harness: Claude Code
Total benchmark runs: 60+ (Feb 6 - Mar 30, 2026)
Total cost: ~$220 across all runs (dominated by baseline agent runs)
Repositories tested: 14 open-source projects (449 - 138K GitHub stars)
Ground truth sources: Merged PRs with passing CI from real open-source projects
All runs logged with timestamps, configs, full agent transcripts, and structured metrics
Analysis engine: Supermodel MCP server (tree-sitter-based parsing, BFS reachability analysis)

Supermodel maintains precise code graph primitives so you don't have to. Get started with the API or try the MCP server.