Step by step — same bug, two agent workflows
CI verified — no LLM
39/39 fixtures pass39 fixtures · 9 feature-build · 4 repair-plan loop · check-json ~170–260 tok vs TS paste · 78–92% less context.
bun run proof:agent-repair
Full CI suite: bun test tests/agent-repair-sufficiency.test.ts tests/agent-repair-multistep.test.ts && bun run benchmark:agent-repair
Live model benchmark
Point 100% · TS 100%May 22, 2026, 3:16 AM · gpt-4.1, claude-opus-4-6, claude-sonnet-4-6 on 13 fixtures (35 single-shot, 4 CI-only loop). Success = point check passes.
Latest proof run: 78/78 API runs pass point check.
- Claude Opus 4.6Point 13/13 · TS 13/13
- Claude Sonnet 4.6Point 13/13 · TS 13/13
- GPT-4.1Point 13/13 · TS 13/13
Pick a CI fixture below. Each column shows what the agent sees at every step, how many tokens that step costs, and whether we prove it in CI or illustrate it from typical chat pastes.
What is actually verified?
| Claim | Point workflow | TypeScript paste workflow |
|---|---|---|
| Repair without an LLM | Yes — golden line → point check in CI | No — no equivalent tsc→fix CI test |
| Context size measured | Yes — real check-json on fixtures | Estimated — paste heuristic, not a logged trace |
| Model can repair with this context | Live API — GPT-4o and Claude, point check gate | Live API — same .point file, larger paste prompt |
- CI verifiedRuns in point repo on every bun test — real check-json, golden line, point check. No LLM.
- EstimatedIllustrative TS agent paste sizes (~4 chars/token). Not captured from a live Cursor trace.
- Live modelReal API calls; success = point check passes on the applied line.
Fixture: tests/fixtures/agent-repair/unknown-field-broken.point — same broken file both workflows must repair. Model-eval context sizes: ~263 Point vs ~3000 TS for this case.
Agent task: Fix a typo in an existing launch readiness rule.
With Point
Point agent loop
check-json → patch one line → point check
- 1. Run point check-json+~263 tok
Agent (or you) runs the CLI. Compiler returns structured JSON — no file paste.
CI verifiedWhat the agent reads
{ "schemaVersion": "point.core.check.v1", "ok": false, "diagnostics": [ { "code": "unknown-field", "message": "Unknown field unknownField on LaunchSignals", "path": "fn.launchReadinessScore.if.condition", "ref": "point://semantic/Math/rule.launch readiness", "severity": "error", "span": { "start": { "line": 12, "column": 1, "offset": 207 }, "end": { "line": 12, "column": 36, "offset": 242 } }, "expected": [ "has bundle id", "submitted for review", "has passing tests" ], "actual": "unknownField", "repair": "Use one of: has bundle id, submitted for review, has passing tests.", "relatedRefs": [ "point://semantic/Math/record.Launch Signals.field.has bundle id", "point://semantic/Math/record.Launch Signals.field.submitted for review", "point://semantic/Math/record.Launch Signals.field.has passing tests" ] } ] }Measured from real CLI output on this fixture (1,051 chars). CI asserts context stays under 1,200 chars.
- 2. Patch the line at ref+0 tok
Agent picks a field from expected, replaces line 12. No repo search — ref and repair tell it where.
CI verifiedOne-line change
- add 30 when signals.unknown field + add 30 when signals.has bundle id
CI applies the golden line from the fixed fixture — no LLM — and point check passes.
- 3. Run point check+0 tok
Same gate as CI. If check passes, the repair loop is done.
CI verifiedTerminal
$ point check tests/fixtures/agent-repair/unknown-field-broken.point Point core check passed: tests/fixtures/agent-repair/unknown-field-broken.point
Without Point
TypeScript + chat paste
tsc error → paste files → guess → often retry
- 1. Run tsc — error only+~31 tok
Compiler returns a line number in emitted JS/TS. No list of valid field names.
EstimatedWhat the agent reads first
error TS2339: Property 'unknownField' does not exist on type 'LaunchSignals'. at launchReadinessScore (lib/math.ts:18:15)
Real error shape; token count from this message only.
- 2. Paste surrounding code+~2,969 tok
Typical Cursor/Codex workflow: paste component, lib, tests — agent hunts for the typo.
EstimatedRepresentative paste (truncated)
// ReadinessPanel.tsx — excerpt (~320 lines total in real repos) import { useMemo } from "react"; import type { LaunchSignals } from "../../types"; import { launchReadinessScore, scoreStatusLabel } from "../../lib/math"; export function ReadinessPanel({ signals }: { signals: LaunchSignals }) { const score = useMemo(() => launchReadinessScore(signals), [signals]); const label = scoreStatusLabel(score); return (<section><h2>Launch readiness</h2><p>Score: {score} — {label}</p></section>); } // lib/math.ts — agent often pastes this too when tsc fails export function launchReadinessScore(signals: LaunchSignals): number { let score = 0; if (signals.unknownField) score += 30; if (signals.submittedForReview) score += 40; if (signals.hasPassingTests) score += 30; return score; }Illustrative ~12,000-char paste heuristic. Not from a logged agent session.
- 3. Wrong fix → same paste again+~3,000 tok
If the model emits camelCase or patches the wrong file, the next turn reloads the same context.
EstimatedSecond turn (common)
Same ~3,000 tokens pasted again + new error output
Not CI-tested. Shown because retry loops are common without structured repair hints.
Reproduce CI proof: bun test tests/agent-repair-sufficiency.test.ts tests/agent-repair-multistep.test.ts && bun run benchmark:agent-repair · Fixtures: tests/fixtures/agent-repair/ · Benchmarks and tables →
