Explanation

Agent repair walkthrough

Step-by-step: what a coding agent loads for Point check-json vs TypeScript paste — with honest CI vs estimated labels and token counts per step.

Step by step — same bug, two agent workflows

CI verified — no LLM

39/39 fixtures pass

39 fixtures · 9 feature-build · 4 repair-plan loop · check-json ~170–260 tok vs TS paste · 7892% less context.

bun run proof:agent-repair

Full CI suite: bun test tests/agent-repair-sufficiency.test.ts tests/agent-repair-multistep.test.ts && bun run benchmark:agent-repair

View test source on GitHub →

Live model benchmark

Point 100% · TS 100%

May 22, 2026, 3:16 AM · gpt-4.1, claude-opus-4-6, claude-sonnet-4-6 on 13 fixtures (35 single-shot, 4 CI-only loop). Success = point check passes.

Latest proof run: 78/78 API runs pass point check.

  • Claude Opus 4.6Point 13/13 · TS 13/13
  • Claude Sonnet 4.6Point 13/13 · TS 13/13
  • GPT-4.1Point 13/13 · TS 13/13
Full benchmark table →

Pick a CI fixture below. Each column shows what the agent sees at every step, how many tokens that step costs, and whether we prove it in CI or illustrate it from typical chat pastes.

What is actually verified?

ClaimPoint workflowTypeScript paste workflow
Repair without an LLMYes — golden line → point check in CINo — no equivalent tsc→fix CI test
Context size measuredYes — real check-json on fixturesEstimated — paste heuristic, not a logged trace
Model can repair with this contextLive API — GPT-4o and Claude, point check gateLive API — same .point file, larger paste prompt
  • CI verifiedRuns in point repo on every bun test — real check-json, golden line, point check. No LLM.
  • EstimatedIllustrative TS agent paste sizes (~4 chars/token). Not captured from a live Cursor trace.
  • Live modelReal API calls; success = point check passes on the applied line.

Fixture: tests/fixtures/agent-repair/unknown-field-broken.point — same broken file both workflows must repair. Model-eval context sizes: ~263 Point vs ~3000 TS for this case.

Agent task: Fix a typo in an existing launch readiness rule.

Point context~263tokens · one turn
vs
TS paste~3,000tokens · one turn
Saved91%same repair task

With Point

Point agent loop

check-json → patch one line → point check

  1. 1. Run point check-json+~263 tok

    Agent (or you) runs the CLI. Compiler returns structured JSON — no file paste.

    CI verified

    What the agent reads

    {
      "schemaVersion": "point.core.check.v1",
      "ok": false,
      "diagnostics": [
        {
          "code": "unknown-field",
          "message": "Unknown field unknownField on LaunchSignals",
          "path": "fn.launchReadinessScore.if.condition",
          "ref": "point://semantic/Math/rule.launch readiness",
          "severity": "error",
          "span": {
            "start": {
              "line": 12,
              "column": 1,
              "offset": 207
            },
            "end": {
              "line": 12,
              "column": 36,
              "offset": 242
            }
          },
          "expected": [
            "has bundle id",
            "submitted for review",
            "has passing tests"
          ],
          "actual": "unknownField",
          "repair": "Use one of: has bundle id, submitted for review, has passing tests.",
          "relatedRefs": [
            "point://semantic/Math/record.Launch Signals.field.has bundle id",
            "point://semantic/Math/record.Launch Signals.field.submitted for review",
            "point://semantic/Math/record.Launch Signals.field.has passing tests"
          ]
        }
      ]
    }

    Measured from real CLI output on this fixture (1,051 chars). CI asserts context stays under 1,200 chars.

  2. 2. Patch the line at ref+0 tok

    Agent picks a field from expected, replaces line 12. No repo search — ref and repair tell it where.

    CI verified

    One-line change

    -   add 30 when signals.unknown field
    +   add 30 when signals.has bundle id

    CI applies the golden line from the fixed fixture — no LLM — and point check passes.

  3. 3. Run point check+0 tok

    Same gate as CI. If check passes, the repair loop is done.

    CI verified

    Terminal

    $ point check tests/fixtures/agent-repair/unknown-field-broken.point
    Point core check passed: tests/fixtures/agent-repair/unknown-field-broken.point
Context this turn~263 tokens

Without Point

TypeScript + chat paste

tsc error → paste files → guess → often retry

  1. 1. Run tsc — error only+~31 tok

    Compiler returns a line number in emitted JS/TS. No list of valid field names.

    Estimated

    What the agent reads first

    error TS2339: Property 'unknownField' does not exist on type 'LaunchSignals'.
      at launchReadinessScore (lib/math.ts:18:15)

    Real error shape; token count from this message only.

  2. 2. Paste surrounding code+~2,969 tok

    Typical Cursor/Codex workflow: paste component, lib, tests — agent hunts for the typo.

    Estimated

    Representative paste (truncated)

    // ReadinessPanel.tsx — excerpt (~320 lines total in real repos)
    import { useMemo } from "react";
    import type { LaunchSignals } from "../../types";
    import { launchReadinessScore, scoreStatusLabel } from "../../lib/math";
    
    export function ReadinessPanel({ signals }: { signals: LaunchSignals }) {
      const score = useMemo(() => launchReadinessScore(signals), [signals]);
      const label = scoreStatusLabel(score);
      return (<section><h2>Launch readiness</h2><p>Score: {score} — {label}</p></section>);
    }
    
    // lib/math.ts — agent often pastes this too when tsc fails
    export function launchReadinessScore(signals: LaunchSignals): number {
      let score = 0;
      if (signals.unknownField) score += 30;
      if (signals.submittedForReview) score += 40;
      if (signals.hasPassingTests) score += 30;
      return score;
    }

    Illustrative ~12,000-char paste heuristic. Not from a logged agent session.

  3. 3. Wrong fix → same paste again+~3,000 tok

    If the model emits camelCase or patches the wrong file, the next turn reloads the same context.

    Estimated

    Second turn (common)

    Same ~3,000 tokens pasted again + new error output

    Not CI-tested. Shown because retry loops are common without structured repair hints.

Context this turn~3,000 tokensAfter one wrong guess: ~6,000 tokensStep breakdown sums to ~6,000 (model-eval measured total shown above)

Reproduce CI proof: bun test tests/agent-repair-sufficiency.test.ts tests/agent-repair-multistep.test.ts && bun run benchmark:agent-repair · Fixtures: tests/fixtures/agent-repair/ · Benchmarks and tables →