Score

Score

Average score per attempt row (0–1). Rewards partial credit.

Formula: Mean of all attempt scores across all results rows: SUM(score) / COUNT(*) over the results table.

Ranks models that make consistent partial progress on hard tasks. A model that scores 0.5 on every task beats one that passes half and fails the rest on this metric.

69.54
Tasks pass 53/64
1st: 46 2nd: 7 Failed: 11
Cost / run

Cost / run

Average total LLM cost per benchmark run in USD.

Formula: SUM(cost_usd) / run_count across all runs for this model.

Use to compare operating cost across models with similar pass rates. Does not account for quality. Combine with $/Pass for a cost-efficiency view.

$0.36
Latency p50

Latency p50

Median per-task wall time (LLM call + compile + test), in milliseconds.

Formula: 50th percentile of per-task duration_ms: LLM latency + compile time + test time.

Use p50 for a typical-case latency expectation. Unaffected by outlier slow tasks.

2m 30s
Pass Rate

Pass Rate

Fraction of distinct tasks solved in any attempt across all runs.

Formula: (tasks_passed_attempt_1 + tasks_passed_attempt_2_only) / tasks_attempted_distinct

Primary ranking metric. Compare models here first: it directly measures how often the model delivers working code.

82.8% 95% CI: [71.8–90.1]%
pass^n (strict)

pass^n (strict)

Fraction of tasks the model solved in every single run (strict consistency).

Formula: tasks where ALL runs produced a passing result / tasks_attempted_distinct

Measures reliability under repetition. High pass^n means the model is unlikely to regress on a re-run, important for CI and production use.

76.6%
$/Pass

$/Pass

Total cost divided by number of distinct tasks passed. Lower is better.

Formula: SUM(cost_usd) / tasks_passed_distinct across all runs.

Best single cost-efficiency metric. Penalises expensive models that pass few tasks and rewards cheap models with high pass rates.

$0.4309
Latency p95

Latency p95

95th-percentile per-task wall time. Captures tail latency.

Formula: 95th percentile of per-task duration_ms across all tasks in all runs.

Use p95 to understand worst-case latency. A low p95 means the model rarely stalls, relevant for automated pipelines with timeouts.

3m 1s

Overview

Claude Opus 4.6 has run on 6 occasions, attempting 64 tasks with an average score of 69.54.

Settings

Generation parameters used across this model's runs. "varies" indicates the value differed between runs.

Temperature
varies
Thinking budget
varies
Avg tokens / run
224,748
Consistency
90.6%

History

1.00.0
6 runs · oldest 19h ago · latest 10d ago

Cost

meanp95

Failure modes

  • AL0104 670 Syntax error, '=' expected view all →
  • AL0111 211 Semicolon expected. Add a semicolon (;) to terminate the statement. view all →
  • AL0000 123 App generation failed view all →
  • AL0185 80 Page '0' is missing view all →
  • AL0224 73 Expression expected. Provide a valid expression (variable, constant, calculation, or method call). view all →
  • AL0107 56 Syntax error, identifier expected. Provide a valid name (letters, digits, and underscores only). view all →
  • AL0198 48 Expected one of the application object keywords (table, tableextension, page, pageextension, pagecustomization, profile, profileextension, codeunit, report, reportextension, xmlport, query, controladdin, dotnet, enum, enumextension, interface, permissionset, permissionsetextension, entitlement) view all →
  • AL0275 48 'Product' is an ambiguous reference between 'Product' defined by the extension 'CentralGauge_CG-AL-M001_2 by CentralGauge (1.0.0.0)' and 'Product' defined by the extension 'CG-AL-M001 Prereq by CentralGauge (1.0.0.0)'. view all →
  • AL0118 32 The name 'AreAllApprovalsComplete' does not exist in the current context. view all →
  • AL0105 29 Syntax error, identifier expected; 'key' is a keyword view all →

Shortcomings

AL concepts Claude Opus 4.6 struggles with. Click a row for description, correct pattern, and observed error codes.

No shortcomings analyzed yet

Shortcomings analysis is on the roadmap. The first analyzer run is scheduled for the P8 release; until then, this section reflects no data.

See methodology

Recent runs

Runs
StartedModelTasksScoreCostDurationStatus
19h ago300/50671.88$3.490mscompleted
22h ago300/50669.12$3.620mscompleted
1d ago300/50671.88$3.580mscompleted
9d ago300/50668.83$3.860mscompleted
10d ago300/50667.21$4.090mscompleted
10d ago300/50668.38$4.180mscompleted

See all 6 runs →

Methodology

Scores are computed per task, averaged across attempts. See the about page for details.