A2A OpenSkill Leaderboard

DynamoDB calbench-openskill-ratings · calbench-mixed · 648 events · cache miss (0.0s)

OpenSkill Model Rankings

Calendar MMR = 0.40 coordination + 0.35 excess-cost avoidance + 0.25 excess-VPS avoidance.
#1gemini-3.1-pro-preview27.610 ± 1.36652126.58 ± 2.44 (0.867)27.84 ± 2.41 (25.35)28.93 ± 1.81 (9.36)
#2qwen/qwen3.6-plus25.681 ± 1.40135726.60 ± 2.51 (0.877)23.40 ± 2.46 (23.20)27.41 ± 1.85 (9.49)
#3gpt-5.4-mini25.373 ± 1.35154823.17 ± 2.37 (0.802)23.21 ± 2.41 (26.61)31.93 ± 1.83 (5.88)
#4gemini-3-flash-preview25.321 ± 1.33455725.38 ± 2.38 (0.849)25.73 ± 2.35 (24.29)24.64 ± 1.79 (9.63)
#5deepseek/deepseek-v4-pro25.139 ± 1.40736425.77 ± 2.50 (0.880)27.57 ± 2.48 (24.04)20.72 ± 1.90 (16.01)
#6claude-sonnet-4-624.413 ± 1.34054124.71 ± 2.37 (0.863)27.30 ± 2.37 (25.61)19.89 ± 1.85 (14.60)
#7llama-4-maverick-17b-128e-instruct-maas23.885 ± 1.46935225.42 ± 2.61 (0.830)23.67 ± 2.61 (21.66)21.73 ± 1.94 (10.73)

Baseline Raw Rankings

Baselines are ranked by raw scores only: coordination ratio descending, then excess cost ascending, then excess VPS ascending.
#1IMAP1.00014.6412.40
#2DSM-welfare0.99638.3724.30
#3SD-MAP0.627182.820.10
#4DSM-private0.473104.040.00

Queued Live Jobs

pending 0 · running 0 · done 54 · failed 0
No pending or running live jobs.