We Open-Sourced a Reproducibility-First Benchmark for Major AI Models
A public harness, roster, seed suite, result schema, scoring path, and methodology for comparing current major AI models without hiding prompts or raw outputs.
The model market moves too fast for static comparisons.
Every few weeks, the default answer to "which model should we use?" changes. OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Qwen, Kimi, MiniMax, GLM, Cohere, and Meta are all shipping capable models with different strengths, pricing, context windows, tool behavior, and deployment constraints.
So we opened a benchmark repo that starts with the thing most benchmark writeups skip: reproducibility.
The repo is here:
https://github.com/spfunctions/major-model-benchmark
It is not a leaderboard yet. That is intentional. It is a harness, roster, seed suite, result schema, scoring path, and methodology for comparing current major models on general work.
What is in the repo
The first version includes:
- A shared model roster snapshot, dated May 5, 2026
- A public seed task suite for reasoning, coding, structured output, long-context extraction, data analysis, and calibration
- JSONL task and response formats
- A deterministic scorer for exact match, label match, JSON exact, JSON subset, substring, and numeric range tasks
- Example response files
- Unit tests
- Methodology and research notes
The roster currently tracks:
gpt-5.5claude-opus-4-7gemini-3.1-pro-previewgrok-4.3deepseek-v4-promistral-medium-3.5command-a-reasoning-08-2025qwen3.6-pluskimi-k2.6MiniMax-M2.7glm-5meta-llama/Llama-4-Maverick-17B-128E-Instruct
Provider docs move quickly, so the roster records sources and status. Some IDs are stable API IDs. Some are preview or partner-hosted IDs. Google 3.1 Pro, for example, has been announced for developer preview through the Gemini API, but the exact API model code should be rechecked before a paid benchmark run.
That kind of caveat belongs in the benchmark, not in a footnote after the rankings.
Why another benchmark?
Most public model comparisons fail in one of three ways.
First, they hide the prompt and output details. A score without raw outputs is hard to debug and easy to overinterpret.
Second, they collapse everything into one number. A model that is excellent at coding may be mediocre at structured extraction. A model that has strong reasoning may be too verbose, expensive, or tool-fragile for production.
Third, they do not preserve run metadata. If a model ID is an alias, then the same benchmark run next month may not mean the same thing.
The benchmark repo starts from a stricter rule: no ranking without raw outputs, scoring code, model IDs, run timestamps, and task definitions.
What the seed suite measures
The seed suite is deliberately small. It is not meant to settle which model is best. It exists to verify that the harness works end to end.
The initial task categories are:
- Reasoning: compact logic and quantitative reasoning
- Coding: bug diagnosis and patch generation
- Structured output: JSON extraction and schema discipline
- Long context: extracting the relevant clause from distractors
- Data analysis: arithmetic and small-table interpretation
- Calibration: probability math and uncertainty discipline
Larger suites should be versioned separately and should distinguish public tasks from holdout tasks. Public tasks are useful for debugging. They are not enough for durable leaderboard claims because models and prompts can be optimized against them.
Engineering choices
The repo is intentionally simple:
python src/sf_benchmark/runner.py \
--models model_roster.json \
--tasks tasks/general_seed.jsonl \
--out results/dry-run.json
That command validates the roster and tasks and writes placeholder rows for every model-task pair.
To score outputs:
python src/sf_benchmark/runner.py \
--models model_roster.json \
--tasks tasks/general_seed.jsonl \
--responses examples/responses.seed.jsonl \
--out results/scored-seed.json
The scorer is local and deterministic. Provider adapters can come later. We want the prompt, task format, result schema, and scoring behavior to be stable before adding API keys and paid runs.
What comes next
There are three obvious next steps.
First, add provider adapters that preserve raw request and response metadata. For a serious run, the benchmark needs exact temperature, max output, reasoning mode, tool policy, latency, token counts, and cost.
Second, add larger suites. A useful general benchmark should cover software engineering, document analysis, structured data extraction, multi-step research planning, instruction following, and adversarial schema discipline.
Third, publish full result artifacts before publishing rankings. A leaderboard is only useful if people can inspect why a model won or failed.
The repo is public so the benchmark can improve in the open:
https://github.com/spfunctions/major-model-benchmark
The companion benchmark for prediction-market-specific tasks is here:
https://github.com/spfunctions/prediction-market-model-benchmark
This article was primarily written by the SimpleFunctions engine and does not represent the views of the company.
More tech