techMay 5, 20266 min read

We Open-Sourced a Reproducibility-First Benchmark for Major AI Models

Name: SimpleFunctions
Author: SimpleFunctions

A public harness, roster, seed suite, result schema, scoring path, and methodology for comparing current major AI models without hiding prompts or raw outputs.

SimpleFunctions Research

#ai-benchmarks#open-source#evals#llms#model-comparison

The model market moves too fast for static comparisons.

Every few weeks, the default answer to "which model should we use?" changes. OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Qwen, Kimi, MiniMax, GLM, Cohere, and Meta are all shipping capable models with different strengths, pricing, context windows, tool behavior, and deployment constraints.

So we opened a benchmark repo that starts with the thing most benchmark writeups skip: reproducibility.

The repo is here:

https://github.com/spfunctions/major-model-benchmark

It is not a leaderboard yet. That is intentional. It is a harness, roster, seed suite, result schema, scoring path, and methodology for comparing current major models on general work.

What is in the repo

The first version includes:

A shared model roster snapshot, dated May 5, 2026
A public seed task suite for reasoning, coding, structured output, long-context extraction, data analysis, and calibration
JSONL task and response formats
A deterministic scorer for exact match, label match, JSON exact, JSON subset, substring, and numeric range tasks
Example response files
Unit tests
Methodology and research notes

The roster currently tracks:

gpt-5.5
claude-opus-4-7
gemini-3.1-pro-preview
grok-4.3
deepseek-v4-pro
mistral-medium-3.5
command-a-reasoning-08-2025
qwen3.6-plus
kimi-k2.6
MiniMax-M2.7
glm-5
meta-llama/Llama-4-Maverick-17B-128E-Instruct

Provider docs move quickly, so the roster records sources and status. Some IDs are stable API IDs. Some are preview or partner-hosted IDs. Google 3.1 Pro, for example, has been announced for developer preview through the Gemini API, but the exact API model code should be rechecked before a paid benchmark run.

That kind of caveat belongs in the benchmark, not in a footnote after the rankings.

Why another benchmark?

Most public model comparisons fail in one of three ways.

First, they hide the prompt and output details. A score without raw outputs is hard to debug and easy to overinterpret.

Second, they collapse everything into one number. A model that is excellent at coding may be mediocre at structured extraction. A model that has strong reasoning may be too verbose, expensive, or tool-fragile for production.

Third, they do not preserve run metadata. If a model ID is an alias, then the same benchmark run next month may not mean the same thing.

The benchmark repo starts from a stricter rule: no ranking without raw outputs, scoring code, model IDs, run timestamps, and task definitions.

What the seed suite measures

The seed suite is deliberately small. It is not meant to settle which model is best. It exists to verify that the harness works end to end.

The initial task categories are:

Reasoning: compact logic and quantitative reasoning
Coding: bug diagnosis and patch generation
Structured output: JSON extraction and schema discipline
Long context: extracting the relevant clause from distractors
Data analysis: arithmetic and small-table interpretation
Calibration: probability math and uncertainty discipline

Larger suites should be versioned separately and should distinguish public tasks from holdout tasks. Public tasks are useful for debugging. They are not enough for durable leaderboard claims because models and prompts can be optimized against them.

Engineering choices

The repo is intentionally simple:

python src/sf_benchmark/runner.py \
  --models model_roster.json \
  --tasks tasks/general_seed.jsonl \
  --out results/dry-run.json

That command validates the roster and tasks and writes placeholder rows for every model-task pair.

To score outputs:

python src/sf_benchmark/runner.py \
  --models model_roster.json \
  --tasks tasks/general_seed.jsonl \
  --responses examples/responses.seed.jsonl \
  --out results/scored-seed.json

The scorer is local and deterministic. Provider adapters can come later. We want the prompt, task format, result schema, and scoring behavior to be stable before adding API keys and paid runs.

What comes next

There are three obvious next steps.

First, add provider adapters that preserve raw request and response metadata. For a serious run, the benchmark needs exact temperature, max output, reasoning mode, tool policy, latency, token counts, and cost.

Second, add larger suites. A useful general benchmark should cover software engineering, document analysis, structured data extraction, multi-step research planning, instruction following, and adversarial schema discipline.

Third, publish full result artifacts before publishing rankings. A leaderboard is only useful if people can inspect why a model won or failed.

The repo is public so the benchmark can improve in the open:

https://github.com/spfunctions/major-model-benchmark

The companion benchmark for prediction-market-specific tasks is here:

https://github.com/spfunctions/prediction-market-model-benchmark

Engine-written disclosure

This article was primarily written by the SimpleFunctions engine and does not represent the views of the company.

Heartbeat engine →Edge discovery →Agent guide →

More tech

We Let an LLM Agent Autonomously Manage Kalshi Positions — Here's Its ArchitectureBlog How to Build an OpenClaw Prediction Market Bot with SimpleFunctionsBlog MCP Servers for Prediction Markets: Connect Claude Code to Kalshi and PolymarketBlog How to Build a Prediction Market Trading Bot in 2026Blog

← All articles More tech →