marketsMay 5, 20266 min read

A Benchmark for AI Models on Prediction-Market Tasks

Name: SimpleFunctions
Author: SimpleFunctions

A domain benchmark for testing whether models can handle forecast calibration, resolution logic, orderbooks, cross-venue equivalence, thesis updates, and risk gates.

SimpleFunctions Research

#prediction-markets#ai-benchmarks#forecasting#calibration#open-source

General AI benchmarks miss the hard parts of prediction-market work.

Prediction markets are not just another finance domain. A useful model has to parse market rules, reason about unresolved outcomes, compare non-identical contracts across venues, interpret orderbooks, update probabilities without double-counting evidence, and refuse trades when edge disappears after fees, spread, liquidity, or risk limits.

So we opened a domain benchmark:

https://github.com/spfunctions/prediction-market-model-benchmark

It uses the same model roster as our general benchmark, but the tasks are prediction-market-specific.

What this benchmark tests

The first seed suite covers seven task families:

Resolution: decide YES, NO, UNRESOLVED, or INVALID from rules and evidence
Calibration: evaluate or produce probabilistic forecasts
Microstructure: reason about spreads, depth, stale quotes, and executable edge
Cross-venue mapping: decide whether two contracts are equivalent enough for arbitrage
Thesis updates: revise a probability after new evidence without double-counting
Risk gates: reject trades that violate bankroll, liquidity, conviction, or operator constraints
Evidence independence: separate duplicated reports from independent signals

These are the failure modes that matter in real prediction-market systems.

A model can sound smart and still fail here. It can explain a macro thesis beautifully while missing that a Kalshi contract resolves on certified results, not media projections. It can forecast a 57 percent probability and still recommend a bad trade because the best ask, fees, and liquidity make it non-actionable. It can compare two contracts with similar titles while ignoring that one resolves at a June FOMC meeting and the other resolves by July 31.

Those mistakes are not formatting issues. They change money, risk, and truth.

Same model group, different task surface

The domain benchmark mirrors the roster in the general benchmark:

gpt-5.5
claude-opus-4-7
gemini-3.1-pro-preview
grok-4.3
deepseek-v4-pro
mistral-medium-3.5
command-a-reasoning-08-2025
qwen3.6-plus
kimi-k2.6
MiniMax-M2.7
glm-5
meta-llama/Llama-4-Maverick-17B-128E-Instruct

Keeping the roster identical matters. It lets us ask a sharper question: does a model that performs well on general reasoning also transfer to prediction-market reasoning?

The answer is not obvious. Forecasting and trading require calibration, rule grounding, and risk discipline. More reasoning is not automatically better if it produces overconfident probabilities or invents equivalence between contracts.

Forecast quality is not trade quality

One design rule in the repo is that forecast quality and trade quality should be scored separately.

A model can make a good probability estimate and still make a bad trade. For example:

Fair value: 63 percent
Market ask: 56c
Fee: 1c
Available liquidity: $80
Requested size: $400
Max per-trade risk: 20 percent of a $1,000 bankroll

The forecast edge is positive before risk constraints. The trade should still be rejected because requested size and liquidity fail the gate.

That distinction is central to prediction-market agents. We do not want models that simply say "positive EV, buy." We want models that know when not to act.

Point-in-time snapshots matter

Prediction-market data changes continuously. Odds move, liquidity appears and disappears, news arrives, and markets resolve.

For a serious benchmark run, each task should preserve:

venue
contract title and rules
market snapshot timestamp
orderbook state
market probability or price at the time
evidence available at the time
resolution outcome, when known

This is why dynamic forecasting benchmarks such as ForecastBench are useful reference points. Static knowledge tests are easier to contaminate and less representative of real forecasting. Prediction-market evaluation needs the same discipline, with the additional complexity of venue rules and executable prices.

What is in the repo today

The first version includes:

Shared model roster
Seed prediction-market task suite
JSONL task and response format
Deterministic scorer for exact match, label match, JSON exact, JSON subset, substring, and numeric range tasks
Example response files
Unit tests
Methodology and research notes

The dry run is simple:

python src/sf_pm_benchmark/runner.py \
  --models model_roster.json \
  --tasks tasks/prediction_market_seed.jsonl \
  --out results/dry-run.json

Scoring sample responses:

python src/sf_pm_benchmark/runner.py \
  --models model_roster.json \
  --tasks tasks/prediction_market_seed.jsonl \
  --responses examples/responses.seed.jsonl \
  --out results/scored-seed.json

This is not a final leaderboard. It is the benchmark substrate.

What comes next

The next engineering step is adding Brier score, log loss, executable-edge scoring, and invalid-trade rate. The next data step is freezing larger Kalshi and Polymarket task sets with point-in-time snapshots and clear licensing.

After that, the interesting comparison begins:

Which models are calibrated?
Which models understand resolution rules?
Which models avoid false cross-venue equivalence?
Which models can separate a good forecast from a permissible trade?
Which models refuse to trade when the operator constraints say no?

The repo is public here:

https://github.com/spfunctions/prediction-market-model-benchmark

The companion general benchmark is here:

https://github.com/spfunctions/major-model-benchmark

Engine-written disclosure

This article was primarily written by the SimpleFunctions engine and does not represent the views of the company.

Taker strategy →Maker strategy →Edge discovery →

More markets

How to Actually Make Money on Kalshi (Not the Advice You'll Find on Reddit)Blog Monitoring the Situation: From Passive APIs to Proactive IntelligenceBlog Why the Best Trading Terminal Is a Command LineBlog Prediction Market Orderbook Analysis: Reading Depth, Spread, and LiquidityBlog

← All articles More markets →