A Benchmark for AI Models on Prediction-Market Tasks
A domain benchmark for testing whether models can handle forecast calibration, resolution logic, orderbooks, cross-venue equivalence, thesis updates, and risk gates.
General AI benchmarks miss the hard parts of prediction-market work.
Prediction markets are not just another finance domain. A useful model has to parse market rules, reason about unresolved outcomes, compare non-identical contracts across venues, interpret orderbooks, update probabilities without double-counting evidence, and refuse trades when edge disappears after fees, spread, liquidity, or risk limits.
So we opened a domain benchmark:
https://github.com/spfunctions/prediction-market-model-benchmark
It uses the same model roster as our general benchmark, but the tasks are prediction-market-specific.
What this benchmark tests
The first seed suite covers seven task families:
- Resolution: decide YES, NO, UNRESOLVED, or INVALID from rules and evidence
- Calibration: evaluate or produce probabilistic forecasts
- Microstructure: reason about spreads, depth, stale quotes, and executable edge
- Cross-venue mapping: decide whether two contracts are equivalent enough for arbitrage
- Thesis updates: revise a probability after new evidence without double-counting
- Risk gates: reject trades that violate bankroll, liquidity, conviction, or operator constraints
- Evidence independence: separate duplicated reports from independent signals
These are the failure modes that matter in real prediction-market systems.
A model can sound smart and still fail here. It can explain a macro thesis beautifully while missing that a Kalshi contract resolves on certified results, not media projections. It can forecast a 57 percent probability and still recommend a bad trade because the best ask, fees, and liquidity make it non-actionable. It can compare two contracts with similar titles while ignoring that one resolves at a June FOMC meeting and the other resolves by July 31.
Those mistakes are not formatting issues. They change money, risk, and truth.
Same model group, different task surface
The domain benchmark mirrors the roster in the general benchmark:
gpt-5.5claude-opus-4-7gemini-3.1-pro-previewgrok-4.3deepseek-v4-promistral-medium-3.5command-a-reasoning-08-2025qwen3.6-pluskimi-k2.6MiniMax-M2.7glm-5meta-llama/Llama-4-Maverick-17B-128E-Instruct
Keeping the roster identical matters. It lets us ask a sharper question: does a model that performs well on general reasoning also transfer to prediction-market reasoning?
The answer is not obvious. Forecasting and trading require calibration, rule grounding, and risk discipline. More reasoning is not automatically better if it produces overconfident probabilities or invents equivalence between contracts.
Forecast quality is not trade quality
One design rule in the repo is that forecast quality and trade quality should be scored separately.
A model can make a good probability estimate and still make a bad trade. For example:
- Fair value: 63 percent
- Market ask: 56c
- Fee: 1c
- Available liquidity: $80
- Requested size: $400
- Max per-trade risk: 20 percent of a $1,000 bankroll
The forecast edge is positive before risk constraints. The trade should still be rejected because requested size and liquidity fail the gate.
That distinction is central to prediction-market agents. We do not want models that simply say "positive EV, buy." We want models that know when not to act.
Point-in-time snapshots matter
Prediction-market data changes continuously. Odds move, liquidity appears and disappears, news arrives, and markets resolve.
For a serious benchmark run, each task should preserve:
- venue
- contract title and rules
- market snapshot timestamp
- orderbook state
- market probability or price at the time
- evidence available at the time
- resolution outcome, when known
This is why dynamic forecasting benchmarks such as ForecastBench are useful reference points. Static knowledge tests are easier to contaminate and less representative of real forecasting. Prediction-market evaluation needs the same discipline, with the additional complexity of venue rules and executable prices.
What is in the repo today
The first version includes:
- Shared model roster
- Seed prediction-market task suite
- JSONL task and response format
- Deterministic scorer for exact match, label match, JSON exact, JSON subset, substring, and numeric range tasks
- Example response files
- Unit tests
- Methodology and research notes
The dry run is simple:
python src/sf_pm_benchmark/runner.py \
--models model_roster.json \
--tasks tasks/prediction_market_seed.jsonl \
--out results/dry-run.json
Scoring sample responses:
python src/sf_pm_benchmark/runner.py \
--models model_roster.json \
--tasks tasks/prediction_market_seed.jsonl \
--responses examples/responses.seed.jsonl \
--out results/scored-seed.json
This is not a final leaderboard. It is the benchmark substrate.
What comes next
The next engineering step is adding Brier score, log loss, executable-edge scoring, and invalid-trade rate. The next data step is freezing larger Kalshi and Polymarket task sets with point-in-time snapshots and clear licensing.
After that, the interesting comparison begins:
- Which models are calibrated?
- Which models understand resolution rules?
- Which models avoid false cross-venue equivalence?
- Which models can separate a good forecast from a permissible trade?
- Which models refuse to trade when the operator constraints say no?
The repo is public here:
https://github.com/spfunctions/prediction-market-model-benchmark
The companion general benchmark is here:
This article was primarily written by the SimpleFunctions engine and does not represent the views of the company.