SimpleFunctions Research · April 2026

Calibrated World Models for AI Agents

Name: SimpleFunctions
Author: SimpleFunctions

Prediction Market Data as Real-Time Context

Patrick Liu · SimpleFunctions · patrick@simplefunctions.dev

Baseline

2.3%

→

+ World State

70.5%

31x

improvement

World Awareness BenchmarkClaude Haiku 4.544 questions, 5 categories+800 tokens of context0 fine-tuning

Read paper (PDF, 8 pages)Benchmark dataset Daily snapshots

Abstract

Large language models have a knowledge cutoff that prevents them from reasoning accurately about current events. Existing mitigations — web search, news APIs, retrieval-augmented generation — return narrative text that requires parsing and provides no calibrated probabilities. We propose injecting prediction market data as a compact, structured world model into agent system prompts. Prediction markets aggregate the judgments of participants with real money at risk, producing calibrated probability estimates for geopolitical events, economic indicators, commodity prices, and elections. We introduce the World Awareness Benchmark (WAB), a 44-question evaluation testing whether AI agents can accurately report current world conditions. Ground truth is derived from live prediction market prices. On WAB, a baseline Claude Haiku 4.5 scores 2.3% while the same model augmented with an 800-token world state scores 70.5% — a 31x improvement. The world state injection requires no fine-tuning, no retrieval infrastructure, and adds only ~800 tokens to the system prompt.

The Problem

Ask an LLM "What is the probability of a US recession in 2026?" and it will either hallucinate a number, hedge with "I don't have access to real-time data," or give you a figure from its training data. Web search returns narratives. News APIs return headlines. Neither provides a number an agent can reason over.

Web Search

"According to recent reports, tensions in the Middle East remain elevated..."

News API

{"title": "Iran tensions
 rise", "source": "Reuters"}

Prediction Market

Iran invasion: 53%
(+5pp, $225K volume)

The first two provide narrative. The third provides a calibrated probability — backed by people who lose money when they're wrong.

Method

Anchor Contract Selection

Naive selection by price delta picks noisy daily contracts. We score by volume x macroBoost — a 5x multiplier for recession, invasion, rate cut keywords; 0.1x penalty for daily closes. Critical contracts (Fed rate, recession probability) always appear regardless of movement.

Title Deduplication

Strip prices, dates, and numbers from contract titles to find the semantic core. "Natural gas > $2.720" and "Natural gas > $2.725" collapse to one entry.

Delta Updates

Full state: ~800 tokens. Delta since last check: ~30-50 tokens. For long-running agents, this is a 16-20x reduction in per-cycle context overhead.

Token Efficiency

Source	Tokens	Latency	Calibrated
Web search	2,000-5,000	2-5s	No
News API	500-1,000	500ms	No
RAG	1,000-3,000	1-3s	No
World state	~800	200ms	Yes
World delta	~30-50	100ms	Yes

Results: Per-Category Accuracy

Economy

80%

10q

Recession probability, Fed rate path, SPY/TLT prices

Elections

20%

80%

Presidential odds, Senate control

Markets

75%

12q

Bitcoin, Ethereum, Gold, mispriced edges

Energy

71%

Oil prices, OPEC, supply disruption

Geopolitical

50%

10q

Iran invasion, Hormuz, nuclear test, Taiwan

Economy and Elections are strongest (80%) — these have the most liquid, well-calibrated prediction market contracts. Geopolitical is lower (50%) because some questions reference specific contracts not in the 800-token snapshot; tool-use would close this gap.

Key Insight

"The world awareness problem is not a model capability problem — it is a context problem. The same model that scores 2.3% without context scores 70.5% with 800 tokens of structured data. Investment in better world state construction may be more impactful than scaling model parameters for current-events reasoning."

Reproduce the Results

# Install
pip install simplefunctions-ai

# Inject world state into any LLM
from simplefunctions import world

state = world()  # ~800 tokens, free, no auth
# → Inject into system prompt
# → Agent now scores 70.5% on WAB instead of 2.3%

Try the API live →PyPI package →Download benchmark →

Citation

@article{liu2026calibrated,
  title   = {Calibrated World Models for AI Agents:
             Prediction Market Data as Real-Time Context},
  author  = {Liu, Patrick},
  year    = {2026},
  url     = {https://simplefunctions.dev/papers/world-model},
  note    = {World Awareness Benchmark: 2.3% → 70.5% (31x)
             with 800-token prediction market context injection}
}

Resources

Full Paper (PDF) →World State API →WAB Benchmark →Daily Snapshots →Python client →MCP adapter →