# Calibrated World Models for AI Agents: Prediction Market Data as Real-Time Context

**Patrick Liu**
SimpleFunctions
patrick@simplefunctions.dev

**April 2026**

---

## Abstract

Large language models have a knowledge cutoff that prevents them from reasoning about current events. Existing solutions — web search, news APIs, RAG over recent documents — return narrative text that requires parsing and provides no calibrated probabilities. We propose injecting prediction market data as a structured world model into agent system prompts. Prediction markets aggregate the judgments of participants with money at risk, producing calibrated probability estimates for future events. We introduce the World Awareness Benchmark (WAB), a 44-question evaluation testing whether agents can accurately report current geopolitical risk levels, recession probabilities, commodity prices, and election odds. On WAB, a baseline Claude Haiku 4.5 model scores 2.3%, while the same model augmented with an 800-token prediction market world state scores 70.5% — a 31x improvement. The world state context requires no fine-tuning, no retrieval infrastructure, and adds only ~800 tokens to the prompt. We release the benchmark, dataset, and API as open resources.

## 1. Introduction

AI agents are increasingly deployed for tasks requiring awareness of current world conditions: portfolio analysis, risk assessment, policy research, travel planning, supply chain monitoring. These tasks require the agent to know, at minimum, the current state of geopolitical tensions, economic indicators, commodity prices, and political developments.

Current LLMs cannot provide this. Their training data has a cutoff date, and their outputs about current events are either hedged ("I don't have access to real-time data") or hallucinated (confidently stated but factually wrong).

The standard mitigation is tool use — typically web search or news API integration. But these sources have a fundamental limitation: they return *narrative text*, not *structured data*. "According to recent reports, tensions in the Middle East remain elevated" is not actionable information for an agent. It cannot be compared, thresholded, or used in conditional logic.

We propose an alternative: injecting *prediction market data* into the agent's context. Prediction markets are exchanges where participants trade contracts on the outcomes of future events. A contract on "US recession in 2026" trading at 33 cents represents a 33% market-implied probability, backed by real money at risk. This produces a fundamentally different kind of information than news coverage:

- **Calibrated**: Prices converge toward true probabilities because miscalibrated traders lose money (Tetlock & Gardner, 2015; Arrow et al., 2008).
- **Structured**: Each data point is a probability with a clear referent — not a paragraph requiring interpretation.
- **Comprehensive**: Major prediction exchanges cover geopolitics, economics, energy, elections, technology, and cryptocurrency with thousands of active contracts.
- **Compact**: The entire world state can be summarized in ~800 tokens — far less than a single web search result.

## 2. Method

### 2.1 World State Construction

We aggregate data from 9,706 prediction market contracts across two venues: Kalshi (CFTC-regulated, US-based) and Polymarket (blockchain-based, global). Contracts are organized into six topic categories: Geopolitics, Economy, Energy, Elections, Crypto, and Tech.

For each topic, we select contracts using a two-tier approach:

**Anchor contracts** are selected by a composite score:

$$\text{score}(c) = \text{volume}(c) \times \text{macroBoost}(c)$$

where $\text{macroBoost}$ assigns a 5× multiplier to contracts matching fundamental macro keywords (recession, invasion, rate cut, etc.) and a 0.1× penalty to daily close contracts. This ensures that high-importance contracts (recession probability, war probability) always appear, regardless of daily price movement.

**Mover contracts** fill remaining slots, ranked by $\text{volume} \times |\Delta\text{price}|$, with deduplication by title similarity to prevent near-identical contracts (e.g., natural gas at adjacent strike prices) from consuming multiple slots.

The resulting output is ~800 tokens of structured markdown, including:
- SF Index: Uncertainty (0-100), Geopolitical Risk (0-100), Momentum (-1 to +1)
- Traditional market prices (SPY, VIX, Gold, Oil, Treasuries)
- 2-4 contracts per topic with current price, 24h change, and venue
- Top mispriced edges (model price vs. market price divergences)
- Cross-market divergence alerts

### 2.2 Integration Patterns

We evaluate three integration patterns of increasing depth:

**System prompt injection**: The world state markdown is prepended to the system prompt. The agent receives current data as context without any tool calls. Cost: ~800 additional input tokens per request.

**Tool use**: The agent is given tools to query focused world state (`?focus=energy,geopolitics` concentrates the same token budget on fewer topics), search specific markets, and retrieve incremental updates (`/delta?since=1h` returns ~30-50 tokens of changes).

**MCP server**: The world state is exposed via the Model Context Protocol (Anthropic, 2024), enabling automatic discovery by compatible agent frameworks.

### 2.3 Delta Updates

For agents in long-running sessions, re-reading 800 tokens of world state every reasoning cycle is wasteful. We introduce a delta endpoint that returns only what changed since a given timestamp. The delta output is typically 30-50 tokens, providing a 16-20× reduction in context overhead for ongoing sessions.

## 3. World Awareness Benchmark

We introduce the World Awareness Benchmark (WAB), designed to test whether an agent can accurately report current world conditions. The benchmark consists of 44 questions across five categories:

| Category | Questions | Example |
|----------|-----------|---------|
| Geopolitical | 10 | "What is the SF Geopolitical Risk Index score?" |
| Economy | 10 | "What is the prediction market probability of a US recession in 2026?" |
| Energy | 7 | "What is the current oil ETF (USO) price?" |
| Elections | 5 | "What probability does the market assign to [candidate]?" |
| Markets | 12 | "What is the current gold ETF (GLD) price?" |

Ground truth is derived from live prediction market prices at the time of evaluation. Each answer is scored:
- **2 points**: Exact match within tolerance (±5 for probabilities, ±$5 for prices)
- **1 point**: Correct direction or within 3× tolerance
- **0 points**: Wrong, hallucinated, or refused

Maximum score: 88 (44 questions × 2 points).

The benchmark is regenerated monthly from live market data, ensuring it always tests current knowledge rather than memorized facts.

## 4. Results

We evaluate Claude Haiku 4.5 (Anthropic, 2025) under two conditions: baseline (no world state) and augmented (800-token world state in system prompt).

### 4.1 Aggregate Results

| Condition | Score | Accuracy | Exact Match | Partial | Wrong |
|-----------|-------|----------|-------------|---------|-------|
| Baseline | 2/88 | 2.3% | 1 | 0 | 43 |
| + World State | 62/88 | 70.5% | 31 | 0 | 13 |

Improvement: **31x**

### 4.2 Per-Category Results

| Category | Baseline | + World State | Improvement |
|----------|----------|---------------|-------------|
| Geopolitical | 0/20 (0%) | 10/20 (50%) | +50pp |
| Economy | 0/20 (0%) | 16/20 (80%) | +80pp |
| Energy | 0/14 (0%) | 10/14 (71%) | +71pp |
| Elections | 2/10 (20%) | 8/10 (80%) | +60pp |
| Markets | 0/24 (0%) | 18/24 (75%) | +75pp |

Economy and Elections show the strongest improvement (80%), likely because these categories have the most liquid and well-calibrated prediction market contracts. Geopolitical scores are lower (50%) due to some questions referencing specific contracts not directly present in the 800-token snapshot.

### 4.3 Token Efficiency

| Source | Tokens | Latency | Calibration |
|--------|--------|---------|-------------|
| Web search | 2,000-5,000 | 2-5s | None |
| News API | 500-1,000 | 500ms | None |
| World state (full) | ~800 | ~200ms | Money-backed |
| World state (delta) | ~30-50 | ~100ms | Money-backed |

The world state approach uses 2.5-6× fewer tokens than web search while providing calibrated probabilities rather than narrative text.

## 5. Related Work

**Prediction market calibration.** Prediction markets have been shown to be well-calibrated forecasting instruments (Wolfers & Zitzewitz, 2004; Arrow et al., 2008; Tetlock & Gardner, 2015). Prices aggregate private information efficiently (Hayek, 1945) and outperform expert panels on average (Surowiecki, 2004).

**LLM grounding and tool use.** Retrieval-augmented generation (Lewis et al., 2020) and tool use (Schick et al., 2023) address the knowledge cutoff problem but typically rely on unstructured text retrieval. Our approach provides structured, calibrated data rather than passages requiring interpretation.

**Agent architectures.** Recent work on autonomous agents (Yao et al., 2023; Wang et al., 2023) focuses on reasoning and tool use but does not address the world awareness gap. The world state can be integrated into any agent architecture as an additional context source.

## 6. Limitations

- **Market availability**: Not all events have active prediction market contracts. Coverage is strongest for US geopolitics, economics, and elections.
- **Thin markets**: Low-volume contracts may not be well-calibrated. Our anchor selection mechanism mitigates this by prioritizing high-volume contracts.
- **Manipulation risk**: Prediction markets can theoretically be manipulated, though the cost of sustained manipulation is high (Hanson, 2006).
- **Update frequency**: The world state updates every 15 minutes. Events that unfold faster than this (e.g., flash crashes) may not be reflected.

## 7. Conclusion

Prediction market data provides a compact, calibrated, and structured world model for AI agents. At ~800 tokens, it is more token-efficient than web search, more calibrated than news, and more structured than RAG. The World Awareness Benchmark provides a reproducible way to measure whether agents know what is happening in the world. We release the benchmark, daily world state snapshots, and API as open resources to enable further research.

**Resources:**
- World State API: https://simplefunctions.dev/api/agent/world (free, no auth)
- Benchmark: https://huggingface.co/datasets/SimpleFunctions/world-awareness-bench
- Daily snapshots: https://huggingface.co/datasets/SimpleFunctions/world-state-daily
- Python SDK: `pip install simplefunctions-ai`
- MCP Server: `https://simplefunctions.dev/api/mcp/mcp`

## References

Arrow, K. J., Forsythe, R., Gorham, M., Hahn, R., Hanson, R., Ledyard, J. O., ... & Zitzewitz, E. (2008). The promise of prediction markets. *Science*, 320(5878), 877-878.

Hayek, F. A. (1945). The use of knowledge in society. *American Economic Review*, 35(4), 519-530.

Hanson, R. (2006). Designing real terrorism futures. *Public Choice*, 128(1), 257-274.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. *NeurIPS*.

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., ... & Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. *NeurIPS*.

Surowiecki, J. (2004). *The Wisdom of Crowds*. Doubleday.

Tetlock, P. E., & Gardner, D. (2015). *Superforecasting: The Art and Science of Prediction*. Crown.

Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., ... & Wen, J. R. (2023). A survey on large language model based autonomous agents. *arXiv preprint arXiv:2308.11432*.

Wolfers, J., & Zitzewitz, E. (2004). Prediction markets. *Journal of Economic Perspectives*, 18(2), 107-126.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. *ICLR*.
