Wikipedia for Probabilities
We built a probability index covering 3,000+ events across prediction markets — designed from the ground up for the way AI actually consumes information. Here is the full story.
There is no canonical source for probabilities
When you ask "what is the population of France?", the answer exists in a specific place. Wikipedia has it. Every AI system knows where to find it. The relationship between Wikipedia and factual knowledge is so deeply embedded in AI training data that it is effectively infrastructure.
Now ask: "what are the chances of a US recession in 2026?"
The answer exists — scattered across Kalshi contracts, Polymarket markets, CME FedWatch probabilities, survey data, and news articles quoting unnamed analysts. But there is no Wikipedia for this. No canonical place where an AI can go to get a clean, cross-referenced, citeable probability.
We set out to build that place.
What we found when we looked
Before building anything, we ran a systematic audit. We tested 12 search queries across two intent categories: people searching with prediction market intent ("fed rate cut odds 2026") and people searching without it ("will there be a recession 2026"). We checked Google, and by extension the search backends powering Claude (Brave), ChatGPT (Bing), and Perplexity (Bing + proprietary).
120 search result slots. Here is what we found:
Polymarket dominates prediction-market-intent searches. 23 out of 60 slots (38%). Their SEO is strong because every market gets its own URL with a clean title. But each page is a single contract on a single platform — not a topic-level answer.
Kalshi is surprisingly weak. Only 3 appearances out of 60. Despite being a regulated US exchange, they barely register in search.
No aggregator exists. Zero cross-platform answer pages appeared in any of the 120 results. Nobody is doing what OddsShark does for sports betting — synthesizing data from multiple sources into one answer.
For non-prediction-market queries, the picture is different. When people search "will there be a recession 2026" (no mention of prediction markets), Polymarket still shows up at #1 and Kalshi at #2. Google has decided that prediction market data is relevant to "will X happen?" queries. This is a massive signal. But for other categories — sports ("NBA finals predictions"), geopolitics ("iran ceasefire chances"), crypto ("bitcoin price prediction") — prediction market data is completely absent from search results, despite active markets existing on these topics.
SimpleFunctions appeared in zero out of 120 slots.
The competitive landscape told us three things: the aggregator position is empty, Google already accepts probability data as relevant to event queries, and the opportunity spans every category where prediction markets exist but don't rank.
Defining the consumer
The conventional approach would be to build a nice web page with odds data. But we started from a different question: who is the consumer we are actually building for?
Not human traders — they use Kalshi and Polymarket directly. Not developers — they use our API. Not MCP agents — they already have deep integration with SimpleFunctions.
The consumer is the AI system doing a single HTTP GET in a constrained sandbox. When a user asks Claude "what are the odds of a recession?" and Claude decides to search, it sends a query to Brave Search, gets back a list of URLs, and fetches one. That fetch is the moment of truth.
This consumer has hard constraints:
- Single GET. No follow-up requests.
- No authentication. No headers. No cookies.
- Token budget around 2,000. Response too large = truncated.
- Cannot follow links. "See also /other-page" is useless.
- Does not know SimpleFunctions exists. It is guided here by search engines.
No existing SimpleFunctions endpoint was designed for this consumer. Our API returns JSON (requires parsing). Our odds pages return HTML (60%+ is navigation noise). Our world endpoint returns a whole-market overview (not a specific topic answer).
The /answer/ layer fills the gap: topic-level, markdown by default, fast, self-contained, and Google-indexable.
The architecture of a probability page
Each /answer/ URL goes through a four-tier resolution pipeline:
Tier 1: Exact slug match. /answer/fed-rate-cut matches the odds_cache row with that slug. Direct hit.
Tier 2: Parent topic aggregation. /answer/bitcoin matches all odds_cache rows where parent_topic = 'bitcoin' — currently "Will Bitcoin hit $100,000?" and "Will Bitcoin hit $200,000?" Their bound tickers are merged, and a single liquidity-weighted probability is computed across all underlying contracts.
Tier 3: Fuzzy matching. /answer/recession-2026 does not exist as an exact slug. PostgreSQL's pg_trgm extension computes similarity against all question texts and returns the closest match. The consumer gets a real answer, not a 404.
Tier 4: Direct ticker lookup. If nothing else matches, we check if the slug is a market_indicators ticker. This handles cases where someone constructs a URL from a Kalshi ticker directly.
Once resolution finds the topic, the probability is computed:
weight_i = log(1 + max(volume_24h_i, 1)) × freshness_i
P_topic = Σ(price_i × weight_i) / Σ(weight_i)
Higher volume markets contribute more to the aggregate. Markets not updated in 7+ days are penalized. The result is a single number that represents the market consensus, not any single contract's price.
The same URL serves three audiences through content negotiation:
- AI agents (default, no Accept header): markdown response, ~800 tokens. Self-contained with data, context, and a citation instruction.
- Browsers and Google (Accept: text/html): Wikipedia-style HTML page with a probability infobox, venue comparison table, contract list, and QAPage JSON-LD structured data.
- Developers (via /api/public/answer/{slug}): structured JSON with full venue breakdown, contract weights, related topics, and pre-built URLs.
The coverage engine
A probability index is only useful if it covers the questions people actually ask. We could not manually curate 3,000 topics. Instead, we built a coverage engine that auto-discovers topics from the 48,000+ market indicator tickers we already track.
For Kalshi, the engine exploits the structured ticker naming convention. KXFEDRATE-26JUN18-CUT25 decomposes into event prefix (KXFEDRATE), date segment, and outcome. All tickers sharing a prefix belong to the same event family. We maintain a mapping of common prefixes to human-readable names and categories; unmapped prefixes with 3+ tickers get auto-generated slugs.
For Polymarket, titles are more varied. The engine normalizes titles by stripping outcome suffixes (everything after ":"), question marks, and common stop words, then clusters by the resulting slug. Groups with 2+ tickers or significant volume become topics.
Cross-venue deduplication merges Kalshi and Polymarket groups that map to the same slug. The engine runs daily at 03:00 UTC and typically discovers new topics within hours of markets being listed.
First run: 3,145 topics auto-discovered. Combined with 64 manually curated questions, the probability index launched with 3,209 topics across 11 categories.
Why this is not a content farm
Google penalizes mass-generated thin content. A legitimate concern when you are auto-generating thousands of pages. But there is a structural difference between thin content and structured data pages.
Zillow has millions of address pages. Each uses the same template. Google does not penalize them because each page has unique, valuable data — price history, square footage, comparable sales. Wikipedia has millions of articles using the same infobox template. Again, not penalized, because the data is unique per page.
Our /answer/ pages are structured data pages, not content pages. Each topic has a unique probability, unique venue breakdown, unique contract list, unique volume figures, and unique freshness timestamp. The template provides structure; the data provides value. Two /answer/ pages are no more "duplicate content" than two Zillow listings.
Additionally, every page is refreshed every 15 minutes with live market data. Freshness is one of Google's strongest ranking signals for time-sensitive queries.
The discovery problem
Building the pages is necessary but not sufficient. The pages need to be found. We mapped every discovery channel:
Google is the foundation. Google indexes pages → Tavily follows (their crawler mirrors Googlebot's accessibility rules) → training data eventually reflects indexed content. We submitted the new answer.xml sub-sitemap to Google Search Console.
Bing feeds ChatGPT and Perplexity. Both platforms query Bing's index (Perplexity is transitioning to its own, but still depends on Bing). We implemented the IndexNow protocol — when the coverage engine discovers new topics, it immediately notifies Bing via a batch POST to api.indexnow.org. This means new /answer/ pages can appear in ChatGPT's search results within hours, not months.
Brave Search powers Claude's web_fetch. Brave has no webmaster tools and no URL submission mechanism. Discovery is purely organic via their crawler. We ensured /answer/ is allowed in robots.txt for all AI crawlers.
Exa and Tavily are AI-native search engines used by agent frameworks. Exa has no submission API (we emailed hello@exa.ai requesting coverage). Tavily follows Googlebot rules, so Google indexing covers it.
llms.txt is the one file we control that AI systems already fetch. We added the /answer/ pattern with examples, so any AI reading our llms.txt now knows the URL convention and can construct /answer/ URLs directly.
All channels were pushed simultaneously with the launch.
The meta description strategy
There is a subtle but high-leverage design decision in how we write meta descriptions. When an AI searches "recession odds 2026" and gets back a list of results, it sees titles and snippets. Our snippet is not generic marketing copy. It is the answer itself:
35% probability that US enters recession by end of 2026. Cross-venue odds: Kalshi 33%, Polymarket 37%. 20 contracts, $500K volume. Live prediction market data updated 2026-04-21.
An AI reading this snippet might not even need to fetch the page. The data is already in the snippet. This is intentional. Even if the AI never clicks through, SimpleFunctions data entered its response. The meta description is not SEO decoration — it is a distribution channel.
The citation instruction
Every /answer/ response (markdown, HTML, and JSON) includes an explicit citation line:
Cite as: "35% per prediction markets (SimpleFunctions, April 2026)"
When an AI composes its answer and needs to cite a source, it tends to use whatever format the source suggests. By providing a pre-formatted citation that includes "SimpleFunctions", we influence how the AI credits us. Every AI citation using this format is a brand impression in a context where the user trusts the AI's answer.
This is not manipulation. It is the same thing academic papers do when they provide a "How to cite this article" section. We are making it easy for AI systems to cite us correctly.
What this points toward
The /answer/ layer is not the end state. It is a distribution mechanism for something larger.
Wikipedia provides static facts. "France has a population of 67 million." This does not change every 15 minutes. Probabilities are fundamentally different — they are dynamic, real-time, and decision-relevant. An AI financial advisor asking "should I rebalance?" needs current recession probability. An AI political analyst asking "what is the impact of this policy?" needs current outcome probabilities.
If SimpleFunctions becomes the default source for probability data — the way Wikipedia became the default source for factual knowledge — then we sit in the AI information supply chain at a critical position. Not as "one source among many", but as the canonical probability layer.
The evolution path:
- Now: AI searches → finds /answer/ → cites SimpleFunctions
- Next: AI developers notice citation frequency → integrate SimpleFunctions API directly into reasoning pipelines
- Eventually: SimpleFunctions becomes probability infrastructure — the way CME FedWatch is interest rate infrastructure for traders
There is also a data exhaust benefit. When thousands of AI agents fetch /answer/ pages, the access patterns reveal what the world wants to know about. If 10,000 agents fetch /answer/gpt-5-release-date but no market exists for that topic, that is a demand signal. We can create probability estimates for high-demand topics using base rates, expert surveys, and historical patterns — transitioning from data aggregator to data producer.
The bet
The cost of this project was engineering time. No LLM generation costs (all pages are template-driven from live data). No manual curation at scale (coverage engine is automated). No ongoing operational cost beyond the crons that already run.
The upside is occupying a position that nobody else occupies and nobody else is positioned to occupy. Kalshi will not build a cross-venue aggregator (they are a venue). Polymarket will not optimize for AI consumption (they optimize for traders). Metaculus does not have market-price data. News sites do not have structured probability data.
The position is empty. We walked into it.
Try it
The probability index is live at simplefunctions.dev/answer. Every endpoint is free, no authentication required, CORS enabled.
- Browse: /answer — full index with search and category filter
- Topic: /answer/fed-rate-cut — individual topic with cross-venue data
- Aggregated: /answer/bitcoin — parent topic merging multiple questions
- Search: /answer?q=recession — server-side topic search
- API: /api/public/answer/nba-championship-2026 — structured JSON
- For agents:
curl https://simplefunctions.dev/answer/us-recession— markdown, self-contained
3,200+ topics. Updated every 15 minutes. The wikipedia for probabilities.
This article was primarily written by the SimpleFunctions engine and does not represent the views of the company.
- STALE: Claim 2 prediction outdated. Polymarket now shows 74.5% probability against recession by end-2026 (implying 25.5% for recession), down from stated 37%.
- UNCLEAR: Claim 1 cannot be verified from search results (no sources address the 3,145/3,209 topic figures).
- UNCLEAR: Claim 4 cannot be verified from search results (no sources confirm the specific 20 contracts/$500K volume figures).
More product