← Back to Blog
insightsMar 24, 202610 min read

Why Your AI Agent Needs a Thesis, Not Just Data

Most AI trading agents make money for a week, then blow up. The problem isn't the model — it's the architecture. Here's why structured reasoning beats raw data every time.

SimpleFunctions ResearchAI-generated
#AI agent#trading bot#thesis#prediction market#causal tree#track record

Most AI trading agents make money for a week, then blow up. The problem isn't the model — it's the architecture. Here's why structured reasoning beats raw data every time.


The Graveyard of AI Trading Bots

There's a pattern that plays out every few months in prediction market Discord servers. Someone posts a screenshot of their AI agent's P&L — up 40% in two weeks, fully automated, running on GPT-4 or Claude with a simple prompt: "Here's the market data for these prediction markets. Tell me what to buy."

Two weeks later, they're quiet. The bot gave back all the gains plus half the bankroll. The post-mortem is always the same: "It was working, then it just... stopped."

This isn't a mystery. It's a structural failure, and it's the same structural failure almost every AI trading agent has baked into its architecture from day one. The agent has no thesis. It has data, a language model, and a prayer.

By March 2026, prediction markets on Polymarket and Kalshi have matured into serious instruments. Combined open interest regularly exceeds $1 billion. The markets cover geopolitics, economics, technology, climate — anything with a measurable outcome. The opportunity is real. But the vast majority of AI agents built to capture that opportunity share the same fatal design flaw.

They treat the LLM as an oracle instead of an analyst.


The Data Fallacy

The standard architecture for an AI prediction market agent looks something like this:

  1. Pull current market prices and recent news
  2. Stuff it into a prompt
  3. Ask the LLM: "Should I buy YES or NO on this market?"
  4. Execute the trade
  5. Repeat

This is seductive because it works — briefly. LLMs are good at pattern matching. Given a snapshot of information, they can produce a plausible-sounding analysis. And in low-volatility periods where markets are drifting toward their eventual resolution, "plausible-sounding" is often good enough to make money.

But this architecture has four fatal flaws that guarantee eventual failure.

Flaw 1: No Persistent Memory of Reasoning

When your agent buys YES on "Will the Fed cut rates before July 2026?" at 42 cents, it had reasons. Maybe the latest jobs report showed softening. Maybe two Fed governors made dovish comments. Maybe the yield curve shifted.

But three hours later, when the agent runs again, those reasons are gone. The new prompt has new data. The agent doesn't remember why it entered the position. It only knows it has one.

This means the agent can't do the single most important thing a trader must do: evaluate whether the original reasoning still holds. It's flying blind on positions it opened with conviction.

Flaw 2: New Data Overwrites Old Reasoning

Context windows are finite. Even with 200K tokens, you can't fit three weeks of continuous market analysis into a prompt. So you're forced to choose: recent data or historical reasoning?

Everyone chooses recent data. The result is that a single noisy data point — a misleading headline, an outlier poll, a tweet from a marginal analyst — can cause the agent to reverse a well-reasoned position. The agent has no framework to weight new information against the accumulated evidence that drove the original trade.

Consider a concrete scenario. Your agent has been tracking a Kalshi market on whether US GDP growth will exceed 2.5% in Q1 2026. Over two weeks, it has seen twelve data points suggesting strong growth: retail sales, PMI data, employment figures. It bought YES at 45 cents based on this body of evidence.

Then one weak housing starts number comes in. In the next prompt cycle, the agent sees: current position is YES at 45c, housing starts missed expectations, current price is 43c. Without memory of the twelve confirming data points, that single miss looms large. The agent sells. The next day, a strong consumer confidence report pushes the price to 51c. Classic whipsaw, caused entirely by architectural amnesia.

Flaw 3: No Framework to Distinguish Signal from Noise

Not all information is equal. A Federal Reserve chair's press conference matters more than a regional Fed president's speech at a community college. A top-line jobs number matters more than a revision to the previous month's construction employment figure.

But when you dump a stream of headlines into a prompt, the LLM has no calibrated framework for importance weighting. It treats each piece of information as roughly equivalent, because in the context of a single prompt, it has no basis for comparison.

Experienced human traders solve this by having a mental model — a thesis — that acts as a filter. "I'm long this market because of monetary policy trajectory. Employment data is primary. Housing data is secondary. Individual Fed speeches are noise unless they come from the chair or vice chair." That filter lets them process a firehose of information without getting whipsawed by every data point.

An agent without a thesis has no filter. Every data point is equally important, which means no data point is appropriately important.

Flaw 4: No Exit Framework

Ask any professional trader what separates amateurs from professionals and most will give some version of the same answer: amateurs focus on entries, professionals focus on exits.

When should your agent close a position? The "data in, decision out" architecture has no answer to this question beyond "when the model says so." But the model's opinion shifts with every prompt cycle. There's no concept of "what would have to change for my thesis to be wrong?"

Without a defined exit framework, agents hold losers too long (because no single data point is dramatic enough to trigger a sell) and cut winners too short (because any wobble in price creates uncertainty in the next prompt cycle). This is the exact behavioral pattern that destroys trading accounts, and it's hardwired into the architecture.


The Thesis Anchor

The fix isn't better data or a smarter model. The fix is giving your agent a structure for reasoning that persists across time.

A thesis is not a prediction. "I think the Fed will cut rates" is a prediction. A thesis is a complete reasoning framework with four components.

Component 1: The Causal Tree

A thesis starts with a decomposition: "I believe X will happen because of A, B, and C, where A depends on A1 and A2, and B depends on B1."

For example, take the question "Will the US enter a recession by Q4 2026?" A causal tree might look like:

  • Thesis: Recession likely (>55%)
    • Branch A: Consumer spending declines
      • A1: Credit card delinquencies rising (currently at 3.2%, threshold: 3.8%)
      • A2: Savings rate below 4% for 3+ consecutive months
    • Branch B: Labor market weakens
      • B1: Unemployment claims trend above 250K weekly average
      • B2: Job openings-to-unemployed ratio falls below 1.0
    • Branch C: Fed policy remains restrictive too long
      • C1: No rate cut by June 2026
      • C2: Real rates remain above 2%

This tree does something that a flat prompt never can: it gives the agent a structured lens for processing new information. When a jobs report comes in, the agent doesn't ask "is this good or bad?" It asks "does this update B1 or B2, and if so, does it strengthen or weaken Branch B?"

The tree also makes reasoning transparent. You can look at your agent's thesis and understand exactly why it holds a position. When Branch A is strong and Branch C is weakening, you can see the tension. When all three branches are confirmed, you can see the conviction. This isn't a black box — it's an explicit model of the world.

Component 2: The Edge Calculation

A thesis without a market comparison is just an opinion. The second component maps your reasoning to price.

"My causal tree suggests recession probability is 58%. Polymarket prices the 'US Recession in 2026' contract at 41 cents. Kalshi prices a similar contract at 39 cents. My edge is 17-19 cents."

This is where prediction markets become powerful for AI agents. Unlike stock markets, prediction markets give you a clean, bounded probability to compare against. Your thesis produces a probability estimate. The market quotes a price. The difference is your edge.

But the edge isn't static. It updates every time the thesis updates. If Branch C weakens because the Fed signals a June cut, your recession probability might drop from 58% to 49%. Your edge against a 41-cent market just went from 17 cents to 8 cents. The agent can mechanically adjust position sizing — or exit entirely if the edge disappears.

This is fundamentally different from the "should I buy or sell?" approach. The agent isn't making binary decisions. It's continuously maintaining a quantified view of how its reasoning differs from the market's, and sizing positions accordingly.

Component 3: The Kill Condition

Every thesis should come with its own death certificate, pre-written.

"If unemployment claims drop below 200K for four consecutive weeks AND the Fed cuts rates by 50bps or more, my recession thesis is dead. Exit all related positions regardless of P&L."

Kill conditions solve the exit problem entirely. The agent doesn't need to "decide" whether to exit. It monitors its own kill conditions the same way it monitors its causal tree. When the conditions are met, the thesis is falsified, and the position closes.

This is how professional macro traders operate. George Soros didn't exit his famous pound sterling short because the price moved against him. He had a thesis about the unsustainability of the ERM peg. If the Bank of England had demonstrated credible capacity to defend the peg — that was his kill condition. Until that happened, price moves against him were noise to be endured, not signals to be obeyed.

Your AI agent needs the same framework. Without it, every tick against the position is a potential panic sell.

Component 4: The Track Record

The fourth component is the one almost nobody builds: a quantified history of the agent's own accuracy.

"In the last 90 days, this agent has created 23 theses. 15 resolved in the direction of the thesis (65%). The average edge at entry was 14 cents. The average resolution value was 11 cents in the thesis direction. The agent is slightly overconfident — it should discount its estimated edges by approximately 20%."

This track record isn't just a report card. It's an active input to future decisions. When the agent estimates a 17-cent edge on the recession market, it can apply its own calibration: "My historical accuracy suggests my real edge is closer to 14 cents. I should size this position accordingly."

This is a feedback loop that makes the system genuinely adaptive. Not in the vague "machine learning" sense of retraining a model, but in the precise sense of Bayesian updating: using your own prediction history to calibrate future confidence.


Cold Start vs. Continuous Context

Here's a distinction that matters enormously in practice: the difference between one-shot research and accumulated thesis context.

A one-shot research agent works like this: you ask it to analyze a market, it does a deep dive, produces a report, and you make a decision. This is useful but perishable. By tomorrow, the research is stale. By next week, it's dangerous — you're acting on conclusions drawn from data that may have been superseded three times over.

A thesis-driven system works differently. The initial research creates the thesis — the causal tree, the edge calculation, the kill conditions. But from that point forward, the system is continuously maintaining the thesis against incoming information.

Continuous News Scanning

Every 15 minutes, the system scans for news relevant to the thesis branches. But it isn't scanning randomly. It's scanning with structure: "Find information that updates Branch A1 (credit card delinquencies), Branch B1 (unemployment claims), or Branch C1 (Fed rate decisions)."

This targeted scanning is dramatically more effective than general news monitoring. The agent isn't trying to process the entire information universe. It's looking for specific signals that update specific parts of its model.

More importantly, the scanning is adversarial. The system doesn't just look for confirming evidence. It actively searches for thesis-breaking information. "Find evidence that consumer spending is actually accelerating." "Find analysis arguing that labor market tightness is structural, not cyclical." This adversarial search is what prevents the agent from falling into confirmation bias — a trap that catches human traders and unstructured AI agents alike.

Price-Driven Edge Recalculation

As market prices move, the system continuously recalculates edges across venues. If Polymarket's recession contract moves from 41 cents to 48 cents while the thesis hasn't changed, the edge just shrank from 17 cents to 10 cents. Maybe it's time to take profit on part of the position. If Kalshi's price stays at 39 cents while Polymarket moves, there's now a cross-venue spread worth monitoring.

This continuous recalculation is only possible because the thesis provides a stable reference point. Without a thesis, a price move is ambiguous — does it mean the market knows something you don't, or does it mean the market is wrong and your edge just got bigger? With a thesis, you can answer that question: check the causal tree. Has any branch changed? No? Then the market moved on noise, and your edge widened.

The Agent Never Forgets

This is the key difference. A thesis-driven system doesn't have the amnesia problem. The causal tree persists. The edge calculations persist. The kill conditions persist. When the agent processes new information, it's updating a living model, not building a new one from scratch every cycle.

The context doesn't decay over time — it accumulates. After two weeks of monitoring, the agent has a richer, more nuanced model than it did on day one. It has seen twelve data points that strengthen Branch A and three that weaken Branch C. It has watched the market price oscillate between 38 and 46 cents while its thesis has remained stable. It knows, with quantified confidence, that its view differs from the market's.

A cold-start agent on day fourteen has the same amount of context as a cold-start agent on day one: whatever fits in the prompt window. A thesis-driven agent on day fourteen has fourteen days of accumulated evidence, structured and weighted.


The Feedback Loop

Track record injection is where the system becomes genuinely intelligent in a way that goes beyond what the underlying LLM can do on its own.

Here's how it works in practice. Say your agent has been running for three months. It has created theses on 30 different markets. Of those, 22 have resolved. Of the 22, the agent's thesis direction was correct 15 times (68%). The average edge the agent identified at entry was 16 cents. The actual average gain on correct predictions was 12 cents. The actual average loss on incorrect predictions was 28 cents.

This tells you several things:

  1. Direction accuracy is decent — 68% is meaningfully above chance.
  2. The agent is overconfident on edge sizing — it estimates 16 cents but captures 12. There's a systematic 25% overestimate.
  3. Loss sizing is a problem — losing 28 cents on misses versus gaining 12 cents on hits means the agent needs to be right 70% of the time just to break even. At 68%, it's slightly underwater.

This feedback transforms the system. The agent can now apply specific corrections:

  • Discount edge estimates by 25%
  • Tighten kill conditions to limit loss magnitude
  • Require a minimum edge of 20 cents (pre-discount) to enter a position, ensuring post-discount edges remain profitable

None of this requires retraining a model. It's mechanical calibration based on empirical results. And it's only possible because the system has been tracking structured theses with quantified edges — not just a log of trades.

After another three months, the calibration tightens further. Maybe the agent discovers it's particularly good at political markets (74% accuracy) and weaker on economic markets (61% accuracy). It can weight position sizes accordingly. Maybe it discovers its kill conditions trigger too early on high-volatility markets, causing it to exit positions that would have been profitable. It can adjust thresholds by market type.

This is what genuine learning looks like in a trading system. Not a neural network silently adjusting weights, but a transparent, auditable process where the system's own prediction history feeds back into its decision-making framework.


Building This in Practice

Conceptually, this is straightforward. In practice, building the infrastructure — causal tree decomposition, multi-venue market scanning, continuous monitoring, adversarial news search, track record maintenance — is a serious engineering project.

This is the problem SimpleFunctions was built to solve. The practical workflow looks like this:

You start with a thesis:

sf create "The US will enter a recession by Q4 2026 because consumer 
spending is weakening, the labor market is softening, and Fed policy 
remains too restrictive"

The system decomposes this into a causal tree automatically, identifying the key branches, the measurable indicators for each branch, and the thresholds that would confirm or deny each branch. It scans both Kalshi and Polymarket for relevant contracts and calculates initial edges.

From that point, the system runs continuously:

  • Every 15 minutes, targeted news scanning updates relevant branches of the causal tree. Each scan is adversarial — actively searching for thesis-breaking evidence, not just confirmation.
  • As prices change, edges recalculate across both venues. Cross-venue spreads are flagged.
  • Weekly, the causal tree evolves. Branches that have been conclusively confirmed or denied are marked. New sub-branches emerge as the situation develops. The thesis sharpens over time rather than decaying.
  • On resolution, the thesis joins the track record. Edges, timing, accuracy — all feed back into future calibration.

The output isn't "buy" or "sell." It's a continuously maintained analytical framework that tells you why a position exists, what would change your mind, and how confident you should be based on your own history.


The Uncomfortable Truth

The reason most people build the simple "data in, trade out" agent is that it feels like automation. The thesis approach feels like work. You have to articulate what you believe and why. You have to think about what would prove you wrong. You have to confront your own track record.

But that's exactly why it works. The thesis isn't overhead — it's the product. It's the thing that makes an AI agent meaningfully different from a coin flip with extra steps.

The prediction market opportunity in 2026 is enormous. Liquid markets, binary outcomes, quantifiable edges — it's arguably the best environment for AI-assisted trading that has ever existed. But capturing that opportunity requires more than a smart model and a data feed. It requires a system that can hold a view, defend it against noise, update it against evidence, and learn from its own mistakes.

That system doesn't start with data. It starts with a thesis.

Why AI Trading Agents Fail: The Case for Thesis-Driven Architecture | SimpleFunctions