← Back to Blog
insightsApr 16, 20268 min read

Compute ROI in Agent Economies: A Framework and Early Data

Most AI systems measure cost per token. We measure dollars of compute per dollars of information discovered. Prediction markets make this possible.

Patrick Liu
#compute-roi#framework#prediction-markets#ai-infrastructure#alpha

Patrick Liu


The dominant metrics for evaluating AI systems measure inputs: cost per token, latency per request, benchmark scores on standardized tasks. These are useful for comparing models, but they don't answer the question that matters for agent economies: what did the compute produce?

Most agent systems can't answer this question because their outputs lack market-priced value. A coding agent produces code — valuable, but not denominated in dollars until it ships and generates revenue, a feedback loop too slow and indirect to attribute value to specific compute calls. A research agent produces summaries — useful, but "usefulness" is subjective and unpriced.

Prediction markets create a rare setting where this measurement becomes possible. An AI system that identifies a 5¢ edge on a prediction market contract has produced information whose expected value is precisely measurable: 5¢ per contract held, scaled by position size, with a known probability of realization based on the market's resolution. The cost of the compute that found that edge is also precisely known. The ratio is a real ROI — not a proxy, not an estimate, but a direct measurement of dollars-of-compute to dollars-of-information.

This article introduces a framework for thinking about compute ROI in systems where this measurement is possible, and presents early data from a production system.


The Framework

1. Value Anchors

Most AI applications measure input-side metrics: cost per million tokens, requests per second, task completion rate. These tell you how efficiently you're consuming compute, but not how much value the compute created.

A value anchor is a mechanism that prices the output of AI compute in dollar terms. In quantitative trading, the value anchor is P&L. In prediction markets, it's edge — the difference between a model's assessed probability and the market price, denominated in cents. If your system identifies a 7¢ edge on a contract and you buy 100 shares, the discovery has a maximum payoff of $7 and an expected value proportional to your assessment's accuracy. The cost of producing it is some known amount. The ratio is meaningful.

Value anchors are rare. They require domains where information has market-priced consequences and where the feedback loop is tight enough to attribute value to specific compute. Prediction markets, quantitative trading, insurance underwriting, and commodity arbitrage are examples. Most agent applications — customer service, code generation, content creation — don't have natural value anchors, which is why input-side metrics dominate the discourse.

2. Four Measurable Dimensions

Once a value anchor exists, four dimensions of compute efficiency become quantifiable:

Cost per signal. How much compute does it take to produce one meaningful evaluation — defined as a confidence change above a significance threshold? This varies dramatically across model tiers. In our data, the cheapest effective tier produces signals at roughly 1/20th the cost per unit of confidence movement compared to the primary evaluation tier, while detecting signals at a substantially higher rate per call.

Marginal returns by frequency. How does per-evaluation information yield change as you increase evaluation frequency on the same question? We observe clear diminishing returns: the per-evaluation signal contribution drops substantially between low-frequency and high-frequency evaluation buckets. Total information still increases with frequency, but each additional dollar of compute produces less new information. This is not waste — it's a curve, and the question is where on the curve you should operate.

Signal decay rate. How long does a discovered edge persist before the market absorbs it? This determines how frequently you need to monitor. If edges disappear in hours, monitoring frequency must be measured in hours or finer. If edges persist for days, daily monitoring may suffice. The distribution of edge lifespans determines the operationally meaningful monitoring frequency.

Cross-signal propagation. When one monitored thesis changes significantly, how often do related theses also change? If correlated theses reliably move together, a single signal detection can trigger re-evaluation of multiple positions — multiplying the value of the initial compute. This correlation structure is not visible in any single thesis's data and must be measured across the portfolio.

3. The Asymmetric Judgment Principle

In alpha-finding systems, there is a fundamental asymmetry: the cost of under-evaluating (missing an edge that appears and disappears while you weren't looking) is categorically different from the cost of over-evaluating (spending compute on an evaluation that finds nothing new).

Over-evaluation wastes compute. The money is gone, but the loss is bounded and recoverable — you can always reduce frequency tomorrow.

Under-evaluation misses alpha. The edge that appeared and disappeared during the gap is gone forever. You can't recover it by evaluating more aggressively tomorrow.

This asymmetry has a practical consequence: any framework for compute allocation should be biased toward spending more, not less. The threshold for skipping an evaluation should be high (multiple confirming signals that nothing is changing). The threshold for escalating an evaluation should be low (any single signal that something might be changing). This is the opposite of how most cost-optimization frameworks work. For alpha-discovery, it is the orientation that matches the underlying risk asymmetry.


Early Data

The following observations come from 30 days of production data in SimpleFunctions, a system that uses tiered LLM models to evaluate prediction market theses on a continuous basis. The system architecture is described in Portfolio Autopilot.

Total system cost: approximately $400/month. This covers all LLM inference and external data costs for continuous monitoring of active theses across multiple prediction market venues. The single largest cost center is the primary evaluation model tier, which accounts for roughly 70% of spend.

Model tiers differ dramatically in cost-efficiency. We operate four model tiers spanning a wide cost range. When measured by cost per unit of confidence movement — a proxy for information production efficiency — the cheapest effective tier is more than an order of magnitude more efficient than the primary high-volume tier. But the expensive tier produces far more total signal. This is the marginal returns curve in action: the cheap tier is efficient but shallow; the expensive tier is thorough but redundant on quiet days.

Edge lifespans concentrate in a specific band. Across roughly 1,000 unique edge observations over the observation window, the distribution is unimodal — not bimodal as we had initially expected. The majority of edges survive between 12 hours and 7 days. Short-lived edges (under 4 hours) exist but are a minority. Very long-lived edges (over a week) are also a minority, and tend to be slightly larger in magnitude. This distribution has direct implications for monitoring frequency — it suggests that evaluation cycles measured in hours, not minutes, would capture most edges.

Theses share causal structure. Active theses are not independent. Theses cluster around shared causal drivers — certain themes (regional risk, macroeconomic regime, policy uncertainty) appear in the causal trees of multiple active theses simultaneously. When one thesis experiences a significant confidence change, related theses often follow. This correlation is thematic (shared causal drivers), not positional (they track different markets). It means a signal detected in one thesis is informative for others in the same causal cluster — and a system that doesn't propagate signals across correlated theses is leaving value on the table.


What These Numbers Don't Tell You

Several limitations must be stated clearly.

This data comes from one system operating in specific prediction markets. SimpleFunctions monitors theses on Kalshi and Polymarket. The market microstructure, liquidity characteristics, and participant behavior on these venues are specific. The numbers above should not be generalized to other agent systems, other markets, or other domains without independent validation.

"Confidence change" is an intermediate metric, not P&L. Our system deliberately decouples thesis evaluation from position management — thesis confidence informs but does not mechanically determine trades. This means we cannot directly validate each thesis's compute ROI against realized P&L. We use confidence change as a proxy for information production, which is informative but not definitive. This decoupling is intentional. Coupling thesis evaluation to position management would force every thesis to be tradeable in the operator's portfolio — but theses are independent informational outputs that can serve other use cases (research, signal licensing, third-party agents). Decoupling preserves their independence.

Thirty days is a short observation window. Some of our findings — particularly the edge lifespan distribution and the cross-thesis correlation patterns — may look different over longer periods or under different market regimes. We present them as initial observations, not settled facts.

Models are iterating rapidly. The cost-efficiency comparisons between model tiers reflect the models available today. A new model release could change these ratios significantly within months. The framework's value is in the measurement structure — what to measure and why — not in the specific numbers at any given snapshot.


Implications

To the extent that autonomous agents in economic environments multiply in coming years — a trajectory current systems suggest but do not guarantee — each will need some form of compute ROI measurement. Not "how fast is my model" or "how cheap is my inference," but "how much value did this dollar of compute create?"

Most agent deployments today can't answer this question. The value anchors don't exist, the feedback loops are too slow, or the outputs are too diffuse to price. Prediction markets happen to provide unusually clean conditions for this measurement — bounded prices, binary outcomes, transparent markets, and tight feedback loops.

What we've described here is an early attempt at building the measurement layer. The specific numbers will change. The framework — value anchors, marginal returns curves, signal decay, cross-signal propagation, asymmetric judgment — is what we think will persist.