SimpleFunctions Research · · preliminary note

Feature-Based Prediction-Market Forecasting: Preliminary Observations

A gradient-boosted baseline on 11 days of SimpleFunctions microstructure data

Patrick Liu · SimpleFunctions

Abstract

We release sf-ml-baseline v0.1, the first publicly documented feature-based forecasting baseline for prediction markets. On 11 days of SimpleFunctions microstructure data (1.76M labelled rows for 24h direction, 14K for resolution) and five engineered features — mid price, 1h delta, implied yield, cancel-replace intensity, and cancel-versus-volume ratio — a three-seed LightGBM ensemble achieves Brier 0.2294 on a held-out 246K-row test set for the 24h direction task, compared to 0.2500 for a coinflip (CI non-overlap, improvement −0.0206). An XGBoost/CatBoost bake-off converges to the same Brier (classical saturation on a five-feature corpus). On the resolution task, per-category LightGBM models beat the price/100 baseline by 0.035-0.041 Brier on Crypto and Commodities. We frame this as a starting point for microstructure-based forecasting rather than a competitor to LLM+RAG systems such as Halawi et al. (2024) or AIA (2025); subsequent versions will be ensembled with those approaches, not substituted for them. Weights, code, and training scripts are released under CC-BY-4.0 with a SimpleFunctions attribution addendum.

Feature-Based Prediction-Market Forecasting: Preliminary Observations

A gradient-boosted baseline on 11 days of SimpleFunctions microstructure data

Patrick Liu · SimpleFunctions · patrick@simplefunctions.dev · April 2026


1. Introduction

Forecasting prediction-market outcomes from historical data is well-studied on the crowd-psychology side — Tetlock (2005), Manski (2006), Wolfers & Zitzewitz (2004) — and well-studied on the language-model side — Halawi et al. (2024) and AIA (2025) both train LLMs on Metaculus-style long-horizon questions. What is not documented in the open literature is a feature-based baseline that uses the microstructure of prediction markets: order flow, cancel-replace dynamics, cross-venue gaps, and derived yield and regime features.

SimpleFunctions (SF) runs a real-time pipeline that computes 27 indicator columns per ticker on every Kalshi and Polymarket market. This note documents a preliminary attempt to use a small subset of those indicators as features in gradient-boosted trees. The framing is deliberately narrow. We are not trying to beat LLM+RAG systems that ingest news. We are trying to answer a simpler question: does the SF microstructure feature set contain predictive signal, and how much?

The short answer is yes, the signal exists and is statistically significant, but the ceiling on a five-feature classical-tree stack appears to be very close to what our v0.1 bake-off reached. We release the weights, document the gap, and lay out a six-phase roadmap for what would plausibly close it.

2. Data and Features

Training window. SF's market_indicator_history table captures a snapshot row per market every few minutes. At the time of this note the table spans eleven calendar days (2026-04-08 to 2026-04-19), 6.79M rows, and 1.87 GB. The continuous forward-only R2 data dump that will extend this archive started on 2026-04-18; retrains against 30-day, 90-day, and 180-day windows are scheduled below.

Features (V1). We use five features in v0.1, chosen because they are (a) present on every snapshot row, (b) numerically clean after null handling, and (c) interpretable:

  • price_cents — mid-market price in cents.
  • delta_cents — signed one-hour price change.
  • iy (implied yield) — annualised yield on a hold-to-resolution trade at the current price.
  • cri (cancel-replace intensity) — per-minute count of cancel-replace pairs on the book.
  • cvr (cancel-versus-volume ratio) — ratio of cancelled volume to traded volume, windowed over the last hour.

The market_indicators live-state table carries 27 columns; we trained on only 5 because that is the intersection with the historical archive. The full 20-feature expansion is part of the v0.3 roadmap.

Labels. We evaluate two targets:

  • T1 (direction, 24h): sign(price(t+24h)price(t))\mathrm{sign}(\mathrm{price}(t+24\mathrm{h}) - \mathrm{price}(t)), binary. Available on every row that has a matching t+24t+24h snapshot: 1.76M labelled rows, 246K in the held-out test fold.
  • T4 (resolution): resolved_outcome from the marketwide_resolutions table, binary. Available on 1.09M settled markets, of which the feature join yields 14K labelled rows with the full 5-feature snapshot at t024t_0 - 24h.

We also considered T3 (price at t+24t+24h, regression) but defer it to v0.3: only 15.4% of marketwide_resolutions rows carry the predicted_price_t24h column (167K of 1.09M), because the column was added recently and is not backfilled. Training T3 on 15.4% coverage would leak the very selection bias we are trying to measure.

Splits. We use a temporal split: the first 80% of timestamps form the training set, the last 20% the test set, and a 24-hour embargo is enforced between them. The embargo is a standard device from the Jane Street time-series competition (2022) and prevents label leakage through overlapping resolution windows.

3. Method

Models. We trained three families in a bake-off:

  1. LightGBM (Ke et al. 2017), 3-seed ensemble (seeds 42, 137, 2026), each 500 rounds with early stopping on a 10% validation slice.
  2. XGBoost (Chen & Guestrin 2016), same 3-seed ensemble, same hyperparameter grid.
  3. CatBoost (Prokhorenkova et al. 2018), same 3-seed ensemble.

Hyperparameter search. We used a fixed shallow grid (depth {4,6,8}\in \{4,6,8\}, learning rate {0.03,0.05,0.1}\in \{0.03, 0.05, 0.1\}, L2 regularisation {0.1,1.0}\in \{0.1, 1.0\}) because the training corpus is small relative to typical boosting literature and we observed no meaningful improvement beyond these defaults.

Metric. Brier score (Brier 1950) is the primary metric:

Brier=1Ni=1N(piyi)2\mathrm{Brier} = \frac{1}{N}\sum_{i=1}^{N} (p_i - y_i)^2

where pi[0,1]p_i \in [0,1] is the predicted probability and yi{0,1}y_i \in \{0,1\} is the realised outcome. Brier is a proper scoring rule (Gneiting & Raftery 2007) and in our domain has the advantage that the three candidate baselines (coinflip, momentum, price/100) all evaluate on the same scale without a threshold choice. We report 95% bootstrap confidence intervals (1000 resamples) for every Brier number and mark results as significant only when the CI does not overlap the baseline's CI.

Baselines. For T1 the headline baseline is a coinflip (constant 0.5). For T4 the headline baseline is price divided by 100 at t0t_0, which is the market's own probability estimate on a YES contract that resolves in 24 hours.

4. Results

TaskBrier (model)Brier (baseline)ΔCI non-overlap?
V1 × T1 (direction, NtestN_\text{test}=246K)0.22940.2500 (coinflip)−0.0206yes
V2 × T4 (resolution, NtestN_\text{test}=2.2K)0.16810.1767 (price/100)−0.0086overlap
T4 Crypto (per-category)0.07930.1207−0.0414n/a
T4 Commodities (per-category)0.07910.1145−0.0355n/a

Three observations.

1. The signal exists on T1, but it is modest. On the direction task the 0.0206 Brier gap is statistically significant (CI non-overlap) and clears the 0.01\geq 0.01 improvement threshold we pre-registered before training. A skeptical reader should interpret this as "the five-feature corpus carries measurable signal" rather than "the corpus is nearly complete." Uniformly predicting 0.5 scores 0.2500; a model that perfectly discriminated at the margin of a typical prediction market would score closer to 0.18–0.20.

2. Classical models saturate at the same Brier. LightGBM, XGBoost, and CatBoost all converge to Brier within ±0.0005 of each other on T1. This is the predictable behaviour of gradient-boosted trees on a small, highly correlated feature set: once one family has extracted the signal, the others cannot extract more from the same features. We had expected Phase B (the bake-off) to add 0.003–0.005 over Phase A; in practice it added nothing. The implication is that the v0.1 corpus is feature-bound, not model-bound.

3. Per-category resolution models beat the market on liquid verticals. On T4 the aggregate improvement over the price/100 baseline is not significant, but once we slice by category, Crypto and Commodities both show large and consistent gains (0.0414 and 0.0355 Brier respectively). This matches the intuition that markets with continuous underliers (BTC, oil, gold) leave more microstructure-recoverable alpha on the table than idiosyncratic binary markets (single legislation, single court rulings). Released weights include per-category LightGBM heads.

Calibration. Reliability diagrams (not shown; see the companion HuggingFace release) confirm the T1 ensemble is well-calibrated in the mid-range (0.35 to 0.65) and slightly overconfident at the tails, consistent with typical LightGBM behaviour. Temperature scaling (Guo et al. 2017) is included in the roadmap for v0.2.

5. Related Work

LLM-based forecasting. Halawi et al. (2024) trained GPT-4 on Metaculus questions and achieved Brier 0.179 against a crowd baseline of 0.149 (the crowd won). AIA (2025) extended this approach; on liquid Polymarket markets the system also fails to beat the market. Our work is complementary, not competitive: our release operates on microstructure (price dynamics, order flow), theirs operates on news text and reasoning. The obvious next step is an ensemble, which we flag as future work.

Tabular gradient boosting. Shwartz-Ziv & Armon (2022) showed that gradient-boosted trees remain competitive with tabular deep learning on many tasks, but Gorishniy et al. (2021) and Hollmann et al. (2023) documented exceptions. We attempted TabPFN v2 in the Phase B bake-off but were blocked by the token-auth requirement at ux.priorlabs.ai; deferred to v0.2.

Crowd-aggregation baselines. Tetlock (2005), Manski (2006), and Wolfers & Zitzewitz (2004) established that prediction-market prices are well-calibrated aggregates of participant judgement. Our baseline is therefore not naive: "price/100" is already the weighted consensus of real traders with money at risk. Beating it on any category is non-trivial.

6. Limitations

1. Eleven days is not six months. Our initial assumption was that six months of history would be available; investigation revealed 11 days. We therefore have one temporal generalisation test (the last two days versus the first nine) rather than a robust seasonal evaluation. Retrains at 30-day, 90-day, and 180-day windows are scheduled below.

2. Five features is not twenty. The live indicator table has 27 columns; the historical archive has 5. Until the schema expansion lands (v0.3), we cannot evaluate whether the remaining 15+ indicators (adjusted implied yield, cancel-versus-volume delta, residual volatility, cross-venue gap, etc.) add signal.

3. No cross-venue features. Kalshi–Polymarket price gaps for the same event (where they exist) are one of the largest observed signals in SF's internal analytics. V0.1 does not incorporate them; v0.3 will.

4. Selection bias on T4. The 14K-row T4 corpus is only a 1.3% slice of marketwide_resolutions, driven by feature-snapshot availability at t024t_0 - 24h. It is plausible that categories with more continuous trading (where 24h-prior snapshots exist) are exactly the categories where microstructure is most informative, inflating apparent per-category gains.

5. No live deployment. The weights are released as a static artefact. We have not yet deployed them as a real-time forecast endpoint. A production /api/agent/forecast is part of the v1.0 roadmap.

7. Roadmap

We plan six phases, each with a pre-registered Brier improvement gate:

PhaseDescriptionComputeGateIf fail
ALightGBM baselinelaptop≥ 0.01 over coinflipstop; document
BXGBoost + CatBoost + TabPFNlaptop≥ 0.003 over Aship A
CFT-Transformer, SAINT, TabNetKaggle T4≥ 0.005 over Bship B
DChronos-Bolt, TimesFM, Moirai (zero/few-shot)Kaggle T4≥ 0.005 over Cnegative result
EQwen3-4B QLoRA on tabular-as-textKaggle T4within 0.005 of bestno LLM path
FKV-injection paperrented A100/H100significant > RAG baselinepublish negative

Phases A and B shipped on 2026-04-19 with the numbers in this note. The next concrete step is Phase D (Chronos-Bolt-base zero-shot), which is cheap to run and has a plausible near-term upside because time-series foundation models have been trained on exactly the kind of price series that constitute our corpus.

Retrain schedule. v0.2 is scheduled for 2026-05-20 (30 days of R2 archive); v0.3 when the market_indicator_history schema expansion lands; v1.0 around 2026-10 with six months of data plus whichever of Phases C–E have cleared their gates.

8. Conclusion

We do not claim sf-ml-baseline v0.1 is a strong model. We claim three things only:

  1. The SF microstructure feature set contains measurable predictive signal for 24-hour direction on prediction markets (Brier −0.0206, statistically significant).
  2. Gradient-boosted trees saturate on this feature set at ~0.2294 Brier; further gains from classical models alone are unlikely.
  3. Per-category resolution models beat the market price on Crypto and Commodities by 0.035–0.041 Brier.

The correct way to use this release is as one voice in an ensemble with news/LLM approaches, not as a replacement for them. We ship the weights, the training code, and the retrain plan so that other researchers and operators can verify the numbers and extend the roadmap.

References

  • AIA (2025). Multi-agent systems for long-horizon prediction-market forecasting. Technical report.
  • Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.
  • Chen, T. & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. KDD.
  • Gneiting, T. & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. JASA, 102(477), 359–378.
  • Gorishniy, Y. et al. (2021). Revisiting deep learning models for tabular data. NeurIPS.
  • Guo, C. et al. (2017). On calibration of modern neural networks. ICML.
  • Halawi, D. et al. (2024). Approaching human-level forecasting with language models. arXiv:2402.18563.
  • Hollmann, N. et al. (2023). TabPFN: A transformer that solves small tabular classification problems in a second. ICLR.
  • Jane Street (2022). Jane Street Market Prediction competition description. Kaggle.
  • Ke, G. et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. NeurIPS.
  • Manski, C. F. (2006). Interpreting the predictions of prediction markets. Economics Letters, 91(3), 425–429.
  • Prokhorenkova, L. et al. (2018). CatBoost: Unbiased boosting with categorical features. NeurIPS.
  • Shwartz-Ziv, R. & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90.
  • Tetlock, P. E. (2005). Expert political judgment: How good is it? How can we know? Princeton University Press.
  • Wolfers, J. & Zitzewitz, E. (2004). Prediction markets. Journal of Economic Perspectives, 18(2), 107–126.

How to Cite

@misc{sf_feature_baseline_prelim_2026,
  author       = {Liu, Patrick},
  title        = {Feature-Based Prediction-Market Forecasting: Preliminary Observations},
  year         = {2026},
  month        = {apr},
  howpublished = {SimpleFunctions Research Note},
  url          = {https://simplefunctions.dev/papers/feature-baseline-preliminary},
  note         = {Companion release: \url{https://huggingface.co/SimpleFunctions/sf-ml-baseline}}
}

Keywords

prediction marketsLightGBMBrier scorecalibrationmicrostructureforecastinggradient boosting

Related