Machine Learning Research for Polymarket Trading

Executive Summary

Based on analysis of our data (300+ snapshots, 50 unique markets) and current research, prediction markets exhibit significant inefficiencies that can be exploited with ML. The key insight: markets are demonstrably inefficient - academic research shows ~$40M in arbitrage profits extracted from Polymarket in 2024 alone.

Bottom Line: Start simple with classical approaches, add ML incrementally as data grows. Focus on arbitrage detection and mean reversion rather than outcome prediction.

Current Data Limitations

Only 6 snapshots per market (collected over ~22 minutes based on timestamps)
No resolution data yet (0 resolved markets in database)
No price momentum history (insufficient time-series depth)
High variance markets (prices like 0.0025 to 0.9965 in same snapshot)

Reality Check: You need more data before training supervised models on outcomes. But you can trade inefficiencies NOW.

1. ML Patterns That Work for Prediction Markets

A. Arbitrage Detection (IMMEDIATE OPPORTUNITY)

Why it works: Academic research shows Polymarket is structurally inefficient.

Recent 2024-2025 research findings: - Nearly $40M extracted via arbitrage in one year - Top 3 wallets profited $4.2M from 10,200+ arbitrage trades - Median sum of condition prices = $0.60 (should be $1.00) - 93% of PredictIt markets had pricing inefficiencies - On Polymarket, Harris + Trump contracts summed to ≠ $1 on 62 of 65 days before 2024 election

What to detect: - Single market arbitrage: P(YES) + P(NO) ≠ 1.00 - Cross-market arbitrage: Related events mispriced (e.g., "Trump wins" vs "Trump wins popular vote") - Cross-platform arbitrage: Same event, different prices on Kalshi/PredictIt/Polymarket

Implementation:

# Simple rule-based (no ML needed initially)
def detect_arbitrage(yes_price, no_price):
    total = yes_price + no_price
    if total < 0.98 or total > 1.02:  # 2% threshold
        return True, total
    return False, total

# ML approach (as data grows)
# Use isolation forests or autoencoders to detect anomalous pricing
from sklearn.ensemble import IsolationForest
model = IsolationForest(contamination=0.05)
features = [yes_price, no_price, volume_24h, liquidity, time_to_expiry]

Your stat_arb strategy already does this - it's tracking spread z-scores between correlated markets. Keep it.

B. Mean Reversion Models

Why it works: Prediction markets overreact to news, then correct.

Features that matter: - Price velocity (dp/dt over last N snapshots) - Volume spikes (24h volume / avg volume) - Time to expiration (markets stabilize near resolution) - Spread from consensus (how far from 50%)

Recommended approach (300 samples is LIMITED):

# Classical time-series > ML for small samples
from statsmodels.tsa.arima.model import ARIMA

# Once you have 100+ snapshots per market:
# 1. Fit ARIMA(1,0,1) or exponential smoothing
# 2. Predict reversion to mean
# 3. Trade when z-score > 2.0 (like your stat_arb strategy)

# When data > 1000 snapshots, upgrade to:
from sklearn.ensemble import GradientBoostingRegressor
# Train on: price_change ~ volume_spike + time_to_expiry + momentum

C. Sentiment-Driven Price Prediction (FUTURE)

Why it could work: News moves markets, ML can extract signals.

Not viable yet because: - Need labeled training data (resolved markets with outcomes) - Need to collect external data (Twitter sentiment, news, etc.) - Small sample size (50 markets is tiny)

When viable (6+ months of data): - Use LSTMs to model price trajectories - Incorporate sentiment scores from news/social - Train classifier: P(YES wins | price_history, sentiment, time_to_expiry)

Warning from research: 2025 studies show LSTM/DNN predictors create "false positives" if temporal context is ignored. Don't use LSTMs until you have 1000+ sequential observations per market.

D. Market Microstructure Patterns

Features to engineer NOW (even with limited data):

Feature	Why It Matters	Implementation
Bid-Ask Spread	Liquidity proxy, slippage risk	`best_ask - best_bid` from orderbook
Depth Imbalance	Buy/sell pressure	`bid_depth / (bid_depth + ask_depth)`
Volume Velocity	Momentum indicator	`volume_24h_current / volume_24h_avg`
Price Impact	How much price moves per $1k	Track in live trading
Time Decay	Markets converge near expiry	`days_until_expiry`

2. Most Predictive Features

Based on research and market structure:

Tier 1 (Use Immediately)

Arbitrage signals: P(YES) + P(NO) deviation from 1.0
Spread z-scores: Current spread vs historical mean (your stat_arb strategy)
Volume anomalies: 24h volume spikes (>2σ from mean)
Time to expiration: Markets stabilize <48hrs before resolution

Tier 2 (Need More Data - 100+ snapshots)

Price momentum: Rolling 5-period return
Mean reversion indicators: Distance from moving average
Liquidity shifts: Change in bid/ask depth
Cross-market correlations: Implied relationships between related events

Tier 3 (Need 1000+ snapshots + external data)

Sentiment scores: News/Twitter sentiment
Order flow toxicity: Informed vs uninformed trading
Market maker behavior: Spread widening patterns
Macro correlations: BTC price, stock market, etc.

3. Recommended Models (By Data Availability)

NOW (300 snapshots, no resolutions)

1. Rule-Based Arbitrage Bot - Detect P(YES) + P(NO) ≠ 1.0 - Trade when |sum - 1.0| > threshold - No ML needed, pure logic - Expected edge: 2-5% per opportunity (based on academic research)

2. Statistical Arbitrage (Your Current Approach) - Z-score based mean reversion - Track spread between correlated markets - Exit when spread normalizes - Keep this - it's the right approach for your data constraints

3. Isolation Forest for Anomaly Detection

from sklearn.ensemble import IsolationForest

# Detect mispriced markets
features = ['yes_price', 'no_price', 'volume_24h', 'liquidity', 'spread']
clf = IsolationForest(contamination=0.1)  # Flag 10% as anomalies
anomalies = clf.fit_predict(market_data)

# Trade the anomalies

SOON (1000+ snapshots, 50+ resolutions)

4. Gradient Boosted Trees (XGBoost/LightGBM) - Classification: Will YES win? (binary outcome) - Regression: What will price be in 1 hour? - Features: price history, volume, time decay, correlations

import lightgbm as lgb

# Target: Price change over next hour
# Features: last_N_prices, volume_24h, time_to_expiry, etc.
model = lgb.LGBMRegressor(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)

# Predict price movement
predicted_change = model.predict(X_test)

Why GBMs over deep learning? - Work with small datasets (100s of samples) - Interpretable (SHAP values show feature importance) - Fast to train - Research shows they outperform NNs on tabular data

5. ARIMA for Time Series - Predict price reversion - Model: price_t = φ * price_(t-1) + ε - Works with 100+ sequential observations

LATER (5000+ snapshots, 200+ resolutions)

6. LSTM Networks - Model price trajectories over time - Incorporate sentiment/news embeddings - Need MUCH more data to avoid overfitting

Warning: 2025 research shows LSTMs fail without proper temporal validation. Use walk-forward testing, not random train/test splits.

7. Reinforcement Learning - Agent learns optimal entry/exit timing - Reward: realized P&L - State: orderbook, price history, positions - Needs 10,000+ trades to converge

4. Research on Prediction Market Efficiency

Key Academic Findings (2024-2025)

Markets ARE Inefficient (Good for us): - Polymarket showed $40M in arbitrage opportunities in 2024 - Cross-platform pricing differences persist despite arbitrage - "Noise traders" (vibes-based betting) create exploitable patterns - Accuracy: Only 67% of Polymarket markets beat random chance

But Efficiency Varies (Adaptive Markets Hypothesis): - Politics markets: Most inefficient (highest arb profits) - Sports markets: Most arb opportunities, lower profits - High-liquidity markets: More efficient (harder to beat) - Markets tighten near resolution (less edge in final 48hrs)

Machine Learning vs Efficient Markets: - 2025 study: ML accuracy inversely correlated with market efficiency - In highly efficient markets, ML barely beats random walk - In inefficient markets (like Polymarket), ML can extract edge - Key insight: Don't try to predict outcomes, exploit inefficiencies

Practical Takeaways

Focus on structural edge (arbitrage, mean reversion) not outcome prediction
Trade inefficient market categories (politics, long-dated events)
Avoid ultra-efficient markets (high volume, near expiration)
Use simple models first (ARIMA, GBMs) before deep learning
Validate with calibration (are 70% confidence bets winning 70%?)

5. Actionable Recommendations

Phase 1: Immediate (Next 2 Weeks)

Goal: Exploit arbitrage with existing data

Enhance Arbitrage Detection
Add cross-market checks (related events)
Implement Isolation Forest to flag anomalies
Alert on P(YES) + P(NO) > 1.02 or < 0.98
Feature Engineering
Calculate: volume_velocity, spread_z_score, time_to_expiry_hours
Store in database for ML training later
Track bid-ask spread from orderbook data
Paper Trade Aggressively
Log all signals and outcomes
Build labeled dataset (did trade profit? by how much?)
This IS your training data

Phase 2: Short-Term (1-3 Months)

Goal: Train simple predictive models as data accumulates

Collect Resolution Data
Store winning outcomes in market_resolutions table
Calculate realized P&L on each signal
Build ground truth for supervised learning
Train First Models
LightGBM classifier: Will this arbitrage opportunity profit?
ARIMA: What's the expected price reversion?
Features: spread_z_score, volume_velocity, time_decay
Backtesting Framework
Walk-forward validation (NO random splits - temporal data!)
Measure: Sharpe ratio, win rate, expected value per trade
Calibration analysis: Are predictions well-calibrated?

Phase 3: Medium-Term (3-6 Months)

Goal: Scale profitable strategies

Expand Data Collection
Add external signals (news sentiment, correlated assets)
Increase snapshot frequency (every 5min instead of hourly)
Track more markets (100+ active markets)
Advanced Models
Multi-output GBMs (predict price movement for all outcomes)
Correlation models (trade related event pairs)
Market regime detection (is market in "efficient" or "chaotic" mode?)
Automated Execution
Real-time signal generation
Risk-adjusted position sizing (Kelly criterion)
Stop-losses on adverse selection

Phase 4: Long-Term (6-12 Months)

Goal: Build production ML trading system

Deep Learning (If Justified)
- LSTM for price trajectory prediction (need 1000+ sequences)
- Transformer models for multi-market attention
- Reinforcement learning for dynamic position management
Ensemble Methods
- Combine rule-based + ML predictions
- Weighted by historical performance
- Adaptive model selection (use best model for each market type)
Continuous Learning
- Online learning (update models with new data daily)
- Concept drift detection (market behavior changes)
- A/B testing of strategies

Critical Success Factors

Do This

Start with arbitrage (it's proven, works with limited data)
Use classical stats (ARIMA, z-scores) before deep learning
Validate everything with backtests (walk-forward, not random splits)
Measure calibration (are predictions well-calibrated?)
Size positions with Kelly criterion (avoid ruin)
Paper trade for 30+ days before live trading

Don't Do This

Train LSTMs with <1000 samples (will overfit)
Use random train/test splits (temporal data leaks information)
Predict outcomes directly (predict inefficiencies instead)
Ignore transaction costs (Polymarket has fees + slippage)
Over-leverage (Kelly/4 is safer than full Kelly)
Trade near market resolution (edge disappears <48hrs)

Expected Performance

Based on academic research and market conditions:

Strategy	Win Rate	Avg Profit/Trade	Sharpe Ratio	Data Required
Simple Arbitrage	85-95%	2-5%	2.0-3.0	Minimal
Stat Arb (Mean Rev)	60-70%	3-8%	1.5-2.5	100+ snapshots
GBM Classifier	55-65%	5-12%	1.0-2.0	1000+ + labels
LSTM Price Pred	52-58%	4-10%	0.8-1.5	5000+ sequences

Reality Check: Academic research shows top wallets achieved ~$1.4M profit each over one year. That's the ceiling. Start small, scale cautiously.

Next Steps

Immediate (This Week):
Review your stat_arb strategy - it's sound for current data constraints
Add Isolation Forest anomaly detection
Log ALL signals to build training dataset
Short-Term (This Month):
Collect 30 days of continuous data (1000+ snapshots)
Implement resolution tracking
Paper trade arbitrage signals
Medium-Term (Next Quarter):
Train first LightGBM models
Backtest with walk-forward validation
Go live with best-performing strategy (if EV > 0)

References

Academic Research: - Unravelling the Probabilistic Forest: Arbitrage in Prediction Markets - 2025 study showing $40M in Polymarket arbitrage - Machine learning, stock market forecasting, and market efficiency - 2025 analysis of ML accuracy vs market efficiency - The perils of election prediction markets - 2024 election market inefficiency research

Time Series with Limited Data: - Finding an Accurate Early Forecasting Model from Small Dataset - Methods for small sample forecasting - Very long and very short time series - Classical methods for limited data

Polymarket-Specific: - Top 10 Polymarket Trading Strategies - Practitioner insights - Polymarket users lost millions to 'bot-like' bettors - Evidence of exploitable inefficiencies

Bottom Line: Prediction markets are demonstrably inefficient. Your stat_arb strategy is the right approach. Add simple ML (Isolation Forest, GBMs) as data grows. Avoid deep learning until you have 5000+ samples. Focus on exploiting structural inefficiencies, not predicting outcomes.

The edge is real. Start trading it.

Database	Purpose	Location
market_history.db	Price snapshots every 5 minutes (8.2 MB)	EC2 (primary)
pqap_staging.db	Trades, positions, P&L history	EC2 (primary)
paper_trading_state.json	Current portfolio state	EC2 (primary)

Component	Details
Dashboard URL	https://pqap.tailwindtech.ai
Server	AWS EC2 (us-east-1)
SSL	Let's Encrypt via Traefik
Mode	Paper Trading (simulated)

Documentation

Machine Learning Research for Polymarket Trading

Executive Summary

Current Data Limitations

1. ML Patterns That Work for Prediction Markets

A. Arbitrage Detection (IMMEDIATE OPPORTUNITY)

B. Mean Reversion Models

C. Sentiment-Driven Price Prediction (FUTURE)

D. Market Microstructure Patterns

2. Most Predictive Features

Tier 1 (Use Immediately)

Tier 2 (Need More Data - 100+ snapshots)

Tier 3 (Need 1000+ snapshots + external data)

3. Recommended Models (By Data Availability)

NOW (300 snapshots, no resolutions)

SOON (1000+ snapshots, 50+ resolutions)

LATER (5000+ snapshots, 200+ resolutions)

4. Research on Prediction Market Efficiency

Key Academic Findings (2024-2025)

Practical Takeaways

5. Actionable Recommendations

Phase 1: Immediate (Next 2 Weeks)

Phase 2: Short-Term (1-3 Months)

Phase 3: Medium-Term (3-6 Months)

Phase 4: Long-Term (6-12 Months)

Critical Success Factors

Do This

Don't Do This

Expected Performance

Next Steps

References

System Overview

Polymarket API

Data Collector

SQLite Database

Strategy Engine

ML Model

Execution Engine

Dashboard

Telegram

Trading Strategies

Dual Arbitrage Active

Mean Reversion Active

Market Maker Active

Time Arbitrage Active

ML Prediction Active

Value Betting Disabled

Data Storage (Single Source of Truth)

Environment Architecture

EC2 (Production)

Local (Development)

Environment Details

How It Works (Simple Version)