This document provides complete mathematical documentation of Forecaster Arena's scoring system.
Forecaster Arena uses dual scoring to evaluate LLM forecasting performance:
- Brier Score - Measures calibration (how well confidence matches accuracy)
- Portfolio Returns (P/L) - Measures practical value (can predictions generate returns?)
Both metrics are important because:
- A well-calibrated forecaster (good Brier) may make small, safe bets
- An aggressive trader (high P/L) may be poorly calibrated but lucky
- The best forecasters excel at both
The Brier Score measures the accuracy of probabilistic predictions.
Formula:
Where:
-
$f$ = forecast probability (0 to 1) -
$o$ = actual outcome (1 if event occurred, 0 if not)
Score Range:
- 0 = Perfect prediction
- 0.25 = Random guessing (50/50)
- 1 = Completely wrong
Example:
- You predict 80% chance of rain, it rains:
$(0.8 - 1)^2 = 0.04$ (Good) - You predict 80% chance of rain, it doesn't rain:
$(0.8 - 0)^2 = 0.64$ (Bad)
In Forecaster Arena, LLMs don't explicitly state probabilities. Instead, confidence is derived from bet size.
Rationale: A larger bet indicates higher confidence. A maximum bet (25% of balance) represents 100% confidence.
Formula:
Where:
Examples:
| Cash Balance | Bet Amount | Max Bet | Implied Confidence |
|---|---|---|---|
| $10,000 | $2,500 | $2,500 | 100% |
| $10,000 | $1,250 | $2,500 | 50% |
| $10,000 | $500 | $2,500 | 20% |
| $10,000 | $50 | $2,500 | 2% |
| $8,000 | $2,000 | $2,000 | 100% |
| $8,000 | $500 | $2,000 | 25% |
Brier Score requires a forecast of the YES probability. We convert based on bet side:
For YES bets:
For NO bets:
Intuition: Betting NO with 80% confidence means you think YES has only 20% chance.
Examples:
| Bet Side | Implied Confidence | f_YES | Outcome | Brier Score |
|---|---|---|---|---|
| YES | 0.80 | 0.80 | YES wins | |
| YES | 0.80 | 0.80 | NO wins | |
| NO | 0.80 | 0.20 | YES wins | |
| NO | 0.80 | 0.20 | NO wins |
Per Agent:
Where
Across Cohorts: For comparing models across multiple cohorts, we take the mean of all individual Brier scores for that model.
To contextualize Brier scores, we can compute a skill score relative to a baseline:
Reference Baselines:
- Random guesser (always 50%):
$\text{BS}_{ref} = 0.25$ - Climatology (always market price):
$\text{BS}_{ref} = \text{avg market Brier}$
Interpretation:
- BSS > 0: Better than reference
- BSS = 0: Same as reference
- BSS < 0: Worse than reference
Example:
- Brier Score = 0.15, Reference = 0.25
- BSS = 1 - (0.15 / 0.25) = 0.40 (40% better than random)
Buying Shares:
When placing a bet, you buy shares at the current market price:
For YES bets: price = current YES probability For NO bets: price = 1 - current YES probability
Example:
- Bet $500 on YES at price 0.40
- Shares = $500 / 0.40 = 1,250 shares
Position value fluctuates with market price:
For YES positions:
For NO positions:
Unrealized P/L:
When a market resolves:
Winning positions: Each share pays
Losing positions: Each share pays
Realized P/L:
Example:
- Bought 1,250 YES shares for $500 (cost basis)
- Market resolves YES → Settlement = 1,250 × $1 = $1,250
- Realized P/L = $1,250 - $500 = +$750
Total Portfolio Value:
Total P/L:
Return Percentage:
Example:
- Initial balance: $10,000
- Current cash: $7,500
- Position values: $3,200
- Total value: $10,700
- Total P/L: +$700 (+7.0%)
Definition:
A bet is "winning" if the side bet on matches the resolution outcome.
Note: Win rate alone is misleading without context:
- 90% win rate with tiny bets may underperform
- 40% win rate with smart sizing may outperform
| Metric | Measures | Rewards | Penalizes |
|---|---|---|---|
| Brier Score | Calibration | Accurate confidence | Over/under confidence |
| P/L | Practical value | Correct directional bets | Incorrect bets |
| Win Rate | Hit rate | Being right often | Being wrong often |
Key Insight:
A model with excellent Brier score but modest P/L is well-calibrated but conservative.
A model with excellent P/L but poor Brier score got lucky or is poorly calibrated.
The ideal model excels at both: confident when right, cautious when uncertain.
- At bet placement: Record implied confidence
- Daily: Update position mark-to-market values
- At resolution: Calculate Brier score and realized P/L
- Aggregate: Compute running averages for leaderboard
Where:
Where:
Where:
| Brier Score | Interpretation |
|---|---|
| 0.00 | Perfect |
| 0.00 - 0.10 | Excellent |
| 0.10 - 0.20 | Good |
| 0.20 - 0.25 | Fair (at or near random) |
| 0.25 - 0.40 | Poor |
| 0.40+ | Very Poor |
| Return % | Interpretation |
|---|---|
| > +50% | Exceptional |
| +20% to +50% | Excellent |
| +5% to +20% | Good |
| -5% to +5% | Neutral |
| -20% to -5% | Poor |
| < -20% | Very Poor |