What each pick badge means
Calibrated on backtest run_id=13 (1,838 regular-season games, 2025–26). Updated 2026-05-28.
Every pick falls into one tier based on the model's confidence. Each tier is a strict rule on per-model probabilities; calibrated on 1,800 games.
| Badge | Rule | Hit % | n / season | Notes |
|---|---|---|---|---|
| ELITE | all 4 models agree > 62% | 82.9% | 35 | Rare (~2% of slate) · safest tier |
| STRONG | any 2 of 3 primary models > 60% | 62.0% | 234 | Workhorse · most volume |
| VALUE | all 3 primary models > 55% | 64.7% | 153 | Best ROI tier · +odds friendly · parlay-grade |
| TOTALS-LOCK | LR ≥ 0.62 and sim ≥ 0.62, same side | 68.2% | 107 | D2 consensus · highest confidence band |
| TOTALS-STRONG | LR ≥ 0.60 and sim ≥ 0.55, same side | 65.7% | 277 | D2 consensus · workhorse totals band |
| TOTALS-VALUE | LR ≥ 0.58 and sim ≥ 0.52, same side | 58.4% | 197 | D2 consensus · wider band, parlay-grade |
| TOTALS-LR-SOLO | LR ≥ 0.66, sim disagrees | — | 0 | Forward-compatible · 0 picks across n=3,638 games to date |
| PASS | no tier rule matches | ~50% | — | Collapsed by default · add manually if your read differs |
How to read tier colors: green = LOCK band (~9–10/10 confidence), amber = STRONG, blue = LEAN/VALUE, grey = PASS.
Lineup gate: predictions only appear once both starting lineups are posted. Games waiting on lineups show ⏳ lineups in the left column — usually 2–3 hours before first pitch. For future-dated slates, predictions show with a ⏳ tentative roster badge — they use probable pitchers + team strength only (no lineup data yet).
Models behind the picks: analytic generative (Negative-Binomial run distribution), simulation Monte-Carlo (2,000 sims/game), LR classifier (elastic-net), and XGBoost (gradient-boosted trees). ELITE requires all four; STRONG / VALUE require the three primary voters (XGB optional). On totals, the user-facing tier is the new D2 LR + sim consensus rule — both voters must agree on side at the threshold; LR is the peaked-accuracy anchor and sim is the agreement check.
Sim is now multi-line (D1): sim emits a probability at every standard market line 7.5 → 11.5 in 0.5 increments — matching LR's classifier coverage. The 8.5 / 8.0 / 11.0 / 11.5 calibrators are persisted (isotonic, refuse-if-worse gate); 7.5 / 9.0 / 9.5 / 10.0 / 10.5 currently use sim's raw probability because the calibrator didn't improve held-out log-loss.
Why fewer non-8.5 ensemble totals (D3): at non-8.5 lines, only sim emits — there's nothing to average against, so the "ensemble" is just sim's raw probability. To prevent single-voter overconfidence at the tails, the ensemble blender requires ≥ 2 voters per line; below that it emits nothing. Net effect: ensemble Total Brier 0.2454 → 0.2451 with the gate on, and the Δ +0.13 / +0.22 overconfidence at P ≥ 0.60 went to within ±0.04.
Headline lift from Wave 1 + D1 + D2 + D3 (n=1,800): totals consensus tier hit-rate 62.5% → 63.7% · ROI at −110 +19.4% → +21.6% · ensemble Brier 0.2340 → 0.2316. ML side: essentially unchanged (only C8 batter recency feeds into ML and the n=1,800 delta is +0.0001 — noise).
Full model card with Brier progression, totals verification, and EV-backtest tables: /model-card