f1 predict
live
./docs / data-pipeline

Data Pipeline

Jobs

All jobs live in data-engine/src/jobs/. They are invoked via src/main.py --job <name>.

Main Race Jobs

JobPurpose
sync_schedulePopulates the races table for a season from FastF1 (includes sprint dates, event_format)
sync_seasonPopulates teams and drivers for a season from FastF1
ingest_qualifyingIngests Q1/Q2/Q3 qualifying session data and sector times — 2018+
ingest_qualifying_legacyIngests qualifying from Ergast — pre-2018
ingest_raceIngests race results, lap times, and race conditions (weather, SC/VSC, temps) — 2018+
ingest_race_legacyIngests race results from Ergast (no lap data) — pre-2018
compute_season_statsAggregates driver_season_stats and team_season_stats for a year (includes sprint aggregates)
compute_featuresComputes the 12 feature scores per driver for a grand prix
compute_predictionsRuns softmax on feature scores, writes win probabilities and predicted positions

Sprint Race Jobs

JobPurpose
ingest_sprint_qualifyingIngests SQ session — stores SQ1/SQ2/SQ3 times + sector times + speed in sprint_results; uses messages=True; has date guard
ingest_sprintIngests sprint race results, sprint lap times, and sprint conditions (weather, SC/VSC, temps)
compute_sprint_featuresComputes 8 sprint-specific feature scores per driver
compute_sprint_predictionsRuns softmax on sprint feature scores → sprint win probabilities

Job Chain

Conventional Weekend

sync_schedule        races table must exist before ingest can find rounds

sync_season          teams and drivers must exist before results can reference them

ingest_qualifying    qualifying_results must exist before compute_features

ingest_race          race_results and lap_times ingested; race.status → completed

compute_season_stats driver_season_stats and team_season_stats must be fresh
                     before compute_features reads them

compute_features     produces driver_prediction_features (raw weighted scores)

compute_predictions  reads raw scores, runs softmax, writes probabilities

Sprint Weekend

sync_schedule + sync_season

ingest_sprint_qualifying   → sprint_results (sq1/sq2/sq3 times + grid)
                             race.status → sprint_qualifying_done

compute_sprint_features    → driver_sprint_features
compute_sprint_predictions → sprint_predictions

ingest_sprint              → sprint_results (finish positions, points)
                             → sprint_lap_times, races (sprint conditions)
                             race.status → sprint_done

compute_season_stats       → updates sprint_wins, sprint_total_points, etc.

ingest_qualifying          → qualifying_results
                             race.status → qualifying_done

compute_features + compute_predictions  → main race prediction

ingest_race                → race_results + lap_times
                             race.status → completed

compute_season_stats       → final season stats update

compute_season_stats should be re-run after each race so that rolling stats (win rate, DNF rate, avg position gain, sprint aggregates) are up to date before the next prediction.


Cron Schedule (Render)

Conventional Weekend

Time (UTC)Jobs
Saturday 22:00ingest_qualifyingcompute_featurescompute_predictions
Sunday 18:00ingest_racecompute_season_stats

Sprint Weekend

Time (UTC)Jobs
Friday 22:00ingest_sprint_qualifyingcompute_sprint_featurescompute_sprint_predictions
Saturday 16:00ingest_sprintcompute_season_stats
Saturday 22:00ingest_qualifyingcompute_featurescompute_predictions
Sunday 18:00ingest_racecompute_season_stats

Running Jobs Locally

cd data-engine
source venv/bin/activate

# Sync
python src/main.py --job sync_schedule     --year 2026
python src/main.py --job sync_season       --year 2026 --round 1

# Main race pipeline
python src/main.py --job ingest_qualifying --year 2026 --round 6
python src/main.py --job ingest_race       --year 2026 --round 6
python src/main.py --job compute_season_stats --year 2026
python src/main.py --job compute_features  --race_id 42
python src/main.py --job compute_predictions --race_id 42

# Sprint pipeline
python src/main.py --job ingest_sprint_qualifying --year 2026 --round 9
python src/main.py --job compute_sprint_features  --race_id 55
python src/main.py --job compute_sprint_predictions --race_id 55
python src/main.py --job ingest_sprint            --year 2026 --round 9

For pre-2018:

python src/main.py --job ingest_qualifying_legacy --year 2015 --round 5
python src/main.py --job ingest_race_legacy       --year 2015 --round 5

Historical Backfill

Two backfill scripts in data-engine/:

Full backfill (backfill_full.py)

Runs the complete pipeline (sync → ingest → sprint pipeline → season stats → features → predictions) for every round in a year range. Sprint weekends are automatically detected and handled.

cd data-engine
source venv/bin/activate

python backfill_full.py                        # 2000–2026 (legacy years skip gracefully)
python backfill_full.py --start 2018           # 2018–2026 (recommended — full FastF1 coverage)
python backfill_full.py --start 2025 --end 2025

Sprint-only backfill (backfill_sprint.py)

Re-runs just the sprint pipeline for specific years (useful after sprint schema changes).

python backfill_sprint.py --years 2021 2022 2024 2026
python backfill_sprint.py --years 2026          # single year

Data coverage by era:

YearsQualifyingLap timesSprint
2018–presentFull Q1/Q2/Q3 + sector timesFull per-lapFull SQ + sprint lap times
2006–2017Q1/Q2/Q3 timesNoneNone (sprint format started 2021)
2000–2005Single best lap onlyNoneNone

Sprint format years: 2021 (3 rounds), 2022 (3 rounds), 2023 (6 rounds), 2024 (6 rounds), 2025 (6 rounds), 2026+.

Both scripts use _run() wrappers — a single failing round does not abort the whole year. Failures print as [SKIP] lines.


FastF1 Cache

FastF1 caches API responses locally. Enable it in development to avoid re-fetching:

import fastf1
fastf1.Cache.enable_cache('./cache')

The cache is already enabled via src/config.py. The cache/ directory is gitignored.


Idempotency

Every job uses INSERT ... ON CONFLICT DO UPDATE. Running any job twice produces identical results — no duplicate rows. This makes retries safe.

The sprint qualifying ingest uses exclude_update to prevent SQ data from overwriting already-ingested sprint race finish positions (finish_position, points, status, total_sprint_time_ms, fastest_lap are never overwritten by SQ ingest).

ingest_qualifying and ingest_sprint_qualifying both have a date guard: they check qualifying_date/sprint_qualifying_date <= today before loading FastF1, and raise an error if no results came back. This prevents future rounds from being incorrectly set to qualifying_done/sprint_qualifying_done by a backfill run.


Error Handling

  • Jobs exit with code 1 on failure so Render marks the job failed for manual retrigger.
  • Never use sleep() inside jobs — Render has a job timeout.
  • Use structured logging: {"job": "ingest_race", "round": 14, "status": "failed", "error": "..."}.