Time Series Forecasting: From Prophet to Transformers
Finance director wants sales forecast quarterly. Analyst builds SARIMA, gets MAPE 8.3% on test. Deploys. Two months later MAPE production 23%. Why: trained pre-COVID, tested stable period, production hit promotions and supply disruptions. Data leakage + distribution shift = pretty metrics in notebook, broken production forecast.
Common Forecasting Problems
Wrong cross-validation. Standard train_test_split for time series — error. Random split creates leakage: model sees "future" in training. Correct — TimeSeriesSplit or walk-forward validation with expanding window.
Multiple seasonality. Hourly electricity consumption has three: daily (24h), weekly (168h), yearly (8760h). SARIMA handles one. Prophet multiple but slow scaling to thousands series.
Missing data and anomalies. Sensor gap — information (sensor off) not just NaN. Linear interpolation kills signal. Handling depends on gap nature.
Cold start hierarchical. New SKU in 50k product assortment: no history, need forecast. Standard models fail — need cross-learning or feature-based.
Tools and When to Apply
Prophet (Meta) — excellent for business data with clear seasonality and holidays. Quick setup, interpretable, built-in outlier/missing handling. Fails on irregular patterns, doesn't scale thousands series without parallelization. neuralprophet adds AR, AutoML but loses interpretability.
Gradient boosting on features (LightGBM, XGBoost) — often underestimated. Create features manually: lags (t-1, t-7, t-28), moving averages, categorical (day of week, month), exogenous. Trains on all series simultaneously — solves cold start via similar series. Often better MAPE than neural on retail than neural with proper feature engineering.
TFT (Temporal Fusion Transformer) — transformer designed interpretable forecasting with covariates. Built-in: variable selection (which features matter), temporal self-attention (which times affect forecast), quantile predictions. Available in pytorch-forecasting. Requires ~10k records per series for stability.
PatchTST — divides time series into patches (like ViT for images). Better local pattern capture than classic transformers. Works long-horizon (96–720 steps). Implementation in neuralforecast from Nixtla.
N-HiTS, N-BEATS — neural without attention, faster TFT, competitive accuracy. N-BEATS wins M4/M5 benchmark on tasks without covariates.
Deeper: TFT in Production
TFT requires careful data prep. Typical pipeline via pytorch-forecasting:
training = TimeSeriesDataSet(
data,
time_idx="time_idx",
target="sales",
group_ids=["store", "sku"],
min_encoder_length=max_encoder_length // 2,
max_encoder_length=max_encoder_length, # 120 days
min_prediction_length=1,
max_prediction_length=max_prediction_length, # 28 days
static_categoricals=["store_type", "category"],
time_varying_known_reals=["price", "promo_flag"],
time_varying_unknown_reals=["sales"],
target_normalizer=GroupNormalizer(groups=["store", "sku"], transformation="softplus"),
)
Common mistake: default target_normalizer (StandardScaler) breaks zero-value series (no sales weekends). GroupNormalizer with transformation="softplus" — right for count data.
Case: retail demand forecast. 120 stores, 8000 SKU, 28-day horizon. Original: SARIMA each series, MAPE 18.4%, full retraining cycle — 6 hours. TFT on PyTorch + pytorch-forecasting: one model all series, MAPE 11.2%, retraining — 40min on A10G. Bonus: feature importance via variable selection — day_before_holiday impacts more than holiday itself.
Forecast Quality Evaluation
Don't use RMSE alone — heavily penalizes large values. For retail forecasting:
- MAPE — interpretable but unstable near zero
- sMAPE — symmetric version, avoids division issues
- MASE — normalized vs naive seasonal, excellent cross-series comparison
- Quantile loss / Pinball loss — probabilistic forecasting, interval coverage
Workflow
Start EDA: visualizations, ADF stationarity test, STL decomposition, missing/outlier analysis. 2-3 days but reveals system data issues blocking forecasting.
Then: baseline (naive seasonal, Prophet), feature engineering for LGBM, neural architecture selection. Walk-forward validation realistic horizon. Deploy via API with auto-retraining on schedule via Airflow or Prefect.
Timelines: MVP forecast one data type — 3-6 weeks. Hierarchical forecasting with automation — 2-5 months.







