A lightweight machine-learning forecast for feeder-level congestion across PG&E's distribution system — built entirely on public data, accurate to within 240 kW.
Random Forest models trained on weather, load shapes, EV adoption, and DER data predicted feeder headroom ~6× more accurately than linear baselines — offering utilities a fast, scalable complement to PG&E's labor-intensive ICA studies.
California's distribution grid is being asked to absorb electrification, rooftop solar, and EV charging far faster than its planning assumptions were ever designed for. The tool utilities lean on — PG&E's Integration Capacity Analysis maps — gives only static snapshots, takes weeks to regenerate, and is updated only quarterly.
The result is a planning gap: regulators know some feeders are heading toward congestion, but can't see which ones, when, or under what scenarios. Programs that depend on locational targeting — managed EV charging, behind-the-meter storage incentives — have historically underperformed for exactly this reason.
We built a pipeline that joins hourly weather data (CIMIS), residential load-shape granular profiles (CALMAC), ZIP-level EV adoption (CEC), and feeder-level Integration Capacity Analysis values (PG&E GRIP) into a single feature table at the feeder × month × hour grain.
On top of that table we trained four model families — OLS, Ridge, Lasso, and Random Forest — across three reframings of the same underlying question: continuous headroom forecasting for short-term operational planning, binary classification for near-term overload risk, and multi-class tiering for multi-year capital planning.
Across all three problems, ensemble methods captured non-linear interactions between customer mix, DER penetration, weather, and time-of-day that linear models simply could not see. For the regression task — predicting continuous headroom — the gap was an order of magnitude.
| Model | RMSE | R² | Notes |
|---|---|---|---|
| OLS | ~1,580 kW | 0.71 | Misses non-linear DER × time-of-day effects |
| Ridge | ~1,560 kW | 0.72 | Marginal gain over OLS |
| Lasso | ~1,540 kW | 0.72 | Useful for feature selection only |
| Random Forest | ~240 kW | 0.99 | Captures interaction structure cleanly |
For the binary congested-vs-not task, the tuned Random Forest hit 97.95% accuracy with a ROC-AUC of 0.997. The metric that matters most operationally — recall on the congested class — came in above 95%, meaning the model rarely tells a planner "you're fine" when in fact the feeder is hitting its limit.
The multi-class tiering model, which I led, performed strongly on the dominant medium-risk tier and meaningfully better than logistic regression on the high-risk extremes — exactly the feeders utilities most need to identify for capital planning.
The honest framing matters here. These models don't enforce power-system physics. They smooth over peak events because they're trained at month-hour resolution. EV data is annual and tied to registration addresses, not actual charging locations. All of these limitations bias the model toward underestimating risk.
But for the use case we set out to address — giving regulators and program administrators a fast, scenario-friendly first pass before commissioning full ICA studies — the models work. A feeder flagged as Tier 2 deserves engineering attention. A managed-charging pilot rolled out to flagged ZIPs gets meaningfully better locational targeting than the historical baseline.
The hardest part wasn't the modeling — it was getting four messy public datasets to agree on what a "feeder" was. If I redid this, I'd start with the join logic and work outward.