How to Evaluate Weather Forecasting Models for Extreme Events: A Step-by-Step Guide

Introduction

Extreme weather events—such as record-breaking heatwaves, cold snaps, and storms—cause hundreds of billions of dollars in damages annually and threaten lives. Accurate forecasts are crucial for early warning systems, but a recent study in Science Advances reveals a critical gap: artificial intelligence (AI) weather models, despite their rapid advances, still underperform traditional physics-based models when predicting these exceptional events. This step-by-step guide will help you understand the trade-offs between AI and traditional models, evaluate their performance for extreme weather, and make informed decisions about which forecasting approach to rely on. Whether you’re a meteorologist, emergency planner, or data scientist, these steps will clarify why old-fashioned physics still holds the edge for extremes—and how to use both methods wisely.

How to Evaluate Weather Forecasting Models for Extreme Events: A Step-by-Step Guide — Source: www.carbonbrief.org

What You Need

Basic understanding of weather modeling concepts (e.g., numerical weather prediction, machine learning)
Access to historical weather data for training and verification (e.g., from NOAA, ECMWF, or reanalysis datasets)
Record-breaking event catalogs (such as the 2018 and 2020 extreme events used in the study)
Performance metrics (frequency and intensity comparison, e.g., using RMSE, bias, or hit rates)
Computational resources or model output from both physics-based and AI models
Critical thinking to avoid over-reliance on any single method

Step-by-Step Guide

Step 1: Understand the Strengths of Traditional Physics-Based Models

Traditional numerical weather prediction models are built on fundamental laws of physics—equations that simulate atmospheric and oceanic processes. These models have been refined over decades and excel at capturing rare, record-breaking events because they don’t depend on historical patterns. The study found that physics-based models accurately reproduced the frequency and intensity of thousands of extreme hot, cold, and windy events from 2018 and 2020. To evaluate them, examine their performance on extreme tails of the distribution—something AI models struggle with. For a deeper dive, see Step 3.

Step 2: Recognize AI Model Limitations for Extremes

AI models (e.g., graph neural networks or transformer-based forecasts) are trained on historical data. As study author Prof. Sebastian Engelke warns, they are “relatively constrained to the range of this dataset.” When presented with conditions never seen before—like a record-breaking temperature—the AI tends to underestimate both its likelihood and magnitude. Verify any AI forecast by comparing it against the historical record; if the event exceeds the 99th percentile, suspect underprediction. Check out Step 5 for cross-validation tips.

Step 3: Compare Model Outputs Against Record-Breaking Benchmarks

Gather a set of observed extreme events (e.g., from the 2018 and 2020 catalogs). Run both AI and traditional models on these cases. For each event, record:

Predicted intensity (e.g., peak temperature, wind speed)
Predicted frequency (how often such an event is forecast within a time window)
Actual observed intensity and frequency

Then calculate the bias. The study found AI models systematically underestimated both. Use statistical tests (e.g., Kolmogorov-Smirnov) to see if differences are significant. This step aligns with the “warning shot” against replacing traditional models too hastily.

Step 4: Evaluate the Training Data Quality for AI Models

AI performance depends heavily on the training dataset. Ensure your training data includes a sufficient number of extreme events. If the dataset is dominated by “normal” weather, the AI will be biased toward the mean. augment it with synthetic extremes or reanalysis data? For record-breaking predictions, consider whether the AI has ever seen anything comparable. If not, its forecasts are unreliable—trust physics-based models for such scenarios.

Step 5: Use Physics-Based Models as a Baseline for Validation

Even when using AI for routine forecasts, always compare its output against a physics-based model when an extreme event is predicted. Create a checklist:

Does the AI forecast show an intensity >2 standard deviations above the mean?
Does the physics model agree?
If not, investigate the discrepancy—likely the AI is underestimating.

The law of physics still governs the atmosphere; AI’s pattern recognition is an approximation. Using both models in ensemble can improve overall reliability.

Step 6: Implement a Hybrid Forecasting Strategy

Given that AI excels at short-term, routine forecasts with lower computational costs, and physics-based models outperform for extremes, adopt a tiered approach:

Use AI models for initial, fast guidance (e.g., 0–7 day forecasts).
Apply physics-based models for verification when AI signals an extreme.
For official warnings, blend outputs using weighting that favors physics for rare events.

This leverages the strengths of each while compensating for weaknesses.

Tips for Success

Don’t abandon traditional models prematurely. The study is a “warning shot”—AI is not a silver bullet for extremes. Maintain both systems in parallel.
Monitor model drift. As climate changes, the definition of “record-breaking” shifts. Physics-based models remain robust because they don’t rely on historical norms.
Incorporate ensemble methods. Use multiple AI and physics models to gauge uncertainty—for extremes, the spread between them is a red flag.
Communicate uncertainty honestly. If an AI model is known to underestimate extremes, adjust forecasts upward or note the limitation to end users.
Keep training data updated. Sometimes, adding recent extreme events helps AI, but it cannot overcome the fundamental shortage of rare data.
Plan for continuous evaluation. Re-run the 2018/2020 comparison annually with new models to see if AI improves on extremes.

By following these steps, you can make informed choices about weather forecasting models, ensuring record-breaking events don’t catch systems off guard. The key takeaway: AI is a powerful tool, but for extremes, physics still rules.