Spatial Risk Modeling#

This document provides a clear explanation of different models used for spatial risk assessment. While the examples focus on deforestation, these same methods apply to any binary outcome phenomenon including:

  • Deforestation & forest degradation risk

  • Forest fire occurrence probability

  • Flooding risk zones

  • Disease outbreak spatial patterns

  • Landslide susceptibility

  • Species habitat suitability

  • Urban expansion patterns

  • Crop yield failure zones

Model Classification Summary#

All models predict binary outcomes (event occurs: yes/no) at the pixel/location level. The models can be categorized into:

  • Supervised Machine Learning Models: Trained on historical occurrence data with environmental/contextual predictors

  • Unsupervised/Heuristic Models: Rule-based approaches using spatial patterns

Models Description#

1. Moving Window (MW) Model#

Notebook: 4.mw_model.ipynb

Type: Unsupervised spatial heuristic model

Method:

  • Calculates local event rates within moving windows of different sizes (e.g., 5×5, 11×11, 21×21 pixels)

  • Uses historical patterns in the neighborhood to predict future risk

  • No machine learning training required

Key Features:

  • Window sizes: Typically 5, 11, and 21 pixels

  • Based on spatial proximity assumption: areas near recent events are at higher risk

  • Fast computation, no training phase needed

Output: Probability/risk map based on neighborhood event density

Application Examples:

  • Deforestation: Areas near recent forest loss

  • Fire risk: Zones near previous burn scars

  • Flooding: Areas near historically flooded zones

When to use: Quick baseline model, captures spatial clustering of events

2. Generalized Linear Model (GLM)#

Notebook: 5.2.far_glm.ipynb

Type: Supervised binary classification (regression-based)

Algorithm: Logistic Regression

Method:

  • Uses environmental and contextual variables to predict event probability

  • Linear combination of features with logistic transformation

  • Trained via maximum likelihood estimation

Features Used (examples vary by application):

  • Continuous variables (scaled): altitude, slope, distances to relevant features (roads, rivers, infrastructure, previous events)

  • Categorical variables: land use, protected status, soil type, administrative units

Application Examples:

  • Deforestation: altitude, slope, distance to roads/rivers/towns/forest edge, protected areas

  • Fire risk: temperature, humidity, wind speed, vegetation type, distance to settlements

  • Flooding: elevation, slope, distance to water bodies, soil permeability, drainage density

Training:

  • Algorithm: sklearn.linear_model.LogisticRegression

  • Sample points from event and non-event locations

  • Binary target: I(event_occurred) where 1 = event happened, 0 = no event

Output: Probability of event occurrence (0-1 scale, rescaled to 0-65535 for raster storage)

Advantages: Fast, interpretable coefficients, understand which factors drive risk

Limitations: Assumes linear relationships (in logit space), doesn’t model spatial autocorrelation

3. iCAR Model (Intrinsic Conditional Autoregressive)#

Notebook: 5.3.far_icar.ipynb

Type: Supervised Bayesian spatial classification

Algorithm: Bayesian hierarchical model with spatial random effects

Method:

  • Extends logistic regression by adding spatial random effects

  • Explicitly models spatial autocorrelation via neighborhood structure

  • Accounts for the fact that nearby locations tend to have similar risk (spatial dependence)

Features Used:

  • Same environmental predictors as GLM

  • Additional spatial component: Cell adjacency matrix (spatial neighborhood structure)

Training:

  • Bayesian inference via MCMC (Markov Chain Monte Carlo) sampling

  • Estimates both coefficients (βs) and spatial autocorrelation parameter (ρ)

  • Computationally intensive (requires burn-in and sampling iterations)

Special Features:

  • Spatial random effects smooth predictions across space

  • Interpolates spatial correlation parameter (rho) for fine-scale predictions

  • Provides uncertainty estimates via posterior distributions

Output: Spatially-smoothed probability map with uncertainty estimates

Advantages: Accounts for spatial dependence, more realistic for clustered phenomena like fires, diseases, or deforestation

Limitations: Computationally expensive, requires careful tuning (MCMC iterations)

Application Examples:

  • Deforestation: Smooth risk transitions accounting for spatial contagion

  • Disease spread: Model spatial correlation in outbreak patterns

  • Fire risk: Account for fire spread patterns and neighborhood effects

4. Random Forest (RF)#

Notebook: 5.4.far_rf.ipynb

Type: Supervised binary classification (ensemble method)

Algorithm: Random Forest Classifier

Method:

  • Ensemble of decision trees trained on bootstrap samples

  • Each tree makes predictions, final output is averaged

  • Captures complex non-linear relationships and interactions between features

Features Used:

  • Same predictors as GLM and iCAR (context-dependent)

  • Automatically handles feature interactions and non-linear effects

  • Can include temporal features, climate variables, socioeconomic data, etc.

Training:

  • Algorithm: sklearn.ensemble.RandomForestClassifier

  • Parameters: number of trees (typically 100), min samples per leaf, max depth

  • No explicit spatial modeling (though can include spatial coordinates)

Output: Probability of event occurrence (averaged from all trees)

Advantages:

  • Captures non-linear relationships and complex interactions

  • Robust to outliers and missing data

  • Feature importance scores show which variables matter most

  • Generally high predictive accuracy across diverse applications

Limitations:

  • “Black box” model (less interpretable than GLM)

  • Can overfit if not properly tuned

  • Computationally more expensive than GLM

Application Examples:

  • Fire risk: Complex interactions between weather, vegetation, human activity

  • Flooding: Non-linear relationships between rainfall, topography, land cover

  • Species distribution: Complex habitat suitability with multiple interacting factors

5. Benchmark/Stratification Model#

Notebook: 3.benchmark_jnr_model.ipynb

Type: Unsupervised rule-based spatial stratification

Method:

  • Stratifies landscape based on key risk factors:

    • Distance to relevant features (e.g., forest edge, water bodies, fault lines)

    • Administrative/ecological units (sub-regions, soil types, etc.)

  • Assigns historical event rates to each stratum

  • Deterministic assignment (no statistical learning)

Approach:

  1. Identify distance threshold where most events (e.g., 99.5%) occurred

  2. Divide landscape into distance bins from key feature

  3. Calculate historical event rate for each bin × category combination

  4. Apply these rates as vulnerability scores

Output: Vulnerability map with risk classes based on historical patterns

Advantages: Simple, transparent, auditable, follows established methodologies (e.g., JNR for deforestation)

Limitations: Cannot capture complex interactions, assumes future patterns similar to past

Application Examples:

  • Deforestation: Distance to forest edge × jurisdictions (JNR methodology)

  • Fire risk: Distance to ignition sources × vegetation types

  • Flooding: Elevation zones × drainage basins

  • Landslides: Slope classes × geological units

Model Comparison Table#

Model

Type

Supervised?

Spatial Modeling

Complexity

Interpretability

Moving Window (MW)

Heuristic

No

Implicit (neighborhood)

Low

High

GLM (Logistic)

Regression

Yes

No

Low

High

iCAR

Bayesian Spatial

Yes

Explicit (CAR structure)

High

Medium

Random Forest

Ensemble

Yes

No

Medium

Low

JNR Benchmark

Rule-based

No

Implicit (distance-based)

Low

High

Data Requirements for Training vs. Prediction#

For SUPERVISED models (GLM, iCAR, Random Forest)#

Training Phase - Requires BOTH:

  • Y (target): Historical deforestation labels (0 = deforested, 1 = remained forest)

  • X (features): Environmental and accessibility variables (altitude, slope, distances, protected areas, etc.)

Prediction Phase - Requires ONLY:

  • X (features): The same predictor variables for new/future areas

  • The trained model applies learned relationships to predict Y

For UNSUPERVISED models (Moving Window, JNR Benchmark)#

No Training Phase - They directly compute predictions from:

  • Historical deforestation patterns (used as input, not as labeled training data)

  • Spatial/distance features (forest edge distance, jurisdictions)

  • No Y/X distinction - they use deforestation history to create risk zones directly

Training Data#

All supervised models (GLM, iCAR, RF) use the same training data generated in:

Notebook: 5.1.far_models_sampling.ipynb

What is Y (Target/Dependent Variable)?#

The binary outcome for each sampled location between two time periods:

Examples by application:

  • Deforestation: 0 = deforested, 1 = remained forest

  • Fire: 0 = burned, 1 = not burned

  • Flooding: 0 = flooded, 1 = not flooded

  • Disease: 0 = outbreak occurred, 1 = no outbreak

Formula in code: I(event_occurred) where event is 0 (happened) or 1 (didn’t happen)

Note

In the deforestation notebooks, this is coded as I(1-deforestation) where 1 = forest remained

What are X’s (Features/Independent Variables)?#

The predictor variables depend on your specific application. Here are examples across different domains:

For Deforestation Risk:

  • Environmental: altitude, slope, soil_type

  • Accessibility: dist_roads, dist_rivers, dist_towns, dist_forest_edge

  • Policy: protected_areas, indigenous_territories, jurisdiction

For Fire Risk:

  • Climate: temperature, humidity, wind_speed, precipitation

  • Vegetation: vegetation_type, ndvi, fuel_load, canopy_cover

  • Accessibility: dist_settlements, dist_roads, dist_previous_fires

  • Temporal: season, fire_season_index

For Flooding Risk:

  • Topography: elevation, slope, aspect, topographic_wetness_index

  • Hydrology: dist_rivers, drainage_density, flow_accumulation

  • Land cover: imperviousness, land_use, soil_permeability

  • Infrastructure: dist_drainage_systems, dams_upstream

For Disease Outbreak:

  • Climate: temperature, humidity, rainfall

  • Demographics: population_density, age_structure, mobility_patterns

  • Infrastructure: healthcare_access, sanitation_quality

  • Proximity: dist_previous_cases, dist_high_risk_areas

Spatial Variables (for iCAR):

  • cell: Spatial cell ID for modeling neighborhood structure

  • X, Y: Geographic coordinates

Sampling Strategy#

  • Stratified random sampling from event and non-event locations

  • Typically 10,000+ samples (adaptive based on study area and event prevalence)

  • Spatial cell IDs (grid cells of ~10×10 km) for accounting spatial autocorrelation

  • Balanced or weighted representation of outcome classes

How Training Works#

Supervised models learn the relationship:

P(event) = f(X1, X2, X3, ..., Xn)

Examples:

  • P(deforestation) = f(altitude, slope, dist_roads, dist_towns, dist_edge, protected_areas, ...)

  • P(fire) = f(temperature, humidity, wind_speed, vegetation_type, dist_settlements, ...)

  • P(flooding) = f(elevation, slope, dist_rivers, rainfall, land_use, soil_type, ...)

Training process:

  1. Sample locations where we KNOW the outcome (Y = event occurred or not)

  2. Extract predictor values (X’s) at those locations

  3. Fit model to learn: Given these X values, what’s the probability of the event?

  4. Apply learned model to predict event probability for ALL locations using their X values

Model Evaluation#

Notebook: 6.models_evaluation.ipynb

All models are compared using validation metrics on coarse grid cells:

Metrics:

  • : Explained variance (how well predictions match observations)

  • RMSE: Root Mean Square Error (average prediction error)

  • wRMSE: Weighted RMSE (accounts for varying grid cell sizes)

  • MedAE: Median Absolute Error (robust to outliers)

Evaluation Periods:

  • Calibration: Training period (e.g., 2015-2020)

  • Validation: Testing period (e.g., 2020-2024)

  • Historical: Full historical period (e.g., 2015-2024)

  • Forecast: Future projections using latest data

Which Model to Choose?#

Use Moving Window (MW) when#

  • Quick assessment needed

  • Limited computational resources

  • Events are highly spatially clustered (fires, deforestation, disease outbreaks)

  • Transparency is critical

  • Neighborhood effects dominate other factors

Use GLM when#

  • Need interpretable coefficients (understand which factors increase/decrease risk)

  • Want to quantify driver importance and effect sizes

  • Computational efficiency is important

  • Linear relationships (in logit space) are reasonable

  • Regulatory or policy context requires explainability

Use iCAR when#

  • Spatial autocorrelation is strong (contagious processes like fires, disease, deforestation)

  • Need spatially-smooth predictions without artificial boundaries

  • Have computational resources for MCMC sampling

  • Want uncertainty quantification and credible intervals

  • Spatial spillover effects are important

Use Random Forest when#

  • Maximum predictive accuracy is priority

  • Relationships are complex/non-linear (e.g., climate thresholds, tipping points)

  • Feature interactions are important (e.g., temperature × humidity for fire risk)

  • Have many predictors and unsure which matter

  • Less concerned about interpretability, more about prediction performance

Use Benchmark/Stratification when#

  • Following established standards (e.g., JNR for REDD+, official flooding protocols)

  • Need simple, auditable, transparent methodology

  • Historical patterns are reliable predictors of future

  • Administrative or jurisdictional reporting required

  • Stakeholder communication and buy-in are critical

  • Limited data or technical capacity

References#

  • riskmapjnr: Python package for JNR risk mapping methodology

  • forestatrisk: Python package for deforestation risk modeling (GLM, iCAR)

  • sklearn: Scikit-learn for Random Forest implementation

Additional Notes#

Prediction Output Format#

All models produce raster maps with values 0-65535 representing deforestation probability:

  • 0 = No data / non-forest

  • 1-65535 = Risk level (rescaled probability)

Spatial Resolution#

  • Typically 30m pixels (matching forest cover data)

  • Coarse grid evaluation: 300+ pixel cells for validation

Temporal Periods#

  • Calibration: Model training period

  • Validation: Independent test period

  • Historical: Full observed period (for final model)

  • Forecast: Future projection period