Spatial Risk Modeling#
This document provides a clear explanation of different models used for spatial risk assessment. While the examples focus on deforestation, these same methods apply to any binary outcome phenomenon including:
Deforestation & forest degradation risk
Forest fire occurrence probability
Flooding risk zones
Disease outbreak spatial patterns
Landslide susceptibility
Species habitat suitability
Urban expansion patterns
Crop yield failure zones
Model Classification Summary#
All models predict binary outcomes (event occurs: yes/no) at the pixel/location level. The models can be categorized into:
Supervised Machine Learning Models: Trained on historical occurrence data with environmental/contextual predictors
Unsupervised/Heuristic Models: Rule-based approaches using spatial patterns
Models Description#
1. Moving Window (MW) Model#
Notebook: 4.mw_model.ipynb
Type: Unsupervised spatial heuristic model
Method:
Calculates local event rates within moving windows of different sizes (e.g., 5×5, 11×11, 21×21 pixels)
Uses historical patterns in the neighborhood to predict future risk
No machine learning training required
Key Features:
Window sizes: Typically 5, 11, and 21 pixels
Based on spatial proximity assumption: areas near recent events are at higher risk
Fast computation, no training phase needed
Output: Probability/risk map based on neighborhood event density
Application Examples:
Deforestation: Areas near recent forest loss
Fire risk: Zones near previous burn scars
Flooding: Areas near historically flooded zones
When to use: Quick baseline model, captures spatial clustering of events
2. Generalized Linear Model (GLM)#
Notebook: 5.2.far_glm.ipynb
Type: Supervised binary classification (regression-based)
Algorithm: Logistic Regression
Method:
Uses environmental and contextual variables to predict event probability
Linear combination of features with logistic transformation
Trained via maximum likelihood estimation
Features Used (examples vary by application):
Continuous variables (scaled): altitude, slope, distances to relevant features (roads, rivers, infrastructure, previous events)
Categorical variables: land use, protected status, soil type, administrative units
Application Examples:
Deforestation: altitude, slope, distance to roads/rivers/towns/forest edge, protected areas
Fire risk: temperature, humidity, wind speed, vegetation type, distance to settlements
Flooding: elevation, slope, distance to water bodies, soil permeability, drainage density
Training:
Algorithm:
sklearn.linear_model.LogisticRegressionSample points from event and non-event locations
Binary target:
I(event_occurred)where 1 = event happened, 0 = no event
Output: Probability of event occurrence (0-1 scale, rescaled to 0-65535 for raster storage)
Advantages: Fast, interpretable coefficients, understand which factors drive risk
Limitations: Assumes linear relationships (in logit space), doesn’t model spatial autocorrelation
3. iCAR Model (Intrinsic Conditional Autoregressive)#
Notebook: 5.3.far_icar.ipynb
Type: Supervised Bayesian spatial classification
Algorithm: Bayesian hierarchical model with spatial random effects
Method:
Extends logistic regression by adding spatial random effects
Explicitly models spatial autocorrelation via neighborhood structure
Accounts for the fact that nearby locations tend to have similar risk (spatial dependence)
Features Used:
Same environmental predictors as GLM
Additional spatial component: Cell adjacency matrix (spatial neighborhood structure)
Training:
Bayesian inference via MCMC (Markov Chain Monte Carlo) sampling
Estimates both coefficients (βs) and spatial autocorrelation parameter (ρ)
Computationally intensive (requires burn-in and sampling iterations)
Special Features:
Spatial random effects smooth predictions across space
Interpolates spatial correlation parameter (rho) for fine-scale predictions
Provides uncertainty estimates via posterior distributions
Output: Spatially-smoothed probability map with uncertainty estimates
Advantages: Accounts for spatial dependence, more realistic for clustered phenomena like fires, diseases, or deforestation
Limitations: Computationally expensive, requires careful tuning (MCMC iterations)
Application Examples:
Deforestation: Smooth risk transitions accounting for spatial contagion
Disease spread: Model spatial correlation in outbreak patterns
Fire risk: Account for fire spread patterns and neighborhood effects
4. Random Forest (RF)#
Notebook: 5.4.far_rf.ipynb
Type: Supervised binary classification (ensemble method)
Algorithm: Random Forest Classifier
Method:
Ensemble of decision trees trained on bootstrap samples
Each tree makes predictions, final output is averaged
Captures complex non-linear relationships and interactions between features
Features Used:
Same predictors as GLM and iCAR (context-dependent)
Automatically handles feature interactions and non-linear effects
Can include temporal features, climate variables, socioeconomic data, etc.
Training:
Algorithm:
sklearn.ensemble.RandomForestClassifierParameters: number of trees (typically 100), min samples per leaf, max depth
No explicit spatial modeling (though can include spatial coordinates)
Output: Probability of event occurrence (averaged from all trees)
Advantages:
Captures non-linear relationships and complex interactions
Robust to outliers and missing data
Feature importance scores show which variables matter most
Generally high predictive accuracy across diverse applications
Limitations:
“Black box” model (less interpretable than GLM)
Can overfit if not properly tuned
Computationally more expensive than GLM
Application Examples:
Fire risk: Complex interactions between weather, vegetation, human activity
Flooding: Non-linear relationships between rainfall, topography, land cover
Species distribution: Complex habitat suitability with multiple interacting factors
5. Benchmark/Stratification Model#
Notebook: 3.benchmark_jnr_model.ipynb
Type: Unsupervised rule-based spatial stratification
Method:
Stratifies landscape based on key risk factors:
Distance to relevant features (e.g., forest edge, water bodies, fault lines)
Administrative/ecological units (sub-regions, soil types, etc.)
Assigns historical event rates to each stratum
Deterministic assignment (no statistical learning)
Approach:
Identify distance threshold where most events (e.g., 99.5%) occurred
Divide landscape into distance bins from key feature
Calculate historical event rate for each bin × category combination
Apply these rates as vulnerability scores
Output: Vulnerability map with risk classes based on historical patterns
Advantages: Simple, transparent, auditable, follows established methodologies (e.g., JNR for deforestation)
Limitations: Cannot capture complex interactions, assumes future patterns similar to past
Application Examples:
Deforestation: Distance to forest edge × jurisdictions (JNR methodology)
Fire risk: Distance to ignition sources × vegetation types
Flooding: Elevation zones × drainage basins
Landslides: Slope classes × geological units
Model Comparison Table#
Model |
Type |
Supervised? |
Spatial Modeling |
Complexity |
Interpretability |
|---|---|---|---|---|---|
Moving Window (MW) |
Heuristic |
No |
Implicit (neighborhood) |
Low |
High |
GLM (Logistic) |
Regression |
Yes |
No |
Low |
High |
iCAR |
Bayesian Spatial |
Yes |
Explicit (CAR structure) |
High |
Medium |
Random Forest |
Ensemble |
Yes |
No |
Medium |
Low |
JNR Benchmark |
Rule-based |
No |
Implicit (distance-based) |
Low |
High |
Data Requirements for Training vs. Prediction#
For SUPERVISED models (GLM, iCAR, Random Forest)#
Training Phase - Requires BOTH:
Y (target): Historical deforestation labels (0 = deforested, 1 = remained forest)
X (features): Environmental and accessibility variables (altitude, slope, distances, protected areas, etc.)
Prediction Phase - Requires ONLY:
X (features): The same predictor variables for new/future areas
The trained model applies learned relationships to predict Y
For UNSUPERVISED models (Moving Window, JNR Benchmark)#
No Training Phase - They directly compute predictions from:
Historical deforestation patterns (used as input, not as labeled training data)
Spatial/distance features (forest edge distance, jurisdictions)
No Y/X distinction - they use deforestation history to create risk zones directly
Training Data#
All supervised models (GLM, iCAR, RF) use the same training data generated in:
Notebook: 5.1.far_models_sampling.ipynb
What is Y (Target/Dependent Variable)?#
The binary outcome for each sampled location between two time periods:
Examples by application:
Deforestation: 0 = deforested, 1 = remained forest
Fire: 0 = burned, 1 = not burned
Flooding: 0 = flooded, 1 = not flooded
Disease: 0 = outbreak occurred, 1 = no outbreak
Formula in code: I(event_occurred) where event is 0 (happened) or 1 (didn’t happen)
Note
In the deforestation notebooks, this is coded as I(1-deforestation) where 1 = forest remained
What are X’s (Features/Independent Variables)?#
The predictor variables depend on your specific application. Here are examples across different domains:
For Deforestation Risk:
Environmental:
altitude,slope,soil_typeAccessibility:
dist_roads,dist_rivers,dist_towns,dist_forest_edgePolicy:
protected_areas,indigenous_territories,jurisdiction
For Fire Risk:
Climate:
temperature,humidity,wind_speed,precipitationVegetation:
vegetation_type,ndvi,fuel_load,canopy_coverAccessibility:
dist_settlements,dist_roads,dist_previous_firesTemporal:
season,fire_season_index
For Flooding Risk:
Topography:
elevation,slope,aspect,topographic_wetness_indexHydrology:
dist_rivers,drainage_density,flow_accumulationLand cover:
imperviousness,land_use,soil_permeabilityInfrastructure:
dist_drainage_systems,dams_upstream
For Disease Outbreak:
Climate:
temperature,humidity,rainfallDemographics:
population_density,age_structure,mobility_patternsInfrastructure:
healthcare_access,sanitation_qualityProximity:
dist_previous_cases,dist_high_risk_areas
Spatial Variables (for iCAR):
cell: Spatial cell ID for modeling neighborhood structureX, Y: Geographic coordinates
Sampling Strategy#
Stratified random sampling from event and non-event locations
Typically 10,000+ samples (adaptive based on study area and event prevalence)
Spatial cell IDs (grid cells of ~10×10 km) for accounting spatial autocorrelation
Balanced or weighted representation of outcome classes
How Training Works#
Supervised models learn the relationship:
P(event) = f(X1, X2, X3, ..., Xn)
Examples:
P(deforestation) = f(altitude, slope, dist_roads, dist_towns, dist_edge, protected_areas, ...)P(fire) = f(temperature, humidity, wind_speed, vegetation_type, dist_settlements, ...)P(flooding) = f(elevation, slope, dist_rivers, rainfall, land_use, soil_type, ...)
Training process:
Sample locations where we KNOW the outcome (Y = event occurred or not)
Extract predictor values (X’s) at those locations
Fit model to learn: Given these X values, what’s the probability of the event?
Apply learned model to predict event probability for ALL locations using their X values
Model Evaluation#
Notebook: 6.models_evaluation.ipynb
All models are compared using validation metrics on coarse grid cells:
Metrics:
R²: Explained variance (how well predictions match observations)
RMSE: Root Mean Square Error (average prediction error)
wRMSE: Weighted RMSE (accounts for varying grid cell sizes)
MedAE: Median Absolute Error (robust to outliers)
Evaluation Periods:
Calibration: Training period (e.g., 2015-2020)
Validation: Testing period (e.g., 2020-2024)
Historical: Full historical period (e.g., 2015-2024)
Forecast: Future projections using latest data
Which Model to Choose?#
Use Moving Window (MW) when#
Quick assessment needed
Limited computational resources
Events are highly spatially clustered (fires, deforestation, disease outbreaks)
Transparency is critical
Neighborhood effects dominate other factors
Use GLM when#
Need interpretable coefficients (understand which factors increase/decrease risk)
Want to quantify driver importance and effect sizes
Computational efficiency is important
Linear relationships (in logit space) are reasonable
Regulatory or policy context requires explainability
Use iCAR when#
Spatial autocorrelation is strong (contagious processes like fires, disease, deforestation)
Need spatially-smooth predictions without artificial boundaries
Have computational resources for MCMC sampling
Want uncertainty quantification and credible intervals
Spatial spillover effects are important
Use Random Forest when#
Maximum predictive accuracy is priority
Relationships are complex/non-linear (e.g., climate thresholds, tipping points)
Feature interactions are important (e.g., temperature × humidity for fire risk)
Have many predictors and unsure which matter
Less concerned about interpretability, more about prediction performance
Use Benchmark/Stratification when#
Following established standards (e.g., JNR for REDD+, official flooding protocols)
Need simple, auditable, transparent methodology
Historical patterns are reliable predictors of future
Administrative or jurisdictional reporting required
Stakeholder communication and buy-in are critical
Limited data or technical capacity
References#
riskmapjnr: Python package for JNR risk mapping methodology
forestatrisk: Python package for deforestation risk modeling (GLM, iCAR)
sklearn: Scikit-learn for Random Forest implementation
Additional Notes#
Prediction Output Format#
All models produce raster maps with values 0-65535 representing deforestation probability:
0 = No data / non-forest
1-65535 = Risk level (rescaled probability)
Spatial Resolution#
Typically 30m pixels (matching forest cover data)
Coarse grid evaluation: 300+ pixel cells for validation
Temporal Periods#
Calibration: Model training period
Validation: Independent test period
Historical: Full observed period (for final model)
Forecast: Future projection period