===========================
Spatial Risk Modeling
===========================

This document provides a clear explanation of different models used for **spatial risk assessment**. While the examples focus on deforestation, these same methods apply to **any binary outcome phenomenon** including:

- **Deforestation & forest degradation** risk
- **Forest fire** occurrence probability
- **Flooding** risk zones
- **Disease outbreak** spatial patterns
- **Landslide** susceptibility
- **Species habitat** suitability
- **Urban expansion** patterns
- **Crop yield** failure zones

Model Classification Summary
==============================

All models predict **binary outcomes** (event occurs: yes/no) at the pixel/location level. The models can be categorized into:

- **Supervised Machine Learning Models**: Trained on historical occurrence data with environmental/contextual predictors
- **Unsupervised/Heuristic Models**: Rule-based approaches using spatial patterns

Models Description
==================

1. Moving Window (MW) Model
----------------------------

**Notebook**: ``4.mw_model.ipynb``

**Type**: Unsupervised spatial heuristic model

**Method**:

- Calculates local event rates within moving windows of different sizes (e.g., 5×5, 11×11, 21×21 pixels)
- Uses historical patterns in the neighborhood to predict future risk
- No machine learning training required

**Key Features**:

- Window sizes: Typically 5, 11, and 21 pixels
- Based on spatial proximity assumption: areas near recent events are at higher risk
- Fast computation, no training phase needed

**Output**: Probability/risk map based on neighborhood event density

**Application Examples**:

- **Deforestation**: Areas near recent forest loss
- **Fire risk**: Zones near previous burn scars
- **Flooding**: Areas near historically flooded zones

**When to use**: Quick baseline model, captures spatial clustering of events

2. Generalized Linear Model (GLM)
----------------------------------

**Notebook**: ``5.2.far_glm.ipynb``

**Type**: Supervised binary classification (regression-based)

**Algorithm**: Logistic Regression

**Method**:

- Uses environmental and contextual variables to predict event probability
- Linear combination of features with logistic transformation
- Trained via maximum likelihood estimation

**Features Used** (examples vary by application):

- **Continuous variables** (scaled): altitude, slope, distances to relevant features (roads, rivers, infrastructure, previous events)
- **Categorical variables**: land use, protected status, soil type, administrative units

**Application Examples**:

- **Deforestation**: altitude, slope, distance to roads/rivers/towns/forest edge, protected areas
- **Fire risk**: temperature, humidity, wind speed, vegetation type, distance to settlements
- **Flooding**: elevation, slope, distance to water bodies, soil permeability, drainage density

**Training**:

- Algorithm: ``sklearn.linear_model.LogisticRegression``
- Sample points from event and non-event locations
- Binary target: ``I(event_occurred)`` where 1 = event happened, 0 = no event

**Output**: Probability of event occurrence (0-1 scale, rescaled to 0-65535 for raster storage)

**Advantages**: Fast, interpretable coefficients, understand which factors drive risk

**Limitations**: Assumes linear relationships (in logit space), doesn't model spatial autocorrelation

3. iCAR Model (Intrinsic Conditional Autoregressive)
-----------------------------------------------------

**Notebook**: ``5.3.far_icar.ipynb``

**Type**: Supervised Bayesian spatial classification

**Algorithm**: Bayesian hierarchical model with spatial random effects

**Method**:

- Extends logistic regression by adding spatial random effects
- Explicitly models spatial autocorrelation via neighborhood structure
- Accounts for the fact that nearby locations tend to have similar risk (spatial dependence)

**Features Used**:

- Same environmental predictors as GLM
- **Additional spatial component**: Cell adjacency matrix (spatial neighborhood structure)

**Training**:

- Bayesian inference via MCMC (Markov Chain Monte Carlo) sampling
- Estimates both coefficients (βs) and spatial autocorrelation parameter (ρ)
- Computationally intensive (requires burn-in and sampling iterations)

**Special Features**:

- Spatial random effects smooth predictions across space
- Interpolates spatial correlation parameter (rho) for fine-scale predictions
- Provides uncertainty estimates via posterior distributions

**Output**: Spatially-smoothed probability map with uncertainty estimates

**Advantages**: Accounts for spatial dependence, more realistic for clustered phenomena like fires, diseases, or deforestation

**Limitations**: Computationally expensive, requires careful tuning (MCMC iterations)

**Application Examples**:

- **Deforestation**: Smooth risk transitions accounting for spatial contagion
- **Disease spread**: Model spatial correlation in outbreak patterns
- **Fire risk**: Account for fire spread patterns and neighborhood effects

4. Random Forest (RF)
----------------------

**Notebook**: ``5.4.far_rf.ipynb``

**Type**: Supervised binary classification (ensemble method)

**Algorithm**: Random Forest Classifier

**Method**:

- Ensemble of decision trees trained on bootstrap samples
- Each tree makes predictions, final output is averaged
- Captures complex non-linear relationships and interactions between features

**Features Used**:

- Same predictors as GLM and iCAR (context-dependent)
- Automatically handles feature interactions and non-linear effects
- Can include temporal features, climate variables, socioeconomic data, etc.

**Training**:

- Algorithm: ``sklearn.ensemble.RandomForestClassifier``
- Parameters: number of trees (typically 100), min samples per leaf, max depth
- No explicit spatial modeling (though can include spatial coordinates)

**Output**: Probability of event occurrence (averaged from all trees)

**Advantages**:

- Captures non-linear relationships and complex interactions
- Robust to outliers and missing data
- Feature importance scores show which variables matter most
- Generally high predictive accuracy across diverse applications

**Limitations**:

- "Black box" model (less interpretable than GLM)
- Can overfit if not properly tuned
- Computationally more expensive than GLM

**Application Examples**:

- **Fire risk**: Complex interactions between weather, vegetation, human activity
- **Flooding**: Non-linear relationships between rainfall, topography, land cover
- **Species distribution**: Complex habitat suitability with multiple interacting factors

5. Benchmark/Stratification Model
-----------------------------------

**Notebook**: ``3.benchmark_jnr_model.ipynb``

**Type**: Unsupervised rule-based spatial stratification

**Method**:

- Stratifies landscape based on key risk factors:

  - Distance to relevant features (e.g., forest edge, water bodies, fault lines)
  - Administrative/ecological units (sub-regions, soil types, etc.)

- Assigns historical event rates to each stratum
- Deterministic assignment (no statistical learning)

**Approach**:

1. Identify distance threshold where most events (e.g., 99.5%) occurred
2. Divide landscape into distance bins from key feature
3. Calculate historical event rate for each bin × category combination
4. Apply these rates as vulnerability scores

**Output**: Vulnerability map with risk classes based on historical patterns

**Advantages**: Simple, transparent, auditable, follows established methodologies (e.g., JNR for deforestation)

**Limitations**: Cannot capture complex interactions, assumes future patterns similar to past

**Application Examples**:

- **Deforestation**: Distance to forest edge × jurisdictions (JNR methodology)
- **Fire risk**: Distance to ignition sources × vegetation types
- **Flooding**: Elevation zones × drainage basins
- **Landslides**: Slope classes × geological units

Model Comparison Table
=======================

.. list-table::
   :header-rows: 1
   :widths: 20 15 10 20 15 15

   * - Model
     - Type
     - Supervised?
     - Spatial Modeling
     - Complexity
     - Interpretability
   * - **Moving Window (MW)**
     - Heuristic
     - No
     - Implicit (neighborhood)
     - Low
     - High
   * - **GLM (Logistic)**
     - Regression
     - Yes
     - No
     - Low
     - High
   * - **iCAR**
     - Bayesian Spatial
     - Yes
     - Explicit (CAR structure)
     - High
     - Medium
   * - **Random Forest**
     - Ensemble
     - Yes
     - No
     - Medium
     - Low
   * - **JNR Benchmark**
     - Rule-based
     - No
     - Implicit (distance-based)
     - Low
     - High

Data Requirements for Training vs. Prediction
===============================================

For SUPERVISED models (GLM, iCAR, Random Forest)
--------------------------------------------------

**Training Phase** - Requires BOTH:

- **Y (target)**: Historical deforestation labels (0 = deforested, 1 = remained forest)
- **X (features)**: Environmental and accessibility variables (altitude, slope, distances, protected areas, etc.)

**Prediction Phase** - Requires ONLY:

- **X (features)**: The same predictor variables for new/future areas
- The trained model applies learned relationships to predict Y

For UNSUPERVISED models (Moving Window, JNR Benchmark)
-------------------------------------------------------

**No Training Phase** - They directly compute predictions from:

- **Historical deforestation patterns** (used as input, not as labeled training data)
- **Spatial/distance features** (forest edge distance, jurisdictions)
- No Y/X distinction - they use deforestation history to create risk zones directly

Training Data
=============

All supervised models (GLM, iCAR, RF) use the same training data generated in:

**Notebook**: ``5.1.far_models_sampling.ipynb``

What is Y (Target/Dependent Variable)?
---------------------------------------

The **binary outcome** for each sampled location between two time periods:

**Examples by application**:

- **Deforestation**: 0 = deforested, 1 = remained forest
- **Fire**: 0 = burned, 1 = not burned
- **Flooding**: 0 = flooded, 1 = not flooded
- **Disease**: 0 = outbreak occurred, 1 = no outbreak

Formula in code: ``I(event_occurred)`` where event is 0 (happened) or 1 (didn't happen)

.. note::
   In the deforestation notebooks, this is coded as ``I(1-deforestation)`` where 1 = forest remained

What are X's (Features/Independent Variables)?
-----------------------------------------------

The predictor variables depend on your specific application. Here are examples across different domains:

**For Deforestation Risk:**

- Environmental: ``altitude``, ``slope``, ``soil_type``
- Accessibility: ``dist_roads``, ``dist_rivers``, ``dist_towns``, ``dist_forest_edge``
- Policy: ``protected_areas``, ``indigenous_territories``, ``jurisdiction``

**For Fire Risk:**

- Climate: ``temperature``, ``humidity``, ``wind_speed``, ``precipitation``
- Vegetation: ``vegetation_type``, ``ndvi``, ``fuel_load``, ``canopy_cover``
- Accessibility: ``dist_settlements``, ``dist_roads``, ``dist_previous_fires``
- Temporal: ``season``, ``fire_season_index``

**For Flooding Risk:**

- Topography: ``elevation``, ``slope``, ``aspect``, ``topographic_wetness_index``
- Hydrology: ``dist_rivers``, ``drainage_density``, ``flow_accumulation``
- Land cover: ``imperviousness``, ``land_use``, ``soil_permeability``
- Infrastructure: ``dist_drainage_systems``, ``dams_upstream``

**For Disease Outbreak:**

- Climate: ``temperature``, ``humidity``, ``rainfall``
- Demographics: ``population_density``, ``age_structure``, ``mobility_patterns``
- Infrastructure: ``healthcare_access``, ``sanitation_quality``
- Proximity: ``dist_previous_cases``, ``dist_high_risk_areas``

**Spatial Variables (for iCAR):**

- ``cell``: Spatial cell ID for modeling neighborhood structure
- ``X, Y``: Geographic coordinates

Sampling Strategy
-----------------

- Stratified random sampling from event and non-event locations
- Typically 10,000+ samples (adaptive based on study area and event prevalence)
- Spatial cell IDs (grid cells of ~10×10 km) for accounting spatial autocorrelation
- Balanced or weighted representation of outcome classes

How Training Works
-------------------

**Supervised models learn the relationship:**

.. code-block:: text

   P(event) = f(X1, X2, X3, ..., Xn)

**Examples:**

- ``P(deforestation) = f(altitude, slope, dist_roads, dist_towns, dist_edge, protected_areas, ...)``
- ``P(fire) = f(temperature, humidity, wind_speed, vegetation_type, dist_settlements, ...)``
- ``P(flooding) = f(elevation, slope, dist_rivers, rainfall, land_use, soil_type, ...)``

**Training process:**

1. Sample locations where we KNOW the outcome (Y = event occurred or not)
2. Extract predictor values (X's) at those locations
3. Fit model to learn: Given these X values, what's the probability of the event?
4. Apply learned model to predict event probability for ALL locations using their X values

Model Evaluation
================

**Notebook**: ``6.models_evaluation.ipynb``

All models are compared using validation metrics on coarse grid cells:

**Metrics**:

- **R²**: Explained variance (how well predictions match observations)
- **RMSE**: Root Mean Square Error (average prediction error)
- **wRMSE**: Weighted RMSE (accounts for varying grid cell sizes)
- **MedAE**: Median Absolute Error (robust to outliers)

**Evaluation Periods**:

- **Calibration**: Training period (e.g., 2015-2020)
- **Validation**: Testing period (e.g., 2020-2024)
- **Historical**: Full historical period (e.g., 2015-2024)
- **Forecast**: Future projections using latest data

Which Model to Choose?
========================

Use Moving Window (MW) when
----------------------------

- Quick assessment needed
- Limited computational resources
- Events are highly spatially clustered (fires, deforestation, disease outbreaks)
- Transparency is critical
- Neighborhood effects dominate other factors

Use GLM when
------------

- Need interpretable coefficients (understand which factors increase/decrease risk)
- Want to quantify driver importance and effect sizes
- Computational efficiency is important
- Linear relationships (in logit space) are reasonable
- Regulatory or policy context requires explainability

Use iCAR when
-------------

- Spatial autocorrelation is strong (contagious processes like fires, disease, deforestation)
- Need spatially-smooth predictions without artificial boundaries
- Have computational resources for MCMC sampling
- Want uncertainty quantification and credible intervals
- Spatial spillover effects are important

Use Random Forest when
-----------------------

- Maximum predictive accuracy is priority
- Relationships are complex/non-linear (e.g., climate thresholds, tipping points)
- Feature interactions are important (e.g., temperature × humidity for fire risk)
- Have many predictors and unsure which matter
- Less concerned about interpretability, more about prediction performance

Use Benchmark/Stratification when
----------------------------------

- Following established standards (e.g., JNR for REDD+, official flooding protocols)
- Need simple, auditable, transparent methodology
- Historical patterns are reliable predictors of future
- Administrative or jurisdictional reporting required
- Stakeholder communication and buy-in are critical
- Limited data or technical capacity

References
==========

- **riskmapjnr**: Python package for JNR risk mapping methodology
- **forestatrisk**: Python package for deforestation risk modeling (GLM, iCAR)
- **sklearn**: Scikit-learn for Random Forest implementation

Additional Notes
=================

Prediction Output Format
-------------------------

All models produce raster maps with values 0-65535 representing deforestation probability:

- 0 = No data / non-forest
- 1-65535 = Risk level (rescaled probability)

Spatial Resolution
------------------

- Typically 30m pixels (matching forest cover data)
- Coarse grid evaluation: 300+ pixel cells for validation

Temporal Periods
----------------

- **Calibration**: Model training period
- **Validation**: Independent test period
- **Historical**: Full observed period (for final model)
- **Forecast**: Future projection period