1 Introduction to the framework and application



In this notebook, we run through the basic steps to implement the geospatial data integration framework (G-DIF) presented in “A geospatial data integration framework to rapidly estimate post-disaster damage” by Sabine Loos , Jack Baker, David Lallemant, Nama Budhathoki, Sang-Ho Yun, Jamie McCaughey, Ritika Singh, and Feroz Khan, which is available online at https://doi.org/10.1177/8755293020926190.

The goal of this paper is to provide a framework to integrate multiple damage data sources that become available after an earthquake, including: field surveys, remotely-sensed proxies, engineering forecasts, and other geospatial covariates. More information can be found in the paper, but here is the general flowchart:


1.1 Packages and functions

The code for this is split between this main markdown and multiple functions. The entire source code can be found at https://github.com/sabineloos/GDIF-damageprediction. The main packages we use are caret for cross-validation and stepwise selection in the trend model and gstat for kriging of the residuals. Necessary packages to install are also included in install.R script.

This code is specifically meant to show how we develop the framework for the example scenario and dataset presented in the paper, however, it is possible to modify the code with new spatial dataframes for new settings.

rm(list=ls(all = T))
# source("code/install.R") # if you need to install packages uncomment this
require(car) # for VIF checking
require(caret) # for mixed stepwise selection
require(gstat) # for kriging
require(sp) # for reading/writing to shapefile
require(rgeos)
require(maptools) # for spcbind
require(automap) # for autofit variogram
require(raster) # for making rasters
require(rasterVis) # for plotting rasters
require(ggplot2); theme_set(theme_bw()) # for plotting;
require(gridExtra) # for arranging plots side-by-side
require(foreach) # for foreach parallel loops
require(future) # for future parallel backend
require(doFuture) # for using future backend for froeach loops
require(GGally) # for plotting correlation matrices

# Outside functions
source("code/Functions/PlottingFunctions.R") #contains functions used to create plots in this markdown
source("code/Functions/RK_LS_trend.R") # main function to define the trend model
source("code/Functions/OK_parallel.R") # function to perform ordinary kriging in parallel for spatial correlation of residuals
source("code/Functions/RK_errormet.R") # function to quantify error metrics of prediction and uncertainties
source("code/Functions/nscore.R") # function to perform a normal score transformation
source("code/Functions/polygon2points.R") # function to turn SpatialPolygonsDataFrame to SpatialPointsDataFrame
source("code/Functions/comb_dup.R") # for combining dataframes with duplicate entries
source("code/Functions/sp2gg.R") #contains functions to convert spatial to ggplot dataframe

1.2 Load data from the 2015 Nepal Earthquake

Here, we implement G-DIF with data from the 2015 Nepal earthquake. Our area of interest is the 11 districts surrounding Kathmandu Valley. This code depends on an already prepared spatial dataframe (i.e. grids, polygons, or points) that has values for the primary data (e.g. the dependent variable, typically from field surveys, to be predicted) and the secondary data (e.g. the independent variables). Here we use the following data sources that were either already available or produced after the earthquake:

  1. Field surveys of impact at the household-level, organized by Nepal’s Central Bureau of Statistics and organized by our collaborators at Kathmandu Living Labs. This data has been made openly available in the 2015 Nepal Earthquake Open Data Portal.
  2. A map of shaking intensity from the United States Geological Survey (USGS). This provides the Modified Mercalli Intensity at 1.75 km resolution. This uses the latest ShakeMap (version 4) (Worden et al. 2010, @Worden2016ShakeMapRelease)
  3. A self-developed engineering forecast of damage. This uses the USGS shaking intensity + fragility curves from NSET + Landscan population exposure to predict the mean damage ratio at a 1km resolution (JICA 2002, @LandScan:2012tj)
  4. A damage proxy map (DPM) developed through inSAR-based coherence differencing from the ARIA project at the NASA Jet Propulsion Lab. This provides a damage proxy value from -1 to 1 at 30m resolution (Yun et al. 2015)
  5. A digital elevation model from CGIAR Consortium for Spatial Information. This provides the elevation at a 90 m resolution. (Jarvis et al. 2008)

We convert each dataset to the same support and load as one spatial dataframe (alldat_shp) before inputting in the framework. The format is a gridded dataframe at a resolution of .0028\(^\circ\) \(\times\) .0028\(^\circ\) (~290m \(\times\) 290m). The variables in the dataframe are:

  • GRID_ID – ID number of the grid
  • GT_mnDG – Groundtruth mean damage grade, average field surveyed damage within grid
  • ENG_mnDR – Mean damage ratio from the engineering forecast
  • DPM_bldg_med – Median DPM pixel value overlaying buildings within a grid
  • MMI_mn - Modified Mercalli Intensity value over the centroid of each grid
  • DEM_mn - Elevation in meters over centroid of each grid
  • Area_deg - Area of the gird in degrees

2 G-DIF methodology


This code uses the centroid of each grid to calculate distances. We therefore check if the data is a points data frame, and if not, convert it to one using polygon2points.

2.1 Define modeling parameters and dataframes

In order to model, we need to know what we are predicting, what we use to predict, and how many data points we have. The user needs to define the below variables

  • target_var = Target variable, this is the variable we would like to predict for every grid in the dataframe, based on the field surveyed damage. We create a new column called GT_OUT. In this case, we use GT_mnDG, or the mean damage grade per grid.
  • nsurvey = Number of grids that will be surveyed. In this case, we use 100 randomly selected grids. Note that a survey of 100 grids has more than 100 buildings. In actuality we know what GT_mnDG is at all locations (but we will use this for validation in Section 3)
  • seconddat_vars = Secondary data variables, these are the variables we will use to predict the target variable, based on all secondary damage data. We use ENG_mnDR, DPM_bldg_med, MMI_mn, DEM_mn.
  • seed = the random seed we use for reproducibility. Here we use 977, which is the area code for Kathmandu :)
  • id_var = ID variable
  • maxdist.multiplier = For the spatial correlation model, we krige the residuals only using points within 2 times the spatial correlation range of a point
  • nmin and nmax = For the spatial correlation model, the minimum and maximum number of points to consider as an alternative to maxidst.multiplier

Also, one of our secondary damage data variables (DPM_bldg_med) is only at a subset of our entire area of interest. Because of this, we split up our field surveys proportionally according to the percentage of the area that the DPM covers (p_area_dpm = 41.35%). Therefore, out of the 100 grids we survey, 41 have DPM values and 59 do not have DPM values.

We use these parameters to create two dataframes:

  1. pred_shp = The prediction spatial dataframe where we add all our predictions. Note that here, we only predict at the 80200 grids with buildings to speed up computation (but one could predict at all points as well).
  2. field_sample_shp = The field spatial dataframe (i.e. the training dataframe) that contains the subset of 100 grids at the location of the field surveys.

Here, we also perform a normal score transformation (NST) of the secondary data variables using nscore.

2.2 Exploring the data

As a summary, we will be using the following 100 grids for training our model:

Ideally, we would like to see a relationship between our target variable, GT_mnDG and the secondary variables. We would also want our field survey sample to decently match the distributions from the data from all locations. We can verify this using the following correlation matrix.