In this notebook, we run through the basic steps to implement the geospatial data integration framework (G-DIF) presented in “A geospatial data integration framework to rapidly estimate post-disaster damage” by Sabine Loos , Jack Baker, David Lallemant, Nama Budhathoki, Sang-Ho Yun, Jamie McCaughey, Ritika Singh, and Feroz Khan, which is available online at https://doi.org/10.1177/8755293020926190.
The goal of this paper is to provide a framework to integrate multiple damage data sources that become available after an earthquake, including: field surveys, remotely-sensed proxies, engineering forecasts, and other geospatial covariates. More information can be found in the paper, but here is the general flowchart:
The code for this is split between this main markdown and multiple functions. The entire source code can be found at https://github.com/sabineloos/GDIF-damageprediction. The main packages we use are caret for cross-validation and stepwise selection in the trend model and gstat for kriging of the residuals. Necessary packages to install are also included in install.R script.
This code is specifically meant to show how we develop the framework for the example scenario and dataset presented in the paper, however, it is possible to modify the code with new spatial dataframes for new settings.
rm(list=ls(all = T))
# source("code/install.R") # if you need to install packages uncomment this
require(car) # for VIF checking
require(caret) # for mixed stepwise selection
require(gstat) # for kriging
require(sp) # for reading/writing to shapefile
require(rgeos)
require(maptools) # for spcbind
require(automap) # for autofit variogram
require(raster) # for making rasters
require(rasterVis) # for plotting rasters
require(ggplot2); theme_set(theme_bw()) # for plotting;
require(gridExtra) # for arranging plots side-by-side
require(foreach) # for foreach parallel loops
require(future) # for future parallel backend
require(doFuture) # for using future backend for froeach loops
require(GGally) # for plotting correlation matrices
# Outside functions
source("code/Functions/PlottingFunctions.R") #contains functions used to create plots in this markdown
source("code/Functions/RK_LS_trend.R") # main function to define the trend model
source("code/Functions/OK_parallel.R") # function to perform ordinary kriging in parallel for spatial correlation of residuals
source("code/Functions/RK_errormet.R") # function to quantify error metrics of prediction and uncertainties
source("code/Functions/nscore.R") # function to perform a normal score transformation
source("code/Functions/polygon2points.R") # function to turn SpatialPolygonsDataFrame to SpatialPointsDataFrame
source("code/Functions/comb_dup.R") # for combining dataframes with duplicate entries
source("code/Functions/sp2gg.R") #contains functions to convert spatial to ggplot dataframeHere, we implement G-DIF with data from the 2015 Nepal earthquake. Our area of interest is the 11 districts surrounding Kathmandu Valley. This code depends on an already prepared spatial dataframe (i.e. grids, polygons, or points) that has values for the primary data (e.g. the dependent variable, typically from field surveys, to be predicted) and the secondary data (e.g. the independent variables). Here we use the following data sources that were either already available or produced after the earthquake:
We convert each dataset to the same support and load as one spatial dataframe (alldat_shp) before inputting in the framework. The format is a gridded dataframe at a resolution of .0028\(^\circ\) \(\times\) .0028\(^\circ\) (~290m \(\times\) 290m). The variables in the dataframe are:
GRID_ID – ID number of the gridGT_mnDG – Groundtruth mean damage grade, average field surveyed damage within gridENG_mnDR – Mean damage ratio from the engineering forecastDPM_bldg_med – Median DPM pixel value overlaying buildings within a gridMMI_mn - Modified Mercalli Intensity value over the centroid of each gridDEM_mn - Elevation in meters over centroid of each gridArea_deg - Area of the gird in degrees# Load data frame that has been prepared for the framework
alldat_shp <- readRDS(file = "data/Nepal_11dist_damagedata_grids.rds") # this doesn't work, since data is not available
head(alldat_shp@data[which(!is.na(alldat_shp$DPM_bldg_med)),])This code uses the centroid of each grid to calculate distances. We therefore check if the data is a points data frame, and if not, convert it to one using polygon2points.
if(class(alldat_shp) != "SpatialPointsDataFrame"){# check if already a spatial points dataframe
alldat_shp <- polygon2points(alldat_shp)}In order to model, we need to know what we are predicting, what we use to predict, and how many data points we have. The user needs to define the below variables
# DEFINE the target (or output) variable and create a new column called GT_OUT.
target_var = c("GT_mnDG")
ind_target = which(names(alldat_shp) == target_var[1])
alldat_shp$GT_OUT = alldat_shp@data[,ind_target]
# DEFINE the ID variable
id_var = "GRID_ID"
# DEFINE names for secondary data
seconddat_vars <- c("ENG_mnDR","DPM_bldg_med","MMI_mn","DEM_mn")
# DEFINE the number of field surveys (taking a random sample)
nsurvey = 100
# DEFINE seed (for reproducibility purposes)
RNGkind(sample.kind = "Rounding") # define random number genertor (for R3.6.0 difference in RNG)
seed = 977 # this is the area code for Nepal ;)
# DEFINE local kriging parameters (optional)
# kriging options for local kriging (saves time)
maxdist.multiplier <- 2 # use 2x the spatial corelation range for kriging
nmin <- 2 # minimum number of points to consider
nmax <- 20 # maximum number of points to consider
# The range of predictions allowed
targ_min <- min(alldat_shp$GT_OUT, na.rm = T)
targ_max <- max(alldat_shp$GT_OUT, na.rm = T)target_var = Target variable, this is the variable we would like to predict for every grid in the dataframe, based on the field surveyed damage. We create a new column called GT_OUT. In this case, we use GT_mnDG, or the mean damage grade per grid.nsurvey = Number of grids that will be surveyed. In this case, we use 100 randomly selected grids. Note that a survey of 100 grids has more than 100 buildings. In actuality we know what GT_mnDG is at all locations (but we will use this for validation in Section 3)seconddat_vars = Secondary data variables, these are the variables we will use to predict the target variable, based on all secondary damage data. We use ENG_mnDR, DPM_bldg_med, MMI_mn, DEM_mn.seed = the random seed we use for reproducibility. Here we use 977, which is the area code for Kathmandu :)id_var = ID variablemaxdist.multiplier = For the spatial correlation model, we krige the residuals only using points within 2 times the spatial correlation range of a pointnmin and nmax = For the spatial correlation model, the minimum and maximum number of points to consider as an alternative to maxidst.multiplier# grids with buildings
ind_bldgs <- which(!is.na(alldat_shp$GT_mnDG))
# read in boundary of F540 and F550 frames used to create the DPM (these have been merged separately
boundary_dpm <- readRDS(file = "data/DPM_f540550boundary.rds")
# Find index for areas with and without dpm (because we're making separate models for each)
ind_nodpm <- which(is.na(alldat_shp$DPM_bldg_med[ind_bldgs]))
ind_dpm <- which(!is.na(alldat_shp$DPM_bldg_med[ind_bldgs]))
# Calculate the percentage of the total area that is covered by the DPM. We will use this to calculate the proportion of field surveys in and out of the DPM.
p_area_dpm =rgeos::gArea(boundary_dpm)/sum(alldat_shp$Area_deg)
n_grids_dpm <- length(ind_dpm)
n_grids_nodpm <- length(ind_nodpm)Also, one of our secondary damage data variables (DPM_bldg_med) is only at a subset of our entire area of interest. Because of this, we split up our field surveys proportionally according to the percentage of the area that the DPM covers (p_area_dpm = 41.35%). Therefore, out of the 100 grids we survey, 41 have DPM values and 59 do not have DPM values.
We use these parameters to create two dataframes:
pred_shp = The prediction spatial dataframe where we add all our predictions. Note that here, we only predict at the 80200 grids with buildings to speed up computation (but one could predict at all points as well).field_sample_shp = The field spatial dataframe (i.e. the training dataframe) that contains the subset of 100 grids at the location of the field surveys.Here, we also perform a normal score transformation (NST) of the secondary data variables using nscore.
# apply a normal score transformation
pred_shp <- maptools::spCbind(alldat_shp[ind_bldgs,c(id_var, "GT_OUT")],
data.frame(apply(alldat_shp@data[,seconddat_vars],
2, nscore))[ind_bldgs,])
# field_sample_shp is the subset of locations of the grids that have been field surveyed
set.seed(seed);
ind_dpm_field <- ind_dpm[sample(n_grids_dpm, round(nsurvey*p_area_dpm), replace = F)]
ind_nodpm_field <- ind_nodpm[sample(n_grids_nodpm, (nsurvey - round(nsurvey*p_area_dpm)), replace = F)]
ind_field <- append(ind_dpm_field, ind_nodpm_field)
field_sample_shp <- pred_shp[ind_field,]
# redefine indices for those grids that don't have dpm values in dpm boundary and move to ind_nodpm
ind_nodpm_field <- which(is.na(field_sample_shp$DPM_bldg_med))
ind_dpm_field <- which(!is.na(field_sample_shp$DPM_bldg_med))As a summary, we will be using the following 100 grids for training our model:
Ideally, we would like to see a relationship between our target variable, GT_mnDG and the secondary variables. We would also want our field survey sample to decently match the distributions from the data from all locations. We can verify this using the following correlation matrix.