1 Introduction to the framework and application

In this notebook, we run through the basic steps to implement the geospatial data integration framework (G-DIF) presented in “A geospatial data integration framework to rapidly estimate post-disaster damage” by Sabine Loos , Jack Baker, David Lallemant, Nama Budhathoki, Sang-Ho Yun, Jamie McCaughey, Ritika Singh, and Feroz Khan, which is available online at https://doi.org/10.1177/8755293020926190.

The goal of this paper is to provide a framework to integrate multiple damage data sources that become available after an earthquake, including: field surveys, remotely-sensed proxies, engineering forecasts, and other geospatial covariates. More information can be found in the paper, but here is the general flowchart:

1.1 Packages and functions

The code for this is split between this main markdown and multiple functions. The entire source code can be found at https://github.com/sabineloos/GDIF-damageprediction. The main packages we use are caret for cross-validation and stepwise selection in the trend model and gstat for kriging of the residuals. Necessary packages to install are also included in install.R script.

This code is specifically meant to show how we develop the framework for the example scenario and dataset presented in the paper, however, it is possible to modify the code with new spatial dataframes for new settings.

rm(list=ls(all = T))
# source("code/install.R") # if you need to install packages uncomment this
require(car) # for VIF checking
require(caret) # for mixed stepwise selection
require(gstat) # for kriging
require(sp) # for reading/writing to shapefile
require(rgeos)
require(maptools) # for spcbind
require(automap) # for autofit variogram
require(raster) # for making rasters
require(rasterVis) # for plotting rasters
require(ggplot2); theme_set(theme_bw()) # for plotting;
require(gridExtra) # for arranging plots side-by-side
require(foreach) # for foreach parallel loops
require(future) # for future parallel backend
require(doFuture) # for using future backend for froeach loops
require(GGally) # for plotting correlation matrices

# Outside functions
source("code/Functions/PlottingFunctions.R") #contains functions used to create plots in this markdown
source("code/Functions/RK_LS_trend.R") # main function to define the trend model
source("code/Functions/OK_parallel.R") # function to perform ordinary kriging in parallel for spatial correlation of residuals
source("code/Functions/RK_errormet.R") # function to quantify error metrics of prediction and uncertainties
source("code/Functions/nscore.R") # function to perform a normal score transformation
source("code/Functions/polygon2points.R") # function to turn SpatialPolygonsDataFrame to SpatialPointsDataFrame
source("code/Functions/comb_dup.R") # for combining dataframes with duplicate entries
source("code/Functions/sp2gg.R") #contains functions to convert spatial to ggplot dataframe

1.2 Load data from the 2015 Nepal Earthquake

Here, we implement G-DIF with data from the 2015 Nepal earthquake. Our area of interest is the 11 districts surrounding Kathmandu Valley. This code depends on an already prepared spatial dataframe (i.e. grids, polygons, or points) that has values for the primary data (e.g. the dependent variable, typically from field surveys, to be predicted) and the secondary data (e.g. the independent variables). Here we use the following data sources that were either already available or produced after the earthquake:

Field surveys of impact at the household-level, organized by Nepal’s Central Bureau of Statistics and organized by our collaborators at Kathmandu Living Labs. This data has been made openly available in the 2015 Nepal Earthquake Open Data Portal.
A map of shaking intensity from the United States Geological Survey (USGS). This provides the Modified Mercalli Intensity at 1.75 km resolution. This uses the latest ShakeMap (version 4) (Worden et al. 2010, @Worden2016ShakeMapRelease)
A self-developed engineering forecast of damage. This uses the USGS shaking intensity + fragility curves from NSET + Landscan population exposure to predict the mean damage ratio at a 1km resolution (JICA 2002, @LandScan:2012tj)
A damage proxy map (DPM) developed through inSAR-based coherence differencing from the ARIA project at the NASA Jet Propulsion Lab. This provides a damage proxy value from -1 to 1 at 30m resolution (Yun et al. 2015)
A digital elevation model from CGIAR Consortium for Spatial Information. This provides the elevation at a 90 m resolution. (Jarvis et al. 2008)

We convert each dataset to the same support and load as one spatial dataframe (alldat_shp) before inputting in the framework. The format is a gridded dataframe at a resolution of .0028$^\circ$ $\times$ .0028$^\circ$ (~290m $\times$ 290m). The variables in the dataframe are:

GRID_ID – ID number of the grid
GT_mnDG – Groundtruth mean damage grade, average field surveyed damage within grid
ENG_mnDR – Mean damage ratio from the engineering forecast
DPM_bldg_med – Median DPM pixel value overlaying buildings within a grid
MMI_mn - Modified Mercalli Intensity value over the centroid of each grid
DEM_mn - Elevation in meters over centroid of each grid
Area_deg - Area of the gird in degrees

# Load data frame that has been prepared for the framework
alldat_shp <- readRDS(file = "data/Nepal_11dist_damagedata_grids.rds") # this doesn't work, since data is not available
head(alldat_shp@data[which(!is.na(alldat_shp$DPM_bldg_med)),])

2 G-DIF methodology

This code uses the centroid of each grid to calculate distances. We therefore check if the data is a points data frame, and if not, convert it to one using polygon2points.

if(class(alldat_shp) != "SpatialPointsDataFrame"){# check if already a spatial points dataframe
  alldat_shp <- polygon2points(alldat_shp)}

2.1 Define modeling parameters and dataframes

In order to model, we need to know what we are predicting, what we use to predict, and how many data points we have. The user needs to define the below variables

# DEFINE the target (or output) variable and create a new column called GT_OUT. 
target_var = c("GT_mnDG")
ind_target = which(names(alldat_shp) == target_var[1])
alldat_shp$GT_OUT = alldat_shp@data[,ind_target]

# DEFINE the ID variable
id_var = "GRID_ID"

# DEFINE names for secondary data
seconddat_vars <- c("ENG_mnDR","DPM_bldg_med","MMI_mn","DEM_mn")

# DEFINE the number of field surveys (taking a random sample)
nsurvey = 100

# DEFINE seed (for reproducibility purposes) 
RNGkind(sample.kind = "Rounding") # define random number genertor (for R3.6.0 difference in RNG)
seed = 977 # this is the area code for Nepal ;)

# DEFINE local kriging parameters (optional)
# kriging options for local kriging (saves time)
maxdist.multiplier <- 2 # use 2x the spatial corelation range for kriging
nmin <- 2 # minimum number of points to consider 
nmax <- 20 # maximum number of points to consider

# The range of predictions allowed
targ_min <- min(alldat_shp$GT_OUT, na.rm = T)
targ_max <- max(alldat_shp$GT_OUT, na.rm = T)

target_var = Target variable, this is the variable we would like to predict for every grid in the dataframe, based on the field surveyed damage. We create a new column called GT_OUT. In this case, we use GT_mnDG, or the mean damage grade per grid.
nsurvey = Number of grids that will be surveyed. In this case, we use 100 randomly selected grids. Note that a survey of 100 grids has more than 100 buildings. In actuality we know what GT_mnDG is at all locations (but we will use this for validation in Section 3)
seconddat_vars = Secondary data variables, these are the variables we will use to predict the target variable, based on all secondary damage data. We use ENG_mnDR, DPM_bldg_med, MMI_mn, DEM_mn.
seed = the random seed we use for reproducibility. Here we use 977, which is the area code for Kathmandu :)
id_var = ID variable
maxdist.multiplier = For the spatial correlation model, we krige the residuals only using points within 2 times the spatial correlation range of a point
nmin and nmax = For the spatial correlation model, the minimum and maximum number of points to consider as an alternative to maxidst.multiplier

# grids with buildings
ind_bldgs <- which(!is.na(alldat_shp$GT_mnDG))

# read in boundary of F540 and F550 frames used to create the DPM (these have been merged separately
boundary_dpm <- readRDS(file = "data/DPM_f540550boundary.rds")

# Find index for areas with and without dpm (because we're making separate models for each)
ind_nodpm <- which(is.na(alldat_shp$DPM_bldg_med[ind_bldgs]))
ind_dpm <- which(!is.na(alldat_shp$DPM_bldg_med[ind_bldgs]))
# Calculate the percentage of the total area that is covered by the DPM. We will use this to calculate the proportion of field surveys in and out of the DPM.
p_area_dpm =rgeos::gArea(boundary_dpm)/sum(alldat_shp$Area_deg)
n_grids_dpm <- length(ind_dpm)
n_grids_nodpm <- length(ind_nodpm)

Also, one of our secondary damage data variables (DPM_bldg_med) is only at a subset of our entire area of interest. Because of this, we split up our field surveys proportionally according to the percentage of the area that the DPM covers (p_area_dpm = 41.35%). Therefore, out of the 100 grids we survey, 41 have DPM values and 59 do not have DPM values.

We use these parameters to create two dataframes:

pred_shp = The prediction spatial dataframe where we add all our predictions. Note that here, we only predict at the 80200 grids with buildings to speed up computation (but one could predict at all points as well).
field_sample_shp = The field spatial dataframe (i.e. the training dataframe) that contains the subset of 100 grids at the location of the field surveys.

Here, we also perform a normal score transformation (NST) of the secondary data variables using nscore.

# apply a normal score transformation
pred_shp <- maptools::spCbind(alldat_shp[ind_bldgs,c(id_var, "GT_OUT")],
                                   data.frame(apply(alldat_shp@data[,seconddat_vars],
                                                    2, nscore))[ind_bldgs,])
# field_sample_shp is the subset of locations of the grids that have been field surveyed
set.seed(seed); 
ind_dpm_field <- ind_dpm[sample(n_grids_dpm, round(nsurvey*p_area_dpm), replace = F)]
ind_nodpm_field <- ind_nodpm[sample(n_grids_nodpm, (nsurvey - round(nsurvey*p_area_dpm)), replace = F)]
ind_field <- append(ind_dpm_field, ind_nodpm_field)
field_sample_shp <- pred_shp[ind_field,]

# redefine indices for those grids that don't have dpm values in dpm boundary and move to ind_nodpm
ind_nodpm_field <- which(is.na(field_sample_shp$DPM_bldg_med))
ind_dpm_field <- which(!is.na(field_sample_shp$DPM_bldg_med))

2.2 Exploring the data

As a summary, we will be using the following 100 grids for training our model:

Ideally, we would like to see a relationship between our target variable, GT_mnDG and the secondary variables. We would also want our field survey sample to decently match the distributions from the data from all locations. We can verify this using the following correlation matrix.

2.3 Trend model

Here, we use a linear least squares regression model to estimate the trend for the area with DPM values and the area without DPM values. We use the function RK_LS_trend to perform model selection, deciding between an Ordinary Least Squares (OLS) model and a Generalized Least Squares (GLS) model. In this function, we perform a five-fold cross-validation with both the GLS and OLS models.

The inputs to RK_LS_trend are as follows:

locn_shp = The spatial dataframe that contains the data at the locations that have been surveyed (i.e. field_sample_shp).
newdata_shp = The spatial dataframe that contains the locations where to estimate the trend (i.e. pre_shp)
stepwise = T/F to indicate whether to perform stepwise selection because of multicollinearity
dependent.var_char = The character value of the dependent variable (i.e. GT_mnDG)
independent.var_vec = The character vector of the independent variables. These change between the areas with and without the DPM.
id.var_char = The character value of the ID variable (i.e. GRID_ID)
return_se = T/F to indicate whether to include the estimated coefficients and standard errors of the estimated trends

2.3.1 For the area with the DPM

The equation for the trend with the DPM is:

\[ m(s) = \beta_1(ENGmnDR) + \beta_2 (DPM) + \beta_3 (MMI) + \beta_4 (DEM) + c\] Before running RK_LS_trend, we need to assess whether multicollinearity may be an issue, in which case we would need to perform a mixed stepwise selection. We do this by assessing the variance inflation factor (VIF). If the VIF is greater than 5, we set stepwise = T and run RK_LS_trend with stepwise selection. The VIF’s for the variables used in this trend model are:

# Check if there is multicollinearity using VIF
mod_vif <- lm(GT_OUT ~ MMI_mn + ENG_mnDR + DEM_mn + DPM_bldg_med, # formula for linear model with DPM
              field_sample_shp@data[ind_dpm_field,]) # data frame of field data at DPM locations
stepwise <- ifelse(any(car::vif(mod_vif)>5), T, F) # defining stepwise as T/R
car::vif(mod_vif)

##       MMI_mn     ENG_mnDR       DEM_mn DPM_bldg_med 
##     1.384151     1.145147     1.086021     1.248783

Since none of the VIF’s are greater than 5, stepwise = FALSE. We now use that as input into RK_LS_trend.

# Least squares regression model for areas with DPM
RK_trend_dpm <- RK_LS_trend(locn_shp = field_sample_shp[ind_dpm_field,], 
                        newdata_shp = pred_shp[ind_dpm,],
                        stepwise = stepwise, #stepwise selection of variables
                        dependent.var_char = "GT_OUT",
                        independent.var_vec <- seconddat_vars,
                        id.var_char = id_var,
                        return_se = T)

Looking at RK_trend_dpm$RMSE we can see that the trend that resulted in the lowest root mean squared error (from the function RK_LS_trend is the ordinary least squares trend (trnd_OLS). The coefficients from this estimated trend are:

# saving the trend as that with the DPM
RK_trend <- RK_trend_dpm
# the coefficients from the OLS trend (do this for the trend that resulted with the lowest RMSE)
RK_trend$se$ols.coeff

### For the area wihout the DPM We repeat the exact same procedure for the area with the DPM, except the equation for the trend is now:

\[ m(s) = \beta_1(ENGmnDR) + \beta_2 (MMI) + \beta_3 (DEM) + c\]

# Check if there is multicollinearity using VIF
mod_vif <- lm(GT_OUT ~ MMI_mn + ENG_mnDR + DEM_mn,field_sample_shp@data[ind_nodpm_field,])
stepwise <- ifelse(any(car::vif(mod_vif)>5), T, F)

# Least squares regression model for areas with DPM
RK_trend_nodpm <- RK_LS_trend(locn_shp = field_sample_shp[ind_nodpm_field,], 
                            newdata_shp = pred_shp[ind_nodpm,],
                            stepwise = stepwise, #stepwise selection of variables
                            dependent.var_char = "GT_OUT",
                            independent.var_vec <- c("MMI_mn","ENG_mnDR","DEM_mn"),
                            id.var_char = id_var,
                            return_se = T)

The coefficients for this trend are quite different (we only include the first in the paper). The trend used is still an OLS trend. These coefficients therefore dependent on the locations of the field surveys, so one can’t compare the coefficients of the two trends.

2.3.2 Merging trend predictions

Because we have estimated the trend for two separate areas, we have to merge them back together into one spatial dataframe. We use the function comb_dup to do this, which takes a dataframe with two columns of the same name and merges them into one column of that name.

# merge together
ind_merge <- c(id_var, names(RK_trend_dpm$locn_shp)[!names(RK_trend_dpm$locn_shp) %in% names(field_sample_shp)])
field_sample_shp <- merge(field_sample_shp,RK_trend_dpm$locn_shp@data[,ind_merge], by = id_var, all.x = T, suffixes = c("",""))
field_sample_shp <- merge(field_sample_shp,RK_trend_nodpm$locn_shp@data[,ind_merge], by = id_var, all.x = T, suffixes = c("",""))
field_sample_shp@data <- comb_dup(field_sample_shp@data)

ind_merge <- c(id_var, names(RK_trend_dpm$newdata_shp)[!names(RK_trend_dpm$newdata_shp) %in% names(pred_shp)])
pred_shp <- merge(pred_shp,RK_trend_dpm$newdata_shp@data[,ind_merge], by = id_var, all.x = T, suffixes = c("",""))
pred_shp <- merge(pred_shp,RK_trend_nodpm$newdata_shp@data[,ind_merge], by = id_var, all.x = T, suffixes = c("",""))
pred_shp@data <- comb_dup(pred_shp@data)

We therefore merge these two trends together:

We also identify the trend model from RK_LS_trend the resulted in the lowest RMSE’s and assign its estimated trend and residuals to two new variables in pred_shp and field_sample_shp

trnd_RKfin = the final trend estimate
trnd_RKfin_var = the final variance of that trend estimate

The equations for the variance of the GLS trend is:

\[ \hat{\sigma}_m^2 = (\textbf{X}_0 - \textbf{X}^\top \textbf{C}^{-1} \textbf{C}_0)^\top \cdot (\textbf{X}^\top \textbf{C}^{-1} \textbf{X})^{-1} \cdot (\textbf{X}_0 - \textbf{X}^\top \textbf{C}^{-1} \textbf{C}_0) \]

In addition, we assign the residuals of the estimated trend and the true damage in field_sample_shp as trnd_resid_fin.

# choose best trend model, and redevelop variogram for those residuals for all areas
pred_shp$trnd_RKfin <- NA; field_sample_shp$trnd_RKfin <- NA
#for DPM areas
if(RK_trend_dpm$RMSE$model[which.min(RK_trend_dpm$RMSE$RMSE)] %in% "trnd_GLS"){
  pred_shp@data[ind_dpm,"trnd_RKfin"] <- pred_shp@data[ind_dpm,"trnd_GLS"];
  field_sample_shp@data[ind_dpm_field,"trnd_resid_fin"] = field_sample_shp@data[ind_dpm_field, "trnd_resid_GLS"]
  pred_shp@data[ind_dpm,"trnd_RKfin_var"]  <- pred_shp@data[ind_dpm,"trnd_GLS_var"]
}else{
  pred_shp@data[ind_dpm,"trnd_RKfin"] <- pred_shp@data[ind_dpm,"trnd_OLS"];
  field_sample_shp@data[ind_dpm_field,"trnd_resid_fin"] = field_sample_shp@data[ind_dpm_field, "trnd_resid_OLS"]
  pred_shp@data[ind_dpm,"trnd_RKfin_var"]  <- pred_shp@data[ind_dpm,"trnd_OLS_var"]
}
#for no DPM areas
if(RK_trend_nodpm$RMSE$model[which.min(RK_trend_nodpm$RMSE$RMSE)] %in% "trnd_GLS"){
  pred_shp@data[ind_nodpm,"trnd_RKfin"] <- pred_shp@data[ind_nodpm,"trnd_GLS"];
  field_sample_shp@data[ind_nodpm_field,"trnd_resid_fin"] = field_sample_shp@data[ind_nodpm_field, "trnd_resid_GLS"]
  pred_shp@data[ind_nodpm,"trnd_RKfin_var"]  <- pred_shp@data[ind_nodpm,"trnd_GLS_var"]
}else{
  pred_shp@data[ind_nodpm,"trnd_RKfin"] <- pred_shp@data[ind_nodpm,"trnd_OLS"];
  field_sample_shp@data[ind_nodpm_field,"trnd_resid_fin"] = field_sample_shp@data[ind_nodpm_field, "trnd_resid_OLS"]
  pred_shp@data[ind_nodpm,"trnd_RKfin_var"]  <- pred_shp@data[ind_nodpm,"trnd_OLS_var"]
}

2.4 Spatial Correlation model

Now that we have the residuals at the field locations, we can estimate the residuals at all other locations. ### Fit variogram We do this by first developing a variogram. Here we fit the best of a Matern, Exponential, or Spherical variogram using the package autofit.

Note that in this figure we show the Matern variogram in projected coordinates so distances are in km and with kappa = 0.5, which forces the variogram to be exponential

# define projection
PCRS <- "+proj=utm +zone=45 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
# residual variogram (projected, so units are in km)
resid.vgm <-automap::autofitVariogram(formula = trnd_resid_fin~1,
                                      input_data = spTransform(field_sample_shp[,"trnd_resid_fin"], CRS(PCRS)),
                                      model = c("Mat"), #
                                      verbose = T,
                                      kappa = 0.5,
                                      # kappa = c(0.05, seq(0.2, 2, 0.1)),
                                      miscFitOptions = list(merge.small.bins = F))

## Selected:
##   model     psill    range kappa
## 1   Nug 0.6403505    0.000   0.0
## 2   Mat 0.1858548 9409.058   0.5
## 
## Tested models, best first:
##   Tested.models kappa      SSerror
## 1           Mat   0.5 4.427539e-07

orig.vgm <-automap::autofitVariogram(formula = GT_OUT~1,
                                      input_data = spTransform(field_sample_shp, CRS(PCRS)),
                                      model = c("Mat"), 
                                      kappa = 0.5,
                                      verbose = F,
                                     miscFitOptions = list(merge.small.bins = F))

We plot our fitted variogram with the 100 sampled points and compare to the “true” variogram developed using 10,000 points.

2.4.1 Krige residuals

Once we fit the variogram, we can estimate the residuals at all locations using the function OK_parallel which relies on the packages gstat to perform kriging. To speed up computation, we perform local kriging by setting the maximum range of spatial correlation to consider (here it is 2 times the range, as defined in Section 2.1 ) or the maximum number of points to consider. For more information see the gstat package. We also perform the kriging in parallel, which is only helpful when you have multiple cores or a really large prediction data frame. We can confirm that we kriged the dataframe in parallel, because it outputs the following 7 times:

# residual variogram in non-transformed coordinates
resid.vgm <-automap::autofitVariogram(formula = trnd_resid_fin~1,
                                      input_data = field_sample_shp,
                                      model = c("Exp", "Ste", "Mat"),
                                      kappa = c(0.05, seq(0.2, 0.5, 0.1)),
                                      verbose = F,
                                      miscFitOptions = list(merge.small.bins = F))
# krige residuals in parallel (this is more worth it if you have more cores or a larger prediction area)
registerDoFuture()
plan(list(multiprocess))
parallelKrige <- OK_parallel(vgm.model = resid.vgm,
                             locn_shp = field_sample_shp,
                             newdata_shp = pred_shp,
                             maxdist.multiplier = maxdist.multiplier,
                             nmin = nmin, nmax = nmax,
                             dependent.var = "trnd_resid_fin", 
                             debuglev = 1,
                             parallel = T, 
                             n_cores = (detectCores()-1))

## [using ordinary kriging]
## [using ordinary kriging]
## [using ordinary kriging]
## [using ordinary kriging]
## [using ordinary kriging]
## [using ordinary kriging]
## [using ordinary kriging]

We can now add the kriging predictions to pred_shp as:

resid_pred_OK = predicted residuals from kriging
resid_var_OK = variance of the residual prediction from kriging

The equations for trend variance for ordinary kriging is:

\[ \hat{\sigma}_\epsilon^2 = b +\nu - \pmb{\lambda}^\top \mathbf{C}_0 \]

# Add kriging results to prediction dataframe
pred_shp$resid_pred_OK <- parallelKrige$var1.pred
pred_shp$resid_var_OK <- parallelKrige$var1.var

2.5 Final damage prediction and variance

Now that we have an estimate for the trend and an estimate for the residuals at all locations, we can sum them together to get the final damage prediction and variance. Therefore we add the following two variables to pred_shp

RK_pred_fin = Final damage prediction
RK_var_fin = Final prediction variance

*Note: We also bound our predictions between 1 and 5, which are the bounds of our target variable, GT_mnDG.

# Calculate our damage prediction and variance
pred_shp$RK_pred_fin<- rowSums(pred_shp@data[,c("trnd_RKfin", "resid_pred_OK")], na.rm=TRUE)
# bounding predictions
pred_shp$RK_pred_fin[pred_shp$RK_pred_fin <=targ_min] <- targ_min
pred_shp$RK_pred_fin[pred_shp$RK_pred_fin >=targ_max] <- targ_max
# final variance
pred_shp$RK_var_fin <- rowSums(pred_shp@data[,c("trnd_RKfin_var", "resid_var_OK")], na.rm=TRUE)

3 Validation of results

Now that we have the final damage predictions, we’d like to assess how well G-DIF predicts the actual damage. We do this by calculating the error metrics at all unknown locations. We do this by using the function RK_errormet, which calculates the following:

ME = the mean error
VE = the variance of the error
MSRE = the mean squared relative error
MRE = the mean relative error
VRE = the variance of the relative error
MSE = the mean squared error
MAE = the mean absolute error
G = the accuracy of the uncertainty as defined by the goodness of an accuracy plot (Goovaerts 2001, @Deutsch1997)

The above prediction results in the following error metrics:

# Find "unknown locations" - validation set
ind_field <- which(pred_shp@data[,id_var] %in% field_sample_shp@data[,id_var])
ind_validation <- -ind_field
# Calculate the error metrics
errormet <- RK_errormet(newdata_shp = pred_shp,
                        dependent.var_char = "GT_OUT",
                        pred.var_char = "RK_pred_fin",
                        var.var_char = "RK_var_fin",
                        ind_validation = ind_validation,
                        error = T,
                        error_rel = T,
                        error_sq = T,
                        error_abs = T,
                        SDSR = F,
                        accplot = T,
                        return_shp = T)
# this shape has all the error metrics
pred_shp <- errormet$shp
# includes a datatable of all error metrics
errormet$errormet

3.1 Compare G-DIF to engineering forecast

These error metrics don’t necessarily indicate how well G-DIF performs unless we compare to another damage dataset. Here, we compare to the engineering forecast which is an input to G-DIF. Theoretically, G-DIF should be more accurate than the forecast.

In order to compare the two directly, we have to convert the engineering forecast to the same units as the G-DIF prection (aka mean damage ratio to mean damage grade). We do this by binning the damage ratios according to the damage grades described in Grünthal (1998).

## convert from mean damage ratio to damage grade 
pred_shp$ENG_val <- cut(alldat_shp$ENG_mnDR[ind_bldgs], breaks=c(-.0001, .01, .2, .6, .99, 1))
levels(pred_shp$ENG_val) <- seq(1,5)
pred_shp$ENG_val <- as.numeric(as.character(pred_shp$ENG_val))
# calculate the errormetrics for the engineering forecast
errormet_ENG <- RK_errormet(newdata_shp = pred_shp[,c(id_var, "GT_OUT", "ENG_val")],
                            dependent.var_char = "GT_OUT",
                            pred.var_char = "ENG_val",
                            accplot = F, SDSR = F,
                            ind_validation = ind_validation, 
                            return_shp = T)
# rename outputs
names(errormet_ENG$shp)<-c(id_var,"GT_OUT","ENG_val","ENG_ERROR","ENG_ERROR_rel","ENG_ERROR_sq","ENG_ERROR_abs")
# merge with pred_shp
pred_shp <- merge(pred_shp, errormet_ENG$shp@data[,c(id_var,"ENG_ERROR","ENG_ERROR_sq")], by = id_var, all.x = T)

We therefore compare the true damage, to the integrated damage and the engineering forecast.

The histogram of errors for G-DIF and engineering forecast illuminate how G-DIF performs better (lower mean and standard deviation):

3.2 For one district

The error histogram can look quite different depending on the spatial scale at which we validate the framework. G-DIF’s advantage is that it will provide a locally calibrated damage estimate, dependent on the field surveys available within in a district. Therefore, we assess the errors at individual districts.

First, we assess the errors at Nuwakot, the district directly northeast of Kathmandu Valley. We can see the engineering forecast is slightly more biased than G-DIF here, because Nuwakot experienced very high rates of collapse. Conversely, the errors at Makawanpur, the district directly southwest of Kathmandu Valley did not experience as severe damage. We can see here that the engineering forecast is overestimates damage compared to G-DIF.

For all but two districts, G-DIF performs better. These other two districts are only slightly different and highlight the importance of where we sample the field surveys.

3.3 Sensitivity to number of field surveys

Since the results of G-DIF depend on the number and locations of field surveys, we run a sensitivity analysis using different numbers and placements of the field surveys for training data. We discuss these results further in the paper, but it is plain to tell that G-DIF performs better both in terms of MSE and ME with more field surveys.

4 Session Info

devtools::session_info()

## - Session info ----------------------------------------------------------
##  setting  value                       
##  version  R version 3.6.1 (2019-07-05)
##  os       Windows 10 x64              
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  ctype    English_United States.1252  
##  tz       America/Los_Angeles         
##  date     2020-07-14                  
## 
## - Packages --------------------------------------------------------------
##  package      * version    date       lib source        
##  abind          1.4-5      2016-07-21 [1] CRAN (R 3.6.0)
##  assertthat     0.2.1      2019-03-21 [1] CRAN (R 3.6.1)
##  automap      * 1.0-14     2013-08-29 [1] CRAN (R 3.6.1)
##  backports      1.1.5      2019-10-02 [1] CRAN (R 3.6.1)
##  bitops         1.0-6      2013-08-17 [1] CRAN (R 3.6.0)
##  callr          3.3.2      2019-09-22 [1] CRAN (R 3.6.1)
##  car          * 3.0-3      2019-05-27 [1] CRAN (R 3.6.1)
##  carData      * 3.0-2      2018-09-30 [1] CRAN (R 3.6.0)
##  caret        * 6.0-84     2019-04-27 [1] CRAN (R 3.6.1)
##  cellranger     1.1.0      2016-07-27 [1] CRAN (R 3.6.1)
##  class          7.3-15     2019-01-01 [2] CRAN (R 3.6.1)
##  classInt       0.4-1      2019-08-06 [1] CRAN (R 3.6.1)
##  cli            1.1.0      2019-03-19 [1] CRAN (R 3.6.1)
##  codetools      0.2-16     2018-12-24 [2] CRAN (R 3.6.1)
##  colorspace     1.4-1      2019-03-18 [1] CRAN (R 3.6.1)
##  crayon         1.3.4      2017-09-16 [1] CRAN (R 3.6.1)
##  curl           4.2        2019-09-24 [1] CRAN (R 3.6.1)
##  data.table     1.12.4     2019-10-03 [1] CRAN (R 3.6.1)
##  DBI            1.0.0      2018-05-02 [1] CRAN (R 3.6.1)
##  desc           1.2.0      2018-05-01 [1] CRAN (R 3.6.1)
##  devtools       2.2.1      2019-09-24 [1] CRAN (R 3.6.1)
##  digest         0.6.21     2019-09-20 [1] CRAN (R 3.6.1)
##  doFuture     * 0.8.1      2019-07-20 [1] CRAN (R 3.6.1)
##  dplyr          0.8.3      2019-07-04 [1] CRAN (R 3.6.1)
##  e1071          1.7-2      2019-06-05 [1] CRAN (R 3.6.1)
##  ellipsis       0.3.0      2019-09-20 [1] CRAN (R 3.6.1)
##  evaluate       0.14       2019-05-28 [1] CRAN (R 3.6.1)
##  extrafont    * 0.17       2014-12-08 [1] CRAN (R 3.6.0)
##  extrafontdb    1.0        2012-06-11 [1] CRAN (R 3.6.0)
##  FNN            1.1.3      2019-02-15 [1] CRAN (R 3.6.1)
##  forcats        0.4.0      2019-02-17 [1] CRAN (R 3.6.1)
##  foreach      * 1.4.7      2019-07-27 [1] CRAN (R 3.6.1)
##  foreign        0.8-71     2018-07-20 [2] CRAN (R 3.6.1)
##  fs             1.3.1      2019-05-06 [1] CRAN (R 3.6.1)
##  future       * 1.14.0     2019-07-02 [1] CRAN (R 3.6.1)
##  generics       0.0.2      2018-11-29 [1] CRAN (R 3.6.1)
##  GGally       * 1.4.0      2018-05-17 [1] CRAN (R 3.6.1)
##  ggmap          3.0.0      2019-02-04 [1] CRAN (R 3.6.1)
##  ggplot2      * 3.3.0      2020-03-05 [1] CRAN (R 3.6.3)
##  ggridges     * 0.5.1      2018-09-27 [1] CRAN (R 3.6.1)
##  ggsn         * 0.5.0      2019-02-18 [1] CRAN (R 3.6.1)
##  globals      * 0.12.4     2018-10-11 [1] CRAN (R 3.6.0)
##  glue           1.3.1      2019-03-12 [1] CRAN (R 3.6.1)
##  gower          0.2.1      2019-05-14 [1] CRAN (R 3.6.1)
##  gridExtra    * 2.3        2017-09-09 [1] CRAN (R 3.6.1)
##  gstat        * 2.0-3      2019-09-26 [1] CRAN (R 3.6.1)
##  gtable         0.3.0      2019-03-25 [1] CRAN (R 3.6.1)
##  haven          2.2.0      2019-11-08 [1] CRAN (R 3.6.1)
##  hexbin         1.27.3     2019-05-14 [1] CRAN (R 3.6.1)
##  hms            0.5.2      2019-10-30 [1] CRAN (R 3.6.1)
##  htmltools      0.4.0      2019-10-04 [1] CRAN (R 3.6.1)
##  httr           1.4.1      2019-08-05 [1] CRAN (R 3.6.1)
##  intervals      0.15.1     2015-08-27 [1] CRAN (R 3.6.0)
##  ipred          0.9-9      2019-04-28 [1] CRAN (R 3.6.1)
##  iterators    * 1.0.12     2019-07-26 [1] CRAN (R 3.6.1)
##  jpeg           0.1-8      2014-01-23 [1] CRAN (R 3.6.0)
##  jsonlite       1.6        2018-12-07 [1] CRAN (R 3.6.1)
##  KernSmooth     2.23-15    2015-06-29 [2] CRAN (R 3.6.1)
##  knitr          1.25       2019-09-18 [1] CRAN (R 3.6.1)
##  labeling       0.3        2014-08-23 [1] CRAN (R 3.6.0)
##  lattice      * 0.20-38    2018-11-04 [2] CRAN (R 3.6.1)
##  latticeExtra * 0.6-28     2016-02-09 [1] CRAN (R 3.6.1)
##  lava           1.6.6      2019-08-01 [1] CRAN (R 3.6.1)
##  lifecycle      0.1.0      2019-08-01 [1] CRAN (R 3.6.1)
##  listenv        0.7.0      2018-01-21 [1] CRAN (R 3.6.1)
##  lubridate      1.7.4      2018-04-11 [1] CRAN (R 3.6.1)
##  magrittr       1.5        2014-11-22 [1] CRAN (R 3.6.1)
##  maptools     * 0.9-8      2019-10-05 [1] CRAN (R 3.6.1)
##  MASS           7.3-51.4   2019-03-31 [2] CRAN (R 3.6.1)
##  Matrix         1.2-17     2019-03-22 [2] CRAN (R 3.6.1)
##  memoise        1.1.0      2017-04-21 [1] CRAN (R 3.6.1)
##  mgcv           1.8-28     2019-03-21 [2] CRAN (R 3.6.1)
##  ModelMetrics   1.2.2      2018-11-03 [1] CRAN (R 3.6.1)
##  munsell        0.5.0      2018-06-12 [1] CRAN (R 3.6.1)
##  nlme           3.1-140    2019-05-12 [2] CRAN (R 3.6.1)
##  nnet           7.3-12     2016-02-02 [2] CRAN (R 3.6.1)
##  openxlsx       4.1.0.1    2019-05-28 [1] CRAN (R 3.6.1)
##  pillar         1.4.2      2019-06-29 [1] CRAN (R 3.6.1)
##  pkgbuild       1.0.6      2019-10-09 [1] CRAN (R 3.6.1)
##  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 3.6.1)
##  pkgload        1.0.2      2018-10-29 [1] CRAN (R 3.6.1)
##  plyr           1.8.4      2016-06-08 [1] CRAN (R 3.6.1)
##  png            0.1-7      2013-12-03 [1] CRAN (R 3.6.0)
##  prettyunits    1.0.2      2015-07-13 [1] CRAN (R 3.6.1)
##  processx       3.4.1      2019-07-18 [1] CRAN (R 3.6.1)
##  prodlim        2018.04.18 2018-04-18 [1] CRAN (R 3.6.1)
##  ps             1.3.0      2018-12-21 [1] CRAN (R 3.6.1)
##  purrr          0.3.3      2019-10-18 [1] CRAN (R 3.6.1)
##  R6             2.4.0      2019-02-14 [1] CRAN (R 3.6.1)
##  raster       * 3.0-7      2019-09-24 [1] CRAN (R 3.6.1)
##  rasterVis    * 0.46       2019-07-02 [1] CRAN (R 3.6.1)
##  RColorBrewer * 1.1-2      2014-12-07 [1] CRAN (R 3.6.0)
##  Rcpp           1.0.2      2019-07-25 [1] CRAN (R 3.6.1)
##  readxl         1.3.1      2019-03-13 [1] CRAN (R 3.6.1)
##  recipes        0.1.7      2019-09-15 [1] CRAN (R 3.6.1)
##  remotes        2.1.0      2019-06-24 [1] CRAN (R 3.6.1)
##  reshape        0.8.8      2018-10-23 [1] CRAN (R 3.6.1)
##  reshape2       1.4.3      2017-12-11 [1] CRAN (R 3.6.1)
##  rgdal          1.4-6      2019-10-01 [1] CRAN (R 3.6.1)
##  rgeos        * 0.5-2      2019-10-03 [1] CRAN (R 3.6.1)
##  RgoogleMaps    1.4.4      2019-08-20 [1] CRAN (R 3.6.1)
##  rio            0.5.16     2018-11-26 [1] CRAN (R 3.6.1)
##  rjson          0.2.20     2018-06-08 [1] CRAN (R 3.6.0)
##  rlang          0.4.2      2019-11-23 [1] CRAN (R 3.6.1)
##  rmarkdown      1.16       2019-10-01 [1] CRAN (R 3.6.1)
##  rpart          4.1-15     2019-04-12 [1] CRAN (R 3.6.1)
##  rprojroot      1.3-2      2018-01-03 [1] CRAN (R 3.6.1)
##  Rttf2pt1       1.3.7      2018-06-29 [1] CRAN (R 3.6.0)
##  scales         1.0.0      2018-08-09 [1] CRAN (R 3.6.1)
##  sessioninfo    1.1.1      2018-11-05 [1] CRAN (R 3.6.1)
##  sf             0.8-0      2019-09-17 [1] CRAN (R 3.6.1)
##  sp           * 1.3-1      2018-06-05 [1] CRAN (R 3.6.1)
##  spacetime      1.2-2      2018-07-17 [1] CRAN (R 3.6.1)
##  stringi        1.4.3      2019-03-12 [1] CRAN (R 3.6.0)
##  stringr        1.4.0      2019-02-10 [1] CRAN (R 3.6.1)
##  survival       2.44-1.1   2019-04-01 [2] CRAN (R 3.6.1)
##  testthat       2.2.1      2019-07-25 [1] CRAN (R 3.6.1)
##  tibble         2.1.3      2019-06-06 [1] CRAN (R 3.6.1)
##  tidyr          1.0.0      2019-09-11 [1] CRAN (R 3.6.1)
##  tidyselect     0.2.5      2018-10-11 [1] CRAN (R 3.6.1)
##  timeDate       3043.102   2018-02-21 [1] CRAN (R 3.6.0)
##  units          0.6-5      2019-10-08 [1] CRAN (R 3.6.1)
##  usethis        1.5.1      2019-07-04 [1] CRAN (R 3.6.1)
##  vctrs          0.2.0      2019-07-05 [1] CRAN (R 3.6.1)
##  viridis      * 0.5.1      2018-03-29 [1] CRAN (R 3.6.1)
##  viridisLite  * 0.3.0      2018-02-01 [1] CRAN (R 3.6.1)
##  withr          2.1.2      2018-03-15 [1] CRAN (R 3.6.1)
##  xfun           0.10       2019-10-01 [1] CRAN (R 3.6.1)
##  xts            0.11-2     2018-11-05 [1] CRAN (R 3.6.1)
##  yaml           2.2.0      2018-07-25 [1] CRAN (R 3.6.0)
##  zeallot        0.1.0      2018-01-28 [1] CRAN (R 3.6.1)
##  zip            2.0.4      2019-09-01 [1] CRAN (R 3.6.1)
##  zoo            1.8-6      2019-05-28 [1] CRAN (R 3.6.1)
## 
## [1] C:/Users/scloo/Documents/R/win-library/3.6
## [2] C:/Program Files/R/R-3.6.1/library

5 References

Bright, Eddie A, Phil R Coleman, Amy N Rose, and Marie L Urban. 2012. “LandScan 2011.” Oak Ridge National Laboratory. https://landscan.ornl.gov/.

Deutsch, Clayton. 1997. “Direct assessment of local accuracy and precision.” In Geostatistics Wollongong ’96, edited by EY Baafi and NA Schofield, 1st ed., 115–25. Wollongong, Australia: Kluwer Academic Publishers. http://sul-sfx.stanford.edu/sfxlcl41?url_ver=Z39.88-2004&rfr_id=info%3Asid%2FxSearch&url_ctx_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Actx&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.btitle=Direct assessment of local accuracy and precision&rft.isbn=%40_f.

Goovaerts, P. 2001. “Geostatistical modelling of uncertainty in soil science.” Geoderma 103 (1-2): 3–26. https://doi.org/10.1016/S0016-7061(01)00067-2.

Grünthal, Gottfried. 1998. European Macroseismic Scale 1998. Vol. 15. http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:European+Macroseismic+Scale+1998#0.

Jarvis, A, H I Reuter, A Nelson, and E Guevara. 2008. “Hole-filled seamless SRTM data V4.” International Centre for Tropical Agriculture (CIAT). http://srtm.csi.cgiar.org/.

JICA. 2002. “The study on earthquake disaster mitigation in the Kathmandu Valley, Kingdom of Nepal.” Japan International Cooperation Agency : Nippon Koei Co., Ltd. : Oyo Corp.

Worden, C. B., D. J. Wald, T. I. Allen, K. Lin, D. Garcia, and G. Cua. 2010. “A revised ground-motion and intensity interpolation scheme for shakemap.” Bulletin of the Seismological Society of America 100 (6): 3083–96. https://doi.org/10.1785/0120100101.

Yun, Sang-Ho, Kenneth Hudnut, Susan Owen, Frank Webb, Patrizia Sacco, Eric Gurrola, Gerald Manipon, et al. 2015. “Rapid Damage Mapping for the 2015 Mw 7.8 Gorkha Earthquake Using Synthetic Aperture Radar Data from COSMO-SkyMed and ALOS-2 Satellites.” Seismological Research Letters 86 (6): 1549–56. https://doi.org/10.1785/0220150152.

Geostatistical Data Integration Framework

Code to accompany paper

Sabine Loos

2020-07-14