This notebook runs through the “Recommendations” section of the article “Efficacy of damage data integration: a comparative analysis of four major earthquakes.” This article is available at: ⚠️ *link to be added*⚠️. The full Github repository for this code is here: https://github.com/sabineloos/GDIF-Gen.

We will be demonstrating this model building process using the New Zealand February 2011 earthquake data. This data is not publicly available, but was made available for this research by New Zealand’s Earthquake Commission and Tonkin + Taylor.

This code goes through one realization of a field survey sample that is randomly sampled throughout the study area. It also builds on previous code to develop G-DIF in Nepal.

Before we build G-DIF with this dataset, we need to do some initial setup to load the required packages, functions, and data for this example. Everything in this initial stage would be necessary to implement G-DIF in real-time, except for developing the field survey sample. The locations of the field surveys would already be identified by field survey teams before a modeler started to develop G-DIF. We also discuss field surveying strategies in the article, for any interested readers.

The functions that are necessary to carry out this code are as follows. These are also included in the Github repository.

`logtr`

: Several transformation functions, here we use the`std`

function to standardize x variables.`FS_sample`

: This function develops a field survey sample using various techniques, including a`random`

sample,`biased`

sample,`clustered`

sample, and`largedistance`

sample.`RK_trend`

: This is a function to develop the trend function for G-DIF. Several inputs can be identified, including the type of trend model and whether to use stepwise selection.`OK_parallel`

: This is a wrapper function to implement Ordinary Kriging using the`gstat`

package. It gives the option to run Ordinary Kriging in parallel, which is useful for especially large prediction areas. Here, we use it to krige the residuals of the trend function.`RK_errormet`

: This is a function to calculate several error metrics at certain locations. Here, we use it to calculate the errors of the final G-DIF damage estimate at*all points that were not used to train the models in G-DIF*.`PlottingFunctions`

: These include several plotting functions, typically using the`ggplot2`

package

We also read in the New Zealand data, which has already been prepped by connecting the field surveyed damage data with multiple secondary datasets. As you can see, this is of class `SpatialPointsDataFrame`

and there were originally 78489 buildings in this dataset.

```
# clear data
rm(list = ls())
# load packages
packages <- c("gstat", "foreach", "dplyr", "sp","raster", "car", "gridExtra")
for (pkg in packages) {
if(!require(pkg, character.only = T)){install.packages(pkg, repos = 'http://cran.us.r-project.org')}; library(pkg, character.only = T)
}
# load functions
functions <- c('OK_parallel','RK_trend','RK_errormet',"FS_sample", "logtr",
'PlottingFunctions')
for (fn in functions) {
source(file = paste0("functions/", fn, ".R"))
}
rm(packages, functions, fn, pkg)
# set date
date <- format(Sys.Date(), "%m%d%y")
# read in the spatial points data frame that contains all primary and secondary new zealand damage data
## this is unfortunately not publicly available
sp.nz <- readRDS("../data/forsim/nz.rds")
sp.nz
```

```
## class : SpatialPointsDataFrame
## features : 78489
## extent : 172.4635, 172.7688, -43.75119, -43.35616 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +ellps=GRS80 +no_defs
## variables : 22
## names : OBJECTID, GT_BDR, MMI_USGS, PGA_USGS, PGV_USGS, PGA_BB, vs30, DEM, PAGER_pcollapse, DPM_med, DPM_ctr, gwdMedian, gwlMedian, SoilType, GT_nbldgs, ...
## min values : 1, 0, 6, 0.16, 10, 0.152428, 163.113876343376, 12.1234645843506, 0, 0, 0, 0.100000001490116, -0.887591123580654, Ai0, 1, ...
## max values : 79221, 0.75, 8.9, 0.88, 119, 0.760193, 675.594543430965, 143.56413269043, 0.0325, 63, 63, 25.8282852171285, 20.6647624969511, ZZZsoil, 1, ...
```

At this point, once a modeler has loaded their prepped dataset, they should identify their independent variables (e.g. the secondary data, or \(X\)) and their dependent variables (e.g. the primary data, or \(Y\))

The independent variables (\(X\), `xvar`

) in New Zealand are:

`MMI_USGS`

= Shaking intensity, measured with Modified Mercalli Intensity (MMI) from USGS’s ShakeMap. Any other ground motion estimate can be used here as well.`DPM_med`

= Damage Proxy Map (DPM), from NASA JPL-ARIA`PAGER_pcollapse`

= An engineering forecast of the probability of collapse using a vulnerability curve from USGS`vs30`

= An estimate of Vs30 from USGS.

The ependent variable (\(Y\), `yvar`

) in New Zealand is:

`GT_BDR`

= Building damage ratio per building

though several other field surveyed metrics exist as well.

A modeler would not have a full damage dataset however, like the `sp.nz`

dataset. Instead, they would have a prepared dataset *only* at the field survey locations. In this code, we have to simulate this situation, so in the next few code chunks we develop a random field survey sample.

First we define the index for the prediction area, which in this case is all the buildings in `sp.nz`

that have all secondary variables, identified through `ind_studyarea`

. `nstudyarea`

is the number of buildings in `ind_studyarea`

.

```
# need to define the prediction area to draw sample from.
ind_studyarea <- which(!is.na(sp.nz$gwdMedian) & !is.na(sp.nz$DPM_med) & !is.na(sp.nz$vs30))
nstudyarea = length(ind_studyarea)
nstudyarea
```

`## [1] 58426`

The prediction area is all x and y variables, at `ind_studyarea`

:

```
# Define sp.pred
## sp.pred is the spatial dataframe with all predictors and final variables ONLY at study area index
sp.pred <- sp.nz[ind_studyarea,c("OBJECTID", "fsAREA","GT_nbldgs", yvar, xvar)]
# add GT_OUT to sp.pred, for modeling purposes. This is the same as GT_BDR
sp.pred$GT_OUT <- sp.pred@data[,yvar]
head(sp.pred)
```

We then develop the field survey index using the following parameters:

`pctsurvey`

= fraction of`nstudyarea`

that is sampled, in this case we want to sample 100 buildings, or 0.17% of the data.`fstype`

= the field sampling strategy, in this case`random`

`nsurvey`

= the final number of surveys, given`pctsurvey`

, in this case`nsurvey`

= 100.

```
# parameter definition for the field sruveys
pctsurvey = 0.001711567 # 100 bldgs (100/58426)
fstype <- "random"
nsurvey = round(pctsurvey*nstudyarea)
nsurvey
```

`## [1] 100`

We will use these above parameters as an input into the `FS_sample`

function. This wouldn’t happen in real time, as the field sample would already be defined by the surveyors. `FS_sample`

outputs a set of indices to draw from the study area.

```
# set seed
seed = 12; set.seed(seed)
# get indices for field sample
ind_field <- FS_sample(method = fstype, # random
datatosample = sp.pred@data,
nsurvey = nsurvey,
nbldgs = floor(nsurvey*mean(sp.pred$GT_nbldgs, na.rm = T)),
seed = seed,
location = "nz")
head(ind_field)
```

`## [1] 53594 17655 2222 42053 546 53339`

The field survey sample is therefore `sp.pred`

at `ind_field`

:

Here, we’ll run through several model building steps and diagnostics for building the models in G-DIF. This is organized in the same order as the steps listed in the “Recommendations” section of the paper. At the beginning of each step, we directly quote the recommendation from the paper.

The size of the field survey sample will dictate how a modeler trains and evaluates the accuracy of G-DIF’s damage estimate. Methods such as k-fold cross-validation, leave-one-out cross-validation, and bootstrapping can be used for training and validating the models within G-DIF. In addition, with more field data, a modeler can create a separate test set to evaluate G-DIF’s potential prediction performance on unsurveyed locations. Generally, more field data will lead to a more accurate damage estimate with less variation.RECOMMENDATION FROM PAPER :

Now imagining that we didn’t just define the field survey sample size… and we got it from the field. We should first see how big the sample is:

`## [1] 100`

This size should be a big enough sample to create to sufficiently build the trend and spatial correlation models (pending some other checks in the next few steps). Importantly, it is big enough to build these models with a training/validation set, or k-fold cross-validation. Importantly, a test set should be held out, when there enough field samples, to evaluate the performance of the models within G-DIF. Here, since we know the damage at all other 58326 buildings in the study area, we will just use these for final evaluation of G-DIF, though in reality, we would split the 100 buildings into training and test sets.

We can evaluate some basic summary statistics about the field sample of building damage, including the quartiles, mean, and standard deviation.

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.007155 0.070091 0.122685 0.260221 0.447687 0.750000
```

`## [1] 0.2642037`

We might even want to see the full distribution of values. A simple histogram will do.