This notebook runs through the “Recommendations” section of the article “Efficacy of damage data integration: a comparative analysis of four major earthquakes.” This article is available at: ⚠️ link to be added⚠️. The full Github repository for this code is here: https://github.com/sabineloos/GDIF-Gen.

We will be demonstrating this model building process using the New Zealand February 2011 earthquake data. This data is not publicly available, but was made available for this research by New Zealand’s Earthquake Commission and Tonkin + Taylor.

This code goes through one realization of a field survey sample that is randomly sampled throughout the study area. It also builds on previous code to develop G-DIF in Nepal.

Setup G-DIF code and data

Before we build G-DIF with this dataset, we need to do some initial setup to load the required packages, functions, and data for this example. Everything in this initial stage would be necessary to implement G-DIF in real-time, except for developing the field survey sample. The locations of the field surveys would already be identified by field survey teams before a modeler started to develop G-DIF. We also discuss field surveying strategies in the article, for any interested readers.

The functions that are necessary to carry out this code are as follows. These are also included in the Github repository.

  • logtr: Several transformation functions, here we use the std function to standardize x variables.
  • FS_sample: This function develops a field survey sample using various techniques, including a random sample, biased sample, clustered sample, and largedistance sample.
  • RK_trend: This is a function to develop the trend function for G-DIF. Several inputs can be identified, including the type of trend model and whether to use stepwise selection.
  • OK_parallel: This is a wrapper function to implement Ordinary Kriging using the gstat package. It gives the option to run Ordinary Kriging in parallel, which is useful for especially large prediction areas. Here, we use it to krige the residuals of the trend function.
  • RK_errormet: This is a function to calculate several error metrics at certain locations. Here, we use it to calculate the errors of the final G-DIF damage estimate at all points that were not used to train the models in G-DIF.
  • PlottingFunctions: These include several plotting functions, typically using the ggplot2 package

We also read in the New Zealand data, which has already been prepped by connecting the field surveyed damage data with multiple secondary datasets. As you can see, this is of class SpatialPointsDataFrame and there were originally 78489 buildings in this dataset.

## class       : SpatialPointsDataFrame 
## features    : 78489 
## extent      : 172.4635, 172.7688, -43.75119, -43.35616  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +ellps=GRS80 +no_defs 
## variables   : 22
## names       : OBJECTID, GT_BDR, MMI_USGS, PGA_USGS, PGV_USGS,   PGA_BB,             vs30,              DEM, PAGER_pcollapse, DPM_med, DPM_ctr,         gwdMedian,          gwlMedian, SoilType, GT_nbldgs, ... 
## min values  :        1,      0,        6,     0.16,       10, 0.152428, 163.113876343376, 12.1234645843506,               0,       0,       0, 0.100000001490116, -0.887591123580654,      Ai0,         1, ... 
## max values  :    79221,   0.75,      8.9,     0.88,      119, 0.760193, 675.594543430965,  143.56413269043,          0.0325,      63,      63,  25.8282852171285,   20.6647624969511,  ZZZsoil,         1, ...

Define variables

At this point, once a modeler has loaded their prepped dataset, they should identify their independent variables (e.g. the secondary data, or \(X\)) and their dependent variables (e.g. the primary data, or \(Y\))

The independent variables (\(X\), xvar) in New Zealand are:

  • MMI_USGS = Shaking intensity, measured with Modified Mercalli Intensity (MMI) from USGS’s ShakeMap. Any other ground motion estimate can be used here as well.
  • DPM_med = Damage Proxy Map (DPM), from NASA JPL-ARIA
  • PAGER_pcollapse = An engineering forecast of the probability of collapse using a vulnerability curve from USGS
  • vs30 = An estimate of Vs30 from USGS.

The ependent variable (\(Y\), yvar) in New Zealand is:

  • GT_BDR = Building damage ratio per building

though several other field surveyed metrics exist as well.

Field survey sample

A modeler would not have a full damage dataset however, like the sp.nz dataset. Instead, they would have a prepared dataset only at the field survey locations. In this code, we have to simulate this situation, so in the next few code chunks we develop a random field survey sample.

First we define the index for the prediction area, which in this case is all the buildings in sp.nz that have all secondary variables, identified through ind_studyarea. nstudyarea is the number of buildings in ind_studyarea.

## [1] 58426

The prediction area is all x and y variables, at ind_studyarea:

We then develop the field survey index using the following parameters:

  • pctsurvey = fraction of nstudyarea that is sampled, in this case we want to sample 100 buildings, or 0.17% of the data.
  • fstype = the field sampling strategy, in this case random
  • nsurvey = the final number of surveys, given pctsurvey, in this case nsurvey = 100.
## [1] 100

We will use these above parameters as an input into the FS_sample function. This wouldn’t happen in real time, as the field sample would already be defined by the surveyors. FS_sample outputs a set of indices to draw from the study area.

## [1] 53594 17655  2222 42053   546 53339

The field survey sample is therefore sp.pred at ind_field:

G-DIF Diagnostics

Here, we’ll run through several model building steps and diagnostics for building the models in G-DIF. This is organized in the same order as the steps listed in the “Recommendations” section of the paper. At the beginning of each step, we directly quote the recommendation from the paper.

1. Evaluate field survey sample size

RECOMMENDATION FROM PAPER : The size of the field survey sample will dictate how a modeler trains and evaluates the accuracy of G-DIF’s damage estimate. Methods such as k-fold cross-validation, leave-one-out cross-validation, and bootstrapping can be used for training and validating the models within G-DIF. In addition, with more field data, a modeler can create a separate test set to evaluate G-DIF’s potential prediction performance on unsurveyed locations. Generally, more field data will lead to a more accurate damage estimate with less variation.

Now imagining that we didn’t just define the field survey sample size… and we got it from the field. We should first see how big the sample is:

## [1] 100

This size should be a big enough sample to create to sufficiently build the trend and spatial correlation models (pending some other checks in the next few steps). Importantly, it is big enough to build these models with a training/validation set, or k-fold cross-validation. Importantly, a test set should be held out, when there enough field samples, to evaluate the performance of the models within G-DIF. Here, since we know the damage at all other 58326 buildings in the study area, we will just use these for final evaluation of G-DIF, though in reality, we would split the 100 buildings into training and test sets.

We can evaluate some basic summary statistics about the field sample of building damage, including the quartiles, mean, and standard deviation.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.007155 0.070091 0.122685 0.260221 0.447687 0.750000
## [1] 0.2642037
We might even want to see the full distribution of values. A simple histogram will do.