This notebook runs through the “Recommendations” section of the article “Efficacy of damage data integration: a comparative analysis of four major earthquakes.” This article is available at: ⚠️ link to be added⚠️. The full Github repository for this code is here: https://github.com/sabineloos/GDIF-Gen.

We will be demonstrating this model building process using the New Zealand February 2011 earthquake data. This data is not publicly available, but was made available for this research by New Zealand’s Earthquake Commission and Tonkin + Taylor.

This code goes through one realization of a field survey sample that is randomly sampled throughout the study area. It also builds on previous code to develop G-DIF in Nepal.

Setup G-DIF code and data

Before we build G-DIF with this dataset, we need to do some initial setup to load the required packages, functions, and data for this example. Everything in this initial stage would be necessary to implement G-DIF in real-time, except for developing the field survey sample. The locations of the field surveys would already be identified by field survey teams before a modeler started to develop G-DIF. We also discuss field surveying strategies in the article, for any interested readers.

The functions that are necessary to carry out this code are as follows. These are also included in the Github repository.

  • logtr: Several transformation functions, here we use the std function to standardize x variables.
  • FS_sample: This function develops a field survey sample using various techniques, including a random sample, biased sample, clustered sample, and largedistance sample.
  • RK_trend: This is a function to develop the trend function for G-DIF. Several inputs can be identified, including the type of trend model and whether to use stepwise selection.
  • OK_parallel: This is a wrapper function to implement Ordinary Kriging using the gstat package. It gives the option to run Ordinary Kriging in parallel, which is useful for especially large prediction areas. Here, we use it to krige the residuals of the trend function.
  • RK_errormet: This is a function to calculate several error metrics at certain locations. Here, we use it to calculate the errors of the final G-DIF damage estimate at all points that were not used to train the models in G-DIF.
  • PlottingFunctions: These include several plotting functions, typically using the ggplot2 package

We also read in the New Zealand data, which has already been prepped by connecting the field surveyed damage data with multiple secondary datasets. As you can see, this is of class SpatialPointsDataFrame and there were originally 78489 buildings in this dataset.

## class       : SpatialPointsDataFrame 
## features    : 78489 
## extent      : 172.4635, 172.7688, -43.75119, -43.35616  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +ellps=GRS80 +no_defs 
## variables   : 22
## names       : OBJECTID, GT_BDR, MMI_USGS, PGA_USGS, PGV_USGS,   PGA_BB,             vs30,              DEM, PAGER_pcollapse, DPM_med, DPM_ctr,         gwdMedian,          gwlMedian, SoilType, GT_nbldgs, ... 
## min values  :        1,      0,        6,     0.16,       10, 0.152428, 163.113876343376, 12.1234645843506,               0,       0,       0, 0.100000001490116, -0.887591123580654,      Ai0,         1, ... 
## max values  :    79221,   0.75,      8.9,     0.88,      119, 0.760193, 675.594543430965,  143.56413269043,          0.0325,      63,      63,  25.8282852171285,   20.6647624969511,  ZZZsoil,         1, ...

Define variables

At this point, once a modeler has loaded their prepped dataset, they should identify their independent variables (e.g. the secondary data, or \(X\)) and their dependent variables (e.g. the primary data, or \(Y\))

The independent variables (\(X\), xvar) in New Zealand are:

  • MMI_USGS = Shaking intensity, measured with Modified Mercalli Intensity (MMI) from USGS’s ShakeMap. Any other ground motion estimate can be used here as well.
  • DPM_med = Damage Proxy Map (DPM), from NASA JPL-ARIA
  • PAGER_pcollapse = An engineering forecast of the probability of collapse using a vulnerability curve from USGS
  • vs30 = An estimate of Vs30 from USGS.

The ependent variable (\(Y\), yvar) in New Zealand is:

  • GT_BDR = Building damage ratio per building

though several other field surveyed metrics exist as well.

Field survey sample

A modeler would not have a full damage dataset however, like the sp.nz dataset. Instead, they would have a prepared dataset only at the field survey locations. In this code, we have to simulate this situation, so in the next few code chunks we develop a random field survey sample.

First we define the index for the prediction area, which in this case is all the buildings in sp.nz that have all secondary variables, identified through ind_studyarea. nstudyarea is the number of buildings in ind_studyarea.

## [1] 58426

The prediction area is all x and y variables, at ind_studyarea:

We then develop the field survey index using the following parameters:

  • pctsurvey = fraction of nstudyarea that is sampled, in this case we want to sample 100 buildings, or 0.17% of the data.
  • fstype = the field sampling strategy, in this case random
  • nsurvey = the final number of surveys, given pctsurvey, in this case nsurvey = 100.
## [1] 100

We will use these above parameters as an input into the FS_sample function. This wouldn’t happen in real time, as the field sample would already be defined by the surveyors. FS_sample outputs a set of indices to draw from the study area.

## [1] 53594 17655  2222 42053   546 53339

The field survey sample is therefore sp.pred at ind_field:

G-DIF Diagnostics

Here, we’ll run through several model building steps and diagnostics for building the models in G-DIF. This is organized in the same order as the steps listed in the “Recommendations” section of the paper. At the beginning of each step, we directly quote the recommendation from the paper.

1. Evaluate field survey sample size

RECOMMENDATION FROM PAPER : The size of the field survey sample will dictate how a modeler trains and evaluates the accuracy of G-DIF’s damage estimate. Methods such as k-fold cross-validation, leave-one-out cross-validation, and bootstrapping can be used for training and validating the models within G-DIF. In addition, with more field data, a modeler can create a separate test set to evaluate G-DIF’s potential prediction performance on unsurveyed locations. Generally, more field data will lead to a more accurate damage estimate with less variation.

Now imagining that we didn’t just define the field survey sample size… and we got it from the field. We should first see how big the sample is:

## [1] 100

This size should be a big enough sample to create to sufficiently build the trend and spatial correlation models (pending some other checks in the next few steps). Importantly, it is big enough to build these models with a training/validation set, or k-fold cross-validation. Importantly, a test set should be held out, when there enough field samples, to evaluate the performance of the models within G-DIF. Here, since we know the damage at all other 58326 buildings in the study area, we will just use these for final evaluation of G-DIF, though in reality, we would split the 100 buildings into training and test sets.

We can evaluate some basic summary statistics about the field sample of building damage, including the quartiles, mean, and standard deviation.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.007155 0.070091 0.122685 0.260221 0.447687 0.750000
## [1] 0.2642037
We might even want to see the full distribution of values. A simple histogram will do.
Field surveyed damage

Field surveyed damage

You’ll notice that the distribution of field surveyed damage in New Zealand is Right-Skewed and has a spike at 0.75. The right-skew is not always the case, sometimes, the distribution can be left-skewed, as was seen in Nepal. The spike at 0.75 was due to how the building damage ratio was assessed in New Zealand, again this will not always be the case. Here, surveyors marked all collapsed buildings as 0.75, which was the reason for this spike.

2. Establish boundary of prediction area

RECOMMENDATION FROM PAPER : In real time, the modeler would establish the spatial boundary for the forward prediction of damage using G-DIF. Here, we draw the boundaries based on the regions where all of the secondary datasets are available. In Haiti, New Zealand, and Nepal, this meant constraining the prediction area to the footprint of the remote sensing-derived data. However, it is likely that a damage estimate would be needed outside of this footprint. An initial boundary could be the area of strong shaking in the USGS ShakeMap, and users on the ground can help refine this by considering areas they will focus on during response and recovery. To accommodate datasets that are not available throughout the entire prediction area, separate trend models can be built for multiple subregions. Alternatively, one can integrate these datasets using Bayesian methods. However, developing explicit methods to do this in a spatial manner requires future research.

In this case, we have already defined the prediction area as the area where we have all secondary datasets. Like what is stated in the recommendation, this is a big point for the modeler. They should establish this boundary based on both available data and through discussion with stakeholders on where they would like to know damage.

Here we can plot sp.pred, the study area, with the ward boundaries of Christchurch, NZ.

We can see that the current prediction area covers most of the central wards of Christchurch. However, a modeler could define a different prediction area depending on the needs of stakeholders. Here, we keep this prediction area (even though it misses some wards), because we can use these gray buildings for validation later.

3. Assess spatial distribution of field sample

RECOMMENDATION FROM PAPER : Combining the previous two steps, modelers should evaluate whether the field sample is spatially distributed across the prediction area. Highly clustered field samples will converge to the trend model in regions outside of the field surveyed area. Field surveys collected at distances greater than the expected spatial correlation may result in a ``nugget’’ variogram, or a variogram with constant semivariance at all distances. In both cases, modelers should focus on building a predictive trend model that accurately reflects the relationships between the field survey samples and secondary data (discussed in step 5).

The first thing we should do is map our field survey sample, sp.field, with relation to the study area, sp.pred. As you can see below, the black dots are the locations of the buildings in the field survey sample, and the gray dots are the locations of all the buildings in our study area (where we want to predict the results of G-DIF).