This notebook runs through the “Recommendations” section of the article “Efficacy of damage data integration: a comparative analysis of four major earthquakes.” This article is available at: ⚠️ link to be added⚠️. The full Github repository for this code is here: https://github.com/sabineloos/GDIF-Gen.

We will be demonstrating this model building process using the New Zealand February 2011 earthquake data. This data is not publicly available, but was made available for this research by New Zealand’s Earthquake Commission and Tonkin + Taylor.

This code goes through one realization of a field survey sample that is randomly sampled throughout the study area. It also builds on previous code to develop G-DIF in Nepal.

Setup G-DIF code and data

Before we build G-DIF with this dataset, we need to do some initial setup to load the required packages, functions, and data for this example. Everything in this initial stage would be necessary to implement G-DIF in real-time, except for developing the field survey sample. The locations of the field surveys would already be identified by field survey teams before a modeler started to develop G-DIF. We also discuss field surveying strategies in the article, for any interested readers.

The functions that are necessary to carry out this code are as follows. These are also included in the Github repository.

  • logtr: Several transformation functions, here we use the std function to standardize x variables.
  • FS_sample: This function develops a field survey sample using various techniques, including a random sample, biased sample, clustered sample, and largedistance sample.
  • RK_trend: This is a function to develop the trend function for G-DIF. Several inputs can be identified, including the type of trend model and whether to use stepwise selection.
  • OK_parallel: This is a wrapper function to implement Ordinary Kriging using the gstat package. It gives the option to run Ordinary Kriging in parallel, which is useful for especially large prediction areas. Here, we use it to krige the residuals of the trend function.
  • RK_errormet: This is a function to calculate several error metrics at certain locations. Here, we use it to calculate the errors of the final G-DIF damage estimate at all points that were not used to train the models in G-DIF.
  • PlottingFunctions: These include several plotting functions, typically using the ggplot2 package

We also read in the New Zealand data, which has already been prepped by connecting the field surveyed damage data with multiple secondary datasets. As you can see, this is of class SpatialPointsDataFrame and there were originally 78489 buildings in this dataset.

## class       : SpatialPointsDataFrame 
## features    : 78489 
## extent      : 172.4635, 172.7688, -43.75119, -43.35616  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +ellps=GRS80 +no_defs 
## variables   : 22
## names       : OBJECTID, GT_BDR, MMI_USGS, PGA_USGS, PGV_USGS,   PGA_BB,             vs30,              DEM, PAGER_pcollapse, DPM_med, DPM_ctr,         gwdMedian,          gwlMedian, SoilType, GT_nbldgs, ... 
## min values  :        1,      0,        6,     0.16,       10, 0.152428, 163.113876343376, 12.1234645843506,               0,       0,       0, 0.100000001490116, -0.887591123580654,      Ai0,         1, ... 
## max values  :    79221,   0.75,      8.9,     0.88,      119, 0.760193, 675.594543430965,  143.56413269043,          0.0325,      63,      63,  25.8282852171285,   20.6647624969511,  ZZZsoil,         1, ...

Define variables

At this point, once a modeler has loaded their prepped dataset, they should identify their independent variables (e.g. the secondary data, or \(X\)) and their dependent variables (e.g. the primary data, or \(Y\))

The independent variables (\(X\), xvar) in New Zealand are:

  • MMI_USGS = Shaking intensity, measured with Modified Mercalli Intensity (MMI) from USGS’s ShakeMap. Any other ground motion estimate can be used here as well.
  • DPM_med = Damage Proxy Map (DPM), from NASA JPL-ARIA
  • PAGER_pcollapse = An engineering forecast of the probability of collapse using a vulnerability curve from USGS
  • vs30 = An estimate of Vs30 from USGS.

The ependent variable (\(Y\), yvar) in New Zealand is:

  • GT_BDR = Building damage ratio per building

though several other field surveyed metrics exist as well.

Field survey sample

A modeler would not have a full damage dataset however, like the sp.nz dataset. Instead, they would have a prepared dataset only at the field survey locations. In this code, we have to simulate this situation, so in the next few code chunks we develop a random field survey sample.

First we define the index for the prediction area, which in this case is all the buildings in sp.nz that have all secondary variables, identified through ind_studyarea. nstudyarea is the number of buildings in ind_studyarea.

## [1] 58426

The prediction area is all x and y variables, at ind_studyarea:

We then develop the field survey index using the following parameters:

  • pctsurvey = fraction of nstudyarea that is sampled, in this case we want to sample 100 buildings, or 0.17% of the data.
  • fstype = the field sampling strategy, in this case random
  • nsurvey = the final number of surveys, given pctsurvey, in this case nsurvey = 100.
## [1] 100

We will use these above parameters as an input into the FS_sample function. This wouldn’t happen in real time, as the field sample would already be defined by the surveyors. FS_sample outputs a set of indices to draw from the study area.

## [1] 53594 17655  2222 42053   546 53339

The field survey sample is therefore sp.pred at ind_field:

G-DIF Diagnostics

Here, we’ll run through several model building steps and diagnostics for building the models in G-DIF. This is organized in the same order as the steps listed in the “Recommendations” section of the paper. At the beginning of each step, we directly quote the recommendation from the paper.

1. Evaluate field survey sample size

RECOMMENDATION FROM PAPER : The size of the field survey sample will dictate how a modeler trains and evaluates the accuracy of G-DIF’s damage estimate. Methods such as k-fold cross-validation, leave-one-out cross-validation, and bootstrapping can be used for training and validating the models within G-DIF. In addition, with more field data, a modeler can create a separate test set to evaluate G-DIF’s potential prediction performance on unsurveyed locations. Generally, more field data will lead to a more accurate damage estimate with less variation.

Now imagining that we didn’t just define the field survey sample size… and we got it from the field. We should first see how big the sample is:

## [1] 100

This size should be a big enough sample to create to sufficiently build the trend and spatial correlation models (pending some other checks in the next few steps). Importantly, it is big enough to build these models with a training/validation set, or k-fold cross-validation. Importantly, a test set should be held out, when there enough field samples, to evaluate the performance of the models within G-DIF. Here, since we know the damage at all other 58326 buildings in the study area, we will just use these for final evaluation of G-DIF, though in reality, we would split the 100 buildings into training and test sets.

We can evaluate some basic summary statistics about the field sample of building damage, including the quartiles, mean, and standard deviation.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.007155 0.070091 0.122685 0.260221 0.447687 0.750000
## [1] 0.2642037
We might even want to see the full distribution of values. A simple histogram will do.
Field surveyed damage

Field surveyed damage

You’ll notice that the distribution of field surveyed damage in New Zealand is Right-Skewed and has a spike at 0.75. The right-skew is not always the case, sometimes, the distribution can be left-skewed, as was seen in Nepal. The spike at 0.75 was due to how the building damage ratio was assessed in New Zealand, again this will not always be the case. Here, surveyors marked all collapsed buildings as 0.75, which was the reason for this spike.

2. Establish boundary of prediction area

RECOMMENDATION FROM PAPER : In real time, the modeler would establish the spatial boundary for the forward prediction of damage using G-DIF. Here, we draw the boundaries based on the regions where all of the secondary datasets are available. In Haiti, New Zealand, and Nepal, this meant constraining the prediction area to the footprint of the remote sensing-derived data. However, it is likely that a damage estimate would be needed outside of this footprint. An initial boundary could be the area of strong shaking in the USGS ShakeMap, and users on the ground can help refine this by considering areas they will focus on during response and recovery. To accommodate datasets that are not available throughout the entire prediction area, separate trend models can be built for multiple subregions. Alternatively, one can integrate these datasets using Bayesian methods. However, developing explicit methods to do this in a spatial manner requires future research.

In this case, we have already defined the prediction area as the area where we have all secondary datasets. Like what is stated in the recommendation, this is a big point for the modeler. They should establish this boundary based on both available data and through discussion with stakeholders on where they would like to know damage.

Here we can plot sp.pred, the study area, with the ward boundaries of Christchurch, NZ.

We can see that the current prediction area covers most of the central wards of Christchurch. However, a modeler could define a different prediction area depending on the needs of stakeholders. Here, we keep this prediction area (even though it misses some wards), because we can use these gray buildings for validation later.

3. Assess spatial distribution of field sample

RECOMMENDATION FROM PAPER : Combining the previous two steps, modelers should evaluate whether the field sample is spatially distributed across the prediction area. Highly clustered field samples will converge to the trend model in regions outside of the field surveyed area. Field surveys collected at distances greater than the expected spatial correlation may result in a ``nugget’’ variogram, or a variogram with constant semivariance at all distances. In both cases, modelers should focus on building a predictive trend model that accurately reflects the relationships between the field survey samples and secondary data (discussed in step 5).

The first thing we should do is map our field survey sample, sp.field, with relation to the study area, sp.pred. As you can see below, the black dots are the locations of the buildings in the field survey sample, and the gray dots are the locations of all the buildings in our study area (where we want to predict the results of G-DIF).

Field survey map (black) compared to full study area (gray)

Field survey map (black) compared to full study area (gray)

We notice that the 100 points in our field survey are sufficiently scattered throughout the prediction area, and are not clustered, so hopefully it captures a good distribution of the secondary data (but we’ll be checking that in the next step).

However, are the points are at a close enough distance to build a variogram for the spatial correlation model? We can evaluate this by looking at the distance matrix of the separation distances between the field surveyed points. Here we can see the distances between the first five points in the field survey sample:

##           [,1]      [,2]      [,3]     [,4]      [,5]
## [1,]        NA  1.169016  5.080011 7.574704 14.302552
## [2,]  1.169016        NA  5.344604 6.727222 13.906492
## [3,]  5.080011  5.344604        NA 6.365345 10.147449
## [4,]  7.574704  6.727222  6.365345       NA  8.359726
## [5,] 14.302552 13.906492 10.147449 8.359726        NA

It might be easier to just see the distribution of separation distances in our distance matrix though:

Distances between field surveyed buildings

Distances between field surveyed buildings

We can see that the majority of separation distances are between 0-10km. This is often within the range of spatial correlation of trend residuals, so we have enough points at close enough distances to build our variogram later.

4. Compare field survey sample values to other locations

RECOMMENDATION FROM PAPER : The modeler should compare values of secondary data at the field surveyed locations to the full distribution of secondary data at all locations. As much as possible, the variance of the secondary data at the field surveyed locations should reflect the variance over the entire affected region. Otherwise, the trend estimate may not extrapolate well outside of sample distribution’s range. This might occur if field surveys are clustered so that there is little variation in the secondary data. This could also be due to datasets available at too low of a resolution, though this is unlikely with the datasets represented in this study. If this occurs, additional field surveys should be collected, if possible, for a wider range of secondary data values.

Remember how we said that the field survey should capture much of the secondary data? We should also analytically check this. We could potentially just compare the summary statistics of the secondary data at the field survey locations…

##     MMI_USGS        DPM_med       PAGER_pcollapse       vs30      
##  Min.   :6.600   Min.   : 0.000   Min.   :0.0000   Min.   :167.3  
##  1st Qu.:7.800   1st Qu.: 1.000   1st Qu.:0.0080   1st Qu.:182.5  
##  Median :8.200   Median : 5.966   Median :0.0150   Median :188.2  
##  Mean   :8.094   Mean   : 8.746   Mean   :0.0151   Mean   :193.2  
##  3rd Qu.:8.400   3rd Qu.:13.985   3rd Qu.:0.0200   3rd Qu.:192.8  
##  Max.   :8.900   Max.   :38.262   Max.   :0.0325   Max.   :362.3

to those same summary statistics over the entire prediction area:

##     MMI_USGS        DPM_med       PAGER_pcollapse        vs30      
##  Min.   :6.400   Min.   : 0.000   Min.   :0.00000   Min.   :163.1  
##  1st Qu.:7.800   1st Qu.: 1.000   1st Qu.:0.00800   1st Qu.:182.5  
##  Median :8.200   Median : 5.321   Median :0.01500   Median :188.6  
##  Mean   :8.067   Mean   : 9.478   Mean   :0.01448   Mean   :190.8  
##  3rd Qu.:8.400   3rd Qu.:15.000   3rd Qu.:0.02000   3rd Qu.:193.8  
##  Max.   :8.900   Max.   :63.000   Max.   :0.03250   Max.   :675.6

But, again it might be easier to visualize these distributions. Here, we plot the histograms of all the secondary data (or the xvar) at the field survey locations compared to the study area.

Comparison between field survey locations and full study aea of the distributions of all secondary data

Comparison between field survey locations and full study aea of the distributions of all secondary data

Now we can see that the field sample indeed covers the same ranges as the full study area of the secondary data. Therefore, we have more confidence in using these samples and secondary data to build G-DIF.

One note here, is that we are planning to try out Ordinary Least Squares (OLS) for our trend model. We’d like to be able to compare our coefficients from the OLS model, so we also standardize these variables to have mean 0 and sd of 1.

##     MMI_USGS           DPM_med         PAGER_pcollapse         vs30        
##  Min.   :-2.80514   Min.   :-0.90329   Min.   :-1.62472   Min.   :-1.2265  
##  1st Qu.:-0.51088   1st Qu.:-0.80798   1st Qu.:-0.72685   1st Qu.:-0.4325  
##  Median : 0.25388   Median :-0.33466   Median : 0.05879   Median :-0.1348  
##  Mean   : 0.05122   Mean   :-0.06975   Mean   : 0.07001   Mean   : 0.1268  
##  3rd Qu.: 0.63625   3rd Qu.: 0.42953   3rd Qu.: 0.61995   3rd Qu.: 0.1044  
##  Max.   : 1.59219   Max.   : 2.74336   Max.   : 2.02287   Max.   : 8.9625

5. Explore relationships between primary and secondary data

RECOMMENDATION FROM PAPER :When building the trend model, the modeler should incorporate the relationships that exist between each secondary dataset and the field survey sample. Many relationships between the secondary data and the field data are close to linear. One can evaluate this by examining the moving average of the field damage at various bins of each secondary data, or creating a ``loess" curve. Nonlinear trends can be considered by transforming the secondary dataset using a nonlinear trend model.

When figuring out how to define the trend model, we should look at the relationships between each secondary dataset and the field surveyed damage using our field dataset. One simple way to do this is to look at scatterplots between the two and see what relationship exists. Here, we plot the scatterplots between the yvar, GT_BDR on the y-axis and each xvar on the x-axis. Then we look at their relationships through both a moving average or “loess” curve and a linear model.

Assessing relationships between field damage and each secondary data source

Assessing relationships between field damage and each secondary data source

As we can see here, many of these variables exhibit positive relationships with the groundtruth, as expected. This is good, otherwise we might consider manually removing that variable. We can also see that in many of these cases, a linear relationship might be sufficient for each predictor (though one could explore defining other nonlinear relationships as well depending ont he results of these exploratory plots).

6. Address redundant data

RECOMMENDATION FROM PAPER : Some damage datasets may be collinear (e.g., an engineering forecast may be closely aligned with a ShakeMap). Collinearity can lead to unreliable estimates of the coefficients for each secondary dataset and to overfitting. Ways to address collinearity are by evaluating the variance inflation factor before modeling, applying variable selection techniques, or aggregating collinear variables using principal component analysis.

We should also get a handle on our predictor variables themselves, and evaluate whether any of them might exhibit collinearity. This will pose issues when fitting the trend model.

First, we can just do some visual comparisons of each predictor by mapping them out. Remember these are the transformed variable (they were standardized earlier)/

Visual comparisons of all secondary data

Visual comparisons of all secondary data

We can see just from these maps that the Shaking Intensity and the Engineering Forecast might be highly correlated just from their similarities in their spatial patterns. This is because the Engineering forecast is essentially a function of the Shaking Intensity.

A more quantitative way to evaluate this is through the Variance Inflation Factor (VIF). The VIF is a metric that captures the multicollinearity between variables through the following equation:

\(VIF = \frac{1}{1-R_i^2}\) where \(i\) is the variable of interest. VIF’s greater than 5 are problematic and VIF’s greater than 10 should be addressed.

To evaluate the VIF for our secondary datasets in R, we first have to create a linear model that uses all of our xvar.

## GT_BDR ~ MMI_USGS + DPM_med + PAGER_pcollapse + vs30
## 
## Call:
## lm(formula = form, data = sp.field@data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52857 -0.11629 -0.03722  0.08960  0.51565 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.269963   0.020491  13.175  < 2e-16 ***
## MMI_USGS         0.060344   0.066117   0.913    0.364    
## DPM_med          0.174928   0.024649   7.097 2.29e-10 ***
## PAGER_pcollapse -0.015746   0.067468  -0.233    0.816    
## vs30             0.003714   0.014800   0.251    0.802    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2015 on 95 degrees of freedom
## Multiple R-squared:  0.4418, Adjusted R-squared:  0.4183 
## F-statistic:  18.8 on 4 and 95 DF,  p-value: 2.061e-11

Then we can evaluate the variance inflation factor (VIF) using the car package.

##        MMI_USGS         DPM_med PAGER_pcollapse            vs30 
##       11.214211        1.160541       11.331673        1.051380

The VIF’s here confirm our suspicion. The VIF for the Shaking intensity and the Engineering Forecast are both over 10, meaning these two variables exhibit high collinearity with another variable which should be addressed in our trend model.

Another way to evaluate collinearity between secondary data is through the correlation matrix.

##                   MMI_USGS   DPM_med PAGER_pcollapse       vs30
## MMI_USGS         1.0000000 0.3220170       0.9543410 -0.1406438
## DPM_med          0.3220170 1.0000000       0.3350945  0.1124256
## PAGER_pcollapse  0.9543410 0.3350945       1.0000000 -0.1402013
## vs30            -0.1406438 0.1124256      -0.1402013  1.0000000

Again, we see that the engineering forecast and the shaking intensity have a high correlation (0.954341), confirming the need to address this in the trend model.

Methods to address collinearity

This collinearity between the shaking intensity and the engineering forecast can pose problems with obtaining accurate coefficient estimates. There are multiple ways to address this, as described in the earlier recommendation. Here, we will implement mixed stepwise selection when training the trend model. We could also manually remove one of the variables, depending on our level of trust in the data.

7. Build and refine the trend model

RECOMMENDATION FROM PAPER : After building the trend model, the modeler should check the coefficients for each secondary dataset. A coefficient opposite from expectations (for example, predictions of decreasing damage with increasing ground shaking intensity), may indicate that the secondary dataset is unreliable or that outliers exist and should be removed from the model. Otherwise the directions of the coefficients should reflect the relationships seen in Step 5. To evaluate the relative utility of each secondary dataset, one should ensure that each variable is standardized to ensure that coefficients are comparable. The coefficients and standard errors of each variable in the trend model will provide intuition behind which secondary dataset has the most influence on the trend estimate.

Now that we’ve thoroughly explored our data, we can start building the models in G-DIF. First, we need to build the trend model that connects the x and y variables.Here we use RK_trend to build the trend model. At this stage, we have an opportunity to choose the best trend model through cross-validation, which we did in the previous investigation. To see more of that, please visit here. One could compare an Ordinary Least Squares, to a Generalized Least Squares, to potentially even other regression models. However, for the purpose of this example we stick to the OLS.

Now that we’ve built our trend model, we want to make sure that its results make sense. Let’s take a look at the coefficients found in the model and their standard errors. We want to make sure these make sense and match the directions we’d expect and that we found in our exploratory scatterplots.

All coefficients look OK, they exhibit positive relationships, which is both what we expected and what we saw in our exploratory analysis. You’ll also noticed that two variables were removed through the stepwise selection: the engineering forecast PAGER_pcollapse and vs30. This addresses the multicollinearity found between MMI_USGS and PAGER_pcollapse and the previous step, and also the fact that vs30 was a non-predictive dataset (we can see this through the scatterplots in Step 5).

And just to make sure, let’s take a look at the map of the trend prediction.

Since the next step in G-DIF is to use the residuals of the trend model to develop the spatial correlation model, we should take a look at these residuals. Let’s first see a histogram of these residuals.

Histogram of the trend residuals, vertical line is the mean

Histogram of the trend residuals, vertical line is the mean

These residuals look pretty good! The mean of the residuals is at 0, which is what we’d hope. An assumption for kriging is that the data we are kriging is Gaussian. Our histogram looks like our residuals might be normal, but we can develop another plot to compare our distribution above to the theoretical distribution of the Normal distribution. We do this by developing the QQ-plot, which compares our empirical cumlative distribution to that of a normal distribution with the same mean and standard deviation.

QQ-plot of the trend residuals

QQ-plot of the trend residuals

This seems good! As you can see, our data (the black dots) lies along the 1-1 line, which is what we’d hope for if our data would match the theoretical Normal distribution.

In the case of non-Gaussian residuals, we can transform the residuals and continue to model in the Gaussian space. However, because our residuals resemble a normal distribution, this is not necessary. One method to transform your residuals is through the normal score transformation. We provide code for this in the functions folder called nscore.R.

8. Build and refine the spatial correlation model

RECOMMENDATION FROM PAPER : The choice of variogram and kriging method affects the final spatial pattern of damage, especially in situations where the secondary data are poor predictors of the true damage. In addition to selecting the variogram based on best fit, modelers should also consider the expected spatial pattern that result from the selected variogram. If the fitted variogram exhibits a trend (or the semivariance increases with distance), the field surveyed damage may not have been successfully detrended with the trend estimate or perhaps the selected variogram model is too flexible. In this case, it might be preferable to consider local kriging, where only the closest surveyed points to the unsurveyed location are considered when making the kriging prediction. If a nugget variogram is fit to the detrended survey points, the trend model may have fully captured the spatial correlation in damage. The modeler should compare the variogram fit to the detrended field surveys with the variogram fit to the original field surveys, to ensure the nugget variogram is not arising from a lack of closely spaced field data (as discussed in Step 3).

Now that we’ve checked the assumptions for kriging in the previous step, we can start to develop our spatial correlation model. First we can develop a variogram of the residuals. Here, we compare three different variogram models, the Exponential, the Spherical, and the Matern, using cross-validation. We see that the Matern model fits best.

We should however, visually evaluate the variogram when compared to the empirical variogram. We can also compare the original variogram (of the field surveyed damage) to the residual variogram, to see if we successfully detrended the data.

We can see how by detrending our data, our variogram has become much more defined and the range has reduced.

Now we can krige our residuals using the OK_parallel function and vgm.resid. Here we perform local kriging, partly because the New Zealand dataset is pretty large. One could use cross-validation to decide and nmin and nmax. Here, we manually define these parameters.

## [using ordinary kriging]

And again, let’s just check the results of the kriging before saving the results in sp.pred

9. Calculate performance metrics

RECOMMENDATION FROM PAPER : Finally, the modeler should calculate the model performance on a test set of field surveys held out from all model fitting. The modeler can use the fitted model to predict damage at the test set locations and calculate the mean error, variance in the error, and mean squared error. Prediction errors for newly acquired field data should be monitored—low errors would confirm the current G-DIF damage estimate, whereas high errors would trigger a model update.

We can develop our final damage prediction by adding the trend model and the spatial correlation model results. In this example, we bound our predictions to be between 0 and 0.75, but another way to address this would be to use Normal Score Transformation on our dependendent variable, GT_BDR, and transform back after modeling.

We should compare our G-DIF estimate to the true damage. In this example, we know what the true damage was so we can visually compare our predictions. However, in a real-time scenario, one could only compare the results at our test set locations.

Visual comparison of G-DIF damage estimate to true damage

Visual comparison of G-DIF damage estimate to true damage

Again, we can quantitatively assess the error in our prediction by looking at the error at all the points in the study area that were NOT used for fitting the models. A histogram of these errors are:

But what would be better is to evaluate a few key error metrics. We can do this by using the RK_errormet function.

We find that the Mean Squared Error, or MSE, is

## [1] 0.05340667

The variance of the error, or VE, is

## [1] 0.05337985

And the mean of the error, or ME, is

## [1] 0.005265925

And there you have it! If you have any questions, please email Sabine Loos at sloos@alumni.stanford.edu.