R: Machine Learning for Natural Disaster Preparedness


Kathmandu, Nepal

In 2015, the city of Kathmandu in central Nepal was struck by a highly destructive, magnitude 7.8 earthquake which took the lives of approximately 9,000 people. Following the initial recovery efforts, the Nepalese government performed a door-to-door survey in 11 of the city districts with the aim of assessing eligibility for reconstruction assistance. This is the largest post-disaster structural data collection effort as of this writing. Kathmandu Living Labs updated the data in 2019 to further organize it and to reflect changes to the federal districting system, recognizing that the dataset on over 750,000 households would be of interest to anyone in the data science community with an interest in disaster recovery or socio-economic issues in developing countries. It is this data under consideration for the project, in which we develop exploratory visualizations on the extent of building damage. Additionaly, we use machine learning (specifically, the xgboost package) to investigate the significance of structural factors on building survivability.

Acquisition and Nature of the Dataset

The source of data for this project is the fantastic Open Data Portal made available for the 2015 Nepal Earthquake by the nonprofit Kathmandu Living Labs, with data initially collected by the Nepalese Government. It allows us to explore quite a few demographic features of 11 districts surrounding the city of Kathmandu. I am focusing on the ‘Structural Data’ set on 760,000 buildings.

#load locally stored dataframe
df <- read.csv("/Volumes/EXT128/Data Science Projects/building_structure/csv_building_structure.csv")
#review structure of raw data
str(df,vec.len=1)## 'data.frame':    762106 obs. of  31 variables:
##  $ building_id                           : num  1.2e+11 ...
##  $ district_id                           : int  12 12 ...
##  $ vdcmun_id                             : int  1207 1207 ...
##  $ ward_id                               : int  120703 120703 ...
##  $ count_floors_pre_eq                   : int  1 1 ...
##  $ count_floors_post_eq                  : int  1 1 ...
##  $ age_building                          : int  9 15 ...
##  $ plinth_area_sq_ft                     : int  288 364 ...
##  $ height_ft_pre_eq                      : int  9 9 ...
##  $ height_ft_post_eq                     : int  9 9 ...
##  $ land_surface_condition                : chr  "Flat" ...
##  $ foundation_type                       : chr  "Other" ...
##  $ roof_type                             : chr  "Bamboo/Timber-Light roof" ...
##  $ ground_floor_type                     : chr  "Mud" ...
##  $ other_floor_type                      : chr  "Not applicable" ...
##  $ position                              : chr  "Not attached" ...
##  $ plan_configuration                    : chr  "Rectangular" ...
##  $ has_superstructure_adobe_mud          : int  0 0 ...
##  $ has_superstructure_mud_mortar_stone   : int  1 1 ...
##  $ has_superstructure_stone_flag         : int  0 0 ...
##  $ has_superstructure_cement_mortar_stone: int  0 0 ...
##  $ has_superstructure_mud_mortar_brick   : int  0 0 ...
##  $ has_superstructure_cement_mortar_brick: int  0 0 ...
##  $ has_superstructure_timber             : int  0 0 ...
##  $ has_superstructure_bamboo             : int  1 1 ...
##  $ has_superstructure_rc_non_engineered  : int  0 0 ...
##  $ has_superstructure_rc_engineered      : int  0 0 ...
##  $ has_superstructure_other              : int  0 0 ...
##  $ condition_post_eq                     : chr  "Damaged-Used in risk" ...
##  $ damage_grade                          : chr  "Grade 3" ...
##  $ technical_solution_proposed           : chr  "Major repair" ...

Data Cleaning

I’m studying the damage_grade column primarily, which is supposed to have five categories: Grade 1 is basically undamaged, and Grade 5 requires total demolition or reconstruction. More details on that scale can be found in the FAQ provided by KLL. With this priority and my intended mapping visualization in mind, I’m going to convert the this column to a numeric. I’m also going to add a damage description variable for a chart later on.

#damage grade to integers
df$damage_grade[df$damage_grade == 'Grade 5'] <- 5
df$damage_grade[df$damage_grade == 'Grade 4'] <- 4
df$damage_grade[df$damage_grade == 'Grade 3'] <- 3
df$damage_grade[df$damage_grade == 'Grade 2'] <- 2
df$damage_grade[df$damage_grade == 'Grade 1'] <- 1
#change data type
df$damage_grade <- as.integer(df$damage_grade)
#add description column
df$damage_desc <- NA
#input appropriate damage grades
df$damage_desc[df$damage_grade == 1] <- 'Cosmetic Repair'
df$damage_desc[df$damage_grade == 2] <- 'Minor Structural Repair'
df$damage_desc[df$damage_grade == 3] <- 'Major Structural Repair'
df$damage_desc[df$damage_grade == 4] <- 'Unsafe / Structurally Unsound'
df$damage_desc[df$damage_grade == 5] <- 'Total Collapse'

Another thing is that I’m interested in the district-level damage statistics. It will help to have the district names available to us, which I retrieved from the data dictionary to match up with the ID numbers. I’ll attach this information to the dataframe using an inner join.

#get names
dist_name <- c('Okhaldhunga','Sindhuli','Ramechhap','Dolakha','Sindhupalchok','Kavrepalanchok','Nuwakot','Rasuwa','Dhading','Makwanpur','Gorkha')
#get ids
dist_id <- c(12,20,21,22,23,24,28,29,30,31,36)
#create district information dataframe
districts <- data.frame(district_id = dist_id, district_name = dist_name)
#inner join in district_id using merge() 
df <- merge(districts,df,by="district_id")

In the raw data there are 762,106 buildings observed with 31 variables recorded, and we need to check the quality of the data as far as errors and missing values go. We can check for NAs and numerical outliers in the traditional way, but one additional strategy we can employ is the detection of other kinds of errors using factors for the categorical variables. Factors will allow us to see if there are any anomalies in the number of levels. From the str() call earlier, we can see that the superstructure variable is one-hot encoded, which is great! But all of the other categorical fields are characters. I’m going to use the character datatype as a loop condition for transforming to factors, and then check the levels of each.

#change any character variables to factors
for (c in names(df)[sapply(df,is.character)])
  df[[c]] <- as.factor(df[[c]])
#check factor levels for errors or missing information
lapply(df[sapply(df, is.factor)], levels)
## $district_name
##  [1] "Dhading"        "Dolakha"        "Gorkha"         "Kavrepalanchok"
##  [5] "Makwanpur"      "Nuwakot"        "Okhaldhunga"    "Ramechhap"     
##  [9] "Rasuwa"         "Sindhuli"       "Sindhupalchok" 
## 
## $land_surface_condition
## [1] "Flat"           "Moderate slope" "Steep slope"   
## 
## $foundation_type
## [1] "Bamboo/Timber"          "Cement-Stone/Brick"     "Mud mortar-Stone/Brick"
## [4] "Other"                  "RC"                    
## 
## $roof_type
## [1] "Bamboo/Timber-Heavy roof" "Bamboo/Timber-Light roof"
## [3] "RCC/RB/RBC"              
## 
## $ground_floor_type
## [1] "Brick/Stone" "Mud"         "Other"       "RC"          "Timber"     
## 
## $other_floor_type
## [1] "Not applicable"    "RCC/RB/RBC"        "Timber-Planck"    
## [4] "TImber/Bamboo-Mud"
## 
## $position
## [1] ""                "Attached-1 side" "Attached-2 side" "Attached-3 side"
## [5] "Not attached"   
## 
## $plan_configuration
##  [1] ""                                "Building with Central Courtyard"
##  [3] "E-shape"                         "H-shape"                        
##  [5] "L-shape"                         "Multi-projected"                
##  [7] "Others"                          "Rectangular"                    
##  [9] "Square"                          "T-shape"                        
## [11] "U-shape"                        
## 
## $condition_post_eq
## [1] "Covered by landslide"                   
## [2] "Damaged-Not used"                       
## [3] "Damaged-Repaired and used"              
## [4] "Damaged-Rubble clear"                   
## [5] "Damaged-Rubble Clear-New building built"
## [6] "Damaged-Rubble unclear"                 
## [7] "Damaged-Used in risk"                   
## [8] "Not damaged"                            
## 
## $technical_solution_proposed
## [1] ""               "Major repair"   "Minor repair"   "No need"       
## [5] "Reconstruction"
## 
## $damage_desc
## [1] "Cosmetic Repair"               "Major Structural Repair"      
## [3] "Minor Structural Repair"       "Total Collapse"               
## [5] "Unsafe / Structurally Unsound"

This actually revealed something pretty important- a few of the categorical variables have “ ” levels which indicates that they would not have been detected by searching for NA values. I’m going to fix this issue by imputing NAs where the whitespace is so that missing information can be detected, and then check for the number of rows that are incomplete.

#replace any whitespace with NA
df[df == ""] <- NA
#complete cases check
paste0('Number of rows with missing data: ', sum(!complete.cases(df)))
## [1] "Number of rows with missing data: 12"

At least in the realm of blank spaces or NAs, this data seems to be pretty high quality. I don’t think that we will lose anything important if we were to remove these 12 observations since we have over 760,000.

#remove the incomplete observations
df <- na.omit(df)
#review dimensions of cleaned data
paste0('Cleaned Data Dimension: ', dim(df)[1],' rows, ',dim(df)[2],' columns')
## [1] "Cleaned Data Dimension: 762094 rows, 33 columns"

Exploratory Visualizations

We’ve done enough cleaning to begin visualizing the damage done by this quake. The first basic question is, how are the damage grades distributed? I’ll answer this with a bar chart which displays the damage levels, their definitions, and a color code established for the rest of the visualizations.


Somewhat surprisingly, certainly tragically, the majority of buildings under study were flagged as irrepairably damaged or close to it! How does this break down by district?

#forcats fct_infreq used in ggplot aes() for ordering the legend properly
ggplot(data=df,aes(x=fct_rev(fct_infreq(district_name)),fill=factor(damage_grade)))+ 
  geom_bar(color = "gray30",position = position_stack(reverse = TRUE),stat = "count")+
  labs(x='District',y='Building Assessments',title='Figure 2: Damaged Buildings by Grade Occurence and District',caption='Data: KLL')+
  scale_fill_viridis_d(option = "viridis", name = "Grade",direction=-1,guide=guide_legend(reverse=TRUE))+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

We can see that Sindhupalchok has a disproportionate amount of level 5 damage, but does not have the highest number of damaged buildings. Nearly every observed building in Rasuwa was totally destroyed, but it is a relatively small number of buildings. To get the whole picture, consider this chart in tandem with the next.

Mapping

Next, I’m going to map out the average damage using ggplot’s geographic plotting so that anyone unfamiliar with the area surrounding Kathmandu can better understand the distribution. To do this we are going to need geographical data as well as some additional preprocessing. Database of Global Administrative Areas, or GADM, offers this kind of spatial data for most countries at various levels of granularity. I am retrieving their level 3 data for administrative districts of Nepal.

nepal <- st_read('/Volumes/EXT128/Data Science Projects/building_structure/building_structure/gadm41_NPL_shp/gadm41_NPL_3.shp')
The earthquake data covers 11 districts surrounding the capital city of Kathmandu. As we will see, it does not cover Kathmandu itself nor the nearby central districts of Lalitpur and Bhaktapur. KLL outlines in its FAQthat this is because the project was made from Phase I government data collection- those three districts were surveyed in Phase II and the remaining districts of Nepal in Phase III. I’m going to isolate a map layer which shows those 11 districts and apply the necessary damage information to the resulting dataframe.

#isolate earthquake-affected districts from map data
nepal_eq <- nepal[nepal$NAME_3 %in% dist_name,]
#aggregation of districts by mean damage grade
mean_dmg_dist <- aggregate(df$damage_grade, list(df$district_name), FUN=mean)
#assign column names for matching with geodata
colnames(mean_dmg_dist) <- c("NAME_3","DMG")
#inner join damage and map
nepal_eq <- merge(nepal_eq,mean_dmg_dist,by="NAME_3")
#mdg_order <- mean_dmg_dist[order(mean_dmg_dist$DMG,decreasing = FALSE),]
Now the spatial data has the average damage grade attached for each district. One more thing is to prepare a layer for the central districts which do not have data, so at least we can have neutrally-colored representations for those instead of a hole in the center of the viz.

#vector of missing districts
blank_name <- c('Kathmandu','Lalitpur','Bhaktapur')
#map subset of missing districts
nepal_blank <- nepal[nepal$NAME_3 %in% blank_name,]
And finally for this map visualization we’ll make the lengthy call to ggplot:

ggplot() +
  #unused districts layer
  geom_sf(data=nepal_blank,fill='grey30')+ 
  geom_sf_text(data=nepal_blank,aes(label=NAME_3),size = 1.5,fun.geometry = st_centroid,colour="white")+
  #measured districts layer
  geom_sf(data=nepal_eq,aes(fill=DMG))+
  geom_sf_text(data=nepal_eq,aes(label=NAME_3),size = 3,fun.geometry=st_centroid,colour="white")+
  labs(x=NULL,y=NULL,title='Figure 3: 2015 "Ghorka" Earthquake, Average Structural Damage',subtitle='11 Survey Districts Surrounding Kathmandu, Nepal',caption='Data: KLL')+
  guides(colour = guide_colourbar(reverse = FALSE))+
  scale_fill_viridis_c(option = "viridis", name = "Grade", limits=c(0,5),direction=-1)+
  theme(aspect.ratio=5/8)

Among the things that we can surmise from this visualization is that regions north of Kathmandu had the highest proportion of highly damaged buildings.

Machine Learning on Structural Features

Now, of course, earthquake results depend an awful lot on where you happen to be at the moment. But what about the structural information present in this data? Surely there’s something useful we can gain from it, so I’m going to try a machine learning algorithm to check the predictive ability of structural details on the outcome of damage grade. This is a supervised classification task, and the method I have chosen to use is XGBoost. This particular project isn’t really meant to be an in-depth look at machine learning algorithms, so for more information I recommend you check out the documentation on on the xgboost R package. Suffice to say, it’s is one of the most popular learning methods for classification tasks and is frequently utilized for award-winning projects.

Feature Selection

The charts so far suggest that location is a very significant factor in building damage, so we are going to subset our data to remove the columns associated with location. Additionally, post-earthquake structural information (with the exception of the target variable, damage grade) is getting deleted since that is obviously correlated heavily with damage grade. It wouldn’t make sense to include them in a model for predicting post-disaster damage!

#create dataframe for machine learning on structural details
structure <- df
#remove some columns
structure$district_id <- NULL
structure$district_name <- NULL
structure$building_id <- NULL
structure$vdcmun_id <- NULL
structure$ward_id <- NULL
structure$count_floors_post_eq <- NULL
structure$height_ft_post_eq <- NULL
structure$condition_post_eq <- NULL
structure$technical_solution_proposed <- NULL
structure$damage_desc <- NULL
structure$other_floor_type <- NULL

Data Preparation for XGBoost

The next step is a subjective one. There are five damage grades, but we’re going to reduce that to two. Anything grade 3 or lower we will consider low damage, and anything higher than 3 will be the opposite. The descriptions in the raw data suggest there is a lot of overlap between assessments of grades 4 and 5. This is to say that the reduction of the problem into a binary classification is justifiable for the scope of this project.

#create binary target variable
structure$damage_grade <- ifelse(structure$damage_grade %in% c(1,2,3), 0, 1)

One thing that’s interesting about this data is that Nepal is a very ancient place. We’d expect the age of a building to have something to do with its structural integrity, right? My question is, to what degree should we apply this study to ancient buildings? Their structural features wouldn’t fit well with the buildings in typical use, so I consider the 1000+ year old buildings I noticed in the data to be outliers.

#eliminate ancient buildings
structure <- structure[structure$age_building < 1000,]
The next step is critical. Since computers can only really work with numbers, we need to get around the fact that much of the data we have access to here is categorical. Somewhat unique among datasets like this however is that the problem is partially fixed out of the box (Kathmandu Living Labs did a great job here, what can I say): the superstructure variable has in fact been one-hot encoded. This means that instead of a single column with a handful of possible inputs for each type of superstructure, we have 12 columns with names like has_superstructure_adobe_mud with possible values of 0 and 1 to indicate the state of a building’s superstructure. This is perfect as is and can be understood by ML packages, so we’re going to copy that pattern by one-hot encoding the other categorical variables. This time I’ll use a package, fastDummies, which carries out the process brilliantly. I just have to pass it the column names which need to be expanded, and be sure to pass it an argument that deletes the originals.

#one-hot encoding via fastDummies package
onehot <- dummy_cols(structure,select_columns=c('land_surface_condition','foundation_type','roof_type','ground_floor_type','position','plan_configuration'),remove_first_dummy = FALSE, remove_selected_columns = TRUE)
In the next step we’re going to split the data for testing and training. We’re holding out 15% of the set to test the model on.

#train/test split at 85%
set.seed(43022)
train_size <- floor(0.85 * nrow(onehot))
train_ind <- sample(seq_len(nrow(onehot)), size = train_size)
train <- onehot[train_ind, ]
test <- onehot[-train_ind, ]
The next part is particular to the xgboost package. In order to train and test a model, it wants the matrix format specifically whereas currently we have a dataframe. Matrices can be though of as dataframes with additional optimization for strictly numeric cells, which we have thanks to the one-hot encoding from earlier. Also in this step, I am separating the target low/high damage prediction labels (coded as 0 and 1) from the prediction variables.

#separation of labels and predictors
train_features <- as.matrix(train[,-16])
train_labels <- as.matrix(train[,16])
test_features <- as.matrix(test[,-16])
test_labels <- as.matrix(test[,16])

Modeling with XGBoost

Now we’re ready for the modeling. I’m utilizing the unique xgb.DMatrix object type, which is easy to convert from the standard R matrix. For purposes of this exploration, I’m going to use the default tuning parameters for xgboost though these are surely something I will be exploring in another article.

#xgboost objects
sindhuli_xgb_train <- xgb.DMatrix(data = train_features, label= train_labels)
sindhuli_xgb_test <- xgb.DMatrix(data = test_features, label= test_labels)
#xgboost model with 25 runs, binary prediction objective
model <- xgboost(data = sindhuli_xgb_train,
                 nrounds = 25,
                 verbose = 0,
                 objective = "binary:logistic") 
#test set predictions, rounded since they are probabilities between 0 and 1 by default
pred <- round(predict(model, sindhuli_xgb_test))
#test metrics from caret
test_matrix <- confusionMatrix(as.factor(pred), as.factor(test_labels))
test_matrix$table

Model Evaluation

##           Reference
## Prediction     0     1
##          0 20328  4689
##          1 25250 64048
paste0('Accuracy: ',round(test_matrix$overall[1],2),' / P-Value: ',test_matrix$overall[6])
## [1] "Accuracy: 0.74 / P-Value: 0"

We obtained an accuracy of 74% with default xgboost parameters. There were a lot of misclassifications, particularly of the high damage class where the true outcome was low damage. But there is definitely something here, since the p-value indicates a statistically significant difference between our model vs. simply predicting the most common class. I should note that if you include the location information, the metrics turn out way better; I even did that with all five classes and still got significant results. However, as I discussed earlier that would not tell us much more than figures 2 and 3 do.

Variable Importance

So what of the structural details? In conclusion I’m going to investigate the variable importance as determined by the xgboost model.

#extract variable importance matrix, top 10
xgb_imp <- xgb.importance (feature_names = colnames(train_features),model = model)[1:10]
#plot feature vs. gain
ggplot(data=xgb_imp)+
  geom_bar(aes(x=reorder(Feature,Gain),y=Gain,color=Gain),stat='identity',size=1)+
  theme(axis.text.y = element_text(angle = 45, vjust = -2, hjust=1, size = 6),legend.position='none')+
  labs(x= 'Feature', y = 'Gain', title='Figure 4: Variable Importance',subtitle='Structural features predicting earthquake damage during the 2015 Nepal Quake',caption='Data: KLL')+
  coord_flip()

That does tell us something- the presence of mud/mortar/stone superstructures as a variable has very high gain compared to everything else. This means that it is the feature with the most information for predicting the binary damage variable we created. The significance of the difference between the two superstructure categories can be verified with a two-sample t-test:

#vector of classifications with each mud/mortar/stone flag
mms_t <- onehot$damage_grade[onehot$has_superstructure_mud_mortar_stone == 1]
mms_f <- onehot$damage_grade[onehot$has_superstructure_mud_mortar_stone == 0]
t.test(mms_t, mms_f, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  mms_t and mms_f
## t = 393.53, df = 253571, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4755483 0.4803089
## sample estimates:
## mean of x mean of y 
## 0.6985456 0.2206170

Since the mean value of outcomes with this type of superstructure was significantly closer to 1 (near total collapse), you would not want to be in this type of building when the earthquake hits! And that makes sense, with many other superstructures in the data set utilizing timber or concrete.

References

 Earthquake Data Portal. Kathmandu Living Labs. https://www.kathmandulivinglabs.org/projects/earthquake-data-portal.

 2015 Nepal earthquake: Open data portal. 2015 Nepal Earthquake: Open Data Portal. https://eq2015.npc.gov.np/#/download.

 Documentation. 2015 Nepal Earthquake: Open Data Portal. http://eq2015.npc.gov.np/docs/#/faqs/faqs.

 Country: Nepal. GADM. https://gadm.org/download_country.html.

 XGBoost 1.6.2 documentation. dmlc XGBoost. https://xgboost.readthedocs.io/en/stable/.