In 2015, the city of Kathmandu in central Nepal was struck by a highly destructive, magnitude 7.8 earthquake which took the lives of approximately 9,000 people. Following the initial recovery efforts, the Nepalese government performed a door-to-door survey in 11 of the city districts with the aim of assessing eligibility for reconstruction assistance. This is the largest post-disaster structural data collection effort as of this writing. Kathmandu Living Labs updated the data in 2019 to further organize it and to reflect changes to the federal districting system, recognizing that the dataset on over 750,000 households would be of interest to anyone in the data science community with an interest in disaster recovery or socio-economic issues in developing countries. It is this data under consideration for the project, in which we develop exploratory visualizations on the extent of building damage. Additionaly, we use machine learning (specifically, the xgboost package) to investigate the significance of structural factors on building survivability.
Acquisition and Nature of the Dataset
The source of data for this project is the fantastic Open Data Portal made available for the 2015 Nepal Earthquake by the nonprofit Kathmandu Living Labs, with data initially collected by the Nepalese Government. It allows us to explore quite a few demographic features of 11 districts surrounding the city of Kathmandu. I am focusing on the ‘Structural Data’ set on 760,000 buildings.
#load locally stored dataframe
df <- read.csv("/Volumes/EXT128/Data Science Projects/building_structure/csv_building_structure.csv")
#review structure of raw data
str(df,vec.len=1)
## 'data.frame': 762106 obs. of 31 variables:
## $ building_id : num 1.2e+11 ...
## $ district_id : int 12 12 ...
## $ vdcmun_id : int 1207 1207 ...
## $ ward_id : int 120703 120703 ...
## $ count_floors_pre_eq : int 1 1 ...
## $ count_floors_post_eq : int 1 1 ...
## $ age_building : int 9 15 ...
## $ plinth_area_sq_ft : int 288 364 ...
## $ height_ft_pre_eq : int 9 9 ...
## $ height_ft_post_eq : int 9 9 ...
## $ land_surface_condition : chr "Flat" ...
## $ foundation_type : chr "Other" ...
## $ roof_type : chr "Bamboo/Timber-Light roof" ...
## $ ground_floor_type : chr "Mud" ...
## $ other_floor_type : chr "Not applicable" ...
## $ position : chr "Not attached" ...
## $ plan_configuration : chr "Rectangular" ...
## $ has_superstructure_adobe_mud : int 0 0 ...
## $ has_superstructure_mud_mortar_stone : int 1 1 ...
## $ has_superstructure_stone_flag : int 0 0 ...
## $ has_superstructure_cement_mortar_stone: int 0 0 ...
## $ has_superstructure_mud_mortar_brick : int 0 0 ...
## $ has_superstructure_cement_mortar_brick: int 0 0 ...
## $ has_superstructure_timber : int 0 0 ...
## $ has_superstructure_bamboo : int 1 1 ...
## $ has_superstructure_rc_non_engineered : int 0 0 ...
## $ has_superstructure_rc_engineered : int 0 0 ...
## $ has_superstructure_other : int 0 0 ...
## $ condition_post_eq : chr "Damaged-Used in risk" ...
## $ damage_grade : chr "Grade 3" ...
## $ technical_solution_proposed : chr "Major repair" ...
Data Cleaning
I’m studying the damage_grade column primarily, which is supposed to have five categories: Grade 1 is basically undamaged, and Grade 5 requires total demolition or reconstruction. More details on that scale can be found in the FAQ provided by KLL. With this priority and my intended mapping visualization in mind, I’m going to convert the this column to a numeric. I’m also going to add a damage description variable for a chart later on.
#damage grade to integers
df$damage_grade[df$damage_grade == 'Grade 5'] <- 5
df$damage_grade[df$damage_grade == 'Grade 4'] <- 4
df$damage_grade[df$damage_grade == 'Grade 3'] <- 3
df$damage_grade[df$damage_grade == 'Grade 2'] <- 2
df$damage_grade[df$damage_grade == 'Grade 1'] <- 1
#change data type
df$damage_grade <- as.integer(df$damage_grade)
#add description column
df$damage_desc <- NA
#input appropriate damage grades
df$damage_desc[df$damage_grade == 1] <- 'Cosmetic Repair'
df$damage_desc[df$damage_grade == 2] <- 'Minor Structural Repair'
df$damage_desc[df$damage_grade == 3] <- 'Major Structural Repair'
df$damage_desc[df$damage_grade == 4] <- 'Unsafe / Structurally Unsound'
df$damage_desc[df$damage_grade == 5] <- 'Total Collapse'
Another thing is that I’m interested in the district-level damage statistics. It will help to have the district names available to us, which I retrieved from the data dictionary to match up with the ID numbers. I’ll attach this information to the dataframe using an inner join.
#get names
dist_name <- c('Okhaldhunga','Sindhuli','Ramechhap','Dolakha','Sindhupalchok','Kavrepalanchok','Nuwakot','Rasuwa','Dhading','Makwanpur','Gorkha')
#get ids
dist_id <- c(12,20,21,22,23,24,28,29,30,31,36)
#create district information dataframe
districts <- data.frame(district_id = dist_id, district_name = dist_name)
#inner join in district_id using merge()
df <- merge(districts,df,by="district_id")
In the raw data there are 762,106 buildings observed with 31 variables recorded, and we need to check the quality of the data as far as errors and missing values go. We can check for NAs and numerical outliers in the traditional way, but one additional strategy we can employ is the detection of other kinds of errors using factors for the categorical variables. Factors will allow us to see if there are any anomalies in the number of levels. From the str() call earlier, we can see that the superstructure variable is one-hot encoded, which is great! But all of the other categorical fields are characters. I’m going to use the character datatype as a loop condition for transforming to factors, and then check the levels of each.
#change any character variables to factors
for (c in names(df)[sapply(df,is.character)])
df[[c]] <- as.factor(df[[c]])
#check factor levels for errors or missing information
lapply(df[sapply(df, is.factor)], levels)
## $district_name
## [1] "Dhading" "Dolakha" "Gorkha" "Kavrepalanchok"
## [5] "Makwanpur" "Nuwakot" "Okhaldhunga" "Ramechhap"
## [9] "Rasuwa" "Sindhuli" "Sindhupalchok"
##
## $land_surface_condition
## [1] "Flat" "Moderate slope" "Steep slope"
##
## $foundation_type
## [1] "Bamboo/Timber" "Cement-Stone/Brick" "Mud mortar-Stone/Brick"
## [4] "Other" "RC"
##
## $roof_type
## [1] "Bamboo/Timber-Heavy roof" "Bamboo/Timber-Light roof"
## [3] "RCC/RB/RBC"
##
## $ground_floor_type
## [1] "Brick/Stone" "Mud" "Other" "RC" "Timber"
##
## $other_floor_type
## [1] "Not applicable" "RCC/RB/RBC" "Timber-Planck"
## [4] "TImber/Bamboo-Mud"
##
## $position
## [1] "" "Attached-1 side" "Attached-2 side" "Attached-3 side"
## [5] "Not attached"
##
## $plan_configuration
## [1] "" "Building with Central Courtyard"
## [3] "E-shape" "H-shape"
## [5] "L-shape" "Multi-projected"
## [7] "Others" "Rectangular"
## [9] "Square" "T-shape"
## [11] "U-shape"
##
## $condition_post_eq
## [1] "Covered by landslide"
## [2] "Damaged-Not used"
## [3] "Damaged-Repaired and used"
## [4] "Damaged-Rubble clear"
## [5] "Damaged-Rubble Clear-New building built"
## [6] "Damaged-Rubble unclear"
## [7] "Damaged-Used in risk"
## [8] "Not damaged"
##
## $technical_solution_proposed
## [1] "" "Major repair" "Minor repair" "No need"
## [5] "Reconstruction"
##
## $damage_desc
## [1] "Cosmetic Repair" "Major Structural Repair"
## [3] "Minor Structural Repair" "Total Collapse"
## [5] "Unsafe / Structurally Unsound"
This actually revealed something pretty important- a few of the categorical variables have “ ” levels which indicates that they would not have been detected by searching for NA values. I’m going to fix this issue by imputing NAs where the whitespace is so that missing information can be detected, and then check for the number of rows that are incomplete.
#replace any whitespace with NA
df[df == ""] <- NA
#complete cases check
paste0('Number of rows with missing data: ', sum(!complete.cases(df)))
## [1] "Number of rows with missing data: 12"
At least in the realm of blank spaces or NAs, this data seems to be pretty high quality. I don’t think that we will lose anything important if we were to remove these 12 observations since we have over 760,000.
#remove the incomplete observations
df <- na.omit(df)
#review dimensions of cleaned data
paste0('Cleaned Data Dimension: ', dim(df)[1],' rows, ',dim(df)[2],' columns')
## [1] "Cleaned Data Dimension: 762094 rows, 33 columns"
Exploratory Visualizations
We’ve done enough cleaning to begin visualizing the damage done by this quake. The first basic question is, how are the damage grades distributed? I’ll answer this with a bar chart which displays the damage levels, their definitions, and a color code established for the rest of the visualizations.
Somewhat surprisingly, certainly tragically, the majority of buildings under study were flagged as irrepairably damaged or close to it! How does this break down by district?
#forcats fct_infreq used in ggplot aes() for ordering the legend properly
ggplot(data=df,aes(x=fct_rev(fct_infreq(district_name)),fill=factor(damage_grade)))+
geom_bar(color = "gray30",position = position_stack(reverse = TRUE),stat = "count")+
labs(x='District',y='Building Assessments',title='Figure 2: Damaged Buildings by Grade Occurence and District',caption='Data: KLL')+
scale_fill_viridis_d(option = "viridis", name = "Grade",direction=-1,guide=guide_legend(reverse=TRUE))+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Mapping
Next, I’m going to map out the average damage using ggplot’s geographic plotting so that anyone unfamiliar with the area surrounding Kathmandu can better understand the distribution. To do this we are going to need geographical data as well as some additional preprocessing. Database of Global Administrative Areas, or GADM, offers this kind of spatial data for most countries at various levels of granularity. I am retrieving their level 3 data for administrative districts of Nepal.
nepal <- st_read('/Volumes/EXT128/Data Science Projects/building_structure/building_structure/gadm41_NPL_shp/gadm41_NPL_3.shp')
#isolate earthquake-affected districts from map data
nepal_eq <- nepal[nepal$NAME_3 %in% dist_name,]
#aggregation of districts by mean damage grade
mean_dmg_dist <- aggregate(df$damage_grade, list(df$district_name), FUN=mean)
#assign column names for matching with geodata
colnames(mean_dmg_dist) <- c("NAME_3","DMG")
#inner join damage and map
nepal_eq <- merge(nepal_eq,mean_dmg_dist,by="NAME_3")
#mdg_order <- mean_dmg_dist[order(mean_dmg_dist$DMG,decreasing = FALSE),]
#vector of missing districts
blank_name <- c('Kathmandu','Lalitpur','Bhaktapur')
#map subset of missing districts
nepal_blank <- nepal[nepal$NAME_3 %in% blank_name,]
ggplot() +
#unused districts layer
geom_sf(data=nepal_blank,fill='grey30')+
geom_sf_text(data=nepal_blank,aes(label=NAME_3),size = 1.5,fun.geometry = st_centroid,colour="white")+
#measured districts layer
geom_sf(data=nepal_eq,aes(fill=DMG))+
geom_sf_text(data=nepal_eq,aes(label=NAME_3),size = 3,fun.geometry=st_centroid,colour="white")+
labs(x=NULL,y=NULL,title='Figure 3: 2015 "Ghorka" Earthquake, Average Structural Damage',subtitle='11 Survey Districts Surrounding Kathmandu, Nepal',caption='Data: KLL')+
guides(colour = guide_colourbar(reverse = FALSE))+
scale_fill_viridis_c(option = "viridis", name = "Grade", limits=c(0,5),direction=-1)+
theme(aspect.ratio=5/8)
Among the things that we can surmise from this visualization is that regions north of Kathmandu had the highest proportion of highly damaged buildings.
Machine Learning on Structural Features
Now, of course, earthquake results depend an awful lot on where you happen to be at the moment. But what about the structural information present in this data? Surely there’s something useful we can gain from it, so I’m going to try a machine learning algorithm to check the predictive ability of structural details on the outcome of damage grade. This is a supervised classification task, and the method I have chosen to use is XGBoost. This particular project isn’t really meant to be an in-depth look at machine learning algorithms, so for more information I recommend you check out the documentation on on the xgboost R package. Suffice to say, it’s is one of the most popular learning methods for classification tasks and is frequently utilized for award-winning projects.
Feature Selection
The charts so far suggest that location is a very significant factor in building damage, so we are going to subset our data to remove the columns associated with location. Additionally, post-earthquake structural information (with the exception of the target variable, damage grade) is getting deleted since that is obviously correlated heavily with damage grade. It wouldn’t make sense to include them in a model for predicting post-disaster damage!
#create dataframe for machine learning on structural details
structure <- df
#remove some columns
structure$district_id <- NULL
structure$district_name <- NULL
structure$building_id <- NULL
structure$vdcmun_id <- NULL
structure$ward_id <- NULL
structure$count_floors_post_eq <- NULL
structure$height_ft_post_eq <- NULL
structure$condition_post_eq <- NULL
structure$technical_solution_proposed <- NULL
structure$damage_desc <- NULL
structure$other_floor_type <- NULL
Data Preparation for XGBoost
The next step is a subjective one. There are five damage grades, but we’re going to reduce that to two. Anything grade 3 or lower we will consider low damage, and anything higher than 3 will be the opposite. The descriptions in the raw data suggest there is a lot of overlap between assessments of grades 4 and 5. This is to say that the reduction of the problem into a binary classification is justifiable for the scope of this project.
#create binary target variable
structure$damage_grade <- ifelse(structure$damage_grade %in% c(1,2,3), 0, 1)
#eliminate ancient buildings
structure <- structure[structure$age_building < 1000,]
#one-hot encoding via fastDummies package
onehot <- dummy_cols(structure,select_columns=c('land_surface_condition','foundation_type','roof_type','ground_floor_type','position','plan_configuration'),remove_first_dummy = FALSE, remove_selected_columns = TRUE)
#train/test split at 85%
set.seed(43022)
train_size <- floor(0.85 * nrow(onehot))
train_ind <- sample(seq_len(nrow(onehot)), size = train_size)
train <- onehot[train_ind, ]
test <- onehot[-train_ind, ]
#separation of labels and predictors
train_features <- as.matrix(train[,-16])
train_labels <- as.matrix(train[,16])
test_features <- as.matrix(test[,-16])
test_labels <- as.matrix(test[,16])
Modeling with XGBoost
Now we’re ready for the modeling. I’m utilizing the unique xgb.DMatrix object type, which is easy to convert from the standard R matrix. For purposes of this exploration, I’m going to use the default tuning parameters for xgboost though these are surely something I will be exploring in another article.#xgboost objects
sindhuli_xgb_train <- xgb.DMatrix(data = train_features, label= train_labels)
sindhuli_xgb_test <- xgb.DMatrix(data = test_features, label= test_labels)
#xgboost model with 25 runs, binary prediction objective
model <- xgboost(data = sindhuli_xgb_train,
nrounds = 25,
verbose = 0,
objective = "binary:logistic")
#test set predictions, rounded since they are probabilities between 0 and 1 by default
pred <- round(predict(model, sindhuli_xgb_test))
#test metrics from caret
test_matrix <- confusionMatrix(as.factor(pred), as.factor(test_labels))
test_matrix$table
Model Evaluation
## Reference
## Prediction 0 1
## 0 20328 4689
## 1 25250 64048
paste0('Accuracy: ',round(test_matrix$overall[1],2),' / P-Value: ',test_matrix$overall[6])
## [1] "Accuracy: 0.74 / P-Value: 0"
We obtained an accuracy of 74% with default xgboost parameters. There were a lot of misclassifications, particularly of the high damage class where the true outcome was low damage. But there is definitely something here, since the p-value indicates a statistically significant difference between our model vs. simply predicting the most common class. I should note that if you include the location information, the metrics turn out way better; I even did that with all five classes and still got significant results. However, as I discussed earlier that would not tell us much more than figures 2 and 3 do.
Variable Importance
So what of the structural details? In conclusion I’m going to investigate the variable importance as determined by the xgboost model.
#extract variable importance matrix, top 10
xgb_imp <- xgb.importance (feature_names = colnames(train_features),model = model)[1:10]
#plot feature vs. gain
ggplot(data=xgb_imp)+
geom_bar(aes(x=reorder(Feature,Gain),y=Gain,color=Gain),stat='identity',size=1)+
theme(axis.text.y = element_text(angle = 45, vjust = -2, hjust=1, size = 6),legend.position='none')+
labs(x= 'Feature', y = 'Gain', title='Figure 4: Variable Importance',subtitle='Structural features predicting earthquake damage during the 2015 Nepal Quake',caption='Data: KLL')+
coord_flip()
That does tell us something- the presence of mud/mortar/stone superstructures as a variable has very high gain compared to everything else. This means that it is the feature with the most information for predicting the binary damage variable we created. The significance of the difference between the two superstructure categories can be verified with a two-sample t-test:
#vector of classifications with each mud/mortar/stone flag
mms_t <- onehot$damage_grade[onehot$has_superstructure_mud_mortar_stone == 1]
mms_f <- onehot$damage_grade[onehot$has_superstructure_mud_mortar_stone == 0]
t.test(mms_t, mms_f, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: mms_t and mms_f
## t = 393.53, df = 253571, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4755483 0.4803089
## sample estimates:
## mean of x mean of y
## 0.6985456 0.2206170
Since the mean value of outcomes with this type of superstructure was significantly closer to 1 (near total collapse), you would not want to be in this type of building when the earthquake hits! And that makes sense, with many other superstructures in the data set utilizing timber or concrete.
References
Earthquake Data Portal. Kathmandu Living Labs. https://www.kathmandulivinglabs.org/projects/earthquake-data-portal.
2015 Nepal earthquake: Open data portal. 2015 Nepal Earthquake: Open Data Portal. https://eq2015.npc.gov.np/#/download.
Documentation. 2015 Nepal Earthquake: Open Data Portal. http://eq2015.npc.gov.np/docs/#/faqs/faqs.
Country: Nepal. GADM. https://gadm.org/download_country.html.
XGBoost 1.6.2 documentation. dmlc XGBoost. https://xgboost.readthedocs.io/en/stable/.