forked from actuarial-data-science/Tutorials
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathshap_tutorial.Rmd
292 lines (224 loc) · 6.67 KB
/
shap_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
---
title: "SHAP Tutorial"
date: "2023-03-02"
output:
html_document:
toc: yes
toc_float: yes
number_sections: yes
df_print: paged
theme: paper
math_method: katex
knit: (function(input, ...) {rmarkdown::render(input, output_dir = "docs")})
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
eval = TRUE
)
```
This accompanying notebook produces selected results from the tutorial, with the aim of showing how SHAP is used in practice. The results might slightly differ from those in the tutorial, and the plots are organized differently.
You can download the data from Github or from OpenML.
# Load and inspect data
```{r}
# Data
library(farff) # 1.1.1 (required by OpenML)
library(OpenML) # 1.12 (if you download the data from OpenML)
library(withr) # 2.5.0
library(tidyverse) # 2.0.0
# Models
library(caret) # 6.0.94
library(keras) # 2.11.1
library(lightgbm) # 3.3.5
# SHAP
library(shapviz) # 0.9.1
library(kernelshap) # 0.3.7
library(patchwork) # 1.1.2
# df <- arrow::read_parquet("rdata/df.parquet")
df <- getOMLDataSet(data.id = 45106L)$data
dim(df)
head(df)
summary(df)
```
# Modeling
In this section, we split the data into 90% training and 10% test observations, and fit the three statistical models. Furthermore, we define the function that returns the true model values.
## Data split
```{r}
with_seed(
8300,
ix <- sample(nrow(df), 0.9 * nrow(df))
)
y <- "claim_nb"
x <- c("year", "town", "driver_age", "car_weight", "car_power", "car_age")
train <- df[ix, ]
y_train <- train[[y]]
X_train <- data.matrix(train[x])
```
## Generalized linear model (GLM)
We start by fitting a (naive) additive linear Poisson regression model.
```{r}
(fit_glm <- glm(reformulate(x, y), data = train, family = poisson()))
```
## Deep neural net
We use TensorFlow to fit the deep neural net.
```{r}
# Standardization
scaler <- preProcess(X_train, method = "range", rangeBounds = c(-1, 1))
# Callbacks
cb <- list(
callback_early_stopping(patience = 20),
callback_reduce_lr_on_plateau(patience = 5)
)
# Architecture
make_nn <- function() {
k_clear_session()
tensorflow::set_random_seed(4349)
input <- layer_input(length(x))
output <- input %>%
layer_dense(units = 40, activation = "tanh") %>%
layer_dense(units = 20, activation = "tanh") %>%
layer_dense(units = 10, activation = "tanh") %>%
layer_dense(units = 1, activation = "exponential")
keras_model(input, output)
}
# Create, compile and fit model
fit_nn <- make_nn() %>%
compile(optimizer = optimizer_adam(learning_rate = 1e-4), loss = loss_poisson)
summary(fit_nn)
history <- fit_nn %>%
fit(
x = predict(scaler, X_train),
y = y_train,
epochs = 200,
batch_size = 1e4,
validation_split = 0.1,
callbacks = cb,
verbose = 0
)
plot(history, metrics = "loss")
```
## LightGBM
To fit a boosted trees model, we use LightGBM. The parameters have been tuned outside this script by combining early-stopping with random parameter search cross-validation.
```{r}
dtrain <- lgb.Dataset(
X_train,
label = y_train,
params = list(feature_pre_filter = FALSE)
)
params <- list(
learning_rate = 0.05,
objective = "poisson",
metric = "poisson",
num_leaves = 7,
min_data_in_leaf = 50,
min_sum_hessian_in_leaf = 0.001,
colsample_bynode = 0.8,
bagging_fraction = 0.8,
lambda_l1 = 3,
lambda_l2 = 5,
num_threads = 7
)
fit_lgb <- lgb.train(params = params, data = dtrain, nrounds = 300)
```
## True model
Since we use simulated data, the true underlying frequency model is known:
```{r}
age_effect <- function(age) {
x <- (age - 66) / 60
0.05 + x^8 + 0.4*x^3 + 0.3*x^2 + 0.06*x
}
true_model <- function(df) {
log_lambda <- with(
df,
0 +
0.15 * town +
+ log(age_effect(driver_age)) +
(0.3 + 0.15 * town) * car_power / 100 + # interaction 1
# 0.1 * car_power / (car_weight / 100)^2 + # interaction 2
-0.02 * car_age
)
exp(log_lambda)
}
# Check
true_model(head(df))
```
# SHAP analysis
Let's analyze our models with SHAP.
## Preparations
The first steps in the SHAP analysis is to select a dataset of 1000 rows to be explained. Furthermore, for model-agnostic Kernel SHAP, we additionally sample a smaller dataset, serving as background data for integrating out marginal means.
```{r}
with_seed(
3948, {
X_explain <- train[sample(nrow(train), 1000), x]
bg <- train[sample(nrow(train), 200), ]
}
)
```
## LightGBM and TreeSHAP
Let's explain our LightGBM model with the extremely efficient TreeSHAP algorithm.
```{r}
system.time(
shap_lgb <- shapviz(fit_lgb, X_pred = data.matrix(X_explain))
)
```
### Waterfall plot of first observation
```{r}
sv_waterfall(shap_lgb, row_id = 1)
```
### SHAP importance: Barplot and summary plot
```{r}
sv_importance(shap_lgb, show_numbers = TRUE)
sv_importance(shap_lgb, kind = "bee")
```
### SHAP dependence plots
Each figure uses the potentially strongest interacting feature on the color scale. Therefore, the colors differ from picture to picture (and later also from model to model).
```{r}
theme_set(theme_gray(base_size = 8))
sv_dependence(shap_lgb, x, alpha = 0.5) &
ylim(-0.5, 1.05)
```
## GLM with Kernel SHAP
For all other models, including our GLM, we use model-agnostic Kernel SHAP, and focus on SHAP dependence plots. (Setting `verbose = FALSE` suppresses the progress bar - it does not play well with R Markdown).
```{r}
system.time(
shap_glm <- shapviz(kernelshap(fit_glm, X = X_explain, bg_X = bg, verbose = FALSE))
)
sv_dependence(shap_glm, x, alpha = 0.5) &
ylim(-0.5, 1.05)
```
## Neural net with Kernel SHAP
```{r}
# Function that maps data.frame to neural net input and calculates (log) predictions
pred_nn_ln <- function(model, df) {
X <- data.matrix(df[x])
X_scaled <- predict(scaler, X)
log(predict(model, X_scaled, batch_size = 1e4, verbose = 0))
}
# Kernel SHAP
system.time(
shap_nn <- shapviz(
kernelshap(fit_nn, X = X_explain, bg_X = bg, pred_fun = pred_nn_ln, verbose = FALSE)
)
)
# Dependence plots
sv_dependence(shap_nn, x, alpha = 0.5) &
ylim(-0.5, 1.05)
```
## True model with Kernel SHAP
```{r}
system.time(
shap_truth <- shapviz(
kernelshap(
"truth",
X = X_explain,
bg_X = bg,
pred_fun = function(m, X) log(true_model(X)),
verbose = FALSE
)
)
)
sv_dependence(shap_truth, x, alpha = 0.5) &
ylim(-0.5, 1.05)
```