-
Notifications
You must be signed in to change notification settings - Fork 473
/
Copy pathCh11-surv-lab.Rmd
551 lines (420 loc) · 16.9 KB
/
Ch11-surv-lab.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
# Survival Analysis
<a target="_blank" href="https://colab.research.google.com/github/intro-stat-learning/ISLP_labs/blob/v2.2/Ch11-surv-lab.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/intro-stat-learning/ISLP_labs/v2.2?labpath=Ch11-surv-lab.ipynb)
In this lab, we perform survival analyses on three separate data
sets. In Section~\ref{brain.cancer.sec} we analyze the `BrainCancer`
data that was first described in Section~\ref{sec:KM}. In Section~\ref{time.to.pub.sec}, we examine the `Publication`
data from Section~\ref{sec:pub}. Finally, Section~\ref{call.center.sec} explores
a simulated call-center data set.
We begin by importing some of our libraries at this top
level. This makes the code more readable, as scanning the first few
lines of the notebook tell us what libraries are used in this
notebook.
```{python}
from matplotlib.pyplot import subplots
import numpy as np
import pandas as pd
from ISLP.models import ModelSpec as MS
from ISLP import load_data
```
We also collect the new imports
needed for this lab.
```{python}
from lifelines import \
(KaplanMeierFitter,
CoxPHFitter)
from lifelines.statistics import \
(logrank_test,
multivariate_logrank_test)
from ISLP.survival import sim_time
```
## Brain Cancer Data
We begin with the `BrainCancer` data set, contained in the `ISLP` package.
```{python}
BrainCancer = load_data('BrainCancer')
BrainCancer.columns
```
The rows index the 88 patients, while the 8 columns contain the predictors and outcome variables.
We first briefly examine the data.
```{python}
BrainCancer['sex'].value_counts()
```
```{python}
BrainCancer['diagnosis'].value_counts()
```
```{python}
BrainCancer['status'].value_counts()
```
Before beginning an analysis, it is important to know how the
`status` variable has been coded. Most software
uses the convention that a `status` of 1 indicates an
uncensored observation (often death), and a `status` of 0 indicates a censored
observation. But some scientists might use the opposite coding. For
the `BrainCancer` data set 35 patients died before the end of
the study, so we are using the conventional coding.
To begin the analysis, we re-create the Kaplan-Meier survival curve shown in Figure~\ref{fig:survbrain}. The main
package we will use for survival analysis
is `lifelines`.
The variable `time` corresponds to $y_i$, the time to the $i$th event (either censoring or
death). The first argument to `km.fit` is the event time, and the
second argument is the censoring variable, with a 1 indicating an observed
failure time. The `plot()` method produces a survival curve with pointwise confidence
intervals. By default, these are 90% confidence intervals, but this can be changed
by setting the `alpha` argument to one minus the desired
confidence level.
```{python}
fig, ax = subplots(figsize=(8,8))
km = KaplanMeierFitter()
km_brain = km.fit(BrainCancer['time'], BrainCancer['status'])
km_brain.plot(label='Kaplan Meier estimate', ax=ax)
```
Next we create Kaplan-Meier survival curves that are stratified by
`sex`, in order to reproduce Figure~\ref{fig:survbrain2}.
We do this using the `groupby()` method of a dataframe.
This method returns a generator that can
be iterated over in the `for` loop. In this case,
the items in the `for` loop are 2-tuples representing
the groups: the first entry is the value
of the grouping column `sex` while the second value
is the dataframe consisting of all rows in the
dataframe matching that value of `sex`.
We will want to use this data below
in the log-rank test, hence we store this
information in the dictionary `by_sex`. Finally,
we have also used the notion of
*string interpolation* to automatically
label the different lines in the plot. String
interpolation is a powerful technique to format strings ---
`Python` has many ways to facilitate such operations.
```{python}
fig, ax = subplots(figsize=(8,8))
by_sex = {}
for sex, df in BrainCancer.groupby('sex'):
by_sex[sex] = df
km_sex = km.fit(df['time'], df['status'])
km_sex.plot(label='Sex=%s' % sex, ax=ax)
```
As discussed in Section~\ref{sec:logrank}, we can perform a
log-rank test to compare the survival of males to females. We use
the `logrank_test()` function from the `lifelines.statistics` module.
The first two arguments are the event times, with the second
denoting the corresponding (optional) censoring indicators.
```{python}
logrank_test(by_sex['Male']['time'],
by_sex['Female']['time'],
by_sex['Male']['status'],
by_sex['Female']['status'])
```
The resulting $p$-value is $0.23$, indicating no evidence of a
difference in survival between the two sexes.
Next, we use the `CoxPHFitter()` estimator
from `lifelines` to fit Cox proportional hazards models.
To begin, we consider a model that uses `sex` as the only predictor.
```{python}
coxph = CoxPHFitter # shorthand
sex_df = BrainCancer[['time', 'status', 'sex']]
model_df = MS(['time', 'status', 'sex'],
intercept=False).fit_transform(sex_df)
cox_fit = coxph().fit(model_df,
'time',
'status')
cox_fit.summary[['coef', 'se(coef)', 'p']]
```
The first argument to `fit` should be a data frame containing
at least the event time (the second argument `time` in this case),
as well as an optional censoring variable (the argument `status` in this case).
Note also that the Cox model does not include an intercept, which is why
we used the `intercept=False` argument to `ModelSpec` above.
The `summary()` method delivers many columns; we chose to abbreviate its output here.
It is possible to obtain the likelihood ratio test comparing this model to the one
with no features as follows:
```{python}
cox_fit.log_likelihood_ratio_test()
```
Regardless of which test we use, we see that there is no clear
evidence for a difference in survival between males and females. As
we learned in this chapter, the score test from the Cox model is
exactly equal to the log rank test statistic!
Now we fit a model that makes use of additional predictors. We first note
that one of our `diagnosis` values is missing, hence
we drop that observation before continuing.
```{python}
cleaned = BrainCancer.dropna()
all_MS = MS(cleaned.columns, intercept=False)
all_df = all_MS.fit_transform(cleaned)
fit_all = coxph().fit(all_df,
'time',
'status')
fit_all.summary[['coef', 'se(coef)', 'p']]
```
The `diagnosis` variable has been coded so that the baseline
corresponds to HG glioma. The results indicate that the risk associated with HG glioma
is more than eight times (i.e. $e^{2.15}=8.62$) the risk associated
with meningioma. In other words, after adjusting for the other
predictors, patients with HG glioma have much worse survival compared
to those with meningioma. In addition, larger values of the Karnofsky
index, `ki`, are associated with lower risk, i.e. longer survival.
Finally, we plot estimated survival curves for each diagnosis category,
adjusting for the other predictors. To make these plots, we set the
values of the other predictors equal to the mean for quantitative variables
and equal to the mode for categorical. To do this, we use the
`apply()` method across rows (i.e. `axis=0`) with a function
`representative` that checks if a column is categorical
or not.
```{python}
levels = cleaned['diagnosis'].unique()
def representative(series):
if hasattr(series.dtype, 'categories'):
return pd.Series.mode(series)
else:
return series.mean()
modal_data = cleaned.apply(representative, axis=0)
```
We make four
copies of the column means and assign the `diagnosis` column to be the four different
diagnoses.
```{python}
modal_df = pd.DataFrame(
[modal_data.iloc[0] for _ in range(len(levels))])
modal_df['diagnosis'] = levels
modal_df
```
We then construct the model matrix based on the model specification `all_MS` used to fit
the model, and name the rows according to the levels of `diagnosis`.
```{python}
modal_X = all_MS.transform(modal_df)
modal_X.index = levels
modal_X
```
We can use the `predict_survival_function()` method to obtain the estimated survival function.
```{python}
predicted_survival = fit_all.predict_survival_function(modal_X)
predicted_survival
```
This returns a data frame,
whose plot methods yields the different survival curves. To avoid clutter in
the plots, we do not display confidence intervals.
```{python}
fig, ax = subplots(figsize=(8, 8))
predicted_survival.plot(ax=ax);
```
## Publication Data
The `Publication` data presented in Section~\ref{sec:pub} can be
found in the `ISLP` package.
We first reproduce Figure~\ref{fig:lauersurv} by plotting the Kaplan-Meier curves
stratified on the `posres` variable, which records whether the
study had a positive or negative result.
```{python}
fig, ax = subplots(figsize=(8,8))
Publication = load_data('Publication')
by_result = {}
for result, df in Publication.groupby('posres'):
by_result[result] = df
km_result = km.fit(df['time'], df['status'])
km_result.plot(label='Result=%d' % result, ax=ax)
```
As discussed previously, the $p$-values from fitting Cox’s
proportional hazards model to the `posres` variable are quite
large, providing no evidence of a difference in time-to-publication
between studies with positive versus negative results.
```{python}
posres_df = MS(['posres',
'time',
'status'],
intercept=False).fit_transform(Publication)
posres_fit = coxph().fit(posres_df,
'time',
'status')
posres_fit.summary[['coef', 'se(coef)', 'p']]
```
However, the results change dramatically when we include other
predictors in the model. Here we exclude the funding mechanism
variable.
```{python}
model = MS(Publication.columns.drop('mech'),
intercept=False)
coxph().fit(model.fit_transform(Publication),
'time',
'status').summary[['coef', 'se(coef)', 'p']]
```
We see that there are a number of statistically significant variables,
including whether the trial focused on a clinical endpoint, the impact
of the study, and whether the study had positive or negative results.
## Call Center Data
In this section, we will simulate survival data using the relationship
between cumulative hazard and
the survival function explored in Exercise \ref{ex:all3}.
Our simulated data will represent the observed
wait times (in seconds) for 2,000 customers who have phoned a call
center. In this context, censoring occurs if a customer hangs up
before his or her call is answered.
There are three covariates: `Operators` (the number of call
center operators available at the time of the call, which can range
from $5$ to $15$), `Center` (either A, B, or C), and
`Time` of day (Morning, Afternoon, or Evening). We generate data
for these covariates so that all possibilities are equally likely: for
instance, morning, afternoon and evening calls are equally likely, and
any number of operators from $5$ to $15$ is equally likely.
```{python}
rng = np.random.default_rng(10)
N = 2000
Operators = rng.choice(np.arange(5, 16),
N,
replace=True)
Center = rng.choice(['A', 'B', 'C'],
N,
replace=True)
Time = rng.choice(['Morn.', 'After.', 'Even.'],
N,
replace=True)
D = pd.DataFrame({'Operators': Operators,
'Center': pd.Categorical(Center),
'Time': pd.Categorical(Time)})
```
We then build a model matrix (omitting the intercept)
```{python}
model = MS(['Operators',
'Center',
'Time'],
intercept=False)
X = model.fit_transform(D)
```
It is worthwhile to take a peek at the model matrix `X`, so
that we can be sure that we understand how the variables have been coded. By default,
the levels of categorical variables are sorted and, as usual, the first column of the one-hot encoding
of the variable is dropped.
```{python}
X[:5]
```
Next, we specify the coefficients and the hazard function.
```{python}
true_beta = np.array([0.04, -0.3, 0, 0.2, -0.2])
true_linpred = X.dot(true_beta)
hazard = lambda t: 1e-5 * t
```
Here, we have set the coefficient associated with `Operators` to
equal $0.04$; in other words, each additional operator leads to a
$e^{0.04}=1.041$-fold increase in the “risk” that the call will be
answered, given the `Center` and `Time` covariates. This
makes sense: the greater the number of operators at hand, the shorter
the wait time! The coefficient associated with `Center == B` is
$-0.3$, and `Center == A` is treated as the baseline. This means
that the risk of a call being answered at Center B is 0.74 times the
risk that it will be answered at Center A; in other words, the wait
times are a bit longer at Center B.
Recall from Section~\ref{Ch2-statlearn-lab:loading-data} the use of `lambda`
for creating short functions on the fly.
We use the function
`sim_time()` from the `ISLP.survival` package. This function
uses the relationship between the survival function
and cumulative hazard $S(t) = \exp(-H(t))$ and the specific
form of the cumulative hazard function in the Cox model
to simulate data based on values of the linear predictor
`true_linpred` and the cumulative hazard.
We need to provide the cumulative hazard function, which we do here.
```{python}
cum_hazard = lambda t: 1e-5 * t**2 / 2
```
We are now ready to generate data under the Cox proportional hazards
model. We truncate the maximum time to 1000 seconds to keep
simulated wait times reasonable. The function
`sim_time()` takes a linear predictor,
a cumulative hazard function and a
random number generator.
```{python}
W = np.array([sim_time(l, cum_hazard, rng)
for l in true_linpred])
D['Wait time'] = np.clip(W, 0, 1000)
```
We now simulate our censoring variable, for which we assume
90% of calls were answered (`Failed==1`) before the
customer hung up (`Failed==0`).
```{python}
D['Failed'] = rng.choice([1, 0],
N,
p=[0.9, 0.1])
D[:5]
```
```{python}
D['Failed'].mean()
```
We now plot Kaplan-Meier survival curves. First, we stratify by `Center`.
```{python}
fig, ax = subplots(figsize=(8,8))
by_center = {}
for center, df in D.groupby('Center'):
by_center[center] = df
km_center = km.fit(df['Wait time'], df['Failed'])
km_center.plot(label='Center=%s' % center, ax=ax)
ax.set_title("Probability of Still Being on Hold")
```
Next, we stratify by `Time`.
```{python}
fig, ax = subplots(figsize=(8,8))
by_time = {}
for time, df in D.groupby('Time'):
by_time[time] = df
km_time = km.fit(df['Wait time'], df['Failed'])
km_time.plot(label='Time=%s' % time, ax=ax)
ax.set_title("Probability of Still Being on Hold")
```
It seems that calls at Call Center B take longer to be answered than
calls at Centers A and C. Similarly, it appears that wait times are
longest in the morning and shortest in the evening hours. We can use a
log-rank test to determine whether these differences are statistically
significant using the function `multivariate_logrank_test()`.
```{python}
multivariate_logrank_test(D['Wait time'],
D['Center'],
D['Failed'])
```
Next, we consider the effect of `Time`.
```{python}
multivariate_logrank_test(D['Wait time'],
D['Time'],
D['Failed'])
```
As in the case of a categorical variable with 2 levels, these
results are similar to the likelihood ratio test
from the Cox proportional hazards model. First, we
look at the results for `Center`.
```{python}
X = MS(['Wait time',
'Failed',
'Center'],
intercept=False).fit_transform(D)
F = coxph().fit(X, 'Wait time', 'Failed')
F.log_likelihood_ratio_test()
```
Next, we look at the results for `Time`.
```{python}
X = MS(['Wait time',
'Failed',
'Time'],
intercept=False).fit_transform(D)
F = coxph().fit(X, 'Wait time', 'Failed')
F.log_likelihood_ratio_test()
```
We find that differences between centers are highly significant, as
are differences between times of day.
Finally, we fit Cox's proportional hazards model to the data.
```{python}
X = MS(D.columns,
intercept=False).fit_transform(D)
fit_queuing = coxph().fit(
X,
'Wait time',
'Failed')
fit_queuing.summary[['coef', 'se(coef)', 'p']]
```
The $p$-values for Center B and evening time
are very small. It is also clear that the
hazard --- that is, the instantaneous risk that a call will be
answered --- increases with the number of operators. Since we
generated the data ourselves, we know that the true coefficients for
`Operators`, `Center = B`, `Center = C`,
`Time = Even.` and `Time = Morn.` are $0.04$, $-0.3$,
$0$, $0.2$, and $-0.2$, respectively. The coefficient estimates
from the fitted Cox model are fairly accurate.