Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduction #36

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge branch 'main' into introduction
  • Loading branch information
RaphaelS1 committed Dec 5, 2024
commit 51484095ea0715dbccfe491ed82ea1542924a11c
37 changes: 1 addition & 36 deletions book/P1C4_survival.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,42 +37,7 @@ In practice the differences are often blurred as time-measurement will naturally
Also discrete-time methods are often applied to continuous time data and vice versa.
It is nevertheless important to make the distinction as it informs mathematical treatment and definition of the different quantities introduced below.


___

<!-- In their broadest and most basic definitions, survival analysis is the study of temporal data from a given origin until the occurrence of one or more events or 'end-points' [@Collett2014], and machine learning is the study of models and algorithms that learn from data in order to make predictions or find patterns [@Hastie2001]. Reducing either field to these definitions is ill-advised. -->
<!--
Survival analysis is concerned with time-to-event data, i.e. data where the outcome is a duration from some origin until the occurrence of one or multiple events of interest.
In many medical settings, this duration will often be literal survival time, i.e. time until death after some procedure or treatment.
When collecting such data, however, the outcome of interest can often be not observed (fully), e.g. because subjects drop out of the study, haven't experienced the event of interest until end of study period or due to the occurence of another, competing event.

Often analysis of time-to-event data targets the estimation of the (improper) distribution of the event times or, equivalently, modelling the transitions between different states (e.g. alive -> dead) while taking into account censoring and truncation as well as other peculiarities of time-to-event data. However, the target of estimation can also be a relative risk score or the expected time-to-event (cf. @sec-surv-set-types for details).
-->


As discussed in the introduction, *Survival Analysis* is concerned with data where the outcome is a time-to-event.
Because the collection of such data takes place in the temporal domain (it takes time to observe a duration), the event of interest is often unobservable.
For example, because the event did not occur by the end of the data collection period or because of the occurrence of another event that prevents the event of interest from being observed.
In survival analysis terminolgy these are refered to as *censoring* and *competing risks*.

This chapter defines these and related terms and introduces basic terminology and mathematical definitions.
@sec-surv-set-math starts with the common single-event, right-censored data setting and then extends to further types of censoring as well as truncation.
@sec-eha introduces event-history analysis, which is a generalisation to settings with multiple, potentially competing or recurrent events.
@sec-surv-set-types defines common prediction types of survival models, which is particularly important for machine learning based survival analysis.
Finally, in order to cleanly discuss *machine learning survival analysis*, the *survival task* is introduced in @sec-surv-setmltask.

While these definitions and concepts are not new to survival analysis, we feel that it is of utmost importance for machine learning practitioners to be able identify and specify the survival problem present in their data correctly, as misspecification cannot be detected by comparing the predictive performance of alternate models.
The predictive performance can only detect if one model is better suited to minimize a given objective function, but not whether or not the objective function is specified correctly.
The latter depends on the (assumptions about the) data generating process and has to be also reflected in the definition of the evaluation measure.

## Survival Data and Definitions {#sec-surv-set-math}

This section describes the basic template for a survival analysis problem and introduces key definitions that will be used throughout this book.

### Quantifying the Distribution of Event Times {#sec-distributions}

This section introduces functions that can be used to fully characteristise a probability distribution, termed here as *distribution defining functions*.
Particular focus is given to distribution defining functions that are important in survival analysis.
### Continuous Time {#sec-distributions-continuous}

For now, assume a continuous, positive, random variable $Y$ taking values in (t.v.i.) $\NNReals$.
A standard representation of the distribution of $Y$ is given by the probability density function (pdf), $f_Y: \NNReals \rightarrow \NNReals$, and cumulative distribution function (cdf), $F_Y: \NNReals \rightarrow [0,1]; (\tau) \mapsto P(Y \leq \tau)$.
Expand Down
6 changes: 4 additions & 2 deletions book/experiments/code.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@ remotes::install_github("mlr-org/mlr3proba", ref = 'v0.5.7', upgrade = "never")
remotes::install_github("mlr-org/mlr3", ref = 'v0.16.1', upgrade = "never")
remotes::install_github("mlr-org/paradox", ref = 'v0.11.1', upgrade = "never")

## Introduction
library(ggplot2)
theme_set(theme_bw())

library(distr6)
library(ggplot2)
g = dstr("Gompertz", shape = 2, decorators = "ExoticStatistics")
Expand All @@ -15,7 +17,7 @@ g = ggplot(d, aes(x = t, y = y, color = fun)) +
theme(legend.position = "n")
ggsave("book/Figures/introduction/gompertz.png", g, height = 3, units = "in",
dpi = 600)

## Ranking
rm(list = ls())
library(dplyr)
Expand Down
Loading
You are viewing a condensed version of this merge commit. You can view the full changes here.