Chapter 3 - Nonparametric Estimation and Testing
Department of Biostatistics & Medical Informatics
University of Wisconsin-Madison
Nelsen–Aalen estimator of cumulative hazard
Kaplan–Meier estimator of survival function
Log-rank test and variations
Analysis of the German Breast Cancer study
\[\newcommand{\d}{{\rm d}}\] \[\newcommand{\dd}{{\rm d}}\] \[\newcommand{\pr}{{\rm pr}}\] \[\newcommand{\var}{{\rm var}}\] \[\newcommand{\se}{{\rm se}}\] \[\newcommand{\indep}{\perp \!\!\! \perp}\]
Recall in Chapter 2…\[\begin{align} E\{\dd N(t)\mid X\geq t\}&=\frac{\pr\{\dd N^*(t)=1, C\geq t\}}{\pr(T\geq t, C\geq t)} \notag\\ &=\frac{\pr\{\dd N^*(t)=1\}\pr(C\geq t)}{\pr(T\geq t)\pr(C\geq t)} \notag\\ &=E\{\dd N^*(t)\mid T\geq t\}\notag\\ &=\dd\Lambda(t), \end{align}\]
So \[\dd\Lambda(t_j) = E\{\dd N(t_j)\mid X\geq t_j\}=\frac{E\{\dd N(t_j)\}}{\pr(X\geq t_j)}\]
Motivates empirical estimator \[\begin{align} \dd\hat\Lambda(t_j) & = \frac{d_j}{n_j} = \frac{\sum_{i=1}^n \dd N_i(t_j)}{\sum_{i=1}^n I(X_i\geq t_j)}\\ &=\mbox{proportion of failures among those at risk} \end{align}\]
Cumulative hazard \[ \hat\Lambda(t)=\sum_{j:t_j\leq t}\frac{d_j}{n_j} =\int_0^t\frac{\sum_{i=1}^n \dd N_i(u)}{\sum_{i=1}^n I(X_i\geq u)} \]
Continuous relationship \[ \tilde S(t)=\exp\left\{-\hat\Lambda(t)\right\} \]
Discrete relationship (more general and intuitive)
\(S(t_j)\): probability of surviving past \(t_j\) \[\begin{align*} S(t_j)&=\pr(T>t_j)\\ &=\pr(T>t_1)\pr(T>t_2\mid T>t_1)\cdots\pr(T>t_j\mid T>t_{j-1})\\ &=\pr(T>t_1\mid T\geq t_1)\pr(T>t_2\mid T\geq t_2)\cdots\pr(T>t_j\mid T\geq t_j)\\ &=\prod_{l=1}^j\pr(T>t_l\mid T\geq t_l), \end{align*}\]
Overall \[\begin{equation*} S(t)=\prod_{j:t_j\leq t}\pr(T>t_j\mid T\geq t_j) \end{equation*}\]
Note \[ \pr(T>t_j\mid T\geq t_j) = 1-\pr(T=t_j\mid T\geq t_j) = 1-\dd\Lambda(t_j) \]
Plug in the Nelsen–Aalen \[\begin{equation} \hat S(t)=\prod_{j:t_j\leq t}\{1-\dd\hat\Lambda(t_j)\}=\prod_{j:t_j\leq t}(1-d_j/n_j). \end{equation}\]
Easier under log-transform from product to sum \[ \log\hat S(t)=\sum_{j:t_j\leq t}\log(1-d_j/n_j) \]
By delta method\(^*\) \[ \hat\var\{\hat S(t)\}=\hat S(t)^2\hat\var\{\log\hat S(t)\}, \]
Delta Method
If approximately \(S_n \sim N(\mu, \sigma^2)\), then approximately \(g(S_n) \sim N\left\{g(\mu), \dot g(\mu)^2\sigma^2\right\}\), where \(\dot g(\mu)=\dd g(\mu)/\dd\mu\).
Variance of KM \[\begin{equation}\label{eq:km:greenwood} \hat\var\{\hat S(t)\}=\hat S(t)^2\sum_{j:t_j\leq t}\frac{d_j}{n_j(n_j-d_j)} \end{equation}\]
Naive 95% confidence interval (CI) \[\begin{equation}\label{eq:km:ci_plain} \left[\hat S(t)-1.96\hat\se\{\hat S(t)\}, \hat S(t)+1.96\hat\se\{\hat S(t)\}\right] \end{equation}\]
Transform \(\zeta(t)=\log\{-\log S(t)\} \in \mathbb R\)
CI for \(\zeta(t)\) \[\begin{equation} \left[\hat\zeta(t)-1.96\hat\se\{\hat\zeta(t)\},\hat\zeta(t)+1.96\hat\se\{\hat\zeta(t)\}\right] \end{equation}\]
Transform the bounds back to \(S(t)\) \[ \left[\hat S(t)^{\exp[1.96\hat\se\{\hat\zeta(t)\}]}, \hat S(t)^{\exp[-1.96\hat\se\{\hat\zeta(t)\}]}\right] \subset [0, 1] \]
Remains to calculate \(\hat\se\{\hat\zeta(t)\}\) by delta method (Exercise)
survival::survfit()
(I)Surv(time, status) ~ 1
: fit curve to a homogeneous sample
Surv(time, status) ~ group
: fit curve to each level of group
data = df
: input data frame df
conf.type = "log-log"
: log-log transformation for CI
"log"
: default log transformation"plain"
: naive CIsurvival::survfit()
(II)surfit
object containing KM estimates
summary()
and plot()
summary(obj)
: a list containing
time
: \(t_j\) \((j=1,\ldots, m)\)surv
: \(\hat S(t_j)\)n.risk
: \(n_j\)n.event
: \(d_j\)std.err
: \(\hat\se\{\hat S(t_j)\}\)...
rats.rx
obj <- survfit(Surv(time, status) ~ 1, data = rats.rx,
conf.type = "log-log")
summary(obj)
# Call: survfit(formula = Surv(time, status) ~ 1, data = rats.rx,
# conf.type = "log-log")
#
# time n.risk n.event survival std.err lower 95% CI upper 95% CI
# 34 99 1 0.990 0.0100 0.930 0.999
# 39 98 1 0.980 0.0141 0.922 0.995
# 45 97 1 0.970 0.0172 0.909 0.990
# 67 89 1 0.959 0.0202 0.894 0.984
# 70 86 1 0.948 0.0228 0.879 0.978
# ...
survival::survdiff()
Surv(time, status) ~ group
: test survival function between levels of variable group
strata(str_var)
: stratified by variable str_var
(optional)rho = r
: weights \(\hat S(t_j-)^\rho\) with \(\rho=\) r
pvalue
(p-value of the test)rx
stratified by sex
head(rats)
# litter rx time status sex
# 1 1 1 101 0 f
# 2 1 0 49 1 f
# 3 1 0 104 0 f
# ...
survdiff(Surv(time, status) ~ rx + strata(sex), data = rats)
# Call:
# survdiff(formula = Surv(time, status) ~ rx + strata(sex), data = rats)
#
# N Observed Expected (O-E)^2/E (O-E)^2/V
# rx=0 200 21 28.9 2.16 6.99
# rx=1 100 21 13.1 4.77 6.99
#
# Chisq= 7 on 1 degrees of freedom, p= 0.008
npsm::gehan.test()
nph::logrank.maxtest()
(maximum of multiple weighted statistics)survival::survfit()
survival::survdiff()
Choose one
Problem 3.17
(Extra credit) Problem 3.15