Chapter 4 - Cox Proportional Hazards Regression
Department of Biostatistics & Medical Informatics
University of Wisconsin-Madison
Model specification
Partial-likelihood estimation and inference
Residual analysis and goodness-of-fit
Time-dependent covariates
\[\newcommand{\d}{{\rm d}}\] \[\newcommand{\T}{{\rm T}}\] \[\newcommand{\dd}{{\rm d}}\] \[\newcommand{\pr}{{\rm pr}}\] \[\newcommand{\var}{{\rm var}}\] \[\newcommand{\se}{{\rm se}}\] \[\newcommand{\indep}{\perp \!\!\! \perp}\] \[\newcommand{\Pn}{n^{-1}\sum_{i=1}^n}\]
Parametrize baseline function \[\begin{equation}\label{eq:cox:ph_param} \lambda(t\mid Z;\theta)=\lambda_0(t;\eta)\exp(\beta^\T Z) \end{equation}\]
Estimation
Semiparametric model

Partial likelihood \[\begin{align*} PL_n(\beta)=\prod_{j=1}^m\pr(\mathcal F_j\mid \mathcal R_j,d_j=1) =\prod_{j=1}^m\frac{\exp(\beta^\T Z_{(j)})}{\sum_{i\in\mathcal R_j}\exp(\beta^\T Z_i)} \end{align*}\]
Log-partial likelihood \[\begin{align}\label{eq:cox:pln} pl_n(\beta)=n^{-1}\log PL_n(\beta)&=n^{-1}\sum_{j=1}^m\left\{\beta^\T Z_{(j)}-\log\sum_{i\in\mathcal R_j}\exp(\beta^\T Z_i)\right\}\notag\\ &= n^{-1}\sum_{i=1}^n \delta_i\left\{\beta^\T Z_i- \log\sum_{j=1}^n I(X_j\geq X_i)\exp(\beta^\T Z_j) \right\}\notag\\ &=n^{-1}\sum_{i=1}^n\int_0^\infty \left\{\beta^\T Z_i- \log\sum_{j=1}^n I(X_j\geq t)\exp(\beta^\T Z_j) \right\}\dd N_i(t). \end{align}\]
Exercise: Conditional survival
A patient with covariate value \(z\) is censored at time \(c\). Use the fit model to predict his/her future survival probabilities.
survival::coxph() (I)Surv(time, status) ~ covariates: \((X, \delta) \sim Z\)strata(str_var): stratified by variable str_var (optional)coxph
obj$coefficients: \(\hat\beta\)obj$var: \(\hat\var(\hat\beta)\)survival::coxph() (II)Lambda0$time: \(t\); Lambda0$hazard: \(\hat\Lambda_0(t)\)Lambda0$strata: levels of str_var if stratified by itsurvival::coxph() (III)summary(obj)
summary(obj)$coefficients: a matrix containing \(\hat\beta\), hazard ratio \(\exp(\hat\beta)\), \(\hat\se(\hat\beta)\), test statistic, and \(p\)-value as columnsgtsummary::tbl_regression()
#--- Data frame ---------------------
head(df)
# id time status hormone age meno size grade nodes prog estrg
# 1 1 43.836066 1 1 38 1 18 3 5 141 105
# 3 2 46.557377 1 1 52 1 20 1 1 78 14
# 5 3 41.934426 1 1 47 1 30 2 1 422 89
# 7 4 4.852459 0 1 40 1 24 1 3 25 11
# 8 5 61.081967 0 2 64 2 19 2 1 19 9
# 9 6 63.377049 0 2 49 2 56 1 3 356 64
#--- Model fitting ------------------
obj <- coxph(Surv(time, status)~ hormone + meno + age + size + grade
+ nodes + prog + estrg, data = df)Summary results
#--- Summary results ------------------
summary(obj)
#> n= 686, number of events= 299
#> coef exp(coef) se(coef) z Pr(>|z|)
#> hormone2 -0.346278 0.707316 0.129075 -2.683 0.007301 **
#> meno2 0.258445 1.294915 0.183476 1.409 0.158954
#> age -0.009459 0.990585 0.009301 -1.017 0.309126
#> size 0.007796 1.007827 0.003939 1.979 0.047794 *
#> grade2 0.636112 1.889121 0.249202 2.553 0.010693 *
#> grade3 0.779654 2.180718 0.268480 2.904 0.003685 **
#> nodes 0.048789 1.049998 0.007447 6.551 5.7e-11 ***
#> prog -0.221724 0.801137 0.057353 -3.866 0.000111 ***
#> estrg 0.019731 1.019927 0.045037 0.438 0.661307
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Tumor grade?
grade2 and grade3#--- Breslow estimates of baseline function ------------------
Lambda0 <- basehaz(obj, centered = FALSE)
# Plot the baseline hazard function
plot(stepfun(Lambda0$time, c(0, Lambda0$hazard)), do.points = FALSE,
cex.axis = 0.9, lwd = 2, frame.plot = FALSE, xlim = c(0, 100),
ylim = c(0, 0.8), xlab = "Time (months)",
ylab = "Baseline cumulative hazard", main = "")Exercise: Survival prediction
Predict survival function for arbitrary \(z\).
obj$coefficients (\(\hat\beta\)) and Lambda0 (\(\hat\Lambda_0(t)\))survfit(obj, newdata)
survival::resid() (I)# Calculate Cox-Snell by martingale residuals
# df: data frame
coxsnellres <- df$status - resid(obj, type = "martingale")
# Compute Nelsen-Aalen estimates for cumulative hazard
fit <- survfit(Surv(coxsnellres, df$status) ~ 1)
Lambda <- cumsum(fit$n.event / fit$n.risk)
# Scatter plot
plot(log(fit$time), log(Lambda), xlab = "log(t)",
ylab = "log-cumulative hazard")
abline(0, 1, lty = 3) # add a dotted reference linesurvival::cox.zph()sch$table: \(\chi^2\) tests of proportionality (each covariate and overall)sch$time: \(m\)-vector of the \(X_i\)sch$y: \((m\times p)\)-matrix of the rescaled \(\hat r_i\)plot(sch): plot the residuals, one covariate per panelsurvival::resid() (II)Martingale/Deviance residuals
df: data framezk: name of covariate of interest# Get the residuals
mart_resid <- resid(obj, type = 'martingale') # n-vector of martingale
dev_resid <- resid(obj, type = 'deviance') # n-vector of deviance
# Scatter plot residuals against covariate
plot(df$zk, mart_resid)
lines(lowess(df$zk, mart_resid)) # add a smooth line through points
abline(0, 0, lty = 3) # add a dotted reference line# Compute the residuals
coxsnellres <- df$status-resid(obj, type="martingale")
## Then use N-A method to estimate the cumulative
## hazard function for residuals
fit <- survfit(Surv(coxsnellres,df$status) ~ 1)
Htilde <- cumsum(fit$n.event/fit$n.risk)
# Scatter plot
plot(log(fit$time), log(Htilde), xlab="log(t)", frame.plot = FALSE,
ylab="log-cumulative hazard", xlim = c(-8, 2), ylim = c(-8, 2))
abline(0, 1, lty = 3, lwd = 1.5) # add a reference line



## Martingale residuals
mart_resid <- resid(obj_stra,type = 'martingale')
#plot residuals against, e.g., age
plot(df$age, mart_resid, main='Age', cex.lab=1.2, cex.axis=1.2,
xlab="Age (years)", ylab="Martingale residuals")
lines(lowess(df$age, mart_resid), lwd = 2) # add a smooth line
abline(0, 0, lty = 3, lwd = 2) # add a reference line

agec=1: \(\leq 40\) yrs; agec=2: \((40, 60]\) yrs; agec=3: \(> 60\) yrslog_nodes = log(nodes)# Call:
# coxph(formula = Surv(time, status) ~ hormone + meno + agec +
# size + log_nodes+ prog + estrg + strata(grade), data = df)
# coef exp(coef) se(coef) z Pr(>|z|)
# hormone2 -0.402106 0.668910 0.128995 -3.117 0.00183 **
# meno2 0.247223 1.280465 0.156266 1.582 0.11363
# agec2 -0.499979 0.606544 0.197677 -2.529 0.01143 *
# agec3 -0.497538 0.608026 0.249509 -1.994 0.04614 *
# size 0.006665 1.006688 0.003956 1.685 0.09205 .
# log_nodes 0.477734 1.612417 0.066199 7.217 5.33e-13 ***
# prog -0.229536 0.794902 0.057190 -4.014 5.98e-05 ***
# estrg 0.030350 1.030816 0.046469 0.653 0.51367 
# Tabulate final model results
tbl_final <- tbl_regression(
obj_stra_final,
exponentiate = TRUE,
label = list(
hormone = "Hormone",
meno = "Menopause",
agec = "Age group",
size = "Tumor size (mm)",
log_nodes = "Log-number of positive lymph nodes",
prog = "Progesterone (100 fmol/mg)",
estrg = "Estrogen (100 fmol/mg)"
)
)
survival::coxph() (IV)(start, stop): period during which covariate takes on a particular valueevent: event indicator at time stop
event = 0 need not mean censoring (why?)coxph objectsurvival::coxph() (V)
# Call:
# coxph(formula = Surv(start, stop, event) ~ age + accpt + surgery +
# transplant, data = heart)
#
# n= 172, number of events= 75
#
# coef exp(coef) se(coef) z Pr(>|z|)
# age 0.02717 1.02754 0.01371 1.981 0.0476 *
# accpt -0.14635 0.86386 0.07047 -2.077 0.0378 *
# surgery -0.63721 0.52877 0.36723 -1.735 0.0827 .
# transplant1 -0.01025 0.98980 0.31375 -0.033 0.9739
---cox.zph() based on Grambsch and Therneau (1994)Time-varying coefficients
survival::coxph()PHREGdf$status - residuals(obj, type = "martingale")cox.zph(obj)residuals(obj, type = c("martingale", "deviance"))coxph(Surv(start, stop, event) ~ covariates)