Chapter 3 - Nonparametric Estimation and Testing

Slides

Lecture slides here. (To convert html to pdf, press E \(\to\) Print \(\to\) Destination: Save to pdf)

Chapter Summary

Conditioning on the cohort at risk at each observed event time naturally leads to discrete hazard-based approaches for analyzing time-to-event data. These approaches illuminate how nonparametric estimators, such as the Kaplan–Meier curve, and group-comparison tests, such as the log-rank statistic, arise directly from examining discrete hazards over the course of a study.

The discrete hazard

A discrete hazard captures the probability of an event happening exactly at a specific observed time, given that the subject remains event-free up to that instant. Let \(d_j\) be the number of events at time \(t_j\) and \(n_j\) the size of the at-risk cohort immediately before \(t_j\). Then the discrete hazard at \(t_j\) is often estimated by \[ \hat\lambda(t_j) = \frac{d_j}{n_j}. \] As event times become dense in larger samples, the sum of these discrete hazards approximate the cumulative hazard. This perspective incorporates censoring by reducing \(n_j\) each time individuals drop out or experience the event. The discrete hazard thus defines a step-by-step mechanism for survival analysis, tying observed failures at each \(t_j\) to the relevant risk set.

Working with a discrete hazard emphasizes the fact that each observed event time carries information about the underlying risk. By updating \(n_j\) dynamically, these methods properly weight the observed data for unbiased inference under independent censoring. In practical studies where event times are not necessarily continuous, such discrete formulations provide an elegant way to handle censored observations and can be generalized to more complex sampling and trial designs.

The Kaplan–Meier estimator

From discrete hazards, the Kaplan–Meier estimator emerges as a product-limit construction. If \(\hat\lambda(t_j)\) denotes the discrete hazard at \(t_j\), the survival function is estimated by \[ \hat S(t) = \prod_{t_j \le t} \bigl(1 - \hat\lambda(t_j)\bigr) = \prod_{t_j \le t} \Bigl(1 - \frac{d_j}{n_j}\Bigr). \] Each factor \(1 - d_j / n_j\) represents the proportion of subjects who “survive” past \(t_j\), given that \(n_j\) remain at risk immediately before \(t_j\). Because \(n_j\) is updated to exclude prior failures and censored observations, the estimator remains consistent even when censoring occurs at arbitrary times. At large sample sizes, \(\hat S(t)\) converges in probability to the true \(S(t)\), making it a cornerstone of nonparametric survival analysis.

Beyond plotting the full curve, summary measures like the median survival time emerge by identifying \(t\) such that \(\hat S(t)\) falls to 0.5. This approach can easily extend to other percentiles or even partial survival probabilities at fixed times of interest. The stepwise nature of the Kaplan–Meier curve also offers straightforward visualization, indicating event occurrences at each jump and seamlessly accounting for censoring up to those points.

The log-rank test

Groups often require formal comparison to assess whether one experiences faster or slower rates of failure than another. The log-rank test constructs a weighted difference in group-specific event counts over time. Let \(d_{1j}\) and \(n_{1j}\) denote the observed failures and risk set in group 1 at time \(t_j\), with \(d_j\) and \(n_j\) as the totals across all groups. The log-rank statistic sums \[ d_{1j} - d_j \frac{n_{1j}}{n_j} \] over \(t_j\). Under a null hypothesis of no group difference, the fraction \(n_{1j}/n_j\) should capture group 1’s expected share of failures, so repeated departures from that share indicate differing survival experiences. In large samples, this statistic often approximates a chi-square distribution, allowing standard significance testing. The log-rank test has strong power under proportional hazards, where one group’s hazard is a constant multiple of the other’s. Nonetheless, real data may violate strict proportionality, motivating alternative weighting schemes or more flexible tests.

Extensions of the log-rank test

Weighted log-rank variants highlight different parts of the follow-up period. When early effects dominate, decreasing weights place emphasis on initial events. For delayed effects, increasing weights spotlight later intervals. A “max-combo test” addresses uncertainty about the exact timing of effects by calculating multiple weighted statistics and taking their maximum, adjusting for correlation to preserve correct Type I error. This approach captures a broad array of hazard patterns, albeit with more complex null distributions. Stratification extends the log-rank framework further by partitioning subjects into strata defined by confounders (e.g., menopausal status). Within each stratum, the test proceeds as usual, and these stratum-level statistics combine into a single global test. This adjustment effectively controls for measured covariates that could otherwise bias group comparisons.

Statistical properties via martingale

Beyond heuristic arguments, martingale theory provides rigorous large-sample derivations for Kaplan–Meier estimator and log-rank tests. This is because the estimators and test statstics can be written as stochastic integrals of martingales, which simplify the computation of variances and covariances.

Software use

Nonparametric survival estimates and log-rank tests are easily performed in R via the survival package. The survfit function fits Kaplan–Meier curves from a time-to-event variable and an event indicator, while the survdiff function calculates log-rank tests (and variations such as stratified or weighted log-rank). Many standard analyses can therefore be completed with just a few lines of code, shown below:

library(survival)

# Fit Kaplan–Meier curves for two groups
km_fit <- survfit(Surv(time, status) ~ group, data = mydata)

# Summarize survival estimates and times
summary(km_fit)

# Perform a log-rank test to compare groups
lr_test <- survdiff(Surv(time, status) ~ group, data = mydata)
lr_test

Additional utilities are available for generating publication-ready tables and plots. The gtsummary package, for instance, provides a tbl_survfit() function that creates neatly formatted tables of survival estimates by group or stratum, including confidence intervals and median times. The ggsurvfit package extends ggplot2 to produce enhanced Kaplan–Meier graphs with at-risk tables, confidence interval shading, and optional log-rank \(p\)-values annotated on the plot:

library(gtsummary)
library(ggsurvfit)

# Summarize survival estimates in a neat table
tbl <- tbl_survfit(km_fit, times = c(12, 24, 48, 96))
tbl

# Create a KM plot with confidence intervals and at-risk tables
ggsurvfit(km_fit) +
  add_confidence_interval() +
  add_risktable()

These higher-level functions streamline the presentation of nonparametric survival analyses, ensuring that both the numeric results and the visual displays are clear and publication-ready.

Conclusion

Discrete hazard constructs offer a flexible pathway to nonparametric survival estimation and testing. By updating risk sets at each event time, they integrate censoring into hazard estimators, leading to the Kaplan–Meier survival curve. For group comparisons, the log-rank test accumulates deviations in observed vs. expected failures across time, capturing hazard differences in a simple chi-square framework. Weighted extensions and max-combo designs handle a variety of hazard patterns and timing effects, while stratification addresses possible confounders. Underlying it all, martingale theory justifies the asymptotic properties of the estimators and tests, ensuring their validity and robustness in large samples.

R Code

Show the code
###############################################################################
# Chapter 3 R Code
#
# This script reproduces major numerical results in Chapter 3, including
#  1. The Nelson–Aalen estimator
#  2. Kaplan–Meier (KM) analysis
#  3. Log-rank tests for the rat study and the GBC study
#  4. Additional code for generating selected plots used in the chapter
###############################################################################

# =============================================================================
# Setup
# =============================================================================
library(survival)   # Core survival analysis (Surv, survfit, survdiff)
library(tidyverse) # Data manipulation and ggplot2-based graphics
library(patchwork) # Combine multiple ggplot2 plots
library(broom)     # Tidy extraction of model outputs
library(gtsummary) # Formatted survival summaries and tables
library(ggsurvfit) # ggplot2-based KM plots with risk tables
library(ggsci)     # Scientific journal color palettes

# =============================================================================
# (A) Nelson–Aalen Estimator for the Rat Study
# =============================================================================
# The 'rats' dataset records times to tumor or censoring in rats receiving a drug
# vs control. Each row corresponds to one rat, with variables:
#   litter:  Litter ID, from 1 to 100 (3 rats per litter)
#   rx:      Treatment (1 = drug, 0 = control)
#   time:    Time to tumor or last follow-up
#   status:  Event status (1 = tumor, 0 = censored)
#   sex:     'male' or 'female'
#
# Data source:
#   N. Mantel, N. R. Bohidar, and J. L. Ciminera.
#   "Mantel-Haenszel analyses of litter-matched time-to-response data, with
#    modifications for recovery of interlitter information."
#   Cancer Research, 37:3863-3868, 1977.

# -----------------------------------------------------------------------------
# 1. Read and inspect the dataset
# -----------------------------------------------------------------------------
rats <- read.table("Data//Rat Tumorigenicity Study//rats.txt", header = TRUE)
head(rats)

# -----------------------------------------------------------------------------
# 2. Subset to the treatment group (rx == 1)
# -----------------------------------------------------------------------------
rats.rx <- rats[rats$rx == 1, ]

time   <- rats.rx$time
status <- rats.rx$status

# -----------------------------------------------------------------------------
# 2a. A small illustrative subset for plotting follow-up patterns
# -----------------------------------------------------------------------------
rats_sub <- rats.rx[order(time), ] |>
  as_tibble() |>
  filter(row_number() <= 12)

# Follow-up as a single horizontal timeline with event/censor marks
fig_rats_sub_line <- rats_sub |>
  ggplot(aes(x = time, y = 1)) +
  geom_segment(
    x = 0, xend = 70, y = 1, yend = 1,
    arrow = arrow(length = unit(0.2, "cm"), ends = "last", type = "open"),
    size = 0.5
  ) +
  # Use different point shapes for censoring vs event
  geom_point(aes(shape = factor(status)), size = 3, fill = "white") +
  scale_shape_manual(values = c(23, 19), labels = c("Censoring", "Tumor development")) +
  scale_x_continuous(
    limits = c(0, max(rats_sub$time)),
    breaks = c(0, unique(rats_sub$time)),
    expand = expansion(c(0, 0.05))
  ) +
  labs(x = "Time (days)", y = "All") +
  theme_classic() +
  theme(
    axis.line     = element_blank(),
    axis.ticks.y  = element_blank(),
    axis.ticks.x  = element_blank(),
    axis.text.y   = element_blank(),
    axis.title.y  = element_text(angle = 0, vjust = 0.5),
    axis.title.x  = element_text(margin = margin(t = 15)),
    legend.position = "none"
  )

# Swim-lane view of the same subset
fig_rats_sub_swim <- rats_sub |>
  mutate(id = row_number()) |>
  ggplot(aes(x = time, y = reorder(id, time))) +
  geom_linerange(aes(xmin = 0, xmax = time)) +
  geom_point(aes(shape = factor(status)), size = 3, fill = "white") +
  geom_vline(xintercept = 0, linewidth = 1) +
  theme_minimal() +
  scale_y_discrete(name = "Rats") +
  scale_x_continuous(
    name = "Time (days)",
    breaks = seq(0, 100, by = 20),
    expand = expansion(c(0, 0.05))
  ) +
  scale_shape_manual(values = c(23, 19), labels = c("Censoring", "Tumor development")) +
  theme(
    legend.position   = "top",
    legend.title      = element_blank(),
    axis.text.y       = element_blank(),
    axis.title.x      = element_blank(),
    axis.ticks.y      = element_blank(),
    panel.grid.major.y= element_blank(),
    legend.text       = element_text(size = 11)
  )

# Stack the two plots (legend is shown in the swim plot)
fig_rats_followup <- fig_rats_sub_swim / fig_rats_sub_line +
  plot_layout(heights = c(0.8, 0.2))

fig_rats_followup

# -----------------------------------------------------------------------------
# 3. Extract unique sorted event times (status == 1) within the treatment group
# -----------------------------------------------------------------------------
ts <- unique(sort(time[status == 1]))
m  <- length(ts)
ts

# Number of failures at each event time (named vector)
ds <- table(time[status == 1])

# Prepare vectors for risk set sizes, hazard increments, and cumulative hazard
ns <- rep(NA_real_, m)  # number at risk
dL <- rep(NA_real_, m)  # hazard increment dLambda
L  <- rep(NA_real_, m)  # cumulative hazard Lambda

# -----------------------------------------------------------------------------
# 4. Compute Nelson–Aalen estimates
# -----------------------------------------------------------------------------
for (j in seq_len(m)) {
  # Risk set size at ts[j]: subjects still under observation at ts[j]
  ns[j] <- sum(time >= ts[j])

  # Nelson–Aalen increment: dLambda = d_j / n_j
  dL[j] <- ds[j] / ns[j]

  # Cumulative hazard up to ts[j]
  L[j] <- sum(dL[1:j])
}

# A compact display table of intermediate quantities
results_NA <- cbind(ts, ds, ns, dL, L)
round(results_NA, 3)

# -----------------------------------------------------------------------------
# 5a. Plot cumulative hazard using base R step function
# -----------------------------------------------------------------------------
par(mfrow = c(1, 1))
plot(
  stepfun(ts, c(0, L)),
  do.points  = FALSE,
  ylim       = c(0, 0.4),
  xlim       = c(0, 120),
  lwd        = 2,
  frame.plot = FALSE,
  xlab       = "Time (days)",
  ylab       = "Cumulative hazard function",
  cex.lab    = 1.5,
  cex.axis   = 1.5,
  main       = ""
)

# -----------------------------------------------------------------------------
# 5b. Plot cumulative hazard using ggplot2
# -----------------------------------------------------------------------------
df_NA <- data.frame(
  time   = c(0, ts),
  cumhaz = c(0, L)
) |>
  add_row(time = 115, cumhaz = tail(L, 1))

ggplot(df_NA, aes(x = time, y = cumhaz)) +
  geom_step(size = 0.8) +
  labs(x = "Time (days)", y = "Cumulative hazard function") +
  theme_minimal()

# =============================================================================
# (B) Kaplan–Meier Estimates for the Rat Study
# =============================================================================
# The hazard increments dL = d_j / n_j computed above imply the KM recursion:
#   S(t_j) = S(t_{j-1}) * (1 - d_j / n_j)
# Greenwood's formula provides an approximate SE.

# -----------------------------------------------------------------------------
# 1. Define incremental survival and Greenwood components
# -----------------------------------------------------------------------------
csurv     <- 1 - dL                         # 1 - d_j/n_j
var.csurv <- ds / (ns * (ns - ds))          # d_j / [n_j*(n_j - d_j)]

KMsurv <- rep(NA_real_, m)
se     <- rep(NA_real_, m)

# -----------------------------------------------------------------------------
# 2. Compute KM survival and Greenwood SE
# -----------------------------------------------------------------------------
for (j in seq_len(m)) {
  KMsurv[j] <- prod(csurv[1:j])
  se[j]     <- KMsurv[j] * sqrt(sum(var.csurv[1:j]))
}

results_KM <- cbind(ts, ds, ns, csurv, KMsurv, se)
round(results_KM, 3)

# -----------------------------------------------------------------------------
# 3. Compare with survfit() from survival
# -----------------------------------------------------------------------------
obj_km_1 <- survfit(Surv(time, status) ~ 1, data = rats.rx, conf.type = "log-log")
summary(obj_km_1)

tidy_obj <- broom::tidy(obj_km_1)
tidy_obj

# Plot KM curve with base R (and the default CI style from survfit)
plot(
  obj_km_1,
  ylim       = c(0, 1),
  xlim       = c(0, 100),
  lwd        = 2,
  frame.plot = FALSE,
  xlab       = "Time (days)",
  ylab       = "Tumor-free probabilities",
  cex.axis   = 0.8
)

legend(
  1, 0.2,
  c("Kaplan--Meier curve", "95% Confidence limits"),
  lty = 1:2, lwd = 2
)

# =============================================================================
# (C) Summaries at Specific Times (KM) and Tidy Tables
# =============================================================================
# gtsummary::tbl_survfit() provides formatted summaries at specified time points.

obj_km_1_simple <- survfit(Surv(time, status) ~ 1, data = rats.rx)

# Survival estimates at specified times
tbl_time <- tbl_survfit(
  x            = obj_km_1_simple,
  times        = seq(40, 100, by = 20),
  label_header = "{time} days"
)
tbl_time

# Selected quantiles (e.g., 90% and 80%)
tbl_perc <- tbl_survfit(
  x            = obj_km_1_simple,
  label_header = "{100 * prob}% quantile",
  probs        = c(0.1, 0.2)
)
tbl_perc

# -----------------------------------------------------------------------------
# Example KM plot via ggsurvfit (single group)
# -----------------------------------------------------------------------------
ggsurvfit(obj_km_1_simple) +
  add_confidence_interval() +
  add_risktable() +
  scale_x_continuous(breaks = seq(0, 100, by = 20)) +
  ylim(0, 1) +
  labs(x = "Time (days)", y = "Tumor-free probabilities") +
  theme_minimal()

# =============================================================================
# (D) Log-Rank Test for Rat Study (Treatment vs Control)
# =============================================================================
head(rats)

# Stratified log-rank test by sex (rho = 0 gives the standard log-rank test)
logrank_obj <- survdiff(Surv(time, status) ~ rx + strata(sex), data = rats, rho = 0)
logrank_obj

# NOTE: survdiff() does not return a p-value field; compute it from the chi-square
logrank_p <- 1 - pchisq(logrank_obj$chisq, df = length(logrank_obj$n) - 1)
logrank_p

# -----------------------------------------------------------------------------
# 1. Two-group KM summaries using gtsummary
# -----------------------------------------------------------------------------
obj_km_2 <- survfit(Surv(time, status) ~ rx, data = rats)

tbl_surv <- tbl_survfit(
  x            = obj_km_2,
  times        = seq(40, 100, by = 20),
  label_header = "{time} days",
  label        = list(rx ~ "Treatment")
)
tbl_surv

# -----------------------------------------------------------------------------
# 2. Two-group KM plot with ggsurvfit (includes log-rank p-value annotation)
# -----------------------------------------------------------------------------
obj_km_2b <- survfit2(Surv(time, status) ~ rx, data = rats)

ggsurvfit(obj_km_2b, linetype_aes = TRUE, linewidth = 1) +
  add_pvalue(caption = "Log-rank {p.value}") +
  add_risktable(
    risktable_stats = "n.risk",
    theme = list(
      theme_risktable_default(),
      scale_y_discrete(labels = c("Drug", "Control"))
    )
  ) +
  labs(x = "Time (days)", y = "Tumor-free probabilities") +
  scale_linetype_manual(values = c(2, 1), labels = c("Control", "Drug")) +
  scale_color_discrete(labels = c("Control", "Drug")) +
  scale_x_continuous(breaks = seq(0, 100, by = 20)) +
  scale_y_continuous(
    limits = c(0, 1),
    breaks = seq(0, 1, by = 0.2),
    expand = c(0, 0)
  ) +
  theme_classic() +
  theme(
    legend.position    = "top",
    legend.key.width   = unit(1, "cm"),
    panel.grid.major.y = element_line(color = "gray90"),
    legend.text        = element_text(size = 10),
    plot.caption       = element_text(size = 10)
  )

# Optional export (commented to avoid side effects)
# ggsave("images//rats_KM_gg2.png", width = 7, height = 4.5)
# ggsave("images//rats_KM_gg2.eps", device = cairo_ps, width = 7, height = 4.5)

# =============================================================================
# (E) German Breast Cancer (GBC) Study: KM and Log-Rank
# =============================================================================
# The full dataset contains multiple event types; here we focus on first-event times.

gbc <- read.table("Data//German Breast Cancer Study//gbc.txt", header = TRUE)

# Sort by subject id, then time
o <- order(gbc$id, gbc$time)
gbc <- gbc[o, ]

# Keep only the first record per subject (first event / first censoring)
data.CE <- gbc[!duplicated(gbc$id), ]

# Collapse status into 0/1 (event vs censoring)
data.CE$status <- ifelse(data.CE$status > 0, 1, 0)

# -----------------------------------------------------------------------------
# 1. Base R KM curves by hormone, overall and stratified by menopausal status
# -----------------------------------------------------------------------------
par(mfrow = c(1, 3))

# Overall
km_overall <- survfit(Surv(time, status) ~ hormone, data = data.CE)
plot(
  km_overall,
  xlim     = c(0, 80),
  lwd      = 2,
  frame    = FALSE,
  lty      = c(2, 1),
  xlab     = "Time (months)",
  ylab     = "Relapse-free survival probabilities",
  main     = "Overall",
  cex.lab  = 1.5,
  cex.axis = 1.5,
  cex.main = 1.5
)
legend(1, 0.2, lty = 2:1, c("No Hormone", "Hormone"), lwd = 2, cex = 1.5)

# Pre-menopausal (meno == 1)
km_pre <- survfit(Surv(time, status) ~ hormone, data = data.CE[data.CE$meno == 1, ])
plot(
  km_pre,
  xlim     = c(0, 80),
  lwd      = 2,
  frame    = FALSE,
  lty      = c(2, 1),
  xlab     = "Time (months)",
  ylab     = "Relapse-free survival probabilities",
  main     = "Pre-Menopausal",
  cex.lab  = 1.5,
  cex.axis = 1.5,
  cex.main = 1.5
)
legend(1, 0.2, lty = 2:1, c("No Hormone", "Hormone"), lwd = 2, cex = 1.5)

# Post-menopausal (meno == 2)
km_post <- survfit(Surv(time, status) ~ hormone, data = data.CE[data.CE$meno == 2, ])
plot(
  km_post,
  xlim     = c(0, 80),
  lwd      = 2,
  frame    = FALSE,
  lty      = c(2, 1),
  xlab     = "Time (months)",
  ylab     = "Relapse-free survival probabilities",
  main     = "Post-Menopausal",
  cex.lab  = 1.5,
  cex.axis = 1.5,
  cex.main = 1.5
)
legend(1, 0.2, lty = 2:1, c("No Hormone", "Hormone"), lwd = 2, cex = 1.5)

# -----------------------------------------------------------------------------
# 2. ggsurvfit version: three panels (overall, pre-, post-menopausal)
# -----------------------------------------------------------------------------
pre_obj2  <- survfit2(Surv(time, status) ~ hormone, data = data.CE |> filter(meno == 1))
post_obj2 <- survfit2(Surv(time, status) ~ hormone, data = data.CE |> filter(meno == 2))
all_obj2  <- survfit2(Surv(time, status) ~ hormone, data = data.CE)

plot_gbc_KM <- function(obj, title = NULL) {
  p <- ggsurvfit(obj, linetype_aes = TRUE, linewidth = 1) +
    add_risktable(
      risktable_stats = "n.risk",
      theme = list(
        theme_risktable_default(),
        scale_y_discrete(labels = c("Hormone", "No Hormone"))
      )
    ) +
    scale_y_continuous(
      "Relapse-free survival probabilities",
      limits = c(0, 1),
      breaks = seq(0, 1, by = 0.2),
      expand = expansion(c(0, 0.005))
    ) +
    scale_x_continuous("Time (months)", breaks = seq(0, 84, by = 12)) +
    scale_color_jama(labels = c("No Hormone", "Hormone")) +
    scale_linetype_manual(values = c(2, 1), labels = c("No Hormone", "Hormone")) +
    theme_classic() +
    theme(
      plot.margin        = margin(0, 0, 0, 0),
      legend.position    = "top",
      legend.key.width   = unit(1, "cm"),
      panel.grid.major.y = element_line(color = "gray80"),
      legend.text        = element_text(size = 10),
      plot.caption       = element_text(size = 10),
      plot.title         = element_text(size = 12)
    )

  if (!is.null(title)) p <- p + ggtitle(title)
  p
}

pre_fig  <- plot_gbc_KM(pre_obj2,  "Pre-Menopausal")
post_fig <- plot_gbc_KM(post_obj2, "Post-Menopausal")
all_fig  <- plot_gbc_KM(all_obj2,  "Overall")

fig_meno <- wrap_plots(ggsurvfit_build(pre_fig), ggsurvfit_build(post_fig), ncol = 2)

wrap_plots(ggsurvfit_build(all_fig), plot_spacer(), fig_meno, ncol = 1) +
  plot_layout(heights = c(1, 0.02, 1))

# -----------------------------------------------------------------------------
# 3. Log-rank tests: stratified by menopausal status and unstratified
# -----------------------------------------------------------------------------
survdiff(Surv(time, status) ~ hormone + strata(meno), data = data.CE)
survdiff(Surv(time, status) ~ hormone, data = data.CE)