Statistical Methods for Composite Endpoints: Win Ratio and Beyond

Chapter 1. Introduction

Lu Mao

lmao@biostat.wisc.edu

Department of Biostatistics & Medical Informatics

University of Wisconsin-Madison

May 31, 2025

Outline

Examples and regulatory guidelines
Traditional methods
- Time to first event
- Weighted total events (Wcompo package)
Win ratio and hierarchical endpoints
- The estimand issue

\[\newcommand{\d}{{\rm d}}\] \[\newcommand{\T}{{\rm T}}\] \[\newcommand{\dd}{{\rm d}}\] \[\newcommand{\cc}{{\rm c}}\] \[\newcommand{\pr}{{\rm pr}}\] \[\newcommand{\var}{{\rm var}}\] \[\newcommand{\se}{{\rm se}}\] \[\newcommand{\indep}{\perp \!\!\! \perp}\] \[\newcommand{\Pn}{n^{-1}\sum_{i=1}^n}\] \[ \newcommand\mymathop[1]{\mathop{\operatorname{#1}}} \] \[ \newcommand{\Ut}{{n \choose 2}^{-1}\sum_{i<j}\sum} \] \[ \def\a{{(a)}} \def\b{{(1-a)}} \def\t{{(1)}} \def\c{{(0)}} \def\d{{\rm d}} \def\T{{\rm T}} \]

Example and Guidelines

Motivating Example: Colon Cancer

Landmark colon cancer trial
- Population: 619 patients with stage C disease (Moertel et al., 1990)
- Arms: Levamisole + fluorouracil ($n=304$) vs control ($n=315$)
- Endpoint: relapse-free survival (log-rank test p<0.001)
  - Death = Relapse
  - 258 deaths (89%) after relapse ignored

Motivating Example: HF-ACTION

A cardiovascular trial (HF-ACTION)
- Subpopulation: 426 heart failure patients (O’Connor et al., 2009)
- Arms: Exercise training + usual care ($n=205$) vs usual care ($n=221$)
- Endpoint: hospitalization-free survival (log-rank test p=0.100)
  - Death = Hospitalization
  - 82 (88%) deaths + 707 (69%) recurrent hospitalizations ignored

Composite Endpoints

Traditional composite endpoint (TCE)
- Time to first event
  - Relapse/Progression-free survival
  - First major adverse cardiac event (MACE): death, heart failure, myocardio-infarction, stroke (event-free survival)
- Limitations
  - Lack of clinical priority
  - Statistical inefficiency (waste of data)

Hierarchical composite endpoint (HCE)
- Example: Death > nonfatal MACE > six-minute walk test (6MWT)/NYHA class

Why Composite

Advantages
- More events $\to$ higher power $\to$ smaller sample size/lower costs
- No need for multiplicity adjustment
- A unified measure of treatment effect
ICH-E9 “Statistical Principles for Clinical Trials” (ICH, 1998)
- “There should generally be only one primary variable”
- “If a single primary variable cannot be selected …, another useful strategy is to integrate or combine the multiple measurements into a single or composite variable …”
- “[composite endpoint] addresses the multiplicity problem without adjustment to the type I error”

Regulatory Guidelines: FDA

Main points
- Typically first event but can do total events
- Component-wise analysis important for interpretation
FDA Guidance for Industry: “Multiple Endpoints in Clinical Trials” (FDA, 2022)
- “Composite endpoints are often assessed as the time to first occurrence of any one of the components, …, it also may be possible to analyze total endpoint events”
- “The treatment effect on the composite rate can be interpreted as characterizing the overall clinical effect when the individual events all have reasonably similar clinical importance”
- “…analyses of the components of the composite endpoint are important and can influence interpretation of the overall study results”

Regulatory Guidelines: Europe

Main points
- Combine events of similar importance
- Include mortality as a component
European Network for Health Technology Assessment “Endpoints used for Relative Effectiveness Assessment – Composite Endpoints” (EUnetHTA, 2015)
- “All components of a composite endpoint should be separately defined as secondary endpoints and reported with the results of the primary analysis”
- “Components of similar clinical importance and sensitivity to intervention should preferably be combined”
- “If adequate, mortality should however be included if it is likely to have a censoring effect on the observation of other components”

A Tricky Example

The EMPA-REG Trial (NCT01131676)
- Population: 7,020 patients with type 2 diabetes (Zinman et al., 2015)
- Treatment arms: Empagliflozin vs control
- Endpoint: Time to first CV death, nonfatal MI, nonfatal stroke

Traditional Composites

Data and Notation

Full data $\mathcal H^*(\infty)$
- $D$: survival time; $N^*_D(t)=I(D\leq t)$
- $N^*_1(t), \ldots, N^*_K(t)$: counting processes for $K$ nonfatal event types
- Cumulative data: $\mathcal H^*(t)=\{N^*_D(u), N^*_1(u), \ldots, N^*_K(u):0\leq u\leq t\}$

Observed (censored) data $\{\mathcal H^*(X), X\}$
- $\mathcal H^*(X)$: outcomes up to time $X$
- $X=D\wedge C$: length of follow-up ($a\wedge b = \min(a, b)$)
- $C$: independent censoring time
- Goal: estimate/test features of $\mathcal H^*(\infty)$ using $\{\mathcal H^*(X), X\}$

First Event

Univariate endpoint
- $N^*_{\rm TFE}(t) = I\{N^*_D(t)+\sum_{k=1}^KN^*_k(t)\geq 1\}$
  - $I(\cdot)$: 0-1 indicator
- $\tilde T$: time to first event
  - Kaplan–Meier curve, log-rank test, Cox model

Component-wise weighting
- Upweight death over nonfatal events
  - E.g., Death = 2 $\times$ hospitalization

Total Events

Weighted composite event process
- $N^*_{\rm R}(t)=w_DN^*_D(t)+\sum_{k=1}^Kw_kN^*_k(t)$
  - $w_D, w_1, \ldots, w_K$: weights to death and nonfatal events
- Proportional means model (Mao & Lin, 2016) \[ E\{N^*_{\rm R}(t)\mid Z\} = \exp(\beta^\T Z)\mu_0(t) \]
  - $\exp(\beta)$: mean ratio of weighted total events comparing treatment $(Z=1)$ vs control $(Z=0)$
- R-package: Wcompo

Software: `Wcompo::CompoML()`

Basic syntax
- id: unique patient identifier; time: event times; status: event types (1: death; 2,...,K nonfatal event types; Z: covariate matrix)
- w: $K$-vector of weights to event types 1,...K; default is unweighted

library(Wcompo)
obj <- CompoML(id, time, status, Z, w = c(2, 1))

Output: a list of class CompoML
- obj$beta: $\hat\beta$; obj$var: $\hat\var(\hat\beta)$
- plot(obj, z): plot mean function $\exp(\hat\beta^{\rm T} z)\hat\mu_0(t)$

HF-ACTION: An Example

High-risk subgroup (n=426)
- Baseline cardiopulmonary exercise (CPX) test $\leq$ 9 min

Table 1: Summary statistics for a high-risk subgroup (n=426) in HF-ACTION trial.

		Usual care (N = 221)	Exercise training (N = 205)
Age	≤ 60 years	122 (55.2%)	128 (62.4%)
	> 60 years	99 (44.8%)	77 (37.6%)
Follow-up	(months)	28.6 (18.4, 39.3)	27.6 (19, 40.2)
Death		57 (25.8%)	36 (17.6%)
Hospitalizations	0	51 (23.1%)	60 (29.3%)
	1-3	114 (51.6%)	102 (49.8%)
	4-10	49 (22.2%)	39 (19%)
	>10	7 (3.2%)	4 (2%)

HF-ACTION: Preparation

Load packages and data

library(survival) # for standard survival analysis
library(Wcompo) # for weighted total events
library(rmt) # for hfaction data
library(tidyverse) # for data wrangling

# Load data
data(hfaction)
head(hfaction) # trt_ab=1: training; 0: usual care
#>        patid       time status trt_ab age60
#> 1 HFACT00001 0.60506502      1      0     1
#> 2 HFACT00001 1.04859685      0      0     1
#> 3 HFACT00002 0.06297057      1      0     1
#> 4 HFACT00002 0.35865845      1      0     1
#> 5 HFACT00002 0.39698836      1      0     1
#> 6 HFACT00002 3.83299110      0      0     1

HF-ACTION: Data

Data processing

# For weighted total analysis by compoML()
# Convert status=1 for death, 2=hospitalization
hfaction <- hfaction |> 
  mutate(
    status = case_when(
      status == 1 ~ 2,
      status == 2 ~ 1,
      status == 0 ~ 0)
  )

# TFE: take the first event per patient id
hfaction_TFE <- hfaction |> 
  arrange(patid, time) |> # sort by patid and time
  group_by(patid) |> 
  slice_head() |> # take first row
  ungroup()

HF-ACTION: Mortality

Cox model for death
- HR: $\exp(-0.3973) = 67.2\%$ ($32.8\%$ reduction in risk)
- $P$-value: 0.0621 (borderline significant)

## Get mortality data
hfaction_D <- hfaction |> 
  filter(status != 2) # remove hospitalization records

## Cox model for death against trt_ab
obj_D <- coxph(Surv(time, status) ~ trt_ab, data = hfaction_D)
summary(obj_D)
#> n= 426, number of events= 93 
#>           coef exp(coef) se(coef)      z      p
#> trt_ab -0.3973    0.6721   0.2129 -1.866 0.0621

HF-ACTION: TFE

Cox model for hospitalization-free survival
- HR: $\exp(-0.1770) = 83.8\%$ ($16.2\%$ reduction in risk)
- $P$-value: 0.111 (less significant than death)

# Cox model for TFE against trt_ab
obj_TFE <- coxph(Surv(time, status > 0) ~ trt_ab, data = hfaction_TFE)
summary(obj_TFE)
#>   n= 426, number of events= 326 
#>           coef exp(coef) se(coef)      z Pr(>|z|)
#> trt_ab -0.1770    0.8378   0.1112 -1.592    0.111

HF-ACTION: Death vs TFE

Hospitalizations dilute effect on death …
- An EMPA-REG-like situation

HF-ACTION: Weighted Total

Proportional means model (death = $2\times$ hosp)

MR: $\exp(-0.15398) = 85.7\%$ ($14.3\%$ reduction in total number of composite events)
$P$-value: 0.170 (less significant than TFE)
Limitation: Survival $\uparrow$ $\to$ cumulative total $\uparrow$ $\to$ attenuated effect

# Total events (proportional mean) -------------------------------
obj_ML <- CompoML(hfaction$patid, hfaction$time, hfaction$status, 
                  hfaction$trt_ab, w = c(2, 1))
obj_ML
#>         Event 1 (Death) Event 2
#> Weight               2       1
#>         Estimate      se z.value p.value
#> trt_ab -0.15398  0.11215 -1.3729  0.1698

HF-ACTION: Cumulative Means

Model-based mean functions

plot(obj_ML, 0, ylim= c(0, 5), xlab="Time (years)", col= "red", lwd = 2)
plot(obj_ML, 1, add = TRUE, col = "blue", lwd = 2)
legend(0, 5, col=c("red","blue"), c("Usual care", "Training"), lwd = 2)

Lessons Learned

Adding nonfatal events $\neq$ higher power
- Component may be less discriminating (Freemantle et al., 2003)
- Length of exposure (death as competing risk) (Schmidli et al., 2023)

Solutions
- Hierarchically prioritize death
  - Evaluate nonfatal components only on survivors
- Quantitative weighting $\to$ adjust for survival time
  - Loss rate = cumulative total / length of exposure (Ch 3)

Hierarchical Composites

Win Ratio: Basics

A common approach to HCE
- Proposed and popularized by Pocock et al. (2012)
- Treatment vs control: generalized pairwise comparisons
- Win-loss: sequential comparison on components
  - Longer survival > fewer/later nonfatal MACE > better 6MWT/NYHA score
- Effect size: WR $=$ wins / losses

Alternative metrics
- Proportion in favor (net benefit): PIF $=$ wins $-$ losses (Buyse, 2010)
- Win odds: WO $=$ (wins $+$ $2^{-1}$ties) / (losses $+$ $2^{-1}$ties) (Brunner et al., 2021; Dong et al., 2020)

Win Ratio: Gaining Popularity

More trials are using it…

An Important Caveat

WR’s estimand depends on censoring …
- Luo et al. (2015), Bebu & Lachin (2016), Oakes (2016), Mao (2019), Dong et al. (2020a), Li et al. (2024), etc.

What is an estimand?
- Population-level quantity to be estimated
  - Population-mean difference, (true) risk ratio, etc.
- Specifies how treatment effect is measured
- ICH E9 (R1) addendum: estimand construction one of the “central questions for drug development and licensing” (ICH, 2020)

Win-Loss Changes with Time

Illustration
- Win-loss status, and deciding component, changes with time
- Longer follow-up …
  - Parameters: win/loss proportions $\uparrow$ (WR uncertain); tie proportion $\downarrow$
  - Component contributions: prioritized $\uparrow$; deprioritized $\downarrow$

Trial-Dependent Estimand

Actual estimand
- Average WR mixing shorter-term with longer-term comparisons
- Weight set (haphazardly) by censoring distribution
  - Staggered entry, random withdrawal $\to$ non-scientific

Testing vs estimation
- Testing (qualitative): okay
  - Valid under $H_0$, powerful if treatment consistently outperforms control over time
- Estimation (quantitative): not okay
  - Pre-define restriction time $\to$ use censoring weight for unbiased estimation (Ch 3)
  - Specify a time-constant WR model (Ch 4)

Conclusion

Notes

More on
- Regulatory guidelines for composite endpoints (Mao & Kim, 2021)
- ICH E9 (R1) implementation (Akacha et al., 2017; Ionan et al., 2022; Qu & Lipkovich, 2021; Ratitch et al., 2020)
- Practical guidance (Pocock et al., 2024; Redfors et al., 2020)
- Defining estimand for win ratio (Mao, 2024)
- Generalized pairwise comparisons (Deltuvaite-Thomas et al., 2022; Dong et al., 2022; Péron et al., 2016; Verbeeck et al., 2023)
Cumulative total events
- Based on cumulative incidence/frequency under competing risks (Fine & Gray, 1999; Ghosh & Lin, 2000; Gray, 1988)

Summary

Composite endpoints
- Death + hospitalization/progression/relapse
- Regulatory recommendation
Traditional
- Time to first: death = nonfatal (survival::coxph())
- Weighted total: death = $w_D\times$ nonfatal (Wcompo::compoML())
Hierarchical
- Win ratio, net benifit, win odds: death > nonfatal
- Estimand issue - ICH E9 (R1)

References

Akacha, M., Bretz, F., Ohlssen, D., Rosenkranz, G., & Schmidli, H. (2017). Estimands and Their Role in Clinical Trials. Statistics in Biopharmaceutical Research, 9(3), 268–271. https://doi.org/10.1080/19466315.2017.1302358

Bebu, I., & Lachin, J. M. (2016). Large sample inference for a win ratio analysis of a composite outcome based on prioritized components. Biostatistics, 17(1), 178–187. https://doi.org/10.1093/biostatistics/kxv032

Brunner, E., Vandemeulebroecke, M., & Mütze, T. (2021). Win odds: An adaptation of the win ratio to include ties. Statistics in Medicine, 40(14), 3367–3384. https://doi.org/10.1002/sim.8967

Buyse, M. (2010). Generalized pairwise comparisons of prioritized outcomes in the two-sample problem. Statistics in Medicine, 29(30), 3245–3257. https://doi.org/10.1002/sim.3923

Deltuvaite-Thomas, V., Verbeeck, J., Burzykowski, T., Buyse, M., Tournigand, C., Molenberghs, G., & Thas, O. (2022). Generalized pairwise comparisons for censored data: An overview. Biometrical Journal, 65(2). https://doi.org/10.1002/bimj.202100354

Dong, G., Hoaglin, D. C., Qiu, J., Matsouaka, R. A., Chang, Y.-W., Wang, J., & Vandemeulebroecke, M. (2020). The Win Ratio: On Interpretation and Handling of Ties. Statistics in Biopharmaceutical Research, 12(1), 99–106. https://doi.org/10.1080/19466315.2019.1575279

Dong, G., Huang, B., Chang, Y.-W., Seifu, Y., Song, J., & Hoaglin, D. C. (2020a). The win ratio: Impact of censoring and follow-up time and use with nonproportional hazards. Pharmaceutical Statistics, 19(3), 168–177. https://doi.org/10.1002/pst.1977

Dong, G., Huang, B., Verbeeck, J., Cui, Y., Song, J., Gamalo-Siebers, M., Wang, D., Hoaglin, D. C., Seifu, Y., Mütze, T., & Kolassa, J. (2022). Win statistics (win ratio, win odds, and net benefit) can complement one another to show the strength of the treatment effect on time-to-event outcomes. Pharmaceutical Statistics, 22(1), 20–33. https://doi.org/10.1002/pst.2251

EUnetHTA. (2015). Guidance for industry: Multiple endpoints in clinical trials. https://www.eunethta.eu/wp-content/uploads/2018/01/Endpoints-used-for-Relative-Effectiveness-Assessment-Composite-endpoints_Amended-JA1-Guideline_Final-Nov-2015_0.pdf

FDA. (2022). Guidance for industry: Multiple endpoints in clinical trials. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/multiple-endpoints-clinical-trials-guidance-industry

Fine, J. P., & Gray, R. J. (1999). A Proportional Hazards Model for the Subdistribution of a Competing Risk. Journal of the American Statistical Association, 94(446), 496–509. https://doi.org/10.1080/01621459.1999.10474144

Freemantle, N., Calvert, M., Wood, J., Eastaugh, J., & Griffin, C. (2003). Composite Outcomes in Randomized Trials. JAMA, 289(19), 2554. https://doi.org/10.1001/jama.289.19.2554

Ghosh, D., & Lin, D. Y. (2000). Nonparametric Analysis of Recurrent Events and Death. Biometrics, 56(2), 554–562. https://doi.org/10.1111/j.0006-341x.2000.00554.x

Gray, R. J. (1988). A class of $K$-sample tests for comparing the cumulative incidence of a competing risk. The Annals of Statistics, 16(3). https://doi.org/10.1214/aos/1176350951

ICH. (1998). Statistical principles for clinical trials. London: European Medicines Evaluation Agency.

ICH. (2020). ICH E9 (R1) addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials, step 5. London: European Medicines Evaluation Agency.

Ionan, A. C., Paterniti, M., Mehrotra, D. V., Scott, J., Ratitch, B., Collins, S., Gomatam, S., Nie, L., Rufibach, K., & Bretz, F. (2022). Clinical and Statistical Perspectives on the ICH E9(R1) Estimand Framework Implementation. Statistics in Biopharmaceutical Research, 15(3), 554–559. https://doi.org/10.1080/19466315.2022.2081601

Li, H., Chen, W.-C., Lu, N., Tang, R., & Zhao, Y. (2024). The elusiveness of the win ratio parameter in the presence of missing data. Therapeutic Innovation & Regulatory Science, 1–2.

Luo, X., Tian, H., Mohanty, S., & Tsai, W. Y. (2015). An Alternative Approach to Confidence Interval Estimation for the Win Ratio Statistic. Biometrics, 71(1), 139–145. https://doi.org/10.1111/biom.12225

Mao, L. (2019). On the Alternative Hypotheses for the Win Ratio. Biometrics, 75(1), 347–351. https://doi.org/10.1111/biom.12954

Mao, L. (2024). Defining estimand for the win ratio: Separate the true effect from censoring. Clinical Trials, 21(5), 584–594.

Mao, L., & Kim, K. (2021). Statistical Models for Composite Endpoints of Death and Nonfatal Events: A Review. Statistics in Biopharmaceutical Research, 13(3), 260–269. https://doi.org/10.1080/19466315.2021.1927824

Mao, L., & Lin, D. Y. (2016). Semiparametric regression for the weighted composite endpoint of recurrent and terminal events. Biostatistics, 17(2), 390–403. https://doi.org/10.1093/biostatistics/kxv050

Moertel, C. G., Fleming, T. R., Macdonald, J. S., Haller, D. G., Laurie, J. A., Goodman, P. J., Ungerleider, J. S., Emerson, W. A., Tormey, D. C., Glick, J. H., Veeder, M. H., & Mailliard, J. A. (1990). Levamisole and Fluorouracil for Adjuvant Therapy of Resected Colon Carcinoma. New England Journal of Medicine, 322(6), 352–358. https://doi.org/10.1056/nejm199002083220602

O’Connor, C. M., Whellan, D. J., Lee, K. L., Keteyian, S. J., Cooper, L. S., Ellis, S. J., Leifer, E. S., Kraus, W. E., Kitzman, D. W., Blumenthal, J. A., Rendall, D. S., Miller, N. H., Fleg, J. L., Schulman, K. A., McKelvie, R. S., Zannad, F., Piña, I. L., & HF-ACTION Investigators, for the. (2009). Efficacy and Safety of Exercise Training in Patients With Chronic Heart Failure. JAMA, 301(14), 1439. https://doi.org/10.1001/jama.2009.454

Oakes, D. (2016). On the win-ratio statistic in clinical trials with multiple types of event. Biometrika, 103(3), 742–745. https://doi.org/10.1093/biomet/asw026

Péron, J., Buyse, M., Ozenne, B., Roche, L., & Roy, P. (2016). An extension of generalized pairwise comparisons for prioritized outcomes in the presence of censoring. Statistical Methods in Medical Research, 27(4), 1230–1239. https://doi.org/10.1177/0962280216658320

Pocock, S. J., Ariti, C. A., Collier, T. J., & Wang, D. (2012). The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. European Heart Journal, 33(2), 176–182. https://doi.org/10.1093/eurheartj/ehr352

Pocock, S. J., Gregson, J., Collier, T. J., Ferreira, J. P., & Stone, G. W. (2024). The win ratio in cardiology trials: Lessons learnt, new developments, and wise future use. European Heart Journal, 45(44), 4684–4699.

Qu, Y., & Lipkovich, I. (2021). Implementation of ICH E9 (R1): A Few Points Learned During the COVID-19 Pandemic. Therapeutic Innovation & Regulatory Science, 55(5), 984–988. https://doi.org/10.1007/s43441-021-00297-6

Ratitch, B., Bell, J., Mallinckrodt, C., Bartlett, J. W., Goel, N., Molenberghs, G., O’Kelly, M., Singh, P., & Lipkovich, I. (2020). Choosing Estimands in Clinical Trials: Putting the ICH E9(R1) Into Practice. Therapeutic Innovation & Regulatory Science, 54(2), 324–341. https://doi.org/10.1007/s43441-019-00061-x

Redfors, B., Gregson, J., Crowley, A., McAndrew, T., Ben-Yehuda, O., Stone, G. W., & Pocock, S. J. (2020). The win ratio approach for composite endpoints: practical guidance based on previous experience. European Heart Journal, 41(46), 4391–4399. https://doi.org/10.1093/eurheartj/ehaa665

Schmidli, H., Roger, J. H., & Akacha, M. (2023). Rejoinder to Commentaries on “Estimands for Recurrent Event Endpoints in the Presence of a Terminal Event”. Statistics in Biopharmaceutical Research, 15(2), 255–256. https://doi.org/10.1080/19466315.2023.2166098

Verbeeck, J., De Backer, M., Verwerft, J., Salvaggio, S., Valgimigli, M., Vranckx, P., Buyse, M., & Brunner, E. (2023). Generalized Pairwise Comparisons to Assess Treatment Effects. Journal of the American College of Cardiology, 82(13), 1360–1372. https://doi.org/10.1016/j.jacc.2023.06.047

Zinman, B., Wanner, C., Lachin, J. M., Fitchett, D., Bluhmki, E., Hantel, S., Mattheus, M., Devins, T., Johansen, O. E., Woerle, H. J., Broedl, U. C., & Inzucchi, S. E. (2015). Empagliflozin, Cardiovascular Outcomes, and Mortality in Type 2 Diabetes. New England Journal of Medicine, 373(22), 2117–2128. https://doi.org/10.1056/nejmoa1504720