##### Document Text Contents

Page 2

Praise From the Experts

“For the nonstatistician, the array of statistical issues in design and analysis of clinical trials can be

overwhelming. Drawing on their years of experience dealing with data analysis plans and the

regulatory environment, the authors of Analysis of Clinical Trials Using SAS: A Practical Guide have

done a great job in organizing the statistical issues one needs to consider both in the design phase and

in the analysis and reporting phase. As the authors demonstrate, SAS provides excellent tools for

designing, analyzing and reporting results of comparative trials, making this book very attractive for

the nonstatistician.

This book will also be very useful for statisticians who wish to learn the most current statistical

methodologies in clinical trials. The authors make use of recent developments in SAS - including

stratification, multiple imputation, mixed models, nonparametrics, and multiple comparisons

procedures - to provide cutting-edge tools that are either difficult to find or unavailable in other

software packages. Statisticians will also appreciate the fact that sufficient technical details are

provided.

Because clinical trials are so highly regulated, some of the most rigorous and highly respected

statistical tools are used in this arena. The methodologies covered in this book have applicability to the

design and analysis of experiments well beyond clinical trials; researchers in all fields who carry out

comparative studies would do well to have it on their bookshelves.”

Peter Westfall, Texas Tech University

“This is a very impressive book. Although it doesn't pretend to cover all types of analyses used in

clinical trials, it does provide unique coverage of five extremely important areas. Each chapter

combines a detailed literature review and theoretical discussion with a useful how-to guide for

practitioners. This will be a valuable book for clinical biostatisticians in the pharmaceutical industry.”

Steve Snapinn, Amgen

“Chapter 1 (“Analysis of Stratified Data”) will make a very fine and useful addition to the literature on

analyzing clinical trials using SAS.”

Stephen Senn, University of Glasgow

“Chapter 2 ("Multiple Comparisons and Multiple Endpoints") provides an excellent single-source

overview of the statistical methods available to address the critically important aspect of multiplicity

in clinical trial evaluation. The treatment is comprehensive, and the authors compare and contrast

methods in specific applications. Newly developed gatekeeping methods are included, and the

available software makes the approach accessible. The graphical displays are particularly useful in

understanding the concept of adjusted p-values.”

Joe Heyse, Merck

Page 215

204 Analysis of Clinical Trials Using SAS: A Practical Guide

Group Sequential Designs for Efficacy and Futility Testing: Alternative

Approaches

As pointed out by Dr. Gordon Lan, the described group sequential approach to simultaneous efficacy

and futility testing may have serious fundamental flaws in practice. It was explained in the beginning of

this section that the lower stopping boundary for futility testing is driven by the Type II error probability

under the alternative hypothesis and thus depends heavily on the assumed value of the effect size, say

θ = θ1. For example, the lower boundary in the severe sepsis example was computed assuming the

effect size was θ1 = 0.1352 and the boundary’s shape will change if a smaller or larger value of θ1 is

used in calculations.

When the experimental drug appears futile and the trial sponsor or members of the data monitoring

committee consider early termination due to lack of efficacy, is it still reasonable to assume that the true

effect size is θ1 and is it still reasonable to rely on the futility boundary determined from θ1? Since the

estimated effect size θ̂ is no longer consistent with θ1, it is prudent to consider a range of alternatives

from θ = 0 to θ = θ̂ in order to predict the final outcome of the trial. This approach appears to be more

robust than the traditionally used group sequential approach and is closely related to adaptive stochastic

curtailment tests that will be introduced in Section 4.3.

4.2.3 Error Spending Approach

There are two approaches to sequential data monitoring in clinical trials. The first one relies on a set of

prespecified time points at which the data are reviewed. The Pocock and O’Brien-Fleming stopping

boundaries were originally proposed for this type of inspection scheme. An alternative approach was

introduced by Lan and DeMets (1983). It is based on an error spending strategy and enables the study

sponsor or data monitoring committee to change the timing and frequency of interim looks. Within the

error spending framework, interim monitoring follows the philosophical rationale of the design thinking

while allowing considerable flexibility.

Why do we need flexibility with respect to timing and frequency of analyses? It is sometimes

convenient to tie interim looks to calendar time rather than information time related to the sample size.

Nonpharmaceutical studies often employ interim analyses performed at regular intervals, such as every

3 or 6 months; see, for example, Van Den Berghe et al. (2001). Flexible strategies are also preferable in

futility monitoring. From a logistical perspective, it is more convenient to perform futility analyses on a

monthly or quarterly basis rather than after a prespecified number of patients have been enrolled into the

trial. In this case the number of patients changes unpredictably between looks and one needs to find a

way to deal with random increments of information in the data analysis.

This section provides a review of the error spending methodology proposed by Lan and DeMets

(1983) and illustrates it using the clinical trial examples from the introduction to this chapter. For a

further discussion of the methodology, see Jennison and Turnbull (2000, Chapter 7).

To introduce the error spending approach, consider a two-arm clinical trial with n patients in each

treatment group. The clinical researchers are interested in implementing a group sequential design to

facilitate the detection of early signs of therapeutic benefit. A Type I error spending function α(t) is a

nondecreasing function of the fraction of the total sample size t (0 ≤ t ≤ 1) with

α(0) = 0 and α(1) = α,

where α is the prespecified Type I error rate. Suppose that analyses are performed after n1, . . . , nm

patients have been accrued in each of the treatment groups (nm = n is the total number of patients in

each group). It is important to emphasize that interim looks can occur at arbitrary time points and thus

n1, . . . , nm are neither prespecified nor equally spaced. Let tk = nk/n and Zk denote the fraction of the

total sample size and test statistic at the kth look. The joint distribution of the test statistics is assumed to

be multivariate normal. Finally, denote the true treatment difference by δ.

Put simply, the selected error spending function determines the rate at which the overall Type I error

probability is spent during the trial. To see how it works, suppose that the first interim look is taken

when the sample size in each treatment group is equal to n1 patients. An upper one-sided critical value,

Page 216

Chapter 4 Interim Data Monitoring 205

denoted by u1, is determined in such a way that the amount of Type I error spent equals α(n1/n), which

is equal to α(t1). In other words, choose u1 to satisfy the following criterion:

P{Z1 > u1 if δ = 0} = α(t1).

The trial is stopped at the first interim analysis if Z1 is greater than u1 and a decision to continue the

trial is made otherwise.

Since we have already spent a certain fraction of the overall Type I error at the first analysis, the

amount we have left for the second analysis is

α(n2/n) − α(n1/n) = α(t2) − α(t1).

Therefore, at the time of the second interim look, the critical value u2 is obtained by solving the

following equation:

P{Z1 ≤ u1, Z2 > u2 if δ = 0} = α(t2) − α(t1).

Again, compare Z2 to u2 and proceed to the next analysis if Z2 does not exceed u2.

Likewise, the critical value u3 used at the third interim analysis is defined in such a way that

P{Z1 ≤ u1, Z2 ≤ u2, Z3 > u3 if δ = 0} = α(t3) − α(t2),

and a similar argument is applied in order to compute how much Type I error can be spent at each of the

subsequent analyses and determine the corresponding critical values u4, . . . , um . It is easy to verify that

the overall Type I error associated with the constructed group sequential test is equal to

α(t1) + [α(t2) − α(t1)] + [α(t3) − α(t2)] + . . . + [α(tm) − α(tm−1)] = α(tm),

and, by the definition of an α-spending function,

α(tm) = α(1) = α.

The described sequential monitoring strategy preserves the overall Type I error rate regardless of the

timing and frequency of interim looks.

To reiterate, the Lan-DeMets error spending approach allows for flexible interim monitoring without

sacrificing the overall Type I error probability. However, the power of a group sequential trial may be

dependent on the chosen monitoring strategy. As demonstrated by Jennison and Turnbull (2000, Section

7.2), the power is generally lower than the target value when the looks are more frequent than

anticipated and greater than the target value otherwise. In the extreme cases examined by Jennison and

Turnbull, the attained power differed from its nominal value by about 15%.

As shown by Pampallona and Tsiatis (1994) and Pampallona, Tsiatis and Kim (2001),3 the outlined

Type I error spending approach is easily extended to group sequential trials for futility monitoring, in

which case a Type II error spending function is introduced, or to simultaneous efficacy and futility

monitoring, which requires a specification of both Type I and Type II error spending functions.

Finally, it is worth reminding the reader about an important property of the error spending approach.

The theoretical properties of this approach hold under the assumption that the timing of future looks is

independent of what has been observed in the past. In theory, the overall probability of Type I errors may

no longer be preserved if one modifies the sequential testing scheme due to promising findings at one of

the interim looks. Several authors have studied the overall Type I error rate of sequential plans in which

this assumption is violated. For example, Lan and DeMets (1989) described a realistic scenario in which

a large but nonsignificant test statistic at an interim look causes the data monitoring committee to request

additional looks at the data. Lan and DeMets showed via simulations that data-dependent changes in the

frequency of interim analyses generally have a minimal effect on the overall α level. Further, as pointed

3Although the paper by Pampallona, Tsiatis and Kim was published in 2001, it was actually written prior to 1994.

Page 429

418 Index

T

t distribution statistics, pushback test on 63

TABLE statement, FREQ procedure 17

Tarone-Ware tests 42, 44, 48

likelihood-ratio tests vs. 48–49

stratified inferences, time-to-event data 50

TECHNIQUE= option, NLMIXED procedure 347

ten-look error spending functions 206–207

TEST statement, LIFETEST procedure 46–48

stratified inferences, time-to-event data 49–50

TEST statement, MIANALYZE procedure 313

TEST statement, MULTTEST procedure 92–93

Fisher option 92–93

FT option 92–93

ties, time-to-event data with 53–55

TIES= option, MODEL statement 54

time-to-event data

model-based tests 50–55

randomization-based tests 42–50

stratified analysis of 40–56

Toeplitz covariance matrix ("model 4") 284–285

tolerance limits, reference intervals based on

133–135

%TollLimit macro 135, 366

TPHREG procedure 51

CLASS statement 51

trialwise error rate

See familywise error rate

Tsiatis-Rosner-Mehta point estimates and

confidence limits 222–224

two-sided group sequential testing 182

See also stopping boundaries

two-way cell-means models 5–6

TYPE= option, REPEATED statement

(GENMOD) 162, 164, 287, 292, 339

Type I analysis 6–8, 12

Type I error rate

See familywise error rate

Type I error rate, spending functions for 206–209

simultaneous efficacy and futility monitoring

214–215

Type II analysis 8–10, 12

Type II error rate, spending functions for 206–209

Type III analysis 10–14

pretesting with 13–14

Type 3 statistics 161

Type IV analysis 6

TYPE3 option, MODEL statement 37

U

ulcerative colitis trial

analysis of safety data 157–158

GEE analysis 159–166

random effect models for binary responses

167–169

Westfall-Young resampling-based method

90–92

UNIVARIATE procedure

OUTPUT statement 132

sample quantile computations 131–132

univariate reference intervals

based on sample quantiles 131–133

based on tolerance limits 133–135

Koenker-Bassett method 136–137

upper stopping boundaries

See stopping boundaries

urinary incontinence trial 16–19

V

van Elteren statistic 15–19

VAR statement

MI procedure 312

MIANALYZE procedure 312–313

%VarPlot macro 153, 367

W

Wald statistics

asymptotic model-based tests 36–37

proportional hazards regression 52

Wang-Tsiatis sequential design 184

expected sample size 230

maximum sample size 229

stopping boundaries, computation of 224

weak control of familywise error rate 69

weighted GEE 332, 335–336, 344–347

Westfall-Young resampling-based method 89–92

gatekeeping strategies 122–123

multiple endpoint testing 101

WGEE (weighted generalized estimated equations)

332, 335–336, 344–347

Wilcoxon test 42–44

LIFETEST procedure with STRATA statement

44–46

LIFETEST procedure with TEST statement

46–48

pooling 49–50

Tarone-Ware and Harrington-Fleming tests vs.

48–49

within-imputation variances 309–310

WITHINSUBJECT= option, REPEATED

statement 339

working correlation matrix 334

working covariance matrix 159

Page 430

Index 419

Numbers

100�Jth quantile, computing with IRLS algorithm

137–138

Symbols

�D-spending functions 206–209

simultaneous efficacy and futility monitoring

214–215

�E-spending functions 206

Praise From the Experts

“For the nonstatistician, the array of statistical issues in design and analysis of clinical trials can be

overwhelming. Drawing on their years of experience dealing with data analysis plans and the

regulatory environment, the authors of Analysis of Clinical Trials Using SAS: A Practical Guide have

done a great job in organizing the statistical issues one needs to consider both in the design phase and

in the analysis and reporting phase. As the authors demonstrate, SAS provides excellent tools for

designing, analyzing and reporting results of comparative trials, making this book very attractive for

the nonstatistician.

This book will also be very useful for statisticians who wish to learn the most current statistical

methodologies in clinical trials. The authors make use of recent developments in SAS - including

stratification, multiple imputation, mixed models, nonparametrics, and multiple comparisons

procedures - to provide cutting-edge tools that are either difficult to find or unavailable in other

software packages. Statisticians will also appreciate the fact that sufficient technical details are

provided.

Because clinical trials are so highly regulated, some of the most rigorous and highly respected

statistical tools are used in this arena. The methodologies covered in this book have applicability to the

design and analysis of experiments well beyond clinical trials; researchers in all fields who carry out

comparative studies would do well to have it on their bookshelves.”

Peter Westfall, Texas Tech University

“This is a very impressive book. Although it doesn't pretend to cover all types of analyses used in

clinical trials, it does provide unique coverage of five extremely important areas. Each chapter

combines a detailed literature review and theoretical discussion with a useful how-to guide for

practitioners. This will be a valuable book for clinical biostatisticians in the pharmaceutical industry.”

Steve Snapinn, Amgen

“Chapter 1 (“Analysis of Stratified Data”) will make a very fine and useful addition to the literature on

analyzing clinical trials using SAS.”

Stephen Senn, University of Glasgow

“Chapter 2 ("Multiple Comparisons and Multiple Endpoints") provides an excellent single-source

overview of the statistical methods available to address the critically important aspect of multiplicity

in clinical trial evaluation. The treatment is comprehensive, and the authors compare and contrast

methods in specific applications. Newly developed gatekeeping methods are included, and the

available software makes the approach accessible. The graphical displays are particularly useful in

understanding the concept of adjusted p-values.”

Joe Heyse, Merck

Page 215

204 Analysis of Clinical Trials Using SAS: A Practical Guide

Group Sequential Designs for Efficacy and Futility Testing: Alternative

Approaches

As pointed out by Dr. Gordon Lan, the described group sequential approach to simultaneous efficacy

and futility testing may have serious fundamental flaws in practice. It was explained in the beginning of

this section that the lower stopping boundary for futility testing is driven by the Type II error probability

under the alternative hypothesis and thus depends heavily on the assumed value of the effect size, say

θ = θ1. For example, the lower boundary in the severe sepsis example was computed assuming the

effect size was θ1 = 0.1352 and the boundary’s shape will change if a smaller or larger value of θ1 is

used in calculations.

When the experimental drug appears futile and the trial sponsor or members of the data monitoring

committee consider early termination due to lack of efficacy, is it still reasonable to assume that the true

effect size is θ1 and is it still reasonable to rely on the futility boundary determined from θ1? Since the

estimated effect size θ̂ is no longer consistent with θ1, it is prudent to consider a range of alternatives

from θ = 0 to θ = θ̂ in order to predict the final outcome of the trial. This approach appears to be more

robust than the traditionally used group sequential approach and is closely related to adaptive stochastic

curtailment tests that will be introduced in Section 4.3.

4.2.3 Error Spending Approach

There are two approaches to sequential data monitoring in clinical trials. The first one relies on a set of

prespecified time points at which the data are reviewed. The Pocock and O’Brien-Fleming stopping

boundaries were originally proposed for this type of inspection scheme. An alternative approach was

introduced by Lan and DeMets (1983). It is based on an error spending strategy and enables the study

sponsor or data monitoring committee to change the timing and frequency of interim looks. Within the

error spending framework, interim monitoring follows the philosophical rationale of the design thinking

while allowing considerable flexibility.

Why do we need flexibility with respect to timing and frequency of analyses? It is sometimes

convenient to tie interim looks to calendar time rather than information time related to the sample size.

Nonpharmaceutical studies often employ interim analyses performed at regular intervals, such as every

3 or 6 months; see, for example, Van Den Berghe et al. (2001). Flexible strategies are also preferable in

futility monitoring. From a logistical perspective, it is more convenient to perform futility analyses on a

monthly or quarterly basis rather than after a prespecified number of patients have been enrolled into the

trial. In this case the number of patients changes unpredictably between looks and one needs to find a

way to deal with random increments of information in the data analysis.

This section provides a review of the error spending methodology proposed by Lan and DeMets

(1983) and illustrates it using the clinical trial examples from the introduction to this chapter. For a

further discussion of the methodology, see Jennison and Turnbull (2000, Chapter 7).

To introduce the error spending approach, consider a two-arm clinical trial with n patients in each

treatment group. The clinical researchers are interested in implementing a group sequential design to

facilitate the detection of early signs of therapeutic benefit. A Type I error spending function α(t) is a

nondecreasing function of the fraction of the total sample size t (0 ≤ t ≤ 1) with

α(0) = 0 and α(1) = α,

where α is the prespecified Type I error rate. Suppose that analyses are performed after n1, . . . , nm

patients have been accrued in each of the treatment groups (nm = n is the total number of patients in

each group). It is important to emphasize that interim looks can occur at arbitrary time points and thus

n1, . . . , nm are neither prespecified nor equally spaced. Let tk = nk/n and Zk denote the fraction of the

total sample size and test statistic at the kth look. The joint distribution of the test statistics is assumed to

be multivariate normal. Finally, denote the true treatment difference by δ.

Put simply, the selected error spending function determines the rate at which the overall Type I error

probability is spent during the trial. To see how it works, suppose that the first interim look is taken

when the sample size in each treatment group is equal to n1 patients. An upper one-sided critical value,

Page 216

Chapter 4 Interim Data Monitoring 205

denoted by u1, is determined in such a way that the amount of Type I error spent equals α(n1/n), which

is equal to α(t1). In other words, choose u1 to satisfy the following criterion:

P{Z1 > u1 if δ = 0} = α(t1).

The trial is stopped at the first interim analysis if Z1 is greater than u1 and a decision to continue the

trial is made otherwise.

Since we have already spent a certain fraction of the overall Type I error at the first analysis, the

amount we have left for the second analysis is

α(n2/n) − α(n1/n) = α(t2) − α(t1).

Therefore, at the time of the second interim look, the critical value u2 is obtained by solving the

following equation:

P{Z1 ≤ u1, Z2 > u2 if δ = 0} = α(t2) − α(t1).

Again, compare Z2 to u2 and proceed to the next analysis if Z2 does not exceed u2.

Likewise, the critical value u3 used at the third interim analysis is defined in such a way that

P{Z1 ≤ u1, Z2 ≤ u2, Z3 > u3 if δ = 0} = α(t3) − α(t2),

and a similar argument is applied in order to compute how much Type I error can be spent at each of the

subsequent analyses and determine the corresponding critical values u4, . . . , um . It is easy to verify that

the overall Type I error associated with the constructed group sequential test is equal to

α(t1) + [α(t2) − α(t1)] + [α(t3) − α(t2)] + . . . + [α(tm) − α(tm−1)] = α(tm),

and, by the definition of an α-spending function,

α(tm) = α(1) = α.

The described sequential monitoring strategy preserves the overall Type I error rate regardless of the

timing and frequency of interim looks.

To reiterate, the Lan-DeMets error spending approach allows for flexible interim monitoring without

sacrificing the overall Type I error probability. However, the power of a group sequential trial may be

dependent on the chosen monitoring strategy. As demonstrated by Jennison and Turnbull (2000, Section

7.2), the power is generally lower than the target value when the looks are more frequent than

anticipated and greater than the target value otherwise. In the extreme cases examined by Jennison and

Turnbull, the attained power differed from its nominal value by about 15%.

As shown by Pampallona and Tsiatis (1994) and Pampallona, Tsiatis and Kim (2001),3 the outlined

Type I error spending approach is easily extended to group sequential trials for futility monitoring, in

which case a Type II error spending function is introduced, or to simultaneous efficacy and futility

monitoring, which requires a specification of both Type I and Type II error spending functions.

Finally, it is worth reminding the reader about an important property of the error spending approach.

The theoretical properties of this approach hold under the assumption that the timing of future looks is

independent of what has been observed in the past. In theory, the overall probability of Type I errors may

no longer be preserved if one modifies the sequential testing scheme due to promising findings at one of

the interim looks. Several authors have studied the overall Type I error rate of sequential plans in which

this assumption is violated. For example, Lan and DeMets (1989) described a realistic scenario in which

a large but nonsignificant test statistic at an interim look causes the data monitoring committee to request

additional looks at the data. Lan and DeMets showed via simulations that data-dependent changes in the

frequency of interim analyses generally have a minimal effect on the overall α level. Further, as pointed

3Although the paper by Pampallona, Tsiatis and Kim was published in 2001, it was actually written prior to 1994.

Page 429

418 Index

T

t distribution statistics, pushback test on 63

TABLE statement, FREQ procedure 17

Tarone-Ware tests 42, 44, 48

likelihood-ratio tests vs. 48–49

stratified inferences, time-to-event data 50

TECHNIQUE= option, NLMIXED procedure 347

ten-look error spending functions 206–207

TEST statement, LIFETEST procedure 46–48

stratified inferences, time-to-event data 49–50

TEST statement, MIANALYZE procedure 313

TEST statement, MULTTEST procedure 92–93

Fisher option 92–93

FT option 92–93

ties, time-to-event data with 53–55

TIES= option, MODEL statement 54

time-to-event data

model-based tests 50–55

randomization-based tests 42–50

stratified analysis of 40–56

Toeplitz covariance matrix ("model 4") 284–285

tolerance limits, reference intervals based on

133–135

%TollLimit macro 135, 366

TPHREG procedure 51

CLASS statement 51

trialwise error rate

See familywise error rate

Tsiatis-Rosner-Mehta point estimates and

confidence limits 222–224

two-sided group sequential testing 182

See also stopping boundaries

two-way cell-means models 5–6

TYPE= option, REPEATED statement

(GENMOD) 162, 164, 287, 292, 339

Type I analysis 6–8, 12

Type I error rate

See familywise error rate

Type I error rate, spending functions for 206–209

simultaneous efficacy and futility monitoring

214–215

Type II analysis 8–10, 12

Type II error rate, spending functions for 206–209

Type III analysis 10–14

pretesting with 13–14

Type 3 statistics 161

Type IV analysis 6

TYPE3 option, MODEL statement 37

U

ulcerative colitis trial

analysis of safety data 157–158

GEE analysis 159–166

random effect models for binary responses

167–169

Westfall-Young resampling-based method

90–92

UNIVARIATE procedure

OUTPUT statement 132

sample quantile computations 131–132

univariate reference intervals

based on sample quantiles 131–133

based on tolerance limits 133–135

Koenker-Bassett method 136–137

upper stopping boundaries

See stopping boundaries

urinary incontinence trial 16–19

V

van Elteren statistic 15–19

VAR statement

MI procedure 312

MIANALYZE procedure 312–313

%VarPlot macro 153, 367

W

Wald statistics

asymptotic model-based tests 36–37

proportional hazards regression 52

Wang-Tsiatis sequential design 184

expected sample size 230

maximum sample size 229

stopping boundaries, computation of 224

weak control of familywise error rate 69

weighted GEE 332, 335–336, 344–347

Westfall-Young resampling-based method 89–92

gatekeeping strategies 122–123

multiple endpoint testing 101

WGEE (weighted generalized estimated equations)

332, 335–336, 344–347

Wilcoxon test 42–44

LIFETEST procedure with STRATA statement

44–46

LIFETEST procedure with TEST statement

46–48

pooling 49–50

Tarone-Ware and Harrington-Fleming tests vs.

48–49

within-imputation variances 309–310

WITHINSUBJECT= option, REPEATED

statement 339

working correlation matrix 334

working covariance matrix 159

Page 430

Index 419

Numbers

100�Jth quantile, computing with IRLS algorithm

137–138

Symbols

�D-spending functions 206–209

simultaneous efficacy and futility monitoring

214–215

�E-spending functions 206