##### Document Text Contents

Page 2

naked statistics

Stripping the Dread from the Data

CHARLES WHEELAN

Page 136

malaria drug).

In a courtroom, the threshold for rejecting the presumption of innocence is the

qualitative assessment that the defendant is “guilty beyond a reasonable doubt.”

The judge or jury is left to define what exactly that means. Statistics harnesses

the same basic idea, but “guilty beyond a reasonable doubt” is defined

quantitatively instead. Researchers typically ask, If the null hypothesis is true,

how likely is it that we would observe this pattern of data by chance? To use a

familiar example, medical researchers might ask, If this experimental drug has

no effect on heart disease (our null hypothesis), how likely is it that 91 out of

100 patients getting the drug would show improvement compared with only 49

out of 100 patients getting a placebo? If the data suggest that the null hypothesis

is extremely unlikely—as in this medical example—then we must reject it and

accept the alternative hypothesis (that the drug is effective in treating heart

disease).

In that vein, let us revisit the Atlanta standardized cheating scandal alluded to

at several points in the book. The Atlanta test score results were first flagged

because of a high number of “wrong-to-right” erasures. Obviously students

taking standardized exams erase answers all the time. And some groups of

students may be particularly lucky in their changes, without any cheating

necessarily being involved. For that reason, the null hypothesis is that the

standardized test scores for any particular school district are legitimate and that

any irregular patterns of erasures are merely a product of chance. We certainly

do not want to be punishing students or administrators because an unusually high

proportion of students happened to make sensible changes to their answer sheets

in the final minutes of an important state exam.

But “unusually high” does not begin to describe what was happening in

Atlanta. Some classrooms had answer sheets on which the number of wrong-to-

right erasures were twenty to fifty standard deviations above the state norm. (To

put this in perspective, remember that most observations in a distribution

typically fall within two standard deviations of the mean.) So how likely was it

that Atlanta students happened to erase massive numbers of wrong answers and

replace them with correct answers just as a matter of chance? The official who

analyzed the data described the probability of the Atlanta pattern occurring

without cheating as roughly equal to the chance of having 70,000 people show

up for a football game at the Georgia Dome who all happen to be over seven feet

tall.2 Could it happen? Yes. Is it likely? Not so much.

Georgia officials still could not convict anybody of wrongdoing, just as my

professor could not (and should not) have had me thrown out of school because

Page 137

my final exam grade in statistics was out of sync with my midterm grade.

Atlanta officials could not prove that cheating was going on. They could,

however, reject the null hypothesis that the results were legitimate. And they

could do so with a “high degree of confidence,” meaning that the observed

pattern was nearly impossible among normal test takers. They therefore

explicitly accepted the alternative hypothesis, which is that something fishy was

going on. (I suspect they used more official-sounding language.) Subsequent

investigation did in fact uncover the “smoking erasers.” There were reports of

teachers changing answers, giving out answers, allowing low-scoring children to

copy from high-scoring children, and even pointing to answers while standing

over students’ desks. The most egregious cheating involved a group of teachers

who held a weekend pizza party during which they went through exam sheets

and changed students’ answers.

In the Atlanta example, we could reject the null hypothesis of “no cheating”

because the pattern of test results was so wildly improbable in the absence of

foul play. But how implausible does the null hypothesis have to be before we can

reject it and invite some alternative explanation?

One of the most common thresholds that researchers use for rejecting a null

hypothesis is 5 percent, which is often written in decimal form: .05. This

probability is known as a significance level, and it represents the upper bound

for the likelihood of observing some pattern of data if the null hypothesis were

true. Stick with me for a moment, because it’s not really that complicated.

Let’s think about a significance level of .05. We can reject a null hypothesis at

the .05 level if there is less than a 5 percent chance of getting an outcome at least

as extreme as what we’ve observed if the null hypothesis were true. A simple

example can make this much clearer. I hate to do this to you, but assume once

again that you’ve been put on missing-bus duty (in part because of your valiant

efforts in the last chapter). Only now you are working full-time for the

researchers at the Changing Lives study, and they have given you some excellent

data to help inform your work. Each bus operated by the organizers of the study

has roughly 60 passengers, so we can treat the passengers on any bus as a

random sample drawn from the entire Changing Lives population. You are

awakened early one morning by the news that a bus in the Boston area has been

hijacked by a pro-obesity terrorist group.* Your job is to drop from a helicopter

onto the roof of the moving bus, sneak inside through the emergency exit, and

then stealthily determine whether the passengers are Changing Lives

participants, solely on the basis of their weights. (Seriously, this is no more

implausible than most action-adventure plots, and it’s a lot more educational.)

As the helicopter takes off from the commando base, you are given a machine

Page 272

Copyright

Copyright © 2013 by Charles Wheelan

All rights reserved

Printed in the United States of America

First Edition

For information about permission to reproduce selections from this book, write to Permissions, W. W.

Norton & Company, Inc.,

500 Fifth Avenue, New York, NY 10110

For information about special discounts for bulk purchases,

please contact W. W. Norton Special Sales at

[email protected] or 800-233-4830

Manufacturing by Courier Westford

Production manager: Anna Oler

ISBN 978-0-393-07195-5 (hardcover)

eISBN 978-0-393-08982-0

W. W. Norton & Company, Inc.

500 Fifth Avenue, New York, N.Y. 10110

www.wwnorton.com

W. W. Norton & Company Ltd.

Castle House, 75/76 Wells Street, London W1T 3QT

Page 273

Also by Charles Wheelan

10½ Things No Commencement Speaker Has Ever Said

Naked Economics: Undressing the Dismal Science

naked statistics

Stripping the Dread from the Data

CHARLES WHEELAN

Page 136

malaria drug).

In a courtroom, the threshold for rejecting the presumption of innocence is the

qualitative assessment that the defendant is “guilty beyond a reasonable doubt.”

The judge or jury is left to define what exactly that means. Statistics harnesses

the same basic idea, but “guilty beyond a reasonable doubt” is defined

quantitatively instead. Researchers typically ask, If the null hypothesis is true,

how likely is it that we would observe this pattern of data by chance? To use a

familiar example, medical researchers might ask, If this experimental drug has

no effect on heart disease (our null hypothesis), how likely is it that 91 out of

100 patients getting the drug would show improvement compared with only 49

out of 100 patients getting a placebo? If the data suggest that the null hypothesis

is extremely unlikely—as in this medical example—then we must reject it and

accept the alternative hypothesis (that the drug is effective in treating heart

disease).

In that vein, let us revisit the Atlanta standardized cheating scandal alluded to

at several points in the book. The Atlanta test score results were first flagged

because of a high number of “wrong-to-right” erasures. Obviously students

taking standardized exams erase answers all the time. And some groups of

students may be particularly lucky in their changes, without any cheating

necessarily being involved. For that reason, the null hypothesis is that the

standardized test scores for any particular school district are legitimate and that

any irregular patterns of erasures are merely a product of chance. We certainly

do not want to be punishing students or administrators because an unusually high

proportion of students happened to make sensible changes to their answer sheets

in the final minutes of an important state exam.

But “unusually high” does not begin to describe what was happening in

Atlanta. Some classrooms had answer sheets on which the number of wrong-to-

right erasures were twenty to fifty standard deviations above the state norm. (To

put this in perspective, remember that most observations in a distribution

typically fall within two standard deviations of the mean.) So how likely was it

that Atlanta students happened to erase massive numbers of wrong answers and

replace them with correct answers just as a matter of chance? The official who

analyzed the data described the probability of the Atlanta pattern occurring

without cheating as roughly equal to the chance of having 70,000 people show

up for a football game at the Georgia Dome who all happen to be over seven feet

tall.2 Could it happen? Yes. Is it likely? Not so much.

Georgia officials still could not convict anybody of wrongdoing, just as my

professor could not (and should not) have had me thrown out of school because

Page 137

my final exam grade in statistics was out of sync with my midterm grade.

Atlanta officials could not prove that cheating was going on. They could,

however, reject the null hypothesis that the results were legitimate. And they

could do so with a “high degree of confidence,” meaning that the observed

pattern was nearly impossible among normal test takers. They therefore

explicitly accepted the alternative hypothesis, which is that something fishy was

going on. (I suspect they used more official-sounding language.) Subsequent

investigation did in fact uncover the “smoking erasers.” There were reports of

teachers changing answers, giving out answers, allowing low-scoring children to

copy from high-scoring children, and even pointing to answers while standing

over students’ desks. The most egregious cheating involved a group of teachers

who held a weekend pizza party during which they went through exam sheets

and changed students’ answers.

In the Atlanta example, we could reject the null hypothesis of “no cheating”

because the pattern of test results was so wildly improbable in the absence of

foul play. But how implausible does the null hypothesis have to be before we can

reject it and invite some alternative explanation?

One of the most common thresholds that researchers use for rejecting a null

hypothesis is 5 percent, which is often written in decimal form: .05. This

probability is known as a significance level, and it represents the upper bound

for the likelihood of observing some pattern of data if the null hypothesis were

true. Stick with me for a moment, because it’s not really that complicated.

Let’s think about a significance level of .05. We can reject a null hypothesis at

the .05 level if there is less than a 5 percent chance of getting an outcome at least

as extreme as what we’ve observed if the null hypothesis were true. A simple

example can make this much clearer. I hate to do this to you, but assume once

again that you’ve been put on missing-bus duty (in part because of your valiant

efforts in the last chapter). Only now you are working full-time for the

researchers at the Changing Lives study, and they have given you some excellent

data to help inform your work. Each bus operated by the organizers of the study

has roughly 60 passengers, so we can treat the passengers on any bus as a

random sample drawn from the entire Changing Lives population. You are

awakened early one morning by the news that a bus in the Boston area has been

hijacked by a pro-obesity terrorist group.* Your job is to drop from a helicopter

onto the roof of the moving bus, sneak inside through the emergency exit, and

then stealthily determine whether the passengers are Changing Lives

participants, solely on the basis of their weights. (Seriously, this is no more

implausible than most action-adventure plots, and it’s a lot more educational.)

As the helicopter takes off from the commando base, you are given a machine

Page 272

Copyright

Copyright © 2013 by Charles Wheelan

All rights reserved

Printed in the United States of America

First Edition

For information about permission to reproduce selections from this book, write to Permissions, W. W.

Norton & Company, Inc.,

500 Fifth Avenue, New York, NY 10110

For information about special discounts for bulk purchases,

please contact W. W. Norton Special Sales at

[email protected] or 800-233-4830

Manufacturing by Courier Westford

Production manager: Anna Oler

ISBN 978-0-393-07195-5 (hardcover)

eISBN 978-0-393-08982-0

W. W. Norton & Company, Inc.

500 Fifth Avenue, New York, N.Y. 10110

www.wwnorton.com

W. W. Norton & Company Ltd.

Castle House, 75/76 Wells Street, London W1T 3QT

Page 273

Also by Charles Wheelan

10½ Things No Commencement Speaker Has Ever Said

Naked Economics: Undressing the Dismal Science