##### Document Text Contents

Page 258

238 SIMULATION-BASED BAYESIAN INFERENCE

sampling points below the graph of cQ, and accepting the horizontal positions of the

points falling below the graph of P . The remaining points are rejected. The coordinates

of the points below the cQ graph are sampled as follows. The horizontal position θ

is obtained by drawing it from the candidate distribution with density Q. Next, the

vertical position ũ is uniformly sampled from the interval (0, cQ(θ)), that is ũ = cQ(θ)u

with u ∼ U(0, 1). As the point (θ, ũ) is accepted if and only if ũ is located in the

interval (0, P (θ)), the acceptance probability for this point is given by P (θ)/cQ(θ). The

following rejection algorithm collects a sample of size n from the target distribution with

density P .

Initialize the algorithm:

The set of accepted draws S is empty: S = ∅.

The number of accepted draws i is zero: i = 0.

Do while i < n:

Obtain θ from candidate distribution with density Q.

Obtain u from uniform distribution U(0, 1).

If u < P(θ)/cQ(θ) then accept θ :

Add θ to the set of accepted draws: S = S ∪ {θ}.

Update the number of accepted draws: i = i + 1.

Although rejection sampling is based on using an approximating candidate distribu-

tion, the method yields an exact sample for the target distribution. However, the big

drawback of the rejection approach is that many candidate draws might be required to

obtain an accepted sample of moderate size, making the method inefficient. For example,

in Figure 7.9 it is seen that most points are located above the P graph, so that many

draws are thrown away. For large n, the fraction of accepted draws tends to the ratio of

the area below the P graph and the area below the cQ graph. As the candidate density

Q integrates to 1, this acceptance rate is given by

∫

P(θ) dθ/c, so that a smaller value

for c results in more efficiency. Clearly, c is optimized by setting it at

c = max

θ

P (θ)

Q(θ)

, (7.29)

implying that the optimal c is small if variation in the ratio P(θ)/Q(θ) is small. This

explains that an appropriate candidate density, providing a good approximation to the

target density, is desirable.

Clever rejection sampling methods have been developed for simulating draws from

the univariate standard normal distribution, which serve as building blocks for more

involved algorithms. However, in higher dimensions, it may be nontrivial to determine

the maximum of the ratio P(θ)/Q(θ). Moreover, it may be difficult to find a candidate

density that has small c in (7.29), so that the acceptance rate may be very low. Therefore,

in higher dimensions, rejection sampling is not so popular; one mostly prefers one of the

other sampling methods that will be discussed later in this chapter.

Page 259

A PRIMER ON SIMULATION METHODS 239

7.3.3.2 Importance sampling

Importance sampling is another indirect approach to obtain an estimate for E[g(θ)],

where θ is a random variable from the target distribution. It was initially discussed by

Hammersley and Handscomb (1964) and introduced in econometrics by Kloek and Van

Dijk (1978). The method is related to rejection sampling. The rejection method either

accepts or rejects candidate draws, that is, either draws receive full weight or they do not

get any weight at all. Importance sampling is based on this notion of assigning weights

to draws. However, in contrast with the rejection method, these weights are not based

on an all-or-nothing situation. Instead, they can take any possible value, representing the

relative importance of draws. If Q is the candidate density (= importance function) and

P is a kernel of the target density, importance sampling is based on the relationship

E[g(θ)] =

∫

g(θ)P (θ) dθ∫

P(θ) dθ

=

∫

g(θ)w(θ)Q(θ) dθ∫

w(θ)Q(θ) dθ

= E[w(θ̃)g(θ̃)]

E[w(θ̃)]

, (7.30)

where θ̃ is a random variable from the candidate distribution, and w(θ̃) = P(θ̃)/Q(θ̃) is

the weight function, which should be bounded. It follows from (7.30) that a consistent

estimate of E[g(θ)] is given by the weighted mean

�E[g(θ)]IS =

∑n

i=1w(θ̃i)g(θ̃i)∑n

j=1w(θ̃j )

, (7.31)

where θ̃1, . . . , θ̃n are realizations from the candidate distribution and w(θ̃1), . . . , w(θ̃n)

are the corresponding weights. As relationship (7.30) would still hold after redefining

the weight function as w(θ̃) = P(θ̃)/cQ(θ̃), yielding the acceptance probability of θ̃ ,

there exists a clear link between rejection sampling and importance sampling, that is, the

importance sampling method weights draws with the acceptance probabilities from the

rejection approach. Figure 7.10 provides a graphical illustration of the method. Points for

which the graph of the target density is located above the graph of the candidate density

are not sampled often enough. In order to correct for this, such draws are assigned

relatively large weights (weights larger than unity). The reverse holds in the opposite

case. Although importance sampling can be used to estimate characteristics of the target

density (such as the mean), it does not provide a sample according to this density, as

draws are generated from the candidate distribution. So, in a strict sense, importance

sampling should not be called a sampling method; rather, it should be called a pure

integration method.

The performance of the importance sampler is greatly affected by the choice of the

candidate distribution. If the importance function Q is inappropriate, the weight function

w(θ̃) = P(θ̃)/Q(θ̃) varies a lot, and it might happen that only a few draws with extreme

weights almost completely determine the estimate �E[g(θ)]IS. This estimate would be

very unstable. In particular, a situation such that the tails of the target density are fatter

than the tails of the candidate density is concerning, as this would imply that the weight

function might even tend to infinity. In such a case, E[g(θ)] does not exist – see Equation

(7.30). Roughly stated, it is much less harmful for the importance sampling results if the

candidate density’s tails are too fat than if the tails are too thin, as compared with the

target density. It is for this reason that a fat-tailed Student’s t importance function is

usually preferred over a normal candidate density.

Page 515

INDEX 495

summability

absolute, 348

summation operator

seasonal, 358, 364

supF statistic, 202, 204, 205

supremum test statistic, 393

switching regression, 85

symmetric bootstrap P -value, 186

Symposium on Large-Scale Digital

Calculating Machinery, 1

tabu list, 89

tabu search, 88, 89

test statistic

asymptotically pivotal, 187, 191

pivotal, 185, 186

tests for heteroskedasticity, 188

tests for structural change, 201

TESTU01, 67, 73, 75

threshold accepting, 85, 89, 99, 113

threshold autoregressive models, 85

threshold methods, 88

threshold sequence, 89, 104

threshold vector error correction models,

85

time domain, 322, 346

time-reversibility, 250

Toeplitz matrix, 346

top Lyapunov exponent, 396

traffic assignment, 443

trajectory methods, 88, 94

TRAMO–SEATS program, 331, 344,

363, 367, 368, 373

transition equation, 368

trend of data sequence, 325, 326, 329,

334, 336, 344, 354, 355, 361

trend/cycle component of data sequence,

357

trigonometric basis, 332

trigonometric identity, 332

two-stage least squares, 197, 198

unguided search, 94

unitary matrix, 349

University of Auckland, 24

University of Cambridge, 1, 20, 24

Department of Applied Economics,

1, 2, 12

University of Chicago, 24

University of London, 20

London School of Economics and

Political Science, 12, 24

University of Michigan, 2

University of Minnesota, 24

University of Pennsylvania, 2, 8, 20, 24

University of Warwick

ESRC Macroeconomic Modelling

Bureau, 12

University of Wisconsin, 24

unobserved components, 351, 363, 368

updating equation of Kalman filter, 369

VAR model

Bayesian estimation, 294

reduced form, 288

structural form, 288

VAR models, 85

VAR order selection, 295

VAR process, 285

variance decomposition, 63

variational inequalities, 432, 435, 438,

448–450, 456, 458, 460, 461

VEC model, 286

VEC models, 85

VECM, 286

vector autoregression, 62–64, 75

vector autoregressive process, 285, 365

vector error correction model, 286

volatility forecast, 414

von Neumann, John

first computer program, 5

first stored-program computer, 5

game theory, 5

Wald test, 389–395, 406

wavelet analysis, 322, 340

wavelets, 154

wavepacket, 340

Wharton Econometric Forecasting

Associates, 11, 16

white noise

strong, 382

testing, 386–388

weak, 382

white noise process, 340

Page 516

496 INDEX

white noise process, (continued )

band-limited, 340

Wiener process, 340

Wiener–Kolmogorov filter, 341, 342,

359, 362, 372

initial conditions, 355, 360

wild bootstrap, 196, 197, 204, 205

worst case analysis, 121

worst case strategy, 122

wrapped filter, 350

z transform, 322, 345, 348

238 SIMULATION-BASED BAYESIAN INFERENCE

sampling points below the graph of cQ, and accepting the horizontal positions of the

points falling below the graph of P . The remaining points are rejected. The coordinates

of the points below the cQ graph are sampled as follows. The horizontal position θ

is obtained by drawing it from the candidate distribution with density Q. Next, the

vertical position ũ is uniformly sampled from the interval (0, cQ(θ)), that is ũ = cQ(θ)u

with u ∼ U(0, 1). As the point (θ, ũ) is accepted if and only if ũ is located in the

interval (0, P (θ)), the acceptance probability for this point is given by P (θ)/cQ(θ). The

following rejection algorithm collects a sample of size n from the target distribution with

density P .

Initialize the algorithm:

The set of accepted draws S is empty: S = ∅.

The number of accepted draws i is zero: i = 0.

Do while i < n:

Obtain θ from candidate distribution with density Q.

Obtain u from uniform distribution U(0, 1).

If u < P(θ)/cQ(θ) then accept θ :

Add θ to the set of accepted draws: S = S ∪ {θ}.

Update the number of accepted draws: i = i + 1.

Although rejection sampling is based on using an approximating candidate distribu-

tion, the method yields an exact sample for the target distribution. However, the big

drawback of the rejection approach is that many candidate draws might be required to

obtain an accepted sample of moderate size, making the method inefficient. For example,

in Figure 7.9 it is seen that most points are located above the P graph, so that many

draws are thrown away. For large n, the fraction of accepted draws tends to the ratio of

the area below the P graph and the area below the cQ graph. As the candidate density

Q integrates to 1, this acceptance rate is given by

∫

P(θ) dθ/c, so that a smaller value

for c results in more efficiency. Clearly, c is optimized by setting it at

c = max

θ

P (θ)

Q(θ)

, (7.29)

implying that the optimal c is small if variation in the ratio P(θ)/Q(θ) is small. This

explains that an appropriate candidate density, providing a good approximation to the

target density, is desirable.

Clever rejection sampling methods have been developed for simulating draws from

the univariate standard normal distribution, which serve as building blocks for more

involved algorithms. However, in higher dimensions, it may be nontrivial to determine

the maximum of the ratio P(θ)/Q(θ). Moreover, it may be difficult to find a candidate

density that has small c in (7.29), so that the acceptance rate may be very low. Therefore,

in higher dimensions, rejection sampling is not so popular; one mostly prefers one of the

other sampling methods that will be discussed later in this chapter.

Page 259

A PRIMER ON SIMULATION METHODS 239

7.3.3.2 Importance sampling

Importance sampling is another indirect approach to obtain an estimate for E[g(θ)],

where θ is a random variable from the target distribution. It was initially discussed by

Hammersley and Handscomb (1964) and introduced in econometrics by Kloek and Van

Dijk (1978). The method is related to rejection sampling. The rejection method either

accepts or rejects candidate draws, that is, either draws receive full weight or they do not

get any weight at all. Importance sampling is based on this notion of assigning weights

to draws. However, in contrast with the rejection method, these weights are not based

on an all-or-nothing situation. Instead, they can take any possible value, representing the

relative importance of draws. If Q is the candidate density (= importance function) and

P is a kernel of the target density, importance sampling is based on the relationship

E[g(θ)] =

∫

g(θ)P (θ) dθ∫

P(θ) dθ

=

∫

g(θ)w(θ)Q(θ) dθ∫

w(θ)Q(θ) dθ

= E[w(θ̃)g(θ̃)]

E[w(θ̃)]

, (7.30)

where θ̃ is a random variable from the candidate distribution, and w(θ̃) = P(θ̃)/Q(θ̃) is

the weight function, which should be bounded. It follows from (7.30) that a consistent

estimate of E[g(θ)] is given by the weighted mean

�E[g(θ)]IS =

∑n

i=1w(θ̃i)g(θ̃i)∑n

j=1w(θ̃j )

, (7.31)

where θ̃1, . . . , θ̃n are realizations from the candidate distribution and w(θ̃1), . . . , w(θ̃n)

are the corresponding weights. As relationship (7.30) would still hold after redefining

the weight function as w(θ̃) = P(θ̃)/cQ(θ̃), yielding the acceptance probability of θ̃ ,

there exists a clear link between rejection sampling and importance sampling, that is, the

importance sampling method weights draws with the acceptance probabilities from the

rejection approach. Figure 7.10 provides a graphical illustration of the method. Points for

which the graph of the target density is located above the graph of the candidate density

are not sampled often enough. In order to correct for this, such draws are assigned

relatively large weights (weights larger than unity). The reverse holds in the opposite

case. Although importance sampling can be used to estimate characteristics of the target

density (such as the mean), it does not provide a sample according to this density, as

draws are generated from the candidate distribution. So, in a strict sense, importance

sampling should not be called a sampling method; rather, it should be called a pure

integration method.

The performance of the importance sampler is greatly affected by the choice of the

candidate distribution. If the importance function Q is inappropriate, the weight function

w(θ̃) = P(θ̃)/Q(θ̃) varies a lot, and it might happen that only a few draws with extreme

weights almost completely determine the estimate �E[g(θ)]IS. This estimate would be

very unstable. In particular, a situation such that the tails of the target density are fatter

than the tails of the candidate density is concerning, as this would imply that the weight

function might even tend to infinity. In such a case, E[g(θ)] does not exist – see Equation

(7.30). Roughly stated, it is much less harmful for the importance sampling results if the

candidate density’s tails are too fat than if the tails are too thin, as compared with the

target density. It is for this reason that a fat-tailed Student’s t importance function is

usually preferred over a normal candidate density.

Page 515

INDEX 495

summability

absolute, 348

summation operator

seasonal, 358, 364

supF statistic, 202, 204, 205

supremum test statistic, 393

switching regression, 85

symmetric bootstrap P -value, 186

Symposium on Large-Scale Digital

Calculating Machinery, 1

tabu list, 89

tabu search, 88, 89

test statistic

asymptotically pivotal, 187, 191

pivotal, 185, 186

tests for heteroskedasticity, 188

tests for structural change, 201

TESTU01, 67, 73, 75

threshold accepting, 85, 89, 99, 113

threshold autoregressive models, 85

threshold methods, 88

threshold sequence, 89, 104

threshold vector error correction models,

85

time domain, 322, 346

time-reversibility, 250

Toeplitz matrix, 346

top Lyapunov exponent, 396

traffic assignment, 443

trajectory methods, 88, 94

TRAMO–SEATS program, 331, 344,

363, 367, 368, 373

transition equation, 368

trend of data sequence, 325, 326, 329,

334, 336, 344, 354, 355, 361

trend/cycle component of data sequence,

357

trigonometric basis, 332

trigonometric identity, 332

two-stage least squares, 197, 198

unguided search, 94

unitary matrix, 349

University of Auckland, 24

University of Cambridge, 1, 20, 24

Department of Applied Economics,

1, 2, 12

University of Chicago, 24

University of London, 20

London School of Economics and

Political Science, 12, 24

University of Michigan, 2

University of Minnesota, 24

University of Pennsylvania, 2, 8, 20, 24

University of Warwick

ESRC Macroeconomic Modelling

Bureau, 12

University of Wisconsin, 24

unobserved components, 351, 363, 368

updating equation of Kalman filter, 369

VAR model

Bayesian estimation, 294

reduced form, 288

structural form, 288

VAR models, 85

VAR order selection, 295

VAR process, 285

variance decomposition, 63

variational inequalities, 432, 435, 438,

448–450, 456, 458, 460, 461

VEC model, 286

VEC models, 85

VECM, 286

vector autoregression, 62–64, 75

vector autoregressive process, 285, 365

vector error correction model, 286

volatility forecast, 414

von Neumann, John

first computer program, 5

first stored-program computer, 5

game theory, 5

Wald test, 389–395, 406

wavelet analysis, 322, 340

wavelets, 154

wavepacket, 340

Wharton Econometric Forecasting

Associates, 11, 16

white noise

strong, 382

testing, 386–388

weak, 382

white noise process, 340

Page 516

496 INDEX

white noise process, (continued )

band-limited, 340

Wiener process, 340

Wiener–Kolmogorov filter, 341, 342,

359, 362, 372

initial conditions, 355, 360

wild bootstrap, 196, 197, 204, 205

worst case analysis, 121

worst case strategy, 122

wrapped filter, 350

z transform, 322, 345, 348