Download Applied Chemometrics for Scientists - R. Brereton (Wiley, 2007) WW PDF

TitleApplied Chemometrics for Scientists - R. Brereton (Wiley, 2007) WW
File Size2.7 MB
Total Pages397
Document Text Contents
Page 198


Figure 5.27 Two groups, one modelled by one PC and one by two PCs

Each group is independently modelled using PCA. Note that each group could be
described by a different number of PCs. Figure 5.27 represents two groups each charac-
terized by three raw measurements, which may, for example, be chromatographic peak
heights or physical properties. However, one group falls mainly on a straight line, which is
de�ned as the �rst PC of the group. The second group falls roughly on a plane: the axes
of this plane are the �rst two PCs of this group. This way of looking at PCs (axes that
best �t the data) are sometimes used by chemists, and are complementary to the de�nitions
introduced previously (Section 5.2.2). It is important to note that there are a number of
proposed methods for determining how many PCs are most suited to describe a class, of
which the original advocates of SIMCA preferred cross-validation (Section 5.10).

The class distance can be calculated as the geometric distance from the PC models (see
Figure 5.28). The unknown is much closer to the plane formed from the group represented
by squares than the line formed by the group represented by circles, and so is tentatively
assigned to this class. A rather more elaborate approach is in fact usually employed in
which each group is bounded by a region of space, which represents 95 % con�dence that
a particular object belongs to a class. Hence geometric class distances can be converted to
statistical probabilities.

Sometimes it is interesting to see which variables are useful for discrimination. There are
often good reasons, for example in gas chromatography-mass spectrometry we may have
hundreds of peaks in a chromatogram and be primarily interested in a very small number

Page 199



Figure 5.28 Distance of an unknown sample (asterisk) to two known classes

of, for example, biomarkers that are used to distinguish two groups, so this interpretation
can have a chemical basis.

The modelling power of each variable in each class is defined by:

Mj = 1 − sjresid/sjsraw
where sjraw is the standard deviation of the variable in the raw data, and sjresid the standard
deviation of the variable in the residuals given by:

E = X − T .P
which is the difference between the observed data and the PC model as described earlier.
The modelling power varies between 1 (excellent) and 0 (no discrimination). Variables with
M below 0.5 are of little use.

Another second measure is how well a variable discriminates between two classes. This
is distinct from modelling power – being able to model one class well does not necessarily
imply being able to discriminate two groups effectively. In order to determine this, it is
necessary to fit each sample to both class models. For example, fit sample 1 to the PC
model of both class A and class B. The residual matrices are then calculated, just as for
discriminatory power, but there are now four such matrices:

1. Samples in class A fitted to the model of class A.
2. Samples in class A fitted to the model of class B.

Page 396


descriptive 68–9
outliers 64
terminology 43

descriptive 63
distributions 63–4
hypothesis testing 64
prediction confidence 64
quality control 64

sum of squares 17–9
supervised pattern recognition 146–7,

bootstrap 173
cross-validation 172–3
industrial process control 174
method stability 174
model application 174
model improvement 173
model optimization 173–4
and Partial Least Squares 213–4
and prediction 171–3
test set 172–4
training set 171–4

Support Vector Machines (SVMs) 303–6
boundary establishment 304–6
kernel methods 305–6
linearity vs. nonlinearity 304–6
multiclass problems 306
optimization 306

Taguchi designs 36–7
taxonomy 288–313

algorithmic methods 288–9
Bayesian methods 300–3
class distance determination 297–300
defined 289
Discriminant Partial Least Squares 306–8
discrimination methods 291–7
Fisher’s iris data 289–90, 291, 293–4
micro-organisms 308–12
numerical 290–1
Support Vector Machines 303–6

test set 172, 215–6
and compound correlations 216
dataset representation 215
and outlier detection 216

three-way chemical data 187–8
time series 130–1, 134

time series analysis 111–2
Total Ion Chromatogram (TIC) 316–7
training set 58–60, 171

and calibration experiments 205–6
and classification 171–3
designs 216

tree diagrams 14–6, 40, 42, 324–9
branches 324–5
cladistics 325–6
dendograms 325
evolutionary theory 325–6
nodes 324–5
phylograms 326–7
representation 324–5

trilinear PLS1 218–9
data types 219

t-statistic 83–4, 94
t-test 81–5, 94–5

one-tailed 83–4
principle 82
statistics tabulated 83–4
uses 83–4
two-tailed 81

Tucker3 models 188–9, 347–8

unfolding 190, 217–8
disadvantages 218
of images 333, 339, 341–2

univariate calibration 195–202
classical 195–7
equations 198–9
extra terms 199
graphs 199–202
inverse 196–7
matrix 198–9
outliers 197–8
terminology 195

univariate classification 175
unsupervised pattern recognition 146,

see also cluster analysis

UPGMA 327–8
UV/visible spectroscopy 138, 224, 249, 252,


variable combination 228
variable selection 224–8

noise 225–6
optimum number 225–6

Page 397


variance 19, 68–9
comparisons 85–6
and F -test 85–6

varimax rotation 247
food chemistry 359–65

wavelets 142–3
applications 143
in datasets 142–3
defined 142

weights 188
Window Factor Analysis 236–9

eigenvalues 237–8
expanding windows 237
fixed size windows 238–9
variations 239

yield improvement 267–86
optimization 269, 271–4

Similer Documents