Title Exploratory Data Analysis with MATLAB English 7.8 MB 363
```                            C3669_FM
Exploratory Data Analysis with MATLAB
Preface
C3669_CH01
Contents
Chapter 1 Introduction to Exploratory Data Analysis
1.1 What is Exploratory Data Analysis
1.2 Overview of the Text
1.3 A Few Words About Notation
1.4 Data Sets Used in the Book
1.4.1 Unstructured Text Documents
1.4.2 Gene Expression Data
1.4.3 Oronsay Data Set
1.4.4 Software Inspection
1.5 Transforming Data
1.5.1 Power Transformations
1.5.2 Standardization
1.5.3 Sphering the Data
Exercises
C3669_CH02
Contents
Chapter 2 Dimensionality Reduction - Linear Methods
2.1 Introduction
2.2 Principal Component Analysis - PCA
2.2.1 PCA Using the Sample Covariance Matrix
2.2.2 PCA Using the Sample Correlation Matrix
2.2.3 How Many Dimensions Should We Keep?
2.3 Singular Value Decomposition - SVD
2.4 Factor Analysis
2.5 Intrinsic Dimensionality
Exercises
C3669_CH04
Contents
Chapter 4 Data Tours
4.1 Grand Tour
4.1.1 Torus Winding Method
4.1.2 Pseudo Grand Tour
4.2 Interpolation Tours
4.3 Projection Pursuit
4.4 Projection Pursuit Indexes
4.4.1 Posse Chi-Square Index
4.4.2 Moment Index
Exercises
C3669_CH05
Contents
Chapter 5 Finding Clusters
5.1 Introduction
5.2 Hierarchical Methods
5.3 Optimization Methods - k-Means
5.4 Evaluating the Clusters
5.4.1 Rand Index
5.4.2 Cophenetic Correlation
5.5.3 Upper Tail Rule
5.5.4 Silhouette Plot
5.5.5 Gap Statistic
Exercises
C3669_CH06
Contents
Chapter 6 Model-Based Clustering
6.1 Overview of Model-Based Clustering
6.2 Finite Mixtures
6.2.1 Multivariate Finite Mixtures
6.2.2 Component Models - Constraining the Covariances
6.3 Expectation-Maximization Algorithm
6.4 Hierarchical Agglomerative Model-Based Clustering
6.5 Model-Based Clustering
6.6 Generating Random Variables from a Mixture Model
Exercises
C3669_CH07
Contents
Chapter 7 Smoothing Scatterplots
7.1 Introduction
7.2 Loess
7.3 Robust Loess
7.4 Residuals and Diagnostics
7.4.1 Residual Plots
7.4.3 Loess Envelopes - Upper and Lower Smooths
7.5 Bivariate Distribution Smooths
7.5.1 Pairs of Middle Smoothings
7.5.2 Polar Smoothing
7.6 Curve Fitting Toolbox
Exercises
C3669_CH08
Contents
Chapter 8 Visualizing Clusters
8.1 Dendrogram
8.2 Treemaps
8.3 Rectangle Plots
8.4 ReClus Plots
8.5 Data Image
Exercises
C3669_CH09
Contents
Chapter 9 Distribution Shapes
9.1 Histograms
9.1.1 Univariate Histograms
9.1.2 Bivariate Histograms
9.2 Boxplots
9.2.1 The Basic Boxplot
9.2.2 Variations of the Basic Boxplot
9.3 Quantile Plots
9.3.1 Probability Plots
9.3.2 Quantile-quantile Plot
9.3.3 Quantile Plot
9.4 Bagplots
Exercises
C3669_CH10
Contents
Chapter 10 Multivariate Visualization
10.1 Glyph Plots
10.2 Scatterplots
10.2.1 2-D and 3-D Scatterplots
10.2.2 Scatterplot Matrices
10.2.3 Scatterplots with Hexagonal Binning
10.3 Dynamic Graphics
10.3.1 Identification of Data
10.3.3 Brushing
10.4 Coplots
10.5 Dot Charts
10.5.1 Basic Dot Chart
10.5.2 Multiway Dot Chart
10.6 Plotting Points as Curves
10.6.1 Parallel Coordinate Plots
10.6.2 Andrews’ Curves
10.6.3 More Plot Matrices
10.7 Data Tours Revisited
10.7.1 Grand Tour
10.7.2 Permutation Tour
Exercises
C3669_APPa
Contents
Appendix A Proximity Measures
A.1 Definitions
A.1.1 Dissimilarities
A.1.2 Similarity Measures
A.1.3 Similarity Measures for Binary Data i-
A.1.4 Dissimilarities for Probability Density Functions
A.2 Transformations
C3669_APPb
Contents
Appendix B Software Resources for EDA
B.1 MATLAB Programs
B.2 Other Programs for EDA
B.3 EDA Toolbox
C3669_APPc
Contents
Appendix C Description of Data Sets
C3669_APPd
Contents
Appendix D Introduction to MATLAB1
D.1 What Is MATLAB?
D.2 Getting Help in MATLAB
D.3 File and Workspace Management
D.4 Punctuation in MATLAB
D.5 Arithmetic Operators
D.6 Data Constructs in MATLAB
Basic Data Constructs
Building Arrays
Cell Arrays
Structures
D.7 Script Files and Functions
D.8 Control Flow
for Loop
while Loop
if-else Statements
switch Statement
D.9 Simple Plotting
D.10 Where to get MATLAB Information
C3669_APPe
Contents
Appendix E MATLAB Functions
E.1 MATLAB
E.2 Statistics Toolbox - Versions 4 and 5
E.3 Exploratory Data Analysis Toolbox
C3669_REF
Contents
References
C3669_Color FIGS
Contents
FIGURE 3.7
FIGURE 3.8
FIGURE 3.9 (TOP)
FIGURE 3.9 (BOTTOM)
FIGURE 3.10
FIGURE 5.4
FIGURE 8.7 (TOP)
FIGURE 8.7 (BOTTOM)
FIGURE 8.8
FIGURE 8.9
FIGURE 10.3 (TOP)
FIGURE 10.3 (BOTTOM)
FIGURE 10.5
FIGURE 10.7
FIGURE 10.9
```
##### Document Text Contents
Page 1

Fig109cAI.eps

Exploratory Data
Analysis

with MATLAB®

Computer Science and Data Analysis Series

Page 2

Fig109cAI.eps

Chapman & Hall/CRC

Series in Computer Science and Data Analysis

The interface between the computer and statistical sciences is increasing,
as each discipline seeks to harness the power and resources of the other.
This series aims to foster the integration between the computer sciences
and statistical, numerical and probabilistic methods by publishing a broad
range of reference works, textbooks and handbooks.

SERIES EDITORS
John Lafferty, Carnegie Mellon University
Fionn Murtagh, Queen’s University Belfast
Padhraic Smyth, University of California Irvine

Proposals for the series should be sent directly to one of the series editors
above, or submitted to:

Chapman & Hall/CRC Press UK
London SW15 2NU
UK

Published Titles

Bayesian Artificial Intelligence
Kevin B. Korb and Ann E. Nicholson

Exploratory Data Analysis with MATLAB®

Wendy L. Martinez and Angel R. Martinez

Forthcoming Titles

Correspondence Analysis and Data Coding with JAVA and R
Fionn Murtagh

R Graphics
Paul Murrell

Nonlinear Dimensionality Reduction
Vin de Silva and Carrie Grimes

Page 181

Fig109cAI.eps

Smoothing Scatterplots 213

ylabel('Log [ Defects / Page ]')

Next we set up the parameters (α = 0.5, λ = 2) for a loess smooth and show
the smoothed scatterplot in Figure 7.8.

% Set up the parameters.
alpha = 0.5;
lambda = 2;
% Do the loess on this.
x0 = linspace(min(X),max(X));
y0 = loess(X,Y,x0,alpha,lambda);
% Plot the curve and scatterplot.
plot(X,Y,'.',x0,y0)
xlabel('Log[PrepTime (mins)/Page]')
ylabel('Log[Defects/Page]')

We can assess our results by looking at the residual plots. First we find the
residuals and plot them in Figure 7.9 (top), where we see that they are
roughly symmetric about zero. Then we plot the absolute value of the
residuals against the fitted values (Figure 7.9 (bottom)). A loess smooth of
these observations show that the variance does not seem to be dependent on
the fitted values.

% Get the residuals.
% First find the loess values at the observed X values.

FIGURE 7.7

This is the scatterplot of observations showing the number of defects found per page versus
the time spend inspecting each page. We see that the relationship is approximately linear.

−2 −1 0 1 2 3 4
−7

−6

−5

−4

−3

−2

−1

0

1

Log [ PrepTime (mins) / Page ]

L
o

g
[

D
e

fe
ct

s
/

P
a

g
e

]

EDA.book Page 213 Monday, October 18, 2004 8:33 AM

Page 182

Fig109cAI.eps

214 Exploratory Data Analysis with MATLAB

yhat = loess(X,Y,X,alpha,lambda);
resid = Y - yhat;
% Now plot the residuals.
plot(1:length(resid),resid,'.')
ax = axis;
axis([ax(1), ax(2), -4 4])
xlabel('Index')
ylabel('Residuals')
% Plot the absolute value of the residuals
% against the fitted values.
r0 = linspace(min(yhat),max(yhat),30);
rhat = loess(yhat,abs(resid),r0,0.5,1);
plot(yhat,abs(resid),'.',r0,rhat)
xlabel('Fitted Values')
ylabel('| Residuals |')

The following code constructs a residual dependence plot for this loess
smooth. We include a loess smooth for this scatterplot to better understand
the results. This is shown in Figure 7.10; we do not see any indication of bias.

% Now plot the residuals on the vertical
% and the independent values on the
% horizontal. This is the residual
% dependence plot. Include a loess curve.

FIGURE 7.8

After we add the loess curve (α = 0.5, λ = 2), we see that the relationship is not completely
linear.

−2 −1 0 1 2 3 4
−7

−6

−5

−4

−3

−2

−1

0

1

Log [ PrepTime (mins) / Page ]

L
o

g
[

D
e

fe
ct

s
/

P
a

g
e

]

EDA.book Page 214 Monday, October 18, 2004 8:33 AM

Page 362

Fig109cAI.eps

FIGURE 10.5

This shows a scatterplot of the oronsay data based on hexagonal binning. The color of the
symbols represents the value of the probability density at that bin.

FIGURE 10.7

The red points in this scatterplot were highlighted using the scattergui function.

5 10 15 20 25 30 35 40 45

5

10

15

20

25

30

35

40

1

2

3

4

5

6

7

8

x 10
−3

−10 −8 −6 −4 −2 0 2 4 6
−5

0

5

10

C3669_Color Insert.fm Page 7 Thursday, November 11, 2004 2:15 PM

Page 363

Fig109cAI.eps

FIGURE 10.9

This is the scatterplot matrix with brushing and linking. This mode is transient, where only points inside the brush are highlighted. Corresponding
points are highlighted in all scatterplots.

.18.25mm

8.9

40.4

.125.18mm

0.3

40.1

.09.125mm

0.2

9.9

C
3669_C

olor Insert.fm
P

age 8 T
hursday, N

ovem
ber 11, 2004 2:15 P

M