Download Exploratory Data Analysis with MATLAB PDF

TitleExploratory Data Analysis with MATLAB
LanguageEnglish
File Size7.8 MB
Total Pages363
Table of Contents
                            C3669_FM
	Exploratory Data Analysis with MATLAB
		Preface
		Table of Contents
C3669_CH01
	Contents
	Chapter 1 Introduction to Exploratory Data Analysis
		1.1 What is Exploratory Data Analysis
		1.2 Overview of the Text
		1.3 A Few Words About Notation
		1.4 Data Sets Used in the Book
			1.4.1 Unstructured Text Documents
			1.4.2 Gene Expression Data
			1.4.3 Oronsay Data Set
			1.4.4 Software Inspection
		1.5 Transforming Data
			1.5.1 Power Transformations
			1.5.2 Standardization
			1.5.3 Sphering the Data
		1.6 Further Reading
		Exercises
C3669_CH02
	Contents
	Chapter 2 Dimensionality Reduction - Linear Methods
		2.1 Introduction
		2.2 Principal Component Analysis - PCA
			2.2.1 PCA Using the Sample Covariance Matrix
			2.2.2 PCA Using the Sample Correlation Matrix
			2.2.3 How Many Dimensions Should We Keep?
		2.3 Singular Value Decomposition - SVD
		2.4 Factor Analysis
		2.5 Intrinsic Dimensionality
		2.6 Summary and Further Reading
		Exercises
C3669_CH04
	Contents
	Chapter 4 Data Tours
		4.1 Grand Tour
			4.1.1 Torus Winding Method
			4.1.2 Pseudo Grand Tour
		4.2 Interpolation Tours
		4.3 Projection Pursuit
		4.4 Projection Pursuit Indexes
			4.4.1 Posse Chi-Square Index
			4.4.2 Moment Index
		4.5 Summary and Further Reading
		Exercises
C3669_CH05
	Contents
	Chapter 5 Finding Clusters
		5.1 Introduction
		5.2 Hierarchical Methods
		5.3 Optimization Methods - k-Means
		5.4 Evaluating the Clusters
			5.4.1 Rand Index
			5.4.2 Cophenetic Correlation
			5.5.3 Upper Tail Rule
			5.5.4 Silhouette Plot
			5.5.5 Gap Statistic
		5.5 Summary and Further Reading
		Exercises
C3669_CH06
	Contents
	Chapter 6 Model-Based Clustering
		6.1 Overview of Model-Based Clustering
		6.2 Finite Mixtures
			6.2.1 Multivariate Finite Mixtures
			6.2.2 Component Models - Constraining the Covariances
		6.3 Expectation-Maximization Algorithm
		6.4 Hierarchical Agglomerative Model-Based Clustering
		6.5 Model-Based Clustering
		6.6 Generating Random Variables from a Mixture Model
		6.7 Summary and Further Reading
		Exercises
C3669_CH07
	Contents
	Chapter 7 Smoothing Scatterplots
		7.1 Introduction
		7.2 Loess
		7.3 Robust Loess
		7.4 Residuals and Diagnostics
			7.4.1 Residual Plots
			7.4.2 Spread Smooth
			7.4.3 Loess Envelopes - Upper and Lower Smooths
		7.5 Bivariate Distribution Smooths
			7.5.1 Pairs of Middle Smoothings
			7.5.2 Polar Smoothing
		7.6 Curve Fitting Toolbox
		7.7 Summary and Further Reading
		Exercises
C3669_CH08
	Contents
	Chapter 8 Visualizing Clusters
		8.1 Dendrogram
		8.2 Treemaps
		8.3 Rectangle Plots
		8.4 ReClus Plots
		8.5 Data Image
		8.6 Summary and Further Reading
		Exercises
C3669_CH09
	Contents
	Chapter 9 Distribution Shapes
		9.1 Histograms
			9.1.1 Univariate Histograms
			9.1.2 Bivariate Histograms
		9.2 Boxplots
			9.2.1 The Basic Boxplot
			9.2.2 Variations of the Basic Boxplot
		9.3 Quantile Plots
			9.3.1 Probability Plots
			9.3.2 Quantile-quantile Plot
			9.3.3 Quantile Plot
		9.4 Bagplots
		9.5 Summary and Further Reading
		Exercises
C3669_CH10
	Contents
	Chapter 10 Multivariate Visualization
		10.1 Glyph Plots
		10.2 Scatterplots
			10.2.1 2-D and 3-D Scatterplots
			10.2.2 Scatterplot Matrices
			10.2.3 Scatterplots with Hexagonal Binning
		10.3 Dynamic Graphics
			10.3.1 Identification of Data
			10.3.2 Linking
			10.3.3 Brushing
		10.4 Coplots
		10.5 Dot Charts
			10.5.1 Basic Dot Chart
			10.5.2 Multiway Dot Chart
		10.6 Plotting Points as Curves
			10.6.1 Parallel Coordinate Plots
			10.6.2 Andrews’ Curves
			10.6.3 More Plot Matrices
		10.7 Data Tours Revisited
			10.7.1 Grand Tour
			10.7.2 Permutation Tour
		10.8 Summary and Further Reading
		Exercises
C3669_APPa
	Contents
	Appendix A Proximity Measures
		A.1 Definitions
			A.1.1 Dissimilarities
			A.1.2 Similarity Measures
			A.1.3 Similarity Measures for Binary Data i-
			A.1.4 Dissimilarities for Probability Density Functions
		A.2 Transformations
		A.3 Further Reading
C3669_APPb
	Contents
	Appendix B Software Resources for EDA
		B.1 MATLAB Programs
		B.2 Other Programs for EDA
		B.3 EDA Toolbox
C3669_APPc
	Contents
	Appendix C Description of Data Sets
C3669_APPd
	Contents
	Appendix D Introduction to MATLAB1
		D.1 What Is MATLAB?
		D.2 Getting Help in MATLAB
		D.3 File and Workspace Management
		D.4 Punctuation in MATLAB
		D.5 Arithmetic Operators
		D.6 Data Constructs in MATLAB
			Basic Data Constructs
			Building Arrays
			Cell Arrays
			Structures
		D.7 Script Files and Functions
		D.8 Control Flow
			for Loop
			while Loop
			if-else Statements
			switch Statement
		D.9 Simple Plotting
		D.10 Where to get MATLAB Information
C3669_APPe
	Contents
	Appendix E MATLAB Functions
		E.1 MATLAB
		E.2 Statistics Toolbox - Versions 4 and 5
		E.3 Exploratory Data Analysis Toolbox
C3669_REF
	Contents
	References
C3669_Color FIGS
	Contents
	FIGURE 3.7
	FIGURE 3.8
	FIGURE 3.9 (TOP)
		FIGURE 3.9 (BOTTOM)
	FIGURE 3.10
	FIGURE 5.4
	FIGURE 8.7 (TOP)
		FIGURE 8.7 (BOTTOM)
	FIGURE 8.8
	FIGURE 8.9
	FIGURE 10.3 (TOP)
		FIGURE 10.3 (BOTTOM)
	FIGURE 10.5
	FIGURE 10.7
	FIGURE 10.9
                        
Document Text Contents
Page 1

Fig109cAI.eps


Exploratory Data
Analysis

with MATLAB®

Computer Science and Data Analysis Series

Page 2

Fig109cAI.eps


Chapman & Hall/CRC

Series in Computer Science and Data Analysis

The interface between the computer and statistical sciences is increasing,
as each discipline seeks to harness the power and resources of the other.
This series aims to foster the integration between the computer sciences
and statistical, numerical and probabilistic methods by publishing a broad
range of reference works, textbooks and handbooks.

SERIES EDITORS
John Lafferty, Carnegie Mellon University
David Madigan, Rutgers University
Fionn Murtagh, Queen’s University Belfast
Padhraic Smyth, University of California Irvine

Proposals for the series should be sent directly to one of the series editors
above, or submitted to:

Chapman & Hall/CRC Press UK
23-25 Blades Court
London SW15 2NU
UK

Published Titles

Bayesian Artificial Intelligence
Kevin B. Korb and Ann E. Nicholson

Exploratory Data Analysis with MATLAB®

Wendy L. Martinez and Angel R. Martinez

Forthcoming Titles

Correspondence Analysis and Data Coding with JAVA and R
Fionn Murtagh

R Graphics
Paul Murrell

Nonlinear Dimensionality Reduction
Vin de Silva and Carrie Grimes

Page 181

Fig109cAI.eps


Smoothing Scatterplots 213

ylabel('Log [ Defects / Page ]')

Next we set up the parameters (α = 0.5, λ = 2) for a loess smooth and show
the smoothed scatterplot in Figure 7.8.

% Set up the parameters.
alpha = 0.5;
lambda = 2;
% Do the loess on this.
x0 = linspace(min(X),max(X));
y0 = loess(X,Y,x0,alpha,lambda);
% Plot the curve and scatterplot.
plot(X,Y,'.',x0,y0)
xlabel('Log[PrepTime (mins)/Page]')
ylabel('Log[Defects/Page]')

We can assess our results by looking at the residual plots. First we find the
residuals and plot them in Figure 7.9 (top), where we see that they are
roughly symmetric about zero. Then we plot the absolute value of the
residuals against the fitted values (Figure 7.9 (bottom)). A loess smooth of
these observations show that the variance does not seem to be dependent on
the fitted values.

% Get the residuals.
% First find the loess values at the observed X values.

FIGURE 7.7

This is the scatterplot of observations showing the number of defects found per page versus
the time spend inspecting each page. We see that the relationship is approximately linear.

−2 −1 0 1 2 3 4
−7

−6

−5

−4

−3

−2

−1

0

1

Log [ PrepTime (mins) / Page ]

L
o

g
[

D
e

fe
ct

s
/

P
a

g
e

]

EDA.book Page 213 Monday, October 18, 2004 8:33 AM

Page 182

Fig109cAI.eps


214 Exploratory Data Analysis with MATLAB

yhat = loess(X,Y,X,alpha,lambda);
resid = Y - yhat;
% Now plot the residuals.
plot(1:length(resid),resid,'.')
ax = axis;
axis([ax(1), ax(2), -4 4])
xlabel('Index')
ylabel('Residuals')
% Plot the absolute value of the residuals
% against the fitted values.
r0 = linspace(min(yhat),max(yhat),30);
rhat = loess(yhat,abs(resid),r0,0.5,1);
plot(yhat,abs(resid),'.',r0,rhat)
xlabel('Fitted Values')
ylabel('| Residuals |')

The following code constructs a residual dependence plot for this loess
smooth. We include a loess smooth for this scatterplot to better understand
the results. This is shown in Figure 7.10; we do not see any indication of bias.

% Now plot the residuals on the vertical
% and the independent values on the
% horizontal. This is the residual
% dependence plot. Include a loess curve.

FIGURE 7.8

After we add the loess curve (α = 0.5, λ = 2), we see that the relationship is not completely
linear.

−2 −1 0 1 2 3 4
−7

−6

−5

−4

−3

−2

−1

0

1

Log [ PrepTime (mins) / Page ]

L
o

g
[

D
e

fe
ct

s
/

P
a

g
e

]

EDA.book Page 214 Monday, October 18, 2004 8:33 AM

Page 362

Fig109cAI.eps




FIGURE 10.5



This shows a scatterplot of the oronsay data based on hexagonal binning. The color of the
symbols represents the value of the probability density at that bin.



FIGURE 10.7



The red points in this scatterplot were highlighted using the scattergui function.

5 10 15 20 25 30 35 40 45

5

10

15

20

25

30

35

40

1

2

3

4

5

6

7

8

x 10
−3

−10 −8 −6 −4 −2 0 2 4 6
−5

0

5

10



C3669_Color Insert.fm Page 7 Thursday, November 11, 2004 2:15 PM

Page 363

Fig109cAI.eps




FIGURE 10.9



This is the scatterplot matrix with brushing and linking. This mode is transient, where only points inside the brush are highlighted. Corresponding
points are highlighted in all scatterplots.

.18.25mm

8.9

40.4

.125.18mm

0.3

40.1

.09.125mm

0.2

9.9



C
3669_C

olor Insert.fm
P

age 8 T
hursday, N

ovem
ber 11, 2004 2:15 P

M

Similer Documents