Download Title of this Thesis PDF

TitleTitle of this Thesis
File Size7.0 MB
Total Pages181
Table of Contents
1 Introduction
2 A Cognitive Interpretation of Data Analysis
	2.1 Introduction
	2.2 A theory of data analysis
	2.3 The role of cognition in data analysis
	2.4 Making sense of measured data
	2.5 A conceptual model of data analysis
	2.6 Implications for data analysis practice
	2.7 Conclusion
3 Visualizing complex data with embedded plots
	3.1 Introduction
	3.2 Case Study: Analyzing complex data
	3.3 Benefits of embedded plots
	3.4 Implementing embedded plots with the grammar of graphics
	3.5 Conclusion
4 Dates and times made easy with lubridate
	4.1 Introduction
	4.2 Motivation
	4.3 Parsing date-times
	4.4 Manipulating date-times
	4.5 Arithmetic with date-times
	4.6 Rounding dates
	4.7 Time zones
	4.8 Daylight savings time
	4.9 Case study 1
	4.10 Case study 2
	4.11 Conclusion
5 How and why to teach statistical inference with simulations in R
	5.1 Introduction
	5.2 Background
	5.3 Why teach with visual simulations
	5.4 Why program visual simulations in R
	5.5 How to implement visual simulations in R
	5.6 How to use simulations in the classroom
	5.7 Conclusion
6 Conclusion
	6.1 Original contributions
	6.2 Future Work
	6.3 Final thoughts
Document Text Contents
Page 2


Tools and theory to improve data analysis


Garrett Grolemund

This thesis proposes a scientific model to explain the data analysis

process. I argue that data analysis is primarily a procedure to build un-

derstanding and as such, it dovetails with the cognitive processes of the

human mind. Data analysis tasks closely resemble the cognitive process

known as sensemaking. I demonstrate how data analysis is a sensemaking

task adapted to use quantitative data. This identification highlights a uni-

versal structure within data analysis activities and provides a foundation

for a theory of data analysis. The model identifies two competing chal-

lenges within data analysis: the need to make sense of information that

we cannot know and the need to make sense of information that we can-

not attend to. Classical statistics provides solutions to the first challenge,

but has little to say about the second. However, managing attention is

the primary obstacle when analyzing big data. I introduce three tools

for managing attention during data analysis. Each tool is built upon a

different method for managing attention. ggsubplot creates embedded

plots, which transform data into a format that can be easily processed

by the human mind. lubridate helps users automate sensemaking out-

side of the mind by improving the way computers handle date-time data.

Visual Inference Tools develop expertise in young statisticians that

Page 90


and Risk Advisory Group, Commerzbank Securities, 2009), tis (Hallman, 2010),

timeSeries (Wuertz and Chalabi, 2010), fts (Armstrong, 2009), and tseries

(Trapletti and Hornik, 2009) objects.

Note that lubridate overrides the + and - methods for POSIXt, Date, and

difftime objects in base R. This allows users to perform simple arithmetic on date-

time objects with the new timespan classes introduced by lubridate, but it does not

alter the way R implements addition and subtraction for non-lubridate objects.

lubridate introduces four new object classes based on the Java language Joda

Time project (Colebourne and O’Neill, 2010). Joda Time introduces a conceptual

model of the different ways to measure timespans. Section 4.5 describes this model

and explains how lubridate uses it to perform easy and accurate arithmetic with

dates in R.

This paper demonstrates the convenient tools provided in the lubridate pack-

age and ends with a case study, which uses lubridate in a real life example. This

paper describes lubridate 0.2, which can be downloaded from the Comprehensive

R Archive Network at Develop-

ment versions can be found at

4.2 Motivation

To see how lubridate simplifies things, consider a common scenario. Given a char-

acter string, we would like to read it in as a date-time, extract the month, and change

it to February (i.e, 2). Table 4.1 shows two ways we could do this. On the left are the

base R methods we would use for these three tasks. On the right are the lubridate


Now we will go a step further. In Table 4.2, we move our date back in time by one

day and display our new date in the Greenwich Meridian time zone (GMT). Again,

Page 91


Base R method lubridate method

date <- as.POSIXct("01-01-2010", date <- dmy("01-01-2010")

format = "%d-%m-%Y", tz = "UTC")

as.numeric(format(date, "%m")) # or month(date)
as.POSIXlt(date)$month + 1

date <- as.POSIXct(format(date, month(date) <- 2

"%Y-2-%d"), tz = "UTC")

Table 4.1: lubridate provides a simple way to parse a date into R, extract the month
value and change it to February.

Base R method lubridate method

date <- seq(date, length = 2, date <- date - days(1)

by = "-1 day")[2]

as.POSIXct(format(as.POSIXct(date), with tz(date, "GMT")

tz = "UTC"), tz = "GMT")

Table 4.2: lubridate easily displays a date one day earlier and in the GMT time

base R methods are shown on the left, lubridate methods on the right.

lubridatemakes basic date-time manipulations much more straightforward. Plus,

the same lubridate methods work for most of the popular date-time object classes

(Date, POSIXt, chron, etc.), which isn’t always true for base R methods.

Table 4.3 provides a more complete comparison between lubridate methods and

base R methods. It shows how lubridate can simplify each of the common date-

time tasks presented in the article “Date and Time Classes in R” (Grothendieck and

Petzoldt, 2004). It also provides a useful summary of lubridate methods.

Page 180


Weick, K., Sutcliffe, K., and Obstfeld, D. (2005), “Organizing and the Process of
Sensemaking,” Organization Science, 16, 409–421. 2.3.1

Weir, C., McManus, I., and Kiely, B. (1990), “Evaluation of the teaching of statistical
concepts by interactive experience with Monte Carlo simulations,” British Journal
of Educational Psychology, 61, 240–247. 5.1

Wender, K. and Muehlboeck, J. (2003), “Animated diagrams in teaching statistics,”
Behavior Research Methods, 35, 255–258. 5.1

Wertheimer, M. (1938), “Laws of organization in perceptual forms,” A source book of
Gestalt psychology, 71–88. 2.3.1

Wickham, H. (2009), ggplot2: Elegant graphics for data analysis, Springer New
York. 3.1, 3.4, 4.10

— (2010), “A layered grammar of graphics,” Journal of Computational and Graphical
Statistics, 19, 3–28. 3.1, 3.4, 3.4

— (2011), “The split-apply-combine strategy for data analysis,” Journal of Statistical
Software, 40, 1–29. 3.4.2

Wickham, H., Hofmann, H., Wickham, C., and Cook, D. (Submitted), “Glyph-maps
for Visually Exploring Temporal Patterns in Climate Data and Models,” Environ-
metrics. 3.1

Wild, C. (1994), “Embracing the “Wider View” of Statistics,” The American Statis-
tician, 48, 163–171. 2.1, 2.2

Wild, C. and Pfannkuch, M. (1999), “Statistical thinking in empirical enquiry,” Inter-
national Statistical Review/Revue Internationale de Statistique, 67, 223–248. 2.1,
2.2, 2.4.2, 2.5, 2.5.2, 2.5.3

Wild, C., Pfannkuch, M., Regan, M., and Horton, N. (2011), “Towards more acces-
sible conceptions of statistical inference,” Journal of the Royal Statistical Society:
Series A (Statistics in Society), 174, 247–295. 5.2

Wild, C., Pfannkuch, M., and Regan, M.and Horton, N. (2010), “Inferential reason-
ing: Learning to “make a call” in theory,” in 8th International Conference on the
Teaching of Statistics. 5.2

Wilkinson, L. and Wills, G. (2005), The grammar of graphics, Springer Verlag. 3.1,

Woodworth, R. (1971), Experimental psychology., Holt, Rinehart and Winston. 2.6.1

Wu, A., Zhang, X., and Cai, G. (2010), “An interactive sensemaking framework for
mobile visual analytics,” in Proceedings of the 3rd International Symposium on
Visual Information Communication, ACM, p. 22. 2.3.1

Page 181


Wuertz, D. and Chalabi, Y. (2010), timeSeries: Rmetrics - Financial Time Series
Objects, r package version 2110.87. 4.1

Yi, J., Kang, Y., Stasko, J., and Jacko, J. (2008), “Understanding and characterizing
insights: how do people gain insights using information visualization?” in BELIV
’08: Proceedings of the 2008 conference on BEyond time and errors: novel evaLu-
ation methods for Information Visualization, New York, NY, USA: ACM, pp. 1–6.

Zeileis, A. and Grothendieck, G. (2005), “zoo: S3 Infrastructure for Regular and
Irregular Time Series,” Journal of Statistical Software, 14, 1–27. 4.1

Zhang, J. (1997), “The nature of external representations in problem solving,” Cog-
nitive science, 21, 179–217. 2.3.2

— (2000), “External representations in complex information processing tasks,” En-
cyclopedia of library and information science, 68, 164–180. 2.3.2

Zhang, P. (2010), “Sensemaking: Conceptual changes, cognitive mechanisms, and
structural representations. A qualitative user study,” Unpublished doctoral disser-
tation, University of Maryland, at

Ziemer, H. and Lane, D. (2000), “Evaluating the Efficacy of the Rice University Vir-
tual Statistics Lab,” Poster presented at the 22nd Annual Meeting of the National
Institute on the Teaching of Psychology, St. Petersburg Beach, FL. 5.1

Zwaan, R. and Yaxley, R. (2003), “Spatial iconicity affects semantic relatedness judg-
ments,” Psychonomic Bulletin & Review, 10, 954–958. 3.3.2, 5.3.1

Similer Documents