Download Big Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis PDF

TitleBig Data Analytics with Spark: A Practitioner's Guide to Using Spark for Large Scale Data Analysis
LanguageEnglish
File Size5.3 MB
Total Pages290
Table of Contents
                            Contents at a Glance
Contents
About the Author
About the Technical Reviewers
Acknowledgments
Introduction
Chapter 1: Big Data Technology Landscape
	Hadoop
		HDFS (Hadoop Distributed File System)
		MapReduce
		Hive
	Data Serialization
		Avro
		Thrift
		Protocol Buffers
		SequenceFile
	Columnar Storage
		RCFile
		ORC
		Parquet
	Messaging Systems
		Kafka
		ZeroMQ
	NoSQL
		Cassandra
		HBase
	Distributed SQL Query Engine
		Impala
		Presto
		Apache Drill
	Summary
Chapter 2: Programming in Scala
	Functional Programming (FP)
		Functions
			First-Class
			Composable
			No Side Effects
			Simple
		Immutable Data Structures
		Everything Is an Expression
	Scala Fundamentals
		Getting Started
		Basic Types
		Variables
		Functions
			Methods
			Local Functions
			Higher-Order Methods
			Function Literals
			Closures
		Classes
		Singletons
		Case Classes
		Pattern Matching
		Operators
		Traits
		Tuples
		Option Type
		Collections
			Sequences
				Array
				List
				Vector
			Sets
			Map
			Higher-Order Methods on Collection Classes
				map
				flatMap
				filter
				foreach
				reduce
	A Standalone Scala Application
	Summary
Chapter 3: Spark Core
	Overview
		Key Features
			Easy to Use
			Fast
			General Purpose
			Scalable
			Fault Tolerant
		Ideal Applications
			Iterative Algorithms
			Interactive Analysis
	High-level Architecture
		Workers
		Cluster Managers
		Driver Programs
		Executors
		Tasks
	Application Execution
		Terminology
		How an Application Works
	Data Sources
	Application Programming Interface (API)
		SparkContext
		Resilient Distributed Datasets (RDD)
			Immutable
			Partitioned
			Fault Tolerant
			Interface
			Strongly Typed
			In Memory
		Creating an RDD
			parallelize
			textFile
			wholeTextFiles
			sequenceFile
		RDD Operations
			Transformations
				map
				filter
				flatMap
				mapPartitions
				union
				intersection
				subtract
				distinct
				cartesian
				zip
				zipWithIndex
				groupBy
				keyBy
				sortBy
				pipe
				randomSplit
				coalesce
				repartition
				sample
			Transformations on RDD of key-value Pairs
				keys
				values
				mapValues
				join
				leftOuterJoin
				rightOuterJoin
				fullOuterJoin
				sampleByKey
				subtractByKey
				groupByKey
				reduceByKey
			Actions
				collect
				count
				countByValue
				first
				max
				min
				take
				takeOrdered
				top
				fold
				reduce
			Actions on RDD of key-value Pairs
				countByKey
				lookup
			Actions on RDD of Numeric Types
				mean
				stdev
				sum
				variance
		Saving an RDD
			saveAsTextFile
			saveAsObjectFile
			saveAsSequenceFile
	Lazy Operations
		Action Triggers Computation
	Caching
		RDD Caching Methods
			cache
			persist
		RDD Caching Is Fault Tolerant
		Cache Memory Management
	Spark Jobs
	Shared Variables
		Broadcast Variables
		Accumulators
	Summary
Chapter 4: Interactive Data Analysis with Spark Shell
	Getting Started
		Download
		Extract
		Run
	REPL Command s
	Using the Spark Shell as a Scala Shell
	Number Analysis
	Log Analysis
	Summary
Chapter 5: Writing a Spark Application
	Hello World in Spark
	Compiling and Running the Application
		sbt (Simple Build Tool)
			Build Definition File
			Directory Structure
		Compiling the Code
		Running the Application
	Monitoring the Application
	Debugging the Application
	Summary
Chapter 6: Spark Streaming
	Introducing Spark Streaming
		Spark Streaming Is a Spark Add-on
		High-Level Architecture
		Data Stream Sources
		Receiver
		Destinations
	Application Programming Interface (API)
		StreamingContext
			Creating an Instance of StreamingContext
			Starting Stream Computation
			Checkpointing
			Stopping Stream Computation
			Waiting for Stream Computation to Finish
		Basic Structure of a Spark Streaming Application
		Discretized Stream (DStream)
		Creating a DStream
			Basic Sources
				socketTextStream
				textFileStream
				actorStream
			Advanced Sources
		Processing a Data Stream
			Basic Transformation
				map
				flatMap
				filter
				repartition
				union
			Aggregation Transformation
				count
				reduce
				countByValue
			Transformations Available Only on DStream of key-value Pairs
				cogroup
				join
				groupByKey
				reduceByKey
			Special Transformations
				transform
				updateStateByKey
		Output Operations
			Saving to a File System
				saveAsTextFiles
				saveAsObjectFiles
				saveAsHadoopFiles
				saveAsNewAPIHadoopFiles
			Displaying on Console
				print
			Saving into a Database
				foreachRDD
		Window Operation
			window
			countByWindow
			countByValueAndWindow
			reduceByWindow
			reduceByKeyAndWindow
	A Complete Spark Streaming Application
	Summary
Chapter 7: Spark SQL
	Introducing Spark SQL
		Integration with Other Spark Libraries
		Usability
		Data Sources
		Data Processing Interface
		Hive Interoperability
	Performance
		Reduced Disk I/O
		Partitioning
		Columnar Storage
		In-Memory Columnar Caching
		Skip Rows
		Predicate Pushdown
		Query Optimization
	Applications
		ETL (Extract Transform Load)
		Data Virtualization
		Distributed JDBC/ODBC SQL Query Engine
		Data Warehousing
	Application Programming Interface (API)
		Key Abstractions
			SQLContext
				Creating an Instance of SQLContext
				Executing SQL Queries Programmatically
			HiveContext
				Creating an Instance of HiveContext
				Executing HiveQL Queries Programmatically
			DataFrame
			Row
		Creating DataFrames
			Creating a DataFrame from an RDD
				toDF
				createDataFrame
			Creating a DataFrame from a Data Source
				JSON
				Parquet
				ORC
				Hive
				JDBC -Compliant Database
		Processing Data Programmatically with SQL/HiveQL
		Processing Data with the DataFrame API
			Basic Operations
				cache
				columns
				dtypes
				explain
				persist
				printSchema
				registerTempTable
				toDF
			Language-Integrated Query Methods
				agg
				apply
				cube
				distinct
				explode
				filter
				groupBy
				intersect
				join
				limit
				orderBy
				randomSplit
				rollup
				sample
				select
				selectExpr
				withColumn
			RDD Operations
				rdd
				toJSON
			Actions
				collect
				count
				describe
				first
				show
				take
			Output Operations
				write
		Saving a DataFrame
			JSON
			Parquet
			ORC
			Hive
			JDBC-Compliant Database
	Built-in Functions
		Aggregate
		Collection
		Date/Time
			Conversion
			Field Extraction
			Date Arithmetic
			Miscellaneous
		Math
		String
		Window
	UDFs and UDAFs
	Interactive Analysis Example
	Interactive Analysis with Spark SQL JDBC Server
	Summary
Chapter 8: Machine Learning with Spark
	Introducing Machine Learning
		Features
			Categorical Features
			Numerical Features
		Labels
		Models
		Training Data
			Labeled
			Unlabeled
		Test Data
		Machine Learning Applications
			Classification
			Regression
			Clustering
			Anomaly Detection
			Recommendation
			Dimensionality Reduction
		Machine Learning Algorithms
			Supervised Machine Learning Algorithms
				Regression Algorithms
				Classification Algorithms
			Unsupervised Machine Learning Algorithms
				k-means
				Principal Components Analysis (PCA)
				Singular value decomposition (SVD)
		Hyperparameter
		Model Evaluation
			Area Under Curve (AUC)
			F-measure
			Root Mean Squared Error (RMSE)
		Machine Learning High-level Steps
	Spark Machine Learning Libraries
	MLlib Overview
		Integration with Other Spark Libraries
		Statistical Utilities
		Machine Learning Algorithms
	The MLlib API
		Data Types
			Vector
				DenseVector
				SparseVector
			LabeledPoint
			Rating
		Algorithms and Models
			Regression Algorithms
				train
				trainRegressor
			Regression Models
				predict
				save
				load
				toPMML
			Classification Algorithms
				train
				trainClassifier
			Classification Models
				predict
				save
				load
				toPMML
			Clustering Algorithms
				train
				run
			Clustering Models
				predict
				computeCost
				save
				load
				toPMML
			Recommendation Algorithms
				train
				trainImplicit
			Recommendation Models
				predict
				recommendProducts
				recommendProductsForUsers
				recommendUsers
				recommendUsersForProducts
				save
				load
		Model Evaluation
			Regression Metrics
			Binary Classification Metrics
			Multiclass Classification Metrics
			Multilabel Classification Metrics
			Recommendation Metrics
	An Example MLlib Application
		Dataset
		Goal
		Code
	Spark ML
		ML Dataset
		Transformer
			Feature Transformer
			Model
		Estimator
		Pipeline
		PipelineModel
		Evaluator
		Grid Search
		CrossValidator
	An Example Spark ML Application
		Dataset
		Goal
		Code
	Summary
Chapter 9: Graph Processing with Spark
	Introducing Graphs
		Undirected Graphs
		Directed Graphs
		Directed Multigraphs
		Property Graphs
	Introducing GraphX
	GraphX API
		Data Abstractions
			VertexRDD
			Edge
			EdgeRDD
			EdgeTriplet
			EdgeContext
			Graph
		Creating a Graph
		Graph Properties
		Graph Operators
			Property Transformation Operators
				mapVertices
				mapEdges
				mapTriplets
			Structure Transformation Operators
				reverse
				subgraph
				mask
				groupEdges
			Join Operators
				joinVertices
				outerJoinVertices
			Aggregation Operators
				aggregateMessages
			Graph-Parallel Operators
				pregel
			Graph Algorithms
				pageRank
				staticPageRank
				connectedComponents
				stronglyConnectedComponents
				triangleCount
	Summary
Chapter 10: Cluster Managers
	Standalone Cluster Manager
		Architecture
		Setting Up a Standalone Cluster
			Manually Starting a Cluster
			Manually Stopping a Cluster
			Starting a Cluster with a Script
			Stopping a Cluster with a Script
		Running a Spark Application on a Standalone Cluster
			Client Mode
			Cluster Mode
	Apache Mesos
		Architecture
		Setting Up a Mesos Cluster
		Running a Spark Application on a Mesos Cluster
			Deploy Modes
			Run Modes
				Fine-grained Mode
				Coarse-grained Mode
	YARN
		Architecture
		Running a Spark Application on a YARN Cluster
	Summary
Chapter 11: Monitoring
	Monitoring a Standalone Cluster
		Monitoring a Spark Master
		Monitoring a Spark Worker
	Monitoring a Spark Application
		Monitoring Jobs Launched by an Application
		Monitoring Stages in a Job
		Monitoring Tasks in a Stage
		Monitoring RDD Storage
		Monitoring Environment
		Monitoring Executors
		Monitoring a Spark Streaming Application
		Monitoring Spark SQL Queries
		Monitoring Spark SQL JDBC/ODBC Server
	Summary
Bibliography:
Index
                        
Document Text Contents
Page 1

www.apress.com

Guller
Big D

ata Analytics w
ith Spark

Big Data Analytics
with Spark

A Practitioner’s Guide to Using Spark
for Large Scale Data Analysis

Mohammed Guller

Big Data Analytics with Spark

B O O K S F O R P R O F E S S I O N A L S B Y P R O F E S S I O N A L S® THE E XPER T ’S VOICE® IN S PA R K

The book also includes a chapter on Scala, the hottest functional programming language, and
the language that underlies Spark. You’ll learn the basics of functional programming in Scala,
so that you can write Spark applications in it.

What’s more, Big Data Analytics with Spark provides an introduction to other big data technologies
that are commonly used along with Spark, such as HDFS, Avro, Parquet, Ka� a, Cassandra,
HBase, Mesos, and so on. It also provides an introduction to machine learning and graph
concepts. So the book is self-suffi cient; all the technologies that you need to know to use Spark
are covered. The only thing that you are expected to have is some programming knowledge
in any language.

From this book, you’ll learn how to:

• Write Spark applications in Scala for processing and analyzing large-scale data
• Interactively analyze large-scale data with Spark SQL using just SQL and HiveQL
• Process high-velocity stream data with Spark Streaming
• Develop machine learning applications with MLlib and Spark ML
• Analyze graph-oriented data and implement graph algorithms with GraphX
• Deploy Spark with the Standalone cluster manger, YARN, or Mesos
• Monitor Spark applications

9 781484 209653

53999
ISBN 978-1-4842-0965-3

Shelve in:
Databases/General

User level:
Beginning–Advanced

www.it-ebooks.info

http://www.it-ebooks.info/

Page 2

Big Data Analytics
with Spark

A Practitioner’s Guide to Using Spark for
Large-Scale Data Processing, Machine

Learning, and Graph Analytics, and
High-Velocity Data Stream Processing

Mohammed Guller

www.it-ebooks.info

http://www.it-ebooks.info/

Page 145

Chapter 7 ■ Spark SQL

130

By default, the orderBy method sorts in ascending order. You can explicitly specify the sorting order
using a Column expression, as shown next.

val sortedByAgeNameDF = customerDF.sort($"age".desc, $"name".asc)
sortedByAgeNameDF.show

+---+--------+---+------+
|cId| name|age|gender|
+---+--------+---+------+
| 4|Jennifer| 45| F|
| 6| Sandra| 45| F|
| 5| Robert| 41| M|
| 3| John| 31| M|
| 2| Liz| 25| F|
| 1| James| 21| M|
+---+--------+---+------+

randomSplit

The randomSplit method splits the source DataFrame into multiple DataFrames. It takes an array of weights
as argument and returns an array of DataFrames. It is a useful method for machine learning, where you want
to split the raw dataset into training, validation and test datasets.

val dfArray = homeDF.randomSplit(Array(0.6, 0.2, 0.2))
dfArray(0).count
dfArray(1).count
dfArray(2).count

rollup

The rollup method takes the names of one or more columns as arguments and returns a multi-dimensional
rollup. It is useful for subaggregation along a hierarchical dimension such as geography or time.

Assume you have a dataset that tracks annual sales by city, state and country. The rollup method can
be used to calculate both grand total and subtotals by city, state, and country.

case class SalesByCity(year: Int, city: String, state: String,
country: String, revenue: Double)
val salesByCity = List(SalesByCity(2014, "Boston", "MA", "USA", 2000),
SalesByCity(2015, "Boston", "MA", "USA", 3000),
SalesByCity(2014, "Cambridge", "MA", "USA", 2000),
SalesByCity(2015, "Cambridge", "MA", "USA", 3000),
SalesByCity(2014, "Palo Alto", "CA", "USA", 4000),
SalesByCity(2015, "Palo Alto", "CA", "USA", 6000),
SalesByCity(2014, "Pune", "MH", "India", 1000),
SalesByCity(2015, "Pune", "MH", "India", 1000),
SalesByCity(2015, "Mumbai", "MH", "India", 1000),
SalesByCity(2014, "Mumbai", "MH", "India", 2000))


www.it-ebooks.info

http://www.it-ebooks.info/

Page 146

Chapter 7 ■ Spark SQL

131

val salesByCityDF = sc.parallelize(salesByCity).toDF()
val rollup = salesByCityDF.rollup($"country", $"state", $"city").sum("revenue")
rollup.show

+-------+-----+---------+------------+
|country|state| city|sum(revenue)|
+-------+-----+---------+------------+
| India| MH| Mumbai| 3000.0|
| USA| MA|Cambridge| 5000.0|
| India| MH| Pune| 2000.0|
| USA| MA| Boston| 5000.0|
| USA| MA| null| 10000.0|
| USA| null| null| 20000.0|
| USA| CA| null| 10000.0|
| null| null| null| 25000.0|
| India| MH| null| 5000.0|
| USA| CA|Palo Alto| 10000.0|
| India| null| null| 5000.0|
+-------+-----+---------+------------+

sample

The sample method returns a DataFrame containing the specified fraction of the rows in the source
DataFrame. It takes two arguments. The first argument is a Boolean value indicating whether sampling should
be done with replacement. The second argument specifies the fraction of the rows that should be returned.

val sampleDF = homeDF.sample(true, 0.10)

select

The select method returns a DataFrame containing only the specified columns from the source DataFrame.

val namesAgeDF = customerDF.select("name", "age")
namesAgeDF.show

+--------+---+
| name|age|
+--------+---+
| James| 21|
| Liz| 25|
| John| 31|
|Jennifer| 45|
| Robert| 41|
| Sandra| 45|
+--------+---+

www.it-ebooks.info

http://www.it-ebooks.info/

Page 289

■ index

276

Spark SQL
aggregate functions, 139
API (see Application programming

interface (API))
arithmetic functions, 140
collection functions, 140
columnar caching, 106
columnar storage, 105
conversion functions, 140
data processing interfaces, 104
data sources, 104
definition, 103
Disk I/O, 105
field extraction functions, 140
GraphX, 103
Hive, 105
interactive analysis

JDBC server, 149
Spark shell, 142
SQL/HiveQL queries, 143

math functions, 141
miscellaneous functions, 141
partitioning, 105
pushdown predication, 106
query optimization

analysis phase, 106
code generation phase, 107
data virtualization, 108
data warehousing, 108
ETL, 107
JDBC/ODBC server, 108
logical optimization phase, 106
physical planning phase, 107

skip rows, 106
Spark ML, 103
Spark Streaming, 103
string values, 141
UDAFs, 141
UDFs, 141
uses, 104
window functions, 141

Spark standalone cluster
cores and memory, 245
DEBUG and INFO logs, 246
Spark workers, 246–247
Web UI

monitoring executors, 246
Spark master, 244

Spark Streaming
API (see Application programming

interface (API))
application source code, 97
command-line arguments, 99
data stream sources, 80

definition, 79
destinations, 81
getLang method, 100
hashtag, 97, 101
machine learning algorithms, 80
map transformation, 100
micro-batches, 80
monitoring, 260–262
Receiver, 81
SparkConf class, 100
Spark core, 80
start method, 102
StreamingContext class, 100
Twitter4J library, 99–100
updateStateByKey method, 101

Sparse method, 174
SparseVector, MLlib, 174
split method, 73
Standalone cluster manager

architecture, 231
master, 232
worker, 232

client mode, 235
cluster mode, 235
master process, 233
setting up, 232
spark-submit script, 234
start-all.sh scri, 234
stop-all.sh script, 234
stopping process, 233
worker processes, 233

start method, 83–84, 102
staticPageRank method, 228
Statistical analysis, 172
stop method, 84
Strongly connected component (SCC), 229
Structure transformation operators

groupEdges, 220–221
mask, 219–220
reverse, 218
subgraph, 219

subgraph method, 219
Supervised machine learning algorithm, 158
Support Vector Machine (SVM) algorithm, 163–164
SVMWithSGD, 181

��������� T
take method, 136
Task metrics, 255
textFile method, 67–68, 73
textFileStream method, 86
toDF method, 112, 121
toJSON method, 134

www.it-ebooks.info

http://www.it-ebooks.info/

Page 290

■ Index

277

Tokenizer, 202
toPMML method, 180–181
trainRegressor method, 177–178
transform method, 90, 101
triangleCount method, 230
Twitter, 97

��������� U
union method, 88
Unsupervised machine learning algorithm, 167
updateStateByKey method, 91, 101
User-defined aggregation functions (UDAFs), 141
User-defined functions (UDFs), 141

��������� V
Vector type, MLlib, 173
VertexRDD, 210–211

��������� W, X
window method, 95
withColumn method, 132
WordCount application

args argument, 72

compile, 74
debugging, 77
destination path, 72
flatMap method, 73
input parameters, 71
IntelliJ IDEA, 71
main method, 72
map method, 73
monitor, 77
reduceByKey method, 73
running, 75
sbt, 73
small and large dataset, 72
SparkContext class, 72–73
spark-submit script, 73
split method, 73
structure, 72
textFile method, 73

write method, 136

��������� Y, Z
Yet another resource negotiator (YARN)

architecture, 240
JobTracker, 239
running, 241

www.it-ebooks.info

http://www.it-ebooks.info/

Similer Documents