Download Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark PDF

TitleAgile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark
TagsDemonstrating Work Values Apache Spark
File Size11.5 MB
Total Pages351
Table of Contents
Table of Contents
	Agile Data Science Mailing List
	Data Syndrome, Product Analytics Consultancy
		Live Training
	Who This Book Is For
	How This Book Is Organized
	Conventions Used in This Book
	Using Code Examples
	O’Reilly Safari
	How to Contact Us
Part I. Setup
	Chapter 1. Theory
			Methodology as Tweet
			Agile Data Science Manifesto
		The Problem with the Waterfall
			Research Versus Application Development
		The Problem with Agile Software
			Eventual Quality: Financing Technical Debt
			The Pull of the Waterfall
		The Data Science Process
			Setting Expectations
			Data Science Team Roles
			Recognizing the Opportunity and the Problem
			Adapting to Change
		Notes on Process
			Code Review and Pair Programming
			Agile Environments: Engineering Productivity
			Realizing Ideas with Large-Format Printing
	Chapter 2. Agile Tools
		Scalability = Simplicity
		Agile Data Science Data Processing
		Local Environment Setup
			System Requirements
			Setting Up Vagrant
			Downloading the Data
		EC2 Environment Setup
			Downloading the Data
		Getting and Running the Code
			Getting the Code
			Running the Code
			Jupyter Notebooks
		Touring the Toolset
			Agile Stack Requirements
			Python 3
			Serializing Events with JSON Lines and Parquet
			Collecting Data
			Data Processing with Spark
			Publishing Data with MongoDB
			Searching Data with Elasticsearch
			Distributed Streams with Apache Kafka
			Processing Streams with PySpark Streaming
			Machine Learning with scikit-learn and Spark MLlib
			Scheduling with Apache Airflow (Incubating)
			Reflecting on Our Workflow
			Lightweight Web Applications
			Presenting Our Data
	Chapter 3. Data
		Air Travel Data
			Flight On-Time Performance Data
			OpenFlights Database
		Weather Data
		Data Processing in Agile Data Science
			Structured Versus Semistructured Data
		SQL Versus NoSQL
			NoSQL and Dataflow Programming
			Spark: SQL + NoSQL
			Schemas in NoSQL
			Data Serialization
			Extracting and Exposing Features in Evolving Schemas
Part II. Climbing the Pyramid
	Chapter 4. Collecting and Displaying Records
		Putting It All Together
		Collecting and Serializing Flight Data
		Processing and Publishing Flight Records
			Publishing Flight Records to MongoDB
		Presenting Flight Records in a Browser
			Serving Flights with Flask and pymongo
			Rendering HTML5 with Jinja2
		Agile Checkpoint
		Listing Flights
			Listing Flights with MongoDB
			Paginating Data
		Searching for Flights
			Creating Our Index
			Publishing Flights to Elasticsearch
			Searching Flights on the Web
	Chapter 5. Visualizing Data with Charts and Tables
		Chart Quality: Iteration Is Essential
		Scaling a Database in the Publish/Decorate Model
			First Order Form
			Second Order Form
			Third Order Form
			Choosing a Form
		Exploring Seasonality
			Querying and Presenting Flight Volume
		Extracting Metal (Airplanes [Entities])
			Extracting Tail Numbers
			Assessing Our Airplanes
		Data Enrichment
			Reverse Engineering a Web Form
			Gathering Tail Numbers
			Automating Form Submission
			Extracting Data from HTML
			Evaluating Enriched Data
	Chapter 6. Exploring Data with Reports
		Extracting Airlines (Entities)
			Defining Airlines as Groups of Airplanes Using PySpark
			Querying Airline Data in Mongo
			Building an Airline Page in Flask
			Linking Back to Our Airline Page
			Creating an All Airlines Home Page
		Curating Ontologies of Semi-structured Data
		Improving Airlines
			Adding Names to Carrier Codes
			Incorporating Wikipedia Content
			Publishing Enriched Airlines to Mongo
			Enriched Airlines on the Web
		Investigating Airplanes (Entities)
			SQL Subqueries Versus Dataflow Programming
			Dataflow Programming Without Subqueries
			Subqueries in Spark SQL
			Creating an Airplanes Home Page
			Adding Search to the Airplanes Page
			Creating a Manufacturers Bar Chart
			Iterating on the Manufacturers Bar Chart
			Entity Resolution: Another Chart Iteration
	Chapter 7. Making Predictions
		The Role of Predictions
		Predict What?
		Introduction to Predictive Analytics
			Making Predictions
		Exploring Flight Delays
		Extracting Features with PySpark
		Building a Regression with scikit-learn
			Loading Our Data
			Sampling Our Data
			Vectorizing Our Results
			Preparing Our Training Data
			Vectorizing Our Features
			Sparse Versus Dense Matrices
			Preparing an Experiment
			Training Our Model
			Testing Our Model
		Building a Classifier with Spark MLlib
			Loading Our Training Data with a Specified Schema
			Addressing Nulls
			Replacing FlightNum with Route
			Bucketizing a Continuous Variable for Classification
			Feature Vectorization with
			Classification with Spark ML
	Chapter 8. Deploying Predictive Systems
		Deploying a scikit-learn Application as a Web Service
			Saving and Loading scikit-learn Models
			Groundwork for Serving Predictions
			Creating Our Flight Delay Regression API
			Testing Our API
			Pulling Our API into Our Product
		Deploying Spark ML Applications in Batch with Airflow
			Gathering Training Data in Production
			Training, Storing, and Loading Spark ML Models
			Creating Prediction Requests in Mongo
			Fetching Prediction Requests from MongoDB
			Making Predictions in a Batch with Spark ML
			Storing Predictions in MongoDB
			Displaying Batch Prediction Results in Our Web
			Automating Our Workflow with Apache Airflow (Incubating)
		Deploying Spark ML via Spark Streaming
			Gathering Training Data in Production
			Training, Storing, and Loading Spark ML Models
			Sending Prediction Requests to Kafka
			Making Predictions in Spark Streaming
			Testing the Entire System
	Chapter 9. Improving Predictions
		Fixing Our Prediction Problem
		When to Improve Predictions
		Improving Prediction Performance
			Experimental Adhesion Method: See What Sticks
			Establishing Rigorous Metrics for Experiments
			Time of Day as a Feature
		Incorporating Airplane Data
			Extracting Airplane Features
			Incorporating Airplane Features into Our Classifier Model
		Incorporating Flight Time
Appendix A. Manual Installation
	Installing Hadoop
	Installing Spark
	Installing MongoDB
	Installing the MongoDB Java Driver
	Installing mongo-hadoop
		Building mongo-hadoop
		Installing pymongo_spark
	Installing Elasticsearch
	Installing Elasticsearch for Hadoop
	Setting Up Our Spark Environment
	Installing Kafka
	Installing scikit-learn
	Installing Zeppelin
About the Author
Document Text Contents
Page 1

Russell Jurney

Agile Data
Science 2.0

Now with Kafka

and Spark!

Page 175

# Is Delta around?
airlines.filter(airlines.C3 == 'DL').show()

produces the following result:

| C0| C1| C2| C3| C4| C5| C6| C7|
|2009|Delta Air Lines|null| DL|DAL|DELTA|United States| Y|

Now let’s filter this data down to just the airline names and two-letter carrier codes,
and join it to the unique carrier codes from the on-time performance dataset:

# Drop fields except for C1 as name, C3 as carrier code
airlines = spark.sql("SELECT C1 AS Name, C3 AS CarrierCode from airlines")

# Join our 14 carrier codes to the airlines table to get our set of airlines
our_airlines = carrier_codes.join(
airlines, carrier_codes.Carrier == airlines.CarrierCode
our_airlines ='Name', 'CarrierCode')

This results in:

| Name|CarrierCode|
| American Airlines| AA|
| Spirit Airlines| NK|
| Hawaiian Airlines| HA|
| Alaska Airlines| AS|
| JetBlue Airways| B6|
| United Airlines| UA|
| US Airways| US|
| SkyWest| OO|
| Virgin America| VX|
| Southwest Airlines| WN|
| Delta Air Lines| DL|
|Atlantic Southeas...| EV|
| Frontier Airlines| F9|
|American Eagle Ai...| MQ|

Finally, let’s store this intermediate data as JSON:


and again, copy it into a JSON Lines file:

cp data/our_airlines.json/part* data/our_airlines.jsonl

Then we can take a peek with cat data/our_airlines.jsonl:

Improving Airlines | 157

Page 176

{"Name":"American Airlines","CarrierCode":"AA"}
{"Name":"Spirit Airlines","CarrierCode":"NK"}
{"Name":"Hawaiian Airlines","CarrierCode":"HA"}
{"Name":"Alaska Airlines","CarrierCode":"AS"}
{"Name":"JetBlue Airways","CarrierCode":"B6"}
{"Name":"United Airlines","CarrierCode":"UA"}
{"Name":"US Airways","CarrierCode":"US"}
{"Name":"Virgin America","CarrierCode":"VX"}
{"Name":"Southwest Airlines","CarrierCode":"WN"}
{"Name":"Delta Air Lines","CarrierCode":"DL"}
{"Name":"Atlantic Southeast Airlines","CarrierCode":"EV"}
{"Name":"Frontier Airlines","CarrierCode":"F9"}
{"Name":"American Eagle Airlines","CarrierCode":"MQ"}

Incorporating Wikipedia Content
Now that we have airline names, we can use Wikipedia to get various information
about each airline, like a summary, logo, and company website! To do so, we make
use of the wikipedia package for Python, which wraps the MediaWiki API. We’ll be
using BeautifulSoup again to parse the page’s HTML.

Check out ch06/
import sys, os, re
import utils

import wikipedia
from bs4 import BeautifulSoup
import tldextract

# Load our airlines...
our_airlines = utils.read_json_lines_file('data/our_airlines.jsonl')

# Build a new list that includes Wikipedia data
with_url = []
for airline in our_airlines:
# Get the Wikipedia page for the airline name
wikipage =['Name'])

# Get the summary
summary = wikipage.summary
airline['summary'] = summary

# Get the HTML of the page
page = BeautifulSoup(wikipage.html())

# Task: get the logo from the right 'vcard' column
# 1) Get the vcard table
vcard_table = page.find_all('table', class_='vcard')[0]
# 2) The logo is always the first image inside this table

158 | Chapter 6: Exploring Data with Reports

Page 350

test command (Airflow), 68

DAGs in Airflow, 262
entire predictive systems, 282-284
flight delay regression API, 231
regression model, 205-207
tasks in Airflow, 68, 261

third order form (normalization), 123
Thrift serialization system, 85
time of day of flights, 298-302
time series charts, 121
timestamps, 195-196, 240-241
training data

collecting, 235-236, 265
features and, 188
loading with specified schema, 208
predictive analytics and, 187
preparing, 201

training the regression model, 204
Tunkelang, Daniel, 4

UDFs (user-defined functions), 217
Unicode standard, 43
user experience designers (team role), 18, 20
UTF-8 character encoding, 43
UUID (Univesally Unique Identifer), 267


Elasticsearch and, 50
Jupyter Notebooks and, 40
Python 3 and, 39
setting up, 33
system requirements, 33

categorical, 187, 201, 203, 220
continuous, 187, 201, 211-219, 220
nominal, 187, 201, 203, 220

VectorAssembler class, 220, 238, 287

features, 201-203, 219-221
regression results, 200

installing, 33
system requirements, 33

visualizing data
histograms and, 211

with charts and tables, 119-148
with D3.js, 74

VM (virtual machine)
setting up Vagrant, 33
system requirements, 33

Warden, Pete, 85
waterfall method

about, 5
problems with, 10-11
pull of the, 4, 15
research versus application development,

WBAN Master List, 80
weather data, 80, 185-223
web applications, lightweight, 70-72
web developers (team role), 18, 20
web forms

automating submission, 143
reverse engineering, 140

web pages
building in Flask, 135, 151
creating home page, 153, 166
improving with multimedia content,

linking back to, 138, 152
publishing enriched data to, 159-161
semi-structured data in, 154

web services, deploying scikit-learn applica‐
tions as, 225-233

Wickham, Hadley, 198
Wikipedia content, incorporating into flight

data, 158
wikipedia package, 158
Williams, Hugh E., 81

automating with Airflow, 255-264
lightweight web applications, 70
software stack, 70

xgboost library, 198

Zeppelin (Apache), 262, 321
Zookeeper (Apache), 55, 266

332 | Index

Page 351

About the Author
Russell Jurney cut his data teeth in casino gaming, building web apps to analyze the
performance of slot machines in the US and Mexico. After dabbling in entrepreneur‐
ship, interactive media, and journalism, he moved to Silicon Valley to build analytics
applications at scale at Ning and LinkedIn. Russell is now principal consultant at Data
Syndrome, where he helps companies apply the principles and methods in this book
to build analytics products.

The animal on the cover of Agile Data Science is a silvery marmoset (Mico argenta‐
tus). These small New World monkeys live in the eastern parts of the Amazon rain‐
forest and Brazil. Despite their name, silvery marmosets can range in color from
near-white to dark brown. Brown marmosets have hairless ears and faces and are
sometimes referred to as bare-ear marmosets. Reaching an average size of 22 cm,
marmosets are about the size of squirrels, which makes their travel through tree can‐
opies and dense vegetation very easy. Silvery marmosets live in extended families of
around 12, where all the members help care for the young. Marmoset fathers carry
their infants around during the day and return them to the mother every two to three
hours to be fed. Babies wean from their mother’s milk at around six months and full
maturity is reached at one to two years old. The marmoset’s diet consists mainly of
sap and tree gum. They use their sharp teeth to gouge holes in trees to reach the sap,
and will occasionally eat fruit, leaves, and insects as well. As the deforestation of the
rainforest continues, however, marmosets have begun to eat food crops grown by
people; as a result, many farmers view them as pests. Large-scale extermination pro‐
grams are underway in agricultural areas, and it is still unclear what impact this will
have on the overall silvery marmoset population. Because of their small size and mild
disposition, marmosets are regularly used as subjects of medical research. Studies on
the fertilization, placental development, and embryonic stem cells of marmosets may
reveal the causes of developmental problems and genetic disorders in humans. Out‐
side of the lab, marmosets are popular at zoos because they are diurnal (active during
daytime) and full of energy; their long claws mean they can quickly move around in
trees, and both males and females communicate with loud vocalizations.

Many of the animals on O’Reilly covers are endangered; all of them are important to
the world. To learn more about how you can help, go to

The cover image is from Lydekker’s Royal Natural History. The cover fonts are URW
Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the heading font
is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.

Similer Documents