Table of Contents
Preface ix
Part I Setup
1 Theory 3
Introduction 3
Definition 5
Methodology as Tweet 5
Agile Data Science Manifesto 6
The Problem with the Waterfall 10
Research Versus Application Development 11
The Problem with Agile Software 14
Eventual Quality: Financing Technical Debt 14
The Pull of the Waterfall 15
The Data Science Process 16
Setting Expectations 17
Data Science Team Roles 18
Recognizing the Opportunity and the Problem 19
Adapting to Change 21
Notes on Process 23
Code Review and Pair Programming 25
Agile Environments: Engineering Productivity 25
Realizing Ideas with Large-Format Printing 27
2 Agile Tools 29
Scalability = Simplicity 30
Agile Data Science Data Processing 30
Local Environment Setup 32
System Requirements 33
Setting Up Vagrant 33
Downloading the Data 33
EC2 Environment Setup 34
Downloading the Data 38
Getting and Running the Code 38
Getting the Code 38
Running the Code 38
Jupyter Notebooks 39
Touring the Toolset 39
Agile Stack Requirements 39
Python 3 39
Serializing Events with JSON Lines and Parquet 42
Collecting Data 45
Data Processing with Spark 45
Publishing Data with MongoDB 48
Searching Data with Elasticsearch 50
Distributed Streams with Apache Kafka 54
Processing Streams with PySpark Streaming 57
Machine Learning with scikit-learn and Spark MLlib 58
Scheduling with Apache Airflow (Incubating) 59
Reflecting on Our Workflow 70
Lightweight Web Applications 70
Presenting Our Data 73
Conclusion 75
3 Data 77
Air Travel Data 77
Flight On-Time Performance Data 78
OpenFlights Database 79
Weather Data 80
Data Processing in Agile Data Science 81
Structured Versus Semistructured Data 81
SQL Versus NoSQL 82
SQL 83
NoSQL and Dataflow Programming 83
Spark: SQL + NoSQL 84
Schemas in NoSQL 84
Data Serialization 85
Extracting and Exposing Features in Evolving Schemas 85
Conclusion 86
Part II Climbing the Pyramid
4 Collecting and Displaying Records 89
Putting It All Together 90
Collecting and Serializing Flight Data 91
Processing and Publishing Flight Records 94
Publishing Flight Records to MongoDB 95
Presenting Flight Records in a Browser 96
Serving Flights with Flask and pymongo 97
Rendering HTML5 with Jinja2 98
Agile Checkpoint 102
Listing Flights 103
Listing Flights with MongoDB 103
Paginating Data 106
Searching for Flights 112
Creating Our Index 112
Publishing Flights to Elasticsearch 113
Searching Flights on the Web 114
Conclusion 117
5 Visualizing Data with Charts and Tables 119
Chart Quality: Iteration Is Essential 120
Scaling a Database in the Publish/Decorate Model 120
First Order Form 121
Second Order Form 122
Third Order Form 123
Choosing a Form 123
Exploring Seasonality 124
Querying and Presenting Flight Volume 124
Extracting Metal (Airplanes [Entities]) 132
Extracting Tail Numbers 132
Assessing Our Airplanes 139
Data Enrichment 140
Reverse Engineering a Web Form 140
Gathering Tail Numbers 142
Automating Form Submission 143
Extracting Data from HTML 144
Evaluating Enriched Data 147
Conclusion 148
6 Exploring Data with Reports 149
Extracting Airlines (Entities) 150
Defining Airlines as Groups of Airplanes Using PySpark 150
Querying Airline Data in Mongo 151
Building an Airline Page in Flask 151
Linking Back to Our Airline Page 152
Creating an All Airlines Home Page 153
Curating Ontologies of Semi-structured Data 154
Improving Airlines 155
Adding Names to Carrier Codes 156
Incorporating Wikipedia Content 158
Publishing Enriched Airlines to Mongo 159
Enriched Airlines on the Web 160
Investigating Airplanes (Entities) 162
SQL Subqueries Versus Dataflow Programming 164
Dataflow Programming Without Subqueries 164
Subqueries in Spark SQL 165
Creating an Airplanes Home Page 166
Adding Search to the Airplanes Page 167
Creating a Manufacturers Bar Chart 172
Iterating on the Manufacturers Bar Chart 174
Entity Resolution: Another Chart Iteration 177
Conclusion 183
7 Making Predictions 185
The Role of Predictions 186
Predict What? 186
Introduction to Predictive Analytics 187
Making Predictions 187
Exploring Flight Delays 189
Extracting Features with PySpark 193
Building a Regression with scikit-learn 198
Loading Our Data 198
Sampling Our Data 199
Vectorizing Our Results 200
Preparing Our Training Data 201
Vectorizing Our Features 201
Sparse Versus Dense Matrices 203
Preparing an Experiment 204
Training Our Model 204
Testing Our Model 205
Conclusion 207
Building a Classifier with Spark MLlib 208
Loading Our Training Data with a Specified Schema 208
Addressing Nulls 210
Replacing FlighlNum with Route 210
Bucketizing a Continuous Variable for Classification 211
Feature Vectorization with pyspark.ml.feature 219
Classification with Spark ML 221
Conclusion 223
8 Deploying Predictive Systems 225
Deploying a scikit-learn Application as a Web Service 225
Saving and Loading scikit-learn Models 226
Groundwork for Serving Predictions 227
Creating Our Flight Delay Regression API 228
Testing Our API 232
Pulling Our API into Our Product 232
Deploying Spark ML Applications in Batch with Airflow 234
Gathering Training Data in Production 235
Training, Storing, and Loading Spark ML Models 237
Creating Prediction Requests in Mongo 239
Fetching Prediction Requests from MongoDB 245
Making Predictions in a Batch with Spark ML 248
Storing Predictions in MongoDB 252
Displaying Batch Prediction Results in Our Web Application 253
Automating Our Workflow with Apache Airflow (Incubating) 256
Conclusion 264
Deploying Spark ML via Spark Streaming 264
Gathering Training Data in Production 265
Training, Storing, and Loading Spark ML Models 265
Sending Prediction Requests to Kafka 266
Making Predictions in Spark Streaming 277
Testing the Entire System 283
Conclusion 285
8 Improving Predictions 287
Fixing Oar Prediction Problem 287
When to Improve Predictions 288
Improving Prediction Performance 288
Experimental Adhesion Method: See What Sticks 288
Establishing Rigorous Metrics for Experiments 289
Time of Day as a Feature 298
Incorporating Airplane Data 302
Extracting Airplane Features 302
Incorporating Airplane Features into Our Classifier Model 305
Incorporating Flight Time 310
Conclusion 313
A Manual Installation 315
Index 323