Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Data Science Bookcamp: Five real-world Python projects
Data Science Bookcamp: Five real-world Python projects
Data Science Bookcamp: Five real-world Python projects
Ebook1,376 pages14 hours

Data Science Bookcamp: Five real-world Python projects

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

Learn data science with Python by building five real-world projects! Experiment with card game predictions, tracking disease outbreaks, and more, as you build a flexible and intuitive understanding of data science.

In Data Science Bookcamp you will learn:

- Techniques for computing and plotting probabilities
- Statistical analysis using Scipy
- How to organize datasets with clustering algorithms
- How to visualize complex multi-variable datasets
- How to train a decision tree machine learning algorithm

In Data Science Bookcamp you’ll test and build your knowledge of Python with the kind of open-ended problems that professional data scientists work on every day. Downloadable data sets and thoroughly-explained solutions help you lock in what you’ve learned, building your confidence and making you ready for an exciting new data science career.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
A data science project has a lot of moving parts, and it takes practice and skill to get all the code, algorithms, datasets, formats, and visualizations working together harmoniously. This unique book guides you through five realistic projects, including tracking disease outbreaks from news headlines, analyzing social networks, and finding relevant patterns in ad click data.

About the book
Data Science Bookcamp doesn’t stop with surface-level theory and toy examples. As you work through each project, you’ll learn how to troubleshoot common problems like missing data, messy data, and algorithms that don’t quite fit the model you’re building. You’ll appreciate the detailed setup instructions and the fully explained solutions that highlight common failure points. In the end, you’ll be confident in your skills because you can see the results.

What's inside

- Web scraping
- Organize datasets with clustering algorithms
- Visualize complex multi-variable datasets
- Train a decision tree machine learning algorithm

About the reader
For readers who know the basics of Python. No prior data science or machine learning skills required.

About the author
Leonard Apeltsin is the Head of Data Science at Anomaly, where his team applies advanced analytics to uncover healthcare fraud, waste, and abuse.

Table of Contents
CASE STUDY 1 FINDING THE WINNING STRATEGY IN A CARD GAME
1 Computing probabilities using Python
2 Plotting probabilities using Matplotlib
3 Running random simulations in NumPy
4 Case study 1 solution
CASE STUDY 2 ASSESSING ONLINE AD CLICKS FOR SIGNIFICANCE
5 Basic probability and statistical analysis using SciPy
6 Making predictions using the central limit theorem and SciPy
7 Statistical hypothesis testing
8 Analyzing tables using Pandas
9 Case study 2 solution
CASE STUDY 3 TRACKING DISEASE OUTBREAKS USING NEWS HEADLINES
10 Clustering data into groups
11 Geographic location visualization and analysis
12 Case study 3 solution
CASE STUDY 4 USING ONLINE JOB POSTINGS TO IMPROVE YOUR DATA SCIENCE RESUME
13 Measuring text similarities
14 Dimension reduction of matrix data
15 NLP analysis of large text datasets
16 Extracting text from web pages
17 Case study 4 solution
CASE STUDY 5 PREDICTING FUTURE FRIENDSHIPS FROM SOCIAL NETWORK DATA
18 An introduction to graph theory and network analysis
19 Dynamic graph theory techniques for node ranking and social network analysis
20 Network-driven supervised machine learning
21 Training linear classifiers with logistic regression
22 Training nonlinear classifiers with decision tree techniques
23 Case study 5 solution
LanguageEnglish
PublisherManning
Release dateDec 7, 2021
ISBN9781638352303
Data Science Bookcamp: Five real-world Python projects

Related to Data Science Bookcamp

Related ebooks

Computers For You

View More

Related articles

Reviews for Data Science Bookcamp

Rating: 5 out of 5 stars
5/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Data Science Bookcamp - Leonard Apeltsin

    inside front cover

    Core algorithms inside the book

    A trained logistic regression classifier distinguishes between two classes of points by slicing like a cleaver through 3D space (see section 21).

    Data Science Bookcamp

    Five real-world Python projects

    Leonard Apeltsin

    To comment go to liveBook

    Manning

    Shelter Island

    For more information on this and other Manning titles go to

    www.manning.com

    Copyright

    For online information and ordering of these  and other Manning books, please visit www.manning.com. The publisher offers discounts on these books when ordered in quantity.

    For more information, please contact

    Special Sales Department

    Manning Publications Co.

    20 Baldwin Road

    PO Box 761

    Shelter Island, NY 11964

    Email: [email protected]

    ©2021 by Manning Publications Co. All rights reserved.

    No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.

    Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.

    ♾ Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.

    ISBN: 9781617296253

    dedication

    To my teacher, Alexander Vishnevsky, who taught me how to think

    brief contents

    Part 1.   Case study 1: Finding the winning strategy in a card game

      1   Computing probabilities using Python

      2   Plotting probabilities using Matplotlib

      3   Running random simulations in NumPy

      4   Case study 1 solution

    Part 2.   Case study 2: Assessing online ad clicks for significance

      5   Basic probability and statistical analysis using SciPy

      6   Making predictions using the central limit theorem and SciPy

      7   Statistical hypothesis testing

      8   Analyzing tables using Pandas

      9   Case study 2 solution

    Part 3.   Case study 3: Tracking disease outbreaks using news headlines

    10   Clustering data into groups

    11   Geographic location visualization and analysis

    12   Case study 3 solution

    Part 4.   Case study 4: Using online job postings to improve your data science resume

    13   Measuring text similarities

    14   Dimension reduction of matrix data

    15   NLP analysis of large text datasets

    16   Extracting text from web pages

    17   Case study 4 solution

    Part 5.   Case study 5: Predicting future friendships from social network data

    18   An introduction to graph theory and network analysis

    19   Dynamic graph theory techniques for node ranking and social network analysis

    20   Network-driven supervised machine learning

    21   Training linear classifiers with logistic regression

    22   Training nonlinear classifiers with decision tree techniques

    23   Case study 5 solution

    contents

    front matter

    preface

    acknowledgments

    about this book

    about the author

    about the cover illustration

    Part 1.   Case study 1: Finding the winning strategy in a card game

      1   Computing probabilities using Python

    1.1  Sample space analysis: An equation-free approach for measuring uncertainty in outcomes

    Analyzing a biased coin

    1.2  Computing nontrivial probabilities

    Problem 1: Analyzing a family with four children

    Problem 2: Analyzing multiple die rolls

    Problem 3: Computing die-roll probabilities using weighted sample spaces

    1.3  Computing probabilities over interval ranges

    Evaluating extremes using interval analysis

      2   Plotting probabilities using Matplotlib

    2.1  Basic Matplotlib plots

    2.2  Plotting coin-flip probabilities

    Comparing multiple coin-flip probability distributions

      3   Running random simulations in NumPy

    3.1  Simulating random coin flips and die rolls using NumPy

    Analyzing biased coin flips

    3.2  Computing confidence intervals using histograms and NumPy arrays

    Binning similar points in histogram plots

    Deriving probabilities from histograms

    Shrinking the range of a high confidence interval

    Computing histograms in NumPy

    3.3  Using confidence intervals to analyze a biased deck of cards

    3.4  Using permutations to shuffle cards

      4   Case study 1 solution

    4.1  Predicting red cards in a shuffled deck

    Estimating the probability of strategy success

    4.2  Optimizing strategies using the sample space for a 10-card deck

    Part 2.   Case study 2: Assessing online ad clicks for significance

    Problem statement

    Dataset description

    Overview

      5   Basic probability and statistical analysis using SciPy

    5.1  Exploring the relationships between data and probability using SciPy

    5.2  Mean as a measure of centrality

    Finding the mean of a probability distribution

    5.3  Variance as a measure of dispersion

    Finding the variance of a probability distribution

      6   Making predictions using the central limit theorem and SciPy

    6.1  Manipulating the normal distribution using SciPy

    Comparing two sampled normal curves

    6.2  Determining the mean and variance of a population through random sampling

    6.3  Making predictions using the mean and variance

    Computing the area beneath a normal curve

    Interpreting the computed probability

      7   Statistical hypothesis testing

    7.1  Assessing the divergence between sample mean and population mean

    7.2  Data dredging: Coming to false conclusions through oversampling

    7.3  Bootstrapping with replacement: Testing a hypothesis when the population variance is unknown

    7.4  Permutation testing: Comparing means of samples when the population parameters are unknown

      8   Analyzing tables using Pandas

    8.1  Storing tables using basic Python

    8.2  Exploring tables using Pandas

    8.3  Retrieving table columns

    8.4  Retrieving table rows

    8.5  Modifying table rows and columns

    8.6  Saving and loading table data

    8.7  Visualizing tables using Seaborn

      9   Case study 2 solution

    9.1  Processing the ad-click table in Pandas

    9.2  Computing p-values from differences in means

    9.3  Determining statistical significance

    9.4  41 shades of blue: A real-life cautionary tale

    Part 3.   Case study 3: Tracking disease outbreaks using news headlines

    Problem statement

    Dataset description

    Overview

    10   Clustering data into groups

    10.1  Using centrality to discover clusters

    10.2  K-means: A clustering algorithm for grouping data into K central groups

    K-means clustering using scikit-learn

    Selecting the optimal K using the elbow method

    10.3  Using density to discover clusters

    10.4  DBSCAN: A clustering algorithm for grouping data based on spatial density

    Comparing DBSCAN and K-means

    Clustering based on non-Euclidean distance

    10.5  Analyzing clusters using Pandas

    11   Geographic location visualization and analysis

    11.1  The great-circle distance: A metric for computing the distance between two global points

    11.2  Plotting maps using Cartopy

    Manually installing GEOS and Cartopy

    Utilizing the Conda package manager

    Visualizing maps

    11.3  Location tracking using GeoNamesCache

    Accessing country information

    Accessing city information

    Limitations of the GeoNamesCache library

    11.4  Matching location names in text

    12   Case study 3 solution

    12.1  Extracting locations from headline data

    12.2  Visualizing and clustering the extracted location data

    12.3  Extracting insights from location clusters

    Part 4.   Case study 4: Using online job postings to improve your data science resume

    Problem statement

    Dataset description

    Overview

    13   Measuring text similarities

    13.1  Simple text comparison

    Exploring the Jaccard similarity

    Replacing words with numeric values

    13.2  Vectorizing texts using word counts

    Using normalization to improve TF vector similarity

    Using unit vector dot products to convert between relevance metrics

    13.3  Matrix multiplication for efficient similarity calculation

    Basic matrix operations

    Computing all-by-all matrix similarities

    13.4  Computational limits of matrix multiplication

    14   Dimension reduction of matrix data

    14.1  Clustering 2D data in one dimension

    Reducing dimensions using rotation

    14.2  Dimension reduction using PCA and scikit-learn

    14.3  Clustering 4D data in two dimensions

    Limitations of PCA

    14.4  Computing principal components without rotation

    Extracting eigenvectors using power iteration

    14.5  Efficient dimension reduction using SVD and scikit-learn

    15   NLP analysis of large text datasets

    15.1  Loading online forum discussions using scikit-learn

    15.2  Vectorizing documents using scikit-learn

    15.3  Ranking words by both post frequency and count

    Computing TFIDF vectors with scikit-learn

    15.4  Computing similarities across large document datasets

    15.5  Clustering texts by topic

    Exploring a single text cluster

    15.6  Visualizing text clusters

    Using subplots to display multiple word clouds

    16   Extracting text from web pages

    16.1  The structure of HTML documents

    16.2  Parsing HTML using Beautiful Soup

    16.3  Downloading and parsing online data

    17   Case study 4 solution

    17.1  Extracting skill requirements from job posting data

    Exploring the HTML for skill descriptions

    17.2  Filtering jobs by relevance

    17.3  Clustering skills in relevant job postings

    Grouping the job skills into 15 clusters

    Investigating the technical skill clusters

    Investigating the soft-skill clusters

    Exploring clusters at alternative values of K

    Analyzing the 700 most relevant postings

    17.4  Conclusion

    Part 5.   Case study 5: Predicting future friendships from social network data

    Problem statement

    Introducing the friend-of-a-friend recommendation algorithm

    Predicting user behavior

    Dataset description

    The Profiles table

    The Observations table

    The Friendships table

    Overview

    18   An introduction to graph theory and network analysis

    18.1  Using basic graph theory to rank websites by popularity

    Analyzing web networks using NetworkX

    18.2  Utilizing undirected graphs to optimize the travel time between towns

    Modeling a complex network of towns and counties

    Computing the fastest travel time between nodes

    19   Dynamic graph theory techniques for node ranking and social network analysis

    19.1  Uncovering central nodes based on expected traffic in a network

    Measuring centrality using traffic simulations

    19.2  Computing travel probabilities using matrix multiplication

    Deriving PageRank centrality from probability theory

    Computing PageRank centrality using NetworkX

    19.3  Community detection using Markov clustering

    19.4  Uncovering friend groups in social networks

    20   Network-driven supervised machine learning

    20.1  The basics of supervised machine learning

    20.2  Measuring predicted label accuracy

    Scikit-learn’s prediction measurement functions

    20.3  Optimizing KNN performance

    20.4  Running a grid search using scikit-learn

    20.5  Limitations of the KNN algorithm

    21   Training linear classifiers with logistic regression

    21.1  Linearly separating customers by size

    21.2  Training a linear classifier

    Improving perceptron performance through standardization

    21.3  Improving linear classification with logistic regression

    Running logistic regression on more than two features

    21.4  Training linear classifiers using scikit-learn

    Training multiclass linear models

    21.5  Measuring feature importance with coefficients

    21.6  Linear classifier limitations

    22   Training nonlinear classifiers with decision tree techniques

    22.1  Automated learning of logical rules

    Training a nested if/else model using two features

    Deciding which feature to split on

    Training if/else models with more than two features

    22.2  Training decision tree classifiers using scikit-learn

    Studying cancerous cells using feature importance

    22.3  Decision tree classifier limitations

    22.4  Improving performance using random forest classification

    22.5  Training random forest classifiers using scikit-learn

    23   Case study 5 solution

    23.1  Exploring the data

    Examining the profiles

    Exploring the experimental observations

    Exploring the Friendships linkage table

    23.2  Training a predictive model using network features

    23.3  Adding profile features to the model

    23.4  Optimizing performance across a steady set of features

    23.5  Interpreting the trained model

    Why are generalizable models so important?

    index

    front matter

    preface

    Another promising candidate had failed their data science interview, and I began to wonder why. The year was 2018, and I was struggling to expand the data science team at my startup. I had interviewed dozens of seemingly qualified candidates, only to reject them all. The latest rejected applicant was an economics PhD from a top-notch school. Recently, the applicant had transitioned into data science after completing a 10-week bootcamp. I asked the applicant to discuss an analytics problem that was very relevant to our company. They immediately brought up a trendy algorithm that was not applicable to the situation. When I tried to debate the algorithm’s incompatibilities, the candidate was at a loss. They didn’t know how the algorithm actually worked or the appropriate circumstances under which to use it. These details hadn’t been taught to them at the bootcamp.

    After the rejected candidate departed, I began to reflect on my own data science education. How different it had been! Back in 2006, data science was not yet a coveted career choice, and DS bootcamps did not yet exist. In those days, I was a poor grad student struggling to pay the rent in pricey San Francisco. My graduate research required me to analyze millions of genetic links to diseases. I realized that my skills were transferable to other areas of analysis, and thus my data science consultancy was born.

    Unbeknownst to my graduate advisor, I began to solicit analytics work from random Bay Area companies. That freelance work helped pay the bills, so I could not be too choosy about the data-driven assignments I tackled. Thus, I would sign up for a variety of data science tasks, ranging from simple statistical analyses to complex predictive modeling. Sometimes I would find myself overwhelmed by a seemingly intractable data problem, but in the end, I’d persevere. My struggles taught me the nuances of diverse analytics techniques and how to best combine them to reach elegant solutions. More importantly, I learned how common techniques can fail and how to surmount these failure points to deliver impactful results. As my skill set grew, my data science career began to flourish. Eventually, I became a leader in the field.

    Would I have achieved the same level of success through rote memorization at a 10-week bootcamp? Probably not. Many bootcamps prioritize the study of standalone algorithms over more cohesive problem-solving skills. Furthermore, the hype over an algorithm’s strengths tends to be emphasized over its weaknesses. Consequently, students are sometimes ill prepared to handle data science in real-world settings. That insight inspired me to write this book.

    I decided to replicate my own data science education by exposing you, my readers, to a set of increasingly challenging analytics problems. Additionally, I chose to arm you with tools and techniques required to handle these problems effectively. My aim is to holistically help you cultivate your analytic problem-solving skills. This way, when you interview for that junior data science position, you will be much more likely to get the job.

    acknowledgments

    Writing this book was very hard. I definitely could not have done it alone. Fortunately, my family and friends provided their support during this arduous journey. First and foremost, I thank my mother, Irina Apeltsin. She kept me motivated during those difficult days when the task before me seemed insurmountable. Additionally, I thank my grandmother, Vera Fisher, whose pragmatic advice kept me on track as I plowed through the material for my book.

    Furthermore, I’d like to thank my childhood friend Vadim Stolnik. Vadim is a brilliant graphic designer who helped me with the book’s myriad illustrations. Also, I want to acknowledge my friend and colleague Emmanuel Yera, who had my back during my initial writing efforts. Moreover, I must mention my dear dance partner Alexandria Law, who kept my spirits up during my struggles and also helped pick out this book’s cover.

    Next, I thank my editor at Manning, Elesha Hyde. Over the course of the past three years, you’ve worked tirelessly to ensure that I deliver something truly of value to my readers. I will forever be grateful for your patience, optimism, and ceaseless commitment to quality. You’ve pushed me to become a better writer, and my readers will ultimately benefit from these efforts. Additionally, I’d like to acknowledge my technical development editor Arthur Zubarov and my technical proofreader Rafaella Ventaglio. Your inputs helped me craft a better, cleaner book. I also thank Deirdre Hiam, my project editor; Tiffany Taylor, my copyeditior; Katie Tennant, my proofreader; and everyone else at Manning who had a hand in this book.

    To all the reviewers—Adam Scheller, Adriaan Beiertz, Alan Bogusiewicz, Amaresh Rajasekharan, Ayon Roy, Bill Mitchell, Bob Quintus, David Jacobs, Diego Casella, Duncan McRae, Elias Rangel, Frank L Quintana, Grzegorz Bernas, Jason Hales, Jean-François Morin, Jeff Smith, Jim Amrhein, Joe Justesen, John Kasiewicz, Maxim Kupfer, Michael Johnson, Michał Ambroziewicz, Raffaella Ventaglio, Ravi Sajnani, Robert Diana, Simone Sguazza, Sriram Macharla, and Stuart Woodward—thank you. Your suggestions helped make this a better book.

    about this book

    Open-ended problem-solving abilities are essential for a data science career. Unfortunately, these abilities cannot be acquired simply by reading. To become a problem solver, you must persistently solve difficult problems. With this in mind, I’ve structured my book around case studies: open-ended problems modeled on real-world situations. The case studies range from online advertisement analysis to tracking disease outbreaks using news data. Upon completing these case studies, you will be well suited to begin a career in data science.

    Who should read this book

    This book’s intended reader is an educated novice who is interested in transitioning to a data science career. When I imagine a typical reader, I picture a fourth-year college student studying economics who wishes to explore a broader range of analytics opportunities, or a chemistry major already out of school who is searching for a more data-centric career path. Or perhaps the reader is a successful frontend web developer with a very limited mathematics background who would like to give data science a shot. None of my potential readers have ever taken a data science class, leaving them inexperienced when it comes to diverse data analysis. The purpose of this book is to eliminate that skill deficiency.

    My readers are required to know the bare-bones basics of Python programming. Self-taught beginning Python should be sufficient to explore the exercises in the book. Your mathematical knowledge is not expected to extend beyond basic high-school trigonometry.

    How this book is organized

    This book contains five case studies of progressing difficulty. Each case study begins with a detailed problem statement, which you will need to resolve. The problem statement is followed by two to five sections that introduce the data science skills required to solve the problem. These skill sections cover fundamental libraries, as well as mathematical and algorithmic techniques. Each final case study section describes the solution to the problem.

    Case study 1 pertains to basic probability theory:

    Section 1 discusses how to compute probabilities using straightforward Python.

    Section 2 introduces the concept of probability distributions. It also introduces the Matplotlib visualization library, which can be used to visualize the distributions.

    Section 3 discusses how to estimate probabilities using randomized simulations. The NumPy numerical computing library is introduced to facilitate efficient simulation execution.

    Section 4 contains the case study solution.

    Case study 2 extends beyond probability into statistics:

    Section 5 introduces simple statistical measures of centrality and dispersion. It also introduces the SciPy scientific computing library, which contains a useful statistics module.

    Section 6 dives deep into the central limit theorem, which can be used to make statistical predictions.

    Section 7 discusses various statistical inference techniques, which can be used to distinguish interesting data patterns from random noise. Additionally, this section illustrates the dangers of incorrect inference usage and how these dangers can be best avoided.

    Section 8 introduces the Pandas library, which can be utilized to preprocess tabular data before statistical analysis.

    Section 9 contains the case study solution.

    Case study 3 focuses on the unsupervised clustering of geographic data:

    Section 10 illustrates how measures of centrality can be used to cluster data into groups. The scikit-learn library is also introduced to facilitate efficient clustering.

    Section 11 focuses on geographic data extraction and visualization. Extraction from text is carried out with the GeoNamesCache library, while visualization is achieved using the Cartopy map-plotting library.

    Section 12 contains the case study solution.

    Case study 4 focuses on natural language processing using large-scale numeric computations:

    Section 13 illustrates how to efficiently compute similarities between texts using matrix multiplication. NumPy’s built-in matrix optimizations are used extensively for this purpose.

    Section 14 shows how to utilize dimension reduction for more efficient matrix analysis. Mathematical theory is discussed in conjunction with scikit-learn’s dimension-reduction methods.

    Section 15 applies natural language processing techniques to a very large text dataset. The section discusses how to best explore and cluster that text data.

    Section 16 shows how to extract text from online data using the Beautiful Soup HTML-parsing library.

    Section 17 contains the case study solution.

    Case study 5 completes the book with a discussion of network theory and supervised machine learning:

    Section 18 introduces basic network theory in conjunction with the NetworkX graph analysis library.

    Section 19 shows how to utilize network flow to find clusters in network data. Probabilistic simulations and matrix multiplications are used to achieve effective clustering.

    Section 20 introduces a simple supervised machine learning algorithm based on network theory. Common machine learning evaluation techniques are also illustrated using scikit-learn.

    Section 21 discusses additional machine learning techniques, which rely on memory-efficient linear classifiers.

    Section 22 dives into the flaws of previously introduced supervised learning methodologies. The flaws are subsequently circumvented using nonlinear decision tree classifiers.

    Section 23 contains the case study solution.

    Each section of the book builds on the algorithms and libraries introduced in previous sections. Hence, you are encouraged to go through this book cover to cover to minimize confusion. But if you are already familiar with a subset of the material in the book, feel free to skip that familiar material. Finally, I strongly recommend that you tackle each case study problem on your own before reading the solution. Independently trying to solve each problem will maximize the value of this book.

    About the code

    This book contains many examples of source code, both in numbered listings and inline with normal text. In both cases, the source code is formatted in a fixed-width font like this to separate it from ordinary text. The source code in the listings is structured in modular chunks, with written explanations that precede each modular bit of code. That code presentation style is well suited for display in a Jupyter notebook since notebooks bridge functional code samples with written explanations. Consequently, the source code for each case study is available for download in a Jupyter notebook at www.manning.com/books/data-science-bookcamp. These notebooks combine code listings with summarized explanations from the book. Per usual notebook style, interdependencies exist between separate notebook cells. Thus, it’s recommended that you run the code samples in the exact order they appear in the notebook: otherwise you risk encountering a dependency-driven error.

    about the author

    Leonard Apeltsin

    is the head of data science at Anomaly. His team applies advanced analytics to uncover healthcare fraud, waste, and abuse. Prior to Anomaly, Leonard led the machine learning development efforts at Primer AI, a startup that specializes in natural language processing. As a founding member, Leonard helped grow the Primer AI team from 4 to nearly 100 employees. Before venturing into startups, Leonard worked in academia, uncovering hidden patterns in genetically linked diseases. His discoveries have been published in the subsidiaries of the journals Science and Nature. Leonard holds BS degrees in biology and computer science from Carnegie Mellon University and a PhD in bioinformatics from The University of California, San Francisco.

    about the cover illustration

    The figure on the cover of Data Science Bookcamp is captioned Habitante du Tyrol, or resident of Tyrol. The illustration is taken from a collection of dress costumes from various countries by Jacques Grasset de Saint-Sauveur (1757–1810), titled Costumes de Différents Pays, published in France in 1797. Each illustration is finely drawn and colored by hand. The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns and regions were just 200 years ago. Isolated from each other, people spoke different dialects and languages. On the streets or in the countryside, it was easy to identify where they lived and what their trade or station in life was just by their dress.

    The way we dress has changed since then and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different continents, let alone different towns, regions, or countries. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life.

    At a time when it is hard to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Grasset de Saint-Sauveur’s pictures.

    Part 1. Case study 1: Finding the winning strategy in a card game

    Problem statement

    Would you like to win a bit of money? Let’s wager on a card game for minor stakes. In front of you is a shuffled deck of cards. All 52 cards lie face down. Half the cards are red, and half are black. I will proceed to flip over the cards one by one. If the last card I flip over is red, you’ll win a dollar. Otherwise, you’ll lose a dollar.

    Here’s the twist: you can ask me to halt the game at any time. Once you say Halt, I will flip over the next card and end the game. That next card will serve as the final card. You will win a dollar if it’s red, as shown in figure CS1.1.

    Figure CS1.1 The card-flipping game. We start with a shuffled deck. I repeatedly flip over the top card from the deck. (A) I have just flipped the fourth card. You instruct me to stop. (B) I flip over the fifth and final card. The final card is red. You win a dollar.

    We can play the game as many times as you like. The deck will be reshuffled every time. After each round, we’ll exchange money. What is your best approach to winning this game?

    Overview

    To address the problem at hand, we will need to know how to

    Compute the probabilities of observable events using sample space analysis.

    Plot the probabilities of events across a range of interval values.

    Simulate random processes, such as coin flips and card shuffling, using Python.

    Evaluate our confidence in decisions drawn from simulations using confidence interval analysis.

    1 Computing probabilities using Python

    This section covers

    What are the basics of probability theory?

    Computing probabilities of a single observation

    Computing probabilities across a range of observations

    Few things in life are certain; most things are driven by chance. Whenever we cheer for our favorite sports team, or purchase a lottery ticket, or make an investment in the stock market, we hope for some particular outcome, but that outcome cannot ever be guaranteed. Randomness permeates our day-to-day experiences. Fortunately, that randomness can still be mitigated and controlled. We know that some unpredictable events occur more rarely than others and that certain decisions carry less uncertainty than other much-riskier choices. Driving to work in a car is safer than riding a motorcycle. Investing part of your savings in a retirement account is safer than betting it all on a single hand of blackjack. We can intrinsically sense these trade-offs in certainty because even the most unpredictable systems still show some predictable behaviors. These behaviors have been rigorously studied using probability theory. Probability theory is an inherently complex branch of math. However, aspects of the theory can be understood without knowing the mathematical underpinnings. In fact, difficult probability problems can be solved in Python without needing to know a single math equation. Such an equation-free approach to probability requires a baseline understanding of what mathematicians call a sample space.

    1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes

    Certain actions have measurable outcomes. A sample space is the set of all the possible outcomes an action could produce. Let’s take the simple action of flipping a coin. The coin will land on either heads or tails. Thus, the coin flip will produce one of two measurable outcomes: heads or tails. By storing these outcomes in a Python set, we can create a sample space of coin flips.

    Listing 1.1 Creating a sample space of coin flips

    sample_space = {'Heads', 'Tails'}    ❶

    ❶ Storing elements in curly brackets creates a Python set. A Python set is a collection of unique, unordered elements.

    Suppose we choose an element of sample_space at random. What fraction of the time will the chosen element equal Heads? Well, our sample space holds two possible elements. Each element occupies an equal fraction of the space within the set. Therefore, we expect Heads to be selected with a frequency of 1/2. That frequency is formally defined as the probability of an outcome. All outcomes within sample_space share an identical probability, which is equal to 1 / len(sample_space).

    Listing 1.2 Computing the probability of heads

    probability_heads = 1 / len(sample_space) print(f'Probability of choosing heads is {probability_heads}') Probability of choosing heads is 0.5

    The probability of choosing Heads equals 0.5. This relates directly to the action of flipping a coin. We’ll assume the coin is unbiased, which means the coin is equally likely to fall on either heads or tails. Thus, a coin flip is conceptually equivalent to choosing a random element from sample_space. The probability of the coin landing on heads is therefore 0.5; the probability of it landing on tails is also equal to 0.5.

    We’ve assigned probabilities to our two measurable outcomes. However, there are additional questions we could ask. What is the probability that the coin lands on either heads or tails? Or, more exotically, what is the probability that the coin will spin forever in the air, landing on neither heads nor tails? To find rigorous answers, we need to define the concept of an event. An event is the subset of those elements within sample_space that satisfy some event condition (as shown in figure 1.1). An event condition is a simple Boolean function whose input is a single sample_space element. The function returns True only if the element satisfies our condition constraints.

    Figure 1.1 Four event conditions applied to a sample space. The sample space contains two outcomes: heads and tails. Arrows represent the event conditions. Every event condition is a yes-or-no function. Each function filters out those outcomes that do not satisfy its terms. The remaining outcomes form an event. Each event contains a subset of the outcomes found in the sample space. Four events are possible: heads, tails, heads or tails, and neither heads nor tails.

    Let’s define two event conditions: one where the coin lands on either heads or tails, and another where the coin lands on neither heads nor tails.

    Listing 1.3 Defining event conditions

    def is_heads_or_tails(outcome):  return outcome in {'Heads', 'Tails'} def is_neither(outcome): return not is_heads_or_tails(outcome)

    Also, for the sake of completeness, let’s define event conditions for the two basic events in which the coin satisfies exactly one of our two potential outcomes.

    Listing 1.4 Defining additional event conditions

    def is_heads(outcome): return outcome == 'Heads' def is_tails(outcome): return outcome == 'Tails'

    We can pass event conditions into a generalized get_matching_event function. That function is defined in listing 1.5. Its inputs are an event condition and a generic sample space. The function iterates through the generic sample space and returns the set of outcomes where event_condition(outcome) is True.

    Listing 1.5 Defining an event-detection function

    def get_matching_event(event_condition, sample_space):     return set([outcome for outcome in sample_space                 if event_condition(outcome)])

    Let’s execute get_matching_event on our four event conditions. Then we’ll output the four extracted events.

    Listing 1.6 Detecting events using event conditions

    event_conditions = [is_heads_or_tails, is_heads, is_tails, is_neither] for event_condition in event_conditions:     print(fEvent Condition: {event_condition.__name__})      ❶     event = get_matching_event(event_condition, sample_space)     print(f'Event: {event}\n') Event Condition: is_heads_or_tails Event: {'Tails', 'Heads'} Event Condition: is_heads Event: {'Heads'} Event Condition: is_tails Event: {'Tails'} Event Condition: is_neither Event: set()

    ❶ Prints the name of an event_condition function

    We’ve successfully extracted four events from sample_space. What is the probability of each event occurring? Earlier, we showed that the probability of a single-element outcome for a fair coin is 1 / len(sample_space). This property can be generalized to include multi-element events. The probability of an event is equal to len(event) / len(sample_space), but only if all outcomes are known to occur with equal likelihood. In other words, the probability of a multi-element event for a fair coin is equal to the event size divided by the sample space size. We now use event size to compute the four event probabilities.

    Listing 1.7 Computing event probabilities

    def compute_probability(event_condition, generic_sample_space):     event = get_matching_event(event_condition, generic_sample_space)    ❶     return len(event) / len(generic_sample_space)                        ❷ for event_condition in event_conditions:     prob = compute_probability(event_condition, sample_space)     name = event_condition.__name__

        print(fProbability of event arising from '{name}' is {prob})

     

    Probability of event arising from 'is_heads_or_tails' is 1.0 Probability of event arising from 'is_heads' is 0.5 Probability of event arising from 'is_tails' is 0.5 Probability of event arising from 'is_neither' is 0.0

    ❶ The compute_probability function extracts the event associated with an inputted event condition to compute its probability.

    ❷ Probability is equal to event size divided by sample space size.

    The executed code outputs a diverse range of event probabilities, the smallest of which is 0.0 and the largest of which is 1.0. These values represent the lower and upper bounds of probability; no probability can ever fall below 0.0 or rise above 1.0.

    1.1.1 Analyzing a biased coin

    We computed probabilities for an unbiased coin. What would happen if that coin was biased? Suppose, for instance, that a coin is four times more likely to land on heads relative to tails. How do we compute the likelihoods of outcomes that are not weighted in an equal manner? Well, we can construct a weighted sample space represented by a Python dictionary. Each outcome is treated as a key whose value maps to the associated weight. In our example, Heads is weighted four times as heavily as Tails, so we map Tails to 1 and Heads to 4.

    Listing 1.8 Representing a weighted sample space

    weighted_sample_space = {'Heads': 4, 'Tails': 1}

    Our new sample space is stored in a dictionary. This allows us to redefine the size of the sample space as the sum of all dictionary weights. Within weighted_sample_ space, that sum will equal 5.

    Listing 1.9 Checking the weighted sample space size

    sample_space_size = sum(weighted_sample_space.values()) assert sample_space_size == 5

    We can redefine event size in a similar manner. Each event is a set of outcomes, and those outcomes map to weights. Summing over the weights yields the event size. Thus, the size of the event satisfying the is_heads_or_tails event condition is also 5.

    Listing 1.10 Checking the weighted event size

    event = get_matching_event(is_heads_or_tails, weighted_sample_space)    ❶ event_size = sum(weighted_sample_space[outcome] for outcome in event) assert event_size == 5

    ❶ As a reminder, this function iterates over each outcome in the inputted sample space. Thus, it will work as expected on our dictionary input. This is because Python iterates over dictionary keys, not key-value pairs as in many other popular programming languages.

    Our generalized definitions of sample space size and event size permit us to create a compute_event_probability function. The function takes as input a generic_sample_ space variable that can be either a weighted dictionary or an unweighted set.

    Listing 1.11 Defining a generalized event probability function

    def compute_event_probability(event_condition, generic_sample_space):     event = get_matching_event(event_condition, generic_sample_space)     if type(generic_sample_space) == type(set()):                      ❶         return len(event) / len(generic_sample_space)     event_size = sum(generic_sample_space[outcome]                     for outcome in event)     return event_size / sum(generic_sample_space.values())

    ❶ Checks whether generic_event_space is a set

    We can now output all the event probabilities for the biased coin without needing to redefine our four event condition functions.

    Listing 1.12 Computing weighted event probabilities

    for event_condition in event_conditions:     prob = compute_event_probability(event_condition, weighted_sample_space)     name = event_condition.__name__     print(fProbability of event arising from '{name}' is {prob}) Probability of event arising from 'is_heads' is 0.8 Probability of event arising from 'is_tails' is 0.2 Probability of event arising from 'is_heads_or_tails' is 1.0 Probability of event arising from 'is_neither' is 0.0

    With just a few lines of code, we have constructed a tool for solving many problems in probability. Let’s apply this tool to problems more complex than a simple coin flip.

    1.2 Computing nontrivial probabilities

    We’ll now solve several example problems using compute_event_probability.

    1.2.1 Problem 1: Analyzing a family with four children

    Suppose a family has four children. What is the probability that exactly two of the children are boys? We’ll assume that each child is equally likely to be either a boy or a girl. Thus we can construct an unweighted sample space where each outcome represents one possible sequence of four children, as shown in figure 1.2.

    Figure 1.2 The sample space for four sibling children. Each row in the sample space contains 1 of 16 possible outcomes. Every outcome represents a unique combination of four children. The sex of each child is indicated by a letter: B for boy and G for girl. Outcomes with two boys are marked by an arrow. There are six such arrows; thus, the probability of two boys equals 6 / 16.

    Listing 1.13 Computing the sample space of children

    possible_children = ['Boy', 'Girl'] sample_space = set() for child1 in possible_children:     for child2 in possible_children:         for child3 in possible_children:             for child4 in possible_children:                 outcome = (child1, child2, child3, child4)    ❶                 sample_space.add(outcome)

    ❶ Each possible sequence of four children is represented by a four-element tuple.

    We ran four nested for loops to explore the sequence of four births. This is not an efficient use of code. We can more easily generate our sample space using Python’s built-in itertools.product function, which returns all pairwise combinations of all elements across all input lists. Next, we input four instances of the possible_children list into itertools.product. The product function then iterates over all four instances of the list, computing all the combinations of list elements. The final output equals our sample space.

    Listing 1.14 Computing the sample space using product

    from itertools import product all_combinations = product(*(4 * [possible_children]))    ❶ assert set(all_combinations) == sample_space              ❷

    ❶ The * operator unpacks multiple arguments stored within a list. These arguments are then passed into a specified function. Thus, calling product(*(4 * [possible_children])) is equivalent to calling product(possible_children, possible_children, possible_children, possible_children).

    ❷ Note that after running this line, all_combinations will be empty. This is because product returns a Python iterator, which can be iterated over only once. For us, this isn’t an issue. We are about to compute the sample space even more efficiently, and all_combinations will not be use in future code.

    We can make our code even more efficient by executing set(product(possible_ children, repeat=4)). In general, running product(possible_children, repeat=n) returns an iterable over all possible combinations of n children.

    Listing 1.15 Passing repeat into product

    sample_space_efficient = set(product(possible_children, repeat=4)) assert sample_space == sample_space_efficient

    Let’s calculate the fraction of sample_space that is composed of families with two boys. We define a has_two_boys event condition and then pass that condition into compute_event_probability.

    Listing 1.16 Computing the probability of two boys

    def has_two_boys(outcome): return len([child for child in outcome                                       if child == 'Boy']) == 2 prob = compute_event_probability(has_two_boys, sample_space) print(fProbability of 2 boys is {prob}) Probability of 2 boys is 0.375

    The probability of exactly two boys being born in a family of four children is 0.375. By implication, we expect 37.5% of families with four children to contain an equal number of boys and girls. Of course, the actual observed percentage of families with two boys will vary due to random chance.

    1.2.2 Problem 2: Analyzing multiple die rolls

    Suppose we’re shown a fair six-sided die whose faces are numbered from 1 to 6. The die is rolled six times. What is the probability that these six die rolls add up to 21?

    We begin by defining the possible values of any single roll. These are integers that range from 1 to 6.

    Listing 1.17 Defining all possible rolls of a six-sided die

    possible_rolls = list(range(1, 7)) print(possible_rolls) [1, 2, 3, 4, 5, 6]

    Next, we create the sample space for six consecutive rolls using the product function.

    Listing 1.18 Sample space for six consecutive die rolls

    sample_space = set(product(possible_rolls, repeat=6))

    Finally, we define a has_sum_of_21 event condition that we’ll subsequently pass into compute_event_probability.

    Listing 1.19 Computing the probability of a die-roll sum

    def has_sum_of_21(outcome): return sum(outcome) == 21 prob = compute_event_probability(has_sum_of_21, sample_space) print(f6 rolls sum to 21 with a probability of {prob})      ❶

     

     

    6 rolls sum to 21 with a probability of 0.09284979423868313

    ❶ Conceptually, rolling a single die six times is equivalent to rolling six dice simultaneously.

    The six die rolls will sum to 21 more than 9% of the time. Note that our analysis can be coded more concisely using a lambda expression. Lambda expressions are one-line anonymous functions that do not require a name. In this book, we use lambda expressions to pass short functions into other functions.

    Listing 1.20 Computing the probability using a lambda expression

    prob = compute_event_probability(lambda x: sum(x) == 21, sample_space)  ❶ assert prob == compute_event_probability(has_sum_of_21, sample_space)

    ❶ Lambda expressions allow us to define short functions in a single line of code. Coding lambda x: is functionally equivalent to coding func(x):. Thus, lambda x: sum(x) == 21 is functionally equivalent to has_sum_of_21.

    1.2.3 Problem 3: Computing die-roll probabilities using weighted sample spaces

    We’ve just computed the likelihood of six die rolls summing to 21. Now, let’s recompute that probability using a weighted sample space. We need to convert our unweighted sample space set into a weighted sample space dictionary; this will require us to identify all possible die-roll sums. Then we must count the number of times each sum appears across all possible die-roll combinations. These combinations are already stored in our computed sample_space set. By mapping the die-roll sums to their occurrence counts, we will produce a weighted_sample_space result.

    Listing 1.21 Mapping die-roll sums to occurrence counts

    from collections import defaultdict        ❶ weighted_sample_space = defaultdict(int)  ❷ for outcome in sample_space:              ❸     total = sum(outcome)                  ❹     weighted_sample_space[total] += 1      ❺

    ❶ This module returns dictionaries whose keys are all assigned a default value. For instance, defaultdict(int) returns a dictionary where the default value for each key is set to zero.

    ❷ The weighted_sample dictionary maps each summed six-die-roll combination to its occurrence count.

    ❸ Each outcome contains a unique combination of six die rolls.

    ❹ Computes the summed value of six unique die rolls

    ❺ Updates the occurrence count for a summed dice value

    Before we recompute our probability, let’s briefly explore the properties of weighted_ sample_space. Not all weights in the sample space are equal—some of the weights are much smaller than others. For instance, there is only one way for the rolls to sum to 6: we must roll precisely six 1s to achieve that dice-sum combination. Hence, we expect weighted_sample_space[6] to equal 1. We expect weighted_sample_space[36] to also equal 1, since we must roll six 6s to achieve a sum of 36.

    Listing 1.22 Checking very rare die-roll combinations

    assert weighted_sample_space[6] == 1 assert weighted_sample_space[36] == 1

    Meanwhile, the value of weighted_sample_space[21] is noticeably higher.

    Listing 1.23 Checking a more common die-roll combination

    num_combinations = weighted_sample_space[21] print(fThere are {num_combinations } ways for 6 die rolls to sum to 21) There are 4332 ways for 6 die rolls to sum to 21

    As the output shows, there are 4,332 ways for six die rolls to sum to 21. For example, we could roll four 4s, followed by a 3 and then a 2. Or we could roll three 4s followed by a 5, a 3, and a 1. Thousands of other combinations are possible. This is why a sum of 21 is much more probable than a sum of 6.

    Listing 1.24 Exploring different ways of summing to 21

    assert sum([4, 4, 4, 4, 3, 2]) == 21 assert sum([4, 4, 4, 5, 3, 1]) == 21

    Note that the observed count of 4,332 is equal to the length of an unweighted event whose die rolls add up to 21. Also, the sum of values in weighted_sample is equal to the length of sample_space. Hence, a direct link exists between unweighted and weighted event probability computation.

    Listing 1.25 Comparing weighted events and regular events

    event = get_matching_event(lambda x: sum(x) == 21, sample_space) assert weighted_sample_space[21] == len(event) assert sum(weighted_sample_space.values()) == len(sample_space)

    Let’s now recompute the probability using the weighted_sample_space dictionary. The final probability of rolling a 21 should remain unchanged.

    Listing 1.26 Computing the weighted event probability of die rolls

    prob = compute_event_probability(lambda x: x == 21,                                 weighted_sample_space) assert prob == compute_event_probability(has_sum_of_21, sample_space) print(f6 rolls sum to 21 with a probability of {prob}) 6 rolls sum to 21 with a probability of 0.09284979423868313

    What is the benefit of using a weighted sample space over an unweighted one? Less memory usage! As we see next, the unweighted sample_space set has on the order of 150 times more elements than the weighted sample space dictionary.

    Listing 1.27 Comparing weighted to unweighted event space size

    print('Number of Elements in Unweighted Sample Space:') print(len(sample_space)) print('Number of Elements in Weighted Sample Space:') print(len(weighted_sample_space)) Number of Elements in Unweighted Sample Space: 46656 Number of Elements in Weighted Sample Space: 31

    1.3 Computing probabilities over interval ranges

    So far, we’ve only analyzed event conditions that satisfy some single value. Now we’ll analyze event conditions that span intervals of values. An interval is the set of all the numbers between and including two boundary cutoffs. Let’s define an is_in_interval function that checks whether a number falls within a specified interval. We’ll control the interval boundaries by passing a minimum and a maximum parameter.

    Listing 1.28 Defining an interval function

    def is_in_interval(number, minimum, maximum):     return minimum <= number <= maximum          ❶

    ❶ Defines a closed interval in which the min/max boundaries are included. However, it’s also possible to define open intervals when needed. In open intervals, at least one of the boundaries is excluded.

    Given the is_in_interval function, we can compute the probability that an event’s associated value falls within some numeric range. For instance, let’s compute the likelihood that our six consecutive die rolls sum to a value between 10 and 21 (inclusive).

    Listing 1.29 Computing the probability over an interval

    prob = compute_event_probability(lambda x: is_in_interval(x, 10, 21),  ❶                                 weighted_sample_space) print(fProbability of interval is {prob}) Probability of interval is 0.5446244855967078

    ❶ Lambda function that takes some input x and returns True if x falls in an interval between 10 and 21. This one-line lambda function serves as our event condition.

    The six die rolls will fall into that interval range more than 54% of the time. Thus, if a roll sum of 13 or 20 comes up, we should not be surprised.

    1.3.1 Evaluating extremes using interval analysis

    Interval analysis is critical to solving a whole class of very important problems in probability and statistics. One such problem involves the evaluation of extremes: the problem boils down to whether observed data is too extreme to be believable.

    Data seems extreme when it is too unusual to have occurred by random chance. For instance, suppose we observe 10 flips of an allegedly fair coin, and that coin lands on heads 8 out of 10 times. Is this a sensible result for a fair coin? Or is our coin secretly biased toward landing on heads? To find out, we must answer the following question: what is the probability that 10 fair coin flips lead to an extreme number of heads? We’ll define an extreme head count as eight heads or more. Thus, we can describe the problem as follows: what is the probability that 10 fair coin flips produce from 8 to 10 heads?

    We’ll find our answer by computing an interval probability. However, first we need the sample space for every possible sequence of 10 flipped coins. Let’s generate a weighted sample space. As previously discussed, this is more efficient than using a non-weighted representation.

    The following code creates a weighted_sample_space dictionary. Its keys equal the total number of observable heads, ranging from 0 through 10. These head counts map to values. Each value holds the number of coin-flip combinations that contain the associated head count. We thus expect weighted_sample_space[10] to equal 1, since there is just one possible way to flip a coin 10 times and get 10 heads. Meanwhile, we expect weighted_sample_space[9] to equal 10, since a single tail among 9 heads can occur across 10 different positions.

    Listing 1.30 Computing the sample space for 10 coin flips

    def generate_coin_sample_space(num_flips=10):                        ❶     weighted_sample_space = defaultdict(int)     for coin_flips in product(['Heads', 'Tails'], repeat=num_flips):         heads_count = len([outcome for outcome in coin_flips          ❷                           if outcome == 'Heads'])         weighted_sample_space[heads_count] += 1     return weighted_sample_space weighted_sample_space = generate_coin_sample_space() assert weighted_sample_space[10] == 1 assert weighted_sample_space[9] == 1

    ❶ For reusability, we define a general function that returns a weighted sample space for num_flips coin flips. The num_flips parameter is preset to 10 coin flips.

    ❷ Number of heads in a unique sequence of num_flips coin flips

    Our weighted sample space is ready. We now compute the probability of observing an interval from 8 to 10 heads.

    Listing 1.31 Computing an extreme head-count probability

    prob = compute_event_probability(lambda x: is_in_interval(x, 8, 10),                                 weighted_sample_space)

    print(fProbability of observing more than 7 heads is {prob})

     

    Probability of observing more than 7 heads is 0.0546875

    Ten fair coin flips produce more than seven heads approximately 5% of the time. Our observed head count does not commonly occur. Does this mean the coin is biased? Not necessarily. We haven’t yet considered extreme tail counts. If we had observed eight tails and not eight heads, we would have still been suspicious of the coin. Our computed interval did not take this extreme into account—instead, we treated eight or more tails as just another normal possibility. To evaluate the fairness of our coin, we must include the likelihood of observing eight tails or more. This is equivalent to observing two heads or fewer.

    Let’s formulate the problem as follows: what is the probability that 10 fair coin flips produce either 0 to 2 heads or 8 to 10 heads? Or, stated more concisely, what is the probability that the coin flips do not produce from 3 to 7 heads? That probability is computed here.

    Listing 1.32 Computing an extreme interval probability

    prob = compute_event_probability(lambda x: not is_in_interval(x, 3, 7),                                 weighted_sample_space)

    print(fProbability of observing more than 7 heads or 7 tails is {prob})

     

    Probability of observing more than 7 heads or 7 tails is 0.109375

    Ten fair coin flips produce at least eight identical results approximately 10% of the time. That probability is low but still within the realm of plausibility. Without additional evidence, it’s difficult to decide whether the coin is truly biased. So, let’s collect that evidence. Suppose we flip the coin 10 additional times, and 8 more heads come up. This brings us to 16 heads out of 20 coin flips total. Our confidence in the fairness of the coin has been reduced, but by how much? We can find out by measuring the change in probability. Let’s find the probability of 20 fair coin flips not producing from 5 to 15 heads.

    Listing 1.33 Analyzing extreme head counts for 20 fair coin flips

    weighted_sample_space_20_flips = generate_coin_sample_space(num_flips=20) prob = compute_event_probability(lambda x: not is_in_interval(x, 5, 15),                                 weighted_sample_space_20_flips)

    print(fProbability of observing more than 15 heads or 15 tails is {prob})

     

    Probability of observing more than 15 heads or 15 tails is 0.01181793212890625

    The updated probability has dropped from approximately 0.1 to approximately 0.01. Thus, the added evidence has caused a tenfold decrease in our confidence in the coin’s fairness. Despite this probability drop, the ratio of heads to tails has remained constant at 4 to 1. Both our original and updated experiments produced 80% heads and 20% tails. This leads to an interesting question: why does the probability of observing an extreme result decrease as the coin is flipped more times? We can find out through detailed mathematical analysis. However, a much more intuitive solution is to just visualize the distribution of head counts across our two sample space dictionaries. The visualization would effectively be a plot of keys (head counts) versus values (combination counts) present in each dictionary. We can carry out this plot using Matplotlib, Python’s most popular visualization library. In the subsequent section, we discuss Matplotlib usage and its application to probability theory.

    Summary

    A sample space is the set of all the possible outcomes an action can produce.

    An event is a subset of the sample space containing just those outcomes that satisfy some event condition. An event condition is a Boolean function that takes as input an outcome and returns either True or False.

    The probability of an event equals the fraction of event outcomes over all the possible outcomes in the entire sample space.

    Probabilities can be computed over numeric intervals. An interval is defined as the set of all the numbers sandwiched between two boundary values.

    Interval probabilities are useful for determining whether an observation appears extreme.

    2 Plotting probabilities using Matplotlib

    This section covers

    Creating simple plots using Matplotlib

    Labeling plotted data

    What is a probability distribution?

    Plotting and comparing multiple probability distributions

    Data plots are among the most valuable tools in any data scientist’s arsenal. Without good visualizations, we are effectively crippled in our ability to glean insights from our data. Fortunately, we have at our disposal the external Python Matplotlib library, which is fully optimized for outputting high-caliber plots and data visualizations. In this section, we use Matplotlib to better comprehend the coin-flip probabilities that we computed in section 1.

    2.1 Basic Matplotlib plots

    Let’s begin by installing the Matplotlib library.

    Note Call pip install matplotlib from the command line terminal to install the Matplotlib library.

    Once installation is complete, import matplotlib.pyplot, which is the library’s main plot-generation module. According to convention, the module is commonly imported using the shortened alias plt.

    Listing 2.1 Importing Matplotlib

    import matplotlib.pyplot as plt

    We will now plot some data using plt.plot. That method takes as input two iterables: x and y. Calling plt.plot(x, y) prepares a 2D plot of x versus y; displaying the plot requires a subsequent call to plt.show(). Let’s assign our x to equal integers 0 through 10 and our y values to equal double the values of x. The following code visualizes that linear relationship (figure 2.1).

    Listing 2.2 Plotting a linear relationship

    x = range(0, 10) y = [2 * value for value in x] plt.plot(x, y) plt.show()

    Figure 2.1 A Matplotlib plot of x versus 2x. The x variable represents integers 0 through 10.

    Warning The axes in the linear plot are not evenly spaced, so the slope of the plotted line appears less steep than it actually is. We can equalize both axes by calling plt.axis('equal'). However, this will lead to an awkward visualization containing too much empty space. Throughout this book, we rely on Matplotlib’s automated axes adjustments while also carefully observing the adjusted lengths.

    The visualization is complete. Within it, our 10 y-axis points have been connected using smooth line segments. If we prefer to visualize the 10 points individually, we can do so using the plt.scatter method (figure 2.2).

    Listing 2.3 Plotting individual data points

    plt.scatter(x, y) plt.show()

    Figure 2.2 A Matplotlib scatter plot of x versus 2 * x. The x variable represents integers 0 through 10. The individual integers are visible as scattered points in the plot.

    Suppose we want to emphasize the interval where x begins at 2 and ends at 6. We do this by shading the area under the plotted curve over the specified interval, using the plt.fill_between method. The method takes as input both x and y and also a where parameter, which defines the interval coverage. The input of the where parameter is a list of Boolean values in which an element is True if the x value at the corresponding index falls within the interval we specified. In the following code, we set the where parameter to equal [is_in_interval(value, 2, 6) for value in x]. We also execute plt.plot(x,y) to juxtapose the shaded interval with the smoothly connected line (figure 2.3).

    Listing 2.4 Shading an interval beneath a connected plot

    plt.plot(x, y) where = [is_in_interval(value, 2, 6) for value in x] plt.fill_between(x, y, where=where) plt.show()

    Figure 2.3 A connected plot with a shaded interval. The interval covers all values between 2 and 6.

    So far, we have reviewed three visualization methods: plt.plot, plt.scatter, and plt.fill_between. Let’s execute all three methods in a single plot (figure 2.4). Doing so highlights an interval beneath a continuous line while also exposing individual coordinates.

    Listing 2.5 Exposing individual coordinates within a continuous plot

    plt.scatter(x, y) plt.plot(x, y) plt.fill_between(x, y, where=where) plt.show()

    Figure 2.4 A connected plot and a scatter plot combined with a shaded interval. The individual integers in the plot appear as points marking a smooth, indivisible line.

    No data plot is ever truly complete without descriptive x-axis and y-axis labels. Such labels can be set using the plt.xlabel and plt.ylabel methods (figure 2.5).

    Listing 2.6 Adding axis labels

    plt.plot(x, y) plt.xlabel('Values between zero and ten') plt.ylabel('Twice the values of x') plt.show()

    Figure 2.5 A Matplotlib plot with x-axis and y-axis labels

    Common Matplotlib methods

    plt.plot(x, y)—Plots the elements of x versus the elements of y. The plotted points are connected using smooth line segments.

    plt.scatter(x, y)—Plots the elements of x versus the elements of y.

    Enjoying the preview?
    Page 1 of 1