Data Science Bookcamp: Five real-world Python projects

Ebook1,376 pages14 hours

Data Science Bookcamp: Five real-world Python projects

Name: Data Science Bookcamp: Five real-world Python projects
Brand: Manning
Rating: 5.0 (1 reviews)

By Leonard Apeltsin

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

Learn data science with Python by building five real-world projects! Experiment with card game predictions, tracking disease outbreaks, and more, as you build a flexible and intuitive understanding of data science.

In Data Science Bookcamp you will learn:

- Techniques for computing and plotting probabilities
- Statistical analysis using Scipy
- How to organize datasets with clustering algorithms
- How to visualize complex multi-variable datasets
- How to train a decision tree machine learning algorithm

In Data Science Bookcamp you’ll test and build your knowledge of Python with the kind of open-ended problems that professional data scientists work on every day. Downloadable data sets and thoroughly-explained solutions help you lock in what you’ve learned, building your confidence and making you ready for an exciting new data science career.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
A data science project has a lot of moving parts, and it takes practice and skill to get all the code, algorithms, datasets, formats, and visualizations working together harmoniously. This unique book guides you through five realistic projects, including tracking disease outbreaks from news headlines, analyzing social networks, and finding relevant patterns in ad click data.

About the book
Data Science Bookcamp doesn’t stop with surface-level theory and toy examples. As you work through each project, you’ll learn how to troubleshoot common problems like missing data, messy data, and algorithms that don’t quite fit the model you’re building. You’ll appreciate the detailed setup instructions and the fully explained solutions that highlight common failure points. In the end, you’ll be confident in your skills because you can see the results.

What's inside

- Web scraping
- Organize datasets with clustering algorithms
- Visualize complex multi-variable datasets
- Train a decision tree machine learning algorithm

About the reader
For readers who know the basics of Python. No prior data science or machine learning skills required.

About the author
Leonard Apeltsin is the Head of Data Science at Anomaly, where his team applies advanced analytics to uncover healthcare fraud, waste, and abuse.

Table of Contents
CASE STUDY 1 FINDING THE WINNING STRATEGY IN A CARD GAME
1 Computing probabilities using Python
2 Plotting probabilities using Matplotlib
3 Running random simulations in NumPy
4 Case study 1 solution
CASE STUDY 2 ASSESSING ONLINE AD CLICKS FOR SIGNIFICANCE
5 Basic probability and statistical analysis using SciPy
6 Making predictions using the central limit theorem and SciPy
7 Statistical hypothesis testing
8 Analyzing tables using Pandas
9 Case study 2 solution
CASE STUDY 3 TRACKING DISEASE OUTBREAKS USING NEWS HEADLINES
10 Clustering data into groups
11 Geographic location visualization and analysis
12 Case study 3 solution
CASE STUDY 4 USING ONLINE JOB POSTINGS TO IMPROVE YOUR DATA SCIENCE RESUME
13 Measuring text similarities
14 Dimension reduction of matrix data
15 NLP analysis of large text datasets
16 Extracting text from web pages
17 Case study 4 solution
CASE STUDY 5 PREDICTING FUTURE FRIENDSHIPS FROM SOCIAL NETWORK DATA
18 An introduction to graph theory and network analysis
19 Dynamic graph theory techniques for node ranking and social network analysis
20 Network-driven supervised machine learning
21 Training linear classifiers with logistic regression
22 Training nonlinear classifiers with decision tree techniques
23 Case study 5 solution

Skip carousel

LanguageEnglish

PublisherManning

Release dateDec 7, 2021

ISBN9781638352303

Author

Leonard Apeltsin

Related authors

Skip carousel

Related to Data Science Bookcamp

Related ebooks

Skip carousel

Pandas in Action
Ebook
Pandas in Action
byBoris Paskhaver
Rating: 0 out of 5 stars
0 ratings
Introducing Data Science: Big data, machine learning, and more, using Python tools
Ebook
Introducing Data Science: Big data, machine learning, and more, using Python tools
byDavy Cielen
Rating: 5 out of 5 stars
5/5
Think Like a Data Scientist: Tackle the data science process step-by-step
Ebook
Think Like a Data Scientist: Tackle the data science process step-by-step
byBrian Godsey
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Python, Second Edition
Ebook
Deep Learning with Python, Second Edition
byFrancois Chollet
Rating: 0 out of 5 stars
0 ratings
Real-World Machine Learning
Ebook
Real-World Machine Learning
byHenrik Brink
Rating: 0 out of 5 stars
0 ratings
Data Science with Python and Dask
Ebook
Data Science with Python and Dask
byJesse Daniel
Rating: 0 out of 5 stars
0 ratings
Probabilistic Deep Learning: With Python, Keras and TensorFlow Probability
Ebook
Probabilistic Deep Learning: With Python, Keras and TensorFlow Probability
byBeate Sick
Rating: 0 out of 5 stars
0 ratings
Machine Learning in Action
Ebook
Machine Learning in Action
byPeter Harrington
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis
Ebook
Python Data Analysis
byIvan Idris
Rating: 4 out of 5 stars
4/5
Natural Language Processing in Action: Understanding, analyzing, and generating text with Python
Ebook
Natural Language Processing in Action: Understanding, analyzing, and generating text with Python
byHannes Hapke
Rating: 0 out of 5 stars
0 ratings
Data Analysis with Python and PySpark
Ebook
Data Analysis with Python and PySpark
byJonathan Rioux
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Python
Ebook
Deep Learning with Python
byFrancois Chollet
Rating: 5 out of 5 stars
5/5
Classic Computer Science Problems in Python
Ebook
Classic Computer Science Problems in Python
byDavid Kopec
Rating: 0 out of 5 stars
0 ratings
Grokking Machine Learning
Ebook
Grokking Machine Learning
byLuis Serrano
Rating: 0 out of 5 stars
0 ratings
Deep Learning with PyTorch
Ebook
Deep Learning with PyTorch
byLuca Pietro Giovanni Antiga
Rating: 5 out of 5 stars
5/5
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
Ebook
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
byJohn Wolohan
Rating: 0 out of 5 stars
0 ratings
Deep Learning for Vision Systems
Ebook
Deep Learning for Vision Systems
byMohamed Elgendy
Rating: 5 out of 5 stars
5/5
Deep Reinforcement Learning in Action
Ebook
Deep Reinforcement Learning in Action
byBrandon Brown
Rating: 4 out of 5 stars
4/5
Python: Deeper Insights into Machine Learning
Ebook
Python: Deeper Insights into Machine Learning
byJohn Hearty
Rating: 0 out of 5 stars
0 ratings
MLOps Engineering at Scale
Ebook
MLOps Engineering at Scale
byCarl Osipov
Rating: 0 out of 5 stars
0 ratings
Machine Learning with TensorFlow, Second Edition
Ebook
Machine Learning with TensorFlow, Second Edition
byChris Mattmann
Rating: 0 out of 5 stars
0 ratings
Graph-Powered Machine Learning
Ebook
Graph-Powered Machine Learning
byAlessandro Negro
Rating: 0 out of 5 stars
0 ratings
Time Series Forecasting in Python
Ebook
Time Series Forecasting in Python
byMarco Peixeiro
Rating: 4 out of 5 stars
4/5
Deep Learning with Structured Data
Ebook
Deep Learning with Structured Data
byMark Ryan
Rating: 0 out of 5 stars
0 ratings
Mastering Data Mining with Python – Find patterns hidden in your data
Ebook
Mastering Data Mining with Python – Find patterns hidden in your data
byMegan Squire
Rating: 0 out of 5 stars
0 ratings
GANs in Action: Deep learning with Generative Adversarial Networks
Ebook
GANs in Action: Deep learning with Generative Adversarial Networks
byVladimir Bok
Rating: 0 out of 5 stars
0 ratings
Artificial Intelligence with Python
Ebook
Artificial Intelligence with Python
byPrateek Joshi
Rating: 4 out of 5 stars
4/5
R in Action, Third Edition: Data analysis and graphics with R and Tidyverse
Ebook
R in Action, Third Edition: Data analysis and graphics with R and Tidyverse
byRobert I. Kabacoff
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Spark and Python: Essential Techniques for Predictive Analytics
Ebook
Machine Learning with Spark and Python: Essential Techniques for Predictive Analytics
byMichael Bowles
Rating: 0 out of 5 stars
0 ratings
Advanced Algorithms and Data Structures
Ebook
Advanced Algorithms and Data Structures
byMarcello La Rocca
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 4 out of 5 stars
4/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 5 out of 5 stars
5/5
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Ebook
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
byMargot Lee Shetterly
Rating: 4 out of 5 stars
4/5
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
Ebook
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
Ebook
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
byJohannes Wild
Rating: 0 out of 5 stars
0 ratings
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Uncanny Valley: A Memoir
Ebook
Uncanny Valley: A Memoir
byAnna Wiener
Rating: 4 out of 5 stars
4/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
ChatGPT 4 $10,000 per Month #1 Beginners Guide to Make Money Online Generated by Artificial Intelligence
Ebook
ChatGPT 4 $10,000 per Month #1 Beginners Guide to Make Money Online Generated by Artificial Intelligence
byJake L Kent
Rating: 0 out of 5 stars
0 ratings
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byT.C. Boyle
Rating: 5 out of 5 stars
5/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Tor and the Dark Art of Anonymity
Ebook
Tor and the Dark Art of Anonymity
byLance Henderson
Rating: 5 out of 5 stars
5/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
An Ultimate Guide to Kali Linux for Beginners
Ebook
An Ultimate Guide to Kali Linux for Beginners
byAnsh Goyal
Rating: 3 out of 5 stars
3/5
Python Machine Learning By Example
Ebook
Python Machine Learning By Example
byYuxi (Hayden) Liu
Rating: 4 out of 5 stars
4/5
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
UNLIMITED
#1 Data Science, Past, Present and Future: Hilary Mason talks about the past, present, and future of data science with Hugo. Hilary is the VP of Research at Cloudera Fast Forward, a machine intelligence research company, and the data scientist in residence at Accel. If you want to hear about wh...
byDataFramed
100%
100% found this document useful
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
UNLIMITED
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
byHow to Data (Joshiverse- Journey of a Budding Data Scientist)
0 ratings
0% found this document useful
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
UNLIMITED
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
byScreaming in the Cloud
0 ratings
0% found this document useful
S1:E1 "The Beginning"
UNLIMITED
S1:E1 "The Beginning"
byData Science Now
0 ratings
0% found this document useful
Advantages of Completing Small Python Projects
UNLIMITED
Advantages of Completing Small Python Projects
byThe Real Python Podcast
0 ratings
0% found this document useful
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
UNLIMITED
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
byDataFramed
0 ratings
0% found this document useful
#40 Becoming a Data Scientist
UNLIMITED
#40 Becoming a Data Scientist
byDataFramed
100%
100% found this document useful
Getting Technical about the Data Center Revolution with Jonathan Friedmann, CEO of Speedata
UNLIMITED
Getting Technical about the Data Center Revolution with Jonathan Friedmann, CEO of Speedata
byMaking Data Simple
0 ratings
0% found this document useful
Unraveling Python's Syntax to Its Core With Brett Cannon
UNLIMITED
Unraveling Python's Syntax to Its Core With Brett Cannon
byThe Real Python Podcast
100%
100% found this document useful
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
UNLIMITED
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
byPractical AI: Machine Learning, Data Science, LLM
0 ratings
0% found this document useful
Measuring Your Python Learning Progress
UNLIMITED
Measuring Your Python Learning Progress
byThe Real Python Podcast
100%
100% found this document useful
#059 - 10 Python clean code tips drawn from code reviews
UNLIMITED
#059 - 10 Python clean code tips drawn from code reviews
byPybites Podcast
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
UNLIMITED
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
UNLIMITED
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
#77 Acing the Data Science Interview
UNLIMITED
#77 Acing the Data Science Interview
byDataFramed
0 ratings
0% found this document useful
Let's Talk About Natural Language Processing: This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of...
UNLIMITED
Let's Talk About Natural Language Processing: This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of...
byData Skeptic
0 ratings
0% found this document useful
001 Introduction: Teaches the high level fundamentals of machine learning and artificial intelligence. I teach basic intuition, algorithms, and math. I discuss languages and frameworks, deep learning, and more. ocdevel.com/mlg/1 for notes and resources
UNLIMITED
001 Introduction: Teaches the high level fundamentals of machine learning and artificial intelligence. I teach basic intuition, algorithms, and math. I discuss languages and frameworks, deep learning, and more. ocdevel.com/mlg/1 for notes and resources
byMachine Learning Guide
0 ratings
0% found this document useful
What is beyond PoCs? ML project-hurdles you should be prepared to take with Balázs Kégl - 016: Why do we do PoCs all the time and why do we struggle with Real projects? We are going to talk about ML project-hurdles with the head of AI at Huawei Paris, Balazs Kegl.
UNLIMITED
What is beyond PoCs? ML project-hurdles you should be prepared to take with Balázs Kégl - 016: Why do we do PoCs all the time and why do we struggle with Real projects? We are going to talk about ML project-hurdles with the head of AI at Huawei Paris, Balazs Kegl.
byMachine Learning Cafe
0 ratings
0% found this document useful
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
UNLIMITED
Eureka moments with natural language processing: featuring Nicholas Mohnacky of bundleIQ
byPractical AI: Machine Learning, Data Science, LLM
0 ratings
0% found this document useful
Radio Frequency Data Collection: Aerial collection of data from smart meters. In this episode, we discuss why this is better than mesh networks, and other ground collection options and also look at the limitations of drones in terms of data collection platforms. Join the email list f...
UNLIMITED
Radio Frequency Data Collection: Aerial collection of data from smart meters. In this episode, we discuss why this is better than mesh networks, and other ground collection options and also look at the limitations of drones in terms of data collection platforms. Join the email list f...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
109. DAO mini-series: KPIs in DAOs
UNLIMITED
109. DAO mini-series: KPIs in DAOs
byAt Work with The Ready
0 ratings
0% found this document useful
Ep. 137 - Intro to Data Science (Briana Vecchione): Briana helps us explore the steps it takes to answer a complex data question. We talk about the importance and difficulty of cleaning data, the role of ethics in data collection and analysis, and how a codenewbie can dig into this fascinating topic.
UNLIMITED
Ep. 137 - Intro to Data Science (Briana Vecchione): Briana helps us explore the steps it takes to answer a complex data question. We talk about the importance and difficulty of cleaning data, the role of ethics in data collection and analysis, and how a codenewbie can dig into this fascinating topic.
byCodeNewbie
0 ratings
0% found this document useful
Democratizing Causality - Aleksander Molak
UNLIMITED
Democratizing Causality - Aleksander Molak
byDataTalks.Club
0 ratings
0% found this document useful
Amazon's Alexa to be Powered by Claude, How the Brain Glues Memories, and a DNA Computer Plays Chess
UNLIMITED
Amazon's Alexa to be Powered by Claude, How the Brain Glues Memories, and a DNA Computer Plays Chess
byDiscover Daily by Perplexity
0 ratings
0% found this document useful
ATLAS with Dr. Mario Lassnig: Our guest today is Dr. Mario Lassnig, a software engineer working on the ATLAS Experiment at CERN!
UNLIMITED
ATLAS with Dr. Mario Lassnig: Our guest today is Dr. Mario Lassnig, a software engineer working on the ATLAS Experiment at CERN!
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
719: Computational Mathematics and Fluid Dynamics, with Prof. Margot Gerritsen: In this episode, Margot Gerritsen and Jon Krohn discuss the fundamentals of computational mathematics and its application in studying fluid dynamics. Margot also talks about how her synesthesia led to a lifelong interest in math, using computational mathematics to predict airflow, and why it is so important that underrepresented groups in data science become more visible through organizations like Women in Data Science.This episode is brought to you by the Zerve data science dev environment, ...
UNLIMITED
719: Computational Mathematics and Fluid Dynamics, with Prof. Margot Gerritsen: In this episode, Margot Gerritsen and Jon Krohn discuss the fundamentals of computational mathematics and its application in studying fluid dynamics. Margot also talks about how her synesthesia led to a lifelong interest in math, using computational mathematics to predict airflow, and why it is so important that underrepresented groups in data science become more visible through organizations like Women in Data Science.This episode is brought to you by the Zerve data science dev environment, ...
bySuper Data Science: ML & AI Podcast with Jon Krohn
0 ratings
0% found this document useful
Developing a Major Project with ChatGPT
UNLIMITED
Developing a Major Project with ChatGPT
byThe Secret To Success with Antonio T Smith Jr
0 ratings
0% found this document useful
Looking Back at AI in 2021 with Jeremie from Towards Data Science: For our first episode in 2022, we are joined with our friends from the Towards Data Science podcast to discuss our thoughts about the AI-related trends and events that happened in 2021. Some things we discuss are: Foundation models continue to grow, ...
UNLIMITED
Looking Back at AI in 2021 with Jeremie from Towards Data Science: For our first episode in 2022, we are joined with our friends from the Towards Data Science podcast to discuss our thoughts about the AI-related trends and events that happened in 2021. Some things we discuss are: Foundation models continue to grow, ...
byLast Week in AI
0 ratings
0% found this document useful
S5E19 Item Response Theory, Q.E.D.
UNLIMITED
S5E19 Item Response Theory, Q.E.D.
byQuantitude
0 ratings
0% found this document useful
BI 116 Michael W. Cole: Empirical Neural Networks: Support the show to get full episodes and join the Discord community. Mike and I discuss his modeling approach to study cognition. Many people I have on the podcast use deep neural networks to study brains, where the idea is to train or optimize t
UNLIMITED
BI 116 Michael W. Cole: Empirical Neural Networks: Support the show to get full episodes and join the Discord community. Mike and I discuss his modeling approach to study cognition. Many people I have on the podcast use deep neural networks to study brains, where the idea is to train or optimize t
byBrain Inspired
0 ratings
0% found this document useful

Skip carousel

How Image Recognition Works
APC
UNLIMITED
How Image Recognition Works
Nov 4, 2019
4 min read
Scikit-Learn: The Ultimate Python Library
APC
UNLIMITED
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Manipulate Data Like A Pro With Pandas
Linux Format
UNLIMITED
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Tensor Flow 101
APC
UNLIMITED
Tensor Flow 101
Jan 27, 2020
4 min read
DJANGO Create A Database-driven Website
Linux Format
UNLIMITED
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
Understanding ELT & ETL
Techfastly
UNLIMITED
Understanding ELT & ETL
Apr 1, 2021
8 min read
Why Python?
Linux Format
UNLIMITED
Why Python?
Apr 7, 2020
Python is an interpreted, high-level, general-purpose programming language that was first released in 1991 by its creator, Guido van Rossum. Very similar in programming construct to how BASIC (Beginners All-purpose Sybollic Instruction Code) was used
1 min read
2 The Use of Python in AI and ML
Techfastly
UNLIMITED
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
The Fundamental Limits of Machine Learning
Nautilus
UNLIMITED
The Fundamental Limits of Machine Learning
Sep 20, 2016
5 min read
Usability
Linux Format
UNLIMITED
Usability
Oct 19, 2021
3 min read
‘Early Bird’ Makes Training AI Greener
Futurity
UNLIMITED
‘Early Bird’ Makes Training AI Greener
May 19, 2020
3 min read
A New Approach to Multiplication Opens the Door to Better Quantum Computers
Quanta
UNLIMITED
A New Approach to Multiplication Opens the Door to Better Quantum Computers
Apr 24, 2019
3 min read
The Race To Exascale Supercomputers
Maximum PC
UNLIMITED
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read
How To Train Computers Faster For ‘Extreme’ Datasets
Futurity
UNLIMITED
How To Train Computers Faster For ‘Extreme’ Datasets
Dec 12, 2019
4 min read
Pi Calculated To A Recordbreaking 62.8 Trillion Digits
How It Works
UNLIMITED
Pi Calculated To A Recordbreaking 62.8 Trillion Digits
Oct 1, 2021
1 min read
Google Taught an AI That Sorts Cat Photos to Analyze DNA
The Atlantic
UNLIMITED
Google Taught an AI That Sorts Cat Photos to Analyze DNA
Dec 7, 2017
3 min read
Life Science
Family Tree
UNLIMITED
Life Science
Jun 27, 2023
6 min read
Comparing Time Series Data Like A Pro
Linux Format
UNLIMITED
Comparing Time Series Data Like A Pro
Jun 1, 2021
8 min read
This Lens-free Microscope Fits On A Fingertip
Futurity
UNLIMITED
This Lens-free Microscope Fits On A Fingertip
Mar 5, 2018
3 min read
Sorting Out The AI Gold Rush?
The European Business Review
UNLIMITED
Sorting Out The AI Gold Rush?
Sep 26, 2024
12 min read
Deep Learning Technique for Object Detection
Techfastly
UNLIMITED
Deep Learning Technique for Object Detection
Jun 1, 2021
3 min read
Experiments In Photogrammetry
British Columbia History
UNLIMITED
Experiments In Photogrammetry
Jun 15, 2023
Ever since the fire of June 30, 2021, destroyed the Lytton Museum and Archives, I have been trying to assemble preservation methods designed to reduce the effect of another catastrop loss. To this end, I have been studying ways of making digital thre
2 min read
Is Artificial Intelligence Permanently Inscrutable?: Despite new biology-like tools, some insist interpretation is impossible.
Nautilus
UNLIMITED
Is Artificial Intelligence Permanently Inscrutable?: Despite new biology-like tools, some insist interpretation is impossible.
Sep 1, 2016
Dmitry Malioutov can’t say much about what he built. As a research scientist at IBM, Malioutov spends part of his time building machine learning systems that solve difficult problems faced by IBM’s corporate clients. One such program was meant for a
13 min read
Is Artificial Intelligence Permanently Inscrutable?
Nautilus
UNLIMITED
Is Artificial Intelligence Permanently Inscrutable?
Sep 1, 2016
Dmitry Malioutov can’t say much about what he built. As a research scientist at IBM, Malioutov spends part of his time building machine learning systems that solve difficult problems faced by IBM’s corporate clients. One such program was meant for a
13 min read
DNA Painter
Family Tree
UNLIMITED
DNA Painter
Jun 21, 2022
1 min read
What Else Can DeepMind Solve?
Science Illustrated
UNLIMITED
What Else Can DeepMind Solve?
Oct 5, 2022
The IT company DeepMind, owned by Google, emphasises the role of people in its company, as well as artificial intelligence, saying that “exciting new ideas come from dedicated collaboration between different fields”. But the algorithm behind the arti
3 min read
Metabolomes: A New Way To Store Data In Little Space
Futurity
UNLIMITED
Metabolomes: A New Way To Store Data In Little Space
Jul 5, 2019
3 min read
FROM BITS 01 TO QuBITS
Techfastly
UNLIMITED
FROM BITS 01 TO QuBITS
Oct 1, 2021
4 min read
‘Deep Learning’ Goes Faster With Organized Data
Futurity
UNLIMITED
‘Deep Learning’ Goes Faster With Organized Data
Jun 5, 2017
Researchers have found that a technique for speedy data lookup, called hashing, can dramatically reduce the amount of computation required for deep learning, a demanding form of machine learning. “This applies to any deep-learning architecture, and t
2 min read
Cryptographers Solve Decades-Old Privacy Problem
Nautilus
UNLIMITED
Cryptographers Solve Decades-Old Privacy Problem
Nov 17, 2023
4 min read

Related categories

Skip carousel

Reviews for Data Science Bookcamp

Rating: 5 out of 5 stars

5/5

1 rating0 reviews

Book preview

Data Science Bookcamp - Leonard Apeltsin

inside front cover

Core algorithms inside the book

A trained logistic regression classifier distinguishes between two classes of points by slicing like a cleaver through 3D space (see section 21).

Data Science Bookcamp

Five real-world Python projects

Leonard Apeltsin

To comment go to liveBook

Manning

Shelter Island

For more information on this and other Manning titles go to

www.manning.com

Copyright

For online information and ordering of these and other Manning books, please visit www.manning.com. The publisher offers discounts on these books when ordered in quantity.

For more information, please contact

Special Sales Department

Manning Publications Co.

20 Baldwin Road

PO Box 761

Shelter Island, NY 11964

Email: [email protected]

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.

♾ Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.

ISBN: 9781617296253

dedication

To my teacher, Alexander Vishnevsky, who taught me how to think

brief contents

Part 1. Case study 1: Finding the winning strategy in a card game

1 Computing probabilities using Python

2 Plotting probabilities using Matplotlib

3 Running random simulations in NumPy

4 Case study 1 solution

Part 2. Case study 2: Assessing online ad clicks for significance

5 Basic probability and statistical analysis using SciPy

6 Making predictions using the central limit theorem and SciPy

7 Statistical hypothesis testing

8 Analyzing tables using Pandas

9 Case study 2 solution

Part 3. Case study 3: Tracking disease outbreaks using news headlines

10 Clustering data into groups

11 Geographic location visualization and analysis

12 Case study 3 solution

Part 4. Case study 4: Using online job postings to improve your data science resume

13 Measuring text similarities

14 Dimension reduction of matrix data

15 NLP analysis of large text datasets

16 Extracting text from web pages

17 Case study 4 solution

Part 5. Case study 5: Predicting future friendships from social network data

18 An introduction to graph theory and network analysis

19 Dynamic graph theory techniques for node ranking and social network analysis

20 Network-driven supervised machine learning

21 Training linear classifiers with logistic regression

22 Training nonlinear classifiers with decision tree techniques

23 Case study 5 solution

front matter

preface

acknowledgments

about this book

about the author

about the cover illustration

Part 1. Case study 1: Finding the winning strategy in a card game

1 Computing probabilities using Python

1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes

Analyzing a biased coin

1.2 Computing nontrivial probabilities

Problem 1: Analyzing a family with four children

Problem 2: Analyzing multiple die rolls

Problem 3: Computing die-roll probabilities using weighted sample spaces

1.3 Computing probabilities over interval ranges

Evaluating extremes using interval analysis

2 Plotting probabilities using Matplotlib

2.1 Basic Matplotlib plots

2.2 Plotting coin-flip probabilities

Comparing multiple coin-flip probability distributions

3 Running random simulations in NumPy

3.1 Simulating random coin flips and die rolls using NumPy

Analyzing biased coin flips

3.2 Computing confidence intervals using histograms and NumPy arrays

Binning similar points in histogram plots

Deriving probabilities from histograms

Shrinking the range of a high confidence interval

Computing histograms in NumPy

3.3 Using confidence intervals to analyze a biased deck of cards

3.4 Using permutations to shuffle cards

4 Case study 1 solution

4.1 Predicting red cards in a shuffled deck

Estimating the probability of strategy success

4.2 Optimizing strategies using the sample space for a 10-card deck

Part 2. Case study 2: Assessing online ad clicks for significance

Problem statement

Dataset description

Overview

5 Basic probability and statistical analysis using SciPy

5.1 Exploring the relationships between data and probability using SciPy

5.2 Mean as a measure of centrality

Finding the mean of a probability distribution

5.3 Variance as a measure of dispersion

Finding the variance of a probability distribution

6 Making predictions using the central limit theorem and SciPy

6.1 Manipulating the normal distribution using SciPy

Comparing two sampled normal curves

6.2 Determining the mean and variance of a population through random sampling

6.3 Making predictions using the mean and variance

Computing the area beneath a normal curve

Interpreting the computed probability

7 Statistical hypothesis testing

7.1 Assessing the divergence between sample mean and population mean

7.2 Data dredging: Coming to false conclusions through oversampling

7.3 Bootstrapping with replacement: Testing a hypothesis when the population variance is unknown

7.4 Permutation testing: Comparing means of samples when the population parameters are unknown

8 Analyzing tables using Pandas

8.1 Storing tables using basic Python

8.2 Exploring tables using Pandas

8.3 Retrieving table columns

8.4 Retrieving table rows

8.5 Modifying table rows and columns

8.6 Saving and loading table data

8.7 Visualizing tables using Seaborn

9 Case study 2 solution

9.1 Processing the ad-click table in Pandas

9.2 Computing p-values from differences in means

9.3 Determining statistical significance

9.4 41 shades of blue: A real-life cautionary tale

Part 3. Case study 3: Tracking disease outbreaks using news headlines

Problem statement

Dataset description

Overview

10 Clustering data into groups

10.1 Using centrality to discover clusters

10.2 K-means: A clustering algorithm for grouping data into K central groups

K-means clustering using scikit-learn

Selecting the optimal K using the elbow method

10.3 Using density to discover clusters

10.4 DBSCAN: A clustering algorithm for grouping data based on spatial density

Comparing DBSCAN and K-means

Clustering based on non-Euclidean distance

10.5 Analyzing clusters using Pandas

11 Geographic location visualization and analysis

11.1 The great-circle distance: A metric for computing the distance between two global points

11.2 Plotting maps using Cartopy

Manually installing GEOS and Cartopy

Utilizing the Conda package manager

Visualizing maps

11.3 Location tracking using GeoNamesCache

Accessing country information

Accessing city information

Limitations of the GeoNamesCache library

11.4 Matching location names in text

12 Case study 3 solution

12.1 Extracting locations from headline data

12.2 Visualizing and clustering the extracted location data

12.3 Extracting insights from location clusters

Part 4. Case study 4: Using online job postings to improve your data science resume

Problem statement

Dataset description

Overview

13 Measuring text similarities

13.1 Simple text comparison

Exploring the Jaccard similarity

Replacing words with numeric values

13.2 Vectorizing texts using word counts

Using normalization to improve TF vector similarity

Using unit vector dot products to convert between relevance metrics

13.3 Matrix multiplication for efficient similarity calculation

Basic matrix operations

Computing all-by-all matrix similarities

13.4 Computational limits of matrix multiplication

14 Dimension reduction of matrix data

14.1 Clustering 2D data in one dimension

Reducing dimensions using rotation

14.2 Dimension reduction using PCA and scikit-learn

14.3 Clustering 4D data in two dimensions

Limitations of PCA

14.4 Computing principal components without rotation

Extracting eigenvectors using power iteration

14.5 Efficient dimension reduction using SVD and scikit-learn

15 NLP analysis of large text datasets

15.1 Loading online forum discussions using scikit-learn

15.2 Vectorizing documents using scikit-learn

15.3 Ranking words by both post frequency and count

Computing TFIDF vectors with scikit-learn

15.4 Computing similarities across large document datasets

15.5 Clustering texts by topic

Exploring a single text cluster

15.6 Visualizing text clusters

Using subplots to display multiple word clouds

16 Extracting text from web pages

16.1 The structure of HTML documents

16.2 Parsing HTML using Beautiful Soup

16.3 Downloading and parsing online data

17 Case study 4 solution

17.1 Extracting skill requirements from job posting data

Exploring the HTML for skill descriptions

17.2 Filtering jobs by relevance

17.3 Clustering skills in relevant job postings

Grouping the job skills into 15 clusters

Investigating the technical skill clusters

Investigating the soft-skill clusters

Exploring clusters at alternative values of K

Analyzing the 700 most relevant postings

17.4 Conclusion

Part 5. Case study 5: Predicting future friendships from social network data

Problem statement

Introducing the friend-of-a-friend recommendation algorithm

Predicting user behavior

Dataset description

The Profiles table

The Observations table

The Friendships table

Overview

18 An introduction to graph theory and network analysis

18.1 Using basic graph theory to rank websites by popularity

Analyzing web networks using NetworkX

18.2 Utilizing undirected graphs to optimize the travel time between towns

Modeling a complex network of towns and counties

Computing the fastest travel time between nodes

19 Dynamic graph theory techniques for node ranking and social network analysis

19.1 Uncovering central nodes based on expected traffic in a network

Measuring centrality using traffic simulations

19.2 Computing travel probabilities using matrix multiplication

Deriving PageRank centrality from probability theory

Computing PageRank centrality using NetworkX

19.3 Community detection using Markov clustering

19.4 Uncovering friend groups in social networks

20 Network-driven supervised machine learning

20.1 The basics of supervised machine learning

20.2 Measuring predicted label accuracy

Scikit-learn’s prediction measurement functions

20.3 Optimizing KNN performance

20.4 Running a grid search using scikit-learn

20.5 Limitations of the KNN algorithm

21 Training linear classifiers with logistic regression

21.1 Linearly separating customers by size

21.2 Training a linear classifier

Improving perceptron performance through standardization

21.3 Improving linear classification with logistic regression

Running logistic regression on more than two features

21.4 Training linear classifiers using scikit-learn

Training multiclass linear models

21.5 Measuring feature importance with coefficients

21.6 Linear classifier limitations

22 Training nonlinear classifiers with decision tree techniques

22.1 Automated learning of logical rules

Training a nested if/else model using two features

Deciding which feature to split on

Training if/else models with more than two features

22.2 Training decision tree classifiers using scikit-learn

Studying cancerous cells using feature importance

22.3 Decision tree classifier limitations

22.4 Improving performance using random forest classification

22.5 Training random forest classifiers using scikit-learn

23 Case study 5 solution

23.1 Exploring the data

Examining the profiles

Exploring the experimental observations

Exploring the Friendships linkage table

23.2 Training a predictive model using network features

23.3 Adding profile features to the model

23.4 Optimizing performance across a steady set of features

23.5 Interpreting the trained model

Why are generalizable models so important?

index

front matter

preface

Another promising candidate had failed their data science interview, and I began to wonder why. The year was 2018, and I was struggling to expand the data science team at my startup. I had interviewed dozens of seemingly qualified candidates, only to reject them all. The latest rejected applicant was an economics PhD from a top-notch school. Recently, the applicant had transitioned into data science after completing a 10-week bootcamp. I asked the applicant to discuss an analytics problem that was very relevant to our company. They immediately brought up a trendy algorithm that was not applicable to the situation. When I tried to debate the algorithm’s incompatibilities, the candidate was at a loss. They didn’t know how the algorithm actually worked or the appropriate circumstances under which to use it. These details hadn’t been taught to them at the bootcamp.

After the rejected candidate departed, I began to reflect on my own data science education. How different it had been! Back in 2006, data science was not yet a coveted career choice, and DS bootcamps did not yet exist. In those days, I was a poor grad student struggling to pay the rent in pricey San Francisco. My graduate research required me to analyze millions of genetic links to diseases. I realized that my skills were transferable to other areas of analysis, and thus my data science consultancy was born.

Unbeknownst to my graduate advisor, I began to solicit analytics work from random Bay Area companies. That freelance work helped pay the bills, so I could not be too choosy about the data-driven assignments I tackled. Thus, I would sign up for a variety of data science tasks, ranging from simple statistical analyses to complex predictive modeling. Sometimes I would find myself overwhelmed by a seemingly intractable data problem, but in the end, I’d persevere. My struggles taught me the nuances of diverse analytics techniques and how to best combine them to reach elegant solutions. More importantly, I learned how common techniques can fail and how to surmount these failure points to deliver impactful results. As my skill set grew, my data science career began to flourish. Eventually, I became a leader in the field.

Would I have achieved the same level of success through rote memorization at a 10-week bootcamp? Probably not. Many bootcamps prioritize the study of standalone algorithms over more cohesive problem-solving skills. Furthermore, the hype over an algorithm’s strengths tends to be emphasized over its weaknesses. Consequently, students are sometimes ill prepared to handle data science in real-world settings. That insight inspired me to write this book.

I decided to replicate my own data science education by exposing you, my readers, to a set of increasingly challenging analytics problems. Additionally, I chose to arm you with tools and techniques required to handle these problems effectively. My aim is to holistically help you cultivate your analytic problem-solving skills. This way, when you interview for that junior data science position, you will be much more likely to get the job.

acknowledgments

Writing this book was very hard. I definitely could not have done it alone. Fortunately, my family and friends provided their support during this arduous journey. First and foremost, I thank my mother, Irina Apeltsin. She kept me motivated during those difficult days when the task before me seemed insurmountable. Additionally, I thank my grandmother, Vera Fisher, whose pragmatic advice kept me on track as I plowed through the material for my book.

Furthermore, I’d like to thank my childhood friend Vadim Stolnik. Vadim is a brilliant graphic designer who helped me with the book’s myriad illustrations. Also, I want to acknowledge my friend and colleague Emmanuel Yera, who had my back during my initial writing efforts. Moreover, I must mention my dear dance partner Alexandria Law, who kept my spirits up during my struggles and also helped pick out this book’s cover.

Next, I thank my editor at Manning, Elesha Hyde. Over the course of the past three years, you’ve worked tirelessly to ensure that I deliver something truly of value to my readers. I will forever be grateful for your patience, optimism, and ceaseless commitment to quality. You’ve pushed me to become a better writer, and my readers will ultimately benefit from these efforts. Additionally, I’d like to acknowledge my technical development editor Arthur Zubarov and my technical proofreader Rafaella Ventaglio. Your inputs helped me craft a better, cleaner book. I also thank Deirdre Hiam, my project editor; Tiffany Taylor, my copyeditior; Katie Tennant, my proofreader; and everyone else at Manning who had a hand in this book.

To all the reviewers—Adam Scheller, Adriaan Beiertz, Alan Bogusiewicz, Amaresh Rajasekharan, Ayon Roy, Bill Mitchell, Bob Quintus, David Jacobs, Diego Casella, Duncan McRae, Elias Rangel, Frank L Quintana, Grzegorz Bernas, Jason Hales, Jean-François Morin, Jeff Smith, Jim Amrhein, Joe Justesen, John Kasiewicz, Maxim Kupfer, Michael Johnson, Michał Ambroziewicz, Raffaella Ventaglio, Ravi Sajnani, Robert Diana, Simone Sguazza, Sriram Macharla, and Stuart Woodward—thank you. Your suggestions helped make this a better book.

about this book

Open-ended problem-solving abilities are essential for a data science career. Unfortunately, these abilities cannot be acquired simply by reading. To become a problem solver, you must persistently solve difficult problems. With this in mind, I’ve structured my book around case studies: open-ended problems modeled on real-world situations. The case studies range from online advertisement analysis to tracking disease outbreaks using news data. Upon completing these case studies, you will be well suited to begin a career in data science.

Who should read this book

This book’s intended reader is an educated novice who is interested in transitioning to a data science career. When I imagine a typical reader, I picture a fourth-year college student studying economics who wishes to explore a broader range of analytics opportunities, or a chemistry major already out of school who is searching for a more data-centric career path. Or perhaps the reader is a successful frontend web developer with a very limited mathematics background who would like to give data science a shot. None of my potential readers have ever taken a data science class, leaving them inexperienced when it comes to diverse data analysis. The purpose of this book is to eliminate that skill deficiency.

My readers are required to know the bare-bones basics of Python programming. Self-taught beginning Python should be sufficient to explore the exercises in the book. Your mathematical knowledge is not expected to extend beyond basic high-school trigonometry.

How this book is organized

This book contains five case studies of progressing difficulty. Each case study begins with a detailed problem statement, which you will need to resolve. The problem statement is followed by two to five sections that introduce the data science skills required to solve the problem. These skill sections cover fundamental libraries, as well as mathematical and algorithmic techniques. Each final case study section describes the solution to the problem.

Case study 1 pertains to basic probability theory:

Section 1 discusses how to compute probabilities using straightforward Python.

Section 2 introduces the concept of probability distributions. It also introduces the Matplotlib visualization library, which can be used to visualize the distributions.

Section 3 discusses how to estimate probabilities using randomized simulations. The NumPy numerical computing library is introduced to facilitate efficient simulation execution.

Section 4 contains the case study solution.

Case study 2 extends beyond probability into statistics:

Section 5 introduces simple statistical measures of centrality and dispersion. It also introduces the SciPy scientific computing library, which contains a useful statistics module.

Section 6 dives deep into the central limit theorem, which can be used to make statistical predictions.

Section 7 discusses various statistical inference techniques, which can be used to distinguish interesting data patterns from random noise. Additionally, this section illustrates the dangers of incorrect inference usage and how these dangers can be best avoided.

Section 8 introduces the Pandas library, which can be utilized to preprocess tabular data before statistical analysis.

Section 9 contains the case study solution.

Case study 3 focuses on the unsupervised clustering of geographic data:

Section 10 illustrates how measures of centrality can be used to cluster data into groups. The scikit-learn library is also introduced to facilitate efficient clustering.

Section 11 focuses on geographic data extraction and visualization. Extraction from text is carried out with the GeoNamesCache library, while visualization is achieved using the Cartopy map-plotting library.

Section 12 contains the case study solution.

Case study 4 focuses on natural language processing using large-scale numeric computations:

Section 13 illustrates how to efficiently compute similarities between texts using matrix multiplication. NumPy’s built-in matrix optimizations are used extensively for this purpose.

Section 14 shows how to utilize dimension reduction for more efficient matrix analysis. Mathematical theory is discussed in conjunction with scikit-learn’s dimension-reduction methods.

Section 15 applies natural language processing techniques to a very large text dataset. The section discusses how to best explore and cluster that text data.

Section 16 shows how to extract text from online data using the Beautiful Soup HTML-parsing library.

Section 17 contains the case study solution.

Case study 5 completes the book with a discussion of network theory and supervised machine learning:

Section 18 introduces basic network theory in conjunction with the NetworkX graph analysis library.

Section 19 shows how to utilize network flow to find clusters in network data. Probabilistic simulations and matrix multiplications are used to achieve effective clustering.

Section 20 introduces a simple supervised machine learning algorithm based on network theory. Common machine learning evaluation techniques are also illustrated using scikit-learn.

Section 21 discusses additional machine learning techniques, which rely on memory-efficient linear classifiers.

Section 22 dives into the flaws of previously introduced supervised learning methodologies. The flaws are subsequently circumvented using nonlinear decision tree classifiers.

Section 23 contains the case study solution.

Each section of the book builds on the algorithms and libraries introduced in previous sections. Hence, you are encouraged to go through this book cover to cover to minimize confusion. But if you are already familiar with a subset of the material in the book, feel free to skip that familiar material. Finally, I strongly recommend that you tackle each case study problem on your own before reading the solution. Independently trying to solve each problem will maximize the value of this book.

About the code

This book contains many examples of source code, both in numbered listings and inline with normal text. In both cases, the source code is formatted in a fixed-width font like this to separate it from ordinary text. The source code in the listings is structured in modular chunks, with written explanations that precede each modular bit of code. That code presentation style is well suited for display in a Jupyter notebook since notebooks bridge functional code samples with written explanations. Consequently, the source code for each case study is available for download in a Jupyter notebook at www.manning.com/books/data-science-bookcamp. These notebooks combine code listings with summarized explanations from the book. Per usual notebook style, interdependencies exist between separate notebook cells. Thus, it’s recommended that you run the code samples in the exact order they appear in the notebook: otherwise you risk encountering a dependency-driven error.

about the author

Leonard Apeltsin

is the head of data science at Anomaly. His team applies advanced analytics to uncover healthcare fraud, waste, and abuse. Prior to Anomaly, Leonard led the machine learning development efforts at Primer AI, a startup that specializes in natural language processing. As a founding member, Leonard helped grow the Primer AI team from 4 to nearly 100 employees. Before venturing into startups, Leonard worked in academia, uncovering hidden patterns in genetically linked diseases. His discoveries have been published in the subsidiaries of the journals Science and Nature. Leonard holds BS degrees in biology and computer science from Carnegie Mellon University and a PhD in bioinformatics from The University of California, San Francisco.

about the cover illustration

The figure on the cover of Data Science Bookcamp is captioned Habitante du Tyrol, or resident of Tyrol. The illustration is taken from a collection of dress costumes from various countries by Jacques Grasset de Saint-Sauveur (1757–1810), titled Costumes de Différents Pays, published in France in 1797. Each illustration is finely drawn and colored by hand. The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns and regions were just 200 years ago. Isolated from each other, people spoke different dialects and languages. On the streets or in the countryside, it was easy to identify where they lived and what their trade or station in life was just by their dress.

The way we dress has changed since then and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different continents, let alone different towns, regions, or countries. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life.

At a time when it is hard to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Grasset de Saint-Sauveur’s pictures.

Part 1. Case study 1: Finding the winning strategy in a card game

Problem statement

Would you like to win a bit of money? Let’s wager on a card game for minor stakes. In front of you is a shuffled deck of cards. All 52 cards lie face down. Half the cards are red, and half are black. I will proceed to flip over the cards one by one. If the last card I flip over is red, you’ll win a dollar. Otherwise, you’ll lose a dollar.

Here’s the twist: you can ask me to halt the game at any time. Once you say Halt, I will flip over the next card and end the game. That next card will serve as the final card. You will win a dollar if it’s red, as shown in figure CS1.1.

Figure CS1.1 The card-flipping game. We start with a shuffled deck. I repeatedly flip over the top card from the deck. (A) I have just flipped the fourth card. You instruct me to stop. (B) I flip over the fifth and final card. The final card is red. You win a dollar.

We can play the game as many times as you like. The deck will be reshuffled every time. After each round, we’ll exchange money. What is your best approach to winning this game?

Overview

To address the problem at hand, we will need to know how to

Compute the probabilities of observable events using sample space analysis.

Plot the probabilities of events across a range of interval values.

Simulate random processes, such as coin flips and card shuffling, using Python.

Evaluate our confidence in decisions drawn from simulations using confidence interval analysis.

1 Computing probabilities using Python

This section covers

What are the basics of probability theory?

Computing probabilities of a single observation

Computing probabilities across a range of observations

Few things in life are certain; most things are driven by chance. Whenever we cheer for our favorite sports team, or purchase a lottery ticket, or make an investment in the stock market, we hope for some particular outcome, but that outcome cannot ever be guaranteed. Randomness permeates our day-to-day experiences. Fortunately, that randomness can still be mitigated and controlled. We know that some unpredictable events occur more rarely than others and that certain decisions carry less uncertainty than other much-riskier choices. Driving to work in a car is safer than riding a motorcycle. Investing part of your savings in a retirement account is safer than betting it all on a single hand of blackjack. We can intrinsically sense these trade-offs in certainty because even the most unpredictable systems still show some predictable behaviors. These behaviors have been rigorously studied using probability theory. Probability theory is an inherently complex branch of math. However, aspects of the theory can be understood without knowing the mathematical underpinnings. In fact, difficult probability problems can be solved in Python without needing to know a single math equation. Such an equation-free approach to probability requires a baseline understanding of what mathematicians call a sample space.

1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes

Certain actions have measurable outcomes. A sample space is the set of all the possible outcomes an action could produce. Let’s take the simple action of flipping a coin. The coin will land on either heads or tails. Thus, the coin flip will produce one of two measurable outcomes: heads or tails. By storing these outcomes in a Python set, we can create a sample space of coin flips.

Listing 1.1 Creating a sample space of coin flips

sample_space = {'Heads', 'Tails'} ❶

❶ Storing elements in curly brackets creates a Python set. A Python set is a collection of unique, unordered elements.

Suppose we choose an element of sample_space at random. What fraction of the time will the chosen element equal Heads? Well, our sample space holds two possible elements. Each element occupies an equal fraction of the space within the set. Therefore, we expect Heads to be selected with a frequency of 1/2. That frequency is formally defined as the probability of an outcome. All outcomes within sample_space share an identical probability, which is equal to 1 / len(sample_space).

Listing 1.2 Computing the probability of heads

probability_heads = 1 / len(sample_space) print(f'Probability of choosing heads is {probability_heads}') Probability of choosing heads is 0.5

The probability of choosing Heads equals 0.5. This relates directly to the action of flipping a coin. We’ll assume the coin is unbiased, which means the coin is equally likely to fall on either heads or tails. Thus, a coin flip is conceptually equivalent to choosing a random element from sample_space. The probability of the coin landing on heads is therefore 0.5; the probability of it landing on tails is also equal to 0.5.

We’ve assigned probabilities to our two measurable outcomes. However, there are additional questions we could ask. What is the probability that the coin lands on either heads or tails? Or, more exotically, what is the probability that the coin will spin forever in the air, landing on neither heads nor tails? To find rigorous answers, we need to define the concept of an event. An event is the subset of those elements within sample_space that satisfy some event condition (as shown in figure 1.1). An event condition is a simple Boolean function whose input is a single sample_space element. The function returns True only if the element satisfies our condition constraints.

Figure 1.1 Four event conditions applied to a sample space. The sample space contains two outcomes: heads and tails. Arrows represent the event conditions. Every event condition is a yes-or-no function. Each function filters out those outcomes that do not satisfy its terms. The remaining outcomes form an event. Each event contains a subset of the outcomes found in the sample space. Four events are possible: heads, tails, heads or tails, and neither heads nor tails.

Let’s define two event conditions: one where the coin lands on either heads or tails, and another where the coin lands on neither heads nor tails.

Listing 1.3 Defining event conditions

def is_heads_or_tails(outcome): return outcome in {'Heads', 'Tails'} def is_neither(outcome): return not is_heads_or_tails(outcome)

Also, for the sake of completeness, let’s define event conditions for the two basic events in which the coin satisfies exactly one of our two potential outcomes.

Listing 1.4 Defining additional event conditions

def is_heads(outcome): return outcome == 'Heads' def is_tails(outcome): return outcome == 'Tails'

We can pass event conditions into a generalized get_matching_event function. That function is defined in listing 1.5. Its inputs are an event condition and a generic sample space. The function iterates through the generic sample space and returns the set of outcomes where event_condition(outcome) is True.

Listing 1.5 Defining an event-detection function

def get_matching_event(event_condition, sample_space): return set([outcome for outcome in sample_space if event_condition(outcome)])

Let’s execute get_matching_event on our four event conditions. Then we’ll output the four extracted events.

Listing 1.6 Detecting events using event conditions

event_conditions = [is_heads_or_tails, is_heads, is_tails, is_neither] for event_condition in event_conditions: print(fEvent Condition: {event_condition.__name__}) ❶ event = get_matching_event(event_condition, sample_space) print(f'Event: {event}\n') Event Condition: is_heads_or_tails Event: {'Tails', 'Heads'} Event Condition: is_heads Event: {'Heads'} Event Condition: is_tails Event: {'Tails'} Event Condition: is_neither Event: set()

❶ Prints the name of an event_condition function

We’ve successfully extracted four events from sample_space. What is the probability of each event occurring? Earlier, we showed that the probability of a single-element outcome for a fair coin is 1 / len(sample_space). This property can be generalized to include multi-element events. The probability of an event is equal to len(event) / len(sample_space), but only if all outcomes are known to occur with equal likelihood. In other words, the probability of a multi-element event for a fair coin is equal to the event size divided by the sample space size. We now use event size to compute the four event probabilities.

Listing 1.7 Computing event probabilities

def compute_probability(event_condition, generic_sample_space): event = get_matching_event(event_condition, generic_sample_space) ❶ return len(event) / len(generic_sample_space) ❷ for event_condition in event_conditions: prob = compute_probability(event_condition, sample_space) name = event_condition.__name__

print(fProbability of event arising from '{name}' is {prob})

Probability of event arising from 'is_heads_or_tails' is 1.0 Probability of event arising from 'is_heads' is 0.5 Probability of event arising from 'is_tails' is 0.5 Probability of event arising from 'is_neither' is 0.0

❶ The compute_probability function extracts the event associated with an inputted event condition to compute its probability.

❷ Probability is equal to event size divided by sample space size.

The executed code outputs a diverse range of event probabilities, the smallest of which is 0.0 and the largest of which is 1.0. These values represent the lower and upper bounds of probability; no probability can ever fall below 0.0 or rise above 1.0.

1.1.1 Analyzing a biased coin

We computed probabilities for an unbiased coin. What would happen if that coin was biased? Suppose, for instance, that a coin is four times more likely to land on heads relative to tails. How do we compute the likelihoods of outcomes that are not weighted in an equal manner? Well, we can construct a weighted sample space represented by a Python dictionary. Each outcome is treated as a key whose value maps to the associated weight. In our example, Heads is weighted four times as heavily as Tails, so we map Tails to 1 and Heads to 4.

Listing 1.8 Representing a weighted sample space

weighted_sample_space = {'Heads': 4, 'Tails': 1}

Our new sample space is stored in a dictionary. This allows us to redefine the size of the sample space as the sum of all dictionary weights. Within weighted_sample_ space, that sum will equal 5.

Listing 1.9 Checking the weighted sample space size

sample_space_size = sum(weighted_sample_space.values()) assert sample_space_size == 5

We can redefine event size in a similar manner. Each event is a set of outcomes, and those outcomes map to weights. Summing over the weights yields the event size. Thus, the size of the event satisfying the is_heads_or_tails event condition is also 5.

Listing 1.10 Checking the weighted event size

event = get_matching_event(is_heads_or_tails, weighted_sample_space) ❶ event_size = sum(weighted_sample_space[outcome] for outcome in event) assert event_size == 5

❶ As a reminder, this function iterates over each outcome in the inputted sample space. Thus, it will work as expected on our dictionary input. This is because Python iterates over dictionary keys, not key-value pairs as in many other popular programming languages.

Our generalized definitions of sample space size and event size permit us to create a compute_event_probability function. The function takes as input a generic_sample_ space variable that can be either a weighted dictionary or an unweighted set.

Listing 1.11 Defining a generalized event probability function

def compute_event_probability(event_condition, generic_sample_space): event = get_matching_event(event_condition, generic_sample_space) if type(generic_sample_space) == type(set()): ❶ return len(event) / len(generic_sample_space) event_size = sum(generic_sample_space[outcome] for outcome in event) return event_size / sum(generic_sample_space.values())

❶ Checks whether generic_event_space is a set

We can now output all the event probabilities for the biased coin without needing to redefine our four event condition functions.

Listing 1.12 Computing weighted event probabilities

for event_condition in event_conditions: prob = compute_event_probability(event_condition, weighted_sample_space) name = event_condition.__name__ print(fProbability of event arising from '{name}' is {prob}) Probability of event arising from 'is_heads' is 0.8 Probability of event arising from 'is_tails' is 0.2 Probability of event arising from 'is_heads_or_tails' is 1.0 Probability of event arising from 'is_neither' is 0.0

With just a few lines of code, we have constructed a tool for solving many problems in probability. Let’s apply this tool to problems more complex than a simple coin flip.

1.2 Computing nontrivial probabilities

We’ll now solve several example problems using compute_event_probability.

1.2.1 Problem 1: Analyzing a family with four children

Suppose a family has four children. What is the probability that exactly two of the children are boys? We’ll assume that each child is equally likely to be either a boy or a girl. Thus we can construct an unweighted sample space where each outcome represents one possible sequence of four children, as shown in figure 1.2.

Figure 1.2 The sample space for four sibling children. Each row in the sample space contains 1 of 16 possible outcomes. Every outcome represents a unique combination of four children. The sex of each child is indicated by a letter: B for boy and G for girl. Outcomes with two boys are marked by an arrow. There are six such arrows; thus, the probability of two boys equals 6 / 16.

Listing 1.13 Computing the sample space of children

possible_children = ['Boy', 'Girl'] sample_space = set() for child1 in possible_children: for child2 in possible_children: for child3 in possible_children: for child4 in possible_children: outcome = (child1, child2, child3, child4) ❶ sample_space.add(outcome)

❶ Each possible sequence of four children is represented by a four-element tuple.

We ran four nested for loops to explore the sequence of four births. This is not an efficient use of code. We can more easily generate our sample space using Python’s built-in itertools.product function, which returns all pairwise combinations of all elements across all input lists. Next, we input four instances of the possible_children list into itertools.product. The product function then iterates over all four instances of the list, computing all the combinations of list elements. The final output equals our sample space.

Listing 1.14 Computing the sample space using product

from itertools import product all_combinations = product(*(4 * [possible_children])) ❶ assert set(all_combinations) == sample_space ❷

❶ The * operator unpacks multiple arguments stored within a list. These arguments are then passed into a specified function. Thus, calling product(*(4 * [possible_children])) is equivalent to calling product(possible_children, possible_children, possible_children, possible_children).

❷ Note that after running this line, all_combinations will be empty. This is because product returns a Python iterator, which can be iterated over only once. For us, this isn’t an issue. We are about to compute the sample space even more efficiently, and all_combinations will not be use in future code.

We can make our code even more efficient by executing set(product(possible_ children, repeat=4)). In general, running product(possible_children, repeat=n) returns an iterable over all possible combinations of n children.

Listing 1.15 Passing repeat into product

sample_space_efficient = set(product(possible_children, repeat=4)) assert sample_space == sample_space_efficient

Let’s calculate the fraction of sample_space that is composed of families with two boys. We define a has_two_boys event condition and then pass that condition into compute_event_probability.

Listing 1.16 Computing the probability of two boys

def has_two_boys(outcome): return len([child for child in outcome if child == 'Boy']) == 2 prob = compute_event_probability(has_two_boys, sample_space) print(fProbability of 2 boys is {prob}) Probability of 2 boys is 0.375

The probability of exactly two boys being born in a family of four children is 0.375. By implication, we expect 37.5% of families with four children to contain an equal number of boys and girls. Of course, the actual observed percentage of families with two boys will vary due to random chance.

1.2.2 Problem 2: Analyzing multiple die rolls

Suppose we’re shown a fair six-sided die whose faces are numbered from 1 to 6. The die is rolled six times. What is the probability that these six die rolls add up to 21?

We begin by defining the possible values of any single roll. These are integers that range from 1 to 6.

Listing 1.17 Defining all possible rolls of a six-sided die

possible_rolls = list(range(1, 7)) print(possible_rolls) [1, 2, 3, 4, 5, 6]

Next, we create the sample space for six consecutive rolls using the product function.

Listing 1.18 Sample space for six consecutive die rolls

sample_space = set(product(possible_rolls, repeat=6))

Finally, we define a has_sum_of_21 event condition that we’ll subsequently pass into compute_event_probability.

Listing 1.19 Computing the probability of a die-roll sum

def has_sum_of_21(outcome): return sum(outcome) == 21 prob = compute_event_probability(has_sum_of_21, sample_space) print(f6 rolls sum to 21 with a probability of {prob}) ❶

6 rolls sum to 21 with a probability of 0.09284979423868313

❶ Conceptually, rolling a single die six times is equivalent to rolling six dice simultaneously.

The six die rolls will sum to 21 more than 9% of the time. Note that our analysis can be coded more concisely using a lambda expression. Lambda expressions are one-line anonymous functions that do not require a name. In this book, we use lambda expressions to pass short functions into other functions.

Listing 1.20 Computing the probability using a lambda expression

prob = compute_event_probability(lambda x: sum(x) == 21, sample_space) ❶ assert prob == compute_event_probability(has_sum_of_21, sample_space)

❶ Lambda expressions allow us to define short functions in a single line of code. Coding lambda x: is functionally equivalent to coding func(x):. Thus, lambda x: sum(x) == 21 is functionally equivalent to has_sum_of_21.

1.2.3 Problem 3: Computing die-roll probabilities using weighted sample spaces

We’ve just computed the likelihood of six die rolls summing to 21. Now, let’s recompute that probability using a weighted sample space. We need to convert our unweighted sample space set into a weighted sample space dictionary; this will require us to identify all possible die-roll sums. Then we must count the number of times each sum appears across all possible die-roll combinations. These combinations are already stored in our computed sample_space set. By mapping the die-roll sums to their occurrence counts, we will produce a weighted_sample_space result.

Listing 1.21 Mapping die-roll sums to occurrence counts

from collections import defaultdict ❶ weighted_sample_space = defaultdict(int) ❷ for outcome in sample_space: ❸ total = sum(outcome) ❹ weighted_sample_space[total] += 1 ❺

❶ This module returns dictionaries whose keys are all assigned a default value. For instance, defaultdict(int) returns a dictionary where the default value for each key is set to zero.

❷ The weighted_sample dictionary maps each summed six-die-roll combination to its occurrence count.

❸ Each outcome contains a unique combination of six die rolls.

❹ Computes the summed value of six unique die rolls

❺ Updates the occurrence count for a summed dice value

Before we recompute our probability, let’s briefly explore the properties of weighted_ sample_space. Not all weights in the sample space are equal—some of the weights are much smaller than others. For instance, there is only one way for the rolls to sum to 6: we must roll precisely six 1s to achieve that dice-sum combination. Hence, we expect weighted_sample_space[6] to equal 1. We expect weighted_sample_space[36] to also equal 1, since we must roll six 6s to achieve a sum of 36.

Listing 1.22 Checking very rare die-roll combinations

assert weighted_sample_space[6] == 1 assert weighted_sample_space[36] == 1

Meanwhile, the value of weighted_sample_space[21] is noticeably higher.

Listing 1.23 Checking a more common die-roll combination

num_combinations = weighted_sample_space[21] print(fThere are {num_combinations } ways for 6 die rolls to sum to 21) There are 4332 ways for 6 die rolls to sum to 21

As the output shows, there are 4,332 ways for six die rolls to sum to 21. For example, we could roll four 4s, followed by a 3 and then a 2. Or we could roll three 4s followed by a 5, a 3, and a 1. Thousands of other combinations are possible. This is why a sum of 21 is much more probable than a sum of 6.

Listing 1.24 Exploring different ways of summing to 21

assert sum([4, 4, 4, 4, 3, 2]) == 21 assert sum([4, 4, 4, 5, 3, 1]) == 21

Note that the observed count of 4,332 is equal to the length of an unweighted event whose die rolls add up to 21. Also, the sum of values in weighted_sample is equal to the length of sample_space. Hence, a direct link exists between unweighted and weighted event probability computation.

Listing 1.25 Comparing weighted events and regular events

event = get_matching_event(lambda x: sum(x) == 21, sample_space) assert weighted_sample_space[21] == len(event) assert sum(weighted_sample_space.values()) == len(sample_space)

Let’s now recompute the probability using the weighted_sample_space dictionary. The final probability of rolling a 21 should remain unchanged.

Listing 1.26 Computing the weighted event probability of die rolls

prob = compute_event_probability(lambda x: x == 21, weighted_sample_space) assert prob == compute_event_probability(has_sum_of_21, sample_space) print(f6 rolls sum to 21 with a probability of {prob}) 6 rolls sum to 21 with a probability of 0.09284979423868313

What is the benefit of using a weighted sample space over an unweighted one? Less memory usage! As we see next, the unweighted sample_space set has on the order of 150 times more elements than the weighted sample space dictionary.

Listing 1.27 Comparing weighted to unweighted event space size

print('Number of Elements in Unweighted Sample Space:') print(len(sample_space)) print('Number of Elements in Weighted Sample Space:') print(len(weighted_sample_space)) Number of Elements in Unweighted Sample Space: 46656 Number of Elements in Weighted Sample Space: 31

1.3 Computing probabilities over interval ranges

So far, we’ve only analyzed event conditions that satisfy some single value. Now we’ll analyze event conditions that span intervals of values. An interval is the set of all the numbers between and including two boundary cutoffs. Let’s define an is_in_interval function that checks whether a number falls within a specified interval. We’ll control the interval boundaries by passing a minimum and a maximum parameter.

Listing 1.28 Defining an interval function

def is_in_interval(number, minimum, maximum): return minimum <= number <= maximum ❶

❶ Defines a closed interval in which the min/max boundaries are included. However, it’s also possible to define open intervals when needed. In open intervals, at least one of the boundaries is excluded.

Given the is_in_interval function, we can compute the probability that an event’s associated value falls within some numeric range. For instance, let’s compute the likelihood that our six consecutive die rolls sum to a value between 10 and 21 (inclusive).

Listing 1.29 Computing the probability over an interval

prob = compute_event_probability(lambda x: is_in_interval(x, 10, 21), ❶ weighted_sample_space) print(fProbability of interval is {prob}) Probability of interval is 0.5446244855967078

❶ Lambda function that takes some input x and returns True if x falls in an interval between 10 and 21. This one-line lambda function serves as our event condition.

The six die rolls will fall into that interval range more than 54% of the time. Thus, if a roll sum of 13 or 20 comes up, we should not be surprised.

1.3.1 Evaluating extremes using interval analysis

Interval analysis is critical to solving a whole class of very important problems in probability and statistics. One such problem involves the evaluation of extremes: the problem boils down to whether observed data is too extreme to be believable.

Data seems extreme when it is too unusual to have occurred by random chance. For instance, suppose we observe 10 flips of an allegedly fair coin, and that coin lands on heads 8 out of 10 times. Is this a sensible result for a fair coin? Or is our coin secretly biased toward landing on heads? To find out, we must answer the following question: what is the probability that 10 fair coin flips lead to an extreme number of heads? We’ll define an extreme head count as eight heads or more. Thus, we can describe the problem as follows: what is the probability that 10 fair coin flips produce from 8 to 10 heads?

We’ll find our answer by computing an interval probability. However, first we need the sample space for every possible sequence of 10 flipped coins. Let’s generate a weighted sample space. As previously discussed, this is more efficient than using a non-weighted representation.

The following code creates a weighted_sample_space dictionary. Its keys equal the total number of observable heads, ranging from 0 through 10. These head counts map to values. Each value holds the number of coin-flip combinations that contain the associated head count. We thus expect weighted_sample_space[10] to equal 1, since there is just one possible way to flip a coin 10 times and get 10 heads. Meanwhile, we expect weighted_sample_space[9] to equal 10, since a single tail among 9 heads can occur across 10 different positions.

Listing 1.30 Computing the sample space for 10 coin flips

def generate_coin_sample_space(num_flips=10): ❶ weighted_sample_space = defaultdict(int) for coin_flips in product(['Heads', 'Tails'], repeat=num_flips): heads_count = len([outcome for outcome in coin_flips ❷ if outcome == 'Heads']) weighted_sample_space[heads_count] += 1 return weighted_sample_space weighted_sample_space = generate_coin_sample_space() assert weighted_sample_space[10] == 1 assert weighted_sample_space[9] == 1

❶ For reusability, we define a general function that returns a weighted sample space for num_flips coin flips. The num_flips parameter is preset to 10 coin flips.

❷ Number of heads in a unique sequence of num_flips coin flips

Our weighted sample space is ready. We now compute the probability of observing an interval from 8 to 10 heads.

Listing 1.31 Computing an extreme head-count probability

prob = compute_event_probability(lambda x: is_in_interval(x, 8, 10), weighted_sample_space)

print(fProbability of observing more than 7 heads is {prob})

Probability of observing more than 7 heads is 0.0546875

Ten fair coin flips produce more than seven heads approximately 5% of the time. Our observed head count does not commonly occur. Does this mean the coin is biased? Not necessarily. We haven’t yet considered extreme tail counts. If we had observed eight tails and not eight heads, we would have still been suspicious of the coin. Our computed interval did not take this extreme into account—instead, we treated eight or more tails as just another normal possibility. To evaluate the fairness of our coin, we must include the likelihood of observing eight tails or more. This is equivalent to observing two heads or fewer.

Let’s formulate the problem as follows: what is the probability that 10 fair coin flips produce either 0 to 2 heads or 8 to 10 heads? Or, stated more concisely, what is the probability that the coin flips do not produce from 3 to 7 heads? That probability is computed here.

Listing 1.32 Computing an extreme interval probability

prob = compute_event_probability(lambda x: not is_in_interval(x, 3, 7), weighted_sample_space)

print(fProbability of observing more than 7 heads or 7 tails is {prob})

Probability of observing more than 7 heads or 7 tails is 0.109375

Ten fair coin flips produce at least eight identical results approximately 10% of the time. That probability is low but still within the realm of plausibility. Without additional evidence, it’s difficult to decide whether the coin is truly biased. So, let’s collect that evidence. Suppose we flip the coin 10 additional times, and 8 more heads come up. This brings us to 16 heads out of 20 coin flips total. Our confidence in the fairness of the coin has been reduced, but by how much? We can find out by measuring the change in probability. Let’s find the probability of 20 fair coin flips not producing from 5 to 15 heads.

Listing 1.33 Analyzing extreme head counts for 20 fair coin flips

weighted_sample_space_20_flips = generate_coin_sample_space(num_flips=20) prob = compute_event_probability(lambda x: not is_in_interval(x, 5, 15), weighted_sample_space_20_flips)

print(fProbability of observing more than 15 heads or 15 tails is {prob})

Probability of observing more than 15 heads or 15 tails is 0.01181793212890625

The updated probability has dropped from approximately 0.1 to approximately 0.01. Thus, the added evidence has caused a tenfold decrease in our confidence in the coin’s fairness. Despite this probability drop, the ratio of heads to tails has remained constant at 4 to 1. Both our original and updated experiments produced 80% heads and 20% tails. This leads to an interesting question: why does the probability of observing an extreme result decrease as the coin is flipped more times? We can find out through detailed mathematical analysis. However, a much more intuitive solution is to just visualize the distribution of head counts across our two sample space dictionaries. The visualization would effectively be a plot of keys (head counts) versus values (combination counts) present in each dictionary. We can carry out this plot using Matplotlib, Python’s most popular visualization library. In the subsequent section, we discuss Matplotlib usage and its application to probability theory.

Summary

A sample space is the set of all the possible outcomes an action can produce.

An event is a subset of the sample space containing just those outcomes that satisfy some event condition. An event condition is a Boolean function that takes as input an outcome and returns either True or False.

The probability of an event equals the fraction of event outcomes over all the possible outcomes in the entire sample space.

Probabilities can be computed over numeric intervals. An interval is defined as the set of all the numbers sandwiched between two boundary values.

Interval probabilities are useful for determining whether an observation appears extreme.

2 Plotting probabilities using Matplotlib

This section covers

Creating simple plots using Matplotlib

Labeling plotted data

What is a probability distribution?

Plotting and comparing multiple probability distributions

Data plots are among the most valuable tools in any data scientist’s arsenal. Without good visualizations, we are effectively crippled in our ability to glean insights from our data. Fortunately, we have at our disposal the external Python Matplotlib library, which is fully optimized for outputting high-caliber plots and data visualizations. In this section, we use Matplotlib to better comprehend the coin-flip probabilities that we computed in section 1.

2.1 Basic Matplotlib plots

Let’s begin by installing the Matplotlib library.

Note Call pip install matplotlib from the command line terminal to install the Matplotlib library.

Once installation is complete, import matplotlib.pyplot, which is the library’s main plot-generation module. According to convention, the module is commonly imported using the shortened alias plt.

Listing 2.1 Importing Matplotlib

import matplotlib.pyplot as plt

We will now plot some data using plt.plot. That method takes as input two iterables: x and y. Calling plt.plot(x, y) prepares a 2D plot of x versus y; displaying the plot requires a subsequent call to plt.show(). Let’s assign our x to equal integers 0 through 10 and our y values to equal double the values of x. The following code visualizes that linear relationship (figure 2.1).

Listing 2.2 Plotting a linear relationship

x = range(0, 10) y = [2 * value for value in x] plt.plot(x, y) plt.show()

Figure 2.1 A Matplotlib plot of x versus 2x. The x variable represents integers 0 through 10.

Warning The axes in the linear plot are not evenly spaced, so the slope of the plotted line appears less steep than it actually is. We can equalize both axes by calling plt.axis('equal'). However, this will lead to an awkward visualization containing too much empty space. Throughout this book, we rely on Matplotlib’s automated axes adjustments while also carefully observing the adjusted lengths.

The visualization is complete. Within it, our 10 y-axis points have been connected using smooth line segments. If we prefer to visualize the 10 points individually, we can do so using the plt.scatter method (figure 2.2).

Listing 2.3 Plotting individual data points

plt.scatter(x, y) plt.show()

Figure 2.2 A Matplotlib scatter plot of x versus 2 * x. The x variable represents integers 0 through 10. The individual integers are visible as scattered points in the plot.

Suppose we want to emphasize the interval where x begins at 2 and ends at 6. We do this by shading the area under the plotted curve over the specified interval, using the plt.fill_between method. The method takes as input both x and y and also a where parameter, which defines the interval coverage. The input of the where parameter is a list of Boolean values in which an element is True if the x value at the corresponding index falls within the interval we specified. In the following code, we set the where parameter to equal [is_in_interval(value, 2, 6) for value in x]. We also execute plt.plot(x,y) to juxtapose the shaded interval with the smoothly connected line (figure 2.3).

Listing 2.4 Shading an interval beneath a connected plot

plt.plot(x, y) where = [is_in_interval(value, 2, 6) for value in x] plt.fill_between(x, y, where=where) plt.show()

Figure 2.3 A connected plot with a shaded interval. The interval covers all values between 2 and 6.

So far, we have reviewed three visualization methods: plt.plot, plt.scatter, and plt.fill_between. Let’s execute all three methods in a single plot (figure 2.4). Doing so highlights an interval beneath a continuous line while also exposing individual coordinates.

Listing 2.5 Exposing individual coordinates within a continuous plot

plt.scatter(x, y) plt.plot(x, y) plt.fill_between(x, y, where=where) plt.show()

Figure 2.4 A connected plot and a scatter plot combined with a shaded interval. The individual integers in the plot appear as points marking a smooth, indivisible line.

No data plot is ever truly complete without descriptive x-axis and y-axis labels. Such labels can be set using the plt.xlabel and plt.ylabel methods (figure 2.5).

Listing 2.6 Adding axis labels

plt.plot(x, y) plt.xlabel('Values between zero and ten') plt.ylabel('Twice the values of x') plt.show()

Figure 2.5 A Matplotlib plot with x-axis and y-axis labels

Common Matplotlib methods

plt.plot(x, y)—Plots the elements of x versus the elements of y. The plotted points are connected using smooth line segments.

plt.scatter(x, y)—Plots the elements of x versus the elements of y.

Enjoying the preview?

Page 1 of 1

Data Science Bookcamp: Five real-world Python projects

About this ebook

Leonard Apeltsin

Related authors

Related to Data Science Bookcamp

Related ebooks

Pandas in Action

Introducing Data Science: Big data, machine learning, and more, using Python tools

Think Like a Data Scientist: Tackle the data science process step-by-step

Deep Learning with Python, Second Edition

Real-World Machine Learning

Data Science with Python and Dask

Probabilistic Deep Learning: With Python, Keras and TensorFlow Probability

Machine Learning in Action

Python Data Analysis

Natural Language Processing in Action: Understanding, analyzing, and generating text with Python

Data Analysis with Python and PySpark

Deep Learning with Python

Classic Computer Science Problems in Python

Grokking Machine Learning

Deep Learning with PyTorch

Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code

Deep Learning for Vision Systems

Deep Reinforcement Learning in Action

Python: Deeper Insights into Machine Learning

MLOps Engineering at Scale

Machine Learning with TensorFlow, Second Edition

Graph-Powered Machine Learning

Time Series Forecasting in Python

Deep Learning with Structured Data

Mastering Data Mining with Python – Find patterns hidden in your data

GANs in Action: Deep learning with Generative Adversarial Networks

Artificial Intelligence with Python

R in Action, Third Edition: Data analysis and graphics with R and Tidyverse

Machine Learning with Spark and Python: Essential Techniques for Predictive Analytics

Advanced Algorithms and Data Structures

Computers For You

Elon Musk

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology

Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition

The Invisible Rainbow: A History of Electricity and Life

Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race

CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide

Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics

The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution

Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!

Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence

Uncanny Valley: A Memoir

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls

ChatGPT 4 $10,000 per Month #1 Beginners Guide to Make Money Online Generated by Artificial Intelligence

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally

How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)

The Hacker Crackdown: Law and Disorder on the Electronic Frontier

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are

Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad

101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters

Deep Search: How to Explore the Internet More Effectively

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL

The Professional Voiceover Handbook: Voiceover training, #1

Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work

Tor and the Dark Art of Anonymity

ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind

An Ultimate Guide to Kali Linux for Beginners

Python Machine Learning By Example

CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61

Related podcast episodes

Related articles

Related categories

Reviews for Data Science Bookcamp

What did you think?

Book preview

Data Science Bookcamp - Leonard Apeltsin

inside front cover

Data Science Bookcamp

dedication

brief contents

contents

Part 1. Case study 1: Finding the winning strategy in a card game

101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters