Mastering Python for Data Science
Mastering Python for Data Science
Copyright
Mastering Python for Data Science
© 2025 Clive et al
First Edition
Published in Nigeria
ISBN: 978-1-234567-89-0
Cover design by Brainiacdesigns
For permissions, inquiries or feedback, contact:
[email protected]
ii
CONTENTS
DEDICATION xiii
ACKNOWLEDGEMENT xiv
PREFACE xv
iv
MODULE 5: FILE HANDLING IN PYTHON FOR DATA
SCIENCE 107
• Introduction
• Basics of File Handling in Python
• Reading Files
• Writing and Appending to Files
• Working with CSV Files
• Handling JSON Files
• Best Practices for Closing Files
• Parsing and Extracting Structured Data
• Handling Missing Values in Files Using pandas
• Integrating File Handling with the Data Science Ecosystem
• Review Questions
vi
• Sorting and Set Operations
• Advanced NumPy Functions
• File I/O with NumPy
• Pseudorandom Number Generation
• Review Questions
vii
• Python Built-in Exceptions
• Importance of Exception Handling in Data Science
• Basic Exception Handling Techniques
• Advanced Exception Handling
• Applying Exception Handling in Data Science
• Debugging and Logging Exceptions
• Best Practices for Exception Handling
• Practical Data Science Scenarios
• Writing Robust and Error-Resilient Code
• Exception Handling in Production Environments
• Review Questions
ix
MODULE 18: REAL-WORLD DATA SCIENCE PROJECTS
517
• End-to-End Data Science Project
• Problem Definition
• Data Collection
• Data Cleaning
• Handling Missing Values, Duplicates, and Inconsistencies
• Model Building
• Feature Selection, Train-Test Split, Algorithm Selection
• Model Evaluation
• Deployment
• Saving the Model (Joblib/Pickle)
• Creating a RESTful API (Flask/FastAPI)
• Docker Containerization
• Cloud Deployment (Heroku, AWS, Azure)
• Feature Engineering in Data Science
• Steps in Feature Engineering
• Feature Creation, Transformation, Selection, Extraction
• Feature Selection
• Ensemble Feature Selection:
• 3ConFA Framework (Chi-Square + IG + DT-RFE)
• Predictive Analytics
• Customer Segmentation
• Sentiment Analysis
• Review Questions
xi
xii
Dedication
This book is dedicated to God almighty and family, friends,
mentors, students. Your support, encouragement, and belief in us
made this journey possible.
xiii
Acknowledgments
We would like to express our heartfelt appreciation to everyone
who contributed to the creation of this book. Special thanks to
our students, whose curiosity and feedback inspired many of the
examples and case studies included. We are grateful to our
colleagues and collaborators for their technical insights and
encouragement throughout the writing process.
We also wish to acknowledge the authors of the following works,
whose content and ideas greatly influenced this book:
• Peter Morgan, for his invaluable contributions in Data
Analysis from Scratch with Python: Step-by-Step Guide, which
provided a foundational understanding of Python in data
analysis.
• Laura Igual & Santi Seguí, for their work in Introduction
to Data Science: A Python Approach to Concepts, Techniques,
and Applications, which greatly enriched the conceptual
depth of the data science methods discussed.
• Wes McKinney, for his comprehensive work in Python for
Data Analysis (2nd Edition), which served as a key reference
for practical Python applications in data analysis.
Thanks
Clive et al
xiv
Preface
Mastering Python for Data Science is designed as a
comprehensive, hands-on guide for learners and professionals
aiming to build solid, practical skills in Python programming and
its applications in data science. This book blends core Python
concepts with real-world data challenges, providing a clear
pathway from fundamental programming techniques to advanced
data analysis, modeling, and visualization.
Each module is structured around practical case studies and real-
life problems drawn from various domains such as finance, health,
social media, and business analytics. To reinforce learning,
modules conclude with practical exercises, short answer questions,
and coding tasks that challenge the reader to apply concepts
immediately.
xv
Target Audience
This book is intended for:
• Students studying data science, computer science, or
analytics.
• Aspiring data scientists and analysts looking to gain
practical Python skills.
• Professionals transitioning into data science roles.
• Anyone with a basic knowledge of programming who
wants to apply Python to solve real-world data problems.
Prerequisites
Before using this book, readers should have a basic understanding
of programming concepts such as variables, loops, and functions.
Familiarity with Python syntax is helpful but not required, as the
book provides a quick refresher in the early modules. A basic
knowledge of mathematics, particularly statistics and linear
algebra, will also be beneficial for understanding data analysis and
machine learning concepts. Access to a computer with Python
installed or a cloud-based coding platform like Google Colab is
recommended to follow along with the practical exercises.
xvi
MODULE 1
FOUNDATIONS OF DATA SCIENCE
AND PYTHON PROGRAMMING
We are about to kick start our journey in the realm of Data
science, where numbers tell stories, Python does the heavy
lifting, and ‘clean data’ is basically a mythical creature. In this
module, we’ll decode the secret language of data nerds (yes,
‘Pandas’ is a library, not a zoo animal), wield Python like a coding
wizard, and learn why ‘NaN’ isn’t just a bread brand but your
worst spreadsheet nightmare. By the end, you’ll be shouting ‘I see
patterns everywhere!’, even in your coffee and soup stain.
This module serves as your gateway into the exciting world of
data science. By the end, readers
will be able to:
1. Explain data science concepts.
2. Write basic Python code for data tasks.
3. Use Jupyter Notebook for analysis.
4. Differentiate Python from R/AI/ML.
1
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
2
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
While they’re all related, their roles are distinct: Data Science
explains the why behind the mess (stats), answering, "Here’s why
you’re drunk." AI is the ambitious dreamer, saying, "Let’s build a
robot to drink for you (and watch it fail hilariously)." ML,
however, has learned from experience, predicting with, "I knew
you'd order tequila... because you always do."
Data Science and Data Analysis are closely related but differ in
scope and application. Data Science is a broad, interdisciplinary
field that combines statistics, machine learning, and programming
to extract insights, build predictive models, and develop AI-driven
solutions. It deals with both structured and unstructured data,
using advanced techniques like deep learning and big data
4
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
They are almost the same because they share the same goal, which
is to derive insights from data and use it for better decision
making.
Being a data scientist sounds way cooler than being a data analyst.
Although the job functions might be similar and overlapping, it all
deals with discovering patterns and generating insights from data.
"Both cousins are after the same thing: helping organizations make
smarter decisions. But while Data Science is out there building
5
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
"Let’s face it, saying ‘I’m a Data Scientist’ at a party does sound
cooler than ‘I’m a Data Analyst.’ It’s like saying you're a wizard
versus an accountant, even though you both work with magic
(data, that is). But at the end of the day, whether you’re analyzing
historical data or predicting future trends, it’s all about uncovering
insights and using data to make better decisions."
6
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
7
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
16
The >>> you see is the prompt where you’ll type code
expressions. To exit the Python interpreter and return to the
command prompt, you can either type exit() or press Ctrl+Z
then press ENTER While some Python programmers execute all
of their Python code in this way using interpreter or command
line style, those doing data analysis or scientific computing make
use of IPython, an enhanced Python interpreter, or Jupyter
notebooks, A web-based interactive environment for writing and
running code, supporting python language originally created
within the IPython project. We will be using jupyter notebooks
for our lessons.
1.7 Python VS R
Python and R are like two superheroes in the data science world,
each with their own special powers. Python is the versatile, all-
rounder hero, loved for its simplicity, scalability, and vast libraries
like Pandas and TensorFlow. It’s perfect for machine learning,
deep learning, and automating tasks basically, the go-to choice for
large-scale, AI-driven projects. R, on the other hand, is the expert
statistician, designed for statistical computing and data
visualization. With powerful libraries like ggplot2 and dplyr, R
shines in exploratory data analysis and statistical modeling,
making it the favorite in academia and research. While Python
dominates the industry, R is still the champion for in-depth,
number-crunching analysis.
1.8 Why Choose Python for Data Science & Machine Learning
9
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
10
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
Python didn't just win the data science crown, it bribed the judges
with its "readable syntax" and an army of libraries that do
everything except make coffee (though there's probably a PyBrew
module for that). It's the only language where you can go from
writing "Hello World" to training a neural network in the time it
takes MATLAB to compile "Good morning." With pandas for
data wrestling, Matplotlib for "artistic" bar charts, and scikit-learn
for when you want to predict stock markets but actually just
predict which coworker will steal your lunch, Python is basically
the duct tape of the digital world, holding together everything
from quick scripts to AI that may or may not take over the world.
The best part? When your code fails (and it will), the error
messages are slightly less terrifying than Java's. Now if only it
could fix your Imposter Syndrome too...
Well, sort of… but not that sort of. Here’s the deal: while
mathematical expertise is certainly important, the level of math
knowledge you need really depends on what you’re doing. If
you’re just analyzing data, performing basic analysis, or using pre-
built machine learning models, you don’t need to be able to recite
every formula from memory (unless you're trying to impress at a
data science party, but even then, who’s got the time?). Think of it
like cooking: you don’t need to invent new recipes, but knowing
how to balance flavors helps you create something delicious.
Similarly, knowing linear algebra, statistics, and probability basics
is helpful, but you don’t need to be able to explain a Fourier
transform to your grandma.
It’s not about being a math wizard, but about understanding just
enough to make it work. As the saying goes: "You don’t need to
know how to build a rocket to ride in one, but a little math helps
you not accidentally end up in space." You can definitely survive
with a basic understanding of linear algebra, but as long as you can
Google the math behind a machine learning model and know
when your data’s going off the rails, you're good to go. After all,
math in data science is like the seasoning in a dish: you need just
12
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
enough to make it work, but too much, and it’s overcooked. So,
don’t worry if you didn’t ace your calculus exam. Just make sure
you can explain why your model predicted your cat would run for
president.
For any programmer, and by extension, for any data scientist, the
integrated development environment (IDE) is an essential tool.
IDEs are designed to maximize programmer productivity. Thus,
over the years this software has evolved in order to make the
coding task less complicated. Choosing the right IDE for each
person is crucial and, unfortunately, there is no “one-size-fits-all”
programming environment. The best solution is to try the most
popular IDEs among the community and keep whichever fits
better in each case. In general, the basic pieces of any IDE are
three: the editor, the compiler, (or interpreter) and the debugger.
Some IDEs can be used in multiple programming languages,
provided by language-specific plugins, such as Netbeans or Eclipse.
Others are only specific for one language or even a specific
programming task in the case of Python, there are a large number
of specific IDEs, both commercial (PyCharm, WingIDE …) and
open-source. The open-source community helps IDEs to spring
up, thus anyone can customize their own environment and share
it with the rest of the community. For example, Spyder (Scientific
13
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
To start using Jupyter Notebook, users can install it via pip install
jupyter and launch it with jupyter notebook, which opens an
interface in a web browser where they can create and manage
notebooks (.ipynb files) or download ANACONDA
SOFTWARE which has Jupyter Notebook as one of its packages.
Its ability to combine code execution, documentation, and
visualization in a single environment makes Jupyter Notebook
a powerful WIDE for data science, machine learning, and research.
15
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
Installing Anaconda:
1. Download Anaconda from https://www.anaconda.com
2. Follow the installation steps for your operating system
(Windows/Mac/Linux).
3. Open the Anaconda Navigator or use the command line
interface.
16
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
notebook files.
18
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
19
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
20
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
21
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
1.18 Introspection
Introspection is the ability of Python (and Jupyter Notebook) to
examine objects, their attributes, methods, and documentation at
runtime. This helps in understanding how to use functions,
classes, and modules without referring to external documentation.
22
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
return "Hello"
In [2]: sample_function??
This will display the function’s definition along with its docstring.
you know what you're doing. Plus, the Python community has
adopted these shorthand conventions, so when you see pd or sns,
you’ll know exactly what’s going on, like a secret handshake
among programmers!
The Python community has adopted a number of naming
conventions for commonly used modules in data science.
Example:
In [1]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
When you see np.arange, this is a reference to the arange function
in NumPy. This is done because it’s considered bad practice in
Python software development to import everything (from numpy
import *) from a large package like NumPy.
24
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
25
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
1.22 Comments
In Python, comments are like the helpful notes you leave for
yourself or your future self when you forget what your code was
doing (which happens a lot, trust me). They explain what's going
on in the code, making it easier for others to understand and for
you to avoid scratching your head in confusion a few weeks later.
27
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
A function is like your personal robot that does the heavy lifting
for you without complaining. You give it a task, it performs it,
and then hands you the result,no questions asked. Functions are
defined with the def keyword, followed by a name (because every
robot needs a name, right?), and they can take parameters (the
instructions you give to your robot). When it's time for your
robot to work, you just call it by its name followed by
parentheses, and boom,task completed. If the robot needs any
special tools (i.e., arguments), you just pass them along! It’s the
ultimate life hack to avoid repeating yourself.
28
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
def greet(self):
return f"Hello, {self.name}!"
p = Person("Clive")
In [2]: print(p.greet()) # Object method call
We will discuss more on this later in this book.
29
C. Asuai, H. Houssem & M. Ibrahim Foundation Of Data Science Python Programming
QUESTIONS
1. What is the main goal of data science?
2. Name two key differences between data science and data
analysis.
3. Why is "clean data" often called a "mythical creature"?
4. What does EDA stand for, and why is it important?
5. How does machine learning differ from traditional
programming?
6. Why is Python preferred over R for large-scale data tasks?
7. What does NaN mean in a dataset, and why is it problematic?
8. Name two Python libraries used for data manipulation.
9. What is the purpose of Jupyter Notebook?
10. Write a Python code snippet to print "Hello, Data World!".
11. What is the main advantage of using Anaconda for data
science?
12. How do you launch a Jupyter Notebook from the command
line?
13. What does import pandas as pd do?
14. Name one Python library for data visualization.
15. How is a .ipynb file different from a regular Python script?
16. Give one reason why Python is more beginner-friendly than
R.
17. How is AI broader than machine learning?
18. What makes Python better for big data than Excel?
19. Name a task where R might outperform Python.
20. Python's comment system (# and docstrings) is often dismissed
as trivial, but in data science workflows, improper
commenting can create catastrophic misunderstandings.
Analyze how each of these scenarios could lead to failures in a
30
C. Asuai, H. Houssem & M. Ibrahim Mastering Python For Data Science
31
32
MODULE 2
GETTING STARTED WITH PYTHON:
VARIABLES, DATA TYPES AND
FUNCTIONS
This module dives into the magical world of Python, where you'll
meet the essentials: variables, data types, and functions, plus a
sprinkle of essential libraries (just the basics, don’t worry). By the
end of this adventure, you’ll be able to:
1. Understand and Use Variables in Python
• Define and assign values to variables.
• Understand variable naming conventions and best
practices.
• Work with dynamic typing and type inference in
Python.
2. Comprehend Python Data Types
• Understand and differentiate between fundamental
data types (integers, floats, strings, booleans, etc.).
• Work with complex data types (lists, tuples,
dictionaries, and sets).
• Perform type conversions and casting.
3. Master Basic String Operations
• Concatenate and format strings.
• Utilize string methods for manipulation.
4. Understand and Write Functions in Python
• Define functions using the def keyword.
• Use function arguments and return values.
• Work with default arguments, keyword arguments,
and variable-length arguments.
33
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
VARIABLES
A variable in Python is like a box where you can stash your
stuff,except this box doesn’t require any moving or cleaning. It’s a
memory location that holds data. When you assign a variable (aka
give it a name), you're essentially saying, "Hey, this box will hold
this thing over here on the right side of the equals sign!" So, if you
assign x = 5, Python's like, "Got it! The box named 'x' now holds
the number 5, and I’ll be using that box whenever you need it."
It's like giving your cat a name tag... but instead of a cat, it’s data!
When assigning a variable (or name) in Python, you are creating a
reference to the object on the righthand side of the equals sign.
# Example of variables
In [1]: x = 10 # Integer
y = 3.14 # Float
name = "Alice" # String
is_active = True # Boolean
34
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
down on bugs and confusion. It's like giving your future self a
roadmap so you don’t wander into the land of "Why doesn't this
work?" every time you revisit your code. Plus, your colleagues (or
your future self) will love you for it when they can read your code
without needing a secret decoder ring.
DATA TYPES
Variables are like your kitchen pantry, and data types are the
ingredients you store inside them. Now, what exactly is a data
type? It's essentially the kind of ingredient you're working with:
Is it a number you can bake into a cake, a word you can toss into a
soup, or maybe a list of ingredients to cook up later?
In Python, data types are the classification of the value that a
variable holds. These data types determine what kind of
operations you can perform on them without causing a kitchen
disaster (like trying to bake a list of ingredients instead of using it
to store them).
Python supports several built-in data types, each as useful as a
different ingredient in your coding recipe.
1. Integer (int): Whole numbers (e.g., 10, -5).
2. Float (float): Decimal numbers (e.g., 3.14, -0.001).
3. String (str): Text data (e.g., "Hello, World!").
4. Sequence data type (String, List Tuple, Range)
5. Mapping data type
6. Set data type
7. Boolean (bool): Represents True or False.
8. Binary
9. None
Note: Integer (int) and Float (float) are termed Numeric data type
It's crucial to know what kind of "ingredient" you're working
with when cooking up your Python code! Knowing the data type
of a variable helps you decide what operations can be safely
36
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
37
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Example:
In [1]: ival = 10
In [2]: ival ** 2
Out[2]: 100
Floating-point numbers in Python are the drama queens of the
number world. They come with decimal points, always
demanding a bit more attention, and when things get really
serious, they break out the scientific notation.
Python’s float type stores floating-point numbers as double-
precision (64-bit) values. This basically means they’re precise and
can handle a lot of decimals, but don't expect them to always be
perfectly exact,floating-point numbers sometimes have a way of
rounding up or down, just like your hopes for a pizza to show up
in 30 minutes.
# Regular float
pi = 3.14159
print(type(pi)) # Output: <class 'float'>
# Scientific notation
big_number = 1.23e6 # 1.23 * 10^6, a fancy way of writing 1,230,000
print(big_number) # Output: 1230000.0
# If you want a float from integer division, Python's got you covered!
result3 = float(7 // 2)
print(result3) # Output: 3.0
Notice that 7 / 2 produces a float (3.5), while 7 // 2 produces a
whole number (3). But don’t worry, if you really want a floating-
point result, you can easily convert it using float(). Python’s all
about keeping you happy and your numbers neat, even when they
get a little messy.
Ah, yes! If you're in the mood for some "old-school" division,like
the C programming language style,Python's got you covered with
the floor division operator //.
This operator ensures that any division that doesn't result in a
whole number is neatly "floored," meaning it drops the decimal
part and returns just the integer part.
It's like the difference between taking the "easy way" (normal
division) and "getting down to business" (floor division).
For example:
# Floor division
result = 3 // 2
print(result) # Output: 1
39
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
2. Strings
A string in Python is a sequence of characters enclosed in either
single quotes (') or double quotes ("). It can contain letters,
numbers, symbols, or even spaces. Think of a string as a fancy
container that holds text, like a shopping bag full of words..
Ah, the beauty of strings in Python,where you can go all "quote-
ception" and still get the job done! Whether you’re a "single-quote
fan" or a "double-quote enthusiast," Python lets you express your
string literals using either. It’s like picking your favorite flavor of
ice cream,whatever makes you happy!
a = 'Using single quote in writing a string'
b = "This string uses double quote"
Both strings will work equally well, and Python won't judge you
for your choice of quotes. It's like a chill coffee shop where you
can order your drink however you want and still get the same
great taste! The only catch? You should probably avoid mixing
them up unless you're trying to start a fight with your code:
# This is fine:
string3 = "I said 'hello'!"
string4 = 'I said "hello"!'
# But this could cause problems:
string5 = 'I said "hello' # Oops! We forgot the closing quote!
So go ahead,choose your quotes wisely, and let your strings flow!
For multiline strings with line breaks, you can use triple quotes,
either ''' or """:
40
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
c = """
This is a longer string that
spans multiple lines
"""
You might be in for a little surprise,our friendly string c is
sneakier than it looks! Although it just seems like a regular multi-
line message, it’s actually hiding four whole lines inside. Why?
Because even the invisible line breaks after the opening triple
quotes (""") and at the end of each line sneak their way into the
string like uninvited party guests.
Wanna catch them in the act? Just call in the count() method like a
detective on a mission:
In [7]: c.count('\n')
Out[7]: 3
Boom! It finds three newline characters skulking around in there.
The fourth line? That’s just the part after the last newline, quietly
minding its business. Strings: they’re not just text, they’re little
drama queens with hidden secrets
Python strings are immutable; In the world of Python, strings are
divas,they do not like being changed directly! Once you've created
a string, it's set in stone. Try to sneak in and change one of its
characters like this:
In [8]: a = 'this is a string'
In [9]: a[10] = 'f'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-57-5ca625d1e504> in <module>()
----> 1 a[10] = 'f'
TypeError: 'str' object does not support item assignment
41
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
42
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
This turns your sleek little string into a list: ['p', 'y', 't', 'h', 'o', 'n'].
And if you’re only in the mood for a taste of it, slicing steps in:
In [18]: s[:3]
Out[18]: 'pyt'
And bam,just the first three letters: 'pyt'. This slicing trick is pure
magic and works across many Python sequences. It's like your
string has a built-in deli counter!
Now, let’s talk about the mysterious backslash \. It’s not just a
weird slanted line,it’s an escape artist! Use it to sneak special
characters into your strings like \n (newline) or \u2764 (a Unicode
heart ). But if you just want your backslashes to act like... well,
backslashes, you’ll need to double them up:
In [19]: s = '12\\34'
In [20]: print(s)
12\34
Output? 12\34. One backslash to escape the other. Yep, it’s the
buddy system.
Feeling overwhelmed by all those double slashes? Don’t
worry,Python’s got your back with raw strings! Just add an r
before your string and Python will stop trying to be clever:
If you have a string with a lot of backslashes and no special
characters, you might find this a bit annoying. Fortunately, you
can preface the leading quote of the string with r, which means
that the characters should be interpreted as is:
In [21]: s = r'this\has\no\special\characters'
In [22]: s
Out[22]: 'this\\has\\no\\special\\characters'
Now Python won’t try to play detective and interpret those
slashes,it just leaves them alone. What you type is what you get.
The r stands for raw, but we like to think of it as relax, because it
saves you from backslash madness.
43
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Now let’s talk about gluing strings together. If you have two
separate strings and you want to make one mega-string, just add
them like this:
In [23]: a = 'this is the first half '
In [24]: b = 'and this is the second half'
In [25]: a + b
Out[25]: 'this is the first half and this is the second half'
But wait, there’s more! Ever wanted to sneak values into a string,
like prices, quantities, or dramatic punchlines? That’s where
string formatting comes in. Python has a superhero method
called .format() that lets you plug values into placeholders inside a
string template.
In [26]: template = '{0:.2f} {1:s} are worth US${2:d}'
In this string,
• {0:.2f} means to format the first argument as a floating-point
number with two decimal places.
• {1:s} means to format the second argument as a string.
• {2:d} means to format the third argument as an exact integer.
To substitute arguments for these format parameters, we pass a
sequence of arguments to the format method:
In [27]: template.format(4.5560, 'Argentine Pesos', 1)
Out[27]: '4.56 Argentine Pesos are worth US$1'
String formatting in Python is like giving your data a glow up
before sending it out to a fancy party You can dress up your
numbers give your words a trim and even align everything like it
is standing in a military parade.
Want a number to show just two decimal places Python says Sure
thing boss Want your text centered like it is meditating in yoga
class Python nods and goes Namaste.
There are so many ways to format strings it is like having a closet
full of outfits for your variables From old school percent style to
44
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
Imagine a basket where you can throw in fruits but the basket is
super picky It refuses duplicates You drop in an apple another
apple and a banana and it just goes Nope I already have an apple
So in the end your basket only holds one apple and one banana
Example:
text = "Data Science" # String
numbers = [1, 2, 3, 4] # List
coordinates = (10, 20, 30) # Tuple
seq = range(1, 5) # Range
Example:
Agent_Delta = {"name": "Clive", "age": 25, "city": "Sapele"}
Print(Agent_Delta)
47
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Accessing the Value – You just call the agent’s ID (key), and bam!
You get their file (value).
print(Agent_Delta["name"]) # Output: Clive
Adding or Updating Agents (Keys) – New agents can join the
force, or existing ones can get their files updated. Maybe Clive gets
a promotion to age 31!
Agent_Delta["country"] = "Nigeria" # New agent on the team!
Agent_Delta["age"] = 31 # Update agent Clive’s age
48
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
True and False friends and letting them do some heavy lifting in
your code.
Example:
is_raining = True
is_sunny = False
print(is_raining) # Output: True
print(is_sunny) # Output: False
50
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
What is none?
None is Python’s version of null. It represents the absence of a
value, like an invisible placeholder that politely says, “Nothing to
see here, folks!”
In [9]: a = None
In [10]: a is None
Out[10]: True
In [11]: b = 5
In [12]: b is not None
Out[12]: True
If you create a function and forget to return something,don’t
worry! Python’s got you covered. It’ll quietly slip in a None for
you like a waiter who clears your plate without asking.
def mysterious_function():
pass
result = mysterious_function()
print(result) # Output: None
See? No return? No problem. Python just goes, “Eh, let’s make it
None.”
None is also a common default value for function arguments:
def add_and_maybe_multiply(a, b, c=None):
result = a + b
if c is not None:
result = result * c
return result
Note: None is not only a reserved keyword but also a unique
instance of NoneType:
In [13]: type(None)
Out[13]: NoneType
So, the next time you encounter None, don’t be alarmed. It’s just
Python’s elegant way of saying,
51
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
In some languages, such as Visual Basic, the string '6' might get
implicitly converted (or typecasted) to an integer, thus yielding 12.
Yet in other languages, such as JavaScript, the integer 6 might be
casted to a string, yielding the concatenated string '66'. In this
regard Python is considered a strongly typed language, which
means that every object has a specific type (or class), and implicit
conversions will occur only in certain obvious circumstances, such
as the following:
In [1]: a = 32.54
In [2]: b = 4
# String formatting, to be visited later
In [3]: print('a is {0}, b is {1}'.format(type(a), type(b)))
Out [3]: a is <class 'float'>, b is <class 'int'>
Note: The example above returns the data type for variable a and
b.
In [4]: a / b
Out[4]: 8.135
Don’t worry though,we’ll dive deeper into typecasting later in
this module and teach you how to convert types like a pro
magician
53
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
TYPE CASTING
Type casting (or type conversion) is the process of converting a
variable from one data type to another.This is like giving your
variables a wardrobe change , transforming them from one type to
another so they can fit in at different parties (or, well, code
blocks).
Python supports two types of type casting:
54
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
Example:
x = 5 # Integer
y = 2.5 # Float
result = x + y # Integer is converted to float automatically
print(result) # Output: 7.5 (Float)
Here, Python automatically upgrades x from an integer to a float
so the operation goes smoothly. No errors, no drama. Everyone’s
happy.
55
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
57
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
OPERATORS
- Subtraction a–b 7
* Multiplication a*b 30
// Floor Division a // b 3
% Modulus (Remainder) a % b 1
** Exponentiation a ** b 1000
58
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
Example (a = 10, b =
Operator Description Output
3)
== Equal to a == b False
Bitwise
& a&b 1 (0001)
AND
` ` Bitwise OR `a
^ Bitwise a^b 6 (0110)
59
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
XOR
Bitwise
~ ~a -6
NOT
<< Left Shift a << 1 10 (1010)
Exponentiate and
**= x **= 2 x = x ** 2
assign
60
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
Example (x =
Operator Description Output
[1, 2, 3])
the sequence
Returns True if a value does
not in 4 not in x True
NOT exist in the sequence
61
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
2. Comparison Operators
a = 10
b=5
print(a == b) # False
print(a != b) # True
print(a > b) # True
print(a < b) # False
print(a >= b) # True
print(a <= b) # False
3. Logical Operators
x = True
y = False
print(x and y) # False
print(x or y) # True
print(not x) # False
4. Bitwise Operators
a = 5 # Binary: 0101
b = 3 # Binary: 0011
print(a & b) # AND: 1 (0001)
print(a | b) # OR: 7 (0111)
print(a ^ b) # XOR: 6 (0110)
print(~a) # NOT: -6
print(a << 1) # Left Shift: 10 (1010)
print(a >> 1) # Right Shift: 2 (0010)
5. Assignment Operators
x = 10
x += 2 # x = x + 2 → 12
x -= 3 # x = x - 3 → 9
x *= 4 # x = x * 4 → 36
x /= 6 # x = x / 6 → 6.0
x //= 2 # x = x // 2 → 3
x %= 2 # x = x % 2 → 1
x **= 3 # x = x ** 3 → 1
print(x) # Final output: 1
6. Membership Operators
fruits = ["apple", "banana", "cherry"]
62
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
FUNCTIONS
63
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
65
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Scope in Python
Scope refers to the visibility of variables within a program.
Python follows the LEGB (Local, Enclosing, Global, Built-in)
rule to determine variable scope:
• Local Scope – Variables defined inside a function are only
accessible within that function.
• Enclosing Scope – Applies to nested functions where an
inner function can access variables from an outer function.
• Global Scope – Variables defined outside functions are
accessible throughout the program unless modified inside a
function using global.
• Built-in Scope – Includes Python's default functions and
libraries.
Example:
In [1] x = 10 # Global scope
def outer_function():
y = 20 # Enclosing scope
def inner_function():
z = 30 # Local scope
print(x, y, z) # Accessing global, enclosing, and local variables
inner_function()
outer_function()
print(outer())
Nested functions are often used in closures and decorators in
Python, making them powerful for organizing and structuring
code efficiently.
67
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
# Sample dataset
data = [12, 15, 14, 10, 18, 20, 22, 25]
print(compute_statistics(data))
3. Functions for Data Visualization (Matplotlib & Seaborn)
Custom functions can be used to automate plotting for data
exploration.
Example:
import matplotlib.pyplot as plt
# Sample data
categories = ["A", "B", "C"]
values = [10, 20, 15]
68
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
plot_bar_chart(categories, values)
4. Functions in Machine Learning (Scikit-learn)
Functions help in data preprocessing, feature selection, model
training, and evaluation in machine learning.
Example: Training a Machine Learning Model Using a
Function
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Sample dataset
X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])
y = np.array([0, 0, 1, 1, 1])
SYNTAX
When I first made the leap from Java and C++ to Python, it was
like finding out that your favorite cereal can also double as an ice
cream topping,mind-blowing! One of the coolest tricks Python
pulled out of its hat was the ability to return multiple values
from a function without any fancy, complicated syntax. It's like
getting a gift bag full of goodies with a single function call. Here’s
how it works:
Here’s an example:
def get_person_info():
name = "Asuai"
age = 30
job = "Data Scientist"
# Returning multiple values as a tuple
return name, age, job
print(age) # Output: 30
print(job) # Output: Data Scientist
In data science (and probably in a few other scientific realms
where we wear lab coats and write code in the dark), you’ll often
find yourself in the magical land of returning multiple values.
Remember when we used that cool tuple trick earlier? Well, here’s
the behind-the-scenes look: The function is actually returning just
one object, but that object is a tuple! Then, Python does a little
sleight of hand to unpack it into multiple variables.
For example, let’s take a step back and see how this looks when
we don’t unpack those values right away:
return_value = f()
# In this case, return_value would be a 3-tuple with a, b, and c
Now, if you’re feeling fancy (and who wouldn’t want to be?), you
might opt for a more organized method of returning values,
especially if your values need labels. Here's where the dictionary
comes into play,kind of like putting your values in labeled boxes:
def f():
a=5
b=6
c=7
# Returning a dictionary instead of a tuple
return {'a': a, 'b': b, 'c': c}
# Call it
return_value = f()
print(return_value)
71
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
def square(x):
return x ** 2
72
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
73
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
74
C. Asuai, H. Houssem & M. Ibrahim Getting Started With Python
75
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
QUESTIONS
1. Explain how Python manages memory for variables and
data types. How does Python handle memory allocation and
garbage collection?
2. Python is a dynamically typed language, but type hints
(typing module) are used in modern Python programming.
Explain the benefits of type hints and provide an example where
they improve code readability.
3. Explain the concept of variable scope in Python. What is a
closure in Python, and how can it be used effectively in data
science applications?
4. Given a long string containing multiple sentences, write a
function that finds and returns the longest word in the string.
Optimize for performance.
5. Write a function that accepts two numbers and performs
division. Ensure the function handles cases where the
denominator is zero and returns a proper error message instead of
raising an exception.
6. Write a Python function that takes any number of
keyword arguments and returns a formatted string of key-value
pairs in alphabetical order.
7. Write a Python class that represents a Temperature object.
Implement methods to convert temperatures between Celsius,
Fahrenheit, and Kelvin.
76
MODULE 3
CONTROL STRUCTURES
Control structures control the flow of program. In this structure,
your Python code learns how to make decisions, repeat itself
(intentionally), and generally behave like a clever little robot. This
module is all about giving your program some brains,with if-else
statements, loops, and loop control tools like break and
continue. Think of it as teaching your code how to choose its
own adventure or do things over and over again without
complaining. Buckle up,Python is about to get a whole lot more
logical (and slightly dramatic)! By the end of this module, the
reader should have understanding on:
1. Introduction to Control Structures
• Importance of control structures in programming.
• Overview of different control structures in Python.
2. Conditional Statements
• if, elif, and else statements.
• Nested conditional statements.
• Using logical operators (and, or, not) in conditions.
3. Looping Constructs
• for loops and iterating over sequences.
• while loops and condition-based iteration.
• Nested loops and their use cases.
4. Loop Control Statements
• break statement to exit loops early.
• continue statement to skip an iteration.
• pass statement as a placeholder.
77
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
INTRODUCTION
Control structures in Python are like the traffic cops of your
code,directing when to stop, go, or take a U-turn! They help
your program make decisions (if), repeat actions (while, for), and
gracefully skip or exit when needed (break, continue, pass).
Python’s built-in keywords make it easy to write logic that flows
smoother than a fresh cup of coffee on a Monday morning.
Whether you're building a calculator, a chatbot, or just trying to
survive your first coding class,control structures are your go-to
tools to keep the code chaos in check.
Example
x = 10
if x > 0:
print("Positive number")
Output: Positive number
If-Else Statements
79
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
80
C. Asuai, H. Houssem & M. Ibrahim Control Structures
Here, both the username and password must match for access to
be granted.
Example 3: Checking Multiple Conditions with or
temperature = 5
raining = True
if temperature < 10 or raining:
print("Wear a jacket!")
else:
print("No jacket needed.")
•This ensures a jacket is worn if either the temperature is
low or it is raining.
Example 4: Using not for Boolean Values
logged_in = False
if not logged_in:
print("Please log in first")
•Since logged_in is False, not logged_in becomes True, and
the message is printed.
Example 5: Complex Decision Making
age = 40
has_valid_id = True
criminal_record = False
if (age >= 18 and has_valid_id) and not criminal_record:
print("You can apply for a driver's license")
else:
print("You cannot apply")
• The person must be 18 or older, have a valid ID, and no
criminal record to apply.
81
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
have your back. Python gives you two main loop types: for loops
(great for going through items like a polite guest at a buffet) and
while loops (perfect when you're not sure how long you’ll need
but you're in it for the long haul). In data science, loops help
automate the boring stuff, so you can focus on the cool stats and
fancy plots.
Types of loops
1. for loops
for loops are for iterating over a collection (like a list or tuple) .
Used when you know the number of iterations (e.g., iterating over
a list, array, or DataFrame).
The standard syntax is:
for variable in iterable:
# Code to execute in each iteration
# Loop through numbers 1 to 5
for i in range(1, 6):
print("Number:", i)
You can advance a for loop to the next iteration, skipping the
remainder of the block, using the continue keyword. Consider
this code, which sums up integers in a list and skips None values:
sequence = [1, 2, None, 4, None, 5]
total = 0
for value in sequence:
if value is None:
continue
total += value
2. while loops
A while loop specifies a condition and a block of code that is to be
executed until the condition evaluates to False or the loop is
explicitly ended with break. It is useful when working with
streaming data or performing operations until a condition is met.
82
C. Asuai, H. Houssem & M. Ibrahim Control Structures
Syntax:
while condition:
# Code to execute in each iteration
Example:
num = 1
while num <= 5:
print(num)
num += 1
Loop control statements alter the flow of loops (e.g., for and
while). Think of loop control statements as traffic signals inside
your loops,they tell your code when to keep going, when to skip a
turn, and when to slam the brakes and stop entirely. Python offers
three main loop control superheroes:
1. break – The “I’m outta here!” statement. It ends the loop
early, like storming out of a boring meeting before it's
over.
2. continue – The “skip this one!” statement. It politely tells
the loop to skip the current iteration and jump to the next.
3. pass – The “nothing to see here” statement. It's a
placeholder that lets your code say “I’ll deal with this
later.”
83
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
84
C. Asuai, H. Houssem & M. Ibrahim Control Structures
print(num)
Output:
0
1
3
4
(2 is skipped, but the loop continues.)
Example 2: continue in a while Loop
num = 0
while num < 5:
num += 1
if num == 3:
continue # Skips when num is 3
print(num)
Output:
1
2
4
5
(3 is skipped.)
86
C. Asuai, H. Houssem & M. Ibrahim Control Structures
RANGE FUNCTION
print(i)
Start, Stop
for i in range(2, 6):
print(i)
Start, Stop, Step
for i in range(1, 10, 2):
print(i)
TERNARY EXPRESSIONS
(CONDITIONAL EXPRESSION)
Output:
Negative
(Since x is -5, it falls into the last condition.)
89
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
90
C. Asuai, H. Houssem & M. Ibrahim Control Structures
QUESTIONS
1. How do control structures improve the efficiency of a
program?
2. How does the elif statement differ from if and else?
3. What are logical operators, and how are they used in
conditional statements?
4. What will be the output of the following code?
x = 10
if x > 5 and x < 15:
print("x is in range")
else:
print("x is out of range")
5. What is an infinite loop, and how can it be avoided?
6. How is the pass statement different from break and
continue?
7. What will be the output of the following code?
for i in range(5):
if i == 3:
break
print(i)
8. How can you use a loop control statement to skip even
numbers in a loop from 1 to 10?
9. Write a code that sums all numbers from 0 to 99,999 that
are multiples of 3 or 5:
10. Write a code that compares two numbers and returns the
minimum number (using the Tenary expression)
11. Consider Table 3-2, write a program that grades the
student of a University based on their examination scores
91
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
1 0-39 F
2 40-45 D
3 46-55 C
4 56-69 B
5 70-100 (>=70) A
6 >100 OUT-OF-RANGE
92
MODULE 4
INTRODUCTION TO PYTHON
LIBRARIES FOR DATA SCIENCE
Python libraries are like a cheat code for programmers: they hand
you powerful tools on a silver platter so you can spend less time
reinventing the wheel and more time doing the fun, brainy stuff.
Want to wrestle with massive datasets? There's a library for that.
Need to crunch numbers like a caffeinated accountant? There's a
library for that too. Dreaming of plotting jaw-dropping graphs
that make your friends go "wow"? Yup, libraries got your back!
By the end of this module, readers will:
1. Understand the concept of Python libraries and their
importance in data science.
2. Learn how Python libraries simplify data analysis,
visualization, and machine learning tasks.
3. Gain knowledge of the different types of Python libraries
used in data science.
4. Learn how to import libraries using Python’s import
statement.
5. Recognize the role of libraries in enhancing efficiency and
reducing repetitive coding.
INTRODUCTION
Python has a rich ecosystem of libraries for data science. These
libraries play a crucial role in making programming more efficient
by providing pre-built functions and tools that eliminate the need
to write complex code from scratch. Instead of manually
implementing algorithms for data manipulation, mathematical
operations, or machine learning, developers can use optimized
93
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Elements of NumPy
• A fast and efficient multidimensional array object ndarray
• Functions for performing element-wise computations with
arrays or mathematical operations between arrays
• Tools for reading and writing array-based datasets to disk
• Linear algebra operations, Fourier transform, and random
number generation
• A mature C API to enable Python extensions and native C or
C++ code to access NumPy’s data structures and computational
facilities
Beyond the fast array-processing capabilities that NumPy adds to
Python, one of its primary uses in data science is as a container for
data to be passed between algorithms and libraries. For numerical
data, NumPy arrays are more efficient for storing and
manipulating data than the other built-in Python data structures.
Also, libraries written in a lower-level language, such as C or
Fortran, can operate on the data stored in a NumPy array without
copying data into some other memory representation.
That’s why most serious number-crunching tools in Python either
worship the NumPy array like a sacred relic or at least make sure
they play nice with it. Long story short: if you're doing numerical
computing in Python, you're basically living in NumPy’s world ,
you’re just paying rent.
95
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Matplotlib
Matplotlib is a popular Python library used for data visualization.
It provides a variety of plotting functions to create static,
animated, and interactive visualizations. The library is highly
customizable and supports plots like line graphs, bar charts,
histograms, scatter plots, and more. The pyplot module in
Matplotlib offers a MATLAB-like interface for easy plotting. It
integrates well with libraries like NumPy and Pandas, making it
useful for data science and machine learning applications.
In short, Matplotlib is like the artsy friend who can turn even the
most boring spreadsheet into a jaw-dropping gallery. Want a
simple line graph? Done. Need a scatter plot that looks like it
belongs in a modern art museum? Easy. Dreaming of a bar chart
so beautiful it deserves its own
Instagram account? Matplotlib says, “Hold my coffee.”
And the best part? It's ridiculously flexible , you can tweak,
stretch, and color your plots until they look just right (or until
you’ve completely forgotten what you were originally analyzing...
oops). With Matplotlib by your side, your data doesn't just speak ,
it throws a full-blown musical concert!
97
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Example
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], [4, 5, 6])
plt.show()
SEABORN
Seaborn is a Python data visualization library built on top of
Matplotlib. It provides a high-level interface for creating attractive
and informative statistical graphics. Seaborn simplifies complex
visualizations with functions for drawing histograms, scatter plots,
box plots, violin plots, and heatmaps. It integrates well with
Pandas, allowing for easy visualization of DataFrame-based data.
Seaborn also supports theme customization and statistical analysis,
making it useful for exploratory data analysis and machine
learning.
Think of Seaborn as Matplotlib’s cooler, better-dressed cousin
who shows up to the party and immediately steals the spotlight.
While Matplotlib gives you the raw tools to make a plot, Seaborn
hands you a masterpiece on a silver platter , color-coordinated,
beautifully styled, and ready for Instagram.
Want a heatmap so gorgeous it makes your CPU sweat? A violin
plot so elegant it could play at a royal wedding? Seaborn’s got you
covered. Plus, it plays super nicely with Pandas, so you can throw
a messy DataFrame at it, and Seaborn will somehow turn it into
data art. With Seaborn, your exploratory data analysis isn't just
smart , it’s stunning.
98
C. Asuai, H. Houssem & M. Ibrahim Python Libraries for Data Science
Example
import seaborn as sns
sns.histplot([1, 2, 2, 3, 3, 3])
plt.show()
SCIKIT-LEARN
Scikit-learn is used for machine learning. Scikit-learn is a machine
learning library built from NumPy, SciPy, and Matplotlib. Scikit-
learn offers simple and efficient tools for common tasks in data
analysis and data science such as classification, regression,
clustering, dimensionality reduction, model selection, and
preprocessing.
It includes submodules for such models as:
• Classification: SVM, nearest neighbors, random forest, logistic
regression, etc.
• Regression: Lasso, ridge regression, etc.
• Clustering: k-means, spectral clustering, etc.
• Dimensionality reduction: PCA, feature selection, matrix
factorization, etc.
• Model selection: Grid search, cross-validation, metrics
• Preprocessing: Feature extraction, normalization
Along with pandas, statsmodels, and IPython, scikit-learn has been
critical for enabling Python to be a productive data science
programming language. To use Scikit-learn effectively, it is often
integrated with other libraries in the data science ecosystem.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
99
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
SCIPY LIBRARY
SciPy SciPy (Scientific Python) is an open-source library built on
NumPy that provides advanced mathematical, scientific, and
engineering functions. It is widely used in data science, machine
learning, and scientific computing for tasks like optimization,
signal processing, and statistical analysis.
100
C. Asuai, H. Houssem & M. Ibrahim Python Libraries for Data Science
101
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
return x**2
STATSMODELS
Statsmodels is a Python library for statistical modeling,
hypothesis testing, and econometrics. It provides advanced
statistical tools for analyzing relationships between variables,
making it a powerful alternative to Scikit-learn for regression and
time-series analysis.
Key Elements of Statsmodels
1. Regression Analysis – Linear, logistic, and generalized
linear models.
2. Time-Series Analysis – AR, ARMA, and ARIMA models
for forecasting.
3. Hypothesis Testing – T-tests, ANOVA, and chi-square
tests.
4. Statistical Distributions – Probability distributions and
density estimation.
5. Robust Models – Nonparametric regression and
generalized estimating equations (GEE).
# Sample data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
X = sm.add_constant(X) # Add intercept term
102
C. Asuai, H. Houssem & M. Ibrahim Python Libraries for Data Science
103
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
104
C. Asuai, H. Houssem & M. Ibrahim Python Libraries for Data Science
QUESTIONS
1. Why are libraries important in data science?
2. How do libraries improve efficiency in programming?
3. What is the difference between a library and a module in
Python?
4. What happens if you try to import a library that is not
installed?
5. What is the difference between import numpy and from
numpy import array?
6. Explain the use of as in the statement import pandas as pd.
7. Why is numpy preferred for numerical computations over
standard Python lists?
8. What is the difference between matplotlib and seaborn for
data visualization?
9. What is scikit-learn used for in data science?
105
106
MODULE 5
FILE HANDLING IN PYTHON FOR
DATA SCIENCE
Working with files (especially CSV, Excel documents) is a very
crucial aspect in datascience. It is imperative that we understand
how these are handled.
Think of files like the treasure chests of data science , hidden
away, locked up, and sometimes a little dusty. Whether it’s a
squeaky clean CSV or a grumpy old Excel file full of weird
formatting decisions, knowing how to open, read, and manipulate
these files is like having the keys to the kingdom. Without proper
file handling skills, you’re basically a pirate without a map
, lots of ambition, but nowhere to sail!
Mastering file handling means you can finally stop fearing the
"File Not Found" error like it's a horror movie jump scare , and
start confidently pulling in data like a boss.
In this module, readers will:
1. Understand File Handling Basics – Learn how to open,
read, write, and append to files in Python.
2. Work with Different File Formats – Explore handling
CSV and JSON files, which are commonly used in data
science.
3. Learn File Closing Best Practices – Understand the
importance of closing files to prevent resource leaks.
4. Integrate File Handling with Data Science Workflows –
Learn how to load and store data efficiently in Python’s
data science ecosystem.
5. Handle Missing Values in Pandas – Discover techniques
for identifying and managing missing data in datasets.
107
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
INTRODUCTION
File handling is essential in data science for reading datasets,
storing processed data, and managing large-scale data pipelines.
Python provides built-in functions and libraries like pandas to
handle various file formats efficiently.
108
C. Asuai, H. Houssem & M. Ibrahim File Handling in Python for Data Science
READING FILES
In Python, reading files is a common operation performed using
the open() function with the 'r' (read) mode. The simplest way is
to use file.read() to fetch the entire content as a string,
or file.readlines() to get a list of lines. For memory efficiency with
large files, looping through the file object line by line is preferred.
1. read() Method
Reads the entire file content as a string.
109
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
3. readlines() Method
Reads all lines and returns a list.
file = open("data.txt", "r")
lines = file.readlines()
print(lines)
file.close()
WRITING TO FILES
In Python, writing to files is done using the open() function with
modes like 'w' (write) or 'a' (append). The 'w' mode overwrites
the file if it exists, while 'a' adds content to the end without
deleting existing data. The write() method is used to insert text,
and writelines() can write a list of strings.
1. write() Method
Writes a string to a file.
file = open("output.txt", "w")
file.write("Hello, Data Science!")
file.close()
For safety, files should be opened in a with block to ensure proper
closing.
Example
# Writing to a file (overwrites existing content)
with open('example.txt', 'w') as file:
110
C. Asuai, H. Houssem & M. Ibrahim File Handling in Python for Data Science
file.write("Hello, World!\n")
file.write("This is a new line.")
2. writelines() Method
Writes a list of strings to a file.
lines = ["Line 1\n", "Line 2\n"]
file = open("output.txt", "w")
file.writelines(lines)
file.close()
For safety, files should be opened in a with block to ensure proper
closing.
lines = ["First line\n", "Second line\n", "Third line\n"]
with open('example.txt', 'w') as file:
file.writelines(lines)
APPENDING TO FILES
Appending allows you to add new content to the end of an
existing file without overwriting it. In Python, this is done by
opening the file in 'a' (append) mode using the open() function.
file = open("output.txt", "a")
file.write("New Data\n")
file.close()
For safety, files should be opened in a with block to ensure proper
closing.
with open("example.txt", "a") as file:
file.write("\nAppending new line.") # Adds content without deleting
existing data
111
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
reader = csv.reader(file)
for row in reader:
print(row)
Using pandas
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())
note: df.head prints the first 5 rows in the dataset/CSV file.
112
C. Asuai, H. Houssem & M. Ibrahim File Handling in Python for Data Science
CLOSING FILES
Properly closing files is essential to free up system resources and
prevent data corruption. Python provides multiple ways to ensure
files are closed correctly.
1. Using with Statement (Recommended)
The safest way to handle files is using the with statement, which
automatically closes the file after the block executes,even if an
error occurs.
with open("example.txt", "r") as file:
content = file.read()
# File is automatically closed here
Method Description
open(filename, mode) Opens a file in the specified mode
read() Reads the entire file as a string
readline() Reads a single line from the file
readlines() Reads all lines and returns a list
write(string) Writes a string to a file
writelines(list) Writes multiple lines from a list
close() Closes the file
csv.reader(file) Reads CSV data into lists
csv.writer(file) Writes data to a CSV file
json.load(file) Loads JSON data from a file
json.dump(data, file) Writes JSON data to a file
114
C. Asuai, H. Houssem & M. Ibrahim File Handling in Python for Data Science
115
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Function Description
string representation.
Read pandas data encoded using the MessagePack
read_msgpack
binary format.
Read an arbitrary object stored in Python pickle
read_pickle
format.
Read a SAS dataset stored in one of the SAS
read_sas
system’s custom storage formats.
Read the results of a SQL query (using
read_sql
SQLAlchemy) as a pandas DataFrame.
read_stata Read a dataset from Stata file format.
read_feather Read the Feather binary file format.
These functions help convert text data into a DataFrame. Each
function has optional arguments that allow customization, such as
specifying delimiters, handling missing values, and defining data
types.
To read this file, you have a couple of options. You can allow
pandas to assign default column names, or you can specify names
yourself:
Examples
In [1]: pd.read_csv('folder1/ex2.csv', header=None)
Out[1]:
01234
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo
2 9 10 11 12 foo
117
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
118
C. Asuai, H. Houssem & M. Ibrahim File Handling in Python for Data Science
119
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
df = pd.DataFrame(data)
120
C. Asuai, H. Houssem & M. Ibrahim File Handling in Python for Data Science
Argument Description
na_rep String representation of missing values (e.g., "N/A").
Columns Subset of columns to write.
Mode File mode ("w" for write, "a" for append).
Encoding File encoding (e.g., "utf-8").
121
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
QUESTIONS
1. What is the difference between reading (r), writing (w), and
appending (a) modes in file handling?
2. How do you open a file in Python and ensure it closes
properly?
3. What is the advantage of using the with open() statement
over open() and close()?
4. What will happen if you try to write to a file opened in
read mode (r)?
5. Write a Python script to create a new text file named
data.txt and write the sentence "Python makes file
handling easy." into it.
6. How do you read all lines of a file into a list in Python?
7. Explain the difference between read(), readline(), and
readlines().
8. Write a Python program to read a file named example.txt
and print its contents line by line.
9. How can you write multiple lines into a file using Python?
10. What does the "w" mode do if the file already exists?
11. Write a Python script to read a CSV file named data.csv
and print its contents using the csv module. How can you
write a pandas DataFrame to a CSV file? Provide an
example.
12. Explain the difference between pd.read_csv() and to_csv()
in pandas.
13. How do you convert a Python dictionary into a JSON
object and save it to a file?
14. Why is it important to handle missing values in data
science?
122
C. Asuai, H. Houssem & M. Ibrahim File Handling in Python for Data Science
123
124
MODULE 6
DATA STRUCTURES IN PYTHON
Data structures are the shiny gadgets you’ll need to store,
organize, and juggle information like a true data wizard. Mastering
them means you can finally stop treating your data like random
piles of socks and start organizing it like a pro.
In this module, readers will:
1. Understand the Importance of Data Structures – Learn
why data structures are fundamental for efficient data
storage and manipulation in Python.
2. Work with Built-in Data Structures – Explore lists,
tuples, dictionaries, and sets, and understand their
properties and use cases.
3. Manipulate Lists and Tuples – Learn indexing, slicing,
and common operations such as appending, inserting, and
removing elements.
4. Work with Dictionaries and Sets – Understand key-value
pairs, hash-based lookups, and set operations like union,
intersection, and difference.
5. Understand Mutability and Performance – Learn the
difference between mutable and immutable data structures
and how they impact performance.
6. Apply Data Structures in Data Science – Understand
how Python data structures help in data wrangling,
preprocessing, and analysis
By the end of this module, readers will be able to choose the right
data structure for different programming tasks and efficiently
manipulate data in Python.
125
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
INTRODUCTION
Data structures are used to store and organize data efficiently.
Python provides several built-in data structures, including lists,
tuples, sets, and dictionaries.
Think of data structures as the secret filing cabinets of Python.
Without them, your data would just be lying around like dirty
laundry. Lists, tuples, sets, and dictionaries are not just fancy
words; they are the neat little boxes that keep your information
sorted, accessible, and ready for action. By truly understanding
how they work, you will not just use Python, you will command
it like a data scientist who knows exactly where every sock,
receipt, and plot twist is stored.
So far we have seen and used these data structures in this book,
but still don’t understand how it works. This module will provide
us with the understanding needed to kick start our journey in data
science.
1. Lists
Lists are ordered, mutable collections of items. They can contain
elements of different data types. Lists in contrast with tuples, are
variable-length and their contents can be modified in-place. You
can define list using square brackets [] or using the list type
function:
Syntax
List_name=[‘Values’]
126
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
Syntax:
listName.pop(index)
# Example of a list
fruits = ["apple", "banana", "cherry"]
print(fruits[0]) # Output: apple
# Adding an item
fruits.append("orange")
print(fruits) # Output: ['apple', 'banana', 'cherry', 'orange']
In [9]: b_list.pop(2)
Out[9]: 'peekaboo'
In [10]: b_list
Out[10]: ['foo', 'red', 'baz', 'dwarf']
Elements can be removed by value with remove, which locates the
first such value and removes it from the last:
# Removing an item
fruits.remove("banana")
print(fruits) # Output: ['apple', 'cherry', 'orange']
In [11]: b_list.append('foo')
In [12]: b_list
Out[12]: ['foo', 'red', 'baz', 'dwarf', 'foo']
In [13]: b_list.remove('foo')
In [14]: b_list
Out[14]: ['red', 'baz', 'dwarf', 'foo']
If performance is not a concern, by using append and remove, you
can use a Python list as a perfectly suitable “multiset” data
structure.
Check if a list contains a value using the in keyword:
In [15]: 'dwarf' in b_list
Out[15]: True
The keyword not can be used to negate in:
In [16]: 'dwarf' not in b_list
Out[16]: False
Checking whether a list contains a value is a lot slower than doing
so with dicts and sets (to be introduced shortly), as Python makes
128
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
a linear scan across the values of the list, whereas it can check the
others (based on hash tables) in constant time.
• Sorting
You can sort a list in-place (without creating a new object) by
calling its sort function:
In [21]: a = [7, 2, 5, 1, 3]
In [22]: a.sort()
In [23]: a
Out[23]: [1, 2, 3, 5, 7]
sort has a few options that will occasionally come in handy. One
is the ability to pass a secondary sort key,that is, a function that
129
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
130
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
• Reversed
Reversed iterates over the elements of a sequence in reverse order:
131
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
In [40]: list(reversed(range(10)))
Out[40]: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
Keep in mind that reversed is a generator, so it does not create the
reversed sequence until materialized (e.g., with list or a for loop).
Table 6-1: Common Python List Operations: Methods,
Descriptions, and Examples
Operation/Method Description Example Output
Accesses the element lst = [1, 2, 3];
list[index] 2
at the given index lst[1]
132
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
133
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
2. Tuples
Tuples in Python are similar to lists, except for their key feature
of being immutable. Once a tuple has been created, it will remain
the same throughout its lifespan. Elements cannot be added to or
deleted from a tuple once it has been created which makes it a
suitable choice for storing data that does not change over time.
They are often initialized by listing their contents between
parentheses or brackets.
In [1]: tup = 4, 5, 6
In [2]: tup
Out[2]: (4, 5, 6)
Sometimes you need parentheses around values within a tuple
when constructing more complex expressions such as when
forming a nested tuple, e.g :
In [3]: nested_tup = (4, 5, 6), (7, 8)
In [4]: nested_tup
Out[4]: ((4, 5, 6), (7, 8))
Converting any sequence or iterator into a tuple is achieved by
using the tuple function, e.g :
In [5]: tuple([4, 0, 2])
Out[5]: (4, 0, 2)
In [6]: tup = tuple('string')
In [7]: tup
Out[7]: ('s', 't', 'r', 'i', 'n', 'g')
You can use square brackets to fetch items from the sequence, like
with many other object types. As in C, C++, Java, and many
other languages, sequences are 0-indexed in Python:
134
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
In [8]: tup[0]
Out[8]: 's'
While the objects stored in a tuple may be mutable themselves,
once the tuple is created it’s not possible to modify which object is
stored in each slot:
In [9]: tup = tuple(['foo', [1, 2], True])
In [10]: tup[2] = False
135
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
136
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
Tuple methods
Since the size and contents of a tuple cannot be modified, it is very
light on instance methods. A particularly useful one (also available
on lists) is count, which counts the number of occurrences of a
value:
In [34]: a = (1, 2, 2, 2, 3, 4, 2)
In [35]: a.count(2)
Out[35]: 4
137
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
138
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
Note:
• Tuples are immutable, so operations like tuple.append(),
tuple.remove(), and
tuple.sort() are not allowed.
# Tuples are immutable
coordinates = (10, 20)
# coordinates[0] = 15 # This will raise an error
• If modification is required, convert the tuple to a list:
coordinates = (10, 20)
lst = list(coordinates)
lst.append(30)
my_new_lst = tuple(lst)
print(my_new_lst) # Output: (10, 20, 30)
3. Sets
Sets are unordered collections of unique elements. A set is an
unordered collection of unique elements. You can think of them
like dicts, but keys only, no values. A set can be created in two
ways: via the set function or via a set literal with curly braces:
#Creating a set via the set literal
In [1]: set([2, 2, 2, 1, 3, 3])
Out[2]: {1, 2, 3}
In [7]: a.union(b)
Out[7]: {1, 2, 3, 4, 5, 6, 7, 8}
In [8]: a | b
Out[8]: {1, 2, 3, 4, 5, 6, 7, 8}
The intersection contains the elements occurring in both sets. The
& operator or the intersection method can be used:
In [9]: a.intersection(b)
Out[9]: {3, 4, 5}
In [10]: a & b
Out[10]: {3, 4, 5}
Removes
and 1 (or any
set.pop() returns s = {1, 2, 3}; print(s.pop()) other
an element)
arbitrary
140
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
element
from the
set.
Raises
KeyErro
r if
empty.
Removes
all
s = {1, 2, 3}; s.clear();
set.clear() elements set()
print(s)
from the
set.
Returns
a shallow s1 = {1, 2, 3}; s2 =
set.copy() {1, 2, 3}
copy of s1.copy(); print(s2)
the set.
Returns a new set
set.union(*others) or `A B` containing all elements `{1, 2}
from both sets.
Returns
a new set
set.intersection(*others) or containi
{1, 2} & {2, 3} {2}
A&B ng only
common
elements.
Returns
a new set
set.difference(*others) or A
with {1, 2, 3} - {2, 3} {1}
–B
elements
in A but
141
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
not in B.
Returns
a set
with
set.symmetric_difference(ot elements
{1, 2, 3} ^ {2, 3, 4} {1, 4}
her) or A ^ B in either
A or B
but not
both.
s = {1, 2};
Updates the set by adding s.update({
set.update(*others) or `A = B`
elements from another set. 2, 3, 4});
print(s)
Updates
the set,
keeping s = {1, 2, 3};
set.intersection_update(*others)
only s.intersection_update {2, 3}
or A &= B
elements ({2, 3, 4}); print(s)
found in
both.
Updates
the set,
removin s = {1, 2, 3};
set.difference_update(*others) or
g s.difference_update({
A -= B
elements 2, 3, 4}); print(s)
found in
B.
142
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
4. Dictionaries
Dictionaries store data as key-value pairs. Keys must be unique
and immutable. dict is likely the most important built-in Python
data structure. A more common name for it is hash map or
associative array. It is a flexibly sized collection of key-value pairs,
where key and value are Python objects. One approach for
creating one is to use curly braces {} and colons to separate keys
and values:
# Example of a dictionary
person = {"name": "Alice", "age": 25}
print(person["name"]) # Output: Alice
In [1]: empty_dict = {}
In [2]: d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}
In [3]: d1
Out[3]: {'a': 'some value', 'b': [1, 2, 3, 4]}
You can access, insert, or set elements using the same syntax as for
accessing elements
of a list or tuple:
143
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
The keys and values method give you iterators of the dict’s keys
and values, respectively. While the key-value pairs are not in any
particular order, these functions output the keys and values in the
same order:
In [17]: list(d1.keys())
Out[17]: ['a', 'b', 7]
In [18]: list(d1.values())
Out[18]: ['some value', [1, 2, 3, 4], 'an integer']
You can merge one dict into another using the update method:
In [19]: d1.update({'b' : 'foo', 'c' : 12})
In [20]: d1
Out[20]: {'a': 'some value', 'b': 'foo', 7: 'an integer', 'c': 12}
The update method changes dicts in-place, so any existing keys in
the data passed to update will have their old values discarded.
145
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
147
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
148
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
149
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
You might have gotten these names from a couple of files and
decided to organize them by language. Now, suppose we wanted
to get a single list containing all names with two or more e’s in
them. We could certainly do this with a simple for loop:
names_of_interest = []
for names in all_data:
enough_es = [name for name in names if name.count('e') >= 2]
names_of_interest.extend(enough_es)
You can actually wrap this whole operation up in a single nested
list comprehension, which will look like:
In [32]: result = [name for names in all_data for name in names
.....: if name.count('e') >= 2]
In [33]: result
Out[33]: ['Steven']
At first, nested list comprehensions are a bit hard to wrap your
head around. The for parts of the list comprehension are arranged
according to the order of nesting, and any filter condition is put at
the end as before. Here is another example where we “flatten” a
list of tuples of integers into a simple list of integers:
In [34]: some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
In [35]: flattened = [x for tup in some_tuples for x in tup]
In [36]: flattened
Out[36]: [1, 2, 3, 4, 5, 6, 7, 8, 9]
Keep in mind that the order of the for expressions would be the
same if you wrote a nested for loop instead of a list
comprehension:
flattened = []
for tup in some_tuples:
for x in tup:
flattened.append(x)
You can have arbitrarily many levels of nesting, though if you
have more than two or three levels of nesting you should
probably start to question whether this makes sense from a code
readability standpoint. It’s important to distinguish the syntax just
150
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
1. Lists
A list is a mutable, ordered collection of elements that can store
different data types. It allows easy addition, deletion, and
modification of elements.
Applications in Data Science
a. Storing datasets before converting them into structured
formats (e.g., Pandas DataFrames).
b. Holding multiple values retrieved from a dataset.
c. Iterating over sequences in data processing.
Example
Imagine we have a dataset of sales revenue recorded daily. We can
store it in a list and perform calculations.
# List of daily sales revenue
daily_sales = [2500, 3000, 2700, 4000, 3200, 2900, 3100]
151
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Output
Total Revenue: 21400
Highest Sales: 4000
Lowest Sales: 2500
2. Tuples
A tuple is an immutable, ordered collection of elements. It is used
when data should remain unchanged throughout execution.
Applications in Data Science
a. Storing fixed metadata (e.g., coordinates, feature names).
b. Using tuples as keys in dictionaries (since they are
immutable).
c. Faster access to data compared to lists due to immutability.
Example
Let’s consider a dataset where each data point represents a
geographical location (latitude, longitude).
# Tuple representing a location (Latitude, Longitude)
location = (37.7749, -122.4194) # San Francisco
# Accessing values
print("Latitude:", location[0])
print("Longitude:", location[1])
152
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
}
print("Temperature in San Francisco:", temperature_data[location])
Output
Latitude: 37.7749
Longitude: -122.4194
Temperature in San Francisco: 15.5
3. Dictionaries
A dictionary is a collection of key-value pairs that provides fast
lookups and is widely used in data science.
Applications in Data Science
a. Mapping categorical variables to numerical values.
b. Storing data in key-value format for fast lookups.
c. Aggregating statistics from raw data.
Example
Suppose we are working with customer purchase records where
each customer has a unique ID.
# Dictionary of customers and their total purchase amount
customer_purchases = {
"C001": 1200.50,
"C002": 850.75,
"C003": 430.60,
"C004": 1540.90
}
153
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Output
Customer C002's purchase amount: 850.75
Customer C001 spent $1200.5
Customer C002 spent $850.75
Customer C003 spent $430.6
Customer C004 spent $1540.9
Customer C005 spent $920.3
4. Sets
A set is an unordered collection of unique elements. It is useful for
eliminating duplicate values.
Applications in Data Science
a. Removing duplicate entries from a dataset.
b. Performing set operations (union, intersection) on
categorical data.
c. Checking for membership in large datasets.
Example
Let’s say we have a dataset containing email IDs of registered
users, but there are duplicates.
# List of email IDs (some are repeated)
email_list = ["[email protected]", "[email protected]",
"[email protected]",
"[email protected]", "[email protected]", "[email protected]"]
154
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
QUESTIONS
1. What is the difference between a list and a tuple?
2. Create a list of integers from 1 to 10. Write code to reverse
the list and remove the last element.
3. Given a tuple t = (5, 10, 15, 20), write code to access the
second element and print it.
4. Create a list of numbers and use a list comprehension to
filter out even numbers.
5. Given a list
6. Write a Python program to count the frequency of each
element in a list using a dictionary.
7. What is the advantage of using sets over lists?
8. Create a dictionary to store student names and their
corresponding grades. Add a new student and update an
existing student's grade.
9. How do data structures contribute to efficient data
manipulation in data science?
10. Given two sets, A = {1, 2, 3, 4} and B = {3, 4, 5, 6}, write
code to find their intersection and union.
11. Write a Python program to append the number 7 to a
list [1, 2, 3, 4, 5] and then insert the number 6 at the second
position.
12. Create a dictionary where the keys are student names and
the values are their grades. Write code to check if a specific
student exists in the dictionary.
13. Given a list [10, 20, 30, 40, 50], write code to slice the list
and extract the elements [20, 30, 40].
14. Write a Python program to remove duplicates from a
list [1, 2, 2, 3, 4, 4, 5] using a set.
155
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
15. Create a tuple with mixed data types (e.g., integers, strings,
floats). Write code to print its length and the data type of
the third element.
16. Given a dictionary {'a': 1, 'b': 2, 'c': 3}, write code to
iterate through its keys and values,
156
MODULE 7
DATA MANIPULATION AND
ANALYSIS WITH NUMPY AND
PANDAS
Data manipulation and Analysis is where raw, messy data
transforms into polished, meaningful insights. In this module, we
roll up our sleeves and dive into the real action: manipulating and
analyzing data like pros. Armed with NumPy and Pandas, two
of Python’s most powerful libraries, you will learn how to bend
data to your will. NumPy will be your go-to toolkit for high-
speed numerical computations, while Pandas will become your
best friend for working with structured data.
By the end of this module, the reader should be able to:
158
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
Introduction
Data manipulation is a crucial step in data science, allowing
researchers and analysts to clean, transform, and prepare data for
analysis. Python provides powerful libraries for data
manipulation, with NumPy and Pandas being the most widely
used.
In the world of data science, data manipulation is like preparing
jollof rice. If you do not clean, sort, and cook the ingredients
properly, you will serve nonsense and nobody will consume the
meal. Raw data is usually messy, scattered, and stubborn like
Lagos traffic on a Monday morning. That is why Python gave us
NumPy and Pandas, the real MVPs, to help us arrange, transform,
and polish our data until it is sweet and ready for analysis. With
these tools, you will not just manage data, you will package it like
a correct boss!
159
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
# Creating a 1D array
arr = np.array([1, 2, 3, 4, 5])
print(arr) # Output: [1 2 3 4 5]
# Creating a 2D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print(arr2)
# Creating a 2D array
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(matrix)
# Output:
160
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
# [[1 2 3]
# [4 5 6]]
Array Attributes
NumPy arrays include attribute properties such as shape, type and
size.
print(arr.shape) # Output: (5,) (1D array with 5 elements)
print(matrix.shape) # Output: (2, 3) (2 rows, 3 columns)
print(arr.dtype) # Output: int64 (data type of elements)
print(arr.size) # Output: 5 (total number of elements)
Matrix Operations
NumPy allows performing multiple matrix operations, including
addition, multiplication and dot products.
# Matrix addition
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
print(a + b)
# Output:
# [[ 6 8]
# [10 12]]
# Matrix multiplication
print(np.dot(a, b))
# Output:
# [[19 22]
# [43 50]]
print(a * b) # Output: [2 4 6]
162
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
DataFrames
A DataFrame has rows and columns which resemble a table.
Creating a Pandas DataFrame
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
# Output:
# Name Age City
# 0 Alice 25 New York
#1 Bob 30 Los Angeles
# 2 Charlie 35 Chicago
Example 2
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
163
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
print(df)
Indexing, Selection, and Filtering
print(df["Name"]) # Selecting a column
print(df[df["Age"] > 25]) # Filtering rows
164
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
Data Transformation
You can transform data by changing data types, applying
functions, or creating new columns.
# Changing data types
df['A'] = df['A'].astype(float)
# Applying functions
df['B'] = df['B'].apply(lambda x: x * 2 if pd.notnull(x) else x)
df2 = pd.DataFrame({
'Key': ['A', 'B', 'D'],
'Value': [4, 5, 6]
})
# Inner join
merged_df = pd.merge(df1, df2, on='Key', how='inner')
print(merged_df)
# Output:
# Key Value_x Value_y
165
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
#0 A 1 4
#1 B 2 5
Concatenating DataFrames
Concatenation combines DataFrames along a specified axis.
concatenated_df = pd.concat([df1, df2], axis=0)
print(concatenated_df)
# Output:
# Key Value
#0 A 1
#1 B 2
#2 C 3
#0 A 4
#1 B 5
#2 D 6
Cleaning Data
df["Age"] = df["Age"].astype(int) # Converting data type
print(df.drop_duplicates()) # Removing duplicates
Applying Functions
def age_category(age):
return "Young" if age < 30 else "Old"
df["Category"] = df["Age"].apply(age_category)
print(df)
166
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
167
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
QUESTIONS
1. How does broadcasting work in NumPy?
2. Explain the difference between Pandas Series and
DataFrame.
3. What function is used to detect missing values in Pandas?
4. How do you merge two DataFrames in Pandas?
5. What is the purpose of the groupby function in Pandas?
6. Create a 3x3 NumPy array and print its shape and data
type.
7. Perform matrix multiplication on two 2x2 arrays.
8. Use broadcasting to add a scalar value to a 1D array.
9. What is the difference between np.dot() and np.multiply()?
10. Create a 2D array and calculate the sum of each row.
11. Create a DataFrame and fill missing values with the mean
of the column.
12. Write a Python script to merge two DataFrames based on a
common column.
13. Use a lambda function to transform a column in a
DataFrame.
14. What is the difference between pd.merge() and pd.concat()?
15. Create a DataFrame and calculate the sum of each column.
16. Write a Pandas script to load a CSV file and display the
first five rows.
17. Generate a DataFrame with random numbers and compute
its mean and standard deviation.
18. Create a visualization using Pandas’ built-in plotting
functions for table 7-1
Table 7-1 shows the sales performance of different products
for a specific month, measured in units.
168
C. Asuai, H. Houssem & M. Ibrahim Data Structures in Python
169
170
MODULE 8
DATA VISUALIZATION
Data visualization is a critical part of data science, enabling you to
explore data, identify patterns, and communicate insights
effectively. This module introduces two powerful Python libraries
for visualization: Matplotlib and Seaborn.
By the end of this module, the reader will be able to:
Understand the Importance of Data Visualization
• Explain the role of data visualization in data science.
• Recognize how visual representations enhance data
interpretation.
• Differentiate between Matplotlib and Seaborn and their
use cases.
Work with Matplotlib for Basic Plotting
• Create simple line plots using Matplotlib.
• Customize plots by adding titles, axis labels, legends, and
grids.
• Create several plots that are displayed together in a single
figure.
Create Various Plot Types Using Matplotlib
• Make line plots to identify how data changes over time.
• Create scatter plots to explore correlation between
different variables.
• Create bar charts and histograms to represent categorical
and distribution data.
• Create pie charts when you have proportional or
percentage data.
• Use box plots to evaluate a dataset’s distribution as well as
detect anomalous values.
171
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
172
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
INTRODUCTION
Data visualization plays a vital role in the field of data science by
helping analysts and researchers to investigate and communicate
their findings in a clear and meaningful way. This section mainly
deals with using two Python libraries for creating visualizations:
Matplotlib and Seaborn.
173
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
174
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
# Data
x = [1, 2, 3, 4, 5]
175
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Bar Charts
Bar charts allow users to compare different categories of
information. They display information using rectangular bars
where the length or height of each bar is proportional to the value
it represents. Categories are usually placed along one axis while
their values are shown on the other making it easy to see
differences at a glance. Bar charts are perfect for comparing things
like sales across different products survey responses or the
popularity of different movie genres.
Imagine a bar chart as a friendly competition where each bar is a
contestant trying to reach the top The taller the bar the bigger the
bragging rights Whether it is showing which ice cream flavor rules
the summer or which superhero movie crushed the box office bar
charts make it super easy and super fun to spot the winners and
the underdogs. With just a quick look you can cheer for the
champions and spot the ones needing a little extra boost.
176
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
# Data
categories = ['A', 'B', 'C', 'D']
values = [15, 20, 10, 25]
Scatter Plots
Scatter plots are used to show the relationship between two
different variables. Each point on the plot represents an
observation with its position determined by the values of those
two variables. They help you spot patterns clusters outliers or any
kind of trend like whether two things are moving together or not.
Now imagine you are at a lively Nigerian market like Balogun
Market in Lagos or Oil Mill Market in Port Harcourt. Each trader
is selling something different and their prices and number of
customers vary all day long. If you plotted the price of tomatoes
against the number of buyers on a scatter plot you would see little
dots all over the place showing the hustle and bustle of the
market. Some traders would have high prices but few customers
some would have cheap prices and huge crowds and some would
be right in the middle. A scatter plot captures that busy colorful
energy of real life helping you quickly spot who is balling who
177
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
needs to adjust their hustle and where the sweet spot for success
lies
# Data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]
178
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
179
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Histogram
Histograms are used to show the distribution of a set of
continuous data. They look like bar charts but are a little
different. Instead of comparing categories each bar groups
numbers into ranges called bins and shows how many data points
fall into each range. Histograms help you see things like where
most values are clustered whether the data is spread out or if there
are any unusual gaps or spikes.
Imagine you attend a big Nigerian wedding , you know the kind
where hundreds of guests show up with plenty of jollof rice and
loud music. Suppose you wanted to know the age distribution of
all the guests. If you group them into age ranges like 0 to 10 years
11 to 20 years 21 to 30 years and so on a histogram would show
you which age group had the most people. Maybe you will find
that the 21 to 30 crew came out in full force while only a few
elders sprinkled in from the 61 and above range. Just like at that
owambe party the histogram quickly shows you where the crowd
is gathering and where things are a little quiet
import numpy as np
data = np.random.randn(1000)
plt.hist(data, bins=30, color='blue', edgecolor='black')
plt.title("Histogram Example")
plt.show()
180
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
Box Plot
A box plot also called a box and whisker plot is used to show the
spread and distribution of a dataset. It displays the median the
upper and lower quartiles and the minimum and maximum values.
In simple terms a box plot tells you where most of your data
points fall where the middle value is and if there are any unusual
values called outliers.
Now imagine you are comparing the prices of okra in different
markets across Lagos. Some markets sell okra super cheap others
super expensive and most are somewhere in between. A box plot
would show you all of that at a glance. The fat box in the middle
shows where most of the prices fall the line inside shows the
average Lagosian price and the little whiskers stretch out to show
the full range. If one market is selling okra for double the normal
price it would stand out as an outlier , maybe they think they are
selling gold instead of vegetables
plt.boxplot(data)
plt.title("Box Plot Example")
plt.show()
# Simulated data: Okra prices (in naira) across different Lagos markets
181
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
CUSTOMIZING PLOTS
Matplotlib gives you full control over your plots, allowing you to
customize almost every aspect, from colors and labels to the
overall layout. You can change the colors of lines, bars, or
markers to make your plot visually appealing or match your
brand or theme. You can include titles and labels on both the axes
and the plot itself to help your readers easily interpret the
information shown in your graph.
You can also use Matplotlib to combine several graphs in a single
figure so you can easily compare each plot to one another.
Annotations can be used to call attention to important details, so
your plot is easier to understand for your viewers.
With Matplotlib, you can customize labels, layout and gridlines to
produce professional and attractive visualizations of your data.
183
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
EXAMPLE
import matplotlib.pyplot as plt
import numpy as np
EXAMPLE
# Create subplots
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
# Plot 1
ax[0].plot(x, y, color='blue')
ax[0].set_title("Line Plot")
# Plot 2
ax[1].scatter(x, y, color='red')
ax[1].set_title("Scatter Plot")
plt.show()
Annotations
Annotations in Matplotlib let you place text, arrows or other
shapes directly on your plot at the exact locations you want. This
feature allows you to call attention to crucial information, explain
details or make summaries right on your graph. You have the
power to highlight a variety of elements, like exceptional values or
significant occurrences which significantly improves the clarity
and impact of what is being shown by your plots.
Common Uses for Annotations:
• Highlighting data points: Adding a label to an outlier or
a significant point in the graph.
• Explaining trends: Marking where a particular change
happens in a curve or a key event.
• Pointing out values: Labeling specific data points with
their values for better clarity.
186
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
Example
import matplotlib.pyplot as plt
import numpy as np
187
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
plt.show()
plt.plot(x, y, marker='o', linestyle='-', color='b')
plt.annotate('Peak', xy=(4, 30), xytext=(3, 35),
arrowprops=dict(facecolor='black', shrink=0.05))
plt.title("Annotated Plot")
plt.show()
Example
x = [1, 2, 3, 4, 5]
y = [10, 15, 7, 10, 25]
plt.scatter(x, y, color='r')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Scatter Plot Example")
plt.show()
188
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
2. Statistical Plots:
It can produce advanced graphical methods such as violin
plots, box plots and regression plots, making it well-suited
for performing data analysis tasks.
3. Integration with Pandas:
Seaborn integrates easily with pandas DataFrames, enabling
users to quickly visualize information from within their
DataFrames.
4. Simplified Syntax:
Seaborn allows you to produce sophisticated plots with
fewer lines of code than Matplotlib, making plotting easier.
5. Support for Multi-plot Grids:
Seaborn features tools such as FacetGrid and PairGrid ,
allowing for the generation of multivariate plots that can
be used to understand patterns in the data.
189
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
When you use a heatmap for correlation analysis, it’s like a party
tracker for numbers. You’re not just looking at random data
points; you're getting a bird's eye view of how everything is
connected.
In simple terms, heatmaps visually represent the relationship
between variables, using colors. The darker the color, the stronger
the connection, while lighter shades mean the relationship is weak
or non-existent. For example, in predictive maintenance (a hot
topic!), you could use a heatmap to figure out which sensors in
your machines are most strongly linked,like when you notice
your oil pressure and temperature sensors always dance in sync.
So, the next time you see one of those colorful grids with reds,
greens, or blues popping up, just remember: it's like the data’s way
of showing you who’s cool with who, so you can make smarter
decisions without any guesswork!
import seaborn as sns
import numpy as np
# Create a heatmap
sns.heatmap(data, annot=True, cmap='coolwarm')
plt.title("Heatmap Example")
plt.show()
# Import Seaborn
import seaborn as sns
import pandas as pd
# Sample dataset - You can replace this with your own data
data = {
'sensor_1': [1, 2, 3, 4, 5],
'sensor_2': [5, 4, 3, 2, 1],
'sensor_3': [2, 3, 4, 5, 6],
191
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
'sensor_4': [5, 6, 7, 8, 9]
}
# Create a DataFrame
df = pd.DataFrame(data)
Pairplots
Pairplots are an excellent way to visualize relationships between
multiple numerical variables. In Seaborn, the sns.pairplot()
function creates a grid of scatterplots for each pair of features,
along with histograms or density plots along the diagonal to show
distributions.
Think of a pairplot like a party where different groups of people
(your data variables) are mingling. You’ve got several sensor
readings.Let’s say sensor_1, sensor_2, etc.,and you’re curious
about how they interact with each other. Instead of guessing, a
pairplot comes to the rescue by plotting a grid of scatter plots,
showing you exactly how each pair of sensors relates. If two
sensors are in sync, their scatter plot will form a clear line or
pattern. If they’re not, you’ll see more scattered dots. The
diagonal of this grid is where each sensor hangs out on its own,
showing you how it behaves over time with histograms or density
plots.
192
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
# Sample dataset - You can replace this with your own data
data = {
'sensor_1': [1, 2, 3, 4, 5],
'sensor_2': [5, 4, 3, 2, 1],
'sensor_3': [2, 3, 4, 5, 6],
'sensor_4': [5, 6, 7, 8, 9]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Add title
plt.suptitle("Pairplot of Sensor Data", y=1.02)
193
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
# Create a pairplot
sns.pairplot(iris, hue='species')
plt.title("Pairplot Example")
plt.show()
Density Plot
A density plot is a smooth, continuous representation of the
distribution of data, much like a histogram, but without the bins.
It's used to show the probability density function (PDF) of a
continuous random variable, so you get a smooth curve instead of
the blocky look you get with histograms. Think of it as an elegant
way of seeing how your data is spread out, with high peaks
showing where most of your data points are concentrated and
valleys indicating less frequent data.
Imagine a density plot like a smooth Afrobeat groove,it shows the
smooth flow of your data from left to right, highlighting where
the data "moves" or "clusters" the most. If your data was a party,
the density plot is like showing you the hotspots where everyone’s
gathered and the quieter corners where not much is happening.
In Python, you can easily create a density plot using Seaborn’s
sns.kdeplot(). This is especially helpful when you want to visualize
the distribution of individual variables or compare distributions
across multiple variables.
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
194
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
# Sample dataset - You can replace this with your own data
data = {
'sensor_1': [1, 2, 3, 4, 5],
'sensor_2': [5, 4, 3, 2, 1],
'sensor_3': [2, 3, 4, 5, 6],
'sensor_4': [5, 6, 7, 8, 9]
}
# Create a DataFrame
df = pd.DataFrame(data)
Violin Plots
Imagine you're at a party and you see a huge group of people
dancing. Some are doing the Shaku Shaku in one corner, while
others are chilling with some Afrobeat moves in another. Now,
instead of guessing who's dancing better or how many people are
jamming in each group, you get this funky violin plot to give you
the full vibe!
In simpler terms, a violin plot is like a combination of a box plot
and a density plot. It shows you the distribution of your data, like
195
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
how your sensor readings are spread out. The wider the "violin,"
the more data points (or people) are hanging out in that range.
The "stem" in the middle is the line that shows you the range of
your data, and the "bulge" or wider part is where most of your
data points are concentrated, just like where everyone is crowded
at the party!
What makes it even more fun is that a violin plot lets you
compare multiple variables at once. For example, you could check
how temperature varies compared to pressure,are they dancing the
same rhythm, or are they in different parts of the party? It gives
you a quick, clear visual of the spread of your data,whether
they’re partying together or off doing their own thing.
So, if you want to know how sensor data is behaving without
digging deep into boring numbers, a violin plot gives you the
whole vibe, letting you see the highs, lows, and everything in
between, all in one smooth, colorful plot. It’s like seeing the
dancefloor from above,no more guessing!
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Sample dataset - You can replace this with your own data
data = {
'sensor_1': [1, 2, 3, 4, 5],
'sensor_2': [5, 4, 3, 2, 1],
'sensor_3': [2, 3, 4, 5, 6],
'sensor_4': [5, 6, 7, 8, 9]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Add title
plt.title("Violin Plot of Sensor Data")
197
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
(as shown by the histogram) and the seamless dancing that results
from the coordination of everyone’s movements (shown by the
KDE plot). They provide a deeper understanding of how your
data fluctuates.
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Sample dataset - You can replace this with your own data
data = {
'sensor_1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 5, 6]
}
# Create a DataFrame
df = pd.DataFrame(data)
198
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
# Sample dataset - You can replace this with your own data
data = {
'temperature': [22, 24, 25, 26, 28, 30, 31, 33, 34, 35],
'pressure': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110]
}
# Create a DataFrame
df = pd.DataFrame(data)
199
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
200
C. Asuai, H. Houssem & M. Ibrahim Data Visualization
QUESTIONS
1. What is the difference between Matplotlib and Seaborn?
2. Explain the purpose of a heatmap.
3. What is the difference between a histogram and a box plot?
4. What is the difference between a box plot and a violin
plot?
5. Create a line plot to visualize the growth of a company's
revenue over 5 years.\
6. Use a bar chart to compare the population of 5 cities.
7. Create a scatter plot to visualize the relationship between
hours studied and exam scores.
8. Customize a plot by adding a title, labels, and grid lines.
9. Create a figure with two subplots: one line plot and one
scatter plot.
10. Create a heatmap to visualize the correlation between
features in the Iris dataset.
11. Use a pairplot to explore relationships between numerical
variables in the Titanic dataset.
12. Create a violin plot to compare the distribution of sepal
lengths across different species in the Iris dataset
201
202
MODULE 9
LINEAR ALGEBRA FOR DATA
SCIENCE
In this module, we covered key linear algebra concepts such as
vectors, matrices, determinants, eigenvalues, and singular value
decomposition, all with Python implementations. Mastering these
concepts is crucial for understanding and implementing machine
learning algorithms.
By the end of this module, the reader will be able to:
• Understand the fundamental concepts of linear algebra and
its importance in data science.
• Define and perform operations on vectors using NumPy.
• Implement vector addition and scalar multiplication in
Python.
• Define and manipulate matrices, including matrix addition
and multiplication.
• Compute the determinant and inverse of a matrix using
NumPy.
• Understand the significance of eigenvalues and eigenvectors
in data science.
• Compute eigenvalues and eigenvectors in Python.
• Perform Singular Value Decomposition (SVD) for
dimensionality reduction and recommendation systems.
• Apply linear algebra techniques in machine learning and
data science tasks such as:
• Dimensionality Reduction (PCA using eigenvalues
and eigenvectors).
• Regression Models (Matrix operations in linear
regression).
203
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
204
C. Asuai, H. Houssem & M. Ibrahim Linear Algebra for Data Science
205
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
# Vector Addition
v_sum = v1 + v2
print("Vector Sum:", v_sum)
# Scalar Multiplication
v_scaled = s * v1
print("Scaled Vector:", v_scaled)
206
C. Asuai, H. Houssem & M. Ibrahim Linear Algebra for Data Science
Matrix operations are like the calculations you do when you want
to find out your total spending or compare prices across stalls You
can add two matrices together if they are the same size by adding
their matching elements one by one You can multiply a matrix by
a number to scale all the values up or down just like when you are
doubling your shopping list before Christmas and you know
everything will cost twice as much
Another important operation is matrix multiplication which is a
little more complex It is like when you are combining information
from two different tables to get a final result Maybe you have one
table for the price of each item and another for the quantity you
bought Matrix multiplication helps you combine these tables in a
smart way to find your total spending automatically
In short matrices are not just scary grids of numbers They are
powerful friends that help you manage and transform data
whether you are solving business problems building machine
learning models or even planning your next shopping trip.
209
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
210
C. Asuai, H. Houssem & M. Ibrahim Linear Algebra for Data Science
work with It’s like taking a big complicated dish and breaking it
into smaller more manageable pieces so you can enjoy the meal in
the right way
SVD is used everywhere especially when you want to simplify
complex data For example in recommendation systems like
Netflix or YouTube SVD helps to find patterns in how users
interact with content and predict what they might like next In
dimensionality reduction SVD helps reduce the number of
features in your data while keeping the important information
intact It’s like when you want to reduce the number of actors in a
movie but still make sure the plot remains interesting and
engaging
SVD works by decomposing your original matrix A into three
smaller matrices U S and V where U and V are orthogonal
matrices and S is a diagonal matrix These three matrices can then
be used to perform things like data compression and noise
reduction making SVD an important part of any data scientist’s
toolkit
211
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
1. Dimensionality Reduction
High-dimensional datasets can be challenging to process and
analyze. Linear algebra techniques help reduce the dimensionality
while retaining meaningful information.
Principal Component Analysis (PCA)
PCA is a popular dimensionality reduction technique that uses
eigenvectors and eigenvalues to transform high-dimensional data
into a lower-dimensional space.
212
C. Asuai, H. Houssem & M. Ibrahim Linear Algebra for Data Science
Linear regression is one of the oldest and most reliable tools in the
world of data science It is a statistical method used to predict a
target value based on input features or independent variables In
simple terms if you wanted to predict the price of a house based
on its size number of rooms and age you could use linear
regression to draw the best-fitting line that connects the dots of
data points to make the best prediction
Now the magic behind linear regression is heavily dependent
on matrix operations These operations help us find the perfect
line that minimizes the difference between the predicted values
and the actual data points
To compute θ the formula relies on matrix multiplication and
inversion These matrix operations allow us to find the values of θ
that best fit the data by minimizing the errors in prediction This is
what we call solving for the best-fitting line
In real-life data science applications using this formula allows us to
make predictions based on the patterns in the data whether we are
predicting house prices stock market trends or any other kind of
continuous data
213
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Example in Python
import numpy as np
3. Neural Networks
Neural networks, which are a key part of machine learning, rely
heavily on matrices for their computations. The network itself is
made up of layers of nodes (neurons), and these nodes are
connected by weights that determine how data flows through the
network. These weights are stored in weight matrices, which are
crucial in transforming inputs into predictions. When an input is
passed through the network, matrix multiplication takes place
between the input data and the weight matrices to generate the
outputs for each layer.
Activation functions are then applied to the results of these matrix
multiplications to introduce non-linearity into the model, making
it capable of learning complex patterns. These activation
functions, such as sigmoid or ReLU, are element-wise operations
that are applied to each value in the matrix of outputs from the
previous layer.
Backpropagation, which is how neural networks learn, also
involves matrices. It’s the process by which the network adjusts its
214
C. Asuai, H. Houssem & M. Ibrahim Linear Algebra for Data Science
Example
This example demonstrates forward propagation (input through
the network) and the basic structure for a neural network layer.
Import numpy as np
# Example Neural Network with 1 hidden layer
# Random weight matrices for input to hidden layer (3 input features -> 4
neurons in hidden layer)
np.random.seed(42)
215
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
weights_input_hidden = np.random.rand(X.shape[1], 4)
# Random weight matrix for hidden to output layer (4 hidden neurons -> 1
output)
weights_hidden_output = np.random.rand(4, 1)
# Forward propagation
hidden_layer_input = np.dot(X, weights_input_hidden) # Input layer to
hidden layer multiplication
hidden_layer_output = sigmoid(hidden_layer_input) # Apply activation
function
print("Predicted Output:")
print(output_layer_output)
# Let's assume we're training and now want to compute the error using
backpropagation
error = y - output_layer_output
print("Error:")
print(error)
# Backpropagation
output_layer_delta = error * sigmoid_derivative(output_layer_output) #
Gradient of the loss with respect to output layer
216
C. Asuai, H. Houssem & M. Ibrahim Linear Algebra for Data Science
hidden_layer_error = output_layer_delta.dot(weights_hidden_output.T) #
Error propagated to hidden layer
hidden_layer_delta = hidden_layer_error *
sigmoid_derivative(hidden_layer_output) # Gradient of the loss with respect to
hidden layer
print("\nUpdated Weights:")
print("Weights (Input -> Hidden):")
print(weights_input_hidden)
print("Weights (Hidden -> Output):")
print(weights_hidden_output)
4. Recommendation Systems
Recommendation systems are a powerful tool used by platforms
like Netflix, Amazon, and YouTube to suggest products, movies,
or content to users. They use information on user preferences to
offer personalized suggestions. Singular Value Decomposition is
a major practice used when developing recommendation systems.
In a recommendation system, the process often begins by creating
a user-item interaction matrix which uses a row for each user
and a column for each item. A ranking is given for each item
depending on how much the user has viewed and used it.
Nonetheless, many entries in the matrix are empty since the same
user does not interact with every item. SVD is used here for a
good reason. A SVD approach to the user-item matrix results in
three smaller matrices called U, S and VT which outline the main
trends in what users prefer and what each item has in common.
This is how it works:
217
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
218
C. Asuai, H. Houssem & M. Ibrahim Linear Algebra for Data Science
import numpy as np
from numpy.linalg import svd
219
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
# Apply SVD
U, S, Vt = svd(ratings)
print("Singular Values:", S)
5. Anomaly Detection
Anomaly detection is the process of identifying rare items, events,
or observations that deviate significantly from the majority of the
data. These outliers can provide valuable insights, such as detecting
fraud in financial transactions, identifying faulty equipment in
predictive maintenance, or spotting unusual behavior in network
trafficIn the process, linear algebra is particularly important for
technology like Mahalanobis Distance.
The Mahalanobis Distance is used to find the distance from a
single point to an entire distribution. Mahalanobis Distance is
different from Euclidean Distance because it includes the effect of
data correlations and scales the distance based on the data values’
variability. As a result, it becomes very handy for finding outliers
in sets of data where features share a connection.
220
C. Asuai, H. Houssem & M. Ibrahim Linear Algebra for Data Science
Use Case:
In a credit card fraud detection system, for instance, the
Mahalanobis Distance could help identify unusual transactions by
considering the correlation between different transaction features
(like transaction amount, frequency, time of day, etc.). If a
transaction's Mahalanobis distance is significantly higher than the
typical values, it may indicate fraudulent activity.
221
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Example
import numpy as np
import pandas as pd
from scipy.spatial import distance
# Create a DataFrame
df = pd.DataFrame(data)
for i in range(df.shape[0]):
diff = df.iloc[i] - mean
mahalanobis_dist = np.sqrt(np.dot(np.dot(diff.T, inv_cov_matrix), diff))
mahalanobis_distances.append(mahalanobis_dist)
df['Mahalanobis_Distance'] = mahalanobis_distances
222
C. Asuai, H. Houssem & M. Ibrahim Linear Algebra for Data Science
# Display results
print(df)
# Sample dataset
X = np.array([[2, 3], [3, 4], [5, 7], [100, 200]]) # The last point is an outlier
223
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
224
C. Asuai, H. Houssem & M. Ibrahim Linear Algebra for Data Science
plt.subplot(1, 2, 1)
plt.imshow(image, cmap='gray')
plt.title('Original Image')
plt.subplot(1, 2, 2)
plt.imshow(edges, cmap='gray')
plt.title('Edge Detected Image')
plt.show()
225
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
QUESTIONS
1. Define a pair of vectors and calculate their dot product
using NumPy codes.
2. Given a matrix M = np.array([[2, 4], [3, 6]]), compute its
determinant.
3. Do singular value decomposition of a 3x3 matrix that you
have selected.
4. Discuss the role of eigenvectors and eigenvalues in
implementing Principal Component Analysis.
226
MODULE 10
ADVANCED NUMPY: ARRAYS AND
VECTORIZED COMPUTATION
This module will help you maximize NumPy arrays and discover
vectorized computation. Working with an entire array is very
quick with a single line of code using NumPy. Focus on being
efficient, rather than trying to do a lot of work! You are about to
work efficiently with data and make the calculations in your code
more effective and organized.
The reader will have the ability to perform the following tasks:
Understand NumPy Arrays
• State what the difference is between Python lists
and NumPy arrays.
• Manage and modify NumPy arrays with ease.
Index, Slice, and Reshape Arrays
• Make use of indexing and slicing to extract items
from an array.
• Take an array of any dimension and shape it to the
form you need.
Utilize Vectorized Computation
• Carry out mathematical and statistical tasks using
vectorized functions.
• Speed up computations using the functions from
NumPy.
Work with Multidimensional Arrays
• Construct and control arrays of different
dimensions.
• Manipulate matrices with the help of NumPy.
227
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
INTRODCUTION
# Multiply each element in the NumPy array by 2 and measure execution time
%time for _ in range(10): my_arr2 = my_arr * 2
# Multiply each element in the Python list by 2 using list comprehension and
measure execution time
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]
229
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
231
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Creating ndarrays
The easiest way to create an array is to use the array function.
This accepts any sequence-like object (including other arrays) and
produces a new NumPy array containing the passed data. For
example, a list is a good candidate for conversion:
In [8]: data1 = [6, 7.5, 8, 0, 1]
In [9]: arr1 = np.array(data1)
In [10]: arr1
Out[10]: array([ 6. , 7.5, 8. , 0. , 1. ])
232
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
233
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
234
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
Function Description
Generates an array of random values from a
np.random.rand()
uniform distribution.
Generates an array of random values from a
np.random.randn()
normal distribution.
Generates an array of random integers within a
np.random.randint()
range.
Creates an uninitialized array (values may be
np.empty()
arbitrary).
This table provides a quick reference for array creation functions
in NumPy, essential for efficient numerical computing.
235
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
236
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
print("Original Array:")
print(arr)
# Element-wise multiplication
arr_mult = arr * arr
print("\nElement-wise Multiplication:")
print(arr_mult)
# Element-wise subtraction
arr_sub = arr - arr
print("\nElement-wise Subtraction:")
print(arr_sub)
237
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
# Accessing elements
print(arr[0]) # First element
238
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
# Slicing
print(arr[1:4]) # Elements from index 1 to 3
Reshaping Arrays
In NumPy, reshaping is the process of changing the shape of an
existing array without modifying its data. This allows you to
transform an array into a different configuration (e.g., from a flat
1D array to a 2D matrix or from a 2D matrix to a higher-
dimensional tensor) while keeping the data intact. It's like
rearranging the seats at a party without changing the number of
guests!
For example, if you have a 1D array with 12 elements, you can
reshape it into a 2D array with 3 rows and 4 columns. The total
number of elements before and after reshaping must remain the
same, but you can adjust how they are arranged.
import numpy as np
# Create a 1D array of 12 elements
data = np.arange(12)
print("Original Array:", data)
239
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
240
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
In [44]: arr2d[0][2]
Out[44]: 3
In [45]: arr2d[0, 2]
Out[45]: 3
In multidimensional arrays, if you omit later indices, the returned
object will be a lower dimensional ndarray consisting of all the
data along the higher dimensions. So in the 2 × 2 × 3 array arr3d:
In [46]: arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
In [47]: arr3d
Out[47]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
arr3d[0] is a 2 × 3 array:
In [48]: arr3d[0]
Out[48]:
array([[1, 2, 3],
[4, 5, 6]])
Both scalar values and arrays can be assigned to arr3d[0]:
In [49]: old_values = arr3d[0].copy()
In [50]: arr3d[0] = 42
In [51]: arr3d
Out[51]:
array([[[42, 42, 42],
[42, 42, 42]],
[[ 7, 8, 9],
[10, 11, 12]]])
In [52]: arr3d[0] = old_values
In [53]: arr3d
Out[53]:
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
241
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Similarly, arr3d[1, 0] gives you all of the values whose indices start
with (1, 0), forming a 1 dimensional array:
In [54]: arr3d[1, 0]
Out[54]: array([7, 8, 9])
This expression is the same as though we had indexed in two steps:
In [55]: x = arr3d[1]
In [56]: x
Out[56]:
array([[ 7, 8, 9],
[10, 11, 12]])
In [57]: x[0]
Out[57]: array([7, 8, 9])
Note that in all of these cases where subsections of the array have
been selected, the returned arrays are views.
242
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
As you can see, it has sliced along axis 0, the first axis. A slice,
therefore, selects a range of elements along an axis. It can be
helpful to read the expression arr2d[:2] as “select the first two rows
of arr2d.”
You can pass multiple slices just like you can pass multiple
indexes:
In [62]: arr2d[:2, 1:]
Out[62]:
array([[2, 3],
[5, 6]])
When slicing like this, you always obtain array views of the same
number of dimensions. By mixing integer indexes and slices, you
get lower dimensional slices. For example, I can select the second
row but only the first two columns like so:
In [63]: arr2d[1, :2]
Out[63]: array([4, 5])
Similarly, I can select the third column but only the first two rows
like so:
In [64]: arr2d[:2, 2]
Out[64]: array([3, 6])
Note that a colon by itself means to take the entire axis, so you
can slice only higher dimensional axes by doing:
In [65]: arr2d[:, :1]
Out[65]:
array([[1],
[4],
[7]])
Of course, assigning to a slice expression assigns to the whole
selection:
In [66]: arr2d[:2, 1:] = 0
In [67]: arr2d
Out[67]:
array([[1, 0, 0],
[4, 0, 0],
243
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
[7, 8, 9]])
Boolean Indexing
Let’s consider an example where we have some data in an array
and an array of names with duplicates. I’m going to use here the
randn function in numpy.random to generate some random
normally distributed data:
In [68]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
In [69]: data = np.random.randn(7, 4)
In [70]: names
Out[70]:
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],
dtype='<U4')
In [71]: data
Out[71]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 1.669 , -0.4386, -0.5397, 0.477 ],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])
Suppose each name corresponds to a row in the data array and we
wanted to select all the rows with corresponding name 'Bob'. Like
arithmetic operations, comparisons (such as ==) with arrays are
also vectorized. Thus, comparing names with the string 'Bob'
yields a boolean array:
In [72]: names == 'Bob'
Out[72]: array([ True, False, False, True, False, False, False], dtype=bool)
This boolean array can be passed when indexing the array:
In [73]: data[names == 'Bob']
Out[73]:
array([[ 0.0929, 0.2817, 0.769 , 1.2464],
[ 1.669 , -0.4386, -0.5397, 0.477 ]])
244
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
The boolean array must be of the same length as the array axis it’s
indexing. You can even mix and match boolean arrays with slices
or integers (or sequences of integers; more on this later).
Boolean selection will not fail if the boolean array is not the
correct length, so I recommend care when using this feature.
In these examples, I select from the rows where names == 'Bob'
and index the columns, too:
In [74]: data[names == 'Bob', 2:]
Out[74]:
array([[ 0.769 , 1.2464],
[-0.5397, 0.477 ]])
In [75]: data[names == 'Bob', 3]
Out[75]: array([ 1.2464, 0.477 ])
To select everything but 'Bob', you can either use != or negate the
condition using ~:
In [76]: names != 'Bob'
Out[76]: array([False, True, True, False, True, True, True], dtype=bool)
In [77]: data[~(names == 'Bob')]
Out[77]:
array([[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])
The ~ operator can be useful when you want to invert a general
condition:
In [78]: cond = names == 'Bob'
In [79]: data[~cond]
Out[79]:
array([[ 1.0072, -1.2962, 0.275 , 0.2289],
[ 1.3529, 0.8864, -2.0016, -0.3718],
[ 3.2489, -1.0212, -0.5771, 0.1241],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[-0.7135, -0.8312, -2.3702, -1.8608]])
245
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
246
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
[ 7. , 7. , 7. , 7. ],
[ 7. , 7. , 7. , 7. ],
[ 7. , 7. , 7. , 7. ],
[ 0.3026, 0.5238, 0.0009, 1.3438],
[ 0. , 0. , 0. , 0. ]])
As we will see later, these types of operations on two-dimensional
data are convenient to do with pandas
247
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
248
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])
swapaxes similarly returns a view on the data without making a
copy
In [182]: arr.mean(axis=1)
Out[182]: array([ 1.022 , 0.1875, -0.502 , -0.0881, 0.3611])
In [183]: arr.sum(axis=0)
Out[183]: array([ 3.1693, -2.6345, 2.2381, 1.1486])
Here, arr.mean(1) means “compute mean across the columns”
where arr.sum(0) means “compute sum down the rows.”
Other methods like cumsum and cumprod do not aggregate,
instead producing an array of the intermediate results:
In [184]: arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
In [185]: arr.cumsum()
Out[185]: array([ 0, 1, 3, 6, 10, 15, 21, 28])
In multidimensional arrays, accumulation functions like cumsum
return an array of the same size, but with the partial aggregates
computed along the indicated axis according to each lower
dimensional slice:
In [186]: arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
In [187]: arr
Out[187]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [188]: arr.cumsum(axis=0)
Out[188]:
array([[ 0, 1, 2],
[ 3, 5, 7],
[ 9, 12, 15]])
In [189]: arr.cumprod(axis=1)
Out[189]:
array([[ 0, 0, 0],
[ 3, 12, 60],
[ 6, 42, 336]])
250
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
Method Description
Computes the sum of all elements in the array or
Sum
along a specified axis. Zero-length arrays return 0.
Calculates the arithmetic mean. Zero-length arrays
Mean
return NaN.
Computes the standard deviation and variance,
std, var respectively, with an optional degrees of freedom
adjustment (default denominator is n).
Returns the minimum and maximum values in the
min, max
array.
argmin, Returns the indices of the minimum and maximum
argmax elements, respectively.
Computes the cumulative sum of elements, starting
cumsum
from 0.
Computes the cumulative product of elements,
cumprod
starting from 1.
Out[193]: True
In [194]: bools.all()
Out[194]: False
These methods also work with non-boolean arrays, where non-
zero elements evaluate to True.
252
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
Method Description
unique(x) Computes the sorted, unique elements in x.
253
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Method Description
Finds the sorted, common elements present in
intersect1d(x, y)
both x and y.
Computes the sorted union of elements from x
union1d(x, y)
and y.
Returns a boolean array indicating whether
in1d(x, y)
each element of x is present in y.
Computes the set difference, returning
setdiff1d(x, y)
elements in x that are not in y.
Finds the symmetric difference,elements that
setxor1d(x, y)
appear in either x or y, but not in both.
254
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
When loading an .npz file, you get back a dict-like object that
loads the individual arrays lazily:
In [217]: arch = np.load('array_archive.npz')
In [218]: arch['b']
Out[218]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
If your data compresses well, you may wish to use
numpy.savez_compressed instead:
In [219]: np.savez_compressed('arrays_compressed.npz', a=arr, b=arr)
Function Description
Sets the seed for the random number generator to
Seed
ensure it can be repeated later.
Returns a sequence of numbers arranged randomly
permutation
or by a given range in a random order.
Shuffle Randomly permutes a sequence in-place.
Generates samples from a uniform distribution over
Rand
[0, 1).
Draws random integers from a specified low-to-high
Randint
range.
Generates samples from a normal distribution with
Randn mean 0 and standard deviation 1 (MATLAB-like
interface).
The process involves obtaining samples from a
Binomial
binomial distribution.
256
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
Function Description
Creates samples using a normal (Gaussian)
Normal
distribution.
Technique for drawing samples from beta
Beta
distribution.
Based on the selected method, will sample from a
Chisquare
chi-square distribution.
Gamma Gives samples from a gamma distribution.
Makes samples using the uniform distribution in the
uniform
range [0, 1).
1. np.where()
It is used to carry out actions on elements or filter data according
to given conditions using np.where(). It provides the indices or
values that meet the given criterion.
Syntax:
np.where(condition, [x, y])
• This returns values from x where True and values
from y where False.
• If only the condition is given, it produces a list of where
the condition is True.
257
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Examples:
1. Find indices where elements satisfy a condition:
import numpy as np
array = np.array([1, 2, 3, 4, 5])
indices = np.where(array > 3) # Output: (array([3, 4]),)
print(indices)
2. Replace elements based on a condition:
array = np.array([1, 2, 3, 4, 5])
result = np.where(array > 3, 10, 0) # Replace elements > 3 with 10, e
lse 0
print(result) # Output: [0, 0, 0, 10, 10]
2. np.unique()
The np.unique() function is used to find unique elements in an
array. It cleans up the array and returns a list of all the different
elements, sorted in order.
Syntax:
np.unique(array, return_index=False, return_counts=False, axis=0)
• return_index:
If True, the function will return the indices of
where each unique element first appears..
• return_counts: If True, returns the count of each unique
element.
• axis: Specifies the axis along which to find unique elements
(for multi-dimensional arrays).
Examples:
1. Find unique elements:
array = np.array([1, 2, 2, 3, 4, 4, 5])
unique_elements = np.unique(array)
print(unique_elements) # Output: [1, 2, 3, 4, 5]
2. Find unique elements with counts:
array = np.array([1, 2, 2, 3, 4, 4, 5])
unique_elements, counts = np.unique(array, return_counts=True)
print(unique_elements) # Output: [1, 2, 3, 4, 5]
print(counts) # Output: [1, 2, 1, 2, 1]
258
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
3. np.concatenate()
The np.concatenate() function is used to combine two or more
arrays along a specified axis. It can be helpful for integrating
different sets of data or adding more rows/columns to an array.
Syntax:
np.concatenate((array1, array2, ...), axis=0)
• axis=0:
Concatenates vertically (row-wise).
• axis=1: Concatenates horizontally (column-wise).
Examples:
1. Concatenate two 1D arrays:
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
result = np.concatenate((array1, array2))
print(result) # Output: [1, 2, 3, 4, 5, 6]
2. Concatenate two 2D arrays vertically:
array1 = np.array([[1, 2], [3, 4]])
array2 = np.array([[5, 6], [7, 8]])
result = np.concatenate((array1, array2), axis=0)
print(result) # Output: [[1, 2], [3, 4], [5, 6], [7, 8]]
3. Concatenate two 2D arrays horizontally:
array1 = np.array([[1, 2], [3, 4]])
array2 = np.array([[5, 6], [7, 8]])
result = np.concatenate((array1, array2), axis=1)
print(result) # Output: [[1, 2, 5, 6], [3, 4, 7, 8]]
259
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
260
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
Example:
array = np.array([1, 2, 3, 4])
result = np.mean(array) # Output: 2.5
261
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Example:
array = np.array([[1, 2, 3], [4, 5, 6]])
row_sum = np.sum(array, axis=1) # Output: [6, 15] (sum along rows)
col_sum = np.sum(array, axis=0) # Output: [5, 7, 9] (sum along columns)
262
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
263
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
QUESTIONS
1. Create a 3x3 NumPy array and compute its transpose.
2. Generate an array of 10 random numbers and find the
mean.
3. Use np.where to label elements in an array as "Even" or
"Odd".
4. How do you use np.where() to filter elements in an array?
5. What is the purpose of the np.unique() function?
6. How do you calculate the mean of a NumPy array along a
specific axis?
7. What is the difference between np.sum() and np.cumsum()?
8. How do you create a 3x3 identity matrix using NumPy?
9. Create a NumPy array with the values [1, 2, 3, 4, 5] and
print its shape and data type.
10. Perform element-wise multiplication on two arrays: a = [1,
2, 3] and b = [4, 5, 6].
11. Reshape a 1D array [1, 2, 3, 4, 5, 6] into a 2x3 matrix.
12. Extract the second column from the following 2D array:
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
13. Use broadcasting to add the scalar value 5 to every element
in a 2D array.
14. Find the maximum value in the following array:
array = np.array([10, 20, 30, 40, 50])
15. Concatenate two arrays horizontally:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
16. Calculate the mean of the following array along the rows:
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
17. Create a 3D array of shape (2, 3, 4) filled with zeros.
264
C. Asuai, H. Houssem & M. Ibrahim Arrays and Vectorized Computing
265
266
MODULE 11
PROBABILITY AND STATISTICS
Probability and statistics form the backbone of data science,
enabling data-driven decision-making, uncertainty modeling, and
hypothesis testing. This module introduces key concepts in
probability and statistics and their applications in data science
using Python.
By the end of this module, you will be able to:
• Understand fundamental probability concepts.
• Compute descriptive statistics and visualize data
distributions.
• Perform hypothesis testing and inferential statistics.
• Apply probability and statistical techniques in data science
applications.
INTRODUCTION
Probability shows how likely it is for an event to take place. In
data science, uncertainty is represented by this technique when
making predictions.
In recent years, Probability has played a key role in data science
by supporting understanding of uncertainty, prediction and
drawing conclusions from gathered data.
n_trials = 10000
# Calculate the probability of heads
probability = simulate_coin_toss(n_trials)
# Print the result
print(f"Probability of heads in a fair coin toss (after {n_trials} trials):
{probability:.4f}")
269
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
RULES OF PROBABILITY
Addition Rule
• When calculating the probability that either of two events
will occur, you rely on the addition rule.:
▪ Customer Segmentation: Figuring out the odds that
a customer falls under either Segment A or Segment
B.
▪ Risk Assessment: Calculating the likelihood that
one or both of two particular risks may affect a
company’s process.
Example:
For this example, we have a dataset that indicates:
▪ There is a 40% chance that a customer will buy
Product A.
▪ Thirty percent chance that a customer will select
Product B.
▪ It is likely that a customer will buy both products
on only 0.1 occasions.
The addition rule applied yields the answer:
• P(A∪B)=P(A)+P(P(B)−P(A∩B)=0.4+0.3−0.1=0.
6
A customer, then, has a 60% chance of buying either
Product A or Product B.
270
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
Multiplication Rule
• It allows you to find the probability that two events
happen at the same time which is important for many
things:
• Feature Engineering: Determining the combination
of age and income (e.g., two features) with an equal
likelihood of appearing in a dataset.
• Bayesian Inference: Updating chances of different
outcomes for increased certainty.
Example:
Suppose:
• Only a 0.2 chance exists that a person will click an
ad.
• Half of the customers who click on the ad end up
making a purchase.
Using the multiplication rule we can multiply the
exponent by each term in the product:
• P(Click∩Purchase)=P(Click)⋅P(Purchase∣Click)=0
.2⋅0.5=0.1
In this case, only 10% of the customers viewing the ad will
make a purchase.
271
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
272
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
Where:
n = Number of trials (a fixed number of independent experiments).
k = Number of successes (must be an integer 0≤ k ≤ n).
p = Probability of success in a single trial (0 ≤ p ≤ 1).
• C(n,k)C(n,k) = Combination ("n choose k"), calculated
as:
𝑛!
𝐶(𝑛, 𝑘) =
𝑘! (𝑛 − 𝑘)!
This counts the number of ways to choose k successes out of n trials.
Example:
If you flip a fair coin (p=0.5) 10 times (n=10), the probability of
getting exactly 3 heads (k=3) is:
1
P(X=3)=C(10,3)⋅(0.5)3⋅(0.5)7=120 ∙ ≈ 0.117
1024
Poisson Distribution:
• Models the number of events occurring in a fixed interval
of time or space.
• Example: Number of emails received in an hour.
Poisson Distribution PMF:
𝜄𝜆𝑘 𝑒 −𝜆
𝑃(𝑋 = 𝑘) =
𝑘!
Where:
k= Number of events (a non-negative
integer, k=0,1,2,…k=0,1,2,…).
λ = Average rate of occurrence (mean number of events in the
interval).
e = Euler's number (~2.71828).
273
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Example
Suppose a help desk receives an average of λ=5 calls per hour.
What is the probability of receiving exactly 3 calls in an hour?
53 𝑒 −5 125 ∙ 𝑒 −5
𝑃(𝑋 = 3) = = ≈ 0.1404 (14.04%).
3! 6
Continuous Distributions
Normal (Gaussian) Distribution:
o Symmetric, bell-shaped distribution characterized
by its mean (μμ) and standard deviation (σσ).
o Example: Distribution of heights in a population.
Normal Probability Density Function:
1 (𝑥−𝜇)2
−
𝑓(𝑥) = 𝑒 2𝜎2
𝜎√2Π
Where:
x = Continuous random variable.
μ = Mean (location parameter, center of the distribution).
σ = Standard deviation (scale parameter, measures spread).
σ2 = Variance.
π≈3.14159, e≈2.71828.
Uniform Distribution:
Exponential Distribution:
In a Poisson process, the Exponential Distribution is
used to model the time that passes between individual
events that are independent and happen at a fixed average
rate..
Example: How far apart are customers arriving at a
store?
PDF of the Exponential Distribution:
𝜆Γ𝑒 −𝜆𝑥 𝑖𝑓 𝑥 ≥ 0,
𝑓(𝑥) = {
0, 𝑖𝑓 𝑥 < 0
Where:
x= Time between events (a continuous random
variable, x≥0x≥0).
λ (lambda) = Rate parameter (events per unit time).
Conditional Probability
Conditional probability refers to the probability of an event
taking place if we already know that another event has happened.
Formula:
𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴⁄𝐵) = (𝑖𝑓 𝑃(𝐵) > 0)
𝑃(𝐵)
275
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Where
(A∣B): Probability that iff event B happened event A might occur.
P(A∩B): Probability that A and B take place at the same time.
P(B)P: Probability that event B occurs.
Bayes' Theorem
Bayes' theorem relates the conditional and marginal probabilities
of random events. It Updates the probability of an event based on new
information.
Formula:
𝑃(𝐵⁄𝐴) ∙ 𝑃(𝐴)
𝑃(𝐴⁄𝐵) =
𝑃(𝐵)
Where:
P(A: Prior (initial belief about A).
P(B∣A): Likelihood (probability of observing B if A is true).
P(A∣B): Posterior (revised probability after observing B).
P(B): Marginal likelihood (total probability of B across all
scenarios).
DESCRIPTIVE STATISTICS
276
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
277
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Measures of Dispersion
Measures of dispersion compare how far the individual
observations are from the mean. They provide insights into the
consistency and reliability of the data.
Table 11-2: Measures of dispersion
Metric Definition Use in Data Science
- Understanding the
Measures how far each
spread of data (e.g.,
Variance data point is from the
variability in customer
mean.
spending).
- Used in feature
engineering and model
evaluation.
The square root of - Assessing data
Standard variance, providing a consistency (e.g.,
Deviation measure of spread in the consistency in delivery
same units as the data. times).
- Used in normalization
and standardization of
data.
The range between the
- Identifying outliers
Interquartile 25th percentile (Q1) and
(e.g., detecting anomalies
Range (IQR) the 75th percentile
in transaction amounts).
(Q3).
- Robust to extreme
values, making it useful
for skewed datasets.
278
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
Data Visualization
279
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Visualization
Description Use in Data Science
Technique
data distribution. income data).
Kurtosis: Measures the - Guiding data
"tailedness" of the transformation and
distribution. preprocessing steps.
INFERENTIAL STATISTICS
Inferential statistics allow data scientists to make predictions or
inferences about a population based on a sample of data. These
techniques are critical for hypothesis testing, model evaluation,
and decision-making.
Table 11-4: Sampling and Estimation
Concept Definition Use in Data Science
- Analyzing samples
Population: The entire set
from large datasets.
Population of data.
- Generalizing
vs. Sample Sample: A subset of the
findings to the
population.
population.
- Justifies the use of
States that the sampling
normal distribution
Central Limit distribution of the mean
in hypothesis
Theorem approaches a normal
testing.
(CLT) distribution as the sample
- Enables confidence
size increases.
interval estimation.
280
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
282
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
Dataset
These are the following columns in the dataset:
`CustomerID`: The CustomerID is a distinctive number to
identify every customer.
`Gender`: Information on whether the customer is a man or a
woman.
283
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
#Descriptive statistics
print(df.describe())
#Distribution of SpendingScore
plt.figure(figsize=(8, 6))
sns.histplot(df['SpendingScore'], kde=True)
plt.title("Distribution of Spending Score")
plt.xlabel("Spending Score")
plt.ylabel("Frequency")
plt.show()
284
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
plt.show()`
Insights
1. The mean `SpendingScore` is 50.
2. The distribution of `SpendingScore` is approximately normal.
3. There is no significant difference in `SpendingScore` between male
and female customers based on the box plot.
Step 2: Inferential Statistics - Hypothesis Testing
We will test whether there is a significant difference in
`SpendingScore` between male and female customers using a two-
sample t-test.
Hypotheses
- Null Hypothesis (H0): There is no difference in mean
`SpendingScore` between male and female customers.
- Alternative Hypothesis (H1): There is a difference in mean
`SpendingScore` between male and female customers.
from scipy.stats import ttest_ind
#Perform t-test
t_stat, p_value = ttest_ind(male_scores, female_scores)
print(f"T-statistic: {t_stat}, P-value: {p_value}")
#Interpret results
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis: There is a significant difference in spending
scores between genders.")
else:
285
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Results
- T-statistic: 0.15, P-value: 0.88
- Since the p-value > 0.05, we fail to reject the null hypothesis.
- Conclusion: There is no significant difference in
`SpendingScore` between male and female customers.
Step 3: Predictive Modeling
We will build a logistic regression model to predict whether a
customer has a high `SpendingScore` (above 70).
#Make predictions
y_pred = model.predict(X_test)
286
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Results
- The model achieves an accuracy of 85%.
- The confusion matrix and classification report show good
performance in predicting high-spending customers.
Conclusion
1. Descriptive Statistics and Visualization:
- Helped summarize and visualize customer spending patterns.
- Identified no significant difference in spending between genders.
2. Inferential Statistics:
- Used a t-test to confirm no significant difference in spending
between genders.
3. Predictive Modeling:
- Built a logistic regression model to predict high-spending customers
with 85% accuracy.
287
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Hypothesis testing
male_scores = df[df['Gender'] == 'Male']['SpendingScore']
female_scores = df[df['Gender'] == 'Female']['SpendingScore']
t_stat, p_value = ttest_ind(male_scores, female_scores)
print(f"T-statistic: {t_stat}, P-value: {p_value}")
Predictive modeling
df['HighSpending'] = df['SpendingScore'] > 70
X = df[['Age', 'AnnualIncome']]
y = df['HighSpending']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
288
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
#Predict
new_email = ["free money"]
print(model.predict(vectorizer.transform(new_email))) Output: ['spam']
289
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Load dataset
df = sns.load_dataset("titanic")
Perform t-test
t_stat, p_value = ttest_ind(group_a, group_b)
print(f"P-value: {p_value}") Output: P-value: 0.049 (significant at
alpha=0.05)
Example:
# Example: Predicting loan default risk
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
Simulated data
X = [[25, 50000], [30, 60000], [35, 70000], [40, 80000]] Age, Income
y = [0, 0, 1, 1] 0: No default, 1: Default
Train a model
model = RandomForestClassifier()
model.fit(X, y)
Predict
291
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Example:
from sklearn.linear_model import LinearRegression
import numpy as np
Predict
future_month = np.array([6]).reshape(-1, 1)
print(model.predict(future_month)) Output: [600]
292
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
Simulated data
prices = [10, 20, 30, 40, 50]
demand = [100, 80, 60, 40, 20]
revenue = np.array(prices) * np.array(demand)
293
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
QUESTIONS
1. Define probability and explain its importance in data
science.
2. Differentiate between discrete and continuous probability
distributions.
3. Explain the concepts of Type I Error and Type II Error in
hypothesis testing.
4. What is the significance of confidence intervals in
inferential statistics?
5. Discuss the differences between descriptive and inferential
statistics.
6. Describe a real-world application where hypothesis testing
is useful in data science.
7. Generate a dataset of 1000 random numbers from a normal
distribution with a mean of 50 and a standard deviation of
10 using NumPy. Compute and print the mean, median,
variance, and standard deviation of the dataset.
8. Load a sample dataset (e.g., iris or titanic from seaborn) and
compute the correlation matrix between numerical
features.
9. Create a histogram and boxplot for a dataset of your choice
using matplotlib or seaborn. Interpret the distribution of the
data.
10. Using Python, simulate the rolling of two six-sided dice
10,000 times and plot the distribution of their sum.
11. Generate a binomial distribution where the probability of
success is 0.3 in 10 trials. Plot its probability mass function
(PMF).
294
C. Asuai, H. Houssem & M. Ibrahim Probability and Statistics
295
296
MODULE 12
ADVANCED PANDAS FOR DATA
SCIENCE
Dealing with data in Python often relies on Pandas. While the
essential features are enough for many purposes, to efficiently
work with big and complicated data, you need to know the
advanced tools. This module is based on what you have learned
about Pandas and teaches you more advanced methods for data
analysis.
As a result of this module, you will be able to:
• Efficiently manipulate large datasets using Pandas.
• Utilize advanced indexing techniques.
• Apply group operations and time series analysis.
• Optimize Pandas performance for large-scale data
processing.
By mastering the advanced techniques covered in this module, you
will be well-prepared to handle complex data science tasks and
optimize your data processing workflows using Pandas.
INTRODUCTION
This module discusses Pandas features such as multi-indexing, how
to boost performance and handling of big data. At this point, we
should have learned how to import pandas.
In [1]: import pandas as pd
As a result, when you see a pd. in code, it’s referring to pandas.
You may also find it easier to import Series and DataFrame into
the local namespace since they are so frequently used:
In [2]: from pandas import Series, DataFrame
To get started with pandas, you will need to get comfortable with
its two workhorse data structures: Series and DataFrame. While
297
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
they are not a universal solution for every problem, they provide
a solid, easy-to-use basis for most applications.
Series
A Series is a one-dimensional array-like object containing a
sequence of values (of similar types to NumPy types) and an
associated array of data labels, called its index.
The simplest Series is formed from only an array of data:
In [11]: obj = pd.Series([4, 7, -5, 3])
In [12]: obj
Out[12]:
04
17
2 -5
33
dtype: int64
298
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
a -5
c3
dtype: int64
In [17]: obj2.index
Out[17]: Index(['d', 'b', 'a', 'c'], dtype='object')
Compared with NumPy arrays, you can use labels in the index
when selecting single values or a set of values:
In [18]: obj2['a']
Out[18]: -5
In [19]: obj2['d'] = 6
In [20]: obj2[['c', 'a', 'd']]
Out[20]:
c3
a -5
d6
dtype: int64
Here ['c', 'a', 'd'] is interpreted as a list of indices, even though it
contains strings instead of integers. Using NumPy functions or
NumPy-like operations, such as filtering with a boolean array,
scalar multiplication, or applying math functions, will preserve the
index-value link:
In [21]: obj2[obj2 > 0]
Out[21]:
d6
b7
c3
dtype: int64
In [22]: obj2 * 2
Out[22]:
d 12
b 14
a -10
c6
dtype: int64
In [23]: np.exp(obj2)
299
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Out[23]:
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
dtype: float64
Another way to think about a Series is as a fixed-length, ordered
dict, as it is a mapping of index values to data values. It can be used
in many contexts where you might use a dict:
In [24]: 'b' in obj2
Out[24]: True
In [25]: 'e' in obj2
Out[25]: False
Should you have data contained in a Python dict, you can create a
Series from it by passing the dict:
In [26]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [27]: obj3 = pd.Series(sdata)
In [28]: obj3
Out[28]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
When you are only passing a dict, the index in the resulting Series
will have the dict’s keys in sorted order. You can override this by
passing the dict keys in the order you want them to appear in the
resulting Series:
In [29]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [30]: obj4 = pd.Series(sdata, index=states)
In [31]: obj4
Out[31]:
California NaN
Ohio 35000.0
Oregon 16000.0
300
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
Texas 71000.0
dtype: float64
301
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
302
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
A Series’s index can be altered in-place by assignment:
In [41]: obj
Out[41]:
04
17
2 -5
33
dtype: int64
In [42]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [43]: obj
Out[43]:
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
DataFrame
A DataFrame represents a rectangular table of data and contains
an ordered collection of columns, each of which can be a different
value type (numeric, string, boolean, etc.). The DataFrame has
both a row and column index; it can be thought of as a dict of
Series all sharing the same index. Under the hood, the data is
stored as one or more two-dimensional blocks rather than a list,
dict, or some other collection of one-dimensional arrays. The
exact details of DataFrame’s internals are outside the scope of this
book. While a DataFrame is physically two-dimensional, you can
use it to represent higher dimensional data in a tabular format
using hierarchical indexing.
303
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
In [46]: frame.head()
Out[46]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
If you specify a sequence of columns, the DataFrame’s columns
will be arranged in that order:
In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
304
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
Out[47]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
If you pass a column that isn’t contained in the dict, it will appear
with missing values in the result:
In [48]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
....: index=['one', 'two', 'three', 'four',
....: 'five', 'six'])
In [49]: frame2
Out[49]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
In [50]: frame2.columns
Out[50]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
A column in a DataFrame can be retrieved as a Series either by
dict-like notation or by attribute:
In [51]: frame2['state']
Out[51]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
305
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
In [52]: frame2.year
Out[52]:
one 2000
two 2001
three 2002
four 2001
five 2002
six 2003
Name: year, dtype: int64
306
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
308
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
310
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
Type Notes
labels.
List of lists or tuples Treated as the “2D ndarray” case.
The DataFrame’s indexes are used unless
Another DataFrame
different ones are passed.
Like the “2D ndarray” case except masked
NumPy
values become NA/missing in the
MaskedArray
DataFrame result.
Index Objects
pandas’s Index objects are responsible for holding the axis labels
and other metadata (like the axis name or names). Any array or
other sequence of labels you use when constructing a Series or
DataFrame is internally converted to an Index:
In [76]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [77]: index = obj.index
In [78]: index
Out[78]: Index(['a', 'b', 'c'], dtype='object')
In [79]: index[1:]
Out[79]: Index(['b', 'c'], dtype='object')
Index objects are immutable and thus can’t be modified by the
user:
index[1] = 'd' # TypeError
Immutability makes it safer to share Index objects among data
structures:
In [80]: labels = pd.Index(np.arange(3))
In [81]: labels
Out[81]: Int64Index([0, 1, 2], dtype='int64')
In [82]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
In [83]: obj2
Out[83]:
311
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
0 1.5
1 -2.5
2 0.0
dtype: float64
In [84]: obj2.index is labels
Out[84]: True
Some users will not often take advantage of the capabilities pro‐
vided by indexes, but because some operations will yield results
containing indexed data, it’s important to understand how they
work. In addition to being array-like, an Index also behaves like a
fixed-size set:
In [85]: frame3
Out[85]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [86]: frame3.columns
Out[86]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [87]: 'Ohio' in frame3.columns
Out[87]: True
In [88]: 2003 in frame3.index
Out[88]: False
Unlike Python sets, a pandas Index can contain duplicate labels:
In [89]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
In [90]: dup_labels
Out[90]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
Selections with duplicate labels will select all occurrences of that
label. Each Index has a number of methods and properties for set
logic, which answer other common questions about the data it
contains. Some useful ones are summarized in Table 12-2.
312
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
collection.
Computes a new Index
Index([1, 2,
delete() with the element at idx1.delete(2)
4, 5])
index i removed.
Computes a new Index
Index([1, 2,
drop() by deleting specified idx1.drop([3, 5])
4])
values.
Inserts an element at a
Index([1, 10,
insert() specified index, idx1.insert(1, 10)
2, 3, 4, 5])
creating a new Index.
313
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
like when you want to get served jollof rice at a party, .iloc
doesn’t ask for titles or descriptions , it just dives in based on
position. Clean, sharp.
The syntax follows:
df.iloc[row_index, column_index]
• Supports slicing and list-based selection.
Examples:
import pandas as pd
# Sample DataFrame
data = {'A': [10, 20, 30], 'B': [40, 50, 60], 'C': [70, 80, 90]}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
315
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
For example, if you have a DataFrame and want to access the row
with the label "Chinedu", you'd do df.loc['Chinedu']. No need to
count or guess, just use the label! It’s like calling your friend to
meet you at the local suya joint: .loc is especially useful when you
have a well-defined set of labels for your rows or columns. It’s like
that VIP entrance where your name is on the list, no questions
asked. Whether you’re dealing with dates, IDs, or names, just call
them out directly.
The syntax follows:
df.loc[row_label, column_label]
• Supports slicing and boolean conditions.
Examples:
# Selecting a single value
print(df.loc['row2', 'C']) # Output: 80
# Selecting a row
print(df.loc['row3']) # Output: Third row
316
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
In [233]: df.sum(axis='columns')
Out[233]:
a 1.40
b 2.60
c NaN
d -0.55
dtype: float64
NA values are excluded unless the entire slice (row or column in
this case) is NA.
This can be disabled with the skipna option:
In [234]: df.mean(axis='columns', skipna=False)
Out[234]:
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
In [236]: df.cumsum()
Out[236]:
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
Another type of method is neither a reduction nor an
accumulation. describe is one such example, producing multiple
summary statistics in one shot:
In [237]: df.describe()
Out[237]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
On non-numeric data, describe produces alternative summary
statistics:
In [238]: obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
In [239]: obj.describe()
Out[239]:
count 16
unique 3
top a
freq 8
dtype: object
319
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Expected
Method Description Example
Output
Number of
Count non-NA df.count() A: 5, B: 5
values.
Count, Mean,
Summary
Std, Min,
statistics for
25%, 50%,
Describe Series or df.describe()
75%, Max for
DataFrame
each column
columns.
(see below).
Compute min
df['A'].min(),
min, max and max 10, 50
df['A'].max()
values.
Index locations
argmin, (integers) of df['A'].argmin(),
0, 4
argmax min/max df['A'].argmax()
values.
Index labels of
idxmin, df['A'].idxmin(),
min/max 0, 4
idxmax df['A'].idxmax()
values.
Compute
Quantile quantile (0-1 df['A'].quantile(0.75) 40.0
range).
Sum Sum of values. df['A'].sum() 150
Mean of
Mean df['B'].mean() 25.0
values.
320
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
Expected
Method Description Example
Output
Median (50%
Median quantile) of df['A'].median() 30.0
values.
Mean absolute
Mad deviation from df['A'].mad() 12.0
mean.
Product of all
Prod df['A'].prod() 12000000
values.
Sample
Var df['A'].var() 250.0
variance.
Sample
Std standard df['A'].std() 15.81
deviation.
Sample
skewness
Skew df['A'].skew() 0.0
(third
moment).
Sample
kurtosis
Kurt df['A'].kurt() -1.2
(fourth
moment).
Cumulative [10, 30, 60,
Cumsum df['A'].cumsum()
sum. 100, 150]
[10, 10, 10, 10,
cummin, Cumulative df['A'].cummin(),
10], [10, 20,
cummax min/max. df['A'].cummax()
30, 40, 50]
Cumprod Cumulative df['A'].cumprod() [10, 200, 6000,
321
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Expected
Method Description Example
Output
product. 240000,
12000000]
First
[NaN, 10, 10,
Diff arithmetic df['A'].diff()
10, 10]
difference.
Compute [NaN, 1.0,
pct_change percent df['A'].pct_change() 0.5, 0.333,
changes. 0.25]
322
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
323
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
324
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
5b
6b
7c
8c
dtype: object
In [257]: mask = obj.isin(['b', 'c'])
In [258]: mask
Out[258]:
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
In [259]: obj[mask]
Out[259]:
0c
5b
6b
7c
8c
dtype: object
Related to isin is the Index.get_indexer method, which gives you
an index array from an array of possibly non-distinct values into
another array of distinct values:
In [260]: to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
In [261]: unique_vals = pd.Series(['c', 'b', 'a'])
In [262]: pd.Index(unique_vals).get_indexer(to_match)
Out[262]: array([0, 2, 1, 1, 0, 2])
See Table 12-6 for a reference on these methods.
326
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
327
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Cross-sections:
Extract cross-sections of data using pd.DataFrame.xs().
# Cross-section
print(df.xs(key=1, level='number'))
329
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Chunking:
When reading large files, you can process the data in smaller
chunks using the chunksize parameter in pd.read_csv(). This
allows you to work with data that doesn’t fit into memory all at
once.
import pandas as pd
Dask Integration:
Dask helps Pandas to manage large datasets that do not fit in
memory. The dask.dataframe sees to it that data analysis resembles
methods in Pandas, even though it takes place in parallel.
import dask.dataframe as dd
# Read a large CSV file using Dask
df = dd.read_csv('large_dataset.csv')
# Perform operations
df['column'] = df['column'] * 2
result = df.compute() # Converts Dask DataFrame to Pandas DataFrame
print(result.head())
330
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
Memory Optimization:
Improve the efficiency of your program by changing columns
with floating-point data to the shorter (i.e., float32, instead
of float64) format and converting columns with repeated values to
categorical data..
# Convert data types to save memory
df['column'] = df['column'].astype('float32')
# KNN Imputation
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
Data Transformation:
Convert your data by calling pd.melt() to change it to long form
or use pd.pivot_table() to compact and modify your data.
# Melt DataFrame
melted_df = pd.melt(df, id_vars=['id'], value_vars=['col1', 'col2'])
331
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
# Pivot Table
pivot_df = pd.pivot_table(df, values='value', index='date', columns='category
')
String Operations:
When you want to extract substrings, replace patterns or split
your columns in pandas, use the pd.Series.str accessor.
# Extract substrings
df['new_column'] = df['string_column'].str.extract(r'(\d+)')
# Replace patterns
df['string_column'] = df['string_column'].str.replace('old', 'new')
# Split columns
df[['first', 'last']] = df['name'].str.split(' ', expand=True)
Query Method:
With the query() method, you can filter data more conveniently
and quickly than by using regular boolean indexing. You can use a
string expression to apply conditions in your DataFrame which
can make your code appear simpler and more manageable when
you deal with many conditions.
332
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
# Query method
filtered_df = df.query('value > 20')
print(filtered_df)
Aggregation Functions:
After using the groupby() method to organize data, aggregation
functions in pandas allow you to summarize the results and
examine them. They allow you to use statistical methods such as
sum, mean, count and so on with every group, making useful
observations from big databases..
Common Built-in Aggregation Functions:
• sum(): This function adds up all the values within each
group.
• mean(): It calculates the average of values within each
group.
• count(): This counts the number of non-null values in each
group.
• min(): Finds the minimum value in each group.
• max(): Finds the maximum value in each group.
• std(): Calculates the standard deviation of the values in each
group.
• median(): Computes the median of the values in each group.
333
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Example:
For example, df is a DataFrame that has the following columns
and rows:
Category Value
A 10
A 20
B 30
B 40
Using the functions sum(), mean() or count() provides a summary
of the data:
df.groupby('Category')['Value'].sum()
Output:
Category Value
A 30
B 70
df.groupby('Category')['Value'].mean()
Output:
Category Value
A 15.0
B 35.0
df.groupby('Category')['Value'].count()
Output:
Category Value
A 2
B 2
334
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
335
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
336
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
337
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Output:
Date Value Rolling_Mean
2021-01-01 10 NaN
338
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
print(df)
339
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Output:
Date Value Cumulative_Sum
2021-01-01 10 10
2021-01-02 20 30
2021-01-03 30 60
2021-01-04 40 100
2021-01-05 50 150
In this example, expanding().sum() calculates the cumulative sum of
the 'Value' column, progressively adding each new value to the
running total.
Common Operations with Expanding:
• Cumulative sum: df['Value'].expanding().sum()
• Cumulative mean: df['Value'].expanding().mean()
• Cumulative standard deviation: df['Value'].expanding().std()
340
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
df = pd.DataFrame(data)
df.set_index('Day', inplace=True)
341
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Resampling:
To change the frequency of your time series data,
apply resample().
# Resample to monthly
monthly_df = df.resample('M').mean()
# Compute differences
df['diff'] = df['value'].diff()
342
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
Time Zones:
Time zones can be processed using tz_localize() and tz_convert().
# Localize and convert time zones
df = df.tz_localize('UTC').tz_convert('US/Eastern')
# DateOffsets
df['next_month'] = df.index + pd.DateOffset(months=1)
More information on Time series is included in module 15 of this
book.
343
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
2. Feather Format
Feather allows for quick data access, making it a good selection for
exchange between Python and R projects.
# Save to Feather format
df.to_feather('data.feather')
344
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
# Sample DataFrame
df = pd.DataFrame({'column': np.arange(1000000)})
@jit(nopython=True)
def custom_function(x):
return x * 2
3. Parallel Processing
If you’re performing tasks that can be split into chunks, use Joblib
or Python’s multiprocessing to process data in parallel.
Example with Joblib:
from joblib import Parallel, delayed
import pandas as pd
import numpy as np
# Sample DataFrame
df = pd.DataFrame({'column': np.arange(1000000)})
345
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
346
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
# Load data
df = pd.read_csv('messy_data.csv')
# Clean data
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['value'] = pd.to_numeric(df['value'], errors='coerce')
df.dropna(inplace=True)
# Transform data
df['log_value'] = np.log(df['value'])
Case Study 2: Perform advanced group operations on a real-world
dataset to derive insights.
# Group by category and calculate summary statistics
grouped = df.groupby('category')
summary = grouped.agg({'value': ['mean', 'std', 'count']})
print(summary)
Time Series Forecasting
Case Study 3: Build a time series forecasting model using Pandas
and Statsmodels.
import statsmodels.api as sm
# Prepare data
df.set_index('date', inplace=True)
df = df.asfreq('D')
# Clean data
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['value'] = pd.to_numeric(df['value'], errors='coerce')
df.dropna(inplace=True)
# Transform data
df['log_value'] = np.log(df['value'])
# Analyze data
summary = df.groupby('category').agg({'value': ['mean', 'std', 'count']})
print(summary)
Case study 5: Create a time series forecasting model using Pandas
and additional libraries like Statsmodels or Prophet.
import statsmodels.api as sm
# Prepare data
df.set_index('date', inplace=True)
df = df.asfreq('D')
348
C. Asuai, H. Houssem & M. Ibrahim Advanced Pandas for Data Science
QUESTIONS
1. Differentiate between loc[] and iloc[] in Pandas.
2. Perform time series resampling on a dataset with daily data.
3. What is chunking in Pandas, and why is it useful for large
datasets?
Provide an example of reading a CSV file in chunks.
4. What are some techniques to optimize memory usage in
Pandas?
Provide an example of converting a column to a categorical
data type.
5. How can you handle missing data in a DataFrame using
interpolation?
Write a code snippet to interpolate missing values in a column.
6. What is the purpose of pd.melt()?
Provide an example of reshaping a DataFrame using pd.melt().
7. What is a MultiIndex in Pandas?
Create a MultiIndex DataFrame with at least three levels and
perform indexing.
6. How do you access data from a MultiIndex DataFrame
using pd.IndexSlice?
Provide an example of slicing a MultiIndex DataFrame.
7. What is boolean indexing, and how is it used in Pandas?
Write a code snippet to filter rows where a column’s value is
greater than 50.
8. How does the query() method improve readability in
filtering data?
Rewrite the following boolean indexing using
the query() method:
df[df['value'] > 20]
9. What is the purpose of pd.DataFrame.xs()?
349
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
351
352
MODULE 13
ERRORS AND EXCEPTION
HANDLING
Errors and exceptions are inevitable in programming, and
handling them properly is crucial for writing robust, maintainable,
and error-resilient code. This module covers the fundamentals of
errors and exception handling in Python, their importance in data
science, and best practices for writing reliable programs.
By the end of this module, the reader will be able to:
1. Understand the Basics of Errors and Exceptions
2. Use Basic Exception Handling Techniques
3. Implement Advanced Exception Handling
4. Raise and Customize Exceptions
5. Apply Exception Handling in Data Science Workflows
6. Debug and Log Exceptions
7. Follow Best Practices for Exception Handling
8. Apply Exception Handling in Practical Data Science
Scenarios
9. Write Robust and Error-Resilient Code
10. Understand the Role of Exception Handling in Production
Environments
• Learn how exception handling contributes to the
reliability and maintainability of data science
applications in production.
• Implement exception handling strategies
for scalable and distributed data processing
systems.
353
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
INTRODUCTION
Python errors generally fall into three broad categories, but we'll
focus on the two most common ones: Syntax Errors and
Runtime Errors.
Syntax Errors: The Grammar Mistakes of Code
A syntax error occurs when your code breaks the rules of
Python's language. Think of it as submitting a WAEC exam in all
capital letters and forgetting punctuation. Python immediately
flags this and refuses to run your program.
354
C. ASUAI, H. HOUSSEM & M. IBRAHIM ERRORS AND EXCEPTION HANDLING
Example:
print("Hello World" # Missing closing parenthesis
Output:
SyntaxError: unexpected EOF while parsing
Python reads your code like a meticulous English teacher,once it
notices a misplaced comma or an unclosed bracket, it halts
everything. These types of errors are caught before the program
even begins to run.
Runtime Errors: When Trouble Starts Midway
A runtime error, on the other hand, sneaks in after your code has
passed the initial check. Imagine you've written a brilliant exam,
but forgot to sign your name. It’s only during marking that the
omission becomes a problem. That's what runtime errors feel like.
Example:
print(10 / 0)
Output:
ZeroDivisionError: division by zero
Dividing by zero is mathematically undefined, and Python will
not allow it. When this happens, the program crashes unless
you've told it what to do in such a scenario.
355
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Output:
Cannot divide by zero!
Instead of an error message stopping your program, Python
smoothly prints out your custom message and moves on. This is
particularly useful in real-world applications like mobile apps or
automated systems, where you want your software to keep
running even if a small issue occurs.
357
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
try:
df = pd.read_csv("data.csv")
except FileNotFoundError:
print("The dataset could not be found.")
2. Robust Pipelines
In production-grade pipelines (e.g., model training, data
preprocessing, or feature engineering), one bad input can cause the
entire flow to fail. Exception handling helps isolate errors so you
can skip or log faulty data while the rest of the pipeline keeps
running like a Danfo bus that never stops:
for file in file_list:
try:
data = pd.read_csv(file)
# Process data
except Exception as e:
print(f"Error processing {file}: {e}")
continue
This ensures your machine learning jobs don't stop halfway
because of a single bad file or a NaN where you expected a float.
3. User Experience
When building data science tools or applications, good exception
handling leads to clearer, more helpful error messages. Whether
it's a data analyst running a script or a manager using your
dashboard, giving them useful feedback instead of cryptic Python
tracebacks is key:
try:
result = model.predict(user_input)
except ValueError:
print("Invalid input. Please provide numerical values.")
t’s much clearer for users to see a traffic light than just to hear
someone yelling “Stop!”.
358
C. ASUAI, H. HOUSSEM & M. IBRAHIM ERRORS AND EXCEPTION HANDLING
Example
try:
int("abc")
except ValueError:
print("Invalid value for conversion!")
Output:
Invalid value for conversion!
359
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Dealing with data from users or files from outside can lead to
various problems. Many types of exceptions can be managed in
Python using several except blocks..
try:
num = int(input("Enter a number: "))
print(10 / num)
except ZeroDivisionError:
print("Error: Cannot divide by zero!")
except ValueError:
print("Error: Invalid input, please enter a number.")
361
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Example
try:
file = open("data.txt", "r")
content = file.read()
except FileNotFoundError:
print("File not found!")
else:
print("File read successfully.")
finally:
if 'file' in locals():
file.close()
362
C. ASUAI, H. HOUSSEM & M. IBRAHIM ERRORS AND EXCEPTION HANDLING
Output:
Inner: Division by zero!
363
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
def process_data(data):
if not data:
raise InvalidDataError("Data cannot be empty!")
return data
try:
process_data([])
except InvalidDataError as e:
print(e)
Output:
Data cannot be empty!
try:
cleaned_data.append(float(value))
except (TypeError, ValueError):
print(f"Skipping invalid value: {value}")
print(cleaned_data)
Output:
Skipping invalid value: None
[1.0, 2.0, 4.0]
• Example:
try:
with open("data.csv", "r") as file:
content = file.read()
except FileNotFoundError:
print("File not found!")
365
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
except requests.exceptions.RequestException as e:
print(f"API request failed: {e}")
faulty_function()
Output:
ZeroDivisionError: division by zero
Traceback (most recent call last):
366
C. ASUAI, H. HOUSSEM & M. IBRAHIM ERRORS AND EXCEPTION HANDLING
367
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
• Example:
import unittest
def divide(a, b):
if b == 0:
raise ValueError("Cannot divide by zero!")
return a / b
class TestDivision(unittest.TestCase):
def test_divide_by_zero(self):
368
C. ASUAI, H. HOUSSEM & M. IBRAHIM ERRORS AND EXCEPTION HANDLING
with self.assertRaises(ValueError):
divide(10, 0)
unittest.main()
try:
df = pd.read_csv("data.csv")
df["column"] = pd.to_numeric(df["column"], errors="coerce")
except FileNotFoundError:
print("File not found!")
try:
model = LinearRegression()
model.fit(X_train, y_train)
except ValueError as e:
369
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
try:
plt.plot([1, 2, 3], [4, 5, None])
except ValueError:
print("Invalid data for plotting!")
370
C. ASUAI, H. HOUSSEM & M. IBRAHIM ERRORS AND EXCEPTION HANDLING
• Example:
try:
data = load_data("data.csv")
cleaned_data = clean_data(data)
model = train_model(cleaned_data)
except Exception as e:
print(f"Pipeline failed: {e}")
• Example:
try:
result = distributed_computation(data)
371
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
except TimeoutError:
retry_computation(data) # Retry on failure
372
C. ASUAI, H. HOUSSEM & M. IBRAHIM ERRORS AND EXCEPTION HANDLING
QUESTIONS
1. What is the difference between a syntax error and a
runtime error? Provide an example of each.
2. What is an exception in Python? How is it different from a
runtime error?
3. List three common built-in exceptions in Python and
explain when they are raised.
4. Why is exception handling important in data science
workflows? Provide a real-world scenario.
5. Write a Python code snippet that uses
a try and except block to handle a ZeroDivisionError.
6. What is the purpose of handling specific exceptions instead
of catching all exceptions generically?
7. Write a Python code snippet that uses
multiple except blocks to handle
both ValueError and TypeError.
8. What is the purpose of the else clause in a try-except block?
Provide an example.
9. Explain the role of the finally clause in exception handling.
Write a code snippet to demonstrate its use.
10. What are nested try-except blocks? Provide an example
where they might be useful.
11. Can the finally clause be used without an except block?
Explain with an example.
12. Why would you want to create a custom exception instead
of using a built-in exception?
13. Write a Python function that raises a custom exception if
the input is negative.
14. How would you handle missing or corrupted data in a
dataset using exception handling? Provide an example.
373
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
15. Write a Python code snippet to safely read a file and handle
the FileNotFoundError exception.
16. What is a traceback in Python? How can you use it to
debug errors?
17. Write a Python code snippet that logs an exception using
the logging module.
18. Write a Python code snippet that provides a user-friendly
error message for a KeyError.
19. Why is it important to write unit tests for exception
handling? Provide an example.
20. What are some common errors that can occur during
model training, and how would you handle them?
21. Write a Python function that reads a CSV file, processes its
contents, and handles any file-related exceptions.
374
MODULE 14
PLOTTING AND VISUALIZATION
It is very important we tell stories with our data. Pictures they say
speak louder than text, information are better understood when
they are visualized. As a datascientist, visualization forms part of
your day-to-day tasks.
By the end of this module, students will:
1. Understand the importance of data visualization in data
science.
2. Master key Python libraries for visualization (Matplotlib,
Seaborn, Pandas, Plotly).
3. Customize and style visualizations for clarity and impact.
4. Create advanced visualizations, including multi-panel
figures and interactive dashboards.
5. Apply visualization techniques to real-world datasets and
case studies.
6. Tell compelling stories with data using best practices in
visualization.
7. Prepare visualizations for reports, presentations, and web
applications.
8. Evaluate and critique visualizations for effectiveness and
accuracy.
Human Perception
Humans are inherently visual creatures. Research shows that the
brain processes visual information 60,000 times faster than text.
Visualizations leverage this by:
• Simplifying Complexity: A single chart can summarize
thousands of data points, making it easier to identify
trends, outliers, and patterns.
376
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
377
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Numerical Data
plt.show()
• Count Plots: Display the count of observations in each
category.
sns.countplot(x='category', data=df)
plt.title('Count of Categories')
plt.show()
Time-Series Data
Time-series data is collected over time, such as daily stock prices
or monthly sales.
• Line Charts: Show trends over time.
plt.plot(df['date'], df['sales'])
plt.title('Monthly Sales')
plt.show()
• Area Charts: Similar to line charts but with the area below
the line filled, emphasizing volume.
plt.fill_between(df['date'], df['sales'], color='skyblue')
plt.title('Monthly Sales')
plt.show()
Geospatial Data
Geospatial data includes location-based information, such as city
populations or regional sales.
• Maps: Visualize data on a geographic map.
import geopandas as gpd
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.plot()
• Choropleth Maps: Use color gradients to represent data
values across regions.
world['population_density'] = world['pop_est'] / world['area']
world.plot(column='population_density', legend=True)
379
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Relationships
Visualizations can also show relationships between variables.
• Heatmaps: Display correlations or relationships in a
matrix format.
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
• Pair Plots: They illustrate relationships existing between
pairs of variables in a dataset.
sns.pairplot(df)
plt.show()
• Correlation Matrices: Summarize relationships between
numerical variables.
df.corr()
We will cover these visualizations in detail later in this module.
Basics of Matplotlib
Most Python programmers use Matplotlib to make static plots.
You can use it to create various charts and graphs, like lines and
complex charts on several panels. In 2003, John D. Hunter
introduced this tool with a user interface like MATLAB and
developed it to be highly flexible.
Why Use Matplotlib?
• Versatile: Supports line plots, scatter plots, bar charts,
histograms, 3D plots, and more.
• Publication-Quality Output: Allows fine-tuning of every
plot element (fonts, colors, styles).
• Integration: Works well with NumPy, Pandas, and
Jupyter Notebooks.
380
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
1. Figure (plt.figure)
Using plt.figure(), you can create a Figure object to container every
chart element in Matplotlib. Your mind can see a figure as an
empty canvas that you can use to add one or more plots. Any
visual in Matplotlib begins with this statement. You can put
various subplots or Axes, side by side or one above the other in
the same figure to view and compare them together.
It is possible to customize a figure with the plt.figure() function. For
example, you tell Inkscape the size of the figure by using figsize and
passing in the values for width and height in inches. This is
beneficial when organizing plots for presentations or publications.
To adjust the background color of the figure, use the facecolor
parameter. Choosing a high dpi ensures the information in your
plots remains sharp even when exporting as a file.
381
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
In the end, plt.figure() is used to arrange the canvas and its look
before you add any subplots or other details. This aspect is critical
when making multi-panel plots or enhancing a plot for great look.
Example:
fig = plt.figure(figsize=(8, 6), facecolor='lightgray', dpi=100)
382
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
4. Legend (plt.legend)
To understand the colors, markers and line choices in a Matplotlib
plot, a legend is essential. The legend makes it easy for the
audience to see at a glance what data points are represented by the
various graphical patterns on the chart.
You can call plt.legend() after specifying a label for your line in the
main plot. You can call plt.legend() after specifying the labels and it
will add a legend to your graph.
The legend allows players to modify it in many ways. You can
control its position on the plot using the loc parameter, such as
loc='upper left' or loc='best', which automatically selects an optimal
location. You can also add a title to the legend using the title
parameter for added clarity, and toggle the visibility of the legend
frame with frameon=False to create a cleaner look.
Example:
plt.plot([1, 2, 3], label='Line 1')
plt.plot([3, 2, 1], label='Line 2')
383
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
1. Line Plot
A line plot is a main approach for presenting how values change
continuously, for example, during a period of time. Using lines, it
links individual points to draw patterns in data that develops in
sequence. People often use line plots in time-series research,
market trends, record changes in temperature or monitor any type
of progress over time.
Creating line plots is quicker in Python because of Matplotlib and
Seaborn libraries. In Matplotlib, the plt.plot() function is widely
used to change the style and appearance of lines, colors and
markers. When you use lineplot() in Seaborn, you can also get
confidence intervals for the overall data. You can also enrich your
line plot by including notes, making multiple lines visible and
even including interactivity tools provided by Plotly.
384
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
Key Customizations:
• Line style: '-' (solid), '--' (dashed), ':' (dotted).
• Color: 'r' (red), 'g' (green), '#1f77b4' (hex code).
• Marker: 'o' (circle), 's' (square), '*' (star).
Example:
plt.plot(
[1, 2, 3, 4],
[10, 20, 25, 30],
linestyle='--',
color='green',
marker='o',
label='Trend'
)
plt.legend()
plt.show()
385
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Bar Chart
Using a bar chart (or bar graph), we can highlight the values of
data in categories by making each bar as short or tall as its related
value. A bar chart can be used to compare data grouped into sets
such as counts, lists of occurrences or metrics that have been
brought together from various groups (e.g., sales data for each
region, responses to a survey or descriptions of different
populations).
Scatter Plot
# Add title
plt.title('Scatter Plot')
Key Customizations:
• Point size: s=100 (either a scalar or an array).
• Color mapping: Colormap can be expressed as c=[...] (containing the
numbers).
• Transparency: alpha=0.5 (0=transparent, 1=opaque).
389
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
390
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
A plot is clear only if the special elements like labels and legends
help communicate the information.
• Titles (plt.titleThe purpose of the data can be roughly explained
by its title (e.g., “Monthly Sales Trends in 2023”). A carefully
chosen title can make it clear to the audience what the case is
about.
• Axis Labels using plt.xlabel() and plt.ylabel() allows you to set the
names of each axis and their units such as "Temperature (°C)" or
"Revenue (in $). If we do not use labels, the information in the
data becomes confusing.
• Legends (plt.legend()) you should include legends when you have
many lines or groups of data in the same plot. They use various
colors/markers for different sets, so it’s easy to separate the data.
Example
import matplotlib.pyplot as plt
# Data
x = [1, 2, 3, 4]
y = [10, 20, 25, 30]
391
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Labels the Y-
plt.ylabel() plt.ylabel('Temperature (°C)')
axis
392
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
Common Customizations
Colors
• Named colors: 'red', 'blue', 'green'
• Hex codes: '#FF5733' (orange), '#1f77b4' (Matplotlib blue)
• Shortcuts: 'r' (red), 'g' (green), 'b' (blue)
Table 14-2: Line Styles
Style Description
Marker Description
'o' Circle
's' Square
'^' Triangle
393
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Marker Description
'*' Star
'D' Diamond
394
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
plt.xlim(min,
Sets X-axis range plt.xlim(0, 10)
max)
plt.ylim(min,
Sets Y-axis range plt.ylim(-5, 5)
max)
plt.yticks(ticks, Customizes Y-
plt.yticks([0, 50, 100])
labels) axis ticks
plt.title('Sales',
Title plt.title()
fontsize=14)
X/Y
plt.xlabel(), plt.ylabel() plt.xlabel('Time (s)')
Labels
plt.legend(loc='upper
Legend plt.legend()
right')
Line
color='red' color='#1f77b4'
Color
395
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Axis
plt.xlim(), plt.ylim() plt.xlim(0, 100)
Limits
plt.xticks([1,2],
Ticks plt.xticks(), plt.yticks()
['Low','High'])
396
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
1. Creating Subplots
Basic Subplot Grid
plt.tight_layout()
plt.show()
397
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
398
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
399
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
400
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
plt.show()
Basic grid
plt.subplots() fig, axs = plt.subplots(2, 2)
of subplots
401
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Single
subplot
plt.subplot() plt.subplot(2, 2, 1)
(index-
based)
Add
subplots to
add_subplot() ax = fig.add_subplot(1, 2, 1)
a figure
object
Complex,
gs = GridSpec(2, 2); ax =
GridSpec flexible
fig.add_subplot(gs[0, :])
layouts
Fine-tune
spacing plt.subplots_adjust(wspace=0.5,
subplots_adjust()
between hspace=0.3)
subplots
INTRODUCTION TO SEABORN
402
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
sns.histplot(data=df,
Frequency
Histogram sns.histplot() x='age', bins=20,
distribution
kde=True)
Smooth sns.kdeplot(data=df,
KDE Plot sns.kdeplot() density x='income',
estimate fill=True)
Marginal sns.rugplot(data=df,
Rug Plot sns.rugplot()
distributions x='age')
Example:
# Combined histogram and KDE
sns.histplot(data=df, x='price', kde=True, bins=15)
plt.title("Price Distribution")
plt.show()
2. Categorical Plots
With categorical plots, you can examine numbers that belong to
different groups and notice any increases, decreases or unusual
404
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
Plot
Function Use Case Example
Type
sns.barplot(x='class',
Bar Compare
sns.barplot() y='survival_rate',
Plot means
data=df)
Show sns.boxplot(x='species',
Box
sns.boxplot() quartiles & y='petal_length',
Plot
outliers data=df)
Distributio
Violin sns.violinplot( sns.violinplot(x='day',
n +
Plot ) y='total_bill', data=df)
density
Example:
# Box plot with hue
sns.boxplot(data=df, x='species', y='sepal_length', hue='region')
plt.title("Sepal Length by Species")
plt.show()
405
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
3. Relationship Plots
Plot
Function Use Case Example
Type
2D sns.scatterplot(x='heig
Scatter sns.scatterplo
relationshi ht', y='weight',
Plot t()
ps data=df)
All
Pair pairwise sns.pairplot(df,
sns.pairplot()
Plot relationshi hue='species')
ps
Example:
# Scatter plot with regression line
sns.lmplot(data=df, x='engine_size', y='mpg', hue='fuel_type')
plt.title("Engine Size vs. MPG")
plt.show()
406
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
Advanced Features
Faceting (Small Multiples)
A FacetGrid allows seaborn to separate the data into different
categories and plot each subset in its own panel. Using this
method, patterns are compared well across groups because all axes
remain the same.
# Create DataFrame
data = {
'total_bill': [25, 18, 40, 15, 30, 22, 50, 10, 45, 20],
'tip': [4, 2, 7, 1, 5, 3, 10, 1, 8, 2],
'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male',
'Female', 'Male', 'Female'],
'smoker': ['Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No'],
'day': ['Sun', 'Sat', 'Sun', 'Sat', 'Sun', 'Sat', 'Sun', 'Sat', 'Sun', 'Sat'],
'time': ['Dinner', 'Lunch', 'Dinner', 'Lunch', 'Dinner', 'Lunch', 'Dinner',
'Lunch', 'Dinner', 'Lunch']
}
df = pd.DataFrame(data)
print(df.head())
g = sns.FacetGrid(df, col="time", row="smoker", margin_titles=True,
height=4)
g.map(sns.histplot, "total_bill", kde=True, bins=5, color="skyblue")
g.set_axis_labels("Total Bill ($)", "Frequency")
g.set_titles(col_template="{col_name}", row_template="Smoker:
{row_name}")
plt.tight_layout()
plt.show()
407
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Example
# Set theme and palette
sns.set_theme(style="ticks", palette="deep", font_scale=1.2)
Regression Plots
Regression analysis in Seaborn allows the user to check the
association between two continuous variables and to plot a
fitted line predicting that relationship. The most popular choice
for this job is sns.regplot() which responds by displaying a scatter
plot and a linear regression line which is already included. By
doing this, we are able to determine if the relationship goes up,
down or up and down.
With seaborn, you can also use the function sns.lmplot() which
behaves like regplot and allows you to chart data from different
subgroups using features such as hue, col or row. In EDA, these
plots play a crucial part since they use statistics and visuals to
clarify data. You can also personalize seaborn regression plots by
408
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
# Sample dataset
tips = sns.load_dataset("tips")
410
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
# Pairwise relationships
sns.pairplot(df, hue='target_column')
High (themes,
Customization Low
palettes)
411
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
One thing that makes Plotly unique is its ability to make data
visualization interactive and interesting to look at. Listed below
are some important features of this site:
• Hover Tooltips – This feature in Plotly makes it easy to
see precise numbers when you move over data points.
• Zoom and Pan Interactions – With these interactions,
you can enlarge some parts of a plot or smoothly browse
along the axis bars, all without having to recreate the chart.
• 3D Plots and Animations – With Plotly, you can make
3D plots, animate data over time and observe how data
changes over time.
• Built-in Themes – Plotly offers a choice of built-in styles
and themes (plotly_dark, ggplot2, seaborn, etc.) that
quickly give any visualization a great look for publication.
412
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
Basic Example
import plotly.express as px
# Interactive scatter plot
fig = px.scatter(
data_frame=df,
x='GDP_per_capita',
y='Life_Expectancy',
color='Continent',
size='Population',
hover_name='Country',
title='Life Expectancy vs. GDP per Capita'
)
fig.show()
# Sample data
df = pd.DataFrame({
'x': [1, 2, 3, 4],
'y': [10, 15, 13, 17],
'z': [5, 6, 7, 8]
413
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
})
fig = go.Figure(data=[go.Scatter3d(
x=df['x'],
y=df['y'],
z=df['z'],
mode='markers',
marker=dict(
size=8,
color=df['z'], # Color by z value
colorscale='Viridis',
opacity=0.8
)
)])
fig.show()
Here, Plotly emphasizes the value of interactivity, taking data
analysis to a new level by supporting 3D graphs instead of the
standard 2D lines.
Example 2:
fig = px.scatter_3d(
df,
x='Height',
y='Weight',
z='Age',
color='Gender'
)
fig.show()
414
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
Animations in Plotly
# Sample data structure (assumes df has columns 'Year', 'Sales', and 'Quarter')
fig = px.bar(
df,
x='Year',
y='Sales',
animation_frame='Quarter',
range_y=[0, 1000]
)
fig.show()
This code creates an animated bar chart where each frame
represents a different quarter, and bars reflect the sales values for
different years. This is helpful in dashboards or business reports
where temporal change is important.
415
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
app.layout = html.Div([
dcc.Graph(figure=fig),
dcc.Slider(min=0, max=10, step=1, value=5)
])
app.run_server(debug=True)
This simple app shows an interactive graph with a slider. You can
connect the slider to filter or animate plots, making dashboards
truly interactive and user-driven.
Geospatial Visualization
GeoPandas Basics
GeoPandas is an extension of Pandas that enables you to work
with geospatial data, such as shapefiles or GeoJSON. It adds
support for geometric operations and plotting maps directly with
Matplotlib.
With GeoPandas, you can easily load and visualize global,
national, or regional boundaries, and even overlay datasets like
population, GDP, or pollution.
Example: Plotting a World Map
import geopandas as gpd
import matplotlib.pyplot as plt
Choropleth Maps
# Set title
plt.title('Population Density by Country')
417
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
plt.show()
Here, we use the given example:
• Population density is found by dividing the population
figure by the country’s size.
• Using the OrRd color scale, the map reflects that lower
population densities are represented by yellow and higher
densities are represented by red..
• The equidistant coloring in coordcat makes each part of
the data about the same size, allowing for a more balanced
display.
From the choropleth map, the darker shades of red on the map
represent countries with larger population densities.
They are valuable for highlighting differences between areas in
various data, especially for spotting any noticeable patterns in
locations..
Geospatial Geopandas,
Choropleth + point maps
Analysis Folium
Correlation
Seaborn Heatmap + pair plot
Analysis
418
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
Advanced Topics
ML Model Interpretation
It is vital to interpret the actions of complex machine learning
models and see which elements are affecting their predictions. A
number of approaches and representations are available to help
people interpret these models easily:
• SHAP (Shapley Additive Explanations): SHAP credit
each part of the input to an answer, helping to understand
what a machine learning model focuses on.
shap.plots.waterfall() displays this data in a row of added
predictions compared to the average result.
Example of a SHAP Waterfall Plot:
import shap
import matplotlib.pyplot as plt
419
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
model = RandomForestRegressor()
model.fit(X_train, y_train) # Fit the model
# Convert to image
img = tf.shade(agg)
img.show()
Datashader efficiently handles large data visualizations by
rendering only the necessary data points for display,
improving performance while maintaining clarity.
• Vaex: A library optimized for out-of-core DataFrames,
allowing users to manipulate and visualize large datasets
that don’t fit into memory. Vaex provides fast operations
on datasets up to billions of rows by using lazy evaluations
and memory-mapping techniques.
Example (Vaex DataFrame and Plotting):
import vaex
421
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
422
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
ax3 = fig.add_subplot(2, 2, 3)
When you issue a plotting command like plt.plot([1.5, 3.5, -2,
1.6]), matplotlib draws on the last figure and subplot used
(creating one if necessary), thus hiding the figure and subplot
creation.
423
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
424
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
Here is a small example where I shrink the spacing all the way to
zero:
fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)
for i in range(2):
for j in range(2):
axes[i, j].hist(np.random.randn(500), bins=50, color='k', alpha=0.5)
plt.subplots_adjust(wspace=0, hspace=0)
You may notice that the axis labels overlap. matplotlib doesn’t
check whether the labels overlap, so in a case like this you would
need to fix the labels yourself by specifying explicit tick locations
and tick labels (we’ll look at how to do this in the following
sections).
425
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
426
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
427
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Adding legends
Legends are another critical element for identifying plot elements.
There are a couple of ways to add one. The easiest is to pass the
label argument when adding each piece of the plot:
In [44]: from numpy.random import randn
In [45]: fig = plt.figure(); ax = fig.add_subplot(1, 1, 1)
In [46]: ax.plot(randn(1000).cumsum(), 'k', label='one')
Out[46]: [<matplotlib.lines.Line2D at 0x7fb624bdf860>]
In [47]: ax.plot(randn(1000).cumsum(), 'k--', label='two')
Out[47]: [<matplotlib.lines.Line2D at 0x7fb624be90f0>]
In [48]: ax.plot(randn(1000).cumsum(), 'k.', label='three')
Out[48]: [<matplotlib.lines.Line2D at 0x7fb624be9160>]
Once you’ve done this, you can either call ax.legend() or
plt.legend() to automatically create a legend.
In [49]: ax.legend(loc='best')
The legend method has several other choices for the location loc
argument. See the docstring (with ax.legend?) for more
information.
428
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
The loc tells matplotlib where to place the plot. If you aren’t
picky, 'best' is a good option, as it will choose a location that is
most out of the way. To exclude one or more elements from the
legend, pass no label or label='_nolegend_'.
429
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
430
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
Argument Description
Explicit file format to use (e.g., 'png', 'pdf', 'svg', 'ps',
Format
'eps').
Matplotlib Configuration
matplotlib comes configured with color schemes and defaults that
are geared primarily toward preparing figures for publication.
Fortunately, nearly all of the default behavior can be customized
via an extensive set of global parameters governing figure size,
subplot spacing, colors, font sizes, grid styles, and so on. One way
to modify the configuration programmatically from Python is to
use the rc method; for example, to set the global default figure size
to be 10 × 10, you could enter:
plt.rc('figure', figsize=(10, 10))
The first argument to rc is the component you wish to customize,
such as 'figure', 'axes', 'xtick', 'ytick', 'grid', 'legend', or many
others.
After that can follow a sequence of keyword arguments indicating
the new parameters. An easy way to write down the options in
your program is as a dict:
font_options = {'family' : 'monospace',
'weight' : 'bold',
'size' : 'small'}
plt.rc('font', **font_options)
For more extensive customization and to see a list of all the
options, matplotlib comes with a configuration file matplotlibrc in
the matplotlib/mpl-data directory. If you customize this file and
place it in your home directory titled .matplotlibrc, it will be
loaded each time you use matplotlib.
432
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
As we’ll see in the next section, the seaborn package has several
built-in plot themes or styles that use matplotlib’s configuration
system internally.
Line Plots
Series and DataFrame each have a plot attribute for making some
basic plot types. By default, plot() makes line plots
In [60]: s = pd.Series(np.random.randn(10).cumsum(), index=np.arange(0, 100,
10))
In [61]: s.plot()
The Series object’s index is passed to matplotlib for plotting on
the x-axis, though you can disable this by passing
use_index=False. The x-axis ticks and limits can be adjusted with
the xticks and xlim options, and y-axis respectively with yticks
and ylim. See Table 9-3 for a full listing of plot options. I’ll
433
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Argument Description
Label Label for the plot legend
A matplotlib subplot object to plot on; if not provided,
Ax
uses the active subplot
Style Style string (e.g., 'ko--') passed to matplotlib
Alpha Plot fill opacity (range: 0 to 1)
Type of plot: 'area', 'bar', 'barh', 'density', 'hist', 'kde', 'line',
Kind
'pie'
Logy Use logarithmic scaling on the y-axis
434
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
Argument Description
use_index Use the object's index for tick labels
Rot Rotation of tick labels (range: 0 to 360 degrees)
Xticks Values to use for x-axis ticks
Yticks Values to use for y-axis ticks
Xlim Limits for the x-axis (e.g., [0, 10])
Ylim Limits for the y-axis
Grid Display axis grid (enabled by default)
DataFrame has a number of options allowing some flexibility with
how the columns are handled; for example, whether to plot them
all on the same subplot or to create separate subplots. See Table 9-
4 for more on these.
Table 9-4: DataFrame-specific plot arguments
Argument Description
Subplots Plot each DataFrame column in a separate subplot
If subplots=True, share the same x-axis (links ticks and
Sharex
limits)
Sharey If subplots=True, share the same y-axis
Size of the figure to create, specified as a tuple (e.g.,
Figsize
(10, 6))
435
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Bar Plots
The plot.bar() and plot.barh() make vertical and horizontal bar
plots, respectively. In this case, the Series or DataFrame index will
be used as the x (bar) or y (barh) ticks
In [64]: fig, axes = plt.subplots(2, 1)
In [65]: data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop'))
In [66]: data.plot.bar(ax=axes[0], color='k', alpha=0.7)
Out[66]: <matplotlib.axes._subplots.AxesSubplot at 0x7fb62493d470>
In [67]: data.plot.barh(ax=axes[1], color='k', alpha=0.7)
The options color='k' and alpha=0.7 set the color of the plots to
black and use partial transparency on the filling.
With a DataFrame, bar plots group the values in each row
together in a group in bars, side by side, for each value.
In [69]: df = pd.DataFrame(np.random.rand(6, 4),
....: index=['one', 'two', 'three', 'four', 'five', 'six'],
....: columns=pd.Index(['A', 'B', 'C', 'D'], name='Genus'))
In [70]: df
Out[70]:
Genus A B C D
one 0.370670 0.602792 0.229159 0.486744
two 0.420082 0.571653 0.049024 0.880592
three 0.814568 0.277160 0.880316 0.431326
four 0.374020 0.899420 0.460304 0.100843
five 0.433270 0.125107 0.494675 0.961825
six 0.601648 0.478576 0.205690 0.560547
In [71]: df.plot.bar()
Note that the name “Genus” on the DataFrame’s columns is used
to title the legend.
We create stacked bar plots from a DataFrame by passing
stacked=True, resulting in the value in each row being stacked
together
In [73]: df.plot.barh(stacked=True, alpha=0.5)
A useful recipe for bar plots is to visualize a Series’s value
frequency using value_counts: s.value_counts().plot.bar().
436
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
437
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
438
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
In [92]: tips['tip_pct'].plot.hist(bins=50)
A related plot type is a density plot, which is formed by computing
an estimate of a continuous probability distribution that might
have generated the observed data. The usual procedure is to
approximate this distribution as a mixture of “kernels”,that is,
simpler distributions like the normal distribution. Thus, density
plots are also known as kernel density estimate (KDE) plots.
Using plot.kde makes a density plot using the conventional
mixture-of-normals estimate
In [94]: tips['tip_pct'].plot.density()
Seaborn makes histograms and density plots even easier through
its distplot method, which can plot both a histogram and a
continuous density estimate simultaneously. As an example,
consider a bimodal distribution consisting of draws from two
different standard normal distributions:
In [96]: comp1 = np.random.normal(0, 1, size=200)
In [97]: comp2 = np.random.normal(10, 2, size=200)
In [98]: values = pd.Series(np.concatenate([comp1, comp2]))
In [99]: sns.distplot(values, bins=100, color='k')
439
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
440
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
441
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
QUESTIONS
1. What is the purpose of data visualization in data science?
2. Name two advantages of using Seaborn over Matplotlib for
statistical plotting.
3. How would you display the first five rows of a DataFrame
named df before plotting?
4. Write a Python command to create a histogram of a
column named age using Pandas.
5. What type of plot would best show the distribution of a
single numerical variable?
6. Which Matplotlib function is used to display a plot?
7. What does the plt.subplot() function do in Matplotlib?
8. Write the code to create a basic line plot using Matplotlib
for lists x and y.
9. How do you change the color and line style in a Matplotlib
plot?
10. Explain what a boxplot shows in a dataset.
11. What is the difference between a bar chart and a
histogram?
12. Give an example of an interactive visualization library in
Python.
13. What argument would you use in Seaborn's sns.scatterplot()
to change the marker color?
14. How do you create a correlation heatmap using Seaborn?
15. What is the role of figsize in plotting with Matplotlib?
16. How would you save a plot as an image file using
Matplotlib?
17. What does plotly.express simplify compared to the standard
plotly.graph_objects?
442
C. Asuai, H. Houssem & M. Ibrahim Plotting and Visualizing
443
444
MODULE 15
TIME SERIES: AN OVERVIEW
In the world of data, time series is where timestamps and data
shake hands. Whether you're tracking stock prices, recording
temperatures, monitoring machine sensors, or following your
fitness steps, you're likely dealing with time series data.
Time series analysis is a critical aspect of data science, particularly
useful in forecasting, trend analysis, and anomaly detection across
domains like finance, weather prediction, healthcare, and
industrial monitoring.
By the end of this module, you should be able to:
• Understand what time series data is and identify its various
forms and frequencies.
• Use Python's datetime, time, and calendar modules for
date and time manipulation.
• Parse, format, and convert date/time strings using both
standard Python and dateutil.
• Work effectively with Pandas' time series tools including
DatetimeIndex, Timestamp, and to datetime().
• Handle missing values and duplicate timestamps in time
series.
• Generate custom date ranges and shift or resample time
series to different frequencies.
• Perform indexing, slicing, arithmetic, and aggregation on
time series data.
445
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
446
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
Why it matters:
• Calculate future/past dates (e.g., "What’s the date 30 days
from now?").
• Measure time intervals (e.g., "How long did this process
take?")
from datetime import datetime
now = datetime.now()
print(now) # Output: 2025-02-27 10:30:45.123456
447
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Why it matters:
• Benchmarking code performance.
• Scheduling tasks at precise intervals.
Using calendar Module
The calendar module handles date-related calculations:
• calendar.isleap(year) – Checks for leap years.
• calendar.weekday(year, month, day) – Returns day of the
week (0=Monday).
import calendar
# Check if a year is a leap year
print(calendar.isleap(2024)) # True
print(calendar.isleap(2023)) # False
# Get weekday (0=Monday, 6=Sunday)
print(calendar.weekday(2024, 7, 15)) # 0 (Monday)
Why it matters:
• Validating dates (e.g., "Is February 29, 2023 valid?").
• Planning weekly/monthly reporting.
Using the timedelta Class
The timedelta class in Python’s datetime module enables date and
time arithmetic, making it easy to compute differences between
dates or shift them forward/backward. For example, you can:
• Add or subtract days, seconds, or microseconds from a
date.
• Calculate durations (e.g., "How many days until the
project deadline?").
• Generate sequences of dates (e.g., "Every 7 days for the
next month").
A timedelta object represents a duration, not an absolute time, and
supports operations like:
448
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
Date Arithmetic
future_date = datetime(2024, 1, 1) + timedelta(days=7) # Adds 1 week
past_date = datetime.now() - timedelta(hours=3) # Subtracts 3 hours
Duration Comparisons
delta1 = timedelta(days=1)
delta2 = timedelta(hours=36)
print(delta2 > delta1) # True (36 hours > 24 hours)
Time Interval Calculations
start = datetime(2024, 1, 1)
end = datetime(2024, 1, 15)
project_duration = end - start # Returns timedelta(days=14)
Component Extraction
delta = timedelta(days=2, hours=5)
print(delta.days) # 2 (total full days)
print(delta.seconds) # 18000 (5 hours in seconds)
Scaling Operations
double_delta = timedelta(days=1) * 2 # 2 days
half_hour = timedelta(hours=1) / 2 # 30 minutes
ts = pd.Series(np.random.randn(6), index=dates)
print(ts)
Indexing and Subsetting
Pandas allows easy selection and slicing of time series data:
print(ts['2011-01-10']) # Selects a specific date
print(ts['2011']) # Selects all data for the year 2011
print(ts['2011-05']) # Selects all data for May 2011
Using datetime objects for slicing:
print(ts[datetime(2011, 1, 7):])
Performing arithmetic operations between time series aligns them
on timestamps:
print(ts + ts[::2])
Pandas Timestamp and DatetimeIndex
Timestamps in Pandas use NumPy’s datetime64 type at
nanosecond resolution:
print(ts.index.dtype) # Output: dtype('<M8[ns]')
Individual values in a DatetimeIndex are Timestamp objects:
stamp = ts.index[0]
print(stamp) # Output: Timestamp('2011-01-02 00:00:00')
By leveraging Python’s built-in and Pandas functionalities,
handling and analyzing time series data becomes seamless across
various applications.
Time Series with Duplicate Indices
In some applications, there may be multiple data observations
falling on a particular timestamp. Here is an example:
In [63]: dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/200
0', '1/3/2000'])
In [64]: dup_ts = pd.Series(np.arange(5), index=dates)
In [65]: dup_ts
Out[65]:
2000-01-01 0
2000-01-02 1
2000-01-02 2
452
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
2000-01-02 3
2000-01-03 4
dtype: int64
We can tell that the index is not unique by checking
its is_unique property:
In [66]: dup_ts.index.is_unique
Out[66]: False
Indexing into this time series will now either produce scalar values
or slices depending on whether a timestamp is duplicated:
In [67]: dup_ts['1/3/2000'] # not duplicated
Out[67]: 4
In [68]: dup_ts['1/2/2000'] # duplicated
Out[68]:
2000-01-02 1
2000-01-02 2
2000-01-02 3
dtype: int64
Suppose you wanted to aggregate the data having non-unique
timestamps. One way to do this is to use groupby and pass level=0:
In [69]: grouped = dup_ts.groupby(level=0)
In [70]: grouped.mean()
Out[70]:
2000-01-01 0
2000-01-02 2
2000-01-03 4
dtype: int64
In [71]: grouped.count()
Out[71]:
2000-01-01 1
2000-01-02 3
2000-01-03 1
dtype: int64
453
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Out[75]:
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
454
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
455
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
The start and end dates define strict boundaries for the generated
date index. For example, if you wanted a date index containing the
last business day of each month, you would pass
the 'BM' frequency (business end of month; see more complete
listing of frequencies in Table 14-3) and only dates falling on or
inside the date interval will be included:
In [78]: pd.date_range('2000-01-01', '2000-12-01', freq='BM')
Out[78]:
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
'2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
'2000-09-29', '2000-10-31', '2000-11-30'],
dtype='datetime64[ns]', freq='BM')
457
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
Sometimes you will have start or end dates with time information
but want to generate a set of timestamps normalized to midnight
as a convention. To do this, there is a normalize option:
In [80]: pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)
Out[80]:
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
'2012-05-06'],
dtype='datetime64[ns]', freq='D')
458
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
459
C. Asuai, H. Houssem & M. Ibrahim Mastering Python for Data Science
460
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
QUESTIONS
1. What function is used in pandas to convert a
column to datetime format?
2. How do you set a datetime column as the index of a
DataFrame?
3. What method is used to fill in missing dates using
the previous available value?
4. How would you select all rows from January 2025
in a time-indexed DataFrame?
5. Which attributes would you use to extract the day
of the week and the month from a datetime index?
461
462
MODULE 16
ADVANCED TIME SERIES ANALYSIS
WITH PYTHON
This module provides a comprehensive introduction to time series
analysis, covering fundamental concepts, preprocessing
techniques, model building, and evaluation. By mastering these
techniques, you will be well-equipped to tackle real-world time
series problems using Python. When readers complete the module,
they will need to be able to:
1. Understand Time Series Fundamentals:
• Describe time series and how it is used in finance,
economics, healthcare, retail and meteorology.
• Point out the main features in time series data:
trend, seasonal patterns, cyclic trends and anything
remaining(residuals).
2. Preprocess Time Series Data:
• Manage missing values in data using interpolation
methods, filling forward or backward or removing
the values.
• Increase or decrease the frequency of your time
series data by resampling..
• Remove noise by applying methods like moving
average and exponential smoothing to your dataset.
• Use Min-Max Scaling or Standardization to scale
time series data before you start modeling.
3. Perform Time Series Decomposition:
• Understand how to use additive and multiplicative
approaches to separate a time series into its trend,
seasonal and remaining parts.
463
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
464
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
INTRODUCTION
Definition of Time Series
A time series consists of a sequence of observations recorded at
specific time intervals, such as daily stock prices, monthly sales
figures, or annual rainfall data. The goal of time series analysis is
to extract meaningful insights and patterns from this data.
Practical Applications
Time series analysis is widely used across various fields:
• Finance: Stock market predictions, risk assessment.
• Economics: GDP forecasting, employment trends.
465
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
466
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
Multiplicative Decomposition
Expresses the series as the product of its components. this is
applied when the seasonal effect scales with the trend (e.g.,
higher trend → larger seasonal swings). It is common in economic
and financial data (e.g., sales growing over time with amplified
seasonal peaks).
Y(t)=Trend(t)×Seasonality(t)×Residual(t)
Example:
If a time series has:
• Trend: 100 units
• Seasonality: 1.2x (20% increase in peak season)
• Residual: 1.05x (5% random fluctuation)
Then:
Y(t)=100×1.2×1.05=126
468
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
plot_acf(series, lags=40)
plot_pacf(series, lags=40)
470
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
register_matplotlib_converters()
471
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
SARIMA Model
SARIMA (Seasonal AutoRegressive Integrated Moving Average) is
an extension of ARIMA that supports seasonality in your time
series. While ARIMA captures trends and patterns, SARIMA adds
the ability to model seasonal effects that repeat over a fixed period
(like months or quarters).
SARIMA is often denoted as: SARIMA(p, d, q)(P, D, Q, s)
p, d, q: non-seasonal ARIMA parameters.
P, D, Q: seasonal components.
s: the length of the season (e.g., 12 for monthly data with yearly
seasonality).
472
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
Exponential Smoothing
Exponential Smoothing is like putting sunglasses on noisy time
series data,it helps you see the real trend without being blinded by
the noise. It gives more weight to recent observations but doesn’t
completely ignore older ones.
There are three main types:
• Simple Exponential Smoothing (SES) – best for data with
no trend or seasonality.
• Holt’s Linear Trend Method – for data with a trend but no
seasonality.
• Holt-Winters Method – for data with trend and
seasonality. That’s our hero today!
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
474
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
475
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
476
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
477
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
GARCH Model
The GARCH Model (Generalized Autoregressive Conditional
Heteroskedasticity) is a powerful statistical tool designed to
analyze and forecast volatility in time series data, particularly in
financial markets. Unlike traditional models that assume constant
variance, GARCH captures the phenomenon of volatility
clustering - where periods of high volatility tend to persist,
followed by periods of low volatility. This makes it exceptionally
useful for risk management, derivative pricing, and portfolio
optimization in finance.
The core GARCH(1,1) model represents conditional variance
through three key components: long-term average volatility (ω),
reaction to recent market shocks (α), and persistence of past
volatility (β). While extremely valuable for volatility forecasting,
GARCH has limitations including its inability to capture
asymmetric effects (where negative shocks impact volatility
differently than positive ones) and computational complexity with
high-frequency data. More advanced variants like EGARCH and
GJR-GARCH address some of these limitations by modeling
asymmetric volatility responses.
from arch import arch_model
import pandas as pd
forecast = results.forecast(horizon=5)
print(forecast.variance.iloc[-1])
This Python implementation demonstrates a typical GARCH
workflow: loading financial returns data, fitting the GARCH(1,1)
model, and generating volatility forecasts. The ARCH package
provides a comprehensive toolkit for GARCH modeling and its
extensions, making it invaluable for financial analysts and
quantitative researchers working with market risk and volatility
prediction.
ADVANCED MODELS
Prophet (Developed by Facebook)
Facebook Prophet is an open-source forecasting tool designed
for business time series with strong seasonality effects. It uses both
curve fitting and custom seasonality modeling to create accurate
forecasts with easy-to-understand parameters.
from prophet import Prophet
import pandas as pd
479
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
forecast = model.predict(future)
# Plot results
fig = model.plot(forecast)
LSTM Networks
LSTM stands for Long Short-Term Memory. It is a special version
of RNN designed to find long-term dependencies in information
that is arranged in sequence. LSTMs address the issue of vanishing
gradients by storing long-term information with the help of
memory cells and gates.
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
data = generate_time_series()
X, y = create_dataset(data)
480
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
Dense(1)
])
model.compile(optimizer='adam', loss='mse')
model.fit(X, y, epochs=20, batch_size=32)
MODEL EVALUATION
Assessing the model you develop is necessary in machine learning.
We need to test a trained model to ensure how well it handles new
data. As a result, the model has a greater ability to notice
important features in various types of inputs. The metrics used to
measure a model’s performance change depending on if the
problem is regression, classification or clustering. We will now
speak about metrics related to regression.
Performance Metrics
MAE or Mean Absolute Error, shows the average size of errors
in predictions and ignores their direction. This means the mean of
the absolute values of the differences between actual and predicted
values is called MAE. It simplifies understanding, as it gives you an
average of the errors in the original units. Better performance
results from a low value of MAE in the model..
Formula:
𝑛
1
𝑀𝐴𝐸 = ∑|𝑦𝑖 − 𝑦̂𝑖 |
𝑛
𝑖=1
Where:
𝑦𝑖 is the real value.
𝑦̂𝑖 is the value that is predicted.
481
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
# Sample data
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
# MAE Calculation
mae = mean_absolute_error(y_true, y_pred)
print("Mean Absolute Error:", mae)
# MSE Calculation
mse = mean_squared_error(y_true, y_pred)
print("Mean Squared Error:", mse)
482
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
Formula:
𝑛
1 𝑦𝑖 − 𝑦̂𝑖
𝑀𝐴𝑃𝐸 = ∑ | | × 100
𝑛 𝑦𝑖
𝑖=1
483
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
Example:
import numpy as np
# MAPE Calculation
mape = np.mean(np.abs((np.array(y_true) - np.array(y_pred)) /
np.array(y_true))) * 100
print("Mean Absolute Percentage Error:", mape)
larger and the model is evaluated during the next period. The
model assumes that the amount of available data increases with the
passage of time.
An example of Rolling Window Cross-Validation is shown below:
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
# Create a model
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
485
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
You can use this method when you want to replicate a situation
where predictions from the model rely on data from the past and
get updated when more data is obtained.
Example of Walk-Forward Validation:
# Walk-forward validation (Expanding window)
for i in range(1, len(X)):
X_train, X_test = X[:i], X[i:i+1] # Train on all previous points, test on the
next one
y_train, y_test = y[:i], y[i:i+1]
model.fit(X_train, y_train)
prediction = model.predict(X_test)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
486
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
5. TimeSeriesSplit in Scikit-Learn
The TimeSeriesSplit function in Scikit-learn carries out time series
cross-validation. Data are divided into k groups so that the original
order of the observations is maintained. Each time period in the
test set is separated from data included in the training set.. It
ensures that data leakage does not occur by using future
observations to predict past ones.
Example:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
# Perform cross-validation
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
487
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
predictions = model.predict(X_test)
REAL-WORLD APPLICATION
we will explore three real-world applications of time series
forecasting, detailing the models used for each and providing code
implementations for stock price prediction, sales forecasting, and
weather forecasting.
Code Implementation:
ARIMA Model:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
488
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
# Make predictions
predictions = model_fit.forecast(steps=len(test))
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(test, predictions))
print(f"RMSE: {rmse}")
LSTM Model:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
time_step = 60
X, y = create_dataset(stock_data_scaled, time_step)
# Make predictions
predictions = model.predict(X_test)
490
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
Sales Forecasting
Problem Overview: Time series forecasting plays an important
role in sales forecasting. To do all these things well, businesses
have to forecast sales accurately. Seasonal patterns and trends can
be modelled well with Exponential Smoothing and Prophet, so
these are common choices for forecasting sales.
Models Used:
Exponential Smoothing: This approach provides the highest
significance to the latest data and often helps in predicting sales.
Prophet: a Facebook tool used for forecasting results that can
include daily, weekly and yearly changes, holidays and different
events.
Code Implementation:
Exponential Smoothing Model:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Train-Test Split
train_size = int(len(data) * 0.8)
train, test = data[:train_size], data[train_size:]
491
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
# Make predictions
predictions = model_fit.forecast(steps=len(test))
Prophet Model:
import pandas as pd
from fbprophet import Prophet
import matplotlib.pyplot as plt
492
C. Asuai, H. Houssem & M. Ibrahim Time Series: An Overview
Weather Forecasting
Problem Overview: Weather Forecasting involves predicting
what the weather will do by looking back at history. SARIMA
and LSTM models are suitable for finding seasonal and long-term
trends in the data about weather..
Models Used:
SARIMA: Seasonal ARIMA, a modification of the ARIMA model,
is designed to handle seasonal data, which is crucial in weather
forecasting.
LSTM: A deep learning model capable of capturing complex non-
linear relationships in time series data, making it suitable for
weather prediction.
Code Implementation:
SARIMA Model:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Train-Test Split
train_size = int(len(data) * 0.8)
train, test = data[:train_size], data[train_size:]
# Make predictions
predictions = model_fit.forecast(steps=len(test))
time_step = 60
X, y = create_dataset(weather_data_scaled, time_step)
# Make predictions
predictions = model.predict(X_test)
495
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
QUESTIONS
1. You are given a time series dataset with daily temperature data.
What steps would you follow to forecast the temperature for
the next week using LSTM?
2. What is an ARIMA model, and what does each of its
components (p, d, q) represent?
3. What is the purpose of differencing in time series modeling?
4. How does the SARIMA model extend the ARIMA model?
5. Explain the difference between an AR (AutoRegressive)
model and an MA (Moving Average) model in time series
analysis.
6. You have a dataset with monthly sales data, and you are asked
to fit an ARIMA model. Describe the steps you would follow
to build the model.
7. How would you choose the optimal seasonal period (s) for a
SARIMA model?
8. Given a dataset with daily temperature data, how would you
use an ARIMA model to forecast the temperature for the next
30 days?
9. What is the purpose of using a "grid search" in tuning the
parameters of a time series model, such as ARIMA or
SARIMA?
10. You are given a time series with significant seasonality. How
would you incorporate seasonality into your model, and
which model would you use?
496
MODULE 17
MACHINE LEARNING WITH
PYTHON
This module outlines the main features of machine learning which
is a fundamental area of AI that allows systems to learn and
enhance their functions over time, mostly by themselves. You will
learn important ideas and methods for resolving regression,
classification and clustering issues in Python.
After completing this module, you will learn to:
• Understand the basic principles of machine learning.
• Implement regression, classification, and clustering
algorithms in Python.
• Apply machine learning models to real-world datasets.
• Evaluate and interpret model performance.
INTRODUCTION
Machine learning is a branch of artificial intelligence concerned
with making systems capable of understanding data and making
choices or predictions. You have to guide your computer to stop
being lazy and instead start using its own brain. Rather than
explaining everything, you provide it with lots of data and hope it
discovers how to do what you want it to do. It is a creative AI
field in which computers study patterns, guess what might happen
or occasionally choose something surprising.
Machine learning doesn’t rely on being programmed to draw
conclusions from sets of data. In general, it falls into two main
areas: supervised learning and unsupervised learning.
497
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
# Sample data
X = np.array([[1000], [1500], [2000], [2500]]) # Square footage
y = np.array([200000, 250000, 300000, 350000]) # Prices
498
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
model = LinearRegression()
model.fit(X, y)
model = LogisticRegression()
model.fit(X, y)
499
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
model = DecisionTreeClassifier()
model.fit(X, y)
model = SVC()
model.fit(X, y)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X, y)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
model = MultinomialNB()
model.fit(X, y)
501
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)
Unsupervised Learning
During unsupervised learning, the model is trained using input
data that does not have any labels. When you apply unsupervised
learning, your computer is left to explore a new place all by itself.
This learning method receives unlabeled data, so the teacher
doesn’t define what the outputs are. The objective is to uncover
patterns, groupings or structures that are present in the data.
Common tasks in unsupervised learning include:
• Clustering: Grouping similar data points together.
• Dimensionality Reduction: Reducing the number of
features while preserving important information.
502
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
1. K-Means Clustering
model = AgglomerativeClustering(n_clusters=2)
labels = model.fit_predict(X)
503
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
X = [[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]]
iris = load_iris()
X = iris.data
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
504
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
digits = load_digits()
X = digits.data
y = digits.target
505
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ∗ 100
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
where:
• TP = True Positives (correctly predicted positives)
• TN = True Negatives (correctly predicted negatives)
• FP = False Positives (incorrectly predicted positives)
• FN = False Negatives (incorrectly predicted negatives)
We will talk about TP, TN, FP and FN shortly in tis module.
506
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
2. Precision
Precision tells you how many of the items your model identified
as positive are actually positive. It’s particularly important in
imbalanced datasets, where false positives are costly.
Formula:
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = ∗ 100
𝑇𝑃 + 𝐹𝑃
Use Case: Useful when the cost of false positives is high (e.g.,
diagnosing a disease).
Example:
from sklearn.metrics import precision_score
507
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
4. F1-Score
F1-score is the harmonic mean of Precision and Recall. It balances
both the precision and recall, and is particularly useful when you
have an uneven class distribution.
Formula:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Example:
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
print(f"F1-Score: {f1:.2f}")
# Sample data
y_prob = [0.9, 0.1, 0.8, 0.4, 0.5] # predicted probabilities for class 1
508
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
plt.legend(loc='lower right')
plt.show()
print(f"AUC: {roc_auc:.2f}")
509
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
Formula:
𝑛
1
𝑀𝑆𝐸 = ∑( 𝑦𝑖 − 𝑦̂𝑖 )2
𝑛
𝑖=1
Example:
from sklearn.metrics import mean_squared_error
3. R-squared (R²)
R² measures the proportion of the variance in the target variable
that is explained by the model. If the R² is closer to 1, the model
fits the data better.
Formula:
2
𝑆𝑆𝑟𝑒𝑠 ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2
𝑅 = 1− =1− 𝑛
𝑆𝑆𝑡𝑜𝑡 ∑𝑖=1(𝑦𝑖 − 𝑦̅ )2
Where
𝑆𝑆𝑟𝑒𝑠 = Sum of squared residual (errors) ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2
𝑆𝑆𝑡𝑜𝑡 = Total sum of squares∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅ )2
𝑦̅ = mean of actual y values.
Example:
from sklearn.metrics import r2_score
r2 = r2_score(y_true, y_pred)
print(f"R-squared: {r2}")
Confusion Matrix:
510
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
# Confusion Matrix
511
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
512
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
QUESTIONS
1. What is machine learning and how is it different from
traditional programming?
2. What is the difference between supervised and
unsupervised learning? Provide examples.
3. Given this code, identify if it's supervised or unsupervised
and justify your answer:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(X)
4. In supervised learning, what is the purpose of the label in
training data?
5. What does the label y represent in this code?
X = [[50000, 600], [80000, 700]]
y = [0, 1]
6. Which supervised learning algorithm would you use to
predict continuous values?
7. What is the expected output type of this model?
from sklearn.linear_model import LinearRegression
model = LinearRegression()
8. Write a Python script to train a linear regression model on
a dataset.
9. Use K-Means clustering to group the Iris dataset into 3
clusters.
10. What is the difference between linear regression and
logistic regression?
11. Train a linear regression model on the Boston Housing
dataset and evaluate its performance.
12. Use logistic regression to classify the Iris dataset into two
classes (setosa vs non-setosa).
513
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
515
516
MODULE 18
REAL-WORLD DATA SCIENCE
PROJECTS
This module focuses on applying the concepts and techniques
learned in previous modules to real-world data science projects. It
covers the end-to-end process of a data science project and
provides case studies to demonstrate practical applications.
2. Data Collection
Gathering data is important in any data science task, as it supports
the accuracy and relevance of your model. This means you must
collect and record data that is linked to your problem. This means
extracting information from various databases inside the school,
including student information and management information
systems at Nigerian polytechnics.
Other useful data is obtained from education platforms, forums
for scholars and students or from the Nigerian Bureau of Statistics.
Referring to student dropout, data used would be their grades,
their class attendance, how fees are paid and involvement in social
activities. If data is collected correctly, the model will have all the
518
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
import pandas as pd
# Load data from a CSV file
data = pd.read_csv("student_performance.csv")
print(data.head())
3. Data Cleaning
Preparing data for analysis and modeling begins with data
cleaning. In many practical situations, mainly in Nigerian
polytechnics, some student information may be missing,
duplicated or have incorrect forms. When not addressed such
issues can result in faulty training and predictions by the model.
After gathering the data for dropout prediction, you should
inspect for and treat missing academic scores, adjust inconsistent
forms of categorical variables and eliminate records when key
features are completely blank. Cleaning the data correctly is
important for making sure it can be used effectively for machine
learning.
import pandas as pd
import numpy as np
# Sample dataset
data = {
'student_id': [101, 102, 103, 104, 104],
'attendance_rate': [0.95, 0.60, np.nan, 0.80, 0.80],
'gpa': [3.5, 2.1, 2.8, 3.2, 3.2],
'financial_aid': ['Yes', 'No', 'YES', np.nan, 'Yes'],
'dropout': [0, 1, 0, 1, 1]
}
519
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
df = pd.DataFrame(data)
print("Original Data:")
print(df)
print("\nCleaned Data:")
print(df)
4. Model Building
After preparing the data, the next phase is to develop a model. In
this phase, a machine learning system studies the past data so that
it can make predictions about things it has not encountered
before. When building a model, one should pick the features, set
the target variable, divide the data into training and test parts and
decide on the best algorithm considering the problem.
In trying to predict dropouts, our features are attendance rate,
GPA and financial aid status and the target variable is whether a
student left school or stayed (binary classification: 1 = left school,
0 = stayed).
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
520
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
# Make predictions
y_pred = model.predict(X_test)
522
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Output
Evaluation Metrics:
Accuracy: 0.75
Precision: 0.67
Recall: 1.00
F1 Score: 0.80
Confusion Matrix:
[[1 1]
[0 1]]
6. Deployment
After finishing training and testing the machine learning model,
the following stage is to make it useful by deploying it. This
means school principals or student support staff can rely on the
523
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
# Example input: {"features": [0.85, 1, 2, 0]}
features = np.array(data['features']).reshape(1, -1)
prediction = model.predict(features)
return jsonify({'dropout_risk': int(prediction[0])})
if __name__ == '__main__':
app.run(debug=True)
524
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
3. Using Docker
To containerize your app with the provided Dockerfile, follow
these steps:
Step 1: Create the Dockerfile:
Save the following content into a file named Dockerfile (with no file
extension):
# Use an official Python runtime as a parent image
FROM python:3.10
# Copy the current directory contents (the app) into the container at /app
COPY . /app
# Specify the command to run your app when the container starts
CMD ["python", "app.py"]
app = Flask(__name__)
@app.route('/')
def home():
return "Hello, Dockerized Flask App!"
if __name__ == '__main__':
525
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
app.run(debug=True, host='0.0.0.0')
This is a basic Flask app for demonstration purposes.
526
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
•
Send predictions to counselors for timely
interventions.
Outcome after deployment
Imagine the registrar of a Nigerian polytechnic logs into a portal
and uploads current student data. Behind the scenes, the deployed
model evaluates each student’s risk of dropping out. A report is
generated with a list of at-risk students, complete with risk scores
and suggestions for counseling or financial aid. This automated
system helps improve student retention in a proactive, data-driven
way.
527
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
528
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
# Define model
model = DecisionTreeClassifier()
rfe = RFE(model, n_features_to_select=3)
529
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
final_scores = scores.sum(axis=1)
selected_features = np.argsort(final_scores)[-3:] # Top 3 features
531
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
(𝑂𝑖 −∈𝑖 )
𝑥2 = ∑ −−−−−−−−−− 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 1
∈𝑖
Where
𝑂𝑖 = Observed frequency of feature-class co-occurrence
∈𝑖 = Expected frequency (assuming no association between
feature and class)
Sum (∑) = Calculated across all categories of the feature and
class
Entropy Calculation:
Measures impurity/disorder in the target variable ∁:
𝐻(∁) = − ∑ 𝑃(∁𝑖 ) log 2 𝑃(∁𝑖 ) − − − − − − − −
𝑖
−− 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 2
532
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
Conditional Entropy:
Computes entropy after splitting data by feature f:
𝐻 (∁⁄𝑓) = − ∑ 𝑃(𝑓𝑖 ) ∑ 𝑃(𝐶𝑖 ⁄𝑓𝑗 ) log 2 𝑃(𝐶𝑖 ⁄𝑓𝑗 ) − − − − − −
𝑗 𝑖
−−−− 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 3
IG Formula:
Difference between original entropy and post-split
entropy:
𝐼𝐺(𝑓) = 𝐻(∁) − 𝐻(𝐶⁄𝐹 ) − − − − − − − − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 4
4. Iff Conditions
The three different aspects of aggregating features are what define
the 3ConFA framework. For a feature to be included, it has to
meet three different requirements (requirements for aggregation).
533
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
534
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
535
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
Output:
- Optimal feature subset X
- Model performance metrics
Procedure:
1. Initialization:
- X ← ∅ (empty set for final selected features)
- S₁, S₂, S₃ ← ∅ (temporary sets for each method's results)
2. Mutual Information Filtering:
- For each feature f in D:
- Calculate MI score: MI(fᵢ) = I(fᵢ; target)
- Compute mean MI score: h₁ = (∑ MI(fᵢ))/n
- S₁ ← {fᵢ | MI(fᵢ) ≥ α·h₁} (features above threshold)
3. Chi-Squared Filtering:
- For each feature fᵢ in D:
(𝑂𝑖 −∈𝑖 )
- Calculate χ² score: 𝑥 2 (fᵢ) = ∑
∈𝑖
- Compute mean χ² score: h₂ = (∑ χ²(fᵢ))/n
- S₂ ← {fᵢ | χ²(fᵢ) ≥ β·h₂} (features above threshold)
4. Recursive Feature Elimination:
a. Initialize: F ← all features, model ← base estimator
b. Repeat until stopping condition met:
i. Train model on current feature set F
ii. Get importance scores for all f ∈ F
iii. Rank features by importance
iv. Eliminate bottom k features (e.g., k=1)
v. Evaluate model performance
c. S₃ ← optimal feature subset from RFE
5. Feature Aggregation:
- For each feature f in D:
- if (f ∈ S₁ AND f ∈ S₂ AND f ∈ S₃):
- X ← X ∪ {f}
536
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
537
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
CASE STUDIES
We will provide real-world case studies to demonstrate the
application of data science techniques.
538
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
# Train a model
model = LinearRegression()
model.fit(X, y)
# Make predictions
y_pred = model.predict(X)
539
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
# Sample data
X = [[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]]
# Train a model
model = KMeans(n_clusters=2)
model.fit(X)
540
C. Asuai, H. Houssem & M. Ibrahim Machine Learning With Python
# Sample data
reviews = ["I love this product!", "This is the worst product ever.", "It's okay."]
labels = [1, 0, 0] # 1 = Positive, 0 = Negative
# Train a model
model = LogisticRegression()
model.fit(X, labels)
# Make predictions
y_pred = model.predict(X)
541
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
QUESTIONS
1. Define a problem statement for a data science project and
identify the key success metrics.
2. Collect and clean a dataset for your project.
3. Build and evaluate a machine learning model using the
cleaned dataset.
4. What are the key challenges in deploying a machine
learning model?
5. Perform predictive analytics on a dataset of your choice
and evaluate the model.
6. Use K-Means clustering to segment customers in a retail
dataset.
7. Build a sentiment analysis model using customer reviews
and evaluate its performance.
8. What are the key challenges in sentiment analysis?
9. Write a Python script to visualize the results of a clustering
algorithm.
542
MODULE 19
HANDS-ON PROJECTS FOR DATA
SCIENCE FUNDAMENTALS AND
BEYOND
This module presents a set of projects that help you practice
different types of data science. These assignments help students
practice by working on real-life issues, keep learning key points
and sharpen their skills in data, visualization, machine learning,
deep learning and natural language processing with Python and
many useful libraries.
At the beginning, the module includes simpler projects, like EDA,
Regression and Sentiment Analysis using TextBlob. With these,
you can start working on data in data science by loading data,
preprocessing it, using basic models, visualizing the outcomes and
understanding what has happened.
As you move forward, the projects involve more complicated
work on topics such as predicting timelines, classifying images and
text using deep learning, offering recommendations, detecting
anomalies and making interactive dashboards. For example,
predictive maintenance, fraud detection and market basket
analysis represent data science in many industries.
All projects have the following things included in them:
• A clear objective
• Detailed requirements
• Practical use of libraries such as Pandas, Scikit-learn,
Matplotlib, Seaborn, TextBlob, XGBoost, TensorFlow,
Keras, NLTK, Plotly Dash, and Streamlit
543
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
544
C. Asuai, H. Houssem & M. Ibrahim Hands-On Projects For Data Science
PROJECTS
Project 1: Perform EDA on a Simple Dataset
Objective: Perform simple analysis on a basic dataset (such as the
Iris or Titanic dataset) as part of the objective.
Requirements:
• Load the dataset using Pandas.
• Display basic statistics (mean, median, mode).
• Visualize data using histograms, bar plots, and box plots.
• Interpret the distributions and relationships in the data.
545
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
546
C. Asuai, H. Houssem & M. Ibrahim Hands-On Projects For Data Science
547
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
548
C. Asuai, H. Houssem & M. Ibrahim Hands-On Projects For Data Science
549
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
550
C. Asuai, H. Houssem & M. Ibrahim Hands-On Projects For Data Science
551
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
552
C. Asuai, H. Houssem & M. Ibrahim Hands-On Projects For Data Science
553
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
554
C. Asuai, H. Houssem & M. Ibrahim Hands-On Projects For Data Science
555
C. Asuai, H. Houssem & M. Ibrahim Advanced Time Series Analysis With Python
REFRENCES
Morgan, P. (2016). Data analysis from scratch with Python: Step-by-
step guide. AI Sciences LLC.
ISBN-13: 978-1721942817
ISBN-10: 1721942815
556