Datascience With Python
Datascience With Python
Python
Beginner to Advance Guide
By Skill Foundry
💡 Brief Definition:
Data Science is an interdisciplinary field focused on analyzing, modeling, and interpreting data
to support decision-making and automation.
Whether you’re tracking customer behavior on an e-commerce site, analyzing patient records in
healthcare, or training self-driving cars, you’re witnessing the power of Data Science.
Year Milestone
Pillar Description
Missing any one of these weakens the overall process. For example, a model might perform well
technically but be useless if it doesn't solve a business problem.
🏥 Healthcare
💳 Finance
🚗 Transportation
Predict demand for ride-sharing (e.g., Uber surge pricing)
Optimize delivery routes using traffic and weather data
Improve logistics efficiency and fuel savings
📱 Social Media
👥 Common Roles:
Role Description
Business Analyst Bridges the gap between data and business decisions
Each role collaborates with others to complete the data science puzzle.
🛠 Technical Skills
🧠 Soft Skills
💡 Tip: You don’t need to master everything at once. Build gradually, layer by layer.
👨💻 Programming Languages
Uses tools like Python, R, Hadoop Uses tools like SPSS, SAS, R
⚠️ Ethical Challenges:
✅ Best Practice: Always ask — “Could this model harm someone?” and “Would I be okay if my data
were used this way?”
🔧 Technical
🧠 Conceptual
🚫 Organizational
📉 Case Example:
Imagine a bank training a model on biased loan data. Even if the model is 95% accurate, it may
reject many eligible applicants simply because past data reflected systemic bias.
📈 Trends:
1.14 Summary
By now, you should understand:
What Data Science is (and isn’t)
Its applications and tools
Key roles in the data science ecosystem
The workflow followed in most projects
Skills, challenges, and ethical considerations
This chapter has set the stage for what’s to come. From Chapter 2 onward, we’ll begin coding,
cleaning, and exploring real datasets.
2.1 Introduction
Before we dive into coding or analysis, we must properly set up our Python environment for
Data Science. Think of this like preparing your lab before running experiments — without the
right tools and a clean workspace, you can’t perform high-quality work.
In this chapter, we'll guide you through:
Installing Python
Using Anaconda and virtual environments
Managing packages with pip and conda
Working in Jupyter Notebooks and VS Code
Organizing your data science projects for real-world scalability
💡 Brief Definition:
Python is a general-purpose programming language often used in data science for data
manipulation, statistical modeling, and machine learning, due to its clean syntax and robust
ecosystem.
🧠 Installing Python
There are two common ways to install Python:
4. Verify installation:
Open terminal or command prompt and type:
python --version
🖋️ Jupyter Notebooks
Jupyter is the most widely used tool in data science because it lets you write code and see
outputs inline, along with text, math, and charts. It is an open-source web-based environment
where you can write and execute Python code alongside rich text and visualizations.
Why it’s great:
Interactive
Code + commentary (Markdown)
Easy to visualize data
Exportable as HTML, PDF, etc.
To launch it:
jupyter notebook
Use this when doing EDA (Exploratory Data Analysis) or developing models step by step.
While Jupyter is great for analysis, VS Code is better for organizing larger projects and
production-ready scripts.
VS Code Features:
Lightweight but powerful
Git integration
Extensions for Jupyter, Python, Docker
Great for version-controlled data science workflows
Install the Python extension in VS Code for best performance.
💡 Brief Definition
A virtual environment is an isolated Python environment that allows you to install specific
packages and dependencies without affecting your global Python setup or other projects.
Environment
No Yes
management
Best practice: Use conda when using Anaconda. Use pip when outside Anaconda or when
conda doesn't support a package.
✅ This structure separates raw data, notebooks, source code, and outputs.
NumPy (short for Numerical Python) is a fundamental package for scientific computing. It offers
powerful N-dimensional array objects and broadcasting operations for fast numerical processing.
🔍 Why NumPy?
🔨 Basic Usage
import numpy as np
a = np.array([1, 2, 3])
b = np.array([[1, 2], [3, 4]])
🧠 Use Cases
Creating arrays/matrices
Vectorized operations
Linear algebra in ML algorithms
Pandas is a fast, powerful, flexible library for data analysis and manipulation, built on top of
NumPy.
🔍 Why Pandas?
🔨 Basic Usage
import pandas as pd
df = pd.read_csv("sales.csv")
print(df.head()) # First 5 rows
print(df.describe()) # Summary statistics
🧠 Use Cases
Matplotlib is a comprehensive library for creating static, animated, and interactive plots in
Python.
🔍 Why Matplotlib?
🔨 Basic Usage
🧠 Use Cases
Line charts, histograms, scatter plots
Visualizing trends and distributions
Custom charts for reports
🔍 Why Seaborn?
🔨 Basic Usage
df = pd.read_csv("iris.csv")
sns.pairplot(df, hue="species")
🧠 Use Cases
Scikit-learn is the most widely used library for building machine learning models in Python. It
includes tools for classification, regression, clustering, and model evaluation.
🔍 Why Scikit-learn?
Easy-to-use interface
Well-documented
Robust set of ML algorithms and utilities
🔨 Basic Usage
X = df[['feature1', 'feature2']]
y = df['target']
🧠 Use Cases
🧠 Examples
💡 Brief Definition
Git is a distributed version control system that tracks changes to your code and lets you
collaborate with others using repositories.
💡 Brief Definition:
Docker packages your environment, code, and dependencies into a container — a self-contained
unit that runs the same anywhere.
Example Dockerfile:
FROM python:3.11
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
CMD ["python", "main.py"]
🎯 Scenario
You are starting a project to analyze customer churn using machine learning. You want to:
Structure your project folders cleanly
Use a virtual environment
Install necessary libraries
Track everything with Git
Work in Jupyter notebooks
🧠 Step-by-Step Setup
Step 1: Create Your Project Folder
mkdir customer_churn_analysis
cd customer_churn_analysis
Step 2: Set Up a Virtual Environment
python -m venv env
source env/bin/activate # or .\env\Scripts\activate on
Windows
Step 3: Install Your Libraries
📝 Chapter 2 Summary
Concept Purpose
This chapter lays the programming foundation for everything that follows. Even if you're
somewhat familiar with Python, revisiting its core concepts from a data science perspective will
help you write cleaner, faster, and more scalable code.
3.2.4 Operators
Arithmetic: +, -, *, /, //, %, **
Comparison: ==, !=, >, <, >=, <=
Logical: and, or, not
if x > 10:
print("Greater than 10")
elif x == 10:
print("Equal to 10")
else:
print("Less than 10")
3.3.2 Loops
For Loop
for i in range(5):
print(greet("Lukka"))
3.5.1 Lists
A list is an ordered, mutable collection of elements. Elements can be of any data type.
fruits = ["apple", "banana", "cherry"]
Key Operations:
fruits[0] # Access
fruits.append("kiwi") # Add
fruits.remove("banana") # Remove
fruits.sort() # Sort
len(fruits) # Length
3.5.2 Tuples
A tuple is like a list, but immutable (cannot be changed after creation).
dimensions = (1920, 1080)
Used when you want to protect the integrity of data (e.g., coordinates, feature shapes in ML).
3.5.3 Dictionaries
Dictionaries store data in key-value pairs — extremely useful in data science for mapping,
grouping, or storing metadata.
person = {
"name": "Lukka",
"age": 25,
3.5.4 Sets
A set is an unordered collection of unique elements.
unique_tags = set(["data", "science", "data", "python"])
print(unique_tags) # {'python', 'data', 'science'}
Useful for:
Removing duplicates
Set operations: union, intersection, difference
set1 = {1, 2, 3}
set2 = {3, 4, 5}
set1 & set2 # Intersection → {3}
Method Purpose
# Reading
with open("notes.txt", "r") as f:
content = f.read()
print(content)
# Writing CSV
with open("data.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["Name", "Age"])
writer.writerow(["Lukka", 25])
# Reading CSV
# Write JSON
with open("data.json", "w") as f:
json.dump(data, f)
# Read JSON
with open("data.json", "r") as f:
result = json.load(f)
print(result)
import random
print(random.choice(["a", "b", "c"]))
Usage:
import utils
print(utils.is_even(10)) # True
def greet(self):
return f"Hello, I’m {self.name} and I’m {self.age}
years old."
# Creating an object
p1 = Person("Lukka", 25)
print(p1.greet())
__init__() is the constructor.
self refers to the instance of the class.
3.12.3 Inheritance
You can extend classes to reuse or customize functionality:
class DataScientist(Person):
def __init__(self, name, age, skill):
super().__init__(name, age)
self.skill = skill
def greet(self):
return f"Hi, I'm {self.name}, and I specialize in
{self.skill}."
3.13.1 Iterators
Anything that can be looped over using for is iterable.
nums = [1, 2, 3]
it = iter(nums)
print(next(it)) # 1
print(next(it)) # 2
You can create custom iterators with classes by defining __iter__() and __next__().
3.13.2 Generators
Generators simplify writing iterators using yield.
def countdown(n):
while n > 0:
yield n
n -= 1
for x in countdown(5):
print(x)
Advantages:
Memory-efficient (doesn’t store all values in memory)
Ideal for large file streaming, web scraping, etc.
df = pd.read_csv("sales.csv")
df.columns = [col.strip().lower().replace(" ", "_") for col in
df.columns]
df.dropna(inplace=True)
df.to_csv("cleaned_sales.csv", index=False)
files = os.listdir("data/")
combined = pd.DataFrame()
combined.to_csv("merged.csv", index=False)
now = datetime.now()
print(now) # Current time
# Activate
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
Install dependencies inside the environment:
pip install pandas numpy
Freeze current environment:
pip freeze > requirements.txt
Install from a file:
pip install -r requirements.txt
sns.histplot(df["sales"])
plt.show()
Exercise 3: Comprehensions
Use a list comprehension to create a list of squares for numbers divisible by 3 from 0 to 30.
Data cleaning is a critical step in the data science pipeline, often consuming a large portion of a
data scientist’s time. Raw datasets collected from real-world sources are seldom ready for
analysis. They typically contain missing values, incorrect data types, duplicates, inconsistencies,
and other anomalies that can severely hinder accurate analysis or model performance. Effective
data cleaning ensures the integrity of the dataset, allowing the insights drawn from it to be valid
and the models built upon it to be robust.
Neglecting this stage can result in incorrect conclusions, poor model generalization, and wasted
effort downstream. Therefore, data cleaning and preprocessing are not optional steps but
essential phases in any data-driven project.
df = pd.read_csv('customer_data.csv')
print(df.isnull().sum())
To determine whether the dataset contains any missing values at all:
Handling Outliers
Outliers are values that fall far outside the normal range of a dataset. While some outliers may
reflect real, meaningful variation, others may result from data entry errors, equipment
malfunctions, or system bugs. In either case, they need to be investigated and treated
appropriately.
Identifying Outliers
Outliers can be visualized using plots:
Box Plot:
import matplotlib.pyplot as plt
df.boxplot(column='AnnualIncome')
plt.show()
Histogram:
df['AnnualIncome'].hist(bins=50)
plt.show()
Z-Score Method:
Outliers can be statistically detected using the Z-score, which measures how many standard
deviations a value is from the mean.
from scipy.stats import zscore
df['zscore'] = zscore(df['AnnualIncome'])
outliers = df[(df['zscore'] > 3) | (df['zscore'] < -3)]
IQR (Interquartile Range) Method:
This is a robust method commonly used in practice.
Q1 = df['AnnualIncome'].quantile(0.25)
Q3 = df['AnnualIncome'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['AnnualIncome'] < Q1 - 1.5 * IQR) |
(df['AnnualIncome'] > Q3 + 1.5 * IQR)]
Label Encoding
Suitable for ordinal variables where the categories have an inherent order:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
One-Hot Encoding
Ideal for nominal variables (no inherent order), this creates binary columns for each category:
df = pd.get_dummies(df, columns=['Gender', 'Region'])
Care should be taken to avoid the dummy variable trap in linear models, where one dummy
column can be linearly predicted from the others. This is usually handled by dropping one
dummy column:
df = pd.get_dummies(df, columns=['Region'], drop_first=True)
For large cardinality categorical variables (e.g., thousands of product IDs), dimensionality
reduction or embedding techniques are often considered instead.
scaler = StandardScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age',
'Income']])
Each value becomes:
Min-Max Normalization
This scales features to a fixed range—commonly [0, 1].
from sklearn.preprocessing import MinMaxScaler
Robust Scaling
This method uses the median and interquartile range, making it resilient to outliers.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age',
'Income']])
The choice of scaling technique depends on the distribution of data and the specific machine
learning algorithm being used. It is important to apply the same transformation to both the
training and testing sets to maintain consistency.
categorical_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', pd.get_dummies)
])
Final Thoughts
Data cleaning and preprocessing serve as the bedrock of any data science process. While often
overlooked in favor of more glamorous modeling techniques, it is this stage that determines the
upper bound of a model’s performance. No matter how advanced the algorithm, its output is
only as reliable as the input data it receives.
Effective preprocessing requires not only technical skill but also domain knowledge, attention to
detail, and rigorous validation. As datasets grow larger and more complex, the ability to engineer
clean, structured, and meaningful data becomes an increasingly valuable asset in a data scientist’s
toolkit.
Exercises
1. Detect and Handle Missing Values:
o Load a dataset of your choice.
o Identify missing values.
o Apply at least two imputation techniques and compare the results.
2. Data Type Correction:
o Convert date strings to datetime format.
o Change categorical columns to proper data types.
o Convert numerical columns stored as strings to floats or integers.
3. Remove Duplicates:
o Introduce artificial duplicates to a dataset.
o Use pandas functions to detect and remove them.
o Validate that only duplicates were removed.
4. Outlier Detection:
o Visualize numeric columns using boxplots and histograms.
Exploratory Data Analysis (EDA) is the process of systematically examining and understanding
the characteristics of a dataset before any formal modeling or hypothesis testing is performed.
The goal is to discover patterns, spot anomalies, test assumptions, and develop an understanding
of the underlying structure of the data. EDA combines both numerical summaries and visual
techniques to provide a comprehensive view of the dataset.
A well-executed EDA lays the groundwork for meaningful analysis, helping to guide decisions
about feature engineering, data cleaning, and modeling strategies.
Objectives of EDA
Before diving into methods and techniques, it is important to understand what EDA seeks to
accomplish:
Uncover patterns and trends in the data.
Understand distributions of variables.
Identify missing values and outliers.
Explore relationships between variables.
Verify assumptions required for statistical models.
Generate hypotheses for further investigation.
EDA does not follow a rigid structure—it is often iterative and guided by the nature of the
dataset and the goals of the analysis.
df = pd.read_csv('sales_data.csv')
# View dimensions
print(df.shape)
Univariate Analysis
Univariate analysis examines a single variable at a time. This includes analyzing distributions,
central tendencies (mean, median, mode), and dispersion (standard deviation, variance, range).
# Histogram
df['Revenue'].hist(bins=30)
plt.title('Revenue Distribution')
plt.xlabel('Revenue')
plt.ylabel('Frequency')
plt.show()
# Box plot
df.boxplot(column='Revenue')
plt.title('Box Plot of Revenue')
plt.show()
These plots reveal skewness, modality (unimodal, bimodal), and potential outliers.
Analyzing Categorical Features
For categorical variables, frequency counts and bar charts are informative.
# Frequency table
print(df['Region'].value_counts())
# Bar chart
df['Region'].value_counts().plot(kind='bar')
plt.title('Number of Records by Region')
plt.xlabel('Region')
plt.ylabel('Count')
plt.show()
This helps assess the balance of category representation and identify dominant or rare categories.
# Correlation
print(df[['AdvertisingSpend', 'Revenue']].corr())
Categorical vs. Numerical
Box plots and group-wise aggregations are useful when analyzing the effect of a categorical
variable on a numerical variable.
# Box plot
df.boxplot(column='Revenue', by='Region')
plt.title('Revenue by Region')
plt.suptitle('') # Remove automatic title
plt.show()
# Grouped statistics
print(df.groupby('Region')['Revenue'].mean())
Categorical vs. Categorical
Crosstabs and stacked bar charts can show relationships between two categorical variables.
# Crosstab
pd.crosstab(df['Region'], df['MembershipLevel'])
Pair Plots
Pair plots allow for the simultaneous visualization of relationships between multiple numerical
variables.
import seaborn as sns
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm',
linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
This is useful for detecting multicollinearity, identifying redundant features, and guiding feature
selection.
Grouped Aggregations
Grouping data by one or more categorical variables and then analyzing numerical trends helps in
understanding how different segments behave.
# Average revenue by gender and membership level
grouped = df.groupby(['Gender',
'MembershipLevel'])['Revenue'].mean()
print(grouped)
You can also visualize such groupings using grouped bar plots or facet grids.
# Plot
monthly_revenue.plot()
plt.title('Monthly Revenue Trend')
plt.xlabel('Month')
plt.ylabel('Total Revenue')
plt.show()
Time-based EDA helps reveal seasonality, trends, and cycles that may impact forecasting and
decision-making.
Detecting Skewness
print(df['Revenue'].skew())
A skew of 0 indicates a symmetric distribution.
A positive skew means the tail is on the right.
A negative skew means the tail is on the left.
# Log transformation
df['Revenue_log'] = np.log1p(df['Revenue'])
Visual Detection
Box plots are a simple and effective way to visually detect outliers.
# Box plot of revenue
df.boxplot(column='Revenue')
plt.title('Revenue Box Plot')
plt.show()
Z-Score Method
The Z-score represents how many standard deviations a value is from the mean.
from scipy import stats
import numpy as np
z_scores = np.abs(stats.zscore(df['Revenue']))
df_outliers = df[z_scores > 3]
print(df_outliers)
Typically, a Z-score greater than 3 is considered an outlier in a normal distribution.
IQR Method
Interquartile Range (IQR) is the range between the 25th and 75th percentiles.
Handling Outliers
Options for handling outliers depend on the context:
Remove: If they result from data entry errors.
Cap or Floor (Winsorizing): Set to percentile thresholds.
Transform: Apply log or Box-Cox transformations to reduce their impact.
Separate Models: Train different models for normal and anomalous data, if appropriate.
Feature Engineering is the art and science of transforming raw data into meaningful features that
enhance the predictive performance of machine learning models. While algorithms often receive
significant attention, the quality and relevance of features often determine the success of a
model.
In real-world scenarios, raw data is rarely in a format directly usable by models. It may contain
irrelevant fields, inconsistent formats, or hidden information that must be extracted. Feature
engineering bridges the gap between raw data and usable input for algorithms.
What is a Feature?
A feature (also called an attribute or variable) is an individual measurable property or
characteristic of a phenomenon being observed. In the context of supervised learning:
Input features are the independent variables used to predict an outcome.
Target feature (or label) is the dependent variable or output we aim to predict.
The process of identifying, constructing, transforming, and selecting features is collectively
known as feature engineering.
1. Feature Creation
Creating new features from existing data can often reveal patterns and relationships that raw data
does not explicitly present.
a. Mathematical Transformations
Applying arithmetic operations can uncover meaningful ratios, differences, or composite metrics.
df['RevenuePerVisit'] = df['Revenue'] / df['NumVisits']
df['AgeDifference'] = df['Age'] - df['Tenure']
b. Text Extraction
Extract information from strings such as domain names, keywords, or substrings.
df['EmailDomain'] = df['Email'].str.split('@').str[1]
c. Date-Time Decomposition
Decompose timestamps into components like day, month, year, hour, or weekday.
df['OrderDate'] = pd.to_datetime(df['OrderDate'])
df['OrderMonth'] = df['OrderDate'].dt.month
df['OrderWeekday'] = df['OrderDate'].dt.day_name()
This allows the model to learn temporal patterns like seasonality, holidays, or business cycles.
2. Feature Transformation
Transforming variables improves their distribution, removes skewness, or stabilizes variance.
a. Log Transformation
Useful when dealing with positively skewed data (e.g., sales, income):
df['LogRevenue'] = np.log1p(df['Revenue'])
c. Standardization
Centers data around the mean with a standard deviation of 1. Beneficial for algorithms assuming
Gaussian distribution:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['StandardizedAge']] = scaler.fit_transform(df[['Age']])
b. Label Encoding
Assigns a unique integer to each category:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['GenderEncoded'] = le.fit_transform(df['Gender'])
Use cautiously: label encoding may impose an ordinal relationship where none exists.
c. Frequency Encoding
Replaces each category with its frequency in the dataset:
freq_encoding = df['ProductCategory'].value_counts().to_dict()
df['ProductCategoryFreq'] =
df['ProductCategory'].map(freq_encoding)
This helps retain cardinality while incorporating category importance.
b. Equal-Frequency Binning
Each bin contains approximately the same number of observations:
df['QuantileTenure'] = pd.qcut(df['Tenure'], q=4)
c. Custom Binning
Apply domain-specific knowledge to define meaningful thresholds:
bins = [0, 18, 35, 60, 100]
labels = ['Teen', 'Young Adult', 'Adult', 'Senior']
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels)
5. Interaction Features
Creating interaction terms captures the combined effect of two or more features.
a. Polynomial Interactions
Generate products or ratios between features:
df['Income_Tenure'] = df['Income'] * df['Tenure']
df['IncomePerTenure'] = df['Income'] / (df['Tenure'] + 1)
These are especially useful in models that do not automatically account for interactions (e.g.,
linear regression).
b. Concatenated Categorical Features
Combine categories to form new compound features:
df['Region_Product'] = df['Region'] + "_" +
df['ProductCategory']
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df[['Feature1',
'Feature2', 'Feature3']])
df['PC1'] = principal_components[:, 0]
df['PC2'] = principal_components[:, 1]
PCA features are especially helpful when input features are highly correlated.
tsne = TSNE(n_components=2)
embedding = tsne.fit_transform(df[numerical_columns])
df['TSNE1'] = embedding[:, 0]
df['TSNE2'] = embedding[:, 1]
importances = pd.Series(model.feature_importances_,
index=X.columns)
importances.sort_values().plot(kind='barh')
15. Recursive Feature Elimination (RFE)
Selects features by recursively training a model and removing the least important ones:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe = RFE(estimator=LogisticRegression(),
n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)
Summary
Feature engineering is a cornerstone of effective data science. While machine learning algorithms
provide the machinery to discover patterns, it is well-crafted features that feed them meaningful,
structured signals.
Key takeaways from this chapter:
Good features > complex models: Thoughtfully engineered features often outperform
more complex algorithms applied to raw data.
Feature creation includes mathematical combinations, time decomposition, and
domain-specific metrics.
Transformations (e.g., log, standardization) correct skewness, stabilize variance, and
bring comparability across features.
Categorical encoding techniques such as one-hot, label, and target encoding are critical
for handling non-numeric data.
Binning can simplify models, aid interpretability, and capture non-linear patterns.
Advanced strategies such as interaction terms, missingness indicators, and
dimensionality reduction can capture hidden structure.
Cyclical variables (time-based features) must be encoded in ways that respect their
periodic nature.
Feature selection reduces noise, improves interpretability, and often boosts
performance.
Automation tools can rapidly generate useful features but should be guided by domain
understanding.
Ultimately, the feature engineering process is iterative, blending technical skill, statistical
intuition, and domain knowledge. Mastery of feature engineering will make you not only a
better modeler but a more effective problem solver.
Exploratory Data Analysis (EDA) is the practice of analyzing datasets to summarize their main
characteristics, often using visual methods. It is one of the most critical phases in any data
science project. EDA helps uncover patterns, detect anomalies, test hypotheses, and check
assumptions with the help of both statistics and graphical representations.
In many ways, EDA is the bridge between raw data and modeling. Before we can apply
algorithms, we must understand the data we’re working with. This chapter delves into the
principles, techniques, and best practices of EDA in Python, along with examples, tools, and
practical guidance for real-world data.
Univariate Analysis
Bivariate Analysis
Bivariate analysis involves exploring the relationship between two variables. It can help
determine associations, trends, and possible predictive relationships.
Numerical vs. Numerical:
Scatter plots are a common tool for visualizing the relationship between two numeric variables.
sns.scatterplot(x='Height', y='Weight', data=df)
This reveals correlation, linearity, or clusters in the data.
The correlation matrix quantifies the linear relationships:
corr_matrix = df.corr(numeric_only=True)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
This helps identify highly correlated features, which may lead to multicollinearity in modeling.
Multivariate Analysis
Multivariate visualizations help uncover complex interactions between three or more variables.
Pair Plots:
Seaborn’s pairplot offers a grid of scatterplots for all numerical variable pairs.
sns.pairplot(df, hue='Target')
You can identify clusters, correlations, and class separability.
Facet Grids:
Faceting allows visualizing how relationships change across different subsets.
g = sns.FacetGrid(df, col='Gender', row='Region')
g.map_dataframe(sns.scatterplot, x='Age', y='SpendingScore')
This helps detect segment-specific patterns or stratified relationships.
Colored Scatter Plots:
Use hue or size to add another dimension:
sns.scatterplot(data=df, x='Income', y='SpendingScore',
hue='Gender', size='Age')
This adds richness to visualizations and can reveal trends missed in 2D views.
print(skew(df['Income']))
print(kurtosis(df['Income']))
Outlier Detection
Outliers can distort statistics and models. EDA helps identify and decide how to handle them.
Boxplots:
Boxplots are a fast way to spot univariate outliers.
Z-score Method:
from scipy.stats import zscore
z_scores = np.abs(zscore(df['Income']))
df[z_scores > 3]
IQR Method:
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Income'] < Q1 - 1.5 * IQR) | (df['Income']
> Q3 + 1.5 * IQR)]
Use caution—outliers may be errors or important signal, depending on context.
Autocorrelation Analysis
Autocorrelation measures how past values relate to future ones. This can identify lag effects and
cyclical behavior.
plot_acf(df['Sales'])
plot_pacf(df['Sales'])
Use these plots to decide lag features or model order for forecasting.
Summary
Exploratory Data Analysis is the critical foundation of any data science project. It transforms
raw data into understanding, revealing structure, quality issues, relationships, and modeling clues.
Key techniques include:
Univariate analysis for distributions and outliers.
Bivariate and multivariate analysis for uncovering interactions and associations.
Visual methods like histograms, boxplots, pair plots, and heatmaps to explore data
intuitively.
Statistical summaries for numeric clarity.
Time series EDA, including rolling averages, decomposition, and autocorrelation
analysis.
Exercises
Exploring a Customer Transactions Dataset
Use a dataset containing CustomerID, Age, Gender, Region, TotalSpend, and PurchaseDate:
Generate univariate and bivariate plots for all relevant features.
Identify and explain any outliers or unusual segments.
Create a time series plot of average monthly spend and detect seasonality.
Suggest three new features based on EDA insights.
EDA with a Health Records Dataset
Given columns like Age, BMI, SmokingStatus, BloodPressure, DiseaseOutcome:
Analyze the distribution of BMI across smoking categories.
Build a heatmap of correlations among numeric health metrics.
Use pairplots to explore relationships with DiseaseOutcome.
Time Series EDA with Sales Data
Dataset includes Date, StoreID, ProductID, UnitsSold:
Create time series plots for one store’s sales over a year.
Decompose sales into trend, seasonal, and residual components.
Investigate whether weekend sales differ from weekday sales.
Advanced Multivariate Visualization
Use a dataset with at least 10 features:
Use pairplots to explore clusters or separability in labeled data.
Create facet grids to compare behaviors across regions or demographic groups.
Add interactivity using Plotly or seaborn for deeper exploration.
Train-Test Split
Before training any model, it is essential to split the dataset into training and testing subsets. The
training data is used to train the model, while the testing data evaluates the model’s predictive
performance on unseen samples.
In Python, the train_test_split function from the sklearn.model_selection module is the most
commonly used utility for this task.
from sklearn.model_selection import train_test_split
Cross-Validation
While a simple train-test split is often adequate for many tasks, it may provide a biased estimate
of model performance if the data split is not representative. Cross-validation techniques help
address this by repeatedly splitting the data into multiple train-test folds and averaging the
performance.
K-Fold Cross-Validation
The most popular method is K-Fold cross-validation. The dataset is divided into K subsets
(folds). The model is trained on K-1 folds and tested on the remaining fold. This process repeats
K times, with each fold serving as the test set once.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
scores = cross_val_score(model, X, y, cv=5) # 5-fold CV
y_pred = model.predict(X_test)
skf = StratifiedKFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=skf)
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
LOOCV is best suited for small datasets where every data point is valuable.
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
This method is crucial for forecasting problems to simulate real-world deployment scenarios.
y_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
models = {
"Logistic Regression": LogisticRegression(),
"Random Forest": RandomForestClassifier(),
"SVM": SVC(probability=True)
}
Bias-Variance Trade-Off
The bias-variance trade-off is a fundamental concept that helps explain underfitting and
overfitting:
Bias: Error from overly simplistic models. High bias leads to underfitting.
Variance: Error from overly complex models sensitive to training data noise. High
variance leads to overfitting.
A good model finds the right balance — low bias and low variance — by being both accurate
and generalizable.
Visual Intuition:
High bias = consistent errors regardless of training data.
model.fit(X_train, y_train)
importances = model.feature_importances_
features = pd.Series(importances, index=X.columns)
features.sort_values().plot(kind='barh', title='Feature
Importance')
Summary
Evaluating and validating models is not a single-step process—it is a continuous and
multifaceted discipline that determines the trustworthiness of your results. Using the right
metrics, cross-validation techniques, and interpretability tools can help you understand your
model's performance and make it production-ready.
Key Takeaways:
Choose evaluation metrics that match your problem type and business goal.
Use cross-validation to ensure generalization, especially when data is limited.
Pay special attention to imbalanced datasets using metrics like F1, ROC-AUC, and
precision-recall.
Incorporate model interpretability tools to build trust and transparency.
Validate models post-deployment and monitor performance over time.
Avoid pitfalls like data leakage, improper splitting, and overfitting to validation data.
Exercises
1. Train/Test Split & Evaluation
Load the breast cancer dataset from sklearn.datasets. Train a logistic regression model.
Evaluate using accuracy, precision, recall, and F1-score.
2. Cross-Validation Comparison
Use K-Fold cross-validation to evaluate both a Random Forest and SVM classifier on the
iris dataset. Compare their mean F1-scores.
3. ROC Curve Analysis
Train a decision tree on an imbalanced binary classification dataset. Plot the ROC curve
and calculate the AUC score.
Introduction
Building a machine learning model is only part of the data science workflow. For a model to
provide real-world value, it must be integrated into production environments where it can make
predictions on new data. This is the essence of model deployment. Additionally, to streamline
and automate the process of data transformation, model training, and prediction, pipelines are
employed.
In this chapter, we will explore the end-to-end process of preparing machine learning models for
deployment. We will look into building reproducible pipelines using Python tools, deploying
models via APIs, and best practices for versioning, monitoring, and scaling in production
systems.
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
In this example, the StandardScaler is applied first, followed by the LogisticRegression model.
The same preprocessing is guaranteed during prediction.
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegression())
])
This setup ensures all preprocessing steps are encapsulated in the pipeline, minimizing the risk of
inconsistencies during deployment.
param_grid = {
'classifier__C': [0.1, 1, 10],
'classifier__penalty': ['l1', 'l2'],
'classifier__solver': ['liblinear']
}
# Make predictions
preds = loaded_pipeline.predict(X_new)
Make sure all preprocessing steps are part of the pipeline before serialization to avoid
inconsistencies.
app = Flask(__name__)
model = joblib.load('model_pipeline.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json(force=True)
input_data = np.array(data['features']).reshape(1, -1)
prediction = model.predict(input_data)
return jsonify({'prediction': prediction.tolist()})
if __name__ == '__main__':
app.run(debug=True)
app = FastAPI()
model = joblib.load('model_pipeline.pkl')
class InputData(BaseModel):
features: list
@app.post("/predict")
def predict(data: InputData):
input_array = np.array(data.features).reshape(1, -1)
prediction = model.predict(input_array)
return {"prediction": prediction.tolist()}
Run the app using:
uvicorn app:app --reload
Access interactive API docs at http://localhost:8000/docs.
COPY . .
# Make predictions
predictions = model.predict(data)
data['prediction'] = predictions
# Save results
data.to_csv('scored_data.csv', index=False)
You can schedule this script to run daily using cron, Airflow, or a cloud scheduler.
with mlflow.start_run():
model = LogisticRegression()
model.fit(X_train, y_train)
on: [push]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.11
- name: Install dependencies
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=live_df)
report.show()
Summary
Model deployment is the final, business-critical step of the data science workflow. From
serializing your pipeline to deploying it with APIs and Docker, this chapter covered the essentials
of productionizing your machine learning models.
Key Takeaways:
Pipelines streamline data processing and model application.
Joblib and Pickle can serialize pipelines for reuse.
Flask and FastAPI expose models via REST APIs.
Docker containers ensure environment consistency across platforms.
CI/CD and monitoring are essential for robust, scalable deployment.
Cloud platforms offer tools to deploy and monitor models at scale.
Exercises
1. Pipeline Construction
Create a pipeline using scikit-learn that includes preprocessing for numerical and
categorical features and a classifier. Save it using Joblib.
2. Flask API Deployment
Build a Flask app that loads your saved pipeline and exposes a /predict endpoint.
3. Dockerize Your Model
Write a Dockerfile to containerize your Flask model API. Build and run the Docker
container locally.
4. Deploy to Heroku or Render
Push your Dockerized app to Heroku or Render. Test your deployed model by sending
HTTP requests.
5. Batch Prediction Job
Write a script to load a CSV file, apply a saved model, and export predictions to a new
CSV.
6. Track an Experiment with MLflow
Train two models with different hyperparameters. Log metrics, parameters, and artifacts
using MLflow.
7. CI/CD Pipeline for Model Testing
Set up GitHub Actions to automatically run tests whenever you push changes to your
model code.
Introduction
Once a model has been trained, evaluating its performance is the critical next step. The process
of model evaluation ensures that the model not only performs well on training data but also
generalizes to unseen data. This involves selecting appropriate metrics, conducting statistical
tests, and using diagnostic tools to assess the robustness, reliability, and fairness of the model.
This chapter delves into various evaluation techniques for different types of models, including
classification, regression, and clustering. It also addresses concepts such as overfitting,
underfitting, cross-validation, confusion matrices, ROC curves, and advanced metrics like AUC,
F1-score, and adjusted R². The goal is to build a solid foundation for interpreting model outputs
and making informed decisions in production scenarios.
Each cell gives us the components needed for the most common metrics.
Precision-
The ratio of correctly predicted positive observations to the total predicted positives.
Measures how many predicted positives are actually correct. Important when false positives are
costly (e.g., spam detection).
Recall (Sensitivity)
The ratio of correctly predicted positive observations to all actual positives.
F1-Score
The harmonic mean of precision and recall.
y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0]
Precision-Recall Trade-off
Precision and recall often pull in opposite directions. Increasing recall may reduce precision and
vice versa. Adjusting the decision threshold of your model allows you to navigate this trade-off.
model = LogisticRegression()
model.fit(X_train, y_train)
probs = model.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test,
probs)
This helps in choosing a threshold that aligns with business objectives (e.g., high precision vs.
high recall).
MSE is sensitive to outliers and is useful when large errors are particularly undesirable.
Root Mean Squared Error (RMSE)
The square root of MSE, giving errors in the original units of the target variable.
Values range from 0 (no explanatory power) to 1 (perfect fit). Negative values indicate the model
performs worse than predicting the mean.
Adjusted R²
Adjusted R² penalizes the addition of irrelevant features and adjusts R² based on the number of
predictors.
y_true = [0, 1, 2, 2, 1]
y_pred = [0, 0, 2, 2, 1]
print(classification_report(y_true, y_pred))
model = RandomForestClassifier()
scores = cross_val_score(model, X, y, cv=5,
scoring='f1_macro')
print("F1 scores across folds:", scores)
Stratified K-Fold
Ensures class distribution is preserved in each fold (important for classification).
Leave-One-Out CV (LOOCV)
Each sample becomes a test set once. Computationally expensive but high fidelity.
Model Calibration
Classification models often output probabilities (e.g., logistic regression, random forest).
However, not all models produce well-calibrated probabilities. A model is well-calibrated if
the predicted probabilities correspond to actual observed frequencies.
For instance, among all samples where the model predicts a 70% probability of being positive,
about 70% should indeed be positive.
Calibration Curve
This plot compares predicted probabilities with actual outcomes. A perfect calibration curve lies
on the diagonal.
Example:
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
Learning Curves
Learning curves show model performance (e.g., accuracy or loss) as a function of training set
size. They help diagnose whether more data will improve performance or if the model is
underfitting or overfitting.
Typical Learning Curve Patterns:
Underfitting: Low training and validation scores, converging.
Overfitting: High training score, low validation score.
Good fit: Both curves converge and stabilize at high score.
Example:
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
Validation Curves
Validation curves show model performance as a function of a hyperparameter (e.g., depth of a
tree, regularization parameter).
Useful for:
Selecting optimal hyperparameters
Diagnosing bias-variance tradeoff
Example:
from sklearn.model_selection import validation_curve
Bias-Variance Tradeoff
Understanding this tradeoff is key to improving model generalization.
Bias: Error due to overly simplistic assumptions (e.g., linear model on non-linear data).
Variance: Error due to model sensitivity to training data (e.g., overfitting).
A good model balances both — low enough bias and variance to generalize well.
Where a is the mean intra-cluster distance, and b is the mean nearest-cluster distance.
Davies-Bouldin Index: Lower values indicate better clustering.
Calinski-Harabasz Index: Higher values indicate better-defined clusters.
Adjusted Rand Index (ARI): If ground truth is available, measures agreement between
predicted and true clusters.
Example:
from sklearn.metrics import silhouette_score
pca = PCA(n_components=2)
Key Takeaways
Select metrics aligned with your goals: Classification, regression, clustering, etc., all
need different evaluation strategies.
Balance precision and recall: Use F1-score, ROC, and PR curves to assess trade-offs.
Use visual tools: Learning curves, residual plots, and confusion matrices provide critical
insights.
Cross-validation is essential: It gives more reliable estimates of model generalization.
Watch for bias and unfairness: Fairness metrics are becoming essential in responsible
AI.
Temporal Order
Each observation in a time series dataset is associated with a specific timestamp. This ordering is
crucial—shuffling or randomizing the order of observations would destroy the meaning of the
data.
Trend
A trend represents the long-term progression in the data. It can be upward, downward, or even
stationary. Trends are often influenced by external factors such as economic growth,
technological advancements, or policy changes.
Seasonality
Seasonality refers to periodic fluctuations that occur at regular intervals due to seasonal factors.
For example, retail sales often spike during the holiday season, or electricity consumption
increases during hot summers due to air conditioning.
Cyclic Patterns
Unlike seasonality, cyclic variations do not follow a fixed calendar pattern. Cycles are often
influenced by economic conditions or other structural factors and tend to span longer periods.
Noise
Random fluctuations or irregular variations that cannot be attributed to trend, seasonality, or
cyclic patterns constitute noise. Identifying and filtering noise is a key part of time series
preprocessing.
result = adfuller(df['value'].dropna())
print('ADF Statistic:', result[0])
print('p-value:', result[1])
The null hypothesis of the ADF test is that the time series is non-stationary. Therefore, a small
p-value (typically less than 0.05) indicates that we can reject the null hypothesis and conclude
that the series is stationary.
plot_acf(df['value'].dropna(), lags=40)
plt.show()
plot_pacf(df['value'].dropna(), lags=40)
plt.show()
ACF and PACF plots guide the selection of parameters for ARIMA models. For example:
A slow decay in the ACF indicates a non-stationary series.
A sharp cutoff in PACF suggests the presence of autoregressive terms.
ARIMA Modeling
The Autoregressive Integrated Moving Average (ARIMA) model is a powerful and widely
used statistical model for time series forecasting. It combines three components:
AR (Autoregressive): Uses past values to predict future values.
I (Integrated): Refers to differencing of raw observations to make the series stationary.
MA (Moving Average): Uses past forecast errors to correct future predictions.
An ARIMA model is denoted as ARIMA(p, d, q) where:
p: number of lag observations in the model (AR terms)
d: degree of differencing (to induce stationarity)
q: size of the moving average window (MA terms)
To build and fit an ARIMA model in Python:
from statsmodels.tsa.arima.model import ARIMA
Model Diagnostics
Post-estimation diagnostics are essential for validating the quality of an ARIMA model. One
must check if the residuals resemble white noise—random with zero mean and constant
variance.
residuals = model_fit.resid
residuals.plot(title='Residuals')
plt.show()
plot_acf(residuals)
plt.show()
A good model will show no autocorrelation in the residuals, indicating that all the information
has been captured by the model.
model = SARIMAX(df['value'],
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12)) # assuming
monthly data with yearly seasonality
model_fit = model.fit()
print(model_fit.summary())
The SARIMAX class also allows for inclusion of exogenous variables, making it suitable for
more complex forecasting tasks involving external regressors.
model = pm.auto_arima(df['value'],
seasonal=True,
m=12, # seasonality period
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
print(model.summary())
The auto_arima function evaluates multiple combinations of parameters and chooses the one
with the best performance. This saves time and improves the accuracy of the initial model setup.
# Plot forecast
model.plot(forecast)
plt.show()
Prophet handles both daily and irregular time series without needing extensive parameter tuning.
It also allows adding holiday effects and custom seasonalities:
# Add holidays
from prophet.make_holidays import make_holidays_df
model.add_country_holidays(country_name='US')
This makes Prophet particularly useful in business and retail environments where calendar-based
effects significantly influence trends.
y_true = df['value'][-10:]
y_pred = forecast['yhat'][-10:]
Statistical Thresholding
A simple approach involves computing the rolling mean and standard deviation and flagging
points that fall outside a defined threshold.
rolling_mean = df['value'].rolling(window=12).mean()
rolling_std = df['value'].rolling(window=12).std()
model = IsolationForest(contamination=0.01)
df['anomaly'] = model.fit_predict(df[['value', 'lag1']])
Points labeled as -1 are considered anomalies.
model = VAR(df_multivariate)
model_fitted = model.fit(maxlags=15, ic='aic')
forecast = model_fitted.forecast(df_multivariate.values[-
model_fitted.k_ar:], steps=5)
model = preprocessing.StandardScaler() |
linear_model.LinearRegression()
metric = metrics.MAE()
for x, y in stream:
y_pred = model.predict_one(x)
model.learn_one(x, y)
metric.update(y, y_pred)
Such pipelines enable models to evolve as data streams in, maintaining updated forecasts with
minimal lag.
Conclusion
Time series analysis is a deeply practical and mathematically rich field. From simple trend
forecasting to real-time streaming analytics, the ability to understand and model time series data
is invaluable in countless domains.
This chapter covered the essential building blocks:
Understanding temporal structure
Stationarity and differencing
Classical models (ARIMA, SARIMA)
Advanced models (Prophet, VAR, LSTM)
Forecast evaluation
Anomaly detection and multivariate analysis
Equipped with these tools, you are now capable of analyzing and forecasting a wide range of
real-world time series datasets.
Q3. What is the role of the ACF and PACF in model selection?
Q5. How does Prophet handle holidays and seasonality differently than ARIMA?
Coding Exercises
E1. Load a dataset with monthly airline passenger numbers. Visualize it and decompose it using
seasonal_decompose.
E2. Perform the Augmented Dickey-Fuller test on the data and apply differencing if required to
achieve stationarity.
E3. Use auto_arima to select optimal ARIMA parameters and forecast the next 12 months.
E4. Implement Facebook Prophet on the same dataset. Compare its forecasts with ARIMA’s.
E6. Create a synthetic multivariate time series dataset and fit a VAR model. Forecast all variables
for the next 5 steps.
E7. Using River, simulate real-time learning and forecasting on a streaming time series dataset.
Text Preprocessing
Preprocessing is a crucial step that simplifies and normalizes text data. Standard steps include:
Lowercasing
Convert all characters to lowercase to ensure consistency.
text = text.lower()
Word Clouds
A visual representation of word frequencies.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wc = WordCloud(width=800, height=400).generate("
".join(tokens))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()
N-grams
N-grams are contiguous sequences of n words that help capture context.
from nltk import ngrams
bigrams = list(ngrams(tokens, 2))
finder = BigramCollocationFinder.from_words(tokens)
finder.nbest(BigramAssocMeasures.likelihood_ratio, 10)
corpus = [
"Data science is an interdisciplinary field.",
"Machine learning is a part of data science.",
"Text mining and NLP are key in data science."
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Pros:
Simple and interpretable.
Works well for small to medium datasets.
Cons:
High dimensionality.
Does not capture semantic similarity or word order.
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(corpus)
print(tfidf_vectorizer.get_feature_names_out())
print(X.toarray())
TF-IDF helps suppress common but less informative words like "is" or "and," giving higher
weight to rare but meaningful terms like "interdisciplinary" or "mining".
Word Embeddings
Unlike sparse matrices from BoW/TF-IDF, word embeddings are dense, low-dimensional
vectors that capture semantic relationships. They are trained on large corpora and map similar
words to similar vectors.
Word2Vec
Word2Vec models come in two flavors:
CBOW (Continuous Bag of Words): Predicts a word from context.
Skip-gram: Predicts context from a word.
from gensim.models import Word2Vec
vector = model.wv['science']
similar_words = model.wv.most_similar('science')
GloVe (Global Vectors) is another embedding method that captures global co-occurrence
statistics.
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(corpus)
These embeddings are especially useful for tasks like semantic search, clustering, and sentence
similarity.
pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('classifier', MultinomialNB())
])
model = Sequential()
model.add(Embedding(input_dim=5000, output_dim=128,
input_length=100))
model.add(LSTM(units=64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
model.summary()
LSTMs are commonly used for:
Bidirectional LSTMs
To capture both past and future context, Bidirectional LSTMs process text in both directions.
from keras.layers import Bidirectional
model.add(Bidirectional(LSTM(64)))
Transformer Models
Transformers revolutionized NLP by discarding recurrence and using attention mechanisms.
Attention allows the model to weigh the importance of different words in a sequence, regardless
of their position.
Key Innovations in Transformers:
Self-Attention: allows each word to focus on all other words.
Positional Encoding: injects order information into the model.
Parallelization: transformers train faster than RNNs/LSTMs.
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-
base-uncased", num_labels=2)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
Classification Metrics
Useful for tasks like spam detection, sentiment analysis, and topic classification.
Accuracy: Overall correctness.
Precision: Correct positive predictions / All positive predictions.
Recall: Correct positive predictions / All actual positives.
F1 Score: Harmonic mean of precision and recall.
y_true = [0, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 1]
print(classification_report(y_true, y_pred))
Exercises
Exercise 1: Preprocessing and Feature Engineering
Load a dataset of product reviews.
Preprocess the text (lowercase, remove stopwords, lemmatize).
Extract features using TF-IDF.
Print top 10 keywords for each class.
Exercise 2: Text Classification
Train a sentiment analysis model using logistic regression on the IMDB dataset.
Evaluate it using precision, recall, and F1-score.
Exercise 3: Topic Modeling
Apply Latent Dirichlet Allocation (LDA) to a news dataset.
Identify and label 5 topics with representative keywords.
Exercise 4: Named Entity Recognition
Use spaCy to extract entities from a set of news articles.
Conclusion
Natural Language Processing bridges the gap between human communication and machines,
enabling rich analysis of vast unstructured text data. From foundational techniques like
tokenization and TF-IDF to state-of-the-art transformers like BERT and GPT, NLP has
become central to modern data science applications.
The future of NLP continues to evolve, with breakthroughs in multilingual modeling, real-time
understanding, and reasoning. A solid grasp of NLP will significantly expand your toolkit as a
data scientist, enabling you to tackle problems involving social media, reviews, documents,
customer support, and more.
Introduction
Time series data is everywhere—from stock market prices and weather records to web traffic
logs and sales figures. In data science, time series analysis enables forecasting, anomaly detection,
and understanding temporal trends. This chapter introduces the theory, methods, and practical
tools for analyzing and forecasting time series data using Python.
You’ll learn:
Key concepts and terminology in time series
Techniques for visualization and decomposition
Statistical models like ARIMA and exponential smoothing
Modern approaches including Facebook Prophet and LSTMs
Practical workflows for forecasting
We begin by laying the foundational concepts and types of time series data.
result = adfuller(df['Passengers'].dropna())
print(f"ADF Statistic: {result[0]}")
print(f"p-value: {result[1]}")
Less commonly used standalone, MA components are more effective when combined with AR.
model = SimpleExpSmoothing(df['Passengers'])
fit = model.fit()
df['SES'] = fit.fittedvalues
model = Holt(df['Passengers'])
fit = model.fit()
df['Holt'] = fit.fittedvalues
result = seasonal_decompose(df['Passengers'],
model='additive')
result.plot()
This is useful for understanding structure and preparing for modeling.
plot_acf(df['Passengers'].dropna(), lags=40)
plot_pacf(df['Passengers'].dropna(), lags=40)
Facebook Prophet
Developed by Meta (Facebook), Prophet is an open-source forecasting tool designed to handle
daily or seasonal patterns with strong trend components. It is intuitive, scalable, and handles
holidays and missing data well.
Installation:
pip install prophet
Usage Example:
from prophet import Prophet
# Prepare data
df_prophet = df.reset_index()
df_prophet.columns = ['ds', 'y']
# Fit model
model = Prophet()
model.fit(df_prophet)
# Plot
model.plot(forecast)
model.plot_components(forecast)
X = df[['lag1', 'lag12']]
y = df['Passengers']
model = XGBRegressor()
model.fit(X_train, y_train)
preds = model.predict(X_test)
Note: Feature engineering is crucial—models don’t “know” time unless you teach it (e.g., lags,
month, trend index).
# Normalize
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df[['Passengers']])
# Create sequences
def create_sequences(data, window):
X, y = [], []
for i in range(window, len(data)):
X.append(data[i-window:i, 0])
y.append(data[i, 0])
return np.array(X), np.array(y)
window = 12
X, y = create_sequences(scaled_data, window)
X = X.reshape((X.shape[0], X.shape[1], 1))
# LSTM model
model = Sequential()
model.add(LSTM(64, activation='relu', input_shape=(window,
1)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.fit(X, y, epochs=10, batch_size=16)
Use Cases:
Multivariate time series
High-frequency financial data
Long-term forecasting
model = Sequential([
Conv1D(filters=64, kernel_size=2, activation='relu',
input_shape=(window,1)),
MaxPooling1D(pool_size=2),
LSTM(64),
Dense(1)
])
model.compile(optimizer='adam', loss='mse')
model.fit(X, y, epochs=10)
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
This ensures the training data always precedes test data.
Exercises
Exercise 1: Airline Passenger Forecasting
Load the classic airline passenger dataset.
Plot the series and decompose it.
Fit an ARIMA model and forecast 12 months.
Compare the forecast to the actual data using RMSE and MAE.
Exercise 2: Build a Facebook Prophet Model
Load any retail or web traffic time series.
Fit a Prophet model and visualize trend/seasonality components.
Add holidays or changepoints and observe the effect.
Exercise 3: LSTM-based Forecast
Convert the monthly time series into sliding window sequences.
Normalize, build an LSTM model, and train it.
Conclusion
Time series analysis is a powerful tool in any data scientist’s skillset. From basic decomposition
to advanced forecasting with deep learning, Python offers an extensive toolkit for tackling time-
based data.
By understanding structure, choosing the right model, and evaluating forecasts correctly, you can
build robust solutions across finance, retail, healthcare, IoT, and beyond.
In the next chapter, we will dive into Natural Language Processing (NLP) in Python, where
we explore text-based data, vectorization, embeddings, and transformer-based models.
Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the
interaction between computers and human (natural) languages. It enables machines to read,
understand, and derive meaning from human text or speech. In the era of big data, NLP is
critical for analyzing vast amounts of unstructured text data from social media, customer reviews,
emails, and more.
This chapter explores the core concepts, preprocessing techniques, and practical
implementations of NLP using Python. By the end of this chapter, you will be equipped to
process, analyze, and model textual data for various data science applications.
Text Preprocessing
Raw text data is messy. Preprocessing is essential to prepare it for analysis and modeling.
Common steps include:
Lowercasing
text = "NLP is AMAZING!"
text = text.lower()
Removing punctuation and special characters
import re
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if w not in stop_words]
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]
Lemmatization
More sophisticated than stemming; considers the word’s part of speech.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in
filtered_tokens]
Part-of-Speech Tagging
Tagging words with their grammatical role (noun, verb, etc.).
from nltk import pos_tag
pos_tags = pos_tag(tokens)
With spaCy, POS tagging is built-in:
for token in doc:
print(token.text, token.pos_)
word_freq = Counter(lemmatized)
print(word_freq.most_common(10))
Word Clouds
from wordcloud import WordCloud
import matplotlib.pyplot as plt
N-grams
Sequences of N consecutive words, useful for context and phrase detection.
from nltk.util import ngrams
Text Vectorization
Machine learning models cannot directly understand raw text—they require numerical input.
Vectorization is the process of converting text into vectors (arrays of numbers) while preserving
semantic meaning and structure.
corpus = [
"NLP is fun and powerful",
"I love studying NLP",
"NLP helps analyze text"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())
Each row in the resulting matrix corresponds to a document, and each column represents a word
in the corpus vocabulary.
Limitations:
High-dimensional and sparse
Ignores semantics and word order
Where:
TF(t, d): Frequency of term t in document d
IDF(t): Inverse document frequency (rarity of term across all documents)
Example:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)
print(tfidf.get_feature_names_out())
print(X.toarray())
TF-IDF is commonly used in text classification, clustering, and search engines due to its balance
of frequency and uniqueness.
Word Embeddings
Unlike BoW and TF-IDF, word embeddings capture semantic relationships between words in
dense vector representations. They are trained on large corpora to learn contextual meaning.
Word2Vec
Introduced by Google, Word2Vec uses a neural network to learn word relationships. There are
two architectures:
CBOW (Continuous Bag of Words): Predicts the target word from surrounding
context.
Skip-gram: Predicts the surrounding context from the target word.
from gensim.models import Word2Vec
sentences = [
['nlp', 'is', 'fun'],
['nlp', 'is', 'powerful'],
['nlp', 'helps', 'analyze', 'text']
]
model = api.load("glove-wiki-gigaword-100")
print(model['king'])
tsne = TSNE(n_components=2)
Y = tsne.fit_transform(vectors)
plt.figure(figsize=(10, 6))
for i, word in enumerate(words):
plt.scatter(Y[i, 0], Y[i, 1])
plt.text(Y[i, 0]+0.01, Y[i, 1]+0.01, word)
plt.show()
Text Classification
Text classification is the task of assigning predefined categories to text data. Applications include
spam detection, topic labeling, sentiment detection, and intent recognition.
Let’s walk through a full pipeline to classify movie reviews as positive or negative using the
IMDb dataset.
data = fetch_20newsgroups(subset='train',
categories=['rec.sport.baseball', 'sci.med'],
remove=('headers', 'footers', 'quotes'))
texts = data.data
labels = data.target
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', MultinomialNB())
])
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)
Sentiment Analysis
Sentiment analysis determines the emotional tone behind text. It is commonly used in social
media analysis, product reviews, and opinion mining.
Using TextBlob
from textblob import TextBlob
sid = SentimentIntensityAnalyzer()
scores = sid.polarity_scores("Wow! This is a great product
:)")
print(scores)
Output:
{'neg': 0.0, 'neu': 0.382, 'pos': 0.618, 'compound': 0.7783}
Compound score is the overall sentiment (-1 to +1)
Confusion Matrix
Visualizes correct and incorrect predictions:
from sklearn.metrics import confusion_matrix,
ConfusionMatrixDisplay
cm = confusion_matrix(y_test, preds)
ConfusionMatrixDisplay(cm).plot()
This helps identify where the model is confusing one class with another.
nlp = spacy.load('en_core_web_sm')
text = "Apple was founded by Steve Jobs in Cupertino in 1976."
doc = nlp(text)
Topic Modeling
Topic modeling is an unsupervised technique to discover abstract “topics” within a collection of
documents.
texts = [
['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response'],
['graph', 'minors', 'trees'],
['graph', 'trees']
]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states.shape) # (batch_size,
sequence_length, hidden_size)
last_hidden_states contains contextual embeddings for each token.
Fine-tuning BERT
Fine-tuning involves training BERT on a specific task, such as sentiment analysis or NER, with
labeled data. This adapts BERT’s knowledge for your application.
sentiment_pipeline = pipeline('sentiment-analysis')
result = sentiment_pipeline("I love using transformers!")
print(result)
Output:
[{'label': 'POSITIVE', 'score': 0.9998}]
Summary
NER extracts entities from text for structured understanding.
Topic modeling uncovers hidden themes in large text corpora.
Transformer models like BERT provide powerful, context-aware representations.
Hugging Face Transformers library simplifies integration of state-of-the-art NLP models.
Chapter Recap
This chapter introduced you to the fascinating field of Natural Language Processing (NLP) and
its implementation using Python. You explored:
The foundational concepts of NLP, including the linguistic layers involved in
understanding text.
Essential text preprocessing techniques such as tokenization, stopword removal,
stemming, and lemmatization.
Methods for representing text numerically: Bag of Words, TF-IDF, and word
embeddings like Word2Vec and GloVe.
Building practical NLP models including text classification and sentiment analysis
pipelines with classical machine learning.
Advanced NLP tasks such as Named Entity Recognition (NER) using spaCy, topic
modeling with Latent Dirichlet Allocation (LDA), and the power of transformer-based
models like BERT.
Leveraging pre-trained transformer models through Hugging Face’s Transformers library
for state-of-the-art results.
h2o.init()
data = h2o.import_file("your_dataset.csv")
train, test = data.split_frame(ratios=[0.8])
aml = H2OAutoML(max_runtime_secs=300)
aml.train(y="target_column", training_frame=train)
lb = aml.leaderboard
print(lb)
AutoML tools typically provide a leaderboard of models, automatically perform cross-validation,
and return the best-performing model. This accelerates prototyping, especially when working
under time or resource constraints.
However, AutoML is not a replacement for understanding the data or problem domain. It
should complement—not replace—domain expertise.
Best Practices
Bias audits: Evaluate whether different groups are treated fairly by your model.
Model cards: Document model intent, training data, performance, and ethical
considerations.
Data anonymization: Remove or obfuscate personally identifiable information before
processing.
Ethics checklists: Integrate ethical reviews into every stage of the data science lifecycle.
Ethical concerns are not just technical—they require collaborative input from legal, social, and
domain experts.
Exercises
1. End-to-End Project
Choose a real-world dataset from Kaggle or UCI. Define a business problem, clean the data,
build a model, evaluate it, and write a short report summarizing your approach and results.
2. Explainable AI
Train a random forest or XGBoost model on tabular data and use SHAP to interpret feature
importance and individual predictions. Visualize your findings.
3. Try AutoML
Use H2O, TPOT, or Auto-sklearn to build a model for a classification task. Compare its
performance to your manually tuned model.
4. Ethical Review
Pick an ML application (e.g., credit scoring, hiring). Identify potential ethical risks, biases, and
what steps could be taken to ensure fairness and transparency.
5. MLOps Prototype
Create a small pipeline using MLflow or DVC to track versions of your model and data.
Optionally, serve your model using Flask and monitor basic request metrics.