Machine Learning With Python
Machine Learning With Python
Analog Computing
Bernd Ulmann,
ISBN ----, e-ISBN (PDF) ----
Machine Learning
with Python
Authors
Tarkeshwar Barua Ritesh Kumar Jain
Professor of Computer Application Geetanjali Institute of Technical Studies
Chandigarh Group of colleges Udaipur
Punjab, India India
[email protected] [email protected]
ISBN 978-3-11-069716-2
e-ISBN (PDF) 978-3-11-069718-6
e-ISBN (EPUB) 978-3-11-069725-4
www.degruyter.com
Contents
Chapter 1
Introduction to Machine Learning 1
1.1 What Is Machine Learning? 1
1.1.1 Definition of Machine Learning 1
1.1.2 Historical Background 2
1.1.3 Types of Machine Learning 4
1.1.4 Applications of Machine Learning 7
1.1.5 Challenges in Machine Learning 8
1.2 Python in the Machine Learning Landscape 11
1.2.1 Why Python for Machine Learning? 11
1.2.2 Popular Python Libraries for ML 12
1.2.3 Python vs Other Programming Languages in ML 13
1.2.4 Community and Resources for Python ML 14
1.3 Setting Up Your Python Environment 15
1.3.1 Installing Python 15
1.3.2 Virtual Environments and Why They’re Important 16
1.3.3 Essential Python Libraries for ML 17
1.3.4 Using Package Managers: Pip and Conda 19
1.3.5 Setting Up Jupyter Notebook 20
1.3.6 Best Practices for Managing ML Projects in Python 22
Summary 25
Exercise (MCQs) 26
Answers 27
Fill in the Blanks 27
Answers 28
Descriptive Questions 28
Chapter 2
Basics of Python Programming 30
2.1 Why Python? 30
2.1.1 Drawbacks of Python 30
2.1.2 History of Python 31
2.1.3 Major Features of Python 31
2.1.4 Market Demand 32
2.1.5 Why Python in Mobile App Development? 33
2.1.6 Python Versions 36
2.2 Python Syntax and Structure 36
2.2.1 Indentation and Whitespace 36
2.2.2 Comments and Documentation 37
2.3 Data Types and Variables 38
VI Contents
Chapter 3
Data Preprocessing in Python 90
3.1 Numerical and Scientific Computing Using NumPy and SciPy 90
3.1.1 Numerical Computations with NumPy 90
3.1.2 Scientific Computations with SciPy 105
3.2 Loading Data with Pandas 109
3.2.1 DataFrame and Series 109
3.2.2 Data Manipulation with Pandas 114
3.3 Data Cleaning and Transformation 120
3.3.1 Handling Missing Data 121
3.3.2 Data Type Conversions 122
3.4 Feature Engineering 123
3.4.1 Encoding Categorical Variables 124
3.4.2 Feature Scaling 126
3.5 Data Visualization with Matplotlib and Seaborn 129
3.5.1 Basic Plotting with Matplotlib 130
3.5.2 Advanced Visualizations with Seaborn 140
Contents VII
Chapter 4
Foundations of Machine Learning 152
4.1 Supervised vs Unsupervised Learning 153
4.1.1 Classification vs Regression 154
4.1.2 Clustering vs Association 156
4.2 Overfitting and Regularization 159
4.2.1 Bias-Variance Trade-Off 159
4.2.2 L1 and L2 Regularization 163
4.3 Evaluation Metrics 167
4.3.1 Metrics for Classification 168
4.3.2 Metrics for Regression 171
4.4 Cross-Validation 175
4.4.1 k-Fold Cross-Validation 176
4.4.2 Leave-One-Out and Stratified K-Fold 178
Summary 182
Exercise (MCQs) 183
Answers 185
Fill in the Blanks 186
Answers 186
Descriptive Questions 187
Chapter 5
Classic Machine Learning Algorithms 188
5.1 Linear Regression 189
5.1.1 Simple Linear Regression 190
5.1.2 Multiple Linear Regression 197
5.1.3 Polynomial Regression 203
5.2 Logistic Regression 206
5.2.1 Binary Classification 210
5.2.2 Multiclass Classification 214
5.2.3 Regularization in Logistic Regression 217
5.3 Decision Trees and Random Forests 222
VIII Contents
Chapter 6
Advanced Machine Learning Techniques 316
6.1 Gradient Boosted Trees: XGBoost and LightGBM 316
6.1.1 XGBoost Algorithm 319
6.1.2 LightGBM Versus XGBoost 327
6.2 Kernel Methods 331
6.2.1 Kernel Tricks 332
6.2.2 Radial Basis Function (RBF) Kernel 333
6.2.3 Clustering Techniques Beyond k-Means 336
6.3 Anomaly Detection 347
6.3.1 Statistical Methods 347
6.3.2 Distance-Based Methods 350
6.4 Clustering 351
6.4.1 Isolation Forest 352
6.4.2 Density-Based Methods 352
6.4.3 Other Techniques 353
Contents IX
Summary 354
Exercise (MCQs) 356
Descriptive Type Questions 358
Fill in the Blanks 359
Answers 359
True and False 359
Chapter 7
Neural Networks and Deep Learning 360
7.1 Introduction to Neural Networks 360
7.2 Perceptron 362
7.2.1 Structure of Perceptron 362
7.2.2 Function of Perceptron 363
7.2.3 Where to Use Perceptron 363
7.2.4 Where to Use Activation Function 363
7.3 TensorFlow 367
7.3.1 Computational Graph 367
7.3.2 Eager Execution 368
7.3.3 Keras 369
7.3.4 Sessions 370
7.3.5 Common Operations 370
7.4 Implementing Neural Network Using TensorFlow 370
7.5 Building a Neural Network Using Keras Framework 373
7.5.1 Difference between Keras and Tensorflow 373
7.6 Convolutional Neural Network (CNN) 374
7.6.1 Model: "sequential_2" 386
7.6.2 CNN Architecture 389
7.7 Dropout Layers 390
7.8 Recurrent Neural Networks (RNNs) 391
7.8.1 Types of RNNs 391
7.9 Sequence-to-Sequence Models 398
7.10 Transfer Learning 400
7.11 Using Pretrained Models 406
7.12 Fine-Tuning and Feature Extraction 407
7.12.1 Choosing the Right Approach to Fine-Tune 407
7.13 Generative Adversarial Networks (GANs) 408
7.13.1 Architecture of GANs 408
7.13.2 Best Practices of GANs 409
7.13.3 Application of GANs 409
7.14 Regularization and Optimization in Deep Learning 409
7.15 Batch Normalization 413
7.15.1 How BatchNorm Works? 414
X Contents
Chapter 8
Specialized Applications and Case Studies 424
8.1 Introduction to Natural Language Processing (NLP) 424
8.1.1 Tokenization 428
8.1.2 Word Embeddings 429
8.1.3 Sequence Modeling for NLP 431
8.2 Time-Series Forecasting 433
8.2.1 Autoregressive Integrated Moving Average (ARIMA) Models 434
8.2.2 Prophet and Neural Networks for Time Series 436
8.3 Recommender Systems 438
8.4 Computer Vision Applications 443
8.4.1 Object Detection and Segmentation 445
8.5 Reinforcement Learning 447
8.5.1 Core Elements of Reinforcement Learning 447
8.5.2 Learning Process 448
8.6 Application of Reinforcement Learning 455
8.7 Challenges of Reinforcement Learning 455
8.8 Q-learning 455
8.9 Deep Q Networks 457
8.10 Policy Gradient Methods 459
Summary 464
Exercise (MCQs) 465
Answers Key 467
Some Basic Question 468
Fill in the Blanks 469
Answers 469
References 471
Index 475
Chapter 1
Introduction to Machine Learning
Machine learning (ML) is a discipline within the field of artificial intelligence (AI) that
concentrates on the creation of algorithms and models, allowing computer systems to
acquire knowledge and make forecasts or choices without the need for explicit pro-
gramming. The primary objective of ML is to empower computers to autonomously
learn and enhance their performance based on experience or data.
Artificial Intelligence
Enabling Machines to think like humans
Machine Learning
Training machines to get better at a task
without explicit programming
Deep Learning
Different experts and sources may provide slightly varied definitions of ML, reflecting
different perspectives on the field. Here are a few diverse definitions given by various
authorities:
https://doi.org/10.1515/9783110697186-001
2 Chapter 1 Introduction to Machine Learning
ML, a branch of AI, enables computers to acquire knowledge and reach conclusions
without the need for explicit instructions. This revolutionary discipline encompasses
different methodologies, each designed to address specific learning situations. The
main forms of ML comprise supervised learning, unsupervised learning, and rein-
forcement learning, each providing distinct approaches and applications for solving
various problems. Now, let us delve into an investigation of these foundational
classifications.
Key Characteristics
Training data: The dataset used for training purposes comprises pairs of inputs
and corresponding outputs, with each input having its correct output provided.
Learning objective: The algorithm aims to learn the relationship or mapping be-
tween inputs and outputs.
Examples: Common applications include image classification, spam filtering, and
regression problems.
Example Scenario
Task: Predicting whether an email is spam or not.
Training data: Emails categorized as spam or nonspam.
Learning objective: The algorithm acquires knowledge to categorize new emails
by recognizing recurring patterns from the data it has been trained on.
Key Characteristics
Training data: The dataset is unlabeled, meaning there are no predefined output
labels.
Learning objective: Uncover concealed patterns or interconnections among
the data.
Examples: Clustering, dimensionality reduction, and association rule learning
are prevalent tasks within the field of unsupervised learning.
Example Scenario
Task: Grouping similar customer purchase behaviors.
Training data: Purchase data without specific labels.
Learning objective: The algorithm identifies natural groupings or clusters of
similar purchase patterns.
Key Characteristics
Environment: The agent interacts with an environment and takes actions to
achieve goals.
Feedback: The agent is provided with feedback through the dispensation of re-
wards or punishments contingent upon its enacted actions.
Learning objective: Learn a policy that maps states to actions to maximize cu-
mulative rewards.
Examples: Game playing, such as the case with AlphaGo, robotic control, and the
development of autonomous systems.
Example Scenario
Task: Teaching a computer program to play a game.
Environment: The game environment.
Learning objective: The agent learns a strategy (policy) to take actions that max-
imize the game score over time.
Healthcare
Disease diagnosis: ML aids in diagnosing diseases by analyzing medical images
(e.g., X-rays and MRIs) and identifying patterns indicative of specific conditions.
Predictive analytics: Predictive models are employed to forecast the occurrence
of disease outbreaks, the likelihood of patient readmissions, and the estimation of
individual health risks, thus enabling proactive healthcare interventions.
Finance
Fraud detection: ML algorithms scrutinize transaction data to discern atypical
patterns and ascertain potentially deceitful activities in real time.
Algorithmic trading: Predictive models utilize their analytical capabilities to ex-
amine market trends and historical data, enhancing the process of making in-
formed decisions within the realm of algorithmic trading and thereby optimizing
investment strategies.
Autonomous Vehicles
Object detection and recognition: ML algorithms are utilized to analyze sensor
data to identify and categorize various entities such as objects, pedestrians, and
obstacles. These algorithms play a crucial role in the decision-making process
within autonomous vehicles.
Path planning: Reinforcement learning is applied for path planning, enabling ve-
hicles to navigate complex environments and make dynamic decisions.
Education
Personalized learning: ML models adapt educational content based on individ-
ual student progress, tailoring the learning experience to meet specific needs.
Student performance prediction: Predictive analytics identifies students at risk
of academic challenges, allowing for timely intervention and support.
Although ML has made notable progress, it also faces various obstacles that affect its
progress, implementation, and efficacy. Resolving these obstacles is imperative for
1.1 What Is Machine Learning? 9
furthering the field and guaranteeing morally sound, resilient, and comprehensible
ML systems. Presented below are the principal challenges encountered in ML.
Lack of Standardization
Challenge: The absence of standardized evaluation metrics, datasets, and model
architectures can impede reproducibility and hinder fair comparisons between
different models.
10 Chapter 1 Introduction to Machine Learning
Ethical Considerations
Challenge: Ethical concerns arise from biased models, potential misuse of AI, and
the ethical implications of automated decision-making in critical areas such as
healthcare and criminal justice.
Mitigation: Incorporating ethical considerations in model development, promot-
ing diversity and inclusivity, and adhering to ethical guidelines and standards are
essential.
Adversarial Attacks
Challenge: Adversarial attacks involve manipulating input data to mislead ML
models, compromising their performance and reliability.
Mitigation: Developing robust models, incorporating adversarial training, and
regularly updating models to counter emerging attack strategies are strategies to
address this challenge.
Continuous Learning
Challenge: Many ML models are designed for static datasets, and adapting to
evolving data over time (concept drift) is a challenge.
Mitigation: Implementing online learning approaches, retraining models periodi-
cally, and staying vigilant to changes in data distributions help address continu-
ous learning challenges.
Privacy Concerns
Challenge: Handling sensitive information in training data raises privacy con-
cerns, especially in healthcare and finance applications.
Mitigation: Adopting privacy-preserving techniques, such as federated learning
and differential privacy, helps protect individual privacy while still allowing
model training.
1.2 Python in the Machine Learning Landscape 11
Python has become the preferred programming language for ML due to various com-
pelling factors. The efficient development is facilitated by its simple and easy-to-
understand syntax, which makes it accessible for both novices and experienced devel-
opers. Python’s extensive ecosystem incorporates powerful libraries like Scikit-learn,
TensorFlow, and PyTorch, providing robust tools for a wide range of tasks, from data
preprocessing to complex deep learning models.
The language’s versatility is demonstrated by its ability to seamlessly integrate with
other technologies, enabling smooth incorporation into different data science work-
flows and frameworks. Python’s strong community support plays a crucial role, ensur-
ing a vast array of resources, tutorials, and forums for problem-solving. Its popularity
extends beyond the realm of ML, fostering collaborations across different disciplines.
The open-source nature of Python and its compatibility with various platforms contrib-
ute to its widespread adoption in research, industry, and academia. Consequently, Py-
thon’s user-friendly nature, extensive libraries, and community support collectively
12 Chapter 1 Introduction to Machine Learning
establish it as the preferred language for practitioners and researchers navigating the
diverse field of ML.
In the realm of ML, Python is synonymous with a rich ecosystem of libraries that em-
power developers and researchers to build and deploy sophisticated models. Here are
three standout libraries that have played pivotal roles in shaping the landscape of ML:
1.2.2.1 Scikit-learn
The ML library, Scikit-learn, is renowned for its adaptability and user-centric design,
offering an assortment of uncomplicated and effective resources for the examination
and construction of data. It encompasses an extensive range of algorithms that cater
to classification, regression, clustering, and dimensionality reduction tasks.
1.2.2.2 TensorFlow
Developed by Google, TensorFlow is an immensely robust open-source library that
demonstrates exceptional performance in the realm of deep learning applications. Its
remarkable adaptability and capacity for expansion render it well-suited for individu-
als ranging from novices to seasoned professionals. TensorFlow provides extensive
support for the creation and implementation of intricate neural network frameworks.
1.2.2.3 PyTorch
PyTorch is a dynamic and popular deep learning library known for its imperative pro-
gramming style. Favored by researchers and developers alike, PyTorch facilitates
building dynamic computational graphs, offering flexibility and ease in experiment-
ing with various neural network architectures.
1.2 Python in the Machine Learning Landscape 13
Python has become the prevailing programming language in the realm of ML; how-
ever, it is crucial to evaluate it alongside other languages that are frequently em-
ployed in this particular domain.
Performance ❌ Generally lower ❌ Generally lower ✔ Better performance ✔ Efficient and high
performance performance in certain cases performance
It is of vital significance to take into account that the selection of programming lan-
guage frequently relies on the precise demands of the ML venture, the developer’s
acquaintance with the language, and considerations of effectiveness. Every language
possesses its own merits and is appropriately equipped for specific situations.
14 Chapter 1 Introduction to Machine Learning
The Python community for ML is a lively and cooperative ecosystem that has a crucial
function in the progress, dissemination, and enhancement of ML endeavors. Pre-
sented here is a summary of the community and the ample resources that are accessi-
ble for Python in the field of ML.
Open-Source Collaboration
Collaborative development: The Python ML community thrives on open-source
collaboration, with developers worldwide contributing to libraries, frameworks,
and tools. This collective effort leads to continuous improvements, bug fixes, and
the evolution of the ecosystem.
Educational Platforms
Coursera, edX, and Udacity: These platforms offer a plethora of ML courses and
specializations in Python, providing learners with comprehensive resources and
hands-on projects.
Kaggle: Kaggle serves as both a competition platform and a learning resource,
allowing users to explore datasets, compete, and collaborate on ML projects.
Online blogs and websites: Numerous blogs and websites, such as Towards Data
Science and Analytics Vidhya, provide in-depth tutorials, case studies, and best
practices for ML in Python.
The collaborative and inclusive nature of the Python ML community, coupled with
the abundance of educational resources and platforms, makes it an ideal environment
for developers, researchers, and learners to thrive and contribute to the evolving
landscape of ML in Python.
Installing Python is the foundational step for any ML endeavor. Follow these steps for
a seamless installation:
16 Chapter 1 Introduction to Machine Learning
Download Python
Visit the Python website’s official domain, https://www.python.org/downloads/, to ob-
tain the most recent edition that is compatible with your designated operating system.
Verify Installation
Launch a command prompt on Windows or a terminal on macOS/Linux and enter the
command “python –version” or “python –V.” The installed version of Python will be
displayed, thereby confirming the successful completion of the installation.
This example demonstrates checking the Python version, with the result indicating
that Python 3.8.12 is installed. Now, with Python installed, you’re ready to proceed to
the next steps of setting up your ML environment.
In Python, virtual environments are crucial for managing project dependencies and
isolating different projects from each other. Here’s why virtual environments are es-
sential and how to use them.
Dependency Isolation
Virtual environments create isolated spaces for Python projects. Each environment has
its own set of installed packages, preventing conflicts between project dependencies.
Version Control
By enclosing the dependencies of a project within a virtual environment, one can ex-
ercise authority over the versions of libraries employed for a particular project. This
guarantees the preservation of project reproducibility throughout its lifespan.
1.3 Setting Up Your Python Environment 17
Easy Replication
Virtual environments make it easy to replicate the development environment on an-
other machine. By sharing the requirements.txt or environment.yml file, others can
recreate the exact environment used for a project.
Once activated, the terminal prompt changes to indicate the active virtual environ-
ment, ensuring that any installed packages are specific to the project tied to that
environment.
Understanding and leveraging virtual environments is a best practice in Python
development, especially in the context of ML projects, where dependencies can be
project-specific and evolve over time.
Several crucial Python libraries play a crucial role in the advancement of ML by offer-
ing a range of useful tools for tasks such as data manipulation, statistical analysis, and
the construction of ML models. The subsequent section delves deeply into these indis-
pensable libraries.
18 Chapter 1 Introduction to Machine Learning
NumPy
Description: NumPy represents an essential library for performing numerical
computations using the Python programming language. It offers comprehensive
assistance for handling extensive, multidimensional arrays and matrices, while
also offering efficient mathematical operations to manipulate these arrays
effectively.
Importance: NumPy serves as the foundational framework for the manipulation
of data and the execution of numerical operations in the field of ML. It constitutes
the fundamental building block for numerous other libraries within the wider
ecosystem.
Pandas
Description: The Pandas library is a flexible tool for manipulating data, provid-
ing various data structures, such as DataFrames, which are particularly well-
suited for managing structured data. It offers an array of tools for data cleaning,
exploration, and analysis.
Importance: Pandas streamlines the process of data preprocessing, thereby facil-
itating the optimization of data for ML models. Its proficiency lies in its capacity
to effectively handle tabular and time-series data.
Matplotlib:
Description: Matplotlib is an extensive plotting library for generating Python-
based visualizations, encompassing 2D static, animated, and interactive represen-
tations. It offers a versatile framework to construct diverse plots and charts.
Importance: Visualization is crucial for understanding data patterns. Matplotlib
facilitates the creation of informative plots, aiding in data exploration and
communication.
Scikit-learn
Description: Scikit-learn, an enduring ML library, furnishes uncomplicated and
effective instruments to extract valuable information and scrutinize data. The li-
brary encompasses a diverse range of algorithms for classification, regression,
clustering, and beyond.
Importance: Scikit-learn serves as a primary resource for deploying ML models.
With its standardized application programming interface (API) and comprehensive
documentation, it provides accessibility to individuals at all levels of expertise.
TensorFlow
Description: TensorFlow, an open-source ML library, was created by Google and
is capable of supporting both deep learning and traditional ML. This library sim-
plifies the process of developing and training intricate neural network models.
1.3 Setting Up Your Python Environment 19
PyTorch
Description: PyTorch, renowned for its adaptability and user-friendly interface,
is a dynamic and open-source deep learning framework. It offers a dynamic
computational graph, rendering it well-suited for research and experimentation.
Importance: PyTorch is widely embraced by researchers due to its user-friendly
architecture and flexible computational abilities, rendering it extensively em-
ployed within both academic and industrial domains to construct state-of-the-art
deep learning models.
Keras
Description: Keras, a Python-based high-level neural networks API, is designed
to operate on TensorFlow, Theano, or Microsoft Cognitive Toolkit. It offers a user-
friendly platform for constructing and exploring neural networks.
Importance: Keras simplifies the development of neural network architectures
and is often the choice for quick prototyping and experimentation. It abstracts
low-level details, allowing developers to focus on model design.
These essential Python libraries collectively form a robust ecosystem for ML develop-
ment. Understanding their functionalities and integrating them into projects enables
efficient data handling, model building, and visualization.
Package managers are essential tools for managing Python libraries and dependen-
cies. Two widely used package managers in the Python ecosystem are pip and conda.
Here’s an overview of how to use them.
pip
Description: pip serves as the primary package manager for the Python program-
ming language. It streamlines the procedure of installation, enhancement, and ad-
ministration of Python packages.
Installation: If one is utilizing Python 3.4 or a more recent version, it is probable
that the pip package manager is already present. To enhance the pip package
manager to the most recent iteration, execute the following command:
20 Chapter 1 Introduction to Machine Learning
conda
Description: conda is a cross-platform package manager and environment man-
agement system. It is particularly powerful for managing packages with complex
dependencies.
Installation: Install conda by downloading and installing Anaconda or Miniconda.
Using pip and conda appropriately is crucial for managing dependencies in your Py-
thon environment. Choose the one that best fits your project requirements and eco-
system compatibility. It’s common to use pip for general Python packages and conda
for packages with non-Python dependencies or in data science environments.
The Jupyter Notebook, which is extensively utilized for interactive data analysis, ex-
ploration, and ML development, possesses significant capabilities. A comprehensive
tutorial outlining the process of setting up Jupyter Notebook in your Python environ-
ment is presented here.
1.3 Setting Up Your Python Environment 21
Installation
– Ensure that your Python environment is activated (either globally or within a vir-
tual environment).
– Use pip to install Jupyter Notebook:
$ pip install jupyter
This action will initiate the launch of a fresh tab in the web browser, thereby reveal-
ing the Jupyter Notebook dashboard.
Now, in your Jupyter Notebook, you can choose “My Environment” as a kernel.
22 Chapter 1 Introduction to Machine Learning
Exporting Notebooks
– You can export your Jupyter Notebooks to various formats, including HTML, PDF,
and Markdown. Use the “File” menu to access the “Download as” option and
choose your preferred format.
Establishing Jupyter Notebook offers an interactive and perceptible milieu for crafting
ML models, illustrating data, and recording your endeavors. It harmoniously incorpo-
rates the Python framework, rendering it a multifaceted instrument for data scientists
and programmers alike.
Managing ML projects in Python involves more than just writing code. Adopting best
practices ensures project organization, reproducibility, and collaboration. Here’s a de-
tailed guide.
Project Structure
– Organize your project with a clear directory structure. Commonly used structures
include separating data, code, and documentation into distinct directories.
my_project/
├── data/
├── src/
│ ├── __init__.py
│ ├── data_processing.py
│ └── model_training.py
├── notebooks/
├── README.md
└── requirements.txt
$ git init
$ git add .
$ git commit -m “Initial commit”
Virtual Environments
– Always use virtual environments to isolate project dependencies. Include a re-
quirements.txt or environment.yml file to document and recreate the
environment.
$ python -m venv venv
$ source venv/bin/activate
(venv) $ pip install -r requirements.txt
Documentation
– Ensure thorough documentation is maintained. Clarify the functionality of the
code through the utilization of README files, docstrings, and comments. Provide
instructions on configuring the environment and executing the project.
Automated Testing
– Implement unit tests to ensure code correctness. Tools like pytest can be used for
automated testing. Run tests regularly to catch potential issues early.
(venv) $ pip install pytest
(venv) $ pytest tests/
run: |
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- name: Run tests
run: |
pytest tests/
Reproducibility
– Document the steps to reproduce your experiments. Include information on data
sources, preprocessing steps, and model hyperparameters. This ensures that
others can reproduce your results.
Code Reviews
– Incorporate code review practices. Peer reviews help catch bugs, improve code
quality, and ensure that the project adheres to coding standards.
Environment Variables
– Employ environment variables to safeguard sensitive data, such as API keys or
database credentials. It is inadvisable to embed these values directly within
your code.
Collaboration Platforms
– Leverage collaboration platforms like Jupyter Notebooks or Google Colab for in-
teractive development and sharing insights.
By incorporating these best practices, you establish a solid foundation for managing
ML projects in Python. This promotes collaboration and maintainability, and ensures
the reproducibility of your work.
Summary 25
Summary
– Utilize pip and conda for managing Python libraries and dependencies based on
project requirements.
– Set up Jupyter Notebook for interactive data analysis and ML development, allow-
ing seamless integration with Python libraries.
Exercise (MCQs)
3. Which library is known for its imperative programming style in deep learn-
ing?
a) Scikit-learn b) TensorFlow c) PyTorch d) Keras
Answers
1. b) AI subset
2. c) Labeled dataset
3. c) PyTorch
4. c) Isolating project dependencies
5. d) NumPy
6. b) Package management and environment management
7. c) Git
8. c) Interactive data analysis and development
9. c) Travis CI
10. b) To capture errors and important information
Answers
1. explicitly
2. labeled
3. unlabeled
4. environment
5. complex
6. versatility
7. machine learning
8. project dependencies
9. pip
10. ecosystem
Descriptive Questions
5. Elaborate on the essential Python libraries for machine learning, such as NumPy,
Pandas, Matplotlib, Scikit-learn, TensorFlow, PyTorch, and Keras. Describe the
role and importance of each library in the machine learning ecosystem.
6. Compare and contrast the use of pip and conda as package managers in Python,
emphasizing their features and best use cases.
7. Provide a step-by-step guide on setting up Jupyter Notebook in a Python environ-
ment, including installation, starting the notebook, creating a new notebook, and
installing additional kernels.
8. Discuss the best practices for managing machine learning projects in Python, cov-
ering aspects like project structure, version control with Git, documentation, auto-
mated testing, and continuous integration.
9. Explain the importance of proper logging and monitoring in machine learning
projects, emphasizing their role in capturing information and errors, especially
for ML models’ performance and drift over time.
10. How can reproducibility be ensured in machine learning experiments? Discuss
the steps to document and reproduce experiments, including details on data sour-
ces, preprocessing, and model hyperparameters.
Chapter 2
Basics of Python Programming
All programming languages possess both advantages and disadvantages. Python, too,
exhibits certain disadvantages, some of which are outlined below.
– In comparison to other programming languages like C, C++, and Java, Python’s
program execution is considerably slower. Java, being the fastest language, owes
its speed to the Java Virtual Machine (JVM) and Just-In-Time (JIT) compiler. For
more detailed information, please refer to YouTube videos.
– The generation of intricate graphics places a heavy computational burden, result-
ing in a degradation of graphics quality.
– In the absence of Cython, code execution suffers from sluggishness. Cython allows
for code compilation at the C level, leveraging C compiler optimizations.
https://doi.org/10.1515/9783110697186-002
2.1 Why Python? 31
Python language, which was created by Guido van Rossum in 1991, is not named after
the type of snake, but rather after a British Comedy group called “Monty Python’s Fly-
ing Circus.” Guido van Rossum, being a big fan of the group and their quirky humor,
named the language in their honor. In the year 2000, Python 2.0 was released with
new features such as comprehensions, cycle detecting garbage collection, and Unicode
support, 10 years after the initial release of Python 1.0. Python programs often pay
tribute to the group by incorporating their jokes and famous quotes into the code.
Python is available in two major versions: Python 3.x and Python 2.x. Python 2.x,
which is considered a legacy version, will be supported until 2020, while Python 3.x is
the more frequently updated and popular version. Some features from Python 3 can
be imported into Python 2.x using the “__future__” module. Python 3.0, released in
2008, was a major release that lacked backward compatibility, meaning that code
written in Python 2.x could not run on the Python 3.x series.
However, this issue was addressed in Python 2.7, as a large amount of code had
already been written in the Python 2.x series. Initially, support for Python 2.7 was set
to end in 2015, but it has since been extended to 2020. Guido van Rossum took on the
responsibility of leading the Python project until July 2018, when he passed on the
role to a five-person steering council. This council will now be responsible for releas-
ing future versions of Python.
– Reading and comprehension are facilitated due to the presence of tabular space
rather than curly braces, making it effortless to learn.
– Using the English language facilitates the coding of Python programs, rendering
the process significantly easier.
– The Python programming language is built upon an interpreted language, which
greatly reduces the time required to execute our applications.
– Procedural programming, object-oriented programming, and functional program-
ming are all supported by Python through the Thread module, allowing for the
use of Multi Threading.
– Parsing Python, Android, and iOS templates facilitates the generation of dynamic
pages for clients.
32 Chapter 2 Basics of Python Programming
– Interactive language, python code does not perform any conversion of human-
readable code into executable code. This characteristic renders python highly in-
teractive as it allows for real-time modifications.
– The KIVY framework provides support for Python, Android, and iOS.
– Same code can be executed on all available platforms due to platform independence.
There are numerous programming languages currently utilized for application develop-
ment in the market. However, I would like to present some facts sourced from various
internet platforms. These facts encompass primary introductory languages, sector-wise
demand, salary considerations, and market share.
Fig. 2.1 depicts the global market demand trends for various programming lan-
guages over the last decade. The graph illustrates the oscillations in the requirement
of languages such as Python, Java, C++, and JavaScript, etc., across different industries
and regions. Fig. 2.2 presents a comparative analysis of average annual salaries
earned by professionals with these programming language skills. Data is collected
from broad surveys that are conducted across major tech hubs worldwide to provide
insight into remuneration levels for each language. Furthermore, Fig. 2.3 demon-
strates the relative popularity of these programming languages as measured by online
search trends, enrollment in coding courses and adoption rates in software develop-
ment projects among others. This popularity index gives a complete view of the dy-
namic landscape of programming languages which helps in identifying new trends
and possible future shifts in demand.
25
20
15
10
0
Python Java MATLAB C C++ Scheme Scratch
# 1 115,000 Swift
# 2 107,000 Python
Ruby
# 3 104,000 C++
# 5 102,000 Java
# 6 99,000 JavaScript
# 7 94,000 C
# 8 92,000 SQL
# 9 89,000 PHP
In today’s world, mobile devices have become an integral part of our daily lives, mak-
ing it nearly impossible for individuals to function without their smartphones. These
devices have greatly simplified our lives. Recognizing this, many software companies
have shifted their focus to mobile app development. However, developing mobile
apps presents a number of challenges due to the existence of various mobile phone
platforms such as Android, iOS, Windows, etc., each with its own unique software re-
quirements. Consequently, programmers are required to write code natively for each
platform, a time-consuming task. To mitigate this issue, a cross-platform approach is
recommended. Python, with its ease of use and extensive library support, simplifies
the process of app development.
The importance of Python can be seen in the figure below, which highlights the
market demand in various sectors. Python is renowned for its simplicity and versatil-
ity, as it can be learned and utilized on multiple platforms. It offers robust integration
capabilities with various technologies, leading to increased programming productivity
throughout the development life cycle. Python is particularly well-suited for large and
34 Chapter 2 Basics of Python Programming
4 C# 8.4% –0.4%
7 ↓ C 7.0% –0.5%
9 ↑ R 3.2% +0.5%
30
22,5 23,42
15
7,5
7,65
5,56
5,08 4,71 4,69
3,84
2,15 1,84 1,53
0
Finance Banking Games Front Office Telecoms Electronics Investment Marketing Manufacturing Retail
2.1 Why Python?
Banking
Python syntax and structure encompass the rules and organization principles that
govern how Python code is written and structured.
Indentation
Python employs indentation as a means to establish code blocks. In contrast to numer-
ous other programming languages that employ braces {} or keywords such as begin
and end to demarcate code blocks, Python relies on uniform indentation. This particu-
lar characteristic distinguishes Python and is indispensable for the legibility and orga-
nization of the code.
2.2 Python Syntax and Structure 37
def welcome(name):
if name == "Ritesh":
print("Hello, Ritesh!")
else:
print("Hello, stranger!")
The example provided above illustrates how the scope of the if-else block is deter-
mined by the indentation, specifically the whitespace that precedes the print state-
ments and else statement. Typically, this indentation consists of four spaces, although
it is also possible to use a tab. It is crucial to maintain consistency in the chosen inden-
tation style across the entire codebase.
Whitespace
Whitespace, including spaces and tabs, is used for separation and clarity in Python
code. However, excessive or inconsistent use of whitespace can lead to syntax errors.
In the illustration, the initial utilization of whitespace is evident and augments legibil-
ity. The subsequent utilization, characterized by an excessive number of spaces, is
technically accurate but may impede the comprehension of the code. It is generally
advisable to adhere to PEP 8, the Python code style guide, which furnishes guidelines
for the arrangement of code, encompassing the utilization of whitespace.
Comments
Comments in Python are used to explain code and make it more understandable.
They are not executed and are preceded by the # symbol.
Comments are essential for documenting code, explaining complex parts, or leaving
notes for other developers. However, it’s crucial not to overuse comments or state the
obvious, as the code should be self-explanatory whenever possible.
Documentation
Documentation refers to more extensive explanations of code, often provided in doc-
strings. Docstrings are triple-quoted strings that document a module, function, class,
or method.
Parameters:
- a (int): The first number.
- b (int): The second number.
Returns:
int: The result of multiplying a and b.
"""
return a ✶ b
The docstring in this given example offers comprehensive details regarding the func-
tion, which encompasses its objective, parameters, and the value it returns. The signif-
icance of having appropriate documentation cannot be overstated, as it plays a vital
role in aiding fellow developers in comprehending the utilization of your code, while
also facilitating the seamless operation of tools such as automatic documentation
generators.
Python is a language that is dynamically typed, which implies that there is no require-
ment to explicitly declare the data type of a variable. Nevertheless, comprehending
data types is of utmost importance for proficient programming.
2.3 Data Types and Variables 39
Primitive data types serve as the fundamental constituents for constructing intricate
data structures. Within the domain of Python, prevalent primitive data types encompass:
int (Integer): Represents whole numbers.
float (Float): Represents decimal numbers.
str (String): Represents textual data.
bool (Boolean): Represents True or False values.
NoneType: Represents the absence of a value.
# Integer
age = 25
# Float
height = 5.8
# String
name = "John"
# Boolean
is_student = True
# NoneType
no_value = None
Understanding these primitive data types is foundational for working with variables
in Python.
Lists
In the realm of the Python programming language, a list proves to be a highly adapt-
able and dynamic data structure that assumes a critical role in the arrangement and
manipulation of groups of elements. Lists possess the ability to be modified, follow a
specific order, and accommodate elements of diverse data types. This particular chap-
ter delves into the complex nuances of lists, elucidating their formation, manipula-
tion, and assorted operations through extensive illustrations.
40 Chapter 2 Basics of Python Programming
Creating Lists
Defining a sequence of elements enclosed within square brackets [] is a fundamental
operation in Python, known as creating lists. Lists, being mutable, have the ability to
store elements of different data types and allow for modification of their content after
their initial creation.
– Empty list:
An empty list is created without any elements, useful for dynamic population.
empty_list = []
word_list = list("Python")
Nested Lists
Lists can be nested, allowing the creation of multidimensional structures.
Accessing Elements
Retrieving particular values from Python lists involves the act of accessing elements
based on their respective index. The indexing of lists in Python follows a zero-based
2.3 Data Types and Variables 41
approach, whereby the initial element possesses an index of 0, the subsequent ele-
ment possesses an index of 1, and so forth.
– Basic indexing:
Index notation is used to access elements within a list, wherein the indexing begins
from 0.
# Output: apple
– Negative indexing:
Negative indexing is a functionality in lists that enables us to retrieve elements from
the list’s end, which offers a practical approach to navigating elements in a backward
manner. The final element is assigned an index of − 1, while the penultimate element
is assigned an index of − 2, and so forth.
List Slicing
List slicing enables the creation of a novel list through the extraction of a subset of
elements from a preexisting list. The slicing syntax, list[start:stop:step], entails the
utilization of the start index as the point of origin, the stop index as the point of termi-
nation, and the step size as the interval between elements.
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# Basic slicing
subset1 = numbers[2:6] # Elements from index 2 to 5
print(subset1) # Output: [3, 4, 5, 6]
# Slicing with step
42 Chapter 2 Basics of Python Programming
In the above example, subset1 extracts elements from index 2 to 5, subset2 includes
elements from index 1 to 7 with a step of 2, and subset3 selects elements using nega-
tive indices.
Modifying Elements
Modifying elements in Python lists is a crucial aspect of working with mutable data
structures. Lists allow us to change the values of existing elements at specific indices.
– Basic modification:
To alter an element within a list, one employs its index and designates a fresh value
to said index.
numbers = [1, 2, 3, 4, 5]
numbers[1:4] = [10, 20, 30]
Adding Elements
Adding elements to Python lists is a fundamental operation that allows you to dynam-
ically expand the size of the list. There are several methods to add elements to a list,
each serving different purposes.
In this instance, the elements derived from the new_fruits list are appended to the
concluding segment of the fruits list.
In this particular context, the element denoted as “grape” is appended to the fruits list
through the utilization of the + = operator.
Removing Elements
Removing elements from Python lists is a fundamental operation for the purpose of
preserving and adjusting lists. Diverse methods can be employed based on the partic-
ular need.
In this particular instance, the initial instance of the term “banana” is eliminated
from the enumeration of fruits.
Here, the element at index 1 (“banana”) is removed and assigned to the variable
removed_fruit.
In this example, the element at index 1 (“banana”) is removed using the del statement.
# Output: []
Length of a List
The len() function enables the calculation of the quantity of elements present in a
given list.
num_elements = len(numbers)
# Output: 5
Nesting Lists
Creating lists within lists in the Python programming language entails the creation of
a list where the constituent elements are also lists. This particular technique facili-
tates the generation of intricate data structures, including but not limited to matrices
or lists that contain other lists. It is important to note that each individual element
within the outer list has the potential to be a list in its own right.
# Output: 2
Tuples
A Python tuple is a compilation of ordered and unchangeable components. Following
its creation, the components of a tuple remain unmodifiable, unappendable, and un-
removable. Tuples are established through the use of parentheses () and have the abil-
46 Chapter 2 Basics of Python Programming
ity to hold components of varying data types. Presented here is a comprehensive elu-
cidation of tuple operations, accompanied by examples and intricate particulars.
Creating Tuples
Creating tuples in Python is a simple procedure that entails the utilization of parentheses
(). Tuples have the capability to incorporate elements of diverse data types, and once they
are formed, their values cannot be altered. Various approaches exist to generate tuples:
– Using parentheses:
The creation of a tuple is typically accomplished through the act of enclosing elements
within parentheses, which is considered to be the most prevalent method.
– Without parentheses:
Tuples can also be created without explicit parentheses. The commas alone are suffi-
cient to define a tuple.
numbers_tuple = 1, 2, 3
empty_tuple = ()
– Single-element tuple:
A tuple containing only one element necessitates the inclusion of a trailing comma to
differentiate it from a typical value enclosed in parentheses.
2.3 Data Types and Variables 47
single_element_tuple = ("apple",)
Accessing Elements
Accessing elements in a tuple is similar to accessing elements in other sequences in
Python, such as lists or strings. Tuples use zero-based indexing, and individual ele-
ments can be retrieved using square brackets [].
– Basic indexing:
To access an element in a tuple, specify its index within square brackets.
# Output: "apple"
– Negative indexing:
Negative indexing enables the retrieval of elements from the termination of the tuple.
last_fruit = fruits_tuple[-1]
# Output: "orange"
Here, last_fruit is assigned the value of the last element using negative indexing.
– Slicing:
Tuple slicing allows you to create a new tuple by extracting a subset of elements.
numbers_tuple = (1, 2, 3, 4, 5)
subset = numbers_tuple[1:4]
# Output: (2, 3, 4)
– Omitting indices:
If the start or stop index is not provided when slicing a tuple, the default behavior is
to use the beginning or end of the tuple, respectively.
48 Chapter 2 Basics of Python Programming
Immutable Nature
The characteristic that sets tuples in Python apart from other data structures, like
lists, is their immutable nature. Immutability implies that, once a tuple is created, its
elements cannot be altered, adjusted, appended, or erased.
Tuple Unpacking
Tuple unpacking is a functionality present in the Python programming language,
which facilitates the assignment of values from a tuple to separate variables within a
single line of code. This characteristic offers a practical and efficient approach to ex-
tracting elements from a tuple and assigning them to variables, thereby eliminating
the need for multiple assignment statements.
2.3 Data Types and Variables 49
# Creating a tuple
coordinates = (3, 7)
# Tuple unpacking
x, y = coordinates
In this illustration, the values of the coordinates tuple are decomposed into the varia-
bles x and y. It is essential that the count of variables on the left side of the assign-
ment corresponds to the count of elements in the tuple.
– Unpacking in functions:
Tuple unpacking is often used in function returns to conveniently capture multiple
values.
def get_coordinates():
return 5, 10
Here, the function get_coordinates returns a tuple, and the values are unpacked into
variables x and y when calling the function.
– Extended unpacking:
In Python, we can use the ✶ operator to capture remaining elements when unpacking.
# Creating a tuple
numbers = (1, 2, 3, 4, 5)
In this example, the ✶rest syntax captures all elements between the first and last ele-
ments in the tuple.
– Ignoring elements:
We can use an underscore _ to ignore specific elements during unpacking.
# Creating a tuple
point = (8, 3, 5)
# x is 8, z is 5
Here, the underscore _ is used to ignore the second element in the tuple.
Dictionaries
A Python dictionary is a nonsequential aggregation of key-value pairs, where every
key must be exclusive. In alternative programming languages, dictionaries are re-
ferred to as associative arrays or hash maps. They are established by employing curly
braces {} and encompass keys along with their corresponding values.
Creating a Dictionary
Defining a collection of key-value pairs using curly braces {} is a fundamental step in
the creation of a dictionary in Python. It is of utmost importance that each key within
the dictionary is distinct, as the keys are intrinsically linked to their corresponding
values. Dictionaries present a versatile and effective means of organizing and retriev-
ing data by means of keys.
– Basic accessing:
In this example, the values associated with the keys “name” and “age” are accessed
and assigned to variables.
In this instance, the numerical value linked to the designated identifier “age” has
been altered from 20 to 21.
Here, a novel key-value pair consisting of the key “gender” and the value “Male” is
appended to the dictionary.
54 Chapter 2 Basics of Python Programming
The update() method is employed in this particular instance to alter the value linked
to the “age” key, introduce a fresh key-value pair for the attribute of “gender,” and
append yet another new key-value pair for the attribute of “city.”
In this particular instance, the dictionary’s key “grade” and its corresponding value
are extracted from the dictionary by employing the del statement.
2.3 Data Types and Variables 55
# Removing and retrieving the value associated with the "courses" key
courses = student.pop("courses")
The “courses” key and its corresponding value are extracted from the dictionary, and
the value is subsequently assigned to the variable courses.
In this instance, the final key-value pair within the dictionary is eliminated and desig-
nated to the variable last_item.
Sets
A Python set is an unordered assemblage of distinct elements. Sets are established by
employing curly brackets {} and are advantageous for multiple operations such as
verifying membership, discovering intersections, unions, and disparities amidst sets.
Creating a Set
Defining an unordered collection of unique elements is the process of creating a set in
Python. The sets are denoted by curly braces {} and can be formed by using existing
iterables such as lists or by explicitly stating the elements.
colors_set = set(colors_list)
2.3 Data Types and Variables 57
The set() constructor is employed in this instance to generate a set called colors_set
from a list known as colors_list. It should be noted that sets inherently eliminate any
duplicate elements, thereby resulting in a set that exclusively consists of distinct
elements.
# Creating a set
fruits_set = {"apple", "banana", "orange"}
In this particular instance, the inclusion of the element “kiwi” within the fruits_set is
achieved through the utilization of the add() method.
# Creating a set
fruits_set = {"apple", "banana", "orange"}
The fruits_set incorporates the elements “grape” and “pineapple” through the utiliza-
tion of the update() method.
# Creating a set
fruits_set = {"apple", "banana", "orange"}
# Creating a set
fruits_set = {"apple", "banana", "orange"}
# Discarding an element
fruits_set.discard("kiwi")
Here, the element “kiwi” is discarded from the fruits_set, but if “kiwi” were not pres-
ent, it would not raise an error.
# Creating a set
fruits_set = {"apple", "banana", "orange"}
In this instance, a random element is removed from the fruits_set, and its value is
subsequently assigned to the variable popped_element.
# Creating a set
fruits_set = {"apple", "banana", "orange"}
Set Operations
Sets in Python support various operations that enable us to perform common set-
related tasks. Below are the fundamental set operations:
set1 = {1, 2, 3}
set2 = {3, 4, 5}
# Output: {1, 2, 3, 4, 5}
set1 = {1, 2, 3}
set2 = {3, 4, 5}
# Output: {3}
– Difference (-) – elements in the first set but not in the second:
The dissimilarity between two sets comprises of elements that exist in the initial set
but do not belong to the subsequent set.
set1 = {1, 2, 3}
set2 = {3, 4, 5}
60 Chapter 2 Basics of Python Programming
# Output: {1, 2}
set1 = {1, 2, 3}
set2 = {3, 4, 5}
# Output: {1, 2, 4, 5}
– if statement:
The if statement is employed to execute a block of code on the condition that a speci-
fied condition holds true.
Syntax
if condition:
# code to be executed if the condition is true
x = 10
if x > 5:
print("x is greater than 5")
The print statement will be executed solely if the condition x > 5 is verified, as illus-
trated in this example.
– if-else statement:
The if-else statement grants the capability to execute a specific block of code when a
given condition is found to be true, and alternatively, to execute a distinct block of
code when the condition is found to be false.
if condition:
# code to be executed if the condition is true
else:
# code to be executed if the condition is false
x = 3
if x % 2 == 0:
print("x is even")
else:
print("x is odd")
In this instance, the program shall output whether x is classified as even or odd con-
tingent upon the condition.
– if-elif-else statement:
The if-elif-else statement is an expansion of the if-else statement, enabling the exami-
nation of multiple conditions in a sequential manner.
if condition1:
# code to be executed if condition1 is true
62 Chapter 2 Basics of Python Programming
elif condition2:
# code to be executed if condition2 is true
else:
# code to be executed if none of the conditions are true
x = 0
if x > 0:
print("x is positive")
elif x < 0:
print("x is negative")
else:
print("x is zero")
2.4.2 Loops
– for loop:
The for loop is employed to iterate through a sequence, be it a list, tuple, string, or
range, or any other iterable objects. This loop carries out a set of instructions for
every element present in the sequence.
Syntax
In this example, the for loop iterates over the list of fruits, and the code inside the
loop prints each fruit.
2.5 Functions and Modules 63
– while loop:
The while loop persists in executing a block of code for as long as a specified condi-
tion remains true. It iterates the execution until the condition becomes false.
Syntax
while condition:
# code to be executed as long as the condition is true
count = 0
while count < 5:
print(count)
count += 1
The while loop, in this particular case, outputs the count value and increases it by 1
during each iteration, provided that the count is below 5.
Functions and modules are essential principles in Python that endorse the organiza-
tion, legibility, and reusability of code. Functions encapsulate the logic of code, while
modules and packages furnish a structured approach to arranging and aggregating
correlated functionality.
Functions in Python enable us to encapsulate and reuse code, enhancing the legi-
bility and maintainability of our programs.
Modules and packages aid in the organization of code into manageable units,
thereby facilitating code reuse and maintenance.
Syntax
def function_name(parameters):
# code to be executed
return result # optional
The add_numbers function in this illustrative case accepts two parameters, namely a
and b, performs the summation operation on them, and subsequently provides the
outcome as the returned value.
def greet(name):
greeting_message = f"Hello, {name}!"
return greeting_message
Here, the greet function takes a name parameter and returns a personalized greeting
message.
square_result = square_number(4)
print(square_result) # Output: 16
Lambda functions, which are also referred to as “anonymous functions,” are charac-
terized by their brevity and frequent utilization in performing small, uncomplicated
tasks. These functions are established through the employment of the lambda key-
word and have the ability to accept an arbitrary quantity of arguments; however,
they are limited to solely possessing a solitary expression. The implementation of
lambda functions proves to be especially advantageous when dealing with temporary
operations that can be conveyed as arguments to higher-order functions.
Lambda functions are especially useful when a small, one-off function is needed
and defining a full function using def seems too verbose. They are commonly used
in situations where functions are treated as first-class citizens, such as when passing
functions as arguments to higher-order functions.
Syntax
add_numbers = lambda x, y: x + y
result = add_numbers(5, 3)
print(result) # Output: 8
square = lambda x: x ✶✶ 2
result = square(4)
print(result) # Output: 16
A lambda function, named square, is defined to compute the square of a given num-
ber. Subsequently, the lambda function is invoked with the input argument 4, and the
outcome is displayed.
is_even = lambda x: x % 2 == 0
result = is_even(7)
print(result) # Output: False
The fundamental concepts in Python that enhance code organization, readability, and
reusability are modules and packages. Modules enable the encapsulation of related
code into distinct files, while packages offer a means to structure these modules
within a hierarchical directory.
2.5 Functions and Modules 67
Modules
A Python module is a file that consists of Python definitions, statements, and func-
tions. This enables the organization of related code into distinct files, thereby enhanc-
ing the modularity and maintainability of the code. The utilization of functions,
classes, and variables defined in a module is achieved by importing it into other Py-
thon scripts.
# my_module.py
# main_program.py
import my_module
result = my_module.add(5, 3)
print(result) # Output: 8
In this particular instance, the module, my_module, encompasses functions for addi-
tion and subtraction. The importation of this module into the main_program.py file
allows for the utilization of the add function.
Packages
A package is a method of organizing associated modules within a unified directory
structure. It serves the purpose of preventing naming conflicts among distinct mod-
ules. In essence, a package constitutes a directory that encompasses multiple Python
files, also known as “modules.” For a directory to be regarded as a package, it is im-
perative that it includes a designated file labeled __init__.py.
68 Chapter 2 Basics of Python Programming
Package Structure
my_package/
|-- __init__.py
|-- module1.py
|-- module2.py
# main_program.py
result = module1.square(4)
print(result) # Output: 16
In this example, my_package is a package, and module1 is a module within that pack-
age. The square function from module1 is used in main_program.py.
The ability to read from and write to files is a crucial aspect of data storage, retrieval,
and manipulation in any programming language. In Python, file operations are made
simple with the aid of built-in functions and methods. Opening, reading, and writing
to files allows for persistent data storage and the exchange of information between
programs.
Dealing with files in Python is essential for managing data persistence and exter-
nal data sources. It offers a means of interacting with data stored on disk, enabling
the integration of Python programs with external data and facilitating efficient data
management.
File I/O in Python pertains to the procedures encompassing the act of both retrieving
data from and storing data into files. Python furnishes an assortment of pre-existing
functions and methods for file I/O, affording you the capability to engage with files on
your operating system. The fundamental constituents of file I/O encompass the act of
initiating file access, acquiring data from files, inscribing data into files, and ulti-
mately terminating file access.
2.6 Working with Files 69
Opening a File
The Python open() function is an inherent function used to initiate the process of
opening files. This function is an essential component of file management within the
Python programming language, facilitating a multitude of file-related tasks such as
reading, writing, and appending. As a result of executing the open() function, a file
object is returned, which in turn grants access to a range of methods for the manipu-
lation of files.
Syntax
Modes
– ‘r’: The file is opened for reading.
– ‘w’: The file is opened for writing. If the file already exists, it is truncated.
– ‘a’: The file is opened for writing, but new data is appended to the end if the file
already exists.
– ‘b’: ‘b’ is appended to the mode for binary files (e.g., ‘rb’ or ‘wb’).
– ‘x’: Fails if the file already exists.
– ‘t’: ‘t’ is appended to the mode for text files (e.g., ‘rt’ or ‘wt’).
Writing to a File
To perform file writing operations in Python, it is necessary to first open the file using
a designated mode, subsequently write the desired data into it, and finally close the
file. In the Python programming language, there exist multiple methods that facilitate
the process of writing data to a file.
– Appending to a file:
To append new content to an existing file without replacing its current content, it is
possible to open the file in append mode (‘a’). Subsequently, the write() method will
append the new content to the conclusion of the file.
result = math.sqrt(25)
print(result) # Output: 5.0
The import keyword is used in this instance to import the entire math library. Subse-
quently, the square root of 25 is calculated using the math.sqrt() function.
# Example: Importing only the sqrt function from the math library
from math import sqrt
2.7 Working with Libraries 73
result = sqrt(25)
print(result) # Output: 5.0
In this case, only the sqrt function from the math library is imported. This approach
eliminates the need to prefix the function with the library name when used.
data = pd.read_csv("data.csv")
print(data.head())
Here, the pandas library is imported with the alias pd. This is a common convention
to simplify code and make it more readable. The pd alias is then used to call functions
from the pandas library.
current_date = date.today()
print(current_date) # Output: 2023-12-12
Importing all modules using the ✶ wildcard allows us to use all functions and classes
without explicitly mentioning them. However, this approach is generally discouraged
to avoid name clashes.
The Python libraries form the backbone of various Python applications, providing solu-
tions across different domains, including data science, machine learning, web develop-
ment, and more. Depending on our project requirements, we can choose the libraries
that best suit our needs.
NumPy
– Description: NumPy, which is short for numerical Python, encompasses a com-
prehensive collection of numerical operation functions specifically designed for
Python.
74 Chapter 2 Basics of Python Programming
Main Features
– Efficient manipulation of large, homogeneous arrays can be achieved through the
use of multidimensional arrays.
– Mathematical functions: Offer an extensive assortment of mathematical functions
to facilitate array operations.
Use Cases
– Scientific computing and data analysis.
– Linear algebra operations.
– Signal processing and image analysis.
Pandas
– Description: Pandas is a library used for manipulating and analyzing data.
Main Features
– DataFrame: A two-dimensional table for data manipulation.
– Series: A one-dimensional labeled array.
– Data cleaning, merging, and reshaping tools.
Use Cases
– Data exploration and analysis.
– Data cleaning and preparation.
– Time series data analysis.
Matplotlib
– Description: Matplotlib is an esteemed library for generating visualizations, en-
compassing the realms of static, animated, and interactive visualizations.
Main Features
– Line plots, scatter plots, bar plots, etc.
– Support for LaTeX for mathematical expressions.
– Customizable plots and styles.
Use Cases
– Data visualization for analysis and presentation.
– Creating publication-quality plots.
2.7 Working with Libraries 75
Requests
– Description: Requests is a library designed to facilitate the execution of HTTP
requests.
Main Features
– HTTP methods (GET, POST, PUT, DELETE, etc.).
– Session handling and cookie support.
– Asynchronous requests with asyncio.
Use Cases
– Web scraping and data extraction.
– Interacting with RESTful APIs.
– Sending HTTP requests and handling responses.
Scikit-learn
– Description: Scikit-learn, a library for classical machine learning algorithms, is a
tool widely used in the field of machine learning.
Main Features
– Classification, regression, clustering, and dimensionality reduction algorithms
can be categorized as different types of computational techniques for data
analysis.
– Model selection and evaluation tools.
– Support for preprocessing and feature engineering.
Use Cases
– Building and evaluating machine learning models.
– Data analysis and exploration.
Main Features
– Define, train, and deploy neural networks.
– Support for automatic differentiation.
– Broad ecosystem for machine learning and deep learning.
76 Chapter 2 Basics of Python Programming
Use Cases
– Deep learning model development.
– Natural language processing and computer vision tasks.
Main Features
– Django: High-level framework with built-in features (admin, authentication, etc.).
– Flask: Micro-framework for lightweight and flexible applications.
Use Cases
– Developing web applications and APIs.
– Rapid development and scalability.
Beautiful Soup
– Description: Beautiful Soup is a web scraping library for pulling data out of
HTML and XML files.
Main Features
– Parses HTML and XML documents.
– Navigates the parse tree and searches for specific elements.
Use Cases
– Scraping data from websites.
– Extracting information from HTML and XML.
SQLAlchemy
– Description: SQLAlchemy is a library that serves as a toolkit for SQL operations
and also provides functionality for object-relational mapping (ORM).
Main Features
– SQL expression language and ORM.
– Connection pooling and transaction management.
2.8 Object-Oriented Programming in Python 77
Use Cases
– Database interaction in Python.
– Object-relational mapping for databases.
OpenCV
– Description: OpenCV is a computer vision library for image and video processing.
Main Features
– Image and video manipulation.
– Feature extraction and object detection.
Use Cases
– Computer vision applications.
– Image and video analysis.
Classes
– A class serves as a blueprint or a template for the creation of objects, specifying
their attributes and methods.
– The definition of classes in Python employs the use of the keyword “class.”
class ClassName:
def __init__(self, parameter1, parameter2, ...):
# Constructor or initializer method
# Set up instance attributes
78 Chapter 2 Basics of Python Programming
– The class keyword is employed for the purpose of declaring a class. In the pro-
vided instance, the name of the class is denoted by ClassName.
– Constructor (__init__) method:
– The __init__ method, which is referred to as the “constructor,” is a distinctive
method that is automatically executed upon the creation of an object from
the class.
– The initialization of the object’s attributes is performed. The self parameter is
employed to refer to the instance of the class and to make references to in-
stance attributes.
– Other methods:
– Additional methods within the class are defined like regular functions, taking
self as the first parameter.
– These methods can perform various actions and access instance attributes.
Example
class Car:
def __init__(self, make, model, year):
self.make = make
self.model = model
self.year = year
def display_info(self):
return f"{self.year} {self.make} {self.model}"
The given example demonstrates the implementation of a class called Car. This class
possesses the attributes of make, model, and year, as well as a method named dis-
play_info. The initialization method, __init__, is responsible for assigning values to the
attributes of the Car object.
Objects
In Python, an object represents an exemplification of a class. A class serves as a design
that specifies characteristics and functionalities, while an object embodies a tangible
manifestation of that design. Objects play a vital role in Python since they enable us
2.8 Object-Oriented Programming in Python 79
to simulate and engage with real-world entities within our code. The establishment of
objects contributes to upholding a modular and structured framework, thereby en-
hancing the readability and manageability of our code.
– Creating an object:
To instantiate an instance of the class, one invokes the class as though it were a func-
tion. By doing so, the constructor (__init__) method is invoked, passing the specified
initial values.
Here, car1 and car2 are instances of the Car class. The __init__ method is automatically
called when creating these objects, initializing their attributes.
Example 1
class Car:
def __init__(self, make, model, year):
self.make = make
self.model = model
self.year = year
def display_info(self):
return f"{self.year} {self.make} {self.model}"
This example demonstrates the creation of a simple Car class, instantiation of objects
(car1 and car2), and accessing their attributes and methods.
Example 2
class Dog:
def __init__(self, name, age):
self.name = name
self.age = age
def bark(self):
return "Woof!"
In this particular instance, the class is denoted as “Dog,” while an object named
“my_dog” is created from the aforementioned class. The initialization method,
“__init__,” serves to establish the attributes (name and age), while the bark method
acts as a function connected to the Dog class, capable of being invoked on instances
of the class. Subsequently, the object my_dog possesses the ability to access its attrib-
utes and invoke methods that have been defined within the class.
Inheritance in Python
In the realm of object-oriented programming, the notion of inheritance holds great
significance as it permits a fresh class to acquire attributes and methods from a pre-
existing class. This pre-existing class is commonly known as the “base class” or “par-
ent class,” while the newly formed class takes on the role of the derived class or child
2.8 Object-Oriented Programming in Python 81
class. By taking advantage of inheritance, one can foster the reuse of code and the
establishment of a hierarchical structure encompassing various classes.
# Base class
class Animal:
def __init__(self, name):
self.name = name
def speak(self):
return "Some generic sound"
# Creating objects
animal = Animal("Generic Animal")
dog = Dog("Buddy")
In this particular instance, the base class Animal possesses a method known as speak.
The Dog class, derived from Animal, supersedes the speak method. Both classes,
namely Animal and Dog, can be utilized, and the speak method exhibits distinct be-
havior contingent upon the class.
Types of Inheritance
1. Single inheritance: A class acquires the attributes and behaviors of a single base
class. For example, the derived class is defined as class DerivedClass(BaseClass).
2. Multiple inheritance: A class obtains the attributes and behaviors of more than
one base class. For example, the derived class is defined as class DerivedClass
(BaseClass1, BaseClass2, . . .).
3. Multilevel inheritance: A class gains the attributes and behaviors of another
class, which itself inherits from a base class. For example, the derived class is de-
fined as class DerivedClass(IntermediateClass).
82 Chapter 2 Basics of Python Programming
# Base class
class Animal:
def __init__(self, name):
self.name = name
def speak(self):
return "Some generic sound"
# Creating objects
animal = Animal("Generic Animal")
dog = Dog("Buddy")
In this particular instance, the Animal class serves as the fundamental class, while the
Dog class functions as the subclass. The Dog class acquires the speak method from its
parent class Animal.
# Base classes
class Engine:
def start(self):
return "Engine started"
2.8 Object-Oriented Programming in Python 83
class Electric:
def charge(self):
return "Charging electric power"
# Creating an object
hybrid_car = HybridCar()
# Base class
class Animal:
def speak(self):
return "Some generic sound"
# Creating an object
poodle = Poodle()
84 Chapter 2 Basics of Python Programming
Polymorphism in Python
Polymorphism enables the utilization of objects from diverse classes as objects of a
shared base class. It permits the representation of various types of objects through a
singular interface (method). Polymorphism is accomplished through method overrid-
ing, wherein the child class possesses a method with the identical name as the method
in the parent class.
class Shape:
def area(self):
return "Some generic area calculation"
class Square(Shape):
def __init__(self, side):
self.side = side
def area(self):
✶✶
return self.side 2
class Circle(Shape):
def __init__(self, radius):
self.radius = radius
def area(self):
return 3.14 ✶ self.radius ✶✶
2
# Using polymorphism
shapes = [Square(5), Circle(3)]
for shape in shapes:
print(f"Area: {shape.area()}")
Exercise (MCQs) 85
# Output
# Area: 25
# Area: 28.26
In this particular instance, the class Shape serves as the fundamental class housing a
method for computing area. The classes Square and Circle are subclasses of Shape
and possess an overridden version of the area method. The list shapes encompasses
objects from both classes, employing the identical method (area) to determine the
area, thereby exemplifying the concept of polymorphism.
Summary
Exercise (MCQs)
11. Which type of inheritance involves a class inheriting from more than one
base class?
A. Single inheritance
B. Multiple inheritance
C. Multilevel inheritance
D. Hierarchical inheritance
Answers
1. C
2. B
3. A
4. C
5. A
6. C
7. B
8. C
9. A
10. B
11. B
Answers
1. indentation
2. #
3. booleans
4. primitive
5. mutable, immutable
6. key-value, unique
7. index
8. blueprint
9. objects
10. another
11. multiple
Descriptive Questions
1. Explain the significance of proper indentation in Python syntax and how it influ-
ences the structure of the code.
2. Provide an overview of primitive data types in Python and explain the purpose of
variables in the context of Python programming.
3. Discuss the differences between lists and tuples in Python, emphasizing their mu-
tability or immutability.
4. Explain the key characteristics of dictionaries and sets in Python, highlighting
their use cases and unique properties.
5. Define the concepts of classes and objects in Python’s object-oriented program-
ming paradigm, providing examples for better understanding.
6. Elaborate on the concept of inheritance in Python, discussing how it enables code
reuse. Provide an example to illustrate polymorphism.
7. Describe the various types of inheritance in Python, including single, multiple,
multilevel, hierarchical, and hybrid inheritance.
8. Walk through the polymorphism example involving Shape, Square, and Circle
classes. Explain how the common interface (area method) is utilized.
9. Explore the control statements in Python, including if, for, and while statements.
Provide examples to demonstrate their usage.
10. Discuss the role of functions in Python and explain the process of defining func-
tions. Additionally, explain the concept of modules and how they enhance code
organization.
11. Provide an overview of working with files in Python, explaining the open() func-
tion and file input/output operations.
12. Explain the importance of libraries in Python programming and discuss the pro-
cess of importing libraries. Provide examples of popular Python libraries.
Descriptive Questions 89
In this segment, we delve into the fundamental elements of numerical and scientific
computation employing two potent Python libraries – NumPy and SciPy. These libraries
assume a pivotal function in scientific computation, furnishing a basis for manipulating
extensive, multidimensional arrays and executing diverse mathematical computations.
– Introduction to NumPy
NumPy, which is an abbreviation for Numerical Python, constitutes a robust open-
source Python library that furnishes assistance for extensive, multidimensional ar-
rays and matrices. Furthermore, it encompasses a compilation of sophisticated mathe-
matical functions to carry out operations on these arrays. Notably, it emerges as an
indispensable library for scientific computation in Python and serves as the underpin-
ning for diverse other libraries and tools in the realm of data science and machine
learning.
Here are key aspects of NumPy that contribute to its significance:
1. Arrays:
Multidimensional data structures: NumPy presents the ndarray, an object that
is capable of representing arrays with multiple dimensions such as vectors, matri-
ces, and arrays with higher dimensions. This feature enables the storage and ma-
nipulation of extensive datasets in a manner that is efficient.
Element-wise operations: NumPy facilitates element-wise computations, thereby
enabling mathematical operations to be executed on complete arrays without ne-
cessitating explicit iterations. This results in the production of succinct and effec-
tive code.
2. Mathematical operations:
Universal functions (ufuncs): NumPy offers an extensive assortment of univer-
sal functions (ufuncs) that perform element-wise computations on arrays. These
ufuncs encompass a broad spectrum of mathematical operations, encompassing
fundamental arithmetic, trigonometry, logarithms, and various others.
https://doi.org/10.1515/9783110697186-003
3.1 Numerical and Scientific Computing Using NumPy and SciPy 91
3. Performance:
Efficient data storage: NumPy arrays are executed in the programming lan-
guages C and Fortran, thereby facilitating proficient storage and manipulation of
numerical data. This capability is of utmost significance when managing exten-
sive datasets and executing intricate computations.
Broadcasting: The broadcasting mechanism of NumPy facilitates the execution
of operations on arrays with different shapes and sizes. This feature simplifies
the process of performing operations on arrays with varying dimensions.
5. Applications:
Scientific computing: NumPy finds extensive applications in the field of scien-
tific computation for endeavors encompassing data analysis, simulations, and the
resolution of mathematical quandaries.
Data science and machine learning: Many libraries used in the fields of data
science and machine learning, such as Pandas and scikit-learn, heavily depend on
NumPy arrays to represent and manipulate data.
Array Creation
Using np.array()
The np.array() function serves as a fundamental means to generate arrays within the
NumPy library. This function accepts a sequence-like object (such as a list or a tuple)
as its input and subsequently yields a NumPy array.
import numpy as np
A 2D array (arr2d) is generated from a nested Python list in this instance. The outer
list symbolizes rows, while the inner lists symbolize the elements of each row. Conse-
quently, the resulting structure is a 2D array.
NumPy arrays possess homogeneity, denoting the preference for elements of an iden-
tical data type. Nevertheless, NumPy undertakes the endeavor to convert elements
into a shared data type. In the given illustration, the array includes integers along
with a string, hence NumPy alters all the elements to a common data type, specifically
Unicode.
We have the option to explicitly indicate the data type of the array by utilizing the
dtype parameter. In the given instance, we generate a one-dimensional array consist-
ing of integers and specifically designate it as a float.
Using np.zeros()
The np.zeros() function is employed to generate an array that contains solely zeros.
As an argument, it accepts the shape of the array.
94 Chapter 3 Data Preprocessing in Python
import numpy as np
Here, a 2D array (zeros_2d) with three rows and four columns, all initialized to zero,
is created using np.zeros().
Using np.ones()
The np.ones() function is similar to np.zeros(), but it creates an array filled with ones.
This example demonstrates the creation of a 1D array (ones_1d) with six elements, all
initialized to one, using np.ones().
3.1 Numerical and Scientific Computing Using NumPy and SciPy 95
In this example, a 1D array (zeros_float) with a specified data type (float) is created
using np.zeros().
# Creating a 2D array filled with ones with a specified data type (int)
ones_int = np.ones((3, 2), dtype=int)
print(“\n2D Array of Ones with Specified Data Type (int):”)
print(ones_int)
Here, a 2D array (ones_int) with a specified data type (int) is created using np.ones().
Using np.arange()
The np.arange() function is utilized to generate an array that contains values that are
evenly spaced within a specified range. It bears resemblance to the pre-existing range()
function in Python; however, it yields a NumPy array.
96 Chapter 3 Data Preprocessing in Python
In this example, a 1D array (arr1d) is created using np.arange(10). The array contains
values from 0 to 9 (exclusive). The np.arange() function is versatile and can take pa-
rameters like start, stop, and step to customize the array.
In this particular illustration, the np.arange() method is employed with the specified ar-
guments start = 2, stop = 10 (exclusive), and step = 2. Consequently, a one-dimensional
array (arr_custom) is obtained, which comprises the elements [2, 4, 6, 8].
The dtype parameter enables us to specify the data type of the resulting array. In the
present case, an array of dimension 1 (arr_float_dtype) is generated with values rang-
ing from 0 to 4, while a particular data type (float) is assigned to it.
Reshaping arrays constitutes a critical operation within the context of NumPy, entail-
ing the modification of the shape or dimensions of an extant array. The manipulation
of the reshape() method within NumPy represents a widely employed approach to at-
tain this objective.
import numpy as np
# Creating a 1D array
arr1d = np.arange(6)
print("Original 1D Array:")
print(arr1d)
# Creating a 2D array
arr2d_original = np.array([[1, 2, 3], [4, 5, 6]])
print("\nOriginal 2D Array:")
print(arr2d_original)
Array Manipulation
Array manipulation in NumPy involves various operations to modify the shape, con-
tent, and structure of arrays.
NumPy arrays facilitate the execution of robust indexing and slicing operations.
import numpy as np
# Creating a 2D array
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Original 2D Array:")
print(arr2d)
# Slicing 1D array
print("Sliced row 2:", arr2d[1, :])
# Slicing 2D array
print("Sliced subarray:")
print(arr2d[:2, 1:])
Array Concatenation
# Concatenating 1D arrays
concatenated_arr = np.concatenate([arr1, arr2])
print("Concatenated 1D Array:")
print(concatenated_arr)
Transposition
# Creating a 2D array
arr2d_original = np.array([[1, 2, 3], [4, 5, 6]])
print("Original 2D Array:")
print(arr2d_original)
arr2d_transposed = arr2d_original.T
print("\nTransposed 2D Array:")
print(arr2d_transposed)
Reshaping
# Creating a 1D array
arr1d = np.arange(6)
print("Original 1D Array:")
print(arr1d)
Splitting Arrays
Dividing an array into multiple subarrays along a specified axis can be achieved
through the process of array partitioning.
# Creating a 1D array
arr1d_to_split = np.array([1, 2, 3, 4, 5, 6])
Adding/Removing Elements
# Creating a 1D array
arr1d_original = np.array([1, 2, 3, 4, 5])
Customizing np.arange() Parameters 101
# Appending an element
arr1d_appended = np.append(arr1d_original, 6)
print("Appended 1D Array:")
print(arr1d_appended)
Element-wise operations
Element-wise computations involve the execution of fundamental arithmetic opera-
tions including but not limited to addition, subtraction, multiplication, and division.
import numpy as np
# Addition
result_addition = arr1 + arr2
print("Addition:", result_addition)
# Subtraction
result_subtraction = arr1 - arr2
print("Subtraction:", result_subtraction)
# Multiplication
result_multiplication = arr1 ✶ arr2
print("Multiplication:", result_multiplication)
# Division
result_division = arr1 / arr2
print("Division:", result_division)
102 Chapter 3 Data Preprocessing in Python
Mathematical Functions
NumPy offers an array of mathematical operations that act on each element, thereby
granting a diverse array of mathematical functions.
# Creating an array
arr = np.array([1, 2, 3, 4, 5])
# Square root
result_sqrt = np.sqrt(arr)
print("Square Root:", result_sqrt)
# Exponential
result_exp = np.exp(arr)
print("Exponential:", result_exp)
# Trigonometric functions
result_sin = np.sin(arr)
print("Sine:", result_sin)
result_cos = np.cos(arr)
print("Cosine:", result_cos)
# Summation
sum_result = np.sum(arr)
print("Sum:", sum_result)
# Mean
mean_result = np.mean(arr)
print("Mean:", mean_result)
Customizing np.arange() Parameters 103
Broadcasting
import numpy as np
# Creating a 2D array
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
The scalar 10 is uniformly distributed to all elements of the 2D array arr2d in this par-
ticular instance.
Vectorization
Vectorization is a key concept in NumPy, where operations are applied to entire ar-
rays instead of individual elements using explicit loops. It leads to more readable and
computationally efficient code.
Here, the np.sqrt() function is a universal function that applies the square root opera-
tion element-wise to the array.
Array Broadcasting
SciPy, a Python library employed for scientific and technical computation, is an open-
source platform. It enhances the functionalities of NumPy by offering supplementary
modules for diverse purposes including optimization, signal processing, statistical op-
erations, linear algebra, and beyond. Researchers, scientists, and engineers working
in diverse disciplines find SciPy to be an indispensable resource.
Key Features
Extensive functionality: SciPy encompasses an extensive assortment of mathe-
matical and scientific computational operations, rendering it an all-inclusive li-
brary catering to a multitude of academic domains.
Integration with NumPy: SciPy seamlessly integrates with NumPy, thereby en-
hancing the capabilities of both libraries and creating a robust environment for
numerical computation.
Interdisciplinary tools: The SciPy library comprises a range of modules that en-
compass optimization, interpolation, signal and image processing, statistical anal-
ysis, and other functionalities. This wide range of capabilities renders it well-
suited for an extensive variety of applications.
Open source and community-driven: Being an open-source project, SciPy bene-
fits from a large community of contributors, ensuring regular updates, bug fixes,
and the inclusion of new features.
This command will install the latest version of SciPy and its dependencies.
import scipy
If there are no errors, and the version is printed, SciPy is successfully installed.
Dependencies
SciPy relies on NumPy, so it is crucial to have NumPy installed. NumPy can be installed
separately or, as mentioned earlier, along with SciPy. Other optional dependencies exist
for specific modules within SciPy, such as Matplotlib for plotting functionalities.
import numpy as np
from scipy import linalg # Importing SciPy's linear algebra module
# Creating a 1D array
arr_1d = np.array([1, 2, 3])
Customizing np.arange() Parameters 107
# Creating a 2D matrix
matrix_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("1D Array:")
print(arr_1d)
print("\n2D Matrix:")
print(matrix_2d)
# Coefficient matrix
coeff_matrix = np.array([[2, 1], [3, -1]])
# Right-hand side
rhs_vector = np.array([8, 1])
# Matrix A
matrix_A = np.array([[4, -2], [1, 1]])
108 Chapter 3 Data Preprocessing in Python
print("Eigenvalues:")
print(eigenvalues)
print("\nEigenvectors:")
print(eigenvectors)
# Matrix B
matrix_B = np.array([[1, 2], [2, 3]])
print("U matrix:")
print(U)
print("\nS matrix (singular values):")
print(S)
print("\nVt matrix (transpose of V):")
print(Vt)
Numerical Integration
SciPy provides powerful numerical integration methods through scipy.integrate. The
quad() function is commonly used.
Consider integrating f(x) = x2 over the interval [0, 1]
# Function to integrate
def func(x):
return x✶✶2
3.2 Loading Data with Pandas 109
# Numerical integration
result, error = integrate.quad(func, 0, 1)
Numerical Differentiation
scipy.misc.derivative() computes the derivative of a function at a given point using
numerical methods.
Consider differentiating g(x) = ex at x = 2
# Numerical differentiation
derivative_at_2 = derivative(func_g, 2.0, dx=1e-6)
DataFrame
A DataFrame is a data structure in pandas that is two-dimensional and tabular in na-
ture, closely resembling a table found in a relational database. It is composed of rows
and columns, wherein each column can possess a distinct data type.
Features
– Tabular structure: Data organized in rows and columns.
– Column names: Each column has a name or label.
– Index: Each row has an index, which can be explicitly set or autogenerated.
– Flexibility: Columns can have different data types.
– Extensibility: Supports the addition of new columns.
110 Chapter 3 Data Preprocessing in Python
import pandas as pd
Output
Fig. 3.1 shows the results generated by a code snippet designed to create a pandas Data-
Frame implementation. The visual diagram shows the structure and contents of the Data-
Frame, including its rows, columns, and corresponding data objects stored in each cell.
Series
A Series refers to a singularly structured and labeled array in the pandas library. It is
essentially a solitary column extracted from a DataFrame, but it is also capable of exist-
ing independently. The Series data structure has the capability to store various data
types, encompassing integers, floating-point numbers, strings, and other forms of data.
Features
– One-dimensional: Consists of a single column or array of data.
– Labeled index: Each element in the Series has an index.
– Homogeneous data: All elements in a Series have the same data type.
– Similar to NumPy Arrays: Many operations available in NumPy arrays can be ap-
plied to Series.
print(ages)
3.2 Loading Data with Pandas 111
Output
Fig. 3.2 shows the results from the implementation rules responsible for creating the
pandas Series object. The visual diagram shows the layout and elements of the se-
quence, which is a one-dimensional labeled layout in the Panda library. The output
shows the index labels associated with each value in the series, and provides a clear
understanding of how data is organized and stored in this data structure.
. Elements: A single column of data with Multiple columns of data organized in a tabular
an index. structure with both row and column indices.
. Use Cases: Suitable for representing a Suitable for representing a dataset with
single variable or feature. multiple variables, each as a column.
. Flexibility: Homogeneous data type for Columns can have different data types.
all elements.
From a Dictionary
The column names in a dictionary are derived from its keys, and the data is repre-
sented by the corresponding values.
import pandas as pd
df_from_dict = pd.DataFrame(data_dict)
Output:
Output:
import numpy as np
Output:
Output:
Output:
tools, data cleansing with pandas ensures that datasets remain consistent, comprehen-
sive, and primed for meaningful analysis and insights.
import pandas as pd
import numpy as np
Output:
Output:
column values or the entire row. The drop_duplicates() function facilitates the elimi-
nation of duplicate rows, resulting in a DataFrame that exclusively contains unique
records. The strategic management of duplicates is imperative in order to preserve
the accuracy of analyses and to prevent biases that may arise from repeated data
points. These operations possess particular value when working with extensive data-
sets or when integrating data from multiple sources, as they ensure that the resulting
DataFrame accurately represents the underlying information.
Identifying uplicates:
Identifying and handling duplicate rows is crucial to maintaining data integrity.
Pandas provides methods to identify duplicate rows based on column values or the
entire row.
import pandas as pd
Output:
Removing duplicates
Removing duplicate rows is accomplished using the drop_duplicates() function. This
ensures that the DataFrame contains only unique records.
Output:
The identification of duplicate rows in the DataFrame was conducted through the uti-
lization of the duplicated() method in this particular instance. The resulting Data-
Frame, denoted as duplicates, showcases the duplicate rows that were identified.
Subsequently, the removal of duplicate rows was carried out by means of the drop_-
duplicates() function, thereby generating a DataFrame devoid of any duplicates. These
operations are of utmost importance in guaranteeing the accuracy of the data and en-
suring that the analyses conducted are not distorted by the presence of redundant
information.
import pandas as pd
# Creating a DataFrame
data = {'Date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-
02'],
'Category': ['A', 'B', 'A', 'B'],
'Value': [10, 20, 15, 25]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\nDataFrame after Pivot:")
print(df_pivot)
Output:
In this example, the original DataFrame has a long format, with each date having sep-
arate rows for categories A and B. After applying pivot, the data is reshaped into a
wide format, with dates as the index and categories as columns, providing a clearer
tabular structure.
Melting data:
The utilization of the melt function in the pandas library is paramount in the conver-
sion of wide-format data to long-format. This becomes particularly valuable in situations
where the data is arranged in a tabular or horizontal structure and necessitates transfor-
mation into a stacked or vertical format.
The inclusion of the id_vars, var_name, and value_name parameters in the method
affords the opportunity for customization of the melted DataFrame.
120 Chapter 3 Data Preprocessing in Python
import pandas as pd
# Creating a DataFrame
data = {'Date': ['2022-01-01', '2022-01-02'],
'Category_A': [10, 15],
'Category_B': [20, 25]}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
Output:
Data cleansing and data transformation are fundamental stages in the data prepro-
cessing pipeline, with the objective of enhancing the quality and utility of the dataset
for analysis or modeling purposes. The process of data cleansing entails the identifica-
tion and rectification of errors, discrepancies, and missing values within the dataset.
This may involve the elimination of duplicate records, the handling of missing
data through imputation or deletion, and the correction of erroneous values. On the
other hand, data transformation encompasses the reorganization or restructuring of
the data to render it suitable for analysis or modeling. This may encompass the encod-
3.3 Data Cleaning and Transformation 121
Handling missing data is a crucial aspect of data preprocessing, ensuring the robust-
ness and accuracy of analyses and models. There exist various strategies to address
missing data, including imputation, deletion, and modeling. Imputation entails esti-
mating missing values by utilizing other available data, such as utilizing the mean,
median, or mode for numerical variables or employing the most frequent category
for categorical variables.
Deletion involves eliminating records or features with missing values, either en-
tirely or partially, which can be appropriate if the missing data is minimal and ran-
dom. Modeling entails treating missing values as a distinct category or employing
machine learning algorithms to predict missing values based on other variables.
import pandas as pd
from sklearn.impute import SimpleImputer
print("Original DataFrame:")
print(df)
print("\nDataFrame after Imputation:")
print(df_imputed)
122 Chapter 3 Data Preprocessing in Python
Output:
In this particular instance, there are absent values in the columns denoted as ‘Age’,
‘Income,’ and ‘Education_Level’. To address this issue, we employ the utilization of the
SimpleImputer module from the renowned scikit-learn library to impute the missing
values with the mean value of each corresponding column. The resultant DataFrame
encompasses the imputed values, thereby guaranteeing the wholeness of the dataset
and facilitating further analysis and modeling endeavors.
Data type conversions play a vital role in the preprocessing of data, as they allow for
the representation of data in a manner that is suitable for analysis, modeling, or stor-
age. This process entails the alteration of data from one type to another, for instance,
the conversion of strings into numerical values or the transformation of categorical
variables into numerical representations.
The conversions of data types serve to guarantee the consistency of data and its
compatibility with various algorithms and tools. To illustrate, the conversion of categor-
ical variables into numerical format, accomplished through the utilization of encoding
techniques like one-hot encoding or label encoding, facilitates the effective processing
of such variables by machine learning algorithms. Likewise, the conversion of numeri-
cal values from one data type to another, such as the transition from integers to floats,
may become necessary in order to address precision or scaling requirements.
import pandas as pd
Output:
In this example, we have a DataFrame with mixed data types. We convert the ‘ID’ and
‘Income’ columns from strings to integers using the astype method in pandas, ensur-
ing consistency and enabling numerical operations or analysis of these columns. Simi-
larly, other data type conversions can be performed as needed to prepare the data for
further processing or modeling.
Feature engineering involves the process of transforming raw data into a format that
is more suitable for machine learning algorithms, with the goal of improving the per-
formance of the model. It encompasses the selection, creation, and modification of
features to extract meaningful patterns and insights from the data. This may include
converting categorical variables into numerical representations using techniques like
one-hot encoding or label encoding. Additionally, feature engineering includes feature
scaling, which standardizes or normalizes numerical features to ensure consistency
and comparability across different scales. Moreover, it involves techniques for reduc-
ing data dimensionality, such as principal component analysis (PCA) and feature se-
lection methods that aim to decrease the number of features and eliminate irrelevant
or redundant ones. The effective implementation of feature engineering can signifi-
124 Chapter 3 Data Preprocessing in Python
cantly enhance the accuracy, interpretability, and generalization of the model to new
data, making it a critical step in the machine learning pipeline.
One-Hot Encoding
One-hot encoding is a technique that is used to convert categorical variables into a
numerical format that is suitable for machine learning algorithms. In this methodol-
ogy, each category is represented as a binary vector, where only one bit is assigned a
value of 1 to indicate the presence of the respective category. This approach ensures
that the categorical information is preserved without introducing any form of ordinal-
ity. For example, if we have three categories, namely ‘Red’, ‘Green’, and ‘Blue’, they
would be encoded as [1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively.
import pandas as pd
# One-hot encoding
one_hot_encoded = pd.get_dummies(df['Color'])
print("Original DataFrame:")
print(df)
print("\nOne-hot Encoded DataFrame:")
print(one_hot_encoded)
3.4 Feature Engineering 125
Output:
In this specific instance, the categorical variable ‘Color’ is transformed into binary
vectors through the process of one-hot encoding, which conveys the existence or non-
existence of each category. This particular conversion allows machine learning algo-
rithms to effectively understand and make use of categorical data.
Label Encoding
Label encoding is a method utilized to convert categorical variables into a numerical
format by assigning a unique integer to each category. This specific technique entails
replacing each category with a numerical value, starting from 0 and progressing up to
n-1, where n represents the number of distinct categories. While its simplicity is evi-
dent, it is crucial to acknowledge that label encoding may unintentionally introduce a
semblance of order to the data, thereby implying a meaningful ranking among catego-
ries that may not actually exist. For instance, if we were to examine three categories:
‘Red’, ‘Green’, and ‘Blue’, they would be encoded as 0, 1, and 2, respectively.
# Label encoding
label_encoder = LabelEncoder()
df['Color_LabelEncoded'] = label_encoder.fit_transform(df['Color'])
126 Chapter 3 Data Preprocessing in Python
print("Original DataFrame:")
print(df[['Color']])
print("\nLabel Encoded DataFrame:")
print(df[['Color_LabelEncoded']])
Output:
In this example, the categorical variable ‘Color’ has been encoded with labels that rep-
resent numerical values. Every distinct category is given a numerical label according
to its sequential appearance in the data. Nevertheless, prudence is advised when uti-
lizing label encoding, particularly in scenarios where the categorical variable lacks
inherent ordinality. This is due to the potential for machine learning algorithms to
misinterpret the encoded labels, resulting in misinterpretation of the data.
machine learning models. Various methods are commonly employed for feature scal-
ing, including Min-Max Scaling, Z-score Scaling, and Robust Scaling.
Standardization
Standardization is a method used to normalize numerical features by transforming
them to have a mean of 0 and a standard deviation of 1. This procedure involves sub-
tracting the mean of each feature from its respective values and dividing the result by
the standard deviation.
The utilization of standardization is beneficial for algorithms that assume the
input data adheres to a normal distribution and assigns equal significance to all fea-
tures. It ensures that features with larger scales do not dominate the learning process.
For example, in a dataset that encompasses features like age, income, and education
level, which exhibit varying scales, standardization would enable comparison among
these features.
# Standardization
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print("Original DataFrame:")
print(df)
print("\nStandardized DataFrame:")
print(df_scaled)
128 Chapter 3 Data Preprocessing in Python
Output:
In this instance, the initial numerical data depicting age, income, and education level
is standardized through the utilization of the StandardScaler implemented in scikit-
learn. Each individual characteristic is adjusted in such a way that it possesses an av-
erage value of 0 and a standard deviation of 1, thereby rendering them amenable to
comparison across varying scales. This safeguard guarantees that no individual char-
acteristic holds undue influence over the process of learning in machine learning al-
gorithms that heavily rely on numerical data.
Normalization
Normalization is a technique for scaling features, which transforms numerical data
into a standardized scale that typically ranges from 0 to 1. This process entails adjust-
ing the values of each feature so that they fall within this range, while simultaneously
preserving their relative relationships.
Normalization is particularly advantageous when the magnitudes of the features
exhibit substantial variation, as it ensures that all features contribute equitably to the
learning procedure. It is extensively employed in algorithms that mandate input data
to be confined within a specific range, such as neural networks and distance-based
algorithms like k-nearest neighbors. To elaborate, for instance, if we possess features
such as age, income, and education level that possess dissimilar scales, normalization
would unify them onto a standardized scale, thereby facilitating direct comparability.
df = pd.DataFrame(data)
# Normalization
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.
columns)
print("Original DataFrame:")
print(df)
print("\nNormalized DataFrame:")
print(df_normalized)
Output:
In this instance, the initial numeric information signifying age, income, and level of
education undergoes normalization by utilizing the MinMaxScaler from scikit-learn.
The values of each characteristic are adjusted to a span ranging from 0 to 1, thereby
maintaining their relative associations while guaranteeing consistency across various
scales. This facilitates equitable comparisons between characteristics and averts the
prevalence of any individual attribute in machine learning algorithms, thereby up-
holding the integrity of the learning process.
The utilization of Matplotlib and Seaborn for data visualization is indispensable when
it comes to thoroughly exploring, meticulously analyzing, and effectively communicat-
ing insights obtained from data. Matplotlib, a widely utilized plotting library in Py-
thon, provides a significant level of customization in generating a wide range of static
plots, including line plots, scatter plots, histograms, bar plots, and more. It empowers
130 Chapter 3 Data Preprocessing in Python
users with precise control over various plot elements, such as colors, markers, labels,
and annotations.
Seaborn, constructed atop Matplotlib, furnishes a more advanced interface for
crafting informative and visually appealing statistical graphics. By providing conve-
nient functions for plotting data with minimal code, it simplifies the otherwise intri-
cate process of generating complex visualizations. Seaborn excels in the production of
visually captivating plots for statistical analysis, including specialized ones like violin
plots, box plots, pair plots, and heatmaps.
Collectively, Matplotlib and Seaborn constitute a potent toolkit for data visualiza-
tion, enabling analysts and data scientists to swiftly and effectively explore patterns,
trends, and relationships within datasets. These libraries facilitate the creation of
plots of publication-quality, thereby enhancing data storytelling and presentation,
thereby rendering them indispensable tools in the workflow of data analysis.
Plotting with Matplotlib is an essential aspect of data visualization, enabling the crea-
tion of various plots to examine data distributions, trends, and relationships. Matplot-
lib, a versatile Python plotting library, provides a flexible and intuitive interface for
producing high-quality static visualizations suitable for publication. By utilizing Mat-
plotlib, analysts and researchers can generate different types of plots, including line
plots, scatter plots, bar plots, histograms, pie charts, box plots, violin plots, heatmaps,
area plots, and contour plots.
These plots serve distinct objectives, permitting users to examine numerical dis-
tributions, compare categorical variables, visualize relationships between variables,
identify outliers, and explore patterns within data. Whether visualizing time series
data, investigating correlations, or presenting categorical distributions, Matplotlib
provides the necessary tools for creating informative and insightful visualizations.
The establishment of the foundation of data exploration and analysis workflows
is accomplished through the utilization of basic plotting techniques in Matplotlib.
These techniques provide critical understandings of datasets that can assist in deci-
sion-making, enable discoveries, and effectively convey findings. Attaining expertise
in basic plotting with Matplotlib is essential for practitioners in the field of data sci-
ence and analysis, as it is a fundamental skill that allows for the extraction of action-
able insights from data.
Line Plot
Line plots are employed to represent the trajectory of data points across an unbroken
duration. They are constructed by joining the data points with linear segments. For
3.5 Data Visualization with Matplotlib and Seaborn 131
instance, one could construct a line plot to depict the variation in stock prices over a
given period of time.
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot Example')
plt.show()
Output:
Scatter Plot
Scatter plots exhibit individual data points as markers on a Cartesian plane, rendering
them well-suited for illustrating the correlation between two numerical variables. For
instance, one can create a plot to depict the connection between the mileage of a car
and its corresponding price.
132 Chapter 3 Data Preprocessing in Python
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
plt.scatter(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot Example')
plt.show()
Output:
Bar Plot
Bar plots depict categorical data using rectangular bars, where the length of each bar
signifies the value of the corresponding category. Such visualizations prove valuable
in the task of comparing the quantities associated with various categories. For in-
stance, they are employed to assess the sales performance of different products.
3.5 Data Visualization with Matplotlib and Seaborn 133
plt.bar(categories, values)
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Plot Example')
plt.show()
Output:
Histogram
Histograms depict the distribution of quantitative data by partitioning the data into
intervals and enumerating the quantity of data points within each interval. They
serve as a valuable tool for comprehending the frequency distribution of a specific
variable. An instance where histograms are applicable is when visualizing the age dis-
tribution within a population.
134 Chapter 3 Data Preprocessing in Python
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
plt.hist(data, bins=5)
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram Example')
plt.show()
Output:
Pie Chart
Pie charts depict categorical information by dividing a circle into slices, with each
slice denoting a specific category and its magnitude indicating the proportion of that
category in the entirety. These charts serve a practical purpose in demonstrating the
makeup of an entire entity, such as the allocation of expenses in a budget.
Output:
Box Plot
Box plots, alternatively referred to as box-and-whisker plots, serve the purpose of
graphically representing the distribution of numerical data, thereby exhibiting the me-
dian, quartiles, and outliers. Their value lies in the identification of outliers and the
comprehension of the data’s dispersion and central tendency. For instance, they can be
employed to contrast the distribution of test scores among various student groups.
np.random.seed(10)
data = np.random.normal(0, 1, 100)
plt.boxplot(data)
plt.title('Box Plot Example')
plt.show()
136 Chapter 3 Data Preprocessing in Python
Output:
Violin Plot
Violin plots amalgamate a box plot and a kernel density plot to exhibit the distribu-
tion of numerical data. They offer a more intricate perspective of the distribution of
data in comparison to box plots. For instance, they can be utilized to visualize the dis-
tribution of heights among diverse age groups.
np.random.seed(10)
data = np.random.normal(0, 1, 100)
sns.violinplot(data)
plt.title('Violin Plot Example')
plt.show()
3.5 Data Visualization with Matplotlib and Seaborn 137
Output:
Heatmap
Heatmaps visualize data in a tabular format by assigning colors to cells based on their
values. They are commonly used to visualize correlations or relationships in large da-
tasets. Example: visualizing the correlation matrix of numerical variables in a dataset.
np.random.seed(10)
data = np.random.rand(10, 10)
Output:
Area Plot
Area plots are similar to line plots but fill the area below the line, making them useful
for visualizing cumulative data or stacked data. For example, visualizing the cumula-
tive sales over time for different product categories.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y1 = [1, 2, 3, 4, 5]
y2 = [1, 4, 9, 16, 25]
Output:
Contour Plot
Contour plots represent three-dimensional data in two dimensions by showing con-
tours or lines of constant values. They are commonly used for visualizing geographi-
cal or scientific data. For example, visualizing elevation data on a map.
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(X) + np.cos(Y)
plt.contour(X, Y, Z)
plt.title('Contour Plot Example')
plt.show()
140 Chapter 3 Data Preprocessing in Python
Output:
Seaborn, an advanced visualization tool, provides robust features for the demonstra-
tion of data, exhibiting superior capabilities when contrasted with fundamental plot-
ting libraries such as Matplotlib. Seaborn is constructed upon Matplotlib, delivering
an interface at a higher level, which enables the creation of statistically informative
and visually captivating graphics.
Here’s a sample structure for the “dataset.csv” file:
x_variable,y_variable,category
1,2,A
2,3,B
3,4,A
4,5,B
5,6,A
6,7,B
3.5 Data Visualization with Matplotlib and Seaborn 141
Pair Plot
Pair plots depict the pairwise associations among variables in a dataset by presenting
scatter plots for numerical variables and histograms for the diagonal axes.
data = pd.read_csv('dataset.csv')
sns.jointplot(x='x_variable', y='y_variable', data=data, kind='scatter')
Output:
Joint Plot
Joint plots integrate scatter plots and histograms to visually represent the correlation
between two quantitative variables in addition to their respective distributions.
142 Chapter 3 Data Preprocessing in Python
data = pd.read_csv('dataset.csv')
sns.jointplot(x='x_variable', y='y_variable', data=data, kind='scatter')
Output:
PairGrid
PairGrid allows customization of pair plots by providing access to individual subplots,
facilitating detailed exploration of pairwise relationships.
data = pd.read_csv('dataset.csv')
g = sns.PairGrid(data)
g.map(sns.scatterplot)
Output:
oversampling (increasing the number of samples in the minority category) and under-
sampling (reducing the number of samples in the majority category).
Furthermore, synthetic data generation techniques such as SMOTE (Synthetic Mi-
nority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling) are uti-
lized to generate artificial data points for the minority category, thus achieving a
balanced dataset. Handling imbalanced data ensures that machine learning models
are trained on datasets that are more representative, resulting in improved perfor-
mance and generalization across all categories.
# Instantiate RandomOverSampler
ros = RandomOverSampler()
# Resample dataset
X_resampled, y_resampled = ros.fit_resample(X, y)
In this particular instance, the initial step involves establishing an imbalanced dataset
consisting of two distinct classes. Subsequently, we employ the RandomOverSampler
function, sourced from the imbalanced-learn library, in order to oversample the mi-
3.6 Handling Imbalanced Data 145
nority class. Lastly, we display the class distributions both prior to and subsequent to
the resampling process, enabling us to observe the resultant balancing effect.
Here’s a programming example of undersampling using the imbalanced-learn
library:
# Instantiate RandomUnderSampler
rus = RandomUnderSampler()
# Resample dataset
X_resampled, y_resampled = rus.fit_resample(X, y)
Synthetic data generation approaches are employed to address the problem of class
imbalance in datasets, with a particular emphasis on the minority class. Methods
such as SMOTE and ADASYN produce synthetic instances for the minority class by uti-
lizing existing data points.
SMOTE accomplishes this by creating synthetic samples through interpolation be-
tween instances of the minority class, while ADASYN adjusts the generation process
by considering the density of samples in the feature space. These approaches aid in
mitigating the impact of class imbalance by augmenting the dataset with artificially
generated data points, thereby enhancing the performance of machine learning
models.
An illustrative example showcasing the application of SMOTE is presented below:
146 Chapter 3 Data Preprocessing in Python
# Instantiate SMOTE
smote = SMOTE()
In this instance, SMOTE is utilized to produce artificial data points for the underrepre-
sented class within the dataset that exhibits imbalance. The fit_resample technique is
utilized to execute the generation of synthetic data, and the resultant distributions of
classes both before and after the application of SMOTE are compared to observe the
effect of achieving balance.
Here is an illustrative programming example showcasing the utilization of ADA-
SYN (Adaptive Synthetic Sampling) for the purpose of creating synthetic data points
for the underrepresented class within an imbalanced dataset:
# Instantiate ADASYN
adasyn = ADASYN()
In this particular instance, we are presented with a dataset that exhibits an imbalance
in terms of the number of instances belonging to each class, whereby the minority
class possesses a smaller number of instances. To address this issue, we apply the
ADASYN algorithm, which serves to generate synthetic data points specifically for the
minority class. Subsequently, the fit_resample method is invoked to carry out the gen-
eration of synthetic data. Finally, to observe the effect of the balancing process, we
proceed to display the class distributions both before and after the application of
ADASYN.
Summary
– Numerical and scientific computing using NumPy and SciPy: Covered array crea-
tion, manipulation, and various numerical operations using NumPy and SciPy
libraries.
– Loading data with pandas: Introduction to pandas library for data manipulation
and analysis, including DataFrame and Series, along with data manipulation
techniques.
– Data cleaning and transformation: Discussed strategies for handling missing data
and data type conversions in datasets.
– Feature engineering: Covered techniques such as encoding categorical variables
and feature scaling for preparing data for machine learning.
– Data visualization with Matplotlib and Seaborn: Introduced Matplotlib and Sea-
born libraries for data visualization, including basic and advanced plotting
techniques.
– Handling imbalanced data: Discussed challenges posed by imbalanced datasets in
machine learning and techniques like resampling and synthetic data generation
to address class imbalance.
Exercise (MCQs)
3. What is the primary data structure in pandas for storing and manipulating
data?
a) Arrays b) Lists c) DataFrame d) Tuples
7. Which plot type is used for visualizing pairwise relationships between varia-
bles in Seaborn?
a) Scatter plot b) Pair plot c) Histogram d) Heatmap
9. What is the purpose of synthetic data generation techniques like SMOTE and
ADASYN?
a) To create artificial data points for majority classes
b) To remove outliers from datasets
c) To convert categorical variables into numerical representations
d) To standardize features in datasets
Answers
1. b
2. c
3. c
4. d
5. c
6. d
7. b
8. a
9. a
10. a
1. ___________ and ___________ are libraries commonly used for numerical and scien-
tific computing tasks in Python.
2. In pandas, the primary data structure for storing and manipulating data is
called ___________.
3. Resampling techniques such as ___________ and ___________ are used to address
class imbalance in machine learning datasets.
4. Feature scaling techniques aim to standardize or normalize the ___________ of fea-
tures in datasets.
5. Matplotlib and Seaborn are popular libraries used for ___________ in Python.
6. Pair plots are used in Seaborn to visualize ___________ relationships between
variables.
7. SMOTE and ADASYN are techniques used for generating ___________ data points in
imbalanced datasets.
8. ___________ and ___________ are methods used for encoding categorical variables in
feature engineering.
9. Data cleaning involves handling ___________ values and converting data types for
analysis.
10. Classes serve as blueprints for creating ___________ in object-oriented
programming.
Answers
1. NumPy, SciPy
2. DataFrame
3. oversampling, undersampling
150 Chapter 3 Data Preprocessing in Python
4. scale
5. data visualization
6. pairwise
7. synthetic
8. Label encoding, one-hot encoding
9. missing
10. objects
Descriptive Questions
1. Explain the importance of proper indentation in Python syntax and how it im-
pacts the readability of code.
2. Describe the role of pandas in data manipulation and analysis, and provide exam-
ples of DataFrame operations.
3. Discuss the significance of feature engineering in machine learning and explain
the difference between encoding categorical variables and feature scaling.
4. Explain the process of data visualization using Matplotlib and Seaborn, and pro-
vide examples of basic and advanced plotting techniques.
5. Describe the challenges posed by imbalanced datasets in machine learning and
discuss techniques such as resampling and synthetic data generation to address
class imbalance.
6. Explain the concept of feature scaling and discuss its importance in preparing
data for machine learning models.
7. Discuss the role of comments in Python code and how they contribute to code
documentation and readability.
8. Explain the concept of object-oriented programming in Python, including classes,
objects, inheritance, and polymorphism.
9. Discuss the strategies for handling missing data in datasets and the importance of
data imputation in data preprocessing.
10. Describe the process of encoding categorical variables in feature engineering and
discuss the differences between label encoding and one-hot encoding.
11. Write a Python program that creates a 2D NumPy array and performs the follow-
ing operations:
a. Compute the mean, median, and standard deviation of the array.
b. Reshape the array into a different shape.
c. Perform element-wise addition and multiplication with another array.
12. Write a Python program that loads a CSV file using pandas and performs the fol-
lowing operations:
a. Display the first few rows of the DataFrame.
b. Calculate summary statistics for numerical columns.
c. Convert a categorical column to numerical one using label encoding.
Descriptive Questions 151
13. Write a Python program that generates a line plot using Matplotlib to visualize a
time series dataset.
a. Include labels for the x and y axes.
b. Add a title to the plot.
c. Customize the line style and color.
14. Write a Python program that loads an imbalanced dataset and implements over-
sampling using the SMOTE technique from the imbalanced-learn library.
a. Display the class distribution before and after oversampling.
b. Train a simple machine learning model (e.g., logistic regression) on the bal-
anced dataset and evaluate its performance.
15. Write a Python program that preprocesses a dataset for machine learning using
feature engineering techniques.
a. Encode categorical variables using one-hot encoding.
b. Scale numerical features using Min-Max scaling or standardization.
c. Split the dataset into training and testing sets for model evaluation.
Chapter 4
Foundations of Machine Learning
The underpinning on which the entire realm of machine learning operates is constituted
by the bedrock of this discipline. These bedrock elements encompass indispensable prin-
ciples and concepts that serve as the foundation for the formulation, construction, and
assessment of machine learning algorithms. At its core, the objective of machine learn-
ing is to facilitate computers in obtaining knowledge from data without the necessity of
explicit programming. This chapter thoroughly explores the fundamental facets that
hold paramount importance for every practitioner to apprehend to adeptly navigate the
intricate terrain of machine learning.
A fundamental initial step in comprehending machine learning resides in differ-
entiating between supervised and unsupervised learning. The incorporation of la-
beled data characterizes supervised learning, while unsupervised learning operates
without such labeling. Subsequent examination is conducted within the subsections
to clarify the notions of classification versus regression and clustering versus associa-
tion. These notions offer insight into the varied forms of learning tasks and their cor-
responding applications.
The significance of achieving a delicate equilibrium between the ability to capture
intricate patterns in data and the capability to generalize to unobserved instances is
emphasized by the concepts of overfitting and regularization. Overfitting pertains to
the scenario in which a model becomes excessively tailored to the training data, result-
ing in inadequate performance on unseen data. Techniques for regularization, such as
the Bias-Variance Trade-off and L1/L2 Regularization, offer methods to alleviate overfit-
ting and enhance the resilience of models.
Evaluation metrics play a crucial role in assessing the performance of machine
learning models. This chapter provides an overview of metrics tailored for both classi-
fication and regression tasks. These metrics offer valuable insights into the accuracy,
precision, recall, and other key measures of model performance.
Cross-validation arises as an essential instrument for the evaluation and choice of
models. It diminishes the possibility of overfitting by methodically dividing data into
training and validation sets. A comprehensive explanation of diverse techniques, such
as k-fold cross-validation, leave-one-out, and stratified K-old, empowers professionals
to proficiently validate their models.
The basic essence of machine learning is to provide learners with the necessary
fundamental principles and methodologies that are required to begin the process of
constructing intelligent systems. These foundations establish a strong and stable basis
for the subsequent exploration of more sophisticated concepts and techniques in later
chapters.
https://doi.org/10.1515/9783110697186-004
4.1 Supervised vs Unsupervised Learning 153
Supervised and unsupervised learning are two fundamental frameworks in the field
of machine learning, each addressing distinct categories of learning tasks and
methodologies.
Supervised learning, which focuses on the training of a model using labeled infor-
mation, involves associating each input data point with an output label or target. The
primary objective is to establish a mapping from inputs to outputs, allowing the
model to generate predictions on new data. Supervised learning is commonly used in
classification tasks, where the aim is to assign input instances to specific categories or
labels, as well as regression tasks, where the goal is to forecast a continuous value.
Prominent algorithms in supervised learning include decision trees, support vector
machines, and neural networks.
In contrast, unsupervised learning deals with unlabeled data, requiring the model
to identify underlying patterns or structures without explicit guidance. Instead of mak-
ing predictions for specific outputs, unsupervised learning algorithms aim to uncover
inherent relationships or groupings within the data. Clustering algorithms, such as K-
means and hierarchical clustering, segment the data into distinct clusters based on sim-
ilarities, while association algorithms, like Apriori, discover rules or associations be-
tween different attributes. Unsupervised learning is particularly advantageous for
anomaly detection, data compression, and feature extraction tasks.
Supervised learning relies on labeled data to learn the correlations between in-
puts and outputs, enabling prediction tasks. In contrast, unsupervised learning oper-
ates on unlabeled data to reveal hidden structures or patterns, providing insights and
understanding of raw data without explicit guidance. Both paradigms play significant
roles in various machine learning applications, each offering distinct advantages and
challenges depending on the problem and available data.
Clustering: Clustering algorithms partition data points into clusters or groups accord-
ing to their similarity. The objective is to discover natural groupings in the data with-
out any predetermined labels. Customer segmentation, document clustering, and
image segmentation are common examples of this.
Supervised and unsupervised learning methods, which span a broad array of techni-
ques and algorithms, serve as indispensable tools for tackling diverse real-world chal-
lenges in domains including finance, healthcare, e-commerce, and beyond.
in this situation could consist of the applicant’s credit score, earnings, and debt-to-
income ratio, while the output would be a binary tag indicating either “default” or
“no default.” A logistic regression model can be educated on past data to categorize
potential loan applicants based on these characteristics.
In contrast, suppose we have an interest in predicting the price of a dwelling
based on its dimensions, number of bedrooms, and geographical location. This sce-
nario introduces a regression issue as the output variable, specifically the dwelling
price, exhibits a continuous characteristic. In this instance, a linear regression model
can be utilized to establish the connection between the input attributes and the dwell-
ing prices by employing a dataset consisting of past real estate transactions. Conse-
quently, this grants us the capability to generate predictions regarding the price of
newly constructed dwellings.
Classification involves the classification of data into classes or labels, whereas re-
gression concentrates on the prediction of numerical values. A comprehensive under-
standing of the differences between these two types of supervised learning tasks is
crucial when choosing appropriate algorithms and methodologies to effectively tackle
various real-world problems.
Types of Classification
Binary classification: In the realm of binary classification, the aim is to categorize
instances into either of two separate categories. Instances may encompass the identifi-
cation of spam (differentiating emails into spam or non-spam), the detection of fraud
(distinguishing between fraudulent and non-fraudulent transactions), and medical di-
agnosis (ascertaining the existence or absence of a disease).
Types of Regression
Linear regression: Linear regression constructs a model that represents the correla-
tion between a reliant variable and one or multiple autonomous variables using a lin-
ear equation. Its application lies in the forecast of continuous numerical values.
Instances include the anticipation of housing costs based on the area and quantity of
bedrooms, the prediction of stock prices utilizing historical data, and the estimation
of sales revenue by taking marketing expenses into account.
Ridge and Lasso regression: Ridge and Lasso regression are methods utilized to reg-
ularize linear regression models by integrating a penalty term into the cost function.
These methods assist in mitigating overfitting and improving the generalization capa-
bility of the model. Ridge regression incorporates a penalty term that is directly pro-
portional to the square of the coefficients’ magnitude, whereas Lasso regression
incorporates a penalty term that is directly proportional to the absolute value of the
coefficients. These methods are particularly advantageous when dealing with multi-
collinearity and datasets with high dimensionality.
Classification and regression techniques, which are essential tools in the field of ma-
chine learning, are widely used in diverse domains including finance, healthcare,
marketing, and engineering.
Clustering and association are two separate categories of unsupervised learning tasks
in the realm of machine learning, each serving different purposes and employing var-
ied methodologies.
Clustering entails the process of grouping similar data points together based on
their inherent characteristics, without any predefined labels. The primary aim is to
discover inherent groupings or clusters within the data. For example, in the context
of customer segmentation, clustering algorithms can effectively divide customers into
distinct groups based on their purchasing behavior, demographics, or other relevant
features. A well-known algorithm for clustering is the K-means algorithm, which as-
signs data points to clusters by minimizing the distance between each point and the
centroid of its assigned cluster. Another technique, called “hierarchical clustering,”
constructs a hierarchical structure of clusters by recursively merging or splitting clus-
ters based on their similarity.
On the contrary, association rule learning seeks to uncover intriguing connections
or associations amidst various variables within extensive datasets. The primary em-
4.1 Supervised vs Unsupervised Learning 157
Types of Clustering
K-means clustering: The technique of K-means clustering divides the data into a pre-
established quantity (k) of clusters, aiming to minimize the distance between data
points and the centroid of their designated cluster. This iterative process updates cent-
roids until convergence is reached. K-means clustering is extensively utilized owing to
its straightforwardness and effectiveness, although it necessitates the prior specifica-
tion of the cluster count.
tates the specification of two parameters: epsilon, which denotes the maximum dis-
tance between points for them to be considered part of the same cluster, and minPts,
which represents the minimum number of points required to form a dense region.
Mean shift clustering: Mean shift clustering, on the other hand, is a clustering tech-
nique devoid of parameters that detects clusters through the displacement of centroids
towards areas of heightened data density. Comparable to hierarchical clustering, it does
not necessitate the a priori specification of cluster quantity and can autonomously as-
certain the optimal number of clusters based on the distribution of data.
Types of Association
Apriori algorithm: The Apriori algorithm, which is widely recognized, is a notable al-
gorithm for learning association rules. This algorithm is employed to identify frequent
itemsets in transactional datasets. By generating candidate itemsets and subsequently
eliminating ones that do not meet the minimum support criteria, the algorithm effec-
tively identifies frequent itemsets. From these frequent itemsets, association rules are
derived, thereby providing valuable insights into the probability of one item being pur-
chased given the purchase of another item.
Eclat algorithm: The Eclat algorithm is an additional and widely recognized algo-
rithm for association rule learning. It explores frequent itemsets by intersecting trans-
action tidsets. Through the utilization of the downward closure property of support, it
effectively detects frequent itemsets.
Various types of clustering and association techniques play a crucial role in the do-
main of data mining and pattern recognition. They facilitate the identification of sig-
nificant patterns and valuable insights from extensive datasets across a wide range of
fields, including market basket analysis, customer segmentation, and recommenda-
tion systems.
4.2 Overfitting and Regularization 159
The concept of balancing bias and variance is a fundamental aspect of the domain of
machine learning. This balance is essential to attain an equilibrium between the bias
and variance of a specific model. Bias represents the difference between the model’s
expected prediction and the true value, while variance measures the variability in the
model’s predictions across various training datasets.
160 Chapter 4 Foundations of Machine Learning
Let us consider a simple example that involves the application of a polynomial re-
gression model to a collection of data points. Suppose we have a dataset that consists of
only one characteristic, labeled as “x,” and its associated target variable, labeled as “y.”
Our aim is to construct a polynomial regression model that can effectively forecast the
value of “y” based on “x”. To accomplish this task, we can express our model in the
following manner:
y = β0 + β1 x + β2 x2 + . . . + βnxn + ϵ
where ϵ represents the error term, and β0, β1, . . ., βn are the coefficients of the polyno-
mial terms.
The model’s bias can be measured by determining the disparity between the an-
ticipated forecast of the model and the actual value, which can be computed as:
where ^f ðxÞ represents the predicted value by the model, f(x) is the true value, and
E½^f ðxÞ is the expected value of the predictions over different training datasets.
On the contrary, the model’s variance quantifies the amount of variation in pre-
dictions at a specific point when considering different instances of the model. It can
be determined by performing calculations:
The aim is to identify a model that attains an optimal balance between bias and vari-
ance. A model that displays a high level of bias but a low level of variance, like a lin-
ear regression model, may oversimplify the underlying relationship within the data,
thus resulting in systematic errors (underfitting). On the other hand, a model that ex-
hibits low bias but high variance, such as a high-degree polynomial regression model,
may capture the noise present in the training data, leading to an increased suscepti-
bility to fluctuations in the training set (overfitting).
To exemplify the trade-off between bias and variance, we shall examine the pro-
cess of fitting polynomial regression models to a specific dataset, where the degrees of
the polynomials differ. In this scenario, as the degree of the polynomial rises, the bias
decreases (resulting in a more flexible model capable of capturing more complex rela-
tionships within the data), while the variance increases (causing the model to be
more sensitive to fluctuations in the training data).
By choosing the suitable degree of a polynomial, our aim is to find a harmonious
equilibrium between bias and variance that reduces the total error (the sum of the
squared bias and variance). This objective is frequently achieved through methods
like cross-validation or regularization, which work to address overfitting by penaliz-
ing overly complex models.
In summary, the bias-variance trade-off highlights the fundamental trade-off that
exists between the bias and variance of machine learning models. Understanding this
4.2 Overfitting and Regularization 161
trade-off is essential when choosing the appropriate complexity of a model and pre-
venting cases of underfitting or overfitting in real-world applications.
The concept of balancing bias and variance is a pivotal principle in the domain of
machine learning. It entails achieving an equilibrium between a model’s capacity to
precisely grasp the intrinsic patterns within a dataset (reduced bias) and its adaptability
to various datasets (reduced variance). Numerous Python libraries are accessible, pro-
viding resources and methodologies for comprehending and handling this trade-off.
scikit-learn, a Python library, is extensively employed for a range of machine
learning undertakings, encompassing both supervised and unsupervised learning. Al-
though scikit-learn does not furnish distinct functions for measuring bias and vari-
ance, it does present an array of instruments for assessing models. These instruments
consist of cross-validation, learning curves, and validation curves, which serve the
purpose of evaluating the bias and variance of a model.
An example of utilizing learning curves in scikit-learn to visually represent the
bias-variance trade-off is as follows:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
# Define model
model = LogisticRegression()
Fig. 4.1 presents a visual representation of the bias-variance tradeoff, illustrating the
relationship between the number of samples and the model’s accuracy. The graph de-
picts how the model’s performance, measured by its accuracy, varies as the size of the
training dataset changes.
TensorFlow and Keras: TensorFlow and its associated high-level API, Keras, pro-
vide a diverse range of tools and methodologies for the construction and training of
deep learning models. These libraries furnish functionalities that enable the imple-
mentation of regularization techniques, dropout, and early stopping, all of which
serve to effectively address the bias-variance trade-off.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
4.2 Overfitting and Regularization 163
where J ðθÞ is the cost function, hθ xðiÞ is the predicted value for the ith example, yðiÞ
is the true value, and m is the number of training examples.
In the context of regularization in the L2 norm, which is alternatively referred to as
“Ridge regularization,” an additional term is incorporated into the cost function that is
directly proportional to the square of the magnitude of the coefficients of the model:
X
n
JL2 ðθÞ = J ðθÞ + λ θ2j
j=1
We fit a linear regression model to this dataset using both L1 and L2 regularization with
λ = 0.1. After training the models, we examine the values of the coefficients θ0 and θ1 .
With L2 regularization, the resulting coefficients might be:
θ0 = 0.5
θ1 = 0.9
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Logistic Regression model with L1 regularization
model = LogisticRegression(penalty='l1', solver='liblinear', C=1.0) # C
166 Chapter 4 Foundations of Machine Learning
# Evaluate model
accuracy = model.score(X_test_scaled, y_test)
print("Accuracy:", accuracy)
Accuracy: 0.9666666666666667
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Evaluation metrics are essential instruments for assessing the effectiveness of ma-
chine learning models, providing valuable information on the model’s ability to gen-
eralize and make accurate predictions on new data. These metrics measure various
aspects of model performance, such as accuracy, precision, recall, and F1 score,
among others. The understanding and choice of appropriate evaluation metrics are
crucial for evaluating the effectiveness of a model and guiding strategies to improve
its performance.
Accuracy is a commonly used evaluation metric for classification tasks, which as-
sesses the proportion of correctly classified instances out of the total number of in-
stances. However, solely relying on accuracy may not provide a comprehensive
representation of model performance, especially when dealing with imbalanced data-
sets where one class is dominant. In such cases, precision and recall become essential
metrics. Precision measures the proportion of true positive predictions among all pos-
itive predictions, while recall measures the proportion of true positive predictions
among all actual positive instances. The F1 score, which is the harmonic mean of pre-
cision and recall, achieves a balance between these two metrics, making it a suitable
choice for imbalanced datasets.
In the context of regression tasks, the evaluation metrics encompass three meas-
ures: mean squared error (MSE), mean absolute error (MAE), and R-squared (R^2)
score. MSE serves as a quantification of the average squared difference between the
predicted and true values, while MAE provides a measure of the average absolute dif-
ference. On the other hand, the R^2 score quantifies the degree to which the model
explains the variance, with higher values indicating a stronger fit.
168 Chapter 4 Foundations of Machine Learning
The area under the receiver operating characteristic curve (AUC-ROC) and the
area under the precision-recall curve (AUC-PR) are additional evaluation metrics that
are commonly utilized in binary classification tasks. AUC-ROC assesses the balance be-
tween the sensitivity (true positive rate) and the false positive rate, providing valuable
insights into the model’s capacity to distinguish between positive and negative instan-
ces across different thresholds. In the context of imbalanced datasets where precision
and recall are of utmost importance, AUC-PR concisely summarizes the performance
of the precision-recall curve.
Cross-validation is a commonly used method for assessing model performance,
particularly in situations where there is a lack of data. It involves partitioning the da-
taset into several subsets, training the model on one subset, and then assessing its per-
formance on the remaining subset. This procedure is repeated multiple times, and the
average performance across the subsets is calculated to obtain a reliable estimate of
the model’s performance.
In summary, the assessment criteria play a pivotal role in the assessment of the
effectiveness of machine learning models in various tasks and datasets. Through care-
ful selection of suitable assessment criteria and utilization of methods like cross-
validation, experts can gain a valuable understanding of model performance and
make informed choices to improve model accuracy and generalizability.
The assessment of the efficacy of machine learning models in tasks involving categori-
cal output variables is heavily reliant on metrics for classification. These metrics pro-
vide valuable insights into the model’s capacity to accurately categorize instances into
various classes, allowing for the evaluation of its accuracy, precision, recall, and other
performance-related factors.
Accuracy is a noteworthy indicator among the fundamental metrics employed in clas-
sification. It measures the ratio of accurately classified instances to the total number of
instances. The calculation for accuracy entails the application of the subsequent formula:
TP + TN
Accuracy =
TP + TN + FP + FN
TP
Precision =
TP + FP
Recall, also referred to as sensitivity or true positive rate, quantifies the ratio of accu-
rate positive forecasts in relation to the entirety of genuine positive occurrences, and
its computation can be accomplished by:
TP
Recall =
TP + FN
The F1 score, known as the harmonic mean of precision and recall, offers an equilib-
rium between these two metrics and proves particularly advantageous when dealing
with imbalanced datasets. One can compute it using the subsequent formula:
Precision✶ Recall
F1 = 2✶
Precision + Recall
Not spam
Spam
Using this matrix of confusion, we can derive the accuracy, precision, recall, and F1
score of the model.
In this particular instance, the model achieved a level of correctness of 85%, signifying
that 85% of the forecasts were accurate. The precision of 87.5% signifies that among
all emails predicted as spam, 87.5% were in fact spam. The recall of 93.3% indicates
that the model accurately identified 93.3% of all genuine spam emails. The F1 score of
90.3% offers a balanced evaluation of precision and recall, taking into account both
false positives and false negatives.
170 Chapter 4 Foundations of Machine Learning
Metrics for classification offer valuable insights into the performance of a model by
quantifying its accuracy, precision, recall, and F1 score. Comprehending these metrics is
vital for assessing the efficacy of classification models and making well-informed deci-
sions in applications of machine learning.
Several Python libraries offer functionality for computing metrics commonly
used for evaluating classification models.
# Compute accuracy
accuracy = accuracy_score(y_true, y_pred)
# Compute precision
precision = precision_score(y_true, y_pred)
# Compute recall
recall = recall_score(y_true, y_pred)
# Compute F1 score
f1 = f1_score(y_true, y_pred)
TensorFlow and PyTorch: While TensorFlow and PyTorch are primarily deep learn-
ing libraries, they also offer functionality for computing classification metrics. These
libraries are particularly useful when working with neural network models.
import tensorflow as tf
# Compute accuracy
accuracy = tf.keras.metrics.Accuracy()
accuracy.update_state(y_true, y_pred)
accuracy_result = accuracy.result().numpy()
4.3 Evaluation Metrics 171
Pandas and NumPy: Pandas and NumPy constitute essential libraries utilized for the
manipulation of data and numerical computations within the Python programming
language. Despite their lack of dedicated functions for the computation of classifica-
tion metrics, they are frequently employed in tandem with other libraries to prepro-
cess data and manually derive metrics.
import numpy as np
import pandas as pd
# Compute accuracy
accuracy = np.mean(y_true == y_pred)
# Compute precision
true_positives = np.sum((y_true == 1) & (y_pred == 1))
false_positives = np.sum((y_true == 0) & (y_pred == 1))
precision = true_positives / (true_positives + false_positives)
# Compute recall
false_negatives = np.sum((y_true == 1) & (y_pred == 0))
recall = true_positives / (true_positives + false_negatives)
# Compute F1 score
f1 = 2 ✶ (precision ✶
recall) / (precision + recall)
Metrics for regression are essential in evaluating the effectiveness of machine learn-
ing models when the output variable is continuous. These metrics provide valuable
insights into the model’s ability to accurately predict numerical values and assist in
evaluating its precision, accuracy, and other performance-related aspects.
172 Chapter 4 Foundations of Machine Learning
One of the main metrics used for regression analysis is the MSE, which measures
the average squared difference between predicted and actual values. Its calculation
involves the use of the following formula:
1X n
ðyi − ^yi Þ
2
MSE =
n i=1
where n is the number of instances, yi is the true value, and ^yi is the predicted value
for the ith instance.
MAE is another metric that is frequently employed. It quantifies the average abso-
lute disparity between the predicted and actual values. The calculation involves deter-
mining the absolute difference:
1X n
MAE = jyi − ^yi j
n i=1
These metrics provide different perspectives on model performance. MSE and RMSE
penalize large errors more heavily, making them sensitive to outliers. MAE, on the
other hand, treats all errors equally and is more robust to outliers.
Let’s consider a numerical example to illustrate these metrics. Suppose we have a
dataset with five instances:
ðx1 , y1 Þ = ð1, 3Þ
ðx2 , y2 Þ = ð2, 5Þ
ðx3 , y3 Þ = ð3, 7Þ
ðx4 , y4 Þ = ð4, 9Þ
ðx5 , y5 Þ = ð5, 11Þ
Using these estimations, it is feasible to compute the MSE, MAE, and RMSE of the
model:
1X 5
1
ðyi − ^yi Þ = × 02 + 12 + 12 + 12 + 12 = 0.8
2
MSE =
5 i=1 5
1X 5
1
MAE = ½yi − ^yi = × ð0 + 1 + 1 + 1 + 1Þ = 0.8
5 i=1 5
pffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffi
RMSE = MSE = 0.8 ≈ 0.894
TensorFlow and PyTorch: While TensorFlow and PyTorch are primarily deep learn-
ing libraries, they also offer functionality for computing regression metrics. These li-
braries are particularly useful when working with neural network models.
import torch
import torch.nn as nn
Pandas and NumPy: Pandas and NumPy are essential libraries utilized to manipulate
data and execute numerical computations within the Python programming language.
Although these libraries do not offer dedicated functionalities for the calculation of
regression metrics, they are frequently employed alongside other libraries to prepro-
cess data and manually compute metrics.
import numpy as np
import pandas as pd
4.4 Cross-Validation
scikit-learn: Scikit-learn is an influential Python library for machine learning that of-
fers extensive resources for data analysis and modeling. Within the sklearn model_se-
lection module, it incorporates a flexible KFold class that enables the partitioning of
the dataset into k folds, thereby facilitating cross-validation.
TensorFlow and PyTorch: TensorFlow and PyTorch, notwithstanding their main pur-
pose as deep learning libraries, possess the capability to undertake k-fold cross-
validation. Nonetheless, given the absence of dedicated built-in features for cross-
validation, the process of implementing k-fold cross-validation may necessitate the in-
clusion of supplementary procedures.
import numpy as np
from sklearn.model_selection import KFold
import torch
Pandas and NumPy: Pandas and NumPy are essential libraries in Python for the ma-
nipulation of data and the computation of numerical values. Despite their lack of inher-
ent cross-validation capabilities, these libraries are frequently employed in tandem
with other libraries to preprocess data and facilitate cross-validation.
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
LOOCV and stratified k-fold cross-validation are two extensively utilized techniques
for assessing the effectiveness of machine learning models.
LOOCV entails dividing the dataset into n folds, with n denoting the number of
instances in the dataset. In each iteration, one instance is excluded and used as the
validation set, while the model is trained on the remaining n-1 instances. This process
is repeated n times, with each instance serving as the validation set once. LOOCV pro-
vides a precise evaluation of model performance, although it can be computationally
demanding, especially for large datasets.
Let us illustrate the LOOCV methodology using a numerical example. We can as-
sume that we have a dataset comprising 100 instances. In the first iteration, the model
is trained on instances 2 to 100, and its performance is evaluated on instance 1. Mov-
ing on to the second iteration, the model is trained on instances 1 and 3 to 100, with
the performance assessed on instance 2. This process is repeated for all instances,
leading to the computation of a performance metric (such as accuracy) for each itera-
4.4 Cross-Validation 179
tion. The final estimation of the model’s performance is obtained by averaging the
performance metrics across all instances.
Stratified k-fold cross-validation, a variant of K-fold cross-validation, ensures that
each fold in the cross-validation process maintains the same class distribution as the
original dataset. This is of particular importance when dealing with imbalanced data-
sets, where one class may be disproportionately represented. The stratified approach
guarantees that each class is proportionately represented in both the training and val-
idation sets, thus yielding more reliable performance estimates.
Let us expound upon the concept of stratified k-fold cross-validation through the
use of a numerical illustration. Assume we are faced with a binary classification prob-
lem, which consists of a total of 100 instances. Out of these instances, 80 belong to
class 0 while the remaining 20 belong to class 1. Our objective is to apply stratified k-
fold cross-validation with a value of k equal to 5. By dividing the dataset into 5 folds,
we ensure that each fold maintains the original dataset’s class distribution. This guar-
antees that each fold contains a proportionate representation of both classes, result-
ing in more dependable performance evaluations.
LOOCV and stratified k-fold cross-validation are two powerful methodologies uti-
lized for assessing the effectiveness of machine learning models. LOOCV provides an
accurate evaluation of model performance, although it comes at the expense of
computational overhead. On the other hand, stratified K-fold cross-validation ensures
that each fold preserves the class distribution of the original dataset, leading to more
reliable performance estimates, particularly when dealing with imbalanced datasets.
LOOCV and stratified k-fold cross-validation are commonly used methodologies
for evaluating the effectiveness of machine learning models. Several Python libraries
are available to facilitate the implementation of these methodologies.
LeaveOneOut: The LeaveOneOut class, which falls under the sklearn model_selection
module, creates an instance of LOOCV. This technique generates train/test indices that
divide the data into separate train/test sets, ensuring that each sample is used as a test
set exactly once.
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
TensorFlow and PyTorch: Although TensorFlow and PyTorch are primarily recog-
nized as deep learning libraries, they can also be employed for implementing cross-
validation techniques. However, compared to scikit-learn, additional steps may be
required.
4.4 Cross-Validation 181
For TensorFlow: TensorFlow lacks built-in functions for cross-validation, thus neces-
sitating manual dataset splitting and model training within a loop to implement
LOOCV or stratified k-fold cross-validation. TensorFlow’s flexibility allows for custom-
ization based on specified requirements.
For PyTorch: Similar to TensorFlow, PyTorch does not provide built-in functions for
cross-validation. Cross-validation can be implemented manually using techniques like
k-fold splitting within a loop, and subsequently, the model can be trained and as-
sessed accordingly.
import numpy as np
from sklearn.model_selection import StratifiedKFold
import torch
import torch.nn as nn
import torch.optim as optim
optimizer.zero_grad()
outputs = model(X_train)
loss = criterion(outputs, y_train)
loss.backward()
optimizer.step()
mean_accuracy_skf = np.mean(accuracies_skf)
print("Mean accuracy with Stratified K-Fold (PyTorch):",
mean_accuracy_skf)
These libraries provide effective and reliable resources for implementing cross-
validation methods, enabling professionals to accurately assess the performance of
models and make well-informed choices regarding model selection and fine-tuning of
hyperparameters.
Summary
Exercise (MCQs)
4. Which technique involves grouping similar data points together based on their
features?
A) Clustering B) Association C) Classification D) Regression
7. Which cross-validation technique involves dividing the dataset into k folds and
using each fold as a test set exactly once?
A) k-Fold cross-validation
B) Leave-one-out cross-validation
C) Stratified k-fold cross-validation
D) Random split cross-validation
14. Which visualization technique helps analyze the relationship between training
size and model performance?
A) Learning curves
B) Validation curves
C) Residual plots
D) Confusion matrices
15. Which library offers high-level APIs for building and training deep learning mod-
els?
A) Matplotlib B) TensorFlow C) Scikit-learn D) PyTorch
16. Which technique is commonly used for reducing the dimensionality of high-
dimensional datasets?
A) Clustering
B) Association
C) Principal component analysis (PCA)
D) K-nearest neighbors (KNN)
Answers
1. A) Regularization
2. A) Supervised learning involves labeled data, while unsupervised learning in-
volves unlabeled data.
3. B) Regression
4. A) Clustering
5. A) Large parameter values
6. D) F1 score
7. A) k-Fold cross-validation
8. C) Underfitting and overfitting
9. B) TensorFlow
10. A) L1 regularization
11. A) Accuracy
12. A) Mean squared error (MSE)
13. B) Leave-one-out cross-validation
14. A) Learning curves
15. D) PyTorch
16. C) Principal component analysis (PCA)
186 Chapter 4 Foundations of Machine Learning
1. In machine learning, overfitting occurs when the model learns to _______ the
training data and performs poorly on unseen data.
2. Supervised learning involves _______ data for training, while unsupervised learn-
ing deals with _______ data.
3. Classification predicts _______ class labels, while regression predicts _______ nu-
meric values.
4. Clustering groups _______ data points together based on their features, while asso-
ciation identifies _______ among variables.
5. L1 and L2 regularization techniques are used to prevent _______ in machine learn-
ing models.
6. The F1 score is a metric that combines both _______ and _______.
7. In k-fold cross-validation, the dataset is divided into _______ equal-sized folds.
8. The bias-variance trade-off balances model _______ and _______.
9. TensorFlow and PyTorch are popular libraries for building and training _______
models.
10. L1 regularization adds a penalty equivalent to the _______ of the magnitude of co-
efficients, while L2 regularization adds a penalty equivalent to the _______ of the
magnitude of coefficients.
11. Precision measures the ratio of _______ predictions to _______ predictions.
12. MSE stands for _______ and is calculated as the average of squared _______ between
predicted and actual values.
13. Keras, a high-level API for building neural networks, is now integrated into
_______ as its official high-level API.
14. Principal component analysis (PCA) is a technique commonly used for reducing
the _______ of high-dimensional datasets.
Answers
1. memorize
2. labeled, unlabeled
3. discrete, continuous
4. similar, patterns or relationships
5. overfitting
6. precision, recall
7. k
8. complexity, variance
9. neural network
10. absolute value, square
11. true positive, all positive
Descriptive Questions 187
Descriptive Questions
https://doi.org/10.1515/9783110697186-005
5.1 Linear Regression 189
Linear regression is a statistical technique employed for the purpose of depicting the
connection between one or more autonomous variables and a continuous reliant vari-
able. It postulates that this connection is roughly linear, indicating that alterations in
the autonomous variables are correspondingly linked to modifications in the reliant
variable in a linear manner.
In the most basic configuration, referred to as Simple Linear Regression, there ex-
ists solely a single independent variable. The equation of the model is expressed as y
= mx + b, wherein y symbolizes the dependent variable, x symbolizes the independent
variable, m denotes the slope of the line (indicating the rate of alteration of y in rela-
tion to x), and b represents the y-intercept (which signifies the value of y when x
equals zero).
In Multiple Linear Regression, the inclusion of two or more independent variables
leads to the expansion of the model equation. This expanded equation encompasses all
the variables, denoted as y = b0 + b1x1 + b2x2 + . . . + bnxn. Here, b0 represents the inter-
cept, while b1, b2, . . ., bn correspond to the coefficients associated with the predictors
x1, x2, . . ., xn.
The primary aim of linear regression is to identify the optimal line (or hyper-
plane, when considering multiple variables) that minimizes the total sum of squared
discrepancies between the observed values of the dependent variable and the values
predicted by the linear equation. This procedure is frequently accomplished through
the utilization of the least squares method.
Linear regression models can be assessed using different metrics, such as the co-
efficient of determination (R^2), which quantifies the extent to which the independent
variables account for the variability in the dependent variable, and hypothesis tests
to determine the statistical significance of the regression coefficients.
The following are the underlying presumptions of linear regression: linearity, in-
dependence of the observations, homoscedasticity (constant variance of errors), nor-
mality of residuals (normal distribution of errors), and lack of multicollinearity (no
significant dependency between independent variables).
Linear regression is extensively employed in diverse domains, encompassing the
realms of economics, finance, social sciences, engineering, and natural sciences, to un-
dertake a multitude of tasks, including prognosticating forthcoming results, compre-
hending associations amid variables, and assessing the impacts of interventions or
treatments. Notwithstanding its uncomplicated nature, linear regression persists as
one of the most potent and comprehensible instruments within the statistical arsenal.
190 Chapter 5 Classic Machine Learning Algorithms
5.1 Linear Regression 191
hours_studied = [2, 3, 4, 5, 6]
exam_scores = [65, 70, 75, 80, 85]
plt.scatter(hours_studied, exam_scores)
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.title('Relationship Between Hours Studied and Exam Score')
plt.show()
Fig. 5.1 presents a scatter plot of the relationship between the number of hours of
study and the corresponding test scores before visually applying simple linear regres-
sion. The data points in the graph show the distribution of these two variables, allow-
ing a preliminary examination of the potential relationship between study time and
academic achievement.
192 Chapter 5 Classic Machine Learning Algorithms
where:
– n is the number of data points,
– ∑xy is the sum of the product of x and y,
– ∑x is the sum of x values,
– ∑y is the sum of y values, and
– ∑x2 is the sum of the squares of x values.
5 ✶ 1550 − 20 ✶ 375
m=
5 ✶ 90 − 202
7750 − 7500
m=
450 − 400
250
m=
50
m=5
5.1 Linear Regression 193
375 − 5 ✶ 20
b=
5
375 − 100
b=
5
275
b=
5
b = 55
90
85
Y - Dependent
80
75
70
65
2 3 4 5 6
X - Independent
Fig. 5.2 shows a linear regression best fit line superimposed on a scatter plot of test
scores versus hours of practice. A line of best fit is a linear model calculated to best
represent the relationship between two variables, minimizing the deviation between
the actual data points and the predicted values on the line.
y = ð5 × 7Þ + 55 = 90
So, the predicted exam score for a student who studies 7 h is 85.
194 Chapter 5 Classic Machine Learning Algorithms
import numpy as np
# Given data
hours_studied = np.array([2, 3, 4, 5, 6])
exam_scores = np.array([65, 70, 75, 80, 85])
R^2: 1.0
Thus, the R^2 value is 1.00 and the MSE is 0.0. These measurements show how well
the model matches the data, with the number of hours studied accounting for almost
100% of the variation in exam scores.
Let’s look at another scenario in which we wish to estimate a house’s cost depend-
ing on its square footage. We will use the scikit-learn toolkit and Python to do simple
linear regression.
5.1 Linear Regression 195
sklearn.linear_model
This module in scikit-learn contains various classes for linear models, including re-
gression models.
LinearRegression class
– To fit a linear regression model to data, use the LinearRegression class offered
by sklearn.linear_model.
– Modeling the relationship between a dependent variable and one or more inde-
pendent variables using a linear technique is known as linear regression.
– In simple linear regression (as used in the example), there is only one indepen-
dent variable. However, scikit-learn’s LinearRegression class can handle multi-
ple independent variables as well (Multiple Linear Regression).
– Ordinary least squares (OLS) regression is implemented by the LinearRegression
class. It finds the best-fitting line (or hyperplane) through the data points by mini-
mizing the sum of squared residuals.
– The predict() function can be used to create predictions once the model has been
trained using training data via the fit() method.
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Example data: House size (in square feet) and corresponding prices
house_sizes = np.array([800, 1000, 1200, 1500, 1800]).reshape(-1, 1) #
Reshape to make it a column vector
house_prices = np.array([100000, 150000, 180000, 210000, 250000])
Fig. 5.3 shows an example of using linear regression to predict house prices based on
property size. The Scatter plot shows the relationship between the square footage or
5.1 Linear Regression 197
total living area of the homes and their corresponding sale prices. The line of best fit,
estimated using linear regression techniques, is superimposed on the data points, rep-
resenting a linear model that aims to capture the relationship between household size
and the objective variable of house price.
The dependent variable in this equation is y, while the independent variables are x1,
x2, . . ., xn. With respect to each independent variable, the coefficients (or weights)
b0, b1, b2, . . ., bn correspond. Interestingly, the intercept term is b0.
Methods like ordinary least squares (OLS), which minimize the sum of squared
discrepancies between actual and predicted values, are used to estimate the coeffi-
cients. Metrics like MSE and R2 are used to assess the model’s performance and deter-
mine its goodness of fit.
Multiple linear regression enables the modeling of more intricate relationships
and interactions among multiple predictors, making it a versatile tool in diverse fields
such as finance, economics, engineering, and social sciences. However, it assumes lin-
earity, independence of predictors, constant variance of errors, and normally distrib-
uted residuals, all of which should be examined before interpretation.
A multiple linear regression model’s coefficients are interpreted by evaluating
each independent variable’s effect on the dependent variable while holding the other
variables constant. Assuming that all other variables stay constant, the coefficients
show how the dependent variable changes for every unit change in the corresponding
independent variable.
Multiple linear regression can be performed using various methods, each with its
advantages and disadvantages.
– The most common method used to fit a multiple linear regression model is ordi-
nary least squares.
– Its goal is to reduce the total sum of squared differences between the dependent
variable’s expected and observed values.
– OLS estimates the coefficients (weights) for the independent variables by deter-
mining the values that minimize the residual sum of squares (RSS).
– This approach yields unbiased coefficient estimates and is computationally effi-
cient and simple to apply.
To execute OLS in Python, the statsmodels library can be employed, which offers a
convenient interface for fitting statistical models, including linear regression. Below
is an outline of the steps involved in performing OLS using statsmodels:
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Example dataset
data = {
'X1': [1, 2, 3, 4, 5],
'X2': [2, 3, 4, 5, 6],
'Y': [2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)
– Import the required libraries first: statsmodels, pandas for dataset processing,
and numpy for numerical operations.api as standard for OLS model fitting.
– Describe a sample dataset that has the dependent variable Y and the independent
variables X1 and X2.
– Use Pandas to structure the data into a DataFrame df.
5.1 Linear Regression 199
– Use sm to add an intercept term to the independent variables in order to fit the
OLS model.add_constant() function, which augments the DataFrame with a one-
column representation of the intercept term.
– Specify which variables are dependent (y) and which are independent (X).
– Fit OLS model using sm.OLS(y, X).fit(), pass the dependent variable y and the inde-
pendent variables X.
– Finally, print the summary of the fitted model using model.summary(), which
provides detailed information about the regression coefficients, standard errors,
p-values, and goodness-of-fit statistics such as R-squared and adjusted R-squared.
Fig. 5.4 provides a visual representation of the ordinary least squares (OLS) method,
which is a fundamental technique employed in linear regression analysis.
The OLS regression model is summarized in the program output, which is dis-
played above. It includes data like coefficients, standard errors, p-values, R-squared,
and more. This synopsis sheds light on the connections between the independent and
dependent variables as well as the model’s general goodness-of-fit.
is a technique used specifically in linear regression to choose the best coefficients for mini-
mizing the cost function, which is commonly expressed as the Mean Squared Error. To do
this, update the coefficients in the opposite direction as the cost function’s gradient.
– One popular optimization approach that seeks to reduce the cost function – such
as the Mean Squared Error – is gradient descent. The coefficients are adjusted
iteratively to accomplish this.
– Gradient fall, when used in multiple linear regression, updates the coefficients by
advancing in the direction of the cost function’s steepest fall.
– Gradient Descent can be used by processing the data in batches or mini-batches
when faced with massive datasets that are too big to store into memory.
– It is noteworthy, therefore, that Gradient Descent might not converge to the in-
tended global minimum, but rather to a local minimum. Thus, it becomes vital to
fine-tune hyperparameters like learning rate.
import numpy as np
import matplotlib.pyplot as plt
# Gradient Descent
for iteration in range(n_iterations):
gradients = 2/m ✶ X_b.T.dot(X_b.dot(theta) - y)
theta -= eta ✶ gradients
plt.xlabel('X')
plt.ylabel('y')
plt.title('Gradient Descent Linear Regression')
plt.legend()
plt.show()
– Create a sample set of data called X and Y, where X represents a feature and Y the
desired variable.
– np.c_[np.ones((100, 1)), X] to add an intercept term to the independent variables.
– Establish the settings for Gradient Descent, including the number of data points
(m), the number of iterations (n_iterations), and the learning rate (eta).
– Randomly initialize the coefficients theta.
– Update the coefficients theta using the gradient of the cost function with respect to the
coefficients and perform Gradient Descent for a predetermined number of iterations.
– Use Matplotlib to plot the regression line and the data points.
– Print the regression line’s final coefficients at the end.
Program output is a plot with data points and the regression line fitted with gradient
descent. Furthermore, the final regression line coefficients, which reflect the slope
and intercept, are printed.
Fig. 5.5 illustrates the concept of gradient descent, a widely used optimization algo-
rithm in the context of linear regression.
– This method involves solving the normal equation (XTX)−1XTy, where X is the ma-
trix of independent variables, y is the vector of the dependent variable, and (XTX)−1
is the inverse of the matrix XTX.
– The multiple linear regression model’s coefficients can be solved using matrix in-
version in the form of an algebraic statement.
– This operation can be resource-intensive when applied to extensive datasets,
owing to the requirement of calculating the inverse of a matrix. This is particu-
larly true when the matrix is not well conditioned.
– Nevertheless, matrix inversion ensures precise solutions without the necessity for it-
erative optimization, rendering it advantageous for datasets of smaller proportions.
import numpy as np
import matplotlib.pyplot as plt
plt.xlabel('X')
plt.ylabel('y')
plt.title('Multiple Linear Regression with Matrix Inversion Method')
plt.legend()
plt.show()
Selection of Method
– The selection of methodology is contingent upon various factors, including the ex-
tent of the dataset, the availability of computational resources, and the necessity
for interpretability.
– Ordinary least squares (OLS) is the preferred approach for datasets of a smaller
scale and when interpretability is of utmost importance.
– Gradient Descent is well-suited for extensive datasets and environments that fa-
cilitate parallel processing.
– Matrix Inversion is a suitable technique for datasets ranging from small to mod-
erately sized, provided that computational resources are sufficient.
The process of using an ‘n-th degree’ polynomial function to analyze regression data
in order to determine the relationship between the independent variable (represented
by ‘x’) and the dependent variable (represented by ‘y’) is known as polynomial regres-
204 Chapter 5 Classic Machine Learning Algorithms
sion. Unlike simple linear regression, which assumes a linear relationship between
the variables, polynomial regression allows complex and non-linear relationships to
be represented in an efficient manner.
Fig. 5.6 presents a visual representation of a multi-linear regression model, which in-
volves more than one independent variable, fitted using the Matrix Inversion method.
Working Principle
1. Model Representation: y = θ0 + θ1x + θ2x2 +⋯ + θnxn + ε is the polynomial function
of degree n that polynomial regression uses to represent the relationship between
x and y. The coefficients are θ0,θ1, . . ., and θn, and the error term is denoted by 0.
2. Degree of Polynomial: The model’s complexity is based on the degree of the poly-
nomial. Greater degrees have the potential to overfit but can also capture more
complex interactions.
3. Model Fitting: In polynomial regression, the model is fitted to the training set of
data using methods such as matrix inversion, gradient descent, or ordinary least
squares (OLS).
4. Evaluation: Metrics like as cross-validation, R2 score, and MSE are used to assess
the model’s performance.
5. Prediction: The model can be taught to generate predictions on fresh data points.
5.1 Linear Regression 205
For instance, consider a scenario where we have data on the relationship between the
temperature (x) and the rate of ice cream sales (y). Simple linear regression may not
capture the non-linear relationship adequately. In such cases, polynomial regression,
such as quadratic or cubic regression, can be used to better model the curvature in
the relationship, potentially improving predictive accuracy.
Applications
– Curve fitting: Polynomial regression is commonly used in curve fitting applica-
tions where the relationship between variables is non-linear.
– Engineering and physics: It is widely used in engineering and physics to model
relationships between variables in physical systems.
– Economics: In economics, polynomial regression can model relationships be-
tween economic variables that exhibit non-linear behavior.
With Python, we can utilize libraries like NumPy and scikit-learn to conduct polyno-
mial regression. NumPy will be utilized for numerical operations, and scikit-learn of-
fers useful methods for regression modeling and polynomial features. We’ll visualize
the data using Matplotlib.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Polynomial features
poly_features = PolynomialFeatures(degree=2) # Quadratic regression
X_poly = poly_features.fit_transform(X)
# Predictions
y_pred = poly_reg.predict(X_poly)
# Print coefficients
print("Intercept:", poly_reg.intercept_)
print("Coefficients:", poly_reg.coef_)
– Import NumPy for numerical computations and Matplotlib for data visualization.
– Generate example data with a linear relationship (y = 2x) with some added noise.
– Use PolynomialFeatures from scikit-learn to generate polynomial features up to
degree 2 (quadratic regression).
– Utilizing the polynomial characteristics, fit a linear regression model.
– Based on the fitted model, make predictions.
– Matplotlib can be used to plot the regression line and the data points.
– The polynomial regression model’s coefficients should then be printed.
The resulting output of the program depicts a graphical representation that exhibits
both the data points as well as the regression line, which has been fitted utilizing poly-
nomial regression, more specifically quadratic regression for this particular instance.
Furthermore, it also provides a printed representation of the coefficients associated
with the polynomial regression model, encompassing both the intercept and the coef-
ficients pertaining to the polynomial features.
Working Principle
1
Pðy = 1 jxÞ =
1 + e−z
The linear combination of feature values, denoted as z, can be expressed as θ0 +
θ1x1 + θ2x2 +⋯ + θnxn, where θ represents the coefficients.
2. Decision boundary: Logistic regression makes predictions about class labels by
determining if the predicted probability exceeds a specified threshold, usually set
at 0.5. The decision boundary is the hypersurface that distinguishes the different
classes within the feature space.
3. Training: The model undergoes training through the utilization of optimization
algorithms such as gradient descent or Newton’s method, with the purpose of
identifying the most favorable coefficients that lead to the minimization of the
logistic loss function.
208 Chapter 5 Classic Machine Learning Algorithms
Advantages
Limitations
Applications
Example
x y Pass/Fail
. .
. .
. .
. .
. .
Step 3: Training
We employ optimization algorithms such as gradient descent to identify the most favor-
able coefficients (θ0, θ1, θ2) that minimize the logistic loss function. The logistic loss func-
tion gauges the disparity between the anticipated probabilities and the factual categories.
Step 4: Prediction
Once the logistic regression model has undergone training, it becomes capable of fore-
casting the likelihood of success in the examination for incoming students, taking into
consideration their respective characteristic values. To illustrate, if a fresh student
possesses x = 2.8 and y = 3.7, we are able to anticipate the probability of success by em-
ploying the acquired coefficients.
Step 5: Evaluation
We assess the model’s performance by employing various metrics such as accuracy,
precision, recall, and F1-score on a distinct validation or test dataset. These metrics
serve as indicators of the model’s ability to accurately predict the true classes.
210 Chapter 5 Classic Machine Learning Algorithms
Example
So, the logistic regression model predicts with high probability that the new student
will pass the exam.
Working Principle
– The aim of binary classification is to acquire knowledge about the relationship
between input features and discrete binary labels, commonly represented as 00
or 11, which signify the two classes.
– Dataset Preparation: The dataset is segregated into independent variables (fea-
tures) and their corresponding dependent variable (binary labels). Each data
point comprises feature values and the associated class label.
– Model Selection: Multiple algorithms, including but not limited to Logistic Regres-
sion, Decision Trees, Support Vector Machines (SVM), Random Forests, and Neural
Networks, have the potential to be utilized for the purpose of binary classification.
– Model Training: The selected algorithm is subjected to training with the aid of a
labeled dataset, enabling it to grasp the intricate patterns and interconnections
between various features and class labels. During the duration of the training
procedure, the algorithm modifies its internal parameters with the objective of
minimizing a pre-established loss or error function.
– Model Evaluation: The evaluation of the trained model is conducted by employing
assessment metrics such as accuracy, precision, recall, F1-score, and receiver op-
erating characteristic (ROC) curve. These metrics are utilized to measure the mod-
5.2 Logistic Regression 211
el’s capacity to make accurate predictions of the true class labels on data that has
not been previously observed.
– Prediction: After the completion of the training and evaluation process, the model
can be employed for the purpose of making predictions on fresh data instances.
This is achieved by classifying each instance into one of the two classes, which is
based on the feature values associated with that particular instance.
Evaluation Metrics
– The accuracy metric denotes the proportion of accurately classified instances out
of the total number of instances.
– Precision, on the other hand, signifies the ratio of true positive predictions to all posi-
tive predictions, which highlights the model’s efficacy in minimizing false positives.
– Recall, also referred to as sensitivity, measures the proportion of true positive
predictions to all actual positive instances, thus demonstrating the model’s capa-
bility to detect all positive instances.
– The F1-score, as the harmonic mean of precision and recall, provides a balanced
evaluation of these two measures.
– ROC-AUC evaluates the area under the Receiver Operating Characteristic curve,
thereby evaluating the model’s capability to distinguish between positive and
negative classes at different thresholds.
Applications
Binary classification is applied in diverse domains, encompassing a wide array of
applications:
– Medical Diagnosis: Identifying diseases based on symptoms or medical test results.
– Email Spam Detection: Classifying emails as spam or legitimate.
– Credit Risk Assessment: Predicting whether a loan applicant is likely to default.
– Fraud Detection: Identifying fraudulent transactions in financial systems.
– Sentiment Analysis: Determining the sentiment of a text, whether it conveys a
positive or negative tone, is a significant task.
Example
Suppose we have a dataset containing information about customer transactions, in-
cluding transaction amount, time of transaction, and whether the transaction is fraud-
ulent (1) or legitimate (0). By training a binary classification model on this dataset, we
can predict whether future transactions are likely to be fraudulent, enabling proac-
tive measures to prevent fraudulent activities.
Let’s consider a Python code to perform binary classification using logistic regression,
along with plots and evaluation metrics. We’ll use the famous Iris dataset available in sci-
kit-learn, where we’ll classify whether a given iris flower is of the “setosa” species or not.
212 Chapter 5 Classic Machine Learning Algorithms
– Load the Iris dataset and retrieve solely the characteristics and target labels
linked to the “setosa” species.
– Divide the dataset into two sets for the purpose of training and testing.
– Employ a logistic regression model to train the data for classification.
– Generate predictions using the test data and assess the model’s performance by
considering accuracy, precision, recall, F1-score, and the confusion matrix.
– Represent the data points together with the decision boundary in a visual manner
to facilitate classification visualization.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, confusion_matrix
print("Recall:", recall)
print("F1-score:", f1)
print("Confusion Matrix:\n", conf_matrix)
# Plot decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min,
y_max, 0.01))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3)
sns.scatterplot(x=X_test[:, 0], y=X_test[:, 1], hue=y_test)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('Binary Classification (Setosa vs. Not Setosa)')
plt.show()
Example
Consider a dataset consisting of images depicting handwritten digits ranging from 0
to 9. Each individual image is represented as a matrix encompassing pixel values. The
primary objective is to accurately classify each image into one of ten possible digit
classes, specifically 0, 1, 2, . . ., 9.
Working Principle
The fundamental principle behind multiclass classification is to acquire knowledge of
the mapping from input features to discrete class labels. In this context, each data
point is assigned to one and only one class.
5.2 Logistic Regression 215
– Dataset preparation: The dataset is divided into distinct features, which represent
the input data, and their corresponding class labels. It should be noted that each data
point is characterized by multiple features and is assigned to one of several classes.
– Model selection: There are numerous algorithms that can be utilized for multi-
class classification, such as Logistic Regression, Decision Trees, Random Forests,
Support Vector Machines (SVM), and Neural Networks. Some algorithms inher-
ently support multiclass classification, while others can be extended using techni-
ques like One-vs-Rest (OvR) or One-vs-One (OvO).
– Model training: The selected algorithm undergoes training using a labeled data-
set, where it learns the underlying patterns and relationships between the fea-
tures and class labels. Throughout the training process, the algorithm adjusts its
internal parameters to minimize a predefined loss or error function.
– Model evaluation: The performance of the trained model is assessed using vari-
ous metrics, including accuracy, precision, recall, F1-score, and confusion matrix.
These metrics provide an evaluation of how effectively the model predicts the
true class labels for unseen data.
– Prediction: Once the model has been trained and evaluated, it can be utilized to
make predictions on new data instances. This involves assigning each instance to
one of the multiple classes based on its feature values.
Evaluation Metrics
– Accuracy: Accuracy is defined as the ratio of correctly classified instances to the
total number of instances.
– Precision: Precision, on the other hand, is the ratio of true positive predictions
for each class to all positive predictions for that class.
– Recall (sensitivity): Recall, also known as sensitivity, is the ratio of true positive
predictions for each class to all actual positive instances for that class.
– F1-score: The F1-score, which is the harmonic mean of precision and recall,
serves as a means of striking a balance between the two.
– Confusion matrix: A confusion matrix is a table that displays the counts of true
positive, true negative, false positive, and false negative predictions for each
class.
Applications
Multiclass classification has numerous real-world applications across various do-
mains, including:
– Handwritten digit recognition
– Speech recognition
– Image classification
– Natural language processing (e.g., sentiment analysis, topic classification)
– Medical diagnosis (e.g., disease classification)
216 Chapter 5 Classic Machine Learning Algorithms
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, confusion_matrix
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("Confusion Matrix:\n", conf_matrix)
– The outcomes produced by the code encompass a range of metrics, which include
accuracy, precision, recall, F1-score, as well as the confusion matrix.
– Accuracy serves as a measurement of the proportion of instances that are cor-
rectly classified out of the total number of instances.
– Precision, recall, and F1-score provide valuable insights into the effectiveness of
the model in classifying each individual class.
– The confusion matrix offers a comprehensive breakdown of the number of true
positive, true negative, false positive, and false negative predictions for each class.
– Additionally, the code generates a heatmap of the confusion matrix to visually depict
the performance of the model. The diagonal elements of the matrix correspond to
accurate classifications, while the off-diagonal elements indicate misclassifications.
Working Principle
– The primary objective of logistic regression is to ascertain the optimal coefficients
that minimize the logistic loss function and effectively classify instances into their
corresponding classes.
– Regularization term: Regularization incorporates a penalty component into the
loss function, which in turn discourages the existence of coefficients with large
values. When it comes to logistic regression, the two prevalent methods of regu-
5.2 Logistic Regression 219
Example
Suppose a logistic regression model is constructed with two features, namely x1 and
x2, alongside a binary target variable denoted as y, which signifies whether a cus-
tomer will engage in purchasing a product (1) or not (0). In the absence of regulariza-
tion, the model may exhibit impeccable fitting to the training data, yet its ability to
generalize to novel data may be limited.
By implementing L1 or L2 regularization, a penalty term is incorporated into the
loss function, taking into account the magnitude of the coefficients. For instance,
when utilizing L2 regularization, the loss function is modified.
1X X 2
N p
Loss = − ½yi logðybi Þ + ð1 − yi Þ logð1 − ybi Þ + λ θj
N i=1 j=1
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, confusion_matrix
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
5.2 Logistic Regression 221
print("F1-score:", f1)
print("Confusion Matrix:\n", conf_matrix)
Decision Trees and Random Forests have gained considerable popularity as machine
learning algorithms employed in the domains of classification and regression tasks.
Decision Trees
Definition: Decision Trees are hierarchical structures that bear semblance to trees.
Each internal node within these structures represents a decision predicated on a spe-
cific characteristic. Meanwhile, each branch signifies the resulting outcome of said de-
cision. Finally, each leaf node serves as a representation of the ultimate decision or
prediction.
Working principle: Decision Trees repeatedly divide the feature space by consider-
ing the values of input features. At each node, the algorithm selects the feature that
most effectively separates the data into similar subsets. This process persists until ei-
ther all data points are assigned to the same category or a predetermined stopping
criterion is met.
Advantages:
Interpretability: Decision Trees possess a straightforward and comprehensible nature,
rendering them appropriate for elucidating the rationale behind predictions.
5.3 Decision Trees and Random Forests 223
Random Forests
Definition: Random Forests are a type of ensemble learning methodology that consists of
a group of Decision Trees. Every individual tree in the forest is trained on a randomly
chosen subset of the training data along with a randomized subset of features. The pre-
dictions made by each tree are then combined to formulate the ultimate prediction.
Working principle: Random Forests combine the predictive power of multiple Deci-
sion Trees to improve generalization performance and mitigate overfitting. In the
training phase, each tree is grown independently using a random subset of the train-
ing data and features. The final prediction is obtained by averaging the predictions
made by all the trees (for regression) or by majority voting (for classification).
Advantages:
Enhanced generalization: Random Forests alleviate the problem of overfitting by ag-
gregating the predictions of numerous individual trees.
Robustness: They adeptly handle noisy data and outliers due to the amalgamation
effect. Feature importance: They furnish a metric of feature importance, affording
users the ability to discern the most pertinent features for prediction.
Disadvantages:
Complexity: Random Forests exhibit greater intricacy compared to individual Deci-
sion Trees, rendering them more arduous to interpret.
Computational cost: Training and predicting with Random Forests can incur consid-
erable computational expenses, particularly when confronted with voluminous datasets.
Applications
Decision Trees and Random Forests are extensively utilized in diverse domains, en-
compassing but not limited to:
– Finance: The assessment of creditworthiness and the identification of fraudulent ac-
tivities. Healthcare: The determination of diseases and the prognostic evaluation.
224 Chapter 5 Classic Machine Learning Algorithms
Decision Trees and Random Forests are formidable machine learning algorithms with
their individual merits and demerits. While Decision Trees proffer transparency and
simplicity, Random Forests furnish enhanced generalization aptitude and resilience
via ensemble learning.
The process of constructing Decision Trees entails iteratively dividing the dataset ac-
cording to the input feature values, resulting in the formation of a hierarchical struc-
ture resembling a tree. In this structure, internal nodes signify decisions, while leaf
nodes indicate the ultimate prediction.
– Upon the satisfaction of the stopping criteria, the initiation of the construction of
leaf nodes takes place. These leaf nodes encapsulate the label of the majority
class that is found within the subset. As a result, the tree is empowered to gener-
ate precise predictions by leveraging the provided data.
In summary, the process of building decision trees involves a series of carefully executed
steps, including the selection of an optimal split, the creation of decision nodes and leaf
nodes, and the iterative process of recursive splitting. These steps ultimately result in the
creation of a powerful and interpretable model for making data-driven decisions.
Fig. 5.11 shows the decision tree for credit card approval.
In this example, the decision tree splits the dataset based on the credit score fea-
ture. If an applicant’s credit score is 700 or higher, they are approved for the loan;
otherwise, they are denied.
Let us examine a more intricate numerical illustration of constructing a decision
tree to anticipate whether consumers will procure a product, relying on demographic
and behavioral attributes.
Example: Predicting Purchase Decision
Suppose we have a dataset of customers containing the following features:
– Age (numeric)
– Gender (categorical: Male, Female)
– Income (numeric)
– Website Visit Duration (numeric)
– Product Reviews (numeric)
And the target variable indicates whether the customer made a purchase (Yes or No).
– Selecting the Optimal Division: Our initial step involves the selection of the char-
acteristic that yields the most homogeneous subsets within the dataset. As an il-
lustration, we may discover that segmenting the data based on age leads to
subsets exhibiting the highest degree of purity.
– Generation of Decision Nodes: At the apex of the tree, we establish a decision
node that represents the chosen characteristic and its division point. For example,
if the optimal division is age < 30, we generate a decision node labeled “Age < 30?”
226 Chapter 5 Classic Machine Learning Algorithms
– Subdivision of the Data: The dataset is partitioned into subgroups based on the
values of the chosen attribute. Each subgroup represents a branch originating
from the decision node.
– Recursive Division: We perform the aforementioned process recursively for each
subset until one of the specified stopping conditions is fulfilled:
– All data points within a subset pertain to the same class (homogeneous).
– The maximum depth of the tree has been reached.
– The minimum number of data points within a node has been reached. No sig-
nificant enhancement in the reduction of impurity is observed.
– Generation of Terminal Nodes: Upon fulfillment of the stopping conditions, we
generate terminal nodes that contain the majority class label found within the
respective subset.
Fig. 5.12 depicts a decision tree model that predicts whether a consumer will procure
a specific product or not.
In this particular illustration, the decision tree forecasts a transaction in the
event that the patron is below 30 years of age and possesses an income that falls
below 50 K, or if they are of the male gender. In any other case, the decision tree pro-
ceeds to divide based on the duration of the visit to the website, projecting a transac-
tion if the duration of the visit falls below 10 min.
The scikit-learn library, often employed for Decision Trees, is widely utilized in
the field. This library is a robust tool for machine learning, offering a range of algo-
rithms, such as Decision Trees, to construct models that can make accurate predic-
tions. A comprehensive elucidation of scikit-learn’s DecisionTreeClassifier can be
found below:
– The DecisionTreeClassifier is a class implemented in the scikit-learn library,
which serves as an implementation of the Decision Tree algorithm specifically de-
signed for classification tasks.
– The DecisionTreeClassifier exhibits a prominent characteristic in its ability to ac-
commodate multiple splitting criteria, namely Gini impurity and entropy (infor-
mation gain). By specifying the criterion parameter as either “gini” or “entropy,”
users possess the flexibility to employ the most appropriate criterion for their
classification requirements.
– In order to mitigate the issue of overfitting and improve the overall ability to
make accurate predictions on unseen data, the DecisionTreeClassifier implemen-
tation in scikit-learn provides various pruning techniques. These techniques in-
volve the application of constraints, such as the maximum depth of the tree, the
minimum number of samples allowed in each leaf, and the minimum number of
samples required for a split.
– In terms of handling categorical features, the DecisionTreeClassifier automatically
deals with them by executing either one-hot encoding or integer encoding.
– Furthermore, the DecisionTreeClassifier possesses the capability to handle miss-
ing values. It does so by creating splits in the nodes based on whether the feature
is present or absent.
– After training, the DecisionTreeClassifier provides a feature_importances_ attri-
bute, which allows users to assess the significance of each feature in the predic-
tion process.
– The training procedure for a DecisionTreeClassifier model encompasses the in-
stantiation of an object from the DecisionTreeClassifier class and the subsequent
invocation of the fit method. This particular method necessitates the feature ma-
trix (X_train) and target vector (y_train) as its input parameters.
– Once the DecisionTreeClassifier model has been trained, it can be utilized to
make predictions on fresh data by making use of the predict method and provid-
ing the feature matrix of the new data.
228 Chapter 5 Classic Machine Learning Algorithms
– When considering the visualization of decision trees, one can employ tools like
graphviz and matplotlib. In order to simplify the process of visualization, scikit-
learn provides the plot_tree function, which enables the direct visualization of
the decision tree.
– Users have the ability to assess the effectiveness of a DecisionTreeClassifier
model by utilizing different evaluation metrics, including accuracy, precision, re-
call, F1-score, and the confusion matrix. These metrics can be accessed through
the metrics module in the sklearn library.
Example:
Output:
Accuracy: 1.0
Below is an illustrative Python code snippet showcasing the implementation of
Decision Trees for regression on the Diabetes dataset from the scikit-learn library.
The code encompasses various steps such as dataset loading, model training using De-
5.3 Decision Trees and Random Forests 229
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_squared_error
Output:
Mean Squared Error: 4976.797752808989
The MSE computation is executed by the code in order to determine the disparity
between the target values (y_test) and the predicted values (y_pred) on the test set.
MSE represents the average squared deviation between the actual and predicted val-
ues, thus providing an indication of the model’s precision.
230 Chapter 5 Classic Machine Learning Algorithms
Fig. 5.13 presents a decision tree model trained on a diabetes dataset. The tree struc-
ture consists of internal nodes representing tests or decisions based on various fea-
tures or attributes.
The act of visualizing the decision tree is facilitated by the plot_tree function
from scikit-learn. It portrays the structure of the decision tree through the utilization
of nodes and branches. Each node corresponds to a decision based on a specific fea-
ture, while each leaf node corresponds to the predicted target value.
The data’s output and visualization provide valuable insights into the perfor-
mance and decision-making process of the Decision Tree Regression model, which has
been trained on the Diabetes dataset. Understanding the organization of the decision
tree and interpreting its nodes and branches is essential for gaining insights into the
relationships between characteristics and the target variable. Furthermore, evaluat-
ing the model’s performance by utilizing metrics such as MSE assists in assessing its
accuracy and effectiveness in generating predictions.
Entropy and Information Gain are principles utilized in Decision Trees for the pur-
pose of ascertaining the most optimal attribute to divide the data at each node.
5.3 Decision Trees and Random Forests 231
Entropy
Entropy quantifies the degree of impurity or uncertainty within a given dataset, and
it is determined through the analysis of the dataset’s distribution in terms of various
classes or categories.
Formula: For a dataset with K classes and proportion pi of class i:
X
K
EntropyðSÞ = − pi log2 ðpi Þ
i=1
Proportion of class A:
6
pA = = 0.6
10
Proportion of class B:
4
pB = = 0.4
10
EntropyðSÞ = − 0.6 log2 ð0.6Þ + 0.4 log2 ð0.4Þ
Information Gain
Information Gain assesses the decrease in entropy or uncertainty obtained by divid-
ing the dataset according to a specific characteristic. It quantifies the extent to which
a characteristic imparts knowledge about the class labels.
For a dataset S with N instances, and m subsets after splitting based on feature A:
X
m
Nj
✶
IGðS, AÞ = EntropyðSÞ − Entropy Sj
j=1
N
Suppose that we are interested in partitioning the dataset according to a specific fea-
ture, denoted as X, thereby yielding two subsets, S1 and S2, which consist of 7 and 3
instances, respectively.
Entropy before split: EntropyðSÞ = 0.971
✶ ✶
Entropy of subset S1 : EntropyðS1 Þ = − ð0.7 log2 ð0.7Þ + 0.3 log2 ð0.3ÞÞ
232 Chapter 5 Classic Machine Learning Algorithms
Sunny Hot No
Sunny Hot No
Overcast Hot Yes
Rainy Mild Yes
Rainy Cool Yes
Rainy Cool No
Overcast Cool Yes
Sunny Mild No
Sunny Cool Yes
Rainy Mild Yes
Sunny Mild Yes
Overcast Mild Yes
Overcast Hot Yes
Rainy Mild No
For Outlook:
– Split the dataset based on Outlook (Sunny, Overcast, Rainy)
– Calculate the proportion of instances in each subset and the corresponding
entropy.
– Weighted average of entropies to calculate Information Gain.
5 4
IGðS, OutlookÞ = EntropyðSÞ − × Entropy Ssunny + × EntropyðSOvercast Þ
14 14
5
+ × Entropy SRainy
14
5 4 5
= 0.796 − × 0.971 + × 0 + 0.971
14 14 14
For Temperature:
– Split the dataset based on Temperature (Hot, Mild, Cool)
– Calculate the proportion of instances in each subset and the corresponding
entropy.
– Weighted average of entropies to calculate Information Gain.
4 6
IGðS, TemperatureÞ = EntropyðSÞ − × EntropyðSHot Þ + × EntropyðSMild Þ
14 14
4
+ × EntropyðSCool Þ
14
4 6 4
= 0.796 − × 0.811 + × 0.918 + × 0.811
14 14 14
Random Forests and Bagging, specifically ensemble learning techniques, are utilized
in order to augment the effectiveness and robustness of machine learning models by
merging numerous individual models.
Bagging, also known as Bootstrap Aggregating, is a methodology in which numer-
ous foundational models, often in the form of decision trees, are independently
trained on randomly selected subsets of the training data. This selection process in-
volves replacement, resulting in each base model acquiring unique insights into the
underlying data due to the inherent randomness. The predictions generated by these
models are subsequently combined through averaging or aggregation techniques to
formulate the final prediction.
For instance, consider a dataset comprising 1,000 instances, where the objective is
to train a decision tree. In the context of bagging, the following steps are undertaken:
1. A random sample, consisting of 70% of the instances, is selected (with replace-
ment) to train the initial decision tree.
2. This process is repeated multiple times to train several decision trees, with each
model utilizing a distinct subset of the data.
3. To formulate a prediction, the predictions derived from all decision trees are ag-
gregated, employing methods such as averaging in the case of regression or vot-
ing for classification.
By adhering to the principles of bagging, the aforementioned steps facilitate the crea-
tion of an ensemble model capable of leveraging the diverse perspectives acquired by
the individual base models.
Random Forests is a method that builds on the idea of bagging by introducing an
additional element of randomness. This is accomplished by choosing a random subset
of features at each decision tree node. The aim of this random feature selection is to
decrease the correlation between the trees, thus reducing the risk of overfitting and
improving the overall generalization capability.
Example
Continuing with the previous example, in Random Forests:
1. When training each decision tree, instead of using all features, we randomly se-
lect a subset of features.
2. The subset of characteristics is employed to ascertain the optimal division at
every node within the tree.
3. The variability in the process of selecting features guarantees that every decision
tree within the collection acquires distinct characteristics of the dataset, thereby
resulting in a more heterogeneous assortment of models.
4. Predictions are aggregated as in bagging.
5.3 Decision Trees and Random Forests 235
Advantages
– The phenomenon of overfitting can be effectively mitigated through the utiliza-
tion of Bagging and Random Forests, as these methods employ a strategy of aver-
aging predictions obtained from numerous models, each of which is trained on
distinct subsets of the available data.
– Random Forests are particularly proficient at enhancing the generalization capa-
bilities of models by introducing an additional element of randomness, thereby
fostering the development of more diverse models that exhibit superior perfor-
mance in terms of generalization.
– Ensembling techniques possess a significant advantage in their capacity to tackle
the problem of noise within the data. Through the aggregation of predictions
from various models, ensembling aids in mitigating the influence of noisy data
points and outliers, thereby enhancing the resilience of the approach.
The Python libraries commonly employed for Random Forests and Bagging are pri-
marily implemented by scikit-learn, a well-known Python library for machine learn-
ing. Within this context, we find the main libraries utilized to carry out Random
Forests and Bagging:
scikit-learn (sklearn)
– ensemble: The ensemble module in scikit-learn provides classes for implement-
ing ensemble learning techniques, including Random Forests and Bagging.
– RandomForestClassifier: This class implements the Random Forest algorithm
for classification tasks.
– RandomForestRegressor: This class implements the Random Forest algorithm
for regression tasks.
– BaggingClassifier: This class implements the Bagging ensemble method for clas-
sification tasks.
– BaggingRegressor: This class implements the Bagging ensemble method for re-
gression tasks.
– These classes provide easy-to-use interfaces for training ensemble models and
making predictions.
numpy (np)
– numPy is an essential module for conducting scientific computations in the Py-
thon programming language.
– It offers comprehensive assistance for dealing with multi-dimensional arrays and
matrices, which are crucial for effectively managing data in various machine
learning algorithms.
– In the context of Random Forests and Bagging, numpy is often used for data ma-
nipulation and numerical computations.
236 Chapter 5 Classic Machine Learning Algorithms
matplotlib.pyplot (plt)
– matplotlib.pyplot is a plotting library used for creating visualizations in Python.
– It provides functions for creating various types of plots, such as line plots, scatter
plots, and histograms.
– In the context of Random Forests and Bagging, matplotlib.pyplot is used to visual-
ize decision boundaries and other relevant plots for model evaluation.
These libraries present a comprehensive array of tools for the implementation and
assessment of Random Forests and Bagging algorithms in the Python programming
language. They offer proficient implementations of these ensemble techniques and
provide supplementary functionalities for data preprocessing, evaluation, and visuali-
zation, rendering them indispensable for the construction of resilient machine learn-
ing models.
Below is a Python code example demonstrating the use of Random Forests and
Bagging with the Iris dataset from scikit-learn. It includes loading the dataset, training
Random Forest and Bagging classifiers, making predictions, evaluating the models,
and visualizing the decision boundaries.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from itertools import product
# Train classifiers
dt.fit(X_train, y_train)
rf.fit(X_train, y_train)
bagging.fit(X_train, y_train)
# Make predictions
y_pred_dt = dt.predict(X_test)
y_pred_rf = rf.predict(X_test)
y_pred_bagging = bagging.predict(X_test)
# Evaluate classifiers
accuracy_dt = accuracy_score(y_test, y_pred_dt)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
Fig. 5.14: Comparisons of accuracy in Decision Tree, Random Forest, and Bagging.
5.4 Support Vector Machines 239
Fig. 5.14 shows the plots for comparisons of accuracy in Decision Tree, Random Forest,
and Bagging.
The code uses the Iris dataset, which contains 150 samples with 4 features each
(sepal length, sepal width, petal length, and petal width). For visualization purposes,
only the first two features are used.
Output
– The precision of the Decision Tree, Random Forest, and Bagging classifiers when
applied to the test data is displayed.
– Decision boundaries of each classifier are plotted to visualize their performance
in separating different classes.
Explanation
– The dataset is initially divided into training and testing sets.
– Subsequently, Decision Tree, Random Forest, and Bagging classifiers are instanti-
ated. By employing the fit method, each classifier is trained on the training data.
– Predictions are generated on the test data through the utilization of the predict
method.
– Using the accuracy_score function from scikit-learn, accuracy scores are com-
puted for each classifier.
– Eventually, decision boundaries are plotted to visually represent how each classi-
fier separates the classes within the feature space.
Visualization
Decision boundaries are plotted for each classifier, showing regions where each class
is predicted. Different colors represent different classes, and data points are plotted
as markers. Decision boundaries help visualize the classification performance of each
classifier in the feature space.
The support vector machine (SVM) is a supervised learning algorithm utilized for clas-
sification and regression purposes. It demonstrates notable effectiveness in high-
dimensional spaces and possesses the ability to effectively capture intricate relation-
ships within data. SVMs function by identifying the optimal hyperplane that separates
classes within the feature space. This process maximizes the margin between classes
while simultaneously minimizing classification errors.
240 Chapter 5 Classic Machine Learning Algorithms
The essential principles of SVM comprise support vectors, which are the data
points closest to the hyperplane, and the margin, which denotes the distance between
the hyperplane and the support vectors.
The objective of SVMs is to maximize this margin, thereby endowing the decision
boundary with resilience against noise and outliers. In situations where classes are
not able to be linearly separated, SVMs can employ kernel functions to map input fea-
tures into a space of higher dimensionality. This allows for the establishment of non-
linear decision boundaries.
The versatility of SVMs is underscored by their manifold applications, including
text classification, image recognition, and bioinformatics. Due to their effectiveness
and scalability, SVMs are prevalently utilized in both binary and multiclass classifica-
tion problems. Nevertheless, it should be noted that SVMs can impose a notable
computational burden and necessitate the judicious selection of hyperparameters in
order to achieve optimal performance. All in all, SVMs are a potent tool for classifica-
tion tasks, as they offer flexibility, accuracy, and robustness in their treatment of di-
verse datasets.
Linear support vector machines: Linear Support Vector Machines classify data by
detecting the optimal hyperplane that efficiently separates classes in the feature
space. It demonstrates impressive performance when the data demonstrates linear
separability.
Kernel support vector machines: Kernel Support Vector Machines augment the ca-
pabilities of linear Support Vector Machines through the application of kernel func-
tions (such as polynomial and radial basis functions) to convert the input features
into a higher-dimensional space, thereby establishing non-linear decision boundaries.
In an efficient manner, Kernel Support Vector Machines effectively manage data that
is not linearly separable.
Working Principle
– Support vector machines (SVMs) strive to discover the hyperplane that maximizes
the margin, which refers to the distance between the hyperplane and the closest
data points (known as support vectors) from each class.
– In the case of linearly separable data, the equation for the hyperplane can be ex-
pressed as w⋅x + b = 0, where w represents the weight vector, x denotes the fea-
ture vector, and b signifies the bias term.
5.4 Support Vector Machines 241
– The margin is computed as jjw2 jj, and the objective is to maximize this margin
while minimizing classification errors.
Data Preparation: We start by preparing our dataset, consisting of features (x1 and x2)
and corresponding class labels.
Model Training:
– Next, the SVM model is trained on the dataset.
– The SVM algorithm aims to identify the optimal hyperplane that effectively sepa-
rates the two classes in the feature space.
– The linear SVM hyperplane equation is denoted as w⋅x + b = 0, where w represents
the weight vector, x denotes the feature vector, and b signifies the bias term.
– The objective is to determine the w and b parameters that maximize the margin
between the hyperplane and the nearest support vectors, which are the data
points of each class.
Model Evaluation:
– Once the training process of the model is completed, we proceed to assess its per-
formance by utilizing a range of metrics including accuracy, precision, recall, and
F1-score.
– Additionally, we employ the technique of visualizing the decision boundary to
gain insights into the effectiveness of the Support Vector Machine (SVM) in segre-
gating the classes within the feature space.
The execution of Support Vector Machine (SVM) commonly involves the utilization of
Python libraries, which are predominantly provided by scikit-learn (sklearn), a popu-
lar machine learning library in the Python programming language. In the subsequent
discussion, we will introduce the main libraries employed in the implementation
of SVM.
scikit-learn (sklearn)
– SVR (Support Vector Regressor): This class implements the SVM algorithm for re-
gression tasks.
– These classes provide flexible and easy-to-use interfaces for training SVM models,
making predictions, and tuning hyperparameters.
numpy (np)
matplotlib.pyplot (plt)
These libraries offer a comprehensive set of tools for implementing and evaluating
SVM algorithms in Python. They provide efficient implementations of SVM models,
support for data manipulation and numerical computations, and functionalities for
visualization and model evaluation. By leveraging these libraries, users can easily
build, train, and evaluate SVM models for various classification and regression tasks.
The Linear Support Vector Machine (SVM) is a supervised learning technique utilized
for the purpose of binary classification tasks. Its main objective is to ascertain the
most suitable hyperplane that efficiently partitions the classes within the feature
space. A comprehensive explanation of the Linear SVM is provided hereafter:
Objective: The aim of the Linear Support Vector Machine (SVM) is to identify the hyper-
plane possessing the utmost margin, thereby distinguishing the classes within the fea-
ture space.
5.4 Support Vector Machines 243
Margin: The margin is defined as the separation between the hyperplane and the
closest support vectors of each class, in terms of distance. The objective of the linear
SVM is to optimize this margin, aiming to maximize it.
Optimization: The formulation of Linear SVM involves an optimization problem that
aims to minimize the norm of the weight vector w, while ensuring that the constraint
yi(w⋅xi + b) ≥ 1 holds for all training instances (xi,yi).
Classification: Upon completion of training, Linear SVM assigns new data points to
classes based on their position in relation to the hyperplane. Data points falling on
one side are assigned to one class, while those on the other side are assigned to the
other class.
Kernel trick: While the performance of Linear SVM is commendable in the case of
linearly separable data, the utilization of kernel functions allows for the transforma-
tion of data into a higher-dimensional space. This transformation effectively renders
the data linearly separable, thereby facilitating the classification of data that is not
linearly separable.
The Linear Support Vector Machine (SVM) algorithm exhibits remarkable efficacy in
executing binary classification tasks, demonstrating robust resilience to errors, sub-
stantial computational efficiency, and the capability to manage voluminous datasets.
It discerns the most suitable hyperplane that effectively discriminates between dis-
tinct classes within the feature space, thereby contributing to its extensive utilization
in various machine learning tasks.
Let us contemplate a binary classification problem which entails two distinct fea-
tures, denoted as x1 and x2. In the scenario at hand, we are presented with a dataset
that can be described as follows:
x x Class
244 Chapter 5 Classic Machine Learning Algorithms
Model Evaluation:
The evaluation of the Linear Support Vector Machine (SVM) model is conducted by
making predictions on the test data and comparing them with the actual labels. To
measure the performance of the model, different metrics such as accuracy, precision,
recall, and F1-score are computed. Additionally, the decision boundary, which repre-
sents the hyperplane that effectively separates the two classes, is presented visually.
Let’s consider a Python code to perform Linear SVM:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
# Plot hyperplane
plt.plot([x0_min, x0_max], [-(w[0]✶x0_min + b)/w[1], -(w[0]✶x0_max +
b)/w[1]], 'k--')
plt.xlim(x0_min, x0_max)
plt.ylim(x1_min, x1_max)
plt.show()
246 Chapter 5 Classic Machine Learning Algorithms
Fig. 5.15 presents a scatter plot visualization of the dataset prior to applying the Sup-
port Vector Machine (SVM) algorithm. The plot displays the data points, each repre-
senting an individual observation or instance, distributed across two dimensions or
features. These features are represented by the x-axis and y-axis, respectively.
Fig. 5.16 presents a scatter plot visualization of the dataset after applying the Support
Vector Machine (SVM) algorithm. Similar to the previous scatter plot (Fig. 5.15), the
5.4 Support Vector Machines 247
data points are plotted in the feature space, with the x-axis and y-axis representing
two chosen features or dimensions.
However, in this figure, an additional component is superimposed onto the scatter
plot: the decision boundary or separating hyperplane learned by the SVM model. This
decision boundary is a line (or a higher-dimensional hyperplane in case of more fea-
tures) that optimally separates the different classes or categories present in the dataset.
This piece of code produces artificial data, visualizes it prior to employing Support
Vector Machine (SVM), subsequently trains a model of Linear SVM, and visualizes the
data after utilizing SVM with the decision boundary (hyperplane) and support vectors.
Explanation of Code:
– Import the requisite libraries: numpy for numerical calculations, matplotlib.pyplot
for visualization, and the svm module from sklearn for the Support Vector Machine.
– Generate Synthetic Data:
– Synthetic data is generated using the np.r_ function to concatenate two sets
of randomly generated 2D points.
– The initial collection of data points is generated from a standard distribution
with a center at (−2, −2), denoted as class −1.
– A subsequent collection of data points is generated from a standard distribu-
tion with a center at (2, 2), denoted as class 1.
– The acquired data is stored in the variable X, with the corresponding labels
being stored in the variable y.
– Train Linear SVM:
– We create an SVM classifier object using svm.SVC with kernel = ‘linear’, indi-
cating a linear kernel.
– The classifier undergoes training on the synthetic data with the utilization of
the fit method.
– Plot Data After SVM:
– We create a contour plot to visualize the decision boundary (hyperplane) ob-
tained after applying SVM.
– The decision boundary is plotted along with the support vectors and data
points.
– Support vectors are marked with circles, and the hyperplane is plotted as a
dashed line.
– The title and axis labels are added to the plot for clarity.
This code exemplifies the utilization of scikit-learn in order to execute Linear SVM
classification on fabricated data. Initially, it generates fabricated data and graphs it
prior to the application of SVM. Subsequently, it proceeds to train a Linear SVM
model and graphically illustrates the data post SVM application, showcasing the deci-
sion boundary and support vectors.
Let us examine an alternative Python code that executes Linear SVM on a dataset
obtained from scikit-learn, specifically the load_iris dataset.
248 Chapter 5 Classic Machine Learning Algorithms
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.title('Data before SVM')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1, edgecolor='k')
plt.xticks(())
plt.yticks(())
plt.show()
The above code displays two subplots. The first subplot shows the original data points
before applying SVM, colored according to their class labels. The second subplot
shows the data points after applying linear SVM, along with the decision boundary
(hyperplane) separating the classes as shown below.
Fig. 5.17 presents a comparative visualization of the Iris dataset before and after ap-
plying the Support Vector Machine (SVM) algorithm.
The Kernel Support Vector Machine (SVM) is known for its significant advancement
over the conventional SVM algorithm. This allows the SVM to classify data that cannot
be separated linearly. This is accomplished by implicitly mapping the data into a fea-
ture space with a higher dimensionality.
The primary objective of Kernel SVM is to identify the most optimal hyperplane
in the feature space, capable of separating the classes. This is achieved by non-
linearly transforming the data using kernel functions.
The technique known as the kernel trick plays a crucial role in allowing the com-
putation of dot products in the feature space of higher dimensionality, all the while
avoiding the explicit transformation of the data. Various kernel functions, such as the
linear, polynomial, radial basis function (RBF), and sigmoid functions, are employed
to gauge the similarity between different data points.
By transforming the data into a space of higher dimensionality, kernel SVM ena-
bles the establishment of decision boundaries that are non-linear in nature within the
original feature space. This allows for the linear separation of classes.
The optimization problem for Kernel SVM is tackled through techniques such as the
Sequential Minimal Optimization (SMO) algorithm or quadratic programming. These
methods ascertain the optimal hyperplane in the higher-dimensional space, either by
maximizing the margin between classes or by minimizing classification errors.
In terms of scalability, Kernel SVM can prove computationally demanding for
large datasets, particularly when non-linear kernels and high-dimensional feature
spaces are involved.
Kernel SVM finds widespread application in diverse machine learning tasks, en-
compassing classification, regression, and anomaly detection, where non-linear rela-
tionships are apparent in the data.
To conclude, Kernel SVM stands as a versatile and effective algorithm for han-
dling non-linear relationships in data, enabling the construction of intricate decision
5.4 Support Vector Machines 251
boundaries in the feature space. Its capacity to implicitly transform data using kernel
functions renders it suitable for a vast array of machine learning applications.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.svm import SVC
plt.show()
Fig. 5.18 presents a scatter plot visualization of the dataset prior to applying the kernel
Support Vector Machine (SVM) algorithm. The plot displays the data points, each rep-
resenting an individual observation or instance, distributed across two dimensions or
features. These features are represented by the x-axis and y-axis, respectively.
However, in this figure, an additional component is superimposed onto the scat-
ter plot: the decision boundary or separating hyperplane learned by the kernel SVM
model in the transformed higher-dimensional feature space.
– The first plot displays the synthetic data generated using the make_circles function.
– Data points belonging to different classes are represented by different colors.
– The original feature space does not allow for linear separation due to its circular
distribution.
5.4 Support Vector Machines 253
Fig. 5.19 presents a scatter plot visualization of the dataset after applying the kernel
Support Vector Machine (SVM) algorithm. Similar to the previous scatter plot (Fig.
5.18), the data points are plotted in the feature space, with the x-axis and y-axis repre-
senting two chosen features or dimensions.
– The second plot shows the same synthetic data after applying Kernel SVM.
– The data points are once again depicted using distinct colors to indicate varying
classes.
– Furthermore, the SVM model’s learned decision boundary is illustrated as a con-
tour plot.
– The decision boundary effectively separates the circular clusters into different re-
gions, demonstrating the non-linear separation capability of Kernel SVM.
Let us now examine an alternative Python code that executes the kernel Support Vec-
tor Machine (SVM) algorithm, utilizing the Radial Basis Function (RBF) kernel, on a
dataset known as Iris, sourced from the scikit-learn library.
Steps:
– Loads the iris dataset from the scikit-learn library and specifically chooses solely
the initial two characteristics.
– Creates a pipeline that first scales the data using StandardScaler and then applies
a kernel SVM with an RBF kernel using SVC(kernel = ‘rbf’).
– Fits the pipeline to the data.
– Creates a mesh grid to plot the data before and after applying SVM.
– Plots the original data (before SVM) in the first subplot.
254 Chapter 5 Classic Machine Learning Algorithms
– Plots the data after applying SVM in the second subplot, including the decision
boundary using plt.contourf.
– Displays the plots.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.title('Data before SVM')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1, edgecolor='k')
plt.show()
Fig. 5.20 presents a comparative visualization of the Iris dataset before and after ap-
plying the Kernel Support Vector Machine (SVM) algorithm. The code displays two
subplots. The first subplot shows the original data points before applying SVM, col-
ored according to their class labels. The second subplot shows the data points after
applying kernel SVM with an RBF kernel, along with the decision boundary separating
the classes.
The process of hyperparameter tuning in Support Vector Machines (SVM) involves the
selection of a set of hyperparameters that will optimize the performance of the SVM
model. These hyperparameters are predetermined before the training process and
cannot be directly estimated from the data.
256 Chapter 5 Classic Machine Learning Algorithms
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
plt.show()
Steps:
1. Generate artificial data with two characteristics using the make_classification
function.
2. Split the data into training and testing sets using the train_test_split function.
3. Define a grid of parameters that specifies the hyperparameters to be tuned and
their potential values.
4. Conduct a grid search with cross-validation (cv = 5) to identify the optimal combi-
nation of hyperparameters using the GridSearchCV function.
5. Visualize the artificial data prior to applying SVM. Subsequently, plot the data
again after implementing SVM.
6. Additionally, depict the decision boundary derived from the SVM model.
7. Present the Best Hyperparameters: We present the hyperparameters chosen as
the best through the grid search.
8. Evaluate Model Performance: We assess the model’s performance on the test set
by employing the accuracy score.
5.4 Support Vector Machines 259
Fig. 5.21 presents a scatter plot visualization of the dataset prior to applying the Sup-
port Vector Machine (SVM) algorithm and performing hyperparameter tuning. The
plot displays the data points, each representing an individual observation or instance,
distributed across two dimensions or features. These features are represented by the
x-axis and y-axis, respectively.
Fig. 5.22 presents a scatter plot visualization of the dataset after applying the Support
Vector Machine (SVM) algorithm and performing hyperparameter tuning. Similar to
the previous scatter plot (Fig. 5.21), the data points are plotted in the feature space,
with the x-axis and y-axis representing two chosen features or dimensions.
Algorithm Steps:
– Initialization: The first step is to determine the number of clusters (k) and ran-
domly initialize the centroids of these clusters.
– Assignment Step: Next, we allocate each data point to the nearest centroid by uti-
lizing a distance metric, commonly the Euclidean distance.
– Update Step: We then proceed to recalculate the centroids of the clusters by com-
puting the mean of all data points assigned to each cluster.
– Convergence: Finally, we repeat the assignment and update steps until conver-
gence criteria are satisfied, such as reaching a maximum number of iterations or
observing minimal change in centroids.
Initialization Methods:
– Random Initialization: Select centroids randomly from the available data points.
– K-Means Initialization: Opt for centroids that are strategically spaced apart to en-
hance convergence and avoid inferior local optima.
Scalability:
K-Means is computationally efficient and capable of handling large datasets with nu-
merous features.
Assumptions:
K-Means assumes that clusters are spherical and possess similar sizes. Furthermore, it
assumes that the variance of the distribution of each feature is equal across all clusters.
Applications:
K-Means clustering finds extensive application in diverse domains such as customer
segmentation, image segmentation, anomaly detection, and recommendation systems.
Limitations:
Depending on the initial centroids, K-Means may converge to local optima. It is highly
sensitive to the choice of k and may produce suboptimal outcomes if the true number
of clusters is unknown or if clusters exhibit irregular shapes or varying sizes.
To summarize, K-Means clustering is a versatile and widely employed algorithm
for partitioning data into clusters. Despite its simplicity and efficiency, careful atten-
tion must be given to initialization methods, the choice of k, and the interpretation of
results to ensure meaningful clustering.
dealing with noise and outliers, and interpreting cluster assignments effectively. Un-
derstanding clustering basics is crucial for selecting the appropriate algorithm, inter-
preting results, and extracting meaningful insights from unlabeled data.
X = [(2, 4), (3, 5), (4, 6), (10, 12), (11, 13), (12, 14), (20, 22),
(21, 23), (22, 24)]
import numpy as np
import matplotlib.pyplot as plt
X = np.array([(2, 4), (3, 5), (4, 6), (10, 12), (11, 13), (12, 14), (20,
22), (21, 23), (22, 24)])
The scatter plot illustrates that the dataset encompasses points that have the potential
to develop clusters. Subsequently, we shall employ the K-Means clustering algorithm
for various values of k and employ the elbow method to ascertain the most optimal
number of clusters.
264 Chapter 5 Classic Machine Learning Algorithms
Fig. 5.23 presents scatter plot of data set used for selecting number of clusters.
Fig. 5.24 illustrates the application of the Elbow method, a widely used technique for
determining the optimal number of clusters (k) in k-means clustering. The figure
likely consists of two subplots or panels. The elbow method aids us in discerning the
juncture at which the pace of decline in within-cluster sum of squares (WCSS) deceler-
ates. In the present scenario, we discern an elbow juncture at k = 3. Consequently, we
can deduce that the most suitable number of clusters for this dataset is 3.
Below is a Python script that facilitates the execution of K-Means clustering meth-
odology, encompassing visual representations both pre and post K-Means, in conjunc-
tion with a comprehensive elucidation of every procedural stage and corresponding
outcome.
Steps:
1. Create synthetic data by utilizing the make_blobs function from scikit-learn. This
particular dataset comprises 300 samples, encompassing 2 features and 4 clusters.
2. Before applying the K-Means clustering technique, visualize the generated data.
3. Commence by initializing and fitting the K-Means clustering algorithm to the
data. In order to ensure reproducibility, we set the random state and specify the
number of clusters as 4.
4. Upon completion of fitting the K-Means model, we obtain the cluster centers as
well as the labels assigned to each data point.
5. Proceed to plot the data following K-Means clustering, where each data point is
color-coded based on its assigned cluster, and the cluster centers are denoted by
red crosses.
266 Chapter 5 Classic Machine Learning Algorithms
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
Fig. 5.25 presents a scatter plot visualization of the dataset prior to applying the k-
means clustering algorithm. The plot displays the data points, each representing an
individual observation or instance, distributed across two dimensions or features.
These features are represented by the x-axis and y-axis, respectively.
Fig. 5.26 algorithm. Similar to the previous scatter plot (Fig. 5.25), the data points are
plotted in the feature space, with the x-axis and y-axis representing two chosen fea-
tures or dimensions.
However, in this figure, an additional component is superimposed onto the scat-
ter plot: the cluster assignments resulting from the k-means algorithm.
– The initial plot illustrates the synthetic data prior to the implementation of K-
Means clustering.
– The second plot displays the data after clustering with K-Means. Each point is col-
ored according to its assigned cluster, and the centroids of the clusters are
marked in red. This visualization helps us understand how K-Means has grouped
the data into clusters based on similarity.
Let us examine an additional Python script that performs the k-means clustering algo-
rithm on a dataset obtained from scikit-learn, while simultaneously producing graphi-
cal depictions of the data before and after the utilization of the k-means clustering
algorithm.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
plt.show()
Steps:
1. Importing the required libraries is essential for conducting various operations.
The libraries that need to be imported include numpy for performing numerical
operations, matplotlib.pyplot for visualization purposes, and datasets and cluster
from sklearn for working with datasets and clustering algorithms respectively.
2. Load the iris dataset from the scikit-learn library and opt to exclusively utilize
the initial two attributes for the objective of visualization.
3. Create a figure with two subplots using plt.figure and plt.subplot.
4. In the first subplot, plot the original data points using plt.scatter with the cmap
= ‘viridis’ colormap (you can choose any colormap you prefer).
5. Instantiate a KMeans instance with n_clusters = 3 (as the iris dataset possesses 3
categories) and specify a random_state to ensure reproducibility.
6. Train the KMeans model on the dataset by invoking the fit method on the kmeans
object.
7. Obtain the cluster labels for every individual data point by utilizing the kmeans
labels_ attribute.
8. In the second subplot, plot the data points using plt.scatter, but this time color them
according to their cluster labels using c = labels and the cmap = ‘viridis’ colormap.
9. Get the coordinates of the cluster centers using kmeans.cluster_centers_.
10. Plot the cluster centers using plt.scatter with c = ‘red’ (red color), s = 100 (larger
size), and alpha = 0.5 (semi-transparent).
11. Display the plots using plt.show().
– The first subplot shows the original data points before applying k-means cluster-
ing, colored with a continuous colormap (viridis in this case).
– The second subplot shows the data points after applying k-means clustering,
where each point is colored according to its assigned cluster. Additionally, the
cluster centers are plotted as larger red dots.
270 Chapter 5 Classic Machine Learning Algorithms
Fig. 5.27 presents a visual comparison of the IRIS dataset before and after applying
the k-means clustering algorithm.
where n represents the total number of samples, xki and xkj denote the values of the
ith and jth features of the kth sample, and μi and μj correspond to the means of fea-
tures xi and xj, respectively.
Eigendecomposition
– The principal components are ranked in a descending order based on the eigen-
values, with the first component capturing the highest amount of variance, fol-
lowed by the second component capturing the second highest amount of variance,
and so on.
– It is a widely accepted convention to choose a particular percentage of the total
variance, such as 90%, as a threshold for determining the number of principal
components to retain.
Projection
Finally, the data is projected onto the designated principal components in order to
acquire the representation with reduced dimensions. This projection is accomplished
by performing the multiplication of the original data matrix with the matrix consist-
ing of the chosen eigenvectors (principal components).
272 Chapter 5 Classic Machine Learning Algorithms
Mathematical Representation
Y = X · Vk
by the “curse of dimensionality,” which occurs when the complexity of the data in-
creases exponentially with the number of dimensions. This, in turn, leads to issues like
overfitting, increased computational costs, and data sparsity.
There are two primary approaches to dimensionality reduction: feature selection
and feature extraction. Feature selection involves choosing a subset of the original
features, while feature extraction involves creating new features by combining or
transforming the original ones.
PCA is a commonly used technique for feature extraction. It identifies the direc-
tions of maximum variance in the data and projects the data onto a lower-dimensional
subspace defined by these directions, known as principal components.
Other notable techniques for dimensionality reduction include Linear Discrimi-
nant Analysis (LDA), which aims to maximize the separability between classes, and t-
Distributed Stochastic Neighbor Embedding (t-SNE), a non-linear technique suitable
for visualizing high-dimensional data in a lower-dimensional space.
Dimensionality reduction has the potential to improve the performance of ma-
chine learning models by eliminating irrelevant or redundant features, reducing
noise, and enhancing the interpretability of the data. However, it is important to
strike a balance between reducing dimensions and preserving essential information
for the specific task at hand. Dimensionality reduction techniques find extensive use
in diverse domains such as image and signal processing, text mining, bioinformatics,
and recommendation systems, among others. They play a fundamental role in data
preprocessing, visualization, and feature engineering pipelines within the context of
machine learning workflows.
Feature selection and feature extraction are two frequently utilized methodolo-
gies within the domains of machine learning and data analysis, which aim to reduce
the dimensionality of datasets and amplify the efficacy of models. Herein lies an ex-
tensive elucidation of each approach:
Feature Selection
Feature selection involves the careful selection of a subset of the original features
from the dataset, with the exclusion of any features that are deemed irrelevant or re-
dundant. The primary objective is to ameliorate model performance by curbing over-
fitting, diminishing computational complexity, and augmenting interpretability. The
techniques employed for feature selection can be classified into three distinct types:
– Filter Methods: Filter methods assess the relevance of features independently of
the chosen learning algorithm. Standard techniques encompass correlation analy-
sis, information gain, and statistical tests such as ANOVA and chi-square.
– Wrapper Methods: Wrapper methods appraise feature subsets by iteratively
training the model using different combinations of features and evaluating their
performance. Techniques such as forward selection, backward elimination, and
recursive feature elimination (RFE) correspond to this category.
5.6 Principal Component Analysis 275
Feature Extraction
Feature extraction is the process of converting the initial features into a fresh collec-
tion of features through the act of combining or altering them, all the while preserv-
ing the crucial information. The primary objective is to decrease the dimensionality
of the dataset while maintaining the utmost amount of pertinent information possi-
ble. There are two main classifications for feature extraction methods: linear and
non-linear techniques:
– Linear Methods: Linear techniques, including PCA and Linear Discriminant Analy-
sis (LDA), generate fresh characteristics by forming linear combinations of the ini-
tial characteristics. PCA detects the orientations that exhibit the highest variance,
whereas LDA concentrates on enhancing the distinction between classes.
– Non-linear Methods: Non-linear techniques, such as t-distributed Stochastic Neigh-
bor Embedding (t-SNE) and Isomap, produce novel characteristics by capturing
non-linear associations within the data. These approaches prove to be especially
advantageous when it comes to representing high-dimensional data in lower-
dimensional spaces while simultaneously conserving local structures.
Eigenvectors and eigenvalues represent pivotal principles within the realm of linear
algebra, possessing extensive utility across diverse domains, encompassing machine
learning and data analysis.
276 Chapter 5 Classic Machine Learning Algorithms
Eigenvectors
– Eigenvectors are vectors of particular significance that pertain to linear transfor-
mations, serving to denote the directions in which the transformation solely im-
parts elongation or compression upon the vector, while leaving its orientation
unaltered.
– From a mathematical standpoint, a vector v is classified as an eigenvector of a
square matrix A if it satisfies the equation:
A·v =λ ·v
Eigenvalues
– Eigenvalues are the scalars that represent the factor by which the corresponding
eigenvector is stretched or compressed during a linear transformation.
– Each eigenvector of a matrix A corresponds to a unique eigenvalue.
– Eigenvalues have a significant impact on the determination of the characteristics
of linear transformations, such as the process of diagonalizing matrices, analyz-
ing stability, and finding solutions to differential equations.
Applications
– Image and signal processing: Eigenvectors are used for image compression and
noise reduction.
– Structural analysis: Eigenvalues determine the stability and natural frequencies
of structures.
– Machine learning: Eigenvectors and eigenvalues are used in dimensionality re-
duction, feature extraction, and clustering algorithms.
The subsequent Python code demonstrates the implementation of PCA for dimension-
ality reduction.
5.6 Principal Component Analysis 277
import numpy as np
# Original data
X = np.array([[2.5, 2.4, 0.5, 0.7],
[2.1, 1.9, 1.8, 1.3],
[1.6, 1.6, 1.5, 1.1],
[1.0, 0.9, 1.0, 0.7],
[0.5, 0.6, 0.7, 0.5]])
# Step 2: Subtract the mean from each observation to center the data
X_centered = X - mean
print("\nSorted Eigenvectors:")
for i, eigenvector in enumerate(sorted_eigenvectors.T):
print(f"PC{i+1}: {eigenvector}")
# Step 6: Project the centered data onto the new subspace defined by the
principal components
X_projected = X_centered @ sorted_eigenvectors
print("\nProjected Data:")
print(X_projected)
278 Chapter 5 Classic Machine Learning Algorithms
This code follows the steps you outlined for dimensionality reduction using PCA.
Here’s a breakdown of what the code does:
1. The original data X is defined as a NumPy array.
2. The mean of each feature is calculated and subtracted from the data to center it
(Steps 1 and 2).
3. The covariance matrix is calculated using np.cov (Step 3).
4. The eigenvalues and eigenvectors of the covariance matrix are calculated through
the utilization of the np.linalg.eig function, as outlined in Step 4.
5. The eigenvectors are sorted in descending order based on their corresponding ei-
genvalues (Step 5).
6. The centered data X_centered is projected onto the new subspace defined by the
sorted eigenvectors using matrix multiplication (Step 6).
Fig. 5.28 displays the results of Principal Component Analysis (PCA), a technique used
for dimensionality reduction, showcasing the transformed dataset where data points
are represented in a lower-dimensional space while preserving the most significant
variance across the original features.
The code prints out the mean, covariance matrix, eigenvalues, sorted eigenvec-
tors (labeled as PC1, PC2, etc.), and the projected data X_projected.
Below is an exemplification of a Python code that showcases the implementation
of PCA on an IRIS dataset. Additionally, it demonstrates the visualization of the out-
comes both pre- and post-PCA.
5.6 Principal Component Analysis 279
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
plt.tight_layout()
plt.show()
280 Chapter 5 Classic Machine Learning Algorithms
Steps:
1. Importing the required libraries includes numpy, matplotlib.pyplot, load_iris
from sklearn datasets, StandardScaler from sklearn preprocessing, and PCA from
sklearn.decomposition.
2. Load the Iris dataset.
3. Standardize the features utilizing the StandardScaler.
4. Instantiate PCA with the desired number of components (in this case, 2) and fit it
to the standardized data.
5. Transform the initial dataset into a lower-dimensional space through the utiliza-
tion of the properly adjusted PCA model.
6. Visualize the initial dataset as well as the dataset post-PCA by means of the mat-
plotlib.pyplot.scatter function.
7. Set appropriate titles, labels, and colorbars for better visualization.
8. Display the plots using plt.show().
This code generates two subplots: one showing the original data and another showing
the data after PCA. The data points colored based on the target labels (species) to visu-
alize any patterns or clusters before and after PCA.
Fig. 5.29 illustrates the IRIS dataset both before and after Principal Component Analy-
sis (PCA) transformation. The plot likely demonstrates how PCA reduces the di-
mensionality of the data while retaining the most important information, aiding in
visualizing the dataset’s structure and potential clustering patterns.
5.7 Naive Bayes 281
Naive Bayes, a classification algorithm that leverages Bayes’ theorem and assumes in-
dependence between features, is both simple and effective. Despite its oversimplified
assumptions, it often performs exceptionally well in practical scenarios, especially in
the realm of text classification problems.
The core idea of Naive Bayes is to calculate the probability of each class based on
the feature values of the instance being classified. Subsequently, predictions are made
by selecting the class with the highest probability.
Naive Bayes, a powerful probabilistic classifier, relies on Bayes’ theorem and as-
sumes feature independence in a straightforward manner.
Bayes’ Theorem
Naive Bayes is grounded on Bayes’ theorem, an articulated principle that outlines the
probability of a hypothesis when provided with the evidence:
Where:
– P(class|data) represents the likelihood of the class in relation to the given data,
which is determined by the feature values.
– P(data|class) signifies the conditional probability of the data based on the class.
– P(class) denotes the prior probability associated with the class.
– P(data) serves as the evidence and functions as a constant that scales the probabilities.
The “naive” assumption is derived from the notion that Naive Bayes assumes condi-
tional independence among all features, given the class label. This simplification ena-
bles the calculation of P(data|class) to be performed in the subsequent manner:
Mathematically, this is expressed as:
This assumption, although seldom valid in practical scenarios, renders the computa-
tions significantly more manageable and enables the application of Naive Bayes to
problems with a high number of dimensions.
To conduct training for a Naive Bayes classifier, it is imperative to compute the
prior probabilities P(class) and the likelihood probabilities P(feature|class) based on
the available training data. In the case of numerical features, it is often assumed that
they follow a Gaussian distribution, and consequently, the mean and variance for
each class are estimated. As for categorical features, the frequency of each feature
value for each class can be straightforwardly calculated.
282 Chapter 5 Classic Machine Learning Algorithms
Let’s consider a simple example of using Naive Bayes for email spam classification.
We consider a dataset consisting of emails, each classified as either “spam” or
“not spam” (ham). Our objective is to construct a Naive Bayes classifier that can fore-
cast whether a new email is spam or not by analyzing the presence or absence of spe-
cific words within the email’s body.
Let’s say we have the following training data:
In order to educate the Naive Bayes classifier, it is necessary to compute the prior
probabilities P(spam) and P(ham), as well as the likelihood probabilities P(word|
spam) and P(word|ham) for each individual word.
Given that an equivalent amount of spam and ham emails are present in the
training data, the prior probabilities would be as follows:
5.7 Naive Bayes 283
For the likelihood probabilities, we can calculate the frequency of each word in the
spam and ham emails. For example:
Once all the requisite probabilities have been estimated from the training data, a new
email can be classified by computing the posterior probability P(spam|email) and P
(ham|email) utilizing Bayes’ theorem and the assumption of naive independence. The
prediction is made by selecting the class with the highest posterior probability.
For example, let’s say we have a new email with the text: “Get rich quickly with
our system!” To classify this email, we would calculate P(spam|email) and P(ham|
email) using the estimated probabilities from the training phase, and choose the class
with the higher probability.
This particular example demonstrates the fundamental operations of Naive Bayes
in the context of text classification. In practical applications, more sophisticated tech-
niques such as feature selection, smoothing, and handling of non-occurring events are
frequently employed to enhance the performance of Naive Bayes classifiers.
Let’s consider a numerical example of using Naive Bayes for classification.
Suppose we possess a dataset comprising weather observations, wherein each in-
stance is categorized as either “Play” or “Don’t Play” contingent upon four character-
istics: Outlook (Sunny, Overcast, Rain), Temperature (Hot, Mild, Cool), Humidity (High,
Normal), and Wind (Strong, Weak).
Here’s the training dataset:
Next, we calculate the likelihood probabilities for each feature value given the class:
For example, let’s calculate P(Outlook = Sunny|Play) and P(Outlook = Sunny|
Don’t Play):
– P(Outlook = Sunny|Play) = 2/9 = 0.222 (2 out of 9 instances with Play have Outlook
= Sunny)
– P(Outlook = Sunny|Don’t Play) = 3/5 = 0.6 (3 out of 5 instances with Don’t Play have
Outlook = Sunny)
We can similarly calculate the likelihood probabilities for all feature values and
classes.
Now, let’s classify a new instance with the feature values: Outlook = Overcast,
Temperature = Cool, Humidity = High, Wind = Strong.
In order to determine the posterior probability of each class, the utilization of
Bayes’ theorem and the assumption of feature independence, commonly referred to
as the “naive” assumption, is employed.
– P(Play|features) = (P(Overcast|Play) ✶ P(Cool|Play) ✶ P(High|Play) ✶ P(Strong|
Play) ✶ P(Play)) / P(features)
– P(Don’t Play|features) = (P(Overcast|Don’t Play) ✶ P(Cool|Don’t Play) ✶ P(High|
Don’t Play) ✶ P(Strong|Don’t Play) ✶ P(Don’t Play)) / P(features)
We don’t need to calculate P(features) since it’s a scaling factor, and we’re only inter-
ested in the relative probabilities.
Since P(Play|features) > P(Don’t Play|features), we would classify this new in-
stance as “Play.”
This example illustrates the calculations involved in training and using a Naive
Bayes classifier. In practice, techniques like Laplace smoothing are often employed to
handle zero probabilities and prevent overfitting.
5.7 Naive Bayes 285
Applications
The Gaussian Naive Bayes technique is a variation of the Naive Bayes algorithm,
which proves to be highly advantageous in scenarios involving continuous or numeri-
cal characteristics. It operates under the assumption that the continuous attributes
conform to a Gaussian (normal) distribution for every class.
The key steps in Gaussian Naive Bayes are:
1. Calculate the prior probabilities of each class, P(class), from the training data.
2. For every continuous feature, the mean and standard deviation should be com-
puted for that particular feature in each class.
3. Assuming a Gaussian distribution, the likelihood of a feature value x given a class
is calculated using the probability density function:
!
1 ✶ ð x − μÞ 2
PðxjclassÞ = pffiffiffiffiffiffiffiffiffiffiffiffiffi exp −
2π✶ σ 2 2σ 2
the mean and standard deviation of the feature in that class are represented by μ
and σ, respectively.
4. For categorical features, calculate the likelihood probabilities as in regular Naive
Bayes.
5. To classify a new instance, calculate the posterior probability for each class using
Bayes’ theorem: P(class|data) = (P(data|class) ✶ P(class)) / P(data)
6. P(data|class) is calculated as the product of the likelihood probabilities for each
feature, assuming independence: P(data|class) = P(x1|class) ✶ P(x2|class) ✶. . .✶
P(xn|class)
7. Classify the given instance as the category possessing the utmost posterior
probability.
The Gaussian assumption makes Gaussian Naive Bayes particularly effective for con-
tinuous data, as it captures the distribution of feature values within each class. How-
ever, it may not perform well if the feature distributions are significantly non-
Gaussian or if there are strong dependencies between features.
Like regular Naive Bayes, Gaussian Naive Bayes is computationally efficient and
can be a good baseline classifier, especially when dealing with high-dimensional con-
286 Chapter 5 Classic Machine Learning Algorithms
tinuous data. However, more advanced techniques like kernel density estimation or
semi-supervised learning may be required for complex data distributions.
Let’s consider a numerical example of using Gaussian Naive Bayes for classification.
Suppose we possess a collection of measurements for iris flowers, in which each
individual is categorized as one of three distinct species: Setosa, Versicolor, or Virgin-
ica. The attributes included in this dataset encompass sepal length, sepal width, petal
length, and petal width, all of which are expressed in centimeters.
Here’s a small subset of the training dataset:
First, we calculate the prior probabilities of each class from the training data:
– P(Setosa) = 2/6 = 0.333
– P(Versicolor) = 2/6 = 0.333
– P(Virginica) = 2/6 = 0.333
For Setosa:
– Sepal Length: μ = 5.0, σ = 0.0707
– Sepal Width: μ = 3.25, σ = 0.354
– Petal Length: μ = 1.4, σ = 0.0
– Petal Width: μ = 0.2, σ = 0.0
For Versicolor:
– Sepal Length: μ = 6.7, σ = 0.424
– Sepal Width: μ = 3.2, σ = 0.0
– Petal Length: μ = 4.6, σ = 0.141
– Petal Width: μ = 1.45, σ = 0.071
For Virginica:
– Sepal Length: μ = 6.05, σ = 0.354
– Sepal Width: μ = 3.0, σ = 0.424
– Petal Length: μ = 5.55, σ = 0.636
– Petal Width: μ = 2.2, σ = 0.424
5.7 Naive Bayes 287
Now, let’s classify a new instance with the feature values: Sepal Length = 6.2, Sepal
Width = 3.4, Petal Length = 5.4, Petal Width = 2.3.
To calculate the posterior probability for each class, we use Bayes’ theorem and
the Gaussian probability density function for the continuous features:
P(Setosa|data) = (P(6.2|Setosa) ✶ P(3.4|Setosa) ✶ P(5.4|Setosa) ✶ P(2.3|Setosa) ✶
P(Setosa)) / P(data) P(Versicolor|data) = (P(6.2|Versicolor) ✶ P(3.4|Versicolor) ✶ P(5.4|
Versicolor) ✶ P(2.3|Versicolor) ✶ P(Versicolor)) / P(data) P(Virginica|data) = (P(6.2|Vir-
ginica) ✶ P(3.4|Virginica) ✶ P(5.4|Virginica) ✶ P(2.3|Virginica) ✶ P(Virginica)) / P(data)
Plugging in the calculated means, standard deviations, and prior probabilities,
we get:
– P(Setosa|data) = 1.97 x 10^-19
– P(Versicolor|data) = 1.86 x 10^-6
– P(Virginica|data) = 1.93 x 10^-3
Since P(Virginica|data) is the highest, we would classify this new instance as the Vir-
ginica species.
This illustration showcases the computations entailed in the training and utiliza-
tion of a Gaussian Naive Bayes classifier for continuous attributes. In practical appli-
cation, methods such as feature scaling and the management of absent values may be
necessary to enhance performance.
Below is a Python code that implements Gaussian Naive Bayes classification on a
dataset.
Steps:
– Firstly, it is essential to import the required libraries for the task at hand. These
libraries include numpy, which is used for performing numerical operations, mat-
plotlib.pyplot, which is used for plotting, make_blobs from sklearn datasets, which
allows us to generate a synthetic dataset, GaussianNB from sklearn naive_bayes,
which is the Gaussian Naive Bayes classifier, train_test_split from sklearn model_se-
lection, which is used to split the data into train and test sets, and accuracy_score
from sklearn metrics, which is used to calculate the classification accuracy.
– To generate a synthetic dataset with two clusters, we can utilize the make_blobs
function. This function will generate a dataset with 1,000 samples, two features,
and two classes.
– In order to evaluate the performance of our model, it is necessary to split the
data into training and test sets. To achieve this, we can make use of the train_-
test_split function. In this case, we will allocate 80% of the data for training and
the remaining 20% for testing.
– Before applying the Naive Bayes classifier to the training data, it is beneficial to
visualize the data and observe the separability of the classes. This can be accom-
plished by plotting the training data using the plt scatter function.
288 Chapter 5 Classic Machine Learning Algorithms
– Next, we will create an instance of the GaussianNB classifier and fit it to the train-
ing data using the gnb.fit(X_train, y_train) command.
– Once the classifier has been trained, we can proceed to make predictions on the
test set. This can be done using the y_pred = gnb predict(X_test) command.
– To assess the performance of our model, we need to calculate the classification
accuracy. This can be achieved by utilizing the accuracy_score function and pass-
ing in the true labels (y_test) and the predicted labels (y_pred). The resulting accu-
racy can then be printed for further analysis.
– Finally, we can visualize the test data after applying the Naive Bayes classifier.
This can be done by plotting the test data and coloring the points according to the
predicted class labels (y_pred).
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
Fig. 5.30 presents a scatter plot visualizing the dataset before applying the Naïve
Bayes classification algorithm.
Accuracy: 0.94
The code gives two plots: one showing the original data before applying Naive
Bayes, and another showing the data after applying Naive Bayes, with the points col-
ored according to their predicted class labels. You should also see the classification
accuracy printed in the console.
290 Chapter 5 Classic Machine Learning Algorithms
The output will depend on the random state used to generate the synthetic data-
set, but you should expect a reasonably high accuracy since the data is well-separated
into two clusters.
This illustration exhibits the utilization of the Gaussian Naive Bayes classifier in the
Python programming language. Additionally, it showcases the visualization of the data
prior to and subsequent to the application of the algorithm. Furthermore, the evaluation
of the performance of the algorithm is conducted by employing the accuracy metric.
Fig. 5.31 shows a scatter plot visualizing the dataset after applying the Naïve Bayes classifi-
cation algorithm. This plot demonstrates how the algorithm has classified the data points
into different classes based on their features, providing insights into the effectiveness of
the Naïve Bayes classifier in separating the data points according to their characteristics.
Let’s consider another Python code example that implements Gaussian Naive
Bayes classification on the iris dataset from scikit-learn.
Steps:
– Firstly, the necessary libraries should be imported. These include numpy for per-
forming numerical operations, matplotlib.pyplot for generating plots, load_iris
from sklearn datasets for loading the iris dataset, GaussianNB from sklearn
naive_bayes for implementing the Gaussian Naive Bayes classifier, train_test_split
from sklearn model_selection for splitting the data into train and test sets, and
accuracy_score from sklearn metrics for calculating the classification accuracy.
5.7 Naive Bayes 291
– To load the iris dataset, the load_iris() function from scikit-learn can be utilized.
To focus on visualization, only the first two features, namely sepal length and
sepal width, are selected.
– In order to divide the data into training and test sets, the train_test_split function
is employed. In this particular case, 80% of the data is allocated for training pur-
poses, while the remaining 20% is designated for testing.
– To visualize the training data prior to applying the Naive Bayes algorithm, the plt
scatter function can be used. This will provide a visual representation of the data
and the separability of the classes.
– By creating an instance of the GaussianNB classifier and fitting it to the training
data using the gnb.fit(X_train, y_train) syntax, the algorithm can be implemented.
– To make predictions on the test set, the y_pred = gnb predict(X_test) code can be
executed.
– Once the predictions are made, the accuracy of the classification can be calculated
using the accuracy_score(y_test, y_pred) function. The result can then be printed.
– Finally, to visualize the test data after applying the Naive Bayes algorithm, the pre-
dicted class labels, y_pred, can be used to color the points. This can be achieved by
plotting the test data and assigning colors based on the predicted labels.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
plt.ylabel('Sepal Width')
plt.show()
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
Fig. 5.32 displays a visualization of the IRIS dataset before applying the Naïve Bayes
classification algorithm.
Accuracy: 0.90
The first plot shows the iris data before applying Naive Bayes, where the points
are colored according to their true class labels. The second plot shows the data after
applying Naive Bayes, with the points colored according to their predicted class labels.
The performance of the Gaussian Naive Bayes classifier in distinguishing the three
classes using the sepal length and sepal width features is deemed to be satisfactory.
Fig. 5.33 presents a visualization of the IRIS dataset after applying the Naïve Bayes
classification algorithm.
The Multinomial Naive Bayes algorithm is a variation of the Naive Bayes algorithm
that is excellently suited for tasks involving the classification of text. In these tasks,
the features are indicative of the occurrence frequency of words or tokens within a
document.
The key assumptions made by Multinomial Naive Bayes are:
1. The data is generated from a multinomial distribution, where each feature
(word) is drawn independently from the same vocabulary.
294 Chapter 5 Classic Machine Learning Algorithms
2. The feature vectors are sparse, meaning that most word counts are zero for a
given document.
3. The order of the words in the document is not important, only the word counts
matter.
In the given context, n represents the overall number of words, k denotes the size
of the vocabulary, ni signifies the count of word i, and pi indicates the probability
of word i in that particular class.
3. The probabilities pi are estimated from the training data as (count of word i in
class + alpha) / (total word count in class + alpha ✶ vocabulary size), where alpha
is a smoothing parameter to avoid zero probabilities.
4. To categorize a novel record, one must compute the posterior probability for
each category by utilizing Bayes’ theorem:
Multinomial Naive Bayes is particularly effective for text classification because it cap-
tures the frequency information of the words, which is often more important than the
presence or absence of a word in a document. It also handles the sparse nature of text
data well.
However, it assumes that the words are independent, which may not be a valid
assumption in natural language. Additionally, it does not account for word order or
semantic relationships between words.
Despite these limitations, Multinomial Naive Bayes is a simple and efficient algo-
rithm that often works well in practice for text classification tasks. It is widely used as
a baseline model or as part of more complex ensemble models.
Let’s consider a numerical example of using Multinomial Naive Bayes for text
classification.
Suppose that we possess a limited collection of film critiques, in which each critique
is categorized as either “Positive” or “Negative.” The characteristics of these critiques are
the numerical representations of the frequency of words utilized in each critique.
5.7 Naive Bayes 295
Review Label
Now, we calculate the likelihood probabilities P(word|class) for each word and class.
To avoid zero probabilities, we use additive smoothing with α = 1.
For the Positive class:
– Total word count = 7 (excluding duplicates)
– P(great|Positive) = (1 + 1) / (7 + 1 ✶ 17) = 0.125
– P(movie|Positive) = (1 + 1) / (7 + 1 ✶ 17) = 0.125
– P(loved|Positive) = (1 + 1) / (7 + 1 ✶ 17) = 0.125
– . . . (remaining words have probability = 1 / (7 + 1 ✶ 17) = 0.0625)
Now, let’s classify a new review: “good movie but boring plot.”
We determine the posterior probability for each category by employing Bayes’
theorem and the multinomial likelihood.
P(Positive|review) = (P(good|Positive) ✶ P(movie|Positive) ✶ P(but|Positive) ✶
P(boring|Positive) ✶ P(plot|Positive) ✶ P(Positive)) / P(review) P(Negative|review)
= (P(good|Negative) ✶ P(movie|Negative) ✶ P(but|Negative) ✶ P(boring|Negative)
✶
P(plot|Negative) ✶ P(Negative)) / P(review)
Plugging in the calculated probabilities, we get:
296 Chapter 5 Classic Machine Learning Algorithms
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
# Sample data
X = np.array([['young', 'yes', 'no', 'good'],
['young', 'yes', 'no', 'poor'],
['old', 'yes', 'yes', 'good'],
['old', 'yes', 'yes', 'poor'],
['young', 'no', 'no', 'good'],
['young', 'no', 'yes', 'poor'],
['old', 'no', 'yes', 'good'],
['old', 'no', 'yes', 'poor']])
y = np.array([1, 0, 1, 0, 1, 0, 0, 0])
# Model training
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
# Predictions
y_pred = mnb.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)
Accuracy: 0.5
Bagging (Bootstrap Aggregating) and Boosting are two well-known ensemble learning
methods that enhance predictive performance by integrating multiple base models.
Bagging, also referred to as Bootstrap Aggregating, incorporates the training of multiple
instances of the same foundational model on diverse subsets of the training data, which are
sampled with replacement. Each model is independently trained, and the predictions are
combined through averaging (for regression) or voting (for classification). Bagging helps to
mitigate the problem of overfitting by training models on distinct subsets of the data, thus
reducing the variance of the final predictions. An example of Bagging is the Random Forest
algorithm, which constructs a collection of decision trees, each trained on a random subset
of the features and data instances. The final prediction is obtained by averaging the predic-
tions of all decision trees, resulting in improved generalization performance and robustness.
Boosting, in contrast, is an iterative technique that constructs multiple weak learners
sequentially. Each weak learner is trained on the residuals of the previous models, with a
greater focus on the misclassified instances. Boosting algorithms aim to correct the errors
made by the previous models and concentrate on instances that are difficult to classify.
The final prediction is obtained by aggregating the predictions of all weak learners, typi-
cally through a weighted sum. Two popular boosting algorithms are AdaBoost and Gradi-
ent Boosting. In AdaBoost, for instance, each weak learner is assigned a weight based on its
performance, and the final prediction is obtained by combining the weighted predictions
of all weak learners. Gradient Boosting constructs models sequentially to minimize a loss
function, with each model focusing on reducing the errors made by the previous models.
Now let us consider a binary classification problem where the goal is to predict
whether an email is spam or not based on its features. In the Bagging approach, we
can train multiple decision tree classifiers on different subsets of the email dataset
and then combine their predictions using majority voting to classify new emails. In
the Boosting approach, we can sequentially train decision tree classifiers, with each
subsequent model focusing on the misclassified emails from the previous models. By
5.8 Ensemble Methods: Boosting, Bagging 299
aggregating the predictions of all models, Bagging and Boosting methods can signifi-
cantly improve the accuracy of spam classification compared to using a single deci-
sion tree model. These ensemble techniques are widely used in practical applications
across various domains to enhance predictive performance and robustness.
Below is the Python code for Bagging algorithm:
– Generate synthetic data with two features and two classes using the make_classi-
fication function from sklearn, initially.
– Then, proceed to split the data into training and testing sets using the train_test_s-
plit method.
– Before applying Bagging, visualize the data through plotting.
– Prior to Bagging, train a Random Forest classifier on the training data and make
predictions on the test set.
– Display the accuracy before Bagging.
– After Bagging, train multiple Random Forest classifiers with Bagging and overlay
their predictions on the plot, hence plotting the data again.
– Lastly, exhibit both the plots before and after Bagging.
# Base Model
base_model = DecisionTreeClassifier(max_depth=4)
base_model.fit(X_train, y_train)
base_pred = base_model.predict(X_test)
✶
base_acc = round(base_model.score(X_test, y_test) 100, 2)
# Bagging model
model = BaggingClassifier(base_estimator=base_model, n_estimators=100,
random_state=0)
model.fit(X_train, y_train)
bag_pred = model.predict(X_test)
300 Chapter 5 Classic Machine Learning Algorithms
✶
bag_acc = round(model.score(X_test, y_test) 100, 2)
plt.show()
Fig. 5.34 illustrates a plot representing the dataset before applying the bagging ensem-
ble technique.
5.8 Ensemble Methods: Boosting, Bagging 301
Fig. 5.35 demonstrates a plot representing the dataset after applying the bagging en-
semble technique.
The result of executing this code will yield two graphical representations:
– Plot before Bagging: Displays the spread of the data points prior to the implemen-
tation of Bagging.
– Plot after Bagging: Depicts the spread of the data points after the application of
Bagging with numerous Decision Tree classifiers.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
302 Chapter 5 Classic Machine Learning Algorithms
# Accuracy
acc = gbc.score(X_test, y_test)
print("Accuracy: ", acc)
Accuracy: 0.995
5.8 Ensemble Methods: Boosting, Bagging 303
Fig. 5.36 illustrates a plot representing the dataset before applying the boosting en-
semble technique.
Fig. 5.37 illustrates a plot representing the dataset after applying the boosting ensem-
ble technique
The key concept behind AdaBoost is that it assigns more importance to the sam-
ples that were misclassified by the previous models in each round. Hence, the subse-
quent models aim to rectify the errors made by their predecessors.
The alpha parameter governs the contribution of each weak learner to the final
strong learner model. A higher alpha signifies better models.
By training successive models on the errors made by previous models and com-
bining multiple weak models, AdaBoost mitigates bias and variance, resulting in im-
proved performance.
Some advantages of AdaBoost include its ease of implementation, minimal need
for tuning, and compatibility with various simple weak learner models, thereby yield-
ing strong performance.
5.8 Ensemble Methods: Boosting, Bagging 305
5. Repeat steps 2–4, fitting new weak models to try correcting residual errors until
satisfactory limit.
The key difference from AdaBoost is that each new model in Gradient Boosting tries to cor-
rect the residual errors from previous step rather than focusing on misclassified examples.
Learning rate shrinks the contribution of each model to prevent overfitting. Tree
depth is also kept small.
Gradient descent-like improvement along residual errors gradient leads to strong
overall prediction. Combining multiple additive models yield robust performance de-
spite weak individual models.
Advantages are built-in regularization and handling variety of data. But can over-
fit if not properly tuned.
The Python library used for Gradient Boosting is sklearn.ensemble.Gradient-
BoostingClassifier for classification tasks and sklearn.ensemble.GradientBoosting-
Regressor for regression tasks. This library constitutes an integral component of the
scikit-learn (sklearn) package, renowned for its widespread adoption as a Python-
based machine learning library.
Here’s a brief explanation of the key components of the Gradient Boosting library:
1. GradientBoostingClassifier: This class is used for classification tasks. It imple-
ments gradient boosting for classification. It can handle both binary and multi-
class classification problems.
2. GradientBoostingRegressor: This class is used for regression tasks. It imple-
ments gradient boosting for regression. It’s suitable for predicting continuous tar-
get variables.
3. Parameters: Both GradientBoostingClassifier and GradientBoostingRegressor
possess a multitude of parameters that can be adjusted in order to enhance the
performance of the model. A few noteworthy parameters encompass the quantity
of boosting stages (n_estimators), the rate at which the model learns (learnin-
g_rate), the maximum depth of the individual regression estimators (max_depth),
and the loss function (loss), among other parameters.
4. Ensemble learning: Gradient Boosting is a method of ensemble learning that se-
quentially combines several weak learners, most commonly decision trees. The
subsequent models in this technique aim to rectify the mistakes made by their
predecessors, ultimately leading to the development of a powerful learner.
5. Gradient Boosting algorithm: Gradient Boosting constructs a sequence of trees
in an ensemble. In each iteration, a new tree is fitted to the residuals from the
previous iteration. The process of fitting involves minimizing a loss function
through the use of gradient descent.
6. Feature Importance: Gradient Boosting offers a feature importance attribute,
enabling users to comprehend the significance of each feature in the prediction
procedure.
5.8 Ensemble Methods: Boosting, Bagging 307
The Python code example provided below demonstrates the application of AdaBoost
and Gradient Boosting classifiers:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier,
GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# AdaBoost classifier
ada_boost = AdaBoostClassifier(base_estimator=base_estimator,
n_estimators=50, random_state=42)
ada_boost.fit(X_train, y_train)
# Predictions
y_pred_ada = ada_boost.predict(X_test)
y_pred_grad = grad_boost.predict(X_test)
# Plotting
plt.figure(figsize=(18, 5))
# Before Boosting
plt.subplot(1, 3, 1)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='coolwarm',
marker='o', edgecolors='k')
plt.title('Before Boosting')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
# After AdaBoost
plt.subplot(1, 3, 2)
plt.scatter(X_train[:, 0], X_train[:, 1], c=ada_boost.predict(X_train),
cmap='coolwarm', marker='o', edgecolors='k')
plt.title('After AdaBoost')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.tight_layout()
plt.show()
2 2 2
1 1 1
0 0 0
Feature 2
Feature 2
Feature 2
–1 –1 –1
–2 –2 –2
–1.0 –0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 –1.0 –0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 –1.0 –0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Feature 1 Feature 1 Feature 1
Summary
Exercise (MCQs)
c) Linear models
d) Non-linear models
Answers
1. c) Regression
2. b) Classification
3. b) Classification
4. a) High-dimensional spaces
5. d) Clustering
6. b) Dimensionality reduction
7. a) Bayes’ theorem
8. a) Multiple weak learners
9. a) Minimizing a cost function
10. c) Solving linear regression
11. b) Modeling non-linear relationships
12. b) Decrease overfitting
13. b) Unsupervised learning
14. b) Multiple classes
15. a) Improve model performance
Answers
1. dependent, independent
2. classification
3. features
4. hyperplane
5. distinct
6. dimensionality reduction, visualization
7. Bayes’, independent
8. weak, performance
9. optimize
10. inverse
11. non-linear, complex
12. overfitting, cost
13. unsupervised
14. class
15. improve
Descriptive Questions
11. Use the Boston house prices dataset (from sklearn.datasets import load_boston) to
predict house prices based on features like average number of rooms and
crime rate.
12. Use the Iris dataset (from sklearn.datasets import load_iris) to classify iris flowers
into different species based on sepal and petal measurements.
13. Use the Iris dataset to build a decision tree classifier to predict the species of an
iris flower.
14. Generate a synthetic dataset using make_blobs from sklearn.datasets and apply k-
means clustering to identify clusters.
15. Apply PCA to the Iris dataset to reduce the dimensionality of the data and visual-
ize the transformed data.
Chapter 6
Advanced Machine Learning Techniques
In the 1990s, Robert Schapire and Yoav Freund developed a very popular algorithm
called AdaBoost, where underfitting and reducing bias were introduced for the first
time. AdaBoost is known as a parent of all gradient boosted decision trees. In the
same series of algorithm development, the gradient boosted trees (GBTs) are a power-
ful and versatile machine learning technique that can be used for both regression and
classification tasks. GBT is another popular model ensembling method which works
by combining multiple weak learners, such as decision trees, with a single strong
learner. Each weak learner is trained to improve the predictions of the previous
learner, and the final prediction is made by combining the predictions of all weak
learners. Various advantages of GBTs were described further.
High Accuracy
GBTs can often achieve high accuracy on a variety of tasks, even with complex
datasets.
Flexibility
GBTs can be adapted to a wide range of tasks by changing the type of weak learner,
the loss function, and other hyperparameters.
Faster Interpretability
Unlike some other machine learning models, GBTs can be relatively easy to interpret,
which can be helpful for understanding why the model is making certain predictions.
Quick Response
https://doi.org/10.1515/9783110697186-006
6.1 Gradient Boosted Trees: XGBoost and LightGBM 317
With XGBoost, any type of data is accepted for training and testing the model.
Faster Prediction
Algorithms are very complex to build sequentially, that is a very time-consuming pro-
cess, but when it comes to prediction, it is quite fast. That is why we can say “Faster
prediction over the slow training.”
Any way there is no doubt on the GBTs that are very good in the field of deep
learning still it has some of the disadvantages like a coin. In further development boosting
can be optimized by adequate loss function. This works on the concept of “A big set of
weak learner can create one strong learner”. XGBoost is coming from Gradient Boosted
decision tree algorithm. Decision tree is always known as weak learner that why many
experiments happened with it. Still we can say that there are still some drawbacks in
algorithms.
1. Computational cost: Training a GBT model can be computationally expensive,
especially for large datasets.
2. Overfitting: GBTs can be prone to overfitting if they are not carefully regularized.
warnings.filterwarnings("ignore")
318 Chapter 6 Advanced Machine Learning Techniques
diamonds = sns.load_dataset("diamonds")
diamonds.head()
diamonds.shape
diamonds.describe()
diamonds.describe(exclude=np.number)
from sklearn.model_selection import train_test_split
X[col] = X[col].astype('category')
X.dtypes
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=1)
import xgboost as xgb
In the above code, we have seen step-by-step preparation of the dataset and get-
ting ready for training and testing data. The percentage of the dataset for training
and testing may be changed according to the application. But in many of the cases,
we are keeping 20% for testing and 80% for training purpose. In gradient boosting,
at each step, a new weak model is trained to predict the “error” of the current
strong model and we know that the error is a difference between the predicted
value and the expected value. The low error model is known better than the higher
error rate:
Fji+1 = Fji − f ji
where Fi+1 is the final error calculated, Fi is the strong model at step i, and fi is the
weak model at step I. This operation keeps repeating until it meets the given maxi-
mum accuracy. Some of the points are given below, which will help us to understand
the internal working mechanism of GBTs:
1. GBT works by iteratively building the decision trees. Each tree is trained to im-
prove upon the predictions of the previous tree, and the final prediction is made
by combining the predictions of all the trees.
2. The loss function is a measure of how well the model’s predictions fit the data.
The loss function is used to train each tree, and it is also used to determine when
to stop training the model.
3. GBTs have a number of hyperparameters that can be tuned to improve the mod-
el’s performance. These hyperparameters include the number of trees, the learn-
ing rate, and the maximum depth of the trees.
XGBoost, short for eXtreme Gradient Boosting, is a specific and very popular implemen-
tation of GBTs. In the previous code, we have seen that data collection and sampling by
320 Chapter 6 Advanced Machine Learning Techniques
specified percentage for training and testing datasets. Now let’s have a look at how to
calculate mean squared error (MSE) for better accuracy of the trained model.
import numpy as np
preds = model.predict(dtest_reg)
rmse = mean_squared_error(y_test, preds, squared=False)
In general, we can say that all these boosting algorithms share a common principle,
use boosting to create an ensemble of learners, and not only inherit the strengths of
GBTs like high accuracy, flexibility, and interpretability, but also boast several im-
provements and unique features:
1. Scalability: XGBoost is optimized for speed and efficiency, making it capable of
handling large datasets much faster than traditional GBT implementations.
2. Regularization: XGBoost incorporates various regularization techniques to pre-
vent overfitting, a common issue with GBTs. This allows it to achieve better per-
formance on unseen data.
3. Parallelization: XGBoost can be easily parallelized across multiple cores or ma-
chines, further enhancing its training speed.
4. Second-order optimization: XGBoost utilizes a second-order Taylor approxima-
tion in its loss function, leading to faster convergence and potentially better accu-
racy compared to standard GBTs.
5. Sparse data handling: XGBoost efficiently handles sparse data, where most fea-
tures have missing values for most observations. This is crucial for many real-
world datasets.
6. Customization: XGBoost offers numerous hyperparameters to fine-tune the model
for specific tasks and data characteristics.
6.1 Gradient Boosted Trees: XGBoost and LightGBM 321
Due to its impressive performance and flexibility, XGBoost has become widely adopted
across various domains, including:
1. Finance: Predicting loan defaults, credit risk, and stock prices
2. E-commerce: Recommending products, predicting customer churn, and detecting
fraudulent transactions
3. Manufacturing: Predicting machine failures, optimizing production processes,
and improving quality control
4. Healthcare: Predicting disease diagnoses, analyzing medical images, and person-
alizing treatment plans
XGBoost is a powerful tool for machine learning practitioners, offering a robust and
adaptable solution for various prediction and classification tasks. Let’s have a look on
the following code:
y_pred = xgb_model.predict(X_test)
Gradient boosting techniques deal with a biggest problem called bias. Usually, we get
one issue with the decision tree that underfits the data. It splits the dataset only few of
times in an attempt to separate the data but technically we can divide into small two
pieces. This is how we can improve the performance of the model. Random forest is a
good example here, where it prunes and grows tree based on the required data. Bag-
ging is the technique here to reduce overall variance of the algorithm implementation:
X
m
H ðxÞ = αj hj ðxÞ
j=1
where αj is the rate of learning, hj(x) is a weak learner, and if we make it a sum of all
this, then it becomes more powerful than an ensemble of weak learners. Let’s see an-
other code for extreme gradient boost code in production environment that will pro-
duce a better accuracy in comparing with the previous example. Here, we have given
the complete code that starts from the beginning to end:
warnings.filterwarnings("ignore")
diamonds = sns.load_dataset("diamonds")
diamonds.head()
diamonds.shape
diamonds.describe()
diamonds.describe(exclude=np.number)
from sklearn.model_selection import train_test_split
cats = X.select_dtypes(exclude=np.number).columns.tolist()
X.dtypes
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=1)
import xgboost as xgb
import numpy as np
preds = model.predict(dtest_reg)
rmse = mean_squared_error(y_test, preds, squared=False)
results = xgb.cv(
params, dtrain_clf,
324 Chapter 6 Advanced Machine Learning Techniques
num_boost_round=n,
nfold=5,
metrics=["mlogloss", "auc", "merror"],
)
results.keys()
dtype='object')
results['test-auc-mean'].max()
import xgboost as xgb
X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:5000], X[5000:]
6.1 Gradient Boosted Trees: XGBoost and LightGBM 325
The above code will gives us the score 0.9274285714285714, which is best in the sys-
tem in the initial stage. If we will train this model again, then the accuracy will im-
prove because of the begging and boosting technique.
The Fig 6.1 shows plot of the Boosting iterations and deviation in training model.
The best example of bagging is random forest. On the other hand, boosting means
training a bunch of models sequentially, where every model learns from previous
mistakes. The best example for the boosting is gradient boosting tree. Let’s have a
look in the picture given below
326 Chapter 6 Advanced Machine Learning Techniques
The Drawing 6.2 shows the Bagging method. The boosting technique combines weak
learners sequentially so that every new tree corrects the error of the previous one.
There are several different loss functions but for multiclass classification, cross-
entropy is a very popular option. The cross-entropy of the distribution q relative to a
distribution p over a given set is defined in the cross-entropy formula:
H ðp, qÞ = − Ep ½logðqÞ
where Ep[·] is the expected value operator with respect to the distribution p.
The discrete probability distributions p and q with the same support x are as
follows:
X
H ðp, qÞ = − pðxÞlogqðxÞ
x2X
Finally, we can say that boosting is a core concept in XGBoost and plays a crucial role
in its impressive performance and capabilities. It is completely different from tradi-
tional gradient boosting line. It builds an ensemble of weak learners (often decision
trees) sequentially. Each new learner focuses on correcting the errors made by previ-
ous learners and uses the gradient of the loss function to guide the learning process.
Some of the key components of XGBoost are given below:
Sparse data handling: Efficiently handles data with many missing values, making it
suitable for real-world scenarios.
6.1 Gradient Boosted Trees: XGBoost and LightGBM 327
There are few steps to boost the performance of the XGBoost, and some of them are
given below:
Iteration
a) Calculate the residual (difference between actual and predicted values) for each
data point.
b) Build a new decision tree on the residuals, focusing on reducing errors.
c) Update the final prediction by weighting the new tree’s predictions based on the
learning rate.
d) Repeat: Iterate steps a-c until a stopping criterion is met (e.g., maximum iterations,
minimal progress).
LightGBM, short for “light gradient boosting machine,” is a powerful open-source ma-
chine learning library that leverages gradient boosting to tackle both regression and
classification tasks. LightGBM is well known for its blazing-fast training speed. It out-
performs many gradient boosting alternatives, making it ideal for large datasets. It
consumes significantly less memory in all other algorithms even after bigger datasets
on machines with limited resources. It supports the multiple machine execution in
the same time. This is how we can reduce the training and execution time. It can han-
dle more set of hyperparameters with fine grain control over the learning process.
LightGBM is a compelling choice for machine learning tasks, where speed, efficiency,
and memory usage are critical. Its impressive performance and ease of use make it a
valuable tool for both beginners and experts. This algorithm can be used in multiple
sectors like:
1. Finance: Fraud detection, credit risk assessment, and stock price prediction
2. E-commerce: Product recommendation, customer churn prediction, and anomaly
detection
3. Natural language processing (NLP): Text classification, sentiment analysis, and
machine translation
4. Computer vision: Image classification, object detection, and image segmentation
Both algorithms are widely used in the different sectors still we have some compari-
sons below:
328 Chapter 6 Advanced Machine Learning Techniques
The Tab. 6.1 shows the Comparison of LightGBM and XGBoost algorithms with respect
to speed, memory usage, accuracy and ease of use. Let’s have a look by hands-on cod-
ing, and a sample dataset snapshot is given below. Before execution of this code, do
not forget to install “lightbgm” library by using the following command:
Below code will display as training accuracy 0.9647 and testing accuracy 0.8163. It
may change in your computer according to the dataset size:
The Fig 6.2 shows the Dataset for model training with LightBGM.
import pandas as pd
from sklearn.model_selection import train_test_split
import lightbgm as lgb
data = pd.read_csv("SVMtrain.csv")
# To define the input and output feature
x = data.drop(['Embarked', 'PassengerId'], axis=1)
y = data.Embarked
# train and test split
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.33, random_state=42)
model = lgb.LGBMClassifier(learning_rate=0.09, max_depth=-5,
random_state=42)
model.fit(x_train, y_train, eval_set=[(x_test, y_test), (x_train,
y_train)],
verbose=20, eval_metric='logloss')
print('Training accuracy {:.4f}'.format(model.score(x_train, y_train)))
print('Testing accuracy {:.4f}'.format(model.score(x_test, y_test)))
Components of LightGBM
There are two major components:
1. Gradient-based one-side sampling (GOSS): This technique intelligently samples
data instances based on their gradients, focusing on informative examples that
contribute more to improving the model. It is a method for efficiently selecting a
subset of the training data during each iteration of the boosting process.
2. Exclusive feature bundling (EFB): Bundles similar features together during tree
construction, reducing memory usage and potentially improving efficiency. EFB
aims to reduce the number of features involved in the learning process, while
minimizing information loss. It focuses on “mutually exclusive” features, meaning
features that rarely take nonzero values simultaneously. Let’s assume features
representing different colors (red, blue, and green). They are mutually exclusive
because an object cannot be red and blue at the same time.
3. Gradient centralized (GC) algorithm: It optimizes calculations around gra-
dients, leading to faster training times and aims to improve training efficiency
and potentially even the final model’s performance by modifying the gradients
used in the optimization process. Here, gradients tell the optimizer how much to
update the weights of the neural network in each step. GC centralizes these gra-
dients, meaning it subtracts the mean value from each column of the gradient
matrix. This essentially forces the gradients to have an average of zero. GC differs
330 Chapter 6 Advanced Machine Learning Techniques
from gradient clipping, which limits the magnitude of individual gradients. While
clipping prevents exploding gradients, GC focuses on overall gradient direction.
Overall, we can say GC algorithm is a promising technique for improving the
training of DNNs. It is relatively simple to implement and has shown positive re-
sults in various applications. However, it is important to consider its limitations
and evaluate its effectiveness on a case-by-case basis.
Selective Sampling
1. Large gradients: All data points with large gradients are retained for training.
These points contain valuable information for improvement.
2. Small gradients: Points with small gradients are randomly sampled with a cer-
tain probability. This maintains diversity in the training set while focusing on
more informative points.
Kernel methods are a powerful technique used in various machine learning tasks like
classification, regression, and clustering. They offer a way to effectively handle nonlin-
ear data by implicitly transforming it into a higher dimensional space where complex
relationships become more apparent. Imagine trying to separate different colored dots
on a two-dimensional plane. Let’s have a look on the code below:
im2 = im2.show()
By just applying the image classifier as a kernel method, update the pixels in the mem-
ory-loaded photograph. If we apply the following changes, then we will some black
image that is the way of extracting important and required features from image. This
image will look like the below snapshot:
The Fig 6.3 shows the Author’s original image before applying the kernel method.
332 Chapter 6 Advanced Machine Learning Techniques
If the dots are linearly separable (e.g., a straight line can divide them), traditional lin-
ear algorithms like linear regression or linear support vector machines (SVMs) work
well. However, what if the dots form a more complex pattern, like a circle or a spiral?
Linear algorithms would not be able to effectively separate them.
The Fig 6.4 shows the Author’s image after applying the kernel method.
Kernel methods come to the rescue! They implicitly map the data to a higher di-
mensional space where the separation becomes linear. This mapping is done using a
mathematical function called a kernel.
1. Kernel function: This function takes two data points as input and calculates a
similarity measure between them. Different kernels exist for different data types
and problems (e.g., linear kernel, Gaussian kernel, and polynomial kernel).
2. Feature space: The kernel function essentially represents an inner product in a
high-dimensional feature space, even though we never explicitly compute the co-
ordinates of the data points in that space.
3. Linear algorithm: A standard linear algorithm, like a linear SVM or linear re-
gression, operates in this high-dimensional space using the similarity measure
provided by the kernel.
The “kernel trick” is a key aspect of using kernel methods effectively in machine
learning. It refers to the clever way that kernel methods work with data without ex-
plicitly transforming it into high-dimensional space, saving both computational time
6.2 Kernel Methods 333
and memory. Imagine you have data points that are not linearly separable in their
original lower dimensional space. To use a linear algorithm for classification or re-
gression, you would need to explicitly map the data into a higher dimensional space,
where it becomes linearly separable. However, this transformation can be computa-
tionally expensive and memory-intensive for large datasets.
Instead of explicitly performing the transformation, the kernel trick leverages a
special function called a kernel. This kernel function takes two data points as input
and computes a measure of their similarity based on their inner product in the high-
dimensional space.
The kernel tricks will help us by selecting the right kernel, and its hyperparameters
are crucial for good performance. We can use it to understand the model behavior in
the high-dimensional space, which can be challenging. Kernel methods can be prone
to overfitting if not regularized properly. The kernel trick is a powerful technique
that unlocks the potential of kernel methods in machine learning. By efficiently cap-
turing data similarity in a high-dimensional space without explicit computations, it
enables flexible and powerful solutions for nonlinear problems.
The radial basis function (RBF) kernel, also known as the Gaussian kernel, is one of
the most popular and versatile kernels used in machine learning, particularly with
SVMs and other kernel methods. It excels at handling nonlinear data, making it a
valuable tool for various tasks like classification, regression, and clustering. Imagine
data points scattered in a two-dimensional plane. If the data forms a straight line, a
linear kernel can separate them easily. But what if the data forms a circle or a more
complex shape? That’s where RBF comes in. It implicitly maps the data points to a
higher dimensional space, where the separation becomes more linear. This mapping
is done by calculating the similarity between each pair of data points based on their
334 Chapter 6 Advanced Machine Learning Techniques
Euclidean distance. Points closer in the original space have a higher similarity in the
high-dimensional space, represented by a larger kernel value. The interpolant that
takes the form of a weighted sum of RBF interpolation is a mesh-free method, mean-
ing the nodes (points in the domain) need not lie on a structured grid, and does not
require the formation of a mesh. It is often spectrally accurate and stable for large
numbers of nodes even in high dimensions.
The Fig 6.5 shows the Different stages of RBF optimization.
There are many types of RBFs available and some of them are given below.
1. Gaussian: Gaussian RBF is a powerful tool used in various machine learning algo-
rithms, particularly for handling nonlinear data. It is a general class of functions
that measure the similarity between two data points based on their distance. A
specific type of RBF is shaped like a bell curve, where closer points have higher
similarity scores than farther ones.
2. Multiquadratic: The multiquadratic RBF (MQ-RBF) is another type of RBF used
in machine learning, particularly for interpolation and approximation tasks. It
shares some similarities with the Gaussian RBF. Some of the functions are given
below:
p
(i) Unlike the bell-shaped Gaussian, the MQ-RBF takes the form: ’ðrÞ = ðr2 + c2 Þ,
where r is the distance between two data points and c is a shape parameter.
(ii) It combines the aspects of the linear and inverse MQ-RBFs, making it more
flexible than either alone.
3. Inverse quadratic: The inverse of a function is another function that “undoes”
the original function. When applied to a quadratic function, finding the inverse
isn’t always straightforward. Quadratic functions typically don’t pass the horizon-
tal line test, meaning one input can have multiple outputs, which isn’t a property
of a function and its inverse:
1. However, if we restrict the domain of the quadratic function to a specific
range where it only has one output for each input (passes the horizontal line
test), then we can find its inverse. This restricted function is called a bijection.
6.2 Kernel Methods 335
import pandas as pd
import numpy as np
from keras.layers import Layer
from keras import backend as K
class RBFLayer(Layer):
def __init__(self, units, gamma, ✶✶kwargs):
super(RBFLayer, self).__init__(✶✶kwargs)
self.units = units
self.gamma = K.cast_to_floatx(gamma)
model = Sequential()
model.add(Flatten(input_shape=(28, 28)))
model.add(RBFLayer(10, 0.5))
model.add(Dense(1, activation='sigmoid', name='foo'))
model.compile(optimizer='rmsprop', loss=binary_crossentropy)
==================================Output===================================
Epoch 1/3
WARNING:tensorflow:From C:\Users\23188\PycharmProjects\workshop\.venv\Lib\site-
packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is dep-
recated. Please use tf.compat.v1.ragged.RaggedTensorValue instead.
k-Means is a popular and straightforward clustering algorithm, but it does have some
limitations. So, venturing beyond k-means opens up a diverse toolbox for tackling
6.2 Kernel Methods 337
more complex data structures and clustering needs. This is an experiment with differ-
ent techniques, evaluates their performance using appropriate metrics, and chooses
the one that best suits your needs. Remember that there’s no “one-size-fits-all” solu-
tion, and exploring beyond k-means opens up a world of possibilities for effective
data clustering! Some of the important techniques are given below:
Ward’s method: This approach minimizes the variance within each cluster, aiming
for compact and spherical clusters. However, it might struggle with elongated or ir-
regular shapes.
Average linkage: This method joins clusters based on the average distance between
all pairs of points in the clusters, leading to more balanced clusters but potentially
sacrificing compactness.
The Fig 6.6 shows plot of Sorted observation versus k-NN distance.
Let’s have a look on the code below, which is implementing DBSCN algorithm:
import pandas as pd
from numpy import array
df = pd.read_csv("https://reneshbedre.github.io/assets/posts/tsne/tsne_
scores.csv")
df.head(2)
# check the shape of dataset
print(df.shape)
import numpy as np
from sklearn.neighbors import NearestNeighbors
# n_neighbors = 5 as kneighbors function returns distance of point to
itself (i.e. first column will be zeros)
nbrs = NearestNeighbors(n_neighbors = 5).fit(df)
# Find the k-neighbors of a point
neigh_dist, neigh_ind = nbrs.kneighbors(df)
# sort the neighbor distances (lengths to points) in ascending order #
axis = 0 represents sort along first axis i.e. sort along row
sort_neigh_dist = np.sort(neigh_dist, axis = 0)
kneedle.plot_knee()
plt.show()
– Parameter sensitivity: Choosing optimal values for ε and MinPts can impact re-
sults and require some experimentation.
– Curse of dimensionality: Can be less effective in high-dimensional data due to
the influence of distance calculations.
6.2 Kernel Methods 341
– Clustering by fast search and ordering points (CLARA): This algorithm parti-
tions data into k clusters while minimizing the cost of moving points between
clusters, leading to compact and well-separated clusters. This is a density-based
clustering algorithm known for its efficiency and ability to identify clusters with-
out a predetermined number. CLARA can be used in anomaly detection (noise
identification), image segmentation, customer segmentation, market research,
and scientific data analysis.
The Fig 6.8 shows plot of Final clustering after implementation of DBSCAN.
Mixture models: These models assume that the data arises from a mixture of proba-
bility distributions, where each distribution represents a cluster. Popular examples
include Gaussian mixture models (GMMs) and latent Dirichlet allocation (LDA). GMMs
are a powerful tool for data clustering and density estimation, particularly when deal-
ing with data that can be represented as a mixture of multiple overlapping Gaussian
distributions while LDA is a powerful and versatile probabilistic topic modeling tech-
nique widely used in text analysis and NLP tasks. We can use them in document cate-
gorization and organization, text summarization and topic extraction, information
retrieval and recommendation systems, anomaly detection and plagiarism detection,
sentiment analysis and opinion mining, and language modeling and dialogue systems.
Mixture model is not recommended in:
1. Choosing the number of topics: Requires careful consideration and evaluation.
2. Sensitivity to hyperparameters: Tuning these parameters can impact results.
3. Black box nature: While topics are identified, their semantic meaning might re-
quire further interpretation.
# get the iris data. It has 4 features, 3 classes and 150 data points.
data = iris.data
"""!
The pyclustering library clarans implementation requires list of lists
as its input dataset.
Thus we convert the data from numpy array to list.
"""
data = data.tolist()
"""!
@brief Constructor of clustering algorithm CLARANS.
@details The higher the value of maxneighbor, the closer is CLARANS to
K-Medoids, and the longer is each search of a local minima.
@param[in] data: Input data that is presented as list of points
(objects), each point should be represented by list or tuple.
@param[in] number_clusters: amount of clusters that should be allocated.
@param[in] numlocal: the number of local minima obtained (amount of
iterations for solving the problem).
@param[in] maxneighbor: the maximum number of neighbors examined.
"""
clarans_instance = clarans(data, 3, 6, 4);
=================================Output====================================
A peek into the dataset : [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1,
1.5, 0.2]]
Execution time : 1.2831659999792464
Index of the points that are in a cluster : [[50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,
63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 119, 121, 123, 126, 127, 133, 138, 142, 149],
[77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112, 114, 115, 116, 117, 118, 120, 122, 124,
125, 128, 129, 130, 131, 132, 134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 146, 147, 148], [0,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]]
The target class of each datapoint : [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
000000000
0000000000000111111111111111111111111
1111111111111111111111111122222222222
2222222222222222222222222222222222222
2 2]
The index of medoids that algorithm found to be best : [78, 128, 2]
===========================================================================
1. Construct a similarity graph: Represent data points as nodes and connect simi-
lar points with edges (weighted based on similarity).
2. Calculate the Laplacian matrix: This matrix encodes the connections between
nodes and reflects the similarity graph’s structure.
3. Extract eigenvectors and eigenvalues: Compute the eigenvectors and eigenval-
ues of the Laplacian matrix.
4. Project data onto lower dimensions: Use the eigenvectors associated with the
smallest eigenvalues to project data points into a lower dimensional space.
6.2 Kernel Methods 345
Mean shift: This nonparametric technique iteratively moves points toward denser re-
gions until convergence, identifying clusters of arbitrary shapes. Mean shift is a den-
sity-based clustering algorithm that excels at discovering clusters of arbitrary shapes
in your data. It can be used in anomaly detection (noise identification), image segmen-
tation, customer segmentation, market research, and scientific data analysis.
1. Define a kernel function: This function determines how the influence of neigh-
boring points decreases with distance. (The commonly used kernels include flat
and Gaussian.)
2. Start at each data point:
(i) Calculate the average (mean) of its neighboring points within the kernel
bandwidth.
(ii) Shift the point toward this calculated mean.
3. Repeat steps 1 and 2: Iteratively shift each point toward the mean of its neigh-
bors until convergence (meaning points stop moving significantly).
4. Identify clusters: Points that converge to the same location form a cluster.
1. Parameter sensitivity: Choosing optimal values for the kernel and bandwidth
can impact results and require some experimentation.
2. Curse of dimensionality: Can be less effective in high-dimensional data due to
the influence of distance calculations.
3. Computationally expensive: Iterative nature and dependence on distance calcu-
lations can lead to higher computational cost compared to some other methods.
346 Chapter 6 Advanced Machine Learning Techniques
1. Initialize the map: Create a grid of neurons in the lower dimensional space,
each with randomly assigned weights.
2. Present a data point: Randomly select a data point from the high-dimensional space.
3. Find the winning neuron: Calculate the distance between the data point and
each neuron using a similarity measure (e.g., Euclidean distance). Identify the
neuron closest to the data point as the “winner.”
4. Update weights: Adjust the weights of the winning neuron and its neighbors toward
the data point, making them more responsive to similar data points in the future.
5. Repeat steps 2–4: Iterate through the data points multiple times, refining the
map based on the data distribution.
1. Parameter selection: Choosing the right grid size and learning rate can impact
the results.
2. Interpretability: While visually informative, understanding the meaning of spe-
cific regions on the map might require further analysis.
3. Performance in high dimensions: Effectiveness can decrease with increasing di-
mensionality of the input data.
6.3 Anomaly Detection 347
The best clustering technique depends on your specific data and goals. Consider fac-
tors like:
1. Data type: Numerical, categorical, text, etc.
2. Number of clusters: Known or unknown beforehand?
3. Cluster shape: Spherical, elongated, or irregular?
4. Presence of noise: Does the data contain outliers?
5. Interpretability: How important is understanding the reasons behind clusters?
Anomaly detection stands for identifying the Unusual, also known as outlier detection,
which involves identifying data points that deviate significantly from the expected
pattern in a dataset. This is an important task in various fields, from fraud detection
to system health monitoring. Some of the important anomaly detection techniques
are based on their type to detect anomalies are described below.
6.3.1.1 Z-Score
Z-score, also known as the standard score, is a simple yet powerful statistical method
for anomaly detection. It tells you how many standard deviations a particular data
point is away from the mean of the dataset, providing a standardized measure of de-
viation from the average. It measures how many standard deviations a data point is
away from the mean. Points with high Z-scores are potential anomalies:
x − mean
Z − score =
standard deviation
where mean is the average of all data points in the dataset and standard deviation
measures the spread of the data around the mean.
A Z-score of 0 means the data point is exactly equal to the mean. Positive Z-scores
indicate points above the mean, with higher values representing larger deviations.
Negative Z-scores indicate points below the mean, with absolute values signifying the
degree of deviation. Typically, data points with absolute Z-scores greater than 2 or 3
are considered potential anomalies, as they fall outside the expected range of the ma-
jority of data points. However, this threshold can be adjusted based on your specific
data and desired sensitivity. Let’s have a look on the code:
348 Chapter 6 Advanced Machine Learning Techniques
import pandas as pd
import numpy as np
# Set a threshold for anomaly detection (e.g., z-score > 2 or < -2)
threshold = 2
# Identify anomalies
anomalies = np.where(np.abs(z_scores) > threshold)[0]
– Anomaly detection: Z-score is widely used for identifying unusual data points, such
as fraudulent transactions, system errors, or outliers in scientific experiments.
– Feature scaling: In machine learning, Z-score is often used to standardize fea-
tures before processing by algorithms, ensuring all features have similar scales
and preventing one feature from dominating the analysis.
– Quality control: Monitoring process variables in manufacturing or other industries
often involves using Z-scores to detect deviations from normal operating ranges.
1. Can be sensitive to outliers itself, as they affect the mean and standard deviation
used for normalization.
2. Less effective for skewed or heavy-tailed data distributions.
1. Outlier detection: Points falling outside the range Q1 – 1.5 IQR and Q3 + 1.5 IQR
can be considered potential outliers.
2. Data exploration: IQR provides a quick grasp of the central tendency and spread
of your data, complementing the median.
3. Comparing data distributions: You can compare the IQRs of different groups or
datasets to understand their relative variability.
4. Robust measure of variability: IQR is less sensitive to outliers compared to the
range, making it more reliable for skewed or noisy data.
1. Choose a kernel: Similar to SVMs, OCSVMs use a kernel function to convert data
points into higher dimensional space, potentially enabling better separation of
normal and abnormal regions.
2. Optimize the boundary: The algorithm seeks to find the maximum-margin hy-
perplane (decision boundary) in the high-dimensional space that encloses most of
the training data with the largest possible margin.
3. Identify anomalies: New data points are projected onto the same high-
dimensional space. Points falling outside the learned boundary or exceeding a
distance threshold are considered anomalies.
2. Local outlier factor (LOF): Calculate the ratio of a point’s local density (number
of close neighbors) to the density of its neighbors’ neighbor. High LOF values indi-
cate potential anomalies.
3. Isolation forest: Randomly partition the data and measure the isolation depth of
each point, which reflects the number of splits needed to isolate it completely.
Anomalies are easier to isolate and have lower depths.
6.4 Clustering
Some of the advantages of clustering techniques are automatic feature learning, han-
dling complex data, identifying complex clusters, and scalability. Some of the common
approaches are given below:
– Deep autoencoders: These networks learn compressed representations of the
data, with similar data points having similar code representations. Clustering can
be performed on these encoded representations.
352 Chapter 6 Advanced Machine Learning Techniques
Isolation forest is a powerful anomaly detection algorithm that utilizes isolation trees
to identify anomalies in your data. This isolates anomalies by randomly partitioning
the data space and measuring the isolation depth of each point. Its functions are
given below:
1. Randomly select features and split points within their ranges.
2. Divide the data into two branches based on the chosen split.
3. Repeat steps 1 and 2 recursively until reaching isolation or a maximum depth.
4. Each data point has a path length representing the number of splits needed to
isolate it.
5. Average the path lengths of a data point across all isolation trees in the forest.
6. Points with shorter path lengths (easier to isolate) are considered more anomalous.
LOF: LOF is another powerful algorithm for anomaly detection, utilizing the concept
of local density to identify data points deviating from their surroundings. It is used to
calculate the ratio of the local density of a point to the density of its neighbors. High
LOF values indicate potential anomalies. LOF compares the local density of a data
point (its neighborhood) to the local densities of its neighbors. Anomalies are consid-
ered points with significantly lower local density, indicating they reside in sparse re-
gions compared to their neighbors. The working mechanism and steps involved in
smooth working are given below:
1. Define the neighborhood: Choose a parameter “k” representing the number of
NN to consider for each data point.
2. Calculate reachability distance: Measure the reachability distance from each
point to its neighbors, considering both their distance and the density of their
neighborhoods.
3. Compute local density: Estimate the local density of each point by inverting the
average reachability distance of its kNN.
4. Calculate LOF: Divide the local density of a point by the average local density of
its kNN.
6.4 Clustering 353
5. Identify anomalies: Points with significantly higher LOF values (higher than a
predefined threshold) are considered potential anomalies, residing in sparser re-
gions compared to their surroundings.
1. Time series analysis: The time series analysis is a powerful tool for studying and
understanding sequences of data points collected over time. It detects anomalies
in time-dependent data by deviating from expected patterns or trends. Time se-
ries analysis focuses on extracting meaningful information and patterns from
data points ordered chronologically. Some of the key points about time series
analysis are given below:
– Decomposition: Breaks down the time series into trend, seasonality, and re-
sidual components to analyze each aspect separately.
– Autocorrelation and partial autocorrelation: Examine the relationship be-
tween data points at different time lags to identify patterns and dependencies.
– Statistical modeling: Fit various statistical models (e.g., ARIMA and SARIMA)
to the data to capture seasonality, trends, and random components.
– Machine learning: Utilize techniques like recurrent neural networks or long
short-term memory networks to automatically learn complex patterns and
make predictions.
This could involve anything from stock prices to website traffic, and sensor read-
ings to weather data. It helps to answer questions like:
1. Are there any trends or seasonalities in the data?
2. What are the underlying patterns driving the data’s behavior?
3. Can we predict future values based on the historical data?
2. Spectral analysis: Spectral analysis is also known as delving into the frequency
domain. Spectral analysis is a powerful technique for decomposing signals into
their constituent frequencies, revealing hidden patterns and insights that might
be obscured in the time domain. It analyzes the frequency domain of data to iden-
tify unusual patterns. Imagine a complex signal like music, speech, or an EEG re-
cording. We can use the following sectors:
– Audio analysis: Identifying musical notes, analyzing speech patterns, and de-
tecting audio anomalies
– Image processing: Identifying textures, edges, and objects based on their fre-
quency content
– Signal processing: Filtering noise, extracting specific frequency components,
and analyzing vibrations in mechanical systems
354 Chapter 6 Advanced Machine Learning Techniques
Summary
GBTs are a powerful machine learning technique used for both regression and classifica-
tion tasks. They work by combining multiple weak learners, typically decision trees, into
a single strong learner: XGBoost versus LightGBM, two gradient boosting powerhouses.
Both XGBoost and LightGBM are popular implementations of GBTs known for their
high accuracy and efficiency. They are powerful algorithms for gradient boosting, com-
bining weak learners (decision trees) for improved accuracy and performance.
Ensemble learning is a powerful machine learning technique that combines the
predictions of multiple models to achieve better performance than any single model.
It is like getting multiple experts to weigh in on a problem and then taking the best
guess based on their combined insights.
Kernel methods are a powerful class of algorithms in machine learning, particu-
larly known for their ability to handle nonlinear relationships in data. While tradi-
tional linear models are limited to linear relationships, kernel methods can learn
complex patterns by implicitly mapping data into a higher dimensional space where
these relationships become linear. The “kernel trick” is a crucial aspect of kernel
methods, often referred to as the key to their magic. It allows them to handle nonlin-
ear data while maintaining computational efficiency. In other words, we can say that
they transform data into higher dimensions for linear separation, enabling complex
relationships between features.
Summary 355
The RBF kernel is a popular and powerful kernel function widely used in various
machine learning algorithms, particularly SVMs. It supports only nonlinear data.
While k-means is a popular and widely used clustering technique, it does have its
limitations. Here are some alternative clustering techniques you can consider depend-
ing on your specific needs and data characteristics:
DBSCAN is a powerful clustering algorithm that groups data points based on their
density and connectivity. It is particularly useful for:
– Identifying clusters of arbitrary shapes and sizes: Unlike k-means, which re-
quires predefining spherical clusters, DBSCAN can handle clusters with complex
shapes, making it suitable for various data types.
– Detecting outliers: DBSCAN can identify and separate outliers from the main
clusters, making it ideal for noisy datasets.
– Handling data without predefined number of clusters: You do not need to
specify the number of clusters beforehand, which can be helpful when the num-
ber of clusters is unknown or varies across different datasets.
Mean shift is a powerful and versatile technique in machine learning and data
analysis, particularly useful for unsupervised learning tasks like clustering and
density estimation. It operates by iteratively shifting data points toward the “dens-
est” region in their vicinity, ultimately converging to the modes or peaks in the un-
derlying data distribution.
Spectral clustering is a powerful technique in machine learning often used for
unsupervised learning tasks like clustering and graph partitioning. It leverages
the spectral properties of a similarity matrix to group data points based on their un-
derlying structure. This makes it particularly useful for identifying nonconvex clus-
ters and handling data with complex shapes, where other clustering algorithms
like k-means might struggle.
Exercise (MCQs)
1. What is GBT?
A) A regression technique that iteratively builds trees to improve predictions
B) A classification technique that uses decision trees to classify data points
C) A dimensionality reduction technique that projects data onto lower-dimensional
subspaces
D) A clustering algorithm that groups data points based on their similarity
13. How DBSCAN can be compared with other clustering algorithms like k-
means?
A) DBSCAN is more efficient for high-dimensional data
B) k-Means requires specifying the number of clusters in advance, while DBSCAN
does not
C) k-Means is more sensitive to outliers
D) All of the above
12. What are the steps involved in using DBSCAN for clustering a dataset?
13. How can you choose the appropriate values for eps and min_samples?
14. How can you evaluate the performance of DBSCAN on a clustering task?
15. What are some of the alternative clustering algorithms to DBSCAN?
Answers
1. AdaBoost
2. overfitting
3. speed and efficiency
4. Ensemble learning
5. LightGBM
Neural networks are a type of artificial intelligence inspired by the human brain. They
consist of interconnected nodes called neurons, which process information in a similar
way to biological neurons. It is inspired by the biological neural networks that consti-
tute in human brain. The trained system can learn and make decision like human.
This whole thing is based upon given prior data and its training. It learns from its mis-
takes and improves system according to time. Fundamentally mimics human brains to
develop algorithms to build predictive models and complex patterns. The human
brain consists of fundamental cells known as neurons to store and process informa-
tion. In the form of electronic signals these neural neurons will process the input data.
The signals are passed to other neurons. The component stimuli receive the signal by
dendrites of neuron. Resulting, processed information in the neuron cells and covered
as an output. The resultant generated after processing, which is passed through the
Axon to the next available neuron. It is depending upon the strength of the signal.
Then finally it depends upon neuron which can accept or reject produced output until
output accuracy not reaches to the maximum level as we can see in Fig. 7.1.
The Fig. 7.1 shows the Biological neuron in the human brain.
These networks learn by adjusting the connections between neurons based on
data, enabling them to perform complex tasks like:
– Image recognition: Classifying objects in images (e.g., cats, dogs, and cars)
– Natural language processing: Understanding and generating text (e.g., machine
translation and chatbots)
– Recommendation systems: Suggesting products or content users might like
– Fraud detection: Identifying suspicious financial transactions
https://doi.org/10.1515/9783110697186-007
7.1 Introduction to Neural Networks 361
7.2 Perceptron
Perceptrons are the fundamental building blocks of neural networks, serving as the
basic unit of computation. While seemingly simple, they hold immense power when
combined and trained, allowing complex learning and problem-solving. Here’s a
breakdown of their key features:
Perceptrons act as linear classifiers, meaning they can only learn and represent line-
arly separable patterns. This limitation led to the development of more complex archi-
tectures like MLPs with multiple layers and nonlinear activation functions. Despite
their limitations, perceptrons are powerful tools for understanding the basic princi-
ples of neural networks and learning algorithms like perceptron learning rule.
– While not used in complex tasks anymore, perceptrons still find application in:
– Simple classification problems like spam filtering
– Feature extraction and dimensionality reduction
– Understanding the theoretical foundations of neural networks
– To delve deeper, consider exploring concepts like
– MLPs and their ability to learn nonlinear relationships
– Different activation functions and their impact on learning
– Perceptron learning rule and its limitations
– Advanced neural network architectures like CNNs and recurrent neural networks
(RNNs) built upon the foundation of perceptrons
Hidden layer(s)
2
1
w
Input layer Output layer
x 3
w
x Difference in
desired values
w
x
5
Backprop output
w layer
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
7.2 Perceptron 365
self.error_list = []
self.limit = 0.5
self.true_positives = 0
self.false_positives = 0
self.true_negatives = 0
self.false_negatives = 0
output_test[i] == 1:
self.false_negatives += 1
if self.predict(test_element) > self.limit and \
output_test[i] == 0:
self.false_positives += 1
if self.predict(test_element) < self.limit and \
output_test[i] == 0:
self.true_negatives += 1
print('True positives: ', self.true_positives,
'\nTrue negatives: ', self.true_negatives,
'\nFalse positives: ', self.false_positives,
'\nFalse negatives: ', self.false_negatives,
'\nAccuracy: ',
(self.true_positives + self.true_negatives) /
(self.true_positives + self.true_negatives +
self.false_positives + self.false_negatives))
# Run a Script That Trains and Evaluate the Neural Network Model
NN = NeuralNetwork()
NN.train(input_train_scaled, output_train_scaled, 200)
NN.predict(input_pred)
NN.view_error_development()
NN.test_evaluation(input_test_scaled, output_test_scaled)
7.3 TensorFlow
TensorFlow is a powerful and popular open-source library for building and training ma-
chine learning models, particularly in the realm of deep learning. The fundamental data
structure in TensorFlow. It represents multidimensional arrays of numerical values, simi-
lar to matrices in linear algebra. Tensors have a specific data type (integers, floats, etc.)
and shape (number of dimensions and elements in each dimension). Operations in Ten-
sorFlow are performed on tensors, allowing for calculations and manipulations.
– TensorFlow 2.0 introduced eager execution, allowing you to see the results of op-
erations immediately, line by line, similar to traditional scripting languages.
– This makes learning and debugging easier compared to the older symbolic execu-
tion mode.
import tensorflow as tf
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy
(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test, verbose=2)
probability_model = tf.keras.Sequential([
model,
tf.keras.layers.Softmax()
])
probability_model(x_test[:5])
7.3 TensorFlow 369
===============================Output======================================
Epoch 1/5
– WARNING:tensorflow:From C:\Users\23188\PycharmProjects\workshop\.venv\Lib
\site-packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensor-
Value is deprecated. Please use tf.compat.v1.ragged.RaggedTensorValue instead.
– WARNING:tensorflow:From C:\Users\23188\PycharmProjects\workshop\.venv\Lib
\site-packages\keras\src\engine\base_layer_utils.py:384: The name tf.executing_ea-
gerly_outside_functions is deprecated. Please use tf.compat.v1.executing_eager-
ly_outside_functions instead.
===========================================================================
7.3.3 Keras
7.3.4 Sessions
– In TensorFlow 1.x, sessions were used to manage the execution of the computa-
tional graph.
– With eager execution, sessions are no longer required, making the code cleaner
and more intuitive.
import tensorflow as tf
Implementing neural networks with TensorFlow and Keras can be a rewarding expe-
rience, allowing you to build powerful machine learning models from scratch. Start
with simple tasks and gradually increase complexity as you gain experience. Utilize
tutorials and online resources for specific examples and code implementations. Ex-
periment with different architectures and hyperparameters to find the best perform-
ing model for your task. Consider using tools like TensorBoard for visualizing training
progress and model behavior. Here’s a breakdown of the key steps involved:
7.4 Implementing Neural Network Using TensorFlow 371
Let’s have a look on the given code below which demonstrates simple linear regres-
sion model by using TensorFlow deep learning framework eager execution:
372 Chapter 7 Neural Networks and Deep Learning
import tensorflow as tf
# Optimizer
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
# Training loop
for epoch in range(100):
# Calculate loss
current_loss = loss(x_train, y_train)
# Make a prediction
prediction = predict(5)
print(f"Prediction for x = 5: {prediction}")
7.5 Building a Neural Network Using Keras Framework 373
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
Keras is a high-level API built on top of TensorFlow, designed for ease of use and
rapid prototyping. It offers prebuilt components like layers, optimizers, and loss func-
tions, simplifying the process of building and experimenting with neural networks.
Keras is known for its readability and Pythonic syntax, making it easier to learn and
374 Chapter 7 Neural Networks and Deep Learning
use compared to TensorFlow’s low-level APIs. Keras is widely used for quick experi-
mentation, building prototypes, and developing deep learning models where simplic-
ity and speed are priorities. On the other hand TensorFlow is a comprehensive
framework offering a wide range of functionalities for various machine learning
tasks including data manipulation, numerical computations, and deep learning. It pro-
vides low-level APIs that give you fine-grained control over your model architecture
and training process. This allows for flexibility and customization, but requires more
coding effort and understanding of the underlying concepts. TensorFlow is often used
for research, complex tasks, and production-grade models where fine-tuning and con-
trol are crucial.
If you’re new to deep learning or want to quickly experiment with different archi-
tectures, Keras is a great starting point. As you gain experience and need more control
or flexibility, you can gradually transition to using TensorFlow’s low-level APIs. You can
even combine Keras and TensorFlow by building your model with Keras’ high-level API
and then fine-tuning specific parts using TensorFlow’s lower-level functionalities. We
can classify the TensorFlow and Keras applications based upon the model require-
ments. The Tab. 7.1 shows the Difference between TensorFlow and Keras framework
based on different features:
CNNs are a powerful type of deep learning architecture specifically designed for proc-
essing grid-like data, most notably images. Their ability to automatically extract and
learn features from visual inputs has made them a cornerstone of various applica-
tions from image classification and object detection to self-driving cars and medical
image analysis. CNN is well known for image processing. Unlike traditional methods
requiring manual feature engineering, CNNs automatically learn relevant features di-
rectly from the data. This is crucial for complex datasets where manually identifying
features is impractical. CNNs are less sensitive to small shifts in the input image due
to the use of shared weights and pooling. This makes them robust to variations in ob-
ject position and viewpoint. CNNs can be easily scaled to larger and more complex
datasets by adding more layers and increasing the number of filters. Some of the key
concepts are given below:
7.6 Convolutional Neural Network (CNN) 375
CNN is best in the recognizing objects, scenes, and activities in images (e.g., classifying
handwritten digits and detecting faces in photos). It performs well in Locating and clas-
sifying objects within an image (e.g., identifying cars, pedestrians, and traffic signs in
self-driving car applications). Image segmentation is the very powerful feature of CNN
which is helping us by dividing an image into different regions corresponding to objects
or semantic categories (e.g., segmenting organs in medical images). CNN is good in ap-
plying the artistic style of one image to another (e.g., creating images that look like they
were painted by Van Gogh). Some of the popular CNN architectures include LeNet-5,
AlexNet, VGGNet, ResNet, and Inception. Frameworks like TensorFlow and PyTorch
offer tools for building and training CNNs. Let’s have a look on CNN model trained in
python:
376 Chapter 7 Neural Networks and Deep Learning
import pathlib
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/ex
ample_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos.tar',
origin=dataset_url, extract=True)
data_dir = pathlib.Path(data_dir).with_suffix('')
image_count = len(list(data_dir.glob('✶/✶.jpg')))
print(image_count)
roses = list(data_dir.glob('roses/✶'))
PIL.Image.open(str(roses[0]))
PIL.Image.open(str(roses[1]))
tulips = list(data_dir.glob('tulips/✶'))
PIL.Image.open(str(tulips[0]))
PIL.Image.open(str(tulips[1]))
batch_size = 32
img_height = 180
img_width = 180
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
7.6 Convolutional Neural Network (CNN) 377
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
class_names = train_ds.class_names
print(class_names)
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(class_names[labels[i]])
plt.axis("off")
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
num_classes = len(class_names)
model = Sequential([
layers.Rescaling(1. / 255, input_shape=(img_height, img_width, 3)),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
378 Chapter 7 Neural Networks and Deep Learning
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy
(from_logits=True),
metrics=['accuracy'])
model.summary()
epochs = 10
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=epochs
)
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs_range = range(epochs)
plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')
plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()
7.6 Convolutional Neural Network (CNN) 379
data_augmentation = keras.Sequential(
[
layers.RandomFlip("horizontal",
input_shape=(img_height,
img_width,
3)),
layers.RandomRotation(0.1),
layers.RandomZoom(0.1),
]
)
plt.figure(figsize=(10, 10))
for images, _ in train_ds.take(1):
for i in range(9):
augmented_images = data_augmentation(images)
ax = plt.subplot(3, 3, i + 1)
plt.imshow(augmented_images[0].numpy().astype("uint8"))
plt.axis("off")
model = Sequential([
data_augmentation,
layers.Rescaling(1. / 255),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Dropout(0.2),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes, name="outputs")
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True),
metrics=['accuracy'])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy
(from_logits=True),
metrics=['accuracy'])
model.summary()
380 Chapter 7 Neural Networks and Deep Learning
epochs = 15
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=epochs
)
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs_range = range(epochs)
plt.figure(figsize=(8, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')
plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()
sunflower_url = "https://storage.googleapis.com/download.tensorflow.org/
example_images/592px-Red_sunflower.jpg"
sunflower_path = tf.keras.utils.get_file('Red_sunflower',
origin=sunflower_url)
img = tf.keras.utils.load_img(
sunflower_path, target_size=(img_height, img_width)
)
img_array = tf.keras.utils.img_to_array(img)
img_array = tf.expand_dims(img_array, 0) # Create a batch
predictions = model.predict(img_array)
score = tf.nn.softmax(predictions[0])
7.6 Convolutional Neural Network (CNN) 381
print(
"This image most likely belongs to {} with a {:.2f} percent
confidence."
.format(class_names[np.argmax(score)], 100 ✶ np.max(score))
)
interpreter = tf.lite.Interpreter(model_path=TF_MODEL_FILE_PATH)
interpreter.get_signature_list()
classify_lite = interpreter.get_signature_runner('serving_default')
classify_lite
predictions_lite = classify_lite(sequential_1_input=img_array)
['outputs']
score_lite = tf.nn.softmax(predictions_lite)
print(
"This image most likely belongs to {} with a {:.2f} percent
confidence."
.format(class_names[np.argmax(score_lite)], 100 ✶ np.max(score_lite))
)
print(np.max(np.abs(predictions – predictions_lite)))
===========================Output==========================================
3670
Found 3670 files belonging to 5 classes.
Using 2936 files for training.
2024-02-26 08:25:07.145531: I tensorflow/core/platform/cpu_feature_guard.cc:182] This
TensorFlow binary is optimized to use available CPU instructions in performance-
critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX2 AVX512F
AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate
compiler flags.
382 Chapter 7 Neural Networks and Deep Learning
0.0 1.0
WARNING:tensorflow:From C:\Users\23188\PycharmProjects\workshop\.venv\Lib\site-
packages\keras\src\layers\pooling\max_pooling2d.py:161: The name tf.nn.max_pool is
deprecated. Please use tf.nn.max_pool2d instead.
7.6 Convolutional Neural Network (CNN) 383
WARNING:tensorflow:From C:\Users\23188\PycharmProjects\workshop\.venv\Lib\site-
packages\keras\src\optimizers\__init__.py:309: The name tf.train.Optimizer is depre-
cated. Please use tf.compat.v1.train.Optimizer instead.
Model: "sequential"
Layer (type) Output Shape Param #
===========================================================================
rescaling_1 (Rescaling) (None, 180, 180, 3) 0
conv2d (Conv2D) (None, 180, 180, 16) 448
max_pooling2d (MaxPooling2D) (None, 90, 90, 16) 0
conv2d_1 (Conv2D) (None, 90, 90, 32) 4640
max_pooling2d_1 (MaxPooling2D) (None, 45, 45, 32) 0
conv2d_2 (Conv2D) (None, 45, 45, 64) 18496
max_pooling2d_2 (MaxPooling2D) (None, 22, 22, 64) 0
flatten (Flatten) (None, 30976) 0
dense (Dense) (None, 128) 3965056
dense_1 (Dense) (None, 5) 645
===========================================================================
Total params: 3989285 (15.22 MB)
Trainable params: 3989285 (15.22 MB)
Non-trainable params: 0 (0.00 Byte)
Epoch 1/10
WARNING:tensorflow:From C:\Users\23188\PycharmProjects\workshop\.venv\Lib\site-
packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is dep-
recated. Please use tf.compat.v1.ragged.RaggedTensorValue instead.
WARNING:tensorflow:From C:\Users\23188\PycharmProjects\workshop\.venv\Lib\site-
packages\keras\src\engine\base_layer_utils.py:384: The name tf.executing_eagerly_out-
side_functions is deprecated. Please use tf.compat.v1.executing_eagerly_outside_func-
tions instead.
Epoch 5/10
92/92 [==============================] - 10s 106ms/step - loss: 0.4579 - accuracy:
0.8372 - val_loss: 0.8921 - val_accuracy: 0.6567
Epoch 6/10
92/92 [==============================] - 10s 108ms/step - loss: 0.2728 - accuracy:
0.9108 - val_loss: 1.0403 - val_accuracy: 0.6635
Epoch 7/10
92/92 [==============================] - 10s 110ms/step - loss: 0.1597 - accuracy:
0.9499 - val_loss: 1.2018 - val_accuracy: 0.6894
Epoch 8/10
92/92 [==============================] - 11s 115ms/step - loss: 0.0928 - accuracy:
0.9751 - val_loss: 1.4531 - val_accuracy: 0.6485
Epoch 9/10
92/92 [==============================] - 12s 126ms/step - loss: 0.0579 - accuracy:
0.9857 - val_loss: 1.4830 - val_accuracy: 0.6635
Epoch 10/10
92/92 [==============================] - 19s 201ms/step - loss: 0.0295 - accuracy:
0.9922 - val_loss: 1.7615 - val_accuracy: 0.6335
===========================================================================
0.9 1.4
1.2
0.8
1.0
0.7 0.8
0.6
0.6
0.4
0.5 0.2
Training Accuracy
Validation Accuracy
0.0
0 2 4 6 8 0 2 4 6 8
Fig. 7.5: Training and validation accuracy versus training and validation loss.
The Fig. 7.5 shows the Training and validation accuracy versus training and validation
loss.
7.6 Convolutional Neural Network (CNN) 385
Fig. 7.6: Training and validation accuracy versus training and validation loss with 15 epochs.
The Fig. 7.6 shows the Training and validation accuracy versus training and validation
loss with 15 epochs.
The same model is being trained on the 15 epochs and have look
386 Chapter 7 Neural Networks and Deep Learning
===========================================================================
Total params: 3989285 (15.22 MB)
Trainable params: 3989285 (15.22 MB)
Nontrainable params: 0 (0.00 Byte)
Epoch 1/15
92/92 [==============================] - 17s 161ms/step - loss: 1.3169 - accuracy:
0.4343 - val_loss: 1.1176 - val_accuracy: 0.5804
Epoch 2/15
92/92 [==============================] - 15s 159ms/step - loss: 1.0208 - accuracy:
0.5947 - val_loss: 0.9498 - val_accuracy: 0.6022
Epoch 3/15
92/92 [==============================] - 16s 173ms/step - loss: 0.9197 - accuracy:
0.6471 - val_loss: 0.9662 - val_accuracy: 0.6444
Epoch 4/15
92/92 [==============================] - 16s 169ms/step - loss: 0.8458 - accuracy:
0.6689 - val_loss: 0.8407 - val_accuracy: 0.6621
Epoch 5/15
92/92 [==============================] - 15s 167ms/step - loss: 0.7843 - accuracy:
0.6996 - val_loss: 0.8002 - val_accuracy: 0.6812
Epoch 6/15
92/92 [==============================] - 15s 160ms/step - loss: 0.7400 - accuracy:
0.7098 - val_loss: 0.7416 - val_accuracy: 0.7112
Epoch 7/15
92/92 [==============================] - 15s 160ms/step - loss: 0.6869 - accuracy:
0.7405 - val_loss: 0.8310 - val_accuracy: 0.7193
Epoch 8/15
92/92 [==============================] - 15s 168ms/step - loss: 0.6741 - accuracy:
0.7486 - val_loss: 0.7771 - val_accuracy: 0.7044
Epoch 9/15
92/92 [==============================] - 15s 162ms/step - loss: 0.6595 - accuracy:
0.7517 - val_loss: 0.7488 - val_accuracy: 0.7166
Epoch 10/15
92/92 [==============================] - 14s 155ms/step - loss: 0.6391 - accuracy:
0.7473 - val_loss: 0.7664 - val_accuracy: 0.7057
Epoch 11/15
92/92 [==============================] - 15s 162ms/step - loss: 0.6100 - accuracy:
0.7698 - val_loss: 0.6865 - val_accuracy: 0.7398
Epoch 12/15
92/92 [==============================] - 17s 183ms/step - loss: 0.5785 - accuracy:
0.7878 - val_loss: 0.7593 - val_accuracy: 0.7180
Epoch 13/15
388 Chapter 7 Neural Networks and Deep Learning
CNN architecture, or the arrangement and configuration of layers within a CNN, plays
a crucial role in its ability to learn and extract features from images. Some of the core
building blocks of the CNN algorithm are given below:
1. Convolutional layer: The heart of a CNN, responsible for extracting features
through the application of filters (kernels) across the image. Filters slide over the
image, computing dot products to identify local patterns like edges and textures.
2. Pooling layer: Reduces the dimensionality of the data, summarizing the informa-
tion captured by the convolutional layer. Common methods include max pooling
and average pooling. Pooling layers play a crucial role in reducing the dimension-
ality of data while preserving important features. This compression helps manage
computational costs, prevents overfitting, and can even strengthen the feature ex-
traction process. Pooling layers reduce the spatial dimensions (width and height)
of the data, typically by a factor of 2 or more. This results in a smaller output with
fewer elements. By applying different pooling operations, the layer summarizes
the information contained within a specific region of the input data. This region
is often called a pooling window.
3. Activation function: Introduces nonlinearity, allowing the network to learn com-
plex relationships between features. Popular choices include ReLU and Leaky ReLU.
4. Fully connected layer: Typically used in the final stages for tasks like classifica-
tion or regression. Connects all neurons in one layer to all neurons in the next,
integrating the learned features.
5. Inception: Used filter factorization and parallel processing for efficient feature
extraction.
The optimal architecture depends on your specific problem, dataset size, and compu-
tational resources. Experimentation and exploration are key to finding the best con-
figuration for your needs. Few points must have to be kept in mind before developing
any CNN model:
– Number of layers: Deeper networks often learn more complex features, but re-
quire more data and computational resources.
– Filter size and number: Smaller filters capture local details, while larger ones
capture larger features. More filters allow for learning a wider variety of patterns.
– Pooling type and stride: Max pooling identifies dominant features, while average
pooling summarizes information. Stride controls the downsampling rate. Max pool-
ing selects the maximum value within the pooling window. This emphasizes the
strongest activations and can be useful for detecting dominant features like edges.
– Activation function: Choice depends on the task and desired properties (e.g.,
ReLU for efficiency and Leaky ReLU for avoiding dying neurons).
– Batch normalization: Helps stabilize training and improve generalization.
– Regularization techniques: Dropout and L1/L2 regularization prevent overfitting
by reducing model complexity.
– Transfer learning: Utilize pretrained models on large datasets (e.g., ImageNet) as
a starting point for fine-tuning on your specific task.
– Data augmentation: Artificially expand your dataset with variations (e.g., flips
and rotations) to improve generalization.
The dropout layers act as powerful tools for preventing overfitting, a major challenge
where models memorize training data too closely and fail to generalize well to unseen
examples. During training, a dropout layer randomly drops out a certain proportion of
neurons (units) in each layer, effectively turning them off for that training iteration.
These dropped neurons don’t contribute to the calculations or updates in that particular
step. This process forces the remaining neurons to learn more robust and independent
representations of the features, as they can’t rely on the presence of any specific neuron
every time. By preventing neurons from becoming overly reliant on each other, drop-
out encourages them to learn more diverse and generalizable features, improving the
model’s ability to perform well on new data. As different neurons are dropped out in
each training iteration, the model effectively learns an ensemble of slightly different
networks, making it less susceptible to specific noise or patterns in the training data. By
effectively reducing the effective size of the network during training, dropout can some-
times lead to faster convergence and training time, especially for larger models.
7.8 Recurrent Neural Networks (RNNs) 391
Recurrent neural networks (RNNs) stand out for their ability to process and learn
from sequential data, where elements are ordered and have dependencies. This
makes them ideal for tasks like natural language processing (NLP), speech recogni-
tion, and time series analysis. Unlike traditional neural networks, RNNs have an inter-
nal memory that allows them to retain information from previous inputs and use it to
process the current one. This memory enables them to capture the context and rela-
tionships within sequences as shown in Fig. 7.8.
1. Vanilla RNN: The basic RNN architecture, but can suffer from vanishing and ex-
ploding gradients, limiting its ability to learn long-term dependencies. Vanilla
RNNs struggle with vanishing and exploding gradients, making it difficult to learn
dependencies over long sequences. LSTMs address this by introducing gating
mechanisms that control the flow of information through the network.
2. Long short-term memory (LSTM): Introduces gating mechanisms to control the
flow of information, addressing the gradient issues and enabling learning of lon-
392 Chapter 7 Neural Networks and Deep Learning
ger dependencies. LSTM networks stand out for their ability to learn and exploit
long-term dependencies within sequential data as shown in Fig. 7.9.
3. Gated recurrent unit (GRU): Similar to LSTM but with simpler architecture and
fewer parameters, offering a balance between performance and efficiency. GRU
emerges as a compelling alternative to LSTMs. While both excel at learning long-
term dependencies within sequential data, GRUs offer a more streamlined archi-
tecture with certain advantages. GRUs aim to provide comparable performance
to LSTMs while being simpler and potentially less computationally expensive.
GRUs combine the Forget and Input gates of LSTMs into a single update gate and
introduce a reset gate to control the flow of information. This reduces the number
of parameters and operations compared to LSTMs.
7.8 Recurrent Neural Networks (RNNs) 393
import os
print(os.listdir("../input"))
dataset_train = pd.read_csv('../input/stockprice-
train/Stock_Price_Train.csv')
dataset_train.head()
train = dataset_train.loc[:, ['Open']].values #array'e çevirdik
train
from sklearn.preprocessing import MinMaxScaler #bununla, 0-1 arasına
scale ettik
scaler = MinMaxScaler(feature_range = (0, 1))
train_scaled = scaler.fit_transform(train)
train_scaled
plt.plot(train_scaled)
#Reshaping:
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
394 Chapter 7 Neural Networks and Deep Learning
#Initialize RNN:
regressor = Sequential()
dataset_test = pd.read_csv('../input/stockprice-test/Stock_Price_Test.
csv')
dataset_test.head()
7.8 Recurrent Neural Networks (RNNs) 395
X_test = []
for i in range(timesteps, 70):
X_test.append(inputs[i-timesteps:i,0])
X_test = np.array(X_test)
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
predicted_stock_price = regressor.predict(X_test)
predicted_stock_price = scaler.inverse_transform(predicted_stock_price)
#inverse_transform ile, scale edildikten sonra predict edilen değerleri
gerçek değer aralığına çekiyoruz
data = pd.read_csv('../input/international-airline-
passengers/international-airline-passengers.csv')
data.head()
train_size = int(len(dataset)✶0.5)
test_size = len(dataset)- train_size
train = dataset[0:train_size, :]
test = dataset[train_size:len(dataset), :]
dataX = []
datay = []
timestemp = 10
dataX = []
datay = []
for i in range(len(test) - timestemp - 1):
a = test[i:(i + timestemp), 0]
dataX.append(a)
datay.append(test[i + timestemp, 0])
trainX.shape
trainX.shape
7.8 Recurrent Neural Networks (RNNs) 397
# model
model = Sequential()
model.add(LSTM(10, input_shape=(1, timestemp))) # 10 lstm neuron(block)
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainy, epochs=50, batch_size=1)
#make predictions
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainy = scaler.inverse_transform([trainy])
testPredict = scaler.inverse_transform(testPredict)
testy = scaler.inverse_transform([testy])
import math
# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainy[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testy[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))
# shifting train
trainPredictPlot = np.empty_like(dataset)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[timestemp:len(trainPredict)+timestemp, :] = trainPredict
# shifting test predictions for plotting
testPredictPlot = np.empty_like(dataset)
testPredictPlot[:, :] = np.nan
testPredictPlot[len(trainPredict)+(timestemp✶2)+1:len(dataset)-1, :] =
testPredict
398 Chapter 7 Neural Networks and Deep Learning
The seq2seq models stand out for their ability to process and transform one sequence
of data into another. This makes them invaluable for tasks like machine translation,
text summarization, speech recognition, and chatbots, where understanding and gen-
erating sequences are crucial. There are two main important components of seq2seq
model like Encoder-Decoder Architecture and attention mechanism. Encode-decoder
architecture from seq2seq model consists of two main components:
– Encoder: Processes the input sequence, capturing its meaning and context. It can
be an RNN like LSTM or GRU, or a transformer-based architecture.
– Decoder: Generates the output sequence, conditioned on the information en-
coded by the encoder. It also uses an RNN or transformer-based architecture,
often with an attention mechanism to focus on relevant parts of the encoded
sequence.
On the other hand attention mechanism is the key component that allows the decoder
to selectively attend to different parts of the encoded sequence when generating each
element of the output sequence. This helps capture long-range dependencies and im-
prove the accuracy and coherence of the generated output. We can use it in Translat-
ing text from one language to another, considering the context and grammar of both
languages and Generating a concise summary of a longer text document, capturing
the main points and overall meaning.
The Seq2seq can be used in Converting spoken language into text, taking into ac-
count the nuances of pronunciation and context and Building conversational agents
that can understand and respond to user queries in a natural way. It is very useful in
Generating descriptions of images based on their visual content and Creating musical
pieces based on specific styles or themes. The seq2seq architectures including trans-
former-based models like T5 and BART. It learns about advanced attention mecha-
nisms like self-attention and masked attention and experiment with different loss
functions and training techniques for seq2seq models. We can utilize libraries like
TensorFlow and PyTorch for building and training seq2seq models. Let’s have a look
on the python implementation:
7.9 Sequence-to-Sequence Models 399
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Model
# Compile the model (you may choose an appropriate optimizer and loss
function)
seq2seq_model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=['accuracy'])
The transfer learning emerges as a powerful technique for accelerating and enhanc-
ing the training process, especially for complex tasks and limited data. It involves
leveraging the knowledge gained from a pretrained model on one task (source task)
to improve performance on a related task (target task). Instead of training a model
from scratch on your specific dataset, you utilize a pretrained model that has already
learned valuable representations from a large dataset related to your task. This saves
time and computational resources.
Pretrained models often contain rich feature representations that can be adapted
to your target task, leading to faster convergence and potentially better performance
compared to training from scratch. When you have limited labeled data for your spe-
cific task, transfer learning allows you to leverage the knowledge from a larger data-
set, mitigating the data scarcity issue. You don’t simply copy the entire pretrained
model. Instead, you typically fine-tune its layers, adjusting the weights and biases to-
ward your specific task using your limited data. This balances the benefits of pre-
trained knowledge with the need to adapt to your specific problem.
Transfer learning offers a valuable toolbox for deep learning practitioners, en-
abling faster training, improved performance, and efficient utilization of limited data.
By carefully selecting pretrained models, designing appropriate fine-tuning strategies,
and considering the limitations, you can leverage this technique to unlock the power
7.10 Transfer Learning 401
of deep learning for your specific tasks. Let’s have a look on the following code that
shows implementation:
import numpy as np
import keras
from keras import layers
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
layer = keras.layers.Dense(3)
layer.build((None, 4)) # Create the weights
print("weights:", len(layer.weights))
print("trainable_weights:", len(layer.trainable_weights))
print("non_trainable_weights:", len(layer.non_trainable_weights))
layer = keras.layers.BatchNormalization()
layer.build((None, 4)) # Create the weights
print("weights:", len(layer.weights))
print("trainable_weights:", len(layer.trainable_weights))
print("non_trainable_weights:", len(layer.non_trainable_weights))
layer = keras.layers.Dense(3)
layer.build((None, 4)) # Create the weights
layer.trainable = False # Freeze the layer
print("weights:", len(layer.weights))
print("trainable_weights:", len(layer.trainable_weights))
print("non_trainable_weights:", len(layer.non_trainable_weights))
# Check that the weights of layer1 have not changed during training
final_layer1_weights_values = layer1.get_weights()
np.testing.assert_allclose(
initial_layer1_weights_values[0], final_layer1_weights_values[0]
)
np.testing.assert_allclose(
initial_layer1_weights_values[1], final_layer1_weights_values[1]
)
inner_model = keras.Sequential(
[
keras.Input(shape=(3,)),
keras.layers.Dense(3, activation="relu"),
keras.layers.Dense(3, activation="relu"),
]
)
model = keras.Sequential(
[
keras.Input(shape=(3,)),
inner_model,
keras.layers.Dense(3, activation="sigmoid"),
]
)
base_model = keras.applications.Xception(
weights='imagenet', # Load weights pre-trained on ImageNet.
input_shape=(150, 150, 3),
include_top=False) # Do not include the ImageNet classifier at the
top.
base_model.trainable = False
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(from_logits=True),
metrics=[keras.metrics.BinaryAccuracy()])
# model.fit(new_dataset, epochs=20, callbacks=..., validation_data=...)
# It's important to recompile your model after you make any changes
# to the `trainable` attribute of any inner layer, so that your changes
# are take into account
model.compile(optimizer=keras.optimizers.Adam(1e-5),
# Very low learning rate
loss=keras.losses.BinaryCrossentropy(from_logits=True),
metrics=[keras.metrics.BinaryAccuracy()])
tfds.disable_progress_bar()
plt.imshow(image)
plt.title(int(label))
plt.axis("off")
augmentation_layers = [
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.1),
]
def data_augmentation(x):
for layer in augmentation_layers:
x = layer(x)
return x
batch_size = 64
train_ds = train_ds.batch(batch_size).prefetch(tf_data.AUTOTUNE).cache()
validation_ds = validation_ds.batch(batch_size).prefetch(tf_data.
AUTOTUNE).cache()
test_ds = test_ds.batch(batch_size).prefetch(tf_data.AUTOTUNE).cache()
base_model = keras.applications.Xception(
7.10 Transfer Learning 405
model.summary(show_trainable=True)
model.compile(
optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(from_logits=True),
metrics=[keras.metrics.BinaryAccuracy()],
)
epochs = 2
print("Fitting the top layer of the model")
model.fit(train_ds, epochs=epochs, validation_data=validation_ds)
epochs = 1
print("Fitting the end-to-end model")
model.fit(train_ds, epochs=epochs, validation_data=validation_ds)
In the realm of deep learning, pretrained models stand as powerful allies, offering a
significant head start for tackling new tasks. By leveraging their prelearned knowl-
edge, you can accelerate training, enhance performance, and overcome data scarcity
challenges. These are deep neural networks already trained on large, diverse datasets
for general tasks like image recognition or NLP. They act as a foundation upon which
you build further capabilities. This is the core technique where you utilize a pre-
trained model as a starting point for your specific task. It involves:
– Feature extraction: The pretrained model’s earlier layers extract general fea-
tures like edges in images or word embeddings in text.
– Fine-tuning: You adjust the final layers of the pretrained model using your own,
task-specific data to adapt these features to your unique needs.
We can consider pretrained model to leverage the prelearned features, saving time
and computational resources compared to training from scratch. Pretrained models
often capture valuable representations, leading to better results on your task, espe-
cially with limited data. If you lack extensive labeled data for your specific task, pre-
trained models can bridge the gap, boosting your model’s performance. Focus on
building and fine-tuning the final layers, accelerating the development process.
7.12 Fine-Tuning and Feature Extraction 407
In the domain of transfer learning with pretrained models, both fine-tuning and fea-
ture extraction play crucial roles in adapting the pretrained knowledge to your spe-
cific task. This involves utilizing the pretrained model’s earlier layers as a feature
extractor. These layers, trained on a large, diverse dataset, have learned to capture
general features like edges, textures, and word embeddings. Pass your own input data
through the pretrained model up to a specific layer, typically before the final layers
responsible for the original task (e.g., classification). These are the advantage flows:
– Efficiently extract meaningful features without training from scratch.
– Useful when your own data is limited, leveraging the pretrained model’s
knowledge.
– Can be combined with other feature engineering techniques for further enrichment
Fine-tuning involves adjusting the weights and biases of the pretrained model, typically
in the later layers, using your own task-specific data. This adapts the general features
extracted earlier to your specific problem. Train your model with your data, but only
update the weights of the chosen layers (fine-tuning) while keeping the earlier layers
(pretrained) frozen (not updating). Fine-tuning offers the following advantages:
– Adapts the pretrained features to your specific task, potentially improving
performance.
– More flexible than feature extraction, allowing for more tailored adaptations.
– Requires more data for training compared to feature extraction.
Generative adversarial networks (GANs) incorporate the strengths of the previous re-
sponses and address any mentioned shortcomings. GANs stand out for their ability to
create realistic and diverse data, paving the way for exciting applications in image
generation, text creation, and more. This response delves into their inner workings,
explores their capabilities, and discusses key considerations for using them effec-
tively. GANs are a powerful but complex tool. Understanding their core concepts,
strengths, limitations, and best practices is crucial for successful implementation and
achieving your desired generative outcomes.
1X N
Mean squared error = ð y2 − y1 Þ
N i=1
The regularization and optimization play intertwined roles in steering the training
process toward a successful outcome. Let’s delve into the nuances of each and how
they work together to create robust and effective models. Regularization is used in
Preventing Overfitting and Generalizing Well. Imagine a neural network memorizing
every training example perfectly. While this might seem ideal, such a model would
410 Chapter 7 Neural Networks and Deep Learning
Regularization can make the optimization landscape smoother, with fewer local min-
ima, making it easier for optimization algorithms to find the global minimum. Optimi-
zation algorithms can influence the effectiveness of regularization. For example,
using a learning rate that is too high can negate the benefits of regularization. There’s
no one-size-fits-all approach. The best combination of regularization and optimization
techniques depends on your specific problem, data, and computational resources. Ex-
perimentation is crucial. Let’s have a look on the code with Linear Regression given
below:
import mglearn as ml
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from numpy import genfromtxt
dataset = genfromtxt('https://raw.githubusercontent.com/m-mehdi/tutori
als/main/boston_housing.csv', delimiter=',')
X = dataset[:, :-1]
y = dataset[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=0)
lr = LinearRegression().fit(X_train, y_train)
print(f"Linear Regression-Training set score: {lr.score(X_train,
y_train):.2f}")
7.14 Regularization and Optimization in Deep Learning 411
==================================Output===================================
Linear Regression-Training set score: 0.95
Linear Regression-Test set score: 0.61
===========================================================================
Try different techniques and combinations to find what works best for your task.
Monitor your model’s performance on both training and validation data to avoid
overfitting. By understanding the roles of regularization and optimization, you can
make informed decisions to train deep learning models that are both accurate and
generalizable, performing well on unseen data and avoiding the pitfalls of overfitting.
Remember, the journey to optimal performance often involves iterative experimenta-
tion and careful tuning of these essential elements. Some of the commonly used opti-
mization algorithms are:
– Gradient descent (GD): A classic algorithm that iteratively updates weights in
the direction that decreases the loss, taking small steps at each iteration. The gra-
dient decent is an optimization algorithm. It is also known as slop of function. If
the gradient is higher then the slope is steeper and the faster a model can learn. It
is used to minimizing the cost and increasing accuracy of the model. For example,
if we want to climb down the hill we should take small steps instead of jumping
down at once. Like the same example if we starts from point ‘a’, we have to walk
down slowly. i.e. we update the position to x timely and we continue the same
until, we reach at the bottom. The bottom is considered as the lowest point
of cost.
– Momentum GD: Similar to GD but incorporates momentum, accumulating past
gradients to accelerate convergence toward the minimum.
– Adam: An adaptive learning rate optimization algorithm that automatically ad-
justs the learning rate for each weight based on its past updates, often leading to
faster convergence.
import mglearn as ml
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from numpy import genfromtxt
dataset = genfromtxt('https://raw.githubusercontent.com/m-mehdi/tutori
412 Chapter 7 Neural Networks and Deep Learning
als/main/boston_housing.csv', delimiter=',')
X = dataset[:, :-1]
y = dataset[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=0)
from sklearn.linear_model import Ridge
===================================Output==================================
Ridge Regression-Training set score: 0.90
Ridge Regression-Test set score: 0.76
===========================================================================
import mglearn as ml
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from numpy import genfromtxt
dataset = genfromtxt('https://raw.githubusercontent.com/m-mehdi/tutori
als/main/boston_housing.csv', delimiter=',')
X = dataset[:, :-1]
y = dataset[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.25, random_state=0)
===============================Lasso Regression=============================
Lasso Regression-Training set score: 0.29
Lasso Regression-Test set score: 0.21
===========================================================================
7.15 Batch Normalization 413
The batch normalization is a technique that stabilizes training and speeds up conver-
gence, leading to better and faster-trained neural networks. BatchNorm introduces an
additional normalization step within each layer of your deep neural network. It nor-
malizes the activations (outputs) of each layer to have a mean of 0 and a standard
deviation of 1 across the current batch of samples. Let’s have a look in Fig. 7.10.
Batch Norm
H, W
Xi − Meani
Xi =
Std Devi
1X m
μB = xi
m i=1
1X m
σ 2B = ðxi − μB Þ2
m i=1
∧ xi − μB
xi = qffiffiffiffiffiffiffiffiffiffiffiffiffiffi
σ 2B + 2
∧
yi = γ xi + β = BNγ, β ðxi Þ
414 Chapter 7 Neural Networks and Deep Learning
BatchNorm is a valuable tool for training deep neural networks, offering faster con-
vergence, improved performance, and reduced sensitivity to initialization. By under-
standing its core principles, benefits, and considerations, you can effectively leverage
BatchNorm to achieve better results in your deep learning projects. The working
mechanism of BatchNorm is given below:
– Calculate mean and standard deviation: For each layer and each batch of data,
BatchNorm computes the mean and standard deviation of the activations across
that batch.
– Normalize activations: Each activation in the batch is then subtracted by the
mean and divided by the standard deviation.
– Scale and shift: To preserve information, the normalized activations are multi-
plied by learned scale and shift parameters (gamma and beta), allowing the net-
work to adapt to the normalization step.
1. Batch size: BatchNorm is typically used with small batch sizes, as it relies on ac-
curate statistics within each batch.
2. Hyperparameter tuning: Tuning the learning rate and adjusting the scale and
shift parameters can be crucial for optimal performance.
3. Minibatch statistics: BatchNorm uses statistics from the current minibatch,
which can be an approximation of the population statistics. This might lead to
slight inconsistencies during training and inference.
4. Alternative normalization techniques: Other normalization techniques like
layer normalization and group normalization exist, offering different trade-offs
and potentially better performance in specific scenarios.
propagation, it can accumulate and become significantly larger at each layer. This can
happen due to factors like vanishing gradients in shallower layers, leading to ampli-
fied gradients in deeper layers. Gradient clipping sets a maximum threshold for the
magnitude of gradients. Any gradient value exceeding this threshold is “clipped”
down to the threshold, effectively preventing it from growing further.
Gradient clipping is a valuable tool for deep learning practitioners, providing a safety
net against exploding gradients and aiding in stable and efficient training. By under-
standing its principles, benefits, and considerations, you can effectively implement
gradient clipping to enhance the performance and robustness of your deep learning
models. The three types of gradient clipping approaches available are:
1. Global clipping: Clips all gradients to a single maximum value.
2. Layer-wise clipping: Clips gradients independently for each layer, allowing for
customization based on layer sensitivity.
3. Norm-based clipping: Clips gradients based on their norm (length), ensuring
they stay within a specific radius.
Let’s dive deeper into gradient clipping with following python code.
import tensorflow as tf
from tensorflow.keras import Model, layers
import numpy as np
import tensorflow_datasets as tfds
# Hyperparameters
num_classes = 10 # total classes (0-9 digits).
num_features = 784 # data features (img shape: 28✶28).
# Training Parameters
learning_rate = 0.001
416 Chapter 7 Neural Networks and Deep Learning
training_steps = 1000
batch_size = 32
display_step = 100
# Network Parameters
# MNIST image shape is 28✶28px, we will then handle 28 sequences of 28
timesteps for every sample.
num_input = 28 # number of sequences.
timesteps = 28 # timesteps.
num_units = 32 # number of neurons for the LSTM layer.
print(tf.__version__)
import neptune
run = neptune.init_run(project='common/tf-keras-integration',
api_token='ANONYMOUS')
# Cross-Entropy Loss.
# Note that this will apply 'softmax' to the logits.
def cross_entropy_loss(x, y):
# Convert labels to int 64 for tf cross-entropy function.
y = tf.cast(y, tf.int64)
# Apply softmax to logits and compute cross-entropy.
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=x)
# Average loss across the batch.
return tf.reduce_mean(loss)
# Accuracy metric.
def accuracy(y_pred, y_true):
# Predicted class is the index of highest score in prediction vector
(i.e. argmax).
correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true,
tf.int64))
return tf.reduce_mean(tf.cast(correct_prediction, tf.float32), axis=-1)
# Adam optimizer.
optimizer = tf.optimizers.Adam(learning_rate)
# Compute loss.
loss = cross_entropy_loss(pred, y)
# Compute gradients.
gradients = tape.gradient(loss, trainable_variables)
# Run training for the given number of steps. for step, (batch_x,
batch_y) in enumerate(train_data.take(training_steps), 1):
# Run the optimization to update W and b values.
run_optimization(batch_x, batch_y)
if step % display_step == 0:
pred = lstm_net(batch_x, is_training=True)
loss = cross_entropy_loss(pred, batch_y)
acc = accuracy(pred, batch_y)
run['monitoring/logs/loss'].log(loss)
run['monitoring/logs/acc'].log(acc)
Some of the best practices of gradient clipping approaches are given below:
– Clipping threshold: Choosing the right clipping threshold is crucial. A too-high
threshold might not be effective, while a too-low threshold might affect training
speed.
– Impact on performance: While gradient clipping can be beneficial, it might
slightly reduce the final accuracy of the model in some cases.
– Experimentation and evaluation: Experiment with different clipping approaches
and thresholds to find what works best for your specific model and task.
– Alternative techniques: Other techniques like gradient normalization and adap-
tive learning rates can also help address exploding gradients.
Summary 419
Summary
The ANN is inspired by the human brain, and neural networks are interconnected
layers of artificial neurons that process information. Each neuron performs a sim-
ple computation, and the connections between them determine the overall behavior
of the network. These networks learn by adjusting the connections between neurons
based on training data. The perceptrons and activation functions serve as fundamen-
tal elements in constructing neural networks, paving the way for complex learning
and decision-making capabilities. Let’s delve into their individual roles and how they
work together. Imagine a perceptron as a simple neuron-like structure that receives
input signals, processes them, and generates an output. It’s the building block of neu-
ral networks, responsible for basic computations and information flow. TensorFlow is
a powerful and versatile open-source software library, primarily used for developing
and deploying machine learning and deep learning models. It provides a flexible
and efficient platform for various tasks from image recognition and NLP to self-
driving cars and scientific computing. Keras is a high-level API for building and
training deep learning models, developed and maintained by Google. It sits on top
of powerful libraries like TensorFlow and Theano, providing a simpler and more
user-friendly interface for deep learning tasks. Here’s what you need to know about
Keras.
CNNs have emerged as champions in the field of image and video analysis. Their
unique architecture, inspired by the human visual system, allows them to excel at
tasks like image recognition, object detection, and video classification. Let’s delve into
the core concepts and capabilities of CNNs. CNNs play a crucial role in image process-
ing, excelling in various tasks due to their unique architecture inspired by the human
visual system. Let’s delve deeper into how CNNs contribute to image processing and
explore specific applications. In the fascinating world of image processing, CNN ar-
chitecture serves as the blueprint for CNNs, dictating the flow of information and en-
abling them to excel at tasks like image classification, object detection, and image
segmentation.
The pooling and dropout layers play distinct yet essential roles in boosting per-
formance and preventing overfitting. Here’s a breakdown of their individual function-
alities and how they work together. RNNs emerge as a powerful tool for processing
sequential data, where the order of elements matters. Unlike traditional neural net-
works that handle independent inputs, RNNs introduce a key difference compared to
feedforward neural networks: internal memory. This memory allows them to retain
information from previous inputs and use it to process the current input, enabling
them to capture dependencies and context within sequential data. The sequential
data emerges as a unique type of information where the order of elements is cru-
cial. Unlike independent data points, understanding sequential data requires consid-
ering the past and anticipating the future within its inherent structure. Imagine a
sentence, a song, a stock market timeline, or a video clip. These all represent sequen-
420 Chapter 7 Neural Networks and Deep Learning
tial data, where each element (word, note, price point, frame) carries intrinsic mean-
ing and influences those that follow. Processing sequential data effectively requires
methods that capture these dependencies and context.
The LSTM and GRU networks stand out as powerful tools for handling sequen-
tial data, where order matters. Both are special types of RNNs, designed to overcome
a major limitation of traditional RNNs: the vanishing gradient problem. This problem
hinders their ability to learn long-term dependencies within sequences. The seq2seq
stands for Bridging the Gap Between Different Sequences models. It emerges as a
powerful and versatile tool for tasks involving the transformation of one sequence
of data into another. From translating languages to generating captions for images,
these models excel at capturing the relationships and context within sequences, en-
abling them to perform various impressive tasks. The transfer learning stands for
Building on Existing Knowledge for Faster Progress. It emerges as a powerful tech-
nique that allows you to leverage knowledge gained from one task to improve per-
formance on a related one. Imagine training a dog to fetch a specific toy; you
wouldn’t start from scratch each time it encounters a new object. Similarly, transfer
learning enables models to “remember” what they’ve learned previously and adapt it
to new situations, significantly accelerating the learning process and improving re-
sults. Fine-tuning involves modifying the weights of a pretrained model to adapt it
to a new task. Typically, the lower layers of the model, which capture general fea-
tures, are frozen, while the higher layers, responsible for more specific learning, are
fine-tuned on your own dataset.
The generative adversarial networks (GANs) stand out as a powerful and fasci-
nating technique for generating new data, like images, text, or music. Imagine creat-
ing realistic portraits of people who never existed, or composing music in the style of
your favorite artist – that’s the kind of magic GANs can achieve!
Vanilla GAN: The original GAN architecture, with separate generator and dis-
criminator networks.
Deep convolutional GAN (DCGAN): Leverages convolutional layers in both the
generator and discriminator, particularly effective for image generation.
Wasserstein GAN (WGAN): Improves training stability by using a different loss
function and gradient penalty.
Generative adversarial networks with gradient penalty (GAN-GP): Combines
aspects of DCGAN and WGAN for improved stability and performance.
StyleGAN: Utilizes style transfer techniques to generate highly diverse and realis-
tic images.
The regularization and optimization are two fundamental techniques that
work hand-in-hand to improve the performance and generalizability of your models.
The batch normalization (BatchNorm) emerges as a powerful technique that im-
proves the training speed and stability of neural networks. It acts like a magic in-
gredient, smoothing the training process and often leading to better performance.
The gradient clipping emerges as a crucial technique for preventing exploding gra-
Exercise (MCQs) 421
dients, a phenomenon that can hinder the training process and lead to unstable or
even diverging models. Imagine training a dog: you wouldn’t pull too hard on the
leash, as it could make them resist or even run away. Similarly, gradient clipping
helps you “control” the learning process by setting reasonable limits on the changes
made to your model’s weights.
Exercise (MCQs)
1. In a convolutional neural network (CNN), what does the “pooling layer” do?
A) Normalizes activations in the previous layer
B) Reduces the dimensionality of the feature maps
C) Performs element-wise multiplication with a specific kernel
D) Learns nonlinear relationships between features
2. Which of the following is NOT a valid activation function for a neural net-
work?
A) Sigmoid B) ReLU C) Softmax D) Linear
3. What is the main difference between gradient descent and Adam, two popu-
lar optimization algorithms for neural networks?
A) Adam uses adaptive learning rates, while gradient descent has a fixed rate.
B) Adam is faster for large datasets, while gradient descent is better for small
datasets.
C) Adam requires less tuning of hyperparameters compared to gradient descent.
D) Adam is more prone to overfitting than gradient descent.
5. What are the main challenges associated with training generative adversar-
ial networks (GANs)?
A) Selecting the right architecture for both the generator and discriminator
B) Ensuring stable training and avoiding mode collapse
C) Evaluating the quality of generated data effectively
D) All of the above
C) Early stopping
D) Data augmentation with label smoothing
7. How can you interpret the weights of a trained neural network to under-
stand what it has learned?
A) By directly analyzing the weight values
B) Using visualization techniques like saliency maps
C) By performing feature inversion to reconstruct input data
D) All of the above
9. What are the ethical considerations involved in using deep learning models,
especially those trained on large datasets?
A) Bias and fairness in decision-making
B) Explainability and interpretability of model predictions
C) Privacy concerns and data security
D) All of the above
10. What are the latest advancements and research directions in the field of
deep learning?
A) Explainable AI (XAI) for interpretable models
B) Continual learning for adapting to new data streams
C) Transformers for natural language processing and beyond
D) Neuromorphic computing for more efficient hardware
Answer Key
Answers
1. Neural networks
2. Perceptrons
3. Pooling layer
4. BatchNorm
5. Maximum threshold
Chapter 8
Specialized Applications and Case Studies
Natural language processing (NLP) is a field of artificial intelligence (AI) that deals with
the interaction between computers and human language. Its goal is to enable com-
puters to understand, interpret, and manipulate natural language in a way that is simi-
lar to how humans do. Its primary goal is to enable computers to understand, interpret,
and generate human language in a manner that is both meaningful and useful. This
includes written text, spoken language, and even sign language. NLP techniques often
involve a combination of statistical models, machine learning algorithms, and linguistic
rules to process and understand human language effectively. Recent advancements in
deep learning, particularly with models like Transformers, have led to significant im-
provements in various NLP tasks, enabling more accurate and versatile language proc-
essing capabilities. Let’s have a basic python code to download and install library NLTK
along with some packages. We can install NLTK library using pip install nltk command
on terminal and then you may execute the following code:
import nltk
https://doi.org/10.1515/9783110697186-008
8.1 Introduction to Natural Language Processing (NLP) 425
sentences = sent_tokenize(text)
print(sentences)
words = word_tokenize(text)
print(words)
===============================Output======================================
['Natural Language Processing (NLP) is a field of artificial intelligence (AI) that deals
\nwith the interaction between computers and human language.', 'Its goal is to enable
computers \nto understand, interpret, and manipulate natural language in a way that
is similar to how \nhumans do.', 'Its primary goal is to enable computers to under-
stand, interpret, and generate \nhuman language in a manner that is both meaningful
and useful.', 'This includes written text, \nspoken language, and even sign language.']
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'artificial', 'intelli-
gence', '(', 'AI', ')', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and',
'human', 'language', '.', 'Its', 'goal', 'is', 'to', 'enable', 'computers', 'to', 'understand', ',', 'in-
terpret', ',', 'and', 'manipulate', 'natural', 'language', 'in', 'a', 'way', 'that', 'is', 'similar',
'to', 'how', 'humans', 'do', '.', 'Its', 'primary', 'goal', 'is', 'to', 'enable', 'computers', 'to', 'un-
derstand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', 'in', 'a', 'manner',
'that', 'is', 'both', 'meaningful', 'and', 'useful', '.', 'This', 'includes', 'written', 'text', ',', 'spo-
ken', 'language', ',', 'and', 'even', 'sign', 'language', '.']
===========================================================================
The above code will tokenize the words and sentences. It shows that NLP is a rapidly
growing field with the potential to revolutionize the way we interact with computers
and information. NLP techniques can analyze the structure and semantics of language
to understand the intended meaning behind words and sentences. NLP can be used to
create text, translate languages, and even write different kinds of creative content.
We can use NLP in some of the fields below:
1. Machine translation: Translate text from one language to another. Automatically
translating text from one language to another. This includes tasks like language
translation, language detection, and multilingual text analysis. Improving the
quality of machine translation systems by using word embeddings to encode
source and target language words.
2. Chatbots and virtual assistants: Understand and respond to user queries in a
natural way.
3. Sentiment analysis: Determining the sentiment or opinion expressed in a piece
of text, whether it’s positive, negative, or neutral. This is useful for analyzing cus-
tomer feedback, social media sentiment, and product reviews. It is enhancing the
performance of sentiment analysis models by representing words in a continuous
vector space, capturing subtle semantic nuances.
4. Text summarization and classification: Summarize and classification on large
amounts of text data into a concise and informative format. Generating concise
426 Chapter 8 Specialized Applications and Case Studies
summaries of longer texts while preserving the key information and main points.
This is useful for quickly extracting important information from large documents
or articles.
5. Automatic language identification: Identify the language a piece of text is writ-
ten in. Understanding the meaning and intent behind user queries or commands.
This involves tasks like intent recognition, slot filling, and dialogue management
in conversational systems.
6. Speech recognition and text-to-speech: Convert speech to text and vice versa.
Converting spoken language into text. This is the technology behind virtual assis-
tants like Siri, Alexa, and Google Assistant.
7. Named entity recognition (NER): Identifying and classifying named entities
mentioned in text into predefined categories such as names of persons, organiza-
tions, and locations.
8. Text generation: Generating human-like text based on given input or prompts.
This can be used for various applications such as chatbots, content generation,
and dialogue systems.
9. Question answering: Automatically answering questions posed in natural lan-
guage based on a given context or knowledge base. This includes tasks like read-
ing comprehension and FAQ systems.
10. Text mining: Extracting useful insights and patterns from large volumes of text
data. This includes techniques such as text clustering, topic modeling, and trend
analysis.
11. Part-of-speech (POS) tagging: It helps into improving the performance of se-
quence labeling tasks by incorporating word embeddings as features in machine
learning models. Let’s have a look on the following code:
===================================Output==================================
[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is',
'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('(', '('), ('AI',
'NNP'), (')', ')'), ('that', 'IN'), ('deals', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('interaction', 'NN'), ('be-
tween', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('language', 'NN'), ('.', '.'), ('Its',
'PRP$'), ('goal', 'NN'), ('is', 'VBZ'), ('to', 'TO'), ('enable', 'JJ'), ('computers', 'NNS'), ('to', 'TO'),
('understand', 'VB'), (',', ','), ('interpret', 'VB'), (',', ','), ('and', 'CC'), ('manipulate', 'VB'), ('natu-
ral', 'JJ'), ('language', 'NN'), ('in', 'IN'), ('a', 'DT'), ('way', 'NN'), ('that', 'WDT'), ('is', 'VBZ'), ('sim-
ilar', 'JJ'), ('to', 'TO'), ('how', 'WRB'), ('humans', 'NNS'), ('do', 'VBP'), ('.', '.'), ('Its', 'PRP$'),
('primary', 'JJ'), ('goal', 'NN'), ('is', 'VBZ'), ('to', 'TO'), ('enable', 'JJ'), ('computers', 'NNS'), ('to',
'TO'), ('understand', 'VB'), (',', ','), ('interpret', 'VB'), (',', ','), ('and', 'CC'), ('generate', 'VB'),
('human', 'JJ'), ('language', 'NN'), ('in', 'IN'), ('a', 'DT'), ('manner', 'NN'), ('that', 'WDT'), ('is',
'VBZ'), ('both', 'DT'), ('meaningful', 'JJ'), ('and', 'CC'), ('useful', 'JJ'), ('.', '.'), ('This', 'DT'), ('in-
cludes', 'VBZ'), ('written', 'VBN'), ('text', 'NN'), (',', ','), ('spoken', 'JJ'), ('language', 'NN'), (',', ','),
('and', 'CC'), ('even', 'RB'), ('sign', 'JJ'), ('language', 'NN'), ('.', '.')]
===========================================================================
12. Semantic similarity: Computing the similarity between words or phrases based
on the similarity of their embeddings. This is useful for tasks such as information
retrieval, recommendation systems, and question answering.
NLP uses various machine learning techniques, such as deep learning, to analyze and
process language data. These methods are used to analyze the patterns and relation-
ships between different elements of language. NLP incorporates knowledge of gram-
mar, syntax, semantics, and pragmatics to understand the nuances of human language.
NLP can help break down language barriers and facilitate communication between
people and machines. NLP can automate tasks that currently require human interven-
tion such as analyzing customer feedback or translating documents. NLP can help or-
ganizations make better decisions by providing insights from large amounts of textual
data. Like every coin has two side one is head and tail, the same way by having the
many positives with NLP still it is having few challenges as well, Some of them are
given below:
1. Language ambiguity: Human language is often ambiguous and can be inter-
preted in different ways.
2. Context dependency: The meaning of a word or phrase can depend on the con-
text in which it is used.
3. Constant evolution of language: Language is constantly evolving, which can
pose challenges for NLP systems to keep up.
428 Chapter 8 Specialized Applications and Case Studies
8.1.1 Tokenization
Tokenization is a fundamental task in NLP that involves breaking down a text into
smaller units called tokens. These tokens could be words, subwords, characters, or
even phrase s, depending on the specific requirements of the task at hand. The pro-
cess of tokenization plays a crucial role in many NLP tasks because it forms the basis
for further analysis and processing. These tokens can be individual words, characters,
sentences, or even phrases, depending on the specific task and chosen technique.
Computers cannot directly understand the meaning of continuous text. Tokenization
transforms it into a format that machines can process and analyze. Tokenization lays
the groundwork for various NLP tasks like sentiment analysis, machine translation,
and text classification. We have some challenges to in the tokenization to, some of
them are given below:
– Handling contractions and special characters: NLP systems need to decide
whether to split contractions (e.g., “don’t”) and how to handle special characters
like emojis or symbols.
– Language-specific challenges: Different languages have varying sentence struc-
tures and punctuation marks, requiring specific tokenization rules for each
language.
– Context-dependent words: Words like “to” or “a” can have different meanings
depending on the context, posing a challenge for tokenization algorithms.
By breaking down complex text structures, tokenization allows NLP models to focus
on the essential information and perform these tasks more efficiently and accurately.
Tokenization is typically one of the initial steps in the NLP pipeline, and it serves as
the foundation for subsequent tasks such as part-of-speech tagging, named entity rec-
ognition, syntactic parsing, and semantic analysis. The choice of tokenization strategy
depends on the specific requirements of the NLP task, the language being processed,
and the characteristics of the text data. We can say that tokenization is a crucial first
step in NLP tasks, preparing textual data for further processing and analysis by ma-
chines. We can categorize tokenization based upon its application and usage. Some of
them are given below:
text at the sentence level. For example, the paragraph “NLP is a fascinating field.
It involves the interaction between computers and humans through natural lan-
guage.” would be tokenized into [“NLP is a fascinating field.”, “It involves the in-
teraction between computers and humans through natural language.”]. This
divides the text into individual sentences at full stops, exclamation marks, or
question marks.
3. Subword tokenization: Subword tokenization breaks down words into smaller
meaningful units called subwords or morphemes. This is particularly useful for
handling out-of-vocabulary words and dealing with languages with complex mor-
phology. Techniques like byte pair encoding and WordPiece are commonly used
for subword tokenization.
4. Character tokenization: In character tokenization, each character in the text be-
comes a separate token. This approach is useful when analyzing text at a very
fine-grained level or when dealing with languages with complex scripts. This
breaks down the text into individual characters, which is useful for certain tasks
like spelling correction or character-level language models.
5. Phrasal tokenization: Phrasal tokenization involves grouping consecutive words
into phrases or chunks based on predefined rules or patterns. This can be useful
for tasks like named entity recognition or chunking.
6. N-gram tokenization: This creates sequences of n consecutive words, useful for
tasks like language modeling and machine translation.
Word embeddings are a type of word representation in NLP that aims to capture the
semantic meaning of words in a continuous vector space. Traditional methods of repre-
senting words, such as one-hot encoding or bag-of-words, lack the ability to capture se-
mantic relationships between words and often result in high-dimensional and sparse
representations. Word embeddings address these limitations by representing words as
dense vectors in a continuous vector space, where similar words are mapped to nearby
points. These embeddings are learned from large corpora of text using unsupervised or
semi-supervised techniques, such as neural network models. Word embeddings are a
powerful technique for representing words as numerical vectors. These vectors capture
the semantic meaning and relationships between words, allowing machines to under-
stand language nuances beyond just the individual words themselves. Word embed-
dings are a fundamental building block for many NLP tasks. By encoding semantic
meaning and relationships, they empower machines to understand and process lan-
guage in a way that is closer to how humans do. There are two popular methods or
algorithms for generating word embeddings such as
1. Word2Vec: Word2Vec is a popular algorithm that learns word embeddings by an-
alyzing the co-occurrence of words in a sentence or surrounding context. Devel-
430 Chapter 8 Specialized Applications and Case Studies
2. Text classification: Categorize text data into specific groups based on the
meaning encoded in the word embeddings.
3. Question answering: Answer questions accurately by understanding the
context and relationships between words in the query and the text data.
fewer parameters and are computationally more efficient, making them suitable
for applications with limited computational resources.
4. Transformer models: Transformers are a recent advancement in sequence
modeling that have gained widespread popularity in NLP. They rely on self-
attention mechanisms to capture global dependencies between input and output
sequences. Transformers consist of an encoder-decoder architecture, where the
encoder processes the input sequence and the decoder generates the output se-
quence. Models like BERT (bidirectional encoder representations from transform-
ers) and GPT (generative pretrained transformer) have achieved state-of-the-art
performance on various NLP tasks by leveraging transformer architectures.
These models use an attention mechanism to focus on the most relevant parts of
the sequence when processing each element, overcoming limitations of RNNs in
handling long sequences.
5. Convolutional neural networks (CNNs): While primarily used for image proc-
essing, CNNs can also be applied to sequence modeling tasks in NLP. CNNs oper-
ate on fixed-size input windows and learn hierarchical feature representations
by applying convolutional filters across the input sequence. They are particularly
effective for tasks like text classification and sentiment analysis.
Overall we can say that the sequence modeling plays a vital role in modern NLP. By
considering the order and context of data points, these models enable machines to
understand complex relationships and perform various NLP tasks with greater accu-
racy and effectiveness.
Time-series forecasting in NLP involves predicting future values or trends in text data
based on historical patterns and sequences. While traditional time-series forecasting
methods are often applied to numerical data, there are techniques and approaches that
can be adapted for forecasting tasks in NLP. While NLP and time-series forecasting are
distinct fields, they can be combined in certain applications to enhance the prediction of
future events based on textual data. Time-series forecasting in NLP involves adapting ex-
isting methods and techniques from both fields to predict future trends and patterns in
text data. By leveraging word embeddings, language models, neural network architec-
tures, and attention mechanisms, it’s possible to build accurate forecasting models for
various NLP tasks. While still under development, the combination of NLP and time-
series forecasting holds promise for enhancing the prediction of future events by leverag-
ing valuable insights from textual data. We can use it in the following applications:
1. Word embeddings for time series: In NLP, words or tokens can be treated as
time-series data, especially in tasks like sentiment analysis or topic modeling
over time. Word embeddings, such as Word2Vec or GloVe, can capture seman-
tic relationships between words in the context of time. By analyzing changes in
word embeddings over time, it’s possible to forecast future trends in language
usage or sentiment.
2. Language models: Large pretrained language models like GPT or BERT have
been used for time-series forecasting in NLP. By fine-tuning these models on his-
torical text data, they can generate predictions about future text sequences. For
example, they can predict the next word in a sentence or generate entire para-
graphs of text based on past patterns.
3. Temporal convolutional networks (TCNs): TCNs are neural network architec-
tures designed for sequential data processing, including time-series data. They use
convolutional layers with causal padding to capture temporal dependencies in the
input sequence. TCNs have been applied to text data for tasks like language model-
ing and text generation, making them suitable for time-series forecasting in NLP.
4. Recurrent neural networks (RNNs) and long short-term memory (LSTM) net-
works: RNNs and LSTMs are commonly used for sequential data processing in-
cluding time-series forecasting. In NLP, these architectures can be adapted to
model the temporal dynamics of text data and make predictions about future se-
quences. For example, they can be trained to predict the next word in a sentence
or the sentiment of future text.
5. Attention mechanisms: Attention mechanisms, commonly used in transformer
architectures like BERT and GPT, can be leveraged for time-series forecasting in
NLP. These mechanisms allow the model to focus on relevant parts of the input
sequence, which is useful for capturing temporal patterns in text data. By attend-
ing to historical text sequences, the model can make predictions about future
trends or language usage.
434 Chapter 8 Specialized Applications and Case Studies
There is no doubt that NLP is not a good algorithm but still we have some challenges
into it in the model training such as:
Data integration: Integrating textual data with numerical time-series data can be
challenging due to differences in format and structure.
Model complexity: Combining NLP and time-series forecasting techniques can lead
to more complex models requiring larger datasets and computational resources.
Data quality and bias: The quality and potential biases within textual data sources
can impact the accuracy and reliability of the overall forecasts.
Autoregressive (AR): This component takes into account the impact of past values of
the time series on the forecast. It considers how many past values (called lags) are statis-
tically significant in influencing the current value or in the other words we can say. The
AR component represents the relationship between the current value of the series and
its past values. It models the dependency of the current observation on its lagged (past)
values. The “p” parameter determines the number of lagged observations included in
the model, where p, d, and q stands for:
– p: Number of autoregressive terms (lag order)
– d: Degree of differencing needed to achieve stationarity
– q: Number of moving average terms
Choosing the appropriate p, d, and q values is crucial for accurate forecasts. Various
statistical tests and information criteria are used to identify the best fitting model.
Integrated (I): The I component represents the differencing of the time series to make it
stationary. Stationarity is a key assumption in ARIMA modeling, as it ensures that the
8.2 Time-Series Forecasting 435
statistical properties of the series remain constant over time. The “d” parameter deter-
mines the order of differencing required to achieve stationarity. This component deals
with nonstationary time series data where the mean, variance, or seasonality changes
over time. Differencing is applied to the data to achieve stationarity, making it statistically
stable for analysis and forecasting.
Moving average (MA): The MA component represents the dependency between the
current observation and a linear combination of past error terms (residuals). It models
the noise or random fluctuations in the time series. The “q” parameter determines the
number of lagged residuals included in the model. This component considers the aver-
age of past errors (the difference between predicted and actual values) to improve the
forecast by accounting for random fluctuations in the data.
ARIMA models are typically denoted by the notation ARIMA(p, d, q), where “p” represents
the AR order, “d” represents the differencing order, and “q” represents the MA order.
While ARIMA models are primarily used for numerical data, they can be applied to
certain aspects of text data analysis, particularly when dealing with time-series trends
in textual information. For example, ARIMA models could be used to forecast the fre-
quency of certain keywords or phrases in a text corpus over time. This could be useful
for tasks such as analyzing trends in social media conversations, monitoring changes in
public opinion, or forecasting demand for specific products or services based on textual
data. We can use ARIMA model in Predicting future sales figures, Forecasting stock pri-
ces, Estimating customer demand, and Analyzing economic trends.
However, it’s important to note that ARIMA models may not be directly applicable
to all aspects of text data analysis, especially when dealing with unstructured textual
information. In such cases, other techniques such as NLP and machine learning may
be more appropriate for extracting insights and making predictions from text data.
ARIMA models are a valuable tool for time series forecasting, offering a robust
and interpretable approach. However, it’s important to be aware of their limitations
and consider alternative methods for nonstationary or complex data or for long-term
forecasting needs. ARIMA algorithm can’t be used in some of the situation like:
1. Assumption of stationarity: ARIMA models require the data to be stationary,
which may not be the case for all real-world time series data.
2. Limited handling of nonlinear relationships: ARIMA models primarily focus
on linear relationships between past and future values and may not capture com-
plex nonlinear patterns effectively.
3. Challenges with long-term forecasting: The accuracy of ARIMA models gener-
ally deteriorates with longer forecasting horizons.
436 Chapter 8 Specialized Applications and Case Studies
Prophet and neural networks are two different approaches used for time-series fore-
casting, each with its own strengths and applications. Both Prophet and Neural Net-
works are powerful tools for time-series forecasting, but they have their own strengths
and weaknesses. Prophet and neural networks offer different approaches to time-series
forecasting, each with its own strengths and weaknesses. Prophet provides a simple yet
powerful framework for forecasting with strong seasonal patterns and special events,
while neural networks offer flexibility and scalability for modeling complex dependen-
cies in sequential data. The choice between these approaches depends on factors such
as the nature of the data, the presence of seasonal patterns, and the computational re-
sources available for model training and deployment.
Prophet is relatively easy to use and requires minimal data preprocessing, making it
accessible to users with varying levels of expertise. It provides a powerful yet user-
friendly interface for time-series forecasting, making it suitable for both beginners
and experienced practitioners.
Neural network: Neural networks, particularly recurrent neural networks (RNNs) and
their variants like long short-term memory (LSTM) networks, are a class of deep learn-
ing models capable of learning complex patterns and dependencies in sequential data.
When applied to time-series forecasting, neural networks offer several advantages:
1. Ability to capture nonlinear relationships and complex patterns in the data.
2. Flexibility in modeling various types of time-series data, including both univari-
ate and multivariate series.
3. Scalability to handle large-scale datasets and high-dimensional input features.
4. Capability to automatically extract relevant features from raw data, reducing the
need for manual feature engineering.
8.2 Time-Series Forecasting 437
However, training neural networks for time-series forecasting often requires a consid-
erable amount of data and computational resources. Additionally, neural networks can
be prone to overfitting, especially when dealing with small or noisy datasets. Proper
model tuning and regularization techniques are essential to achieve good performance
and generalization on unseen data. The Tab. 8.1 shows the Pros and cons of prophet.
Pros Cons
Simple and user-friendly: Easy to use and Limited flexibility: Not as flexible as neural
understand, requiring minimal data preprocessing networks in capturing complex nonlinear
and code. relationships in data.
Interpretable model: Provides insights into the May struggle with nonstationary data: May not
factors influencing the forecast such as trend, perform well with data that exhibits significant
seasonality, and holidays. trends or changes in variance over time.
Handles seasonality and holidays: Can Limited feature engineering: Offers limited
automatically capture and model seasonal patterns options for incorporating additional features
and holiday effects. beyond the provided model components.
The Tab. 8.2 shows some of the Pros and cons of neural network.
Pros Cons
High flexibility: Capable of capturing complex non- Complexity and difficulty of use: Can be complex
linear relationships in data, making them suitable to set up, requiring more expertise in data science
for diverse forecasting tasks. and machine learning.
Can handle nonstationary data: Able to learn Interpretability: Can be difficult to interpret the
from various types of data including nonstationary inner workings and reasoning behind the model’s
data. predictions.
Incorporation of additional features: Can be Computational cost: Training neural networks can
combined with other features beyond the time be computationally expensive and resource-
series data to improve accuracy. intensive.
Data requirements: Often require less amounts of Data requirements: Often require larger amounts
data to achieve optimal performance. of data to achieve optimal performance.
438 Chapter 8 Specialized Applications and Case Studies
Recommender systems are a type of information filtering system that aim to predict
user preferences or interests and recommend items (such as products, movies, music,
and articles) that are likely to be of interest to them. These systems play a crucial role
in various online platforms and services, helping users discover relevant content and
improving user engagement and satisfaction. NLP offers a powerful toolkit for en-
hancing recommender systems by extracting meaningful insights from textual data,
leading to more accurate, personalized, and insightful recommendations for users.
Recommender systems are widely used in e-commerce platforms, streaming services,
social media, news websites, and other online platforms to personalize user experien-
ces, increase user engagement, and drive business revenue. The choice of a specific
recommender system depends on factors such as the characteristics of the data, the
available features, the scalability requirements, and the desired level of recommenda-
tion accuracy. There are several types of recommender systems, each employing dif-
ferent algorithms and techniques:
8.3 Recommender Systems 439
=========================Output====================================
(S
Natural/JJ
Language/NNP
Processing/NNP
(/(
(ORGANIZATION NLP/NNP)
)/)
is/VBZ
a/DT
field/NN
of/IN
artificial/JJ
intelligence/NN
(/(
AI/NNP
)/)
that/IN
deals/NNS
with/IN
the/DT
interaction/NN
between/IN
computers/NNS
and/CC
human/JJ
language/NN
./.
Its/PRP$
goal/NN
is/VBZ
to/TO
8.3 Recommender Systems 441
enable/JJ
computers/NNS
to/TO
understand/VB
, /,
interpret/VB
, /,
and/CC
manipulate/VB
natural/JJ
language/NN
in/IN
a/DT
way/NN
that/WDT
is/VBZ
similar/JJ
to/TO
how/WRB
humans/NNS
do/VBP
./.
Its/PRP$
primary/JJ
goal/NN
is/VBZ
to/TO
enable/JJ
computers/NNS
to/TO
understand/VB
, /,
interpret/VB
, /,
and/CC
generate/VB
human/JJ
language/NN
in/IN
a/DT
manner/NN
that/WDT
442 Chapter 8 Specialized Applications and Case Studies
is/VBZ
both/DT
meaningful/JJ
and/CC
useful/JJ
./.
This/DT
includes/VBZ
written/VBN
text/NN
, /,
spoken/JJ
language/NN
, /,
and/CC
even/RB
sign/JJ
language/NN
./.)
===================================================================
NLP is very good in terms of performance and accuracy but still it has some chal-
lenges those are given below Table 8.3:
Benefits Challenges
Improved accuracy and personalization: By Data quality and bias: The quality and potential
incorporating nuanced insights from text data, NLP biases within textual data can impact the accuracy
can enhance the accuracy and personalization of and fairness of recommendations.
recommendations.
Handling diverse textual data: NLP allows Computational cost: NLP techniques can be
systems to understand various forms of user input, computationally expensive, requiring powerful
including reviews, social media posts, or search hardware and efficient algorithms.
queries.
Discovery of hidden patterns: NLP techniques can Semantic ambiguity: Language can be
unveil hidden patterns within text data, leading to ambiguous, and NLP models might misinterpret the
unexpected and valuable recommendations for meaning or intent of user-generated text.
users.
Computer vision, a field of AI, focuses on enabling computers to interpret and under-
stand visual information from the real world. It has numerous applications across
various domains, revolutionizing industries and enhancing human capabilities. Com-
puter vision has a wide range of applications across various industries. Here are
some of the most common applications:
Here are some common applications of computer vision.
1. Object detection and recognition: Computer vision systems can detect and rec-
ognize objects within images or videos. This capability is used in various applica-
tions such as autonomous vehicles, surveillance systems, and augmented reality
(AR). Object detection is also essential in retail for inventory management and in
healthcare for identifying anatomical structures in medical imaging. Computer vi-
sion algorithms can detect and track objects in images and videos. This technology
is used in various applications including security surveillance, traffic monitoring,
and robotics.
2. Image classification: Image classification involves categorizing images into pre-
defined classes or categories. This application is widely used in content modera-
tion, where images are classified as safe or unsafe for certain audiences. Image
classification is also employed in agriculture for identifying crop diseases, in
manufacturing for quality control, and in e-commerce for visual search.
444 Chapter 8 Specialized Applications and Case Studies
Object detection and segmentation are two important tasks in computer vision, both
involving the identification and localization of objects within images or videos. Object
detection and segmentation are fundamental tasks in computer vision with numerous
applications including autonomous driving, surveillance, medical imaging, and AR.
These tasks enable machines to understand and interact with visual data, paving the
way for a wide range of intelligent applications and services.
– Object detection: Object detection is the task of locating and classifying multiple
objects within an image or video frame. It involves identifying the presence of
objects in an image and determining their respective classes or categories. Object
detection is typically performed using bounding boxes to outline the regions
where objects are located. Common techniques for object detection include:
– Single shot multibox detector (SSD): SSD is a popular real-time object de-
tection method that predicts bounding boxes and class probabilities directly
from feature maps at multiple scales. It achieves high speed and accuracy by
efficiently processing images with a single feedforward pass through a CNN.
– Faster R-CNN: Faster R-CNN is a two-stage object detection framework that
uses a region proposal network to generate candidate object regions, followed
by a detection network to refine the proposals and classify objects. It achieves
high accuracy by jointly optimizing region proposals and object detection.
– YOLO (You Only Look Once): YOLO is another real-time object detection
method that divides the input image into a grid of cells and predicts bounding
boxes and class probabilities for each grid cell. It processes the entire image
in a single forward pass through a CNN, making it faster than two-stage meth-
ods like Faster R-CNN.
– Object segmentation: Object segmentation is the task of segmenting or partition-
ing an image into multiple regions, each corresponding to a distinct object or ob-
ject instance. Unlike object detection, which identifies the presence of objects and
446 Chapter 8 Specialized Applications and Case Studies
their bounding boxes, object segmentation provides pixel-level masks for each ob-
ject in the image. Common techniques for object segmentation include:
– Mask R-CNN: Mask R-CNN extends faster R-CNN by adding a branch for predict-
ing segmentation masks alongside bounding boxes and class probabilities. It
generates pixel-wise masks for each object instance in the image, enabling pre-
cise segmentation of objects with complex shapes and overlapping instances.
– U-Net: U-Net is a fully convolutional network (FCN) architecture designed for
biomedical image segmentation but widely used in other domains as well. It
consists of an encoder-decoder structure with skip connections that preserve
spatial information at different scales. U-Net is known for its effectiveness in
segmenting objects from limited training data.
– Semantic segmentation: Semantic segmentation assigns a class label to each
pixel in the image, without distinguishing between different instances of the
same class. It provides a dense pixel-wise classification of the entire image,
allowing for scene understanding and pixel-level analysis. Techniques such
as FCNs and DeepLab are commonly used for semantic segmentation tasks.
Let’s have look on the code below:
syn = wordnet.synsets('dog')[0]
syn = wordnet.synsets('poodle')[0]
# Get Antonym
synsets = wordnet.synsets('good')
antonym = None
antonym = lemma.antonyms()[0].name()
break
if antonym:
break
if antonym:
print("Antonym of 'good': ", antonym)
else:
print("No antonym found for 'good'")
===============================Output==============================
Hyponyms of 'dog': ['basenji', 'corgi', 'cur', 'dalmatian', 'Great_Pyrenees', 'grif-
fon', 'hunting_dog', 'lapdog', 'Leonberg', 'Mexican_hairless', 'Newfoundland',
'pooch', 'poodle', 'pug', 'puppy', 'spitz', 'toy_dog', 'working_dog']
Hypernyms of 'dog': ['dog']
Antonym of 'good': evil
===================================================================
1. Agent: The learning entity that interacts with the environment and makes deci-
sions. The agent is the entity that interacts with the environment. It observes the
state of the environment, selects actions, and receives rewards or penalties based
448 Chapter 8 Specialized Applications and Case Studies
on its actions. The agent’s goal is to learn a policy – a mapping from states to ac-
tions – that maximizes cumulative rewards over time.
2. Environment: The system or world the agent interacts with, providing feedback
through rewards and penalties. The environment represents the external system
or process with which the agent interacts. It is defined by a set of states, actions,
and transition dynamics. The environment also provides feedback to the agent in
the form of rewards or penalties based on its actions.
3. Action: The choices the agent can make within the environment. An action is a
decision or choice made by the agent that affects the state of the environment.
Actions can be discrete (e.g., selecting from a finite set of options) or continuous
(e.g., specifying a value within a continuous range).
4. Reward: The feedback signal the environment provides to the agent, indicating
the goodness or badness of its actions. A reward is a scalar feedback signal pro-
vided by the environment to the agent after each action. It indicates the immedi-
ate desirability or quality of the action taken by the agent. The agent’s goal is to
learn a policy that maximizes the cumulative reward over time.
5. Policy: The agent’s strategy for choosing actions is based on the current state of
the environment. A policy is a mapping from states to actions that defines the
agent’s behavior. It specifies the action the agent should take in each state to max-
imize expected cumulative rewards. Policies can be deterministic or stochastic,
depending on whether they directly specify actions or provide a probability dis-
tribution over actions.
6. State: A state represents the current situation or configuration of the environment.
It contains all the relevant information needed for the agent to make decisions.
States can be discrete or continuous, depending on the nature of the environment.
7. Value functions: The value function estimates the expected cumulative reward
that an agent can achieve from a given state or state-action pair. It quantifies the
desirability of being in a particular state or taking a particular action and is used
to guide the agent’s decision-making process.
8. Exploration and exploitation: RL involves a trade-off between exploration (try-
ing out new actions to discover potentially better strategies) and exploitation (se-
lecting actions that are known to yield high rewards based on current knowledge).
Balancing exploration and exploitation is crucial for effective learning.
1. Trial and error: Through interacting with the environment and receiving re-
wards, the agent learns by trial and error. It gradually improves its policy by se-
lecting actions that lead to higher rewards over time.
8.5 Reinforcement Learning 449
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
env = gym.make("CartPole-v1")
# set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
from IPython import display
plt.ion()
# if GPU is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Transition = namedtuple('Transition',
('state', 'action', 'next_state', 'reward'))
class ReplayMemory(object):
self.memory.append(Transition(✶args))
def __len__(self):
return len(self.memory)
class DQN(nn.Module):
steps_done = 0
def select_action(state):
global steps_done
sample = random.random()
eps_threshold = EPS_END + (EPS_START - EPS_END) ✶ \
math.exp(-1. ✶ steps_done / EPS_DECAY)
steps_done += 1
if sample > eps_threshold:
with torch.no_grad():
# t.max(1) will return the largest column value of each row.
# second column on max result is index of where max element was
# found, so we pick action with the larger expected reward.
return policy_net(state).max(1).indices.view(1, 1)
else:
return torch.tensor([[env.action_space.sample()]], device=device,
dtype=torch.long)
episode_durations = []
def plot_durations(show_result=False):
plt.figure(1)
durations_t = torch.tensor(episode_durations, dtype=torch.float)
if show_result:
plt.title('Result')
else:
plt.clf()
plt.title('Training. . .')
plt.xlabel('Episode')
plt.ylabel('Duration')
plt.plot(durations_t.numpy())
452 Chapter 8 Specialized Applications and Case Studies
def optimize_model():
if len(memory) < BATCH_SIZE:
return
transitions = memory.sample(BATCH_SIZE)
# Transpose the batch (see https://stackoverflow.com/a/19343/3343043
for
# detailed explanation). This converts batch-array of Transitions
# to Transition of batch-arrays.
batch = Transition(✶zip(✶transitions))
if torch.cuda.is_available():
num_episodes = 600
else:
num_episodes = 50
if terminated:
next_state = None
else:
next_state = torch.tensor(observation, dtype=torch.float32,
device=device).unsqueeze(0)
if done:
episode_durations.append(t + 1)
plot_durations()
break
print('Complete')
plot_durations(show_result=True)
plt.ioff()
plt.show()
8.8 Q-learning 455
Exploration vs. exploitation dilemma: The agent needs to balance exploring new
actions to discover potential rewards with exploiting already learned successful
actions.
High computational cost: Training RL agents can be computationally expensive, es-
pecially for complex environments.
Data scarcity: Learning effectively in RL often requires a large amount of data col-
lected through interaction with the environment.
8.8 Q-learning
Q-learning is a popular RL algorithm used for learning optimal policies in Markov de-
cision processes, particularly in settings where the agent has full knowledge of the
environment. It is a model-free, value-based algorithm that learns an action-value
function (Q-function) to determine the quality of taking a particular action in a given
state. Q-learning is a specific model-free RL algorithm used to train an agent to make
optimal decisions in an environment. It belongs to a family of algorithms known as
value-based methods that learn by estimating the long-term value of taking specific
actions in different states. Over time, as the agent explores the environment and re-
ceives feedback, the Q-values converge to the optimal action-values, indicating the ex-
pected cumulative rewards of taking each action in each state. The agent then follows
the optimal policy by selecting actions with the highest Q-values in each state.
Q-learning is particularly well-suited for discrete and deterministic environments
with finite state and action spaces. However, it can also be extended to handle contin-
uous and stochastic environments through function approximation methods and ex-
perience replay techniques. Despite its simplicity, Q-learning has been successfully
456 Chapter 8 Specialized Applications and Case Studies
applied in various domains including robotics, game playing, and autonomous sys-
tems. Overall, Q-learning is a fundamental and versatile algorithm in the field of RL.
Its simplicity and off-policy learning capabilities make it a popular choice for various
applications. However, it’s crucial to address the challenges of exploration, conver-
gence, and dimensionality when implementing Q-learning in complex tasks. Some of
the terminologies are used in Q-learning are:
1. Q-value (Q(s, a)): This represents the estimated future reward an agent expects
to receive by taking action “a” in state “s.”
2. State (s): The current situation or configuration the agent is in.
3. Action (a): The possible choices the agent can make in a given state.
4. Reward (r): The feedback signal the environment provides after the agent takes
an action.
Initialization: Initialize the Q-function, Q(s,a), where s is the state and a is the action,
to arbitrary values. All Q-values are initially set to a chosen value (e.g., 0).
Exploration vs. exploitation: At each time step, the agent selects an action based on its
current policy. The policy can be greedy (selecting the action with the highest Q-value)
or exploratory (selecting a random action with some probability).
Action selection: Given the current state s, the agent selects an action a based on the
policy.
Interaction: The agent takes an action (a) in the current state (s).
Observation and reward: The agent observes the next state s` and the reward r received
as a result of taking action a in state s. The agent receives a reward (r) and observes the
next state (s’).
Update Q-value: The agent updates the Q-value for the current state-action pair (s, a)
using the Bellman equation:
Qðs, aÞ = Qðs, aÞ + α r + γmaxa ′Q s′, a′ − Qðs, aÞ
where α is the learning rate, controlling the step size of the updates. It controls the
weight given to the new information (learning from the current experience)
– γ is the discount factor, determining the importance of future rewards (balancing
immediate and long-term benefits)
– r + γmaxa ′Q s′, a′ is the target value, representing the expected cumulative re-
ward of taking action a in state s and then following the optimal policy
thereafter.
8.9 Deep Q Networks 457
Benefits Challenges
Model-free: Doesn’t require a detailed model of the Exploration vs. exploitation: Balancing
environment, making it applicable to various exploration of new actions with exploiting known
situations. good actions remains crucial.
Off-policy learning: Can learn from data collected Convergence: In complex environments,
using different policies, allowing for efficient convergence to an optimal policy can be slow or
exploration of the environment. even impossible.
Simple and efficient: Easy to understand and Curse of dimensionality: With large state and
implement, making it a popular choice for RL action spaces, the number of Q-values to learn can
applications. become very high, making learning inefficient.
the replay memory and used to update the Q-network parameters. Experience re-
play helps decorrelate training samples and break temporal correlations in the
data. Stores past experiences (state, action, reward, next state) in a replay mem-
ory. This allows the network to learn from a diverse set of experiences and over-
come the limitations of learning from consecutive experiences alone.
– Target network: To stabilize training, DQN introduces a separate target network
that is periodically updated to approximate the target Q-values. The target network
is a copy of the Q-network with frozen parameters and is used to compute the tar-
get Q-values for updating the Q-network. Periodically updating the target network
helps prevent oscillations and divergence during training. The target network intro-
duces a separate target network that is periodically updated with the weights of the
main Q-network. This helps stabilize the learning process by reducing the correla-
tion between the target and action selection in the Q-learning update.
– Loss function: DQN uses the mean squared error (MSE) loss between the pre-
dicted Q-values and the target Q-values to update the Q-network parameters. The
target Q-values are computed using the Bellman equation with the target network.
– ε-Greedy exploration: To balance exploration and exploitation, DQN employs ε-
greedy exploration, where the agent selects a random action with probability ε
and selects the action with the highest Q-value with probability 1 – ε. The value of
ε is annealed over time to gradually shift from exploration to exploitation.
– Training procedure: During training, the agent interacts with the environment,
collects experiences, and updates the Q-network parameters using stochastic gra-
dient descent based on the sampled minibatches of transitions. The target net-
work is periodically updated to track the changes in the Q-network.
Benefits Challenges
Scalability: Handles large state and action spaces Complexity: Designing and training deep neural
effectively due to function approximation with networks requires more expertise and
neural networks. computational resources.
Policy gradient methods are a powerful class of RL algorithms that directly optimize
the policy of an agent to maximize its long-term reward in an environment. Unlike
value-based methods like Q-learning, which focus on learning the value of states and
actions, policy gradient methods directly learn a policy that maps states to actions.
The policy gradient methods directly learn a policy – a mapping from states to ac-
tions – by optimizing the expected cumulative reward. Unlike value-based methods
like Q-learning, which estimate the value of state-action pairs, policy gradient meth-
ods directly parameterize the policy and update its parameters to maximize the ex-
pected return. Policy gradient methods offer several advantages including the ability
to learn stochastic policies and handle continuous action spaces. Some terminology
those are used in the policy gradient methods:
1. Policy (π): Represents the probability distribution over possible actions the agent
can take in a given state.
2. State (s): The current situation or configuration the agent is in.
3. Action (a): The possible choices the agent can make in a given state.
4. Reward (r): The feedback signal the environment provides after the agent takes
an action.
They have been successfully applied to a wide range of RL tasks including robotics,
game playing, and NLP. Examples of policy gradient methods include REINFORCE,
actor-critic methods, and proximal policy optimization (PPO). The policy gradient
methods offer a powerful approach to RL, enabling agents to learn effective policies
for complex tasks. However, addressing the challenges of variance, sample efficiency,
and hyperparameter tuning is critical for successful application. Here are some steps
that show how policy gradient method works:
Step 1. Policy parameterization: The agent interacts with the environment following
its current policy and collects data (state-action-reward tuples). In policy gradient
methods, the policy is typically parameterized by a neural network or another func-
tion approximator. The parameters of the policy network are denoted by θ, and the
policy itself is denoted by πθ (a∣s), representing the probability of taking action a in
state s given the parameters θ.
X
T
J ðθÞ = Eπθ γt rt
t=0
460 Chapter 8 Specialized Applications and Case Studies
where rt is the reward received at time step t, T is the time horizon, and γ is the dis-
count factor that determines the importance of future rewards.
Step 3. Gradient ascent: Policy gradient methods use gradient ascent to update the
policy parameters θ in the direction of the gradient of the objective function J(θ). The
gradient of J(θ) with respect to the policy parameters is given by
X
T
∇θ J ðθÞ = Eπθ ∇θ logπθðat ∨ st Þ.Gt
t=0
where Gt is the return from time step t onward, also known as the return or advantage.
Step 4. Policy update: The policy parameters are updated using stochastic gradient
ascent:
θ θ + α∇θ J ðθÞ
Step 5. Reward-to-go: In practice, policy gradient methods often use the reward-to-go
formulation, where the gradient is scaled by the return from the current time step
onward rather than the total return. This helps reduce variance in the gradient esti-
mates and improves training stability.
Challenges vs. benefits of policy gradient method is given in the table given below:
Benefit Challenges
Direct policy optimization: They directly optimize High variance: Estimating the policy gradient can
the policy, which can be more efficient than be noisy and lead to unstable learning, requiring
learning state-action values, especially in large state careful implementation and techniques like
spaces. variance reduction.
Policy interpretability: In some cases, the learned Sample efficiency: They can be sample-inefficient,
policy can be interpreted, providing insights into meaning they may require a large amount of data
the agent’s decision-making process. to learn effectively.
Versatility: They can be applied to various tasks Hyperparameter tuning: Tuning hyperparameters
and environments including continuous action of the learning algorithm and policy network is
spaces. crucial for achieving good performance.
3. Proximal policy optimization: This advanced method addresses issues like the
policy gradient vanishing problem and ensures that the updated policy remains
close to the original one, leading to more stable learning.
import gym
env = gym.make('CartPole-v1')
env.observation_space
env.action_space
import numpy as np
class LogisticPolicy:
self.θ = θ
self.α = α
self.γ = γ
return 1 / (1 + np.exp(-y))
y = x @ self.θ
prob0 = self.logistic(y)
probs = self.probs(x)
action = np.random.choice([0, 1], p=probs)
return action, probs[action]
462 Chapter 8 Specialized Applications and Case Studies
y = x @ self.θ
grad_log_p0 = x - x ✶ self.logistic(y)
grad_log_p1 = - x ✶ self.logistic(y)
discounted_rewards = np.zeros(len(rewards))
cumulative_rewards = 0
for i in reversed(range(0, len(rewards))):
cumulative_rewards = cumulative_rewards ✶ self.γ + rewards[i]
discounted_rewards[i] = cumulative_rewards
return discounted_rewards
observations = []
actions = []
rewards = []
probs = []
done = False
observations.append(observation)
totalreward += reward
rewards.append(reward)
actions.append(action)
probs.append(prob)
# update policy
policy.update(rewards, observations, actions)
print("EP: " + str(i) + " Score: " + str(total_reward) + " ",
end="\r", flush=False)
# for reproducibility
GLOBAL_SEED = 0
np.random.seed(GLOBAL_SEED)
plt.plot(episode_rewards)
Summary
computers to interact with us more naturally and perform tasks involving human lan-
guage. Its primary goal is to enable computers to understand, interpret, and generate
human language in a way that is both meaningful and contextually relevant. Tokeni-
zation and word embeddings are fundamental concepts in NLP that work together
to enable machines to understand and process human language. Breaking down text
into smaller units such as words or sentences is the process of tokenization. In NLP,
sequence modeling plays a crucial role in tasks that involve analyzing and processing
sequential data such as text, speech, and even protein sequences in bioinformatics.
Unlike traditional machine learning methods that treat data points as independent, se-
quence models consider the order and context of elements within the sequence.
ARIMA models are a class of statistical models commonly used for time series analysis
and forecasting. They are particularly useful when dealing with data that exhibits non-
stationarity, meaning the statistical properties of the data (such as mean and variance)
change over time. They are particularly useful when the data exhibits stationarity,
meaning the statistical properties (mean, variance, and autocorrelation) remain con-
stant over time. Prophet is an open-source forecasting tool developed by Facebook’s
Core Data Science team. It is designed to handle time series data with strong seasonal
effects and multiple seasonality. Prophet uses an additive model where different com-
ponents of the time series (trend, seasonality, and holiday effects) are modeled sepa-
rately and combined to make predictions. Both Prophet and neural networks are
powerful tools for time series forecasting, but they differ in their approach, strengths,
and weaknesses:
Collaborative filtering (CF) is a technique used in recommender systems to predict
the preferences of a user based on the preferences of similar users or items. It’s a pow-
erful tool for personalizing recommendations across various domains like suggesting
products, movies, music, or even news articles. The underlying idea is to leverage the
collective wisdom of a group of users to infer preferences for individual users.
Exercise (MCQs)
2. What is the process of converting words into their base form called?
A) Tokenization B) Lemmatization C) Stemming D) Normalization
7. What type of model uses past observations to predict future values without
explicitly identifying trends or seasonality?
A) ARIMA model
B) Exponential smoothing model
C) Linear regression model
D) Naïve forecast model
8. The mean squared error (MSE) is a commonly used metric to evaluate the
performance of a time series forecast. A lower MSE indicates:
A) A higher deviation between predicted and actual values.
B) A better fit between predicted and actual values.
C) No relationship between predicted and actual values.
D) The forecast is always accurate.
10. Which of the following is NOT a common challenge in time series forecasting?
A) Missing data points
B) Stationarity of the data
C) Identifying the relevant factors influencing the series
D) Overfitting the model to the training data
Answers Key 467
Answers Key
1. b) Image recognition
2. c) Stemming
3. b) The
4. c) Represent text as a frequency distribution of words
5. c) Recurrent neural network (RNN)
6. c) Cyclicity (not a universal component, only present in some time series)
7. d) Naïve forecast model (assumes future values are equal to the last observed
value)
468 Chapter 8 Specialized Applications and Case Studies
8. b) A better fit between predicted and actual values (lower error means predic-
tions are closer to actual values)
9. d) Predicting daily website traffic with a significant weekly pattern (ARIMA mod-
els can capture seasonality)
10. d) Overfitting the model to the training data (a challenge in all machine learning
tasks, not specific to time series forecasting)
11. d) Demographic filtering (not a common type)
12. c) Similarities between users based on their past interactions
13. c) Both a and b (cold start affects both new users and new items)
14. b) Collaborative filtering only (used to reduce dimensionality in user-item matrix)
15. d) Eliminating the need for human intervention (not a complete replacement, hu-
mans still play a role in system design and optimization)
1. Explain how image segmentation is used in medical imaging analysis. What are
some specific applications and the potential benefits for patients and healthcare
professionals?
2. Explain the concept of ambiguity in natural language and discuss the challenges
it poses for NLP tasks like machine translation and sentiment analysis. How do
NLP algorithms handle these challenges?
3. Compare and contrast two different approaches to text summarization: abstrac-
tive summarization and extractive summarization. Discuss the advantages and
disadvantages of each approach.
4. Explain the key components of a reinforcement learning system: agent, environ-
ment, states, actions, and rewards. How do these components interact with each
other in the learning process?
5. Compare and contrast two different reinforcement learning algorithms such as Q-
learning and DQN. Discuss their strengths and weaknesses in different scenarios.
6. Reinforcement learning is often used in situations where the environment is partially
observable or dynamic. How do RL algorithms handle these complexities? What are
some challenges and potential solutions for learning in such environments?
7. Explain the steps involved in the time-series forecasting process, starting with
data collection and preprocessing to model selection, evaluation, and interpreta-
tion. What are some important considerations and challenges at each step?
Answers 469
Answers
1. word embeddings
2. breaking down
3. Sentence tokenization
4. GloVe
5. RNNs
References
[1] Raschka, S. & Mirjalili, V. (2019). Python Machine Learning: Machine Learning and Deep Learning
with Python, Scikit-learn, and TensorFlow 2 (3rd ed.). Packt Publishing.
[2] Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts,
Tools, and Techniques to Build Intelligent Systems (2nd ed.). O’Reilly Media.
[3] Barua, T., Doshi, R. & Hiran, K.K. (2020). Mobile Applications Development: With Python in Kivy
Framework Walter de Gruyter GmbH & Co KG.
[4] Müller, A.C. & Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data
Scientists O’Reilly Media.
[5] Testas, A. (2023, November 27). Distributed Machine Learning with PySpark Apress.
[6] Brownlee, J. (2016). Machine Learning Mastery with Python: Understand Your Data, Create Accurate
Models, and Work Projects End-To-End Machine Learning Mastery.
[7] Hiran, K.K., Doshi, R., Kant, K., Ruchi, H. & Lecturer, D.S. (2013). Robust & Secure Digital Image
Watermarking Technique Using Concatenation Process Cloud Computing View Project Digital Image
Processing View Project Robust & Secure Digital Image Watermarking Technique Using
Concatenation Process. International Journal of ICT and Management, 117, https://www.research
gate.net/publication/320404232
[8] Harrington, P. (2012). Machine Learning in Action Manning Publications.
[9] Patel, R. (2018). Python Deep Learning: Next Generation Techniques to Revolutionize Computer
Vision, AI, and Deep Learning Packt Publishing.
[10] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M.,
Perrot, M. & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine
Learning Research, 12, 2825–2830.
[11] Aggarwal, C.C. (2018). Python for Data Science: A Guide to Successful Python Tools for Data Science
Springer.
[12] Jain, R.K. (2023, April 10). A Survey on Different Approach Used for Sign Language Recognition Using
Machine Learning. Asian Journal of Computer Science and Technology, 12(1), 11–15. https://doi.org/
10.51983/ajcst-2023.12.1.3554
[13] Vasques, X. (2024, March 6). Machine Learning Theory and Applications John Wiley & Sons.
[14] Wireko, J.K., Hiran, K.K. & Doshi, R. (2018). Culturally Based User Resistance to New Technologies in
the Age of IoT in Developing Countries: Perspectives from Ethiopia. International Journal of
Emerging Technology and Advanced Engineering, 8(4), 96–105.
[15] Heaton, J. (2018). Introduction to Deep Learning Using Python: A Guide for Data Scientists
CreateSpace Independent Publishing Platform.
[16] Jain, R.K., Kant Hiran, K., Maheshwari, R. & Vaishali,. (2023, April 20). Lung Cancer Detection Using
Machine Learning Algorithms. 2023 International Conference on Computational Intelligence,
Communication Technology and Networking (CICTN), https://doi.org/10.1109/cictn57981.2023.
10141467
[17] VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data O’Reilly
Media.
[18] LeCun, Y., Bengio, Y. & Hinton, G. (2015). Deep Learning. Nature, 521(7553), 436–444.
[19] Jain, R.K. & Agarwal, V. (2023, May 24). Comparative Analysis of Visual Positioning Techniques for
Indoor Navigation Systems. Asian Journal of Engineering and Applied Technology, 12(1), 18–22.
https://doi.org/10.51983/ajeat-2023.12.1.3596
[20] Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and Trends® in Machine
Learning, 2(1), 1–127.
https://doi.org/10.1515/9783110697186-009
472 References
[21] Kingma, D.P. & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv preprint
arXiv:1412.6980.
[22] Dadhich, M., Hiran, K.K. & Rao, S.S. (2021). Teaching–learning perception toward blended E-learning
portals during pandemic lockdown. In Sharma, T., Ahn, Chang, Verma, O, Panigrahi, B, Soft
Computing: Theories and Applications: Proceedings of SoCTA 2020, Volume Vol. 2 (pp. 119–129).
Springer Singapore, Singapore.
[23] Hiran, K.K., Henten, A., Shrivas, M.K. & Doshi, R. (2018, August). Hybrid educloud model in higher
education: The case of Sub-Saharan Africa, Ethiopia. In Quist-Aphetsi K., Kuada E., 2018 IEEE 7th
International Conference on Adaptive Science & Technology (ICAST) (pp. 1–9). IEEE Ghana
Section, IEEE.
[24] Mijwil, M.M., Aggarwal, K., Doshi, R., Hiran, K.K. & Sundaravadivazhagan, B. (2022). Deep Learning
Techniques for COVID-19 Detection Based on Chest X-ray and CT-scan Images: A Short Review and
Future Perspective. Asian Journal of Applied Sciences. 24: Volume 10, Issue 3 (Page. 224-231).
[25] Dadhich, M., Hiran, K.K., Rao, S.S. & Sharma, R. (2022). Factors Influencing Patient Adoption of the
IoT for E-health Management Systems (E-hms) Using the UTAUT Model: A High Order SEM-ANN
Approach. International Journal of Ambient Computing and Intelligence (IJACI), 13(1), 1–18.
[26] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A Simple
Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(1),
1929–1958.
[27] Hiran, K.K., Khazanchi, D., Vyas, A.K. & Padmanaban, S. (2021). Machine Learning for Sustainable
Development. Machine Learning for Sustainable Development, https://doi.org/10.1515/
9783110702514
[28] Saini, H.K., Jain, K.L., Hiran, K.K. & Bhati, A. (2021). Paradigms to Make Smart City Using Blockchain.
Blockchain 3.0 For Sustainable Development. 10, p.21.
[29] Tyagi, S.K.S., Mukherjee, A., Pokhrel, S.R. & Hiran, K. (2020a). An Intelligent and Optimal Resource
Allocation Approach in Sensor Networks for Smart Agri-IoT. Smart Agri-IoT. I E E E Sensors Journal,
21(16), 17439–17446. https://doi.org/10.1109/JSEN.2020.3020889
[30] Jain, R.K. & Rathi, S.K. (2021). A review paper on sign language recognition using machine learning
techniques. In Emerging Trends in Data Driven Computing and Communications (Ed. Mathur,
R. et al.) Springer. Page No. 91–98.
[31] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J.,
Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser,
L., Kudlur, M. & Zheng, X. (2016). TensorFlow: Large-scale Machine Learning on Heterogeneous
Systems. Software available from tensorflow.org.
[32] Zeiler, M.D. & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Fleet, D,
Pajdla, Schiele, B, Tuytelaars, T European Conference on Computer Vision (pp. 818–833).
Springer, Cham.
[33] Prajapati, R.K., Bhardwaj, Y., Jain, R.K. & Kamal Kant Hiran, D. (2023, April 20). A review paper on
automatic number plate recognition using machine learning : An in-depth analysis of machine
learning techniques in automatic number plate recognition: Opportunities and limitations. In 2023
International Conference on Computational Intelligence, Communication Technology and
Networking (CICTN). https://doi.org/10.1109/cictn57981.2023.10141318
[34] Wong, K.K.L. (2023, October 31). Cybernetical Intelligence John Wiley & Sons.
[35] Jain, R. & Hiran, K.K. (2024). BIONET. Advances in Systems Analysis, Software Engineering, and High
Performance Computing Book Series. https://doi.org/10.4018/979-8-3693-1131-8.ch004
[36] Simonyan, K. & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-scale Image
Recognition. arXiv preprint arXiv:1409.1556.
[37] Acharya, S., Jain, U., Kumar, R., Prajapat, S., Suthar, S. & Jain, R.K. JARVIS: A Virtual Assistant for
Smart Communication. ijaem, 3(6), pp. 460–465.
References 473
[38] Pajankar, A., & Joshi, A. (2022). Hands-on Machine Learning with Python: Implement Neural
Network Solutions with Scikit-learn and PyTorch. apress.
[39] Hochreiter, S. & Schmidhuber, J. (1997). Long Short-term Memory. Neural Computation, 9(8),
1735–1780.
[40] Goodfellow, I., Bengio, Y., Courville, A. & Bengio, Y. (2016). Deep Learning (Adaptive Computation
and Machine Learning Series) The MIT Press.
[41] Jain, R.K., Hiran, K. & Paliwal, G. (2012). Quantum Cryptography: A New Generation Of Information
Security System. Proceedings of International Journal of Computers and Distributed Systems, ISSN,
2278–5183, 2(1), Page No. 42–45.
[42] Kuhn, M. & Johnson, K. (2013). Applied Predictive Modeling Springer.
[43] Marsland, S. (2015). Machine Learning: An Algorithmic Perspective Chapman and Hall/CRC.
[44] McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
(2nd ed.). O’Reilly Media.
[45] Hossain, E. (2023, December 26). Machine Learning Crash Course for Engineers Springer Nature.
[46] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S. & Fei-Fei, L. (2015). ImageNet Large
Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211–252.
[47] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D. & Rabinovich, A. (2015). Going
deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition IEEE Xplore, USA (pp. 1–9).
[48] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J. . . . Kudlur, M. (2016). TensorFlow: A
system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design
and Implementation, USENIX ASSOCIATION, (OSDI 16) (pp. 265–283).
[49] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S. & Bengio, Y. (2014).
Generative Adversarial Nets. Advances in Neural Information Processing Systems, 3(11), 2672–2680.
[50] Oliphant, T.E. (2007). Python for Scientific Computing. Computing in Science & Engineering, 9(3),
10–20.
[51] Hiran, K.K. & Doshi, R. (2013). An Artificial Neural Network Approach for Brain Tumor Detection
Using Digital Image Segmentation. Brain, 2(5), 227–231.
[52] Hiran, K.K. & Henten, A. (2020). An Integrated TOE–DoI Framework for Cloud Computing Adoption
in the Higher Education Sector: Case Study of Sub-Saharan Africa, Ethiopia. International Journal of
System Assurance Engineering and Management, 11, 441–449.
[53] Mahrishi, M., Hiran, K.K., Meena, G. & Sharma, P. (Eds.) (2020). Machine Learning and Deep
Learning in Real-time Applications IGI global.
[54] Hiran, K.K., Jain, R.K., Lakhwani, K. & Doshi, R. (2021). Machine Learning: Master Supervised and
Unsupervised Learning Algorithms with Real Examples (English Edition) BPB Publications.
[55] Chollet, F. (2017). Deep Learning with Python Manning Publications.
[56] Hiran, K.K., Jain, R.K., Lakhwani, K. & Doshi, R. (2021). Machine Learning: Master Supervised and
Unsupervised Learning Algorithms with Real Examples BPB Publications.
[57] Bisong, E. (2019, September 27). Building Machine Learning and Deep Learning Models on Google
Cloud Platform.
[58] Khazanchi, D., Vyas, A.K., Hiran, K.K. & Padmanaban, S. (Eds.) (2021). Blockchain 3.0 For Sustainable
Development Vol. 10 Walter de Gruyter GmbH & Co KG.
[59] Ye, A. & Wang, Z. (2022, December 27). Modern Deep Learning for Tabular Data Apress.
[60] Lakhwani, K., Gianey, H.K., Wireko, J.K. & Hiran, K.K. (2020). Internet of Things (IoT): Principles,
Paradigms and Applications of IoT (English Edition).
[61] Hiran, K.K., Doshi, R., Fagbola, T. & Mahrishi, M. (2019). Cloud Computing: Master the Concepts,
Architecture and Applications with Real-world Examples and Case Studies Bpb Publications.
474 References
[62] Hiran, K.K. & Doshi, R. (2013). Robust & Secure Digital Image Watermarking Technique Using
Concatenation Process. International Journal of ICT and Management. Vol- I, Issue - 2, Page no.
117–121.
[63] Müller, A.C. (2016). Advanced Machine Learning with Python: Explore the Most Sophisticated
Algorithms and Techniques for Building Intelligent Systems Packt Publishing.
[64] Acharya, R. (2019). Python Data Science Cookbook: Discover the Latest Python Tools and Techniques
to Help You Tackle the World of Data Acquisition and Analysis Packt Publishing.
[65] Priyadarshi, N., Padmanaban, S., Hiran, K.K., Holm-Nielson, J.B. & Bansal, R.C. (Eds.) (2021). Artificial
Intelligence and Internet of Things for Renewable Energy Systems Vol. 12 Walter de Gruyter GmbH
& Co KG.
[66] Doshi, R. & Hiran, K.K. (2024). Explainable artificial intelligence as a cybersecurity aid. In Ghonge, M.
M., Pradeep, N., Jhanjhi, N., & Kulkarni, P. Advances in Explainable AI Applications for Smart Cities
(pp. 98–113). IGI Global.
[67] Lakhwani, K., Bhargava, S., Somwanshi, D., Doshi, R. & Hiran, K.K. (2020, December). An enhanced
approach to infer potential host of coronavirus by analyzing its spike genes using multilayer
artificial neural network. In 2020 5th IEEE International Conference on Recent Advances and
Innovations in Engineering (ICRAIE) (pp. 1–5). IEEE Delhi Section, India.
[68] Hardas, M., Mathur, S., Dadhich, M., Bhaskar, A. & Hiran, K.K. (2023, August). Multi-class
classification of retinal fundus images in diabetic retinopathy using probabilistic neural network. In
2023 International Conference on Emerging Trends in Networks and Computer Communications
(ETNCC) (pp. 275–282). IEEE South Africa Section.
[69] Narkhede, N., Mathur, S., Bhaskar, A.A., Hiran, K.K., Dadhich, M. & Kalla, M. (2023, August). A new
methodical perspective for classification and recognition of music genre using machine learning
classifiers. In 2023 International Conference on Emerging Trends in Networks and Computer
Communications (ETNCC) (pp. 94–99). IEEE South Africa Section.
[70] Mijwil, M.M., Hiran, K.K., Doshi, R., Dadhich, M., Al-Mistarehi, A.H. & Bala, I. (2023). ChatGPT and the
Future of Academic Integrity in the Artificial Intelligence Era: A New Frontier. Al-Salam. Journal for
Engineering and Technology, 2(2), 116–127.
[71] Prajapati, R.K., Bhardwaj, Y., Jain, R.K. & Hiran, K.K. (2023, April). A review paper on automatic
number plate recognition using machine learning: An in-depth analysis of machine learning
techniques in automatic number plate recognition: Opportunities and limitations. In 2023
International Conference on Computational Intelligence, Communication Technology and
Networking (CICTN) (pp. 527–532). IEEE UP section, India.
Index
append( ) 43 Eigenvectors 275
Arrays 90 Eigenvalues 275
Association 154 Ensemble Methods 297
Apriori algorithm 158 Eager Execution 368
Accuracy 167 Expert Systems 3
AdaBoost 303
Anomaly Detection 347 Features 31
Activation Function 363 Functions 63
Autoregressive Integrated Moving Average Files 68
(ARIMA) 434 Flask 76
Feature Engineering 123
Broadcasting 103 FP-growth algorithm 158
Box Plot 135 Feature Extraction 407
Bias-Variance 159
Binary Classification 210 Gaussian Naive Bayes 285
Bagging and Boosting 298 Gradient Boosting 303
Batch Normalization 413 Generative Adversarial Networks (GANs) 408
Gradient Clipping 414
Conda 20 GitHub 15
Continuous Integration 23
Comments 37 Hyperparameter 255
Control Structures 60 Hierarchical inheritance 82
Conditional Statements 60 Hybrid inheritance 82
Classes 77 Heatmap 137
Concatenation 99
Contour Plot 139 Indentation 36
Classification 153 Inheritance 80
Clustering 154, 156, 261, 351 Imbalanced Data 143
Cross-Validation 175 Information Gain 230
Computational Graph 367 Isolation Forest 352
Convolutional Neural Network (CNN) 374 insert( ) 43
Computer Vision 443 Immutable 48
https://doi.org/10.1515/9783110697186-010
476 Index
Python 30 Tuples 39
Python Libraries 12, 17 Transposition 99
PyTorch 19 Transformation 120
Python Environment 15 TensorFlow 18, 367, 373
Pip 19 Transfer Learning 400
Python Syntax 36 Tokenization 428
Primitive 39 Time-Series Forecasting 433
pop( ) 44, 55
Packages 67 Unsupervised Learning 153
Polymorphism 80 Unpacking 48
Pandas 18, 109
PrefixSpan algorithm 158 Virtual Environments 16
Polynomial Regression 155, 203 Variables 38
Principal Component Analysis 270 Vectorization 103
Perceptron 362 Violin Plot 136
Prophet 436
Policy Gradient Methods 459 Whitespace 37
Word Embeddings 429
Q-learning 455
XGBoost 316, 319
Resampling 144
Regression 153