40 Algorithms Every Data Scientist Should Know Jurgen Weichenberger Huw Kwon
40 Algorithms Every Data Scientist Should Know Jurgen Weichenberger Huw Kwon
Jürgen Weichenberger
Huw Kwon
www.bpbonline.com
ii
ISBN: 978-93-55519-832
All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in
any form or by any means or stored in a database or retrieval system, without the prior written
permission of the publisher with the exception to the program listings which may be entered,
stored and executed in a computer system, but they can not be reproduced by the means of
publication, photocopy, recording, or by any electronic and mechanical means.
All trademarks referred to in the book are acknowledged as properties of their respective
owners but BPB Publications cannot guarantee the accuracy of this information.
www.bpbonline.com
iii
Dedicated to
My beloved wife: Li and
My Daughter Sophia
– Jürgen
Dr. Zachary Elewitz is a data scientist with over a decade of experience, currently serving
as the Head of AI at Fortune Brands Innovations. He holds two AI-related patents, sits on
Texas A&M Commerceʼs Venture College Board, and serves in several AI-related groups,
including the National Institute of Standards and Technology Generative AI Public
Working Group. In his spare time, he is pursuing a Masters in Viking studies and enjoys
snowboarding, indoor bouldering, and playing the guitar.
vi
Acknowledgements
Jürgen: I want to express my deepest gratitude to my family and friends for their
unwavering support and encouragement throughout this book’s writing, especially
my wife Li and my daughter Sophia.
I am also grateful to BPB Publications for their guidance and expertise in bringing
this book to fruition. It was a long journey of revising this book, with valuable
participation and collaboration of reviewers, technical experts, and editors.
Finally, I would like to thank all the readers who have taken an interest in this
book and for their support in making it a reality. Your encouragement has been
invaluable.
Huw: Your guidance and support have been invaluable on my journey to solve real-
world problems with Data and AI. Each of you has been statistically significant in
this pursuit of mine, and for that, I am deeply grateful.
vii
Preface
Building Artificial Intelligence and Machine Learning Solutions is a complex task that
requires a comprehensive understanding of the latest technologies and alogorithms
available to us. Artificial Intelligence has become an increasingly powerful tool over the
last couple of years and as such the amount of algorithsm available to us have explode.
This book is designed to provide a comprehensive guide through the world of Artificial
Intelligence Algorithms and be a practical and hands-on support to every new data
scientist as well as experienced data scientists. It covers a wide range of topics, including
the basic definition of Artificial Intelligence and Machine Learning, basic data concepts,
and basic and advanced algorithms for supervised, unsupervised, semi-supervised, and
reinforcement learning algorithms.
Throughout the book, you will learn about the key features of every algorithm, their
mathematical foundation, and how to use them to build Artificial Intelligence solutions
that are efficient, reliable, and easy to maintain. You will also learn about best practices and
design patterns for Artificial Intelligence solutions and will be provided with numerous
practical examples to help you understand the algorithms.
This book is intended for new data scientists who want to learn which algorithms are
available and how to build Artificial Intelligence solutions with them. It is also helpful for
experienced data scientists who want to expand their knowledge of these algorithms and
improve their skills in building robust and reliable Artificial Intelligence solutions.
With this book, you will gain the knowledge and skills to become a proficient data scientist
and be able to build Artificial Intelligence solutions we hope you will find this book
informative and helpful.
Chapter 2: Typical Data Structures – An AI/ML algorithm can neither be trained nor run
an inference without being fed with the right data structure. The process of preparing the
data is known as feature engineering and requires the right school of thought.
viii
Chapter 4: Basic Supervised Learning Algorithms – Chapters 4-11 will cover the 40
algorithms which comprise the core of the book.
Chapter 10: Basic Semi-Supervised Learning Algorithms – This chapter covers basic semi-
supervised learning algorithms, including Self-training, where a model iteratively adds
high-confidence predictions from unlabeled data to its training set. Co-training involves
multiple models training on different data views and refining each other’s predictions.
Multi-view Learning enhances learning by using various data representations to ensure
prediction agreement. Expectation-Maximization (EM) estimates parameters and missing
labels in probabilistic models. Finally, Graph-based Methods propagate labels from labeled
to unlabeled data using the data’s structure, with techniques like Label Propagation and
Manifold Regularization.
x
NLP involves developing algorithms and computational models that can analyze,
interpret, and generate human language, including tasks such as language translation,
sentiment analysis, text summarization, speech recognition, and language generation.
The goal of NLP is to enable computers to understand and respond to natural language
in the same way that humans do, allowing for more natural and intuitive communication
between humans and machines. NLP has applications in a wide range of fields, including
machine translation, information retrieval, customer service, healthcare, and education.
Chapter 13: Computer Vision – Computer vision is a field of artificial intelligence and
computer science that focuses on enabling computers to interpret and understand the
visual world around them, similar to how humans perceive and process visual information.
Computer vision involves developing algorithms and computational models that can
analyze and interpret images and videos. This includes tasks such as object detection,
image classification, facial recognition, scene understanding, and image segmentation.
The field of computer vision has made significant progress in recent years, with the
development of deep learning algorithms and convolutional neural networks, which have
led to breakthroughs in tasks such as image recognition and object detection.
Chapter 15: Outlook into the Future: Quantum Machine Learning – Quantum machine
learning is an emerging field that combines quantum computing and machine learning.
Quantum computing uses the principles of quantum mechanics to perform certain
computations much faster than classical computing. Machine learning, on the other hand,
involves developing algorithms that can learn patterns and insights from data.
Quantum machine learning aims to leverage the power of quantum computing to develop
more efficient algorithms for machine learning tasks, such as classification, clustering, and
regression. These algorithms could potentially provide significant speedup and better
accuracy compared to classical machine learning algorithms.
Quantum machine learning has the potential to revolutionize many industries, including
finance, healthcare, and cybersecurity, by providing faster and more accurate predictions
and insights from large datasets.
xii
https://rebrand.ly/8qenuj3
The code bundle for the book is also hosted on GitHub at
https://github.com/bpbpublications/40-Algorithms-Every-Data-Scientist-Should-Know.
In case there’s an update to the code, it will be updated on the existing GitHub repository.
We have code bundles from our rich catalogue of books and videos available at
https://github.com/bpbpublications. Check them out!
Errata
We take immense pride in our work at BPB Publications and follow best practices to en-
sure the accuracy of our content to provide with an indulging reading experience to our
subscribers. Our readers are our mirrors, and we use their inputs to reflect and improve
upon human errors, if any, that may have occurred during the publishing processes in-
volved. To let us maintain the quality and help us reach out to any readers who might be
having difficulties due to any unforeseen errors, please write to us at :
[email protected]
Your support, suggestions and feedbacks are highly appreciated by the BPB Publications’
Family.
Did you know that BPB offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.bpbonline.
com and as a print book customer, you are entitled to a discount on the eBook copy.
Get in touch with us at :
[email protected] for more details.
Piracy
If you come across any illegal copies of our works in any form on the internet,
we would be grateful if you would provide us with the location address or
website name. Please contact us at [email protected] with a link to
the material.
Reviews
Please leave a review. Once you have read and used this book, why not leave
a review on the site that you purchased it from? Potential readers can then see
and use your unbiased opinion to make purchase decisions. We at BPB can
understand what you think about our products, and our authors can see your
feedback on their book. Thank you!
https://discord.bpbonline.com
xiv
Table of Contents
1. Fundamentals.......................................................................................................................... 1
Introduction............................................................................................................................. 1
Structure .................................................................................................................................. 1
Objectives................................................................................................................................. 2
Fundamentals of AI and ML................................................................................................. 2
Defining AI and ML................................................................................................................ 3
Artificial Intelligence........................................................................................................... 3
Machine learning................................................................................................................ 4
History of AI and ML............................................................................................................. 4
Classic examples of AI and ML........................................................................................... 6
AI and ML algorithms............................................................................................................ 9
Examples of AI and ML algorithms.................................................................................. 10
Structure of a typical AI and ML algorithm..................................................................... 15
Conclusion............................................................................................................................. 16
Points to remember............................................................................................................... 16
Queues...................................................................................................................... 32
Trees.......................................................................................................................... 34
Knowledge graph....................................................................................................... 36
Conclusion............................................................................................................................. 39
Points to remember............................................................................................................... 40
Exercises................................................................................................................................. 40
Exercise 1: Data preparation for sentiment analysis......................................................... 40
Exercise 2: Data preparation for image classification....................................................... 41
Exercises................................................................................................................................. 93
Linear regression exercise: Predict house prices................................................................ 93
Logistic regression exercise: Predicting customer churn.................................................. 95
Decision tree exercise: Diagnosing plant diseases............................................................. 97
Random forest exercise: Predicting wine quality.............................................................. 98
SVM exercise: Classifying handwritten digits............................................................... 101
Shrinkage................................................................................................................ 133
Stopping criteria..................................................................................................... 134
Advantages and disadvantages of GBM algorithms....................................................... 134
Advantages of GBM algorithms............................................................................. 134
Disadvantages of GBM algorithms......................................................................... 135
Real-world applications for GBM algorithms................................................................. 135
Real-world GBM coding example................................................................................... 137
XGBoost................................................................................................................................ 138
Mathematical foundation................................................................................................ 141
Objective function................................................................................................... 141
Regularization......................................................................................................... 141
Taylor expansion for approximation....................................................................... 142
Optimal leaf weights............................................................................................... 142
Pruning................................................................................................................... 142
Handling missing values........................................................................................ 142
Column block and parallelization........................................................................... 143
Advantages and disadvantages of XGBoost algorithms ................................................ 143
Advantages of XGBoost algorithms........................................................................ 143
Disadvantages of XGBoost algorithms................................................................... 144
Real-world applications for XGBoost algorithms........................................................... 144
Real-world XGBoost coding example.............................................................................. 145
Conclusion........................................................................................................................... 147
Points to remember............................................................................................................. 147
Exercises and solutions...................................................................................................... 148
Naive Bayes exercise: Classifying email messages as spam or not spam........................ 148
k-NN exercise: Classifying types of flowers based on measurements............................. 150
Neural Network exercise: Handwritten digit classification with the MNIST dataset... 152
GBM exercise: Predicting house prices with the Boston Housing dataset..................... 154
XGBoost exercise: Predicting iabetes Ouotcomes with the
Pima Indians Diabetes dataset................................................................................... 156
xx
Conclusion........................................................................................................................... 249
Points to remember............................................................................................................. 249
Exercises and solutions...................................................................................................... 250
DBSCAN Exercise: Clustering geographical data.......................................................... 250
GMM exercise: Clustering customer spending data...................................................... 252
Autoencoder exercise: Image denoising........................................................................... 254
Anomaly detection exercise: Detecting fraudulent transactions.................................... 256
LDA exercise: Topic modelling on news articles............................................................. 258
Intuition.................................................................................................................. 366
Risks........................................................................................................................ 367
Advantages and disadvantages....................................................................................... 367
Real-world applications................................................................................................... 368
Real-world coding example............................................................................................. 369
Scenario................................................................................................................... 369
Co-training........................................................................................................................... 371
Mathematical foundation................................................................................................ 373
Assumptions........................................................................................................... 373
The Algorithm......................................................................................................... 374
Mathematical justification...................................................................................... 374
Advantages and disadvantages....................................................................................... 375
Real-world applications................................................................................................... 376
Real-world coding example............................................................................................. 376
Multi-view learning............................................................................................................ 378
Coding example using co-training.................................................................................. 379
Mathematical foundation................................................................................................ 380
Co-training.............................................................................................................. 380
Multiple kernel learning......................................................................................... 380
Canonical correlation analysis based methods........................................................ 381
Shared and individual feature learning:................................................................. 381
Joint and individual feature learning...................................................................... 381
Advantages and disadvantages....................................................................................... 381
Real-world applications................................................................................................... 383
Real-world coding example............................................................................................. 383
Scenario................................................................................................................... 383
Expectation-Maximization................................................................................................. 385
Mathematical foundation........................................................................................ 387
Advantages and Disadvantages.............................................................................. 388
Real-world applications.......................................................................................... 389
xxix
Index..............................................................................................................................539-553
Chapter 1
Fundamentals
Introduction
This chapter of the book will cover the fundamentals of artificial intelligence (AI)
and machine learning (ML). We would start by laying out the fundamentals and their
definitions to create a common understanding of the field. We will dive into the world of
AI and ML by defining the fields and their impact on the world inside and outside of AI.
We will as well include the critical concepts and what kind of industry problems could
be solved with AI and ML. We will close out the chapter with simple examples to make a
differentiation between an AI/ML application and an AI/ML algorithm.
Structure
The chapter covers the following topics:
• Fundamentals of AI and ML
• Defining AI and ML
• History of AI and ML
o Classic examples of AI and ML
• AI and ML algorithms
o Examples of AI and ML algorithms
o Structure of a typical AI and ML algorithm
2 40 Algorithms Every Data Scientist Should Know
Objectives
By the end of this chapter, you will gain a comprehensive understanding of AI and ML as
general concepts and their underlying fundamentals. Additionally, you will learn about
the origins of AI and ML and be exposed to some basic examples. Furthermore, you will
grasp the concept of basic data structures associated with these fields.
Fundamentals of AI and ML
The fundamentals of AI and ML encompass a wide range of concepts and techniques.
Here are some key fundamentals of AI and ML:
• Data: High-quality data is essential for AI and ML. It serves as the foundation for
training and evaluating models. Understanding the data, its quality, structure, and
representation is crucial for successful AI and ML applications.
• Algorithms: Algorithms are mathematical and computational procedures used
to solve specific problems or perform tasks. In AI and ML, algorithms are used
to train models, make predictions, and make decisions based on data. Examples
include decision trees, neural networks, Support Vector Machines (SVM), and
clustering algorithms.
• Feature engineering: Feature engineering involves selecting, transforming, and
creating relevant features from raw data to improve the performance of ML
models. This process helps extract meaningful information and patterns from the
data, making it easier for models to learn and make accurate predictions.
• Model training: Model training is the process of feeding labeled data into an
algorithm or model to learn patterns and relationships. During training, the model
adjusts its internal parameters to minimize the difference between predicted and
actual outputs. This process often involves optimization techniques, such as
gradient descent, to find the best parameter values.
• Model evaluation: Evaluating the performance of ML models is crucial to ensure
their effectiveness and generalization. Various metrics, such as accuracy, precision,
recall, and F1-score, are used to assess the modelʼs predictive capabilities. Cross-
validation techniques, such as k-fold cross-validation, help estimate the modelʼs
performance on unseen data.
• Generalization and overfitting: Generalization refers to a modelʼs ability to
perform well on unseen data. Overfitting occurs when a model becomes overly
complex and performs well on the training data but fails to generalize to new data.
Techniques such as regularization and early stopping are employed to prevent
overfitting and promote better generalization.
• Model deployment: Deploying ML models involves making them available for
use in real-world applications. This includes optimizing the model for efficiency,
scalability, and compatibility with the target environment. The model deployment
Fundamentals 3
Defining AI and ML
Let us now discuss AI and ML in detail.
Artificial Intelligence
Artificial Intelligence is a broad field in computer science that focuses on the creation
of systems capable of performing tasks that would typically require human intelligence.
This includes tasks like understanding natural language, recognizing patterns, solving
problems, learning from experience, and making decisions.
Examples of AI in use today by data scientists include:
• Natural Language Processing (NLP): NLP algorithms are used to create systems
like Siri, Google Assistant, and ChatGPT (which you currently interact with) that
can understand and generate human language.
• Computer vision: Algorithms in this domain are designed to interpret and
understand the visual world. For instance, Facebook uses computer vision AI to
recognize and tag faces in images.
• Recommendation systems: Websites like Amazon and Netflix use AI to recommend
products or movies based on a userʼs past behavior and the behavior of similar
users.
• Predictive analytics: Many industries use AI to predict future outcomes, like
predicting stock prices in finance or disease outbreaks in healthcare.
• Autonomous vehicles: Companies like Tesla use AI to enable cars to navigate and
understand the world around them.
These examples only scratch the surface of AIʼs potential. Its reach is continually expanding,
making it a crucial tool in a modern data scientistʼs arsenal.
4 40 Algorithms Every Data Scientist Should Know
Machine learning
Machine learning is a subset of AI that gives computers the ability to learn from data and
make decisions or predictions without being explicitly programmed to do so. This process
involves the development of algorithms that can process large amounts of data, learn
patterns within that data, and use this learned information to predict future outcomes or
behavior. This learning is accomplished by improving the performance of the system over
time as it is exposed to more data.
There are three main types of ML, supervised learning, unsupervised learning, and
reinforcement learning, which are discussed below:
• Supervised learning involves training a model on a labeled dataset, that is,
a dataset where the outcome or target variable is known. The model learns the
relationship between the features and the target and can then predict the outcome
for new, unseen data. For example, a bank might use supervised learning to
predict whether a loan applicant will default based on their previous loan history
and financial profile.
• Unsupervised learning involves training a model on an unlabeled dataset, that
is, a dataset where the outcome or target variable is not known. The goal is to
discover hidden patterns or intrinsic structures within the data. Common uses
include clustering and dimensionality reduction. For example, a retail company
might use unsupervised learning to segment its customers into different groups
based on their buying behavior.
• Reinforcement learning involves training a model to make a series of decisions
by rewarding or punishing the model (the agent) based on the actions it takes in an
environment to reach a goal. The model learns to perform actions that maximize
some reward over time. This is often used in robotics, gaming, and navigation.
For example, reinforcement learning has been used to train AI to play and win
complex games like Go and Chess.
Modern data scientists need to understand these concepts and techniques to build and
deploy effective ML models. Moreover, they often need to use different types of machine
learning in concert, depending on the task at hand. They should also be aware of new
trends in ML, such as Deep Learning, transfer learning, and active learning, which have led
to significant advancements in fields like computer vision, natural language processing,
and recommender systems.
History of AI and ML
The development of AI and ML has been an incremental journey spanning several decades.
The evolution of these fields has been influenced by various domains like mathematics,
statistics, computer science, cognitive psychology, and neuroscience, as discussed in the
following points:
Fundamentals 5
• The 1950s - Birth of AI and ML: The birth of AI as a distinct field happened
during a summer conference at Dartmouth College in 1956, which was attended
by pioneers like John McCarthy, Marvin Minsky, Allen Newell, and Herbert Simon.
Here, they proposed that every feature of learning or any other feature of intelligence
can in principle be so precisely described that a machine can be made to simulate it.ʼ
Even before this, in 1950, Alan Turing introduced the concept of machine intelligence
with the Turing Test, a measure of a machineʼs ability to exhibit intelligent behavior
equivalent to, or indistinguishable from, that of a human.
In 1959, Arthur Samuel developed a program that could play checkers and learn
from its mistakes, marking one of the first self-learning programs and a seminal
moment in ML.
• The 1960s - Growth and consolidation: In the 1960s, AI research focused on
problem-solving and symbolic methods. AI programs like DENDRAL and ELIZA
were developed during this time.
ML saw a significant development in 1967 with the creation of the Nearest
Neighbor algorithm, which started basic pattern recognition.
• The 1970s - AI Winter and rule-based systems: The mid-1970s marked the
beginning of the first AI Winter, a period of disappointment resulting from the
overhyping of AI capabilities and subsequent cuts in funding. The focus shifted
towards expert systems – rule-based systems that tried to mimic the decision-
making of human experts.
• The 1980s – Revival and ML expansion: In the 1980s, AI saw a revival with the
rise of ML. The development of the backpropagation algorithm enabled more
efficient training of neural networks, and the advent of SVM led to significant
progress in ML.
• The 1990s – AI and ML Maturity: The 1990s saw ML mature into a field of its own,
with the growth of decision tree algorithms, reinforcement learning, and Bayesian
networks. AI and ML began to be used in practical applications, from data mining
to industrial robotics.
• The 2000s - The data boom and rise of deep learning: The explosion of data in the
2000s, due to the rise of the internet and, later, social media, alongside advancements
in computational power and storage, created the perfect conditions for AI and ML
to flourish. Deep learning, a subset of ML, started to become feasible, driven by the
development of new neural network architectures.
• The 2010s - AI and ML breakthroughs: This decade witnessed rapid progress
in AI and ML. The development of advanced neural network architectures, like
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks
(RNNs), led to breakthroughs in image and speech recognition and natural
language processing.