ML All Units Mca 3rd Semester Anna University

Download as pdf or txt
Download as pdf or txt
You are on page 1of 100

VASAVI VIDYA TRUST GROUP OF INSTITUTIONS, SALEM – 103.

COURSE: II MCA
SUBJECT: MACHINE LEARNING CODE: MC- 4301

SYLLABUS
UNIT- I
INTRODUCTION
Human Learning - Types – Machine Learning - Types - Problems not to be solved - Applications –
Languages / Tools– Issues. Preparing to Model: Introduction - Machine Learning Activities – Types of
data - Exploring structure of data - Data Quality and remediation - Data Pre-Processing.
UNIT - II
MODEL EVALUATION AND FEATURE ENGINEERING
Model Selection - Training Model - Model Representation and Interpretability – Evaluating Performance of
a Model - Improving Performance of a Model - Feature Engineering: Feature Transformation - Feature
Subset Selection.
UNIT - III
BAYESIAN LEARNING
Basic Probability Notation - Inference - Independence - Bayes’ Rule. Bayesian Learning: Maximum
Likelihood and Least Squared error hypothesis - Maximum Likelihood hypotheses for predicting
probabilities - Minimum description Length principle - Bayes optimal classifier - Naive Bayes classifier -
Bayesian Belief networks - EM algorithm.
UNIT – IV
PARAMETRIC MACHINE LEARNING
Logistic Regression: Classification and representation – Cost function – Gradient descent – Advanced
optimization – Regularization - Solving the problems on overfitting. Perceptron – Neural Networks – Multi
– Class Classification - Backpropagation – Non-linearity with activation functions (Tanh, Sigmoid, Relu,
PRelu) - Dropout as regularization.
UNIT - V
NON PARAMETRIC MACHINE LEARNING
k- Nearest Neighbors- Decision Trees – Branching – Greedy Algorithm - Multiple Branches – Continuous
attributes – Pruning. Random Forests: ensemble learning. Boosting – Adaboost algorithm. Support Vector
Machines – Large Margin Intuition – Loss Function - Hinge Loss – SVM Kernels.
REFERENCES
 Ethem Alpaydin, ―Introduction to Machine Learning 3e (Adaptive Computation and Machine
Learning Series)‖, Third Edition, MIT Press, 2014
 Tom M. Mitchell, “Machine Learning”, India Edition, 1st Edition, McGraw-Hill Education Private
Limited, 2013
 Saikat Dutt, Subramanian Chandramouli and Amit Kumar Das, "Machine
 Learning", 1st Edition, Pearson Education, 2019
 Christopher M. Bishop, “Pattern Recognition and Machine Learning”, Revised Edition, Springer,
2016.
 Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, 2nd
Edition, O‟Reilly, 2019
 Stephen Marsland, ―Machine Learning – An Algorithmic Perspective‖, Second Edition, Chapman
and Hall/CRC Machine Learning and Pattern Recognition Series, 2014.
Unit : I
Human Learning - Types – Machine Learning - Types - Problems not to be solved - Applications –
Languages / Tools– Issues. Preparing to Model: Introduction - Machine Learning Activities –
Types of data - Exploring structure of data - Data Quality and remediation - Data Pre-Processing.

INTRODUCTION
 Human Learning is the process of obtaining new understanding, knowledge, behaviors, skills, values,
attitudes, and preferences.
 The ability to learn is possessed by humans is called human learning.
 Human Learning Systems is an approach which holds the complexity of the real world, and enables
us to work effectively in that complexity.
 Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of
a machine to imitate intelligent human behavior.
 Machine Learning is the field of study that gives computers the capability to learn without being
explicitly programmed.
 ML is one of the most exciting technologies that one would have ever come across.
 As it is clear from the name, it gives the computer that makes it more similar to humans: The ability
to learn.
 Machine learning is actively being used today, perhaps in many more places than one would expect.
TYPES OF HUMAN LEARNING:
 Classical Conditioning
 Observational Learning
 Operant Conditioning.
Classical Conditioning:
 Classical conditioning is a type of
learning that happens
unconsciously.
 When you learn through classical
conditioning, an automatic
conditioned response is paired
with a specific stimulus.
 This creates a behavior.
Observational Learning:
 Observational learning is the process of
learning by watching the behaviors of
others.
 The targeted behavior is watched,
memorized, and then mimicked.
 Also known as shaping and modeling,
observational learning is most common in
children as they imitate behaviors of
adults.

Operant Conditioning
 Operant conditioning
sometimes referred to as
instrumental conditioning.
 It is a method of learning that
uses rewards and punishment
to modify behavior.
 Through operant conditioning,
behavior that is rewarded is
likely to be repeated, and
behavior that is punished will
rarely occur.

Types of Machine Learning


 It is classified into,
 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning.

Supervised Learning
 Supervised learning is the types of learning in which machines are trained using well "labelled"
training data, and on basis of that data, machines predict the output.
 The labelled data means some input data is already tagged with the correct output.
 In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly.
 It applies the same concept as a student learns in the supervision of the teacher.
 Supervised learning is a process of providing input data as well as correct output data to the machine
learning model.
 The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x)
with the output variable(y).
Unsupervised Learning

 Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
Reinforcement Learning

 Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to


behave in an environment by performing the actions and seeing the results of actions.
 For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled data,
unlike supervised learning.
 Since there is no labeled data, so the agent is bound to learn by its experience only.
 RL solves a specific type of problem where decision making is sequential, and the goal is long-term,
such as game-playing, robotics, etc. is a feedback-based Machine learning technique in which an
agent learns to behave in an environment by performing the actions and seeing the results of actions.
 For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled data,
unlike supervised learning.
 Since there is no labeled data, so the agent is bound to learn by its experience only.
 RL solves a specific type of problem where decision making is sequential, and the goal is long-term,
such as game-playing, robotics, etc.
TYPES OF MACHINE LEARNING
 Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions.
 Machine learning contains a set of algorithms that work on a huge amount of data. Data is fed to
these algorithms to train them, and on the basis of training, they build the model & perform a specific
task.
 These ML algorithms help to solve different business problems like Regression, Classification,
Forecasting, Clustering, and Associations, etc.
 Machine Learning is divided into mainly four types, are:
 Supervised Machine Learning
 Unsupervised Machine Learning
 Semi-Supervised Machine Learning
 Reinforcement Learning

Supervised Machine Learning


 Supervised machine learning is based on supervision.
 The supervised learning technique, we train the machines using the "labeled" dataset, and based on
the training, the machine predicts the output.
 The labeled data specifies that some of the inputs are already mapped to the output and then we ask
the machine to predict the output using the test dataset.
 For Example ,
 In supervised learning we provide the training to the machine to understand the images, such as
the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller, cats are
smaller), etc.
 After completion of training, we input the picture of a cat and ask the machine to identify the
object and predict the output.
 Now, the machine is well trained, so it will check all the features of the object, such as height,
shape, colour, eyes, ears, tail, etc., and find that it's a cat.
 Finally it will put it in the Cat category.
 This is the process of how the machine identifies the objects in Supervised Learning.
 The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y).
 Some real-world applications of supervised learning are Risk Assessment, Fraud Detection, Spam
filtering, etc.
 Categories of Supervised Machine Learning
 Supervised machine learning can be classified into two types of problems,
1) Classification
2) Regression
1) Classification
 Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.
 The classification algorithms predict the categories present in the dataset.
 Some real-world examples of classification algorithms are Spam Detection, Email filtering, etc.
 Some popular classification algorithms are given below:
a) Random Forest Algorithm
b) Decision Tree Algorithm
c) Logistic Regression Algorithm
d) Support Vector Machine Algorithm

2) Regression
 Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables.
 These are used to predict continuous output variables, such as market trends, weather prediction,
etc.
 Some popular Regression algorithms are given below:
a) Simple Linear Regression Algorithm
b) Multivariate Regression Algorithm
c) Decision Tree Algorithm
d) Lasso Regression
 Advantages:
 Since supervised learning work with the labeled dataset so we can have an exact idea about the
classes of objects.
 These algorithms are helpful in predicting the output on the basis of prior experience.
 Disadvantages:
 These algorithms are not able to solve complex tasks.
 It may predict the wrong output if the test data is different from the training data.
 It requires lots of computational time to train the algorithm.
APPLICATIONS OF SUPERVISED LEARNING
 Image Segmentation:
 Supervised Learning algorithms are used in image segmentation.
 In this process, image classification is performed on different image data with pre-defined labels.
 Medical Diagnosis:
 Supervised algorithms are also used in the medical field for diagnosis purposes.
 It is done by using medical images and past labelled data with labels for disease conditions.
 With such a process, the machine can identify a disease for the new patients.
 Fraud Detection :
 Supervised Learning classification algorithms are used for identifying fraud transactions, fraud
customers, etc.
 It is done by using historic data to identify the patterns that can lead to possible fraud.
 Spam detection
 In spam detection & filtering, classification algorithms are used.
 These algorithms classify an email as spam or not spam.
 The spam emails are sent to the spam folder.
 Speech Recognition
 Supervised learning algorithms are also used in speech recognition.
 The algorithm is trained with voice data, and various identifications can be done using the same,
such as voice-activated passwords, voice commands, etc.
Unsupervised Machine Learning
 Unsupervised learning is different from the supervised learning technique;
 In unsupervised machine learning, the machine is trained using the unlabeled dataset, and the
machine predicts the output without any supervision.
 In unsupervised learning, the models are trained with the data that is neither classified nor labelled,
and the model acts on that data without any supervision.
 The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset
according to the similarities, patterns, and differences.
 Machines are instructed to find the hidden patterns from the input dataset.
 The machine will discover its patterns and differences, such as color difference, shape difference,
and predict the output when it is tested with the test dataset.
 Categories of Unsupervised Machine Learning
 Unsupervised Learning can be further classified into two types:
Clustering
Association
1) Clustering
 The clustering technique is used when we want to find the inherent groups from the data.
 It is a way to group the objects into a cluster such that the objects with the most similarities
remain in one group and have fewer or no similarities with the objects of other groups.
 An example of the clustering algorithm is grouping the customers by their purchasing behaviour.
 Some of the popular clustering algorithms are given below:
a) K-Means Clustering algorithm
b) Mean-shift algorithm
c) DBSCAN Algorithm
d) Principal Component Analysis
e) Independent Component Analysis
2) Association
 Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset.
 The main aim of this learning algorithm is to find the dependency of one data item on another
data item and map those variables accordingly so that it can generate maximum profit.
 This algorithm is mainly applied in Market Basket analysis, Web usage mining, continuous
production, etc.
 Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, and FP-
growth algorithm.
ADVANTAGES AND DISADVANTAGES OF UNSUPERVISED LEARNING:-
Advantages:
 These algorithms can be used for complicated tasks compared to the supervised ones because these
algorithms work on the unlabeled dataset.
 Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset are easier as
compared to the labelled dataset.
Disadvantages:
 The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and
algorithms are not trained with the exact output in prior.
 Working with Unsupervised learning is more difficult as it works with the unlabelled dataset that
does not map with the output.
Applications of Unsupervised Learning
 Network Analysis:
 Unsupervised learning is used for identifying plagiarism and copyright in document network
analysis of text data for scholarly articles.
 Recommendation Systems:
 Recommendation systems widely use unsupervised learning techniques for building
recommendation applications for different web applications and e-commerce websites.
 Anomaly Detection:
 Anomaly detection is a popular application of unsupervised learning, which can identify unusual
data points within the dataset.
 It is used to discover fraudulent transactions.
 Singular Value Decomposition:
 Singular Value Decomposition or SVD is used to extract particular information from the
database.
 For example, extracting information of each user located at a particular location.
Semi-Supervised Learning
 Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and
Unsupervised machine learning.
 It represents the intermediate ground between Supervised (With Labelled training data) and
Unsupervised learning (with no labelled training data) algorithms and uses the combination of
labelled and unlabeled datasets during the training period.
 Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and operates on the data that consists of a few labels, it mostly consists of unlabeled data.
 As labels are costly, but for corporate purposes, they may have few labels.
 It is completely different from supervised and unsupervised learning as they are based on the
presence & absence of labels.
 To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced.
 The main aim of semi-supervised learning is to effectively use all the available data, rather than only
labelled data like in supervised learning.
 Initially, similar data is clustered along with an unsupervised learning algorithm, and further, it helps
to label the unlabeled data into labelled data.
 It is because labelled data is a comparatively more expensive than unlabeled data.
 For Example.
 Supervised learning is where a student is under the supervision of an instructor at home and
college.
 Further, if that student is self-analyzing the same concept without any help from the instructor, it
comes under unsupervised learning.
 Under semi-supervised learning, the student has to revise himself after analyzing the same
concept under the guidance of an instructor at college.
Advantages and Disadvantages of Semi-Supervised Learning
Advantages:
 It is simple and easy to understand the algorithm.
 It is highly efficient.
 It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.
Disadvantages:
 Iterations results may not be stable.
 We cannot apply these algorithms to Network-Level Data.
 Accuracy is Low.
Reinforcement Learning
 Reinforcement Learning works on a feedback-based process, which automatically explore its
surrounding by taking action, learning from experiences.., finally it improves our performance.
 The Agent gets rewarded for each good action and gets punished for each bad action.
 The Goal of Reinforcement Learning is to maximize the rewards.
 In Reinforcement Learning, there is No Labeled Data like Supervised Learning.
 The Agents learn from their experiences only.
 The Reinforcement Learning Process is similar to a Human Learning.
For example,
A child learns various things by experiences in his day-to-day life.
 Reinforcement Learning is to play a game, where the Game is the environment.
 Agent movements are scored for the goals.
 Agent receives feedback in terms of punishment and rewards.
Categories of Reinforcement Learning
 Two Types of Reinforcement learning are,
1) Positive Reinforcement Learning:
 Positive Reinforcement Learning specifies increasing the tendency that the required behaviour
would occur again by adding something.
 It enhances the strength of the behaviour of the agent and positively impacts it.
2) Negative Reinforcement Learning:
 Negative Reinforcement Learning works exactly Opposite to the Positive Reinforcement
Learning.
 It increases the tendency that the specific behaviour would occur again by avoiding the negative
condition.
Real-world Use cases of Reinforcement Learning
 Video Games:
 Reinforcement Learning algorithms are much popular in gaming applications.
 It is used to gain super-human performance.
 Resource Management:
 The Resource Management with Deep Reinforcement Learning shows how to use Reinforcement
Learning in computer to automatically learn and schedule resources to wait for different jobs in
order to Minimize Average Job Slowdown.
 Robotics:
 Reinforcement Learning is widely used in Robotics applications.
 Robots are used in the industrial and manufacturing area.
 Robots are made more powerful with Reinforcement Learning.
 There are different industries that have their vision of building Intelligent Robots using AI and
Machine Learning Technology.
 Text Mining
 Text Mining, one of the great applications of NLP.
 Natural Language Processing (NLP) is an upcoming technology that derives various forms of AI,
by creating a faultless as well as interactive interface between humans and machines with the
help of Reinforcement Learning.
Advantages and Disadvantages of Reinforcement Learning
 Advantages
 It helps in solving complex real-world problems which are difficult to be solved by general
techniques.
 The learning model of RL is similar to the learning of human beings; hence most accurate results
can be found.
 Helps in achieving long term results.
 Disadvantage
 Reinforcement Learning algorithms are not preferred for simple problems.
 Reinforcement Learning algorithms require huge data and computations.
 Too much reinforcement learning can lead to an overload of states which can weaken the results.
Problems Not to be Solved by Machine Learning

1) Reasoning Power
 Reasoning Power is an area where Machine Learning has not successful.
 The feature of Reasoning Power is different when it is compare with human features.
 Algorithms available today are mainly oriented towards specific use-cases and are narrowed
down when it comes to applicability.
 They cannot think as to why a particular method is happening that way or inspect their own
outcomes.
 For instance, if an image recognition algorithm identifies apples and oranges in a given scenario,
it cannot say if the apple (or orange) has gone bad or not, or why is that fruit an apple or orange.
 Mathematically, all of this learning process can be explained by us, but from an algorithmic
perspective, the natural property cannot be told by the algorithms or even us.
 In other words, Machine Learning algorithms require the ability to reason beyond their future
application.
2) Contextual Limitation
 According to NLP algorithms the text and speech information are understandable.
 The NLP may learn letters, words, sentences or even the syntax, but where they fall back is the
situation and background of the language.
 Algorithms do not understand the context of the language used.
 So, ML does not have an overall idea of the situation.
 It is limited by mnemonic understandings rather than thinking to see what is actually going on.
3) Scalability
 ML implementations being organized on a significant basis, it all depends on data as well as its
scalability.
 Data is growing at a huge rate and has many forms which largely affect the scalability of an ML
project.
 Algorithms cannot do much about this unless they are updated constantly for new changes to
handle data.
 This is where ML regularly requires human involvement in terms of scalability and remains
unsolved mostly.
 In addition, growing data has to be deal the right way if shared on an ML platform which again
needs examination through knowledge and feeling lacked by current ML.
4) Regulatory Restriction For Data in ML
 ML usually needs considerable amounts (actually, massive) of data in stages such as Training,
Cross-Validation etc.
 The data includes private as well as general information which makes more complication.
 Most TECH companies have privatized data and these data are the ones which are actually
useful for ML applications.
 Sometimes it become failure or risk of the wrong usage of data, especially in critical areas such
as medical research, health insurance etc.,
 And also anonymised (not related / identified) at times, it has the possibility of being week and
without protection.
 Hence this is the reason regulatory rules are compulsory heavily when it comes to using private
data.
5) Internal Working Of Deep Learning
 Nowadays, the Deep Learning (DL) powers applications such as voice recognition, image
recognition and so on through artificial neural networks.
 But, the internal working of DL is still unknown and yet to be solved.
 Advanced DL algorithms still confuse researchers in terms of its working and efficiency.
 Millions of neurons that form the neural networks in DL increase abstraction at every level,
which cannot be understand at all.
 This is why deep learning is dubbed a ‘black box’ since its internal agenda is unknown.
APPLICATIONS OF MACHINE LEARNING
 Machine learning is growing very
rapidly day by day.
 We are using machine learning in our
daily life even without knowing it such as
Google Maps, Google assistant, Alexa, etc.
 Real-World Applications of Machine
Learning are,
1) Image Recognition:
 Image recognition is one of the most
common applications of machine
learning.
 It is used to identify objects, persons, places, digital images, etc.
 The popular use case of image recognition and face detection is, Automatic friend tagging
suggestion:
 Face book provides us a feature of auto friend tagging suggestion.
 Whenever we upload a photo with our Face book friends, then we automatically get a tagging
suggestion with name, and the technology behind this is machine learning's face
detection and recognition algorithm.
 It is based on the Face book project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
2) Speech Recognition
 While using Google, we get an option of "Search by voice," it comes under speech recognition, and
it's a popular application of machine learning.
 Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition."
 At present, machine learning algorithms are widely used by various applications of speech
recognition.
 Google assistant and Alexi… are using speech recognition technology to follow the voice
instructions.
3) Traffic Prediction:
 If we want to visit a new place, we take help of Google Maps, which shows us the Correct Path with
the Shortest Route and Predicts the Traffic Conditions.
 It predicts the traffic conditions such as whether Traffic is Cleared, Slow-Moving, or Heavily
Congested with the help of two ways:
 Real Time location of the vehicle form Google Map APP and Sensors
 Average Time has taken on past days at the same time.
 Everyone who is using Google Map is helping this app to make it better.
 It takes information from the user and sends back to its database to improve the performance.
4) Product Recommendations:
 Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user.
 Whenever we search for some product on Amazon, then we started getting an advertisement for the
same product while internet surfing on the same browser and this is because of machine learning.
 Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.
 As similar, when we use Netflix, we find some recommendations for entertainment series, movies,
etc., and this is also done with the help of machine learning.
5) Self-Driving Cars:
 One of the most exciting applications of machine learning is Self-Driving Cars.
 Machine learning plays a significant role in self-driving cars.
 Tesla, the most popular car manufacturing company is working on self-driving car.
 It is using unsupervised learning method to train the car models to detect people and objects while
driving.
6) Email Spam and Malware Filtering:
 Whenever we receive a new email, it is filtered automatically as important, normal, and spam.
 We always receive an important mail in our inbox with the important symbol and spam emails in our
spam box, and the technology behind this is Machine learning.
 Below are some spam filters used by Gmail:
 Content Filter
 Header filter
 General blacklists filter
 Rules-based filters
 Permission filters
 Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naive Bayes
classifier are used for Email Spam Filtering and Malware Detection.
7) Virtual Personal Assistant:
 We have various virtual personal assistants such as Google Assistant, Alexa, Cortana and Siri.
 As the name suggests, they help us in finding the information using our voice instruction.
 These assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.
 These virtual assistants use machine learning algorithms as an important part.
 These assistant record our voice instructions, send it over the server on a cloud, and decode it using
ML algorithms and act accordingly.
8) Online Fraud Detection:
 Machine learning is making our online transaction safe and secure by detecting fraud transaction.
 Whenever we perform some online transaction, there may be various ways that a fraudulent
transaction can take place such as fake accounts, fake ids, and steal money in the middle of a
transaction.
 So to detect this, Feed Forward Neural network helps us by checking whether it is a genuine
transaction or a fraud transaction.
 For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round.
 For each genuine transaction, there is a specific pattern which gets change for the fraud transaction
hence, it detects it and makes our online transactions more secure.
9) Stock Market Trading:
 Machine learning is widely used in stock market trading.
 In the stock market, there is always a risk of up and downs in shares, so for this machine
learning's long short term memory neural network is used for the prediction of stock market
trends.
10) Medical Diagnosis:
 In medical science, machine learning is used for diseases diagnoses.
 With this, medical technology is growing very fast and able to build 3D models that can predict the
exact position of lesions in the brain.
 It helps in finding brain tumors and other brain-related diseases easily.
11) Automatic Language Translation:
 Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at
all, as for this also machine learning helps us by converting the text into our known languages.
 Google's GNMT (Google Neural Machine Translation) provide this feature, which is a Neural
Machine Learning that translates the text into our familiar language, and it called as Automatic
Translation.
 The technology behind the automatic translation is a sequence to sequence learning algorithm, which
is used with image recognition and translates the text from one language to another language.
LANGUAGES USED IN ML
 Artificial Intelligence is a very
important technology to develop and
build new computer programs and
systems, which can be used to
simulate various intelligence
processes like learning, reasoning,
etc.
 Some of them are Python, R, Lisp, Java, C++, Julia, and Prolog.
1) Python
 Python is one of the most powerful and easy programming languages
developed in the early stage of 1991.
 Most of the developers choose Python as their favourite programming language for
developing machine learning solutions.
 Python is a user friendly language.
 Python is a platform-independent language and also provides an extensive framework for Deep
Learning, Machine Learning, and Artificial Intelligence.
 Python is also a portable language as it is used on various platforms such as Linux, Windows, Mac
OS, and UNIX.
 Features
 It is easy to learn than any other programming language.
 It is also a Dynamically-Typed Language.
 Python is an Object-Oriented Language.
 It provides extensive community support and a framework for ML and DL.
 It is Open-Source software.
 Large standard sets of libraries.
 Interpreted language.
 Python is an ideal programming language used for Machine Language, Natural Processing
Language (NLP), and Neural networks, etc.
2) Java
 Java is the most widely used programming language by all developers
and programmers to develop machine learning solutions.
 Java is also a platform-independent language as it can also be easily
implemented on various platforms.
 Java is an object-oriented and scalable programming language.
 Java allows virtual machine technology that helps to create a single version of the app and provides
support to your business.
 The best thing about Java is once it is written and compiled on one platform, then you do not need to
compile it again and again.
 This is known as WORA (Once Written Read/Run Anywhere) principle.
 Features of Java
 Portability
 Cross-platform.
 Easy to learn and use.
 Easy-to-code Algorithms.
 Built-in garbage collector.
 Standard Widget Toolkit.
 Simplified work with large-scale projects.
 Better user interaction.
 Easy to debug.
3) Prolog
 Prolog is one of the oldest programming languages used for Machine Learning solutions.
 Prolog stands for "Programming in Logic",
 It is developed by French Scientist Alain Colmerauer in 1970.
 Prolog is a declarative language rather than imperative.
 Features of Prolog
 Supports basic mechanisms such as
Pattern Matching,
Tree-based data structuring, and automatic backtracking.
4) Lisp
 Lisp is widely used for scientific research in the fields of natural languages, theorem proofs, and to
solve artificial intelligence problems.
 Lisp was originally created as a practical mathematical notation for programs.
 Lisp programming language is the Second Oldest Language after
FORTRAN; it is still being used because of its Crucial Features.
 LISP programming was invented by John McCarthy.
 LISP is one of the most efficient programming languages for solving specific problems.
 It is mainly used for machine learning and logic problems.
 It has also influenced the creation of other programming languages for AI, like R and Julia.
 It is so flexible; but it is not user friendly and also a lack of well-known libraries, syntax, etc.
 Due to this reason, it is not preferred by the programmers.
 Features of LISP
 The program can be easily modified, similar to data.
 Make use of recursion for control structure rather than iteration.
 Garbage Collection is necessary.
 We can easily execute data structures as programs.
 An object can be created dynamically.
5) R
 R is one of the great languages for statistical processing in programming.
 R supports free, open-source programming language for data analysis
purposes.
 It may not be the perfect language for AI, but it provides great performance while dealing with large
numbers.
 R contains several packages that are specially designed for AI, which are:
 Gmodels:
This package provides different tools for the Model Fitting Task.
 TM :
It is a great framework that is used for text mining applications.
 RODBC:
It is an ODBC interface.
 OneR:
This package is used to implement the One Rule Machine Learning classification
algorithm.
 Features of R programming
 R is an open-source programming language, which is free of cost, and also you can add packages
for other functionalities.
 R provides strong & interactive graphics capability to users.
 It enables you to perform complex statistical calculations.
 It is widely used in machine learning and AI due to its high-performance capabilities.
6) Julia
 Julia is one of the newer languages on the list and was created to focus on performance computing in
scientific and technical fields.
 Julia includes several features that directly apply to AI programming.
 Julia is a comparatively new language, which is mainly suited for
numerical analysis and computational science.
 Features of Julia
 Common numeric data types.
 Arbitrary precision values.
 Robust mathematical functions.
 Built-in package manager.
 Dynamic type system.
 Ability to work for both parallel and distributed computing.
 Macros and Meta programming capabilities.
 Support for multiple dispatches.
 Support for C functions.
7) C++
 C++ language has been present for so long around, but still being a top
and popular programming language among developers.
 It provides better handling for AI models while developing.
 Although C++ may not be the first choice of developers for AI
programming, various machine learning and deep learning libraries are written in the C++ language.
 Features of C++
 C++ is one of the fastest languages, and it can be used in statistical techniques.
 It can be used with ML algorithms for fast execution.
 Most of the libraries and packages available for Machine learning and AI are written in C++.
 It is a user friendly and simple language.
MACHINE LEARNING TOOLS
 Machine learning is one of the most revolutionary technologies that are making lives simpler.
 It is a subfield of Artificial Intelligence, which analyses the data, build the model, and make
predictions.
 There are different tools, software, and platform available for machine learning and also new
software and tools are evolving day by day.
 Machine learning tools, choosing the best tool per your model is a challenging task.
 If you choose the right tool for your model, you can make it faster and more efficient.

1) TensorFlow
 Tensor Flow is one of the most popular open-source libraries used to train and
build both machine learning and deep learning models.
 It is developed by Google Brain Team.
 It is much popular among machine learning enthusiasts, and they use it for
building different ML applications.
 It offers a powerful library, tools, and resources for numerical computation, specifically for large
scale machine learning and deep learning projects.
 It enables data scientists/ML developers to build and organize machine learning applications
efficiently.
 Features:
 Tensor Flow enables us to build and train our ML models easily.
 It also enables you to run the existing models using the TensorFlow.js
 It provides multiple abstraction levels that allow the user to select the correct resource as per the
requirement.
 It helps in building a neural network.
 This is open-source software and highly flexible.
 It also enables the developers to perform numerical computations using data flow graphs.
 Run-on GPUs and CPUs, and also on various Mobile Computing Platforms.
 GPUs can process many pieces of data simultaneously, making them useful for machine learning,
video editing, and gaming applications.
 It enables to easily organize and training the model in the cloud.
2) PyTorch
 PyTorch is an open-source machine learning framework, which is
based on the Torch library.
 This framework is free and open-source and developed by FAIR (Face
book’s AI Research lab).
 It is one of the popular ML frameworks, which can be used for various applications, including
computer vision and natural language processing.
 PyTorch has Python and C++ interfaces; however, the Python interface is more interactive.
 Different deep learning software is made up on top of PyTorch, such as PyTorch Lightning, Hugging
Face's Transformers (it use state-of-the-art), Tesla autopilot, etc.
 It supports GPU.
 Features:
 It is more suitable for deep learning researches with good speed and flexibility.
 It can also be used on cloud platforms.
 It includes tutorial courses, various tools, and libraries.
 It also provides a dynamic computational graph that makes this library more popular.
 It allows changing the network behaviour randomly without any delay.
 It is freely available.
3) Google Cloud ML Engine
 It is a hosted platform where ML developers and data scientists build and run optimum quality
machine, learning models.
 It provides a managed service that allows developers to easily create ML
models with any type of data and of any size.
 Features:
 Provides machine learning model training, building, deep learning and
predictive modeling.
 The Two Services, namely, Prediction and Training, can be used independently or combinedly.
 It can be used by enterprises, i.e., for identifying clouds in a satellite image, responding faster to
emails of customers.
 It can be widely used to train a complex model.
4) Amazon Machine Learning (AML)
 Amazon Machine Learning (AML) is a cloud-based and robust machine
learning software application, which is widely used for building machine
learning models and making predictions.
 It integrates data from multiple sources, including Red shift, Amazon S3, or
RDS.
 Features
 Enables the users to identify the patterns, build mathematical models, and
make predictions.
 It provides support for three types of models,
Multi-Class Classification
Binary Classification
Regression.
 It permits users to import the model into or export the model out from Amazon Machine
Learning.
 It also provides core concepts of machine learning, including ML models, Data sources,
Evaluations, Real-time predictions and Batch predictions.
 It enables the user to retrieve predictions with the help of batch APIs for bulk requests or real-
time APIs for individual requests.
5) Google ML kit for Mobile
 For Mobile app developers, Google brings ML Kit, which is packaged
with the expertise of machine learning and technology to create more
robust, optimized, and personalized apps.
 This tools kit can be used for face detection, text recognition, landmark
detection, image labeling, and barcode scanning applications.
 Features:
 The ML kit is optimized for mobile.
 It provides easy-to-use APIs that enables powerful use cases in your mobile apps.
 It includes Vision API and Natural Language APIS to detect faces, text, and objects, and identify
different languages & provide reply suggestions.

ISSUES IN MACHINE LEARNING

 Some common issues in Machine Learning are,


1) Inadequate Training Data
 The major issue that comes while using machine
learning algorithms is the lack of quality as well as
quantity of data [inadequate (insufficient) data, noisy (meaningless) data, and unclean (no
information) data].
 For Example, a simple task requires thousands of sample data, and an advanced task such as speech
or image recognition needs millions of sample data examples.
 Further, Data Quality is also important for the algorithms to work perfectly, but the absence of data
quality is also found in Machine Learning applications.
 Data Quality can be affected by some factors as follows:
 Noisy Data
It is responsible for an inaccurate prediction that affects the decision as well as accuracy
in classification tasks.
 Incorrect Data
It is also responsible for faulty programming and results obtained in machine learning
models.
Hence, incorrect data may affect the accuracy of the results also.
 Generalizing of Output Data
Generalizing output data becomes complex, which results in comparatively poor future
actions.

2) Poor Quality of Data


 Noisy data, incomplete data, inaccurate data, and unclean data lead to less accuracy in classification
and low-quality results.
 Hence, data quality can also be considered as a major common problem while processing machine
learning algorithms.
3) Non-Representative Training Data
 A machine learning model is said to be ideal if it predicts well for generalized cases and provides
accurate decisions.
 If there is less training data, then there will be a sampling noise in the model, called the Non-
Representative Training Set (won't be accurate in predictions).
 Hence, we should use representative data in training to protect against being partial and make
accurate predictions without any flow.
4) OVERFITTING AND UNDERFITTING
Overfitting:
 Overfitting is one of the most common issues faced by Machine Learning engineers and data
scientists. Whenever a machine learning model is trained with a huge amount of data, it starts
capturing noise and inaccurate data into the training data set.
 It negatively affects the performance of the model.
 Methods to Reduce Overfitting :
 Increase training data in a dataset.
 Reduce model complexity by simplifying the model by selecting one with less parameters
 Early stopping during the training phase.
 Reduce the noise.
 Reduce the number of attributes in training data.
Underfitting:
 Underfitting is just the opposite of overfitting.
 Whenever a machine learning model is trained with fewer amounts of data, and as a result, it
provides incomplete and inaccurate data and destroys the accuracy of the machine learning model.

 Methods to Reduce Underfitting:


 Increase model complexity
 Remove noise from the data
 Trained on increased and better features
 Reduce the constraints
 Increase the number of times to get better results.
5) Monitoring and Maintenance
 Regular monitoring and maintenance become compulsory when different results for different actions.
 To overcome, data change is required with editing of codes as well as resources.
6) Getting Bad Recommendations
 A machine learning model operates under a specific context which results in bad recommendations
and concept flow in the model.
 For Example, a specific time customer is looking for some gadgets, but the customer requirement
changed over time but still machine learning model showing same recommendations to the customer
while customer expectation has been changed. This incident is called a Data Drift.
 To overcome this by regularly updating and monitoring data according to the expectations.
7) Lack of Skilled Resources
 The absence of skilled resources in the form of Manpower is also an issue.
 Hence, we need manpower having in-depth knowledge of mathematics, science, and technologies for
developing and managing scientific substances for machine learning.
8) Customer Segmentation
 Customer segmentation is also an important issue while developing a machine learning algorithm.
 To identify the customers who paid for the recommendations shown by the model and who don't
even check them.
 Hence, an algorithm is necessary to recognize the customer behavior and trigger a relevant
recommendation for the user based on past experience.
9) Process Complexity of Machine Learning
 The machine learning process is very complex, which is also another major issue, faced by Machine
Learning Engineers and Data Scientists.
 Machine Learning and Artificial Intelligence are very new technologies but are still in an
experimental phase and continuously being changing over time.
 There are the majority of hits and trial experiments; hence the probability of error is higher than
expected.
 Further, it also includes analyzing the data, removing data bias, training data, applying complex
mathematical calculations, etc., making the procedure more complicated and quite tedious.
10) Data Bias
 Data Biasing is also found a big challenge in Machine Learning.
 These errors exist when certain elements of the dataset are heavily weighted or need more
importance than others.
 Biased data leads to inaccurate results, skewed outcomes, and other analytical errors.
 Methods to remove Data Bias:
 Research more for customer segmentation.
 Be aware of your general use cases and potential outliers.
 Combine inputs from multiple sources to ensure data variety.
 Include bias testing in the development process.
 Analyze data regularly and keep tracking errors to resolve them easily.
 Use multi-pass annotation such as sentiment analysis [to make positive to negative], content
moderation [monitoring and following rules], and goal recognition.
11) Lack of Explain ability
 The outputs cannot be easily understood as it is programmed in specific ways to deliver for certain
conditions. Hence, a lack of explain ability is also found in machine learning algorithms which
reduce the credibility of the algorithms.
12) Slow Implementations and Results
 Slow programming, excessive requirements and overloaded data take more time to provide accurate
results than expected.
 This needs continuous maintenance and monitoring of the model for delivering accurate results.
13) Irrelevant Features
 Machine learning models are planned to give the best possible outcome, if we feed garbage data as
input, then the result will also be garbage.
 Hence, we should use relevant features in our training sample.
 A machine learning model is said to be good if training data has a good set of features or less to no
irrelevant features.
PREPARING TO MODEL
Introduction:
 A Machine Learning Model is a file that has been trained to recognize certain types of patterns.
 We trained a model over a set of data and provide an algorithm that it can use to reason over and
learn from those data.
 Once you have trained the model, you can use it to reason over data that it hasn't seen before, and
make predictions about those data.
 For Example,
 We build an application that can recognize a user's
emotions based on their facial expressions.
 You can train a model by providing it with images of
faces that are each tagged with a certain emotion, and
then you can use that model in an application that can recognize any user's emotion. [Emoji]
MACHINE LEARNING ACTIVITIES
 To complete process is to understand the problem because
the good result depends on the better understanding of the
problem and to know the purpose of the problem.
 To solve a problem, we create a machine learning system
called "Model", and this model is created by providing
"Training".
 To train a model, we need data; hence, life cycle starts by
collecting data.
1) Gathering Data:
 Data Gathering is the first step of the machine learning Preparation of a model.
 The goal of this step is to identify and obtain all data-related problems.
 To identify the different data sources, as data can be collected from various sources such
as Files, Database, Internet, Mobile Devices …
 The quantity and quality of the collected data will determine the efficiency of the output.
 The more will be the data, the more accurate will be the prediction.
 This step includes:
 Identify Various Data Sources.
 Collect data {Set of data, also called as a Dataset}
 Integrate the data obtained from different sources
2) Data Preparation
 After collecting the data, data preparation is a step to place our data into a suitable place and prepare
it to use in our machine learning training.
 It is divided into two processes:
Data Exploration:
 It is used to understand or examine the nature of data.
 It is used to understand the characteristics [Correlations, general trends, outliers…], format,
and quality of data.
 A better understanding of data leads to an effective outcome.
Data Pre-Processing:
 Analyze the data to its preprocessing.
3) Data Wrangling
 Data wrangling is the process of cleaning and converting raw data into a useable format.
 It is the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step.
 Cleaning of data is required to address the quality issues includes Missing Values, Duplicate data,
Invalid data…
4) Data Analysis
 The cleaned and prepared data is passed on to this stage.
 The aim is to build a machine learning model to analyze the data using various analytical techniques
and review the outcome.
 Selection of analytical techniques
 Building models
 Review the result
 It is used to determination of the type of the problems, as Classification, Regression, Cluster
analysis; Association, etc. then build the model using prepared data, and evaluate the model.
5) Train Model
 By training the model, is used to improve its performance for better outcome of the problem.
 Datasets are used to train the model using various machine learning algorithms.
 Training a model is required to understand the various patterns, rules, and, features.
6) Test Model
 Testing the model for the accuracy of our model by providing a test dataset to it.
 Testing the model determines the percentage accuracy of the model as per the requirement of project
or problem.
7) Deployment
 If the above-prepared model is producing an accurate result as per our requirement with acceptable
speed, then we organize the model in the real system.
TYPES OF DATA
 Data is defined as a set of facts
and figures which can be used
to serve a specific purpose.
 Data can be used as a survey or
an analysis.
 Data in a systematic and
organized form is referred to as
information.

1) Qualitative [or] Categorical


Data
 Qualitative data, also known as the categorical data, describes the data that fits into the categories.
 Qualitative data are not numerical.
 The categorical information involves categorical variables that describe the features such as a
person’s gender, home town etc.
 Categorical measures are defined in terms of natural language specifications, but not in terms of
numbers.
 Sometimes categorical data can hold numerical values (quantitative value), but those values do not
have a mathematical sense.
 For Example,
 The Categorical Data are Birthdates, Favourite Sport, and School Postcode.
 The birthdates and school postcode hold the quantitative value, but it does not give numerical
meaning.
a) Nominal Data
 Nominal data is one of the types of qualitative information which helps to label the variables
without providing the numerical value.
 Nominal Data is also called the Nominal Scale.
 It cannot be ordered and measured. But sometimes, the data can be qualitative and quantitative.
 For Example,
Nominal data are Letters, Symbols, Words, and Gender etc.
 The nominal data are examined using the grouping method.
 In this method, the data are grouped into categories, and then the frequency or the percentage of
the data can be calculated.
 These data are represented using the Pie Charts.
b) Ordinal Data
 Ordinal data / variable are a type of data that follows a natural order.
 The nominal data is that the difference between the data values is not determined.
 This variable is mostly found in surveys, finance, economics, questionnaires, and so on.
 The ordinal data is represented using a Bar Chart.
 The information may be expressed using tables in which each row in the table shows the distinct
category.
2) Quantitative [or] Numerical Data
 Quantitative data is also known as Numerical Data which represents the numerical value (i.e., how
much, how often, how many).
 Numerical data gives information about the quantities of a specific thing.
 For Example,
Numerical Data are Height, Length, Size, Weight, and so on.
 It is classified into two types based on the data sets.
Discrete Data
Continuous Data.
a) Discrete Data
 Discrete data can take only discrete values.
 Discrete information contains only a finite number of possible values.
 Those values cannot be subdivided meaningfully.
 Here, things can be counted in whole numbers.
 For Example:
Number of students in the class
b) Continuous Data
 Continuous data is data that can be calculated.
 It has an infinite number of probable values that can be selected within a given specific range.
 For Example:
Temperature Range
STRUCTURE OF DATA
 The data structure is defined as the basic building block of computer programming that helps us to
organize, manage and store data for efficient search and retrieval.
 In other words, the data structure is the collection of data type 'values' which are stored and
organized in such a way that it allows for efficient access and modification.
TYPES OF DATA STRUCTURE
 The data structure is the ordered sequence of data, and it tells the compiler how a programmer is
using the data such as Integer, String, Boolean, etc.
 There are two different types of data structures:
1. Linear
2. Non-Linear Data Structures.

Linear Data structure:


 The linear data structure is a special type of data structure that helps to organize and manage data in
a specific order where the elements are attached adjacently.
 There are mainly 4 types of linear data structure as follows:
1) Array:
 An array is one of the most basic and common data structures used in Machine Learning.
 It is also used to solve complex mathematical problems.
 To convert the column of a data frame into a list format in pre-processing analysis.
 Using a list of tokenized words to begin clustering topics.
 An array contains index numbers to represent an element starting from 0.
 The lowest index is arr[0] and corresponds to the first element.
2) Stacks:
 Stacks are based on the concept of LIFO (Last in First out) .
 It is used for binary classification in deep learning.
 Stacks are easy to learn and implement in ML models.
 Stacks enable the Addition and removal occurs at the top of the stack.
3) Linked List:
 A linked list is the type of
collection of data elements that
consist of a value and pointer
that point to the next node in the
list.
 In a linked list, insertion and deletion are constant time operations and are very efficient, but
accessing a value is slow and often requires scanning.
 Linked list is a dynamic array where the shifting of elements can be done at the head, middle or
tail position, it is relatively cost consuming.
4) Queue:
 A Queue is defined as the "FIFO" (first in, first out).
 For Example,
It is useful to predict a queuing scenario in real-time programs, such as people waiting in line
to withdraw cash in the bank.
Non-Linear Data Structures
 In a Non-linear data structures, elements are not arranged in any sequence.
 All the elements are arranged and linked with each other in a hierarchal manner, where one element
can be linked with one or more elements.
1) Graphs
 A graph data structure is also very much useful in machine learning for link prediction.
 Graphs are directed or undirected concepts with nodes and ordered or unordered pairs.
2) Trees -Binary Tree:
 In a binary tree, each node has two pointers
to subsequent nodes instead of just one.
 Binary trees are sorted, so it is easy for
insertion and deletion operations.
 In a binary tree, the value of the left child
node is always less than the value of the
parent node while the value of the right-side
child nodes is always more than the parent
node.
 In a binary tree structure, data sorting is done automatically, which makes insertion and deletion
efficient.

DATA QUALITY AND REMEDIATION


DATA QUALITY
 Data Quality refers to the development and implementation of activities that apply Quality
Management Techniques to data in order to ensure the data is fit to serve the specific needs of an
organization in a particular context.
 The Elements of Data Quality are ,
1) Consistency
2) Accuracy
3) Completeness
4) Auditability
5) Validity
6) Uniqueness
7) Timeliness
1) Consistency
 Data has No Inconsistency in your databases.
 This means that if two values are examined from separate data sets, they will match or align.
 For Example,
The budget amount for a specific department needs to be consistent across the organization.
 Examples of Consistency Metrics:
Range
Variance
Standard Deviation
2) Accuracy
 Data is Error-Free and Exact.
 Accuracy is when a measured value matches the Actual (True) value and it contains No
Mistakes, such as Outdated Information, Redundancies, and Errors.
 The main aim is to increase the accuracy of your data continually, even the datasets grow in size.
 Examples of Accuracy Metrics:
Error Ratio
Deviation
3) Completeness
 Data Records are “full” and contain enough information.
 Tracking this data quality metric involves finding any fields that contain Missing or Incomplete
Values.
 All data entries must be complete in order to compose a High Quality Data Set.
 Examples of Completeness Metrics:
Percentage of data records that contain all needed information
4) Auditability
 Data is accessible and changes are traceable.
 Data Quality determines tracking the percentage of fields where you cannot determine what and
when edits were made, and by whom.
 Examples of Auditability Metrics:
Percentage of Gaps in the Data Set
Percentage of Altered Data
Percentage of Disassociated Data
Percentage of Untraceable Data
5) Validity
 Data Validity exists in the same and correct format everywhere they appear and it is also called
as Data Integrity.
 Examples of Validity Metrics:
Percentage of data records where all values are in the required format.
6) Uniqueness
 Data will be recorded NO more than once.
 Tracking this metric helps organization to identify and avoid double data entry.
 Examples of Uniqueness Metrics:
Number or percentage of repeated values.
7) Timeliness
 Data is available and accurate.
 It’s important to collect data in a timely manner in order to effectively track changes.
 Examples of Timeliness Metrics:
Time Variance
DATA REMEDIATION
 It is the process of Cleansing, Organizing and Migrating Data so that it's properly protected and
best serves to its determined purpose.
 The key word “Remediation” derives from the word “Remedy,” which is to correct a mistake.
 Data Remediation is a process that involves Replacing, Modifying, Cleansing or Deleting any
“Dirty” Data.
Data Remediation - Terminology
1) Data Migration
 The process of moving data between two or more systems, data formats or servers.
2) Data Discovery
 A manual or automated process of searching for patterns in data sets to identify structured and
unstructured data in an organization’s systems.
3) ROT
 An Acronym that stands for Redundant, Obsolete [Outdated] and Trivial [Unstructured] Data.
 The Unstructured Data is to Recommended Retention [Retaining] period and no longer useful to
an organization.
4) Dark Data
 Any information that businesses collect, process and store, but do not use for other purposes.
 For Example,
It includes Customer Call Records, Raw Survey Data or Email Correspondences.
 Storing and Securing these types of data results more expense and sometimes even greater risk
than it does value.
5) Dirty Data
 Data that damages the integrity of the organization’s complete dataset.
 This can include data that is unnecessarily duplicated, outdated, incomplete or inaccurate.
6) Data Overload
 This is when an organization has acquired too much data, including Low-Quality or Dark Data.
 In Data overloading, it is difficult to Identify, Classify and Remediation of Data.
7) Data Cleansing
 Transforming data in its native state to a predefined standardized format.
8) Data Governance [Rules]
 Management of the Availability, Usability, Integrity and Security of the data stored within an
organization.
Stages of Data Remediation
 Data Remediation process that involves effectively to resolve unclean data.
1) Assessment
 Before refining a data, first completely understand the data.
 Identifying the quantity and type of data deals with success for Data Remediation.
2) Organizing and Segmentation
 Organizing and creating segments based on the information’s purpose is critical during the data
remediation process.
 Accessibility is a big factor to consider when it comes to segmenting data.
 Data that needs to be easily accessed by team members for day-to-day tasks, and then there’s
data that needs to have higher security measures for legal or regulatory purposes.
 For Sensitive Data that has greater privacy requirements, organizations will probably want to
separate that data and store it on another platform with advanced security features.
3) Indexation and Classification
 Once your data is segmented, can be moved onto indexing and classification.
 In this step, organizations will focus on segments containing Non-ROT data and classify the level
of sensitivity of this remaining data.
 Regulated data like Personally Identifiable Information (PII), Personal Health Information (PHI)
and Financial Information will need to be classified with the company’s terminology for the
highest degree of sensitivity.
 “Restricted data” is a common sensitive data classification depending on its level of sensitivity.
4) Migrating
 The goal is to consolidate their data into a new, cleansed storage environment, and then
migration is an essential step in the data remediation process.
5) Data Cleansing
 The final task that suits for the data depending on what segmentation group it falls under and its
classification which includes slicing, analyzing , placing and preventing execution to clean up
data.

DATA PRE-PROCESSING

 Data Preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model.
 It is the first and crucial step while creating a machine learning model.
 When creating a machine learning project, operation with data, it is mandatory to clean [noisy,
missing values and unusable format] it and put in a formatted way.
 Data preprocessing is required tasks for cleaning the data and making it suitable for a machine
learning model which also increases the accuracy and efficiency of a machine learning model.
 It involves below steps:
1) Getting the Dataset
2) Importing Libraries
3) Importing Datasets
4) Finding Missing Data
5) Encoding Categorical Data
6) Splitting Dataset into Training and Test Set
7) Feature Scaling
1) Get the Dataset
 To create a machine learning model, the collected data for a particular problem in a proper
format is known as the Dataset.
 For Example,
Each dataset is different from another dataset.
To use the dataset in our code, we usually put it into a CSV file.
Sometimes, we need to use an HTML or xlsx file.
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a patient.
2) Importing Libraries
 To perform data preprocessing using Python, we need to import some predefined Python
libraries.
 These libraries are used to perform some specific jobs.
 There are three specific libraries that we will use for data preprocessing, which are:
Numpy:
 It includes mathematical
operation [Fundamental
Package for Scientific
Calculation]
Matplotlib:
 In Python 2D plotting library, used to plot any type of charts.
Pandas:
 It is used for importing and managing the datasets.
 It is an open-source data manipulation and analysis library.
3) Importing the Datasets
 To import the datasets which we have collected for our machine learning project.
 But before importing a dataset, we have to set the current directory as a working directory.
 The following steps are used to set a working directory in Spyder IDE.
To save your Python file in the directory this contains dataset.
Go to File explorer option in Spyder IDE, and select the required directory.
Click on F5 button or run option to execute the file.
4) Handling Missing data:
 The next step of data preprocessing is to handle missing data in the datasets.
 If our dataset contains some missing data, then it may create a huge problem for our machine
learning model.
 Two Different Ways to handle Missing Data:
A) By Deleting the Particular Row:
This stage commonly deals with null values.
To delete the specific row or column this consists of null values, but it is not an efficient way
and removing data may lead to loss of information which will not give the accurate output.
B) By Calculating the Mean:
To solve the above problem, calculate the mean of that column or row which contains any
missing value and will put it on the place of missing value.
This strategy is useful for the features which have numeric data such as age, salary, year, etc.
5) Encoding Categorical Data:
 In a dataset, the Categorical Data is data which has some categories such as, Country, and
Purchased.
 Machine Learning Model works on mathematics and numbers, so it is difficult to build a model.
 To solve it is necessary to encode these categorical variables into numbers.
6) Splitting the Dataset into the Training set and Test set
 In Machine Learning Data Preprocessing, we divide our dataset into a training set and test set.
 Training dataset is completely different test dataset.
 If we train our model very well and its training accuracy is also very high, but we provide a new
dataset to it, then it will decrease the performance.
 To make easier performs well with the training set and also with the test dataset.
 Training Set:
A subset of dataset to train the
machine learning model, the output
of the model will be known
already.
 Test Set:
A subset of dataset to test the machine learning model, by which the test set, model
predicts the output.
 During the first stage, split the arrays of the dataset into random train and test subsets.
 In the second stage further classified into,
x_train : Features for the Training Data
x_test : Features for Testing Data
y_train : Dependent Variables for Training Data
y_test : Independent Variable for Testing Data
7) Feature Scaling
 Feature scaling is the final step of data preprocessing in machine learning.
 It is a technique to standardize the independent variables of the dataset in a specific range.
 In feature scaling, we put our variables in the same range and in the same scale so that no any
variable dominate the other variable.

Unit : I completed
UNIT - II
MODEL EVALUATION AND FEATURE ENGINEERING
Model Selection - Training Model - Model Representation and Interpretability – Evaluating Performance of
a Model - Improving Performance of a Model - Feature Engineering: Feature Transformation - Feature
Subset Selection.

UNIT - II
MODEL EVALUATION AND FEATURE ENGINEERING
MODEL SELECTION
 Model Selection in Machine Learning is the process of choosing the best suited model for a
particular problem.
 Selecting a Model depends on various factors such as the dataset, task, nature of the model etc.
 In Machine Learning, Different types of model are available.
 Some of them are, Logistic Regression , K – Means Clustering , Neural Network , Support Vector
Machine , Hierarchical Clustering , Decision Tree , Random Forest …
 Logistic Regression – Supervised Learning
Logistic Regression is used for predicting the categorical dependent variable using a given set of
independent variables.
Logistic Regression predicts the output of a
categorical dependent variable.
Categorical or Discrete value can be either
Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie
between 0 and 1.
 K – Means Clustering – Unsupervised
Learning
It is an iterative algorithm that divides the
unlabeled dataset into k different clusters in
such a way that each dataset that belongs
only one group with similar properties.
 Neural Network – Deep Learning
A neural network is a method in machine
learning that teaches computers to process data
in a way that is inspired by the human brain.
It is a type of machine learning process, called
Deep Learning that uses interconnected nodes
or neurons in a layered structure that resembles
the human brain.
 Model Selection in Machine Learning is based on:
 Data
 Task
1) Based on Types of Data
 If we have Different Types of Data, choose a Specific Model Based on Data that we have.
CNN Model :
a) Used For processing data that has a grid pattern, such as images…

b) Used for Image & Video


c) Comes under Deep Learning Model
d) Prediction the Image like DOG …
RNN Model:
a) Recurrent Neural Network
b) Used to recognize patterns in sequences of data, such as text, handwriting, the speech
recognition [spoken word], and numerical time series data.
SVM, Logistic Regression, Decision Tree Model:
a) Used for Numerical Data
2) Based on Task – we need to carry out
Classification Task : SVM
: Logistic Regression
: Decision Tree Model
Regression Task : Linear Regression
: Random Forest
: Polynomial Regression
Clustering Task : K-Mean Clustering
: Hierarchical Clustering

TRAINING MODEL
 Training a model simply means learning (determining) good values for all the weights and the bias
from labeled examples.
 In supervised learning, a machine learning
algorithm builds a model by examining many
examples and attempting to find a model that
minimizes loss; this process is called
Experiential Risk Minimization.
 How to Train For a Model
 Training Machine Learning Models starts
after the Machine learning model was built.
 It is trained in order to get the appropriate
results.
 To train a machine learning model, one
needs a huge amount of pre-processed data.
 The Pre-Processed data means data in structured form with reduced null values, etc.
 The best model will chosen depends on associated attributes, the volume of the available dataset,
the number of features, complexity, etc.
 However, in practice, it is recommended that we always start with the simplest model that can be
applied to the particular problem and then gradually enhance the complexity & test the accuracy
with the help of parameter tuning and cross-validation.
MODEL REPRESENTATION AND INTERPRETABILITY
MODEL REPRESENTATION
 Models are representations of a selected part or aspect (often referred to as target system, parent
system, original, or prototype) of the external world.
 This is the model's target system.
 Model Representation of the data is to provide a useful view point into the data's key qualities.
 In order to train a model, choose the best set of data with features to represent.
 How to Represent a Model ( Conversion of Raw Data into Feature Vector)
 Feature
Engineering
means
transforming raw
data into a feature
vector.

 Stage: 1- Mapping Raw Data to Features


In this stage, Feature
engineering maps
raw data to ML
features
The left side, raw
data from an input
data source; the right
side illustrates a
feature vector, which
is the set of floating-point values comprising the examples in your data set.
Many machine learning models must represent the features as real-numbered vectors
since the feature values must be multiplied by the model weights.
 Stage: 2 - Mapping numeric values
In this stage, Mapping Integer Values to Floating-Point Values.
Integer and floating-
point data don't need a
special encoding
because they can be
multiplied by a
numeric weight.
So we convert the raw
integer value 6 to the
feature value 6.0 is
trivial (consideration
level is small)
 Stage: 3 - Mapping Categorical Values
In this stage, convert strings to numeric values.
For Example,
There might be a feature called Street Name with options that include:
1. Charleston Road
2. North Shoreline Boulevard
3. Shorebird Way
4. Rengstorff Avenue
Since models cannot multiply strings by the learned weights, feature engineering to
convert strings to numeric values.

Now the mapping our street names to numbers:


1. Map Charleston Road : 0
2. Map North Shoreline Boulevard : 1
3. Map Shorebird Way : 2
4. Map Rengstorff Avenue : 3
5. Map everything else (OOV) : 4
In this stage, categorical feature in our model that represents values as follows:
 For values that apply to the example, set corresponding vector elements to 1.
 Set all other elements to 0.
 The length of this vector is equal to the number of elements in the vocabulary.
 This representation is called a one-hot encoding when a single value is
 Multi-hot encoding when multiple values are 1.
 Final Stage: Mapping street address via One-Hot Encoding.
In this stage, the model uses only the weight for Shorebird Way by creating a Boolean
variable – 1 for Shorebird
Way.
For Example, if a house
is at the corner of two
streets, then two binary
values are set to 1, and
the model uses both their
respective weights.
One-Hot Encoding
extends to numeric data
that you do not want to
directly multiply by a weight, such as a Postal Code.
INTERPRETABILITY
 Interpretability is a ability to determine
cause and effect from a machine
learning model, where humans can
capture relevant knowledge from a
model concerning relationships either
contained in data or learned by the
model.
 If something is interpretable, it is
possible to find its meaning or possible
to find a particular meaning in it.
 Some commonly used Interpretable models are ,
 Linear Regression,
 Logistic Regression
 Decision Tree
 The higher the interpretability of an ML model, the
easier it is to understand the model's
predictions.
Interpretability Evaluation Methods
 There are currently three major
ways to evaluate interpretability
methods:
 Application-Grounded
 Human-Grounded
 Functionally Grounded
1) Application-Grounded
 The evaluation requires a human to perform experiments within a real-life application.
 For Example,
To evaluate an interpretation on diagnosing a certain disease, the best way is for the
doctor to perform diagnoses.
2) Human-Grounded
 The evaluation is about conducting Simpler Human-Subject Experiments.
 For Example,
Humans are presented with pairs of explanations, and must choose the one that they find
to be of higher quality.
3) Functionally Grounded
 The evaluation requires no human experiments.
 Instead, it uses a proxy for explanation quality.
 This method is much less costly than the previous two.
 The challenge, of course, is to determine what proxies to use.
 For Example,
Decision Trees have been considered interpretable in many situations, but additional
research is required.

EVALUATING PERFORMANCE OF A MODEL


 Model Evaluation is the process of using different evaluation metrics to understand a machine
learning model's performance, as well as
its strengths and weaknesses.
 Model Evaluation is important to
evaluate the efficiency of a model during
initial research phases, and it also plays
a role in model monitoring.
 Model Evaluation Techniques:
 There are two methods of evaluating
models in machine learning.
 Hold Out
 Cross Validation.
 To Avoid Overfitting, both methods use a test set (not seen by the model) to evaluate model
performance.
a) Hold Out Method:
 The purpose of holdout evaluation is to test a model on different data than it was trained on.
 This provides an unbiased {unbalanced / unfair} estimate of learning performance.
 In this method, the dataset is randomly divided into three subsets:
1) Training Set, is a subset of the dataset used to build predictive [a commonly used
statistical technique to predict future behavior] models.
2) Validation Set, is a subset of the dataset used to evaluate the performance of the model
built in the training phase.
3) Test Set (Or) Unseen Data, is a subset of the dataset used to evaluate the future
performance of the model
 Advantages:
Speed
Simplicity
Flexibility
b) Cross Validation
 It is a technique that involves partitioning the original observation dataset into a training set,
used to train the model and an independent set used to evaluate the analysis.
 K-fold cross validation is partitioned into ‘k’ equal size subsamples called Folds.
 The method is usually partitioned with 5 to 10 sizes as its preferred value for test and
validation set.
IMPROVING PERFORMANCE OF A MODELS
 Improving the accuracy of a machine learning model is a skill that can only improve with practice to
get accurate and most reliable results.
 There are various methods you can improve your
machine learning model performance.
 Some of them are,
 Choosing the Right Algorithms
 Use the Right Quantity of Data
 Quality of Training Data Sets
 Supervised or Unsupervised ML
 Model Validation and Testing
1) Choosing the Right Algorithms
 Algorithms are the key factor used to train the ML models.
 The data feed into this that helps the model to learn from and predict with accurate results.
 Hence, choosing the right algorithm is important to ensure the performance of your machine
learning model
2) Use the Right Quantity of Data
 The important factor while developing a machine learning model is to choose the right quantity
of data sets.
 A huge quantity of datasets is required for algorithms.
 Depending on the complexities of problem, data size evaluation and rule are the leading factors
determine the quantity and types of training data sets that help in improving the performance of
the model.
3) Quality of Training Data Sets
 The quality of machine learning training data set is another key factor while developing an ML
model.
 If the quality of machine learning training data sets is not good or accurate your model will never
give accurate results, affecting the overall performance of the model not suitable to use in real-
life.
 Standard quality-assurance methods and in-depth quality assessment methods are the leading two
popular methods which ensure the right quality of training data sets to improve the performance
of your ML model.
4) Supervised or Un-Supervised ML
 Supervised, Unsupervised and Reinforcement learning are the algorithm consists of a
target/outcome variable (or dependent variable) which is to be predicted from a given set of
predictors (independent variables).
 In Unsupervised machine learning, a model is given any target or outcome variable to
predict/estimate.
 For Supervised ML, labeled or annotated data is required, while for unsupervised ML the
approach is different.
5) Model Validation and Testing
 Building a machine learning model is not enough to get the right predictions.
 To check the accuracy and need to validate the same to ensure get the precise results.
 Validating the model will improve the performance of the ML model.
FEATURE ENGINEERING
 Feature Engineering is the Pre-Processing step of machine learning, which is used to transform raw
data into features that can be used for creating a predictive model using Machine learning or
statistical Modeling. [or]
 Feature Engineering is the pre-processing step of machine learning, which extracts features from raw
data.
 All Machine Learning
Algorithms take input
data to generate the
Output.
 The Input data remains in a tabular form consisting of rows (instances or observations) and columns
(variable or attributes),
and these attributes are
often known
as Features.
 Feature is an attribute
that impacts a problem
or is useful for the
problem.
 Feature Engineering in machine learning aims to improve the performance of models.
 Feature Engineering process selects the most useful predictor variables for the model.
 Automated Feature Engineering is also used in different machine learning software that helps in
automatically extracting features from raw data.
 Feature Engineering in ML contains mainly Four Processes:
 Feature Creation
 Feature Transformations
 Feature Extraction
 Feature Selection
Need for Feature Engineering in Machine Learning
 In Machine Learning, the performance of the model depends on Data Pre-Processing and Data
Handling.
 Feature engineering in machine learning improves the model's performance.
 The needs for feature engineering are,
1) Better features mean flexibility.
 In machine learning, we always try to choose the optimal model to get good results.
 The flexibility in features will enable you to select the less complex models.
 The less complex models are faster to run, easier to understand and maintain, which is always
attractive.
2) Better Features mean Simpler Models.
 Sometimes after choosing the wrong model, we can get better predictions, and this is because of
better features.
 After feature engineering, it is not necessary to do hard for picking the right model with the most
optimized parameters.
 If we have good features, we can better represent the complete data and use it to best characterize
the given problem.
3) Better Features mean Better Results.
 In machine learning, as data we will provide will get the same output.
 To obtain better results, we must need to use better features.
FEATURE TRANSFORMATIONS
 Feature Transformation is a mathematical transformation in which we apply a mathematical formula
to a particular column (feature) and transform the values, which are useful for our further analysis.
 It is a technique by which we can boost our model performance.
 Feature Transformation:
 Transformation of data to improve the accuracy of the algorithm.
 Feature Selection:
 Removing Unnecessary Features.
 The transformation step of feature engineering involves adjusting the predictor variable to improve
the accuracy and performance of the model.
 For Example,
 It ensures that the model is flexible to take input of the variety of data; it ensures that all the
variables are on the same scale, making the model easier to understand.
 It improves the model's accuracy and ensures that all the features are within the acceptable range
to avoid any computational error.
Feature Transformations Techniques

 It is classified into three Transformations.


 Function Transformation
 Power Transformation
 Quantile Transformation
 The above said transformation of feature transformation techniques further classified into :
 Log Transformation
 Reciprocal Transformation
 Square Transformation
 Square Root Transformation
 Custom Transformation
 Power Transformations
1) Log Transformation
 The Log Transform is one of the most
popular Transformation techniques.
 Log transformation is a data transformation method in which each variable of x will be replaced
by log(x) with base 10, base 2, or natural log.
 It is primarily used to convert a Skewed Distribution to a Normal Distribution / Less-Skewed
Distribution. { This transformation is mostly applied to Right-Skewed data}
 In this transform, we take the log of the values in a column and use these values as the column
instead.
 For Example,
log(10) = 1
log(100) = 2
log(10000) = 4
2) Reciprocal Transformation
 A transformation of raw data that involves by replacing the original data units with their
Reciprocals and analyzing the modified data.
 It can be used with Non-Zero Data and is commonly used when distributions have Skewness or
Clear Outliers.
 In this transformation, x will replace by the inverse of x (1/x).
 This transformation reverses the order among values of the same sign, so large values become
smaller and vice-versa.
 The reciprocal transformation will give little effect on the shape of the distribution.
3) Square Transformation
 A square is transformed by increasing its length and decreasing its other side by the same
percentage amount.
 This transformation mostly applies to left-skewed data.
4) Square Root Transformation:
 A procedure for converting a set of data in which each value, xi, is replaced by its square root.
 The main advantage of Square Root Transformation is, it can be applied to zero values.
 This transformation is defined only for positive numbers.
 This can be used for reducing the skewness of Right-Skewed Data.
 It is weaker than the Log Transformation.
 A square root transformation has the effect of making the data Less Skew and making the
variation more uniform.
5) Custom Transformation:
 A Function Transformer forwards its X (and optionally y) an argument to a user-defined function
or function object and returns this function's result.
 The resulting transformer will not be pickleable if lambda is used as the function.
 This is useful for stateless transformations such as taking the log of frequencies, doing custom
scaling, etc.
FEATURE SUBSET SELECTION
 Feature Selection is a way of selecting the subset of the most relevant features from the original
features set by removing the redundant, irrelevant, or noisy features.
 Feature subset selection is the process of identifying and removing as much of the irrelevant and
redundant information as possible.
 This reduces the dimensionality of the data and allows learning algorithms to operate faster and more
effectively.
 It is a process of automatically or manually selecting the subset of most appropriate and relevant
features to be used in model building.
 Feature Selection is performed by either including the important features or excluding the irrelevant
features in the dataset without changing them.
 Machine Learning works on the concept of "Garbage In Garbage Out".
 Advantage:
 It helps in avoiding the curse of dimensionality.
Number of samples needed to estimate the level of accuracy grows with respect to the
number of input variables.
 It helps in the simplification of the model so that it can be easily interpreted by the researchers.
 It reduces the training time.
The dataset consists of noisy data, irrelevant data, and some part of useful data.
The huge amount of data also slows down the training process of the model, and with noise
and irrelevant data, the model may not predict and perform well.
 It reduces overfitting hence enhance the generalization.
Feature Selection Techniques
 There are mainly two types of Feature Selection techniques
1) Supervised Feature Selection Technique
 Supervised Feature Selection Techniques consider the target variable and can be used for the
labelled dataset.
2) Un-Supervised Feature Selection Technique
 Un-Supervised Feature Selection Techniques ignore the target variable and can be used for the
unlabelled dataset.
 There are mainly three techniques under Supervised Feature Selection:
a) Wrapper Methods
 In wrapper methodology, selection of features is done by considering it as a search problem, in
which different combinations are made, evaluated, and compared with other combinations.
 It trains the algorithm by using the subset of features
iteratively.
 On the basis of the output of the model, features are
added or subtracted, and with this feature set, the
model has trained again.

 Some techniques of wrapper methods are:


 Forward Selection
Forward selection is an iterative process, which begins with an empty set of features.
After each iteration, it keeps adding on a feature and evaluates the performance to check
whether it is improving the performance or not.
The process continues until the addition of a new variable/feature does not improve the
performance of the model.
 Backward Elimination
This technique begins the process by considering all the features and removes the least
significant feature.
This elimination process continues until removing the features does not improve the
performance of the model.
Backward Elimination is also an iterative approach.
It is the opposite of forward selection.
 Exhaustive Feature Selection
Exhaustive feature selection methods, which evaluate each feature, set as brute-force.
It means these methods tries & make each possible combination of features and return the
best performing feature set.
 Recursive Feature Elimination
Recursive feature elimination is a recursive greedy optimization approach, where features
are selected by recursively taking a smaller and smaller subset of features.
An estimator is trained with each set of features, and the importance of each feature is
determined using
 Coef_Attribute (Or)
 Through a feature_importances_attribute.
b) Filter Methods
 In Filter Method, features are selected on the basis of
statistics measures. This method does not depend on the
learning algorithm and chooses the features as a pre-
processing step.
 The filter method filters out the irrelevant feature and
redundant columns from the model by using different
metrics through ranking.
 The advantage of using filter methods is that it needs low
computational time and does not overfit the data.
 Some common techniques of Filter methods are as follows:
Information Gain
Chi-square Test
Fisher's Score
Missing Value Ratio
1) Information Gain:
 Information gain determines the reduction in entropy while transforming the dataset.
 It can be used as a feature selection technique by calculating the information gain of each
variable with respect to the target variable.
2) Chi-square Test:
 Chi-square test is a technique to determine the relationship between the categorical
variables.
 The chi-square value is calculated between each feature and the target variable, and the
desired number of features with the best chi-square value is selected.
3) Fisher's Score:
 Fisher's score is one of the popular supervised techniques of features selection.
 It returns the rank of the variable on the fisher's criteria in descending order.
 We can select the variables with a large fisher's score.
4) Missing Value Ratio:
 The value of the missing value ratio can be used for evaluating the feature set against the
threshold value.
 The formula for
obtaining the missing value ratio is the number of missing values in each column divided
by the total number of observations.
 The variable is having more than the threshold value can be dropped.
c) Embedded Methods
 Embedded methods combined the advantages of both filter and wrapper methods by considering
the interaction of features along with low computational
cost.
 These are fast processing methods similar to the filter
method but more accurate than the filter method.
 These methods are also iterative, which evaluates each
iteration, and optimally finds the most important features
that contribute the most to training in a particular
iteration.
 Some techniques of embedded methods are:
 Regularization-
Regularization adds a penalty term to different
parameters of the machine learning model for avoiding overfitting in the model.
This penalty term is added to the coefficients; hence it shrinks some coefficients to
zero.
Those features with zero coefficients can be removed from the dataset.
The types of regularization techniques are L1 Regularization (Lasso Regularization)
or Elastic Nets (L1 and L2 regularization).
 Random Forest Importance
Random Forest is such a tree-based method, which is a type of bagging algorithm that
aggregates a different number of decision trees.
It automatically ranks the nodes by their performance or decrease in the impurity over
all the trees.
Nodes are arranged as per the impurity values, and thus it allows to pruning of trees
below a specific node.
The remaining nodes create a subset of the most important features.

UNIT: II COMPLETED
UNIT: III- BAYESIAN LEARNING

Basic Probability Notation - Inference – Independence - Bayes‟ Rule - Bayesian Learning: Maximum
Likelihood and Least Squared Error Hypothesis - Maximum Likelihood Hypotheses For Predicting
Probabilities - Minimum Description Length Principle - Baye's Optimal Classifier - Naïve Baye's
Classifier - Bayesian Belief Networks - EM Algorithm.

Introduction

UNIT: III

BASIC PROBABILITY NOTATION

 Probability can be defined as a chance that an uncertain event will occur.


 it is the numerical measure of the likelihood that an event will occur.
 The value of probability always remains between 0 and 1 that represent ideal uncertainties. 0 ≤
p(a)
≤ 1, where p(a) is the probability of an event a.
 The probability of a sure event or certain event is 1.
 The probability of an impossible or uncertain event is 0.

TYPES OF PROBABILITY

Conditional Probability

 The Conditional Probability is the probability of one event given the occurrence of another
event, oftendescribed in terms of events a and b from two dependent random variables e.g. x
and y.
 If A and B are dependent events. Conditional probability of A given B means the
probability ofoccurrence of A when the event B has already happened. It is denoted by P(A/B)
and is defined by
𝐏𝐏(𝐀𝐀/𝐁𝐁) =𝐏𝐏(𝐀𝐀 ∩ 𝐁𝐁)
𝐏𝐏(𝐁𝐁)

, 𝐢𝐢𝐢𝐢 𝐏𝐏(𝐁𝐁) ≠ 𝟎𝟎

 Conditional probability of B given A means the probability of occurrence of B when the


event A hasalready happened. It is denoted by P(B/A) is defined by
𝐏𝐏(𝐁𝐁/𝐀𝐀) =𝐏𝐏(𝐀𝐀 ∩ 𝐁𝐁)
𝐏𝐏(𝐀𝐀)

,
𝐢𝐢𝐢𝐢 𝐏𝐏(𝐀𝐀) ≠ 𝟎𝟎

Joint Probability Distribution

 Joint probability is a statistical measure that calculates two events occurring together and at the
same point in time. These two events are usually coined event A and event B, and can formally
be written as:p(A and B).‟
 The joint probability distribution (or "joint" for short), which completely specifies an agent's
probabilityassignments to all propositions in the domain (both simple and complex).
 Example:
 A, B =Two different events that intersect P (A and B)
 The joint probability of A and B, {P (A B )= P (A⋂B) }

AXIOMS OF PROBABILITY

 Three axioms of probability,


 Axiom 1: The probability of an event is a real number greater than or equal to 0.
 Axiom 2: The probability that at least one of all the possible outcomes of a process
(such asrolling a die) will occur is 1.
 Axiom 3: If two events A and B are mutually exclusive, then the probability of
either A or B occurring is the probability of A occurring plus the probability of B
occurring.

Example:

 For any event A, P (A) ≥ 0.


 When S is the sample space of an experiment, the set of all possible outcomes, P(S) = 1.
 If A and B are mutually exclusive outcomes, P (A 𝖴𝖴 B) = P (A) + P (B).
BAYESIAN INFERENCE

 Bayesian Inference is a method of statistical inference in which Bayes' Theorem is used to


update theprobability for a hypothesis as more evidence or information becomes available.
 Bayesian Inference is an important technique in statistics, and especially in mathematical
statistics.Bayesian updating is particularly important in the dynamic analysis of a sequence
of data.
 Bayesian Inference has found application in a wide range of activities, including science,
engineering,philosophy, medicine, sport, carpooling, and law.
 In the philosophy of decision theory, Bayesian inference is closely related to subjective
probability,often called "Bayesian Probability".
1. Bayesian inference derives the posterior probability as a consequence of two types: a Prior
Probability
and a "Likelihood Function" derived from a statistical model for the observed data.

 Bayesian Inference computes the Posterior Probability according to Bayes' Theorem:

INDEPENDENCE

 Independence is a fundamental notation in probability theory, as in statistics and the theory of


stochasticprocesses.
 Two Events are independent, statistically independent, or stochastically independent if the
occurrence of one does not affect the probability of occurrence of the other (equivalently, does
not affect the odds).
 Similarly, Two Random Variables are independent if the realization of one does not
affect theprobability distribution of the other.
 When dealing with collections of more than two events, a Weak and a Strong
Notation ofindependence need to be distinguished.
 if any two events in the collection are independent of each other, while saying that the events
are mutually independent (or collectively independent) intuitively means that each event is
independent ofany combination of other events in the collection these events are called Pair
Wise Independent
 The name " Mutual Independence " (or) "Collective Independence" seems the outcome of a
pedagogical choice, merely to distinguish the stronger notion from "Pair Wise Independence"
which isa Weaker Notion.
 In the advanced literature of probability theory, statistics, and stochastic processes, the stronger
notationis simply named Independence with No Modifier.
 It is stronger since independence implies pair wise independence, but not the other way around.

BAYES' THEOREM:

 Bayes' Theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines theprobability of an event with uncertain knowledge.
 In Probability Theory, it relates the conditional probability and marginal probabilities of two
randomevents.
 Bayes' Theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference isan application of Bayes' theorem, which is fundamental to Bayesian statistics.
 It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).
 Bayes' Theorem allows updating the probability prediction of an event by observing new
information ofthe real world.

 Example: If cancer corresponds to one's age then by using Bayes' Theorem, we can
determine theprobability of cancer more accurately with the help of age.
 Bayes' Theorem can be derived using product rule and conditional probability of event A with
knownevent B:
 As from product rule we can write:
 P(A ⋀ B )= P(A|B) P(B) or
 Similarly, the probability of event B with known event A: P(A ⋀ B)= P(B|A) P(A)
 Equating right hand side of
both theequations, we will
get:

 The above equation (a) is called as Bayes' Rule or Bayes' Theorem.


 This equation is basic of most modern AI systems for Probabilistic Inference.
 It shows the simple relationship between joint and conditional probabilities.
 Here, P(A|B) is known as Posterior, which we need to calculate, and it will be read as
Probability ofhypothesis A when we have occurred an evidence B.
 P(B|A) is called the Likelihood, in which we consider that hypothesis is true, then we
calculate theprobability of evidence.
 P(A) is called the Prior Probability, probability of hypothesis before considering the evidence
 P(B) is called Marginal Probability, pure probability of an evidence.
 In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be
writtenas:

Where A1, A2, A3... An is a set of mutually exclusive and exhaustive events.

APPLYING BAYES' RULE:

 Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A).
 This is very useful in cases where we have a good probability of these three terms and
want todetermine the fourth one.
 Suppose we want to perceive the effect of some unknown cause, and want to compute
that cause,then the Bayes' rule becomes:
Example-1:

 Question: From a standard deck of playing cards, a single card is drawn. The probability that
the card is king is 4/52, then calculate posterior probability P(King|Face), which means the
drawn face card is a king card.
 Given Data:

 P(king): probability that the card is King= 4/52= 1/13


 P(face): probability that a card is a face card= 3/13
 P(Face|King): probability of face card when we assume it is a king = 1
 Result :

BAYESIAN LEARNING

Introduction

 Bayesian Learning views the problem of constructing hypotheses from data as a sub
problem of themore fundamental problem of making predictions.
 The idea is to use hypotheses as intermediaries between data and predictions.
 First, the probability of each hypothesis is estimated, given the data.
 Predictions are then made from the hypotheses, using the posterior probabilities of the
hypotheses toweight the predictions.
 For Example,
 Consider the problem of predicting tomorrow's weather.
 Suppose the available experts are divided into two camps: some suggest model A, and
somesuggest model B. The Bayesian method, rather than choosing between A and B,
gives someweight to each based on their likelihood.
 The likelihood will depend on how much the known data support each of the two models.
MAXIMUM LIKELIHOOD HYPOTHESIS

Introduction

 Science involves the creation of hypothesis (or theories), and the testing of those theories by
comparingtheir predictions with experimental observations.
 In many cases, the conclusions of an experiment are obvious – the theory is supported or disproven.
 The uncertainties data determines whether a hypothesis is right or wrong – only how likely it
is to beright: “Probability of the Hypothesis”.
 In order to do this, our hypothesis must be detailed enough for us to work out how likely we
would havebeen to get the results we observe, assuming that the hypotheses is true.

Define Maximum Likelihood Estimation

 Let X1, X2, X3, .............. Xn be independent and identically distributed random sample drawn from a
population having the probability density function f(X, ), then the joint density of X1, X2, X3,…
……..Xn is given by
L( Ɵ ) = f(X1, Ɵ), f(X2, Ɵ), f(X3, Ɵ) ..............f(Xn, Ɵ)
n
π f (xi,Ɵ )
L( Ɵ ) = i=1
is known as Maximum likelihood function and it is denoted by „L‟.

Maximum Likelihood Estimation (MLE),

 The Frequentist View, and


Bayesian Estimation, the
Bayesian view, are perhaps
thetwo most widely used
methods for parameter
estimation, the process
by which, given some
data, we areable to
estimate the model that
produced that data.
 The point in the parameter space that maximizes the likelihood function is called the Maximum
Likelihood Estimate.

 MLE is a special case of maximum a Posteriori Estimation (MAP) that assumes a


uniformprior distribution of the parameters.
Least Squared Error Hypothesis

 The Least Squares Method is a statistical procedure to find the best fit for a set of data points by

minimizing the sum of the offsets or residuals of points from the plotted curve.

 Least Squares Estimates are calculated by fitting a regression line to the points from a data set
that hasthe minimal sum of the deviations squared (least square error).
 Least Squares Regression is used to predict the behavior of dependent variables.
 Least-Squared Error Hypotheses are generated by adding random noise to the true target
value, wherethis random noise is drawn independently for each example from a Normal
distribution with zero mean.
 Maximizing the Likelihood Function determines the parameters that are most likely to
produce theobserved data.

Maximum Likelihood Hypotheses For Predicting Probabilities

 Maximum Likelihood Estimation is a probabilistic framework for solving the problem of


densityestimation.
 It involves maximizing a likelihood function in order to find the probability distribution and
parametersthat best explain the observed data.
 The general approach for using MLE is a Set the parameters of our model to values which
maximize thelikelihood of the parameters given the data.
 Common modeling problems involve show to estimate a joint probability distribution for a data set.
 How do you choose the probability distribution function?
 How do you choose the parameters for the probability distribution function?
 There are many techniques for solving this problem, although two common approaches are:
 Maximum a Posteriori (MAP) , a Bayesian method
 Maximum Likelihood Estimation (MLE), frequentist method.
 Note: MAP gives you the value which maximizes the posterior probability P (θ|D).

Maximum Likelihood Estimation

 Maximum likelihood estimation involves defining a likelihood function for calculating the
conditionalprobability of observing the data sample given a probability distribution and
distribution parameters. This approach can be used to search a space of possible distributions
and parameters.
 Given data the maximum likelihood estimate (MLE) for the parameter p is the value of p that
maximizesthe likelihood P (data | p).
 MLE is the technique which helps us in determining the parameters of the distribution that best
describethe given data.
 These values are a good representation of the given data but may not best describe the
population. Wecan use MLE in order to get stronger parameter estimates.

Minimum Description Length Principle

 The Minimum Description Length (MDL) Principle is a powerful method of instructive


inference, thebasis of statistical modeling, pattern
Recognition and machine learning.
 It holds that the best explanation,
given a limited set of observed
data, is the one thatpermits the
greatest compression of the data.
 Minimum description length principle: For
transferring the information from one end to another end you require a minimum number of bits
 The minimum description length (MDL) criterion in machine learning says that the best
descriptionof the data is given by the model which compresses it the best.

Bayes Optimal Classifier

 The Bayes Optimal Classifier is a


probabilistic model that makes the most
likely prediction for anew example.
 Bayes Optimal Classifier is a
probabilistic modelthat finds the most
probable prediction using the training
data and space of hypotheses to make a
prediction for a new data instance.

Naive Bayes Classifiers

 Naive Bayes classifiers are a collection of classification


algorithmsbased on Bayes' Theorem.
 It is not a single algorithm but a family of algorithms where all of
them share a common principle, i.e. every pair of features being classified is independent of each
other.
 Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helpin building the fast machine learning models that can make quick predictions.
 It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
 A naive Bayes classifier assumes that the presence (or absence) of a particular feature of a
class isunrelated to the presence (or absence) of any other feature, given the class variable.
Assumption:
 The fundamental Naive Bayes assumption is
that eachfeature makes an:
1) Independent
2) Equal

BAYESIAN BELIEF NETWORK

 Bayesian belief network is key computer technology for dealing with probabilistic events and to
solve aproblem which has uncertainty.
 "A Bayesian network is a probabilistic graphical model which represents a set of variables
and theirconditional dependencies using a Directed Acyclic Graph."
 It is also called a Bayes Network, Belief Network, Decision Network, or Bayesian Model.
 Bayesian networks are probabilistic, because these networks are built from a probability
distribution,and also use probability theory for prediction and anomaly detection.
 Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks including
prediction, anomalydetection, diagnostics, automated insight, reasoning, time series prediction,
and decision making underuncertainty.
 Bayesian Network can be used for building models from data and experts opinions, and it
consists oftwo parts:
1) Directed Acyclic Graph
2) Table of Conditional Probabilities.
 The generalized form of Bayesian network that represents and
solve decision problems under uncertainknowledge is known as
an Influence diagram.
 A Bayesian network graph is made up of nodes
and Arcs(directed links), where:
 Each node corresponds to the random variables, and a variable can be continuous or discrete.
 Arc or directed arrows represent the causal relationship or conditional probabilities between random
variables. These directed links or arrows connect the pair of nodes in the graph.
 These links represent that one node directly influence the other node, and if there is no directed link
that means that nodes are independent with each other.

 In the above diagram, A, B, C, and D are random variables represented by the


nodes of thenetwork graph.
 If we are considering node B, which is connected with node A by a directed arrow,
then nodeA is called the parent of Node B.
 Node C is independent of node A.
 The Bayesian network has mainly two components:
1) Causal Component
2) Actual numbers
 Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi)
), whichdetermines the effect of the parent on that node.

EM – ALGORITHM

 The EM algorithm is an iterative approach those cycles between two modes.


 The first mode attempts to estimate the missing or latent variables, called the estimation-step or E-
step.
 The second mode attempts to optimize the parameters of the model to best explain the data,
called themaximization-step or M-step.

Algorithm:

 Given a set of incomplete data, consider a set of starting parameters.


 Expectation step (E – step): Using the observed available data of the dataset, estimate (guess)
the valuesof the missing data.
 Maximization step (M – step): Complete data generated after the expectation (E) step is used in
order toupdate the parameters.
 Repeat step 2 and step 3 until convergence.

How it works:

 Initially, a set of initial values of the


parameters are considered.

 A set of incomplete observed data is givento the


system with the assumption that the observed data comes from a specific model.

 The next step is known as “Expectation” – step or E-step.

 In this step, we use the observed data in order to estimate or guess the values of
the missing or incomplete data.

 It is basically used to update thevariables.

 The next step is known as “Maximization”-step or M-step. In this step, we use the complete
data generated in the preceding “Expectation” – step in order to update the values of the
parameters. It isbasically used to update the hypothesis.
 Now, in the fourth step, it is checked whether the values are converging or not, if yes,
then stopotherwise repeat step-2 and step-3 i.e. “Expectation” – step and “Maximization”
– step until theconvergence occurs.
Flow chart for EM algorithm

Usage of EM Algorithm

 It can be used to fill the missing data in a sample.


 It can be used as the basis of unsupervised learning of clusters.
 It can be used for the purpose of estimating the parameters of Hidden Markov Model (HMM).
 It can be used for discovering the values of latent variables.

Advantages of EM Algorithm

 It is always guaranteed that likelihood will increase with each iteration.


 The E-step and M-step are often pretty easy for many problems in terms of implementation.
 Solutions to the M-steps often exist in the closed form.

Disadvantages of EM ALGORITHM

 It has slow convergence.


 It makes convergence to the local optima only.
 It requires both the probabilities, forward and backward (numerical optimization requires only
forwardprobability).

UNIT : III COMPLETED


UNIT: IV - PARAMETRIC MACHINE LEARNING

Logistic Regression: Classification and Representation – Cost function – Gradient Descent –


Advanced optimization – Regularization - Solving the problems on over fitting. Perceptron – Neural
Networks – Multi –Class Classification - Back Propagation – Non - Linearity with Activation
Functions (Tanh, Sigmoid, Relu, PRelu) - Dropout as Regularization.

UNIT: IV

PARAMETRIC MACHINE LEARNING

Introduction
 Assumptions can greatly simplify the learning process, but can also limit what can be
learned.Algorithms that simplify the function to a known form are called parametric
machine learningalgorithms.
 Examples for the Parametric Machine Learning Algorithms are ,
1) Logistic Regression
2) Linear Discriminant Analysis
3) Perceptron
4) Naive Bayes
5) Simple Neural Networks

Benefits of Parametric Machine Learning Algorithms:

1) Simpler: These methods are easier to understand and interpret results.


2) Speed: Parametric models are very fast to learn from data.
3) Less Data: They do not require as much training data and
can work well even if the fit to thedata is not perfect.

LOGISTIC REGRESSION

 Logistic regression is a supervised learning classification


algorithm used to predict the probability of a target
variable.
 Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.

 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, whichpredicts two maximum values (0 or 1).
 The curve from the logistic function indicates the likelihood of something such as whether the
cells arecancerous or not, a mouse is overweight or not based on its weight, etc.

 Mathematically, a logistic regression model predicts P(Y=1) as a function of X.


 For example, if we are modeling people‟s sex as male or female from their height, then the
probabilityof male given a person‟s height, or more formally:

P (Sex = Male | Height)

 Another way, we are modeling the probability that an input (X) belongs to the default class
(Y=1), wecan write this formally as:

P(X) = P( Y=1 | X)

 It is one of the simplest ML algorithms that can be used for various classification problems such
as spamdetection, Diabetes prediction, cancer detection etc.

Types of Logistic Regression

 Generally, Logistic Regression means Binary Logistic Regression having binary target
variables, butthere can be two more categories of target variables that can be predicted by it.
 Based on those numbers of categories, it is classified into 3 types,

1) Binary or Binomial
 In such a kind of classification, a dependent variable will have only two possible types either 1
or 0.
 For Example, these variables may represent success or failure, Yes or No, Win or Loss, Pass or
Fail.
2) Multinomial
 In this type, the dependent variable can have 3 or more possible Un-Ordered types or

the typeshaving no quantitative significance.


 For Example, it represents “Type A” or “Type B” or “Type C”, “Cat”, Or “Dogs”, Or “Sheep”.
3) Ordinal
 The dependent variable can have 3 or more possible ordered types or the types having a

quantitativesignificance.
 For Example, these variables may represent “Poor” or “Good”, “Very Good”, “Excellent”

and eachcategory can have the scores like 0,1,2,3, "Low", "Medium", or "High".

Logistic Regression Assumptions

 Before diving into the implementation of logistic regression, we must be aware of all the
assumptions in that,
 In binary logistic regression, the target variables must be binary always and the
desiredoutcome is represented by the factor level 1.
 The independent variable should not have Multi – Collinearity, which means the
independent variables must be independent of each other.
 We should choose a large sample size for logistic regression.

CLASSIFICATION AND REPRESENTATION

 In Machine Learning and Statistics, Classification is the problem of identifying to which of


a set ofcategories (sub populations), a new observation belongs to, on the basis of a
training set of data containing observations and whose categories membership is known.
 In Machine Learning, Classification refers to a predictive modeling problem where a class
label ispredicted for a given example of input data.
 For Example, Classification is to classify the emails are: “Spam” or “Not Spam.”

TYPES OF CLASSIFICATION

1) Binary Classification
2) Multi-Class Classification
3) Multi-Label Classification
4) Imbalanced Classification

Binary Classification:-

 It is used when there are only two distinct


classes and the data we want to classify belongs
exclusively to one of those classes, e.g. to classify if
a post about a given product as positive ornegative.
 Binary classification is the task of classifying the elements ofa set into two groups on the
basis of a classification rule.
 Typically, binary classification tasks involve one class that is thenormal state and another class
that is the abnormal state.
 The class for the normal state is assigned the class label 0 and the class with the abnormal
state isassigned the class label
 For example
 “Not spam” is the normal state and “spam” is the abnormal state.
 “Cancer not detected” is the normal state and “cancer detected” is the abnormal state.
 Popular algorithms that can be used for binary classification include:
 Logistic Regression
 k-Nearest Neighbors
 Decision Trees
 Support Vector Machine
 Naive Bayes

Multi-Class Classification

 It is used when there are three or more classes and the data
we want to classify belongs exclusively to one of those
classes.
 For Example, to classify a set of images of fruits which
may be oranges, apples, or pears?
 Multi-class classification makes the assumption
that each sample is assigned to one and only one
label: a fruit can beeither an apple or a pear but not
both at the same time.
 Many algorithms used for binary classification can be used for multi-class classification.
Popularalgorithms that can be used for multi-class classification include:
 K-Nearest Neighbors.
 Decision Trees.
 Naive Bayes.
 Random Forest.
 Gradient Boosting.
 Instead, heuristic methods can be used to split a multi-class classification problem into
multiple binaryclassification datasets and train a binary classification model each.
 Two examples of these heuristic methods include:
1) One-vs-Rest (OvR)

2) One-vs-One (OvO)
One-vs-Rest (OvR)

 One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic


method forusing binary classification algorithms for multi-class classification.
 For example, given a multi-class classification problem with examples for each class
„red,‟
„blue,‟ and „green„. This could be divided into three binary classification datasets as
follows:
Binary Classification Problem 1: red vs [blue, green]
Binary Classification Problem 2: blue vs [red, green]
Binary Classification Problem 3: green vs [red, blue]

One-vs-One (OvO)

 Like one-vs-rest, one-vs-one splits a multi-class classification dataset into binary


classification problems. Unlike one-vs-rest that splits it into one binary dataset for
each class,the one-vs-one approach splits the dataset into one dataset for each class
versus every other class.
 For example, consider a multi-class classification problem with four classes: „red,‟
„blue,‟and „green,‟ this could be divided into 3 binary classification datasets as
follows:
Binary Classification Problem 1: red vs. blue
Binary Classification Problem 2: red vs. green
Binary Classification Problem 3: blue vs. green

Multi-Label Classification

 It is used when there are two or more


classes andthe data we want to classify
may belong to none of the classes or all
of them at the same time.
 These tasks are referred to as Multiple
Label Classification or Multi-Label
Classification forshort.
 In Multi-Label Classification, zero or more labels
are required as output for each input sample, and the outputs are required simultaneously.
 Multi-Label Classification refers to those classification tasks that have two or more class
labels, whereone or more class labels may be predicted.
Imbalanced Classification

 A classification predictive modeling problem where the


distribution of examples across the classes is notequal or
unequally distributed.
 Imbalanced classification refers to a classification
predictive modeling problem where the number
of examples in the training dataset for each class
label is notbalanced.
 The Class Distribution is Not Equal or Close to Equal.
 For Example, the collected measurements of flowers are, 80 numbers of first flower
species and 20examples of a second flower species, and only these examples include our
training dataset. Problem with Imbalanced Dataset:
1) Algorithms may get partial towards the majority class and thus tend to predict

output as themajority class.


2) Minority class observations look like noise to the model and are ignored by the model.

3) Imbalanced dataset gives misleading accuracy score.

COST FUNCTION

 A Cost Function is a mechanism utilized insupervised machine learning, the cost function
returns the error between predicted outcomes compared with the actual outcomes.
 Naive Bayes [NB] loss function is
defined as theerror for one sample,
whereas the cost function isthe average
loss across a number of samples in a
given dataset.
 Loss functions measure how far an
estimated valueis from its true value.
 A loss function maps decisions to their
associatedcosts.
 Loss functions are not fixed, they change depending on the task in hand and the goal to be met.

Types of the Cost Function

 Loss function is used to minimize the error in the algorithm while computing the errors in
the givendatasets.
 It is used to quantify how good or bad the model is performing.
 It is divided into two categories
1) Regression Loss
2) Classification Loss.
a) Binary Classification Cost Functions
b) Multi-Class Classification Cost Functions

REGRESSION LOSS FUNCTION

 A Loss Function is a measure of how good a prediction model does in terms of being able to
predict theexpected outcome.
 Regression models deal with predicting a continuous value for example salary of an employee,
price of acar, loan prediction, etc.
 A cost function used in the Regression Problem is called “Regression Cost Function”.
 The calculation is based on the Distance-Based Error.

Types of Regression Loss Function

1) Mean Square Error


 MSE loss function is defined as the average of squared differences between the actual
and thepredicted value.

 The corresponding cost function is the Mean of this Squared Errors (MSE).
 The MSE Loss function penalizes the model
for making large errors by squaring them
and this property makes the MSE cost
function less robust tooutliers.
 Therefore, it should not be used if the data is
prone tomany outliers.
 The MSE function where the true target value is 100,and the
predicted values range between - 10,000 to 10,000.
 The MSE loss (Y-axis) reaches its minimum value at prediction (X-axis) = 100.
 The range is 0 to ∞.
 MSE = (Sum of Squared Errors) / n.
2) Mean Absolute Error (MAE)
 It is another loss function used for regression models.
 MAE is the sum of absolute
differences between our
target and predicted
variables, where MSE loss
function is defined as the
average ofabsolute
differences between the
actual and the predicted
value.
 It measures the average
magnitude oferrors in a set of
predictions, without
considering their directions.
 It is also called as Mean Bias Error (MBE), which is a sum of residuals/errors).
 The range is also 0 to ∞.

 MAE = (Sum of Absolute Errors) / n.


 The corresponding cost function is the Mean of this Absolute Errors (MAE).
 The MAE Loss function is more robust to outliers compared to MSE Loss function.
Therefore, itshould be used if the data is horizontal to many outliers.
 It is also known as L1 Loss.
 It is robust to outliers thus it will give better results even when our dataset has noise or outliers.
3) Huber Loss / Smooth Mean Absolute Error
 Huber loss function is defined as the combination of
MSE and MAE Loss function as it approaches MSE
when 𝛿𝛿 ~0 and MAE when 𝛿𝛿 ~ ∞ (large numbers).
 It’s Mean Absolute Error that becomesquadratic
when the error is small.
 And to make the error quadratic dependson how
small that error could be which is controlled by
a hyper parameter, 𝛿𝛿 (delta), which can be
tuned.
 The choice of the delta value is criticalbecause it
determines what you‟re willing to consider as an outlier.
 The Huber Loss function could be less sensitive to outliers compared to MSE Loss
functiondepending upon the hyper parameter value.
 It is used if the data is horizontal to outliers and we might need to train hyper parameter
delta whichis an iterative process.

CLASSIFICATION LOSS FUNCTION

 Loss functions for classification.


 Classification problems involve predicting a discrete class output.
 It involves dividing the dataset into different and unique classes based on different parameters
so that anew and unseen record can be put into one of the classes.

Types of Classification Loss Function

1) Cross Entropy Loss.


 It is also known as Negative Log Likelihood.
 It is the commonly used loss function for classification.
 Cross-Entropy Loss progress as the predicted probability differs from the actual label.
 Cross-Entropy Loss increases as the predicted probability differs from the actual label.

2) Hinge Loss.
 It is also known as Multi class SVM Loss.
 In simple terms, the score of correct category should be greater than sum of scores of all
incorrectcategories by some safety margin (usually one).
 Hinge loss is applied for maximum-margin classification, highly for support vector machines.
 It is a convex function used in convex optimizers.

GRADIENT DESCENT

 Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable


function.
 Gradient descent is an optimization algorithm used to find the values of parameters
(coefficients) of afunction (f) that minimizes a cost function (cost).
 Gradient descent is used when the parameters cannot be calculated analytically (e.g.
using linearalgebra) and must be searched for by an optimization algorithm.
 The idea is to take repeated steps in the opposite direction of the gradient (or approximate
gradient) ofthe function at the current point, because this is the direction of steepest descent.

Types of Gradient Descent

1) Batch Gradient Descent:


 Batch gradient descent is used to
calculate orprocess the derivative
from all training data for each
iteration before calculating an
update.
 When the number of training
examples is large, then batch gradient
descent is calculation will be very
expensive. Hence thebatch gradient
descent is not preferred.
 Instead of that we will prefer to use Stochastic Gradient Descent or Mini-Batch
GradientDescent.
2) Stochastic Gradient Descent:
 Stochastic gradient descent (often abbreviated SGD) is an iterative method for
optimizing an objective function with suitable smoothness properties (e.g.
Differentiable or Sub- Differentiable).

 Stochastic gradient descent refers to calculating the derivative from each training data
instanceand calculating the update immediately.
 The parameters are being updated even after one iteration in which only a single
example hasbeen processed.
 It is quite faster than batch gradient descent.
 When the number of training examples is large, even then it processes only one
example which can be additional overhead for the system as the number of iterations
will be quite large.

3) Mini Batch Gradient Descent:


 This is a type of gradient descent which works faster than both Batch Gradient
Descent andStochastic Gradient Descent.
 When the number of training examples is large, it is processed in batches of b training
examplesin one go.
 So, it works for larger training examples with lesser number of iterations.

ADVANCED OPTIMIZATION REGULARIZATION

 Regularization is the process of adding information in order to solve an ill-posed problem


[problemwith more solution] or to prevent overfitting.
 Regularization can be applied to objective functions in ill-posed optimization problems.
 Regularization focuses on reducing the test or generalization error without affecting the initial
trainingerror.
 Optimization techniques help in better convergence of a neural network by optimizing the
gradient ofthe error function.
 In regularization theory, this is undertaken by adding a second term to the optimization
problem beingsolved when fitting parameter values to data.
 the optimization problem was given in terms of minimizing some loss function:

β^=argminβL(β,X,Y)

 Regularization theory adds a second term to this optimization problem, which we will term R:

β^=argminβL(β,X,Y)+λR(β)

Note: R is a function only of the parameters of the model.


 Two common regularization functions are L1 and L2 regularization:

RL1=∑i=0m|βi|RL2=∑i=0mβi2

 L1 regularization gives the sum of the absolute values of the parameters.


 L2 gives the sum of their square.
 to minimize, the output of the regularization function acts as a penalty on the optimization problem.
 In both cases, larger (in absolute size) parameters will result in a higher penalty.
SOLVING PROBLEMS ON THE OVER FITTING

 Overfitting means that the noise [Irrelevant or Meaningless] or random variations in the
training data ispicked up and learned as concepts by the model.
 Over fitting is a modeling error which occurs when a function is too closely fit to a limited set
of datapoints.
 Under fitting refers to a model that can neither model the training data nor generalize to new data.
 Under fitting occurs when a statistical
model ormachine learning algorithm
cannot capture the underlying trend of
the data.
 Under fitting occurs when the
model or thealgorithm does not fit
the data well enough.
 This problem can be addressed by pruning a tree after it has
learned in order to remove some of the detail it has picked up.
 Note: Pruning is a technique that reduces the size of decision trees by removing sections of the tree
that are non-critical and redundant.

 Under fitting occurs if the model or algorithm shows low variance but high bias.
 Over fitting occurs if the model or algorithm shows high variance but low bias.
 Over fitting arises when a model tries to fit the training data so well that it cannot generalize
to newobservations.
 Under fitting models do not generalize well to both training and test data sets.

For Example: Tomato Problem

 Model A: Red, circle, green star shape


on top, a few water droplets
 Model B: Red, circle
 The problem with model A is
that not alltomatoes have water
droplets on them.
 This model is too specific and likely to
pick wettomatoes.
 It is not generalized well to all tomatoes.
 It will look for water droplets so cannot predictdry tomatoes in an image. It is
overfitting.
 On the other hand, model B thinks everything that is red and has circle shape is a tomato which
is nottrue. This model is too general, not able to detect critical features of tomatoes. It is
underfitting.

PERCEPTRON

 A perceptron works by taking in some numerical inputs along with what is known as weights and a
bias.
 The multiplies these inputs with the respective weights (this is known as the weighted sum).
 These products are then added together along with the bias.
 The activation function takes the weighted sum and the bias as inputs and returns a final output.
 A perceptron consists of four parts:
1) Input values
2) Weights and a bias,
3) A weighted sum,
4) Activation function.

 Assume we have a single neuron and threeinputs x1, x2, x3 multiplied by the weights
w1, w2, and w3 respectively.
 This function is called the weighted sum because it is the sum of the weights and inputs.
 Here the outputs to falls in between the range 0 to 1.

NEURAL NETWORKS

 Neural networks are artificial systems that were inspired by biological neural networks.
 These systems learn to perform tasks by being exposed to various datasets and examples
without anytask-specific rules.
 The idea is that the system generates identifying characteristics from the data they have been
passedwithout being programmed with a pre-programmed understanding of these datasets.
 Neural networks are based on computational models for threshold logic.
 Threshold logic is a combination of algorithms and mathematics.
 Neural networks are based either on the study of the brain or on the application of neural
networks toartificial intelligence.
 The work has led to improvements in finite automata theory.

Supervised vs Unsupervised Learning:

 Neural networks learn via supervised learning; supervised machine learning involves
an inputvariable x and output variable y.
 The algorithm learns from a training dataset. With each correct answer, algorithms
iteratively makepredictions on the data.
 The learning stops when the algorithm reaches an acceptable level of performance.
 Unsupervised machine learning has input data X and no corresponding output variables.
 The goal is to model the underlying structure of the data for understanding more about the data.
 The keywords for supervised machine learning are classification and regression.
 For unsupervised machine learning, the keywords are clustering and association

Types of Neural Networks

 There are Seven Types of Neural Networks that can be used.


 The First is a multilayer perceptron which has three or more layers and uses a nonlinear
activationfunction.
 The Second is the convolutional neural network that uses a variation of the multilayer perceptrons.
 The Third is the recursive neural network that uses weights to make structured predictions.
 The Fourth is a recurrent neural network that makes connections between the neurons in a
directed cycle. The long short-term memory neural network uses the recurrent neural network
architecture anddoes not use activation function.
 The Final Two are sequence to sequence modules which use two recurrent networks and
shallow neuralnetworks which produce a vector space from an amount of text. These neural
networks are applications of the basic neural network demonstrated below.
 For Example, the neural network will work with three vectors: a vector of attributes X, a
vector ofclasses Y, and a vector of weights W.
 The code will use 100 iterations to fit the attributes to the classes.
 The predictions are generated, weighed, and then outputted after iterating through the vector of
weightsW.
 The neural network handles back propagation.

Input:

X {2.6, 3.1, 3.0,


3.4, 2.1, 2.5,
2.6, 1.3, 4.9,
0.1, 0.3, 2.3,};
Y {1, 1, 1};
W {0.3, 0.4, 0.6};

Output:

0.990628
0.984596
0.994117
Multi-layer Neural Networks

 A Multi-Layer Perceptron (MLP) or


Multi- Layer Neural Network
contains one or morehidden layers
(apart from one input and oneoutput
layer).
 While a single layer perceptron can
only learnlinear functions, a multi-
layer perceptron can also learn non –
linear functions.

MULTI- CLASS CLASSIFICATION

 Multiclass classification means a classification problem where the task is to classify between
more thantwo classes.
 Multiclass
classification is a
popular problem in
supervisedmachine
learning.
 A classification task with
more than two classes;
e.g., classify a set of
images of fruits which
maybe oranges, apples,
or pears.
 Multi-class classification makes the assumption that each sample is assigned to one and only one
label: afruit can be either an apple or a pear but not both at the same time.
 Imbalanced dataset refers to a problem with classification problems where the classes
are notrepresented equally.
 In a multiclass classification, we train a classifier using our training data, and use this
classifier forclassifying new examples.
 Load dataset from source and also split the dataset into “training” and “test” data.
 In machine learning, multiclass or multinomial classification is the problem of classifying
instances intoone of three or more classes (classifying instances into one of two classes is called
binary classification)
 Popular Algorithms that can be used for Multi-Class Classification include:

1) k-Nearest Neighbors.
2) Decision Trees.
3) Naive Bayes.
4) Random Forest.
5) Gradient Boosting.

BACK PROPAGATION

 Back Propagation is a common method for training a neural network.


 Back Propagation is a method to calculate the gradient of the loss function with respect to the
weights inan artificial neural network.
 Back Propagation is also known as "Backward Propagation of Errors."
 It is commonly used algorithms that optimize the performance of the network by adjusting the
weights,which allows you to Reduce Error Rates and to make the model reliable.

How Back Propagation Works: Simple Algorithm

1) Inputs X, arrive through the pre-connected path


2) Input is modeled using real weights W. The weights are usually randomly selected.

3) Calculate the output for every neuron from


the input layer, to the hidden layers, to the
output layer.
4) Calculate the error in the outputs

[Error B= Actual Output – Desired Output]

5) Travel back from the output layer to the


hiddenlayer to adjust the weights such
that the error isdecreased.

Types of Back Propagation Networks

 It is classified into two types are:


1) Static Back-propagation
2) Recurrent Backpropagation

Static Back-Propagation:

 It is one kind of backpropagation network which produces a mapping of a static input for static
output.
 It is useful to solve static classification issues like optical character recognition.
Recurrent Back Propagation:

 Recurrent backpropagation is fed forward until a fixed value is achieved. After that, the
error iscomputed and propagated backward.
 The main difference between both of these methods is: that the mapping is rapid in
static back-propagation while it is non static in recurrent backpropagation.

Advantages of Back Propagation

 Back Propagation is fast, simple and easy to program


 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to be learned.

Disadvantages of Back Propagation

 The actual performance of backpropagation on a specific problem is dependent on the input data.
 Backpropagation can be quite sensitive to noisy data

ACTIVATION FUNCTION

 Activation functions are mathematical equations that determine the output of a neural network.
 The activation function is a Non-Linear Transformation that we do over the input before
sending itto the next layer of neurons or finalizing it as output.
 An activation function is a very important feature of an artificial neural network, they basically
decidewhether the neuron should be activated or not.
 In a neural network, numeric data points, called inputs, are fed into the neurons in the
inputlayer.

 Each neuron has a weight, and multiplying the input number with the weight gives the
output ofthe neuron, which is transferred to the next layer.
 In artificial neural networks, the activation function defines the output of that node given an
input orset of inputs.

 Modern neural networks use a technique called back propagation to train the model,
whichplaces an increased computational strain on the activation function, and its
derivative function.

 The activation function is a mathematical “gate” in between the input feeding the
currentneuron and its output going to the next layer.

 It can be as simple as a step function that turns the neuron output on and off,
depending on arule or threshold.
 It can be a transformation that maps the input signals into output signals that are needed
for theneural network to function.
 Neural Networks use non-linear activation functions, which can help the network learn
complex data, compute and learn almost any function representing a question, and
provide accurate predictions.

Two Types of Activation Functions

1) Linear Activation Function


2) Non-Linear Activation Functions
 The neural network without any activation function in any of its layers is called a linear neural
network.
 The neural network which has action functions like relu, sigmoid or tanh in any of its layer or
even inmore than one layer is called non-linear neural network
 A Linear equation can be defined as the equation having the maximum only one degree.
 A Nonlinear equation can be defined as the equation having the maximum degree 2 or more than 2.
 A linear equation forms a straight line on the graph.
 A nonlinear equation forms a curve on the graph.

NON-LINEAR ACTIVATION FUNCTIONS

 Non-Linear Functions address the problems of a Linear Activation Function.


 Allow a Back Propagation derivative function which is related to the inputs.
 Allow “stacking” of multiple layers of neurons to create a Deep Neural Network.
 Modern neural network models use Non-Linear Activation Functions.
 They allow the model to create complex mappings between the network‟s inputs and outputs,
which areessential for learning and modeling complex data, such as images, video, audio, and
data sets which arenon- linear or have high dimensionality.
Types of Non-Linear Activation Function

1) Sigmoid or Logistic Activation Function [S-shape Function]


2) Tanh or Hyperbolic Tangent Activation Function
3) ReLU (Rectified Linear Unit) Activation Function
4) PReLU (Parametric ReLU) Activation Function
5) LReLU(Leaky ReLU )Activation Function
6) Softmax Activation Function

Sigmoid Activation Function

 The Sigmoid Activation Function is also called the Logistic Function.


 The Sigmoid Function curve looks like a

S-Shape.

 It is non-linear and easy to work


with when constructing a neural
network model.
 It is the same function used in the
logisticregression classification
algorithm.
 The function takes any real value
as input and outputs values in
between the range 0 t o 1.
 It is especially used for models
where wehave to predict the
probability as an output.
 The function is Differentiable. That means, we can find the slope of the sigmoid curve at
any two points.

 The function is Monotonic, but function’s derivative is not.


 Derivative or Differential: Change in y-axis with respect to change in x-axis, It is also known as
Slope.
 Monotonic Function: A function which is either entirely Non-Increasing or Non-Decreasing.
Hyperbolic Tangent Function

 Tanh is very similar to Sigmoid Function.


 The function is defined as (-1, + 1), it is zero-centered which makes optimization easier.
 It looks like shifted and vertical scale form of a sigmoid function having the range between -1 and
1.
 Tanh works better than sigmoid function because, with values between -1 and +1, the
mean of theactivations that comes out of your hidden layer are closer to having a zero
mean.
 The advantage over the sigmoid function is that its derivative is steeper, which means it can
get morevalue.
 It is more efficient because it has a wider range for faster learning and grading.

ReLU (Rectified Linear Unit) Function

 The Rectified Linear Activation Function or ReLU.


 ReLU 's output is not a straight line, it bends at the x-axis.
 a linear function that will output the input directly if it is positive, otherwise, it will output zero.
 It has become the default
activation function for many
types of neural networks
because a model that uses itis
easier to train and often
achieves better performance.
 The ReLU is the most used
activationfunction in the
world.
 it is used in almost all the
convolutional neural networks
or deeplearning
 The ReLU is half rectified (from bottom). f(z) is zero when z is less
than zero and f(z) is equal to z when z is above or equal to zero.
 Range: 0 to Infinity.
 The function and its derivative both are Monotonic.
 But the issue is that all the negative values become zero immediately which decreases the
ability of themodel to fit or train from the data properly.
 Any negative input given to the ReLU activation function turns the value into zero
immediately in thegraph, which in turns affects the resulting graph by not mapping the
negative values appropriately.

Parametric ReLU

 Parameterized ReLU or
Parametric ReLU activation
function is a variant of ReLU. It
is similar to Leaky ReLU [Leaky
ReLU function is an improved
version of the ReLU activation
function], with a slight change in
dealingwith negative input
values where the negative part of
the function adaptively learns
during the training phase.
 If then PReLU becomes ReLU. While the positive part is linear.
DROPOUT AS REGULARIZATION

 Dropout is a regularization technique


for reducing overfitting in neural
networks by preventing complexco-
adaptations on training data.
 It prevents over fitting by
ensuring thatno units are Co-
Dependent.
 It is a veryefficient way of performing model averaging with neural networks.
 The term "Dropout" refers to droppingout units (both hidden and visible) in aneural network.
 Regularization is a technique which makes slight modifications to the learning algorithm such that
the
Model generalizes better.
 This in turn improves the model's performance on the unseen data as well.
 Dropout works by randomly setting the outgoing edges of Hidden Units.
 The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during
trainingtime, which helps prevent overfitting.
 Note: Dropout layer only applies when training is set to True such that no values are dropped
duringinference. When using model.

Unit: IV Completed
UNIT: V - NON PARAMETRIC MACHINE LEARNING

SYLLABUS

k- Nearest Neighbors - Decision Trees – Branching – Greedy Algorithm - Multiple Branches –


Continuous attributes – Pruning. Random Forests: ensemble learning. Boosting – Ada boost algorithm.
Support Vector Machines – Large Margin Intuition – Loss Function - Hinge Loss –SVM Kernels.

UNIT: V

NON - PARAMETRIC MACHINE LEARNING

 Non-Parametric methods seek to best fit the training data in constructing the mapping function.
 Ability to generalize to unseen data.
 Able to fit a large number of functional forms from the training data.
 Do not make strong assumptions [makes predictions based on the training patterns for a
new datainstance] about the form of the mapping function

Popular Nonparametric Machine Learning Algorithms are:

1) k-Nearest Neighbors
2) Decision Trees like CART…
3) Support Vector Machines

Benefits / Advantages of Non-parametric Machine Learning Algorithms:

1) Flexibility: Capable of fitting a large number of functional forms.


2) Power: No assumptions (or weak assumptions) about the underlying function.
3) Performance: Can result in higher performance models for prediction.

Limitations of Non-parametric Machine Learning Algorithms:

1) More Data: Require a lot more training data to estimate the mapping function.
2) Slower: A lot slower to train as they often have far more parameters to train.
3) Overfitting: More of a risk to overfit the training data and it is harder to explain why
specificpredictions are made.

K-NEAREST NEIGHBORS ALGORITHM

 K-Nearest Neighbors is one of the simplest Machine Learning algorithms based on Supervised
Learningtechnique.
 K-NN algorithm assumes the similarity between the new case/data and available cases and put
the newcase into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. Thismeans when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for
the Classification problems.
 K-NN is a Non-Parametric Algorithm, which means it does not make any assumption on
underlyingdata.
 It is also called A Lazy Learner Algorithm because it does not learn from the training set
immediatelyinstead it stores the dataset and at the time of classification, it performs an action on
the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifiesthat data into a category that is much similar to the new data.
 Example:
 We have an image of a creature that looks similar to cat and dog,
 We want to know either it is a cat or dog.
 By using the KNN
algorithm the identification
works on a similarity
measure.
 Our KNN model, finds the similar
Features of the new data set to the cats and dogs images.
 Based on the most similar features it will put it in either cat or dog category.

How does K-NN work?

 The K-NN working can be explained on the basis of the below algorithm:
 Step-1: Select the number K of the neighbors
 Step-2: Calculate the Euclidean distance of K number of neighbors
 Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
 Step-4: Among these k neighbors, count the number of the data points in each category.
 Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
 Step-6: Our model is ready.
 By calculating the Euclidean distance we got
the nearest neighbors, as three nearest
neighbors in category A and two nearest
neighbors in category B.
 By considering all the 3 nearest neighbors are
from category A, hence this new data point
must belong tocategory A.

DECISION TREES

 Decision Tree is a SUPERVISED LEARNING TECHNIQUE that can be used


for both classification and Regression problems, but mostly it is preferred for
solving Classificationproblems.
 It is a tree-structured classifier, where internal nodes represent the features of a
dataset,branches represent the decision rules and each leaf node represents the
outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
 Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes arethe output of those decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands onfurther branches and constructs a tree-like structure.
 In order to build a tree, we use the CART algorithm, which stands for Classification and

Regression Tree algorithm.


 A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the treeinto sub-trees.

Decision Tree Terminologies

1) Root Node: Root node


is from where the
decision tree starts. It
represents the entire
dataset, which further
gets divided into two or
more homogeneous
sets.
2) Leaf Node: Leaf nodes are
thefinal output node, and the tree cannot be
segregated further after getting a leaf node.
3) Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodesaccording to the given conditions.
4) Branch/Sub Tree: A tree formed by splitting the tree.
5) Pruning: Pruning is the process of removing the unwanted branches from the tree.
6) Parent/Child node: The root node of the tree is called the parent node, and other nodes
are calledthe child nodes.

How does the Decision Tree algorithm Work?

Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).Step-3: Divide the S into subsets that contains possible values
for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continuethis process until a stage is reached where you cannot further classify the nodes and called
the final nodeas a leaf node.

Example:

 A candidate who has a job offer and


wants todecide whether he should
accept the offer or Not.
 The decision tree starts with the
root node(Salary attribute by
ASM).
 The root node splits further into the
next decision node (distance from
the office) andone leaf node based
on the corresponding labels.
 The next decision node further gets split into one decision node (Cab facility) and one leaf
node.
 Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).

BRANCHING

 Branching algorithm consists of multiple branching rules and in each node of the search
tree oneof these rules is selected to be applied based on various criteria.
 The algorithm is best known for the fact that it solves the 3-Satisfiability problem in time.
 In logic and computer science, the Boolean satisfiability problem (sometimes called
propositional satisfiability problem and abbreviated SATISFIABILITY, SAT or B-SAT)
is the problem of determining if there exists an interpretation that satisfies a given
Boolean formula. .
 In contrast, “AND, NOT “is unsatisfiable.

To Solve Boolean Satisfiability Problem

 Satisfiable: If the Boolean variables can be assigned values such that the formula turns
out to beTRUE, then we say that the formula is satisfiable.
 Un-Satisfiable : If it is not possible to assign such values, then we say that the
formula isunsatisfiable.

GREEDY ALGORITHM

 Greedy algorithms work by recursively constructing a set of objects from the smallest
possibleconstituent parts.
 Recursion is an approach to problem solving in which the solution to a particular problem
depends onsolutions to smaller instances of the same problem.
 Greedy is an algorithmic model that builds up a solution piece by piece, always choosing the
next piecethat offers the most clear and immediate benefit.
 Where the problems choosing locally optimal also leads to global solution are best fit for Greedy.
 The advantage to using a greedy algorithm is that solutions to smaller instances of the
problem can bestraightforward and easy to understand.
 The disadvantage is that it is entirely possible that the most optimal short-term solutions may
lead to theworst possible long-term outcome.
 Greedy algorithms can be used for optimization purposes or finding close to optimization in
case of NPHard [Non-Deterministic Polynomial Time, meaning that provably solving it in
polynomial time would also solve thousands of open problems that have been open for decades.

MULTIPLE
BRANCHES6 MAJOR BRANCHES OF ARTIFICIAL
INTELLIGENCE (AI)

1) Machine learning
 Machine Learning is the technique
that gives computers the potential to
learn without beingprogrammed, and
it is classified into,
a) Supervised Learning:
b) Unsupervised Learning:
c) Reinforcement Learning:
2) Neural Network

 Neural network replicates the human brain where thehuman brain comprises an infinite number
of neurons and to code brain-neurons into a system or a machine is what the neural network
functions.

3) Robotics

 Robotics is an interdisciplinary field of science and engineering incorporated with mechanical


engineering, electrical engineering, computer science, and many others.

4) Expert Systems

 An expert system refers to a computer system that mimics the decision-making intelligence of a
humanexpert.
 The key features of expert systems include extremely responsive, reliable, understandable
and highexecution.

5) Fuzzy Logic

 Fuzzy logic is a technique that represents and modifies uncertain information by measuring the
degree towhich the hypothesis is correct.
6) Natural Language Processing

 NLP is a method that deals in searching, analyzing, understanding and deriving information
from thetext form of data.

CONTINUOUS ATTRIBUTE

 Attribute is a data field that represents the characteristics or features of a data object.
 For a customer, object attributes can be customer Id, address, etc.
 A set of attributes used to describe a given object are known as Attribute Vector or Feature
Vector.
 Continuous Attribute:
 It is a Quantitative type.
 Data have an infinite no of
states.Continuous data is of
float type.
 There can be many values
between 2 and 3

PRUNING

 Pruning is a Data Compression


Technique in machine learning and
search algorithms that reduces the
size of decision trees by removing
sections of the tree that are
non-criticaland redundant to
classify instances.
 A tree that is too large risks
overfittingthe training data and
poorly generalizing to new
samples.
 In the context of code, pruning is the activity of removing unnecessary and
unreachable code so as to make the code more readable and easily maintainable

 The process of adjusting Decision Tree to minimize “misclassification error” is called pruning.
 Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers.
 The pruned trees are smaller and less complex.
 Types of Pruning:-

1) Pre-Pruning: The tree is pruned by halting its


construction early.
2) Post-Pruning: This approach removes a sub-
tree from a fully grown tree.

RANDOM FORESTS: Ensemble Learning

 Random forest is a supervised ensemble


learning algorithm that is used for both
classifications as well as regression problems.
 It operates by constructing a large number of
decision trees at training time.
 For classification tasks, the output of the
random forest is the class selected by most.

 It utilizes ensemble learning, which is a


technique that combines many classifiers to provide solutions to complex problems.

 Ensemble learning refers to algorithms that combine the predictions from two or more models.

 Ensemble methods, which combines several decision trees to produce better predictive
performance thanutilizing a single decision tree.
 A random forest algorithm consists of many decision trees.
 The main principle behind the ensemble model is that a group of Weak Learners come together
to forma Strong Learner.

BOOSTING

 Boosting can be referred to as a


set of algorithms whose primary
function is toconvert weak
learners to strong learners.
 They have become main stream in the Data
Science industry because they have been
around in the machine learning communityfor years.
 Boosting is an ensemble modeling technique which attempts to build a strong classifier from the
numberof weak classifiers. It is done building a model by using weak models in series.
How Boosting Algorithm Works?

 The basic principle behind the working of the boosting algorithm is to generate multiple weak
learnersand combine their predictions to form one strong rule.
 After multiple iterations, the weak learners are combined to form a strong learner that will
predict amore accurate outcome.

Three Types Of Boosting Algorithms are:

1) AdaBoost (Adaptive Boosting) algorithm.


2) Gradient Boosting algorithm.
3) XG Boost algorithm.

Basic Ensemble Learning

1) Bagging
2) Boosting

Bagging

 Bootstrap Aggregating, also called Bagging.


 It is a machine learning Ensemble Meta-Algorithm.
 It is designed to improve the stability and accuracy of machine learning algorithms used in
statisticalclassification and regression.
 It also reduces variance and helps to avoid overfitting.

ADABOOST ALGORITHM

 AdaBoost algorithm, short for Adaptive Boosting, is a Boosting technique used as an Ensemble
Methodin Machine Learning.
 It is called Adaptive Boosting as the weights are re-assigned to each instance, with higher
weightsassigned to incorrectly classified instances.
 AdaBoost learns from the mistakes by increasing the weight of misclassified data points.

How AdaBoost Algorithm Works:

 It fits a sequence of weak


learners ondifferent weighted
training data.
 It starts by predicting original data
set andgives equal weight to each
observation.
Step 0: Initialize the weights
of datapoints.
Step 1: Train a decision tree

Step 2: Calculate the weighted errorrate of the decision tree. Higher the weight, the
more the corresponding error will be weighted.

Step 3: Calculate this decision tree‟s weight in the ensemble.

Step 4: Update weights of wrongly classified points.


Step 5: Repeat Step 1(until the number of trees we set to train is reached)
Step 6: Make the final prediction.

GRADIENT BOOSTING

 Gradient Boosting is a popular boosting algorithm in the field of machine learning.


 In gradient boosting, each predictor corrects its predecessor‟s error.
 In contrast to Ada-boost, the weights of the training instances are not tweaked, instead, each
predictor istrained using the residual errors of predecessor as labels.
 A technique called the Gradient Boosted Trees whose base learner is CART
(Classification and Regression Trees).
 Gradient boosting algorithm is one of the most powerful algorithms.
 Gradient boosting algorithm can be used for predicting not only continuous target
variable (as aRegressor) but also categorical target variable (as a Classifier).
 It relies on the intuition that the best possible next model, when combined with previous
models,minimizes the overall prediction error.
 If a small change in the prediction for a case causes no change in error, then next target
outcome of thecase is zero.

How Gradient Boosting Algorithm Works:

Step 1: Train a decision tree

Step 2: Apply the decision tree just trained to predict


Step 3: Calculate the residual of this decision tree,save residual errors as the new y.

Step 4: Repeat Step 1 (until the number of trees we set to train is reached)

Step 5: Make the Final Prediction.

The Gradient Boosting makes a new prediction by simply adding up the predictions (of all
trees).

SUPPORT VECTOR MACHINE

 Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used
for bothclassification and regression challenges.
 It is mostly used in classification problems.
 The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n- dimensional space into classes so that we can easily put the new data point in the correct
category in thefuture. This best decision boundary is called a hyperplane.
 SVM chooses the extreme points/vectors that help in creating the hyperplane.
 These extreme cases are called as support vectors, and hence algorithm is termed as Support
VectorMachine.

Types of SVM

1) Linear SVM:
 Linear SVM is used for linearly separable data, which means if a dataset can be classified
into twoclasses by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
2) Non-Linear SVM:
 Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot
be classified by using a straight line, then such data is termed as non-linear data and
classifier used iscalled as Non-linear SVM classifier.
How does it work?

 Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1and x2.
 The classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
 Consider the below images:

LARGE MARGIN INTUITION

 An intuition for large-margin classification, insisting on a large margin reduces the capacity of
the model: the range of angles at which the fat decision surface can be placed is smaller than
for a decision hyperplane
 SVM classifier creates a maximum-margin hyperplane that lies in a transformed input space
and splitsthe example classes, while maximizing the distance to the nearest cleanly split
examples.
 SVM methods have been used
as powerful tools in solving
classification problems in a
wide range ofapplication fields.
 Sometimes people refer to
SVM as large margin
classifiers We will consider
what that means and what an SVM hypothesis looks like The
 SVM cost function is as above, and we've drawn out the cost terms below:
 Left is cost1 and right is cost0 What does it take to make terms small

 If y =1 cost1(z) = 0 only when z >= 1 If y = 0 cost0(z) = 0 only when z <= -1

LOSS FUNCTION

 A loss function is for a single training example.


 It is also sometimes called an error function.
 A cost function, on the other hand, is the average loss over the entire training dataset.
 The optimization strategies aim at minimizing the cost function.

HINGE LOSS

 Hinge loss is a commonly used penalty.


 If no violations no hinge loss.
 If violations hinge loss proportional to the distance of violation.
 Hinge loss is primarily used with Support Vector Machine (SVM)
Classifiers with classlabels -1 and 1.
 So make sure you change the label of the „Malignant‟ class in the dataset from 0 to -1.
 Hinge Loss not only penalizes the wrong predictions, but also the right predictions that
are not confident.
 Hinge Loss simplifies the mathematics for SVM while maximizing the loss (as
compared toLog-Loss).
 It is used when we want to make real-time decisions with not a laser-sharp focus on accuracy.

SVM KERNELS

 SVM algorithms use a set of mathematical functions that are defined as the kernel.
 The function of kernel is to take data as input and transform it into the required form.
 Different SVM algorithms use different types of kernel functions.
 These functions can be different types.
 For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.
 Introduce Kernel functions for sequence data, graphs, text, images, as well as vectors.
 The most used type of kernel function is RBF.
 Because it has localized and finite response along the entire x-axis.
 The kernel functions return the inner product between two points in a suitable feature space.
 Thus by defining a notion of similarity, with little computational cost even in very high-
dimensionalspaces.

UNIT : V COMPLETED

You might also like