Module 2 - Aiml Part 1
Module 2 - Aiml Part 1
Bengaluru
Module 2
Supervised machine learning algorithms
CONTENT
• Introduction to the Machine Learning (ML) Framework
• Types of ML,
• Types of variables/features used in ML algorithms,
• One-hot encoding,
• Simple Linear Regression,
• Multiple Linear Regression,
• Evaluation metrics for regression model
Artificial Intelligence
2
CONTENT
• Classification models
• Decision Tree algorithms using Entropy and Gini Index as measures
of node impurity,
• Model evaluation metrics for classification algorithms,
• Multi-class classification
• Class Imbalance problem.
• Naïve Bayes Classifiers
• Naive Bayes model for sentiment classification
Artificial Intelligence
3
What is Machine Learning
Definition
A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it.
The accuracy of predicted output depends upon the amount of data, as the
huge amount of data helps to build a better model which predicts the output
more accurately.
Importance of machine learning
Finding hidden patterns and extracting useful information from data
Solving complex problems and decision making in many fields
(applications)
Feature of ML: Data-Driven Technology
similar to data mining as it also deals with huge amount of data.
uses data to detect various patterns in a given dataset
learn from past data and improve automatically.
Artificial Intelligence
4
Applications of machine learning
Artificial Intelligence
5
Lifecycle of machine learning
Gathering Data
To identify the different data sources, as data can be collected from various
sources such as files, database or internet.
The quantity and quality of the collected data will determine the accuracy of
the prediction and efficiency of the output.
This step includes the below tasks:
Identify various data sources
Collect data
Integrate the data obtained from different sources – This coherent set of data
is called dataset
Artificial Intelligence
6
Lifecycle of machine learning
Data preparation: This step can be further divided into two processes:
Data exploration: To understand the characteristics, format, and
quality of data to find Correlations, general trends, and outliers for an
effective outcome.
Data pre-processing: Cleaning of data is required to address the
quality issues: Missing Values, Duplicate data, Invalid data and Noise,
which can be solved using filtering techniques.
Data Wrangling
Reorganizing, mapping and transforming raw, unstructured data to a
useable format.
This step involves data aggregation and data visualization.
Artificial Intelligence
7
Lifecycle of machine learning
Data Analysis
The aim of this step is to build a machine learning model to analyze the data and
review the outcome.
Train Model
Datasets are used to train the model using various machine learning algorithms
– to understand various patterns, rules, and, features.
Test Model
Tests accuracy of the model with respect to the requirements of project or
problem.
Deployment
Performance of the project is checked with the available data and deployed
which is similar to making the final report for a project.
Artificial Intelligence
8
Difference between AI & ML
Artificial Intelligence Machine learning
Artificial intelligence is a technology Machine learning is a subset of AI
which enables a machine to simulate which allows a machine to
human behavior. automatically learn from past data
without programming explicitly.
AI is working to create an intelligent Machine learning is working to create
system which can perform various machines that can perform only those
complex tasks. specific tasks for which they are
trained.
The main applications of AI are Siri, The main applications of machine
Expert System, Online game playing, learning are Online recommender
intelligent humanoid robot, etc. system and Google search algorithms
Artificial Intelligence
9
Machine learning - dataset
Artificial Intelligence
10
Contd…
o Validation Dataset
It is used to verify that the increase in the accuracy of the training dataset is
actually increased if we test the model with the data that is not used in the
training.
If the accuracy over the training dataset increases while the accuracy over the
validation dataset decreases, then this results in the case of high variance i.e.,
overfitting.
o Test Dataset
Most of the time when we try to make changes to the model based upon the
output of the validation set then unintentionally, we make the model peek into
our validation set and as a result, our model might get overfit on the validation
set as well.
To overcome this issue, we have a test dataset that is only used to test the final
output of the model in order to confirm the accuracy.
Artificial Intelligence
11
Contd…
How to get the datasets / Popular sources for ML dataset
Kaggle Dataset
UCI Machine Learning Repository
Datasets via AWS
Google's Dataset Search Engine
Microsoft Dataset
Awesome Public Dataset Collection
Government Datasets
Computer Vision Datasets
Scikit-learn dataset
Artificial Intelligence
12
Machine learning-
data preprocessing
Definition: Data pre-processing is a process of preparing the raw data and
making it suitable for a machine learning model.
Significance
A real-world data contains noises, missing values, and maybe in an
unusable format which cannot be directly used for machine learning
models.
Data pre-processing is required tasks for cleaning the data and making it
suitable for a machine learning model.
Steps
Getting the dataset
The data is usually put in CSV file ("Comma-Separated Values" files; it is a
file format which allows us to save the tabular data, such as spreadsheets.
It is useful for huge datasets and can use these datasets in programs).
Artificial Intelligence
13
CONTD…
Importing libraries
o Numpy: used for including any type of mathematical operation in the code.
o Matplotlib: used to plot any type of charts in Python for the code.
o Pandas: used for importing and managing the datasets. It is an open-source
data manipulation and analysis library.
Importing datasets
read_csv() function: used to read a csv file
To distinguish the matrix of features (independent variables) and dependent
variables from dataset - iloc[ ] method is used to extract the required rows and
columns from the dataset.
To extract dependent variables, again, we will use Pandas.iloc[] method.
Artificial Intelligence
14
Contd…
Artificial Intelligence
15
Contd…
Encoding Categorical Data
o Label Encoder() class from pre-processing library is used for encoding the
variables into digits.
o Categorical variables usually have strings for their values. Many machine
learning algorithms do not support string values for the input variables.
Therefore, we need to replace these string values with numbers. This process is
called categorical variable encoding.
o Types of encoding:
One-hot encoding
Dummy encoding
One-hot encoding
Artificial Intelligence
16
Contd…
For example, let’s say we have a categorical variable Color with three categories
called “Red”, “Green” and “Blue”, we need to use three dummy variables to
encode this variable using one-hot encoding. A dummy (binary) variable just
takes the value 0 or 1 to indicate the exclusion or inclusion of a category.
• In one-hot encoding,
“Red” color is encoded as [1 0 0] vector of size 3.
“Green” color is encoded as [0 1 0] vector of size 3.
“Blue” color is encoded as [0 0 1] vector of size 3.
Artificial Intelligence
17
Contd…
Dummy encoding
Dummy encoding also uses dummy (binary) variables. Instead of creating a
number of dummy variables that is equal to the number of categories (k) in the
variable, dummy encoding uses k-1 dummy variables.
To encode the same Color variable with three categories using the dummy
encoding, we need to use only two dummy variables
Artificial Intelligence
18
Contd…
Splitting dataset into training, validation and test set
Feature scaling
Feature scaling is the final step of data pre-processing in machine learning.
Artificial Intelligence
19
Feature selection techniques in
Machine Learning
Artificial Intelligence 20
Feature selection
• A feature is an attribute that has an impact on a problem or is useful for the
problem, and choosing the important features for the model is known as feature
selection.
• Definition: Feature selection is a way of selecting the subset of the most relevant
features from the original features set by removing the redundant, irrelevant, or
noisy features.
• Significance of Feature Selection:
It helps in avoiding the curse of dimensionality.
It helps in the simplification of the model so that it can be easily interpreted
by the researchers.
It reduces the training time.
It reduces overfitting hence enhance the generalization.
Artificial Intelligence
21
feature selection techniques
Artificial Intelligence
22
Supervised feature selection
techniques
• Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a
search problem, in which different combinations are made, evaluated, and
compared with other combinations.
It trains the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted,
and with this feature set, the model has trained again.
Artificial Intelligence
23
Contd…
• Filter Methods
In Filter Method, features are selected on the basis of statistics measures.
This method does not depend on the learning algorithm and chooses the
features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns
from the model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational
time and does not overfit the data.
Artificial Intelligence
24
Contd…
• Embedded Methods
Embedded methods combined the advantages of both filter and wrapper
methods by considering the interaction of features along with low
computational cost. These are fast processing methods similar to the filter
method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and
optimally finds the most important features that contribute the most to
training in a particular iteration.
Artificial Intelligence
25
Feature engineering for machine
learning
• Feature engineering is the pre-processing step of machine learning, which
extracts features from raw data.
• Feature engineering in ML contains mainly four processes:
Feature Creation: finding the most useful variables to be used in a
predictive model.,
Transformations: This step of feature engineering involves adjusting the
predictor variable to improve the accuracy and performance of the model.
Feature Extraction: Is an automated feature engineering process that
generates new variables by extracting them from the raw data
Feature Selection: Is a way of selecting the subset of the most relevant
features from the original features set by removing the redundant,
irrelevant, or noisy features
Artificial Intelligence
26
Feature engineering techniques
for ML
• Imputation: Imputation is responsible for handling irregularities within the
dataset.
• Handling Outliers: Standard deviation can be used to identify the outliers. Z-
score can also be used to detect outliers.
• Log Transform: helps in handling the skewed data, and it makes the distribution
more approximate to normal after transformation.
• Binning: used to normalize the noisy data.
• Feature Split: is the process of splitting features intimately into two or more
parts and performing to make new features.
• One hot encoding: It is a technique that converts the categorical data in a form
so that they can be easily understood by machine learning algorithms and hence
can make a good prediction.
Artificial Intelligence
27
Machine learning - types
Artificial Intelligence
28
Supervised learning
• Supervised learning is the types of machine learning in which machines are
trained using well "labelled" training data, and on basis of that data, machines
predict the output. The labelled data means some input data is already tagged
with the correct output.
Artificial Intelligence
29
Types of Supervised learning
Regression Algorithms
• Are used if there is a relationship between the input variable and the output
variable. Example: Weather forecasting, Market Trends, etc.
• Regression algorithms under supervised learning: Linear Regression, Non-Linear
Regression, Polynomial Regression, Ridge Regression and Lasso Regression.
Classification Algorithms
• Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
Example: Spam Filtering.
• Classification algorithms under supervised learning: Random Forest, Decision
Trees, Logistic Regression, Support vector Machines
Artificial Intelligence
30
Important Terminologies
Dependent Variable: The main factor in Regression analysis which we want to
predict or understand is called the dependent variable. It is also called target
variable.
Independent Variable: The factors which affect the dependent variables or
which are used to predict the values of the dependent variables are called
independent variable, also called as a predictor.
Outliers: Outlier is an observation which contains either very low value or very
high value in comparison to other observed values. An outlier may hamper the
result, so it should be avoided.
Multicollinearity: If the independent variables are highly correlated with each
other than other variables, then such condition is called Multicollinearity.
Overfitting: If our algorithm works well with the training dataset but not well
with test dataset, then such problem is called Overfitting.
Underfitting: If our algorithm does not perform well even with training
dataset, then such problem is called underfitting.
Artificial Intelligence
31
Unsupervised Machine Learning
Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data without any
supervision.
Artificial Intelligence
32
Unsupervised Machine Learning -
Types
Types:
Clustering: Clustering is a method of grouping the objects into clusters such
that objects with most similarities remains into a group and has less or no
similarities with the objects of another group.
Association: An association rule is an unsupervised learning method which
is used for finding the relationships between variables in the large database.
It determines the set of items that occurs together in the dataset.
Unsupervised learning algorithms: K-means clustering, Hierarchal clustering,
Anomaly detection, Neural Networks, Principle Component Analysis, Apriori
algorithm
Advantage of Unsupervised Learning: “Preferable” as it is easy to get unlabeled
data in comparison to labeled data.
Disadvantages of Unsupervised Learning: The result might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.
Artificial Intelligence
33
Semi-Supervised Learning
• Semi-Supervised learning is a type of Machine Learning algorithm that lies
between Supervised and Unsupervised machine learning.
• The main aim of semi-supervised learning is to effectively use all the available
data, rather than only labelled data like in supervised learning.
• Advantages: It is highly efficient and is used to solve drawbacks of Supervised
and Unsupervised Learning algorithms.
• Disadvantages
Iterations results may not be stable.
We cannot apply these algorithms to network-level data.
Accuracy is low.
Artificial Intelligence
34
Reinforcement Learning
• Reinforcement learning works on a feedback-based process, in which an AI agent
(A software component) automatically explore its surrounding by hitting & trail,
taking action, learning from experiences, and improving its performance.
• Agent gets rewarded for each good action and get punished for each bad action;
hence the goal of reinforcement learning agent is to maximize the rewards.
• In reinforcement learning, there is no labelled data like supervised learning, and
agents learn from their experiences only.
• Due to its way of working, reinforcement learning is employed in different fields
such as Game theory, Operation Research, Information theory, multi-agent
systems.
• A reinforcement learning problem can be formalized using Markov Decision
Process(MDP).
Artificial Intelligence
35
Reinforcement Learning
• Categories of Reinforcement Learning
Positive Reinforcement Learning: Specifies increasing the tendency that the
required behavior would occur again by adding something.
Negative Reinforcement Learning: It increases the tendency that the specific
behavior would occur again by avoiding the negative condition.
• Applications: Robotics, Text Mining, Resource Management, Video Games.
• Advantages
The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
Helps in achieving long term results.
• Disadvantages
RL algorithms require huge data and computations.
Too much reinforcement learning can lead to an overload of states which can
weaken the results.
Artificial Intelligence
36
Thank you
Artificial Intelligence 37