Machine Learning
Machine Learning
05.10.2023
Dr.M.Prabhavathy
Assistant Professor
Department of AI and DS
Outline
4
What is Machine Learning?
• Let’s say you want to solve Character Recognition
• Hard way: Understand handwriting/characters
• Latin
• Devanagri
• Symbols: http://detexify.kirelabs.org/classify.html
5
What is Machine Learning?
• Let’s say you want to solve Character Recognition
• Hard way: Understand handwriting/characters
• Lazy way: Throw data!
6
Example: Netflix Challenge
• Goal: Predict how a viewer will rate a movie
• 10% improvement = 1 million dollars
7
Example: Netflix Challenge
• Goal: Predict how a viewer will rate a movie
8
Comparison
Traditional Programming
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
9
What is Machine Learning?
10
What is Machine Learning?
• If you are a Scientist
Machine
Data Understanding
Learning
11
What is Machine Learning?
• In basic terms, ML is the process of training a piece of software, called a model,
to make useful predictions using a data set.
• This predictive model can then serve up predictions about previously unseen
data. We use these predictions to take action in a product.
• For example, the system predicts that a user will like a certain video, so the
system recommends that video to the user.
Machine Learning
Historical data algorithm
training
12
Why Machine Learning?
Engineering Better Computing Systems
• Develop systems
• too difficult/expensive to construct manually
• because they require specific detailed skills/knowledge
• knowledge engineering bottleneck
• Develop systems
• that adapt and customize themselves to individual users.
• Personalized news or mail filter
• Personalized tutoring
• Algorithms
• Many basic effective and efficient algorithms available.
• Data
• Large amounts of on-line data available.
• Computing
• Large amounts of computational resources available.
Where does ML fit in?
15
Machine Learning
A.I. & Deep
Learning
LETS TEACH MACHINE HOW TO LEARN FROM DATA
Growth of Machine Learning
• Machine learning is preferred approach to
• Speech recognition, Natural language processing
• Computer vision
• Medical outcomes analysis
• Robot control
• Computational biology
• This trend is accelerating
• Improved machine learning algorithms
• Improved data capture, networking, faster computers
• Software too complex to write by hand
• New sensors / IO devices
• Demand for self-customization to user, environment
• It turns out to be difficult to extract knowledge from human expertsfailure of
expert systems in the 1980’s.
19
TYPES OF
DATA LETS CLASSIFY DATASETS
for Machines
Datasets
• Dataset is a collection of data in which Country Age Salary Purchased
data is arranged in some order.
India 38 48000 No
• Dataset can contain any data from a
series of an array to a database table. France 43 45000 Yes
• Tabular dataset can be understood as a
Germany 30 54000 No
database table or matrix, where each
column corresponds to a particular France 48 65000 No
variable, and each row corresponds to
Germany 40 Yes
the fields of the dataset.
• Most supported file type for a tabular India 35 58000 Yes
dataset is "Comma Separated File," or
CSV.
21
Need of Datasets
• To work with machine learning projects, we need a huge amount of
data, because, without the data, one cannot train ML/AI models.
Collecting and preparing the dataset is one of the most crucial parts
while creating an ML/AI project.
• In building ML applications, datasets are divided into two parts:
Training dataset
Test Dataset
23
TYPES OF
MACHINE
LETS CLASSIFY DIFFERENT TYPE OF MACHINE LEARNING
LEARNING
Machine Learning algorithms
Supervised Learning Classification
Develop predictive
model based on both
input and output data
Regression
Machine
Learning Unsupervised Learning
LEARNING
Supervised learning is when input variables (X) called
Featuresvariable
output and (Y) called label or a class use an algorithm to learn the
mapping function from the input to output.
Y = f(X)
The goal is to approximate the mapping function so well that when you
have new input data (X) that you can predict the output variables (Y) for
that data.
Steps Involved in Supervised Learning
• First Determine the type of training dataset
• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and validation
dataset.
• Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
• Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need validation
sets as the control parameters, which are the subset of training datasets.
• Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate
Supervised learning -
Regression
• Regression predictive modeling is the task of approximating a
mapping function from input variables to a continuous output
variable .
• A continuous output variable is a real-value, such as an integer or
floating point value. These are often quantities, such as amounts
and sizes.
• Examples of regression problems:
• What is the price of the houses?
• What is the height of the students?
30
Supervised learning -
Classification
• Classification predictive modeling is the task of approximating a
mapping function from input variables to discrete output variables
• The output variables are often called labels or categories. The
mapping function predicts the class or category for a given
observation.
• Examples of classification problems:
• Is the person boy or girl?
• Is the email spasm or not spasm?
• Is she happy or not?
31
UNSUPERVISED
LEARNING
AI AND MACHINE LEARNING USE CASES
• Unsupervised Learning is
when only have input
data (X) and no
corresponding output
variables.
• The goal for unsupervised
learning is to model the
underlying structure or
distribution in the data in
order to learn more about the
data.
• Algorithms are left to their
own devises to discover and
present the interesting
structure in the data.
Clustering:
Clustering is a method of grouping the objects
into clusters such that objects with most
similarities remains into a group and has less or
no similarities with the objects of another group.
Cluster analysis finds the commonalities between
the data objects and categorizes them as per the
presence and absence of those commonalities.
Association:
An association rule is used for finding the
relationships between variables in the large
database.
It determines the set of items that occurs
together in the dataset. Association rule makes
marketing strategy more effective.
Reinforcement learning
Reinforcement learning is quite different from the other two types of machine learning. In reinforcement learning, there’s
no training data. The algorithm works on a rewards-based system. Reinforcement learning involves an autonomous agent
that observes the environment and then selects an action that will lead to rewards. This helps the algorithm improve in the
long run on its own. The best example of the reinforcement learning approach is creating a game.
Data Preprocessing
Data preprocessing is a necessary step before building a model with these features.
Why Data Preprocessing?
• Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
Data integration
Integration of multiple databases, data cubes, files, or notes
Data transformation
Normalization (scaling to a specific range)
Aggregation
Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical results
• Data discretization: with particular importance, especially for numerical data
• Data aggregation, dimensionality reduction, data compression, generalization
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the task is classification—not
effective in certain cases)
• Use the most probable value to fill in the missing value: inference-based such as
regression, Bayesian formula, decision tree
Noisy Data
• Q: What is noise?
• A: Random error in a measured variable.
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
– used also for discretization (discussed later)
• Clustering
– detect and remove outliers
• Semi-automated method: combined computer and human inspection
– detect suspicious values and check manually
• Regression
– smooth by fitting the data into regression functions
Learning Algorithms
46
Anatomy of ML
Algorithms
Anatomy of Learning Algorithms
Feature Engineering
• One Hot Encoding
• Binning
• Normalization
• Regularization
• Standardization
Three Sets: Training, Validation set, Test Set
Underfitting and Overfitting
Model Performance Assessment
• Confusion Matrix
• Precision/Recall
• Accuracy
• ROC
Hyperparameter Tuning
Feature Engineering
Feature engineering is the pre-processing step of machine learning, which
extracts features from raw data.
It helps to represent an underlying problem to predictive models in a
better way, which as a result, improve the accuracy of the model for
unseen data.
Source: Python ML 49
One Hot Encoding
• One-hot encoding is one of the most
common encoding methods in machine
learning.
• Method spreads the values in a column to
multiple flag columns and assigns 0 or 1 to
them. These binary values express the
relationship between grouped and
encoded column.
• Method changes the categorical data,
which is challenging to understand for Example
algorithms, to a numerical format and
enables to group the categorical data
without losing any information.
Binning
68
Machine Learning
Process
Machine Learning process
Machine learning
algorithm
Hyper-
Data preparation Validation parameter
tuning Deployment
and
Feature Feature monitoring
Data
Extraction & Scaling &
Processing
Engineering Selection
Training
70
Stage 1. Train A Model with Examples Datasets
(Training)
Cat
Dog
OUTPUT
Car
Fruit
ML
model is a
mathematical
function
Stage 2. Predict with the Trained Model
Recap: Machine Learning Lifecycle
1 2 3 4 5
Define ML use cases Data Exploration Select Algorithm Data Pipeline & Build ML Model
Define Specific ML use Perform exploratory Choose the right ML feature engineering Develop the first
cases for the Project. data analysis to Algorithm for the Create the right iteration of the ML
understand the Task features from raw Model.
data data for the ML Task.
1 9 8 7 6
0
Monitor Model OOppeerraattiioonna Plan for Deployment Present Results Iterate ML Model
alliizzee MMooddeell Present Results of the Refine the ML Model
model in a way that to improve
demonstrates its value
to stakeholders.
Happy
Learning!