0% found this document useful (0 votes)
14 views74 pages

Machine Learning

The document provides an overview of machine learning, including its definition, types, and the importance of data preprocessing. It discusses various machine learning algorithms such as supervised, unsupervised, and reinforcement learning, along with key concepts like feature engineering and model evaluation. Additionally, it highlights the significance of quality data and the challenges of overfitting and underfitting in model performance.

Uploaded by

A.S. ROHIT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views74 pages

Machine Learning

The document provides an overview of machine learning, including its definition, types, and the importance of data preprocessing. It discusses various machine learning algorithms such as supervised, unsupervised, and reinforcement learning, along with key concepts like feature engineering and model evaluation. Additionally, it highlights the significance of quality data and the challenges of overfitting and underfitting in model performance.

Uploaded by

A.S. ROHIT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Basics of Machine Learning

05.10.2023

Dr.M.Prabhavathy
Assistant Professor
Department of AI and DS
Outline

• Basics of Machine Learning


• Types of Machine Learning
• Data Preprocessing in ML
• ML Anatomy – Key concepts
• Machine Learning Process
Quotes……

A breakthrough in machine learning would be worth ten Microsofts.


— Bill Gates, Former Chairman, Microsoft

“Machine learning is the next Internet”


Tony Tether, Director, DARPA

“Machine learning is today’s discontinuity”


Jerry Yang, CEO, Yahoo
What is Machine Learning?
• Let’s say you want to solve Character Recognition
• Hard way: Understand handwriting/characters

4
What is Machine Learning?
• Let’s say you want to solve Character Recognition
• Hard way: Understand handwriting/characters
• Latin
• Devanagri
• Symbols: http://detexify.kirelabs.org/classify.html

5
What is Machine Learning?
• Let’s say you want to solve Character Recognition
• Hard way: Understand handwriting/characters
• Lazy way: Throw data!

6
Example: Netflix Challenge
• Goal: Predict how a viewer will rate a movie
• 10% improvement = 1 million dollars

7
Example: Netflix Challenge
• Goal: Predict how a viewer will rate a movie

• 10% improvement = 1 million dollars

• Essence of Machine Learning:


• A pattern exists
• We cannot pin it down mathematically
• We have data on it

8
Comparison
Traditional Programming
Data
Computer Output
Program

Machine Learning

Data
Computer Program
Output
9
What is Machine Learning?

•“the acquisition of knowledge or skills


through experience, study, or by being
taught.”

10
What is Machine Learning?
• If you are a Scientist

Machine
Data Understanding
Learning

• If you are an Engineer / Entrepreneur


• Get lots of data
• Machine Learning
• ???
• Profit!

11
What is Machine Learning?
• In basic terms, ML is the process of training a piece of software, called a model,
to make useful predictions using a data set.
• This predictive model can then serve up predictions about previously unseen
data. We use these predictions to take action in a product.
• For example, the system predicts that a user will like a certain video, so the
system recommends that video to the user.

Machine Learning
Historical data algorithm

training

input prediction Future


New data Learned model attributes

12
Why Machine Learning?
Engineering Better Computing Systems
• Develop systems
• too difficult/expensive to construct manually
• because they require specific detailed skills/knowledge
• knowledge engineering bottleneck

• Develop systems
• that adapt and customize themselves to individual users.
• Personalized news or mail filter
• Personalized tutoring

• Discover new knowledge from large databases


• Medical text mining (e.g. migraines to calcium channel blockers to magnesium)
• data mining
Why Machine Learning?
The Time is Ripe

• Algorithms
• Many basic effective and efficient algorithms available.
• Data
• Large amounts of on-line data available.
• Computing
• Large amounts of computational resources available.
Where does ML fit in?

15
Machine Learning
A.I. & Deep
Learning
LETS TEACH MACHINE HOW TO LEARN FROM DATA
Growth of Machine Learning
• Machine learning is preferred approach to
• Speech recognition, Natural language processing
• Computer vision
• Medical outcomes analysis
• Robot control
• Computational biology
• This trend is accelerating
• Improved machine learning algorithms
• Improved data capture, networking, faster computers
• Software too complex to write by hand
• New sensors / IO devices
• Demand for self-customization to user, environment
• It turns out to be difficult to extract knowledge from human expertsfailure of
expert systems in the 1980’s.
19
TYPES OF
DATA LETS CLASSIFY DATASETS
for Machines
Datasets
• Dataset is a collection of data in which Country Age Salary Purchased
data is arranged in some order.
India 38 48000 No
• Dataset can contain any data from a
series of an array to a database table. France 43 45000 Yes
• Tabular dataset can be understood as a
Germany 30 54000 No
database table or matrix, where each
column corresponds to a particular France 48 65000 No
variable, and each row corresponds to
Germany 40 Yes
the fields of the dataset.
• Most supported file type for a tabular India 35 58000 Yes
dataset is "Comma Separated File," or
CSV.
21
Need of Datasets
• To work with machine learning projects, we need a huge amount of
data, because, without the data, one cannot train ML/AI models.
Collecting and preparing the dataset is one of the most crucial parts
while creating an ML/AI project.
• In building ML applications, datasets are divided into two parts:
Training dataset
Test Dataset

23
TYPES OF
MACHINE
LETS CLASSIFY DIFFERENT TYPE OF MACHINE LEARNING

LEARNING
Machine Learning algorithms
Supervised Learning Classification
Develop predictive
model based on both
input and output data
Regression

Machine
Learning Unsupervised Learning

Group and interpret


date based only on input Clustering
data
SUPERVISED
AI AND MACHINE LEARNING USE CASES

LEARNING
Supervised learning is when input variables (X) called
Featuresvariable
output and (Y) called label or a class use an algorithm to learn the
mapping function from the input to output.

Y = f(X)
The goal is to approximate the mapping function so well that when you
have new input data (X) that you can predict the output variables (Y) for
that data.
Steps Involved in Supervised Learning
• First Determine the type of training dataset
• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and validation
dataset.
• Determine the input features of the training dataset, which should have
enough knowledge so that the model can accurately predict the output.
• Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need validation
sets as the control parameters, which are the subset of training datasets.
• Evaluate the accuracy of the model by providing the test set. If the model
predicts the correct output, which means our model is accurate
Supervised learning -
Regression
• Regression predictive modeling is the task of approximating a
mapping function from input variables to a continuous output
variable .
• A continuous output variable is a real-value, such as an integer or
floating point value. These are often quantities, such as amounts
and sizes.
• Examples of regression problems:
• What is the price of the houses?
• What is the height of the students?

30
Supervised learning -
Classification
• Classification predictive modeling is the task of approximating a
mapping function from input variables to discrete output variables
• The output variables are often called labels or categories. The
mapping function predicts the class or category for a given
observation.
• Examples of classification problems:
• Is the person boy or girl?
• Is the email spasm or not spasm?
• Is she happy or not?

31
UNSUPERVISED
LEARNING
AI AND MACHINE LEARNING USE CASES
• Unsupervised Learning is
when only have input
data (X) and no
corresponding output
variables.
• The goal for unsupervised
learning is to model the
underlying structure or
distribution in the data in
order to learn more about the
data.
• Algorithms are left to their
own devises to discover and
present the interesting
structure in the data.
Clustering:
 Clustering is a method of grouping the objects
into clusters such that objects with most
similarities remains into a group and has less or
no similarities with the objects of another group.
 Cluster analysis finds the commonalities between
the data objects and categorizes them as per the
presence and absence of those commonalities.

Association:
 An association rule is used for finding the
relationships between variables in the large
database.
 It determines the set of items that occurs
together in the dataset. Association rule makes
marketing strategy more effective.
Reinforcement learning
Reinforcement learning is quite different from the other two types of machine learning. In reinforcement learning, there’s
no training data. The algorithm works on a rewards-based system. Reinforcement learning involves an autonomous agent
that observes the environment and then selects an action that will lead to rewards. This helps the algorithm improve in the
long run on its own. The best example of the reinforcement learning approach is creating a game.
Data Preprocessing
Data preprocessing is a necessary step before building a model with these features.
Why Data Preprocessing?
• Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names

• No quality data, no quality mining results!


Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data

• A multi-dimensional measure of data quality:


A well-accepted multi-dimensional view:
accuracy, completeness, consistency, timeliness, believability, value added, interpretability,
accessibility
Broad categories:
intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration
Integration of multiple databases, data cubes, files, or notes

Data transformation
Normalization (scaling to a specific range)
Aggregation

Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical results
• Data discretization: with particular importance, especially for numerical data
• Data aggregation, dimensionality reduction, data compression, generalization
Missing Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the task is classification—not
effective in certain cases)

• Fill in the missing value manually: tedious + infeasible?


• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples of the same class to fill in the missing value:
smarter

• Use the most probable value to fill in the missing value: inference-based such as
regression, Bayesian formula, decision tree
Noisy Data
• Q: What is noise?
• A: Random error in a measured variable.
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
– used also for discretization (discussed later)
• Clustering
– detect and remove outliers
• Semi-automated method: combined computer and human inspection
– detect suspicious values and check manually
• Regression
– smooth by fitting the data into regression functions
Learning Algorithms
46
Anatomy of ML
Algorithms
Anatomy of Learning Algorithms
Feature Engineering
• One Hot Encoding
• Binning
• Normalization
• Regularization
• Standardization
Three Sets: Training, Validation set, Test Set
Underfitting and Overfitting
Model Performance Assessment
• Confusion Matrix
• Precision/Recall
• Accuracy
• ROC
Hyperparameter Tuning
Feature Engineering
Feature engineering is the pre-processing step of machine learning, which
extracts features from raw data.
It helps to represent an underlying problem to predictive models in a
better way, which as a result, improve the accuracy of the model for
unseen data.

Source: Python ML 49
One Hot Encoding
• One-hot encoding is one of the most
common encoding methods in machine
learning.
• Method spreads the values in a column to
multiple flag columns and assigns 0 or 1 to
them. These binary values express the
relationship between grouped and
encoded column.
• Method changes the categorical data,
which is challenging to understand for Example
algorithms, to a numerical format and
enables to group the categorical data
without losing any information.
Binning

Binning is a way to convert numerical


continuous variables into discrete
variables by categorizing them on the
basis of the range of values of the
column in which they fall.
Normalization
Normalization is a scaling technique in Machine Learning applied
during data preparation to change the values of numeric columns
in the dataset to use a common scale.
It is not necessary for all datasets in a model. It is required only
when features of machine learning models have different ranges.
Regularization
Machine learning model performs well with the training data but does
not perform well with the test data. It means the model is not able to
predict the output when deals with unseen data by introducing noise in
the output, and hence the model is called overfitted. This problem can
be deal with the help of a regularization technique.
It is a technique to prevent the model from overfitting by adding extra
information to it.
Standardization
• Standardization is an important technique that is mostly performed as a pre-
processing step before many Machine Learning models, to standardize the
range of features of input data set.
• Data standardization is the process of rescaling the attributes so that they
have mean as 0 and variance as 1.
• The ultimate goal to perform standardization is to bring down all the features
to a common scale without distorting the differences in the range of the
values.
Train Test & Validation
• Training Set: It is the set of data that is used to
train and make the model learn the hidden
features/patterns in the data.
• Validation Set: The validation set is a set of data,
separate from the training set, that is used to
validate our model performance during training.
• Test Set: The test set is a separate set of data used
to test the model after completing the training.
Epoch in ML
• Epochs are defined as the total number of iterations for training the machine
learning model with all the training data in one cycle. In Epoch, all training data is
used exactly once.
• Batch size is defined as the total number of training examples that exist in a single
batch.

Mathematically, we can understand it as follows;


Total number of training examples = 3000;
Assume each batch size = 500;
Then the total number of Iterations = Total number of training examples/Individual batch size = 3000/500
Total number of iterations = 6
And 1 Epoch = 6 Iterations
Underfitting and Overfitting
• Dataset contains impurities, noisy data, outliers, missing data, or
imbalanced data. Due to these impurities, different problems occur that
affect the accuracy and the performance of the model.
• Overfitting & underfitting are the two main errors/problems in the machine
learning model, which cause poor performance in Machine Learning.
Overfitting and Underfitting
• Noise: Noise is meaningless or irrelevant data present in the dataset. It affects
the performance of the model if it is not removed.
• Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
• Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance
occurs.
• Generalization: It shows how well a model is trained to predict unseen data.
Model Performance Assessment

True positives (TP): Predicted positive and are actually positive.


False positives (FP): Predicted positive and are actually negative.
True negatives (TN): Predicted negative and are actually negative.
False negatives (FN): Predicted negative and are actually positive.
Confusion Matrix
It’s just a representation of the above parameters in a matrix format.
Accuracy
• Most commonly used metric to judge a model and is actually not a
clear indicator of the performance.
Precision
• Percentage of positive instances out of the total predicted positive
instances. Here denominator is the model prediction done as positive
from the whole given dataset. It is find ‘how much the model is right
when it says it is right’.
Recall/Sensitivity
Percentage of positive instances out of the total actual positive instances. For
example, find out ‘how much extra right ones, the model missed when it
showed the right ones’.
Specificity
• Percentage of negative instances out of the total actual negative instances.
For example, Like finding out how many healthy patients were not having
cancer and were told they don’t have cancer.
ROC curve
ROC stands for receiver operating characteristic and the graph is
plotted against TPR and FPR for various threshold values.
Comparing different predictors.
Cost function
• Cost function is an important parameter that determines how well a machine
learning model performs for a given dataset. It calculates the difference between
the expected value and predicted value and represents it as a single real number.
• Cost function is a measure of how wrong the model is in estimating the
relationship between X(input) and Y(output) Parameter. Cost function is referred
as Loss function.
• Gradient Descent is an optimization algorithm which is used for optimizing the
cost function or error in the model.
• Gradient descent is an iterative process where the model gradually converges
towards a minimum value, and if the model iterates further than this point, it
produces little or zero changes in the loss. This point is known as convergence,
and at this point, the error is least, and the cost function is optimized.
Hyperparameter
• Model parameters are configuration variables that are internal to the
model, and a model learns them on its own.
• Hyperparameters are those parameters that are explicitly defined by the
user to control the learning process.
• These are usually defined manually by the machine learning engineer.
• One cannot know the exact best value for hyperparameters for the
given problem. The best value can be determined either by the rule of
thumb or by trial and error.
• Hyperparameters are essential for optimizing the model.
Cross Validation
• Cross-validation is a technique for validating the model efficiency by training
it on the subset of input data and testing on previously unseen subset of the
input data.
• In machine learning, there is always the need to test the stability of the
model. It means based only on the training dataset; can't fit the model on
the training dataset. For this purpose, reserve a particular sample of the
dataset, which was not part of the training dataset. After that, test the
model on that sample before deployment, and this complete process comes
under cross-validation.

68
Machine Learning
Process
Machine Learning process
Machine learning
algorithm

Dataset Data Collection


Modeling

Hyper-
Data preparation Validation parameter
tuning Deployment
and
Feature Feature monitoring
Data
Extraction & Scaling &
Processing
Engineering Selection
Training

Repeat till satisfactory model performance

70
Stage 1. Train A Model with Examples Datasets
(Training)

Cat

Dog

OUTPUT
Car

Fruit
ML
model is a
mathematical
function
Stage 2. Predict with the Trained Model
Recap: Machine Learning Lifecycle

1 2 3 4 5

Define ML use cases Data Exploration Select Algorithm Data Pipeline & Build ML Model
Define Specific ML use Perform exploratory Choose the right ML feature engineering Develop the first
cases for the Project. data analysis to Algorithm for the Create the right iteration of the ML
understand the Task features from raw Model.
data data for the ML Task.

1 9 8 7 6
0
Monitor Model OOppeerraattiioonna Plan for Deployment Present Results Iterate ML Model
alliizzee MMooddeell Present Results of the Refine the ML Model
model in a way that to improve
demonstrates its value
to stakeholders.
Happy
Learning!

You might also like