Introduction To ML
Introduction To ML
BENGALURU
INTRODUCTION TO
MACHINE LEARNING
PRESENTED BY:
AARAN DLIMA
INTRODUCTION:
Human beings can learn everything from our
experiences and we do have leaning capacity.
A machine can learn if it can gain more data to improve its performance.
HOW DOES ML WORKS?
ML builds prediction models, learns
from the data and predicts the output
of new data when it receives it.
SUPERVISED LEARNING
UNSUPERVISED LEARNING
REINFORCEMENT LEARNING
CLASSIFICATION OF ML
SUPERVISED LEARNING
REMEMBER: Labeled data is used here.
No predetermined result.
Speech recognition
Prediction
Recommendation
APPLICATIONS OF ML
Voice Assistant
Fraud Detection
Language Translation
Stock market
TYPES OF DATA
Numerical data: Such as house price,
temperature, etc.
01 02 03
SEARCH THE DATASET IMPORTING LIBRARIES IMPORT DATASETS
Each Dataset is different from the Import some predefined Python After importing the libraries we
other. libraries. need to import the data that we
have collected.
Search the dataset according to
Few libraries we usually come
the need of your problem
across are numpy, pandas, Here we make use of syntaxes
statement.
matplotlib, seaborn, scikit learn. such as : read_csv,read_excel and
Data we make use is usually in csv so on.(Learn different ways of
format,text file format or excel file. importing the datasets.)
STEPS IN DATA
PREPROCESSING
Once first three steps are done then explore the data. Know
what are your dependent and independent variables are.
04 05 06
DATA CLEANING ENCODING SPLITTING
We are Splitting the dataset into
Check for Missing values This step is basically for treating
X_train, X_test, Y_train, Y_test.
Categorical variables.
How do you deal with missing
Here you need to decide what is
values? Few encoding techniques are Label
your test size.(Usually 20% to
Deleting encoder and One hot encoder with
30%).
Substituting values with mean, the help of scikit learn package.
median or mode
Random state keyword.
Check for column names to be
renamed and so on.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that variable in a particular column, and
rest variables become 0. With dummy encoding, we will have a number of columns equal to the number of categories.
STEPS IN DATA PREPROCESSING:
7. Feature Scaling
We put our variables in the same range and in the same scale so
that no any variable dominate the other variable.
Problem Definition
Data Collection
Data Preparation
Model Building
Model Evaluation
Model Deployment
OVERFITTING &UNDERFITTING
These are the two main two problems that we encounter in Machine Learning
which degrades the performance of machine learning models.
Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
OVERFITTING:
Occurs when our ML model tries to cover all the data points or more
than the required data points present in the given dataset.
It means the more we train our model, the more chances of occurring
the overfitted model.
As we can see from the above linear regression output graph, the model tries to cover all the data
points present in the scatter plot.
Because the goal of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.
HOW TO AVOID OVERFITTING:
Cross Validation
Removing Features
Early Stopping
Regularization
Ensembling
UNDERFITTING:
Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data.
As a result, it may fail to find the best fit of the dominant trend in the
data.
Here the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.
Here, the model is unable to capture the data points present in the plot.
Reducible errors: These errors can be reduced to improve the model accuracy.
Irreducible errors: These errors will always be present in the model regardless of which
algorithm has been used. The cause of these errors is unknown variables whose value
can't be reduced.
BIAS
Low Bias: A low bias model will make fewer assumptions
about the form of the target function.
A model that shows high variance learns a lot and perform well with
the training dataset, and does not generalize well with the unseen
dataset.( model gives good results with the training dataset )
If the model is very simple with fewer parameters, it may have low
variance and high bias.
For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible
because bias and variance are related to each other:
If we decrease the variance, it will increase the bias.
If we decrease the bias, it will increase the variance.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and
variance errors.
CONFUSION MATRIX
The confusion matrix is a matrix used to determine the
performance of the classification models for a given
set of test data.
For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it
is 3*3 table, and so on.
The matrix is divided into two dimensions, that are predicted values and actual
values along with the total number of predictions.
Predicted values are those values, which are predicted by the model, and actual
values are the true values for the given observations.
NEED FOR CONFUSION MATRIX
It not only tells the error made by the classifiers but also the type of errors such as
it is either type-I or type-II error.
With the help of the confusion matrix, we can calculate the different parameters
for the model, such as accuracy, precision, etc.
PERFORMANCE METRICS FOR CLASSIFICATION
Confusion Matrix
Accuracy
Precision
Recall
F Score
ACCURACY
When to Use Accuracy?
It is good to use the Accuracy metric when the target variable classes in data are
approximately balanced.
For example, if 60% of classes in a fruit image dataset are of Apple, 40% are
Mango. In this case, if the model is asked to predict whether the image is of Apple
or Mango, it will give a prediction with 97% of accuracy.
PERFORMANCE METRICS FOR CLASSIFICATION
Precision is the ratio of correctly classified positive samples (True Positive) to a
total number of classified positive samples (either correctly or incorrectly).
PRECISION
Precision helps us to visualize the reliability of the machine learning model in
classifying the model as positive.
RECALL
It is also similar to the Precision metric.
From the above definitions of Precision and Recall, we can say that recall determines the
performance of a classifier with respect to a false negative, whereas precision gives
information about the performance of a classifier with respect to a false positive.
So, if we want to minimize the false negative, then, Recall should be as near to 100%, and if
we want to minimize the false positive, then precision should be close to 100% as possible.
It is calculated with the help of Precision and Recall. So, the F1 Score can be calculated as the
harmonic mean of both precision and Recall, assigning equal weight to each of them
F SCORE
When to use F-Score?
As F-score make use of both precision and recall, so it should be used if both of them are
important for evaluation, but one (precision or recall) is slightly more important to consider than
the other.
For example, when False negatives are comparatively more important than false positives, or vice
versa.
PERFORMANCE METRICS FOR CLASSIFICATION
It is one of the popular and important metrics for evaluating the performance of the
classification model.
AUC (Area
Under Curve) - ROC (Receiver Operating Characteristic curve) curve represents a graph to show the
ROC performance of a classification model at different threshold levels.
TPR = FPR =
To calculate value at any point in a ROC curve, we can evaluate a logistic regression model multiple times with different classification
thresholds, but this would not be much efficient. So, for this, one efficient method is used, which is known as AUC.
AUC: AREA UNDER THE ROC CURVE
AUC calculates the performance across all the thresholds and provides an aggregate measure.
It means a model with 100% wrong prediction will have an AUC of 0.0, whereas models with 100%
correct predictions will have an AUC of 1.0.
AUC: AREA UNDER THE ROC CURVE
When to Use AUC?
AUC should be used to measure how well the predictions are
ranked rather than their absolute values.
It means we cannot use the Accuracy metric (explained above) to evaluate a regression model; instead, the performance of a
Regression model is reported as errors in the prediction.
R2 Score
Adjusted R2
PERFORMANCE METRICS FOR REGRESSION
Mean Absolute Error measures the absolute difference between actual and
predicted values, where absolute means taking a number as Positive.
Mean
Absolute Error Let's take an example of Linear Regression, where the model draws a best fit line
between dependent and independent variables. To measure the MAE or error in
prediction, we need to calculate the difference between actual values and
predicted values. But in order to find the absolute error for the complete dataset,
we need to find the mean absolute of the complete dataset.
Y is the Actual value, Y' is the predicted value, and N is the total number of observations.
PERFORMANCE METRICS FOR REGRESSION
Mean Squared
Error MSE is usually positive and non-zero.
Due to squared differences, it penalizes small errors also, and hence it leads to
over-estimation of how bad the model is.
Y is the Actual value, Y' is the predicted value, and N is the total number of observations.
PERFORMANCE METRICS FOR REGRESSION
R squared error is also known as Coefficient of Determination, which is another
popular metric used for Regression model evaluation.
R2 SCORE
Determines the goodness of fit.
The R squared score will always be less than or equal to 1 without concerning
if the values are too large or small.
PERFORMANCE METRICS FOR REGRESSION
Adjusted R squared, as the name suggests, is the improved version of R squared
error.
1) Classifications of ML.
avoid?)