Machine Learning
Machine Learning
Machine learning is a buzzword for today's technology, and it is growing very rapidly
day by day. We are using machine learning in our daily life even without knowing it
such as Google Maps, Google assistant, Alexa, etc. Below are some most trending real-
world applications of Machine Learning:
Dr. Kapil K. Misal Unit 1- Machine Learning
Training data : Training data is the one you feed to a machine learning model, so
it can analyse it and discover some patterns and dependencies. This training set has 3
main characteristics:
Size. The training set normally has more data than testing data. The more data you
feed to the machine, the better quality model you have. Once a machine learning
algorithm is provided with data from your records, it learns patterns from it and makes
a model for decision-making.
Label. A label is a value of what we try to predict (response variables). For example, if
we want to forecast if the patient will be diagnosed with cancer, based on their
symptoms, the response variable will be Yes/No for the cancer diagnosis. The training
data can be labelled and unlabelled. Both types can be used in machine learning for
different cases.
Case details. Algorithms make decisions based on the information you give them. You
need to make sure that the data is relevant and has various cases with different
outcomes. For instance, if you need a model that can score potential borrowers, you
need to include in the training set the information you normally know about your
potential client during the application process:
Name and contact details, location;
Demographics, social and behavioural characteristics;
Source of origin (Meta Ads, website landing page, third party, etc.)
Factors connected to the behaviour/activity on websites, conversions, time spent on
a website, number of clicks, and more.
Testing Data :
Dr. Kapil K. Misal Unit 1- Machine Learning
After the machine learning model is built, you need to check its work. The AI
platform uses testing data to evaluate the performance of your model and adjust
or optimize it for better forecasts. The testing set should have the following
characteristics:
Unseen. You cannot reuse the same information that was in the training set.
Large. The data set should be large enough so that the machine can make
predictions.
Representative. The data should represent the actual dataset.
Luckily, you don’t need to collect new data and compare predictions with actual
data manually. The AI can split the existing data into two parts, put testing set
aside while training, and then run tests comparing predictions and actual results
all by itself. Data science has different options for data split, but the most common
proportions are 70/30, 80/20, and 90/10.
So having a massive data set at hand, we can check if it’s possible to make
predictions based on that model or not.
To make it simple to understand, let’s consider this following definition:
Wolf is a positive class
No wolf is a negative class
True Positive (TP): is the result that we get if we correctly predict the
positive class
False Positive (FP): is the outcome that we get if we predict a negative class
as a positive class
True Negative (TN): is the result that we get if we correctly predict the
negative class
False Negative (FN): is the outcome that we get if we predict a positive class
as a negative class
Dr. Kapil K. Misal Unit 1- Machine Learning
4. Cross-validation :
Cross-validation is a technique for evaluating ML models by training several ML
models on subsets of the available input data and evaluating them on the
complementary subset of the data. Use cross-validation to detect overfitting, ie, failing
to generalize a pattern.
In Amazon ML, you can use the k-fold cross-validation method to perform cross-
validation. In k-fold cross-validation, you split the input data into k subsets of data
(also known as folds). You train an ML model on all but one (k-1) of the subsets, and
then evaluate the model on the subset that was not used for training. This process
is repeated k times, with a different subset reserved for evaluation (and excluded
from training) each time.
The following diagram shows an example of the training subsets and complementary
evaluation subsets generated for each of the four models that are created and trained
during a 4-fold cross-validation. Model one uses the first 25 percent of data for
evaluation, and the remaining 75 percent for training. Model two uses the second
subset of 25 percent (25 percent to 50 percent) for evaluation, and the remaining
three subsets of the data for training, and so on.
Dr. Kapil K. Misal Unit 1- Machine Learning
Cross-validation is a technique in which we train our model using the subset of the
data-set and then evaluate using the complementary subset of the data-set. The three
steps involved in cross-validation are as follows:
Reserve some portion of sample data-set.
Using the rest data-set train the model.
Test the model using the reserve portion of the data-set.
Validation In this method, we perform training on the 50% of the given data-set
and rest 50% is used for the testing purpose. The major drawback of this method
is that we perform training on the 50% of the dataset, it may possible that the
remaining 50% of the data contains some important information which we are
leaving while training our model i.e higher bias. LOOCV (Leave One Out Cross
Validation) In this method, we perform training on the whole data-set but leaves
only one data-point of the available data-set and then iterates for each data-point.
It has some advantages as well as disadvantages also. An advantage of using this
method is that we make use of all data points and hence it is low bias. The major
drawback of this method is that it leads to higher variation in the testing model as
we are testing against one data point. If the data point is an outlier it can lead to
higher variation. Another drawback is it takes a lot of execution time as it iterates
over ‘the number of data points’ times. K-Fold Cross Validation In this method, we
split the data-set into k number of subsets(known as folds) then we perform
training on the all the subsets but leave one(k-1) subset for the evaluation of the
trained model. In this method, we iterate k times with a different subset reserved
for testing purpose each time.
or not. This can involve a large number of features, such as whether or not the e-mail
has a generic title, the content of the e-mail, whether the e-mail uses a template, etc.
However, some of these features may overlap. In another condition, a classification
problem that relies on both humidity and rainfall can be collapsed into just one
underlying feature, since both of the aforementioned are correlated to a high degree.
Hence, we can reduce the number of features in such problems. A 3-D classification
problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2-
dimensional space, and a 1-D problem to a simple line. The below figure illustrates this
concept, where a 3-D feature space is split into two 2-D feature spaces, and later, if
found to be correlated, the number of features can be reduced even further.
Embedded
Feature extraction: This reduces the data in a high dimensional space to a
lower dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction: The various methods used for
dimensionality reduction include:
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Generalized Discriminant Analysis (GDA)