Unit-1 MLA
Unit-1 MLA
Machine Learning
Machine learning is a growing technology which enables computers to
learn automatically from past data. Machine learning uses various
algorithms for building mathematical models and making
predictions using historical data or information. Currently, it is
being used for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-
tagging, recommender system, and many more.
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
Supervised learning is a type of machine learning method in which we
provide sample labeled data to the machine learning system in order
to train it, and on that basis, it predicts the output.
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns
without any supervision.
The training is provided to the machine with the set of data that has
not been labeled, classified, or categorized, and the algorithm needs to
act on that data without any supervision. The goal of unsupervised
learning is to restructure the input data into new features or a group of
objects with similar patterns.
find useful insights from the huge amount of data. It can be further
classifieds into two categories of algorithms:
o Clustering
o Association
3) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which
a learning agent gets a reward for each right action and gets a penalty
for each wrong action. The agent learns automatically with these
feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of
an agent is to get the most reward points, and hence, it improves its
performance.
The robotic dog, which automatically learns the movement of his arms,
is an example of Reinforcement learning.
o The duration of 1974 to 1980 was the tough time for AI and ML
researchers, and this duration was called as AI winter.
o In this duration, failure of machine translation occurred, and
people had reduced their interest from AI, which led to reduced
funding by the government to the researches.
1. Image Recognition:
Image recognition is one of the most common applications of machine
learning. It is used to identify objects, persons, places, digital images,
etc. The popular use case of image recognition and face detection
is, Automatic friend tagging suggestion:
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes
under speech recognition, and it's a popular application of machine
learning.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which
shows us the correct path with the shortest route and predicts the
traffic conditions.
o Real Time location of the vehicle form Google Map app and
sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it
better. It takes information from the user and sends back to its
database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and
entertainment companies such as Amazon, Netflix, etc., for product
recommendation to the user. Whenever we search for some product
on Amazon, then we started getting an advertisement for the same
product while internet surfing on the same browser and this is because
of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-
driving cars. Machine learning plays a significant role in self-driving
cars. Tesla, the most popular car manufacturing company is working
on self-driving car. It is using unsupervised learning method to train
the car models to detect people and objects while driving.
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
These assistant record our voice instructions, send it over the server
on a cloud, and decode it using ML algorithms and act accordingly.
For each genuine transaction, the output is converted into some hash
values, and these values become the input for the next round. For
each genuine transaction, there is a specific pattern which gets change
for the fraud transaction hence, it detects it and makes our online
transactions more secure.
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The
goal of this step is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can
be collected from various sources such as files, database, internet,
or mobile devices. It is one of the most important steps of the life
cycle. The quantity and quality of the collected data will determine the
efficiency of the output. The more will be the data, the more accurate
will be the prediction.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data
preparation is a step where we put our data into a suitable place and
prepare it to use in our machine learning training.
In this step, first, we put all data together, and then randomize the
ordering of data.
o Data exploration:
It is used to understand the nature of data that we have to work
with. We need to understand the characteristics, format, and
quality of data.
A better understanding of data leads to an effective outcome. In
this, we find Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into
a useable format. It is the process of cleaning the data, selecting the
variable to use, and transforming the data in a proper format to make
it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required
to address the quality issues.
o Missing Values
o Duplicate data
o Invalid data
o Noise
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step.
This step involves:
Hence, in this step, we take the data and use machine learning
algorithms to build the model.
5. Train Model
Now the next step is to train the model, in this step we train our model
to improve its performance for better outcome of the problem.
We use datasets to train the model using various machine learning
algorithms. Training a model is required so that it can understand the
various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset,
then we test the model. In this step, we check for the accuracy of our
model by providing a test dataset to it.
7. Deployment
The last step of machine learning life cycle is deployment, where we
deploy the model in the real-world system.
Artificial Intelligence
Artificial intelligence is a field of computer science which makes a
computer system that can mimic human intelligence. It is comprised of
two words "Artificial" and "intelligence", which means "a human-
made thinking power." Hence we can define it as,
o Weak AI
o General AI
o Strong AI
Currently, we are working with weak AI and general AI. The future of AI
is Strong AI for which it is said that it will be intelligent than humans.
Machine learning
Machine learning is about extracting knowledge from the data. It can
be defined as,
o Supervised learning
o Reinforcement learning
o Unsupervised learning
What is Hypothesis?
The hypothesis is defined as the supposition or proposed
explanation based on insufficient evidence or assumptions. It is
just a guess based on some known facts but has not yet been proven.
A good hypothesis is testable, which results in either true or false.
In this example, a scientist just claims that UV rays are harmful to the
eyes, but we assume they may cause blindness. However, it may or
may not be possible. Hence, these types of assumptions are called a
hypothesis.
There are some common methods given to find out the possible
hypothesis from the Hypothesis space, where hypothesis space is
represented by uppercase-h (H) and hypothesis by lowercase-h
(h). These are defined as follows:
Hypothesis (h):
It is defined as the approximate function that best describes the target
in supervised machine learning algorithms. It is primarily based on
data as well as bias and restrictions applied to data.
y= mx + b
Where,
Y: Range
x: domain
c: intercept (constant)
If we divide this coordinate plane in such as way that it can help you to
predict output or result as follows:
Based on the given test data, the output result will be as follows:
However, based on data, algorithm, and constraints, this coordinate
plane can also be divided in the following ways as follows:
Hypothesis space (H) is the composition of all legal best possible ways
to divide the coordinate plane so that it best maps input to proper
output.
Hypothesis in Statistics
Similar to the hypothesis in machine learning, it is also considered an
assumption of the output. However, it is falsifiable, which means it can
be failed in the presence of sufficient evidence.
What is Bias?
In general, a machine learning model analyses the data, find patterns
in it and make predictions. While training, the model learns these
patterns in the dataset and applies them to test data for
prediction. While making predictions, a difference occurs
between prediction values made by the model and actual
values/expected values, and this difference is known as bias
errors or Errors due to bias. It can be defined as an inability of
machine learning algorithms such as Linear Regression to capture the
true relationship between the data points. Each algorithm begins with
some amount of bias because bias occurs from assumptions in the
model, which makes the target function simple to learn. A model has
either:
o Low Bias: A low bias model will make fewer assumptions about
the form of the target function.
o High Bias: A model with a high bias makes more assumptions,
and the model becomes unable to capture the important
features of our dataset. A high bias model also cannot
perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn
fast. The simpler the algorithm, the higher the bias it has likely to be
introduced. Whereas a nonlinear algorithm often has low bias.
A model that shows high variance learns a lot and perform well with
the training dataset, and does not generalize well with the unseen
dataset. As a result, such a model gives good results with the training
dataset but shows high error rates on the test dataset.
Since, with high variance, the model learns too much from the dataset,
it leads to overfitting of the model. A model with high variance has the
below problems:
o High training error and the test error is almost similar to training
error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to
take care of bias and variance in order to avoid overfitting and
underfitting in the model. If the model is very simple with fewer
parameters, it may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have high variance
and low bias. So, it is required to make a balance between bias and
variance errors, and this balance between the bias error and variance
error is known as the Bias-Variance trade-off.
But it has one of the big disadvantages that we are just using a 50%
dataset to train our model, so the model may miss out to capture
important information of the dataset. It also tends to give the
underfitted model.
Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It
means, if there are total n datapoints in the original input dataset, then
n-p data points will be used as the training dataset and the p data
points as the validation set. This complete process is repeated for all
the samples, and the average error is calculated to know the
effectiveness of the model.
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the
model as we iteratively check against one data point.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups
of samples of equal sizes. These samples are called folds. For each
learning set, the prediction function uses k-1 folds, and the rest of the
folds are used for the test set. This approach is a very popular CV
approach because it is easy to understand, and the output is less
biased than other methods.
Holdout Method
This method is the simplest cross-validation technique among all. In
this method, we need to remove a subset of the training data and use
it to get prediction results by training it on the rest part of the dataset.
The error that occurs in this process tells how well our model will
perform with the unknown dataset. Although this approach is simple to
perform, it still faces the issue of high variance, and it also produces
misleading results sometimes.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are
given below:
o For the ideal conditions, it provides the optimum output. But for the
inconsistent data, it may produce a drastic result. So, it is one of the
big disadvantages of cross-validation, as there is no certainty of the
type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it
may face the differences between the training set and validation sets.
Such as if we create a model for the prediction of stock market values,
and the data is trained on the previous 5 years stock values, but the
realistic future values for the next 5 years may drastically different, so
it is difficult to expect the correct output for such situations.
Applications of Cross-Validation
o This technique can be used to compare the performance of different
predictive modeling methods.
o It has great scope in the medical research field.
o It can also be used for the meta-analysis, as it is already being used by
the data scientists in the field of medical statistics.