0% found this document useful (0 votes)
43 views18 pages

ML Unit 1 Solution

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views18 pages

ML Unit 1 Solution

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Sem:-VII Sub:- ML

SRES’s
SHREE RAMCHANDRA COLLEGE OF ENGINEERING
Lonikand, Pune – 412216
Ref. №: SRCOE/COMP /2024-25/ Date:

UNIT I
Introduction to Machine Learning

1. Compare machine learning vs Artificial Intelligence. Apr 2022 [5]


2. Describe parametric and Non-parametric machine learning models. Apr [5]
2022
3. Explain various Data formats that conform ML elements. Apr 2022 Apr [5]
2023
4. Explain supervised, unsupervised and semi supervised learning. Apr 2022 [7]
5. Describe various statistical learning approaches. Apr 2022 Apr 2023[5] [8]
6 Compare Machine Learning with traditional programming. Discuss types
of Machine Learning with suitable examples Apr 2023
7 What is Machine Learning? Explain applications of Machine Learning in [5]
data science. Apr 2023
8 Explain Geometric Model and Probabilistic Model with suitable [5]
examples. Apr 2023
9 How machine learning model works? Explain various steps involved. Apr [5]
2023

Shree Ramchandra college of engineering Page 1


Sem:-VII Sub:- ML

SRES’s
SHREE RAMCHANDRA COLLEGE OF ENGINEERING
Lonikand, Pune – 412216
Ref. №: SRCOE/COMP /2024-25/ Date:

UNIT I
Introduction to Machine Learning

1. Compare machine learning vs Artificial Intelligence. Apr 2022 [5]


Ans:-
Artificial Intelligence Machine learning

Artificial intelligence is a technology Machine learning is a subset of AI which allows


which enables a machine to simulate a machine to automatically learn from past data
human behavior. without programming explicitly.

The goal of AI is to make a smart The goal of ML is to allow machines to learn


computer system like humans to from data so that they can give accurate output.
solve complex problems.

In AI, we make intelligent systems In ML, we teach machines with data to perform
to perform any task like a human. a particular task and give an accurate result.

Machine learning and deep learning Deep learning is a main subset of machine
are the two main subsets of AI. learning.

AI has a very wide range of scope. Machine learning has a limited scope.

AI is working to create an intelligent Machine learning is working to create machines


system which can perform various that can perform only those specific tasks for
complex tasks. which they are trained.

AI system is concerned about Machine learning is mainly concerned about


maximizing the chances of success. accuracy and patterns.

The main applications of AI are Siri, The main applications of machine learning
customer support using catboats, are Online recommender system, Google
Expert System, Online game search algorithms, Facebook auto friend

Shree Ramchandra college of engineering Page 2


Sem:-VII Sub:- ML

playing, intelligent humanoid robot, tagging suggestions, etc.


etc.

On the basis of capabilities, AI can Machine learning can also be divided into
be divided into three types, which mainly three types that are Supervised
are, Weak AI, General AI, learning, Unsupervised learning,
and Strong AI. and Reinforcement learning.

It includes learning, reasoning, and It includes learning and self-correction when


self-correction. introduced with new data.

AI completely deals with Structured, Machine learning deals with Structure


semi-structured, and unstructured and semi-structured data.
data.

2. Describe parametric and Non-parametric machine learning models.


Apr 2022 [5]
Ans:-
i) Parametric Model:
Assumptions can greatly simplify the learning process, but can also limit what can be
learned. Algorithms that simplify the function to a known form are called parametric
machine learning algorithms.
A learning model that summarizes data with a set of parameters of fixed size
(independentof the number of training examples) is called a parametric model.
No matter how much data you throw at a parametric model, it won’t change its mind
about how many parameters it needs.

The algorithms involve two steps:


a)Select a form for the function.
b).Learn the coefficients for the function from the training data.
An easy to understand functional form for the mapping function is a line, as is used in
linearregression:b0 + b1*x1 + b2*x2 = 0
Where b0, b1 and b2 are the coefficients of the line that control the intercept and slope,
and x1and x2 are two input variables.
Assuming the functional form of a line greatly simplifies the learning process.
Now, all we need to do is estimate the coefficients of the line equation and we have a
predictive model for the problem.
Often the assumed functional form is a linear combination of the input variables and as
such parametric machine learning algorithms are often also called “linear
machine learning algorithms“.

The problem is, the actual unknown underlying function may not be a linear function like
a line.It could be almost a line and require some minor transformation of the input data to
work right.Or it could be nothing like a line in which case the assumption is wrong and
the approach willproduce poor results.

Shree Ramchandra college of engineering Page 3


Sem:-VII Sub:- ML

Some more examples of parametric machine learning algorithms include:


a) Logistic Regression
b) Linear Discriminant Analysis
c) Perceptron
d) Naive Bayes
e) Simple Neural Networks
Benefits
Simpler: These methods are easier to understand and interpret results.
Speed: Parametric models are very fast to learn from data.
Less Data: They do not require as much training data and can work well even if the fit to
the data is
not perfect.
Limitations
Constrained: By choosing a functional form these methods are highly constrained to the
specified
form.
Limited Complexity: The methods are more suited to simpler problems.
Poor Fit: In practice the methods are unlikely to match the underlying mapping function.

ii) Non Parametric Model:


Algorithms that do not make strong assumptions about the form of the mapping function
are called nonparametric machine learning algorithms.
By not making assumptions, they are free to learn any functional form from the training
data.
Nonparametric methods are good when you have a lot of data and no prior knowledge,
and when you don’t want to worry too much about choosing just the right features.
Nonparametric methods seek to best fit the training data in constructing the mapping
function,
whilst maintaining some ability to generalize to unseen data.As such, they are able to fit a
large number of functional forms.
An easy to understand nonparametric model is the k-nearest neighbors algorithm that
makes predictions based on the k most similar training patterns for a new data instance.
The method does not assume anything about the form of the mapping function other than
patterns that are close are likely to have a similar output variable.

Some more examples of popular nonparametric machine learning algorithms are:


1.k-Nearest Neighbors
2.Decision Trees like CART and C4.5
3.Support Vector Machines
4.Non Parametric Model:

Benefits of Nonparametric Machine Learning Algorithms:


1.Flexibility: Capable of fitting a large number of functional forms.
2.Power: No assumptions (or weak assumptions) about the underlying function.
3.Performance: Can result in higher performance models for prediction.

Shree Ramchandra college of engineering Page 4


Sem:-VII Sub:- ML

Limitations of Nonparametric Machine Learning Algorithms:


More data: Require a lot more training data to estimate the mapping function.
Slower: A lot slower to train as they often have far more parameters to train.
Overfitting: More of a risk to overfit the training data and it is harder to explain why
specific
predictions are made.

3. Explain various Data formats that conform ML elements. Apr 2022 [5]
Ans:-
Data Formats in Machine Learning

Each data format represents how the input data is represented in memory.
This is important as each machine learning application performs well for a particular data
format and worse for others.
Interchanging between various data formats and choosing the correct format is a major
optimization technique.

There are four types of data formats:


NHWC
NCHW
NCDHW
NDHWC
Each letter in the formats denotes a particular aspect/ dimension of the data:
N: Batch size : is the number of images passed together as a group for inference
C: Channel : is the number of data components that make a data point for the input data.
Itis 3 for opaque images and 4 for transparent images.
H: Height : is the height/ measurement in y axis of the input data
W: Width : is the width/ measurement in x axis of the input data
D: Depth : is the depth of the input data

NHWC:-
NHWC denotes (Batch size, Height, Width, Channel). This means there is a 4D array
where the first dimension represents batch size and accordingly. This 4D array is laid out
in memory in row major order. Hence, you can visualize the memory layout to
imagine which operations will access consecutive memory (fast) or memory separated
by other data (slow).

NCHW:-
NCHW denotes (Batch size, Channel, Height, Width). This means there is a 4D array
where the first dimension represents batch size and accordingly. This 4D array is laid out
in memory in row major order.

NCDHW:-
NCHW denotes (Batch size, Channel, Depth, Height, Width). This means there is a 5D
array where the first dimension represents batch size and accordingly. This 5D array is

Shree Ramchandra college of engineering Page 5


Sem:-VII Sub:- ML

laid out in memory in row major order.


NDHWC:-
NCHW denotes (Batch size, Depth, Height, Width, Channel). This means there is a 5D
array where the first dimension represents batch size and accordingly. This 5D array is
laid out in memory in row major order.

4. Explain supervised, unsupervised and semi supervised learning. Apr


2022 [7]
Ans:-
1)Supervised Machine Learning:-
Supervised learning is one of the most basic types of machine learning.
In this type, the machine learning algorithm is trained on labeled data.Even though the data
needs to be labeled accurately for this method to work, supervised learning is extremely
powerful when used in the right circumstances.

In supervised learning, the ML algorithm is given a small training dataset to work


with.This training dataset is a smaller part of the bigger dataset and serves to give the
algorithm a basic idea of the problem, solution, and data points to be dealt with.The training
dataset is also very similar to the final dataset in its characteristics and provides the
algorithm with the labeled parameters required for the problem.The algorithm then finds
relationships between the parameters given, essentially establishing acause and effect
relationship between the variables in the dataset.At the end of the training, the algorithm has
an idea of how the data works and the relationship between the input and the output.
Supervised learning is commonly used in real world applications, such as face and speech
recognition,
products or movie recommendations, and sales forecasting.Supervised learning deals with
learning a function from available training data. Here, a learning algorithm analyzes the
training data and produces a derived function that can be used for mapping new examples.
Supervised learning can be further classified into two types - Regression and Classification.
Regression
Regression trains on and predicts a continuous-valued response, for example predicting real
estateprices.When output Y is discrete valued, it is classification and when Y is continuous,
then it is Regression.
Discrete variables are numeric variables that have a countable number of values between
any two values. A discrete variable is always numeric. For example, the number of customer
complaints or the number of flaws or defects.

Shree Ramchandra college of engineering Page 6


Sem:-VII Sub:- ML

Continuous variables are numeric variables that have an infinite number of values between
any two values. A continuous variable can be numeric or date/time. For example, the length
of a part or the date and time a payment is received.
Regression algorithms are used if there is a relationship between the input variable and the
output variable.
It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc.
1.Linear Regression
2.Regression Trees
3. Non-Linear Regression
4.Bayesian Linear Regression
5.Polynomial Regression vi)Logistic Regression

Classification
Classification attempts to find the appropriate class label, such as analyzing
positive/negative sentiment, male and female persons, benign and malignant tumors, secure
and unsecure loans etc.
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc
1. Decision Trees
2. Random Forest
3. Support vector Machines
4. Neural network

2)Unsupervised Machine Learning


The creation of these hidden structures is what makes unsupervised learning
algorithms versatile.Instead of a defined and set problem statement, unsupervised learning
algorithms can adapt to the data by dynamically changing hidden structures.
Unsupervised learning is used to detect anomalies, outliers, such as fraud or defective
equipment, or to group customers with similar behaviours for a sales campaign. It is the
opposite of supervised learning. There is no labeled data here. When learning data contains
only some indications without any description or labels, it is up to the coder or to the
algorithm to find the structure of the underlying data, to discover hidden patterns, or to
determine how to describe the data. This kind of learning data is called unlabeled data.
Suppose that we have a number of data points, and we want to classify them into several
groups. We may not exactly know what the criteria of classification would be.
So, an unsupervised learning algorithm tries to classify the given dataset into a certain
number of groups in an optimum way.
Unsupervised learning algorithms are extremely powerful tools for analyzing data and for
identifying patterns and trends.
They are most commonly used for clustering similar input into logical groups.
It has two types clustering and Association
Clustering:
Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another
group.

Shree Ramchandra college of engineering Page 7


Sem:-VII Sub:- ML

Cluster analysis finds the commonalities between the data objects and categorizes them as
per the presence and absence of those commonalities.
Association:
An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database.It determines the set of items that occurs
together in the dataset.Association rule makes marketing strategy more effective. Such as
people who buy X item (suppose a bread) are also tend to purchase Y (Butter/Jam) item.A
typical example of Association rule is Market Basket Analysis.
The list of some popular unsupervised learning algorithms:
1. K-means clustering
2. KNN (K-Nearest Neighbors)
3.Hierarchical clustering
4. Anomaly detection
5. Neural Networks
6.Principle Component Analysis
7. Independent Component Analysis
8. Apriori algorithm
9. Singular value decomposition

3) Semi Supervised Machine Learning:


The most basic disadvantage of any Supervised Learning algorithm is that the dataset
has to be hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a
very costly process, especially when dealing with large volumes of data. The most basic
disadvantage of any Unsupervised Learning is that its application spectrum is limited.
To counter these disadvantages, the concept of Semi-Supervised Learning was introduced.
It is partly supervised and partly unsupervised .
If some learning samples are labeled, but some other are not labeled, then it is semi-
supervised learning.
It makes use of a large amount of unlabeled data for training and a small amount of labeled
data for testing.

Semi-supervised learning is applied in cases where it is expensive to acquire a fully labeled


dataset while more practical to label a small subset.Supervised learning: where a student is
under the supervision of a teacher at both home and school, Unsupervised learning: where a

Shree Ramchandra college of engineering Page 8


Sem:-VII Sub:- ML

student has to figure out a concept himself and Semi-Supervised learning: where a teacher
teaches a few concepts in class and gives questions as homework which are based on similar
concepts.

5. Describe various statistical learning approaches. Apr 2022 [8]


Ans:-
Stistical Learning:-
Statistics is a collection of tools that you can use to get answers to important questions
aboutdata.You can use descriptive statistical methods to transform raw observations into
information thatyou can understand and share.You can use inferential statistical methods to
reason from small samples of data to whole domains.Statistical learning theory is a
framework for machine learning that draws from statistics and functional analysis.It deals
with finding a predictive function based on the data presented.The main idea in statistical
learning theory is to build a model that can draw conclusions fromdata and make predictions.

Statistical Learning Approaches:-


1)Statistics in Data Preparation:-
Statistical methods are required in the preparation of train and test data for your machine
learning model.This includes techniques for:
1.Outlier detection.
2.Missing value imputation.
3.Data sampling.
4.Data scaling.
5.Variable encoding and much more.
A basic understanding of data distributions, descriptive statistics, and data. visualization is
required tohelp you identify the methods to choose when performing these tasks.

2) Statistics in Model Evaluation


Statistical methods are required when evaluating the skill of a machine learning model
on data not seen during training.
This includes techniques for:
1. Data sampling
2. Data Resampling
3. Experimental design
Re-sampling techniques such as k-fold cross-validation are often well understood by
machine learning practitioners, but the rationale for why this method is required is not.

3) Statistics in Model Selection:-


Statistical methods are required when selecting a final model or model configuration to
use for a predictive modeling problem.
These include techniques for:
1. Checking for a significant difference between results.
2. Quantifying the size of the difference between results.
3. This might include the use of statistical hypothesis tests.

Shree Ramchandra college of engineering Page 9


Sem:-VII Sub:- ML

4) Statistics in Model Presentation


Statistical methods are required when presenting the skill of a final model to
stakeholders. This includes techniques for:
1. Summarizing the expected skill of the model on average.
2. Quantifying the expected variability of the skill of the model in practice.
This might include estimation statistics such as confidence intervals.
5) Statistics in Prediction
Statistical methods are required when making a prediction with a finalized model on
new data.
This includes techniques for:
1. Quantifying the expected variability for the prediction.
2. This might include estimation statistics such as prediction intervals.

6) Problem Framing:
Requires the use of exploratory data analysis and data mining.

7)Data Cleaning.
Requires the use of outlier detection, imputation and more.

8)Data Selection:
Requires the use of data sampling and feature selection methods.

9)Model Configuration:
Requires the use of statistical hypothesis tests and estimation statistics
6. Compare Machine Learning with traditional programming. Discuss
types of Machine Learning with suitable examples. Apr 2023 [5]
Ans:-
Comparison of Machine Learning with Traditional Programming
For any solution, the first task is the creation of the most suitable algorithm and writing
the code.

Traditional programming is a manual process — meaning a person (programmer) creates the


program. But without anyone programming the logic, one has to manually formulate or code
rules. We have the input data, and someone (programmer) coded a program that uses that
data and runson a computer to produce the desired output. Machine Learning, on the other
hand, the input data and output are fed to an algorithm to create a program.
In Traditional programming one has to manually formulate/code rules while in Machine
Learning the algorithms automatically formulate the rules from the data, which is very
powerful. If the Traditional Programming is automation, Then machine learning is
automating the process
of automation.

Shree Ramchandra college of engineering Page 10


Sem:-VII Sub:- ML

Thereafter, it is mandatory to set the input parameters and, in fact, if an implemented


algorithm is ok it will produce the expected result.
However, when we need to predict something, we need to use an algorithm with a variety of
input parameters. To solve the same problem using ML-methods, data engineers use a totally
different procedure. Instead of developing an algorithm on its own, they need to collect an
array of historical data that will be used for semi-automatic model building.
Following managing a satisfactory set of data, the data engineer loads it into already tailored
ML-algorithms. The result is a model that can predict a new result, receiving new data as
input. A distinctive feature of ML is there is no need to build a model. This complicated yet
meaningful responsibility is executed by ML-algorithms.
Another significant difference between ML and Programming is determined by the number
of input parameters that the model is capable of processing. For an accurate prediction, you
have to add thousands of parameters and do it with high accuracy, as every bit will affect the
final result. A human being a priori cannot build an algorithm that will use all of those details
in a reasonable way.
Refer Question no. 4 for Discuss types of Machine Learning with suitable examples
7. What is Machine Learning? Explain applications of Machine Learning
in data science. Apr. 2023 [5]
Ans:-
Machine Learning:
Machine learning is a branch of artificial intelligence that enables algorithms to
uncover hidden patterns within datasets, allowing them to make predictions on new, similar
data without explicit programming for each task. Traditional machine learning combines
data with statistical tools to predict outputs, yielding actionable insights. This technology
finds applications in diverse fields such as image and speech recognition, natural language
processing, recommendation systems, fraud detection, portfolio optimization, and
automating tasks.
For instance, recommender systems use historical data to personalize suggestions. Netflix,
for example, employs collaborative and content-based filtering to recommend movies and
TV shows based on user viewing history, ratings, and genre preferences. Reinforcement
learning further enhances these systems by enabling agents to make decisions based on
environmental feedback, continually refining recommendations.

Shree Ramchandra college of engineering Page 11


Sem:-VII Sub:- ML

Machine learning’s impact extends to autonomous vehicles, drones, and robots, enhancing
their adaptability in dynamic environments. This approach marks a breakthrough where
machines learn from data
examples to generate accurate outcomes, closely intertwined with data mining and data
science.

Machine Learning Applications

With an understanding of the common machine learning uses, let’s explore some
examples of the popular applications in the market that rely heavily on machine learning.

1. Social Media (Facebook)


Automatic friend tagging suggestions on Facebook are one of the best machine-
learning applications. Facebook automatically locates a face that matches its database using
face detection and image recognition and then advises us to tag that individual using
DeepFace (a project of Facebook’s Deep Learning division).

2. Transportation (Uber)
Uber is a customized cab application that relies on machine learning to automatically
locate a rider, and offer options to travel home, to work, or to any other regular location
based on the rider’s history and patterns. Moreover, the app further uses ML algorithms to
make precision predictions around the Estimated Time of Arrival (ETA) to a particular
destination by analyzing traffic conditions.
3. Language Translation (Google Translate)
To break all language barriers and make traveling to foreign countries easy, Google
Translate employs Google Neural Machine Translation (GNMT) which relies on Natural
Language Processing(NLP) to translate words across thousands of languages and
dictionaries. It also makes use of POS Tagging, Named Entity Recognition (NER), and
Chunking to maintain the words’ tonality.

4.Image Recognition:
Image recognition is one of the most common applications of machine learning. It is
used to identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a
photo with our Facebook friends, then we automatically get a tagging suggestion with name,
and the technology behind this is machine learning's face detection and recognition
algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.

5.Speech Recognition:
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.

Shree Ramchandra college of engineering Page 12


Sem:-VII Sub:- ML

Speech recognition is a process of converting voice instructions into text, and it is also
known as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.

6.Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the
correct path with the shortest route and predicts the traffic conditions.It predicts the traffic
conditions such as whether traffic is cleared, slow-moving, or heavily congested with the
help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes
information from the user and sends back to its database to improve the performance.

7.Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies
such as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search
for some product on Amazon, then we started getting an advertisement for the same
product while internet surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.

8.Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.

9.Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important, normal, and
spam. We always receive an important mail in our inbox with the important symbol and
spam emails in our spam box, and the technology behind this is Machine learning. Below are
some spam filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters

Shree Ramchandra college of engineering Page 13


Sem:-VII Sub:- ML

o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree,


and Naïve Bayes classifier are used for email spam filtering and malware detection.

8. Explain Geometric Model and Probabilistic Model with suitable


examples. Apr. 2023 [5]
Ans:
1.Geometric Model:
In Geometric models, features could be described as points in two dimensions (x- and y-
axis) or a three-dimensional space (x, y, and z).
Even when features are not intrinsically geometric, they could be modeled in a geometric
manner (for example, temperature as a function of time can be modeled in two axes).
In geometric models, there are two ways we could impose similarity.
We could use geometric concepts like lines or planes to segment (classify) the instance
space. These are called Linear models. Alternatively, we can use the geometric notion of
distance to represent similarity. In this case, if two points are close together, they have
similar values for features and thus can be classed as similar. We call such models as
Distance-based models.
Linear Model
Linear models are relatively simple. In this case, the function is
represented as a linear combination of its inputs.
In the simplest case where f(x) represents a straight line, we have an equation of the form f
(x) = mx + c where c represents the intercept and m represents the slope.

Linear models are parametric, which means that they have a fixed form with a small number
of numeric
parameters that need to be learned from data.
For example, in f (x) = mx + c, m and c are the parameters that we are trying to learn from
the data.
This technique is different from tree or rule models, where the structure of the model (e.g.,
which features to use in the tree, and where) is not fixed in advance.
Linear models are stable, i.e., small variations in the training data have only a limited impact
on the learned model.
Shree Ramchandra college of engineering Page 14
Sem:-VII Sub:- ML

In contrast, tree models tend to vary more with the training data, as the choice of a different
split at the root of the tree typically means that the rest of the tree is different as well.
As a result of having relatively few parameters, Linear models have low variance and high
bias. This implies that Linear models are less likely to overfit the training data than some
other models. However, they are more likely to underfit.
For example, if we want to learn the boundaries between countries based on labeled data,
then linear models are not likely to give a good approximation.

Distance Model
Distance-based models are the second class of Geometric models.
Like Linear models, distance-based models are based on the geometry of data.
As the name implies, distance-based models work on the concept of distance.
In the context of Machine learning, the concept of distance is not based on merely the
physical distance between two points.
Instead, we could think of the distance between two points considering the mode of
transport between two points.
Travelling between two cities by plane covers less distance physically than by train because
as the plane is unrestricted.
Similarly, in chess, the concept of distance depends on the piece used – for example, a
Bishop can move diagonally.
Thus, depending on the entity and the mode of travel, the concept of distance can be
experienced differently.
The distance metrics commonly used are Euclidean, Minkowski, Manhattan, and
Mahalanobis. Distance is applied through the concept of neighbors and exemplars.
Neighbors are points in proximity with respect to the distance measure expressed through
exemplars.

Exemplars are either centroids that find a centre of mass according to a chosen distance
metric or medoids that find the most centrally located data point.
The most commonly used centroid is the arithmetic mean, which minimizes squared
Euclidean distance to all other points.
The algorithms under Geometric Model: KNN, Linear Regression, SVM, Logistic
Regression etc.

2. Probabilistic Models
The third family of machine learning algorithms is the probabilistic models.
The k-nearest neighbour algorithm uses the idea of distance (e.g., Euclidean distance) to
classify
entities, and logical models use a logical expression to partition the instance space.
Here the probabilistic models use the idea of probability to classify new entities.
Probabilistic models see features and target variables as random variables.

Shree Ramchandra college of engineering Page 15


Sem:-VII Sub:- ML

The process of modeling represents and manipulates the level of uncertainty with respect to
these variables.
There are two types of probabilistic models: Predictive and Generative.
Predictive probability models use the idea of a conditional probability distribution P (Y |X)
from which Y can be predicted from X.
Generative models estimate the joint distribution P (Y, X). Once we know the joint
distribution for the generative models, we can derive any conditional or marginal
distribution involving the same variables.
Thus, the generative model is capable of creating new data points and their labels, knowing
the joint probability distribution.
The joint distribution looks for a relationship between two variables.
Once this relationship is inferred, it is possible to infer new data points.
The algorithms under Probabilistic Models: Naïve Bayes , Gaussian Process Regression etc
Naïve Bayes is an example of a probabilistic classifier.
The goal of any probabilistic classifier is given a set of features (x_0 through x_n) and a set
of classes (c_0 through c_k), we aim to determine the probability of the features occurring in
each class, and to return the most likely class.
Therefore, for each class, we need to calculate P(c_i | x_0, …, x_n).
We can do this using the Bayes rule defined as The Naïve Bayes algorithm is based on the
idea of Conditional Probability.
Conditional probability is based on finding the probability that something will happen, given
that something else has already happened.
The task of the algorithm then is to look at the evidence and to determine the likelihood of a
specific class and assign a label accordingly to each entity.

9. How machine learning model works? Explain various steps involved.


Apr. 2023 [5]
Ans:-
machine learning model:-
Machine learning is a method of data analysis that automates analytical model
building. In simple terms, machine learning is “making a machine learn”. Machine learning
is a new field that combines many traditional disciplines.
The workflow by providing a systematic way on how to proceed with the machine learning
model.
The workflow automate the process of machine learning and following the pipeline makes
the process of making ML models systematic and easy.
The Machine Learning pipeline starts with data collection and integration. After data is
collected analysis and visualization of data is done. Further, the most crucial step feature
selection and engineering is performed then the model is trained. After that model,
evaluation is done and our model becomes ready for prediction!

Shree Ramchandra college of engineering Page 16


Sem:-VII Sub:- ML

various steps in machine learning model:-


Here is the diagrammatic view of the ML pipeline:

1.Data Collection and integration:


The first step of the ML pipeline involves the collection of data and integration of data.
Data collected acts as an input to the model (data preparation phase)
Inputs are called features.
Data collected in the case of our considered example involves a lot of data. The collected
data should answer the following questions- What is past customer history? What were the
past orders? Is the customer a prime member of our bookstore? Does the customer own a
kindle? Has the customer made any previous complaints? What was the most number of
complaints?
The more the data is, more the better our model becomes.
Once the data is collected we need to integrate and prepare the data.
Integration of data means placing all related data together.
Then data preparation phase starts in which we manually and critically explore the data.
The data preparation phase tells the developer that is the data matching the expectations. Is
there enough info to make an accurate prediction? Is the data consistent?

2. Exploratory Data Analysis and Visualisation:


Once the data is prepared developer needs to visualize the data to have a better
understanding of relationships within the dataset.
When we get to see data, we can notice the unseen patterns that we may not have noticed in
the first phase.
It helps developers easily identify missing data and outliers.
Data visualization can be done by plotting histograms, scatter plots, etc.
After visualization is done data is analyzed so that developer can decide what ML
technique he may use.
In the considered example case unsupervised learning may be used to analyze customer
purchasing habits.

3. Feature Selection and Engineering:


Feature selection means selecting what features the developer wants to use within the
model.

Shree Ramchandra college of engineering Page 17


Sem:-VII Sub:- ML

Features should be selected so that a minimum correlation exists between them and a
maximum correlation exists between the selected features and output.
Feature engineering is the process to manipulate the original data into new and potential
data that has a lot many features within it.
In simple words Feature engineering is converting raw data into useful data or getting the
maximum out of the original data.
Feature engineering is arguably the most crucial and time-consuming step of the ML
pipeline.
Feature selection and engineering answers questions – Are these features going to make
any sense in our prediction? It deals with the accuracy and precision of data.

4. Model Training:
After the first three steps are done completely we enter the model training phase.
It is the first step officially when the developer gets to train the model on basis of data.
To train the model, data is split into three parts- Training data, validation data, and test
data.
Around 70%-80% of data goes into the training data set which is used in training the
model.
Validation data is also known as development set or dev set and is used to avoid overfitting
or underfitting situations i.e. enabling hyperparameter tuning.
Hyperparameter tuning is a technique used to combat overfitting and underfitting.
Validation data is used during model evaluation.
Around 10%-15% of data is used as validation data.
Rest 10%-15% of data goes into the test data set. Test data set is used for testing after the
model preparation.
It is crucial to randomize data sets while splitting the data to get an accurate model.
Data can be randomized using Scikit learn in python.

5. Model Evaluation:
After the model training, validation, or development data is used to evaluate the model.
To get the most accurate predictions to test data may be used for further model evaluation.
A confusion matrix is created after model evaluation to calculate accuracy and precision
numerically. After model evaluation, our model enters the final stage that is prediction.

6. Prediction:
In the prediction phase developer deploys the model.
After model deployment, it becomes ready to make predictions.
Predictions are made on training data and test data to have a better understanding of the
build model.
The deployment of the model isn’t a one-time exercise. As more and more data gets
generated, the model is trained on new data, evaluated again, and deployed again. Model
training, model evaluation, and prediction phase circulate each other.

Shree Ramchandra college of engineering Page 18

You might also like