MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Course/Branch : B.Tech-CSE Semester :V
Subject : Machine Learning Techniques Subject Code : BCS055
Q1: Explain briefly “History of Machine Learning”.
Early history of machine learning:
In 1943, neurophysiologist Warren McCulloch and mathematician Walter Pitts wrote a paper about neurons, and
how they work. They created a model of neurons using an electrical circuit, and thus the neural network was
created.
In 1952, Arthur Samuel created the first computer program which could learn asit ran. Frank Rosenblatt
designed the first artificial neural network in 1958, called Perceptron. The main goal of this was pattern and
shape recognition.
In 1959, Bernard Widrow and Marcian Hoff created two models of neural network. The first was called
ADELINE, and it could detect binary patterns. For example, in a stream of bits, it could predict what the next
one would be. The second was called MADELINE, and it could eliminate echo on phone lines.
1980s and 1990s:
In 1982, John Hopfield suggested creating a network which had bidirectional lines, similar to how neurons
actually work. Use of back propagation in neural networks came in 1986, when researchers from the
Stanford psychology department decided to extend an algorithm created by Widrow and Hoffin 1962.
This allowed multiple layers to be used in a neural network, creating what are known as ‘slow learners',
which will learn over a long period of time.
In 1997, the IBM computer Deep Blue, which was a chess-playing computer, beat the world chess champion.
In 1998, research at AT&T Bell Laboratories on digit recognition resulted in good accuracy in detecting
handwritten postcodes from the US Postal Service.
21st Century:
Since the start of the 21st century, many businesses have realised that machine learning will increase
calculation potential. This is why they are researching more heavily in it, in order to stay ahead of the
competition.
Q2: Write down the differences between Machine Learning and Data Science.
Data science
1 Data science is a concept used to tackle big data and includes data cleansing, preparation, and analysis.
2. It includes various data operations.
3. Data science works by sourcing, cleaning, and processing data to extract meaning out of it for analytical
purposes.
4. SAS, Tableau, Apache, Spark, MATLAB are the tools used in data science..
5. Data science deals with structured and unstructured data.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
6. Fraud detection and healthcare analysis are examples of data science.
Machine learning
1. Machine learning is defined as the practice of using algorithms to use data, learn from it and then forecast
future trends for that topic.
2. It includes subset of Artificial Intelligence.
3. Machine learning uses efficient programs that can use data without being explicitly told to do so.
4. Amazon Lex, IBM Watson Studio, Microsoft Azure ML Studio are the tools used in ML.
5. Machine learning uses statistical models.
6. Recommendation systems such as Spotify and Facial Recognition are examples of machine learning.
Q3: Describe how to design a learning system with examples?
Steps used to design a learning system are:
Specify the learning task.
Choose a suitable set of training data to serve as the training experience.
Divide the training data into groups or classes and label accordingly.
Determine the type of knowledge representation to be learned from the training experience.
Choose a learner classifier that can generate general hypotheses from the training data.
Apply the learner classifier to test data.
Compare the performance of the system with that of an expert human.
Well defined learning problem:
A computer program is said to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks inT, as measured by P, improves with experience E.
Three features in learning problems:
1. The class of tasks (T)
2. The measure of performance to be improved (P)
3. The source of experience (E)
For example:
1`A checkers learning problem:
a. Task (T): Playing checkers.
b. Performance measure (P): Percent of games won against opponents.
c. Training experience (E): Playing practice games against itself.
2. A handwriting recognition learning problem:
a. Task (T): Recognizing and classifying handwritten words within images.
b. Performance measure (P): Percent of words correctly classified
c. Training experience (E): A database of handwritten words with given classifications.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
3. A robot driving learning problem:
a. Task (T): Driving on public four-lane highways using vision sensors.
b. Performance measure (P): Average distance travelled before an error (as judged by human overseer).
c. Training experience (E): A sequence of images and steering commands recorded while observing a
human driver.
Q4: Explain the concept of Machine Learning. Define the term learning. What are the types of Learning?
Concept of Machine Learning
Machine learning is an application of Artificial Intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed.
Machine learning focuses on the development of computer programs that can access data.
The primary aim is to allow the computers to learn automatically without human intervention or
assistance and adjust actions accordingly.
Machine learning enables analysis of massive quantities of data.
It generally delivers faster and more accurate results in order to identify profitable opportunities or
dangerous risks.
Combining machine learning with AI and cognitive technologies can make it even more effective in
processing large volumes of information.
1. Learning refers to the change in a subject’s behaviour to a given situation brought by repeated experiences in
that situation, provided that the behaviour changes cannot be explained on the basis of native response
tendencies, matriculation or temporary states of the subject.
2. Learning agent can be thought of as containing a performance element that decides what actions to take and a
learning element that modifies the performance element so that it makes better decisions.
3. The design of a learning element is affected by three major issues:
a. Components of the performance element.
b. Feedback of components.
c. Representation of the components.
The important components of learning are:
1. Acquisition of new knowledge:
a. One component of learning is the acquisition of new knowledge.
b. Simple data acquisition is easy for computers, even though it is difficult for people.
2. Problem solving:
The other component of learning is the problem solving that is required for both to integrate into the system, new
knowledge that is presented to it and to deduce new information when required facts are not been presented.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Types of Learning are:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Q5: Compare classification and clustering in machine learning along with suitable real life applications.
Clustering
1. Clustering analyses data objects without known class label.
2. There is no prior knowledge of the attributes of the data to form clusters.
3. It is done by grouping only the input data because output is not predefined.
4. The number of clusters is not known before clustering. These are identified after the completion of clustering.
5. It is considered as unsupervised learning because there is no prior knowledge of the class labels.
Classification
1. In classification, data are grouped by analysing the data objects whose class label is known.
2. There is some prior knowledge of the attributes of each classification.
3. It is done by classifying output based on the values of the input data
4 The number of classes is known before classification as there is predefined output based on input data.
5. It is considered as the supervised learning because class labels are known before.
Q6: What is a “Well-Posed Learning Problem”?
Well Posed Learning Problem – A computer program is said to learn from experience E in context to some task
T and some performance measure P, if its performance on T, as was measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits –
Task
Performance Measure
Experience
Certain examples that efficiently defines the well-posed learning problem are –
1. To better filter emails as spam or not
Task – Classifying emails as spam or not.
Performance Measure – The fraction of emails accurately classified as spam or not spam.
Experience – Observing you label emails as spam or not spam.
2. A checkers learning problem
Task – Playing checkers game.
Performance Measure – percent of games won against oppose.
Experience – playing implementation games against itself.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
3. Handwriting Recognition Problem
Task – Acknowledging handwritten words within portrayal.
Performance Measure – percent of words accurately classified.
Experience – a directory of handwritten words with given classifications.
4. A Robot Driving Problem
Task – driving on public four-lane highways using sight scanners.
Performance Measure – average distance progressed before a fallacy.
Experience – order of images and steering instructions noted down while observing a human driver.
5. Fruit Prediction Problem
Task – forecasting different fruits for recognition.
Performance Measure – able to predict maximum variety of fruits.
Experience – training machine with the largest datasets of fruits images.
6. Face Recognition Problem
Task – predicting different types of faces.
Performance Measure – able to predict maximum types of faces.
Experience – training machine with maximum amount of datasets of different face images.
7. Automatic Translation of documents
Task – translating one type of language used in a document to other language.
Performance Measure – able to convert one language to other efficiently.
Experience – training machine with a large dataset of different types of languages.
Q7: Explain reinforcement learning with a suitable example.
1. Reinforcement learning is the study of how artificial system can learn to optimize their behaviour in
the face of rewards and punishments.
2. Reinforcement learning algorithms have been developed that are closely related to methods of dynamic
programming which is a general approach to optimal control.
3. Reinforcement learning phenomena have been observed in psychological studies of animal behaviour,
and in neurobiological investigations of neuromodulation and addiction.
4. The task of reinforcement learning is to use observed rewards to learn an optimal policy for the
environment.
5. An optimal policy is a policy that maximizes the expected total reward.
6. Without some feedback about what is good and what is bad, the agent will have no grounds for deciding
which move to make.
7. The agents need to know that something good has happened when it wins and that something bad has
happened when it loses.
8. This kind of feedback is called a reward or reinforcement.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
9. Reinforcement learning is very valuable in the field of robotics, where the tasks to be performed are
frequently complex enough to defy encoding as programs and no training data is available.
10. The robot’s task consists of finding out, through trial and error (or success), which Actions are good in a
certain situation and which are not.
11. In many cases humans learn in a very similar way.
12. For example, when a child learns to walk, this usually happens without instruction, rather simply through
reinforcement.
13. Successful attempts at working are rewarded by forward progress, and unsuccessful attempts are
penalized by often painful falls.
14. Positive and negative reinforcement are also important factors in successful learning in school and in
many sports.
15. In many complex domains, reinforcement learning is the only feasible way to train a program to perform
at high levels.
Q8: Differentiate data science and machine learning.
Data science
1 Data science is a concept used to tackle big data and includes data cleansing, preparation, and analysis.
2. It includes various data operations.
3. Data science works by sourcing, cleaning, and processing data to extract meaning out of it for analytical
purposes.
4. SAS, Tableau, Apache, Spark, MATLAB are the tools used in data science.
5. Data science deals with structured and unstructured | data.
6. Fraud detection and healthcare analysis are examples of data science.
Machine Learning
1. Machine learning is defined as the practice of using algorithms to use data, learn from it and then
forecast future trends for that topic.
2. It includes subset of Artificial Intelligence.
3. Machine learning uses efficient programs that can use data without being explicitly told to do so.
4. Amazon Lex, IBM Watson Studio, Microsoft Azure ML Studio are the tools used in ML.
5 Machine learning uses statistical models.
6. Recommendation systems such as Spotify and Facial Recognition are examples of machine learning.
Q9: Explain the issues related with machine learning.
Issues related with machine learning are:
1. Data quality:
a. It is essential to have good quality data to produce quality ML, algorithms and models.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
b. To get high-quality data, we must implement data evaluation, integration, exploration, and
governance techniques prior to developing ML models.
c. Accuracy of ML is driven by the quality of the data.
2. Transparency:
a. It is difficult to make definitive statements on how well a model is going to generalize in new
environments.
3. Manpower:
a. Manpower means having data and being able to use it. This does not introduce bias into the model.
b. There should be enough skill sets in the organization for software development and data collection.
4 Other:
a. The most common issue with ML is people using it where it does not belong.
b. Every time there is some new innovation in ML, we see overzealous engineers trying to use it where it's
not really necessary.
c. This used to happen a lot with deep learning and neural networks.
d. Traceability and reproduction of results are two main issues.
Q10: Discuss Supervised and Unsupervised Learning.
Supervised Learning:
1. Supervised learning is also known as associative learning, in which the network is trained by providing it
with input and matching output patterns.
2. Supervised training requires the pairing of each input vector with a target vector representing the desired
output.
3. The input vector together with the corresponding target vector is called training pair.
4. During the training session an input vector is applied to the network, and it results in an output vector.
5. This response is compared with the target response.
6. If the actual response differs from the target response, the network will generate an error signal.
7. This error signal is then used to calculate the adjustment that should be made in the synaptic weights
so that the actual output matches the target output.
8. The error minimization in this kind of training requires supervisor or teacher.
9. These input-output pairs can be provided by an external teacher, or by the system which contains the
neural network (self-supervised).
10. Supervised training methods are used to perform non-linear mapping in pattern classification networks,
pattern association networks and multilayer neural networks.
11. Supervised learning generates a global model that maps input objects to desired outputs.
12. In some cases, the map is implemented as a set of local models such as in case-based reasoning or the
nearest neighbour algorithm.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
13. In order to solve problem of supervised learning following steps are considered:
i. Determine the type of training examples.
ii. Gathering a training set.
ii. Determine the input feature representation of the learned function.
iv. Determine the structure of the learned function and corresponding learning algorithm.
v. Complete the design.
Unsupervised Learning:
1 It is a learning in which an output unit is trained to respond to clusters of pattern within the input.
2 Unsupervised training is employed in self-organizing neural networks.
3. This training does not require a teacher.
4 In this method of training, the input vectors of similar types are grouped without the use of training data to
specify how a typical member of each group looks or to which group a member belongs.
5. During training the neural network receives input patterns and organizes these patterns into categories.
6. When new input pattern is applied, the neural network provides an output response indicating the class to
which the input pattern belongs.
7. If a class cannot be found for the input pattern, a new class is generated.
8. Though unsupervised learning does not require a teacher, it requires certain guidelines to form groups.
9. Grouping can be done based on colour, shape and any other property of the object.
10. It is a method of machine learning where a model is fit to observations.
11. It is distinguished from supervised learning by the fact that there is no priori output.
12. In this, a data set of input objects is gathered.
13. It treats input objects as a set of random variables. It can be used in conjunction with Bayesian inference
to produce conditional probabilities.
Q11: Write short note on “Well defined Learning System” with examples.
Designing a learning system in machine learning requires careful consideration of several key factors,
including the type of data being used, the desired outcome, and the available resources. In this article, we
will explore the key steps involved in designing a learning system in machine learning and discuss some best
practices to keep in mind.
o The first step in designing a learning system in machine learning is to identify the type of data that
will be used. This can include structured data, such as numerical and categorical data, as well as
unstructured data, such as text and images. The type of data will determine the type of machine
learning algorithms that can be used and the preprocessing steps required.
o Once the data has been identified, the next step is to determine the desired outcome of the learning
system. This can include classifying data, making predictions, or identifying patterns in the data. The
desired outcome will determine the type of machine learning algorithm that should be used, as well
as the evaluation metrics that will be used to measure the performance of the learning system.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
o Next, the resources available for the learning system must be considered. This includes the amount
of data available, the computational power available, and the amount of time available to train the
model. These resources will determine the complexity of the machine learning algorithm that can be
used and the amount of data that can be used for training.
o Once the data, desired outcome, and resources have been identified, it is time to select a machine-
learning algorithm and begin the training process. Decision trees, SVMs, and neural networks are
examples of common algorithms. It is crucial to assess the effectiveness of the learning system using
the right assessment measures, such as recall, accuracy, and precision.
o After the learning system is trained, it is important to fine-tune the model by adjusting the
parameters and hyper parameters. This can be done using techniques such as cross-validation and
grid search. The final model should be tested on a hold-out test set to evaluate its performance on
unseen data.
When constructing a machine learning system, there are some other recommended practices to bear in mind
in addition to these essential processes. A crucial factor to take into account is making sure that the training
data are indicative of the data that will be encountered in the actual world. To do this, the data may be
divided into training, validation, and test sets.
Another best practice is to use appropriate regularization techniques to prevent overfitting. This can include
techniques such as L1 and L2 regularization and dropout. It is also important to use feature scaling and
normalization to ensure that the data is in a format that is suitable for the machine learning algorithm being
used.
Following are the qualities that you need to keep in mind while designing a learning system:
Reliability
The system must be capable of carrying out the proper task at the appropriate degree of performance in a
given setting. Testing the dependability of ML systems that learn from data is challenging because a system's
failure need not result in an error; instead, it could simply produce garbage results, meaning that some results
were produced even though the system had not been trained with the corresponding ground truth.
When a typical system fails, you receive an error message, such as The crew is addressing a technical issue
and will return soon.
When a machine learning (ML) system fails, it usually does so without being seen. For instance, when
translating from English to Hindi or vice versa, even if the model may not have seen all of the words, it may
nevertheless provide a translation that is illogical.
Scalability
There should be practical methods for coping with the system's expansion as it changes (in terms of data
amount, traffic volume, or complexity). Because certain essential applications might lose millions of dollars
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
or their credibility with just one hour of outage or failure, there should be an automated provision to grow
computing and storage capacity.
For instance, if a feature on an e-commerce website fails to function as planned on a busy day, it might result
in a loss of millions of dollars in sales.
Maintainability
The performance of the model may fluctuate as a result of changes in data distribution over time. In the ML
system, there should be a provision to first determine whether there is any model drift or data drift, and once
the major drift is noticed, how to re-train/re-fresh and enable new ML models without interfering with the
ML system's present functioning.
Adaptability
The availability of fresh data with increased features or changes in business objectives, such as conversion
rate vs. customer engagement time for e-commerce, are the other changes that occur most frequently in
machine learning (ML) systems. As a result, the system has to be adaptable to fast upgrades without causing
any service disruptions.
Data
1. For example, human age and height have expected value ranges, but they can't be too huge, like age
value 150+, height - 10 feet, etc. Feature expectations are recorded in a schema - ranges of the
feature values carefully captured to avoid any unanticipated value, which can result in a trash
answer.
2. All features are advantageous; features introduced to the system should be valuable in some way,
such as being a predictor or an identifier, as each feature has a handling cost.
3. No feature should cost more than it is worth; each new feature should be evaluated in terms of cost
vs. benefits in order to eliminate those that would be difficult to implement or manage.
4. The data pipeline has the necessary privacy protections in place; for instance, personally identifiable
information (PII) should be managed carefully because any leaking of sensitive information may
have legal repercussions.
5. If any new external component has an influence on the system, it will be easier to introduce new
features to boost system performance.
6. All input feature code, including one-hot encoding/binning features and the handling of unseen
levels in one-hot encoded features, must be checked in order to avoid any intermediate values from
departing from the desired range.
Model
1. Model specifications are evaluated and submitted; for quicker re-training, correct versioning of the
model learning code is required.
2. Correlation between offline and online metrics: Model metrics (log loss, mape, mse) should be
strongly associated with the application's goal, such as revenue/cost/time.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
3. Hyperparameters like learning rates, the number of layers, the size of the layers, the maximum depth,
and regularisation coefficients must be modified for the use case because the selection of
hyperparameter values can significantly affect the accuracy of predictions.
4. To support the most recent model in production, it is important to comprehend how frequently to
retrain models depending on changes in data distribution. Model staleness has an influence that is
known.
5. Simple linear models with high-level characteristics are a good starting point for functional testing
and doing cost-benefit analyses when compared to more complex models. However, a simpler model
is not always better.
6. Model performance must be assessed using adequately representative data to ensure that model
quality is satisfactory on significant data slices.
7. The model is put to the test for inclusion-model characteristics, which should be thoroughly
examined against predicting importance since, in some applications, specific features may slant
outcomes in favor of particular categories, usually for reasons of fairness.
Infrastructure
1. The results of repeated training on the same set of data should be similar models. Generally
speaking, depending on how precise the system or infrastructure is, there could be some differences.
However, there shouldn't be a significant difference.
2. The accuracy of model algorithms and model API services must be unit-tested using random inputs
to identify any errors in the code or response.
3. The whole ML pipeline must be tested for proper operation, including the assembly of training data,
feature creation, model training, model verification, and deployment to a serving system.
4. Prior to actually fulfilling real requests, a model must first be trained, at which point an
offline/online system must evaluate it to ensure that it is of appropriate quality.
5. Considering the behaviour of ML systems, whose performance strongly depends on non-stationary
quality/distribution of input data, there should be a well-designed fallback mechanism if something
goes wrong with the ML answer. Models of serving can be reversed.
Q12: Describe well defined Learning Problem role’s in Machine Learning.
Well defined learning problems role’s in machine learning:
1. Learning to recognize spoken words:
a) Successful speech recognition systems employ machine learning in some form.
b) For example, the SPHINX system learns speaker-specific strategies for recognizing the primitive sounds
(phonemes) and words from the observed speech signal.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
c) Neural network learning methods and methods for learning hidden Markov models are effective for
automatically customizing to individual speakers, vocabularies, microphone characteristics, background
noise, etc.
2. Learning to drive an autonomous vehicle:
a) Machine learning methods have been used to train computer controlled vehicles to steer correctly when
driving on a variety of road types.
b) For example, the ALYINN system has used its learned strategies to drive unassisted at 70 miles per hour
for 90 miles on public highways among other cars.
3. Learning to classify new astronomical structures:
a) Machine learning methods have been applied to a variety of large databases to learn general regularities
implicit in the data.
b) For example, decision tree learning algorithms have been used by NASA to learn how to classify
celestial objects from the second Palomar Observatory Sky Survey.
c) This system is used to automatically classify all objects in the Sky Survey, which consists of three
terabytes of image data.
4. Learning to play world class backgammon:
a) The most successful computer programs for playing games such as backgammon are based on machine
learning algorithms.
b) For example, the world's top computer program for backgammon, TD-GAMMON learned its strategy by
playing over one million practice games against itself.
Q13: What are the Advantages, Disadvantages and Applications of Machine Learning?
Advantages of machine learning are:
1. Easily identifies trends and patterns:
a. Machine learning can review large volumes of data and discover specific trends and patterns that would
not be apparent to humans.
b. For an e-commerce website like Flipkart, it serves to understand the browsing behaviours and purchase
histories of its users to help cater to the right products, deals, and reminders relevant to them.
c. It uses the results to reveal relevant advertisements to them.
2. No human intervention needed (automation):
Machine learning does not require physical force i.e., no human intervention is needed.
3. Continuous improvement:
a. ML algorithms gain experience; they keep improving in accuracy and efficiency.
b. As the amount of data keeps growing, algorithms learn to make accurate predictions faster.
4. Handling multi-dimensional and multi-variety data:
a. Machine learning algorithms are good at handling data that are multi-dimensional and multi-variety, and
they can do this in dynamic or uncertain environments.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Disadvantages of machine learning are:
1. Data acquisition:
a. Machine learning requires massive data sets to train on, and these should be inclusive/unbiased, and of
good quality
2. Time and resources:
a. ML needs enough time to let the algorithms learn and develop enough to fulfill their purpose with a
considerable amount of accuracy and relevancy.
b. It also needs massive resources to function.
3. Interpretation of results:
a. To accurately interpret results generated by the algorithms. We must carefully choose the algorithms for
our purpose.
4. High error-susceptibility:
a. Machine learning is autonomous but highly susceptible to errors.
b. It takes time to recognize the source of the issue, and even longer to correct it.
Following are the applications of machine learning:
1. Image recognition:
a. Image recognition is the process of identifying and detecting an object or a feature in a digital image or
video.
b. This is used in many applications like systems for factory automation, toll booth monitoring, and
security surveillance.
2. Speech recognition:
a. Speech Recognition (SR) is the translation of spoken words into text.
b. It is also known as Automatic Speech Recognition (ASR), computer speech recognition, or Speech To
Text (STT).
c. In speech recognition, a software application recognizes spoken words.
3. Medical diagnosis:
a. ML provides methods, techniques, and tools that can help in solving diagnostic and prognostic
problems in a variety of medical domains.
b. It is being used for the analysis of the importance of clinical parameters and their combinations for
prognosis.
4. Statistical arbitrage:
a. In finance, statistical arbitrage refers to automated trading strategies that are typical of a short-term and
involve a large number of securities.
b. In such strategies, the user tries to implement a trading algorithm for a set of securities on the basis of
quantities such as historical correlations and general economic variables.
5. Learning associations:
Learning association is the process for discovering relations between variables in large data base.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
6. Extraction:
a. Information Extraction (IE) is another application of machine learning.
b. It is the process of extracting structured information from unstructured data.
Q14: Write short note on “Artificial Neural Networks”.
1. Artificial Neural Networks (ANN) or neural networks are computational algorithms that intended to simulate
the behaviour of biological systems composed of neurons.
2. ANNs are computational models inspired by an animal's central nervous systems.
3. It is capable of machine learning as well as pattern recognition.
4. A neural network is an oriented graph. It consists of nodes which in the biological analogy represent neurons,
connected by arcs.
5. It corresponds to dendrites and synapses. Each arc associated with a weight at each node.
6. A neural network is a machine learning algorithm based on the model of a human neuron. The human brain
consists of millions of neurons.
7. It sends and process signals in the form of electrical and chemical signals.
8. These neurons are connected with a special structure known as synapses. Synapses allow neurons to pass
signals,
9. An Artificial Neural Network is an information processing technique. It works like the way human brain
processes information.
10. ANN includes a large number of connected processing units that work together to process information. They
also generate meaningful results from it.
Q15: Write short note on “Clustering” with its applications.
1. Clustering s a division of data into groups of similar objects.
2. Each group or cluster consists of objects that are similar among themselves and dissimilar to objects of other
groups.
3. A cluster is a collection of data objects that are similar to one another within the same cluster and are
dissimilar to the object in the other cluster.
4. Clusters may be described as connected regions of a multidimensional space containing relatively high density
points, separated from each other by a region containing a relatively low density points,
5. From the machine learning perspective, clustering can be viewed as unsupervised learning of concepts.
6. Clustering analyses data objects without help of known class label.
7. In clustering, the class labels are not present in training data simply because they are not known to cluster the
data objects.
8. Hence, it is the type of unsupervised learning.
9. For this reason, clustering is a form of learning by observation rather than learning by examples.
10. There are certain situations where clustering is useful. These include:
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
a. The collection and classification of training data can be costly and time consuming. Therefore it is
difficult to collect a training data set. A large number of training samples are not all labelled. Then it is
useful to train a supervised classifier with a small portion of training data and then use clustering
procedures to tune the classifier based on the large, unclassified dataset.
b. For data mining, it can be useful to search for grouping among the data and then recognize the cluster.
c. The properties of feature vectors can change over time. Then, supervised classification is not reasonable.
Because the test features vectors may have completely different properties.
d. The clustering can be useful when it is required to search for good parametric families for the class
conditional densities, in case of supervised classification.
Following are the applications of clustering:
1. Data reduction:
a. In many cases, the amount of available data is very large and its processing becomes complicated.
b. Cluster analysis can be used to group the data into a number of clusters and then process each cluster as a
single entity.
c. In this way, data compression is achieved.
2. Hypothesis generation:
d. In this case, cluster analysis is applied to a data set to infer hypothesis that concerns about the nature of
the data.
e. Clustering is used here to suggest hypothesis that must be verified using other data sets.
3. Hypothesis testing: In this context, cluster analysis is used for the verification of the validity of a specific
hypothesis.
4. Prediction based on groups:
a. In this case, cluster analysis is applied to the available data set and then the resulting clusters are
characterized based on the characteristics of the patterns by which they are formed.
b. In this sequence, if an unknown pattern is given, we can determine the cluster to which it is more likely
to belong and characterize it based on the characterization of the respective cluster.
Q16: Differentiate between Clustering and Classification.
Clustering
1. Clustering analyses data objects without known class label.
2. There is no prior knowledge of the attributes of the data to form clusters.
3. It is done by grouping only the input data because output is not predefined.
4. The number of clusters is not known before clustering. These are identified after the completion of clustering.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
5. It is considered as unsupervised learning because there is no prior knowledge of the class labels.
Classification
1. In classification, data are grouped by analysing the data objects whose class label is known.
2. There is some prior knowledge of the attributes of each classification.
3. It is done by classifying output based on the values of the input data
4 The number of classes is known before classification as there is predefined output based on input data.
5. It is considered as the supervised learning because class labels are known before.
Q17: What are the various Clustering Techniques?
1. Clustering techniques are used for combining observed examples into clusters or groups which satisfy
two following main criteria:
a. Each group or cluster is homogeneous i.e., examples belong to the same group are similar to each other.
b. Each group or cluster should be different from other clusters i.e., examples that belong to one cluster
should be different from the examples of the other clusters.
2. Depending on the clustering techniques, clusters can be expressed in different ways :
a. Identified clusters may be exclusive, so that any example belongs to only one cluster.
b. They may be overlapping i.e., an example may belong to several clusters.
c. They may be probabilistic i.e., an example belongs to each cluster with a certain probability.
d. Clusters might have hierarchical structure.
Major classifications of clustering techniques are:
Clustering
1. Hierarchical
a. Divisive
b. Agglomerative
2. Partitional
a. Centroid
b. Model Based
c. Graphic Theoretic
d. Spectral
Once a criterion function has been selected, clustering becomes a well-defined problem in discrete
optimization. We find those partitions of the set of samples that extremize the criterion function.
The sample set is finite; there are only a finite number of possible
The clustering problem can always be solved by exhaustive enumeration.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
1. Hierarchical clustering:
a. This method works by grouping data object into a tree of clusters.
b. This method can be further classified depending on whether the hierarchical decomposition is formed in
bottom up (merging) or top down (splitting) fashion.
Following are the two types of hierarchical clustering:
a. Agglomerative hierarchical clustering: This bottom up strategy starts by placing each object in its own
cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single
cluster.
1. Divisive hierarchical clustering:
a. This top down strategy does the reverse of agglomerative strategy by starting with all objects in one
cluster.
b. It subdivides the cluster into smaller and smaller pieces until each object forms a cluster on its own.
2. Partitional clustering:
a. This method first creates an initial set of number of partitions where each partition represents a cluster.
b. The clusters are formed to optimize an objective partition criterion such as a dissimilarity function based
on distance so that the objects within a cluster are similar whereas the objects of different clusters are
dissimilar.
Following are the types of partitioning methods:
a. Centroid based clustering:
1. In this, it takes the input parameter and partitions a set of object into a number of clusters so that
resulting intracluster similarity is high but the intercluster similarity is low.
2. Cluster similarity is measured in terms of the mean value of the objects in the cluster, which can be
viewed as the cluster’s centroid or center of gravity.
b. Model-based clustering:
1. This method hypothesizes a model for each of the cluster and finds the best fit of the data to that model.
Q18: Explain Decision Trees with advantages and Disadvantages.
Advantages of decision tree method are:
1. Decision trees are able to generate understandable rules.
2. Decision trees perform classification without requiring computation.
3. Decision trees are able to handle both continuous and categorical variables.
4. Decision trees provide a clear indication for the fields that are important for prediction or classification.
Disadvantages of decision tree method are:
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
1. Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a
continuous attribute.
2. Decision trees are prone to errors in classification problems with many class and relatively small number
of training examples.
3. Decision tree are computationally expensive to train. At each node, each candidate splitting field must be
sorted before its best split can be found.
4. In decision tree algorithms, combinations of fields are used and a search must be made for optimal
combining weights. Pruning algorithms can also be expensive since many candidate sub-trees must be
formed and compared.
Q19: Write short note on “Support Vector Machine”.
1. Support Vector Machine (SVM) is machine learning algorithm that analyses data for classification and
regression analysis.
2. SVM is a supervised learning method that looks at data and sorts it into one of two categories.
3. An SVM outputs a map of the sorted data with the margins between the two as far apart as possible.
4. Applications of SVM:
a. Text and hypertext classification
b. Image classification
c. Recognizing handwritten characters
d. Biological sciences, including protein classification
Following are the types of support vector machine:
1. Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into
two classes by using a single straight line, then such data is termed as linearly separable data, and
Classifier is used called as Linear SVM classifier.
2. Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot
be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.
Q20: What are the classes of problems in Machine Learning?
Common classes of problem in machine learning:
1. Classification:
a. In classification data is labelled i.e., it is assigned a class, for example, Spam/non s spam or fraud/non-
fraud.
b. The decision being modelled is to assign labels to new unlabelled pieces of data.
c. This can be thought of as a discrimination problem, modelling the differences or similarities between
groups.
2. Regression:
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
a. Regression data is labelled with a real value rather than a label.
b. The decision being modelled is what value to predict for new unpredicted data.
3. Clustering:
a. In clustering data is not labelled, but can be divided into groups based on similarity and other measures
of natural structure in the data.
b. For example, organising pictures by faces without names, where the human user has to assign names to
groups, like iPhoto on the Mac.
4. Rule extraction:
a. In rule extraction, data is used as the basis for the extraction of propositional rules.
b. These rules discover statistically supportable relationships between attributes in the data.
=====================
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Course/Branch : B. Tech-CSE Semester :V
Subject : Machine Learning Techniques Subject Code : BCS055
Q1: Explain Linear, Polynomial and Gaussian Kernel (Radial Basis Function) in detail.
Polynomial Kernel
1. The polynomial kernel is a kernel function used with Support Vector Machines (SVMs) and other kernelized
models that represent the similarity of vectors (training samples) in a feature space over polynomials of the
original variables, allowing learning of non-linear models.
2. Polynomial kernel function is given by the equation:
(a * b + r )d
Where, a and b are two different data points that we need to classify.
r determines the coefficients of the polynomial.
d determines the degree of the polynomial.
We perform the dot products of the data points, which gives us the high dimensional coordinates for the data.
When d = 1, the polynomial kernel computes the relationship between each pair of observations in 1-
Dimension and these relationships help to find the support vector classifier.
When d = 2, the polynomial kernel computes the 2-Dimensional relationship between each pair of observations
which help to find the support vector classifier.
Gaussian Kernel (Radial Basis Function).
1. RBF kernel is a function whose value depends on the distance from the origin or from some point.
2. Gaussian Kernel is of the following format:
K(X1, X2) = exponent (- γ|| X1 – X2 ||2)
|| X1 – X2 || = Euclidean distance between X 1, and X2,
Using the distance in the original space we calculate the dot product (Similarity) of X1 and X2.
3. Following are the parameters used in Gaussian Kernel:
a. C: Inverse of the strength of regularization.
Behavior: As the value of ‘¢’ increases the model gets overfits.
As the value of ‘c’ decreases the model underfits.
b. γ : Gamma (used only for RBF kernel)
Behavior: As the value of 'y increases the model gets overfits.
As the value of ‘Y decreases the model underfits.
Q2: Differentiate between Linear Regression and Logistic Regression.
Linear regression
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
1. Linear regression is a supervised regression model.
2. In Linear regression, we predict the value by an integer number.
3. No activation function is used.
4. A threshold value is added.
5. It is based on the least square estimation.
6. Linear regression is used to estimate the dependent variable in case of a change in independent
variables.
7. Linear regression assumes the normal or Gaussian distribution of the dependent variable.
Logistics regression
1. Logistic regression is a supervised classification model.
2. In Logistic regression, we predict the value by 1 or 0.
3. Activation function is used to convert a linear regression equation to the logistic regression equation.
4. No threshold value is needed.
5. The dependent variable consists of only two categories.
6. Logistic regression is used to calculate the probability of an event.
7. Logistic regression assumes the binomial distribution of the dependent variable.
Q3: What are the types of Logistics Regression? Explain?
Logistics regression can be divided into following types:
1. Binary (Binomial) Regression:
a. In this classification, a dependent variable will have only two possible types either 1 or 0.
b. For example, these variables may represent success or failure, yes or no, win or loss etc.
2. Multinomial Regression:
a. In this classification, dependent variable can have three or more possible unordered types or the types
having no quantitative significance.
b. For example, these variables may represent “Type A” or “Type B” or “Type C”
3. Ordinal Regression:
a. In this classification, dependent variable can have three or more possible ordered types or the types
having a quantitative significance.
b. For example, these variables may represent “poor” or “good”, “very good”, “Excellent” and each
category can have the scores like 0, 1, 2, and 3.
Q4: Describe briefly Linear Regression and Logistic Regression.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Linear Regression
Linear regression is a supervised machine learning algorithm where the predicted output is continuous and
has a constant slope.
1. It is used to predict values within a continuous range, (for example: sales, price) rather than trying to
classify into categories (for example: cat, dog).
3. Following are the types of linear regression:
Simple regression:
Simple linear regression uses traditional slope-intercept form to produce accurate prediction,
Y=mx+b
Where, m and b are the variables,
x represents our input data and y represents our prediction.
Multivariable Regression:
1. A multi-variable linear equation is given below, where w represents the coefficients, or weights:
F(x ,y ,z) = (w1 x + w2 y + w3 z )
2. The variables x, y, z represent the attributes, or distinct pieces of information that, we have about
each observation.
3. For sales predictions, these attributes might include a company’s advertising spend on radio, TV,
and newspapers.
Sales = w1 Radio + w2 TV + w3 Newspapers
Logistic Regression
1. Logistic regression is a supervised learning classification algorithm used to predict the probability of a target
variable.
2. The nature of target or dependent variable is dichotomous, which means there would be only two possible
classes.
3. The dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands
for failure/no).
4. A logistic regression model predicts P(Y = 1) as a function of X. It is one of the simplest ML algorithms that
can be used for various classification problems such as spam detection, diabetes prediction, cancer detection
etc.
Q5: What is the assumption in Naïve Bayesian Algorithm that makes it different from Bayesian
Theorem?
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Naive Bayes classifier is a type of probabilistic classifier that makes predictions based on Bayes’ theorem with
the “naive” assumption of feature independence. It’s called “naive” because it assumes that the presence of a
particular feature in a class is independent of the presence of other features, given the class label. Despite this
simplifying assumption, Naive Bayes classifiers are powerful and widely used in various machine learning
tasks, particularly in text classification and sentiment analysis.
The Naive Bayes classifier calculates the probability of a data point belonging to each class based on the
observed features. It then predicts the class with the highest probability as the final prediction.
Bayes: This refers to Bayes’ theorem, a theorem in probability theory that helps to calculate the conditional
probability of an event (the class label) happening, given another event (the features) has already occurred.
Why Naive? This assumption of independence is often unrealistic in real-world data. For instance, in spam
filtering, the presence of the word “free” might make the presence of “discount” more likely. However, despite
this simplification, Naive Bayes often performs well — that’s why it’s considered “naive” because it makes an
assumption that might not always hold true.
“Naive Bayes” essentially means a classification algorithm that uses Bayes’ theorem but makes a simplifying
assumption about the independence of features.
Bayes’ Theorem
Bayes’ Theorem is a fundamental principle in probability theory that provides a way to update our beliefs or
the probability of an event based on new evidence or information.
Bayes’ Theorem helps us calculate conditional probabilities. Here, Conditional probability is the probability of
an event occurring given that another event has already occurred. We write Bayes’ Theorem in mathematics
like this:
Where,
P (A|B): conditional probability of event A occurring given that event B has occurred.
P (B|A): the probability of event B occurring given that event A has already occurred.
P (A): prior probability of event A, before considering any new evidence.
P (B): probability of event B.
Given a features vector X=(x1,x2,…,xn) and a class variable y, Bayes Theorem states that:
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
We’re interested in calculating the posterior probability P(y | X) from the likelihood P(X | y) and prior
probabilities P(y),P(X).
Using the chain rule, the likelihood P(X | y) can be decomposed as:
but because of the Naive’s conditional independence assumption, the conditional probabilities are independent
of each other.
Thus, by conditional independence, we have:
And as denominator remains constant for all values, the posterior probability can then be:
The Naive Bayes classifier combines this model with a decision rule. One common rule is to pick the
hypothesis that’s most probable; this is known as the maximum a posteriori or MAP decision rule.
Assumption of Naive Bayes
[[
The fundamental Naive Bayes assumption is that each feature makes an:
1. Feature independence: The features of the data are conditionally independent of each other, given the
class label.
2. Continuous features are normally distributed: If a feature is continuous, then it is assumed to be
normally distributed within each class.
3. Discrete features have multinomial distributions: If a feature is discrete, then it is assumed to have a
multinomial distribution within each class.
4. Features are equally important: All features are assumed to contribute equally to the prediction of the
class label.
5. No missing data: The data should not contain any missing values.
Q6: Discuss the various properties and issues of SVM.
Following are the properties of SVM:
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
1. Flexibility in choosing a similarity function: Sparseness of solution when dealing with large data sets only
support vectors are used to specify the separating hyperplane.
2. Ability to handle large feature spaces: complexity does not depend on the dimensionality of the feature
space.
3. Overfitting can be controlled by soft margin approach: A simple convex optimization problem which is
guaranteed to converge to a single global solution.
Issues of SVM
1. SVM does not give the best performance for handling text structures as compared to other algorithms that are
used in handling text data. This leads to loss of sequential information and thereby, leading to worse
performance.
2. SVM cannot return the probabilistic confidence value that is similar to logistic regression. This does not
provide much explanation as the confidence of prediction is important in several applications.
3. The choice of the kernel is perhaps the biggest limitation of the support vector machine. Considering so
many kernels present, it becomes difficult to choose the right one for the data.
Q7: Why SVM is an example of a large margin classifier? Discuss the different kernel functions used in
SVM.
The SVM tries to separate the data with the largest margin possible, for this reason the SVM is sometimes
called large margin classifier. Large margin classifiers are not very robust to outliers and to be fair, SVMs are
a bit more sophisticated and robust than the simple concept of large margin classifier explained above.
SVM algorithm is used for solving classification problems in machine learning.
One reasonable choice for the best hyperplane in a Support Vector Machine (SVM) is the one that maximizes
the separation margin between the two classes. The maximum-margin hyperplane, also referred to as the hard
margin, is selected based on maximizing the distance between the hyperplane and the nearest data point on
each side.
So we choose the hyperplane whose distance from it to the nearest data point on each side is maximized. If
such a hyperplane exists it is known as the maximum-margin hyperplane/hard margin. So from the above
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
figure, we choose L2. Let’s consider a scenario like shown below
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data? It’s simple!
The blue ball in the boundary of red ones is an outlier of blue balls. The SVM algorithm has the characteristics
to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM is robust to outliers.
So in this type of data point what SVM does is, finds the maximum margin as done with previous data sets
along with that it adds a penalty each time a point crosses the margin. So the margins in these types of cases
are called soft margins. When there is a soft margin to the data set, the SVM tries to minimize
(1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations no hinge loss. If violations
hinge loss proportional to the distance of violation.
Say, our data is shown in the figure above. SVM solves this by creating a new variable using a kernel. We call
a point xi on the line and we create a new variable yi as a function of distance from origin o.so if we plot this
we get something like as shown below
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Types of Kernel in SVM
Here are some common types of kernels in support vector machine algorithms:
1. Linear Kernel
The linear kernel is the simplest and is used when the data is linearly separable.
It calculates the dot product between the feature vectors.
2. Polynomial Kernel
The polynomial kernel is effective for non-linear data.
It computes the similarity between two vectors in terms of the polynomial of the original variables.
3. Radial Basis Function (RBF) Kernel
The RBF kernel is a common type of Kernel in SVM for handling non-linear decision boundaries.
It maps the data into an infinite-dimensional space.
4. Sigmoid Kernel
The sigmoid SVM kernel types can be used as an alternative to the RBF kernel.
It is based on the hyperbolic tangent function and is suitable for neural networks and other non-linear
classifiers.
5. Custom Kernels
In addition to the standard kernels mentioned above, SVMs allow the use of custom kernels tailored to
specific problems.
Custom kernels can be designed based on domain knowledge or problem-specific requirements.
Several factors should be considered when choosing a kernel:
1. Nature of the Data: If the data can be easily split into groups with a straight line, you can use a linear
kernel. But if the data is all mixed up and needs more complex boundaries to separate it. You should go
for kernels like RBF or polynomials.
2. Computational Complexity: Using a linear kernel is faster and uses fewer resources than non-linear
kernels like RBF. So, think about how much computing power you have and how big your application
needs to be.
3. Model Interpretability: With linear kernels, it's easier to understand how the model decides between
things because the lines are simple. But with non-linear kernels, the lines can get complicated. Which
makes it harder to figure out how the model works.
4. Hyperparameter Tuning: Every type of kernel has its special settings called hyperparameters. That
you need to adjust to make the model work its best. As well as try out different combinations of these
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
settings using cross-validation to find the one that works the best.
Q8: Explain the EM algorithm with the necessary steps.
1. The Expectation-Maximization (EM) algorithm is an iterative way to find maximum-likelihood estimates for
model parameters when the data is incomplete or has missing data points or has some hidden variables.
2. EM chooses random values for the missing data points and estimates a new set of data.
3. These new values are then recursively used to estimate a better first data, by filling up missing points, until
the values get fixed.
4. These are the two basic steps of the EM algorithm:
a. Estimation Step:
1. Initialize µk, ∑k, and πk by random values, or by K means clustering results or by hierarchical clustering
results.
2. Then for those given parameter values, estimate the value of the latent variables (i.e. γk ).
b. Maximization Step: Update the value of the parameters (i.e., µk, ∑k, and πk) calculated using ML method:
1. Initialize the mean of µk, the covariance matrix ∑k, and mixing the coefficients of πk, by random
values, (or other values).
2. Compute the πk, for all k.
3. Again estimate all the parameters using the current πk, values
4. Compute log-likelihood function.
5. Put some convergence criterion.
6. If the log-likelihood value converges to some value (or if all the parameters converge to some values)
then stop, else return to Step 2.
Q9: Write short note on “Bayesian Belief Networks”.
1. A Bayesian network is a directed acyclic graph in which each node is annotated with quantitative probability
information.
2. The full specification is as follows:
a. A set of random variables makes up the nodes of the network variables may be discrete or continuous.
b. A set of directed links or arrows connects pairs of nodes. If there is an arrow from x to node y, x is
said to be a parent of y.
c. Each node x has a conditional probability distribution P(x,| parent(x)) that quantifies the effect of
parents on the node.
d. The graph has no directed cycles (and hence is a directed acyclic graph or DAG).
3. A Bayesian network provides a complete description of the domain. Every entry in the full joint probability
distribution can be calculated from the information in the network
4. Bayesian networks provide a concise way to represent conditional independence relationships in the domain.
5. A Bayesian network is often exponentially smaller than the full joint distribution.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
For example:
1. Suppose we want to determine the possibility of grass getting wet or dry due to the occurrence of different
seasons.
2. The weather has three states: Sunny, Cloudy, and Rainy. There are two possibilities for the grass: Wet or
Dry.
3. The sprinkler can be on or off. If it is rainy, the grass gets wet but if it is sunny, we can make grass wet by
pouring water from a sprinkler.
4. Suppose that the grass is wet. This could be contributed by one of the two reasons - Firstly, it is raining.
Secondly, the sprinklers are turned on.
5. Using the Bayes’ rule, we can deduce the most contributing factor towards the wet grass.
Bayesian network possesses the following merits in uncertainty knowledge representation:
1. Bayesian network can conveniently handle incomplete data.
2. Bayesian network can learn the casual relation of variables. In data analysis, casual relation is helpful for
field knowledge understanding, it can also easily lead to precise prediction even under much interference.
3. The combination of bayesian network and bayesian statistics can take full advantage of field knowledge and
information from data.
4. The combination of bayesian network and other models can effectively avoid over-fitting problem.
Q10: What is Bayesian Learning? Explain how the decision error for Bayesian Classification is minimized.
1. Bayesian classifier can be made optimal by minimizing the classification error probability.
2. In Fig. 2.7.1, it is observed that when the threshold is moved away from x, the corresponding shaded area
under the curves always increases.
3. Hence, we have to decrease this shaded area to minimize the error.
4. Let R1, be the region of the feature space for ω1, and R2, be the corresponding region for ω2.
5 Then an error will be occurred if, x € R1, although it belongs to ω2, or if x € R2, although it belongs to ω1, i.e.,
Pe = p (x € R2 , ω1 ) + p (x € R1, ω2) (2.7.1)
6. Pe, can be written as,
Pe = p (x € R2 | ω1) p(ω1) + p (x € R1 | ω2) p(ω2)
= P(ω1)) ʃR2 p(x| ω1)dx + p(ω2) ʃR1 p(x| ω2) dx . (2.7.2)
7. Using the Baye's rule,
=P ʃR2 p(ω1) | x) p(x)dx + ʃR1 p(ω2) | x) p(x)dx (2.7.3)
8 The error will be minimized if the partitioning regions R and R, of the feature space are chosen so that
Rl : p(ω1) | x) > p(ω2) | x)
Rl : p(ω2) | x) > p(ω1) | x) (2.7.4)
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
9. Since the union of the regions R, R, covers all the space, we have
P ʃR1 p(ω1) | x) p(x)dx + ʃR2 p(ω1) | x) p(x) dx = 1 (2.7.5)
10. Combining equation (2.7.3) and (2.7.5), we get,
Pe = p(ω1) ʃR1 (p(ω1) | x) - p(ω2) | x) dx ... (2.7.6)
11. Thus the probability of error is minimized if R1 is the region of space in which p (ω1) | x) > p (ω2) | x).
Then R2, becomes region where the reverse is true.
12. In a classification task with M classes, ω1, ω2, ..., ωM, an unknown pattern, represented by the feature
vector x, is assigned to class ωi if (p (ωi) | x) > (p (ωj) | x) Ɐ j≠i
Q11: Define Bayes Classifier. Explain how Classification is done using Bayes Classifier.
1. A Bayes classifier is a simple probabilistic classifier based on applying Bayes theorem (from Bayesian
statistics) with strong (Naive) independence assumptions.
2. A Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature.
3. Depending on the precise nature of the probability model, Naive Bayes classifiers can be trained very
efficiently in a supervised learning.
4. In many practical applications, parameter estimation for Naive Bayes models uses the method of
maximum likelihood; in other words, one can work with the Naive Bayes model without believing in
Bayesian probability or using any Bayesian methods.
5. An advantage of the Naive Bayes classifier is that it requires a small amount of training data to
estimate the parameters (means and variances of the variables) necessary for classification.
6. The perceptron bears a certain relationship to a classical pattern classifier known as the Bayes
classifier.
7. When the environment is Gaussian, the Bayes classifier reduces to a linear classifier.
In the Bayes classifier, or Bayes hypothesis testing procedure, we minimize the average risk, denoted
by R. For a two-class problem, represented by classes C1 and C2, the average risk is defined:
R = C11 P1 ʃH1 Pi(x| C1) dx + C22 P2 ʃH2 Pi(x| C2) dx
+ C21 P1 ʃH2 Pi(x| C1) dx + C12 P2 ʃH1 Pi(x| C2) dx
Where the various terms are defined as follows:
Pi = Prior probability that the observation vector x is drawn from subspace H1, with I = 1, 2 and P1 + P2 =1
Ci,j = Cost of deciding in favour of class Ci, represented by subspace Hi, When class Cj is true, with i,j=1,2
Pi(x/Ci) = Conditional probability density function of the random vector X
8. Fig. 2.9.1(a) depicts a block diagram representation of the Bayes classifier.
The important points in this block diagram are two fold:
a. The data processing in designing the Bayes classifier is confined entirely to the computation of
the likelihood ratio˄(x).
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
b. This computation is completely invariant to the values assigned to the prior probabilities and
involved in the decision-making process. These quantities merely affect the values of the
threshold x.
c. From a computational point of view, we find it more convenient to work with logarithm of the
likelihood ratio rather than the likelihood ratio itself.
Assign x to class ξ1
If ξ (x) > ξ
Input Vector ˄(x) Otherwise, Assign
Likelihood Comparator
X Ratio Computer it to class ξ2
ξ
Assign x to class ξ1
If log˄(x) > log ξ
Input Vector log˄(x) Otherwise, Assign
X Likelihood Comparator it to class ξ2
Ratio Computer
log ξ
Fig. 2.9.1. Two equivalent implementations of the Bayes classifier:
(a) Likelihood ratio test
Q12: Discuss Bayes Classifier using some example in detail.
Previous Question Answer to be considered
For example:
1. Let D be a training set of features and their associated class labels. Each feature is represented by an n-
dimensional attribute vector X =(x1,x2, ..., xn) depicting n measurements made on the feature from n
attributes respectively Al,A2. .Ai.
2. Suppose that there are m classes, C1,C2,...,Cm . Given a feature X, the classifier will predict that X
belongs to the class having the highest posterior probability, conditioned on X. That is, classifier
predicts that X belongs to class Ci, if and only if, P( Ci | X) > P(Cj | X ) for 1 <= j <= m, j # i
Thus, we maximize P( Ci | X). The class Ci for which P( Ci | X) is maximized is called the maximum
posterior hypothesis. By Bayes theorem,
P ( X | Ci ) P (Ci)
P (C,|X) = ---------------------------
P(X)
3. As P(X) is constant for all classes, only P( X | Ci ) P(Ci) need to be maximized. If the class prior
probabilities are not known then it is commonly assumed that the classes are equally likely i.e.,
P(C1) = P(C2) = ... P(Cm) and therefore P( X | Ci)) is maximized. Otherwise P(X|Ci) P(Ci) is maximized.
4. i. Given data sets with many attributes, the computation of p(X | C,) will be extremely expensive.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
ii. To reduce computation in evaluating P( X | Ci ), the assumption of class conditional independence is
made.
iii. This presumes that the values of the attributes are conditionally independent of one another, given
the class label of the feature.
n
Thus, P(X | Ci ) = ∏ P( Xk | Ci )
k=1
= P(X1 |C2) x P (X2 | C2 ) X ….. x P (Xn | Ci )
iv. The probabilities P (X1 | C1 ) , P (X2 | C1 ),...., P (Xn | Ci ) are easily estimated from the training feature.
Here Xk, refers to the value of attribute Ak, for each attribute, it is checked whether the attribute is
categorical or continuous valued.
v. For example, to compute P(X | Ci) we consider,
a. If Ak is categorical then P (Xk | Ci) is the number of feature of class Ci in D having the value Xk
for Ak divided by |Ci , D|, the number of features of class Ci in D.
b. If Ak is continuous valued then continuous valued attribute is typically assumed to have a
Gaussian distribution with a mean µ and standard deviation σ, defined by,
𝟏
𝒈(𝒙) = ∗ (𝒆−𝟏(𝒙−µ)𝟐 )/𝟐σ2
√𝟐𝝅𝛔
So that, P(Xk | Ci) = g (Xk )
vi. There is a need to compute the mean μ and the standard deviation σ of the value of attribute Ak for
training set of class Ci. These values are used to estimate P ( Xk | Ci).
vii. For example, let X = (35, Rs. 40,000) where A1 and A2 are the attributes age and income,
respectively. Let the class label attribute be buys-computer.
viii. The associated class label for X is yes (i.e., buys-computer = yes). Let’s suppose that age has not
been discretized and therefore exists as a continuous valued attribute.
ix. Suppose that from the training set, we find that customer in D who buy a computer are 38 + 12
years of age. In other words, for attribute age and this class, we have μ =38 and σ = 12.
5. In order to predict the class label of X, P(X | Ci) P(Ci) is evaluated for each class Ci. The classifier
predicts that the class label of X is the class C i, if and only if
P(X | Ci) P(Ci) > P(X | Cj) P(Cj)) for 1 <= j <= m, j # i
The predicted class label is the class Ci for which P( X | Ci) P(Ci) is the maximum.
Q13: Explain Naïve Bayes Classifier.
1. Naive Bayes model is the most common Bayesian network model used in machine learning.
2. Here, the class variable C is the root which is to be predicted and the attribute variables Xi are the leaves.
3. The model is Naive because it assumes that the attributes are conditionally independent of each other,
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
given the class.
4. Assuming Boolean variables, the parameters are :
ɵ = P(C = true), ɵi1, = P(Xi = true |C = true), ɵi2 = P(Xi = true | C = False)
5. Naive Bayes models can be viewed as Bayesian networks in which each Xi has C as the sole parent and
C has no parents.
6. A Naive Bayes model with Gaussian P(Xi | C) is equivalent to a mixture of Gaussian’s with diagonal
covariance matrices.
7. While mixtures of Gaussian’s are used for density estimation in continuous domains, Naive Bayes
models used in discrete and mixed domains.
8. Naive Bayes models allow for very efficient inference of marginal and conditional distributions.
9. Naive Bayes learning has no difficulty with noisy data and can give more appropriate probabilistic
predictions.
Q14: Describe the Usage, Advantages and Disadvantages of EM Algorithm.
Usage of EM algorithm:
1. It can be used to fill the missing data in a sample.
2. It can be used as the basis of unsupervised learning of clusters.
3. It can be used for the purpose of estimating the parameters of Hidden Markov Model (HMM).
4. It can be used for discovering the values of latent variables.
Advantages of EM algorithm are:
1. It is always guaranteed that likelihood will increase with each iteration.
2. The E-step and M-step are often pretty easy for many problems in terms of implementation.
3. Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm are:
1. It has slow convergence.
2. It makes convergence to the local optima only.
3. It requires both the probabilities, forward and backward (numerical optimization requires only forward probability).
Q15: How is the Bayesian Network powerful representation for uncertainty knowledge? Explain with example.
A Bayesian network is a directed acyclic graph in which each node is annotated with quantitative probability
information.
The full specification is as follows:
i. A set of random variables makes up the nodes of the network variables may be discrete or continuous.
ii. A set of directed links or arrows connects pairs of nodes. If there is an arrow from x to node y, x is
said to be a parent of y.
iii. Each node xi has a conditional probability distribution P(xi| parent(x i)) that quantifies the effect of
parents on the node.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
iv. The graph has no directed cycles (and hence is a directed acyclic graph or DAG).
3. A Bayesian network provides a complete description of the domain. Every entry in the full joint probability
distribution can be calculated from the information in the network.
4. Bayesian networks provide a concise way to represent conditional independence relationships in the domain.
5. A Bayesian network is often exponentially smaller than the full joint distribution.
For example:
1. Suppose we want to determine the possibility of grass getting wet or dry due to the occurrence of
different seasons.
2. The weather has three states: Sunny, Cloudy, and Rainy. There are two possibilities for the grass: Wet
or Dry.
3. The sprinkler can be on or off. If it is rainy, the grass gets wet but if it is sunny, we can make grass wet
by pouring water from a sprinkler.
4. Suppose that the grass is wet. This could be contributed by one of the two reasons - Firstly, it is raining.
Secondly, the sprinklers are turned on.
5. Using the Baye's rule, we can deduce the most contributing factor towards the wet grass.
Bayesian network possesses the following merits in uncertainty knowledge representation:
1. Bayesian network can conveniently handle incomplete data.
2. Bayesian network can learn the casual relation of variables. In data analysis, casual relation is helpful
for field knowledge understanding, it can also easily lead to precise prediction even under much
interference.
3. The combination of bayesian network and bayesian statistics can take full advantage of field
knowledge and information from data.
4. The combination of bayesian network and other models can effectively avoid over-fitting problem.
Q16: Explain the role Prior Probability and Posterior Probability in Bayesian Classification.
Role of prior probability:
1. The prior probability is used to compute the probability of the event before the collection of new data.
2. It is used to capture our assumptions / domain knowledge and is independent of the data.
3. It is the unconditional probability that is assigned before any relevant evidence is taken into account.
Role of posterior probability:
1. Posterior probability is used to compute the probability of an event after collection of data.
2. It is used to capture both the assumptions / domain knowledge and the pattern in observed data.
3. It is the conditional probability that is assigned after the relevant evidence or background is taken into
account.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Q17: Explain the types and properties of Support Vector Machine.
Following are the types of support vector machine:
1. Linear SVM : Linear SVM is used for linearly separable data, which means if a dataset can be classified
into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier
is used called as Linear SVM classifier.
2 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot
be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.
Following are the properties of SVM:
1. Flexibility in choosing a similarity function: Sparseness of solution when dealing with large data sets only
support vectors are used to specify the separating hyperplane.
2. Ability to handle large feature spaces: complexity does not depend on the dimensionality of the feature
space.
3. Overfitting can be controlled by soft margin approach: A simple convex optimization problem which is
guaranteed to converge to a single global solution.
Q18: What are the parameters used in Support Vector Classifier?
Parameters used in support vector classifier are:
1. Kernel:
a. Kernel is selected based on the type of data and also the type of transformation.
b. By default, the kernel is Radial Basis Function Kernel (RBF).
2. Gamma:
a. This parameter decides how far the influence of a single training example reaches during
transformation, which in turn affects how tightly the decision boundaries end up surrounding points in
the input space.
b. If there is a small value of gamma, points farther apart are considered similar.
c. So, more points are grouped together and have smoother decision boundaries (may be less accurate).
d. Larger values of gamma cause points to be closer together (may cause overfitting).
3. The 'C' parameter
a. This parameter controls the amount of regularization applied on the data.
b. Large values of C mean low regularization which in turn causes the training data to fit very well (may
cause overfitting).
a. Lower values of C mean higher regularization which causes the model to be more tolerant of errors
(may lead to lower accuracy).
Q19: What are the Advantages and Disadvantages of Support Vector Machines?
Advantages of SVM are:
1. Guaranteed optimality: Owing to the nature of Convex Optimization, the solution will always be global
minimum, not a local minimum.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
2. The abundance of implementations: We can access it conveniently.
3. SVM can be used for linearly separable as well as non-linearly separable data. Linearly separable data passes
hard margin whereas non-linearly separable data poses a soft margin.
4. SVMs provide compliance to the semi-supervised learning models. It can be used in areas where the data is
labeled as well as unlabeled. It only requires a condition to the minimization problem which is known as the
transductive SVM.
5. Feature Mapping used to be quite a load on the computational complexity of the overall training performance
of the model. However, with the help of Kernel Trick, SVM can carry out the feature mapping using the
simple dot product.
Disadvantages of SVM:
1. SVM does not give the best performance for handling text structures as compared to other algorithms that are
used in handling text data. This leads to loss of sequential information and thereby, leading to worse
performance.
2. SVM cannot return the probabilistic confidence value that is similar to logistic regression. This does not
provide much explanation as the confidence of prediction is important in several applications.
3. The choice of the kernel is perhaps the biggest limitation of the support vector machine. Considering so many
kernels present, it becomes difficult to choose the right one for the data.
Q20: Write a short Note on Hyperplane (Decision Surface).
1. A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides
the space into two disconnected parts.
2. For example let's assume a line to be one dimensional Euclidean space.
3. Now pick a point on the line, this point divides the line into two parts.
4. The line has 1 dimension, while the point has 0 dimensions. So a point is a hyperplane of the line.
5. For two dimensions we saw that the separating line was the hyperplane.
6. Similarly, for three dimensions planes with two dimensions divides the 3d space into two parts and thus act as
a hyperplane.
7. Thus for a space of n dimensions we have a hyperplane of n-1 dimensions separating it into two parts.
=====================
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Course/Branch : B.Tech-CSE Semester :V
Subject : Machine Learning Techniques Subject Code : BCS055
Q1: Explain ID3 Algorithm.
ID3 (Iterative Dichotomiser 3)
1. ID3 is an algorithm used to generate a decision tree from a dataset.
2. To construct a decision tree, ID3 uses a top-down, greedy search through the given sets,
where each attribute at every tree node is tested to select the attribute that is best for
classification of a given set.
3. Therefore, the attribute with the highest information gain can be selected as the test
attribute of the current node.
4. In this algorithm, small decision trees are preferred over the larger ones. It is a heuristic
algorithm because it does not construct the smallest tree.
5. For building a decision tree model, ID3 only accepts categorical attributes. Accurate results
are not given by ID3 when there is noise and when it is serially implemented.
6. Therefore data is preprocessed before constructing a decision tree.
7. For constructing a decision tree information gain is calculated for each and every attribute
and attribute with the highest information gain becomes the root node. The rest possible
values are denoted by arcs.
8. All the outcome instances that are possible are examined whether they belong to the same
class or not. For the instances of the same class, a single name is used to denote the class
otherwise the instances are classified on the basis of splitting attribute.
Procedure for ID3 Algorithm:
ID3 (Examples, Target Attribute, Attributes)
1. Create a Root node for the tree.
2. If all Examples are positive, return the single-node tree root, with label = +
3. If all Examples are negative, return the single-node tree root, with label = -
4. If Attributes is empty, return the single-node tree root, with label = most common value of
target attribute in examples.
5. Otherwise begin
a. A - the attribute from Attributes that best classifies Examples
b. The decision attribute for Root - A
c. For each possible value, Vi of A,
i. Add a new tree branch below root, corresponding to the test A = Vi
ii. Let Example Vi be the subset of Examples that have value Vi for A
iii. If Example Vi is empty
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
a. Then below this new branch add a leaf node with label most common value
of 'Target Attribute in Examples
b. Else below this new branch add the sub-tree ID3 (Example V,,
TargetAttribute, Attributes-{A))
6. End
7. Return root.
Q2: What is the limitation of Decision Tree?
Decision Tree models are sophisticated analytical models that are simple to comprehend,
visualize, execute, and score, with minimum data pre-processing required. These are supervised
learning systems in which input is constantly split into distinct groups based on specified factors.
They also have limitations which we are going to discuss; when there are few decisions and
consequences in the tree, decision trees are generally simple to understand. Typical examples
include the inability to measure attribute values, the high cost and complexity of such measures
and the lack of availability of all attributes at the same time.
Limitations of Decision tree
1. Not good for Regression
Logistic regression is a statistical analysis approach that uses independent features to try to predict
precise probability outcomes. On high-dimensional datasets, this may cause the model to be over-
fit on the training set, overstating the accuracy of predictions on the training set, and so preventing
the model from accurately predicting results on the test set.
This is most common when the model is trained on a small amount of training data with a large
number of features. Regularization strategies should be considered on high-dimensional datasets
to minimize over-fitting (but this makes the model complex). The model may be under-fit on the
training data if the regularization variables are too high.
Complex correlations are difficult to capture with logistic regression. This approach is readily
outperformed by more powerful and complicated algorithms such as Neural Networks.
Because logistic regression(see above figure) has a linear decision surface, it cannot tackle
nonlinear issues. In real-world circumstances, linearly separable data is uncommon. As a result,
non-linear features must be transformed, which can be done by increasing the number of features
such that the data can be separated linearly in higher dimensions.
2. Overfitting Problem
Overly complicated trees can be created by decision-tree learners, which do not generalize the
input well. This is referred to as overfitting. Some of the important techniques to avoid such
problems are –
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Pruning
Establishing the minimum amount of samples required at a leaf node
Setting the maximum depth of the tree
If we continue to develop the tree, each row of the input data table may be seen as the final rule.
On the training data, the model will perform admirably, but it will fail to validate on the test data.
Overfitting occurs when the tree reaches a particular level of complexity. Overfitting is quite
likely to occur in a really large tree.
The decision makes an effort to avoid overfitting. Trees are nearly always stopped before reaching
depth; thus, each leaf node only includes observations from one class or one observation point.
There are several methods for determining when to stop growing the tree.
If a leaf node is a pure node at any point during the growth process, no additional downstream
trees will grow from that node. Other leaf nodes can be used to continue growing the tree.
When the decrease in tree impurity is relatively slight. When the impurity lowers by a very little
amount, say 0.001 or less, this user input parameter causes the tree to be terminated.
When there are only a few observations remaining on the leaf node. This ensures that the tree is
terminated when the node’s reliability for further splitting is questioned due to the limited sample
size. According to the Central Limit Theorem, a big sample consists of around 30 observations
when they are mutually independent. This can serve as a general guide, but because we typically
work with multi-dimensional observations that may be associated, this user input parameter
should be higher than 30, say 50 or 100 or more.
3. Expensive
The cost of creating a decision tree is high since each node requires field sorting. In other
algorithms, a mixture of several fields is used at the same time, resulting in even higher expenses.
Pruning methods are also expensive due to the large number of candidate subtrees that must be
produced and compared.
4. Independency between samples
Each training example must be completely independent of the other samples in the dataset. If they
are related in some manner, the model will try to give those specific training instances more
weight. As a result, no matched data or repeated measurements should be used as training data.
5. Unstable
Because slight changes in the data can result in an entirely different tree being constructed,
decision trees can be unstable. The use of decision trees within an ensemble helps to solve this
difficulty.
6. Greedy Approach
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
To form a binary tree, the input space must be partitioned correctly. The greedy algorithm used
for this is recursive binary splitting. It is a numerical procedure that entails the alignment of
various values. Data will be split according to the first best split, and only that path will be used to
split the data. However, various pathways of the split could be more instructive; thus, that split
may not be the best.
7. Predictions Are Not Smooth or Continuous
As shown in the diagram below, decision tree forecasts are neither smooth or continuous but
piecewise constant approximations.
Q3: Discuss why we use SVM Kernels and in which scenario which SVM kernel is used?
Kernel Method in SVMs
Support Vector Machines (SVMs) use kernel methods to transform the input data into a higher-dimensional
feature space, which makes it simpler to distinguish between classes or generate predictions. Kernel
approaches in SVMs work on the fundamental principle of implicitly mapping input data into a higher-
dimensional feature space without directly computing the coordinates of the data points in that space.
The kernel function in SVMs is essential in determining the decision boundary that divides the various
classes. In order to calculate the degree of similarity between any two points in the feature space, the kernel
function computes their dot product.
The most commonly used kernel function in SVMs is the Gaussian or radial basis function (RBF) kernel. The
RBF kernel maps the input data into an infinite-dimensional feature space using a Gaussian function. This
kernel function is popular because it can capture complex nonlinear relationships in the data.
Other types of kernel functions that can be used in SVMs include the polynomial kernel, the sigmoid kernel,
and the Laplacian kernel. The choice of kernel function depends on the specific problem and the
characteristics of the data.
Basically, kernel methods in SVMs are a powerful technique for solving classification and regression
problems, and they are widely used in machine learning because they can handle complex data structures and
are robust to noise and outliers.
Characteristics of Kernel Function
Kernel functions used in machine learning, including in SVMs (Support Vector Machines), have several
important characteristics, including:
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Mercer's condition: A kernel function must satisfy Mercer's condition to be valid. This condition
ensures that the kernel function is positive semi definite, which means that it is always greater than or
equal to zero.
Positive definiteness: A kernel function is positive definite if it is always greater than zero except for
when the inputs are equal to each other.
Non-negativity: A kernel function is non-negative, meaning that it produces non-negative values for
all inputs.
Symmetry: A kernel function is symmetric, meaning that it produces the same value regardless of the
order in which the inputs are given.
Reproducing property: A kernel function satisfies the reproducing property if it can be used to
reconstruct the input data in the feature space.
Smoothness: A kernel function is said to be smooth if it produces a smooth transformation of the
input data into the feature space.
Complexity: The complexity of a kernel function is an important consideration, as more complex
kernel functions may lead to over fitting and reduced generalization performance.
Basically, the choice of kernel function depends on the specific problem and the characteristics of the data,
and selecting an appropriate kernel function can significantly impact the performance of machine learning
algorithms.
Major Kernel Function in Support Vector Machine
In Support Vector Machines (SVMs), there are several types of kernel functions that can be used to map the
input data into a higher-dimensional feature space. The choice of kernel function depends on the specific
problem and the characteristics of the data.
Here are some most commonly used kernel functions in SVMs:
Linear Kernel
A linear kernel is a type of kernel function used in machine learning, including in SVMs (Support Vector
Machines). It is the simplest and most commonly used kernel function, and it defines the dot product between
the input vectors in the original feature space.
The linear kernel can be defined as:
K(x, y) = x .y
Where x and y are the input feature vectors. The dot product of the input vectors is a measure of their
similarity or distance in the original feature space.
When using a linear kernel in an SVM, the decision boundary is a linear hyperplane that separates the
different classes in the feature space. This linear boundary can be useful when the data is already separable by
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
a linear decision boundary or when dealing with high-dimensional data, where the use of more complex
kernel functions may lead to overfitting.
Polynomial Kernel
A particular kind of kernel function utilised in machine learning, such as in SVMs, is a polynomial kernel
(Support Vector Machines). It is a nonlinear kernel function that employs polynomial functions to transfer the
input data into a higher-dimensional feature space.
One definition of the polynomial kernel is:
Where x and y are the input feature vectors, c is a constant term, and d is the degree of the polynomial, K(x,
y) = (x. y + c) d. The constant term is added to, and the dot product of the input vectors elevated to the degree
of the polynomial.
The decision boundary of an SVM with a polynomial kernel might capture more intricate correlations
between the input characteristics because it is a nonlinear hyperplane.
The degree of nonlinearity in the decision boundary is determined by the degree of the polynomial.
The polynomial kernel has the benefit of being able to detect both linear and nonlinear correlations in the
data. It can be difficult to select the proper degree of the polynomial, though, as a larger degree can result in
overfitting while a lower degree cannot adequately represent the underlying relationships in the data.
In general, the polynomial kernel is an effective tool for converting the input data into a higher-dimensional
feature space in order to capture nonlinear correlations between the input characteristics.
Gaussian (RBF) Kernel
The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a popular kernel function used
in machine learning, particularly in SVMs (Support Vector Machines). It is a nonlinear kernel function that
maps the input data into a higher-dimensional feature space using a Gaussian function.
The Gaussian kernel can be defined as:
K(x, y) = exp(-gamma * ||x - y||^2)
Where x and y are the input feature vectors, gamma is a parameter that controls the width of the Gaussian
function, and ||x - y||^2 is the squared Euclidean distance between the input vectors.
When using a Gaussian kernel in an SVM, the decision boundary is a nonlinear hyper plane that can capture
complex nonlinear relationships between the input features. The width of the Gaussian function, controlled by
the gamma parameter, determines the degree of nonlinearity in the decision boundary.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
One advantage of the Gaussian kernel is its ability to capture complex relationships in the data without the
need for explicit feature engineering. However, the choice of the gamma parameter can be challenging, as a
smaller value may result in under fitting, while a larger value may result in over fitting.
Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a type of kernel function
used in machine learning, including in SVMs (Support Vector Machines). It is a non-parametric kernel that
can be used to measure the similarity or distance between two input feature vectors.
The Laplacian kernel can be defined as:
K(x, y) = exp(-gamma * ||x - y||)
Where x and y are the input feature vectors, gamma is a parameter that controls the width of the Laplacian
function, and ||x - y|| is the L1 norm or Manhattan distance between the input vectors.
When using a Laplacian kernel in an SVM, the decision boundary is a nonlinear hyperplane that can capture
complex relationships between the input features. The width of the Laplacian function, controlled by the
gamma parameter, determines the degree of nonlinearity in the decision boundary.
One advantage of the Laplacian kernel is its robustness to outliers, as it places less weight on large distances
between the input vectors than the Gaussian kernel. However, like the Gaussian kernel, choosing the correct
value of the gamma parameter can be challenging.
Q4: Discuss the various issues of Decision tree.
Issues related to the applications of decision trees are:
1. Missing data:
a. When values have gone unrecorded, or they might be too expensive to obtain.
b. Two problems arise:
i. To classify an object that is missing from the test attributes.
ii. To modify the information gain formula when examples have unknown values for the
attribute.
2. Multi-valued attributes:
a. When an attribute has many possible values, the information gain measure gives an inappropriate
indication of the attribute's usefulness.
b. In the extreme case, we could use an attribute that has a different value for every example.
c. Then each subset of examples would be a singleton with a unique classification, so the information
gain measure would have its highest value for this attribute, the attribute could be irrelevant or
useless.
d. One solution is to use the gain ratio.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
3. Continuous and integer valued input attributes:
a. Height and weight have an infinite set of possible values.
b. Rather than generating infinitely many branches, decision tree learning algorithms find the split point
that gives the highest information gain.
c. Efficient dynamic programming methods exist for finding good Split points, but it 1s still the most
expensive part of real world decision tree learning applications.
4. Continuous-valued output attributes:
a. If we are trying to predict a numerical value, such as the price of a work of art, rather than discrete
classifications, then we need a regression tree.
b. Such a tree has a linear function of some subset of numerical attributes, rather than a single value at
each leaf.
c. The learning algorithm must decide when to stop splitting and begin applying linear regression using
the remaining attributes.
Q5: Explain instance based learning with representation?
1. Instance-Based Learning (IBL) is an extension of nearest neighbour or K-NN classification algorithms.
2. IBL algorithms do not maintain a set of abstractions of model created from the instances.
3. The K-NN algorithms have large space requirement.
4 They also extend it with a significance test to work with noisy instances, since a lot of real-life datasets have
training instances and K-NN algorithms do not work well with noise.
5. Instance-based learning is based on the memorization of the dataset.
6. The number of parameters is unbounded and grows with the size of the data.
7. The classification is obtained through memorized examples.
8. The cost of the learning process is 0, all the cost is in the computation of the prediction.
9. This kind learning is also known as lazy learning.
Following are the instance based learning representation:
Instance-based representation (1):
1. The simplest form of learning is plain memorization.
2. This is a completely different way of representing the knowledge extracted from a set of instances.
Just store the instances themselves and operate by relating new instances whose class is unknown to
existing ones whose class is known.
3. Instead of creating rules, work directly from the examples themselves.
Instance-based representation (2):
1. Instance-based learning is lazy, deferring the real work as long as possible.
2. In instance-based learning, each new instance is compared with existing ones using a distance metric,
and the closest existing instance is used to assign the class to the new one. This is also called the
nearest-neighbour classification method.
3. Sometimes more than one nearest neighbour is used, and the majority class of the closest k-nearest
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
neighbours is assigned to the new instance. This is termed the k-nearest neighbour method.
Instance-based representation (3):
1. When computing the distance between two examples, the standard Euclidean distance may be used.
2. A distance of 0 is assigned if the values are identical, otherwise the distance is 1.
3. Some attributes will be more important than others. We need some kinds of attribute weighting. To get
suitable attribute weights from the training set is a key problem.
4. It may not be necessary, or desirable, to store all the training instances.
Instance-based representation (4):
1. Generally some regions of attribute space are more stable with regard to class than others, and just a
few examples are needed inside stable regions.
2. An apparent drawback to instance-based representation is that they do not make explicit the structures
that are learned.
Q6: How Locally Weighted Regression is different from Radial Basis function networks?
1. Model-based methods, such as neural networks and the mixture of Gaussians, use the data to build a
parameterized model.
2. After training, the model is used for predictions and the data are generally discarded.
3. In contrast, memory-based methods are non-parametric approaches that explicitly retain the training data,
and use it each time a prediction needs to be made.
4 Locally Weighted Regression (LWR) is a memory-based method that performs a regression around a point
using only training data that are local to that point.
5. LWR was suitable for real-time control by constructing an LWR-based system that learned a difficult
juggling task.
6. The LOESS (Locally Estimated Scatterplot Smoothing) model performs a linear regression on points in the
data set, weighted by a kernel centered at x.
7. The kernel shape is a design parameter for which the original LOESS model uses a tricubic kernel:
hi(x) = h(x-xi) = exp(-k(x – xi )2)
Where k is a smoothing parameter
8. For brevity, we will drop the argument x for hi(x), and define n = ∑i hi.
We can then write the estimated means and covariances as:
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Radial Basis Function (RBP):
1. A Radial Basis Function (RBP) is a function that assigns a real value to each input from its domain (it is a
real-value function), and the value produced by the RBF is alway8 an absolute value i.e., it is a measure of
distance and cannot be negative.
2. Euclidean distance (the straight-line distance) between two points in Euclidean space is used.
3. Radial basis functions are used to approximate functions, such as neural networks acts as function
approximators.
4. The following sum represents a radial basis function network:
y(x)= ∑ωi ϕ ( || x- xi || )
5. The radial basis functions act as activation functions.
6. The approximant y(x) is differentiable with respect to the weights which are learned using iterative update
methods common among neural networks.
Q7: Explain KNN Algorithm with suitable example.
1. The KNN classification algorithm is used to decide the new instance should belong to which class.
2 When K=1, we have the nearest neighbour algorithm.
3. KNN classification is incremental.
4. KNN classification does not have a training phase, all instances are stored. Training uses indexing to find
neighbours quickly.
5. During testing, KNN classification algorithm has to find K-nearest neighbours of a new instance. This is
time consuming if we do exhaustive comparison.
6. K-nearest neighbours use the local neighborh0od to obtain a prediction.
Algorithm: Let m be the number of training data samples. Let p be an unknown point.
1. Store the training samples in an array of data points array. This means each element of this array represents
a tuple (x, y).
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
2. For i = ɵ to m:
Calculate Euclidean distance d(arr[i], p).
3. Make set S of K smallest distances obtained. Each of these distances corresponds to an already classified
data point.
4. Return the majority label among S.
Q8: Differentiate between Lazy and Eager Learning.
Aspect Lazy Learning Eager Learning
Timing of Model
The model is built during prediction. The model is built before prediction.
Building
Relies heavily on the training data during Less dependent on training data during
Data Dependency prediction. prediction.
Slower during training, but faster
Faster during training, but slower during
Computational during prediction due to pre-built
prediction due to real-time model building.
Efficiency model.
Decision Trees, Support Vector
k-Nearest Neighbors (KNN)
Example Machines (SVM), Neural Networks
Less memory usage during training, but More memory usage during training,
Memory Usage more during prediction. but less during prediction.
Q9: Illustrate the operation of the ID3 training example. Consider the information gain as attribute
measure.
ID3 (Iterative Dichotomiser 3)
9. ID3 is an algorithm used to generate a decision tree from a dataset.
10. To construct a decision tree, ID3 uses a top-down, greedy search through the given sets,
where each attribute at every tree node is tested to select the attribute that is best for
classification of a given set.
11. Therefore, the attribute with the highest information gain can be selected as the test
attribute of the current node.
12. In this algorithm, small decision trees are preferred over the larger ones. It is a heuristic
algorithm because it does not construct the smallest tree.
13. For building a decision tree model, ID3 only accepts categorical attributes. Accurate results
are not given by ID3 when there is noise and when it is serially implemented.
14. Therefore data is preprocessed before constructing a decision tree.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
15. For constructing a decision tree information gain is calculated for each and every attribute
and attribute with the highest information gain becomes the root node. The rest possible
values are denoted by arcs.
16. All the outcome instances that are possible are examined whether they belong to the same
class or not. For the instances of the same class, a single name is used to denote the class
otherwise the instances are classified on the basis of splitting attribute.
Procedure for ID3 Algorithm:
ID3 (Examples, Target Attribute, Attributes)
6. Create a Root node for the tree.
7. If all Examples are positive, return the single-node tree root, with label = +
8. If all Examples are negative, return the single-node tree root, with label = -
9. If Attributes is empty, return the single-node tree root, with label = most common value of
target attribute in examples.
10. Otherwise begin
a. A - the attribute from Attributes that best classifies Examples
b. The decision attribute for Root - A
c. For each possible value, Vi of A,
iv. Add a new tree branch below root, corresponding to the test A = Vi
v. Let Example Vi be the subset of Examples that have value Vi for A
vi. If Example Vi is empty
c. Then below this new branch add a leaf node with label most common value
of 'Target Attribute in Examples
d. Else below this new branch add the sub-tree ID3 (Example V,,
TargetAttribute, Attributes-{A))
6. End
7. Return root.
=====================
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Q10: What are the steps used for making Decision Tree?
Steps used for making decision tree are:
1. Get list of rows (dataset) which are taken into consideration for making decision tree (recursively at each
node).
2. Calculate uncertainty of our dataset or Gini impurity or how much our data is mixed up etc.
3. Generate list of all question which needs to be asked at that node
4. Partition rows into True rows and False rows based on each question asked.
5. Calculate information gain based on Gini impurity and partition of data from previous step.
6. Update highest information gain based on each question asked.
7. Update question based on information gain (higher information gain).
8. Divide the node on question. Repeat again from step 1 until we get pure node (leaf nodes).
Q11: Explain Attribute Selection Measures used in Decision Tree.
1. Entropy:
i. Entropy is a measure of uncertainty associated with a random variable.
ii. The entropy increases with the increase in uncertainty or randomness and decreases with a decrease in
uncertainty or randomness.
iii. The value of entropy ranges from 0-1.
Entropy (D)= ∑ei=1 - pi log2(pi)
Where pi, is the non-zero probability that an arbitrary tuple in D belongs to class C and is estimated by |Ci D|/|D|.
iv. A log function of base 2 is used because the entropy is encoded in bits 0 and 1.
2. Information gain:
i. ID3 uses information gain as its attribute selection measure.
ii. Information gain is the difference between the original information gain requirement (i.e. based on the proportion
of classes) and the new requirement (i.e. obtained after the partitioning of A).
Gain(D, A) = Entropy(D)- ∑vj=1 | Dj | / |D| Entropy(Dj)
Where,
D: A given data partition,
A: Attribute,
V: Suppose we partition the tuples in D on some attribute A having V distinct values
iii. D is split into V partition or subsets,(D1, D2 ,….Dj) where Dj contains those tuples in D that have outcome a j of A.
iv. The attribute that has the highest information gain is chosen.
3. Gain ratio:
i. The information gain measure is biased towards tests with many outcomes.
ii. That is, it prefers to select attributes having a large number of values.
iii. As each partition is pure, the information gain by partitioning is maximal. But such partitioning cannot be
used for classification.
iv. Attribute selection measure which is an extension to the information gain.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
v. Gain ratio differs from information gain, which measures the information with respect to a classification that
is acquired based on some partitioning.
vi. Gain ratio applies kind of information gain using a split information value defined as:
Q12: Explain Inductive Bias with Inductive System.
Inductive bias:
1. Inductive bias refers to the restrictions that are imposed by the assumptions made in the learning
method.
2. For example, assuming that the solution to the problem of road safety can be expressed as a
conjunction of a set of eight concepts.
3. This does not allow for more complex expressions that cannot be expressed as a conjunction.
4. This inductive bias means that there are some potential solutions that we cannot explore, and not
contained within the version space we examine.
5. Order to have an unbiased learner, the version space would have to contain every possible
hypothesis that could possibly be expressed.
6. The solution that the learner produced could never be more general than the complete set of
training data.
7. In other words, it would be able to classify data that it had previously seen (as the rote learner
could) but would be unable to generalize in order to classify new, unseen data.
8. The inductive bias of the candidate elimination algorithm is that it is only able to classify a new
piece of data if all the hypotheses contained within its version space give data the same
classification.
9. Hence, the inductive bias does impose a limitation on the learning method.
Inductive system:
Training Examples Classification of new instance
Candidate Elimination Or do not know
New Instance Using Hypothesis space H
Q13: Explain Inductive Learning Algorithm. Which learning algorithms used in inductive bias?
Inductive learning algorithm
Step 1: Divide the table T containing m examples into n sub-tables (t1, t2, ...tn). One table for each possible
value of the class attributes (repeat steps 2-8 for each sub-table).
Step 2: Initialize the attribute combination count j = 1.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Step 3: For the sub-table on which work is going on, divide the attribute list into distinct combinations, each
combination with j distinct attributes.
Step 4: For each combination of attributes, count the number of occurrences of attribute values that appear
under the same combination of attributes in unmarked rows of the sub-table under consideration, and
at the same time, not appears under the same combination of attributes of other sub-tables.
Call the first combination with the maximum number of occurrences the max-combination MAX.
Step 5: If MAX = = null, increase j by 1 and go to Step 3.
Step 6: Mark all rows of the sub-table where working, in which the values of MAX appear, as classified.
Step 7: Add a rule (IP attribute = "XYZ"> THEN decision is YES/NO) to R (rule set) whose left-hand side
will have attribute names of the MAX with their values separated by AND, and its right hand side
contains the decision attribute value associated with the sub-table.
Step 8: If all rows are marked as classified, then move on to process another sub-table and go to Step 2, else,
go to Step 4. If no sub-tables are available, exit with the set of rules obtained till then.
Learning algorithm used in inductive bias are:
1. Rote-Learner:
i. Learning corresponds to storing each observed training example in memory.
ii. Subsequent instances are classified by looking them up in memory.
iii. If the instance is found in memory, the stored classification is returned.
iv. Otherwise, the system refuses to classify the new instance.
v. Inductive bias: There is no inductive bias.
2. Candidate-Elimination:
a. New instances are classified only in the case where all members of the current version space agree on
the classification.
b. Otherwise, the system refuses to classify the new, instance.
c. Inductive bias: The target concept can be represented in its hypothesis space.
3. FIND-S:
a. This algorithm, finds the most specific hypothesis consistent with the training examples.
b. It then uses this hypothesis to classify all subsequent instances.
c. Inductive bias: The target concept can be represented in its hypothesis space, and all instances are
negative instances unless the opposite is entailed by its other knowledge.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Q14: What are the Performance Dimensions used for Instance based learning system?
Performance dimension used for instance-based learning algorithm are:
1. Generality:
a. This is the class of concepts that describe the representation of an algorithm.
b. IBL algorithms can pac-learn any concept whose boundary is a union of a finite number of closed
hyper-curves of finite size.
2. Accuracy: This concept describes the accuracy of classification.
3. Learning rate:
a. This is the speed at which classification accuracy increases during training.
b. It is a more useful indicator of the performance of the learning algorithm than accuracy for finite-
sized training sets.
4. Incorporation costs:
a. These are incurred while updating the concept descriptions with a single training instance.
b. They include classification costs.
5. Storage requirement: This is the size of the concept description for IBL algorithms, which is defined as
the number of saved instances used for classification decisions.
Q15: Explain the Functions, Advantages and Disadvantages of Instance Based Learning.
Functions of instance-based learning are:
1. Similarity function:
i. This computes the similarity between training instances i and the instances in the concept
description.
ii. Similarities are numeric-valued.
2. Classification function:
a. This receives the similarity function's results and the classification performance records of the
instances in the concept description.
b. It yields a classification for i.
3. Concept description updater:
a. This maintains records on classification performance and decides which instances to include in the
concept description.
b. Inputs include i, the similarity results, the classification results, and a current concept description. It
yields the modified concept description.
Advantages of instance-based learning:
1. Learning is trivial.
2. Works efficiently.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
3. Noise resistant.
4. Rich representation, arbitrary decision surfaces.
5. Easy to understand.
Disadvantages of instance-based learning:
1. Need lots of data.
2. Computational cost is high.
3. Restricted to x € Rn.
4. Implicit weights of attributes (need normalization).
5 Need large space for storage i.e., require large memory.
6. Expensive application time.
Q16: Explain Locally Weighted Regression.
1. Model-based methods, such as neural networks and the mixture of Gaussians, use the data to build a
parameterized model.
2. After training, the model is used for predictions and the data are generally discarded.
3. In contrast, memory-based methods are non-parametric approaches that explicitly retain the training data,
and use it each time a prediction needs to be made.
4 Locally Weighted Regression (LWR) is a memory-based method that performs a regression around a point
using only training data that are local to that point.
5. LWR was suitable for real-time control by constructing an LWR-based system that learned a difficult
juggling task.
6. The LOESS (Locally Estimated Scatterplot Smoothing) model performs a linear regression on points in the
data set, weighted by a kernel centered at x.
7. The kernel shape is a design parameter for which the original LOESS model uses a tricubic kernel:
hi(x) = h(x-xi) = exp(-k(x – xi )2)
Where k is a smoothing parameter
8. For brevity, we will drop the argument x for hi(x), and define n = ∑i hi.
We can then write the estimated means and covariances as:
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Q17: Explain the Architecture of Radial Basis Function Network.
1. Radial Basis Function (RBF) networks have three layers: an input layer, a hidden layer with a non-linear
RBF activation function and a linear output layer.
2. The input can be modeled as a vector of real numbers x € Rn.
3. The output of the network is then a scalar function of the input vector,
Φ : Rn R, and is given by
Φ (x) = ∑ni=1 ai p|| x - ci ||)
Where n is the number of neurons in the hidden layer, c, is the center vector for neuron i and a, is the
weight of neuron i in the linear output neuron.
4. Functions that depend only on the distance from a center vector are radially symmetric about that vector.
5. In the basic form all inputs are connected to each hidden neuron.
6. The radial basis function is taken to be Gaussian
p ( || x - ci || ) = exp [-β ( || x - ci ||2]
7. The Gaussian basis functions are local to the center vector in the sense that
lim p ( || x - ci || ) = 0
i.e., changing parameters of one neuron has only a small effect for input values that are far away from
the center of that neuron.
8. Given certain mild conditions on the shape of the activation function, RBP networks are universal
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
approximators on a compact subset of R".
9. This means that an RBF network with enough hidden neurons can approximate any continuous function on
a closed, bounded set with arbitrary precision.
10. The parameters a, Ci, P, and f are determined in a manner that optimizes the fit between and the data.
Q18: What are the Functions, Advantages and Disadvantages of Case Based Learning System?
Functions of case-based learning algorithm are
1. Pre-processor: This prepares the input for processing (for example, normalizing the range of numeric-
valued features to ensure that they are treated with equal importance by the similarity function, formatting the
raw input into a set of cases).
2. Similarity:
a. This function assesses the similarities of a given case with the previously stored cases in the
concept description.
b. Assessment may involve explicit encoding and/or dynamic computation.
c. CBL similarity functions find a compromise along the continuum between these extremes.
3. Prediction: This function inputs the similarity assessments and generates a prediction for the value of the
given case's goal feature (i.e. a classification when it is symbolic-valued).
4. Memory updating: This updates the stored case-base, such as by modifying or abstracting previously
stored cases, forgetting cases presumed to be noisy, or updating a feature's relevance weight setting.
Advantages of case-based learning algorithm:
1. Case-Based Learning (CBL) algorithms contain an input as a sequence of training cases and an output
concept description, which can be used to generate predictions of goal feature values for subsequently
presented cases.
2. The primary component of the concept description is case-base, but almost all CBL algorithms maintain
additional related information for the purpose of generating accurate predictions (for example, settings for
feature weights).
3. Current CBL algorithms assume that cases are described using a feature value representation, where features
are either predictor or goal features.
4. CBL algorithms are distinguished by their processing behavior.
Disadvantages of case-based learning algorithm:
1. They are computationally expensive because they save and compute similarities to all training cases.
2. They are intolerant of noise and irrelevant features.
3. They are sensitive to the choice of the algorithm's similarity function.
4. There is no simple way they can process symbolic valued feature values.
Q19: Describe Case Based Learning Cycle with Limitations, Benefits and Applications.
Case-based learning algorithm processing stages are:
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
1. Case retrieval: After the problem situation has been assessed, the best matching case is searched in the
case-base and an approximate solution is retrieved.
2. Case adaptation: The retrieved solution is adapted to fit better in the new problem.
3. Solution evaluation:
a. The adapted solution can be evaluated either before the solution is applied to the problem or after the
solution has been applied.
b. In any case, if the accomplished result is not satisfactory, the retrieved solution must be adapted again or
more cases should be retrieved.
4. Case-base updating: If the solution was verified as correct, the new case may be added to the case base.
Different scheme of the CBL working cycle are:
1. Retrieve the most similar case.
2. Reuse the case to attempt to solve the current problem.
3. Revise the proposed solution if necessary.
4. Retain the new solution as a part of new case.
The benefits of CBL as a lazy Problem solving method are:
1. Ease of knowledge elicitation:
a. Lazy methods can utilize easily available case or problem instances instead of rules that are difficult to
extract.
b. So, classical knowledge engineering is replaced by case and structuring. Acquisition and structuring.
2. Absence of problem-solving bias:
a. Cases can be used for multiple problem-solving purposes, because they are stored in a raw form.
b. This in contrast to eager methods, which can be used merely for the purpose for which the knowledge
has already been compiled.
3. Incremental learning:
a. A CBL system can be put into operation with a minimal set solved cases furnishing the case base.
b. The case base will be filled with new cases increasing the system's problem-solving ability.
c. Besides augmentation of the case base, new indexes and clusters categories can be created and the
existing ones can be changed.
d. This in contrast requires a special training period whenever informatics extraction (knowledge
generalization) is performed.
e. Hence, dynamic on-line adaptation a non-rigid environment is possible.
4. Suitability for complex and not-fully formalized solution spaces:
a. CBL systems can applied to an incomplete model of problem domain, implementation involves both
to identity relevant case features and to furnish, possibly a partial case base, with proper cases.
b. Lazy approaches are appropriate for complex solution spaces than eager approaches, which replace the
presented data with abstractions obtained by generalization.
5. Suitability for sequential problem solving:
a. Sequential tasks, like these encountered reinforcement learning problems, benefit from the storage of
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
history in the form of sequence of states or procedures.
b. Such storage is facilitated by lazy approaches.
6. Ease of explanation:
a. The results of a CBL system can be justified based upon the similarity of the current problem to the
retrieved case.
b. CBL are easily traceable to precedent cases, it is also easier to analyze failures of the system.
7. Ease of maintenance: This is particularly due to the fact that CBL systems can adapt to many changes in
the problem domain and the relevant environment, merely by acquiring.
Limitations of CBL are:
1. Handling large case bases:
a. High memory/ storage requirements and time-consuming retrieval accompany CBL systems utilizing
large case bases.
b. Although the order of both is linear with the number of cases, these problems usually lead to increased
construction costs and reduced system performance.
c. These problems are less significant as the hardware components become faster and cheaper.
2. Dynamic problem domains:
a. CBL systems may have difficulties in handling dynamic problem domains, where they may be unable
to follow a shift in the way problems are solved, since they are strongly biased towards what has
already worked.
b. This may result in an outdated case base.
3. Handling noisy data:
a. Parts of the problem situation may be irrelevant to the problem itself.
b. Unsuccessful assessment of such noise present in a problem situation currently imposed on a CBL
system may result in the same problem being unnecessarily stored numerous times in the case base
because of the difference due to the noise.
c. In turn this implies inefficient storage and retrieval of cases.
4. Fully automatic operation:
a. In a CBL system, the problem domain is not fully covered.
b. Hence, some problem situations can occur for which the system has no solution.
c. In such situations, CBL systems expect input from the user.
Q20: What are the Advantages and Disadvantages of KNN Algorithm?
Advantages of KNN algorithm
1. No training period:
a. KNN is called lazy learner (Instance-based learning).
b. It does not learn anything in the training period. It does not derive any discriminative function from
the training data.
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
c. In other words, there is no training period for it. It stores the training dataset and learns from it only at
the time of making real time predictions
d. This makes the KNN algorithm much faster than other algorithms that require training for example,
SVM, Linear Regression etc.
2. Since the KNN algorithm requires no training before making predictions, new data can be added seamlessly
which will not impact the accuracy of the algorithm.
3. KNN is very easy to implement. There are only two parameters required to implement KNN i.e., the value
of K and the distance function (for example, Euclidean).
Disadvantages of KNN:
1. Does not work well with large dataset: In large datasets, the cost of calculating the distance between the
new point and each existing points is huge which degrades the performance of the algorithm.
2. Does not work well with high dimensions: The KNN algorithm does not work well with high dimensional
data because with large number of dimensions, it becomes difficult for the algorithm to calculate the
distance in each dimension.
3. Need feature scaling: We need to do feature scaling (standardization and normalization) before applying
KNN algorithm to any dataset. If we do not do so, KNN may generate wrong predictions.
4. Sensitive to noisy data, missing values and outliers: KNN is sensitive to noise in the dataset. We need to
manually represent missing values and remove outliers.
=====================
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Course/Branch : B.Tech-CSE Semester :V
Subject : Machine Learning Techniques Subject Code : BCS055
Q1: Explain different layers of CNN (Convolutional network) with suitable examples.
Q2: What is Self-Organizing Map (SOM)? Explain the stages and steps in SOM Algorithm.
Q3: Explain Gradient Descent and delta rule.
Q4: What are Neural Networks? What are the types of Neural Networks?
Q5: Discuss the benefits of Artificial Neural Networks.
Q6: Explain Back propagation Algorithm.
Q7: Write a short note on Unsupervised Learning.
Q8: Discuss the role of Activation function in neural networks. Also discuss various types of activation
functions with formulas and diagrams.
Q9: Describe Artificial Neural Networks (ANN) with different Layers and its characteristics.
Q10: What are the Advantages and Disadvantages of ANN? Explain the application areas of ANN?
Q11: Explain the Architecture and different types of Neuron.
Q12: Explain different types of Gradient Descent with advantages and disadvantages.
Q13: Explain generalized Delta Learning Rule.
Q14: Explain Perceptron with single Flow Graph.
Q15: State and Prove Perceptron Convergence Theorem.
Q16: Explain Multilayer Perceptron with its Architecture and Characteristics.
Q17: Discuss selection of various parameters in Back propagation Neural Network (BPN) and its effects.
Q18: Describe the Architecture, Limitations, Advantages and Disadvantages of Deep Learning with various
Applications.
Q19: Explain 1D and 2D Convolutional Neural Network.
Q20: Describe Diabetic Retinopathy on the basis of Deep Learning.
=====================
MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY
NH-58, Delhi-Roorkee Highway, Baghpat Road, Meerut – 250 005 U.P.
ODD Semester 2024-25
Course/Branch : B. Tech -CSE Semester :V
Subject : Machine Learning Techniques Subject Code : BCS055
Q1: Explain Genetic Algorithm with flow chart.
Q2: What is Reinforcement learning? Describe briefly Reinforcement learning.
Q3: Explain Markov Decision Process.
Q4: Explain GA (Genetic algorithm) cycle of reproduction?
Q5: What are advantages and disadvantages of Genetic algorithm?
Q6: Differentiate between Q Learning and Machine Learning.
Q7: Explain various types of reinforcement learning techniques with suitable example.
Q8: Differentiate between Reinforcement and Supervised Learning.
Q9: What are the different types and elements of Reinforcement Learning?
Q10: Describe briefly different learning task used in Machine Learning
Q11: Explain approaches used to implement Reinforcement Learning Algorithm.
Q12: Describe Learning Models, challenges and applications of Reinforcement Learning.
Q13: Describe Q-Learning Algorithm Process and steps involved in Deep Q-Learning Network.
Q14: Explain different phases of Genetic Algorithm with advantages and disadvantages.
Q15: Write Short notes on Procedures and Representations of Genetic Algorithm.
Q16: Explain different types of Encoding and benefits of Genetic Algorithm.
Q17: Explain different methods of selection in Genetic Algorithm in order to select a population for
next generation.
Q18:
Q19:
Q20:
=====================