ML Mid 1 Scheme
ML Mid 1 Scheme
10
Marks
2 Define Supervised Learning? Explain the different types of supervised learning and
its applications
Supervised Learning is a type of machine learning where a model is trained on labeled data,
meaning that each training example is paired with an output label. The goal is for the model to 2 Marks
learn the relationship between the input features and the corresponding labels so that it can
make accurate predictions on new, unseen data.
Supervised machine learning can be classified into two types of problems, which are given
below:
Classification
Regression
Classification:
Classification is a type of supervised learning where a target feature, is predicted for
test data based on the information given by training data. The target categorical feature
is known as class.
Classification algorithms are used to solve the classification problems in which the 3 Marks
output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue,
etc.
Some real-world examples of classification algorithms are Spam Detection, Email
filtering, etc.
Some popular classification algorithms are given below:
Random Forest Algorithm
Decision Tree Algorithm
Logistic Regression Algorithm
Support Vector Machine Algorithm
Regression:
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
3 Marks
Moreover, it is a type of supervised learning that learns from labelled data sets to
predict continuous output for different data
Some popular Regression algorithms are given below:
o Linear Regression Algorithm
o Logistic Regression
o Multivariate Regression Algorithm
Applications of Supervised Learning
Some common applications of Supervised Learning are given below:
Image Segmentation - Supervised Learning algorithms are used in image
segmentation. In this process, image classification is performed on different image
data with pre-defined labels.
Medical Diagnosis - Supervised algorithms are also used in the medical field for
diagnosis purposes. It is done by using medical images and past labelled data with
labels for disease conditions. With such a process, the machine can identify a disease
for the new patients.
Fraud Detection - Supervised Learning classification algorithms are used for 2 Marks
identifying fraud transactions, fraud customers, etc. It is done by using historic data to
identify the patterns that can lead to possible fraud.
Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.
3 Q. Examine whether missing values in features, will impact the learning activity? If
so, how can that be addressed?
Yes, Missing values in features can indeed have a significant impact on the learning
activity, especially in machine learning models.
In a data set, one or more data elements may have missing values in multiple records.
It can be caused by omission on part of the person who is collecting sample data 2 Marks
There are multiple strategies to handle missing value of data elements. Some of those
strategies have been discussed below.
1. Eliminate records having a missing value of data elements:
In case the proportion of data elements having missing values is within a tolerable
limit, a simple but effective approach is to remove the records having such data
elements. This is possible if the quantum of data left after removing the data elements
having missing values is sizeable.
For Example in the case of Auto MPG data set, only in 6 out of 398 records, the value
of attribute ‘horsepower’ is missing. If we get rid of those 6 records, we will still have
392 records, which is definitely a substantial number. So, we can very well eliminate
8 Marks
the records and keep working with the remaining data set.
However, this will not be possible if the proportion of records having data elements
with missing value is really high as that will reduce the power of model
2. Imputing missing values:
Imputation is a method to assign a value to the data elements having missing values.
Mean/mode/median is most frequently assigned value.
For quantitative attributes, all missing values are assigned with the mean, median, or
mode of the remaining values under the same attribute.
For qualitative attributes, all missing values are assigned by the mode of all
remaining values of the same attribute.
For example, in context of the attribute ‘horsepower’ of the Auto MPG data set, since
the attribute is quantitative, we take a mean or median of the remaining data element
values and assign that to all data elements having a missing value. So, we may assign
the mean, which is 104.47 and assign it to all the six data elements.
3. Estimate missing values:
If there are data points similar to the ones with missing attribute values, then the
attribute values from those similar data points can be planted in place of the missing
value.
For finding similar data points or observations, distance function can be used.
For example, let’s assume that the weight of a Russian student having age 12 years
and height 5 ft. is missing. Then the weight of any other Russian student having age
close to 12 years and height close to 5 ft. can be assigned.
4 Q. Explain Box plot and apply the concept of a box plot to the dataset 100, 120, 110,
150, 110, 140, 130, 170, 120, 220, 140, 110 to identify any outliers.
Box Plot: It is a type of chart that depicts a group of numerical data through their quartiles. It
is a simple way to visualize the shape of our data. It makes comparing characteristics of data
between categories very easy.
Components of a box plot
A box plot gives a five-number summary of a set of data which is-
Minimum – It is the minimum value in the dataset excluding the outliers
First Quartile (Q1) – 25% of the data lies below the First (lower) Quartile. 4 Marks
Median (Q2) – It is the mid-point of the dataset. Half of the values lie below it and
half above.
Third Quartile (Q3) – 75% of the data lies below the Third (Upper) Quartile.
Maximum – It is the maximum value in the dataset excluding the outliers.
The area inside the box (50% of the data) is known as the Inter Quartile
Range. The IQR is calculated as –
IQR = Q3-Q1
Outliers are the data points below and above the lower and upper limit. The lower
and upper limit is calculated as –
Lower Limit = Q1 - 1.5*IQR
Upper Limit = Q3 + 1.5*IQR
The values below and above these limits are considered outliers and the minimum and
maximum values are calculated from the points which lie under the lower and upper
limit.
How to create a box plot
Let us take a sample data to understand how to create a box plot.
Here are the runs scored by a cricket team in a league of 12 matches –
100,120,110,150,110,140,130,170,120,220,140,110.
To draw a box plot for the given data first we need to arrange the data in ascending
order and then find the minimum, first quartile, median, third quartile and the
maximum.
Ascending Order -
100,110,110,110,120,120,130,140,140,150,170,220
Median (Q2) = (120+130)/2 = 125 ; Since there were even values
To find the First Quartile we take the first six values and find their median.
Q1 = (110+110)/2 = 110
For the Third Quartile, we take the next six and find their median.
Q3 = (140+150)/2 = 145
Note: If the total number of values is odd then we exclude the Median while calculating Q1
and Q3. Here since there were two central values we included them. 6 Marks
Now, we need to calculate the Inter Quartile Range.
IQR = Q3-Q1 = 145-110 = 35
We can now calculate the Upper and Lower Limits to find the minimum and
maximum values and also the outliers if any.
Lower Limit = Q1-1.5*IQR = 110-1.5*35 = 57.5
Upper Limit = Q3+1.5*IQR = 145+1.5*35 = 197.5
So the minimum and maximum between the range [57.5,197.5] for our given data are
Minimum = 100
Maximum = 170
The outliers which are outside this range are –
Outliers = 220
Now we have all the information, so we can draw the box plot which is as below-
We can see from the diagram that the Median is not exactly at the centre of the box
and one whisker is longer than the other. We also have one Outlier.
5 Q. Explain Overfitting, Underfitting, bias-variance, trade-off in context of model
fitting.
Overfitting:
Overfitting occurs when our machine learning model tries to cover more than the
required data present in the given dataset.
Because of this, the model starts caching noise and inaccurate values present in the
dataset, and all these factors reduce the efficiency and accuracy of the model.
The overfitted model has low bias and high variance.
The chances of occurrence of overfitting increase as much we provide training to our
model. It means the more we train our model, the more chances of occurring the
overfitted model. Overfitting is the main problem that occurs in supervised learning.
Example: The concept of the overfitting can be understood by the below graph of the
linear regression output:
3 Marks
As we can see from the above diagram, the model is unable to capture the data points
present in the plot.
How to avoid underfitting:
o By increasing the training time of the model.
o By increasing the number of features.
Bias – variance trade-off:
In supervised learning, the class value assigned by the learning model built based on
the training data may differ from the actual class value. This error in learning can be of
two types
o errors due to ‘bias’
o error due to ‘variance’
3 Marks
Errors due to ‘Bias’
Errors due to bias arise due to underfitting of the model. Parametric models generally
have high bias making them easier to understand/interpret and faster to learn. These
algorithms have a poor performance on data sets, which are complex in nature and do
not align with the simplifying assumptions made by the algorithm.
Underfitting results in high bias.
Errors due to ‘Variance’
Errors due to variance occur from difference in training data sets used to train the
model. However, in case of overfitting, since the model closely matches the training
data, even a small difference in training data gets magnified in the model.
So, the problems in training a model can either happen because either (a) the model is
too simple and hence fails to interpret the data grossly or (b) the model is extremely
complex and increases even small differences in the training data.
As is quite understandable:
o Increasing the bias will decrease the variance, and
o Increasing the variance will decrease the bias
In the above diagram, center of the target is a model that perfectly predicts correct
values. As we move away from the bulls-eye our predictions become get worse and
worse.
Why is Bias Variance Tradeoff?
If our model is too simple and has very few parameters then it may have high bias and
low variance. On the other hand if our model has large number of parameters then it’s
going to have high variance and low bias. So we need to find the right/good balance
without overfitting and underfitting the data.
This tradeoff in complexity is why there is a tradeoff between bias and variance. An
algorithm can’t be more complex and less complex at the same time.
1 Marks
To build a good model, we need to find a good balance between bias and variance
such that it minimizes the total error.
The balance between bias and variance can be adjusted in specific algorithms by
modifying parameters, as seen in the following examples:
o For k-nearest neighbors, a low bias and high variance can be corrected by
increasing the value of k, which increases the bias and decreases the
variance.
o For support vector machines, a low bias and high variance can be altered
by adjusting the C parameter, which increases the bias but decreases the
variance.
2. Wrapper approach:
In the wrapper approach, identification of best feature subset is done using the
induction algorithm as a black box.
The feature selection algorithm searches for a good feature subset using the induction
algorithm itself as a part of the evaluation function.
Since for every candidate subset, the learning model is trained and the result is 6 Marks
evaluated by running the learning algorithm, wrapper approach is computationally
very expensive. However, the performance is generally superior compared to filter
approach.
3. Hybrid approach:
Hybrid approach takes the advantage of both filter and wrapper approaches.
A typical hybrid algorithm makes use of both the statistical tests as used in filter
approach to decide the best subsets for a given cardinality and a learning algorithm to
select the final best subset among the best subsets across different cardinalities.
4. Embedded approach:
Embedded approach is quite similar to wrapper approach as it also uses and inductive
algorithm to evaluate the generated feature subsets.
However, the difference is it performs feature selection and classification
simultaneously.
ANS: Machine Learning (ML) is a branch of artificial intelligence (AI) that enables systems to learn
from data, identify patterns, and make decisions with minimal human intervention.