0% found this document useful (0 votes)

28 views35 pages

ML Unit 2

Machine learning

Uploaded by

suhanisweety448

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views35 pages

ML Unit 2

Machine learning

Uploaded by

suhanisweety448

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

UNIT-II

Modeling and Evaluation & Basics of Feature

Introduction: Engineering
● The basic learning process, irrespective of the fact that the learner is a human or a
machine, can be divided into three parts:
1. Data Input
2. Abstraction
3. Generalization
● Seeing the current trends in Information technology, a large volume of
heterogeneous data is produced widely across the world by means of social media
sites such as Facebook, Instagram, Google plus, etc.
● The generated data act as garbage and makes no sense until they are categorized.
● The learning process, abstraction is a significant step as it represents raw input
data in a summarized and structured format.

Introduction….
● This structured representation of raw input data to the meaningful pattern is called
a model.
● The process of assigning a model, and fitting a specific model to a data set is
called model training Once the model is trained, the raw input data is summarized
into an abstracted form.
● Generalization is a term used to describe a model’s ability to react to new data.
That is, after being trained on a training set, a model can digest new data and make
accurate predictions.
● If a model has been trained too well on training data, it will be unable to
generalize. It will make inaccurate predictions when given new data,it is able to
make accurate predictions for the training data. This is called overfitting.
● Underfitting happens when a model has not been trained enough on the data. it is
not capable of making accurate predictions, even with the training data.
● If the outcome is systematically incorrect, the learning is said to have a bias.
Categories of Machine Learning Approaches
● Three broad categories of machine learning approaches used for resolving different types
of problems
1. Supervised
1. Classification
2. Regression
2. Unsupervised
1. Clustering
2. Association analysis
3. Reinforcement
1. Active
2. Passive
● For each of the cases, the model that has to be created/trained is different.
● Multiple factors play a role when we try to select the model for solving a machine
learning problem

Categories of Machine Learning Approaches

● There is no one model that works best for every machine learning problem. •
● Any learning model tries to simulate some real-world aspect, firstly we need to
understand the data characteristics, combine this understanding with the problem we are
trying to solve and then decide which model to be selected for solving the problem.
● Three types of problems
1.Predicting class values
2. Predicting numerical values
3- Predicting grouping of data
● The problem, may be related to the prediction of a class value whether the next day will
be sunny or rainy, whether tumor is mild or serious etc.
● predicting some numerical value like
● what the price of a house should be in the next quarter, what is the expected growth of a
certain IT stock in the next 7 days, etc.
● Predicting grouping of data like finding customer segments that are using a certain
product.
Categories of Machine Learning…
● Machine learning algorithms are broadly of two types:
1. models for supervised learning, which primarily focus on solving predictive problems
2. models for unsupervised learning, which solve descriptive problems.
● Predictive model:
● Models for supervised learning or predictive models, try to predict certain value using the
values in an input data set.
● The predictive models have a clear focus on what they want to learn and how they want
to learn.
● Predictive models, in turn, may need to predict the value of a category or class to which a
data instance belongs to.
● Below are some examples:
1. Predicting win/loss in a cricket match
2. Predicting whether a transaction is fraud
3. Predicting whether a customer may move to another product

Predictive models
Classification models: which are used for prediction of target features of categorical value
are known as classification models.
• Some of the popular classification models include
1. k-Nearest Neighbor (KNN),
2. Naive Bayes, and
3. Decision Tree.

Regression models: which are used for prediction of the numerical value of the target
feature of a data instance are known as regression models.
• popular regression models.
1. Linear Regression and
2. Logistic Regression models
Descriptive models
• Models for unsupervised learning or descriptive models are used to describe a data set or
gain insight from a data set.
• There is no target feature or single feature of interest in case of unsupervised learning.
• Based on the value of all features, interesting patterns or insights are derived about the
data set.
• Descriptive models which group together similar data instances, i.e. data instances having
a similar value of the different features are called clustering models.
• Examples of clustering include
1. Customer grouping or segmentation based on social, demographic, ethnic,
etc. factors
2. Grouping of music based on different aspects like genre, language, time-
period, etc.
• The most popular model for clustering is k-Means.

Descriptive models
Market Basket Analysis
• Descriptive models related to pattern discovery is used for market basket analysis
of transactional data.
• In market basket analysis, based on the purchase pattern available in the
transactional data,
• the possibility of purchasing one product based on the purchase of another product
is determined.
• For example, transactional data may reveal a pattern that generally a customer who
purchases milk also purchases biscuit at the same time.
• This can be useful for targeted promotions or in-store set up.
• Promotions related to biscuits can be sent to customers of milk products or vice
versa.
• Also, in the store products related to milk can be placed close to biscuits.
Training a Model(for Supervised Learning)
• Holdout method
• Cross-validation methods
• Bootstrap sampling
• Lazy vs. Eager learner

Holdout method
•In supervised learning, a model is trained using the
labelled input data.
•The test data may not be available immediately, also,
the label value of the test data is not known.
•That is the reason why a part of the input data is held
back (holdout) for evaluation of the model.
•This subset of the input data is used as the test data for
evaluating the performance of a trained model.
•In general 70% to 80% of the input data (labelled) is
used for model training.
Holdout method
• Once the model is trained using the training data, the labels of the test data are
predicted using the model's target function.
• Then the predicted value is compared with the actual value of the label.
The validation data used for measuring the model performance. It is used in
iterations and to refine the model in each iteration.
• If the volume of input data is huge, then
• stratified Random sampling is employed for test data selection.
• the whole data is broken into several homogenous groups
• a random sample is selected from each group.
• This ensures that the generated random partitions have equal proportions of each
class.

Holdout method
• The issues in random sampling approach, in Holdout method,
1. the smaller data sets - difficult to divide the data of some of the classes proportionally
amongst training and test data sets.
2. A repeated holdout, is sometimes used to ensure the randomness of the composed data
sets.
• Several random holdouts are used to measure the model performance.
• In the end, the average of all performances is taken.
• As multiple holdouts have been drawn, the training and test data (and validation data) are
contain representative data from all classes and resemble the original input data closely.
• This process of repeated holdout is the basis of k-fold cross-validation technique.
• In k-fold cross-validation, the data set is divided into k-completely distinct or non-
overlapping random partitions called folds.
k-fold cross-validation
• In k-fold cross-validation, the data set is divided into k-completely distinct or non-
overlapping random partitions called folds.
• The value of 'k' in k-fold cross-validation can
be set to any number.
• there are two approaches which are extremely
popular:
• 1. 10-fold cross-validation (10-fold CV)
• 2. Leave-one-out cross-validation (LOOCV)

10-fold cross-validation
• 10-fold cross-validation is by far the most popular approach.
• for each of the 10-folds, each comprising of approximately 10% of the data, one of
the folds is used as the test data for validating model performance trained based on
the remaining 9 folds (or 90% of the data).
• This is repeated 10 times, once for each of the 10 folds being used as the test data
and the remaining folds as the training data.
• The average performance across all folds is being reported.
10-fold cross-validation
• each of the circles resembles a record in the input data
set whereas the different colors indicate the different
classes that the records belong to.
• The entire data set is broken into 'k' folds –out of which
one fold is selected in each iteration as the test data set.
• The fold selected as test data set in each of the
'k' iterations is different.
• the contiguous circles represented as folds, do not mean
that they are subsequent records in the data set.
• the records in a fold are drawn by using random sampling
technique.

Leave-one-out cross-validation (LOOCV)

• Leave-one-out cross-validation (LOOCV) is an extreme case of k-fold
cross-validation using one record or data instance at a time as a test
data.
• This is done to maximize the count of data used to train the model.
• the number of iterations for which it has to be run is equal to the
total number of data in the input data set.
• it is computationally very expensive and not used much in practice.
Bootstrap Sampling

•It is a popular way

to identify training
and test
data sets from the
input data set.
Cross Validation Vs Bootstrapping
Eager learning
• Eager learning follows the general principles of machine learning — it tries
to construct a generalized, input independent target function during the model
training phase.
• It uses Abstraction, generalization and comes up with a trained model at the
end of the learning phase.
• Hence, when the test data comes in for classification,
• the eager learner is ready with the model and
• doesn't need to refer back to the training data.
• Eager learners take more time in the learning phase than the lazy learners.
• Some of the algorithms which adopt eager learning approach include
Decision Tree, Support Vector Machine, Neural Network, etc.

Lazy learning
• Lazy learning, completely skips the abstraction and generalization processes,
otherwise, lazy learner doesn't 'learn' anything.
• It uses the training data in exact, and uses the knowledge to classify the unlabelled
test data.
• it is also known as rote learning (i.e. memorization technique based on repetition).
• Due to its heavy dependency on the given training data instance, it is also known
as instance learning or non-parametric learning.
• Lazy learners take very little time in training because not much of training
actually happens.
• it takes long time in classification as for each attribute in record of test data,
a comparison-based assignment of label happens.
Model Representation and Interpretability
● The main goal of each machine learning model is to generalize well.
● Generalization defines the ability of an ML model to provide a suitable output by
adapting the given set of unknown input.
● It means after providing training on the dataset, it can produce reliable and accurate
output.
● The underfitting and overfitting are the two terms that need to be checked for the
performance of the model and whether the model is generalizing well or not
● Bias: difference occurs between prediction values made by the model and actual
values/expected values, and this difference is known as bias errors or Errors due to bias.
(the error rate of the training data. )
The error rate has a high value, we call it High Bias
The error rate has a low value, we call it low Bias.
● Variance: The difference between the error rate of training data and testing data is called
variance.
difference of errors is high then it’s called high variance
difference of errors is low then it’s called low variance.

Overfitting
• Overfitting occurs when our machine learning model tries to cover more
than the required data present in the given dataset.
• Because of this, the model starts caching noise and inaccurate values present
in the dataset, and all these factors reduce the efficiency and accuracy of the
model.
• The overfitted model has low bias and high variance.
• Overfitting is the main problem that occurs in supervised learning.
• How to avoid the Overfitting in Model:
1. Early stopping the training
2. using re-sampling techniques like cross validation
3. hold back of a validation data set
4. Removing the features
Underfitting & Overfitting

Underfitting
● Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data
● In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.
● An underfitted model has high bias and low variance.
● How to avoid underfitting:
1. Increasing the training time of a model
2. Increasing the number of features
Bias – variance trade-off
• In supervised learning, the class value assigned by the learning model built
based on the training data may differ from the actual class value.
• This error in learning can be of two types:
1. errors due to 'bias' and
2. error due to 'variance’.
• Errors due to bias arise due to underfitting of the model. Underfitting
results in high bias.
• Errors due to variance occur from difference in training data sets used to
train the model.
• In case of overfitting, the model closely matches the training data, even a
small difference in training data gets magnified in the model.

Bias – variance trade-off

● Increasing the bias will decrease the variance
Ex: Linear regression, Logistic regression
● Increasing the variance will decrease the bias
● From the above diagram, center of the target
is a model that perfectly predicts correct values.
Ex: Decision tree, KNN, SVM
Why is Bias Variance Tradeoff?
• If our model is too simple and has very few parameters then it may have high bias
and low variance
• If our model has large number of parameters then it’s going to have high variance
and low bias.
• The best solution is to have a model with low bias as well as low variance.
• To build a good model, we need to find a good balance between bias and variance
such that it minimizes the total error.
• The goal of supervised machine learning is to achieve a balance between bias and
variance.
• For example, in a supervised algorithm k-Nearest Neighbors or KNN, the user
configurable parameter 'k' can be used to do a trade-off between bias and variance.
• When the value of 'k' is decreased, the model becomes simpler to fit and bias
increases.
• When the value of 'k' is increased, the variance increases.

Evaluating Performance of a Model

● Supervised Learning-Classification
● Supervised Learning-Regression
● Unsupervised Learning-Clustering
Supervised Learning-Classification:
1. Accuracy
2. Error Rate
3. Kappa value(k)
4. Sensitivity
5. Specificity
6. Precision
7. Recall
8. F-Measure
9. Receiver Operating Characteristic Curves(ROC)
10.Arear Under Curve(AUC)
Supervised Learning-Classification
• The classification model is to assign class label to the target feature
based on the value of the predictor features.
• For example, in a cricket match, the problem of predicting the win/loss, the classifier
will assign a class value win/loss to target feature based on the values of other features
like
• The whether
• The team won the toss,
• Number of spinners in the team,
• Number of wins the team had in the tournament, etc.
• To evaluate the performance of the model, the number of correct classifications or
predictions made by the model has to be recorded
• Based on the number of correct and incorrect classifications or predictions made by a
model, the accuracy of the model is calculated.
• If 99 out of 100 times the model has classified correctly then the model accuracy is said
to be 99%

Details of Model Classification

• There are four possibilities with regards to the cricket match win/loss prediction:

1. the model predicted win and the team won

(True Positive)
2. the model predicted win and the team lost
(False Positive)
3. the model predicted loss and the team won
(False Negative)
4. the model predicted loss and the team lost
(True Negative)
Confusion Matrix
• A matrix containing correct and incorrect predictions in the form of
TPs, FPs, FNs and TNs is known as confusion matrix.
• The win/loss prediction of cricket match has two classes of interest —
win and loss.
• For that reason it will generate a 2 x 2 confusion matrix.
• For a classification problem involving three classes, the confusion
matrix would be 3 x 3, etc.
• assume the confusion matrix of the win/loss prediction of cricket
match problem to be as below:

Model Accuracy
• Model Accuracy is given by total number of correct classifications (either True
Positive or True Negative) divided by total number of classifications done

• In context of the above confusion matrix, total count of TPs = 85, count of FPs =
4, count of FNs = 2 and count of TNs = 9.
Error Rate
• The percentage of misclassifications is indicated using error rate which
is measured as

• In context of the above confusion matrix,

Kappa value(k)
• Kappa value of a model indicates the adjusted the model accuracy.
• It is calculated using the formula below
Kappa value(k)…
● In context of the above confusion matrix, total count of TPs = 85, count of
FPs = 4, count of FNs = 2 and count of TNs = 9.

• Kappa value can be 1 at the maximum, which represents perfect agreement

Sensitivity
• The sensitivity of a model measures the proportion of TP examples or positive
cases which were correctly classified.
• It is measured as

• In the context of the above confusion matrix for the cricket match win prediction
problem,
Specificity
• Specificity of a model measures the proportion of negative examples which have
been correctly classified.
• In the context of the above confusion matrix for the cricket match win prediction
problem,

• A higher value of specificity will indicate a better model performance

• There are two other performance measures of a supervised learning model which
are similar to sensitivity and specificity. These are precision and recall.

Precision and Recall

• Precision gives the proportion of positive predictions which are truly positive,
recall gives the proportion of TP cases over all actually positive cases.

• Recall indicates the proportion of correct prediction of positives to the total

number of positives. In case of win/loss prediction of cricket, recall resembles
what proportion of the total wins were predicted correctly
F-Measure
• F-measure is another measure of model performance which combines the
precision and recall. It takes the harmonic mean of precision and recall as
calculated as

• F-score is a combination of multiple measures into one.

• F-score is used to measure the performance of different models can be compared.
• However, one assumption the calculation is based on is that precision and recall
have equal weight, which may not always be true in reality.

• Visualization is an easier and more effective way to understand the model Performance
• 1. Receiver operating characteristic (ROC) curves
• 2. Area Under Curve (AUC)
• It also helps in comparing the
efficiency of two models.
Receiver operating characteristic (ROC) curves
• Receiver Operating Characteristic (ROC) curve helps in visualizing the
performance of a classification model.
• It shows the efficiency of a model in the detection of true positives while avoiding
the occurrence of false positives.
• To refresh our memory, true positives - the model has correctly classified data
instances as the class of interest.
• On the other hand, FPs are those cases where the model incorrectly classified data
instances as the class of interest.

Area Under Curve (AUC)

• The area under curve (AUC) value, is the area of the two-dimensional space under the curve
extending from (0, 0) to (1,1)
• where each point on the curve gives a set of true and false positive values at a specific
classification threshold.
• This curve gives an indication of the predictive quality of a model.
• AUC value ranges from 0 to 1, with an AUC of less than 0.5 indicating that the classifier has no
predictive ability.
• The AUC of classifier 1 is more than the AUC of classifier 2. Hence, the inference that classifier 1
is better than classifier 2.
• A quick indicative interpretation of the predictive values from 0.5 to 1.0 is given below:
• 0.5 — 0.6 Almost no predictive ability
• 0.6—0.7 Weak predictive ability
• 0.7 — 0.8 Fair predictive ability
• 0.8 — 0.9 Good predictive ability
• 0.9 — 1.0 Excellent predictive ability
Unsupervised
• A clustering learning
algorithm is successful —identified
if the clusters Clustering
using the
algorithm is able to achieve the right results In the overall problem domain.
• For example, if clustering is applied for identifying customer segments for a
marketing campaign of a new product launch,
• the clustering can be considered successful only if the marketing campaign
end with a success, i.e. it is able to create the right brand recognition resulting
in steady revenue from new product sales.
• Two challenges of clustering:
1. It is generally not known how many clusters can be formulated from a
particular
data set. It is completely open-ended in most cases and provided as a user
input
to a clustering algorithm.
2. Even if the number of clusters is given, the same number of clusters can be

Supervised Learning-Regression

A regression model
•

ensures the difference

between predicted and
actual values is low
can be considered as a
• R-squared Regression
is a good measure to- evaluate
R-squared
the model Measure
fitness.
• The R-squared value lies between O to 1 (0%—100%) with a larger value
representing a better fit.
• It is calculated as:

• Sum of Squares Total (SST) = squared differences of each observation from

the overall mean = where y is the mean.

• Sum of Squared Errors (SSE) (of prediction) = sum of the squared residuals
=

● The silhouette coefficient, which is one of the most popular internal evaluation
Internal Evaluation – Silhouett Coefficient
methods, uses distance (Euclidean or Manhattan distances most commonly used)
between data elements as a similarity measure.
● The value of silhouette width ranges between
-1 and +1, with a high value indicating high
intra-cluster homogeneity and inter-cluster
heterogeneity.
• For a data set clustered into 'k' clusters, silhouette
width is calculated as:

• a(i) is the average distance between the i th

data instance and all other data instances
belonging to the same cluster and
• b(i) is the lowest average distance between
silhouette width Calculation
• let's calculate the distance of an arbitrary data element 'i’ in cluster 1 with the
different data elements from another cluster, say cluster 4 and take an average of all
those distances.
• Hence,
• where n is the total number of elements in cluster 4.
• In the same way, we can calculate the values of b12 (average) and b13 (average).
• b(i) is the minimum of all these values.
• Hence, b(i) = minimum [b12 (average), b13 (average), b14(average)]

External Evaluation
• In this approach, class label is known for the data set subjected to
clustering.
• the known class labels are not a part of the data used in clustering.
• The cluster algorithm is assessed based on how close the results are
compared to those known class labels.
• For example, purity is one of the most popular measures of cluster
algorithms — evaluates the extent to which clusters contain a single class.
For a data set having 'n' data instances and 'c' known class labels which generates'k'
clusters, purity is measured as:
IMPROVING PERFORMANCE OF MODEL
• The model selection is done on several aspects:
1. Type of learning the task in hand, i.e. supervised or
unsupervised
2. Type of the data, i.e. categorical or numeric
3. Sometimes on the problem domain
4. Above all, experience in working with different models to
solve problems of diverse domains

Tuning model parameter

• Model parameter tuning is the process of adjusting the model fitting
options, is an effective way to improve model performance
• Most machine learning models have at least one parameter which can
be tuned.
• The classification model k-Nearest Neighbour (KNN): using different
values of 'k' or the number of nearest neighbour to be considered, the
model can be tuned.
• The neural networks model: The number of hidden layers can be
adjusted to tune the performance in neural networks model.
• As an alternate approach of increasing the performance of one model,
several models may be combined together.
• The models in such combination are complimentary to each other, i.e.
one model may learn one type data sets well while struggle with another
type of data set.
ENSEMBLE

This approach of
•

combining different
models with diverse
strengths is known as
ensemble (figure).
ENSEMBLE……….
● Alternatively, the same training data may be used but the models combined are
quite varying, e.g, SVM, neural network, kNN, etc.
● The outputs from the different models are combined using a combination
function. A very simple strategy of combining, in the case of a prediction task
using ensemble, it is based on majority voting of the different models combined.
● For example, 3 out of 5 classes predict ‘win’ and 2 predict ‘loss’ – then the final
outcome of the ensemble using majority vote would be a ‘win’.
● The ensemble models are:
1. Bagging or bootstrap aggregating
2. Boosting
3. Random Forest
Bagging or Bootstrap aggregating
● Bagging uses bootstrap sampling method to generate multiple
training data sets.
● These training data sets are used to generate (or train) a set of models
using the same learning algorithm.
● Then the outcomes of the models are combined by majority voting
(classification) or by average (regression).
● Bagging is a very simple ensemble technique which can perform
really well for unstable learners like a decision tree, in which a slight
change in data can impact the outcome of a model significantly.
Boosting & Random Forest
● Just like bagging, boosting is another key ensemble based technique.
● The weaker learning models are trained on resampled data and the outcomes are
combined using a weighted voting approach based on the performance of different
models.
● Adaptive boosting or AdaBoost is a special variant of boosting algorithm.
● It is based on the idea of generating weak learners and slowly learning.
● Random forest is another ensemble-based technique. It is an ensemble of decision
trees hence the name random forest to indicate a forest of decision trees.
● Random Forest is a powerful ensemble learning technique that leverages the
strength of decision trees while addressing their limitations such as overfitting.
● By introducing randomness in feature selection and data sampling, Random
Forest builds a diverse set of decision trees and combines their predictions to
make robust and accurate predictions for classification and regression tasks.
Basics of Feature Engineering
● A feature is an attribute of a data set that is used in a machine learning process.
● The features in a data set are also called its dimensions. So, a data set having ‘n’
features is called an n-dimensional data set.
● A model for predicting the risk of cardiac disease may have features such as the
following: Age, Gender, Weight, Whether the person smokes, etc.
● Features in machine learning is very important, Because the quality of the features
in the dataset has major impact on the quality of the insights you will get while
using the dataset for machine learning

What is feature engineering?

● Feature engineering refers to the process of translating a data set into features
such that these features are able to represent the data set more effectively and
result in a better learning performance.
● Feature engineering is an important pre-processing step for machine learning.
● It has two major elements:
1. Feature transformation
2. Feature subset selection

● Feature transformation: transforms the data –structured or unstructured, into a

new set of features which can represent the underlying problem which machine
learning is trying to solve.
● There are two variants of feature transformation:
3. feature construction
4. feature extraction
Feature transformation & Feature subset selection
● Feature construction process discovers missing information about the
relationships between features and augments the feature space by creating
additional features.
● Hence, if there are ‘n’ features or dimensions in a data set, after feature
construction ‘m’ more features or dimensions may get added.
● So at the end, the data set will become ‘n + m’ dimensional.
● Feature extraction is the process of extracting or creating a new set of features
from the original set of features using some functional mapping.
● Feature subset selection: no new feature is generated.
The objective of feature selection is to derive a subset of features from the full
feature set which is most meaningful in the context of a specific machine learning
problem.

Feature transformation- Feature construction

• Feature transformation is a mathematical transformation in which we apply a
mathematical formula to a particular column (feature) and transform the values,
which are useful for our further analysis.
• It creates new features from existing features that may help improve the model
performance
• Feature construction involves transforming a given set of input features to generate
a new set of more powerful features.
Feature construction
● There are certain situations where feature construction is an essential activity
before we can start with the machine learning task. These situations are
1. when features have categorical value and machine learning needs numeric
value inputs
2. when features having numeric (continuous) values and need to be converted to
ordinal values
3. when text-specific feature construction needs to be done
● Encoding categorical (nominal) variables:
● Let’s take the example of data set on athletes, data set has features age, city of
origin, parents athlete and Chance of Win.
● The feature chance of a win is a class variable while the others are predictor
variables

Encoding categorical (nominal) variables

Encoding categorical (ordinal) variables
● Let’s take an example of a student data set and it has three features science marks,
maths marks and grade.
● As we can see, the grade is an ordinal variable with values A, B, C, and D.
● To transform this variable to a numeric variable, we can create a feature num_grade
mapping a numeric value against each ordinal value. In the context of the current
example, grades A, B, C, and D is mapped to values 1, 2, 3, and 4

Transforming numeric features to categorical features

● For example, we may want to treat the real estate price prediction problem, which is a
regression problem, as a real estate price category prediction, which is a classification
problem.
● In that case, we can ‘bin’ the numerical data into multiple categories based on the data
range.
● In the context of the real estate price prediction example, the original data set has a
numerical feature apartment_price. It can be transformed to a categorical variable price-
grade.
Numeric(continuous) features to categorical features

Text-specific feature construction

• In the current world, text is the most predominant medium of communication. If we think
about social networks like Facebook, Twitter, emails, Whatsapp, text plays a major role
in the flow of information.
• Hence, text mining is an important area of research. However, making sense of text data,
due to the inherent unstructured nature of the data, is not so straightforward.
• All machine learning models need numerical data as input.
• So the text data in the data sets need to be transformed into numerical features.
• The process of converting a text data into a numerical representation is known as
vectorization.
• In this process, word occurrences in all documents belonging to the text data or corpus
are consolidated in the form of bag-of-words.
• There are three major steps that are followed:
1. tokenize
2. count
3. normalize
Text-specific feature construction…
● In order to tokenize a Text Data, the blank spaces and punctuations are used as
delimiters to separate out the words, or tokens.
● Then the number of occurrences of each token is counted, for each document.
● Lastly, tokens are weighted with reducing importance when they occur in the
majority of the documents.
● A matrix is then formed with each token representing a column and a specific
document of the corpus representing each row.
● Each cell contains the count of occurrence of the token in a specific document.
● This matrix is known as a
document-term matrix

Feature extraction
● In feature extraction, new features are created from a combination of original features.
Some of the commonly used operators for combining the original features include
1. For Boolean features: Conjunctions, Disjunctions, Negation, etc.
2. For nominal features: Cartesian product, M of N, etc.
3. For numerical features: Min, Max, Addition, Subtraction, Multiplication, Division,
Average, Equivalence, Inequality, etc.
● Let’s take an example. Say, we have a
data set with a feature set F (F1 , F2 , …, Fn).
● After feature extraction using a mapping
function f. we will have a set of features
F ’ (F’ 1 , F’ 2 , …, F’m ) such that
F ’ i =f(Fi) and m<n
Feature extraction algorithms- PCA
• The most popular feature extraction algorithms used in machine learning are
1. Principal Component Analysis(PCA)
2. Singular value decomposition(SVD)
3. Linear Discriminant Analysis(LDA)
• Principal Component Analysis(PCA): is an unsupervised learning algorithm that is
used for the dimensionality reduction in machine learning.
• It is a statistical process that converts the observations of correlated features into a set
of linearly uncorrelated features with the help of orthogonal transformation.
• These new transformed features are called the Principal Components.
• It is one of the popular tools that is used for exploratory data analysis and predictive
modeling.
• It is a technique to draw strong patterns from the given dataset by reducing the
variances.

Unit I 2
No ratings yet
Unit I 2
78 pages
Wa0001.
No ratings yet
Wa0001.
173 pages
ML Unit IV
No ratings yet
ML Unit IV
70 pages
Modellingandevaluationunit2june2322 220623063944 5c70ebed
No ratings yet
Modellingandevaluationunit2june2322 220623063944 5c70ebed
53 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
Notes
No ratings yet
Notes
125 pages
Unit 5
No ratings yet
Unit 5
77 pages
Unit 4 Part 1
No ratings yet
Unit 4 Part 1
47 pages
UNIT03
No ratings yet
UNIT03
52 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Chapter 2,3,4
No ratings yet
Chapter 2,3,4
8 pages
ML Unit 2
No ratings yet
ML Unit 2
86 pages
5 - Model For Predictions - ML
No ratings yet
5 - Model For Predictions - ML
52 pages
CHP 3
No ratings yet
CHP 3
70 pages
Lect 03 Evaluation Part 2
No ratings yet
Lect 03 Evaluation Part 2
40 pages
Unit 3
No ratings yet
Unit 3
55 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
Unit IV
No ratings yet
Unit IV
51 pages
CH 05 Optimization Technique
No ratings yet
CH 05 Optimization Technique
58 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
43 pages
Unit3ModellingandEvaluationpptx 2023 09 02 15 19 21
No ratings yet
Unit3ModellingandEvaluationpptx 2023 09 02 15 19 21
49 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
Ch6-Models Selection Evaluating Classifiers
No ratings yet
Ch6-Models Selection Evaluating Classifiers
28 pages
Module3 DS PPT
No ratings yet
Module3 DS PPT
68 pages
ML 3170724 Unit-3
No ratings yet
ML 3170724 Unit-3
48 pages
ML 5
No ratings yet
ML 5
26 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
ML Unit 2 Part 1
No ratings yet
ML Unit 2 Part 1
47 pages
Bi Unit 5
No ratings yet
Bi Unit 5
20 pages
Training Evaluation
No ratings yet
Training Evaluation
42 pages
Ml-Mid-2-Important Topics
No ratings yet
Ml-Mid-2-Important Topics
19 pages
Module 1
No ratings yet
Module 1
50 pages
Unit 3 (ML)
No ratings yet
Unit 3 (ML)
26 pages
ML Unit4 Notes
No ratings yet
ML Unit4 Notes
20 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
24 pages
Data Splitting and Bias Variance Tradeoff
No ratings yet
Data Splitting and Bias Variance Tradeoff
14 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Unit 3
No ratings yet
Unit 3
13 pages
Lecture-4 Model Evaluation
No ratings yet
Lecture-4 Model Evaluation
28 pages
Lec2 Intro To ML
No ratings yet
Lec2 Intro To ML
35 pages
??????? ???????? ??????????!
No ratings yet
??????? ???????? ??????????!
16 pages
M4 - FDS
No ratings yet
M4 - FDS
15 pages
Xchapter 1
No ratings yet
Xchapter 1
31 pages
ML Notes (Module-3)
No ratings yet
ML Notes (Module-3)
21 pages
ch-3 FML
No ratings yet
ch-3 FML
14 pages
End SEM V IMP DSE 2
No ratings yet
End SEM V IMP DSE 2
9 pages
ML - ML in Nutshell
No ratings yet
ML - ML in Nutshell
7 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
Sample Q - A For Module 3 - 4
No ratings yet
Sample Q - A For Module 3 - 4
18 pages
A Recurrent Neural Network
No ratings yet
A Recurrent Neural Network
22 pages
AIML Dom 25 Nov 2024
No ratings yet
AIML Dom 25 Nov 2024
22 pages
Machine Learning Laboratory 18CSL76: Institute of Technology and Management
No ratings yet
Machine Learning Laboratory 18CSL76: Institute of Technology and Management
49 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
ML 5
No ratings yet
ML 5
14 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Data Exploration
No ratings yet
Data Exploration
5 pages
LAWBOT
No ratings yet
LAWBOT
13 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
H13 311 - V3.0 Demo
No ratings yet
H13 311 - V3.0 Demo
5 pages
Analysis On Credit Card Fraud Detection Methods
0% (1)
Analysis On Credit Card Fraud Detection Methods
7 pages
Deepfake Detector
No ratings yet
Deepfake Detector
32 pages
Electric Water Boiler
No ratings yet
Electric Water Boiler
32 pages
Quantum, AI, Communication Engineering Diploma
No ratings yet
Quantum, AI, Communication Engineering Diploma
31 pages
Artificial Intelligence-Based Power Transformer Health Index For Handling Data Uncertainty
No ratings yet
Artificial Intelligence-Based Power Transformer Health Index For Handling Data Uncertainty
12 pages
Plant Detection 33
No ratings yet
Plant Detection 33
61 pages
Ai Notes-Viii Complete Notes
No ratings yet
Ai Notes-Viii Complete Notes
6 pages
Batch-4 Idp
No ratings yet
Batch-4 Idp
52 pages
Synopsis of Modern Agriculture
No ratings yet
Synopsis of Modern Agriculture
8 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
IoT 5th Unit
No ratings yet
IoT 5th Unit
10 pages
Network Optimization - GNN
No ratings yet
Network Optimization - GNN
34 pages
Nancy Chaurasia
No ratings yet
Nancy Chaurasia
2 pages
Online Learners' Engagement Detection Via Facial Emotion Recognition in Online Learning Context Using Hybrid Classification Model
No ratings yet
Online Learners' Engagement Detection Via Facial Emotion Recognition in Online Learning Context Using Hybrid Classification Model
19 pages
Battery Saver-1
No ratings yet
Battery Saver-1
12 pages
ML CBP Finally Done
No ratings yet
ML CBP Finally Done
23 pages
Generative AI For Automated Security Operations in Cloud Computing
No ratings yet
Generative AI For Automated Security Operations in Cloud Computing
7 pages
Request For EoI Data 4 Development Fellowships 1
No ratings yet
Request For EoI Data 4 Development Fellowships 1
6 pages
Pattern Recognition Suggestion
No ratings yet
Pattern Recognition Suggestion
2 pages
Prospects of Artificial Intelligence
No ratings yet
Prospects of Artificial Intelligence
18 pages
Syllabus-EE 414, 517, Deep Learning, Fall 2023
No ratings yet
Syllabus-EE 414, 517, Deep Learning, Fall 2023
4 pages
Project Proposal: COMSATS University Islamabad, Park Road, Chak Shahzad, Islamabad Pakistan
No ratings yet
Project Proposal: COMSATS University Islamabad, Park Road, Chak Shahzad, Islamabad Pakistan
22 pages
Predicting Near-Future Churners and Win-Backs in The Telecommunications Industry
No ratings yet
Predicting Near-Future Churners and Win-Backs in The Telecommunications Industry
4 pages
WL DSP Comparison 2024
No ratings yet
WL DSP Comparison 2024
1 page