0% found this document useful (0 votes)
28 views35 pages

ML Unit 2

Machine learning

Uploaded by

suhanisweety448
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views35 pages

ML Unit 2

Machine learning

Uploaded by

suhanisweety448
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

UNIT-II

Modeling and Evaluation & Basics of Feature


Introduction: Engineering
● The basic learning process, irrespective of the fact that the learner is a human or a
machine, can be divided into three parts:
1. Data Input
2. Abstraction
3. Generalization
● Seeing the current trends in Information technology, a large volume of
heterogeneous data is produced widely across the world by means of social media
sites such as Facebook, Instagram, Google plus, etc.
● The generated data act as garbage and makes no sense until they are categorized.
● The learning process, abstraction is a significant step as it represents raw input
data in a summarized and structured format.

Introduction….
● This structured representation of raw input data to the meaningful pattern is called
a model.
● The process of assigning a model, and fitting a specific model to a data set is
called model training Once the model is trained, the raw input data is summarized
into an abstracted form.
● Generalization is a term used to describe a model’s ability to react to new data.
That is, after being trained on a training set, a model can digest new data and make
accurate predictions.
● If a model has been trained too well on training data, it will be unable to
generalize. It will make inaccurate predictions when given new data,it is able to
make accurate predictions for the training data. This is called overfitting.
● Underfitting happens when a model has not been trained enough on the data. it is
not capable of making accurate predictions, even with the training data.
● If the outcome is systematically incorrect, the learning is said to have a bias.
Categories of Machine Learning Approaches
● Three broad categories of machine learning approaches used for resolving different types
of problems
1. Supervised
1. Classification
2. Regression
2. Unsupervised
1. Clustering
2. Association analysis
3. Reinforcement
1. Active
2. Passive
● For each of the cases, the model that has to be created/trained is different.
● Multiple factors play a role when we try to select the model for solving a machine
learning problem

Categories of Machine Learning Approaches


● There is no one model that works best for every machine learning problem. •
● Any learning model tries to simulate some real-world aspect, firstly we need to
understand the data characteristics, combine this understanding with the problem we are
trying to solve and then decide which model to be selected for solving the problem.
● Three types of problems
1.Predicting class values
2. Predicting numerical values
3- Predicting grouping of data
● The problem, may be related to the prediction of a class value whether the next day will
be sunny or rainy, whether tumor is mild or serious etc.
● predicting some numerical value like
● what the price of a house should be in the next quarter, what is the expected growth of a
certain IT stock in the next 7 days, etc.
● Predicting grouping of data like finding customer segments that are using a certain
product.
Categories of Machine Learning…
● Machine learning algorithms are broadly of two types:
1. models for supervised learning, which primarily focus on solving predictive problems
2. models for unsupervised learning, which solve descriptive problems.
● Predictive model:
● Models for supervised learning or predictive models, try to predict certain value using the
values in an input data set.
● The predictive models have a clear focus on what they want to learn and how they want
to learn.
● Predictive models, in turn, may need to predict the value of a category or class to which a
data instance belongs to.
● Below are some examples:
1. Predicting win/loss in a cricket match
2. Predicting whether a transaction is fraud
3. Predicting whether a customer may move to another product

Predictive models
Classification models: which are used for prediction of target features of categorical value
are known as classification models.
• Some of the popular classification models include
1. k-Nearest Neighbor (KNN),
2. Naive Bayes, and
3. Decision Tree.

Regression models: which are used for prediction of the numerical value of the target
feature of a data instance are known as regression models.
• popular regression models.
1. Linear Regression and
2. Logistic Regression models
Descriptive models
• Models for unsupervised learning or descriptive models are used to describe a data set or
gain insight from a data set.
• There is no target feature or single feature of interest in case of unsupervised learning.
• Based on the value of all features, interesting patterns or insights are derived about the
data set.
• Descriptive models which group together similar data instances, i.e. data instances having
a similar value of the different features are called clustering models.
• Examples of clustering include
1. Customer grouping or segmentation based on social, demographic, ethnic,
etc. factors
2. Grouping of music based on different aspects like genre, language, time-
period, etc.
• The most popular model for clustering is k-Means.

Descriptive models
Market Basket Analysis
• Descriptive models related to pattern discovery is used for market basket analysis
of transactional data.
• In market basket analysis, based on the purchase pattern available in the
transactional data,
• the possibility of purchasing one product based on the purchase of another product
is determined.
• For example, transactional data may reveal a pattern that generally a customer who
purchases milk also purchases biscuit at the same time.
• This can be useful for targeted promotions or in-store set up.
• Promotions related to biscuits can be sent to customers of milk products or vice
versa.
• Also, in the store products related to milk can be placed close to biscuits.
Training a Model(for Supervised Learning)
• Holdout method
• Cross-validation methods
• Bootstrap sampling
• Lazy vs. Eager learner

Holdout method
•In supervised learning, a model is trained using the
labelled input data.
•The test data may not be available immediately, also,
the label value of the test data is not known.
•That is the reason why a part of the input data is held
back (holdout) for evaluation of the model.
•This subset of the input data is used as the test data for
evaluating the performance of a trained model.
•In general 70% to 80% of the input data (labelled) is
used for model training.
Holdout method
• Once the model is trained using the training data, the labels of the test data are
predicted using the model's target function.
• Then the predicted value is compared with the actual value of the label.
The validation data used for measuring the model performance. It is used in
iterations and to refine the model in each iteration.
• If the volume of input data is huge, then
• stratified Random sampling is employed for test data selection.
• the whole data is broken into several homogenous groups
• a random sample is selected from each group.
• This ensures that the generated random partitions have equal proportions of each
class.

Holdout method
• The issues in random sampling approach, in Holdout method,
1. the smaller data sets - difficult to divide the data of some of the classes proportionally
amongst training and test data sets.
2. A repeated holdout, is sometimes used to ensure the randomness of the composed data
sets.
• Several random holdouts are used to measure the model performance.
• In the end, the average of all performances is taken.
• As multiple holdouts have been drawn, the training and test data (and validation data) are
contain representative data from all classes and resemble the original input data closely.
• This process of repeated holdout is the basis of k-fold cross-validation technique.
• In k-fold cross-validation, the data set is divided into k-completely distinct or non-
overlapping random partitions called folds.
k-fold cross-validation
• In k-fold cross-validation, the data set is divided into k-completely distinct or non-
overlapping random partitions called folds.
• The value of 'k' in k-fold cross-validation can
be set to any number.
• there are two approaches which are extremely
popular:
• 1. 10-fold cross-validation (10-fold CV)
• 2. Leave-one-out cross-validation (LOOCV)

10-fold cross-validation
• 10-fold cross-validation is by far the most popular approach.
• for each of the 10-folds, each comprising of approximately 10% of the data, one of
the folds is used as the test data for validating model performance trained based on
the remaining 9 folds (or 90% of the data).
• This is repeated 10 times, once for each of the 10 folds being used as the test data
and the remaining folds as the training data.
• The average performance across all folds is being reported.
10-fold cross-validation
• each of the circles resembles a record in the input data
set whereas the different colors indicate the different
classes that the records belong to.
• The entire data set is broken into 'k' folds –out of which
one fold is selected in each iteration as the test data set.
• The fold selected as test data set in each of the
'k' iterations is different.
• the contiguous circles represented as folds, do not mean
that they are subsequent records in the data set.
• the records in a fold are drawn by using random sampling
technique.

Leave-one-out cross-validation (LOOCV)


• Leave-one-out cross-validation (LOOCV) is an extreme case of k-fold
cross-validation using one record or data instance at a time as a test
data.
• This is done to maximize the count of data used to train the model.
• the number of iterations for which it has to be run is equal to the
total number of data in the input data set.
• it is computationally very expensive and not used much in practice.
Bootstrap Sampling

•It is a popular way


to identify training
and test
data sets from the
input data set.
Cross Validation Vs Bootstrapping
Eager learning
• Eager learning follows the general principles of machine learning — it tries
to construct a generalized, input independent target function during the model
training phase.
• It uses Abstraction, generalization and comes up with a trained model at the
end of the learning phase.
• Hence, when the test data comes in for classification,
• the eager learner is ready with the model and
• doesn't need to refer back to the training data.
• Eager learners take more time in the learning phase than the lazy learners.
• Some of the algorithms which adopt eager learning approach include
Decision Tree, Support Vector Machine, Neural Network, etc.

Lazy learning
• Lazy learning, completely skips the abstraction and generalization processes,
otherwise, lazy learner doesn't 'learn' anything.
• It uses the training data in exact, and uses the knowledge to classify the unlabelled
test data.
• it is also known as rote learning (i.e. memorization technique based on repetition).
• Due to its heavy dependency on the given training data instance, it is also known
as instance learning or non-parametric learning.
• Lazy learners take very little time in training because not much of training
actually happens.
• it takes long time in classification as for each attribute in record of test data,
a comparison-based assignment of label happens.
Model Representation and Interpretability
● The main goal of each machine learning model is to generalize well.
● Generalization defines the ability of an ML model to provide a suitable output by
adapting the given set of unknown input.
● It means after providing training on the dataset, it can produce reliable and accurate
output.
● The underfitting and overfitting are the two terms that need to be checked for the
performance of the model and whether the model is generalizing well or not
● Bias: difference occurs between prediction values made by the model and actual
values/expected values, and this difference is known as bias errors or Errors due to bias.
(the error rate of the training data. )
The error rate has a high value, we call it High Bias
The error rate has a low value, we call it low Bias.
● Variance: The difference between the error rate of training data and testing data is called
variance.
difference of errors is high then it’s called high variance
difference of errors is low then it’s called low variance.

Overfitting
• Overfitting occurs when our machine learning model tries to cover more
than the required data present in the given dataset.
• Because of this, the model starts caching noise and inaccurate values present
in the dataset, and all these factors reduce the efficiency and accuracy of the
model.
• The overfitted model has low bias and high variance.
• Overfitting is the main problem that occurs in supervised learning.
• How to avoid the Overfitting in Model:
1. Early stopping the training
2. using re-sampling techniques like cross validation
3. hold back of a validation data set
4. Removing the features
Underfitting & Overfitting

Underfitting
● Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data
● In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.
● An underfitted model has high bias and low variance.
● How to avoid underfitting:
1. Increasing the training time of a model
2. Increasing the number of features
Bias – variance trade-off
• In supervised learning, the class value assigned by the learning model built
based on the training data may differ from the actual class value.
• This error in learning can be of two types:
1. errors due to 'bias' and
2. error due to 'variance’.
• Errors due to bias arise due to underfitting of the model. Underfitting
results in high bias.
• Errors due to variance occur from difference in training data sets used to
train the model.
• In case of overfitting, the model closely matches the training data, even a
small difference in training data gets magnified in the model.

Bias – variance trade-off


● Increasing the bias will decrease the variance
Ex: Linear regression, Logistic regression
● Increasing the variance will decrease the bias
● From the above diagram, center of the target
is a model that perfectly predicts correct values.
Ex: Decision tree, KNN, SVM
Why is Bias Variance Tradeoff?
• If our model is too simple and has very few parameters then it may have high bias
and low variance
• If our model has large number of parameters then it’s going to have high variance
and low bias.
• The best solution is to have a model with low bias as well as low variance.
• To build a good model, we need to find a good balance between bias and variance
such that it minimizes the total error.
• The goal of supervised machine learning is to achieve a balance between bias and
variance.
• For example, in a supervised algorithm k-Nearest Neighbors or KNN, the user
configurable parameter 'k' can be used to do a trade-off between bias and variance.
• When the value of 'k' is decreased, the model becomes simpler to fit and bias
increases.
• When the value of 'k' is increased, the variance increases.

Evaluating Performance of a Model


● Supervised Learning-Classification
● Supervised Learning-Regression
● Unsupervised Learning-Clustering
Supervised Learning-Classification:
1. Accuracy
2. Error Rate
3. Kappa value(k)
4. Sensitivity
5. Specificity
6. Precision
7. Recall
8. F-Measure
9. Receiver Operating Characteristic Curves(ROC)
10.Arear Under Curve(AUC)
Supervised Learning-Classification
• The classification model is to assign class label to the target feature
based on the value of the predictor features.
• For example, in a cricket match, the problem of predicting the win/loss, the classifier
will assign a class value win/loss to target feature based on the values of other features
like
• The whether
• The team won the toss,
• Number of spinners in the team,
• Number of wins the team had in the tournament, etc.
• To evaluate the performance of the model, the number of correct classifications or
predictions made by the model has to be recorded
• Based on the number of correct and incorrect classifications or predictions made by a
model, the accuracy of the model is calculated.
• If 99 out of 100 times the model has classified correctly then the model accuracy is said
to be 99%

Details of Model Classification


• There are four possibilities with regards to the cricket match win/loss prediction:

1. the model predicted win and the team won


(True Positive)
2. the model predicted win and the team lost
(False Positive)
3. the model predicted loss and the team won
(False Negative)
4. the model predicted loss and the team lost
(True Negative)
Confusion Matrix
• A matrix containing correct and incorrect predictions in the form of
TPs, FPs, FNs and TNs is known as confusion matrix.
• The win/loss prediction of cricket match has two classes of interest —
win and loss.
• For that reason it will generate a 2 x 2 confusion matrix.
• For a classification problem involving three classes, the confusion
matrix would be 3 x 3, etc.
• assume the confusion matrix of the win/loss prediction of cricket
match problem to be as below:

Model Accuracy
• Model Accuracy is given by total number of correct classifications (either True
Positive or True Negative) divided by total number of classifications done

• In context of the above confusion matrix, total count of TPs = 85, count of FPs =
4, count of FNs = 2 and count of TNs = 9.
Error Rate
• The percentage of misclassifications is indicated using error rate which
is measured as

• In context of the above confusion matrix,

Kappa value(k)
• Kappa value of a model indicates the adjusted the model accuracy.
• It is calculated using the formula below
Kappa value(k)…
● In context of the above confusion matrix, total count of TPs = 85, count of
FPs = 4, count of FNs = 2 and count of TNs = 9.

• Kappa value can be 1 at the maximum, which represents perfect agreement

Sensitivity
• The sensitivity of a model measures the proportion of TP examples or positive
cases which were correctly classified.
• It is measured as

• In the context of the above confusion matrix for the cricket match win prediction
problem,
Specificity
• Specificity of a model measures the proportion of negative examples which have
been correctly classified.
• In the context of the above confusion matrix for the cricket match win prediction
problem,

• A higher value of specificity will indicate a better model performance


• There are two other performance measures of a supervised learning model which
are similar to sensitivity and specificity. These are precision and recall.

Precision and Recall


• Precision gives the proportion of positive predictions which are truly positive,
recall gives the proportion of TP cases over all actually positive cases.

• Recall indicates the proportion of correct prediction of positives to the total


number of positives. In case of win/loss prediction of cricket, recall resembles
what proportion of the total wins were predicted correctly
F-Measure
• F-measure is another measure of model performance which combines the
precision and recall. It takes the harmonic mean of precision and recall as
calculated as

• F-score is a combination of multiple measures into one.


• F-score is used to measure the performance of different models can be compared.
• However, one assumption the calculation is based on is that precision and recall
have equal weight, which may not always be true in reality.

• Visualization is an easier and more effective way to understand the model Performance
• 1. Receiver operating characteristic (ROC) curves
• 2. Area Under Curve (AUC)
• It also helps in comparing the
efficiency of two models.
Receiver operating characteristic (ROC) curves
• Receiver Operating Characteristic (ROC) curve helps in visualizing the
performance of a classification model.
• It shows the efficiency of a model in the detection of true positives while avoiding
the occurrence of false positives.
• To refresh our memory, true positives - the model has correctly classified data
instances as the class of interest.
• On the other hand, FPs are those cases where the model incorrectly classified data
instances as the class of interest.

Area Under Curve (AUC)


• The area under curve (AUC) value, is the area of the two-dimensional space under the curve
extending from (0, 0) to (1,1)
• where each point on the curve gives a set of true and false positive values at a specific
classification threshold.
• This curve gives an indication of the predictive quality of a model.
• AUC value ranges from 0 to 1, with an AUC of less than 0.5 indicating that the classifier has no
predictive ability.
• The AUC of classifier 1 is more than the AUC of classifier 2. Hence, the inference that classifier 1
is better than classifier 2.
• A quick indicative interpretation of the predictive values from 0.5 to 1.0 is given below:
• 0.5 — 0.6 Almost no predictive ability
• 0.6—0.7 Weak predictive ability
• 0.7 — 0.8 Fair predictive ability
• 0.8 — 0.9 Good predictive ability
• 0.9 — 1.0 Excellent predictive ability
Unsupervised
• A clustering learning
algorithm is successful —identified
if the clusters Clustering
using the
algorithm is able to achieve the right results In the overall problem domain.
• For example, if clustering is applied for identifying customer segments for a
marketing campaign of a new product launch,
• the clustering can be considered successful only if the marketing campaign
end with a success, i.e. it is able to create the right brand recognition resulting
in steady revenue from new product sales.
• Two challenges of clustering:
1. It is generally not known how many clusters can be formulated from a
particular
data set. It is completely open-ended in most cases and provided as a user
input
to a clustering algorithm.
2. Even if the number of clusters is given, the same number of clusters can be

Supervised Learning-Regression

A regression model

ensures the difference


between predicted and
actual values is low
can be considered as a
• R-squared Regression
is a good measure to- evaluate
R-squared
the model Measure
fitness.
• The R-squared value lies between O to 1 (0%—100%) with a larger value
representing a better fit.
• It is calculated as:

• Sum of Squares Total (SST) = squared differences of each observation from

the overall mean = where y is the mean.

• Sum of Squared Errors (SSE) (of prediction) = sum of the squared residuals
=

● The silhouette coefficient, which is one of the most popular internal evaluation
Internal Evaluation – Silhouett Coefficient
methods, uses distance (Euclidean or Manhattan distances most commonly used)
between data elements as a similarity measure.
● The value of silhouette width ranges between
-1 and +1, with a high value indicating high
intra-cluster homogeneity and inter-cluster
heterogeneity.
• For a data set clustered into 'k' clusters, silhouette
width is calculated as:

• a(i) is the average distance between the i th


data instance and all other data instances
belonging to the same cluster and
• b(i) is the lowest average distance between
silhouette width Calculation
• let's calculate the distance of an arbitrary data element 'i’ in cluster 1 with the
different data elements from another cluster, say cluster 4 and take an average of all
those distances.
• Hence,
• where n is the total number of elements in cluster 4.
• In the same way, we can calculate the values of b12 (average) and b13 (average).
• b(i) is the minimum of all these values.
• Hence, b(i) = minimum [b12 (average), b13 (average), b14(average)]

External Evaluation
• In this approach, class label is known for the data set subjected to
clustering.
• the known class labels are not a part of the data used in clustering.
• The cluster algorithm is assessed based on how close the results are
compared to those known class labels.
• For example, purity is one of the most popular measures of cluster
algorithms — evaluates the extent to which clusters contain a single class.
For a data set having 'n' data instances and 'c' known class labels which generates'k'
clusters, purity is measured as:
IMPROVING PERFORMANCE OF MODEL
• The model selection is done on several aspects:
1. Type of learning the task in hand, i.e. supervised or
unsupervised
2. Type of the data, i.e. categorical or numeric
3. Sometimes on the problem domain
4. Above all, experience in working with different models to
solve problems of diverse domains

Tuning model parameter


• Model parameter tuning is the process of adjusting the model fitting
options, is an effective way to improve model performance
• Most machine learning models have at least one parameter which can
be tuned.
• The classification model k-Nearest Neighbour (KNN): using different
values of 'k' or the number of nearest neighbour to be considered, the
model can be tuned.
• The neural networks model: The number of hidden layers can be
adjusted to tune the performance in neural networks model.
• As an alternate approach of increasing the performance of one model,
several models may be combined together.
• The models in such combination are complimentary to each other, i.e.
one model may learn one type data sets well while struggle with another
type of data set.
ENSEMBLE

This approach of

combining different
models with diverse
strengths is known as
ensemble (figure).
ENSEMBLE……….
● Alternatively, the same training data may be used but the models combined are
quite varying, e.g, SVM, neural network, kNN, etc.
● The outputs from the different models are combined using a combination
function. A very simple strategy of combining, in the case of a prediction task
using ensemble, it is based on majority voting of the different models combined.
● For example, 3 out of 5 classes predict ‘win’ and 2 predict ‘loss’ – then the final
outcome of the ensemble using majority vote would be a ‘win’.
● The ensemble models are:
1. Bagging or bootstrap aggregating
2. Boosting
3. Random Forest
Bagging or Bootstrap aggregating
● Bagging uses bootstrap sampling method to generate multiple
training data sets.
● These training data sets are used to generate (or train) a set of models
using the same learning algorithm.
● Then the outcomes of the models are combined by majority voting
(classification) or by average (regression).
● Bagging is a very simple ensemble technique which can perform
really well for unstable learners like a decision tree, in which a slight
change in data can impact the outcome of a model significantly.
Boosting & Random Forest
● Just like bagging, boosting is another key ensemble based technique.
● The weaker learning models are trained on resampled data and the outcomes are
combined using a weighted voting approach based on the performance of different
models.
● Adaptive boosting or AdaBoost is a special variant of boosting algorithm.
● It is based on the idea of generating weak learners and slowly learning.
● Random forest is another ensemble-based technique. It is an ensemble of decision
trees hence the name random forest to indicate a forest of decision trees.
● Random Forest is a powerful ensemble learning technique that leverages the
strength of decision trees while addressing their limitations such as overfitting.
● By introducing randomness in feature selection and data sampling, Random
Forest builds a diverse set of decision trees and combines their predictions to
make robust and accurate predictions for classification and regression tasks.
Basics of Feature Engineering
● A feature is an attribute of a data set that is used in a machine learning process.
● The features in a data set are also called its dimensions. So, a data set having ‘n’
features is called an n-dimensional data set.
● A model for predicting the risk of cardiac disease may have features such as the
following: Age, Gender, Weight, Whether the person smokes, etc.
● Features in machine learning is very important, Because the quality of the features
in the dataset has major impact on the quality of the insights you will get while
using the dataset for machine learning

What is feature engineering?


● Feature engineering refers to the process of translating a data set into features
such that these features are able to represent the data set more effectively and
result in a better learning performance.
● Feature engineering is an important pre-processing step for machine learning.
● It has two major elements:
1. Feature transformation
2. Feature subset selection

● Feature transformation: transforms the data –structured or unstructured, into a


new set of features which can represent the underlying problem which machine
learning is trying to solve.
● There are two variants of feature transformation:
3. feature construction
4. feature extraction
Feature transformation & Feature subset selection
● Feature construction process discovers missing information about the
relationships between features and augments the feature space by creating
additional features.
● Hence, if there are ‘n’ features or dimensions in a data set, after feature
construction ‘m’ more features or dimensions may get added.
● So at the end, the data set will become ‘n + m’ dimensional.
● Feature extraction is the process of extracting or creating a new set of features
from the original set of features using some functional mapping.
● Feature subset selection: no new feature is generated.
The objective of feature selection is to derive a subset of features from the full
feature set which is most meaningful in the context of a specific machine learning
problem.

Feature transformation- Feature construction


• Feature transformation is a mathematical transformation in which we apply a
mathematical formula to a particular column (feature) and transform the values,
which are useful for our further analysis.
• It creates new features from existing features that may help improve the model
performance
• Feature construction involves transforming a given set of input features to generate
a new set of more powerful features.
Feature construction
● There are certain situations where feature construction is an essential activity
before we can start with the machine learning task. These situations are
1. when features have categorical value and machine learning needs numeric
value inputs
2. when features having numeric (continuous) values and need to be converted to
ordinal values
3. when text-specific feature construction needs to be done
● Encoding categorical (nominal) variables:
● Let’s take the example of data set on athletes, data set has features age, city of
origin, parents athlete and Chance of Win.
● The feature chance of a win is a class variable while the others are predictor
variables

Encoding categorical (nominal) variables


Encoding categorical (ordinal) variables
● Let’s take an example of a student data set and it has three features science marks,
maths marks and grade.
● As we can see, the grade is an ordinal variable with values A, B, C, and D.
● To transform this variable to a numeric variable, we can create a feature num_grade
mapping a numeric value against each ordinal value. In the context of the current
example, grades A, B, C, and D is mapped to values 1, 2, 3, and 4

Transforming numeric features to categorical features


● For example, we may want to treat the real estate price prediction problem, which is a
regression problem, as a real estate price category prediction, which is a classification
problem.
● In that case, we can ‘bin’ the numerical data into multiple categories based on the data
range.
● In the context of the real estate price prediction example, the original data set has a
numerical feature apartment_price. It can be transformed to a categorical variable price-
grade.
Numeric(continuous) features to categorical features

Text-specific feature construction


• In the current world, text is the most predominant medium of communication. If we think
about social networks like Facebook, Twitter, emails, Whatsapp, text plays a major role
in the flow of information.
• Hence, text mining is an important area of research. However, making sense of text data,
due to the inherent unstructured nature of the data, is not so straightforward.
• All machine learning models need numerical data as input.
• So the text data in the data sets need to be transformed into numerical features.
• The process of converting a text data into a numerical representation is known as
vectorization.
• In this process, word occurrences in all documents belonging to the text data or corpus
are consolidated in the form of bag-of-words.
• There are three major steps that are followed:
1. tokenize
2. count
3. normalize
Text-specific feature construction…
● In order to tokenize a Text Data, the blank spaces and punctuations are used as
delimiters to separate out the words, or tokens.
● Then the number of occurrences of each token is counted, for each document.
● Lastly, tokens are weighted with reducing importance when they occur in the
majority of the documents.
● A matrix is then formed with each token representing a column and a specific
document of the corpus representing each row.
● Each cell contains the count of occurrence of the token in a specific document.
● This matrix is known as a
document-term matrix

Feature extraction
● In feature extraction, new features are created from a combination of original features.
Some of the commonly used operators for combining the original features include
1. For Boolean features: Conjunctions, Disjunctions, Negation, etc.
2. For nominal features: Cartesian product, M of N, etc.
3. For numerical features: Min, Max, Addition, Subtraction, Multiplication, Division,
Average, Equivalence, Inequality, etc.
● Let’s take an example. Say, we have a
data set with a feature set F (F1 , F2 , …, Fn).
● After feature extraction using a mapping
function f. we will have a set of features
F ’ (F’ 1 , F’ 2 , …, F’m ) such that
F ’ i =f(Fi) and m<n
Feature extraction algorithms- PCA
• The most popular feature extraction algorithms used in machine learning are
1. Principal Component Analysis(PCA)
2. Singular value decomposition(SVD)
3. Linear Discriminant Analysis(LDA)
• Principal Component Analysis(PCA): is an unsupervised learning algorithm that is
used for the dimensionality reduction in machine learning.
• It is a statistical process that converts the observations of correlated features into a set
of linearly uncorrelated features with the help of orthogonal transformation.
• These new transformed features are called the Principal Components.
• It is one of the popular tools that is used for exploratory data analysis and predictive
modeling.
• It is a technique to draw strong patterns from the given dataset by reducing the
variances.

You might also like