Feature Engg Pre Processing Python
Feature Engg Pre Processing Python
Feature Engineering
Training Data
• It assists in learning and forming a predictive hypothesis for future data.
Test Data
• Data provided to test a hypothesis created via prior learning is known as test data.
• Typically 20% of labeled data is reserved for the test.
Validation data
• It is a dataset used to retest the hypothesis (in case the algorithm got overfitted to even the test
data due to multiple attempts at testing).
•
Feature Selection
• This becomes even more important when the number of features are very large.
You need not use every feature at your disposal for creating an algorithm. You can
assist your algorithm by feeding in only those features that are really important.
Reasons to use feature selection are:
• It enables the machine learning algorithm to train faster.
• It reduces the complexity of a model and makes it easier to interpret.
• It improves the accuracy of a model if the right subset is chosen.
• It reduces overfitting.
Feature Selection Methods
• There are three types of feature selection methods
1. Filter Methods
2. Wrapper Methods
3. Embedded methods
The correlation coefficients are calculated based on the types feature data and response data as shown in the table
below:
measure
s
Brief explanation of correlation coefficients
• Pearson’s Correlation: It is used as a measure for quantifying linear dependence between
two continuous variables X and Y. Its value varies from -1 to +1. Pearson’s correlation is
given as:
• LDA: Linear discriminant analysis is used to find a linear combination of features that
characterizes or separates two or more classes (or levels) of a categorical variable.
• ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that
it is operated using one or more categorical independent features and one continuous
dependent feature. It provides a statistical test of whether the means of several groups are
equal or not.
• Chi-Square: It is a is a statistical test applied to the groups of categorical features to
evaluate the likelihood of correlation or association between them using their frequency
distribution.
Correlation Statistics
• The scikit-learn library provides an implementation of most of the useful statistical measures.
• For example:
• Pearson’s Correlation Coefficient: f_regression()
• ANOVA: f_classif()
• Chi-Squared: chi2()
• Mutual Information: mutual_info_classif() and mutual_info_regression()
• Also, the SciPy library provides an implementation of many more statistics, such as Kendall’s tau (
kendalltau) and Spearman’s rank correlation (spearmanr).
Selection Method
• The scikit-learn library also provides many different filtering methods once statistics have been calculated
for each input variable with the target.
• Two of the more popular methods include:
• Select the top k variables: SelectKBest
• Select the top percentile variables: SelectPercentile
Transform Variables
Consider transforming the variables in order to access different statistical methods.
• For example, you can transform a categorical variable to ordinal, even if it is not,
and see if any interesting results come out.
• You can also make a numerical variable discrete (e.g. bins); try categorical-based
measures.
• Some statistical measures assume properties of the variables, such as Pearson’s
that assumes a Gaussian probability distribution to the observations and a linear
relationship. You can transform the data to meet the expectations of the test and try
the test regardless of the expectations and compare results
Regression Feature Selection:
(Numerical Input, Numerical Output)
• Output: Running the example first creates the regression dataset, then defines the
feature selection and applies the feature selection procedure to the dataset,
returning a subset of the selected input features.
•
Classification Feature Selection:
(Numerical Input, Categorical Output)
Running the example first creates the classification dataset, then defines the feature selection and applies the feature
selection procedure to the dataset, returning a subset of the selected input features.
Classification Feature Selection:
(Categorical Input, Categorical Output)
• The two most commonly used feature selection methods for categorical input data
when the target variable is also categorical (e.g. classification predictive modeling)
are the chi-squared statistic and the mutual information statistic.
• Depending upon the type the code is written for feature selection as follows:
SelectKBest(score_func=chi2, k=4)
Or
SelectKBest(score_func=mutual_info_classif, k=4)
Wrapper Methods
• In wrapper methods, we try to use a subset of features and train a model using
them. Based on the inferences that we draw from the previous model, we decide to
add or remove features from your subset. The problem is essentially reduced to a
search problem. These methods are usually computationally very expensive.
Wrapper Methods
• Forward Selection: Forward selection is an iterative method in which we start with
having no feature in the model. In each iteration, we keep adding the feature which
best improves our model till an addition of a new variable does not improve the
performance of the model.
• Backward Elimination: In backward elimination, we start with all the features and
removes the least significant feature at each iteration which improves the
performance of the model. We repeat this until no improvement is observed on
removal of features.
• Recursive Feature elimination: It is a greedy optimization algorithm which aims
to find the best performing feature subset. It repeatedly creates models and keeps
aside the best or the worst performing feature at each iteration. It constructs the
next model with the left features until all the features are exhausted. It then ranks
the features based on the order of their elimination.
Recursive Feature Elimination example code
Embedded Methods
• Embedded methods combine the qualities’ of filter and wrapper methods. It’s
implemented by algorithms that have their own built-in feature selection methods.
Embedded Methods
• Some of the most popular examples of these methods are LASSO and RIDGE
regression which have inbuilt penalization functions to reduce overfitting.
• Lasso regression performs L1 regularization which adds penalty equivalent to
absolute value of the magnitude of coefficients.
• Ridge regression performs L2 regularization which adds penalty equivalent to
square of the magnitude of coefficients.
• Other examples of embedded methods are Regularized trees, Memetic algorithm,
Random multinomial logit.
Filter Vs Wrapper methods
The main differences between the filter and wrapper methods for feature selection are:
• Filter methods measure the relevance of features by their correlation with dependent
variable while wrapper methods measure the usefulness of a subset of feature by
actually training a model on it.
• Filter methods are much faster compared to wrapper methods as they do not involve
training the models. On the other hand, wrapper methods are computationally very
expensive as well.
• Filter methods use statistical methods for evaluation of a subset of features while
wrapper methods use cross validation.
• Filter methods might fail to find the best subset of features in many occasions but
wrapper methods can always provide the best subset of features.
• Using the subset of features from the wrapper methods make the model more prone to
overfitting as compared to using subset of features from the filter methods.
Feature Scaling
• To standardize the jth feature, you need to subtract the sample mean uj from every
training sample and divide it by its standard deviation σj as given below:
Here, xj is a vector consisting of the jth feature values of all training samples n.
• Given below is a sample NumPy code that uses NumPy mean and standard
• functions to standardize features from a sample data set X (x0, x1...) :
• In most cases, normalization refers to the rescaling of data features between 0 and
1, which is a special case of Min-Max scaling.
• Normalization: Example
In the given equation, subtract the min value for each feature from each feature
instance and divide by the spread between max and min.
In effect, it measures the relative percentage of distance of each instance from the min value for that feature.
The ML library scikit-learn has a MinMaxScaler class for normalization.
Difference between Standardization and Normalization
•Steps 1 and 2 reduce the mean of the data, and steps 3 and 4 rescale each coordinate to have unit
variance. It ensures that different attributes are treated on the same scale.
•For instance, if x1 was maxed speed in mph (taking values in high tens or low hundreds) and x2 was the
number of seats (taking values 2-4), then this renormalization rescales the attributes to make them more
comparable to each other.
Principal Component Analysis (PCA)(Contd.)
• How do you find the axis of variation u on which most of the data lies?
• When you project this data to lie along the axis of the unit vector, you would like to
preserve most of it, such that its variance is maximized (which means most data is
covered).
• Intuitively, the data starts off with some amount of variance (information).
• The figure shows this normalized data.
• Let’s project data onto different u axes as shown in the charts given on the left.
• Dots represent the projection of data points on this line.
• In figure A, projected data has a large amount of variance, and the points are far
from zero.
• In figure B, projected data has a low amount of variance, and the points are closer
to zero.Hence, figure A is a better choice to project the data.
• The length of projection of x on a unit vector u is given by xTu. This also represent
the distance of the projection of x from the origin.
• Hence, to maximize the variance of the projections, you can choose a unit length u:
•It is also known as the covariance matrix of the data
(assuming that it has zero mean).
•Generally, if you need to project data onto the k-dimensional
subspace (k < n), you choose u1, u2...uk to be the top k
You get the principal Eigenvector* of Eigenvectors of ∑.
•All the ui now form a new orthogonal basis for the data.
•Then, to represent x(i) in this new basis, you need to
compute the corresponding vector:
• The vector y(i) is a lower k-dimensional approximation of x(i). This is known as the
dimensionality reduction.
• The vectors u1,u2...uk are called the first k principal components of the data.
Applications of PCA
• Noise Reduction
• PCA can eliminate noise or noncritical aspects of the data set to reduce complexity.
Also, during image processing or comparison, image compression can be done
with PCA, eliminating the noise such as lighting variations in face images.
• Compression
• It is used to map high dimensional data to lower dimensions. For example, instead
of having to deal with multiple car types (dimensions), we can cluster them into
fewer types.
• Preprocess
• It reduces data dimensions before running a supervised learning program and
saves on computations as well as reduces overfitting.
PCA: 3D to 2D Conversion
• 3D Data ----changes to----- After PCA, one finds only two dimensions being
important—Red and Green that carry most of the variance. The blue dimension has
limited variance, and hence it is eliminated.
Key Takeaways
We need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a
working directory. To set a working directory in Spyder IDE, we need to follow the
below steps:
• Save your Python file in the directory which contains dataset.
• Go to File explorer option in Spyder IDE, and select the required directory.
• Click on F5 button or run option to execute the file.
read_csv() function:
• Now to import the dataset, we will use read_csv() function of pandas library, which
is used to read a csv file and performs various operations on it. Using this function,
we can read a csv file locally as well as through an URL.
• We can use read_csv function as below:
• Here, data_set is a name of the variable to store our dataset, and inside the
function, we have passed the name of our dataset.
Extracting Dependent and Independent variables
• In machine learning, it is important to distinguish the matrix of features
(independent variables) and dependent variables from dataset.
• In our dataset, there are three independent variables that are Country, Age, and
Salary, and one is a dependent variable which is Purchased.
Extracting independent variable:
• To extract an independent variable, we will use iloc[ ] method of Pandas library. It
is used to extract the required rows and columns from the dataset.
In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the columns. Here we
have used :-1, because we don't want to take the last column as it contains the dependent variable. So by doing this,
we will get the matrix of features.
Output
• By executing the above code, we will get output as:
As we can see in the above output, there are only three variables.
Extracting Dependent and Independent variables
• Extracting dependent variable:
To extract dependent variables, again, we will use Pandas .iloc[] method.
Here we have taken all the rows with the last column only. It will give the array of dependent variables.
By executing the above code, we will get output as:
Handling Missing data:
The next step of data preprocessing is to handle missing data in the datasets. If our
dataset contains some missing data, then it may create a huge problem for our
machine learning model. Hence it is necessary to handle missing values present in
the dataset.
Ways to handle missing data:
• There are mainly two ways to handle missing data, which are:
• By deleting the particular row: The first way is used to commonly deal with null
values. In this way, we just delete the specific row or column which consists of null
values. But this way is not so efficient and removing data may lead to loss of
information which will not give the accurate output.
• By calculating the mean: In this way, we will calculate the mean of that column or
row which contains any missing value and will put it on the place of missing value.
This strategy is useful for the features which have numeric data such as age,
salary, year, etc. Here, we will use this approach.
Python code
• To handle missing values, we will use Scikit-learn library in our code, which
contains various libraries for building machine learning models. Here we will use
Imputer class of sklearn.preprocessing library. Below is the code for it:
Encoding Categorical data:
• Categorical data is data which has some categories such as, in our dataset; there
are two categorical variable, Country, and Purchased.
• Machine learning model completely works on mathematics and numbers, so it is
necessary to encode these categorical variables into numbers.
For Country variable:
• Firstly, we will convert the country variables into categorical data. So to do this, we
will use LabelEncoder() class from preprocessing library.
Explanation
• Dummy variables are those variables which have values 0 or 1. The 1 value gives
the presence of that variable in a particular column, and rest variables become 0.
With dummy encoding, we will have a number of columns equal to the number of
categories.
• In our dataset, we have 3 categories so it will produce three columns having 0 and
1 values. For Dummy Encoding, we will use OneHotEncoder class of
preprocessing library.
As we can see in the output, all the variables are encoded into numbers 0 and 1 and divided into three columns.
Resulting Datasets
• For Purchased Variable:
For the second categorical variable, we will only use labelencoder object of LableEncoder class. Here we are not
using OneHotEncoder class because the purchased variable has only two categories yes or no, and which are
automatically encoded into 0 and 1.
Splitting the Dataset into the Training set and Test set
• In machine learning data preprocessing, we divide our dataset into a training set
and test set. This is one of the crucial steps of data preprocessing as by doing this,
we can enhance the performance of our machine learning model.
• If we have given training to our machine learning model by a dataset and we test it
by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.
• If we train our model very well and its training accuracy is also very high, but we
provide a new dataset to it, then it will decrease the performance. So we always try
to make a machine learning model which performs well with the training set and
also with the test dataset. Here, we can define these datasets as:
• Training Set: A subset of dataset to train the machine learning model, and we
already know the output.
• Test set: A subset of dataset to test the machine learning model, and by using the
test set, model predicts the output.
For splitting the dataset, we will use the below lines of code and output is also
shown:
For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
• Feature Scaling: Standardization
• Let us understand Standardization technique below.
• Standardization is a popular feature scaling method, which gives data the property
of a standard normal distribution (also known as Gaussian distribution).
• All features are standardized on the normal distribution (a mathematical model).
• The mean of each feature is centered at zero, and the feature column has a
standard deviation of one.
• Now, we will create the object of StandardScaler class for independent variables
or features. And then we will fit and transform the training dataset
For test dataset, we will directly apply transform() function instead of fit_transform() because it is already done in training
set.
Output
By executing the above lines of code, we will get the scaled values for x_train and
x_test as:
• x_train:
output
• x_test: