L2 - SLM Notes (Pre-Processing)
L2 - SLM Notes (Pre-Processing)
TEXTBOOKS/LEARNING RESOURCES:
a) Masashi Sugiyama, Introduction to Statistical Machine Learning (1 st ed.), Morgan Kaufmann, 2017. ISBN 978-0128021217.
b) T. M. Mitchell, Machine Learning (1st ed.), McGraw Hill, 2017. ISBN 978-1259096952.
1 2 3
ML Life Cycle
Gathering Data Data
involves seven
Data preparation Wrangling
major steps:
4 5 6 7
The most important thing is to understand the problem and to know the purpose of
the problem.
Machine
learning To solve a problem, we create a machine learning system called "model", and this
Life model is created by providing "training data".
cycle
Therefore, the life cycle starts by collecting data.
Data Gathering is the first step of the machine learning life cycle. The goal of this
step is to identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be
collected from various sources such as files, database, internet, or mobile devices.
1
It is one of the most important steps of the life cycle. The quantity and quality of
Gathering the collected data will determine the efficiency of the output. The more will be
Data the data, the more accurate will be the prediction.
After collecting the data, we need to prepare it for further steps. Data
preparation is a step where we put our data into a suitable place and prepare it
to use in our machine learning training.
preparation • Data exploration: It is used to understand the nature of data that we have
to work with. We need to understand the characteristics, format, and
quality of data. A better understanding of data leads to an effective
Data Conversion
outcome. In this, we find Correlations, general trends, and outliers.
Data Scaling • Data pre-processing: Now the next step is pre-processing of data for its
analysis.
Missing Values
3 Duplicate data
Data Invalid data
Wrangling
Noise
So, we use various filtering techniques to clean the data. It is mandatory to detect
and remove the above issues because it can negatively affect the quality of the
outcome.
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
Selection of analytical techniques
Building models
Review the result
4
The aim of this step is to build a machine learning model to analyze the data using
Analyse Data
various analytical techniques and review the outcome.
It starts with the determination of the type of the problems, where we select the
machine learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and evaluate the
model.
7
If the above-prepared model is producing an accurate result as per our
Deployment requirement with acceptable speed, then we deploy the model in the real
system.
Numerical data is any data where data points are exact numbers. Statisticians also might call numerical
data, quantitative data. Numerical data can be characterized by continuous or discrete data. Continuous
data can assume any value within a range whereas discrete data has distinct values.
Categorical data represents characteristics, such as a hockey player’s position, team, hometown.
Categorical data can take numerical values. For example, maybe we would use 1 for the color red and 2 for
blue. But these numbers don’t have a mathematical meaning. That is, we can’t add them together or take
the average. In the context of supervised classification, categorical data would be the class label. This would
also be something like if a person is a man or woman, or property is residential or commercial.
Time series data is a sequence of numbers collected at regular intervals over some period of time. Time
series data has a temporal value attached to it, so this would be something like a date or a timestamp that
you can look for trends in time. For example, we might measure the average number of home sales for
many years. The difference of time series data and numerical data is that rather than having a bunch of
numerical values that don’t have any time ordering, time-series data does have some implied ordering.
There is a first data point collected and the last data point collected.
Text data is basically just words. A lot of the time the first thing that you do with text is you turn it into
numbers using some interesting functions like the bag of words formulation.
Feature
Variable
Feature
Red Color, Round Shape, Plain texture patterns
Variable
Features are the basic building blocks of datasets. The quality of the features in
your dataset has a major impact on the quality of the insights you will gain when
Need you use that dataset for machine learning.
of
Featur Additionally, different business problems within the same industry do not
necessarily require the same features, which is why it is important to have a strong
e understanding of the business goals of your data science project.
Variabl You can improve the quality of your dataset’s features with processes like feature
e selection and feature engineering. If these techniques are done well, the resulting
optimal dataset will contain all of the essential features that might have bearing on
your specific business problem, leading to the best possible model outcomes and
the most beneficial insights.
Dr. Tej Bahadur Chandra October 30, 2024 19
Data Pre-processing
Data
Pre-
processi
ng
Data pre-processing is a process
of preparing the raw data and
making it suitable for a machine
learning model.
Overvie
w
of
Operato
rs In
Python
Overvie
w
of
Arrays
In
Python
1. Prerequisite
Download the latest version for Python (3.10.7) https://www.python.org/downloads/
Pre- Numpy: It is used for including any type of mathematical operation in the code. It is the
processi fundamental package for scientific calculation in Python. It also supports to add large,
multidimensional arrays and matrices.
ng Matplotlib: It is a Python 2D plotting library, and with this library, we need to import a
sub-library pyplot. This library is used to plot any type of charts in Python for the code.
Pandas: It is an open-source data manipulation and analysis library used for importing
and managing the datasets.
3. Loding Dataset
>> data_set= pd.read_csv('DatasetName.csv')
Data
Pre-
processi
ng
To handle missing values, we will use Scikit-learn library in our code, which
contains various libraries for building machine learning models. Here we will
use Imputer class of sklearn.preprocessing library. Below is the code for it:
Handlin By deleting the particular row: The first way is used to commonly deal with null
g values. In this way, we just delete the specific row or column which consists of null
values. But this way is not so efficient and removing data may lead to loss of
Missing information which will not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column
or row which contains any missing value and will put it on the place of missing
value. This strategy is useful for the features which have numeric data such as age,
salary, year, etc. Here, we will use this approach
Handlin
g
Missing
Categorical data is data which has some categories such as : Cow, Dog, Cat,
Success, Fail, etc.
Encoding
Categoric
al Data
In machine learning data preprocessing, we divide our dataset into a training set and
test set. This is one of the crucial steps of data preprocessing as by doing this, we can
enhance the performance of our machine learning model.
Splitting
Dataset Training Set: A subset of dataset to train the machine learning model, and we already
know the output.
In
Train & Test set: A subset of dataset to test the machine learning model, and by using the test
set, model predicts the output.
Test set
So we always try to make a machine learning model which performs well with the
training set and also with the test dataset. Here, we can define these datasets as:
Splitting
Dataset
In
Train &
Test set
Splitting Parameters:
test_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
In proportion of our test size. its default value is none.
Train & train_size: this is a float value whose value ranges between 0.0 and 1.0. it represents
the proportion of our train size. its default value is none.
Test set random_state: this parameter is used to control the shuffling applied to the data
before applying the split. it acts as a seed.
shuffle: This parameter is used to shuffle the data before splitting. Its default value is
true.
Splitting
Dataset
In
Train &
Test set
In feature scaling, we put our variables in the same range and in the same scale so
Feature that no any variable dominate the other variable.
Scaling In the following example: salary values will dominate the age values,
Normalizati
on
Feature
Scaling
Or
Normalizati Normalization (min-max)
on
print(X_train)
print(X_test)
Dr. Tej Bahadur Chandra October 30, 2024 36
Further Readings:
https://www.turing.com/kb/how-and-where-to-apply-feature-scaling-in-python