0% found this document useful (0 votes)
16 views37 pages

L2 - SLM Notes (Pre-Processing)

Uploaded by

riyasharmasophia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views37 pages

L2 - SLM Notes (Pre-Processing)

Uploaded by

riyasharmasophia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Statistical Machine Learning

TEXTBOOKS/LEARNING RESOURCES:
a) Masashi Sugiyama, Introduction to Statistical Machine Learning (1 st ed.), Morgan Kaufmann, 2017. ISBN 978-0128021217.
b) T. M. Mitchell, Machine Learning (1st ed.), McGraw Hill, 2017. ISBN 978-1259096952.

REFERENCE BOOKS/LEARNING RESOURCES:


a) Richard Golden, Statistical Machine Learning A Unified Framework (1 st ed.), unknown, 2020.
Dr. Tej Bahadur Chandra October 30, 2024 1
Lecture - 2

 Machine Learning Life Cycle

 Artificial Intelligence (AI) vs. Machine Learning

 Data and Types of Data

Dr. Tej Bahadur Chandra October 30, 2024 2


Machine Learning Life Cycle

1 2 3
ML Life Cycle
Gathering Data Data
involves seven
Data preparation Wrangling
major steps:

4 5 6 7

Train the Test the


Analyse Data Deployment
model model

Dr. Tej Bahadur Chandra October 30, 2024 3


Machine Learning Life Cycle

The most important thing is to understand the problem and to know the purpose of
the problem.
Machine
learning To solve a problem, we create a machine learning system called "model", and this
Life model is created by providing "training data".

cycle
Therefore, the life cycle starts by collecting data.

Dr. Tej Bahadur Chandra October 30, 2024 4


Machine Learning Life Cycle

 Data Gathering is the first step of the machine learning life cycle. The goal of this
step is to identify and obtain all data-related problems.

 In this step, we need to identify the different data sources, as data can be
collected from various sources such as files, database, internet, or mobile devices.
1
 It is one of the most important steps of the life cycle. The quantity and quality of
Gathering the collected data will determine the efficiency of the output. The more will be
Data the data, the more accurate will be the prediction.

Integrate the data


Identify various
Collect data from different
data sources
sources

Dr. Tej Bahadur Chandra October 30, 2024 5


Machine Learning Life Cycle

 After collecting the data, we need to prepare it for further steps. Data
preparation is a step where we put our data into a suitable place and prepare it
to use in our machine learning training.

Data  This step can be further divided into two processes:

preparation • Data exploration: It is used to understand the nature of data that we have
to work with. We need to understand the characteristics, format, and
quality of data. A better understanding of data leads to an effective
Data Conversion
outcome. In this, we find Correlations, general trends, and outliers.
Data Scaling • Data pre-processing: Now the next step is pre-processing of data for its
analysis.

Dr. Tej Bahadur Chandra October 30, 2024 6


Machine Learning Life Cycle

In real-world applications, collected data may have various issues, including:

 Missing Values

3  Duplicate data
Data  Invalid data
Wrangling
 Noise

So, we use various filtering techniques to clean the data. It is mandatory to detect
and remove the above issues because it can negatively affect the quality of the
outcome.

Dr. Tej Bahadur Chandra October 30, 2024 7


Machine Learning Life Cycle

Now the cleaned and prepared data is passed on to the analysis step. This step involves:
 Selection of analytical techniques
 Building models
 Review the result
4

The aim of this step is to build a machine learning model to analyze the data using
Analyse Data
various analytical techniques and review the outcome.

It starts with the determination of the type of the problems, where we select the
machine learning techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using prepared data, and evaluate the
model.

Dr. Tej Bahadur Chandra October 30, 2024 8


Machine Learning Life Cycle

We use datasets to train the model using various machine learning


algorithms. Training a model is required so that it can understand the
5 6
various patterns, rules, and, features.
Train and Test
the model
Once the model has been trained, then we test the model. In this step, we
check for the accuracy of our model by providing a test dataset to it and
calculate the percentage accuracy of the model as per the requirement of
project or problem.

Dr. Tej Bahadur Chandra October 30, 2024 9


Machine Learning Life Cycle

The last step of machine learning life cycle is deployment, where we


deploy the model in the real-world system.

7
If the above-prepared model is producing an accurate result as per our
Deployment requirement with acceptable speed, then we deploy the model in the real
system.

But before deploying the project, we will check whether it is improving


its performance using available data or not. The deployment phase is
similar to making the final report for a project.

Dr. Tej Bahadur Chandra October 30, 2024 10


Artificial Intelligence (AI) vs. Machine Learning

Dr. Tej Bahadur Chandra October 30, 2024 11


Artificial Intelligence (AI) vs. Machine Learning

 ML is the subset of AI.

 AI is a bigger concept to create intelligent


machines that can simulate human thinking
capability and behaviour.

 Machine Learning is an application or subset of


AI that allows machines to learn from data
without being programmed explicitly.

Dr. Tej Bahadur Chandra October 30, 2024 12


Artificial Intelligence (AI) vs. Machine Learning

Dr. Tej Bahadur Chandra October 30, 2024 13


Data and Types of Data

Dr. Tej Bahadur Chandra October 30, 2024 14


Data and Types of Data

Numerical Exp.: 1,2,3 or 2.45, 3.687

Types of Categorical Exp.: Cold, Hot, Humid, Raining, Low

Data Time series Exp.: 9AM-$300; 10AM-&200; 11AM-$400

Text Exp.: Dog, Cow, Horse

Dr. Tej Bahadur Chandra October 30, 2024 15


Data and Types of Data

 Numerical data is any data where data points are exact numbers. Statisticians also might call numerical
data, quantitative data. Numerical data can be characterized by continuous or discrete data. Continuous
data can assume any value within a range whereas discrete data has distinct values.

 Categorical data represents characteristics, such as a hockey player’s position, team, hometown.
Categorical data can take numerical values. For example, maybe we would use 1 for the color red and 2 for
blue. But these numbers don’t have a mathematical meaning. That is, we can’t add them together or take
the average. In the context of supervised classification, categorical data would be the class label. This would
also be something like if a person is a man or woman, or property is residential or commercial.

 Time series data is a sequence of numbers collected at regular intervals over some period of time. Time
series data has a temporal value attached to it, so this would be something like a date or a timestamp that
you can look for trends in time. For example, we might measure the average number of home sales for
many years. The difference of time series data and numerical data is that rather than having a bunch of
numerical values that don’t have any time ordering, time-series data does have some implied ordering.
There is a first data point collected and the last data point collected.

 Text data is basically just words. A lot of the time the first thing that you do with text is you turn it into
numbers using some interesting functions like the bag of words formulation.

Dr. Tej Bahadur Chandra October 30, 2024 16


Feature Variable

It is a measurable property of the object you’re trying


to analyze. In datasets, features appear as columns:

Feature
Variable

Dr. Tej Bahadur Chandra October 30, 2024 17


Feature Variable

Red Color, Round Shape, some texture patterns

Feature
Red Color, Round Shape, Plain texture patterns
Variable

Height, color, patterns, ears, face

Dr. Tej Bahadur Chandra October 30, 2024 18


Feature Variable

Features are the basic building blocks of datasets. The quality of the features in
your dataset has a major impact on the quality of the insights you will gain when
Need you use that dataset for machine learning.

of
Featur Additionally, different business problems within the same industry do not
necessarily require the same features, which is why it is important to have a strong
e understanding of the business goals of your data science project.

Variabl You can improve the quality of your dataset’s features with processes like feature
e selection and feature engineering. If these techniques are done well, the resulting
optimal dataset will contain all of the essential features that might have bearing on
your specific business problem, leading to the best possible model outcomes and
the most beneficial insights.
Dr. Tej Bahadur Chandra October 30, 2024 19
Data Pre-processing

Data
Pre-
processi
ng
Data pre-processing is a process
of preparing the raw data and
making it suitable for a machine
learning model.

Dr. Tej Bahadur Chandra October 30, 2024 20


Data Pre-processing

Overvie
w
of
Operato
rs In
Python

Dr. Tej Bahadur Chandra October 30, 2024 21


Data Pre-processing

Overvie
w
of
Arrays
In
Python

Dr. Tej Bahadur Chandra October 30, 2024 22


Data Pre-processing

1. Prerequisite
 Download the latest version for Python (3.10.7) https://www.python.org/downloads/

 Download and install Spyder (IDE) https://www.spyder-ide.org/

Data 2. Importing Libraries

Pre-  Numpy: It is used for including any type of mathematical operation in the code. It is the

processi fundamental package for scientific calculation in Python. It also supports to add large,
multidimensional arrays and matrices.
ng  Matplotlib: It is a Python 2D plotting library, and with this library, we need to import a
sub-library pyplot. This library is used to plot any type of charts in Python for the code.

 Pandas: It is an open-source data manipulation and analysis library used for importing
and managing the datasets.

Dr. Tej Bahadur Chandra October 30, 2024 23


Data Pre-processing

3. Loding Dataset
 >> data_set= pd.read_csv('DatasetName.csv')

Data
Pre-
processi
ng

Dr. Tej Bahadur Chandra October 30, 2024 24


Data Pre-processing

To handle missing values, we will use Scikit-learn library in our code, which
contains various libraries for building machine learning models. Here we will
use Imputer class of sklearn.preprocessing library. Below is the code for it:

Handlin  By deleting the particular row: The first way is used to commonly deal with null

g values. In this way, we just delete the specific row or column which consists of null
values. But this way is not so efficient and removing data may lead to loss of
Missing information which will not give the accurate output.

 By calculating the mean: In this way, we will calculate the mean of that column
or row which contains any missing value and will put it on the place of missing
value. This strategy is useful for the features which have numeric data such as age,
salary, year, etc. Here, we will use this approach

Dr. Tej Bahadur Chandra October 30, 2024 25


Data Pre-processing

Handlin
g
Missing

Dr. Tej Bahadur Chandra October 30, 2024 26


Data Pre-processing

 Categorical data is data which has some categories such as : Cow, Dog, Cat,
Success, Fail, etc.

 Since machine learning model completely works on mathematics and numbers,


Encoding but if our dataset would have a categorical variable, then it may create trouble
Categori while building the model. So it is necessary to encode these categorical variables

cal Data into numbers.

 To encode categorical data we will use LabelEncoder() class


from preprocessing library.

Dr. Tej Bahadur Chandra October 30, 2024 27


Data Pre-processing

Encoding
Categoric
al Data

Dr. Tej Bahadur Chandra October 30, 2024 28


Dr. Tej Bahadur Chandra October 30, 2024 29
Data Pre-processing

 In machine learning data preprocessing, we divide our dataset into a training set and
test set. This is one of the crucial steps of data preprocessing as by doing this, we can
enhance the performance of our machine learning model.
Splitting
Dataset  Training Set: A subset of dataset to train the machine learning model, and we already
know the output.
In
Train &  Test set: A subset of dataset to test the machine learning model, and by using the test
set, model predicts the output.
Test set
 So we always try to make a machine learning model which performs well with the
training set and also with the test dataset. Here, we can define these datasets as:

Dr. Tej Bahadur Chandra October 30, 2024 30


Data Pre-processing

Splitting
Dataset
In
Train &
Test set

Dr. Tej Bahadur Chandra October 30, 2024 31


Data Pre-processing

train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True)

Splitting Parameters:

Dataset  *arrays: inputs such as lists, arrays, data frames, or matrices

 test_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
In proportion of our test size. its default value is none.

Train &  train_size: this is a float value whose value ranges between 0.0 and 1.0. it represents
the proportion of our train size. its default value is none.
Test set  random_state: this parameter is used to control the shuffling applied to the data
before applying the split. it acts as a seed.

 shuffle: This parameter is used to shuffle the data before splitting. Its default value is
true.

Dr. Tej Bahadur Chandra October 30, 2024 32


Data Pre-processing

Splitting
Dataset
In
Train &
Test set

Dr. Tej Bahadur Chandra October 30, 2024 33


Data Pre-processing

 Feature scaling is the final step of data preprocessing in machine learning. It is a


technique to standardize the independent variables of the dataset in a specific range.

 In feature scaling, we put our variables in the same range and in the same scale so
Feature that no any variable dominate the other variable.

Scaling  In the following example: salary values will dominate the age values,

Or and it will produce an incorrect result.

Normalizati
on

Dr. Tej Bahadur Chandra October 30, 2024 34


Data Pre-processing

 There are two ways to perform feature scaling in machine learning:


 Standardization

Feature
Scaling
Or
Normalizati  Normalization (min-max)

on

Dr. Tej Bahadur Chandra October 30, 2024 35


Data pre-processing Python Code
# Splitting the dataset into the Training set and Test set
# Importing the libraries
# X = dataset.iloc[:, :-1].values
import numpy as np
# y = dataset.iloc[:, -1].values
import matplotlib.pyplot as plt
X=ds2[:, :-1]
import pandas as pd
Y=ds2[:, -1]
print(X)
# Importing the dataset
print(Y)
dataset = pd.read_csv('D:\Data.csv')
from sklearn.model_selection import train_test_split
ds=dataset.iloc[:, :].values
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.2, random_state = 1)
print(X_train)
# Taking care of missing data
print(X_test)
from sklearn.impute import SimpleImputer
print(y_train)
imputer = SimpleImputer(missing_values=np.nan,
print(y_test)
strategy='mean') # try help
imputer.fit(ds[:, 1:3])
# Feature Scaling
ds[:, 1:3] = imputer.transform(ds[:, 1:3])
from sklearn import preprocessing
print(ds)
min_max_scaler =
preprocessing.MinMaxScaler(feature_range =(0, 1))
# Encoding categorical data; Encoding the Dependent Variable
X_train[:, 3:] = min_max_scaler.fit_transform(X_train[:,
from sklearn.preprocessing import LabelEncoder
3:])
lbl_encdr=LabelEncoder()
ds[:, 3]=lbl_encdr.fit_transform(ds[:, 3])
Standardisation = preprocessing.StandardScaler()
print(ds)
X_test[:, 3:] = Standardisation.fit_transform(X_test[:, 3:])

print(X_train)
print(X_test)
Dr. Tej Bahadur Chandra October 30, 2024 36
Further Readings:

• Machine Learning A-Z, https://www.superdatascience.com/pages/machine-learning

• Operators In Python, https://www.youtube.com/watch?v=Pm9FOpOwhlA

• Arrays In Python, https://www.youtube.com/watch?v=phRshQSU-xA&t=201s

• Google Collab, https://colab.research.google.com

• How and Where to Apply Feature Scaling in Python?,

https://www.turing.com/kb/how-and-where-to-apply-feature-scaling-in-python

Dr. Tej Bahadur Chandra October 30, 2024 37

You might also like