CS 2 3 4 Aml
CS 2 3 4 Aml
SE/SS ZG568
Swarna Chaudhary
Assistant Professor – BITS CSIS s
BITS [email protected]
Pilani
Pilani Campus
Course
Plan
M1 Introduction to Machine Learning
M5 Classification Models I
M6 Classification Models II
M7 Unsupervised Learning
M8 Neural Networks
M9 Deep Networks
4
Frame the Problem- Look at the Big Picture
• Define the objectives in business terms
• How will your solution be used?
• What are the current solutions (if any)?
• How should you frame this problem (supervised/Unsupervised, Online/Offline etc.)?
• How should performance be measured?
• Is the performance measure aligned with the business objective?
• What are comparable problems? Can you reuse experience or tools?
• Is Human expertise available?
• How would you solve the problem?
• List the assumptions you have made.
• What would be the minimum performance needed to reach the business objective?
5
ML
Pipeline
Deployed
ETL Pipeline ML Pipeline CI/CD Application
• MAE is preferred for a large number of outliers. aka L1 norm or Manhattan distance/norm.
General Form
Python Packages for ML
• Scientific Computing:
• Pandas
• Numpy
• Scipy
• Data Visualizations
• Matplotlib
• Seaborn
• Algorithmic Libraries
• Sklearn/ Scikit-Learn
• Keras
• TensorFlow
11
Key Steps in a Machine Learning Project
12
Importing and Exporting Data
To read an entire CSV file
# Importing libraries
import pandas as pd
import numpy as np
14
Exporting a pandas dataframe to CSV
• To save modified dataset
path = “C:/Windows/…/property.csv
df.to_csv(path)
15
Different file formats
16
What is Data?
• An attribute is a property or characteristic of an object Attributes
• Examples: eye color of a person, temperature, etc.
• aka variable, field, characteristic, dimension, or
feature
• A collection of attributes describe an object
• aka record, point, case, sample, entity, or instance
• Attribute values are numbers or symbols assigned to an
attribute for a particular object
Object
s
Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of
documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite
number of digits.
• Continuous attributes are typically represented as floating-point variables.
Types and properties of Attributes
• Nominal
• ID numbers, eye color, zip codes
• Ordinal
• rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall,
medium, short}
• Interval
• calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race)
Properties
• Distinctness: =
• Order: < >
• Differences are + -
meaningful :
• Ratios are * /
meaningful :
Categorization of Attributes
Data – Types of Attributes
21
To check data types in Python
• df.dtypes
• df.info()
22
To check data distribution
• Returns a statistical summary of data types
23
Types of Data sets
Tree/Graphs
Graphs,
Sequence
Sequence Networks
Sourc
Sequence Data Network Data
e
BITS Pilani, Pilani Campus
Data
Types
• Relational/Object
• Transactional
Data
• Document Data
• Web & Social Discrete
work DataNumeric Symmetric Binary
Net
Asymmetric Binary Ordinal Continuous Numeric
• Spatial Data
• Time Series
• Relational/Object
• Transactional Data
• Document Data
• Spatial Data
• Time Series
• Relational/Object
• Transactional Data
• Document Data
• Spatial Data
• Time Series
• Relational/Object
• Transactional Data
• Document Data
• Time
Series
• Relational/Object
• Transactional Data
• Document Data
• Spatial Data
• Time Series
• Relational/Object
• Transactional Data
• Document Data
• Spatial Data
• Time Series
31
Key Steps in a Machine Learning Project
32
Data Preprocessing
• Data Cleaning
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation
Data
Quality
Correct
Consistent Interpretable
Quality
Trustable Usable on
Demand
Complete
34
BITS Pilani, Pilani Campus
Detecting Missing Value
dataframe.isnull()
dataframe.isnull().sum()
35
BITS Pilani, Pilani Campus
How to Handle Missing Data?
• Drop the missing values
• Drop the tuple: usually done when class label is missing (when doing classification)—not effective
when the % of missing values per attribute varies considerably
• Drop the variable
36
37
BITS Pilani, Pilani Campus
Handling Missing
Value
dataframes.dropna():
# axis=0, the action is performed on rows that satisfy the condition
df.dropna(subset=[“COLNAME”], axis =0, inplace=True)
38
BITS Pilani, Pilani Campus
Handling Missing
Value ImputationMethods
dataframe.replace(missing_value, new_value):
mean = df[“COLNAME”].mean()
df[“COLNAME”].replace(np.nan, mean)
39
BITS Pilani, Pilani Campus
Handling Missing
Value Housingexample -
Book
40
BITS Pilani, Pilani Campus
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 33 27-Jan-2015 Mar 27, 1982
William Gates 34 15-Jan-2015 MGR Feb 24, 1981
Kennedy 32 30-Jan-2015 DOC Nov 25,1982
41
09/03/2022
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 33 27-Jan-2015 Mar 27, 1982
William Gates 34 15-Jan-2015 MGR Feb 24, 1981
Kennedy 32 30-Jan-2015 DOC Nov 25,1982
42
09/03/2022
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
• Other data problems which require data cleaning
– duplicate records
– inconsistent data
43
09/03/2022
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Handle Noisy Data?
• Binning (also used for discretization)
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
– Binning methods smooth a sorted data value by consulting its "neighborhood,"
that is, the values around it, i.e. they perform local smoothing.
44
Data Mining
45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binning Methods for Data Smoothing
46
Data Mining
• Descriptive Statistics
df[‘sq_ft’].describe()
47
Q1=dataframe.quantile(0.25)
Q3=dataframe.quantile(0.75)
IQR = Q3-Q1
print(IQR)
48
BITS Pilani, Pilani Campus
Data Quality Assurance – Remove Duplicates
dataframe.drop_duplicates()
49
BITS Pilani, Pilani Campus
Unnecessary data
• Type 1: Uninformative/Repetitive
• feature is uninformative because it has too many rows being the same
value
• How to find?
• create a list of features with a high percentage of the same value
• Type 2: Irrelevant
• feature is not related to the problem
• How to find?
• Domain Knowledge
• Correlation Analysis
• Type 3: Duplicate
• Two or more features with the same values for each observation
Example
df.drop(["Sno"], inplace = True, axis=1 )
50
Data Aggregation
Collected data consist of the sales per quarter, for the years 2008 to 2010
Data of interest is in the annual sales (total per year)
Data can be aggregated such that the resulting data set is smaller in volume,
without loss of information necessary for the analysis task
Aggregation Description
51
Aggregation
• Purpose
• Data reduction
• Change of scale
Aggregation
pivot with a multi-index
53
54
Aggregation
55
Sampling
Primary technique used for data reduction
• It is often used for both the preliminary investigation of the data and the final data
analysis.
• Statisticians often sample because obtaining the entire set of data of interest is too
expensive or time consuming.
• Sampling is typically used in machine learning because processing the entire set of
data of interest may be too expensive or time consuming.
• Key Principle
• Using a sample will work almost as well as using the entire data set, if the
sample is representative
• A sample is representative if it has approximately the same properties as the
original dataset
• Techniques:
• Simple Random Sampling
• Stratified Sampling
Sampling
Syntax:
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None,
axis=None)
Parameters:
n: int value, Number of random rows to generate.
frac: Float value, Returns (float value * length of data frame values ). frac cannot be used with n.
replace: Boolean value, return sample with replacement if True.
random_state: int value or numpy.random.RandomState, optional. if set to a particular integer, will
return same rows as sample in every iteration.
axis: 0 or ‘row’ for Rows and 1 or ‘column’ for Columns.
57
Binarization
Maps a continuous or categorical attribute into one or more binary variables
• Often convert a continuous attribute to a categorical attribute and then
convert a categorical attribute to a set of binary attributes
• Examples: eye color and height measured as {low, medium, high}
One Hot Encoding
59
Attribute Transformation
Maps the entire set of values of a given attribute to a new set of replacement values
• Each original value can be identified with one of the new values
• Simple functions: xk, log(x), ex, |x|
• Normalization
• Refers to various techniques to adjust to differences among attributes in
terms of frequency of occurrence, mean, variance, range
• Take out unwanted, common signal, e.g., seasonality
• In statistics, standardization refers to subtracting off the means and dividing
by the standard deviation
Normalization
• Transforming the data to fall within a smaller or common range such as [-1, 1] or
[0.0, 1.0]
• Normalising the data attempts to give all attributes an equal weight
• It also helps preventing attributes with larger values from outweighing attributes
with smaller values.
• Various methods
• Min-max normalization
• Decimal scaling
• Z-score
Data Mining
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
Data Mining
62
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Normalization
Data Mining
63
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Curse of Dimensionality
When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
Definitions of density and distance between
points, which are critical for clustering and
outlier detection, become less meaningful
64
Dimensionality Reduction
• Avoid curse of dimensionality
• Reduce amount of time and memory required by ML algorithms
• Allow data to be more easily visualized x2
• May help to eliminate irrelevant features or reduce noise
• Popular Techniques
e
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Find a projection that captures the largest amount of
variation in data
x1
PCA
66
Feature Subset Selection
Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or
more other attributes
• Example: purchase price of a product and the amount of
sales tax paid
• Irrelevant features
• Contain no information that is useful for the ML task at hand
• Example: students' ID is often irrelevant to the task of
predicting students' GPA
• Many techniques developed, especially for classification
Feature Subset Selection
68
Feature Creation
Create new attributes that can capture the important information in a data set
• Model Evaluation
• Model Validation