0% found this document useful (0 votes)
335 views70 pages

CS 2 3 4 Aml

This document outlines the course plan and modules for an Applied Machine Learning course. The 10 modules cover topics like the machine learning pipeline, linear models, classification models, neural networks, and more. Module 2-3 focus on the end-to-end machine learning pipeline, including framing problems, getting data, preprocessing, visualization, feature engineering, model building and evaluation. Key steps in a machine learning project and Python packages for machine learning are also discussed.

Uploaded by

shruti katare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
335 views70 pages

CS 2 3 4 Aml

This document outlines the course plan and modules for an Applied Machine Learning course. The 10 modules cover topics like the machine learning pipeline, linear models, classification models, neural networks, and more. Module 2-3 focus on the end-to-end machine learning pipeline, including framing problems, getting data, preprocessing, visualization, feature engineering, model building and evaluation. Key steps in a machine learning project and Python packages for machine learning are also discussed.

Uploaded by

shruti katare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Applied Machine Learning

SE/SS ZG568

Swarna Chaudhary
Assistant Professor – BITS CSIS s
BITS [email protected]
Pilani
Pilani Campus
Course
Plan
M1 Introduction to Machine Learning

M2-M3 End-to-end Machine Learning Pipeline

M4 Linear Prediction Models

M5 Classification Models I

M6 Classification Models II

M7 Unsupervised Learning

M8 Neural Networks

M9 Deep Networks

M10 FAccT Machine Learning

BITS Pilani, Pilani Campus


End-to-end Machine Learning
Pipeline
ModuleLearning
Objectives
• Get a fair idea on the components of a Machine Learning Pipeline

• Identify & implement the use case specific preprocessing


requirements

• Understand the application of various visualization techniques

BITS Pilani, Pilani Campus


Key Steps in a Machine Learning Project

• Framing a machine learning problem


• Get the data
• Data Pre processing
• Data Visualization and Analysis
• Feature Engineering
• Model Building and EVALUATION
• Fine tune the model
• Present the solution
• Launch, monitor, and maintain

4
Frame the Problem- Look at the Big Picture
• Define the objectives in business terms
• How will your solution be used?
• What are the current solutions (if any)?
• How should you frame this problem (supervised/Unsupervised, Online/Offline etc.)?
• How should performance be measured?
• Is the performance measure aligned with the business objective?
• What are comparable problems? Can you reuse experience or tools?
• Is Human expertise available?
• How would you solve the problem?
• List the assumptions you have made.
• What would be the minimum performance needed to reach the business objective?

5
ML
Pipeline

Deployed
ETL Pipeline ML Pipeline CI/CD Application

Source credit: https://docs.microsoft.com/

BITS Pilani, Pilani Campus


An Example Problem Statement
• Build a model of housing prices using the census data.
• Data attributes <population, median income, median housing price, ….> and so
on for each district
• Districts are the smallest geographical unit (population ~600 – 3000)
• The model to predict the median housing price in any district, given all the other
metrics.
• Goodness of the model is determined by how close the model output is w.r.t.
actual price for unseen district data
Framing the Problem
Understand the Business Objective and Context
• What is the expected usage and benefit?
• impacts the choice of algorithms, goodness measure, and effort in lifecycle
management of the model
• What is the baseline method and its performance?
Choice of Model
• Supervised or unsupervised? Prediction or classification? Online Vs. Batch?
Instance-based or Model-based?
• Analyze the dataset
• Each instance comes with the expected output, i.e., the district’s median housing
price.
• supervised
• Goal is to predict a real valued price based on multiple variables line population,
income etc.
c
• regression
v
• Output is based on input data at rest, not rapidly changing data rapidly.
• Dataset small enough to fit in memory
c

v batch
• So, it’s a supervised univariate batch regression problem
Choice of Performance Metrics
A typical Choice for Regression: Root Mean Square Error (RMSE)

h is the model, X is the training dataset, m is number of instances, x(i) is i-th


instance, y(i) is the actual price for the i-th instance.

Mean Absolute Error (MAE)

• MAE is preferred for a large number of outliers. aka L1 norm or Manhattan distance/norm.

General Form
Python Packages for ML
• Scientific Computing:
• Pandas
• Numpy
• Scipy
• Data Visualizations
• Matplotlib
• Seaborn
• Algorithmic Libraries
• Sklearn/ Scikit-Learn
• Keras
• TensorFlow
11
Key Steps in a Machine Learning Project

• Framing a machine learning problem


• Get the data
• Data Pre processing
• Data Visualization and Analysis
• Feature Engineering
• Model Building and EVALUATION
• Fine tune the model
• Present the solution
• Launch, monitor, and maintain

12
Importing and Exporting Data
To read an entire CSV file
# Importing libraries
import pandas as pd
import numpy as np

# Read csv file into a pandas dataframe


df = pd.read_csv("property_data.csv")

To read rows for one column


print(df[0:3]['PID'])

To read certain columns


print(df.loc[:,['PID','ST_NUM']])

To read certain rows and certain columns


print(df.loc[[1,3],['PID','ST_NUM']])
13
Printing the dataframe
•df prints the entire dataframe
•df.head(n) shows the first n rows of data frame
•df.tail(n) shows the bottom n rows of data frame

14
Exporting a pandas dataframe to CSV
• To save modified dataset

path = “C:/Windows/…/property.csv
df.to_csv(path)

15
Different file formats

16
What is Data?
• An attribute is a property or characteristic of an object Attributes
• Examples: eye color of a person, temperature, etc.
• aka variable, field, characteristic, dimension, or
feature
• A collection of attributes describe an object
• aka record, point, case, sample, entity, or instance
• Attribute values are numbers or symbols assigned to an
attribute for a particular object

Object
s
Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of
documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite
number of digits.
• Continuous attributes are typically represented as floating-point variables.
Types and properties of Attributes
• Nominal
• ID numbers, eye color, zip codes
• Ordinal
• rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall,
medium, short}
• Interval
• calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race)

Properties

• Distinctness: = 
• Order: < >
• Differences are + -
meaningful :
• Ratios are * /
meaningful :
Categorization of Attributes
Data – Types of Attributes

21
To check data types in Python

• df.dtypes

To change the datatype

• df.info()

22
To check data distribution
• Returns a statistical summary of data types

23
Types of Data sets
Tree/Graphs

Relational/Object Spatial Data

Graphs,
Sequence

Multimedia Mobile Data

Sequence Networks

Time Series/ Data Web & Social

Sourc
Sequence Data Network Data

e
BITS Pilani, Pilani Campus
Data
Types

• Relational/Object

• Transactional
Data

• Document Data
• Web & Social Discrete
work DataNumeric Symmetric Binary
Net
Asymmetric Binary Ordinal Continuous Numeric
• Spatial Data

• Time Series

BITS Pilani, Pilani Campus


Data
Types

• Relational/Object

• Transactional Data

• Document Data

• Web & Social Network

• Spatial Data

• Time Series

BITS Pilani, Pilani Campus


Data
Types

• Relational/Object

• Transactional Data

• Document Data

• Web & Social Network Data

• Spatial Data

• Time Series

BITS Pilani, Pilani Campus


Data
Types

• Relational/Object

• Transactional Data

• Document Data

• Web & Social Network Data

• Spatial Data Categorical Nominal

• Time
Series

BITS Pilani, Pilani Campus


Data
Types

• Relational/Object

• Transactional Data

• Document Data

• Web & Social Network Data

• Spatial Data

• Time Series

BITS Pilani, Pilani Campus


Data
Types

• Relational/Object

• Transactional Data

• Document Data

• Web & Social Network Data

• Spatial Data

• Time Series

BITS Pilani, Pilani Campus


Get the data
• List the data you need and how much you need
• Find where you can get the data
• Check how much space will it take
• Check legal obligations, if necessary
• Get access authorizations
• Create a workspace (with enough storage space)
• Get the data
• Convert the data to a format you can easily manipulate
• Check the size and type of data
• Sample a test data, put it aside, and never look at it

31
Key Steps in a Machine Learning Project

• Framing a machine learning problem


• Get the data
• Data Pre processing
• Data Visualization and Analysis
• Feature Engineering
• Model Building and EVALUATION
• Fine tune the model
• Present the solution
• Launch, monitor, and maintain

32
Data Preprocessing
• Data Cleaning
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation
Data
Quality

Correct

Consistent Interpretable

Quality

Trustable Usable on
Demand

Complete

34
BITS Pilani, Pilani Campus
Detecting Missing Value
dataframe.isnull()

dataframe.isnull().sum()

35
BITS Pilani, Pilani Campus
How to Handle Missing Data?
• Drop the missing values
• Drop the tuple: usually done when class label is missing (when doing classification)—not effective
when the % of missing values per attribute varies considerably
• Drop the variable

• Replace the missing values


• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
o a global constant : e.g., “unknown”, a new class?!
o the attribute mean
o the attribute mean for all samples belonging to the same class: smarter
o the most probable value: inference-based such as Bayesian formula or decision tree

36

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Handling Missing
Value
ImputationMethods

37
BITS Pilani, Pilani Campus
Handling Missing
Value
dataframes.dropna():
# axis=0, the action is performed on rows that satisfy the condition
df.dropna(subset=[“COLNAME”], axis =0, inplace=True)

# axis=1, the action is performed on columns that satisfy the condition


df.dropna(subset=[“COLNAME”], axis =1, inplace=True)

38
BITS Pilani, Pilani Campus
Handling Missing
Value ImputationMethods

dataframe.replace(missing_value, new_value):
mean = df[“COLNAME”].mean()
df[“COLNAME”].replace(np.nan, mean)

39
BITS Pilani, Pilani Campus
Handling Missing
Value Housingexample -
Book

40
BITS Pilani, Pilani Campus
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 33 27-Jan-2015   Mar 27, 1982
William Gates 34 15-Jan-2015 MGR Feb 24, 1981
Kennedy 32 30-Jan-2015 DOC Nov 25,1982

41
09/03/2022
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 33 27-Jan-2015   Mar 27, 1982
William Gates 34 15-Jan-2015 MGR Feb 24, 1981
Kennedy 32 30-Jan-2015 DOC Nov 25,1982

1) Missing values in Profession column


2) Format of DateOfFirstBuy and DateOfBirth are different, needs
standardization
3) Row 1 and Row 3 are potentially duplicate data.
4) Both Age and DateOfBirth are stored. Age is derived attribute.
5) Inconsistent format for name, missing first or last names

42
09/03/2022
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
• Other data problems which require data cleaning
– duplicate records
– inconsistent data

43
09/03/2022
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Handle Noisy Data?
• Binning (also used for discretization)
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
– Binning methods smooth a sorted data value by consulting its "neighborhood,"
that is, the values around it, i.e. they perform local smoothing.

44

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals
will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning


• Divides the range into N intervals, each containing approximately same number of
samples
• Good data scaling
• Managing categorical attributes can be tricky

Data Mining
45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binning Methods for Data Smoothing

46

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Handling Noise in Python
• Use Histogram (numeric features) to detect noise/outliers
df[‘sq_ft'].hist(bins=100)

• Use Boxplot(numeric features) to detect noise/outliers


df.boxplot(column=[‘sq_ft'])

• Descriptive Statistics
df[‘sq_ft’].describe()

• Bar Chart (Categorical)


df[‘ST_NAME'].value_counts().plot.bar()

47

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Handling Outliers
dataframe.describe()

Q1=dataframe.quantile(0.25)

Q3=dataframe.quantile(0.75)
IQR = Q3-Q1
print(IQR)

import seaborn as sns


sns.boxplot(x=dataframe[“COLNAME"])

48
BITS Pilani, Pilani Campus
Data Quality Assurance – Remove Duplicates
dataframe.drop_duplicates()

49
BITS Pilani, Pilani Campus
Unnecessary data
• Type 1: Uninformative/Repetitive
• feature is uninformative because it has too many rows being the same
value
• How to find?
• create a list of features with a high percentage of the same value

• Type 2: Irrelevant
• feature is not related to the problem
• How to find?
• Domain Knowledge
• Correlation Analysis

• Type 3: Duplicate
• Two or more features with the same values for each observation

Example
df.drop(["Sno"], inplace = True, axis=1 )
50
Data Aggregation
Collected data consist of the sales per quarter, for the years 2008 to 2010
Data of interest is in the annual sales (total per year)
Data can be aggregated such that the resulting data set is smaller in volume,
without loss of information necessary for the analysis task

Aggregation Description

count() Total number of items

first(), last() First and last item

mean(), median() Mean and median

min(), max() Minimum and maximum

Standard deviation and


std(), var() variance

mad() Mean absolute deviation

prod() Product of all items

sum() Sum of all items

51
Aggregation
• Purpose
• Data reduction
• Change of scale
Aggregation
pivot with a multi-index

53
54
Aggregation

55
Sampling
Primary technique used for data reduction
• It is often used for both the preliminary investigation of the data and the final data
analysis.
• Statisticians often sample because obtaining the entire set of data of interest is too
expensive or time consuming.
• Sampling is typically used in machine learning because processing the entire set of
data of interest may be too expensive or time consuming.
• Key Principle
• Using a sample will work almost as well as using the entire data set, if the
sample is representative
• A sample is representative if it has approximately the same properties as the
original dataset
• Techniques:
• Simple Random Sampling
• Stratified Sampling
Sampling
Syntax:
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None,
axis=None)

Parameters:
n: int value, Number of random rows to generate.
frac: Float value, Returns (float value * length of data frame values ). frac cannot be used with n.
replace: Boolean value, return sample with replacement if True.
random_state: int value or numpy.random.RandomState, optional. if set to a particular integer, will
return same rows as sample in every iteration.
axis: 0 or ‘row’ for Rows and 1 or ‘column’ for Columns.

57
Binarization
Maps a continuous or categorical attribute into one or more binary variables
• Often convert a continuous attribute to a categorical attribute and then
convert a categorical attribute to a set of binary attributes
• Examples: eye color and height measured as {low, medium, high}
One Hot Encoding

59
Attribute Transformation
Maps the entire set of values of a given attribute to a new set of replacement values
• Each original value can be identified with one of the new values
• Simple functions: xk, log(x), ex, |x|
• Normalization
• Refers to various techniques to adjust to differences among attributes in
terms of frequency of occurrence, mean, variance, range
• Take out unwanted, common signal, e.g., seasonality
• In statistics, standardization refers to subtracting off the means and dividing
by the standard deviation
Normalization
• Transforming the data to fall within a smaller or common range such as [-1, 1] or
[0.0, 1.0]
• Normalising the data attempts to give all attributes an equal weight
• It also helps preventing attributes with larger values from outweighing attributes
with smaller values.
• Various methods
• Min-max normalization
• Decimal scaling
• Z-score

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Normalization
• Min-max normalization: to [new_minA, new_maxA]

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):

• Ex. Let μ = 54,000, σ = 16,000. Then


• Normalization by decimal scaling
Where j is the smallest integer such that Max(|ν’|) < 1

Data Mining

62
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Normalization

Data Mining
63
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Curse of Dimensionality
When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
Definitions of density and distance between
points, which are critical for clustering and
outlier detection, become less meaningful

64
Dimensionality Reduction
• Avoid curse of dimensionality
• Reduce amount of time and memory required by ML algorithms
• Allow data to be more easily visualized x2
• May help to eliminate irrelevant features or reduce noise
• Popular Techniques
e
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Find a projection that captures the largest amount of
variation in data
x1
PCA

66
Feature Subset Selection
Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or
more other attributes
• Example: purchase price of a product and the amount of
sales tax paid
• Irrelevant features
• Contain no information that is useful for the ML task at hand
• Example: students' ID is often irrelevant to the task of
predicting students' GPA
• Many techniques developed, especially for classification
Feature Subset Selection

68
Feature Creation
Create new attributes that can capture the important information in a data set

• More efficiently than the original attributes


• Three general methodologies
• Feature extraction
• Example: extracting edges from images
• Feature construction
• Example: dividing mass by volume to get density
• Mapping data to new space
• Example: Fourier and wavelet analysis
Next Week
Plan
• Visualization

• Model Evaluation

• Model Validation

BITS Pilani, Pilani Campus

You might also like