0% found this document useful (0 votes)

335 views70 pages

CS 2 3 4 Aml

This document outlines the course plan and modules for an Applied Machine Learning course. The 10 modules cover topics like the machine learning pipeline, linear models, classification models, neural networks, and more. Module 2-3 focus on the end-to-end machine learning pipeline, including framing problems, getting data, preprocessing, visualization, feature engineering, model building and evaluation. Key steps in a machine learning project and Python packages for machine learning are also discussed.

Uploaded by

shruti katare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

335 views70 pages

CS 2 3 4 Aml

Uploaded by

shruti katare

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 70

Applied Machine Learning

SE/SS ZG568

Swarna Chaudhary
Assistant Professor – BITS CSIS s
BITS [email protected]
Pilani
Pilani Campus
Course
Plan
M1 Introduction to Machine Learning

M2-M3 End-to-end Machine Learning Pipeline

M4 Linear Prediction Models

M5 Classification Models I

M6 Classification Models II

M7 Unsupervised Learning

M8 Neural Networks

M9 Deep Networks

M10 FAccT Machine Learning

BITS Pilani, Pilani Campus

End-to-end Machine Learning
Pipeline
ModuleLearning
Objectives
• Get a fair idea on the components of a Machine Learning Pipeline

• Identify & implement the use case specific preprocessing

requirements

• Understand the application of various visualization techniques

BITS Pilani, Pilani Campus

Key Steps in a Machine Learning Project

• Framing a machine learning problem

• Get the data
• Data Pre processing
• Data Visualization and Analysis
• Feature Engineering
• Model Building and EVALUATION
• Fine tune the model
• Present the solution
• Launch, monitor, and maintain

4
Frame the Problem- Look at the Big Picture
• Define the objectives in business terms
• How will your solution be used?
• What are the current solutions (if any)?
• How should you frame this problem (supervised/Unsupervised, Online/Offline etc.)?
• How should performance be measured?
• Is the performance measure aligned with the business objective?
• What are comparable problems? Can you reuse experience or tools?
• Is Human expertise available?
• How would you solve the problem?
• List the assumptions you have made.
• What would be the minimum performance needed to reach the business objective?

5
ML
Pipeline

Deployed
ETL Pipeline ML Pipeline CI/CD Application

Source credit: https://docs.microsoft.com/

BITS Pilani, Pilani Campus

An Example Problem Statement
• Build a model of housing prices using the census data.
• Data attributes <population, median income, median housing price, ….> and so
on for each district
• Districts are the smallest geographical unit (population ~600 – 3000)
• The model to predict the median housing price in any district, given all the other
metrics.
• Goodness of the model is determined by how close the model output is w.r.t.
actual price for unseen district data
Framing the Problem
Understand the Business Objective and Context
• What is the expected usage and benefit?
• impacts the choice of algorithms, goodness measure, and effort in lifecycle
management of the model
• What is the baseline method and its performance?
Choice of Model
• Supervised or unsupervised? Prediction or classification? Online Vs. Batch?
Instance-based or Model-based?
• Analyze the dataset
• Each instance comes with the expected output, i.e., the district’s median housing
price.
• supervised
• Goal is to predict a real valued price based on multiple variables line population,
income etc.
c
• regression
v
• Output is based on input data at rest, not rapidly changing data rapidly.
• Dataset small enough to fit in memory
c
•
v batch
• So, it’s a supervised univariate batch regression problem
Choice of Performance Metrics
A typical Choice for Regression: Root Mean Square Error (RMSE)

h is the model, X is the training dataset, m is number of instances, x(i) is i-th

instance, y(i) is the actual price for the i-th instance.

Mean Absolute Error (MAE)

• MAE is preferred for a large number of outliers. aka L1 norm or Manhattan distance/norm.

General Form
Python Packages for ML
• Scientific Computing:
• Pandas
• Numpy
• Scipy
• Data Visualizations
• Matplotlib
• Seaborn
• Algorithmic Libraries
• Sklearn/ Scikit-Learn
• Keras
• TensorFlow
11
Key Steps in a Machine Learning Project

• Framing a machine learning problem

12
Importing and Exporting Data
To read an entire CSV file
# Importing libraries
import pandas as pd
import numpy as np

# Read csv file into a pandas dataframe

df = pd.read_csv("property_data.csv")

To read rows for one column

print(df[0:3]['PID'])

To read certain columns

print(df.loc[:,['PID','ST_NUM']])

To read certain rows and certain columns

print(df.loc[[1,3],['PID','ST_NUM']])
13
Printing the dataframe
•df prints the entire dataframe
•df.head(n) shows the first n rows of data frame
•df.tail(n) shows the bottom n rows of data frame

14
Exporting a pandas dataframe to CSV
• To save modified dataset

path = “C:/Windows/…/property.csv
df.to_csv(path)

15
Different file formats

16
What is Data?
• An attribute is a property or characteristic of an object Attributes
• Examples: eye color of a person, temperature, etc.
• aka variable, field, characteristic, dimension, or
feature
• A collection of attributes describe an object
• aka record, point, case, sample, entity, or instance
• Attribute values are numbers or symbols assigned to an
attribute for a particular object

Object
s
Discrete and Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• Examples: zip codes, counts, or the set of words in a collection of
documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• Examples: temperature, height, or weight.
• Practically, real values can only be measured and represented using a finite
number of digits.
• Continuous attributes are typically represented as floating-point variables.
Types and properties of Attributes
• Nominal
• ID numbers, eye color, zip codes
• Ordinal
• rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall,
medium, short}
• Interval
• calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race)

Properties

• Distinctness: = 
• Order: < >
• Differences are + -
meaningful :
• Ratios are * /
meaningful :
Categorization of Attributes
Data – Types of Attributes

21
To check data types in Python

• df.dtypes

To change the datatype

• df.info()

22
To check data distribution
• Returns a statistical summary of data types

23
Types of Data sets
Tree/Graphs

Relational/Object Spatial Data

Graphs,
Sequence

Multimedia Mobile Data

Sequence Networks

Time Series/ Data Web & Social

Sourc
Sequence Data Network Data

e
BITS Pilani, Pilani Campus
Data
Types

• Relational/Object

• Transactional
Data

• Spatial Data

• Time Series

BITS Pilani, Pilani Campus

Get the data
• List the data you need and how much you need
• Find where you can get the data
• Check how much space will it take
• Check legal obligations, if necessary
• Get access authorizations
• Create a workspace (with enough storage space)
• Get the data
• Convert the data to a format you can easily manipulate
• Check the size and type of data
• Sample a test data, put it aside, and never look at it

31
Key Steps in a Machine Learning Project

• Framing a machine learning problem

32
Data Preprocessing
• Data Cleaning
• Aggregation
• Sampling
• Discretization and Binarization
• Attribute Transformation
• Dimensionality Reduction
• Feature subset selection
• Feature creation
Data
Quality

Correct

Consistent Interpretable

Quality

Trustable Usable on
Demand

Complete

34
BITS Pilani, Pilani Campus
Detecting Missing Value
dataframe.isnull()

dataframe.isnull().sum()

35
BITS Pilani, Pilani Campus
How to Handle Missing Data?
• Drop the missing values
• Drop the tuple: usually done when class label is missing (when doing classification)—not effective
when the % of missing values per attribute varies considerably
• Drop the variable

• Replace the missing values

• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
o a global constant : e.g., “unknown”, a new class?!
o the attribute mean
o the attribute mean for all samples belonging to the same class: smarter
o the most probable value: inference-based such as Bayesian formula or decision tree

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Handling Missing
Value
ImputationMethods

37
BITS Pilani, Pilani Campus
Handling Missing
Value
dataframes.dropna():
# axis=0, the action is performed on rows that satisfy the condition
df.dropna(subset=[“COLNAME”], axis =0, inplace=True)

# axis=1, the action is performed on columns that satisfy the condition

df.dropna(subset=[“COLNAME”], axis =1, inplace=True)

38
BITS Pilani, Pilani Campus
Handling Missing
Value ImputationMethods

dataframe.replace(missing_value, new_value):
mean = df[“COLNAME”].mean()
df[“COLNAME”].replace(np.nan, mean)

39
BITS Pilani, Pilani Campus
Handling Missing
Value Housingexample -
Book

40
BITS Pilani, Pilani Campus
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 33 27-Jan-2015 Mar 27, 1982
William Gates 34 15-Jan-2015 MGR Feb 24, 1981
Kennedy 32 30-Jan-2015 DOC Nov 25,1982

41
09/03/2022
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 33 27-Jan-2015 Mar 27, 1982
William Gates 34 15-Jan-2015 MGR Feb 24, 1981
Kennedy 32 30-Jan-2015 DOC Nov 25,1982

1) Missing values in Profession column

2) Format of DateOfFirstBuy and DateOfBirth are different, needs
standardization
3) Row 1 and Row 3 are potentially duplicate data.
4) Both Age and DateOfBirth are stored. Age is derived attribute.
5) Inconsistent format for name, missing first or last names

42
09/03/2022
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
• Other data problems which require data cleaning
– duplicate records
– inconsistent data

43
09/03/2022
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Handle Noisy Data?
• Binning (also used for discretization)
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
– Binning methods smooth a sorted data value by consulting its "neighborhood,"
that is, the values around it, i.e. they perform local smoothing.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals
will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning

• Divides the range into N intervals, each containing approximately same number of
samples
• Good data scaling
• Managing categorical attributes can be tricky

Data Mining
45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binning Methods for Data Smoothing

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Handling Noise in Python
• Use Histogram (numeric features) to detect noise/outliers
df[‘sq_ft'].hist(bins=100)

• Use Boxplot(numeric features) to detect noise/outliers

df.boxplot(column=[‘sq_ft'])

• Descriptive Statistics
df[‘sq_ft’].describe()

• Bar Chart (Categorical)

df[‘ST_NAME'].value_counts().plot.bar()

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Handling Outliers
dataframe.describe()

Q1=dataframe.quantile(0.25)

Q3=dataframe.quantile(0.75)
IQR = Q3-Q1
print(IQR)

import seaborn as sns

sns.boxplot(x=dataframe[“COLNAME"])

48
BITS Pilani, Pilani Campus
Data Quality Assurance – Remove Duplicates
dataframe.drop_duplicates()

49
BITS Pilani, Pilani Campus
Unnecessary data
• Type 1: Uninformative/Repetitive
• feature is uninformative because it has too many rows being the same
value
• How to find?
• create a list of features with a high percentage of the same value

• Type 2: Irrelevant
• feature is not related to the problem
• How to find?
• Domain Knowledge
• Correlation Analysis

• Type 3: Duplicate
• Two or more features with the same values for each observation

Example
df.drop(["Sno"], inplace = True, axis=1 )
50
Data Aggregation
Collected data consist of the sales per quarter, for the years 2008 to 2010
Data of interest is in the annual sales (total per year)
Data can be aggregated such that the resulting data set is smaller in volume,
without loss of information necessary for the analysis task

Aggregation Description

count() Total number of items

first(), last() First and last item

mean(), median() Mean and median

min(), max() Minimum and maximum

Standard deviation and

std(), var() variance

mad() Mean absolute deviation

prod() Product of all items

sum() Sum of all items

51
Aggregation
• Purpose
• Data reduction
• Change of scale
Aggregation
pivot with a multi-index

53
54
Aggregation

55
Sampling
Primary technique used for data reduction
• It is often used for both the preliminary investigation of the data and the final data
analysis.
• Statisticians often sample because obtaining the entire set of data of interest is too
expensive or time consuming.
• Sampling is typically used in machine learning because processing the entire set of
data of interest may be too expensive or time consuming.
• Key Principle
• Using a sample will work almost as well as using the entire data set, if the
sample is representative
• A sample is representative if it has approximately the same properties as the
original dataset
• Techniques:
• Simple Random Sampling
• Stratified Sampling
Sampling
Syntax:
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None,
axis=None)

Parameters:
n: int value, Number of random rows to generate.
frac: Float value, Returns (float value * length of data frame values ). frac cannot be used with n.
replace: Boolean value, return sample with replacement if True.
random_state: int value or numpy.random.RandomState, optional. if set to a particular integer, will
return same rows as sample in every iteration.
axis: 0 or ‘row’ for Rows and 1 or ‘column’ for Columns.

57
Binarization
Maps a continuous or categorical attribute into one or more binary variables
• Often convert a continuous attribute to a categorical attribute and then
convert a categorical attribute to a set of binary attributes
• Examples: eye color and height measured as {low, medium, high}
One Hot Encoding

59
Attribute Transformation
Maps the entire set of values of a given attribute to a new set of replacement values
• Each original value can be identified with one of the new values
• Simple functions: xk, log(x), ex, |x|
• Normalization
• Refers to various techniques to adjust to differences among attributes in
terms of frequency of occurrence, mean, variance, range
• Take out unwanted, common signal, e.g., seasonality
• In statistics, standardization refers to subtracting off the means and dividing
by the standard deviation
Normalization
• Transforming the data to fall within a smaller or common range such as [-1, 1] or
[0.0, 1.0]
• Normalising the data attempts to give all attributes an equal weight
• It also helps preventing attributes with larger values from outweighing attributes
with smaller values.
• Various methods
• Min-max normalization
• Decimal scaling
• Z-score

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Normalization
• Min-max normalization: to [new_minA, new_maxA]

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):

• Ex. Let μ = 54,000, σ = 16,000. Then

• Normalization by decimal scaling
Where j is the smallest integer such that Max(|ν’|) < 1

Data Mining

62
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Normalization

Data Mining
63
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Curse of Dimensionality
When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
Definitions of density and distance between
points, which are critical for clustering and
outlier detection, become less meaningful

64
Dimensionality Reduction
• Avoid curse of dimensionality
• Reduce amount of time and memory required by ML algorithms
• Allow data to be more easily visualized x2
• May help to eliminate irrelevant features or reduce noise
• Popular Techniques
e
• Principal Components Analysis (PCA)
• Singular Value Decomposition
• Find a projection that captures the largest amount of
variation in data
x1
PCA

66
Feature Subset Selection
Another way to reduce dimensionality of data
• Redundant features
• Duplicate much or all of the information contained in one or
more other attributes
• Example: purchase price of a product and the amount of
sales tax paid
• Irrelevant features
• Contain no information that is useful for the ML task at hand
• Example: students' ID is often irrelevant to the task of
predicting students' GPA
• Many techniques developed, especially for classification
Feature Subset Selection

68
Feature Creation
Create new attributes that can capture the important information in a data set

• More efficiently than the original attributes

• Three general methodologies
• Feature extraction
• Example: extracting edges from images
• Feature construction
• Example: dividing mass by volume to get density
• Mapping data to new space
• Example: Fourier and wavelet analysis
Next Week
Plan
• Visualization

• Model Evaluation

• Model Validation

BITS Pilani, Pilani Campus

Ad3351 Daa Question Bank
No ratings yet
Ad3351 Daa Question Bank
12 pages
Bca Ctis Sem-5 Introduction To Data Science
No ratings yet
Bca Ctis Sem-5 Introduction To Data Science
14 pages
Touchpad Plus Ver. 4.0 Class 7
From Everand
Touchpad Plus Ver. 4.0 Class 7
Nidhi Gupta
No ratings yet
18+430 List of CH.: Design of Pier P1
No ratings yet
18+430 List of CH.: Design of Pier P1
54 pages
Section 13F - Engine Electrical System PDF
No ratings yet
Section 13F - Engine Electrical System PDF
14 pages
Touchpad Computer Applications Class 9
From Everand
Touchpad Computer Applications Class 9
Sanjay Jain
4/5 (1)
S2-18-SS ZG537-L1
No ratings yet
S2-18-SS ZG537-L1
60 pages
S2-18-SS ZG537-L1
No ratings yet
S2-18-SS ZG537-L1
47 pages
Python Lab Internal Rubrics 15 M
No ratings yet
Python Lab Internal Rubrics 15 M
1 page
Data Science Module1
No ratings yet
Data Science Module1
20 pages
VTU Exam Question Paper With Solution of 17CS73 Machine Learning Jan-2021-Swathi Y
No ratings yet
VTU Exam Question Paper With Solution of 17CS73 Machine Learning Jan-2021-Swathi Y
7 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Data Science Assignment
No ratings yet
Data Science Assignment
18 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
Data Literacy Questions All Types
No ratings yet
Data Literacy Questions All Types
2 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
DS+C25 PGDDS+Masters
No ratings yet
DS+C25 PGDDS+Masters
13 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
Unit 2 - Knowledge Delivery
No ratings yet
Unit 2 - Knowledge Delivery
31 pages
358 33 Powerpoint Slides DSC Chapter 15
No ratings yet
358 33 Powerpoint Slides DSC Chapter 15
55 pages
Classification - Issues Regarding Classification and Prediction
No ratings yet
Classification - Issues Regarding Classification and Prediction
42 pages
MCQ Data Science
No ratings yet
MCQ Data Science
1 page
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
Bi Unit1
No ratings yet
Bi Unit1
93 pages
R22-Ids-Question Bank
No ratings yet
R22-Ids-Question Bank
4 pages
Search vs. Hashing
No ratings yet
Search vs. Hashing
55 pages
Data Analytics Question Bank
No ratings yet
Data Analytics Question Bank
4 pages
Cyber Security Assignment
No ratings yet
Cyber Security Assignment
9 pages
Video Summarization Project Presentaion
No ratings yet
Video Summarization Project Presentaion
34 pages
Week 2 - Data Analytics Life Cycle
No ratings yet
Week 2 - Data Analytics Life Cycle
41 pages
IDS Unit 1 Notes
No ratings yet
IDS Unit 1 Notes
24 pages
Model Building Through
No ratings yet
Model Building Through
21 pages
Advanced Machine Learning: Module-1
No ratings yet
Advanced Machine Learning: Module-1
164 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Instructor Support
100% (1)
Instructor Support
150 pages
DSMP 1.0 CampusX Data Science Mentorship Program
No ratings yet
DSMP 1.0 CampusX Data Science Mentorship Program
14 pages
Modelling in R
No ratings yet
Modelling in R
47 pages
Bahria University, Islamabad Campus: Department of Computer Science
No ratings yet
Bahria University, Islamabad Campus: Department of Computer Science
3 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
684c20656336c Campus Challenge25 Amex Offer
No ratings yet
684c20656336c Campus Challenge25 Amex Offer
25 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Accenture Exam Pattern & Syllabus
No ratings yet
Accenture Exam Pattern & Syllabus
47 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
Unit 4
No ratings yet
Unit 4
29 pages
31.5 - Python Syllabus
No ratings yet
31.5 - Python Syllabus
2 pages
Cse-IV-unix and Shell Programming (10cs44) - Notes
No ratings yet
Cse-IV-unix and Shell Programming (10cs44) - Notes
161 pages
DMW Lab Manual (1) EDIT
No ratings yet
DMW Lab Manual (1) EDIT
118 pages
What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
No ratings yet
What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
47 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
Unit 5
No ratings yet
Unit 5
104 pages
Adt301 Foundations of Data Science, November 2024
No ratings yet
Adt301 Foundations of Data Science, November 2024
2 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Cyber Security IMP Points Short Notes
No ratings yet
Cyber Security IMP Points Short Notes
20 pages
Unit 1: Daa Two Mark Question and Answer 1
No ratings yet
Unit 1: Daa Two Mark Question and Answer 1
22 pages
DB Syllabus DBATU (5) 55
100% (1)
DB Syllabus DBATU (5) 55
3 pages
Module-4 (PDFDrive)
No ratings yet
Module-4 (PDFDrive)
67 pages
Mfcs PPT (All Units)
No ratings yet
Mfcs PPT (All Units)
103 pages
Business Intelligence
No ratings yet
Business Intelligence
60 pages
Touchpad Prime Ver. 1.2 Class 6
From Everand
Touchpad Prime Ver. 1.2 Class 6
Nisha Batra
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Clustering 2
No ratings yet
Clustering 2
80 pages
AML - Mid Term - Merged
No ratings yet
AML - Mid Term - Merged
192 pages
AML AfterMid Merged
No ratings yet
AML AfterMid Merged
389 pages
Ss Zg653 Ec-3r Second Sem 2020-2021
No ratings yet
Ss Zg653 Ec-3r Second Sem 2020-2021
4 pages
Se Zg651 Course Handout
No ratings yet
Se Zg651 Course Handout
9 pages
4.1. Uncertainty
100% (2)
4.1. Uncertainty
18 pages
Learning Curve
No ratings yet
Learning Curve
4 pages
Experiment-2 Name of Student: Waghule Shubham Kalyan Batch: B3 Branch: CS-D Roll No.: 94 Problem Statement
100% (2)
Experiment-2 Name of Student: Waghule Shubham Kalyan Batch: B3 Branch: CS-D Roll No.: 94 Problem Statement
3 pages
Physics 2020 QP Set 1 English
No ratings yet
Physics 2020 QP Set 1 English
10 pages
Greenhouse Monitoring and Control System Based On Wireless Sensor Network
No ratings yet
Greenhouse Monitoring and Control System Based On Wireless Sensor Network
4 pages
Vihtavuori - 308 Win & 300WM Table de Chargement
No ratings yet
Vihtavuori - 308 Win & 300WM Table de Chargement
1 page
International Standards in Nanotechnologies: A B C C D
No ratings yet
International Standards in Nanotechnologies: A B C C D
15 pages
Introduction To Electricity Magnetism and Circuits 1536849524
100% (1)
Introduction To Electricity Magnetism and Circuits 1536849524
995 pages
ETAP 21.0.1 - Unbalanced Load Flow Analysis
No ratings yet
ETAP 21.0.1 - Unbalanced Load Flow Analysis
80 pages
14 Loci and Transformations
No ratings yet
14 Loci and Transformations
83 pages
Biostatistics Classes PDF
No ratings yet
Biostatistics Classes PDF
156 pages
Ls Maths8 2ed TR Diagnostic Check Answers
100% (1)
Ls Maths8 2ed TR Diagnostic Check Answers
4 pages
BAEMIN Group Report - SMK Group 10
No ratings yet
BAEMIN Group Report - SMK Group 10
26 pages
Bar 2
No ratings yet
Bar 2
3 pages
02 Chem30 Exemplars 2009 10
No ratings yet
02 Chem30 Exemplars 2009 10
94 pages
School of Mechanical Engineering MEE437 Operations Research - FS 2016-17 - PBL Faculty: Siva Prasad Darla Project Based Learning Course
No ratings yet
School of Mechanical Engineering MEE437 Operations Research - FS 2016-17 - PBL Faculty: Siva Prasad Darla Project Based Learning Course
5 pages
Long-Term Exposure To Ambient Benzene and Brain Disorders Among Urban Adults
No ratings yet
Long-Term Exposure To Ambient Benzene and Brain Disorders Among Urban Adults
16 pages
7 - Perfect Square and Square Root
No ratings yet
7 - Perfect Square and Square Root
26 pages
M00000XXX Honing Cylinder Liners
No ratings yet
M00000XXX Honing Cylinder Liners
11 pages
Distortions: M. Vedani Failure and Control of Metals - AY 2020/21
No ratings yet
Distortions: M. Vedani Failure and Control of Metals - AY 2020/21
7 pages
2022 Lutomirski - Strength Reduction Factors
No ratings yet
2022 Lutomirski - Strength Reduction Factors
9 pages
J.E. Maintenance Manual 2011 07
No ratings yet
J.E. Maintenance Manual 2011 07
8 pages
QED User Manual
No ratings yet
QED User Manual
57 pages
Geotechnical Earthquake Engineering: Prof. Deepankar Choudhury
No ratings yet
Geotechnical Earthquake Engineering: Prof. Deepankar Choudhury
38 pages
1995 Gom Amberjack Field Case History
No ratings yet
1995 Gom Amberjack Field Case History
10 pages
Verilog Paractice Assignments
No ratings yet
Verilog Paractice Assignments
3 pages
ML Module 1
No ratings yet
ML Module 1
52 pages
Matlab Programmming Previous Papers
No ratings yet
Matlab Programmming Previous Papers
4 pages

CS 2 3 4 Aml

Uploaded by

CS 2 3 4 Aml

Uploaded by

Applied Machine Learning

M2-M3 End-to-end Machine Learning Pipeline

M4 Linear Prediction Models

M10 FAccT Machine Learning

BITS Pilani, Pilani Campus

• Identify & implement the use case specific preprocessing

• Understand the application of various visualization techniques

BITS Pilani, Pilani Campus

• Framing a machine learning problem

Source credit: https://docs.microsoft.com/

BITS Pilani, Pilani Campus

h is the model, X is the training dataset, m is number of instances, x(i) is i-th

Mean Absolute Error (MAE)

• Framing a machine learning problem

# Read csv file into a pandas dataframe

To read rows for one column

To read certain columns

To read certain rows and certain columns

To change the datatype

Relational/Object Spatial Data

Multimedia Mobile Data

Time Series/ Data Web & Social

BITS Pilani, Pilani Campus

• Web & Social Network

BITS Pilani, Pilani Campus

• Web & Social Network Data

BITS Pilani, Pilani Campus

• Web & Social Network Data

• Spatial Data Categorical Nominal

BITS Pilani, Pilani Campus

• Web & Social Network Data

BITS Pilani, Pilani Campus

• Web & Social Network Data

BITS Pilani, Pilani Campus

• Framing a machine learning problem

• Replace the missing values

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

# axis=1, the action is performed on columns that satisfy the condition

1) Missing values in Profession column

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Equal-depth (frequency) partitioning

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Use Boxplot(numeric features) to detect noise/outliers

• Bar Chart (Categorical)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

import seaborn as sns

count() Total number of items

first(), last() First and last item

mean(), median() Mean and median

min(), max() Minimum and maximum

Standard deviation and

mad() Mean absolute deviation

prod() Product of all items

sum() Sum of all items

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Ex. Let μ = 54,000, σ = 16,000. Then

• More efficiently than the original attributes

BITS Pilani, Pilani Campus

You might also like