0% found this document useful (0 votes)

17 views24 pages

Preprocessing ch.1

The document provides an introduction to preprocessing data for machine learning in Python. It discusses why preprocessing is important to transform data into a suitable format for modeling, which can improve model performance and generate more reliable results. It then demonstrates common preprocessing techniques like exploring data with pandas, removing missing data through various methods like dropna(), converting column data types with astype(), and splitting data into training and test sets using train_test_split() with options like stratification to reduce overfitting and evaluate performance on holdout data.

Uploaded by

oml78531

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views24 pages

Preprocessing ch.1

Uploaded by

oml78531

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Introduction to

preprocessing
PREPROCESSING FOR MACHINE LEARNING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
What is data preprocessing?
After exploratory data analysis and data cleaning
Preparing data for modeling

Example: transforming categorical features into numerical features (dummy variables)

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Why preprocess?

Transform dataset so it's suitable for

modeling

Improve model performance

Generate more reliable results

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Recap: exploring data with pandas
import pandas as pd
hiking = pd.read_json("hiking.json")
print(hiking.head())

Prop_ID Name ... lat lon

0 B057 Salt Marsh Nature Trail ... NaN NaN
1 B073 Lullwater ... NaN NaN
2 B073 Midwood ... NaN NaN
3 B073 Peninsula ... NaN NaN
4 B073 Waterfall ... NaN NaN

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Recap: exploring data with pandas
print(hiking.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 11 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 Prop_ID 33 non-null object
1 Name 33 non-null object
2 Location 33 non-null object
3 Park_Name 33 non-null object
4 Length 29 non-null object
5 Difficulty 27 non-null object
6 Other_Details 31 non-null object
7 Accessible 33 non-null object
8 Limited_Access 33 non-null object
9 lat 0 non-null float64
10 lon 0 non-null float64
dtypes: float64(2), object(9)
memory usage: 3.0+ KB

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Recap: exploring data with pandas
print(wine.describe())

Type Alcohol ... Alcalinity of ash

count 178.000000 178.000000 ... 178.000000
mean 1.938202 13.000618 ... 19.494944
std 0.775035 0.811827 ... 3.339564
min 1.000000 11.030000 ... 10.600000
25% 1.000000 12.362500 ... 17.200000
50% 2.000000 13.050000 ... 19.500000
75% 3.000000 13.677500 ... 21.500000
max 3.000000 14.830000 ... 30.000000

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Removing missing data
print(df) print(df.dropna())

A B C A B C
0 1.0 NaN 2.0 1 4.0 7.0 3.0
1 4.0 7.0 3.0 4 5.0 9.0 7.0
2 7.0 NaN NaN
3 NaN 7.0 NaN
4 5.0 9.0 7.0

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Removing missing data
print(df) print(df.drop([1, 2, 3]))

A B C A B C
0 1.0 NaN 2.0 0 1.0 NaN 2.0
1 4.0 7.0 3.0 4 5.0 9.0 7.0
2 7.0 NaN NaN
3 NaN 7.0 NaN
4 5.0 9.0 7.0

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Removing missing data
print(df) print(df.drop("A", axis=1))

A B C B C
0 1.0 NaN 2.0 0 NaN 2.0
1 4.0 7.0 3.0 1 7.0 3.0
2 7.0 NaN NaN 2 NaN NaN
3 NaN 7.0 NaN 3 7.0 NaN
4 5.0 9.0 7.0 4 9.0 7.0

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Removing missing data
print(df) print(df.isna().sum())

A B C A 1
0 1.0 NaN 2.0 B 2
1 4.0 7.0 3.0 C 2
2 7.0 NaN NaN dtype: int64
3 NaN 7.0 NaN
4 5.0 9.0 7.0 print(df.dropna(subset=["B"]))

A B C
1 4.0 7.0 3.0
3 NaN 7.0 NaN
4 5.0 9.0 7.0

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Removing missing data
print(df) print(df.dropna(thresh=2))

A B C A B C
0 1.0 NaN 2.0 0 1.0 NaN 2.0
1 4.0 7.0 3.0 1 4.0 7.0 3.0
2 7.0 NaN NaN 4 5.0 9.0 7.0
3 NaN 7.0 NaN
4 5.0 9.0 7.0

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
PREPROCESSING FOR MACHINE LEARNING IN PYTHON
Working With Data
Types
PREPROCESSING FOR MACHINE LEARNING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Why are types important?
print(volunteer.info()) object : string/mixed types

int64 : integer
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664 float64 : float
Data columns (total 35 columns):
# Column Non-Null Count Dtype datetime64 : dates and times
-- ------ -------------- -----
0 opportunity_id 665 non-null int64
1 content_id 665 non-null int64
2 vol_requests 665 non-null int64
3 event_time 665 non-null int64
4 title 665 non-null object
.. ... ... ...
34 NTA 0 non-null float64
dtypes: float64(13), int64(8), object(14)
memory usage: 182.0+ KB

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Converting column types
print(df) print(df.info())

A B C <class 'pandas.core.frame.DataFrame'>
0 1 string 1.0 RangeIndex: 3 entries, 0 to 2
1 2 string2 2.0 Data columns (total 3 columns):
2 3 string3 3.0 # Column Non-Null Count Dtype
-- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null object
2 C 3 non-null object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Converting column types
print(df) df["C"] = df["C"].astype("float")
print(df.dtypes)

A B C
0 1 string 1.0 A int64
1 2 string2 2.0 B object
2 3 string3 3.0 C float64
dtype: object

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
PREPROCESSING FOR MACHINE LEARNING IN PYTHON
Training and test
sets
PREPROCESSING FOR MACHINE LEARNING IN PYTHON

James Chapman
Curriculum Manager, DataCamp
Why split?

1. Reduces overfitting

2. Evaluate performance on a holdout set

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Splitting up your dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train y_train
0 1.0 n
1 4.0 n
...
5 5.0 n
6 6.0 n

X_test y_test
0 9.0 y
1 1.0 n
2 4.0 n

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Stratified sampling

Dataset of 100 samples: 80 class 1 and 20 class 2

Training set of 75 samples: 60 class 1 and 15 class 2

Test set of 25 samples: 20 class 1 and 5 class 2

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Stratified sampling
X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y, random_state=42)

y["labels"].value_counts()

class1 80
class2 20
Name: labels, dtype: int64

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Stratified sampling
y_train["labels"].value_counts() y_test["labels"].value_counts()

class1 60 class1 20
class2 15 class2 5
Name: labels, dtype: int64 Name: labels, dtype: int64

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Let's practice!
PREPROCESSING FOR MACHINE LEARNING IN PYTHON

The St. Martin's Guide To Writing, Ninth Edition
86% (28)
The St. Martin's Guide To Writing, Ninth Edition
1,093 pages
Learning Pandas PDF
No ratings yet
Learning Pandas PDF
171 pages
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
No ratings yet
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
6 pages
TM Forum Applications Framework 3-2
No ratings yet
TM Forum Applications Framework 3-2
226 pages
Preprocessing Data For Machine Learning: Sarah Guido
No ratings yet
Preprocessing Data For Machine Learning: Sarah Guido
21 pages
MOD-3 Dap
No ratings yet
MOD-3 Dap
41 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
FDS Notes Unit-4
No ratings yet
FDS Notes Unit-4
30 pages
Datascience
No ratings yet
Datascience
26 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Report File
No ratings yet
Report File
40 pages
ML Aml Cse It Lab Manual Final
No ratings yet
ML Aml Cse It Lab Manual Final
22 pages
CH 3 2
No ratings yet
CH 3 2
17 pages
Pandas PDF
No ratings yet
Pandas PDF
171 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Data Preprocessing and Data Analysis Using Python
No ratings yet
Data Preprocessing and Data Analysis Using Python
32 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas Library Documentation
No ratings yet
Pandas Library Documentation
16 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
1 page
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
No ratings yet
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
35 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Pandas Notes
No ratings yet
Pandas Notes
10 pages
Data Science Unit 2 Second Half Notes
No ratings yet
Data Science Unit 2 Second Half Notes
18 pages
Unit6 - Working With Data
No ratings yet
Unit6 - Working With Data
29 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
Data Science - Unit II
100% (2)
Data Science - Unit II
173 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Ilovepdf Merged (2) Merged
No ratings yet
Ilovepdf Merged (2) Merged
65 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
E-Commerce Product Delivery Prediction
No ratings yet
E-Commerce Product Delivery Prediction
13 pages
Part A Assignment - No - 1
No ratings yet
Part A Assignment - No - 1
7 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Jashan ML
No ratings yet
Jashan ML
20 pages
ML Lab Manual 1-10
No ratings yet
ML Lab Manual 1-10
58 pages
Lab 2 DWM
No ratings yet
Lab 2 DWM
13 pages
Python Intro Tut 16 Jun
No ratings yet
Python Intro Tut 16 Jun
4 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Pandas Cheatsheet DF
No ratings yet
Pandas Cheatsheet DF
1 page
DS Final
No ratings yet
DS Final
46 pages
Student - Linear Regression Example - Colaboratory
No ratings yet
Student - Linear Regression Example - Colaboratory
6 pages
Data Science & Machine Learning Using Python - CDR
No ratings yet
Data Science & Machine Learning Using Python - CDR
8 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Pandas DataFrames
No ratings yet
Pandas DataFrames
1 page
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
19 pages
MLC Practical
No ratings yet
MLC Practical
51 pages
Python Indepth Live Session
No ratings yet
Python Indepth Live Session
8 pages
Python - Scientific Functions
No ratings yet
Python - Scientific Functions
24 pages
Fds Mannual
No ratings yet
Fds Mannual
39 pages
Business Analytics
No ratings yet
Business Analytics
33 pages
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
From Everand
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
Manish Soni
No ratings yet
Data Science Using Python and R
From Everand
Data Science Using Python and R
Chantal D. Larose
No ratings yet
Management Information Systems (891) : Assignment No. 1
No ratings yet
Management Information Systems (891) : Assignment No. 1
4 pages
Credits 2jbbd PDF
No ratings yet
Credits 2jbbd PDF
8 pages
Dnsperf-1 0 1 0-Info-20071228
No ratings yet
Dnsperf-1 0 1 0-Info-20071228
10 pages
Aws Cli
No ratings yet
Aws Cli
92 pages
C++ Exercises II
50% (2)
C++ Exercises II
4 pages
Creating Textures For Games Part-1
No ratings yet
Creating Textures For Games Part-1
11 pages
Problem Assignment 1
100% (1)
Problem Assignment 1
2 pages
12CS em 2024
No ratings yet
12CS em 2024
152 pages
Limiting Download File Extensions On Mikrotik
No ratings yet
Limiting Download File Extensions On Mikrotik
10 pages
Table of Specifications (Tos) Epp 6 - Ict and Entrepreneurship - Quarter 1
100% (1)
Table of Specifications (Tos) Epp 6 - Ict and Entrepreneurship - Quarter 1
1 page
Snap Manual
No ratings yet
Snap Manual
77 pages
K109 Change Orders v1.0
No ratings yet
K109 Change Orders v1.0
38 pages
2D and 3D Truss Elements: MCEN 4173/5173
No ratings yet
2D and 3D Truss Elements: MCEN 4173/5173
19 pages
Textooo
No ratings yet
Textooo
2 pages
0418 w04 Ms 2
No ratings yet
0418 w04 Ms 2
8 pages
1909 EKC and EKE Portfolio - Shared Version PDF
No ratings yet
1909 EKC and EKE Portfolio - Shared Version PDF
61 pages
Day 6 Notes MESH ANALYSIS
No ratings yet
Day 6 Notes MESH ANALYSIS
11 pages
Xps Spi
No ratings yet
Xps Spi
38 pages
Dbms Unit 3 Notes.
100% (1)
Dbms Unit 3 Notes.
24 pages
Bahasa Inggris
No ratings yet
Bahasa Inggris
11 pages
Advantages and Disadvantages of Information Gathering Techniques
100% (3)
Advantages and Disadvantages of Information Gathering Techniques
4 pages
General Instructions For Mind Map Case Work
No ratings yet
General Instructions For Mind Map Case Work
4 pages
William D. Patterson - Curriculum Vitae 01224316569/07892893687
No ratings yet
William D. Patterson - Curriculum Vitae 01224316569/07892893687
7 pages
Formation of Bus Admittance Matrix Using
100% (1)
Formation of Bus Admittance Matrix Using
24 pages
Bill Gates: Paul Allen
No ratings yet
Bill Gates: Paul Allen
4 pages
Dca01 Block02 Computer Fundamental
No ratings yet
Dca01 Block02 Computer Fundamental
22 pages
CSR Project Part I
No ratings yet
CSR Project Part I
6 pages
The Complete Guide To Simple OEE
100% (3)
The Complete Guide To Simple OEE
26 pages

Preprocessing ch.1

Uploaded by

Preprocessing ch.1

Uploaded by

Introduction to

Example: transforming categorical features into numerical features (dummy variables)

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Transform dataset so it's suitable for

Improve model performance

Generate more reliable results

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Prop_ID Name ... lat lon

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Type Alcohol ... Alcalinity of ash

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

2. Evaluate performance on a holdout set

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

Dataset of 100 samples: 80 class 1 and 20 class 2

Training set of 75 samples: 60 class 1 and 15 class 2

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

PREPROCESSING FOR MACHINE LEARNING IN PYTHON

You might also like