DWM Exp 8

The document outlines a data cleaning technique focused on data transformation, detailing methods such as rescaling, binarizing, and standardizing data using Python and scikit-learn. It provides code examples for each method, demonstrating how to preprocess raw data into a clean dataset suitable for analysis. The conclusion emphasizes the importance of these techniques in data mining applications.

Uploaded by

giteanuja09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views4 pages

DWM Exp 8

Uploaded by

giteanuja09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Experiment 8

Title: Implement data cleaning technique: Data transformation

CO: Use Data Mining tools for various applications.

Theory:
Data Preprocessing is a technique that is used to convert the raw data into a clean
data set. In other words, whenever the data is gathered from different sources it is
collected in raw format which is not feasible for the analysis.

1. Rescale Data
• When our data is comprised of attributes with varying scales, many machine
learning algorithms can benefit from rescaling the attributes to all have the same
scale.
• This is useful for optimization algorithms in used in the core of machine learning
algorithms like gradient descent.
• It is also useful for algorithms that weight inputs like regression and neural
networks and algorithms that use distance measures like K-Nearest Neighbors.
• We can rescale your data using scikit-learn using the MinMaxScaler class.
# Python code to Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-
diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values

# separate array into input and output components

X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

Output
[[ 0.353 0.744 0.59 0.354 0.0 0.501 0.234 0.483]
[ 0.059 0.427 0.541 0.293 0.0 0.396 0.117 0.167]
[ 0.471 0.92 0.525 0. 0.0 0.347 0.254 0.183]
[ 0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.0 ]
[ 0.0 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]

2. Binarize Data (Make Binary)

• We can transform our data using a binary threshold. All values above the threshold
are marked 1 and all equal to or below are marked as 0.
• This is called binarizing your data or threshold your data. It can be useful when you
have probabilities that you want to make crisp values. It is also useful when feature
engineering and you want to add new features that indicate something meaningful.
• We can create new binary attributes in Python using scikit-learn with
the Binarizer class.
# Python code for binarization
from sklearn.preprocessing import Binarizer
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-
diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values

# separate array into input and output components

X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(binaryX[0:5,:])

Output
[[ 1. 1. 1. 1. 0. 1. 1. 1.]
[ 1. 1. 1. 1. 0. 1. 1. 1.]
[ 1. 1. 1. 0. 0. 1. 1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1. 1. 1. 1.]]

3. Standardize Data
• Standardization is a useful technique to transform attributes with a Gaussian
distribution and differing means and standard deviations to a standard Gaussian
distribution with a mean of 0 and a standard deviation of 1.
• We can standardize data using scikit-learn with the StandardScaler class.
# Python code to Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-
diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values

# separate array into input and output components

X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
Output
[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426]
[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]
[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]
[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]

Conclusion: We implement data cleaning technique: Data transformation.

Assessment Scheme:

Process Related Product Related Total

Sign of Teacher
(15 Marks) (10 Marks) (25 Marks)

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Library Management - Principles and Practice
60% (5)
Library Management - Principles and Practice
83 pages
Pattern Recognition Lab
No ratings yet
Pattern Recognition Lab
24 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Applied Physics: Textbook of
No ratings yet
Applied Physics: Textbook of
2 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Pair of Words Questions For SBI Clerk Prelims 2020-21
No ratings yet
Pair of Words Questions For SBI Clerk Prelims 2020-21
12 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
EOBI FS Operational Manual For Employers New
No ratings yet
EOBI FS Operational Manual For Employers New
17 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
PHD Thesis Topics in Data Mining
100% (2)
PHD Thesis Topics in Data Mining
5 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
SCADA, DCS, PLC, RTU & Smart Instrumentation Terminology Defined
No ratings yet
SCADA, DCS, PLC, RTU & Smart Instrumentation Terminology Defined
14 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Examples Of: Pages Awesome "About Us"
No ratings yet
Examples Of: Pages Awesome "About Us"
12 pages
Autodesk 2014 Product Keys
100% (1)
Autodesk 2014 Product Keys
3 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
ML - Lab Manual
No ratings yet
ML - Lab Manual
54 pages
MLP Week 2 Slides
No ratings yet
MLP Week 2 Slides
82 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Gadla Bestawros Text and Translation PDF
No ratings yet
Gadla Bestawros Text and Translation PDF
40 pages
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
No ratings yet
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
71 pages
ML Notes
No ratings yet
ML Notes
44 pages
Machine Learning Lab File
No ratings yet
Machine Learning Lab File
45 pages
Business Value Of: Ci, CD, & Devops
No ratings yet
Business Value Of: Ci, CD, & Devops
67 pages
MELSEC iQ-R Simple Motion Module Function Block Reference PDF
No ratings yet
MELSEC iQ-R Simple Motion Module Function Block Reference PDF
98 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Tanu Raman ML Lab File
No ratings yet
Tanu Raman ML Lab File
21 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
Unit I Introduction and Syntax of Python Program
No ratings yet
Unit I Introduction and Syntax of Python Program
21 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
PMA Unit-2 PDF
No ratings yet
PMA Unit-2 PDF
19 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Lecture11 Comparison of State Machine-1
No ratings yet
Lecture11 Comparison of State Machine-1
33 pages
Integrate Cisco Intersight Managed Cisco UCS X - Series
No ratings yet
Integrate Cisco Intersight Managed Cisco UCS X - Series
44 pages
Data Pre-Processing With Sklearn Using Standard and Minmax
No ratings yet
Data Pre-Processing With Sklearn Using Standard and Minmax
21 pages
DWDM Lab Report
No ratings yet
DWDM Lab Report
26 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Asffh
No ratings yet
Asffh
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Lumix DMC Fz100
No ratings yet
Lumix DMC Fz100
44 pages
Hgs Phase II
No ratings yet
Hgs Phase II
27 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
281710lecture Notes 3-Applications of Data Structures-1718434458689
No ratings yet
281710lecture Notes 3-Applications of Data Structures-1718434458689
8 pages
DEEPWEB, Dark Web, TOR NETWORK, Bitcoin, Encryption, Codes and Ciphers
No ratings yet
DEEPWEB, Dark Web, TOR NETWORK, Bitcoin, Encryption, Codes and Ciphers
10 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
Experiment 5
No ratings yet
Experiment 5
10 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
32 pages
MLDA1
No ratings yet
MLDA1
8 pages
Chapter 4 - 44785485 - 2024 - 10 - 24 - 08 - 50
No ratings yet
Chapter 4 - 44785485 - 2024 - 10 - 24 - 08 - 50
13 pages
AIML Report.
No ratings yet
AIML Report.
12 pages
Experiment 5
No ratings yet
Experiment 5
9 pages
23UCC554
No ratings yet
23UCC554
9 pages
ML Lab Codes
No ratings yet
ML Lab Codes
14 pages
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
34 pages
TB3209 Getting Started With ADC 90003209A
No ratings yet
TB3209 Getting Started With ADC 90003209A
27 pages
Mini 4
No ratings yet
Mini 4
9 pages
Unit 4 - 45256988 - 2024 - 10 - 24 - 08 - 54
No ratings yet
Unit 4 - 45256988 - 2024 - 10 - 24 - 08 - 54
18 pages
1 - Data Preprocessing and Cleaning - 55
No ratings yet
1 - Data Preprocessing and Cleaning - 55
8 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
KRAI Practical
No ratings yet
KRAI Practical
14 pages
A110 Rayyan Expt4dep
No ratings yet
A110 Rayyan Expt4dep
9 pages
DWM Exp 7
No ratings yet
DWM Exp 7
4 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
Advance Python
No ratings yet
Advance Python
5 pages
Copado Developer PDF
No ratings yet
Copado Developer PDF
5 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
COMP6981-DataPreproc ASoares Online
No ratings yet
COMP6981-DataPreproc ASoares Online
2 pages
Harsh Thakur
No ratings yet
Harsh Thakur
2 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
2 pages
ACTIVITY 9 Important Shortcuts
No ratings yet
ACTIVITY 9 Important Shortcuts
5 pages
Kabir Data Preprocessing Python
No ratings yet
Kabir Data Preprocessing Python
14 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Assignment in Automata Theory and Compiler Design
No ratings yet
Assignment in Automata Theory and Compiler Design
18 pages
Full-Stack Questions - Nonceblox
No ratings yet
Full-Stack Questions - Nonceblox
3 pages
CSCI213 Spring2013 Lectures Multithreading
No ratings yet
CSCI213 Spring2013 Lectures Multithreading
16 pages
Course Answer-Booklet
No ratings yet
Course Answer-Booklet
3 pages
Lab 5
No ratings yet
Lab 5
2 pages
Artificial Intelligence Based Person Identification Virtual Assistant
No ratings yet
Artificial Intelligence Based Person Identification Virtual Assistant
5 pages
A Simple Event-Based PID Controller: Årzén, Karl-Erik
No ratings yet
A Simple Event-Based PID Controller: Årzén, Karl-Erik
7 pages
Ffmpeg Watch-Folder PDF
No ratings yet
Ffmpeg Watch-Folder PDF
2 pages
Preconditions: C++ & Fortran Development in Windows Using The Mingw-W64 GCC and Netbeans
No ratings yet
Preconditions: C++ & Fortran Development in Windows Using The Mingw-W64 GCC and Netbeans
1 page
GV500 Quick Start V100.160124534
No ratings yet
GV500 Quick Start V100.160124534
2 pages
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet

DWM Exp 8

Uploaded by

DWM Exp 8

Uploaded by

Experiment 8

Title: Implement data cleaning technique: Data transformation

# separate array into input and output components

2. Binarize Data (Make Binary)

# separate array into input and output components

# separate array into input and output components

Conclusion: We implement data cleaning technique: Data transformation.

Process Related Product Related Total

You might also like