0% found this document useful (0 votes)

24 views55 pages

Module 2

This module discusses preparing data for analysis. It covers locating and downloading datasets, data preprocessing including dealing with missing values and normalizing data, and data reduction techniques. The document provides an overview of these topics and describes approaches for handling different data preprocessing tasks.

Uploaded by

muddasar.buggcy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views55 pages

Module 2

Uploaded by

muddasar.buggcy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

Module 2

Preparing Data for Analysis

Isabelle Bichindaritz, SUNY Oswego 1

A piece of information is some knowledge,

some data processed so that it modifies the
state of « uncertainty » about a system

Data Processing
Processing Information

Isabelle Bichindaritz, SUNY Oswego

3
Datasets and Files
 client.csv file example (with delimiter “, “)

clientNo, fName, lName, address, telNo

1, Lisa, Smith, mountain, 1439
2, John, Mack, city, 5634
3, Mary, Lewis, river, 9045
4, Mark, Trump, plain, 2710
5, Leslie, Clinton, village, 3592

Isabelle Bichindaritz, SUNY Oswego 4

 Open data sources (public).

 List in resources
 The Cancer Genome Atlas (TCGA)
 Alzheimer’s Disease Neuroimaging Initiative (ADNI)
 Health and Retirement Study (HRS)
 UK Biobank
 Millennium Cohort Study
 CALIBER (EHR and admin data)
 UCI Machine Learning Repository
 …

Isabelle Bichindaritz, SUNY Oswego 6

Data Sources
 Heterogeneous types of Big Data to analyze separately or
together:
 Numeric, nominal.
 Text.
 Image.
 Video.
 Sound.
 Social media.
 Web.
 Time series.
 Signal.
 …

Isabelle Bichindaritz, SUNY Oswego 7

 Data as found in public datasets and other types of

datasets is imperfect, ‘dirty’.

 Data quality is essential to get good analytics results -

Garbage In, Garbage Out (GIGO).

 The format of data often requires to make changes – for

example the analytical method used may required
nominal data, while your data is numeric.

Isabelle Bichindaritz, SUNY Oswego 9

Importance of Data Preprocessing
 Types of data characteristics to fix:

 Missing values

 Noisy data

 Incorrect data type

 Incomplete data

Isabelle Bichindaritz, SUNY Oswego 10

Importance of Data Preprocessing
 The goal is to improve the quality of data to ensure that
the measurements provided are as

 Accurate
 Precise
 Complete
 Interpretable
 Correct

as possible.

Isabelle Bichindaritz, SUNY Oswego 11

Importance of Data Preprocessing
 The goal is to improve the quality of data to ensure that
the measurements provided are as

 Accurate
 Precise
 Complete
 Interpretable
 Correct

as possible.

Isabelle Bichindaritz, SUNY Oswego 12

 Data cleaning
 Dealing with missing values
 Dealing with erroneous data and outliers

 Data transformation
 Changing data types (discretization)
 Changing range of data values (normalization)
 Adding variables

 Data reduction
 Feature selection
 Sampling

Isabelle Bichindaritz, SUNY Oswego 14

Data Preprocessing Tasks

from Han and Kamber

(2014)

Isabelle Bichindaritz, SUNY Oswego 15

 A blank
 A ‘.’
 A ‘n/a’
 A ‘?’

 There are several strategies to deal with them, for

example applying some kind of filter.

Isabelle Bichindaritz, SUNY Oswego 17

 Delete the entire row (depends on how many rows you have)
 Replace by a fixed value (‘unknown’)
 Replace values by a statistic associated with a particular
column or a particular group – mean, median, mode
 Replace values based on nearest neighbors
 Replace values based on likelihood.

Isabelle Bichindaritz, SUNY Oswego 19

Replacing Missing Values
Delete entire
row
Data deletion
Delete entire
column

Impute a
Handling constant value
missing values
Impute with
mean …

Data Impute from

imputation neighbors

Impute based
on a model

Impute
randomly
Isabelle Bichindaritz, SUNY Oswego 20
Outline
 Introduction to module
 Locating and downloading datasets
 Datasets and files
 Data sources
 Data preprocessing
 Importance of data preprocessing
 Data preprocessing tasks
 Missing values
 Replacing missing values
 Normalizing and discretizing data
 Data normalization
 Discretization
 Data reduction
 Feature selection
 Data sampling
 Introduction to R language
 Principles of R
 Working with R
 Data preprocessing with R Bichindaritz, SUNY Oswego
Isabelle 21
Normalizing and Discretizing Data
 In order to handle noise in the data, they can be transformed
globally to

 Reduce the grain in the data (discretize) from fine-rain to higher-

grain, for example from numeric to nominal.

 Change the scale or range of the data (normalize).

 It might also be necessary to discretize to apply different data

analytics methods because some prediction methods require
a nominal target attribute.

Isabelle Bichindaritz, SUNY Oswego 22

 When having data of mixed scale, some data analytics

methods do not behave well (Ex: age and income have
widely different ranges).

 For example, it is frequent to scale all data between the

range [-1, 1] or [0, 1].

Isabelle Bichindaritz, SUNY Oswego 24

Data Normalization
 Generally, data are scaled into a smaller range.

 Methods include:
 Min-max normalization

 Z-score normalization

 Decimal scaling

Isabelle Bichindaritz, SUNY Oswego 25

Data Normalization
 Min-max normalization transforms data from range [m,
M] into range [m’, M’] using the formula

val’ = (val – m) / (M – m) * (M’ – m’) + m’

 Example: normalizing into [0, 1] age values between [0,

150]
age 50  0.33 (intuitively)
check val’ = (50 – 0) / (150 – 0) * (1 – 0) + 0
= 50 / 150 = 1/3 = 0.33

Isabelle Bichindaritz, SUNY Oswego 26

Data Normalization
 Z-score normalization

val’ = val – mean / std

 Ex: normalizing age values between [0, 150]

where mean age in the population is 36.8 and standard
deviation is 12
age = 50  val’ = 50 – 36.8 / 12 = 1.1

Isabelle Bichindaritz, SUNY Oswego 27

Data Normalization
 The normal (distribution) curve
 From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it

Isabelle Bichindaritz, SUNY Oswego 28

Data Normalization
 Decimal scaling

val’ = val / 10n

where n is determined such as the largest val’ would be less than 1

this formula transforms the values into interval [-1, 1] is there are negative values, and
into [0, 1] otherwise.

 Ex: normalizing age values between [0, 150]

we want the highest age to be less than 1, therefore divide by 1,000 = 103

age = 50  val’ = 50 / 103 = 0.05

Isabelle Bichindaritz, SUNY Oswego 29

Data Normalization
 Comparison between the methods

 The method that preserves the original data distribution is decimal scaling,
therefore it preserves more than the others the shape of the data repartition. It
acts similarly to image resizing in photo editing software (shrink / magnify).

 Z-score normalization is the most used because the resulting distribution is going
to be normal, which is advantageous with certain statistical methods. However it
distorts the natural shape of the data distribution.

 Min-max normalization can accommodate any new range we want, not only [0, 1]
and [-1, 1] like the other ones.

Isabelle Bichindaritz, SUNY Oswego 30

 Discretization transforms data from numeric into nominal

data type.

 Effects of discretization:
 Smooths data.
 Reduces noise.
 Reduces data size.
 Enables specific methods using nominal data.

Isabelle Bichindaritz, SUNY Oswego 32

Discretization
 Discretization methods

 Manual methods:
 Distribution analysis.

 Automatic methods:

 Binning.
 Equal-width binning
 Equal-depth binning
 Regression analysis.
 Cluster analysis.
 Natural partitioning.

Isabelle Bichindaritz, SUNY Oswego 33

Discretization
 Equal-width binning
 Given a range of values [min, max], we divide in intervals of
approximately same width; either we set the width arbitrarily to
w, or we set the desired number of bins to n, in which case w is
calculated as:

w = max – min / n

 Ex: if the range is [0, 100] and we want 4 bins, each bin will have
a width of
100 – 0 / 4 = 25
the bins will be: [0, 24], [25, 49], [50, 74], [75, 100].

Isabelle Bichindaritz, SUNY Oswego 34

Discretization
 Equal-depth binning
 Given a range of values [min, max], we place approximately the same
number of instances in each bin by dividing the total number of
samples nb by the desired number of samples in each bin (depth) d,
in which case the number of bins n is calculated as:

n = nb / d

 Ex: if the range is [0, 100] for 100 samples of different values (for
example 99 is missing), we want 20 samples in each bin, the number
of bins will be:
100 / 20 = 5
the bins will be: [0, 19], [20, 39], [40, 59], [60, 79], [80, 100].

Isabelle Bichindaritz, SUNY Oswego 35

Discretization
 Advantage of each method:

 Equal-width binning is more simple however very sensitive to

outliers in the data.

 Equal-depth binning scales well by keeping the distribution of the

data however the bin values may be more difficult to interpret.

 Smoothing of data can be accomplished by replacing the

values in a bin by statistic such as average (numeric data),
median (numeric data), or mode (categorical data).

Isabelle Bichindaritz, SUNY Oswego 36

 Feature selection.

 Sampling.

 Data compression.

 Data aggregation.

 etc.

Isabelle Bichindaritz, SUNY Oswego 38

 Feature selection is also called dimensionality reduction.

 A feature is also called a variable (or a column).

 It is very important in biomedical data due to an often large

number of features available – the curse of dimensionality (Ex:
number of gene expressions).

 It will be studied in a future module.

Isabelle Bichindaritz, SUNY Oswego 40

 The sample needs to be representative.

 Main methods:
 Simple random sampling with replacement.
 Simple random sampling without replacement.
 Stratified sampling.

Isabelle Bichindaritz, SUNY Oswego 42

Data Sampling

W O R
SRS le random
i m p ho ut
(s l e wit from Han and Kamber
sa m p m e nt ) (2014)
e p l a ce
r

SRSW
R

Raw Data
Isabelle Bichindaritz, SUNY Oswego
43
Outline
 Introduction to module
 Locating and downloading datasets
 Datasets and files
 Data sources
 Data preprocessing
 Importance of data preprocessing
 Data preprocessing tasks
 Missing values
 Replacing missing values
 Normalizing and discretizing data
 Data normalization
 Discretization
 Data reduction
 Feature selection
 Data sampling
 Introduction to R language
 Principles of R
 Working with R
 Data preprocessing with R Bichindaritz, SUNY Oswego
Isabelle 44
Introduction to R Language
 R is a computation, graphic, and open source programming environment for statistical
analysis and data science applications.

 R comprises a set of functions for statistical analysis and graphics, a programming

language, a run-time interpreter, a debugger,
numerous add-on packages, and script files.

 Packages provide added functionality and allow for extensibility

of the language functionality since any researcher can contribute
a package to R.

 In terms of programming language, R’ syntax is close to that of Scheme.

 Developed originally by Ross Ihaka and Robert Gentleman at the University of Auckland
in New Zealand, it is now maintained by the “R core group” (http://www.R-project.org).

Isabelle Bichindaritz, SUNY Oswego 45

Outline
 Introduction to module
 Locating and downloading datasets
 Datasets and files
 Data sources
 Data preprocessing
 Importance of data preprocessing
 Data preprocessing tasks
 Missing values
 Replacing missing values
 Normalizing and discretizing data
 Data normalization
 Discretization
 Data reduction
 Feature selection
 Data sampling
 Introduction to R language
 Principles of R
 Working with R
 Data preprocessing with R Bichindaritz, SUNY Oswego
Isabelle 46
Principles of R
 Installation. Installers can be downloaded from a mirror listed on
https://cran.r-project.org/mirrors.html. Installation is automatic by
double-clicking on the installer. Installation requirements are
about 50Mb of disk space.
 Configuration. User should select a working directory and memory
size, which will contain all files input or output with R. This can be
done from the desktop shortcut to R, through the properties:
 Change the working directory under Start-in (Windows). This is the
directory from which R will read, or into which R will write, by
default.
 Change the memory size by adding at the end of Target the
number of Gb wanted: --max-mem-size=3G. The memory limit
varies depending on the available memory and the operating
system.
 Some useful tools: Rgui, Rstudio,
Isabelle notepad++.
Bichindaritz, SUNY Oswego 47
Principles of R
 Running. Double-clicking on the desktop shortcut or selecting from the start
menu R will open R window.
 Documentation. Important documents for getting started with R are the
following:
 An FAQ for R for Windows is available from
http://cran.r-project.org/bin/windows/base/rw-FAQ.html.
 An FAQ for R is available from http://cran.r-project.org/doc/FAQ/R-FAQ.html.
 The user guide of R is entitled “Using R for Data Analysis and Graphics” and is
available from: http://cran.r-project.org/doc/contrib/usingR.pdf.
 Other documents are available from the documentation section of
http://cran.fhcrc.org/.
 Online documentation is available from R itself through help(name) or ?name.

Isabelle Bichindaritz, SUNY Oswego 48

Principles of R
 Important commands. Some important commands include:

ls() to list the content of the memory.

rm() to empty the memory.
rm(object) to remove an object from memory.
q() to quit.
summary(object) to display summary characteristics of an object.
class(object) to display the class (type) of an object.

Isabelle Bichindaritz, SUNY Oswego 49

Principles of R
 Packages are libraries of functions to use in addition to the
standard functions. They need to be loaded specifically. There are
two types of packages:
 Standard packages, which can be installed from the Package menu,
choosing Load package in the graphical user interface (GUI).
 Packages to install from a local zip file, which can be installed from
the Package menu, choosing Install package(s) from local zip
files…, which proposes to load a zipped package from the working
directory.
 Packages can also be installed with install.packages().
 Once packages are installed, they can be loaded with library().

Isabelle Bichindaritz, SUNY Oswego 50

 Bioconductor (http://bioconductor.org) for bioinformatics

packages.

 Anaconda (https://www.continuum.io/downloads) for data

Science includes Python, R, and Scala with their most popular
packages (including Bioconductor).

Isabelle Bichindaritz, SUNY Oswego 52

Working with R

 Anaconda includes the Jupyter notebook in which R can

be run.

 In this course, Jupyter notebook is provided from a link in

the menu so that no local R installation is required.

Isabelle Bichindaritz, SUNY Oswego 53

 Watch the video

 Start Jupyter notebook from the provided link

Isabelle Bichindaritz, SUNY Oswego 55

Ch5 5 Data Preprocessing
No ratings yet
Ch5 5 Data Preprocessing
39 pages
Pre Processing
No ratings yet
Pre Processing
66 pages
Experiment No. 5: Objective
No ratings yet
Experiment No. 5: Objective
5 pages
Module 2
No ratings yet
Module 2
84 pages
Foundation of Data Science previous year question paper
No ratings yet
Foundation of Data Science previous year question paper
40 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
ACU IMU JENA V02 Thematic Jan 2022 EN-1
No ratings yet
ACU IMU JENA V02 Thematic Jan 2022 EN-1
124 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Develop A Program To Implement Data Preprocessing Using
No ratings yet
Develop A Program To Implement Data Preprocessing Using
19 pages
Class 2 - Extraction, Transformation and Load (ETL)
No ratings yet
Class 2 - Extraction, Transformation and Load (ETL)
25 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
Data Science Course Curriculum 27 Feb 2023
No ratings yet
Data Science Course Curriculum 27 Feb 2023
21 pages
Data Science & Machine Learning by Using R Programming
No ratings yet
Data Science & Machine Learning by Using R Programming
6 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
WEF_Intelligent_Clinical_Trials_2024
No ratings yet
WEF_Intelligent_Clinical_Trials_2024
20 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Power BI Complete eBook Hindi by AK Gupta (1) Converted
No ratings yet
Power BI Complete eBook Hindi by AK Gupta (1) Converted
4 pages
DAV practical 2
No ratings yet
DAV practical 2
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
CS322_Lec 3_S25
No ratings yet
CS322_Lec 3_S25
42 pages
Aids-B Ii-Ii DSP Lab LP
No ratings yet
Aids-B Ii-Ii DSP Lab LP
2 pages
Data+Analytics+Detailed+Syllabus
No ratings yet
Data+Analytics+Detailed+Syllabus
26 pages
Data Preprocessing (DWDM MOD 2)
No ratings yet
Data Preprocessing (DWDM MOD 2)
62 pages
Linear
No ratings yet
Linear
107 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
End To End Development Example in Sap® Netweaver 7.4 Sap® Hana
No ratings yet
End To End Development Example in Sap® Netweaver 7.4 Sap® Hana
71 pages
Bind Peeking - The Endless Tuning Nightmare: SAGE Computing Services
No ratings yet
Bind Peeking - The Endless Tuning Nightmare: SAGE Computing Services
30 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Getting Future-Ready: With Marketing Transformation
No ratings yet
Getting Future-Ready: With Marketing Transformation
34 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
NAC.pdf (1)
No ratings yet
NAC.pdf (1)
23 pages
HSB3119 Theory Summary p1 Stud
No ratings yet
HSB3119 Theory Summary p1 Stud
22 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
22CB340
No ratings yet
22CB340
4 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
Data Mining - Lab 1
No ratings yet
Data Mining - Lab 1
4 pages
DAI101 4 Data Preparation (1)
No ratings yet
DAI101 4 Data Preparation (1)
45 pages
1737527078055
No ratings yet
1737527078055
111 pages
End Term Two Exams. Geography Form 1
No ratings yet
End Term Two Exams. Geography Form 1
6 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Unit-4 Part 1 Preparing Model
No ratings yet
Unit-4 Part 1 Preparing Model
20 pages
Data Science Practicals
No ratings yet
Data Science Practicals
40 pages
DEV_Lab_Manual
No ratings yet
DEV_Lab_Manual
27 pages
Explain in Brief Flash Memory - : - Sram and Dram
No ratings yet
Explain in Brief Flash Memory - : - Sram and Dram
4 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
ds
No ratings yet
ds
28 pages
Experimental Research and Non-Experimental Research
No ratings yet
Experimental Research and Non-Experimental Research
14 pages
Storage Managment
No ratings yet
Storage Managment
9 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Ch. Khalid Mahmood Gorsi
No ratings yet
Ch. Khalid Mahmood Gorsi
106 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Effects of Generative Artificial Intelligence Integration on Writing Competency of Senior High School Learners in Calauag National High School
No ratings yet
Effects of Generative Artificial Intelligence Integration on Writing Competency of Senior High School Learners in Calauag National High School
29 pages
Computer Science Project Guide: CIS 490/491 and CIS 700/710
No ratings yet
Computer Science Project Guide: CIS 490/491 and CIS 700/710
33 pages
118-Article Text-881-1-10-20230330
No ratings yet
118-Article Text-881-1-10-20230330
11 pages
Handson data preprocessing PYTHON
No ratings yet
Handson data preprocessing PYTHON
3 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
Coursera Notes
No ratings yet
Coursera Notes
4 pages
Business driven information systems Fifth Edition. Edition Baltzan - The full ebook set is available with all chapters for download
100% (1)
Business driven information systems Fifth Edition. Edition Baltzan - The full ebook set is available with all chapters for download
51 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
SBA Activity Sample
No ratings yet
SBA Activity Sample
2 pages
Fall 2022 ECON 570
No ratings yet
Fall 2022 ECON 570
4 pages
User Management
No ratings yet
User Management
5 pages
Transactions and Concurrency Control
100% (1)
Transactions and Concurrency Control
7 pages
M. Information Systems Paige Baltzan instant download
100% (1)
M. Information Systems Paige Baltzan instant download
54 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
8.1.writea Program To Read Structure Elements From Keyboard. Program
No ratings yet
8.1.writea Program To Read Structure Elements From Keyboard. Program
7 pages
City of San Fernando (La Union) SCHOOL: Concise Bulleted Form
No ratings yet
City of San Fernando (La Union) SCHOOL: Concise Bulleted Form
2 pages
KPMG 2018 Esg-Reporting-Review-02
No ratings yet
KPMG 2018 Esg-Reporting-Review-02
4 pages
Data Science With R - Course Materials
No ratings yet
Data Science With R - Course Materials
25 pages
Advanced Topics in Data Science
No ratings yet
Advanced Topics in Data Science
4 pages
Udacity Enterprise Syllabus Data Analyst nd002
No ratings yet
Udacity Enterprise Syllabus Data Analyst nd002
16 pages
2014 European Predictive Analytics & Big Data Summit: June 10th
No ratings yet
2014 European Predictive Analytics & Big Data Summit: June 10th
1 page
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
Sample Qp Practical
No ratings yet
Sample Qp Practical
3 pages
Oracle RDBMS & SQL Tutorial (Very Good)
100% (8)
Oracle RDBMS & SQL Tutorial (Very Good)
66 pages
Data Science 2
No ratings yet
Data Science 2
55 pages
Azure Data Factory
100% (2)
Azure Data Factory
10 pages
Tablue
0% (1)
Tablue
2 pages
Storage Devices
No ratings yet
Storage Devices
3 pages
7.3. Objectives of Distributed Transaction Management
No ratings yet
7.3. Objectives of Distributed Transaction Management
2 pages
Mastering Data Mining with Python – Find patterns hidden in your data
From Everand
Mastering Data Mining with Python – Find patterns hidden in your data
Megan Squire
No ratings yet
Mastering Data Science: From Basics to Expert Proficiency
From Everand
Mastering Data Science: From Basics to Expert Proficiency
William Smith
No ratings yet