Data Science Essentials: Missing and Repeated Values

This document discusses several essential data cleansing steps for preparing data for predictive modeling: handling missing and duplicate values, feature engineering, identifying outliers, and scaling numeric features. It provides examples and recommends using Azure ML modules or custom code to clean data by addressing missing values, removing duplicate rows, calculating new features, treating outliers, and normalizing features.

Uploaded by

jibamnildo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views5 pages

Data Science Essentials: Missing and Repeated Values

Uploaded by

jibamnildo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Data Science Essentials

Data Cleansing

Usually, data requires some cleansing before it can be useful in creating predictive models.

Missing and Repeated Values

It is common for data to have some values missing or to have some repeated rows of data. For example,
consider the following data, which represents cupcake sales:

Date Chocolate Sales Vanilla Sales

01-01-2016 125 76
01-02-2016 122 0
135 85
01-04-2016 76
01-05-2016 127 74
01-02-2016 122 81
Missing Values
In this dataset, there are some rows that contain empty columns. The values for these data points are
missing, or unknown. In some cases, the missing values can be interpolated from the other available
data. For example, the missing date falls within a sequence of incrementing dates, and it may be safe to
assume that the missing value can be replaced with 01-03-2016. The missing value for chocolate sales
may be more difficult to discern, though you could conceivably use the mean value for the column;
which given how little variance there seems to be in the data may be adequate. Alternatively, if you
can’t replace the missing value without compromising the validity of the model you are creating, it may
be better to simply remove the entire row.

While missing values are often indicated by blank (or null) cells, they may be more difficult to spot. Some
systems automatically replace missing values with NA, or sometimes missing numeric values are
replaced with the text string NaN (not a number). Additionally, some systems insert a placeholder value
such as 0 or 9999 to represent a missing value; which can be a difficult to find source of inaccuracies in a
model. For example, the table above includes a vanilla sales figure of 0. It is possible that no vanilla
cupcakes were sold on that date, but based on the relatively consistent values for vanilla sales in the
other rows, it’s also possible that the actual sales figure is unknown and 0 has been inserted as a
placeholder. Knowing how to determine whether a value is missing, and what to do about it depends on
familiarity with data and the goals of your analysis.
You can detect and treat missing values with the Clean Missing Data module in Azure ML, or you can
write custom R, Python, or SQL code. For more information about the Clean Missing Data module, see
https://msdn.microsoft.com/en-us/library/azure/dn906028.aspx.

Duplicate Rows
The cupcake sales data described previously also includes two rows containing exactly the same values
for all columns. Given that the dataset seems to represent daily sales totals, it seems likely that no two
rows should share the same date; and since the other column values are identical, it is logical to assume
that these rows are duplicates and one should be removed from the dataset.

However, handling duplicates may not always be this straightforward. In this example, we can assume
that each row is uniquely identified by a key column (date), but what should be done about a case
where two rows contain the same date but have different values in the other columns? You could opt to
keep the first or last row that occurs, or you could try to merge the duplicate rows by assigning an
average value for the numeric columns that share the same key.

Additionally, some datasets do not contain key columns that uniquely identify each row. In this case,
you may find multiple rows that contain the same values for every column and assume they are
duplicates – but be careful, it could also be that the rows represent different data points that happen to
exhibit exactly the same features. As with handling missing values, how you identify and treat duplicates
depends on your knowledge of the data and the use to which you intend to put it.

You can detect and remove duplicate rows with the Remove Duplicate Rows module in Azure ML, or
you can write custom R, Python, or SQL code. For more information about the Remove Duplicate Rows
module, see https://msdn.microsoft.com/en-us/library/azure/dn905805.aspx.

Feature Engineering
Sometimes you may need to create calculated values in your dataset that don’t exist in the source data.
This may be to combine multiple dimensions in the data into a single composite feature, or it may be to
apply a logarithmic or other transformation to a numeric value in order to create a more distinct fit
between features and the labels that you want to use them to predict.

The generation of custom calculated columns is generally referred to as feature engineering, and is a
common technique in data modeling. You can use the Apply Math Operation Azure Machine Learning
module to calculate a new column or you can use a custom R, Python, and SQL script. For more
information about the Apply Math Operation module, see https://msdn.microsoft.com/en-
us/library/azure/dn905975.aspx.

Outliers and Errors

Large datasets often include values that are errors or outliers, which will skew the relationships in the
model. Finding outliers involves comprehensive exploration and visualization of the data, and you must
be careful to ensure that the outliers you identify are genuine outliers that should be treated, and not
indicators that there are some important subsets within the data that should be taken into account in
the model.

The first step to identifying outliers is often to visualize the relationships between important features
and labels as a scatterplot, and looking for plot points that fall outside of the apparent pattern in the
data, as shown here:
You should typically spend time comparing various plots in order to confirm that outliers are genuine,
and then deciding how to treat them. Treatments include deleting rows containing outliers, or replacing
outlier values with the maximum or minimum values (depending if they are above or below the normal
ramge) or the mean value of the column.

You can treat outliers with the Clip Values module, or you can write custom R, Python, or SQL scripts to
handle outliers. For more information about the Clip Values module, see
https://msdn.microsoft.com/en-us/library/azure/dn905918.aspx.

Scaling Data
Most datasets used in data science contain multiple numeric features, and these features are often
expressed in different units of measurement. For example, the following scatterplot shows the
relationship between the engine size and weight of automobiles.
When creating a model, the important thing is the relative relationship between the numeric features –
not necessarily the absolute values. In this case, the absolute numeric values are on completely different
scales; weights range between 1000 and 3500, and engine size ranges from 80 to 185. When you take
into consideration that there may be many more numeric features, each on their own scale of values, it
can become very difficult to compare multiple features in a model.

To address this problem, you can scale (or normalize) the numeric features in your dataset so that the
data values are within a consistent scale, without losing the relative relationships between the features.
For example, the following scatterplot shows the relationship between engine size and weight with the
values normalized to a common scale.
Note that the numerical values for the features are now on the same scale, but the relative
relationships between them are still apparent.

You can scale data using custom R or Python code, or you can use the Normalize Data module in Azure
Machine Learning. For more information about the Normalize Data module, see
https://msdn.microsoft.com/en-us/library/azure/dn905838.aspx.

Note: You should generally normalize data after you have removed rows containing missing values,
duplicates, and outliers. This ensures that the scale is not skewed by extreme values that will not be
used by the model.

Data Cleaning Techniques
No ratings yet
Data Cleaning Techniques
11 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
12 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
EDA - Task
No ratings yet
EDA - Task
20 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Data Quality
No ratings yet
Data Quality
14 pages
DS203 2024 09 06 Data Problems 1
No ratings yet
DS203 2024 09 06 Data Problems 1
25 pages
DS&ML 4
No ratings yet
DS&ML 4
9 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
DMML
No ratings yet
DMML
65 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
Expt 2
No ratings yet
Expt 2
3 pages
Web Application For Sustainable Farm Food
No ratings yet
Web Application For Sustainable Farm Food
49 pages
CN Lab Manual
75% (4)
CN Lab Manual
34 pages
Data Quality
100% (2)
Data Quality
16 pages
Chapter 1 Introduction To Assembly Language Programming
No ratings yet
Chapter 1 Introduction To Assembly Language Programming
45 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Unit 2
No ratings yet
Unit 2
21 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Cleaning Techniques (Slides)
No ratings yet
Cleaning Techniques (Slides)
20 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Module 5 - Data Cleaning and Transformation
No ratings yet
Module 5 - Data Cleaning and Transformation
26 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Lesson 15 Slides
No ratings yet
Lesson 15 Slides
29 pages
Slides For Chapter 3: Networking and Internetworking: Distributed Systems: Concepts and Design
No ratings yet
Slides For Chapter 3: Networking and Internetworking: Distributed Systems: Concepts and Design
27 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
CITCALL - Overview
No ratings yet
CITCALL - Overview
16 pages
Stratix Industrial Networks Infrastructure At-A-Glance
No ratings yet
Stratix Industrial Networks Infrastructure At-A-Glance
5 pages
Unit 1
No ratings yet
Unit 1
21 pages
Installation Document - SPPL - BLR
No ratings yet
Installation Document - SPPL - BLR
29 pages
FALV sitWTRO2016
No ratings yet
FALV sitWTRO2016
10 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
5/5 (2)
Os Lab Manual - 2024
No ratings yet
Os Lab Manual - 2024
79 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Computer Architecture CS F342 Ca-Lect7
No ratings yet
Computer Architecture CS F342 Ca-Lect7
11 pages
Software Quality Assurance and Testing
No ratings yet
Software Quality Assurance and Testing
14 pages
Brochure+-+10 25 2023
No ratings yet
Brochure+-+10 25 2023
6 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
It430 Assignment Solution
0% (1)
It430 Assignment Solution
2 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Overview:: Welcome To The Aws Iot Security Primer Overview
No ratings yet
Overview:: Welcome To The Aws Iot Security Primer Overview
13 pages
Half Blind Attack MSP430
No ratings yet
Half Blind Attack MSP430
6 pages
EDA
100% (1)
EDA
9 pages
Information Security Management Through Reflexive Security
No ratings yet
Information Security Management Through Reflexive Security
18 pages
Wiley Sec F 2021 MCQ
No ratings yet
Wiley Sec F 2021 MCQ
378 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Server Hosting Management System (Ip Class 12) (2024-25)
No ratings yet
Server Hosting Management System (Ip Class 12) (2024-25)
21 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
BK3120
No ratings yet
BK3120
86 pages
base24 stuff M9hFGpip67zoQkJgES4OkhDzSdGE41AFJf15wm2lyXbU1zM2jWeZplA%3D%3D-dx8SR18S2LU1pEVsCAfadagun%2B4%3D; __gads=ID=3855e33d2849c23a:T=1362756576:S=ALNI_MaMOBN46-wm5NXstwbDf-ZVO1eIWw; __CJ_session_metadata=%22%7B%5C%22active_facebook_session%5C%22%3A%5C%22false%5C%22%2C%5C%22last_facebook_ping%5C%22%3A1362799725053%7D%22; _trp_hit_8989/15071_300x250=2; grvinsights=d3b5fc74702b7e7494caab382114b774; __utma=137936306.1372482448.1362756592.1362756592.1362799723.2; __utmb=137936306.43.9.1362800241261; __utmc=137936306; __utmz=137936306.1362756592.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=scribd; __utmv=137936306.|1=logged_in=true=1^2=fb_setup_context=none=1; _scribd_session=BAh7CjoQbGFzdF9yZWF1dGhsKweArDpRIgx3b3JkX2lkaQRoJTkBOgxjc3JmX2lkIiVkNjI3OWNjNGVhNTJmMjQzMGFiOWZlNDVmYjQ1NGQwMyIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNoSGFzaHsABjoKQHVzZWR7ADocZGlzYWJsZV9pbnN0YW50X2Nvbm5lY3RG--234c3e551720993a91c6c4cb308ccb64e39eaac9 X-Forwarded-For: 218.186.49.46 Jj�
100% (1)
base24 stuff M9hFGpip67zoQkJgES4OkhDzSdGE41AFJf15wm2lyXbU1zM2jWeZplA%3D%3D-dx8SR18S2LU1pEVsCAfadagun%2B4%3D; __gads=ID=3855e33d2849c23a:T=1362756576:S=ALNI_MaMOBN46-wm5NXstwbDf-ZVO1eIWw; __CJ_session_metadata=%22%7B%5C%22active_facebook_session%5C%22%3A%5C%22false%5C%22%2C%5C%22last_facebook_ping%5C%22%3A1362799725053%7D%22; _trp_hit_8989/15071_300x250=2; grvinsights=d3b5fc74702b7e7494caab382114b774; __utma=137936306.1372482448.1362756592.1362756592.1362799723.2; __utmb=137936306.43.9.1362800241261; __utmc=137936306; __utmz=137936306.1362756592.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=scribd; __utmv=137936306.|1=logged_in=true=1^2=fb_setup_context=none=1; _scribd_session=BAh7CjoQbGFzdF9yZWF1dGhsKweArDpRIgx3b3JkX2lkaQRoJTkBOgxjc3JmX2lkIiVkNjI3OWNjNGVhNTJmMjQzMGFiOWZlNDVmYjQ1NGQwMyIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNoSGFzaHsABjoKQHVzZWR7ADocZGlzYWJsZV9pbnN0YW50X2Nvbm5lY3RG--234c3e551720993a91c6c4cb308ccb64e39eaac9 X-Forwarded-For: 218.186.49.46 Jj�
105 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Debugging
No ratings yet
Debugging
22 pages
Components of An Android Application: 1. Activities
100% (1)
Components of An Android Application: 1. Activities
3 pages
2-5 - Storage - Network - Architecture - Copie
No ratings yet
2-5 - Storage - Network - Architecture - Copie
41 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
Subtitle
No ratings yet
Subtitle
2 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet
PHP Point of Sale
No ratings yet
PHP Point of Sale
52 pages
Festo MotionControl
No ratings yet
Festo MotionControl
24 pages
Tp32 SM
No ratings yet
Tp32 SM
36 pages
Difference Between SAP Memory and ABAP Memory: Answers 1
No ratings yet
Difference Between SAP Memory and ABAP Memory: Answers 1
2 pages
A Beginner's Guide To Oracle Planning and Budgeting Cloud Service (PBCS)
No ratings yet
A Beginner's Guide To Oracle Planning and Budgeting Cloud Service (PBCS)
23 pages
Data Analytics 02: Drag Connect It Change Remove Cabin, Life Boat, Name, and Ticket Number
No ratings yet
Data Analytics 02: Drag Connect It Change Remove Cabin, Life Boat, Name, and Ticket Number
2 pages
Services Grameenphone (GP) Banglalink Robi Teletalk: Type "SFC Old Number New Number" & Send To 2888
No ratings yet
Services Grameenphone (GP) Banglalink Robi Teletalk: Type "SFC Old Number New Number" & Send To 2888
1 page

Data Science Essentials: Missing and Repeated Values

Uploaded by

Data Science Essentials: Missing and Repeated Values

Uploaded by

Data Science Essentials

Missing and Repeated Values

Date Chocolate Sales Vanilla Sales

Outliers and Errors

You might also like