Escriptive Tatistics Pplications: Pavan Kumar A

This document discusses descriptive statistics and their applications in data cleaning. It notes that data wrangling transforms raw data into consistent data that can be analyzed, and that data scientists spend 80% of their time cleaning data. It then outlines sources of poor data quality and problems with dirty data. The document describes data cleaning as having two steps: detection and correction of errors. It discusses using summary, tabular and graphical descriptive statistics like minimum, maximum, mean and standard deviation to detect errors in the data through techniques like looking for outliers in histograms and scatter plots. Frequency analysis and logic checks are also discussed as ways to locate dirty data. Methods proposed for error correction include categorizing values, setting outliers to missing or mean values.

Uploaded by

naresh darapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views12 pages

Escriptive Tatistics Pplications: Pavan Kumar A

Uploaded by

naresh darapu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

DESCRIPTIVE STATISTICS:

APPLICATIONS

Pavan Kumar A
INTRODUCTION TO DATA CLEANING
 Data Wrangling is the process of
transforming raw data into consistent
data that can be analyzed.
 Data cleaning is one of the primary pain
points of data science.
 Data Scientists spend 80% of data
analysis time in cleaning data.[1]

1.http://www.crowdflower.com/blog/2014/01/data-cleaning-with-crowdflower- Source: https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-

the-80-percent-solution-for-data-scientists Introduction_to_data_cleaning_with_R.pdf
RAW DATA
 Raw data can be hard to understand, even for those with advanced technical
skills.
 In order to make this data easily understandable and user-friendly, it must be
pre-processed and prepared for actual analysis.
 Causes of Poor data quality

 Data entry errors

 False values for variables
 Heaping data
 Application errors or Coding errors
 Incomplete or outdated data
 Differences in data representation among data sources
 Problems associated with dirty data

 Invalid reports resulting in wrong interpretation

STEPS: DATA CLEANING
 Data cleaning is basically done in two steps DETECTION and CORRECTION.
 Some of them includes following

 Missing data coded as "999”

 The 'not applicable' or 'blank' coded as "0"
 Reduplication
 COLUMN SHIFT - data for one variable column was entered under the
adjacent column
 Logic checks
 Support of Domain expert is also needed for data cleaning.
ERROR DETECTION
 Most of the errors will be detected using Descriptive Statistics
 Descriptive Statistics are of three types

 Summary Statistics
 Tabular Statistics
 Graphical Statistics
 Summary Statistics

 Min and Max

 Mean
 Median
 Variance
 SD (Standard Deviation)
ERROR DETECTION
Descriptive Statistics : Summary Analysis
 Look at minimum and maximum values (range) for descriptive statistics
 Look for Likeliness of the value in terms of range or z-score
 Look at Mean, Median and Standard Deviation
 Example 1:

Source: http://www.tulane.edu/~panda2/Analysis2/datclean/stats_with_errors.html

 ACPRVF: Females low arm circumference in cm’s (age<5 yrs)

 ACPRVM: Males low arm circumference in cm’s (age<5 yrs)
ERROR DETECTION
 Descriptive Statistics : Graphical Analysis (Histogram)

Source: http://www.tulane.edu/~panda2/Analysis2/datclean/stats_with_errors.html
ERROR DETECTION
 Descriptive Statistics : Graphical Analysis (Scatter Plot)
 Some errors appears only when it is compared with two variables.

 Outliers are one of those to look at.

Source: http://www.tulane.edu/~panda2/Analysis2/datclean/stats_with_errors.html
ERROR DETECTION
 Descriptive Statistics : Tabular Analysis (Frequency)
 Frequencies help to locate the 'dirty' data (Unequal distribution) among the entered
variables.
 Example 2: Baby ages
ERROR DETECTION
 Logic Checks
 We can often detect errors in data simply by seeing if the responses are logical.
 Example
 We would expect to see 100% of responses, not 110%.

 Issuing driving license for the age group <18

ERROR CORRECTION
1. Categorize the values like <=60% and >=60%-100% and assign the
values 0 and 1 respectively. (This eliminates the unexpected ranges)
2. Outliers set to “missing” if the errors are very less

3. Best way: Outliers set to “MEAN” (for multiple variable analysis) for
normal distribution of the data values.
THANK YOU !!!!

Marine Microbiology Ecology and Applications by Colin Munn
100% (1)
Marine Microbiology Ecology and Applications by Colin Munn
394 pages
ELP On Mushroom Cultivation
No ratings yet
ELP On Mushroom Cultivation
19 pages
Design of Heat Exchangers Using Aspen EDR
No ratings yet
Design of Heat Exchangers Using Aspen EDR
7 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Solicitation Letter
No ratings yet
Solicitation Letter
5 pages
Prism
No ratings yet
Prism
21 pages
ASHRAE Weather Data
No ratings yet
ASHRAE Weather Data
1 page
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Module 2
No ratings yet
Module 2
62 pages
Essay On Greenhouse Effect
100% (2)
Essay On Greenhouse Effect
3 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Assignment 1 Cyber Security
No ratings yet
Assignment 1 Cyber Security
10 pages
American Culture and Drug Abuse
No ratings yet
American Culture and Drug Abuse
1 page
Ansys Fluent Project in Advanced Fluid Mechanics
100% (1)
Ansys Fluent Project in Advanced Fluid Mechanics
28 pages
DM Merged
No ratings yet
DM Merged
169 pages
Game Changer - Record
No ratings yet
Game Changer - Record
3 pages
Impact of Colonialism On Africa and Its Economic Development
No ratings yet
Impact of Colonialism On Africa and Its Economic Development
8 pages
ISO 37001 New
No ratings yet
ISO 37001 New
13 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
03preprocessing 20160222
No ratings yet
03preprocessing 20160222
65 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
W4-5 03preprocessing
No ratings yet
W4-5 03preprocessing
83 pages
Utilization of Low-Density Polyethylene (LDPE) Plastic in Production of Cement Brick
No ratings yet
Utilization of Low-Density Polyethylene (LDPE) Plastic in Production of Cement Brick
41 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
GS4 Ethics Notes by @CSEWhy
No ratings yet
GS4 Ethics Notes by @CSEWhy
26 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
CE118 Project Part 1
No ratings yet
CE118 Project Part 1
42 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Data Preprocessing - Updated
No ratings yet
Data Preprocessing - Updated
31 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Week2 2
No ratings yet
Week2 2
25 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
AI Book 10 - Worksheets - Unit 1 - Answer Key
No ratings yet
AI Book 10 - Worksheets - Unit 1 - Answer Key
8 pages
Unit 2 Data Cleaning
No ratings yet
Unit 2 Data Cleaning
12 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Outliners
No ratings yet
Outliners
15 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
2 Manipulating Processing Data
No ratings yet
2 Manipulating Processing Data
81 pages
2322 B EN UM AGFA CR Detectors Plates and Cassettes
No ratings yet
2322 B EN UM AGFA CR Detectors Plates and Cassettes
54 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Best Ferrocement Structure 2016
No ratings yet
Best Ferrocement Structure 2016
7 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Exemplos Betas
No ratings yet
Exemplos Betas
12 pages
Escriptive Tatistics ND Tabulation: Pavan Kumar A
No ratings yet
Escriptive Tatistics ND Tabulation: Pavan Kumar A
25 pages
Big Data - Lab 3
No ratings yet
Big Data - Lab 3
25 pages
Unit 1
No ratings yet
Unit 1
21 pages
ATA Tructures In: Pavan Kumar A
No ratings yet
ATA Tructures In: Pavan Kumar A
35 pages
Steelez - Auction 24th May To 02nd June 2025
No ratings yet
Steelez - Auction 24th May To 02nd June 2025
7 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
SRS-02 (Gen. Aptitude Test) SET-A PDF
No ratings yet
SRS-02 (Gen. Aptitude Test) SET-A PDF
22 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Functions N Built: Pavan Kumar A
No ratings yet
Functions N Built: Pavan Kumar A
19 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
I R A E D: Mport EAD ND Xport ATA
No ratings yet
I R A E D: Mport EAD ND Xport ATA
28 pages
Introduction To R: Pavan Kumar A
No ratings yet
Introduction To R: Pavan Kumar A
55 pages
310-A STO FY 2024 TIER 1
No ratings yet
310-A STO FY 2024 TIER 1
12 pages
ANOVA Poplar-Trees
No ratings yet
ANOVA Poplar-Trees
3 pages
C S I R: Ontrol Tructures N
No ratings yet
C S I R: Ontrol Tructures N
18 pages
On The Optimal Weighting Matrix For The GMM System Estimator in Dynamic Panel Data Models
No ratings yet
On The Optimal Weighting Matrix For The GMM System Estimator in Dynamic Panel Data Models
28 pages
Lec448B 20160406
No ratings yet
Lec448B 20160406
30 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
3 Dirt Wall
No ratings yet
3 Dirt Wall
5 pages
UC3843 ChipsWinner
No ratings yet
UC3843 ChipsWinner
11 pages
04 Data Cleaning in R
No ratings yet
04 Data Cleaning in R
36 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
5 - Vocabulary Exercises - Motivation
No ratings yet
5 - Vocabulary Exercises - Motivation
3 pages
Power Series Solutions of Linear Differential Equations
No ratings yet
Power Series Solutions of Linear Differential Equations
34 pages
Jayson Dr. Palisoc Domain 3 Diversity of Learners
No ratings yet
Jayson Dr. Palisoc Domain 3 Diversity of Learners
7 pages
FMS Mba Naresh Darapu PDF
No ratings yet
FMS Mba Naresh Darapu PDF
1 page
FMS Mba Naresh Darapu PDF
No ratings yet
FMS Mba Naresh Darapu PDF
1 page
FMS Mba Naresh Darapu PDF
No ratings yet
FMS Mba Naresh Darapu PDF
1 page
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet

Escriptive Tatistics Pplications: Pavan Kumar A

Uploaded by

Escriptive Tatistics Pplications: Pavan Kumar A

Uploaded by

DESCRIPTIVE STATISTICS:

1.http://www.crowdflower.com/blog/2014/01/data-cleaning-with-crowdflower- Source: https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-

 Data entry errors

 Invalid reports resulting in wrong interpretation

 Missing data coded as "999”

 Min and Max

 ACPRVF: Females low arm circumference in cm’s (age<5 yrs)

 Outliers are one of those to look at.

 Issuing driving license for the age group <18

You might also like