0% found this document useful (0 votes)
36 views12 pages

Escriptive Tatistics Pplications: Pavan Kumar A

This document discusses descriptive statistics and their applications in data cleaning. It notes that data wrangling transforms raw data into consistent data that can be analyzed, and that data scientists spend 80% of their time cleaning data. It then outlines sources of poor data quality and problems with dirty data. The document describes data cleaning as having two steps: detection and correction of errors. It discusses using summary, tabular and graphical descriptive statistics like minimum, maximum, mean and standard deviation to detect errors in the data through techniques like looking for outliers in histograms and scatter plots. Frequency analysis and logic checks are also discussed as ways to locate dirty data. Methods proposed for error correction include categorizing values, setting outliers to missing or mean values.

Uploaded by

naresh darapu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views12 pages

Escriptive Tatistics Pplications: Pavan Kumar A

This document discusses descriptive statistics and their applications in data cleaning. It notes that data wrangling transforms raw data into consistent data that can be analyzed, and that data scientists spend 80% of their time cleaning data. It then outlines sources of poor data quality and problems with dirty data. The document describes data cleaning as having two steps: detection and correction of errors. It discusses using summary, tabular and graphical descriptive statistics like minimum, maximum, mean and standard deviation to detect errors in the data through techniques like looking for outliers in histograms and scatter plots. Frequency analysis and logic checks are also discussed as ways to locate dirty data. Methods proposed for error correction include categorizing values, setting outliers to missing or mean values.

Uploaded by

naresh darapu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

DESCRIPTIVE STATISTICS:

APPLICATIONS

Pavan Kumar A
INTRODUCTION TO DATA CLEANING
 Data Wrangling is the process of
transforming raw data into consistent
data that can be analyzed.
 Data cleaning is one of the primary pain
points of data science.
 Data Scientists spend 80% of data
analysis time in cleaning data.[1]

1.http://www.crowdflower.com/blog/2014/01/data-cleaning-with-crowdflower- Source: https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-


the-80-percent-solution-for-data-scientists Introduction_to_data_cleaning_with_R.pdf
RAW DATA
 Raw data can be hard to understand, even for those with advanced technical
skills.
 In order to make this data easily understandable and user-friendly, it must be
pre-processed and prepared for actual analysis.
 Causes of Poor data quality

 Data entry errors


 False values for variables
 Heaping data
 Application errors or Coding errors
 Incomplete or outdated data
 Differences in data representation among data sources
 Problems associated with dirty data

 Invalid reports resulting in wrong interpretation


STEPS: DATA CLEANING
 Data cleaning is basically done in two steps DETECTION and CORRECTION.
 Some of them includes following

 Missing data coded as "999”


 The 'not applicable' or 'blank' coded as "0"
 Reduplication
 COLUMN SHIFT - data for one variable column was entered under the
adjacent column
 Logic checks
 Support of Domain expert is also needed for data cleaning.
ERROR DETECTION
 Most of the errors will be detected using Descriptive Statistics
 Descriptive Statistics are of three types

 Summary Statistics
 Tabular Statistics
 Graphical Statistics
 Summary Statistics

 Min and Max


 Mean
 Median
 Variance
 SD (Standard Deviation)
ERROR DETECTION
Descriptive Statistics : Summary Analysis
 Look at minimum and maximum values (range) for descriptive statistics
 Look for Likeliness of the value in terms of range or z-score
 Look at Mean, Median and Standard Deviation
 Example 1:

Source: http://www.tulane.edu/~panda2/Analysis2/datclean/stats_with_errors.html

 ACPRVF: Females low arm circumference in cm’s (age<5 yrs)


 ACPRVM: Males low arm circumference in cm’s (age<5 yrs)
ERROR DETECTION
 Descriptive Statistics : Graphical Analysis (Histogram)

Source: http://www.tulane.edu/~panda2/Analysis2/datclean/stats_with_errors.html
ERROR DETECTION
 Descriptive Statistics : Graphical Analysis (Scatter Plot)
 Some errors appears only when it is compared with two variables.

 Outliers are one of those to look at.

Source: http://www.tulane.edu/~panda2/Analysis2/datclean/stats_with_errors.html
ERROR DETECTION
 Descriptive Statistics : Tabular Analysis (Frequency)
 Frequencies help to locate the 'dirty' data (Unequal distribution) among the entered
variables.
 Example 2: Baby ages
ERROR DETECTION
 Logic Checks
 We can often detect errors in data simply by seeing if the responses are logical.
 Example
 We would expect to see 100% of responses, not 110%.

 Issuing driving license for the age group <18


ERROR CORRECTION
1. Categorize the values like <=60% and >=60%-100% and assign the
values 0 and 1 respectively. (This eliminates the unexpected ranges)
2. Outliers set to “missing” if the errors are very less

3. Best way: Outliers set to “MEAN” (for multiple variable analysis) for
normal distribution of the data values.
THANK YOU !!!!

You might also like