0% found this document useful (0 votes)

8 views74 pages

BDA -Statistical Inference, Exploratory Data Analysis, and the Analytics Process

Uploaded by

raxosep792

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views74 pages

BDA -Statistical Inference, Exploratory Data Analysis, and the Analytics Process

Uploaded by

raxosep792

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Data

Analytics

2 Data Preprocessing,
Statistical Inference, EDA, and
the Analytics Process

Compiled & Edited by

Babar Yaqoob Khan
Visiting Lecturer – Data Science
Department of Information Technology
University of the Punjab – Gujranwala Campus
WHAT IS IN IT FOR YOU?
 Data
 Definition, Types (Structured Data, Semi-Structured Data, Un-Structured Data), Sources, Qualities & Importance
 The information processing cycle
 Data Preprocessing (Sampling, Cleansing, Aggregation, Dimensionality Reduction, Feature Subset Selection,
Feature Creation, Integration, Discretization and Binarization, and Transformation)

 Statistical Inference
 Definition & Objectives, Sampling, Statistical experiment and Probability
 Exploratory Data Analysis (EDA)
 Definition & Objectives, EDA Process and Example
 The Data Analytics Process
 Definition & Objectives, the Process diagram
 Data Analytical Life Cycle
 Discovery, Data preparation , Model planning , Model building , Model Evaluation, Communicate results,
Operationalize
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 2
Data – Definition, Types, Sources, Qualities & Importance
 Data
 The facts and figures in raw or unorganized form (such as alphabets, numbers, or symbols) that
refer to, or represent, conditions, ideas, or objects.

 Different types or formats of data:

• Numbers, Characters or Strings, Time and Date
• Pictures/Images, Graphs, and Maps
• Documents, E-mails, Tweets, and Newsfeeds etc.
• Audio and Video streams
• Formats: XML, CSV, TSV, SQL, JSON, Text etc.
• Records: user-level data, timestamped event data

 Data can be stored in files, data repositories or in databases

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 3
Types of Data Measurements

 Nominal scale

 Business activities – sale & purchase of products

 Manufacturing process – production and assembling of products

 Transportation – transportation of people and products from place to place

 Sensing & monitoring – data from sensors (in space and oceans etc. ) and CCTV cameras

 Human interaction – emails, audio, video and textual communication

 … … ….

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 11
Data Pre-processing

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 15
Data Science/Analytics Process
Exploratory
Data
Analysis

Instrument Data
Business Raw Data Clean
Data Pre-
Problem Collection Dataset
Sources Processing

Data
Decision Support Processing
Business Intelligence
Recommender Systems
Business Forecasting (Prediction)
Visualization/ Make
Data Decisions
Communicate
Product
Results
Reality
Population vs. Sample

 Population (N)
 Includes all of the elements from a set of data e.g.,
• The entire US population i.e., 341.97 million (341,963,408) or
• The entire Pakistan population i.e., 252.37 million (252,363,571)
• The entire world population i.e., 8.2 billion
• Set of objects, such as tweets or photographs

 Sample (n)
 Consists of one or more observations drawn from the population

n<N

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 17
Sampling & Types
 Sampling
• Technique mainly employed for data selection from population
• Often used both for preliminary investigation and the final data analysis

 Sampling Types
 Simple Random Sampling
• Equal probability of selecting any item

 Stratified Sampling
• Split the data into partitions and draw random samples from each partition

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 18
Sampling & Types
 Sampling
• Technique mainly employed for data selection from population
• Often used both for preliminary investigation and the final data analysis

 Sampling Types
 Systematic Sampling
• Select every nth item from a list.
For instance, if you have a list of 1,000 people and you choose every 10th person

 Cluster Sampling
• The population is divided into clusters, usually based on geographical areas or natural grouping
s. A few clusters are randomly selected, and all members within those clusters are surveyed

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 19
Sample Size

Ideal Ratio: 70:30

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 20
Why Data Pre-processing?

 Data in the real world is dirty

 GIGO - good data is a prerequisite for producing effective models of any type

Incomplete: lacking attribute values, lacking certain attributes of interest,

or containing only aggregate data
e.g., occupation=“ ”
Noisy: containing errors or outliers
e.g., Salary=“-10”
Inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 21
Why is Data Dirty?
Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected and when it is
analysed.
– Human/hardware/software problems

Noisy data (incorrect values) may come from

– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission

Inconsistent data may come from

– Different data sources
– Functional dependency violation (e.g., modify some linked data)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 22
Data Preprocessing – Major Tasks

 Data Cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
 Data Integration
 Integration of multiple databases or files
 Data Transformation
 Normalization and aggregation
 Data Reduction
 Obtains reduced representation in volume but produces the same or similar analytical
results
 Data Discretization & Binarization
 Part of data reduction but with particular importance for numerical data

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 23
Forms of Data Preprocessing

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 24
Data Preprocessing – Cleaning

 Importance
 Garbage in Garbage out Principle (GIGO)

 Data Cleaning Tasks

• Fill in missing values

• Identify outliers and Managing noisy data

• Correct inconsistent data

• Resolve redundancy caused by data integration

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 26
Data Preprocessing – Cleaning

 Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data

 Missing data may be due to

• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
 Missing data may need to be inferred
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 27
Data Preprocessing – Cleaning

 How to Handle Missing Data?

 Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.

 Fill in the missing value manually: tedious + infeasible?

 Fill in it automatically with

 a global constant : e.g. “unknown”, a new class?!

 the attribute mean for all data points belonging to the same class: smarter

 the most probable value: inference-based such as Bayesian formula or decision

tree Statistical Inference, Exploratory Data Analysis, and the Data Science Process 28
Data Preprocessing – Cleaning

 How to Handle Noisy Data ?

 Binning
 First sort data and partition into (equal-frequency) bins
 Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries,
etc.

 Regression
 Smooth by fitting the data into regression functions

 Clustering
 Detect and remove outliers

 Combined Computer and Human Inspection

 Detect suspicious values and check by human (e.g., deal with possible outliers) 29
Statistical Inference, Exploratory Data Analysis, and the Data Science Process
Data Preprocessing – Cleaning

 Simple Discretization Methods: Binning

 Equal-width (distance) partitioning

 Divides the range into N intervals of equal size: uniform grid
 If A and B are the lowest and highest values of the attribute, the width of intervals will be:
W = (Max – Min)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well

 Equal-depth (frequency) partitioning

 Divides the range into N intervals each containing approximately same number of data
points
 Good data scaling
 Managing categorical attributes can be tricky
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 30
Data Preprocessing – Cleaning

 Binning

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 31
Data Preprocessing – Cleaning

 Regression

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 32
Data Preprocessing – Cleaning

 Clustering

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 33
Data Preprocessing – Integration

Data integration:

 Combines data from multiple sources into a coherent store

 Schema integration: e.g., A.cust-id ≡ B.cust-#

 Integrate metadata from different sources

 Entity identification problem:

 Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

 Detecting and resolving data value conflicts

 For the same real world entity, attribute values from different sources are different

 Possible reasons: different representations, different scales, e.g., metric vs. British units
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 34
Data Preprocessing – Data Integration

Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple databases

 Object identification: The same attribute or object may have different names in
different databases
 Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue (from monthly income data)

 Redundant attributes may be able to be detected by correlation analysis

 Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 35
Data Preprocessing – Data Integration

Correlation Analysis (Numerical Data)

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 36
Data Preprocessing - Data Integration

Correlation Analysis (Categorical Data)

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 37
Correlation Analysis (Categorical Data)

Play Chess Not Play Chess Sum (row)

Like Science Fiction 250 200 450
Not Like Science Fiction 50 1000 1050
Sum 300 1200 1500

Probability to play chess: P(chess) = 300/1500 = 0.2

Probability to like science fiction: P(SciFi) = 450/1500 = 0.3
If science fiction and chess playing are independent attributes, then the
probability to like SciFi AND play chess is

P(SciFi, chess) = P(SciFi) · P(chess) = 0.06

That means, we expect 0.06 · 1500 = 90 such cases (if they are independent)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 38
Correlation Analysis (Categorical Data)

Play Chess Not Play Chess Sum (row)

Like Science Fiction 250 (90) 200 450
Not Like Science Fiction 50 1000 1050
Sum 300 1200 1500

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on

the data distribution in the two categories)

It shows that like_science_fiction and play_chess are correlated in the group!

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 39
Data Preprocessing – Reduction, Discretization
 Data Reduction (Dimensionality Reduction)
• Obtains reduced representation in volume but produces the same or similar analytical
results
 Feature Subset Selection / Principal Component Analysis (PCA)
 Singular Value Decomposition (SVD)

 Data Discretization (Dimensionality Reduction)

• Part of data reduction but with particular importance for numerical data
• Also called “binning”

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 40
Data Preprocessing – Transformation
 Data Preprocessing – Transformation
• Maps entire set of values of an attribute to a new set of values
• Data standardization and normalization (by clustering and binning)

 Smoothing: remove noise from data

 Aggregation: summarization
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling

 Attribute/feature construction
 New attributes constructed from the given ones
41
Data Preprocessing – Feature Creation
 Feature Creation
• Original attributes not always best representation of information
• Creates new features which are more efficient/focused

 Methodologies
 Features Extraction – Domain Specific
• Derived features
 Feature Construction
• Combine multiple features to construct new feature(s)
 Mapping Data to New Space
• Fourier Transform - what frequencies are present in your signal
• Wavelet Transform - what frequencies are present and where (or at what scale)

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 42
Data Preprocessing – Discretization & Binarization

 Discretization & Binarization

• Converting the data into discrete form and later to binarize it to accommodate certain machine
learning algorithms/models

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 43
Statistical Inference

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 44
Statistical Inference – Population and Sample
 Population
 Includes all of the elements from a set of data e.g.,
• The entire Pakistani population i.e., 247 million or
• The entire world population i.e., 8 billion
• Set of objects, such as tweets or photographs
 Sample Sample < Population



It consists of one or more items drawn from the population
For example, 1000 Pakistanis selected from all provinces of Pakistan
n<N
 Size of sample (n) always less than size of the population (N)
 Sample may not be totally representative of the population

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 45
Statistical Inference
 Statistical inference
 It is process of estimating the parameters of a population, using the random sampling.
 The inference also tests reliability of the estimates with calculated uncertainty.

 Purpose and benefits

 Enable us to understand the population without studying its all items.
 Minimizes the cost of understanding the population.
 It remains the only possible option, when whole the population is not accessible.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 46
Statistical Inference

Population

Sample
Sampling

Parameters
Estimation

Statistical
Inference
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 47
Statistical Experiment
 A statistical experiment has three properties:
 The experiment can have more than one possible outcome
 Each possible outcome can be specified in advance
 The outcome of the experiment depends on chance

 For instance, toss a coin Statistical Experiment

 Outcomes are:
• More than one Head or, Tail
• Specified in advance { Head, Tail }
• Depends on chance Unknown in advance, unless coin is tossed
Or 50% chance of Head and vice versa

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 48
Variables or Parameters
 Variable or Parameter
 It represents value of an attribute of an item in the population. i.e. name, color of an item.
 A random variable can take on any of the specified values (domain).
 A random variable takes a value after a statistical experiment.

𝒙=7 𝒙
Statistical Experiment 𝐱
𝒙=7

𝒙 is a Variable 𝒙 is a Random Variable

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 49
Probability
 Probability
 It is the measure of the likelihood of happening an event.
 A quantitative measure, always takes value between 0 and 1.
 Example:
 Tossing a coin is a statistical experiment.
 It can result two outcomes: heads (H) tails (T)
• Head or a Tail
Favorable outcomes
 Calculating the chance of a resulting a ‘head’ is its probability Probability =
Possible outcomes

P(H) = 1/2 =
0.5
P(T) = 1/2 =
0.5
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 50
Probability Distribution
 Probability distribution

 It links each outcome of a statistical experiment with its probability of occurrence

 For instance, you toss a coin two times
 Possible outcomes = {HH, HT, TH, TT}

 Let X = number of Heads

 Possible outcomes = { 0, 1, 2 }
• P(X = 0) = 1/4 = 0.25 No Heads = { TT }
• P(X = 2) = 1/4 = 0.25 Two Heads = { HH }
• P(X = 1) = 2/4 = 0.50 One Heads = { HT, TH }

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 51
Data Modeling
 Modeling
 A model is representation of a real object or situation. It presents a simplified version of something.
 It is an artificial construction to understand and represent the nature of real things.
• Model does not has unnecessary detail.
 Humans try to understand the world around them using different models.
• Architect capture 3-D prints to construct design structures
• Biologists capture connection between amino acids to understand protein-protein interactions
• Statisticians and Data Scientists capture randomness to comprehend data-generating processes
 Data Modeling
 Data modeling is the analysis of data objects and their relationships to other data objects.
 The model helps us in defining and analyzing data requirements needed to support the business
processes in an organization.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 52
Data Modeling
 Building a Model
 Do some Exploratory Data Analysis (EDA) and discover the relationship among the data.
 Try to describe the relationship using a mathematical formula.
 Model Fitting
 Model Fitting (Balance Fitting)
• When model fits the training as well as testing data pretty well
 Underfitting
• When model is unable to fit even the training data
 Overfitting
• When model fits the training data well but testing data too poor
 Noise (undesired data) and higher variability (inconsistency) in data cause the overfitting
 Remove noise (data cleaning) and add more training data to train the model.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 53
Fitting a Data Model

Too simple to explain Too good to be true.

the variation in data Forced-fitting

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 54
Exploratory Data Analysis
What is EDA?
 Exploratory Data Analysis (EDA)
 In statistics, EDA is used to analyze datasets to summarize their main characteristics.

 EDA often employ the visual methods to see what the data can tell us beyond the formal modeling
or hypothesis testing task.

 It is an effort to understand the process that generate the data under observation.
• ‘Exploration’ means your understanding of the problem is changing as you go ahead.
• Plots, graphs and summary statistics are basic tools of the EDA.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 58
What is EDA?

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 59
EDA – Why we do it?
 EDA helps us to:
 Understand the data and its value in business

• Discover patterns in data

• Spot anomalies (outliers) in data
• Verify existing assumptions about data
• Make comparisons between the data distributions.
• Finding suitable data formats

 Improve accuracy of the data-products.

 Assure verification of the data-products.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 60
The Data Science Process
 The Data Science Process
 Say, we have data (raw data) on these things
 We want to process these data for better analysis
 Processing would give us a Clean Dataset to analyze
 We’ll be doing some EDA with the clean dataset
 EDA will lead us towards a Data Model and an Algorithm
 We get the results after using the model and interpret, visualize, or report them
 The results are in decision making or as input for a ‘Data Product’.
 The Products may be like as:
• Recommender system
• Business forecasting system
• Spam classifier

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 61
Data Science Process
Exploratory
Data
Analysis

Instrument Data
Business Raw Data Clean
Data Pre-
Problem Collection Dataset
Sources Processing

Data
Decision Support, Processing
Business Intelligence
Recommender Systems
Business Forecasting
(Prediction) Visualization/ Make
Data Decisions
Communicate
Product
Results
Reality
The Big Data Approach
 Big Data
 Big data is a field dedicated to the analysis,
processing, and storage of large collections of data
that frequently originate from various sources.
 It is used when traditional data analysis,
processing and storage technologies and
techniquesare insufficient.
 Big Data Characteristics
 Volume
 Velocity
 Variety
 Veracity
 Value

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 63
Big Data – Analytics Types
 Descriptive Analysis (What happened)
 It is done to answer questions about events that have already occurred.

 Diagnostic Analysis (Why did it happen)

 It is used to determine the cause of a phenomenon that occurred in the past using questions that
focus on the reason behind the event.

 Predictive Analysis (What will happen)

 It is an attempt to determine the outcome of an event that would occur in the future.

 Prescriptive Analysis (How can we make it happen)

 Prescriptive analytics are build upon the results of predictive analytics to prescribe the actions
that should be taken to improve the business.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 64
Big Data – Analytics Types

2
Diagnostic 3
Predictive

1 4
Descriptiv Prescriptiv
e e

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 65
Data Analytics Life Cycle – Phase 1
 Phase 1: Learning the business domain and problem discovery
 Understand the business process
• Study the similar past projects
• Identify available resources – people, required skills, technology, time, and data.
• Have right mix of domain experts, customers, analytic talent, and project management.
 Identifying key stakeholders
• Understand their interests in the project
• Propose and discuss more than one solutions to the problem
 Discover the problem to be solved
• Write the problem statement and its justification.
• Discuss and refine the problem statement after discussion with the major stakeholder
• Establish the criteria for success and failure of the proposed solution

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 66
Data Analytics Life Cycle – Key Roles

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 67
Data Analytics Life Cycle – Phase 2
 Phase 2: Data preparation
 Define the steps to explore and preprocess data before its modeling and analysis.
 Prepare the analytics sandbox (setup for the experiments)
 Perform the Extract Transform Load (ETL) process (or ELT).  ETLT = ETL + ELT
 Understand the target data
 Data cleaning – data normalization and transformation
• For better understanding, utilize maximum of the available data
• Survey and visualize the test dataset
• Carefully complete the highly labor-intensive activity
 Data accessing strategies:
• Download snapshot of the production data
• Use the API facility, if available

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 68
Phase 2 – Sample Dataset Inventory

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 69
Phase 2 – Common tools for data preparation
 Phase 2: Data preparation tools
 Hadoop
• It can perform massively parallel loading and analysis of large dataset.
• Used for web traffic parsing, GPS location analytics, genomic analysis, and combining of
massive unstructured data feeds from multiple sources.
 Alpine Miner
• Provides a graphical user interface (GUI) for data manipulation and analysis
 Open Refine (Google Refine)
• A powerful tool for working with large and unstructured dataset. It is a popular GUI-based tool
for performing data transformations.
 Data Wrangler (Stanford University)
• An interactive tool for data cleaning and transformation on a given dataset.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 70
Data Analytics Life Cycle – Phase 3
 Phase 3: Planning the data model
 Data exploration and variable selection
• Perform Exploratory Data Analysis, if required.
• Explore associations & relationships among data
• Identify key performance indicators (KPIs)
 Selecting suitable data analytical method or model
• Keep in mind requirements of the business
• Consider the type and format of data attributes
• Consult the domain experts and follow the best practices

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 71
Phase 3 – Selecting appropriate data analytical model

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 72
Data Analytics Life Cycle – Phase 3
 Phase 3: Common tools for the model planning phase
 R - Analytical Software Package
• It has the data modeling capabilities and good environment for building interpretive models
• R has ability to interface with databases via an ODBC connection and execute statistical tests
and analyses against Big Data via an open source connection.
• R contains nearly 5,000 packages for data analysis and graphical representation.
 SQL Analysis services
• It can perform in-database analytics of common data mining functions, involved aggregations,
and basic predictive models.
 SAS/ACCESS
• Provides integration between SAS and the analytics sandbox via multiple data connectors such
as OBDC, JDBC and OLE DB. Connectivity to relational databases (such as Oracle or Teradata)
and data warehousing applications ( i.e. Green plum or Aster)
• Enterprise applications such as SAP and Salesforce.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 73
Data Analytics Life Cycle – Phase 4
 Phase 4: Model building
 Develop datasets for testing, training, and production purposes.
 Assess validity of the model and its results on small scale
• Verify result of the model from domain experts
 Evaluate the required hardware support to execute the model

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 74
Data Analytics Life Cycle – Phase 4
 Phase 4: Common tools for the model building phase
 SAS Enterprise Miner
• Allows users to run predictive and descriptive models based on large volumes of data from
across the enterprise.
• It is built for enterprise-level computing and analytics by interoperating with large data stores.
 SPSS Modeler (IBM SPSS Modeler)
• Offers methods to explore and analyze data through a GUI.
 MatLab
• Provides a high-level language for performing a variety of data analytics and exploration.
 Statistica and Mathematica
• Popular and well-regarded data mining and analytics tools.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 75
Data Analytics Life Cycle – Phase 4
 Phase 4: Free or Open Source tools for the model building phase
 WEKA
• A free data mining software package with an analytic workbench. The functions created in
WEKA can be executed within Java code.
 Python
• It is a programming language that provides toolkits for machine learning and analysis, such as
scikit-learn, numpy, scipy, pandas, and related data visualization using matplotlib.
 Rand PL/R
• R was described earlier in the model planning phase, and PL\R is a procedural language for
PostgreSQL with R. Using this approach means that R commands can be executed in database.
 Octave
• A programming language for computational modeling having some functionality of MatLab.
• Being freely available, Octave is used in major universities when teaching machine learning.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 76
Data Analytics Life Cycle – Phase 5
 Phase 5: Communicate the results
 Collaborate with the major stakeholders, and evaluation of the results
• Identify key findings, quantify their business value.
• The deliverable of this phase will be decisive for the outside stakeholders and sponsors
• Summarize the findings and convey to the stakeholders.
• Make recommendations for future work or improvements to existing processes
 Accept failure of an analytical project
• A true failure means failure of data to accept or reject the hypothesis stated in phase-1.
• Analyst should be rigorous enough with the data to determine whether it will prove or disprove
the hypotheses

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 77
Data Analytics Life Cycle – Phase 6
 Phase 6: Operationalize
 Communicate the benefits of the project more broadly
• If required, run a pilot project before implementing the models in a production environment.
• Learn from the deployment and make any needed adjustments.
 Properly document and deliver the final reports, briefings, code, and technical documents.
• Consult documentation of the similar past projects, if available.
• Follow the documentation standards to increase its effectiveness.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 78
Content’s Review
 Data
 Definition, Importance, Characteristics, Sources and Types
 Structured Data, Semi-Structured Data, Un-Structured Data
 The information processing cycle
 Data Preprocessing (Integration, Cleansing, Reduction, and Transformation)
 Statistical Inference
 Definition & Objectives, Sampling, Statistical experiment and Probability
 Exploratory Data Analysis (EDA) You are Welcome !
 Definition & Objectives, EDA Process and Example Questions ?
 The Data Science Process Comments !
Suggestions !!
 Definition & Objectives, the Process diagram
 Data Analytical Life Cycle
 Discovery, Data preparation , Model planning , Model building , Communicate results, Operationalize

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 79
Questions ?
Comments !
Suggestions !!

Farewell to the Day 

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 80

Proceedings of The 6th International Conference On Finance and Economics Icfe 2020
No ratings yet
Proceedings of The 6th International Conference On Finance and Economics Icfe 2020
535 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
CS322_Lec 3_S25
No ratings yet
CS322_Lec 3_S25
42 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Notes For Multivariate Statistics With R
No ratings yet
Notes For Multivariate Statistics With R
189 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Unit-1
No ratings yet
Unit-1
44 pages
Unit I and unit ii dev (1)
No ratings yet
Unit I and unit ii dev (1)
36 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Introduction Data Science Edited
No ratings yet
Introduction Data Science Edited
33 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Teacher Development in Technology-Enhanced Language Teaching Jeong-Bae Son pdf download
100% (1)
Teacher Development in Technology-Enhanced Language Teaching Jeong-Bae Son pdf download
64 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
UNIT _ Introduction_DataScience_new (1)
No ratings yet
UNIT _ Introduction_DataScience_new (1)
55 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Chapter3
No ratings yet
Chapter3
50 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Unit - II
No ratings yet
Unit - II
56 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data2 Science Process Am
No ratings yet
Data2 Science Process Am
33 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Data Science 2
No ratings yet
Data Science 2
55 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Data Mining
No ratings yet
Data Mining
40 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
408
No ratings yet
408
32 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Unit I 2 Marks
No ratings yet
Unit I 2 Marks
5 pages
Abebe Final Ppt
No ratings yet
Abebe Final Ppt
52 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Lecture 2 The data science process and tools for each step
No ratings yet
Lecture 2 The data science process and tools for each step
8 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Communication in Practice PDF
No ratings yet
Communication in Practice PDF
9 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Katalog Aria Connect En-1
No ratings yet
Katalog Aria Connect En-1
24 pages
03 Damping
No ratings yet
03 Damping
20 pages
Barriers and Boundaries
0% (3)
Barriers and Boundaries
2 pages
Draft Drone Policy Framework
No ratings yet
Draft Drone Policy Framework
18 pages
Axial Capacities of Eccentrically Loaded Equal-Leg Single Angles - Comparisons of Various Design Methods
100% (1)
Axial Capacities of Eccentrically Loaded Equal-Leg Single Angles - Comparisons of Various Design Methods
38 pages
Sample (Charles Benson)
No ratings yet
Sample (Charles Benson)
23 pages
Quasi Equilibrium State Pendulum
No ratings yet
Quasi Equilibrium State Pendulum
22 pages
DRV MasterDrives Chassis - E K - Converters
No ratings yet
DRV MasterDrives Chassis - E K - Converters
421 pages
Bali 2007: On The Road Again!
No ratings yet
Bali 2007: On The Road Again!
7 pages
1.9 Interference Figures
No ratings yet
1.9 Interference Figures
33 pages
How To Overlay Pictures
No ratings yet
How To Overlay Pictures
44 pages
English Advanced Higher Dissertation Examples
100% (2)
English Advanced Higher Dissertation Examples
8 pages
Tcas South Central Railway
No ratings yet
Tcas South Central Railway
61 pages
Ex34063 PDF
No ratings yet
Ex34063 PDF
1 page
H2S Treatment by Scavenger at Oil and Gas Field
No ratings yet
H2S Treatment by Scavenger at Oil and Gas Field
37 pages
CH 12. Dynamic Memory Allocation
No ratings yet
CH 12. Dynamic Memory Allocation
4 pages
Angrenaj Cilindric PDF
No ratings yet
Angrenaj Cilindric PDF
17 pages
CV of Shobhit Kumar
No ratings yet
CV of Shobhit Kumar
2 pages
Das1 PDF
No ratings yet
Das1 PDF
10 pages
Rabin-Karp Algorithm
No ratings yet
Rabin-Karp Algorithm
2 pages
Accomplishment Report 2016
No ratings yet
Accomplishment Report 2016
2 pages
DuraGal-Flooring-System-Technical-Information-Guide (1) 8
No ratings yet
DuraGal-Flooring-System-Technical-Information-Guide (1) 8
1 page
Mapeh 9 WHLP Week 1 q2
No ratings yet
Mapeh 9 WHLP Week 1 q2
4 pages
Mind Over Money by Brad Klontz - Excerpt
69% (13)
Mind Over Money by Brad Klontz - Excerpt
30 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet

BDA -Statistical Inference, Exploratory Data Analysis, and the Analytics Process

Uploaded by

BDA -Statistical Inference, Exploratory Data Analysis, and the Analytics Process

Uploaded by

Data

Compiled & Edited by

 Different types or formats of data:

 Data can be stored in files, data repositories or in databases

More Information Content

 Business activities – sale & purchase of products

 Manufacturing process – production and assembling of products

 Transportation – transportation of people and products from place to place

 Human interaction – emails, audio, video and textual communication

Ideal Ratio: 70:30

 Data in the real world is dirty

Incomplete: lacking attribute values, lacking certain attributes of interest,

Noisy data (incorrect values) may come from

Inconsistent data may come from

 Data Cleaning Tasks

• Fill in missing values

• Identify outliers and Managing noisy data

• Correct inconsistent data

• Resolve redundancy caused by data integration

 Missing data may be due to

 How to Handle Missing Data?

 Fill in the missing value manually: tedious + infeasible?

 Fill in it automatically with

 a global constant : e.g. “unknown”, a new class?!

 the most probable value: inference-based such as Bayesian formula or decision

 How to Handle Noisy Data ?

 Combined Computer and Human Inspection

 Simple Discretization Methods: Binning

 Equal-width (distance) partitioning

 Equal-depth (frequency) partitioning

 Combines data from multiple sources into a coherent store

 Schema integration: e.g., A.cust-id ≡ B.cust-#

 Integrate metadata from different sources

 Entity identification problem:

 Detecting and resolving data value conflicts

Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple databases

 Redundant attributes may be able to be detected by correlation analysis

Correlation Analysis (Numerical Data)

Correlation Analysis (Categorical Data)

Play Chess Not Play Chess Sum (row)

Probability to play chess: P(chess) = 300/1500 = 0.2

P(SciFi, chess) = P(SciFi) · P(chess) = 0.06

Play Chess Not Play Chess Sum (row)

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on

It shows that like_science_fiction and play_chess are correlated in the group!

 Data Discretization (Dimensionality Reduction)

 Smoothing: remove noise from data

 Discretization & Binarization

 Purpose and benefits

 For instance, toss a coin Statistical Experiment

𝒙 is a Variable 𝒙 is a Random Variable

 It links each outcome of a statistical experiment with its probability of occurrence

 Let X = number of Heads

Too simple to explain Too good to be true.

• Discover patterns in data

 Improve accuracy of the data-products.

 Diagnostic Analysis (Why did it happen)

 Predictive Analysis (What will happen)

 Prescriptive Analysis (How can we make it happen)

Farewell to the Day 

You might also like