BDA -Statistical Inference, Exploratory Data Analysis, and the Analytics Process
BDA -Statistical Inference, Exploratory Data Analysis, and the Analytics Process
Analytics
2 Data Preprocessing,
Statistical Inference, EDA, and
the Analytics Process
Statistical Inference
Definition & Objectives, Sampling, Statistical experiment and Probability
Exploratory Data Analysis (EDA)
Definition & Objectives, EDA Process and Example
The Data Analytics Process
Definition & Objectives, the Process diagram
Data Analytical Life Cycle
Discovery, Data preparation , Model planning , Model building , Model Evaluation, Communicate results,
Operationalize
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 2
Data – Definition, Types, Sources, Qualities & Importance
Data
The facts and figures in raw or unorganized form (such as alphabets, numbers, or symbols) that
refer to, or represent, conditions, ideas, or objects.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 3
Types of Data Measurements
Nominal scale
Ordinal scale
Interval scale
Quantitative
Ratio scale
Discrete Continuous
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 4
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 5
Types of Data Measurements: Examples
Nominal:
ID numbers, Names of people, Gender, Blood type, Eye colour, Political Party
Categorical:
Fruits, vegetables, juices, zip codes, sales
Ordinal:
Rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium,
short}
Interval:
Calendar dates, temperatures in Celsius or Fahrenheit, GRE and IQ scores
Ratio:
Mass, length, counts, money
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 6
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 7
Data – Definition, Types, Sources, Qualities & Importance
Data Sources
Sensing & monitoring – data from sensors (in space and oceans etc. ) and CCTV cameras
… … ….
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 11
Data Pre-processing
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 15
Data Science/Analytics Process
Exploratory
Data
Analysis
Instrument Data
Business Raw Data Clean
Data Pre-
Problem Collection Dataset
Sources Processing
Data
Decision Support Processing
Business Intelligence
Recommender Systems
Business Forecasting (Prediction)
Visualization/ Make
Data Decisions
Communicate
Product
Results
Reality
Population vs. Sample
Population (N)
Includes all of the elements from a set of data e.g.,
• The entire US population i.e., 341.97 million (341,963,408) or
• The entire Pakistan population i.e., 252.37 million (252,363,571)
• The entire world population i.e., 8.2 billion
• Set of objects, such as tweets or photographs
Sample (n)
Consists of one or more observations drawn from the population
n<N
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 17
Sampling & Types
Sampling
• Technique mainly employed for data selection from population
• Often used both for preliminary investigation and the final data analysis
Sampling Types
Simple Random Sampling
• Equal probability of selecting any item
Stratified Sampling
• Split the data into partitions and draw random samples from each partition
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 18
Sampling & Types
Sampling
• Technique mainly employed for data selection from population
• Often used both for preliminary investigation and the final data analysis
Sampling Types
Systematic Sampling
• Select every nth item from a list.
For instance, if you have a list of 1,000 people and you choose every 10th person
Cluster Sampling
• The population is divided into clusters, usually based on geographical areas or natural grouping
s. A few clusters are randomly selected, and all members within those clusters are surveyed
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 19
Sample Size
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 20
Why Data Pre-processing?
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 21
Why is Data Dirty?
Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected and when it is
analysed.
– Human/hardware/software problems
Data Cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
Data Integration
Integration of multiple databases or files
Data Transformation
Normalization and aggregation
Data Reduction
Obtains reduced representation in volume but produces the same or similar analytical
results
Data Discretization & Binarization
Part of data reduction but with particular importance for numerical data
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 23
Forms of Data Preprocessing
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 24
Data Preprocessing – Cleaning
Importance
Garbage in Garbage out Principle (GIGO)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 26
Data Preprocessing – Cleaning
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.
the attribute mean for all data points belonging to the same class: smarter
Regression
Smooth by fitting the data into regression functions
Clustering
Detect and remove outliers
Binning
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 31
Data Preprocessing – Cleaning
Regression
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 32
Data Preprocessing – Cleaning
Clustering
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 33
Data Preprocessing – Integration
Data integration:
Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
For the same real world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, e.g., metric vs. British units
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 34
Data Preprocessing – Data Integration
Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 35
Data Preprocessing – Data Integration
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 36
Data Preprocessing - Data Integration
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 37
Correlation Analysis (Categorical Data)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 39
Data Preprocessing – Reduction, Discretization
Data Reduction (Dimensionality Reduction)
• Obtains reduced representation in volume but produces the same or similar analytical
results
Feature Subset Selection / Principal Component Analysis (PCA)
Singular Value Decomposition (SVD)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 40
Data Preprocessing – Transformation
Data Preprocessing – Transformation
• Maps entire set of values of an attribute to a new set of values
• Data standardization and normalization (by clustering and binning)
Attribute/feature construction
New attributes constructed from the given ones
41
Data Preprocessing – Feature Creation
Feature Creation
• Original attributes not always best representation of information
• Creates new features which are more efficient/focused
Methodologies
Features Extraction – Domain Specific
• Derived features
Feature Construction
• Combine multiple features to construct new feature(s)
Mapping Data to New Space
• Fourier Transform - what frequencies are present in your signal
• Wavelet Transform - what frequencies are present and where (or at what scale)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 42
Data Preprocessing – Discretization & Binarization
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 43
Statistical Inference
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 44
Statistical Inference – Population and Sample
Population
Includes all of the elements from a set of data e.g.,
• The entire Pakistani population i.e., 247 million or
• The entire world population i.e., 8 billion
• Set of objects, such as tweets or photographs
Sample Sample < Population
It consists of one or more items drawn from the population
For example, 1000 Pakistanis selected from all provinces of Pakistan
n<N
Size of sample (n) always less than size of the population (N)
Sample may not be totally representative of the population
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 45
Statistical Inference
Statistical inference
It is process of estimating the parameters of a population, using the random sampling.
The inference also tests reliability of the estimates with calculated uncertainty.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 46
Statistical Inference
Population
Sample
Sampling
Parameters
Estimation
Statistical
Inference
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 47
Statistical Experiment
A statistical experiment has three properties:
The experiment can have more than one possible outcome
Each possible outcome can be specified in advance
The outcome of the experiment depends on chance
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 48
Variables or Parameters
Variable or Parameter
It represents value of an attribute of an item in the population. i.e. name, color of an item.
A random variable can take on any of the specified values (domain).
A random variable takes a value after a statistical experiment.
𝒙=7 𝒙
Statistical Experiment 𝐱
𝒙=7
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 49
Probability
Probability
It is the measure of the likelihood of happening an event.
A quantitative measure, always takes value between 0 and 1.
Example:
Tossing a coin is a statistical experiment.
It can result two outcomes: heads (H) tails (T)
• Head or a Tail
Favorable outcomes
Calculating the chance of a resulting a ‘head’ is its probability Probability =
Possible outcomes
P(H) = 1/2 =
0.5
P(T) = 1/2 =
0.5
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 50
Probability Distribution
Probability distribution
Possible outcomes = { 0, 1, 2 }
• P(X = 0) = 1/4 = 0.25 No Heads = { TT }
• P(X = 2) = 1/4 = 0.25 Two Heads = { HH }
• P(X = 1) = 2/4 = 0.50 One Heads = { HT, TH }
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 51
Data Modeling
Modeling
A model is representation of a real object or situation. It presents a simplified version of something.
It is an artificial construction to understand and represent the nature of real things.
• Model does not has unnecessary detail.
Humans try to understand the world around them using different models.
• Architect capture 3-D prints to construct design structures
• Biologists capture connection between amino acids to understand protein-protein interactions
• Statisticians and Data Scientists capture randomness to comprehend data-generating processes
Data Modeling
Data modeling is the analysis of data objects and their relationships to other data objects.
The model helps us in defining and analyzing data requirements needed to support the business
processes in an organization.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 52
Data Modeling
Building a Model
Do some Exploratory Data Analysis (EDA) and discover the relationship among the data.
Try to describe the relationship using a mathematical formula.
Model Fitting
Model Fitting (Balance Fitting)
• When model fits the training as well as testing data pretty well
Underfitting
• When model is unable to fit even the training data
Overfitting
• When model fits the training data well but testing data too poor
Noise (undesired data) and higher variability (inconsistency) in data cause the overfitting
Remove noise (data cleaning) and add more training data to train the model.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 53
Fitting a Data Model
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 54
Exploratory Data Analysis
What is EDA?
Exploratory Data Analysis (EDA)
In statistics, EDA is used to analyze datasets to summarize their main characteristics.
EDA often employ the visual methods to see what the data can tell us beyond the formal modeling
or hypothesis testing task.
It is an effort to understand the process that generate the data under observation.
• ‘Exploration’ means your understanding of the problem is changing as you go ahead.
• Plots, graphs and summary statistics are basic tools of the EDA.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 58
What is EDA?
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 59
EDA – Why we do it?
EDA helps us to:
Understand the data and its value in business
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 60
The Data Science Process
The Data Science Process
Say, we have data (raw data) on these things
We want to process these data for better analysis
Processing would give us a Clean Dataset to analyze
We’ll be doing some EDA with the clean dataset
EDA will lead us towards a Data Model and an Algorithm
We get the results after using the model and interpret, visualize, or report them
The results are in decision making or as input for a ‘Data Product’.
The Products may be like as:
• Recommender system
• Business forecasting system
• Spam classifier
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 61
Data Science Process
Exploratory
Data
Analysis
Instrument Data
Business Raw Data Clean
Data Pre-
Problem Collection Dataset
Sources Processing
Data
Decision Support, Processing
Business Intelligence
Recommender Systems
Business Forecasting
(Prediction) Visualization/ Make
Data Decisions
Communicate
Product
Results
Reality
The Big Data Approach
Big Data
Big data is a field dedicated to the analysis,
processing, and storage of large collections of data
that frequently originate from various sources.
It is used when traditional data analysis,
processing and storage technologies and
techniquesare insufficient.
Big Data Characteristics
Volume
Velocity
Variety
Veracity
Value
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 63
Big Data – Analytics Types
Descriptive Analysis (What happened)
It is done to answer questions about events that have already occurred.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 64
Big Data – Analytics Types
2
Diagnostic 3
Predictive
1 4
Descriptiv Prescriptiv
e e
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 65
Data Analytics Life Cycle – Phase 1
Phase 1: Learning the business domain and problem discovery
Understand the business process
• Study the similar past projects
• Identify available resources – people, required skills, technology, time, and data.
• Have right mix of domain experts, customers, analytic talent, and project management.
Identifying key stakeholders
• Understand their interests in the project
• Propose and discuss more than one solutions to the problem
Discover the problem to be solved
• Write the problem statement and its justification.
• Discuss and refine the problem statement after discussion with the major stakeholder
• Establish the criteria for success and failure of the proposed solution
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 66
Data Analytics Life Cycle – Key Roles
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 67
Data Analytics Life Cycle – Phase 2
Phase 2: Data preparation
Define the steps to explore and preprocess data before its modeling and analysis.
Prepare the analytics sandbox (setup for the experiments)
Perform the Extract Transform Load (ETL) process (or ELT). ETLT = ETL + ELT
Understand the target data
Data cleaning – data normalization and transformation
• For better understanding, utilize maximum of the available data
• Survey and visualize the test dataset
• Carefully complete the highly labor-intensive activity
Data accessing strategies:
• Download snapshot of the production data
• Use the API facility, if available
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 68
Phase 2 – Sample Dataset Inventory
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 69
Phase 2 – Common tools for data preparation
Phase 2: Data preparation tools
Hadoop
• It can perform massively parallel loading and analysis of large dataset.
• Used for web traffic parsing, GPS location analytics, genomic analysis, and combining of
massive unstructured data feeds from multiple sources.
Alpine Miner
• Provides a graphical user interface (GUI) for data manipulation and analysis
Open Refine (Google Refine)
• A powerful tool for working with large and unstructured dataset. It is a popular GUI-based tool
for performing data transformations.
Data Wrangler (Stanford University)
• An interactive tool for data cleaning and transformation on a given dataset.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 70
Data Analytics Life Cycle – Phase 3
Phase 3: Planning the data model
Data exploration and variable selection
• Perform Exploratory Data Analysis, if required.
• Explore associations & relationships among data
• Identify key performance indicators (KPIs)
Selecting suitable data analytical method or model
• Keep in mind requirements of the business
• Consider the type and format of data attributes
• Consult the domain experts and follow the best practices
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 71
Phase 3 – Selecting appropriate data analytical model
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 72
Data Analytics Life Cycle – Phase 3
Phase 3: Common tools for the model planning phase
R - Analytical Software Package
• It has the data modeling capabilities and good environment for building interpretive models
• R has ability to interface with databases via an ODBC connection and execute statistical tests
and analyses against Big Data via an open source connection.
• R contains nearly 5,000 packages for data analysis and graphical representation.
SQL Analysis services
• It can perform in-database analytics of common data mining functions, involved aggregations,
and basic predictive models.
SAS/ACCESS
• Provides integration between SAS and the analytics sandbox via multiple data connectors such
as OBDC, JDBC and OLE DB. Connectivity to relational databases (such as Oracle or Teradata)
and data warehousing applications ( i.e. Green plum or Aster)
• Enterprise applications such as SAP and Salesforce.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 73
Data Analytics Life Cycle – Phase 4
Phase 4: Model building
Develop datasets for testing, training, and production purposes.
Assess validity of the model and its results on small scale
• Verify result of the model from domain experts
Evaluate the required hardware support to execute the model
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 74
Data Analytics Life Cycle – Phase 4
Phase 4: Common tools for the model building phase
SAS Enterprise Miner
• Allows users to run predictive and descriptive models based on large volumes of data from
across the enterprise.
• It is built for enterprise-level computing and analytics by interoperating with large data stores.
SPSS Modeler (IBM SPSS Modeler)
• Offers methods to explore and analyze data through a GUI.
MatLab
• Provides a high-level language for performing a variety of data analytics and exploration.
Statistica and Mathematica
• Popular and well-regarded data mining and analytics tools.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 75
Data Analytics Life Cycle – Phase 4
Phase 4: Free or Open Source tools for the model building phase
WEKA
• A free data mining software package with an analytic workbench. The functions created in
WEKA can be executed within Java code.
Python
• It is a programming language that provides toolkits for machine learning and analysis, such as
scikit-learn, numpy, scipy, pandas, and related data visualization using matplotlib.
Rand PL/R
• R was described earlier in the model planning phase, and PL\R is a procedural language for
PostgreSQL with R. Using this approach means that R commands can be executed in database.
Octave
• A programming language for computational modeling having some functionality of MatLab.
• Being freely available, Octave is used in major universities when teaching machine learning.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 76
Data Analytics Life Cycle – Phase 5
Phase 5: Communicate the results
Collaborate with the major stakeholders, and evaluation of the results
• Identify key findings, quantify their business value.
• The deliverable of this phase will be decisive for the outside stakeholders and sponsors
• Summarize the findings and convey to the stakeholders.
• Make recommendations for future work or improvements to existing processes
Accept failure of an analytical project
• A true failure means failure of data to accept or reject the hypothesis stated in phase-1.
• Analyst should be rigorous enough with the data to determine whether it will prove or disprove
the hypotheses
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 77
Data Analytics Life Cycle – Phase 6
Phase 6: Operationalize
Communicate the benefits of the project more broadly
• If required, run a pilot project before implementing the models in a production environment.
• Learn from the deployment and make any needed adjustments.
Properly document and deliver the final reports, briefings, code, and technical documents.
• Consult documentation of the similar past projects, if available.
• Follow the documentation standards to increase its effectiveness.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 78
Content’s Review
Data
Definition, Importance, Characteristics, Sources and Types
Structured Data, Semi-Structured Data, Un-Structured Data
The information processing cycle
Data Preprocessing (Integration, Cleansing, Reduction, and Transformation)
Statistical Inference
Definition & Objectives, Sampling, Statistical experiment and Probability
Exploratory Data Analysis (EDA) You are Welcome !
Definition & Objectives, EDA Process and Example Questions ?
The Data Science Process Comments !
Suggestions !!
Definition & Objectives, the Process diagram
Data Analytical Life Cycle
Discovery, Data preparation , Model planning , Model building , Communicate results, Operationalize
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 79
Questions ?
Comments !
Suggestions !!
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 80