0% found this document useful (0 votes)
6 views

BDA -Statistical Inference, Exploratory Data Analysis, and the Analytics Process

Uploaded by

raxosep792
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

BDA -Statistical Inference, Exploratory Data Analysis, and the Analytics Process

Uploaded by

raxosep792
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Data

Analytics

2 Data Preprocessing,
Statistical Inference, EDA, and
the Analytics Process

Compiled & Edited by


Babar Yaqoob Khan
Visiting Lecturer – Data Science
Department of Information Technology
University of the Punjab – Gujranwala Campus
WHAT IS IN IT FOR YOU?
 Data
 Definition, Types (Structured Data, Semi-Structured Data, Un-Structured Data), Sources, Qualities & Importance
 The information processing cycle
 Data Preprocessing (Sampling, Cleansing, Aggregation, Dimensionality Reduction, Feature Subset Selection,
Feature Creation, Integration, Discretization and Binarization, and Transformation)

 Statistical Inference
 Definition & Objectives, Sampling, Statistical experiment and Probability
 Exploratory Data Analysis (EDA)
 Definition & Objectives, EDA Process and Example
 The Data Analytics Process
 Definition & Objectives, the Process diagram
 Data Analytical Life Cycle
 Discovery, Data preparation , Model planning , Model building , Model Evaluation, Communicate results,
Operationalize
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 2
Data – Definition, Types, Sources, Qualities & Importance
 Data
 The facts and figures in raw or unorganized form (such as alphabets, numbers, or symbols) that
refer to, or represent, conditions, ideas, or objects.

 Different types or formats of data:


• Numbers, Characters or Strings, Time and Date
• Pictures/Images, Graphs, and Maps
• Documents, E-mails, Tweets, and Newsfeeds etc.
• Audio and Video streams
• Formats: XML, CSV, TSV, SQL, JSON, Text etc.
• Records: user-level data, timestamped event data

 Data can be stored in files, data repositories or in databases

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 3
Types of Data Measurements

 Nominal scale

More Information Content


Qualitative
 Categorical scale

 Ordinal scale

 Interval scale
Quantitative

 Ratio scale

Discrete Continuous

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 4
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 5
Types of Data Measurements: Examples
Nominal:
ID numbers, Names of people, Gender, Blood type, Eye colour, Political Party
Categorical:
Fruits, vegetables, juices, zip codes, sales

Ordinal:
Rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium,
short}

Interval:
Calendar dates, temperatures in Celsius or Fahrenheit, GRE and IQ scores

Ratio:
Mass, length, counts, money
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 6
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 7
Data – Definition, Types, Sources, Qualities & Importance
 Data Sources

 Business activities – sale & purchase of products

 Manufacturing process – production and assembling of products

 Transportation – transportation of people and products from place to place

 Sensing & monitoring – data from sensors (in space and oceans etc. ) and CCTV cameras

 Human interaction – emails, audio, video and textual communication

 … … ….

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 11
Data Pre-processing

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 15
Data Science/Analytics Process
Exploratory
Data
Analysis

Instrument Data
Business Raw Data Clean
Data Pre-
Problem Collection Dataset
Sources Processing

Data
Decision Support Processing
Business Intelligence
Recommender Systems
Business Forecasting (Prediction)
Visualization/ Make
Data Decisions
Communicate
Product
Results
Reality
Population vs. Sample

 Population (N)
 Includes all of the elements from a set of data e.g.,
• The entire US population i.e., 341.97 million (341,963,408) or
• The entire Pakistan population i.e., 252.37 million (252,363,571)
• The entire world population i.e., 8.2 billion
• Set of objects, such as tweets or photographs

 Sample (n)
 Consists of one or more observations drawn from the population

n<N

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 17
Sampling & Types
 Sampling
• Technique mainly employed for data selection from population
• Often used both for preliminary investigation and the final data analysis

 Sampling Types
 Simple Random Sampling
• Equal probability of selecting any item

 Stratified Sampling
• Split the data into partitions and draw random samples from each partition

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 18
Sampling & Types
 Sampling
• Technique mainly employed for data selection from population
• Often used both for preliminary investigation and the final data analysis

 Sampling Types
 Systematic Sampling
• Select every nth item from a list.
For instance, if you have a list of 1,000 people and you choose every 10th person

 Cluster Sampling
• The population is divided into clusters, usually based on geographical areas or natural grouping
s. A few clusters are randomly selected, and all members within those clusters are surveyed

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 19
Sample Size

Ideal Ratio: 70:30

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 20
Why Data Pre-processing?

 Data in the real world is dirty


 GIGO - good data is a prerequisite for producing effective models of any type

Incomplete: lacking attribute values, lacking certain attributes of interest,


or containing only aggregate data
e.g., occupation=“ ”
Noisy: containing errors or outliers
e.g., Salary=“-10”
Inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 21
Why is Data Dirty?
Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected and when it is
analysed.
– Human/hardware/software problems

Noisy data (incorrect values) may come from


– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission

Inconsistent data may come from


– Different data sources
– Functional dependency violation (e.g., modify some linked data)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 22
Data Preprocessing – Major Tasks

 Data Cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
 Data Integration
 Integration of multiple databases or files
 Data Transformation
 Normalization and aggregation
 Data Reduction
 Obtains reduced representation in volume but produces the same or similar analytical
results
 Data Discretization & Binarization
 Part of data reduction but with particular importance for numerical data

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 23
Forms of Data Preprocessing

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 24
Data Preprocessing – Cleaning

 Importance
 Garbage in Garbage out Principle (GIGO)

 Data Cleaning Tasks

• Fill in missing values

• Identify outliers and Managing noisy data

• Correct inconsistent data

• Resolve redundancy caused by data integration

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 26
Data Preprocessing – Cleaning

 Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data

 Missing data may be due to


• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
 Missing data may need to be inferred
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 27
Data Preprocessing – Cleaning

 How to Handle Missing Data?

 Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.

 Fill in the missing value manually: tedious + infeasible?

 Fill in it automatically with

 a global constant : e.g. “unknown”, a new class?!

 the attribute mean for all data points belonging to the same class: smarter

 the most probable value: inference-based such as Bayesian formula or decision


tree Statistical Inference, Exploratory Data Analysis, and the Data Science Process 28
Data Preprocessing – Cleaning

 How to Handle Noisy Data ?


 Binning
 First sort data and partition into (equal-frequency) bins
 Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries,
etc.

 Regression
 Smooth by fitting the data into regression functions

 Clustering
 Detect and remove outliers

 Combined Computer and Human Inspection


 Detect suspicious values and check by human (e.g., deal with possible outliers) 29
Statistical Inference, Exploratory Data Analysis, and the Data Science Process
Data Preprocessing – Cleaning

 Simple Discretization Methods: Binning

 Equal-width (distance) partitioning


 Divides the range into N intervals of equal size: uniform grid
 If A and B are the lowest and highest values of the attribute, the width of intervals will be:
W = (Max – Min)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well

 Equal-depth (frequency) partitioning


 Divides the range into N intervals each containing approximately same number of data
points
 Good data scaling
 Managing categorical attributes can be tricky
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 30
Data Preprocessing – Cleaning

 Binning

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 31
Data Preprocessing – Cleaning

 Regression

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 32
Data Preprocessing – Cleaning

 Clustering

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 33
Data Preprocessing – Integration

Data integration:

 Combines data from multiple sources into a coherent store

 Schema integration: e.g., A.cust-id ≡ B.cust-#

 Integrate metadata from different sources

 Entity identification problem:

 Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

 Detecting and resolving data value conflicts

 For the same real world entity, attribute values from different sources are different

 Possible reasons: different representations, different scales, e.g., metric vs. British units
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 34
Data Preprocessing – Data Integration

Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple databases


 Object identification: The same attribute or object may have different names in
different databases
 Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue (from monthly income data)

 Redundant attributes may be able to be detected by correlation analysis

 Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 35
Data Preprocessing – Data Integration

Correlation Analysis (Numerical Data)

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 36
Data Preprocessing - Data Integration

Correlation Analysis (Categorical Data)

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 37
Correlation Analysis (Categorical Data)

Play Chess Not Play Chess Sum (row)


Like Science Fiction 250 200 450
Not Like Science Fiction 50 1000 1050
Sum 300 1200 1500

Probability to play chess: P(chess) = 300/1500 = 0.2


Probability to like science fiction: P(SciFi) = 450/1500 = 0.3
If science fiction and chess playing are independent attributes, then the
probability to like SciFi AND play chess is

P(SciFi, chess) = P(SciFi) · P(chess) = 0.06


That means, we expect 0.06 · 1500 = 90 such cases (if they are independent)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 38
Correlation Analysis (Categorical Data)

Play Chess Not Play Chess Sum (row)


Like Science Fiction 250 (90) 200 450
Not Like Science Fiction 50 1000 1050
Sum 300 1200 1500

Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on


the data distribution in the two categories)

It shows that like_science_fiction and play_chess are correlated in the group!

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 39
Data Preprocessing – Reduction, Discretization
 Data Reduction (Dimensionality Reduction)
• Obtains reduced representation in volume but produces the same or similar analytical
results
 Feature Subset Selection / Principal Component Analysis (PCA)
 Singular Value Decomposition (SVD)

 Data Discretization (Dimensionality Reduction)


• Part of data reduction but with particular importance for numerical data
• Also called “binning”

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 40
Data Preprocessing – Transformation
 Data Preprocessing – Transformation
• Maps entire set of values of an attribute to a new set of values
• Data standardization and normalization (by clustering and binning)

 Smoothing: remove noise from data


 Aggregation: summarization
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling

 Attribute/feature construction
 New attributes constructed from the given ones
41
Data Preprocessing – Feature Creation
 Feature Creation
• Original attributes not always best representation of information
• Creates new features which are more efficient/focused

 Methodologies
 Features Extraction – Domain Specific
• Derived features
 Feature Construction
• Combine multiple features to construct new feature(s)
 Mapping Data to New Space
• Fourier Transform - what frequencies are present in your signal
• Wavelet Transform - what frequencies are present and where (or at what scale)

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 42
Data Preprocessing – Discretization & Binarization

 Discretization & Binarization


• Converting the data into discrete form and later to binarize it to accommodate certain machine
learning algorithms/models

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 43
Statistical Inference

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 44
Statistical Inference – Population and Sample
 Population
 Includes all of the elements from a set of data e.g.,
• The entire Pakistani population i.e., 247 million or
• The entire world population i.e., 8 billion
• Set of objects, such as tweets or photographs
 Sample Sample < Population


It consists of one or more items drawn from the population
For example, 1000 Pakistanis selected from all provinces of Pakistan
n<N
 Size of sample (n) always less than size of the population (N)
 Sample may not be totally representative of the population

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 45
Statistical Inference
 Statistical inference
 It is process of estimating the parameters of a population, using the random sampling.
 The inference also tests reliability of the estimates with calculated uncertainty.

 Purpose and benefits


 Enable us to understand the population without studying its all items.
 Minimizes the cost of understanding the population.
 It remains the only possible option, when whole the population is not accessible.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 46
Statistical Inference

Population

Sample
Sampling

Parameters
Estimation

Statistical
Inference
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 47
Statistical Experiment
 A statistical experiment has three properties:
 The experiment can have more than one possible outcome
 Each possible outcome can be specified in advance
 The outcome of the experiment depends on chance

 For instance, toss a coin Statistical Experiment


 Outcomes are:
• More than one Head or, Tail
• Specified in advance { Head, Tail }
• Depends on chance Unknown in advance, unless coin is tossed
Or 50% chance of Head and vice versa

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 48
Variables or Parameters
 Variable or Parameter
 It represents value of an attribute of an item in the population. i.e. name, color of an item.
 A random variable can take on any of the specified values (domain).
 A random variable takes a value after a statistical experiment.

𝒙=7 𝒙
Statistical Experiment 𝐱
𝒙=7

𝒙 is a Variable 𝒙 is a Random Variable

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 49
Probability
 Probability
 It is the measure of the likelihood of happening an event.
 A quantitative measure, always takes value between 0 and 1.
 Example:
 Tossing a coin is a statistical experiment.
 It can result two outcomes: heads (H) tails (T)
• Head or a Tail
Favorable outcomes
 Calculating the chance of a resulting a ‘head’ is its probability Probability =
Possible outcomes

P(H) = 1/2 =
0.5
P(T) = 1/2 =
0.5
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 50
Probability Distribution
 Probability distribution

 It links each outcome of a statistical experiment with its probability of occurrence


 For instance, you toss a coin two times
 Possible outcomes = {HH, HT, TH, TT}

 Let X = number of Heads

 Possible outcomes = { 0, 1, 2 }
• P(X = 0) = 1/4 = 0.25 No Heads = { TT }
• P(X = 2) = 1/4 = 0.25 Two Heads = { HH }
• P(X = 1) = 2/4 = 0.50 One Heads = { HT, TH }

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 51
Data Modeling
 Modeling
 A model is representation of a real object or situation. It presents a simplified version of something.
 It is an artificial construction to understand and represent the nature of real things.
• Model does not has unnecessary detail.
 Humans try to understand the world around them using different models.
• Architect capture 3-D prints to construct design structures
• Biologists capture connection between amino acids to understand protein-protein interactions
• Statisticians and Data Scientists capture randomness to comprehend data-generating processes
 Data Modeling
 Data modeling is the analysis of data objects and their relationships to other data objects.
 The model helps us in defining and analyzing data requirements needed to support the business
processes in an organization.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 52
Data Modeling
 Building a Model
 Do some Exploratory Data Analysis (EDA) and discover the relationship among the data.
 Try to describe the relationship using a mathematical formula.
 Model Fitting
 Model Fitting (Balance Fitting)
• When model fits the training as well as testing data pretty well
 Underfitting
• When model is unable to fit even the training data
 Overfitting
• When model fits the training data well but testing data too poor
 Noise (undesired data) and higher variability (inconsistency) in data cause the overfitting
 Remove noise (data cleaning) and add more training data to train the model.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 53
Fitting a Data Model

Too simple to explain Too good to be true.


the variation in data Forced-fitting

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 54
Exploratory Data Analysis
What is EDA?
 Exploratory Data Analysis (EDA)
 In statistics, EDA is used to analyze datasets to summarize their main characteristics.

 EDA often employ the visual methods to see what the data can tell us beyond the formal modeling
or hypothesis testing task.

 It is an effort to understand the process that generate the data under observation.
• ‘Exploration’ means your understanding of the problem is changing as you go ahead.
• Plots, graphs and summary statistics are basic tools of the EDA.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 58
What is EDA?

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 59
EDA – Why we do it?
 EDA helps us to:
 Understand the data and its value in business

• Discover patterns in data


• Spot anomalies (outliers) in data
• Verify existing assumptions about data
• Make comparisons between the data distributions.
• Finding suitable data formats

 Improve accuracy of the data-products.


 Assure verification of the data-products.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 60
The Data Science Process
 The Data Science Process
 Say, we have data (raw data) on these things
 We want to process these data for better analysis
 Processing would give us a Clean Dataset to analyze
 We’ll be doing some EDA with the clean dataset
 EDA will lead us towards a Data Model and an Algorithm
 We get the results after using the model and interpret, visualize, or report them
 The results are in decision making or as input for a ‘Data Product’.
 The Products may be like as:
• Recommender system
• Business forecasting system
• Spam classifier

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 61
Data Science Process
Exploratory
Data
Analysis

Instrument Data
Business Raw Data Clean
Data Pre-
Problem Collection Dataset
Sources Processing

Data
Decision Support, Processing
Business Intelligence
Recommender Systems
Business Forecasting
(Prediction) Visualization/ Make
Data Decisions
Communicate
Product
Results
Reality
The Big Data Approach
 Big Data
 Big data is a field dedicated to the analysis,
processing, and storage of large collections of data
that frequently originate from various sources.
 It is used when traditional data analysis,
processing and storage technologies and
techniquesare insufficient.
 Big Data Characteristics
 Volume
 Velocity
 Variety
 Veracity
 Value

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 63
Big Data – Analytics Types
 Descriptive Analysis (What happened)
 It is done to answer questions about events that have already occurred.

 Diagnostic Analysis (Why did it happen)


 It is used to determine the cause of a phenomenon that occurred in the past using questions that
focus on the reason behind the event.

 Predictive Analysis (What will happen)


 It is an attempt to determine the outcome of an event that would occur in the future.

 Prescriptive Analysis (How can we make it happen)


 Prescriptive analytics are build upon the results of predictive analytics to prescribe the actions
that should be taken to improve the business.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 64
Big Data – Analytics Types

2
Diagnostic 3
Predictive

1 4
Descriptiv Prescriptiv
e e

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 65
Data Analytics Life Cycle – Phase 1
 Phase 1: Learning the business domain and problem discovery
 Understand the business process
• Study the similar past projects
• Identify available resources – people, required skills, technology, time, and data.
• Have right mix of domain experts, customers, analytic talent, and project management.
 Identifying key stakeholders
• Understand their interests in the project
• Propose and discuss more than one solutions to the problem
 Discover the problem to be solved
• Write the problem statement and its justification.
• Discuss and refine the problem statement after discussion with the major stakeholder
• Establish the criteria for success and failure of the proposed solution

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 66
Data Analytics Life Cycle – Key Roles

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 67
Data Analytics Life Cycle – Phase 2
 Phase 2: Data preparation
 Define the steps to explore and preprocess data before its modeling and analysis.
 Prepare the analytics sandbox (setup for the experiments)
 Perform the Extract Transform Load (ETL) process (or ELT).  ETLT = ETL + ELT
 Understand the target data
 Data cleaning – data normalization and transformation
• For better understanding, utilize maximum of the available data
• Survey and visualize the test dataset
• Carefully complete the highly labor-intensive activity
 Data accessing strategies:
• Download snapshot of the production data
• Use the API facility, if available

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 68
Phase 2 – Sample Dataset Inventory

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 69
Phase 2 – Common tools for data preparation
 Phase 2: Data preparation tools
 Hadoop
• It can perform massively parallel loading and analysis of large dataset.
• Used for web traffic parsing, GPS location analytics, genomic analysis, and combining of
massive unstructured data feeds from multiple sources.
 Alpine Miner
• Provides a graphical user interface (GUI) for data manipulation and analysis
 Open Refine (Google Refine)
• A powerful tool for working with large and unstructured dataset. It is a popular GUI-based tool
for performing data transformations.
 Data Wrangler (Stanford University)
• An interactive tool for data cleaning and transformation on a given dataset.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 70
Data Analytics Life Cycle – Phase 3
 Phase 3: Planning the data model
 Data exploration and variable selection
• Perform Exploratory Data Analysis, if required.
• Explore associations & relationships among data
• Identify key performance indicators (KPIs)
 Selecting suitable data analytical method or model
• Keep in mind requirements of the business
• Consider the type and format of data attributes
• Consult the domain experts and follow the best practices

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 71
Phase 3 – Selecting appropriate data analytical model

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 72
Data Analytics Life Cycle – Phase 3
 Phase 3: Common tools for the model planning phase
 R - Analytical Software Package
• It has the data modeling capabilities and good environment for building interpretive models
• R has ability to interface with databases via an ODBC connection and execute statistical tests
and analyses against Big Data via an open source connection.
• R contains nearly 5,000 packages for data analysis and graphical representation.
 SQL Analysis services
• It can perform in-database analytics of common data mining functions, involved aggregations,
and basic predictive models.
 SAS/ACCESS
• Provides integration between SAS and the analytics sandbox via multiple data connectors such
as OBDC, JDBC and OLE DB. Connectivity to relational databases (such as Oracle or Teradata)
and data warehousing applications ( i.e. Green plum or Aster)
• Enterprise applications such as SAP and Salesforce.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 73
Data Analytics Life Cycle – Phase 4
 Phase 4: Model building
 Develop datasets for testing, training, and production purposes.
 Assess validity of the model and its results on small scale
• Verify result of the model from domain experts
 Evaluate the required hardware support to execute the model

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 74
Data Analytics Life Cycle – Phase 4
 Phase 4: Common tools for the model building phase
 SAS Enterprise Miner
• Allows users to run predictive and descriptive models based on large volumes of data from
across the enterprise.
• It is built for enterprise-level computing and analytics by interoperating with large data stores.
 SPSS Modeler (IBM SPSS Modeler)
• Offers methods to explore and analyze data through a GUI.
 MatLab
• Provides a high-level language for performing a variety of data analytics and exploration.
 Statistica and Mathematica
• Popular and well-regarded data mining and analytics tools.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 75
Data Analytics Life Cycle – Phase 4
 Phase 4: Free or Open Source tools for the model building phase
 WEKA
• A free data mining software package with an analytic workbench. The functions created in
WEKA can be executed within Java code.
 Python
• It is a programming language that provides toolkits for machine learning and analysis, such as
scikit-learn, numpy, scipy, pandas, and related data visualization using matplotlib.
 Rand PL/R
• R was described earlier in the model planning phase, and PL\R is a procedural language for
PostgreSQL with R. Using this approach means that R commands can be executed in database.
 Octave
• A programming language for computational modeling having some functionality of MatLab.
• Being freely available, Octave is used in major universities when teaching machine learning.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 76
Data Analytics Life Cycle – Phase 5
 Phase 5: Communicate the results
 Collaborate with the major stakeholders, and evaluation of the results
• Identify key findings, quantify their business value.
• The deliverable of this phase will be decisive for the outside stakeholders and sponsors
• Summarize the findings and convey to the stakeholders.
• Make recommendations for future work or improvements to existing processes
 Accept failure of an analytical project
• A true failure means failure of data to accept or reject the hypothesis stated in phase-1.
• Analyst should be rigorous enough with the data to determine whether it will prove or disprove
the hypotheses

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 77
Data Analytics Life Cycle – Phase 6
 Phase 6: Operationalize
 Communicate the benefits of the project more broadly
• If required, run a pilot project before implementing the models in a production environment.
• Learn from the deployment and make any needed adjustments.
 Properly document and deliver the final reports, briefings, code, and technical documents.
• Consult documentation of the similar past projects, if available.
• Follow the documentation standards to increase its effectiveness.

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 78
Content’s Review
 Data
 Definition, Importance, Characteristics, Sources and Types
 Structured Data, Semi-Structured Data, Un-Structured Data
 The information processing cycle
 Data Preprocessing (Integration, Cleansing, Reduction, and Transformation)
 Statistical Inference
 Definition & Objectives, Sampling, Statistical experiment and Probability
 Exploratory Data Analysis (EDA) You are Welcome !
 Definition & Objectives, EDA Process and Example Questions ?
 The Data Science Process Comments !
Suggestions !!
 Definition & Objectives, the Process diagram
 Data Analytical Life Cycle
 Discovery, Data preparation , Model planning , Model building , Communicate results, Operationalize

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 79
Questions ?
Comments !
Suggestions !!

Farewell to the Day 

Statistical Inference, Exploratory Data Analysis, and the Data Science Process 80

You might also like