0% found this document useful (0 votes)
27 views91 pages

Unit II

Uploaded by

madhurcb1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views91 pages

Unit II

Uploaded by

madhurcb1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 91

UNIT-II

Data Analytics
INTRODUCTION TO ANALYTICS
• Analytics is often used to discover, interpret, and
communicate meaningful patterns in data. In business,
healthcare, sports, and many other fields, analytics helps
to inform decision-making and improve efficiency,
effectiveness, and profitability.
• Data Analytics refers to the techniques to analyze data
to enhance productivity and business gain. Data is
extracted from various sources and is cleaned and
categorized to analyze different behavioral patterns. The
techniques and the tools used vary according to the
organization or individual.
Types of analytics
1. Descriptive Analytics (“What has happened?”)
(Data aggregation, summary, data mining)
2. Predictive Analytics (“What might happen?”)
(Regression, LSE,MLE)
3. Prescriptive Analytics (“What should we do?”)
(Optimization, Recommendation)
Basis for Data Analytics Data Analysis
Comparison
Data analytics is ‘general’ form of Data analysis is a specialized form
Form
analytics which is used in businesses to of data analytics used in
make decisions from data which are data- businesses to analyze data and
driven take some insights of it.
Data analysis consisted of defining
Data analytics consist of data collection
a data, investigation, cleaning,
and inspect in general and it has one or
Structure transforming the data to give a
more users.
meaningful outcome.
For analyzing the data
There are many analytics tools in a
OpenRefine, KNIME, RapidMiner,
market but mainly R, Tableau Public,
Tools Google Fusion Tables, Tableau
Python, SAS, Apache Spark, Excel are
Public, NodeXL, WolframAlpha
used.
tools are used.
Data analytics life cycle consist of
Business Case Evaluation, Data The sequence followed in data
Identification, Data Acquisition & analysis are data gathering, data
Filtering, Data Extraction, Data Validation scrubbing, analysis of data and
Sequence
& Cleansing, Data Aggregation & interpret the data precisely so that
Representation, Data Analysis, Data you can understand what your
Visualization, Utilization of Analysis data want to say.
Results.
Basis for
Data Analytics Data Analysis
Comparison

Data Analytics, in general, can be used Data analysis can be used in various
to find masked patterns, anonymous ways like one can perform analysis
correlations, customer preferences, like descriptive analysis,
Usage
market trends and other necessary exploratory analysis, inferential
information that can help to make more analysis, predictive analysis and
notify decisions for business purpose. take useful insights from the data.

Let say you have 1gb customer purchase Suppose you have 1gb customer
related data of past 1 year, now one has purchase related data of past 1
Example to find that what our customers next year and you are trying to find
possible purchases, you will use data what happened so far that means
analytics for that. in data analysis we look into past.
Why is Data Analytics important?
As an enormous amount of data gets generated, the need to extract
useful insights is a must for a business enterprise. Data Analytics has
a key role in improving your business.

Here are 4 main factors which signify the need for Data Analytics:
• Gather Hidden Insights – Hidden insights from data are gathered
and then analyzed with respect to business requirements.
• Generate Reports – Reports are generated from the data and are
passed on to the respective teams and individuals to deal with
further actions for a high rise in business.
• Perform Market Analysis – Market Analysis can be performed to
understand the strengths and the weaknesses of competitors.
• Improve Business Requirement – Analysis of Data allows
improving Business to customer requirements and experience.
Tools in Data Analytics
With the increasing demand for Data Analytics in the market, many
tools have emerged with various functionalities for this purpose.
Either open-source or user-friendly, the top tools in the data
analytics market are as follows.
• R programming – This tool is the leading analytics tool used for
statistics and data modeling. R compiles and runs on various
platforms such as UNIX, Windows, and Mac OS. It also provides
tools to automatically install all packages as per user-
requirement.
• Python – Python is an open-source, object-oriented
programming language which is easy to read, write and
maintain. It provides various machine learning and visualization
libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas,
Keras etc. It also can be assembled on any platform like SQL
server, a MongoDB database or JSON.
• Tableau Public – This is a free software that connects
to any data source such as Excel, corporate Data
Warehouse etc. It then creates visualizations, maps,
dashboards etc with real-time updates on the web.
• QlikView – This tool offers in-memory data
processing with the results delivered to the end-
users quickly. It also offers data association and data
visualization with data being compressed to almost
10% of its original size.
• SAS – A programming language and environment for
data manipulation and analytics, this tool is easily
accessible and can analyze data from different
sources.
Tools in Data Analytics ...
• Microsoft Excel – This tool is one of the most widely used
tools for data analytics. Mostly used for clients’ internal
data, this tool analyzes the tasks that summarize the data
with a preview of pivot tables.
• RapidMiner – A powerful, integrated platform that can
integrate with any data source types such as Access, Excel,
Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is
mostly used for predictive analytics, such as data mining,
text analytics, machine learning.
• KNIME – Konstanz Information Miner (KNIME) is an open-
source data analytics platform, which allows you to analyze
and model data. With the benefit of visual programming,
KNIME provides a platform for reporting and integration
through its modular data pipeline concept.
• OpenRefine – Also known as GoogleRefine, this data
cleaning software will help you clean up data for
analysis. It is used for cleaning messy data, the
transformation of data and parsing data from
websites.
• Apache Spark – One of the largest large-scale data
processing engine, this tool executes applications in
Hadoop clusters 100 times faster in memory and 10
times faster on disk. This tool is also popular for data
pipelines and machine learning model development.
Data Workflows
Reporting Vs Analytics:
Reporting is presenting result of data analysis
and Analytics is process or systems involved in
analysis of data to obtain a desired output.
Various steps involved in Analytics:
• Define your Objective
• Understand Your Data Source
• Prepare Your Data
• Analyze Data
• Report on Results
Step 1 - Define Your Objective
Ask the following questions:
• What are you trying to achieve?
• What could the result look like?
Step 2 - Understand Your Data Source
Ask the following questions:
• What information do I need?
• Can I get the data myself, or do I need to ask an IT
resource?
Step 3 - Prepare Your Data
Ask the following questions:
• Does the data need to be cleansed?
• Does the data need to be normalized?
Step 4 - Analyze Data
Ask the following questions:
• What tests can I run on the data?
• Is help available to understand results?
Step 5 - Report Results
Ask the following questions:
• Will management understand the results?
• Can you represent the results visually?
Various Analytics techniques are:
• Data Preparation
• Reporting, Dashboards & Visualization
• Segmentation
• Forecasting
• Descriptive Modelling
• Predictive Modelling
Application of Modeling in Business
• A statistical model embodies a set of assumptions
concerning the generation of the observed data, and
similar data from a larger population.
• A model represents, often in considerably idealized form,
the data-generating process. Signal processing is an
enabling technology that encompasses the fundamental
theory, applications, algorithms, and implementations of
processing or transferring information contained in many
different physical, symbolic, or abstract formats broadly
designated as signals. It uses mathematical, statistical,
computational, heuristic, and linguistic representations,
formalisms, and techniques for representation, modelling,
analysis, synthesis, discovery, recovery, sensing,
acquisition, extraction, learning, security, or forensics.
Application of Modeling in Business...

• In manufacturing statistical models are used to


define Warranty policies, solving various conveyor
related issues, Statistical Process Control etc.
• BA's often need to analyse data as part of making
data modeling decisions, and this means that data
modeling can include some amount of data analysis.
A lot can be accomplished with very basic technical
skills, such as the ability to run simple database
queries. This is why you may see a technical skill like
SQL in a business analyst job description.
Application of Modeling in Business...
• Many BA's succeed without knowing these more
technical skills, instead, they rely on their ability to
collaborate with technical professionals and other
knowledgeable stakeholders to ensure the data is
understood well enough to make the right modelling
decisions.
• The non-technical BA can also evaluate sample data,
interview stakeholders to discover possible data-related
issues, review current state database models, and
analyse exception reports.
• While data analysis skills are valuable for the business
analyst, they are not essential. However, data modelling
falls squarely within the business analyst’s domain.
Databases & Types of data and variables:
A database is an organized collection of data, typically stored and
accessed electronically. Databases allow data to be stored in a
structured manner, ensuring easy access, manipulation, and
retrieval of the data for various applications. Databases can handle
different types of data and serve a variety of purposes based on
their design and structure.
• Key Components of Databases:
• Data: The actual information stored within the database, such as
numbers, text, or multimedia.
• Database Management System (DBMS): Software that enables
users to interact with the database, manage data, and ensure
data integrity and security (e.g., MySQL, PostgreSQL, MongoDB).
• Query Language: Languages like SQL (Structured Query
Language) that allow users to retrieve, insert, update, and delete
data from databases.
Types of Databases
Databases are generally categorized based on the way
they organize and store data. The most common types:
1. Relational Databases (RDBMS)
2. NoSQL Databases
3. NewSQL Databases
4. Cloud Databases
5. In-Memory Databases
6. Object-Oriented Databases
7. Distributed Databases
8. Time-Series Databases
9. Columnar Databases
1. Relational Databases (RDBMS)
2. NoSQL Databases
3. NewSQL Databases
4. Cloud Databases
5. In-Memory Databases
6. Object-Oriented Databases
7. Distributed Databases
8. Time-Series Databases
9. Columnar Databases
Data Dictionary
A data dictionary, or metadata repository, as defined in the IBM
Dictionary of Computing, is a "centralized repository of information
about data such as meaning, relationships to other data, origin,
usage, and format”.
Data can be categorized on various parameters like Categorical, Type
etc.
• Data is of 2 types – Numeric and Character. Again numeric data can
be further divided into sub group of – Discrete and Continuous.
• Again, Data can be divided into 2 categories – Nominal and ordinal.
• Also based on usage data is divided into 2 categories – Quantitative
and Qualitative
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Important Characteristics of Data

– Dimensionality (number of attributes)


• High dimensional data brings a number of challenges

– Sparsity
• Only presence counts

– Resolution
• Patterns depend on the scale

– Size
• Type of analysis may depend on size of data
Record Data
• Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
• If data objects have the same fixed set of numeric attributes,
then the data objects can be thought of as points in a multi-
dimensional space, where each dimension represents a
distinct attribute

• Such data set can be represented by an m by n matrix, where


there are m rows, one for each object, and n columns, one for
each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data
• Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
play
team

win
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip
constitute a transaction, while the individual products that
were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
• Examples: Generic graph, a molecule, and webpages

2
5 1
2
5

Benzene Molecule: C6H6


Ordered Data
• Sequences of transactions
Items/Events

An element of
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data

• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean
What is Data?

• Collection of data objects and Attributes


their attributes
• An attribute is a property or Tid Refund Marital Taxable
Status Income Cheat
characteristic of an object
1 Yes Single 125K No
– Examples: eye color of a person,
temperature, etc. 2 No Married 100K No
– Attribute is also known as 3 No Single 70K No

Objects
variable, field, characteristic, 4 Yes Married 120K No
dimension, or feature 5 No Divorced 95K Yes
• A collection of attributes 6 No Married 60K No
describe an object 7 Yes Divorced 220K No
– Object is also known as record, 8 No Single 85K Yes
point, case, sample, entity, or
9 No Married 75K No
instance
10 No Single 90K Yes
10
Attribute Values
• Attribute values are numbers or symbols
assigned to an attribute for a particular object

• Distinction between attributes and attribute


values
– Same attribute can be mapped to different attribute
values
• Example: height can be measured in feet or meters

– Different attributes can be mapped to the same set of


values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
Measurement of Length
• The way you measure an attribute may not match the attributes
properties.
5 A 1

B
7 2

C
This scale This scale
8 3
preserves preserves
only the the ordering
ordering D and additvity
property of properties of
length. 10 4 length.

15 5
Types of Attributes
• There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
• Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties/operations it possesses:
– Distinctness: = 
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful

– Nominal attribute: distinctness


– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful differences
– Ratio attribute: all 4 properties/operations
Difference Between Ratio and Interval
• Is it physically meaningful to say that a
temperature of 10 ° is twice that of 5° on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?

• Consider measuring the height above average


– If Bill’s height is three inches above average and
Bob’s height is six inches above average, then would
we say that Bob is twice as tall as Bill?
– Is this situation analogous to that of temperature?
This categorization of attributes is due to S. S. Stevens
Attribute Transformation Comments
Type
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Categorical
Qualitative

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic function well by the values {1, 2, 3} or
by { 0.5, 1, 10}.

Interval new_value = a * old_value + b Thus, the Fahrenheit and


where a and b are constants Celsius temperature scales
Quantitative
Numeric

differ in terms of where their


zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.

This categorization of attributes is due to S. S. Stevens


Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as
floating-point variables.
Asymmetric Attributes
• Only presence (a non-zero attribute value) is regarded as important
• Words present in documents
• Items present in customer transactions

• If we met a friend in the grocery store would we ever say the


following?
“I see our purchases are very similar since we didn’t buy most
of the same things.”

• We need two asymmetric binary attributes to represent one


ordinary binary attribute
– Association analysis uses asymmetric attributes

• Asymmetric attributes typically arise from objects that are


sets
Critiques

• Incomplete
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data

• Real data is approximate and noisy


– This can complicate recognition of the proper attribute type
– Treating one attribute type as another may be approximately
correct
Critiques …
• Not a good guide for statistical analysis
– May unnecessarily restrict operations and
results
• Statistical analysis is often approximate
• Thus, for example, using interval analysis for
ordinal values may be justified
– Transformations are common but don’t
preserve scales
• Can transform data to a new scale with better
statistical properties
• Many statistical analyses depend only on the
distribution
More Complicated Examples

• ID numbers
– Nominal, ordinal, or interval?

• Number of cylinders in an automobile engine


– Nominal, ordinal, or ratio?

• Biased Scale
– Interval or Ratio
Key Messages for Attribute Types
• The types of operations you choose should be “meaningful” for the
type of data you have
– Distinctness, order, meaningful intervals, and meaningful ratios
are only four properties of data

– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are not
there

– Analysis may depend on these other properties of the data


•Many statistical analyses depend only on the distribution

– Many times what is meaningful is measured by statistical


significance

– But in the end, what is meaningful is measured by the domain


Data Modeling Techniques
Data modeling techniques are essential for
structuring and organizing data in a way that
supports data analysis, storage, and retrieval.
These techniques are used to create conceptual
representations of data, which helps in designing
databases and data systems, ensuring that the
data structure meets the needs of the business or
analytical processes.
various data modeling techniques
1. Conceptual Data Modeling
2. Logical Data Modeling
3. Physical Data Modeling
4. Hierarchical Data Modeling
5. Network Data Modeling
6. Relational Data Modeling
7. Dimensional Data Modeling (Star and Snowflake
Schema)
8. Object-Oriented Data Modeling
9. Entity-Relationship (ER) Model
10. NoSQL Data Modeling
1. Conceptual Data Modeling
• Purpose: This is the high-level representation of
organizational data and focuses on defining business
entities, their attributes, and relationships. It doesn’t
consider how the data will be physically stored or managed.
• Key Components: Entities (e.g.,customers, products),
relationships between entities, and attributes (e.g.,
customer name, product ID).
• Tools: Entity-Relationship Diagrams (ERDs) are commonly
used in conceptual data models to represent entities and
their relationships.
• Use Cases: Typically used in the initial stages of database
design to define the scope and structure of the data from a
business perspective.
2. Logical Data Modeling
• Purpose: This step is more detailed than conceptual modeling
and represents the structure of the data within a database
without considering how it will be physically implemented. It
focuses on organizing data into tables, columns, and keys.
• Key Components:
• Entities: Represented as tables.
• Attributes: Represented as columns.
• Primary and Foreign Keys: Define relationships between tables.
• Normalization: Applied to eliminate redundancy and ensure data
integrity.
• Tools: ERD tools, such as Lucidchart, ER/Studio, or Microsoft
Visio.
• Use Cases: Logical data models serve as blueprints for database
design, ensuring that data is organized and relationships
between tables are correctly structured.
3. Physical Data Modeling
• Purpose: This focuses on how data will actually be stored in the
system. It defines the storage mechanisms, database objects
(tables, indexes, views), and physical storage structures (e.g., disk
partitions).
• Key Components:
• Tables: Actual database tables.
• Indexes: For improving query performance.
• Constraints: Like unique, not null, or foreign key constraints to
enforce data integrity.
• Storage Parameters: Defines how the data will be stored physically
(disk, memory).
• Tools: Database design tools like SQL Server Management Studio
(SSMS), Oracle SQL Developer, or MySQL Workbench.
• Use Cases: Used to build and implement the actual database
system, ensuring optimal performance, storage, and query
execution.
4. Hierarchical Data Modeling
• Purpose: Organizes data in a tree-like structure with
a parent-child relationship. Each child node has a
single parent, but parents can have multiple
children.
• Key Components: Nodes represent entities, and
edges represent relationships. This model is
effective for hierarchical data, like organization
charts or file systems.
• Examples: XML, Windows Registry.
• Use Cases: Used in systems where data has a natural
hierarchy (e.g., organizational charts, taxonomies).
5. Network Data Modeling
• Purpose: Expands on hierarchical models by
allowing entities to have multiple parent nodes,
representing a more complex set of relationships
(many-to-many).
• Key Components: This model uses graphs with
nodes (entities) and edges (relationships).
• Examples: CODASYL, Graph databases like Neo4j.
• Use Cases: Often used in networking or
telecommunications, and in cases where many-
to-many relationships need to be represented.
6. Relational Data Modeling
• Purpose: Organizes data in a set of tables with defined
relationships between them. This is the most widely
used model in database systems today.
• Key Components:
• Tables: Represent entities.
• Rows: Represent records.
• Columns: Represent attributes.
• Primary and Foreign Keys: Ensure data integrity and
define relationships between tables.
• Examples: MySQL, PostgreSQL, Oracle Database.
• Use Cases: Used for traditional database management
systems (DBMS), where data needs to be efficiently
stored, retrieved, and manipulated.
7. Dimensional Data Modeling (Star and
Snowflake Schema)
• Purpose: Primarily used for designing data warehouses and
analytical databases. It focuses on optimizing data for read-heavy
queries and reporting, rather than for transaction processing.
• Key Components:
• Fact Tables: Contain numerical data for analysis (e.g., sales revenue).
• Dimension Tables: Contain descriptive data related to the facts (e.g.,
product details, customer information).
• Star Schema: The simplest form, where dimension tables directly
connect to a central fact table.
• Snowflake Schema: A more normalized version of the star schema,
where dimension tables are further broken down into related sub-tables.
• Tools: Informatica, Pentaho, Talend.
• Use Cases: Commonly used in business intelligence (BI) and data
warehouses for aggregating and analyzing large datasets.
8. Object-Oriented Data Modeling
• Purpose: Represents data as objects (similar to
programming languages like Java or C++), where both
data and the operations that can be performed on the
data are bundled together.
• Key Components: Classes (blueprints for objects),
objects (instances of classes), attributes, and methods.
• Examples: Object-Relational Mapping (ORM) tools like
Hibernate, Entity Framework.
• Use Cases: Often used in systems where the data
model is closely aligned with object-oriented
programming paradigms, such as modern web
applications or systems with complex business rules.
9. Entity-Relationship (ER) Model
• Purpose: One of the foundational techniques for data
modeling. ER modeling is used to represent data
entities, their attributes, and relationships in a visual
format.
• Key Components:
• Entities: Represent tables or objects.
• Attributes: Represent fields or properties of entities.
• Relationships: Define how entities interact with one
another (e.g., one-to-one, one-to-many).
• Tools: ERwin, Visual Paradigm, Lucidchart.
• Use Cases: Often used in early database design and
conceptual modeling to map out complex
relationships between different data objects.
10. NoSQL Data Modeling
• Purpose: NoSQL models are designed for handling large,
unstructured, or semi-structured data. These models are
used when scalability and flexibility are prioritized over strict
schema design.
• Key Components:
• Document-based: Stores data as documents (e.g., JSON, BSON).
Examples: MongoDB.
• Key-Value: Data is stored as key-value pairs. Examples: Redis,
DynamoDB.
• Column-family: Optimized for read and write operations in large-
scale data systems. Examples: Cassandra, HBase.
• Graph-based: Organizes data as a graph to model relationships.
Examples: Neo4j.
• Use Cases: Used in big data applications, real-time analytics,
and systems where data structure is constantly evolving
MISSING IMPUTATIONS
• There are many ways to approach missing data. The most
common is to ignore it. But making no choice means that
your statistical software is choosing for you.
• Most of the time, your software is choosing list wise
deletion. List wise deletion may or may not be a bad
choice, depending on why and how much data are missing.
• Another common approach among those who are paying
attention is imputation. Imputation simply means replacing
the missing values with an estimate, then analyzing the full
data set as if the imputed values were actual observed
values.
How do you choose that estimate? The
following are common methods:
• Mean imputation
• Substitution
• Hot deck imputation
• Cold deck imputation
• Regression imputation
• Stochastic regression imputation
• Interpolation and extrapolation
• Single or Multiple Imputation
Mean imputation
• Simply calculate the mean of the observed values for
that variable for all individuals who are non-missing.
• It has the advantage of keeping the same mean and the
same sample size, but many, many disadvantages.
Pretty much every method listed below is better than
mean imputation.

Substitution
• Impute the value from a new individual who was not
selected to be in the sample.
• In other words, go find a new subject and use their
value instead.
Hot deck imputation
• A randomly chosen value from an individual in the
sample who has similar values on other variables.
• In other words, find all the sample subjects who are
similar on other variables, then randomly choose one
of their values on the missing variable.
• One advantage is you are constrained to only possible
values. In other words, if Age in your study is
restricted to being between 5 and 10, you will always
get a value between 5 and 10 this way.
• Another is the random component, which adds in
some variability. This is important for accurate
standard errors.
Cold deck imputation
• A systematically chosen value from an
individual who has similar values on other
variables.
• This is similar to Hot Deck in most ways, but
removes the random variation. So for
example, you may always choose the third
individual in the same experimental condition
and block.
Regression imputation
• The predicted value obtained by regressing
the missing variable on other variables.
• So instead of just taking the mean, you’re
taking the predicted value, based on other
variables. This preserves relationships among
variables involved in the imputation model,
but not variability around predicted values.
Stochastic regression imputation
• The predicted value from a regression plus a
random residual value.
• This has all the advantages of regression
imputation but adds in the advantages of the
random component.
• Most multiple imputation is based on some
form of stochastic regression imputation.
Interpolation and extrapolation
• An estimated value from other observations from
the same individual. It usually only works in
longitudinal data.
• Use caution, though. Interpolation, for example,
might make more sense for a variable like height in
children–one that can’t go back down over time.
Extrapolation means you’re estimating beyond the
actual range of the data and that requires making
more assumptions that you should.
Single or Multiple Imputation?
• There are two types of imputation–single or multiple.
Usually when people talk about imputation, they
mean single.
• Single refers to the fact that you come up with a
single estimate of the missing value, using one of the
seven methods listed above.
• It’s popular because it is conceptually simple and
because the resulting sample has the same number
of observations as the full data set.
• Single imputation looks very tempting when listwise
deletion eliminates a large portion of the data set.
But it has limitations.
• Some imputation methods result in biased parameter
estimates, such as means, correlations, and regression
coefficients, unless the data are Missing Completely at
Random (MCAR). The bias is often worse than with
listwise deletion, the default in most software.
• The extent of the bias depends on many factors,
including the imputation method, the missing data
mechanism, the proportion of the data that is missing,
and the information available in the data set.
• Moreover, all single imputation methods
underestimate standard errors.
• Since the imputed observations are themselves
estimates, their values have corresponding random
error. But when you put in that estimate as a data
point, your software doesn’t know that. So it
overlooks the extra source of error, resulting in too-
small standard errors and too-small p-values.
• And although imputation is conceptually simple, it
is difficult to do well in practice. So it’s not ideal but
might suffice in certain situations.
• So multiple imputation comes up with multiple
estimates. Two of the methods listed above work
as the imputation method in multiple imputation–
hot deck and stochastic regression.
• Because these two methods have a random
component, the multiple estimates are slightly
different. This re-introduces some variation that your
software can incorporate in order to give your model
accurate estimates of standard error.
• Multiple imputation was a huge breakthrough in
statistics about 20 years ago. It solves a lot of
problems with missing data (though, unfortunately
not all) and if done well, leads to unbiased parameter
estimates and accurate standard errors.
Need for Business Modeling
5 Ways Data Analytics is Transforming Business
Models
1. Strategic Analytics
2. Platform Analytics
3. Enterprise Information Management (EIM)
4. Business Model Transformation
5. Making Data-centric Business
1. Strategic Analytics
Strategic analytics is detailed, data-driven analysis of your entire
system to help you determine what’s driving customer and market
behavior.
The key to strategic analytics is doing it in the right order:
Step 1 — Competitive Advantage Analytics to identify your
capability strengths and weaknesses
Step 2 — Enterprise Analytics to get diagnostics at the enterprise,
business unit and business process levels
Step 3 — Human Capital Analytics for diagnostics at the individual
level to get actionable insights
The data should answer critical questions like:
• What are the key decisions that drive the most value for us?
• What new data is available that hasn’t been mined yet?
• What new analytics techniques haven’t been fully explored?
2. Platform Analytics
• This helps you fuse analytics into your decision-making to
improve core operations. It can help your company harness the
power of data to identify new opportunities.
The important questions to ask include:
• How can we integrate analytics into everyday processes?
• Which processes will benefit from automatic, repeatable, real-
time analysis?
• Could our back-end system benefit from big data analytics?
Platform analytics must include more than a stack of technologies.
As it’s available via many formats and channels, it can be used to
check the pulse of your organization.
It will help you incorporate data analysis into key decisions across all
departments, including sales, marketing, the supply chain, customer
service, customer experience, and other core business functions.
3. Enterprise Information Management (EIM)
• Almost 80% of vital business information is stored in
unmanaged repositories. With strategic and platform
analytics already in place, EIM helps you take advantage of
social, mobile, analytics and cloud technologies (SMAC) to
improve the way data is managed and used across the
company.
• By building agile data management operations with tools for
information creation, capture, distribution and consumption,
EIM will help you:
– Streamline your business practices
– Enhance collaboration efforts
– Boost employee productivity in and out of the office
• When defining your EIM strategy, identify the business
requirements, key issues and opportunities for initiating EIM.
Also, identify potential programs and projects whose success
4. Business Model Transformation
• Companies that embrace big data analytics and transform their
business models in parallel will create new opportunities for revenue
streams, customers, products and services.
• From forecasting demand and sourcing materials to accounting and
the recruitment and training of staff, every aspect of your business
can be reinvented.
Needed changes include:
• Having a big data strategy and vision that identifies and capitalizes on
new opportunities
• Fostering a culture of innovation and experimentation with data
• Understanding how to leverage new skills and technologies, and
managing the impact they have on how information is accessed and
safeguarded
• Building trust with consumers who hold vital data
• Creating partnerships both within and outside your core industry
5. Making Data-centric Business
• Do you generate a large volume of data? Could that data benefit
other organizations, both inside and outside your industry?
• Data-centric business isn’t just an asset, it’s currency. It’s the
source of your core competitiveness, and it’s worth its weight in
gold.
There are three main categories to data analytics:
• Insight: Includes mining, cleansing, clustering and segmenting
data to understand customers and their networks, influence and
product insights
• Optimization: Analyzing business functions, processes and
models
• Innovation: Exploring new, disruptive business models to further
the evolution and growth of your customer base.

You might also like