Module I(Introduction Data Analytics Life Cycle) Part II (1)

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 103

Data Analytics Lifecycle

Data Analytics Lifecycle


• Big Data analysis differs from traditional data analysis
primarily due to the volume, velocity and variety
characteristics of the data being processes.

• To address the distinct requirements for performing


analysis on Big Data, a step-by-step methodology is
needed to organize the activities and tasks involved
with acquiring, processing, analyzing and repurposing
data (adapt for use in a different purpose).
Data Analytics Lifecycle (cont..)
From a Big Data adoption and planning perspective, it is important that in
addition to the lifecycle, consideration be made for issues of training,
education, tooling and staffing of a data analytics team.
Key Roles for a Successful Analytics Project
• Business User – understands the domain area
• Project Sponsor – provides requirements
• Project Manager – ensures meeting objectives
• Business Intelligence Analyst – provides business domain
expertise based on deep understanding of the data
• Database Administrator (DBA) – creates DB environment
• Data Engineer – provides technical skills, assists data
management and extraction, supports analytic sandbox
• Data Scientist – provides analytic techniques and modeling
Data Analytics Lifecycle (cont..)

• The data analytic lifecycle is designed for Big


Data problems and data science projects
• The cycle is iterative to represent a real
project
• Work can return to earlier phases as new
information is uncovered
Data Analytics Lifecycle-Abstract view
Phase 1: Discovery

•The data science team must learn and investigate the problem,

•Develop context and understanding, and

•Learn about the data sources needed and available for the
project.
•In addition, the team formulates initial hypotheses that can
later be tested with data.
Phase 1: Discovery (cont.)
• The team should perform five main activities during this step of the discovery
phase:
• Identify data sources: Make a list of data sources the team may need to test
the initial hypotheses outlined in this phase.
– Make an inventory of the datasets currently available and those that
can be purchased or otherwise acquired for the tests the team wants
to perform.
• Capture aggregate data sources: This is for previewing the data and providing
high-level understanding.
– It enables the team to gain a quick overview of the data and
perform further exploration on specific areas.
• Review the raw data: Begin understanding the interdependencies among the
data attributes.
– Become familiar with the content of the data, its quality, and its
limitations.
Phase 1: Discovery (cont.)
• Evaluate the data structures and tools needed: The data type and
structure dictate which tools the team can use to analyze the data.

• Scope the sort of data infrastructure needed for this type of


problem: In addition to the tools needed, the data influences the kind of
infrastructure that's required, such as disk storage and network
capacity.

• Unlike many traditional stage-gate processes, in which the team can


advance only when specific criteria are met, the Data Analytics
Lifecycle is intended to accommodate more ambiguity.

• For each phase of the process, it is recommended to pass certain


checkpoints as a way of gauging whether the team is ready to move to
the next phase of the Data Analytics Lifecycle.
Phase 2: Data preparation

• This phase includes


• Steps to explore, Preprocess, and condition data
prior to modeling and analysis.
Phase
Phase
2: Data
2: Data
preparation
preparation
(cont.)
• It requires the presence of an analytic sandbox (workspace), in
which the team can work with data and perform analytics for the
duration of the project.

– The team needs to execute Extract, Load, and Transform (ELT) or


extract, transform and load (ETL) to get data into the sandbox.

– In ETL, users perform processes to extract data from a datastore,


perform data transformations, and load the data back into the
datastore.

– The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and
analyze it.
Phase 2: Data preparation
Rules for Analytics Sandbox

• When developing the analytic sandbox, collect all kinds of data there,
as team members need access to high volumes and varieties of data
for a Big Data analytics project.

• This can include everything from summary-level aggregated data,


structured data , raw data feeds, and unstructured text data from
call logs or web logs, depending on the kind of analysis the team
plans to undertake.

• A good rule is to plan for the sandbox to be at least 5– 10 times the


size of the original datasets, partly because copies of the data may be
created that serve as specific tables or data stores for specific kinds of
analysis in the project.
Phase 2: Data preparation
Performing ETLT
• As part of the ETLT step, it is advisable to make an inventory of the data
and compare the data currently available with datasets the team needs.

• Performing this sort of gap analysis provides a framework for


understanding which datasets the team can take advantage of
today and where the team needs to initiate projects for data
collection or access to new datasets currently unavailable.

• A component of this sub-phase involves extracting data from the


available sources and determining data connections for raw data, online
transaction processing (OLTP) databases, online analytical
processing (OLAP), or other data feeds.

• Data conditioning refers to the process of cleaning data, normalizing


datasets, and performing transformations on the data.
12/06/24 MODULE-I DATA ANALYTICS 14
12/06/24 MODULE-I DATA ANALYTICS 15
Common Tools for the Data Preparation Phase

Several tools are commonly used for this phase:


Hadoop can perform massively parallel ingest and custom analysis for web traffic
analysis, GPS location analytics, and combining of massive unstructured data
feeds from multiple sources.

Alpine Miner provides a graphical user interface (GUI) for creating analytic
workflows, including data manipulations and a series of analytic events such
as staged data-mining techniques (for example, first select the top 100
customers, and then run descriptive statistics and clustering).

OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool
for working with messy data. A GUI-based tool for performing data
transformations, and it's one of the most robust free tools currently available.

Similar to OpenRefine, Data Wrangler is an interactive tool for data cleaning and
transformation. Wrangler was developed at Stanford University and can be
used to perform many transformations on a given dataset.
Phase 3: Model Planning
• Phase 3 is model planning, where the team determines the
methods, techniques, and workflow it intends to follow for
the subsequent model building phase.

– The team explores the data to learn about the


relationships between variables and subsequently
selects key variables and the most suitable models.
– During this phase that the team refers to the
hypotheses developed in Phase 1, when they first
became acquainted with the data and understanding
the business problems or domain area.
Common Tools for the Model Planning Phase
Here are several of the more common ones:

•R has a complete set of modeling capabilities and provides a good environment


for building interpretive models with high-quality code. In addition, it has the ability
to interface with databases via an ODBC connection and execute statistical tests.
•SQL Analysis services can perform in-database analytics of common data
mining functions, involved aggregations, and basic predictive models.
•SAS/ ACCESS provides integration between SAS and the analytics sandbox via
multiple data connectors such as OBDC, JDBC, and OLE DB. SAS itself is
generally used on file extracts, but with SAS/ ACCESS, users can connect to
relational databases (such as Oracle or Teradata).
Phase 4: Model Building
• In this phase the data science team needs to develop data sets for
training, testing, and production purposes. These data sets enable the
data scientist to develop the analytical model and train it ("training
data"), while holding aside some of the data ("hold-out data" or "test
data") for testing the model.

• the team develops datasets for testing, training, and production


purposes.
– In addition, in this phase the team builds and executes models
based on the work done in the model planning phase.

– The team also considers whether its existing tools will sufficient for
running the models, or if it will need a more robust environment for
executing models and workflows (for example, fast hardware and
parallel processing, if applicable).
• Free or Open Source tools: Rand PL/R, Octave, WEKA, Python
• Commercial Tools: Matlab, STATISTICA.
Phase 5: Communicate Results
• In Phase 5, After executing the model, the team needs to compare the
outcomes of the modeling to the criteria established for success and
failure.
• The team considers how best to articulate the findings and outcomes
to the various team members and stakeholders, taking into account
warning, assumptions, and any limitations of the results.
• The team should identify key findings, quantify the business value,
and develop a narrative to summarize and convey findings to
stakeholders.
Phase 6: Operationalize
• In the final phase 6, Operationalize, the team communicates the
benefits of the project more broadly and sets up a pilot project to
deploy the work in a controlled way before broadening the work to
a full enterprise or ecosystem of users.

• This approach enables the team to learn about the performance and
related constraints of the model in a production environment on a small
scale and make adjustments before a full deployment.

– The team delivers final reports, briefings, code, and technical


documents. In addition, the team may run a pilot project to implement
the models in a production environment.
Key outputs for each of the main stakeholders
• Key outputs for each of the main stakeholders of an analytics project and what they
usually expect at the conclusion of a project.
• Business User typically tries to determine the benefits and implications of the
findings to the business.
• Project Sponsor typically asks questions related to the business impact of the
project, the risks and return on investment (ROI), and the way the project can be
evangelized within the organization (and beyond).
• Project Manager needs to determine if the project was completed on time and
within budget and how well the goals were met.
• Business Intelligence Analyst needs to know if the reports and dashboards he
manages will be impacted and need to change.
• Data Engineer and Database Administrator (DBA) typically need to share their code
from the analytics project and create a technical document on how to implement it.
• Data Scientist needs to share the code and explain the model to her peers,
managers, and other stakeholders.
Data Analytics Lifecycle
Overview
The Big Data analytics lifecycle can
be divided into the following nine
stages, as shown in;
1. Business Case Evaluation
2. Data Identification
3. Data Acquisition & Filtering
4. Data Extraction
5. Data Validation & Cleansing
6. Data Aggregation &
Representation
7. Data Analysis
8. Data Visualization
9. Utilization of Analysis Results
1. Business Case Evaluation
• Before any Big Data project can be started, it needs to be
clear what the business objectives and results of the data
analysis should be.
• This initial phase focuses on understanding the project
objectives and requirements from a business perspective,
and then converting this knowledge into a data mining
problem definition.
• A preliminary plan is designed to achieve the objectives. A
decision model, especially one built using the Decision
Model and Notation standard can be used.
• Once an overall business problem is defined, the problem is
converted into an analytical problem.
2. Data Identification
• The Data Identification stage determines the origin of data.
Before data can be analysed, it is important to know what
the sources of the data will be.
• Especially if data is procured from external suppliers, it is
necessary to clearly identify what the original source of the
data is and how reliable (frequently referred to as the
veracity of the data) the dataset is.
• The second stage of the Big Data Lifecycle is very important,
because if the input data is unreliable, the output data will
also definitely be unreliable.
• Identifying a wider variety of data sources may increase
the probability of finding hidden patterns and correlations.
3. Data Acquisition and Filtering
• The Data Acquisition and Filtering Phase builds upon the previous
stage of the Big Data Lifecycle.
• In this stage, the data is gathered from different sources, both from
within the company and outside of the company.
• After the acquisition, a first step of filtering is conducted to filter out
corrupt data.
• Additionally, data that is not necessary for the analysis will be filtered
out as well.
• The filtering step will be applied on each data source individually, so
before the data is aggregated into the data warehouse.
• In many cases, especially where external, unstructured data is
concerned, some or most of the acquired data may be irrelevant
(noise) and can be discarded as part of the filtering process.
3. Data Acquisition and Filtering (cont.)
• Data classified as “corrupt” can
include records with missing or
nonsensical values or invalid
data types. Data that is filtered
out for one analysis may
possibly be valuable for a
different type of analysis.

• Metadata can be added via


automation to data from both
internal and external data
sources to improve the
classification and querying.

 Examples of appended metadata include dataset size and structure, source


information, date and time of creation or collection and language-specific
information.
4. Data Extraction
 Some of the data identified in the two previous stages may be
incompatible with the Big Data tool that will perform the actual
analysis.
 In order to deal with this problem, the Data Extraction stage is
dedicated to extracting different data formats from data sets
(e.g. the data source) and transforming these into a format the
Big Data tool is able to process and analyse.
 The complexity of the transformation and the extent in which is
necessary to transform data is greatly dependent on the Big Data
tool that has been selected.
 The Data Extraction lifecycle stage is dedicated to extracting
disparate data and transforming it into a format that the
underlying Big Data solution can use for the purpose of the data
analysis.
4. Data Extraction (cont.)
(A). Illustrates the
extraction of comments
and a user ID
embedded within an
XML document without
the need for further
transformation.
(A)
(B). Demonstrates the
extraction of the
latitude and longitude
coordinates of a user
from a single JSON
field.

(B)
5. Data Validation and Cleansing
 Data that is invalid leads to invalid results. In order to ensure
only the appropriate data is analysed, the Data Validation and
Cleansing stage of the Big Data Lifecycle is required.
 During this stage, data is validated against a set of
predetermined conditions and rules in order to ensure the data
is not corrupt.
 An example of a validation rule would be to exclude all persons
that are older than 100 years old, since it is very unlikely that
data about these persons would be correct due to physical
constraints.
 The Data Validation and Cleansing stage is dedicated to
establishing often complex validation rules and removing any
known invalid data.
5. Data Validation and Cleansing (cont.)
 For example, as illustrated in Fig. 1, the first value in Dataset B is
validated against its corresponding value in Dataset A.
 The second value in Dataset B is not validated against its
corresponding value in Dataset A. If a value is missing, it is inserted
from Dataset A.

 Data validation can be used to examine interconnected datasets


in order to fill in missing valid data.
Data Preprocessing?
.

• Data in the real world is dirty


– Incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
– Incorrect/Error: Collection instrument may be faulty, mandatory
field of personal information may contain wrong data
– Noisy: containing errors or outliers
– Inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
Major Tasks in Data Preprocessing
1. Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies (redundancy)

2. Data integration
– Integration of multiple databases, data cubes, files, or notes

3. Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results
– Data aggregation, dimensionality reduction, data compression,
generalization

4. Data transformation
– Normalization (scaling to a specific range)
– Aggregation
Forms of data preprocessing
1. Data Cleaning

• Data Cleaning Tasks


1.1- Fill in missing values
1.2- Identify outliers and smooth out noisy data
1.3- Correct inconsistent data
1.1 Missing Data
• Data is not available/ Missing data may be due to
 E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
 Information is not collected :(e.g., people decline to give their age and
weight)
 Attributes may not be applicable to all cases (e.g., annual income is not
applicable to children )
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
• Missing data need to be inferred
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming
the task is classification—not effective in certain cases)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”, a
new class?! simple but not foolproof.
• Use the central tendency (mean/median) to fill in the missing value.
Normal->Mean, Skewed->Median
• Use the most probable value to fill in the missing value: inference-
based such as regression, Bayesian formula, decision tree

Bias the data


Types of Missing Values
Why missing data is a problem?
Ans: It creates bias in the data. because we don’t know that the data is missing
randomly/missedout/intensionally.
*Bias data: produce lack of prdictivity & trustworthyness

• Missing completely at random (MCAR)


• Missing at Random (MAR)
• Missing Not at Random (MNAR)
• Structurally Missing Data (SMD)
Missing Completely at Random (MCAR) (Types of Missing Values…)
Assumption: If a person has missing data then it is completely unrelated to the other information
in the data. The missingness on the variable is completely unsystematic.
– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
– you cannot predict the missing value from the remaining know variables.
The fact that it is missing is independent of the remaining variables.

Example when we take a random sample of a population, where each member has the
same chance of being included in the sample.

When data is missing completely at random, it means that we can undertake analyses
using only observations that have complete data (provided we have enough of such
observations).
Missing at Random (MAR) Types of Missing Values…
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
Example of MAR is when we take a sample from a population, where the probability to be
included depends on some known property.
1. A simple predictive model is that income can be
predicted based on gender and age. Looking at the
table, we note that our missing value is for a
Female aged 30 or more, and observations say the
other females aged 30 or more have a High
income. As a result, we can predict that the
missing value should be High.
2.For example, imagine a sensor that misses a
particular minute’s measurement but captures that
data the minute before and the minute following.
The missing value can roughly be interpolated
from the remaining values to a reasonable degree
of accuracy.

There is a systematic relationship between the inclination of missing values


and the observed data. All that is required is a probabilistic relationship
Missing not at Random (MNAR) - Nonignorable
Types of Missing Values…
– Missingness is related to unobserved measurements and they are
not random
– The missing values are related to the values of that variable itself,
even after controlling for other variables.
MNAR means that the probability of being missing varies for reasons that are unknown to
us.
Example-1: when smoking status is not recorded in patients admitted as an emergency with
an intention (not random), then it is more likely to have worse outcomes from surgery.
Example-2, perhaps people in a certain age/income bracket refuse to answer how many
vehicles or houses they own.

Strategies to handle MNAR are to find more data about the causes for the
missingness, or to perform what-if analyses to see how sensitive the results are under
various scenarios.
Structurally Missing Data
A survey that asks for income from employment would have missing values
for those who do not have a job.

12/06/24 MODULE-I DATA ANALYTICS 42


Identify outliers and smooth out noisy data

• Noise
Random error or variance in a measured variable.
Or simply meaningless data that can’t be interpreted by machines.
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in the naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
1.2.1.Simple Discretization Method: Binning
• Data smoothing refers to a statistical approach of eliminating noise
and outliers from datasets to make the patterns more noticeable
• Binning or bucketing is used to smoothing the data. It smooth a
sorted data value by consulting its “neighborhood” i.e. the values
around it.
• The sorted values are distributed into a number of “buckets,” or
bins of equal width or equal frequency
• Bin Size : No. bins or buckets = square root of the no of data
points
• Bin Width/Depth: No of objects/elements in a single bin.

• There are 2 methods of dividing data into bins


– Equal Width Binning -Equal Frequency Binning
1.2.1 Binning Methods
For data set: 0, 5, 14, 15, 17, 18, 22, 25, 27 (sorted)
No. of Bins/ Bin size = 3 [3*3=9]

• Bins have equal width with a range


of each bin are defined as Make bins according to bin size
[min + w], [min + 2w] …. [min + nw] with equal frequency/depth i.e.
where w = (max – min) / (no of bins) equal elements i.e. 9/3=3
= (27-0)/3=9
• How do I use that 9 to make the • Bin size=3
bins? • Bin 1: 0, 5, 14
1. 0 + 9 = 9 (from 0 to 9) • Bin 2: 15, 17, 18
= Bin 1: 0, 5
• Bin 3: 22, 25, 27
2. 9 + 9 = 18 (from 9+ to
18) = Bin 2 : 14, 15, 17, 18
3. 18 + 9 = 27 (from 18+ to
27)
= Bin 3 : 22, 25, 27
Types of Smoothing
in
Equal Frequency Bins and Equal Width Bins

• Smoothing by Mean
• Smoothing by Median
• Smoothing by Boundaries
12/06/24 MODULE-I DATA ANALYTICS 48
Smoothing the data by Equal Frequency Bins contd..
1. Smoothing by BIN MEANS: Find the mean values of
each bin and Replace all with mean values
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
2. Smoothing by BIN MEDIANS: Find the median
values of each bin and Replace all with the median
- Bin 1: 8.5, 8.5, 8.5, 8.5
- Bin 2: 22.5, 22.5, 22.5, 22.5
- Bin 3: 28.5, 28.5, 28.5, 28.5

3. Smoothing by BIN BOUNDARIES: Min and Max will be the Bin


boundary, and middle element will be replaced by the closet
boundary value
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
12/06/24 MODULE-I DATA ANALYTICS 50
Smoothing the data by Equal Width Bins contd..
1. Smoothing by BIN MEANS: Find the mean values of
each bin and Replace all with mean values
- Bin 1: 7, 7, 7
- Bin 2: 20.25, 20.25, 20.25, 20.25
- Bin 3: 28.4, 28.4, 28.4, 28.4
2. Smoothing by BIN MEDIANS: Find the median values of
each bin and Replace all with the median
- Bin 1: 8, 8, 8, 8
- Bin 2: 21, 21, 21, 21
- Bin 3: 28, 28, 28, 28
3. Smoothing by BIN BOUNDARIES: Min and Max will be the Bin boundary,
and middle element will be replaced by the closet boundary value
- Bin 1: 4, 9, 9
- Bin 2: 15, 24, 24, 24 (14+ to 24)
- Bin 3: 25, 25, 25, 25, 34 (24+ to 34)
1.2.2 Cluster Analysis
Clustering is the task of dividing the population or data points into a
number of groups (without prior knowledge of class labels) such that data
points in the same groups are more similar to other data points in the same
group and dissimilar to the data points in other groups.
1.2.3 Regression
Regression is a method to determine the relationship between a dependent variable and one
or more independent variables. It smoothes by fitting the data into regression functions

• Linear regression (best line to fit two variables)


• Multiple linear regression (more than two variables, fit to a
multidimensional surface

Y1

Y1’ y=x+1

X1 x
2. Data Integration and Transformation
Data Integration
2. Handling Redundant Data in Data Integration

Σ(A  A )(B  B ) Cov(A, B)


rA,B  
(n  1)σ A σ B σ Aσ B

Where
2. Handling Redundant Data in Data Integration contd..
Correlation Analysis (Numeric Data) contd...

• Note that −1 ≤ rA,B ≤ +1. If rA,B is greater than 0, then A and B are
positively correlated, meaning that the values of A increase as the
values of B increase.
• The higher the value, the stronger the correlation (i.e., the more
each attribute implies the other). Hence, a higher value may
indicate that A (or B) may be removed as a redundancy.
• If the resulting value is equal to 0, then A and B are independent
and there is no correlation between them.
• If the resulting value is less than 0, then A and B are negatively
correlated, where the values of one attribute increase as the
values of the other attribute decrease.
• Scatter plots can also be used to view correlations between
attributes.
2. Handling Redundant Data in Data Integration contd..
Correlation Analysis (Numeric Data)
Example:
Consider the stock prices listed in the below table

Find out wheather AllElectronics and HighTeck prices are correlated


or not?
2. Handling Redundant Data in Data Integration contd..
Correlation Analysis (Numeric Data)

Step-1: Calculate the mean of the attribute AllElectronics (A) and


HighTech (B).
Mean(A)=(6+5+4+3+2)/5=4
Mean(B)=(20+10+14+5+5)/5=10.8
Step-2: calculate the standard deviation of both attributes A and B i.e.
1.414 and 5.706 respectively
Step-3: calculate the correlation value using the given equation

Step-4: calculated value is greater than 0, hence both are correlated.


Correlation coefficient

a. Positive Correlation:
The correlation in the same direction is called positive correlation. If one variable
increase other is also increase and one variable decrease other is also decrease. For
example, the length of an iron bar will increase as the temperature increases.

b. Negative Correlation:
The correlation in opposite direction is called negative correlation, if with the increase in one
variable the other decreases and vice versa. Example, the volume of gas will decrease as the
pressure increase or the demand of a particular commodity increases as price of such commodity
is decreases.

59
2. Handling Redundant Data in Data Integration contd..
Correlation coefficient contd..

c. No Correlation or Zero Correlation:


If there is no relationship between the two variables such that the value
of one variable change and the other variable remain constant is called
no or zero correlation.
2. Data Integration and Transformation

Strategies for Data Normalization are:


•Smoothing: remove noise from data (binning, clustering, regression)
•Aggregation: summarization process is applied, data cube construction
•Generalization/Concept hierarchy climbing: Attributes can be
generalized to higher-level concept
•Normalization: scaled to fall within a small, specified range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
•Attribute/feature construction: New attributes are constructed from the
given ones to help the mining process
• Data aggregation is any process in which data is brought together and
conveyed in a summary form. It is typically used prior to the performance
of a statistical analysis.
• Combining two or more attributes (or objects) into a single attribute (or
object).
• Data aggregation generally works on the big data or data marts that do
not provide enough information value as a whole.
Aggregation with mathematical functions:
• Sum -Adds together all the specified data to get a total.
• Average -Computes the average value of the specific data.
• Max -Displays the highest value for each category.
• Min -Displays the lowest value for each category.
• Count -Counts the total number of data entries for each category.
Data can also be aggregated by date, allowing trends to be shown over a
period of years, quarters, months, etc.
Data normalization makes data easier to classify and understand. It is used
to scale the data of an attribute so that it falls in a smaller range
Need of Normalization?
•Normalization is generally required when multiple attributes are there but attributes
have values on different scales, this may lead to poor data models while performing
data mining operations.
•Otherwise, it may lead to a dilution in effectiveness of an important equally
important attribute(on lower scale) because of other attribute having values on larger
scale.
•Heterogenous data with different units usually needs to be normalized. Otherwise,
data has the same unit and same order of magnitude it might not be necessary with
normalization.
•Unless normalized at pre-processing, variables with disparate ranges or varying
precision acquire different driving values.
Example

Chart for Raw Data

Chart for Normalized Data


Methods of Data Normalization:
a. Decimal Scaling
b. Min-Max Normalization
c. z-Score Normalization(zero-mean Normalization)
Example
Input:- 10, 15, 50, 60
Normalized to range 0 to 1.
Here min=10, max= 60, new_min=0, new_max=1
Output:- 0, 0.1, 0.8, 1
Input:- 10, 15, 50, 60

n
1
mean  x 
n x
i 1
i 33.75
2

SD  x 
 (Xi  X )
n 1
Output:- 0.9515, 0.7512, 0.6510, 1.0517
3. Data Reduction

• Problem:
Data Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set

• Solution?
– Data reduction…
3. Data Reduction contd...
A. Data Cube Aggregation
It is a process in which information is gathered and expressed in
a summary form

For example, above is the data of one company’s sales per quarter for the year 2018 to the year
2022. If the problem is to get the annual sale per year, then it is required to aggregate the sales
per quarter for each year. In this way, aggregation provides you with the required data, which is
much smaller in size, and thereby we achieve data reduction even without losing any data.
A. Data Cube Aggregation contd..
• Data cubes store multidimensional aggregated information.
• Data cubes provide fast access to pre-computed, summarized data,
thereby benefiting online analytical processing as well as data mining.
3. Data Reduction contd...
B. Attribute subset Selection
B. Attribute subset Selection contd..
How can we find a ‘good’ subset of the original attributes?
•For n attributes, there are 2n possible subsets.
•An exhaustive search for the optimal subset of attributes can be expensive,
especially as n increase (Brute force method).
•Thus heuristic methods (Greedy method) that explore a reduced search
space are commonly used for attribute subset selection.

Heuristic methods:
i.Step-wise forward selection
ii.Step-wise backward elimination
iii.Combining forward selection and backward elimination
iv.Decision-tree induction
B. Attribute subset Selection contd..
i. Stepwise Forward Selection:
• The procedure starts with an empty set of attributes as the reduced
set.
• First: The best single-feature is picked.
• Next: At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.
B. Attribute subset Selection contd..
ii. Stepwise Backword Selection:
• The procedure starts with the full set of attributes.
• At each step, it removes the worst attribute remaining in the set.

• The stepwise forward selection and backward elimination methods can be


combined
• At each step, the procedure selects the best attribute and removes the worst
from among the remaining attributes
iii. Combining forward selection and backward elimination
B. Attribute subset Selection contd..
iv. Decision Tree Induction

• Decision tree induction (Classification Algorithm) constructs


a flowchart-like or tree like structure from given data where
each internal node denotes a test on an attribute, each
branch corresponds to an outcome of the test and each
external node denotes a class prediction.

• At each node, the algorithm chooses the “best” attribute to


partition the data into individual classes.

• All attributes that do not appear in the tree are assumed to


be irrelevant.
B. Attribute subset Selection contd..
iv. Decision Tree Induction contd...
• Nonleaf nodes: tests
• Branches: outcomes of tests
• Leaf nodes: class prediction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}

> Reduced attribute set: {A1, A4, A6}


3. Data Reduction contd...
C. Dimensionality Reduction
• Dimensionality: The feature or attribute of the dataset
• Model:- Dataset will be input to the training phase, training phase will study the feature
and output the model which can be able to perceive or interpret the similar kind of the
object.

Following are differenet models developed from datasets with diffrent features but aiming for
the same goal

M3 M4 M5 M6 M7
M1 M2
7 10 15 100 150
2 4

Threshold
Accuracy of Model Increases Accuracy of Model decreases
C. Dimensionality Reduction contd..
Example:
Let one cricket ball is given to the training phase and it study the object with 4 feature as
Shape (sphere), Eatable (No), Play (yes), Color (red)

Object Shape Eatable Play Red Color

Here color dimension is the irrelevant dimension and plays the role of the curse of dimension
for the model identifying the ball.
The threshold dimension for the above model is 3
C. Dimensionality Reduction contd..
• Dimensionality reduction is a method of converting the high
dimensional variables into lower dimensional variables without
changing the specific information of the variables.
• It represent the original data in the compressed or reduced form by
appling data encoding or transformation.

• In the process of Compression the resultant data can be:


Lossless- If original data can be reconstructed form compressed
data without reconstructing the whole.
Lossy- If we can construct only an approximation of original data.

Popular methods of Lossy dimensionality reduction are


i. Discrete Wavelet Transform (DWT) (Sparse matrix created)
ii. Principal component Analysis (PCA) (Combines the essence of
attributes by creating an alternative, smaller set of variables)
6. Data Aggregation and Representation
 Data may be spread across multiple datasets, requiring that
dataset be joined together to conduct the actual analysis.
 In order to ensure only the correct data will be analysed in the next
stage, it might be necessary to integrate multiple datasets.
 The Data Aggregation and Representation stage is dedicated to
integrate multiple datasets to arrive at a unified view.
 Additionally, data aggregation will greatly speed up the analysis
process of the Big Data tool, because the tool will not be required
to join different tables from different datasets, greatly speeding up
the process.
6. Data Aggregation and Representation (cont.)
 Performing this stage can become complicated because of differences in:
• Data Structure – Although the data format may be the same, the data
model may be different.
• Semantics – A value that is labeled differently in two different datasets
may mean the same thing, for example “surname” and “last name.”
 Whether data aggregation is required or not, it is important to understand
that the same data can be stored in many different forms.

A simple example of
data aggregation
where two datasets are
aggregated together
using the Id field.
7. Data Analysis
 The Data Analysis stage of the Big Data Lifecycle stage is dedicated to
carrying out the actual analysis task.
 It runs the code or algorithm that makes the calculations that will lead to
the actual result.
 Data Analysis can be simple or really complex, depending on the required
analysis type.
 In this stage the ‘actual value’ of the Big Data project will be generated. If
all previous stages have been executed carefully, the results will be factual
and correct.
 Depending on the type of analytic result required, this stage can be as
simple as querying a dataset to compute an aggregation for comparison.
 On the other hand, it can be as challenging as combining data mining and
complex statistical analysis techniques to discover patterns and anomalies
or to generate a statistical or mathematical model to depict relationships
between variables.
7. Data Analysis (cont.)
 Data analysis can be classified as confirmatory analysis or exploratory analysis, the
latter of which is linked to data mining, as shown in Figure,

 Confirmatory data analysis is a deductive approach where the cause of the


phenomenon being investigated is proposed beforehand. The proposed cause or
assumption is called a hypothesis.

 Exploratory data analysis is an inductive approach that is closely associated with


data mining. No hypothesis or predetermined assumptions are generated. Instead, the
data is explored through analysis to develop an understanding of the cause of the
phenomenon.
8. Data Visualization
 The ability to analyse massive amounts of data and find useful insight
is one thing, communicating the results in a way that everybody can
understand is something completely different.
 The Data visualization stage is dedicated to using data visualization
techniques and tools to graphically communicate the analysis results
for effective interpretation by business users. Frequently this requires
plotting data points in charts, graphs or maps.
 The results of completing the Data Visualization stage provide users
with the ability to perform visual analysis, allowing for the discovery
of answers to questions that users have not yet even formulated.
8. Data Visualization (cont.)
9. Utilization of Analysis Results
 After the data analysis has been performed an the result have been
presented, the final step of the Big Data Lifecycle is to use the results in
practice.
 The Utilisation of Analysis results is dedicated to determining how and
where the processed data can be further utilised to leverage the result of
the Big Data Project.
 Depending on the nature of the analysis problems being addressed, it is
possible for the analysis results to produce “models” that encapsulate
new insights and understandings about the nature of the patterns and
relationships that exist within the data that was analyzed.
 A model may look like a mathematical equation or a set of rules. Models
can be used to improve business process logic and application system
logic, and they can form the basis of a new system or software program.

K – Fold Method

12/06/24 MODULE-I DATA ANALYTICS 99


12/06/24 MODULE-I DATA ANALYTICS 101

You might also like