Module I(Introduction Data Analytics Life Cycle) Part II (1)
Module I(Introduction Data Analytics Life Cycle) Part II (1)
Module I(Introduction Data Analytics Life Cycle) Part II (1)
•The data science team must learn and investigate the problem,
•Learn about the data sources needed and available for the
project.
•In addition, the team formulates initial hypotheses that can
later be tested with data.
Phase 1: Discovery (cont.)
• The team should perform five main activities during this step of the discovery
phase:
• Identify data sources: Make a list of data sources the team may need to test
the initial hypotheses outlined in this phase.
– Make an inventory of the datasets currently available and those that
can be purchased or otherwise acquired for the tests the team wants
to perform.
• Capture aggregate data sources: This is for previewing the data and providing
high-level understanding.
– It enables the team to gain a quick overview of the data and
perform further exploration on specific areas.
• Review the raw data: Begin understanding the interdependencies among the
data attributes.
– Become familiar with the content of the data, its quality, and its
limitations.
Phase 1: Discovery (cont.)
• Evaluate the data structures and tools needed: The data type and
structure dictate which tools the team can use to analyze the data.
– The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and
analyze it.
Phase 2: Data preparation
Rules for Analytics Sandbox
• When developing the analytic sandbox, collect all kinds of data there,
as team members need access to high volumes and varieties of data
for a Big Data analytics project.
Alpine Miner provides a graphical user interface (GUI) for creating analytic
workflows, including data manipulations and a series of analytic events such
as staged data-mining techniques (for example, first select the top 100
customers, and then run descriptive statistics and clustering).
OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool
for working with messy data. A GUI-based tool for performing data
transformations, and it's one of the most robust free tools currently available.
Similar to OpenRefine, Data Wrangler is an interactive tool for data cleaning and
transformation. Wrangler was developed at Stanford University and can be
used to perform many transformations on a given dataset.
Phase 3: Model Planning
• Phase 3 is model planning, where the team determines the
methods, techniques, and workflow it intends to follow for
the subsequent model building phase.
– The team also considers whether its existing tools will sufficient for
running the models, or if it will need a more robust environment for
executing models and workflows (for example, fast hardware and
parallel processing, if applicable).
• Free or Open Source tools: Rand PL/R, Octave, WEKA, Python
• Commercial Tools: Matlab, STATISTICA.
Phase 5: Communicate Results
• In Phase 5, After executing the model, the team needs to compare the
outcomes of the modeling to the criteria established for success and
failure.
• The team considers how best to articulate the findings and outcomes
to the various team members and stakeholders, taking into account
warning, assumptions, and any limitations of the results.
• The team should identify key findings, quantify the business value,
and develop a narrative to summarize and convey findings to
stakeholders.
Phase 6: Operationalize
• In the final phase 6, Operationalize, the team communicates the
benefits of the project more broadly and sets up a pilot project to
deploy the work in a controlled way before broadening the work to
a full enterprise or ecosystem of users.
• This approach enables the team to learn about the performance and
related constraints of the model in a production environment on a small
scale and make adjustments before a full deployment.
(B)
5. Data Validation and Cleansing
Data that is invalid leads to invalid results. In order to ensure
only the appropriate data is analysed, the Data Validation and
Cleansing stage of the Big Data Lifecycle is required.
During this stage, data is validated against a set of
predetermined conditions and rules in order to ensure the data
is not corrupt.
An example of a validation rule would be to exclude all persons
that are older than 100 years old, since it is very unlikely that
data about these persons would be correct due to physical
constraints.
The Data Validation and Cleansing stage is dedicated to
establishing often complex validation rules and removing any
known invalid data.
5. Data Validation and Cleansing (cont.)
For example, as illustrated in Fig. 1, the first value in Dataset B is
validated against its corresponding value in Dataset A.
The second value in Dataset B is not validated against its
corresponding value in Dataset A. If a value is missing, it is inserted
from Dataset A.
2. Data integration
– Integration of multiple databases, data cubes, files, or notes
3. Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results
– Data aggregation, dimensionality reduction, data compression,
generalization
4. Data transformation
– Normalization (scaling to a specific range)
– Aggregation
Forms of data preprocessing
1. Data Cleaning
Example when we take a random sample of a population, where each member has the
same chance of being included in the sample.
When data is missing completely at random, it means that we can undertake analyses
using only observations that have complete data (provided we have enough of such
observations).
Missing at Random (MAR) Types of Missing Values…
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
Example of MAR is when we take a sample from a population, where the probability to be
included depends on some known property.
1. A simple predictive model is that income can be
predicted based on gender and age. Looking at the
table, we note that our missing value is for a
Female aged 30 or more, and observations say the
other females aged 30 or more have a High
income. As a result, we can predict that the
missing value should be High.
2.For example, imagine a sensor that misses a
particular minute’s measurement but captures that
data the minute before and the minute following.
The missing value can roughly be interpolated
from the remaining values to a reasonable degree
of accuracy.
Strategies to handle MNAR are to find more data about the causes for the
missingness, or to perform what-if analyses to see how sensitive the results are under
various scenarios.
Structurally Missing Data
A survey that asks for income from employment would have missing values
for those who do not have a job.
• Noise
Random error or variance in a measured variable.
Or simply meaningless data that can’t be interpreted by machines.
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in the naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
How to Handle Noisy Data?
1.2.1.Simple Discretization Method: Binning
• Data smoothing refers to a statistical approach of eliminating noise
and outliers from datasets to make the patterns more noticeable
• Binning or bucketing is used to smoothing the data. It smooth a
sorted data value by consulting its “neighborhood” i.e. the values
around it.
• The sorted values are distributed into a number of “buckets,” or
bins of equal width or equal frequency
• Bin Size : No. bins or buckets = square root of the no of data
points
• Bin Width/Depth: No of objects/elements in a single bin.
• Smoothing by Mean
• Smoothing by Median
• Smoothing by Boundaries
12/06/24 MODULE-I DATA ANALYTICS 48
Smoothing the data by Equal Frequency Bins contd..
1. Smoothing by BIN MEANS: Find the mean values of
each bin and Replace all with mean values
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
2. Smoothing by BIN MEDIANS: Find the median
values of each bin and Replace all with the median
- Bin 1: 8.5, 8.5, 8.5, 8.5
- Bin 2: 22.5, 22.5, 22.5, 22.5
- Bin 3: 28.5, 28.5, 28.5, 28.5
Y1
Y1’ y=x+1
X1 x
2. Data Integration and Transformation
Data Integration
2. Handling Redundant Data in Data Integration
Where
2. Handling Redundant Data in Data Integration contd..
Correlation Analysis (Numeric Data) contd...
• Note that −1 ≤ rA,B ≤ +1. If rA,B is greater than 0, then A and B are
positively correlated, meaning that the values of A increase as the
values of B increase.
• The higher the value, the stronger the correlation (i.e., the more
each attribute implies the other). Hence, a higher value may
indicate that A (or B) may be removed as a redundancy.
• If the resulting value is equal to 0, then A and B are independent
and there is no correlation between them.
• If the resulting value is less than 0, then A and B are negatively
correlated, where the values of one attribute increase as the
values of the other attribute decrease.
• Scatter plots can also be used to view correlations between
attributes.
2. Handling Redundant Data in Data Integration contd..
Correlation Analysis (Numeric Data)
Example:
Consider the stock prices listed in the below table
a. Positive Correlation:
The correlation in the same direction is called positive correlation. If one variable
increase other is also increase and one variable decrease other is also decrease. For
example, the length of an iron bar will increase as the temperature increases.
b. Negative Correlation:
The correlation in opposite direction is called negative correlation, if with the increase in one
variable the other decreases and vice versa. Example, the volume of gas will decrease as the
pressure increase or the demand of a particular commodity increases as price of such commodity
is decreases.
59
2. Handling Redundant Data in Data Integration contd..
Correlation coefficient contd..
n
1
mean x
n x
i 1
i 33.75
2
SD x
(Xi X )
n 1
Output:- 0.9515, 0.7512, 0.6510, 1.0517
3. Data Reduction
• Problem:
Data Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set
• Solution?
– Data reduction…
3. Data Reduction contd...
A. Data Cube Aggregation
It is a process in which information is gathered and expressed in
a summary form
For example, above is the data of one company’s sales per quarter for the year 2018 to the year
2022. If the problem is to get the annual sale per year, then it is required to aggregate the sales
per quarter for each year. In this way, aggregation provides you with the required data, which is
much smaller in size, and thereby we achieve data reduction even without losing any data.
A. Data Cube Aggregation contd..
• Data cubes store multidimensional aggregated information.
• Data cubes provide fast access to pre-computed, summarized data,
thereby benefiting online analytical processing as well as data mining.
3. Data Reduction contd...
B. Attribute subset Selection
B. Attribute subset Selection contd..
How can we find a ‘good’ subset of the original attributes?
•For n attributes, there are 2n possible subsets.
•An exhaustive search for the optimal subset of attributes can be expensive,
especially as n increase (Brute force method).
•Thus heuristic methods (Greedy method) that explore a reduced search
space are commonly used for attribute subset selection.
Heuristic methods:
i.Step-wise forward selection
ii.Step-wise backward elimination
iii.Combining forward selection and backward elimination
iv.Decision-tree induction
B. Attribute subset Selection contd..
i. Stepwise Forward Selection:
• The procedure starts with an empty set of attributes as the reduced
set.
• First: The best single-feature is picked.
• Next: At each subsequent iteration or step, the best of the remaining
original attributes is added to the set.
B. Attribute subset Selection contd..
ii. Stepwise Backword Selection:
• The procedure starts with the full set of attributes.
• At each step, it removes the worst attribute remaining in the set.
Following are differenet models developed from datasets with diffrent features but aiming for
the same goal
M3 M4 M5 M6 M7
M1 M2
7 10 15 100 150
2 4
Threshold
Accuracy of Model Increases Accuracy of Model decreases
C. Dimensionality Reduction contd..
Example:
Let one cricket ball is given to the training phase and it study the object with 4 feature as
Shape (sphere), Eatable (No), Play (yes), Color (red)
Here color dimension is the irrelevant dimension and plays the role of the curse of dimension
for the model identifying the ball.
The threshold dimension for the above model is 3
C. Dimensionality Reduction contd..
• Dimensionality reduction is a method of converting the high
dimensional variables into lower dimensional variables without
changing the specific information of the variables.
• It represent the original data in the compressed or reduced form by
appling data encoding or transformation.
A simple example of
data aggregation
where two datasets are
aggregated together
using the Id field.
7. Data Analysis
The Data Analysis stage of the Big Data Lifecycle stage is dedicated to
carrying out the actual analysis task.
It runs the code or algorithm that makes the calculations that will lead to
the actual result.
Data Analysis can be simple or really complex, depending on the required
analysis type.
In this stage the ‘actual value’ of the Big Data project will be generated. If
all previous stages have been executed carefully, the results will be factual
and correct.
Depending on the type of analytic result required, this stage can be as
simple as querying a dataset to compute an aggregation for comparison.
On the other hand, it can be as challenging as combining data mining and
complex statistical analysis techniques to discover patterns and anomalies
or to generate a statistical or mathematical model to depict relationships
between variables.
7. Data Analysis (cont.)
Data analysis can be classified as confirmatory analysis or exploratory analysis, the
latter of which is linked to data mining, as shown in Figure,