0% found this document useful (0 votes)
64 views83 pages

BI Unit4

The document discusses data pre-processing and outliers for business intelligence. It covers the data analytics life cycle which includes 6 phases: data discovery and formation, data preparation and processing, designing a model, model building, communicating results, and measuring effectiveness. It also discusses how to prepare data for business intelligence, including cleaning, integrating, and transforming data from multiple sources into quality reports. Real-world applications and methods for detecting outliers are also covered.

Uploaded by

ShadowOP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views83 pages

BI Unit4

The document discusses data pre-processing and outliers for business intelligence. It covers the data analytics life cycle which includes 6 phases: data discovery and formation, data preparation and processing, designing a model, model building, communicating results, and measuring effectiveness. It also discusses how to prepare data for business intelligence, including cleaning, integrating, and transforming data from multiple sources into quality reports. Real-world applications and methods for detecting outliers are also covered.

Uploaded by

ShadowOP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Sanjivani Rural Education Society’s

Sanjivani College of Engineering, Kopargaon-423 603


(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified

Department of Computer Engineering


(NBA Accredited)

Subject- Business Intelligence


Unit-IV: Data Pre-processing and outliers

Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Course Contents
Data Analytics life cycle, Discovery, Data
preparation, Preprocessing requirements, data
cleaning, data integration, data reduction, data
transformation, Data discretization and concept
hierarchy generation,
Model Planning, Model building, Communicating
Results & Findings, Operationalizing,
 Introduction to OLAP.
Real-world Applications, types of outliers, outlier
challenges, Outlier detection Methods, Proximity-
Based Outlier analysis, Clustering Based Outlier
analysis.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2
Data analytics
Data analytics is the science of analyzing raw data to make
conclusions about that information.
Data analytics help a business optimize its performance,
perform more efficiently, maximize profit, or make more
strategically-guided decisions.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3


Data Analytics Pipeline

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4


Data Analytics Life Cycle
Data is precious in today’s digital environment.
It goes through several life stages, including creation, testing,
processing, consumption, and reuse.
 These stages are mapped out in the data analytics Life Cycle
for professionals working on data analytics initiatives.
Each stage has its significance and characteristics.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5


Importance of Data Analytics Life Cycle

The Data Analytics Life Cycle covers the process of


generating, collecting, processing, using, and analyzing data to
achieve corporate objectives.
 It provides a systematic method for managing data to
convert it into information that can be used to achieve
organizational and project goals.
The process gives guidance and strategies for extracting
information from data and moving forward on the proper
path to achieve corporate objectives.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6


Data Analytics Life Cycle

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 7


Data Analytics Life Cycle
Phase 1: Data Discovery and Formation
 The purpose of this initial phase is to conduct evaluations and
assessments to develop a fundamental hypothesis for resolving any
business problems and issues.
 Identify the objective of your data and how to achieve it after the
data analytics life cycle.
 The first stage entails mapping out the potential use and demand of
data, such as where the data is coming from, the story you want
your data to portray, and how your business benefits from the
incoming information.
 As a data analyst, you will need to explore case studies using similar
data analytics and, most crucially, examine current company trends.
Then you must evaluate all in-house infrastructure and resources,
as well as time and technological needs, in order to match the
previously acquired data.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 8
Data Analytics Life Cycle
Phase 1: Data Discovery and Formation
Key takeaways:
 The data science team investigates and learns about the challenge.
 Create context and understanding.
 Learn about the data sources that will be required and available
for the project.
 The team develops preliminary hypotheses that can later be
tested with data.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 9


Data Analytics Life Cycle
Phase 2: Data Preparation and Processing
• The data preparation and processing phase involves gathering,
processing, and purifying the collected data.
• One of the most important aspects of this step is ensuring that the
data you require is available for processing.
The following techniques are used to acquire data
• Data Acquisition: Accumulate data from external sources.
• Data Entry: Within the organization, creating new data points
utilizing digital technologies or manual data input procedures.
• Signal Reception: Information gathering from digital devices
such as the Internet of Things and control systems.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 10


Data Analytics Life Cycle
Phase 3: Design a Model
 After you’ve defined your business goals and gathered a large
amount of data (formatted, unformatted, or semi-formatted), it’s
time to create a model that uses the data to achieve the goal.
 Model planning is the name given to this stage of the data
analytics process.
 There are numerous methods for loading data into the
system and starting to analyze it:
• ETL (Extract, Transform, and Load) converts the information
before loading it into a system using a set of business rules.
• ELT (Extract, Load, and Transform) loads raw data into the
sandbox before transforming it.
• ETLT (Extract, Transform, Load, Transform) is a combination of
two layers of transformation.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 11
Data Analytics Life Cycle
Phase 4: Model Building
 This stage of the data analytics Life Cycle involves creating data
sets for testing, training, and production.
 The data analytics professionals develop and operate the model
they designed in the previous stage with proper effort.
 They use tools and methods like decision trees, regression
techniques (such as logistic regression), and neural networks to
create and run the model. The experts also run the model through
a trial run to see if it matches the datasets.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 12


Data Analytics Life Cycle
Phase 4: Model Building
 This stage of the data analytics Life Cycle involves creating data
sets for testing, training, and production.
 Key Takeaways:
• The team creates datasets for use in testing, training, and
production.
• The team also examines if its present tools will serve for running
the models or if a more robust environment is required for model
execution.
• Rand PL/R, Octave, and WEKA are examples of free or open-
source tools.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 13


Data Analytics Life Cycle
Phase 5: Result Communication and Publication
 Recall the objective you set for your company in phase 1. Now is
the time to see if the tests you ran in the previous phase-matched
those criteria.
 The communication process begins with cooperation with key
stakeholders to decide whether the project’s outcomes are
successful or not.
 The project team is responsible for identifying the major
conclusions of the analysis, calculating the business value
associated with the outcome, and creating a narrative to
summarize and communicate the results to stakeholders.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 14


Data Analytics Life Cycle
Phase 6: Measuring Effectiveness
 As your data analytics Life Cycle comes to an end, the final stage is
to offer stakeholders a complete report that includes important
results, coding, briefings, and technical papers/documents.
 Furthermore, to assess the effectiveness of the study, the data is
transported from the sandbox to a live environment and observed
to see if the results match the desired business aim.
 If the findings meet the objectives, the reports and outcomes are
finalized. However, if the conclusion differs from the purpose
stated in phase 1. Then You can go back in the data analytics Life
Cycle to any of the previous phases to adjust your input and get a
different result.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 15


Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified

Department of Computer Engineering


(NBA Accredited)

Subject- Business Intelligence


Unit-IV: Data Pre-processing and outliers

Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
How to Prepare Data for Business
Intelligence and Analytics
 Preparing data for Business Intelligence (BI) can be a very
tedious and time consuming process.
 You want the data to turn into the best reports for analysis.

 But, the raw data needs lots of processing and handling


before you can even approach the results.
 In addition, it’s essential to make sure data is collected and
shared across the whole organization. Gartner calls this the
“democratization of analytics.”
 So, how can you make this happen?

 The main challenge is planning and building the combination


of several data sources tables into quality reports that
support the business and answer multiple questions for the
same topic.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2


What are we looking for?
Let’sstart with the end results - the report or
dashboard you desire to see.
We’re looking for useful business information to gain
insights and drive the business forward.
In general, business analytics and reporting are
enablers for business owners and leaders at all levels
to change and direct their product. However, getting
these results isn’t necessarily a straightforward
endeavor:
The data may contain lots of anomalies and
duplication - require redundancy removal,
normalization across the different data sources, and
varying granularity.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3
What are we looking for?
It’salso a challenge to get the data ready for
everyone - not just the business owner and
developers, but also for the CMO and key decision-
makers.
Plus, infrastructure and tool challenges might
actually slow down access to your data, and limit
your ability to offload mundane tasks that take time
and hamper focus.
So, we’ll drill into the process step by step to show
you what you need to do to get a report that’s right
for as many people as possible in the business - not
just the business owner.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4


What are we looking for?
It’salso a challenge to get the data ready for
everyone - not just the business owner and
developers, but also for the CMO and key decision-
makers.
Plus, infrastructure and tool challenges might
actually slow down access to your data, and limit
your ability to offload mundane tasks that take time
and hamper focus.
So, we’ll drill into the process step by step to show
you what you need to do to get a report that’s right
for as many people as possible in the business - not
just the business owner.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5


Data Preparation :A step-by-step example
To reach our desired results, there are several steps
to take to go from raw data to useful analytics:
• Collect and load data
• Transform data to be BI ready
• Test system with manual queries
• Build the reports

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6


What are we looking for?
Collect and load data
• Collecting data entails uploading the data into the
data warehouse like Redshift, so that you can
leverage its relational database features and
capabilities.
Transform data to be BI ready

• The best way to start this step is by investigation


using manual queries on the loaded raw data.
• You can then evaluate the quality of the data and
decide which tables are not relevant or need to be
changed. Then, plan and decide on the right
transformations accordingly.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 7
What are we looking for?
Test system with manual queries
• Try to get the same result using different manual
queries.
• In this step, you can also even manually count the
result and compare it to the result obtained from the
transformation.
Build the reports

• Create end user reports and charts with the right


granularity and resolution, like DAU per device,
country, etc.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 8


Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified

Department of Computer Engineering


(NBA Accredited)

Subject- Business Intelligence


Unit-IV: Data Pre-processing and outliers

Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Data Quality
Data has quality if it satisfies the requirements of its
intended use. There are many factors comprising
data quality. These include: accuracy, completeness,
consistency, timeliness, believability, and
interpretability.
Inaccurate, incomplete, and inconsistent data are
commonplace properties of large real-world
databases and data warehouses.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2


Factors affecting Data Quality
Accuracy
• There are many possible reasons for inaccurate data (having
incorrect attribute values).
• The data collection instruments used may be faulty.
• There may have been human or computer errors occurring at
data entry.
• Users may purposely submit incorrect data values for
mandatory fields when they do not wish to submit personal
information

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3


Factors affecting Data Quality
Incomplete Data
• Incomplete data can occur for a number of reasons. Attributes
of interest may not always be available.
• Relevant data may not be recorded due to a
misunderstanding, or because of equipment malfunctions.
• Missing data, particularly for tuples with missing values for
some attributes, may need to be inferred.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4


Factors affecting Data Quality
Timeliness
• The fact that the month-end data is not updated in a timely
fashion has a negative impact on the data quality.
Believability
• It reflects how much the data are trusted by users
Interpretability
• It reflects how easy the data are understood.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5


Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified

Department of Computer Engineering


(NBA Accredited)

Subject- Business Intelligence


Unit-IV: Data Pre-processing and outliers

Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Data Pre-Processing
 There are a number of data preprocessing techniques.
 Datacleaning can be applied to remove noise and correct
inconsistencies in the data.
 Dataintegration merges data from multiple sources into a coherent
data store, such as a data warehouse.
 Data
reduction can reduce the data size by aggregating, eliminating
redundant features, or clustering, for instance.
 Datatransformations, such as normalization, may be applied, where
data are scaled to fall within a smaller range like 0.0 to 1.0.
 Thiscan improve the accuracy and efficiency of mining algorithms
involving distance measurements.
 These techniques are not mutually exclusive; they may work together.
 For
example, data cleaning can involve transformations to correct
wrong data, such as by transforming all entries for a date field to a
common format.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2
Major tasks in Data Pre-Processing

Major steps involved in data preprocessing namely,


1. Data cleaning
2. Data integration
3. Data reduction
4. Normalization
5. Discretization and concept hierarchy generation

4. Data transformation

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3


Major tasks in Data Pre-Processing

Data Cleaning
• Data cleaning routines work to “clean” the data by filling
in missing values, smoothing noisy data, identifying or
removing outliers, and resolving inconsistencies.
• If users believe the data are dirty, they are unlikely to
trust the results of any data mining that has been
applied to it.
• Furthermore, dirty data can cause confusion for the
mining procedure, resulting in unreliable output.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4


Major tasks in Data Pre-Processing

Data Integration
• This would involve integrating multiple databases, data
cubes, or files, that is, data integration.
• Yet some attributes representing a given concept may
have different names in different databases, causing
inconsistencies and redundancies.
• Having a large amount of redundant data may slow
down or confuse the knowledge discovery process.
Clearly, in addition to data cleaning, steps must be taken
to help avoid redundancies during data integration.
• Typically, data cleaning and data integration are
performed as a preprocessing step when preparing the
data for a data warehouse.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5
Major tasks in Data Pre-Processing
Data Reduction
• Data reduction obtains a reduced representation of the data
set that is much smaller in volume, yet produces the same
(or almost the same) analytical results.
• Data reduction strategies include dimensionality reduction
and numerosity reduction.
• In dimensionality reduction, data encoding schemes are
applied so as to obtain a reduced or “compressed”
representation of the original data.
• Examples include data compression techniques (such as wavelet
transforms and principal components analysis) as well as
attribute subset selection (e.g., removing irrelevant attributes),
and attribute construction (e.g., where a small set of more useful
attributes is derived from the original set).
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6
Major tasks in Data Pre-Processing

Data Reduction
• In numerosity reduction, the data are replaced by
alternative, smaller representations using parametric
models (such as regression or log-linear models) or
nonparametric models (such as with histograms, clusters,
sampling, or data aggregation).

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 7


Major tasks in Data Pre-Processing

Data Normalization
• Getting back to your data, you have decided, say, that you
would like to use a distance-based mining algorithm for
your analysis, such as neural networks, nearest-neighbor
classifiers, or clustering.
• Such methods provide better results if the data to be
analyzed have been normalized, that is, scaled to a smaller
range such as [0.0, 1.0].
• Your customer data, for example, contain the attributes age and
annual salary. The annual salary attribute usually takes much
larger values than age. Therefore, if the attributes are left
unnormalized, the distance measurements taken on annual
salary will generally outweigh distance measurements taken on
age.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 8
Major tasks in Data Pre-Processing

Discretization and concept hierarchy generation


• It can also be useful, where raw data values for attributes
are replaced by ranges or higher conceptual levels. For
example, raw values for age may be replaced by higher-level
concepts, such as youth, adult, or senior.
• Discretization and concept hierarchy generation are
powerful tools for data mining in that they allow the mining
of data at multiple levels of abstraction.
• Normalization, data discretization, and concept hierarchy
generation are forms of data transformation.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 9


Major tasks in Data Pre-Processing

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 10


Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified

Department of Computer Engineering


(NBA Accredited)

Subject- Business Intelligence


Unit-IV: Data Pre-processing and outliers

Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Data Cleaning

Real-world data tend to be incomplete, noisy, and


inconsistent.
Data cleaning (or data cleansing) routines attempt
to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in
the data.
1 Missing Values
2 Noisy Data

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2


Missing Values
 You note that many tuples have no recorded value for
several attributes, such as customer income. How can
you go about filling in the missing values for this
attribute? Let’s look at the following methods:
1. Ignore the tuple
• This is usually done when the class label is missing (assuming
the mining task involves classification).
• This method is not very effective, unless the tuple contains
several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies
considerably.
• By ignoring the tuple, we do not make use of the remaining
attributes values in the tuple. Such data could have been useful
to the task at hand.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3
Missing Values

2. Fill in the missing value manually


• In general, this approach is time consuming and may not be
feasible given a large data set with many missing values.

3. Use a global constant to fill in the missing value


• Replace all missing attribute values by the same constant, such
as a label like “Unknown” or −∞.
• If missing values are replaced by, say, “Unknown,” then the
mining program may mistakenly think that they form an
interesting concept, since they all have a value in common—that
of “Unknown.” Hence, although this method is simple, it is not
foolproof.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4


Missing Values

4. Use a measure of central tendency for the


attribute (such as the mean or median) to fill in the
missing value
• The measures of central tendency indicate the “middle” value of
a data distribution. For normal (symmetric) data distributions,
the mean can be used, while skewed data distribution should
employ the median.
5. Use the attribute mean or median for all samples
belonging to the same class as the given tuple.
6. Use the most probable value to fill in the missing
value

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5


Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified

Department of Computer Engineering


(NBA Accredited)

Subject- Business Intelligence


Unit-IV: Data Pre-processing and outliers

Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Noise

Noise is a random error or variance in a measured


variable.
Noisy data are data with a large amount of
additional meaningless information.
It is designed to detect trends in the presence of
noisy data in cases in which the shape of the trend is
unknown. The smoothing name comes from the fact
that to accomplish this feat, we assume that the
trend is smooth, as in a smooth surface. In contrast,
the noise, or deviation from the trend, is
unpredictably wobbly:

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2


Smoothing

 The smoothing name


comes from the fact
that to accomplish
this feat, we assume
that the trend
is smooth, as in a
smooth surface.
 In contrast, the
noise, or deviation
from the trend, is
unpredictably
wobbly.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3
Smoothing

1. Binning
 Binning methods smooth a sorted data value by consulting
its “neighborhood,” that is, the values around it.
 The sorted values are distributed into a number of
“buckets,” or bins.
 Because binning methods consult the neighborhood of
values, they perform local smoothing. Figure illustrates
some binning techniques.
 In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3 (i.e., each bin
contains three values). In smoothing by bin means, each
value in a bin is replaced by the mean value of the bin.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4


Smoothing :Binning

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5


Smoothing

2. Regression
 Data smoothing can also be done by conforming data values
to a function, a technique known as regression.
 Linear regression involves finding the “best” line to fit two
attributes (or variables), so that one attribute can be used
to predict the other.
 Multiple linear regression is an extension of linear
regression, where more than two attributes are involved
and the data are fit to a multidimensional surface.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6


Smoothing

3. Outliers Analysis
 Data smoothing can also be done by conforming data values
to a function, a technique known as regression.
 Outliers may be detected by clustering, for example, where
similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may
be considered outliers.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 7


Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified

Department of Computer Engineering


(NBA Accredited)

Subject- Business Intelligence


Unit-IV: Data Pre-processing and outliers

Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Data Integration

Data mining often requires data integration—the


merging of data from multiple data stores.
Careful integration can help reduce and avoid
redundancies and inconsistencies in the resulting
data set.
This can help improve the accuracy and speed of
the subsequent mining process.
The semantic heterogeneity and structure of data
pose great challenges in data integration.
How can we match schema and objects from
different sources? This is the essence of the entity
identification problem.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2
Data Integration

• The Entity Identification Problem


 It is likely that your data analysis task will involve data
integration, which combines data from multiple sources into a
coherent data store, as in data warehousing.
 These sources may include multiple databases, data cubes, or
flat files.
 There are a number of issues to consider during data
integration. Schema integration and object matching can be
tricky.
 How can equivalent real-world entities from multiple data
sources be matched up? This is referred to as the entity
identification problem.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3


Data Integration

• Redundancy and Correlation Analysis


 Redundancy is another important issue in data integration.
 An attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or set
of attributes.
 Inconsistencies in attribute or dimension naming can also
cause redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.
Given two attributes, such analysis can measure how strongly
one attribute implies the other, based on the available data.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4


Data Integration

• Tuple Duplication
 In addition to detecting redundancies between attributes,
duplication should also be detected at the tuple level (e.g.,
where there are two or more identical tuples for a given
unique data entry case).
 The use of denormalized tables (often done to improve
performance by avoiding joins) is another source of data
redundancy.
 Inconsistencies often arise between various duplicates, due to
inaccurate data entry or updating some but not all of the
occurrences of the data.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5


Unit IV Data Pre-processing and Outliers

Data Analytics
Data analytics is the science of analyzing raw data to make conclusions about that information.
Data analytics help a business optimize its performance, perform more efficiently, maximize
profit, or make more strategically-guided decisions

Data Analytics Life Cycle


Data is precious in today’s digital environment. It goes through several life stages, including
creation, testing, processing, consumption, and reuse. These stages are mapped out in the data
analytics Life Cycle for professionals working on data analytics initiatives. Each stage has its
significance and characteristics.

Importance of Data Analytics Life Cycle


The Data Analytics Life Cycle covers the process of generating, collecting, processing, using, and
analyzing data to achieve corporate objectives. It provides a systematic method for managing data
to convert it into information that can be used to achieve organizational and project goals. The
process gives guidance and strategies for extracting information from data and moving forward on
the proper path to achieve corporate objectives.
Data professionals use the circular nature of the Life Cycle to go ahead or backward with data
analytics. Based on the new information, they can decide whether to continue with their current
research or abandon it and redo the entire analysis. Throughout the process, they are guided by the
Data Analytics Life Cycle.

Phases of Data Analytics Life Cycle


Phase 1: Data Discovery and Formation
Everything starts with a purpose in mind. In this phase, you will identify the objective of your data
and how to achieve it after the data analytics Life Cycle. The purpose of this initial phase is to
conduct evaluations and assessments to develop a fundamental hypothesis for resolving any
business problems and issues.
The first stage entails mapping out the potential use and demand of data, such as where the data is
coming from, the story you want your data to portray, and how your business benefits from the
incoming information.
As a data analyst, you will need to explore case studies using similar data analytics and, most
crucially, examine current company trends. Then you must evaluate all in-house infrastructure and
resources, as well as time and technological needs, in order to match the previously acquired data.
Following the completion of the evaluations, the team closes this stage with hypotheses that will
be tested using data later on. This is the first and most critical step in the big data analytics Life
Cycle.
Key takeaways:
 The data science team investigates and learns about the challenge.
 Create context and understanding.
 Learn about the data sources that will be required and available for the project.
 The team develops preliminary hypotheses that can later be tested with data.

Phase 2: Data Preparation and Processing


The data preparation and processing phase involves gathering, processing, and purifying the
collected data. One of the most important aspects of this step is ensuring that the data you require
is available for processing.
The following techniques are used to acquire data:
 Data Acquisition: Accumulate data from external sources.
 Data Entry: Within the organization, creating new data points utilizing digital
technologies or manual data input procedures.
 Signal Reception: Information gathering from digital devices such as the Internet of
Things and control systems.
An analytical sandbox is required during the data preparation phase of the data analytics Life
Cycle. This is a scalable platform used to process data by data analysts and data scientists. The
analytical sandbox contains data that has been executed, loaded, and altered.
There is no defined order in which this phase of the analytical cycle must take place; it may be
repeated at a later time as necessary.
Phase 3: Design a Model
After you’ve defined your business goals and gathered a large amount of data (formatted,
unformatted, or semi-formatted), it’s time to create a model that uses the data to achieve the goal.
Model planning is the name given to this stage of the data analytics process.
There are numerous methods for loading data into the system and starting to analyze it:
 ETL (Extract, Transform, and Load) converts the information before loading it into a
system using a set of business rules.
 ELT (Extract, Load, and Transform) loads raw data into the sandbox before
transforming it.
 ETLT (Extract, Transform, Load, Transform) is a combination of two layers of
transformation.
This step also involves teamwork to identify the approaches, techniques, and workflow to be used
in the succeeding phase to develop the model. The process of developing a model begins with
finding the relationship between data points in order to choose the essential variables and,
subsequently, create a suitable model.
Phase 4: Model Building
This stage of the data analytics Life Cycle involves creating data sets for testing, training, and
production. The data analytics professionals develop and operate the model they designed in the
previous stage with proper effort.
They use tools and methods like decision trees, regression techniques (such as logistic regression),
and neural networks to create and run the model. The experts also run the model through a trial
run to see if it matches the datasets.
It assists them in determining whether the tools they now have will be enough to execute the model
or if a more robust system is required for it to function successfully.
Key Takeaways:
 The team creates datasets for use in testing, training, and production.
 The team also examines if its present tools will serve for running the models or if a
more robust environment is required for model execution.
 Rand PL/R, Octave, and WEKA are examples of free or open-source tools.
Phase 5: Result Communication and Publication
Recall the objective you set for your company in phase 1. Now is the time to see if the tests you
ran in the previous phase-matched those criteria.
The communication process begins with cooperation with key stakeholders to decide whether the
project’s outcomes are successful or not.
The project team is responsible for identifying the major conclusions of the analysis, calculating
the business value associated with the outcome, and creating a narrative to summarize and
communicate the results to stakeholders.
Phase 6: Measuring Effectiveness
As your data analytics Life Cycle comes to an end, the final stage is to offer stakeholders a
complete report that includes important results, coding, briefings, and technical papers/documents.
Furthermore, to assess the effectiveness of the study, the data is transported from the sandbox to a
live environment and observed to see if the results match the desired business aim.
If the findings meet the objectives, the reports and outcomes are finalized. However, if the
conclusion differs from the purpose stated in phase 1. Then You can go back in the data analytics
Life Cycle to any of the previous phases to adjust your input and get a different result.

How to Prepare Data for Business Intelligence and Analytics?


Preparing data for Business Intelligence (BI) can be a very tedious and time consuming process.
You want the data to turn into the best reports for analysis. But, the raw data needs lots of
processing and handling before you can even approach the results. In addition, it’s essential to
make sure data is collected and shared across the whole organization. Gartner calls this the
“democratization of analytics.”
So, how can you make this happen? The main challenge is planning and building the combination
of several data sources tables into quality reports that support the business and answer multiple
questions for the same topic. For example, reporting on your online service’s “daily active users”
and also answering questions about the region source, the devices they use, etc. - ensuring that the
right data resolution and granularity is available for your multiple reporting requirements.
In this article, we provide practical inputs and insights for data architects and engineers to help in
the crucial step of BI data preparation. First, we give some how-to information and then some
ideas to streamline the process.
What are we looking for?
Let’s start with the end results - the report or dashboard you desire to see. We’re looking for useful
business information to gain insights and drive the business forward.
Here’s an example of a report for daily active users (DAU) by country :

This report is a collection of powerful data that enables an overview of daily active users by their
country. It helps product owners and business managers to focus their efforts on the regional
trends.
In general, business analytics and reporting are enablers for business owners and leaders at all
levels to change and direct their product. However, getting these results isn’t necessarily a
straightforward endeavor:
The data may contain lots of anomalies and duplication - require redundancy removal,
normalization across the different data sources, and varying granularity.
It’s also a challenge to get the data ready for everyone - not just the business owner and developers,
but also for the CMO and key decision-makers.
Plus, infrastructure and tool challenges might actually slow down access to your data, and limit
your ability to offload mundane tasks that take time and hamper focus.
So, we’ll drill into the process step by step to show you what you need to do to get a report that’s
right for as many people as possible in the business - not just the business owner.

A step-by-step example for data preparation


To reach our desired results, there are several steps to take to go from raw data to useful analytics:
1. Collect and load data
2. Transform data to be BI ready
3. Test system with manual queries
4. Build the reports
An important note: In this step-by-step example, we don’t follow the traditional ETL sequence,
but the more modern ELT approach. First, we load (L) the data into a Panoply data warehouse and
only then run the transformations (T) to prepare the data for BI. In the specific case of this example,
the tables are ready for loading from CSV files, so the extract (E) step is not shown.
1. Collect and load data
Collecting data entails uploading the data into the data warehouse like Redshift, so that you can
leverage its relational database features and capabilities.
There are several ways to collect data and insert it into Redshift. Let’s see how to do this to create
reports like the one shown in the previous section. In this case, the tables are in 3 CSV files:
sessions, users and devices.
As shown below, we use Panoply.io to choose and quickly upload the CSV files to Redshift. You
can also load the collected data into Redshift manually using COPY command but this will require
a few more steps.

To create the report, we uploaded 2 additional CSV files in a similar manner. Note that you can
upload data from many disparate data sources using AWS Redshift.
2. Transform data to be BI ready
The best way to start this step is by investigation using manual queries on the loaded raw data.
You can then evaluate the quality of the data and decide which tables are not relevant or need to
be changed. Then, plan and decide on the right transformations accordingly.
If we continue with our example, here’s how we use Panoply.io to calculate DAU, DAU by country
and DAU by country and device type (based on the operating system). Note that although each of
these queries is correct for its own calculation, the data may need to be normalized to allow
provision of the right results in other queries.
Users counted by time
Users counted by country

Users counted by country and the device operating system


Once all our data has been loaded into the data warehouse, we have the flexibility to continue with
our investigation on the raw data. We now consider different dimensions such as by comparing
DAU to DAU by country and OS, using the following transformation queries:
For simple DAU: SELECT date(“time”), SUM(dau) FROM dau GROUP BY date(“time”)
For DAU from the country & OS: SELECT date(“time”), SUM(dau) FROM dau_country_os
GROUP BY date(“time”)
As you can see in the below query outputs, the results in the right table are greater than the results
in the left (that is due to cases when a user has a few devices, so each device might be counted
multiple times, although it’s the same unique user).

To transform the data to be BI ready, using multiple data sources and structures is the best method.
A rule of thumb is that it’s not just one table since you need to continuously find good combination
of information involving several tables.
So, taking our example to the next step: In our case we want one transformation that will allow us
the flexibility to answer many questions (DAU, DAU by country and DAU by country and OS).
However, as we saw in the comparison table above, we can’t calculate the DAU using the more
granular transformation (DAU by country and OS), since we won’t be able to answer all three
questions with that transformation only. In order to solve this issue, we will need to use a higher
resolution transformation that’s grouped by day, userid, country and OS:
3. Test the transformation with manual queries
As shown above, try getting the same result using different manual queries. In this step, you can
also pull the results data into a spreadsheet (a sample of the data should be enough), or even
manually count the result and compare it to the result obtained from the transformation.
4. Build the reports
Create end user reports and charts with the right granularity and resolution, like DAU per device,
country, etc.
Here’s the results from our example:
DAU over time
DAU by device

DAU by country

Data Quality
Data has quality if it satisfies the requirements of its intended use. There are many factors
comprising data quality. These include: accuracy, completeness, consistency, timeliness,
believability, and interpretability.
Inaccurate, incomplete, and inconsistent data are commonplace properties of large real-world
databases and data warehouses.

Factors affecting data quality


Accuracy
There are many possible reasons for inaccurate data (having incorrect attribute values). The data
collection instruments used may be faulty. There may have been human or computer errors
occurring at data entry. Users may purposely submit incorrect data values for mandatory fields
when they do not wish to submit personal information, e.g., by choosing the default value ‘January
1’ displayed for birthday. (This is known as disguised missing data.) Errors in data transmission
can also occur. There may be technology limitations, such as limited buffer size for coordinating
synchronized data transfer and consumption. Incorrect data may also result from inconsistencies
in naming conventions or data codes used, or inconsistent formats for input fields, such as date.
Duplicate tuples also require data cleaning.
Incomplete Data
Incomplete data can occur for a number of reasons. Attributes of interest may not always be
available, such as customer information for sales transaction data. Other data may not be included
simply because they were not considered important at the time of entry. Relevant data may not be
recorded due to a misunderstanding, or because of equipment malfunctions. Data that were
inconsistent with other recorded data may have been deleted. Furthermore, the recording of the
history or modifications to the data may have been overlooked. Missing data, particularly for tuples
with missing values for some attributes, may need to be inferred.
Recall that data quality depends on the intended use of the data. Two different users may
have very different assessments of the quality of a given database. For example, a marketing
analyst may need to access the database mentioned above for a list of customer addresses. Some
of the addresses are outdated or incorrect, yet overall, 80% of the addresses are accurate. The
marketing analyst considers this to be a large customer database for target marketing purposes and
is pleased with the accuracy of the database, although, as sales manager, you found the data
inaccurate.
Timeliness
Timeliness also affects data quality. Suppose that you are overseeing the distribution of monthly
sales bonuses to the top sales representatives at AllElectronics. Several sales representatives,
however, fail to submit their sales records on time at the end of the month. There are also a number
of corrections and adjustments that flow in after the month’s end. For a period of time following
each month, the data stored in the database is incomplete. However, once all of the data is received,
it is correct. The fact that the month-end data is not updated in a timely fashion has a negative
impact on the data quality.
Believability reflects how much the data are trusted by users
Interpretability reflects how easy the data are understood.
Data preprocessing
There are a number of data preprocessing techniques. Data cleaning can be applied to remove noise
and correct inconsistencies in the data. Data integration merges data from multiple sources into a
coherent data store, such as a data warehouse. Data reduction can reduce the data size by
aggregating, eliminating redundant features, or clustering, for instance. Data transformations, such
as normalization, may be applied, where data are scaled to fall within a smaller range like 0.0 to
1.0. This can improve the accuracy and efficiency of mining algorithms involving distance
measurements. These techniques are not mutually exclusive; they may work together. For
example, data cleaning can involve transformations to correct wrong data, such as by transforming
all entries for a date field to a common format.

Major tasks in data preprocessing


Major steps involved in data preprocessing, namely, data cleaning, data integration, data reduction,
and data transformation.
1. Data cleaning
Data cleaning routines work to “clean” the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty,
they are unlikely to trust the results of any data mining that has been applied to it. Furthermore,
dirty data can cause confusion for the mining procedure, resulting in unreliable output. Although
most mining routines have some procedures for dealing with incomplete or noisy data, they are
not always robust. Instead, they may concentrate on avoiding overfitting the data to the function
being modeled. Therefore, a useful preprocessing step is to run your data through some data
cleaning routines.
2. Data Integration
Getting back to your task at AllElectronics, suppose that you would like to include data from
multiple sources in your analysis. This would involve integrating multiple databases, data cubes,
or files, that is, data integration. Yet some attributes representing a given concept may have
different names in different databases, causing inconsistencies and redundancies. For example, the
attribute for customer identification may be referred to as customer_id in one data store and cust_id
in another. Naming inconsistencies may also occur for attribute values. For example, the same first
name could be registered as “Bill” in one database, but “William” in another, and “B.” in the third.
Furthermore, you suspect that some attributes may be inferred from others (e.g., annual revenue).
Having a large amount of redundant data may slow down or confuse the knowledge discovery
process. Clearly, in addition to data cleaning, steps must be taken to help avoid redundancies
during data integration. Typically, data cleaning and data integration are performed as a
preprocessing step when preparing the data for a data warehouse. Additional data cleaning
can be performed to detect and remove redundancies that may have resulted from data integration.
3. Data reduction
“Hmmm,” you wonder, as you consider your data even further. “The data set I have selected for
analysis is HUGE, which is sure to slow down the mining process. Is there a way I can reduce the
size of my data set without jeopardizing the data mining results?” Data reduction obtains a reduced
representation of the data set that is much smaller in volume, yet produces the same (or almost the
same) analytical results.
Data reduction strategies include dimensionality reduction and numerosity reduction. In
dimensionality reduction, data encoding schemes are applied so as to obtain a reduced or
“compressed” representation of the original data. Examples include data compression techniques
(such as wavelet transforms and principal components analysis) as well as attribute subset selection
(e.g., removing irrelevant attributes), and attribute construction (e.g., where a small set of more
useful attributes is derived from the original set).
In numerosity reduction, the data are replaced by alternative, smaller representations using
parametric models (such as regression or log-linear models) or nonparametric models (such as
with histograms, clusters, sampling, or data aggregation).
4. Normalization
Getting back to your data, you have decided, say, that you would like to use a distance-based
mining algorithm for your analysis, such as neural networks, nearest-neighbor classifiers, or
clustering. Such methods provide better results if the data to be analyzed have been normalized,
that is, scaled to a smaller range such as [0.0, 1.0]. Your customer data, for example, contain the
attributes age and annual salary. The annual salary attribute usually takes much larger values than
age. Therefore, if the attributes are left unnormalized, the distance measurements taken on annual
salary will generally outweigh distance measurements taken on age.
5. Discretization and concept hierarchy generation
It can also be useful, where raw data values for attributes are replaced by ranges or higher
conceptual levels. For example, raw values for age may be replaced by higher-level concepts, such
as youth, adult, or senior. Discretization and concept hierarchy generation are powerful tools for
data mining in that they allow the mining of data at multiple levels of abstraction.
Normalization, data discretization, and concept hierarchy generation are forms of data
transformation. You soon realize such data transformation operations are additional data
preprocessing procedures that would contribute toward the success of the mining process.

In summary, real-world data tend to be dirty, incomplete, and inconsistent. Data preprocessing
techniques can improve the quality of the data, thereby helping to improve the accuracy and
efficiency of the subsequent mining process. Data preprocessing is an important step in the
knowledge discovery process, because quality decisions must be based on quality data. Detecting
data anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs
for decision making.
Data cleaning techniques
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing)
routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.

1. Missing Values
2. Noisy Data

Explain how to handle the missing values?


Missing Values
Imagine that you need to analyze AllElectronics sales and customer data. You note that many
tuples have no recorded value for several attributes, such as customer income. How can you go
about filling in the missing values for this attribute? Let’s look at the following methods:
1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification). This method is not very effective, unless the tuple contains several
attributes with missing values. It is especially poor when the percentage of missing values per
attribute varies considerably. By ignoring the tuple, we do not make use of the remaining attributes
values in the tuple. Such data could have been useful to the task at hand.
2. Fill in the missing value manually: In general, this approach is time consuming and may not
be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by the
same constant, such as a label like “Unknown” or −∞. If missing values are replaced by, say,
“Unknown,” then the mining program may mistakenly think that they form an interesting concept,
since they all have a value in common—that of “Unknown.” Hence, although this method is
simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (such as the mean or median) to fill in
the missing value: The measures of central tendency indicate the “middle” value of a data
distribution. For normal (symmetric) data distributions, the mean can be used, while skewed data
distribution should employ the median. For example, suppose that the data distribution regarding
the income of AllElectronics customers is symmetric and that the average income is $56,000. Use
this value to replace the missing value for income.
5. Use the attribute mean or median for all samples belonging to the same class as the given
tuple: For example, if classifying customers according to credit risk, we may replace the missing
value with the average income value for customers in the same credit risk category as that of the
given tuple. If the data distribution for a given class is skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using a Bayesian formalism, or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a decision tree to
predict the missing values for income.

Noise
Noise is a random error or variance in a measured variable.
The following are data smoothing techniques:
1. Binning:
Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values
around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing. Figure illustrates some
binning techniques. In this example, the data for price are first sorted and then partitioned into
equal-frequency bins of size 3 (i.e., each bin contains three values). In smoothing by bin means,
each value in a bin is replaced by the mean value of the bin.
For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in
this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in which
each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced
by the closest boundary value. In general, the larger the width, the greater the effect of the
smoothing. Alternatively, bins may be equal-width, where the interval range of values in each bin
is constant. Binning is also used as a discretization technique.
2. Regression
Data smoothing can also be done by conforming data values to a function, a technique known as
regression. Linear regression involves finding the “best” line to fit two attributes (or variables), so
that one attribute can be used to predict the other. Multiple linear regression is an extension of
linear regression, where more than two attributes are involved and the data are fit to a
multidimensional surface.
3. Outlier analysis
Outliers may be detected by clustering, for example, where similar values are organized into
groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered
outliers. Figure illustrates 2-D plot of customer data with respect to customer locations in a city,
showing three data clusters. Each cluster centroid is marked with a “+”, representing the average
point in space for that cluster. Outliers may be detected as values that fall outside of the sets of
clusters.
Data Integration
Data mining often requires data integration—the merging of data from multiple data stores.
Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting
data set. This can help improve the accuracy and speed of the subsequent mining process. The
semantic heterogeneity and structure of data pose great challenges in data integration. How can
we match schema and objects from different sources? This is the essence of the entity identification
problem.

The Entity Identification Problem


It is likely that your data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store, as in data warehousing. These sources may
include multiple databases, data cubes, or flat files. There are a number of issues to consider during
data integration. Schema integration and object matching can be tricky. How can equivalent real-
world entities from multiple data sources be matched up? This is referred to as the entity
identification problem. For example, how can the data analyst or the computer be sure that
customer id in one database and cust_number in another refer to the same attribute? Examples of
metadata for each attribute include the name, meaning, data type, and range of values permitted
for the attribute, and null rules for handling blank, zero, or null values. Such metadata can be used
to help avoid errors in schema integration. The metadata may also be used to help transform the
data (e.g., where data codes for pay type in one database may be “H” and “S”, and 1 and 2 in
another). Hence, this step also relates to data cleaning, as described earlier.
When matching attributes from one database to another during integration, special attention
must be paid to the structure of the data. This is to ensure that any attribute functional dependencies
and referential constraints in the source system match those in the target system. For example, in
one system, a discount may be applied to the order, whereas in another system it is applied to each
individual line item within the order. If this is not caught before integration, items in the target
system may be improperly discounted.

Redundancy and Correlation Analysis


Redundancy is another important issue in data integration. An attribute (such as annual revenue,
for instance) may be redundant if it can be “derived” from another attribute or set of attributes.
Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data
set.
Some redundancies can be detected by correlation analysis. Given two attributes, such
analysis can measure how strongly one attribute implies the other, based on the available data. For
nominal data, we use the χ 2 (chisquare) test. For numeric attributes, we can use the correlation
coefficient and covariance, both of which access how one attribute’s values vary with those of
another.
χ 2 Correlation Test for Nominal Data for nominal data, a correlation relationship between two
attributes, A and B, can be discovered by a χ 2 (chi-square) test. Suppose A has c distinct values,
namely a1, a2, . . . ac. B has r distinct values, namely b1, b2, . . . br. The data tuples described by
A and B can be shown as a contingency table, with the c values of A making up the columns and
the r values of B making up the rows. Let (Ai , Bj ) denote the joint event that attribute A takes on
value ai and attribute B takes on value bj , that is, where (A = ai , B = bj ). Each and every possible
(Ai , Bj ) joint event has its own cell (or slot) in the table. The χ 2 value (also known as the Pearson
χ 2 statistic) is computed as:
where oij is the observed frequency (i.e., actual count) of the joint event (Ai , Bj ) and eij is the
expected frequency of (Ai , Bj ), which can be computed as:

where n is the number of data tuples, count(A = ai) is the number of tuples having value ai for A,
and count(B = bj) is the number of tuples having value bj for B. The sum in Equation (1) is
computed over all of the r×c cells. Note that the cells that contribute the most to the χ 2 value are
those whose actual count is very different from that expected.
The χ 2 statistic tests the hypothesis that A and B are independent, that is, there is no
correlation between them. The test is based on a significance level, with (r − 1) × (c − 1) degrees
of freedom. We will illustrate the use of this statistic in an example below. If the hypothesis can
be rejected, then we say that A and B are statistically correlated.

Tuple Duplication
In addition to detecting redundancies between attributes, duplication should also be detected at the
tuple level (e.g., where there are two or more identical tuples for a given unique data entry case).
The use of denormalized tables (often done to improve performance by avoiding joins) is another
source of data redundancy. Inconsistencies often arise between various duplicates, due to
inaccurate data entry or updating some but not all of the occurrences of the data. For example, if a
purchase order database contains attributes for the purchaser’s name and address instead of a key
to this information in a purchaser database, discrepancies can occur, such as the same purchaser’s
name appearing with different addresses within the purchase order database.

Detection and Resolution of Data Value Conflicts


Data integration also involves the detection and resolution of data value conflicts. For example,
for the same real-world entity, attribute values from different sources may differ. This may be due
to differences in representation, scaling, or encoding. For instance, a weight attribute may be stored
in metric units in one system and British imperial units in another. For a hotel chain, the price of
rooms in different cities may involve not only different currencies but also different services (such
as free breakfast) and taxes. When exchanging information between schools, each school may have
its own curriculum and grading scheme. One university may adopt a quarter system, offer three
courses on database systems, and assign grades from A+ to F, whereas another may adopt a
semester system, offer two courses on databases, and assign grades from 1 to 10. It is difficult to
work out precise course-to-grade transformation rules between the two universities, making
information exchange difficult.
Attributes may also differ on the level of abstraction, where an attribute in one system is recorded
at, say, a lower level of abstraction than the “same” attribute in another. For example, the total
sales in one database may refer to one branch of All Electronics, while an attribute of the same
name in another database may refer to the total sales for All Electronics stores in a given region.
The topic of discrepancy detection is further described in Section 3.2.3 on data cleaning as a
process.

Data warehouses and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube. A data cube allows data to be modeled and viewed in multiple
dimensions. It is defined by dimensions and facts. In general terms, dimensions are the perspectives or
entities with respect to which an organization wants to keep records. For example, AllElectronics may
create a sales data warehouse in order to keep records of the store’s sales with respect to the dimensions
time, item, branch, and location. These dimensions allow the store to keep track of things like
monthly sales of items and the branches and locations at which the items were sold. Each
dimension may have a table associated with it, called a dimension table, which further describes
the dimension. For example, a dimension table for item may contain the attributes item name,
brand, and type. Dimension tables can be specified by users or experts, or automatically generated
and adjusted based on data distributions.
A multidimensional data model is typically organized around a central theme, such as sales. This
theme is represented by a fact table. Facts are numeric measures. Think of them as the quantities
by which we want to analyze relationships between dimensions. Examples of facts for a sales data
warehouse include dollars sold (sales amount in dollars), units sold (number of units sold), and
amount budgeted. The fact table contains the names of the facts, or measures, as well as keys to
each of the related dimension tables. You will soon get a clearer picture of how this works when
we look at multidimensional schemas.
What is OLAP?
Online Analytical Processing (OLAP) is a category of software that allows users to analyze
information from multiple database systems at the same time. It is a technology that enables
analysts to extract and view business data from different points of view.
Analysts frequently need to group, aggregate and join data. These OLAP operations in data mining
are resource intensive. With OLAP data can be pre-calculated and pre-aggregated, making analysis
faster.
OLAP databases are divided into one or more cubes. The cubes are designed in such a way that
creating and viewing reports become easy. OLAP stands for Online Analytical Processing.
OLAP cube:
At the core of the OLAP concept, is an OLAP Cube. The OLAP cube is a data structure optimized
for very quick data analysis.

The OLAP Cube consists of numeric facts called measures which are categorized by dimensions.
OLAP Cube is also called the hypercube.
Usually, data operations and analysis are performed using the simple spreadsheet, where data
values are arranged in row and column format. This is ideal for two-dimensional data. However,
OLAP contains multidimensional data, with data usually obtained from a different and unrelated
source. Using a spreadsheet is not an optimal option. The cube can store and analyze
multidimensional data in a logical and orderly manner.
How does it work?
A Data warehouse would extract information from multiple data sources and formats like text files,
excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into an OLAP server (or OLAP
cube) where information is pre-calculated in advance for further analysis.
Basic analytical operations of OLAP

Four types of analytical OLAP operations are:


Roll-up
Drill-down
Slice
Dice
Pivot (rotate)
1. Roll up
Roll-up is also known as “consolidation” or “aggregation.” The Roll-up operation can be
performed in 2 ways
Reducing dimensions
Climbing up concept hierarchy. Concept hierarchy is a system of grouping things based on their
order or level.
Consider the following diagram
 In this example, cities New jersey and Lost Angles and rolled up into country USA
 The sales figure of New Jersey and Los Angeles are 440 and 1560 respectively. They
become 2000 after roll-up
 In this aggregation process, data is location hierarchy moves up from city to the country.
 In the roll-up process at least one or more dimensions need to be removed. In this example,
Cities dimension is removed.

2. Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the rollup process. It can
be done via
 Moving down the concept hierarchy
 Increasing a dimension

Consider the diagram above


 Quater Q1 is drilled down to months January, February, and March. Corresponding sales
are also registers.
 In this example, dimension months are added.
3. Slice
 Here, one dimension is selected, and a new sub-cube is created.
 Following diagram explain how slice operation performed:

 Dimension Time is Sliced with Q1 as the filter.


 A new cube is created altogether.

4. Dice
This operation is similar to a slice. The difference in dice is you select 2 or more dimensions
that result in the creation of a sub-cube.
5. Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data.
In the following example, the pivot is based on item types.

You might also like