BI Unit4
BI Unit4
Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Course Contents
Data Analytics life cycle, Discovery, Data
preparation, Preprocessing requirements, data
cleaning, data integration, data reduction, data
transformation, Data discretization and concept
hierarchy generation,
Model Planning, Model building, Communicating
Results & Findings, Operationalizing,
Introduction to OLAP.
Real-world Applications, types of outliers, outlier
challenges, Outlier detection Methods, Proximity-
Based Outlier analysis, Clustering Based Outlier
analysis.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2
Data analytics
Data analytics is the science of analyzing raw data to make
conclusions about that information.
Data analytics help a business optimize its performance,
perform more efficiently, maximize profit, or make more
strategically-guided decisions.
Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
How to Prepare Data for Business
Intelligence and Analytics
Preparing data for Business Intelligence (BI) can be a very
tedious and time consuming process.
You want the data to turn into the best reports for analysis.
Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Data Quality
Data has quality if it satisfies the requirements of its
intended use. There are many factors comprising
data quality. These include: accuracy, completeness,
consistency, timeliness, believability, and
interpretability.
Inaccurate, incomplete, and inconsistent data are
commonplace properties of large real-world
databases and data warehouses.
Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Data Pre-Processing
There are a number of data preprocessing techniques.
Datacleaning can be applied to remove noise and correct
inconsistencies in the data.
Dataintegration merges data from multiple sources into a coherent
data store, such as a data warehouse.
Data
reduction can reduce the data size by aggregating, eliminating
redundant features, or clustering, for instance.
Datatransformations, such as normalization, may be applied, where
data are scaled to fall within a smaller range like 0.0 to 1.0.
Thiscan improve the accuracy and efficiency of mining algorithms
involving distance measurements.
These techniques are not mutually exclusive; they may work together.
For
example, data cleaning can involve transformations to correct
wrong data, such as by transforming all entries for a date field to a
common format.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2
Major tasks in Data Pre-Processing
4. Data transformation
Data Cleaning
• Data cleaning routines work to “clean” the data by filling
in missing values, smoothing noisy data, identifying or
removing outliers, and resolving inconsistencies.
• If users believe the data are dirty, they are unlikely to
trust the results of any data mining that has been
applied to it.
• Furthermore, dirty data can cause confusion for the
mining procedure, resulting in unreliable output.
Data Integration
• This would involve integrating multiple databases, data
cubes, or files, that is, data integration.
• Yet some attributes representing a given concept may
have different names in different databases, causing
inconsistencies and redundancies.
• Having a large amount of redundant data may slow
down or confuse the knowledge discovery process.
Clearly, in addition to data cleaning, steps must be taken
to help avoid redundancies during data integration.
• Typically, data cleaning and data integration are
performed as a preprocessing step when preparing the
data for a data warehouse.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5
Major tasks in Data Pre-Processing
Data Reduction
• Data reduction obtains a reduced representation of the data
set that is much smaller in volume, yet produces the same
(or almost the same) analytical results.
• Data reduction strategies include dimensionality reduction
and numerosity reduction.
• In dimensionality reduction, data encoding schemes are
applied so as to obtain a reduced or “compressed”
representation of the original data.
• Examples include data compression techniques (such as wavelet
transforms and principal components analysis) as well as
attribute subset selection (e.g., removing irrelevant attributes),
and attribute construction (e.g., where a small set of more useful
attributes is derived from the original set).
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6
Major tasks in Data Pre-Processing
Data Reduction
• In numerosity reduction, the data are replaced by
alternative, smaller representations using parametric
models (such as regression or log-linear models) or
nonparametric models (such as with histograms, clusters,
sampling, or data aggregation).
Data Normalization
• Getting back to your data, you have decided, say, that you
would like to use a distance-based mining algorithm for
your analysis, such as neural networks, nearest-neighbor
classifiers, or clustering.
• Such methods provide better results if the data to be
analyzed have been normalized, that is, scaled to a smaller
range such as [0.0, 1.0].
• Your customer data, for example, contain the attributes age and
annual salary. The annual salary attribute usually takes much
larger values than age. Therefore, if the attributes are left
unnormalized, the distance measurements taken on annual
salary will generally outweigh distance measurements taken on
age.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 8
Major tasks in Data Pre-Processing
Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Data Cleaning
Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Noise
1. Binning
Binning methods smooth a sorted data value by consulting
its “neighborhood,” that is, the values around it.
The sorted values are distributed into a number of
“buckets,” or bins.
Because binning methods consult the neighborhood of
values, they perform local smoothing. Figure illustrates
some binning techniques.
In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3 (i.e., each bin
contains three values). In smoothing by bin means, each
value in a bin is replaced by the mean value of the bin.
2. Regression
Data smoothing can also be done by conforming data values
to a function, a technique known as regression.
Linear regression involves finding the “best” line to fit two
attributes (or variables), so that one attribute can be used
to predict the other.
Multiple linear regression is an extension of linear
regression, where more than two attributes are involved
and the data are fit to a multidimensional surface.
3. Outliers Analysis
Data smoothing can also be done by conforming data values
to a function, a technique known as regression.
Outliers may be detected by clustering, for example, where
similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may
be considered outliers.
Prof. S.A.Shivarkar
Assistant Professor
E-mail :
[email protected]
Contact No: 8275032712
Data Integration
• Tuple Duplication
In addition to detecting redundancies between attributes,
duplication should also be detected at the tuple level (e.g.,
where there are two or more identical tuples for a given
unique data entry case).
The use of denormalized tables (often done to improve
performance by avoiding joins) is another source of data
redundancy.
Inconsistencies often arise between various duplicates, due to
inaccurate data entry or updating some but not all of the
occurrences of the data.
Data Analytics
Data analytics is the science of analyzing raw data to make conclusions about that information.
Data analytics help a business optimize its performance, perform more efficiently, maximize
profit, or make more strategically-guided decisions
This report is a collection of powerful data that enables an overview of daily active users by their
country. It helps product owners and business managers to focus their efforts on the regional
trends.
In general, business analytics and reporting are enablers for business owners and leaders at all
levels to change and direct their product. However, getting these results isn’t necessarily a
straightforward endeavor:
The data may contain lots of anomalies and duplication - require redundancy removal,
normalization across the different data sources, and varying granularity.
It’s also a challenge to get the data ready for everyone - not just the business owner and developers,
but also for the CMO and key decision-makers.
Plus, infrastructure and tool challenges might actually slow down access to your data, and limit
your ability to offload mundane tasks that take time and hamper focus.
So, we’ll drill into the process step by step to show you what you need to do to get a report that’s
right for as many people as possible in the business - not just the business owner.
To create the report, we uploaded 2 additional CSV files in a similar manner. Note that you can
upload data from many disparate data sources using AWS Redshift.
2. Transform data to be BI ready
The best way to start this step is by investigation using manual queries on the loaded raw data.
You can then evaluate the quality of the data and decide which tables are not relevant or need to
be changed. Then, plan and decide on the right transformations accordingly.
If we continue with our example, here’s how we use Panoply.io to calculate DAU, DAU by country
and DAU by country and device type (based on the operating system). Note that although each of
these queries is correct for its own calculation, the data may need to be normalized to allow
provision of the right results in other queries.
Users counted by time
Users counted by country
To transform the data to be BI ready, using multiple data sources and structures is the best method.
A rule of thumb is that it’s not just one table since you need to continuously find good combination
of information involving several tables.
So, taking our example to the next step: In our case we want one transformation that will allow us
the flexibility to answer many questions (DAU, DAU by country and DAU by country and OS).
However, as we saw in the comparison table above, we can’t calculate the DAU using the more
granular transformation (DAU by country and OS), since we won’t be able to answer all three
questions with that transformation only. In order to solve this issue, we will need to use a higher
resolution transformation that’s grouped by day, userid, country and OS:
3. Test the transformation with manual queries
As shown above, try getting the same result using different manual queries. In this step, you can
also pull the results data into a spreadsheet (a sample of the data should be enough), or even
manually count the result and compare it to the result obtained from the transformation.
4. Build the reports
Create end user reports and charts with the right granularity and resolution, like DAU per device,
country, etc.
Here’s the results from our example:
DAU over time
DAU by device
DAU by country
Data Quality
Data has quality if it satisfies the requirements of its intended use. There are many factors
comprising data quality. These include: accuracy, completeness, consistency, timeliness,
believability, and interpretability.
Inaccurate, incomplete, and inconsistent data are commonplace properties of large real-world
databases and data warehouses.
In summary, real-world data tend to be dirty, incomplete, and inconsistent. Data preprocessing
techniques can improve the quality of the data, thereby helping to improve the accuracy and
efficiency of the subsequent mining process. Data preprocessing is an important step in the
knowledge discovery process, because quality decisions must be based on quality data. Detecting
data anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs
for decision making.
Data cleaning techniques
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing)
routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
1. Missing Values
2. Noisy Data
Noise
Noise is a random error or variance in a measured variable.
The following are data smoothing techniques:
1. Binning:
Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values
around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing. Figure illustrates some
binning techniques. In this example, the data for price are first sorted and then partitioned into
equal-frequency bins of size 3 (i.e., each bin contains three values). In smoothing by bin means,
each value in a bin is replaced by the mean value of the bin.
For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in
this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in which
each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced
by the closest boundary value. In general, the larger the width, the greater the effect of the
smoothing. Alternatively, bins may be equal-width, where the interval range of values in each bin
is constant. Binning is also used as a discretization technique.
2. Regression
Data smoothing can also be done by conforming data values to a function, a technique known as
regression. Linear regression involves finding the “best” line to fit two attributes (or variables), so
that one attribute can be used to predict the other. Multiple linear regression is an extension of
linear regression, where more than two attributes are involved and the data are fit to a
multidimensional surface.
3. Outlier analysis
Outliers may be detected by clustering, for example, where similar values are organized into
groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered
outliers. Figure illustrates 2-D plot of customer data with respect to customer locations in a city,
showing three data clusters. Each cluster centroid is marked with a “+”, representing the average
point in space for that cluster. Outliers may be detected as values that fall outside of the sets of
clusters.
Data Integration
Data mining often requires data integration—the merging of data from multiple data stores.
Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting
data set. This can help improve the accuracy and speed of the subsequent mining process. The
semantic heterogeneity and structure of data pose great challenges in data integration. How can
we match schema and objects from different sources? This is the essence of the entity identification
problem.
where n is the number of data tuples, count(A = ai) is the number of tuples having value ai for A,
and count(B = bj) is the number of tuples having value bj for B. The sum in Equation (1) is
computed over all of the r×c cells. Note that the cells that contribute the most to the χ 2 value are
those whose actual count is very different from that expected.
The χ 2 statistic tests the hypothesis that A and B are independent, that is, there is no
correlation between them. The test is based on a significance level, with (r − 1) × (c − 1) degrees
of freedom. We will illustrate the use of this statistic in an example below. If the hypothesis can
be rejected, then we say that A and B are statistically correlated.
Tuple Duplication
In addition to detecting redundancies between attributes, duplication should also be detected at the
tuple level (e.g., where there are two or more identical tuples for a given unique data entry case).
The use of denormalized tables (often done to improve performance by avoiding joins) is another
source of data redundancy. Inconsistencies often arise between various duplicates, due to
inaccurate data entry or updating some but not all of the occurrences of the data. For example, if a
purchase order database contains attributes for the purchaser’s name and address instead of a key
to this information in a purchaser database, discrepancies can occur, such as the same purchaser’s
name appearing with different addresses within the purchase order database.
Data warehouses and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube. A data cube allows data to be modeled and viewed in multiple
dimensions. It is defined by dimensions and facts. In general terms, dimensions are the perspectives or
entities with respect to which an organization wants to keep records. For example, AllElectronics may
create a sales data warehouse in order to keep records of the store’s sales with respect to the dimensions
time, item, branch, and location. These dimensions allow the store to keep track of things like
monthly sales of items and the branches and locations at which the items were sold. Each
dimension may have a table associated with it, called a dimension table, which further describes
the dimension. For example, a dimension table for item may contain the attributes item name,
brand, and type. Dimension tables can be specified by users or experts, or automatically generated
and adjusted based on data distributions.
A multidimensional data model is typically organized around a central theme, such as sales. This
theme is represented by a fact table. Facts are numeric measures. Think of them as the quantities
by which we want to analyze relationships between dimensions. Examples of facts for a sales data
warehouse include dollars sold (sales amount in dollars), units sold (number of units sold), and
amount budgeted. The fact table contains the names of the facts, or measures, as well as keys to
each of the related dimension tables. You will soon get a clearer picture of how this works when
we look at multidimensional schemas.
What is OLAP?
Online Analytical Processing (OLAP) is a category of software that allows users to analyze
information from multiple database systems at the same time. It is a technology that enables
analysts to extract and view business data from different points of view.
Analysts frequently need to group, aggregate and join data. These OLAP operations in data mining
are resource intensive. With OLAP data can be pre-calculated and pre-aggregated, making analysis
faster.
OLAP databases are divided into one or more cubes. The cubes are designed in such a way that
creating and viewing reports become easy. OLAP stands for Online Analytical Processing.
OLAP cube:
At the core of the OLAP concept, is an OLAP Cube. The OLAP cube is a data structure optimized
for very quick data analysis.
The OLAP Cube consists of numeric facts called measures which are categorized by dimensions.
OLAP Cube is also called the hypercube.
Usually, data operations and analysis are performed using the simple spreadsheet, where data
values are arranged in row and column format. This is ideal for two-dimensional data. However,
OLAP contains multidimensional data, with data usually obtained from a different and unrelated
source. Using a spreadsheet is not an optimal option. The cube can store and analyze
multidimensional data in a logical and orderly manner.
How does it work?
A Data warehouse would extract information from multiple data sources and formats like text files,
excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into an OLAP server (or OLAP
cube) where information is pre-calculated in advance for further analysis.
Basic analytical operations of OLAP
2. Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the rollup process. It can
be done via
Moving down the concept hierarchy
Increasing a dimension
4. Dice
This operation is similar to a slice. The difference in dice is you select 2 or more dimensions
that result in the creation of a sub-cube.
5. Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data.
In the following example, the pivot is based on item types.