Unit I
Unit I
DATA WAREHOUSE
Introduction:
Data warehouses generalize and consolidate data in multidimensional space. The construction of data
warehouses involves data cleaning, data integration and data transformation and can be viewed as an
important preprocessing step for data mining. Moreover, data warehouses provide on-line analytical
processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which
facilitates effective data generalization and data mining. Many other data mining functions, such as
association, classification, prediction, and clustering, can be integrated with OLAP operations to enhance
interactive mining of knowledge at multiple levels of abstraction.
Hence, the data warehouse has become an increasingly important plat form for data analysis and on-line
analytical processing and will provide an effective plat form for data mining. Therefore, data warehousing
and OLAP form an essential step in the knowledge discovery process. Data warehousing provides
architectures and tools for business executives to systematically organize, understand, and use their data to
make strategic decisions. Data warehouse systems are valuable tools in today’s competitive, fast-evolving
world. In the last several years, many firms have spent millions of dollars in building enterprise-wide data
warehouses. Many people feel that with competition mounting in every industry, data warehousing is the
latest must-have marketing weapon—a way to retain customers by learning more about their needs. “Then,
what exactly is a data warehouse?” Data warehouses have been defined in many ways, making it difficult
to formulate a rigorous definition. Loosely speaking, a data warehouse refers to a database that is
maintained separately from an organization’s operational databases. Data warehouse systems allow for the
integration of a variety of application systems. They support information processing by providing a solid
platform of consolidated historical data for analysis. According to William H. Inman, a leading architect in
the construction of data warehouse systems, “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of management’s decision making process”. This
short, but comprehensive definition presents the major features of a data warehouse. The four keywords,
subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data
repository systems, such as relational database systems, transaction processing systems, and file systems.
Let’s take a closer look at each of these key features.
Subject-oriented:
A data warehouse is organized around major subjects, such as customer, supplier, product, and sales. Rather
than concentrating on the day-to-day operations and transaction processing of an organization, a data
warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses
typically provide a simple and concise view around particular subject issues by excluding data that are not
useful in the decision support process.
Integrated:
A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational
databases, flat files, and on-line transaction records. Data cleaning and data integration techniques are
applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on.
Time-variant:
Data are stored to provide information from a historical perspective (e.g., the past 5–10 years). Every key
structure in the data warehouse contains, either implicitly or explicitly, an element of time.
DWDM MODULE 1
Nonvolatile:
A data warehouse is always a physically separate store of data transformed from the application data found
in the operational environment. Due to this separation, a data warehouse does not require transaction
processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data
accessing: initial loading of data and access of data. In sum, a data warehouse is a semantically consistent
data store that serves as a physical implementation of a decision support data model and stores the
information on which an enterprise needs to make strategic decisions. A data warehouse is also often viewed
as architecture, constructed by integrating data from multiple heterogeneous sources to support structured
and/or ad hoc queries, analytical reporting, and decision making.
DATA MINING
Introduction:
Data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a
misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than
rock or sand mining. Thus, data mining should have been more appropriately named “knowledge mining
from data,”. This is unfortunately somewhat long. “Knowledge mining,” a shorter term may not reflect the
emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the
process that finds a small set of precious nuggets from a great deal of raw material. Thus, such a misnomer
that carries both “data” and “mining” became a popular choice. Many other terms carry a similar or slightly
different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern
analysis, data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from
Data, or KDD. Alternatively, others view data mining as simply essential step in the process of knowledge
discovery. Knowledge discovery as consists of an iterative sequence of the following steps:
Steps 1 to 4 are different forms of data preprocessing, where the data are prepared for mining. The data
mining step may interact with the user or a knowledge base. The interesting patterns are presented to the
user and may be stored as new knowledge in the knowledge base. Note that according to this view, data
mining is only one step in the entire process, albeit an essential one because it uncovers hidden patterns for
evaluation.
We agree that data mining is a step in the knowledge discovery process. However, in industry, in media,
and in the database research milieu, the term data mining is becoming more popular than the longer term
of knowledge discovery from data. Therefore, in this book, we choose to use the term data mining. We
adopt a broad view of data mining functionality: data mining is the process of discovering interesting
knowledge from large amounts of data stored in databases, data warehouses, or other information
repositories
➢ PATTERN EVALUATION:
The pattern evaluation is mainly responsible for the measure of interestingness of the pattern by
using a threshold value.
➢ KNOWLEDGE BASE.
The knowledge base is helpful in the whole data Mining process. It might be useful for guiding
the search or evaluating the interestingness of the result Patterns. The knowledge base might even contain
user Beliefs and data from user experiences that can be Useful in the process of data mining. The data
mining Engine might get inputs from knowledge base to make the result more accurate and reliable. The
pattern Evaluation module interacts with knowledge base on Regular basis to get inputs and also to update
it.
➢ DEFINITION OF OLAP:
OLAP is an Online Analytical Processing system. OLAP database stores historical data that has
been inputted by OLTP. It allows a user to view different summaries of multi-dimensional data. Using
OLAP, you can extract information from a large database and analyse it for decision making.
OLAP also allow a user to execute complex queries to extract multidimensional data. In OLTP even if the
transaction fails in middle it will not harm data integrity as the user use OLAP system to retrieve data from
a large database to analyse. Simply the user can fire the query again and extract the data for analysis.
The transaction in OLAP are long and hence take comparatively more time for processing and requires
large space. The transactions in OLAP are less frequent as compared to OLTP. Even the tables in OLAP
database may not be normalized. The example for OLAP is to view a financial report, or budgeting,
marketing management, sales report, etc.
Example:
Any type of Data warehouse system is an OLAP system. Uses of OLAP are as follows:
• Spottily analyzed songs by users to come up with the personalized homepage of their songs and
playlist.
• Netflix movie recommendation system.
• OLAP creates a single platform for all type of business analytical needs which includes planning,
budgeting, forecasting, and analysis.
• The main benefit of OLAP is the consistency of information and calculations.
• Easily apply security restrictions on users and objects to comply with regulations and protect
sensitive data.
• DRAWBACKS OF OLAP:
• implementation and maintenance are dependent on IT professional because the traditional OLAP
tools require a complicated modeling procedure.
• OLAP tools need cooperation between people of various departments to be effective which might
always be not possible.
DEFINITION OF OLTP:
OLTP is an Online Transaction Processing system. The main focus of OLTP system is to record
the current Update, Insertion and Deletion while transaction. The OLTP queries
are simpler and short and hence require less time in processing, and also requires less space.
OLTP database gets updated frequently. It may happen that a transaction in OLTP fails in middle, which
may affect data integrity. So, it has to take special care of data integrity. OLTP database has normalized
tables (3NF).
The best example for OLTP system is an ATM, in which using short transactions we modify the status of
our account. OLTP system becomes the source of data for OLAP.
Example: -
Uses of OLTP are as follows:
• ATM center is an OLTP application.
• OLTP handles the ACID properties during data transaction via the application.
• It’s also used for Online banking, Online airline ticket booking, sending a text message, add a book
to the shopping cart.
DRAWBACKS OF OLTP:
• If OLTP system faces hardware failures, then online transactions get severely affected.
• OLTP systems allow multiple users to access and change the same data at the same time which
many times created unprecedented situation.
•
▪ OLAP is a category of software that allows users to analyse information from multiple database systems
at the same time.
▪ It is a technology that enables analysts to extract and view business data from different points of view.
▪ OLAP databases are divided into one or more cubes. The cubes are designed in such a way that creating
and viewing reports become easy.
Example:
1. ROLL-UP.
2. DOWN-DRILL.
3.SLICE.
4. DICE.
5. PIVOT(ROTATE).
✓ ROLL-UP: Roll-up is also known as “consolidation” or “aggregation”. The roll-up operation can
be performed in two ways:
• Reducing dimensions.
• Climbing up concept hierarchy is
• a system of grouping based on their order or level.
Example:
✓ DOWN-DRILL: In down-drill data is fragmented into smaller parts. It is the opposite of the roll-up
process.
• It can be done via:
i. Moving down the concept hierarchy.
ii. Increasing a dimension.
Example:
✓ DICE: This operation is similar to a slice. The difference in dice is we select two or more
dimensions that results in the creation of a sub-cube.
Example:
✓ PIVOT(ROTATE): In pivot, we rotate the data axes to provide a substitute presentation of data.
Example:
DATA CUBE:
A data cube refers is a three-dimensional (3D) (or higher) range of values that are generally used to explain
the time sequence of an image's data. It is a data abstraction to evaluate aggregated data from a variety of
viewpoints. It is also useful for imaging spectroscopy as a spectrally-resolved image is depicted as a 3-D
volume.
A data cube can also be described as the multidimensional extensions of two-dimensional tables. It can be
viewed as a collection of identical 2-D tables stacked upon one another. Data cubes are used to represent
data that is too complex to be described by a table of columns and rows. As such, data cubes can go far
beyond 3-D to include many more dimensions.
A data cube is generally used to easily interpret data. It is especially useful when representing data together
with dimensions as certain measures of business requirements. A cube's every dimension represents certain
characteristic of the database, for example, daily, monthly or yearly sales. The data included inside a data
cube makes it possible analyze almost all the figures for virtually any or all customers, sales agents,
products, and much more. Thus, a data cube can help to establish trends and analyze performance.
The extracting and mining the large amount of data is called as data mining.
Data mining is the process of discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary
subfield of computer science and statistics with an overall goal to extract information (with intelligent
methods) from a data set and transform the information into a comprehensible structure for further use.
Data mining is the analysis step of the "knowledge discovery in databases" process or KDD.
Steps in KDD:
1. Data Cleaning − In this step, the noise and inconsistent data is removed.
2. Data Integration − In this step, multiple data sources are combined.
3. Data Selection − In this step, data relevant to the analysis task are retrieved from the database.
4. Data Transformation − In this step, data is transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations.
5. Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
6. Pattern Evaluation − In this step, data patterns are evaluated.
7. Knowledge Presentation − In this step, knowledge is represented.
MAJOR ISSUES IN DATA MINING:
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These factors
also create some issues. Here in this tutorial, we will discuss the major issues regarding −
PERFORMANCE ISSUES:
• Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed in a parallel fashion. Then the results from the partitions
is merged. The incremental algorithms, update databases without mining the data again from
scratch.
DATA CLEANING:
Introduction:
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines
attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies
in the data. In this section, you will study basic methods for data cleaning.
Missing Values:
Imagine that you need to analyze All Electronics sales and customer data. You note that many tuples have
no recorded value for several attributes, such as customer income. How can you go about filling in the
missing values for this attribute? Let’s look at the following methods:
Use the attribute mean for all samples belonging to the same class as the given tuple:
For example, if classifying customers according to credit risk, replace the missing value with the average
income value for customers in the same credit risk category as that of the given tuple.
NOISY DATA:
Introduction: Noise is a random error or variance in a measured variable. Given a numerical attribute
such as, say, price, how can we “smooth” out the data to remove the noise.
Let’s look at the following data smoothing techniques:
Sorted data for price (in dollars):4, 8, 15, 21, 21, 24, 25, 28, 34
Bin 1: 4, 8, 15
Bin 1: 9, 9, 9
Bin 1: 4, 4, 15
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the
values around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing.
Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the “best” line to fit two attributes (or variables), so that one attribute can be
used to predict the other. Multiple linear regressions are an extension of linear regression, where more
than two attributes are involved and the data are fit to a multidimensional surface.
Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or
“clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers
Many methods for data smoothing are also methods for data reduction involving discretization. For
example, the binning techniques described above reduce the number of distinct values per attribute. This
acts as a form of data reduction for logic-based data mining methods, such as decision tree induction,
which repeatedly make value comparisons on sorted data. Concept hierarchies are a form of data
discretization that can also be used for data smoothing. A concept hierarchy for price, for example, may
map real price values into inexpensive, moderately priced, and expensive, thereby reducing the number of
data values to be handled by the mining process. Some methods of classification, such as neural networks,
have built-in data smoothing mechanisms.
DATA INTEGRATION:
It is likely that your data analysis task will involve data integration, which combines data from multiple
sources into a coherent data store, as in data warehousing. These sources may include multiple databases,
data cubes, or flat files.
Issues in Data Integration:
There are no of issues to consider during data integration: Schema Integration, Redundancy, Detection
and resolution of data value conflicts. These are explained in brief as following below.
1. Schema Integration:
• Integrate metadata from different sources.
• The real world entities from multiple source be matched referred to as the entity identification
problem.
For example, how can the data analyst and computer be sure that customer id in one data base and
customer number in another reference to the same attribute.
2. Redundancy:
• An attribute may be redundant if it can be derived or obtaining from another attribute or set of
attribute.
• Inconsistencies in attribute can also cause redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required
for our analysis. It reduces data size as it eliminates outdated or redundant features.
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms
(Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression
techniques.
• Lossless Compression
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless
data compression uses algorithms to restore the precise original data from the compressed data.
• Lossy Compression
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples
of this compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning
equivalent to the original the image. In lossy-data compression, the decompressed data may differ to the
original data but are useful enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller
representation of the data instead of actual data, it is important to only store the model parameter. Or non-
parametric method such as clustering, histogram, sampling.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-level
concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
• Binning
Binning is the process of changing numerical variables into categorical counterparts. The number of
categorical counterparts depends on the number of bins specified by the user.
• Histogram analysis
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint
ranges called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the
data set.
2. Equal Width Partioning: Partioning the values in a fixed gap based on the number of bins i.e. a set
of values ranging from 0-20.
3. Clustering: Grouping the similar data together.
Data Transformation
The data are transformed in ways that are ideal for mining the data. The data transformation involves steps that
are:
1. Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows for highlighting
important features present in the dataset. It helps in predicting the patterns. When collecting data, it can be
manipulated to eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can often
be difficult to digest for finding patterns that they wouldn’t see otherwise.
2. Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary format. The data may
be obtained from multiple data sources to integrate these data sources into a data analysis description. This is a
crucial step since the accuracy of data analysis insights is highly dependent on the quantity and quality of the
data used. Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant
results.
The collection of data is useful for everything from decisions concerning financing or business strategy of the
product, pricing, operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.
3. Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining activities in the real
world require continuous attributes. Yet many of the existing data mining frameworks are unable to handle these
attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly improve its efficiency by
replacing a constant quality attribute with its discrete values.
For example, (1-10, 11-20) (age:- young, middle age, senior).
4. Attribute Construction:
Where new attributes are created & applied to assist the mining process from the given set of attributes. This
simplifies the original data & makes the mining more efficient.
5. Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For Example Age
initially in Numerical form (22, 25) is converted into categorical value (young, old).
For example, Categorical attributes, such as house addresses, may be generalized to higher-level definitions,
such as town or country.
6. Normalization:
Data normalization involves converting all data variable into a given range.
Techniques that are used for normalization are:
• Min-Max Normalization:
• This transforms the original data linearly.
• Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
Solved example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs. 10, 000 and Rs. 100, 000. We
want to plot the profit in the range [0, 1]. Using min-max normalization the value of Rs. 20, 000 for
attribute profit can be plotted to:
For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the attribute P. Using z-score
normalization, a value of 85000 for P can be transformed to:
For example:
• Suppose: Values of an attribute P varies from -99 to 99.
• The maximum absolute value of P is 99.
• For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers in the
largest number) so that values come out to be as 0.98, 0.97 and so on.