Chapter 1___Data Mining and Data Warehouse
Chapter 1___Data Mining and Data Warehouse
DATA MINING
SCHOOL OF INFORMATICS
DEPARTMENT OF INFORMATION TECHNOLOGY
Compiled by Aklilu E.
Brief Description of Data Mining
Data Mining refers to extracting or mining “ knowledge from large amounts of data”.
There are many other terms related to data mining, such as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data dredging.
Many people treat data mining as a synonym for another popularly used term, Knowledge
Discovery in Databases, or KDD.
The major reason that data mining has attracted a great deal of attention in information industry
in recent years is due to the wide availability of huge amounts of data and the imminent need
for turning such data into useful information and knowledge.
Data Mining
Major Issues in Data Mining
►Mining different kinds of knowledge in databases:-
Different user may be in interested in different kind of knowledge.
Therefore it is necessary for data mining to cover broad range of knowledge discovery task.
Data Mining
Cont…;
►Data mining query languages and ad hoc data mining:-
Data Mining Query language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient and flexible data
mining.
►Efficiency and scalability of data mining algorithms:- In order to effectively extract the information
from huge amount of data in databases, data mining algorithm must be efficient and scalable.
Data Mining
Data Warehousing, Data Mining and Database Technology
►Data Warehousing is the repository of data that are organized by subject to support decision makers in
the organizations.
►It is the concept was intended to provide architectural model for then flow of data from operational
system to the decision support environment.
►A common source of data for data warehouse is nothing but the operational database of the companies.
►Data warehousing can be said to be the process of centralizing or aggregating data from multiple sources
into one common repository.
►The Data Warehousing supports business analysis and decision making by creating an enterprise wide
integrated database of summarized, historical information.
Data Mining
Cont…;
►Data mining, the extraction of hidden predictive information from large databases, is a
powerful new technology with great potential to help companies focus on the most important
information in their data warehouses.
►Data mining tools predict future trends and behaviors, allowing businesses to make proactive,
knowledge-driven decisions.
►These patterns can often provide meaningful and insightful data to whoever is interested in that
data.
Data Mining
Evolution of Database Technology
Data Mining
Process
► The knowledge Discovery in Databases (KDD) process is commonly defined with the stages:-
1) Selection 4) Data mining
3) Transformation
Steps in KDD
Data Mining
Steps in KDD
► Data Cleaning:- to remove noise or irrelevant data
► Data Selection:- where data relevant to the analysis task are retrieved from the database
► Data Transformation:- where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations
► Data Mining:- an essential process where intelligent methods are applied in order to extract data patterns
► Pattern Evaluation:- to identify the truly interesting patterns representing knowledge based on some
interestingness measures.
► Knowledge Presentation:- where visualization and knowledge representation techniques are used to
present the mined knowledge to the user.
Data Mining
Cont…;
Data Mining
Data mining vs Statistics
►Data Mining is the process of applying these methods with the intention of uncovering hidden patterns
in large data sets.
►It bridges the gap from applied statistics and artificial intelligence (which usually provide the
mathematical background) to database management by exploiting the way data is stored and indexed in
databases to execute the actual learning and discovery algorithms more efficiently, allowing such
methods to be applied to ever-larger data sets.
►Statistics is a component of data mining that provides the tools and analytics techniques for dealing
with large amounts of data.
►It is the science of learning from data and includes everything from collecting and organizing to
analyzing and presenting data.
Data Mining
Data mining functionality
►Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks.
►In general, data mining tasks can be classified into two categories:-
Data Mining
Cont…;
►Predictive Mining tasks perform inference on the current data in order to make predictions.
Classification:
►It classifies data (constructs a model) based on the training set and the values (class labels)
in a classifying attribute and uses it in classifying new data
►Classification can be defined as the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown.
Data Mining
Cont…;
Prediction:
►Find some missing or unavailable data values rather than class labels mentioned to as
prediction.
►Although prediction may refer to both data value prediction and class label prediction, it is
usually confined to data value prediction and thus is distinct from classification.
►Prediction also encompasses the identification of distribution trends based on the available
data.
Data Mining
Cont…;
►Descriptive mining tasks characterize the general properties of the data in the database.
Clustering analysis
►The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity. Each cluster that is formed can be viewed as
a class of objects.
Association:
►It is the discovery of association rules showing attribute-value conditions that occur frequently
together in a given set of data.
Data Mining
What is Data Warehouse?
►A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in
support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. E.g.; source A and source B
may have different ways of identifying a product, but in a data warehouse, there will be only a single
way of identifying a product.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
Data Mining
Data Warehouse Features
►It is separate from the Operational Database.
Data Mining
Data Warehouse vs. Operational DBMS
►A data warehouse is a repository for structured, filtered data that has already been processed
for a specific purpose.
►Data ware collect the data from multiple sources and transform the data using ETL process then
load it to the Data Warehouse for business purpose.
►It is used for maintaining the online transaction and record integrity in multiple access
environments.
Data Mining
Cont…;
No Key Feature Data Warehouse Operational Database
Data Mining
OLTP vs. OLAP
►OLTP stands for OnLine Transaction Processing and is a data modeling approach typically used
to facilitate and manage usual business applications.
►OLTP technology used to perform updates on operational or transactional systems (e.g., point
of sale systems).
►OLAP stands for OnLine Analytic Processing and is an approach to answer multi-dimensional
queries.
►OLAP was conceived for Management Information Systems and Decision Support Systems.
►OLAP technology used to perform complex analysis of the data in a data warehouse.
Data Mining
Conceptual Modeling of Data Warehouses
►The most popular data model for data warehouses is a multidimensional model.
►A multidimensional model, which can be arranged as a star schema, a snowflake schema, or a fact
constellation schema, is the most common type of data model for a data warehouse.
Star Schema
►A star schema is the traditional method of a multi-dimensional data model, in which shreds of evidence
are systematized into information and dimensions.
►A fact is an event that is calculated or measured, which comprises a transaction of sales or purchase
etc.
►A dimension comprises reference information about the fact on different subjects.
Data Mining
Cont…;
►The star schema is a modeling paradigm in which the data warehouse contains
(2) a set of smaller attendant tables (dimension tables), one for each dimension.
►The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around
the central fact table.
of sales.
Data Mining
Snowflake Schema
►Multidimensional database such that the ER diagram resembles a snowflake form.
►The measurement tables are normalized which splits records into additional tables. E.g.; Country is
normalized to reduce redundancy which is there in a star schema.
►The snowflake schema is a variant of the star schema model, where some dimension tables are
normalized, thereby further splitting the data into additional tables.
►The major difference between the snowflake and star schema models is that the dimension tables of the
snowflake model may be kept in normalized form. Such a table is easy to maintain and also saves storage
space because a large dimension table can be extremely large when the dimensional structure is included as
columns.
Data Mining
Cont…;
Data Mining
Fact Constellation Schema
►Fact Constellation Schema shares the fact table of multiple dimensions.
►The schema is considered as a group of stars as a consequence of the call Galaxy Schema.
►Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables.
►This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema
or a fact constellation.
Data Mining
Cont…;
►A fact constellation schema is shown in Figure.
►The shipping table has five dimensions, or keys: item key, time key, shipper key, from location,
and to location, and two measures: dollars cost and units shipped.
►A fact constellation schema allows dimension tables to be shared between fact tables.
►For example, the dimensions tables for time, item, and location are shared between both the
sales and shipping fact tables.
Data Mining
Cont…;
Data Mining
Measures in Data warehouses
Measures: Their Categorization and Computation:
►A data cube measure is a numerical function that can be evaluated at each point in the data cube space.
►A measure value is computed for a given point by aggregating the data corresponding to the respective
dimension-value pairs defining the given point.
►Measures can be organized into three categories based on the kind of aggregate functions used.
1) Distributive,
2) Algebraic,
3) Holistic
Data Mining
Cont…;
Distributive:-
►Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n
aggregate values.
►If the result derived by applying the function to the n aggregate values is the same as that derived by
applying the function to the entire data set (without partitioning), the function can be computed in a
distributed manner.
►Distributive measures can be computed efficiently because they can be computed in a distributive
manner. For the same reason, sum (), min (), and max () are distributive aggregate functions.
Data Mining
Cont…;
Algebraic:-
►If it can be computed from arguments obtained by applying distributive aggregate functions.
►If it may be computed utilizing an algebraic characteristic with M arguments (where M is a bounded
integer), each of which is obtained through making use of a distributive aggregation function.
Holistic:-
►If there's no steady limit on the storage length had to describe a sub-aggregate.
Data Mining
Concept Hierarchy and Data Warehouses
►A Concept hierarchy defines a sequence of mappings from a set of low level Concepts to higher level,
more general Concepts. Concept hierarchies allow data to be handled at varying levels of abstraction.
OLAP Operations on Multidimensional Data
►Roll-up:- The roll-up operation performs aggregation on a data cube, either by climbing-up a concept
hierarchy for a dimension or by dimension reduction. This hierarchy was defined as the total order street
< city < province or state <country.
►Drill-down:- Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping-down a concept hierarchy for a dimension or
introducing additional dimensions. The result of a drill-down operation performed on the central cube by
stepping down a concept hierarchy for time defined as day < month < quarter < year. Drill-down occurs
by descending the time hierarchy from the level of quarter to the more detailed level of month.
Data Mining
Cont…;
►Slice and dice:- The slice operation performs a selection on one dimension of the given cube, resulting
in a sub cube.
►A slice operation where the sales data are selected from the central cube for the dimension time using the
criteria time=”Q2”.
►The dice operation defines a sub cube by performing a selection on two or more dimensions.
►Pivot (rotate):- Pivot is a visualization operation which rotates the data axes in view in order to provide
►A pivot operation where the item and location axes in a 2-D slice are rotated.
Data Mining
Typical OLAP Operations
► Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle
ware to support missing pieces
Include optimization of DBMS backend, implementation of aggregation navigation logic, and
additional tools and services
Greater scalability
Data Mining
Cont…;
►Hybrid OLAP (HOLAP)
User flexibility, e.g., low level: relational, high-level: array
►In this model, there exists a central point from where the radial linearise.
Data Mining
Cont…;
►A star-net model is shown as below:-
►Drill Down: Drill down applied on Customer from a group of the customer to name of the customer.
►Generalization: Moving up the hierarchy. Generalizing day of time dimension with year.
►Specialization: Moving down the hierarchy. Moving from the year of the time dimension to the day.
Data Mining
Cont…;
Data Mining
Data Warehouse Design Process
►A data warehouse can be built using a top-down approach, a bottom-up approach, or a
combination of both.
1. The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that must
be solved are clear and well understood.
2. The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.
Data Mining
Cont…;
3. In the combined approach, an organization can exploit the planned and strategic nature of the
top-down approach while retaining the rapid implementation and opportunistic application of the
bottom-up approach.
Data Mining
Data Warehouse Models
►From the architecture point of view, there are three data warehouse models:-
Enterprise Warehouse
► An enterprise warehouse collects all of the information about subjects spanning the entire organization.
► It provides corporate-wide data integration, usually from one or more operational systems or external
information providers, and is cross-functional in scope.
►It typically contains detailed data as well as summarized data, and can range in size from a few
gigabytes to hundreds of gigabytes, terabytes, or beyond.
►An enterprise data warehouse may be implemented on traditional mainframes, computer super servers,
or parallel architecture platforms.
►It requires extensive business modeling and may take years to design and build.
Data Mining
Data Mart
►A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The
scope is confined to specific selected subjects.
►For example, a marketing data mart may confine its subjects to customer, item, and sales. The data
contained in data marts tend to be summarized.
►The implementation cycle of a data mart is more likely to be measured in weeks rather than months or
years. However, it may involve complex integration in the long run if its design and planning were not
enterprise-wide.
►Depending on the source of data, data marts can be categorized as independent or dependent.
Independent data marts are sourced from data captured from one or more operational systems or external
information providers, or from data generated locally within a particular department or geographic area.
►Dependent data marts are sourced directly from enterprise data warehouses.
Data Mining
Virtual Warehouse
►A virtual warehouse is a set of views over operational databases.
►For efficient query processing, only some of the possible summary views may be materialized.
►A virtual warehouse is easy to build but requires excess capacity on operational database
servers.
Data Mining
OLAP functionalities on Data warehouses
►Other OLAP operations may include ranking the items, computing average, finding out depreciation,
converting the currency or performing some statistical operations.
Functions of OLAP
►OLAP has models for prediction, analysis of trends and patterns and well as statistical analysis.
►OLAP has analytical capabilities for calculation, ratios and variance derivation across dimensions.
►OLAP operations are done for business operations and handling a huge amount of data.
Data Mining
April 5, 2025 Data Mining