0% found this document useful (0 votes)
2 views

Chapter 1___Data Mining and Data Warehouse

Data mining involves extracting knowledge from large datasets and is often synonymous with Knowledge Discovery in Databases (KDD). Key issues in data mining include the need for interactive mining, handling noisy data, and ensuring efficient algorithms for scalability. The document also discusses the relationship between data warehousing and data mining, outlining the processes involved in KDD and the functionalities of data mining such as predictive and descriptive tasks.

Uploaded by

maretuyared
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter 1___Data Mining and Data Warehouse

Data mining involves extracting knowledge from large datasets and is often synonymous with Knowledge Discovery in Databases (KDD). Key issues in data mining include the need for interactive mining, handling noisy data, and ensuring efficient algorithms for scalability. The document also discusses the relationship between data warehousing and data mining, outlining the processes involved in KDD and the functionalities of data mining such as predictive and descriptive tasks.

Uploaded by

maretuyared
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

CHAPTER 1:

DATA MINING

SCHOOL OF INFORMATICS
DEPARTMENT OF INFORMATION TECHNOLOGY
Compiled by Aklilu E.
Brief Description of Data Mining
Data Mining refers to extracting or mining “ knowledge from large amounts of data”.

There are many other terms related to data mining, such as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data dredging.

Many people treat data mining as a synonym for another popularly used term, Knowledge
Discovery in Databases, or KDD.

The major reason that data mining has attracted a great deal of attention in information industry
in recent years is due to the wide availability of huge amounts of data and the imminent need
for turning such data into useful information and knowledge.

Data Mining
Major Issues in Data Mining
►Mining different kinds of knowledge in databases:-
Different user may be in interested in different kind of knowledge.

Therefore it is necessary for data mining to cover broad range of knowledge discovery task.

►Interactive mining of knowledge at multiple levels of abstraction:-


The data mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on returned results.
► Incorporation of background knowledge:-
 To guide discovery process and to express the discovered patterns, the background knowledge can
be used. Background knowledge may be used to express the discovered patterns not only in concise
terms but at multiple level of abstraction.

Data Mining
Cont…;
►Data mining query languages and ad hoc data mining:-
Data Mining Query language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient and flexible data
mining.

►Presentation and visualization of data mining results:-


Once the patterns are discovered it needs to be expressed in high level languages, visual
representations. This representations should be easily understandable by the users.

►Handling noisy or incomplete data:-


The data cleaning methods are required that can handle the noise, incomplete objects while mining
the data regularities. If data cleaning methods are not there then the accuracy of the discovered
patterns will be poor.
Data Mining
Cont…;
►Pattern evaluation:- It refers to interestingness of the problem. The patterns discovered should be
interesting because either they represent common knowledge or lack novelty.

►Efficiency and scalability of data mining algorithms:- In order to effectively extract the information
from huge amount of data in databases, data mining algorithm must be efficient and scalable.

► Parallel, distributed, and incremental mining algorithms:-


The factors such as huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining algorithms.
These algorithm divide the data into partitions which is further processed parallel. Then the results
from the partitions is merged.
The incremental algorithms, updates databases without having mine the data again from scratch.

Data Mining
Data Warehousing, Data Mining and Database Technology
►Data Warehousing is the repository of data that are organized by subject to support decision makers in
the organizations.

►It is the concept was intended to provide architectural model for then flow of data from operational
system to the decision support environment.

►A common source of data for data warehouse is nothing but the operational database of the companies.

►Data warehousing can be said to be the process of centralizing or aggregating data from multiple sources
into one common repository.

►The Data Warehousing supports business analysis and decision making by creating an enterprise wide
integrated database of summarized, historical information.

Data Mining
Cont…;
►Data mining, the extraction of hidden predictive information from large databases, is a
powerful new technology with great potential to help companies focus on the most important
information in their data warehouses.

►Data mining tools predict future trends and behaviors, allowing businesses to make proactive,
knowledge-driven decisions.

►Data mining is the process of finding patterns in a given data set.

►These patterns can often provide meaningful and insightful data to whoever is interested in that
data.

Data Mining
Evolution of Database Technology

Data Mining
Process
► The knowledge Discovery in Databases (KDD) process is commonly defined with the stages:-
1) Selection 4) Data mining

2) Pre-processing 5) Interpretation/ Evaluation

3) Transformation

Steps in KDD

Knowledge discovery as a process is depicted in following figure and consists of an iterative


sequence of the following steps:-

Data Mining
Steps in KDD
► Data Cleaning:- to remove noise or irrelevant data

► Data Integration:- where multiple data sources may be combined

► Data Selection:- where data relevant to the analysis task are retrieved from the database

► Data Transformation:- where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations

► Data Mining:- an essential process where intelligent methods are applied in order to extract data patterns

► Pattern Evaluation:- to identify the truly interesting patterns representing knowledge based on some
interestingness measures.

► Knowledge Presentation:- where visualization and knowledge representation techniques are used to
present the mined knowledge to the user.

Data Mining
Cont…;

Data Mining
Data mining vs Statistics
►Data Mining is the process of applying these methods with the intention of uncovering hidden patterns
in large data sets.

►It bridges the gap from applied statistics and artificial intelligence (which usually provide the
mathematical background) to database management by exploiting the way data is stored and indexed in
databases to execute the actual learning and discovery algorithms more efficiently, allowing such
methods to be applied to ever-larger data sets.

►Statistics is a component of data mining that provides the tools and analytics techniques for dealing
with large amounts of data.

►It is the science of learning from data and includes everything from collecting and organizing to
analyzing and presenting data.

Data Mining
Data mining functionality
►Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks.

►In general, data mining tasks can be classified into two categories:-

Data Mining
Cont…;
►Predictive Mining tasks perform inference on the current data in order to make predictions.

Classification:

►It predicts categorical class labels

►It classifies data (constructs a model) based on the training set and the values (class labels)
in a classifying attribute and uses it in classifying new data

►Classification can be defined as the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown.

Data Mining
Cont…;
Prediction:

►Find some missing or unavailable data values rather than class labels mentioned to as
prediction.

►Although prediction may refer to both data value prediction and class label prediction, it is
usually confined to data value prediction and thus is distinct from classification.

►Prediction also encompasses the identification of distribution trends based on the available
data.

Data Mining
Cont…;
►Descriptive mining tasks characterize the general properties of the data in the database.

Clustering analysis

►Clustering analyses data objects without consulting a known class label.

►The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity. Each cluster that is formed can be viewed as
a class of objects.

Association:

►It is the discovery of association rules showing attribute-value conditions that occur frequently
together in a given set of data.

Data Mining
What is Data Warehouse?
►A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in
support of management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. E.g.; source A and source B
may have different ways of identifying a product, but in a data warehouse, there will be only a single
way of identifying a product.

Time-Variant: Historical data is kept in a data warehouse.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.

Data Mining
Data Warehouse Features
►It is separate from the Operational Database.

►Integrates data from heterogeneous systems.

►Stores HUGE amount of data, more historical than current data.

►Does not require data to be highly accurate.

►Queries are generally complex.

Data Mining
Data Warehouse vs. Operational DBMS
►A data warehouse is a repository for structured, filtered data that has already been processed
for a specific purpose.

►Data ware collect the data from multiple sources and transform the data using ETL process then
load it to the Data Warehouse for business purpose.

►Operational Database are those databases where data changes frequently.

►They are mainly designed for high volume of data transaction.

►They are the source database for the data warehouse.

►It is used for maintaining the online transaction and record integrity in multiple access
environments.

Data Mining
Cont…;
No Key Feature Data Warehouse Operational Database

1 Basic A data warehouse is a repository for Operational Database are those


structured, filtered data that has already databases where data changes
been processed for a specific purpose. frequently.
2 Data structure Data warehouse has denormalized It has normalized schema
schema
3 Performance It is fast for analysis queries It is slow for analytics queries

4 Types of Data It focuses on historical data It focuses on current


transactional data
5 Uses Case It is used for OLAP It is used for OLTP

Data Mining
OLTP vs. OLAP
►OLTP stands for OnLine Transaction Processing and is a data modeling approach typically used
to facilitate and manage usual business applications.

►Most of applications you see and use are OLTP based.

►OLTP technology used to perform updates on operational or transactional systems (e.g., point
of sale systems).

►OLAP stands for OnLine Analytic Processing and is an approach to answer multi-dimensional
queries.

►OLAP was conceived for Management Information Systems and Decision Support Systems.

►OLAP technology used to perform complex analysis of the data in a data warehouse.

Data Mining
Conceptual Modeling of Data Warehouses
►The most popular data model for data warehouses is a multidimensional model.

►A multidimensional model, which can be arranged as a star schema, a snowflake schema, or a fact
constellation schema, is the most common type of data model for a data warehouse.

Star Schema

►A star schema is the traditional method of a multi-dimensional data model, in which shreds of evidence
are systematized into information and dimensions.
►A fact is an event that is calculated or measured, which comprises a transaction of sales or purchase
etc.
►A dimension comprises reference information about the fact on different subjects.

Data Mining
Cont…;
►The star schema is a modeling paradigm in which the data warehouse contains

(1) a large central table (fact table), and

(2) a set of smaller attendant tables (dimension tables), one for each dimension.

►The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around
the central fact table.

Item, Time, Location, and

Branch are the four dimensions

of sales.

Data Mining
Snowflake Schema
►Multidimensional database such that the ER diagram resembles a snowflake form.

►A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions.

►The measurement tables are normalized which splits records into additional tables. E.g.; Country is
normalized to reduce redundancy which is there in a star schema.

►The snowflake schema is a variant of the star schema model, where some dimension tables are
normalized, thereby further splitting the data into additional tables.

►The resulting schema graph forms a shape similar to a snowflake.

►The major difference between the snowflake and star schema models is that the dimension tables of the
snowflake model may be kept in normalized form. Such a table is easy to maintain and also saves storage
space because a large dimension table can be extremely large when the dimensional structure is included as
columns.
Data Mining
Cont…;

Data Mining
Fact Constellation Schema
►Fact Constellation Schema shares the fact table of multiple dimensions.

►It is also termed as Galaxy Schema.

►The schema is considered as a group of stars as a consequence of the call Galaxy Schema.

►Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables.

►This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema
or a fact constellation.

Data Mining
Cont…;
►A fact constellation schema is shown in Figure.

►This schema specifies two fact tables, sales and shipping.

►The sales table definition is identical to that of the star schema.

►The shipping table has five dimensions, or keys: item key, time key, shipper key, from location,
and to location, and two measures: dollars cost and units shipped.

►A fact constellation schema allows dimension tables to be shared between fact tables.

►For example, the dimensions tables for time, item, and location are shared between both the
sales and shipping fact tables.

Data Mining
Cont…;

Data Mining
Measures in Data warehouses
Measures: Their Categorization and Computation:

►A data cube measure is a numerical function that can be evaluated at each point in the data cube space.

►A measure value is computed for a given point by aggregating the data corresponding to the respective
dimension-value pairs defining the given point.

►Measure: a function evaluated on aggregated data corresponding to given dimension-value pairs.

►Measures can be organized into three categories based on the kind of aggregate functions used.

1) Distributive,

2) Algebraic,

3) Holistic

Data Mining
Cont…;
Distributive:-

►An aggregate function is distributive if it can be computed in a distributed manner.

►Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n
aggregate values.

►If the result derived by applying the function to the n aggregate values is the same as that derived by
applying the function to the entire data set (without partitioning), the function can be computed in a
distributed manner.

►A measure is distributive if it is obtained by applying a distributive aggregate function.

►Distributive measures can be computed efficiently because they can be computed in a distributive
manner. For the same reason, sum (), min (), and max () are distributive aggregate functions.

Data Mining
Cont…;
Algebraic:-

►If it can be computed from arguments obtained by applying distributive aggregate functions.

►If it may be computed utilizing an algebraic characteristic with M arguments (where M is a bounded
integer), each of which is obtained through making use of a distributive aggregation function.

►E.g., avg () =sum ()/count (), min_N (), standard deviation ()

Holistic:-

►If it is not algebraic.

►If there's no steady limit on the storage length had to describe a sub-aggregate.

►E.g., median (), mode (), rank ().

Data Mining
Concept Hierarchy and Data Warehouses
►A Concept hierarchy defines a sequence of mappings from a set of low level Concepts to higher level,
more general Concepts. Concept hierarchies allow data to be handled at varying levels of abstraction.
OLAP Operations on Multidimensional Data
►Roll-up:- The roll-up operation performs aggregation on a data cube, either by climbing-up a concept
hierarchy for a dimension or by dimension reduction. This hierarchy was defined as the total order street
< city < province or state <country.

►Drill-down:- Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping-down a concept hierarchy for a dimension or
introducing additional dimensions. The result of a drill-down operation performed on the central cube by
stepping down a concept hierarchy for time defined as day < month < quarter < year. Drill-down occurs
by descending the time hierarchy from the level of quarter to the more detailed level of month.
Data Mining
Cont…;
►Slice and dice:- The slice operation performs a selection on one dimension of the given cube, resulting

in a sub cube.

►A slice operation where the sales data are selected from the central cube for the dimension time using the

criteria time=”Q2”.

►The dice operation defines a sub cube by performing a selection on two or more dimensions.

►Pivot (rotate):- Pivot is a visualization operation which rotates the data axes in view in order to provide

an alternative presentation of the data.

►A pivot operation where the item and location axes in a 2-D slice are rotated.

Data Mining
Typical OLAP Operations
► Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle
ware to support missing pieces
Include optimization of DBMS backend, implementation of aggregation navigation logic, and
additional tools and services
Greater scalability

► Multidimensional OLAP (MOLAP)


Array-based multidimensional storage engine (sparse matrix techniques)

Fast indexing to pre-computed summarized data

Data Mining
Cont…;
►Hybrid OLAP (HOLAP)
 User flexibility, e.g., low level: relational, high-level: array

►Specialized SQL servers


Specialized support for SQL queries over star/snowflake schemas

Star-Net Query Model

►A star-net model is used to query a multidimensional database.

►In this model, there exists a central point from where the radial linearise.

►The radial lines represent the concept hierarchy of each dimension.

►The abstraction level in a hierarchy is called a footprint.

Data Mining
Cont…;
►A star-net model is shown as below:-

►We can apply different OLAP operations such as:

►Rollup: Roll up is applied to location from street to country.

►Drill Down: Drill down applied on Customer from a group of the customer to name of the customer.

►Generalization: Moving up the hierarchy. Generalizing day of time dimension with year.

►Specialization: Moving down the hierarchy. Moving from the year of the time dimension to the day.

Data Mining
Cont…;

Data Mining
Data Warehouse Design Process
►A data warehouse can be built using a top-down approach, a bottom-up approach, or a
combination of both.

1. The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that must
be solved are clear and well understood.

2. The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.

Data Mining
Cont…;
3. In the combined approach, an organization can exploit the planned and strategic nature of the
top-down approach while retaining the rapid implementation and opportunistic application of the
bottom-up approach.

Data Mining
Data Warehouse Models
►From the architecture point of view, there are three data warehouse models:-

Enterprise Warehouse

► An enterprise warehouse collects all of the information about subjects spanning the entire organization.

► It provides corporate-wide data integration, usually from one or more operational systems or external
information providers, and is cross-functional in scope.

►It typically contains detailed data as well as summarized data, and can range in size from a few
gigabytes to hundreds of gigabytes, terabytes, or beyond.

►An enterprise data warehouse may be implemented on traditional mainframes, computer super servers,
or parallel architecture platforms.

►It requires extensive business modeling and may take years to design and build.
Data Mining
Data Mart
►A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The
scope is confined to specific selected subjects.

►For example, a marketing data mart may confine its subjects to customer, item, and sales. The data
contained in data marts tend to be summarized.

►The implementation cycle of a data mart is more likely to be measured in weeks rather than months or
years. However, it may involve complex integration in the long run if its design and planning were not
enterprise-wide.

►Depending on the source of data, data marts can be categorized as independent or dependent.
Independent data marts are sourced from data captured from one or more operational systems or external
information providers, or from data generated locally within a particular department or geographic area.

►Dependent data marts are sourced directly from enterprise data warehouses.
Data Mining
Virtual Warehouse
►A virtual warehouse is a set of views over operational databases.

►For efficient query processing, only some of the possible summary views may be materialized.

►A virtual warehouse is easy to build but requires excess capacity on operational database
servers.

Data Mining
OLAP functionalities on Data warehouses
►Other OLAP operations may include ranking the items, computing average, finding out depreciation,
converting the currency or performing some statistical operations.

Functions of OLAP

►OLAP operations can perform summarization, aggregation at different levels of granularity.

►It can generate hierarchies at the dimension level

►OLAP has models for prediction, analysis of trends and patterns and well as statistical analysis.

►OLAP has analytical capabilities for calculation, ratios and variance derivation across dimensions.

►OLAP operations are done for business operations and handling a huge amount of data.

Data Mining
April 5, 2025 Data Mining

You might also like