0% found this document useful (0 votes)
22 views395 pages

Data Mining

The document provides an overview of data mining and data warehousing, emphasizing the importance of extracting valuable information from large datasets and the challenges posed by the rapid growth of data. It outlines the processes involved in data mining, including exploration, pattern identification, and deployment, as well as the various applications and benefits of data mining in fields like marketing, banking, and healthcare. Additionally, it discusses the concept of data warehousing, defining its key features and the necessity for businesses to utilize historical data for informed decision-making.

Uploaded by

tajpuriyabhim017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views395 pages

Data Mining

The document provides an overview of data mining and data warehousing, emphasizing the importance of extracting valuable information from large datasets and the challenges posed by the rapid growth of data. It outlines the processes involved in data mining, including exploration, pattern identification, and deployment, as well as the various applications and benefits of data mining in fields like marketing, banking, and healthcare. Additionally, it discusses the concept of data warehousing, defining its key features and the necessity for businesses to utilize historical data for informed decision-making.

Uploaded by

tajpuriyabhim017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 395

Data Mining and

Data Warehousing
B.E. Computer
8th Semester
Unit 1 : Introduction to Data
Mining and Data Warehousing
What is Data?
• A representation of facts, concepts, or
instructions in a formal manner suitable
for communication, interpretation, or
processing by human beings or by
computers.
??

Wisdom

Knowledge

Information

Data
Review of basic concepts of data warehousing and data mining

• The Explosive Growth of Data: from terabytes to petabytes


• Data accumulate and double every 9 months
• High-dimensionality of data
• High complexity of data
• New and sophisticated applications
• There is a big gap from stored data to knowledge; and the
transition won’t occur automatically.
• Manual data analysis is not new but a bottleneck
• Fast developing Computer Science and Engineering generates
new demands
Very Large Databases

• Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes

• Petabytes -- 10^15 bytes: Geographic Information Systems

• Exabytes -- 10^18 bytes: National Medical Records

• Zettabytes -- 10^21 bytes: Weather images

• Zottabytes -- 10^24 bytes: Intelligence Agency Videos


Data explosion problem
Automated data collection tools and mature
database technology lead to tremendous
amounts of data accumulated and/or to be
analyzed in databases, data warehouses, and
other information repositories

We are drowning in data, but starving for


knowledge!

Solution:
“Necessity is the mother of invention”—Data
Warehousing and Data Mining
What is Data Mining?

Art/Science of extracting non-trivial, implicit,


previously unknown, valuable, and potentially
Useful information from a large database
Data mining is
•"Extraction of interesting information or patterns
from data in large databases is known as data mining."
•A hot buzzword for a class of techniques that find
patterns in data
•A user-centric, interactive process which leverages
analysis technologies and computing power
•A group of techniques that find relationships that
have not previously been discovered
•A relatively easy task that requires knowledge of the
business problem/subject matter expertise
• Data mining is a logical process that is used to search
through large amount of data in order to find useful
data.
• The goal of this technique is to find patterns that
were previously unknown.
• Once these patterns are found they can further be
used to make certain decisions for development of
their businesses.
• Three steps involved are:
– Exploration
– Pattern identification
– Deployment
• Exploration: In the first step of data exploration data is
cleaned and transformed into another form, and
important variables and then nature of data based on the
problem are determined.
• Pattern identification: Once data is explored, refined and
defined for the specific variables the second step is to
form pattern identification. Identify and choose the
patterns which make the best prediction.
• Deployment: Patterns are deployed for desired outcome.
• Simply, data mining refers to extracting or "mining"
knowledge from large amounts of data.
• "Mining" is a vivid term characterizing the process that
finds a small set of precious nuggets from a great deal of
raw material.
Data mining is not
• Brute-force crunching of bulk data
• “Blind” application of algorithms
• Going to find relationships where none exist
• Presenting data in different ways
• A difficult to understand technology requiring
an advanced degree in computer science
• Searching a phone number in a phone book
• Searching a keyword on Google
• Generating histograms of salaries for different
age groups
• Issuing SQL query to a database and reading
the reply
Data mining is not
• A cybernetic magic that will turn your data
into gold. It’s the process and result of
knowledge production, knowledge discovery
and knowledge management.
• Once the patterns are found Data Mining
process is finished.
• Queries to the database are not DM.
Why use Data Mining today?
Because it can improve customer service, better target marketing
campaigns, identify high-risk clients, and improve production
processes.

Data mining has been used to:


• Identify unexpected shopping patterns in supermarkets.
• Optimize website profitability by making appropriate offers to
each visitor.
• Predict customer response rates in marketing campaigns.
• Defining new customer groups for marketing purposes.
• Predict customer defections: which customers are likely to switch
to an alternative supplier in the near future.
• Distinguish between profitable and unprofitable customers.
 Data analysis and decision support
◦ Market analysis and management
 Target marketing, customer relationship management (CRM), market
basket analysis, cross selling, market segmentation
◦ Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
◦ Fraud detection and detection of unusual patterns (outliers)
 telecommunications, financial, insurance industries
 Other Applications
◦ Text mining (news group, email, documents) and Web mining
◦ Stream data mining
◦ Bioinformatics and bio-data analysis
• Product development - biotechnology, pharmaceutical industry
• Entertainment - digital convergence, sports
• Diagnosis and monitoring - medical, aerospace, automotive.
APPLICATIONS OF DATA MINING
• Data mining has many and varied fields of application some of
which are listed below:
• Sales/Marketing
– Identify buying patterns from customers
– Find associations among customer demographic characteristics
– Predict response to mailing campaigns
– Market basket analysis.
• Banking
– Credit card fraudulent detection
– Identify 'loyal' customers
– Predict customers likely to change their credit card affiliation
– Determine credit card spending by customer groups
– Find hidden correlation's between different financial indicators
– Identify stock trading rules from historical market data
• Insurance and Health Care
– Claims analysis i.e., which medical procedures are
claimed together
– Predict which customers will buy new policies
– Identify behavior patterns of risky customers
– Identify fraudulent behavior
• Transportation
– Determine the distribution schedules among outlets
– Analyze loading patterns
• Medicine
– Characterize patient behavior to predict office visits
– Identify successful medical therapies for different
illnesses
DISADVANTAGES OF DATA MINING
• Privacy issues
• The concerns about the personal privacy have been increasing enormously recently
especially when internet is booming with social networks, e-commerce, forums, blogs
etc.
• Because of privacy issues, people are afraid of their personal information is collected
and used in unethical way that potentially causing them a lot of trouble.
• Security issues
• Businesses own information about their employee and customers including social
security number, birthday, payroll and etc.
• There have been a lot of cases that hackers were accesses and stole big data of
customers from big corporation such as Ford Motor credit company, Sony, etc. with so
much personal and financial information available, the credit card stolen and identity
theft become a big problem.
• Misuse of Information/ inaccurate Information
• Information collected through data mining intended for marketing or ethical purposes
can be misused.
• This information is exploited by unethical people or business to take benefit of
vulnerable people or business to take benefit of vulnerable people or discriminate
against a group of people.
Functions of Data mining
• Data mining has five main functions:
• Classification: infers the defining characteristics of a certain
group (such as customers who have been lost to competitors).
• Clustering: identifies groups of items that share a particular
characteristic. (Clustering differs from classification in that no
predefining characteristic is given in classification.)
• Association: identifies relationships between events that occur
at one time (such as the contents of a shopping basket).
• Sequencing: similar to association, except that the relationship
exists over a-period of time (such as repeat visits to a
supermarket or use of a financial planning product).
• Forecasting: estimates future values based on patterns within
large sets of data (such as demand forecasting).
Data Mining: Confluence of Multiple
Disciplines

Database
Statistics
Systems

Machine
Learning
Data Mining Visualization

Algorithm Other
Disciplines
Knowledge Discovery in Databases (KDD)
Process
• Many people treat data mining as a synonym for
another popularly used term, Knowledge
Discovery in Database, or KDD.
• Alternatively, others view data mining as simply
an essential step in the process of knowledge
discovery.
• Data mining, or knowledge discovery in databases
(KDD) as it is also known, is the nontrivial
extraction of implicit, previously unknown, and
potentially useful information from data.
Some Alternative names to data mining are:
– Knowledge discovery (mining) in databases
(KDD)
– Knowledge extraction
– Data/pattern analysis
– Data archeology
– Data Dredging
– Information Harvesting
– Business intelligence, etc.
Figure: Data mining as a step in the process of knowledge discovery.
THE PROCESS OF KNOWLEDGE DISCOVERY

• The main steps of knowledge discovery


process are:
– Identify business Problem
– Data mining
– Action
– Evaluation and measurement
– Deployment and integration into businesses
processes.
• Data cleaning (to remove noise or irrelevant data):
– Data cleaning is the process of ensuring that, for data
mining purposes, the data is uniform in terms of key and
attributes usage.
– Data cleaning is separate from data enrichment and data
transformation because data cleaning attempts to correct
misused or incorrect attributes in existing data.
– An important element in a cleaning operation is the de-
duplication of records.
• Data integration (where multiple data sources may be
combined)
• Data selection: There are two parts to selecting data for data
mining:
– locating data
– identifying data
• Data enrichment:
– Data enrichment is the process of adding new attributes,
such as calculated fields or data from external sources, to
existing data.
– Most references on data mining tend to combine this step
with data transformation.
– Data transformation involves the manipulation of data,
but data enrichment involves adding information to
existing data.
– This can include combining internal data with external
data, obtained from either different departments or
companies or vendors that sell standardized industry-
relevant data.
• Data transformation:
– Data transformation, in terms of data mining, is the process of
changing the form or structure of existing attributes.
– Data transformation is separate from data cleansing and data
enrichment for data mining purposes because it does not
correct existing attribute data or add new attributes, but
instead grooms existing attributes for data mining purposes.
• Data mining:
– The data mining step may interact with the user or a
knowledge base.
– The interesting patterns are presented to the user, and may be
stored as new knowledge in the knowledge base.
– Data mining is the process of discovering interesting
knowledge from large amounts of data stored either in
databases, data warehouses, or other information
repositories.
• Pattern evaluation:
• Pattern evaluation is used to identify the truly interesting
patterns representing knowledge based on some
interestingness measures.
• Knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user).
• Visualization techniques are a very useful method of
discovering patterns in datasets, and may be used at the
beginning of a data mining process to get a rough feeling of
the quality of the data set and where patterns are to be
found.
• Scatter diagrams can be used to identify interesting subsets of
the data sets so that we can focus on the rest of the data
mining process.
• The steps involved in data mining when viewed as a process of
knowledge discovery are as follows:
• Data cleaning, a process that removes or transforms noise and
inconsistent data.
• Data integration, where multiple data sources may be combined.
• Data selection, where data relevant to the analysis task are
retrieved from the database.
• Data transformation, where data are transformed or consolidated
into forms appropriate for mining.
• Data mining, an essential process where intelligent and efficient
methods are applied in order to extract patterns.
• Pattern evaluation, a process that identifies the truly interesting
patterns representing knowledge, based on some interestingness
measures.
• Knowledge presentation, where visualization and knowledge
representation techniques are used to present the mined
knowledge to the user.
WHAT ARE THE ISSUES IN DATA MINING?

• Security and social issues: Security is an important


issue with any data collection that is shared and/or
is intended to be used for strategic decision-making.
• User interface issues: The knowledge discovered by
data mining tools is useful as long as it is
interesting, and above all understandable by the
user. The major issues related to user interfaces and
visualization are "screen real-estate", information
rendering, and interaction.
• Mining methodology issues: These issues are
relevant to the data mining approaches applied
and their limitations.
• Performance issues: raises the issues of
scalability and efficiency of the data mining
methods when processing considerably large
data.
• Data source issues: We are storing different types
of data in a variety of repositories. It is difficult to
expect a data mining system to effectively and
efficiently achieve good mining results on all
kinds of data and sources.
Thank you !!!
Unit 2 : Introduction to Data
Warehousing
Data Warehouse
What is Data Warehouse?
• According to W. H. Inmon, a data warehouse is a
subject-oriented, integrated, time-variant,
nonvolatile collection of data in support of
management decisions.
• “A data warehouse is a copy of transaction data
specifically structured for querying and reporting” –
Ralph Kimball
• Data Warehousing is the process of building a data
warehouse for an organization.
• Data Warehousing is a process of transforming data
into information and making it available to users in a
timely enough manner to make a difference
Defining features:
• Four keywords:
– subject-oriented,
– integrated,
– time-variant,
– nonvolatile
• These keywords distinguish data warehouses
from other data repository systems, such as
relational database systems, transaction
processing systems and file systems.
Subject Oriented
• Focus is on Subject Areas
rather than Applications
• Organized around major
subjects, such as customer,
product, sales.
• Provide a simple and
concise view around
particular subject issues by
excluding data that are not
useful in the decision
support process.
Integrated
•Constructed by
integrating multiple,
heterogeneous data
sources
•Integration tasks
handles naming
conventions, physical
attributes of data
•Must be made
consistent.
Time Variant

• Only accurate and valid at some point in time or over some


time interval.
• The time horizon for the data warehouse is significantly longer
than that of operational systems.
Operational database provides current value data.
Data warehouse data provide information from a historical
perspective (e.g., past 5-10 years)
Non Volatile

• Data Warehouse is relatively Static in nature.


• Not updated in real-time but data in the data warehouse is loaded
and refreshed from operational systems, it is not updated by end
users.
Data warehousing helps business managers to :
– Extract data from various source systems on
different platforms
– Transform huge data volumes into meaningful
information
– Analyze integrated data across multiple business
dimensions
– Provide access of the analyzed information to the
business users anytime anywhere
• Data warehouse contains five types of data
– Older detail data,
– Current detail data,
– Lightly summarized data,
– Highly summarized data, and
– Meta data.
• Goals of Data Warehousing
– To help reporting as well as analysis
– Maintain organization's historical information
– Be an adaptive and resilient source of information
– Be the foundation for decision-making.
Need of Data Warehouse
• Business user: Business users require data warehouse to view
summarized data from past. Since these people are non-
technical, the data may be presented to them in a very simple
form.
• Store historical data: Data warehouse is required to store the
time variable data from past to be used for various purposes.
• Make Strategic decisions: Some strategies may be depending
upon the data in data warehouse.
• For data consistency and quality: user can effectively
undertake to bring the uniformity and consistency in data.
• High response time: Data warehouse has to be ready for fairly
unexpected loads and types of queries, which demands a high
degree of flexibility and quick response time.
BENEFITS OF IMPLEMENTING A DATA WAREHOUSE
• To provide a single version of truth about enterprise
information.
• To speed up ad hoc reports and queries that involves
aggregations across many attributes which are
resource intensive.
• To provide a system in which managers that do not
have a strong technical background are able to run
complex queries.
• To provide a database that stores relatively clean
data.
• To provide a database that stores historical data that
may have been deleted from the OLTP systems.
BENEFITS OF IMPLEMENTING A DATA WAREHOUSE
• Improve data quality by providing consistent codes
and descriptions, flagging or even fixing bad data.
• Provide the organization's information consistently.
• Restructure the data so that it delivers excellent
query performance, even for complex analytic
queries, without impacting the operational systems.
• Add value to operational business applications,
notably customer relationship manager (CRM)
systems.
• Data warehouse helps to increase productivity and
decrease computing costs.
• The benefits of the data warehouse can be sub-divided as
– Tangible benefits
– Intangible benefits
• Tangible Benefits
– Cost of product comes down.
– Better decisions in terms of cost and quality are taken.
– Data warehouses have led to enhanced asset and liability
management since it provides clear picture of enterprise wide
purchasing and inventory patterns.
• Intangible Benefits
– Improved productivity.
– Enhanced customer relations.
– Data warehouses enable re-engineering of business processes
by providing useful insights into the work processes.
Benefits on successful implementation
of Data Warehousing
• Queries do not impact Operational systems
• Provides quick response to queries for reporting
• Enables Subject Area Orientation
• Integrates data from multiple, diverse sources
• Enables multiple interpretations of same data by
different users or groups
• Provides thorough analysis of data over a period of time
• Accuracy of Operational systems can be checked
• Provides analysis capabilities to decision makers
• Increase customer profitability

• Cost effective decision making

• Manage customer and business partner relationships

• Manage risk, assets and liabilities

• Integrate inventory, operations and manufacturing

• Reduction in time to locate, access, and analyze information (Link

multiple locations and geographies)

• Identify developing trends and reduce time to market

• Strategic advantage over competitors


• Potential high returns on investment
• Competitive advantage
• Increased productivity of corporate decision-
makers
• Provide reliable, High performance access
• Consistent view of Data: Same query, same
data. All users should be warned if data load
has not come in.
• Quality of data is a driver for business re-
engineering.
Usage of Data Warehouse
• The traditional role of a data warehouse is to collect and
organize historical business data so it can be analyzed to
assist management in making business decisions.
• putting information technology to help the knowledge
worker make faster and better decisions.
• Used to manage and control business
• Data is historical or point-in-time
• Optimized for inquiry rather than update
• Use of the system is loosely defined and can be ad-hoc
• Used by managers and end-users to understand the
business and make judgements
Advantages of Data Warehousing
• Potential high Return on Investment (RoI)
• Competitive Advantage
• Increased productivity of corporate Decision Makers
• Standardizes data across an organization
• Smarter decisions for companies – moves towards
fact-based decisions.
• Reduces costs- drops products that are not doing
well
• Increases revenue – works on high selling products.
Problems in Data Warehousing
• Underestimation of resources for data loading
• Hidden problems with source systems
• Required data not captured
• Increased end-user demands
• Data homogenization
• High demand for resources
• Data ownership
• High maintenance
• Long duration projects
• Complexity of integration
OLTP (Database) vs. Data Warehouse
• Online Transaction Processing (OLTP) systems are tuned
for known transactions and workloads while workload is
not known a priori in a data warehouse
• OLTP applications normally automate clerical data
processing tasks of an organization, like data entry and
enquiry, transaction handling, etc. (access, read, update)
• Special data organization, access methods and
implementation methods are needed to support data
warehouse queries (typically multidimensional queries)
– e.g., average amount spent on phone calls between 9AM-5PM
in Kathmandu during the month of March, 2012
• OLTP • Data Warehouse
– Application – Subject Oriented
Oriented – Used to analyze
– Used to run business
business – Summarized and
refined
– Detailed data
– Snapshot data
– Current up to date
– Integrated Data
– Isolated Data
– Ad-hoc access
– Repetitive access
– Knowledge User
– Clerical User (Manager)
• OLTP • Data Warehouse
– Performance – Performance
Sensitive relaxed
– Few Records – Large volumes
accessed at a accessed at a
time (tens) time(millions)
– Read/Update – Mostly Read (Batch
Access Update)
– No data – Redundancy
redundancy present
– Database Size – Database Size
100MB -100 GB 100 GB - few
terabytes
• OLTP • Data Warehouse
– Transaction – Query
throughput is throughput is
the the performance
performance metric
metric – Hundreds of
– Thousands of users
users – Managed by
– Managed in subsets
entirety
Difference between Operational System and Data
Warehouse
Operational System Data Warehouse
Holds current data Holds historic data
Data is dynamic Data is largely static
Read/Write accesses Read only accesses
Repetitive processing Ad hoc complex queries
Transaction driven Analysis driven
Application oriented Subject oriented
Used by clerical staffs for day- Used by top managers for
to-day operations analysis
Normalized data model (ER De-normalized data model
model) (Dimensional model)
Must be optimized for writes Must be optimized for queries
and small queries involving a large portion of the
warehouse
Data Warehouse Applications
• Financial services
• Banking services
• Consumer goods Industry
• Retail sectors
• Controlled manufacturing
• Transportation Industry
• Telephone Industry
• Services Sector
Data Warehouse Applications
• The Retailers
• Manufacturing and Distribution Industry
• Insurance
• Hospitality Industry
• Healthcare
• Government and Education
• Biological data analysis
• Logistic and inventory management
• Trend analysis
• Agriculture
Warehouse Products
• Computer Associates -- CA-Ingres
• Hewlett-Packard -- Allbase/SQL
• Informix -- Informix, Informix XPS
• Microsoft -- SQL Server
• Oracle -- Oracle7, Oracle Parallel Server
• Red Brick -- Red Brick Warehouse
• SAS Institute -- SAS
• Software AG -- ADABAS
• Sybase -- SQL Server, IQ, MPP
Thank you !!!
Unit 3 : Data Warehouse
Logical and Physical Design
Data Warehouse Logical Design
• Logical design is the phase of a database
design concerned with identifying the
relationships among the data elements.
• The logical design should result in
– A set of entities and attributes
corresponding to fact tables and dimension
tables.
– A model of operational data from your
source into subject-oriented information in
your target data warehouse schema.
• The steps for logical data model are
indicated below:
– Identify all entities.
– Identify primary keys for all entities.
– Find the relationships between
different entities.
– Find all attributes for each entity.
– Resolve all entity relationships that is
many-to-many relationships.
– Normalization if required.
Figure: Sample E-R Diagram
Multi-dimensional data model
• The multidimensional data model is an integral part of
On-Line Analytical Processing, or OLAP.
• Because OLAP is on-line, it must provide answers quickly.
• Because OLAP is also analytic, the queries are complex.
The multidimensional data model is designed to solve
complex queries in real time.
• The multidimensional data model is composed of logical
cubes, measures, dimensions, hierarchies, levels, and
attributes.
• A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
• The lattice of cuboids forms a data cube.
Data Cube
• A data cube provides a multidimensional view
of data and allows the pre-computation and
fast accessing of summarized data.
• A data cube allows data to be modeled and
viewed in multiple dimensions. It is defined by
dimensions and facts.
• A data cube is a three- (or higher) dimensional
array of value.
Example of 2D view of sales
Example of 3D view of sales
Example of 3D data cube
Dimensions and Measures
• Every dimensional model (DM) is composed of
one table with a composite primary key, called
the fact table, and a set of smaller tables
called dimension tables.
• Each dimension table has a simple (non-
composite) primary key that corresponds
exactly to one of the components of the
composite key in the fact table.
• In general terms, dimensions are the
perspectives or entities with respect to
which an organization wants to keep
records.
• For example, AllElectronics may create a
sales data warehouse in order to keep
records of the store’s sales with respect
to the dimensions time, item, branch,
and location.
• Each dimension may have a table
associated with it, called a dimension
table, which further describes the
dimension.
• For example, a dimension table for item
may contain the attributes item name,
brand, and type. Dimension tables can
be specified by users or experts, or
automatically generated and adjusted
based on data distributions.
Fact Table
• A multidimensional data model is
typically organized around a central
theme, like sales, for instance.
• This theme is represented by a fact table.
Facts are numerical measures.
• Think of them as the quantities by which
we want to analyze relationships
between dimensions.
• The fact table contains the names of the
facts, or measures, as well as keys to each
of the related dimension tables.
• A fact table is composed of two or more
primary keys and usually also contains
numeric data.
• Because it always contains at least two
primary keys it is always a M-M
relationship.
• Fact tables contain business event details
for summarization.
• The logical model for a fact table contains
a foreign key column for the primary keys
of each dimension.
• The combination of these foreign keys
defines the primary key for the fact table.
• Fact tables are often very large,
containing hundreds of millions of rows
and consuming hundreds of gigabytes or
multiple terabytes of storage.
Dimension table
• A dimension table stores data about the ways
in which the data in the fact table can be
analyzed.
• Contrary to fact tables, dimension tables
contain descriptive attributes (or fields) that
are typically textual fields.
• A dimension table may be used in multiple
places if the data warehouse contains multiple
fact tables or contributes data to data marts.
For example, a time dimension often contains the hierarchy elements:
(all time), Year, Quarter, Month, Day, or (all time), Year Quarter, Week,
Day.
A Data Warehouse Schema
• The schema is a logical description of the entire
database.
• The schema includes the name and description of
records of all record types including all associated
data-items and aggregates.
• The most popular data model for a data
warehouse is a multidimensional model.
• Such a model can exist in the following forms
– a star schema
– a snowflake schema
– a fact constellation schema.
Star Schema
• The most popular schema design for data
warehouses is the Star Schema.
• Each dimension is stored in a dimension table
and each entry is given its own unique
identifier.
• The dimension tables are related to one or
more fact tables.
• The fact table contains a composite key made
up of the identifiers (primary keys) from
dimension tables.
Star Schema
• The fact table also contains facts about the
given combination of dimensions. For
example, a combination of store_key,
date_key and product_key giving the amount
of a certain product sold on a given day at a
given store.
• Fact table has foreign keys to all dimension
tables in a star schema. In the example, there
are three foreign keys (date key, product key,
and store key).
Star Schema
• Fact tables are normalized, whereas dimension
tables are not.
• Fact tables are very large as compared to
dimension tables.
• Others name: star-join schema, data cube,
data list, grid file and multi-dimension schema.
• The facts in a star schema are of the following
three types
– Fully-additive
– Semi-additive
– Non-additive
Advantages of Star Schema
• A star schema is very simple from the users‘
point of view.
• Queries are never complex because the only
joins and conditions involve a fact table and a
single level of dimension tables, without the
indirect dependencies to other tables.
• Provide a direct mapping between the
business entities being analyzed by end users
and the schema design.
• Provide highly optimized performance for
typical star queries.
Advantages of Star Schema
• Are widely supported by a large number of
business intelligence tools.
• Star schemas are used for both simple data
marts and very large data warehouses.
• Star Schema is very easy to understand, even
for non-technical business manager.
• Star Schema provides better performance and
smaller query times
• Star Schema is easily extensible and will
handle future changes easily
Snowflake Schema
• A schema is called a snowflake schema if
one or more dimension tables do not join
directly to the fact table but must join
through other dimension tables.
• It is a variant of star schema model.
• It has a single, large and central fact table
and one or more tables for each
dimension.
Snowflake Schema
• The major difference between the snowflake
and star schema models is that the dimension
tables of the snowflake model may be kept in
normalized form to reduce redundancies.
• Normalization of dimension tables
• Each hierarchical level has its own table
• easy to maintain and less memory space is
required
• a lot of joins can be required if they involve
attributes in secondary dimension tables
Difference between Star Schema and Snow-
flake Schema
• Star Schema is a multi-dimension model where each of its
disjoint dimension is represented in single table.
• Snow-flake is normalized multi-dimension schema when
each of disjoint dimension is represent in multiple tables.
• Star schema can become a snow-flake
• Both star and snowflake schemas are dimensional models;
the difference is in their physical implementations.
• Snowflake schemas support ease of dimension
maintenance because they are more normalized.
• Star schemas are easier for direct user access and often
support simpler and more efficient queries.
Fact Constellation Schema
• Sophisticated applications may require multiple fact
tables to share dimension tables.
• This kind of schema can be viewed as a collection of
stars, and hence, galaxy schema or a fact constellation.
• In a fact constellation, there are multiple fact tables.
• In the Fact Constellations, aggregate tables are created
separately from the detail, therefore, it is impossible to
pick up other queries from different tables.
• Fact Constellation is a good alternative to the Star, but
when dimensions have very high cardinality, the sub-
selects in the dimension tables can be a source of delay.
Starflake Schema
• A starflake schema is a combination of a star
schema and a snowflake schema.
• Starflake schemas are snowflake schemas
where only some of the dimension tables have
been de-normalized.
Materialized View
• Materialized views are query results that have
been stored in advance so long-running
calculations are not necessary when you
actually execute your SQL statements.
• Materialized views can be best explained by
Multidimensional lattice.
• It is useful to materialize a view when:
– It directly solves a frequent query
– It reduces the costs of some queries
Physical Design
Physical design is the phase of a database design
following the logical design that identifies the
actual database tables and index structures used to
implement the logical design.

In the physical design, you look at the most


effective way of storing and retrieving the objects
as well as handling them from a transportation and
backup/recovery perspective.
Physical design decisions are mainly driven by
query performance and database maintenance
aspects.

During the logical design phase, you defined a


model for your data warehouse consisting of
entities, attributes, and relationships. The
entities are linked together using relationships.
Attributes are used to describe the entities. The
unique identifier (UID) distinguishes between
one instance of an entity and another.
Figure: Logical Design Compared with Physical Design
During the physical design process, you translate
the expected schemas into actual database
structures.

At this time, you have to map:


• Entities to tables
• Relationships to foreign key constraints
• Attributes to columns
• Primary unique identifiers to primary key
constraints
• Unique identifiers to unique key constraints
Physical Data Model

Features of physical data model include:


 Specification of all tables and columns.
 Specification of Foreign keys.
 De-normalization may be performed if
necessary.
 At this level, specification of logical data
model is realized in the database.
The steps for physical data model design
involves:
 Conversion of entities into tables,
 Conversion of relationships into
foreign keys, Conversion of attributes
into columns, and
 Changes to the physical data model
based on the physical constraints.
Figure: Logical model and physical model
Physical Design Structures
Once you have converted your logical design to a physical one,
you will need to create some or all of the following structures:
 Tablespaces
 Tables and Partitioned Tables
 Views
 Integrity Constraints
 Dimensions

Additionally, the following structures may be created for


performance improvement:
 Indexes and Partitioned Indexes
 Materialized Views
Tablespaces
• A tablespace consists of one or more datafiles,
which are physical structures within the
operating system you are using.
• A datafile is associated with only one
tablespace.
• From a design perspective, tablespaces are
containers for physical design structures.
Tables and Partitioned Tables
• Tables are the basic unit of data storage. They are
the container for the expected amount of raw data
in your data warehouse.
• Using partitioned tables instead of non-partitioned
ones addresses the key problem of supporting very
large data volumes by allowing you to divide them
into smaller and more manageable pieces.
• Partitioning large tables improves performance
because each partitioned piece is more
manageable.
Views
• A view is a tailored presentation of the data
contained in one or more tables or other
views.
• A view takes the output of a query and treats
it as a table.
• Views do not require any space in the
database.
Integrity Constraints
• Integrity constraints are used to enforce
business rules associated with your database
and to prevent having invalid information in
the tables.
• In data warehousing environments,
constraints are only used for query rewrite.
• NOT NULL constraints are particularly
common in data warehouses.
Indexes and Partitioned Indexes
• Indexes are optional structures associated
with tables.
• Indexes are just like tables in that you can
partition them (but the partitioning strategy is
not dependent upon the table structure)
• Partitioning indexes makes it easier to manage
the data warehouse during refresh and
improves query performance.
Materialized Views

• Materialized views are query results that have


been stored in advance so long-running
calculations are not necessary when you
actually execute your SQL statements.
• From a physical design point of view,
materialized views resemble tables or
partitioned tables and behave like indexes in
that they are used transparently and improve
performance.
Hardware and I/O Consideration
• I/O performance should always be a key
consideration for data warehouse designers and
administrators.
• The typical workload in a data warehouse is
especially I/O intensive, with operations such as
large data loads and index builds, creation of
materialized views, and queries over large
volumes of data.
• The underlying I/O system for a data warehouse
should be designed to meet these heavy
requirements.
• In fact, one of the leading causes of performance
issues in a data warehouse is poor I/O
configuration.
• Database administrators who have previously
managed other systems will likely need to pay
more careful attention to the I/O configuration
for a data warehouse than they may have
previously done for other environments.
• The I/O configuration used by a data warehouse
will depend on the characteristics of the specific
storage and server capabilities
There are following five high-level guidelines
for data-warehouse I/O configurations:
 Configure I/O for Bandwidth not
Capacity
 Stripe Far and Wide
 Use Redundancy
 Test the I/O System Before Building the
Database
 Plan for Growth
Parallelism
• Parallelism is the idea of breaking down a task so that,
instead of one process doing all of the work in a query,
many processes do part of the work at the same time.
• Parallel execution is sometimes called parallelism.
• Parallel execution dramatically reduces response time
for data-intensive operations on large databases
typically associated with decision support systems
(DSS) and data warehouses.
• An example of this is when four processes handle four
different quarters in a year instead of one process
handling all four quarters by itself.
Parallelism improves processing for:
 Queries requiring large table scans,
joins, or partitioned index scans
 Creation of large indexes
 Creation of large tables (including
materialized views)
 Bulk inserts, updates, merges, and
deletes
Parallelism benefits systems with all of the
following characteristics:
 Symmetric multiprocessors (SMPs), clusters,
or massively parallel systems
 Sufficient I/O bandwidth
 Underutilized or intermittently used CPUs
(for example, systems where CPU usage is
typically less than 30%)
 Sufficient memory to support additional
memory-intensive processes, such as sorts,
hashing, and I/O buffers
Indexes
Indexes are optional structures associated with tables and
clusters.
Indexes are structures actually stored in the database, which
users create, alter, and drop using SQL statements.
You can create indexes on one or more columns of a table to
speed SQL statement execution on that table.
In a query-centric system like the data warehouse
environment, the need to process queries faster dominates.
Among the various methods to improve performance,
indexing ranks very high.
Indexes are typically used to speed up the retrieval of
records in response to search conditions.
Indexes can be unique or non-unique.
Unique indexes guarantee that no two rows of a
table have duplicate values in the key column (or
columns).
Non-unique indexes do not impose this restriction
on the column values.
Index structures applied in warehouses are:
 Inverted lists
 Bitmap indexes
 Join indexes
 Text indexes
 B-Tree Index
Thank you !!!
Unit 4 : Data Warehousing
Technologies and Implementation
Introduction
• Incomplete, noisy, and inconsistent data are common place
properties of large real-world databases and data warehouses.
• Incomplete data can occur for a number of reasons.
– Attributes of interest may not always be available, such as customer
information for sales transaction data.
– Other data may not be included simply because it was not
considered important at the time of entry.
– Relevant data may not be recorded due to a misunderstanding, or
because of equipment malfunctions.
– Data that were inconsistent with other recorded data may have been
deleted.
– Furthermore, the recording of the history or modifications to the
data may have been overlooked.
– Missing data, particularly for tuples with missing values for some
attributes, may need to be inferred.
Extract, Transform and Load (ETL)
Definition
Three separate functions combined into one
development tool:

1. Extract - Reads data from a specified source and


extracts a desired subset of data.

2. Transform - Uses rules or lookup tables, or


creating combinations with other data, to convert
source data to the desired state.

3. Load - Writes the resulting data to a target


database
• Furthermore, these tools and utilities include the
following functions:
• Data extraction, which typically gathers data from
multiple, heterogeneous, and external sources
• Data cleaning, which detects errors in the data and
rectifies them when possible
• Data transformation, which converts data from
legacy or host format to warehouse format
• Load, which sorts, summarizes, consolidates,
computes views, checks integrity, and builds indices
and partitions
• Refresh, which propagates the updates from the
data sources to the warehouse
ETL Overview
• ETL, Short for extract, transform, and load
are the database functions that are
combined into one tool.
• ETL is used to migrate data from one
database to another, to form data marts and
data warehouses and also to convert
databases from one format or type to
another.
• To get data out of the source and load it into
the data warehouse – simply a process of
copying data from one database to other
ETL Overview
• Data is extracted from an OLTP
database, transformed to match the
data warehouse schema and loaded
into the data warehouse database
• Many data warehouses also incorporate
data from non-OLTP systems such as
text files, legacy systems, and
spreadsheets; such data also requires
extraction, transformation, and loading.
The ETL Cycle
EXTRACT TRANSFORM LOAD
The process of reading The process of transforming the The process of writing
data from different extracted data from its original the data into the target
sources. state into a consistent state so source.
that it can be placed into another
database.


www data
MIS Systems
(Acct, HR)
TRANSFORM CLEANSE Data Warehouse

Legacy
Systems
EXTRACT LOAD
Archived data


Other indigenous applications
(COBOL, VB, C++, Java)
OLAP
Temporary
Data storage
ETL Processing
ETL is independent yet interrelated steps.
It is important to look at the big picture.
Data acquisition time may include…

Extracts
Data Index
from Data Data Data Statistics
Transfor- Mainte-
source Movement Cleansing Loading Collection
mation nance
systems

Backup
Back-up is a major task, it is a Data Warehouse not a cube
 ETL is often a complex combination of process and
technology that consumes a significant portion of the data
warehouse development efforts and requires the skills of
business analysts, database designers, and application
developers
 It is not a one time event as new data is added to the Data
Warehouse periodically – i.e. monthly, daily, hourly
 Because ETL is an integral, ongoing, and recurring part of a
data warehouse. It may be:
 Automated
 Well documented
 Easily changeable
 When defining ETL for a data warehouse, it is important to
think of ETL as a process, not a physical implementation
Extraction, Transformation, and Loading
(ETL) Processes

Data Extraction
Data Cleansing
Data Transformation
Data Loading
Data Refreshing
Data Extraction
Capture/Extract…obtaining a snapshot of a chosen subset of the source data for
loading into the data warehouse

Static extract = capturing a Incremental extract = capturing


snapshot of the source data changes that have occurred
at a point in time since the last static extract
Data is extracted from heterogeneous data
sources
Each data source has its distinct set of
characteristics that need to be managed and
integrated into the ETL system in order to
effectively extract data.
ETL process needs to effectively integrate
systems that have different:
DBMS
Operating Systems
Hardware
Communication protocols
Need to have a logical data map before the
physical data can be transformed

The logical data map describes the relationship


between the extreme starting points and the
extreme ending points of your ETL system usually
presented in a table or spreadsheet
Some ETL Tools
Tool Vendor
Oracle Warehouse Builder (OWB) Oracle
Data Integrator (BODI) Business Objects
IBM Information Server (Ascential) IBM
SAS Data Integration Studio SAS Institute
PowerCenter Informatica
Oracle Data Integrator (Sunopsis) Oracle
Data Migrator Information Builders
Integration Services Microsoft
Talend Open Studio Talend
DataFlow Group 1 Software (Sagent)
Data Integrator Pervasive
Transformation Server DataMirror
Transformation Manager ETL Solutions Ltd.
Data Manager Cognos
DT/Studio Embarcadero Technologies
ETL4ALL IKAN
DB2 Warehouse Edition IBM
Jitterbit Jitterbit
Pentaho Data Integration Pentaho
Data Cleansing
Data Cleansing
• There are many possible reasons for noisy data
(having incorrect attribute values).
• The data collection instruments used may be faulty.
• There may have been human or computer errors
occurring at data entry.
• Errors in data transmission can also occur. There
may be technology limitations, such as limited
buffer size for coordinating synchronized data
transfer and consumption.
• Incorrect data may also result from inconsistencies
in naming conventions or data codes used, or
inconsistent formats for input fields, such as date.
• Duplicate tuples also require data cleaning.
• Data cleaning (or data cleansing) routines
work to “clean” the data by filling in missing
values, smoothing noisy data, identifying or
removing outliers, and resolving
inconsistencies.
• If users believe the data are dirty, they are
unlikely to trust the results of any data mining
that has been applied to it.
• Furthermore, dirty data can cause confusion
for the mining procedure, resulting in
unreliable output.
• Although most mining routines have some
procedures for dealing with incomplete or
noisy data, they are not always robust.
• Instead, they may concentrate on avoiding
overfitting the data to the function being
modeled.
• Therefore, a useful preprocessing step is to
run your data through some data cleaning
routines.
• More data and multiple sources could mean
more errors in the data and harder to trace
such errors which results in incorrect analysis.
• So there are enormous problem, as most data
is dirty.
• Data Warehouse is NOT just about arranging
data, but should be clean for overall health of
organization. “We drink clean water”!
• Sometime called as Data Scrubbing or
Cleaning.
• ETL software contains rudimentary data
cleansing capabilities
• Specialized data cleansing software is often
used. Leading data cleansing vendors include
Vality (Integrity), Harte-Hanks (Trillium), and
Firstlogic (i.d.Centric)
Why Cleansing?
– Data warehouse contains data that is analyzed
for business decisions
– Source systems contain “dirty data” that must
be cleansed.
– More data and multiple sources could mean
more errors in the data and harder to trace
such errors
– Results in incorrect analysis
– Enormous problem, as most data is dirty.
(GIGO)
• The cleaning step is one of the most important as it
ensures the quality of the data in the data warehouse.
• Cleaning should perform basic data unification rules,
such as:
– Making identifiers unique (sex categories
Male/Female/Unknown, M/F/null, Man/Woman/Not
Available are translated to standard Male/Female/Unknown)
– Convert null values into standardized Not Available/Not
Provided value
– Convert phone numbers, ZIP codes to a standardized form
– Validate address fields, convert them into proper naming,
e.g. Street/St/St./Str./Str
– Validate address fields against each other (State/Country,
City/State, City/ZIP code, City/Street).
Reasons for “Dirty” Data
Dummy Values
Absence of Data
Multipurpose Fields
Cryptic Data
Contradicting Data
Inappropriate Use of Address Lines
Violation of Business Rules
Reused Primary Keys,
Non-Unique Identifiers
Data Integration Problems
Examples:
Dummy Data Problem:
A clerk enters 999-99-9999 as a SSN
rather than asking the customer for
theirs.
Reused Primary Keys:
A branch bank is closed. Several years
later, a new branch is opened, and the
old identifier is used again.
Inconsistent Data Representations
Same data, different representation
Date value representations
Examples:
970314 1997-03-14
03/14/1997 14-MAR-1997
March 14 1997 2450521.5 (Julian date format)

Gender value representations


Examples:
- Male/Female - M/F
- 0/1
Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors: misspellings, erroneous Also: decoding, reformatting, time


dates, incorrect field usage, stamping, conversion, key generation,
mismatched addresses, missing data, merging, error detection/logging,
duplicate data, inconsistencies locating missing data
Data Transformation
Data Transformation
It is the main step where the ETL
adds value.
Actually changes data and provides
guidance whether data can be used
for its intended purposes.
Performed in staging area.
• It applies a set of rules to transform the data
from the source to the target.
• This includes converting any measured data to
the same dimension (i.e. conformed
dimension) using the same units so that they
can later be joined.
• The transformation step also requires joining
data from several sources, generating
aggregates, generating surrogate keys, sorting,
deriving new calculated values, and applying
advanced validation rules.
Transform = convert data from format of operational system to format of data warehouse

Record-level: Field-level:
Selection–data partitioning single-field–from one field to one field
Joining–data combining multi-field–from many fields to one, or one
Aggregation–data summarization field to many
Basic Tasks
1. Selection
2. Splitting/Joining
3. Conversion
4. Summarization
5. Enrichment
Data Loading
Data Loading
Most loads involve only change data rather
than a bulk reloading of all of the data in the
warehouse.
Data are physically moved to the data
warehouse
The loading takes place within a “load
window”
The trend is to near real time updates of the
data warehouse as the warehouse is
increasingly used for operational applications
Load/Index= place transformed data into the warehouse and create indexes

Refresh mode: bulk rewriting of Update mode: only changes in source


target data at periodic intervals data are written to data warehouse
The loading process can be broken down
into 2 different types:
– Initial Load
– Continuous Load (loading over time)
Data Refreshing
Data Refreshing

Data Refresh
Propogate updates from sources to the
warehouse
Issues:
– when to refresh
– how to refresh -- refresh techniques
Set by administrator depending on user
needs and traffic
When to Refresh?
periodically (e.g., every night, every week) or
after significant events
on every update: not warranted unless
warehouse data require current data (up to the
minute stock quotes)
refresh policy set by administrator based on
user needs and traffic
possibly different policies for different sources
Refresh Techniques
Full Extract from base tables
• read entire source table: too expensive
• maybe the only choice for legacy systems
ETL vs. ELT
ETL: Extract, Transform, Load in which data
transformation takes place on a separate
transformation server.
ELT: Extract, Load, Transform in which data
transformation takes place on the data
warehouse server.
Thank you !!!
Chapter 5 :
Data Warehouse to Data
Mining
Data Warehouse Architecture
Data Warehouse Architecture
Operational Data Sources: It may include:
• Network databases.
• Departmental file systems and RDBMSs.
• Private workstations and servers.
• External systems (Internet, commercially available
databases).

Operational Data Store (ODS):


• It is a repository of current and integrated operational
data used for analysis.
• Often structured and supplied with data in same way as
DW.
• May act simply as staging area for data to be
moved into the warehouse.
• Provides users with the ease of use of a
relational database while remaining distant
from decision support functions of the DW.
Load Manager:
• Operations include:
– perform all the operations necessary to support
the extract and load process.
– fast load the extracted data into a temporary data
store
– perform simple transformations into a structure
similar to the one in the data warehouse.
Warehouse Manager (Data Manager):
• Operations performed include:
– Analysis of data to ensure consistency.
– Transformation/merging of source data from temp
storage into DW
– Creation of indexes.
– Backing-up and archiving data.
Query Manager (Manages User Queries):
• Operations include:
– directing queries to the appropriate tables and
– scheduling the execution of queries.
• In some cases, the query manager also generates query
profiles to allow the warehouse manager to determine
which indexes and aggregations are appropriate.
Meta Data: This area of the DW stores all the
meta-data (data about data) definitions used by
all the processes in the warehouse.
• Used for a variety of purposes:
– Extraction and loading processes
– Warehouse management process
– Query management process
• End-user access tools use meta-data to
understand how to build a query.
• Most vendor tools for copy management and
end-user data access use their own versions of
meta-data.
Lightly and Highly Summarized Data:
• It stores all the pre-defined lightly and
highly aggregated data generated by
the warehouse manager.
• The purpose of summary info is to
speed up the performance of queries.
• Removes the requirement to
continually perform summary
operations (such as sort or group by)
in answering user queries.
Archive/Backup Data:
• It stores detailed and summarized data
for the purposes of archiving and backup.
• May be necessary to backup online
summary data if this data is kept beyond
the retention period for detailed data.
• The data is transferred to storage
archives such as magnetic tape or optical
disk.
End-User Access Tools:
• The principal purpose of data warehousing is to
provide information to business users for
strategic decision-making.
• Users interact with the warehouse using end-
user access tools.
• There are three main groups of access tools:
1. Data reporting, query tools
2. Online analytical processing (OLAP) tools
3. Data mining tools
Design of a Data Warehouse:
Three Data Warehouse Models
� Enterprise warehouse
– collects all of the information about subjects spanning the entire
organization
– top down approach
– the W. Inmon methodology
� Data Mart
– a subset of corporate-wide data that is of value to a specific groups of users.
Its scope is confined to specific, selected groups, such as marketing data
mart
• Independent vs. dependent (directly from warehouse) data mart
• bottom up approach
• the R. Kimball methodology
� Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be materialized
The Data Mart Strategy
�The most common approach
�Begins with a single mart and architected marts are
added over time for more subject areas
�Relatively inexpensive and easy to implement
�Can be used as a proof of concept for data warehousing
�Can postpone difficult decisions and activities
�Requires an overall integration plan
�The key is to have an overall plan, processes, and
technologies for integrating the different marts.
�The marts may be logically rather than physically
separate.
• Data Mart is a subset of the information content of a
data warehouse that is stored in its own database.
• Data mart may or may not be sourced from an
enterprise data warehouse i.e. it could have been
directly populated from source data.
• Data mart can improve query performance simply by
reducing the volume of data that needs to be scanned to
satisfy the query.
• Data marts are created along functional level to reduce
the likelihood of queries requiring data outside the mart.
• Data marts may help in multiple queries or tools to
access data by creating their own internal database
structures.
• E.g.: Departmental Store, Banking System.
Enterprise Warehouse Strategy
�A comprehensive warehouse is built initially
�An initial dependent data mart is built using a
subset of the data in the warehouse
�Additional data marts are built using subsets of
the data in the warehouse
�Like all complex projects, it is expensive, time
consuming, and prone to failure
�When successful, it results in an integrated,
scalable warehouse
�Even with the enterprise-wide strategy, the
warehouse is developed in phases and each phase
should be designed to deliver business value.
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Data Data Enterprise


Mart Mart Data
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


Data mining Architecture:
Major Components:
Database, data warehouse, World Wide Web,
or other information repository:
• This is one or a set of databases, data
warehouses, spreadsheets, or other kinds of
information repositories.
• Data cleaning and data integration techniques
may be performed on the data.
Database or data warehouse server:
• The database or data warehouse server is
responsible for fetching the relevant data,
based on the user’s data mining request.
• Knowledge base:
• This is the domain knowledge that is used to guide
the search or evaluate the interestingness of
resulting patterns.
• Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into
different levels of abstraction.
• Data mining engine:
• This is essential to the data mining system and
ideally consists of a set of functional modules for
tasks such as characterization, association and
correlation analysis, classification, prediction,
cluster analysis, outlier analysis, and evolution
analysis.
• Pattern evaluation module:
• This component typically employs
interestingness measures and interacts with
the data mining modules so as to focus the
search toward interesting patterns.
• User interface:
• This module communicates between users and
the data mining system, allowing the user to
interact with the system by specifying a data
mining query or task, providing information to
help focus the search, and performing
exploratory data mining based on the
intermediate data mining results.
OLAP (Online Analytical Processing)
Architecture
• The term OLAP or online analytical
processing was introduced in a paper
entitled "Providing Analytical processing
to User Analysts," by Dr. E. R Codd, the
acknowledged "father" of the relational
database model.
• OLAP provides you with a very good view of
what is happening, but can not predict
what will happen in the future or why it is
happening.
• OLAP is a term used to describe the
analysis of complex data from the data
warehouse.
• OLAP is an advanced data analysis
environment that supports decision
making, business modeling, and operations
research activities.
• Can easily answer ‘who?’ and ‘what?’ questions,
however, ability to answer ‘what if?’ and ‘why?’
type questions distinguishes OLAP from general-
purpose query tools.
• Enables users to gain a deeper understanding and
knowledge about various aspects of their
corporate data through fast, consistent,
interactive access to a wide variety of possible
views of the data.
• Allows users to view corporate data in such a way
that it is a better model of the true
dimensionality of the enterprise.
– OLAP is a category of applications/technology
for Collecting, managing, processing, and
presenting multidimensional data for analysis
and management purposes.

OLAP is FASMI
� Fast
� Analysis
� Shared
� Multidimensional
� Information
Main characteristics of OLAP are:
• Multidimensional conceptual view
• Multi user support
• Accessibility
• Storing
• Uniform reporting performance
• Facilitate interactive query and complex analysis
for the users.
• Provides ability to perform intricate calculations
and comparisons.
• Presents results in a number of meaningful ways,
including charts and graphs.
Comparing OLAP and Data Mining
Examples of OLAP Applications in
Various Functional Areas
OLAP Benefits
• Increased productivity of end-users.
• Retention of organizational control over
the integrity of corporate data.
• Reduced query drag and network traffic on
OLTP systems or on the data warehouse.
• Improved potential revenue and
profitability.
Strengths of OLAP

• It is a powerful visualization paradigm


• It provides fast, interactive response times
• It is good for analyzing time series
• It can be useful to find some clusters and
outliers
• Many vendors offer OLAP tools
OLAP for Decision Support
• Goal of OLAP is to support ad-hoc querying for the
business analyst
• Business analysts are familiar with spreadsheets
• Extend spreadsheet analysis model to work with
warehouse data
– Large data set
– Semantically enriched to understand business terms (e.g.,
time, geography)
– Combined with reporting features
• Multidimensional view of data is the foundation of
OLAP
OLAP Architecture
• The three-tier architecture includes:
– Bottom Tier (Data warehouse server)
– Middle Tier (OLAP Server)
– Top Tier (Front end tools)
• Bottom Tier
• The bottom tier is a warehouse database server that is
almost always a relational database system.
• Data is fed using backend tools and utilities.
• Data from operational database and external sources are
extracted using application programs interfaces known as
gateways.
• Examples of gateways include ODBC (Open database
connection) and OLE-DB (Open linking and embedding for
databases), by Microsoft, and JDBC (Java Database
Connection).
• Middle Tier
• The middle tier is an OLAP server that is typically
implemented using either:
– A Relational OLAP(ROLAP) model, that is, an extended
relational DBMS that maps operations on
multidimensional data to standard relational operations;
or
– A Multidimensional PLAP (MOLAP) model, that is, a
special-purpose server that directly implements
multidimensional data and operations.
• Top Tier
• The top tier is a front-end client layer, which contains
query and reporting tools, analysis tools, and/or data
mining tools (e.g. trend analysis, prediction and so
on).
OLAP Architecture (continued)
• Designed to use both operational and data
warehouse data
• Defined as an “advanced data analysis
environment that supports decision
making, business modeling, and an
operation’s research activities”
• In most implementations, data warehouse
and OLAP are interrelated and
complementary environments
OLTP (Online Transaction Processing)
• OLTP is characterized by a large number of
short online transactions (INSERT, UPDATE and
DELETE).
• OLTP is used to carry out day to day business
functions such as ERP (Enterprise Resource
Planning), CRM (Customer Relationship
Planning).
• OLTP system solved a critical business
problem of automating daily business
functions and running real time report and
analysis.
OLTP Vs OLAP
Facts/Characteristic OLTP OLAP
s
Source of Data Operational Data Data warehouse (From various
database)
Purpose of data Control and run For planning, problem solving
fundamental business tasks and decision support
Queries Simple queries Complex queries and algorithms
Processing Speed Typically, very fast Depends on data size,
techniques and algorithms
Space requirements Can be relatively small Larger due to aggregated
databases
Database Design Highly Normalized with Typically, de-normalized with
many tables. ER based, fewer tables. Use of star or
application-oriented. snowflake schema, subject-
oriented.

Characteristics Operational processing Information processing


Orientation Transaction Analysis
OLAP Vs OLTP
Nature of User Operations workers (e.g. Knowledge Workers,
Clerk, DBA, Database Decision makers (e.g.
professionals) manager, executive, analyst)
Functions Day-to-day operations Long-term informational
requirements, decision
support.
Data Current, detailed, relational; Historical, summarized,
guaranteed up to date multidimensional; accuracy
maintained over time
Unit of Work Short, simple transaction Complex query
Focus Data in Information out
Number of users Thousands Hundreds
DB Size 100M to GB 100GB to TB
View Detailed, flat relational Summarized,
multidimensional
Access Read/ Write Mostly read
Updates All the time Usually not allowed
On-Line Analytical Mining (OLAM)
• On-line analytical mining (OLAM) (also called
OLAP mining) integrates on-line analytical
processing (OLAP) with data mining and mining
knowledge in multidimensional databases.
• OLAM is particularly important for the following
reasons:
� High quality of data in data warehouses
� Available information processing infrastructure
surrounding data warehouses
� OLAP-based exploratory data analysis
� On-line selection of data mining functions
Figure
• An OLAM server performs analytical mining
in data cubes in a similar manner as an
OLAP server performs on-line analytical
processing.
• An integrated OLAM and OLAP architecture
is shown in figure above, where the OLAM
and OLAP servers both accept user on-line
queries (or commands) via a graphical user
interface API and work with the data cube
in the data analysis via a cube API.
• A metadata directory is used to guide the
access of the data cube.
• The data cube can be constructed by
accessing and/or integrating multiple
databases via an MDDB API and/or by
filtering a data warehouse via a database
API that may support OLE DB or ODBC
connections.
Server Options

• Single processor (Uniprocessor)

• Symmetric multiprocessor (SMP)

• Massively parallel processor (MPP)


Server Options
Server Options
OLAP Server Options/Categories of
OLAP Tools
• OLAP tools are categorized according to the
architecture of the underlying database.
• Three main categories of OLAP tools
includes:
� Relational OLAP (ROLAP)
� Multi-dimensional OLAP (MOLAP or MD-
OLAP)
� DOLAP (Desktop OLAP)
� Hybrid OLAP (HOLAP )
Relational OLAP (ROLAP)
• Relational OLAP (ROLAP) implementations are
similar in functionality to MOLAP.
• However, these systems use an underlying
RDBMS, rather than a specialized MDDB.
• This gives them better scalability since they are
able to handle larger volumes of data than the
MOLAP architectures.
• Also, ROLAP implementations typically have
better drill-through because the detail data
resides on the same database as the
multidimensional data .
Relational OLAP (ROLAP)
• The ROLAP environment is typically based on the
use of a data structure known as a star or
snowflake schema.
• Analogous to a virtual MDDB, a star or snowflake
schema is a way of representing multidimensional
data in a two-dimensional RDBMS.
• The data modeler builds a fact table, which is
linked to multiple dimension tables.
• The dimension tables consist almost entirely of
keys, such as location, time, and product, which
point back to the detail records stored in the fact
table.
•This type of data structure requires a great deal of
initial planning and set up, and suffers from some of the
same operational and flexibility concerns of MDDBs.
•Additionally, since the data structures are relational,
SQL must be used to access the detail records.
•Therefore, the ROLAP engine must perform additional
work to do comparisons, such as comparing the current
quarter with this quarter last year.
•Again, IT must be heavily involved in defining,
implementing, and maintaining the database.
•Furthermore, the ROLAP architecture often restricts
the user from performing OLAP operations in a mobile
environment
• Relational Online Analytical
Frontend Processing (ROLAP)
Tool – OLAP functionality using relational
database and familiar query tools
to store and analyze
multidimensional data
MD- • Adds following extensions to
Interface traditional RDBMS:
ROLAP- – Multidimensional data schema support
Engine within RDBMS
Meta Data – Data access language and query
SQL performance optimized for
multidimensional data
• Support for Very Large Databases
Relational DB • Tune a relational DBMS to
support star schemas.
• ROLAP is a fastest growing style of OLAP
technology.
• Supports RDBMS products using a metadata
layer - avoids need to create a static multi-
dimensional data structure - facilitates the
creation of multiple multi-dimensional views of
the two-dimensional relation.
• To improve performance, some products use
SQL engines to support complexity of multi-
dimensional analysis, while others recommend,
or require, the use of highly denormalized
database designs such as the star schema.
Typical Architecture for ROLAP Tools
• With ROLAP data remains in the original
relational tables, a separate set of relational
tables is used to store and reference aggregation
data. ROLAP is ideal for large databases or legacy
data that is infrequently queried.
• ROLAP Products:
– IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
• ROLAP Tools
– ORACLE 8i
– ORACLE Reports; ORACLE Discoverer
– ORACLE Warehouse Builder
– Arbors Software’s Essbase
Advantages of ROLAP
• Define complex, multi-dimensional data with
simple model
• Reduces the number of joins a query has to
process
• Allows the data warehouse to evolve with rel.
low maintenance
• HOWEVER! Star schema and relational DBMS are
not the magic solution
– Query optimization is still problematic
Features of ROLAP:
• Ask any question (not limited to the contents
of the cube)
• Ability to drill down

Downsides of ROLAP:
• Slow Response
• Some limitations on scalability
Multi-Dimensional OLAP (MOLAP)
• The first generation of server-based
multidimensional OLAP (MOLAP) solutions use
multidimensional databases (MDDBs).
• The main advantage of an MDDB over an RDBMS
is that an MDDB can provide information quickly
since it is calculated and stored at the
appropriate hierarchy level in advance.
• However, this limits the flexibility of the MDDB
since the dimensions and aggregations are
predefined.
Multi-Dimensional OLAP (MOLAP)
• If a business analyst wants to examine a
dimension that is not defined in the MDDB, a
developer needs to define the dimension in
the database and modify the routines used to
locate and reformat the source data before
an operator can load the dimension data.
• Another important operational consideration
is that the data in the MDDB must be
periodically updated to remain current.
• This update process needs to be scheduled
and managed. In addition, the updates need
to go through a data cleansing and validation
process to ensure data consistency.
• Finally, an administrator needs to allocate
time for creating indexes and aggregations, a
task that can consume considerable time
once the raw data has been loaded.
• These requirements also apply if the
company is building a data warehouse that is
acting as a source for the MDDB.
• Organizations typically need to invest
significant resources in implementing MDDB
systems and monitoring their daily operations.
• This complexity adds to implementation delays
and costs, and requires significant IT
involvement.
• This also results in the analyst, who is typically
a business user, having a greater dependency
on IT.
• Thus, one of the key benefits of this OLAP
technology — the ability to analyze
information without the use of IT professionals
— may be significantly diminished.
• Uses specialized data structures and
Front-end
Tool multi-dimensional Database
Management Systems (MD-DBMSs)
to organize, navigate, and analyze
data.
• Use a specialized DBMS with a model
such as the “data cube.”
Multidimensional • Data is typically aggregated and
Database
stored according to predicted usage
to enhance query performance.
Typical Architecture for MOLAP Tools
• Traditionally, require a tight coupling with the
application layer and presentation layer.
• Recent trends segregate the OLAP from the data
structures through the use of published application
programming interfaces (APIs).
• MOLAP Products
– Pilot, Arbor Essbase, Gentia
• MOLAP Tools
– ORACLE Express Server
– ORACLE Express Clients (C/S and Web)
– MicroStrategy’s DSS server
– Platinum Technologies’ Plantinum InfoBeacon
• Use array technology and efficient storage techniques
that minimize the disk space requirements through
sparse data management.
• Provides excellent performance when data is used as
designed, and the focus is on data for a specific
decision-support application.
• Features:
Very fast response
Ability to quickly write data into the cube
• Downsides:
Limited Scalability
Inability to contain detailed data
Load time
Desktop OLAP (or Client OLAP)
• The desktop OLAP market resulted from the need for
users to run business queries using relatively small data
sets extracted from production systems.
• Most desktop OLAP systems were developed as
extensions of production system report writers, while
others were developed in the early days of client/server
computing to take advantage of the power of the
emerging (at that time) PC desktop.
• Desktop OLAP systems are popular and typically require
relatively little IT investment to implement. They also
provide highly mobile OLAP operations for users who
may work remotely or travel extensively.
• However, most are limited to a single user and lack the
ability to manage large data sets.
Client-
OLAP
• proprietary data structure on the client
• data stored as file
Stores in the form • mostly RAM based architectures
of cubes/micro-
cubes on the • mobile user
desktop/client
machine • ease of installation and use

Products:
• data volume
Brio.Enterprise
BusinessObjects • no multiuser capabilities
Cognos PowerPlay
Hybrid OLAP (HOLAP)
• Hybrid OLAP (HOLAP) combines ROLAP and
MOLAP storage.
• It tries to take advantage of the strengths of
each of the other two architectures, while
minimizing their weaknesses.
• Some vendors provide the ability to access
relational databases directly from an MDDB,
giving rise to the concept of hybrid OLAP
environments.
• This implements the concept of "drill through,"
which automatically generates SQL to retrieve
detail data records for further analysis.
Hybrid OLAP (HOLAP)
• This gives end users the perception they are
drilling past the multidimensional database into
the source database.
• The hybrid OLAP system combines the
performance and functionality of the MDDB with
the ability to access detail data, which provides
greater value to some categories of users.
• However, these implementations are typically
supported by a single vendor’s databases and are
fairly complex to deploy and maintain.
• Additionally, they are typically somewhat
restrictive in terms of their mobility.
• Can use data from either a RDBMS directly or a
multi-dimension server.
• Equal treatment of MD and Relational Data
• Storage type at the discretion of the administrator
• Cube Partitioning

HOLAP System

Meta Data

Multidimensional Storage Relational Storage


• HOLAP combines elements from MOLAP and ROLAP.
• HOLAP keeps the original data in relational tables
but stores aggregations in a multidimensional
format.
• Combines MOLAP & ROLAP
• Utilizes both pre-calculated cubes & relational data
sources
• HOLAP Tools
– ORACLE 8i
– ORACLE Express Serve
– ORACLE Relational Access Manager
– ORACLE Express Clients (C/S and Web)

• HOLAP Products:
–Oracle Express
–Seagate Holos
–Speedware Media/M
–Microsoft OLAP Services
HOLAP Features:
• For summary type info – cube, (Faster
response)
• Ability to drill down – relational data
sources (drill through detail to underlying
data)
• Source of data transparent to end-user
OLAP Products
Vendor
OLAP Category Candidate Products
ROLAP Microstrategy Microstrategy

Business Objects Business Objects

Crystal Holos (ROLAP Mode) Business Objects

Essbase Hyperion

Microsoft Analysis Services Microsoft

Oracle Express (ROLAP Mode) Oracle

Oracle Discoverer Oracle

MOLAP Crystal Holos Business Objects

Essbase Hyperion

Microsoft Analysis Services Microsoft

Oracle Express Oracle

Cognos Powerplay Cognos

HOLAP Hyperion Essbase+Intelligence Hyperion

Cognos Powerplay+Impromptu Cognos

Business Objects+Crystal Holos Business Objects


Typical OLAP Operations
Roll up (drill-up) or Aggregation: summarize data
– by climbing up hierarchy or by dimension reduction
– Data is summarized with increasing generalization
– dimension reduction: e.g., total sales by city
– summarization over aggregate hierarchy: e.g., total sales by city and year -> total sales by region and
by year
Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or detailed data, or introducing new dimensions
– going from summary to more detailed views
– Increasing levels of detail are revealed
Slice and Dice:
– project and select
– Performing projection operations on the dimensions.
Pivot (rotate):
– reorient the cube, visualization, 3D to series of 2D planes.
– Cross tabulation is performed
Other operations:
– drill across: involving (across) more than one fact table
– drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
Table: A 3-D view of sales data according to the
dimensions time, item, and location. The measure
displayed is dollars sold (in thousands).
Figure: A 3-D data cube representation of the data in the table above,
according to the dimensions time, item, and location. The measure
displayed is dollars sold (in thousands).
Figure: Examples of typical
OLAP operations on
multidimensional data
Roll-up and Drill-down
The roll-up operation performs aggregation on a
data cube, either by climbing up a concept hierarchy
for a dimension or by dimension reduction such
that one or more dimensions are removed from the
given cube.

Drill-down is the reverse of roll-up. It navigates


from less detailed data to more detailed data. Drill-
down can be realized by either stepping down a
concept hierarchy for a dimension or introducing
additional dimensions.
Drill Down Example
Sample
OLAP Drill
down
online
report
Figure: Quarterly Auto Sales Summary
Example of drill-down
Region Units Sold Revenue

Northeast

Southeast

Central

Northwest

Southwest

Quarterly Auto Sales Summary

Region State Units Sold Revenue

Northeast Maine

New York

Massachusetts

Southeast Florida

Georgia

Virginia
Figure: Quarterly Auto Sales Summary

Example of Roll up
Region State Units Sold Revenue

Northeast Maine

New York

Massachusetts

Southeast Florida

Georgia

Virginia

Quarterly Auto Sales Summary

Region Units Sold Revenue

Northeast

Southeast

Central

Northwest

Southwest
Slice and Dice
The slice operation performs a selection on one
dimension of the given cube, resulting in a sub
cube.

The dice operation defines a sub cube by


performing a selection on two or more
dimensions.
Figure: Slicing a data cube
Slicing (above) and dicing (below) a cube
Rotation (Pivot Table)
Product
Product

Region
Time
Pivot
Example of Rotation (Pivot Table)
Data Mining Tools
Thank you !!!
Unit 6 : Data Mining
Approaches and Methods
Types of Data Mining Models
Predictive Model
(a)Classification -Data is mapped into predefined
groups or classes. Also termed as supervised learning
as classes are established prior to examination of
data.
(b) Regression- Mapping of data item into known
type of functions. These may be linear, logistic
functions etc.
(c) Time Series Analysis- Value of an attribute are
examined at evenly spaced times, as it varies with
time.
(d) Prediction- It means fore telling future data states
based on past and current data.
Types of Data Mining Models
Descriptive Models
(a) Clustering- It is referred as unsupervised learning or
segmentation/partitioning. In clustering groups are not
pre-defined.
(b) Summarization- Data is mapped into subsets with
simple descriptions . Also termed as Characterization
or generalization.
(c) Sequence Discovery- Sequential analysis or
sequence discovery utilized to find out sequential
patterns in data. Similar to association but relationship
is based on time.
(d) Association Rules- A model which identifies specific
types of data associations.
Descriptive vs. Predictive Data Mining
Descriptive Mining:
It describes concepts or task-relevant data sets in concise,
summarative, informative, discriminative forms.
Predictive Mining:
It is based on data and analysis, constructs models for the database,
and predicts the trend and properties of unknown data.
Supervised and Unsupervised learning

Supervised learning:
– The network answer to each input pattern is
directly compared with the desired answer
and a feedback is given to the network to
correct possible errors
Unsupervised learning:
– The target answer is unknown. The network
groups the input patterns of the training sets
into clusters, based on correlation and
similarities.
Supervised
• Bayesian Modeling Type and number of
• Decision Trees classes are known in
advance
• Neural Networks

Unsupervised
Type and number of
• One-way Clustering classes are NOT
• Two-way Clustering known in advance
Classification and Prediction
Classification and prediction are two forms of data analysis that
can be used to extract models describing important data classes
or to predict future data trends. Such analysis can help provide
us with a better understanding of the data at large.

Whereas classification predicts categorical (discrete, unordered)


labels, prediction models continuous valued functions.

For example, we can build a classification model to categorize


bank loan applications as either safe or risky, or a prediction
model to predict the expenditures in dollars of potential
customers on computer equipment given their income and
occupation.
Prediction
Prediction is viewed as the construction and use of a model
to assess the class of an unlabeled sample or to assess the
value ranges of an attribute that a given sample is likely to
have.

It is a statement or claim that a particular event will occur in


the future in more certain terms than a forecast . It is similar
to classification .It constructs a model to predict unknown or
missing values. Prediction is the most prevalent grade level
expectation on reasoning in state mathematics standards.
Generally it predicts a continuous value rather
than categorical label. Numeric prediction predicts
the continuous value. The most widely used
approach for numeric prediction is regression.

Regression analysis is used to model the


relationship between one or more independent or
predictor variables and a dependent or response
variable. In the context of Data Mining, predictor
variables are attributes of interest describing the
tuple.
Linear Regression
Regression is a statistical methodology developed by Sir Frances Galton in 1822-
1911.Straight line regression analysis involves a response variable y and a single
predictor variable x.

The simplest form of regression is


y = a + bx

Where y is response variable and x is single predictor variable y is a linear


function of x. a and b are regression coefficients.

As the regression coefficients are also considered as weights, we may write the
above equation as:
y = w+w1x

These coefficients are solved by the method of least squares, which estimates
the best fitting straight line as the one that minimizes the error between the
actual data and the estimate of the line.
Linear Regression
Classification is the process of finding a model (or
function) that describes and distinguishes data
classes or concepts, for the purpose of being able
to use the model to predict the class of objects
whose class label is unknown. The derived model is
based on the analysis of a set of training data (i.e.,
data objects whose class label is known).

“How is the derived model presented?” The derived


model may be represented in various forms, such
as classification (IF-THEN) rules, decision trees,
mathematical formulae, or neural networks.
Figure : A classification model can be represented in various forms, such
as (a) IF-THEN rules, (b) a decision tree, or a (c) neural network.
Classification : Example of Grading
x

<90 >=90
If x >= 90 then grade =A.
x A
If 80<=x<90 then grade =B.
<80 >=80
If 70<=x<80 then grade =C.
x B
If 60<=x<70 then grade =D.
<70 >=70
If x<50 then grade =F. x C

<50 >=60

F D
Figure: Learning
Here, the class label attribute is loan decision, and the
learned model or classifier is represented in the form of
classification rules.
Examples of Classification Algorithms

Decision Trees
Neural Networks
Bayesian Networks
Decision Trees
A decision tree is a predictive model that as its name
implies can be viewed as a tree. Specifically each branch
of the tree is a classification question and the leaves are
partitions of data set with their classification.

A decision tree makes a prediction on the basis of a series


of decisions. The decision trees are being built on
historical data and are a part of the supervised learning.
The machine learning technique for inducting a decision
tree from data is called decision tree learning.
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
Decision Tree Example
Decision Tree: Example
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Outlook
Sunny Rain
Overcast

Humidity Yes Wind

High Strong Weak


Normal

No Yes No Yes

Attributes = {Outlook, Temperature, Humidity, Wind} Play Tennis = {yes, no}


Decision Tree Learning Algorithm - ID3
ID3 (Iterative Dichotomiser) is a simple decision
tree learning algorithm developed by Ross Quinlan
(1983). ID3 follow non-backtracking approach in
which decision trees are constructed in a top-
down recursive “divide and conquer” manner to
test each attribute at every tree node. This
approach starts with a training set of tuples and
their associated class labels. Training set is
recursively partitioned into smaller subsets as the
tree is being built.
Pros and Cons of Decision Tree
Pros
– no distributional assumptions
– can handle real and nominal inputs
– speed and scalability
– robustness to outliers and missing values
– interpretability
– compactness of classification rules
– They are easy to use.
– Generated rules are easy to understand .
– Amenable to scaling and the database size.

Cons
– several tuning parameters to set with little guidance
– decision boundary is non-continuous
– Cannot handle continuous data.
– Incapable of handling many problems which cannot be divided into attribute
domains.
– Can lead to over-fitting as the trees are constructed from training data.
Neural Networks
Neural Network is a set of connected INPUT/OUTPUT
UNITS, where each connection has a WEIGHT
associated with it. It is a case of SUPERVISED,
INDUCTIVE or CLASSIFICATION learning.

Neural Network learns by adjusting the weights so as


to be able to correctly classify the training data and
hence, after testing phase, to classify unknown data.
Neural Network needs long time for training. Neural
Network has a high tolerance to noisy and
incomplete data.
Similarity with Biological Network

 Fundamental processing element of a


neural network is a neuron
 A human brain has 100 billion neurons
 An ant brain has 250,000 neurons
A Neuron (= a Perceptron)

x0
- mk
w0

x1 w1
å f output y

xn wn For Example
n
y  sign( wi xi   k )
i 0
Input weight weighted Activation
vector x vector w sum function

The n-dimensional input vector x is mapped into variable y by


means of the scalar product and a nonlinear function mapping
Perceptron
y1 y2 yp-1 yp

1 2 ... p-1 p
wp-1,1
w1,1 w1,n wp,1

w2,1w1,2 ...w ...


1,n-1 ... wp,n

...
x1 x2 xn-1 xn

 n 
yi t  1  f   wik xk t  i  1, 2, ... p
 k 0 
Multi-Layer Perceptron

Output vector

Err j  O j (1  O j ) Errk w jk
Output nodes k

 j   j  (l) Err j
wij  wij  (l ) Err j Oi
Hidden nodes
Err j  O j (1  O j )(T j  O j )
wij 1
Oj  I j
1 e
Input nodes
I j   wij Oi   j
i
Input vector: xi
Advantages of Neural Network
 prediction accuracy is generally high
 robust, works when training examples contain errors
 output may be discrete, real-valued, or a vector of
several discrete or real-valued attributes
 fast evaluation of the learned target function
 High tolerance to noisy data
 Ability to classify untrained patterns
 Well-suited for continuous-valued inputs and outputs
 Successful on a wide array of real-world data
 Algorithms are inherently parallel
 Techniques have recently been developed for the
extraction of rules from trained neural networks
Disadvantages of Neural Network
long training time
difficult to understand the learned function
(weights)
not easy to incorporate domain knowledge
Require a number of parameters typically best
determined empirically, e.g., the network
topology or ``structure."
Poor interpretability: Difficult to interpret the
symbolic meaning behind the learned weights
and of ``hidden units" in the network
Association Rule
 Proposed by Agrawal et al in 1993.
 It is an important data mining model studied extensively by the database and
data mining community.
 Assume all data are categorical.
 No good algorithm for numeric data.
 Initially used for Market Basket Analysis to find how items purchased by
customers are related.
 Given a set of records each of which contain some number of items from a
given collection;
– Produce dependency rules which will predict occurrence of an item based on occurrences of
other items.

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread
{Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Applications:
Basket data analysis, cross-marketing, catalog design, loss-
leader analysis, clustering, classification, etc.
E.g., 98% of people who purchase tires and auto accessories
also get automotive services done

Concepts:
An item: an item/article in a basket
I: the set of all items sold in the store
A transaction: items purchased in a basket; it may have TID (transaction ID)
A transactional dataset: A set of transactions

The model: rules


A transaction t contains X, a set of items (itemset) in I, if X  t.
An association rule is an implication of the form:
X  Y, where X, Y  I, and X Y = 
An itemset is a set of items.
E.g., X = {milk, bread, cereal} is an itemset.
A k-itemset is an itemset with k items.
E.g., {milk, bread, cereal} is a 3-itemset
Rule Strength Measures
Support:
The rule holds with support sup in T (the transaction
data set) if sup% of transactions contain X  Y.
sup = Pr(X  Y)

Confidence:
The rule holds in T with confidence conf if conf% of
transactions that contain X also contain Y.
conf = Pr(Y | X)

An association rule is a pattern that states when X


occurs, Y occurs with certain probability.
Support and Confidence
support of X in D is count(X)/|D|
For an association rule XY, we can calculate
support (XY) = support (XY)
confidence (XY) = support (XY)/support (X)
Relate Support (S) and Confidence (C) to Joint
and Conditional probabilities
There could be exponentially many A-rules
Interesting association rules are (for now) those
whose S and C are greater than minSup and
minConf (some thresholds set by data miners)
Support and Confidence
Support count:
The support count of an itemset X, denoted by
X.count, in a data set T is the number of
transactions in T that contain X. Assume T has n
transactions. Then,
( X  Y ).count
support 
n
( X  Y ).count
confidence 
X .count
Basic Concepts: Association Rules
Transaction-id Items bought  Itemset X={x1, …, xk}
10 A, B, C
 Find all the rules XY with min
confidence and support
20 A, C – support, s, probability that a
30 A, D transaction contains XY
– confidence, c, conditional
40 B, E, F probability that a transaction
having X also contains Y.

Customer
buys both
Customer Let minimum support 50%,
buys diaper
and minimum confidence
50%, we have
A  C (50%, 66.7%)
C  A (50%, 100%)
Customer
buys beer
Example
Data set D Count, Support, Confidence:
TID Itemsets Count(13)=2
T100 134 |D| = 4

T200 235 Support(13)=0.5


Support(32)=0.5
T300 1235
Confidence(32)=0.67
T400 25
Mining Association Rules: Example
Transaction-id Items bought Min. support 50%
10 A, B, C Min. confidence 50%
20 A, C Frequent pattern Support
30 A, D {A} 75%
40 B, E, F {B} 50%
{C} 50%
{A, C} 50%
For rule A  C:
support = support({A}{C}) = 50%
confidence = support({A}{C})/support({A}) = 66.6%

The Apriori principle:


Any subset of a frequent itemset must be frequent
Example of Association Rule
Examples:
bread ⇒ peanut-butter
beer ⇒ bread

Frequent itemsets: Items that frequently appear


together
I = {bread, peanut-butter}
I = {beer, bread}
Support count (σ): Frequency of occurrence of and
itemset
σ ({bread, peanut-butter}) = 3
σ ({ beer, bread}) = 1

Support: Fraction of transactions that contain an


itemset
s ({bread,peanut-butter}) = 3/5
s ({beer, bread}) = 1/5

Frequent itemset: An itemset whose support is


greater than or equal to a minimum support
threshold (minsup)
What’s an Interesting Rule?

An association rule is an implication of two itemsets:


X⇒Y

Many measures of interest. The two most used are:

Support (s): The occurring frequency of the rule, i.e., number of


transactions that contain both X and Y

Confidence (c): The strength of the association, i e measures of


how often items (X)
The Apriori Algorithm—An Example
1-candidates Freq 1-itemsets
Database TDB C1 Itemset sup
Tid Items L1 Itemset sup
{A} 2
10 A, C, D {A} 2
{B} 3
20 B, C, E {B} 3
{C} 3
30 A, B, C, E 1st scan {C} 3
{D} 1
40 B, E {E} 3
{E} 3
Min_sup=2
Freq 2-itemsets Frequency
Counting ≥ 50%, Confidence
2-candidates
100%:
L2 Itemset sup C2 Itemset sup A  CC2 Itemset
{A, C} 2 {A, B} 1
{B, C} 2 {A, C} 2 B  E {A, B}
{A, C}
{B, E} 3 {A, E} 1 BCnd  E
2 scan {A, E}
{B, C} 2
{C, E} 2 CE  B {B, C}
{B, E} 3
3-candidates {C, E} 2 BE  C {B, E}
Freq 3-itemsets {C, E}
Itemset
3rd scan L3 Itemset sup
C3 {B, C, E} {B, C, E} 2
Clustering and Cluster Analysis
A cluster is a collection of objects which are “similar”
between them and are “dissimilar” to the objects
belonging to other clusters.

Clustering is “the process of organizing objects into groups


whose members are similar in some way”.

“Cluster Analysis is a set of methods for constructing a


(hopefully) sensible and informative classification of an
initially unclassified set of data, using the variable values
observed on each individual.”
- B. S. Everitt (1998), “The Cambridge Dictionary of Statistics”
Applications of Cluster Analysis
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters or for other spatial mining tasks
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access
patterns
Applications of Cluster Analysis
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
 City-planning: Identifying groups of houses according to their
house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
Objectives of Cluster Analysis
Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups

Competing
objectives
Intra-cluster Inter-cluster
distances are distances are
minimized maximized
Types of Clusterings

 Partitioning Clustering
– A division data objects into non-overlapping subsets (clusters) such that each data object is in
exactly one subset
– Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum
of square errors
– Typical methods: k-means, k-medoids, CLARA (Clustering LARge Applications)

 Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
– Create a hierarchical decomposition of the set of data (or objects) using some criterion
– Typical methods: DiAna (Divisive Analysis), AgNes (Agglomerative Nesting), BIRCH (Balanced
Iterative Reducing and Clustering using Hierarchies), ROCK (RObust Clustering using linKs),
CAMELEON

 Density-based Clustering
– Based on connectivity and density functions
– Typical methods: DBSACN (Density Based Spatial Clustering of Applications with Noise), OPTICS
(Ordering Points To Identify the Clustering Structure), DenClue (DENsity-based CLUstEring )
 Grid-based Clustering
- based on a multiple-level granularity structure
- Typical methods: STING (STatistical INformation Grid ), WaveCluster, CLIQUE
(Clustering In QUEst)

 Model-based Clustering
- A model is hypothesized for each of the clusters and tries to find the best fit of
that model to each other
- Typical methods: EM (Expectation Maximization), SOM (Self-Organizing Map),
COBWEB

 Frequent pattern-based Clustering


- Based on the analysis of frequent patterns
- Typical methods: pCluster

 User-guided or constraint-based Clustering


- Clustering by considering user-specified or application-specific constraints
- Typical methods: COD, constrained clustering
Partitioning Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4
Dendrogram 1

p1
p3 p4
p2
p1 p2 p3 p4

Dendrogram 2
Strengths of Hierarchical Clustering
Do not have to assume any particular number of
clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level

They may correspond to meaningful taxonomies


– Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
K-means Algorithm
 Partitioning clustering approach
 Each cluster is associated with a centroid (center point or
mean point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters, K, must be specified

The basic algorithm is very simple:


The k-means partitioning algorithm.
Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s
center is represented by the mean value of the objects in the cluster.

Input:
k: the number of clusters,
D: a data set containing n objects.

Output: A set of k clusters.

Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for each
cluster;
(5) until no change;
Figure: Clustering of a set of objects based on the k-means method. (The mean
of each cluster is marked by a “+”.)
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3

2 each
2 the 2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
K-means Clustering – Details
 Initial centroids are often chosen randomly.
 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the
cluster.
 ‘Closeness’ is measured mostly by Euclidean distance, cosine
similarity, correlation, etc. Typical
choice
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few iterations.
 Often the stopping condition is changed to ‘Until
relatively few points change clusters’
 Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Issues and Limitations for K-means
How to choose initial centers?
How to choose K?
How to handle Outliers?
Clusters different in
Shape
Density
Size
Assumes clusters are spherical in vector space
Sensitive to coordinate changes
K-means Algorithm
Pros
Simple
Fast for low dimensional data
It can find pure sub clusters if large number of clusters is specified

Cons
K-Means cannot handle non-globular data of different sizes and densities
K-Means will not identify outliers
K-Means is restricted to data which has the notion of a center (centroid)
Applicable only when mean is defined, then what about categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
Outliers
What are outliers?
The set of objects are considerably dissimilar from the remainder
of the data
Example: Sports: Michael Jordon, Randy Orton, Sachin Tendulkar ...

Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis

Outlier detection and analysis are very useful for fraud detection,
etc. and can be performed by statistical, distance-based or
deviation-based approaches
How to handle Outliers?
 The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially distort the
distribution of the data.

K-Medoids: Instead of taking the mean value of the object in a cluster


as a reference point, medoids can be used, which is the most centrally
located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Example:
Use in finding Fraudulent usage of credit cards. Outlier Analysis
may uncover Fraudulent usage of credit cards by detecting
purchases of extremely large amounts for a given account number
in comparison to regular charges incurred by the same account.
Outlier values may also be detected with respect to the location
and type of purchase or the purchase frequency.
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10
10 10
10

9 99 99

8 88 88

Arbitrary Assign
7 77 77

6 66 66

5
choose k 55 each 55

4 object as 44 remainin 44

3
initial 33
g object 33

2
medoids 22
to 22

nearest
1 11 11

0 00 00
0 1 2 3 4 5 6 7 8 9 10 00 11 22 33 44 55 66 77 88 99 10
10
medoids 00 11 22 33 44 55 66 77 88 99 10
10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
Do loop 10 10

Until no change
9
Compute
9

Swapping O
8 8

7 total cost of 7

and Oramdom 6
swapping 6

5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Example of K-medoids
Given the two medoids that are initially chosen are A
and B. Based on the following table and randomly
placing items when distances are identical to the two
medoids, we obtain the clusters {A, C, D} and {B, E}. The
three non-medoids {C, D, E} are examined to see which
should be used to replace A or B. We have six costs to
determine: TCAC (the cost change by replacing medoid A
with medoid C), TCAD, TCAE, TCBC, TCBD and TCBE.

TCAC=CAAC+CBAC+CCAC+CDAC+CEAC = 1 + 0 – 2 – 1 + 0 = -2
Where CAAC = the cost change of object A after replacing
medoid A with medoid C
Comparison between K-means and K-
medoids
The k-medoids method is more robust than k-
means in the presence of noise and outliers
because a medoid is less influenced by outliers
or other extreme values than a mean. However,
its processing is more costly than the k-means
method. Both methods require the user to
specify k, the number of clusters.
Thank you !!!
Unit 7 : Mining Complex Types
of Data
Introduction

Mining complex types of data include:


Multimedia data mining
Text data mining
Web data mining
Web content mining
Web usage mining
Web structure mining
Multimedia Data Mining
• Multimedia Data Mining is a subfield of data mining
that deals with an extraction of implicit knowledge,
multimedia data relationships, or other patterns not
explicitly stored in multimedia databases
• Multimedia Data Types
– any type of information medium that can be
represented, processed, stored and transmitted
over network in digital form
– Multi-lingual text, numeric, images, video, audio,
graphical, temporal, relational, and categorical
data.
– Relation with conventional data mining term
Generalizing Multimedia Data
 Image data:
– Extracted by aggregation and/or approximation
– Size, color, shape, texture, orientation, and relative positions
and structures of the contained objects or regions in the image
 Music data:
– Summarize its melody: based on the approximate patterns
that repeatedly occur in the segment
– Summarized its style: based on its tone, tempo, or the major
musical instruments played
 Video:
– provide news video annotation and indexing
– traffic monitoring system
Multidimensional Analysis of Multimedia Data
 Multimedia Data Cube
– Design and construction similar to that of traditional data
cubes from relational data
– Contain additional dimensions and measures for multimedia
information, such as color, texture, and shape
 The database does not store images but their descriptors.
– Feature descriptor: a set of vectors for each visual
characteristic
• Color vector: contains the color histogram
• MFC (Most Frequent Color) vector: five color centroids
• MFO (Most Frequent Orientation) vector: five edge
orientation centroids
– Layout descriptor: contains a color layout vector and an edge
layout vector
Mining Multimedia Databases

Refining or combining searches

Search for “airplane in blue


sky” (top layout grid is blue and
keyword = “airplane”)

Search for “blue sky and


green meadows” (top layout
Search for “blue sky” (top layout grid is blue) grid is blue and bottom is green)
Text Mining

Internet
• Text mining is the procedure of synthesizing
information, by analyzing relations, patterns, and
rules among textual data. These procedures
contains text summarization, text categorization,
and text clustering.

1. Text summarization is the procedure to extract its


partial content reflecting its whole contents
automatically.
2. Text categorization is the procedure of assigning a
category to the text among categories predefined
by users
3. Text clustering is the procedure of segmenting
texts into several clusters, depending on the
substantial relevance.
• Text mining is well motivated, due to the
fact that much of the world’s data can be
found in free text form (newspaper articles,
emails, literature, etc.).
• There is a lot of information available to
mine.
• While mining free text has the same goals
as data mining, in general, extracting useful
knowledge/stats/trends), text mining must
overcome a major difficulty – there is no
explicit structure.
• Machines can reason will relational data
well since schemas are explicitly
available.
• Free text, however, encodes all semantic
information within natural language.
• Our text mining algorithms, then, must
make some sense out of this natural
language representation.
• Humans are great at doing this, but this
has proved to be a problem for
machines.
Application of Text Mining
Text mining system provides a competitive edge for a company to
process and take advantage of a large quantity of textual
information. The potential applications are countless. We highlight
a few below.
 Customer profile analysis, e.g., mining incoming emails for
customers' complaint and feedback.
 Patent analysis, e.g., analyzing patent databases for major
technology players, trends, and opportunities.
 Information dissemination, e.g., organizing and summarizing
trade news and reports for personalized information services.
 Company resource planning, e.g., mining a company's reports
and correspondences for activities, status, and problems
reported.
Text Mining vs. Data Mining
Data Mining Text Mining

Data Object Numerical & categorical data Textual data

Data structure Structured Unstructured &semi-structured

Data representation Straightforward Complex

Space dimension < tens of thousands > tens of thousands


Methods Data analysis, machine learning, Data mining, information

Statistic, neural networks retrieval, NLP, ...


Maturity Broad implementation since1994 Broad implementation
starting 2000
Market 105 analysts at large and mid 108 analysts corporate workers
size companies and individual users
Mining World Wide Web (WWW)
 The term Web Mining was coined by
Orem Etzioni (1996) to denote the use
of data mining techniques to
automatically discover Web documents
and services, extract information from
Web resources, and uncover general
patterns on the Web.
 The World Wide Web is a rich,
enormous knowledge base that can be
useful to many applications.
Mining World Wide Web (WWW)
 The WWW is huge, widely distributed, global
information service centre for news,
advertisements, consumer information,
financial management, education,
government, e-commerce, hyperlink
information, access and usage information.
 The Web’s large size and its unstructured and
dynamic content, as well as its multilingual
nature make extracting useful knowledge
from it a challenging research problem.
Why Mining the World-Wide Web
Internet growth

40000000
35000000
30000000
25000000
Ho s ts

20000000
15000000
10000000
5000000
0
Sep-69

Sep-72

Sep-75

Sep-78

Sep-81

Sep-84

Sep-87

Sep-90

Sep-93

Sep-96

Sep-99
 Growing and changing very rapidly
 Broad diversity of user communities
 Only a small portion of the information on the Web is truly relevant or useful
– 99% of the Web information is useless to 99% of Web users
– How can we find high-quality Web pages on a specified topic?
Web Search Engines
 Index-based: search the Web, index Web pages,
and build and store huge keyword-based indices
 Help locate sets of Web pages containing
certain keywords
Deficiencies
– A topic of any breadth may easily contain
hundreds of thousands of documents
– Many documents that are highly relevant to a
topic may not contain keywords defining them
(polysemy)
Web Mining: A More Challenging Task
 Searches for
– Web access patterns
– Web structures
– Regularity and dynamics of Web contents
Problems
– The “abundance” problem
– Limited coverage of the Web: hidden Web sources,
majority of data in DBMS
– Limited query interface based on keyword-oriented
search
– Limited customization to individual users
Web Mining Taxonomy
Web Mining

Web Content Web Structure Web Usage


Mining Mining Mining

Web Page Search Result General Access Customized


Content Mining Mining Pattern Tracking Usage Tracking
Web Mining Taxonomy
• Web Mining research can be classified into
three categories:
• Web content mining refers to the discovery of
useful information from Web contents,
including text, images, audio, video, etc.
• Web structure mining studies the model
underlying the link structures of the Web. It has
been used for search engine result ranking and
other Web applications.
• Web usage mining focuses on using data mining
techniques to analyze search logs to find
interesting patterns. One of the main
applications of Web usage mining is its use to
learn user profiles.
Mining the World-Wide Web

Web Mining
Web Content
Mining
Web Structure
Web Usage
Web Page Content Mining Mining
Mining
Web Page Summarization
WebLog (Lakshmanan et.al. 1996),
WebOQL(Mendelzon et.al. 1998) …: Search Result General Access Customized
Web Structuring query languages; Mining Pattern Tracking Usage Tracking
Can identify information within given
web pages
•Ahoy! (Etzioni et.al. 1997):Uses heuristics
to distinguish personal home pages from
other web pages
•ShopBot (Etzioni et.al. 1997): Looks for
product prices within web pages
Mining the World-Wide Web

Web Mining

Web Content
Web Structure
Mining Web Usage
Mining
Mining
Web Page
Content Mining Search Result Mining
General Access Customized
Search Engine Result Pattern Tracking Usage Tracking
Summarization
•Clustering Search Result (Leouski
and Croft, 1996, Zamir and Etzioni,
1997):
Categorizes documents using
phrases in titles and snippets
Mining the World-Wide Web
Web Mining

Web Content Web Usage


Mining Web Structure Mining Mining
Using Links
•PageRank (Brin et al., 1998)
•CLEVER (Chakrabarti et al., 1998)
Search Result Use interconnections between web pages to give General Access
Mining weight to pages. Pattern Tracking

Web Page Using Generalization Customized


Content Mining •MLDB (1994), VWV (1998) Usage Tracking
Uses a multi-level database representation of the
Web. Counters (popularity) and link lists are used
for capturing structure.
Mining the World-Wide Web
Web Mining

Web Content Web Structure Web Usage


Mining Mining Mining

Web Page Customized


Content Mining General Access Pattern Tracking Usage Tracking

•Web Log Mining (Zaïane, Xin and Han, 1998)


Search Result
Mining Uses KDD techniques to understand general
access patterns and trends.
Can shed light on better structure and
grouping of resource providers.
Mining the World-Wide Web
Web Mining

Web Content Web Structure Web Usage


Mining Mining Mining

Web Page General Access Customized Usage Tracking


Content Mining Pattern Tracking
•Adaptive Sites (Perkowitz and Etzioni, 1997)
Search Result Analyzes access patterns of each user at a time.
Mining
Web site restructures itself automatically by
learning from user access patterns.
Web Usage Mining
 Web servers, Web proxies, and client applications can
quite easily capture Web Usage data.
– Web server log: Every visit to the pages, what and when
files have been requested, the IP address of the request,
the error code, the number of bytes sent to user, and the
type of browser used.
 By analyzing the Web usage data, web mining systems can
discover useful knowledge about a system’s usage
characteristics and the users’ interests which has various
applications:
– Personalization and Collaboration in Web-based systems
– Marketing
– Web site design and evaluation
– Decision support
 Mining Web log records to discover user access
patterns of Web pages
Applications
– Target potential customers for electronic commerce
– Enhance the quality and delivery of Internet
information services to the end user
– Improve Web server system performance
– Identify potential prime advertisement locations
 Web logs provide rich information about Web
dynamics
– Typical Web log entry includes the URL requested,
the IP address from which the request originated,
and a timestamp
Why Web Usage Mining?
Explosive growth of E-commerce
– Provides an cost-efficient way doing business
– Amazon.com: “online Wal-Mart”
Hidden Useful information
– Visitors’ profiles can be discovered
– Measuring online marketing efforts, launching
marketing campaigns, etc.
• One of the major goals of Web usage mining is to
reveal interesting trends and patterns which can
often provide important knowledge about the users
of a system.
 Many Web applications aim to provide personalized
information and services to users. Web usage data
provide an excellent way to learn about users’ interest.
– WebWatcher (Armstrong et al., 1995)
– Letizia (Lieberman, 1995)
 Web usage mining on Web logs can help identify users
who have accessed similar Web pages. The patterns
that emerge can be very useful in collaborative Web
searching and filtering.
– Amazon.com uses collaborative filtering to
recommend books to potential customers based on
the preferences of other customers having similar
interests or purchasing histories.
– Huang et al. (2002) used Hopfield Net to model
user interests and product profiles in an online
bookstore in Taiwan.
How to perform Web Usage
Mining
 Obtain web traffic data from
– Web server log files
– Corporate relational databases
– Registration forms
 Apply data mining techniques and other Web mining
techniques
 Two categories:
– Pattern Discovery Tools
– Pattern Analysis Tools
Web Content Mining
• Web Content Mining is the process of extracting
useful information from the contents of Web
documents.
• Content data corresponds to the collection of facts a
Web page was designed to convey to the users.
• May consist of text, images, audio, video, or
structured records such as lists and tables.
• Web content has been the most widely researched.
• Issues addressed in text mining are, topic discovery,
extracting association patterns, clustering of web
documents and classification of Web Pages.
Web Content Mining
 Text Mining for Web Documents
– Text mining for Web documents can be
considered a sub-field of Web content
mining.

– Information extraction techniques have


been applied to Web HTML documents

– Text clustering algorithms also have


been applied to Web applications.
Web Structure Mining
 Web link structure has been widely used to infer
important web pages information.
 Web structure mining has been largely influenced by
research in
– Social network analysis
– Citation analysis (bibliometrics).
in-links: the hyperlinks pointing to a page
out-links: the hyperlinks found in a page.
Usually, the larger the number of in-links, the better a page
is.
 By analyzing the pages containing a URL, we can also
obtain
– Anchor text: how other Web page authors annotate a page
and can be useful in predicting the content of the target page.
Web structure mining algorithms:
– The PageRank algorithm is computed by
weighting each in-link to a page
proportionally to the quality of the page
containing the in-link.
• The qualities of these referring pages also are determined
by PageRank.
– Kleinberg (1998) proposed the HITS
(Hyperlink-Induced Topic Search) algorithm,
which is similar to PageRank.
 Authority pages: high-quality pages related to a particular search
query.
 Hub pages: pages provide pointers to other authority pages.
Conclusion

Multimedia data mining needs content-based retrieval


and similarity search integrated with mining methods

Text mining goes beyond keyword-based and similarity-


based information retrieval and discovers knowledge
from semi-structured data using methods like keyword-
based association and document classification.

Web mining includes mining Web link structures to


identify authoritative Web pages, the automatic
classification of Web documents, building a multilayered
Web information base, and Weblog mining.
Thank you !!!
Chapter 8
Application and Trends in
Data Warehousing and
Data Mining
Data Mining Systems Products and Research
Prototypes

As a young discipline, data mining has a


relatively short history and are constantly
evolving-new data mining systems appear on
the market every year; new functions,
features, and visualization tools are added to
existing systems on a constant basis; and
efforts toward the standardization of data
mining language have only just begun.
How to Choose a Data Mining System?

Commercial data mining systems have little in common


– Different data mining functionality or methodology
– May even work with completely different kinds of data sets
Need multiple dimensional view in selection
Data types: relational, transactional, text, time
sequence, spatial?
System issues
– running on only one or on several operating systems?
– a client/server architecture?
– Provide Web-based interfaces and allow XML data as input
and/or output?
 Data sources
– ASCII text files, multiple relational data sources
– support ODBC connections (OLE DB, JDBC)?
 Data mining functions and methodologies
– One vs. multiple data mining functions
– One vs. variety of methods per function
• More data mining functions and methods per function
provide the user with greater flexibility and analysis power
 Coupling with Database and/or data warehouse systems
– Four forms of coupling: no coupling, loose coupling, semitight
coupling, and tight coupling
• Ideally, a data mining system should be tightly coupled with
a database system
Scalability
– Row (or database size) scalability
– Column (or dimension) scalability
– Curse of dimensionality: it is much more challenging
to make a system column scalable that row scalable
Visualization tools
– “A picture is worth a thousand words”
– Visualization categories: data visualization, mining
result visualization, mining process visualization, and
visual data mining
Data mining query language and graphical user
interface
– Easy-to-use and high-quality graphical user interface
– Essential for user-guided, highly interactive data
mining
Examples of Data Mining Systems
Examples of Data Mining Systems
Microsoft SQL Server 2005
– Integrate DB and OLAP with mining
– Support OLEDB for DM standard

IBM Intelligent Miner


– Intelligent Miner is an IBM data-mining product
– A wide range of data mining algorithms
– Scalable mining algorithms
– Toolkits: neural network algorithms, statistical methods, data
preparation, and data visualization tools
– Tight integration with IBM's DB2 relational database system

SAS Enterprise Miner


– SAS Institute Inc. developed Enterprise Miner
– A variety of statistical analysis tools
– Data warehouse tools and multiple data mining algorithms
Enterprise Miner Capabilities
SGI MineSet
– Silicon Graphics Inc. (SGI) developed MineSet
– Multiple data mining algorithms and advanced statistics
– Advanced visualization tools

DBMiner
– DBMiner Technology Inc developed DBMiner.
– It provides multiple data mining algorithms including discovery-
driven OLAP analysis, association, classification, and clustering

SPSS Clementine
– Integral Solutions Ltd. (ISL) developed Clementine
– Clementine has been acquired by SPSS Inc.
– An integrated data mining development environment for end-
users and developers
– Multiple data mining algorithms and visualization tools including
rule induction, neural nets, classification, and visualization tools
SPSS Clementine
Theoretical Foundations of Data Mining
 Data reduction
– The basis of data mining is to reduce the data
representation
– Trades accuracy for speed in response
 Data compression
– The basis of data mining is to compress the given data
by encoding in terms of bits, association rules,
decision trees, clusters, etc.
 Pattern discovery
– The basis of data mining is to discover patterns
occurring in the database, such as associations,
classification models, sequential patterns, etc.
Probability theory
– The basis of data mining is to discover joint probability
distributions of random variables
Microeconomic view
– A view of utility: the task of data mining is finding
patterns that are interesting only to the extent in that
they can be used in the decision-making process of
some enterprise
Inductive databases
– Data mining is the problem of performing inductive
logic on databases,
– The task is to query the data and the theory (i.e.,
patterns) of the database
– Popular among many researchers in database systems
Statistical Data Mining
 There are many well-established statistical techniques
for data analysis, particularly for numeric data
– applied extensively to data from scientific
experiments and data from economics and the
social sciences

Regression
 predict the value of a response
(dependent) variable from one or more
predictor (independent) variables where
the variables are numeric
 forms of regression: linear, multiple,
weighted, polynomial, nonparametric,
and robust
Generalized linear models
– allow a categorical response variable
(or some transformation of it) to be
related to a set of predictor variables
– similar to the modeling of a numeric
response variable using linear
regression
– include logistic regression and Poisson
regression

Mixed-effect models
 For analyzing grouped data, i.e. data that can be classified
according to one or more grouping variables
 Typically describe relationships between a response variable and
some covariates in data grouped according to one or more factors
Regression trees
– Binary trees used for classification
and prediction
– Similar to decision trees:Tests are
performed at the internal nodes
– In a regression tree the mean of the
objective attribute is computed and
used as the predicted value

Analysis of variance
– Analyze experimental data for two
or more populations described by a
numeric response variable and one
or more categorical variables
(factors)
www.spss.com/datamine/factor.htm

Factor analysis
– determine which variables are
combined to generate a given
factor
– e.g., for many psychiatric data, one
can indirectly measure other
quantities (such as test scores)
that reflect the factor of interest

Discriminant analysis
– predict a categorical response
variable, commonly used in social
science
– Attempts to determine several
discriminant functions (linear
combinations of the independent
variables) that discriminate among
the groups defined by the
response variable
Time series:
Many methods such as auto regression, ARIMA (Autoregressive integrated
moving-average modeling), long memory time-series modeling

Quality control:
Displays group summary charts

Survival analysis
Predicts the probability
that a patient undergoing a
medical treatment would
survive at least to time t
(life span prediction)
Visual and Audio Data Mining
Visualization: use of computer graphics to create visual images
which aid in the understanding of complex, often massive
representations of data

Visual Data Mining: the process of discovering implicit but


useful knowledge from large data sets using visualization
techniques

Computer Multimedia Pattern


Graphics Systems Recognition

High Human
Performance Computer
Computing Interfaces
Purpose of Visualization
– Gain insight into an information space by
mapping data onto graphical primitives
– Provide qualitative overview of large data sets
– Search for patterns, trends, structure,
irregularities, relationships among data.
– Help find interesting regions and suitable
parameters for further quantitative analysis.
– Provide a visual proof of computer
representations derived
Integration of visualization and data mining
– data visualization
– data mining result visualization
– data mining process visualization
– interactive visual data mining
Data visualization
– Data in a database or data warehouse can be
viewed
• at different levels of granularity or abstraction
• as different combinations of attributes or
dimensions
– Data can be presented in various visual forms
Boxplots from Statsoft: Multiple Variable
Combinations
Data Mining Result Visualization

 Presentation of the results or knowledge obtained from


data mining in visual forms
 Examples
– Scatter plots and boxplots (obtained from descriptive data
mining)
– Decision trees
– Association rules
– Clusters
– Outliers
– Generalized rules
Visualization of Data Mining Results in SAS
Enterprise Miner: Scatter Plots
Visualization of Association Rules in
SGI/MineSet 3.0
Visualization of a Decision Tree in SGI/MineSet 3.0
Visualization of Cluster Grouping in IBM Intelligent
Miner
Data Mining Process Visualization

Presentation of the various processes of data mining in


visual forms so that users can see
– Data extraction process
– Where the data is extracted
– How the data is cleaned, integrated, preprocessed, and
mined
– Method selected for data mining
– Where the results are stored
– How they may be viewed
Visualization of Data Mining Processes by
Clementine

See your solution


discovery process clearly

Understand
variations with
visualized data
Interactive Visual Data Mining
Using visualization tools in the data mining
process to help users make smart data mining
decisions
Example
– Display the data distribution in a set of attributes using
colored sectors or columns (depending on whether the
whole space is represented by either a circle or a set of
columns)
– Use the display to which sector should first be selected
for classification and where a good split point for this
sector may be
Interactive Visual Mining by Perception-Based
Classification (PBC)
Audio Data Mining
Uses audio signals to indicate the patterns of data or the
features of data mining results
An interesting alternative to visual mining
An inverse task of mining audio (such as music) databases
which is to find patterns from audio data
Visual data mining may disclose interesting patterns using
graphical displays, but requires users to concentrate on
watching patterns
Instead, transform patterns into sound and music and listen
to pitches, rhythms, tune, and melody in order to identify
anything interesting or unusual
Data Mining and Collaborative Filtering
Social Impact of Data Mining
Is Data Mining a Hype or Will It Be Persistent?
Data mining is a technology
Technological life cycle
– Innovators
– Early adopters
– Chasm
– Early majority
– Late majority
– Laggards
Life Cycle of Technology Adoption

Data mining is at Chasm!?


– Existing data mining systems are too generic
– Need business-specific data mining solutions and
smooth integration of business logic with data
mining functions
Social Impacts: Threat to Privacy
 Is data mining a threat to privacy and data security?
– “Big Brother”, “Big Banker”, and “Big Business” are carefully
watching you
– Profiling information is collected every time
• You use your credit card, debit card, supermarket loyalty
card, or frequent flyer card, or apply for any of the above
• You surf the Web, reply to an Internet newsgroup,
subscribe to a magazine, rent a video, join a club, fill out
a contest entry form,
• You pay for prescription drugs, or present you medical
care number when visiting the doctor
– Collection of personal data may be beneficial for companies
and consumers, there is also potential for misuse
Protect Privacy and Data Security
 Fair information practices
– International guidelines for data privacy protection
– Cover aspects relating to data collection, purpose, use,
quality, openness, individual participation, and accountability
– Purpose specification and use limitation
– Openness: Individuals have the right to know what
information is collected about them, who has access to the
data, and how the data are being used
 Develop and use data security-enhancing techniques
– Blind signatures
– Biometric encryption
– Anonymous databases
Trends in Data Mining
Application exploration
– development of application-specific data mining system
– Invisible data mining (mining as built-in function)
Scalable data mining methods
– Constraint-based mining: use of constraints to guide data
mining systems in their search for interesting patterns
Integration of data mining with database systems,
data warehouse systems, and Web database systems
Invisible data mining
Standardization of data mining language
– A standard will facilitate systematic development,
improve interoperability, and promote the education
and use of data mining systems in industry and society
Visual data mining
New methods for mining complex types of data
– More research is required towards the integration of
data mining methods with existing data analysis
techniques for the complex types of data
Web mining
Privacy protection and information security in
data mining

You might also like