Data Mining e Resources
Data Mining e Resources
Introduction:
There is a huge amount of data available in the Information Industry. This data is of
no use until it is converted into useful information. It is necessary to analyze this huge
amount of data and extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data
Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we
would be able to use this information in many applications such as Fraud Detection, Market
Analysis, Production Control, Science Exploration, etc.
Data Mining:
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. It is the process of
discovering patterns in large data sets involving methods at the intersection of machine
learning, statistics, and database systems.
1
Why Data Mining?
The major reason that data mining has attracted a great deal of attention in the
information industry in recent year is due to the wide availability of huge amount of data and
the imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, research projects, and market analysis to engineering design
and science exploration.
Data Mining can be viewed as a result of the natural evolution of information
technology. An evolutionary path has been witnessed in database industry in the development
of the following functionalities:
● Data Collection
● Database Creation
● Data Management (Storage, retrieval, transaction processing)
● Data Analysis and Understanding
Why Data Mining is important?
● Large amount of current and historical data being stored.
● As database grow larger, decision-making form the data is not possible. We need
knowledge derived from the stored data
● Data Sources
o Health-related Services. e.g., medical analysis.
o Commercial, e.g., marketing and sales.
o Financial.
o Scientific, e.g., NASA, Genome
▪ DOD and Intelligence
● Desired analysis
o Support for Planning (Historical supply and demand trends)
o Yield management(scanning Airlines set reservation)
o System Performance(detect abnormal behavior in a system)
o Mature database analysis ( Clean up the data sources)
● Market Analysis
2
● Fraud Detection
● Customer Retention
● Production Control
● Science Exploration
Apart from these, data mining can also be used in the areas of production control,
customer retention, science exploration, sports, astrology, and Internet Web Surf-Aid.
All data mining queries use the Data Mining Extensions (DMX) language. DMX can
be used to create models for all kinds of machine learning tasks, including classification, risk
analysis, generation of recommendations, and linear regression. We can also write DMX
queries to get information about the patterns and statistics that were generated when we
processed the model.
We can write our own DMX, or you can build basic DMX using a tool such as
the Prediction Query Builder and then modify it. Both SQL Server Management Studio and
Visual Studio with Analysis Services projects provide tools that help us to build DMX
prediction queries. This topic describes how to create and execute data mining queries using
these tools.
Prediction Query Builder is included in the Mining Model Prediction tab of Data
Mining Designer, which is available in both SQL Server Management Studio and Visual
Studio with Analysis Services projects.
When we use the query builder, you select a mining model, add new case data, and
add prediction functions. We u can then switch to the text editor to modify the query
manually, or switch to the Results pane to view the results of the query.
Query Editor
The Query Editor in SQL Server Management Studio also lets you build and run
DMX queries. You can connect to an instance of Analysis Services, and then select a
database, mining structure columns, and a mining model. The Metadata Explorer contains a
list of prediction functions that can browse.
3
DMX Templates
SQL Server Management Studio provides interactive DMX query templates that you
can use to build DMX queries. If you do not see the list of templates, click View on the
toolbar, and select Template Explorer. To see all Analysis Services templates, including
templates for DMX, MDX, and XMLA, click the cube icon.
To build a query using a template, we can drag the template into an open query
window, or we can double-click the template to open a new connection and a new query
pane.
Machine Learning:
On the other hand, machine learning is the process of discovering algorithms that
have improved courtesy of experience derived from data. It’s the design, study, and
development of algorithms that permit machines to learn without human intervention. It’s a
tool to make machines smarter, eliminating the human element (but not eliminating humans
themselves; that would be wrong).
Machine learning can look at patterns and learn from them to adapt behaviour for
future incidents, while data mining is typically used as an information source for machine
learning to pull from.
With the enormous amount of data stored in files, databases, and other repositories, it
is increasingly important, if not necessary, to develop powerful means for analysis and
perhaps interpretation of such data and for the extraction of interesting knowledge that could
help in decision-making.
4
The Knowledge Discovery in Databases process comprises of a few steps leading from raw
data collections to some form of new knowledge.
5
The iterative process consists of the following steps:
1. Data Cleaning: It is a phase in which noise data and irrelevant data are removed from
the collection.
2. Data Integration: At this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
3. Data selection: At this step, the data relevant to the analysis is decided on and
retrieved from the data collection.
4. Data transformation: It is also known as data consolidation, At this step the selected
data is transformed into forms appropriate for the mining procedure.
5. Data mining: It is the crucial step in which clever techniques are applied to extract
patterns potentially useful.
1. Classification:
This analysis is used to retrieve important and relevant information about data, and
metadata. This data mining method helps to classify data in different classes.
For example, if you’re evaluating data on individual customers’ financial backgrounds
and purchase histories, we might be able to classify them as “low,” “medium,” or “high”
credit risks. We could then use these classifications to learn even more about those
customers.
2. Association Rules:
Association is related to tracking patterns, but is more specific to dependently linked
variables. In this case, you’ll look for specific events or attributes that are highly correlated
with another event or attribute. This data mining technique helps to find the association
between two or more Items. It discovers a hidden pattern in the data set.
6
For example, we might notice that when your customers buy a specific item, they also
often buy a second, related item. This is usually what’s used to populate “people also bought”
sections of online stores.
3. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in
transaction data for certain period. One of the most basic techniques in data mining is
learning to recognize patterns in our data sets. This is usually recognition of some aberration
in our data happening at regular intervals, or an ebb and flow of a certain variable over time.
For example, we might see that your sales of a certain product seem to spike just before
the holidays, or notice that warmer weather drives more people to your website.
4. Outer detection:
This type of data mining technique refers to observation of data items in the dataset which
do not match an expected pattern or expected behaviour. This technique can be used in a
variety of domains, such as intrusion, detection, fraud or fault detection, etc. Outer detection
is also called Outlier Analysis or Outlier mining.
For example, if our purchasers are almost exclusively male, but during one strange week
in July, there’s a huge spike in female purchasers, we’ll want to investigate the spike and see
what drove it, so we can either replicate it or better understand our audience in the process.
5. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other.
This process helps to understand the differences and similarities between the data. It is very
similar to classification, but involves grouping chunks of data together based on their
similarities.
For example, we might choose to cluster different demographics of our audience into
different packets based on how much disposable income they have, or how often they tend to
shop at your store.
6. Regression:
Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used primarily as a form of planning and modeling is
used to identify the likelihood of a specific variable, given the presence of other variables,
given the presence of other variables.
For example, we could use it to project a certain price, based on other factors like
availability, consumer demand, and competition. More specifically, regression’s main focus
7
is to help you uncover the exact relationship between two (or more) variables in a given data
set.
7. Prediction:
Prediction is one of the most valuable data mining techniques, since it’s used to project
the types of data you’ll see in the future. In many cases, just recognizing and understanding
historical trends is enough to chart a somewhat accurate prediction of what will happen in the
future. Prediction has used a combination of the other data mining techniques like trends,
sequential patterns, clustering, classification, etc. It analyzes past events or instances in a
right sequence for predicting a future event.
For example, you might review consumers’ credit histories and past purchases to predict
whether they’ll be a credit risk in the future.
8
Unit – II
Data models
Multidimensional Data model – Data cube – Dimension Modeling – OLAP operation –
Meta Data – Types of Meta Data.
2. Data Models
Dimensional Model was developed for implementing data in the warehouse and data
marts. The multidimensional data model provides both a mechanism to store data and a way
for business analysis.
Multidimensional data model stores data in the form of data cube. Mostly, data
warehousing supports two or three-dimensional cubes. A data cube allows data to be viewed
in multiple dimensions. Dimension is entities with respect to which an organization wants to
keep records. For example, in store sales record, dimensions allow the store to keep track of
things like monthly sales of items and the branches and locations.
9
Figure 2.1.1: Multidimensional Model
The multidimensional data model is designed to solve complex queries in real time. The
multidimensional data model is composed of logical cubes, measures, dimensions,
hierarchies, levels, and attributes. The simplicity of the model is inherent because it defines
objects that represent real-world business entities.
• Cubes:
Logical cubes provide a means of organizing measures that have the same shape, that is,
they have the exact same dimensions. Measures in the same cube have the same relationships
to other logical objects and can easily be analyzed and displayed together.
• Measures:
Measures populate the cells of a logical cube with the facts collected about business
operations. Measures are organized by dimensions, which typically include a Time
dimension. Measures are static and consistent while analysts are using them to inform their
decisions. They are updated in a batch window at regular intervals: weekly, daily, or
periodically throughout the day. Many applications refresh their data by adding periods to the
time dimension of a measure, and may also roll off an equal number of the oldest time
periods. Each update provides a fixed historical record of a particular business activity for
10
that interval. Other applications do a full rebuild of their data rather than performing
incremental updates.
• Dimensions:
Dimensions contain a set of unique values that identify and categorize data. They form
the edges of a logical cube, and thus of the measures within the cube. Because measures are
typically multidimensional, a single value in a measure must be qualified by a member of
each dimension to be meaningful. For example, the Sales measure has four dimensions:
Time, Customer, Product, and Channel. A particular Sales value (43,613.50) only has
meaning when it is qualified by a specific time period (Feb-01), a customer (Warren
Systems), a product (Portable PCs), and a channel (Catalog).
• Dimensions Attributes:
• Levels:
Level represents a position in the hierarchy. Each level above the base (or most detailed)
level contains aggregate values for the levels below it. The members at different levels have a
one-to-many parent-child relation. For example, Q1-02 and Q2-02 are the children of 2002,
thus 2002 is the parent of Q1-02 and Q2-02.
• Hierarchies:
11
to identify reasons for these trends, and roll up to higher levels to see what affect these trends
have on a larger sector of the business.
• Fact Constellations
The star schema is a modelling paradigm in which the data warehouse contains (1) a large
central table (fact table), and (2) a set of smaller attendant tables (dimension tables), one for
each dimension. The schema graph resembles star burst, with the dimension tables displayed
in a radial pattern around the central fact table.
An example of a star schema for All Electronics sales is shown in Figure 2.1.2 Sales are
considered along four dimensions, namely time, item, branch, and location. The schema
contains a central fact table for sales which contains keys to each of the four dimensions,
along with two measures: dollars sold and units sold. Notice that in the star schema, each
dimension is represented by only one table, and each table contains a set of attributes.
For example, the location dimension table contains the attribute set f location key, street,
city, province or state, country g. This constraint may introduce some redundancy. For
example, \Vancouver" and\ Victoria" are both cities in the Canadian province of British
Columbia. Entries for such cities in the location dimension table will create redundancy
among the attributes province or state and country, i.e., Vancouver, British Columbia,
Canada) and (.., Victoria, British Columbia, Canada). More- over, the attributes with in a
dimension table may form either a hierarchy(total order)or a lattice (partial order).
12
Figure 2.1.2: Star schema of a data warehouse for sales.
The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The resulting
schema graph forms a shape similar to a snow flake. The major difference between the snow
flake and star schema models is that the dimension tables of the snow flake model may be
kept in normalized form. Such a table is easy to maintain and also saves storage space
because a large dimension table can be extremely large when the dimensional structure is
included as columns.
Since much of this space is redundant data, creating an or structure will reduce the overall
space requirement. However, the snow flake structure can reduce the effectiveness of brow
sings in more joins will be needed to execute a query. Consequently, the system performance
maybe adversely impacted. Performance benchmarking can be used to determine what is best
for your design.
An example of a snow flake schema for All Electronics sales is given in Figure2.1.3.
Here, the sales fact table is identical to that of the star schema in. The main difference
between the two schemas is in the Definition of dimension tables. The single dimension table
for item in the star schema is normalized in the snow flake schema, resulting in new item and
13
supplier tables. For example, the item dimension table now contains the attributes supplier
key, type, brand, item name, and item key, the latter of which is linked to the supplier
dimension table, containing supplier type and supplier key information. Similarly, the single
dimension table for location in the star schema can be normalized into two tables: new
location and city. The location key of the new location table now links to the city dimension.
Notice that further normalization can be performed on province or state and country in the
snow flake schema shown in Figure 2.1.3 when desirable. A compromise between the star
schema and the snow flake schema is to adopt a mixed schema where only the very large
dimension tables are normalized. Normalizing large dimension tables saves storage space,
while keeping small dimension table sun normalized may reduce the cost and performance
degradation due to join son multiple dimension tables. Doing both may lead to an overall
performance in. However, careful performance tuning could be required to determine which
dimension tables should be normalized and split into multiple tables.
3 Fact Constellations:
Sophisticated applications may require multiple fact tables to share dimension tables. This
kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or
a fact constellation.
An example of a fact constellation schema is shown in Figure 2.1.4. This schema species
two fact tables, sales and shipping. The sales table definition is identical to that of the star
schema (Figure 2.1.2). The shipping table has dimensions, or keys: time key, item key,
14
shipper key, from location, and to location, and two measures: dollars cost and units shipped.
A fact constellation schema allows dimension tables to be shared between fact tables. For
example, the dimensions tables for time, item, and location, are shared between both the sales
and shipping fact tables
2.2.1. Introduction
Figure 2.1.4:Fact constellation schema of a Figure data warehouse for sales and
shipping.
A data cube can also be described as the multidimensional extensions of two dimensional
tables. It can be viewed as a collection of identical 2-D tables stacked upon one another. Data
cubes are used to represent data that is too complex to be described by a table of columns and
rows. As such, data cubes can go far beyond 3-D to include many more dimensions.
Definition
A data cube refers is a three-dimensional (3D) (or higher) range of values that are
generally used to explain the time sequence of an image's data. It is a data abstraction to
15
evaluate aggregated data from a variety of viewpoints. It is also useful for imaging
spectroscopy as a spectrally-resolved image is depicted as a 3-D volume.
A data cube is generally used to easily interpret data. It is especially useful when
representing data together with dimensions as certain measures of business requirements. A
cube's every dimension represents certain characteristic of the database, for example, daily,
monthly or yearly sales. Figure.2.1.5 Shows the data included inside a data cube makes it
possible analyze almost all the figures for virtually any or all customers, sales agents,
products, and much more. Thus, a data cube can help to establish trends and analyze
performance.
2. Relational OLAP
Most OLAP products are developed based on a structure where the cube is patterned as a
multidimensional array. These multidimensional OLAP (MOLAP) products usually offers
improved performance when compared to other approaches mainly because they can be
indexed directly into the structure of the data cube to gather subsets of data. When the
number of dimensions is greater, the cube becomes sparser. That means that several cells that
represent particular attribute combinations will not contain any aggregated data.
16
This in turn boosts the storage requirements, which may reach undesirable levels at times,
making the MOLAP solution untenable for huge data sets with many dimensions.
Compression techniques might help; however, their use can damage the natural indexing of
MOLAP. Figure 2.2.2 shows multidimensional model.
Figure 2.2.2:
2. Relational OLAP
Relational OLAP make use of the relational database model. The ROLAP data cube is
employed as a bunch of relational tables (approximately twice as many as the quantity of
dimensions) compared to a multidimensional array. Each one of these tables, known as a
cuboid, signifies a specific view. Figure 2.2.3shows relational model.
The term data cube is applied in contexts where these arrays are massively larger than the
hosting computer's main memory; examples include multi-terabyte/petabyte data warehouses
17
and time series of image data. The data cube is used to represent data (sometimes called
facts) along some measure of interest.
A cube data source is a data source in which hierarchies and aggregations have been
created by the cube's designer in advance.
Cubes are very powerful and can return information very quickly, often much more
quickly than a relational data source. However, the reason for a cube's speed is that all its
aggregations and hierarchies are pre-built. These definitions remain static until the cube is
rebuilt. Thus, cube data sources are not as flexible as relational data sources if the types of
questions you need to ask were not anticipated by the original designer, or if they change
after the cube was built.
• Oracle Essbase
• Teradata OLAP
When working with a cube data source, you can create calculated members using MDX
formulas instead of creating Tableau formulas. MDX, which stands for Multidimensional
Expressions, is a query language for OLAP databases. With MDX calculated members, you
can create more complex calculations and reference both measures and dimensions.
A calculated member can be either a calculated measure, which is a new field in the data
source just like a calculated field, or a calculated dimension member, which is a new member
within an existing hierarchy.
18
If you are using a multidimensional data source, you can create calculated members using
MDX formulas instead of Tableau formulas. A calculated member can be either a calculated
measure, which is a new field in the data source just like a calculated field, or a calculated
dimension member, which is a new member within an existing hierarchy. For example, if a
dimension Product has three members (Soda, Coffee, and Crackers) you can define a new
calculated member Beverages that sums the Soda and Coffee members. When you then place
the Products dimension on the Rows shelf it displays four rows: Soda, Coffee, Crackers, and
Beverages.
You can define a calculated dimension member by selecting Calculated Members from
the Data pane menu. In the Calculated Members dialog box that opens, you can create, delete,
and edit calculated members as shown Figure 2.2.4
2. Type a Name for the new calculated member in the Member Definition area of the dialog
box.
19
20
3. Specify the Parent member for the new calculated member. All Member is selected by
default. However, you can choose Selected Member to browse the hierarchy and select a
specific parent member.
Note: Specifying a parent member is not available if you are connected to Oracle Essbase.
4. Give the new member a solve order. Sometimes a single cell in your data source can be.
5. If you are connected to a Microsoft Analysis Services data source, the calculation editor
contains a Run before SSAS check box. Choose this option to execute the Tableau
calculation before any Microsoft Analysis Services calculations. For information on
connecting to Microsoft Analysis Services data sources.
6. Type or paste an MDX expression into the large white text box.
2.3.1. Introduction
The new member displays in the Data pane either in the Measures area, if you chose
[Measures] as the parent member, or in the Dimensions area under the specified parent
member. You can use the new member just like any other field in the view.
Data cube aggregation is any process in which information is gathered and expressed in a
summary form, for purposes such as statistical analysis. A common aggregation purpose is to
get more information about particular groups based on specific variables such as age,
profession, or income. The information about such groups can then be used for Web site
personalization to choose content and advertising likely to appeal to an individual belonging
to one or more groups for which data has been collected. For example, a site that sells music
CDs might advertise certain CDs based on the age of the user and the data aggregate for their
age group. Online analytic processing (OLAP) is a simple type of data aggregation in which
the marketer uses an online reporting mechanism to process the information.
21
Data cube aggregation can be user-based: personal data aggregation services offer the
user a single point for collection of their personal information from other Web sites. The
customer uses a single master personal identification number (PIN) to give them access to
their various accounts (such as those for financial institutions, airlines, book and music clubs,
and so on). Performing this type of data aggregation is sometimes referred to as "screen
scraping."
• Definition
22
Dimensional Data Modelling comprises of one or more dimension tables and fact
tables. Good examples of dimensions are location, product, time, promotion, organization etc.
The simplicity of a dimensional model is inherent because it defines objects that represent
real-world business entities. Analysts know which business measures they are interested in
examining, which dimensions and attributes make the data meaningful, and how the
dimensions of their business are organized into levels and hierarchies.
✓ Measures. Measures store quantifiable business data (such as sales, expenses, and
inventory). Measures are also called "facts". Measures are organized by one or more
dimensions and may be stored or calculated at query time.
✓ Stored Measures. Stored measures are loaded and stored at the leaf level. Commonly,
there is also a percentage of summary data that is stored. Summary data that is not stored is
dynamically aggregated when queried.
✓ Calculated Measures. Calculated measures are measures whose values are calculated
dynamically at query time. Only the calculation rules are stored in the database. Common
23
calculations include measures such as ratios, differences, totals and moving averages.
Calculations do not require disk storage space, and they do not extend the processing time
required for data maintenance.
• Fact
Facts are the measurements/metrics or facts from your business process. For a Sales business
process, a measurement would be quarterly sales number
• Dimension
Dimension provides the context surrounding a business process event. In simple terms,
they give who, what, where of a fact. In the Sales business process, for the fact quarterly sales
number, dimensions would be
✓ Where – Location
24
• Attributes
The Attributes are the various characteristics of the dimension. In the Location
dimension, the attributes can be
✓ State
✓ Country
Attributes are used to search, filter, or classify facts. Dimension Tables contain Attributes
• Fact Table
✓ Measurements/facts
✓ Dimension table
A dimension table contains dimensions of a fact. They are joined to fact table via a
foreign key. Dimension tables are de-normalized tables. The Dimension Attributes are the
various columns in a dimension table.
A dimension offers descriptive characteristics of the facts with the help of their attributes.
No set limit set for given for number of dimensions. The dimension can also contain one or
more hierarchical relationships
The accuracy in creating your Dimensional modelling determines the success of your data
warehouse implementation. Here are the steps to create Dimension Model;
25
3. Identify Dimensions
4. Identify Facts
5. Build Star
Identifying the actual business process a data are house should cover. This could be
Marketing, Sales, HR, etc. as per the data analysis needs of the organization. The selection of
the Business process also depends on the quality of data available for that process. It is the
most important step of the Data Modeling process, and a failure here would have cascading
and irreparable defects.
To describe the business process, you can use plain text or use basic Business Process
Modelling Notation (BPMN) or Unified Modeling Language (UML).
The Grain describes the level of detail for the business problem/solution. It is the process
of identifying the lowest level of information for any table in your data warehouse. If a table
contains sales data for every day, then it should be daily granularity. If a table contains total
sales data for each month, then it has monthly granularity.
• • Example of Grain
• ✓ The CEO at an MNC wants to find the sales for specific products in different
locations on a daily basis.
• ✓ So, the grain is "product sale information by location by the day."
26
Figure 2.3.3: Step of the Dimension Model
Dimensions are nouns like date, store, inventory, etc. These dimensions are where all the
data should be stored. For example, the date dimension may contain data like a year, month
and weekday.
• Example of Dimensions:
The CEO at an MNC wants to find the sales for specific products in different locations on
a daily basis.
✓ Attributes: For Product: Product key (Foreign Key), Name, Type, Specifications
27
This step is co-associated with the business users of the system because this is where they
get access to data stored in the data warehouse. Most of the fact table rows are numerical
values like price or cost per unit, etc.
• Example of Facts:
The CEO at an MNC wants to find the sales for specific products in different
locations on a daily basis. The fact here is Sum of Sales by product by location by time.
5. Build Schema
In this step, you implement the Dimension Model. A schema is nothing but the database
structure (arrangement of tables). There are two popular schemas
• Star Schema
The star schema architecture is easy to design. It is called a star schema because diagram
resembles a star, with points radiating from a center. The center of the star consists of the fact
table, and the points of the star is dimension tables. The fact tables in a star schema which is
third normal form whereas dimensional tables are de-normalized.
The snowflake schema is an extension of the star schema. In a snowflake schema, each
dimension are normalized and connected to more dimension tables.
Dimensional Data Modeling is one of the data modeling techniques used in data
warehouse design.
28
✓ Build dimensional models around business processes.
✓ Need to ensure that every fact table has an associated date dimension table.
✓ Ensure that all facts in a single fact table are at the same grain or level of detail.
✓ It's essential to store report labels and filter domain values in dimension tables
✓ It allows to introduced entirely new dimension without major disruptions to the fact
table.
✓ Dimensional also to store data in such a fashion that it is easier to retrieve the
information from the data once the data is stored in the database.
✓ The dimensional model is very understandable by the business. This model is based on
business terms, so that the business knows what each fact, dimension, or attribute means.
✓ Dimensional models are deformalized and optimized for fast data querying. Many
relational database platforms recognize this model and optimize query execution plans to aid
in performance.
29
✓ The dimensional model also helps to boost query performance. It is more denormalized
therefore it is optimized for querying.
Query performance. Dimensional models are more denormalized and optimized for data
querying, while normalized models seek to eliminate data redundancies and are optimized for
transaction loading and updating. The predictable framework of a dimensional model allows
the database to make strong assumptions about the data which may have a positive impact on
performance. Each dimension is an equivalent entry point into the fact table, and this
symmetrical structure allows effective handling of complex queries. Query optimization for
star-joined databases is simple, predictable, and controllable.
Extensibility. Dimensional models are scalable and easily accommodate unexpected new
data. Existing tables can be changed in place either by simply adding new data rows into the
table or executing SQL alter table commands. No queries or applications that sit on top of the
data warehouse need to be reprogrammed to accommodate changes. Old queries and
applications continue to run without yielding different results. But in normalized models each
modification should be considered carefully, because of the complex dependencies between
database tables.
Dimensional modeling gets its name from the business dimensions we need to
incorporate into the logical data model. It is a logical design technique to structure the
business dimensions and the metrics that are analyzed along these dimensions.
30
This modeling technique is intuitive for that purpose. The model has also proved to
provide high performance for queries and analysis. The multidimensional information
package diagram we have discussed is the foundation for the dimensional model.
Therefore, the dimensional model consists of the specific data structures needed to
represent the business dimensions. These data structures also contain the metrics or facts.
In Chapter 5, we discussed information package diagrams in sufficient detail. We
specifically looked at an information package diagram for automaker sales. Please go back
and review Figure 5-5 in that chapter. What do you see? In the bottom section of the diagram,
you observe the list of measurements or metrics that the automaker wants to use for analysis.
Next, look at the column headings.
These are the business dimensions along which the automaker wants to analyze the
measurements or metrics. Under each column heading you see the dimension hierarchies and
categories within that business dimension. What you see under each column heading are the
attributes relating to that business dimension.
Reviewing the information package diagram for automaker sales, we notice three
types of data entities: (1) measurements or metrics, (2) business dimensions, and (3)
attributes for each business dimension. So when we put together the dimensional model to
represent the information contained in the automaker sales information package, we need to
come up with data structures to represent these three types of data entities.
Let us discuss how we can do this. First, let us work with the measurements or
metrics seen at the bottom of the information package diagram. These are the facts for
analysis. In the automaker sales diagram, the facts are as follows:
Actual sale price MSRP sale price Options price Full price Dealer add-ons Dealer credits
Dealer invoice Amount of down payment Manufacturer proceeds Amount financed
Each of these data items is a measurement or fact. Actual sale price is a fact about what
the actual price was for the sale. Full price is a fact about what the full price was relating to
the sale. As we review each of these factual items, we find that we can group all of these into
a single data structure. In relational database terminology, you may call the data structure a
relational table. So the metrics or facts from the information package diagram will form the
fact table. For the automaker sales analysis this fact table would be the automaker sales fact
table. Look at Figure 10-2 showing how the fact table is formed. The fact table gets its name
from the subject for analysis; in this case, it is automaker sales. Each fact item or
measurement goes into the fact table as an attribute for automaker sales. We have determined
one of the data structures to be included in the dimensional model for automaker sales and
31
derived the fact table from the information package diagram. Let us now move on to the other
sections of the information package diagram, taking the business dimensions one by one.
Look at the product business dimension in Figure 5-5. The product business dimension is
used when we want to analyze the facts by products. Sometimes our analysis could be a
breakdown by individual models. Another analysis could be at a higher level by product
lines. Yet another analysis could be at even a higher level by product categories. The list of
data items relating to the product dimension are as follows:
Model name Model year Package styling Product line Product category Exterior colour
Interior colour First model year What can we do with all these data items in our dimensional
model? All of these relate to the product in some way. We can, therefore, group all of these
data items in one data structure or one relational table. We can call this table the product
dimension table. The data items in the above list would all be attributes in this table. Looking
further into the information package diagram, we note the other business dimensions shown
as column headings. In the case of the automaker sales information package diagram, these
other business dimensions are dealer, customer demographics, payment method, and time.
Just as we formed the product dimension table, we can form the remaining dimension tables
of dealer, customer demographics, payment method, and time. The data items shown within
each column would then be the attributes for each corresponding dimension table. Figure 10-
3 puts all of this together. It shows how the various dimension tables are formed from the
information package diagram. Look at the figure closely and see how each dimension table is
formed. So far we have formed the fact table and the dimension tables. How should these
tables be arranged in the dimensional model? What are the relationships and how should we
mark the relationships in the model? The dimensional model should primarily facilitate
queries and analyses. What would be the types of queries and analyses? These would be
queries and analyses where the metrics inside the fact table are analyzed across one or more
dimensions using the dimension table attributes. Let us examine a typical query against the
automaker sales data. How much sales proceeds did the Jeep Cherokee, Year 2000 Model
with standard options, generate in January 2000 at Big Sam Auto dealership for buyers who
own their homes and who took 3-year leases, financed by Daimler-Chrysler Financing? We
are analyzing actual sale price, MSRP sale price, and full price. We are analyzing these facts
along attributes in the various dimension tables. The attributes in the dimension tables act as
constraints and filters.
Figure 2.3.4:
32
Figure 2.3.4: Formation of the automaker dimension tables.
OLAP is a category of software that allows users to analyze information from multiple
database systems at the same time. It is a technology that enables analysts to extract and view
business data from different points of view. OLAP stands for Online Analytical Processing.
Analysts frequently need to group, aggregate and join data. These operations in relational
databases are resource intensive. With OLAP data can be pre-calculated and pre-aggregated,
making analysis faster.
OLAP databases are divided into one or more cubes. The cubes are designed in such a
way that creating and viewing reports become easy.
The OLAP cube is a data structure optimized for very quick data analysis (Figure 2.4.1)
The OLAP Cube consists of numeric facts called measures which are categorized by
dimensions. OLAP Cube is also called the hypercube.
33
Figure2.4.1:OLAP cube
Usually, data operations and analysis are performed using the simple spreadsheet, where
data values are arranged in row and column format. This is ideal for two-dimensional data.
However, OLAP contains multidimensional data, with data usually obtained from a different
and unrelated source. Using a spreadsheet is not an optimal option. The cube can store and
analyze multidimensional data in a logical and orderly manner.
A Data warehouse would extract information from multiple data sources and formats like
text files, excel sheet, multimedia files, etc. The extracted data is cleaned and transformed.
Data is
loaded into an OLAP server (or OLAP cube) where information is pre-calculated in
advance for further analysis.
1. Roll-up
2. Drill-down
34
3. Slice and dice
4. Pivot (rotate)
35
▪ In this example, cities New jersey and Lost Angles and rolled up into country USA
▪ The sales figure of New Jersey and Los Angeles are 440 and 1560 respectively.
They become 2000 after roll-up
▪ In this aggregation process, data is location hierarchy moves up from city to the
country.
▪ In the roll-up process at least one or more dimensions need to be removed. In this
example, Quater dimension is removed.
▪ Moving down the concept hierarchy
▪ Increasing a dimension
2. Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the rollup process.
It can be done via
Figure
36
▪ Quater Q1 is drilled down to months January, February, and March. Corresponding
sales are also registers.
▪ In this example, dimension months are added.
3. Slice:
Here, one dimension is selected, and a new sub-cube is created. Figure 2.4.3 explain how
slice operation performed:
37
Figure 2.4.4: Slice Operation
4. Dice
This operation is similar to a slice. The difference in dice is you select 2 or more
dimensions that result in the creation of a sub-cube.
Figure 2.4.5:
38
Figure 2.4.5: Slice Operation
5. Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of data. In the figure
2.4.5, the pivot is based on item types.
39
2.5. Meta Data
Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata.
For example, the index of a book serves as a metadata for the contents in the book. In
other words, we can say that metadata is the summarized data that leads us to detailed data. In
terms of data warehouse, we can define metadata as follows.
40
• Metadata acts as a directory. This directory helps the decision support system to locate the
contents of a data warehouse.
Metadata can be broadly categorized into three categories as shown in Figure 2.5.1:
• Business Metadata − It has the data ownership information, business definition, and
changing policies.
• Technical Metadata − It includes database system names, table and column names and
sizes, data types and allowed values. Technical metadata also includes structural information
such as primary and foreign key attributes and indices.
• Operational Metadata − It includes currency of data and data lineage. Currency of data
means whether the data is active, archived, or purged. Lineage of data means the history of
data migrated and transformation applied on it.
Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data. The various roles of metadata are explained
below (figure 2.5.2).
41
• Metadata acts as a directory.
• This directory helps the decision support system to locate the contents of the data
warehouse.
• Metadata helps in decision support system for mapping of data when data is transformed
from operational environment to data warehouse environment.
• Metadata helps in summarization between current detailed data and highly summarized
data.
• Metadata also helps in summarization between lightly detailed data and highly summarized
data.
42
2.5.3. Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following
metadata;
• Definition of data warehouse − It includes the description of structure of data warehouse.
The description is defined by schema, view, hierarchies, derived data definitions, and data
mart locations and contents.
• Business metadata − It contains has the data ownership information, business definition,
and changing policies.
• Operational Metadata − It includes currency of data and data lineage. Currency of data
means whether the data is active, archived, or purged. Lineage of data means the history of
data migrated and transformation applied on it.
• Data for mapping from operational environment to data warehouse − It includes the
source databases and their contents, data extraction, data partition cleaning, transformation
rules, data refresh and purging rules.
• Algorithms for summarization − It includes dimension algorithms, data on granularity,
aggregation, summarizing, etc.
Metadata in a data warehouse is similar to the data dictionary or the data catalog in a
database management system. In the data dictionary, you keep the information about the
logical data structures, the information about the files and addresses, the information about
the indexes, and so on. The data dictionary contains data about the data in the base
Think of metadata as the Yellow Pages in your town. Almost in the same manner, the
metadata component serves as a directory of the contents of your data warehouse. Because of
the importance of metadata in a data warehouse, we have set apart all of
• Why is metadata especially important in a data warehouse? 1. First, it acts as the glue
that connects all parts of the data warehouse.
2. Next, it provides information about the contents and structures to the developers.
3. Finally, it opens the door to the end-users and makes the contents recognizable in their
own terms
43
2.6. Types of Metadata
1. Operational Metadata
3. End-User Metadata
Data for the data warehouse comes from several operational systems of the enterprise.
These source systems contain different data structures. The data elements selected for the
data warehouse have various field lengths and data types. In selecting data from the source
systems for the data warehouse, you split records, combine parts of records from different
source files, and deal with multiple coding schemes and field lengths. When you deliver
information to the end-users, you must be able to tie that back to the original source data sets.
Operational metadata contain all of this information about the operational data sources.
Extraction and transformation metadata contain data about the extraction of data from the
source systems, namely, the extraction frequencies, extraction methods, and business rules
for the data extraction. Also, this category of metadata contains information about all the data
transformations that take place in the data staging area.
The end-user metadata is the navigational map of the data warehouse. It enables the end-
users to find information from the data warehouse. The end-user metadata allows the end-
users to use their own business men logy and look for information It enables the end-users to
find information from the data warehouse. The end-user metadata allows the end-users to use
their own business terminology and look for information in those ways in which they
normally think of the business.
44
There are only three main types, but it’s important to understand each type and how they
function to make your assets more easily discoverable. So, if you’re not sure what the
difference
Let’s start with the basics. Structural metadata is data that indicates how a digital asset is
organized, such as how pages in a book are organized to form chapters, or the notes that
make up a notebook in Evernote or OneNote. Structural metadata also indicates whether a
particular asset is part of a single collection or multiple collections and facilitates the
navigation and presentation of information in an electronic resource. Examples include:
• Page numbers
• Sections
• Chapters
• Indexes
• Table of contents
Beyond basic organization, structural metadata is the key to documenting the relationship
between two assets. For example, it’s used to indicate that a specific stock photo was used in
a particular sales brochure, or that one asset is a raw, unedited version of another.
b) Administrative Metadata
Administrative metadata relates to the technical source of a digital asset. It includes data
such as the file type, as well as when and how the asset was created. This is also the type of
metadata that relates to usage rights and intellectual property, providing information such as
the owner of an asset, where and how it can be used, and the duration a digital asset can be
used for those allowable purposes under the current license.
45
• Technical Metadata – Information necessary for decoding and rendering files
c) Descriptive Metadata
Descriptive metadata is essential for discovering and identifying assets. Why? It’s
information that describes the asset, such as the asset’s title, author, and relevant keywords.
Descriptive metadata is what allows you to locate a book in a particular genre published after
2016, for instance, as a book’s metadata would include both genre and publication date. In
fact, the ISBN system is a good example of an early effort to use metadata to centralize
information and make it easier to locate resources (in this case, books in a traditional library).
Essentially, descriptive metadata includes any information describing the asset that can be
used for later identification and discovery. According to Cornell University, this includes:
Descriptive metadata can be the most robust of all the types of metadata, simply because
there are many ways to describe an asset. When implementing a DAM solution, standardizing
the specific attributes used to describe your assets and how they’re documented is the key to
streamlined discoverability.
46
Unit- III
The data tend to be incomplete, noisy, missing values, smooth out noise while
identifying outliers and correct inconsistencies in the data.
Imagine that you need to analyze All Electronics sales and customer data. Note that
many tuples have no recorded values for several attributes, such as customer income. The
filling in the missing values for this attribute look at the following methods.
This is usually done when the class label is missing. This method is not very effective,
unless the tuple contains several attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably.
This approach is time consuming and may not be feasible given a large data set with
many missing values.
Replace all missing attribute values by the some constant, such as a label like
“unknown”(or)-∞.if missing values are replaced by, say ”unknown” then the program may
mistakenly think that they form an interesting concept. They all have a value in common that
of “unknown”.
For example: suppose that the average income of all electron customer is $56,000.use
this value to replace the missing value for income.
(V) Use the attribute mean for all samples belonging to the same class as the given
triple:
If classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk categories as that of the given
triple.
(VI) Use the most probable value to fill in the missing value:
47
This may be determined with regression inference based tools using a formalism
decision tree induction. For example using the others customer attributes in your data set, you
may construct a decision tree to predict the missing values for income.
Bin 1:9, 9, 9
Bin 1:4, 4, 15
1. Binning:
Binning methods smooth a sorted data value by consulting its’ neighbourhood’ that is
the values around it. The values are distributed into a number of “buckets” (or) bins. Because
binning methods consult the neighbourhood of values, they perform local smoothing.
The data for price are first sorted and then partitioned into equal frequency bins of
size 3.In smoothing by bin means, each value in a bin is replace by the mean value of the bin.
Smoothing by bin medians can be employed in which each bin value is replaced by the bin
median,
In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries.
2. Regression:
48
Data can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the best line to fit two attributes, so that one attribute can be used
to predict the other.
Multiple linear regressions is an extension of linear regression where more than two
attributes one involved and the data are fit to a multidimensional surface.
Figure A 2-D plot of customer data with respect to customer locations in a city, showing
three data cluster. Each cluster centroid is marked with a “+”, representing the average point
in space for that cluster. Outliers may be detected as values that fall outside of sets of cluster.
3. Clustering:
Outlier may be detected by clustering, where similar values are organized into group
(or) clusters. Intuitively, values that fall outside of the set of clusters may be considered
outliers.
There may also be inconsistencies due to data inconsistence due to data. So how can
we proceed with discrepancy detection? As the starting point use any knowledge you may
already have regarding properties of the data. Such knowledge or data about data is referred
to as Meta data.
The data analyst you should on the look out for inconsistent use of codes and my
inconsistent data representation such as”2004/12/25” and ”25/12/2004” for date. The field
overloading is another source of errors that typically results when developers squeeze new
attribute definition into unused portions of already defined attribute.
The data should also examine regarding unique rules, consective rules, and null rules.
A unique rules says that each value of the given attribute must be different from all other
values for that attribute.
A consective rules says that there can be no missing values between the lowest and
highest values for the attribute and that all values must also be unique. A null rule specifies
the use of blanks question marks, special characters or other strings that may indicate the null
condition and how such values should be handled.
(i) The reason for missing values may include .The person originally asked to provide
a values for the attribute refuses and or find that the information requested is not applicable.
(ii)The data entry person does not know the correct value.
The null rule should specify how to record the null condition, such as to store zero for
numerical attributes, a blank for character attribute or any other conventions that may be in
use.
There are a number of different commercial tools that can aid in the step of
discrepancy detection. Data sorubbing tools use simple domain knowledge to detect errors
and make corrections in the data, these tools reply on passing fuzzy matching techniques
when cleaning data from multiple source.
Data auditing tools find discrepancies by analyzing the data to discover rules and
relationships, and detecting data that violate such conditions. They are variants of data
mining tools.
Commercial tools can assist in the data transformations step. Data migration tools
allow simple transformations to be specified, such as to replace the string “gender” by ”sex”
ETL(extraction/transformation/loading) tools allow users to specify transformations through
a graphical user Interface.
50
The merging of data from multiple data sources. The data may also need to be
transformed into forms appropriate for mining.
The data analysis task will involve data integration, which combines data from
multiple sources into a coherent data store, as in data warehousing. These sources may
include multiple data base, data cubes (or) flat files.
Schema integration and object matching can be tricky how can equivalent
real world Entities from multiple data sources be matched up. This is referred to as the entity
identification problem.
Ex: how can the data analysts or the complete be sure that customer-id in one database and
customer-number in another refers to the same attribute?
Meta data for each attribute include the name meaning, data type and
range of values permitted for the attribute and null rules for the null rules for handling blank,
zero or null values.
N N
r A ,B = ∑ ( ai −A ) ( b j−B ) = ∑ ( ai bi ) − A B (2.8)
i=1 i=1
❑ ❑
Where N is the number of the tuples a i and b i are the respective values of A and B in
tuple i , A and B are the respective mean values of A and B, σ A and σ B the respective
❑
standard deviations of A and B and ∑
❑
( ai bi ) is the sum of the AB cross product. Note that -
1≤ r A ,B ≤ +1.if r A ,B greater than 0,then A and B are positively correlated meaning that the
value of A increases as the value of B increases.
The higher value is stronger the correlation. Hence the higher value may indicate that
A and B may be removed as a redundancy. If the resulting value is equal to 0, then A and B
are independent and there is no correlation between them.
2
c r
( oij −e ij )
χ =∑ ∑
2
(2.9)
i=1 j=1 eij
51
Where o ij is the observed frequency of the joint event ( Ai , B j) and e ijis the expected
frequency of( Ai , B j) which can be computed as
Where N is the number of data tuples, count (A= a i) is the number of tuples having
value a i for A, and count (B=b j) is the number of tuples having value b j for B. the sum in is
computed over all of the r x c cells. Note that the cells that contribute the most of the value χ 2
value are those whose actual count is very different from that expected.
Table 2.2
A 2X2 contingency table for the data of example are gender and preferred reading
correlated. The χ 2 statics test the hypothesis that A and B are independent. The test is based
on significance level, with (r-1) (c-1) degrees of freedom.
In data transformation, the data are transformed or consolidate into forms appropriate
for mining.
Smoothing, this works to remove noise from the data. Such techniques include binning,
regression and clustering.
Aggregation:
Ex: the daily sales data may be aggregated so as to compute monthly and annual total
amount.
Generalization of the data, where low level or primitive data are replaced by high level
concepts though the use of concept hierarchies. Categories attributes like can be generalized
to high level concept like city or country.
Normalization where the attributed data are scaled so as to fall within a small specified
range, such as -1.0 to 1.0 or 0.0 to 1.0
Attributed construction, where new attributes are constructed and added from the given set
of attributed to help the mining process. Min Max normalization performs a linear
52
transformation on the original data. Suppose that Min A and Max A are the minimum and
maximum values of an attribute.
A Min Max normalization maps a value, V of A to V’ in the range [new – Min A, New- Max
A] by computing
' V −minA
V= (new _maxA-new_minA)+new_minA.
maxA−minA
Min max normalization preserves the relationship among the original data values. It
will encounter an “out of the bounds” error if a future input case for normalization fall
outside of the original data range for A.
Example:
z-score normalization suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000 respectively . with z-score normalization a
value of $73600 for income is transformed to 73,600 – 54,000 / 16,000 =1.225.
V’ = V/ 10 j
Data reduction techniques can be applied to obtain a reduced representation of the set
that is much. Smaller in volume, yet closely maintains the integrity of the original data.
Mining on the reduction data set should be more efficient yet produce the same
analytical results.
(i) Data cube aggregation; where aggregation operations are applied to the data in
the construction of a data cube.
(iii) Dimensionality reduction, where encoding mechanisms are used to reduce the
data set size.
(iv) Numerosity reduction, where the data are replaced or estimated by alternative,
smaller data representation. Such as parametric models or non parametric
methods. Such as clustering, sample and the use of histograms.
(v) Discretization and concept hierarchy generation; where raw data values for
attributes are replaced by ranges or higher conceptual level. Data
53
discretization is a form of Numerosity reduction that is very useful for the
automatic generation of concept hierarchies.
Imagine that you have collected the data for your analysis. These data consist
of the All Electronics sales per quarter , for the years 2002 to 2004 you are , however
interested in the annual sales, rather than the total per quarter. Thus the data can be
aggregated so that the resulting data summarize the total sales per year instead of per
quarter.
Each cell holds an aggregate data value, corresponding to the data point in
multi dimensional space. The able created at the lowest level of abstraction is referred
to as the base cuboid. The base cuboid should correspond to an individual entity of
interest, such as sales or customer.
Attributed subset selection reduces the data set size by removing irrelevant or
redundant attribute. The goal of attributed subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the classes is as close as
possible to the original distribution obtained using all attributes.
54
Greedy (heuristic) methods for attribute subset selection.
The procedure starts with an empty set of attributes as the reduce set.
The best of the original attribute is determined and added to the
reduced set.
(i) Step wise back ward elimination: The procedure stats with the full set of
attribute. At each step it removes the worst attribute remaining in the set.
(ii) Combination of forward selection and back ward elimination: the stepwise
forward selection and backward elimination methods can be combined, so that
at each step the procedure select the best attribute and removes the worst from
among the remaining attribute.
(iii) Decision Tree induction: Decision tree algorithms, such as ID3, C4.5 and
CART, were originally intended for classification. Decision tree inductions
construct a flow chart.
Dimensionality Reduction:
If the original data can be constructed from the compressed data without any
loss of information, the data reduction is called lossless.
We can reconstruct only an approximation of the original data, and then the
data reduction is called lossy.
Wavelet Transformation:
55
The discrete wavelet transformation (DNT) is a linear signal processing
techniques that, when applied to a data vector x, transforms it to a numerically
different vector, X’ of wavelet coefficients.
The two vector are of the same length when applying this techniques to data
reduction we consider each tuple as an n-dimensional data vector, that is x=
(x1,x2,x3,….xn), depicting n measurements made on the tuple from n data base
attribute.
Examples for wavelet families. The number next to a wavelet name is the number of
vanishing moments of the wavelet. This is a set of mathematical relationships that the
coefficient must satisfy and is related to the number of coefficient.
(i) The length, L of the input data vector must be an integer power of 2. This
condition can be met by padding the data vector with zero’s as necessary
(L>=n).
(ii) Each transform involves applying two functions. The first applies some data
smoothing, such a sum or weighted average. The second performs a weighted
difference which acts to bring of the detailed features of the data.
(iii) The two functions are applied to pairs of data points in X, that is all pairs of
measurement (Xzi,X2i). this results in two sets of data of length L/2.
(iv) The two functions are recursively applied to the sets of data obtained in the
previous loop. Until the resulting data sets obtained are of length 2.
(v) Selected values from the data sets obtained in the above iterations are
designated the wavelet co-efficient of the transformed data.
56
dimensional orthogonal vectors that can best be used to represent the data, where
K<=n.
Example of wavelet families. The number next to a wavelet name is the number of
vanishing moments. This is a set of mathematical relationships that the coefficients
must satisfy and is related to the number of coefficients.
(i) The input data are normalized, so that each attribute falls within the same
range. This step helps ensure that attribute with larger domain will not
dominate attribute with smaller domains.
(ii) PCA Computes K Orthonormal vector that provide a basic for the normalized
input data. These are unit vectors that each point in a direction perpendicular
to the others.
These Vectors are referred to as the Principal components. The input are a
linear combination of the Principal components.
57
Log linear models, which estimate discrete multidimensional probability
distribution, are an example. Nonparametric methods for storing reduced representation of
the data include histograms, clustering and sampling.
Regression and log linear models can be used to approximate the given data. Linear
regressions the data are modeled to fit a straight line.
Y=w x + b,
Where the variance of y is assumed to be constant. It the context of data mining, x and
y are numerical data base attributes. The coefficient, w and b specify the slop of the line and
y intercept respectively.
Regression and log linear models can both be used on sparse data, although their
application may be limited. While both methods can handle skewed data, regression does
exceptionally well.
Histograms:
Histograms use binning to approximate data distribution and are a popular form of
data reduction. A histograms for an attribute, A, partition the data distribution of A into
disjoint subsets or buckets. If each bucket represents only a single attribute values / frequency
pair, the bucket are called singleton.
58
Example: histogram. The following data are a list of price of commonly sold items at all
electronics. The numbers have been
sorted :1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,18,18,18,18,18,18,18,18,
20,20,20,20,20,20,21,21,21,21,25,25,25,25,28,28,30,30,30.
To further the reduced data it is common to have each bucket denote a continuous
range of values for the given attribute.
Equal width:
Equal frequency: In an equal frequency histogram, the buckets are credited so that, roughly,
the frequency of each bucket is constant.
V-optimal:
If we consider all of the possible histograms for a given of buckets the voptimal
histograms is the one with the least variable. Hence the histograms
Variable is a weighted sum of the original values that each bucket represents, were bucket
weight is equal to the number of values in the bucket.
MaxDiff:
59
An equal-width histogram for price,where values are aggregated so that each bucket has a
uniform width of $10.
Clustering:
Clustering techniques consider data tuples as object. They partition the objects into
groups or clusters. So that objects within a cluster are “similar “to one another and dissimilar
to objects in other cluster.
Commonly defined in terms of how close the objects are in space based on a distance
function. The quality of a cluster may be represented by it’s diameter the maximum distance
between any two objects in the cluster.
In data reduction the cluster representation of the data are used to replace the actual
data. The effectiveness of these techniques depends on the nature of the data. It is much more
effective for data that can be organized into distinct cluster than for smeared data.
Sampling:
60
Sampling can be used as a data reduction techniques because it allows a large data set
to be represented by a much smaller random samples of the data . suppose a large data set , D,
contains N tuples.
This is created by drawing s of the N tuples from D (s<n) , where the probability of
drawing any tuple in D is 1/ N , that is all tuples are equally likely to be sampled.
61
This is similar to SRSWOR, except that each time a tuple is drawn from D, it is
recorded and then replaced. That is after a tuple is draw it is placed back in D so that it may
be draw again.
Cluster sample:
If the tuple in D are grouped into m mutually disjoint ‘cluster” then an srs of s clusters
can be obtained, where s<m, for example tuples in a data base are usually retrieved a page at
a time. So that each page can be considered a cluster.
Stratified sample:
An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample. s as opposed to N, the data set size.
Sampling complexity is potentially sub linear to the size of data. Other data reduction
techniques can require at least one complete pass through D.
A DMQL can provide the ability to support ad-hoc and interactive data mining.
By providing a standardized language like SQL.
*To achieve a similar effect like that SQL has on relational database
* in relevance to att_or_dim_list
* order by order_list
* group by grouping_list
* having condition
Example:
This example shows how to use DMQL to specify the task-relevant data, the
mining of associations between items frequently purchased at AB Company by Sri Lankan
customers, with respect to customer income and age. In addition, the user specifies that the
data are to be grouped by date. The data are retrieved from a relational database.
Characterization:
This specifies that characteristic descriptions are to be mined. The analyze clause,
when used for characterization, specifies aggregate measure, such as count, sum, or count%
(percentage count, i.e., the percentage of tuples in the relevant data set with the specified
(characteristics). These measures are to computed for each data characteristic found.
63
* A user can indicate which concept hierarchy is to be used with statement
kuse hierarchy (hierarchy_name) for {attribute _or_dimension) Otherwise, a default
hierarchy per attribute or dimension is used.
set-grouping hierarchies
level2: {40,.....,59}<level1:middle_aged
The user can help control the number of uninteresting patterns returned by the data
mining system by specifying measures of pattern interestingness and their corresponding
thresholds. Interestingness measures and thresholds can be specified be the user with the
statement with {(interest_measure_name)] threshold (threshold_value)
Example:
"How can users specify the forms of presentation and visualization to be used in
displaying the discovered patterns in one or more forms, including rules, tables cross tabs, pie
or bar charts, decision trees, cubes, curves or surface-We define the DMQL display statement
for this purpose; display a
(Result_form)
64
up the concept hierarchy of an attribute or dimension (replacing lower=-level concept values
by higher-level values). Dropping attributes or dimensions con also perform generalization.
The user can alternately view the patterns at different levels of abstractions with
the use of following DMQL syntax:
| add (attribute_or_dimension)
| drop (attribute_or_dimension)
In the above discussion, we presented DMQL syntax for specifying data mining
queries in terms of the five data mining primitives. For a given query, these primitive define
the task-relevant data, the kind of knowledge to be mined, the concept hierarchies and
interestingness measures to be used, and the representation forms for pattern visualization.
Here we put these components together. Let's look at an example for the full specification of
a DQML query.
in relevance to C.age,I.type,I.place_made
65
Unit- III
Data cleaning
Data integration
Data transformation
Data reduction
Binning
Clustering
Regression
Smoothing
Aggregation
Generalization
Normalization
Attribute construction
Z-score normalization
Dimensional reduction
66
Numerosity reduction
Parametric:
Regression model
Non-parametric:
Sampling
Histogram
Clustering
Binning
Histogram analysis
Cluster analysis
Concept hierarchies
Interesting measures
Data cleaning means removing the inconsistent data or noise and collecting necessary
information of a collection of interrelated data.
67
Dimensionality reduction
Numerosity reduction
The design of an effective data mining query language requires a deep understanding
of the power, limitation, and underlying mechanisms of the various kinds of data mining
tasks.
13. List the five primitives for specifying a data mining task.
It is process that abstracts a larger set of task- relevant data in a database from
relatively low conceptual levels to higher conceptual levels 2 approaches for generalization.
Data cleaning (or data cleaning) routines attempt to fill missing values, smooth out
noise while identifying outliers, and correct inconsistencies in the data.
It specifies clauses and syntaxes for performing different types of data mining tasks
for examples data classification, data mining association rules. Also it uses SQL-like syntaxes
to mine databases.
A DMQL can provide the ability to support ad-hoc and interactive data mining. By
providing a standing a standardized language like SQL
1. Association rule mining is popular and well researched method for discovering
interesting relations between variables in large databases.
Association rules are considered interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. Users or domain can set such thresholds.
Support (A=>B)=P(AUB)
Confidence (A=>B)=P(B/A)
69
22. How is association rules mined from large database?
i) Correlation analysis
ii) Mining max patterns
Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules; the name of the algorithm is based on the fact that the algorithm user prior
knowledge of frequent item set properties.
For each frequent item set1, generate all non empty subsets of 1.for every non empty
subsets s 1, output the rule “s=>(1-s)”if support count(1)=min_conf/support_count(s)
Transaction reduction
Portioning
Sampling
70
27. What are the things suffering the performance of apriori candidate generation
technique.
Need to repeatedly scan the scan the database and check a large set of candidates by
pattern matching
28. Describe the method of generating frequent item sets without candidate generation.
Steps:
Compress the database representing frequent items into a frequent pattern tree or FP
tree. Divide the compressed database into a set of conditional database mine each conditional
database separately.
Mining is performed under the guidance of various kinds of constraints by the user.
The constraints include the following Knowledge type constraints data constraints
dimension/level constraints interestingness constraints rule constraints.
71
A decision tree is a flow chart like tree structures, where each internal node denotes a
test on an attribute each branch represents an outcome of the test, and leaf nodes represent
classes or class distributions. The top most in a tree is the root node.
3. Smoothing by bin medians can be employed, in which each bin value is replaced by
the bin______________.
4. The smoothing by bin boundaries, the minimum and maximum values in a given bin
are identified as the ___________.
(a) equal width (b) range (c) binning (d) bin boundaries
5. _________ is an extension of linear regression, where more than two attributes are
involved and the data are fit to a multidimensional surface.
6. A ____________ says that each value of the given attribute must be different from all
other values for that attribute.
(a) process (b) unique rule (c) consecutive (d) null value.
7. A______ specifies the use of blanks, question marks, special characters or other string
that may indicate the null condition and how such values should be handled.
72
Ans: (c) null value
8. Data _____________ tools use simple domain knowledge to detect errors and make
correction in the data.
9. Data _______ tools find discrepancies by analyzing the data to discover rules and
relationship and detecting data that violate such condition.
Ans: c) Knowledge
(a) Correlation analysis (b) Correlation coefficient (c) Entity identification (d)
Normalization
13. A this important issue in data integration is the detection and resolution of
_____________.
14. In ____________the data are transformed (or) consolidated into forms appropriate for
mining.
73
15. Nomalization where the attribute data are scaled so as to fall within a small specified
range such as - 1.0 to 1.0
(a)-1.0 to 1.0 (b) 1.0 to -1.0 (c)-1.0 to 0.0 (d) 0.0 to -1.0
16. Data cube__________, where aggregation operations are applied to the data in the
construction of a data cube.
18. These method are typically____________ in that while searching through attributes
space, they always make what looks to be the best choice at the time.
Ans: (a)Greedy
19. A ___________ for an attribute, a partition the date distribution of A into disjoint
subsets.
20. if we consider Each child of a parent node as a bucket, then an index tree can be
consider as a ____________.
Unit- IV
Classifications:
74
Classification and prdiction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. Such analysis can
help provide us with a better understanding of the data at large.
Ex:
We can build a classification model to categorial bank loan application as either safe
or risky or a prediction model to predict the expenditures. Many classification methods have
been proposed by researchers in machine learning, pattern recognition and statistics.
Such as how to build decidion tree classifiers bayesian classifiers, bayesian belief
networks and rule based classifiers.
What is classification:
Classification works:
1. In the first step a classifier is built describing a predetermined set of data classes. The
learning step, where a classificatio algorithm build the classifier by analyzing.
Learning from a training set made up of data base tuples and their associated class
labels.
The individual tuples making up the training set are referred to as training
tuples and selected from the data base under analysis. Data tuples can be referred to as
samples, examples, instances data points or objects.
The class label of each training tuple is provided, this step is also known as
supervised learning.
1. Training data are analyzed by a classification algorithm.
75
2. The class label attribute is loan decision and the learned module or classifier is
represented in the form of classification rules.
3. Classification :
Test data are used to estimate the accuracy of the classification rules. If the
accuracy is consider acceptable, the rules can be applied to the classification
of new data tuples.
Draw data are analyzed by a classification algorithm .here, the class label
attribute is tenured and the learned model or classifier is represented in the
form of classification rules.
Because the class label of each training tuple is provided, this step is also known as
supervised learning.
Step:2
76
In the second step, the model is used for classification first the predictive accuracy of
the classifier is estimated. A test set is used made up of test tuples and their associated class
labels. They are independent of the training tuples, meaning that were not used to construct
the classifier.
A decision tree is a flow chart like tree structure, where each internal node
(non leaf node) denotes a test on an attribute and each branch represents an outcome of the
test and each leaf node holds a class label. The topmost node in a tree is the root node.
Internal nodes are denoted by rectangles and leaf nodes are denoted by ovals. Some
decision tree algorithm produce only binary tree where as others can produce non binary.
Most algorithms for decision tree induction also fellow a top down approach, which
stands with a training set of tuples and their associated class labels.
We describe a basic algorithm for learning decision trees. During tree construction,
attribute selection measures are used to select the attribute that partition the tuple into distinct
classes.
Algorithm:
77
Input:
Output:
A decision tree.
Method:
1. Create a node N;
2. If tuples in D are all of the same class, c then
3. Return N as a leaf node labeled with the class c;
4. If attribute is empty then
5. Return N as a leaf node labeled with the majority class in D; // majority voting
6. Apply attribute selection method (D,attribute list) to find the “best” splitting
criterion;
7. Label node N with splitting criterion;
8. If splitting attribute is discrete value and multi way splits allowed then // not
restricted to binary trees.
9. Attribute list ←attribute list splitting – attribute; // remove splitting attribute.
10. For each out come j of splitting criterion // partition the tuples and grow sub
trees for each partition.
11. Let Dj be the set of data tuples in D satisfying outcomes j // a partition.
12. If dj is empty then
13. Attach a leaf labeled with thw majority class in D to node N;
14. Else attach the node returned by generate decision tree (Di,attribute list) to
node N; end for
15. Return N;
If the tuples in D are all of the same class, them node N become a leaf and is labeled with that
class.
The algorithm calls attribute selection method to determine the splitting criterion. The
splitting criterion tells us with attribute to test at node N by determining the best way to
separate or partition the tuples in D into individual classes.
The splitting criterion also tells us which branches to grow from node N with
respected to the outcomes of the chosen test.
78
The node N is labeled with the splitting criterion, which server as a test at the node a branch
is grown from node N for each of the out comes of the splitting criterion.
The tuples in D are partitioned accordingly steps(10 to 11). There are three possible
scenarios let A be the splitting attribute. A has v distinct values { a1,a2,,,ar} based on the
training data.
1. A is discete valued:
in this case, the outcomes of the test at node N correspond directly to the known
values of A. A branch is created for each known value aj, of A and albeled with that
value.partition Dj is the subset of class labeled tuples in D having value aj of A.
2. A is continuous value:
In this case, the test at node N has two possible out comes, corresponding to the
condition A-split points and A>split point ,where split point is the split point returned by
attribute selection method as part of the splitting criterion. The tuples are partitioned such
that D1 holds the subset of class labeled tuple in D for which A split point, while D2 holds
the rest.
The test at node N is of the form “A &SA” where Sa is the splitting subset for
A. Returned by attribute selection method as part of the splittingcriterion. It is a subset
of the known values of A. Two branches are grown from N by convertion the left
79
branch out of N is labeled tuple in D that satsfy the test. The right branch out of N is
labeled no. So that D2 corresponds to the subset of class labeled tuples from D that do
not satisfy the test.
The algorithm uses the same process recursively to form a decision tree for the
tuple at each resulting partition Dj.
Attribute selection measures are known as splitting rules because they determine how
the tuples at a given node are to be split. The attribute selection measures provides a ranking
for each attribut describing the given training tuples. The attribute having the best score for
the measure is chosen as the splitting attribute for the given tuples.
If the splitting attribute is continous valued or if we are restricted to binary trees, then,
respectvely , either a split point or a splitting subset must also be determined as part of the
splitting criterion.
The tree node created for partition D is labeled with the splitting criterion, branches
are grown for each outcomes of the criterion and the tuple are partitioned accordingly.
Information Gain:
Information gain is an attribute selection mearsure. Let node N represent or hold the tuples
from partition D. The attribute with the height information gain is chosen as the splitting
attribute for node N.
Here ,
Let Pi be the probability that an arbitary tuple in D belongs to class Ci , estimated by |C i,D|/|D|.
A log function to the base 2 is used, because the information is enclosed in bits.
Info(D) is just the average amount of information needed to identify the class label of
a tuple in D. Note that, at this point, the information we have is based solely on the
proportions of tuples of each class.
80
Info(D) is also known as the entropy of D.information needed (after using A to split D
into v partitions) to classify D is given by
v
|D j|
Info A (D) =∑ × Info( D j )
j=1 |D|
|Di|
The term acts as the weight of the j th partition. Info A (D) is the expected
| D|
information required to classify a tuple from D based on the partitioning by A. The smaller
the expected information (still) required,the greater the parity of the partitions.
That is,
The attribute A with highest information gain Gain(A) is chosen as the splitting
attribute at node N.
Info(D) = I(9.5)= -
9
14
log log 2
9
( )
5 5
− log log 2( )
14 14 14
= 0.940
5 4 5
Infoage (D)= I ( 2 ,3 )+ I ( 4 , 0 ) + I (3 , 2)
14 14 14
=0.694
5
I ( 2 ,3 ) means “age¿ 30 has 5 out of 14 samples, with 2 yes and 3 no.
14
Hence
The mid point between each pair of adjacent value is consider as a possible split point.
The point with the minimum expected information required for A is selected as the
split point for A.
Split Point:
D 1 is the set of tuples in the in D satisfying a<= split point and D2 is the set of tuples
in the D satisfying A> split point.
*Information gain measure is biased towards attributes with a large number of values
SplitInfoincome(D) = -
4 4 6
×( )− ×
6
-
4
14 14 14 14 14 14
× ( )
4
( )
= 1.557
The attribute with the maximum gain ratio is selected as the splitting attribute.
BAYES CLASSIFICATION:
Bayesian classification are statistical classifier. They can predict class membership
probabilities such as the probability that a given tuple belong to a partition class. Bayesian
classification is based on bayes theorem.
BAYES THEOREM:
Bayes theorem is named after thomas bayes a non conformist english clergyman who
did early work in probabilty and decision theory during the 18 th century.
Bayes theorem:
82
P ( X / H ) P(H )
P(H/X) = =P( X / H)× P( H )/ P( X)
P( X )
Classification is to determined P(H/X), (ie posterori probability) the probability that the
hypothesis holds given the observered data sample x.
P(x/H) (likeli hood): the probability of observing the sample x, given that the hypothesis
holds.
Eg: given that x will buy computer the prob that x is 31.40 , medium income.
P ( X / H ) P(H )
P(H/X) = = P(X/H) x P(/H) / P(x) this can be viewed as posterior = likeli
P( X )
hood x priori / evidence .
Predicts x belongs to li if the probability p(li/x) is the highst among all p(lk/x) for all the k
classes.limitation is the it requires inital knowledge of many probabilities, involving
significant computational cost.
1. Let D be a training set tuples and their associated class labels. As usual ,each tuple is
represented by an n-dimensional attribute vector,x=(x1,x2,,,,xn), depicting n
measurement made an the tuple from n attributes, respectively A1,A2,….,An.
2. Suppose that there are m classes c1,c2,cm give tuple , x the classifier will predict that x
belongs to the class having the height posterior probability . conditioned on x. that is,
the naïve bayesian classifier predicts that tuple x belongs to the class ci if and only if
P(C i / X ¿ P (C j / X )for 1 ≤ j≤ m; j≠ i
83
Thus we maximize P(ci/x) . The class ci for which p(ci/x) is maximized is called the
maimum posterior hypothesis . by bayes theorem.
P(
C i / X ¿=
P
X
Ci ( )
P(C i)
P (X )
3. As P(x) is costanr for all classes, only P(x/c i) P(ci) need be aximized. If the class
prior probabilities are known, then it is commonly assumed that the classes are
equallly likely that is P(c1) = P (c2)=…..=p(cm) and we would there fore maximize
p(x/ci).
4. Given data sets with many attributes, it would be extremely computationally
expensive to compute p(x/ci).
Advantages:
It is easy to implement.
Limitations:
Bayesianbelief network:
Concept :
84
The naive Bayesian classifier makes the assumption of class
conditional independent that is given the class label of a tuple, the values of the attributes are
assumed to be conditionally independent of one another. This simplifies computation.
Each node in the directed a cyclic graph represent a random variable. The variables
may be discrete or continuous valued. They may correspond to actual attribute given in the
data or to hidden variable to form a relationship.
Each arc represents a probabilistic dependence. If an arc draw from a node y to a node
z, then y is parent (or) immediate predecessor of z, and z is a descent of y. Each variable is
conditionally independent of its non descendants in the graph, given its parents.
Example:
Here,
Unit-V
Cluster Analysis
Introduction
➢ Clustering is the process of grouping a set of data objects into multiple groups.
➢ Cluster analysis or clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to
85
one another, yet dissimilar to objects in other clusters. The set of clusters resulting from a
cluster analysis can be referred to as a clustering.
Cluster analysis has been widely used in many applications such as,
➢ business intelligence,
➢ Web search,
2. Ability to deal with different types of attributes – Clustering algorithms should work not
only for numeric data, but also for other data types.
5. Ability to deal with noisy data – Outlier, missing, unknown and erroneous data detected
by a clustering algorithm may lead to clusters of poor quality.
6. Insensitivity in the order of input records – Clustering algorithms should produce same
results even if the order of input records is changed.
7. High dimensionality – Data in high dimensional space can be sparse and highly skewed,
hence it is challenging for a clustering algorithm to cluster data objects in high dimensional
space.
86
9. Interpretability and usability – Clustering results should be interpretable,
comprehensible and usable. So we should study how an application goal may influence the
selection of clustering methods.
Numerical Data
• There are a number of methods for computing similarity between these data.
Binary Data
• A simple method involves counting how many attribute values of 2 objects are
different amongst n attributes &using this as an indication of distance.
• This is similar to binary data which may take more than 2 values but has no natural
order.
• This is similar to qualitative nominal data except that data has an order associated
with it.
➢ Partitioning Methods
➢ Hierarchical Methods
➢ Density-Based Methods
➢ Grid-Based Methods
➢ Model-Based Methods
Partitioning Methods
87
The simplest and most fundamental version of cluster analysis is partitioning, which
organizes the objects of a set into several exclusive groups or clusters. To keep the problem
specification concise, we can assume that the number of clusters is given as background
knowledge. This parameter is the starting point for partitioning methods.
Formally, given a data set, D, of n objects, and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition
represents a cluster. The clusters are formed to optimize an objective partitioning criterion,
such as a dissimilarity function based on distance, so that the objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters in terms of the data set
attributes commonly used partitioning methods—k-means and k-Medoids
the cluster. The difference between an object p ∈ Ci and ci, the representative of the cluster,
defined in various ways such as by the mean or medoid of the objects (or points) assigned to
is measured by dist(p,ci), where dist(x,y) is the Euclidean distance between two points x and
y. The quality of cluster Ci can be measured by the within cluster variation, which is the sum
of squared error between all objects in Ci and the centroid ci, defined as
The k-means algorithm defines the centroid of a cluster as the mean value
of the points within the cluster. It proceeds as follows. First, it randomly selects k of the
objects in D, each of which initially represents a cluster mean or center. For each of the
remaining objects, an object is assigned to the cluster to which it is the most similar, based on
the Euclidean distance between the object and the cluster mean. The k-means algorithm then
iteratively improves the within-cluster variation. For each cluster, it computes the new mean
using the objects assigned to the cluster in the previous iteration. All the objects are then
reassigned using the updated means as the new cluster centers. The iterations continue until
the assignment is stable, that is, the clusters formed in the current round are the same as those
formed in the previous round.
Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is
represented by the mean value of the objects in the cluster.
Input:
88
Output: A set of k clusters.
Method:
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,based on the
mean value of the objects in the cluster;
(4) update the cluster means, that is, calculate the mean value of the objects for each cluster;
o Decision trees.
o Neural networks.
o The winner is the neuron whose weight vector is closest to the instance
currently presented.
o The winner and its neighbours learn by having their weights adjusted.
The SOM algorithm is successfully used for vector quantization and speech recognition.
COBWEB:
This algorithm assumes that all attributes are independent (an often too
naive assumption). Its aim is to achieve high predictability of nominal variable values, given
a cluster. This algorithm is not suitable for clustering large database data (Fisher, 1987).
89
CLASSIT:
➢ An outlier is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism.
➢ collective outliers
Global outlier — Object significantly deviates from the rest of the data set
Supervised Methods- Supervised methods model data normality and abnormality. Domain
experts examine and label a sample of the underlying data. Outlier detection can then be
modeled as a classification problem.
ii) Low recall: This is due to inability to index all the information available
on the web. Because some of the relevant pages are not properly indexed.
• While interacting with the web, individuals have their own preferences
for the style of the
• Within this problem, there are sub-problems such as→ problems related to effective web-
site design and management
Other related techniques from different research areas, such as DB(database), IR(information
retrieval) & NLP(natural language processing), can also be used.
• This is the process of extracting useful information from the contents of web-documents.
• We see more & more government information are gradually being placed on the web in
recent years.
• We have
• Some of the web-data are hidden-data, and some are generated dynamically as a result of
queries and reside in the DBMSs.
• The web-content consists of different types of data such as text, image, audio, video as well
as
hyperlinks.
• Much of the web-data is unstructured, free text-data. As a result, text-mining techniques can
be directly employed for web-mining.
92
• Issues addressed in text mining are, topic discovery, extracting association patterns,
clustering of web documents and classification of Web Pages.
• Research activities on this topic have drawn heavily on techniques developed in other
disciplines such as IR (Information Retrieval) and NLP (Natural Language Processing).
• This deals with studying the data generated by the web-surfer's sessions (or behaviours).
On the other hand, web-usage mining extracts the secondary-data derived from the
interactions of the users with the web.
• This can be used to analyze the web-logs to understand access-patterns and trends.
• This can shed better light on the structure & grouping of resource providers.
• Based on user access-patterns, following things can be dynamically customized for each
user over time:
→ information displayed
→ depth of site-structure
→ format of resources
1) The first approach maps the usage-data of the web-server into relational-tables before a
traditional data-mining technique is applied.
93
2) The second approach uses the log-data directly by utilizing special pre-processing
techniques.
• Web-Structure mining is the process of discovering structure information from the web.
• This type of mining can be performed either at the (intra-page) document level or at the
(inter- page) hyperlink level.
• This can be used to generate information such as the similarity & relationship between
different web-sites.
PageRank
• The key idea is that a page has a high rank if it is pointed to by many highly ranked pages
• For determining the collection of similar pages, we need to define the similarity measure
between the pages. There are 2 basic similarity functions:
1) Co-citation: For a pair of nodes p and q, the co-citation is the number of nodes that point
to both p and q.
2) Bibliographic coupling: For a pair of nodes p and q, the bibliographic coupling is equal to
the number of nodes that have links from both p and q.
• This can be used to measure the relative standing or importance of individuals in a network.
• The basis idea is that if a web-page points a link to another web-page, then the former is, in
some sense, endorsing the importance of the latter.
• Links in the network may have different weights, corresponding to the strength of
endorsement.
• This refers to the extraction of knowledge, spatial relationships, or other interesting patterns
not explicitly stored in spatial-databases.
• Consider a map of the city of Mysore containing various natural and man-made geographic
features, and clusters of points (where each point marks the location of a particular house).
94
• The houses might be important because of their size, or their current market value.
• Clustering algorithms can be used to assign each point to exactly one cluster, with the
number of clusters being defined by the user.
• For ex, "the land-value of cluster of residential area around ‘Mysore Palace’ is high".
• This problem is not so simple because there may be a large number of features to consider.
o → discriminant rules
o → association rules
o → discriminant rules
o → association rules
95
• The key idea of a density based cluster is that for each point of a cluster, its epsilon
neighbourhood has to contain at least a minimum number of points.
• First, any other symmetric & reflexive neighbourhood relationship can be used instead of an
epsilon neighbourhood. It may be more appropriate to use topological relations such as
intersects, meets or above/below to group spatially extended objects.
96
A time-series database is also a sequence database. However, a sequence database is any
database that consists of sequences of ordered events, with or without concrete notions of
time.
For example, Web page traversal sequences and customer shopping transaction sequences are
sequence data, but they may not be time-series data.
1. What is clustering?
Clustering is the process of grouping a set of data objects into multiple groups.
1. business intelligence,
3. Web search,
1. Scalability.
Numerical Data
Binary Data
1. Partitioning Methods
2. Hierarchical Methods
3. Density-Based Methods
4. Grid-Based Methods
97
5. Model-Based Methods
• Page Rank is a metric for ranking hypertext documents based on their quality.
• The key idea is that a page has a high rank if it is pointed to by many highly ranked pages.
98