Unit-1 Lecture Notes
Unit-1 Lecture Notes
Data Warehouse:
Data warehouse is like a relational database designed for analytical needs. It functions on the
basis of OLAP (Online Analytical Processing). It is a central location where consolidated data
from multiple locations (databases) are stored.
1
Fig: Data warehousing Process
2
Data warehouse Architecture:
3
(b) a multidimensional OLAP(MOLAP) model that is a special-purpose server that directly
implements multidimensional data and operations.
The top tier is a front end client layer, which contains query and reporting tools, analysis tools
and data mining tools(ex: trend analysis, prediction….)
4
Fig: Multidimensional Representation
Consider the data of a shop for items sold per quarter in the city of Delhi. The data
is shown in the table.
In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an
item sold).
The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
5
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
What is Schema?
Schema is a logical description of the entire database.
It includes the name and description of records of all record types including all
associated data-Items and aggregates.
Much like a database, a data warehouse also requires to maintain a schema.
A database uses relational model, while a data warehouse uses Star, Snowflake, and
Fact Constellation schema.
Modeling data warehouses: dimensions & measures
6
Star schema: A fact table in the middle connected to a set of dimension tables
Snowflake schema: A refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a collection
of stars, therefore called galaxy schema or fact constellation
Star Schema:
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions.
A fact is an event that is counted or measured, such as a sale or log in. A dimension
includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents
a multidimensional data model.
The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star, with
points, diverge from a central table.
The center of the schema consists of a large fact table, and the points of the star are the
dimension tables.
7
Star Schema:
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key,
street, city, province_or_state,country}. This constraint may cause data redundancy. For
example, "Vancouver" and "Victoria" both the cities are in the Canadian province of
British Columbia. The entries for such cities may cause data redundancy along the
attributes province_or_state and country.
8
The dimension tables are not normalized. For instance, in the above figure, Country_ID
does not have Country lookup table as an OLTP design would have.
The schema is widely supported by BI Tools.
Advantages:
(i) Simplest and Easiest
(ii) It optimizes navigation through database
(iii) Most suitable for Query Processing
Snowflake Schema:
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized.
For example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
9
Note : Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.
Advantages:
(i) Less redundancies due to normalization Dimension Tables.
10
(ii) Dimension Tables are easier to update.
Disadvantages:
It is complex schema when compared to star schema.
11
A fact constellation schema is shown in the figure below.
This schema defines two fact tables, sales, and shipping. Sales are treated along four
dimensions, namely, time, item, branch, and location.
The schema contains a fact table for sales that includes keys to each of the four
dimensions, along with two measures: Rupee_sold and units_sold.
The shipping table has five dimensions, or keys: item_key, time_key, shipper_key,
from_location, and to_location, and two measures: Rupee_cost and units_shipped.
It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.
Disadvantages:
(i) Complex due to multiple fact tables
(ii) It is difficult to manage
(iii) Dimension Tables are very large.
OLAP OPERATIONS:
In the multidimensional model, the records are organized into various dimensions, and
each dimension includes multiple levels of abstraction described by concept hierarchies.
12
This organization support users with the flexibility to view data from various
perspectives.
A number of OLAP data cube operation exist to demonstrate these different views,
allowing interactive queries and search of the record at hand. Hence, OLAP supports a
user-friendly environment for interactive data analysis.
Consider the OLAP operations which are to be performed on multidimensional data.
The data cubes for sales of a shop. The cube contains the dimensions, location, and time
and item, where the location is aggregated with regard to city values, time is
aggregated with respect to quarters, and an item is aggregated with respect to item
types.
OLAP having 5 different operations
(i) Roll-up
(ii) Drill-down
(iii) Slice
(iv) Dice
(v) Pivot
Roll-up:
The roll-up operation performs aggregation on a data cube, by climbing down concept
hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data cubes.
It is also known as drill-up or aggregation operation
Figure shows the result of roll-up operations performed on the dimension location. The
hierarchy for the location is defined as the Order Street, city, province, or state, country.
The roll-up operation aggregates the data by ascending the location hierarchy from the
level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are
removed from the cube.
For example, consider a sales data cube having two dimensions, location and time. Roll-
up may be performed by removing, the time dimensions, appearing in an aggregation of
the total sales by location, relatively than by location and by time.
13
Fig: Roll-up operation on Data Cube
Drill-Down
The drill-down operation is the reverse operation of roll-up.
It is also called roll-down operation.
Drill-down is like zooming-in on the data cube.
It navigates from less detailed record to more detailed data. Drill-down can be
performed by either stepping down a concept hierarchy for a dimension or adding
additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping
down a concept hierarchy which is defined as day, month, quarter, and year.
Drill-down appears by descending the time hierarchy from the level of the quarter to a
more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by
adding a new dimension to a cube.
14
Fig: Drill-down operation
Slice:
A slice is a subset of the cubes corresponding to a single value for one or more members
of the dimension.
The slice operation provides a new sub cube from one particular dimension in a given
cube.
For example, a slice operation is executed when the customer wants a selection on one
dimension of a three-dimensional cube resulting in a two-dimensional site. So, the Slice
operations perform a selection on one dimension of the given cube, thus resulting in a
sub cube.
Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
15
Fig: Slice operation
Dice:
The dice operation describes a sub cube by operating a selection on two or more
dimension.
16
Fig: Dice operation
The dice operation on the cubes based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot:
The pivot operation is also called a rotation.
Pivot is a visualization operations which rotates the data axes in view to provide an
alternative presentation of the data.
It may contain swapping the rows and columns or moving one of the row-dimensions
into the column dimensions.
17
Parallel DBMS Vendors:
(Data base Management System)Software that controls the organization, storage, retrieval, security and
integrity of data in a database.
The major DBMS vendors are Oracle, IBM, Microsoft and Sybase (see Oracle Database, DB2, SQL Server
and ASE).
18
Interbase Inprise (Borland) R Open Source
MySQL Freeware R Open Source
NonStop SQL Tandem R Enterprise
Pervasive.SQL 2000 Pervasive Software R Embedded
(Btrieve)
Pervasive.SQL Workgroup Pervasive Software R Enterprise (Windows 32)
Progress Progress Software R Mobile/Embedded
Quadbase SQL Server Quadbase Systems, Inc. Relation Enterprise
al
R:Base R:Base Technologies Relation Enterprise
al
Rdb Oracle R Enterprise
Enterprise
Red Brick Informix (Red Brick) R
(Data
Warehousi
ng)
Enterprise
SQL Server Microsoft R
Mobile/Em
SQLBase Centura Software R bedded
Enterprise
SUPRA Cincom R
VLDB
Teradata NCR R
(Data
Warehousi
ng)
Enterprise
YARD-SQL YARD Software Ltd. R
In-
TimesTen TimesTen Performance R
Memory
Software
Enterprise
Adabas Software AG XR
VLDB
Model 204 Computer Corporation of XR
America
Enterprise
UniData Informix (Ardent) XR
Enterprise
UniVerse Informix (Ardent) XR
Enterprise
Cache' InterSystems OR
Mobile/Em
Cloudscape Informix OR bedded
19
Enterprise/
DB2 IBM OR VLDB
Enterprise
Informix Dynamic Server Informix OR
2000
VLDB
Informix Extended Parallel Informix OR
(Data
Server Warehousi
ng)
Mobile
Oracle Lite Oracle OR
Enterprise
Oracle 8I Oracle OR
Embedded
PointBase Embedded PointBase OR
Mobile
PointBase Mobile PointBase OR
Enterprise
PointBase Network Server PointBase OR
Open
PostgreSQL Freeware OR Source
Enterprise
UniSQL Cincom OR
Enterprise
Jasmine ii Computer Associates OO
Enterprise
Object Store Exceleron OO
VLDB
Objectivity DB Objectivity OO (Scientific)
Enterprise
POET Object Server Suite Poet Software OO
Enterprise
Versant Versant Corporation OO
Mobile/Em
Raima Database Manager Centura Software RN bedded
Enterprise/
Velocis Centura Software RN Embedded
Open
Db.linux Centura Software RNH
Source/Mo
bile/Embe
dded
Open
Db.star Centura Software RNH
Source/Mo
bile/Embe
dded
20
Types of Data Warehouse:
There are three main types of DWH. Each has its specific role in data management operations.
3. Data Mart
It is a subset of a DWH that supports a particular department, region, or business unit. Consider
this: You have multiple departments, including sales, marketing, product development, etc.
Each department will have a central repository where it stores data. This repository is called a
data mart. The EDW stores the data from the data mart in the ODS on a daily/weekly (or as
configured) basis. The ODS acts as a staging area for data integration. It then sends the data to
the EDW to store it and use it for BI purposes.
DATAWAREHOUSE COMPONENTS:
The data warehouse is based on an RDBMS server which is a central information repository that
is surrounded by some key components to make the entire environment functional,
manageable and accessible
There are mainly five components of Data Warehouse:
21
by the fact that traditional RDBMS system is optimized for transactional database processing
and not for data warehousing. For instance, ad-hoc query, multi-table joins, aggregates are
resource intensive and slow down performance.
Hence, alternative approaches to Database are used as listed below-
In a data warehouse, relational databases are deployed in parallel to allow for scalability.
Parallel relational databases also allow shared memory or shared nothing model on various
multiprocessor configurations or massively parallel processors.
New index structures are used to bypass relational table scan and improve speed.
Use of multidimensional database (MDDBs) to overcome any limitations which are placed
because of the relational data model. Example: Essbase from Oracle.
METADATA:
The name Meta Data suggests some high- level technological concept. However, it is quite
simple. Metadata is data about data which defines the data warehouse. It is used for building,
maintaining and managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the
source, usage, values, and features of data warehouse data. It also defines how data can be
changed and processed. It is closely connected to the data warehouse.
QUERY TOOLS:
One of the primary objects of data warehousing is to provide information to businesses to
make strategic decisions. Query tools allow users to interact with the data warehouse system.
22
These tools fall into four different categories:
Query and reporting tools
Application Development tools
Data mining tools
OLAP tools
Characteristics of OLAP:
The main characteristics of OLAP are as follows:
Multidimensional conceptual view: OLAP systems let business users have a dimensional and
logical view of the data in the data warehouse. It helps in carrying slice and dice operations.
Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should provide
normal database operations, containing retrieval, update, adequacy control, integrity, and
security.
Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP
operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-
end.
Storing OLAP results: OLAP results are kept separate from data sources.
Uniform documenting performance: Increasing the number of dimensions or database size
should not significantly degrade the reporting performance of the OLAP system.
OLAP provides for distinguishing between zero values and missing values so that aggregates are
computed correctly.
OLAP system should ignore all missing values and compute correct aggregate values.
OLAP facilitate interactive query and complex analysis for the users.
OLAP allows users to drill down for greater details or roll up for aggregations of metrics along a
single business dimension or across multiple dimension.
OLAP provides the ability to perform intricate calculations and comparisons.
OLAP presents results in a number of meaningful ways, including charts and graphs.
23
OLAP Types:
Three types of OLAP servers are:-
1. Relational OLAP (ROLAP)
2. Multidimensional OLAP (MOLAP)
3. Hybrid OLAP (HOLAP)
Advantages of ROLAP:
1. ROLAP can handle large amounts of data.
2. Can be used with data warehouse and OLTP systems.
Disadvantages of ROLAP:
1. Limited by SQL functionalities.
2. Hard to maintain aggregate tables.
Advantages of MOLAP
1. Optimal for slice and dice operations.
2. Performs better than ROLAP when data is dense.
3. Can perform complex calculations.
24
Disadvantages of MOLAP
1. Difficult to change dimension without re-aggregation.
2. MOLAP can handle limited amount of data.
Advantages of HOLAP
1. HOLAP provide advantages of both MOLAP and ROLAP.
2. Provide fast access at all levels of aggregation.
Disadvantages of HOLAP
1. HOLAP architecture is very complex because it support both MOLAP and ROLAP servers.
The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the
imminent need for turning such data into useful information and knowledge. The
information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and science
exploration.
25
The evolution of database technology
Data collection and Database Creation
(1960s and earlier)
Primitive file processing
26
What is data mining?
Data mining refers to extracting or mining" knowledge from large amounts of data. There
are many other terms related to data mining, such as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat
data mining as a synonym for another popularly used
term, KnowledgeDiscovery in
Databases", or KDD
Data mining is the process of discovering interesting knowledge from large amounts of data
stored either in databases, data warehouses, or other information repositories. Based on this
view, the architecture of a typical data mining system may have the following major
components:
28
6. A graphical user interface that allows the user an
interactive approach to the data mining system.
Flat files: Flat files are actually the most common data source for
data mining algorithms, especially at the research level. Flat files are
simple data files in text or binary format with a structure known by
the data mining algorithm to be applied. The data in these files can
be transactions, time-series data, scientific measurements, etc.
Data warehouses
Transactional databases
SALES
Trans-ID List of item_ID’s
T100 I1,I3,I8
…….. ………
Data mining functionalities/Data mining tasks: what kinds of patterns can be mined?
Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories:
• Descriptive
• predictive
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Describe data mining functionalities, and the kinds of patterns they can discover (or)
Define each of the following data mining functionalities: characterization, discrimination,
association and correlation analysis, classification, prediction, clustering, and evolution analysis.
Give examples of each data mining functionality, using a real-life database that you are familiar
with.
Data can be associated with classes or concepts. It describes a given set of data in a concise and
summarative manner, presenting interesting general properties of the data. These descriptions can
be derived via
1. data characterization, by summarizing the data of the class under study (often
called the target class)
2. data discrimination, by comparison of the target class with one or a set of
comparative classes
3. both data characterization and discrimination
Data characterization
Example:
A data mining system should be able to produce a description summarizing the characteristics of
a student who has obtained more than 75% in every semester; the result could be a general
profile of the student.
Data Discrimination is a comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes.
Example
The general features of students with high GPA’s may be compared with the general features of
students with low GPA’s. The resulting description could be a general comparative profile of the
students such as 75% of the students with high GPA’s are fourth-year computing science
students while 65% of the students with low GPA’s are not.
The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including
crosstabs. The resulting descriptions can also be presented as generalized relations, or in rule
form called characteristic rules.
Discrimination descriptions expressed in rule form are referred to as discriminant rules.
It is the discovery of association rules showing attribute-value conditions that occur frequently
together in a given set of data. For example, a data mining system may find association rules like
major(X, “computing science””) ⇒ owns(X, “personal computer”)
A grocery store retailer to decide whether to put bread on sale. To help determine the impact of
this decision, the retailer generates association rules that show what other products are frequently
purchased with bread. He finds 60% of the times that bread is sold so are pretzels and that 70%
of the time jelly is also sold. Based on these facts, he tries to capitalize on the association
between bread, pretzels, and jelly by placing some pretzels and jelly at the end of the aisle where
the bread is placed. In addition, he decides not to place either of these items on sale at the same
time.
Classification:
Classification:
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Classification can be defined as the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. The derived model is based on the analysis of
a set of training data (i.e., data objects whose class label is known).
Example:
An airport security screening station is used to deter mine if passengers are potential terrorist or
criminals. To do this, the face of each passenger is scanned and its basic pattern(distance between
eyes, size, and shape of mouth, head etc) is identified. This pattern is compared to entries in a
database to see if it matches any patterns that are associated with known offenders
1) IF-THEN rules,
2) Decision tree
3) Neural Network
Prediction:
Find some missing or unavailable data values rather than class labels referred to as prediction.
Although prediction may refer to both data value prediction and class label prediction, it is
usually confined to data value prediction and thus is distinct from classification. Prediction also
encompasses the identification of distribution trends based on the available data.
Example:
Predicting flooding is difficult problem. One approach is uses monitors placed at various points
in the river. These monitors collect data relevant to flood prediction: water level, rain amount,
time, humidity etc. These water levels at a potential flooding point in the river can be predicted
based on the data collected by the sensors upriver from this point. The prediction must be made
with respect to the time the data were collected.
Classification differs from prediction in that the former is to construct a set of models (or
functions) that describe and distinguish data class or concepts, whereas the latter is to predict
some missing or unavailable, and often numerical, data values. Their similarity is that they are
both tools for prediction: Classification is used for predicting the class label of data objects and
prediction is typically used for predicting missing numerical data values.
The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity.
Each cluster that is formed can be viewed as a class of objects.
Clustering can also facilitate taxonomy formation, that is, the organization of observations into a
hierarchy of classes that group similar events together as shown below:
Example:
A certain national department store chain creates special catalogs targeted to various
demographic groups based on attributes such as income, location and physical characteristics of
potential customers (age, height, weight, etc). To determine the target mailings of the various
catalogs and to assist in the creation of new, more specific catalogs, the company performs a
clustering of potential customers based on the determined attribute values. The results of the
clustering exercise are the used by management to create special catalogs and distribute them to
the correct target population based on the cluster for that catalog.
In general, in classification you have a set of predefined classes and want to know
which class a new object belongs to.
Clustering tries to group a set of objects and find whether there is some
relationship between the objects.
In the context of machine learning, classification is supervised learning and
clustering is unsupervised learning.
1.4.5 Outlier analysis: A database may contain data objects that do not comply with general
model of data. These data objects are outliers. In other words, the data objects which do not fall
within the cluster will be called as outlier data objects. Noisy data or exceptional data are also
called as outlier data. The analysis of outlier data is referred to as outlier mining.
Example
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges incurred
by the same account. Outlier values may also be detected with respect to the location and type of
purchase, or the purchase frequency.
1.4.6 Data evolution analysis describes and models regularities or trends for objects whose
behavior changes over time.
Example:
The data of result the last several years of a college would give an idea if quality of graduated
produced by it
Correlation analysis
Correlation analysis is a technique use to measure the association between two variables. A
correlation coefficient (r) is a statistic used for measuring the strength of a supposed linear
association between two variables. Correlations range from -1.0 to +1.0 in value.
A correlation coefficient of 1.0 indicates a perfect positive relationship in which high values of
one variable are related perfectly to high values in the other variable, and conversely, low values
on one variable are perfectly related to low values on the other variable.
A correlation coefficient of 0.0 indicates no relationship between the two variables. That is, one
cannot use the scores on one variable to tell anything about the scores on the second variable.
A correlation coefficient of -1.0 indicates a perfect negative relationship in which high values of
one variable are related perfectly to low values in the other variables, and conversely, low values
in one variable are perfectly related to high values on the other variable.
(4) Novel.
A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An
interesting pattern represents knowledge.
Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domain expert) to label an example,
which may be from a set of unlabelled examples or synthesized by the learning program. The
goal is to optimize the model quality by actively acquiring knowledge from human users, given a
constraint on how many examples they can be asked to label.
(3) Database Systems and Data Warehouses Database systems research focuses on the
creation, maintenance, and use of databases for organizations and end-users. Particularly,
database systems researchers have established highly recognized principles in data models, query
languages, query processing and optimization methods, data storage, and indexing and accessing
methods. Database systems are often well known for their high scalability in processing very
large, relatively structured data sets.
Many data mining tasks need to handle large data sets or even real-time, fast streaming data.
Therefore, data mining can make good use of scalable database technologies to achieve high
efficiency and scalability on large datasets. More over,datamining tasks can be used to extend the
capability of existing database systems to satisfy advanced users’ sophisticated data analysis
requirements. Recent database systems have built systematic data analysis capabilities on
database data using data warehousing and data mining facilities. A data warehouse integrates
data originating from multiple sources and various timeframes. It consolidates data in
multidimensional space to form partially materialized data cubes. The data cube model not only
facilitates OLAP in multidimensional databases but also promotes multidimensional data mining.
(4) Information Retrieval: Information retrieval (IR) is the science of searching for documents
or information in documents. Documents can be text or multimedia, and may reside on the Web.
The differences between traditional information retrieval and database systems are twofold:
Information retrieval assumes that (1) the data under search are unstructured; and (2) the queries
are formed mainly by keywords, which do not have complex structures (unlike SQL queries in
database systems). The typical approaches in information retrieval adopt probabilistic models.
For example, a text document can be regarded as a bag of words, that is, a multiset of words
appearing in the document. The document’s language model is the probability density function
that generates the bag of words in the document. The similarity between two documents can be
measured by the similarity between their corresponding language models. Furthermore, a topic in
a set of text documents can be modelled as a probability distribution over the vocabulary, which
is called a topic model. A text document, which may involve one or multiple topics, can be
regarded as a mixture of multiple topic models.