0% found this document useful (0 votes)
17 views193 pages

DWDM Final

Uploaded by

Bangtan Twtt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views193 pages

DWDM Final

Uploaded by

Bangtan Twtt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 193

DATA WAREHOUSE AND DATA MINING

UNIT -1
PART-A
1.How is data ware house different from a database? Identify the similarity.
Data warehouse is a repository of multiple heterogenous data sources, organized
under a unified schema at a single site in order to facilitate management decision-making. A
relational database’s is a collection of tables, each of which is assigned a unique name. Each
table consists of a set of attributes (columns or fields) and usually stores a large set of tuples
(records or rows). Each tuple in a relational table represents an object identified by a unique
key and described by a set of attribute values. Both are used to store and manipulate the data.
2. Differentiate metadata and data mart.

META DATA DATA MART

Data about data. Departmental subsets that focus on selected


subjects.

Containing location and description of A data mart is a segment of a data


warehouse system components: names, warehouse that can provide data for
definition, structure… reporting and analysis on a section, unit,
department or operation in the company

It is used for maintaining, managing and They are used for rapid delivery of
using the data warehouse enhanced decision support functionality to
end users.

3. Analyze why one of the biggest challenges when designing a data ware house is the
data placement and distribution strategy.
One of the biggest challenges when designing a data warehouse is the data placement
and distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes
necessary to know how the data should be divided across multiple servers and which users
should get access to which types of data. The data can be distributed based on the subject
area, location (geographical region), or time (current, month, year).
4. How would you evaluate the goals of data mining?

➢ Identifying high-value customers based on recent purchase data


➢ Building a model using available customer data to predict the likelihood of churn for
each customer
➢ Assigning each customer rank based on both churn propensity and customer value

5. List the two ways the parallel execution of the tasks within SQL statements can be
done.
6. What elements would you use to relate the design of data warehouse?
➢ Quality Screens.
➢ External Parameters File / Table.
➢ Team and Its responsibilities.
➢ Up to date data connectors to external sources.
➢ Consistent architecture between environments (development / uat (user – acceptance –
testing / production)
➢ Repository of DDL’s and other script files (.SQL, Bash / Powershell)
➢ Testing processes – unit tests, integration tests, regression tests
➢ Audit tables, monitoring and alerting of audit tables
➢ Known and described data lineage
7. Define Data mart
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement.
➢ The absence of a budget for a full-scale data warehouse strategy.
➢ The decentralization of business needs.

8. Define star schema


The multidimensional view of data that is expressed using relational data base
semantics is provided by the data base schema design called star schema. The basic of stat
schema is that information can be classified into two groups:
➢ Facts
➢ Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.

9. What is Data warehousing? Explain the benefits of Data warehousing.


DATA WAREHOUSING:
Data Warehousing is an architectural construct of information systems that provides
users with current and historical decision support information that is hard to access or present
in traditional operational data stores

BENEFITS:
➢ Data warehouses are designed to perform well with aggregate queries running on
large amounts of data.
➢ Data warehousing is an efficient way to manage and report on data that is from a
variety of sources, non-uniform and scattered throughout a company.
➢ Data warehousing is an efficient way to manage demand for lots of information from
lots of users.
➢ Data warehousing provides the capability to analyze large amounts of historical data
for nuggets of wisdom that can provide an organization with competitive advantage.

10. Why data transformation is essential in the process of Knowledge discovery?


Describe it.
Data transformation is essential in the process of knowledge discovery because the
main objective of the knowledge discovery in database process is to extract information from
data in the context of large databases. Data transformation is where data are transformed or
consolidated into forms appropriate for mining by performing summary or aggregation
operations, for instance.

11. Describe the alternate technologies used to improve the performance in data
warehouse environment

12. Distinguish STAR join and STAR index.


STAR JOIN:
A STAR join is a high-speed, single-pass, parallelizable multi table joins, and Brick’s
RDBMS can join more than two tables in a single operation.
STAR INDEX:
Red Brick’s RDBMS supports the creation of specialized indexes called STAR
indexes. It created on one or more foreign key columns of a fact table.

13. Analyse the types of data mart.


TYPES OF DATA MART:
➢ Dependent
➢ Independent
➢ Hybrid

14. Formulate what is data discretization.


Data discretization converts a large number of data values into smaller once, so that
data evaluation and data management become very easy. In other words, it is simply is
defined as a process of converting continuous data attribute values into a finite set of
intervals and associating with each interval some specific data value.

15. Point out the major differences between the star schema and the snowflake schema
The dimension table of the snowflake schema model may be kept in normalized
form to reduce redundancies. Such a table is easy to maintain and saves storage space.

16. Point out the features of Metadata repository in data warehousing


➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info

17. Define Metadata repository


Meta data helps the users to understand content and find the data. Meta data are stored
in a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.

18. Discuss metadata with an example.


It is data about data. It is used for maintaining, managing and using the data
warehouse. For example, author, date created, date modified and file size are examples of
very basic document file metadata. Having the ability to search for a particular element (or
elements) of that metadata makes it much easier for someone to locate a specific document.

19. Illustrate the benefits of metadata repository.


Metadata repository explores the enterprises wide data governance, data quality and
master data management (includes master data and reference data) and integrates this wealth
of information with integrated metadata across the organization to provide decision support
system for data structures, even though it only reflects the structures consumed from various
systems.

20. Design the data warehouse architecture.

PART – B

1.What is data warehouse? Give the Steps for design and construction of Data
Warehouses and explain with three tier architecture diagram.

DATA WAREHOUSE:
A data warehouse is a repository of multiple heterogeneous data sources organized
under a unified schema at a single site to facilitate management decision making. (or)A data
warehouse is a subject-oriented, time-variant and non-volatile collection of data in support of
management’s decision-making process.

CONSTRUCTION OF DATA WAREHOUSE:


There are two reasons why organizations consider data warehousing a critical need. In
other words, there are two factors that drive you to build and use data warehouse. They are:
Business factors:
Business users want to make decision quickly and correctly using all available data.
Technological factors:
To address the incompatibility of operational data stores
IT infrastructure is changing rapidly. Its capacity is increasing and cost is
decreasing so that building a data warehouse is easy

There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the
following two approaches:
Top - Down Approach (Suggested by Bill Inmon)
Bottom - Up Approach (Suggested by Ralph Kimball)
Top - Down Approach
In the top down approach suggested by Bill Inmon, we build a centralized repository
to house corporate wide business data. This repository is called Enterprise Data Warehouse
(EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data. The data in the EDW is stored at the most detail level. The reason to build the EDW on
the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements.

The disadvantages of storing data at the detail level are


1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.

Once the EDW is implemented we start building subject area specific data marts which
contain data in a de normalized form also called star schema. The data in the marts are
usually summarized based on the end users analytical requirements. The reason to de
normalize the data in the mart is to provide faster access to the data for the end users
analytics. If we were to have queried a normalized schema for the same analytics, we
would end up in a complex multiple level joins that would be much slower as compared to
the one on the de normalized schema.
We should implement the top-down approach when
1. The business has complete clarity on all or multiple subject areas data warehosue
requirements.
2. The business is ready to invest considerable time and money.

The advantage of using the Top Down approach is that we build a centralized repository to
cater for one version of truth for business data. This is very important for the data to be
reliable, consistent across subject areas and for reconciliation in case of data related
contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building
the data marts before which they can access their reports.

Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to
build a data warehouse. Here we build the data marts separately at different points of time as
and when the specific subject area requirements are clear. The data marts are integrated or
combined together to form a data warehouse. Separate data marts are combined through the
use of conformed dimensions and conformed facts. A conformed dimension and a conformed
fact is one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names
and consistent values across separate data marts. The conformed dimension means exact
same thing with every fact table it is joined. A Conformed fact has the same definition of
measures, same dimensions joined to it and at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for
knowing the overall requirements of the warehouse. We should implement the bottom up
approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one
data mart.

The advantage of using the Bottom Up approach is that they do not require high initial costs
and have a faster implementation time; hence the business can start using the marts much
earlier as compared to the top-down approach.

The disadvantages of using the Bottom Up approach is that it stores data in the de normalized
format, hence there would be high space usage for detailed data. We have a tendency of not
keeping detailed data in this approach hence loosing out on advantage of having detail data
.i.e.
flexibility to easily cater to future requirements. Bottom up approach is more realistic but the
complexity of the integration may become a serious obstacle.

DESIGN OF A DATA WAREHOUSE:


The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models

2. Diagrammatically illustrate and discuss the following pre-processing techniques:


(i) Data cleaning
Data cleaning means removing the inconsistent data or noise and collecting necessary
information of a collection of interrelated data.
(ii) Data Integration
Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. Data integration
becomes increasingly important in cases of merging systems of two companies or
consolidating applications within one company to provide a unified view of the company's
data assets.

(iii) Data transformation


`In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Data transformation can involve the following:
Smoothing, Aggregation, Generalization, Normalization, Attribute construction

(iv) Data reduction


In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or “compressed” representation of the original data. If the original data can
be reconstructed from the compressed data without any loss of information, the data
reduction is called lossless.

3.(i) Draw the data warehouse architecture and explain its components

Overall Architecture
• The data warehouse architecture is based on the data base management system server.
• The central information repository is surrounded by number of key components
• Data warehouse is an environment, not a product which is based on relational
database management system that functions as the central repository for informational
data.
• The data entered into the data warehouse transformed into an integrated structure and
format. The transformation process involves conversion, summarization, filtering and
condensation.
• The data warehouse must be capable of holding and managing large volumes of data
as well as different structure of data structures over the time.

Key components
✓ Data sourcing, cleanup, transformation, and migration tools
✓ Metadata repository
✓ Warehouse/database technology
✓ Data marts
✓ Data query, reporting, analysis, and mining tools
✓ Data warehouse administration and management
✓ Information delivery system

Data sourcing, cleanup, transformation, and migration tools:


➢ They perform conversions, summarization, key changes, structural changes
➢ The data transformation is required to use by decision support tools.
➢ The transformation produces programs, control statements.
➢ It moves the data into data warehouse from multiple operational systems.

The Functionalities of these tools are listed below:


▪ To remove unwanted data from operational db
▪ Converting to common data names and attributes
▪ Calculating summaries and derived data
▪ Establishing defaults for missing data
▪ Accommodating source data definition changes

Metadata repository:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
It is classified into two:

Technical Meta data:


It contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tasks.
It includes,
▪ Info about data stores.
▪ Transformation descriptions. That is mapping methods from operational db to
warehouse db.
▪ Warehouse Object and data structure definitions for target data
▪ The rules used to perform clean up, and data enhancement
▪ Data mapping operations
▪ Access authorization, backup history, archive history, info delivery history,
data acquisition history, data access etc.,
Business Meta data:
It contains info that gives info stored in data warehouse to users.
It includes,
▪ Subject areas, and info object type including queries, reports, images, video, audio
clips etc.
▪ Internet home pages
▪ Info related to info delivery system
▪ Data warehouse operational info such as ownerships, audit trails etc. ,

Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info

Warehouse/database technology
Data ware house database
This is the central part of the data ware housing environment. This is implemented
based on RDBMS technology.

Data marts
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement
➢ The absence of a budget for a full scale data warehouse strategy
➢ The decentralization of business needs

Data query, reporting, analysis, and mining tools


Its purpose is to provide info to business users for decision making. There are five
categories:
➢ Data query and reporting tools
➢ Application development tools
➢ Executive info system tools (EIS)
➢ OLAP tools
➢ Data mining tools

1. Query and reporting tools:


Used to generate query and report. There are two types of reporting tools.
They are:
▪ Production reporting tool used to generate regular operational reports
▪ Desktop report writer are inexpensive desktop tools designed for end users.
2. Managed Query tools:
Used to generate SQL query. It uses Met layer software in between users and
databases which offers appoint-and-click creation of SQL statement.
3. Application development tools:
This is a graphical data access environment which integrates OLAP tools with
data warehouse and can be used to access all db systems.
4. OLAP Tools:
Are used to analyze the data in multidimensional and complex views.
5. Data mining tools:
Are used to discover knowledge from the data warehouse data.

Data ware house administration and management:


The management of data warehouse includes,
➢ Security and priority management
➢ Monitoring updates from multiple sources
➢ Data quality checks
➢ Managing and updating meta data
➢ Auditing and reporting data warehouse usage and status
➢ Purging data
➢ Replicating, sub setting and distributing data
➢ Backup and recovery
➢ Data warehouse storage management which includes capacity planning, hierarchical
storage management and purging of aged data etc.,

Information delivery system:


➢ It is used to enable the process of subscribing for data warehouse info.
➢ Delivery to one or more destinations according to specified scheduling algorithm

(ii) Explain the different types of OLAP tools.


The different types of OLAP tools are:
➢ MOLAP
➢ ROLAP
➢ HOLAP

1.MOLAP:
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats. That is, data stored in array-based structures.

Advantages:
✓ Excellent performance: MOLAP cubes are built for fast data retrieval, and are
optimal for slicing and dicing operations.
✓ Can perform complex calculations: All calculations have been pre-generated when
the cube is created. Hence, complex calculations are not only doable, but they return
quickly.

Disadvantages:
✓ Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the cube
itself. This is not to say that the data in the cube cannot be derived from a large
amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in the cube itself.
✓ Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and capital resources are needed.

Examples:
Hyperion Essbase, Fusion (Information Builders)

ROLAP:
This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Data stored in relational tables

Advantages:
✓ Can handle large amounts of data: The data size limitation of ROLAP technology is
the limitation on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
✓ Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:
✓ Performance can be slow: Because each ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database, the query time can be long if the
underlying data size is large.
✓ Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements do
not fit all needs (for example, it is difficult to perform complex calculations using
SQL), ROLAP technologies are therefore traditionally limited by what SQL can do.
ROLAP vendors have mitigated this risk by building into the tool out-of the- box
complex functions as well as the ability to allow users to define their own functions.

Examples:
Micro-strategy Intelligence Server, Meta Cube (Informix/IBM)

HOLAP (MQE: Managed Query Environment)


HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP.
For summary-type information, HOLAP leverages cube technology for faster performance. It
stores only the indexes and aggregations in the multidimensional form while the rest of the
data is stored in the relational database.
Examples:
Power Play (Cognos), Brio, Microsoft Analysis Services, Oracle Advanced Analytic
Services

4. (i) Describe in detail about Mapping the Data warehouse to a multiprocessor


architecture

The functions of data warehouse are based on the relational data base technology. The
relation data base technology is implemented in parallel manner. There are two advantages of
having parallel relational data base technology for data warehouse:

Linear Speed up:


refers the ability to increase the number of processor to reduce response time
Linear Scale up:
refers the ability to provide same performance on the same requests as the database
size increases

Types of parallelism
There are two types of parallelism:
➢ Inter query Parallelism:
In which different server threads or processes handle multiple requests
at the same time.
➢ Intra query Parallelism:
This form of parallelism decomposes the serial SQL query into lower-level
operations such as scan, join, sort etc. Then these lower-level operations are executed
concurrently in parallel.
Intra query parallelism can be done in either of two ways:
• Horizontal parallelism:
which means that the data base is partitioned across multiple disks and
parallel processing occurs within a specific task that is performed concurrently
on different processors against different set of data.
• Vertical parallelism:
This occurs among different tasks. All query components such as scan,
join, sort etc are executed in parallel in a pipelined fashion. In other words, an
output from one task becomes an input into another task.

Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.

➢ Random portioning
It includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which
each record is placed on the next disk assigned to the data base.

➢ Intelligent partitioning
It assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks. The various intelligent partitioning
include:
• Hash partitioning:
A hash algorithm is used to calculate the partition number based on
the value of the partitioning key for each row
• Key range partitioning:
Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K
are in partition 1, L to T are in partition 2 and so on.
• Schema portioning:
an entire table is placed on one disk; another table is placed on
different disk etc. This is useful for small reference tables.
• User defined portioning:
It allows a table to be partitioned on the basis of a user defined
expression.
(ii) Describe in detail on data warehouse Metadata
METADATA:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
➢ It is classified into two:
✓ Technical Meta data
✓ Business Meta data

Technical Meta data:


It contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tasks.
It includes,
▪ Info about data stores.
▪ Transformation descriptions. That is mapping methods from operational db to
warehouse db.
▪ Warehouse Object and data structure definitions for target data
▪ The rules used to perform clean up, and data enhancement
▪ Data mapping operations
▪ Access authorization, backup history, archive history, info delivery history,
data acquisition history, data access etc.,
Business Meta data:
It contains info that gives info stored in data warehouse to users.
It includes,
▪ Subject areas, and info object type including queries, reports, images, video, audio
clips etc.
▪ Internet home pages
▪ Info related to info delivery system
▪ Data warehouse operational info such as ownerships, audit trails etc. ,

Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info

5.(i) Explain the steps in building a data warehouse.


There are two reasons why organizations consider data warehousing a critical need. In
other words, there are two factors that drive you to build and use data warehouse. They are:
➢ Business factors:
• Business users want to make decision quickly and correctly using all
available data.
➢ Technological factors:
• To address the incompatibility of operational data stores
• IT infrastructure is changing rapidly. Its capacity is increasing and cost
is decreasing so that building a data warehouse is easy.

There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the
following two approaches:
➢ Top - Down Approach (Suggested by Bill Inmon)
➢ Bottom - Up Approach (Suggested by Ralph Kimball)

Top - Down Approach


In the top down approach suggested by Bill Inmon, we build a centralized repository
to house corporate wide business data. This repository is called Enterprise Data Warehouse
(EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data. The data in the EDW is stored at the most detail level.

The reason to build the EDW on the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements.

The disadvantages of storing data at the detail level are


1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.
Once the EDW is implemented we start building subject area specific data marts
which contain data in a de normalized form also called star schema. The data in the marts are
usually summarized based on the end users analytical requirements. The reason to de
normalize the data in the mart is to provide faster access to the data for the end users
analytics. If we were to have queried a normalized schema for the same analytics, we would
end up in a complex multiple level joins that would be much slower as compared to the one
on the de normalized schema.

We should implement the top-down approach when


1. The business has complete clarity on all or multiple subject areas data warehouse
requirements.
2. The business is ready to invest considerable time and money.

The advantage of using the Top Down approach is that we build a centralized
repository to cater for one version of truth for business data. This is very important for the
data to be reliable, consistent across subject areas and for reconciliation in case of data related
contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and
initial investment. The business has to wait for the EDW to be implemented followed by
building the data marts before which they can access their reports.

Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to
build a data warehouse. Here we build the data marts separately at different points of time as
and when the specific subject area requirements are clear. The data marts are integrated or
combined together to form a data warehouse. Separate data marts are combined through the
use of conformed dimensions and conformed facts. A conformed dimension and a conformed
fact is one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names
and consistent values across separate data marts. The conformed dimension means exact
same thing with every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it
and at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing
and integrating data marts as and when the requirements are clear. We don’t have to wait for
knowing the overall requirements of the warehouse.

We should implement the bottom up approach when


1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one
data mart.

The advantage of using the Bottom Up approach is that they do not require high initial
costs and have a faster implementation time; hence the business can start using the marts
much earlier as compared to the top-down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de
normalized format; hence there would be high space usage for detailed data. We have a
tendency of not keeping detailed data in this approach hence losing out on advantage of
having detail data i.e. flexibility to easily cater to future requirements. Bottom up approach is
more realistic but the complexity of the integration may become a serious obstacle.

(ii) Analyze the information needed to support DBMS schemas for Decision support.

The basic concepts of dimensional modelling are:


➢ Facts
➢ dimensions and
➢ measures.
A fact is a collection of related data items, consisting of measures and context data. It
typically represents business items or business transactions. A dimension is a collection of
data that describe one business dimension. Dimensions determine the contextual background
for the facts; they are the parameters over which we want to perform OLAP. A measure is a
numeric attribute of a fact, representing the performance or behavior of the business relative
to the dimensions. Considering Relational context, there are three basic schemas that are used
in dimensional modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star schema:
The multidimensional view of data that is expressed using relational data base
semantics is provided by the data base schema design called star schema. The basic of stat
schema is that information can be classified into two groups:
➢ Facts
➢ Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.
1. Fact Tables:
A fact table is a table that contains summarized numerical and historical data (facts)
and a multipart index composed of foreign keys from the primary keys of related dimension
tables. A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
2. Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter,
year), Region dimension (profit by country, state, city), Product dimension (profit for
product1, product2). A dimension is a structure usually composed of one or more
hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is
called flat dimension or list. The primary keys of each of the dimension tables are part of
the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Dimension tables are generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic
region (markets, cities), clients, products, times, channels.
3. Measures:
Measures are numeric data based on columns in a fact table. They are the primary
data which end users are interested in. E.g. a sales fact table may contain a profit measure
which represents profit on each sale.

The main characteristics of star schema:


➢ Simple structure -> easy to understand schema
➢ Great query effectives -> small number of tables to join
➢ Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
➢ The most commonly used in the data warehouse implementations -> widely supported
by a large number of business intelligence tools

Snowflake schema:
It is the result of decomposing one or more of the dimensions. The many-to one
relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical
structure of dimensions very well.

Fact constellation schema:


For each star schema it is possible to construct fact constellation schema (for example
by splitting the original star schema into more star schemes each of them describes facts on
another level of dimension hierarchies). The fact constellation architecture contains multiple
fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design because
many variants for particular kinds of aggregation must be considered and selected. Moreover,
dimension tables are still large.

6.(i) Discuss in detail about access tools types?


ACCESS TOOLS:
Data warehouse implementation relies on selecting suitable data access tools. The best
way to choose this is based on the type of data can be selected using this tool and the kind of
access it permits for a particular user.
The following lists the various types of data that can be accessed:
➢ Simple tabular form data
➢ Ranking data
➢ Multivariable data
➢ Time series data
➢ Graphing, charting and pivoting data
➢ Complex textual search data
➢ Statistical analysis data
➢ Data for testing of hypothesis, trends and patterns
➢ Predefined repeatable queries
➢ Ad hoc user specified queries
➢ Reporting and analysis data
➢ Complex queries with multiple joins, multi-level sub queries and sophisticated search
criteria

There are five categories:


➢ Data query and reporting tools
➢ Application development tools
➢ Executive info system tools (EIS)
➢ OLAP tools
➢ Data mining tools

Data query and reporting tools:


Query and reporting tools are used to generate query and report. There are two types
of reporting tools. They are:
• Production reporting tool used to generate regular operational reports
• Desktop report writer are inexpensive desktop tools designed for end users.
Managed Query tools:
used to generate SQL query. It uses Meta layer software in between users
and databases which offers a point-and-click creation of SQL statement. This tool is a
preferred choice of users to perform segment identification, demographic analysis,
territory management and preparation of customer mailing lists etc.
Application development tools:
This is a graphical data access environment which integrates OLAP tools with data
warehouse and can be used to access all db systems

OLAP Tools:
are used to analyze the data in multi-dimensional and complex views. To enable
multidimensional properties and it uses MDDB and MRDB where MDDB refers multi-
dimensional data base and MRDB refers multi relational data bases.

Data mining tools:


are used to discover knowledge from the data warehouse data also can be used for
data visualization and data correction purposes.

(ii) Describe the overall architecture of data warehouse?


Overall Architecture
• The data warehouse architecture is based on the data base management system server.
• The central information repository is surrounded by number of key components
• Data warehouse is an environment, not a product which is based on relational
database management system that functions as the central repository for informational
data.
• The data entered into the data warehouse transformed into an integrated structure and
format. The transformation process involves conversion, summarization, filtering and
condensation.
• The data warehouse must be capable of holding and managing large volumes of data
as well as different structure of data structures over the time.

Key components
✓ Data sourcing, cleanup, transformation, and migration tools
✓ Metadata repository
✓ Warehouse/database technology
✓ Data marts
✓ Data query, reporting, analysis, and mining tools
✓ Data warehouse administration and management
✓ Information delivery system

Data sourcing, cleanup, transformation, and migration tools:


➢ They perform conversions, summarization, key changes, structural changes
➢ The data transformation is required to use by decision support tools.
➢ The transformation produces programs, control statements.
➢ It moves the data into data warehouse from multiple operational systems.

The Functionalities of these tools are listed below:


▪ To remove unwanted data from operational db
▪ Converting to common data names and attributes
▪ Calculating summaries and derived data
▪ Establishing defaults for missing data
▪ Accommodating source data definition changes

Metadata repository:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
It is classified into two:

Technical Meta data:


It contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tasks.
It includes,
▪ Info about data stores.
▪ Transformation descriptions. That is mapping methods from operational db to
warehouse db.
▪ Warehouse Object and data structure definitions for target data
▪ The rules used to perform clean up, and data enhancement
▪ Data mapping operations
▪ Access authorization, backup history, archive history, info delivery history,
data acquisition history, data access etc.,
Business Meta data:
It contains info that gives info stored in data warehouse to users.
It includes,
▪ Subject areas, and info object type including queries, reports, images, video, audio
clips etc.
▪ Internet home pages
▪ Info related to info delivery system
▪ Data warehouse operational info such as ownerships, audit trails etc. ,

Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info

Warehouse/database technology
Data ware house database
This is the central part of the data ware housing environment. This is implemented
based on RDBMS technology.

Data marts
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement
➢ The absence of a budget for a full scale data warehouse strategy
➢ The decentralization of business needs

Data query, reporting, analysis, and mining tools


Its purpose is to provide info to business users for decision making. There are five
categories:
➢ Data query and reporting tools
➢ Application development tools
➢ Executive info system tools (EIS)
➢ OLAP tools
➢ Data mining tools

1. Query and reporting tools:


Used to generate query and report. There are two types of reporting tools.
They are:
▪ Production reporting tool used to generate regular operational reports
▪ Desktop report writer are inexpensive desktop tools designed for end users.
2. Managed Query tools:
Used to generate SQL query. It uses Met layer software in between users and
databases which offers appoint-and-click creation of SQL statement.
3. Application development tools:
This is a graphical data access environment which integrates OLAP tools with
data warehouse and can be used to access all db systems.
4. OLAP Tools:
Are used to analyze the data in multidimensional and complex views.
5. Data mining tools:
Are used to discover knowledge from the data warehouse data.

Data ware house administration and management:


The management of data warehouse includes,
➢ Security and priority management
➢ Monitoring updates from multiple sources
➢ Data quality checks
➢ Managing and updating meta data
➢ Auditing and reporting data warehouse usage and status
➢ Purging data
➢ Replicating, sub setting and distributing data
➢ Backup and recovery
➢ Data warehouse storage management which includes capacity planning, hierarchical
storage management and purging of aged data etc.,

Information delivery system:


➢ It is used to enable the process of subscribing for data warehouse info.
➢ Delivery to one or more destinations according to specified scheduling algorithm

7.(i) Discuss the different types of data repositories on which mining can be performed?
A data repository, often called a data archive or library, is a generic terminology that
refers to a segmented data set used for reporting or analysis. It’s a huge database
infrastructure that gathers, manages, and stores varying data sets for analysis, distribution,
and reporting.
Some common types of data repositories include:
➢ Data Warehouse
➢ Data Lake
➢ Data Mart
➢ Metadata Repository
➢ Data Cube

Data Warehouse

A data warehouse is a large data repository that brings together data from several
sources or business segments. The stored data is generally used for reporting and analysis to
help users make critical business decisions. In a broader perspective, a data warehouse offers
a consolidated view of either a physical or logical data repository gathered from numerous
systems. The main objective of a data warehouse is to establish a connection between data
from current systems. For example, product catalogue data stored in one system and
procurement orders for a client stored in another one.

Data Lake

A data lake is a unified data repository that allows you to store structured, semi-
structured, and unstructured enterprise data at any scale. Data can be in raw form and used for
different tasks like reporting, visualizations, advanced analytics, and machine learning.

Data Mart:

A data mart is a subject-oriented data repository that’s often a segregated section of a


data warehouse. It holds a subset of data usually aligned with a specific business department,
such as marketing, finance, or support. Due to its smaller size, a data mart can fast-track
business procedures as one can easily access relevant data within days instead of months. As
it only includes the data relevant to a specific area, a data mart is an economical way to
acquire actionable insights swiftly.

Metadata Repositories:

Metadata incorporates information about the structures that include the actual data.
Metadata repositories contain information about the data model that store and share this data.
They describe where the source of data is, how it was collected, and what it signifies. It may
define the arrangement of any data or subject deposited in any format. For businesses,
metadata repositories are essential in helping people understand administrative changes, as
they contain detailed information about the data.

Data Cubes:

Data cubes are lists of data with multidimensions (usually 3 or more dimensions)
stored as a table. They are used to describe the time sequence of an image’s data and help
assess gathered data from a range of standpoints. Each dimension of a data cube signifies
specific characteristics of the database such as day-to-day, monthly or annual sales. The data
contained within a data cube allows you to analyze all the information for almost any or all
clients, sales representatives, products, and more. Consequently, a data cube can help you
identify trends and scrutinize business performance.

(ii) Differentiate tangible and intangible benefits of data warehouse

TANGIBLE BENEFITS INTANGIBLE BENEFITS


Improvement in product inventory Improvement in productivity by keeping all
data in single location and eliminating
rekeying of

data.
Decrement in production cost Reduced redundant processing Enhanced
customer relation.
Improvement in selection of target markets Increased customer satisfaction

Enhancement in asset and liability Greater compliance


management

can be measured in financial terms cannot be quantified directly in economic


terms, but still have a very significant
business impact.
the tangible benefits of a process are can increase or decrease over time
unlikely to fluctuate.
tangible benefits can often be estimated intangible benefits are virtually impossible
before certain actions are taken to estimate beforehand.

8.(i) Describe in detail about data extraction

Data extraction is the process of collecting or retrieving disparate types of data from a
variety of sources, many of which may be poorly organized or completely unstructured. Data
extraction makes it possible to consolidate, process, and refine data so that it can be stored in
a centralized location in order to be transformed. These locations may be on-site, cloud-
based, or a hybrid of the two. Data extraction is the first step in both ETL (extract, transform,
load) and ELT (extract, load, transform) and processes. ETL/ELT are themselves part of a
complete data integration strategy. A proper attention must be paid to data extraction which
represents a success factor for a data warehouse architecture. When implementing data
warehouse several the following selection criteria that affect the ability to transform,
consolidate, integrate and repair the data should be considered:
➢ Timeliness of data delivery to the warehouse
➢ The tool must have the ability to identify the particular data and that can be
read by conversion tool
➢ The tool must support flat files, indexed files since corporate data is still in
this type
➢ The tool must have the capability to merge data from multiple data stores
➢ The tool should have specification interface to indicate the data to be extracted
➢ The tool should have the ability to read data from data dictionary
➢ The code generated by the tool should be completely maintainable
➢ The tool should permit the user to extract the required data
➢ The tool must have the facility to perform data type and character set
translation
➢ The tool must have the capability to create summarization, aggregation and
derivation of records
➢ The data warehouse database system must be able to perform loading data
directly from these tools

(ii) Describe in detail about transformation tools


They perform conversions, summarization, key changes, structural changes and
condensation. The data transformation is required so that the information can by used by
decision support tools. The transformation produces programs, control statements, JCL code,
COBOL code, UNIX scripts, and SQL DDL code etc., to move the data into data warehouse
from multiple operational systems.
The functionalities of these tools are listed below:
➢ To remove unwanted data from operational db
➢ Converting to common data names and attributes
➢ Calculating summaries and derived data
➢ Establishing defaults for missing data
➢ Accommodating source data definition changes
Issues to be considered while data sourcing, clean up, extract and transformation:
✓ Data heterogeneity:
It refers to DBMS different nature such as it may be in different data modules,
it may have different access languages, it may have data navigation methods,
operations, concurrency, integrity and recovery processes etc.,
✓ Data heterogeneity:
It refers to the different way the data is defined and used in different modules.
Some experts involved in the development of such tools:
Prism Solutions, Evolutionary Technology Inc., Vality, Praxis and Carleton
9.(i) Suppose that a data warehouse consists of four dimensions customer,
product, salesperson and sales time, and the three measure sales Amt (in
rupees), VAT (in rupees) and payment type (in rupees). Draw the different
classes of schemas that are popularly used for modelling data warehouses
and explain it
(ii) How would you explain Metadata implementation with examples?
10.(i) Describe in detail about
(i)Bitmapped indexing
The new approach to increasing performance of a relational DBMS is to use
innovative indexing techniques to provide direct access to data. SYBASE IQ uses a bit
mapped index structure.
The data stored in the SYBASE DBMS.
SYBASE IQ- it is based on indexing technology; it is a stand alone database
Overview: It is a separate SQL database. Data is loaded into a SYBASE IQ very much as
into any relational DBMS once loaded SYBASE IQ converts all data into a series of bitmaps
which are than highly compressed to store on disk.
Data cardinality:
➢ Data cardinality bitmap indexes are used to optimize queries against a low cardinality
data.
➢ That is in which-The total no. of potential values in relatively low. Example: state
code data cardinality is 50 potential values and general cardinality is only 2 ( male to
female) for low cardinality data.
➢ Each distinct value has its own bitmap index consisting of a bit for every row in a
table, if the bit for a given index is ―on‖ the value exists in the record bitmap index
representation is 10000 bit long vector which has its bits turned on (value of 1) for
every record that satisfies ―gender‖=‖M‖ condition‖
➢ Bit map indexes unsuitable for high cardinality data
➢ Another solution is to use traditional B_tree index structure. B_tree indexes can often
grow to large sizes because as the data volumes & the number of indexes grow.
➢ B_tree indexes can significantly improve the performance,
➢ SYBASE IQ was a technique is called bitwise (Sybase trademark) technology to build
bit map index for high cardinality data, which are limited to about 250 distinct values
for high cardinality data.

Index types:
The first of SYBASE IQ provide five index techniques, Most users apply two indexes
to every column. the default index called projection index and other is either a low or high –
cardinality index. For low cardinality data SYBASE IQ provides.
➢ Low fast index: it is optimized for queries involving scalar functions like SUM,
AVERAGE, and COUNTS.
➢ Low disk index which is optimized for disk space utilization at the cost of being more
CPU intensive.
Performance.
SYBAEE IQ technology achieves the very good performance on adhoc quires for
several reasons
➢ Bitwise technology: this allows various types of data type in query. And support fast
data aggregation and grouping.
➢ Compression: SYBAEE IQ uses sophisticated algorithms to compress data in to bit
maps.
➢ Optimized m/y based programming: SYBASE IQ caches data columns in m/y
according to the nature of user’s queries, it speed up the processor.
➢ Column wise processing: SYBASE IQ scans columns not rows, it reduce the amount
of data the engine has to search.
➢ Low overhead: An engine optimized for decision support SYBASE IQ does not carry
on overhead associated with that. Finally OLTP designed RDBMS performance.
➢ Large block I/P: Block size of SYBASE IQ can turned from 512 bytes to 64 Kbytes
so system can read much more information as necessary in single I/O.
➢ Operating system-level parallelism: SYBASE IQ breaks low level like the sorts,
bitmap manipulation, load, and I/O, into non blocking operations.
➢ Projection and ad hoc join capabilities: SYBASE IQ allows users to take advantage of
known join relation relationships between tables by defining them in advance and
building indexes between tables.
Shortcomings of indexing:-
The user should be aware of when choosing to use SYBASE IQ include
➢ No updates-SYNBASE IQ does not support updates the users would have to update
the source database and then load the update data in SYNBASE IQ on a periodic
basis.
➢ Lack of core RDBMS feature:-
Not support all the robust features of SYBASE SQL server, such as backup and
recovery
➢ Less advantage for planned queries:
SYBASE IQ, run on preplanned queries.
➢ High memory usage: Memory access for the expensive i\o operation
Column local storage:
➢ It is an another approach to improve query performance in the data warehousing
environment
➢ For example, thinking machine operation has developed an innovative data layout
solution that improves RDBMS query performance many times. Implemented in its
CM_SQL RDBMS product, this approach is based on storing data column wise as
opposed to traditional row wise approach.
➢ In figure-. (Row wise approach) This approach works well for OLTP environment in
which a typical transaction accesses a record at a time. However, in data warehousing
the goal is to retrieve multiple values of several columns.
➢ For example, if a problem is to calculate average minimum, maximum salary the
column wise storage of the salary field requires a DBMS to read only one record.(use
Column –wise approach)
Complex data types:-
➢ The best DBMS architecture for data warehousing has been limited to traditional
alphanumeric data types. But data management is the need to support complex data
types Include text, image, full-motion video &sound.
➢ Large data objects called binary large objects what’s required by business is much
more than just storage:
➢ The ability to retrieve the complex data type like an image by its content. The ability
to compare the content of one image to another in order to make rapid business
decision and ability to express all of this in a single SQL statement.
➢ The modern data warehouse DBMS has to be able to efficiently store, access and
manipulate complex data. The DBMS has to be able to define not only new data
structure but also new function that manipulates them and often new access methods,
to provide fast and often new access to the data.
➢ An example of advantage of handling complex data types is a insurance company that
wants to predict its financial exposure during a catastrophe such as flood that wants to
support complex data.

(ii) STARjoin and index.


A STAR join is a high-speed, single-pass, parallelizable multi table joins, and Brick’s
RDBMS can join more than two tables in a single operation. A star schema has one
“central” table whose primary key is compound, i.e., consisting of multiple attributes. Each
one of these attributes is a foreign key to one of the remaining tables. Such a foreign key
dependency exists for each one of these tables, while there are no other foreign keys
anywhere in the schema. Most data warehouses that represent the multidimensional
conceptual data model in a relational fashion [1,2] store their primary data as well as the
data cubes derived from it in star schemas. The “central” table and the remaining tables of
the definition above correspond, respectively, to the fact table and the dimension tables that
are typically found in data warehouses.
Red Brick’s RDBMS supports the creation of specialized indexes called STAR
indexes. It created on one or more foreign key columns of a fact table. A star index is a
collection of join indices, one for every foreign key join in a star or snowflake schema. A
common structure for a data warehouse is a fact table consisting of several dimension fields
and several measure fields. To reduce storage costs, the fact table is often normalized into
a star or a snowflake schema. Since most queries reference both the (normalized) fact tables
and the dimension tables, creating a star index can be an effective way to accelerate data
warehouse queries.

11.(i) What is data Pre-processing? Explain the various data pre-processing techniques.

Data pre-processing describes any type of processing performed on raw data to


prepare it for another processing procedure. Commonly used as a preliminary data mining
practice, data pre-processing transforms the data into a format that will be more easily and
effectively processed for the purpose of the user. Data pre-processing describes any type of
processing performed on raw data to prepare it for another processing procedure. Commonly
used as a preliminary data mining practice, data pre-processing transforms the data into a
format that will be more easily and effectively processed for the purpose of the user. Data in
the real world is dirty. It can be in incomplete, noisy and inconsistent from. These data need
to be pre-processed in order to help improve the quality of the data, and quality of the mining
results.
➢ If no quality data, then no quality mining results. The quality decision is always based
on the quality data.
➢ If there is much irrelevant and redundant information present or noisy and unreliable
data, then knowledge discovery during the training phase is more difficult.
Techniques:
1. Data Cleaning
2. Data Integration
3. Data Transformation
4. Data Reduction

1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can
be generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.

2.Data Integration:
Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. Data integration
becomes increasingly important in cases of merging systems of two companies or
consolidating applications within one company to provide a unified view of the company's
data assets.

3.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. the attribute having p-value greater than significance level can be
discarded.

3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.

4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).

(ii) Explain the basic methods for data cleaning.


Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
Various methods for handling this problem:

The various methods for handling the problem of missing values in data tuples include:
(a) Ignoring the tuple:
This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective
unless the tuple contains several attributes with missing values. It is especially
poor when the percentage of missing values per attribute varies considerably.

(b) Manually filling in the missing value:


In general, this approach is time-consuming and may not be a reasonable task
for large data sets with many missing values, especially when the value to be filled in
is not easily determined.

(c) Using a global constant to fill in the missing value:


Replace all missing attribute values by the same constant, such as a label like
“Unknown,” or −∞. If missing values are replaced by, say, “Unknown,” then the
mining program may mistakenly think that they form an interesting concept, since
they all have a value in common — that of “Unknown.” Hence, although this
method is simple, it is not recommended.
(d) Using the attribute mean for quantitative (numeric) values or attribute mode
for categorical (nominal) values, for all samples belonging to the same class
as the given tuple:
For example, if classifying customers according to credit risk, replace the
missing value with the average income value for customers in the same credit risk
category as that of the given tuple.
(e) Using the most probable value to fill in the missing value:
This may be determined with regression, inference-based tools using Bayesian
formalism, or decision tree induction. For example, using the other customer
attributes in your data set, you may construct a decision tree to predict the missing
values for income.

12. Describe with diagrammatic representation the relationship between operational


data, a data warehouse and data marts.

13. (i) Demonstrate in detail about Data marts


Departmental subsets that focus on selected subjects. They are independent used by
dedicated user group. They are used for rapid delivery of enhanced decision support
functionality to end users.

Data mart is used in the following situation:


➢ Extremely urgent user requirement
➢ The absence of a budget for a full scale data warehouse strategy
➢ The decentralization of business needs
➢ The attraction of easy to use tools and mind sized project

Data mart presents two problems:


1. Scalability: A small data mart can grow quickly in multi dimensions. So that while
designing it, the organization has to pay more attention on system scalability,
consistency and manageability issues
2. Data integration

(ii) Demonstrate data warehouse administration and management

Data warehouse admin and management


The management of data warehouse includes,
➢ Security and priority management
➢ Monitoring updates from multiple sources
➢ Data quality checks
➢ Managing and updating meta data
➢ Auditing and reporting data warehouse usage and status
➢ Purging data
➢ Replicating, sub setting and distributing data
➢ Backup and recovery
➢ Data warehouse storage management which includes capacity planning, hierarchical
storage management and purging of aged data etc.,

14. (i) Generalize the potential performance problems with star schema.
Potential performance problem with star schemas

1.Indexing
➢ It improve the performance in the star schema design
➢ The table in star schema design contain the entire hierarchy of attributes(PERIOD
dimension this hierarchy could be day->week->month->quarter->year),one approach
is to create multi part key of day, week, month ,quarter ,year .it presents some
problems in the star schema model because it should be in normalized
Problems:
1. It require multiple metadata definitions
2. Since the fact table must carry all key components as part of its primary key,
addition or deletion of levels in the physical modification of the affected table.
3. Carrying all the segments of the compound dimensional key in the fact table
increases the size of the index, thus impacting both performance and scalability.
Solutions:
1.One alternative to the compound key is to concatenate the key into a single key for
the attributes (day, week, month, quarter, year) this is used to solve the first above two
problems.
2.The index is remains problem the best approach is to drop the use of meaningful
keys in favour of using an artificial, generated key which is the smallest possible key
that will ensure the uniqueness of each record

2.Level indicator
Problems
1.Another potential problem with the star schema design is that in order to navigate
the dimensions successfully.
2.The dimensional table design includes a level of hierarchy indicator for every
record.
3.Every query that is retrieving detail records from a table that stores details &
aggregates must use this indicator as an additional constraint to obtain a correct
result.
Solutions:
1.The best alternative to using the level indicator is the snowflake schema
2.The snowflake schema contains separate fact tables for each level of aggregation.
So it is Impossible to make a mistake of selecting product detail. The snowflake
schema is even more complicated than a star schema.

(ii) Design and discuss about the star and snowflake schema models of a Data
warehouse.
STAR SCHEMA:

The multidimensional view of data that is expressed using relational data base
semantics is provided by the data base schema design called star schema. The basic of stat
schema is that information can be classified into two groups:
➢ Facts
➢ Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.
1. Fact Tables:
A fact table is a table that contains summarized numerical and historical data (facts)
and a multipart index composed of foreign keys from the primary keys of related dimension
tables. A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
2. Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter,
year), Region dimension (profit by country, state, city), Product dimension (profit for
product1, product2). A dimension is a structure usually composed of one or more
hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is
called flat dimension or list. The primary keys of each of the dimension tables are part of
the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Dimension tables are generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic
region (markets, cities), clients, products, times, channels.
3. Measures:
Measures are numeric data based on columns in a fact table. They are the primary
data which end users are interested in. E.g. a sales fact table may contain a profit measure
which represents profit on each sale.

The main characteristics of star schema:


➢ Simple structure -> easy to understand schema
➢ Great query effectives -> small number of tables to join
➢ Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
➢ The most commonly used in the data warehouse implementations -> widely supported
by a large number of business intelligence tools

Snowflake schema:
It is the result of decomposing one or more of the dimensions. The many-to one
relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical
structure of dimensions very well.
PART – C

1. Explain mapping data warehouse with multiprocessor architecture with the concept of
parallelism and data partitioning

The functions of data warehouse are based on the relational data base technology. The
relation data base technology is implemented in parallel manner. There are two advantages of
having parallel relational data base technology for data warehouse:

Linear Speed up:


refers the ability to increase the number of processor to reduce response time
Linear Scale up:
refers the ability to provide same performance on the same requests as the database
size increases

MAPPING DATA WAREHOUSE WITH MULTIPROCESSOR ARCHITECTURE WITH


THE CONCEPT OF PARALLELISM:
Types of parallelism
There are two types of parallelism:
➢ Inter query Parallelism:
In which different server threads or processes handle multiple requests
at the same time.
➢ Intra query Parallelism:
This form of parallelism decomposes the serial SQL query into lower-level
operations such as scan, join, sort etc. Then these lower-level operations are executed
concurrently in parallel.
Intra query parallelism can be done in either of two ways:
• Horizontal parallelism:
which means that the data base is partitioned across multiple disks and
parallel processing occurs within a specific task that is performed concurrently
on different processors against different set of data.
• Vertical parallelism:
This occurs among different tasks. All query components such as scan,
join, sort etc are executed in parallel in a pipelined fashion. In other words, an
output from one task becomes an input into another task.

MAPPING DATA WAREHOUSE WITH MULTIPROCESSOR ARCHITECTURE WITH


THE CONCEPT OF DATA PARTITIONING:

Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.

➢ Random portioning
It includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which
each record is placed on the next disk assigned to the data base.

➢ Intelligent partitioning
It assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks. The various intelligent partitioning
include:
• Hash partitioning:
A hash algorithm is used to calculate the partition number based on
the value of the partitioning key for each row
• Key range partitioning:
Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K
are in partition 1, L to T are in partition 2 and so on.
• Schema portioning:
an entire table is placed on one disk; another table is placed on
different disk etc. This is useful for small reference tables.
• User defined portioning:
It allows a table to be partitioned on the basis of a user defined
expression.
2.Design a star-schema , snow-flake schema and fact- constellation schema for the data
warehouse that consists of the following four dimensions( Time , Item, Branch And
Location ). Include the appropriate measures required for the schema.

STAR SCHEMA:

• In a Star schema, there is only one fact table and multiple dimension tables.
• In a Star schema, each dimension is represented by one-dimension table.
• Dimension tables are not normalized in a Star schema.
• Each Dimension table.is joined to a key in a fact table.

There is a fact table at the center. It contains the keys to each of four dimensions. The
fact table also contains the attributes, namely dollars sold and units sold.

SNOWFLAKE SCHEMA:
Some dimension tables in the Snowflake schema are normalized. The normalization
splits up the data into additional tables. Unlike in the Star schema, the dimension’s table in a
snowflake schema are normalized. Due to the normalization in the Snowflake schema, the
redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.

FACT CONSTELLATION SCHEMA:

A fact constellation has multiple fact tables. It is also known as a Galaxy Schema. The
sales fact table is the same as that in the Star Schema. The shipping fact table has five
dimensions, namely item_key, time_key, shipper_key, from_location, to_location. The
shipping fact table also contains two measures, namely dollars sold and units sold. It is also
possible to share dimension tables between fact tables.

3.(i) Generalize why we need data pre-processing step in data warehousing


Data pre-processing refers to the set of techniques implemented on the
databases to remove noisy, missing, and inconsistent data. Different Data pre-processing
techniques involved in data mining are data cleaning, data integration, data reduction,
and data transformation.

The need for data pre-processing arises from the fact that the real-time data and many
times the data of the database is often incomplete and inconsistent which may result in
improper and inaccurate data mining results. Thus, to improve the quality of data on
which the observation and analysis are to be done, it is treated with these four steps of
data pre-processing. More the improved data, More, will be the accurate observation
and prediction.

Techniques:
5. Data Cleaning
6. Data Integration
7. Data Transformation
8. Data Reduction

1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
3. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.

4. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can
be generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.

2.Data Integration:
Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. Data integration
becomes increasingly important in cases of merging systems of two companies or
consolidating applications within one company to provide a unified view of the company's
data assets.

3.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
5. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)

6. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.

7. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
8. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. the attribute having p-value greater than significance level can be
discarded.

3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.

4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).

(ii) Explain the various methods of data cleaning and data reduction technique

Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can
be generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.

Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. the attribute having p-value greater than significance level can be
discarded.

3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.

4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).

4.(i) Compare the similarities and differences between the database and data warehouse

DIFFERENCE:

PARAMETER DATABASE DATA WAREHOUSE


Purpose Is designed to record Is designed to analyze
Processing Method The database uses the Online Data warehouse uses Online
Transactional Processing (OLTP) Analytical Processing (OLAP).
Storage limit Generally limited to a single Stores data from any number of
application applications
Availability Data is available real-time Data is refreshed from source
systems as and when needed
Data Type Data stored in the Database is up Current and Historical Data is
to date. stored in Data Warehouse. May
not be up to date.
Focus The focus of database is mainly It has ability of data analysis
on transactions with the help of which is collected from different
queries. sources and generate reports.
Data Duplication In an OLTP database, the data is In OLAP database, the data is
normalized and there is no organized in such a way to
duplication of data in order to facilitate the analysis and
increase the optimized processing reporting. Usually the data is
and better efficiency. denormalized and stored in fewer
tables in simple structure.
Optimization It is optimized for read-write The data-warehouses are
operations through single-point- optimized for retrieval of large
transaction. Mostly the OLTP data-sets to aggregate the data as
database queries respond in less it is designed for handling broad
than a second. analytical queries.
Query Type Simple transaction queries are Complex queries are used for
used. analysis purpose
Data Summary Detailed Data is stored in a It stores highly summarized data.
database

SIMILARITY:

➢ Both the database and data warehouse is used for storing data. These are data
storage systems.
➢ Generally, the data warehouse bottom tier is a relational database system.
Databases are also relational database system. Relational DB systems consist
of rows and columns and a large amount of data.
➢ The DW and databases support multi-user access. A single instance of
database and data warehouse can be accessed by many users at a time.
➢ Both DW and database require queries for accessing the data. The Data
warehouse can be accessed using complex queries while OLTP database can
be accessed by simpler queries.
➢ The database and data warehouse servers can be present on the company
premise or on the cloud.
➢ A data warehouse is also a database.

(ii) Explain what is data visualization. How it helps in data warehousing


✓ Data visualization is the graphical representation of information and data.
✓ Data visualization convert large and small data sets into visuals, which is easy to
understand and process for humans
✓ By using visual elements like charts, graph, and maps, data visualization tools provide
an accessible way to see and understand trends, outliers, and patterns in data.
✓ In the world of Big Data, data visualization tools and technologies are essential to
analyse massive amounts of information and make data-driven decisions.
✓ Data Visualization is used to communicate information clearly and efficiently to users
by the usage of information graphics such as tables and charts.
✓ It helps users in analysing a large amount of data in a simpler way.
✓ It makes complex data more accessible, understandable, and usable.
✓ Tables are used where users need to see the pattern of a specific parameter, while charts
are used to show patterns or relationships in the data for one or more parameters.
✓ The combination of multiple visualizations and bits of information are still referred to
as Infographics.
✓ Data visualizations are used to discover unknown facts and trends
✓ A pie chart is a great way to show parts-of-a-whole. And maps are the best way to share
geographical data visually.
✓ Effective data visualization are created by communication, data science, and design
collide. Data visualizations did right key insights into complicated data sets into
meaningful and natural.
✓ American statistician and Yale professor Edward Tufte believe useful data
visualizations consist of? complex ideas communicated with clarity, precision, and
efficiency.
✓ Data visualization is important because of the processing of information in human
brains. Using graphs and charts to visualize a large amount of the complex data sets is
more comfortable in comparison to studying the spreadsheet and reports.
✓ Data visualization is an easy and quick way to convey concepts universally. You can
experiment with a different outline by making a slight adjustment.

USE OF DATA VISUALIZATION IN DATA WAREHOUSING:


✓ Data visualization helps communicate information more rapidly due to the fact that the
human brain can process and understand a picture faster than it can process and
understand a series of numbers that have to be compared and contrasted.

✓ To make easier in understand and remember.


✓ To discover unknown facts, outliers, and trends.
✓ To visualize relationships and patterns quickly.
✓ To ask a better question and make better decisions.
✓ To competitive analyze.
✓ To improve insights
DATA WAREHOUSE AND DATA MINING
UNIT -1
PART-A
1.How is data ware house different from a database? Identify the similarity.
Data warehouse is a repository of multiple heterogenous data sources, organized
under a unified schema at a single site in order to facilitate management decision-making. A
relational database’s is a collection of tables, each of which is assigned a unique name. Each
table consists of a set of attributes (columns or fields) and usually stores a large set of tuples
(records or rows). Each tuple in a relational table represents an object identified by a unique
key and described by a set of attribute values. Both are used to store and manipulate the data.
2. Differentiate metadata and data mart.

META DATA DATA MART

Data about data. Departmental subsets that focus on selected


subjects.

Containing location and description of A data mart is a segment of a data


warehouse system components: names, warehouse that can provide data for
definition, structure… reporting and analysis on a section, unit,
department or operation in the company

It is used for maintaining, managing and They are used for rapid delivery of
using the data warehouse enhanced decision support functionality to
end users.

3. Analyze why one of the biggest challenges when designing a data ware house is the
data placement and distribution strategy.
One of the biggest challenges when designing a data warehouse is the data placement
and distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes
necessary to know how the data should be divided across multiple servers and which users
should get access to which types of data. The data can be distributed based on the subject
area, location (geographical region), or time (current, month, year).
4. How would you evaluate the goals of data mining?

➢ Identifying high-value customers based on recent purchase data


➢ Building a model using available customer data to predict the likelihood of churn for
each customer
➢ Assigning each customer rank based on both churn propensity and customer value

5. List the two ways the parallel execution of the tasks within SQL statements can be
done.
6. What elements would you use to relate the design of data warehouse?
➢ Quality Screens.
➢ External Parameters File / Table.
➢ Team and Its responsibilities.
➢ Up to date data connectors to external sources.
➢ Consistent architecture between environments (development / uat (user – acceptance –
testing / production)
➢ Repository of DDL’s and other script files (.SQL, Bash / Powershell)
➢ Testing processes – unit tests, integration tests, regression tests
➢ Audit tables, monitoring and alerting of audit tables
➢ Known and described data lineage
7. Define Data mart
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement.
➢ The absence of a budget for a full-scale data warehouse strategy.
➢ The decentralization of business needs.

8. Define star schema


The multidimensional view of data that is expressed using relational data base
semantics is provided by the data base schema design called star schema. The basic of stat
schema is that information can be classified into two groups:
➢ Facts
➢ Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.

9. What is Data warehousing? Explain the benefits of Data warehousing.


DATA WAREHOUSING:
Data Warehousing is an architectural construct of information systems that provides
users with current and historical decision support information that is hard to access or present
in traditional operational data stores

BENEFITS:
➢ Data warehouses are designed to perform well with aggregate queries running on
large amounts of data.
➢ Data warehousing is an efficient way to manage and report on data that is from a
variety of sources, non-uniform and scattered throughout a company.
➢ Data warehousing is an efficient way to manage demand for lots of information from
lots of users.
➢ Data warehousing provides the capability to analyze large amounts of historical data
for nuggets of wisdom that can provide an organization with competitive advantage.

10. Why data transformation is essential in the process of Knowledge discovery?


Describe it.
Data transformation is essential in the process of knowledge discovery because the
main objective of the knowledge discovery in database process is to extract information from
data in the context of large databases. Data transformation is where data are transformed or
consolidated into forms appropriate for mining by performing summary or aggregation
operations, for instance.

11. Describe the alternate technologies used to improve the performance in data
warehouse environment

12. Distinguish STAR join and STAR index.


STAR JOIN:
A STAR join is a high-speed, single-pass, parallelizable multi table joins, and Brick’s
RDBMS can join more than two tables in a single operation.
STAR INDEX:
Red Brick’s RDBMS supports the creation of specialized indexes called STAR
indexes. It created on one or more foreign key columns of a fact table.

13. Analyse the types of data mart.


TYPES OF DATA MART:
➢ Dependent
➢ Independent
➢ Hybrid

14. Formulate what is data discretization.


Data discretization converts a large number of data values into smaller once, so that
data evaluation and data management become very easy. In other words, it is simply is
defined as a process of converting continuous data attribute values into a finite set of
intervals and associating with each interval some specific data value.

15. Point out the major differences between the star schema and the snowflake schema
The dimension table of the snowflake schema model may be kept in normalized
form to reduce redundancies. Such a table is easy to maintain and saves storage space.

16. Point out the features of Metadata repository in data warehousing


➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info

17. Define Metadata repository


Meta data helps the users to understand content and find the data. Meta data are stored
in a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.

18. Discuss metadata with an example.


It is data about data. It is used for maintaining, managing and using the data
warehouse. For example, author, date created, date modified and file size are examples of
very basic document file metadata. Having the ability to search for a particular element (or
elements) of that metadata makes it much easier for someone to locate a specific document.

19. Illustrate the benefits of metadata repository.


Metadata repository explores the enterprises wide data governance, data quality and
master data management (includes master data and reference data) and integrates this wealth
of information with integrated metadata across the organization to provide decision support
system for data structures, even though it only reflects the structures consumed from various
systems.

20. Design the data warehouse architecture.

PART – B

1.What is data warehouse? Give the Steps for design and construction of Data
Warehouses and explain with three tier architecture diagram.

DATA WAREHOUSE:
A data warehouse is a repository of multiple heterogeneous data sources organized
under a unified schema at a single site to facilitate management decision making. (or)A data
warehouse is a subject-oriented, time-variant and non-volatile collection of data in support of
management’s decision-making process.

CONSTRUCTION OF DATA WAREHOUSE:


There are two reasons why organizations consider data warehousing a critical need. In
other words, there are two factors that drive you to build and use data warehouse. They are:
Business factors:
Business users want to make decision quickly and correctly using all available data.
Technological factors:
To address the incompatibility of operational data stores
IT infrastructure is changing rapidly. Its capacity is increasing and cost is
decreasing so that building a data warehouse is easy

There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the
following two approaches:
Top - Down Approach (Suggested by Bill Inmon)
Bottom - Up Approach (Suggested by Ralph Kimball)
Top - Down Approach
In the top down approach suggested by Bill Inmon, we build a centralized repository
to house corporate wide business data. This repository is called Enterprise Data Warehouse
(EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data. The data in the EDW is stored at the most detail level. The reason to build the EDW on
the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements.

The disadvantages of storing data at the detail level are


1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.

Once the EDW is implemented we start building subject area specific data marts which
contain data in a de normalized form also called star schema. The data in the marts are
usually summarized based on the end users analytical requirements. The reason to de
normalize the data in the mart is to provide faster access to the data for the end users
analytics. If we were to have queried a normalized schema for the same analytics, we
would end up in a complex multiple level joins that would be much slower as compared to
the one on the de normalized schema.
We should implement the top-down approach when
1. The business has complete clarity on all or multiple subject areas data warehosue
requirements.
2. The business is ready to invest considerable time and money.

The advantage of using the Top Down approach is that we build a centralized repository to
cater for one version of truth for business data. This is very important for the data to be
reliable, consistent across subject areas and for reconciliation in case of data related
contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the EDW to be implemented followed by building
the data marts before which they can access their reports.

Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to
build a data warehouse. Here we build the data marts separately at different points of time as
and when the specific subject area requirements are clear. The data marts are integrated or
combined together to form a data warehouse. Separate data marts are combined through the
use of conformed dimensions and conformed facts. A conformed dimension and a conformed
fact is one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names
and consistent values across separate data marts. The conformed dimension means exact
same thing with every fact table it is joined. A Conformed fact has the same definition of
measures, same dimensions joined to it and at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and
integrating data marts as and when the requirements are clear. We don’t have to wait for
knowing the overall requirements of the warehouse. We should implement the bottom up
approach when
1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one
data mart.

The advantage of using the Bottom Up approach is that they do not require high initial costs
and have a faster implementation time; hence the business can start using the marts much
earlier as compared to the top-down approach.

The disadvantages of using the Bottom Up approach is that it stores data in the de normalized
format, hence there would be high space usage for detailed data. We have a tendency of not
keeping detailed data in this approach hence loosing out on advantage of having detail data
.i.e.
flexibility to easily cater to future requirements. Bottom up approach is more realistic but the
complexity of the integration may become a serious obstacle.

DESIGN OF A DATA WAREHOUSE:


The following nine-step method is followed in the design of a data warehouse:
1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the dimensions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the db
8. The need to track slowly changing dimensions
9. Deciding the query priorities and query models

2. Diagrammatically illustrate and discuss the following pre-processing techniques:


(i) Data cleaning
Data cleaning means removing the inconsistent data or noise and collecting necessary
information of a collection of interrelated data.
(ii) Data Integration
Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. Data integration
becomes increasingly important in cases of merging systems of two companies or
consolidating applications within one company to provide a unified view of the company's
data assets.

(iii) Data transformation


`In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Data transformation can involve the following:
Smoothing, Aggregation, Generalization, Normalization, Attribute construction

(iv) Data reduction


In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or “compressed” representation of the original data. If the original data can
be reconstructed from the compressed data without any loss of information, the data
reduction is called lossless.

3.(i) Draw the data warehouse architecture and explain its components

Overall Architecture
• The data warehouse architecture is based on the data base management system server.
• The central information repository is surrounded by number of key components
• Data warehouse is an environment, not a product which is based on relational
database management system that functions as the central repository for informational
data.
• The data entered into the data warehouse transformed into an integrated structure and
format. The transformation process involves conversion, summarization, filtering and
condensation.
• The data warehouse must be capable of holding and managing large volumes of data
as well as different structure of data structures over the time.

Key components
✓ Data sourcing, cleanup, transformation, and migration tools
✓ Metadata repository
✓ Warehouse/database technology
✓ Data marts
✓ Data query, reporting, analysis, and mining tools
✓ Data warehouse administration and management
✓ Information delivery system

Data sourcing, cleanup, transformation, and migration tools:


➢ They perform conversions, summarization, key changes, structural changes
➢ The data transformation is required to use by decision support tools.
➢ The transformation produces programs, control statements.
➢ It moves the data into data warehouse from multiple operational systems.

The Functionalities of these tools are listed below:


▪ To remove unwanted data from operational db
▪ Converting to common data names and attributes
▪ Calculating summaries and derived data
▪ Establishing defaults for missing data
▪ Accommodating source data definition changes

Metadata repository:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
It is classified into two:

Technical Meta data:


It contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tasks.
It includes,
▪ Info about data stores.
▪ Transformation descriptions. That is mapping methods from operational db to
warehouse db.
▪ Warehouse Object and data structure definitions for target data
▪ The rules used to perform clean up, and data enhancement
▪ Data mapping operations
▪ Access authorization, backup history, archive history, info delivery history,
data acquisition history, data access etc.,
Business Meta data:
It contains info that gives info stored in data warehouse to users.
It includes,
▪ Subject areas, and info object type including queries, reports, images, video, audio
clips etc.
▪ Internet home pages
▪ Info related to info delivery system
▪ Data warehouse operational info such as ownerships, audit trails etc. ,

Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info

Warehouse/database technology
Data ware house database
This is the central part of the data ware housing environment. This is implemented
based on RDBMS technology.

Data marts
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement
➢ The absence of a budget for a full scale data warehouse strategy
➢ The decentralization of business needs

Data query, reporting, analysis, and mining tools


Its purpose is to provide info to business users for decision making. There are five
categories:
➢ Data query and reporting tools
➢ Application development tools
➢ Executive info system tools (EIS)
➢ OLAP tools
➢ Data mining tools

6. Query and reporting tools:


Used to generate query and report. There are two types of reporting tools.
They are:
▪ Production reporting tool used to generate regular operational reports
▪ Desktop report writer are inexpensive desktop tools designed for end users.
7. Managed Query tools:
Used to generate SQL query. It uses Met layer software in between users and
databases which offers appoint-and-click creation of SQL statement.
8. Application development tools:
This is a graphical data access environment which integrates OLAP tools with
data warehouse and can be used to access all db systems.
9. OLAP Tools:
Are used to analyze the data in multidimensional and complex views.
10. Data mining tools:
Are used to discover knowledge from the data warehouse data.

Data ware house administration and management:


The management of data warehouse includes,
➢ Security and priority management
➢ Monitoring updates from multiple sources
➢ Data quality checks
➢ Managing and updating meta data
➢ Auditing and reporting data warehouse usage and status
➢ Purging data
➢ Replicating, sub setting and distributing data
➢ Backup and recovery
➢ Data warehouse storage management which includes capacity planning, hierarchical
storage management and purging of aged data etc.,

Information delivery system:


➢ It is used to enable the process of subscribing for data warehouse info.
➢ Delivery to one or more destinations according to specified scheduling algorithm

(ii) Explain the different types of OLAP tools.


The different types of OLAP tools are:
➢ MOLAP
➢ ROLAP
➢ HOLAP

1.MOLAP:
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats. That is, data stored in array-based structures.

Advantages:
✓ Excellent performance: MOLAP cubes are built for fast data retrieval, and are
optimal for slicing and dicing operations.
✓ Can perform complex calculations: All calculations have been pre-generated when
the cube is created. Hence, complex calculations are not only doable, but they return
quickly.

Disadvantages:
✓ Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the cube
itself. This is not to say that the data in the cube cannot be derived from a large
amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in the cube itself.
✓ Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and capital resources are needed.

Examples:
Hyperion Essbase, Fusion (Information Builders)

ROLAP:
This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Data stored in relational tables

Advantages:
✓ Can handle large amounts of data: The data size limitation of ROLAP technology is
the limitation on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
✓ Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages:
✓ Performance can be slow: Because each ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database, the query time can be long if the
underlying data size is large.
✓ Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements do
not fit all needs (for example, it is difficult to perform complex calculations using
SQL), ROLAP technologies are therefore traditionally limited by what SQL can do.
ROLAP vendors have mitigated this risk by building into the tool out-of the- box
complex functions as well as the ability to allow users to define their own functions.

Examples:
Micro-strategy Intelligence Server, Meta Cube (Informix/IBM)

HOLAP (MQE: Managed Query Environment)


HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP.
For summary-type information, HOLAP leverages cube technology for faster performance. It
stores only the indexes and aggregations in the multidimensional form while the rest of the
data is stored in the relational database.
Examples:
Power Play (Cognos), Brio, Microsoft Analysis Services, Oracle Advanced Analytic
Services

4. (i) Describe in detail about Mapping the Data warehouse to a multiprocessor


architecture

The functions of data warehouse are based on the relational data base technology. The
relation data base technology is implemented in parallel manner. There are two advantages of
having parallel relational data base technology for data warehouse:

Linear Speed up:


refers the ability to increase the number of processor to reduce response time
Linear Scale up:
refers the ability to provide same performance on the same requests as the database
size increases

Types of parallelism
There are two types of parallelism:
➢ Inter query Parallelism:
In which different server threads or processes handle multiple requests
at the same time.
➢ Intra query Parallelism:
This form of parallelism decomposes the serial SQL query into lower-level
operations such as scan, join, sort etc. Then these lower-level operations are executed
concurrently in parallel.
Intra query parallelism can be done in either of two ways:
• Horizontal parallelism:
which means that the data base is partitioned across multiple disks and
parallel processing occurs within a specific task that is performed concurrently
on different processors against different set of data.
• Vertical parallelism:
This occurs among different tasks. All query components such as scan,
join, sort etc are executed in parallel in a pipelined fashion. In other words, an
output from one task becomes an input into another task.

Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.

➢ Random portioning
It includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which
each record is placed on the next disk assigned to the data base.

➢ Intelligent partitioning
It assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks. The various intelligent partitioning
include:
• Hash partitioning:
A hash algorithm is used to calculate the partition number based on
the value of the partitioning key for each row
• Key range partitioning:
Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K
are in partition 1, L to T are in partition 2 and so on.
• Schema portioning:
an entire table is placed on one disk; another table is placed on
different disk etc. This is useful for small reference tables.
• User defined portioning:
It allows a table to be partitioned on the basis of a user defined
expression.
(ii) Describe in detail on data warehouse Metadata
METADATA:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
➢ It is classified into two:
✓ Technical Meta data
✓ Business Meta data

Technical Meta data:


It contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tasks.
It includes,
▪ Info about data stores.
▪ Transformation descriptions. That is mapping methods from operational db to
warehouse db.
▪ Warehouse Object and data structure definitions for target data
▪ The rules used to perform clean up, and data enhancement
▪ Data mapping operations
▪ Access authorization, backup history, archive history, info delivery history,
data acquisition history, data access etc.,
Business Meta data:
It contains info that gives info stored in data warehouse to users.
It includes,
▪ Subject areas, and info object type including queries, reports, images, video, audio
clips etc.
▪ Internet home pages
▪ Info related to info delivery system
▪ Data warehouse operational info such as ownerships, audit trails etc. ,

Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info

5.(i) Explain the steps in building a data warehouse.


There are two reasons why organizations consider data warehousing a critical need. In
other words, there are two factors that drive you to build and use data warehouse. They are:
➢ Business factors:
• Business users want to make decision quickly and correctly using all
available data.
➢ Technological factors:
• To address the incompatibility of operational data stores
• IT infrastructure is changing rapidly. Its capacity is increasing and cost
is decreasing so that building a data warehouse is easy.

There are several things to be considered while building a successful data warehouse
Business considerations:
Organizations interested in development of a data warehouse can choose one of the
following two approaches:
➢ Top - Down Approach (Suggested by Bill Inmon)
➢ Bottom - Up Approach (Suggested by Ralph Kimball)

Top - Down Approach


In the top down approach suggested by Bill Inmon, we build a centralized repository
to house corporate wide business data. This repository is called Enterprise Data Warehouse
(EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data. The data in the EDW is stored at the most detail level.

The reason to build the EDW on the most detail level is to leverage
1. Flexibility to be used by multiple departments.
2. Flexibility to cater for future requirements.

The disadvantages of storing data at the detail level are


1. The complexity of design increases with increasing level of detail.
2. It takes large amount of space to store data at detail level, hence increased cost.
Once the EDW is implemented we start building subject area specific data marts
which contain data in a de normalized form also called star schema. The data in the marts are
usually summarized based on the end users analytical requirements. The reason to de
normalize the data in the mart is to provide faster access to the data for the end users
analytics. If we were to have queried a normalized schema for the same analytics, we would
end up in a complex multiple level joins that would be much slower as compared to the one
on the de normalized schema.

We should implement the top-down approach when


1. The business has complete clarity on all or multiple subject areas data warehouse
requirements.
2. The business is ready to invest considerable time and money.

The advantage of using the Top Down approach is that we build a centralized
repository to cater for one version of truth for business data. This is very important for the
data to be reliable, consistent across subject areas and for reconciliation in case of data related
contention between subject areas.
The disadvantage of using the Top Down approach is that it requires more time and
initial investment. The business has to wait for the EDW to be implemented followed by
building the data marts before which they can access their reports.

Bottom Up Approach
The bottom up approach suggested by Ralph Kimball is an incremental approach to
build a data warehouse. Here we build the data marts separately at different points of time as
and when the specific subject area requirements are clear. The data marts are integrated or
combined together to form a data warehouse. Separate data marts are combined through the
use of conformed dimensions and conformed facts. A conformed dimension and a conformed
fact is one that can be shared across data marts.
A Conformed dimension has consistent dimension keys, consistent attribute names
and consistent values across separate data marts. The conformed dimension means exact
same thing with every fact table it is joined.
A Conformed fact has the same definition of measures, same dimensions joined to it
and at the same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing
and integrating data marts as and when the requirements are clear. We don’t have to wait for
knowing the overall requirements of the warehouse.

We should implement the bottom up approach when


1. We have initial cost and time constraints.
2. The complete warehouse requirements are not clear. We have clarity to only one
data mart.

The advantage of using the Bottom Up approach is that they do not require high initial
costs and have a faster implementation time; hence the business can start using the marts
much earlier as compared to the top-down approach.
The disadvantages of using the Bottom Up approach are that it stores data in the de
normalized format; hence there would be high space usage for detailed data. We have a
tendency of not keeping detailed data in this approach hence losing out on advantage of
having detail data i.e. flexibility to easily cater to future requirements. Bottom up approach is
more realistic but the complexity of the integration may become a serious obstacle.

(ii) Analyze the information needed to support DBMS schemas for Decision support.

The basic concepts of dimensional modelling are:


➢ Facts
➢ dimensions and
➢ measures.
A fact is a collection of related data items, consisting of measures and context data. It
typically represents business items or business transactions. A dimension is a collection of
data that describe one business dimension. Dimensions determine the contextual background
for the facts; they are the parameters over which we want to perform OLAP. A measure is a
numeric attribute of a fact, representing the performance or behavior of the business relative
to the dimensions. Considering Relational context, there are three basic schemas that are used
in dimensional modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star schema:
The multidimensional view of data that is expressed using relational data base
semantics is provided by the data base schema design called star schema. The basic of stat
schema is that information can be classified into two groups:
➢ Facts
➢ Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.
4. Fact Tables:
A fact table is a table that contains summarized numerical and historical data (facts)
and a multipart index composed of foreign keys from the primary keys of related dimension
tables. A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
5. Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter,
year), Region dimension (profit by country, state, city), Product dimension (profit for
product1, product2). A dimension is a structure usually composed of one or more
hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is
called flat dimension or list. The primary keys of each of the dimension tables are part of
the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Dimension tables are generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic
region (markets, cities), clients, products, times, channels.
6. Measures:
Measures are numeric data based on columns in a fact table. They are the primary
data which end users are interested in. E.g. a sales fact table may contain a profit measure
which represents profit on each sale.

The main characteristics of star schema:


➢ Simple structure -> easy to understand schema
➢ Great query effectives -> small number of tables to join
➢ Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
➢ The most commonly used in the data warehouse implementations -> widely supported
by a large number of business intelligence tools

Snowflake schema:
It is the result of decomposing one or more of the dimensions. The many-to one
relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical
structure of dimensions very well.

Fact constellation schema:


For each star schema it is possible to construct fact constellation schema (for example
by splitting the original star schema into more star schemes each of them describes facts on
another level of dimension hierarchies). The fact constellation architecture contains multiple
fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design because
many variants for particular kinds of aggregation must be considered and selected. Moreover,
dimension tables are still large.

6.(i) Discuss in detail about access tools types?


ACCESS TOOLS:
Data warehouse implementation relies on selecting suitable data access tools. The best
way to choose this is based on the type of data can be selected using this tool and the kind of
access it permits for a particular user.
The following lists the various types of data that can be accessed:
➢ Simple tabular form data
➢ Ranking data
➢ Multivariable data
➢ Time series data
➢ Graphing, charting and pivoting data
➢ Complex textual search data
➢ Statistical analysis data
➢ Data for testing of hypothesis, trends and patterns
➢ Predefined repeatable queries
➢ Ad hoc user specified queries
➢ Reporting and analysis data
➢ Complex queries with multiple joins, multi-level sub queries and sophisticated search
criteria

There are five categories:


➢ Data query and reporting tools
➢ Application development tools
➢ Executive info system tools (EIS)
➢ OLAP tools
➢ Data mining tools

Data query and reporting tools:


Query and reporting tools are used to generate query and report. There are two types
of reporting tools. They are:
• Production reporting tool used to generate regular operational reports
• Desktop report writer are inexpensive desktop tools designed for end users.
Managed Query tools:
used to generate SQL query. It uses Meta layer software in between users
and databases which offers a point-and-click creation of SQL statement. This tool is a
preferred choice of users to perform segment identification, demographic analysis,
territory management and preparation of customer mailing lists etc.
Application development tools:
This is a graphical data access environment which integrates OLAP tools with data
warehouse and can be used to access all db systems

OLAP Tools:
are used to analyze the data in multi-dimensional and complex views. To enable
multidimensional properties and it uses MDDB and MRDB where MDDB refers multi-
dimensional data base and MRDB refers multi relational data bases.

Data mining tools:


are used to discover knowledge from the data warehouse data also can be used for
data visualization and data correction purposes.

(ii) Describe the overall architecture of data warehouse?


Overall Architecture
• The data warehouse architecture is based on the data base management system server.
• The central information repository is surrounded by number of key components
• Data warehouse is an environment, not a product which is based on relational
database management system that functions as the central repository for informational
data.
• The data entered into the data warehouse transformed into an integrated structure and
format. The transformation process involves conversion, summarization, filtering and
condensation.
• The data warehouse must be capable of holding and managing large volumes of data
as well as different structure of data structures over the time.

Key components
✓ Data sourcing, cleanup, transformation, and migration tools
✓ Metadata repository
✓ Warehouse/database technology
✓ Data marts
✓ Data query, reporting, analysis, and mining tools
✓ Data warehouse administration and management
✓ Information delivery system

Data sourcing, cleanup, transformation, and migration tools:


➢ They perform conversions, summarization, key changes, structural changes
➢ The data transformation is required to use by decision support tools.
➢ The transformation produces programs, control statements.
➢ It moves the data into data warehouse from multiple operational systems.

The Functionalities of these tools are listed below:


▪ To remove unwanted data from operational db
▪ Converting to common data names and attributes
▪ Calculating summaries and derived data
▪ Establishing defaults for missing data
▪ Accommodating source data definition changes

Metadata repository:
➢ It is data about data. It is used for maintaining, managing and using the data
warehouse.
It is classified into two:

Technical Meta data:


It contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tasks.
It includes,
▪ Info about data stores.
▪ Transformation descriptions. That is mapping methods from operational db to
warehouse db.
▪ Warehouse Object and data structure definitions for target data
▪ The rules used to perform clean up, and data enhancement
▪ Data mapping operations
▪ Access authorization, backup history, archive history, info delivery history,
data acquisition history, data access etc.,
Business Meta data:
It contains info that gives info stored in data warehouse to users.
It includes,
▪ Subject areas, and info object type including queries, reports, images, video, audio
clips etc.
▪ Internet home pages
▪ Info related to info delivery system
▪ Data warehouse operational info such as ownerships, audit trails etc. ,

Meta data helps the users to understand content and find the data. Meta data are stored in
a separate data store which is known as informational directory or Meta data repository
which helps to integrate, maintain and view the contents of the data warehouse.
The following lists the characteristics of info directory/ Meta data:
➢ It is the gateway to the data warehouse environment
➢ It supports easy distribution and replication of content for high performance and
availability
➢ It should be searchable by business oriented key words
➢ It should act as a launch platform for end user to access data and analysis tools
➢ It should support the sharing of info

Warehouse/database technology
Data ware house database
This is the central part of the data ware housing environment. This is implemented
based on RDBMS technology.

Data marts
It is inexpensive tool and alternative to the data ware house. it based on the subject
area Data mart is used in the following situation:
➢ Extremely urgent user requirement
➢ The absence of a budget for a full scale data warehouse strategy
➢ The decentralization of business needs

Data query, reporting, analysis, and mining tools


Its purpose is to provide info to business users for decision making. There are five
categories:
➢ Data query and reporting tools
➢ Application development tools
➢ Executive info system tools (EIS)
➢ OLAP tools
➢ Data mining tools

6. Query and reporting tools:


Used to generate query and report. There are two types of reporting tools.
They are:
▪ Production reporting tool used to generate regular operational reports
▪ Desktop report writer are inexpensive desktop tools designed for end users.
7. Managed Query tools:
Used to generate SQL query. It uses Met layer software in between users and
databases which offers appoint-and-click creation of SQL statement.
8. Application development tools:
This is a graphical data access environment which integrates OLAP tools with
data warehouse and can be used to access all db systems.
9. OLAP Tools:
Are used to analyze the data in multidimensional and complex views.
10. Data mining tools:
Are used to discover knowledge from the data warehouse data.

Data ware house administration and management:


The management of data warehouse includes,
➢ Security and priority management
➢ Monitoring updates from multiple sources
➢ Data quality checks
➢ Managing and updating meta data
➢ Auditing and reporting data warehouse usage and status
➢ Purging data
➢ Replicating, sub setting and distributing data
➢ Backup and recovery
➢ Data warehouse storage management which includes capacity planning, hierarchical
storage management and purging of aged data etc.,

Information delivery system:


➢ It is used to enable the process of subscribing for data warehouse info.
➢ Delivery to one or more destinations according to specified scheduling algorithm

7.(i) Discuss the different types of data repositories on which mining can be performed?
A data repository, often called a data archive or library, is a generic terminology that
refers to a segmented data set used for reporting or analysis. It’s a huge database
infrastructure that gathers, manages, and stores varying data sets for analysis, distribution,
and reporting.
Some common types of data repositories include:
➢ Data Warehouse
➢ Data Lake
➢ Data Mart
➢ Metadata Repository
➢ Data Cube

Data Warehouse

A data warehouse is a large data repository that brings together data from several
sources or business segments. The stored data is generally used for reporting and analysis to
help users make critical business decisions. In a broader perspective, a data warehouse offers
a consolidated view of either a physical or logical data repository gathered from numerous
systems. The main objective of a data warehouse is to establish a connection between data
from current systems. For example, product catalogue data stored in one system and
procurement orders for a client stored in another one.

Data Lake

A data lake is a unified data repository that allows you to store structured, semi-
structured, and unstructured enterprise data at any scale. Data can be in raw form and used for
different tasks like reporting, visualizations, advanced analytics, and machine learning.

Data Mart:

A data mart is a subject-oriented data repository that’s often a segregated section of a


data warehouse. It holds a subset of data usually aligned with a specific business department,
such as marketing, finance, or support. Due to its smaller size, a data mart can fast-track
business procedures as one can easily access relevant data within days instead of months. As
it only includes the data relevant to a specific area, a data mart is an economical way to
acquire actionable insights swiftly.

Metadata Repositories:

Metadata incorporates information about the structures that include the actual data.
Metadata repositories contain information about the data model that store and share this data.
They describe where the source of data is, how it was collected, and what it signifies. It may
define the arrangement of any data or subject deposited in any format. For businesses,
metadata repositories are essential in helping people understand administrative changes, as
they contain detailed information about the data.

Data Cubes:

Data cubes are lists of data with multidimensions (usually 3 or more dimensions)
stored as a table. They are used to describe the time sequence of an image’s data and help
assess gathered data from a range of standpoints. Each dimension of a data cube signifies
specific characteristics of the database such as day-to-day, monthly or annual sales. The data
contained within a data cube allows you to analyze all the information for almost any or all
clients, sales representatives, products, and more. Consequently, a data cube can help you
identify trends and scrutinize business performance.

(ii) Differentiate tangible and intangible benefits of data warehouse

TANGIBLE BENEFITS INTANGIBLE BENEFITS


Improvement in product inventory Improvement in productivity by keeping all
data in single location and eliminating
rekeying of

data.
Decrement in production cost Reduced redundant processing Enhanced
customer relation.
Improvement in selection of target markets Increased customer satisfaction

Enhancement in asset and liability Greater compliance


management

can be measured in financial terms cannot be quantified directly in economic


terms, but still have a very significant
business impact.
the tangible benefits of a process are can increase or decrease over time
unlikely to fluctuate.
tangible benefits can often be estimated intangible benefits are virtually impossible
before certain actions are taken to estimate beforehand.

8.(i) Describe in detail about data extraction

Data extraction is the process of collecting or retrieving disparate types of data from a
variety of sources, many of which may be poorly organized or completely unstructured. Data
extraction makes it possible to consolidate, process, and refine data so that it can be stored in
a centralized location in order to be transformed. These locations may be on-site, cloud-
based, or a hybrid of the two. Data extraction is the first step in both ETL (extract, transform,
load) and ELT (extract, load, transform) and processes. ETL/ELT are themselves part of a
complete data integration strategy. A proper attention must be paid to data extraction which
represents a success factor for a data warehouse architecture. When implementing data
warehouse several the following selection criteria that affect the ability to transform,
consolidate, integrate and repair the data should be considered:
➢ Timeliness of data delivery to the warehouse
➢ The tool must have the ability to identify the particular data and that can be
read by conversion tool
➢ The tool must support flat files, indexed files since corporate data is still in
this type
➢ The tool must have the capability to merge data from multiple data stores
➢ The tool should have specification interface to indicate the data to be extracted
➢ The tool should have the ability to read data from data dictionary
➢ The code generated by the tool should be completely maintainable
➢ The tool should permit the user to extract the required data
➢ The tool must have the facility to perform data type and character set
translation
➢ The tool must have the capability to create summarization, aggregation and
derivation of records
➢ The data warehouse database system must be able to perform loading data
directly from these tools

(ii) Describe in detail about transformation tools


They perform conversions, summarization, key changes, structural changes and
condensation. The data transformation is required so that the information can by used by
decision support tools. The transformation produces programs, control statements, JCL code,
COBOL code, UNIX scripts, and SQL DDL code etc., to move the data into data warehouse
from multiple operational systems.
The functionalities of these tools are listed below:
➢ To remove unwanted data from operational db
➢ Converting to common data names and attributes
➢ Calculating summaries and derived data
➢ Establishing defaults for missing data
➢ Accommodating source data definition changes
Issues to be considered while data sourcing, clean up, extract and transformation:
✓ Data heterogeneity:
It refers to DBMS different nature such as it may be in different data modules,
it may have different access languages, it may have data navigation methods,
operations, concurrency, integrity and recovery processes etc.,
✓ Data heterogeneity:
It refers to the different way the data is defined and used in different modules.
Some experts involved in the development of such tools:
Prism Solutions, Evolutionary Technology Inc., Vality, Praxis and Carleton
9.(i) Suppose that a data warehouse consists of four dimensions customer,
product, salesperson and sales time, and the three measure sales Amt (in
rupees), VAT (in rupees) and payment type (in rupees). Draw the different
classes of schemas that are popularly used for modelling data warehouses
and explain it
(ii) How would you explain Metadata implementation with examples?
10.(i) Describe in detail about
(i)Bitmapped indexing
The new approach to increasing performance of a relational DBMS is to use
innovative indexing techniques to provide direct access to data. SYBASE IQ uses a bit
mapped index structure.
The data stored in the SYBASE DBMS.
SYBASE IQ- it is based on indexing technology; it is a stand alone database
Overview: It is a separate SQL database. Data is loaded into a SYBASE IQ very much as
into any relational DBMS once loaded SYBASE IQ converts all data into a series of bitmaps
which are than highly compressed to store on disk.
Data cardinality:
➢ Data cardinality bitmap indexes are used to optimize queries against a low cardinality
data.
➢ That is in which-The total no. of potential values in relatively low. Example: state
code data cardinality is 50 potential values and general cardinality is only 2 ( male to
female) for low cardinality data.
➢ Each distinct value has its own bitmap index consisting of a bit for every row in a
table, if the bit for a given index is ―on‖ the value exists in the record bitmap index
representation is 10000 bit long vector which has its bits turned on (value of 1) for
every record that satisfies ―gender‖=‖M‖ condition‖
➢ Bit map indexes unsuitable for high cardinality data
➢ Another solution is to use traditional B_tree index structure. B_tree indexes can often
grow to large sizes because as the data volumes & the number of indexes grow.
➢ B_tree indexes can significantly improve the performance,
➢ SYBASE IQ was a technique is called bitwise (Sybase trademark) technology to build
bit map index for high cardinality data, which are limited to about 250 distinct values
for high cardinality data.

Index types:
The first of SYBASE IQ provide five index techniques, Most users apply two indexes
to every column. the default index called projection index and other is either a low or high –
cardinality index. For low cardinality data SYBASE IQ provides.
➢ Low fast index: it is optimized for queries involving scalar functions like SUM,
AVERAGE, and COUNTS.
➢ Low disk index which is optimized for disk space utilization at the cost of being more
CPU intensive.
Performance.
SYBAEE IQ technology achieves the very good performance on adhoc quires for
several reasons
➢ Bitwise technology: this allows various types of data type in query. And support fast
data aggregation and grouping.
➢ Compression: SYBAEE IQ uses sophisticated algorithms to compress data in to bit
maps.
➢ Optimized m/y based programming: SYBASE IQ caches data columns in m/y
according to the nature of user’s queries, it speed up the processor.
➢ Column wise processing: SYBASE IQ scans columns not rows, it reduce the amount
of data the engine has to search.
➢ Low overhead: An engine optimized for decision support SYBASE IQ does not carry
on overhead associated with that. Finally OLTP designed RDBMS performance.
➢ Large block I/P: Block size of SYBASE IQ can turned from 512 bytes to 64 Kbytes
so system can read much more information as necessary in single I/O.
➢ Operating system-level parallelism: SYBASE IQ breaks low level like the sorts,
bitmap manipulation, load, and I/O, into non blocking operations.
➢ Projection and ad hoc join capabilities: SYBASE IQ allows users to take advantage of
known join relation relationships between tables by defining them in advance and
building indexes between tables.
Shortcomings of indexing:-
The user should be aware of when choosing to use SYBASE IQ include
➢ No updates-SYNBASE IQ does not support updates the users would have to update
the source database and then load the update data in SYNBASE IQ on a periodic
basis.
➢ Lack of core RDBMS feature:-
Not support all the robust features of SYBASE SQL server, such as backup and
recovery
➢ Less advantage for planned queries:
SYBASE IQ, run on preplanned queries.
➢ High memory usage: Memory access for the expensive i\o operation
Column local storage:
➢ It is an another approach to improve query performance in the data warehousing
environment
➢ For example, thinking machine operation has developed an innovative data layout
solution that improves RDBMS query performance many times. Implemented in its
CM_SQL RDBMS product, this approach is based on storing data column wise as
opposed to traditional row wise approach.
➢ In figure-. (Row wise approach) This approach works well for OLTP environment in
which a typical transaction accesses a record at a time. However, in data warehousing
the goal is to retrieve multiple values of several columns.
➢ For example, if a problem is to calculate average minimum, maximum salary the
column wise storage of the salary field requires a DBMS to read only one record.(use
Column –wise approach)
Complex data types:-
➢ The best DBMS architecture for data warehousing has been limited to traditional
alphanumeric data types. But data management is the need to support complex data
types Include text, image, full-motion video &sound.
➢ Large data objects called binary large objects what’s required by business is much
more than just storage:
➢ The ability to retrieve the complex data type like an image by its content. The ability
to compare the content of one image to another in order to make rapid business
decision and ability to express all of this in a single SQL statement.
➢ The modern data warehouse DBMS has to be able to efficiently store, access and
manipulate complex data. The DBMS has to be able to define not only new data
structure but also new function that manipulates them and often new access methods,
to provide fast and often new access to the data.
➢ An example of advantage of handling complex data types is a insurance company that
wants to predict its financial exposure during a catastrophe such as flood that wants to
support complex data.

(ii) STARjoin and index.


A STAR join is a high-speed, single-pass, parallelizable multi table joins, and Brick’s
RDBMS can join more than two tables in a single operation. A star schema has one
“central” table whose primary key is compound, i.e., consisting of multiple attributes. Each
one of these attributes is a foreign key to one of the remaining tables. Such a foreign key
dependency exists for each one of these tables, while there are no other foreign keys
anywhere in the schema. Most data warehouses that represent the multidimensional
conceptual data model in a relational fashion [1,2] store their primary data as well as the
data cubes derived from it in star schemas. The “central” table and the remaining tables of
the definition above correspond, respectively, to the fact table and the dimension tables that
are typically found in data warehouses.
Red Brick’s RDBMS supports the creation of specialized indexes called STAR
indexes. It created on one or more foreign key columns of a fact table. A star index is a
collection of join indices, one for every foreign key join in a star or snowflake schema. A
common structure for a data warehouse is a fact table consisting of several dimension fields
and several measure fields. To reduce storage costs, the fact table is often normalized into
a star or a snowflake schema. Since most queries reference both the (normalized) fact tables
and the dimension tables, creating a star index can be an effective way to accelerate data
warehouse queries.

11.(i) What is data Pre-processing? Explain the various data pre-processing techniques.

Data pre-processing describes any type of processing performed on raw data to


prepare it for another processing procedure. Commonly used as a preliminary data mining
practice, data pre-processing transforms the data into a format that will be more easily and
effectively processed for the purpose of the user. Data pre-processing describes any type of
processing performed on raw data to prepare it for another processing procedure. Commonly
used as a preliminary data mining practice, data pre-processing transforms the data into a
format that will be more easily and effectively processed for the purpose of the user. Data in
the real world is dirty. It can be in incomplete, noisy and inconsistent from. These data need
to be pre-processed in order to help improve the quality of the data, and quality of the mining
results.
➢ If no quality data, then no quality mining results. The quality decision is always based
on the quality data.
➢ If there is much irrelevant and redundant information present or noisy and unreliable
data, then knowledge discovery during the training phase is more difficult.
Techniques:
9. Data Cleaning
10. Data Integration
11. Data Transformation
12. Data Reduction

1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
5. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.

6. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can
be generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.

2.Data Integration:
Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. Data integration
becomes increasingly important in cases of merging systems of two companies or
consolidating applications within one company to provide a unified view of the company's
data assets.

3.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
9. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)

10. Attribute Selection:


In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
11. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

12. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
5. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

6. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. the attribute having p-value greater than significance level can be
discarded.

7. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.

8. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).

(ii) Explain the basic methods for data cleaning.


Data cleaning routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
Various methods for handling this problem:

The various methods for handling the problem of missing values in data tuples include:
(b) Ignoring the tuple:
This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective
unless the tuple contains several attributes with missing values. It is especially
poor when the percentage of missing values per attribute varies considerably.

(b) Manually filling in the missing value:


In general, this approach is time-consuming and may not be a reasonable task
for large data sets with many missing values, especially when the value to be filled in
is not easily determined.

(f) Using a global constant to fill in the missing value:


Replace all missing attribute values by the same constant, such as a label like
“Unknown,” or −∞. If missing values are replaced by, say, “Unknown,” then the
mining program may mistakenly think that they form an interesting concept, since
they all have a value in common — that of “Unknown.” Hence, although this
method is simple, it is not recommended.
(g) Using the attribute mean for quantitative (numeric) values or attribute mode
for categorical (nominal) values, for all samples belonging to the same class
as the given tuple:
For example, if classifying customers according to credit risk, replace the
missing value with the average income value for customers in the same credit risk
category as that of the given tuple.
(h) Using the most probable value to fill in the missing value:
This may be determined with regression, inference-based tools using Bayesian
formalism, or decision tree induction. For example, using the other customer
attributes in your data set, you may construct a decision tree to predict the missing
values for income.

12. Describe with diagrammatic representation the relationship between operational


data, a data warehouse and data marts.

13. (i) Demonstrate in detail about Data marts


Departmental subsets that focus on selected subjects. They are independent used by
dedicated user group. They are used for rapid delivery of enhanced decision support
functionality to end users.

Data mart is used in the following situation:


➢ Extremely urgent user requirement
➢ The absence of a budget for a full scale data warehouse strategy
➢ The decentralization of business needs
➢ The attraction of easy to use tools and mind sized project

Data mart presents two problems:


1. Scalability: A small data mart can grow quickly in multi dimensions. So that while
designing it, the organization has to pay more attention on system scalability,
consistency and manageability issues
2. Data integration

(ii) Demonstrate data warehouse administration and management

Data warehouse admin and management


The management of data warehouse includes,
➢ Security and priority management
➢ Monitoring updates from multiple sources
➢ Data quality checks
➢ Managing and updating meta data
➢ Auditing and reporting data warehouse usage and status
➢ Purging data
➢ Replicating, sub setting and distributing data
➢ Backup and recovery
➢ Data warehouse storage management which includes capacity planning, hierarchical
storage management and purging of aged data etc.,

14. (i) Generalize the potential performance problems with star schema.
Potential performance problem with star schemas

1.Indexing
➢ It improve the performance in the star schema design
➢ The table in star schema design contain the entire hierarchy of attributes(PERIOD
dimension this hierarchy could be day->week->month->quarter->year),one approach
is to create multi part key of day, week, month ,quarter ,year .it presents some
problems in the star schema model because it should be in normalized
Problems:
1. It require multiple metadata definitions
2. Since the fact table must carry all key components as part of its primary key,
addition or deletion of levels in the physical modification of the affected table.
3. Carrying all the segments of the compound dimensional key in the fact table
increases the size of the index, thus impacting both performance and scalability.
Solutions:
1.One alternative to the compound key is to concatenate the key into a single key for
the attributes (day, week, month, quarter, year) this is used to solve the first above two
problems.
2.The index is remains problem the best approach is to drop the use of meaningful
keys in favour of using an artificial, generated key which is the smallest possible key
that will ensure the uniqueness of each record

2.Level indicator
Problems
1.Another potential problem with the star schema design is that in order to navigate
the dimensions successfully.
2.The dimensional table design includes a level of hierarchy indicator for every
record.
3.Every query that is retrieving detail records from a table that stores details &
aggregates must use this indicator as an additional constraint to obtain a correct
result.
Solutions:
1.The best alternative to using the level indicator is the snowflake schema
2.The snowflake schema contains separate fact tables for each level of aggregation.
So it is Impossible to make a mistake of selecting product detail. The snowflake
schema is even more complicated than a star schema.

(ii) Design and discuss about the star and snowflake schema models of a Data
warehouse.
STAR SCHEMA:

The multidimensional view of data that is expressed using relational data base
semantics is provided by the data base schema design called star schema. The basic of stat
schema is that information can be classified into two groups:
➢ Facts
➢ Dimension
Star schema has one large central table (fact table) and a set of smaller tables
(dimensions) arranged in a radial pattern around the central table.
4. Fact Tables:
A fact table is a table that contains summarized numerical and historical data (facts)
and a multipart index composed of foreign keys from the primary keys of related dimension
tables. A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
5. Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit
summary in a fact table can be viewed by a Time dimension (profit by month, quarter,
year), Region dimension (profit by country, state, city), Product dimension (profit for
product1, product2). A dimension is a structure usually composed of one or more
hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is
called flat dimension or list. The primary keys of each of the dimension tables are part of
the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Dimension tables are generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic
region (markets, cities), clients, products, times, channels.
6. Measures:
Measures are numeric data based on columns in a fact table. They are the primary
data which end users are interested in. E.g. a sales fact table may contain a profit measure
which represents profit on each sale.

The main characteristics of star schema:


➢ Simple structure -> easy to understand schema
➢ Great query effectives -> small number of tables to join
➢ Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
➢ The most commonly used in the data warehouse implementations -> widely supported
by a large number of business intelligence tools

Snowflake schema:
It is the result of decomposing one or more of the dimensions. The many-to one
relationship among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical
structure of dimensions very well.
PART – C

1. Explain mapping data warehouse with multiprocessor architecture with the concept of
parallelism and data partitioning

The functions of data warehouse are based on the relational data base technology. The
relation data base technology is implemented in parallel manner. There are two advantages of
having parallel relational data base technology for data warehouse:

Linear Speed up:


refers the ability to increase the number of processor to reduce response time
Linear Scale up:
refers the ability to provide same performance on the same requests as the database
size increases

MAPPING DATA WAREHOUSE WITH MULTIPROCESSOR ARCHITECTURE WITH


THE CONCEPT OF PARALLELISM:
Types of parallelism
There are two types of parallelism:
➢ Inter query Parallelism:
In which different server threads or processes handle multiple requests
at the same time.
➢ Intra query Parallelism:
This form of parallelism decomposes the serial SQL query into lower-level
operations such as scan, join, sort etc. Then these lower-level operations are executed
concurrently in parallel.
Intra query parallelism can be done in either of two ways:
• Horizontal parallelism:
which means that the data base is partitioned across multiple disks and
parallel processing occurs within a specific task that is performed concurrently
on different processors against different set of data.
• Vertical parallelism:
This occurs among different tasks. All query components such as scan,
join, sort etc are executed in parallel in a pipelined fashion. In other words, an
output from one task becomes an input into another task.

MAPPING DATA WAREHOUSE WITH MULTIPROCESSOR ARCHITECTURE WITH


THE CONCEPT OF DATA PARTITIONING:

Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.

➢ Random portioning
It includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which
each record is placed on the next disk assigned to the data base.

➢ Intelligent partitioning
It assumes that DBMS knows where a specific record is located and does
not waste time searching for it across all disks. The various intelligent partitioning
include:
• Hash partitioning:
A hash algorithm is used to calculate the partition number based on
the value of the partitioning key for each row
• Key range partitioning:
Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K
are in partition 1, L to T are in partition 2 and so on.
• Schema portioning:
an entire table is placed on one disk; another table is placed on
different disk etc. This is useful for small reference tables.
• User defined portioning:
It allows a table to be partitioned on the basis of a user defined
expression.
2.Design a star-schema , snow-flake schema and fact- constellation schema for the data
warehouse that consists of the following four dimensions( Time , Item, Branch And
Location ). Include the appropriate measures required for the schema.

STAR SCHEMA:

• In a Star schema, there is only one fact table and multiple dimension tables.
• In a Star schema, each dimension is represented by one-dimension table.
• Dimension tables are not normalized in a Star schema.
• Each Dimension table.is joined to a key in a fact table.

There is a fact table at the center. It contains the keys to each of four dimensions. The
fact table also contains the attributes, namely dollars sold and units sold.

SNOWFLAKE SCHEMA:
Some dimension tables in the Snowflake schema are normalized. The normalization
splits up the data into additional tables. Unlike in the Star schema, the dimension’s table in a
snowflake schema are normalized. Due to the normalization in the Snowflake schema, the
redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.

FACT CONSTELLATION SCHEMA:

A fact constellation has multiple fact tables. It is also known as a Galaxy Schema. The
sales fact table is the same as that in the Star Schema. The shipping fact table has five
dimensions, namely item_key, time_key, shipper_key, from_location, to_location. The
shipping fact table also contains two measures, namely dollars sold and units sold. It is also
possible to share dimension tables between fact tables.

3.(i) Generalize why we need data pre-processing step in data warehousing


Data pre-processing refers to the set of techniques implemented on the
databases to remove noisy, missing, and inconsistent data. Different Data pre-processing
techniques involved in data mining are data cleaning, data integration, data reduction,
and data transformation.

The need for data pre-processing arises from the fact that the real-time data and many
times the data of the database is often incomplete and inconsistent which may result in
improper and inaccurate data mining results. Thus, to improve the quality of data on
which the observation and analysis are to be done, it is treated with these four steps of
data pre-processing. More the improved data, More, will be the accurate observation
and prediction.

Techniques:
13. Data Cleaning
14. Data Integration
15. Data Transformation
16. Data Reduction

1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
7. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.

8. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.

(b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can
be generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.

2.Data Integration:
Data integration involves combining data from several disparate sources, which are
stored using various technologies and provide a unified view of the data. Data integration
becomes increasingly important in cases of merging systems of two companies or
consolidating applications within one company to provide a unified view of the company's
data assets.

3.Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
13. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)

14. Attribute Selection:


In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.

15. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
16. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.

4. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
5. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

6. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. the attribute having p-value greater than significance level can be
discarded.

7. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.

8. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).

(ii) Explain the various methods of data cleaning and data reduction technique

Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
3. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

4. Fill the Missing values:


There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can
be generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods
are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.

Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get rid
of this, we uses data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
5. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

6. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. the attribute having p-value greater than significance level can be
discarded.

7. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example:
Regression Models.

8. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet
transforms and PCA (Principal Component Analysis).

4.(i) Compare the similarities and differences between the database and data warehouse

DIFFERENCE:

PARAMETER DATABASE DATA WAREHOUSE


Purpose Is designed to record Is designed to analyze
Processing Method The database uses the Online Data warehouse uses Online
Transactional Processing (OLTP) Analytical Processing (OLAP).
Storage limit Generally limited to a single Stores data from any number of
application applications
Availability Data is available real-time Data is refreshed from source
systems as and when needed
Data Type Data stored in the Database is up Current and Historical Data is
to date. stored in Data Warehouse. May
not be up to date.
Focus The focus of database is mainly It has ability of data analysis
on transactions with the help of which is collected from different
queries. sources and generate reports.
Data Duplication In an OLTP database, the data is In OLAP database, the data is
normalized and there is no organized in such a way to
duplication of data in order to facilitate the analysis and
increase the optimized processing reporting. Usually the data is
and better efficiency. denormalized and stored in fewer
tables in simple structure.
Optimization It is optimized for read-write The data-warehouses are
operations through single-point- optimized for retrieval of large
transaction. Mostly the OLTP data-sets to aggregate the data as
database queries respond in less it is designed for handling broad
than a second. analytical queries.
Query Type Simple transaction queries are Complex queries are used for
used. analysis purpose
Data Summary Detailed Data is stored in a It stores highly summarized data.
database

SIMILARITY:

➢ Both the database and data warehouse is used for storing data. These are data
storage systems.
➢ Generally, the data warehouse bottom tier is a relational database system.
Databases are also relational database system. Relational DB systems consist
of rows and columns and a large amount of data.
➢ The DW and databases support multi-user access. A single instance of
database and data warehouse can be accessed by many users at a time.
➢ Both DW and database require queries for accessing the data. The Data
warehouse can be accessed using complex queries while OLTP database can
be accessed by simpler queries.
➢ The database and data warehouse servers can be present on the company
premise or on the cloud.
➢ A data warehouse is also a database.

(ii) Explain what is data visualization. How it helps in data warehousing


✓ Data visualization is the graphical representation of information and data.
✓ Data visualization convert large and small data sets into visuals, which is easy to
understand and process for humans
✓ By using visual elements like charts, graph, and maps, data visualization tools provide
an accessible way to see and understand trends, outliers, and patterns in data.
✓ In the world of Big Data, data visualization tools and technologies are essential to
analyse massive amounts of information and make data-driven decisions.
✓ Data Visualization is used to communicate information clearly and efficiently to users
by the usage of information graphics such as tables and charts.
✓ It helps users in analysing a large amount of data in a simpler way.
✓ It makes complex data more accessible, understandable, and usable.
✓ Tables are used where users need to see the pattern of a specific parameter, while charts
are used to show patterns or relationships in the data for one or more parameters.
✓ The combination of multiple visualizations and bits of information are still referred to
as Infographics.
✓ Data visualizations are used to discover unknown facts and trends
✓ A pie chart is a great way to show parts-of-a-whole. And maps are the best way to share
geographical data visually.
✓ Effective data visualization are created by communication, data science, and design
collide. Data visualizations did right key insights into complicated data sets into
meaningful and natural.
✓ American statistician and Yale professor Edward Tufte believe useful data
visualizations consist of? complex ideas communicated with clarity, precision, and
efficiency.
✓ Data visualization is important because of the processing of information in human
brains. Using graphs and charts to visualize a large amount of the complex data sets is
more comfortable in comparison to studying the spreadsheet and reports.
✓ Data visualization is an easy and quick way to convey concepts universally. You can
experiment with a different outline by making a slight adjustment.

USE OF DATA VISUALIZATION IN DATA WAREHOUSING:


✓ Data visualization helps communicate information more rapidly due to the fact that the
human brain can process and understand a picture faster than it can process and
understand a series of numbers that have to be compared and contrasted.

✓ To make easier in understand and remember.


✓ To discover unknown facts, outliers, and trends.
✓ To visualize relationships and patterns quickly.
✓ To ask a better question and make better decisions.
✓ To competitive analyze.
✓ To improve insights
UNIT – 3
DATA MINING
1.Define Data mining. List out the steps in data mining?
DATA MINING:
Data mining refers to extracting or mining knowledge from large amounts of data. The
term is actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data..

STEPS:
➢ business understanding
➢ data understanding
➢ data preparation
➢ modelling
➢ evaluation
➢ deployment.

2. List the steps involved in the process of KDD. How does it relate to data mining?

STEPS:

➢ Data cleaning
➢ Data integration
➢ Data selection
➢ Data transformation
➢ Data mining
➢ Pattern evaluation
➢ Knowledge presentation
KDD refers to the overall process of discovering useful knowledge from data, and data
mining refers to a particular step in this process. Data mining is the application of specific
algorithms for extracting patterns from data.”

3. List the ways in which interesting patterns should be mined.


1. Tracking patterns.
2. Classification.
3. Association.
4. Outlier detection.
5. Clustering.
6. Regression.
7. Prediction.
4. Compare drill down with roll up approach.

DRILL DOWN ROLL UP


Drill-down refers to the process of viewing Roll-up refers to the process of viewing data
data at a level of increased detail with decreasing detail.
By stepping down a concept hierarchy for a By climbing up a concept hierarchy for a
dimension dimension
By introducing a new dimension. By dimension reduction

5. Describe what are the other kinds of data in data mining.


• Flat Files.
• Relational Databases.
• Data Warehouse.
• Transactional Databases.
• Multimedia Databases.
• Spatial Databases.
• Time Series Databases.
• World Wide Web(WWW)

6. How would you illustrate Handling outlier or incomplete data?


The data stored in a database may reect outliers | noise, exceptional cases, or incomplete
data objects. These objects may confuse the analysis process, causing over_tting of the data to
the knowledge model constructed. As a result, the accuracy of the discovered patterns can be
poor. Data cleaning methods and data analysis methods which can handle outliers are required.
While most methods discard outlier data, such data may be of interest in itself such as in fraud
detection for_nding unusual usage of tele-communication services or credit cards. This form
of data analysis is known as outlier mining.
7.Analyze data characterization related to data discrimination
Data Characterization is a summarization of the general characteristics or features of
a target class of data.
Data Discrimination refers to the mapping or classification of a class with some
predefined group or class.
The output of data characterization can be presented in various forms. Examples include
pie charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs. The resulting descriptions can also be presented as generalized relations,
or in rule form called characteristic rules. Discrimination descriptions expressed in rule form
are referred to as discriminant rules.
8.Define association and correlations
ASSOCIATIONS:
It is the discovery of association rules showing attribute-value conditions that occur
frequently together in a given set of data.
CORRELATIONS:
It is used to study the closeness of the relationship between two or more variables i.e.
the degree to which the variables are associated with each other. Suppose in a manufacturing
firm, they want the relation between – Demand & supply of commodities.

9.List the five primitives for specification a data mining task.


THE FIVE PRIMITIVES FOR SPECIFICATION A DATA MINING TASK :
✓ Task-relevant data
✓ Knowledge type to be mined
✓ Background knowledge
✓ Pattern interestingness measure
✓ Visualization of discovered patterns

10. Evaluate the major tasks of data pre-processing.


THE MAJOR TASKS OF DATA PRE-PROCESSING:
✓ Data cleaning
- Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
✓ Data integration
- Integration of multiple databases, data cubes, or files
✓ Data transformation
- Normalization and aggregation
✓ Data reduction
- Obtains reduced representation in volume but produces the same or similar analytical
results
✓ Data discretization
- Part of data reduction but with particular importance, especially for numerical data

11. Are all patterns generated are interesting and useful? Give reasons to justify.
Typically not. Only a small fraction of the patterns potentially generated would actually
be of interest to any given user. A pattern is interesting if it is
(1) easily understood by humans,
(2)valid on new or test data with some degree of certainty,
(3) potentially useful, and
(4) novel.

12. Classify different types of reductions.


TYPES OF REDUCTIONS:
• Data Cube Aggregation
• Dimension reduction
• Data Compression
• Numerosity Reduction
• Discretization & Concept Hierarchy Operation
13. Distinguish between data cleaning and noisy data.

14. Explain the principle elements of missing values in data cleaning.


THE PRINCIPLE ELEMENTS OF MISSING VALUES IN DATA CLEANING:
✓ Ignore the tuple
✓ Fill in the missing value manually
✓ Use a global constant to fill in the missing value
✓ Use the attribute mean to fill in the missing value
✓ Use the attribute mean for all samples belonging to the same class as the given tuple
✓ Use the most probable value to fill in the missing value

15. Discuss the roles of noisy data in data pre-processing.


Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled in following
ways :
1. Binning Method
2. Regression
3. Clustering

16. Consider that the minimum and maximum values for the attribute “salary” are 12,000
and 98,000 respectively and the mapping range of salary is [0.0 ,1.0]. Find the
transformation for the salary 73,600 using min-max normalization.

73,600 – 12,000
Min-Max Normalization: ---------------------- (1.0 – 0) + 0 = 0.716.
98,000 – 12,000

17. Show how the attribute selection set is important in data reduction.
Attribute subset Selection is a technique which is used for data reduction in data mining
process. Data reduction reduces the size of data so that it can be used for analysis purposes
more efficiently. The data set may have a large number of attributes. But some of those
attributes can be irrelevant or redundant

18. Consider the following set of data X = {15,27,62,35,39,50,44,44,22,98} Do pre-


processing using smoothing by bin means and bin boundary to smooth the data, using a
bin of depth 3. Evaluate it.

Step 1: Sort the data.


(15,22,27,35,39,44,44,50,62,98)

Step 2 : Partition the data into equi-depth bins of depth 3.


Bin 1: 15,22,27
Bin 2: 35,39,44
Bin 3: 44,50,62
Bin 4: 98
It cannot be smoothened by bin means as it cannot be equally divided.

19. Formulate why do we need data transformation. Mention the ways by which data can
be transformed.
Data transformation in data mining is done for combining unstructured data with
structured data to analyze it later. It is also important when the data is transferred to a new
cloud data warehouse. When the data is homogeneous and well-structured, it is easier to
analyze and look for patterns

THE WAYS BY WHICH DATA CAN BE TRANSFORMED:


• Data Smoothing.
• Data Aggregation.
• Discretization.
• Generalization.
• Attribute construction.
• Normalization.

20. Define an efficient procedure for cleaning the noisy data.


• You can ignore the tuple. This is done when class label is missing. This method is not
very effective , unless the tuple contains several attributes with missing values.
• You can fill in the missing value manually. This approach is effective on small data set
with some missing values.
• You can replace all missing attribute values with global constant, such as a label like
“Unknown” or minus infinity.
• You can use the attribute mean to fill in the missing value. For example customer
average income is 25000 then you can use this value to replace missing value for
income.
• Use the most probable value to fill in the missing value.

PART – B
1.(i) Demonstrate in detail about data mining steps in the process of knowledge discovery?

Essential step in the Process of knowledge discovery. Knowledge discovery as a process is


depicted in Figure consists of an iterative sequence of the following steps:

3.1.1 Data cleaning:


- to remove noise and inconsistent data
3.1.2 Data integration:
- where multiple data sources may be combined
3.1.3 Data selection:
- where data relevant to the analysis task are retrieved from the database
3.1.4 Data transformation:
- where data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations, for instance
3.1.5 Data mining:
- an essential process where intelligent methods are applied in order to extract data
Patterns
3.1.6 Pattern evaluation:
- to identify the truly interesting patterns representing knowledge based on
some interestingness measures
3.1.7 Knowledge presentation:
- where visualization and knowledge representation techniques are used
to present the mined knowledge to the user.

Data mining as a process of knowledge discovery.

(ii) List the application area of data mining?


THE LIST OF AREAS WHERE DATA MINING IS WIDELY USED :

• Financial Data Analysis


• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection

2. Explain in detail about the Evolution of Database Technology.


Evolution of Database Technology

Data mining primitives.

A data mining query is defined in terms of the following primitives


➢ Task-relevant data
➢ The kinds of Knowledge to be mined
➢ Background Knowledge
➢ Interestingness Measures
➢ Presentation and Visualisation of discovered patterns

1.Task-relevant data:
This is the database portion to be investigated. For example, suppose that you are a manager
of All Electronics in charge of sales in the United States and Canada. In particular, you would
like to study the buying trends of customers in Canada. Rather than mining on the entire
database. These are referred to as relevant attributes

2. The kinds of knowledge to be mined:


This specifies the data mining functions to be performed, such as characterization,
discrimination, association, classification, clustering, or evolution analysis. For instance, if
studying the buying habits of customers in Canada, you may choose to mine associations
between customer profiles and the items that these customers like to buy

3.Background knowledge:
Users can specify background knowledge, or knowledge about the domain to be mined.
This knowledge is useful for guiding the knowledge discovery process, and for evaluating the
patterns found. There are several kinds of background knowledge.

4.Interestingness measures:
These functions are used to separate uninteresting patterns from knowledge. They may
be used to guide the mining process, or after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures.

5.Presentation and visualization of discovered patterns:


This refers to the form in which discovered patterns are to be displayed. Users can
choose from different forms for knowledge presentation, such as rules, tables, charts, graphs,
decision trees, and cubes.
3(i). What is data? How different type of data and attributes can be designed?
DATA:
➢ Collection of data objects and their attributes
✓ An attribute is a property or characteristic of an object
– Examples: eye color of a person, temperature, etc.
– Attribute is also known as variable, field, characteristic, or feature
➢ A collection of attributes describe an object
– Object is also known as record, point, case, sample, entity, or instance

Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values

• Example: Attribute values for ID and age are integers


• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value

Types of Attributes
There are different types of attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-1 0),
grades, height in {tall, medium, short}
– Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
•Examples: temperature in Kelvin, length, time, counts

(ii) Design and discuss in detail about Primitives for specifying a data mining task

Data Mining Primitives:

1.Task-relevant data:
This is the database portion to be investigated. For example, suppose that you are a manager
of All Electronics in charge of sales in the United States and Canada. In particular, you would
like to study the buying trends of customers in Canada. Rather than mining on the entire
database. These are referred to as relevant attributes

2. The kinds of knowledge to be mined:


This specifies the data mining functions to be performed, such as characterization,
discrimination, association, classification, clustering, or evolution analysis. For instance, if
studying the buying habits of customers in Canada, you may choose to mine associations
between customer profiles and the items that these customers like to buy

3.Background knowledge:
Users can specify background knowledge, or knowledge about the domain to be mined.
This knowledge is useful for guiding the knowledge discovery process, and for evaluating the
patterns found. There are several kinds of background knowledge.

4.Interestingness measures:
These functions are used to separate uninteresting patterns from knowledge. They may
be used to guide the mining process, or after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures.
5.Presentation and visualization of discovered patterns:
This refers to the form in which discovered patterns are to be displayed. Users can
choose from different forms for knowledge presentation, such as rules, tables, charts, graphs,
decision trees, and cubes.

4(i). Discuss whether or not each of the following activities is a data mining task.
1. Credit card fraud detection using transaction records.
2. Dividing the customers of a company according to their gender.
3. Computing the total sales of a company
4. Predicting the future stock price of a company using historical records.
5.Monitoring seismic waves for earthquake activities.

1. Credit card fraud detection using transaction records.


Yes. Predict fraudulent cases in credit card transactions.

2. Dividing the customers of a company according to their gender.


No. This is an accounting calculation, followed by the application of a threshold.
However, predicting the profitability of a new customer would be data mining.

3. Computing the total sales of a company


No. Again, this is simple accounting.

4. Predicting the future stock price of a company using historical records.


Yes. We would attempt to create a model that can predict the continuous value of the
stock price. This is an example of the area of data mining known as predictive modelling. We
could use regression for this modelling, although researchers in many fields have developed a
wide variety of techniques for predicting time series.

5.Monitoring seismic waves for earthquake activities.


Yes. In this case, we would build a model of different types of seismic wave behavior
associated with earthquake activities and raise an alarm when one of these different types of
seismic activity was observed. This is an example of the area of data mining known as
classification.

(ii) Discuss on descriptive and predictive data mining tasks with illustrations
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions. In
some cases, users may have no idea of which kinds of patterns in their data may be interesting,
and hence may like to search for several different kinds of patterns in parallel. Thus it is
important to have a data mining system that can mine multiple kinds of patterns to
accommodate di_erent user expectations or applications. Furthermore, data mining systems
should be able to discover patterns at various granularities. To encourage interactive and
exploratory mining, users should be able to easily \play" with the output patterns, such as by
mouse clicking. Operations that can be speci_ed by simple mouse clicks include adding or
dropping a dimension (or an attribute), swapping rows and columns (pivoting, or axis rotation),
changing dimension representations (e.g., from a 3-D cube to a sequence of 2-D cross
tabulations, or crosstabs), or using OLAP roll-up or drill-down operations along dimensions.
Such operations allow data patterns to be expressed from different angles of view and at
multiple levels of abstraction. Data mining systems should also allow users to specify hints to
guide or focus the search for interesting patterns. Since some patterns may not hold for all of
the data in the database, a measure of certainty or \trustworthiness" is usually associated with
each discovered pattern.

5(i). State and Explain the various classification of data mining systems with example.
Classification of data mining systems:
There are many data mining systems available or being developed. Some are
specialized systems dedicated to a given data source or are confined to limited data mining
functionalities, other are more versatile and comprehensive. Data mining systems can be
categorized according to various criteria among other classification are the following:
➢ Classification according to the type of data source mined
➢ Classification according to the data model drawn on
➢ Classification according to the king of knowledge discovered
➢ Classification according to mining techniques used

Classification according to the type of data source mined:


This classification categorizes data mining systems according to the type of data handled
such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.

Classification according to the data model drawn on:


This classification categorizes data mining systems based on the data model involved such
as relational database, object-oriented database, data warehouse, transactional, etc.

Classification according to the king of knowledge discovered:


This classification categorizes data mining systems based on the kind of knowledge
discovered or data mining functionalities, such as characterization, discrimination, association,
classification, clustering, etc. Some systems tend to be comprehensive systems offering several
data mining functionalities together.

Classification according to mining techniques used:


Data mining systems employ and provide different techniques. This classification
categorizes data mining systems according to the data analysis approach used such as machine
learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data
warehouse-oriented, etc. The classification can also take into account the degree of user
interaction involved in the data mining process such as query-driven systems, interactive
exploratory systems, or autonomous systems. A comprehensive system would provide a wide
variety of data mining techniques to fit different situations and options, and offer different
degrees of user interaction.

(ii) Explain the various data mining functionalities in detail.


DATA MINING FUNCTIONALITIES:
Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. Data mining tasks can be classified into two categories: descriptive and
predictive. Descriptive mining tasks characterize the general properties of the data in the
database. Predictive mining tasks perform inference on the current data in order to make
predictions.

1.Concept/Class Description: Characterization and Discrimination


Data can be associated with classes or concepts. For example, in the All Electronics
store, classes of items for sale include computers and printers, and concepts of customers
include big Spenders and budget Spenders. It can be useful to describe individual classes and
concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a
concept are called class/concept descriptions. These descriptions can be derived via
(1) data characterization, by summarizing the data of the class under study (often called
the target class) in general terms.
(2) data discrimination, by comparison of the target class with one or a set of
comparative classes (often called the contrasting classes), or
(3) both data characterization and discrimination.

Data characterization is a summarization of the general characteristics or features of


a target class of data. The data corresponding to the user-specified class are typically collected
by a database query the output of data characterization can be presented in various forms.
Examples include pie charts, bar charts, curves, multidimensional data cubes, and
multidimensional tables, including crosstabs.
Data discrimination is a comparison of the general features of target class data objects
with the general features of objects from one or a set of contrasting classes. The target and
contrasting classes can be specified by the user, and the corresponding data objects retrieved
through database queries.
“How are discrimination descriptions output?”
Discrimination descriptions expressed in rule form are referred to as discriminate rules.

2. Mining Frequent Patterns, Associations, and Correlations


Frequent patterns, as the name suggests, are patterns that occur frequently in data.
There are many kinds of frequent patterns, including item sets, subsequences, and
substructures.
A frequent item set typically refers to a set of items that frequently appear together in
a transactional data set, such as Computer and Software. A frequently occurring subsequence,
such as the pattern that customers tend to purchase first a PC, followed by a digital camera, and
then a memory card, is a (frequent) sequential pattern.
Example:
Association analysis. Suppose, as a marketing manager of All Electronics, you would
like to determine which items are frequently purchased together within the same transactions.
An example of such a rule, mined from the All Electronics transactional database, is buys(X;
-computer||) buys(X; -software||) [support = 1%, confidence = 50%]
where X is a variable representing a customer. A confidence, or certainty, of 50% means that
if a customer buys a computer, there is a 50% chance that she will buy software as well. A 1%
support means that 1% of all of the transactions under analysis showed that computer and
software were purchased together. This association rule involves a single attribute or predicate
(i.e., buys) that repeats. Association rules that contain a single predicate are referred to as single
dimensional association rules. Dropping the predicate notation, the above rule can be written
simply as ―compute software [1%, 50%]‖.
➢ Classification and Prediction
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. The derived model is based on the analysis
of a set of training data (i.e., data objects whose class label is known).

A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute
value, each branch represents an outcome of the test, and tree leaves represent classes or class
distributions. Decision trees can easily be converted to classification rules
A neural network, when used for classification, is typically a collection of neuron-like
processing units with weighted connections between the units. There are many other methods
for constructing classification models, such as naïve Bayesian classification, support vector
machines, and k-nearest neighbor classification. Whereas classification predicts categorical
(discrete, unordered) labels, prediction models Continuous-valued functions. That is, it is used
to predict missing or unavailable numerical data values rather than class labels. Although the
term prediction may refer to both numeric prediction and class label prediction,
➢ Cluster Analysis
Classification and prediction analyze class-labeled data objects, where as
clustering analyses data objects without consulting a known class label.
➢ Outlier Analysis
A database may contain data objects that do not comply with the general behavior
or model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions. However, in some applications such as fraud detection, the
rare events can be more interesting than the more regularly occurring ones. The analysis of
outlier data is referred to as outlier mining.
➢ Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects
whose behavior changes over time. Although this may include characterization,
discrimination, association and correlation analysis, classification, prediction, or clustering
of time related data, distinct features of such an analysis include time-series data analysis,
Sequence or periodicity pattern matching, and similarity-based data analysis.

6. Suppose that the data for analysis include the attributed age. The age values for the
data tuples are 13,15,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,
36,40,45,46,52,70.

(i).use smoothing by bin means to smooth the above data using a bin depth
of 3. Illustrate your steps.

Step 1: Sort the data. (This step is not required here as the data are already sorted.)
Step 2: Partition the data into equidepth bins of depth 3.
Bin 1: 13, 15, 16 Bin 2: 16, 19, 20
Bin 3: 20, 21, 22 Bin 4: 22, 25, 25
Bin 5: 25, 25, 30 Bin 6: 33, 33, 35
Bin 7: 35, 35, 35 Bin 8: 36, 40, 45
Bin 9: 46, 52, 70

Step 3: Calculate the arithmetic mean of each bin.


Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 142/3, 142/3, 142/3 Bin 2: 181/3, 181/3, 181/3
Bin 3: 21, 21, 21 Bin 4: 24, 24, 24
Bin 5: 262/3, 262/3, 262/3 Bin 6: 332/3, 332/3, 332/3
Bin 7: 35, 35, 35 Bin 8: 401/3, 401/3, 401/3
Bin 9: 56, 56, 56

This method smooths a sorted data value by consulting to its ”neighborhood”. It


performs local smoothing.

(ii) Classify the various methods for data smoothing.


Other methods that can be used for data smoothing include alternate forms of binning
such as smoothing by bin medians or smoothing by bin boundaries. Alternatively, equi-width
bins can be used to implement any of the forms of binning, where the interval range of values
in each bin is constant. Methods other than binning include using regression techniques to
smooth the data by fitting it to a function such as through linear or multiple regression. Also,
classification techniques can be used to implement concept hierarchies that can smooth the data
by rolling-up lower level concepts to higher-level concepts.

7. Sketch the various phases of data mining and explain the different steps involved in pre-
processing with their significance before mining, Give an example for each process.

The different stages of Data Mining are broadly classified as follows:-


✓ Data Cleaning
✓ Integration of Data
✓ Selection of Data
✓ Data Transformation
✓ Data mining
Data Cleaning:
Data Cleaning is an important step in the Data Mining process. It involves the filtering
of the unwanted data components and keeping the relevant ones. All the relevant data is
eventually combined together to gain a clear picture.
Integration of Data:
This happens to be the second important step in the process. All the data are collected
from various zones. All the data are then stored in different formats such as Texts, Images,
documents.
Selection of Data:
The third step in the data mining process talks about the data selection. It helps you to
select the data as per your requirement for analysis.
Data Transformation:
This stage comes before the Data Mining stage. Here the Data is transformed from one
form to another form.
Data Mining:
This is the final stage which involves the evaluation of patterns from the data obtained
and representation of knowledge.
DATA PRE-PROCESSING STEPS:

1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Data cleaning:

Data cleaning is the process to remove incorrect data, incomplete data and inaccurate
data from the datasets, and it also replaces the missing values. There are some techniques in
data cleaning

Handling missing values:

• Standard values like “Not Available” or “NA” can be used to replace the
missing values.
• Missing values can also be filled manually but it is not recommended when that
dataset is big.
• The attribute’s mean value can be used to replace the missing value when the
data is normally distributed
wherein in the case of non-normal distribution median value of the attribute can
be used.
• While using regression or decision tree algorithms the missing value can be
replaced by the most probable
value.

Noisy:
Noisy generally means random error or containing unnecessary data points.
Here are some of the methods to handle noisy data.

• Binning:

This method is to smooth or handle noisy data. First, the data is sorted
then and then the sorted values are separated and stored in the form of bins.
There are three methods for smoothing data in the bin. Smoothing by bin
mean method: In this method, the values in the bin are replaced by the mean
value of the bin; Smoothing by bin median: In this method, the values in the
bin are replaced by the median value; Smoothing by bin boundary: In this
method, the using minimum and maximum values of the bin values are taken
and the values are replaced by the closest boundary value.

• Regression:

This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide
the variable which is suitable for our analysis.

• Clustering:

This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.
Data integration:

The process of combining multiple sources into a single dataset. The Data integration
process is one of the main components in data management. There are some problems to be
considered during data integration.

• Schema integration:

Integrates metadata(a set of data that describes other data) from different
sources.

• Entity identification problem:

Identifying entities from multiple databases. For example, the system or the use
should know student _id of one database and student_name of another database belongs
to the same entity.

• Detecting and resolving data value concepts:

The data taken from different databases while merging may differ. Like the
attribute values from one database may differ from another database. For example, the
date format may differ like “MM/DD/YYYY” or “DD/MM/YYYY”.

Data reduction:

This process helps in the reduction of the volume of the data which makes the analysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. There are some of the techniques in data reduction are Dimensionality reduction,
Numerosity reduction, Data compression.

• Dimensionality reduction:

This process is necessary for real-world applications as the data size is big. In
this process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced. Combining and merging the attributes of
the data without losing its original characteristics. This also helps in the reduction of
storage space and computation time is reduced. When the data is highly dimensional
the problem called “Curse of Dimensionality” occurs.

• Numerosity Reduction:

In this method, the representation of the data is made smaller by reducing the
volume. There will not be any loss of data in this reduction.

• Data compression:

The compressed form of data is called data compression. This compression can
be lossless or lossy. When there is no loss of information during compression it is called
lossless compression. Whereas lossy compression reduces information but it removes
only the unnecessary information.
Data Transformation:

The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements. There are some methods in data
transformation.

• Smoothing:

With the help of algorithms, we can remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we can find even a simple
change that helps in prediction.

• Aggregation:

In this method, the data is stored and presented in the form of a summary. The
data set which is from multiple sources is integrated into with data analysis description.
This is an important step since the accuracy of the data depends on the quantity and
quality of the data. When the quality and the quantity of the data are good the results
are more relevant.

• Discretization:

The continuous data here is split into intervals. Discretization reduces the data
size. For example, rather than specifying the class time, we can set an interval like (3
pm-5 pm, 6 pm-8 pm).

• Normalization:

It is the method of scaling the data so that it can be represented in a smaller


range. Example ranging from -1.0 to 1.0.

8.Describe in detail about the issues of data mining.

Major issues in data mining:


Major issues in data mining is regarding mining methodology, user interaction,
performance, and diverse data types

1.Mining methodology and user-interaction issues:


Mining different kinds of knowledge in databases: Since different users can be
interested in different kinds of knowledge, data mining should cover a wide spectrum of data
analysis and knowledge discovery tasks, including data characterization, discrimination,
association, classification, clustering, trend and deviation analysis, and similarity analysis.
These tasks may use the same database in different ways and require the development of
numerous data mining techniques.
➢ Interactive mining of knowledge at multiple levels of abstraction:
Since it is difficult to know exactly what can be discovered within a database,
the data mining process should be interactive.
➢ Incorporation of background knowledge:
Background knowledge, or information regarding the domain under study, may
be used to guide the discovery patterns. Domain knowledge related to databases, such
as integrity constraints and deduction rules, can help focus and speed up a data mining
process, or judge the interestingness of discovered patterns.
➢ Data mining query languages and ad-hoc data mining:
Knowledge in Relational query languages (such as SQL) required since it allow
users to pose ad-hoc queries for data retrieval.
➢ Presentation and visualization of data mining results:
Discovered knowledge should be expressed in high-level languages, visual
representations, so that the knowledge can be easily understood and directly usable by
humans
➢ Handling outlier or incomplete data:
The data stored in a database may reflect outliers: noise, exceptional cases, or
incomplete data objects. These objects may confuse the analysis process, causing over
fitting of the data to the knowledge model constructed. As a result, the accuracy of the
discovered patterns can be poor. Data cleaning methods and data analysis methods
which can handle outliers are required.
➢ Pattern evaluation: refers to interestingness of pattern:
A data mining system can uncover thousands of patterns. Many of the patterns
discovered may be uninteresting to the given user, representing common knowledge or
lacking novelty. Several challenges remain regarding the development of techniques to
assess the interestingness of discovered patterns,

2. Performance issues:
These include efficiency, scalability, and parallelization of data mining algorithms.

➢ Efficiency and scalability of data mining algorithms:


To effectively extract information from a huge amount of data in databases, data
mining algorithms must be efficient and scalable.
➢ Parallel, distributed, and incremental updating algorithms:
Such algorithms divide the data into partitions, which are processed in parallel.
The results from the partitions are then merged.

3. Issues relating to the diversity of database types:


➢ Handling of relational and complex types of data:
Since relational databases and data warehouses are widely used, the
development of efficient and effective data mining systems for such data is important.
➢ Mining information from heterogeneous databases and global information
systems:
Local and wide-area computer networks (such as the Internet) connect many
sources of data, forming huge, distributed, and heterogeneous databases. The discovery
of knowledge from different sources of structured, semi-structured, or unstructured data
with diverse data semantics poses great challenges to data mining.

9.Describe in detail about data reduction in data pre-processing


DATA REDUCTION:
Data reduction techniques can be applied to obtain a reduced representation of the data
set that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same (or almost the
same) analytical results.
Strategies for data reduction include the following.
➢ Data cube aggregation:
where aggregation operations are applied to the data in the construction of a
data cube.
➢ Dimension reduction:
where irrelevant, weakly relevant or redundant attributes or dimensions may be
detected and removed.
➢ Data compression:
where encoding mechanisms are used to reduce the data set size.
➢ Numerosity reduction:
where the data are replaced or estimated by alternative, smaller data
representations such as parametric models (which need store only the model parameters
instead of the actual data), or nonparametric methods such as clustering, sampling, and
the use of histograms.
➢ Discretization and concept hierarchy generation:
where raw data values for attributes are replaced by ranges or higher conceptual
levels. Concept hierarchies allow the mining of data at multiple levels of abstraction,
and are a powerful tool for data mining.

Data Cube Aggregation:


• The lowest level of a data cube
– the aggregated data for an individual entity of interest
– e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered
–using data cube, when possible Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
– Select a minimum set of features such that the probability distribution of different
classes given the values for those features is as close as possible to the original distribution
given the values of all features.
– reduce # of patterns in the patterns, easier to understand

Heuristic methods:

1. Step-wise forward selection:


The procedure starts with an empty set of attributes. The best of the
original attributes is determined and added to the set. At each subsequent
iteration or step, the best of the remaining original attributes is added to the set.
2. Step-wise backward elimination:
The procedure starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.
3. Combination forward selection and backward elimination:
The step-wise forward selection and backward elimination methods can
be combined, where at each step one selects the best attribute and removes the
worst attribute.

4. Decision tree induction:


Decision tree algorithms, such as ID3 and C4.5, were originally intended
for classification. Decision tree induction constructs a flow-chart-like structure
where each internal (non-leaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a
class prediction. At each node, the algorithm chooses the “best" attribute to
partition the data into individual classes.

Data compression:
In data compression, data encoding or transformations are applied so as to obtain a
reduced or ”compressed" representation of the original data. If the original data can be
reconstructed from the compressed data without any loss of information, the data compression
technique used is called lossless. If, instead, we can reconstruct only an approximation of the
original data, then the data compression technique is called lossy. The two popular and
effective methods of lossy data compression: wavelet transforms, and principal components
analysis.

Numerosity reduction :
Regression and log-linear models :
Regression and log-linear models can be used to approximate the given data. In linear
regression, the data are modeled to fit a straight line. For example, a random variable, Y (called
a response variable), can be modeled as a linear function of another random variable, X (called
a predictor variable), with the equation where the variance of Y is assumed to be constant.
These coefficients can be solved for by the method of least squares, which minimizes the error
between the actual line separating the data and the estimate of the line.

Multiple regression is an extension of linear regression allowing a response variable


Y to be modeled as a linear function of a multidimensional feature vector.

Log-linear models approximate discrete multidimensional probability distributions.


The method can be used to estimate the probability of each cell in a base cuboid for a set of
discretized attributes, based on the smaller cuboids making up the data cube lattice

Histograms :
A histogram for an attribute A partitions the data distribution of A into disjoint subsets,
or buckets. The buckets are displayed on a horizontal axis, while the height (and area) of a
bucket typically reects the average frequency of the values represented by the bucket.

Equi-width:
In an equi-width histogram, the width of each bucket range is constant (such as the
width of $10 for the buckets in Figure 3.8).
Equi-depth (or equi-height):
In an equi-depth histogram, the buckets are created so that, roughly, the frequency of
each bucket is constant (that is, each bucket contains roughly the same number of contiguous
data samples).
V-Optimal:
If we consider all of the possible histograms for a given number of buckets, the V-
optimal histogram is the one with the least variance. Histogram variance is a weighted sum of
the original values that each bucket represents, where bucket weight is equal to the number of
values in the bucket.
MaxDiff:
In a MaxDiff histogram, we consider the difference between each pair of adjacent
values. A bucket boundary is established between each pair for pairs having β-1 the largest
differences, where β is user-specified.
10. Describe in detail about various data transformation techniques
Data transformation:
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Data transformation can involve the following:

➢ Normalization:
where the attribute data are scaled so as to fall within a small specified range, such
as -1.0 to 1.0, or 0 to 1.0. There are three main methods for data normalization : min-max
normalization, z-score normalization, and normalization by decimal scaling.
(i).Min-max normalization
performs a linear transformation on the original data. Suppose that minA and
maxA are the minimum and maximum values of an attribute A. Min-max normalization
maps a value v of A to v0 in the range [new minA; new maxA] by computing.

(ii).z-score normalization (or zero-mean normalization)


the values for an attribute A are normalized based on the mean and standard
deviation of A. A value v of A is normalized to v0 by computing where mean A and
stand dev A are the mean and standard deviation, respectively, of attribute A. This
method of normalization is useful when the actual minimum and maximum of attribute
A are unknown, or when there are outliers which dominate the min-max normalization.

(iii). Normalization by decimal scaling


normalizes by moving the decimal point of values of attribute A. The number
of decimal points moved depends on the maximum absolute value of A. A value v of
A is normalized to v0by computing where j is the smallest integer such that
.

➢ Smoothing:
which works to remove the noise from data? Such techniques include binning,
clustering, and regression.

(i). Binning methods:


Binning methods smooth a sorted data value by consulting the ”neighborhood",
or values around it. The sorted values are distributed into a number of 'buckets', or bins.
Because binning methods consult the neighborhood of values, they perform local
smoothing. Figure illustrates some binning techniques.

In this example, the data for price are first sorted and partitioned into equi-depth bins
(of depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.

(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
(ii).Partition into (equi-width) bins:
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
(iii).Smoothing by bin means:
- Bin 1: 9, 9, 9,
- Bin 2: 22, 22, 22
- Bin 3: 29, 29, 29
(iv).Smoothing by bin boundaries:
• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34

(ii). Clustering:
Outliers may be detected by clustering, where similar values are organized into
groups or “clusters”. Intuitively, values which fall outside of the set of clusters may be
considered outliers.

Outliers may be detected by clustering analysis.

➢ Aggregation:
where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute monthly and annual total
amounts.
➢ Generalization of the data:
where low level or 'primitive' (raw) data are replaced by higher level concepts
through the use of concept hierarchies. For example, categorical attributes, like street,
can be generalized to higher level concepts, like city or county.

11. List and explain the primitives for specifying a data mining task.

Five primitives for specifying a data mining task:


➢ Task-relevant data
➢ Knowledge type to be mined
➢ Background Knowledge
➢ Pattern Interestingness measure
➢ Visualization of discovered patterns

Task-relevant data:
This primitive specifies the data upon which mining is to be performed. It involves
specifying the database and tables or data warehouse containing the relevant data, conditions
for selecting the relevant data, the relevant attributes or dimensions for exploration, and
instructions regarding the
ordering or grouping of the data retrieved.

Knowledge type to be mined:


This primitive specifies the specific data mining function to be performed, such as
characterization, discrimination, association, classification, clustering, or evolution analysis.
As well, the user can be more specific and provide pattern templates that all discovered patterns
must match. These templates or meta patterns (also called meta rules or meta queries), can be
used to guide the discovery process.

Background knowledge:
This primitive allows users to specify knowledge they have about the domain to be
mined. Such knowledge can be used to guide the knowledge discovery process and evaluate
the patterns that are found. Of the several kinds of background knowledge, this chapter focuses
on concept hierarchies.

Pattern interestingness measure:


This primitive allows users to specify functions that are used to separate uninteresting
patterns from knowledge and may be used to guide the mining process, as well as to evaluate
the discovered patterns. This allows the user to confine the number of uninteresting patterns
returned by the process, as a data mining process may generate a large number of patterns.
Interestingness measures can be specified for such pattern characteristics as simplicity,
certainty, utility and novelty.

Visualization of discovered patterns:


This primitive refers to the form in which discovered patterns are to be displayed. In
order for data mining to be effective in conveying knowledge to users, data mining systems
should be able to display the discovered patterns in multiple forms such as rules, tables, cross
tabs (cross-tabulations), pie or bar charts, decision trees, cubes or other visual representations.

12(i). How will you handle missing value in a dataset before mining process?

• Ignore the records with missing values.


• Replace them with a global constant (e.g., “?”).
• Fill in missing values manually based on your domain knowledge.
• Replace them with the variable mean (if numerical) or the most frequent value (if
categorical).
• Use modeling techniques such as nearest neighbors, Bayes’ rule, decision tree, or EM
algorithm.

(ii) Give the architecture of a typical data mining system

Architecture of a typical data mining system:

The architecture of a typical data mining system may have the following major
components

➢ Database, data warehouse, or other information repository:


This is one or a set of databases, data warehouses, spread sheets, or other kinds of
information repositories. Data cleaning and data integration techniques may be performed
on the data.

➢ Database or data warehouse server:


The database or data warehouse server is responsible for fetching the relevant
data, based on the user's data mining request.

➢ Knowledge base:
This is the domain knowledge that is used to guide the search, or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern's interestingness
based on its unexpectedness, may also be included.

➢ Data mining engine:


This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as characterization, association analysis,
classification, evolution and deviation analysis.

➢ Pattern evaluation module:


This component typically employs interestingness measures and interacts with
the data mining modules so as to focus the search towards interesting patterns. It may
access interestingness thresholds stored in the knowledge base. Alternatively, the
pattern evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used.

➢ Graphical user interface:


This module communicates between users and the data mining system, allowing
the user to interact with the system by specifying a data mining query or task, providing
information to help focus the search, and performing exploratory data mining based on
the intermediate data mining results.
Architecture of a typical data mining system

13(i). Explain how integration is done with a database or data warehouse system.

DB and DW systems, possible integration schemes include no coupling, loose


coupling, semi-tight coupling, and tight coupling. We examine each of these schemes, as
follows:

1.No coupling: No coupling means that a DM system will not utilize any function of a
DB or DW system. It may fetch data from a particular source (such as a file system), process
data using some data mining algorithms, and then store the mining results in another file.

2.Loose coupling: Loose coupling means that a DM system will use some facilities of
a DB or DW system, fetching data from a data repository managed by these systems,
performing data mining, and then storing the mining results either in a file or in a designated
place in a database or data Warehouse. Loose coupling is better than no coupling because it
can fetch any portion of data stored in databases or data warehouses by using query processing,
indexing, and other system facilities. However, many loosely coupled mining systems are main
memory-based. Because mining does not explore data structures and query optimization
methods provided by DB or DW systems, it is difficult for loose coupling to achieve high
scalability and good performance with large data sets.

3.Semitight coupling: Semi-tight coupling means that besides linking a DM system to


a DB/DW system, efficient implementations of a few essential data mining primitives
(identified by the analysis of frequently encountered data mining functions) can be provided in
the DB/DW system. These primitives can include sorting, indexing, aggregation, isogram
analysis, multi way join, and precomputation of some essential statistical measures, such as
sum, count, max, min ,standard deviation.

4.Tight coupling: Tight coupling means that a DM system is smoothly integrated into
the DB/DW system. The data mining subsystem is treated as one functional component of
information system. Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods of a DB or DW
system

(ii) Consider the following data for the attribute AGE:4,8,21,5,21,24,34,28,25. Perform
smoothing by bin means and bin boundaries using a bin depth of 3

Given:
AGE: 4,8,21,5,21,24,34,28,25

STEP 1: Sort the age.


4,5,8,21,21,24,25,28,34
STEP 2: Partition into (equi-width) bins:
• Bin 1: 4 5 8
• Bin 2: 21 21 24
• Bin 3: 25 28 34
STEP 3: Smoothing by bin means:
• Bin 1: 5 5 5
• Bin 2: 22 22 22
• Bin 3: 29 29 29
STEP 4: Smoothing by bin boundaries:
• Bin 1: 4 4 8
• Bin 2: 21 21 24
• Bin 3: 25 25 34

14. Analyze Using Equi-depth binning method, partition the data given below into 4 bins
and perform smoothing according to the following methods.(8)
1. Smoothing by bin means
2. Smoothing by bin median
3. Smoothing by bin boundaries
24,25,26,27,28,56,67,70,70,75,78,89,89,90,91,94,95,96,100,102,103,107,109,112.

Given:

1.Smoothing by bin means:


• Bin 1: 25.5 25.5 25.5 25.5
• Bin 2: 55.25 55.25 55.25 55.25
• Bin 3: 78 78 78 78
• Bin 4: 91 91 91 91
• Bin 5: 196.5 196.5 196.5 196.5
• Bin 6: 107.75 107.75 107.75 107.75
2.Smoothing by bin median:
• Bin 1: 25.5 25.5 25.5 25.5
• Bin 2: 61.5 61.5 61.5 61.5
• Bin 3: 76.5 76.5 76.5 76.5
• Bin 4: 90.5 90.5 90.5 90.5
• Bin 5: 98 98 98 98
• Bin 6: 108 108 108 108

3.Smoothing by bin boundaries:


• Bin 1: 24 24 24 27
• Bin 2: 28 28 70 70
• Bin 3: 70 70 70 89
• Bin 4: 89 89 89 94
• Bin 5: 95 95 102 102
• Bin 6: 103 103 112 112

(ii) What motivated data mining? Why is it important?


Data mining has attracted a great deal of attention in the information industry and in
society as a whole in recent years, due to the wide availability of huge amounts of data and
imminent need for turning such data into information and knowledge. The information and.
knowledge gained, can be used for applications ranging, from market analysis, fraud detection,
and customer retention, to production control and science exploration.

Data mining can be viewed as a result of the natural evolution of information


technology. The database system industry has witnessed An evolutionary path in the
development of the following functionalities as shown in fig. 1.1, data collection and database
creation, data management (including data storage and retrieval, and database transaction
processing), and advanced data analysis (involving data warehousing and data mining).
Since the 1960s, database and information technology have been evolving
systematically from primitive file processing systems to sophisticated and powerful database
systems. The research and development in database system since the 1970s has progressed
from early hierarchical and network database systems to the development of relational database
systems.

Database technology since the 1980s has been characterized by the popular adoption of
relational technology and an upsurge of research and development activities on new aid
powerful database systems. The steady and amazing progress of computer hardware
technology in the past three decades has led to large supplies of powerful and affordable
computers, data collection equipment, and storage media. This technology provides a great
boost to the database information industry and makes a huge number of databases and
repositories available for transaction management, information retrieval, and data analysis. The
abundance of data, coupled with the need for powerful data analysis tools, has been described
as a data-rich but information poor situation. The fast-growing tremendous amount of data,
collected and stored in large and numerous data repositories have far exceeded our human for
comprehension without powerful tools. As a result, data collected. in large, data repositories
become “data tombs”- data archives that are seldom visited.

Data mining tools perform data analysis and may uncover important data patterns,
contributing greatly to business strategies, knowledge bases, and scientific and medical
research. The widening gap between data and information call for the systematic development
of data tools that will turn data tombs into “golden nuggets” of knowledge.

PART – C

1.Describe the Major issues in data warehousing and data mining.

THE MAJOR ISSUES IN DATA WAREHOUSING AND DATA MINING:


Data mining is not an easy task, as the algorithms used can get very complex and data
is not always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −

• Mining Methodology and User Interaction


• Performance Issues
• Diverse Data Types Issues

It refers to the following kinds of issues

• Mining different kinds of knowledge in databases − Different users may be


interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used.
Background knowledge may be used to express the discovered patterns not only in
concise terms but at multiple levels of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
• Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns will
be poor.
• Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
There can be performance-related issues such as follows

• Efficiency and scalability of data mining algorithms − In order to effectively


extract the information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as
huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which is further
processed in a parallel fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data again from
scratch.

Diverse Data Types Issues

• Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
2.(i) What is interestingness of a pattern?
An interesting pattern represents knowledge. Several objective measures of pattern
interestingness exist. These are based on the structure of discovered patterns and the statistics
underlying them. An objective measure for association rules of the form X Y is rule support,
representing the percentage of transactions from a transaction database that the given rule
satisfies.
This is taken to be the probability P(XUY),where XUY indicates that a transaction
contains both X and Y, that is, the union of itemsets X and Y. Another objective measure for
association rules is confidence, which assesses the degree of certainty of the detected
association. This is taken to be the conditional probability P(Y | X), that is, the probability that
a transaction containing X also contains Y. More formally, support and confidence are defined
as
support(X Y) = P(XUY) confidence(X Y) = P(Y | X)
In general, each interestingness measure is associated with a threshold, which may be
controlled by the user. For example, rules that do not satisfy a confidence threshold of, say,
50% can be considered uninteresting. Rules below the threshold threshold likely reflect noise,
exceptions, or minority cases and are probably of less value.

(ii) Summarize the integration of data mining system with a data warehouse?

THE INTEGRATION OF DATA MINING SYSTEM WITH A DATA WAREHOUSE:


DB and DW systems, possible integration schemes include no coupling, loose
coupling, semi-tight coupling, and tight coupling. We examine each of these schemes, as
follows:
1.No coupling:
No coupling means that a DM system will not utilize any function of a DB or
DW system. It may fetch data from a particular source (such as a file system), process data
using some data mining algorithms, and then store the mining results in another file.
2.Loose coupling:
Loose coupling means that a DM system will use some facilities of a DB or DW system,
fetching data from a data repository managed by these systems, performing data mining, and
then storing the mining results either in a file or in a designated place in a database or data
Warehouse. Loose coupling is better than no coupling because it can fetch any portion of data
stored in databases or data warehouses by using query processing, indexing, and other system
facilities.
However, many loosely coupled mining systems are main memory-based. Because
mining does not explore data structures and query optimization methods provided by DB or
DW systems, it is difficult for loose coupling to achieve high scalability and good performance
with large data sets.
3.Semitight coupling:
Semitight coupling means that besides linking a DM system to a DB/DW system,
efficient implementations of a few essential data mining primitives (identified by the analysis
of frequently encountered data mining functions) can be provided in the DB/DW system. These
primitives can include sorting, indexing, aggregation, histogram analysis, multi way join, and
precomputation of some essential statistical measures, such as sum, count, max, min ,standard
deviation,
4.Tight coupling:
Tight coupling means that a DM system is smoothly integrated into the DB/DW system.
The data mining subsystem is treated as one functional component of information system. Data
mining queries and functions are optimized based on mining query analysis, data structures,
indexing schemes, and query processing methods of a DB or DW system.

3.List the major data pre-processing techniques and explain in detail with examples?
Data cleaning:

Data cleaning is the process to remove incorrect data, incomplete data and inaccurate
data from the datasets, and it also replaces the missing values. There are some techniques in
data cleaning

Handling missing values:

• Standard values like “Not Available” or “NA” can be used to replace the
missing values.
• Missing values can also be filled manually but it is not recommended when that
dataset is big.
• The attribute’s mean value can be used to replace the missing value when the
data is normally distributed
wherein in the case of non-normal distribution median value of the attribute can
be used.
• While using regression or decision tree algorithms the missing value can be
replaced by the most probable
value.

Noisy:
Noisy generally means random error or containing unnecessary data points.
Here are some of the methods to handle noisy data.

• Binning:

This method is to smooth or handle noisy data. First, the data is sorted
then and then the sorted values are separated and stored in the form of bins.
There are three methods for smoothing data in the bin. Smoothing by bin
mean method: In this method, the values in the bin are replaced by the mean
value of the bin; Smoothing by bin median: In this method, the values in the
bin are replaced by the median value; Smoothing by bin boundary: In this
method, the using minimum and maximum values of the bin values are taken
and the values are replaced by the closest boundary value.

• Regression:

This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide
the variable which is suitable for our analysis.

• Clustering:

This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.

Data transformation:
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Data transformation can involve the following:

➢ Normalization:
where the attribute data are scaled so as to fall within a small specified range, such
as -1.0 to 1.0, or 0 to 1.0. There are three main methods for data normalization : min-max
normalization, z-score normalization, and normalization by decimal scaling.
(i).Min-max normalization
performs a linear transformation on the original data. Suppose that minA and
maxA are the minimum and maximum values of an attribute A. Min-max normalization
maps a value v of A to v0 in the range [new minA; new maxA] by computing.

(ii).z-score normalization (or zero-mean normalization)


the values for an attribute A are normalized based on the mean and standard
deviation of A. A value v of A is normalized to v0 by computing where mean A and
stand dev A are the mean and standard deviation, respectively, of attribute A. This
method of normalization is useful when the actual minimum and maximum of attribute
A are unknown, or when there are outliers which dominate the min-max normalization.

(iii). Normalization by decimal scaling


normalizes by moving the decimal point of values of attribute A. The number
of decimal points moved depends on the maximum absolute value of A. A value v of
A is normalized to v0by computing where j is the smallest integer such that
.

➢ Smoothing:
which works to remove the noise from data? Such techniques include binning,
clustering, and regression.

(i). Binning methods:


Binning methods smooth a sorted data value by consulting the ”neighborhood",
or values around it. The sorted values are distributed into a number of 'buckets', or bins.
Because binning methods consult the neighborhood of values, they perform local
smoothing. Figure illustrates some binning techniques.

In this example, the data for price are first sorted and partitioned into equi-depth bins
(of depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of
the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced by the bin median. In smoothing by bin
boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value.

(i).Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
(ii).Partition into (equi-width) bins:
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
(iii).Smoothing by bin means:
- Bin 1: 9, 9, 9,
- Bin 2: 22, 22, 22
- Bin 3: 29, 29, 29
(iv).Smoothing by bin boundaries:
• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34

(ii). Clustering:
Outliers may be detected by clustering, where similar values are organized into
groups or “clusters”. Intuitively, values which fall outside of the set of clusters may be
considered outliers.
Outliers may be detected by clustering analysis.

➢ Aggregation:
where summary or aggregation operations are applied to the data. For example,
the daily sales data may be aggregated so as to compute monthly and annual total
amounts.
➢ Generalization of the data:
where low level or 'primitive' (raw) data are replaced by higher level concepts
through the use of concept hierarchies. For example, categorical attributes, like street,
can be generalized to higher level concepts, like city or county.

DATA REDUCTION:
Data reduction techniques can be applied to obtain a reduced representation of the data
set that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same (or almost the
same) analytical results.
Strategies for data reduction include the following.
➢ Data cube aggregation:
where aggregation operations are applied to the data in the construction of a
data cube.
➢ Dimension reduction:
where irrelevant, weakly relevant or redundant attributes or dimensions may be
detected and removed.
➢ Data compression:
where encoding mechanisms are used to reduce the data set size.
➢ Numerosity reduction:
where the data are replaced or estimated by alternative, smaller data
representations such as parametric models (which need store only the model parameters
instead of the actual data), or nonparametric methods such as clustering, sampling, and
the use of histograms.
➢ Discretization and concept hierarchy generation:
where raw data values for attributes are replaced by ranges or higher conceptual
levels. Concept hierarchies allow the mining of data at multiple levels of abstraction,
and are a powerful tool for data mining.

Data Cube Aggregation:


• The lowest level of a data cube
– the aggregated data for an individual entity of interest
– e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered
–using data cube, when possible Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
– Select a minimum set of features such that the probability distribution of different
classes given the values for those features is as close as possible to the original distribution
given the values of all features.
– reduce # of patterns in the patterns, easier to understand
Heuristic methods:

5. Step-wise forward selection:


The procedure starts with an empty set of attributes. The best of the
original attributes is determined and added to the set. At each subsequent
iteration or step, the best of the remaining original attributes is added to the set.
6. Step-wise backward elimination:
The procedure starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.
7. Combination forward selection and backward elimination:
The step-wise forward selection and backward elimination methods can
be combined, where at each step one selects the best attribute and removes the
worst attribute.

8. Decision tree induction:


Decision tree algorithms, such as ID3 and C4.5, were originally intended
for classification. Decision tree induction constructs a flow-chart-like structure
where each internal (non-leaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a
class prediction. At each node, the algorithm chooses the “best" attribute to
partition the data into individual classes.

Data compression:
In data compression, data encoding or transformations are applied so as to obtain a
reduced or ”compressed" representation of the original data. If the original data can be
reconstructed from the compressed data without any loss of information, the data compression
technique used is called lossless. If, instead, we can reconstruct only an approximation of the
original data, then the data compression technique is called lossy. The two popular and
effective methods of lossy data compression: wavelet transforms, and principal components
analysis.

Numerosity reduction :
Regression and log-linear models :
Regression and log-linear models can be used to approximate the given data. In linear
regression, the data are modeled to fit a straight line. For example, a random variable, Y (called
a response variable), can be modeled as a linear function of another random variable, X (called
a predictor variable), with the equation where the variance of Y is assumed to be constant.
These coefficients can be solved for by the method of least squares, which minimizes the error
between the actual line separating the data and the estimate of the line.

Multiple regression is an extension of linear regression allowing a response variable


Y to be modeled as a linear function of a multidimensional feature vector.

Log-linear models approximate discrete multidimensional probability distributions.


The method can be used to estimate the probability of each cell in a base cuboid for a set of
discretized attributes, based on the smaller cuboids making up the data cube lattice

Histograms :
A histogram for an attribute A partitions the data distribution of A into disjoint subsets,
or buckets. The buckets are displayed on a horizontal axis, while the height (and area) of a
bucket typically reects the average frequency of the values represented by the bucket.

Equi-width:
In an equi-width histogram, the width of each bucket range is constant (such as the
width of $10 for the buckets in Figure 3.8).
Equi-depth (or equi-height):
In an equi-depth histogram, the buckets are created so that, roughly, the frequency of
each bucket is constant (that is, each bucket contains roughly the same number of contiguous
data samples).
V-Optimal:
If we consider all of the possible histograms for a given number of buckets, the V-
optimal histogram is the one with the least variance. Histogram variance is a weighted sum of
the original values that each bucket represents, where bucket weight is equal to the number of
values in the bucket.
MaxDiff:
In a MaxDiff histogram, we consider the difference between each pair of adjacent
values. A bucket boundary is established between each pair for pairs having β-1 the largest
differences, where β is user-specified.

4(i). Generalize in detail how data mining system are classified


DATA MINING:
Data mining refers to extracting or mining knowledge from large amounts of data. The
term is actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data..

Classification of data mining systems:


There are many data mining systems available or being developed. Some are
specialized systems dedicated to a given data source or are confined to limited data mining
functionalities, other are more versatile and comprehensive. Data mining is an interdisciplinary
field, the conuence of a set of disciplines including database systems, statistics, machine
learning, visualization, and information science. Moreover, depending on the data mining
approach used, techniques from other disciplines may be applied, such as neural networks,
fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or high
performance computing. Depending on the kinds of data to be mined or on the given data
mining application, the data mining system may also integrate techniques from spatial data
analysis, information retrieval, pattern recognition, image analysis, signal processing,
computer graphics, Web technology, economics, or psychology. Because of the diversity of
disciplines contributing to data mining, data mining research is expected to generate a large
variety of data mining systems. Therefore, it is necessary to provide a clear classification of
data mining systems. Such a classification may help potential users distinguish data mining
systems and identify those that best match their needs. Data mining systems can be categorized
according to various criteria, as follows. classification according to the kinds of databases
mined. A data mining system can be classified according to the kinds of databases mined.
Database systems themselves can be classified according to different criteria (such as data
models, or the types of data or applications involved), each of which may require its own data
mining technique. Data mining systems can therefore be classified accordingly.
For instance, if classifying according to data models, we may have a relational,
transactional, object-oriented, object-relational, or data warehouse mining system. If
classifying according to the special types of data handled, we may have a spatial, time-series,
text, or multimedia data mining system, or a World-Wide Web mining system. Other system
types include heterogeneous data mining systems, and legacy data mining systems.

Data mining systems can be categorized according to various criteria among other
classification are the following:
➢ Classification according to the type of data source mined
➢ Classification according to the data model drawn on
➢ Classification according to the king of knowledge discovered
➢ Classification according to mining techniques used

(ii) Discuss each classification with an example.


Classification according to the type of data source mined:
This classification categorizes data mining systems according to the type of data handled
such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.

Classification according to the data model drawn on:


This classification categorizes data mining systems based on the data model involved such
as relational database, object-oriented database, data warehouse, transactional, etc.

Classification according to the king of knowledge discovered:


This classification categorizes data mining systems based on the kind of knowledge
discovered or data mining functionalities, such as characterization, discrimination, association,
classification, clustering, etc. Some systems tend to be comprehensive systems offering several
data mining functionalities together.

Classification according to mining techniques used:


Data mining systems employ and provide different techniques. This classification
categorizes data mining systems according to the data analysis approach used such as machine
learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data
warehouse-oriented, etc. The classification can also take into account the degree of user
interaction involved in the data mining process such as query-driven systems, interactive
exploratory systems, or autonomous systems. A comprehensive system would provide a wide
variety of data mining techniques to fit different situations and options, and offer different
degrees of user interaction.
UNIT – 4
ASSOCIATION RULE MINING AND CLASSIFICATION
PART – A
1.Define correlation and market basket analysis.
CORRELATION:
It is used to study the closeness of the relationship between two or more variables i.e.
the degree to which the variables are associated with each other.
MARKET BASKET ANALYSIS:
Market basket analysis is a data mining technique used by retailers to increase sales
by better understanding customer purchasing patterns. It involves analyzing large data sets,
such as purchase history, to reveal product groupings, as well as products that are likely to be
purchased together.
2. Formulate the principle frequent itemset and closed itemset.
FREQUENT ITEMSET:
The frequent-itemsets problem is that of finding sets of items that appear in (are
related to) many of the same baskets.
CLOSED ITEMSET:
It is a frequent itemset that is both closed and its support is greater than or equal to
mins up. An itemset is closed in a data set if there exists no superset that has the same support
count as this original itemset.
3. How would you explain the principle of Apriori algorithm? How can the efficiency of an Apriori
algorithm be improved?
Apriori is an algorithm for frequent item set mining and association rule learning over
relational databases. It proceeds by identifying the frequent individual items in the database
and extending them to larger and larger item sets as long as those item sets appear sufficiently
often in the database.
the efficiency of an Apriori algorithm be improved:
1) using new database mapping way to avoid scanning the database repeatedly
2) further pruning frequent itemsets and candidate itemsets in order to improve
joining efficiency
3) using overlap strategy to count support to achieve high efficiency.

4. Define Data pruning. State the need for pruning phase in decision tree construction.
Pruning means to change the model by deleting the child nodes of a branch node. The
pruned node is regarded as a leaf node. Leaf nodes cannot be pruned. A decision tree consists
of a root node, several branch nodes, and several leaf nodes. The root node represents the top
of the tree.

Decision tree generation consists of two phases


– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers

5. Compare the advantages of FP growth algorithm over apriori algorithm.

➢ Pattern Generation:
FP growth generates pattern by constructing a FP tree whereas Apriori
generates pattern by pairing the items into singletons, pairs and triplets.
➢ Candidate Generation:
There is no candidate generation in FP growth whereas Apriori uses
candidate generation
➢ Process:
The process is faster as compared to Apriori. The runtime of process increases
linearly with increase in number of itemsets. But in Apriori the process is
comparatively slower than FP Growth, the runtime increases exponentially with
increase in number of itemsets
➢ Memory Usage:
A compact version of database is saved in FP growth. In Apriori algorithm,
the candidates combinations are saved in memory

6. Explain how will you generate association rules from frequent itemsets?
Association Rules find all sets of items (itemsets) that have support greater than the
minimum support and then using the large itemsets to generate the desired rules that have
confidence greater than the minimum confidence.

7. What is naïve Bayesian classification? How is it differing from Bayesian


classification?

Naïve assumption: attribute independence


o P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of samples having value xi as i-th
attribute in class C
If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
Computationally easy in both cases
DIFFERENCE:
The distinction between Bayes theorem and Naive Bayes is that Naive Bayes assumes
conditional independence where Bayes theorem does not. This means the relationship
between all input features are independent. Maybe not a great assumption, but this is is why
the algorithm is called “naive”.

8. Discuss association rule mining .


ASSOCIATION RULE MINING:
Finding frequent patterns, associations, correlations, or causal structures among sets
of items or objects in transaction databases, relational databases, and other information
repositories.
9. Describe the uses correlation?
Correlation is used to describe the linear relationship between two continuous
variables (e.g., height and weight). In general, correlation tends to be used when there is no
identified response variable. It measures the strength (qualitatively) and direction of the linear
relationship between two or more variables.

10. Discuss the features of Decision tree induction.


A decision tree is a flowchart-like tree structure, where
➢ Each internal node denotes a test on an attribute.
➢ Each branch represents an outcome of the test.
➢ Each leaf node holds a class label.
➢ The topmost node in a tree is the root node.
11. How would you evaluate accuracy of a classifier?
Image result for accuracy of a classifier. You simply measure the number of correct
decisions your classifier makes, divide by the total number of test examples, and the result is
the accuracy of your classifier.

12. List the two interesting measures of an association rule.


➢ Support
The support supp(X) of an itemset X is defined as the proportion of
transactions in the data set which contain the itemset.
➢ Confidence
The confidence of a rule is defined conf(X => Y) = supp(X U Y)/supp(X)

13. Define Back propagation.


Backpropagation (backward propagation) is an important mathematical tool for
improving the accuracy of predictions in data mining and machine learning. Essentially,
backpropagation is an algorithm used to calculate derivatives quickly.

14. Illustrate support vector machine with example.


➢ A new classification method for both linear and nonlinear data.
➢ It uses a nonlinear mapping to transform the original training data into a higher
dimension.
➢ With the new dimension, it searches for the linear optimal separating hyper plane (i.e.,
―decision boundary‖).
➢ With an appropriate non linear mapping to a sufficiently high dimension, data from
two classes can always be separated by a hyper plane.
➢ SVM finds this hyper plane using support vectors (―essential ‖training tuples) and
margins (defined by the support vectors).
15. How would you show your understanding about rule based classification?
The term rule-based classification can be used to refer to any classification scheme
that make use of IF-THEN rules for class prediction. ... They are also used in the class
prediction algorithm to give a ranking to the rules which will be then be utilized to predict the
class of new cases.

16. Discuss why pruning is needed in decision tree.


Pruning a decision tree helps to prevent overfitting the training data so that our model
generalizes well to unseen data. Pruning a decision tree means to remove a subtree that is
redundant and not a useful split and replace it with a leaf node.

17. What inference can you formulate with Bayes theorem?


Bayes' Theorem describes the probability of an event, based on precedent knowledge
of conditions which might be related to the event. In other words, Bayes' Theorem is the add-
on of Conditional Probability. Bayes' Theorem has two types of probabilities : Prior
Probability [P(H)] Posterior Probability [P(H/X)]

18. Demonstrate the Bayes classification methods.


➢ Bayesian classifiers are statistical classifiers.
➢ They can predictclass membership probabilities, such as the probability that a given
tuple belongs toa particular class.
➢ Bayesian classification is based on Bayes’ theorem.

19. Define Lazy learners with an example.


Lazy method may consider query instance xq when deciding how to generalize
beyond the training data D. Lazy method effectively uses a richer hypothesis space since it
uses many local linear functions to form its implicit global approximation to the target
function

20. What are eager learners?


Eager learners, when given a set of training tuples, will construct a generalization
(i.e., classification) model before receiving new (e.g., test) tuples to classify.
Example:
Decision tree induction, Bayesian classification, rule-based classification,
classification by back propagation and so on.
PART – B

1(i). Compare Classification and Prediction.


CLASSIFICATION PREDICTION
• Classification is the method of • Predication is the method of
recognizing to which group; a recognizing the missing or not
new process belongs to a available numerical data for a
background of a training data new process of observing.
set containing a new process
of observing whose group
membership is familiar.

• A classifier is built to detect • A predictor will be build that


explicit labels. predicts a current valued job
or command value.

• In classification, authenticity • In predication, the authenticity


depends on detecting the class depends on how well a given
label correctly. predictor can guess the value
of a predicated attribute for
new data.

• In classification, the sample • In predication, the sample can


can be called the classifier. be called the predictor.

(ii) Explain the issues regarding classification and prediction


➢ Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by
applying smoothing techniques, for example) and the treatment of missing values (e.g., by
replacing a missing value with the most commonly occurring value for that attribute, or with
the most probable value based on statistics).
➢ Relevance analysis:
Many of the attributes in the data may be redundant. Correlation analysis can be used
to identify whether any two given attributes are statistically related. For example, a strong
correlation between attributes A1 and A2 would suggest that one of the two could be removed
from further analysis. A database may also contain irrelevant attributes.
➢ Data transformation and reduction: The data may be transformed by normalization,
particularly when neural networks or methods involving distance measurements are used in
the learning step. Normalization involves scaling all values for a given attribute so that they
fall within a small specified range, such as -1.0 to 1.0, or 0.0 to 1.0.

(iii) Write and explain the algorithm for mining frequent item sets without candidate generation

in many cases the Apriori candidate generate-and-test method significantly reduces the size of
candidate sets, leading to good performance gain.
An interesting method in this attempt is called frequent-pattern growth, or simply FP-growth, which adopts a
divide-and-conquer strategy as follows. First, it compresses the database representing frequent items into a
frequent-pattern tree, or FP-tree, which retains the itemset association information. It then divides the
compressed
database into a set of conditional databases (a special kind of projected database), each associated with one
frequent item or ―pattern fragment,‖ and mines each such database separately.

Example:
FP-growth (finding frequent itemsets without candidate generation). We re-examine the mining of
transaction database, D, of Table 5.1 in Example 5.3 using the frequent pattern growth approach.
2(i) How would you summarize in detail about mining methods?
The method that mines the complete set of frequent item sets with candidate
generation.

A-priori property &The A-priori Algorithm.


A-priori property:
➢ All non-empty subsets of a frequent item set most also be frequent.
An item set I does not satisfy the minimum support threshold, min-sup, then I
is not frequent, i.e., support (I)<min-sup
If an item A is added to the item set I then the resulting item set (IUA) cannot
occur more frequently than I.
➢ Monotonic functions are functions that move in only one direction.
➢ This property is called anti-monotonic.
➢ If a set cannot pass a test , all its supersets will fail the same test as well.
➢ This property is monotonic in failing the test.

The A-priori Algorithm:


• Join Step: Ck is generated by joining Lk-1 with itself
• Prune Step: Any (k-1) - item set that is not frequent cannot be a subset of a frequent
k-item Set
The method that mines the complete set of frequent item sets without generation.
• Compress a large database into a compact, Frequent-Pattern tree(FP-tree)structure
– highly condensed, but complete for frequent pattern mining
– avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining method
– A divide-and-conquer methodology: decompose mining tasks into smaller
ones
– Avoid candidate generation: sub-database test only!

(ii) Summarize in detail about various kinds of association rules.

Mining Various Kinds of Association Rules


1. Mining Multilevel Association Rules
For many applications, it is difficult to find strong associations among data items at low
or primitive levels of abstraction due to the sparsity of data at those levels. Strong associations
discovered at high levels of abstraction may represent common sense knowledge.
Therefore, data mining systems should provide capabilities for mining association rules
at multiple levels of abstraction, with sufficient flexibility for easy traversal among different
abstraction spaces.
Example:
Mining multilevel association rules. Suppose we are given the task-relevant set of
transactional data in Table for sales in an AllElectronics store, showing the items purchased
for each transaction.
The concept hierarchy for the items is shown in Figure . A concept hierarchy defines a
sequence of mappings from a set of low-level concepts to higher level, more general concepts.
Data can be generalized by replacing low-level concepts within the data by their higher-level
concepts, or ancestors, from a concept hierarchy.
Association rules generated from mining data at multiple levels of abstraction are called
multiple-level or multilevel association rules. Multilevel association rules can be mined
efficiently using concept hierarchies under a support-confidence framework. In general, a top-
down strategy is employed, For each level, any algorithm for discovering frequent itemsets
may be used, such as Apriori or its variations.
2. Mining Multidimensional Association Rules from Relational Databases and Data
Warehouses:
We have studied association rules that imply a single predicate, that is, the
predicate buys. For instance, in mining our AllElectronics database, we may discover the
Boolean association rule

Following the terminology used in multidimensional databases, we refer to each distinct


predicate in a rule as a dimension. Hence, we can refer to Rule above as a single dimensional
or intra dimensional association rule because it contains a single distinct predicate
(e.g., buys)with multiple occurrences (i.e., the predicate occurs more than once within the
rule). As we have seen in the previous sections of this chapter, such rules are commonly mined
from transactional data.
Considering each database attribute or warehouse dimension as a predicate, we can
therefore mine association rules containing multiple predicates, such as

Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules. Rule above contains three predicates (age, occupation,
and buys), each of which occurs only once in the rule. Hence, we say that it has no repeated
predicates. Multidimensional association rules with no repeated predicates are called inter
dimensional association rules. We can also mine multidimensional association rules with
repeated predicates, which contain multiple occurrences of some predicates. These rules are
called hybrid-dimensional association rules. An example of such a rule is the following, where
the predicate buys is repeated:

Note that database attributes can be categorical or quantitative. Categorical attributes


have a finite number of possible values, with no ordering among the values (e.g., occupation,
brand, color). Categorical attributes are also called nominal attributes, because their values
are ―names of things.‖ Quantitative attributes are numeric and have an implicit ordering
among values (e.g., age, income, price). Techniques for mining multidimensional association
rules can be categorized into two basic approaches regarding the treatment of quantitative
attributes.

3. Describe in detail about constraint and correlation based association mining


CONSTRAINT BASED ASSOCIATION MINING:
A data mining process may uncover thousands of rules from a given set of data, most of which end up
being unrelated or uninteresting to the users. Often, users have a good sense of which ―direction‖ of mining
may lead to interesting patterns and the ―form‖ of the patterns or rules they would like to find. Thus, a good
heuristic is to have the users specify such intuition or expectations as constraints to confine the search space.
This strategy is known as constraint-based mining.
The constraints can include the following:

1) Metarule-Guided Mining of Association Rules:


“How are metarules useful?” Metarules allow users to specify the syntactic form of rules that they are
interested in mining. The rule forms can be used as constraints to help improve the efficiency of the mining
process. Metarules may be based on the analyst’s experience, expectations, or intuition regarding the data or
may be automatically generated based on the database schema.

Metarule-guided mining:-
Suppose that as a market analyst for AllElectronics, you have access to the data describing customers
(such as customer age, address, and credit rating) as well as the list of customer transactions. You are interested
in finding associations between customer traits and the items that customers buy. However, rather than finding
all of the association rules reflecting these relationships, you are particularly interested only in determining
which pairs of customer traits promote the sale of office software.A metarule can be used to specify this
information describing the form of rules you are interested in finding. An example of such a metarule is

where P1 and P2 are predicate variables that are instantiated to attributes from the given database during
the mining process, X is a variable representing a customer, and Y and W take on values of the attributes
assigned to P1 and P2, respectively. Typically, a user will specify a list of attributes to be considered for
instantiation with P1 and P2. Otherwise, a default set may be used.

2) Constraint Pushing: Mining Guided by Rule Constraints:


Rule constraints specify expected set/subset relationships of the variables in the mined rules, constant
initiation of variables, and aggregate functions. Users typically employ their knowledge of the application or
data to specify rule constraints for the mining task. These rule constraints may be used together with, or as an
alternative to, metarule-guided mining. In this section, we examine rule constraints as to how they can be used
to make the mining process more efficient. Let’s study an example where rule constraints are used to mine
hybrid-dimensional association rules. Our association mining query is to “Find the sales of which cheap items
(where the sum of the prices is less than $100) may promote the sales of which expensive items (where the
minimum price is $500) of the same group for Chicago customers in 2004.” This can be expressed in the
DMQL data mining query language as follows,

4(i). Develop an algorithm for classification using decision trees. Illustrate the algorithm with a relevant
example.

Decision tree induction is the learning of decision trees from class-labeled training tuples. A
decision tree is a flowchart-like tree structure, where
➢ Each internal node denotes a test on an attribute.
➢ Each branch represents an outcome of the test.
➢ Each leaf node holds a class label.
➢ The topmost node in a tree is the root node.

The construction of decision tree classifiers does not require any domain knowledge or parameter
setting, and therefore I appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensional data.
Their representation of acquired knowledge in tree form is intuitive and generally easy to
assimilate by humans.
The learning and classification steps of decision tree induction are simple and fast. In general,
decision tree classifiers have good accuracy.
Decision tree induction algorithm shave been used for classification in many application areas,
such as medicine, manufacturing and production, financial analysis, astronomy, and molecular
biology.

Algorithm For Decision Tree Induction:

(ii) What approach would you use to apply decision tree induction?

5. What is Classification? What are the features of Bayesian classification? Explain in


detail with an example

CLASSIFICATION:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
BAYESIAN CLASSIFICATION:

➢ The classification problem may be formalized using a-posteriori probabilities:


➢ P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.
➢ E.g. P(class=N | outlook=sunny,windy=true,…)
➢ Idea: assign to sample X the class label C such that P(C|X) is maximal

“What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They
can predict class membership probabilities, such as the probability that a given tuple belongs
to a particular class. Bayesian classification is based on Bayes’ theorem, a simple Bayesian
classifier known as the naïve Bayesian classifier Bayesian classifiers have also exhibited high
accuracy and speed when applied to large databases.

NEED:

➢ Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
➢ Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can be combined with
observed data.
➢ Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
➢ Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured.

Bayesian Theorem:
Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the
Bayes theorem

MAP (maximum posteriori) hypothesis

Practical difficulty: require initial knowledge of many probabilities, significant


computational cost

Naïve Bayes Classifier (I):


A simplified assumption: attributes are conditionally independent:

Greatly reduces the computation cost, only count the class distribution.

Naive Bayesian Classifier (II)


Given a training set, we can compute the probabilities
Estimating a-posteriori probabilities
✓ Bayes theorem:
❖ P(C|X) = P(X|C)·P(C) / P(X)
✓ P(X) is constant for all classes
✓ P(C) = relative freq of class C samples
✓ C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum
✓ Problem: computing P(X|C) is unfeasible!

6(i) Giving concrete example , explain a method that performs frequent itemset mining by using the prior
knowledge of frequent item set properties.
(ii) Discuss in detail the constraint based association mining.

A data mining process may uncover thousands of rules from a given set of data, most of which
end up being unrelated or uninteresting to the users. Often, users have a good sense of which ―direction‖ of
mining may lead to interesting patterns and the ―form‖ of the patterns or rules they would like to find. Thus, a
good heuristic is to have the users specify such intuition or expectations as constraints to confine the search
space. This strategy is known as constraint-based mining.
The constraints can include the following:

1) Metarule-Guided Mining of Association Rules:


“How are metarules useful?” Metarules allow users to specify the syntactic form of rules that they are
interested in mining. The rule forms can be used as constraints to help improve the efficiency of the mining
process. Metarules may be based on the analyst’s experience, expectations, or intuition regarding the data or
may be automatically generated based on the database schema.

Metarule-guided mining:-
Suppose that as a market analyst for AllElectronics, you have access to the data describing customers
(such as customer age, address, and credit rating) as well as the list of customer transactions. You are interested
in finding associations between customer traits and the items that customers buy. However, rather than finding
all of the association rules reflecting these relationships, you are particularly interested only in determining
which pairs of customer traits promote the sale of office software.A metarule can be used to specify this
information describing the form of rules you are interested in finding. An example of such a metarule is

where P1 and P2 are predicate variables that are instantiated to attributes from the given database during
the mining process, X is a variable representing a customer, and Y and W take on values of the attributes
assigned to P1 and P2, respectively. Typically, a user will specify a list of attributes to be considered for
instantiation with P1 and P2. Otherwise, a default set may be used.

2) Constraint Pushing: Mining Guided by Rule Constraints:


Rule constraints specify expected set/subset relationships of the variables in the mined rules, constant
initiation of variables, and aggregate functions. Users typically employ their knowledge of the application or
data to specify rule constraints for the mining task. These rule constraints may be used together with, or as an
alternative to, metarule-guided mining. In this section, we examine rule constraints as to how they can be used
to make the mining process more efficient. Let’s study an example where rule constraints are used to mine
hybrid-dimensional association rules. Our association mining query is to “Find the sales of which cheap items
(where the sum of the prices is less than $100) may promote the sales of which expensive items (where the
minimum price is $500) of the same group for Chicago customers in 2004.” This can be expressed in the
DMQL data mining query language as follows,

7(i). Examine in detail about Lazy learners with examples


Lazy method may consider query instance xq when deciding how to generalize
beyond the training data D. Lazy method effectively uses a richer hypothesis space since it
uses many local linear functions to form its implicit global approximation to the target
function. Lazy Learner firstly stores the training dataset and wait until it receives the test
dataset. In Lazy learner case, classification is done on the basis of the most related data stored
in the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning

(ii) Describe about the process of multi-layer feed-forward neural network classification
using back propagation learning?

A Multilayer Feed-Forward Neural Network:


➢ The backpropagation algorithm performs learning on a multilayer feed- forward
neural network.
➢ It iteratively learns a set of weights for prediction of the class label of tuples.
➢ A multilayer feed-forward neural network consists of an input layer, one or more
hidden layers, and an output layer.
Example:

➢ The inputs to the network correspond to the attributes measured for each training
tuple. The inputs are fed simultaneously into the units making up the input layer.
These inputs pass through the input layer and are then weighted and fed
simultaneously to a second layer known as a hidden layer.
➢ The outputs of the hidden layer units can be input to another hidden layer, and so on.
The number of hidden layers is arbitrary.
➢ The weighted outputs of the last hidden layer are input to units making up the output
layer, which emits the network’s prediction for given tuples

Classification by Backpropagation:
➢ Backpropagation is a neural network learning algorithm.
➢ A neural network is a set of connected input/output units in which each connection
has a weight associated with it.
➢ During the learning phase, the network learns by adjusting the weights so as to be able
to predict the correct class label of the input tuples.
➢ Neural network learning is also referred to as connectionist learning due to the
connections between units.
➢ Neural networks involve long training times and are therefore more suitable for
applications where this is feasible.
➢ Backpropagation learns by iteratively processing a data set of training tuples,
comparing the network’s prediction for each tuple with the actual known target value.
➢ The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for prediction).
➢ For each training tuple, the weights are modified so as to minimize the mean squared
error between the network’s prediction and the actual target value. These
modifications are made in the ―backwards‖ direction, that is, from the output layer,
through each hidden layer down to the first hidden layer hence the name is
backpropagation.
➢ Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
8(i) Describe in detail about frequent pattern classification.

Frequent pattern mining can be classified in various ways, based on the following criteria:

1.Based on the completeness of patterns to be mined:


We can mine the complete set of frequent itemsets, the closed frequent itemsets, and
the maximal frequent itemsets, given a minimum support threshold.
We can also mine constrained frequent itemsets, approximate frequent itemsets,near- match
frequent itemsets, top-k frequent itemsets and so on.

2. Based on the levels of abstraction involved in the rule set:


Some methods for association rule mining can find rules at differing levels of
abstraction.
For example
Suppose that a set of association rules mined includes the following rules where X is a
variable representing a customer:
buys(X, ―computer‖))=>buys(X, ―HP printer‖) (1)
buys(X, ―laptop computer‖)) =>buys(X, ―HP printer‖) (2)
In rule (1) and (2), the items bought are referenced at different levels of abstraction
(e.g., ―computer‖ is a higher-level abstraction of ―laptop computer‖).

3. Based on the number of data dimensions involved in the rule:


If the items or attributes in an association rule reference only one dimension, then it is
a single-dimensional association rule.
buys(X, ―computer‖))=>buys(X, ―antivirus software‖)
If a rule references two or more dimensions, such as the dimensions age, income, and
buys, then it is a multidimensional association rule. The following rule is an example of a
multidimensional rule:
age(X, ―30,31…39‖) ^ income(X, ―42K,…48K‖))=>buys(X, ―high resolution
TV‖)

4. Based on the types of values handled in the rule:


If a rule involves associations between the presence or absence of items, it is a
Boolean association rule. If a rule describes associations between quantitative items or
attributes, then it is a quantitative association rule.

5. Based on the kinds of rules to be mined:


Frequent pattern analysis can generate various kinds of rules and other interesting
relationships. Association rule mining can generate a large number of rules, many of which
are redundant or do not indicate a correlation relationship among itemsets. The discovered
associations can be further analyzed to uncover statistical correlations, leading to correlation
rules.

6. Based on the kinds of patterns to be mined:


Many kinds of frequent patterns can be mined from different kinds of data sets.
Sequential pattern mining searches for frequent sub sequences in a sequence data set, where a
sequence records an ordering of events.
(ii) Write an algorithm for FP-Tree Construction and discuss how frequent itemsets are
generated from FP-Tree

• The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in the
database is called support count or frequency of 1-itemset.
• The second step is to construct the FP tree. For this, create the root of the tree. The root
is represented by null.
• The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken at
the top, the next itemset with lower count and so on. It means that the branch of the
tree is constructed with transaction itemsets in descending order of count.
• The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch (for example in the 1st transaction), then this transaction branch would
share a common prefix to the root. This means that the common itemset is linked to the
new node of another itemset in this transaction.
• Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked
according to transactions.
• The next step is to mine the created FP Tree. For this, the lowest node is examined first
along with the links of the lowest nodes. The lowest node represents the frequency
pattern length 1. From this, traverse the path in the FP Tree. This path or paths are
called a conditional pattern base. Conditional pattern base is a sub-database consisting
of prefix paths in the FP tree occurring with the lowest node (suffix).
• Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
• Frequent Patterns are generated from the Conditional FP Tree.

frequent itemsets are generated from FP-Tree:


1. Create the root node (null)
2. Scan the database, get the frequent itemsets of length 1, and sort these 11-
itemsets in decreasing support count.
3. Read a transaction at a time. Sort items in the transaction acoording to the last
step.
4. For each transaction, insert items to the FP-Tree from the root node and
increment occurence record at every inserted node.
5. Create a new child node if reaching the leaf node before the insersion completes.
6. If a new child node is created, link it from the last node consisting of the same
item.
9. Consider a home finance loan to predict the housing loan payment. Design a general hierarchical a
structure and analyze the factors using rule discovery techniques to accurately predict the number of loan
payments in a given quarter/year. Loan is availed for a period of 20 to 25 years, but an average life span
of the loan exists for only 7 to 10 years due to payment. Make necessary assumptions: Maintenance record
of the customer details and details of the prevailing interest rates, borrower characteristics, account dare,
fine tune loan prepayment such as interest rates and fees in order to maximize the profits of the company.
Elaborately discuss the association rule mining issues. Also Examine on the multi level association rules
and find if you could relate any relation on from the above application.

10. Generalize the Bayes theorem of posterior probability and explain the working of a
Bayesian classifier with an example.
Bayes’ Theorem:
➢ Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is
described by measurements made on a set of n attributes.
➢ Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
➢ For classification problems, we want to determine P(H|X), the probability that the
hypothesis H holds given the ―evidence‖ or observed data tuple X.
➢ P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it provides a way of calculating the posterior
probability, P(H|X), from P(H), P(X|H), and P(X).

Estimating a-posteriori probabilities


Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!

Bayesian Classification:
➢ Bayesian classifiers are statistical classifiers.
➢ They can predictclass membership probabilities, such as the probability that a given
tuple belongs toa particular class.
➢ Bayesian classification is based on Bayes’ theorem
➢ Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
➢ Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can be combined with
observed data.
➢ Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
➢ Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured

11. Explain and Apply the Apriori algorithm for discovering frequent item sets of the table.

12. (i).Define classification? With an example explain how support vector machines can be
used for classification

CLASSIFICATION:
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown. The derived model is based on the
analysis of a set of training data (i.e., data objects whose class label is known).

A bank loans officer needs analysis of her data in order to learn which loan applicants
are ―safe‖ and which are ―risky‖ for the bank. A marketing manager at All Electronics
needs data analysis to help guess whether a customer with a given profile will buy a new
computer. A medical researcher wants to analyze breast cancer data in order to predict which
one of three specific treatments a patient should receive. In each of these examples, the data
analysis task is classification, where a model or classifier is constructed to predict categorical
labels, such as ―safe‖ or ―risky‖ for the loan application data; ―yes‖ or ―no‖ for the
marketing data; or ―treatment A,‖ -treatmentB,‖ or―treatmentC ‖ for the medical data. These
categories can be represented by discrete values, where the ordering among values has no
meaning. For example, the values1,2,and 3 may be used to represent
Treatments A, B, and C, where there is no ordering implied among this group of treatment
regimes. Suppose that the marketing manager would like to predict how much a given
customer will spend during a sale at All Electronics. This data analysis task is an example of
numeric prediction, where the model constructed predicts a continuous-valued function, or
ordered value, as opposed to a categorical label. This model is a predictor “How does
classification work? Data classification is a two-step process, as shown for the loan
application data of Figure 6.1. (The data are simplified for illustrative purposes. In reality, we
may expect many more attributes to be considered.) In the first step, a classifier is built
describing a predetermined set of data classes or concepts. This is the learning step (or
training phase), where a classification algorithm builds the classifier by analyzing or
―learning from‖ a training set made up of database tuples and their associated class labels.

(ii) What are the prediction techniques supported by a data mining systems?

Prediction is similar to classification


➢ First, construct a model
➢ Second, use model to predict unknown value
Major method for prediction is regression
Linear and multiple regression
Non-linear regression
Prediction is different from classification
o Classification refers to predict categorical class label
o Prediction models continuous-valued functions
Predictive modeling:
Predict data values or construct generalized linear models based on the database data.
One can only predict value ranges or category distributions
Method outline:
➢ Minimal generalization
➢ Attribute relevance analysis
➢ Generalized linear model construction
➢ Prediction
Determine the major factors which influence the prediction
o Data relevance analysis: uncertainty measurement, entropy analysis, expert
judgement, etc.

Multi-level prediction: drill-down and roll-up analysis

Prediction: Categorical Data:

13(i) Write Bayes theorem


Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖ As
usual, it is described by measurements made on a set of n attributes. Let H be
some hypothesis, such as that the data tuple X belongs to a specified class C. For
classification problems, we want to determine P(HjX), the probability that the
hypothesis H holds given the ―evidence‖ or observed data tuple X. In other
words, we are looking for the probability that tuple X belongs to class C, given
that we know the attribute description of X.

“How are these probabilities estimated?” P(H), P(XjH), and P(X) may be
estimated from the given data, as we shall see below. Bayes’ theorem is useful
in that it provides a way of calculating the posterior probability, P(HjX),
from P(H), P(XjH), and P(X). Bayes’ theorem is
(ii) Explain how the Bayesian Belief Networks are trained to perform classification

.A belief network is defined by two components—a directed acyclic graph and


a set of conditional probability tables (Figure 6.11). Each node in the directed
acyclic graph represents a random variable. The variables may be discrete or
continuous-valued. They may correspond to actual attributes given in the data or
to ―hidden variables‖ believed to form a relationship (e.g., in the case of medical
data, a hidden variable may indicate a syndrome, representing a number of
symptoms that, together, characterize a specific disease). Each arc represents a
probabilistic dependence. If an arc is drawn from a node Y to a node Z, then Y is
a parent or immediate predecessor of Z, and Z is a descendant of Y. Each
variable is conditionally independent of its non descendants in the graph, given
its parents.

A belief network has one conditional probability table (CPT) for each variable.
The CPT for a variable Y specifies the conditional distribution P(YjParents(Y)),
where Parents(Y) are the parents of Y. Figure(b) shows a CPT for the
variable LungCancer. The conditional probability for each known value
of LungCancer is given for each possible combination of values of its parents.
For instance, from the upper leftmost and bottom rightmost entries, respectively,
we see that
Let X = (x1, : : : , xn) be a data tuple described by the variables or attributes Y1,
: : : , Yn, respectively. Recall that each variable is conditionally independent of
its non descendants in the network graph, given its parents. This allows the
network to provide a complete representation of the existing joint probability
distribution with the

following equation:

14. Describe in detail about the following Classification methods.


(i). Bayesian classification

BAYESIAN CLASSIFICATION:

➢ The classification problem may be formalized using a-posteriori probabilities:


➢ P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C.
➢ E.g. P(class=N | outlook=sunny,windy=true,…)
➢ Idea: assign to sample X the class label C such that P(C|X) is maximal

“What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They
can predict class membership probabilities, such as the probability that a given tuple belongs
to a particular class. Bayesian classification is based on Bayes’ theorem, a simple Bayesian
classifier known as the naïve Bayesian classifier Bayesian classifiers have also exhibited high
accuracy and speed when applied to large databases.
➢ Bayesian classifiers are statistical classifiers.
➢ They can predictclass membership probabilities, such as the probability that a given
tuple belongs toa particular class.
➢ Bayesian classification is based on Bayes’ theorem
➢ Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problems
➢ Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can be combined with
observed data.
➢ Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
➢ Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
(ii) Classification by Back propagation
Classification by Backpropagation
Backpropagation: A neural network learning algorithm
➢ Started by psychologists and neurobiologists to develop and test computational
analogues of neurons
➢ A neural network: A set of connected input/output units where each connection has
a weight associated with it.
➢ During the learning phase, the network learns by adjusting the weights so as to be
able to predict the correct class label of the input tuples
➢ Also referred to as connectionist learning due to the connections between units

Neural Network as a Classifier


Weakness:
o Long training time
o Require a number of parameters typically best determined empirically, e.g., the
network topology or ``structure."
o Poor interpretability: Difficult to interpret the symbolic meaning behind the learned
weights and of ``hidden units" in the network
Strength:
o High tolerance to noisy data
o Ability to classify untrained patterns
o Well-suited for continuous-valued inputs and outputs o Successful on a wide array
of real-world data
o Algorithms are inherently parallel
o Techniques have recently been developed for the extraction of rules from trained
neural networks
ANeuron(=aperceptron)
A Multi-Layer Feed-Forward Neural Network

o The inputs to the network correspond to the attributes measured for each training
tuple
o
Inputs are fed simultaneously into the units making up the input layer
o
They are then weighted and fed simultaneously to a hidden layer
o
The number of hidden layers is arbitrary, although usually only one
o
The weighted outputs of the last hidden layer are input to units making up the output
layer, which emits the network's prediction
o
The network is feed-forward in that none of the weights cycles back to an input unit
or to an output unit of a previous layer
o From a statistical point of view, networks perform nonlinear regression: Given
enough hidden units and enough training samples, they can closely approximate any function
PART – C
1. Find all frequent item sets for the given training set using Apriori and FP growth respectively. Compare
the efficiency of the two mining processes
2. Generalize and Discuss about constraint based association rule mining with examples and
state how association mining to correlation analysis is dealt with.
constraint based association rule mining:
A data mining process may uncover thousands of rules from a given set of data, most of which
end up being unrelated or uninteresting to the users. Often, users have a good sense of which

―direction‖ of mining may lead to interesting patterns and the ―form‖ of the patterns or rules
they would like to find. Thus, a good heuristic is to have the users specify such intuition or
expectations as constraints to confine the search space. This strategy is known as constraint-
based mining. The constraints can include the following:

1. Metarule-Guided Mining of Association Rules

“How are metarules useful?” Metarules allow users to specify the syntactic form of rules
that they are interested in mining. The rule forms can be used as constraints to help improve
the efficiency of the mining process. Metarules may be based on the analyst’s experience,
expectations, or intuition regarding the data or may be automatically generated based on the
database schema.

Metarule-guided mining:- Suppose that as a market analyst for AllElectronics, you have
access to the data describing customers (such as customer age, address, and credit rating) as
well as the list of customer transactions. You are interested in finding associations between
customer traits and the items that customers buy. However, rather than finding all of the
association rules reflecting these relationships, you are particularly interested only in
determining which pairs of customer traits SCE Department of Information
Technology promote the sale of office software.A metarule can be used to specify this
information describing the form of rules you are interested in finding. An example of such a
metarule is
where P1 and P2 are predicate variables that are instantiated to attributes from the given
database during the mining process, X is a variable representing a customer, and Y and W take
on values of the attributes assigned to P1 and P2, respectively. Typically, a user will specify
a list of attributes to be considered for instantiation with P1 and P2. Otherwise, a default set
may be used.

2. Constraint Pushing: Mining Guided by Rule Constraints

Rule constraints specify expected set/subset relationships of the variables in the mined rules,
constant initiation of variables, and aggregate functions. Users typically employ their
knowledge of the application or data to specify rule constraints for the mining task. These rule
constraints may be used together with, or as an alternative to, metarule-guided mining. In this
section, we examine rule constraints as to how they can be used to make the mining process
more efficient. Let’s study an example where rule constraints are used to mine hybrid-
dimensional association rules.

Our association mining query is to “Find the sales of which cheap items (where the sum of
the prices is less than $100) may promote the sales of which expensive items (where the
minimum price is $500) of the same group for Chicago customers in 2004.” This can be
expressed in the DMQL data mining query language as follows,

Association Mining to Correlation Analysis

Most association rule mining algorithms employ a support-confidence framework. Often,


many interesting rules can be found using low support thresholds. Although minimum support
and confidence thresholds help weed out or exclude the exploration of a good number of
uninteresting rules, many rules so generated are still not interesting to the users

1)Strong Rules Are Not Necessarily Interesting: An Example

Whether or not a rule is interesting can be assessed either subjectively or objectively.


Ultimately, only the user can judge if a given rule is interesting, and this judgment, being
subjective, may differ from one user to another. However, objective interestingness measures,
based on the statistics ―behind‖ the data, can be used as one step toward the goal of weeding
out uninteresting rules from presentation to the user.
The support and confidence measures are insufficient at filtering out uninteresting association
rules. To tackle this weakness, a correlation measure can be used to augment the support-
confidence framework for association rules. This leads to correlation rules of the form

That is, a correlation rule is measured not only by its support and confidence but also by the
correlation between itemsets A and B. There are many different correlation measures from
which to choose. In this section, we study various correlation measures to determine which
would be good for mining large data sets.

3. Discuss the single dimensional Boolean association rule mining for transaction database. Evaluate the
below transaction database

Let minimum support 50% and minimum confidence 50%


We have A=>C (50% , 66.6%)
C=>A (50%, 100 %)

4. Construct the decision tree for the following training dataset using decision tree algorithm.
PART A

1.Identify what changes would you make to solve the problem in cluster analysis.

Answer: K-means clustering algorithm can be significantly improved by using a better


initialization technique, and by repeating (re-starting) the algorithm. When the data has
overlapping clusters, k-means can improve the results of the initialization technique.

2.Define K-means partitioning.

Answer: k-means clustering is a method of vector quantization, originally from signal


processing, that aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a
prototype of the cluster.

3. List the major clustering methods.

Answer: Clustering Methods

• Partitioning Method.

• Hierarchical Method.

• Density-based Method.

• Grid-Based Method.

• Model-Based Method.

• Constraint-based Method.

4.Explain why a cluster has to be evaluated.?

* Typical objective functions in clustering formalize the goal of attaining


high intra-cluster similarity (documents within a cluster are similar) and low inter-cluster similarity
(documents from different clusters are dissimilar). This is an internal criterion for the quality of a
clustering.

5.Illustrate the intrinsic methods in cluster analysis.?

* In general, intrinsic methods evaluate a clustering by examining how well the clusters
are separated and how compact the clusters are. Many intrinsic methods have the advantage of a
similarity metric between objects in the data set.

6. How do you explain the similarity in clustering?

* Similarity is an amount that reflects the strength of relationship between two data
items, it represents how similar 2 data patterns are. Clustering is done based on a similarity measure
to group similar data objects together
7. Define what is meant by K nearest neighbor algorithm.
dK-Nearest Neighbors is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense application
in pattern recognition, data mining and intrusion detection.

8. Illustrate some applications of clustering.


Clustering has a large no. of applications spread across various domains. Some of the
most popular applications of clustering are:

• Recommendation engines
• Market segmentation
• Social network analysis
• Search result grouping
• Medical imaging
• Image segmentation

9. What the services are provided by grid based clustering .

undefined

13.Distinguish between density based clustering and grid based clustering.

Undefined

14.Define outlier. How will you determine outliers in the data?

An outlier is an observation that lies an abnormal distance from other values in a random sample
from a population. ... Examination of the data for unusual observations that are far removed from
the mass of data. These points are often referred to as outliers.

Example of an outlier box plot: The data set of N = 90 ordered observations as shown ...

Ways to describe data: Two activities are essential for characterizing a set of data

15.Discuss the challenges of outlier detection.

Low data quality and the presence of noise bring a huge challenge to outlier detection. ... Moreover,
noise and missing data may “hide” outliers and reduce the effectiveness of outlier detection—an
outlier may as a noise point, and an outlier detection method may mistakenly identify a noise point
as an outlier.

16)Distinguish between Classification and clustering.

PARAMETER CLASSIFICATION CLUSTERING

Type used for supervised learning used for unsupervised learning


PARAMETER CLASSIFICATION CLUSTERING

grouping the instances based on


process of classifying the input instances
Basic their similarity without the help of
based on their corresponding class labels
class labels

it has labels so there is need of training


there is no need of training and
Need and testing dataset for verifying the
testing dataset.
model created.

17) Evaluate what information is used by outlier detection method.


The detection methods using relevant subspaces exploit locally information, which can be
represented as relevant features, to identify outliers. A typical example of this kind is
SOD(subspaces outlier detection)9, where for each object, its correlation dataset with
sharing nearest neighbors is explored.
18)Give the methods of clustering high dimensional data.
• Subspace clustering.
• Projected clustering.
• Projection-based clustering.
• Hybrid approaches.
• Correlation clustering.

19 List out the difference between characterization and clustering.

BASIS FOR
CLASSIFICATION CLUSTERING
COMPARISON

Basic This model function classifies the This function maps the data into one of the

data into one of numerous already multiple clusters where the arrangement of

defined definite classes. data items is relies on the similarities

between them.

Involved in Supervised learning Unsupervised learning


BASIS FOR
CLASSIFICATION CLUSTERING
COMPARISON

Training sample Labeled data is provided. Unlabeled data provided.

20 explain the typical phases of outlier detection methods.

1.Statistical Methods

Simply starting with visual analysis of the Univariate data by using Boxplots,

Scatter plots, Whisker plots, etc., can help in finding the extreme values in

the data.

2. Proximity Methods

Proximity-based methods deploy clustering techniques to identify the clusters in

the data and find out the centroid of each cluster.

3. Projection Methods

Projection methods utilize techniques such as the PCA to model the data into a
lower-dimensional subspace using linear correlations.
PART B

1. (i).Analyze the Requirements of clustering in Data Mining(8).

ii)Analyse the desirable properties of Clustering algorithm.(5)

Answer:

(i)Clustering is the process of making a group of abstract objects into classes of similar objects.

Points to Remember

• A cluster of data objects can be treated as one group.


• While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes and
helps single out useful features that distinguish different groups.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large
databases.
• Ability to deal with different kinds of attributes − Algorithms should be capable to
be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or erroneous
data. Some algorithms are sensitive to such data and may lead to poor quality
clusters.
• Interpretability − The clustering results should be interpretable, comprehensible, and
usable.

ii)Analyse the desirable properties of Clustering algorithm.(5)

undefined

2(i).Describe in detail about categorization of major clustering methods(8)

CLUSTER ANALYSIS

Cluster is a group of objects that belong to the same class.


In other words the similar object are grouped in one cluster and dissimilar are grouped in other
cluster.

Points to Remember

• A cluster of data objects can be treated as a one group.


• While doing the cluster analysis, the set of data into groups based on data similarity and
then assign the label to the groups.
• The main advantage of Clustering over classification.

Requirements of Clustering in Data Mining

Here are the typical requirements of clustering in data mining:

• Scalability - We need highly scalable clustering algorithms to deal with large databases.
• Ability to deal with different kind of attributes - Algorithms should be capable to be applied
on any kind of data such as interval based (numerical) data, categorical, binary data.
• Discovery of clusters with attribute shape - The clustering algorithm should be capable
of detect cluster of arbitrary shape. The should not be bounded to only distance measures
that tend to find spherical cluster of small size.
• High dimensionality - The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
• Ability to deal with noisy data - Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
• Interpretability - The clustering results should be interpretable, comprehensible and usable.

Clustering Methods

The clustering methods can be classified into following categories:

• Kmeans
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
1. K-means
Given k, the k-means algorithm is implemented in four steps:

1.Partition objects into k nonempty subsets

2.Compute seed points as the centroids of the clusters of the current partition (the centroid is
the center, i.e., mean point, of the cluster)

3.Assign each object to the cluster with the nearest seed point

4.Go back to Step 2, stop when no more new assignment


2. Partitioning Method
Suppose we are given a database of n objects, the partitioning method construct k partition of data.
Each partition will represent a cluster and k≤n. It means that it will classify the data into k groups,
which satisfy the following requirements:

• Each group contain at least one object.


• Each object must belong to exactly one group.

Typical methods:

K-means, k-medoids, CLARANS

3. Hierarchical Methods
This method creates the hierarchical decomposition of the given set of data objects.:

• Agglomerative Approach
• Divisive Approach

4.Density-based Method
Clustering based on density (local cluster criterion), such as density-connected points

Major features:

• Discover clusters of arbitrary shape


• Handle noise
• One scan
• Need density parameters as termination condition

Two parameters:

• Eps: Maximum radius of the neighbourhood


• MinPts: Minimum number of points in an Eps-neighbourhood of that point

Typical methods: DBSACN, OPTICS, DenClue

(ii).List out the General applications of Clustering. (5)

Applications of Clustering
1. It is the backbone of search engine algorithms – where objects that are similar to each other must
be presented together and dissimilar objects should be ignored. Also, it is required to fetch
objects that are closely related to a search term, if not completely related.
2. A similar application of text clustering like search engine can be seen in academics where
clustering can help in the associative analysis of various documents – which can be in-turn used in
– plagiarism, copyright infringement, patent analysis etc.
3. Used in image segmentation in bioinformatics where clustering algorithms have proven their
worth in detecting cancerous cells from various medical imagery – eliminating the prevalent
human errors and other bias.
4. Netflix has used clustering in implementing movie recommendations for its users.
5. News summarization can be performed using Cluster analysis where articles can be divided into a
group of related topics.

3.What is clustering? Describe in detail about the features of K-means partitioning method. (13)
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the
same group than those in other groups. In simple words, the aim is to segregate groups
with similar traits and assign them into clusters.

Kmeans algorithm is an iterative algorithm that tries to partition the dataset


into Kpre-defined distinct non-overlapping subgroups (clusters) where each
data point belongs to only one group. It tries to make the intra-cluster data
points as similar as possible while also keeping the clusters as different (far) as
possible. It assigns data points to a cluster such that the sum of the squared
distance between the data points and the cluster’s centroid (arithmetic mean of
all the data points that belong to that cluster) is at the minimum. The less
variation we have within clusters, the more homogeneous (similar) the data
points are within the same cluster.

The way kmeans algorithm works is as follows:

1. Specify number of clusters K.

2. Initialize centroids by first shuffling the dataset and then randomly


selecting K data points for the centroids without replacement.

3. Keep iterating until there is no change to the centroids. i.e assignment of


data points to clusters isn’t changing.

• Compute the sum of the squared distance between data points and all
centroids.

• Assign each data point to the closest cluster (centroid).


• Compute the centroids for the clusters by taking the average of the all data
points that belong to each cluster.

The approach kmeans follows to solve the problem is called Expectation-


Maximization. The E-step is assigning the data points to the closest cluster.
The M-step is computing the centroid of each cluster. Below is a break down of
how we can solve it mathematically (feel free to skip it).

The objective function is:

where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also,
μk is the centroid of xi’s cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. wik and
treat μk fixed. Then we minimize J w.r.t. μk and treat wik fixed. Technically
speaking, we differentiate J w.r.t. wik first and update cluster assignments (E-
step). Then we differentiate J w.r.t. μk and recompute the centroids after the
cluster assignments from previous step (M-step). Therefore, E-step is:
In other words, assign the data point xi to the closest cluster judged by its sum
of squared distance from cluster’s centroid.

And M-step is:

Which translates to recomputing the centroid of each cluster to reflect the new
assignments.

kmeans algorithm is very popular and used in a variety of applications such as


market segmentation, document clustering, image segmentation and image
compression, etc. The goal usually when we undergo a cluster analysis is either:

1. Get a meaningful intuition of the structure of the data we’re dealing with.

2. Cluster-then-predict where different models will be built for different


subgroups if we believe there is a wide variation in the behaviors of different
subgroups. An example of that is clustering patients into different
subgroups and build a model for each subgroup to predict the probability of
the risk of having heart attack.
Drawbacks
Kmeans algorithm is good in capturing structure of the data if clusters have a
spherical-like shape. It always try to construct a nice spherical shape around
the centroid. That means, the minute the clusters have a complicated geometric
shapes, kmeans does a poor job in clustering the data. We’ll illustrate three
cases where kmeans will not perform well.

5.What is grid based clustering? With an example explain an algorithm for grid

based clustering.

Grid-Based Clustering

Grid-Based Clustering method uses a multi-resolution grid data structure.

STING - A Statistical Information Grid Approach

In this method, the spatial area is divided into rectangular cells.

There are several levels of cells corresponding to different levels of resolution

For each cell, the high level is partitioned into several smaller cells in the next lower level.

The statistical info of each cell is calculated and stored beforehand and is used to answer queries.

The parameters of higher-level cells can be easily calculated from parameters of lower-level cell

Count, mean, s, min, max

Type of distribution—normal, uniform, etc.

Then using a top-down approach we need to answer spatial data queries.


Then start from a pre-selected layer—typically with a small number Of cells

1.For each cell in the current level compute the confidence interval.

2.Now remove the irrelevant cells from further consideration.

3.When finishing examining the current layer, proceed to the next lower level.

4.Repeat this process until the bottom layer is reached.

WaveCluster

It was proposed by Sheikholeslami, Chatterjee, and Zhang (VLDB’98).

It is a multi-resolution clustering approach which applies wavelet transform to the feature space

A wavelet transform is a signal processing technique that decomposes a signal into different
frequency sub-band.

It can be both grid-based and density-based method.

Input parameters:

No of grid cells for each dimension

The wavelet, and the no of applications of wavelet transform

CLIQUE - Clustering In QUEst

It was proposed by Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

It is based on automatically identifying the subspaces of high dimensional data space that allow
better clustering than original space.

CLIQUE can be considered as both density-based and grid-based:

1.It partitions each dimension into the same number of equal-length intervals.

2.It partitions an m-dimensional data space into non-overlapping rectangular units.

3.A unit is dense if the fraction of the total data points contained in the unit exceeds the input model
parameter.

4.A cluster is a maximal set of connected dense units within a subspace.

6) (i)Demonstrate in detail about model based clustering methods.


The traditional clustering methods, such as hierarchical clustering and k-means clustering,
are heuristic and are not based on formal models. Furthermore, k-means algorithm is
commonly randomnly initialized, so different runs of k-means will often yield different
results. Additionally, k-means requires the user to specify the the optimal number of
clusters.
An alternative is model-based clustering, which consider the data as coming from a
distribution that is mixture of two or more clusters (Fraley and Raftery 2002, Fraley et al.
(2012)). Unlike k-means, the model-based clustering uses a soft assignment, where each
data point has a probability of belonging to each cluster.

In model-based clustering, the data is considered as coming from a mixture of density.

Each component (i.e. cluster) k is modeled by the normal or Gaussian distribution which is
characterized by the parameters:
▪ μkμk: mean vector,
▪ ∑k∑k: covariance matrix,
▪ An associated probability in the mixture. Each point has a probability of belonging to each
cluster.

(ii).Illustrate the topic on (6)


1.CLIQUE:
Clique clustering is the problem of partitioning the vertices of a graph into disjoint clusters,
where each cluster forms a clique in the graph, while optimizing some objective
function. The goal is to maintain a clustering with an objective value close to the optimal
solution.

2. DBSCAN
DBSCAN stands for density-based spatial clustering of applications with noise. It is able to
find arbitrary shaped clusters and clusters with noise (i.e. outliers). The main idea behind
DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster.
DBSCAN is a clustering method that is used in machine learning to separate clusters of high
density from clusters of low density.

7 (i).Demonstrate on clustering high dimensional data. (6)

Clustering high dimensional data


Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to
many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas
such as medicine, where DNA microarray technology can produce a large number of measurements
at once, and the clustering of text documents, where, if a word-frequency vector is used, the number
of dimensions equals the size of the dictionary.

Approaches:

1. Subspace clustering
Subspace clustering is an unsupervised learning problem that aims at
grouping data points into multiple clusters so that data points at a single
cluster lie approximately on a low-dimensional linear subspace. Subspace
clustering is an extension of feature selection just as with feature selection
subspace clustering requires a search method and evaluation criteria but in
addition subspace clustering limit the scope of evaluation criteria. The
subspace clustering algorithm localizes the search for relevant dimensions
and allows them to find the cluster that exists in multiple overlapping
subspaces. Subspace clustering was originally purposed to solved very
specific computer vision problems having a union of subspace structure in
the data but it gains increasing attention in the statistic and machine learning
community.

2. Projected clustering

• Projected clustering is a typical- dimension – reduction subspace


clustering method. That is, instead of initiating from single – dimensional
spaces, it proceeds by identifying an initial approximation of the clusters
in high dimensional attribute space.
• Each dimension is then allocated a weight for each cluster and the
renovated weights are used in the next repetition to restore the clusters .
This leads to the inspection of dense regions in all subspaces of some
craved dimensionality.
• It avoids the production of a huge number of overlapped clusters in lower
dimensionality.
• Projected clustering finds the finest set of medoids by a hill climbing
technique but generalized to deal with projected clustering.
• It acquire a distance measure called Manhattan segmental distance.
• This algorithm composed of three phases : Initialization, iteration, cluster
refinement.

3. Correlation clustering
Correlation clustering is performed on databases and other large data
sources to group together similar datasets, while also alerting the user to
dissimilar datasets. This can be done perfectly in some graphs, while
others will experience errors because it will be difficult to differentiate
similar from dissimilar data. In the case of the latter, correlation clustering
will help reduce error automatically. This is often used for data mining, or
to search unwieldy data for similarities. Dissimilar data are commonly
deleted, or placed into a separate cluster.
ii).Consider five points { X1, X2,X3, X4, X5} with the following coordinates as a two
dimensional sample for clustering: X1 = (0,2.5); X2 = (0,0); X3= (1.5,0); X4 = (5,0); X5 = (5,2)
Illustrate the K-means partitioning algorithm using the above data set. (7)
8. i)How would you discuss the outlier analysis in detail ? (7)

Answer:

(i) Outlier analysis is the process of identifying outliers, or abnormal observations, in


a dataset. Also known as outlier detection, it’s an important step in data analysis,
as it removes erroneous or inaccurate observations which might otherwise skew
conclusions.

There are a wide range of techniques and tools used in outlier analysis.

Sorting: For an amateur data analyst, sorting is by far the easiest technique for
outlier analysis. The premise is simple: load your dataset into any kind of data
manipulation tool (such as a spreadsheet), and sort the values by their magnitude.

Graphing
An equally forgiving tool for outlier analysis is graphing. Once again, the premise is
straightforward: plot all of the data points on a graph, and see which points stand
out from the rest. The advantage of using a graphing approach over a sorting
approach is that it visualizes the magnitude of the data points, which makes it
much easier to spot outliers.

Z-score
A more statistical technique that can be used to identify outliers is the Z-score. The
Z-score measures how far a data point is from the average, as measured in
standard deviations. By calculating the Z-score for each data point, it’s easy to see
which data points are placed far from the average.

ii)Discuss in detail about the various detection techniques in outlier. (6)

1. Statistical Methods

Simply starting with visual analysis of the Univariate data by using Boxplots, Scatter plots,

Whisker plots, etc., can help in finding the extreme values in the data. Assuming a normal

distribution, calculate the z-score, which means the standard deviation (σ) times the data

point is from the sample’s mean. Because we know from the Empirical Rule, which says that
68% of the data falls within one standard deviation, 95% percent within two standard

deviations, and 99.7% within three standard deviations from the mean, we can identify data

points that are more than three times the standard deviation, as outliers.

2. Proximity Methods

Proximity-based methods deploy clustering techniques to identify the clusters in the data

and find out the centroid of each cluster. They assume that an object is an outlier if the

nearest neighbors of the object are far away in feature space; that is, the proximity of the

object to its neighbors significantly deviates from the proximity of most of the other objects

to their neighbors in the same data set.

3. Projection Methods

Projection methods utilize techniques such as the PCA to model the data into a lower-

dimensional subspace using linear correlations. Post that, the distance of each data point to

a plane that fits the sub-space is calculated. This distance can be used then to find the

outliers. Projection methods are simple and easy to apply and can highlight irrelevant values.

9. (i).Explain in detail about data mining applications (5)?

Data Mining Applications

Here is the list of areas where data mining is widely used −

• Financial Data Analysis


• Retail Industry
• Telecommunication Industry
• Biological Data Analysis
• Other Scientific Applications
• Intrusion Detection
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high quality which
facilitates systematic data analysis and data mining
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data from
on sales, customer purchasing history, goods transportation, consumption and services.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e-mail, web
data transmission, etc.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics

(ii). With relevant examples summarize in detail about constraint based cluster analysis. (8)

Ans not found..if anyone could find naahh…first uhh thedi kandupidida maanga!

10. Design statistical approaches in outlier detection with neat design and with examples. (13)
terila

undefined

12.(i). Disucss in detail about the different types of data in cluster analysis. (5)

(I)

Types of data structures in cluster analysis are

1.Data Matrix (or object by variable structure)

2.Dissimilarity Matrix (or object by object structure)

Data Matrix

This represents n objects, such as persons, with p variables (also called measurements or attributes),
such as age, height, weight, gender, race and so on. The structure is in the form of a relational table,
or n-by-p matrix (n objects x p variables)
The Data Matrix is often called a two-mode matrix since the rows and columns of this represent the
different entities.

Dissimilarity Matrix

This stores a collection of proximities that are available for all pairs of n objects. It is often
represented by a n – by – n table, where d(i,j) is the measured difference or dissimilarity between
objects i and j. In general, d(i,j) is a non-negative number that is close to 0 when objects i and j are
higher similar or “near” each other and becomes larger the more they differ. Since d(i,j) = d(j,i) and
d(i,i) =0, we have the matrix in figure.

This is also called as one mode matrix since the rows and columns of this represent the same entity.
(ii). Discuss the following clustering algorithm using examples.(8)

1. K.means

2. K-medoid.
13. Describe the applications and trends in data mining in detail.

Data Mining Applications:

Financial Data Analysis


The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
• Design and construction of data warehouses for multidimensional data analysis and
data mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of
data from on sales, customer purchasing history, goods transportation, consumption and
services. It is natural that the quantity of data collected will continue to expand rapidly
because of the increasing ease, availability and popularity of the web.
Telecommunication Industry
Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality of
service. Here is the list of examples for which data mining improves telecommunication
services −
• Multidimensional Analysis of Telecommunication data.
• Fraudulent pattern analysis.
• Identification of unusual patterns.
• Multidimensional association and sequential patterns analysis.
• Mobile Telecommunication services.
• Use of visualization tools in telecommunication data analysis.

Trends in Data Mining

Data mining concepts are still evolving and here are the latest trends that we get to see in
this field −
• Application Exploration.
• Scalable and interactive data mining methods.
• Integration of data mining with database systems, data warehouse systems and web
database systems.
• SStandardization of data mining query language.
• Visual data mining.
• New methods for mining complex types of data.
• Biological data mining.
• Data mining and software engineering.
• Web mining.
• Distributed data mining.
• Real time data mining.
• Multi database data mining.
• Privacy protection and information security in data mining.

14 What is outlier mining important? Briefly describe the different approaches behind
statistical –based outlier detection, distance based outlier detection and deviation based
outlier detection.
Outliers are an integral part of data analysis. An outlier can be defined as observation point
that lies in a distance from other observations.
An outlier is important as it specifies an error in the experiment. Outliers are extensively
used in various areas such as detecting frauds, introducing potential new trends in the
market and others.
Usually, outliers are confused with noise. However, outliers are different from noise data in
the following sense:
1. Noise is a random error, but outlier is an observation point that is situated away from
different observations.
2. Noise should be removed for better outlier detection.

Statistical Distribution-Based Outlier Detection:


The statistical distribution-based approach to outlier detection assumes a distribution or
probability model for the given data set (e.g., a normal or Poisson distribution) and then
identifies outliers with respect to the model using a discordancy test. Application of the test
requires knowledge of the data set parameters knowledge of distribution parameters such
as the mean and variance and the expected number of outliers.
A statistical discordancy test examines two hypotheses:
▪ A working hypothesis
▪ An alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from an
initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its


rejection. A discordancy test verifies whether an object, oi, is significantly large (or small) in
relation to the distribution F. Different test statistics have been proposed for use as a
discordancy test, depending on the available knowledge of the data. Assuming that some
statistic, T, has been chosen for discordancy testing, and the value of the statistic for object
oi is vi, then the distribution of T is constructed. Significance probability, SP(vi)=Prob(T > vi),
is evaluated. If SP(vi) is sufficiently small, then oi is discordant and the working hypothesis is
rejected.
An alternative hypothesis, H, which states that oi comes from another distribution model, G,
is adopted. The result is very much dependent on which model F is chosen because oi may
be an outlier under one model and a perfectly valid value under another. The alternative
distribution is very important in determining the power of the test, that is, the probability
that the working hypothesis is rejected when oi is really an outlier. There are different kinds
of alternative distributions.

Distance-Based Outlier Detection:


The notion of distance-based outliers was introduced to counter the main limitations
imposed by statistical methods. An object, o, in a data set, D, is a distance-based (DB)outlier
with parameters pct and dmin, that is, a DB(pct;dmin)-outlier, if at least a fraction, pct, of
the objects in D lie at a distance greater than dmin from o. In other words, rather that
relying on statistical tests, we can think of distance-based outliers as those objects that do
not have enough neighbors, where neighbors are defined based on distance from the given
object. In comparison with statistical-based methods, distance based outlier detection
generalizes the ideas behind discordancy testing for various standarddistributions. Distance-
based outlier detection avoids the excessive computation that can be associated with fitting
the observed distribution into some standard distribution and in selecting discordancy tests.

For many discordancy tests, it can be shown that if an object, o, is an outlier according to
the given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct and dmin.
For example, if objects that lie three or more standard deviations from the mean are
considered to be outliers, assuming a normal distribution, then this definition can be
generalized by a DB(0.9988, 0.13s) outlier. Several efficient algorithms for mining distance-
based outliers have been developed.
Deviation-Based Outlier Detection:
Deviation-based outlier detection does not use statistical tests or distance-based measures
to identify exceptional objects. Instead, it identifies outliers by examining the main
characteristics of objects in a group. Objects that ―deviate‖ from this description are
considered outliers. Hence, in this approach the term deviations is typically used to refer to
outliers. In this section, we study two techniques for deviation-based outlier detection. The
first sequentially compares objects in a set, while the second employs an OLAP data cube
approach.

You might also like