Data Warehousing and Data Minining Answer Key - Anna University (16M & 2M With Answers)
Data Warehousing and Data Minining Answer Key - Anna University (16M & 2M With Answers)
in
ENGINEERING COLLEGES
2017 – 18 ODD SEMESTER
Regulation: 2013
Prepared by S C
Sl. Affiliating
Name of the Faculty Designation
No. College
1 Dr.A.Anitha PROF/IT FXEC
A
Classification of Data Mining Systems – Data Mining Task Primitives – Integration of a Data
Mining System with a Data Warehouse – Issues –Data Preprocessing.
9
UNIT IV
S C
ASSOCIATION RULE MINING AND CLASSIFICATION
Mining Frequent Patterns, Associations and Correlations – Mining Methods – Mining various
Kinds of Association Rules – Correlation Analysis – Constraint Based Association Mining –
9
TEXT BOOKS:
1. Alex Berson and Stephen J.Smith, “Data Warehousing, Data Mining and OLAP”, Tata McGraw –
REFERENCES:
1. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, “Introduction to Data Mining”,
Person Education, 2007.
2. K.P. Soman, Shyam Diwakar and V. Aja, “Insight into Data Mining Theory and Practice”, Eastern
Economy Edition, Prentice Hall of India, 2006.
3. G. K. Gupta, “Introduction to Data Mining with Case Studies”, Eastern Economy Edition, Prentice
Hall of India, 2006.
4. Daniel T.Larose, “Data Mining Methods and Models”, Wiley-Interscience, 2006.
A D
S C
TABLE OF CONTENTS
PAGE
SNO TITLE
NO
A Aim and Objective of the Subject 3
B Detailed lesson plan 4
1 Part-A 8
2 Part-B 14
3 Star schema and snow-flake schema 14
4 Three tier Architecture of Data warehouse 17
Mapping data warehouse to multiprocessor 19
5
architecture
6 Components of Data warehouse 23
7 Building a data warehouse 25
8 Meta data 27
D
UNIT II – BUSINESS ANALYSIS
A
9
10
11
Part-A
Part-B
S C
Cognous Impromptu
29
35
35
12 OLAP Operations 37
13 MROLAP Vs MDOLAP 39
14 OLAP Tools 40
15 Data Models 42
16 Part-A 47
17 Part-B 52
18 Task Primitives 53
19 Concept Hierarchy 54
20 Classification of data mining 56
21 Knowledge discovery process 57
22 Data cleaning methods 58
AIM
OBJECTIVES:
OUTCOMES
Upon completion of the course, the student should be able to:
Apply data mining techniques and methods to large data sets.
Use data mining tools.
Compare and contrast the various classifiers.
A D
S C
REFERENCES:
1. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, ―Introduction
to Data Mining‖,Person Education, 2007.
2. K.P. Soman, Shyam Diwakar and V. Aja, ―Insight into Data Mining Theory and
Practice‖, EasternEconomy Edition, Prentice Hall of India, 2006.
3. G. K. Gupta, ―Introduction to Data Mining with Case Studies‖, Eastern
Economy Edition, Prentice Hall of India, 2006.
D
4. Daniel T.Larose, ―Data Mining Methods and Models‖, Wiley-Interscience, 2006.
A
Instruction Schedule
S. No
Week S C Topics
No
of
Book
Page No.
No No
Hrs
UNIT – I DATA WAREHOUSING (8)
1 Data warehousing Components 2 T1 113-127
IV Multidimensional versus
14 Multirelational OLAP- Categories of 1 T1 251-256
Tools
15 OLAP Tools and the Internet 1 T1 262-265
Remarks:
A D
UNIT III- DATA MINING (9)
16
17
Introduction
S C
Data – Types of Data
1
1
T2
T2
1-9
9-21
V
18 Data Mining Functionalities 1 T2 21-27
34
VIII
Learners
D
Associative Classification – Lazy
A 1 T2 344-351
35
Remarks:
Prediction
S C
Other Classification Methods –
1 T2 351-359
Total Hours: 48
A D
S C
UNIT I
DATAWAREHOUSING
PART A
1) List the characteristics of Data warehouse
Subject Oriented
Integrated
Nonvolatile A D
Time Variant
S C
Some data is de-normalized for simplification and to improve
performance
Large amounts of historical data are used
Queries often retrieve large amounts of data
Both planned and ad hoc queries are common
The data load is controlled
S C
Data partitioning is a key requirement for effective parallel execution of
data base operations. It spreads data from data base tables across multiple disks
so that I/O operations such as read and write can be performed in parallel
Random partitioning
Intelligent partitioning
A D
includes an extraction, transportation, transformation, and loading (ETL)
solution, an online analytical processing (OLAP) engine, client analysis tools,
it to business users.
S C
and other applications that manage the process of gathering data and delivering
A D
S C
The process of extracting data from source systems and bringing it into
the data warehouse is commonly called ETL, which stands for extraction,
transformation, and loading
A data mart is the access layer of the data warehouse environment that
is used to get data out to the users. The data mart is a subset of the data
warehouse that is usually oriented to a specific business line or team. Data
marts are small slices of the data warehouse.
A D
S C
10) How is a data warehouse different from a database? How are they similar?
(Nov/Dec 2007, Nov/Dec 2010,Apr/May 2017)
Data warehouse is a repository of multiple heterogeneous data sources,
organized under a unified schema at a single site in order to facilitate
management decision-making.
A relational databases is a collection of tables, each of which is assigned
a unique name. Each table consists of a set of attributes(columns or fields) and
usually stores a large set of tuples(records or rows).
Each tuple in a relational table represents an object identified by a
unique key and described by a set of attribute values. Both are used to store and
manipulate the data.
A D
S C
PART B
1) Explain star schema and snow flake schema with example and discuss their
performance problems May’15, Dec’13, May’11/ Explain about
multidimensional Schema with example Dec’15, Dec’14,Dec ‘16
redundancy S C
A large central table (fact table) containing the bulk of the data, with no
A set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables
displayed in a radial pattern around the central fact table.
A D
Notice that in the star schema, each dimension is represented by only one table,
S C
and each table contains a set of attributes. For example, the location dimension table
contains the attribute set {Location key, street, city, province or state, country}.This
constraint may introduce some redundancy.
Snowflake schema
The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional
tables. The resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that
the dimension tables of the snowflake model may be kept in normalized form to
reduce redundancies. Such a table is easy to maintain and saves storage space.
However, this saving of space is negligible in comparison to the typical
magnitude of the fact table. Furthermore, the snowflake structure can reduce the
effectiveness of browsing, since more joins will be needed to execute a query.
Consequently, the system performance may be adversely impacted. Hence, although
the snowflake schema reduces redundancy, it is not as popular as the star schema in
data warehouse design.
Visit & Downloaded from : www.LearnEngineering.in
15
Visit & Downloaded from : www.LearnEngineering.in
A D
tables. For example, the item dimension table now contains the attributes item key,
item name, brand, type, and supplier key, where supplier key is linked to the supplier
S C
dimension table, containing supplier key and supplier type information. Similarly, the
single dimension table for location in the star schema can be normalized into two new
tables: location and city. The city key in the new location table links to the city
dimension. Notice that further normalization can be performed on province or state
and country
A D
S C
A D
The middle tier is an OLAP server that is typically implemented using either
S C
(1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that
maps operations on multidimensional data to standard relational operations; or (2) a
multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.
The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so
on).
From the architecture point of view, there are three data warehouse models:
enterprise warehouse,
data mart,
Virtual warehouse.
Types of Parallelism
two ways.
A D
Parallel execution of tasks within the SQL statements can be done in either of
Horizontal parallelism: Which means that the data base is partitioned across
S C
multiple disks and the parallel processing occurs in the specific tasks, that is
performed concurrently on different processors against different sets of data
Vertical Parallelism: which occurs among different tasks all components
query operations are executed in parallel in a pipelined fashion. In other words an
output from one task becomes an input into another task as soon as records become
available
Data Partitioning
Data partitioning is a key requirement for effective parallel execution of data
base operations. It spreads data from data base tables across multiple disks so that I/O
operations such as read and write can be performed in parallel.
Random partitioning includes random data striping across multiple disks on
single servers. In round robin partitioning, each new record id placed on the new disk
assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record id
located and does not waste time searching for it across all disks. This partitioning
A D
S C
Shared-disk Architecture
It implements the concept of shared ownership of the entire data base between
RDBMS servers, each of which is running on a node of distributed memory system.
A D
Each RDBMS server can read, write, update and delete records from the same shared
data base, which would require the system to implement a form of distributed lock
manager (DLM).
S C
Pining: In worst case scenario, if all nodes are reading and updating same data, the
RDBMS and its DLM will have to spend a lot of resources synchronizing multiple
buffer pool. This problem is called as pining Data skew: Un even distribution of data
Shared-disk architectures can reduce performance bottle-necks resulting from
data skew
Shared-Nothing Architecture
The data is partitioned across many disks, and DBMS is ―partitioned‖ across
multiple conservers, each of which resides on individual nodes of the parallel system
A D
and has an ownership of its own disk and thus, its own data base partition.
It offers non-linear scalability. These requirements includes
S C
Support for function shipping
Parallel join strategies
Support for data repartitioning
Query compilation
Support for data base transactions
Support for the single system image of the data base environment.
Combined Architecture
Inter server parallelism of the distributed memory architecture means that each
query is parallelized across multiple servers. While intraserver parallelism of the
shared memory architecture means that a query is parallelized with in the server.
Typically, the source data for the warehouse is coming from the operational
applications. As the data enters the warehouse, it is cleaned up and transformed
into an integrated structure and format. The transformation process may involve
conversion, summarization, filtering and condensation of data.
Because the data contains a historical component, the warehouse must be
A D
capable of holding and managing large volumes of data as well as different data
structures for the same database over time.
S C
Seven major components of data warehousing
1. Data Warehouse Database
2. Sourcing, Acquisition, Cleanup and Transformation Tools
3. Meta data
4. Access Tools
5. Data Marts
6. Data Warehouse Administration and Management
7. Information Delivery System
A D
S C
factors include
A D
data from multiple, often heterogeneous, sources into a query data base. The main
S C
Heterogeneity of data sources, which affects data conversion, quality, time-lines
Use of historical data, which implies that data may be‖ old‖
Tendency of database to grow very large
Data Content: Typically a data warehouse may contain detailed data, but the data is
cleaned up and transformed to fit the warehouse model, and certain transactional
attributes of the data are filtered out. The content and the structure of the data
warehouses are reflected in its data model. The data model is a template for how
information will be organized with in the integrated data warehouse framework.
Meta data: Defines the contents and location of the data in the warehouse,
relationship between the operational databases and the data warehouse, and the
business view of the warehouse data that are accessible by end-user tools. The
warehouse design should prevent any direct access to the warehouse data if it does not
use meta data definitions to gain the access.
Data distribution: As the data volumes continue to grow, the data base size may
rapidly outgrow a single server. Therefore, it becomes necessary to know how the data
should be divided across multiple servers. The data placement and distribution design
should consider several options including data distribution by subject area, location, or
time. Tools: Data warehouse designers have to be careful not to sacrifice the overall
design to fit to a specific tool. Selected tools must be compatible with the given data
warehousing environment each other.
Performance consideration: Rapid query processing is a highly desired feature that
should be designed into the data warehouse.
Nine decisions in the design of a data warehouse:
i. Choosing the subject matter
ii. Deciding what a fact table represents
iii. Identifying and conforming the decisions
iv. Choosing the facts
v. Storing pre calculations in the fact table
vi. Rounding out the dimension table
A D
vii. Choosing the duration of the data base
viii. The need to track slowly changing dimensions
C
ix. Deciding the query priorities and the query modes
S
Technical Considerations A number of technical issues are to be considered when
designing and implementing a data warehouse environment .these issues includes. The
hardware platform that would house the data warehouse. The data base management
system that supports the warehouse data base. The communication infrastructure that
connects the warehouse, data marts, operational systems, and end users. The hardware
platform and software to support the meta data repository The systems management
framework that enables the centralized management and administration of the entire
environment.
Implementation Considerations A data warehouse cannot be simply bought and
installed-its implementation requires the integration of many products within a data
ware house.
1. Access tools
6) a.)What is Meta data? Classify Meta data and explain the same
Meta data is data about data that describes the data warehouse. It is used for building,
maintaining, managing and using the data warehouse.
Meta data can be classified into:
Technical Meta data, which contains information about warehouse data for use by
warehouse designers and administrators when carrying out warehouse development
and management tasks.
Business Meta data, which contains information that gives users an easy-to-
understand perspective of the information stored in the data warehouse.
Equally important, Meta data provides interactive access to users to help
understand content and find data. One of the issues dealing with Meta data relates to
the fact that many data extraction tool capabilities to gather Meta data remain fairly
A D
immature. Therefore, there is often the need to create a Meta data interface for users,
which may involve some duplication of effort.
S C
Meta data management is provided via a Meta data repository and
accompanying software. Meta data repository management software, which typically
runs on a workstation, can be used to map the source data to the target database;
generate code for data transformations; integrate and transform the data; and control
moving data to the warehouse.
As user's interactions with the data warehouse increase, their approaches to
reviewing the results of their requests for information can be expected to evolve from
relatively simple manual analysis for trends and exceptions to agent-driven initiation
of the analysis based on user-defined thresholds.
The definition of these thresholds, configuration parameters for the software
agents using them, and the information directory indicating where the appropriate
sources for the information can be found are all stored in the Meta data repository as
well.
D
Accommodating source data definition changes
A
S C
The data sourcing, cleanup, extract, transformation and migration tools have to deal
with some significant issues including:
Database heterogeneity. DBMSs are very different in data models, data access
language, data navigation, operations, concurrency, integrity, recovery etc.
Data heterogeneity. This is the difference in the way data is defined and used in
different models - homonyms, synonyms, unit compatibility (U.S. vs metric),
different attributes for the same entity and different ways of modeling the same fact.
These tools can save a considerable amount of time and effort. However, significant
shortcomings do exist. For example, many available tools are generally useful for
simpler data extracts. Frequently, customized extract routines need to be developed
for the more complicated data extraction procedures
UNIT II
PART A
A D
S C
A data cube is a set of data that is usually constructed from a subset of a data
warehouse and is organized and summarized into a multidimensional structure defined
by a set of dimensions and measures.
A D
S C
A data cube for the highest level of abstraction is the apex cuboid.
A data cube for the lowest level of abstraction is the base cuboid.
MOLAP ROLAP
MOLAP (multidimensional OLAP) ROLAP (relational OLAP) tools do not
tools utilize a pre-calculated data set use pre-calculated data set
MOLAP tools feature very fast ROLAP tools feature the ability to ask
response, and the ability to quickly write any question (you are not limited to the
back data into the data set contents of a cube) and the ability to
drill down to the lowest level of detail
in the database.
The most common examples of The most common examples of ROLAP
MOLAP tools are Hyperion (Arbor) tools are Micro Strategy and Sterling
Essbase and Oracle (IRI) Express (Information Advantage).
A D
S C
OLAP OLTP
Consolidation data; OLAP data
Source of Operational data; OLTPs are
comes from the various OLTP
data the original source of the data.
Databases
Purpose of To control and run To help with planning, problem
data
What the
A D
fundamental business tasks
Reveals a snapshot of ongoing
solving, and decision support
Multi-dimensional views of various
C
data business processes kinds of business activities
Inserts and
Updates S
Short and fast inserts and
updates initiated by end users
Relatively standardized and
Periodic long-running batch jobs
refresh the data
A D
A multidimensional database (MDB) is a type of database that is optimized
for data warehouse and
S C
online analytical processing (OLAP) applications.
Multidimensional databases are frequently created using input from existing relational
databases.
11.) List OLAP guidelines. .(Nov/Dec 2016)
Multidimensional conceptual view
Transparency
Accessibility
Consistent reporting performance
Client/server architecture
Generic Dimensionality
Dynamic sparse matrix handling
Multi-user support
Unrestricted cross-dimensional operations
Intuitive data manipulation
Flexible reporting
Unlimited Dimensions and aggregation levels
Visit & Downloaded from : www.LearnEngineering.in
33
Visit & Downloaded from : www.LearnEngineering.in
The mainly comprehensive premises in computing have been the internet and
data warehousing thus the integration of these two giant technologies is a
necessity. The advantages of using the Web for access are
inevitable.(Reference 3) These advantages are:
The Web allows users to store and manage data and applications on
servers that can be managed, maintained and updated centrally.
A D
S C
PART B
Catalogs
A D
Impromptu stores metadata in subject related folders. This metadata is
called a catalog.
S C
what will be used to develop a query for a report. The metadata set is stored in a file
The catalog does not contain any data. It just contains information about
connecting to the database and the fields that will be accessible for reports.
A catalog contains:
Folders—meaningful groups of information representing columns from one or more
tables
Columns—individual data elements that can appear in one or more folders
Calculations—expressions used to compute required values from existing data
Conditions—used to filter information so that only a certain type of information is
displayed
Prompts—pre-defined selection criteria prompts that users can include in reports they
create
Other components, such as metadata, a logical database name, join information, and
user classes
Catalog can be used to
view, run, and print reports
export reports to other applications
disconnect from and connect to the database
create reports
change the contents of the catalog
A D
add user classes
S C
2) List and explain typical OLAP operations for multidimensional data with
suitable examples and diagrammatic illustrations. Dec’15, May’15, Dec’14,
Dec’13, May’13
Roll-up: The roll-up operation (also called the drill-up operation by some vendors)
performs aggregation on a data cube, either by climbing up a concept hierarchy for a
dimension or by dimension reduction.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to
moredetaileddata.
Drill-downcanberealizedbyeithersteppingdownaconcepthierarchy for a dimension or
introducing additional dimensions.
Slice and dice: The slice operation performs a selection on one dimension of the
given cube, resulting in a sub-cube.
Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the
data
A D
S C
A D
S C
OLAP OPERATIONS ON MULTI DIMENSIONAL DATA
A D Multidimensional modeling
are optimized for On Line
S C
Transaction Processing. OLTP
needs the ability to efficiently
Analytical Processing. OLAP
needs the ability to retrieve data
update data. This is provided in efficiently. Efficient data retrieval
a normalized database that has requires a minimum number of
each value stored only once. joins. This is provided with the
simple structure of relationship in
a Multidimensional modeling,
where each dimension table is
only a single join away from the
fact table.
6 Tables are units of relational Cubes are units of multi-
data storage. dimensional data storage.
7 Table fields of particular data Dimensions and measures stores
type store the actual data. actual data.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and are
optimal for slicing and dicing operations.
A D
Can perform complex calculations: All calculations have been pre-generated when
the cube is created. Hence, complex calculations are not only doable, but they return
quickly.
Disadvantages:
S C
Limited in the amount of data it can handle: Because all calculations are performed
when the cube is built, it is not possible to include a large amount of data in the cube
itself. This is not to say that the data in the cube cannot be derived from a large
amount of data. Indeed, this is possible. But in this case, only summary-level
information will be included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are
additional investments in human and capital resources are needed.
Examples: Hyperion Essbase, Fusion (Information Builders)
ROLAP
This methodology relies on manipulating the data stored in the relational
database to give the appearance of traditional OLAP's slicing and dicing functionality.
In essence, each action of slicing and dicing is equivalent to adding a "WHERE"
clause in the SQL statement. Data stored in relational tables
Advantages:
Can handle large amounts of data: The data size limitation of ROLAP technology is
the limitation on data size of the underlying relational database. In other words,
A D
ROLAP itself places no limitation on data amount.
Can leverage functionalities inherent in the relational database: Often, relational
S C
database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these functionalities.
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database, the query time can be long if the
underlying data size is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements do
not fit all needs (for example, it is difficult to perform complex calculations using
SQL), ROLAP technologies are therefore traditionally limited by what SQL can do.
ROLAP vendors have mitigated this risk by building into the tool out-of-the-box
complex functions as well as the ability to allow users to define their own functions.
Examples:Microstrategy Intelligence Server, MetaCube (Informix/IBM)
5) Explain the data model suitable for Data warehouse with example. May’14
The three levels of data modeling,
Conceptual data model,
Logical data model,
Physical data model
Conceptual data model
D
A conceptual data model identifies the highest-level relationships between the
different entities.Features of conceptual data model include:
A
Includes the important entities and the relationships among them.
No attribute is specified.
No primary key is specified. S C
The figure below is an example of a conceptual data model.
A logical data model describes the data in as much detail as possible, without regard
to how they will be physical implemented in the database. Features of a logical data
model include:
Includes all entities and relationships among them.
All attributes for each entity are specified.
The primary key for each entity is specified.
Foreign keys (keys identifying the relationship between different entities) are
specified.
Normalization occurs at this level.
A D
The steps for designing the logical data model are as follows:
S C
2. Find the relationships between different entities.
3. Find all attributes for each entity.
4. Resolve many-to-many relationships.
5. Normalization.
The figure below is an example of a logical data model.
Comparing the logical data model shown above with the conceptual data
model diagram, we see the main differences between the two:
In a logical data model, primary keys are present, whereas in a conceptual data
model, no primary key is present.
D
In a logical data model, all attributes are specified within an entity. No
A
S C
attributes are specified in a conceptual data model.
Relationships between entities are specified using primary keys and foreign
keys in a logical data model. In a conceptual data model, the relationships are
simply stated, not specified, so we simply know that two entities are related,
but we do not specify what attributes are used for this relationship.
Physical data model represents how the model will be built in the database. A physical
database model shows all table structures, including column name, column data type,
column constraints, primary key, foreign key, and relationships between tables.
A D
S C
Comparing the physical data model shown above with the logical data model diagram,
we see the main differences between the two:
Entity names are now table names.
Attributes are now column names.
Data type for each column is specified. Data types can be different depending
on the actual database being used.
Entity Names ✓ ✓
Entity Relationships ✓ ✓
Attributes ✓
Primary Keys ✓ ✓
Foreign Keys ✓ ✓
Table Names ✓
Column Names ✓
We can see that the complexity increases from conceptual to logical to physical. This
is why we always first start with the conceptual data model (so we understand at high
A D
level what are the different entities in our data and how they relate to one another),
then move on to the logical data model (so we understand the details of our data
S C
without worrying about how they will actually implemented), and finally the physical
data model (so we know exactly how to implement our data model in the database of
choice). In a data warehousing project, sometimes the conceptual data model and the
logical data model are considered as a single deliverable.
UNIT III
DATAMINING
Introduction–Data– Types of Data–Data Mining Functionalities–
Interestingness of Patterns– Classification of Data Mining Systems– Data
Mining Task Primitives– Integration of a Data Mining System with a Data
Warehouse–Issues–Data Preprocessing.
PART A
D
test data with some degree of certainty; and potentially useful.Measures of
pattern interestingness, either objective or subjective, can be used to guide the
discovery process.
C A
Pattern evaluation is to identify the truly interesting patterns representing
S
knowledge based on some interestingness measures
A D
Noise is a random error or variance in a measured variable. Data smoothing
tech is used for removing such noisy data.
S C
4) What are the types of data? Dec’14
Data
Supported Content Types
Type
Text Cyclical, Discrete, Discretized, Key Sequence, Ordered,
Sequence
Long Continuous, Cyclical, Discrete, Discretized, Key, Key
Sequence, Key Time, Ordered, Sequence, Time Classified
Boolean Cyclical, Discrete, Ordered
Double Continuous, Cyclical, Discrete, Discretized, Key, Key
Sequence, Key Time, Ordered, Sequence, Time Classified
Date Continuous, Cyclical, Discrete, Discretized, Key, Key
Sequence, Key Time, Ordered
A D
A legacy database is a group of heterogeneous databases that combines
S C
different kinds of data systems, such as relational or object-oriented databases,
hierarchical databases, network databases, spreadsheets, multimedia databases, or file
systems. The heterogeneous databases in a legacy database may be connected by
intraor inter-computer networks
Advantages
Potential high returns on investment
A D Disadvantages
Underestimation of resources of data
Competitive advantage
S C loading
Hidden problems with source systems
Increased productivity of corporate Required data not captured
decision-makers
More cost-effective decision-making Increased end-user demands
Better enterprise intelligence. Data homogenization
High demand for resources
High maintenance
Long-duration projects
Complexity of integration
A D
S C
PART B
1) List and discuss about the primitives involved in a data mining task. Dec’15,
May’15, Jun’14, Dec’13, May’13.,May ‘17
Primitives for specifying a data mining task
Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Conditions for data selection
Relevant attributes or dimensions
Data grouping criteria
Knowledge type to be mined
Characterization
Discrimination
Association/correlation
Classification/prediction
Clustering
A D
S C
Background knowledge
Concept hierarchies
User beliefs about relationships in the data
Pattern interestingness measures
Simplicity Certainty (e.g., confidence)
Utility (e.g., support) Novelty
Visualization of discovered patterns
Rules, tables, reports, charts, graphs, decision trees, and cubes
Drill-down and roll-up
A data mining task can be specified in the form of a data mining query, which is input
to the datamining system. A datamining query is defined in terms of data mining task
primitives. These primitives allow the user to interactively communicate with the data
mining system during discovery in order to direct the mining process, or examine the
findings from different angles or depths.
2) What is concept hierarchy and data discretization? Explain how they are
useful for data mining. Dec’15 ,Dec ‘16Apr/May 2017
A D
Concept hierarchy generation are powerful tools for data mining, in that they
allow the mining of data at multiple levels of abstraction
S C
Data discretization techniques can be used to reduce the number of values for a
given continuous attribute by dividing the range of the attribute into intervals.
Interval labels can then be used to replace actual data values. Replacing numerous
values of a continuous attribute by a small number of interval labels thereby reduces
and simplifies the original data. This leads to a concise, easy-to-use, knowledge-level
representation of mining results
Discretization techniques can be categorized based on how the discretization is
performed, such as whether it uses class information or which direction it proceeds
(i.e., top-down vs. bottom-up).
If the discretization process uses class information, then we say it is supervised
discretization. Otherwise, it is unsupervised. If the process starts by first finding one
or a few points (called split points or cut points) to split the entire attribute range, and
then repeats this recursively on the resulting intervals, it is called top-down
discretization or splitting. This contrasts with bottom-up discretization or merging,
which starts by considering all of the continuous values as potential split-points,
D
the attribute. Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts (such as numerical values for the attribute age) with
A
higher-level concepts(such as youth, middle-aged, or senior).Al though detail is lost
C
S
by such data generalization, the generalized data may be more meaningful and easier
to interpret.
This contributes to a consistent representation of datamining results among
multiple mining tasks, which is a common requirement .In addition, mining on a
reduced dataset requires fewer input/output operations and is more efficient than
mining on a larger, generalized dataset. Because of these benefits, discretization
techniques and concept hierarchies are typically applied before datamining as a
preprocessing step, rather than during mining.
3) State and explain the various classification of data mining systems with
example. Dec’14, Dec’13, Dec’11, May’11
There are many data mining systems available or being developed. Some are
specialized systems dedicated to a given data source or are confined to limited data
mining functionalities, other are more versatile and comprehensive.
The data mining system can be classified according to the following criteria:
Database Technology Statistics
Machine Learning
Information Science
Visualization
A D
data, multimedia data, time-series data, text data, World Wide Web, etc.
Classification according to the data model drawn on: this classification categorizes
C
data mining systems based on the data model involved such as relational database,
S
object oriented database, data warehouse, transactional, etc.
Classification according to the king of knowledge discovered: this classification
categorizes data mining systems based on the kind of knowledge discovered or data
mining functionalities, such as characterization, discrimination, association,
classification, clustering, etc. Some systems tend to be comprehensive systems
offering several data mining functionalities together.
Classification according to mining techniques used: Data mining systems employ
and provide different techniques. This classification categorizes data mining systems
according to the data analysis approach used such as machine learning, neural
networks, genetic algorithms, statistics, visualization, database oriented or data
warehouse-oriented, etc. The classification can also take into account the degree of
user interaction involved in the data mining process such as query-driven systems,
interactive exploratory systems, or autonomous systems.
D
present the mined knowledge to the user.
A
S C
The various methods for handling the problem of missing values in data tuples
include:
Ignoring the tuple: This is usually done when the class label is missing (assuming
the mining task involves classification or description). This method is not very
effective unless the tuple contains several attributes with missing values. It is
especially poor when the percentage of missing values per attribute varies
considerably.
A D
and may not be a reasonable task for large data sets with many missing values,
especially when the value to be filled in is not easily determined.
S C
Using the attribute mean for quantitative (numeric) values or attribute mode for
categorical (nominal) values, for all samples belonging to the same class as the
given tuple:
For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category as
that of the given tuple.
Using the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using Bayesian formalism, or
decision tree induction.
Binning methods: Binning methods smooth a sorted data value by consulting the
neighborhood", or values around it. The sorted values are distributed into a number of
'buckets', or bins. Because binning methods consult the neighborhood of values, they
perform local smoothing.
In this technique,
A D
S C
Smoothing by bin means: Each value in the bin is replaced by the mean value of the
bin.
Smoothing by bin medians: Each value in the bin is replaced by the bin median.
Smoothing by boundaries: The min and max values of a bin are identified as the bin
boundaries. Each bin value is replaced by the closest boundary value.
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins(equi depth of 3 since each bin contains three values):
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
In smoothing by bin means, each value in a bin is replaced by the mean value
of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.
Therefore, each original value in this bin is replaced by the value 9. Similarly,
smoothing by bin medians can be employed, in which each bin value is replaced by
the bin median. In smoothing by bin boundaries, the minimum and maximum values
S C
Suppose that the data for analysis include the attribute age. The age values for
the data tuples are (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25,
25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3.
Illustrate your steps.
The following steps are required to smooth the above data using smoothing by bin
means with a bin depth of 3.
Step 1: Sort the data. (This step is not required here as the data are already sorted.)
-
A D
A database of data warehouse may store terabytes of data
-
C
Complex data analysis or mining will take long time to run
S
on the complete data set
- Obtaining a reduced representation of the complete dataset
- Produces same result or almost same mining / analytical results as that
of
original.
A D
S C
A D
S C
62
- The goal of attribute subset selection is to find the minimum set of Attributes
such that the resulting probability distribution of data classes is as close as
possible to the original distribution obtained using all Attributes.
- This will help to reduce the number of patterns produced and those
patterns will be easy to understand
Heuristic Methods: (Due to exponential number of attribute choices)
- Step wise forward selection
- Step wise backward elimination
- Combining forward selection and backward elimination
- Decision Tree induction - Class 1 - A1, A5, A6; Class 2 - A2, A3, A4
A D
S C
63
6) Describe in detail data mining functionalities and the different kinds of patterns
can be mined. (May ’17)
Data Mining Functionalities
Data mining is the process of discovering interesting knowledge from large amounts of
data stored in databases, data warehouses or other information repositories.
A D
Specifications / Queries to focus the search for interesting patterns.Each discovered
pattern is measured for its ―trustworthiness‖ based on data in the database.
1) Characterization:
S C
o Concept description / Class description
o Data is associated with concept or class
o Eg. Classes of items for sale – (i) Computers (ii) Printers
- Eg. Concepts of customers – (i) Big Spenders (ii) Budget
Spenders o Class / Concept descriptions can be delivered via:
(1) Data Characterization
(2) Data Discrimination
(3) Data Characterization and Data Discrimination
Data Characterization
64
Summarization based on simple statistical measures
Data summarization along a dimension – user controlled – OLAP
rollup
Attribute oriented induction – without user interaction.
Output of data characterization can be represented in different forms:
Pie Charts, Bar Charts, Curves
Multidimensional Data cubes, Multidimensional tables
Generalized relations – in rule form – called as ―Characteristics
Rules‖
Eg. ―Find Summarization of characteristics of customers who
spend more than Rs. 50000 in shop S1 in a year‖
D
Result = ―Customers are 40–50 years old, employed and have
high credit rating‖ A
S C
Users can drill down on any dimension – Eg. ―Occupation
of customers‖
Data Discrimination
Comparison of general features of target class data objects with
the general features of objects from one or a set of contrasting
classes.
Target and Contrasting classes are specified by the users
Data objects retrieved through database queries
Eg. ―Users wants to compare general features of S/W products
whose sales increased by 10% in the last year with those whose
sales decreased by 30% during the same period.‖
65
Output of data discrimination is same as output of
data characterization.
Rule form is called as ―Discriminate Rules‖.
Eg. Compare two groups of customers.
Group1 – Shops frequently – at least 2 times a month
Group 2 – Shops rarely – less than 3 times a year
Result = ―80% of frequent shopping customers are between 20-40
years old & have university education.‖ & ―60% of infrequent
shopping customers are seniors or youths with no university
degree.‖
Users can drill down on income level dimension for better
discriminative features between the two classes of
customers.
D
2) Mining Frequent Patterns, Associations and Correlation:
C A
Frequent Patterns – Patterns that occur frequently in data.
o Many kinds of Frequent Patterns exists:
S
(1) Itemsets (2) Subsequences
Set if items that frequently appear together in a database.
Eg. Bread & Jam
Frequent Subsequences: (Advanced)
Frequent sequential patterns
Eg. Purchase PC Purchase Digital Camera Memory Card
Frequent Structured Patterns: (Advanced)
Structural forms that occur frequently
Structural forms – Graphs, Trees, Lattices
66
Result = Discovery of interesting associations and
correlations within data.
o Eg. Association Analysis:
Example 1: ―Find which items are frequently purchased together in
the same transactions‖.
Buys (X. ―Computer‖) => Buys (X, ―Software‖) [Support = 1%,
Confidence = 50%]
Association Rules‖
A D
Such association rules are called ―Single Dimensional
C
Example 2:
S
Age (X, ―20…29‖) ^ Income (X, ―20K…29K‖) => Buys (X,
―CD Player‖) [Support = 2%, Confidence = 60%]
Association Rule = ―2% of total customers in the database are
between
20-29 years of age and with income Rs.20000 to Rs.29000
and have purchased CD player.‖ & ―There is 60%
probability that a customer in this age group and income
group will purchase a CD player‖
67
A D
Each Node = Test on the attribute value
Each Branch = Outcome of the test
C
Tree Leaves = Classes or class distributions
S
Decision trees can be converted into classification
rules
Neural Networks
Collection of neuron-like processing units + weighted connections between the units.
Other methods of Classification
i.
Naïve Bayesian Classification
ii. Support Vector Machines
iii. K-nearest neighbor Classification
Classification is used to predict missing or unavailable numeric data values =>
Prediction.
Regression Analysis:- is a statistical methodology used for
numeric prediction.
68
Prediction also includes distribution trends based on the available
data.
Classification and Prediction may be used to be preceded by
Relevance Analysis.
Relevance Analysis:- attempts to identify attributes that do not
contribute to the classification or prediction process which can
be excluded.
Example – Classification and Prediction:
1) IF-THEN rules – Classification Model:
2) A Decision Tree – Classification Model:
3) A Neural Network Classification Model:
4) Cluster Analysis:
A D
Analyzes data objects without consulting a known class label.
The objects are clustered or grouped based on the principles of ―Maximizing the
S C
intra-class similarity‖ and ―Minimizing the inter-class similarity‖.
Objects within a cluster have high similarity compared to objects in other
clusters.
Each cluster formed is a class of objects.
From this class of objects rules can be derived.
Clustering allows ―Taxonomy Formation‖ Hierarchy of classes that groups
similar events together.
69
5) Outlier Analysis:
D
Data that do not comply with general behavior of data are called as Outliers.
o Most Data Mining methods discard outliers as noise or exceptions.
A
o Some applications like fraud detection, rare events can be interesting
C
S
than regular ones.
o Analysis of such outliers is called as Outlier Analysis / Outlier Mining.
o Outliers detected using:
i.
Statistical Methods
ii.
Distance Measures
iii.
Deviation Based Methods
iv.
Difference in characteristics of an object in a group o Example –
Outlier Analysis:
v.
Fraudulent usage of credit cards by detecting purchase of
extremely large amount for a given credit card account
compared to its general charges incurred.Same applies for
Type of purchase, Place of purchase, Frequency of purchase.
70
6) Evolution Analysis:
Describes the trends of data whose behavior change over time.
This step includes:
i.
Characterization & Discrimination
ii.
Association & Correlation Analysis
iii.
Classification & Prediction
iv.
Clustering of time-related data
v.
Time-series data analysis
vi.
Sequence or periodicity pattern matching
vii.
Similarity based data analysis
o Example – Evolution Analysis:
i. Stock exchange data for past several years available.
ii. You want to invest in TATA Steel Corp.
Data mining study / Evolution analysis on previous stock exchange data can help
in stock investment
A D
prediction of future trends in stock exchange prices.This will help in decision making
S C
71
UNIT IV
PART A
A D
Many of the attributes in the data may be redundant. Correlation analysis can be used to
identify whether any two given attributes are statistically related
S
2) What is tree pruning? Dec’14, May’13
C
When a decision tree is built, many of the branches will reflect anomalies in the training
data due to noise or outlier. Tree pruning methods Identify and remove branches that
reflect noise or outliers
Tree pruning Approaches: • Pre pruning • Post pruning
3) What are eager learners & lazy learners? Give example. Dec’15, Dec’14,May ‘17
Lazy learning (e.g., instance-based learning):
Simply stores training data (or only minor processing) and waits until it is given a
test tuple
less time in training but more time in predicting
72
Lazy method effectively uses a richer hypothesis space since it uses many local
linear functions to form an implicit global approximation to the target function
Eager learning:
Given a set of training tuples, constructs a classification model before receiving
new data to classify
Eager learning commit to a single hypothesis that covers the entire instance space
4) How prediction differs from classification data mining? May’14 May ‘17
A classification problem could be seen as a predictor of classes
Predicted values are usually continuous whereas classifications are discreet.
Predictions are often (but not always) about the future whereas
classifications are about the present.
D
Classification is more concerned with the input than the output
A
S C
Prediction
It is used to predict missing or
unavailable numerical data values
Classification
It predicts the class of objects
whose class label is unknown. Its
rather than class labels. Regression objective is to find a derived model
Analysis is generally used for that describes and distinguishes
prediction. Prediction can also be data classes or concepts. The
used for identification of Derived Model is based on the
distribution trends based on analysis set of training data i.e. the
available data. data object whose class label is well
known.
Accuracy − Accuracy of classifier refers to the ability of
classifier. It predict the class label correctly and the accuracy of
the predictor refers to how well a given predictor can guess the
73
5) What is decision tree? Mention two phases in decision tree induction.( Nov/Dec
2016)
A decision tree is a graph that uses a branching method to illustrate every possible
A D
outcome of a decision.Decision tree software is used in data mining to simplify
complex strategic challenges and evaluate the cost-effectiveness of research and
business decisions.
S C
The two phases are Decision tree Construction and Tree Pruing.
6) Distinguish between classification and clustering. May’15
Clustering and classification can seem similar because both data mining algorithms
divide the data set into subsets, but they are two different learning techniques, used in
data mining for the purpose of getting reliable information from a collection of raw data.
Clustering Classification
Definition Clustering is an Classification is a
unsupervised learning supervised learning
technique used to group technique used to assign
similar instances on the predefined tags to instances
basis of features. on the basis of features.
74
similar features.
A D
split into subsets with new data according to the
observations of the training
C
set.
Labels
S
There are no labels in There are labels for some
clustering. points.
75
separating hyperplane (that is, a ―decision boundary‖ separating the tuples of one
class from another).
A D
different items that customers place in their ―shopping baskets‖.
The discovery of such associations can help retailers develop marketing strategies
S C
by gaining insight into which items are frequently purchased together by
customers.
For instance, if customers are buying milk, how likely are they to also buy bread
(and what kind of bread) on the same trip to the supermarket? Such information
can lead to increased sales by helping retailers do selective marketing and plan
their shelf space.
A D
S C
77
PART B
1) Apriori algorithm for discovering frequent item set. Dec’15, may’15, Dec’14,
May’14, Dec’13, May’13, Dec’11, May’11
Discuss the single dimensional Boolean association rule mining for transaction
database. May ‘17
The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation
A D
minimum support threshold, min sup, then I is not frequent; thatis, P(I) <min sup. If an
item A is added to the itemsetI, then the resulting itemset (i.e., I [ A) cannot occur more
C
frequently than I. Therefore, I [ Ais not frequent either; thatis, P(I [A) <min sup.
S
Example of Apriori algorithm
78
For simplicity
M = Mango
O = Onion
And so on……
Original table:
Transaction ID Items Bought
T1 {M, O, N, K, E, Y }
T2 {D, O, N, K, E, Y }
T3 {M, A, K, E}
D
T4 {M, U, C, K, Y }
T5 {C, O, O, K, I, E}
C A
Step 1: Count the number of transactions in which each item occurs, Note ‗O=Onion‘ is
Item
S
bought 4 times in total, but, it occurs in just 3 transactions.
No of
transactions
M 3
O 3
N 2
K 5
E 4
Y 3
D 1
A 1
79
U 1
C 2
I 1
Step 2: Now remember we said the item is said frequently bought if it is bought at least 3
times. So in this step we remove all the items that are bought less than 3 times from the
above table and we are left with
Item Number of transactions
M 3
O 3
K 5
E 4
Y 3
A D
This is the single items that are bought frequently. Now let‘s say we want to find a pair of
items that are bought frequently. We continue from the above table (Table in step 2)
C
Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we start
S
with the second item like OK,OE,OY. We did not do OM because we already did MO
when we were making pairs with M and buying a Mango and Onion together is same as
buying Onion and Mango together. After making all the pairs we get,
Item pairs
MO
MK
ME
MY
OK
OE
OY
80
KE
KY
EY
Step 4: Now we count how many times each pair is bought together. For example M and
O is just bought together in {M,O,N,K,E,Y}
While M and K is bought together 3 times in {M,O,N,K,E,Y}, {M,A,K,E} AND
{M,U,C, K, Y}
After doing that for all the pairs we get
A
2
2D
C
OK 3
OE
OY
KE
S 3
2
4
KY 3
EY 2
Step 5: Golden rule to the rescue. Remove all the item pairs with number of transactions
less than three and we are left with
Item Pairs Number of transactions
MK 3
OK 3
OE 3
81
KE 4
KY 3
Step 6: To make the set of three items we need one more rule (it‘s termed as self-join),
It simply means, from the Item pairs in the above table, we find two pairs with the same
first Alphabet, so we get
· OK and OE, this gives OKE
· KE and KY, this gives KEY
Then we find how many times O,K,E are bought together in the original table and same
for K,E,Y and we get the following table
Item Set
A D
Number of
OKES C transactions
3
KEY 2
While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE,
BCD and you want to generate item sets of 4 items you look for two sets having the same
first two alphabets.
· ABC and ABD -> ABCD
· ACD and ACE -> ACDE
And so on … In general you have to look for sets having just the last alphabet/item
different.
82
Step 7: So we again apply the golden rule, that is, the item set must be bought together at
least 3 times which leaves us with just OKE, Since KEY are bought together just two
times.
Thus the set of three items that are bought together most frequently are O,K,E.
Exhibits high accuracy and speed when applied to large databases. One type of
A D
Bayesian Classification is This method has performance comparable to Decision Tree
induction. This method is based on the assumption
S C
given class is independent of the values.This assumption is called ―Class Condit
Another type of Bayesian Classification
Bayesian Classifiers have minimum error rate when compared to all other classifiers
Bayes Theorem:
83
Then, we have to determine P(H/X) = Probability that the hypothesis H holds for the data
sample X
Ex: Probability that any given data sample is an apple regardless of its colour and shape.
P(X) is constant for all classes. Hence only P(X/Ci) &P(Ci) need to be maximized.
84
Here P(Ci) = si/s; Where si = Samples in class Ci; s = Total number of samples in all
classes.
To evaluate P(X/Ci) we use the Naïve assumption
Independence‖.
Hence
Where P(xk/Ci) = sik/si; Where sik = No. Of samples in Class Ci that has value = xk for
Ak;
si = No. Of samples in class Ci.
Evaluate P(X/Ci) P(Ci) for each class C
X is assigned to the class Ci for which P(X/Ci) P(Ci) is maximum.
D
Example- Predicting a class label using Naïve Bayesian Classification:
A
S C
85
We need to maximize P(X/Ci) P(Ci) for i = 1,2; So, Compute P(C1) & P(C2)
- P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) = 5/14 =
0.357
Next, Compute P(X/C1) & P(X/C2)
P(age = ―<30‖ / buys_computer = yes) =
P(income = medium / buys_computer = yes) = 4/9 = 0.444
P(student = yes / buys_computer = yes) = 6/9 = 0.667
P(credit-rating = fair / buys_computer = yes) = 6/9 = 0.667
P(age = ―<30‖ / buys_computer = no) = 3
P(income = medium / buys_computer = no) = 2/5 = 0.400
P(student = yes / buys_computer = no) = 1/5 = 0.200
P(credit-rating = fair / buys_computer = no) = 2/5 = 0.400
Hence P(X/buys_computer = yes) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044
P(X/buys_computer = no) = 0.600 * 0.400 * 0.200 * 0.400 = 0.019
A D
Finally P(X/buys_computer = yes) P(buys_computer = yes) = 0.044 * 0.643 = 0.028
S C
P(X/buys_computer = no) P(buys_computer = no) = 0.019 * 0.357 = 0.007
Hence Naïve Bayesian Classifier predict
86
they determine how the tuples at a given node are to be split. The attribute selection
measure provides a ranking for each attribute describing the given training tuples. The
attribute having the best score for the measure6 is chosen as the splitting attribute for the
given tuples.
If the splitting attribute is continuous-valued or if we are restricted to binary trees
then, respectively, either a split point or a splitting subset must also be determined as part
of the splitting criterion.
The tree node created for partition D is labeled with the splitting criterion,
branches are grown for each outcome of the criterion, and the tuples are partitioned
accordingly. This section describes three popular attribute selection measures—
information gain, gain ratio, and Gini index.
A D
attribute has m distinct values defining m distinct classes, Ci (for i = 1, :::, m). Let Ci,D
be the set of tuples of class Ci in D. Let jDj and jCi,Dj denote the number of tuples in D
and Ci, D, respectively.
S C
Information gain
The expected information needed to classify a tuple in D is given by
m
Info(D) = ∑ pi log2(pi);
i=1
Where pi is the probability that an arbitrary tuple in D belongs to class Ci and is
estimated by jCi,Dj/jDj. A log function to the base 2 is used, because the information is
encoded in bits. Info(D) is just the average amount of information needed to identify the
class label of a tuple in D. Note that, at this point, the information we have is based solely
on the proportions of tuples of each class. Info(D) is also known as the entropy of D.
87
4) Explain as to how neural networks are used for classification of data. Dec’13
The backpropagation algorithm performs learning on a multilayer feed-forward
neural network. It iteratively learns a set of weights for prediction of the class label of
tuples. A multilayer feed-forward neural network consists of an input layer, one or more
hidden layers, and an output layer. An example of a multilayer feed-forward network is
shown in Figure 1.
Each layer is made up of units. The inputs to the network correspond to the
attributes measured for each training tuple. The inputs are fed simultaneously into the
units making up the input layer. These inputs pass through the input layer and are then
weighted and fed simultaneously to a second layer of ―neuronlike‖ units, known as a
hidden layer. The outputs of the hidden layer units can be input to another hidden layer,
and so on.
The number of hidden layers is arbitrary, although in practice, usually only one is
used. The weighted outputs of the last hidden layer are input to units making up the
D
output layer, which emits the network‘s prediction for given tuples.
A
S C
The units in the input layer are called input units. The units in the hidden layers
and output layer are sometimes referred to as neurodes, due to their symbolic biological
basis, or as output units. The multilayer neural network shown in Figure 1.
88
of output units. Therefore, we say that it is a two-layer neural network. (The input
layer is not counted because it serves only to pass the input values to the next layer.)
Similarly, a network containing two hidden layers is called a three-layer neural
network, and so on. The network is feed-forward in that none of the weights cycles
back to an input unit or to an output unit of a previous layer. It is fully connected in
that each unit provides input to each unit in the next forward layer.
Input:
D, a data set consisting of the training tuples and their associated target values;
l, the learning rate;
network, a multilayer feed-forward network.
A D
Output: A trained neural network.
S C
Method:
(1) Initialize all weights and biases in network;
(2) while terminating condition is not satisfiedf
(3) for each training tupleXinDf
(4) // Propagate the inputs forward:
(5) for each input layer unitjf
O j=Ij; // output of an input unit is its actual input
(6) value
(7) for each hidden or output layer unitjf
(8) Ij=∑iwijOi+θj;//compute the net input of unit j
89
(9) j=i
(10) // Backpropagate the errors:
(11) for each unitjin the output layer
(12) Err j= O j(1 O j)(Tj O j); // compute the error
(13) for each unitjin the hidden layers, from the last to the first
hidden layer
(14) Errj = Oj(1 Oj)∑kErrkwjk; // compute the error with respect to the next
higher layer, k
(15) for each weightwi jinnetworkf
wi j= (l)Err jOi; // weight
(16) increment
wi j=wi j+wi j;g// weight
(17) update
(18) for each biasθjinnetworkf
(19)
(20)
A D
θj = (l)Errj; // bias increment
θj = θj + θj; g // bias update
S C
5) Explain how support vector machines (SVM) can be used for classification.
May’15, Dec’11
Support Vector Machines, a promising new method for the classification of
both linear and nonlinear data. In a nutshell, a support vector machine (or SVM) is an
algorithm that works as follows. It uses a nonlinear mapping to trans-form the original
training data into a higher dimension. Within this new dimension, it searches for the
linear optimal separating hyperplane (that is, a ―decision boundary‖ separating the
tuples of one class from another).
90
hyperplane using support vectors (―essential‖ training tuples) and margins (defined by
the support vectors)
A D
to an informal definition of margin, we can say that the shortest distance from a
hyperplane to one side of its margin is equal to the shortest distance from the
S C
hyperplane to the other side of its margin, where the ―sides‖ of the margin are
parallel to the hyperplane. When dealing with the MMH, this distance is, in fact, the
shortest distance from the MMH to the closest training tuple of either class.
91
A D
S C
92
S C
Once the data have been transformed into the new higher space, the second
step searches for a linear separating hyperplane in the new space.
We again end up with a quadratic optimization problem that can be solved
using the linear SVM formulation. The maximal marginal hyperplane found in the
new space corresponds to a nonlinear separating hypersurface in the original space
A data mining process may uncover thousands of rules from a given set of data, most of
which end up being unrelated or uninteresting to the users. Often, users have a good
sense of which directionof mining may lead to interesting patterns and the form of the
93
patterns or rules they would like to find. Thus, a good heuristic is to have the users
specify such intuition or expectations as constraints to confine the search space. This
strategy is known as constraint-based mining. The constraints can include the following:
A D
―How are metarules useful?‖ Metarules allow users to specify the syntactic form of
S C
rules that they are interested in mining. The rule forms can be used as constraints to help
improve the efficiency of the mining process. Metarules may be based on the analyst‘s
experience, expectations, or intuition regarding the data or may be automatically
generated based on the database schema.
Metarule-guided mining:- Suppose that as a market analyst for AllElectronics, you have
access to the data describing customers (such as customer age, address, and credit rating)
as well as the list of customer transactions. You are interested in finding associations
between customer traits and the items that customers buy. However, rather than finding
all of the association rules reflecting these relationships, you are particularly interested
only in determining which pairs of customer traits SCE Department of Information
Technology promote the sale of office software.A metarule can be used to specify this
information describing the form of rules you are interested in finding. An example of
such a metarule is
94
where P1 and P2 are predicate variables that are instantiated to attributes from the given
database during the mining process, X is a variable representing a customer, and Y and
W take on values of the attributes assigned to P1 and P2, respectively. Typically, a user
will specify a list of attributes to be considered for instantiation with P1 and P2.
Otherwise, a default set may be used.
Rule constraints specify expected set/subset relationships of the variables in the mined
rules, constant initiation of variables, and aggregate functions. Users typically employ
their knowledge of the application or data to specify rule constraints for the mining task.
D
These rule constraints may be used together with, or as an alternative to, metarule-
guided mining. In this section, we examine rule constraints as to how they can be used to
C A
make the mining process more efficient. Let‘s study an example where rule constraints
are used to mine hybrid-dimensional association rules.
S
Our association mining query is to ―Find the sales of which cheap items (where the sum
of the prices is less than $100) may promote the sales of which expensive items (where
the minimum price is $500) of the same group for Chicago customers in 2004.‖ This can
be expressed in the DMQL data mining query language as follows,
95
UNIT V
PART A
1) Define Outlier & its applications. Dec’15, May’15, Dec’14, Dec’13, May’13,
Dec’11
A D
A database may contain data objects that do not comply with the general behavior
or model of the data. These data objects are outliers. Most data mining methods discard
S
detection, Medical diagonosis etc.,
C
outliers as noise or exceptions. Applications – Credit card fraud detection, Crime
96
A D
Divisive: This is a "top down" approach: all observations start in one
cluster, and splits are performed recursively as one moves down the hierarchy.
S C
4) Define Euclidean distance &Manhattan distance. May’15, May’11
97
A D
5) Define CLARANS.
S C
CLARANS(Cluster Large Applications based on Randomized Search) to improve
the quality of CLARA we go for CLARANS. It Draws sample with some randomness in
each step of search.
It overcome the problem of scalability that K-Medoids suffers from.
98
D
database and shrinks them toward the center of the cluster by a specified fraction.
Obviously better in runtime but lacking in precision.
C A
8) State the role of clustering. Dec’11 (Nov/Dec 2016)
S
Clustering is a process of grouping the physical or conceptual data object into
clusters.
99
A D
Density Based Spatial Clustering of Application Noise is called as DBSCAN.
DBSCAN is a density based clustering method that converts the high-density objects
S C
regions into clusters with arbitrary shapes and sizes. DBSCAN defines the cluster as a
maximal set of density connected points.
12) Give an example of outlier analysis for the library management system. (Nov/Dec
2016)
In library management systems, among set of books accessed frequently, if one
book is found to be accessed never, that is considered as an outlier. Likewise, a missing
book can also be considered as an outlier.
13) Give the reason on why clustering is needed in data mining?(Nov/Dec 2016)
Clustering is needed to identify set of similar data objects. The objects are similar
in terms of multiple dimensions. By considering only similar data objects for further
mining improves the interestingness of retrieved knowledge and accuracy
100
PART B
Themostwell-knownandcommonlyused partitioningmethodsarek-means,k-
medoids, and their variations
K means Clustering
A D
S C
Clustering is the process of partitioning a group of data points into a small number
of clusters. For instance, the items in a supermarket are clustered in categories (butter,
cheese and milk are grouped in dairy products). Of course this is a qualitative kind of
partitioning. A quantitative approach would be to measure certain features of the
products, say percentage of milk and others, and products with high percentage of milk
would be grouped together.
In general, we have n data points xi,i=1...nthat have to be partitioned in k clusters.
The goal is to assign a cluster to each data point. K-means is a clustering method that
aims to find the positions μi,i=1...k of the clusters that minimize the distance from the
data points to the cluster. K-means clustering solves
argminc∑i=1k∑x∈cid(x,μi)=argminc∑i=1k∑x∈ci∥x−μi∥22
101
Where ci is the set of points that belong to cluster i. The K-means clustering uses
the square of the Euclidean distance d(x,μi)=∥x−μi∥22. This problem is not trivial (in fact
it is NP-hard), so the K-means algorithm only hopes to find the global minimum, possibly
getting stuck in a different solution
K-means algorithm
The Lloyd's algorithm, mostly known as k-means algorithm, is used to solve the k-
means clustering problem and works as follows. First, decide the number of clusters k.
Then:
S C
4. Repeat steps 2-3 until convergence
102
This data set is to be grouped into two clusters. As a first step in finding a sensible
A D
initial partition, let the A & B values of the two individuals furthest apart (using the
Euclidean distance measure), define the initial cluster means, giving:
S C Mean
Individual Vector
(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)
The remaining individuals are now examined in sequence and allocated to the
cluster to which they are closest, in terms of Euclidean distance to the cluster mean. The
mean vector is recalculated each time a new member is added. This leads to the following
series of steps:
103
Cluster 1 Cluster 2
Mean
Mean Vector
Step Individual Individual Vector
(centroid)
(centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)
Now the initial partition has changed, and the two clusters at this stage having the
following characteristics:
A D Mean
Individual Vector
S
Cluster 1
C 1, 2, 3
(centroid)
(1.8, 2.3)
(4.1, 5.4)
Cluster 2 4, 5, 6, 7
But we cannot yet be sure that each individual has been assigned to the right
cluster. So, we compare each individual‘s distance to its own cluster mean and to
that of the opposite cluster.
104
And we find:
A D 1.1
S C
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its
own (Cluster 1). In other words, each individual's distance to its own cluster mean
should be smaller that the distance to the other cluster's mean (which is not the case with
individual 3). Thus, individual 3 is relocated to Cluster 2 resulting in the new partition:
Mean
Individual Vector
(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)
The iterative relocation would now continue from this new partition until no more
relocations occur. However, in this example each individual is now nearer its own cluster
105
mean than that of the other cluster and the iteration stops, choosing the latest partitioning
as the final cluster solution.
Also, it is possible that the k-means algorithm won't find a final solution. In this
case it would be a good idea to consider stopping the algorithm after a pre-chosen
maximum of iterations.
D
advantages of hierarchical clustering come at the cost of lower efficiency
C
Agglomerative hierarchical clustering
A
Hierarchical clustering methods can be further classified as
106
It subdivides the cluster into smaller and smaller pieces, until each object forms a
cluster on its own or until it satisfies certain termination conditions, such as a desired
number of clusters is obtained or the diameter of each cluster is within a certain
threshold
A D
similarity of the closest pair of data points belonging to different clusters. The cluster
merging process repeats until all of the objects are eventually merged to form one cluster.
S C
In DIANA, all of the objects are used to form one initial cluster. The cluster is
split according to some principle, such as the maximum Euclidean distance between the
clos-est neighboring objects in the cluster. The cluster splitting process repeats until,
eventually, each new cluster contains only a single object
107
In either agglomerative or divisive hierarchical clustering, the user can specify the
desired number of clusters as a termination condition.
A D
A tree structure called a dendrogram is commonly used to represent the process of
S C
hierarchical clustering. It shows how objects are grouped together step by step. Figure 2
shows a dendrogram for the five objects presented in Figure 1, where l = 0 shows the five
objects as singleton clusters at level 0. At l= 1, objects a and b are grouped together to
form the first cluster, and they stay together at all subsequent levels. We can also use a
vertical axis to show the similarity scale between clusters. For example, when the
similarity of two groups of objects, fa, bg and fc, d, eg, is roughly 0.16, they are merged
together to form a single cluster.
Minimum
distance : dmin(Ci,Cj) = minp2Ci,p02Cjjp p0j
Maximum
distance : dmax(Ci,Cj) = maxp2Ci,p02Cjjp p0j
108
Mean distance
: dmean(Ci,Cj) = jmimjj
When an algorithm uses the minimum distance, dmin(Ci, Cj), to measure the
distance between clusters, it is sometimes called a nearest-neighbor clustering
algorithm. More-over, if the clustering process is terminated when the distance between
nearest clusters exceeds an arbitrary threshold, it is called a single-linkage algorithm. If
we view the data points as nodes of a graph, with edges forming a path between the nodes
in a cluster, then the merging of two clusters, Ci and Cj, corresponds to adding an edge
between
A D
S C
Hierarchical clustering dendrogram -- Figure 2
the nearest pair of nodes in Ci and Cj. Because edges linking clusters always go between
distinct clusters, the resulting graph will generate a tree. Thus, an agglomerative
hierarchical clustering algorithm that uses the minimum distance measure is also called a
minimal spanning tree algorithm.
When an algorithm uses the maximum distance, dmax(Ci, Cj), to measure the
distance between clusters, it is sometimes called a farthest-neighbor clustering
algorithm. If the clustering process is terminated when the maximum distance between
nearest clusters exceeds an arbitrary threshold, it is called a complete-linkage algorithm.
By viewing data points as nodes of a graph, with edges linking nodes, we can think of
each cluster as a complete subgraph, that is, with edges connecting all of the nodes in the
109
clusters. The distance between two clusters is determined by the most distant nodes in the
two clusters. Farthest-neighbor algorithms tend to minimize the increase in diameter of
the clusters at each iteration as little as possible. If the true clusters are rather compact
and approximately equal in size, the method will produce high-quality clusters.
Otherwise, the clusters produced can be meaningless.
Density-Based Methods
A D
S C
To discover clusters with arbitrary shape, density-based clustering methods have
been developed. These typically regard clusters as dense regions of objects in the data
space that are separated by regions of low density (representing noise). DBSCAN grows
clusters according to a density-based connectivity analysis. OPTICS extends DBSCAN to
produce a cluster ordering obtained from a wide range of parameter settings. DENCLUE
clusters objects based on a set of density distribution functions.
DBSCAN: A Density-Based Clustering Method Based on Connected Regions
with Sufficiently High Density
110
The neighborhood within a radius ε of a given object is called the ε-neighborhood of the
object.
A D
objects, D, if there is a chain of objects p1, :::, pn, where p1 = q and pn = p such that pi+1is
directly density-reachable from piwith respect toεandMinPts, for1i n, pi2D.
S C
An object p is density-connected to object q with respect to ε and MinPts in a set of
objects, D, if there is an object o 2 D such that both p and q are density-reachable from o
with respect to ε and MinPts.
Density reachability is the transitive closure of direct density reachability, and this
relationship is asymmetric. Only core objects are mutually density reachable. Density
connectivity, however, is a symmetric relation.
111
Of the labeled points, m, p, o, and r are core objects because each is in an ε-neighbor-
hood containing at least three points.
D
A density-based cluster is a set of density-connected objects that is maximal with
respect to density-reachability. Every object not contained in any cluster is considered to
be noise.
A
S C
“How does DBSCAN find clusters?‖ DBSCAN searches for clusters by checking theε-
neighborhood of each point in the database. If the ε-neighborhood of a point p contains
more than MinPts, a new cluster with p as a core object is created. DBSCAN then
iteratively collects directly density-reachable objects from these core objects, which may
involve the merge of a few density-reachable clusters. The process terminates when no
new point can be added to any cluster.
112
If a spatial index is used, the computational complexity of DBSCAN is O(n log n), where
n is the number of database objects. Otherwise, it is O(n2). With appropriate set-tings of
the user-defined parameters ε and MinPts, the algorithm is effective at finding arbitrary-
shaped clusters.
There exist data objects that do not comply with the general behavior or model of the
data. Such data objects, which are grossly different from or inconsistent with the
remaining set of data, are called outliers. Many data mining algorithms try to minimize
the influence of outliers or eliminate them all together. This, however, could result in the
loss of important hidden information because one person‘s noise could be another
person‘s signal. In other words, the outliers may be of particular interest, such as in the
case of fraud detection, where outliers may indicate fraudulent activity.
Thus, outlier detection and analysis is an interesting data mining task, referred to as
outlier mining. It can be used in fraud detection,
113
S C
4) Write short notes on Spatial data mining. May’15
Spatial Data Mining refers to the extraction of knowledge, spatial relationships or
other interesting patterns not explicitly stored in spatial databases. A spatial database
stores a large amount of space-related data, such as maps, preprocessed remote sensing or
medical imaging data, and VLSI chip layout data. Statistical spatial data analysis has
been a popular approach to analyzing spatial data and exploring geographic information.
The term ‗geostatistics‘ is often associated with continuous geographic space, whereas
the term ‗Spatial statistics‘ is often associated with discrete space.
Geo marketing
Remote sensing
Image database exploration
Medical Imaging Navigation
Traffic Control
Environmental Studies
A D
whose generalizations are non-spatial. A Spatial-to-nonspatial dimension is a dimension
whose primitive-level data are spatial but whose generalization, starting at a certain high
C
level, becomes non-spatial. A Spatial-to-Spatial dimension is a dimension whose
Data Cube: S
primitive level and all of its high level generalized data are spatial. Measures of Spatial
115
A D
For mining spatial associations related to the spatial predicate close to and collect the
candidates that pass the minimum support threshold by applying certain rough spatial
S C
evaluation algorithms. Evaluating the relaxed spatial predicate, ‗g close to‘, which is
generalized close to covering a broader context that includes ‗close to‘, ‗touch‘ and
intersect‘
Spatial Clustering methods:
Spatial data clustering identifies clusters, or densely populated regions, according to
some distance measurement in a large, multi dimensional data set.
Spatial Classification and Spatial Trend Analysis:
Spatial Classification analyzes spatial objects to derive classification schemes in
relevance to certain spatial properties. Example: Classify regions in a province into rich
Vs poor according to the average family income. Trend analysis detects changes with
time, such as the changes of temporal patterns in time-series data. Spatial trend analysis
replaces time with space and studies the trend of non-spatial or spatial data changing with
space. Example: Observe the trend of changes of the climate or vegetation with the
116
increasing distance from an ocean. Regression and correlation analysis methods are often
applied by utilization of spatial data structures and spatial access methods.
Mining Raster Databases:
Spatial database systems usually handle vector data that consists of points, lines,
polygons (regions) and their compositions, such as networks or partitions. Huge amounts
of space-related data are in digital raster forms such as satellite images, remote sensing
data and computer tomography.
Information Retrieval:
S C
A field developed in parallel with database systems.
Information is organized into a large number of documents
Information retrieval problem:
locating relevant documents based on user input, such as keywords or example
documents.
Information Retrieval:
Typical IR Systems:
Online Library Catalogs
Online document management systems Information Retrieval Vs Database Systems:
Some DB problems are not present in IR, eg., update, transaction management,
complex objects.
117
Some IR problems are not addressed well in DBMS, eg., unstructured documents,
approximate search using keywords and relevance.
Precision: the percentage of retrieved documents that are in fact relevant to the query.
Precision = |{Relevant} ^ {Retrieved}|
--------------------------------
|{Retrieved}|
Recall: the percentage of documents that are relevant to the query and were in fact
retrieved.
A D
A document is represented by a string, which can be identified by a set of keywords.
S C
Queries may use expressions of keywords.
Eg. Car amd Repair shop, tea, coffee, DBMS but not Oracle
Queries and retrieval should consider synonyms, eg. Repair and maintenance.
118
- Use a singular value decomposition (SVD) techniques to reduce the size of the
frequency table.
- Retain the K most significant rows of the frequency table. Method:
- Create a term frequency matrix, freq-matrix.
- SVD Construction: Compute the singular valued decomposition of the freq-matrix
by splitting it into 3 matrices, U, S, V.
Vector Identification:
- For each document d, replace its original document vector by a new excluding the
eliminated terms.
Index Creation:
- Store the set of all vectors, indexed by one of a number of techniques (such as TV-
tree)
Other Text Retrieval Indexing Techniques: Inverted Index:
A D
Maintains two hash or B +tree indexed tables. Document Table:
a set of documents records <doc_id, postings_list> Term-table: a set of term records, <
term, postings_list>
S C
Answer Query: Find all docs associated with one or a set of terms. Advantage: Easy to
implement
Disadvantage: Do not handle well synonymy and polysely and posting lists could be too
long (storage could be very large)
Signature File:
Associate a signature with each document
A signature is a representation of an ordered list of terms that describe the document.
Order is obtained by frequency analysis, stemming and stop lists
.
Types of Text Data Mining:
Keyword –based association analysis.
119
D
association or correlation relationships among them.
First preprocess the text data by parsing, stemming, removing stop words etc. Then
A
S C
evoke association mining algorithms.
Consider each document as a transaction
View a set of keywords in the document as a set of items in the transaction. Term
level Association Mining:
No need for human effort in tagging documents
The number of meaningless results and the execution time is greatly reduced.
Automatic Document Classification:
Motivation
Automatic Classification for the tremendous number of on-line text documents.
A Classification Problem:
Training set: Human experts generate a training data set. Classification: The
computer system discovers the classification rules.
120
A D
Apply term association mining method to discover sets of associated terms. Use the terms
to maximally distinguish one class of documents from others. Derive a set of association
S C
rules associated with each document class
Order the classification rules based on their occurrence frequency and discriminative
power.
Used the rules to classify new documents.
6) a.) Explain the types of data in cluster analysis in detail with example Dec’14
May’17 (16)
The different types of data used for cluster analysis are
interval scaled,
binary,
nominal,
ordinal
121
S C
Clustering may also help in the identification of areas of similar land use in an
earth observation database and in the identification of groups of houses in a city
according to house type, value,and geographic location, as well as the identification of
groups of automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection,Applications of outlier detection
include the detection of credit card fraud and the monitoring of criminal activities in
electronic commerce
What is Cluster Analysis?
The process of grouping a set of physical objects into classes of similar objects is
called clustering.
122
A D
Typical requirements of clustering in data mining:
2. Scalability – Clustering algorithms should work for huge databases
S C
3. Ability to deal with different types of attributes – Clustering algorithms should
work not only for numeric data, but also for other data types.
4. Discovery of clusters with arbitrary shape – Clustering algorithms (based on
distance measures) should work for clusters of any shape.
4. Minimal requirements for domain knowledge to determine input
parameters – Clustering results are sensitive to input parameters to a
clustering algorithm (example
– number of desired clusters). Determining the value of these parameters is
difficult and requires some domain knowledge.
5. Ability to deal with noisy data – Outlier, missing, unknown and
erroneous data detected by a clustering algorithm may lead to
clusters of poor quality.
123
A D
2. Data Matrix: (object-by-variable structure)
S C
Represents n objects, (such as persons) with p variables (or attributes) (such as
age, height, weight, gender, race and so on. The structure is in the form of
relational table or n x p matrix as shown below:
called as ―two mode‖ matrix
124
called as ―one mode‖ matrix
Where d (i, j) is the dissimilarity between the objects i and j; d (i, j) = d (j, i) and
d (i, i) = 0
A D
S C
b.) What are the social impacts of data mining? May’15
Data Mining can offer the individual many benefits by improving customer
(8)
service and satisfaction, and lifestyle in general. However, it also has serious
implications regarding one‘s right to privacy and data security.
125
Chasm
Early Majority
Late Majority
Laggards
Data Mining can also have multiple personal uses such as:
Identifying patterns in medical applications to choose best companies based on
D
customer service. To classify email messages etc.
Is Data Mining a threat to Privacy and Data Security? With more and more
A
information accessible in electronic forms and available on the web and with
S C
increasingly powerful data mining tools being developed and put into use, there are
increasing concern that data mining may pose a threat to our privacy and data security.
Data Privacy: In 1980, the organization for Economic co-operation and
development (OECD) established as set of international guidelines, referred to as fair
information practices. These guidelines aim to protect privacy and data accuracy.
126
Data Security: Many data security enhancing techniques have been developed to
help protect data. Databases can employ a multilevel security model to classify and
restrict data according to various security levels with users permitted access to only their
authorized level.
Some of the data security techniques are:
Encryption Technique Intrusion Detection In secure multiparty computation in
data obscuration
7. Describe the applications and trends in data mining in detail (Dec ’16) (16)
A D
applicable to any company looking to leverage a large data warehouse to better manage
their customer relationships. Two critical factors for success with data mining are: a
S C
large, well-integrated data warehouse and a well-defined understanding of the business
process within which data mining is to be applied (such as customer prospecting,
retention, campaign management, and so on).
A pharmaceutical company can analyze its recent sales force activity and their
results to improve targeting of high-value physicians and determine which marketing
activities will have the greatest impact in the next few months. The data needs to include
competitor market activity as well as information about the local health care systems.
The results can be distributed to the sales force via a wide-area network that enables the
representatives to review the recommendations from the perspective of the key attributes
127
in the decision process. The ongoing, dynamic analysis of the data warehouse allows
best practices from throughout the organization to be applied in specific sales situations.
A credit card company can leverage its vast warehouse of customer transaction
data to identify customers most likely to be interested in a new credit product. Using a
small test mailing, the attributes of customers with an affinity for the product can be
identified. Recent projects have indicated more than a 20-fold decrease in costs for
targeted mailing campaigns over conventional approaches.
A diversified transportation company with a large direct sales force can apply data
mining to identify the best prospects for its services. Using data mining to analyze its
own customer experience, this company can build a unique segmentation identifying the
attributes of high-value prospects. Applying this segmentation to a general business
database such as those provided by Dun & Bradstreet can yield a prioritized list of
prospects by region.
A D
S C
A large consumer package goods company can apply data mining to improve its
sales process to retailers. Data from consumer panels, shipments, and competitor activity
can be applied to understand the reasons for brand and store switching. Through this
analysis, the manufacturer can select promotional strategies that best reach their target
customer segments.
Each of these examples have a clear common ground. They leverage the
knowledge about customers implicit in a data warehouse to reduce costs and improve the
value of customer relationships. These organizations can now focus their efforts on the
most important (profitable) customers and prospects, and design targeted marketing
strategies to best reach them.There are a number of applications that data mining has.
The first is called market segmentation. With market segmentation, you will be able to
find behaviors that are common among your customers. You can look for patterns
128
among customers that seem to purchase the same products at the same time. Another
application of data mining is called customer churn. Customer churn will allow you to
estimate which customers are the most likely to stop purchasing your products or
services and go to one of your competitors.
For example, by using data mining a retail store may be able to determine which
products are stolen the most. By finding out which products are stolen the most, steps
can be taken to protect those products and detect those who are stealing them. While
direct mail marketing is an older technique that has been used for many years,
companies who combine it with data mining can experience fantastic results. For
example, you can use data mining to find out which customers will respond favorably to
a direct mail marketing strategy. You can also use data mining to determine the
effectiveness of interactive marketing. Some of your customers will be more likely to
purchase your products online than offline, and you must identify them.
A D
While many businesses use data mining to help increase their profits, many of
S C
them don't realize that it can be used to create new businesses and industries. One
industry that can be created by data mining is the automatic prediction of both behaviors
and trends. Imagine for a moment that you were the owner of a fashion company, and
you were able to precisely predict the next big fashion trend based on the behavior and
shopping patterns of your customers? It is easy to see that you could become very
wealthy within a short period of time. You would have an advantage over your
competitors. Instead of simply guessing what the next big trend will be, you will
determine it based on statistics, patterns, and logic.
Another example of automatic prediction is to use data mining to look at your past
marketing strategies. Which one worked the best? Why did it work the best? Who were
the customers that responded most favorably to it? Data mining will allow you to answer
these questions, and once you have the answers, you will be able to avoid making any
129
mistakes that you made in your previous marketing campaign. Data mining can allow
you to become better at what you do. It is also a powerful tool for those who deal with
finances. A financial institution such as a bank can predict the number of defaults that
will occur among their customers within a given period of time, and they can also
predict the amount of fraud that will occur as well.
A D
financial transactions to discover illegal activities or analyzing genome sequences. From
this perspective, it was just a matter of time for the discipline to reach the important area
of computer security.
S C
Applications of Data Mining in Computer Security presents a collection of
research efforts on the use of data mining in computer security.
Data mining has been loosely defined as the process of extracting information
from large amounts of data. In the context of security, the information we are seeking is
the knowledge of whether a security breach has been experienced, and if the answer is
yes, who is the perpetrator. This information could be collected in the context of
discovering intrusions that aim to breach the privacy of services, data in a computer
system or alternatively, in the context of discovering evidence left in a computer system
as part of criminal activity.
130
A D
S C
131
A D
S C
132
Sixth/Seventh Semester
Information Technology
(Regulations 2013)
1.
2. Define Star schema. [Pg No.10]
A D
What are the nine decisions in the design of data warehouse? [Pg No.26]
3.
4.
5.
C
List OLAP guidelines.[Pg. No 33]
S
Comment on OLAP tools Internet. [Pg. No 34]
Give an example of outerlier analysis for the library management
system.[Pg.No.89]
6. What are the different steps in Data Transformation?[Pg.No.12]
7. Elucidate two phase involved in decision tree induction?[Pg No 74]
8. List the methods to improve Apriori‘s efficiency.[Pg.No 77]
9. State the role of Cluster analysis [Pg.No.100]
10. Give the reason on why clustering is needed in data mining?[Pg.No 100]
133
12. (a) Discuss different tool categories in data warehouse business analysis
[Pg. No 28] (16)
Or
(b) (i) Summarize the major differences between OLTP and OLAP system
[Pg.No 32] (8)
(ii) Describe about Cognus Impromptu. [Pg.No 35] (8)
14. (a) Find all frequent items sets for the given training set using Apriori and FP
growth, respectively. Compare the efficiency of the two mining processes
[Pg.No 78] (10+6)
TID items_bought
T100 (M,O,N,K,E,Y)
T200 (D,O,N,K,E,Y)
T300 (M,A,K,E)
A
T400 (M,U,C,K,Y)D
S C
T500 (C,O,O,K,I,E)
Or
(b) Explain Naive Bayesian classifications with algorithm and sample example.
[Pg.No 83] (16)
15. (a) Describe the applications and trends in data mining in detail.
[Pg. No 127] (16)
Or
(b) Describe different partition methods in cluster analysis.
[Pg. No 101] (16)
134
Sixth/Seventh Semester
Information Technology
(Regulations 2013)
13]
A D
1. How is a data warehouse different from database? How they are similar?[Pg.No
S C
2. What is data discretization?[Pg.No 55]
3. List the distinct feature of OLTP and OLAP.[Pg.No 32]
4. What is multidimensional data model? Give Example. [Pg.No 32]
5. Why we need data transformation? Mention the ways by which data can be
transformed.[Pg.No 12]
6. List the five primitives for specification of a data mining task.[Pg.No 52]
7. How do you evaluate accuracy of a classifier?[Pg No 73]
8. What is lazy learner? Give an example. [Pg No 72]
9. What is meant by K-Nearest Neighbor Algorithm?[Pg.No 98]
10. List the some applications of data mining.[Pg.No 96]
135
(b) Discuss Data Extraction, Clean up and transformation tools with meta data
management. [Pg.No. 28] (16)
12. (a)Explain different categories of OLAP tools with diagram.[Pg. No 40] (16)
Or
(b)(i) Summarize multi dimensional data model[Pg. No 39] (8)
(ii) Discuss about Cognus Impromptu. [Pg.No 35] (8)
13. (a)Why do we need to preprocess data? What are the different forms of
preprocessing? [Pg.No 57] (16)
Or
(b) Describe in detail data mining functionalities and the different kinds of patterns
can be mined. [Pg. No 64] (16)
14. (a) Discuss the single dimensional Boolean association rule mining for transaction
database.[Pg.No78] (16)
Or
(b) Discuss about constraint based association rule mining with examples and state
A D
how association mining to correlation analysis is dealt with. [Pg.No.93] (16)
[Pg.No 121]
C
15. (a) Describe different types of data in Cluster Analysis.
S Or
(b) Describe different Hierarchical methods in Cluster Analysis.
(16)
136