0% found this document useful (0 votes)
148 views71 pages

Toppers Solution

Uploaded by

tirth.mange16705
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views71 pages

Toppers Solution

Uploaded by

tirth.mange16705
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

9n Search of Another To h her

&m~£ ( Computer]

Scanned by CamScanner
Topper's Solutions
Semester -

in)n
Data Warehousing &. 3

Page No.
Syllabus
# Chapters
01
Introduction to
Data ,?==£=====
Warehousing
of Data Warehousing; Features o classification of
information F!oW Mechanism; Ro e data
Metadata; Data Warehouse Arch.te Wareh ousing
Architecture; Data Warehouse and Data
Design Strategies.

10
Dimensional Data Warehouse Modeling Vs Operational Database Modeling
Modeling Dimensional Model Vs ER Model; Features of a Good Dimensional
Model; The Star Schema; How Does a Query Execute? The

Fact Table; Updates To Dimension Tables: Slowly Changing


Dimensions, Type 1 Changes, Type 2 Changes, Type 3 Changes,
Large Dimension Tables, Rapidly Changing o r Large Slowly
Changing Dimensions, Junk Dimensions, Keys i n the Data
Warehouse Schema, Primary Keys, Surrogate Keys & Foreign Keys;
Aggregate Tables; Fact Constellation Schema o r Families of Star.

ETL Process Challenges in ETL Functions; Data Extraction; Identification of Data 16


Sources; Extracting Data: Immediate Data Extraction, Deferred Data
Extraction; Data Transformation: Tasks Involved in Data
Transfoi mation, Data Loading: Techniques of Data Loading,
Loading the Fact Tables and Dimension Tables Data Quality; Issues
in Data Cleansing,

Online
21
Analytical Multidimensional Analysis; Hypercubes; OLAP Operations i n
Processing Multidimensional Data Model; OLAP Models: MOLAP ROLAP
(OLAP) HOLAP, DOLAP;

""dltamtoto t0 iS Dat Mining: KnowIedge Disc 29


o »b.. - be -Data to he Mined, Related C °very
o „ . .B in
t'o «Database
Mining Technique, Application and Issues in Data Mining.

Data T es of Attributes; Statistical Description of Data; Data


Exploration 33

Scanned by CamScanner
Semester - 8
Topper's Solutions

Why Preprocessing? Data Cl


lntegral,on
Reduction: Attribute subset sde'cdn L ’ data 3 3
Sampling; Data Transformation ’ „ S t T‘ , m s ’ C,ust ng and
Data
Normalization, Binninn discretization:
& i S Rram A , , a l s i s
generation. “’ y Concept hierarchy

8.1 ,ass,
‘ication methods: 36
1. Decision Tree Induction: Attribute Selection Measures,

8.2 Prediction: Structure of regression models; Simple linear

8.3
Holdout, Random Sampling, Cross Validation, Bootstrap;

8.4

Clustering What is clustering? Types of data, Partitioning Methods (K-Means,


KMedoids) Hierarchical Methods(Agglomerative , Divisive, BRICH),
Density-Based Methods ( DBSCAN, OPTICS)

10 Mining Market Basket Analysis, Frequent Itemsets, Closed Itemsets, and | 52


Frequent Association Rules; Frequent Pattern Mining, Efficient and Scalable
Pattern and
Frequent Itemset Mining Methods, The Apriori Algorithm for
Association
Rule

of Apriori, A pattern growth approach for mining Frequent

Mining closed and maximal patterns; Introduction to Mining


Multilevel Association Rules and Multidimensional Association

Evaluation Measures; Introduction to Constraint-Based Association


Mining.

Scanned by CamScanner
Topper's Solutions
Semester -8
trwM
Marfa 'Distribution

May 2016 Dec 2016


Chapter Name

15 15
Introduction to Data Warehousing.
15 15
Dimensional Modeling.

10 15
ETL Process.

20 10
Online Analytical Processing [OLAP].

Introduction to data mining. 10 15

Data Exploration.

Data Preprocessing. 10

8. Classification. 15 25

Clustering. 10 10

10. Mining Frequent Pattern and Association Rule. 20 10

Miscellaneous.
10

Repeated Questions
20

*** Note: If you need some additions questions which are not included then do +l
questtons onSupM t IflPEsrsSohitionsxom or Whatsapp it on +917507531198

If possible then we will provide softcopy for the same.

Scanned by CamScanner
introduction to Data 'Ware. ., Semester - 8 ohhe\

CHAPTER - 1 : INTRODUCTION TO DATA WAREHOUSING

QI J ILLUSTRATE THE ARCHITECTURE OF A TYPICAL DW SYSTEM. DIFFERENTIATE DW AND


DATA MART.

Q2] DIFFERENTIATE DATA WAREHOUSE VS DATA MART.

ANS: [QI | 10M-MAY16] & [Q2 | 5M-DEC16]

DATA WAREHOUSE:

1. Data W a r e h o u s e i s c o n s t r u c t e d by integrating d a t a from m u l t i p l e h e t e r o g e n e o u s sources.


2. It i s i n t e g r a t e d , s u b j e c t - o r i e n t e d , t i m e - v a r i a n t a n d n o n - v o l a t i l e collection of data.
3. It w a s d e f i n e d by Bill Innion i n 1 9 9 0 .
4. Data W a r e h o u s e i s a system u s e d for reporting a n d d a t a analysis.

5. It is c o n s i d e r e d as a c o r e c o m p o n e n t o f business i n t e l l i g e n c e .

ARCHITECTURE O F TYPICAL DATA WAREHOUSE:

Figure 1.1 sh o ws Typical D a t a Warehouse Architecture.

Data Sources Warehouse Users

Analysis
L Q
0 u
A E
D R
Operational Source

Detailed

End User Access Tools


Y
M Information
A M Reporting
N A
A N
G Metadata
A
E G
External Data R E
R
Mining

Archive/Backup

Figure 1.1: Typical D a t a Warehouse Architecture.

Data Warehouse Architecture consist of following components:

I) Operational Source:

Operational Source is a data source consists of Operational Data and External Data.
> Data can come from Relational DBMS like Oracle, Informix.

/ rd

Scanned by CamScanner
<T()ppw's SofaluM!

Kxtrac' sollll(n ns from one tUtt,


10 . .ntions requir’d «'
Uad Manager performs all varie s

ehouse manage.nen<P-“‘“s s '


HO ,onsibleforthewarehou.
nnm er includes:

De-normalization.

no

is determined by facilities provided by the e n d users access

tools and database.

V] Detailed Data:

VT) Summarized Data:

These aggregations are generated by the warehouse manager.

VII) Archive a n d Backup Data:

The Detailed and Summarized Data are stored for the purpose of archiving and backup.
The data is transferred to storage archives such as magnetic tapes or optical disks.

viiij Metadata:

It is used for extraction and loading process


O P Cess, warehouse management process a n d
management process. 1 SS
a n d query
IX)

End User Acce


The users inte;
ss tools.

Scanned by CamScanner
Ontroduction to Data 'Ware. , . Semester - 8 'Topper*! So/utllM!

DIFFERENTIATE BETWEEN DATA WAREHOUSE AND DATA MART:

Table 1.1 shows the difference between Data Warehouse and Data Marl.

Table 1.1: Difference between Data Warehouse and Data Mat h

Parameters Data Warehouse Dilin Muri

Scope Enterprise Level. Department Level.


I
‘ Approach Top - Down Approach is used. Bottom - Up Approiirli Is used.

Centralized & Planned Yes No

Size 100 GB to 1TB. < 100 GB.

Initial effort, cost, Risk Higher. Lower.

Data Sources Used Many Data Sources are required. Few Data Sources are required.

Nature Highly Flexible. It is restrictive.

Implementation Time Required Implementation takes Months to Implementation is done usually


Year. in months.

Subjects Multiple Subjects. Single Subject.

Data Available Data is historical, detailed and Data consists of some history,
summarized. detailed and summarized.

Q3] What is meant by metadata in the context of a data Warehouse? explain the
DIFFERENT TYPES OF META DATA STORED IN A DATA WAREHOUSE. ILLUSTRATE WITH A

SUITABLE EXAMPLE.

ANS: [IOM-DECIO]

METADATA:

1. Metadata is simply defined as Data about Data.


2. The data that is used to represent other data is known as metadata.
3. For example, the index of a book serves as a metadata for the contents in the book.
4. In other words, we can say that metadata is the summarized data that leads us to detailed data.
5. In terms of data warehouse, we can define metadata as follows:
a. Meta Data is the road-map to a data warehouse.
b. Meta Data in a data warehouse defines the warehouse objects.
c. Metadata acts as a directory. This directory helps the decision support system to locate
the contents of a data warehouse.
Scanned by CamScanner
Ontroductim
%{roi(ucfa>n lo

ROLES OF METADATA: ofnictada ’


Extr;
12 S
T|,e following IW "‘’
Holes of metadata includes: hete
It ah
I t is used for tools.
I t i s used in extraction and cl
b. III)
H i s used in reporting tools.
The
ltisusedintransforni |Mdjngfunctions .
d. It ei
I t plays an important ro
The
Data Load
Data Mining
Function infc
Transformation
Tools
Tools

EXAMPLE

Topper's !
Query Tools
Data Warehouse
Extracting Metadata
Tools
Entity Nai
Alias Narr
Definitioi

Applications Source Sy
Source Systems
OLAP Too)
Responsi
Data Qua
Figure 1.2: Roles of Metadata.
I _______
TYPES OF METADATA:
Q4] 01
|
Metadata in a data warehouse fall into three major categories as shown in figure 1.3. ANS:
1

Types of Metadata
[ COMPAR

Table 1.2

Operational Metadata Extraction and


Transformation Metadata End-User Metadata

Figure 1.3: TypesofMetadafc It is Ap


I)
It uses

It cont;
>
The data elements selected for th str
Uctures. It is usi

It is pe
S the inf
° ° rmatiOn ° f “P-ational d a t atas X reh
° USe haVe

eis
g>ven by Operational Metadata.

Scanned by CamScanner
I Mroduction to fiato Wan. .. Semester-# Topper's Solutions

II) Extraction and Transformation Metadata;


> Extraction and transformation metadata contains the information about extraction of data from
heterogeneous source system.
> It also contains the information about data transformation in data staging area.

III) End-User Metadata;

> The end-user metadata is the navigational map of the data warehouse.
> It enables the end-users to find information from the data warehouse.
> The end-user metadata allows the end-users to use their own business terminology and look for
information in those ways in which they normally think of the business.

I EXAMPLE OF METADATA:

Topper's Solutions Customer Sales Data Warehouse.

Entity Name: Customer.


Alias Name: Account, Client.
Definitions: A Person that purchases the solutions.
Source Systems: Online Sales.
Responsible User: Sagar Narkar.
Data Quality Reviewed: 07-Mar-2017.

Q4] OPERATIONAL VS. DECISIONAL SUPPORT SYSTEM.

MS: [5M-MAY16]

COMPARISON BETWEEN OPERATIONAL SYSTEM AND DECISIONAL SUPPORT SYSTEM:

Table 1.2 shows the difference between Operational System and Decisional Support System.

Table 1.2: Comparison between Operational System and Decisional Support System.

-- Operational System — Decisional Support System


; \ ' .J.'. •’ ' ’ ' . " /

It is Application Oriented. It is Subject Oriented.

It uses Detailed Data. It uses Summarized Data.

It contains isolated data. It contains integrated data.

It is used to run business. It is used to analyze business.

It is performance sensitive. It is not performance sensitive.

Ityejofbb

Scanned by CamScanner
Topper's 'Sofuti
9ntroc(ucfion t

.fhire Is I Disadvantage
Thei r is no data redundancy. It has AdhoC access. Time o
It has repetitive access Failure
|1 iu>7snapshot
I It has up to date data. natab WOGB-lew-rB.
BOTTOM UP A
Database size is 100 MB - 100 GB.
Large volume

*** EXTRA QUESTIONS

QU
ANS:

TOP DOWN APPROACH:


Other Operational DBs

Extract
Transform
Load
Refresh
Figu
The
The
Data Warehouse Dats
The
6. One
|* Data mart ] [° Data inart I Data mart j
It is
Figure 1.4: Top Down Approach. 8. The

Figure 1.4 shows the Top Down Approach for Data Warehouse. on.

In this approach, the data flow begins with data extraction from the operational data sources. 9. It is

This data is then loaded into staging area. 10. Fin


4. It is then transferred to Operational Data Store (ODS).
Advantage
Sometimes the ODS step is skip, if it is replication of the operational databases.
> Dai
6, Data is also loaded into data warehouse in a parallel process to avoid extracting it from the ODS.
> Ris
7. Then the data mart is loaded with the data.
8. And finally OLAP environment is available to the users. Disadvanl
Advantages: > Re
The data is centralized. >
> Resultcrnnhoz.k •„

Page 6 of66

Scanned by CamScanner
production to Data'Ware. . . — -----—

Disadvantages;
> Time consuming process.
> Failure risk is very high.

BOTTOM U P APPROACH:

Other Operation*! Oil*

Extract
Transform
Load
Refrexh

Data Warehouse

Figure 1.5: Bottom Up Approach.

3. The data flow begins with extraction of data from operational databases into the staging area.
4. Data is then loaded into Operational Data Store (ODS).
5. The data in ODS is appended to or replaced by the fresh data being loaded.
6. Once the ODS is refreshed the current data is once again extracted into the staging area.
7. It is then processed to fit into the data mart structure.
8. The data from the data mart is then extracted to the staging area aggregated, summarized and s o
on.
9. It is then loaded into the data warehouse.
10. Finally it is made available to the end user for analysis.

Advantages:
> Data Marts can be delivered more quickly.
Risk of failure is low.

disadvantages:
Redundancy of data in data mart.
It preserve inconsistent and incompatible data.

Scanned by CamScanner
rehouse architecture

ANS:

DATA WAREHOUSE:
,|atile collection o f data.
1. Data Warehouse is constructed by integ

2.
It was defined by Bill Inmon i n 1990.
Data Warehouse is a system used fol i
I t is considered a core component o f b

MULTI-TIER ARCHITECTURE OF DATA WAREHOUSE:

Figure 1.6 shows Multi-Tier Architecture of Data Warehouse.

OLAP Server
Monitor &
Metadata Integrator

Analysis

Query
Transform

Refresh
Extract

Load

Serve Reports
Operational DBs Data Warehouse
Data

Mining

Data mart Data mart Data mart

r I
Data Sources Data Storage
OLAP Engine Front End Tools

Figure 1.6: Multi-Tier Data Warehouse Architecture.

Multi-Tier Data Warehouse Architecture consists of following components:

D Bottom Tier;

> Bottom Tier usually consists of Data Sources and Data Storage.
> I t is warehouse database server. For Example: RDBMS.
> In Bottom Tier, using application program interface, data i s extracted f

external sources. “i* extracted from operational and


>
Pr
”‘™ " k “ ™c. OLE-DS. , „ BC ,„„ pporM

Page 8 of 66
Scanned by CamScanner
(Introduction to Data 'Ware. .. Semester -8 Toner's Solutions

II} Middle Tier;

> Middle Tier usually consists of OLAP Engine.


> OLAP Engine is either implemented using Relational OLAP (ROLAP) or Multidimensional
OLAP (MOLAP).

HI) Top Tier;

> Top Tier includes front end tools.


> Front end tools includes query and reporting tools, analysis tools and data mining tools.
> There are three data warehouse models present.
> Enterprise Warehouse: The information of the entire organization is collected related to various
subjects in enterprise warehouse.
> Data Mart: It is a subset of warehouse that is useful to a specific group of users.
> Virtual Warehouse: It is set of view over operational databases.

Scanned by CamScanner
*Oimen!iona( Wo&liwj

Q2] I

ANS:

QI] FACTLESS FAC 1 D1MENS1

ANS: 1. ;
FACT TABLE:
2. 1

1. Fact Table is a collection of facts and


UPDATI
2. It is located at the center of a s t a r schema
tables. 1. (
1
FACTLESS FACT TABLE:
2. 1
1. A Factless fact table is fact table that does not contain act. 3.

It captures
P events that happen only at information level. 4.
«« roHtionships between dimensions.
A Factless fact table captures the many-to-ma y
Factless fact tables are used for tracking a process or collecting stats. 5.

TYPES OF FACTLESS FACT TABLE: I)


As shown in figure 2.1, there are two types of factless fact tables: those t h a t describe events, and those >
that describe conditions. >

Types of Factless Fact Table

>

Event Tracking Tables Coverage Tables

Figure 2.1: Types of Factless Fact Table.

I) Event Tracking Tables:


Event Tracking Tables is used to track the event of interest.
> Many event-tracking tables in dimensional data warehouses turn out to be factless,
H)

H) Coverage Tables: >


> Coverage Tables was defined by Ralph. >
> It is used to support negative analysis report. >
>
EXAMPLE OF FACTLESS FACT TABLE:

Tracking student attendance.


III)
List of people for the web click.
>

10 of 66

Scanned by CamScanner
Q2] UPDATES TO DIMENSION TABLES.

ANS: [5M-MAY16]

DIMENSION TABLE:

1. A dimension table is a table in a star schema of a data warehouse.


2. A dimension table stores attributes, or dimensions, that describe the objects in a fact table.

UPDATES TO DIMENSIONS TABLES:

1. Over the time, every day as more and more sales take place, more and more rows get added to
the fact table.

3. Now consider the dimension tables. Compared to the fact table, the dimension tables are more
stable and less volatile.
4. Dimension table changes due to change in attributes themselves but not because of increase in
number of rows.
5. Types of changes that affect dimension tables are as follows:

I) Slowly Changing Dimensions:

> Dimensions are generally constant over time, but if not constant then it may change slowly.
> Example: Customer ID of the record remain same but the marital status or location of customer
may change over time.
> There are three different types:
■ Type 1 Change: It is related to correction of errors in source systems and changes are
not preserved.
■ Type 2 Change: It is related to the true changes in source systems and changes are
preserved.
■ Type 3 Change: It is related to tentative changes in the source systems and changes are
preserved.

Large Dimension Tables:

Large Dimensions tables are very deep and wide.


Deep means it has large numbers of rows.
Wide means it may have many attributes or columns.
To handle large dimensions table, we can divide large dimension into some mini dimensions
based on the interest.

Rapidly Changing Dimensions:


If the dimension table changes rapidly then break the dimension table into one or more smaller
dimension tables.

Scanned by CamScanner
Scmesler —-
nthcrdimensi° n
Move the rapidly changing attributes in »"
table with slowly changing attributes.

IV) hmkPimmimis; ficlds in major dimensions of source legacy


> Some textual data or nags cannot be the s>gm >can |
systems.
>
, s i 0 1 1 and keep all text values.
>
ful to fire the quires base
>

------------------------------------ following DIMENSIONS namely product

ii.

Time p e r i o d - 5 Years.

Product -

[10M-MAY16]
ANS:

STAR SCHEMA:

1. Star Schema is the most popular schema design for a Data Warehouse.
2. It is called a star schema because the diagram resembles a star, with points radiating from a
center.
3. The center of the star consists of fact table and the points of the star are the dimension tables.
4. Usually the fact tables in a star schema are in third normal form (3NF) whereas dimensional
tables are de-normalized.

STAR SCHEMA FOR SUPER MARKET CHAIN:

Figure 2.2 shows the Star Schema for Super Market Chain

Fact Table: Sales.

Dimension Table: Product, Store, Time and promotion

Scanned by CamScanner
Dimensions T 'MorteUnc/ Semester - 8 'Topper's Solution
:,
on
Product Ht till!

_____Product Key _ _ _ _ Mo/ r Kry


Pro du ct Description Stm w Name
Product CntegorylD Add/eftfe
ProductSubcatcgorylD City
Brand Name Salon facts table _._St*»ie
Product Key ZIP
Timo Key l(i-j;i oiill) _______
Store Key
Promotion Key
Time Key
Unit Sales
Date
Promol ion Key I
MontlilD
Promotion Name
QuartcrID
__Promotion Type
YcarlD
Promotion Coat
Holiday Flag
Start Date
Time End Date
Ke&ponsihle Manager
Promotion

Figure 2.2: Sales Promotion Star Schema.

Maximum No. of fact table records:

Time Period = 5 Years x 365 Days

= 1825

No. of Stores =300

Each Stores daily sale = 4000

Promotion =1

Maximum No. of Fact Table Records = Time Period x No. of Stores x Daily Sale x Promotion.

a = 1825x300x4000x1

= 2,190,000,000

Maximum No. of Fact Table Records - 2 Billion.

Scanned by CamScanner
Dimension
dimensioned 'Modeling

W h e t h e r tl
Q4] CONSIDER FOLLOWING DiMENSiONS FOR A HYPERMARKET CHAIN: PRODUCT. STORE. TIME
Yes t h e a h
and promotion.
iW e r the following questions. Clearly state any assumptior
With r esp ec t to t
a star s ch ema . W h e t h e r the star schema c an e 1. Pro
qnd d r a w snowflake schema for t he
2. Sto
converted to snowflake s c h e ma ? Justify y ° U 1 a
‘‘ nsjon ta b l e ( s ) , t h e i r attributes a n d
i 3. Tin
datawarehouse
4. Pre
measures)
[10M-MAY161
I SNOWFLA
ANS:
1 1. Th
STAR SCHEMA: ini
s the most popular schema design for a Data W a r c h 0 U
2. In
It is called a star schema because the diagram resembles a s a . 3. W
center. and the points o f the s ta
;as dimension;
4. Pi
usually the fact tables in a star schema are in third normal for.

STAR SCHEMA FOR HYPER MARKET CHAIN:

Figure 2.3 shows the Star S ch e ma for Hyper Market Chain.


Sttbci
$UbCAt<

Fact Table: Sales. Subcj

D i m e n s i o n Table: Product, Store, T ime a n d p r o m o t i o n .


Store
Product
Store Key
ProductKey Store Name
ProductDescri ption
Address
Product CatcgorylD
City
ProductSubcategorylD
Sales tacts table State
Brand Name
ZIP
ProductKey
Time Key
RcgionlD
Store Key
Promotion Key
Time Key
UnitSales
Date
PromotionKcy
MonthlD
QuarterlD
Promotion Name
YearlD Promotion Type
Holiday Flag Promotion Cost
Start Date .
Time
_____End Date __
Responsible Manager —
Promotion

Page<4 of 66

Scanned by CamScanner
dimensioned 'Modeling Semester - 8 Topper's Solutions

Whether t h e star s c h e m a can b e converted to snowflake schema?

Yes the above star schema can be converted to snowflake schema by considering the following
assumptions:
1. Product can be classified into category and subcategory.
2. Store belongs to a region, and a region dimension is not added in star schema.
3. Time Dimensions can be further divided into Month, Quarter and Year.
4. Promotion can be further classified into types.

SNOWFLAKE SCHEMA:

1. The snowflake schema is an extension of the star schema, where each point of the star explodes
into more points.
2. In a star schema, each dimension is represented by a single dimensional table.
3. Whereas in a snowflake schema, that dimensional table is normalized into multiple lookup
tables, each representing a level in the dimensional hierarchy.

Store
Product
Store Key
Produa Key _____ Store Name
Product Description Address
Product Catego rylD Cut
ProductSnbcategoryIP State
Brand Nirae ZIP
Re-ionlD Region
SubcatrgorytD RegionJD
RegonKame
$ubcatet;ory Name

Subcategory
Promotion Key
T ime Key Promotio n Name
Date Promotion Type
Mo nth ID Promotion Cost
StartDate
End Date
Responsible Manager

Promotion

Promotion Type
Promotion Duration

Promotion Type

VeartD
Month ID YearName
Quarter Name
Month Name
Year

Scanned by CamScanner
<TA Prot
A 'Process
I* Thi
tra

DESCRIBE THE S I El
QI]
TV
ANS:
In
ETL:
1 1i 1 (( ( ii(illlll)! o u t of the source sy „J| T1
ETL Stands for Extrac, Tiaiisfo
■ '” ""
e S ) ) ( l ) wll>l
fc
I t is a process i n dataware.
placing it into a data wait >

ETL PROCESS:

Orarl*

SQL Srrrar

Teradata

Tran*r»r»“M , , “ n

Hat File

ETL process involves the following tasks:

I) Extracting the data from different sources:

>
>
> In this step, Data is extracted from source system.
> Data is also made accessible for further processing.

> «• “ “ re «™> * M
a way that it d o e s n o t negatively affect
system.
> Most data projects consolidate data from difft
> Each separate source uses a different format

Common data-source formats include RDBMS AMLfhke


XML (TV CSVJSON).
e

Scanned by CamScanner
Topper's Solutions
Semester - 8
7? Process

transformation.

format.

standards.
For examp
-mm-dd.

• Cleaning (e.g. “Male" to "M” and "Female” to "F" etc.].

- Enriching (e.g. Full name to First Name , Middle Name , Last Name].

In some cases data does not need any transformations and here the data is said to be “rich data
o r "direct move” o r "pass through” data.

) Loading:

This is the final step in the ETL process.


In this step, the extracted data and transformed data is l o a d e d to the target database.
In order to make data load efficient, it is necessary to index the database and disable constraints
before loading the data.
All the three steps in the ETL process can be run parallel.
Data extraction takes time and so the second step of transformation process i s executed
simultaneously.

This prepares data for the third step of loading.


As soon as s o m e data is ready, it is loaded without waiting for completion of the previous steps.

Scanned by CamScanner
DATA WAREHOUSE, 72 Process
'EPP Process

3] DATA
Q2] IN WHAT ETL «
SUITABLE INSTANCE. S:

ATA QUALT
ANS:
Data q
ETL:
To be
C
t Transit
ETL Stands f°r E ’ e | i o l ,s i n g res P on Data
oceSS
■tl *a c i°n>” Some
p g it into "’ t “warehouse,
a data Wc
a

ETL CYCLE:
h

;ists Of the following t o steps Of e x e c c

Initiation of cycle.
Building reference data.
ATA QUA
Extracting data from different sources.
Validation of data.
Transforming data.

6. Staging of data.
Generation of audit reports.

8. Publishing data.

9. Archiving.
10. Cleanup.

ETL PROCESS:

Refer Q I .

USES OF ETL CYCLE IN TYPICAL DATA WAREHOUSE:

2. Dau wmhclng bring, d , r , t „m dllt„„ t ontoasing | e p| atforn , andln ,*


mpon
3.
ETL is required in taking management decisions.

It is used in designing strategies and future plans.

Scanned by CamScanner
~o fiber's Solutions
Semester - 8
27Z Process

Q3] DATA QUALITY [ 5 M - DECI 6 ]

ANS:

DATA QUALITY:
as a
1. :nt the value of itself.

are listed a s follows:

Dummy Values.
Absence of Data.
Non-Unique Identifiers.

d. Cryptic Data.

DATA QUALITY CYCLE:

The Data
Quality Cycle

Figure 3.2: Data Quality Cycle.

Components of Data Quality Cycle includes:

I) Data Discovery: It is the process of finding, gathering, organizing and reporting metadata about

data.

II) Data Profiling: It is the process of analyzing data in detail, comparing the data to its metadata,

calculating data statistics and reporting the measures of quality for the data.

ni
) Data Quality Rules: Based on the business requirements for each Data Quality measure, the data

quality rules are made.

Page ig of 66

Scanned by CamScanner
>>4

Topper's Sofa
Semester - 8
ETE ‘Process

rnrp „ of monitoring of Data Quality, based on the result


IV) Data Quality Monitoring:

executing the Data Quality rules.

V] Data Quality Reporting: Dashboards and scorecards are used to report Data Quality measur

VI) Data Remediation: It is the ongoing correction of Data Quality exceptions a n d issues as they

reported.

Scanned by CamScanner
CHAPTER - 4 : ONLINE ANALYTICAL PROCESSING (OLAP)

QI J DISCUSS VARIOUS OLAP MODELS.

ANS: (1OM-MAV16J

OLAP:

1. OLAP Stands for Online Analytical Processing.


2. It is based on the multidimensional data model.
3. OLAP was defined by OLAP Council.
4. It allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information.

OLAP MODELS:

OLAP

MOLAP ROLAP HOLAP DOLAP

Figure 4.1: OLAP Models.

0 MOLAP:

> MOLAP Stands for M u l t i - d i m e n s i o n a l OLAP.


In MOLAP, data is stored in a m u l t i d i m e n s i o n a l cube.
It uses array-based multidimensional storage engines.
> The storage is not in the relational database, but in proprietary formats.
Figure 4.2 shows MOLAP Process.

Desktop Client Presentation Layer

Proprietary Data
Language

MDDB

MOLAP Engine
Application Layer

MDBMS Server

Create and store summary


Data cubes

Data Warehouse RDBMS Server Data Layer

Figure 4.2: MOLAP Process.

Scanned by CamScanner
Advantages:
x
It can perform complex calculations.
It has excellent performance.
Disadvantages:
It can handle limited amount of data.
It requires additional investment.

11) ROLAP:
>
ROLAP Stands for Relational OLAP.
> ROLAP uses relational or extended relational DBMS.
ROLAP servers are placed between relational back-end server and client front-end tools.
> Figure 4.3 shows ROLAP Process.
Advantages:
It has higher scalability.
It can handle large amount of data.
Disadvantages:
> Performance is slow.
Limited SQL Functionality.

Desktop Client Presentation Layer

Create data cubes


dynamically

Analytical Server Application Layer

Complex SQL

Data Warehouse RDBMS Server Data Layer

Figure 4.3: ROLAP Process.

Ill) HOLAP:

> HOLAP Stands for Hybrid OLAP.


> Hybrid OLAP is a combination of both ROLAP and MOLAP.
> It offers higher scalability of ROLAP and faster computation of MOLAP.
> HOLAP servers allows to store the large data volumes of detailed information.

Scanned by CamScanner
DLQLAI1;

DOLAP Stands for Desktop OLAP.


It is variation of ROLAP.
DOLAP requires only DOLAP software to be present on machine.
It offers portability to the users.

INDEXING OLAP DATA.

(5M-DECI6)

Indexing is used to quickly locate data without having to search every row in a database.
Indexing provides the basis tor both rapid random lookups and efficient access of ordered
records.
Indexing OLAP Data includes Bitmap Index and Join Indices.

Bitmap Index:

Bitmap Index is the index on particular column.


Each value in the column has a bit vector.
The length of the bit vector is number of records in the base table.
The i* bit is set if the iLh row of the base table has the value for the indexed column.
It is not suitable for high cardinality domains.
Example of Bitmap Index is shown in figure 4.4.

Base Table Index on Region Index on Type


Customer Region Type ReclD Asia Europe America ReclD Retail Dealer
Cl Asia Retail 1 1 0 0 1 1 0
C2 Europe Dealer 2 0 1 0 2 0 1
C3 Asia Dealer 3 1 0 0 3 0 1
C4 America Retail 4 0 0 1 4 1 0
C5 Europe Dealer 5 0 1 0 5 0 1

Figure 4.4: Example of Bitmap Index.


Join Indices:

Traditional indices map the values to a list of record ids.


But Join Indices map the values of the dimensions of star schema to rows in the fact table.
Join Indices is: JI (R-id, S-id) where R (R-id, ...) >< S (S-id, ...)
Join indices can span multiple dimensions.
Figure 4.5 shows the Example of Join Indices.
Fact Table: Sales and Dimension Tables: Location and Item.

Scanned by CamScanner
oaw

Location

T97

Mumbai
Solution.

T2.1H

TS59

T710

Figure 4.5: Example of Join Iiullces.

------------- TTZfTdata of a company with respect to TURK


£
Q3] We Would like to View sales dat
DIMENSIONS NAMELY LOCATION, ITEM AND TIME. R E P R > c
form of a 3-d data cube for THE ABOVE and perform R >
AND DICE OLAP OPERATIONS ON THE ABOVE DATA C U B E AND ILLUS FRr •

[10M-MAY161
ANS: >
>
OLAP OPERATIONS:
>
1. OLAP Operations are
>
multi-dimensional databases.
>
2 Since OLAP servers are based on multidimensional v i e w of data, s o OLAl operations arj

List of OLAP operations:

I) Roll-up:

> Roll-up performs aggregation on a data cube in any of t h e following ways:


■ By climbing up a concept hierarchy for a dimension.
■ By dimension reduction.
> The following figure 4.6 illustrates h o w roll-up works.
> Initially the concept hierarchy was "street < city < province < country".
> On rolling up, the data is aggregated by ascending the location hierarchy from the level of c«3

the level of country.


> -------------------- ihlu Liues ratner than countries.
> ,,„,rfcp|„ t „ „ d. mntions

Scanned by CamScanner
Topper's Solutions
Semester - 8
OEM

1000
QI
« fc Q7
03
aS
- Q4

Mobil* Modem Phono Security


------ „ ltrrn(lyp'>«)

(from title* to
countrim)

Toronto
Vancouver/

QI
* Q2
Q3
m
►- a Q4

Mobile Modem Phone Security


item(lypes)

Figure 4.6: Roll-up Operation.

II) Drill-down:

> Drill-down is the reverse operation of roll-up.


> It is performed by either of the following ways:
■ By stepping down a concept hierarchy for a dimension.
■ By introducing a new dimension.
> The following figure 4.7 illustrates how drill-down works:
> Initially the concept hierarchy was "day < month < quarter < year.
> On drilling down, the time dimension is descended from the level of quarter to the level of month.
> When drill-down is performed, one or more dimensions from the data cube are added.
> It navigates the data from less detailed data to highly detailed data.

/> Chicago / _
New York Z “7

_ Qi
I Q2
Drill down on
Q3 time(from
Q4 uarters t o month

// NewYork /£?
Toronto / j » /

January
ftVvtri
r Much
c April
O May
,£. h>M
£ July
.5 A'jjuvt
•“ Secteirber
Octobw
ncAvr.il*>
Oecemtrr

Mobile Modem Phone Security


ilem(type$)

1?aqe 2$ of 66

Scanned by CamScanner
cube ° Perati ° n Selccts one
Particular dimension from a given cube and provides

> Consider the following figure 4.8 that shows how slice works.
Peif
> It ii r ° , , n e d for t , l e d , mension "time" using the criterion time = "Qi"
wUl form a new sub-cube by selecting one or more dimensions.

Chicago
New York
Toronto

(Quarter)
Time
Mobile M o d e m Phone Security
item(types)

slice
for time
='Q1'

g Chicago
~ g New York
3=
.2 5. Toronto
Vancouver 325 400

Mobile Modem Phone Security


item(types)

Figure 4.8: Slice Operation.


IV) Dice:

' Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following figure 4.9 that shows the dice operation.

To,O
/ O/ 395~~7
'? Vancouver/ /
8 S

605
(Quarter}
Time

Mobile Modem
item (types)

z
Chicago w~
New York - 7-
Toronto
Vancouver
QI
S Q2
f J 03
- 2 Q4

Mobile Modem Phone Security

Figure 4.9: Dice Operation.

Scanned by CamScanner
Semester - 8 Topper's Solutions

I he dit e opei ation on the cube based on the following selection criteria involves three
dimensions.
- Location = "Toronto" o r "Vancouver"
■ Time = "QI" o r "Q2"
■ Item =" Mobile" o r "Modem"

V) Pivot;
The pivot operation is also known as rotation.

In this the item and location axes in 2-D slice are rotated.

Chicago
g --
3 Toronto
Vancouver 605 825 14 400

Mobile Modem Phone Security


item(types)

Ptvot

605
Mobile
~ Modem 825

g £ Phone 14

~ Security 400

Chicago Nev/ Toronto Vancouver


York
Location (Cities)

Figure 4.10: Pivot Operation.

!4] DIFFERENTIATE OLTP VS OLAP

IS: [5M-DEC16]

Table 4.1 shows the comparison between OLTP and OLAP.

rable 4.1: C o m p a r i s o n between OLTP a n d OLAP.


1 .
Parameters OLTP
a: ; • . . olap

Full Form Online Transaction Processing. Online Analytical Processing.

Oriented Transaction Oriented. Subject Oriented.

Scanned by CamScanner
0&P

Data Redundancy

Granularity Few Levels of Granularity.

Users Many Users Few Users.

1 0 MB to GB 100 GB toTB.

High Flexibility.

Access
Mostly Read.
Function

Scanned by CamScanner
CamScanner
______—

Copper $ Solutions
Semester - 8

CHAPTER - 5: INTRODUCTION TO DATAMUjlNG

r , . ,r VARIOUS FUNCTIONALITIES OF DATA MININS AS A STEP IN THE PROCESS OF


KNOWLEDGE DISCOVERY.
[10M-DEC16]

ATA MINING:
Data Mining is defined as tl
It is a non-trivial process.

KDD Stands for Knowledge Discovery in Database.


KDD is the process of discovering knowledge in data.

,r
The main goal is to extract knowledge from large database.
mationa]
KDD includes wide variety of application domains which includes Artificial Intelligence, Pattern
Recognition, Machine Learning Statistics and Data Visualization.
Figure 5.1 shows the KDD Process.
Interpretation/
Evaluation

Data Mining

Knowledge
Transformation

Preprocessing

Selection Transformed
Data

Preprocessed Data

Target Data

Data

Figure 5.1: KDD Process.


ist of steps involved in the knowledge discovery process:

Data Cleaning:

Data Integration:

In this step, multiple data sources are combined.

Page ZQ of 66
Scanned by CamScanner
Topper

Sen* Ontrocluctio

II) Eat:
grieved from the database.
WSl arC |> i t fe
in) ll o l hean.. '
!„ this step. ' Hi) Km

BaUlBUtMUD* |t0 forms appropriate for mining by perfo.


> Thi
IV)
11 01 U
. , is transfer"’' ’ '
In this step, d«’lna , s I IV) Dal

[> M

V) EaiaJdliiinB an
In this step, intelligent methods are appl'e
I V) Ba
VI) I > it '
In this step, data patterns are evaluated.
■VI) Gr
It is us
|> Tl
measures.
|> It
VII) Knowledge P r e s e n t a t i o n :

> In this step, knowledge is represented.

> Visualization and knowledge representation techniques are used to present

knowledge to users.

Q2] DISCUSS:

1 . THE STEPS IN HDD PROCESS.

2. THE ARCHITECTURE OF A TYPICAL DM SYSTEM.

ANS:
[iom-maYiH
DATA MINING:

Refer Q I .

KDD PROCESS:

Refer Q I .

ARCHITECTURE OF A TYPICAL DM SYSTEM:


Figure 5.2 shows the Architecture o f a typical data mining system. iI

I)

Scanned by CamScanner
Senwter ~ 8

|l) Databases or data warehouse server;


>• It fetches the data as per the users’ requirement which is need for data mining task.

11) Knowledge base;


► This is used to guide the search, and gives the interesting and hidden patterns from data.

V) Data mining engine;


> It performs the data mining task such as characterization, association, classification, cluster

analysis etc.

) Pattern evaluation module:

It is integrated with the mining module and it give the search of only the interesting patterns.

I) Graphical user interface:

This module is used to communicate between user and the data mining system.

It allow users to browse databases or data warehouse schemas.

Graphical User Interface

Pattern Evaluation

Knowledge Base
Data Mining Engine

Data Warehouse Server

Data Cleansing
Filtering
Data Integration

Data
Database Warehouse

Figure 5.2: Architecture of Typical Data Mining System.

Paae of 66
Scanned by CamScanner
APPi.ica I H)X 0 |.- | ) A I A MININCi TO FINANCIAL ANALYSIS.
ANS:
Ism
DATA .MINING;

application
1.

a
So i t facilitates th "d fi
" andal indUStry
’ S « encral| y reliable and of high
data
Some of "
Tl 'cal cases are as follows:

" ' U ’"M ' di a


“ “ rel,
""'“ r"'
b.
I
'°"" Predlcdon

rgeted marketing.

Industry ”” """"' "“''“''-y

B
' ° ' ° gica ' Data Analysis.
Ot
-r Scientific Ap p l i c a t i o n s

Scanned by CamScanner
Topper's Solutions
Semester - 8
fiat* typloration Preprocessing

CHAPTER - 6 : DATAEXPLOR ATjON


p npi* o f M u m b a i University«

No Que

CHAPTER- 7: DATA PREPROCESSING

Ql ] DISCUSS DIFFERENT STEPS INVOLVED IN DATA PREPROCESSING. [10M-MAY16]

ANS:

DATA PREPROCESSING:

and is likely to contain many errors.


Data preprocessing is a proven method of resolving such issues.
Data preprocessing prepares raw data for further processing.

STEPS INVOLVED IN DATA PREPROCESSING:

I) Data Cleaning:

> Data Cleaning is also known as Scrubbing.


> It is a technique that is applied to remove the noisy data and correct the inconsistencies in

> k involves filling missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies.
> Steps in data cleansing:
■ Parsing: Parsing is the process in which individual data elements are located and identified

in the source systems and then these elements are isolated in the target files.

■ Correcting: In this step, using data algorithm the individual data elements are corrected.

■ Standardizing: In standardizing process, conversion routines are used to transform data into

a consistent format using both standard and custom business i ules.

■ Matching: Matching process involves eliminating duplications by searching and matching

records.

■ Consolidating: Consolidating process involves merging the records into one representation

by analyzing and identifying relationship between matched records.

Scanned by CamScanner
C
7 .shows example of
> Figure 7.1 snow
put*

Cleat' f* 1*

Dlily D*<"
Cleaning •’ro£cSS '

II)

> .............. - .............. . ....... .


a unified view of these data.
> Sources may include multiple databases, data cubes o . •
> Data Integration removes the duplicate and redundant da a.
> Figure 7.2 shows example of data integration process.

Data Integration

a
<s Integrated Data

>
>
Figure 7.2: Data Integration Process. >
III]
>
>
Data transformation involves:

’ Aggregation.
■ Generalization.
■ Normalization.
Figure 7.3 shows the example of data transformation process.

-3,30,120,42,10
0

ansformation Process.
IV]

set that is much smaller!

Scanned by CamScanner
Toper's Solutions

construction of a data cube.

relevant, or redundant attributes or dimensions.

data set size.


Nunierusitv Reduction, in this process, the data are replaced by alternative, smaller data

ns such as pat ametric models and non-parametric models like clustering.


Figure .4 shows data reduction process example.

A2 A3 A4 - A125
4

Al A2 A3 A75
T2
— — ----- Data Reduction
T1

T3 ___ T2
T4 J --------- — T3

n o oo i ___ TSOO

Figure 7.4: Data Reduction Process.

Data Discretization:

In Data Discretization, the range of a continuous attribute is divided into intervals.


By discretization the size of the data is reduced.
In this process, the data is prepared for further analysis.
Discretization process is applied recursively on an attribute.
Three types of attributes:
■ Nominal: Values from an unordered set.
■ Ordinal: Values from an ordered set.
■ Continuous: Real numbers.

Scanned by CamScanner
CHAPTER - 6: CLASSIFICATION

Q1 1 DECISION TREE BASED CLASSIFICATION APPROACH.

ANS: [5M-DEC16

DECISION TREE BASED CLASSIFICATION:

1- It is one of the most important classification and prediction method in data mining.
2. A decision tree represents rules.
3. Rules are easy to understand and can be directly used in SQL to retrieve the records fro]
database.
4. A decision tree classifier has tree type structure.
5. It has leaf nodes and decision nodes.
6. A leaf node is the last node of each branch and indicates value of target attribute.
7. A decision node is the node of tree which has leaf node or sub-tree.
8. Figure 8.1 shows the representation of decision tree for tennis play.
9. As shown in figure 8.1, Humidity, Outlook and Wind is Attribute.
10. High, Normal, Strong, Weak, Sunny, Rain and Overcast is Value.
1 1. Yes and No is classification.

Outlook

Artrib ate Rain


Overcast
Value
Humidity Wind

Classification High Normal Strong Weak

No No

Figure 8.1: Decision tree for tennis play.

Q2] METRICS FOR EVALUATING CLASSIFIER PERFORMANCE.

ANS:
[5M-MAY1
METRICS FOR EVALUATING CLASSIFIER PERFORMANCE-

1. Sensitivity Sensitivity is defined as True Positive recognition rate which is the proportion
positive tuples that are correctly identified.

Sensitivity = TP/P
2.
S B
““ s peafMty „ „ „ e d a s T r u e N e g a t . ve rKognition wHdi is tta
negative tuples that are correctly identified.

Specificity = TN/N

Scanned by CamScanner
Accuracy; It is percentage of test set tuples that are correctly classified.

Accuracy = (TP + TN) / (P + N)

LlTOiJUUlI It is percentage of error made over the whole set of instances used.

Error Rate = 1 - Accuracy

EreciSWIK It is percentage of tuples which are correctly classified as positive are actual positive.

It is the measure of exactness.

Precision _ |TP|
” |TP|+|FP|

Recall: It is percentage of positive tuples which the classifier labelled as positive. It is a measure

of completeness.

|TP|
Recall = |TP|+|FN|

7 F Measures: It is Harmonic mean of precision and recall


2 x Precision x Recall
Precision+Recall

Note:
TP: Class Members which are classified as class members.
TN: Class Non-Members which are classified as class non-members.
FP: Class Non-Members which are classified as class members.
FN: Class Members which are classified as class non-members.

: Number of negative tuples.

WHYNAlVE BAYESIAN CLASSIFICATION IS CALLED “NAIVE”? BRIEFLY OUTLINE THE MAJOR

IDEAS OF NAIVE BAYESIAN CLASSIFICATION.


[10M-DEC16]

Naive Bayesian Classification is based on Bayes Theoi em.


Bayesian classifiers are the statistical classifiers.
Naive Bayesian Classification is referred as Naive because it makes the assumption that each of
its inputs are independent of each other, an assumption which rarely holds true.
For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter.
A Naive Bayes Classifier considers all of these properties to independently contribute to the
probability that this fruit is an apple.

Scanned by CamScanner
Classified 10”
>
>■

(1 I lll> »■

: >
.dideanoena.™.*— T— “™"
in)

i. It is also known as Bayes Rule.


It is used to find conditional p

information is obtained. babi i i t y which we get after any a d d i t i o n a l information i,


Q51

»-* - ixihi is p“ ,e" “ “""


0
yM
P (H) is Apriori Probability of H a n d P (X) is Apt ior1 P> ° a

M DEFINE UNEAE. NON-LEIEAE and MlILTifLE EENEESSIONS- PLAN A REGRESSION MOD,

foe DISEASE DEVELOPMENT WITH RESPECT TO CHANGE « WEATHER PARAMETERS. I


[10M-DEC16]
ANS:

REGRESSION:

Thus regression is very useful i n estimating a n d predicting the average value o f one variable for
a given value of other variable.

TYPES OF REGRESSION:
ANS
I) Linear Regression:

> If the regression curve is a straight line then there is a linear regression between two variabh DEC
> The relationship between dependent and independent variable is described by straight line j
it has only one independent variable. Refi

Y=a+px *** ]
hereY is dependent variable and Xis independent variable and „ ft
name and a, p are parameters. j
DEC
Non-IJnear Regression*

•f the curve of regression is not a straight line then it is called 1 Figr

11 l s c a l l e d
as non-linear regression.

Scanned by CamScanner
K e g i essinns tries to (Ind the mathematical relationship between variables, if it gives a curved me
then it is a
non-linear regression.
K is also known as C u r v i l i n e a r R e g r e s s i o n .

Mull ipLeJ ixishuK


Hi)
Multiple Regression is given by tollowing formula
y
Y = a + 0tXi + (h *2
Multiple regression includes more than one predictor variable.

15) A SIMPLE EXAMPLE FROM THE STOCK MARKET INVOLVING ONLY DISCRETE RANGES HAS

PROFIT AS CATEGORICAL ATTRIBUTE, WITH VALUES {UP, DOWN} AND THE TRAINING DATA

SET IS GIVEN BELOW.

Age Competition Type Profit

Old Yes Software Down


Old No Software Down
Old No Hardware Down
Mid Yes Software Down
Mid Yes Hardware Down
Mid No Hardware Up
Mid No Software Up
New Yes Software Up
New No Hardware Up

New No Software Up

Apply decision tree algorithm a n d show the generated rules.


[10M-MAV16]
INS:

JECISION TREE BASED CLASSIFICATION:

lefer QI.

** Note: Even if the sum i s asked, write a s h o r t theory explaining a b o u t the c o n t e n t .

DECISION TREE FOR ABOVE EXAMPLE:

dgure 8.2 shows the decision tree for stock market case.

Scanned by CamScanner
Age

Now Mild Old

Up Down
Contest

No Yes

Up Down

RULES:
1. IF Age = New THEN Profit = Up.
2. IF Age = Mild and Contest = No THEN Profit = Up.
3. IF Age = Mild a n d Contest = Yes THEN Profit = Down.
4. IF Age = Down THEN Profit = Down.

*** EXTRA QUESTIONS ***

qi ] What is classification? What are the issues in classification?


ANS:
CLASSIFICATION:

1. Classification is the form of data analysis.


2. Classification constructs classification model based on training data set.
3. Using this model i t classifies the new data.
4. Classification models predict categorical class labels.
5. For example, we can build a classification model to categorize bank Ioan applications as ei|
safe o r risky. j

CLASSIFICATION PROCESS:

Classification is a two-step process:

Scanned by CamScanner
wester
odeLConstruction Topper's SoTutions

This step is the learning step.


y
In this step the classification algoritl
y mSbuil
The classifier is built from the trainj' d t h e d as s i n c i ..
y
labels. " 8SCtn,ade upof ( latabnse tuples and their associated class

lln
8 set is referred to as a category or class.
construction.

Classification
Training Data Algorithms

Classifier
Name
Rank
Years Tenured
Sagar
Developer
Rutuja Yes
Developer
Anand No
Netty ork Engineer
No IF Rank = 'Developer* OR Year
Snehal
Technical Support 2 Then Tenured = ’Yes’
Yes

Figure 8.3: Example of model construction.

II] Model Usage:

> In this step, the classifier is used for classification.

> The classification rules can be applied to the new data tuples if the accuracy is considered
acceptable.
> Figure 8.4 shows example of model usage.

Classifier

Testing Data

U n s e e n data

Years Tenured
Name ______Rank _______
Yes Sagar, Developer, 3
Developer
Sagar
No
Rutuja Developer
No
Anand Network Engineer
Yes
Technical Support. Tenured?
Snehal

Yes

Figure 8.4: Example of model usage.

‘Pye Iff of &&

Scanned by CamScanner
,C ,ssue Ls
Preparing the data for classification. Preparing the data involves t h e f 0 ]i n
Wi
activities: hg

1. »ata Cleaning.- Data cleaning Inv

noise is removed by applying snn


>>y replacing a missing value with

anaI
used to know whether ysis is

3,,d redUCti
method ° n: ThC d3ta Ca
" be
by a n y of the fo]I o ng

cGo“ l,Mt,0,,: The data Can alS0 be tra


fonned by generating i t to the higher

ANS: ID3 ALGORITHM OR CLASSIFICATION ALGORITHM.

ID3:

1. JD3 Stands for Iterative Dichotomiser 3.


2. It is an algorithm to build decision tree
3.
I t w a s developed by J. R 0 s s Q u i n | a n i n l
4.
' adoptagreedyapproach.
5.
6. '" this algorithm, there is n o backtracking.
6 tre are cons
“ t r u cted in a top-down re
ID3 ALGORITHM: conquer manner.

IDA (Examples, Tar;

ee.

• „ st ,s e m w 8eMi n the si W. Mth labe, „ . I


""•rsocamb.,. &amp|& ’ "• e tree K o o [ , w[0| |aM =

Scanned by CamScanner
o A
<- the Attribute tint i,
o T h n d -.. ........ C t h a t b e « c l a s s l n
o e a c h possibl(;valtieV|ofA -A
o
A d d a n e w tree
o Let Example Vt be branch below Hoot' c
the subse rreSpOnding t o t h e tcst A
=v '
o If Example Vf is empty ofExam
Ples that have value Vifor A.
I hen below this new h
rar,
vahiP nf t ch add a leaf node with label = most common
v i n e of Target attribute in Examples.
Else below
new branch add the sub-tree ID3 (Example V t,
Target.attributes, Attributes {A})
End.
Return Root.

Advantages:

> ID3 builds a s h o r t tree.


> ID3 builds a fastest tree.

Disadvantages:

> It perform poorly with many class a n d small data.


> Computationally expensive to train.

Scanned by CamScanner
ANS:

re B P i,h
““““ „ B .< I* ““‘ ™ ' ” " 'S
1. clustering is unsupe )C u s e d to pl a c e daU

2. it is data mining techn’q ca l l e d a s clusters.

3.
Clustering Algorithms are

K-MEANS CLUSTERING ALGORITHM:

!. K-Means Clustering is one ofthe partitioning method.

2.
K-Means Clustering aims to partition tf observations into V clusters i n which each observafe
belongs to the cluster with the nearest mean, serving as a prototype o f the cluster. Jr
This results in a partitioning of the data space.
K is positive integer number.

K-Means Clustering Process:

Start

Number of
duster K

Centroid

No object
Dis
Move
'’nce obje c ts to group? End
centroids

Scanned by CamScanner
'S&nester-g
Topper's <$o(utions
F.gure 9.1 shows the flowchart for K-M e a n s ri

Define K centroids for K clusters which are '


3113 far away from
Then group the elements into clusters whi f'"" ' each other.
t0 t h e c e n t r o i d o f that cluster
After this first step, again calculate the -
that cluster. ntroid for each cluster based on the elements of
Follow the same method, and groun th. i
6
in every step, the centroid changes and elf "* °n
ntS move
Do the same nrocese hii _> .. from one cluster to another

EXAMPLE:

Given:

)ata Set - {1, 2, 6, 7, 8, 10, 15, 17, 20}

No. of clusters = 2

Solution:

MStep-1: ( D e f i n e K Centroid)

■Consider initial two centroids for two clusters Ci = 6 and C2 = 15

■Sten-2: ( R a n d o m l y assign data to two clusters)

■Ki = {1, 2, 6, 7, 8 , 1 0 }
|K 2 = {15, 1 7 , 2 0 }

IStep-3: ( C a l c u l a t e M e a n )

■No. ofclusters = 2

[Therefore Ki = {1, 2, 6, 7, 8, 10} Ci = Mean - 34/6 - 5.67

I K2 = {15, 17, 20} C2 = Mean = 52/3 = 17.33

Step-4: (Reassign)

ki = { l , 2 , 6 , 7 , 8 , 1 0 }
P<2 = {1S, 1 7 , 2 0 }

, . „ rhp final answer is Ki = { 1 , 2 , 6, 7, 8, 10} and Kz = {15, 1 7 , 2 0 }


As no elements is moving from cluster, s

$aq64$of66

Scanned by CamScanner
rchot -
Topper's Sofa

g.OMERATlVEALGQRITHMWnul
Ctusieriny V)
™E AG
, r I INK APPROACH. THET
57
TECHNIQUE , q [M6 SlNOE
Q2) WHAT IS CLUSTER VS T(jE DISTANCE BETWEEN
FOLLOWING DATA AND P [TEN'S

below comprises sampl


elements. c B D
VI)
A
Item E T T
T T
t" 'o’ T
T T
T T
T T
T If
TT T T
T T T o’ ASSI
B
~6 T o'
D T T 1.
m
[10 "DEC16] 2.
ANS:
3.
CLUSTERING:
4.
5.
It is data mining technique used to place data
6.
knowledge of the group definitions.

4. Clustering Algorithms are used in Marketing, Biology, and Insui a n c e etc. 7.

CLUSTERING TECHNIQUES:
8.
Clustering Techniques can be classified into the following categories:

I) Partitioning Method:

> In Partitioning based approach, various partition is created.


> Each partition represents a cluster.

ID Hierarchical Method:
>
>

Hl) EXi
>
> Givi

lsl0, asth d Dis


8 Sthe
IV) density in the neighborli

> ' nu thevariou


sobjectstogether
SPaCeis agri
«zed intofi .? ”-
Humber
11 a
grid structure.

Scanned by CamScanner
Al

MiuldJlaimmEUoU; -— 'To erj-Solutions


-ta 8(< t h is method, a model is hyp o t h
or
The model. each clus tc r ift n j
nnd the
H . best fit of data for a Riven
Tins method uses density function to lo c a t e ,

Constraint-based Method:

h.thismethod,tlteclusteringis perfomie
,nc
constraints can be user-oriented or anm- ° r Poration of constraints
PPbcation-oriented.
A G6LOMERATIVE ALGORITHM:

1.
Agglomerate Algorithm is used in Hie
< ba d C UStering
2. it is also known as AGNES (ag g lome raZe n eX r '

3. This approach is also known as the bottom-up approach

4. in this, westart with each object forming a separate group


r 5. ee
ance Ps on
merging the objects or groups that are close tooueanother.
6. p g so until all of the groups are merged into one or until the termination condition
holds.
A Hierarchical Agglomerate Clustering is typically visualized as a Dendrogram as shown in
figure 9.2.
Dendrogram is tree like structure used to illustrate hierarchical clustering technique.

EXAMPLE:

Given:

A D
Item E
E 0
A T 0
£ 2 0
£
5 1 0
£ 2
3 0
D

Scanned by CamScanner
Topper's Solution

Semester - 8

them together
mati . i x , E .a l l l l A c l l l S l c I S ,— i m U .n d - n c e

From above given distance

to form cluster (E, A) Distant*

-31

Consu
them-

E A

Distance Matrix:
Dist((EA), C) = MIN (Dist .
= MIN (2, 2) = 2

Dist ((E, A), B) = MIN (Dist (E, B), Dist (A, B]]
= MIN (2, 5] = 2
Dist ((E, A), D) = MIN (Dist (E, D], Dist (A, D))
= MIN (3, 3) = 3
Distanc

E,A C B D
Item
E, A 0
C 2 0

B 2 1 0

D 3 6 3 0

Step - 2:
Consider the distance matrix obtained in step 1. Since B, C distance is minimum, we combine B and C.
Distance

Step -

Finalb

Final

-- i

E A B C

Distance Matrix:

D'St ((B,C), (E,A))

MIN (2, 5, 2, 2) = 2

rage 48 of 66

Scanned by CamScanner
'ons
ogether

Item
e7a~ D
0
2
0
D 3

I consider the d i s t a n c e matrix obtained i

! them-

Distance

MIN (Dist (E, B), Dist (E, C), Dist (A, B), Dist (A, C)
MIN (2, 2, 5, 2) = 2
Dist CCB, C), D) MIN (Dist (B, D), Dist (C, D]]
MIN (3, 6} = 3

Item D

D 2 0

Step - 4:

Finally we c o m b i n e D with (E, A, B, C]

Final Dendrogram:

Distance

/ IA I

Scanned by CamScanner
*** EXTRA QUESTIONS ***

QU GIVE FIVE EXAMPLES OK APPLICATIONS T U A T C A N B E USE CLUSTERING


0Us
ONE CLUSTERING ALGORITHM M/lTH THE H E L P O f E X A M P L E . CRH

ANS:

CLUSTERING:

Refer Q i fr o m University Questions (Clustering Part)

Af
PLICATIONS OF CLUSTERING:

0
Clustering i s used in many marketing applications such
aSmarket
Search, pa
reCOg
n'««n. data analysis, and image processing.
H)
• ng can also be used in classifying plants and animals i n r
0,1 lnto
their features. different classes
HI)

IV)

V)
— 1,8
Catering, different srouns r
B
,

VI)
. . -~~Z~. e can

V1I)

VIII) ......................-
Taxili ua
S C,u e SUS
e
Picenter. ing is used '" ' ng b l o g data.
t0 d
' entify d
g OUS zo
c
>tes based o n earthqu
STE KlNG ALS0RITHH:

Scanned by CamScanner
H Toppers Solutions
gglotnerative i£mxhi g jJxiu a <u

‘SCrjbj? q2 Agglomerative Algorithm Part

tOp
‘do ' approach.
e
same cluster.

ne c ustcr
’ >s split up or the termination condition holds.
8 oi splitting is done, it can never be undone.

A 66LOMERATIVE V/S DIVISIVE HIERARCHICAL CLUSTERING:


rch
» Pattern

Step 0 Step 1 Step 2 Step 3 Step 4

Agglomerative (AGNES)

12

12345

se can be 345

45

Divisive (DIANA)

Step 4 Step 3 Step 2 Step 1 Step 0

rthquake

'Paaetfoftt

Scanned by CamScanner
-----. . . ------------------ —
zTlz'’

pA j-fgKNA

QI] IT TREE.

ANS:

FP TREE:
1. FP Tree Stands (or Frequent I Jlt<->11 ” ()t labelled as "null a n d Set o f item-p
. which consists of one r«<
2.

Each node in the item-prefix sub tree consists of three fields.


a. Item-name.
b. Count.
Node-link.
4.
a database.

6. But due to frequent pattern sharing, the size of the tree is usually m u c h s m a l l e r t h a n its original
database.
7. Figure 10.1 shows the example of an FP Tree.

Support
ItemJD “?nt

J2 : 7 : - 1i i1
Il ; 6 ; -
J3 ; 6 J* -
to

-4-r _

15:1 /'

\ 13:2

ntages:

eS
Besses data s e t

Scanned by CamScanner
>0;

'Topper's Sofutio ns

|; p Tree may not fit in memory.


U i s expensive to build.

5m multilevel & multidimensional ASSOCI


'May
;
A
(5M-MAY161

RU | CS which combine association with hierarch, r


P rC yofconce
Rules. " Ptsarecalledas

h
wh.ch a„pl.«d „ h„ Wir
P
An item can be generalized o r specialized as n
:em.
patterns J P association rule.

Topper's Solutions

an its
°rigina)
EE Computer IT Mechanical

Microprocessor HMI DBMS SPCC

Figure 10.2: Example of multilevel association rule.

■MULTIDIMENSIONAL ASSOCIATION RULE:

|> Rules which c o m b i n e association with multiple dimensions are called as Multidimensional

Association Rules.
b In this, Rule c o n t a i n s two o r m o r e dimensions or predicates.
> There are two types; Inter dimension association rules and hybrid dimension association rules.

■ Inter dimmision aSSQCiatiaiLBlkSi i3 m'e does not have any repeated predicate. For

EXamPle:
Gender (X, "Male”) * Salaty (X, "High”) ■» Buys (X, "Computer")
. This rule have many occurrences of same
1 ru
■ Hybrid d i m e n s i o n associaU —
predicate i.e. buys. * Buys (x , ..DVir)
viaic
Gender (X, i J

<Pay$of66

Scanned by CamScanner
(Another 'Topper

I
... 0 ( 1 | ftLfl OHrniM- APPLY AH MlNlNq T()
Xdiiwy
1 4-
,l E MlNiNS ANO A'"' " F0 U.0tfWfl »AT AS1! ‘S u>H‘
Q3] DISCUSS ASSOCIA110N li / passocatk-n1' 11-
all fluent ITEM sets A..-

01 1 c
Miniinun* Siipi’ " 1" 1 .
ncc
Minimum Confide
Transaction-!1
1,2,5
100
I l re
"2,1
200
T3
~300
767 FFT
-
Too- TF Given:
600~ TF'
Miiil»' i u n l ‘S
700 1,3, 2,5 Minimum C
800 1,3 “
900" 1, 2, F

ANS:
'rs
f WM-maYic)
ASSOCIATION RULE MINING:

data mining process.

govern the associations.

nrU
al“ ' emini " SiSaPr0CedU ' eW, ’ id ' - ' ’
of d a t ab a s e s . a n t t 0 find frequent pattei
4. n "s. correlations and

an re,aHOnal datate
F si Z - actlona! databases, and S
Oll U fon
Association R uJe Mining q{ ’ '’ ™ of data
association rule. pes; Multilevel assoc
J)
iation rule Solution:
ihutuss:s 311(1
s wti a! ultidimensicnal
61
Refer Q2.

Scan t h e tra

APRlORf ALsORiTHM;

Apriori
J’ orithm i s o n e o f f

54 of 66

Scanned by CamScanner
went
V Softsfionj
,1prior' Algorlilnu nnnlyw n lhl(11 Tohhw', ms
VG TO FIND
"""...... ........ <..............
(SET: 1(,. Items can occur

1 Implement.

nUaiitiMics:
Performance is low.
It requires man)' database scans.

Given-’
Minimum Support Count = 2
Minimum Confidence = 7 0 %

TransaclIonjD Items
100 1.2, 5
200 2,4
MAY16]
300 2,3
400 1,2,4
500 1,3
600 1,3
700 1, 3, 2, 5
□ns and
800 1,3
900 1, 2, 3
)f data

Solution:
isional

Candidate List = {1, 2, 3 , 4 , 5 }

Ite inset Supportcount

2
5

;
W'

Scanned by CamScanner
]tcn>s c t
__
— —-
3,
7
7

Now genera’
Support

’ 7
LJ.
- 5
J, 3
1,4
T 2 A □ ~~
1,5 2
2,3 3
2,4 2
2,5 2
3,4 T
3,5 ____2.
4,5 o

Now we compare Candidate C2 generated in step 3 with the minimum support count and prune those
Itemsets which do not satisfy the minimum support count

Item set Support Count


1,2 4
1,3 5
1,5 2
2,3 3 AD;
2,4 2
___2 , 5 ____ 2
80%

°rt Count
2

1 '
0
1”
0

Scanned by CamScanner
_ _ _ _-Semater ~ 8
' —-—-—
:r:“ZX:~C‘ *- - - - -
Itemset
Support Count
1,2,3 2
_ 1,2,5 2

Fre qiient Itemset are {1, 2, 3} and {1, 2, 5}


, et consider the frequent I t e m s e t {1, 2, 5}

Following are the association rules that can be generated shown below with the support and confidence.
Association Rule Support Confidence Confidence %
T'2=>5 2 2/4 50
7
T ? => 2 2 2/2 100
a
T 5=> 1 2 2/2 100
A 29
1 => 2 5 2 2/7

2 => 1 A
5 2 2/6 33
A
2 2/2 100
5 => 1 2

Minimum Confidence threshold is 7 0 %. So the following rules are considered as output, as they are
strong rules.
Rules Confidence
A
1 5 => 2 100 %
a 100 %
2 5=>1
A 100 %
5 => 1 2

___________ _ _ _ _ _ _ _ _ _ MIN-SUPPOCT = 60% AND M1N-CONFIDENCE =


TJ
"X ™
TRANSACTION ID
““ *™°“ ““ ’“”
-----------T J D ~ Items Bought

T-1000 M, O,N,K,E,Y

T-1001 D, O, N, K, E, Y

T-1002 M, A, K, E

T-1003 M,U, C.K.Y

T-1004 C, O, O, K, E

[10M-DEC16]

ANS:

Scanned by CamScanner
Strnw

J<0’

lte ]

Items

f-1002

N
Solution:

StfijHl
Scan the transaction database

Candidate List = {A, C, D, E, K, M, N, 0, U, Y}

Itemset Supportcount
A 1
C 2
D 1
E 4 SI
K 5
M 3 N
N 2
0 4 It
LT 1
Y 3

.Itemset
Support Count
E S
4
K 5
7 M
F
3
0
4
Y
_____3

N0Wgenerate
CandidateC2froni L]

lnd
the support count for ite m s .

ort Count
4
~2
_3
7
7

Scanned by CamScanner
' offers Solutions

< compare candidate C2 generated in step 3 the


d 0 l l O t satlsfy the m,nlmum
support count and prune those
I ' ., which
SC
support count (i.e. 60 %1
I „PI11

Itemset
E.K 4
E,0 3
KM 3
K,0 3
K.Y 3

Itemset Support Count


E.K.M 2
E.K.0 3
E.K.Y 2
E,O,Y 2
K,M,0 1
K,M,Y 1

Itemsets which do not satisfy the minimum support count (i.e. 60 %)•
Itemset Support Count
E.K.M 2
E.K.0 3
E.K.Y 2
E.O.Y 2

Confidence %
Confidence
Support
75
3/4
IF" 100
3/1
Too
“3/3
“3"” TT -
3/4
~3 "60~
T/F
75
■3/4

strong rules.

Pye 59 of66

Scanned by CamScanner
Solutions
'Minhy Present Pattern . . . . Semester - 8

Rules Confidence
E A 0 => K 100 %
K A 0 => E 100 %

Scanned by CamScanner
ons

*SIQ N ANALY

T8EW M T ,L,AW
« SCHOOLS » 0 ZAT “ '™»««X r”“
ITALS in T||E w OFFICE, corporatesin
MUNICiPAL OFFICE, YoUR ANALYs,s ’D AND OTHER INFORMATION OF THE

0UIDEL1NES C NSIST 0F
' ° all NECESSARY INTERFACE
ANS:

dATA C U B E COMPUTATION: [10M-DEC16]


Data cube computation is a n essential ta sk l n dafa w

or P t of a data cube can greatly reduce the response time and

However,
and storage space.

DATA CUBE COMPUTATION METHODS:

The Multi-way Array Aggregation method computes a full data cube by using a

> It is array based bottom up algorithm.


> It is a typical MOLAP approach that uses direct array addressing.
It uses multi-dimensional chunks.
Figure 1 shows Multi-way Array Aggregation exploration for a 3-D data cube computation.

bc

ABC

ex]
Figure 1: Multi-way Array Aggr e g a t i o n

Limitations:

Scanned by CamScanner
II)
>

>

AC BC
AB

Q2]

ABC

Star Cubing:

Star-Cubing combines the strengths of the Multi-way array aggregation a n d B U G .


It integrates top-down and bottom-up cube computation.
(
= ”" ,r “ pnn,,.,
It operates from a data structure called a star-tree.

I pruned: I
I pruned: J
! W
c/c l)/l)

dC/AC
«C/BC lUvii

HCD

*ncr>

Adv
antage:

Rcduceth
' compute10, tensions.
_ _ _ _____ '« a ,1(,
CniOtyreciuire
me n t s .

Scanned by CamScanner
pro1'

i.

ii.

. Store - 3 0 0 stores reporting d a i l y s a l e s

AnS: [Chapter
st
ore daily)
(b) Discuss:
. The steps in KDD Process.
[10]

Ans:[Chapter - 5 ]

Ans:[Chapter - 4]
(b) A simple example from the stock market involving only discrete ranges has profit as
categorical attribute, with values {Up, Down} and the training data set is given below. [10]

Age Competition Type Profit


Old Yes Software Down
Old No Software Down

Old No Hardware Down

Mid Yes Software Down

Mid Yes Hardware Down

No Hardware Up
Mid
No Software Up
Mid
Yes Software Up
New
Hardware Up
New No
Software Up
New No

Apply decision tree algorithm and show the generate


Ans:
[Chapter - 8]
[1«1
(hi , differentiate DW and Data Mart
' W Illustrate the architecture of a typical DW system- Dif
[Chapter - 1] ________ --------- ----------------------

Scanned by CamScanner
-J vuiuiJOfy
Semester - 8
Question "Papers
[10]

(b) Discuss different steps involved in Data Preprocessing-


Ans: [Chapter - 7 ] Qi]
[10]
Q4] (a) Discuss various OLAP Models.
P
fb) eX K" clustering algorithm? Apply K-Means Al)
with two clusters. Data Set = { 1, 2, 6, 7, 8, 1 0, 1 5, 1 7, 20}
[10]

Ans: [Chapter -9]


[10]
Q5] (a) Describe the steps of ETL Process.
Ans: [Chapter - 3]

[10] Q2]

Minimum Support Count = 2


Minimum Confidence = 70%
TransactionJD Items
100 1, 2 , 5

200 2,4

300 2, 3
Q3
400 1, 2 , 4
500 1,3

600 1,3
700 1, 3, 2, 5
800 1,3
900 1,2,3
Ans: [Chapter - 10]

Q6] Write short notes on any four of the following:


(a) Updates to Dimension tables.
RO]

Ans: [Chapter - 2]
(b) Metrics for Evaluating Classifier Performance.

Ans: [Chapter - 8 ]
(OFPTree.
Ans: [Chapter - 10]
(d) Multilevel & Multidimensional Association Rule.
Ans: [Chapter - 10]
(e) Operational Vs. Decisional Support System.
Ans: [Chapter - 1]

64 of66

Scanned by CamScanner
QU

ultiple regressions. Plan a regression model for Disease


development with iange in weather parameters. [10]

( ) What is meant by metadata in the context of a Data warehouse? Explain the different types of
Meta data stored in a data warehouse. Illustrate with a suitable example. [10]
Ans: [Chapter - 1]
( } escribe the various functionalities of Data Mining as a step in the process of knowledge
discovery. [10]

Ans: [Chapter - 5]

Q3] [a] In what way ETL cycle can be used in typical data warehouse, explain with suitable instance. [10]
Ans: [Chapter - 3]
[b] What is Clustering Techniques? Discuss the Agglomerative algorithm with the following data
and plot a Dendrogram using single link approach. The table below comprises sample data items
indicting the distance between the elements. [10]
Item E A C B D
E 0 1 2 2 3
A 1 0 2 5 3
C 2 2 0 1 6

B 2 5 1 0 3

D 3 3 6 3 0

Ans: [Chapter - 9]
[10]
(a) Discuss how computations can be performed efficiently on data cubes.
Ans: [Chapter - Miscellaneous]
[b] A database has five transactions. Let min-support = 60% and min-confidence = 80%. Find all
,r . . . ... ..... i... Anrinri Algorithm. TJD is the transaction ID [10]
find freq
TJD Items Bought
T-1000 M, 0, N, K, E, Y

Haye &J off)f)

Scanned by CamScanner
Semester - 8
Topper's Solutions
Question Papers

T-1001 d ( o, N, K, E, Y

T-1002 M, A, K, E

T-1003 M, U, C, K, Y

T-1004 C, 0, 0, K, E

Ans:[Chapter - 10]

Q5] [a] Differentiate [10]


i. OLTP Vs. OLAP.
ii. Data Warehouse Vs. Data Mart.
Ans:[Chapter - 4]
(b] Why Naive Bayesian Classification is called "naive"? Briefly outline the major ideas of native
Bayesian Classification [10]
Ans: [Chapter - 8]

Q6] Write short notes on any four of the following: [20]


(a) Application of Data Mining to Financial Analysis.
Ans: [Chapter - 5]
(b) Fact Less Fact Table.

1
Ans: [Chapter - 2]

I iiWWO
(c) Indexing OLAP Data.
Ans: [Chapter - 4]
(dj Data Quality.

I II I ......... I I
Ans:[Chapter - 3]
(e) Decision Tree based Classification Approach.
Ans:[Chapter - 8]

Scanned by CamScanner
OtUr Subjects

e>

v«"
V*'-
•x 0

r
fim( Vear Pry eels an also Available @ Topper's Solutim f
* w*, <w
i‘""nk -I US

'HJisHiiyyni &rf

Price/ fy, yo
T0 rUUfi»ry

(0
Scanned by CamScanner

You might also like