Toppers Solution
Toppers Solution
&m~£ ( Computer]
Scanned by CamScanner
Topper's Solutions
Semester -
in)n
Data Warehousing &. 3
Page No.
Syllabus
# Chapters
01
Introduction to
Data ,?==£=====
Warehousing
of Data Warehousing; Features o classification of
information F!oW Mechanism; Ro e data
Metadata; Data Warehouse Arch.te Wareh ousing
Architecture; Data Warehouse and Data
Design Strategies.
10
Dimensional Data Warehouse Modeling Vs Operational Database Modeling
Modeling Dimensional Model Vs ER Model; Features of a Good Dimensional
Model; The Star Schema; How Does a Query Execute? The
Online
21
Analytical Multidimensional Analysis; Hypercubes; OLAP Operations i n
Processing Multidimensional Data Model; OLAP Models: MOLAP ROLAP
(OLAP) HOLAP, DOLAP;
Scanned by CamScanner
Semester - 8
Topper's Solutions
8.1 ,ass,
‘ication methods: 36
1. Decision Tree Induction: Attribute Selection Measures,
8.3
Holdout, Random Sampling, Cross Validation, Bootstrap;
8.4
Scanned by CamScanner
Topper's Solutions
Semester -8
trwM
Marfa 'Distribution
15 15
Introduction to Data Warehousing.
15 15
Dimensional Modeling.
10 15
ETL Process.
20 10
Online Analytical Processing [OLAP].
Data Exploration.
Data Preprocessing. 10
8. Classification. 15 25
Clustering. 10 10
Miscellaneous.
10
Repeated Questions
20
*** Note: If you need some additions questions which are not included then do +l
questtons onSupM t IflPEsrsSohitionsxom or Whatsapp it on +917507531198
Scanned by CamScanner
introduction to Data 'Ware. ., Semester - 8 ohhe\
DATA WAREHOUSE:
5. It is c o n s i d e r e d as a c o r e c o m p o n e n t o f business i n t e l l i g e n c e .
Analysis
L Q
0 u
A E
D R
Operational Source
Detailed
Archive/Backup
I) Operational Source:
Operational Source is a data source consists of Operational Data and External Data.
> Data can come from Relational DBMS like Oracle, Informix.
/ rd
Scanned by CamScanner
<T()ppw's SofaluM!
De-normalization.
no
V] Detailed Data:
The Detailed and Summarized Data are stored for the purpose of archiving and backup.
The data is transferred to storage archives such as magnetic tapes or optical disks.
viiij Metadata:
Scanned by CamScanner
Ontroduction to Data 'Ware. , . Semester - 8 'Topper*! So/utllM!
Table 1.1 shows the difference between Data Warehouse and Data Marl.
Data Sources Used Many Data Sources are required. Few Data Sources are required.
Data Available Data is historical, detailed and Data consists of some history,
summarized. detailed and summarized.
Q3] What is meant by metadata in the context of a data Warehouse? explain the
DIFFERENT TYPES OF META DATA STORED IN A DATA WAREHOUSE. ILLUSTRATE WITH A
SUITABLE EXAMPLE.
ANS: [IOM-DECIO]
METADATA:
EXAMPLE
Topper's !
Query Tools
Data Warehouse
Extracting Metadata
Tools
Entity Nai
Alias Narr
Definitioi
Applications Source Sy
Source Systems
OLAP Too)
Responsi
Data Qua
Figure 1.2: Roles of Metadata.
I _______
TYPES OF METADATA:
Q4] 01
|
Metadata in a data warehouse fall into three major categories as shown in figure 1.3. ANS:
1
Types of Metadata
[ COMPAR
Table 1.2
It cont;
>
The data elements selected for th str
Uctures. It is usi
It is pe
S the inf
° ° rmatiOn ° f “P-ational d a t atas X reh
° USe haVe
eis
g>ven by Operational Metadata.
Scanned by CamScanner
I Mroduction to fiato Wan. .. Semester-# Topper's Solutions
> The end-user metadata is the navigational map of the data warehouse.
> It enables the end-users to find information from the data warehouse.
> The end-user metadata allows the end-users to use their own business terminology and look for
information in those ways in which they normally think of the business.
I EXAMPLE OF METADATA:
MS: [5M-MAY16]
Table 1.2 shows the difference between Operational System and Decisional Support System.
Table 1.2: Comparison between Operational System and Decisional Support System.
Ityejofbb
Scanned by CamScanner
Topper's 'Sofuti
9ntroc(ucfion t
.fhire Is I Disadvantage
Thei r is no data redundancy. It has AdhoC access. Time o
It has repetitive access Failure
|1 iu>7snapshot
I It has up to date data. natab WOGB-lew-rB.
BOTTOM UP A
Database size is 100 MB - 100 GB.
Large volume
QU
ANS:
Extract
Transform
Load
Refresh
Figu
The
The
Data Warehouse Dats
The
6. One
|* Data mart ] [° Data inart I Data mart j
It is
Figure 1.4: Top Down Approach. 8. The
Figure 1.4 shows the Top Down Approach for Data Warehouse. on.
In this approach, the data flow begins with data extraction from the operational data sources. 9. It is
Page 6 of66
Scanned by CamScanner
production to Data'Ware. . . — -----—
Disadvantages;
> Time consuming process.
> Failure risk is very high.
BOTTOM U P APPROACH:
Extract
Transform
Load
Refrexh
Data Warehouse
3. The data flow begins with extraction of data from operational databases into the staging area.
4. Data is then loaded into Operational Data Store (ODS).
5. The data in ODS is appended to or replaced by the fresh data being loaded.
6. Once the ODS is refreshed the current data is once again extracted into the staging area.
7. It is then processed to fit into the data mart structure.
8. The data from the data mart is then extracted to the staging area aggregated, summarized and s o
on.
9. It is then loaded into the data warehouse.
10. Finally it is made available to the end user for analysis.
Advantages:
> Data Marts can be delivered more quickly.
Risk of failure is low.
disadvantages:
Redundancy of data in data mart.
It preserve inconsistent and incompatible data.
Scanned by CamScanner
rehouse architecture
ANS:
DATA WAREHOUSE:
,|atile collection o f data.
1. Data Warehouse is constructed by integ
2.
It was defined by Bill Inmon i n 1990.
Data Warehouse is a system used fol i
I t is considered a core component o f b
OLAP Server
Monitor &
Metadata Integrator
Analysis
Query
Transform
Refresh
Extract
Load
Serve Reports
Operational DBs Data Warehouse
Data
Mining
r I
Data Sources Data Storage
OLAP Engine Front End Tools
D Bottom Tier;
> Bottom Tier usually consists of Data Sources and Data Storage.
> I t is warehouse database server. For Example: RDBMS.
> In Bottom Tier, using application program interface, data i s extracted f
Page 8 of 66
Scanned by CamScanner
(Introduction to Data 'Ware. .. Semester -8 Toner's Solutions
Scanned by CamScanner
*Oimen!iona( Wo&liwj
Q2] I
ANS:
ANS: 1. ;
FACT TABLE:
2. 1
It captures
P events that happen only at information level. 4.
«« roHtionships between dimensions.
A Factless fact table captures the many-to-ma y
Factless fact tables are used for tracking a process or collecting stats. 5.
>
10 of 66
Scanned by CamScanner
Q2] UPDATES TO DIMENSION TABLES.
ANS: [5M-MAY16]
DIMENSION TABLE:
1. Over the time, every day as more and more sales take place, more and more rows get added to
the fact table.
3. Now consider the dimension tables. Compared to the fact table, the dimension tables are more
stable and less volatile.
4. Dimension table changes due to change in attributes themselves but not because of increase in
number of rows.
5. Types of changes that affect dimension tables are as follows:
> Dimensions are generally constant over time, but if not constant then it may change slowly.
> Example: Customer ID of the record remain same but the marital status or location of customer
may change over time.
> There are three different types:
■ Type 1 Change: It is related to correction of errors in source systems and changes are
not preserved.
■ Type 2 Change: It is related to the true changes in source systems and changes are
preserved.
■ Type 3 Change: It is related to tentative changes in the source systems and changes are
preserved.
Scanned by CamScanner
Scmesler —-
nthcrdimensi° n
Move the rapidly changing attributes in »"
table with slowly changing attributes.
ii.
Time p e r i o d - 5 Years.
Product -
[10M-MAY16]
ANS:
STAR SCHEMA:
1. Star Schema is the most popular schema design for a Data Warehouse.
2. It is called a star schema because the diagram resembles a star, with points radiating from a
center.
3. The center of the star consists of fact table and the points of the star are the dimension tables.
4. Usually the fact tables in a star schema are in third normal form (3NF) whereas dimensional
tables are de-normalized.
Figure 2.2 shows the Star Schema for Super Market Chain
Scanned by CamScanner
Dimensions T 'MorteUnc/ Semester - 8 'Topper's Solution
:,
on
Product Ht till!
= 1825
Promotion =1
Maximum No. of Fact Table Records = Time Period x No. of Stores x Daily Sale x Promotion.
a = 1825x300x4000x1
= 2,190,000,000
Scanned by CamScanner
Dimension
dimensioned 'Modeling
W h e t h e r tl
Q4] CONSIDER FOLLOWING DiMENSiONS FOR A HYPERMARKET CHAIN: PRODUCT. STORE. TIME
Yes t h e a h
and promotion.
iW e r the following questions. Clearly state any assumptior
With r esp ec t to t
a star s ch ema . W h e t h e r the star schema c an e 1. Pro
qnd d r a w snowflake schema for t he
2. Sto
converted to snowflake s c h e ma ? Justify y ° U 1 a
‘‘ nsjon ta b l e ( s ) , t h e i r attributes a n d
i 3. Tin
datawarehouse
4. Pre
measures)
[10M-MAY161
I SNOWFLA
ANS:
1 1. Th
STAR SCHEMA: ini
s the most popular schema design for a Data W a r c h 0 U
2. In
It is called a star schema because the diagram resembles a s a . 3. W
center. and the points o f the s ta
;as dimension;
4. Pi
usually the fact tables in a star schema are in third normal for.
Page<4 of 66
Scanned by CamScanner
dimensioned 'Modeling Semester - 8 Topper's Solutions
Yes the above star schema can be converted to snowflake schema by considering the following
assumptions:
1. Product can be classified into category and subcategory.
2. Store belongs to a region, and a region dimension is not added in star schema.
3. Time Dimensions can be further divided into Month, Quarter and Year.
4. Promotion can be further classified into types.
SNOWFLAKE SCHEMA:
1. The snowflake schema is an extension of the star schema, where each point of the star explodes
into more points.
2. In a star schema, each dimension is represented by a single dimensional table.
3. Whereas in a snowflake schema, that dimensional table is normalized into multiple lookup
tables, each representing a level in the dimensional hierarchy.
Store
Product
Store Key
Produa Key _____ Store Name
Product Description Address
Product Catego rylD Cut
ProductSnbcategoryIP State
Brand Nirae ZIP
Re-ionlD Region
SubcatrgorytD RegionJD
RegonKame
$ubcatet;ory Name
Subcategory
Promotion Key
T ime Key Promotio n Name
Date Promotion Type
Mo nth ID Promotion Cost
StartDate
End Date
Responsible Manager
Promotion
Promotion Type
Promotion Duration
Promotion Type
VeartD
Month ID YearName
Quarter Name
Month Name
Year
Scanned by CamScanner
<TA Prot
A 'Process
I* Thi
tra
DESCRIBE THE S I El
QI]
TV
ANS:
In
ETL:
1 1i 1 (( ( ii(illlll)! o u t of the source sy „J| T1
ETL Stands for Extrac, Tiaiisfo
■ '” ""
e S ) ) ( l ) wll>l
fc
I t is a process i n dataware.
placing it into a data wait >
ETL PROCESS:
Orarl*
SQL Srrrar
Teradata
Tran*r»r»“M , , “ n
Hat File
>
>
> In this step, Data is extracted from source system.
> Data is also made accessible for further processing.
> «• “ “ re «™> * M
a way that it d o e s n o t negatively affect
system.
> Most data projects consolidate data from difft
> Each separate source uses a different format
Scanned by CamScanner
Topper's Solutions
Semester - 8
7? Process
transformation.
format.
standards.
For examp
-mm-dd.
- Enriching (e.g. Full name to First Name , Middle Name , Last Name].
In some cases data does not need any transformations and here the data is said to be “rich data
o r "direct move” o r "pass through” data.
) Loading:
Scanned by CamScanner
DATA WAREHOUSE, 72 Process
'EPP Process
3] DATA
Q2] IN WHAT ETL «
SUITABLE INSTANCE. S:
ATA QUALT
ANS:
Data q
ETL:
To be
C
t Transit
ETL Stands f°r E ’ e | i o l ,s i n g res P on Data
oceSS
■tl *a c i°n>” Some
p g it into "’ t “warehouse,
a data Wc
a
ETL CYCLE:
h
Initiation of cycle.
Building reference data.
ATA QUA
Extracting data from different sources.
Validation of data.
Transforming data.
6. Staging of data.
Generation of audit reports.
8. Publishing data.
9. Archiving.
10. Cleanup.
ETL PROCESS:
Refer Q I .
Scanned by CamScanner
~o fiber's Solutions
Semester - 8
27Z Process
ANS:
DATA QUALITY:
as a
1. :nt the value of itself.
Dummy Values.
Absence of Data.
Non-Unique Identifiers.
d. Cryptic Data.
The Data
Quality Cycle
I) Data Discovery: It is the process of finding, gathering, organizing and reporting metadata about
data.
II) Data Profiling: It is the process of analyzing data in detail, comparing the data to its metadata,
calculating data statistics and reporting the measures of quality for the data.
ni
) Data Quality Rules: Based on the business requirements for each Data Quality measure, the data
Page ig of 66
Scanned by CamScanner
>>4
Topper's Sofa
Semester - 8
ETE ‘Process
V] Data Quality Reporting: Dashboards and scorecards are used to report Data Quality measur
VI) Data Remediation: It is the ongoing correction of Data Quality exceptions a n d issues as they
reported.
Scanned by CamScanner
CHAPTER - 4 : ONLINE ANALYTICAL PROCESSING (OLAP)
ANS: (1OM-MAV16J
OLAP:
OLAP MODELS:
OLAP
0 MOLAP:
Proprietary Data
Language
MDDB
MOLAP Engine
Application Layer
MDBMS Server
Scanned by CamScanner
Advantages:
x
It can perform complex calculations.
It has excellent performance.
Disadvantages:
It can handle limited amount of data.
It requires additional investment.
11) ROLAP:
>
ROLAP Stands for Relational OLAP.
> ROLAP uses relational or extended relational DBMS.
ROLAP servers are placed between relational back-end server and client front-end tools.
> Figure 4.3 shows ROLAP Process.
Advantages:
It has higher scalability.
It can handle large amount of data.
Disadvantages:
> Performance is slow.
Limited SQL Functionality.
Complex SQL
Ill) HOLAP:
Scanned by CamScanner
DLQLAI1;
(5M-DECI6)
Indexing is used to quickly locate data without having to search every row in a database.
Indexing provides the basis tor both rapid random lookups and efficient access of ordered
records.
Indexing OLAP Data includes Bitmap Index and Join Indices.
Bitmap Index:
Scanned by CamScanner
oaw
Location
T97
Mumbai
Solution.
T2.1H
TS59
T710
[10M-MAY161
ANS: >
>
OLAP OPERATIONS:
>
1. OLAP Operations are
>
multi-dimensional databases.
>
2 Since OLAP servers are based on multidimensional v i e w of data, s o OLAl operations arj
I) Roll-up:
Scanned by CamScanner
Topper's Solutions
Semester - 8
OEM
1000
QI
« fc Q7
03
aS
- Q4
(from title* to
countrim)
Toronto
Vancouver/
QI
* Q2
Q3
m
►- a Q4
II) Drill-down:
/> Chicago / _
New York Z “7
_ Qi
I Q2
Drill down on
Q3 time(from
Q4 uarters t o month
// NewYork /£?
Toronto / j » /
January
ftVvtri
r Much
c April
O May
,£. h>M
£ July
.5 A'jjuvt
•“ Secteirber
Octobw
ncAvr.il*>
Oecemtrr
1?aqe 2$ of 66
Scanned by CamScanner
cube ° Perati ° n Selccts one
Particular dimension from a given cube and provides
> Consider the following figure 4.8 that shows how slice works.
Peif
> It ii r ° , , n e d for t , l e d , mension "time" using the criterion time = "Qi"
wUl form a new sub-cube by selecting one or more dimensions.
Chicago
New York
Toronto
(Quarter)
Time
Mobile M o d e m Phone Security
item(types)
slice
for time
='Q1'
g Chicago
~ g New York
3=
.2 5. Toronto
Vancouver 325 400
' Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following figure 4.9 that shows the dice operation.
To,O
/ O/ 395~~7
'? Vancouver/ /
8 S
605
(Quarter}
Time
Mobile Modem
item (types)
z
Chicago w~
New York - 7-
Toronto
Vancouver
QI
S Q2
f J 03
- 2 Q4
Scanned by CamScanner
Semester - 8 Topper's Solutions
I he dit e opei ation on the cube based on the following selection criteria involves three
dimensions.
- Location = "Toronto" o r "Vancouver"
■ Time = "QI" o r "Q2"
■ Item =" Mobile" o r "Modem"
V) Pivot;
The pivot operation is also known as rotation.
In this the item and location axes in 2-D slice are rotated.
Chicago
g --
3 Toronto
Vancouver 605 825 14 400
Ptvot
605
Mobile
~ Modem 825
g £ Phone 14
~ Security 400
IS: [5M-DEC16]
Scanned by CamScanner
0&P
Data Redundancy
1 0 MB to GB 100 GB toTB.
High Flexibility.
Access
Mostly Read.
Function
Scanned by CamScanner
CamScanner
______—
Copper $ Solutions
Semester - 8
ATA MINING:
Data Mining is defined as tl
It is a non-trivial process.
,r
The main goal is to extract knowledge from large database.
mationa]
KDD includes wide variety of application domains which includes Artificial Intelligence, Pattern
Recognition, Machine Learning Statistics and Data Visualization.
Figure 5.1 shows the KDD Process.
Interpretation/
Evaluation
Data Mining
Knowledge
Transformation
Preprocessing
Selection Transformed
Data
Preprocessed Data
Target Data
Data
Data Cleaning:
Data Integration:
Page ZQ of 66
Scanned by CamScanner
Topper
Sen* Ontrocluctio
II) Eat:
grieved from the database.
WSl arC |> i t fe
in) ll o l hean.. '
!„ this step. ' Hi) Km
[> M
V) EaiaJdliiinB an
In this step, intelligent methods are appl'e
I V) Ba
VI) I > it '
In this step, data patterns are evaluated.
■VI) Gr
It is us
|> Tl
measures.
|> It
VII) Knowledge P r e s e n t a t i o n :
knowledge to users.
Q2] DISCUSS:
ANS:
[iom-maYiH
DATA MINING:
Refer Q I .
KDD PROCESS:
Refer Q I .
I)
Scanned by CamScanner
Senwter ~ 8
analysis etc.
It is integrated with the mining module and it give the search of only the interesting patterns.
This module is used to communicate between user and the data mining system.
Pattern Evaluation
Knowledge Base
Data Mining Engine
Data Cleansing
Filtering
Data Integration
Data
Database Warehouse
Paae of 66
Scanned by CamScanner
APPi.ica I H)X 0 |.- | ) A I A MININCi TO FINANCIAL ANALYSIS.
ANS:
Ism
DATA .MINING;
application
1.
a
So i t facilitates th "d fi
" andal indUStry
’ S « encral| y reliable and of high
data
Some of "
Tl 'cal cases are as follows:
rgeted marketing.
B
' ° ' ° gica ' Data Analysis.
Ot
-r Scientific Ap p l i c a t i o n s
Scanned by CamScanner
Topper's Solutions
Semester - 8
fiat* typloration Preprocessing
No Que
ANS:
DATA PREPROCESSING:
I) Data Cleaning:
> k involves filling missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies.
> Steps in data cleansing:
■ Parsing: Parsing is the process in which individual data elements are located and identified
in the source systems and then these elements are isolated in the target files.
■ Correcting: In this step, using data algorithm the individual data elements are corrected.
■ Standardizing: In standardizing process, conversion routines are used to transform data into
records.
■ Consolidating: Consolidating process involves merging the records into one representation
Scanned by CamScanner
C
7 .shows example of
> Figure 7.1 snow
put*
Cleat' f* 1*
Dlily D*<"
Cleaning •’ro£cSS '
II)
Data Integration
a
<s Integrated Data
>
>
Figure 7.2: Data Integration Process. >
III]
>
>
Data transformation involves:
’ Aggregation.
■ Generalization.
■ Normalization.
Figure 7.3 shows the example of data transformation process.
-3,30,120,42,10
0
ansformation Process.
IV]
Scanned by CamScanner
Toper's Solutions
A2 A3 A4 - A125
4
Al A2 A3 A75
T2
— — ----- Data Reduction
T1
—
T3 ___ T2
T4 J --------- — T3
n o oo i ___ TSOO
Data Discretization:
Scanned by CamScanner
CHAPTER - 6: CLASSIFICATION
ANS: [5M-DEC16
1- It is one of the most important classification and prediction method in data mining.
2. A decision tree represents rules.
3. Rules are easy to understand and can be directly used in SQL to retrieve the records fro]
database.
4. A decision tree classifier has tree type structure.
5. It has leaf nodes and decision nodes.
6. A leaf node is the last node of each branch and indicates value of target attribute.
7. A decision node is the node of tree which has leaf node or sub-tree.
8. Figure 8.1 shows the representation of decision tree for tennis play.
9. As shown in figure 8.1, Humidity, Outlook and Wind is Attribute.
10. High, Normal, Strong, Weak, Sunny, Rain and Overcast is Value.
1 1. Yes and No is classification.
Outlook
No No
ANS:
[5M-MAY1
METRICS FOR EVALUATING CLASSIFIER PERFORMANCE-
1. Sensitivity Sensitivity is defined as True Positive recognition rate which is the proportion
positive tuples that are correctly identified.
Sensitivity = TP/P
2.
S B
““ s peafMty „ „ „ e d a s T r u e N e g a t . ve rKognition wHdi is tta
negative tuples that are correctly identified.
Specificity = TN/N
Scanned by CamScanner
Accuracy; It is percentage of test set tuples that are correctly classified.
LlTOiJUUlI It is percentage of error made over the whole set of instances used.
EreciSWIK It is percentage of tuples which are correctly classified as positive are actual positive.
Precision _ |TP|
” |TP|+|FP|
Recall: It is percentage of positive tuples which the classifier labelled as positive. It is a measure
of completeness.
|TP|
Recall = |TP|+|FN|
Note:
TP: Class Members which are classified as class members.
TN: Class Non-Members which are classified as class non-members.
FP: Class Non-Members which are classified as class members.
FN: Class Members which are classified as class non-members.
Scanned by CamScanner
Classified 10”
>
>■
(1 I lll> »■
: >
.dideanoena.™.*— T— “™"
in)
REGRESSION:
Thus regression is very useful i n estimating a n d predicting the average value o f one variable for
a given value of other variable.
TYPES OF REGRESSION:
ANS
I) Linear Regression:
> If the regression curve is a straight line then there is a linear regression between two variabh DEC
> The relationship between dependent and independent variable is described by straight line j
it has only one independent variable. Refi
Y=a+px *** ]
hereY is dependent variable and Xis independent variable and „ ft
name and a, p are parameters. j
DEC
Non-IJnear Regression*
11 l s c a l l e d
as non-linear regression.
Scanned by CamScanner
K e g i essinns tries to (Ind the mathematical relationship between variables, if it gives a curved me
then it is a
non-linear regression.
K is also known as C u r v i l i n e a r R e g r e s s i o n .
15) A SIMPLE EXAMPLE FROM THE STOCK MARKET INVOLVING ONLY DISCRETE RANGES HAS
PROFIT AS CATEGORICAL ATTRIBUTE, WITH VALUES {UP, DOWN} AND THE TRAINING DATA
New No Software Up
lefer QI.
dgure 8.2 shows the decision tree for stock market case.
Scanned by CamScanner
Age
Up Down
Contest
No Yes
Up Down
RULES:
1. IF Age = New THEN Profit = Up.
2. IF Age = Mild and Contest = No THEN Profit = Up.
3. IF Age = Mild a n d Contest = Yes THEN Profit = Down.
4. IF Age = Down THEN Profit = Down.
CLASSIFICATION PROCESS:
Scanned by CamScanner
wester
odeLConstruction Topper's SoTutions
lln
8 set is referred to as a category or class.
construction.
Classification
Training Data Algorithms
Classifier
Name
Rank
Years Tenured
Sagar
Developer
Rutuja Yes
Developer
Anand No
Netty ork Engineer
No IF Rank = 'Developer* OR Year
Snehal
Technical Support 2 Then Tenured = ’Yes’
Yes
> The classification rules can be applied to the new data tuples if the accuracy is considered
acceptable.
> Figure 8.4 shows example of model usage.
Classifier
Testing Data
U n s e e n data
Years Tenured
Name ______Rank _______
Yes Sagar, Developer, 3
Developer
Sagar
No
Rutuja Developer
No
Anand Network Engineer
Yes
Technical Support. Tenured?
Snehal
Yes
Scanned by CamScanner
,C ,ssue Ls
Preparing the data for classification. Preparing the data involves t h e f 0 ]i n
Wi
activities: hg
anaI
used to know whether ysis is
3,,d redUCti
method ° n: ThC d3ta Ca
" be
by a n y of the fo]I o ng
ID3:
ee.
Scanned by CamScanner
o A
<- the Attribute tint i,
o T h n d -.. ........ C t h a t b e « c l a s s l n
o e a c h possibl(;valtieV|ofA -A
o
A d d a n e w tree
o Let Example Vt be branch below Hoot' c
the subse rreSpOnding t o t h e tcst A
=v '
o If Example Vf is empty ofExam
Ples that have value Vifor A.
I hen below this new h
rar,
vahiP nf t ch add a leaf node with label = most common
v i n e of Target attribute in Examples.
Else below
new branch add the sub-tree ID3 (Example V t,
Target.attributes, Attributes {A})
End.
Return Root.
Advantages:
Disadvantages:
Scanned by CamScanner
ANS:
re B P i,h
““““ „ B .< I* ““‘ ™ ' ” " 'S
1. clustering is unsupe )C u s e d to pl a c e daU
3.
Clustering Algorithms are
2.
K-Means Clustering aims to partition tf observations into V clusters i n which each observafe
belongs to the cluster with the nearest mean, serving as a prototype o f the cluster. Jr
This results in a partitioning of the data space.
K is positive integer number.
Start
Number of
duster K
Centroid
No object
Dis
Move
'’nce obje c ts to group? End
centroids
Scanned by CamScanner
'S&nester-g
Topper's <$o(utions
F.gure 9.1 shows the flowchart for K-M e a n s ri
EXAMPLE:
Given:
No. of clusters = 2
Solution:
MStep-1: ( D e f i n e K Centroid)
■Ki = {1, 2, 6, 7, 8 , 1 0 }
|K 2 = {15, 1 7 , 2 0 }
IStep-3: ( C a l c u l a t e M e a n )
■No. ofclusters = 2
Step-4: (Reassign)
ki = { l , 2 , 6 , 7 , 8 , 1 0 }
P<2 = {1S, 1 7 , 2 0 }
$aq64$of66
Scanned by CamScanner
rchot -
Topper's Sofa
g.OMERATlVEALGQRITHMWnul
Ctusieriny V)
™E AG
, r I INK APPROACH. THET
57
TECHNIQUE , q [M6 SlNOE
Q2) WHAT IS CLUSTER VS T(jE DISTANCE BETWEEN
FOLLOWING DATA AND P [TEN'S
CLUSTERING TECHNIQUES:
8.
Clustering Techniques can be classified into the following categories:
I) Partitioning Method:
ID Hierarchical Method:
>
>
Hl) EXi
>
> Givi
Scanned by CamScanner
Al
Constraint-based Method:
h.thismethod,tlteclusteringis perfomie
,nc
constraints can be user-oriented or anm- ° r Poration of constraints
PPbcation-oriented.
A G6LOMERATIVE ALGORITHM:
1.
Agglomerate Algorithm is used in Hie
< ba d C UStering
2. it is also known as AGNES (ag g lome raZe n eX r '
EXAMPLE:
Given:
A D
Item E
E 0
A T 0
£ 2 0
£
5 1 0
£ 2
3 0
D
Scanned by CamScanner
Topper's Solution
Semester - 8
them together
mati . i x , E .a l l l l A c l l l S l c I S ,— i m U .n d - n c e
-31
Consu
them-
E A
Distance Matrix:
Dist((EA), C) = MIN (Dist .
= MIN (2, 2) = 2
Dist ((E, A), B) = MIN (Dist (E, B), Dist (A, B]]
= MIN (2, 5] = 2
Dist ((E, A), D) = MIN (Dist (E, D], Dist (A, D))
= MIN (3, 3) = 3
Distanc
E,A C B D
Item
E, A 0
C 2 0
B 2 1 0
D 3 6 3 0
Step - 2:
Consider the distance matrix obtained in step 1. Since B, C distance is minimum, we combine B and C.
Distance
Step -
Finalb
Final
-- i
E A B C
Distance Matrix:
MIN (2, 5, 2, 2) = 2
rage 48 of 66
Scanned by CamScanner
'ons
ogether
Item
e7a~ D
0
2
0
D 3
! them-
Distance
MIN (Dist (E, B), Dist (E, C), Dist (A, B), Dist (A, C)
MIN (2, 2, 5, 2) = 2
Dist CCB, C), D) MIN (Dist (B, D), Dist (C, D]]
MIN (3, 6} = 3
Item D
D 2 0
Step - 4:
Final Dendrogram:
Distance
/ IA I
Scanned by CamScanner
*** EXTRA QUESTIONS ***
ANS:
CLUSTERING:
Af
PLICATIONS OF CLUSTERING:
0
Clustering i s used in many marketing applications such
aSmarket
Search, pa
reCOg
n'««n. data analysis, and image processing.
H)
• ng can also be used in classifying plants and animals i n r
0,1 lnto
their features. different classes
HI)
IV)
V)
— 1,8
Catering, different srouns r
B
,
VI)
. . -~~Z~. e can
V1I)
VIII) ......................-
Taxili ua
S C,u e SUS
e
Picenter. ing is used '" ' ng b l o g data.
t0 d
' entify d
g OUS zo
c
>tes based o n earthqu
STE KlNG ALS0RITHH:
Scanned by CamScanner
H Toppers Solutions
gglotnerative i£mxhi g jJxiu a <u
tOp
‘do ' approach.
e
same cluster.
ne c ustcr
’ >s split up or the termination condition holds.
8 oi splitting is done, it can never be undone.
Agglomerative (AGNES)
12
12345
se can be 345
45
Divisive (DIANA)
rthquake
'Paaetfoftt
Scanned by CamScanner
-----. . . ------------------ —
zTlz'’
pA j-fgKNA
QI] IT TREE.
ANS:
FP TREE:
1. FP Tree Stands (or Frequent I Jlt<->11 ” ()t labelled as "null a n d Set o f item-p
. which consists of one r«<
2.
6. But due to frequent pattern sharing, the size of the tree is usually m u c h s m a l l e r t h a n its original
database.
7. Figure 10.1 shows the example of an FP Tree.
Support
ItemJD “?nt
J2 : 7 : - 1i i1
Il ; 6 ; -
J3 ; 6 J* -
to
-4-r _
15:1 /'
\ 13:2
ntages:
eS
Besses data s e t
Scanned by CamScanner
>0;
'Topper's Sofutio ns
h
wh.ch a„pl.«d „ h„ Wir
P
An item can be generalized o r specialized as n
:em.
patterns J P association rule.
Topper's Solutions
an its
°rigina)
EE Computer IT Mechanical
|> Rules which c o m b i n e association with multiple dimensions are called as Multidimensional
Association Rules.
b In this, Rule c o n t a i n s two o r m o r e dimensions or predicates.
> There are two types; Inter dimension association rules and hybrid dimension association rules.
■ Inter dimmision aSSQCiatiaiLBlkSi i3 m'e does not have any repeated predicate. For
EXamPle:
Gender (X, "Male”) * Salaty (X, "High”) ■» Buys (X, "Computer")
. This rule have many occurrences of same
1 ru
■ Hybrid d i m e n s i o n associaU —
predicate i.e. buys. * Buys (x , ..DVir)
viaic
Gender (X, i J
<Pay$of66
Scanned by CamScanner
(Another 'Topper
I
... 0 ( 1 | ftLfl OHrniM- APPLY AH MlNlNq T()
Xdiiwy
1 4-
,l E MlNiNS ANO A'"' " F0 U.0tfWfl »AT AS1! ‘S u>H‘
Q3] DISCUSS ASSOCIA110N li / passocatk-n1' 11-
all fluent ITEM sets A..-
01 1 c
Miniinun* Siipi’ " 1" 1 .
ncc
Minimum Confide
Transaction-!1
1,2,5
100
I l re
"2,1
200
T3
~300
767 FFT
-
Too- TF Given:
600~ TF'
Miiil»' i u n l ‘S
700 1,3, 2,5 Minimum C
800 1,3 “
900" 1, 2, F
ANS:
'rs
f WM-maYic)
ASSOCIATION RULE MINING:
nrU
al“ ' emini " SiSaPr0CedU ' eW, ’ id ' - ' ’
of d a t ab a s e s . a n t t 0 find frequent pattei
4. n "s. correlations and
an re,aHOnal datate
F si Z - actlona! databases, and S
Oll U fon
Association R uJe Mining q{ ’ '’ ™ of data
association rule. pes; Multilevel assoc
J)
iation rule Solution:
ihutuss:s 311(1
s wti a! ultidimensicnal
61
Refer Q2.
Scan t h e tra
APRlORf ALsORiTHM;
Apriori
J’ orithm i s o n e o f f
54 of 66
Scanned by CamScanner
went
V Softsfionj
,1prior' Algorlilnu nnnlyw n lhl(11 Tohhw', ms
VG TO FIND
"""...... ........ <..............
(SET: 1(,. Items can occur
1 Implement.
nUaiitiMics:
Performance is low.
It requires man)' database scans.
Given-’
Minimum Support Count = 2
Minimum Confidence = 7 0 %
TransaclIonjD Items
100 1.2, 5
200 2,4
MAY16]
300 2,3
400 1,2,4
500 1,3
600 1,3
700 1, 3, 2, 5
□ns and
800 1,3
900 1, 2, 3
)f data
Solution:
isional
2
5
;
W'
Scanned by CamScanner
]tcn>s c t
__
— —-
3,
7
7
Now genera’
Support
’ 7
LJ.
- 5
J, 3
1,4
T 2 A □ ~~
1,5 2
2,3 3
2,4 2
2,5 2
3,4 T
3,5 ____2.
4,5 o
Now we compare Candidate C2 generated in step 3 with the minimum support count and prune those
Itemsets which do not satisfy the minimum support count
°rt Count
2
1 '
0
1”
0
Scanned by CamScanner
_ _ _ _-Semater ~ 8
' —-—-—
:r:“ZX:~C‘ *- - - - -
Itemset
Support Count
1,2,3 2
_ 1,2,5 2
Following are the association rules that can be generated shown below with the support and confidence.
Association Rule Support Confidence Confidence %
T'2=>5 2 2/4 50
7
T ? => 2 2 2/2 100
a
T 5=> 1 2 2/2 100
A 29
1 => 2 5 2 2/7
2 => 1 A
5 2 2/6 33
A
2 2/2 100
5 => 1 2
Minimum Confidence threshold is 7 0 %. So the following rules are considered as output, as they are
strong rules.
Rules Confidence
A
1 5 => 2 100 %
a 100 %
2 5=>1
A 100 %
5 => 1 2
T-1000 M, O,N,K,E,Y
T-1001 D, O, N, K, E, Y
T-1002 M, A, K, E
T-1004 C, O, O, K, E
[10M-DEC16]
ANS:
Scanned by CamScanner
Strnw
J<0’
lte ]
Items
f-1002
N
Solution:
StfijHl
Scan the transaction database
Itemset Supportcount
A 1
C 2
D 1
E 4 SI
K 5
M 3 N
N 2
0 4 It
LT 1
Y 3
.Itemset
Support Count
E S
4
K 5
7 M
F
3
0
4
Y
_____3
N0Wgenerate
CandidateC2froni L]
lnd
the support count for ite m s .
ort Count
4
~2
_3
7
7
Scanned by CamScanner
' offers Solutions
Itemset
E.K 4
E,0 3
KM 3
K,0 3
K.Y 3
Itemsets which do not satisfy the minimum support count (i.e. 60 %)•
Itemset Support Count
E.K.M 2
E.K.0 3
E.K.Y 2
E.O.Y 2
Confidence %
Confidence
Support
75
3/4
IF" 100
3/1
Too
“3/3
“3"” TT -
3/4
~3 "60~
T/F
75
■3/4
strong rules.
Pye 59 of66
Scanned by CamScanner
Solutions
'Minhy Present Pattern . . . . Semester - 8
Rules Confidence
E A 0 => K 100 %
K A 0 => E 100 %
Scanned by CamScanner
ons
*SIQ N ANALY
T8EW M T ,L,AW
« SCHOOLS » 0 ZAT “ '™»««X r”“
ITALS in T||E w OFFICE, corporatesin
MUNICiPAL OFFICE, YoUR ANALYs,s ’D AND OTHER INFORMATION OF THE
0UIDEL1NES C NSIST 0F
' ° all NECESSARY INTERFACE
ANS:
However,
and storage space.
The Multi-way Array Aggregation method computes a full data cube by using a
bc
ABC
ex]
Figure 1: Multi-way Array Aggr e g a t i o n
Limitations:
Scanned by CamScanner
II)
>
>
AC BC
AB
Q2]
ABC
Star Cubing:
I pruned: I
I pruned: J
! W
c/c l)/l)
dC/AC
«C/BC lUvii
HCD
*ncr>
Adv
antage:
Rcduceth
' compute10, tensions.
_ _ _ _____ '« a ,1(,
CniOtyreciuire
me n t s .
Scanned by CamScanner
pro1'
i.
ii.
AnS: [Chapter
st
ore daily)
(b) Discuss:
. The steps in KDD Process.
[10]
Ans:[Chapter - 5 ]
Ans:[Chapter - 4]
(b) A simple example from the stock market involving only discrete ranges has profit as
categorical attribute, with values {Up, Down} and the training data set is given below. [10]
No Hardware Up
Mid
No Software Up
Mid
Yes Software Up
New
Hardware Up
New No
Software Up
New No
Scanned by CamScanner
-J vuiuiJOfy
Semester - 8
Question "Papers
[10]
[10] Q2]
200 2,4
300 2, 3
Q3
400 1, 2 , 4
500 1,3
600 1,3
700 1, 3, 2, 5
800 1,3
900 1,2,3
Ans: [Chapter - 10]
Ans: [Chapter - 2]
(b) Metrics for Evaluating Classifier Performance.
Ans: [Chapter - 8 ]
(OFPTree.
Ans: [Chapter - 10]
(d) Multilevel & Multidimensional Association Rule.
Ans: [Chapter - 10]
(e) Operational Vs. Decisional Support System.
Ans: [Chapter - 1]
64 of66
Scanned by CamScanner
QU
( ) What is meant by metadata in the context of a Data warehouse? Explain the different types of
Meta data stored in a data warehouse. Illustrate with a suitable example. [10]
Ans: [Chapter - 1]
( } escribe the various functionalities of Data Mining as a step in the process of knowledge
discovery. [10]
Ans: [Chapter - 5]
Q3] [a] In what way ETL cycle can be used in typical data warehouse, explain with suitable instance. [10]
Ans: [Chapter - 3]
[b] What is Clustering Techniques? Discuss the Agglomerative algorithm with the following data
and plot a Dendrogram using single link approach. The table below comprises sample data items
indicting the distance between the elements. [10]
Item E A C B D
E 0 1 2 2 3
A 1 0 2 5 3
C 2 2 0 1 6
B 2 5 1 0 3
D 3 3 6 3 0
Ans: [Chapter - 9]
[10]
(a) Discuss how computations can be performed efficiently on data cubes.
Ans: [Chapter - Miscellaneous]
[b] A database has five transactions. Let min-support = 60% and min-confidence = 80%. Find all
,r . . . ... ..... i... Anrinri Algorithm. TJD is the transaction ID [10]
find freq
TJD Items Bought
T-1000 M, 0, N, K, E, Y
Scanned by CamScanner
Semester - 8
Topper's Solutions
Question Papers
T-1001 d ( o, N, K, E, Y
T-1002 M, A, K, E
T-1003 M, U, C, K, Y
T-1004 C, 0, 0, K, E
Ans:[Chapter - 10]
1
Ans: [Chapter - 2]
I iiWWO
(c) Indexing OLAP Data.
Ans: [Chapter - 4]
(dj Data Quality.
I II I ......... I I
Ans:[Chapter - 3]
(e) Decision Tree based Classification Approach.
Ans:[Chapter - 8]
Scanned by CamScanner
OtUr Subjects
e>
v«"
V*'-
•x 0
r
fim( Vear Pry eels an also Available @ Topper's Solutim f
* w*, <w
i‘""nk -I US
'HJisHiiyyni &rf
Price/ fy, yo
T0 rUUfi»ry
(0
Scanned by CamScanner