0% found this document useful (0 votes)
5 views9 pages

Fundamentals of Data Science-1

The document outlines a comprehensive examination on Data Science fundamentals, covering topics such as the KKD process, data warehousing components, data mining applications, and various classification and clustering methods. It includes detailed questions on algorithms like Apriori and FP-growth, as well as decision trees and clustering techniques. Additionally, it addresses basic concepts in data mining, including DBMS vs Data Mining, data normalization, and methods for estimating the number of clusters.

Uploaded by

sushanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views9 pages

Fundamentals of Data Science-1

The document outlines a comprehensive examination on Data Science fundamentals, covering topics such as the KKD process, data warehousing components, data mining applications, and various classification and clustering methods. It includes detailed questions on algorithms like Apriori and FP-growth, as well as decision trees and clustering techniques. Additionally, it addresses basic concepts in data mining, including DBMS vs Data Mining, data normalization, and methods for estimating the number of clusters.

Uploaded by

sushanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

SCHEME: NEP

COMPUTER APPLICATIONS
Fundamentals of Data Science- 51953

PART – A

Answer all questions. Each questions carries ten marks:

1. a) Explain KKD process in detail. [6]


• Data Cleaning
• Data Integration
• Data Selection
• Data Transformation
• Data Mining
• Pattern Evaluation
• Knowledge Representation

b) Explain the Applications of Data Mining. [4]


• Business and Marketing
• Banking and Finance
• Healthcare
• Retail
• Education
• Telecommunications
• Manufacturing
• E-commerce
OR

c) Explain the components of 3-tier Data Warehousing with a neat


diagram. [6]

• Top tier
• Middle tier
• Bottom tier
d) Explain the issues and challenges in data mining. [4]
• Data Quality Issues
• Handling Large and Complex Data
• Data Privacy and Security
• Integration of Data from Multiple Sources
• Scalability and Performance
• Interpretation of Results
• Dynamic and Evolving Data
• Lack of Skilled Personnel
2. a) Explain the various components of Data Warehousing. [5]
• Data Source
• Data Staging (ETL - Extract, Transform, Load)
• Data Storage (Data Warehouse Repository)
• Metadata
• Data Marts
• OLAP Engine (Online Analytical Processing)
• Front-End Tools (Reporting and Data Mining Tools)
• Data Warehouse Management and Monitoring Tools

b) Mention different OLAP operations. Explain any one OLAP


operation in detail. [5]
• Roll Up
• Drill Down
• Slice and Dice
• Pivot
OR

c) Explain in detail Data integration and data reduction. [10]

• Data Integration
Data integration is the process of combining data from multiple
heterogeneous sources into a unified and consistent view.
Techniques:
o Schema Integration
o Data Cleaning
o Data Transformation
o Entity Resolution
• Data Reduction
Data reduction refers to the process of reducing the volume of data
while maintaining its integrity and analytical value.
Techniques:
o Dimensionality Reduction
o Numerosity Reduction
o Data Compression
o Data Aggregation
o Sampling
3.a) Explain support and confidence in association rule mining with
example. [6]

Association Rule:
Association rule mining is used to discover interesting relationships
(associations) among items in large datasets, commonly applied in market
basket analysis.

Support:
Support is the proportion of transactions in the dataset that contain a
specific itemset.
Formula:
Support(X) = (Number of transactions containing X) / (Total number of
transactions)

Confidence:
Confidence is a measure of the likelihood that an itemset will appear if
another itemset appears.
Formula:
Confidence (X => Y) = (Number of transactions containing X and Y) /
(Number of transactions containing X)
Confidence (X -> Y) = Support_count(X ∪ Y) / Support_count(X)

b) Write a note on frequent pattern growth for mining. [4]

FP-growth is an algorithm for mining frequent patterns that uses a divide-


and-conquer approach. FP Growth algorithm was developed by Han in
2000.
Working of the FP growth:
• Scan the database
• Sort items
• Construct the FP-tree
• Generate frequent item sets
• Generate association rules
OR

c) Explain Apriori algorithm with example. [10]

Apriori is a important algorithm proposed by R. Agrawal and R. Srikant in


1994. It is uses frequent itemsets to generate association rules. It is based
on the concept that a subset of frequent itemset must also be frequent
itemset, which is an Apriori property.
It contains two steps:
1. Join Step: Find the itemsets (Lk)
2. Prune Step: Remove the itemsets in which sub items do not satisfy the
min support count threshold.

4.a) What is decision tree? Explain how classification is done using


decision tree induction. [10]

A Decision Tree is a tree-like model used for classification and regression


tasks. It breaks down a dataset into smaller subsets while an associated
decision tree is incrementally developed. It is one of the most widely used
and easy-to-understand algorithms in data mining and machine learning.

Structure of a Decision Tree:


• Root Node
• Internal Nodes
• Leaf Nodes
• Branches

Decision Tree Induction


Decision Tree Induction is the process of building a decision tree from a
training dataset. Here's how classification is done:

• Select Best Attribute (Splitting Criterion)


o Information Gain
o Gini Index
o Gain Ratio
• Create a Decision Node
• Split the Dataset
• Repeat Recursively
• Assign Class Labels
OR

b) Explain the basic concepts of classification [5]


• Training Dataset
• Classifier (Model)
• Class Label
• Prediction
• Evaluation

c) What is rule based classifier? Explain. [5]

Rule-Based Classifier is a classification technique that uses a set of IF-


THEN rules for making classification decisions. These rules are derived
from the training data and are used to classify new instances.

Structure of Rule:
• A rule is usually written in the form:
IF (condition) THEN (class label)
Example:
IF (Outlook = Sunny) AND (Humidity = High) THEN Play = No
Rule Generation:
• Rules are generated from training data using algorithms like
RIPPER, Decision Trees (converted to rules), or Apriori-based rule
learning.
Rule Matching:
• When a new instance is to be classified, the classifier checks which
rule(s) match the instance.
• If multiple rules match, techniques like confidence ranking or
majority voting are used.

5.a) Explain Hierarchical method of clustering? [6]

A hierarchical clustering method works by grouping data objects into a


hierarchy or 'tree' of clusters. This helps in summarizing the data with the
help of hierarchy.
Algorithms:
• AGNES (AGglomerative NESting)
• DIANA (DIvisive ANAlysis)
• BIRCH
• CHAMELEON

b) Write the algorithm for K-Means clustering? [4]

K-Means is an unsupervised clustering algorithm used to group similar


data points into K distinct clusters based on feature similarity.

OR

c) Explain density based methods and grid based methods. [10]

Density-Based Methods
Density-based methods form clusters based on the density of data points
in the data space. A cluster is a dense region of points that is separated
by areas of lower point density (noise or outliers).
Algorithms:
• DBSCAN: Density-Based Clustering Based on Connected
Regions with High Density
• DENCLUE:

Grid-Based Methods
Grid-based methods divide the data space into a finite number of cells (grid
structure), then perform clustering on the grid instead of the data points.
Algorithms:
• Statistical Information Grid (STING)
• CLIQUE (CLustering In QUEst)

PART – B

Answer any five questions. Each questions carries two marks.

6.
a) Differentiate between DBMS v/s Data Mining.

DBMS (Database Management System) is a software system that


manages and stores data, providing features like data modeling, storage,
retrieval, and security.
Data Mining, on the other hand, is the process of discovering patterns,
relationships, and insights from large datasets using various statistical and
mathematical techniques.

b) What is Data Cube Aggregation?


Data Cube Aggregation is a technique used in Online Analytical
Processing (OLAP) to pre-compute and store aggregated data in a
multidimensional array, known as a data cube.

c) What is Attribute Subset Selection?


Attribute Subset Selection is a dimensionality reduction technique used to
select a subset of relevant attributes or features from a larger set of
attributes.
d) What is Bayes theorem?
Bayes’ Theorem is a mathematical formula used to determine the
probability of a hypothesis based on prior knowledge or evidence. It is
widely used in probabilistic classification, such as the Naive Bayes
Classifier.

e) What is Data Normalization?


Data Normalization is a technique used to rescale numeric data to a
common range, usually between 0 and 1, to prevent differences in scales
for different attributes.

f) What is K-Nearest Neighbour Classifier?


K-Nearest Neighbour (KNN) Classifier is a type of supervised learning
algorithm that classifies new instances based on the majority vote of its k
nearest neighbours in the feature space.

g) Limitation of Partitioning Methods of Clustering.


• Requires pre-defined number of clusters (K)
• Assumes spherical or convex clusters
• Sensitive to initial centroid selection
• Affected by outliers and noise

h) Methods for Estimating Number of Clusters.


Some common methods for estimating the number of clusters include:
1. Elbow Method:
2. Silhouette Method
3. Gap Statistic

You might also like