Fundamentals of Data Science-1
Fundamentals of Data Science-1
COMPUTER APPLICATIONS
Fundamentals of Data Science- 51953
PART – A
• Top tier
• Middle tier
• Bottom tier
d) Explain the issues and challenges in data mining. [4]
• Data Quality Issues
• Handling Large and Complex Data
• Data Privacy and Security
• Integration of Data from Multiple Sources
• Scalability and Performance
• Interpretation of Results
• Dynamic and Evolving Data
• Lack of Skilled Personnel
2. a) Explain the various components of Data Warehousing. [5]
• Data Source
• Data Staging (ETL - Extract, Transform, Load)
• Data Storage (Data Warehouse Repository)
• Metadata
• Data Marts
• OLAP Engine (Online Analytical Processing)
• Front-End Tools (Reporting and Data Mining Tools)
• Data Warehouse Management and Monitoring Tools
• Data Integration
Data integration is the process of combining data from multiple
heterogeneous sources into a unified and consistent view.
Techniques:
o Schema Integration
o Data Cleaning
o Data Transformation
o Entity Resolution
• Data Reduction
Data reduction refers to the process of reducing the volume of data
while maintaining its integrity and analytical value.
Techniques:
o Dimensionality Reduction
o Numerosity Reduction
o Data Compression
o Data Aggregation
o Sampling
3.a) Explain support and confidence in association rule mining with
example. [6]
Association Rule:
Association rule mining is used to discover interesting relationships
(associations) among items in large datasets, commonly applied in market
basket analysis.
Support:
Support is the proportion of transactions in the dataset that contain a
specific itemset.
Formula:
Support(X) = (Number of transactions containing X) / (Total number of
transactions)
Confidence:
Confidence is a measure of the likelihood that an itemset will appear if
another itemset appears.
Formula:
Confidence (X => Y) = (Number of transactions containing X and Y) /
(Number of transactions containing X)
Confidence (X -> Y) = Support_count(X ∪ Y) / Support_count(X)
Structure of Rule:
• A rule is usually written in the form:
IF (condition) THEN (class label)
Example:
IF (Outlook = Sunny) AND (Humidity = High) THEN Play = No
Rule Generation:
• Rules are generated from training data using algorithms like
RIPPER, Decision Trees (converted to rules), or Apriori-based rule
learning.
Rule Matching:
• When a new instance is to be classified, the classifier checks which
rule(s) match the instance.
• If multiple rules match, techniques like confidence ranking or
majority voting are used.
OR
Density-Based Methods
Density-based methods form clusters based on the density of data points
in the data space. A cluster is a dense region of points that is separated
by areas of lower point density (noise or outliers).
Algorithms:
• DBSCAN: Density-Based Clustering Based on Connected
Regions with High Density
• DENCLUE:
Grid-Based Methods
Grid-based methods divide the data space into a finite number of cells (grid
structure), then perform clustering on the grid instead of the data points.
Algorithms:
• Statistical Information Grid (STING)
• CLIQUE (CLustering In QUEst)
PART – B
6.
a) Differentiate between DBMS v/s Data Mining.