DWDM SR2
DWDM SR2
Module 1
1. What is data Preprocessing? (Ans: This refers to the transformations applied to the
identified data before feeding the same into the algorithm.)
2. _______ predicts future trends & behaviors, allowing business managers to
make proactive, knowledge-driven decisions. (Ans: Datamining)
3. Record cannot be updated in ____________ (Ans: Data Warehouse)
4. Define Data scrubbing (Ans: a process to upgrade the quality of data before
it is moved into a data warehouse)
5. Star schema follows which type of relationship ? (Ans: One-to-Many)
6. The algorithm which uses the concept of a train running over data to find
associations of items in data mining is known as _____________ (Ans: FP-
Tree growth algorithm)
7. Mention the data mining algorithm which is used by Google Search to rank
web pages in their search engine results. (Ans: PageRank Algorithm)
8. Identify the statistical measure to quantify the direction and strength of the
relationship between two continuous variables? (Ans: The Pearson correlation
coefficient )
9. What does a correlation coefficient value of 0 indicate about the relationship
between two variables X and Y? (Ans: There is no linear relationship between
the two variables X and Y.)
10. Define the primary goal of Sequential Pattern Mining in data analysis? (Ans:
To identify the patterns in data where the events occurred in a sequence.)
11. Identify the complicated method between Sequential Pattern Mining and
Association Rule Mining. Mention why. (Ans: Sequential Pattern Mining.
Because it deals with sequences of items where the order and timing of
events matter, adding an additional layer of complexity beyond the simple co-
occurrence of items in transactions.)
12. Discuss "gap constraint" in Sequential Pattern Mining? (Ans: It specifies the
allowed number of items or time intervals between elements in a sequential
pattern, controlling how far apart elements can be in the sequence to still
count towards the pattern.)
13. ____________________ define the multidimensional model of the data
warehouse. (Ans: Data Cube)
Module 3
1. Discuss the purpose of time series forecasting. (Ans: To analyze historical
data patterns to predict future values or trends in the data series.)
Module 4
1. What is the primary focus of data stream mining?
Module 5
1.What is web mining and how does it differ from traditional data mining?
Answer: The three main categories of web mining are web content mining,
web structure mining, and web usage mining. Web content mining deals with
extracting information from web pages, web structure mining analyzes the link
structure of the web, and web usage mining focuses on analyzing user
interaction data.
3. What is mining web link structure, and why is it important?
Answer: Mining web link structure involves analyzing the relationships
between web pages through hyperlinks. It is important because it helps
understand the organization and hierarchy of information on the web,
improves search engine ranking algorithms, and assists in detecting spam and
fraudulent websites.
4.What techniques are used in mining web link structure?
Answer: Techniques used in mining web link structure include link analysis
algorithms such as PageRank and HITS (Hypertext Induced Topic Selection),
graph-based algorithms, and network analysis methods to analyze the
topology of the web graph.
5.How does mining multimedia data differ from mining textual data on the
web?
Answer: Distributed web mining involves distributing the mining tasks across
multiple nodes or machines in a network. It can be beneficial for handling
large volumes of web data, improving scalability and efficiency, and facilitating
collaborative mining efforts among multiple organizations or researchers.
8.What are some techniques used in distributed web mining?
Module 6
1. What is graph mining, and how does it differ from traditional data
mining?
Answer: Graph mining and social network analysis can help understand
online communities by identifying influential users, detecting community
structures, analyzing information flow within communities, predicting
user behavior and preferences, and detecting anomalies or suspicious
activities.
9.What are some popular tools and software libraries for graph mining
and social network analysis?
Answer: Popular tools and software libraries for graph mining and social
network analysis include NetworkX (Python library), Gephi, igraph (R
package), Cytoscape, SNAP (Stanford Network Analysis Platform), and
GraphX (Apache Spark library).
Answer: Graph mining and social network analysis can identify key
influencers in a network by analyzing centrality measures such as
degree centrality (number of connections), betweenness centrality
(importance of a node in connecting other nodes), and eigenvector
centrality (importance of a node based on its connections to other
important nodes). These metrics help identify nodes that play crucial
roles in information diffusion and network dynamics.
5 marks
Module 1
1. Define data warehouse? Discuss the basic characteristics of a data
warehouse. 1+4
2. Differentiate operational database systems and a data warehouse.
3. Differentiate between data warehouse and data mart.
4. Define data cube. Write down the conversion procedure of tables and
spreadsheets to data cubes?
1+4
5. Distinguish between OLTP system and OLAP system.
6. Discuss the various phases of knowledge discovery from the database?
7. Explain predictive and descriptive data mining.
8. Define association rules in the context of data mining? Describe the terms
support and confidence with the help of suitable examples.
9. Write down some advantages and disadvantages of FP-Tree algorithm
10. Discuss the constraints in Constraint based Association rule mining?
Module 2
1. Describe the decision tree classifier algorithm and how it works to make
predictions
2. Write short note on Confusion Matrix
3. Define conditional Probability. Discuss Bayes theorem.
4. Define Precision,Recall and F1 Score
5. Name two ASM(Attribute Selection Measures) in the decision tree. Discuss
one of them.
6. Define Gini impurity. Compare between Entropy and Gini impurity
Module 3
1. Explain the concept of similarity search in time series analysis
Module 4
1. What is sequential pattern mining in data streams?
Module 5
1. what are the basic steps to mine web page lay out structure ?
a.Web Page Retrieval: Retrieve the HTML source code of the web page
of interest. This can be done using web scraping techniques or by
accessing the page directly via its URL.
b.HTML Parsing: Parse the HTML source code to extract the structural
elements of the web page, such as tags, attributes, and content. Use
HTML parsing libraries like BeautifulSoup (Python) or jsoup (Java) for
this purpose.
d.Analyze CSS Styles: Analyze the CSS styles associated with each
layout element to understand their visual properties, such as position,
size, margin, padding, and display properties. Extract relevant CSS
properties using CSS parsing libraries or regular expressions.
a.Data Partitioning: The first step is to partition the web data into smaller
subsets or chunks that can be distributed across multiple nodes in the
network. This can be done based on various criteria such as URL
ranges, domain names, or geographical regions.
b.Node Selection: Determine the nodes or machines in the distributed
network that will participate in the mining process. These nodes can be
physical machines or virtual machines connected over a network.
Module 6
1. What is class imbalance problem ?How it is handled ?
Class imbalance problem refers to the situation in which the distribution
of classes in a dataset is skewed, with one class significantly
outnumbering the other(s). This imbalance can cause challenges for
predictive modeling and classification algorithms, particularly those that
assume a balanced class distribution.
Definition: Class imbalance occurs when one class (the minority class)
is significantly underrepresented compared to other classes (the
majority class or classes) in the dataset.
15 marks
Module 1
1. Compare 2-tier, 3-tier, 4-tier data warehouse architecture with proper
diagram. Discuss the tangible and intangible benefits of data warehouse.
10+5
2. Define schema. Discuss various schemas used in data warehouse with
proper diagram. List the advantages and disadvantages for all schemas.
2+7+6
3. Discuss the typical OLAP operations with an example. Write in brief on
various kinds of OLAP servers/models.
9+6
4. Explain the architecture of a data mining system. Discuss the types of
knowledge discovered during data mining. Define outlier mining.
7+6+2
5. Write down the steps of Apriori Algorithm. For the following Given Transaction
Data-Set, Generate Rules using Apriori Algorithm. Consider the Values as
SUPPORT= 50% and CONFIDENCE = 75%.
Transaction ID Items Purchased
1 Bread, cheese, Egg, Juice
2 Bread, cheese, Juice
3 Bread, Milk, Yogurt
4 Bread, Juice, Milk
5 Cheese, Juice, Milk
TID Items
1 A, B
2 C
3 A, C, D
4 B, C
5 A, D
6 B, D
7 A, B, C
8 A, C, D
9 C
10 B, C, D
● Single items: A, B, C, D
● 2-item sequences: AB, AC, AD, BC, BD, CD
● 3-item sequence: ACD
6+9
Module 2
1. Discuss K means Clustering algorithm.
Consider the 5 data points shown below:
p1:(1,2,3) p2:(0,1,2) p3:(3,0,5) p4:(4,1,3) p5:(5,0,1)
Apply the Kmeans clustering algorithm,to group those data points into 2
clusters. Consider the initial centroids are C1:(1,0,0) & C2:(0,1,1).
2 Green 2 Tall No M
5 Green 2 Short No H
6 White 2 Tall No H
7 White 2 Tall No H
Ans: Species H
3. Discuss KNN algorithm. Solve the problem using KNN algorithm Compute the
class label for test instance t1= (3, 7) using KNN (k=3). How do you find the K
value in k nearest neighbour?
Training X1 X2 output
Instances
I1 7 7 0
I2 7 4 0
I3 3 4 1
I4 1 4 1
Ans: For K= 1 output is 1
K=2 output is 1
K=3 output is 0
4. Estimate entropy of this collection of training examples with respect to the target
function classification. Compute the information gain at a1 & a2 relative to these
training examples. Draw a decision tree for the given dataset.
Instance Classification a1 a2
1 + T T
2 + T T
3 - T F
4 + F F
5 - F T
6 - F T
Compare Entropy & Gini impurity. What is entropy? Write mathematical formula for
Entropy. 8+2+2+3
Ans:
Whole set S
Attribute: a1, Values (T, F)
Entropy (S) =1.0
Entropy(T)=0.9183
Entropy(F)=0.9183
Information Gain (S,a1)= 0.0817
a1 is having maximum information gain so a1 will be root node in the decision tree.
Follow the rule accordingly to draw complete decision tree.
5. State and explain Agglomerative Hierarchical clustering algorithm. Cluster the one
dimensional dataset X=[3, 7, 10, 17, 18, 20] into 3 clusters using average linkage
agglomerative hierarchical clustering. State differences between partitional and
hierarchical clustering. State the limitation of k-means algorithm.
4+6+3+2
6. State Hunt’s algorithm for decision tree based classifier. What are the minimum
and maximum values of GINI index for splitting an attribute? What will be the value
of GINI index for the following multiway split of ‘CarType’ attribute?
Consider the two way split of attribute ‘CarType’. The distribution of samples
in two classes `C1’ and `C2’ are following:
Information Gain by 1st one 0.08, Information Gain by 2nd one 0.061, Since
0.08>0.061, 1st one is best as it provides maximum information gain.
4+2+3+6
8+7
Ans: Resultant Clusters: {2,3}, {17,19,22,28}, {10}, {90}.
Module 3
1. Explain the concept of seasonality in time series data and its
significance in time series mining. Give two common methods used for
analyzing time series data with brief explanations. Explain the anomaly
detection in time series data typically work
8+5+2
Module 4
1. Explain the concept of data stream mining, highlighting its challenges,
techniques, and applications.
2. Explain the concept of graph mining and its significance in data
analysis. Discuss the challenges associated with graph mining and the
techniques employed to address them. Additionally, provide examples
of applications where graph mining is utilized and highlight its impact.
5+5+5
Module 5
1. a.What is distributed web mining ? How is it performed ?
b.White down the basic steps for mining web link structure.(3+5+7)
Ans: Mining web link structure involves analyzing the relationships between
web pages through hyperlinks. This process is essential for understanding the
organization, connectivity, and topology of the World Wide Web. Here are the
basic steps involved in mining web link structure:
1.Web Crawling: The first step is to crawl the web to collect a large dataset of
web pages. Web crawlers, also known as spiders or bots, systematically
traverse the web by following hyperlinks from one page to another. They
collect metadata about each page, including its URL, title, content, and
outbound links.
Link Extraction: Extract the hyperlinks embedded within each web page's HTML
content. This involves parsing the HTML source code to identify anchor tags
(<a>) and extracting the href attribute, which contains the URL of the linked
page. Each hyperlink represents a connection or relationship between the
current page and the linked page.
2.Build Graph Representation: Represent the web link structure as a graph, where
nodes represent web pages and edges represent hyperlinks between pages.
This graph is known as the web graph or hyperlink graph. Each node
corresponds to a unique URL, and edges indicate the existence of hyperlinks
between pages.
3.Analyze Graph Properties: Analyze the properties of the web graph to
understand its structure and characteristics. Common graph metrics include
node degree (number of incoming and outgoing links), node centrality
(importance of a page within the graph), graph density (proportion of actual
links to potential links), and graph clustering (identification of densely
connected subgraphs).
4.Identify Important Pages: Identify important or influential pages within the web
graph using centrality measures such as PageRank or HITS (Hypertext
Induced Topic Selection). PageRank assigns a numerical score to each page
based on the number and quality of inbound links, while HITS distinguishes
between authority pages (pages with many incoming links) and hub pages
(pages with many outgoing links).
5.Detect Communities or Clusters: Detect communities or clusters of related
pages within the web graph using graph clustering algorithms such as
modularity optimization or community detection algorithms. These algorithms
identify groups of pages that are densely interconnected and share similar
link patterns, indicating thematic coherence or topical similarity.
6.Visualize Graph Structure: Visualize the structure of the web graph using graph
visualization techniques. Graph visualization tools such as Gephi, Cytoscape,
or NetworkX can generate visual representations of the web link structure,
allowing for intuitive exploration and analysis of the graph's topology and
connectivity.
7.Monitor Changes Over Time: Continuously monitor and update the web link
structure to capture changes and evolution in the web graph over time. Web
pages may be added, removed, or modified, leading to updates in the
hyperlink relationships and graph structure. Regularly re-crawling and re-
analyzing the web graph helps maintain an accurate representation of the web
link structure.
Module 6
1. a.What is the concept of centrality in social network analysis and its
significance in graph mining?
b.What are the key parameters that influenced the distributed
warehouse environment?(7+8)
Recent trends in distributed warehousing have been influenced by advancements in
technology, changing business needs, and evolving consumer expectations. Here are
some notable trends in distributed warehousing:
1.Cloud-Based Solutions: There is a growing adoption of cloud-based distributed
warehousing solutions. Cloud platforms offer scalability, flexibility, and cost-
effectiveness, allowing businesses to easily scale their warehouse
infrastructure based on demand. Providers like Amazon Web Services (AWS),
Google Cloud Platform (GCP), and Microsoft Azure offer cloud-based data
warehousing services such as Amazon Redshift, Google BigQuery, and Azure
Synapse Analytics.
2.Edge Computing Integration: With the proliferation of Internet of Things (IoT)
devices and the need for real-time data processing, there's an increasing trend
towards integrating edge computing with distributed warehousing. Edge
computing brings data processing closer to the data source, reducing latency
and enabling faster decision-making. Distributed warehouses are being
deployed at the edge to handle data processing and analytics in distributed
environments.
3.Hybrid and Multi-Cloud Deployments: Many organizations are adopting hybrid
and multi-cloud strategies for distributed warehousing. They leverage a
combination of on-premises infrastructure, private cloud, and public cloud
services to optimize performance, cost, and data governance. This approach
allows businesses to choose the best-fit environment for different workloads
while maintaining flexibility and avoiding vendor lock-in.
4.Containerization and Orchestration: Containerization technologies such as
Docker and Kubernetes are increasingly used for deploying distributed
warehousing solutions. Containers provide lightweight, portable, and isolated
runtime environments, making it easier to package and deploy warehouse
components across distributed infrastructure. Container orchestration
platforms automate the deployment, scaling, and management of
containerized warehouse applications, ensuring reliability and scalability.
5.Data Mesh Architecture: The concept of data mesh has emerged as a new
architectural approach for distributed data management, including
warehousing. In a data mesh architecture, data is treated as a product, and
cross-functional data teams are responsible for managing and owning
domain-specific data products. This decentralized approach allows for
greater agility, scalability, and autonomy in data management across
distributed environments.
6.Data Democratization and Self-Service Analytics: There's a growing emphasis
on data democratization and self-service analytics in distributed warehousing.
Modern warehouses provide easy access to data for business users and
analysts, empowering them to explore and analyze data independently
without relying on IT support. Self-service analytics tools and platforms
enable users to query, visualize, and derive insights from distributed data
sources in real-time.
Security and Compliance Enhancements: With the increasing volume and
complexity of data in distributed environments, security and compliance have
become top priorities for organizations. Distributed warehousing solutions are
incorporating advanced security features such as encryption, access control,
and audit trails to protect sensitive data across distributed infrastructure.
Compliance frameworks and regulations such as GDPR, CCPA, and HIPAA are
driving the adoption of robust security measures in distributed warehousing.