0% found this document useful (0 votes)
26 views43 pages

Short Notes On Data Mining & Warehousing

Uploaded by

mmdennis25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views43 pages

Short Notes On Data Mining & Warehousing

Uploaded by

mmdennis25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Short Notes on Data Mining & Warehousing

1. Definition of Data Mining: Data mining is discovering patterns and knowledge from large
data sets utilizing methods at the intersection of machine learning, statistics, and database
systems.

2. Evolution of Database Technology:

● 1960s: Development of hierarchical and network DBMS.


● 1970s: Introduction of relational models and DBMSs.
● 1980s: Emergence of object-oriented and extended-relational systems.
● 1990s: Popularity of data warehousing and data mining.
● 2000s: Advancements in stream data management and mining.

3. Data Mining Process (Knowledge Discovery in Databases - KDD): Involves data cleaning,
data integration, data selection, data transformation, data mining, pattern evaluation, and
knowledge presentation.

4. Data Mining Tasks:

● Classification: Assigns items to predefined categories.


● Clustering: Groups similar items without predefined categories.
● Association Rule Learning: Discovers interesting relations between variables.
● Regression: Finds a function that models the data with the least error.

5. Major Challenges in Data Mining:

● Handling high dimensionality and scalability.


● Ensuring data mining methods are efficient and results are understandable.
● Addressing privacy, security, and ethical issues.

Hard Questions with Detailed Answers

1. What is the difference between data mining and data warehousing?

● Answer: Data mining involves extracting patterns from data sets, utilizing various
algorithms and techniques. Data warehousing, on the other hand, refers to the process
of constructing and using a data warehouse, a centralized repository of integrated data
from multiple sources, designed to support decision-making tasks.

2. Describe the CRISP-DM process in data mining.

● Answer: CRISP-DM stands for Cross-Industry Standard Process for Data Mining, which
includes six phases: Business Understanding, Data Understanding, Data Preparation,
Modeling, Evaluation, and Deployment. Each phase serves to understand the objectives,
prepare the data, model it, evaluate the results, and deploy the findings to achieve the
business objectives.

3. How does a decision tree algorithm work in classifying data?

● Answer: A decision tree splits the data into subsets based on attribute value tests. It is
constructed top-down from a root node and involves partitioning the data into subsets
that contain instances with similar values (homogeneous). Decision trees handle both
categorical and numerical data.

4. What are the main types of data mining systems, and how do they differ?

● Answer: Data mining systems can be categorized into relational, web mining, text
mining, and multimedia data mining systems. Each type addresses data in different
formats and structures, utilizing specialized processes and algorithms to extract relevant
knowledge.

5. Explain the concept of support and confidence in association rule mining.

● Answer: Support measures how often a rule applies to a given data set, while
confidence measures the reliability of the inference made by the rule. For example, the
rule {diapers} → {beer} might have a support of 0.5% (it applies to 0.5% of transactions)
and a confidence of 75% (75% of the transactions containing diapers also contain beer).

6. What challenges does big data present to data mining?

● Answer: Big data brings challenges such as managing large volumes and a variety of
data, ensuring data quality and veracity, and maintaining the velocity and complexity of
data processing and storage.

7. How is machine learning utilized within data mining?

● Answer: Machine learning provides the algorithms and statistical models for data mining
to conduct classification, prediction, regression, and clustering, by learning from and
making decisions based on the input data.

8. What role does OLAP play in data warehousing?

● Answer: OLAP (Online Analytical Processing) allows for the quick querying of
multidimensional data within a data warehouse, facilitating complex calculations, trend
analysis, and data summarization, which are crucial for business intelligence.

9. Describe the importance of data cleaning in data mining.

● Answer: Data cleaning involves removing or correcting data anomalies and


inconsistencies, such as missing values, outliers, or incorrect data. This step is critical to
improve the accuracy and efficiency of the subsequent data mining processes.
10. Explain the significance of pattern evaluation in the KDD process.

● Answer: Pattern evaluation identifies the truly interesting patterns representing


knowledge from a large set of potentially discoverable patterns. This step uses
interestingness measures and thresholds to select patterns that are potentially useful to
the user.

11. What is the difference between descriptive and predictive data mining?

● Answer: Descriptive data mining focuses on finding human-interpretable patterns


describing the data. In contrast, predictive data mining uses some variables to predict
unknown or future values of other variables.

12. How do outlier detection and anomaly detection differ in data mining?

● Answer: Outlier detection focuses on finding data points that deviate significantly from
the majority of data, while anomaly detection seeks to identify patterns in data that do
not conform to expected behaviour.

13. What are the ethical concerns associated with data mining?

● Answer: Ethical concerns include privacy invasion, misuse of information, and the
potential discrimination against groups or individuals based on data-derived insights,
requiring careful consideration of data use policies.

14. How can data mining influence decision-making in healthcare?

● Answer: Data mining can help predict disease trends, analyze patient data for better
diagnoses, personalise treatment plans, and improve healthcare outcomes by extracting
valuable insights from large health information datasets.

15. What is the impact of data mining on e-commerce?

● Answer: In e-commerce, data mining is used for customer segmentation, targeted


marketing, recommendation systems, and customer sentiment analysis, enhancing
customer experiences and increasing sales efficiency.

Short Notes on Data Preprocessing from "ICS 2408_Lecture 2"

1. Importance of Data Preprocessing: Data preprocessing is critical because raw data is often
incomplete, noisy, and inconsistent. Quality mining results depend on quality data, which
necessitates thorough cleaning, integration, and transformation.
2. Data Quality Measures:

● Accuracy: Correctness of the data.


● Completeness: Availability of data.
● Consistency: Uniformity of the data across the dataset.
● Timeliness: How current the data is.
● Believability: Trustworthiness of the data.
● Interpretability: Ease of understanding the data.

3. Major Tasks in Data Preprocessing:

● Data Cleaning: Handling missing values, smoothing noisy data, identifying or removing
outliers, and resolving inconsistencies.
● Data Integration: Combining data from multiple sources and resolving data conflicts.
● Data Transformation: Normalization and aggregation to bring data into a form suitable
for mining.
● Data Reduction: Reducing data volume while maintaining data integrity for analysis.
● Data Discretization: Transforming continuous data into categorical data.

4. Handling Missing Data: Methods include ignoring tuples, filling in missing values manually,
using a global constant, attribute mean, or the most probable value estimated through
regression or decision trees.

5. Handling Noisy Data: Techniques include binning, clustering, and regression to smooth
data, identify outliers, and correct inconsistencies.

15 Hard Questions with Detailed Answers

1. Why is data preprocessing considered a crucial step in the data mining process?

● Answer: Data preprocessing is crucial because it directly affects the success of any
data mining project. Without quality data preprocessing, the data mining results may be
misleading or fail to uncover the true patterns and insights due to underlying noise and
inconsistencies in the data.

2. What is data noise, and how can it be handled during the preprocessing stage?

● Answer: Data noise refers to random errors or variances in a dataset. Handling noise
typically involves smoothing techniques such as binning, where data are sorted and
placed into bins; data within each bin can then be replaced by its mean, median, or
boundary, or by using clustering and regression methods to stabilize the data.

3. Explain the concept of data integration in the context of data preprocessing.

● Answer: Data integration involves combining data from multiple sources, ensuring that
the integrated data is consistent and adequately aligned. This process may address
issues like schema integration, entity identification problems, and inconsistencies among
data sources.

4. Discuss the binning method for data smoothing and give an example.

● Answer: Binning involves sorting data and then grouping them into bins. For example,
ages in a dataset could be binned into “0-20”, “21-40”, “41-60”, etc. Within each bin,
ages can be smoothed by replacing them with the bin mean or median, reducing the
impact of minor errors or variations.

5. What is the role of normalization in data preprocessing, and what are some common
methods?

● Answer: Normalization adjusts the scale of data attributes, allowing them to contribute
equally during analytic processing. Common methods include min-max normalization,
z-score normalization, and decimal scaling.

6. Define data reduction and describe why it is significant in data mining.

● Answer: Data reduction techniques aim to reduce the volume of data, simplifying the
patterns in the data while retaining its integrity for analysis. It is significant because it
helps in speeding up the data mining process and reducing storage requirements
without losing key information.

7. How does data discretization improve the performance of data mining algorithms?

● Answer: Discretization converts continuous data into categorical data, which can
simplify and speed up computations in data mining algorithms, particularly those that are
not well-suited for continuous data, such as certain classification algorithms.

8. Explain the importance of dealing with missing data and describe a method to handle
it.

● Answer: Missing data can lead to biased or invalid mining results. One method to
handle missing data is to replace missing values with the overall attribute mean or
median, providing a simple form of imputation that can sometimes yield reasonable
approximations.

9. What challenges arise from data integration, and how can they be addressed?

● Answer: Challenges in data integration include handling schema mismatches, data


conflicts, and duplicates. These can be addressed by schema mapping, data
transformation, and entity resolution techniques to ensure that data from different
sources can be used together accurately.

10. Describe how clustering can be used to handle noisy data.


● Answer: Clustering groups similar data items together, which can help in identifying and
treating outliers – points that do not belong to any cluster may be considered noise or
exceptions.

11. What is the significance of attribute construction in data transformation?

● Answer: Attribute construction involves creating new attributes from existing ones to
provide more useful features for data mining, potentially enhancing the predictive power
of mining algorithms.

12. Discuss how sampling is used in data reduction. Provide an example of a sampling
technique.

● Answer: Sampling involves selecting a representative subset of the data to reduce the
size of the data that needs to be analyzed. An example is stratified sampling, where the
data is divided into homogeneous subgroups before sampling, ensuring that each group
is adequately represented.

13. Explain the concept of histograms in data reduction.

● Answer: Histograms are used to approximate the distribution of data by dividing data
into bins and counting the number of cases that fall into each bin. This helps in
understanding the data distribution and identifying patterns such as skewness or
outliers.

14. What role does data cleaning play in improving the believability and accuracy of
data?

● Answer: Data cleaning improves data believability and accuracy by addressing issues
like missing values, inconsistencies, and errors in the data, which enhances the reliability
of the data mining results derived from the cleaned data.

15. Describe the impact of data preprocessing on the outcome of a data mining process.

● Answer: Effective data preprocessing directly influences the accuracy, efficiency, and
possible insights derived from a data mining process. By ensuring that the data is clean,
consistent, and appropriately formatted, data preprocessing increases the likelihood of
discovering valid, reliable patterns.

Short Notes on Data Warehousing and OLAP from "ICS 2408_Lecture 3 and
4"
1. Data Warehouse Basics:

● Definition: A data warehouse is a subject-oriented, integrated, time-variant, and


non-volatile collection of data. It supports management's decision-making processes by
providing a consolidated platform for data analysis.
● Characteristics:
○ Subject-Oriented: Organized around key subjects such as customers, products,
and sales.
○ Integrated: Consolidates data from various sources, ensuring consistency in
naming conventions, formats, and encodings.
○ Time-Variant: Maintains historical data to provide a time frame for analysis,
distinguishing it from operational systems that focus on current data.
○ Non-Volatile: Data in a data warehouse is not updated in real-time but loaded
from operational systems.

2. Data Warehousing vs. Heterogeneous DBMS:

● Data warehousing uses an update-driven approach where data is collected, cleaned,


transformed, and stored in advance, unlike traditional heterogeneous database systems
that use a query-driven approach.

3. OLAP (Online Analytical Processing):

● Supports complex queries and analyses, such as multidimensional queries (e.g., slicing
and dicing, drilling up and down), which are crucial for decision-making processes.

4. Architecture of Data Warehouses:

● Two-Layer Architecture: Separates physically available sources and data warehouses


into source layer, data staging, and data warehouse layer, enhancing structured analysis
and decision-making capabilities.
● Three-Layer Architecture: Includes a reconciled data layer that integrates and
cleanses operational data, ensuring the data are consistent, correct, current, and
detailed.

5. Schema Designs in Data Warehousing:

● Star Schema: A simple structure with a central fact table connected to dimension tables.
● Snowflake Schema: A more complex form of the star schema where some dimension
tables are normalized.
● Fact Constellation Schema: Multiple fact tables share dimension tables, suitable for
complex enterprises with varied analytical needs.

6. Data Warehouse Back-End Tools and Utilities:


● Tools for data extraction, cleaning, transformation, loading (ETL), and refreshing are
critical for maintaining the data warehouse.

7. OLAP Operations:

● Roll-Up: Increases the level of aggregation.


● Drill-Down: Decreases the level of aggregation or adds detail.
● Slice and Dice: Performs selection and projection.
● Pivot: Reorients the multidimensional view of data.

15 Hard Questions with Detailed Answers

1. What distinguishes a data warehouse from traditional database systems?

● Answer: Unlike traditional databases designed for day-to-day operations (OLTP), data
warehouses are designed to facilitate efficient querying and reporting of historical data
for analytical purposes (OLAP).

2. How does a data warehouse improve decision-making capabilities?

● Answer: It integrates data from multiple sources into a single comprehensive database,
allowing for richer insights through historical data analysis and complex querying, which
supports better strategic decision-making.

3. What is the significance of having a non-volatile data storage in a data warehouse?

● Answer: Non-volatility ensures that once data is entered into the warehouse, it remains
stable and is not changed by further operations, providing a reliable basis for analysis.

4. Describe the process of data transformation in data warehousing.

● Answer: Data transformation involves cleaning, mapping, and transforming data from
various source formats into the warehouse schema. It's crucial for ensuring data quality
and compatibility.

5. What challenges are associated with data integration in a data warehouse?

● Answer: Data integration involves resolving issues related to schema inconsistency,


duplication, and conflicts from different data sources, requiring robust ETL processes
and tools.

6. Explain the concept of multidimensional data models in data warehousing.

● Answer: Multidimensional data models allow data to be viewed in multiple dimensions,


facilitating complex analyses and queries that mimic real-world decision-making
scenarios.
7. What are the benefits of using a star schema in data warehousing?

● Answer: Star schemas simplify queries, improve performance, and enhance user
understanding by clearly separating measures (fact table) and contexts (dimension
tables).

8. How do OLAP operations enhance data analysis?

● Answer: OLAP operations like slicing, dicing, drilling, and pivoting allow analysts to view
data from different perspectives and granularities, leading to deeper insights.

9. What role does metadata play in a data warehouse?

● Answer: Metadata in a data warehouse defines the warehouse objects, data structure,
rules, and processing, facilitating the management and retrieval of data.

10. Discuss the impact of having a three-layer architecture in data warehousing.

● Answer: A three-layer architecture, which includes the source layer, data staging, and
the warehouse layer, promotes data accuracy and operational efficiency by separating
the stages of data preparation.

11. Why is an independent data mart considered beneficial in certain scenarios?

● Answer: Independent data marts can be tailored to specific departmental needs,


providing faster and more relevant insights without the overhead of a full-scale data
warehouse.

12. What considerations are important when designing a data warehouse schema?

● Answer: Important considerations include ensuring alignment with business processes,


simplicity for user understanding, and scalability for future expansion.

13. How does a snowflake schema differ from a star schema, and what are its
advantages?

● Answer: The snowflake schema is a normalized version of the star schema, which
reduces data redundancy and improves data integrity at the cost of query complexity.

14. What are the practical challenges of implementing OLAP in real-world applications?

● Answer: Challenges include managing large data volumes, ensuring fast query
performance, and integrating OLAP with existing IT infrastructure.

15. Explain the "slice and dice" operation in the context of OLAP.
● Answer: "Slice and dice" refers to the ability to focus on specific slices of the data cube
or to cut the data across different dimensions, enabling detailed analysis and
comparison within a multidimensional space.

Short Notes on Association Mining from "ICS 2408_Lecture 5"

1. Association Mining Basics:

● Definition: Association mining identifies frequent patterns, associations, correlations, or


causal structures among sets of items or objects within databases.
● Motivation: Helps discover regularities within data such as which products are often
purchased together or subsequent purchases after an initial item is bought.

2. Importance of Association Mining:

● Applications: Includes market basket analysis, cross-marketing, catalog design, sale


campaign analysis, and more.
● Foundation for Advanced Analysis: Enables tasks like associative classification,
cluster analysis, and supports advanced data mining operations like sequential pattern
mining, spatial and multimedia associations.

3. Market Basket Analysis (MBA):

● Purpose: Understands customer purchase behavior by analyzing items frequently


bought together.
● Insight Goals: Helps in store layout planning, promotions, and improving customer
satisfaction through better product placements.

4. Transactional Data Characteristics:

● Items and Baskets: Discusses the concept of items (individual products) and baskets
(collections of items bought in a single transaction).

5. Itemsets and Association Rules:

● Itemsets: Defined as a collection of one or more items.


● Association Rules: Implications of the form X ⇒ Y, where X and Y are disjoint itemsets,
suggesting when X is bought, Y is also likely bought.

6. Rule Measures: Support and Confidence:

● Support: Probability that a transaction contains both X and Y.


● Confidence: Conditional probability that a transaction containing X also contains Y.
7. Algorithms for Mining Association Rules:

● Apriori Algorithm: Uses a frequent itemset approach to generate association rules.


● FP-Growth Algorithm: Efficiently mines the complete set of frequent itemsets without
candidate generation using a pattern growth approach.

15 Hard Questions with Detailed Answers

1. What is the goal of association mining in data analytics?

● Answer: The primary goal is to find interesting patterns, associations, correlations, and
frequent itemsets among large sets of data in transaction databases and other
information repositories.

2. How does the market basket analysis apply association rule mining?

● Answer: MBA uses association rule mining to find sets of products that frequently
co-occur in transactions, aiding in understanding consumer buying behavior and
enabling effective marketing strategies.

3. What is the difference between support and confidence in association rule mining?

● Answer: Support measures the frequency of occurrence of an itemset in a dataset,


while confidence measures the likelihood of occurrence of the consequent in a rule given
the antecedent.

4. Explain the Apriori Algorithm and its significance in mining association rules.

● Answer: The Apriori Algorithm identifies frequent individual items in the dataset and
extends them to larger and larger itemsets as long as those itemsets appear sufficiently
often in the database. It is significant for its use in large itemset property which improves
the efficiency of identifying frequent itemsets.

5. Describe the FP-Growth Algorithm and how it differs from Apriori.

● Answer: FP-Growth reduces the need for candidate generation and multiple database
scans, which are necessary in the Apriori algorithm, by using a tree structure to store
compressed, crucial information about frequent patterns.

6. What challenges are faced in mining frequent itemsets?

● Answer: Challenges include managing large data volumes, ensuring the efficiency of
the mining process, reducing the number of scans of the database, and handling the
exponential growth of candidate itemsets.

7. What are the uses of association rules in real-world applications?


● Answer: They are used in retail for shelf placement and promotions, in banking for
detecting patterns in transactions, and in e-commerce for product recommendations.

8. How do confidence and lift differ as measures in association mining?

● Answer: Confidence measures the reliability of the inference made by a rule, whereas
lift compares the observed support to that expected if X and Y were independent. A lift
value greater than 1 indicates a positive association.

9. What role does data preprocessing play in association rule mining?

● Answer: Data preprocessing involves cleaning data, handling missing values, and
preparing data in suitable formats, which is crucial for the accuracy and efficiency of
mining association rules.

10. What is meant by 'constraint-based association mining'?

● Answer: It refers to the application of constraints on the mining process to focus the
generation of itemsets and rules on the most potentially valuable ones, enhancing the
efficiency and relevance of the mining process.

11. Discuss the concept of 'interest' in association rules.

● Answer: Interest measures whether the occurrence of the antecedent and the
consequent in a rule is statistically independent. An interest value different from 0
indicates a potentially useful association.

12. How does association mining help in decision-making processes?

● Answer: It provides insights based on historical data patterns, aiding decision-makers in


understanding trends, customer behaviors, and potential strategies for improving
business outcomes.

13. What are the typical steps involved in the association rule mining process?

● Answer: The process includes setting up the problem, identifying relevant data,
preprocessing data, finding frequent itemsets, generating rules from these itemsets, and
evaluating these rules.

14. How can association rules be applied in e-commerce?

● Answer: They can be used to recommend products to users based on the analysis of
frequently bought items together, enhancing the shopping experience and potentially
increasing sales.
15. What is the role of the minimum support threshold in mining association rules?

● Answer: It determines the minimum frequency at which itemsets are considered


'frequent' for further analysis. Setting it too low may lead to too many results, while too
high a threshold may miss interesting associations.

Short Notes on Classification and Prediction from "ICS 2408_Lecture 6"

1. Overview of Classification and Prediction:

● Classification: Assigns items to predefined categories based on a model derived from


historical data.
● Prediction: Deals with the estimation of future values based on observed data patterns,
primarily focusing on continuous or ordered values.

2. Definition and Process:

● Model Construction: Involves training a classifier using a training dataset where each
record is associated with a predefined class label. The model can be in the form of rules,
decision trees, or mathematical equations.
● Model Usage: The constructed model is used to predict class labels for new, unseen
instances. Model accuracy is validated using a test dataset.

3. Types of Learning:

● Supervised Learning: The training data includes input-output pairs. The system learns
to predict the output from the input.
● Unsupervised Learning: The system tries to learn patterns from the input data without
any explicit input-output pairs.

4. Methods of Classification:

● Decision Tree Induction: Uses a tree-like graph of decisions and their possible
consequences.
● Rule-Based Classification: Uses a set of if-then rules for classification.
● Instance-Based Methods: Includes k-nearest neighbor which classifies new cases
based on their distance to cases in the training set.

5. Prediction Techniques:
● Regression Analysis: Used for predicting a continuous variable.
● Time Series Prediction: Analyzing sequenced data points to predict future values.

6. Data Preparation for Classification and Prediction:

● Data Cleaning: Removing noise and handling missing data.


● Relevance Analysis: Identifying relevant predictors.
● Data Transformation and Reduction: Normalizing and generalizing data to improve the
classifier's performance.

7. Evaluation of Classification Methods:

● Accuracy: Correctness of the classifier in predicting class labels.


● Speed: Time taken to build the model and use it for classifying new data.
● Robustness: Ability to handle noise and outliers in the data.
● Scalability: Capability to efficiently process large databases.

15 Hard Questions with Detailed Answers

1. What distinguishes classification from prediction?

● Answer: Classification is about identifying the categorical class labels of instances,


whereas prediction involves forecasting continuous or ordered values.

2. How does a decision tree work in classification?

● Answer: A decision tree represents decisions and their possible consequences


including class labels as leaf nodes. It splits data by features that result in the most
significant information gain at each node.

3. What is the k-nearest neighbor algorithm and how is it used in classification?

● Answer: The k-nearest neighbor algorithm classifies a new instance by a majority vote
of its k nearest neighbors, based on a distance metric like Euclidean distance.

4. Explain the concept of overfitting in decision tree learning.

● Answer: Overfitting occurs when a model is too complex, capturing noise in the data
rather than the actual pattern, thus performing poorly on unseen data.

5. Discuss the role of data preparation in the success of classification models.

● Answer: Proper data preparation, such as cleaning and normalization, is crucial as it


directly affects the quality and effectiveness of the learning process and outcomes.

6. What is naive Bayes classification?


● Answer: Naive Bayes classifiers assume that the effect of a variable’s value on a given
class is independent of the values of other variables. This assumption simplifies
computation, and it performs well in many scenarios.

7. How do you measure the performance of a classification model?

● Answer: Performance can be measured using accuracy, precision, recall, F1-score, and
by analyzing the confusion matrix.

8. What are the advantages and disadvantages of using a decision tree?

● Answer: Advantages include simplicity of understanding and interpreting, and dealing


well with noisy or incomplete data. Disadvantages include overfitting, especially with
many classes or complex trees.

9. What is the Gini index used for in decision trees?

● Answer: The Gini index measures the impurity of a dataset; lower values indicate that a
node contains predominantly instances of a single class. It is used to choose the best
split at each node in the tree.

10. Explain the difference between attribute selection measure in decision trees and
splitting attribute in k-NN.

● Answer: In decision trees, attribute selection measures like information gain or Gini
index help determine which feature splits the data best at each node. In k-NN, all
attributes are used to calculate distance, with no specific feature being chosen for
splitting.

11. What challenges are associated with k-NN classification?

● Answer: Challenges include determining the optimal number of neighbors, handling


high-dimensional data, and the computational cost of finding the nearest neighbors in
large datasets.

12. How does the naive Bayes classifier handle continuous data?

● Answer: It typically assumes that continuous attributes are distributed according to a


Gaussian distribution and estimates the parameters of the distribution from the training
data.

13. Describe the process of pruning in decision tree induction.

● Answer: Pruning reduces the size of decision trees by removing parts of the tree that do
not provide power to classify instances. This helps to improve model generalizability and
reduces overfitting.
14. What is lazy learning in the context of instance-based methods?

● Answer: Lazy learning refers to methods that generalize beyond the training data only
at query time, thus deferring the decision about how to generalize until each new
instance needs to be classified.

15. How does model evaluation differ between classification and prediction tasks?

● Answer: For classification, the evaluation focuses on how accurately the model
classifies new instances. For prediction tasks, it typically revolves around how closely
the model’s predictions match the actual values of the attributes being predicted.

Short Notes on Clustering from "ICS 2408_Lecture 7"

1. Definition and Importance of Cluster Analysis:

● Cluster Analysis: Identifies groups (clusters) of similar items within a dataset where
members of a cluster are more similar to each other than to those in other clusters.
● Importance: Helps in pattern recognition, spatial data analysis, image processing, and
provides insight into the data distribution, which is useful for various applications
including market research and customer segmentation.

2. Types of Clustering Methods:

● Partitioning Methods: Divide the dataset into a set of k clusters. Example: k-means,
k-medoids.
● Hierarchical Methods: Build a hierarchy of clusters using either agglomerative
(bottom-up) or divisive (top-down) approaches.
● Density-Based Methods: Form clusters based on the area of density in the data space.
Example: DBSCAN, OPTICS.
● Grid-Based Methods: Quantize the space into a finite number of cells that form a grid
structure and then do clustering on the grid. Example: STING, CLIQUE.
● Model-Based Clustering: Assume a model for each cluster and find the best fit of the
data to the given model. Example: Gaussian mixtures, EM algorithm.

3. Applications of Clustering:

● Marketing: Helps in discovering customer groups for targeted marketing.


● City Planning: Useful for identifying areas of similar land use.
● Insurance: Can be used to find groups of customers with similar policies or claims.
4. Good Clustering:

● A good clustering method will produce high-quality clusters with:


○ High intra-cluster similarity (items within a cluster are very similar).
○ Low inter-cluster similarity (items from different clusters are very dissimilar).

5. Requirements for Effective Clustering in Data Mining:

● Scalability: The algorithm should handle large datasets efficiently.


● Ability to deal with different types of attributes: Algorithms should handle numerical,
categorical, and ordinal data.
● Discovery of clusters with arbitrary shape: The method should not be biased towards
any specific shape of clusters.
● Minimal requirements for domain knowledge: Should not require much input about
data from the user.

6. Evaluating the Quality of Clustering:

● Involves using metrics like similarity functions that measure how well the clusters are
formed. Typical measures include Euclidean distance for interval-scaled variables.

15 Hard Questions with Detailed Answers

1. What is cluster analysis and why is it important in data mining?

● Answer: Cluster analysis is the task of grouping a set of objects in such a way that
objects in the same group are more similar to each other than to those in other groups.
It's important because it helps reveal natural structures within data and can guide
decision making and prediction.

2. Describe the partitioning clustering method and give an example.

● Answer: Partitioning methods divide data into no-overlapping subsets or clusters without
hierarchical structure. For example, the k-means algorithm organizes data into k groups
based on distance to the nearest centroid.

3. Explain the difference between hierarchical and density-based clustering methods.

● Answer: Hierarchical clustering creates a tree of clusters and does not require the
number of clusters to be specified a priori, while density-based clustering defines
clusters as areas of higher density than the remainder of the data set and can ignore
noise.

4. How does the DBSCAN algorithm work?


● Answer: DBSCAN groups together points that are close to each other based on a
distance measurement and a minimum number of points. It defines clusters as areas of
high density separated by areas of low density.

5. What is a grid-based clustering method and its advantages?

● Answer: Grid-based methods quantize the space into a finite number of cells that form a
grid and perform all operations on the grid structure. This method is fast and
independent of the number of data objects; it only depends on the number of cells in
each dimension in the quantized space.

6. Discuss the model-based clustering method and its applications.

● Answer: Model-based clustering methods hypothesize a model for each of the clusters
and try to find the best fit of the model to each cluster. These methods are often used in
complex data sets like those involving human genes where the data can exhibit very
diverse patterns.

7. What are the main challenges in clustering high-dimensional data?

● Answer: High-dimensional spaces pose the challenge of the "curse of dimensionality"


where the volume of the space increases so fast that the available data become sparse,
making it difficult to identify meaningful clusters.

8. Define the term 'outlier' in the context of clustering.

● Answer: An outlier is a data point that is considerably dissimilar from the rest of the
data, representing noise or an anomalous event in the dataset.

9. How can clustering be used in image processing applications?

● Answer: In image processing, clustering can be used to segment different parts of the
image, identifying regions that share similar colors or intensities, which is crucial in
object recognition and compression techniques.

10. What does it mean for clusters to have high intra-class similarity and low inter-class
similarity?

● Answer: High intra-class similarity means that objects within the same cluster are very
similar to each other, enhancing the homogeneity of the cluster. Low inter-class similarity
means that objects in different clusters are distinctly different, enhancing the distinction
between clusters.

11. What are the typical applications of clustering in economic science?


● Answer: In economic science, clustering can be used for market segmentation,
identifying groups of customers with similar buying habits or preferences, which can
guide pricing strategies and product development.

12. Explain how the quality of clustering can be measured.

● Answer: The quality of clustering can often be measured by internal indices such as
cohesion (how closely related objects within the same cluster are) and separation (how
distinct or well-separated a cluster is from other clusters).

13. What are some considerations when choosing a clustering algorithm?

● Answer: Considerations include the type and scale of the data, the desired number of
clusters, the shape and size of the clusters, the domain of application, and
computational resources.

14. Discuss the importance of scalability in clustering algorithms.

● Answer: Scalability is crucial for clustering algorithms to handle large datasets efficiently
without compromising the quality of the clustering results, especially in environments
where data volumes continuously grow.

15. What is the significance of noise handling in clustering algorithms like DBSCAN?

● Answer: Effective noise handling allows clustering algorithms to focus on the most
relevant patterns in the data without being misled by anomalies or irrelevant data points,
thus improving the accuracy and usefulness of the clustering outcomes.

Short Notes on Applications and Trends in Data Mining from "ICS


2408_Lecture 8"

1. Overview of Data Mining Applications:

● Interdisciplinary Impact: Data mining is a multidisciplinary field that impacts various


domains including finance, retail, telecommunication, and biology.

2. Financial Data Analysis:

● Usage in Finance: Data mining helps analyze financial data from banks and institutions
for multidimensional analysis, loan payment prediction, and credit policy analysis.
● Money Laundering Detection: Integrates data from various databases to detect
financial crimes using tools like data visualization and linkage analysis.
3. Retail Industry Applications:

● Customer Insight: Data mining in retail helps understand customer buying behaviors
and shopping patterns, improves customer service, retention, satisfaction, and optimizes
goods distribution.

4. Telecommunications Industry:

● Pattern Identification: Data mining aids in identifying telecommunication patterns,


catching fraudulent activities, and improving resource utilization and service quality.

5. Biomedical Data Analysis:

● DNA and Gene Analysis: Data mining supports the integration of heterogeneous
genomic databases and aids in identifying gene sequence patterns linked to various
diseases.

6. Key Data Mining Systems Considerations:

● System Selection: Factors in choosing a data mining system include data types
handled, system scalability, coupling with database systems, and the inclusion of
visualization tools.
● Privacy Concerns: Addresses the potential threats to privacy and data security posed
by data mining, promoting fair information practices and the development of
security-enhancing techniques.

7. Trends in Data Mining:

● Invisible Data Mining: Integration of data mining as a built-in function of other systems.
● Scalability and Standardization: Focus on developing scalable mining methods and
standardizing data mining languages to enhance interoperability.
● Advanced Applications: Expansion into complex data types and new methods for
mining, including web mining and privacy protection measures.

15 Hard Questions with Detailed Answers

1. How does data mining assist in financial data analysis?

● Answer: It helps in designing data warehouses for multidimensional data analysis,


predicting loan payments, analyzing credit policies, and detecting financial fraud.

2. What are the main applications of data mining in the retail industry?

● Answer: Data mining helps in identifying customer buying behaviors, improving


customer service, enhancing customer retention, and optimizing inventory and
distribution policies.
3. Describe the role of data mining in the telecommunications industry.

● Answer: It is used to understand business operations, detect fraudulent activities,


improve resource usage, and enhance service quality through pattern analysis.

4. Explain how data mining contributes to biomedical data analysis, specifically in


genetics.

● Answer: Data mining assists in the semantic integration of genomic databases,


identifying disease-related gene sequences, and understanding the genetic basis of
diseases.

5. What considerations should be taken into account when choosing a data mining
system?

● Answer: Considerations include data types supported, scalability, system architecture


(client/server, web-based), integration with existing databases or data warehouses, and
the availability of visualization tools.

6. How does data mining pose a threat to privacy and data security?

● Answer: Data mining can lead to the invasion of privacy through extensive profiling and
data analysis, potentially exposing sensitive personal information.

7. What are some key trends in the field of data mining going forward?

● Answer: Trends include the development of invisible data mining techniques, scalable
methods, integration with database systems, standardization of mining languages, and
enhancements in privacy protection.

8. How can the retail industry benefit from customer segmentation using data mining?

● Answer: Customer segmentation allows retailers to develop targeted marketing


strategies, tailor product offerings, and optimize store layouts to improve customer
experiences and increase sales.

9. What methodologies are used in financial data mining to detect money laundering?

● Answer: Methodologies include data integration from multiple databases, application of


classification and clustering tools, outlier analysis, and sequential pattern analysis.

10. Discuss the impact of standardization in data mining technologies.

● Answer: Standardization facilitates systematic development, improves system


interoperability, and promotes broader adoption and education about data mining
technologies in industry and society.
11. What role does visual data mining play in enhancing data analysis?

● Answer: Visual data mining uses computer graphics to create visual representations of
complex data sets, helping to improve understanding and facilitating more effective data
exploration.

12. How does data mining aid in fraud detection in telecommunications?

● Answer: It analyzes calling patterns, durations, and other multidimensional data to


identify discrepancies and unusual activities indicative of fraudulent behavior.

13. What challenges are associated with mining high-dimensional data?

● Answer: High-dimensional data can lead to the "curse of dimensionality," where


increased dimensions make data sparse, complicating the process of finding meaningful
patterns.

14. Explain the significance of coupling data mining systems with databases or data
warehouses.

● Answer: Tight coupling enhances the efficiency of data mining processes by facilitating
direct access to cleaned and processed data stored in databases or data warehouses.

15. How can data mining be used to enhance customer service in the retail industry?

● Answer: By analyzing customer interaction data and feedback, retailers can identify
areas for improvement in service offerings, personalize customer interactions, and
optimize support resources.
1. What is Data Mining?

Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, Data mining is the science, art, and technology of discovering large and complex bodies
of data in order to discover useful patterns.

2. What are the different tasks of Data Mining?

The following activities are carried out during data mining:

Classification

Clustering

Association Rule Discovery

Sequential Pattern Discovery

Regression

Deviation Detection

3. Discuss the Life cycle of Data Mining projects?

The life cycle of Data mining projects:

Business understanding: Understanding project objectives from a business perspective, data


mining problem definition.

Data understanding: Initial data collection and understanding it.

Data preparation: Constructing the final data set from raw data.

Modelling: Select and apply data modelling techniques.

Evaluation: Evaluate model, decide on further deployment.

Deployment: Create a report, carry out actions based on new insights.


4. Explain the process of KDD?

Data mining treat as a synonym for another popularly used term, Knowledge Discovery from
Data, or KDD. In others view data mining as simply an essential step in the process of
knowledge discovery, in which intelligent methods are applied in order to extract data patterns.

Knowledge discovery from data consists of the following steps:

Data cleaning (to remove noise or irrelevant data).

Data integration (where multiple data sources may be combined).

Data selection (where data relevant to the analysis task are retrieved from the database).

Data transformation (where data are transmuted or consolidated into forms appropriate for
mining by performing summary or aggregation functions, for sample).

Data mining (an important process where intelligent methods are applied in order to extract data
patterns).

Pattern evaluation (to identify the fascinating patterns representing knowledge based on some
interestingness measures).

Knowledge presentation (where knowledge representation and visualization techniques are


used to present the mined knowledge to the user).

5. What is Classification?

Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. Classification can be used for predicting the
class label of data items. However, in many applications, one may like to calculate some
missing or unavailable data values rather than class labels.

6. Explain Evolution and deviation analysis?

Data evolution analysis describes and models regularities or trends for objects whose behavior
variations over time. Although this may involve discrimination, association, classification,
characterization, or clustering of time-related data, distinct features of such an analysis involve
time-series data analysis, periodicity pattern matching, and similarity-based data analysis.

In the analysis of time-related data, it is often required not only to model the general
evolutionary trend of the data but also to identify data deviations that occur over time.
Deviations are differences between measured values and corresponding references such as
previous values or normative values. A data mining system performing deviation analysis, upon
the detection of a set of deviations, may do the following: describe the characteristics of the
deviations, try to describe the reason behindhand them, and suggest actions to bring the
deviated values back to their expected values.

7. What is Prediction?

Prediction can be viewed as the construction and use of a model to assess the class of an
unlabeled object, or to measure the value or value ranges of an attribute that a given object is
likely to have. In this interpretation, classification and regression are the two major types of
prediction problems where classification is used to predict discrete or nominal values, while
regression is used to predict incessant or ordered values.

8. Explain the Decision Tree Classifier?

A Decision tree is a flow chart-like tree structure, where each internal node (non-leaf node)
denotes a test on an attribute, each branch represents an outcome of the test and each leaf
node (or terminal node) holds a class label. The topmost node of a tree is the root node.

A Decision tree is a classification scheme that generates a tree and a set of rules, representing
the model of different classes, from a given data set. The set of records available for developing
classification methods is generally divided into two disjoint subsets namely a training set and a
test set. The former is used for originating the classifier while the latter is used to measure the
accuracy of the classifier. The accuracy of the classifier is determined by the percentage of the
test examples that are correctly classified.

In the decision tree classifier, we categorize the attributes of the records into two different types.
Attributes whose domain is numerical are called the numerical attributes and the attributes
whose domain is not numerical are called categorical attributes. There is one distinguished
attribute called a class label. The goal of classification is to build a concise model that can be
used to predict the class of the records whose class label is unknown. Decision trees can simply
be converted to classification rules.
Decision Tree Classifier

9. What are the advantages of a decision tree classifier?

Decision trees are able to produce understandable rules.

They are able to handle both numerical and categorical attributes.

They are easy to understand.

Once a decision tree model has been built, classifying a test record is extremely fast.

Decision tree depiction is rich enough to represent any discrete value classifier.

Decision trees can handle datasets that may have errors.

Decision trees can deal with handle datasets that may have missing values.

10. Explain Bayesian classification in Data Mining?

A Bayesian classifier is a statistical classifier. They can predict class membership probabilities,
for instance, the probability that a given sample belongs to a particular class. Bayesian
classification is created on the Bayes theorem. A simple Bayesian classifier is known as the
naive Bayesian classifier to be comparable in performance with decision trees and neural
network classifiers. Bayesian classifiers have also displayed high accuracy and speed when
applied to large databases.

11. Why Fuzzy logic is an important area for Data Mining?

Rule-based systems for classification have the disadvantage that they involve exact values for
continuous attributes. Fuzzy logic is useful for data mining systems performing classification. It
provides the benefit of working at a high level of abstraction. In general, the usage of fuzzy logic
in rule-based systems involves the following:

Attribute values are changed to fuzzy values.

For a given new sample, more than one fuzzy rule may apply. Every applicable rule contributes
a vote for membership in the categories. Typically, the truth values for each projected category
are summed.

The sums obtained above are combined into a value that is returned by the system. This
process may be done by weighting each category by its truth sum and multiplying by the mean
truth value of each category. The calculations involved may be more complex, depending on the
difficulty of the fuzzy membership graphs.
12. What are Neural networks?

A neural network is a set of connected input/output units where each connection has a weight
associated with it. During the knowledge phase, the network acquires by adjusting the weights
to be able to predict the correct class label of the input samples. Neural network learning is also
denoted as connectionist learning due to the connections between units. Neural networks
involve long training times and are therefore more appropriate for applications where this is
feasible. They require a number of parameters that are typically best determined empirically,
such as the network topology or “structure”. Neural networks have been criticized for their poor
interpretability since it is difficult for humans to take the symbolic meaning behind the learned
weights. These features firstly made neural networks less desirable for data mining.

The advantages of neural networks, however, contain their high tolerance to noisy data as well
as their ability to classify patterns on which they have not been trained. In addition, several
algorithms have newly been developed for the extraction of rules from trained neural networks.
These issues contribute to the usefulness of neural networks for classification in data mining.
The most popular neural network algorithm is the backpropagation algorithm, proposed in the
1980s

13. How Backpropagation Network Works?

A Backpropagation learns by iteratively processing a set of training samples, comparing the


network’s estimate for each sample with the actual known class label. For each training sample,
weights are modified to minimize the mean squared error between the network’s prediction and
the actual class. These changes are made in the “backward” direction, i.e., from the output
layer, through each concealed layer down to the first hidden layer (hence the name
backpropagation). Although it is not guaranteed, in general, the weights will finally converge,
and the knowledge process stops.

14. What is a Genetic Algorithm?

Genetic algorithm is a part of evolutionary computing which is a rapidly growing area of artificial
intelligence. The genetic algorithm is inspired by Darwin’s theory about evolution. Here the
solution to a problem solved by the genetic algorithm is evolved. In a genetic algorithm, a
population of strings (called chromosomes or the genotype of the gen me), which encode
candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem, is
evolved toward better solutions. Traditionally, solutions are represented in the form of binary
strings, composed of 0s and 1s, the same way other encoding schemes can also be applied.
15. What is Classification Accuracy?

Classification accuracy or accuracy of the classifier is determined by the percentage of the test
data set examples that are correctly classified. The classification accuracy of a classification
tree = (1 – Generalization error).

16. Define Clustering in Data Mining?

Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.

17. Write a difference between classification and clustering?[IMP]

Parameters CLASSIFICATION CLUSTERING

Type Used for supervised need learning Used for unsupervised learning

Basic Process of classifying the input instances based on their corresponding class labels
Grouping the instances based on their similarity without the help of class labels

Need It has labels so there is a need for training and testing data set for verifying the model
created There is no need for training and testing dataset

Complexity More complex as compared to clustering Less complex as compared to


classification

Example Algorithms Logistic regression, Naive Bayes classifier, Support vector machines, etc.
k-means clustering algorithm, Fuzzy c-means clustering algorithm, Gaussian (EM) clustering
algorithm etc.

18. What is Supervised and Unsupervised Learning?[TCS interview question]

Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train the machine using data that is well
labeled. Which means some data is already tagged with the correct answer. After that, the
machine is provided with a new set of examples(data) so that the supervised learning algorithm
analyses the training data(set of training examples) and produces a correct outcome from
labeled data.

Unsupervised learning is the training of a machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Here the task of
the machine is to group unsorted information according to similarities, patterns, and differences
without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore, the machine is restricted to find the hidden structure in unlabeled data by
itself.

19. Name areas of applications of data mining?

Data Mining Applications for Finance

Healthcare

Intelligence

Telecommunication

Energy

Retail

E-commerce

Supermarkets

Crime Agencies

Businesses Benefit from data mining

20. What are the issues in data mining?

A number of issues that need to be addressed by any serious data mining package

Uncertainty Handling

Dealing with Missing Values

Dealing with Noisy data

Efficiency of algorithms

Constraining Knowledge Discovered to only Useful

Incorporating Domain Knowledge

Size and Complexity of Data

Data Selection

Understandably of Discovered Knowledge: Consistency between Data and Discovered


Knowledge.
21. Give an introduction to data mining query language?

DBQL or Data Mining Query Language proposed by Han, Fu, Wang, et.al. This language works
on the DBMiner data mining system. DBQL queries were based on SQL(Structured Query
language). We can this language for databases and data warehouses as well. This query
language support ad hoc and interactive data mining.

22. Differentiate Between Data Mining And Data Warehousing?

Data Mining: It is the process of finding patterns and correlations within large data sets to
identify relationships between data. Data mining tools allow a business organization to predict
customer behavior. Data mining tools are used to build risk models and detect fraud. Data
mining is used in market analysis and management, fraud detection, corporate analysis, and
risk management.

It is a technology that aggregates structured data from one or more sources so that it can be
compared and analyzed rather than transaction processing.

Data Warehouse: A data warehouse is designed to support the management decision-making


process by providing a platform for data cleaning, data integration, and data consolidation. A
data warehouse contains subject-oriented, integrated, time-variant, and non-volatile data.

Data warehouse consolidates data from many sources while ensuring data quality, consistency,
and accuracy. Data warehouse improves system performance by separating analytics
processing from transnational databases. Data flows into a data warehouse from the various
databases. A data warehouse works by organizing data into a schema that describes the layout
and type of data. Query tools analyze the data tables using schema.

23.What is Data Purging?

The term purging can be defined as Erase or Remove. In the context of data mining, data
purging is the process of remove, unnecessary data from the database permanently and clean
data to maintain its integrity.

24. What Are Cubes?

A data cube stores data in a summarized version which helps in a faster analysis of data. The
data is stored in such a way that it allows reporting easily. E.g. using a data cube A user may
want to analyze the weekly, monthly performance of an employee. Here, month and week could
be considered as the dimensions of the cube.
25.What are the differences between OLAP And OLTP?[IMP]

OLAP (Online Analytical Processing)OLTP (Online Transaction Processing)

Consists of historical data from various Databases. Consists only of application-oriented


day-to-day operational current data.

Application-oriented day-to-dayIt is subject-oriented. Used for Data Mining, Analytics, Decision


making, etc.It is application-oriented. Used for business tasks.

The data is used in planning, problem-solving, and decision-making.The data is used to perform
day-to-day fundamental operations.

It reveals a snapshot of present business tasks.It provides a multi-dimensional view of different


business tasks.

A large forex amount of data is stored typically in TB, PB The size of the data is relatively small
as the historical data is archived. For example, MB, GB

Relatively slow as the amount of data involved is large. Queries may take hours. Very Fast as
the queries operate on 5% of the data.

It only needs backup from time to time as compared to OLTP.The backup and recovery process
is maintained religiously

This data is generally managed by the CEO, MD, GM.This data is managed by clerks,
managers.

Only read and rarely write operation. Both read and write operations.

26. Explain Association Algorithm In Data Mining?

Association analysis is the finding of association rules showing attribute-value conditions that
occur frequently together in a given set of data. Association analysis is widely used for a market
basket or transaction data analysis. Association rule mining is a significant and exceptionally
dynamic area of data mining research. One method of association-based classification, called
associative classification, consists of two steps. In the main step, association instructions are
generated using a modified version of the standard association rule mining algorithm known as
Apriori. The second step constructs a classifier based on the association rules discovered.\

27. Explain how to work with data mining algorithms included in SQL server data mining?

SQL Server data mining offers Data Mining Add-ins for Office 2007 that permit finding the
patterns and relationships of the information. This helps in an improved analysis. The Add-in
called a Data Mining Client for Excel is utilized to initially prepare information, create models,
manage, and analyze, results.
28. Explain Over-fitting?

The concept of over-fitting is very important in data mining. It refers to the situation in which the
induction algorithm generates a classifier that perfectly fits the training data but has lost the
capability of generalizing to instances not presented during training. In other words, instead of
learning, the classifier just memorizes the training instances. In the decision trees over fitting
usually occurs when the tree has too many nodes relative to the amount of training data
available. By increasing the number of nodes, the training error usually decreases while at some
point the generalization error becomes worse. The Over-fitting can lead to difficulties when
there is noise in the training data or when the number of the training datasets, the error of the
fully built tree is zero, while the true error is likely to be bigger.

There are many disadvantages of an over-fitted decision tree:

Over-fitted models are incorrect.

Over-fitted decision trees require more space and more computational resources.

They require the collection of unnecessary features.

29. Define Tree Pruning?

When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outliers. Tree pruning methods address this problem of over-fitting the data. So
the tree pruning is a technique that removes the overfitting problem. Such methods typically use
statistical measures to remove the least reliable branches, generally resulting in faster
classification and an improvement in the ability of the tree to correctly classify independent test
data. The pruning phase eliminates some of the lower branches and nodes to improve their
performance. Processing the pruned tree to improve understandability.

30. What is a Sting?

Statistical Information Grid is called STING; it is a grid-based multi-resolution clustering strategy.


In the STING strategy, every one of the items is contained into rectangular cells, these cells are
kept into different degrees of resolutions and these levels are organized in a hierarchical
structure.
31. Define Chameleon Method?

Chameleon is another hierarchical clustering technique that utilization dynamic modeling.


Chameleon is acquainted with recover the disadvantages of the CURE clustering technique. In
this technique, two groups are combined, if the interconnectivity between two clusters is greater
than the inter-connectivity between the object inside a cluster/ group.

32. Explain the Issues regarding Classification And Prediction?

Preparing the data for classification and prediction:

Data cleaning

Relevance analysis

Data transformation

Comparing classification methods

Predictive accuracy

Speed

Robustness

Scalability

Interpretability

33.Explain the use of data mining queries or why data mining queries are more helpful?

The data mining queries are primarily applied to the model of new data to make single or
multiple different outcomes. It also permits us to give input values. The query can retrieve
information effectively if a particular pattern is defined correctly. It gets the training data
statistical memory and gets the specific design and rule of the common case addressing a
pattern in the model. It helps in extracting the regression formulas and other computations. It
additionally recovers the insights concerning the individual cases utilized in the model. It
incorporates the information which isn’t utilized in the analysis, it holds the model with the
assistance of adding new data and perform the task and cross-verified.
34. What is a machine learning-based approach to data mining?

This question is the high-level Data Mining Interview Questions asked in an Interview. Machine
learning is basically utilized in data mining since it covers automatic programmed processing
systems, and it depended on logical or binary tasks. . Machine learning for the most part follows
the rule that would permit us to manage more general information types, incorporating cases
and in these sorts and number of attributes may differ. Machine learning is one of the famous
procedures utilized for data mining and in Artificial intelligence too.

35.What is the K-means algorithm?

K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves
clustering problems. K-means algorithm partition n observations into k clusters where each
observation belongs to the cluster with the nearest mean serving as a prototype of the cluster.

36. What are precision and recall?[IMP]

Precision is the most commonly used error metric in the n classification mechanism. Its range is
from 0 to 1, where 1 represents 100%.

Recall can be defined as the number of the Actual Positives in our model which has a class
label as Positive (True Positive)”. Recall and the true positive rate is totally identical. Here’s the
formula for it:

Recall = (True positive)/(True positive + False negative)

37. What are the ideal situations in which t-test or z-test can be used?

It is a standard practice that a t-test is utilized when there is an example size under 30 attributes
and the z-test is viewed as when the example size exceeds 30 by and large.

38. What is the simple difference between standardized and unstandardized coefficients?

In the case of normalized coefficients, they are interpreted dependent on their standard
deviation values. While the unstandardized coefficient is estimated depending on the real value
present in the dataset.

39. How are outliers detected?

Numerous approaches can be utilized for distinguishing outliers anomalies, but the two most
generally utilized techniques are as per the following:

Standard deviation strategy: Here, the value is considered as an outlier if the value is lower or
higher than three standard deviations from the mean value.

Box plot technique: Here, a value is viewed as an outlier if it is lesser or higher than 1.5 times
the interquartile range (IQR)
40. Why is KNN preferred when determining missing numbers in data?

K-Nearest Neighbour (KNN) is preferred here because of the fact that KNN can easily
approximate the value to be determined based on the values closest to it.

The k-nearest neighbor (K-NN) classifier is taken into account as an example-based classifier,
which means that the training documents are used for comparison instead of an exact class
illustration, like the class profiles utilized by other classifiers. As such, there’s no real training
section. once a new document has to be classified, the k most similar documents (neighbors)
are found and if a large enough proportion of them are allotted to a precise class, the new
document is also appointed to the present class, otherwise not. Additionally, finding the closest
neighbors is quickened using traditional classification strategies.

41. Explain Prepruning and Post pruning approach in Classification?

Prepruning: In the prepruning approach, a tree is “pruned” by halting its construction early (e.g.,
by deciding not to further split or partition the subset of training samples at a given node). Upon
halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset
samples, or the probability distribution of those samples. When constructing a tree, measures
such as statistical significance, information gain, etc., can be used to assess the goodness of a
split. If partitioning the samples at a node would result in a split that falls below a pre-specified
threshold, then further partitioning of the given subset is halted. There are problems, however, in
choosing a proper threshold. High thresholds could result in oversimplified trees, while low
thresholds could result in very little simplification.

Postpruning: The postpruning approach removes branches from a “fully grown” tree. A tree
node is pruned by removing its branches. The cost complexity pruning algorithm is an example
of the post pruning approach. The pruned node becomes a leaf and is labeled by the most
frequent class among its former branches. For every non-leaf node in the tree, the algorithm
calculates the expected error rate that would occur if the subtree at that node were pruned.
Next, the predictable error rate occurring if the node were not pruned is calculated using the
error rates for each branch, collective by weighting according to the proportion of observations
along each branch. If pruning the node leads to a greater probable error rate, then the subtree is
reserved. Otherwise, it is pruned. After generating a set of progressively pruned trees, an
independent test set is used to estimate the accuracy of each tree. The decision tree that
minimizes the expected error rate is preferred.
42. How can one handle suspicious or missing data in a dataset while performing the
analysis?

If there are any inconsistencies or uncertainty in the data set, a user can proceed to utilize any
of the accompanying techniques: Creation of a validation report with insights regarding the data
in conversation Escalating something very similar to an experienced Data Analyst to take a look
at it and accept a call Replacing the invalid information with a comparing substantial and latest
data information Using numerous methodologies together to discover missing values and
utilizing approximation estimate if necessary.

43.What is the simple difference between Principal Component Analysis (PCA) and
Factor Analysis (FA)?

Among numerous differences, the significant difference between PCA and FA is that factor
analysis is utilized to determine and work with the variance between variables, but the point of
PCA is to explain the covariance between the current segments or variables.

44. What is the difference between Data Mining and Data Analysis?

Data Mining Data Analysis

Used to perceive designs in data stored.Used to arrange and put together raw information in a
significant manner.

Mining is performed on clean and well-documented.The analysis of information includes Data


Cleaning. So, information is not available in a well-documented format.

Results extracted from data mining are difficult to interpret. Results extracted from information
analysis are not difficult to interpret.

45. What is the difference between Data Mining and Data Profiling?

Data Mining: Data Mining refers to the analysis of information regarding the discovery of
relations that have not been found before. It mainly focuses on the recognition of strange
records, conditions, and cluster examination.

Data Profiling: Data Profiling can be described as a process of analyzing single attributes of
data. It mostly focuses on giving significant data on information attributes, for example,
information type, recurrence, and so on.

46. What are the important steps in the data validation process?

As the name proposes Data Validation is the process of approving information. This progression
principally has two methods associated with it. These are Data Screening and Data Verification.
Data Screening: Different kinds of calculations are utilized in this progression to screen the
whole information to discover any inaccurate qualities.

Data Verification: Each and every presumed value is assessed on different use-cases, and
afterward a final conclusion is taken on whether the value must be remembered for the
information or not.

47. What is the difference between univariate, bivariate, and multivariate analysis?

The main difference between univariate, bivariate, and multivariate investigation are as per the
following:

Univariate: A statistical procedure that can be separated depending on the check of factors
required at a given instance of time.

Bivariate: This analysis is utilized to discover the distinction between two variables at a time.

Multivariate: The analysis of multiple variables is known as multivariate. This analysis is utilized
to comprehend the impact of factors on the responses.

48. What is the difference between variance and covariance?

Variance and Covariance are two mathematical terms that are frequently in the Statistics field.
Variance fundamentally processes how separated numbers are according to the mean.
Covariance refers to how two random/irregular factors will change together. This is essentially
used to compute the correlation between variables.

49. What are different types of Hypothesis Testing?

The various kinds of hypothesis testing are as per the following:

T-test: A T-test is utilized when the standard deviation is unknown and the sample size is nearly
small.

Chi-Square Test for Independence: These tests are utilized to discover the significance of the
association between all categorical variables in the population sample.

Analysis of Variance (ANOVA): This type of hypothesis testing is utilized to examine contrasts
between the methods in different clusters. This test is utilized comparatively to a T-test but, is
utilized for multiple groups.

Welch’s T-test: This test is utilized to discover the test for equality of means between two testing
sample tests.
50. Why should we use data warehousing and how can you extract data for analysis?

Data warehousing is a key technology on the way to establishing business intelligence. A data
warehouse is a collection of data extracted from the operational or transactional systems in a
business, transformed to clean up any inconsistencies in identification coding and definition, and
then arranged to support rapid reporting and analysis.

Here are some of the benefits of a data warehouse:

It is separate from the operational database.

Integrates data from heterogeneous systems.

Storage a huge amount of data, more historical than current data.

Does not require data to be highly accurate.

Bonus Interview Questions & Answers

1. What is Visualization?

Visualization is for the depiction of data and to gain intuition about the data being observed. It
assists the analysts in selecting display formats, viewer perspectives, and data representation
schema.

2. Give some data mining tools?

DBMiner

GeoMiner

Multimedia miner

WeblogMiner
3. What are the most significant advantages of Data Mining?

There are many advantages of Data Mining. Some of them are listed below:

Data Mining is used to polish the raw data and make us able to explore, identify, and understand
the patterns hidden within the data.

It automates finding predictive information in large databases, thereby helping to identify the
previously hidden patterns promptly.

It assists faster and better decision-making, which later helps businesses take necessary
actions to increase revenue and lower operational costs.

It is also used to help data screening and validating to understand where it is coming from.

Using the Data Mining techniques, the experts can manage applications in various areas such
as Market Analysis, Production Control, Sports, Fraud Detection, Astrology, etc.

The shopping websites use Data Mining to define a shopping pattern and design or select the
products for better revenue generation.

Data Mining also helps in data optimization.

Data Mining can also be used to determine hidden profitability.

4. What are ‘Training set’ and ‘Test set’?

In various areas of information science like machine learning, a set of data is used to discover
the potentially predictive relationship known as ‘Training Set’. The training set is an example
given to the learner, while the Test set is used to test the accuracy of the hypotheses generated
by the learner, and it is the set of examples held back from the learner. The training set is
distinct from the Test set.

5. Explain what is the function of ‘Unsupervised Learning?

Find clusters of the data

Find low-dimensional representations of the data

Find interesting directions in data

Interesting coordinates and correlations

Find novel observations/ database cleaning


6. In what areas Pattern Recognition is used?

Pattern Recognition can be used in

Computer Vision

Speech Recognition

Data Mining

Statistics

Informal Retrieval

Bio-Informatics

7. What is ensemble learning?

To solve a particular computational program, multiple models such as classifiers or experts are
strategically generated and combined to solve a particular computational program Multiple. This
process is known as ensemble learning. Ensemble learning is used when we build component
classifiers that are more accurate and independent of each other. This learning is used to
improve classification, prediction of data, and function approximation.

8. What is the general principle of an ensemble method and what is bagging and boosting in the
ensemble method?

The general principle of an ensemble method is to combine the predictions of several models
built with a given learning algorithm to improve robustness over a single model. Bagging is a
method in an ensemble for improving unstable estimation or classification schemes. While
boosting methods are used sequentially to reduce the bias of the combined model. Boosting
and Bagging both can reduce errors by reducing the variance term.

9. What are the components of relational evaluation techniques?

The important components of relational evaluation techniques are

Data Acquisition

Ground Truth Acquisition

Cross-Validation Technique
Query Type

Scoring Metric

Significance Test

10. What are the different methods for Sequential Supervised Learning?

The different methods to solve Sequential Supervised Learning problems are

Sliding-window methods

Recurrent sliding windows

Hidden Markov models

Maximum entropy Markov models

Conditional random fields

Graph transformer networks

11. What is a Random Forest?

Random forest is a machine learning method that helps you to perform all types of regression
and classification tasks. It is also used for treating missing values and outlier values.

12. What is reinforcement learning?

Reinforcement Learning is a learning mechanism about how to map situations to actions. The
end result should help you to increase the binary reward signal. In this method, a learner is not
told which action to take but instead must discover which action offers a maximum reward. This
method is based on the reward/penalty mechanism.

13. Is it possible to capture the correlation between continuous and categorical variables?
Yes, we can use the analysis of the covariance technique to capture the association between
continuous and categorical variables.

14. What is Visualization?

Visualization is for the depiction of information and to acquire knowledge about the information
being observed. It helps the experts in choosing format designs, viewer perspectives, and
information representation patterns.

15. Name some best tools which can be used for data analysis.

The most common useful tools for data analysis are:

Google Search Operators

KNIME

Tableau

Solver

RapidMiner

Io

NodeXL

16. Describe the structure of Artificial Neural Networks?

An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN), could be a
process model supported by biological neural networks. Its structure consists of an
interconnected collection of artificial neurons. An artificial neural network is an adjective system
that changes its structure-supported information that flows through the artificial network during a
learning section. The ANN relies on the principle of learning by example. There are, however, 2
classical types of neural networks, perceptron and also multilayer perceptron. Here we are
going to target the perceptron algorithmic rule.

You might also like