0% found this document useful (0 votes)
48 views

Data Mining

Data discretization refers to converting continuous data values into a smaller number of intervals to make the data easier to evaluate and manage. It involves converting continuous attribute values into a finite set of intervals with minimal information loss. Issues to consider in data mining include data quality, preprocessing, and algorithms for tasks like classification, clustering, and association rule mining.

Uploaded by

Mano
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Data Mining

Data discretization refers to converting continuous data values into a smaller number of intervals to make the data easier to evaluate and manage. It involves converting continuous attribute values into a finite set of intervals with minimal information loss. Issues to consider in data mining include data quality, preprocessing, and algorithms for tasks like classification, clustering, and association rule mining.

Uploaded by

Mano
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

SECTION D

Q.5 What is data discretization? Discuss the issues to be considered in data mining.
Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy. In other words, data
discretization is a method of converting attributes values of continuous data into a finite set of
intervals with minimum data loss.
Issues:

Q4 llustrate data preprocessing in detail.


Data preprocessing is the process of transforming raw data into an understandable
format. It is also an important step in data mining as we cannot work with raw data. The
quality of the data should be checked before applying machine learning or data mining
algorithms.
Preprocessing of data is mainly to check the data quality. The quality can be checked by
the following

● Accuracy: To check whether the data entered is correct or not.


● Completeness: To check whether the data is available or not recorded.
● Consistency: To check whether the same data is kept in all the places that do or do
not match.
● Timeliness: The data should be updated correctly.
● Believability: The data should be trustable.
● Interpretability: The understandability of the data.
Q3 Explain with example the various steps in decision tree induction.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at a
company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.

The benefits of having a decision tree are as follows −


● It does not require any domain knowledge.
● It is easy to comprehend.
● The learning and classification steps of a decision tree are simple and fast.

Q2 Briefly outline the key features of various clustering methods with relevant
examples.

The various types of clustering are:


1. Connectivity-based Clustering (Hierarchical clustering)
2. Centroids-based Clustering (Partitioning methods)
3. Distribution-based Clustering
4. Density-based Clustering (Model-based methods)
5. Fuzzy Clustering
6. Constraint-based (Supervised Clustering)

1. Connectivity-Based Clustering (Hierarchical Clustering)


Hierarchical Clustering is a method of unsupervised machine learning clustering where it
begins with a pre-defined top to the bottom hierarchy of clusters

2. Centroid Based Clustering


Centroid-based clustering is considered as one of the simplest clustering algorithms, yet
the most effective way of creating clusters and assigning data points to them.

3. Density-based Clustering (Model-based Methods)


If one looks into the previous two methods that we discussed, one would observe that
both hierarchical and centroid-based algorithms are dependent on a distance
(similarity/proximity) metric.

4. Distribution-Based Clustering
Until now, the clustering techniques as we know are based around either proximity
(similarity/distance) or composition (density). There is a family of clustering algorithms
that take a totally different metric into consideration – probability.

5. Fuzzy Clustering
The general idea about clustering revolves around assigning data points to mutually
exclusive clusters, meaning, a data point always resides uniquely inside a cluster and it
cannot belong to more than one cluster.

6. Constraint-based (Supervised Clustering)


The clustering process, in general, is based on the approach that the data can be divided
into an optimal number of “unknown” groups. The underlying stages of all the clustering
algorithms to find those hidden patterns and similarities, without any intervention or
predefined conditions.
SECTION C
1. DIFFERENCE BETWEEN CLASSIFICATION & PREDICTION.
● Classification is the method of recognizing to which group; a new process belongs
to a background of a training data set containing a new process of observing whose
group membership is familiar.
● Predication is the method of recognizing the missing or not available numerical data
for a new process of observing.
● A classifier is built to detect explicit labels.
● A predictor will be build that predicts a current valued job or command value.
● In classification, authenticity depends on detecting the class label correctly.
● In predication, the authenticity depends on how well a given predictor can guess the
value of a predicated attribute for new data.
● In classification, the sample can be called the classifier.
● In prediction, the sample can be called the predictor.

2. (Co1)Explain the steps in the knowledge discovery database.

Some people don’t differentiate data mining from knowledge discovery while others view data mining
as an essential step in the process of knowledge discovery. Here is the list of steps involved in the
knowledge discovery process −

● Data Cleaning − In this step, the noise and inconsistent data is removed.
● Data Integration − In this step, multiple data sources are combined.
● Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
● Data Transformation − In this step, data is transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations.
● Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
● Pattern Evaluation − In this step, data patterns are evaluated.
● Knowledge Presentation − In this step, knowledge is represented.
3. (CO5)State Bayes theorem and discuss how Bayesian classifiers work.

Bayes' theorem describes the probability of occurrence of an event related to any condition. It is also
considered for the case of conditional probability. Bayes theorem is also known as the formula for the
probability of “causes”.

The Naive Bayes classifier works on the principle of conditional probability, as given by the Bayes
theorem. While calculating the math on probability, we usually denote probability as P. Some of the
probabilities in this event would be as follows: The probability of getting two heads = 1/4.

4. (CO4)How will you handle missing values in the dataset before the mining
process? Explain.

Data Mining — Handling Missing Values the Database

1. Ignore the data row. ...


2. Use a global constant to fill in for missing values. ...
3. Use attribute mean. ...
4. Use attribute means for all samples belonging to the same class. ...
5. Use a data mining algorithm to predict the most probable value.

5. (CO1)Outline the characteristics of data warehouses and define metadata.


Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata.
Metadata is the roadmap to a data warehouse.
Metadata in a data warehouse defines the warehouse objects.
Metadata acts as a directory.

Data warehouses are characterized


by being:
1. Subject-oriented: A data warehouse typically provides information on a topic (such
as a sales inventory or supply chain) rather than company operations.
2. Time-variant: Time variant keys (e.g., for the date, month, time) are typically present.
3. Integrated: A data warehouse combines data from various sources. These may
include a cloud, relational databases, flat files, structured and semi-structured data,
metadata, and master data. The sources are combined in a manner that’s
consistent, relatable, and ideally certifiable, providing a business with confidence in
the data’s quality.
4. Persistent and non-volatile: Prior data isn’t deleted when new data is added.
Historical data is preserved for comparisons, trends, and analytics.

Section – B (Very Short answers)

1. (CO1) Define Data Mining.

Data mining is the process of analyzing a large batch of information to discern trends and
patterns. Data mining can be used by corporations for everything from learning about what
customers are interested in or want to buy to fraud detection and spam filtering.
OR
It is the process of finding patterns and correlations within large data sets to identify
relationships between data. Data mining tools allow a business organization to predict
customer behavior. Data mining tools are used to build risk models and detect fraud.

2. (CO2) State the different layers of the data warehouse.

Data Source Layer


The Data Source Layer is the layer where the data from the source is encountered and
subsequently sent to the other layers for desired operations.

Data Staging Layer


Step #1: Data Extraction
Step #2: Landing Database
Step #3: Staging Area
Step #4: ETL

Data Storage Layer


The processed data is stored in the Data Warehouse.

Data Presentation Layer


This Layer is where the users get to interact with the data stored in the data warehouse.

3. (CO3). What is the need for data preprocessing?

Data Preprocessing is required because:


Real world data are generally:
Incomplete: Missing attribute values, missing certain attributes of importance, or having only
aggregate data
Noisy: Containing errors or outliers
Inconsistent: Containing discrepancies in codes or names

4. (CO4) State the important terms used in association rule mining.

1. Support: Support is the rate of frequency of an item appearing in the total number of items.
2. Confidence: Confidence is the conditional probability of occurrence of a consequent (then) providing the
occurrence of an antecedent (if).
3. Lift: Lift is the ratio of confidence and support. It tells how likely an item is purchased after another item is
purchased.

5. . (CO5) Define classification model.

A classification model tries to draw some conclusions from the input values given for training.
It will predict the class labels/categories for the new data. Feature: A feature is an individual
measurable property of a phenomenon being observed.

Data preprocessing is a data mining technique that is used to transform the raw data in a
useful and efficient forma

You might also like