Data Mining
Data Mining
Q.5 What is data discretization? Discuss the issues to be considered in data mining.
Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy. In other words, data
discretization is a method of converting attributes values of continuous data into a finite set of
intervals with minimum data loss.
Issues:
Q2 Briefly outline the key features of various clustering methods with relevant
examples.
4. Distribution-Based Clustering
Until now, the clustering techniques as we know are based around either proximity
(similarity/distance) or composition (density). There is a family of clustering algorithms
that take a totally different metric into consideration – probability.
5. Fuzzy Clustering
The general idea about clustering revolves around assigning data points to mutually
exclusive clusters, meaning, a data point always resides uniquely inside a cluster and it
cannot belong to more than one cluster.
Some people don’t differentiate data mining from knowledge discovery while others view data mining
as an essential step in the process of knowledge discovery. Here is the list of steps involved in the
knowledge discovery process −
● Data Cleaning − In this step, the noise and inconsistent data is removed.
● Data Integration − In this step, multiple data sources are combined.
● Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
● Data Transformation − In this step, data is transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations.
● Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
● Pattern Evaluation − In this step, data patterns are evaluated.
● Knowledge Presentation − In this step, knowledge is represented.
3. (CO5)State Bayes theorem and discuss how Bayesian classifiers work.
Bayes' theorem describes the probability of occurrence of an event related to any condition. It is also
considered for the case of conditional probability. Bayes theorem is also known as the formula for the
probability of “causes”.
The Naive Bayes classifier works on the principle of conditional probability, as given by the Bayes
theorem. While calculating the math on probability, we usually denote probability as P. Some of the
probabilities in this event would be as follows: The probability of getting two heads = 1/4.
4. (CO4)How will you handle missing values in the dataset before the mining
process? Explain.
Data mining is the process of analyzing a large batch of information to discern trends and
patterns. Data mining can be used by corporations for everything from learning about what
customers are interested in or want to buy to fraud detection and spam filtering.
OR
It is the process of finding patterns and correlations within large data sets to identify
relationships between data. Data mining tools allow a business organization to predict
customer behavior. Data mining tools are used to build risk models and detect fraud.
1. Support: Support is the rate of frequency of an item appearing in the total number of items.
2. Confidence: Confidence is the conditional probability of occurrence of a consequent (then) providing the
occurrence of an antecedent (if).
3. Lift: Lift is the ratio of confidence and support. It tells how likely an item is purchased after another item is
purchased.
A classification model tries to draw some conclusions from the input values given for training.
It will predict the class labels/categories for the new data. Feature: A feature is an individual
measurable property of a phenomenon being observed.
Data preprocessing is a data mining technique that is used to transform the raw data in a
useful and efficient forma