0% found this document useful (0 votes)
22 views9 pages

DMjoy

Knowledge discovery in databases (KDD) is an iterative process that involves cleaning, integrating, selecting, transforming, mining, evaluating, and representing data from databases to extract useful patterns. It transforms task-relevant data into patterns and decides the purpose of the model using classification or characterization. The results are then presented in a way that is meaningful and can be used to make decisions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views9 pages

DMjoy

Knowledge discovery in databases (KDD) is an iterative process that involves cleaning, integrating, selecting, transforming, mining, evaluating, and representing data from databases to extract useful patterns. It transforms task-relevant data into patterns and decides the purpose of the model using classification or characterization. The results are then presented in a way that is meaningful and can be used to make decisions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Q.1 ) How Knowledge Discovery is done From Database?

Ans: Knowledge Discovery in Databases (KDD) is a process that


involves the extraction of useful, previously unknown, and potentially
valuable information from large datasets1. The KDD process is an
iterative process and it requires multiple iterations of the following
steps to extract accurate knowledge from the data1:
1. Data Cleaning: This involves the removal of noisy and irrelevant data
from the collection1. It includes cleaning in case of missing values,
cleaning noisy data (where noise is a random or variance error), and
cleaning with data discrepancy detection and data transformation
tools1.
2. Data Integration: This step involves combining heterogeneous data
from multiple sources into a common source (Data Warehouse)1. It
uses Data Migration tools, Data Synchronization tools, and the ETL
(Extract-Load-Transformation) process1.
3. Data Selection: This is the process where data relevant to the analysis
is decided and retrieved from the data collection1. Methods such as
Neural networks, Decision Trees, Naive Bayes, Clustering, and
Regression can be used for this purpose1.
4. Data Transformation: This involves transforming data into an
appropriate form required by the mining procedure1. Data
Transformation is a two-step process: Data Mapping (assigning
elements from source base to destination to capture transformations)
and Code generation (creation of the actual transformation program)1.
5. Data Mining: This involves applying techniques to extract potentially
useful patterns1. It transforms task-relevant data into patterns and
decides the purpose of the model using classification or
characterization1.
6. Pattern Evaluation: This involves identifying strictly increasing
patterns representing knowledge based on given measures1. It finds
the interestingness score of each pattern, and uses summarization and
visualization to make data understandable by the user1.
7. Knowledge Representation: This involves presenting the results in a
way that is meaningful and can be used to make decisions1

Q.2)What kind of data can be mined?

ANS:

1. Relational Databases: A database system that includes a set of


interrelated data, called a database, and a set of software programs to
handle and access the data2.
2. Transactional Databases: A database that includes a file where each
record defines a transaction2.
3. Object-Relational Databases: Databases assembled based on an
object-relational data model2.
4. Temporal Database: A database that generally stores relational data
that contains time-related attributes2.
5. Sequence Database: A database that stores sequences of ordered
events, with or without a factual idea of time2.
6. Time-Series Database: A database that stores sequences of values or
events accessed over repeated measurements of time (e.g., hourly,
daily, weekly)2.
7. Multimedia Data: This includes image data, video data, audio data,
website hyperlinks, and linkages3.
8. Web Data: Web mining is essential to discover crucial patterns and
knowledge from the Web3.
9. Text Data: Text mining is the subfield of data mining, machine
learning, Natural Language processing, and statistics3.
Q.3)What do you understand by central tendency of data?

Ans:

In data mining, the central tendency of a dataset is a measure that


determines the “center” or average value of the dataset. There are
three common measures of central tendency:
1. Mean: This is the sum of all values divided by the total number of
values12. It can be calculated for both ungrouped and grouped
data1. For ungrouped data, it’s the sum of all observations divided by
the total number of observations1. For grouped data, it’s the sum of
the product of observations and their corresponding frequencies
divided by the sum of all frequencies1.
2. Median: This is the middle number in an ordered dataset2. If the
dataset is even, it’s the average of the two middle numbers.
3. Mode: This is the most frequent value in the dataset2. A dataset may
have one mode, more than one mode, or no mode at all.

Q.4)Data charactersation and Data Discrimination.

Ans:

Data characterization and data discrimination are two data analysis


techniques that are related to classification .
12

Data characterization is the


process of summarizing the data of a class under study, called the
target class .
1

Example- A data mining query for characterization. Suppose


that a user wants to describe the general characteristics of
graduate students in the Big University database, given the
attributes name, gender, major, birth place, birth date,
residence, phone# (telephone number), and gpa (grade point
average) .

Data discrimination is the process of comparing the data of the


target class to that of other classes, called the contrasting classes, and
finding the features that distinguish them Cluster analysis is a
1

popular data discretization/discrimination method.


Data
discretization and concept hierarchy generation are also
forms of data reduction. The raw data are replaced by a
smaller number of interval or concept labels. This simplifies
the original data and makes the mining more efficient.

Example -A clustering algorithm can be applied to discretize


a numeric attribute, A, by partitioning the values of A into
clusters or groups. Clustering takes the distribution of A into
consideration, as well as the closeness of data points, and
therefore is able to produce high-quality discretization
results.

Q.5)What do you mean by Supervised and Unsupervised


Learning?

Ans: Supervised learning -> is basically a synonym for


classification. The supervision in the learning comes from
the labeled examples in the training data set. For example,
in the postal code recognition problem, a set of handwritten
postal code images and their corresponding machine-
readable translations are used as the training examples,
which supervise the learning of the classification model.

Unsupervised learning-> is essentially a synonym for


clustering. The learning process is unsupervised since the
input examples are not class labeled. Typically, we may use
clustering to discover classes within the data. For example,
an unsupervised learning method can take, as input, a set of
images of handwritten digits. Suppose that it finds 10
clusters of data. These clusters may correspond to the 10
distinct digits of 0 to 9, respectively. However, since the
training data are not labeled, the learned model cannot tell
us the semantic meaning of the clusters found.

Q6. What is data warehousing?

Ans: A data warehouse integrates data originating from


multiple sources and various timeframes. It consolidates
data in multidimensional space to form partially materialized
data cubes. A Data Warehouse (DW) is a relational database
that is designed for query and analysis rather than
transaction processing1. It is a centralized data repository
which can be queried for business benefits1. It includes
historical data derived from transaction data from single and
multiple sources1. A Data Warehouse provides integrated,
enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.

Q.7) Types of machine learning?

Ans:
1.Supervised Machine Learning: This type of learning is defined
as when a model gets trained on a “Labelled Dataset”. Labelled
datasets have both input and output parameters. In Supervised
Learning, algorithms learn to map points between inputs and correct
outputs1. It has two main categories:

o Classification: Classification algorithms are used to solve problems in


which the output variable is categorical2.
o Regression: Regression algorithms are used when the output is a real

or continuous value1.
2. Unsupervised Machine Learning: This type of learning is where the
model is trained on an unlabelled dataset and the model has to find
patterns within this data1.
3. Semi-Supervised Machine Learning: This type of learning falls
between supervised and unsupervised learning. It uses both labelled
and unlabelled data for training1.
4. Reinforcement Learning: This type of learning is about taking
suitable action to maximize reward in a particular situation1.
Q.8 Major issues in data mining?

Ans:

Data mining is a complex process and it faces several issues. Here are
the major ones:
1. Mining Methodology and User Interaction Issues12:
o Mining different kinds of knowledge in databases: Different users may

be interested in different kinds of knowledge, making it necessary for


data mining to cover a broad range of knowledge discovery tasks12.
o Interactive mining of knowledge at multiple levels of abstraction: The
data mining process needs to be interactive because it allows users to
focus the search for patterns, providing and refining data mining
requests based on the returned results1

2. Data Security & Privacy3: Ensuring that sensitive information is


not mined or misused can pose challenges.

 3. Efficiency and scalability of data mining algorithms: In order to


effectively extract the information from a huge amount of data in
databases, data mining algorithms must be efficient and scalable12.
 Parallel, distributed, and incremental mining algorithms: The factors
such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms1.
Q.9 what do you mean by nominal and ordinal
attributes?

Ans->

Nominal and ordinal attributes are two types of categorical data used in statistics and
data analysis. They differ in the level of measurement and the type of operations that
can be performed on them12345.

Nominal Attributes:

 Nominal data, also known as categorical data, are used for labeling variables
without any quantitative value1.
 The name ‘Nominal’ comes from the Latin word “nomen” which means
‘name’1.
 Nominal data are those items which are distinguished by a simple naming
system1.
 They are data with no numeric value, such as profession1.
 The values grouped into these categories have no meaningful order1.
 For example, gender and occupation are nominal level values1.

Ordinal Attributes:

 Ordinal data is a type of categorical data with an order (or rank) among the
values2.
 The Ordinal Attributes contains values that have a meaningful sequence or
ranking (order) between them, but the magnitude between successive values is
not known2.
 The order of values shows what is important but doesn’t indicate how
important it is2.

In summary, the main difference between nominal and ordinal data lies in whether
there is an order or rank to the categories. Nominal data categories do not have a
standard order, while ordinal data categories do have a clear ordering5.
Q.What is five number theory and its use.

Ans:

The Five Number Summary is a method used in descriptive statistics that provides a
summary of a set of data. It consists of the following five numbers123:

1. Minimum: The smallest number in the dataset2.


2. First Quartile (Q1): The middle number between the minimum and the
median. This is also known as the lower quartile2.
3. Median (Q2): The middle value of the dataset. If the dataset has an even
number of observations, the median is the average of the two middle numbers2.
4. Third Quartile (Q3): The middle value between the median and the
maximum. This is also known as the upper quartile2.
5. Maximum: The largest number in the dataset2.

The Five Number Summary is useful because it provides a concise summary of the
distribution of data in several ways145:

 It tells us where the middle value is located, using the median45.


 It tells us how spread out the data is, using the first and third quartiles45.
 It tells us the range of the data, using the minimum and maximum45.

These five statistics provide similar types of information as other statistics while
having advantages over them. They are less sensitive to skewed distributions and
outliers, making them more robust than other measures like mean and standard
deviation1. This makes them extremely helpful for analyses performed when you’re
just starting to understand your data1. They are valid with continuous and ordinal data,
giving you greater flexibility1.

You might also like