0% found this document useful (0 votes)
59 views

Viva Data Mining Lab

Good

Uploaded by

Amit Gaurav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Viva Data Mining Lab

Good

Uploaded by

Amit Gaurav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1. What is Data Mining?

Data mining refers to extracting or mining knowledge from large amounts of data. In other words, Data
mining is the science, art, and technology of discovering large and complex bodies of data in order to
discover useful patterns.

2. What are the different tasks of Data Mining?

The following activities are carried out during data mining:

• Classification

• Clustering

• Association Rule Discovery

• Sequential Pattern Discovery

• Regression

• Deviation Detection

3. Discuss the Life cycle of Data Mining projects?

The life cycle of Data mining projects:

Business understanding: Understanding projects objectives from a business perspective, data mining
problem definition.

Data understanding: Initial data collection and understand it.

Data preparation: Constructing the final data set from raw data.

Modeling: Select and apply data modeling techniques.

Evaluation: Evaluate model, decide on further deployment.

Deployment: Create a report, carry out actions based on new insights.

4. Explain the process of KDD?

Data mining treat as a synonym for another popularly used term, Knowledge Discovery from Data, or
KDD. In others view data mining as simply an essential step in the process of knowledge discovery, in
which intelligent methods are applied in order to extract data patterns.

Knowledge discovery from data consists of the following steps:

• Data cleaning (to remove noise or irrelevant data).

• Data integration (where multiple data sources may be combined).


• Data selection (where data relevant to the analysis task are retrieved from the database).

• Data transformation (where data are transmuted or consolidated into forms appropriate for
mining by performing summary or aggregation functions, for sample).

• Data mining (an important process where intelligent methods are applied in order to extract
data patterns).

• Pattern evaluation (to identify the fascinating patterns representing knowledge based on some
interestingness measures).

• Knowledge presentation (where knowledge representation and visualization techniques are used
to present the mined knowledge to the user).

5. What is Classification?

Classification is the processing of finding a set of models (or functions) that describe and distinguish data
classes or concepts, for the purpose of being able to use the model to predict the class of objects whose
class label is unknown. Classification can be used for predicting the class label of data items. However, in
many applications, one may like to calculate some missing or unavailable data values rather than class
labels.

7. What is Prediction?

Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled
object, or to measure the value or value ranges of an attribute that a given object is likely to have. In this
interpretation, classification and regression are the two major types of prediction problems where
classification is used to predict discrete or nominal values, while regression is used to predict incessant
or ordered values.

8. Explain the Decision Tree Classifier?

A Decision tree is a flow chart-like tree structure, where each internal node (non-leaf node) denotes a
test on an attribute, each branch represents an outcome of the test and each leaf node (or terminal
node) holds a class label. The topmost node of a tree is the root node.

A Decision tree is a classification scheme that generates a tree and a set of rules, representing the model
of different classes, from a given data set. The set of records available for developing classification
methods is generally divided into two disjoint subsets namely a training set and a test set. The former is
used for originating the classifier while the latter is used to measure the accuracy of the classifier. The
accuracy of the classifier is determined by the percentage of the test examples that are correctly
classified.

In the decision tree classifier, we categorize the attributes of the records into two different types.
Attributes whose domain is numerical are called the numerical attributes and the attributes whose
domain is not numerical are called categorical attributes. There is one distinguished attribute called a
class label. The goal of classification is to build a concise model that can be used to predict the class of
the records whose class label is unknown. Decision trees can simply be converted to classification rules.
9. What are the advantages of a decision tree classifier?

• Decision trees are able to produce understandable rules.

• They are able to handle both numerical and categorical attributes.

• They are easy to understand.

• Once a decision tree model has been built, classifying a test record is extremely fast.

• Decision tree depiction is rich enough to represent any discrete value classifier.

• Decision trees can handle datasets that may have errors.

• Decision trees can deal with handle datasets that may have missing values.

• They do not require any prior assumptions. Decision trees are self-explanatory and when
compacted they are also easy to follow. That is to say, if the decision tree has a reasonable
number of leaves it can be grasped by non-professional users. Furthermore, since decision trees
can be converted to a set of rules, this sort of representation is considered comprehensible.

9. What are the advantages of a decision tree classifier?

Decision trees are able to produce understandable rules.

They are able to handle both numerical and categorical attributes.

They are easy to understand.

Once a decision tree model has been built, classifying a test record is extremely fast.

Decision tree depiction is rich enough to represent any discrete value classifier.

Decision trees can handle datasets that may have errors.

Decision trees can deal with handle datasets that may have missing values.

They do not require any prior assumptions. Decision trees are self-explanatory and when compacted
they are also easy to follow. That is to say, if the decision tree has a reasonable number of leaves it can
be grasped by non-professional users. Furthermore, since decision trees can be converted to a set of
rules, this sort of representation is considered comprehensible.

9. What are the advantages of a decision tree classifier?

• Decision trees are able to produce understandable rules.

• They are able to handle both numerical and categorical attributes.

• They are easy to understand.

• Once a decision tree model has been built, classifying a test record is extremely fast.

• Decision tree depiction is rich enough to represent any discrete value classifier.
• Decision trees can handle datasets that may have errors.

• Decision trees can deal with handle datasets that may have missing values.

• They do not require any prior assumptions. Decision trees are self-explanatory and when
compacted they are also easy to follow. That is to say, if the decision tree has a reasonable
number of leaves it can be grasped by non-professional users. Furthermore, since decision trees
can be converted to a set of rules, this sort of representation is considered comprehensible.

12. What are Neural networks?

A neural network is a set of connected input/output units where each connection has a weight
associated with it. During the knowledge phase, the network acquires by adjusting the weights to be
able to predict the correct class label of the input samples. Neural network learning is also denoted as
connectionist learning due to the connections between units. Neural networks involve long training
times and are therefore more appropriate for applications where this is feasible. They require a number
of parameters that are typically best determined empirically, such as the network topology or
“structure”. Neural networks have been criticized for their poor interpretability since it is difficult for
humans to take the symbolic meaning behind the learned weights. These features firstly made neural
networks less desirable for data mining.

The advantages of neural networks, however, contain their high tolerance to noisy data as well as their
ability to classify patterns on which they have not been trained. In addition, several algorithms have
newly been developed for the extraction of rules from trained neural networks. These issues contribute
to the usefulness of neural networks for classification in data mining. The most popular neural network
algorithm is the backpropagation algorithm, proposed in the 1980s

16. Define Clustering in Data Mining?

Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups. It is basically a collection of objects on the basis of similarity and
dissimilarity between them

17. Write a difference between classification and clustering?[IMP]


Parameters CLASSIFICATION CLUSTERING

Type Used for supervised need learning Used for unsupervised learning

Process of classifying the input Grouping the instances based on


Basic instances based on their their similarity without the help of
corresponding class labels class labels
Parameters CLASSIFICATION CLUSTERING

It has labels so there is a need for


There is no need for training and
Need training and testing data set for
testing dataset
verifying the model created

More complex as compared to Less complex as compared to


Complexity
clustering classification

k-means clustering algorithm,


Logistic regression, Naive Bayes
Example Fuzzy c-means clustering
classifier, Support vector machines,
Algorithms algorithm, Gaussian (EM)
etc.
clustering algorithm etc.
18. What is Supervised and Unsupervised Learning?[TCS interview question]

Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Basically
supervised learning is when we teach or train the machine using data that is well labeled. Which means
some data is already tagged with the correct answer. After that, the machine is provided with a new set
of examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data.

Unsupervised learning is the training of a machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Here the task of the
machine is to group unsorted information according to similarities, patterns, and differences without any
prior training of data.

Unlike supervised learning, no teacher is provided that means no training will be given to the machine.
Therefore, the machine is restricted to find the hidden structure in unlabeled data by itself.

19. Name areas of applications of data mining?

• Data Mining Applications for Finance

• Healthcare

• Intelligence

• Telecommunication

• Energy

• Retail

• E-commerce

• Supermarkets

• Crime Agencies
• Businesses Benefit from data mining

18. What is Supervised and Unsupervised Learning?[TCS interview question]

Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Basically
supervised learning is when we teach or train the machine using data that is well labeled. Which means
some data is already tagged with the correct answer. After that, the machine is provided with a new set
of examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data.

Unsupervised learning is the training of a machine using information that is neither classified nor labeled
and allowing the algorithm to act on that information without guidance. Here the task of the machine is
to group unsorted information according to similarities, patterns, and differences without any prior
training of data.

Unlike supervised learning, no teacher is provided that means no training will be given to the machine.
Therefore, the machine is restricted to find the hidden structure in unlabeled data by itself.

19. Name areas of applications of data mining?

Data Mining Applications for Finance

Healthcare

Intelligence

Telecommunication

Energy

Retail

E-commerce

Supermarkets

Crime Agencies

Businesses Benefit from data mining

22. Differentiate Between Data Mining And Data Warehousing?


Data Mining: It is the process of finding patterns and correlations within large data sets to identify
relationships between data. Data mining tools allow a business organization to predict customer
behavior. Data mining tools are used to build risk models and detect fraud. Data mining is used in market
analysis and management, fraud detection, corporate analysis, and risk management.

It is a technology that aggregates structured data from one or more sources so that it can be compared
and analyzed rather than transaction processing.

Data Warehouse: A data warehouse is designed to support the management decision-making process by
providing a platform for data cleaning, data integration, and data consolidation. A data warehouse
contains subject-oriented, integrated, time-variant, and non-volatile data.

Data warehouse consolidates data from many sources while ensuring data quality, consistency, and
accuracy. Data warehouse improves system performance by separating analytics processing from
transnational databases. Data flows into a data warehouse from the various databases. A data
warehouse works by organizing data into a schema that describes the layout and type of data. Query
tools analyze the data tables using schema.

23.What is Data Purging?

The term purging can be defined as Erase or Remove. In the context of data mining, data purging is the
process of remove, unnecessary data from the database permanently and clean data to maintain its
integrity.

24. What Are Cubes?

A data cube stores data in a summarized version which helps in a faster analysis of data. The data is
stored in such a way that it allows reporting easily. E.g. using a data cube A user may want to analyze the
weekly, monthly performance of an employee. Here, month and week could be considered as the
dimensions of the cube.

25.What are the differences between OLAP And OLTP?[IMP]


OLTP (Online Transaction
OLAP (Online Analytical Processing) Processing)

Consists of historical data from various Consists only of application-oriented


Databases. day-to-day operational current data.

Application-oriented day-to-dayIt is subject-


It is application-oriented. Used for
oriented. Used for Data Mining, Analytics,
business tasks.
Decision making, etc.
OLTP (Online Transaction
OLAP (Online Analytical Processing) Processing)

The data is used in planning, problem- The data is used to perform day-to-day
solving, and decision-making. fundamental operations.

It reveals a snapshot of present business It provides a multi-dimensional view of


tasks. different business tasks.

The size of the data is relatively small as


A large forex amount of data is stored
the historical data is archived. For
typically in TB, PB
example, MB, GB

Relatively slow as the amount of data Very Fast as the queries operate on 5%
involved is large. Queries may take hours. of the data.

It only needs backup from time to time as The backup and recovery process is
compared to OLTP. maintained religiously

This data is generally managed by the CEO, This data is managed by clerks,
MD, GM. managers.

Only read and rarely write operation. Both read and write operations.
26. Explain Association Algorithm In Data Mining?

Association analysis is the finding of association rules showing attribute-value conditions that occur
frequently together in a given set of data. Association analysis is widely used for a market basket or
transaction data analysis. Association rule mining is a significant and exceptionally dynamic area of data
mining research. One method of association-based classification, called associative classification,
consists of two steps. In the main step, association instructions are generated using a modified version of
the standard association rule mining algorithm known as Apriori. The second step constructs a classifier
based on the association rules discovered.

27. Explain how to work with data mining algorithms included in SQL server data mining?

SQL Server data mining offers Data Mining Add-ins for Office 2007 that permits finding the patterns and
relationships of the information. This helps in an improved analysis. The Add-in called a Data Mining
Client for Excel is utilized to initially prepare information, create models, manage, analyze, results.

28. Explain Over-fitting?


The concept of over-fitting is very important in data mining. It refers to the situation in which the
induction algorithm generates a classifier that perfectly fits the training data but has lost the capability of
generalizing to instances not presented during training. In other words, instead of learning, the classifier
just memorizes the training instances. In the decision trees over fitting usually occurs when the tree has
too many nodes relative to the amount of training data available. By increasing the number of nodes, the
training error usually decreases while at some point the generalization error becomes worse. The Over-
fitting can lead to difficulties when there is noise in the training data or when the number of the training
datasets, the error of the fully built tree is zero, while the true error is likely to be bigger.

There are many disadvantages of an over-fitted decision tree:

• Over-fitted models are incorrect.

• Over-fitted decision trees require more space and more computational resources.

• They require the collection of unnecessary features.

32. Explain the Issues regarding Classification And Prediction?

Preparing the data for classification and prediction:

Data cleaning

Relevance analysis

Data transformation

Comparing classification methods

Predictive accuracy

Speed

Robustness

Scalability

Interpretability

32. Explain the Issues regarding Classification And Prediction?

Preparing the data for classification and prediction:

• Data cleaning

• Relevance analysis

• Data transformation

• Comparing classification methods


• Predictive accuracy

• Speed

• Robustness

• Scalability

• Interpretability

34. What is a machine learning-based approach to data mining?

This question is the high-level Data Mining Interview Questions asked in an Interview. Machine learning
is basically utilized in data mining since it covers automatic programmed processing systems, and it
depended on logical or binary tasks. . Machine learning for the most part follows the rule that would
permit us to manage more general information types, incorporating cases and in these sorts and number
of attributes may differ. Machine learning is one of the famous procedures utilized for data mining and in
Artificial intelligence too

35.What is the K-means algorithm?

K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves clustering
problems. K-means algorithm partition n observations into k clusters where each observation belongs to
the cluster with the nearest mean serving as a prototype of the cluster.
40. Why is KNN preferred when determining missing numbers in data?

K-Nearest Neighbour (KNN) is preferred here because of the fact that KNN can easily approximate the
value to be determined based on the values closest to it.

The k-nearest neighbor (K-NN) classifier is taken into account as an example-based classifier, which
means that the training documents are used for comparison instead of an exact class illustration, like the
class profiles utilized by other classifiers. As such, there’s no real training section. once a new document
has to be classified, the k most similar documents (neighbors) are found and if a large enough proportion
of them are allotted to a precise class, the new document is also appointed to the present class,
otherwise not. Additionally, finding the closest neighbors is quickened using traditional classification
strategies.

You might also like