Viva Data Mining Lab
Viva Data Mining Lab
Data mining refers to extracting or mining knowledge from large amounts of data. In other words, Data
mining is the science, art, and technology of discovering large and complex bodies of data in order to
discover useful patterns.
• Classification
• Clustering
• Regression
• Deviation Detection
Business understanding: Understanding projects objectives from a business perspective, data mining
problem definition.
Data preparation: Constructing the final data set from raw data.
Data mining treat as a synonym for another popularly used term, Knowledge Discovery from Data, or
KDD. In others view data mining as simply an essential step in the process of knowledge discovery, in
which intelligent methods are applied in order to extract data patterns.
• Data transformation (where data are transmuted or consolidated into forms appropriate for
mining by performing summary or aggregation functions, for sample).
• Data mining (an important process where intelligent methods are applied in order to extract
data patterns).
• Pattern evaluation (to identify the fascinating patterns representing knowledge based on some
interestingness measures).
• Knowledge presentation (where knowledge representation and visualization techniques are used
to present the mined knowledge to the user).
5. What is Classification?
Classification is the processing of finding a set of models (or functions) that describe and distinguish data
classes or concepts, for the purpose of being able to use the model to predict the class of objects whose
class label is unknown. Classification can be used for predicting the class label of data items. However, in
many applications, one may like to calculate some missing or unavailable data values rather than class
labels.
7. What is Prediction?
Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled
object, or to measure the value or value ranges of an attribute that a given object is likely to have. In this
interpretation, classification and regression are the two major types of prediction problems where
classification is used to predict discrete or nominal values, while regression is used to predict incessant
or ordered values.
A Decision tree is a flow chart-like tree structure, where each internal node (non-leaf node) denotes a
test on an attribute, each branch represents an outcome of the test and each leaf node (or terminal
node) holds a class label. The topmost node of a tree is the root node.
A Decision tree is a classification scheme that generates a tree and a set of rules, representing the model
of different classes, from a given data set. The set of records available for developing classification
methods is generally divided into two disjoint subsets namely a training set and a test set. The former is
used for originating the classifier while the latter is used to measure the accuracy of the classifier. The
accuracy of the classifier is determined by the percentage of the test examples that are correctly
classified.
In the decision tree classifier, we categorize the attributes of the records into two different types.
Attributes whose domain is numerical are called the numerical attributes and the attributes whose
domain is not numerical are called categorical attributes. There is one distinguished attribute called a
class label. The goal of classification is to build a concise model that can be used to predict the class of
the records whose class label is unknown. Decision trees can simply be converted to classification rules.
9. What are the advantages of a decision tree classifier?
• Once a decision tree model has been built, classifying a test record is extremely fast.
• Decision tree depiction is rich enough to represent any discrete value classifier.
• Decision trees can deal with handle datasets that may have missing values.
• They do not require any prior assumptions. Decision trees are self-explanatory and when
compacted they are also easy to follow. That is to say, if the decision tree has a reasonable
number of leaves it can be grasped by non-professional users. Furthermore, since decision trees
can be converted to a set of rules, this sort of representation is considered comprehensible.
Once a decision tree model has been built, classifying a test record is extremely fast.
Decision tree depiction is rich enough to represent any discrete value classifier.
Decision trees can deal with handle datasets that may have missing values.
They do not require any prior assumptions. Decision trees are self-explanatory and when compacted
they are also easy to follow. That is to say, if the decision tree has a reasonable number of leaves it can
be grasped by non-professional users. Furthermore, since decision trees can be converted to a set of
rules, this sort of representation is considered comprehensible.
• Once a decision tree model has been built, classifying a test record is extremely fast.
• Decision tree depiction is rich enough to represent any discrete value classifier.
• Decision trees can handle datasets that may have errors.
• Decision trees can deal with handle datasets that may have missing values.
• They do not require any prior assumptions. Decision trees are self-explanatory and when
compacted they are also easy to follow. That is to say, if the decision tree has a reasonable
number of leaves it can be grasped by non-professional users. Furthermore, since decision trees
can be converted to a set of rules, this sort of representation is considered comprehensible.
A neural network is a set of connected input/output units where each connection has a weight
associated with it. During the knowledge phase, the network acquires by adjusting the weights to be
able to predict the correct class label of the input samples. Neural network learning is also denoted as
connectionist learning due to the connections between units. Neural networks involve long training
times and are therefore more appropriate for applications where this is feasible. They require a number
of parameters that are typically best determined empirically, such as the network topology or
“structure”. Neural networks have been criticized for their poor interpretability since it is difficult for
humans to take the symbolic meaning behind the learned weights. These features firstly made neural
networks less desirable for data mining.
The advantages of neural networks, however, contain their high tolerance to noisy data as well as their
ability to classify patterns on which they have not been trained. In addition, several algorithms have
newly been developed for the extraction of rules from trained neural networks. These issues contribute
to the usefulness of neural networks for classification in data mining. The most popular neural network
algorithm is the backpropagation algorithm, proposed in the 1980s
Clustering is the task of dividing the population or data points into a number of groups such that data
points in the same groups are more similar to other data points in the same group and dissimilar to the
data points in other groups. It is basically a collection of objects on the basis of similarity and
dissimilarity between them
Type Used for supervised need learning Used for unsupervised learning
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Basically
supervised learning is when we teach or train the machine using data that is well labeled. Which means
some data is already tagged with the correct answer. After that, the machine is provided with a new set
of examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data.
Unsupervised learning is the training of a machine using information that is neither classified nor
labeled and allowing the algorithm to act on that information without guidance. Here the task of the
machine is to group unsorted information according to similarities, patterns, and differences without any
prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the machine.
Therefore, the machine is restricted to find the hidden structure in unlabeled data by itself.
• Healthcare
• Intelligence
• Telecommunication
• Energy
• Retail
• E-commerce
• Supermarkets
• Crime Agencies
• Businesses Benefit from data mining
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Basically
supervised learning is when we teach or train the machine using data that is well labeled. Which means
some data is already tagged with the correct answer. After that, the machine is provided with a new set
of examples(data) so that the supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data.
Unsupervised learning is the training of a machine using information that is neither classified nor labeled
and allowing the algorithm to act on that information without guidance. Here the task of the machine is
to group unsorted information according to similarities, patterns, and differences without any prior
training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the machine.
Therefore, the machine is restricted to find the hidden structure in unlabeled data by itself.
Healthcare
Intelligence
Telecommunication
Energy
Retail
E-commerce
Supermarkets
Crime Agencies
It is a technology that aggregates structured data from one or more sources so that it can be compared
and analyzed rather than transaction processing.
Data Warehouse: A data warehouse is designed to support the management decision-making process by
providing a platform for data cleaning, data integration, and data consolidation. A data warehouse
contains subject-oriented, integrated, time-variant, and non-volatile data.
Data warehouse consolidates data from many sources while ensuring data quality, consistency, and
accuracy. Data warehouse improves system performance by separating analytics processing from
transnational databases. Data flows into a data warehouse from the various databases. A data
warehouse works by organizing data into a schema that describes the layout and type of data. Query
tools analyze the data tables using schema.
The term purging can be defined as Erase or Remove. In the context of data mining, data purging is the
process of remove, unnecessary data from the database permanently and clean data to maintain its
integrity.
A data cube stores data in a summarized version which helps in a faster analysis of data. The data is
stored in such a way that it allows reporting easily. E.g. using a data cube A user may want to analyze the
weekly, monthly performance of an employee. Here, month and week could be considered as the
dimensions of the cube.
The data is used in planning, problem- The data is used to perform day-to-day
solving, and decision-making. fundamental operations.
Relatively slow as the amount of data Very Fast as the queries operate on 5%
involved is large. Queries may take hours. of the data.
It only needs backup from time to time as The backup and recovery process is
compared to OLTP. maintained religiously
This data is generally managed by the CEO, This data is managed by clerks,
MD, GM. managers.
Only read and rarely write operation. Both read and write operations.
26. Explain Association Algorithm In Data Mining?
Association analysis is the finding of association rules showing attribute-value conditions that occur
frequently together in a given set of data. Association analysis is widely used for a market basket or
transaction data analysis. Association rule mining is a significant and exceptionally dynamic area of data
mining research. One method of association-based classification, called associative classification,
consists of two steps. In the main step, association instructions are generated using a modified version of
the standard association rule mining algorithm known as Apriori. The second step constructs a classifier
based on the association rules discovered.
27. Explain how to work with data mining algorithms included in SQL server data mining?
SQL Server data mining offers Data Mining Add-ins for Office 2007 that permits finding the patterns and
relationships of the information. This helps in an improved analysis. The Add-in called a Data Mining
Client for Excel is utilized to initially prepare information, create models, manage, analyze, results.
• Over-fitted decision trees require more space and more computational resources.
Data cleaning
Relevance analysis
Data transformation
Predictive accuracy
Speed
Robustness
Scalability
Interpretability
• Data cleaning
• Relevance analysis
• Data transformation
• Speed
• Robustness
• Scalability
• Interpretability
This question is the high-level Data Mining Interview Questions asked in an Interview. Machine learning
is basically utilized in data mining since it covers automatic programmed processing systems, and it
depended on logical or binary tasks. . Machine learning for the most part follows the rule that would
permit us to manage more general information types, incorporating cases and in these sorts and number
of attributes may differ. Machine learning is one of the famous procedures utilized for data mining and in
Artificial intelligence too
K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves clustering
problems. K-means algorithm partition n observations into k clusters where each observation belongs to
the cluster with the nearest mean serving as a prototype of the cluster.
40. Why is KNN preferred when determining missing numbers in data?
K-Nearest Neighbour (KNN) is preferred here because of the fact that KNN can easily approximate the
value to be determined based on the values closest to it.
The k-nearest neighbor (K-NN) classifier is taken into account as an example-based classifier, which
means that the training documents are used for comparison instead of an exact class illustration, like the
class profiles utilized by other classifiers. As such, there’s no real training section. once a new document
has to be classified, the k most similar documents (neighbors) are found and if a large enough proportion
of them are allotted to a precise class, the new document is also appointed to the present class,
otherwise not. Additionally, finding the closest neighbors is quickened using traditional classification
strategies.