0% found this document useful (0 votes)
2 views10 pages

DWH Unit 4

Data Warehouse and Mining notes Unit 4

Uploaded by

lasaciv776
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

DWH Unit 4

Data Warehouse and Mining notes Unit 4

Uploaded by

lasaciv776
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Classification

Classification is the process of classifying a record. One simple example of classification is to check
whether it is raining or not. The answer can either be yes or no. So, there is a particular number of
choices. Sometimes there can be more than two classes to classify. That is called multiclass
classification.

The bank needs to analyze whether giving a loan to a particular customer is risky or not. For example,
based on observable data for multiple loan borrowers, a classification model may be established that
forecasts credit risk. The data could track job records, homeownership or leasing, years of residency,
number, type of deposits, historical credit ranking, etc. The goal would be credit ranking, the predictors
would be the other characteristics, and the data would represent a case for each consumer. In this
example, a model is constructed to find the categorical label. The labels are risky or safe.

How does Classification Works?


The functioning of classification with the assistance of the bank loan application has been mentioned
above. There are two stages in the data classification system: classifier or model creation and
classification classifier.

1. Developing the Classifier or model creation: This level is the learning stage or the learning
process. The classification algorithms construct the classifier in this stage. A classifier is
constructed from a training set composed of the records of databases and their corresponding
class names. Each category that makes up the training set is referred to as a category or class.
We may also refer to these records as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level. The
test data are used here to estimate the accuracy of the classification algorithm. If the
consistency is deemed sufficient, the classification rules can be expanded to cover new data
records.
3. Data Classification Process: The data classification process can be categorized into five steps:
o Create the goals of data classification, strategy, workflows, and architecture of data
classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use effects.
o Data is complex, and a continuous method is a classification.
Different types of classification:

o Classification based on the mined Databases


o Classification based on the type of mined knowledge
o Classification based on statistics
o Classification based on Machine Learning
o Classification based on visualization
o Classification based on Information Science
o Classification based on utilized techniques
o Classification based on adapted applications
Difference between Classification and Clustering

Classification Clustering

Classification is a supervised learning approach where a specific Clustering is an unsupervised


label is provided to the machine to classify new observations. Here learning approach where grouping
the machine needs proper testing and training for the label is done on similarities basis.
verification.

Supervised learning approach. Unsupervised learning approach.

It uses a training dataset. It does not use a training dataset.

It uses algorithms to categorize the new data as per the It uses statistical concepts in which
observations of the training set. the data set is divided into subsets
with the same features.

In classification, there are labels for training data. In clustering, there are no labels for
training data.

Its objective is to find which class a new object belongs to form the Its objective is to group a set of
set of predefined classes. objects to find whether there is any
relationship between them.

It is more complex as compared to clustering. It is less complex as compared to


clustering.
Decision tree
A decision tree is a diagram representation of possible solutions to a decision. It shows different outcomes from
a set of decisions. The diagram is a widely used decision-making tool for analysis and planning.
The diagram starts with a box (or root), which branches off into several solutions. That’s way, it is called
decision tree.

Decision trees are helpful for a variety of reasons. Not only they are easy-to-understand diagrams that support
you ‘see’ your thoughts, but also because they provide a framework for estimating all possible alternatives.

Important Terms of Decision Tree in Data Mining


A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression
tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
A decision tree starts with a root node, which does not have any incoming branches. The outgoing branches from the root
node then feed into the internal nodes, also known as decision nodes. Based on the available features, both node types
conduct evaluations to form homogenous subsets, which are denoted by leaf nodes, or terminal nodes. The leaf nodes
represent all the possible outcomes within the dataset.

• Root Node: This is the first node where the splitting takes place.

• Leaf Node: This is the node after which there is no more branching.

• Decision Node: The node formed after splitting data from a previous node is known as a decision node.

• Branch: Subsection of a tree containing information about the aftermath of split at the decision node.

• Pruning: When removing a decision node’s sub-nodes to cater to an outlier or noisy data is called pruning. It is

also thought to be the opposite of splitting.


Example 1: The Structure of Decision Tree
Let’s explain the decision tree structure with a simple example.

Each decision tree has 3 key parts:

▪ a root node

▪ leaf nodes, and

▪ branches.

No matter what type is the decision tree, it starts with a specific decision. This decision is depicted with a box –
the root node.

Root and leaf nodes hold questions or some criteria you have to answer. Commonly, nodes appear as a squares
or circles. Squares depict decisions, while circles represent uncertain outcomes.

As you see in the example above, branches are lines that connect nodes, indicating the flow from question to
answer.

Each node normally carries two or more nodes extending from it. If the leaf node results in the solution to the
decision, the line is left empty.
Steps for Creating Decision Trees:
1. Write the main decision.
Begin the decision tree by drawing a box (the root node) on 1 edge of your paper. Write the main decision on
the box.

2. Draw the lines


Draw line leading out from the box for each possible solution or action. Make at least 2, but better no more than
4 lines. Keep the lines as far apart as you can to enlarge the tree later.

3. Illustrate the outcomes of the solution at the end of each line.


4. Continue adding boxes and lines.
Continue until there are no more problems, and all lines have either uncertain outcome or blank ending.

5. Finish the tree.


The boxes that represent uncertain outcomes remain as they are.

Advantages of Decision Trees:


▪ It is very easy to understand and interpret.

▪ The data for decision trees require minimal preparation.

▪ They force you to find many possible outcomes of a decision.

▪ Can be easily used with many other decision tools.

▪ Helps you to make the best decisions and best guesses on the basis of the information you have.

▪ Helps you to see the difference between controlled and uncontrolled events.

▪ Helps you estimate the likely results of one decision against another.

Disadvantages:

▪ Sometimes decision trees can become too complex.

▪ The outcomes of decisions may be based mainly on your expectations. This can lead to unrealistic decision trees.

▪ The diagrams can narrow your focus to critical decisions and objectives.
Application of Decision Tree in Data Mining
Decision Tree has a flowchart kind of architecture in-built with the type of algorithm. It essentially

has an “If X then Y else Z” pattern while the split is done. This type of pattern is used for

understanding human intuition in the programmatic field. Hence, one can extensively use this in

various categorization problems.

• This algorithm can be widely used in the field where the objective function is related to its

analysis.

• When there are numerous courses of action available.

• Outlier analysis.

• Understanding the significant set of features for the entire dataset and “mine” the few

features from a list of hundreds of features in big data.

• Selecting the best flight to travel to a destination.

• Decision-making process based on different circumstantial situations.

• Churn Analysis.

• Sentiment Analysis.
Why are decision trees useful?
It enables us to analyze the possible consequences of a decision thoroughly.

It provides us a framework to measure the values of outcomes and the probability of accomplishing
them.

It helps us to make the best decisions based on existing data and best speculations.

In other words, we can say that a decision tree is a hierarchical tree structure that can be used to split
an extensive collection of records into smaller sets of the class by implementing a sequence of simple
decision rules. A decision tree model comprises a set of rules for portioning a huge heterogeneous
population into smaller, more homogeneous, or mutually exclusive classes. The attributes of the classes
can be any variables from nominal, ordinal, binary, and quantitative values, in contrast, the classes must
be a qualitative type, such as categorical or ordinal or binary. In brief, the given data of attributes
together with its class, a decision tree creates a set of rules that can be used to identify the class. One
rule is implemented after another, resulting in a hierarchy of segments within a segment. The hierarchy
is known as the tree, and each segment is called a node. With each progressive division, the members
from the subsequent sets become more and more similar to each other. Hence, the algorithm used to
build a decision tree is referred to as recursive partitioning. The algorithm is known
as CART (Classification and Regression Trees)

Estimation is the process of finding an estimate, or approximation, which is a value that can be used for some
purpose even if input data may be incomplete, uncertain, or unstable.
Estimation determines how much money, effort, resources, and time it will take to build a specific system or
product. Estimation is based on −

• Past Data/Past Experience


• Available Documents/Knowledge
• Assumptions
• Identified Risks
The four basic steps in Software Project Estimation are −

• Estimate the size of the development product.


• Estimate the effort in person-months or person-hours.
• Estimate the schedule in calendar months.
• Estimate the project cost in agreed currency.
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on data similarity and then
assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes and helps single
out useful features that distinguish different groups.

Applications of Cluster Analysis


• Clustering analysis is broadly used in many applications such as market research, pattern recognition,
data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar
functionalities and gain insight into structures inherent to populations.
• Clustering also helps in identification of areas of similar land use in an earth observation database. It also
helps in the identification of groups of houses in a city according to house type, value, and geographic
location.
• Clustering also helps in classifying documents on the web for information discovery.
• Clustering is also used in outlier detection applications such as detection of credit card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to
observe characteristics of each cluster.

Requirements of Clustering in Data Mining


The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large databases.
• Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any
kind of data such as interval-based (numerical) data, categorical, and binary data.
• Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting
clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data
but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms
are sensitive to such data and may lead to poor quality clusters.
• Interpretability − The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods
Clustering methods can be classified into the following categories −

• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
Constraint-based Method
Data Visualization
Data visualization is actually a set of data points and information that are represented graphically to make it
easy and quick for user to understand. Data visualization is good if it has a clear meaning, purpose, and is very
easy to interpret, without requiring context. Tools of data visualization provide an accessible way to see and
understand trends, outliers, and patterns in data by using visual effects or elements such as a chart, graphs, and
maps.

Categories of Data Visualization ;


Data visualization is very critical to market research where both numerical and categorical data can be visualized that
helps in an increase in impacts of insights and also helps in reducing risk of analysis paralysis. So, data visualization
is categorized into following categories :

1. Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data where data generally represents
amount such as height, weight, age of a person, etc. Numerical data visualization is easiest way to visualize data.
It is generally used for helping others to digest large data sets and raw numbers in a way that makes it easier to
interpret into action. Numerical data is categorized into two categories :
• Continuous Data –
It can be narrowed or categorized (Example: Height measurements).
• Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a household has).
The type of visualization techniques that are used to represent numerical data visualization is Charts and
Numerical Values. Examples are Pie Charts, Bar Charts, Averages, Scorecards, etc.
2. Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data where data generally represents
groups. It simply consists of categorical variables that are used to represent characteristics such as a person’s
ranking, a person’s gender, etc. Categorical data visualization is all about depicting key themes, establishing
connections, and lending context. Categorical data is classified into three categories :
• Binary Data –
In this, classification is based on positioning (Example: Agrees or Disagrees).
• Nominal Data –
In this, classification is based on attributes (Example: Male or Female).
• Ordinal Data –
In this, classification is based on ordering of information (Example: Timeline or processes).
The type of visualization techniques that are used to represent categorical data is Graphics, Diagrams, and
Flowcharts. Examples are Word clouds, Sentiment Mapping, Venn Diagram, etc.

You might also like