DWH Unit 4
DWH Unit 4
Classification is the process of classifying a record. One simple example of classification is to check
whether it is raining or not. The answer can either be yes or no. So, there is a particular number of
choices. Sometimes there can be more than two classes to classify. That is called multiclass
classification.
The bank needs to analyze whether giving a loan to a particular customer is risky or not. For example,
based on observable data for multiple loan borrowers, a classification model may be established that
forecasts credit risk. The data could track job records, homeownership or leasing, years of residency,
number, type of deposits, historical credit ranking, etc. The goal would be credit ranking, the predictors
would be the other characteristics, and the data would represent a case for each consumer. In this
example, a model is constructed to find the categorical label. The labels are risky or safe.
1. Developing the Classifier or model creation: This level is the learning stage or the learning
process. The classification algorithms construct the classifier in this stage. A classifier is
constructed from a training set composed of the records of databases and their corresponding
class names. Each category that makes up the training set is referred to as a category or class.
We may also refer to these records as samples, objects, or data points.
2. Applying classifier for classification: The classifier is used for classification at this level. The
test data are used here to estimate the accuracy of the classification algorithm. If the
consistency is deemed sufficient, the classification rules can be expanded to cover new data
records.
3. Data Classification Process: The data classification process can be categorized into five steps:
o Create the goals of data classification, strategy, workflows, and architecture of data
classification.
o Classify confidential details that we store.
o Using marks by data labelling.
o To improve protection and obedience, use effects.
o Data is complex, and a continuous method is a classification.
Different types of classification:
Classification Clustering
It uses algorithms to categorize the new data as per the It uses statistical concepts in which
observations of the training set. the data set is divided into subsets
with the same features.
In classification, there are labels for training data. In clustering, there are no labels for
training data.
Its objective is to find which class a new object belongs to form the Its objective is to group a set of
set of predefined classes. objects to find whether there is any
relationship between them.
Decision trees are helpful for a variety of reasons. Not only they are easy-to-understand diagrams that support
you ‘see’ your thoughts, but also because they provide a framework for estimating all possible alternatives.
• Root Node: This is the first node where the splitting takes place.
• Leaf Node: This is the node after which there is no more branching.
• Decision Node: The node formed after splitting data from a previous node is known as a decision node.
• Branch: Subsection of a tree containing information about the aftermath of split at the decision node.
• Pruning: When removing a decision node’s sub-nodes to cater to an outlier or noisy data is called pruning. It is
▪ a root node
▪ branches.
No matter what type is the decision tree, it starts with a specific decision. This decision is depicted with a box –
the root node.
Root and leaf nodes hold questions or some criteria you have to answer. Commonly, nodes appear as a squares
or circles. Squares depict decisions, while circles represent uncertain outcomes.
As you see in the example above, branches are lines that connect nodes, indicating the flow from question to
answer.
Each node normally carries two or more nodes extending from it. If the leaf node results in the solution to the
decision, the line is left empty.
Steps for Creating Decision Trees:
1. Write the main decision.
Begin the decision tree by drawing a box (the root node) on 1 edge of your paper. Write the main decision on
the box.
▪ Helps you to make the best decisions and best guesses on the basis of the information you have.
▪ Helps you to see the difference between controlled and uncontrolled events.
▪ Helps you estimate the likely results of one decision against another.
Disadvantages:
▪ The outcomes of decisions may be based mainly on your expectations. This can lead to unrealistic decision trees.
▪ The diagrams can narrow your focus to critical decisions and objectives.
Application of Decision Tree in Data Mining
Decision Tree has a flowchart kind of architecture in-built with the type of algorithm. It essentially
has an “If X then Y else Z” pattern while the split is done. This type of pattern is used for
understanding human intuition in the programmatic field. Hence, one can extensively use this in
• This algorithm can be widely used in the field where the objective function is related to its
analysis.
• Outlier analysis.
• Understanding the significant set of features for the entire dataset and “mine” the few
• Churn Analysis.
• Sentiment Analysis.
Why are decision trees useful?
It enables us to analyze the possible consequences of a decision thoroughly.
It provides us a framework to measure the values of outcomes and the probability of accomplishing
them.
It helps us to make the best decisions based on existing data and best speculations.
In other words, we can say that a decision tree is a hierarchical tree structure that can be used to split
an extensive collection of records into smaller sets of the class by implementing a sequence of simple
decision rules. A decision tree model comprises a set of rules for portioning a huge heterogeneous
population into smaller, more homogeneous, or mutually exclusive classes. The attributes of the classes
can be any variables from nominal, ordinal, binary, and quantitative values, in contrast, the classes must
be a qualitative type, such as categorical or ordinal or binary. In brief, the given data of attributes
together with its class, a decision tree creates a set of rules that can be used to identify the class. One
rule is implemented after another, resulting in a hierarchy of segments within a segment. The hierarchy
is known as the tree, and each segment is called a node. With each progressive division, the members
from the subsequent sets become more and more similar to each other. Hence, the algorithm used to
build a decision tree is referred to as recursive partitioning. The algorithm is known
as CART (Classification and Regression Trees)
Estimation is the process of finding an estimate, or approximation, which is a value that can be used for some
purpose even if input data may be incomplete, uncertain, or unstable.
Estimation determines how much money, effort, resources, and time it will take to build a specific system or
product. Estimation is based on −
Clustering Methods
Clustering methods can be classified into the following categories −
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
Constraint-based Method
Data Visualization
Data visualization is actually a set of data points and information that are represented graphically to make it
easy and quick for user to understand. Data visualization is good if it has a clear meaning, purpose, and is very
easy to interpret, without requiring context. Tools of data visualization provide an accessible way to see and
understand trends, outliers, and patterns in data by using visual effects or elements such as a chart, graphs, and
maps.
1. Numerical Data :
Numerical data is also known as Quantitative data. Numerical data is any data where data generally represents
amount such as height, weight, age of a person, etc. Numerical data visualization is easiest way to visualize data.
It is generally used for helping others to digest large data sets and raw numbers in a way that makes it easier to
interpret into action. Numerical data is categorized into two categories :
• Continuous Data –
It can be narrowed or categorized (Example: Height measurements).
• Discrete Data –
This type of data is not “continuous” (Example: Number of cars or children’s a household has).
The type of visualization techniques that are used to represent numerical data visualization is Charts and
Numerical Values. Examples are Pie Charts, Bar Charts, Averages, Scorecards, etc.
2. Categorical Data :
Categorical data is also known as Qualitative data. Categorical data is any data where data generally represents
groups. It simply consists of categorical variables that are used to represent characteristics such as a person’s
ranking, a person’s gender, etc. Categorical data visualization is all about depicting key themes, establishing
connections, and lending context. Categorical data is classified into three categories :
• Binary Data –
In this, classification is based on positioning (Example: Agrees or Disagrees).
• Nominal Data –
In this, classification is based on attributes (Example: Male or Female).
• Ordinal Data –
In this, classification is based on ordering of information (Example: Timeline or processes).
The type of visualization techniques that are used to represent categorical data is Graphics, Diagrams, and
Flowcharts. Examples are Word clouds, Sentiment Mapping, Venn Diagram, etc.