Business Data Mining Week 11
Business Data Mining Week 11
CART Algorithm
Classification and Regression Trees (CART) is a decision tree algorithm that is used for both
classification and regression tasks. It is a supervised learning algorithm that learns from
labelled data to predict unseen data.
Tree structure: CART builds a tree-like structure consisting of nodes and branches. The
nodes represent different decision points, and the branches represent the possible
outcomes of those decisions. The leaf nodes in the tree contain a predicted class label or
value for the target variable.
Splitting criteria: CART uses a greedy approach to split the data at each node. It
evaluates all possible splits and selects the one that best reduces the impurity of the
resulting subsets. For classification tasks, CART uses Gini impurity as the splitting
criterion. The lower the Gini impurity, the more pure the subset is. For regression tasks,
CART uses residual reduction as the splitting criterion. The lower the residual reduction,
the better the fit of the model to the data.
Pruning: To prevent overfitting of the data, pruning is a technique used to remove the
nodes that contribute little to the model accuracy. Cost complexity pruning and
information gain pruning are two popular pruning techniques. Cost complexity pruning
involves calculating the cost of each node and removing nodes that have a negative cost.
Information gain pruning involves calculating the information gain of each node and
removing nodes that have a low information gain.
available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini index criterion.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation of the Gini coefficient. It
works on categorical variables, provides outcomes either “successful” or “failure” and hence
conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,
Where 0 depicts that all the elements are allied to a certain class, or only one class exists
there.
The Gini index of value 1 signifies that all the elements are randomly distributed across
various classes, and
A value of 0.5 denotes the elements are uniformly distributed into some classes.
Mathematically, we can write Gini Impurity as follows:
fromsklearn.tree importDecisionTreeClassifier
fromsklearn.preprocessing importLabelEncoder
features =[
["red", "large"],
["green", "small"],
["red", "small"],
["yellow", "large"],
["green", "large"],
["orange", "large"],
]
le.fit(flattened_features +target_variable)
encoded_new_instance =le.transform(new_instance)
predicted_fruit_type =clf.predict([encoded_new_instance])
decoded_predicted_fruit_type =le.inverse_transform(predicted_fruit_type)
print("Predicted fruit type:", decoded_predicted_fruit_type[0])
Output:
Predicted fruit type: apple
Advantages of CART
Results are simplistic.
Classification and regression trees are Nonparametric and Nonlinear.
Classification and regression trees implicitly perform feature selection.
Outliers have no meaningful effect on CART.
It requires minimal supervision and produces easy-to-understand models.
Limitations of CART
Overfitting.
High Variance.
low bias.
the tree structure may be unstable.
Applications in Business:
1. Customer Segmentation: By classifying customers into distinct groups based on
behaviors and demographics, businesses can tailor marketing strategies effectively.
2. Credit Risk Assessment: Financial institutions use CART to classify loan applicants into
risk categories, aiding in decision-making for loan approvals.
3. Churn Prediction: Companies predict which customers are likely to leave, allowing
proactive retention efforts.
Advantages:
Interpretability: Decision trees are easy to understand and interpret, making it simpler to
explain decisions to stakeholders.
Versatility: Can handle both categorical and numerical data.
Limitations:
Overfitting: Trees can become overly complex and fit the noise in the training data.
Pruning and cross-validation are necessary to mitigate this.
Bias: Trees can be biased if the attributes have different numbers of values.
******************
As all of us are aware that how technology is growing day-by-day and a Large amount of data
is produced every second, analyzing data is going to be very important because it helps us in
fraud detection, identifying spam e-mail, etc. So Data Mining comes into existence to help us
find hidden patterns, discover knowledge from large datasets. The way human brain processes
information is how Artificial Neural Networks (ANN) bases its assimilation of data. The brain
has neurons process information in the form of electric signals.
In the same way, ANN receives input of information through several processors that operate in
parallel and are arranged in tiers. The raw data is received by the first tier, which is processed
through interconnected nodes, having their own rules and packages of knowledge.
The processor passes it on to the next tier as output. All such successive tier of processors
receive the output from its predecessor; therefore, raw data isn’t processed every time. The
Neural Networks modify themselves as they are self-learning after processing additional
information. Each link between nodes is associated with weights.
A preference is put on the input stream with higher weight. The higher the weight of the unit,
the more influence it has on another. It helps in reducing predictable errors, and it is done
through a gradient descent algorithm.
Neural Network:
Neural Network is an information processing paradigm that is inspired by the human nervous
system. As in the Human Nervous system, we have Biological neurons in the same way in
Neural networks we have Artificial Neurons which is a Mathematical Function that originates
from biological neurons. The human brain is estimated to have around 10 billion neurons
each connected on average to 10,000 other neurons. Each neuron receives signals through
synapses that control the effects of the signal on the neuron.
=> The Function of summing junction of an artificial neuron is to collect the weighted inputs
and sum them up.
Yin=[X1*W1+X2*W2+….+Xn*Wn]
=> The output of summing junction may sometimes become equal to zero and to prevent such
a situation, a bias of fixed value Bo is added to it.
Yin =[X1*W1+X2*W2+….+Xn*Wn] + Bo
=> The output Y of a neuron largely depends on its Activation Function (also known as
transfer function).
=> There are different types of Activation Function are in use, Such as
1. Identity Function
2. Binary Step Function With Threshold
3. Bipolar Step Function With Threshold
4. Binary Sigmoid Function
5. Bipolar Sigmoid Function
Self Organization Neural Network: Self Organizing Neural Network (SONN) is a type
of artificial neural network but is trained using competitive learning rather than error-
correction learning (e.g., backpropagation with gradient descent) used by other artificial
neural networks. A Self Organizing Neural Network (SONN) is an unsupervised learning
model in Artificial Neural Network termed as Self-Organizing Feature Maps or Kohonen
Maps. It is used to produce a low-dimensional (typically two-dimensional) representation
of a higher-dimensional data set while preserving the topological structure of the data.
ANNs have the ability to learn and model non-linear relationships. Unlike other prediction
techniques, it doesn’t impose restrictions on input variables.
Here’s how industries and organizations apply neural networks to gain an advantage:
1. Forecasting of Data
Traditionally forecasting models have limitations to data, and such problems are complex.
If ANN is applied correctly, ANNs forecasts without such limitations, as its modeling ability
is able to define relationships and extract unseen features.
This helps users to make more informed decisions through neural networks. ANNs can carry
out business tasks with structured data. They can range from tracking and documenting real-
time communications to finding new leads or potential customers.
As a matter of fact, until recently, decision-makers relied on extracted data from organized
data sets. Even though these are easier to analyze, they don’t offer a more in-depth insight
as the unstructured data does.
Neural networks provide information such as looking into the ‘why’ of a particular
customer’s behavior. Neural Network Step by Step Guide
Let’s take a look at real-life examples of Artificial neural network’s applications in Data
Mining:
1. Healthcare
Neural networks analyzed 100,000 records of patients who were in the Intensive Care Unit
(ICU), and it learned to apply experience to diagnose the ideal course of treatment. 99% of
these recommendations matched and sometimes improved a doctor’s decision.
2. Social Media
Business and employment-oriented website in LinkedIn use neural networks to pick up spam
or abusive content. LinkedIn also uses it to understand all kinds of content shared, so they
can build a better recommendation and search parameter for their members.
Overview
Artificial Neural Networks (ANNs) are computational models inspired by the human brain.
They consist of interconnected layers of nodes (neurons) where each connection has a weight.
ANNs are capable of learning complex patterns through training processes involving
backpropagation and gradient descent.
Learning Process:
1. Initialization: Randomly initialize weights and biases.
2. Forward Propagation: Calculate the output of the network.
3. Loss Calculation: Compute the error using a loss function (e.g., mean squared error for
regression, cross-entropy for classification).
4. Backpropagation: Calculate the gradient of the loss function with respect to each weight
and bias.
5. Gradient Descent: Update the weights and biases to minimize the error.
Applications in Business:
1. Sales Forecasting: ANNs can model complex relationships between sales data and various
factors such as seasonality, promotions, and economic indicators to predict future sales.
2. Fraud Detection: In financial services, ANNs can learn to identify patterns indicative of
fraudulent activity.
3. Customer Sentiment Analysis: Analyzing customer reviews and feedback to gauge
sentiment and improve customer service.
Advantages:
Accuracy: ANNs can model complex, non-linear relationships and often achieve high
accuracy.
Flexibility: Can be applied to a wide range of problems from image recognition to natural
language processing.
Limitations:
Interpretability: ANNs are often considered "black boxes" due to their complexity,
making it hard to interpret the decision process.
Data Requirement: Require large amounts of data to train effectively.
Computationally Intensive: Training ANNs can be resource-intensive, requiring
significant computational power.
Effective Application in Business Data Mining
1. Combining Techniques:
o Hybrid Models: Using CART for initial feature selection and understanding, followed by
ANNs for complex pattern recognition.
o Ensemble Methods: Combining multiple CART models or ANNs to improve robustness
and accuracy.
2. Real-World Implementation:
o Customer Insights: Businesses can use CART to segment customers initially and then
apply ANNs to predict customer behavior within each segment.
o Operational Efficiency: Predictive maintenance using ANNs can forecast equipment
failures, allowing businesses to schedule timely maintenance and avoid downtime.
3. Addressing Limitations:
o Overfitting in CART: Apply pruning techniques and cross-validation to create more
generalizable models.
o Interpretability of ANNs: Use techniques like SHAP (SHapley Additive exPlanations) or
LIME (Local Interpretable Model-agnostic Explanations) to interpret ANN predictions.
********************************
Conclusion
Both CART and ANNs are potent tools in business data mining, each with unique strengths.
CART's interpretability and simplicity make it ideal for initial data exploration and feature
selection. In contrast, ANNs excel in capturing complex patterns and making accurate
predictions, albeit with higher computational demands and lower interpretability. By
leveraging these techniques effectively, businesses can gain valuable insights, optimize
operations, and make data-driven decisions.