Business Analytics
Business Analytics
This involves cleaning, transforming, and structuring your data before analysis or modelling.
• Handling Missing Data: Impute missing values (e.g., mean imputation, forward-fill)
or remove rows/columns with excessive missing data.
• Encoding Categorical Data: Convert categorical variables into numerical forms (e.g.,
one-hot encoding or label encoding).
• Outlier Detection: Remove or handle extreme values that might skew results.
VARIABLE CONVERSION
• Numerical to Categorical: Bin continuous variables into categories (e.g., age into age
groups).
• Datetime to Useful Features: Extract components like day, month, year, or calculate
time intervals.
Example tools:
• Label Encoding: Replace categories with a unique integer (e.g., "Male"=0,
"Female"=1).
• One-Hot Encoding: Create binary variables for each category.
PERFORMANCE MATRIX
A performance matrix helps assess how well a model performs. Depending on the task
(regression, classification, etc.), different metrics are used:
Classification Models:
• Accuracy: (TP+TN)/(TP+TN+FP+FN)
• ROC-AUC: Measures the trade-off between the True Positive Rate and False Positive
Rate.
Regression Models:
• Mean Squared Error (MSE): Average of the squared differences between predictions
and actual values.
These steps are critical in preparing datasets and improving model performance, ensuring
accuracy and reliability in analysis.
Data reduction involves transforming data into a smaller volume while maintaining its
integrity. Common techniques include:
1. Dimensionality Reduction:
2. Feature Selection:
Selecting a subset of relevant features based on statistical tests, model performance, or
domain knowledge.
3. Aggregation:
Summarizing data (e.g., calculating averages) to reduce the size of datasets, especially in
time series data.
4. Sampling:
Randomly selecting a representative subset of data for analysis, reducing the dataset's size.
MISSING DATA
Missing data can arise from various sources, such as data entry errors or non-responses.
Handling missing data is crucial for analysis:
Imputation: Replacing missing values using methods like mean/mode/median imputation,
interpolation, or predictive modelling.
OVERLAPPING DATA
Overlapping data occurs when datasets share similar or identical records, leading to
redundancy:
OVERFITTING
Overfitting happens when a model learns the training data too well, including noise and
outliers, leading to poor generalization on new data. Techniques to mitigate overfitting
include:
OUTLIERS
Outliers are data points that deviate significantly from other observations, which can skew
results:
TYPES OF DATA
DATA PARTIONING
Data partitioning involves dividing a dataset into subsets for various purposes:
1. Training and Testing: In machine learning, datasets are typically split into a training set
(for model training) and a test set (for model evaluation). A common split ratio is 70:30 or
80:20.
2. Cross-Validation: Further partitions the training set into several smaller sets to validate
the model's performance. Common methods include k-fold cross-validation, where the
dataset is divided into k subsets.
3. Stratified Sampling: Ensures that each partition reflects the overall distribution of the
dataset, especially useful in imbalanced datasets.
4. Temporal Partitioning: Dividing time series data based on time, ensuring that the training
set only contains data before a certain point in time, and the testing set contains data after.
MULTIDIMENSIONAL VISUALIZATION
1. Scatter Plots: Used to visualize relationships between two variables. For more dimensions,
color, size, or shape can represent additional variables.
2. 3D Plots: Extend scatter plots into three dimensions, allowing visualization of three
variables simultaneously.
3. Heat Maps: Represent data values through color gradients in a two-dimensional space,
useful for visualizing correlations and densities.
4. Parallel Coordinates: Used for visualizing high-dimensional datasets by representing each
dimension as a vertical axis and plotting lines for each data point.
5. Radial or Spider Charts: Show multivariate data in a circular layout, where each axis
represents a variable.
6. T-SNE (t-Distributed Stochastic Neighbor Embedding): A dimensionality reduction
technique often used for visualizing high-dimensional data in two or three dimensions.
Applications: PCA is widely used in exploratory data analysis, noise reduction, data
visualization, and preprocessing for machine learning algorithms.
CLASSIFICATION
Classification is a supervised learning task that involves predicting the categorical class labels
of new instances based on past observations.
1. Logistic Regression: Models the probability of a binary outcome based on one or more
predictor variables.
2. Decision Trees: A tree-like model that makes decisions based on feature values, splitting
the data at each node to improve purity.
3. Support Vector Machines (SVM): Finds the hyperplane that best separates different
classes in high-dimensional space.
4. K-Nearest Neighbors (KNN): Classifies instances based on the majority class of their
nearest neighbors in the feature space.
5. Random Forest: An ensemble method that uses multiple decision trees to improve
accuracy and reduce overfitting.
6. Neural Networks: Computational models that mimic the human brain, capable of
capturing complex patterns in data.
MISCLASSIFICAION
Misclassification occurs when a classification model incorrectly predicts the class label for a
given instance. This can happen due to various factors, including model bias, data quality,
and the complexity of the underlying patterns.
Types of Misclassification :
1. False Positive (Type I Error): Predicting a positive class when the actual class is negative
(e.g., classifying a benign tumor as malignant).
2. False Negative (Type II Error): Predicting a negative class when the actual class is positive
(e.g., classifying a malignant tumor as benign).
Key Characteristics:
• Instance-based Learning: k-NN does not explicitly learn a model but instead stores
training instances for reference during classification.
• Distance Metric: The algorithm typically uses Euclidean distance to measure the
similarity between instances, but other distance metrics (like Manhattan or
Minkowski) can also be used.
Steps:
1. Choose the number of neighbors (k): The user selects how many neighbors to
consider for classification.
2. Calculate distances: For a new instance, calculate its distance to all training
instances.
3. Select neighbors: Identify the k closest neighbors based on the calculated distances.
4. Vote for class: The predicted class is determined by a majority vote among the k
neighbors (for classification) or by averaging their values (for regression).
Advantages:
NAÏVE BAYES
Naive Bayes is a family of probabilistic algorithms based on Bayes’ theorem, used primarily
for classification tasks. The "naive" aspect refers to the assumption that features are
independent given the class label.
Key Characteristics:
• Probabilistic Model: It calculates the posterior probability of each class given the
input features and selects the class with the highest probability.
• Independence Assumption: Assumes that all features contribute independently to
the probability of a class.
Types:
1. Gaussian Naive Bayes: Assumes that features follow a Gaussian (normal)
distribution.
2. Multinomial Naive Bayes: Suitable for discrete data, especially for text classification.
3. Bernoulli Naive Bayes: Suitable for binary/boolean features.
Steps:
Advantages:
Disadvantages
• The independence assumption may not hold in practice, leading to less accurate
predictions.
Types of Pruning:
1. Pre-pruning (Early Stopping): Stops the growth of the tree before it fully develops, based
on criteria like minimum sample size or maximum depth.
2. Post-pruning: Involves allowing the tree to grow fully and then removing nodes that do
not provide significant predictive power.
Benefits:
Key Characteristics:
• Splitting Criterion: Nodes are split based on metrics like Gini impurity, entropy
(information gain), or classification error.
• Binary Splits: Typically, splits are binary (two branches), though multi-way splits can
occur.
Steps:
Applications: Used in various domains for tasks like fraud detection, medical diagnosis, and
customer segmentation.
REGRESSION TREES
Regression trees are similar to classification trees but are used for predicting continuous
outcomes instead of categorical ones.
Key Characteristics:
• Splitting Criterion: The splits are made based on minimizing the variance or mean
squared error (MSE) of the target variable within each node.
• Leaf Nodes: Each leaf node represents the predicted value for the target variable.
Steps:
Applications: Commonly used in predicting prices, sales forecasting, and other scenarios
where the target variable is continuous.
Overview: Artificial Neural Networks are computational models inspired by the biological
neural networks that constitute animal brains. They are used in various tasks, including
classification, regression, and pattern recognition.
Key Components:
1. Neurons: Basic units of the network that receive input, process it, and produce output.
Each neuron has an activation function that determines its output based on the input.
2. Layers:
• Input Layer: The first layer that receives the input data.
• Hidden Layers: Intermediate layers that process inputs from the previous layer. There
can be one or more hidden layers.
• Output Layer: The final layer that produces the output of the network.
3. Weights and Biases: Each connection between neurons has a weight that adjusts during
training. Biases are additional parameters added to the neuron output to improve the
model’s flexibility.
Training Process:
1. Forward Propagation: Input data is passed through the network, and the output is
computed.
2. Loss Calculation: The difference between the predicted output and the actual output
is calculated using a loss function (e.g., mean squared error for regression or cross-
entropy for classification).
3. Backpropagation: The error is propagated back through the network to update the
weights and biases using optimization algorithms (e.g., Stochastic Gradient Descent,
Adam).
➢ Activation Functions: Functions that introduce non-linearity into the model, allowing
it to learn complex patterns. Common activation functions include:
Applications:
DISCRIMINATE ANALYSIS
Discriminant Analysis is a statistical technique used for classifying a set of observations into
predefined classes. The goal is to find a combination of predictor variables that best
separates the classes.
Types:
Steps:
1. Compute the means and covariances: Calculate the mean vectors and covariance
matrices for each class.
2. Calculate the discriminant functions: Formulate functions based on the linear
combinations of the input features that separate the classes.
3. Classification: For a new observation, calculate its discriminant function values for each
class and assign it to the class with the highest value.
Applications:
Both Artificial Neural Networks and Discriminant Analysis are essential techniques in
machine learning and statistics, each with unique strengths and applications. ANNs excel in
complex and high-dimensional datasets, while discriminant analysis is valuable for problems
with clear class separations and underlying statistical assumptions.