0% found this document useful (0 votes)
14 views14 pages

Business Analytics

The document outlines essential data preparation techniques, including handling missing data, normalization, encoding categorical data, and feature engineering, which are crucial for effective analysis and modeling. It also discusses performance metrics for classification and regression models, data reduction techniques, and methods to address issues like overfitting and outliers. Additionally, it covers various classification algorithms, including decision trees and neural networks, and emphasizes the importance of data partitioning and visualization in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

Business Analytics

The document outlines essential data preparation techniques, including handling missing data, normalization, encoding categorical data, and feature engineering, which are crucial for effective analysis and modeling. It also discusses performance metrics for classification and regression models, data reduction techniques, and methods to address issues like overfitting and outliers. Additionally, it covers various classification algorithms, including decision trees and neural networks, and emphasizes the importance of data partitioning and visualization in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DATA PREPARATION

This involves cleaning, transforming, and structuring your data before analysis or modelling.

• Handling Missing Data: Impute missing values (e.g., mean imputation, forward-fill)
or remove rows/columns with excessive missing data.

• Normalization: Scale numeric values to a standard range (e.g., min-max scaling or z-


score normalization).

• Encoding Categorical Data: Convert categorical variables into numerical forms (e.g.,
one-hot encoding or label encoding).

• Feature Selection/Engineering: Identify important features or create new ones that


improve model performance.

• Outlier Detection: Remove or handle extreme values that might skew results.

VARIABLE CONVERSION

• Numerical to Categorical: Bin continuous variables into categories (e.g., age into age
groups).

• Categorical to Numerical: Convert string-based categories into numerical formats


(e.g., converting "High/Medium/Low" into 3/2/1).

• Datetime to Useful Features: Extract components like day, month, year, or calculate
time intervals.

• Log Transformation: Apply a log transform to reduce skewness in data with


exponential growth (e.g., income, population).

Example tools:
• Label Encoding: Replace categories with a unique integer (e.g., "Male"=0,
"Female"=1).
• One-Hot Encoding: Create binary variables for each category.

PERFORMANCE MATRIX

A performance matrix helps assess how well a model performs. Depending on the task
(regression, classification, etc.), different metrics are used:

Classification Models:

• Accuracy: (TP+TN)/(TP+TN+FP+FN)

• Precision: TP / (TP + FP)

• Recall: TP / (TP + FN)

• F1-Score: Harmonic mean of precision and recall.

• ROC-AUC: Measures the trade-off between the True Positive Rate and False Positive
Rate.

Regression Models:

• Mean Absolute Error (MAE): Average of the absolute differences between


predictions and actual values.

• Mean Squared Error (MSE): Average of the squared differences between predictions
and actual values.

• R-squared (R²): Proportion of variance explained by the model.


• Root Mean Squared Error (RMSE): Square root of MSE, often used to interpret errors
on the same scale as the data.

These steps are critical in preparing datasets and improving model performance, ensuring
accuracy and reliability in analysis.

DATA REDUCTION TECHNIQUES

Data reduction involves transforming data into a smaller volume while maintaining its
integrity. Common techniques include:

1. Dimensionality Reduction:

• Principal Component Analysis (PCA): Reduces dimensions by transforming to a new


set of variables (principal components) that capture the most variance.
• t-Distributed Stochastic Neighbor Embedding (t-SNE): Focuses on maintaining local
structures while reducing dimensions.

2. Feature Selection:
Selecting a subset of relevant features based on statistical tests, model performance, or
domain knowledge.

3. Aggregation:
Summarizing data (e.g., calculating averages) to reduce the size of datasets, especially in
time series data.

4. Sampling:
Randomly selecting a representative subset of data for analysis, reducing the dataset's size.

MISSING DATA
Missing data can arise from various sources, such as data entry errors or non-responses.
Handling missing data is crucial for analysis:
Imputation: Replacing missing values using methods like mean/mode/median imputation,
interpolation, or predictive modelling.

• Deletion: Removing records with missing values (listwise or pairwise deletion).


• Use of Models: Some algorithms can handle missing values intrinsically, such as tree-
based methods.

OVERLAPPING DATA

Overlapping data occurs when datasets share similar or identical records, leading to
redundancy:

• Deduplication: Identifying and removing duplicate records.


• Integration Techniques: Merging datasets carefully to avoid overlap and ensure data
quality.

OVERFITTING

Overfitting happens when a model learns the training data too well, including noise and
outliers, leading to poor generalization on new data. Techniques to mitigate overfitting
include:

• Regularization: Applying penalties (L1 or L2 regularization) to the loss function to


discourage complexity.
• Cross-Validation: Using techniques like k-fold cross-validation to validate the model
on different subsets of the data.
• Pruning: For decision trees, removing branches that have little importance.

OUTLIERS

Outliers are data points that deviate significantly from other observations, which can skew
results:

• Identification: Use statistical methods (Z-scores, IQR) to identify outliers.


• Handling: Options include removing, transforming, or capping outliers based on their
impact on analysis.
DATA NORMALISATION
Data normalization scales numerical data to a common range without distorting differences
in the ranges of values:

• Min-Max Scaling: Rescales features to a range of [0, 1].


• Z-score Normalization: Centers the data around the mean with a standard deviation
of 1.
• Robust Scaling: Uses median and IQR, making it less sensitive to outliers.

TYPES OF DATA

1. Quantitative Data: Numerical data that can be measured. It includes:

• Discrete Data: Countable values (e.g., number of students).


• Continuous Data: Measurable values that can take any value within a range (e.g.,
height, temperature).
2. Qualitative Data: Descriptive data that can be categorized based on traits or
characteristics. It includes:

• Nominal Data: Unordered categories (e.g., colors, types of animals).


• Ordinal Data: Ordered categories (e.g., rankings, satisfaction levels).
3. Time Series Data: Data points collected or recorded at specific time intervals (e.g., stock
prices over time).
4. Spatial Data: Data related to geographical locations, including coordinates, maps, and
patterns (e.g., GPS data).
5. Text Data: Unstructured data in the form of text (e.g., social media posts, emails).

DATA PARTIONING
Data partitioning involves dividing a dataset into subsets for various purposes:

1. Training and Testing: In machine learning, datasets are typically split into a training set
(for model training) and a test set (for model evaluation). A common split ratio is 70:30 or
80:20.
2. Cross-Validation: Further partitions the training set into several smaller sets to validate
the model's performance. Common methods include k-fold cross-validation, where the
dataset is divided into k subsets.
3. Stratified Sampling: Ensures that each partition reflects the overall distribution of the
dataset, especially useful in imbalanced datasets.
4. Temporal Partitioning: Dividing time series data based on time, ensuring that the training
set only contains data before a certain point in time, and the testing set contains data after.

MULTIDIMENSIONAL VISUALIZATION

Multidimensional visualization techniques are used to represent complex datasets with


multiple variables:

1. Scatter Plots: Used to visualize relationships between two variables. For more dimensions,
color, size, or shape can represent additional variables.
2. 3D Plots: Extend scatter plots into three dimensions, allowing visualization of three
variables simultaneously.
3. Heat Maps: Represent data values through color gradients in a two-dimensional space,
useful for visualizing correlations and densities.
4. Parallel Coordinates: Used for visualizing high-dimensional datasets by representing each
dimension as a vertical axis and plotting lines for each data point.
5. Radial or Spider Charts: Show multivariate data in a circular layout, where each axis
represents a variable.
6. T-SNE (t-Distributed Stochastic Neighbor Embedding): A dimensionality reduction
technique often used for visualizing high-dimensional data in two or three dimensions.

PRICIPLE COMPONENT ANALYSIS (PCA)

PCA is a dimensionality reduction technique used to reduce the complexity of a dataset


while preserving as much variance as possible. It transforms the data into a new coordinate
system where the greatest variance lies on the first coordinates (principal components)
Key Steps:
1. Standardization: Scale the data to have a mean of zero and a standard deviation of one,
especially if the variables are on different scales.
2. Covariance Matrix Calculation: Compute the covariance matrix to understand how
variables relate to each other.
3. Eigenvalue and Eigenvector Calculation: Determine the eigenvalues and eigenvectors of
the covariance matrix. Eigenvalues indicate the amount of variance captured by each
principal component, and eigenvectors define the direction of these components.
4. Selecting Principal Components: Choose the top k eigenvalues and their corresponding
eigenvectors based on the desired variance explained (often using a scree plot to visualize).
5. Transforming the Data: Project the original data onto the selected principal components
to obtain a lower-dimensional representation.

Applications: PCA is widely used in exploratory data analysis, noise reduction, data
visualization, and preprocessing for machine learning algorithms.

CLASSIFICATION
Classification is a supervised learning task that involves predicting the categorical class labels
of new instances based on past observations.

Common Classification Algorithms:

1. Logistic Regression: Models the probability of a binary outcome based on one or more
predictor variables.
2. Decision Trees: A tree-like model that makes decisions based on feature values, splitting
the data at each node to improve purity.
3. Support Vector Machines (SVM): Finds the hyperplane that best separates different
classes in high-dimensional space.
4. K-Nearest Neighbors (KNN): Classifies instances based on the majority class of their
nearest neighbors in the feature space.
5. Random Forest: An ensemble method that uses multiple decision trees to improve
accuracy and reduce overfitting.
6. Neural Networks: Computational models that mimic the human brain, capable of
capturing complex patterns in data.
MISCLASSIFICAION
Misclassification occurs when a classification model incorrectly predicts the class label for a
given instance. This can happen due to various factors, including model bias, data quality,
and the complexity of the underlying patterns.

Types of Misclassification :

1. False Positive (Type I Error): Predicting a positive class when the actual class is negative
(e.g., classifying a benign tumor as malignant).
2. False Negative (Type II Error): Predicting a negative class when the actual class is positive
(e.g., classifying a malignant tumor as benign).

• Evaluation Metrics: To assess the performance of a classification model and quantify


misclassification, several metrics are used:
• Accuracy: The ratio of correctly predicted instances to the total instances.
• Precision: The ratio of true positives to the sum of true positives and false positives
(indicates the quality of positive predictions).
• Recall (Sensitivity): The ratio of true positives to the sum of true positives and false
negatives (indicates the ability to capture positive instances).
• F1 Score: The harmonic mean of precision and recall, useful for balancing both
metrics.
• Confusion Matrix: A table summarizing the actual vs. predicted classifications,
allowing for detailed analysis of misclassifications.

k-NEAREST NEIGHBOR (k-NN)


k-NN is a simple, non-parametric classification (and sometimes regression) algorithm that
classifies instances based on the majority class among their k-nearest neighbors in the
feature space.

Key Characteristics:

• Instance-based Learning: k-NN does not explicitly learn a model but instead stores
training instances for reference during classification.
• Distance Metric: The algorithm typically uses Euclidean distance to measure the
similarity between instances, but other distance metrics (like Manhattan or
Minkowski) can also be used.
Steps:
1. Choose the number of neighbors (k): The user selects how many neighbors to
consider for classification.
2. Calculate distances: For a new instance, calculate its distance to all training
instances.
3. Select neighbors: Identify the k closest neighbors based on the calculated distances.
4. Vote for class: The predicted class is determined by a majority vote among the k
neighbors (for classification) or by averaging their values (for regression).

Advantages:

• Simple and easy to implement.


• Naturally handles multi-class classification.
Disadvantages:

• Sensitive to irrelevant features and the choice of k.


• Computationally expensive for large datasets, as it requires calculating distances to
all training instances.

NAÏVE BAYES
Naive Bayes is a family of probabilistic algorithms based on Bayes’ theorem, used primarily
for classification tasks. The "naive" aspect refers to the assumption that features are
independent given the class label.

Key Characteristics:

• Probabilistic Model: It calculates the posterior probability of each class given the
input features and selects the class with the highest probability.
• Independence Assumption: Assumes that all features contribute independently to
the probability of a class.

Types:
1. Gaussian Naive Bayes: Assumes that features follow a Gaussian (normal)
distribution.
2. Multinomial Naive Bayes: Suitable for discrete data, especially for text classification.
3. Bernoulli Naive Bayes: Suitable for binary/boolean features.
Steps:

1. Calculate the prior probabilities for each class.


2. For each feature, calculate the likelihood of each class given the feature values.
3. Apply Bayes’ theorem to compute the posterior probability for each class
4. Choose the class with the highest posterior probability.

Advantages:

• Fast and efficient, particularly with large datasets.


• Works well with high-dimensional data (e.g., text classification).

Disadvantages

• The independence assumption may not hold in practice, leading to less accurate
predictions.

PRUNING IN DECISION TREES


Pruning is a technique used to reduce the size of decision trees and prevent overfitting,
enhancing model generalization.

Types of Pruning:

1. Pre-pruning (Early Stopping): Stops the growth of the tree before it fully develops, based
on criteria like minimum sample size or maximum depth.

2. Post-pruning: Involves allowing the tree to grow fully and then removing nodes that do
not provide significant predictive power.

Benefits:

• Reduces the complexity of the model.


• Improves accuracy on unseen data by preventing overfitting.
CLASSIFICATION TREES
Classification trees are a type of decision tree used for predicting categorical outcomes. The
tree structure consists of nodes representing features, branches representing decision rules,
and leaves representing class labels.

Key Characteristics:

• Splitting Criterion: Nodes are split based on metrics like Gini impurity, entropy
(information gain), or classification error.
• Binary Splits: Typically, splits are binary (two branches), though multi-way splits can
occur.

Steps:

1. Start with the entire dataset at the root.


2. Choose the best feature to split on based on a splitting criterion.
3. Create branches for each possible value of the feature.
4. Repeat the process recursively for each branch until stopping criteria are met (e.g.,
maximum depth, minimum samples).

Applications: Used in various domains for tasks like fraud detection, medical diagnosis, and
customer segmentation.

REGRESSION TREES

Regression trees are similar to classification trees but are used for predicting continuous
outcomes instead of categorical ones.

Key Characteristics:

• Splitting Criterion: The splits are made based on minimizing the variance or mean
squared error (MSE) of the target variable within each node.
• Leaf Nodes: Each leaf node represents the predicted value for the target variable.
Steps:

1. Start with the entire dataset at the root.


2. Choose the best feature to split on based on the criterion that minimizes variance.
3. Create branches for each possible value of the feature.
4. Continue recursively until a stopping condition is reached.

Applications: Commonly used in predicting prices, sales forecasting, and other scenarios
where the target variable is continuous.

ARTIFICIAL NEURAL NETWORKS (ANNs)

Overview: Artificial Neural Networks are computational models inspired by the biological
neural networks that constitute animal brains. They are used in various tasks, including
classification, regression, and pattern recognition.

Key Components:

1. Neurons: Basic units of the network that receive input, process it, and produce output.
Each neuron has an activation function that determines its output based on the input.

2. Layers:

• Input Layer: The first layer that receives the input data.
• Hidden Layers: Intermediate layers that process inputs from the previous layer. There
can be one or more hidden layers.
• Output Layer: The final layer that produces the output of the network.

3. Weights and Biases: Each connection between neurons has a weight that adjusts during
training. Biases are additional parameters added to the neuron output to improve the
model’s flexibility.
Training Process:

1. Forward Propagation: Input data is passed through the network, and the output is
computed.

2. Loss Calculation: The difference between the predicted output and the actual output
is calculated using a loss function (e.g., mean squared error for regression or cross-
entropy for classification).

3. Backpropagation: The error is propagated back through the network to update the
weights and biases using optimization algorithms (e.g., Stochastic Gradient Descent,
Adam).

➢ Activation Functions: Functions that introduce non-linearity into the model, allowing
it to learn complex patterns. Common activation functions include:

• Sigmoid: Outputs values between 0 and 1.


• ReLU (Rectified Linear Unit): Outputs the input directly if positive; otherwise, it
outputs zero.
• Tanh: Outputs values between -1 and 1.

Applications:

• Image recognition (e.g., convolutional neural networks).


• Natural language processing (e.g., recurrent neural networks).
• Time series forecasting and other regression tasks.

DISCRIMINATE ANALYSIS

Discriminant Analysis is a statistical technique used for classifying a set of observations into
predefined classes. The goal is to find a combination of predictor variables that best
separates the classes.
Types:

1. Linear Discriminant Analysis (LDA):


Assumes that the predictor variables are normally distributed and have the same covariance
matrix across classes.
Finds a linear combination of features that maximizes the ratio of between-class variance to
within-class variance, allowing for better class separation.
Suitable for two or more classes.

2. Quadratic Discriminant Analysis (QDA):


Similar to LDA but does not assume equal covariance matrices for all classes, allowing for a
quadratic decision boundary.
Better suited for cases where class distributions differ significantly.

Steps:

1. Compute the means and covariances: Calculate the mean vectors and covariance
matrices for each class.
2. Calculate the discriminant functions: Formulate functions based on the linear
combinations of the input features that separate the classes.
3. Classification: For a new observation, calculate its discriminant function values for each
class and assign it to the class with the highest value.

Applications:

• Face recognition and biometric identification.


• Medical diagnosis.
• Marketing analytics (e.g., customer segmentation).

Both Artificial Neural Networks and Discriminant Analysis are essential techniques in
machine learning and statistics, each with unique strengths and applications. ANNs excel in
complex and high-dimensional datasets, while discriminant analysis is valuable for problems
with clear class separations and underlying statistical assumptions.

You might also like