Data Analytics Unit4 FullNotes
Data Analytics Unit4 FullNotes
2. Segmentation
Segmentation is the process of dividing a dataset into smaller, meaningful subgroups based on similarities in attributes
or behavior.
Types of Segmentation:
- Demographic: Age, income, gender
- Geographic: Region, city, country
- Behavioral: Purchase habits, product usage
- Psychographic: Lifestyle, interests
Segmentation Techniques:
- K-Means Clustering
- Hierarchical Clustering
- DBSCAN
- Self-Organizing Maps (SOM)
Applications:
- Marketing: Targeting specific customer groups
- Healthcare: Grouping patients by conditions
Data Analytics - Unit 4 Full Notes
3. Decision Trees
Decision Trees are flowchart-like structures used for classification and regression tasks.
Types:
- Classification Tree: Output is categorical
- Regression Tree: Output is numerical
Structure:
- Nodes: Attribute tests
- Branches: Outcomes of tests
- Leaves: Final decisions or class labels
Splitting Criteria:
- Gini Index, Entropy/Information Gain for classification
- Variance reduction for regression
Process:
1. Choose the best splitting attribute
2. Partition the data accordingly
3. Recursively build subtrees
4. Stop when data is pure or depth is limited
Challenges:
- Overfitting: Very deep trees memorize noise
- Pruning: Technique to simplify the tree by removing branches
Overfitting occurs when a model learns the training data too closely, including noise and anomalies, leading to poor
Data Analytics - Unit 4 Full Notes
generalization.
Symptoms:
- High training accuracy but low test accuracy
- Complex and deep tree structure
Causes:
- Too many attributes
- Lack of pruning
- Small datasets
Types of Pruning:
- Pre-Pruning: Stops tree growth early (e.g., max depth, min samples)
- Post-Pruning: Removes unnecessary branches after full tree is built
Benefits:
- Reduces overfitting
- Improves prediction on unseen data
- Enhances interpretability
Forecast accuracy metrics evaluate how close predictions are to actual values.
Common Metrics:
- MAE (Mean Absolute Error): Average of absolute errors
- MSE (Mean Squared Error): Average of squared errors
- RMSE (Root Mean Squared Error): Square root of MSE
- MAPE (Mean Absolute Percentage Error): Error as a percentage
- sMAPE (Symmetric MAPE): Balanced version of MAPE
Applications:
Data Analytics - Unit 4 Full Notes
6. STL Decomposition
STL (Seasonal and Trend decomposition using Loess) breaks a time series into three components:
STL uses LOESS (Local regression) for smoothing and is highly flexible.
Advantages:
- Works with any seasonality type
- Robust to outliers
- Allows component-wise analysis
Steps:
1. Input time series
2. Apply smoothing to extract trend and seasonality
3. Subtract from original to get residual
Applications:
- Retail: Understand sales trends
- Finance: Analyze stock patterns
- Weather: Seasonal forecasting
STL is ideal for preprocessing time series before applying models like ARIMA.