Module 5
Module 5
Algorithms
Gahan A V
Assistant Professor
Department of Electronics and Communication Engineering
Bangalore Institute of Technology
BANGALORE INSTITUTE OF TECHNOLOGY
Module 5
2
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY
• The California Housing Prices dataset from the StatLib repository is often used in machine learning examples.
• It is based on the 1990 California census data and contains information about various features of homes, such as the
number of rooms, the median income of people in the area, and the geographical location of the house, among others.
In the context of this example, the dataset has been modified for teaching purposes:
1.A categorical attribute has been added to introduce diversity in the data types (e.g., a feature with specific categories like
"urban," "suburban," etc.).
2.A few features have been removed to simplify the dataset, making it easier for learners to focus on the core concepts of
machine learning and data analysis.
Though the dataset is based on data from 1990, it is often treated as if it represents recent data for practical use, especially
when demonstrating machine learning algorithms and techniques.
• Welcome to Machine Learning Housing Corporation! The first task you are asked to perform is to
build a model of housing prices in California using the California census data.
• This data has metrics such as the population, median income, median housing price, and so on for
each block group in California.
• Block groups are the smallest geographical unit for which the US Census Bureau publishes sample
data (a block group typically has a population of 600 to 3,000 people). We will just call them
“districts”for short.
• Your model should learn from this data and be able to predict the median housing price in any
district, given all the other metrics.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY
To build a model predicting the median housing price in any district of California based on various metrics like population,
median income, and others, follow these steps:
1.Data Preprocessing:
1. Clean the data: Remove or impute missing values.
2. Normalize numerical features: Rescale features like population, median income, etc., for better model performance.
3. Convert categorical variables (if any) into numerical representations (e.g., one-hot encoding).
2.Feature Selection:
1. Identify which features (metrics) are most relevant for predicting the housing price. You can use techniques like
correlation analysis or feature importance from tree-based models.
breakdown of how the problem is framed and the key elements involved in designing the system:
1.Business Objective:
1. The goal is not just to build a model, but to predict a district’s median housing price, which will be used by another
Machine Learning system to decide investment worthiness in specific areas. This will directly impact revenue.
2.Current Solution:
1. The current method of estimating housing prices is manual and unreliable, with estimates sometimes being off by more
than 20%. The company wants to automate and improve this process using a model to predict prices based on other district
data.
Pipelines
•Definition: A data pipeline is a sequence of components that process and transform data step by step.
•Common in ML: Frequently used in Machine Learning due to the need for extensive data manipulation and transformation.
•Independent Components: Each component processes data independently and passes the result to the next stage via a
shared data store.
•Modular Design: Components are self-contained, making the system easier to understand and maintain.
•Team Collaboration: Different teams can focus on specific components without affecting others.
•Resilience: If one component fails, downstream components can temporarily operate using the last available data.
•Potential Issue: Without monitoring, failures may go unnoticed, leading to stale data and degraded system performance.
•Importance of Monitoring: Proper monitoring ensures timely detection and resolution of issues in the pipeline.
•Verify Assumptions: Check the assumptions made during the project to avoid mistakes.
•Example: If the downstream system needs categories (e.g., "cheap," "medium"), the task is classification, not regression.
•Impact: Incorrect assumptions can lead to wasted effort on the wrong model type.
•Action: Confirm with relevant teams about requirements before proceeding.
•Outcome: Once assumptions are verified (e.g., actual prices needed), proceed with the appropriate model (regression).
3. Access Credentials :
- Normally, to access data from databases or data stores, you would need credentials and authorization.
In this case, the process is simplified as you just need to download and extract a single CSV file.
Data Cleaning
Data Cleaning for Missing Values:
1.Why Data Cleaning is Important:
•Most machine learning algorithms cannot handle missing data, so cleaning the data is essential to ensure your model works
effectively. For example, if the total_bedrooms attribute has missing values, this will need to be addressed.
2.Handling Missing Values:
•Remove Rows: One option is to remove rows containing missing values. However, this can result in losing valuable data, so
it’s often not the best choice unless the missing data is very sparse.
•Impute Missing Values: A more common approach is to impute the missing values by filling them with a suitable value,
such as:
•The mean, median, or mode of the column.
•A specific value based on domain knowledge (e.g., 0 or unknown).
•For more sophisticated imputation, you can use machine learning models to predict missing values based on other
features.
Feature Scaling
Feature Scaling:
1.Why Feature Scaling is Important:
1. Machine learning algorithms often perform poorly when the numerical attributes have very different scales. For
example, in housing data:
1. Total rooms can range from 6 to 39,320.
2. Median incomes range from 0 to 15.
2. Models that are sensitive to the scale of input features (like gradient descent-based models, KNN, SVM, etc.) may
struggle to learn efficiently when attributes are not scaled.
2.Scaling Techniques:
1. Min-Max Scaling (Normalization):
1. The values of each feature are scaled so that they fall within a specific range, usually between 0 and 1
2. Standardization:
Standardization (or Z-score normalization) involves subtracting the mean of each feature and dividing by its standard
deviation. This results in a distribution with a mean of 0 and a variance of 1.
By applying feature scaling, you ensure that all input features contribute equally to the model, improving the
performance and convergence speed of many machine learning algorithms.
3.Overfitting Check:
1. One important consideration is overfitting, where the model performs very well on the training set but poorly on unseen
data. Overfitting occurs when the model becomes too complex and starts capturing noise in the data rather than general
patterns.
2. To detect overfitting, it’s important to compare the performance of the model on the training set with its performance on
the test set (which is reserved for model evaluation only).
4.Tune the Model (if necessary):
1. If the model’s performance on the training set is not satisfactory, you may try various strategies to improve it, such as
adding polynomial features, regularization (like Ridge or Lasso regression), or using different models like decision trees or
ensemble methods.
2. If the model is overfitting (i.e., high performance on training but poor performance on test data), you may need to adjust
model complexity or apply regularization techniques to prevent overfitting.
5.Iterative Process:
1. This process is often iterative. After evaluating the model, you might discover new insights that can help refine your
feature engineering, data preparation, or model selection.
2. Depending on the results from the training set evaluation, you may adjust the data pipeline, model, or hyperparameters,
and retrain the model to improve performance.
By following this approach, you'll be able to build a strong foundation for training and evaluating a machine learning model on
your training data.
Grid Search
1. This method involves experimenting with different values for hyperparameters manually, such as the learning rate, numbe
of trees in a forest, or depth of a decision tree.
2. Though effective, this method can be time-consuming and inefficient, especially with complex models and large datasets, a
it requires manually testing multiple combinations of hyperparameters.
Grid Search:
1. A more systematic and efficient approach to fine-tuning is Grid Search. GridSearchCV in Scikit-Learn
automates the process of testing multiple combinations of hyperparameters to find the best set.
2. How Grid Search Works: You define a grid of hyperparameter values you want to test, and
GridSearchCV will exhaustively try every combination, using cross-validation to evaluate the performance
of each combination.
3. Cross-validation helps in assessing how well the model generalizes by splitting the data into multiple
training and validation sets.
4. Example: For a model like RandomForestRegressor, you can specify a grid with values for the number of
trees, maximum depth, and other hyperparameters. The model will then train using each combination and
evaluate the performance using cross-validation.
Randomized Search:
1. RandomizedSearchCV is another technique where instead of testing all combinations of hyperparameters, it randomly
samples a subset of the possible combinations. This can be faster and more efficient than grid search, especially when the
search space is large.
Performance Measures
1.Cross-Validation: Evaluates model performance by splitting data into multiple folds for training/testing,
providing a more reliable estimate of generalization ability.
2.Accuracy: Proportion of correct predictions out of all predictions, but may not be suitable for imbalanced
datasets.
3.Confusion Matrix: Displays TP, FP, TN, FN to evaluate how well the classifier distinguishes between classes.
4.Precision: Measures the accuracy of positive predictions (TP / (TP + FP)).
5.Recall: Measures how well the model captures actual positives (TP / (TP + FN)).
6.F1-Score: Harmonic mean of precision and recall, balancing both metrics.
7.ROC Curve and AUC: The ROC curve plots the tradeoff between recall and false positive rate, with AUC
summarizing overall performance.
These measures help assess a classifier’s performance, especially in situations like imbalanced classes.
Multiclass Classification
Error Analysis
Multilabel Classification:
•Definition: In multilabel classification, each instance can belong to multiple classes simultaneously. This is
different from multiclass classification, where each instance can belong to only one class out of many.
•Example: For image classification, a picture of a cat sitting on a couch could be labeled as both "cat" and
"furniture".
•Approach:
• You can train one classifier per label (binary classification per label).
• Each classifier predicts whether the label should be assigned or not. This method is often referred to as the
One-vs-Rest strategy.
• Metrics like Hamming Loss, Subset Accuracy, and F1 Score are commonly used to evaluate multilabel
classifiers.
Multioutput Classification:
•Definition: In multioutput classification, each instance has multiple outputs (each output being a label or a prediction). It
can be viewed as a special case of multilabel classification where each output corresponds to a different label, but the
outputs are generally related to each other.
•Example: Predicting multiple attributes of a house, such as the house price, number of rooms, and size at the same time, or
in an image, predicting multiple features such as the type of objects (e.g., "cat", "dog") and their locations (bounding
boxes).
•Approach:
• One common method is to train a separate classifier for each output or to use multioutput models that can predict
multiple outputs at once.
• In regression tasks, multioutput regression can be used where the model predicts several continuous values.
Both types of problems are handled by adapting algorithms designed for single-label problems to predict multiple labels or
outputs.