0% found this document useful (0 votes)
3 views

Module 5

ISML

Uploaded by

avg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 5

ISML

Uploaded by

avg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Intelligent Systems and Machine Learning

Algorithms

Gahan A V
Assistant Professor
Department of Electronics and Communication Engineering
Bangalore Institute of Technology
BANGALORE INSTITUTE OF TECHNOLOGY

Module 5

2
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

• The California Housing Prices dataset from the StatLib repository is often used in machine learning examples.

• It is based on the 1990 California census data and contains information about various features of homes, such as the
number of rooms, the median income of people in the area, and the geographical location of the house, among others.

In the context of this example, the dataset has been modified for teaching purposes:

1.A categorical attribute has been added to introduce diversity in the data types (e.g., a feature with specific categories like
"urban," "suburban," etc.).
2.A few features have been removed to simplify the dataset, making it easier for learners to focus on the core concepts of
machine learning and data analysis.

Though the dataset is based on data from 1990, it is often treated as if it represents recent data for practical use, especially
when demonstrating machine learning algorithms and techniques.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY
Look at the Big Picture

• Welcome to Machine Learning Housing Corporation! The first task you are asked to perform is to
build a model of housing prices in California using the California census data.

• This data has metrics such as the population, median income, median housing price, and so on for
each block group in California.

• Block groups are the smallest geographical unit for which the US Census Bureau publishes sample
data (a block group typically has a population of 600 to 3,000 people). We will just call them
“districts”for short.

• Your model should learn from this data and be able to predict the median housing price in any
district, given all the other metrics.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY

To build a model predicting the median housing price in any district of California based on various metrics like population,
median income, and others, follow these steps:
1.Data Preprocessing:
1. Clean the data: Remove or impute missing values.
2. Normalize numerical features: Rescale features like population, median income, etc., for better model performance.
3. Convert categorical variables (if any) into numerical representations (e.g., one-hot encoding).
2.Feature Selection:
1. Identify which features (metrics) are most relevant for predicting the housing price. You can use techniques like
correlation analysis or feature importance from tree-based models.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY
3.Model Selection:
1. Start with a linear regression model if the relationship between features and housing price seems linear.
2. If there are non-linear relationships, try more advanced models like decision trees, random forests, or gradient
boosting.
4.Model Training:
1. Split the data into training and testing sets (e.g., 80% for training, 20% for testing).
2. Train your model on the training set and evaluate its performance on the testing set using metrics like Mean
Absolute Error (MAE), Root Mean Squared Error (RMSE), or R-squared.
5.Model Evaluation:
1. Evaluate the model's performance and adjust hyperparameters to improve accuracy if needed.
2. Cross-validation can help ensure that your model generalizes well on unseen data.
6.Prediction:
1. Once trained, the model should be able to predict the median housing price for any district based on the input
metrics.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Frame the Problem

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

breakdown of how the problem is framed and the key elements involved in designing the system:
1.Business Objective:
1. The goal is not just to build a model, but to predict a district’s median housing price, which will be used by another
Machine Learning system to decide investment worthiness in specific areas. This will directly impact revenue.
2.Current Solution:
1. The current method of estimating housing prices is manual and unreliable, with estimates sometimes being off by more
than 20%. The company wants to automate and improve this process using a model to predict prices based on other district
data.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY
3. Key Questions for Framing the Problem:
1. Supervised or Unsupervised: This is a supervised learning task because the model is provided with labeled training data
(district features and the actual median housing prices).
2. Type of Task: This is a regression task since you are predicting a continuous value (the housing price).
3. Regression Type: It's a multiple regression problem because multiple features (e.g., district population, median income)
are used to predict the target value.
4. Learning Method: Given that there is no continuous stream of data and the dataset is small enough to fit in memory,
batch learning is the most appropriate approach.
4. Model Considerations:
1. The model will use features such as district population and median income to predict the housing price.
2. The output is a single predicted value (univariate regression), as you are predicting just one value (the housing price) per
district.
By considering these elements, the system design will focus on creating a supervised learning regression model using batch
learning techniques to predict housing prices effectively.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Pipelines

•Definition: A data pipeline is a sequence of components that process and transform data step by step.
•Common in ML: Frequently used in Machine Learning due to the need for extensive data manipulation and transformation.
•Independent Components: Each component processes data independently and passes the result to the next stage via a
shared data store.
•Modular Design: Components are self-contained, making the system easier to understand and maintain.
•Team Collaboration: Different teams can focus on specific components without affecting others.
•Resilience: If one component fails, downstream components can temporarily operate using the last available data.
•Potential Issue: Without monitoring, failures may go unnoticed, leading to stale data and degraded system performance.
•Importance of Monitoring: Proper monitoring ensures timely detection and resolution of issues in the pipeline.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Select a Performance Measure

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Check the Assumptions

•Verify Assumptions: Check the assumptions made during the project to avoid mistakes.
•Example: If the downstream system needs categories (e.g., "cheap," "medium"), the task is classification, not regression.
•Impact: Incorrect assumptions can lead to wasted effort on the wrong model type.
•Action: Confirm with relevant teams about requirements before proceeding.
•Outcome: Once assumptions are verified (e.g., actual prices needed), proceed with the appropriate model (regression).

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Steps to Get the Data:

1. Download the Data :


- In typical environments, data is stored across multiple tables or files in databases or other data stores.
- For this project, the data is provided in a simpler form: a single compressed file, `housing.tgz`.

2. Extract the Data :


- The compressed file contains a CSV file named `housing.csv`, which includes all the data you need for analysis.

3. Access Credentials :
- Normally, to access data from databases or data stores, you would need credentials and authorization.

4. Familiarize with Data Schema :


- Once the data is downloaded, it's important to understand its structure (schema) to know how to properly use the data for
your model.

In this case, the process is simplified as you just need to download and extract a single CSV file.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Create a Test Set

Creating a Test Set:

1.Why Set Aside Data Early:


1. It may seem premature to reserve part of the data for testing at this stage when you’ve only glanced at the data.
However, this is an important step to avoid overfitting.
2.Risk of Overfitting:
1. Your brain can easily detect patterns in the data, which might lead to you selecting a model that appears to work well.
But this could result in overfitting, where the model performs well on the training data but poorly on unseen data.
3.Data Snooping Bias:
1. If you examine the test set before finalizing the model, you risk introducing data snooping bias. This happens when the
test data influences the model selection, leading to an overly optimistic estimate of the model’s performance.
4.Best Practice:
1. To avoid this bias, it’s critical to create a test set before you start experimenting with models. This ensures that your
evaluation of the model's performance is honest and unbiased.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Discover and Visualize the Data to Gain Insights

Steps to Discover and Visualize the Data:


1.Ensure the Test Set is Set Aside:
1. Before exploring the data, confirm that the test set has already been separated and you're only working with the
training set. This ensures that you don’t inadvertently introduce bias from the test data.
2.Sample for Large Datasets (Optional):
1. If your dataset is large, consider sampling a smaller exploration set for easier manipulation and faster analysis.
However, if the dataset is small (like in this case), you can work directly with the entire training set.
3.Create a Copy of the Data:
1. Make a copy of the dataset for exploration. This allows you to perform various manipulations and visualizations
without altering the original training data.
4.In-depth Exploration:
1. Now, explore the data in more depth by visualizing distributions, relationships between features, and identifying
patterns or anomalies. This helps you understand the dataset better, uncover insights, and identify potential issues
before model building.
By exploring the data thoroughly, you'll be better prepared to make informed decisions about data preprocessing, feature
selection, and model choice.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


. BANGALORE INSTITUTE OF TECHNOLOGY
Experimenting with Attribute Combinations

Experimenting with Attribute Combinations:


1.Gain Insights from Previous Exploration:
•After exploring the data, you may have identified some quirks or patterns that could affect model performance. For
instance, some attributes might need cleaning, while others might exhibit strong correlations with the target variable.
2.Addressing Skewed Distributions:
•If certain features have tail-heavy distributions (e.g., skewed data), you might want to transform them (such as taking the
logarithm) to make the distribution more normal and suitable for machine learning algorithms.
3.Creating New Attributes:
•Before preparing the data for training, you may want to create new attributes by combining existing ones. This can provide
more meaningful information for the model. Some useful combinations include:
•Rooms per Household: Instead of just the total number of rooms, dividing by the number of households can provide
a more relevant feature.
•Bedrooms per Room: The total number of bedrooms alone is less useful, but dividing by the total number of rooms
can offer a better representation of housing density.
•Population per Household: This could help capture how crowded or spacious a district is, which might correlate with
housing prices.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

4.Example of Creating New Features:


•If the dataset includes attributes like the total number of rooms,
households, and bedrooms, you can create new features like:
•rooms_per_household = total_rooms / households

•bedrooms_per_room = total_bedrooms / total_rooms

•population_per_household = population / households

By experimenting with these combinations, you can often discover new


insights and improve the predictive power of your model

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Prepare the Data for Machine Learning Algorithms

Preparing the Data for Machine Learning Algorithms:


1.Automate Data Preprocessing:
1. Rather than manually preparing the data each time, it’s better to write functions for common data preparation tasks. This
makes your process more efficient and repeatable.
2.Benefits of Writing Functions:
1. Reproducibility: With functions, you can easily apply the same transformations to new datasets, ensuring consistency.
2. Reusable Code: Over time, you will create a library of useful transformation functions that can be applied to future
projects.
3. Integration with Live Systems: These functions can be used in your production system to process new data before
feeding it into your algorithms.
4. Flexibility for Experimentation: Functions allow you to quickly test different transformations on the data and evaluate
which ones yield the best results.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

3. Data Preparation Steps:


1. Typical tasks might include handling missing data, encoding categorical variables, scaling or
normalizing numerical features, and creating new features from existing ones.
2. For example, you could write functions to:
1. Fill missing values or drop rows with missing data.
2. Convert categorical variables into numerical formats using techniques like one-hot encoding or
label encoding.
3. Scale features (e.g., using Min-Max scaling or Standardization).
By preparing reusable functions, you ensure that your data preprocessing is efficient, consistent, and adaptable
to various datasets and project needs.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Data Cleaning
Data Cleaning for Missing Values:
1.Why Data Cleaning is Important:
•Most machine learning algorithms cannot handle missing data, so cleaning the data is essential to ensure your model works
effectively. For example, if the total_bedrooms attribute has missing values, this will need to be addressed.
2.Handling Missing Values:
•Remove Rows: One option is to remove rows containing missing values. However, this can result in losing valuable data, so
it’s often not the best choice unless the missing data is very sparse.
•Impute Missing Values: A more common approach is to impute the missing values by filling them with a suitable value,
such as:
•The mean, median, or mode of the column.
•A specific value based on domain knowledge (e.g., 0 or unknown).
•For more sophisticated imputation, you can use machine learning models to predict missing values based on other
features.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Handling Text and Categorical Attributes

Handling Text and Categorical Attributes:


1.Why It’s Important:
1. Machine learning algorithms typically cannot work directly with text or categorical data, so they must be converted
into numerical representations before feeding them into the model.
2.Types of Categorical Data:
1. Nominal: Categories with no particular order (e.g., city names, product types).
2. Ordinal: Categories with a meaningful order (e.g., education level: high school < college < graduate).
3.Methods for Handling Categorical Data:
1. Label Encoding:
1. For ordinal categorical variables, label encoding assigns a unique integer to each category.
2. Example: If you have an education column with values ["High School", "College", "Graduate"], you could
convert them to [0, 1, 2].

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Feature Scaling

Feature Scaling:
1.Why Feature Scaling is Important:
1. Machine learning algorithms often perform poorly when the numerical attributes have very different scales. For
example, in housing data:
1. Total rooms can range from 6 to 39,320.
2. Median incomes range from 0 to 15.
2. Models that are sensitive to the scale of input features (like gradient descent-based models, KNN, SVM, etc.) may
struggle to learn efficiently when attributes are not scaled.
2.Scaling Techniques:
1. Min-Max Scaling (Normalization):
1. The values of each feature are scaled so that they fall within a specific range, usually between 0 and 1
2. Standardization:
Standardization (or Z-score normalization) involves subtracting the mean of each feature and dividing by its standard
deviation. This results in a distribution with a mean of 0 and a variance of 1.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Which Scaling Method to Use?


1. Min-Max Scaling: Use when you need a bounded range (e.g., for algorithms like neural networks, KNN).
2. Standardization: Preferred for algorithms like linear regression, logistic regression, and most tree-based
models (like Decision Trees, Random Forests, etc.) that are less sensitive to feature scaling but benefit from it
in some cases.
Target Values:
1. It’s generally not necessary to scale target values (e.g., housing prices) unless the scale of the target values is
also significantly different from the features.

By applying feature scaling, you ensure that all input features contribute equally to the model, improving the
performance and convergence speed of many machine learning algorithms.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Select and Train a Model


Select and Train a Model:
Now that you have prepared your data, the next step is to choose and train a machine learning model. Here's the process in theory:
1.Choose the Right Model:
1. Consider the Problem Type: Based on the nature of your task, you need to select an appropriate model:
1. Regression (for predicting a continuous value, like the median housing price in this case).
2. Classification (if you're predicting categories).
2. For regression tasks, common models include:
1. Linear Regression (simple, interpretable, suitable for linear relationships).
2. Decision Trees (capture non-linear patterns).
3. Random Forest or Gradient Boosting (ensemble methods that provide strong predictive power).
3. Evaluate Model Complexity: Simple models like linear regression are easy to interpret but may not perform well on
complex data, while more complex models like Random Forests or Gradient Boosting may offer better accuracy at the cost
of interpretability.
2.Train the Model:
1. Data Split: Ensure that the data is divided into a training set (to train the model) and a test set (to evaluate the model).
This allows you to assess how well the model generalizes to new data.
2. Model Fitting: Training a model involves fitting it to the training data, meaning the model learns the relationships
between the features (input variables) and the target (output variable).
3. Tune Parameters: Different models have parameters that can affect their performance. For instance, decision trees have
parameters like depth and minimum sample size, which influence the model's complexity.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY
3. Evaluate the Model:
1. Use a Test Set: After training, evaluate the model's performance on the test set, which has not been seen during training.
This helps you understand how well the model will perform on new, unseen data.
2. Performance Metrics:
1. For regression models, common metrics include Root Mean Square Error (RMSE) or Mean Absolute Error (MAE).
These metrics help assess how close the model’s predictions are to the actual values.
2. For classification models, you would use metrics such as accuracy, precision, recall, and F1-score to understand how
well the model classifies instances.
4.Model Tuning:
1. Hyperparameter Tuning: To improve the model's performance, you can tune its hyperparameters (settings that control
the training process). This step can significantly boost accuracy.
1. Grid Search: Involves trying every possible combination of hyperparameters within a defined search space.
2. Randomized Search: Randomly samples hyperparameter combinations, making it faster than grid search but still
effective.
2. Tuning helps you find the optimal settings for your model to perform at its best.
5.Deploy the Model:
1. Once the model is trained and tuned, it can be deployed into a production environment where it can make real-time
predictions on new data.
2. Regular updates may be required to ensure the model remains accurate over time as the data changes.
By following these steps, you can ensure that you choose a suitable model, train it effectively, evaluate its performance, and
deploy it successfully for real-world tasks.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Training and Evaluating on the Training Set


Training and Evaluating on the Training Set:
After completing the previous steps of data preparation, including cleaning, transforming, and scaling, training and
evaluating a model on the training set becomes relatively straightforward. Here’s the theoretical approach:
1.Model Training:
1. Choose a Model: In this case, a Linear Regression model is selected, which is appropriate for regression tasks
where the goal is to predict a continuous target variable (e.g., housing prices).
2. Fit the Model: During training, the model is "fitted" to the training data. This means that the Linear Regression
algorithm will attempt to learn the best linear relationship between the input features (such as population, number of
rooms, etc.) and the target variable (the median housing price).
3. The model learns by finding the coefficients (weights) that minimize the error in predictions, typically using
methods like Ordinary Least Squares (OLS).
2.Evaluation on Training Set:
1. Predicting on Training Data: Once the model is trained, it is used to predict housing prices using the same training
data it was trained on.
2. Performance Metrics:
1. For regression tasks, the most common metric to evaluate model performance is Root Mean Square Error
(RMSE). It calculates the average difference between the predicted values and the actual values, with higher
weights for larger errors.
2. You can also use Mean Absolute Error (MAE), which measures the average absolute difference between
predicted and actual values, giving a clearer picture of the overall error, especially if the data contains outliers.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY

3.Overfitting Check:
1. One important consideration is overfitting, where the model performs very well on the training set but poorly on unseen
data. Overfitting occurs when the model becomes too complex and starts capturing noise in the data rather than general
patterns.
2. To detect overfitting, it’s important to compare the performance of the model on the training set with its performance on
the test set (which is reserved for model evaluation only).
4.Tune the Model (if necessary):
1. If the model’s performance on the training set is not satisfactory, you may try various strategies to improve it, such as
adding polynomial features, regularization (like Ridge or Lasso regression), or using different models like decision trees or
ensemble methods.
2. If the model is overfitting (i.e., high performance on training but poor performance on test data), you may need to adjust
model complexity or apply regularization techniques to prevent overfitting.
5.Iterative Process:
1. This process is often iterative. After evaluating the model, you might discover new insights that can help refine your
feature engineering, data preparation, or model selection.
2. Depending on the results from the training set evaluation, you may adjust the data pipeline, model, or hyperparameters,
and retrain the model to improve performance.
By following this approach, you'll be able to build a strong foundation for training and evaluating a machine learning model on
your training data.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY
Fine-Tune Your Model

Grid Search

Fine-Tune Your Model:


Fine-tuning a machine learning model is crucial to ensure it performs optimally. One effective way to fine-tune is by adjusting th
hyperparameters of the model, which control the learning process. These hyperparameters are set before training and can
significantly impact the model’s performance.

Methods for Fine-Tuning:

1.Manual Hyperparameter Tuning:

1. This method involves experimenting with different values for hyperparameters manually, such as the learning rate, numbe
of trees in a forest, or depth of a decision tree.
2. Though effective, this method can be time-consuming and inefficient, especially with complex models and large datasets, a
it requires manually testing multiple combinations of hyperparameters.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Grid Search:
1. A more systematic and efficient approach to fine-tuning is Grid Search. GridSearchCV in Scikit-Learn
automates the process of testing multiple combinations of hyperparameters to find the best set.
2. How Grid Search Works: You define a grid of hyperparameter values you want to test, and
GridSearchCV will exhaustively try every combination, using cross-validation to evaluate the performance
of each combination.
3. Cross-validation helps in assessing how well the model generalizes by splitting the data into multiple
training and validation sets.
4. Example: For a model like RandomForestRegressor, you can specify a grid with values for the number of
trees, maximum depth, and other hyperparameters. The model will then train using each combination and
evaluate the performance using cross-validation.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

1.Advantages of Grid Search:


1. Exhaustive Search: Tests every possible combination of hyperparameters, ensuring that you find
the best performing model based on the provided grid.
2. Automation: Saves time and effort compared to manual tuning by automating the process of
hyperparameter search.
2.Disadvantages:
1. Computationally Expensive: The more hyperparameters and values you try, the longer the
process will take, especially for large models and datasets.
2. Limited Search Space: The grid only explores the predefined values, and you might miss better
values outside the grid. For larger spaces, Randomized Search can be an alternative.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Randomized Search:
1. RandomizedSearchCV is another technique where instead of testing all combinations of hyperparameters, it randomly
samples a subset of the possible combinations. This can be faster and more efficient than grid search, especially when the
search space is large.

Hyperparameter Tuning in Practice:


1. Choosing Hyperparameters: Start by selecting hyperparameters that have the greatest impact on model performance. For
instance, in decision trees, key hyperparameters could include the maximum depth and the number of minimum samples
required to split a node.
2. Evaluating Performance: During fine-tuning, it's important to use cross-validation to estimate how well the model will
perform on unseen data. Use appropriate metrics, such as RMSE for regression tasks, to evaluate performance.

Summary of Fine-Tuning Techniques:


•Grid Search: Exhaustive search over a predefined set of hyperparameters using cross-validation.
•Manual Tuning: Time-consuming but effective if you have domain knowledge.
•Randomized Search: Faster alternative to grid search by randomly sampling combinations of hyperparameters.
By fine-tuning hyperparameters using techniques like Grid Search, you can ensure that the machine learning model performs at its
best, ultimately improving its ability to generalize and make accurate predictions on new data.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Launch, Monitor, and Maintain Your System:


1.Launch the System:
•Integrate the model with production data sources.
•Run tests to ensure data processing and model predictions are correct.
2.Monitoring:
•Continuously track model performance (accuracy, error rates).
•Set up alerts for performance drops or anomalies.
•Monitor both model performance and data input quality.
3.Maintain the Model:
•Retrain the model periodically to prevent model drift.
•Involve human evaluation (domain experts or crowdsourcing) for validation.
•Update the model when necessary based on performance feedback.
4.Data Quality Monitoring:
•Track data quality to detect issues early (faulty sensors, incorrect data).
•Set up alerts for anomalies in the incoming data to prevent incorrect predictions.
5.Continuous Improvement:
•Regularly update the model, retrain, or add new features.
•Scale the system as needed to handle larger data volumes or improve inference speed.
This ensures long-term effectiveness, performance, and relevance of the system.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY
MNIST Dataset - Detailed Points:
1.Content:
•Contains 70,000 grayscale images, each representing a single handwritten digit (0–9).
•Each image has a resolution of 28x28 pixels.
2.Data Split:
•60,000 images are used for training the model.
•10,000 images are set aside for testing the trained model.
3.Labels:
•Every image is labeled with the digit (0-9) it represents, making it a supervised learning task.
4.Purpose:
•MNIST serves as a benchmark for evaluating image classification algorithms.
•Commonly used for testing new machine learning models, especially in image recognition tasks.
5.Usage:
•Widely used in educational contexts for learning about image classification.
•A go-to dataset for understanding and implementing basic machine learning techniques.
6.Origin:
•Handwritten by a combination of high school students and employees of the U.S. Census Bureau.
•Its simplicity and variety make it a useful starting point for machine learning practitioners.
7.Preprocessing:
•Each 28x28 pixel image is flattened into a 1D vector with 784 features, where each feature represents a pixel’s intensity
(value between 0 and 255).
•Images are generally normalized (scaled to a 0-1 range) before being input into models.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY
8.Reputation:
•Often referred to as the "Hello World" of machine learning, because it is one of the first datasets that new learners
and researchers experiment with.
9.Availability:
•MNIST is widely available through machine learning libraries like Scikit-Learn, TensorFlow, and Keras.
•The dataset is easy to download and use for anyone getting started with machine learning.
10.Significance:
•Despite its simplicity, MNIST remains a valuable tool for comparing the performance of different machine
learning algorithms.
•It helps researchers and developers to identify the strengths and weaknesses of models before they move on to
more complex tasks.
11.Challenges:
•While MNIST is simple and well-suited for initial experiments, it has limitations in real-world applicability.
•The dataset is relatively easy for modern models, so more advanced datasets (like CIFAR-10, Fashion MNIST,
etc.) are often used to test higher-performing models.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY
Training a Binary Classifier
Training a Binary Classifier for Digit 5: Key Steps
1.Simplify the Problem:
•Instead of recognizing all digits, focus on identifying just one digit (e.g., 5) as part of a binary classification task.
•For this, create target vectors where the label is True for all instances of the digit 5 and False for all other digits.
2.Choose the Classifier:
•Use Stochastic Gradient Descent (SGD) for training. SGD is efficient for large datasets and works by processing training
instances one at a time, making it suitable for both large and online learning tasks.
•The SGDClassifier is used for this purpose, as it handles large datasets and randomness efficiently.
3.Train the Model:
•The model is trained using the training set, where the features (input data) are paired with the binary labels (whether the digit
is 5 or not).
4.Make Predictions:
•After training, the model can predict whether an image represents the digit 5, returning a True if it is predicted as 5, or False if it is
predicted as something else.

5.Evaluate the Model:


•Evaluate the model's performance using the test set to see how well it classifies the digit 5. Performance can be assessed
through metrics like accuracy, precision, recall, and F1 score.
By following these steps, you can build a classifier that focuses on detecting a specific digit (e.g., 5) in an image, simplifying the
overall task to binary classification.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY

Performance Measures
1.Cross-Validation: Evaluates model performance by splitting data into multiple folds for training/testing,
providing a more reliable estimate of generalization ability.
2.Accuracy: Proportion of correct predictions out of all predictions, but may not be suitable for imbalanced
datasets.
3.Confusion Matrix: Displays TP, FP, TN, FN to evaluate how well the classifier distinguishes between classes.
4.Precision: Measures the accuracy of positive predictions (TP / (TP + FP)).
5.Recall: Measures how well the model captures actual positives (TP / (TP + FN)).
6.F1-Score: Harmonic mean of precision and recall, balancing both metrics.
7.ROC Curve and AUC: The ROC curve plots the tradeoff between recall and false positive rate, with AUC
summarizing overall performance.
These measures help assess a classifier’s performance, especially in situations like imbalanced classes.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Multiclass Classification

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Error Analysis

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Multilabel Classification:
•Definition: In multilabel classification, each instance can belong to multiple classes simultaneously. This is
different from multiclass classification, where each instance can belong to only one class out of many.
•Example: For image classification, a picture of a cat sitting on a couch could be labeled as both "cat" and
"furniture".
•Approach:
• You can train one classifier per label (binary classification per label).
• Each classifier predicts whether the label should be assigned or not. This method is often referred to as the
One-vs-Rest strategy.
• Metrics like Hamming Loss, Subset Accuracy, and F1 Score are commonly used to evaluate multilabel
classifiers.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY

Multioutput Classification:
•Definition: In multioutput classification, each instance has multiple outputs (each output being a label or a prediction). It
can be viewed as a special case of multilabel classification where each output corresponds to a different label, but the
outputs are generally related to each other.
•Example: Predicting multiple attributes of a house, such as the house price, number of rooms, and size at the same time, or
in an image, predicting multiple features such as the type of objects (e.g., "cat", "dog") and their locations (bounding
boxes).
•Approach:
• One common method is to train a separate classifier for each output or to use multioutput models that can predict
multiple outputs at once.
• In regression tasks, multioutput regression can be used where the model predicts several continuous values.
Both types of problems are handled by adapting algorithms designed for single-label problems to predict multiple labels or
outputs.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

You might also like