0% found this document useful (0 votes)

8 views46 pages

Module 5

ISML

Uploaded by

avg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views46 pages

Module 5

ISML

Uploaded by

avg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Intelligent Systems and Machine Learning

Algorithms

Gahan A V
Assistant Professor
Department of Electronics and Communication Engineering
Bangalore Institute of Technology
BANGALORE INSTITUTE OF TECHNOLOGY

Module 5

2
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

• The California Housing Prices dataset from the StatLib repository is often used in machine learning examples.

• It is based on the 1990 California census data and contains information about various features of homes, such as the
number of rooms, the median income of people in the area, and the geographical location of the house, among others.

In the context of this example, the dataset has been modified for teaching purposes:

1.A categorical attribute has been added to introduce diversity in the data types (e.g., a feature with specific categories like
"urban," "suburban," etc.).
2.A few features have been removed to simplify the dataset, making it easier for learners to focus on the core concepts of
machine learning and data analysis.

Though the dataset is based on data from 1990, it is often treated as if it represents recent data for practical use, especially
when demonstrating machine learning algorithms and techniques.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY
Look at the Big Picture

• Welcome to Machine Learning Housing Corporation! The first task you are asked to perform is to
build a model of housing prices in California using the California census data.

• This data has metrics such as the population, median income, median housing price, and so on for
each block group in California.

• Block groups are the smallest geographical unit for which the US Census Bureau publishes sample
data (a block group typically has a population of 600 to 3,000 people). We will just call them
“districts”for short.

• Your model should learn from this data and be able to predict the median housing price in any
district, given all the other metrics.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY

To build a model predicting the median housing price in any district of California based on various metrics like population,
median income, and others, follow these steps:
1.Data Preprocessing:
1. Clean the data: Remove or impute missing values.
2. Normalize numerical features: Rescale features like population, median income, etc., for better model performance.
3. Convert categorical variables (if any) into numerical representations (e.g., one-hot encoding).
2.Feature Selection:
1. Identify which features (metrics) are most relevant for predicting the housing price. You can use techniques like
correlation analysis or feature importance from tree-based models.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY
3.Model Selection:
1. Start with a linear regression model if the relationship between features and housing price seems linear.
2. If there are non-linear relationships, try more advanced models like decision trees, random forests, or gradient
boosting.
4.Model Training:
1. Split the data into training and testing sets (e.g., 80% for training, 20% for testing).
2. Train your model on the training set and evaluate its performance on the testing set using metrics like Mean
Absolute Error (MAE), Root Mean Squared Error (RMSE), or R-squared.
5.Model Evaluation:
1. Evaluate the model's performance and adjust hyperparameters to improve accuracy if needed.
2. Cross-validation can help ensure that your model generalizes well on unseen data.
6.Prediction:
1. Once trained, the model should be able to predict the median housing price for any district based on the input
metrics.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Frame the Problem

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

breakdown of how the problem is framed and the key elements involved in designing the system:
1.Business Objective:
1. The goal is not just to build a model, but to predict a district’s median housing price, which will be used by another
Machine Learning system to decide investment worthiness in specific areas. This will directly impact revenue.
2.Current Solution:
1. The current method of estimating housing prices is manual and unreliable, with estimates sometimes being off by more
than 20%. The company wants to automate and improve this process using a model to predict prices based on other district
data.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY
3. Key Questions for Framing the Problem:
1. Supervised or Unsupervised: This is a supervised learning task because the model is provided with labeled training data
(district features and the actual median housing prices).
2. Type of Task: This is a regression task since you are predicting a continuous value (the housing price).
3. Regression Type: It's a multiple regression problem because multiple features (e.g., district population, median income)
are used to predict the target value.
4. Learning Method: Given that there is no continuous stream of data and the dataset is small enough to fit in memory,
batch learning is the most appropriate approach.
4. Model Considerations:
1. The model will use features such as district population and median income to predict the housing price.
2. The output is a single predicted value (univariate regression), as you are predicting just one value (the housing price) per
district.
By considering these elements, the system design will focus on creating a supervised learning regression model using batch
learning techniques to predict housing prices effectively.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Pipelines

•Definition: A data pipeline is a sequence of components that process and transform data step by step.
•Common in ML: Frequently used in Machine Learning due to the need for extensive data manipulation and transformation.
•Independent Components: Each component processes data independently and passes the result to the next stage via a
shared data store.
•Modular Design: Components are self-contained, making the system easier to understand and maintain.
•Team Collaboration: Different teams can focus on specific components without affecting others.
•Resilience: If one component fails, downstream components can temporarily operate using the last available data.
•Potential Issue: Without monitoring, failures may go unnoticed, leading to stale data and degraded system performance.
•Importance of Monitoring: Proper monitoring ensures timely detection and resolution of issues in the pipeline.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Select a Performance Measure

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Check the Assumptions

•Verify Assumptions: Check the assumptions made during the project to avoid mistakes.
•Example: If the downstream system needs categories (e.g., "cheap," "medium"), the task is classification, not regression.
•Impact: Incorrect assumptions can lead to wasted effort on the wrong model type.
•Action: Confirm with relevant teams about requirements before proceeding.
•Outcome: Once assumptions are verified (e.g., actual prices needed), proceed with the appropriate model (regression).

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Steps to Get the Data:

1. Download the Data :

- In typical environments, data is stored across multiple tables or files in databases or other data stores.
- For this project, the data is provided in a simpler form: a single compressed file, `housing.tgz`.

2. Extract the Data :

- The compressed file contains a CSV file named `housing.csv`, which includes all the data you need for analysis.

3. Access Credentials :
- Normally, to access data from databases or data stores, you would need credentials and authorization.

4. Familiarize with Data Schema :

- Once the data is downloaded, it's important to understand its structure (schema) to know how to properly use the data for
your model.

In this case, the process is simplified as you just need to download and extract a single CSV file.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Create a Test Set

Creating a Test Set:

1.Why Set Aside Data Early:

1. It may seem premature to reserve part of the data for testing at this stage when you’ve only glanced at the data.
However, this is an important step to avoid overfitting.
2.Risk of Overfitting:
1. Your brain can easily detect patterns in the data, which might lead to you selecting a model that appears to work well.
But this could result in overfitting, where the model performs well on the training data but poorly on unseen data.
3.Data Snooping Bias:
1. If you examine the test set before finalizing the model, you risk introducing data snooping bias. This happens when the
test data influences the model selection, leading to an overly optimistic estimate of the model’s performance.
4.Best Practice:
1. To avoid this bias, it’s critical to create a test set before you start experimenting with models. This ensures that your
evaluation of the model's performance is honest and unbiased.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Discover and Visualize the Data to Gain Insights

Steps to Discover and Visualize the Data:

1.Ensure the Test Set is Set Aside:
1. Before exploring the data, confirm that the test set has already been separated and you're only working with the
training set. This ensures that you don’t inadvertently introduce bias from the test data.
2.Sample for Large Datasets (Optional):
1. If your dataset is large, consider sampling a smaller exploration set for easier manipulation and faster analysis.
However, if the dataset is small (like in this case), you can work directly with the entire training set.
3.Create a Copy of the Data:
1. Make a copy of the dataset for exploration. This allows you to perform various manipulations and visualizations
without altering the original training data.
4.In-depth Exploration:
1. Now, explore the data in more depth by visualizing distributions, relationships between features, and identifying
patterns or anomalies. This helps you understand the dataset better, uncover insights, and identify potential issues
before model building.
By exploring the data thoroughly, you'll be better prepared to make informed decisions about data preprocessing, feature
selection, and model choice.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

. BANGALORE INSTITUTE OF TECHNOLOGY
Experimenting with Attribute Combinations

Experimenting with Attribute Combinations:

1.Gain Insights from Previous Exploration:
•After exploring the data, you may have identified some quirks or patterns that could affect model performance. For
instance, some attributes might need cleaning, while others might exhibit strong correlations with the target variable.
2.Addressing Skewed Distributions:
•If certain features have tail-heavy distributions (e.g., skewed data), you might want to transform them (such as taking the
logarithm) to make the distribution more normal and suitable for machine learning algorithms.
3.Creating New Attributes:
•Before preparing the data for training, you may want to create new attributes by combining existing ones. This can provide
more meaningful information for the model. Some useful combinations include:
•Rooms per Household: Instead of just the total number of rooms, dividing by the number of households can provide
a more relevant feature.
•Bedrooms per Room: The total number of bedrooms alone is less useful, but dividing by the total number of rooms
can offer a better representation of housing density.
•Population per Household: This could help capture how crowded or spacious a district is, which might correlate with
housing prices.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

4.Example of Creating New Features:

•If the dataset includes attributes like the total number of rooms,
households, and bedrooms, you can create new features like:
•rooms_per_household = total_rooms / households

•bedrooms_per_room = total_bedrooms / total_rooms

•population_per_household = population / households

By experimenting with these combinations, you can often discover new

insights and improve the predictive power of your model

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Prepare the Data for Machine Learning Algorithms

Preparing the Data for Machine Learning Algorithms:

1.Automate Data Preprocessing:
1. Rather than manually preparing the data each time, it’s better to write functions for common data preparation tasks. This
makes your process more efficient and repeatable.
2.Benefits of Writing Functions:
1. Reproducibility: With functions, you can easily apply the same transformations to new datasets, ensuring consistency.
2. Reusable Code: Over time, you will create a library of useful transformation functions that can be applied to future
projects.
3. Integration with Live Systems: These functions can be used in your production system to process new data before
feeding it into your algorithms.
4. Flexibility for Experimentation: Functions allow you to quickly test different transformations on the data and evaluate
which ones yield the best results.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

3. Data Preparation Steps:

1. Typical tasks might include handling missing data, encoding categorical variables, scaling or
normalizing numerical features, and creating new features from existing ones.
2. For example, you could write functions to:
1. Fill missing values or drop rows with missing data.
2. Convert categorical variables into numerical formats using techniques like one-hot encoding or
label encoding.
3. Scale features (e.g., using Min-Max scaling or Standardization).
By preparing reusable functions, you ensure that your data preprocessing is efficient, consistent, and adaptable
to various datasets and project needs.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Data Cleaning
Data Cleaning for Missing Values:
1.Why Data Cleaning is Important:
•Most machine learning algorithms cannot handle missing data, so cleaning the data is essential to ensure your model works
effectively. For example, if the total_bedrooms attribute has missing values, this will need to be addressed.
2.Handling Missing Values:
•Remove Rows: One option is to remove rows containing missing values. However, this can result in losing valuable data, so
it’s often not the best choice unless the missing data is very sparse.
•Impute Missing Values: A more common approach is to impute the missing values by filling them with a suitable value,
such as:
•The mean, median, or mode of the column.
•A specific value based on domain knowledge (e.g., 0 or unknown).
•For more sophisticated imputation, you can use machine learning models to predict missing values based on other
features.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Handling Text and Categorical Attributes

Handling Text and Categorical Attributes:

1.Why It’s Important:
1. Machine learning algorithms typically cannot work directly with text or categorical data, so they must be converted
into numerical representations before feeding them into the model.
2.Types of Categorical Data:
1. Nominal: Categories with no particular order (e.g., city names, product types).
2. Ordinal: Categories with a meaningful order (e.g., education level: high school < college < graduate).
3.Methods for Handling Categorical Data:
1. Label Encoding:
1. For ordinal categorical variables, label encoding assigns a unique integer to each category.
2. Example: If you have an education column with values ["High School", "College", "Graduate"], you could
convert them to [0, 1, 2].

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Feature Scaling

Feature Scaling:
1.Why Feature Scaling is Important:
1. Machine learning algorithms often perform poorly when the numerical attributes have very different scales. For
example, in housing data:
1. Total rooms can range from 6 to 39,320.
2. Median incomes range from 0 to 15.
2. Models that are sensitive to the scale of input features (like gradient descent-based models, KNN, SVM, etc.) may
struggle to learn efficiently when attributes are not scaled.
2.Scaling Techniques:
1. Min-Max Scaling (Normalization):
1. The values of each feature are scaled so that they fall within a specific range, usually between 0 and 1
2. Standardization:
Standardization (or Z-score normalization) involves subtracting the mean of each feature and dividing by its standard
deviation. This results in a distribution with a mean of 0 and a variance of 1.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Which Scaling Method to Use?

1. Min-Max Scaling: Use when you need a bounded range (e.g., for algorithms like neural networks, KNN).
2. Standardization: Preferred for algorithms like linear regression, logistic regression, and most tree-based
models (like Decision Trees, Random Forests, etc.) that are less sensitive to feature scaling but benefit from it
in some cases.
Target Values:
1. It’s generally not necessary to scale target values (e.g., housing prices) unless the scale of the target values is
also significantly different from the features.

By applying feature scaling, you ensure that all input features contribute equally to the model, improving the
performance and convergence speed of many machine learning algorithms.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Select and Train a Model

Select and Train a Model:
Now that you have prepared your data, the next step is to choose and train a machine learning model. Here's the process in theory:
1.Choose the Right Model:
1. Consider the Problem Type: Based on the nature of your task, you need to select an appropriate model:
1. Regression (for predicting a continuous value, like the median housing price in this case).
2. Classification (if you're predicting categories).
2. For regression tasks, common models include:
1. Linear Regression (simple, interpretable, suitable for linear relationships).
2. Decision Trees (capture non-linear patterns).
3. Random Forest or Gradient Boosting (ensemble methods that provide strong predictive power).
3. Evaluate Model Complexity: Simple models like linear regression are easy to interpret but may not perform well on
complex data, while more complex models like Random Forests or Gradient Boosting may offer better accuracy at the cost
of interpretability.
2.Train the Model:
1. Data Split: Ensure that the data is divided into a training set (to train the model) and a test set (to evaluate the model).
This allows you to assess how well the model generalizes to new data.
2. Model Fitting: Training a model involves fitting it to the training data, meaning the model learns the relationships
between the features (input variables) and the target (output variable).
3. Tune Parameters: Different models have parameters that can affect their performance. For instance, decision trees have
parameters like depth and minimum sample size, which influence the model's complexity.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY
3. Evaluate the Model:
1. Use a Test Set: After training, evaluate the model's performance on the test set, which has not been seen during training.
This helps you understand how well the model will perform on new, unseen data.
2. Performance Metrics:
1. For regression models, common metrics include Root Mean Square Error (RMSE) or Mean Absolute Error (MAE).
These metrics help assess how close the model’s predictions are to the actual values.
2. For classification models, you would use metrics such as accuracy, precision, recall, and F1-score to understand how
well the model classifies instances.
4.Model Tuning:
1. Hyperparameter Tuning: To improve the model's performance, you can tune its hyperparameters (settings that control
the training process). This step can significantly boost accuracy.
1. Grid Search: Involves trying every possible combination of hyperparameters within a defined search space.
2. Randomized Search: Randomly samples hyperparameter combinations, making it faster than grid search but still
effective.
2. Tuning helps you find the optimal settings for your model to perform at its best.
5.Deploy the Model:
1. Once the model is trained and tuned, it can be deployed into a production environment where it can make real-time
predictions on new data.
2. Regular updates may be required to ensure the model remains accurate over time as the data changes.
By following these steps, you can ensure that you choose a suitable model, train it effectively, evaluate its performance, and
deploy it successfully for real-world tasks.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Training and Evaluating on the Training Set

Training and Evaluating on the Training Set:
After completing the previous steps of data preparation, including cleaning, transforming, and scaling, training and
evaluating a model on the training set becomes relatively straightforward. Here’s the theoretical approach:
1.Model Training:
1. Choose a Model: In this case, a Linear Regression model is selected, which is appropriate for regression tasks
where the goal is to predict a continuous target variable (e.g., housing prices).
2. Fit the Model: During training, the model is "fitted" to the training data. This means that the Linear Regression
algorithm will attempt to learn the best linear relationship between the input features (such as population, number of
rooms, etc.) and the target variable (the median housing price).
3. The model learns by finding the coefficients (weights) that minimize the error in predictions, typically using
methods like Ordinary Least Squares (OLS).
2.Evaluation on Training Set:
1. Predicting on Training Data: Once the model is trained, it is used to predict housing prices using the same training
data it was trained on.
2. Performance Metrics:
1. For regression tasks, the most common metric to evaluate model performance is Root Mean Square Error
(RMSE). It calculates the average difference between the predicted values and the actual values, with higher
weights for larger errors.
2. You can also use Mean Absolute Error (MAE), which measures the average absolute difference between
predicted and actual values, giving a clearer picture of the overall error, especially if the data contains outliers.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY

3.Overfitting Check:
1. One important consideration is overfitting, where the model performs very well on the training set but poorly on unseen
data. Overfitting occurs when the model becomes too complex and starts capturing noise in the data rather than general
patterns.
2. To detect overfitting, it’s important to compare the performance of the model on the training set with its performance on
the test set (which is reserved for model evaluation only).
4.Tune the Model (if necessary):
1. If the model’s performance on the training set is not satisfactory, you may try various strategies to improve it, such as
adding polynomial features, regularization (like Ridge or Lasso regression), or using different models like decision trees or
ensemble methods.
2. If the model is overfitting (i.e., high performance on training but poor performance on test data), you may need to adjust
model complexity or apply regularization techniques to prevent overfitting.
5.Iterative Process:
1. This process is often iterative. After evaluating the model, you might discover new insights that can help refine your
feature engineering, data preparation, or model selection.
2. Depending on the results from the training set evaluation, you may adjust the data pipeline, model, or hyperparameters,
and retrain the model to improve performance.
By following this approach, you'll be able to build a strong foundation for training and evaluating a machine learning model on
your training data.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY
Fine-Tune Your Model

Grid Search

Fine-Tune Your Model:

Fine-tuning a machine learning model is crucial to ensure it performs optimally. One effective way to fine-tune is by adjusting th
hyperparameters of the model, which control the learning process. These hyperparameters are set before training and can
significantly impact the model’s performance.

Methods for Fine-Tuning:

1.Manual Hyperparameter Tuning:

1. This method involves experimenting with different values for hyperparameters manually, such as the learning rate, numbe
of trees in a forest, or depth of a decision tree.
2. Though effective, this method can be time-consuming and inefficient, especially with complex models and large datasets, a
it requires manually testing multiple combinations of hyperparameters.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Grid Search:
1. A more systematic and efficient approach to fine-tuning is Grid Search. GridSearchCV in Scikit-Learn
automates the process of testing multiple combinations of hyperparameters to find the best set.
2. How Grid Search Works: You define a grid of hyperparameter values you want to test, and
GridSearchCV will exhaustively try every combination, using cross-validation to evaluate the performance
of each combination.
3. Cross-validation helps in assessing how well the model generalizes by splitting the data into multiple
training and validation sets.
4. Example: For a model like RandomForestRegressor, you can specify a grid with values for the number of
trees, maximum depth, and other hyperparameters. The model will then train using each combination and
evaluate the performance using cross-validation.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

1.Advantages of Grid Search:

1. Exhaustive Search: Tests every possible combination of hyperparameters, ensuring that you find
the best performing model based on the provided grid.
2. Automation: Saves time and effort compared to manual tuning by automating the process of
hyperparameter search.
2.Disadvantages:
1. Computationally Expensive: The more hyperparameters and values you try, the longer the
process will take, especially for large models and datasets.
2. Limited Search Space: The grid only explores the predefined values, and you might miss better
values outside the grid. For larger spaces, Randomized Search can be an alternative.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Randomized Search:
1. RandomizedSearchCV is another technique where instead of testing all combinations of hyperparameters, it randomly
samples a subset of the possible combinations. This can be faster and more efficient than grid search, especially when the
search space is large.

Hyperparameter Tuning in Practice:

1. Choosing Hyperparameters: Start by selecting hyperparameters that have the greatest impact on model performance. For
instance, in decision trees, key hyperparameters could include the maximum depth and the number of minimum samples
required to split a node.
2. Evaluating Performance: During fine-tuning, it's important to use cross-validation to estimate how well the model will
perform on unseen data. Use appropriate metrics, such as RMSE for regression tasks, to evaluate performance.

Summary of Fine-Tuning Techniques:

•Grid Search: Exhaustive search over a predefined set of hyperparameters using cross-validation.
•Manual Tuning: Time-consuming but effective if you have domain knowledge.
•Randomized Search: Faster alternative to grid search by randomly sampling combinations of hyperparameters.
By fine-tuning hyperparameters using techniques like Grid Search, you can ensure that the machine learning model performs at its
best, ultimately improving its ability to generalize and make accurate predictions on new data.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Launch, Monitor, and Maintain Your System:

1.Launch the System:
•Integrate the model with production data sources.
•Run tests to ensure data processing and model predictions are correct.
2.Monitoring:
•Continuously track model performance (accuracy, error rates).
•Set up alerts for performance drops or anomalies.
•Monitor both model performance and data input quality.
3.Maintain the Model:
•Retrain the model periodically to prevent model drift.
•Involve human evaluation (domain experts or crowdsourcing) for validation.
•Update the model when necessary based on performance feedback.
4.Data Quality Monitoring:
•Track data quality to detect issues early (faulty sensors, incorrect data).
•Set up alerts for anomalies in the incoming data to prevent incorrect predictions.
5.Continuous Improvement:
•Regularly update the model, retrain, or add new features.
•Scale the system as needed to handle larger data volumes or improve inference speed.
This ensures long-term effectiveness, performance, and relevance of the system.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY
MNIST Dataset - Detailed Points:
1.Content:
•Contains 70,000 grayscale images, each representing a single handwritten digit (0–9).
•Each image has a resolution of 28x28 pixels.
2.Data Split:
•60,000 images are used for training the model.
•10,000 images are set aside for testing the trained model.
3.Labels:
•Every image is labeled with the digit (0-9) it represents, making it a supervised learning task.
4.Purpose:
•MNIST serves as a benchmark for evaluating image classification algorithms.
•Commonly used for testing new machine learning models, especially in image recognition tasks.
5.Usage:
•Widely used in educational contexts for learning about image classification.
•A go-to dataset for understanding and implementing basic machine learning techniques.
6.Origin:
•Handwritten by a combination of high school students and employees of the U.S. Census Bureau.
•Its simplicity and variety make it a useful starting point for machine learning practitioners.
7.Preprocessing:
•Each 28x28 pixel image is flattened into a 1D vector with 784 features, where each feature represents a pixel’s intensity
(value between 0 and 255).
•Images are generally normalized (scaled to a 0-1 range) before being input into models.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY
8.Reputation:
•Often referred to as the "Hello World" of machine learning, because it is one of the first datasets that new learners
and researchers experiment with.
9.Availability:
•MNIST is widely available through machine learning libraries like Scikit-Learn, TensorFlow, and Keras.
•The dataset is easy to download and use for anyone getting started with machine learning.
10.Significance:
•Despite its simplicity, MNIST remains a valuable tool for comparing the performance of different machine
learning algorithms.
•It helps researchers and developers to identify the strengths and weaknesses of models before they move on to
more complex tasks.
11.Challenges:
•While MNIST is simple and well-suited for initial experiments, it has limitations in real-world applicability.
•The dataset is relatively easy for modern models, so more advanced datasets (like CIFAR-10, Fashion MNIST,
etc.) are often used to test higher-performing models.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY
Training a Binary Classifier
Training a Binary Classifier for Digit 5: Key Steps
1.Simplify the Problem:
•Instead of recognizing all digits, focus on identifying just one digit (e.g., 5) as part of a binary classification task.
•For this, create target vectors where the label is True for all instances of the digit 5 and False for all other digits.
2.Choose the Classifier:
•Use Stochastic Gradient Descent (SGD) for training. SGD is efficient for large datasets and works by processing training
instances one at a time, making it suitable for both large and online learning tasks.
•The SGDClassifier is used for this purpose, as it handles large datasets and randomness efficiently.
3.Train the Model:
•The model is trained using the training set, where the features (input data) are paired with the binary labels (whether the digit
is 5 or not).
4.Make Predictions:
•After training, the model can predict whether an image represents the digit 5, returning a True if it is predicted as 5, or False if it is
predicted as something else.

5.Evaluate the Model:

•Evaluate the model's performance using the test set to see how well it classifies the digit 5. Performance can be assessed
through metrics like accuracy, precision, recall, and F1 score.
By following these steps, you can build a classifier that focuses on detecting a specific digit (e.g., 5) in an image, simplifying the
overall task to binary classification.
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
BANGALORE INSTITUTE OF TECHNOLOGY

Performance Measures
1.Cross-Validation: Evaluates model performance by splitting data into multiple folds for training/testing,
providing a more reliable estimate of generalization ability.
2.Accuracy: Proportion of correct predictions out of all predictions, but may not be suitable for imbalanced
datasets.
3.Confusion Matrix: Displays TP, FP, TN, FN to evaluate how well the classifier distinguishes between classes.
4.Precision: Measures the accuracy of positive predictions (TP / (TP + FP)).
5.Recall: Measures how well the model captures actual positives (TP / (TP + FN)).
6.F1-Score: Harmonic mean of precision and recall, balancing both metrics.
7.ROC Curve and AUC: The ROC curve plots the tradeoff between recall and false positive rate, with AUC
summarizing overall performance.
These measures help assess a classifier’s performance, especially in situations like imbalanced classes.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Multiclass Classification

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Error Analysis

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Multilabel Classification:
•Definition: In multilabel classification, each instance can belong to multiple classes simultaneously. This is
different from multiclass classification, where each instance can belong to only one class out of many.
•Example: For image classification, a picture of a cat sitting on a couch could be labeled as both "cat" and
"furniture".
•Approach:
• You can train one classifier per label (binary classification per label).
• Each classifier predicts whether the label should be assigned or not. This method is often referred to as the
One-vs-Rest strategy.
• Metrics like Hamming Loss, Subset Accuracy, and F1 Score are commonly used to evaluate multilabel
classifiers.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

BANGALORE INSTITUTE OF TECHNOLOGY

Multioutput Classification:
•Definition: In multioutput classification, each instance has multiple outputs (each output being a label or a prediction). It
can be viewed as a special case of multilabel classification where each output corresponds to a different label, but the
outputs are generally related to each other.
•Example: Predicting multiple attributes of a house, such as the house price, number of rooms, and size at the same time, or
in an image, predicting multiple features such as the type of objects (e.g., "cat", "dog") and their locations (bounding
boxes).
•Approach:
• One common method is to train a separate classifier for each output or to use multioutput models that can predict
multiple outputs at once.
• In regression tasks, multioutput regression can be used where the model predicts several continuous values.
Both types of problems are handled by adapting algorithms designed for single-label problems to predict multiple labels or
outputs.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

End-to-End Machine Learning Project (Bootcamp)
No ratings yet
End-to-End Machine Learning Project (Bootcamp)
415 pages
English Grade 6
No ratings yet
English Grade 6
192 pages
House Price Prediction Report
100% (1)
House Price Prediction Report
26 pages
House Price Prediction - Research Paper FINAL DRAFT
100% (1)
House Price Prediction - Research Paper FINAL DRAFT
10 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Learning To Imagine Shtulman en 47980
No ratings yet
Learning To Imagine Shtulman en 47980
6 pages
ME P4252-II Semester - MACHINE LEARNING
100% (1)
ME P4252-II Semester - MACHINE LEARNING
48 pages
End To End Machine Learning Project-2
No ratings yet
End To End Machine Learning Project-2
10 pages
PDFWORD MIL Module 2 Week 2 PDF
89% (9)
PDFWORD MIL Module 2 Week 2 PDF
11 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
Project Report Gr-12
No ratings yet
Project Report Gr-12
25 pages
Aamir Resume Retail
0% (1)
Aamir Resume Retail
3 pages
L03 The Regression Pipeline
No ratings yet
L03 The Regression Pipeline
94 pages
Dsbda Mini Priyanshu
No ratings yet
Dsbda Mini Priyanshu
17 pages
B.E Cse Batchno 106
No ratings yet
B.E Cse Batchno 106
72 pages
House Price Prediction 3 47
No ratings yet
House Price Prediction 3 47
45 pages
Module 2
No ratings yet
Module 2
90 pages
Module 2
No ratings yet
Module 2
20 pages
A Cow in The House
No ratings yet
A Cow in The House
2 pages
Yug Removed
No ratings yet
Yug Removed
29 pages
House Report
No ratings yet
House Report
26 pages
ML - 03 - Machine Learning Systems
No ratings yet
ML - 03 - Machine Learning Systems
60 pages
Project - Synopsis - Format (1) (1) (1) Copy 2
No ratings yet
Project - Synopsis - Format (1) (1) (1) Copy 2
33 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Module 1
No ratings yet
Module 1
109 pages
CS341 Lec16 Annotated Mar26
No ratings yet
CS341 Lec16 Annotated Mar26
29 pages
Question Paper With Solutions
No ratings yet
Question Paper With Solutions
6 pages
NCAA Draft Constitution
No ratings yet
NCAA Draft Constitution
21 pages
Report On Java Chatting
No ratings yet
Report On Java Chatting
10 pages
Dawit House
No ratings yet
Dawit House
49 pages
ML Project CLG
No ratings yet
ML Project CLG
62 pages
Module 4 ISML
No ratings yet
Module 4 ISML
88 pages
A Synopsys Report
No ratings yet
A Synopsys Report
16 pages
Module 3
No ratings yet
Module 3
72 pages
HOUSE PREDICTION (1) (1) New
No ratings yet
HOUSE PREDICTION (1) (1) New
24 pages
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
No ratings yet
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
14 pages
School Story
No ratings yet
School Story
18 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Physics Pratical
No ratings yet
Physics Pratical
12 pages
Aastha Mahajan Python File
No ratings yet
Aastha Mahajan Python File
17 pages
BTech Mechanical Engg Structure
No ratings yet
BTech Mechanical Engg Structure
12 pages
House File
No ratings yet
House File
30 pages
Air Canada SMS
No ratings yet
Air Canada SMS
42 pages
Current Trends in Software
No ratings yet
Current Trends in Software
26 pages
House Price Prediction Project
No ratings yet
House Price Prediction Project
55 pages
Module 2
No ratings yet
Module 2
24 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Ruchi Integration of Approaches
No ratings yet
Ruchi Integration of Approaches
19 pages
Benefits of Multisensory Learning
No ratings yet
Benefits of Multisensory Learning
7 pages
Module 2
No ratings yet
Module 2
35 pages
House Price Predictor PPT Project
No ratings yet
House Price Predictor PPT Project
13 pages
Communication Research - Module 3 Final
No ratings yet
Communication Research - Module 3 Final
9 pages
DLP - Perdev - 10-01-24 - Personal Relationships
No ratings yet
DLP - Perdev - 10-01-24 - Personal Relationships
6 pages
@vtudeveloper - in ISMLA Mod 5
No ratings yet
@vtudeveloper - in ISMLA Mod 5
30 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
NLP Complete - BEPEC - Opendir - Cloud
No ratings yet
NLP Complete - BEPEC - Opendir - Cloud
17 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Unit 2
No ratings yet
Unit 2
78 pages
Lec3 4 ML Project
No ratings yet
Lec3 4 ML Project
26 pages
BA Project - Team17
No ratings yet
BA Project - Team17
13 pages
ML Lab Manual
No ratings yet
ML Lab Manual
43 pages
End To End Machine Learning Problem Problem Under Discussion
No ratings yet
End To End Machine Learning Problem Problem Under Discussion
12 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
Int 5
No ratings yet
Int 5
12 pages
2) Front Pages
No ratings yet
2) Front Pages
11 pages
Iligan Medical Center College
No ratings yet
Iligan Medical Center College
14 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
End To End Project
No ratings yet
End To End Project
21 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
Module 2 Own Notes
No ratings yet
Module 2 Own Notes
10 pages
Presentation 21
No ratings yet
Presentation 21
9 pages
Bangalore House Price Prediction
No ratings yet
Bangalore House Price Prediction
4 pages
Exercises 5
No ratings yet
Exercises 5
3 pages
Eva Mick - Resume 3:25:16
No ratings yet
Eva Mick - Resume 3:25:16
3 pages
TPCN Monthly List of Subcontractors 06-2017
No ratings yet
TPCN Monthly List of Subcontractors 06-2017
3 pages
PBL-1 Research Paper
No ratings yet
PBL-1 Research Paper
5 pages
SABOTSY Angelina-ROADMAP
No ratings yet
SABOTSY Angelina-ROADMAP
6 pages
Project Synopsis Shaiba
No ratings yet
Project Synopsis Shaiba
5 pages
Okasha Basma TESOL 1 6
No ratings yet
Okasha Basma TESOL 1 6
5 pages
California Housing Project
No ratings yet
California Housing Project
5 pages
Social Studies Lesson Plan
No ratings yet
Social Studies Lesson Plan
7 pages
Week 5 PDF
No ratings yet
Week 5 PDF
3 pages
Quiz Submissions - Quiz 3.4
No ratings yet
Quiz Submissions - Quiz 3.4
3 pages
Subhojit Roy Resume Java Latest
No ratings yet
Subhojit Roy Resume Java Latest
5 pages
Mini Project Synopsis
No ratings yet
Mini Project Synopsis
1 page
Assignment (JELLA MAE YCALINA)
No ratings yet
Assignment (JELLA MAE YCALINA)
2 pages
2020 Minimum Entry Requirements: ANU College of Arts & Social Sciences
No ratings yet
2020 Minimum Entry Requirements: ANU College of Arts & Social Sciences
2 pages
Makesworth Accountants Scholarship For ACCA Students in Nepal 2024
No ratings yet
Makesworth Accountants Scholarship For ACCA Students in Nepal 2024
2 pages
Lagos State University of Science and Technology
No ratings yet
Lagos State University of Science and Technology
2 pages
Ai ML
No ratings yet
Ai ML
2 pages
Python Machine Learning: Machine Learning Algorithms for Beginners - Data Management and Analytics for Approaching Deep Learning and Neural Networks from Scratch
From Everand
Python Machine Learning: Machine Learning Algorithms for Beginners - Data Management and Analytics for Approaching Deep Learning and Neural Networks from Scratch
Ahmed Ph. Abbasi
No ratings yet

Module 5

Uploaded by

Module 5

Uploaded by

Intelligent Systems and Machine Learning

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Frame the Problem

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Select a Performance Measure

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Check the Assumptions

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Steps to Get the Data:

1. Download the Data :

2. Extract the Data :

4. Familiarize with Data Schema :

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Create a Test Set

Creating a Test Set:

1.Why Set Aside Data Early:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Discover and Visualize the Data to Gain Insights

Steps to Discover and Visualize the Data:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Experimenting with Attribute Combinations:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

4.Example of Creating New Features:

•bedrooms_per_room = total_bedrooms / total_rooms

•population_per_household = population / households

By experimenting with these combinations, you can often discover new

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Prepare the Data for Machine Learning Algorithms

Preparing the Data for Machine Learning Algorithms:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

3. Data Preparation Steps:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Handling Text and Categorical Attributes

Handling Text and Categorical Attributes:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Which Scaling Method to Use?

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Select and Train a Model

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Training and Evaluating on the Training Set

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Fine-Tune Your Model:

Methods for Fine-Tuning:

1.Manual Hyperparameter Tuning:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

1.Advantages of Grid Search:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Hyperparameter Tuning in Practice:

Summary of Fine-Tuning Techniques:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Launch, Monitor, and Maintain Your System:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

5.Evaluate the Model:

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

You might also like