0% found this document useful (0 votes)
8 views18 pages

Week 3 A

Uploaded by

eshaasif005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

Week 3 A

Uploaded by

eshaasif005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

INTRODUCTION TO MACHINE LEARNING

BY AATIQA BINT E GHAZALI


FALL 2024
Revision
 Q/A session from the previous lecture.
 Check the home-work given.
Tasks of learning examples
 Supervised: classification,regression analysis,
 Unsupervised : anomaly detection , dimensionality reduction,
 Reinforcement: many robots implement Reinforcement Learning algorithms
to learn how to walk, gaming chess players etc,
Main Challenges of Machine Learning
 Insufficient Quantity of Training Data
 Non-representative Training Data
 Poor quality data
 Irrelevant features
 Overfitting the Training Data
 Under-fitting the Training Data
Testing and Validating
 Putting the model in production for testing is bad.
 splitting the data into two sets: the training set and the test set : better
option
 80:20 ratio for train and test
THE ML TOOLBOX
 Data
 Infrastructure
 Algorithms
 Visualizations
Machine Learning Pipeline
 a series of interconnected data processing and modeling steps
 designed to automate, standardize and streamline the process of building,
training, evaluating and deploying machine learning models.
stages of a machine learning pipeline
1. Data collection
2. Data preprocessing
3. Feature engineering
4. Model selection
5. Model training
6. Model evaluation
7. Model deployment
8. Monitoring and maintenance
Data Collection
 new data is collected from various data sources, such as
databases, APIs , or files
 often involves raw data which may require preprocessing to be
useful.
 Common sources of data : Kaggle , UCI
Data preprocessing

 involves cleaning, transforming and preparing input data for modeling.


 Common preprocessing steps include handling missing values, encoding
categorical variables, scaling numerical features and splitting the
data into training and testing sets.
Feature engineering & Model
selection
 Feature engineering
 creating new features or selecting relevant features from the
data that can improve the model's predictive power.
 This step often requires domain knowledge and creativity.
 Model selection
 choose the appropriate machine learning algorithm(s) based on the problem
type (e.g., classification, regression), data characteristics, and performance
requirements.
Model training & Model evaluation
 Model training
 The selected model(s) are trained on the training dataset using the
chosen algorithm(s).
 This involves learning the underlying patterns and relationships within
the training data.
 Pre-trained models can also be used, rather than training a new model.
 Model evaluation
 We will be assessing the model's performance using a separate testing
dataset or through cross-validation.
 Common evaluation metrics depend on the specific problem but may
include accuracy, precision, recall, F1-score, mean squared error or
others.
Model deployment & Maintenance
 Model deployment
 Once a satisfactory model is developed and evaluated, it can be deployed to
a production environment where it can make predictions on new, unseen
data.
 Maintenance
 After deployment, it's important to continuously monitor the model's
performance and retrain it as needed to adapt to changing data patterns.
 This step ensures that the model remains accurate and reliable in a real-
world setting.
 Lets do some practical implementation from the pipeline discussed
Titanic Dataset Collection
 Kaggle holds a wide range of datasets of various types
 One of the most common and beginner datasets / competition is titanic
dataset
 This dataset is used to predict the survivals in Titanic
 Download the dataset
 Upload it on drive
 Explore it
 https://kaggle.com/c/titanic/data
Pandas and NumPy
 pandas and NumPy are very useful libraries in Python
 Pandas is a very popular library for working with data . DataFrames are at
the center of pandas. A DataFrame is structured like a table or spreadsheet.
The rows and the columns both have indexes, and you can perform
operations on rows or columns separately.
 NumPy is an open-source Python library that facilitates efficient numerical
operations on large quantities of data.
 Pandas is built on the top of numpy
 If you are working on anaconda use !pip install numpy and pandas
 Pip is made for installing things in colab , anaconda etc
 After installation import numpy and pandas into your code
Matplotlip and seaborn
 For visualization purposes
 Matplotlib is primarily used for basic chart plotting,
 while Seaborn offers many default themes and a wide variety of schemes for
statistical visualization.
 Import these two libraries in colab
Loading dataset
 Read the csv file using pandas
 And display the dataset in notebook
 Read first few rows of data

You might also like