0% found this document useful (0 votes)
6 views13 pages

Lec 2

lecture 2 AI course

Uploaded by

elmihassanf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Lec 2

lecture 2 AI course

Uploaded by

elmihassanf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

02/22/2024

1
02/22/2024

 Asking the right question


 For example, let's say you want
"Bob spent years planning,
to create a pipeline to determine executing, and optimizing how to
conquer a hill. Unfortunately, it
loan default prediction.
turned out to be the wrong hill."
 Your question can be:
 For a given loan, will it default or not?
 When will the loan default?
 How much money will be received for a given loan?
 What will be the profit made on a given loan?
 What will be the profit made on a given loan without using
disallowed input features?

 Collecting the right data for your pipeline.


 For the loan default example, some of the features that
may or may not be relevant to make accurate
predictions are
Feature Name Description Importance
Name Name of applicant None
Address Home address Low
Annual Income Applicants Income High
Debts Money currently borrowed by
High
the applicant
Credit history How well he/she has been
returning borrowed money Medium
Job Type Govt employee or businessman
or contractor etc Medium
Tax Paid Tax payed in the last financial
year None/High

2
02/22/2024

 Data transformation tier that processes the raw data;


some of the transformations that need to be done are:
 Data Cleansing

 Filtration

 Aggregation

 Augmentation

 Consolidation

 Storage

 Before we can apply our model to new measurements, we need to


know whether it actually works—that is, whether we should trust
its predictions.
 We cannot use the data we used to build the model to evaluate it.
 This is because our model can always simply remember the whole
training set, and will therefore always predict the correct label for
any point in the training set
 To assess the model’s performance, we show it new data (data that
it hasn’t seen before) for which we have labels.
 This is usually done by splitting the labeled data we have collected
into two parts
 Training data
 Testing data
 and sometimes into three:
 Training data
 Validation data
 Testing data

3
02/22/2024

 Once we split the data it is now time to run the


training and test data through a series of models and
assess the performance of a variety of models and
determine how accurate each candidate model is.
 This is an iterative process and various algorithms
might be tested until you have a model that sufficiently
answers your question.

 After we train our model with various algorithms


comes selection if the model that is optimal for the
problem at hand.
 We don't always pick the best performing model.

 An algorithm that performs well with the training data


might not perform well in production because it might
have overfitted the training data.
 At this point in time, model selection is more of an art
than a science

4
02/22/2024

 Once a model is chosen and finalized, it is now ready to


be used to make predictions.
 It is typically exposed via an API and embedded in
decision-making frameworks as part of an analytics
solution
 Some questions to consider in the deployment selection:
 Does the system need to be able to make predictions in real-
time (if yes, how fast: in milliseconds, seconds, minutes,
hours?)
 How often do the models need to be updated?
 What amount of volume or traffic is expected?
 What is the size of the datasets?
 Are there regulations, policies and other constraints that
need to be followed and abided by?

 Performance in the data science context does not mean


how fast the model is running, but rather how accurate
are the predictions.
 Data scientists monitoring machine learning models
are primarily looking at a single metric: drift.
 Drift happens when the data is no longer a relevant or
useful input to the model.
 Data can change and lose its predictive value.

 Data scientists and engineers must monitor the models


constantly to make sure that the model features
continue to be like the data points used during the
model training.

10

5
02/22/2024

 Yield Estimation and Prediction using AI

4 5 6 7

8 9
3

1 2

11

 Let’s assume that a hobby botanist is interested in


distinguishing the species of some iris flowers that she
has found. She has collected some measurements
associated with each iris:
the length and width of
the petals and the length
and width of the sepals,
all measured in
centimeters.

12

6
02/22/2024

 She also has the measurements of some irises that


have been previously identified by an expert botanist
as belonging to the species setosa, versicolor, or
virginica. For these measurements, she can be certain
of which species each iris belongs to. Let’s assume that
these are the only species our hobby botanist will
encounter in the wild.
 Problem Definition

Our goal is to build a machine learning model that can


learn from the measurements of these irises whose
species is known, so that we can predict the species for a
new iris.

13

 Data Ingestion
from sklearn.datasets import load_iris
iris_dataset = load_iris()

14

7
02/22/2024

Data Preparation
The iris data in sklearn is already clean

15

Visualize
Data

16

8
02/22/2024

Data Segregation
 scikit-learn contains a function that shuffles the
dataset and splits it for you: the train_test_split
function

17

Model Training
 Let’s use a k-nearest neighbors classifier,

 To make a prediction for a new data point, the


algorithm finds the point in the training set that is
closest to the new point.
 Then it assigns the label of this training point to the
new data point.
 In k-nearest neighbors, instead of using only the closest
neighbor to the new data point, we can consider any
fixed number k of neighbors in the training (for
example, the closest three or five neighbors). Then, we
can make a prediction using the majority class among
these neighbors.

18

9
02/22/2024

 Import Model

 Train the model

 Test the
model

19

 Model Evaluation
 This is where the test set that we created earlier comes
in. This data was not used to build the model, but we
do know what the correct species is for each iris in the
test set.

20

10
02/22/2024

 At the core of every AI system lies a fundamental truth:


The quality and quantity of data it ingests are paramount
to its effectiveness.
 Data is the lifeblood

that fuels AI algorithms,


allowing them to learn,
adapt and make
decisions.

21

1. Relevance: Data should be relevant to the problem being


addressed by the machine learning model. Irrelevant data can
introduce noise and decrease the model's performance.
2. Quality: High-quality data is accurate, consistent, and
reliable. It should be free from errors, missing values, and
inconsistencies that could adversely affect model training and
performance.
3. Quantity: Sufficient data is needed to train machine learning
models effectively. While the amount of required data varies
depending on the complexity of the problem and the
algorithm used, having a large and diverse dataset generally
improves model performance.
4. Representativeness: Data should be representative of the
real-world scenarios the model will encounter. It should cover
the full range of possible inputs and outcomes to ensure the
model generalizes well to unseen data.

22

11
02/22/2024

5. Variety: Diverse datasets encompassing different data types,


formats, and sources can provide richer insights and help the model
learn more robust representations. This includes structured data
(e.g., numerical data), unstructured data (e.g., text, images), and
semi-structured data (e.g., JSON).
6. Balance: Imbalanced datasets, where certain classes or categories
are overrepresented or underrepresented, can lead to biased
models. Balancing the dataset ensures that the model learns from
all classes equally, improving its ability to make accurate
predictions.
7. Temporal relevance: In many cases, the temporal aspect of data
is crucial. Time-series data, for example, often exhibits patterns and
trends over time that are essential for predictive modeling.
8. Ethical Considerations: Data should be collected and used
ethically, respecting privacy, consent, and fairness principles.
Biased or discriminatory data can lead to biased and unfair machine
learning models.
9. Accessibility: Data should be accessible and properly documented
to facilitate collaboration, reproducibility, and transparency in
machine learning research and development.

23

 https://www.kaggle.com/

 https://huggingface.co/

 https://datasetsearch.research.google.com/

 https://www.microsoft.com/en-
us/research/project/microsoft-research-open-data/

24

12
02/22/2024

 Select a problem to solve using AI in any area of your


choice.
 Find and download a relevant dataset available online.
 Generate/collect datapoints in addition to the
downloaded dataset.
 Write a single page report on
 Problem to solve
 Dataset details
 Number of datapoint
 List of features
 Source of dataset.
 How did you generate the datapoints.
 Link to the downloaded dataset.
 Email the report to me at [email protected]

25

13

You might also like