0% found this document useful (0 votes)
26 views75 pages

Day 1

Uploaded by

Hari Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views75 pages

Day 1

Uploaded by

Hari Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

NUS ACE SUMMER PROGRAMME

AI & MACHINE LEARNING

Manoranjan Dash
Professor and Dean
School of Computing and Data Science
FLAME University, Pune

Ex-Senior Scientist
Singapore Data Science Consortium
National University of Singapore
COURSE OUTLINE
• DAY 1
• Introduction to AI
• Introduction to Google Colab
• Introduction to Orange
• Preprocessing of Data using Python
• DAY 2
• Machine Learning Methods Using Orange DAY 3
• DAY 3
• Machine Learning Methods Using Python
• DAY 4
• Machine Learning Case Studies and Application
• DAY 5
• LLM – Large Language Model
DAY 1
• Introduction to AI
• Introduction to Google Colab
• Introduction to Orange
• Preprocessing of Data using Python
Day 1: Introduction to AI
• Introduction to AI:
• A brief history of AI
• Divide the years since (around) 1950 into key periods which have shaped development
of AI): For example, the period 1990s-2000s were for resurgence of ML and NN
• AI Applications: State of the Art
• Delve into advances in topics like healthcare, NLP and virtual assistants, finance and
fraud detection, computer vision, …
• AI, Machine Learning, and Deep Learning
• AI: problem solving techniques (search methods, heuristic, etc.)
• ML: supervised/unsupervised/semisupervised learning, reinforcement learning, feature
engineering
• DL: MLP, CNN, RNN, Generative Models
DEFINITION:
A branch of computer science that deals with
the creation and development of machines
capable of intelligent behaviour.

AI involves the simulation of human intelligence in machines,


enabling them to learn from experience, adapt to new
inputs, and perform tasks that typically require human
intelligence, such as problem-solving, understanding natural
language, and recognizing patterns.
TIME DESCRIPTION TOOLS/PERSONS
1950s Field of AI was founded in 1950s by researchers who believed that it was possible to create
machines that could think like humans. One of the most important figures in this early
period was Alan Turing.
1960s In the 1960s, AI research made significant progress. One of the most important advances
was the development of expert systems, which are computer programs that can solve
problems in a specific domain, such as medicine or finance.
1970s The 1970s saw a period of decline in AI research, known as the "AI Winter." This was due to
a number of factors, including the failure of some early AI programs to live up to
expectations, and the lack of a clear path forward for AI research.
1980s AI research began to recover in the 1980s, thanks to the development of new techniques,
such as neural networks and genetic algorithms.
1990s The 1990s saw further progress in AI research, with the development of new applications,
such as speech recognition and machine translation.
2000s The 2000s saw the rise of deep learning, a new approach to AI that uses artificial neural
networks to learn from data. Deep learning has led to significant advances in a wide range
of AI applications, including image recognition, natural language processing, and robotics.
2010s AI has been used in a wide range of applications, including self-driving cars, medical
diagnosis, and financial trading.
2020s The 2020s are shaping up to be a decade of even greater progress in AI. AI is being used to
solve some of the world's most pressing problems, such as climate change and poverty.
Day 1: Introduction to AI
• Introduction to AI:
• A brief history of AI
• Divide the years since (around) 1950 into key periods of that have shaped development
of AI): For example, the period 1990s-2000s were for resurgence of ML and NN
• AI Applications: State of the Art
• Delve into advances in topics like healthcare, NLP and virtual assistants, finance and
fraud detection, computer vision, …
• AI, Machine Learning, and Deep Learning
• Define each, show their relationship, discuss each for several minutes by going into each
one’s sub-topics
• AI: problem solving techniques (search methods, heuristic, etc.)
• ML: supervised/unsupervised/semisupervised learning, reinforcement learning, feature
engineering
• DL: MLP, CNN, RNN, Generative Models
AI in Healthcare: State of The Art
Quiz
1. According to this video, why is AI so successful in detecting skin
cancer?
AI in NLP and Virtual Assistants: State of The
Art
• NLP: Natural Language Processing
• NLP aims to enable computers to understand, interpret, and generate natural
language, allowing them to effectively communicate with humans in a
manner that is similar to how people communicate with each other.
• State of the art
• Chatbots
• Virtual Assistants
• Self-Driving Cars
• Healthcare
AI as Virtual Assistant

Virtual assistants: Virtual assistants are similar to chatbots, but they are more powerful and can perform a wider range
of tasks. They can be used to set alarms, make appointments, control smart home devices, and more. NLP is used to help
virtual assistants understand user requests and complete tasks.
Quiz
• What do you think helped the Google Assistant understand the
inability for the restaurant person in getting the date correctly
(Google assistant: “… Wednesday, the 7th…” è Restaurant person: “…
for seven persons…”)?
Day 1: Introduction to AI
• Introduction to AI:
• A brief review of AI History
• Divide the years since (around) 1950 into key periods of that have shaped development of AI):
For example, the period 1990s-2000s were for resurgence of ML and NN
• AI Applications: State of the Art
• Delve into advances in topics like healthcare, NLP and virtual assistants, finance and fraud
detection, computer vision, …
• Show a video or two
• AI, Machine Learning, and Deep Learning
• Define each, show their relationship, discuss each for several minutes by going into each
one’s sub-topics
• AI: problem solving techniques (search methods, heuristic, etc.)
• ML: supervised/unsupervised/semisupervised learning, reinforcement learning, feature
engineering
• DL: MLP, CNN, RNN, Generative Models
Search Methods
• These are systematic approaches to exploring a problem space and finding
the optimal solution. Here are some common types:
• Breadth-first search: Expands all possible states at the current level before moving
on to the next, guaranteeing an optimal solution but potentially inefficient for large
problems.
• Depth-first search: Goes down one path until it reaches a goal or dead end, then
backtracks and tries another path. Can be faster than breadth-first for some
problems, but may miss the optimal solution if trapped in a deep dead end.
• A* search: Combines breadth-first with a heuristic function that estimates the cost of
reaching the goal from each state. Prioritizes states with lower estimated costs,
aiming for an efficient and optimal solution.
• Best-first search: Similar to A* but focuses on finding the state with the highest
heuristic value, regardless of its estimated cost to the goal. Useful for games like
chess where immediate advantages matter.
Day 1: Introduction to AI
• Introduction to AI:
• A brief review of AI History
• Divide the years since (around) 1950 into key periods of that have shaped development of AI):
For example, the period 1990s-2000s were for resurgence of ML and NN
• AI Applications: State of the Art
• Delve into advances in topics like healthcare, NLP and virtual assistants, finance and fraud
detection, computer vision, …
• Show a video or two
• AI, Machine Learning, and Deep Learning
• Define each, show their relationship, discuss each for several minutes by going into each
one’s sub-topics
• AI: problem solving techniques (search methods, heuristic, etc.)
• ML: supervised/unsupervised/semisupervised learning,
reinforcement learning, feature engineering
• DL: MLP, CNN, RNN, Generative Models
Types of Machine Learning
• Supervised Learning
• Unsupervised Learning
• Semi-Supervised Learning
• Reinforcement Learning
Day 1: Introduction to AI
• Introduction to AI: Give a brief definition of AI (5 min)
• A brief review of AI History
• 5 min
• Divide the years since (around) 1950 into key periods of that have shaped development of AI): For
example, the period 1990s-2000s were for resurgence of ML and NN
• AI Applications: State of the Art
• 30 min
• Delve into advances in topics like healthcare, NLP and virtual assistants, finance and fraud detection,
computer vision, …
• Show a video or two
• AI, Machine Learning, and Deep Learning
• 40 min
• Define each, show their relationship, discuss each for several minutes by going into each one’s sub-
topics
• AI: problem solving techniques (search methods, heuristic, etc.)
• ML: supervised/unsupervised/semisupervised learning, reinforcement learning, feature engineering
• DL: MLP, CNN, RNN, Generative Models
Concluding Remarks
• We discussed the history of AI, followed by several state-of-art AI
technologies in healthcare, NLP, etc.
• Then we briefly discussed some techniques of AI search, ML and DL
• Finally, we discussed some precautions and ethics issues in using AI
DAY 1
• Introduction to AI
• Introduction to Google Colab
• Introduction to Orange
• Preprocessing of Data using Python
Introduction to Google Colab
• Google Colab is a free, cloud-based Jupyter notebook environment
that allows you to run Python code in your browser. It is a great way
to learn Python, prototype new ideas, and collaborate with others.
Features of Google Colab
• Free and easy to use
• Runs in your browser
• Supports Python 3
• Access to Google Drive files
• Collaboration with others
• Pre-trained machine learning models
Getting Started with Google Colab
• To get started with Google Colab, create a Google account and sign in to
the Colaboratory website (https://colab.research.google.com/). Once you
are signed in, you can create a new notebook by clicking on the "New
Notebook" button.
Writing and Running Code in Google Colab
• Google Colab notebooks are similar to Jupyter notebooks. They
consist of cells that can contain text, code, or both. To run a cell of
code, simply click on it and press the Enter key.
Saving and Sharing Your Work
• When you are finished working on a notebook, you can save it by
clicking on the "File" menu and selecting "Save". You can also share
your notebook with others by clicking on the "File" menu and
selecting "Share".
Google Colab is a powerful tool that can be used for a variety
of tasks. It is a great way to learn Python, prototype new ideas,
and collaborate with others. If you are interested in using
Google Colab, I encourage you to check out the documentation
and tutorials.
Hands-on
1. Write a simple Python script that takes user input (e.g., name, age)
and prints a personalized greeting along with the age in months.
2. Save the work in google drive and share it with others.
DAY 1
• Introduction to AI
• Introduction to Google Colab
• Introduction to Orange
• Preprocessing of Data using Python
Introduction to Orange
• Why Orange?
• Setting up your system
• Creating Your First Workflow
Why Orange?
• Orange is a platform built for mining and analysis on a GUI based workflow.
This signifies that you do not have to know how to code to be able to work
using Orange and mine data, crunch numbers and derive insights.

• You can perform tasks ranging from basic visuals to data manipulations,
transformations, and data mining. It consolidates all the functions of the
entire process into a single workflow.

• The best part and the differentiator about Orange is that it has some
wonderful visuals. You can try silhouettes, heat-maps, geo-maps and all
sorts of visualizations available.
Setting up your System
• Orange comes built-in with the Anaconda tool if you’ve previously
installed it. If not, follow these steps to download Orange.

• Step 1: Go to Orange Data Mining and click on Download


Step 2: Install the platform and set the working directory for
Orange to store its files

This is what the start-up page of Orange looks like. You have options that allow you to create new projects,
open recent ones or view examples and get started.
• Before we delve into how Orange works, let’s define a few key terms
to help us in our understanding:
• A widget is the basic processing point of any data manipulation. It can do a
number of actions based on what you choose in your widget selector on the
left of the screen.
• A workflow is the sequence of steps or actions that you take in your platform
to accomplish a particular task.
• For now, click on “New” and let’s start building your first workflow.
Creating Your First Workflow
• This is the first step towards building a solution to any problem. We
need to first understand what steps we need to take in order to
achieve our final goal. After you clicked on “New” in the above step,
this is what you should have come up with.
This is your blank Workflow on Orange. Now, you’re ready to explore and solve any problem by dragging
any widget from the widget menu to your workflow.
4. Familiarising yourself with the basics
• Orange is a platform that can help us solve most problems in Data
Science today. Topics that range from the most basic visualizations to
training models. You can even evaluate and perform unsupervised
learning on datasets
• Problem
• The problem we’re looking to solve in this tutorial is the practice problem
Loan Prediction that can be accessed via this link Loan Prediction
(analyticsvidhya.com) on Datahack
Importing the data files
- We begin with the first and the necessary step to understand our data and make predictions: importing our data
- Step 1: Click on the “Data” tab on the widget selector menu and drag the widget “CSV File Import” to our blank workflow.
Step 2: Double click the “File” widget and select the file you want to load into the workflow. Import Iris dataset.
Step 3: Click on ‘Data Table’ widget
Hands-on
1. Open a new workflow in Orange.
2. Get a file widget, get some data file, and connect a data table to
this file widget. Double click on each widget and verify the data
types of each attribute and data rows.
Preprocessing1
• It is a set of techniques to transform raw data into a form that
different machine learning models can use.
• Steps in Data Preprocessing
1. Checking and handling missing values
2. Handling categorical data (Label encoding, One-hot encoding)
3. Standardize/Normalize continuous data
4. PCA transformation
5. Data splitting

1 https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6
1. Checking and Handling Missing Values
• Check for missing values
• df.isnull().any().sum()
• Drop the rows where at least one element is missing.
• df.dropna()
• Drop the cols where at least one element is missing.
• df.dropna(axis='columns')
• Impute Missing Values
1. Checking and Handling Missing Values
(contd.)
• Impute Missing Values
• Imputing refers to using a model to replace missing values
• There are many options
• A constant value that has meaning within the domain, such as 0, distinct from all other
values
• A value from another randomly selected record
• A mean, median or mode value for the column
• A value estimated by another predictive model
House Data
Country Hours Salary House
France 34 12000 No
Spain 37 49000 Yes
Germany 20 34000 No
Spain 58 41000 No
Germany 40 43333 Yes
France 45 28000 Yes
Spain 40 51000 No
France 28 89000 Yes
Germany 50 53000 No
France 47 33000 Yes
Germany 38 No
DROP ROWS and COLUMNS

- Last row with a NaN is removed.


- But no column is removed.
- But if we switch them, the result is different.
IMPUTE MISSING VALUES
2. Handling Categorical Data (Label Encoding,
One-hot Encoding)
• ML models are based on numerical equations and calculation of
numerical variables
• Most of the time we have columns in our dataset that is non-numeric such as
countries, names, cities and so on. In such condition we need to convert
those columns into numeric values which can be used for further processing.
• Label Encoding
• Replace the categorical values by numbers 0, 1, …, N_class – 1
• One-Hot Encoding
- Map categorical variables to integer values
- Express each integer values a binary vector that is all zero values except index
of the integer, which is marked with a 1
LabelEncoder: Encode labels with value between 0 and N_classes - 1
S.N. Country Hour Salary House
s
0 France 34 12000 No
1 Spain 37 49000 Yes
2 Germany 20 34000 No
3 Spain 58 41000 No
4 Germany 40 43333 Yes
5 France 45 28000 Yes
6 Spain 40 51000 No
7 France 28 89000 Yes
8 Germany 50 53000 No
9 France 47 33000 Yes

# Let's create an object of the class LabelEncoder


labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) # All rows and first column, i.e., Country column

(1) fit_transform è Fit to data, then transform it.


(2) .values è converts DataFrame to numpy array (Note: fit_transform requires numpy array)
(3) DataFrame.values è Return a Numpy representation of the DataFrame.
Usual Steps in Data Transformation:
1. Import the data transformation package
2. Instantiate a model
3. Fit the model to the data
4. Transform the data using the model

Examples of data transformation packages:


LabeEncoder, OnehotEncoder, Standardize, Normalize,
PCA, etc.

51
One-Hot-Encoding: Representation of categorical
variables as binary vectors
- Map categorical variables to integer values
- Express each integer values as a binary vector
that is all zero values except at the index of the
integer, which is marked with a 1

- onehot_encoder = OneHotEncoder(sparse=False,
categories='auto')
- “categories = auto” è suppresses a FutureWarning by using
the latest version of OneHotEncoder

- X1 = np.zeros((10, 5), dtype=int) è if you do not specify ‘dtype=int’


it will assume float. This may make output in exponential form
FOOD for THOUGHT:
Which one to choose between LabelEncoder and
OneHotEncoder, and when?
FOOD for THOUGHT:
- If there are more than 2 labels, LabelEncoder output
is misinterpreted as an ordered list.
- Best Practice:
- Use OnehotEncoder for attributes, and
LabelEncoder for class label
- But, if the categorical attribute has many values
that may lead to memory problem, we have to
use LabelEncoder
QUIZ 4:
LabelEncoder output starts from 0 or 1?
3. Standardize/Normalize the Data
• Normalization makes training less sensitive to the scale of features
𝑥 − min(𝑥)
• All values are between 0 to 1 (usually) 𝑧=
max 𝑥 − min(𝑥)
• It will improve analysis for many ML models
• Normalization ensures that a convergence problem does not have a massive
variance, making optimization feasible
• Standardization
• Values are centred around 0
• z-transform: z=(x - mean)/ s.d.
• Mean of Z, µ = 0,
• S.D. of Z: s = 1
• Compare features that have different units or scales
• Standardizing tends to make the training process well behaved because the
numerical condition of the optimization problems is improved
DATA: https://storage.googleapis.com/mledudatasets/california_housing_train.csv NORMALIZATION

A SAMPLE

BEFORE AFTER - Not much difference in histograms


- Each feature within 0 and 1
- However, features are more consistent
- This helps in modelling
STANDARDIZATION

An Explanation:

What will happen if ”scaler.fit_transform(df)” is replaced


by
scaler.fit(df)
df = scaler.transform(df)
FOOD for THOUGHT

Standardization vs Normalization: which is better and when

59
FOOD for THOUGHT

Standardization vs Normalization: which is better and when

1. Normalization projects every data point to lower limit


and upper limit. Thus it does not handle outlier as well
as standardization

60
4. PCA (Principal Components Analysis)
• PCA is a statistical procedure that uses an orthogonal transformation
to convert a set of observations of possibly correlated variables
(entities each of which takes on various numerical values) into a set of
values of linearly uncorrelated variables called principal components
• PCA is used for dimensionality reduction
• It has two major applications
• Data visualization
• Speeding up ML algorithms
4. PCA for Data Visualization
• For a lot of ML applications it helps to be able to visualize your data
• Example
• Iris data
• Use PCA to reduce 4 dimensional data to 2 or 3 dimensions; and thus understand the
data better
Load Iris Dataset
Standardize the Data
PCA Projection to 2D

PCA and Keeping the Top 2 Principal Components


Visualize 2D Projection

What does this statement do?


Explained Variance

- The explained variance tell how much information (variance) can be attributed to each of the PC
- This is important b’cos in the process of converting 4 to 2 dimensions, we lose some information
- By using the attribute explained_variance_ratio_, we see that the 1st PC contains 72.77% and 2nd PC 23.03% variance
- So, together they contain 95.8% of the information
4. PCA to Speed-up ML Algorithms
• One of the most important applications of PCA is to speed up
machine learning algorithms
• Dataset
• We use MNIST database of handwritten digits has 784 features, a training set
of 60k examples, and a test set of 10k examples
- There is a problem executing it. So, download from
https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat
- Go to this link, and click on ‘download’
- Keep in the folder “~/scikit_learn_data/mldata” => for me it is: “C:\Users\dcsman\scikit_learn_data\mldata”
- After this you can run fetch_mldata()
- The images downloaded are contained in mnist.data and mnist.target
- 784 = 28 x 28 images (28x28 pixels)

Split Data into Training and Test Sets


Standardize the Data
- PCA is affected by scale, so we need to scale the features before applying PCA
- Transform the data using standard normal transformation (mean = 0, variance = 1)

Import and Apply PCA


- Notice the code below has 0.95 as a parameter
- It means scikit-learn choose the minimum number of PCs s.t. 95% of the variance is retained
- Fit PCA on the training set only
- Apply mapping (transform) to both the training set and the test set

EXPLANATION (StandardScaler):
Why ‘fit’ is done using train_img but ‘transform’
is done using both train_img and test_img?
Apply Logistic Regression to the Transformed Data
Step 1: Import the model you want to use
Step 2: Make an instance of the Model
Step 3: Train the model on the training data, store the information learned
Step 4: Predict the labels of new data (test images)

Let numpy figure out the dimension (no. of cols)


Measuring Model Performance
- While accuracy is not always the best metric for ML algos (precision, recall, F1 score, ROC Curve,
etc. can be better in some situations), it is used here for simplicity

Timing of Fitting Logistic Regression after PCA


- The table below shows how long it took to fit logistic regression after using PCA (retaining different
amounts of variance each time)

Time took to fit logistic regression after PCA with different fractions of Variance Retained
5. Data Splitting
• Train/Test Split
• Given a percentage split between training and test data, select randomly the
training and test sets
Train/Test Split
Load the data è split it into a training and test sets
Conclusion
• In this introduction to AI we covered topics such as
• A brief review of AI history
• AI Applications: State of the art
• AI, Machine Learning and Deep Learning
• Restrictions and Constraints
• Future of AI
• Introduction to Google Colab
• Introduction to Orange

You might also like