Day 1
Day 1
Manoranjan Dash
Professor and Dean
School of Computing and Data Science
FLAME University, Pune
Ex-Senior Scientist
Singapore Data Science Consortium
National University of Singapore
COURSE OUTLINE
• DAY 1
• Introduction to AI
• Introduction to Google Colab
• Introduction to Orange
• Preprocessing of Data using Python
• DAY 2
• Machine Learning Methods Using Orange DAY 3
• DAY 3
• Machine Learning Methods Using Python
• DAY 4
• Machine Learning Case Studies and Application
• DAY 5
• LLM – Large Language Model
DAY 1
• Introduction to AI
• Introduction to Google Colab
• Introduction to Orange
• Preprocessing of Data using Python
Day 1: Introduction to AI
• Introduction to AI:
• A brief history of AI
• Divide the years since (around) 1950 into key periods which have shaped development
of AI): For example, the period 1990s-2000s were for resurgence of ML and NN
• AI Applications: State of the Art
• Delve into advances in topics like healthcare, NLP and virtual assistants, finance and
fraud detection, computer vision, …
• AI, Machine Learning, and Deep Learning
• AI: problem solving techniques (search methods, heuristic, etc.)
• ML: supervised/unsupervised/semisupervised learning, reinforcement learning, feature
engineering
• DL: MLP, CNN, RNN, Generative Models
DEFINITION:
A branch of computer science that deals with
the creation and development of machines
capable of intelligent behaviour.
Virtual assistants: Virtual assistants are similar to chatbots, but they are more powerful and can perform a wider range
of tasks. They can be used to set alarms, make appointments, control smart home devices, and more. NLP is used to help
virtual assistants understand user requests and complete tasks.
Quiz
• What do you think helped the Google Assistant understand the
inability for the restaurant person in getting the date correctly
(Google assistant: “… Wednesday, the 7th…” è Restaurant person: “…
for seven persons…”)?
Day 1: Introduction to AI
• Introduction to AI:
• A brief review of AI History
• Divide the years since (around) 1950 into key periods of that have shaped development of AI):
For example, the period 1990s-2000s were for resurgence of ML and NN
• AI Applications: State of the Art
• Delve into advances in topics like healthcare, NLP and virtual assistants, finance and fraud
detection, computer vision, …
• Show a video or two
• AI, Machine Learning, and Deep Learning
• Define each, show their relationship, discuss each for several minutes by going into each
one’s sub-topics
• AI: problem solving techniques (search methods, heuristic, etc.)
• ML: supervised/unsupervised/semisupervised learning, reinforcement learning, feature
engineering
• DL: MLP, CNN, RNN, Generative Models
Search Methods
• These are systematic approaches to exploring a problem space and finding
the optimal solution. Here are some common types:
• Breadth-first search: Expands all possible states at the current level before moving
on to the next, guaranteeing an optimal solution but potentially inefficient for large
problems.
• Depth-first search: Goes down one path until it reaches a goal or dead end, then
backtracks and tries another path. Can be faster than breadth-first for some
problems, but may miss the optimal solution if trapped in a deep dead end.
• A* search: Combines breadth-first with a heuristic function that estimates the cost of
reaching the goal from each state. Prioritizes states with lower estimated costs,
aiming for an efficient and optimal solution.
• Best-first search: Similar to A* but focuses on finding the state with the highest
heuristic value, regardless of its estimated cost to the goal. Useful for games like
chess where immediate advantages matter.
Day 1: Introduction to AI
• Introduction to AI:
• A brief review of AI History
• Divide the years since (around) 1950 into key periods of that have shaped development of AI):
For example, the period 1990s-2000s were for resurgence of ML and NN
• AI Applications: State of the Art
• Delve into advances in topics like healthcare, NLP and virtual assistants, finance and fraud
detection, computer vision, …
• Show a video or two
• AI, Machine Learning, and Deep Learning
• Define each, show their relationship, discuss each for several minutes by going into each
one’s sub-topics
• AI: problem solving techniques (search methods, heuristic, etc.)
• ML: supervised/unsupervised/semisupervised learning,
reinforcement learning, feature engineering
• DL: MLP, CNN, RNN, Generative Models
Types of Machine Learning
• Supervised Learning
• Unsupervised Learning
• Semi-Supervised Learning
• Reinforcement Learning
Day 1: Introduction to AI
• Introduction to AI: Give a brief definition of AI (5 min)
• A brief review of AI History
• 5 min
• Divide the years since (around) 1950 into key periods of that have shaped development of AI): For
example, the period 1990s-2000s were for resurgence of ML and NN
• AI Applications: State of the Art
• 30 min
• Delve into advances in topics like healthcare, NLP and virtual assistants, finance and fraud detection,
computer vision, …
• Show a video or two
• AI, Machine Learning, and Deep Learning
• 40 min
• Define each, show their relationship, discuss each for several minutes by going into each one’s sub-
topics
• AI: problem solving techniques (search methods, heuristic, etc.)
• ML: supervised/unsupervised/semisupervised learning, reinforcement learning, feature engineering
• DL: MLP, CNN, RNN, Generative Models
Concluding Remarks
• We discussed the history of AI, followed by several state-of-art AI
technologies in healthcare, NLP, etc.
• Then we briefly discussed some techniques of AI search, ML and DL
• Finally, we discussed some precautions and ethics issues in using AI
DAY 1
• Introduction to AI
• Introduction to Google Colab
• Introduction to Orange
• Preprocessing of Data using Python
Introduction to Google Colab
• Google Colab is a free, cloud-based Jupyter notebook environment
that allows you to run Python code in your browser. It is a great way
to learn Python, prototype new ideas, and collaborate with others.
Features of Google Colab
• Free and easy to use
• Runs in your browser
• Supports Python 3
• Access to Google Drive files
• Collaboration with others
• Pre-trained machine learning models
Getting Started with Google Colab
• To get started with Google Colab, create a Google account and sign in to
the Colaboratory website (https://colab.research.google.com/). Once you
are signed in, you can create a new notebook by clicking on the "New
Notebook" button.
Writing and Running Code in Google Colab
• Google Colab notebooks are similar to Jupyter notebooks. They
consist of cells that can contain text, code, or both. To run a cell of
code, simply click on it and press the Enter key.
Saving and Sharing Your Work
• When you are finished working on a notebook, you can save it by
clicking on the "File" menu and selecting "Save". You can also share
your notebook with others by clicking on the "File" menu and
selecting "Share".
Google Colab is a powerful tool that can be used for a variety
of tasks. It is a great way to learn Python, prototype new ideas,
and collaborate with others. If you are interested in using
Google Colab, I encourage you to check out the documentation
and tutorials.
Hands-on
1. Write a simple Python script that takes user input (e.g., name, age)
and prints a personalized greeting along with the age in months.
2. Save the work in google drive and share it with others.
DAY 1
• Introduction to AI
• Introduction to Google Colab
• Introduction to Orange
• Preprocessing of Data using Python
Introduction to Orange
• Why Orange?
• Setting up your system
• Creating Your First Workflow
Why Orange?
• Orange is a platform built for mining and analysis on a GUI based workflow.
This signifies that you do not have to know how to code to be able to work
using Orange and mine data, crunch numbers and derive insights.
• You can perform tasks ranging from basic visuals to data manipulations,
transformations, and data mining. It consolidates all the functions of the
entire process into a single workflow.
• The best part and the differentiator about Orange is that it has some
wonderful visuals. You can try silhouettes, heat-maps, geo-maps and all
sorts of visualizations available.
Setting up your System
• Orange comes built-in with the Anaconda tool if you’ve previously
installed it. If not, follow these steps to download Orange.
This is what the start-up page of Orange looks like. You have options that allow you to create new projects,
open recent ones or view examples and get started.
• Before we delve into how Orange works, let’s define a few key terms
to help us in our understanding:
• A widget is the basic processing point of any data manipulation. It can do a
number of actions based on what you choose in your widget selector on the
left of the screen.
• A workflow is the sequence of steps or actions that you take in your platform
to accomplish a particular task.
• For now, click on “New” and let’s start building your first workflow.
Creating Your First Workflow
• This is the first step towards building a solution to any problem. We
need to first understand what steps we need to take in order to
achieve our final goal. After you clicked on “New” in the above step,
this is what you should have come up with.
This is your blank Workflow on Orange. Now, you’re ready to explore and solve any problem by dragging
any widget from the widget menu to your workflow.
4. Familiarising yourself with the basics
• Orange is a platform that can help us solve most problems in Data
Science today. Topics that range from the most basic visualizations to
training models. You can even evaluate and perform unsupervised
learning on datasets
• Problem
• The problem we’re looking to solve in this tutorial is the practice problem
Loan Prediction that can be accessed via this link Loan Prediction
(analyticsvidhya.com) on Datahack
Importing the data files
- We begin with the first and the necessary step to understand our data and make predictions: importing our data
- Step 1: Click on the “Data” tab on the widget selector menu and drag the widget “CSV File Import” to our blank workflow.
Step 2: Double click the “File” widget and select the file you want to load into the workflow. Import Iris dataset.
Step 3: Click on ‘Data Table’ widget
Hands-on
1. Open a new workflow in Orange.
2. Get a file widget, get some data file, and connect a data table to
this file widget. Double click on each widget and verify the data
types of each attribute and data rows.
Preprocessing1
• It is a set of techniques to transform raw data into a form that
different machine learning models can use.
• Steps in Data Preprocessing
1. Checking and handling missing values
2. Handling categorical data (Label encoding, One-hot encoding)
3. Standardize/Normalize continuous data
4. PCA transformation
5. Data splitting
1 https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6
1. Checking and Handling Missing Values
• Check for missing values
• df.isnull().any().sum()
• Drop the rows where at least one element is missing.
• df.dropna()
• Drop the cols where at least one element is missing.
• df.dropna(axis='columns')
• Impute Missing Values
1. Checking and Handling Missing Values
(contd.)
• Impute Missing Values
• Imputing refers to using a model to replace missing values
• There are many options
• A constant value that has meaning within the domain, such as 0, distinct from all other
values
• A value from another randomly selected record
• A mean, median or mode value for the column
• A value estimated by another predictive model
House Data
Country Hours Salary House
France 34 12000 No
Spain 37 49000 Yes
Germany 20 34000 No
Spain 58 41000 No
Germany 40 43333 Yes
France 45 28000 Yes
Spain 40 51000 No
France 28 89000 Yes
Germany 50 53000 No
France 47 33000 Yes
Germany 38 No
DROP ROWS and COLUMNS
51
One-Hot-Encoding: Representation of categorical
variables as binary vectors
- Map categorical variables to integer values
- Express each integer values as a binary vector
that is all zero values except at the index of the
integer, which is marked with a 1
- onehot_encoder = OneHotEncoder(sparse=False,
categories='auto')
- “categories = auto” è suppresses a FutureWarning by using
the latest version of OneHotEncoder
A SAMPLE
An Explanation:
59
FOOD for THOUGHT
60
4. PCA (Principal Components Analysis)
• PCA is a statistical procedure that uses an orthogonal transformation
to convert a set of observations of possibly correlated variables
(entities each of which takes on various numerical values) into a set of
values of linearly uncorrelated variables called principal components
• PCA is used for dimensionality reduction
• It has two major applications
• Data visualization
• Speeding up ML algorithms
4. PCA for Data Visualization
• For a lot of ML applications it helps to be able to visualize your data
• Example
• Iris data
• Use PCA to reduce 4 dimensional data to 2 or 3 dimensions; and thus understand the
data better
Load Iris Dataset
Standardize the Data
PCA Projection to 2D
- The explained variance tell how much information (variance) can be attributed to each of the PC
- This is important b’cos in the process of converting 4 to 2 dimensions, we lose some information
- By using the attribute explained_variance_ratio_, we see that the 1st PC contains 72.77% and 2nd PC 23.03% variance
- So, together they contain 95.8% of the information
4. PCA to Speed-up ML Algorithms
• One of the most important applications of PCA is to speed up
machine learning algorithms
• Dataset
• We use MNIST database of handwritten digits has 784 features, a training set
of 60k examples, and a test set of 10k examples
- There is a problem executing it. So, download from
https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat
- Go to this link, and click on ‘download’
- Keep in the folder “~/scikit_learn_data/mldata” => for me it is: “C:\Users\dcsman\scikit_learn_data\mldata”
- After this you can run fetch_mldata()
- The images downloaded are contained in mnist.data and mnist.target
- 784 = 28 x 28 images (28x28 pixels)
EXPLANATION (StandardScaler):
Why ‘fit’ is done using train_img but ‘transform’
is done using both train_img and test_img?
Apply Logistic Regression to the Transformed Data
Step 1: Import the model you want to use
Step 2: Make an instance of the Model
Step 3: Train the model on the training data, store the information learned
Step 4: Predict the labels of new data (test images)
Time took to fit logistic regression after PCA with different fractions of Variance Retained
5. Data Splitting
• Train/Test Split
• Given a percentage split between training and test data, select randomly the
training and test sets
Train/Test Split
Load the data è split it into a training and test sets
Conclusion
• In this introduction to AI we covered topics such as
• A brief review of AI history
• AI Applications: State of the art
• AI, Machine Learning and Deep Learning
• Restrictions and Constraints
• Future of AI
• Introduction to Google Colab
• Introduction to Orange