Lecture 1 Introduction
CSC 484 / 584, DA 515
Introduction of Machine Learning
Fall 2024
Outline
Fundamental concepts:
What Is Machine Learning, Why do we use Machine Learning
Types of Learning
Supervised Learning
Unsupervised Learning
Recommendation
Reinforcement Learning
Design cycle
Data collection
Feature choice
Model choice
Training
Evaluation
Demo: SK-learn example, Linear Regression
2
The main goal for this course
Get a deeper view into the process.
the theory of machine learning
why, when, which, what…
There are many people who self-learned machine
learning and start working in the area after they learned it
realizing that they are missing a lot of foundations.
On the practical side of ML, the Internet is full of
resources for you to learn ML.
3
What is Machine Learning?
Definition 1: Machine Learning is the science (and art) of
programming computers so they can learn from data.
Definition 2: A computer program is said to learn from
experience E (or data D) with respect to some task T, and
some performance measure P, if its performance on T, as
measured by P, improves with experience E (or data D).
- For example: Think about how a doctor diagnoses the
illness.
4
Why Use Machine Learning?
Consider how you would write a program of spam filter
The traditional approach: you figure out the rules
5
What are the rules you can use?
6
Why Use Machine Learning?
Consider how you would write a program of spam filter
Machine Learning approach: learning rules from data
7
Why Use Machine Learning?
Classification: Looking for a Function: y = f(X)
Simple regression
y=f( ) => y =ax +b
• The relation y = f(X) is very hard to figure out
⁻ Speech Recognition
y=f( ) => “How are you”
⁻ Image Recognition
y=f( ) => “dog
” 8
Machine Learning approach
Why Machine Learning Is Possible?
Mass Storage
More data available
Higher Performance of Computer vs distributed Computing
Larger memory in handling the data
Greater computational power for calculating and even
online learning
More and more Algorithms
9
Types of Learning
Supervised learning (Classification)
Training data includes desired outputs
Unsupervised learning (Clustering)
Training data does not include desired outputs
Recommendation (Collaborative Learning)
Semi-supervised learning
Training data includes a few desired outputs
Reinforcement learning
Rewards from sequence of actions
Learning from delayed feedback by interact with
10
environment
Supervised Learning
In supervised learning, the training data you feed to the
algorithm includes the desired solutions, called labels.
Example: learn how to classify new emails
11
Supervised Learning Algorithms
k-Nearest Neighbors
Linear Regression/Generalized Regression
Logistic Regression
Support Vector Machines (SVMs)
Decision Trees
Random Forests
Neural Networks
Naïve Bayes
12
Unsupervised Learning
In unsupervised learning, the training data you feed to the
algorithm is unlabeled.
An unlabeled training set for Clustering
unsupervised learning
Visualization: https://projector.tensorflow.org/
13
Unsupervised Machine Learning Algorithms
Unsupervised Machine Learning Algorithms
Clustering
k-Means
Hierarchical Cluster Analysis (HCA)
Expectation Maximization
Visualization and dimensionality reduction
Principal Component Analysis (PCA)
Kernel PCA
Locally-Linear Embedding (LLE)
t-distributed Stochastic Neighbor Embedding (t-SNE)
Association rule learning (Skipped)
Apriori
Eclat 14
Semi-supervised Learning
Some algorithms can deal with partially labeled training
data, usually a lot of unlabeled data and a little bit of
labeled data.
Example: Google Photos
you upload all your family photos to the service, it automatically
recognizes that the same person A shows up in photos 1, 5, and
11, while another person B shows up in photos 2, 5, and 7. This is
the unsupervised part of the algorithm (clustering).
the system needs you to tell it who these people are. Just one
label per person, and it is able to name everyone in every photo,
which is useful for searching photos.
15
Reinforcement Learning
The learning system, called an agent, can observe the
environment, select and perform actions, and get rewards
in return (or penalties in the form of negative rewards)
learn by itself what is the best strategy, called a policy
A policy defines what
action the agent should
choose when it is in a
given situation.
16
Reinforcement Learning
Learning a policy: A sequence of outputs
No supervised output but delayed reward
Credit assignment problem
Game playing
Robot in a maze
Multiple agents, partial observability, ...
17
Another criterion used to
classify ML systems: Batch and
Online Learning
- Whether or not they can learn incrementally
(1) Batch learning(Offline) (2) On the fly (online)
Example of Online Learning (such as stock price)
A model is trained and launched into production, and then it keeps learning as
new data comes in
18
Instance-Based Versus Model-Based
Learning
Third way to categorize Machine Learning systems
Whether they work by simply comparing new data points to known
data points, or instead detect patterns in the training data and build a
predictive model, much like scientists do
instance-based versus model-based learning
19
Applications
Application: Character recognition
Automated mail sorting, processing bank checks
Scanner captures an image of the text
Image is converted into constituent characters
21
21
Different Algorithms
22
Application: Finger prints recognition
23
Application:
Image Segmentation
24
Application: Brain Tissue Segmentation
25
More Applications Book P.5
Analyzing images of products on a production line to automatically
classify them
Detecting tumors in brain scans
Automatically classifying news articles
Automatically flagging offensive comments on discussion forums
Summarizing long documents automatically
Creating a chatbot or a personal assistant
Forecasting your company’s revenue next year, based on many
performance metrics
Making your app react to voice commands
Detecting credit card fraud
Segmenting clients based on their purchases so that you can design
a different marketing
strategy for each segment
Representing a complex, high-dimensional dataset in a clear and
insightful diagram 26
Recommending a product that a client may be interested in, based
ChatGPT (Nov. 2022)
ChatGPT is a chatbot launched by OpenAI
It is built on top of OpenAI's GPT-3 family of large language
models (now GPT-4o)
o supervised learning and
o reinforcement learning
o Similar work: Co-pilot (MSFT), Gemini (Google), LLaMa(FB),
etc.
o What can you do with LLMs:
- Demo: co-pilot from Microsoft
27
A typical machine learning system
A machine learning system contains
A sensor
A preprocessing mechanism
A feature extraction mechanism (manual or automated)
A classification algorithm
A set of examples (training set) already classified or described
28
Common Machine Learning Algorithms
1. Supervised:
Regression: Linear Regression and more
Classification
Naïve Bayes Classifier Algorithm
Support Vector Machine Algorithm
Decision Trees
Nearest Neighbors
Logistic Regression
Artificial Neural Networks
I will try to explain these
Random Forests algorithms to you in this
2. Unsupervised: semester.
K Means Clustering Algorithm
Hierarchical Cluster Analysis (HCA)
PCA .. 29
The design cycle
Data collection
Probably the most time-intensive component of a PR project
How many examples are enough?
Feature Selection/Engineering
Critical to the success of the PR problem
“Garbage in, garbage out”
Requires basic prior knowledge
Model choice
Statistical, neural and structural approaches
Parameter settings
Training
Given a feature set and a “blank” model, adapt the model to explain
the data
Supervised, unsupervised and reinforcement learning
Evaluation
How well does the trained model do?
Overfitting vs. generalization
30
Features
Features: These are measurable quantities obtained
from the patterns, and the classification task is based
on their respective values.
x1 ,..., xl ,
Feature vectors: A number of features constitute
the feature vector
T
x x1 ,..., xl R l
Feature vectors are treated as random vectors.
31
Features
The combination of d features is represented as a d-
dimensional column vector called a feature vector
The d-dimensional space defined by the feature vector is
called the feature space
Objects are represented as points in feature space. This
representation is called a scatter plot
32
Feature extraction
Task: to extract features which are good for classification.
Good features:
• Objects from the same class have similar feature values.
• Objects from different classes have different values.
“Good” features “Bad” features
33
More feature properties
34
The design cycle
Data collection
Probably the most time-intensive component of a PR project
How many examples are enough?
Feature choice
Critical to the success of the PR problem
“Garbage in, garbage out”
Requires basic prior knowledge
Model choice
Statistical, neural and structural approaches
Parameter settings
Training
Given a feature set and a “blank” model, adapt the model to explain
the data
Supervised, unsupervised and reinforcement learning
Evaluation
How well does the trained model do?
Overfitting vs. generalization
35
Consider the following scenario:
Classification
A fish processing plan wants to automate the process of sorting
incoming fish according to species (salmon or sea bass)
The automation system consists of
a conveyor belt for incoming products
two conveyor belts for sorted products
a pick-and-place robotic arm
a vision system with an overhead CCD camera
a computer to analyze images and control the robot arm
36
36
From [Duda, Hart and Stork, 2001]
Improving the performance of our ML system
We first use “length” as a feature for classification
Determined to achieve a recognition rate of 95%, we try a number of
features such as width, area, position of the eyes w.r.t. mouth...
Finally we find a “good” feature: average intensity of the scales, and
we reach a recognition rate of 93%.
37
Improving the performance of our ML system
We combine “length” and “average intensity of the
scales” to improve class separability
We compute a linear discriminant function to separate
the two classes, and obtain a classification rate of 95.7%
Task: maximization of
classification accuracy.
Task: minimization of
classification error.
38
Cost versus Classification rate
Our linear classifier was designed to minimize the overall
misclassification rate
Is this the best objective function for our fish processing plant?
The cost of misclassifying salmon as sea bass is that the end customer will
occasionally find a tasty piece of salmon when he purchases sea bass
The cost of misclassifying sea bass as salmon is an end customer upset
when he finds a piece of sea bass purchased at the price of salmon
Intuitively, we could adjust the decision boundary to minimize this
cost function
39
The issue of generalization
The recognition rate of our linear classifier (95.7%) met the design
specs, but we still think we can improve the performance of the
system
We then design an artificial neural network with five hidden layers, a
combination of logistic and hyperbolic tangent activation functions, train
it with the Levenberg-Marquardt algorithm and obtain an impressive
classification rate of 99.9975% with the following decision boundary
40
The design cycle
Data collection
Probably the most time-intensive component of a PR project
How many examples are enough?
Feature choice
Critical to the success of the PR problem
“Garbage in, garbage out”
Requires basic prior knowledge
Model choice
Statistical, neural and structural approaches
Parameter settings
Training
Given a feature set and a “blank” model, adapt the model to explain
the data
Supervised, unsupervised and reinforcement learning
Evaluation
How well does the trained model do?
Overfitting vs. generalization
41
The issue of generalization
Satisfied with our classifier, we integrate the system and
deploy it to the fish processing plant
After a few days, the plant manager calls to complain
that the system is misclassifying an average of 25% of
the fish
What went wrong?
42
Overfitting and underfitting
Problem: Training vs Testing
underfitting good fit overfitting
43
Avoid overfitting/underfitting
Dataset Splitting: Split your data into two sets
Training Error vs Testing Error
Underfitting: model is too simple.
Overfitting: If the training error is low (i.e., your
model makes few mistakes on the training set),
but the generalization error (in testing set) is
high 44
Bias, Variant, residual error
Assume:
(1) There is a underlining model, we use data to generalize it
(2) Data is not accurate, we allow some errors.
Total Errors = ModelError(Bias) + SampleError(Variant) + residual
45
DA 515 vs. DA 516 will cover:
DA 515:
Basic Knowledge
Traditional algorithms
SK-learn
DA 535: Deep Learning
Convolutional Neural Network
Deep neural Learning
Recurrent Neural Network
Transformer, GPT
Reinforcement Learning
Keras/Tensorflow
46
DEMO: Linear Regression
47
Example: Linear Regression
Linear Regression
Formula
Gradient descent
Sk-Learn
48
Regression-1: Evaluation
How to measure your model
49
Regression-2:
find the optimum fitness function
Optimization problem
Optimization Methods: Gradient Descent, Stochastic …..
50
Hidden behind:
Gradient Descent (from Internet)
51
Demo: 1. A Linear Regression.ipynb
1. B Decision Tree.ipynb
Data preparation
X_train, y_train, X_test, y_test
SK-Learn Library
1. Model representation
lin_reg = LinearRegression()
2. Training (Optimization)
lin_reg.fit(X_train, y_train)
3. Testing
predictions = lin_reg.predict(X_test)
52
Summary of Chapter 1
Data collection
Probably the most time-intensive component of a PR project
How many examples are enough?
Feature choice
Critical to the success of the PR problem
“Garbage in, garbage out”
Requires basic prior knowledge
Model choice
Statistical, neural and structural approaches
Parameter settings
Training
Given a feature set and a “blank” model, adapt the model to explain
the data
Supervised, unsupervised and reinforcement learning
Evaluation
How well does the trained model do?
Overfitting vs. generalization
53
Setup Python Environment
We will use Jupyter notebook through
this class
Installation Instruction is give separately.
54
Software
Learning Python
Google Developer Python Tutorial
https://developers.google.com/edu/python/
NumPy Tutorial
https://www.tutorialspoint.com/numpy/
Python tutorial
http://docs.python.org/tutorial/
Python quick reference
https://www.python.org/ftp/python/doc/quick-ref.1.3.html
55
Software
Python Library
Scikit-learn -- machine learning in Python
https://scikit-learn.org/stable/
Tensorflow -- open-source low-level machine learning
library
https://www.tensorflow.org/
https://www.tensorflow.org/tutorials
Keras -- Python deep learning library
https://keras.io/
56
Resources
Kaggle Competition
https://www.kaggle.com/
Web pages:
https://machinelearningmastery.com/
https://www.geeksforgeeks.org/machine-learning/
Ready-to-use Data Science code snippets created by
industry experts https://www.dezyre.com/project/recipe-list
There are a lot of more resources:
Github/Youtube/Google
57
End of lecture
SK-learn overview
https://scikit-learn.org/stable/
Students:
Read Chapter 1
Next Lecture: Chapter 2 (ML pipeline)
58
Last Several Slides
For your reference
What is the difference?
Statistics
Data Mining
Pattern Recognition
Machine Learning
Artificial Intelligent (AI)
These aren't just buzzwords we use to sound cool. To the
uninitiated, all the terms tend to sound alike, and many of
them have been used more or less interchangeably in the
popular press. However, there are subtle differences
60
What is the difference?
Statistics is just about the numbers, and quantifying the data.
(descriptive vs. inferential )
There are many tools for finding relevant properties of the data but
this is pretty close to pure mathematics.
Data Mining is about using Statistics as well as other programming
methods to find patterns hidden in the data so that you can explain some
phenomenon.
Data Mining builds intuition about what is really happening in some
data and is still little more towards math than programming, but uses
both.
(Statistical) Pattern Recognition has more to do with the task a
Machine Learning system is trying to accomplish. It is a branch of
machine learning which works by recognizing the patterns and
61
regularities in data.
What is the difference?
Machine Learning is an umbrella term that covers all
technologies in which a machine is able to “learn” on its
own, without having that knowledge explicitly
programmed into it.
Machine Learning is a form of Pattern Recognition.
Machine learning is basically the idea of training machines
to recognize patterns and apply it to practical problems.
Machine Learning uses Data Mining techniques and
other learning algorithms to build models of what is
happening behind some data so that it can predict future
outcomes.
62
What is the difference?
Artificial Intelligence uses models built by Machine
Learning and other ways to reason about the world and
give rise to intelligent behavior whether this is playing a
game or driving a robot/car.
Artificial Intelligence has some goal to achieve by
predicting how actions will affect the model of the world
and chooses the actions that will best achieve that goal.
63
Summary
Statistics quantifies numbers
Pattern Recognition finding patterns
Data Mining explains patterns
Machine Learning predicts with models
Artificial Intelligence behaves and reasons
Data Analytics vs Data Science:
1. Data analysts examine large data sets to identify trends,
develop charts, and create visual presentations to help
businesses make more strategic decisions.
2. Data scientists, on the other hand, design and construct new
processes for data modeling and production using prototypes,
algorithms, predictive models, and custom analysis.
64