0% found this document useful (0 votes)
23 views

Machine Learning Updated

The document provides details about a university course on machine learning, including its code, credit hours, aims and objectives, course description, recommended textbooks, assessment criteria, and lecture plan. The lecture plan covers topics such as supervised learning, the k-nearest neighbor algorithm, and decision trees over 16 weeks.

Uploaded by

Kamran Shahzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Machine Learning Updated

The document provides details about a university course on machine learning, including its code, credit hours, aims and objectives, course description, recommended textbooks, assessment criteria, and lecture plan. The lecture plan covers topics such as supervised learning, the k-nearest neighbor algorithm, and decision trees over 16 weeks.

Uploaded by

Kamran Shahzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

University of Gujrat

Department of Computer Science

Title Machine learning

Code CS-----

Credit Hours 3.0

Prerequisite

Instructor

1. To provide a comprehensive introduction to machine learning


Aims and Objectives methods
2. To build mathematical foundations of machine learning and
provide an appreciation for its applications
3. To provide experience in the implementation and evaluation of
machine learning algorithms
4. To develop research interest in the theory and application of
machine learning

Machine learning (ML) studies the design and development of


Course Description
algorithms that learn from the data and improve their performance
through experience. ML refers to a set of methods and that help
computers to learn, optimize and adapt on their own. ML has been
employed to devise algorithms for diverse applications including object
detection or identification in computer vision, sentiment analysis of
speaker or writer, detection of disease and planning of therapy in
healthcare, product recommendation in e-commerce, learning strategies
for playing games, recommending movies to customers, speech
recognition systems, fraudulent transaction detection or loan application
approval in banking sector, to name a few.

This course provides a thorough introduction to the theoretical


foundations and practical applications of ML. We will learn fundamental
algorithms in supervised learning and unsupervised learning. We will
not only learn how to use ML methods and algorithms but will also try
to explain the underlying theory building on mathematical foundations.
While reviewing the several problems and algorithms to carry out
classification, regression, clustering, dimensionality reduction, we will
focus on the core fundamentals which unify all the algorithms. The
theory discussed in class will be tested in assignments, quizzes and
exams.
• Hands-On Machine Learning with Scikit-Learn and TensorFlow:
Recommended Text Concepts, Tools, and Techniques to Build Intelligent Systems
Books Geron Aurelien, latest edition

• Introduction to Machine Learning by Ethem Alpadyn (latest edition)


• Machine Learning: A Probabilistic Perspective by Kevin P. Murph

The Hundred-page Machine Learning Book


Andriy Burkov, 2019

Reference Books

Assessment Criteria Sessional 25% Mid 25% Final 50%

Quizzes 12%

Assignments 4%

Project/Presentation 9%
Sixteen-week lecture plan
Week Lecture Topic Recommended Readings
Course overview

• What is ML? The Turing test, traditional CS vs. ML, history


of ML, AI vs. ML
• Classification and Regression with examples. Training and
Testing.
• Rules vs. Patterns, Deterministic vs. Probabilistic, Certainty
vs. Uncertainty.
• Learning: Supervised, unsupervised, semi-supervised • Murphy chapter 1
1 1,2
• Labeled data sources: Expert annotators, crowd • Alpaydin, chapter 1
• Search: A deterministic binary classifier
• Challenges and Opportunities of ML:
o Explainability
o Fairness and Societal Biases
o ML for Social Good, ML for Development (ML4D),
Speech and Language Technologies for
Development (SLT4D)
Murphy: 1.1, 1.2,
1.4.2, 1.4.3, 1.4.9

Supervised Learning Recommended


Topics:
• Features, Labels, Training, Testing, Classification, Regression
• Formalizing the supervised learning setup • Goals of Cross
• Feature spaces and feature vectors Validation:
o Sparse and dense feature vectors, one-hot vectors • Model
o Bag-of-word features selection
• Label spaces • training
o Label spaces for classification (binary and multiclass) and • performance
regression estimation
• Hypothesis spaces • Types of Cross
o The No Free Lunch theorem Validation and
o Choosing the hypothesis class 𝐻 and hypothesis ℎ ∈ 𝐻 Pros and Cons
o Various Algorithms for traversing hypothesis classes: • Exhaustive
▪ Pick ℎ randomly • Leave-p out
▪ Try every ℎ • Leave-one out
▪ Just output the label of the training data (memorizer) • Non-
2 3,4 • Evaluating hypotheses: Loss functions and goals of optimization Exhaustive
o Zero-One • k-fold
o Squared • Holdout
o Absolute • Repeated
• Loss reduction and Generalization in Learning random
o Memorizers subsampling
o Smoothing and Priors • Nested
o Tradeoff between Bias and Variance • 𝑘 ∗ 𝑙 fold
• Sampling from the distribution 𝑃(𝑋, 𝑌) • 𝑘-fold with
o Representative datasets validation and
o Training, validation, and testing test sets
• How to split the dataset 𝐷? • Bootstrapping
o Time series data • Stratified cross
o Independent and Identically Distributed (IID) validation
• The weak law of large numbers • Time series
• How to prevent overfitting to test data? Do’s and Don’ts cross validation
• Validation sets (dev sets) and Cross Validation (forward
chaining –
Rolling origin)
The K-Nearest Neighbor Classifier (An instance-based, lazy,
discriminative, non-linear, non-parametric classifier)

• KNN – The Basics


o Nearest neighbor classification rule
o KNN formal definition
o KNN decision boundaries and Voronoi Tessellations
o Properties of KNN: Non-parametric, used for classification and
• Murphy: 1.1,
regression, instance-based, lazy
1.2, 1.4.2,
• KNN Similarity/Distance measures and constraints 1.4.3, 1.4.9
o Minkowski Distances (Manhattan, Euclidean and Chebyshev)
• Videos of
• The KNN algorithm and implementation different values
o KNN regression and classification with examples of k
o Space and Time complexity
• Video
o Bias/variance tradeoff as 𝐾 → 1 and 𝐾 → n
3 5,6 describing
o Tuning the hyperparameter: K
nearest
o KNN: The good, the bad and the ugly
neighbors
• KNN error bounds as 𝑛 → ∞
• A nice
o Bayes Error
explanation of
o 1-NN error as 𝑛 → ∞
nearest
• KNN Enhancements neighbors
o Parzen Windows and Kernels
o K-D trees
o Inverted lists
o Locality sensitive hashing
• The Curse of Dimensionality
o Demonstration and examples
o Challenges and opportunities
o Lower dimensional subspaces and manifolds in higher
dimensional ambient space

Evaluation metrics

• Confusion matrix (contingency tables) - binary and multi-label


o True and False – Positives and Negatives
o Type I and type II errors
• Performance Metrics
o Accuracy, Sensitivity (recall, TPR), Specificity (TNR),
Precision (Positive Predictive Value), Negative predictive value
o False acceptance rate, False rejection rate
o Examples – pros and cons
• The need for a combined measure
o Types of averages: AM, GM, HM
o F-β-measure, F-1-measure
4 7,8 • Multiclass Classification SLP3: 4.7-4.9
o Any-of (multi-label) classification
o One-of (multinomial) classification
o Micro and Macro averaging
• Gold labels and annotation of data
o Inter-annotator agreements
o Cohen’s Kappa and Krippendorff's alpha
• Evaluation of Classifiers, thresholds, comparing classifiers, imbalanced
classes
o Receiver operating Characteristic (ROC) and Precision-Recall
(P-R)Curves
o ROC Area Under the Curve (AUC)
o Equal Error Rate (EER) and Biometric Systems
Linear Regression

• Motivation for linearity


• Revision
o Lines, planes, hyperplanes, and vectors
▪ Lines and planes: Normal form and slope-intercept
form
▪ Decision boundaries with perpendicular weight vectors
▪ Distance between a hyperplane and a point
o The Dot Product
o The geometric interpretation of absorbing the bias term
o Visualizing n dimensions
• Linear Regression
o Intuition
o Derivation and implementation
▪ Linear regression with one variable • ESLII Ch3
▪ Cost function: Square Loss, Mean Square Loss (MSE), • Murphy 7-
5 9,10 ▪ 1/2 Mean Square Loss 7.5.1, 7.5.4
▪ Motivating gradient descent
• The Gradient Descent Algorithm
o Gradients
o The step size α
o Convex and non-convex cost functions – Global and local
optima
o The batch gradient descent algorithm
o Types of Gradient Descent: Batch, Minibatch, and Stochastic
• Multivariate Linear Regression
o Multivariate Gradient Descent
o Vectorizing the notation
• Practical Issues of linear regression
o Feature Scaling, local minima, ravines, saddle points, tracking
progress in GD
o Hyperparameters: Learning rate
• Polynomial regression
Linear Regression: Bias and Variance • Ben Taskar’s
under- and
overfitting
• How to recognize high variance/high bias scenarios? • MLaPP: 1.4.7
o Underfitting and Overfitting • Andrew Ng’s
o How to reduce bias and variance? lecture – ML
▪ Cross validation debugging
▪ Feature selection
• Ben
• Manual Feature Selection Taskar’s Notes
o Scatter Diagrams and Plots on Bias
o Eyeballing those Correlations! Variance
• Regularization • Notes by Scott
o Motivation – The fitting problem Foreman-Roe
o L2 Regularization or Ridge Regression
• ELSII Chapter
o L1 Regularization or Lasso Regression
2.9 · Murphy:
▪ Automatic Feature selection
6.2.2
• Comparison of Ridge and Lasso regression
• Murphy 6.2.2
• Elastic Net Regression – Intuition
6 11,12 • SLP3 Ch5,
ESLII Ch4
Logistic Regression (A linear, discriminative, parametric classifier) • Murphy 8,
13.3, 3.5.3
• Intuition and derivation • Ben Taskar’s
notes
• Regression for classification • TM chapter:
• “Squishing” between 0 and 1 using a non-linear activation function: Naive Bayes
The sigmoid and Logistic
Regression
• A simple sentiment classifier • Nice blogpost
on Gradient
• Visualizing the logistic regression decision boundary
Descent,
• Hyperplanes, linear and non-linear decision boundaries
Adagrad,
• Cost function: Derivation of the cross-entropy loss function (log loss) Newton’s
• Learning algorithm: Batch, Stochastic and Mini-batch Gradient Descent method
• Multiclass (multinomial) classification: One-vs-all (one-vs-rest), One-vs- •
one • The SoftMax activation function and multivariate log loss
The Perceptron (A discriminative, linear, parametric classifier)

• The McCulloch-Pitts Neuron and its limitations • The Perceptron


• The Perceptron and its limitations Wiki page
o The Heaviside step function • Murphy 8.5.4
o Boolean functions: AND, OR and XOR! • Murphy: 14.5 -
o One perceptron, two perceptrons, … 14.5.2.2
• Linear separability in low and high dimensional spaces • Ben Taskar’s
• From the step function to other activation functions Notes on
• The perceptron learning algorithm and its geometric interpretation SVMs
• Proof of convergence • Kernel
o Relation between margin and rate of convergence Cookbook by
7 13,14
David
Maximum Margin Classifiers: Support Vector Machines (SVMs) (A Duvenaud
discriminative, linear, parametric classifier) • Laurent El
Ghaoui’s
• Intuition and motivation lecture on
o The perceptron and the optimal separating hyperplane duality
• Hard Margin Linear Support Vector Machines: Derivation • “Idiot’s guide
o Constrained optimization and Lagrange Multipliers to SVM”
• Soft Margin Linear Support Vector Machines: Derivation •
• Hinge-loss
• Kernels and Kernel SVMs

Neural Networks

• The Neuron and linear decision boundaries. Can we do better?


o Review Logistic regression and gradient descent
• Changing the representation of the data
o Kernels
o Neural networks
• Non-linear activation
• Multi-layer Perceptron
• Neural Networks
• Deep Learning
• Intuition: How NNs work?
o Non-linear and complex decision boundaries
o Universal approximation and regression
• Layers: Depth vs. width • SLP Ch7,
8 15,16
• Gradient descent for NNs ESLII Ch11
o Overfitting and the importance of Stochastic Gradient Descent
(SGD)
• Formal Notation for Logistic Regression and NNs
o Vectorizing LR and NNs
o Forward propagation (LR, NN)
▪ Simple, vectored with a single instance
▪ Vectored with m instances
• Activation functions, gradients, and pros and cons
o Sigmoid
o Tanh
o ReLU and Leaky ReLU
• Backward propagation (LR, NN)
o Simple, vectored with a single instance and m instances
o Training the NN
Sequence Models

• Notion of Sequence Modeling tasks


o Text and Speech (One-Hot )
o Time-Series
o Videos as sequences of images
• Types of Sequence Modeling problems and specifications
o 1-to-Many - Image Captioning
o Many-to-1 - Video Classification, Sentiment Classification
o Many-to-Many (Seq2Seq) - Machine Translation,
Summarization etc. • Andrej
• Why do Feedforward NNs perform poorly at Sequence Modeling tasks? Karpathy’sThe
• Introducing RNNs Unreasonable
Effectiveness
Idea of the Hidden State The overall architecture of RNNs
“Unrolling” a unit Intuition of (truncated) BPTT • Jay Alammar’s
9 17,18 Visualizing A
• Challenges of training RNNs, and how to deal with them Neural
o Exploding/Vanishing Gradients Machine
o Choice of Activation functions Translation
o Gradient Clipping Model
o Changing the architecture (intro to LSTMs) • SLP3
• Introducing LSTMs •
o Changes to the architecture
o Idea of storing the “memory” in a cell
• Exploring Machine Translation
o Setup as a Seq2Seq problem
o Embeddings
o Using the Encoder-Decoder framework
o Information bottleneck: passing on only one hidden state
o Improvements
▪ Passing all the hidden states
▪ Intuition of the Attention Mechanism

Attention Mechanism and Transformers

• The Attention Mechanism in Machine Translation


• Self Attention
o Dot Product Attention
o The idea of contextualized word embeddings
• Introducing the Transformer and its contributions • Jay Alammar’s
o Highly Parallelizable The Illustrated
o Contextualized Embeddings vs. Plain Embeddings Transformer
o Long-Term Dependencies • Andrej
• Transformer Architecture in a nutshell Karpathy’s
10 19,20 o Role of an Encoder Let’s Build
o Role of a Decoder and Masked Attention GPT
o Query, Keys, Values • SLP3
• The Transformer in equations •
o Positional Embeddings
o Projections to QKV
o Self Attention as Dot Product Attention
o Role of Feedforward layers
o Multi Headed Self Attention: the role and its equations
o (ignore Layer Norms for brevity)
• Case Studies: BERT, T5, and GPT-3/LLMs
Decision (Classification/Regression) Trees (Discriminative, non-linear,
parametric/nonparametric)

• Clustering using K-D Trees and nearest neighbor methods


o Why do we still need nearest neighbors?
• Revision of Trees
o Graph: Nodes, edges, directed/undirected, path, cyclic/acyclic
o Tree: A rooted directed acyclic graph (Rooted DAG)
▪ Parent, children, siblings, root, leaves, degree, height,
arity, various relationships
▪ Forests
• From K-D trees to decision trees
• Decision tree examples: classification and regression
o Categorical and real-valued attributes
• Bias and variance
o Tree height
• Growing tree automatically – Decision tree training
o The ID3 and CART algorithms
o Splits and purity/impurity
▪ Gini and Entropy
▪ Information Gain and Information Gain Ratio
▪ Multiclass
▪ Regression: Variance
o Real-valued attributes
• Strengths and weakness of CARTs • Decision Tree
o Automatic feature selection wiki page
o Generalizability of splitting on attributes with a large set of
• Ben Taskar’s
values
old notes
o Missing data
• Murphy: 16.2
11 21,22 o Axis-aligned splits
o Over fitting • ESLII 8.7,
ESLII Ch10,
• Side notes on Huffman coding
15, 16
• A deep dive into Entropy

o Quantifying expected surprise
▪ Events in isolation
▪ Probability distributions
o Information, surprise, and uncertainty
o Evaluating Language Models
▪ Comparing distributions using cross entropy
• A discussion on Parametric/Non-parametric models

Ensemble methods: Bagging and Random Forests

• Decomposition of Generalization Error


o Bias/Variance/Noise
o Detecting high bias and high variance regimes
• Variance reduction
o The weak law of large numbers
o 𝑀 datasets sampled from 𝑃
o Bootstrapping
▪ Bootstrapped Aggregation (Bagging)
o Are we even drawing from the same �
o Summary and advantages of bagging
• Random Forests
o Algorithm
o Examples and benefits
▪ Out-of-box performance
▪ The hyper parameters
▪ 𝑚 and 𝑘
▪ No need of normalization and feature-scaling
▪ Resilience to the curse of dimensionality
▪ Feature selection
▪ Missing data and clustering
o Variants

Ensemble methods: Boosting, Gradient Boosted Trees, and AdaBoost

• Bias Reduction
o Intuition
o Vectors
o Gradient Descent in function space
o Generic Boosting (AnyBoost)
▪ Algorithm and geometric interpretation
• Gradient Boosted Regression Trees
o Algorithm
o Detailed walk-through
• AdaBoost
o Setting
• ESLII 8.7,
o Odds Ratio and log-odds
ESLII Ch10,
o Step size proportional to error reduction
15, 16
o Instance weights proportional to (mis)classification and the say
of the classifier • SLP Ch4
12 23,24
o Algorithm and detailed walk-through • Murphy 2.2,
o Properties and Summary 3.1-3.4
• ESLII 6.6.3
Bayes Theorem •

• Review of probability, joint and conditional probability, and derivation


of the Bayes Theorem
• Maximum a posteriori (MAP) and Maximum Likelihood Estimation
(MLE)
o Posterior, likelihood, prior and evidence
o Classification using MAP and MLE
• Example problems and solutions using Bayes Theorem
o Binary and multiclas
o Monty Hall problem, medical testing, Language Modeling
• Generative and Discriminative classifiers
o Solving SPAM vs. Not-SPAM
The Naïve Bayes Classifierr (A linear, generative, parametric classifier)

• Derivation and implementation


o Classification using the Bayes Theorem
o Learning by example: The SPAM vs. Not-SPAM problem • Ben Taskar’s
o The “zeros” and how to get rid of them! notes on Naïve
• Independence, mutual exclusion, and conditional independence Bayes
o The challenge of “how much of the context to use?” - Ngrams • TM chapter on
o Naïve Assumptions: Conditional Independence and Bag-of- Naive Bayes
Words (ch 1-3)
• Data sparsity and Out-of-vocabulary (OOV) items • Xiaojin Zhu’s
13 25,26 o Laplace Add-1 smoothing notes on
• Another example: Sentiment analysis Multinomial
• Text generation using Naïve Bayes Naïve Bayes
o Infinite monkeys on typewriters • Mannings’
o The Shannon visualization method for Ngrams description of
o Approximating Shakespeare and the Wall Street Journal Multinomial
• Real-valued features: Gaussian Naïve Bayes Naïve Bayes
• Probability vs. likelihood •
• A worked example
• Naïve Bayes decision boundary
• Under assumptions and general case
• Naïve Bayes: Strengths and weaknesses
Unsupervised Learning

• The unsupervised learning setup


• Use cases of unsupervised learning
o Clustering
o Anomaly detection
o Feature selection and dimensionality reduction
• Types of clustering
o Monothetic and Polythetic
o Hard and Soft
o Flat and Hierarchical
• Clustering
o K-D Trees
▪ Monothetic, hard boundary, hierarchical, divisive (top-
down)
▪ Motivation
▪ Algorithm
o Vector Quantization
▪ Motivation and method
▪ Codebook and distance metric
▪ Euclidean and Mahalanobis distances
o K-means
▪ Polythetic, hard boundary, flat
▪ Lloyd/Forgy method
▪ Expectation Maximization (EM) and K-means
▪ The K-means objective
▪ Optimal Number of Clusters
▪ Categorical data and K-modes
▪ Vector Quantization using K-means ESLII 8.7, ESLII
14 27,28
o Evaluating Clustering Ch10, 15, 163
▪ Extrinsic and Intrinsic evaluation
o Gaussian Mixture Models
▪ Polythetic, soft boundary, flat, probabilistic
▪ K-means vs. GMMs
▪ EM for GMMs
▪ Mixture models in 1-dimension and n-dimensions
▪ Likelihoods, cluster assignments, and cluster update
rules
▪ The covariance matrix
▪ How many Gaussians?
o Hierarchical clustering
▪ Recursive K-means
▪ Polythetic, hard boundary, hierarchical, top-
down
▪ Agglomerative Clustering
▪ Polythetic, hard boundary, hierarchical,
bottom-up
▪ Examples
▪ Distances: Single link, complete link, average
link, centroids, Ward’s method
• Dimensionality Reduction
o Feature selection vs. feature reduction
o Motivation
▪ Visualization
▪ Redundant and correlated features
▪ Real vs. apparent dimensionality
▪ The curse of dimensionality
o Principal Component Analysis (PCA)
▪ The dimension of greatest variability
▪ A side-note on Matrices, Linear Transformations, the
determinant, Eigenvalues and Eigenvectors
▪ The Eigenvectors of the Covariance Matrix
▪ How many dimensions?
▪ Strengths and weaknesses
o Linear Discriminant Analysis (LDA)
▪ Supervised setup
▪ Discrimination vs. spread
• Anomaly Detection
o Anomalies
o How do we define anomalies?
o Why unsupervised?
o Examples and challenges
o Detection
▪ One Class Classification
▪ Density estimation
▪ One feature and multiple features
▪ Algorithm
▪ Example
▪ Evaluation
▪ Unsupervised vs. supervised
• Big Challenges and Opportunities in AI and ML
o The case for Explainable AI
o The case for Fair AI
o Societal biases
o Imbalanced classification
o Machine Learning for Development (ML4D)

Other topics – to be covered if we have time


Bayes: Advanced topics (supplementary)

• Hypothesis spaces • SLP Ch4


• Frequentist viewpoint • Murphy
o Intuition, derivation, pros and cons, extreme data 2.2, 3.1-
15 • Bayesian viewpoint 3.4
o Intuition, derivation, MAP, pros and cons • ESLII
o Conjugate priors: The Beta and Dirichlet distributions 6.6.3
• Comparison of MAP and MLE
o Laplace smoothing
• The Bayes Optimal Classifier

Graphical Sequence Processing Models

• Hidden Markov Models (HMMs)


16 • Maximum Entropy Markov Models (MEMMs)
SLP A, ESLII Ch17
• Undirected Graphical Models (Markov Random Fields)
• Conditional Random Fields (CRFs)
• Directed Graphical Models (Bayes Nets)

You might also like