0% found this document useful (0 votes)

7 views110 pages

Fall2024 W4995 Lecture1

The document outlines the logistics and structure of the W4995 Applied Machine Learning course for Fall 2024, taught by Dr. Vijay Pappu. It includes details on grading, assignments, course materials, and topics covered throughout the semester, emphasizing practical applications of machine learning rather than theoretical foundations. Additionally, it discusses the importance of exploratory data analysis and visualization in understanding datasets.

Uploaded by

sejal.mittal99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views110 pages

Fall2024 W4995 Lecture1

Uploaded by

sejal.mittal99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 110

W4995 Applied Machine Learning

Fall 2024
Lecture 1
Dr. Vijay Pappu
A little about me...

B.Tech Ph.D.

Data Sr. Manager,

Scientist Applied Science
Logistics
● Course website:
○ [Section 032] https://courseworks2.columbia.edu/courses/203911
○ [Section 033] https://courseworks2.columbia.edu/courses/203915
○ Please post any questions in EdStem

● Course grading:
○ 5 programming assignments - 50%
○ 1 in-class midterm - 20%
○ 1 project - 30%

● Class attendance is optional

Logistics
● My information
○ Email: [email protected]
○ Oﬃce hours: By Appointment

● Course Assistants (CAs) & Oﬃce hours:

○ TBD
Course materials
● The slides will be made available on course website prior to the in-class
lecture
● The classes are recorded (everything)
○ I might repeat your questions so that the mic can pick up.
● The class recordings will be available to students in the “Video Library” section
of the course webpage.
Assignments
● Programming assignments should be submitted in Python
● We will use Github Classroom for assignment submissions
● 5 assignments in total (2 before midterm)
● Late submissions are not allowed and will result in no points
Project
● You can work in teams of 4-6 (ideal)
○ This will result in ~20-25 teams
● Project deliverables include report and/or functional code
● Project description and key milestones have been posted in the “Pages”
section of Courseworks.
Plagiarism
The use of words, phrases, or ideas that do not belong to the student,
without properly citing or acknowledging the source, is prohibited. This may
include, but is not limited to, copying computer code for the purposes of
completing assignments for submission.

Columbia University Plagiarism policy

● CAs check the homeworks for plagiarism

● Copied code could result in no points for all involved
● Leveraging code snippets from other sources (Stack overﬂow, open source
libraries) is allowed.
○ It is important to mention the source if it is substantial amount of code.
Useful resources
● The course does not have one recommended book, but would leverage
material from the following resources:
○ Introduction to Machine Learning with Python
Lectures available
○ Learning from Data online
○ Applied Predictive Modeling
Can refer to first 5
○ Deep Learning chapters for basics
○ Fundamentals of Data Visualization
Before we begin...

This course is not about…

Before we begin...

This course is not about…

Theoretical underpinnings of Machine Learning
Before we begin...

However, this course is about…

Applying Machine Learning to real-world applications
In today’s lecture, we will cover...
● Introduction
● Exploratory Data Analysis & Visualizations
Machine learning (ML) is the study of computer algorithms
that can improve automatically through experience and by the
use of data.

https://en.wikipedia.org/wiki/Machine_learning
I like this deﬁnition better…

Machine learning involves computers discovering how they can

perform tasks without being explicitly programmed to do so.

https://en.wikipedia.org/wiki/Machine_learning
Heuristics v.s. ML system

Calico - orange, black and/or

grey markings over a white
coat.

Tortoiseshell - white, orange

and/or beige over a black
coat.

Siamese - angular-looking
with black and tan coloring.
Calico

Heuristics ML system
The short answer is…
EVERYWHERE
(Obvious) Examples of Machine Learning
(Obvious) Examples of Machine Learning
(Not so obvious) Examples of Machine Learning

Computer vision identiﬁes crop health issues on a tray Example of the various recipes being tested for one
of arugula crop over time.

https://boweryfarming.com/artiﬁcial-intelligence/
(Not so obvious) Examples of Machine Learning

A protein’s function is determined by its 3D shape.

https://www.nature.com/articles/d41586-020-03348-4
Until 1990’s...
Recently...
One of the reasons...
Supervised Learning
● Supervised learning algorithms learn a function that maps inputs to an output
from a set of labeled training data.
Unsupervised Learning
● Unsupervised learning algorithms learn patterns from unlabeled data samples.
Reinforcement Learning
Deep Learning
● Deep learning is a class of ML algorithms that uses multiple layers to
progressively extract higher-level features/abstractions from raw inputs.
What about others?
● Active learning
● Self-supervised learning
● Transfer learning
● Generative AI
● …
Large Language Models (LLMs)
● Large Language Models (LLMs) are a subset of deep learning models trained on massive corpus of
text data.
● LLMs perform extremely well on a wide range of natural language tasks:
○ Natural Language Understanding: excel at tasks like sentiment analysis, NER & Q&A

○ Text Generation - generate human-like text for chatbots and other content generation tasks

○ Machine Translation: perform high quality machine translations

○ Content Summarization: generate concise summaries of lengthy documents

● LLMs typically consist of billions of parameters and are trained using a Transformer Architecture

that incorporates self-attention mechanism to capture long-term dependencies.

Timeline
Model complexity
Three reasons...

Computational Breakthrough in
Big data
power1 Deep Learning

[1] - https://www.oﬀgridweb.com/preparation/infographic-the-growth-of-computer-processing-power/
One example...
Computers have become powerful and accessible...
Data is publicly available…

https://datasetsearch.research.google.com/

https://www.kaggle.com/datasets
Access to ML is being democratized…
Ethics
Explainability
Python is the de-facto language for ML

http://r4stats.com/articles/popularity/
Great suite of matured libraries for ML tasks
Course schedule
Lecture Topics By the end of class

1 1. Introduction Students would be familiar with basic

2. Exploratory Data Analysis & Visualization data exploration

2 1. Introduction to supervised learning

2. Preprocessing

3 1. Linear models for regression

2. Linear models for classiﬁcation
3. Support Vector Machines (SVMs)

4 1. Trees, Forests & Ensembles Students would be familiar with

2. Gradient Boosting training & evaluation of linear and
ensemble models

5 1. Model evaluation
2. Calibration
3. Automatic machine learning
Course schedule
Lecture Topics By the end of class

Midterm

6 1. Model Interpretation & Feature Selection

2. Linear & non-linear dimensionality reduction
3. Clustering & mixture models

7 1. Learning with imbalanced data

2. Learning with sparse data

8 1. Deep Neural Networks (DNN)

2. Advanced Neural Networks

9 1. Convolutional Neural Networks Students would be familiar with

2. Recurrent Neural Networks applying neural networks to
diﬀerent tasks
Course schedule
Lecture Topics By the end of class

10 1. Working with text data Students would be familiar working with

2. Topic models for text data text data
3. Word & document embeddings

11 1. Content-based recommendations Students would be familiar training and

2. Collaborative ﬁltering & matrix factorization evaluating recommender systems
3. Recommendations using DNNs

12 1. ML in production
2. Course Recap
Questions?
Let’s take a 10 min break!
Exploratory Data Analysis
&
Visualization
Exploratory Data Analysis (EDA) is an approach of analyzing
datasets to summarize their main characteristics, often using
statistical graphics and other data visualization methods.

https://en.wikipedia.org/wiki/Exploratory_data_analysis
Why do we do EDA?
● Explore
● Inform
● Communicate
Data types
● Quantitative/numerical continuous - 1, 3.5, 100, 10^10, 3.14
● Quantitative/numerical discrete - 1, 2, 3, 4
● Qualitative/categorical unordered - cat, dog, whale
● Qualitative/categorical ordered - good, better, best
● Date or time - 09/15/2021, Jan 8th 2020 15:00:00
● Text - The quick brown fox jumps over the lazy dog
Data Visualization
Ugly, Bad & Wrong figures
● Ugly
○ A figure that has aesthetic problems but otherwise is clear and informative
● Bad
○ A figure that has problems related to perception; it may be unclear, confusing, overly
complicated, or deceiving
● Wrong
○ A figure that has problems related to mathematics; it is objectively incorrect
Ugly, Bad & Wrong figures
Aesthetics in data visualization
● Aesthetics refer to a quantifiable set of features that are mapped to the data in
a graphic.
● Aesthetics describe every aspect of a given graphical element.
● Some aesthetics like position, size, color and line width work for both
continuous & discrete data, while others (shape & line type) work for only
discrete data
Scales
● Scales are the mapping between data values and aesthetics values.

Data

Aesthetics

Scales
A typical data visualization chart
● A typical data visualization chart uses three scales.

● Two position scales:

○ month (x-axis)
○ temperature(y-axis)
● One color scale:
○ location
A typical data visualization chart
● A typical data visualization chart uses three scales.

● Two position scales:

○ month (x-axis)
○ location (y-axis)
● One color scale:
○ temperature
An (a)typical data visualization chart
● This visualization chart uses ﬁve scales.

● Two position scales:

○ displacement (x-axis)
○ fuel efficiency (y-axis)
● One color scale:
○ power
● One shape scale:
○ cylinders
● One size scale:
○ weight
Position scale
Cartesian coordinate system
● The most widely used coordinate system for data visualization.
Valid data visualization charts
● All figures show the same information with different aspect ratios
Valid data visualization charts
● Since the same quantity (temperature) is plotted on both axes, the grid
spacings should be same.
Nonlinear axes
● Logarithmic scale is the most commonly used nonlinear scale.

[1, 3.16, 10, 31.6, 100]

[0, 0.5, 1, 1.5, 2]

Logarithmic scale - an example
● Logarithmic scale is typically useful to represent ratios.
Color scale
Use-cases for color in data visualization
● Three fundamental use-cases for using color in data visualization:
○ distinguish groups of data
○ Represent data values
○ Tool to highlight
Color as a tool to distinguish

Qualitative color scale

Color to represent data values

Sequential color scale

Color to represent data values

Diverging color scale

Color as a tool to highlight

Accent color scale

The two neighboring states Louisiana and Texas

experienced among the highest and lowest
population growth from 2000 to 2010.
Visualization Collections
Visualizing data
● Typically, we would like to visualize the following kinds of data:
○ Amounts
○ Distributions
○ Proportions
○ X-Y relationships
○ Uncertainty
Visualizing amounts
Visualizing amounts - bar plots
Visualizing amounts - bar plots
● The bars should not be ordered if they represent ordered categories
Visualizing amounts - grouped & stacked bars
Visualizing amounts - dot plots
Visualizing distributions
Visualizing distributions - histograms
● When making histograms, always explore multiple bin widths

bin width = 1 year bin width = 3 years

bin width = 5 years bin width = 15 years

Visualizing distributions - kernel density plots
● Diﬀerent kernels include Gaussian, Rectangular etc.
● Each kernel is parameterized by bandwidth

Gaussian kernel Gaussian kernel

bandwidth = 0.5 bandwidth = 2

Gaussian kernel Rectangular kernel

bandwidth = 5 bandwidth = 2
Visualizing distributions - kernel density plots
● To visualize several distributions at once, kernel density plots work better than
histograms
Visualizing distributions - highly skewed distribution
Visualizing distributions - multiple distributions

Boxplot Violin plot

Visualizing distributions - multiple distributions

Ridgeline plot
Visualizing proportions
Visualizing proportions - pie charts, stacked & side-by-side bars
● Pie charts help visually emphasize simple fractions, such as ½, 1/3 , ¼ etc.

Pie chart Stacked bar Side-by-side bar

Visualizing proportions - pie charts, stacked & side-by-side bars
● Side-by-side bars help visualize easily changing proportions over time.
Visualizing proportions - pie charts, stacked & side-by-side bars
● Stacked bars are preferred when there are only two quantities to compare
over time.
● Stacked density plots can be used to visualize how proportions change in
response to a continuous variable.

Stacked bar plot Stacked density plot

Visualizing X-Y relationships
Visualizing X-Y relationships - scatterplots
Visualizing X-Y relationships - bubble plots
Visualizing X-Y relationships - scatterplot matrix
Visualizing X-Y relationships - correlation coeﬃcient

Correlation
coeﬃcient

sample means
Visualizing X-Y relationships - correlograms

Correlations between mineral content obtained from 214 glass samples during forensic work
Visualizing uncertainty
Visualizing uncertainty - probability distribution

The blue party is predicted to win over the yellow party by ~1 percentage point with
a margin of error of 1.76 percentage points.
Visualizing uncertainty - population & sample
Visualizing uncertainty - conﬁdence intervals

Ratings of Chocolate bars manufactured in Canada

Visualizing uncertainty - comparing parameter estimates
Questions?

Generally Accepted Scheduling Principles Gasp Compiled
0% (1)
Generally Accepted Scheduling Principles Gasp Compiled
1,821 pages
Coroneos' 100 Integrals
100% (1)
Coroneos' 100 Integrals
92 pages
EE353 - 769 00 Course Introduction
No ratings yet
EE353 - 769 00 Course Introduction
28 pages
Self-Learning Data Science
No ratings yet
Self-Learning Data Science
16 pages
Best Practice Guide For Securing Active Directory Installations and Day-To-Day Operations - Part II
No ratings yet
Best Practice Guide For Securing Active Directory Installations and Day-To-Day Operations - Part II
126 pages
ML - 1 - Sovan - Introduction To ML
No ratings yet
ML - 1 - Sovan - Introduction To ML
83 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Steven Skiena-The Algorithm Design Manual-En
50% (2)
Steven Skiena-The Algorithm Design Manual-En
27 pages
Desmi Operations and Maintenance Instructions
100% (2)
Desmi Operations and Maintenance Instructions
29 pages
Considerații Privind Restaurarea Unei Icoane Rusesti Din Sec Al XIX-lea
100% (2)
Considerații Privind Restaurarea Unei Icoane Rusesti Din Sec Al XIX-lea
11 pages
ML DL Projects and Tutorials
100% (1)
ML DL Projects and Tutorials
21 pages
(Fall 2024) Intro To ML
No ratings yet
(Fall 2024) Intro To ML
51 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
58 pages
Helsenki - Intro To ML
No ratings yet
Helsenki - Intro To ML
35 pages
ML - Unit I - Final
No ratings yet
ML - Unit I - Final
132 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
60 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
45 pages
2024 Machine Learning Intro
No ratings yet
2024 Machine Learning Intro
50 pages
Maruti Dealers PDF
No ratings yet
Maruti Dealers PDF
8 pages
Internshipml (J2)
No ratings yet
Internshipml (J2)
50 pages
ML Revision
No ratings yet
ML Revision
207 pages
Unit I MACHINE LEARNING
No ratings yet
Unit I MACHINE LEARNING
87 pages
Flet (Joyelle McSweeney) (Z-Library)
No ratings yet
Flet (Joyelle McSweeney) (Z-Library)
155 pages
MLUnit 1
No ratings yet
MLUnit 1
131 pages
Conversion of Units
No ratings yet
Conversion of Units
1 page
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
81 pages
1DataScience MachineLearning AI Syllabus.-1.PDF 20240118 174213 0000
No ratings yet
1DataScience MachineLearning AI Syllabus.-1.PDF 20240118 174213 0000
9 pages
Salinas CA Fy 2025 26 Adopted Budget in Brief
No ratings yet
Salinas CA Fy 2025 26 Adopted Budget in Brief
13 pages
CS464 Ch1 Intro Fall2020
No ratings yet
CS464 Ch1 Intro Fall2020
83 pages
Machine Learning: What Is Data and Model? Machine Learning Workflow Distance Based Classifiers Bayes Decision Theory
No ratings yet
Machine Learning: What Is Data and Model? Machine Learning Workflow Distance Based Classifiers Bayes Decision Theory
81 pages
Machine Learning: Instructor: Prof. Ayesha
No ratings yet
Machine Learning: Instructor: Prof. Ayesha
31 pages
From Field Problems To Machine Learning
No ratings yet
From Field Problems To Machine Learning
51 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
SEng5305-chap-1-Introduction To ML
No ratings yet
SEng5305-chap-1-Introduction To ML
85 pages
Processfolio
No ratings yet
Processfolio
3 pages
Machine Learning
No ratings yet
Machine Learning
83 pages
Grab The Full PDF Version of Test Bank For Human Physiology: An Integrated Approach, 8th Edition, Dee Unglaub Silverthorn, With A Fast Download.
100% (6)
Grab The Full PDF Version of Test Bank For Human Physiology: An Integrated Approach, 8th Edition, Dee Unglaub Silverthorn, With A Fast Download.
72 pages
Lecture1 PDF
No ratings yet
Lecture1 PDF
37 pages
ABES Presentation
No ratings yet
ABES Presentation
91 pages
Previous Lecture
No ratings yet
Previous Lecture
43 pages
ML Workshop
No ratings yet
ML Workshop
78 pages
Data Science - CS109: Joe Blitzstein, Verena Kaynig-Fittkau, Hanspeter Pfister
No ratings yet
Data Science - CS109: Joe Blitzstein, Verena Kaynig-Fittkau, Hanspeter Pfister
47 pages
Css NCii Checklist For Trainies
No ratings yet
Css NCii Checklist For Trainies
22 pages
ML 23 First Lectures 2 3 v0.1
No ratings yet
ML 23 First Lectures 2 3 v0.1
66 pages
Unit 3 Exam - Hands-On - Part 1
No ratings yet
Unit 3 Exam - Hands-On - Part 1
2 pages
Introduction To Machine Learning: Pekka Parviainen
No ratings yet
Introduction To Machine Learning: Pekka Parviainen
39 pages
Machine Learning Notes22
No ratings yet
Machine Learning Notes22
45 pages
Lecture 01 - Introduction To AML-Jan24
No ratings yet
Lecture 01 - Introduction To AML-Jan24
66 pages
2021 Machine Learning Intro
No ratings yet
2021 Machine Learning Intro
43 pages
Applied Machine Learning
No ratings yet
Applied Machine Learning
49 pages
Intro To ML - 1
No ratings yet
Intro To ML - 1
29 pages
PGP Generative AI and ML Curriculum (New)
No ratings yet
PGP Generative AI and ML Curriculum (New)
42 pages
Core Concepts of AI
No ratings yet
Core Concepts of AI
46 pages
LM #01-Introduction To ML
No ratings yet
LM #01-Introduction To ML
33 pages
11.a Study of The Recruitment and Selection Process
No ratings yet
11.a Study of The Recruitment and Selection Process
11 pages
ML 01
No ratings yet
ML 01
15 pages
Slide 1 Introduction
No ratings yet
Slide 1 Introduction
33 pages
Lec 01 - Intro To ML
No ratings yet
Lec 01 - Intro To ML
28 pages
Industrial Training Report (Sahil)
No ratings yet
Industrial Training Report (Sahil)
33 pages
Topic 1 - Introduction
No ratings yet
Topic 1 - Introduction
30 pages
Course Logistics and Introduction: CSN-526 Machine Learning
No ratings yet
Course Logistics and Introduction: CSN-526 Machine Learning
23 pages
EDHRM - HR Metrics 2023 Course Outline - Revised
No ratings yet
EDHRM - HR Metrics 2023 Course Outline - Revised
4 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Authorization Form Panda Food
No ratings yet
Authorization Form Panda Food
3 pages
ML Overview
No ratings yet
ML Overview
26 pages
wph16 01 Que 20220616
No ratings yet
wph16 01 Que 20220616
20 pages
Lec 01
No ratings yet
Lec 01
28 pages
Sofialidis HPC Ansys Fluent 01
No ratings yet
Sofialidis HPC Ansys Fluent 01
18 pages
Context:: Needs Analysis Tool For 2 Grade Reading
No ratings yet
Context:: Needs Analysis Tool For 2 Grade Reading
6 pages
Module - 1
No ratings yet
Module - 1
9 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
10 pages
General Purpose Processor
No ratings yet
General Purpose Processor
13 pages
Brochure CE AML 30 Sept 2021 V33
No ratings yet
Brochure CE AML 30 Sept 2021 V33
16 pages
Tirth PDF
No ratings yet
Tirth PDF
19 pages
Data Science Student Schedule
No ratings yet
Data Science Student Schedule
7 pages
Assignment ON MGT-516: (Research and Methodology)
100% (1)
Assignment ON MGT-516: (Research and Methodology)
5 pages
1 Lecture 1: Introduction To Machine Learning
No ratings yet
1 Lecture 1: Introduction To Machine Learning
12 pages
ML Resources CW 2025
No ratings yet
ML Resources CW 2025
5 pages
Unit 2 - Test - Form 2025 GV
No ratings yet
Unit 2 - Test - Form 2025 GV
4 pages
Introduction To CAM Lesson 1
No ratings yet
Introduction To CAM Lesson 1
9 pages
OBIEE Regression Testing
No ratings yet
OBIEE Regression Testing
9 pages
How To Trade The IV Flush Strategy
No ratings yet
How To Trade The IV Flush Strategy
4 pages
Exam - Digital Egypt, Transformation Into A Digital Economy
No ratings yet
Exam - Digital Egypt, Transformation Into A Digital Economy
4 pages
Data Sheet - AST r01
No ratings yet
Data Sheet - AST r01
3 pages
Affidavit of Loss of Certificate of Registration of Motor Vehicle
No ratings yet
Affidavit of Loss of Certificate of Registration of Motor Vehicle
1 page
BlakeBlossomXXX OnlyFans Pictures & Videos Complete Siterip 3 Download
No ratings yet
BlakeBlossomXXX OnlyFans Pictures & Videos Complete Siterip 3 Download
1 page
Paulo Coelho'S: Aleph
No ratings yet
Paulo Coelho'S: Aleph
1 page

Fall2024 W4995 Lecture1

Uploaded by

Fall2024 W4995 Lecture1

Uploaded by

W4995 Applied Machine Learning

Data Sr. Manager,

● Class attendance is optional

● Course Assistants (CAs) & Oﬃce hours:

Columbia University Plagiarism policy

● CAs check the homeworks for plagiarism

This course is not about…

This course is not about…

However, this course is about…

Machine learning involves computers discovering how they can

Calico - orange, black and/or

Tortoiseshell - white, orange

A protein’s function is determined by its 3D shape.

○ Machine Translation: perform high quality machine translations

○ Content Summarization: generate concise summaries of lengthy documents

that incorporates self-attention mechanism to capture long-term dependencies.

1 1. Introduction Students would be familiar with basic

2 1. Introduction to supervised learning

3 1. Linear models for regression

4 1. Trees, Forests & Ensembles Students would be familiar with

6 1. Model Interpretation & Feature Selection

7 1. Learning with imbalanced data

8 1. Deep Neural Networks (DNN)

9 1. Convolutional Neural Networks Students would be familiar with

10 1. Working with text data Students would be familiar working with

11 1. Content-based recommendations Students would be familiar training and

● Two position scales:

● Two position scales:

● Two position scales:

[1, 3.16, 10, 31.6, 100]

[0, 0.5, 1, 1.5, 2]

Qualitative color scale

Sequential color scale

Diverging color scale

Accent color scale

The two neighboring states Louisiana and Texas

bin width = 1 year bin width = 3 years

bin width = 5 years bin width = 15 years

Gaussian kernel Gaussian kernel

Gaussian kernel Rectangular kernel

Boxplot Violin plot

Pie chart Stacked bar Side-by-side bar

Stacked bar plot Stacked density plot

Ratings of Chocolate bars manufactured in Canada

You might also like