0% found this document useful (0 votes)
7 views110 pages

Fall2024 W4995 Lecture1

The document outlines the logistics and structure of the W4995 Applied Machine Learning course for Fall 2024, taught by Dr. Vijay Pappu. It includes details on grading, assignments, course materials, and topics covered throughout the semester, emphasizing practical applications of machine learning rather than theoretical foundations. Additionally, it discusses the importance of exploratory data analysis and visualization in understanding datasets.

Uploaded by

sejal.mittal99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views110 pages

Fall2024 W4995 Lecture1

The document outlines the logistics and structure of the W4995 Applied Machine Learning course for Fall 2024, taught by Dr. Vijay Pappu. It includes details on grading, assignments, course materials, and topics covered throughout the semester, emphasizing practical applications of machine learning rather than theoretical foundations. Additionally, it discusses the importance of exploratory data analysis and visualization in understanding datasets.

Uploaded by

sejal.mittal99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

W4995 Applied Machine Learning

Fall 2024
Lecture 1
Dr. Vijay Pappu
A little about me...

B.Tech Ph.D.

Data Sr. Manager,


Scientist Applied Science
Logistics
● Course website:
○ [Section 032] https://courseworks2.columbia.edu/courses/203911
○ [Section 033] https://courseworks2.columbia.edu/courses/203915
○ Please post any questions in EdStem

● Course grading:
○ 5 programming assignments - 50%
○ 1 in-class midterm - 20%
○ 1 project - 30%

● Class attendance is optional


Logistics
● My information
○ Email: [email protected]
○ Office hours: By Appointment

● Course Assistants (CAs) & Office hours:


○ TBD
Course materials
● The slides will be made available on course website prior to the in-class
lecture
● The classes are recorded (everything)
○ I might repeat your questions so that the mic can pick up.
● The class recordings will be available to students in the “Video Library” section
of the course webpage.
Assignments
● Programming assignments should be submitted in Python
● We will use Github Classroom for assignment submissions
● 5 assignments in total (2 before midterm)
● Late submissions are not allowed and will result in no points
Project
● You can work in teams of 4-6 (ideal)
○ This will result in ~20-25 teams
● Project deliverables include report and/or functional code
● Project description and key milestones have been posted in the “Pages”
section of Courseworks.
Plagiarism
The use of words, phrases, or ideas that do not belong to the student,
without properly citing or acknowledging the source, is prohibited. This may
include, but is not limited to, copying computer code for the purposes of
completing assignments for submission.

Columbia University Plagiarism policy

● CAs check the homeworks for plagiarism


● Copied code could result in no points for all involved
● Leveraging code snippets from other sources (Stack overflow, open source
libraries) is allowed.
○ It is important to mention the source if it is substantial amount of code.
Useful resources
● The course does not have one recommended book, but would leverage
material from the following resources:
○ Introduction to Machine Learning with Python
Lectures available
○ Learning from Data online
○ Applied Predictive Modeling
Can refer to first 5
○ Deep Learning chapters for basics
○ Fundamentals of Data Visualization
Before we begin...

This course is not about…


Before we begin...

This course is not about…


Theoretical underpinnings of Machine Learning
Before we begin...

However, this course is about…


Applying Machine Learning to real-world applications
In today’s lecture, we will cover...
● Introduction
● Exploratory Data Analysis & Visualizations
Machine learning (ML) is the study of computer algorithms
that can improve automatically through experience and by the
use of data.

https://en.wikipedia.org/wiki/Machine_learning
I like this definition better…

Machine learning involves computers discovering how they can


perform tasks without being explicitly programmed to do so.

https://en.wikipedia.org/wiki/Machine_learning
Heuristics v.s. ML system

Calico - orange, black and/or


grey markings over a white
coat.

Tortoiseshell - white, orange


and/or beige over a black
coat.

Siamese - angular-looking
with black and tan coloring.
Calico

Heuristics ML system
The short answer is…
EVERYWHERE
(Obvious) Examples of Machine Learning
(Obvious) Examples of Machine Learning
(Not so obvious) Examples of Machine Learning

Computer vision identifies crop health issues on a tray Example of the various recipes being tested for one
of arugula crop over time.

https://boweryfarming.com/artificial-intelligence/
(Not so obvious) Examples of Machine Learning

A protein’s function is determined by its 3D shape.

https://www.nature.com/articles/d41586-020-03348-4
Until 1990’s...
Recently...
One of the reasons...
Supervised Learning
● Supervised learning algorithms learn a function that maps inputs to an output
from a set of labeled training data.
Unsupervised Learning
● Unsupervised learning algorithms learn patterns from unlabeled data samples.
Reinforcement Learning
Deep Learning
● Deep learning is a class of ML algorithms that uses multiple layers to
progressively extract higher-level features/abstractions from raw inputs.
What about others?
● Active learning
● Self-supervised learning
● Transfer learning
● Generative AI
● …
Large Language Models (LLMs)
● Large Language Models (LLMs) are a subset of deep learning models trained on massive corpus of
text data.
● LLMs perform extremely well on a wide range of natural language tasks:
○ Natural Language Understanding: excel at tasks like sentiment analysis, NER & Q&A

○ Text Generation - generate human-like text for chatbots and other content generation tasks

○ Machine Translation: perform high quality machine translations

○ Content Summarization: generate concise summaries of lengthy documents

● LLMs typically consist of billions of parameters and are trained using a Transformer Architecture

that incorporates self-attention mechanism to capture long-term dependencies.


Timeline
Model complexity
Three reasons...

Computational Breakthrough in
Big data
power1 Deep Learning

[1] - https://www.offgridweb.com/preparation/infographic-the-growth-of-computer-processing-power/
One example...
Computers have become powerful and accessible...
Data is publicly available…

https://datasetsearch.research.google.com/

https://www.kaggle.com/datasets
Access to ML is being democratized…
Ethics
Explainability
Python is the de-facto language for ML

http://r4stats.com/articles/popularity/
Great suite of matured libraries for ML tasks
Course schedule
Lecture Topics By the end of class

1 1. Introduction Students would be familiar with basic


2. Exploratory Data Analysis & Visualization data exploration

2 1. Introduction to supervised learning


2. Preprocessing

3 1. Linear models for regression


2. Linear models for classification
3. Support Vector Machines (SVMs)

4 1. Trees, Forests & Ensembles Students would be familiar with


2. Gradient Boosting training & evaluation of linear and
ensemble models

5 1. Model evaluation
2. Calibration
3. Automatic machine learning
Course schedule
Lecture Topics By the end of class

Midterm

6 1. Model Interpretation & Feature Selection


2. Linear & non-linear dimensionality reduction
3. Clustering & mixture models

7 1. Learning with imbalanced data


2. Learning with sparse data

8 1. Deep Neural Networks (DNN)


2. Advanced Neural Networks

9 1. Convolutional Neural Networks Students would be familiar with


2. Recurrent Neural Networks applying neural networks to
different tasks
Course schedule
Lecture Topics By the end of class

10 1. Working with text data Students would be familiar working with


2. Topic models for text data text data
3. Word & document embeddings

11 1. Content-based recommendations Students would be familiar training and


2. Collaborative filtering & matrix factorization evaluating recommender systems
3. Recommendations using DNNs

12 1. ML in production
2. Course Recap
Questions?
Let’s take a 10 min break!
Exploratory Data Analysis
&
Visualization
Exploratory Data Analysis (EDA) is an approach of analyzing
datasets to summarize their main characteristics, often using
statistical graphics and other data visualization methods.

https://en.wikipedia.org/wiki/Exploratory_data_analysis
Why do we do EDA?
● Explore
● Inform
● Communicate
Data types
● Quantitative/numerical continuous - 1, 3.5, 100, 10^10, 3.14
● Quantitative/numerical discrete - 1, 2, 3, 4
● Qualitative/categorical unordered - cat, dog, whale
● Qualitative/categorical ordered - good, better, best
● Date or time - 09/15/2021, Jan 8th 2020 15:00:00
● Text - The quick brown fox jumps over the lazy dog
Data Visualization
Ugly, Bad & Wrong figures
● Ugly
○ A figure that has aesthetic problems but otherwise is clear and informative
● Bad
○ A figure that has problems related to perception; it may be unclear, confusing, overly
complicated, or deceiving
● Wrong
○ A figure that has problems related to mathematics; it is objectively incorrect
Ugly, Bad & Wrong figures
Aesthetics in data visualization
● Aesthetics refer to a quantifiable set of features that are mapped to the data in
a graphic.
● Aesthetics describe every aspect of a given graphical element.
● Some aesthetics like position, size, color and line width work for both
continuous & discrete data, while others (shape & line type) work for only
discrete data
Scales
● Scales are the mapping between data values and aesthetics values.

Data

Aesthetics

Scales
A typical data visualization chart
● A typical data visualization chart uses three scales.

● Two position scales:


○ month (x-axis)
○ temperature(y-axis)
● One color scale:
○ location
A typical data visualization chart
● A typical data visualization chart uses three scales.

● Two position scales:


○ month (x-axis)
○ location (y-axis)
● One color scale:
○ temperature
An (a)typical data visualization chart
● This visualization chart uses five scales.

● Two position scales:


○ displacement (x-axis)
○ fuel efficiency (y-axis)
● One color scale:
○ power
● One shape scale:
○ cylinders
● One size scale:
○ weight
Position scale
Cartesian coordinate system
● The most widely used coordinate system for data visualization.
Valid data visualization charts
● All figures show the same information with different aspect ratios
Valid data visualization charts
● Since the same quantity (temperature) is plotted on both axes, the grid
spacings should be same.
Nonlinear axes
● Logarithmic scale is the most commonly used nonlinear scale.

[1, 3.16, 10, 31.6, 100]

[0, 0.5, 1, 1.5, 2]


Logarithmic scale - an example
● Logarithmic scale is typically useful to represent ratios.
Color scale
Use-cases for color in data visualization
● Three fundamental use-cases for using color in data visualization:
○ distinguish groups of data
○ Represent data values
○ Tool to highlight
Color as a tool to distinguish

Qualitative color scale


Color to represent data values

Sequential color scale


Color to represent data values

Diverging color scale


Color as a tool to highlight

Accent color scale

The two neighboring states Louisiana and Texas


experienced among the highest and lowest
population growth from 2000 to 2010.
Visualization Collections
Visualizing data
● Typically, we would like to visualize the following kinds of data:
○ Amounts
○ Distributions
○ Proportions
○ X-Y relationships
○ Uncertainty
Visualizing amounts
Visualizing amounts - bar plots
Visualizing amounts - bar plots
● The bars should not be ordered if they represent ordered categories
Visualizing amounts - grouped & stacked bars
Visualizing amounts - dot plots
Visualizing distributions
Visualizing distributions - histograms
● When making histograms, always explore multiple bin widths

bin width = 1 year bin width = 3 years

bin width = 5 years bin width = 15 years


Visualizing distributions - kernel density plots
● Different kernels include Gaussian, Rectangular etc.
● Each kernel is parameterized by bandwidth

Gaussian kernel Gaussian kernel


bandwidth = 0.5 bandwidth = 2

Gaussian kernel Rectangular kernel


bandwidth = 5 bandwidth = 2
Visualizing distributions - kernel density plots
● To visualize several distributions at once, kernel density plots work better than
histograms
Visualizing distributions - highly skewed distribution
Visualizing distributions - multiple distributions

Boxplot Violin plot


Visualizing distributions - multiple distributions

Ridgeline plot
Visualizing proportions
Visualizing proportions - pie charts, stacked & side-by-side bars
● Pie charts help visually emphasize simple fractions, such as ½, 1/3 , ¼ etc.

Pie chart Stacked bar Side-by-side bar


Visualizing proportions - pie charts, stacked & side-by-side bars
● Side-by-side bars help visualize easily changing proportions over time.
Visualizing proportions - pie charts, stacked & side-by-side bars
● Stacked bars are preferred when there are only two quantities to compare
over time.
● Stacked density plots can be used to visualize how proportions change in
response to a continuous variable.

Stacked bar plot Stacked density plot


Visualizing X-Y relationships
Visualizing X-Y relationships - scatterplots
Visualizing X-Y relationships - bubble plots
Visualizing X-Y relationships - scatterplot matrix
Visualizing X-Y relationships - correlation coefficient

Correlation
coefficient

sample means
Visualizing X-Y relationships - correlograms

Correlations between mineral content obtained from 214 glass samples during forensic work
Visualizing uncertainty
Visualizing uncertainty - probability distribution

The blue party is predicted to win over the yellow party by ~1 percentage point with
a margin of error of 1.76 percentage points.
Visualizing uncertainty - population & sample
Visualizing uncertainty - confidence intervals

Ratings of Chocolate bars manufactured in Canada


Visualizing uncertainty - comparing parameter estimates
Questions?

You might also like