MAT8033 Lecture Slides
MAT8033 Lecture Slides
Course Overview
What You Will Learn in this Lecture
• Practical application of data-science in general, and machine-
Dall-E 2 prompt:
learning in particular. “A teddy bear riding a skateboard in Times Square.”
https://mlu-explain.github.io
Attending The Lecture?
• Explain concepts based and provide real-life experiences beyond textbook exercises.
• In order to fully understand the concepts, you must solve exercises yourself.
• Lecture helps you to learn “how to think” about data-driven problem solving.
• Lecture helps to “connect the dots” between the different concepts.
Get the most out of attending
the Lecture
• The lecture is supposed to be an active, engaging
interaction between the professor and the students.
• When in class, eliminate distractions, and try to
follow the discussion. Ask questions if things are not
clear.
• Don’t waste time by sitting in classes without paying
attention or distracting yourself with laptops.
(Preliminary) Course Outline
1. Introduction (What is machine learning, AI, …)
2. Data Visualization
i=1 i=j
The first term fits the parameters most closely to the data and the
second term regularizes the regression coefficients, that means it
keeps the coefficients small.
Project Scope (50% of the grade)
• Details will be announced soon.
Feynman Technique
The Feynman Technique: • Dr. Feynman was a remarkably amazing educator
and physicist.
1. Pick a topic you want to understand and start studying it
• Received Nobel prize in 1965 for his work on QED.
2. (Pretend to) teach the topic to a friend who is unfamiliar • The Feynman Lectures on Physics serve as great
with the topic. introduction to the concepts of physics.
5. Reiterate
• In the 2021 edition of QS World University Rankings, ETH Zurich was ranked 6th in the world.
• For student projects: reach out to me (if you are highly motivated, diligent and quantitative)
Thursday’s 17:00-19:00
Address: 南⽅科技⼤学创园6栋511-2
🤔
source: https://www.statista.com/statistics/871513/worldwide-data-created/
What is Big Data?
• Michelangelo was known for his test-and-learn approach. He started with an idea, tested it,
change it, and readily abandoned it for a better approach (but without overfitting!)
Simplicity is Key
• When predictive analytics are done right, the analyses are not a means to a predictive end;
rather, the desired predictions become a means to analytical insight and discovery.
• We do a better job of analyzing what we really need to analyze and predicting what we really
want to predict.
• Many companies invest into the big data hype, but fail to align their data science methods with
their commercial goals.
• Beware: applying a model without understanding its implications and limits is useless. Drawing
erroneous conclusions can have drastic consequences.
• Blindly fitting a model to data will produce a number, not an insight!
What is Business Analytics?
• Business Analytics (BA) is the practice and art of bringing quantitative data to bear on
decision-making.
• includes a range of data analysis methods
• for many traditional firms, applications involve little more than counting, rule-checking, and
basic arithmetic
• Business Intelligence (BI), refers to data visualization and reporting for understanding “what
happened and what is happening.”
• Use of charts, tables, and dashboards to display, examine, and explore data.
Big Data, Data Mining, AI,…?
• sometimes overlapping and inconsistent definitions between big data, data mining, machine
learning, AI etc.
• In the course, we will focus mostly on steps 4 and 5, since these are general tasks.
• In practice, steps 1-3 are often the most cumbersome, and typically very application specific
(requires domain expertise etc.)
SEMMA: Sample
• when doing machine learning, we always split data into train-test set
• this aspect is crucial to avoid overfitting
• even in more traditional approaches to data analysis this is useful, as overfitting problems can sneak it
SEMMA: Explore
• four types of data
Values are not ordered
(nationality, color etc.)
• Another classification: structured vs. unstructured data. For now, we focus on structured data.
SEMMA: Explore
outlier
Are datapoints
1,2,3 outliers?
🤔
SEMMA: Modify
• Missing Values: If the number of datapoints with missing values is small, those datapoints might be omitted.
• Can replace the missing value with an imputed value, based on the other values for that variable across all
datapoints.
• In practice, this is a common and cumbersome issue!
SEMMA: Modify
• Standardizing and Rescaling data: subtract the mean from each value and then divide by the standard
deviation (also called a z-score).
• Normalizing is one way to bring all variables to the same scale.
• Another popular approach is to normalize each variable to a [0, 1] scale.
SEMMA: Modify
• Some data covers various orders of magnitude (e.g. wealth, number of WeChat contacts, …).
• Taking average and standard deviation can be misleading and will be skewed towards the large values.
• Often, one first applies a log-transform (and then standardizes).
SEMMA: Model
• The (supervised) modeling part boils down to
y = f (X; θ) + noise
target y
X are the features,
θ are the model parameters
and f is the machine learning function (“inductive bias”).
• A lot of time must be devoted to the cleaning, understanding and preparation of the data (SEM
in SEMMA). This part is much harder to automate and standardize. It often requires a lot of
problem specific thought and work.
Data Visualization
Data Visualization
∫a
P(a ⩽ X ⩽ b) = f(x) dx
Data Distributions
• (Continuous) quantitative data is often visualized via its distribution also called probability
density function (pdf).
• The (local) maxima are called peaks or modes.
Data Distributions
Data Distributions
There are different ways to plot the distribution of data:
• PDF = probability density function (~ number of data per unit interval)
• CDF = cumulative distribution function (~ number of data less than given value)
• SF = survival function (complement to CDF, 1-CDF, hence also called CCDF)
Data Distributions
There are different ways to plot the distribution of data:
• PDF = probability density function (~ number of data per unit interval)
• CDF = cumulative distribution function (~ number of data less than given value)
• SF = survival function (complement to CDF, 1-CDF, hence also called CCDF)
Data Distributions
There are different ways to plot the distribution of data:
Quantiles
• The quantile function is just the inverse
of the CDF. The q-quantile is the value
such below which a fraction q of all data
lies.
• Special cases are the quartiles. We have
Q1 = 25% quantile
Q2 = 50% quantile = median
Q3 = 75% quantile
Data Visualization
• ordinal data is often visualized via bar charts:
score
Boxplots
source: https://seaborn.pydata.org/generated/seaborn.violinplot.html
Heatmaps
• often easier to read than 3d-plots, and can also represent categorial data
The Importance of Error Bars
Student: “Great, I can see a trend in my data!” 😃
The Importance of Error Bars
Student: “Great, I can see a trend in my data!” 😃
Professor: “Don’t forget to add error bars”
The Importance of Error Bars
😳
The Importance of Error Bars
😳
The Importance of Error Bars
Winston S. Churchill
“pull yourself up by your bootstraps”
Bootstrapping
• Bootstrapping is a straight forward way to get error bars.
• In bootstrapping, we generate N datasets from one
dataset with K datapoints.
• For each bootstrapped dataset, we sample K times with
replacement from the original dataset.
• Calculate the statistics on each of the K samples, and use
the standard deviation (or inter-quantile range) as error
bar. (Or just create violin plots of all samples.)
Visualizing Networked Data
• A network diagram consists of actors and relations between them.
• Nodes are the actors (e.g., users in a social network or products in a product network), and
represented by circles.
• Edges are the relations between nodes, and are represented by lines connecting nodes.