0% found this document useful (0 votes)
3 views62 pages

MAT8033 Lecture Slides

MAT8033 is a course focused on the practical application of data science and machine learning, covering concepts from data visualization to large language models. Students are expected to have a basic understanding of probability, statistics, and programming, with an emphasis on active participation in lectures and practical exercises. The course includes a final exam and a project, with a strong recommendation for students to engage with additional resources and stay updated on developments in the field.

Uploaded by

chatpgtzhangyue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views62 pages

MAT8033 Lecture Slides

MAT8033 is a course focused on the practical application of data science and machine learning, covering concepts from data visualization to large language models. Students are expected to have a basic understanding of probability, statistics, and programming, with an emphasis on active participation in lectures and practical exercises. The course includes a final exam and a project, with a strong recommendation for students to engage with additional resources and stay updated on developments in the field.

Uploaded by

chatpgtzhangyue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

MAT8033:

Course Overview
What You Will Learn in this Lecture
• Practical application of data-science in general, and machine-
Dall-E 2 prompt:
learning in particular. “A teddy bear riding a skateboard in Times Square.”

• Extract insights from data.

• Get an overview of different concepts in machine learning and its


practical applications, from basic data visualization up to the
concepts behind (Chat-) GPT.

• Understanding the limits and pitfalls of big-data and machine-


learning.
Requirements
• Basic knowledge in probability theory and statistics is assumed.

• Basic knowledge in programming is helpful but not required.


Additional Reading
• Data science is best learned by doing.

• There are many useful tutorials and blogs online.

• You are encouraged to search the internet for useful resources.

• What matters is a general understanding of how to apply data science in practice.


Recommendations
• Excellent introduction to Machine-Learning:
Jake VanderPlas - Machine Learning with Scikit-Learn (I) - PyCon 2015
Recommendations
• Excellent visualizations

https://mlu-explain.github.io
Attending The Lecture?
• Explain concepts based and provide real-life experiences beyond textbook exercises.
• In order to fully understand the concepts, you must solve exercises yourself.
• Lecture helps you to learn “how to think” about data-driven problem solving.
• Lecture helps to “connect the dots” between the different concepts.
Get the most out of attending
the Lecture
• The lecture is supposed to be an active, engaging
interaction between the professor and the students.
• When in class, eliminate distractions, and try to
follow the discussion. Ask questions if things are not
clear.
• Don’t waste time by sitting in classes without paying
attention or distracting yourself with laptops.
(Preliminary) Course Outline
1. Introduction (What is machine learning, AI, …)

2. Data Visualization

3. Linear regression & basic concepts of machine learning regressors

4. Logistic regression & basic concepts of machine learning classifiers

5. The Random Forest algorithm

6. The concept of overfitting

7. Common machine learning methods (k-nearest neighbors, neural nets, …)

8. Unsupervised machine learning

9. Large Language Models


depending on your interests and existing knowledge, we can adjust the pace and topics
Final Exam (50% of the grade)
• Exam Date: to be announced
• Written exam covers all topics contained in these slides.
• All questions are related to information from the slides and discussions during the lecture.
• Questions can typically be answered with 1-4 sentences per question.
• if you understand and can explain the concepts, you will do fine.
• Example:
Write down the two terms of the objective function for Ridge Regression and
explain the meaning of both terms.
Answer: p
n
The objective is β0, …, βp = argmin ∑ (yi − yî ) + λ ∑ βj2.
̂ ̂ 2

i=1 i=j

The first term fits the parameters most closely to the data and the
second term regularizes the regression coefficients, that means it
keeps the coefficients small.
Project Scope (50% of the grade)
• Details will be announced soon.
Feynman Technique
The Feynman Technique: • Dr. Feynman was a remarkably amazing educator
and physicist.
1. Pick a topic you want to understand and start studying it
• Received Nobel prize in 1965 for his work on QED.
2. (Pretend to) teach the topic to a friend who is unfamiliar • The Feynman Lectures on Physics serve as great
with the topic. introduction to the concepts of physics.

3. Go back to the resource material when you get stuck.

4. Simplify and Organize

5. Reiterate

Feynman diagram to visualize particle interactions:


Beyond The Lecture
• Science moves fast.
• Science moves faster and faster:
decreasing half-life of knowledge
• Data science moves especially fast
• stay competitive in the job-market:
- sign up to newsletters
- follow popular research blogs
- talk to friends/colleagues/experts and ask
questions
- stay curious!
Beyond The Lecture
• Sleep is one of the most important but least understood “The best bridge between despair and
aspects of our life, wellness, and longevity. hope is a good night's sleep.”
- E. Joseph Cossman
• Sleep enriches our ability to learn, memorize, and make
logical decisions.
• Sleep recalibrates our emotions, restocks our immune
system, fine-tunes our metabolism, and regulates our
appetite.
• Dreaming mollifies painful memories and creates a virtual
reality space in which the brain melds past and present
knowledge to inspire creativity.
• Important contributions to a healthy life:
• sleep enough (~8h per night)
• eat a healthy and balanced diet
• exercise regularly
Beyond The Lecture
• The Risks-X institute, led by Prof. Didier Sornette was established in
2019, as a pilot cooperation project between ETH Zurich and SUSTech.

• In the 2021 edition of QS World University Rankings, ETH Zurich was ranked 6th in the world.

• For student projects: reach out to me (if you are highly motivated, diligent and quantitative)

https://en.wikipedia.org/wiki/Didier_Sornette https://de.wikipedia.org/wiki/ETH_Zürich https://er.ethz.ch/Risks-X.html


Assistance

1. Slides are sent in group chat and published on blackboard.

2. Ask questions during, before and after class (critical


questions and feedback are very welcome!)

3. Ask questions on WeChat

4. Ask questions via email ([email protected])

5. Visit us during office hour:

Thursday’s 17:00-19:00

Address: 南⽅科技⼤学创园6栋511-2

Please write an email or text beforehand to confirm


Introduction
What is Big Data?
• The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly.
• Up to 2025, global data creation is projected to grow to more than 180 zettabytes.

Does this mean the


definition of Big Data changes
every year?

🤔
source: https://www.statista.com/statistics/871513/worldwide-data-created/
What is Big Data?

Big Data helps us to answer


Big Questions!

Chase Davis (Journalist)


https://source.opennews.org/articles/using-big-data-ask-big-questions/

• Big data means “big variety of data”


• small question:
How many people live in Shenzhen?
• big question:
How is the possibility to work from home going to affect people’s decision to move to big cities?
• By cleverly merging and carefully analyzing different datasets, we can reach big insights.
Big Questions

• The availability of data from different sources


allows us to obtain better and better insights.
• A good understanding of statistics and data
science is required.
• Pittsfalls are common (overfitting, small sample
bias, confounding biases, and so forth).
What is Data Mining?

• The process of extracting (‘mining’) insights from data


• Data mining refers to analytics methods that go beyond basic analysis such
as counts, averages, and other basic descriptive techniques.
• includes statistical and machine-learning methods that inform decision-making
• prediction is an important component
• examples:
• predict sales.
• determine if an X-ray contains a tumor
Data Mining
• There is no one-size-fits-all solution, each problem requires specific understanding.

• Michelangelo was known for his test-and-learn approach. He started with an idea, tested it,
change it, and readily abandoned it for a better approach (but without overfitting!)
Simplicity is Key

• Beware the organizational setting where analytics is a solution in search of a problem


(“putting the cart before the horse”).
• Successful use of analytics and data mining requires both an understanding of the business
context where value is to be captured, and an understanding of exactly what the data mining
methods do.
• As applications (e.g. in Python) become easier, one is more and more likely to fall into this trap.
Nowadays, you can fit a sophisticated ML model with just one line of code.
• “Simplicity is ultimate sophistication” - Michelangelo
Why Are There So Many Different Methods?

• Each method has advantages and disadvantages.


• The usefulness of a method can depend on factors such as
the size of the dataset, the types of patterns that exist in the
data, whether the data meet some underlying assumptions
of the method, how noisy the data are, and the particular
goal of the analysis.
• Experience is needed to decide which method to use.
Big data does not replace thinking!

• When predictive analytics are done right, the analyses are not a means to a predictive end;
rather, the desired predictions become a means to analytical insight and discovery.
• We do a better job of analyzing what we really need to analyze and predicting what we really
want to predict.
• Many companies invest into the big data hype, but fail to align their data science methods with
their commercial goals.
• Beware: applying a model without understanding its implications and limits is useless. Drawing
erroneous conclusions can have drastic consequences.
• Blindly fitting a model to data will produce a number, not an insight!
What is Business Analytics?
• Business Analytics (BA) is the practice and art of bringing quantitative data to bear on
decision-making.
• includes a range of data analysis methods
• for many traditional firms, applications involve little more than counting, rule-checking, and
basic arithmetic
• Business Intelligence (BI), refers to data visualization and reporting for understanding “what
happened and what is happening.”
• Use of charts, tables, and dashboards to display, examine, and explore data.
Big Data, Data Mining, AI,…?

• sometimes overlapping and inconsistent definitions between big data, data mining, machine
learning, AI etc.

• This is just nomenclature, the concepts are what matters.


Data Markets & Ethics

• Data can be very valuable, since it allows to generate (business) insights.


• More and more news stories appear concerning illegal, fraudulent, unsavory or just
controversial uses of data science.
• Be aware: (big) data is both an asset and a liability.
Data Analysis
Overview
The Steps in Data Mining
• typical steps during data analysis are abbreviated as SEMMA:
1. Sample: Take a sample from the dataset; partition into training, validation, and test datasets.

2. Explore: Examine the dataset statistically and graphically.

3. Modify: Transform the variables and impute missing values.

4. Model: Fit predictive models (e.g., regression tree, neural network).

5. Assess: Compare models using a validation dataset.

• In the course, we will focus mostly on steps 4 and 5, since these are general tasks.

• In practice, steps 1-3 are often the most cumbersome, and typically very application specific
(requires domain expertise etc.)
SEMMA: Sample
• when doing machine learning, we always split data into train-test set
• this aspect is crucial to avoid overfitting
• even in more traditional approaches to data analysis this is useful, as overfitting problems can sneak it
SEMMA: Explore
• four types of data
Values are not ordered
(nationality, color etc.)

Values are ordered or ranged (e.g. food


spiciness: not spicy, mild spicy, very spicy)

Values are countable and discrete (e.g.


number of heads in 10 coin tosses)

Variable can be any value in a certain


interval
(e.g. height of a person)

• Another classification: structured vs. unstructured data. For now, we focus on structured data.
SEMMA: Explore

Always visualize data first!!!

(see details in next section)


SEMMA: Modify
Outliers: Values that lie “far away” from the bulk of the data.
• The term far away is deliberately left vague because what is or is not called an outlier is problem specific.
• Some outliers are due to erroneous data collection (e.g. measurement error)
• Some outliers are system intrinsic and should not be ignored

outlier
Are datapoints
1,2,3 outliers?

🤔
SEMMA: Modify
• Missing Values: If the number of datapoints with missing values is small, those datapoints might be omitted.
• Can replace the missing value with an imputed value, based on the other values for that variable across all
datapoints.
• In practice, this is a common and cumbersome issue!
SEMMA: Modify
• Standardizing and Rescaling data: subtract the mean from each value and then divide by the standard
deviation (also called a z-score).
• Normalizing is one way to bring all variables to the same scale.
• Another popular approach is to normalize each variable to a [0, 1] scale.
SEMMA: Modify
• Some data covers various orders of magnitude (e.g. wealth, number of WeChat contacts, …).
• Taking average and standard deviation can be misleading and will be skewed towards the large values.
• Often, one first applies a log-transform (and then standardizes).
SEMMA: Model
• The (supervised) modeling part boils down to
y = f (X; θ) + noise

where y is the target,

target y
X are the features,
θ are the model parameters
and f is the machine learning function (“inductive bias”).

• Pick an algorithm that is suited to the problem


• supervised vs. unsupervised?
• categorial or nominal data?
• regression or classification?
feature X
• …
• Over the course, we will learn different methods
SEMMA: Assess
• How well will our prediction or classification model perform when we apply it to new data?
• To assure generalization, we use the concept of data partitioning and try to avoid overfitting.
SEMMA
• Given a clean data-set of normalized features, the modeling and analysis part (MA in SEMMA)
is nowadays quite straight forward, due to the many useful libraries (e.g. scikit-learn, keras, …)

• A lot of time must be devoted to the cleaning, understanding and preparation of the data (SEM
in SEMMA). This part is much harder to automate and standardize. It often requires a lot of
problem specific thought and work.
Data Visualization
Data Visualization

• Always visualize data first!!!

• Different types of data require different types of visualization


Histograms
• quantitative data can be visualized by histograms:
Data Distributions
• (Continuous) quantitative data is often visualized via its distribution also called probability
density function (pdf). Formally, the pdf can be derived as the continuous limit of the
histogram.

∫a
P(a ⩽ X ⩽ b) = f(x) dx
Data Distributions

• (Continuous) quantitative data is often visualized via its distribution also called probability
density function (pdf).
• The (local) maxima are called peaks or modes.
Data Distributions
Data Distributions
There are different ways to plot the distribution of data:
• PDF = probability density function (~ number of data per unit interval)
• CDF = cumulative distribution function (~ number of data less than given value)
• SF = survival function (complement to CDF, 1-CDF, hence also called CCDF)
Data Distributions
There are different ways to plot the distribution of data:
• PDF = probability density function (~ number of data per unit interval)
• CDF = cumulative distribution function (~ number of data less than given value)
• SF = survival function (complement to CDF, 1-CDF, hence also called CCDF)
Data Distributions
There are different ways to plot the distribution of data:
Quantiles
• The quantile function is just the inverse
of the CDF. The q-quantile is the value
such below which a fraction q of all data
lies.
• Special cases are the quartiles. We have
Q1 = 25% quantile
Q2 = 50% quantile = median
Q3 = 75% quantile
Data Visualization
• ordinal data is often visualized via bar charts:
score
Boxplots

inter-quantile range: IQR = Q3 - Q1


(the range in which half the data lies)

source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 source: https://cds.climate.copernicus.eu/toolbox/doc/gallery/54_box_plots.html


Violinplots

• show the entire distribution, rather than just min/max/mean/median/IQR

source: https://seaborn.pydata.org/generated/seaborn.violinplot.html
Heatmaps
• often easier to read than 3d-plots, and can also represent categorial data
The Importance of Error Bars
Student: “Great, I can see a trend in my data!” 😃
The Importance of Error Bars
Student: “Great, I can see a trend in my data!” 😃
Professor: “Don’t forget to add error bars”
The Importance of Error Bars

😳
The Importance of Error Bars

It is very easy to get fooled or even fool yourself!

😳
The Importance of Error Bars

“I only believe in statistics


that I manipulated myself”

Winston S. Churchill
“pull yourself up by your bootstraps”

Bootstrapping
• Bootstrapping is a straight forward way to get error bars.
• In bootstrapping, we generate N datasets from one
dataset with K datapoints.
• For each bootstrapped dataset, we sample K times with
replacement from the original dataset.
• Calculate the statistics on each of the K samples, and use
the standard deviation (or inter-quantile range) as error
bar. (Or just create violin plots of all samples.)
Visualizing Networked Data
• A network diagram consists of actors and relations between them.
• Nodes are the actors (e.g., users in a social network or products in a product network), and
represented by circles.
• Edges are the relations between nodes, and are represented by lines connecting nodes.

• For example, in a social network such as WeChat,


we can construct a list of users (nodes) and all the
pairwise relations (edges) between users who
have each others contact.
• Many helpful libraries exist (e.g. networkx for
python)

You might also like