0% found this document useful (0 votes)

3 views62 pages

MAT8033 Lecture Slides

MAT8033 is a course focused on the practical application of data science and machine learning, covering concepts from data visualization to large language models. Students are expected to have a basic understanding of probability, statistics, and programming, with an emphasis on active participation in lectures and practical exercises. The course includes a final exam and a project, with a strong recommendation for students to engage with additional resources and stay updated on developments in the field.

Uploaded by

chatpgtzhangyue

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views62 pages

MAT8033 Lecture Slides

Uploaded by

chatpgtzhangyue

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

MAT8033:

Course Overview
What You Will Learn in this Lecture
• Practical application of data-science in general, and machine-
Dall-E 2 prompt:
learning in particular. “A teddy bear riding a skateboard in Times Square.”

• Extract insights from data.

• Get an overview of different concepts in machine learning and its

practical applications, from basic data visualization up to the
concepts behind (Chat-) GPT.

• Understanding the limits and pitfalls of big-data and machine-

learning.
Requirements
• Basic knowledge in probability theory and statistics is assumed.

• Basic knowledge in programming is helpful but not required.

Additional Reading
• Data science is best learned by doing.

• There are many useful tutorials and blogs online.

• You are encouraged to search the internet for useful resources.

• What matters is a general understanding of how to apply data science in practice.

Recommendations
• Excellent introduction to Machine-Learning:
Jake VanderPlas - Machine Learning with Scikit-Learn (I) - PyCon 2015
Recommendations
• Excellent visualizations

https://mlu-explain.github.io
Attending The Lecture?
• Explain concepts based and provide real-life experiences beyond textbook exercises.
• In order to fully understand the concepts, you must solve exercises yourself.
• Lecture helps you to learn “how to think” about data-driven problem solving.
• Lecture helps to “connect the dots” between the different concepts.
Get the most out of attending
the Lecture
• The lecture is supposed to be an active, engaging
interaction between the professor and the students.
• When in class, eliminate distractions, and try to
follow the discussion. Ask questions if things are not
clear.
• Don’t waste time by sitting in classes without paying
attention or distracting yourself with laptops.
(Preliminary) Course Outline
1. Introduction (What is machine learning, AI, …)

2. Data Visualization

3. Linear regression & basic concepts of machine learning regressors

4. Logistic regression & basic concepts of machine learning classifiers

5. The Random Forest algorithm

6. The concept of overfitting

7. Common machine learning methods (k-nearest neighbors, neural nets, …)

8. Unsupervised machine learning

9. Large Language Models

depending on your interests and existing knowledge, we can adjust the pace and topics
Final Exam (50% of the grade)
• Exam Date: to be announced
• Written exam covers all topics contained in these slides.
• All questions are related to information from the slides and discussions during the lecture.
• Questions can typically be answered with 1-4 sentences per question.
• if you understand and can explain the concepts, you will do fine.
• Example:
Write down the two terms of the objective function for Ridge Regression and
explain the meaning of both terms.
Answer: p
n
The objective is β0, …, βp = argmin ∑ (yi − yî ) + λ ∑ βj2.
̂ ̂ 2

i=1 i=j

The first term fits the parameters most closely to the data and the
second term regularizes the regression coefficients, that means it
keeps the coefficients small.
Project Scope (50% of the grade)
• Details will be announced soon.
Feynman Technique
The Feynman Technique: • Dr. Feynman was a remarkably amazing educator
and physicist.
1. Pick a topic you want to understand and start studying it
• Received Nobel prize in 1965 for his work on QED.
2. (Pretend to) teach the topic to a friend who is unfamiliar • The Feynman Lectures on Physics serve as great
with the topic. introduction to the concepts of physics.

3. Go back to the resource material when you get stuck.

4. Simplify and Organize

5. Reiterate

Feynman diagram to visualize particle interactions:

Beyond The Lecture
• Science moves fast.
• Science moves faster and faster:
decreasing half-life of knowledge
• Data science moves especially fast
• stay competitive in the job-market:
- sign up to newsletters
- follow popular research blogs
- talk to friends/colleagues/experts and ask
questions
- stay curious!
Beyond The Lecture
• Sleep is one of the most important but least understood “The best bridge between despair and
aspects of our life, wellness, and longevity. hope is a good night's sleep.”
- E. Joseph Cossman
• Sleep enriches our ability to learn, memorize, and make
logical decisions.
• Sleep recalibrates our emotions, restocks our immune
system, fine-tunes our metabolism, and regulates our
appetite.
• Dreaming mollifies painful memories and creates a virtual
reality space in which the brain melds past and present
knowledge to inspire creativity.
• Important contributions to a healthy life:
• sleep enough (~8h per night)
• eat a healthy and balanced diet
• exercise regularly
Beyond The Lecture
• The Risks-X institute, led by Prof. Didier Sornette was established in
2019, as a pilot cooperation project between ETH Zurich and SUSTech.

• In the 2021 edition of QS World University Rankings, ETH Zurich was ranked 6th in the world.

• For student projects: reach out to me (if you are highly motivated, diligent and quantitative)

https://en.wikipedia.org/wiki/Didier_Sornette https://de.wikipedia.org/wiki/ETH_Zürich https://er.ethz.ch/Risks-X.html

Assistance

1. Slides are sent in group chat and published on blackboard.

2. Ask questions during, before and after class (critical

questions and feedback are very welcome!)

3. Ask questions on WeChat

4. Ask questions via email ([email protected])

5. Visit us during office hour:

Thursday’s 17:00-19:00

Address: 南⽅科技⼤学创园6栋511-2

Please write an email or text beforehand to confirm

Introduction
What is Big Data?
• The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly.
• Up to 2025, global data creation is projected to grow to more than 180 zettabytes.

Does this mean the

definition of Big Data changes
every year?

🤔
source: https://www.statista.com/statistics/871513/worldwide-data-created/
What is Big Data?

Big Data helps us to answer

Big Questions!

Chase Davis (Journalist)

https://source.opennews.org/articles/using-big-data-ask-big-questions/

• Big data means “big variety of data”

• small question:
How many people live in Shenzhen?
• big question:
How is the possibility to work from home going to affect people’s decision to move to big cities?
• By cleverly merging and carefully analyzing different datasets, we can reach big insights.
Big Questions

• The availability of data from different sources

allows us to obtain better and better insights.
• A good understanding of statistics and data
science is required.
• Pittsfalls are common (overfitting, small sample
bias, confounding biases, and so forth).
What is Data Mining?

• The process of extracting (‘mining’) insights from data

• Data mining refers to analytics methods that go beyond basic analysis such
as counts, averages, and other basic descriptive techniques.
• includes statistical and machine-learning methods that inform decision-making
• prediction is an important component
• examples:
• predict sales.
• determine if an X-ray contains a tumor
Data Mining
• There is no one-size-fits-all solution, each problem requires specific understanding.

• Michelangelo was known for his test-and-learn approach. He started with an idea, tested it,
change it, and readily abandoned it for a better approach (but without overfitting!)
Simplicity is Key

• Beware the organizational setting where analytics is a solution in search of a problem

(“putting the cart before the horse”).
• Successful use of analytics and data mining requires both an understanding of the business
context where value is to be captured, and an understanding of exactly what the data mining
methods do.
• As applications (e.g. in Python) become easier, one is more and more likely to fall into this trap.
Nowadays, you can fit a sophisticated ML model with just one line of code.
• “Simplicity is ultimate sophistication” - Michelangelo
Why Are There So Many Different Methods?

• Each method has advantages and disadvantages.

• The usefulness of a method can depend on factors such as
the size of the dataset, the types of patterns that exist in the
data, whether the data meet some underlying assumptions
of the method, how noisy the data are, and the particular
goal of the analysis.
• Experience is needed to decide which method to use.
Big data does not replace thinking!

• When predictive analytics are done right, the analyses are not a means to a predictive end;
rather, the desired predictions become a means to analytical insight and discovery.
• We do a better job of analyzing what we really need to analyze and predicting what we really
want to predict.
• Many companies invest into the big data hype, but fail to align their data science methods with
their commercial goals.
• Beware: applying a model without understanding its implications and limits is useless. Drawing
erroneous conclusions can have drastic consequences.
• Blindly fitting a model to data will produce a number, not an insight!
What is Business Analytics?
• Business Analytics (BA) is the practice and art of bringing quantitative data to bear on
decision-making.
• includes a range of data analysis methods
• for many traditional firms, applications involve little more than counting, rule-checking, and
basic arithmetic
• Business Intelligence (BI), refers to data visualization and reporting for understanding “what
happened and what is happening.”
• Use of charts, tables, and dashboards to display, examine, and explore data.
Big Data, Data Mining, AI,…?

• sometimes overlapping and inconsistent definitions between big data, data mining, machine
learning, AI etc.

• This is just nomenclature, the concepts are what matters.

Data Markets & Ethics

• Data can be very valuable, since it allows to generate (business) insights.

• More and more news stories appear concerning illegal, fraudulent, unsavory or just
controversial uses of data science.
• Be aware: (big) data is both an asset and a liability.
Data Analysis
Overview
The Steps in Data Mining
• typical steps during data analysis are abbreviated as SEMMA:
1. Sample: Take a sample from the dataset; partition into training, validation, and test datasets.

2. Explore: Examine the dataset statistically and graphically.

3. Modify: Transform the variables and impute missing values.

4. Model: Fit predictive models (e.g., regression tree, neural network).

5. Assess: Compare models using a validation dataset.

• In the course, we will focus mostly on steps 4 and 5, since these are general tasks.

• In practice, steps 1-3 are often the most cumbersome, and typically very application specific
(requires domain expertise etc.)
SEMMA: Sample
• when doing machine learning, we always split data into train-test set
• this aspect is crucial to avoid overfitting
• even in more traditional approaches to data analysis this is useful, as overfitting problems can sneak it
SEMMA: Explore
• four types of data
Values are not ordered
(nationality, color etc.)

Values are ordered or ranged (e.g. food

spiciness: not spicy, mild spicy, very spicy)

Values are countable and discrete (e.g.

number of heads in 10 coin tosses)

Variable can be any value in a certain

interval
(e.g. height of a person)

• Another classification: structured vs. unstructured data. For now, we focus on structured data.
SEMMA: Explore

Always visualize data first!!!

(see details in next section)

SEMMA: Modify
Outliers: Values that lie “far away” from the bulk of the data.
• The term far away is deliberately left vague because what is or is not called an outlier is problem specific.
• Some outliers are due to erroneous data collection (e.g. measurement error)
• Some outliers are system intrinsic and should not be ignored

outlier
Are datapoints
1,2,3 outliers?

🤔
SEMMA: Modify
• Missing Values: If the number of datapoints with missing values is small, those datapoints might be omitted.
• Can replace the missing value with an imputed value, based on the other values for that variable across all
datapoints.
• In practice, this is a common and cumbersome issue!
SEMMA: Modify
• Standardizing and Rescaling data: subtract the mean from each value and then divide by the standard
deviation (also called a z-score).
• Normalizing is one way to bring all variables to the same scale.
• Another popular approach is to normalize each variable to a [0, 1] scale.
SEMMA: Modify
• Some data covers various orders of magnitude (e.g. wealth, number of WeChat contacts, …).
• Taking average and standard deviation can be misleading and will be skewed towards the large values.
• Often, one first applies a log-transform (and then standardizes).
SEMMA: Model
• The (supervised) modeling part boils down to
y = f (X; θ) + noise

where y is the target,

target y
X are the features,
θ are the model parameters
and f is the machine learning function (“inductive bias”).

• Pick an algorithm that is suited to the problem

• supervised vs. unsupervised?
• categorial or nominal data?
• regression or classification?
feature X
• …
• Over the course, we will learn different methods
SEMMA: Assess
• How well will our prediction or classification model perform when we apply it to new data?
• To assure generalization, we use the concept of data partitioning and try to avoid overfitting.
SEMMA
• Given a clean data-set of normalized features, the modeling and analysis part (MA in SEMMA)
is nowadays quite straight forward, due to the many useful libraries (e.g. scikit-learn, keras, …)

• A lot of time must be devoted to the cleaning, understanding and preparation of the data (SEM
in SEMMA). This part is much harder to automate and standardize. It often requires a lot of
problem specific thought and work.
Data Visualization
Data Visualization

• Always visualize data first!!!

• Different types of data require different types of visualization

Histograms
• quantitative data can be visualized by histograms:
Data Distributions
• (Continuous) quantitative data is often visualized via its distribution also called probability
density function (pdf). Formally, the pdf can be derived as the continuous limit of the
histogram.

∫a
P(a ⩽ X ⩽ b) = f(x) dx
Data Distributions

• (Continuous) quantitative data is often visualized via its distribution also called probability
density function (pdf).
• The (local) maxima are called peaks or modes.
Data Distributions
Data Distributions
There are different ways to plot the distribution of data:
• PDF = probability density function (~ number of data per unit interval)
• CDF = cumulative distribution function (~ number of data less than given value)
• SF = survival function (complement to CDF, 1-CDF, hence also called CCDF)
Data Distributions
There are different ways to plot the distribution of data:
• PDF = probability density function (~ number of data per unit interval)
• CDF = cumulative distribution function (~ number of data less than given value)
• SF = survival function (complement to CDF, 1-CDF, hence also called CCDF)
Data Distributions
There are different ways to plot the distribution of data:
Quantiles
• The quantile function is just the inverse
of the CDF. The q-quantile is the value
such below which a fraction q of all data
lies.
• Special cases are the quartiles. We have
Q1 = 25% quantile
Q2 = 50% quantile = median
Q3 = 75% quantile
Data Visualization
• ordinal data is often visualized via bar charts:
score
Boxplots

inter-quantile range: IQR = Q3 - Q1

(the range in which half the data lies)

source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 source: https://cds.climate.copernicus.eu/toolbox/doc/gallery/54_box_plots.html

Violinplots

• show the entire distribution, rather than just min/max/mean/median/IQR

source: https://seaborn.pydata.org/generated/seaborn.violinplot.html
Heatmaps
• often easier to read than 3d-plots, and can also represent categorial data
The Importance of Error Bars
Student: “Great, I can see a trend in my data!” 😃
The Importance of Error Bars
Student: “Great, I can see a trend in my data!” 😃
Professor: “Don’t forget to add error bars”
The Importance of Error Bars

😳
The Importance of Error Bars

It is very easy to get fooled or even fool yourself!

😳
The Importance of Error Bars

“I only believe in statistics

that I manipulated myself”

Winston S. Churchill
“pull yourself up by your bootstraps”

Bootstrapping
• Bootstrapping is a straight forward way to get error bars.
• In bootstrapping, we generate N datasets from one
dataset with K datapoints.
• For each bootstrapped dataset, we sample K times with
replacement from the original dataset.
• Calculate the statistics on each of the K samples, and use
the standard deviation (or inter-quantile range) as error
bar. (Or just create violin plots of all samples.)
Visualizing Networked Data
• A network diagram consists of actors and relations between them.
• Nodes are the actors (e.g., users in a social network or products in a product network), and
represented by circles.
• Edges are the relations between nodes, and are represented by lines connecting nodes.

• For example, in a social network such as WeChat,

we can construct a list of users (nodes) and all the
pairwise relations (edges) between users who
have each others contact.
• Many helpful libraries exist (e.g. networkx for
python)

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Blending Shapley Values For Feature Ranking in Machine Learning: An Analysis On Educational Data
No ratings yet
Blending Shapley Values For Feature Ranking in Machine Learning: An Analysis On Educational Data
25 pages
Data Mining 2020
No ratings yet
Data Mining 2020
2 pages
MAT8033 Lecture Slides
No ratings yet
MAT8033 Lecture Slides
29 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
EDA Website Material (As Per VTU Syllabus)
No ratings yet
EDA Website Material (As Per VTU Syllabus)
160 pages
Project Report
No ratings yet
Project Report
29 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
347 862932 Introduction
No ratings yet
347 862932 Introduction
35 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Da Session 1
No ratings yet
Da Session 1
50 pages
Analisis de Datos MIT
No ratings yet
Analisis de Datos MIT
340 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Datascience Notes
No ratings yet
Datascience Notes
161 pages
Introduction To Data Science Course Outline
No ratings yet
Introduction To Data Science Course Outline
5 pages
Summer Training 2020: Advanced Data Science With IBM & Bionic Robotic Arm
No ratings yet
Summer Training 2020: Advanced Data Science With IBM & Bionic Robotic Arm
10 pages
Unit I
No ratings yet
Unit I
52 pages
INSOFE-Comprehensive Curriculum On Big Data Analytics
No ratings yet
INSOFE-Comprehensive Curriculum On Big Data Analytics
11 pages
6220010
No ratings yet
6220010
37 pages
Data Science
100% (2)
Data Science
52 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
File
No ratings yet
File
27 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
29 pages
Data Science Lecture No 01
No ratings yet
Data Science Lecture No 01
28 pages
KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
No ratings yet
KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
51 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Data Science Master Class 2023
No ratings yet
Data Science Master Class 2023
8 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
Data Science RoadMap Min
No ratings yet
Data Science RoadMap Min
27 pages
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
No ratings yet
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
6 pages
Unit - I & II
No ratings yet
Unit - I & II
59 pages
Workshop 0
No ratings yet
Workshop 0
22 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
Data Science With Python-Sasmita PDF
67% (3)
Data Science With Python-Sasmita PDF
9 pages
DSS-first Lecture
No ratings yet
DSS-first Lecture
14 pages
Data Science 1
100% (4)
Data Science 1
133 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
Birla Institute of Technology & Science, Pilani Work Integrated Learning Programmes Digital
No ratings yet
Birla Institute of Technology & Science, Pilani Work Integrated Learning Programmes Digital
9 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
ML Module 1 (Bcs602)
No ratings yet
ML Module 1 (Bcs602)
48 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Kadir
No ratings yet
Kadir
84 pages
Data Science
No ratings yet
Data Science
33 pages
Designing Machine Learning Systems With Python - Sample Chapter
100% (1)
Designing Machine Learning Systems With Python - Sample Chapter
31 pages
Introduction
No ratings yet
Introduction
20 pages
Introduction To Data Science - Module 1
No ratings yet
Introduction To Data Science - Module 1
4 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Lecture 1
No ratings yet
Lecture 1
32 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Ds Sem
No ratings yet
Ds Sem
71 pages
Data Science - PPT
No ratings yet
Data Science - PPT
45 pages
Mit Data Science Machine Learning Program Brochure
No ratings yet
Mit Data Science Machine Learning Program Brochure
17 pages
Lecture 1
No ratings yet
Lecture 1
14 pages
Lecture 1
No ratings yet
Lecture 1
77 pages
Bioinformatics Combines Computer Programming
No ratings yet
Bioinformatics Combines Computer Programming
3 pages
Mit Data Science Machine Learning Program Brochure
No ratings yet
Mit Data Science Machine Learning Program Brochure
17 pages
Highschooler's Mental Models: Mental Models Series, #1
From Everand
Highschooler's Mental Models: Mental Models Series, #1
S VASIST
No ratings yet
Deep Learning With Python Illustrated Guide For Beginners & Intermediates: The Future Is Here!: The Future Is Here!, #2
From Everand
Deep Learning With Python Illustrated Guide For Beginners & Intermediates: The Future Is Here!: The Future Is Here!, #2
William Sullivan
1/5 (1)
MC4020 DWDM Iat 1 (Set2)
No ratings yet
MC4020 DWDM Iat 1 (Set2)
1 page
Discrete Structures Notes - TutorialsDuniya
No ratings yet
Discrete Structures Notes - TutorialsDuniya
136 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
5 pages
Web Content Mining Techniques Tools & Algorithms - A Comprehensive Study
No ratings yet
Web Content Mining Techniques Tools & Algorithms - A Comprehensive Study
6 pages
Presentation Data Mining
No ratings yet
Presentation Data Mining
22 pages
A Predictive Model For The Early Identification of Student Dropout Using Data Classification Clustering and Association Methods
No ratings yet
A Predictive Model For The Early Identification of Student Dropout Using Data Classification Clustering and Association Methods
10 pages
Application of Clinical Bioinformatics
No ratings yet
Application of Clinical Bioinformatics
395 pages
Project
No ratings yet
Project
29 pages
Data Mining: A Database Perspective
No ratings yet
Data Mining: A Database Perspective
19 pages
DMDW Lab Oral Question Bank
No ratings yet
DMDW Lab Oral Question Bank
4 pages
Cognitive Superiority
No ratings yet
Cognitive Superiority
10 pages
Data Mining (Module-1)
No ratings yet
Data Mining (Module-1)
14 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
24 pages
Chap5 Frequent Itemset
No ratings yet
Chap5 Frequent Itemset
70 pages
Application of Data Warehouse and Data Mining in Construction Management
No ratings yet
Application of Data Warehouse and Data Mining in Construction Management
12 pages
Quiz and Mid Paper Data
No ratings yet
Quiz and Mid Paper Data
31 pages
The State-Of-The-Art in Predictive Visual Analytics
No ratings yet
The State-Of-The-Art in Predictive Visual Analytics
24 pages
Market Basket Analysis: Identify The Changing Trends of Market Data Using Association Rule Mining
No ratings yet
Market Basket Analysis: Identify The Changing Trends of Market Data Using Association Rule Mining
8 pages
K Means
No ratings yet
K Means
63 pages
May 2019
No ratings yet
May 2019
3 pages
Community Detection and Mining in Social Media PDF
No ratings yet
Community Detection and Mining in Social Media PDF
2 pages
Q Bank2
No ratings yet
Q Bank2
4 pages
Slide 1
No ratings yet
Slide 1
10 pages
Data Science Components
No ratings yet
Data Science Components
7 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
3 pages
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
No ratings yet
Real-World Data Is Dirty: Data Cleansing and The Merge/Purge Problem
39 pages
Google Gemini1
No ratings yet
Google Gemini1
165 pages
CS 412 Intro. To Data Mining
No ratings yet
CS 412 Intro. To Data Mining
55 pages

MAT8033 Lecture Slides

Uploaded by

MAT8033 Lecture Slides

Uploaded by

MAT8033:

• Extract insights from data.

• Get an overview of different concepts in machine learning and its

• Understanding the limits and pitfalls of big-data and machine-

• Basic knowledge in programming is helpful but not required.

• There are many useful tutorials and blogs online.

• You are encouraged to search the internet for useful resources.

• What matters is a general understanding of how to apply data science in practice.

3. Linear regression & basic concepts of machine learning regressors

4. Logistic regression & basic concepts of machine learning classifiers

5. The Random Forest algorithm

6. The concept of overfitting

7. Common machine learning methods (k-nearest neighbors, neural nets, …)

8. Unsupervised machine learning

9. Large Language Models

3. Go back to the resource material when you get stuck.

4. Simplify and Organize

Feynman diagram to visualize particle interactions:

https://en.wikipedia.org/wiki/Didier_Sornette https://de.wikipedia.org/wiki/ETH_Zürich https://er.ethz.ch/Risks-X.html

1. Slides are sent in group chat and published on blackboard.

2. Ask questions during, before and after class (critical

3. Ask questions on WeChat

4. Ask questions via email ([email protected])

5. Visit us during office hour:

Please write an email or text beforehand to confirm

Does this mean the

Big Data helps us to answer

Chase Davis (Journalist)

• Big data means “big variety of data”

• The availability of data from different sources

• The process of extracting (‘mining’) insights from data

• Beware the organizational setting where analytics is a solution in search of a problem

• Each method has advantages and disadvantages.

• This is just nomenclature, the concepts are what matters.

• Data can be very valuable, since it allows to generate (business) insights.

2. Explore: Examine the dataset statistically and graphically.

3. Modify: Transform the variables and impute missing values.

4. Model: Fit predictive models (e.g., regression tree, neural network).

5. Assess: Compare models using a validation dataset.

Values are ordered or ranged (e.g. food

Values are countable and discrete (e.g.

Variable can be any value in a certain

Always visualize data first!!!

(see details in next section)

where y is the target,

• Pick an algorithm that is suited to the problem

• Always visualize data first!!!

• Different types of data require different types of visualization

inter-quantile range: IQR = Q3 - Q1

source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 source: https://cds.climate.copernicus.eu/toolbox/doc/gallery/54_box_plots.html

• show the entire distribution, rather than just min/max/mean/median/IQR

It is very easy to get fooled or even fool yourself!

“I only believe in statistics

• For example, in a social network such as WeChat,

You might also like