0% found this document useful (0 votes)
7 views29 pages

MAT8033 Lecture Slides

The document outlines a course on data science and machine learning, emphasizing practical applications, data insights, and the importance of understanding concepts like big data and data mining. It includes recommendations for resources, a preliminary course outline, and highlights the significance of active participation in lectures. Additionally, it discusses the ethical considerations of data usage and the need for a solid understanding of statistics and analytics methods.

Uploaded by

chatpgtzhangyue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views29 pages

MAT8033 Lecture Slides

The document outlines a course on data science and machine learning, emphasizing practical applications, data insights, and the importance of understanding concepts like big data and data mining. It includes recommendations for resources, a preliminary course outline, and highlights the significance of active participation in lectures. Additionally, it discusses the ethical considerations of data usage and the need for a solid understanding of statistics and analytics methods.

Uploaded by

chatpgtzhangyue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

MAT8033:

Course Overview
What You Will Learn in this Lecture
• Practical application of data-science in general, and machine-
Dall-E 2 prompt:
learning in particular. “A teddy bear riding a skateboard in Times Square.”

• Extract insights from data.

• Get an overview of different concepts in machine learning and its


practical applications, from basic data visualization up to the
concepts behind (Chat-) GPT.

• Understanding the limits and pitfalls of big-data and machine-


learning.
Requirements
• Basic knowledge in probability theory and statistics is assumed.

• Basic knowledge in programming is helpful but not required.


Additional Reading
• Data science is best learned by doing.

• There are many useful tutorials and blogs online.

• You are encouraged to search the internet for useful resources.

• What matters is a general understanding of how to apply data science in practice.


Recommendations
• Excellent introduction to Machine-Learning:
Jake VanderPlas - Machine Learning with Scikit-Learn (I) - PyCon 2015
Recommendations
• Excellent visualizations

https://mlu-explain.github.io
Attending The Lecture?
• Explain concepts based and provide real-life experiences beyond textbook exercises.
• In order to fully understand the concepts, you must solve exercises yourself.
• Lecture helps you to learn “how to think” about data-driven problem solving.
• Lecture helps to “connect the dots” between the different concepts.
Get the most out of attending
the Lecture
• The lecture is supposed to be an active, engaging
interaction between the professor and the students.
• When in class, eliminate distractions, and try to
follow the discussion. Ask questions if things are not
clear.
• Don’t waste time by sitting in classes without paying
attention or distracting yourself with laptops.
(Preliminary) Course Outline
1. Introduction (What is machine learning, AI, …)

2. Data Visualization

3. Linear regression & basic concepts of machine learning regressors

4. Logistic regression & basic concepts of machine learning classifiers

5. The Random Forest algorithm

6. The concept of overfitting

7. Common machine learning methods (k-nearest neighbors, neural nets, …)

8. Unsupervised machine learning

9. Large Language Models


depending on your interests and existing knowledge, we can adjust the pace and topics
Final Exam (50% of the grade)
• Exam Date: to be announced
• Written exam covers all topics contained in these slides.
• All questions are related to information from the slides and discussions during the lecture.
• Questions can typically be answered with 1-4 sentences per question.
• if you understand and can explain the concepts, you will do fine.
• Example:
Write down the two terms of the objective function for Ridge Regression and
explain the meaning of both terms.
Answer: p
n
The objective is β0, …, βp = argmin ∑ (yi − yî ) + λ ∑ βj2.
̂ ̂ 2

i=1 i=j

The first term fits the parameters most closely to the data and the
second term regularizes the regression coefficients, that means it
keeps the coefficients small.
Project Scope (50% of the grade)
• Details will be announced soon.
Feynman Technique
The Feynman Technique: • Dr. Feynman was a remarkably amazing educator
and physicist.
1. Pick a topic you want to understand and start studying it
• Received Nobel prize in 1965 for his work on QED.
2. (Pretend to) teach the topic to a friend who is unfamiliar • The Feynman Lectures on Physics serve as great
with the topic. introduction to the concepts of physics.

3. Go back to the resource material when you get stuck.

4. Simplify and Organize

5. Reiterate

Feynman diagram to visualize particle interactions:


Beyond The Lecture
• Science moves fast.
• Science moves faster and faster:
decreasing half-life of knowledge
• Data science moves especially fast
• stay competitive in the job-market:
- sign up to newsletters
- follow popular research blogs
- talk to friends/colleagues/experts and ask
questions
- stay curious!
Beyond The Lecture
• Sleep is one of the most important but least understood “The best bridge between despair and
aspects of our life, wellness, and longevity. hope is a good night's sleep.”
- E. Joseph Cossman
• Sleep enriches our ability to learn, memorize, and make
logical decisions.
• Sleep recalibrates our emotions, restocks our immune
system, fine-tunes our metabolism, and regulates our
appetite.
• Dreaming mollifies painful memories and creates a virtual
reality space in which the brain melds past and present
knowledge to inspire creativity.
• Important contributions to a healthy life:
• sleep enough (~8h per night)
• eat a healthy and balanced diet
• exercise regularly
Beyond The Lecture
• The Risks-X institute, led by Prof. Didier Sornette was established in
2019, as a pilot cooperation project between ETH Zurich and SUSTech.

• In the 2021 edition of QS World University Rankings, ETH Zurich was ranked 6th in the world.

• For student projects: reach out to me (if you are highly motivated, diligent and quantitative)

https://en.wikipedia.org/wiki/Didier_Sornette https://de.wikipedia.org/wiki/ETH_Zürich https://er.ethz.ch/Risks-X.html


Assistance

1. Slides are sent in group chat and published on blackboard.

2. Ask questions during, before and after class (critical


questions and feedback are very welcome!)

3. Ask questions on WeChat

4. Ask questions via email ([email protected])

5. Visit us during office hour:

Thursday’s 17:00-19:00

Address: 南⽅科技⼤学创园6栋511-2

Please write an email or text beforehand to confirm


Introduction
What is Big Data?
• The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly.
• Up to 2025, global data creation is projected to grow to more than 180 zettabytes.

Does this mean the


definition of Big Data changes
every year?

🤔
source: https://www.statista.com/statistics/871513/worldwide-data-created/
What is Big Data?

Big Data helps us to answer


Big Questions!

Chase Davis (Journalist)


https://source.opennews.org/articles/using-big-data-ask-big-questions/

• Big data means “big variety of data”


• small question:
How many people live in Shenzhen?
• big question:
How is the possibility to work from home going to affect people’s decision to move to big cities?
• By cleverly merging and carefully analyzing different datasets, we can reach big insights.
Big Questions

• The availability of data from different sources


allows us to obtain better and better insights.
• A good understanding of statistics and data
science is required.
• Pittsfalls are common (overfitting, small sample
bias, confounding biases, and so forth).
What is Data Mining?

• The process of extracting (‘mining’) insights from data


• Data mining refers to analytics methods that go beyond basic analysis such
as counts, averages, and other basic descriptive techniques.
• includes statistical and machine-learning methods that inform decision-making
• prediction is an important component
• examples:
• predict sales.
• determine if an X-ray contains a tumor
Data Mining
• There is no one-size-fits-all solution, each problem requires specific understanding.

• Michelangelo was known for his test-and-learn approach. He started with an idea, tested it,
change it, and readily abandoned it for a better approach (but without overfitting!)
Simplicity is Key

• Beware the organizational setting where analytics is a solution in search of a problem


(“putting the cart before the horse”).
• Successful use of analytics and data mining requires both an understanding of the business
context where value is to be captured, and an understanding of exactly what the data mining
methods do.
• As applications (e.g. in Python) become easier, one is more and more likely to fall into this trap.
Nowadays, you can fit a sophisticated ML model with just one line of code.
• “Simplicity is ultimate sophistication” - Michelangelo
Why Are There So Many Different Methods?

• Each method has advantages and disadvantages.


• The usefulness of a method can depend on factors such as
the size of the dataset, the types of patterns that exist in the
data, whether the data meet some underlying assumptions
of the method, how noisy the data are, and the particular
goal of the analysis.
• Experience is needed to decide which method to use.
Big data does not replace thinking!

• When predictive analytics are done right, the analyses are not a means to a predictive end;
rather, the desired predictions become a means to analytical insight and discovery.
• We do a better job of analyzing what we really need to analyze and predicting what we really
want to predict.
• Many companies invest into the big data hype, but fail to align their data science methods with
their commercial goals.
• Beware: applying a model without understanding its implications and limits is useless. Drawing
erroneous conclusions can have drastic consequences.
• Blindly fitting a model to data will produce a number, not an insight!
What is Business Analytics?
• Business Analytics (BA) is the practice and art of bringing quantitative data to bear on
decision-making.
• includes a range of data analysis methods
• for many traditional firms, applications involve little more than counting, rule-checking, and
basic arithmetic
• Business Intelligence (BI), refers to data visualization and reporting for understanding “what
happened and what is happening.”
• Use of charts, tables, and dashboards to display, examine, and explore data.
Big Data, Data Mining, AI,…?

• sometimes overlapping and inconsistent definitions between big data, data mining, machine
learning, AI etc.

• This is just nomenclature, the concepts are what matters.


Data Markets & Ethics

• Data can be very valuable, since it allows to generate (business) insights.


• More and more news stories appear concerning illegal, fraudulent, unsavory or just
controversial uses of data science.
• Be aware: (big) data is both an asset and a liability.

You might also like