0% found this document useful (0 votes)
57 views48 pages

Week 1 v1.32 (Hidden) - Introduction To Data Analytics

The document outlines the structure and content of the ECS784U/P Data Analytics module for Week 1, 2024, led by Dr. Anthony Constantinou. It details the timetable for lectures and labs, module assessment criteria, and the resources available for students, including coursework and recommended readings. The introduction to data analytics covers key concepts, definitions, and the importance of data in decision-making processes.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views48 pages

Week 1 v1.32 (Hidden) - Introduction To Data Analytics

The document outlines the structure and content of the ECS784U/P Data Analytics module for Week 1, 2024, led by Dr. Anthony Constantinou. It details the timetable for lectures and labs, module assessment criteria, and the resources available for students, including coursework and recommended readings. The introduction to data analytics covers key concepts, definitions, and the importance of data in decision-making processes.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

ECS784U/P DATA ANALYTICS

(WEEK 1, 2024)
INTRODUCTION TO DATA ANALYTICS

DR ANTHONY CONSTANTINOU 1
SCHOOL OF ELECTRONIC ENGINEERING AND COMPUTER SCIENCE
WEEK 1 LECTURE OVERVIEW
Introduction to Data Analytics
▪ Timetable.
▪ Staff.
▪ Module overview, assessment and pass criteria.
▪ Books.
▪ Lab 1.
▪ Coursework 1 and 2.
▪ Introduction to Data Analytics.

2
TIMETABLE (LECTURES)

Lectures:
▪ When: Thursdays, 11:00 – 13:00.

▪ Where: Peoples Palace, in Skeel Lecture Theatre.

▪ Weeks: 1-6 and 8-12 (11 lectures).


▪ No new module content will be taught in Week 7.
▪ Use Week 7 to digest the knowledge/skills you
have learnt in the first half of the term and catch
up on any learning gaps.
▪ Week 7 is not a ‘no work’ week.

3
TIMETABLE (LABS)
Labs:
▪ When: Wednesday, 15:00-17:00.
▪ Where: TB ground floor G02 and first floor 101/102. You should go to the
ground floor, and move to first floor if and only if the ground floor is full.
▪ Weeks: 1-6 and 9-11 (9 labs).
▪ Week 1: Preparation lab. Attend if you are new to programming.
▪ Weeks 2, 3, and 4: Main material – 1st half of module.
▪ Week 5: Case study demo.
▪ Week 6: Revision.
▪ Weeks 7, 8: no lab.
▪ Weeks 9: Demo on causal structure learning (this will be online).
▪ Weeks 10 and 11: Main material – 2nd half of the module.

4
TIMETABLE

5
TIMETABLE

6
LECTURER
▪ Module Organiser: Dr. Anthony Constantinou.
▪ I will be delivering the lectures.
▪ At QMUL since Oct 2009 - joined as a PhD student.
▪ Most of my time is spent doing research:
▪ Head of MInDS (Machine Intelligence and Decision Systems):
https://minds.qmul.ac.uk/
▪ Head of the Bayesian Artificial Intelligence research lab:
http://bayesian-ai.eecs.qmul.ac.uk/
▪ Personal research website: http://www.constantinou.info
▪ Research interests in causal machine learning and intelligent decision
making under uncertainty.
▪ i.e., learning cause-and-effect relationships models, and simulating
actions/intervention within those models to determine the most optimal
sequence of decisions.
▪ Contact:
▪ Any questions about the module should be posted on the forum.
▪ For personal questions, you can contact me via e-mail
7
[email protected]
TEACHING FELLOW
▪ Dr. Neville Kenneth Kitson but prefer to be
known as Ken.
▪ Physics B.Sc., then Ph.D. in Computer
Modelling of Air Pollution
▪ 'hands-on' I.T. career - firstly commercial at
Reuters, then at a non-profit using mobile and
web technologies for health and election
projects in Africa and Asia
▪ Big Data M.Sc. at QMUL 2018-9
▪ project with Anthony -> paper on learning to
model causes of diarrhoea from survey data
▪ Studying for 2nd Ph.D., supervised by Anthony
▪ practical limitations of causal machine
learning
▪ active learning where algorithms ask for input
from human experts for most uncertain
cause-effect relationships
8
LAB STAFF

Mr Julien Guinot: PhD Ms Hyunkyung Park. Ms Yuhan Li. PhD student Mr Ishaku Anaobi. PhD
student in Self-Supervision PhD student in fact in mathematical and student in Improving content
for expert latent verification in dialogue. digital epidemiology and moderation in decentralised
representations of music. dynamics of networks. online social networks.

Ms Parvathy
Ms Kasia Adamska. PhD Ms Amani Abumansour. Dr Ken Kitson:
Subramanianprasad. PhD
student in Predicting Hit PhD student in Claim PhD student in
student in ML to speed up
Songs: Multimodal and detection for Automated active learning &
traditional EM simulations 9
Data-Driven Approach. fact-checking systems. causal discovery.
like design of coding
metasurface array.
MODULE ASSESSMENT
▪ 100% Coursework Module (no exam!):

▪ Coursework 1, 60% weighting:


▪ Released in Week 2 with deadline Week 8.
▪ Focuses on classic machine learning methods using Python libraries.
▪ Done individually.
▪ Find/prepare your own data set.
▪ Do some basic coding, such as loading data and calling methods to analyse data.
▪ Write a 7-page (max) report that will include Introduction, Experiments, Results
and Conclusions.
▪ Coursework 2, 40% weighting:
▪ Released in Week 9 with deadline in Week 13, during exam period.
▪ Focuses on causal machine learning.
▪ Done individually.
▪ Find your own data (can reuse data set from coursework 1 if suitable).
▪ Involves no coding. You will be using an open-source Java UI research project.
▪ Answer a set of questions.
10
MODULE PASS CRITERIA
Level-7 module pass criteria apply (postgraduates).

▪ A minimum total mark of 50% is required to pass this module.

▪ Pass example 1: CW1 mark 30% and CW2 mark 80% (total 50%).
▪ Pass example 2: CW1 mark 40% and CW2 mark 65% (total 50%).

▪ Fail example 1: CW1 mark 15% and CW2 mark 100% (total 49%).
▪ Resit CW1.
▪ Fail example 2: CW1 mark 70% and CW2 mark 15% (total 48%).
▪ Resit CW2.
▪ Fail example 3: CW1 mark 45% and CW2 mark 45% (total 45%).
▪ Resit both CW1 & CW2.

11
BOOKS

▪ Almost all books/book chapters to be provided as PDF


files on QMPlus.

▪ They are either freely available online, or permission


granted by authors.

12
BOOK 1
▪ Available on QMPlus as PDF.
▪ Over 400 pages:
▪ Use Table of Contents to quickly refer
to Python concepts of your interest.
▪ Made available from Week 1.

13
BOOK 2
▪ Not available on QMPlus.
▪ Not a critical book for coursework.
▪ Useful if you would like to study the data
analytic and machine learning
techniques covered in the first half of
this course in more detail.
▪ Over 600 pages:
▪ Use Table of Contents to quickly refer
to relevant data analytic concepts.

14
CHAPTERS FROM BOOKS 3, 4, AND 5
▪ The relevant chapters based on the first two books shown
below will be made available on QMPlus in Week 6.

▪ The third book will be made available on QMPlus in Week 9.

15
LABS (WEEKS 1-6)
▪ There are different Integrated Development Environments (IDE)
that can be used to process Python code.
▪ Jupyter Notebook
▪ Spyder
▪ PyDev with Eclipse
▪ Vim
▪ TextMate
▪ Gedit
▪ Idle
▪ PIDA (Linux)
▪ NotePad++ (Windows)

▪ On this module, we will be using the Jupyter Notebook


available in Anaconda Navigator.

16
LABS (WEEKS 9-11)
▪ Based on causal machine learning
▪ Focus will be on the discovery of cause-and-effect
relationships from data.
▪ Different algorithms are publicly available in R, Python,
Java, MATLAB, and in various Bayesian network software.
▪ On this module, we will be using a Java NetBeans
research project developed in our research lab.
▪ Comes with a basic User Interface (UI).
▪ No need to write code.
▪ Provides access to a list of algorithms and methods
needed to complete Coursework 2.

17
WEEK 1 LAB OVERVIEW
Programming basics
▪ Anaconda Navigator and Jupyter Notebook.
▪ Already installed on the machines in the ITL (machines now moved to
Temporary Building (TB) so hoping for no issues with software).
▪ If you can bring your own laptop with you, great. The demonstrators could also
help you set up Python, Anaconda Navigator and Jupyter Notebook on your
own machine.
▪ Follow the instructions provided in the Week 1 Lab document.

18
WEEK 1 LAB OVERVIEW
Lab 1 is aimed at those who are new to Python
(attendance optional but recommended):
▪ Programming basics with Python.
▪ Data types.
▪ Operators.
▪ Conditional statements and loops.
▪ Creating arrays/lists etc.
▪ Dictionaries.
▪ Functions.
▪ Few exercises.

19
PYTHON INTERACTIVE SHELL
▪ Jupyter Notebook operates as an interactive shell.
▪ You can execute Python code one line at a time.

20
WHY PYTHON ? Reading
slide
▪ Relatively easy to learn:
▪ Simple syntax and more intuitive code
▪ Can do more with fewer lines of code.

▪ Is now widely used by the scientific community and industry.

▪ Extensive ecosystem of rapidly maturing libraries for data science, both for data processing
and data visualisation.
▪ We will experiment with some of them:
▪ NumPy: started by Dr Travis Oliphant, BSc & MSc Brigham Young Uni, and PhD Mayo
Clinic. Worked as Assistant Prof at Brigham Young Uni and founded Anaconda.
▪ Pandas: started by Wes McKinney, BSc MIT and PhD Duke Uni. The author of the Python
for Data Analysis book available on QM+. Worked for, and founded, different companies.
▪ Scikit-learn: started by Dr David Cournapeau, MSc Telecom Paristech, PhD Kyoto Uni.
Started working on this library as a summer project at Google. Moved and works in Tokyo.
▪ In 2010 the French Institute for Research in Computer Science and Automation (INRIA)
took leadership of the project and made the first public release on February the 1st, 2010.
▪ These libraries are open-source and many other community members have contributed to
their development.
▪ Adequate to high computational performance.

▪ It’s free!
21
NEW TO PROGRAMMING ?
▪ This course is NOT about programming.
▪ If you are new to programming, attend labs from Week 1
and practise the lab material.
▪ You are not expected to remember all the methods we
cover in the labs from memory!
▪ You can refer to some of those methods when analysing
data for your coursework 1.

22
10 MINUTTERS PAUSE
10分の休憩
10 MINUTEN PAUSE
‫ دقائق استراحة‬10
10 MINUTI DI PAUSA
‫ דקות‬10 ‫הפסקה של‬
10 MINUTES DE PAUSE
10 मिनट का ब्रेक
10 MINUTES BREAK
10 МИНУТА ПАУЗЕ
10 মিমিটের মিরমি
ΔΙΑΛΕΙΜΜΑ 10 ΛΕΠΤΩΝ
ПЕРЕРЫВ 10 МИНУТ
休息10分钟
DESCANSO DE 10 MINUTOS
10 분 휴식
10 MINUTEN PAUZE 23
WHAT IS DATA ANALYTICS?
Machine
Learning
Statistics Databases

Data
Data Data
Mining Analytics Science

Knowledge Artificial
discovery Intelligence

▪ Interdisciplinary subfield of computer science at the


intersection of different fields.
24
WHAT IS DATA ANALYTICS? Reading
slide
Try to distinguish these terms
Quick internet search says…
▪ Data Analytics:
▪ The process of inspecting, cleansing, transforming, and modelling data with the goal of
discovering useful information, informing conclusions, and supporting decision-making.
▪ Data Science:
▪ Interdisciplinary field that uses scientific methods, processes, algorithms and systems to
extract knowledge and insights from noisy, structured and unstructured data, and apply
knowledge and actionable insights from data across a broad range of application
domains. more focus on algorithms
▪ Data Mining and Knowledge Discovery:
▪ The process of extracting and discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database systems.
▪ Machine Learning:
▪ The study of computer algorithms that can improve automatically through experience and
by the use of data.
▪ Artificial Intelligence:
▪ Intelligence demonstrated by machines, as opposed to natural intelligence displayed by
animals including humans.
▪ Statistics:
▪ The discipline that concerns the collection, organisation, analysis, interpretation, and
presentation of data.

25
WHAT IS MACHINE LEARNING?
Machine Learning:
▪ The study of computer algorithms that can improve iteratively
through additional experience and/or data.

Prediction/
Data Model Knowledge Decision Making
discovery
▪ Humans learn by observation, by making comparisons, by looking
for patterns of repeated behaviour or sequences of events, by
generalisation, and by trial and error.
▪ Humans also act by imagining, and this is covered by a field called
counterfactual reasoning, but which requires causal models (i.e., that
we make some assumptions about causality).
▪ In many ways, these are also the roots of machine learning since
many algorithms aim to mirror some of the above traits of human
learning.
26
GENERATIVE VS DISCRIMINATIVE
MACHINE LEARNING
Generative learning:
▪ Assumes data are generated by different probabilistic models (usually soft
boundaries), and tries to categorise data by answering the question of
“which of the sub-models is most likely to have generated a given data
point”. Generative model can tell you the info of the population/distribution of the data
▪ Aim is to maximise the likelihood each data point will belong to each class.

Discriminative learning:
▪ Divides the data into classes using hard decision boundaries, usually
based on distance metrics.
▪ Aim is to minimise mistakes between boundaries.
Generative
Discriminative (non-linear) Discriminative (linear)

27
ERROR: BIAS VS VARIANCE
Bias:
▪ The difference between
expected (average) and
observed values, which
represent the values we are
trying to predict.

Variance:
▪ How the predictions vary for a
given data point.
=>it’s possible to solve this by using
more data and take the average
http://scott.fortmann-roe.com/docs/BiasVariance.html

28
29
https://sketchplanations.com/sampling-bias
30
https://sketchplanations.com/confirmation-bias
DATA SET

Is a data set alone useful?


▪ A data set consists of historical observations.
▪ Data alone is not very useful:
▪ It does not provide knowledge.
▪ it does not answer predictive questions.
▪ it does not tell us what actions to perform.
▪ In data analytics, we seek to extract useful information
from data to support decision making.
Data ——> Info

31
DATA SET
Typically called variable or feature.

Typically called data instance or data sample.

32
SOURCES OF DATA
▪ Sensory data (on cars, doors, etc).
▪ Image/Text/Sound.
▪ Surveys/Questionnaires.
▪ Focus groups/Interviews.
▪ Social media.
▪ Documents/Records.
Generally two types of data:
▪ Observational: represents data collected based on what we
see without taking action.
▪ Experimental: represents data collected based on actions
that we deliberately take to manipulate the outcome.
▪ e.g., randomised control trials when testing new treatments.
33
CHALLENGES OF DATA ANALYTICS
Data noise…
▪ Missing data values.
▪ Missing variables.
▪ Corrupted data.
▪ Incorrect data.
▪ Limited data.
▪ Big data.
▪ Different evaluation criteria and metrics.
▪ Different distributional assumptions (e.g., normality).
▪ Different dependency assumptions (e.g., linear/non-linear).
▪ Data variety (e.g., discrete, continuous, text, sound, image)
▪ Spurious correlations.
▪ Correlation is not causation (recall observational and experimental data).
34
CHALLENGES OF DATA ANALYTICS
Ever heard of spurious correlations?

35
CHALLENGES OF DATA ANALYTICS
Ever heard of spurious correlations?

36
CHALLENGES OF DATA ANALYTICS
Ever heard of spurious correlations?

37
CORRELATION IS NOT CAUSATION

▪ Can the number of shark attacks help us predict ice cream sales?

From newsmax.com 38
HIDDEN SLIDE

39
CORRELATION IS NOT CAUSATION

▪ Can we conclude from correlation that shark attacks cause ice


cream sales to increase?

From newsmax.com 40
HIDDEN SLIDE

41
TYPES OF DATA Reading
slide
▪ Nominal discrete or categorical variables:
▪ Take finite number of states.
▪ Order of states is not important.
▪ Distance between states remains unmeasured.
▪ E.g., gender, colour, profession.
▪ Ordinal discrete or categorical variables:
▪ Same as above, but where the order of states matters.
▪ E.g., low/medium/high, distinction/merit/pass.

▪ Continuous variables:
▪ Have an infinite number of values between any two values.
▪ Can be represented by statistical distributions.
▪ E.g., age, time, temperature, profit.
▪ Note that a numerical or a continuous variable can generally be
converted into an ordinal categorical or a discrete variable.

Note that the above definitions may vary slightly depending on


book/discipline. 42
TYPES OF DATA ANALYSIS
(focusing on healthcare for the examples)
▪ Descriptive or Analytical
▪ E.g., analysing health condition.

▪ Predictive
▪ Predicting health condition.

▪ Risk
▪ Assessing the risk of getting a particular disease over a period of time.

▪ Diagnostic or Inverse/Bayesian inference


▪ Determining the probability of having a particular disease given a positive test
result.
▪ Prescriptive, Intervention, Action, Decision Making
▪ Determining what treatment to perform given disease 𝐴.

▪ Counterfactual
▪ Given that a previous patient died from disease 𝐴, what would be the probability
for death had we also known information 𝐵 and on this basis taken action 𝐶?
43
DISCIPLINES Reading
slide
▪ Health data
▪ What is the diagnosis given the symptoms?
▪ What is the intervention given the diagnosis?

▪ Climate data
▪ What will the weather be tomorrow?
▪ What are the consequences of climate change?

▪ Sports
▪ What is the value of a player?
▪ Who is going to win the match?
▪ What payoff should a bookmaker offer for a given match event?

▪ Finance and economics


▪ How is inflation expected to change over the next few months?
▪ Are house prices going to increase?
▪ What profit should I expect over a certain period of time if I invest in the stock market?

▪ Marketing
▪ How much increase should I expect in sales if I decrease the sales price from X to Y?

▪ Politics
▪ Who is going to be the next prime minister?
44
▪ How many seats is a given Party expected to win?
DATA ANALYTICS IN PRACTICE
Are data
Reduce noise sufficient/
and complexity useful?
Preprocess
Identify Collect Exploratory
& Clean
objective Data Analysis
Data

Need more data?

Predictive and decision


making accuracy

Assess Apply Learn


Evaluation
model model model

Results? Use model in Optimise model for a


real-world particular task
45
DATA ANALYTICS IN PRACTICE Reading
slide

▪ Typically involves:
▪ Inference considerations
▪ what is the objective?
▪ Data pre-processing
▪ Cleansing, transformation, feature selection, etc.
▪ Model building
▪ different assumptions, theorems, and algorithms to consider.
▪ Evaluation and optimisation on given data
▪ how do we judge the performance of the model?
▪ Application
▪ apply the learnt model to a real case and obtain results.
▪ Evaluation on real case
▪ How did the model perform in practice?
▪ Documentation
▪ preparation of a report covering all of the above.
46
DOMAIN EXPERTS
Why they are generally needed?
▪ Domain expertise is the knowledge and understanding of a
particular field.
▪ They can:
▪ tell us if the data are missing any important variables.
▪ specify the project objectives as well as the questions that need to be
answered.
▪ specify a performance threshold that needs to be achieved for a
model to be useful.
▪ communicate the results effectively to decision makers.
▪ provide knowledge of factors that are important for prediction but
which historical data fail to capture.
▪ In most areas, ML/AI models are still used for decision support
rather than for decision making.
47
DATA ETHICS Reading
slide

▪ Often we have to work with anonymised data for ethical and legal
reasons.

▪ The purpose is to eliminate or reduce discrimination.


▪ E.g., loan and insurance applications should not and often legally
cannot use gender, religion, or race in their assessment.

▪ However, anonymising data is difficult.


▪ Hidden information can be deduced from ‘anonymous’ data.
▪ Correlation can be used to re-identify anonymised data.
▪ E.g., postcode may correlate with ethnicity and religion.

▪ However, ethics depend on the domain.


▪ Gender and race might be useful to know in medicine.
▪ E.g., some treatment decisions may be based on gender.
48

You might also like