SEEM2460 Spring2023 Week1
SEEM2460 Spring2023 Week1
Helen Meng
Professor, Department of Systems Engineering & Engineering Management
Director, Big Data Decision Analytics Research Centre
The Chinese University of Hong Kong
Spring 2023
Week 1
1
Housekeeping
• Walk through course overview
• CUHK Policy on Academic Integrity
2
Outline
• Our World of Data
• What is Data Science?
• Why is it important?
• Illustrative Applications
3
Library of Congress
4
Source: images.google.com
Library of Parliament is the main information repository and research resource Parliament of ...
Canada [Source: Getty Images, Reference: Forbes]
5
Abbey Library of Saint Gall, St. Gallen, Switzerland [Source: Getty images, Forbes]
6
Stuttgart City Library, Stuttgart, Germany [Sources: Getty Images, Forbes]
picture
7
Knowledge is Power (Then)
8
Knowledge is Power (Now)
Source:
10
https://s3.amazonaws.com/external_clips/attachments/39914/original/pmIHS_IoT2_Fig2.png?1420841911
Knowledge is Power (Now)
E‐books /
publications
Clickstream
Traffic flow
Social networks
Text
Geographic
11
Satellite imagery Sentiment Broadcast audio/video streams
Knowledge is Power (Now)
E‐books /
publications
Clickstream
Traffic flow
12
Satellite imagery Sentiment Broadcast audio/video streams
13
14
15
Source: Domo 16
Image credits: Google,
CUHK SEEM‐IBM Course 4581 17
Source: IBM Big Data and Analytics Hub circa 2014 18
Structured and Unstructured Data
Source: prowebscraper.com
Previous
prediction
was 175ZB
19
Source: IDC 2021
20
Source: IDC 2018
Where is the data?
21
Big Data – Value
Volume Velocity Variety
12 terabytes 5 million 100’s video feeds
of Tweets created daily trade events per second from surveillance cameras
Source: IBM
22
Illustrative Example: Foot Traffic Analytics
Souce: i3
23
Illustrative Example: Car Safety
24
This Course
25
Outline
• Our World of Data
• What is Data Science?
• Why is it important?
• Example Applications
26
What is Data Science?
• Definition: A set of fundamental principles that guide the
extraction of knowledge from data, involving processes and
techniques for understanding phenomena via the
(automated) analysis of data [Provost & Fawcett 2013]
• Emerged as an academic subject and an industrial field
• Goal: support data‐driven decision making (DDD)
Frequently co‐occurring:
Data Science
Data mining
Analytics
Knowledge Discovery
27
Source: [Provost & Fawcett 2013]
Data Science – An Interdisciplinary Field
• Multi‐disciplinary focus, needs team work
• Machine learning, mathematics, software engineering,
statistics, visualization, domain expertise, (intuition‐
guided) exploratory data analysis [O’Neil and Schutt 2013]
28
Source: http://www.ibm.com/developerworks/library/os‐datascience/
Data Science – Asking the Question
• Lifecycle of a question
Validation
Question
Not interesting Worth asking again?
Make it repeatable
Must pose the right question at the right time and at the right data!
29
Courtesy: Dr. Hsiao‐Wuen Hon, Managing Director, Microsoft Research Asia
The Data Scientist’s Activities /
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions
Patterns Data
And Trends Analytics
Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning
Data
Capture
30
Adapted from [Baesens 2014]
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions
Patterns Data
And Trends Analytics
Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning
Patterns Data
And Trends Analytics
Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning
Data Selection and Cleaning:
Data • Data points missing, duplicated,
Capture corrupted, invalid, inconsistent
• Obtain high‐quality data 32
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions
Patterns Data
And Trends Analytics
Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning
Data Transformation:
Data • Binning, alphanumeric to numeric
Capture coding, geographical aggregation,
etc. 33
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions
Patterns Data
And Trends Analytics
Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning
Data Analytics:
Data • Churn prediction, fraud detection,
Capture customer segmentation, market
basket analysis 34
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions
Patterns Data
And Trends Analytics
Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning
Patterns Data
And Trends Analytics
Transformed Data
Data Transformation
Update
Update System
System
Preprocessed Data Selection
Data & Cleaning
Patterns Data
And Trends Analytics
Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning
Data • Iterative!
Capture • Research question answered?
• Reformulate hypothesis? 37
Data Analytics Enables Understanding
Source: Cao, ACM Survey 2017
39
Value of Data VALUE INSIGHTS & ACTION
KNOWLEDGE
INFORMATION
DATA
SIGNAL
REFINE
VALUE
WORLD
COMMUNITY
ORGANIZATION
PERSONAL
Courtesy:
COMBINE Dr. Hsiao‐Wuen Hon, MSRA
40
Value of Data
VALUE
REFINE
42
43
44
More than anything, what data scientists do is make discoveries while swimming
in data. It’s their preferred method of navigating the world around them. At ease
in the digital realm, they are able to bring structure to large quantities of
formless data and make analysis possible. They identify rich data sources, join
them with other, potentially incomplete data sources, and clean the resulting set.
In a competitive landscape where challenges keep changing and data never stop
flowing, data scientists help decision makers shift from ad hoc analysis to an
ongoing conversation with data.
Data scientists realize that they face technical limitations, but they don’t allow
that to bog down their search for novel solutions. As they make discoveries, they
communicate what they’ve learned and suggest its implications for new business
directions. Often they are creative in displaying information visually and making
the patterns they find clear and compelling. They advise executives and product
managers on the implications of the data for products, processes, and decisions.
Given the nascent state of their trade, it often falls to data scientists to fashion
their own tools and even conduct academic‐style research. Yahoo, one of the
firms that employed a group of data scientists early on, was instrumental in
developing Hadoop. Facebook’s data team created the language Hive for
programming Hadoop projects. Many other data scientists, especially at data‐
driven companies such as Google, Amazon, Microsoft, Walmart, eBay, LinkedIn,
and Twitter, have added to and refined the tool kit.
What kind of person does all this? What abilities make a data scientist successful?
Think of him or her as a hybrid of data hacker, analyst, communicator, and
trusted adviser. The combination is extremely powerful—and rare.
45
https://hbr.org/2012/10/data‐scientist‐the‐sexiest‐job‐of‐the‐21st‐century/
Skill Sets of a Data Scientist
Video: https://www.youtube.com/watch?v=JSqP5PXWOUc
46
Outline
• Our World of Data
• What is Data Science?
• Why is it important?
• Example Applications
47
Example: Understanding the Customer
“How Companies Learn Your Secrets” by [NYT 2012]
48
49
50
Example: Recommendation Systems
“Amazon’s Recommendation Secret” [Fortune 2012]
51
Example: Google Flu Trends (GFT)
• 2009: New flu virus H1N1 discovered
• Spread quickly, public health agencies feared a terrible
pandemic, no vaccine readily available
• US Center for Disease Control (CDC) requested doctors
inform them of new flu cases
• Picture of pandemic may be 1‐2 weeks out of date
• People feel sick for days but wait before seeing a doctor
• Relaying info took time, CDC tabulated once per week
• 1‐2 weeks delay is too long for public health monitoring
• Need early detection and rapid response
Source: Wikipedia 52
GFT (2)
• Google detecting influenza epidemic
• [Ginsberg et al., Nature vol. 457, 2009]
• Worked with 3B search queries received daily
• Took 50M most common search terms and compared
with CDC data on spread of the flu (between 2003‐2008)
• Narrowed to roughly 40+ search terms used by a
mathematical model to obtain strong correlation with
spread of flu over time and space
• Predict spread of flu in real time
• https://www.youtube.com/watch?v=6111nS66Dpk
• https://www.youtube.com/watch?app=desktop&v=lEDt
89eQ64o
Source: Wikipedia 53
GFT (3)
• GTF became a web service from Google, providing influenza
activity for more than 25 countries
• 2009 flu pandemic – GFT tracked flu information across US
• 2010 Feb CDC identified flue cases spiking, GFT showed
same spike 2 weeks prior
• Later, researchers uncovered that the model was
aggregating queries about different health conditions,
leading to incorrect prediction
• GFT stopped publishing estimates in 2015
Source: Wikipedia 54
Example: IBM Watson
Video: https://www.youtube.com/watch?v=P18EdAKuC1U&t=8s
• IBM looking for its next Grand Challenge
• Develop machine that can pass the Turing Test – measures
machine intelligence by having a system attempt to fool a
human into thinking they are conversing with another person
• Beating a human in Jeopardy
• Quiz master provides answers (or “clues”, tricky wordings)
• Contestants provide a question
• IBM developed DeepQA – a massively parallel software
architecture that analyzes natural language content, developed
by 20 researchers over 3 years
• Natural language processing and machine learning accessing
200million pages of information
• Jeopardy Watson task – get a “clue”, understand it, find the
question, won two strongest human contestants in 2011 55
Example: Language Modelling
GPT‐3: https://www.youtube.com/watch?v=8V20HkoiNtc
Chat‐GPT: https://www.youtube.com/watch?v=o5MutYFWsM8
56
Example: Graphics Generation
Dall‐E ‐‐ https://openai.com/blog/dall‐e/
Midjourney ‐‐ https://www.youtube.com/watch?v=SVcsDDABEkM
57
References
• Loukides, L., What is Data Science, O’Reilly Media 2011.
• O’Neil, C. and Schutt, R., Doing Data Science, O’Reilly 2013
• Provost, F. & Fawcett, T., Data Science for Business, O’Reilly 2013
• Mayer‐Schonberger, V. and Cukier, K., Big Data: A Revolution that will Transform How
We Live, Work and Think, John Murray (Publishers), 2013
• Baesens, B., Analytics in a Big Data World, Wiley 2014
• Davenport, T., Big Data @ Work, Harvard Business Review Press, 2014 (also Webinar)
• Meng, H., “Hong Kong’s Potential in Big Data Analytics,” ComputerWorld, 2014.
• Meng, H., “How Hong Kong can Unleash the Power of Big Data,” ejinsights, 2016.
(https://www.ejinsight.com/eji/article/id/1458571/20161220‐how‐hk‐can‐unleash‐the‐
power‐of‐big‐data)
• Henke, N., “The Age of Analytics: Competing in a Data‐Driven World” McKinsey Report,
2016
• Cao, L. “Data Science: A Comprehensive Overview”, ACM Computing Surveys, 2017.
• Kelleher, J. and Tierney B., Data Science, MIT Press, 2018.
• Brown, J. et al, “Language Models are Few‐short Learners,” arXiv 2020
• Dall‐E , Open AI, 2021
• Midjourney, ChatGPT 2022 (see slides)
58