0% found this document useful (0 votes)
23 views58 pages

SEEM2460 Spring2023 Week1

Uploaded by

michaelng1112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views58 pages

SEEM2460 Spring2023 Week1

Uploaded by

michaelng1112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

SEEM2460 Introduction to Data Science

Helen Meng
Professor, Department of Systems Engineering & Engineering Management
Director, Big Data Decision Analytics Research Centre
The Chinese University of Hong Kong
Spring 2023

Week 1

1
Housekeeping
• Walk through course overview
• CUHK Policy on Academic Integrity

2
Outline
• Our World of Data
• What is Data Science?
• Why is it important?
• Illustrative Applications

3
Library of Congress

4
Source: images.google.com
Library of Parliament is the main information repository and research resource Parliament of ...
Canada [Source: Getty Images, Reference: Forbes]

5
Abbey Library of Saint Gall, St. Gallen, Switzerland [Source: Getty images, Forbes]

6
Stuttgart City Library, Stuttgart, Germany [Sources: Getty Images, Forbes]
picture

7
Knowledge is Power (Then)

8
Knowledge is Power (Now)

Source: CUHK library webpage 9


DIKW Hierarchy
• Ackoff, R. L., "From Data to Wisdom", Journal of Applied Systems
Analysis, 1989.
• Professor Russell Lincoln Ackoff (1919 –2009) was an American
organizational theorist, consultant and Anheuser‐Busch Professor
Emeritus of Management Science at the Wharton School, University
of Pennsylvania. Ackoff was a pioneer in the field of operations
research, systems thinking and management science.
[Source: Wikipedia]

Source:
10
https://s3.amazonaws.com/external_clips/attachments/39914/original/pmIHS_IoT2_Fig2.png?1420841911
Knowledge is Power (Now)
E‐books /
publications
Clickstream
Traffic flow

Social networks
Text
Geographic

11
Satellite imagery Sentiment Broadcast audio/video streams
Knowledge is Power (Now)
E‐books /
publications
Clickstream
Traffic flow

Data Data Data Data Data! Social networks


Text
Geographic

12
Satellite imagery Sentiment Broadcast audio/video streams
13
14
15
Source: Domo 16
Image credits: Google,
CUHK SEEM‐IBM Course 4581 17
Source: IBM Big Data and Analytics Hub circa 2014 18
Structured and Unstructured Data

Source: prowebscraper.com

Previous
prediction
was 175ZB

19
Source: IDC 2021
20
Source: IDC 2018
Where is the data?

21
Big Data – Value
Volume Velocity Variety
12 terabytes 5 million 100’s video feeds
of Tweets created daily trade events per second from surveillance cameras

Analyze product sentiment Identify potential fraud Monitor events of interests

350 billion 500 million 80% data growth


meter readings per annum call detail records per day are images, video documents…

Predict power consumption Prevent customer churn Improve customer satisfaction

Source: IBM

22
Illustrative Example: Foot Traffic Analytics

Souce: i3
23
Illustrative Example: Car Safety

24
This Course

Social Network Operations Case Studies


Healthcare Transportation Finance & Applications
Analysis Management

Data Large‐scale Tools


Machine Data Pattern Techniques and
Visualization Statistics Data Optimization
Learning Mining Recognition Methods
Processing

25
Outline
• Our World of Data
• What is Data Science?
• Why is it important?
• Example Applications

26
What is Data Science?
• Definition: A set of fundamental principles that guide the
extraction of knowledge from data, involving processes and
techniques for understanding phenomena via the
(automated) analysis of data [Provost & Fawcett 2013]
• Emerged as an academic subject and an industrial field
• Goal: support data‐driven decision making (DDD)

Frequently co‐occurring:
Data Science
Data mining
Analytics
Knowledge Discovery

27
Source: [Provost & Fawcett 2013]
Data Science – An Interdisciplinary Field
• Multi‐disciplinary focus, needs team work
• Machine learning, mathematics, software engineering,
statistics, visualization, domain expertise, (intuition‐
guided) exploratory data analysis [O’Neil and Schutt 2013]

28
Source: http://www.ibm.com/developerworks/library/os‐datascience/
Data Science – Asking the Question
• Lifecycle of a question
Validation

Question
Not interesting Worth asking again?

Make it repeatable

Ready to deploy solutions

Must pose the right question at the right time and at the right data!
29
Courtesy: Dr. Hsiao‐Wuen Hon, Managing Director, Microsoft Research Asia
The Data Scientist’s Activities /
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions

Patterns Data
And Trends Analytics

Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning

Data
Capture
30
Adapted from [Baesens 2014]
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions

Patterns Data
And Trends Analytics

Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning

Data Data Capture:


Capture • Gathering source data
• Collecting and accumulating data
31
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions

Patterns Data
And Trends Analytics

Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning
Data Selection and Cleaning:
Data • Data points missing, duplicated,
Capture corrupted, invalid, inconsistent
• Obtain high‐quality data 32
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions

Patterns Data
And Trends Analytics

Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning
Data Transformation:
Data • Binning, alphanumeric to numeric
Capture coding, geographical aggregation,
etc. 33
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions

Patterns Data
And Trends Analytics

Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning
Data Analytics:
Data • Churn prediction, fraud detection,
Capture customer segmentation, market
basket analysis 34
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions

Patterns Data
And Trends Analytics

Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning

Data Actionable Insights


Capture • Understanding, communicate,
decisions, recommendations 35
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions

Patterns Data
And Trends Analytics

Transformed Data
Data Transformation
Update
Update System
System
Preprocessed Data Selection
Data & Cleaning

Data Update System:


Capture • Actions and decisions may be
incorporated into the system 36
Data Processing Life Cycle
Interpretation,
Evaluation, Decisions

Patterns Data
And Trends Analytics

Transformed Data
Data Transformation
Update System
Preprocessed Data Selection
Data & Cleaning

Data • Iterative!
Capture • Research question answered?
• Reformulate hypothesis? 37
Data Analytics Enables Understanding
Source: Cao, ACM Survey 2017

• Data provides observations


• Need tools to processes data automatically and effectively
• Analytics uncovers knowledge through data modeling
• Help us understand ourselves and the world
• Help us improve ourselves and the world 38
Outline
• Our World of Data
• What is Data Science?
• Why is it important?
• Example Applications

39
Value of Data VALUE INSIGHTS & ACTION

KNOWLEDGE

INFORMATION

DATA

SIGNAL

REFINE

VALUE
WORLD

COMMUNITY

ORGANIZATION

PERSONAL

Courtesy:
COMBINE Dr. Hsiao‐Wuen Hon, MSRA
40
Value of Data

VALUE

Insights & KNOWLEDGE


Actions INFORMATION
Recommend
Share DATA Publish
Process Integrate
Discover SIGNAL

REFINE

Courtesy: Dr. Hsiao‐Wuen Hon, MSRA 41


Gartner’s 2014 Hype Cycle

42
43
44
More than anything, what data scientists do is make discoveries while swimming
in data. It’s their preferred method of navigating the world around them. At ease
in the digital realm, they are able to bring structure to large quantities of
formless data and make analysis possible. They identify rich data sources, join
them with other, potentially incomplete data sources, and clean the resulting set.
In a competitive landscape where challenges keep changing and data never stop
flowing, data scientists help decision makers shift from ad hoc analysis to an
ongoing conversation with data.
Data scientists realize that they face technical limitations, but they don’t allow
that to bog down their search for novel solutions. As they make discoveries, they
communicate what they’ve learned and suggest its implications for new business
directions. Often they are creative in displaying information visually and making
the patterns they find clear and compelling. They advise executives and product
managers on the implications of the data for products, processes, and decisions.
Given the nascent state of their trade, it often falls to data scientists to fashion
their own tools and even conduct academic‐style research. Yahoo, one of the
firms that employed a group of data scientists early on, was instrumental in
developing Hadoop. Facebook’s data team created the language Hive for
programming Hadoop projects. Many other data scientists, especially at data‐
driven companies such as Google, Amazon, Microsoft, Walmart, eBay, LinkedIn,
and Twitter, have added to and refined the tool kit.
What kind of person does all this? What abilities make a data scientist successful?
Think of him or her as a hybrid of data hacker, analyst, communicator, and
trusted adviser. The combination is extremely powerful—and rare.

45
https://hbr.org/2012/10/data‐scientist‐the‐sexiest‐job‐of‐the‐21st‐century/
Skill Sets of a Data Scientist

Source: Kelleher J. and Tierney B.


What is Data Science, MIT Press 2018.

Video: https://www.youtube.com/watch?v=JSqP5PXWOUc
46
Outline
• Our World of Data
• What is Data Science?
• Why is it important?
• Example Applications

47
Example: Understanding the Customer
“How Companies Learn Your Secrets” by [NYT 2012]

48
49
50
Example: Recommendation Systems
“Amazon’s Recommendation Secret” [Fortune 2012]

51
Example: Google Flu Trends (GFT)
• 2009: New flu virus H1N1 discovered
• Spread quickly, public health agencies feared a terrible
pandemic, no vaccine readily available
• US Center for Disease Control (CDC) requested doctors
inform them of new flu cases
• Picture of pandemic may be 1‐2 weeks out of date
• People feel sick for days but wait before seeing a doctor
• Relaying info took time, CDC tabulated once per week
• 1‐2 weeks delay is too long for public health monitoring
• Need early detection and rapid response

Source: Wikipedia 52
GFT (2)
• Google detecting influenza epidemic
• [Ginsberg et al., Nature vol. 457, 2009]
• Worked with 3B search queries received daily
• Took 50M most common search terms and compared
with CDC data on spread of the flu (between 2003‐2008)
• Narrowed to roughly 40+ search terms used by a
mathematical model to obtain strong correlation with
spread of flu over time and space
• Predict spread of flu in real time
• https://www.youtube.com/watch?v=6111nS66Dpk
• https://www.youtube.com/watch?app=desktop&v=lEDt
89eQ64o

Source: Wikipedia 53
GFT (3)
• GTF became a web service from Google, providing influenza
activity for more than 25 countries
• 2009 flu pandemic – GFT tracked flu information across US
• 2010 Feb CDC identified flue cases spiking, GFT showed
same spike 2 weeks prior
• Later, researchers uncovered that the model was
aggregating queries about different health conditions,
leading to incorrect prediction
• GFT stopped publishing estimates in 2015

Source: Wikipedia 54
Example: IBM Watson
Video: https://www.youtube.com/watch?v=P18EdAKuC1U&t=8s
• IBM looking for its next Grand Challenge
• Develop machine that can pass the Turing Test – measures
machine intelligence by having a system attempt to fool a
human into thinking they are conversing with another person
• Beating a human in Jeopardy
• Quiz master provides answers (or “clues”, tricky wordings)
• Contestants provide a question
• IBM developed DeepQA – a massively parallel software
architecture that analyzes natural language content, developed
by 20 researchers over 3 years
• Natural language processing and machine learning accessing
200million pages of information
• Jeopardy Watson task – get a “clue”, understand it, find the
question, won two strongest human contestants in 2011 55
Example: Language Modelling
GPT‐3: https://www.youtube.com/watch?v=8V20HkoiNtc
Chat‐GPT: https://www.youtube.com/watch?v=o5MutYFWsM8

56
Example: Graphics Generation
Dall‐E ‐‐ https://openai.com/blog/dall‐e/
Midjourney ‐‐ https://www.youtube.com/watch?v=SVcsDDABEkM

57
References
• Loukides, L., What is Data Science, O’Reilly Media 2011.
• O’Neil, C. and Schutt, R., Doing Data Science, O’Reilly 2013
• Provost, F. & Fawcett, T., Data Science for Business, O’Reilly 2013
• Mayer‐Schonberger, V. and Cukier, K., Big Data: A Revolution that will Transform How
We Live, Work and Think, John Murray (Publishers), 2013
• Baesens, B., Analytics in a Big Data World, Wiley 2014
• Davenport, T., Big Data @ Work, Harvard Business Review Press, 2014 (also Webinar)
• Meng, H., “Hong Kong’s Potential in Big Data Analytics,” ComputerWorld, 2014.
• Meng, H., “How Hong Kong can Unleash the Power of Big Data,” ejinsights, 2016.
(https://www.ejinsight.com/eji/article/id/1458571/20161220‐how‐hk‐can‐unleash‐the‐
power‐of‐big‐data)
• Henke, N., “The Age of Analytics: Competing in a Data‐Driven World” McKinsey Report,
2016
• Cao, L. “Data Science: A Comprehensive Overview”, ACM Computing Surveys, 2017.
• Kelleher, J. and Tierney B., Data Science, MIT Press, 2018.
• Brown, J. et al, “Language Models are Few‐short Learners,” arXiv 2020
• Dall‐E , Open AI, 2021
• Midjourney, ChatGPT 2022 (see slides)

58

You might also like