06 02 Introduction To Data Science
06 02 Introduction To Data Science
SCIENCE
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
WHAT IS DATA SCIENCE?
▪Theories and techniques from many fields and
disciplines are used to investigate and analyze a large
amount of data to help decision makers in many
industries such as science, engineering, economics,
politics, finance, and education
▪Computer Science
▪ Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
▪Mathematics
▪ Mathematical Modeling
▪Statistics
▪ Statistical and Stochastic modeling, Probability.
▪ Other definitions focus more on technical skills alone.
WHY IS IT SEXY?
▪ Gartner’s 2014 Hype Cycle
DATA SCIENCE
DATA SCIENCE
DATA SCIENCE VS ANALYSIS
VS SOFTWARE DELIVERY
CONTRAST: SCIENTIFIC
COMPUTING
CONTRAST: MACHINE
LEARNING
MACHINE LEARNING
SO WHAT IS MACHINE
LEARNING?
▪ Automating automation
▪ Getting computers to program themselves
▪ Writing software is the bottleneck
▪ Let the data do the work instead!
WHAT IS MACHINE
LEARNING?
▪ Optimize a performance criterion using example data or past experience.
▪ Role of Statistics: Inference from a sample
▪ Role of Computer science: Efficient algorithms to
▪Solve the optimization problem
▪ Representing and evaluating the model for inference
21
Traditional Programming
Data
Computer Outpu
Progra t
m
Machine Learning
Data
Computer Progra
Outpu m
t
WHY “LEARN” ?
▪ Machine learning is programming computers to optimize a performance criterion using
example data or past experience.
▪ There is no need to “learn” to calculate payroll
▪ Learning is used when:
▪ Human expertise does not exist (navigating on Mars),
▪ Humans are unable to explain their expertise (speech recognition)
▪ Solution changes in time (routing on a computer network)
▪ Solution needs to be adapted to particular cases (user biometrics)
23
WHAT WE TALK ABOUT WHEN
WE TALK ABOUT“LEARNING”
▪ Learning general models from a data of particular examples
▪ Data is cheap and abundant (data warehouses, data marts); knowledge is expensive and
scarce.
▪ Example in retail: Customer transactions to consumer behavior:
People who bought “Da Vinci Code” also bought “The Five People You Meet in Heaven”
(www.amazon.com)
24
SAMPLE APPLICATIONS
▪ Retail: Market basket analysis, Customer relationship management (CRM)
▪ Finance: Credit scoring, fraud detection
▪ Manufacturing: Optimization, troubleshooting
▪ Medicine: Medical diagnosis
▪ Telecommunications: Quality of service optimization
▪ Bioinformatics: Motifs, alignment
▪ Web mining: Search engines
▪ ...
25
APPLICATIONS
▪ Association
▪ Supervised Learning
▪ Classification
▪ Regression
▪ Unsupervised Learning
▪ Reinforcement Learning
26
LEARNING ASSOCIATIONS
▪ Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services.
Market-Basket transactions
CLASSIFICATION:
APPLICATIONS
▪ Aka Pattern recognition
▪ Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style
▪ Character recognition: Different handwriting styles.
▪ Speech recognition: Temporal dependency.
▪ Use of a dictionary or the syntax of the language.
▪ Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic for speech
▪ Medical diagnosis: From symptoms to illnesses
▪ ...
28
FACE RECOGNITION
Training examples of a person
Test images
29
SUPERVISED LEARNING: USES
▪ Prediction of future cases: Use the rule to predict the output for future inputs
▪ Knowledge extraction: The rule is easy to understand
▪ Compression: The rule is simpler than the data it explains
▪ Outlier detection: Exceptions that are not covered by the rule, e.g., fraud
30
UNSUPERVISED LEARNING
▪ Learning “what normally happens”
▪ No output
▪ Clustering: Grouping similar instances
▪ Example applications
▪Customer segmentation in CRM
▪ Image compression: Color quantization
▪ Bioinformatics: Learning motifs
31
REINFORCEMENT LEARNING
▪ Learning a policy: A sequence of outputs
▪ No supervised output but delayed reward
▪ Credit assignment problem
▪ Game playing
▪ Robot in a maze
▪ Multiple agents, partial observability, ...
32
DATA SCIENCE APPLICATIONS
DATA SCIENCE: CASE STUDY
CANCER RESEARCH
▪ Cancer is an incredibly complex disease; a single tumor can have more than 100 billion cells,
and each cell can acquire mutations individually. The disease is always changing, evolving,
and adapting.
▪ Employ the power of big data analytics and high-performance computing.
▪ Leverage sophisticated pattern and machine learning algorithms to identify patterns that are
potentially linked to cancer
▪ Huge amount of data processing and recognition
DATA SCIENCE: CASE STUDY
HEALTH CARE
▪ Stanford Medicine, Google team
up to harness power of data
science for health care
▪ Stanford Medicine will use the
power, security and scale of
Google Cloud Platform to support
precision health and more efficient
patient care.
▪ Analyzing genetic data
▪ Focusing on precision health
▪ Data as the engine that drives
research
DATA SCIENCE: CASE STUDY
INTERNET OF THINGS (IOT)
DATA SCIENCE: CASE STUDY
CUSTOMER ANALYTICS
REAL LIFE EXAMPLES
▪ Companies learn your secrets, shopping patterns, and preferences
▪ For example, can we know if a woman is pregnant, even if she doesn’t want us to know?
Target case study