0% found this document useful (0 votes)
11 views

06 02 Introduction To Data Science

This document provides an introduction to data science, including definitions of key concepts like big data, data types, machine learning, and data science. It discusses how much data exists in areas like Google, Facebook, and genome projects. It also outlines common machine learning applications like classification, clustering, and reinforcement learning. Finally, it provides examples of data science use cases in domains like cancer research, healthcare, customer analytics, and IoT.

Uploaded by

Ravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

06 02 Introduction To Data Science

This document provides an introduction to data science, including definitions of key concepts like big data, data types, machine learning, and data science. It discusses how much data exists in areas like Google, Facebook, and genome projects. It also outlines common machine learning applications like classification, clustering, and reinforcement learning. Finally, it provides examples of data science use cases in domains like cancer research, healthcare, customer analytics, and IoT.

Uploaded by

Ravi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

INTRODUCTION TO DATA

SCIENCE

Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu


Computer Science and Mathematical Sciences
College of Engineering
Tennessee State University
LEARING
▪ Big Data
▪ Types of Data We Have
▪ What is Data Science?
▪ What Is Machine Learning?
▪ Use Case of Machine Learning
▪ Use case of Data Science
DATA ALL AROUND
▪ Lots of data is being collected
and warehoused
▪ Web data, e-commerce
▪ Financial transactions, bank/credit transactions
▪ Online trading and purchasing
▪ Social Network
HOW MUCH DATA DO WE
HAVE?
▪ Google processes 20 PB a day (2008)
▪ Facebook has 60 TB of daily logs
▪ eBay has 6.5 PB of user data + 50 TB/day (5/2009)
▪ 1000 genomes project: 200 TB

▪Cost of 1 TB of disk: $35


▪ Time to read 1 TB disk: 3 hrs
(100 MB/s)
BIG DATA
• Big Data is any data that is expensive to manage and hard to extract value from
▪ Volume
▪ The size of the data
▪ Velocity
▪ The latency of data processing relative to the growing demand for interactivity
▪ Variety and Complexity
▪ the diversity of sources, formats, quality, structures.
BIG DATA
TYPES OF DATA WE HAVE
▪ Relational Data (Tables/Transaction/Legacy Data)
▪ Text Data (Web)
▪ Semi-structured Data (XML)
▪ Graph Data
▪ Social Network, Semantic Web (RDF), …
▪ Streaming Data
▪ You can afford to scan the data once
WHAT TO DO WITH THESE
DATA?
▪ Aggregation and Statistics
▪ Data warehousing and OLAP
▪ Indexing, Searching, and Querying
▪ Keyword based search
▪ Pattern matching (XML/RDF)
▪ Knowledge discovery
▪ Data Mining
▪ Statistical Modeling
▪ Data Driven
Predictive Analytics
Deep Learning
BIG DATA AND DATA SCIENCE
▪“… the sexy job in the next 10 years will be
statisticians,” Hal Varian, Google Chief Economist
▪ The U.S. will need 140,000-190,000 predictive
analysts and 1.5 million managers/analysts by 2018.
McKinsey Global Institute’s June 2011

▪ New Data Science institutes being created or


repurposed – NYU, Columbia, Washington, UCB,...
▪ New degree programs, courses, boot-camps:
▪e.g., at Berkeley: Stats, I-School, CS, Astronomy…
▪ One proposal (elsewhere) for an MS in “Big Data Science”
WHAT IS DATA SCIENCE?
▪ An area that manages, manipulates, extracts, and interprets knowledge from tremendous
amount of data
▪ Data science (DS) is a multidisciplinary field of study with goal to address the challenges in
big data
▪ Data science principles apply to all data – big and small

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
WHAT IS DATA SCIENCE?
▪Theories and techniques from many fields and
disciplines are used to investigate and analyze a large
amount of data to help decision makers in many
industries such as science, engineering, economics,
politics, finance, and education
▪Computer Science
▪ Pattern recognition, visualization, data warehousing, High performance
computing, Databases, AI
▪Mathematics
▪ Mathematical Modeling
▪Statistics
▪ Statistical and Stochastic modeling, Probability.
▪ Other definitions focus more on technical skills alone.
WHY IS IT SEXY?
▪ Gartner’s 2014 Hype Cycle
DATA SCIENCE
DATA SCIENCE
DATA SCIENCE VS ANALYSIS
VS SOFTWARE DELIVERY
CONTRAST: SCIENTIFIC
COMPUTING
CONTRAST: MACHINE
LEARNING
MACHINE LEARNING
SO WHAT IS MACHINE
LEARNING?
▪ Automating automation
▪ Getting computers to program themselves
▪ Writing software is the bottleneck
▪ Let the data do the work instead!
WHAT IS MACHINE
LEARNING?
▪ Optimize a performance criterion using example data or past experience.
▪ Role of Statistics: Inference from a sample
▪ Role of Computer science: Efficient algorithms to
▪Solve the optimization problem
▪ Representing and evaluating the model for inference

21
Traditional Programming

Data
Computer Outpu
Progra t
m
Machine Learning
Data
Computer Progra
Outpu m
t
WHY “LEARN” ?
▪ Machine learning is programming computers to optimize a performance criterion using
example data or past experience.
▪ There is no need to “learn” to calculate payroll
▪ Learning is used when:
▪ Human expertise does not exist (navigating on Mars),
▪ Humans are unable to explain their expertise (speech recognition)
▪ Solution changes in time (routing on a computer network)
▪ Solution needs to be adapted to particular cases (user biometrics)

23
WHAT WE TALK ABOUT WHEN
WE TALK ABOUT“LEARNING”
▪ Learning general models from a data of particular examples
▪ Data is cheap and abundant (data warehouses, data marts); knowledge is expensive and
scarce.
▪ Example in retail: Customer transactions to consumer behavior:
People who bought “Da Vinci Code” also bought “The Five People You Meet in Heaven”
(www.amazon.com)

▪ Build a model that is a good and useful approximation to the data.

24
SAMPLE APPLICATIONS
▪ Retail: Market basket analysis, Customer relationship management (CRM)
▪ Finance: Credit scoring, fraud detection
▪ Manufacturing: Optimization, troubleshooting
▪ Medicine: Medical diagnosis
▪ Telecommunications: Quality of service optimization
▪ Bioinformatics: Motifs, alignment
▪ Web mining: Search engines
▪ ...

25
APPLICATIONS
▪ Association
▪ Supervised Learning
▪ Classification
▪ Regression
▪ Unsupervised Learning
▪ Reinforcement Learning

26
LEARNING ASSOCIATIONS
▪ Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services.

Example: P ( chips | beer ) = 0.7

Market-Basket transactions
CLASSIFICATION:
APPLICATIONS
▪ Aka Pattern recognition
▪ Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style
▪ Character recognition: Different handwriting styles.
▪ Speech recognition: Temporal dependency.
▪ Use of a dictionary or the syntax of the language.
▪ Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic for speech
▪ Medical diagnosis: From symptoms to illnesses
▪ ...

28
FACE RECOGNITION
Training examples of a person

Test images

29
SUPERVISED LEARNING: USES
▪ Prediction of future cases: Use the rule to predict the output for future inputs
▪ Knowledge extraction: The rule is easy to understand
▪ Compression: The rule is simpler than the data it explains
▪ Outlier detection: Exceptions that are not covered by the rule, e.g., fraud

30
UNSUPERVISED LEARNING
▪ Learning “what normally happens”
▪ No output
▪ Clustering: Grouping similar instances
▪ Example applications
▪Customer segmentation in CRM
▪ Image compression: Color quantization
▪ Bioinformatics: Learning motifs

31
REINFORCEMENT LEARNING
▪ Learning a policy: A sequence of outputs
▪ No supervised output but delayed reward
▪ Credit assignment problem
▪ Game playing
▪ Robot in a maze
▪ Multiple agents, partial observability, ...

32
DATA SCIENCE APPLICATIONS
DATA SCIENCE: CASE STUDY
CANCER RESEARCH
▪ Cancer is an incredibly complex disease; a single tumor can have more than 100 billion cells,
and each cell can acquire mutations individually. The disease is always changing, evolving,
and adapting.
▪ Employ the power of big data analytics and high-performance computing.
▪ Leverage sophisticated pattern and machine learning algorithms to identify patterns that are
potentially linked to cancer
▪ Huge amount of data processing and recognition
DATA SCIENCE: CASE STUDY
HEALTH CARE
▪ Stanford Medicine, Google team
up to harness power of data
science for health care
▪ Stanford Medicine will use the
power, security and scale of
Google Cloud Platform to support
precision health and more efficient
patient care.
▪ Analyzing genetic data
▪ Focusing on precision health
▪ Data as the engine that drives
research
DATA SCIENCE: CASE STUDY
INTERNET OF THINGS (IOT)
DATA SCIENCE: CASE STUDY
CUSTOMER ANALYTICS
REAL LIFE EXAMPLES
▪ Companies learn your secrets, shopping patterns, and preferences
▪ For example, can we know if a woman is pregnant, even if she doesn’t want us to know?
Target case study

▪ Data Science and election (2008, 2012)


▪ 1 million people installed the Obama Facebook app that gave access to info on “friends”
DATA SCIENTISTS
▪ Data Scientist
▪ The Sexiest Job of the 21st Century
▪ They find stories, extract knowledge. They are not reporters
DATA SCIENTISTS
▪ Data scientists are the key to realizing the opportunities presented by big data. They bring
structure to it, find compelling patterns in it, and advise executives on the implications for
products, processes, and decisions
WHAT DO DATA SCIENTISTS
DO?
▪National Security
▪ Cyber Security
▪ Business Analytics
▪ Engineering
▪ Healthcare
▪ And more ….
CONCENTRATION IN DATA SCIENCE

▪Mathematics and Applied Mathematics


▪ Applied Statistics/Data Analysis
▪ Solid Programming Skills (R, Python, Julia, SQL)
▪ Data Mining
▪ Data Base Storage and Management
▪ Machine Learning and discovery

You might also like