0% found this document useful (0 votes)

24 views53 pages

Lecture 1 & 2

The document outlines a course on Data Science, covering topics such as the definition of data, big data, the data science life cycle, and various applications of data science. It emphasizes the importance of understanding data preparation, analysis techniques, and the use of tools like Python and machine learning algorithms. The course aims to equip students with the skills to analyze data and derive insights for real-world applications.

Uploaded by

l211803

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views53 pages

Lecture 1 & 2

Uploaded by

l211803

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 53

CS4048 –Data Science

Prepared By: Maimoona Akram

Overview 2
Course Introduction
What is Data?
What is Big Data?
What is Data Science?
Data Science Life Cycle & Process
Applications of Data Science
Data Analytics
Types of Data Analytics
Course Introduction 3

Objective:
To reinforce the concepts developed in theory with
experiments on data preparation and data analysis using different
techniques in Python

Books and Resources:

ebooks and course material will be on Google Classroom
(coming soon )
Tentative Grading Policy 4

Class
Assignments 10%
Quizzes 10%
Midterm Exam 30%
Project + Final Exam 10% + 40%
General Guidelines 5

• Visit Google classroom regularly for updates

• No email submissions when Google classroom is there.
Always remember that, you are putting your task in trash
by yourself when you are emailing it.
❑ Cheating cases are intolerable.
❑ There will be no re-take of any evaluation.
Learning Outcomes 6
Recognise the basics of Data Science, Prepare and
wrangle the data for analysis
Practice Exploratory Data Analysis (EDA) to investigate
data with the help of statistical and graphical
representation
Identify and apply different machine learning
algorithms to gain insight from the data

Use feature selection methods to optimize the

performance of learning model
Tools and Software Packages 7
• WEKA
• KNIME
• ORANGE
• CBA
Language: Python
• MATLAB
Tools: NumPy, Pandas, matplotlib, scikit-learn etc.
• PYTHON
• R
• SPSS
• SAS
• Maptitude for the Web
• etc.,
What is Data? 8
Data is an information in raw form using numbers, alphabets,
symbols representing ideas, facts, categories etc.

Good decisions require good information

derived from raw facts
How much data do we have? 9
• Lots of data is being collected and warehoused
• Web data, e-commerce
• Financial transactions, bank/credit transactions
• Online trading and purchasing
• Social Network
• Google processes 20 PB a day (2008)
• Facebook has 60 TB of daily logs
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• 1000 genomes project: 200 TB
How much data do we have? 10
Big Data 11

“Between the dawn of civilization and 2003, we only created

ﬁve exabytes of information; now we’re creating that amount
every two days.”

- Eric Schmidt, Google

Big Data 12
Big Data 13

“Big Data is crucial to the company’s very being. Facebook

relies on a massive installation of Hadoop a highly scalable
open-source framework that uses clusters of low cost servers to
solve problems. Facebook even designs its hardware for this
purpose”

Ken Rudin, Analytics Chief, Facebook

Big Data 14

The most abundant thing today, is data. We have data about

everything which is increasing multifold everyday!
“Big Data” Sources
It’s All Happening
On-line User Generated
15
(Web & Mobile)
Every:
Click
Ad impression
Billing event
…
Fast Forward, pause,… ..
Server request
Transaction
Network message
Fault
…

Internet of Things /
M2M Health/Scientific
Computing

Data Science for Engineers

Big Data – Five V’s of Data 16
• Volume (Amount of data):
• How much data is really relevant to the problem solution? Cost of processing?
• So, can you really afford to store and process all that data?

• Velocity (Speed of data generation and movement):

• Much data coming in at high speed
• Need for streaming versus block approach to data analysis
• So, how to analyze data in-flight and combine with data at-rest
• Variety (Diversity of datatypes):
• A small fraction is structured formats, Relational, XML, etc.
• A fair amount is semi-structured, as web logs, etc.
• The rest of the data is unstructured text, photographs, etc.
• So, no single data model can currently handle the diversity
Big Data – Five V of Data (Cont.) 17

• Veracity (Quality of data):

• Accuracy, Precision, Reliability, Integrity
• So, what is it that you don’t know you don’t know about the data?
• Value (worth):
• How much value is created for each unit of data (whatever it is)?
• So, what is the contribution of subsets of the data to the problem
solution?
What can you do with the data?
18

• Reporting
• Monitoring (fine-grained)
• Exploration
• Finding Patterns
• Root Cause Analysis
• Closed-loop Control
• Model construction
• Prediction
Data vs. Information 19
Data Information
• Raw facts • Produced by processing
• Have not yet been raw data to reveal its
processed to reveal meaning
their meaning to the • Requires context
end user • Bedrock of knowledge
• Building blocks of • Should be accurate,
information relevant, and timely to
enable good decision
making
Data vs. Information (cont’d.) 20
Data, Information, and Beyond 21
Data, Information, and Beyond
22

• Information: Data that has been “cleaned” of errors and further processed in a way that makes it
easier to measure, visualize and analyze for a specific purpose

• Knowledge:“How” is the information, derived from the collected data, relevant to our goals?
“How” are the pieces of this information connected to other pieces to add more meaning and
value? And, maybe most importantly, “how” can we apply the information to achieve our goal?

• Wisdom: we must answer questions such as ‘why do something’ and ‘what is best’. In other
words, wisdom is knowledge applied in action.

• If data and information are like a look back to the past, knowledge and wisdom are associated
with what we do now and what we want to achieve in the future.

Data Science for Engineers

How to get information and knowledge

to build wisdom from the raw data?

DATA
SCIENCE
Data Science and Big Data 24

• They are not the “same thing”

• Big data = crude oil
• Big data is about extracting “crude oil”, transporting it in
“mega tankers”, siphoning it through “pipelines”, and
storing it in “massive silos”
• Data science is about refining the “crude oil”
“Data Science” an Emerging Field 23

O’Reilly Radar report,

2011
What is Data Science? 26

• “Data science, also known as data-driven science, is an

interdisciplinary field of scientific methods, processes, algorithms
and systems to extract knowledge or insights from data in various
forms, either structured or unstructured, similar to data mining.”

• It’s an area that manages, manipulates, extracts, and interprets

knowledge from tremendous amount of data
What is Data Science? 27

Data Science is the application of computational and statistical

techniques to address or gain insight into some problem in the
real world

Data Science = Statistics + data processing + machine learning + scientific

inquiry + visualization + business analytics + big data + …
What is Data Science?
28
Data Science: A Visual Definition
29

• Data-driven science
• Interdisciplinary field
• Extract knowledge or
insight from data in various
forms
Data science workflow 30
Data Science vs Data analysis vs Data
Engineering 31

Data Science focuses on finding useful information analyzing raw,

unstructured data to drive a business toward higher profitability
through useful prediction

Data Analysis focuses on solving current business problems from

available data by presenting information in an understandable way

Data Engineering focuses on managing and organizing data, building

and maintaining databases and data pipelines
Data Scientist vs Data analyst vs Data
Engineer 32
Asking Good Questions 33

• Software developers are not encouraged to ask

questions, but data scientists are:
• What exciting things might you be able to learn from a given
data set?
• What things do you/your people really want to know?
• What data sets might get you there?

e.g., Baseball
• How to best measure individual player’s skill, value or performance?
• What is the trajectory of player’s performances as they mature and age?
• To what extent does batting performance correlate with the position played?
Data Science Life Cycle
34
• Understanding of Problem: It all starts with understanding the problem at hand, the questions, and the answers
we are trying to find from the dataset at hand.
• Data Acquisition: Data Acquisition, as the name suggests, is about retrieving the data with the help of Data
Engineers where required. It also consolidates all of the data required to answer the question or to solve the
problem at hand.
• Data Wrangling(Preparation): Data wrangling is about using knowledge to preprocess data. It involves looking
for missing values and asking business questions like why they are missing. Furthermore, it uses knowledge to
give shape to the dataset appropriate for visualizations and to support the coming steps in the life cycle.
• Data Exploration: Data Exploration is about visualization and other statistics’ measures to see whether the
questions we asked, in the beginning, are being answered or not? .
• Feature Engineering and Selection: It is a preprocessing step before modeling in both Machine Learning and
Deep Learning. We will look into these fields in the coming sections. It has similar steps to Data Wrangling apart
from some algorithms for Feature Selection and transformation.
• Modeling: Modeling is the process that uncovers the meaning of the data. It is about capturing underlying trends
and the data’s behavior to make the model, which can be used for predictive analytics as described in the
previous section.
• Deployment: After we build the model we’ll deploy it in the most efficient and optimized manner so that real-
world people can use it. It can be deployed on mobile applications and web applications.
• Monitoring: After we have deployed the model, we will want to monitor it. Monitoring is about familiarizing the
model with the new dataset and tracking the number of requests that the model receives. It also involves making
changes to the analysis and starting over if required.
Data Science Life Cycle 35
Applications of Data Science 37
Nate Silver 38

• American Data Scientist who analyzes elections and

baseball.
-PECOTA: a system for forecasting the performance
and career development of Major -League
Baseball players.
- 2012 U.S. Presidential election
• Correctly predicted the winners of all the states.
Walmart Store’s Sales Forecasting: 39

Commerce & Retail use big data and data science to optimize business processes
and for profitable decision making.

• Various tasks like predicting sales, offering product recommendations to

customers, inventory management, etc. are elegantly managed with the
use of data science techniques
What can you do with the data? 40
recommender systems

• Data Science also plays a key role in Information Retrieval.

• Whenever a user buys a product from a certain site like Amazon or Alibaba,
they start seeing some ads or items that are related to the product. In the
same way, a user may get recommendations while they purchase items.
• The user’s purchasing history is stored at the company’s server, and the
company is utilizing Data Science to predict which items the user is likely to
buy next.
• Facebook and LinkedIn use patterns of friendship relationships to suggest other
people you may know, or should know, with sometimes frightening accuracy
• Netflix’s AI considers customers’ viewing habits and hobbies to provide Netflix
recommendations.
Predictive medicine 41

• Healthcare professionals use many wearables to track the

blood pressure, heart rate, and blood sugar level of
patients.
• This data is generated and processed in real-time. Based on
the body’s current health data, the system can trigger an
alarm that indicates something bad is about to happen to
the patient.
• This was impossible in the past. This can save more lives.
Transportation 42

• Various transportation companies like Uber is using data

science for price optimization and providing better
experiences to their customers.
• Using powerful predictive tools, they accurately predict the
price based on parameters like a weather pattern,
availability of transport, customers, etc
Google data product: detect pattern 43

• During the Swine Flu

epidemic of 2009, Google
was able to track the
progress of the epidemic by
following searches for flu-
related topics.
Traffic Prediction and Earthquake Warning
44

Crowdsourcing + physical modeling + sensing + data assimilation

to produce:

From Alex Bayen, UCB

44
Data Analytics 45

The process of examining datasets to find trends and draw

conclusions from the information contained in it
Types of Data Analytics 46

• Descriptive: A set of techniques for reviewing and examining the data set(s)
to understand the data to describe what happened or is happening?

• Diagnostic: As a next step, a set of techniques to determine why did this

happen?
• It is useful for getting at the root of an issue
Types of Data Analytics (Cont.) 47

• Predictive: A set of techniques that analyse current and historical data

to determine what might happen in the future?

• Prescriptive: A set of techniques for computationally developing and

analyzing alternatives that can become courses of action – either
tactical or strategic – that suggests what should be done next?
Types of Data Analytics (Cont.)
48
Case Study: Coca-Cola 49

• An interview with the company’s

director of data strategy confirmed
that the secret behind client retention
for Coca-Cola is big data analytics.

• In 2017, the company launched a

new flavour – Cherry Sprite. It was
inspired by figures collected from
self-service drinks fountains that let
customers mix their own drinks.
Coca-Cola discovered the most
popular flavour combo and turned it
into a beverage.
Topic labeling 50
Sentiment Analysis 51
Intent Detection 52
Healthcare 53

• Google's DeepMind AI system can identify over 50 eye diseases

with 94% accuracy using thousands of anonymized eye scans.
Additionally, Roche's Apollo platform leverages AI to create
comprehensive patient profiles, enhancing personalized
treatment plans and facilitating scientific collaboration
Gaming 54
• The gaming industry benefits from data science through player
behavior analysis, personalized game recommendations, and
cheating prevention. Companies like Electronic Arts and Riot
Games use data to understand player engagement and maintain
fair play environments, enhancing overall gaming experiences.

Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Verizon Sample Bill PDF
No ratings yet
Verizon Sample Bill PDF
55 pages
Data
No ratings yet
Data
43 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
No ratings yet
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
23 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
01 Introduction
No ratings yet
01 Introduction
37 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
Foundations of Data Science PPT TEXT BOOK
No ratings yet
Foundations of Data Science PPT TEXT BOOK
132 pages
Kadir
No ratings yet
Kadir
84 pages
CS429: Data Mining: About Instructor
No ratings yet
CS429: Data Mining: About Instructor
26 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Modul 1
No ratings yet
Modul 1
56 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
26 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Project Report
No ratings yet
Project Report
29 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
20IT501 BDA Unit1
No ratings yet
20IT501 BDA Unit1
18 pages
Datascience
75% (8)
Datascience
28 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Mod 3
No ratings yet
Mod 3
96 pages
2 Data Science Process 06-01-2024
No ratings yet
2 Data Science Process 06-01-2024
32 pages
Data Science Intro Session-18 & 19
No ratings yet
Data Science Intro Session-18 & 19
48 pages
File
No ratings yet
File
27 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
70 pages
Data Science
No ratings yet
Data Science
40 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
Inroduction To Data Science
No ratings yet
Inroduction To Data Science
62 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Data Science Class Lecture
No ratings yet
Data Science Class Lecture
22 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
Dia 1
No ratings yet
Dia 1
88 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
Unit 1
No ratings yet
Unit 1
76 pages
Chapter one-DSA
No ratings yet
Chapter one-DSA
20 pages
Data Science Unit I
No ratings yet
Data Science Unit I
13 pages
Unit-1 - Introduction To Data Science
No ratings yet
Unit-1 - Introduction To Data Science
17 pages
Abdul Kadir
No ratings yet
Abdul Kadir
97 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Big Data For Dummies
No ratings yet
Big Data For Dummies
8 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Data Science - Unit 1 MDM
No ratings yet
Data Science - Unit 1 MDM
64 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
CASA - Advisory-Circular-119-12-Human-Factors-Principles-Non-Technical-Skills-Training-Assessment-Air-Transport-Operations
No ratings yet
CASA - Advisory-Circular-119-12-Human-Factors-Principles-Non-Technical-Skills-Training-Assessment-Air-Transport-Operations
34 pages
Oracle Database 12c: OR1, 5 Tage
No ratings yet
Oracle Database 12c: OR1, 5 Tage
1 page
Michel Peletz - Kinship Studies in Late Twentieth-Century Anthropology
No ratings yet
Michel Peletz - Kinship Studies in Late Twentieth-Century Anthropology
31 pages
Information Technology British English Teacher B2 C1
No ratings yet
Information Technology British English Teacher B2 C1
13 pages
Assessment Task 2: Activity No. 1
No ratings yet
Assessment Task 2: Activity No. 1
5 pages
RAC QB Final-2023
No ratings yet
RAC QB Final-2023
9 pages
Silk and Silkworms Powerpoint English - Ver - 1
No ratings yet
Silk and Silkworms Powerpoint English - Ver - 1
8 pages
Gulfood Exhibitor List N 1
No ratings yet
Gulfood Exhibitor List N 1
19 pages
Project Milk Curdling
No ratings yet
Project Milk Curdling
15 pages
2024 L1 FixedIncome
No ratings yet
2024 L1 FixedIncome
93 pages
Soil Classification Using Horizontal To Vertical Spectrum Ratio Methods On Scilab in Sendangmulyo, Semarang
No ratings yet
Soil Classification Using Horizontal To Vertical Spectrum Ratio Methods On Scilab in Sendangmulyo, Semarang
8 pages
CTPAT Job Aid - Personnel Training Checklist Sample - October 2021
No ratings yet
CTPAT Job Aid - Personnel Training Checklist Sample - October 2021
4 pages
PT Akasha Wira International TBK Swot Analysis Bac
No ratings yet
PT Akasha Wira International TBK Swot Analysis Bac
13 pages
Catalist-Listed ES Group Announces Revised Chartering Agreement and New Vessel Sale To Sea Hub Tankers For S$29.4 Million
No ratings yet
Catalist-Listed ES Group Announces Revised Chartering Agreement and New Vessel Sale To Sea Hub Tankers For S$29.4 Million
3 pages
DAF Process
100% (1)
DAF Process
4 pages
Chapter 18: C++ As A Better C Introducing Object Technology
No ratings yet
Chapter 18: C++ As A Better C Introducing Object Technology
23 pages
Daftar Harga Barang Toko GMC Mojokerto Jl. Gajah Mada No. 42 Tlp. 0321-7229919 Mojokerto
No ratings yet
Daftar Harga Barang Toko GMC Mojokerto Jl. Gajah Mada No. 42 Tlp. 0321-7229919 Mojokerto
6 pages
Raghuvamsa CantoV English Meaning
No ratings yet
Raghuvamsa CantoV English Meaning
69 pages
MODULE 4 MAT Antepartum Flexible Learning
No ratings yet
MODULE 4 MAT Antepartum Flexible Learning
2 pages
CSR of Reliance Industries PDF
67% (3)
CSR of Reliance Industries PDF
30 pages
Clutch System
No ratings yet
Clutch System
14 pages
Unit Plan: Paises Hispano-Hablantes
100% (1)
Unit Plan: Paises Hispano-Hablantes
30 pages
Pressure Equipment - European Commission PDF
No ratings yet
Pressure Equipment - European Commission PDF
22 pages
The Trade - Offs of ChatGPT To Filipino Freelance Content Writers A Diffusion of Innovation Theory Perspective
No ratings yet
The Trade - Offs of ChatGPT To Filipino Freelance Content Writers A Diffusion of Innovation Theory Perspective
7 pages
LCD TV: Service Manual
No ratings yet
LCD TV: Service Manual
51 pages
Book Sizes
No ratings yet
Book Sizes
9 pages
Hornady 2017 Product Catalog
No ratings yet
Hornady 2017 Product Catalog
132 pages
Venkat - AEM Developer
No ratings yet
Venkat - AEM Developer
4 pages
3.2. Perspectives On Listening Ho
No ratings yet
3.2. Perspectives On Listening Ho
35 pages

Lecture 1 & 2

Uploaded by

Lecture 1 & 2

Uploaded by

CS4048 –Data Science

Prepared By: Maimoona Akram

Books and Resources:

• Visit Google classroom regularly for updates

Use feature selection methods to optimize the

Good decisions require good information

“Between the dawn of civilization and 2003, we only created

- Eric Schmidt, Google

“Big Data is crucial to the company’s very being. Facebook

Ken Rudin, Analytics Chief, Facebook

The most abundant thing today, is data. We have data about

Data Science for Engineers

• Velocity (Speed of data generation and movement):

• Veracity (Quality of data):

Data Science for Engineers

How to get information and knowledge

• They are not the “same thing”

O’Reilly Radar report,

• “Data science, also known as data-driven science, is an

• It’s an area that manages, manipulates, extracts, and interprets

Data Science is the application of computational and statistical

Data Science = Statistics + data processing + machine learning + scientific

Data Science focuses on finding useful information analyzing raw,

Data Analysis focuses on solving current business problems from

Data Engineering focuses on managing and organizing data, building

• Software developers are not encouraged to ask

• American Data Scientist who analyzes elections and

• Various tasks like predicting sales, offering product recommendations to

• Data Science also plays a key role in Information Retrieval.

• Healthcare professionals use many wearables to track the

• Various transportation companies like Uber is using data

• During the Swine Flu

Crowdsourcing + physical modeling + sensing + data assimilation

From Alex Bayen, UCB

The process of examining datasets to find trends and draw

• Diagnostic: As a next step, a set of techniques to determine why did this

• Predictive: A set of techniques that analyse current and historical data

• Prescriptive: A set of techniques for computationally developing and

• An interview with the company’s

• In 2017, the company launched a

• Google's DeepMind AI system can identify over 50 eye diseases

You might also like