0% found this document useful (0 votes)
23 views53 pages

Lecture 1 & 2

The document outlines a course on Data Science, covering topics such as the definition of data, big data, the data science life cycle, and various applications of data science. It emphasizes the importance of understanding data preparation, analysis techniques, and the use of tools like Python and machine learning algorithms. The course aims to equip students with the skills to analyze data and derive insights for real-world applications.

Uploaded by

l211803
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views53 pages

Lecture 1 & 2

The document outlines a course on Data Science, covering topics such as the definition of data, big data, the data science life cycle, and various applications of data science. It emphasizes the importance of understanding data preparation, analysis techniques, and the use of tools like Python and machine learning algorithms. The course aims to equip students with the skills to analyze data and derive insights for real-world applications.

Uploaded by

l211803
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

CS4048 –Data Science

Prepared By: Maimoona Akram


Overview 2
Course Introduction
What is Data?
What is Big Data?
What is Data Science?
Data Science Life Cycle & Process
Applications of Data Science
Data Analytics
Types of Data Analytics
Course Introduction 3

Objective:
To reinforce the concepts developed in theory with
experiments on data preparation and data analysis using different
techniques in Python

Books and Resources:


ebooks and course material will be on Google Classroom
(coming soon )
Tentative Grading Policy 4

Class
Assignments 10%
Quizzes 10%
Midterm Exam 30%
Project + Final Exam 10% + 40%
General Guidelines 5

• Visit Google classroom regularly for updates


• No email submissions when Google classroom is there.
Always remember that, you are putting your task in trash
by yourself when you are emailing it.
❑ Cheating cases are intolerable.
❑ There will be no re-take of any evaluation.
Learning Outcomes 6
Recognise the basics of Data Science, Prepare and
wrangle the data for analysis
Practice Exploratory Data Analysis (EDA) to investigate
data with the help of statistical and graphical
representation
Identify and apply different machine learning
algorithms to gain insight from the data

Use feature selection methods to optimize the


performance of learning model
Tools and Software Packages 7
• WEKA
• KNIME
• ORANGE
• CBA
Language: Python
• MATLAB
Tools: NumPy, Pandas, matplotlib, scikit-learn etc.
• PYTHON
• R
• SPSS
• SAS
• Maptitude for the Web
• etc.,
What is Data? 8
Data is an information in raw form using numbers, alphabets,
symbols representing ideas, facts, categories etc.

Good decisions require good information


derived from raw facts
How much data do we have? 9
• Lots of data is being collected and warehoused
• Web data, e-commerce
• Financial transactions, bank/credit transactions
• Online trading and purchasing
• Social Network
• Google processes 20 PB a day (2008)
• Facebook has 60 TB of daily logs
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• 1000 genomes project: 200 TB
How much data do we have? 10
Big Data 11

“Between the dawn of civilization and 2003, we only created


five exabytes of information; now we’re creating that amount
every two days.”

- Eric Schmidt, Google


Big Data 12
Big Data 13

“Big Data is crucial to the company’s very being. Facebook


relies on a massive installation of Hadoop a highly scalable
open-source framework that uses clusters of low cost servers to
solve problems. Facebook even designs its hardware for this
purpose”

Ken Rudin, Analytics Chief, Facebook


Big Data 14

The most abundant thing today, is data. We have data about


everything which is increasing multifold everyday!
“Big Data” Sources
It’s All Happening
On-line User Generated
15
(Web & Mobile)
Every:
Click
Ad impression
Billing event

Fast Forward, pause,… ..
Server request
Transaction
Network message
Fault

Internet of Things /
M2M Health/Scientific
Computing

Data Science for Engineers


Big Data – Five V’s of Data 16
• Volume (Amount of data):
• How much data is really relevant to the problem solution? Cost of processing?
• So, can you really afford to store and process all that data?

• Velocity (Speed of data generation and movement):


• Much data coming in at high speed
• Need for streaming versus block approach to data analysis
• So, how to analyze data in-flight and combine with data at-rest
• Variety (Diversity of datatypes):
• A small fraction is structured formats, Relational, XML, etc.
• A fair amount is semi-structured, as web logs, etc.
• The rest of the data is unstructured text, photographs, etc.
• So, no single data model can currently handle the diversity
Big Data – Five V of Data (Cont.) 17

• Veracity (Quality of data):


• Accuracy, Precision, Reliability, Integrity
• So, what is it that you don’t know you don’t know about the data?
• Value (worth):
• How much value is created for each unit of data (whatever it is)?
• So, what is the contribution of subsets of the data to the problem
solution?
What can you do with the data?
18

• Reporting
• Monitoring (fine-grained)
• Exploration
• Finding Patterns
• Root Cause Analysis
• Closed-loop Control
• Model construction
• Prediction
Data vs. Information 19
Data Information
• Raw facts • Produced by processing
• Have not yet been raw data to reveal its
processed to reveal meaning
their meaning to the • Requires context
end user • Bedrock of knowledge
• Building blocks of • Should be accurate,
information relevant, and timely to
enable good decision
making
Data vs. Information (cont’d.) 20
Data, Information, and Beyond 21
Data, Information, and Beyond
22

• Information: Data that has been “cleaned” of errors and further processed in a way that makes it
easier to measure, visualize and analyze for a specific purpose

• Knowledge:“How” is the information, derived from the collected data, relevant to our goals?
“How” are the pieces of this information connected to other pieces to add more meaning and
value? And, maybe most importantly, “how” can we apply the information to achieve our goal?

• Wisdom: we must answer questions such as ‘why do something’ and ‘what is best’. In other
words, wisdom is knowledge applied in action.

• If data and information are like a look back to the past, knowledge and wisdom are associated
with what we do now and what we want to achieve in the future.

Data Science for Engineers


23

How to get information and knowledge


to build wisdom from the raw data?

DATA
SCIENCE
Data Science and Big Data 24

• They are not the “same thing”


• Big data = crude oil
• Big data is about extracting “crude oil”, transporting it in
“mega tankers”, siphoning it through “pipelines”, and
storing it in “massive silos”
• Data science is about refining the “crude oil”
“Data Science” an Emerging Field 23

O’Reilly Radar report,


2011
What is Data Science? 26

• “Data science, also known as data-driven science, is an


interdisciplinary field of scientific methods, processes, algorithms
and systems to extract knowledge or insights from data in various
forms, either structured or unstructured, similar to data mining.”

• It’s an area that manages, manipulates, extracts, and interprets


knowledge from tremendous amount of data
What is Data Science? 27

Data Science is the application of computational and statistical


techniques to address or gain insight into some problem in the
real world

Data Science = Statistics + data processing + machine learning + scientific


inquiry + visualization + business analytics + big data + …
What is Data Science?
28
Data Science: A Visual Definition
29

• Data-driven science
• Interdisciplinary field
• Extract knowledge or
insight from data in various
forms
Data science workflow 30
Data Science vs Data analysis vs Data
Engineering 31

Data Science focuses on finding useful information analyzing raw,


unstructured data to drive a business toward higher profitability
through useful prediction

Data Analysis focuses on solving current business problems from


available data by presenting information in an understandable way

Data Engineering focuses on managing and organizing data, building


and maintaining databases and data pipelines
Data Scientist vs Data analyst vs Data
Engineer 32
Asking Good Questions 33

• Software developers are not encouraged to ask


questions, but data scientists are:
• What exciting things might you be able to learn from a given
data set?
• What things do you/your people really want to know?
• What data sets might get you there?

e.g., Baseball
• How to best measure individual player’s skill, value or performance?
• What is the trajectory of player’s performances as they mature and age?
• To what extent does batting performance correlate with the position played?
Data Science Life Cycle
34
• Understanding of Problem: It all starts with understanding the problem at hand, the questions, and the answers
we are trying to find from the dataset at hand.
• Data Acquisition: Data Acquisition, as the name suggests, is about retrieving the data with the help of Data
Engineers where required. It also consolidates all of the data required to answer the question or to solve the
problem at hand.
• Data Wrangling(Preparation): Data wrangling is about using knowledge to preprocess data. It involves looking
for missing values and asking business questions like why they are missing. Furthermore, it uses knowledge to
give shape to the dataset appropriate for visualizations and to support the coming steps in the life cycle.
• Data Exploration: Data Exploration is about visualization and other statistics’ measures to see whether the
questions we asked, in the beginning, are being answered or not? .
• Feature Engineering and Selection: It is a preprocessing step before modeling in both Machine Learning and
Deep Learning. We will look into these fields in the coming sections. It has similar steps to Data Wrangling apart
from some algorithms for Feature Selection and transformation.
• Modeling: Modeling is the process that uncovers the meaning of the data. It is about capturing underlying trends
and the data’s behavior to make the model, which can be used for predictive analytics as described in the
previous section.
• Deployment: After we build the model we’ll deploy it in the most efficient and optimized manner so that real-
world people can use it. It can be deployed on mobile applications and web applications.
• Monitoring: After we have deployed the model, we will want to monitor it. Monitoring is about familiarizing the
model with the new dataset and tracking the number of requests that the model receives. It also involves making
changes to the analysis and starting over if required.
Data Science Life Cycle 35
Applications of Data Science 37
Nate Silver 38

• American Data Scientist who analyzes elections and


baseball.
-PECOTA: a system for forecasting the performance
and career development of Major -League
Baseball players.
- 2012 U.S. Presidential election
• Correctly predicted the winners of all the states.
Walmart Store’s Sales Forecasting: 39

Commerce & Retail use big data and data science to optimize business processes
and for profitable decision making.

• Various tasks like predicting sales, offering product recommendations to


customers, inventory management, etc. are elegantly managed with the
use of data science techniques
What can you do with the data? 40
recommender systems

• Data Science also plays a key role in Information Retrieval.


• Whenever a user buys a product from a certain site like Amazon or Alibaba,
they start seeing some ads or items that are related to the product. In the
same way, a user may get recommendations while they purchase items.
• The user’s purchasing history is stored at the company’s server, and the
company is utilizing Data Science to predict which items the user is likely to
buy next.
• Facebook and LinkedIn use patterns of friendship relationships to suggest other
people you may know, or should know, with sometimes frightening accuracy
• Netflix’s AI considers customers’ viewing habits and hobbies to provide Netflix
recommendations.
Predictive medicine 41

• Healthcare professionals use many wearables to track the


blood pressure, heart rate, and blood sugar level of
patients.
• This data is generated and processed in real-time. Based on
the body’s current health data, the system can trigger an
alarm that indicates something bad is about to happen to
the patient.
• This was impossible in the past. This can save more lives.
Transportation 42

• Various transportation companies like Uber is using data


science for price optimization and providing better
experiences to their customers.
• Using powerful predictive tools, they accurately predict the
price based on parameters like a weather pattern,
availability of transport, customers, etc
Google data product: detect pattern 43

• During the Swine Flu


epidemic of 2009, Google
was able to track the
progress of the epidemic by
following searches for flu-
related topics.
Traffic Prediction and Earthquake Warning
44

Crowdsourcing + physical modeling + sensing + data assimilation


to produce:

From Alex Bayen, UCB


44
Data Analytics 45

The process of examining datasets to find trends and draw


conclusions from the information contained in it
Types of Data Analytics 46

• Descriptive: A set of techniques for reviewing and examining the data set(s)
to understand the data to describe what happened or is happening?

• Diagnostic: As a next step, a set of techniques to determine why did this


happen?
• It is useful for getting at the root of an issue
Types of Data Analytics (Cont.) 47

• Predictive: A set of techniques that analyse current and historical data


to determine what might happen in the future?

• Prescriptive: A set of techniques for computationally developing and


analyzing alternatives that can become courses of action – either
tactical or strategic – that suggests what should be done next?
Types of Data Analytics (Cont.)
48
Case Study: Coca-Cola 49

• An interview with the company’s


director of data strategy confirmed
that the secret behind client retention
for Coca-Cola is big data analytics.

• In 2017, the company launched a


new flavour – Cherry Sprite. It was
inspired by figures collected from
self-service drinks fountains that let
customers mix their own drinks.
Coca-Cola discovered the most
popular flavour combo and turned it
into a beverage.
Topic labeling 50
Sentiment Analysis 51
Intent Detection 52
Healthcare 53

• Google's DeepMind AI system can identify over 50 eye diseases


with 94% accuracy using thousands of anonymized eye scans.
Additionally, Roche's Apollo platform leverages AI to create
comprehensive patient profiles, enhancing personalized
treatment plans and facilitating scientific collaboration
Gaming 54
• The gaming industry benefits from data science through player
behavior analysis, personalized game recommendations, and
cheating prevention. Companies like Electronic Arts and Riot
Games use data to understand player engagement and maintain
fair play environments, enhancing overall gaming experiences.

You might also like