100% found this document useful (1 vote)

49 views55 pages

Data Science 2

1. Data discovery involves finding patterns in a dataset through hypothesis formulation and testing. 2. It makes use of several statistical methods to prove the significance of relationships found in the data. 3. Relationships that are less meaningful based on judgment are discarded.

Uploaded by

kagome

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

49 views55 pages

Data Science 2

Uploaded by

kagome

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

AACS1573

Introduction to
Data Science
Chapter 2
Data Science Process
Last week…

1.5 1.6 1.7 1.8

Types of Analytics Related Data Science
analytics process model Software/Tools applications
Analytic Process Model

Step1
?
Some market players
Data Science Applications (shhh…assignment idea)
In this lesson, we will learn about…

2.1 2.2 2.3 2.4

Data
Data Preparation Data Exploration Data Discovery
Representation

2.5 2.6 2.7

Learning from Creating a Data Insight,
Data Product Deliverance and
Visualization
You have learned about…

2.1 2.2 2.3 2.4

Data
Data Preparation Data Exploration Data Discovery
Representation

2.5 2.6 2.7

Learning from Creating a Data Insight,
Data Product Deliverance and
Visualization

Chapter Chapter Chapter Chapter Chapter Chapter Chapter

Coming next…
Chapter 3: Visualization and Descriptive
Analytics
2.1 Data Preparation
next…
Data Preparation

Reading the data Cleansing the data.

Data Preparation
● This is the first step in turning the available data into a dataset
● i.e., a group of data points, usually normalized, that can be
used with a data analysis model or a machine learning system
(often without any additional preprocessing).
● Reading the data & cleansing the data

Reading the data

● Reading the data is relatively straightforward.
● However, when you are dealing with big data, you often need
to employ a Hadoop Distributed File System (HDFS) to store
the data for further analysis and the data needs to be read
using a MapReduce system
Data Preparation
● However, you may need to supply it in .JSON or some other similar
format type.

JSON JSON JSON JSON JSON

Strings Numbers Objects Arrays Booleans
Data Preparation
● Also, if your data is in a completely custom form, you may need to write your own
program(s) for accessing and restructuring it into a format that can be
understood by the mappers and the reducers of your cluster.
● When reading a very large amount of data, it is wise to first do a sample run on a
relatively small subset of your data to ensure that the resulting dataset will be
useable and useful for the analysis you plan to perform.
● Some preliminary visualization of the resulting sample dataset would also be
quite useful as this will ensure that the dataset is structured correctly for the
different analyses you will do in the later stages of the process.
Sampling
● The aim of sampling is to take a subset of past customer data and use that to build
an analytical model.

Question: Given high performance of computer

ability nowadays, why do we need sampling while
we could also directly analyze the full data set?
Data Preparation

Cleansing the data

● very -consuming part of data preparation and requires a level of
understanding of the data
● This step involves
○ fill in missing values,
○ remove corrupt or problematic data
○ normalize the data in a way that makes sense for the analysis that ensues.
● To comprehend this point better, let us examine the rationale
behind normalization and how distributions (mathematical models
of the frequency of the values of a variable) come into play.
stop
Data Preparation

● Although the most commonly used distribution is the

normal distribution (N), there are several others that often
come into play such as:
○ uniform distribution (U)
○ student distribution (T)
○ Poisson distribution (P)
○ binomial distribution (B)

● Note: normalization applies only to numeric data

Feature Scaling: Normalization

Distance AB before scaling = (40 − 60)2 +(3 − 3)2 = 20

Distance BC before scaling = (40 − 40)2 +(4 − 3)2 = 1

Distance AB after scaling = (1.1 − 1.5)2 +(1.18 − 1.18)2 = 2.6

Distance BC after scaling = (1.1 − 1.1)2 +(0.41 − 1.18)2 = 1.59

Data Preparation
● Normalizing your data will sometimes change the shape of its
distribution, so it makes sense to try out first a few normalizing
approaches before deciding on one.
● The approaches that are most popular are:
○ Subtracting the mean and dividing by the standard deviation, (x — p.) / o. This is
particularly useful for data that follows a normal distribution; it usually yields
values between -3 and 3, approximately.

○ Subtracting the mean and dividing by the range, (x — u) / (max-min). This

approach is a bit more generic; it usually yields values between -0.5 and 0.5,
approximately.

○ Subtracting the minimum and dividing by the range, (x-min) / (max-min). This
approach is very generic and always yields values between 0 and 1, inclusive.
Data Cleansing – Missing Values
Data Cleansing - Outliers
Data Preparation
● Normally, when dealing with big data,
outliers shouldn't be an issue…

● BUT it depends on their values (extremely

large or small values) may affect the basic
statistics of the dataset, especially if there
are many outliers in it.
Find the problems!
Missing Value

Midpoint in the Random Remove the

Mean Value
scale number column

𝒊𝒕𝒆𝒎𝟏
Data Preparation
● When dealing with text data, which is often the
case if you need to analyze logs or social media
posts, a different type of cleansing is required.
● This involves one or more of the following:
○ removing certain characters (e.g., special
characters such as @,*, and punctuation
marks)

○ making all words either uppercase or

lowercase

○ removing certain words that convey little

information (e.g., "a", "the", etc.)

○ removing extra or unnecessary spaces and

line breaks
Data Preparation
All these data preparation steps
(and other methods that may be
relevant to your industry), will help
you turn the data into a dataset.

Make sure you keep a record of

what you have done though, in
case you need to redo these steps
or describe them in a report.
Data Preparation
Date Preparation

Remove stop words

Change to lower case

2.2 Data Exploration
next…
Data Exploration
● First, some exploration of it is performed
to figure out the potential information
that could be hiding within it.
● There is a common misconception that
the more data one has, the better the
results of the analysis will be.
● It is very easy to fall victim to the illusion
that a large dataset is all you need, but
more often than not such a dataset will
contain noise and several irrelevant
attributes.
● All of these wrinkles will need to be ironed
out in the stages that follow, starting with more data = more noise!!
data exploration.
2.3 Data Representation
next…
Data Representation

● comes right after data exploration.

● According to the McGraw-Hill Dictionary of Scientific & Technical
Terms, it is "the manner in which data is expressed symbolically by
binary digits in a computer." >> How data is stored in the
computer.
● This basically involves assigning specific data structures to the
variables involved and serves a dual purpose:
○ completing the transformation of the original (raw) data into a dataset
○ optimizing the memory and storage usage for the stages to follow.
X Y
Which one better?

Z
1, 2, 3 1.00000, 2.00000, 3.00000

X
“TRUE” | “FALSE” TRUE | FALSE

00101, 00110, 00111 101, 110, 111

I make it finally! I MaKe iT FINALly!!!!

Data Representation

● All this may seem very abstract to someone who has never dealt
with data before, but it becomes very clear once you start working
with R or any other statistical analysis package.
● Speaking of R, the data structure of a dataset in that programming
platform is referred to as a data frame, and it is the most complete
structure you can find as it includes useful information about the
data (e.g. names, modality, etc.).
2.4 Data Disovery
next…
finding
patterns in a
the CORE of
dataset
the data
through
science
hypothesis
process
formulation
and testing

makes use of
several
statistical
throw away methods to
the less prove the
meaningful significance of
relationships the
based on our relationships
judgment that the data
scientist
observes

filter out less

robust
relationships
based on
statistics
Data Discovery

● Unfortunately there is no fool-proof methodology for data discovery

although there are several tools that can be used to make the
whole process more manageable.
● How effective you are regarding this stage of the data science
process will depend on your experience, your intuition and how
much time you spend on it.
● Good knowledge of the various data analysis tools (especially
machine learning techniques) can prove very useful here.
● In addition, experience with scientific research in data analysis will
also prove to be priceless in this stage.
2.5 Learning from Data
next…
Learning from Data

● Learning from data is a crucial stage in the data science process

and involves a lot of intelligent (and often creative) analysis of a
dataset using statistical methods and machine learning systems.

helps a computer learn

how to distinguish and
supervised predict new data points
based on a training set

Machine Learning
with enabling the
computer to learn on its
unsupervised own what the data
structure can reveal
about the data itself
Learning from Data

It may seem that using unsupervised and supervised learning may guarantee a
more or less automated way of learning from data.

However, without feedback from the user/programmer, this process is unlikely

to yield any good results for the majority of cases. (This feedback can take the
form of validation or corrections that provide more meaningful results.)

For example, artificial neural networks (ANNs), a very popular artificial

intelligence tool that emulates the way the human brain works, are a great tool
for supervised learning.
Learning from Data
Learning from Data
2.6 Creating a Data Product
next…
Creating a Data Product

● All of the aforementioned parts of the data science process

are precursors to developing something concrete that can
be used as a product of sorts.
● This part of the process is referred to as creating a data
product and was defined by influential data scientist Hilary
Mason as "a product that is based on the combination of
data and algorithms.
● So, a data product is not some kind of luxury that marketing
people try to force us to buy.
● It is something the user cares about
● Stop
Creating a Data Product

● To create a data product, you need to understand the

end users and become familiar with their expectations.
● exercise good judgment on the
You also need to
algorithms you will use and (particularly) on the form
that the results will take.
● Graphs, particularly interactive ones, are a very useful form in which to
present information if you want to promote it as a data
product.
Creating a Data Product
BR Post Engagement

So a data product is similar to having a data expert in your pocket who can afford to
give you useful information at very low rates due to the economies of scale employed.
2.7 Insight, Deliverance &
Visualization
next…
Insight, Deliverance and Visualization

Data science involves research into the data, the goal of which is
how the data products perform in terms of
to determine and understand more of
usefulness to the end users, maintainability,
what’s happening below the surface
etc.

This often leads to new iterations of data discovery, data learning,

etc., making data science an ongoing, evolving activity, oftentimes
employing the agile framework frequently used in software
development today.
Insight, Deliverance and Visualization
In this final stage of the data science process, the data scientist
delivers the data product he has created and observes how it is
received.

The user's feedback is crucial as it will provide the information he

needs to refine the analysis, upgrade it and even redo it from scratch
if necessary.

The data scientist may get ideas on how he can generate similar data
products (or completely new ones) based on the users' newest
requirements.
Insight, Deliverance and Visualization

● Visualization involves the

graphical representation of
data so that interesting and
meaningful information can
be obtained by the viewer.
● It is a way of summarizing
the essence of the analyzed
data graphically in a way that
is intuitive, intelligent and
oftentimes interactive.
Insight, Deliverance and Visualization

Aware of what we don't This means that you are

know and are therefore able more aware of the
to handle the uncertainty of limitations of your models as
the data much better well as the value of the data

These graphs can bring This translates into deeper

about insight (which is the understanding and usually
most valuable part of the to some new hypotheses
data science process) about the data.
Insight, Deliverance and Visualization

● It brings about the improvements

you see in data products all over
the world, the clever upgrades of
certain data applications and,
most importantly, the various
innovations in the big data world.
● So this final stage of the data
science process is NOT THE END
but rather the last part of a cycle
that starts again and again,
spiraling to greater heights of
understanding, usefulness and
evolution.

Everything Data Analytics-A Beginners Guide To Data Literacy Understanding The Processes That Turn Data Into Insights by Elizabeth Clarke
No ratings yet
Everything Data Analytics-A Beginners Guide To Data Literacy Understanding The Processes That Turn Data Into Insights by Elizabeth Clarke
245 pages
1 Introduction To Data Science R Programming Edited 1 1
No ratings yet
1 Introduction To Data Science R Programming Edited 1 1
74 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Applied Statistics - MIT
100% (1)
Applied Statistics - MIT
654 pages
Statistics All Formulas
No ratings yet
Statistics All Formulas
6 pages
LAB Task4
0% (1)
LAB Task4
2 pages
Wiley - Python For Everyone, 3rd Edition - 978-1-119-49853-7
0% (2)
Wiley - Python For Everyone, 3rd Edition - 978-1-119-49853-7
2 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
Maths
No ratings yet
Maths
292 pages
DS MCQ
100% (1)
DS MCQ
29 pages
New Arrivals Books Feb Jun 2020
No ratings yet
New Arrivals Books Feb Jun 2020
45 pages
Data Science Math Skills
No ratings yet
Data Science Math Skills
1 page
Wavelet Toolbox™ User's Guide PDF
No ratings yet
Wavelet Toolbox™ User's Guide PDF
617 pages
Introducing Data Science
57% (7)
Introducing Data Science
2 pages
A Primer of Multivariate Statistics PDF
No ratings yet
A Primer of Multivariate Statistics PDF
626 pages
Data Sceince PPT (Copy 3)
No ratings yet
Data Sceince PPT (Copy 3)
12 pages
Experiment No. 1: Objective
No ratings yet
Experiment No. 1: Objective
4 pages
(Solutions Manual) Probability and Statistics For Engineers and Scientists Manual Hayler
100% (1)
(Solutions Manual) Probability and Statistics For Engineers and Scientists Manual Hayler
51 pages
MBA Read More
No ratings yet
MBA Read More
8 pages
Metode Statistika
No ratings yet
Metode Statistika
106 pages
K Means Clustering
100% (1)
K Means Clustering
10 pages
Instant Download Modern Business Analytics: Practical Data Science For Decision-Making - Ebook PDF PDF All Chapter
100% (8)
Instant Download Modern Business Analytics: Practical Data Science For Decision-Making - Ebook PDF PDF All Chapter
59 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Mathematical Statistics
No ratings yet
Mathematical Statistics
84 pages
Example of 2D Convolution
No ratings yet
Example of 2D Convolution
5 pages
Stat 231 Course Notes
100% (1)
Stat 231 Course Notes
326 pages
Bok:978 1 4899 7218 7 PDF
No ratings yet
Bok:978 1 4899 7218 7 PDF
375 pages
Student Booklet For Sep 2015 v6
100% (1)
Student Booklet For Sep 2015 v6
50 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
Chapter 3 Slide Note
No ratings yet
Chapter 3 Slide Note
39 pages
Data Science
No ratings yet
Data Science
7 pages
Classification Vs Regression
No ratings yet
Classification Vs Regression
3 pages
A Practical Time-Series Tutorial With MATLAB
No ratings yet
A Practical Time-Series Tutorial With MATLAB
95 pages
7871 - 管理运筹建模与求解 - 基于Excel VBA与Matlab
No ratings yet
7871 - 管理运筹建模与求解 - 基于Excel VBA与Matlab
333 pages
01 Intro To Data Science
No ratings yet
01 Intro To Data Science
26 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Linear Regression
100% (1)
Linear Regression
51 pages
Class 7
No ratings yet
Class 7
42 pages
Statistical Modeling
No ratings yet
Statistical Modeling
22 pages
Data Science and Machine Learning Project Ideas
100% (2)
Data Science and Machine Learning Project Ideas
20 pages
Ai Trading
No ratings yet
Ai Trading
17 pages
Logistic Regression in R
No ratings yet
Logistic Regression in R
19 pages
Statistics
No ratings yet
Statistics
41 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
14 pages
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
100% (1)
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
107 pages
Unit 1
No ratings yet
Unit 1
11 pages
AI in Engineering Careers
No ratings yet
AI in Engineering Careers
41 pages
Monte Carlo Studies Using SAS
100% (2)
Monte Carlo Studies Using SAS
258 pages
GSIUserGuide v3.1
No ratings yet
GSIUserGuide v3.1
180 pages
Lecture 10 Tensor and Tensor Algebra 2 PDF
No ratings yet
Lecture 10 Tensor and Tensor Algebra 2 PDF
14 pages
Unit3 160420200647 PDF
No ratings yet
Unit3 160420200647 PDF
146 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
Building A Recommendation System With R - Sample Chapter
No ratings yet
Building A Recommendation System With R - Sample Chapter
11 pages
Big Data CH01
No ratings yet
Big Data CH01
12 pages
MANG2011完成稿
No ratings yet
MANG2011完成稿
13 pages
Forecast
No ratings yet
Forecast
82 pages
One-Sample T-Test
No ratings yet
One-Sample T-Test
9 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
ML Vs Programming
No ratings yet
ML Vs Programming
7 pages
What 10 Years at Uber, Meta and Startups Taught Me About Data Analytics - by Torsten Walbaum - May, 2024 - Towards Data Science
No ratings yet
What 10 Years at Uber, Meta and Startups Taught Me About Data Analytics - by Torsten Walbaum - May, 2024 - Towards Data Science
16 pages
Reubs High School: Statistics Project
No ratings yet
Reubs High School: Statistics Project
13 pages
CLAMSS at Unibo - Open Day 28 Nov 2023 (Prof. Soffritti)
No ratings yet
CLAMSS at Unibo - Open Day 28 Nov 2023 (Prof. Soffritti)
21 pages
Angol Kāpzāsi Terv - Final
No ratings yet
Angol Kāpzāsi Terv - Final
14 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
3 pages
Building AI Projects: Starting An AI Project
No ratings yet
Building AI Projects: Starting An AI Project
33 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Chapter 01 Introduction
No ratings yet
Chapter 01 Introduction
21 pages
REVISED - Even Sem 2025 Exam Schedule IEMS - Regular - 20250401
No ratings yet
REVISED - Even Sem 2025 Exam Schedule IEMS - Regular - 20250401
18 pages
Principles of Data Science
No ratings yet
Principles of Data Science
3 pages
MTCARS Regression Analysis
No ratings yet
MTCARS Regression Analysis
5 pages
Frequency Distribution For Categorical Data
No ratings yet
Frequency Distribution For Categorical Data
6 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Nandhini.K - Image Processing
No ratings yet
Nandhini.K - Image Processing
4 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Data Scientist: Skills
No ratings yet
Data Scientist: Skills
2 pages
Dataiku Academy Welcome Pack
No ratings yet
Dataiku Academy Welcome Pack
16 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
A Step by Step Backpropagation Example - Matt Mazur
No ratings yet
A Step by Step Backpropagation Example - Matt Mazur
7 pages
Statement of Purpose (1) Vishnu
No ratings yet
Statement of Purpose (1) Vishnu
2 pages
Random Forest
No ratings yet
Random Forest
5 pages
228R1A0572 Asutosh Tripathy-Resume 2025
No ratings yet
228R1A0572 Asutosh Tripathy-Resume 2025
2 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Parameters: Unless Otherwise Noted, These Formulas Assume
No ratings yet
Parameters: Unless Otherwise Noted, These Formulas Assume
6 pages
Varun Resume
No ratings yet
Varun Resume
1 page
6220010
No ratings yet
6220010
37 pages
App.A - Detection and Estimation in Additive Gaussian Noise PDF
No ratings yet
App.A - Detection and Estimation in Additive Gaussian Noise PDF
55 pages
Ten Good Reasons To Learn Sas Software'S SQL Procedure: Sigurd W. Hermansen, Westat, Rockville, MD
No ratings yet
Ten Good Reasons To Learn Sas Software'S SQL Procedure: Sigurd W. Hermansen, Westat, Rockville, MD
5 pages

Data Science 2

Uploaded by

Data Science 2

Uploaded by

AACS1573

1.5 1.6 1.7 1.8

2.1 2.2 2.3 2.4

2.5 2.6 2.7

2.1 2.2 2.3 2.4

2.5 2.6 2.7

Chapter Chapter Chapter Chapter Chapter Chapter Chapter

Reading the data Cleansing the data.

Reading the data

JSON JSON JSON JSON JSON

Question: Given high performance of computer

Cleansing the data

● Although the most commonly used distribution is the

● Note: normalization applies only to numeric data

Distance AB before scaling = (40 − 60)2 +(3 − 3)2 = 20

Distance BC before scaling = (40 − 40)2 +(4 − 3)2 = 1

Distance BC after scaling = (1.1 − 1.1)2 +(0.41 − 1.18)2 = 1.59

○ Subtracting the mean and dividing by the range, (x — u) / (max-min). This

● BUT it depends on their values (extremely

Midpoint in the Random Remove the

○ making all words either uppercase or

○ removing certain words that convey little

○ removing extra or unnecessary spaces and

Make sure you keep a record of

Remove stop words

Change to lower case

● comes right after data exploration.

00101, 00110, 00111 101, 110, 111

I make it finally! I MaKe iT FINALly!!!!

filter out less

● Unfortunately there is no fool-proof methodology for data discovery

● Learning from data is a crucial stage in the data science process

helps a computer learn

However, without feedback from the user/programmer, this process is unlikely

For example, artificial neural networks (ANNs), a very popular artificial

● All of the aforementioned parts of the data science process

● To create a data product, you need to understand the

This often leads to new iterations of data discovery, data learning,

The user's feedback is crucial as it will provide the information he

● Visualization involves the

Aware of what we don't This means that you are

These graphs can bring This translates into deeper

● It brings about the improvements

You might also like