100% found this document useful (1 vote)
49 views55 pages

Data Science 2

1. Data discovery involves finding patterns in a dataset through hypothesis formulation and testing. 2. It makes use of several statistical methods to prove the significance of relationships found in the data. 3. Relationships that are less meaningful based on judgment are discarded.

Uploaded by

kagome
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
49 views55 pages

Data Science 2

1. Data discovery involves finding patterns in a dataset through hypothesis formulation and testing. 2. It makes use of several statistical methods to prove the significance of relationships found in the data. 3. Relationships that are less meaningful based on judgment are discarded.

Uploaded by

kagome
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

AACS1573

Introduction to
Data Science
Chapter 2
Data Science Process
Last week…

1.5 1.6 1.7 1.8


Types of Analytics Related Data Science
analytics process model Software/Tools applications
Analytic Process Model

Step1
?
Some market players
Data Science Applications (shhh…assignment idea)
In this lesson, we will learn about…

2.1 2.2 2.3 2.4


Data
Data Preparation Data Exploration Data Discovery
Representation

2.5 2.6 2.7


Learning from Creating a Data Insight,
Data Product Deliverance and
Visualization
You have learned about…

2.1 2.2 2.3 2.4


Data
Data Preparation Data Exploration Data Discovery
Representation

2.5 2.6 2.7


Learning from Creating a Data Insight,
Data Product Deliverance and
Visualization

Chapter Chapter Chapter Chapter Chapter Chapter Chapter


Coming next…
Chapter 3: Visualization and Descriptive
Analytics
2.1 Data Preparation
next…
Data Preparation

Reading the data Cleansing the data.


Data Preparation
● This is the first step in turning the available data into a dataset
● i.e., a group of data points, usually normalized, that can be
used with a data analysis model or a machine learning system
(often without any additional preprocessing).
● Reading the data & cleansing the data

Reading the data


● Reading the data is relatively straightforward.
● However, when you are dealing with big data, you often need
to employ a Hadoop Distributed File System (HDFS) to store
the data for further analysis and the data needs to be read
using a MapReduce system
Data Preparation
● However, you may need to supply it in .JSON or some other similar
format type.

JSON JSON JSON JSON JSON


Strings Numbers Objects Arrays Booleans
Data Preparation
● Also, if your data is in a completely custom form, you may need to write your own
program(s) for accessing and restructuring it into a format that can be
understood by the mappers and the reducers of your cluster.
● When reading a very large amount of data, it is wise to first do a sample run on a
relatively small subset of your data to ensure that the resulting dataset will be
useable and useful for the analysis you plan to perform.
● Some preliminary visualization of the resulting sample dataset would also be
quite useful as this will ensure that the dataset is structured correctly for the
different analyses you will do in the later stages of the process.
Sampling
● The aim of sampling is to take a subset of past customer data and use that to build
an analytical model.

Question: Given high performance of computer


ability nowadays, why do we need sampling while
we could also directly analyze the full data set?
Data Preparation

Cleansing the data


● very -consuming part of data preparation and requires a level of
understanding of the data
● This step involves
○ fill in missing values,
○ remove corrupt or problematic data
○ normalize the data in a way that makes sense for the analysis that ensues.
● To comprehend this point better, let us examine the rationale
behind normalization and how distributions (mathematical models
of the frequency of the values of a variable) come into play.
stop
Data Preparation

● Although the most commonly used distribution is the


normal distribution (N), there are several others that often
come into play such as:
○ uniform distribution (U)
○ student distribution (T)
○ Poisson distribution (P)
○ binomial distribution (B)

● Note: normalization applies only to numeric data


Feature Scaling: Normalization

Distance AB before scaling = (40 − 60)2 +(3 − 3)2 = 20

Distance BC before scaling = (40 − 40)2 +(4 − 3)2 = 1


Distance AB after scaling = (1.1 − 1.5)2 +(1.18 − 1.18)2 = 2.6

Distance BC after scaling = (1.1 − 1.1)2 +(0.41 − 1.18)2 = 1.59


Data Preparation
● Normalizing your data will sometimes change the shape of its
distribution, so it makes sense to try out first a few normalizing
approaches before deciding on one.
● The approaches that are most popular are:
○ Subtracting the mean and dividing by the standard deviation, (x — p.) / o. This is
particularly useful for data that follows a normal distribution; it usually yields
values between -3 and 3, approximately.

○ Subtracting the mean and dividing by the range, (x — u) / (max-min). This


approach is a bit more generic; it usually yields values between -0.5 and 0.5,
approximately.

○ Subtracting the minimum and dividing by the range, (x-min) / (max-min). This
approach is very generic and always yields values between 0 and 1, inclusive.
Data Cleansing – Missing Values
Data Cleansing - Outliers
Data Preparation
● Normally, when dealing with big data,
outliers shouldn't be an issue…

● BUT it depends on their values (extremely


large or small values) may affect the basic
statistics of the dataset, especially if there
are many outliers in it.
Find the problems!
Missing Value

Midpoint in the Random Remove the


Mean Value
scale number column

𝒊𝒕𝒆𝒎𝟏
Data Preparation
● When dealing with text data, which is often the
case if you need to analyze logs or social media
posts, a different type of cleansing is required.
● This involves one or more of the following:
○ removing certain characters (e.g., special
characters such as @,*, and punctuation
marks)

○ making all words either uppercase or


lowercase

○ removing certain words that convey little


information (e.g., "a", "the", etc.)

○ removing extra or unnecessary spaces and


line breaks
Data Preparation
All these data preparation steps
(and other methods that may be
relevant to your industry), will help
you turn the data into a dataset.

Make sure you keep a record of


what you have done though, in
case you need to redo these steps
or describe them in a report.
Data Preparation
Date Preparation

Remove stop words

Change to lower case


2.2 Data Exploration
next…
Data Exploration
● First, some exploration of it is performed
to figure out the potential information
that could be hiding within it.
● There is a common misconception that
the more data one has, the better the
results of the analysis will be.
● It is very easy to fall victim to the illusion
that a large dataset is all you need, but
more often than not such a dataset will
contain noise and several irrelevant
attributes.
● All of these wrinkles will need to be ironed
out in the stages that follow, starting with more data = more noise!!
data exploration.
2.3 Data Representation
next…
Data Representation

● comes right after data exploration.


● According to the McGraw-Hill Dictionary of Scientific & Technical
Terms, it is "the manner in which data is expressed symbolically by
binary digits in a computer." >> How data is stored in the
computer.
● This basically involves assigning specific data structures to the
variables involved and serves a dual purpose:
○ completing the transformation of the original (raw) data into a dataset
○ optimizing the memory and storage usage for the stages to follow.
X Y
Which one better?

Z
1, 2, 3 1.00000, 2.00000, 3.00000

X
“TRUE” | “FALSE” TRUE | FALSE

00101, 00110, 00111 101, 110, 111

I make it finally! I MaKe iT FINALly!!!!


Data Representation

● All this may seem very abstract to someone who has never dealt
with data before, but it becomes very clear once you start working
with R or any other statistical analysis package.
● Speaking of R, the data structure of a dataset in that programming
platform is referred to as a data frame, and it is the most complete
structure you can find as it includes useful information about the
data (e.g. names, modality, etc.).
2.4 Data Disovery
next…
finding
patterns in a
the CORE of
dataset
the data
through
science
hypothesis
process
formulation
and testing

makes use of
several
statistical
throw away methods to
the less prove the
meaningful significance of
relationships the
based on our relationships
judgment that the data
scientist
observes

filter out less


robust
relationships
based on
statistics
Data Discovery

● Unfortunately there is no fool-proof methodology for data discovery


although there are several tools that can be used to make the
whole process more manageable.
● How effective you are regarding this stage of the data science
process will depend on your experience, your intuition and how
much time you spend on it.
● Good knowledge of the various data analysis tools (especially
machine learning techniques) can prove very useful here.
● In addition, experience with scientific research in data analysis will
also prove to be priceless in this stage.
2.5 Learning from Data
next…
Learning from Data

● Learning from data is a crucial stage in the data science process


and involves a lot of intelligent (and often creative) analysis of a
dataset using statistical methods and machine learning systems.

helps a computer learn


how to distinguish and
supervised predict new data points
based on a training set

Machine Learning
with enabling the
computer to learn on its
unsupervised own what the data
structure can reveal
about the data itself
Learning from Data

It may seem that using unsupervised and supervised learning may guarantee a
more or less automated way of learning from data.

However, without feedback from the user/programmer, this process is unlikely


to yield any good results for the majority of cases. (This feedback can take the
form of validation or corrections that provide more meaningful results.)

For example, artificial neural networks (ANNs), a very popular artificial


intelligence tool that emulates the way the human brain works, are a great tool
for supervised learning.
Learning from Data
Learning from Data
2.6 Creating a Data Product
next…
Creating a Data Product

● All of the aforementioned parts of the data science process


are precursors to developing something concrete that can
be used as a product of sorts.
● This part of the process is referred to as creating a data
product and was defined by influential data scientist Hilary
Mason as "a product that is based on the combination of
data and algorithms.
● So, a data product is not some kind of luxury that marketing
people try to force us to buy.
● It is something the user cares about
● Stop
Creating a Data Product

● To create a data product, you need to understand the


end users and become familiar with their expectations.
● exercise good judgment on the
You also need to
algorithms you will use and (particularly) on the form
that the results will take.
● Graphs, particularly interactive ones, are a very useful form in which to
present information if you want to promote it as a data
product.
Creating a Data Product
BR Post Engagement

So a data product is similar to having a data expert in your pocket who can afford to
give you useful information at very low rates due to the economies of scale employed.
2.7 Insight, Deliverance &
Visualization
next…
Insight, Deliverance and Visualization

Data science involves research into the data, the goal of which is
how the data products perform in terms of
to determine and understand more of
usefulness to the end users, maintainability,
what’s happening below the surface
etc.

This often leads to new iterations of data discovery, data learning,


etc., making data science an ongoing, evolving activity, oftentimes
employing the agile framework frequently used in software
development today.
Insight, Deliverance and Visualization
In this final stage of the data science process, the data scientist
delivers the data product he has created and observes how it is
received.

The user's feedback is crucial as it will provide the information he


needs to refine the analysis, upgrade it and even redo it from scratch
if necessary.

The data scientist may get ideas on how he can generate similar data
products (or completely new ones) based on the users' newest
requirements.
Insight, Deliverance and Visualization

● Visualization involves the


graphical representation of
data so that interesting and
meaningful information can
be obtained by the viewer.
● It is a way of summarizing
the essence of the analyzed
data graphically in a way that
is intuitive, intelligent and
oftentimes interactive.
Insight, Deliverance and Visualization

Aware of what we don't This means that you are


know and are therefore able more aware of the
to handle the uncertainty of limitations of your models as
the data much better well as the value of the data

These graphs can bring This translates into deeper


about insight (which is the understanding and usually
most valuable part of the to some new hypotheses
data science process) about the data.
Insight, Deliverance and Visualization

● It brings about the improvements


you see in data products all over
the world, the clever upgrades of
certain data applications and,
most importantly, the various
innovations in the big data world.
● So this final stage of the data
science process is NOT THE END
but rather the last part of a cycle
that starts again and again,
spiraling to greater heights of
understanding, usefulness and
evolution.

You might also like