Data Science 2
Data Science 2
Introduction to
Data Science
Chapter 2
Data Science Process
Last week…
Step1
?
Some market players
Data Science Applications (shhh…assignment idea)
In this lesson, we will learn about…
○ Subtracting the minimum and dividing by the range, (x-min) / (max-min). This
approach is very generic and always yields values between 0 and 1, inclusive.
Data Cleansing – Missing Values
Data Cleansing - Outliers
Data Preparation
● Normally, when dealing with big data,
outliers shouldn't be an issue…
𝒊𝒕𝒆𝒎𝟏
Data Preparation
● When dealing with text data, which is often the
case if you need to analyze logs or social media
posts, a different type of cleansing is required.
● This involves one or more of the following:
○ removing certain characters (e.g., special
characters such as @,*, and punctuation
marks)
Z
1, 2, 3 1.00000, 2.00000, 3.00000
X
“TRUE” | “FALSE” TRUE | FALSE
● All this may seem very abstract to someone who has never dealt
with data before, but it becomes very clear once you start working
with R or any other statistical analysis package.
● Speaking of R, the data structure of a dataset in that programming
platform is referred to as a data frame, and it is the most complete
structure you can find as it includes useful information about the
data (e.g. names, modality, etc.).
2.4 Data Disovery
next…
finding
patterns in a
the CORE of
dataset
the data
through
science
hypothesis
process
formulation
and testing
makes use of
several
statistical
throw away methods to
the less prove the
meaningful significance of
relationships the
based on our relationships
judgment that the data
scientist
observes
Machine Learning
with enabling the
computer to learn on its
unsupervised own what the data
structure can reveal
about the data itself
Learning from Data
It may seem that using unsupervised and supervised learning may guarantee a
more or less automated way of learning from data.
So a data product is similar to having a data expert in your pocket who can afford to
give you useful information at very low rates due to the economies of scale employed.
2.7 Insight, Deliverance &
Visualization
next…
Insight, Deliverance and Visualization
Data science involves research into the data, the goal of which is
how the data products perform in terms of
to determine and understand more of
usefulness to the end users, maintainability,
what’s happening below the surface
etc.
The data scientist may get ideas on how he can generate similar data
products (or completely new ones) based on the users' newest
requirements.
Insight, Deliverance and Visualization