Module 1 - PPT5 - Pre - Processing of Data
Module 1 - PPT5 - Pre - Processing of Data
Module 1
PPT5
The Five Steps of Data Science
(Pre-Processing of Data)
Data Science vs Data Analytics
• Data science follows a structured, step-by-
step process that, when followed, preserves
the integrity of the results.
• On a simpler level, following a strict process
can make it much easier for amateur data
scientists to obtain results faster than if they
were exploring data with no clear vision.
The Five Steps of Data Science
• The five essential steps to perform data
science are as follows:
• 1. Asking an interesting question
• 2. Obtaining the data
• 3. Exploring the data
• 4. Modeling the data
• 5. Communicating and visualizing the results
1. Asking an interesting question
• I would treat this step as you would treat a
brainstorming session.
• Start writing down questions regardless of
whether or not you think the data to answer
these questions even exists.
• Advantage: 1. Biasing yourself even before
searching for data, is getting avoided.
• 2. Obtaining data might involve searching in both
public and private locations and, therefore, might
not be very straightforward
2. Obtaining the data
• Once you have selected the question you want
to focus on, it is time to search the world for the
data that might be able to answer that question.
• Since the data can come from a variety of
sources; so, this step can be very creative!
3. Exploring the data
• By this step, we begin to break down the types
of data that we are dealing with, which is a
pivotal step in the process.
• Once this step is completed, the analyst
generally would have spent several hours
learning about the domain, using code or other
tools to manipulate and explore the data, and
has a very good sense of what the data might be
trying to tell them.
4. Modeling the data
• This step involves the use of statistical and
machine learning models.
• In this step, we are not only fitting and choosing
models, we are implanting mathematical
validation metrics in order to quantify the
models and their effectiveness.
5. Communicating and visualizing the
results
• This is arguably the most important step.
• While it might seem obvious and simple, the
ability to conclude your results in a digestible
format is much more difficult than it seems.
• In the following slides, we will look at different
examples of cases when results were
communicated poorly and when they were
displayed very well.
1. Explore the data
• It involves the ability to recognize the different
types of data, transform data types, and use
code to systemically improve the quality of the
entire dataset to prepare it for the modeling
stage.
• Following are the three basic questions, which
form the guidelines that should be followed
when exploring a newly obtained set of data.
Basic questions for data exploration
1. Is the data organized or not?
o Is data presented in a row/column structure?
o If we have unorganized data, transform it into a
row/column structure.
• 2. date:
• The date at which the review was posted.
• Even though time is usually considered continuous, this column
would likely be considered discrete and at the ordinal level
because of the natural order that dates have.
What does each column represent?
• 3. review_id:
• This is a unique identifier for review that the post represents.
• This would be at the nominal level because, again, there is no
natural order to this identifier.
• 4. stars:
• an ordered column that represents what the reviewer gave the
restaurant as a final score.
• This is ordered and qualitative; so, this is at the ordinal level.
What does each column represent?
• 5. text:
• This is likely the raw text that each reviewer wrote.
• As with most text, we place this at the nominal level.
• 6. type:
• In the first five columns, all we see is the word review.
• This might be a column that identifies that each row is a review,
implying that there might be another type of row other than a
review.
• We place this at the nominal level.
What does each column represent?
• 7. user_id:
• This is likely a unique identifier for the user who is writing the
review.
• Just like other unique IDs, we place data at the nominal level.