0% found this document useful (0 votes)
19 views

Module 1 - PPT5 - Pre - Processing of Data

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Module 1 - PPT5 - Pre - Processing of Data

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Foundations of Data Science

Module 1
PPT5
The Five Steps of Data Science
(Pre-Processing of Data)
Data Science vs Data Analytics
• Data science follows a structured, step-by-
step process that, when followed, preserves
the integrity of the results.
• On a simpler level, following a strict process
can make it much easier for amateur data
scientists to obtain results faster than if they
were exploring data with no clear vision.
The Five Steps of Data Science
• The five essential steps to perform data
science are as follows:
• 1. Asking an interesting question
• 2. Obtaining the data
• 3. Exploring the data
• 4. Modeling the data
• 5. Communicating and visualizing the results
1. Asking an interesting question
• I would treat this step as you would treat a
brainstorming session.
• Start writing down questions regardless of
whether or not you think the data to answer
these questions even exists.
• Advantage: 1. Biasing yourself even before
searching for data, is getting avoided.
• 2. Obtaining data might involve searching in both
public and private locations and, therefore, might
not be very straightforward
2. Obtaining the data
• Once you have selected the question you want
to focus on, it is time to search the world for the
data that might be able to answer that question.
• Since the data can come from a variety of
sources; so, this step can be very creative!
3. Exploring the data
• By this step, we begin to break down the types
of data that we are dealing with, which is a
pivotal step in the process.
• Once this step is completed, the analyst
generally would have spent several hours
learning about the domain, using code or other
tools to manipulate and explore the data, and
has a very good sense of what the data might be
trying to tell them.
4. Modeling the data
• This step involves the use of statistical and
machine learning models.
• In this step, we are not only fitting and choosing
models, we are implanting mathematical
validation metrics in order to quantify the
models and their effectiveness.
5. Communicating and visualizing the
results
• This is arguably the most important step.
• While it might seem obvious and simple, the
ability to conclude your results in a digestible
format is much more difficult than it seems.
• In the following slides, we will look at different
examples of cases when results were
communicated poorly and when they were
displayed very well.
1. Explore the data
• It involves the ability to recognize the different
types of data, transform data types, and use
code to systemically improve the quality of the
entire dataset to prepare it for the modeling
stage.
• Following are the three basic questions, which
form the guidelines that should be followed
when exploring a newly obtained set of data.
Basic questions for data exploration
1. Is the data organized or not?
o Is data presented in a row/column structure?
o If we have unorganized data, transform it into a
row/column structure.

2. What does each row represent?


• Once we have an answer to how the data is organized and
are now looking at a nice row/column based dataset, we
should identify what each row actually represents.
• This step is usually very quick, and can help put things
in perspective pretty quickly.
Basic questions for data exploration
3. What does each column represent?
o We should identify each column by the level of data and
whether or not it is quantitative/qualitative, and so on.
o This categorization might change as our analysis progresses,
but it is important to begin this step as early as possible.

4. Are there any missing data points?


o Data isn't perfect.
o Sometimes missing of data happens due to human or
mechanical error.
o When this happens, we, as data scientists, must make
o decisions about how to deal with these discrepancies.
Basic questions for data exploration
5. Do we need to perform any transformations on
the columns?
o Depending on what level/type of data each
column is at, we need to perform certain types
of transformations.
o For example, in statistical modeling and
machine learning, we would like each
o column to be numerical.
o Python/R Programming is generally used to
make any and all transformations.
Example: Yelp dataset
This dataset is a subset of Yelp's businesses, reviews, and user
data. It was originally put together for the Yelp Dataset Challenge
which is a chance for students to conduct research or analysis on
Yelp's data and share their discoveries.

All personally identifiable information has been removed.


Initial steps undertaken:
• Import the pandas package and nickname it as pd.
• Read in the .csv from the Web; call is yelp_raw_data.
• Look at the head of the data (just the first few rows).
Example:
Example: Yelp dataset
• Is the data organized or not?
• Since we have a nice row/column structure, we can conclude
that this data seems pretty organized.

• What does each row represent?


• Each row represents a user giving a review of the business.
• In Python, we can measure how big our dataset is, by utilizing
the shape quality/function.
• yelp_raw_data.shape
• # (10000,10)
• i.e. dataset has 10000 rows and 10 columns
• In other words, 10,000 observations and 10 characteristics
What does each column represent?
What does each column represent?
1. business_id:
• This is likely a unique identifier for the business the review is for.
• This would be at the nominal level because there is no natural
order to this identifier.

• 2. date:
• The date at which the review was posted.
• Even though time is usually considered continuous, this column
would likely be considered discrete and at the ordinal level
because of the natural order that dates have.
What does each column represent?
• 3. review_id:
• This is a unique identifier for review that the post represents.
• This would be at the nominal level because, again, there is no
natural order to this identifier.

• 4. stars:
• an ordered column that represents what the reviewer gave the
restaurant as a final score.
• This is ordered and qualitative; so, this is at the ordinal level.
What does each column represent?
• 5. text:
• This is likely the raw text that each reviewer wrote.
• As with most text, we place this at the nominal level.

• 6. type:
• In the first five columns, all we see is the word review.
• This might be a column that identifies that each row is a review,
implying that there might be another type of row other than a
review.
• We place this at the nominal level.
What does each column represent?
• 7. user_id:
• This is likely a unique identifier for the user who is writing the
review.
• Just like other unique IDs, we place data at the nominal level.

• Q. Are there any missing data points?


• Perform an isnull operation.
• For example if your dataframe is called awesome_dataframe
then try the python command awesome_dataframe.
• isnull().sum()which will show the number of missing values in
each column.
Q. Do we need to perform any
transformations on the columns?
• For example, will we need to change the scale
of some of the quantitative data, or do we
need to create dummy variables for the
qualitative variables?
• As this dataset has only qualitative columns,
we can only focus on transformations at the
ordinal and nominal scale.

You might also like