Module 1 - PPT5 - Pre - Processing of Data

Uploaded by

namma.wedding1806

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Module 1 - PPT5 - Pre - Processing of Data

Uploaded by

namma.wedding1806

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Foundations of Data Science

Module 1
PPT5
The Five Steps of Data Science
(Pre-Processing of Data)
Data Science vs Data Analytics
• Data science follows a structured, step-by-
step process that, when followed, preserves
the integrity of the results.
• On a simpler level, following a strict process
can make it much easier for amateur data
scientists to obtain results faster than if they
were exploring data with no clear vision.
The Five Steps of Data Science
• The five essential steps to perform data
science are as follows:
• 1. Asking an interesting question
• 2. Obtaining the data
• 3. Exploring the data
• 4. Modeling the data
• 5. Communicating and visualizing the results
1. Asking an interesting question
• I would treat this step as you would treat a
brainstorming session.
• Start writing down questions regardless of
whether or not you think the data to answer
these questions even exists.
• Advantage: 1. Biasing yourself even before
searching for data, is getting avoided.
• 2. Obtaining data might involve searching in both
public and private locations and, therefore, might
not be very straightforward
2. Obtaining the data
• Once you have selected the question you want
to focus on, it is time to search the world for the
data that might be able to answer that question.
• Since the data can come from a variety of
sources; so, this step can be very creative!
3. Exploring the data
• By this step, we begin to break down the types
of data that we are dealing with, which is a
pivotal step in the process.
• Once this step is completed, the analyst
generally would have spent several hours
learning about the domain, using code or other
tools to manipulate and explore the data, and
has a very good sense of what the data might be
trying to tell them.
4. Modeling the data
• This step involves the use of statistical and
machine learning models.
• In this step, we are not only fitting and choosing
models, we are implanting mathematical
validation metrics in order to quantify the
models and their effectiveness.
5. Communicating and visualizing the
results
• This is arguably the most important step.
• While it might seem obvious and simple, the
ability to conclude your results in a digestible
format is much more difficult than it seems.
• In the following slides, we will look at different
examples of cases when results were
communicated poorly and when they were
displayed very well.
1. Explore the data
• It involves the ability to recognize the different
types of data, transform data types, and use
code to systemically improve the quality of the
entire dataset to prepare it for the modeling
stage.
• Following are the three basic questions, which
form the guidelines that should be followed
when exploring a newly obtained set of data.
Basic questions for data exploration
1. Is the data organized or not?
o Is data presented in a row/column structure?
o If we have unorganized data, transform it into a
row/column structure.

2. What does each row represent?

• Once we have an answer to how the data is organized and
are now looking at a nice row/column based dataset, we
should identify what each row actually represents.
• This step is usually very quick, and can help put things
in perspective pretty quickly.
Basic questions for data exploration
3. What does each column represent?
o We should identify each column by the level of data and
whether or not it is quantitative/qualitative, and so on.
o This categorization might change as our analysis progresses,
but it is important to begin this step as early as possible.

4. Are there any missing data points?

o Data isn't perfect.
o Sometimes missing of data happens due to human or
mechanical error.
o When this happens, we, as data scientists, must make
o decisions about how to deal with these discrepancies.
Basic questions for data exploration
5. Do we need to perform any transformations on
the columns?
o Depending on what level/type of data each
column is at, we need to perform certain types
of transformations.
o For example, in statistical modeling and
machine learning, we would like each
o column to be numerical.
o Python/R Programming is generally used to
make any and all transformations.
Example: Yelp dataset
This dataset is a subset of Yelp's businesses, reviews, and user
data. It was originally put together for the Yelp Dataset Challenge
which is a chance for students to conduct research or analysis on
Yelp's data and share their discoveries.

All personally identifiable information has been removed.

Initial steps undertaken:
• Import the pandas package and nickname it as pd.
• Read in the .csv from the Web; call is yelp_raw_data.
• Look at the head of the data (just the first few rows).
Example:
Example: Yelp dataset
• Is the data organized or not?
• Since we have a nice row/column structure, we can conclude
that this data seems pretty organized.

• What does each row represent?

• Each row represents a user giving a review of the business.
• In Python, we can measure how big our dataset is, by utilizing
the shape quality/function.
• yelp_raw_data.shape
• # (10000,10)
• i.e. dataset has 10000 rows and 10 columns
• In other words, 10,000 observations and 10 characteristics
What does each column represent?
What does each column represent?
1. business_id:
• This is likely a unique identifier for the business the review is for.
• This would be at the nominal level because there is no natural
order to this identifier.

• 2. date:
• The date at which the review was posted.
• Even though time is usually considered continuous, this column
would likely be considered discrete and at the ordinal level
because of the natural order that dates have.
What does each column represent?
• 3. review_id:
• This is a unique identifier for review that the post represents.
• This would be at the nominal level because, again, there is no
natural order to this identifier.

• 4. stars:
• an ordered column that represents what the reviewer gave the
restaurant as a final score.
• This is ordered and qualitative; so, this is at the ordinal level.
What does each column represent?
• 5. text:
• This is likely the raw text that each reviewer wrote.
• As with most text, we place this at the nominal level.

• 6. type:
• In the first five columns, all we see is the word review.
• This might be a column that identifies that each row is a review,
implying that there might be another type of row other than a
review.
• We place this at the nominal level.
What does each column represent?
• 7. user_id:
• This is likely a unique identifier for the user who is writing the
review.
• Just like other unique IDs, we place data at the nominal level.

• Q. Are there any missing data points?

• Perform an isnull operation.
• For example if your dataframe is called awesome_dataframe
then try the python command awesome_dataframe.
• isnull().sum()which will show the number of missing values in
each column.
Q. Do we need to perform any
transformations on the columns?
• For example, will we need to change the scale
of some of the quantitative data, or do we
need to create dummy variables for the
qualitative variables?
• As this dataset has only qualitative columns,
we can only focus on transformations at the
ordinal and nominal scale.

Philippine Early Childhood Development Checklist (
92% (13)
Philippine Early Childhood Development Checklist (
3 pages
2-Annex A - PRC Instructional Design Template
100% (1)
2-Annex A - PRC Instructional Design Template
11 pages
Data Sources Advance Data Handling
No ratings yet
Data Sources Advance Data Handling
23 pages
DV - QB - Solution
No ratings yet
DV - QB - Solution
6 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Lesson 3 - Machine Learning Workflow
No ratings yet
Lesson 3 - Machine Learning Workflow
53 pages
Data Sources Data Handling Data Visualization
No ratings yet
Data Sources Data Handling Data Visualization
23 pages
Chaper 3 FoDS - Copy
No ratings yet
Chaper 3 FoDS - Copy
127 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Data Sceince - UNIT -4
No ratings yet
Data Sceince - UNIT -4
70 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Module 1 - Lecture 3 - Types of Data - 16.5.2022
No ratings yet
Module 1 - Lecture 3 - Types of Data - 16.5.2022
38 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Week13 2 Data Analysis 2
No ratings yet
Week13 2 Data Analysis 2
44 pages
Machine Learning Tips
No ratings yet
Machine Learning Tips
2 pages
Unit 1-Part3-Compressed
No ratings yet
Unit 1-Part3-Compressed
28 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Visulization Chapter 2
No ratings yet
Data Visulization Chapter 2
24 pages
22mca341 - Data Science
No ratings yet
22mca341 - Data Science
109 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
29 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
data analysis
No ratings yet
data analysis
42 pages
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Aiml Answers
No ratings yet
Aiml Answers
20 pages
intro
No ratings yet
intro
144 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
DS1
No ratings yet
DS1
20 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Unit 1
No ratings yet
Unit 1
36 pages
Antim-Prahar-Data-Analytics-for-Business-Decisions-2025_compressed
No ratings yet
Antim-Prahar-Data-Analytics-for-Business-Decisions-2025_compressed
44 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Unit I and unit ii dev (1)
No ratings yet
Unit I and unit ii dev (1)
36 pages
lecture-week4
No ratings yet
lecture-week4
50 pages
DATA VISUALIZATION - R PROGRAMMING POWER BI
No ratings yet
DATA VISUALIZATION - R PROGRAMMING POWER BI
51 pages
Data Understanding and Prepration
100% (1)
Data Understanding and Prepration
10 pages
Course_ Introduction to Data Science (SD211105)
No ratings yet
Course_ Introduction to Data Science (SD211105)
10 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
DS1
No ratings yet
DS1
10 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
No ratings yet
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
35 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners
100% (1)
Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners
481 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Data Mining Vs Data Exploration UNIT-II
No ratings yet
Data Mining Vs Data Exploration UNIT-II
11 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
Notes 3 (Prepare Coursera)
No ratings yet
Notes 3 (Prepare Coursera)
67 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
4.0 Introduction to Data
No ratings yet
4.0 Introduction to Data
16 pages
Chapter 1-Introduction To Data
No ratings yet
Chapter 1-Introduction To Data
18 pages
Course 3
No ratings yet
Course 3
22 pages
Data Analysis Using Python Day_1 to Day_4
No ratings yet
Data Analysis Using Python Day_1 to Day_4
30 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Note On VPF
No ratings yet
Note On VPF
1 page
QB - 20ADS45 - Module 3
No ratings yet
QB - 20ADS45 - Module 3
5 pages
Reading Skills For Comprehension
No ratings yet
Reading Skills For Comprehension
46 pages
Module 1 - Interval - Ratio
No ratings yet
Module 1 - Interval - Ratio
28 pages
Principles of Language Assessment
No ratings yet
Principles of Language Assessment
35 pages
UTS Module 1 Lesson 1
No ratings yet
UTS Module 1 Lesson 1
11 pages
Interview Questions
No ratings yet
Interview Questions
4 pages
SCS Paper A Assignment Summer 2024 (15086)
No ratings yet
SCS Paper A Assignment Summer 2024 (15086)
13 pages
Elm-490 Step 6
No ratings yet
Elm-490 Step 6
3 pages
Oana Siritianu_Career ViewPoint
No ratings yet
Oana Siritianu_Career ViewPoint
6 pages
Listening Meeting 3 YA
No ratings yet
Listening Meeting 3 YA
21 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
Chapter1 IntroductiontoPersonalDevelopment
No ratings yet
Chapter1 IntroductiontoPersonalDevelopment
4 pages
Watermarked - Preservice Teachers Beliefs About Their Future Teaching Due To Their Massive Online Learning Experience - Sep 01 2023 21 40 38
No ratings yet
Watermarked - Preservice Teachers Beliefs About Their Future Teaching Due To Their Massive Online Learning Experience - Sep 01 2023 21 40 38
17 pages
7 Signs of Genius & The Path To Enjoying HIGH Intellect Made Simple (English (Auto-Generated) ) (DownloadYoutubeSubtitles - Com)
No ratings yet
7 Signs of Genius & The Path To Enjoying HIGH Intellect Made Simple (English (Auto-Generated) ) (DownloadYoutubeSubtitles - Com)
22 pages
GRADES 1 To 12 Daily Lesson Log
No ratings yet
GRADES 1 To 12 Daily Lesson Log
3 pages
Science 6 Cot
100% (2)
Science 6 Cot
4 pages
Wood Tk9 Santi
No ratings yet
Wood Tk9 Santi
16 pages
Put The Verb Into Either The Gerund (-Ing) or The Infinitive (With 'To')
No ratings yet
Put The Verb Into Either The Gerund (-Ing) or The Infinitive (With 'To')
7 pages
A Short Reference Grammar of Gulf Arabic PDF
No ratings yet
A Short Reference Grammar of Gulf Arabic PDF
144 pages
Van DIjik
No ratings yet
Van DIjik
10 pages
01 Self-Awareness Worksheet 1
No ratings yet
01 Self-Awareness Worksheet 1
2 pages
Introduction-Feminist Pedagogies in Action: Teaching Beyond Disciplines
No ratings yet
Introduction-Feminist Pedagogies in Action: Teaching Beyond Disciplines
13 pages
Nature Vs Nurture Frankenstein
No ratings yet
Nature Vs Nurture Frankenstein
4 pages
Com 10003 Assignment 4
No ratings yet
Com 10003 Assignment 4
6 pages
Strategic Management: Lecture - 1
No ratings yet
Strategic Management: Lecture - 1
24 pages
Advantages and Disadvantages of Web-Based Learning
100% (13)
Advantages and Disadvantages of Web-Based Learning
2 pages
Cooperative Learning
No ratings yet
Cooperative Learning
25 pages
Educ 145
No ratings yet
Educ 145
2 pages
Motion Control of Single Link Flexible Joint Robot Manipulator
No ratings yet
Motion Control of Single Link Flexible Joint Robot Manipulator
11 pages
Latin Book
No ratings yet
Latin Book
66 pages
What Is Scientism
No ratings yet
What Is Scientism
1 page

Module 1 - PPT5 - Pre - Processing of Data

Uploaded by

Module 1 - PPT5 - Pre - Processing of Data

Uploaded by

Foundations of Data Science

2. What does each row represent?

4. Are there any missing data points?

All personally identifiable information has been removed.

• What does each row represent?

• Q. Are there any missing data points?

You might also like