The Data Science Process

Introduction to data science module 2 ppt

Uploaded by

shravyap045

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views33 pages

The Data Science Process

Introduction to data science module 2 ppt

Uploaded by

shravyap045

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Module 2

THE DATA SCIENCE PROCESS-Overview of the data science process- defining

research goals and creating project charter, retrieving data, cleansing, integrating
and transforming data, exploratory data analysis, Build the models, presenting
findings and building application on top of them
1Overview of the data science process
Step 1: Deﬁning research goals and creating a
project charter
•A project starts by understanding the what, the why, and the
how of your project, What does the company expect you to do?
•Answering these three questions (what, why, how) is the goal
of the first phase, so that everybody knows what to do and can
agree on the best course of action.
•1.1 Spend time understanding the goals and context of
your research
•1.2 Create a project charter
Clients want to know what they are paying for right from the start. Once you
understand their business problem, it's important to agree on exactly what
you'll deliver to them.All these details should be written down in a project
charter.
Step 1: Deﬁning research goals and
creating a project charter
Project charter requires teamwork, and your input covers at
least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
■ A timeline
Step 2: Retrieving data
Sometimes you need to go into
the field and design a data
collection process yourself, but
most of the time you won’t be
involved in this step. Many Data can be stored in many forms, ranging from
companies will have already simple text files to tables in a database. The
collected and stored the data for objective now is acquiring all the data you need.
you, and what they don’t have This may be difficult, and even if you succeed,
can often be bought from third data is often like a diamond in the rough: it
parties. needs polishing to be of any use to you.
2.1. Start with data stored within the company
Data is typically stored in databases, data marts, data warehouses, or data lakes.
Databases are for data storage, while data warehouses are for analysis. A data mart
is a smaller subset of a data warehouse for specific business units. Data warehouses
and marts hold preprocessed data, while data lakes contain raw data. Sometimes,
important data may still exist in Excel files on someone’s computer.
Step 2:
Retrieving
data
2.2 Don’t be
afraid to shop
around
If data isn’t
available inside
your
organization, look 2.3. Do data quality checks now to prevent problems
outside your
later
organization’s
walls. Most of the errors you’ll encounter during the data
gathering phase are easy to spot, but being too
careless will make you spend many hours solving data
issues that could have been prevented during data
Step 3: Cleansing, integrating, and
transforming data
Step 3
• 3.1. Cleansing data
• Data cleansing is a key step in the data science process that involves
fixing errors to ensure the data accurately represents the real-world
processes it comes from. There are two main types of errors to address:
interpretation errors (e.g., unrealistic values like an age of 300 years)
and inconsistencies (e.g., representing "Female" as both "Female" and
"F" in different tables). Common issues include physically impossible
values, typos, outliers, missing data, and inconsistent units (like Pounds
vs. Dollars). Cleansing the data ensures it is ready for accurate analysis
and modeling.
3.1. Cleansing data
3.1. Cleansing data
• DATA ENTRY ERRORS

• REDUNDANT WHITESPACE
Whitespaces, though hard to detect, can cause significant errors in data
processing, like mismatches when joining data keys. In Python you can use the
strip() function to remove leading and trailing spaces.
3.1. Cleansing data
FIXING CAPITAL LETTER MISMATCHES
• Capital letter mismatches are common. Most programming languages
make a distinction between “Brazil” and “brazil”. In this case you can
solve the problem by applying a function that returns both strings in
lowercase, such as .lower() in Python.
• “Brazil”.lower() == “brazil”.lower() should result in true.
IMPOSSIBLE VALUES AND SANITY CHECKS
• Sanity checks are another valuable type of data check. Here you check
the value against physically or theoretically impossible values such as
people taller than 3 meters or someone with an age of 299 years. Sanity
checks can be directly expressed with rules:
• check = 0 <= age <= 120
3.1. Cleansing data
OUTLIERS
• An outlier is an observation that seems to be distant from other
observations or, more specifically, one observation that follows a different
logic or generative process than the other observations. The easiest way
to find outliers is to use a plot or a table with the minimum and maximum
values.
3.1. Cleansing data
• DEALING WITH MISSING VALUES
3.1. Cleansing data
DEVIATIONS FROM A CODE BOOK
• It explains how to detect errors in large data sets using set operations
compared to a code book, which serves as metadata describing the data.
A code book includes details like the number of variables, observations,
and meanings of encoded values (e.g., “0” for negative, “5” for very
positive). By comparing data sets, you can identify values in the data that
don't match the code book, signaling errors. Using tables and difference
operators can help streamline this process, especially when working with
large amounts of data.
DIFFERENT UNITS OF MEASUREMENT
• When integrating two data sets, it's crucial to account for differences in
units of measurement. For example, when analyzing global gasoline
prices, some data sets may report prices per gallon, while others use
prices per liter. In such cases, a simple unit conversion can resolve the
3.1. Cleansing data
DIFFERENT LEVELS OF AGGREGATION
• Different levels of aggregation in data sets, like weekly data versus
work-week data, are similar to measurement differences. These
discrepancies are usually easy to spot and can be resolved by
summarizing or expanding the data. Cleaning data early is crucial before
combining information from various sources.
CORRECT ERRORS AS EARLY AS POSSIBLE
• Data errors should be fixed as early as possible in the collection process to
prevent costly mistakes and repeated corrections in multiple projects.
These errors can reveal issues like faulty business processes, defective
equipment, or software bugs. However, data scientists may not always
control data collection, so handling errors in code becomes necessary. It's
essential to keep a copy of original data to avoid losing valuable
information during cleaning. Combining data from different sources is often
more challenging than cleaning individual data sets.
3.2 Integrating data
• This section focuses on integrating data from various sources, which can
differ in size, type, and structure, such as databases, Excel files, and text
documents. For simplicity, the chapter concentrates on table-structured
data.
• There are two main ways to combine data:
1. Joining – Enriches data by merging information from one table with
another.
2. Appending/Stacking – Adds rows from one table to another.
• You can either create a new physical table or a virtual table (view). A view
saves disk space, as it doesn't store data separately.
3.2 Integrating data
JOINING TABLES
• Joining tables allows you to combine information from two tables to enrich individual
observations. For example, you can merge customer purchase data with their regional
information by using a common field, known as a key, such as a customer name or
Social Security number. Keys that uniquely identify records are called primary keys.
Joining tables is similar to using a lookup function in Excel. The number of rows in the
output depends on the type of join used, which will be explained later.
3.2 Integrating data
Appending
• Appending or stacking tables involves adding the observations from one table to
another, resulting in a larger table. For example, appending January's data with
February's creates a combined table with observations from both months. This
operation is similar to the union operation in set theory, and in SQL, it's performed
using the UNION command. Other set operations, like set difference and intersection,
are also used in data science.
3.2 Integrating data
• USING VIEWS TO SIMULATE DATA JOINS AND APPENDS
• To avoid data duplication, you can use views to virtually combine data.
Unlike creating a new physical table, which requires additional storage
space, a view acts as a virtual layer that combines data from multiple
tables without duplicating it. For example, sales data from different months
can be virtually combined into a yearly sales table. However, views have a
drawback: they recreate the join each time they are queried, consuming
more processing power than a pre-calculated table.
3.2 Integrating data
ENRICHING AGGREGATED MEASURES
• Data enrichment involves adding calculated information to a table, such as total sales
or the percentage of total stock sold in a specific region. This aggregated data provides
additional insights, enabling the calculation of each product's participation within its
category. While useful for data exploration, it's especially beneficial when creating data
models. Generally, models that use relative measures, like percentage sales, tend to
perform better than those using raw numbers.
3.3 Transforming data
Relationships
between an input
variable and an
output variable aren’t
always linear. Taking
the log of the
independent variables
simplifies the
estimation problem
dramatically. The
transforming the input
variables greatly
simplifies the
estimation problem.
3.3 Transforming data
REDUCING THE NUMBER OF VARIABLES
• Sometimes you have too many variables and need to reduce the number
because they don’t add new information to the model. Having too many
variables in your model makes the model difficult to handle, and certain
techniques don’t perform well when you overload them with too many input
variables. The method used to reduce number of variables from the datset
is known as dimensionality reduction. The principal components
analysis (PCA) is commonly used dimensionality reduction techniques.
3.3 Transforming data
TURNING VARIABLES INTO DUMMIES
Dummy variables convert categorical
variables into binary indicators that
can take values of 1 (true) or 0
(false). This technique creates
separate columns for each category;
for example, a "Weekdays" column
can be transformed into individual
columns for Monday through
Sunday, where a 1 indicates the
presence of that day and a 0
indicates its absence. This method is
commonly used in modeling,
Step 4: Exploratory data analysis
During exploratory data analysis (EDA), you thoroughly examine your data, primarily
using graphical techniques to visualize and understand the interactions between
variables. This phase emphasizes exploration, so it's important to remain
open-minded and attentive. While the main goal is to discover previously overlooked
anomalies that may require corrective action.
The bar chart, a line plot, and a
distribution are some of the graphs
used in exploratory analysis.
Histogram Boxplot
Step 5 Build the models
In this phase, you have clean data and a
clear understanding of it. Now, you're
ready to build models to achieve specific Building a model is an iterative process,
goals like making better predictions, meaning you'll refine it over time. The
classifying objects, or understanding the process may vary depending on whether
system you're analyzing. This step is you use traditional statistics or modern
more focused compared to earlier machine learning. Most models follow
exploration because you already know these steps:
what you're trying to find and what you 1.Choose a technique and select the
want to achieve. variables to use.
2.Run the model.
3.Diagnose and compare models to find the
best one.
5.1 Model and variable selection
• When building a model, you need to choose the right variables and a modeling
technique. Your earlier exploratory analysis should help you figure out which variables
will be useful. There are many modeling techniques, and picking the right one requires
good judgment. You should also consider factors like:
• Will the model be easy to implement in a production environment?
• How hard will it be to maintain, and how long will it stay relevant without changes?
• Does the model need to be easy to explain?
• Once you've thought about these things, you're ready to take action and start building
the model.
5.2 Model execution
• Once you’ve chosen a model you’ll need to implement it in code.

Linear regression tries to fit a line while Confusion matrix: it shows how many cases
minimizing the distance to each point were correctly classified and incorrectly
classified by comparing the prediction with the
real values.
5.3 Model diagnostics and model comparison
Step 6: Presenting ﬁndings and building
applications
•on top
After of them
building a well-performing model, it's important to present the findings
to stakeholders. This stage often requires automating model predictions or
creating tools to update reports and presentations. Automation helps avoid
repeating manual tasks. Finally, soft skills are crucial for effectively
communicating insights, as it's essential to ensure that stakeholders
understand and value your work.

Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Module 2 Data Science New
No ratings yet
Module 2 Data Science New
57 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit 1
No ratings yet
Unit 1
11 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Unit 2 - DS - 1st Year
No ratings yet
Unit 2 - DS - 1st Year
7 pages
Unit-2 - DS Notes
No ratings yet
Unit-2 - DS Notes
22 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Ad3491-FDA Unit 1 Question Bank
No ratings yet
Ad3491-FDA Unit 1 Question Bank
8 pages
Module 2 Data Science
No ratings yet
Module 2 Data Science
22 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Document
No ratings yet
Document
29 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
CS822 DataMining Week3
No ratings yet
CS822 DataMining Week3
91 pages
Emerging - 2021 - Module 2 PDF
No ratings yet
Emerging - 2021 - Module 2 PDF
61 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Unit 2
No ratings yet
Unit 2
21 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Introduction To Data Science 1-2-2025
No ratings yet
Introduction To Data Science 1-2-2025
14 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
How Should Data Preparation Be Done For An Analytics Project
No ratings yet
How Should Data Preparation Be Done For An Analytics Project
30 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Unit 1
No ratings yet
Unit 1
44 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Data Science 2
100% (1)
Data Science 2
55 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
M4
No ratings yet
M4
14 pages
M5
No ratings yet
M5
18 pages
Erosion in Mathematical Morphology
No ratings yet
Erosion in Mathematical Morphology
10 pages
GSM-GSM Network Architecture, GSM Signalling Protocol Architecture
No ratings yet
GSM-GSM Network Architecture, GSM Signalling Protocol Architecture
16 pages
Addressing and Routing
No ratings yet
Addressing and Routing
9 pages
Data Visualization Using Power BI Curriculum
No ratings yet
Data Visualization Using Power BI Curriculum
1 page
Warid
No ratings yet
Warid
30 pages
Bca Vi Sem Bi - Unit III
No ratings yet
Bca Vi Sem Bi - Unit III
110 pages
A Data Warehouse Approach For Business Intelligence
No ratings yet
A Data Warehouse Approach For Business Intelligence
6 pages
Datawarehouse To Data Lakehouse
100% (1)
Datawarehouse To Data Lakehouse
48 pages
Egov Lab
No ratings yet
Egov Lab
21 pages
Subject:Recent Trends in IT Class: Tybba (Ca) Vi Sem (2013 Pattern)
No ratings yet
Subject:Recent Trends in IT Class: Tybba (Ca) Vi Sem (2013 Pattern)
47 pages
Ca - 601: Recent Trends in Information Technology
No ratings yet
Ca - 601: Recent Trends in Information Technology
18 pages
Acronyms
No ratings yet
Acronyms
6 pages
Informatica CV
No ratings yet
Informatica CV
5 pages
If His Constent Is Not Induced by Misrepresntation: Other
No ratings yet
If His Constent Is Not Induced by Misrepresntation: Other
19 pages
MultiDimensional Data Model
No ratings yet
MultiDimensional Data Model
22 pages
ICICI
No ratings yet
ICICI
6 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
38 pages
Teradata Data Mart Consolidation Return On Investment at GST PDF
No ratings yet
Teradata Data Mart Consolidation Return On Investment at GST PDF
20 pages
Introduction On Data Warehouse With OLTP and OLAP: Arpit Parekh
No ratings yet
Introduction On Data Warehouse With OLTP and OLAP: Arpit Parekh
5 pages
Data Engineer Full Course
100% (1)
Data Engineer Full Course
10 pages
Table Organization in OBAW
No ratings yet
Table Organization in OBAW
6 pages
CS614 Updated Quiz 1 Solution BY MCS of Virtuallians
0% (1)
CS614 Updated Quiz 1 Solution BY MCS of Virtuallians
9 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
16 pages
Attach File 1684508430454
No ratings yet
Attach File 1684508430454
15 pages
Abhinav Puskuru - GCP Data Engineer
No ratings yet
Abhinav Puskuru - GCP Data Engineer
5 pages
Business Intelligence MBA II ND SEMESTER
No ratings yet
Business Intelligence MBA II ND SEMESTER
36 pages
Chapter 06 Test Reviewer
No ratings yet
Chapter 06 Test Reviewer
5 pages
Power BI
No ratings yet
Power BI
15 pages
Through-Process Optimization
No ratings yet
Through-Process Optimization
7 pages
5 Best Practices For Data Warehouse Development
No ratings yet
5 Best Practices For Data Warehouse Development
12 pages
Pentaho Data Integration
No ratings yet
Pentaho Data Integration
99 pages
Data Mining Important Questions
67% (3)
Data Mining Important Questions
5 pages
Fundamentals of Data Warehousing: Ms. Liza Mae P. Nismal
No ratings yet
Fundamentals of Data Warehousing: Ms. Liza Mae P. Nismal
15 pages

The Data Science Process

Uploaded by

The Data Science Process

Uploaded by

Module 2

THE DATA SCIENCE PROCESS-Overview of the data science process- defining

You might also like