Data Science Process Stages Lecture 2

Data Science involves manipulating data to uncover hidden patterns, with roles such as Data Scientists, Data Engineers, and Machine Learning Engineers contributing to the process. The Data Science Process Life Cycle includes steps like data collection, cleaning, exploratory analysis, model building, and deployment, all aimed at deriving insights and making predictions. Key components of Data Science include statistics, data engineering, and advanced computing, while challenges include data quality, bias, model interpretability, and ethical considerations.

Uploaded by

Saman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views4 pages

Data Science Process Stages Lecture 2

Uploaded by

Saman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

What is Data Science?

Data can be proved to be very fruitful if we know how to manipulate it to get hidden patterns
from them. This logic behind the data or the process behind the manipulation is what is known
as Data Science. From formulating the problem statement and collection of data to extracting the
required results from them the Data Science process and the professional who ensures that the
whole process is going smoothly or not is known as the Data Scientist. But there are other job
roles as well in this domain like:
1. Data Engineers : They build and maintain data pipelines.
2. Data Analysts: They focus on interpreting data and generating reports.
3. Data Architect : They design data management systems.
4. Machine Learning Engineer : They develop and deploy predictive models.
5. Deep Learning Engineer : They create more advanced AI models to process complex data.
Data Science Process Life Cycle
Some steps are necessary for any of the tasks that are being done in the field of data science to
derive any fruitful results from the data at hand.
 Data Collection – After formulating any problem statement the main task is to calculate data
that can help us in our analysis and manipulation. Sometimes data is collected by performing
some kind of survey and there are times when it is done by performing scrapping.
 Data Cleaning – Most of the real-world data is not structured and requires cleaning and
conversion into structured data before it can be used for any analysis or modeling.
 Exploratory Data Analysis – This is the step in which we try to find the hidden patterns in the
data at hand. Also, we try to analyze different factors which affect the target variable and the
extent to which it does so. How the independent features are related to each other and what
can be done to achieve the desired results all these answers can be extracted from this process
as well. This also gives us a direction in which we should work to get started with the modeling
process.
 Model Building – Different types of machine learning algorithms as well as techniques have
been developed which can easily identify complex patterns in the data which will be a very
tedious task to be done by a human.
 Model Deployment – After a model is developed and gives better results on the holdout or the
real-world dataset then we deploy it and monitor its performance. This is the main part where
we use our learning from the data to be applied in real-world applications and use cases.

Data Science Process Life Cycle

Key Components of Data Science Process
Data Science is a very vast field and to get the best out of the data at hand one has to apply
multiple methodologies and use different tools to make sure the integrity of the data remains
intact throughout the process keeping data privacy in mind. If we try to point out the main
components of Data Science then it would be:
 Data Analysis – There are times when there is no need to apply advanced deep learning and
complex methods to the data at hand to derive some patterns from it. Due to this before
moving on to the modeling part, we first perform an exploratory data analysis to get a basic
idea of the data and patterns which are available in it this gives us a direction to work on if we
want to apply some complex analysis methods on our data.
 Statistics – It is a natural phenomenon that many real-life datasets follow a normal
distribution. And when we already know that a particular dataset follows some known
distribution then most of its properties can be analyzed at once. Also, descriptive statistics and
correlation and covariances between two features of the dataset help us get a better
understanding of how one factor is related to the other in our dataset.
 Data Engineering – When we deal with a large amount of data then we have to make sure that
the data is kept safe from any online threats also it is easy to retrieve and make changes in the
data as well. To ensure that the data is used efficiently Data Engineers play a crucial role.
 Advanced Computing
o Machine Learning – Machine Learning has opened new horizons which had helped
us to build different advanced applications and methodologies so, that the
machines become more efficient and provide a personalized experience to each
individual and perform tasks in a snap of the hand earlier which requires heavy
human labor and time intense.
o Deep Learning – This is also a part of Artificial Intelligence and Machine Learning
but it is a bit more advanced than machine learning itself. High computing power
and a huge corpus of data have led to the emergence of this field in data science.

Knowledge and Skills for Data Science P rofessionals

Becoming proficient in Data Science requires a combination of skills, including:

 Statistics: Wikipedia defines it as the study of the collection, analysis, interpretation,
presentation, and organization of data. Therefore, it shouldn’t be a surprise that data scientists
need to know statistics.
 Programming Language R/ Python: Python and R are one of the most widely used languages
by Data Scientists. The primary reason is the number of packages available for Numeric and
Scientific computing.
 Data Extraction, Transformation, and Loading: Suppose we have multiple data sources like
MySQL DB, MongoDB, Google Analytics. You have to Extract data from such sources, and then
transform it for storing in a proper format or structure for the purposes of querying and
analysis. Finally, you have to load the data in the Data Warehouse, where you will analyze the
data. So, for people from ETL (Extract Transform and Load) background Data Science can be a
good career option.
Steps for Data Science Processes:

Step 1: Define the Problem and Create a Project Charter

Clearly defining the research goals is the first step in the Data Science Process. A project
charter outlines the objectives, resources, deliverables, and timeline, ensuring that all
stakeholders are aligned.
Step 2: Retrieve Data
Data can be stored in databases, data warehouses, or data lakes within an organization. Accessing
this data often involves navigating company policies and requesting permissions.
Step 3: Data Cleansing, Integration, and Transformation
Data cleaning ensures that errors, inconsistencies, and outliers are removed. Data
integration combines datasets from different sources, while data transformation prepares the
data for modeling by reshaping variables or creating new features.
Step 4: Exploratory Data Analysis (EDA)
During EDA, various graphical techniques like scatter plots, histograms, and box plots are used to
visualize data and identify trends. This phase helps in selecting the right modeling techniques.
Step 5: Build Models
In this step, machine learning or deep learning models are built to make predictions or
classifications based on the data. The choice of algorithm depends on the complexity of the
problem and the type of data.
Step 6: Present Findings and Deploy Models
Once the analysis is complete, results are presented to stakeholders. Models are deployed into
production systems to automate decision-making or support ongoing analysis.

Tools for Data Science Process

As time has passed tools to perform different tasks in Data Science have evolved to a great extent.
Different software like Matlab and Power BI, and programming Languages like Python and R
Programming Language provides many utility features which help us to complete most of the most
complex task within a very limited time and efficiently.
Usage of Data Science Process
The Data Science Process is a systematic approach to solving data-related problems and consists of
the following steps:
1. Problem Definition: Clearly defining the problem and identifying the goal of the analysis.
2. Data Collection: Gathering and acquiring data from various sources, including data cleaning
and preparation.
3. Data Exploration: Exploring the data to gain insights and identify trends, patterns, and
relationships.
4. Data Modeling: Building mathematical models and algorithms to solve problems and make
predictions.
5. Evaluation: Evaluating the model’s performance and accuracy using appropriate metrics.
6. Deployment: Deploying the model in a production environment to make predictions or
automate decision-making processes.
7. Monitoring and Maintenance: Monitoring the model’s performance over time and making
updates as needed to improve accuracy.
Challenges in the Data Science Process

1. Data Quality and Availability: Data quality can affect the accuracy of the models developed
and therefore, it is important to ensure that the data is accurate, complete, and consistent.
Data availability can also be an issue, as the data required for analysis may not be readily
available or accessible.
2. Bias in Data and Algorithms: Bias can exist in data due to sampling techniques, measurement
errors, or imbalanced datasets, which can affect the accuracy of models. Algorithms can also
perpetuate existing societal biases, leading to unfair or discriminatory outcomes.
3. Model Overfitting and Underfitting: Overfitting occurs when a model is too complex and fits
the training data too well, but fails to generalize to new data. On the other hand, underfitting
occurs when a model is too simple and is not able to capture the underlying relationships in the
data.
4. Model Interpretability: Complex models can be difficult to interpret and understand, making it
challenging to explain the model’s decisions and decisions. This can be an issue when it comes
to making business decisions or gaining stakeholder buy-in.
5. Privacy and Ethical Considerations: Data science often involves the collection and analysis of
sensitive personal information, leading to privacy and ethical concerns. It is important to
consider privacy implications and ensure that data is used in a responsible and ethical manner.

Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Ds U1 chp1
No ratings yet
Ds U1 chp1
13 pages
Ads TopperSh
No ratings yet
Ads TopperSh
50 pages
Data Science Bcs A
No ratings yet
Data Science Bcs A
20 pages
EDA Unit1
No ratings yet
EDA Unit1
53 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Module1 Data Science
No ratings yet
Module1 Data Science
15 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
Data Science
100% (2)
Data Science
33 pages
DSE 3 Unit 1
100% (1)
DSE 3 Unit 1
10 pages
Exporatory Data Analytics Notes ME SEM 2
No ratings yet
Exporatory Data Analytics Notes ME SEM 2
132 pages
What Is Data Science?
No ratings yet
What Is Data Science?
94 pages
Bcom Python
No ratings yet
Bcom Python
71 pages
Data Science
No ratings yet
Data Science
14 pages
DSA Lecture1
No ratings yet
DSA Lecture1
15 pages
DS Notes
No ratings yet
DS Notes
159 pages
22amh32 - Data Analytics and Data Science Unit I & Data Science Process 1. Data Science Process
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Data Science Process 1. Data Science Process
7 pages
Introduction To Data-Science
No ratings yet
Introduction To Data-Science
246 pages
Life Cycle of DS Project
No ratings yet
Life Cycle of DS Project
9 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Exploratory Data Analysis With Python
No ratings yet
Exploratory Data Analysis With Python
24 pages
Data Science-Lec 1
No ratings yet
Data Science-Lec 1
17 pages
Introduction of Data Science
No ratings yet
Introduction of Data Science
28 pages
Activity 3. Mind Map. Data Science Methodology
No ratings yet
Activity 3. Mind Map. Data Science Methodology
4 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
53 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
33 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
16 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
Kadir
No ratings yet
Kadir
84 pages
Unit 1
No ratings yet
Unit 1
30 pages
Statictics Computerscience Information Science
No ratings yet
Statictics Computerscience Information Science
3 pages
Unit2 DATA SCIENCE
No ratings yet
Unit2 DATA SCIENCE
8 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
8 pages
Ab Assignment 3
No ratings yet
Ab Assignment 3
7 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
25 pages
Data Science
No ratings yet
Data Science
65 pages
Data Science
No ratings yet
Data Science
18 pages
Handbook Introduction of Data Science AY 23-24
No ratings yet
Handbook Introduction of Data Science AY 23-24
171 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Data Science
No ratings yet
Data Science
5 pages
Unit I
No ratings yet
Unit I
52 pages
Unit 3
No ratings yet
Unit 3
9 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Data Science
No ratings yet
Data Science
18 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
File
No ratings yet
File
27 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Cloud Computing: IST 501 Fall 2013 Dongwon Lee, PH.D
No ratings yet
Cloud Computing: IST 501 Fall 2013 Dongwon Lee, PH.D
52 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
Challenges and Scope of Data Science Project
No ratings yet
Challenges and Scope of Data Science Project
21 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Pool Canvas: Creation Settings
100% (2)
Pool Canvas: Creation Settings
13 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
WhitePaper - How Does A Business Analyst Add Value To An Agile Project
No ratings yet
WhitePaper - How Does A Business Analyst Add Value To An Agile Project
4 pages
The Garbage Collection Handbook
No ratings yet
The Garbage Collection Handbook
514 pages
What Is Data Science
No ratings yet
What Is Data Science
13 pages
Hacking Database Servers
No ratings yet
Hacking Database Servers
7 pages
Automation of BW Accelerator Housekeeping
No ratings yet
Automation of BW Accelerator Housekeeping
23 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Check List For Oracle Database Upgrade
No ratings yet
Check List For Oracle Database Upgrade
2 pages
Past Papers CC
No ratings yet
Past Papers CC
3 pages
Data Architecture Is Composed of Models
No ratings yet
Data Architecture Is Composed of Models
7 pages
Software Architecture Unit1
No ratings yet
Software Architecture Unit1
92 pages
Dhruba Jyoti Saha - Java Architect
No ratings yet
Dhruba Jyoti Saha - Java Architect
15 pages
Understanding Organizational Units and Containers
No ratings yet
Understanding Organizational Units and Containers
5 pages
VMS Presentation 2018
No ratings yet
VMS Presentation 2018
17 pages
ChatGPT Codex - System - Card
No ratings yet
ChatGPT Codex - System - Card
8 pages
Cloud Compute
No ratings yet
Cloud Compute
30 pages
Gina Case Study Dsbda Final
No ratings yet
Gina Case Study Dsbda Final
21 pages
Here Are Some Essential Keyboard Shortcuts For Navigating and Editing Text in A Word Processor
No ratings yet
Here Are Some Essential Keyboard Shortcuts For Navigating and Editing Text in A Word Processor
19 pages
Time Series Analysis and Forecasting
No ratings yet
Time Series Analysis and Forecasting
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Cloud Native Attitude
No ratings yet
Cloud Native Attitude
28 pages
The Future of Cloud Storage
No ratings yet
The Future of Cloud Storage
8 pages
External Practical Schedule Even Sem 2025
No ratings yet
External Practical Schedule Even Sem 2025
5 pages
Introduction To Data Science Lecture 1
No ratings yet
Introduction To Data Science Lecture 1
4 pages
Azure Data Fundamentals Explore Non Relational Data in Azure - Explore Non-Relational Data Offerings in Azure
No ratings yet
Azure Data Fundamentals Explore Non Relational Data in Azure - Explore Non-Relational Data Offerings in Azure
20 pages
Reddy Venkat-1
No ratings yet
Reddy Venkat-1
6 pages
Google Web Vitals
No ratings yet
Google Web Vitals
8 pages
Zoho CRM Edition Comparison PDF
No ratings yet
Zoho CRM Edition Comparison PDF
6 pages
HW Resources Cisco Nexus
No ratings yet
HW Resources Cisco Nexus
8 pages
Web Development Design Foundations html5 8th Edition Felke Morris Solutions Manual PDF
No ratings yet
Web Development Design Foundations html5 8th Edition Felke Morris Solutions Manual PDF
6 pages
Normalization
No ratings yet
Normalization
2 pages
BS Computer Science Pathway UoPeople
No ratings yet
BS Computer Science Pathway UoPeople
2 pages
Case Study Normalization
No ratings yet
Case Study Normalization
1 page
IAA202 Lab5 SE140810
No ratings yet
IAA202 Lab5 SE140810
2 pages
Rahul Kumar Singh Resume
No ratings yet
Rahul Kumar Singh Resume
1 page
FLIP Release Documents
No ratings yet
FLIP Release Documents
6 pages

Data Science Process Stages Lecture 2

Uploaded by

Data Science Process Stages Lecture 2

Uploaded by

What is Data Science?

Data Science Process Life Cycle

Knowledge and Skills for Data Science P rofessionals

Becoming proficient in Data Science requires a combination of skills, including:

Step 1: Define the Problem and Create a Project Charter

Tools for Data Science Process

You might also like