Challenges and Scope of Data Science Project

The document discusses several challenges faced in data science projects including dirty data, lack of talent and domain expertise, unclear goals, and difficulty communicating results. It also outlines the scope of data science projects including companies' growing need for help handling and analyzing large amounts of collected data, and revised privacy regulations requiring responsible data use. Specific challenges covered are misconceptions about data scientist roles, setting metrics, data access and processing, algorithm selection, and security.

Uploaded by

Mihir Pesswani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

482 views21 pages

Challenges and Scope of Data Science Project

Uploaded by

Mihir Pesswani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Challenges and scope of data

science project
Challenges of data science project:
• Dirty data (36% reported)
• Lack of data science talent (30%)
• Company politics (27%)
• Lack of clear question (22%)
• Data inaccessible (22%)
• Results not used by decision makers (18%)
• Explaining data science to others (16%)
• Privacy issues (14%)
• Lack of domain expertise (14%)
• Organization small and cannot afford data science
team (13%)
• Some other challenges:
• 1. Misconception About the Role –
• In big corporations, a Data Scientist is regarded as a jack of all
trades who is assigned with the task of getting the data, building
the model, and making right business decisions which is a big
ask for any individual. In a Data Science team, the role should be
split among different individuals such as Data Engineering, Data
Visualizations. Predictive Analytics, model building, and so on.
• The organization should be clear about their requirement and
specialize the task the Data Scientist needs to perform without
putting unrealistic expectations on the individual. Though a Data
Scientist possesses the majority of the necessary skills,
distributing the task would ensure flawless operation of the
business. Thus a clear description and communication about
the role are necessary before anyone starts working as a Data
Scientist in the company.
• 2. Understanding the Right Metric and The KPI –
• Due to the lack of understanding among the majority of
stakeholders about the role of Data Scientist, they are expected
to wave a magic wand and solve every business problem in a
hitch which is never the case. Every business should have the
right metric which goes in sync with its objectives. The metric
would be parameters to evaluate the performance of the
predictive model while the Key Performance Indicators which
let the business work on the areas to improve.
• A Data Scientist could build a model and gets high accuracy only
to realize that the metric used doesn’t help the business at all.
Each and every company has different parameters or metrics to
identify their performance and thus defining one with clarity
before starting any Data Science work is the key. The metrics
and the KPIs should be identified and laid out and
communicated to the Data Scientist who would then work
accordingly.
• 3. Lack of Domain Knowledge –
• This challenge is more applicable to a beginner Data Scientist
in an organization than who has more years of experience
working as a Data Scientist in the same organization. Someone
who is just starting or is a fresh graduate has all the statistical
skills and techniques to play with the data but without the
right domain understanding, it is difficult to get the right
results. A person with a particular domain knowledge who
know what works and what doesn’t which is not the cause for
a newbie.
• Though domain expertise doesn’t come overnight and it takes
time spending and working in a particular domain, one could,
however, take up datasets across various domains and try to
apply their Data Science skills to solve the problem. In doing
so, the person would get accustomed to the data across
various domains and get an idea about the variables or
features that generally used.
• 4. Setting up the Data Pipeline –
• In the modern world, we don’t deal with megabytes of
data anymore, instead of with deal with terabytes of
unstructured data generated from a multitude of
sources. This data is voluminous in nature and
traditional systems are incapable of handling such
quantity. Hence the concept of Hadoop or Spark came
into the picture which stores data in parallel clusters
and processes it.
• Thus for batch or real-time data processing, it is
necessary that the data pipeline is properly set up
beforehand to allow the continuous flow of data from
external sources to the big data ecosystem which
would then enable Data Scientists to use the data and
process it further.
• 5. Getting the Right Data –
• Quality is better than quantity is the call of the hour in this case. A
Data Scientist role involves understanding the question asked and
answer the question by analyzing the data using the right tools and
techniques. Now, once the requirement is clear, it’s time to get the
right data. There is no shortage of data in the present analytical Eco
space but having enough data without much relevance would lead
to a model failing to solve the actual business problem.
• Thus to build an accurate model which works well with the
business it is necessary to get the right data with the most
meaningful features at the first instance. To overcome this data
issue, the Data Scientist would need to communicate with the
business to get enough data and then use domain understanding to
get rid of the irrelevant features. This is a backward elimination
process but one which often comes handy in most occasions.
• 6. Proper Data Processing –
• A Data Scientist spends most of the time pre-processing the
data to make it ideal for building a model. It’s often a hectic
task which includes cleaning the data, removing outliers,
encoding the variables and so on. Unlike in hackathons or boot
camps, the real-life data is generally pretty unclean and it
requires a lot of data wrangling using different techniques.
• The drawback is that if the model is built on dirty data it would
behave strangely when tested on an unknown set of data.
Suppose the data has a lot of outliers or noise which are not
removed and you train a model with that data, then the model
would learn by heart all the unnecessary patterns in the data
resulting in high variance. This high variance would cause the
model to not generalize well and perform poorly on the new
data set. No wonder, Data Scientists spend eighty percent of
their time only cleaning the data and making it ready.
• to overcome this data pre-processing issue, a
Data Scientist should put all the effort in
identifying all possible anomalies that could
be present in the data and come up with
solutions to get rid of those. Once that is
done, the model would be trained on cleaned
data which would allow it to generalize well to
the patterns and get good performance of the
test data.
• 7. Choosing the Right Algorithm –
• This a subjective challenge as there is no such algorithm
which works best on a dataset. If there is a linear
relationship between the feature and the target variables,
one generally chooses the linear models such as Linear
Regression, Logistic Regression while for non-linear
relationship the tree based models like Decision Tree,
Random Forest, Gradient Boosting, etc., works better.
Hence it is suggested to try different models on a dataset
and evaluate based on the metric given. Once which
minimizes the mean squared error or has a greater ROC
curve is eventually considered to be the go-to model.
Moreover the ensemble models i.e., the combination of
different algorithms together generally provides better
results.
• 8. Communication of the Results –
• Managers or Stakeholders of a company are often
ignorant of the tools and the working structure of the
models. They are required to make key business
decisions based on what they see in front of charts or
graphs or the results communicated by a Data Scientist.
Communicating the results in technical terms would not
help much as people at the helm would struggle to
decide what’s being said. Thus one explain in layman
terms their findings and even use the metric and the
KPIs finalized at the start to present their findings. This
would entail the business to evaluate their performance
and conclude on what key grounds improvements has
to be done for the growth of the business.
• 9. Data Security –
• Data Security is a major challenge in today’s
world. The plethora of data sources which are
interconnected has made it susceptible to attacks
from the hackers. Thus the Data Scientist are
struggling to get consent to use the data because
of the lack of certainty and the vulnerability that
clouds it. Following the global data protection is
one way to ensure data security. Use of cloud
platforms or additional security checks could also
be implemented. Additionally, Machine Learning
could be also used to protect against cyber-crimes
or fraudulent behaviors.
Scope of the data science project
• future Scope of Data Science
• Let’s have a look at a few factors that point out to
data science’s future, demonstrating compelling
reasons why it is crucial to today’s business needs.
• Companies’ Inability to handle data
• Data is being regularly collected by businesses and
companies for transactions and through website
interactions. Many companies face a common
challenge – to analyze and categorize the data that is
collected and stored. A data scientist becomes the
savior in a situation of mayhem like this. Companies
can progress a lot with proper and efficient handling
of data, which results in productivity.
• Revised Data Privacy Regulations
• Countries of the European Union witnessed the passing
of the General Data Protection Regulation (GDPR) in
May 2018. A similar regulation for data protection will
be passed by California in 2020. This will create co-
dependency between companies and data scientists for
the need of storing data adequately and responsibly. In
today’s times, people are generally more cautious and
alert about sharing data to businesses and giving up a
certain amount of control to them, as there is rising
awareness about data breaches and their malefic
consequences. Companies can no longer afford to be
careless and irresponsible about their data. The GDPR
will ensure some amount of data privacy in the coming
future.
• Data Science is constantly evolving
• Career areas that do not carry any growth potential
in them run the risk of stagnating. This indicates
that the respective fields need to constantly evolve
and undergo a change for opportunities to arise and
flourish in the industry. Data science is a broad
career path that is undergoing developments and
thus promises abundant opportunities in the future.
Data science job roles are likely to get more specific,
which in turn will lead to specializations in the field.
People inclined towards this stream can exploit
their opportunities and pursue what suits them best
through these specifications and specializations.
• https://www.analyticsvidhya.com/blog/
2016/02/complete-tutorial-learn-data-
science-scratch/

Brief - Data Governance
No ratings yet
Brief - Data Governance
20 pages
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
CS3352 Fds
No ratings yet
CS3352 Fds
23 pages
RVR FM Product List
0% (1)
RVR FM Product List
37 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
SJ-20130930111324-002-ZXMW NR8120 (V2.03.02) System Description
100% (3)
SJ-20130930111324-002-ZXMW NR8120 (V2.03.02) System Description
83 pages
Lecture-1 Introduction To Data Science
No ratings yet
Lecture-1 Introduction To Data Science
20 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Far From The Tree - Andrew Solomon - Free Download, Borrow, and Streaming - Internet Archive
No ratings yet
Far From The Tree - Andrew Solomon - Free Download, Borrow, and Streaming - Internet Archive
3 pages
Android Widgets
0% (1)
Android Widgets
58 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Aparna INTERN REPORT 12
No ratings yet
Aparna INTERN REPORT 12
46 pages
Smart Steam Emu
No ratings yet
Smart Steam Emu
12 pages
Draft National Telecom Policy 2011
No ratings yet
Draft National Telecom Policy 2011
27 pages
Unit-I Introduction To Data Science
No ratings yet
Unit-I Introduction To Data Science
40 pages
Foundations of Data Science PPT TEXT BOOK
No ratings yet
Foundations of Data Science PPT TEXT BOOK
132 pages
Casio px-130 Ver.4 SM
No ratings yet
Casio px-130 Ver.4 SM
60 pages
HDL Based Synthesis
No ratings yet
HDL Based Synthesis
23 pages
WSUS Reading Material With All Steps
No ratings yet
WSUS Reading Material With All Steps
1,347 pages
Dkvm-8E: 8-Port Keyboard, Video, and Mouse Switch
No ratings yet
Dkvm-8E: 8-Port Keyboard, Video, and Mouse Switch
30 pages
How To Manage BW Transports
No ratings yet
How To Manage BW Transports
14 pages
Unit 3 DV
No ratings yet
Unit 3 DV
44 pages
Knight Eod Robot
No ratings yet
Knight Eod Robot
11 pages
Delphi Cost Estimation
80% (5)
Delphi Cost Estimation
9 pages
Role of Changing Mindset and The Development of Out of Box Thinking
No ratings yet
Role of Changing Mindset and The Development of Out of Box Thinking
21 pages
SP Manual 2023-24
No ratings yet
SP Manual 2023-24
98 pages
Embedded System Assignment
No ratings yet
Embedded System Assignment
14 pages
Deep Learning Approach For Ethiopian Banknote Denomination Classification and Fake Detection System
No ratings yet
Deep Learning Approach For Ethiopian Banknote Denomination Classification and Fake Detection System
8 pages
8 DataStorageIndexingStructures Updated
No ratings yet
8 DataStorageIndexingStructures Updated
57 pages
Foundation of Cyber Security: Semester III
No ratings yet
Foundation of Cyber Security: Semester III
7 pages
2.2 ML Session Bias Variance Tradeoffs
No ratings yet
2.2 ML Session Bias Variance Tradeoffs
38 pages
Juniper SA 700 Datasheet
No ratings yet
Juniper SA 700 Datasheet
4 pages
Govt - Polytechnic College Nedumangadu: Seminar Report ON
No ratings yet
Govt - Polytechnic College Nedumangadu: Seminar Report ON
29 pages
II CSE CS3352 FDS QB Unit4
100% (1)
II CSE CS3352 FDS QB Unit4
6 pages
Guía Sibelius
No ratings yet
Guía Sibelius
12 pages
TKR-720 (N) /820 (N) : VHF/UHF Desktop Repeater
No ratings yet
TKR-720 (N) /820 (N) : VHF/UHF Desktop Repeater
2 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Ids PPT and PDF
No ratings yet
Ids PPT and PDF
493 pages
Driver's License, Learner's Permit or ID Card Application: A. Service Type
No ratings yet
Driver's License, Learner's Permit or ID Card Application: A. Service Type
2 pages
Data Science and Big Data Analytics-1-82
No ratings yet
Data Science and Big Data Analytics-1-82
82 pages
Microsoft Kinect Case Study
No ratings yet
Microsoft Kinect Case Study
5 pages
AI Class 9 Part A
No ratings yet
AI Class 9 Part A
26 pages
Data Science Capstone Project
No ratings yet
Data Science Capstone Project
21 pages
R20 IT 4-1 - DevOps - UNIT - 3
No ratings yet
R20 IT 4-1 - DevOps - UNIT - 3
15 pages
Medical Stock Managemenet System
No ratings yet
Medical Stock Managemenet System
39 pages
Machine Learning Algorithms
100% (1)
Machine Learning Algorithms
15 pages
LTspice Change Log
No ratings yet
LTspice Change Log
2 pages
Interview Preparations - NielsenIQ
No ratings yet
Interview Preparations - NielsenIQ
1 page
Chapter 5 Process Discovery
No ratings yet
Chapter 5 Process Discovery
93 pages
5800 Clinic Mar09
No ratings yet
5800 Clinic Mar09
32 pages
Experiment - 7 Single-Phase Half Wave Voltage Multiplier 7-1 Object
No ratings yet
Experiment - 7 Single-Phase Half Wave Voltage Multiplier 7-1 Object
2 pages
Data Science Skills They Dont Teach You
No ratings yet
Data Science Skills They Dont Teach You
72 pages
Technical Internship Report - HR Dataset
No ratings yet
Technical Internship Report - HR Dataset
52 pages
BA ZG523 Introduction To Data Science
50% (2)
BA ZG523 Introduction To Data Science
12 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
No ratings yet
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
35 pages
Implementing Logic Gates Using Neural Networks (Part 2) - by Vedant Kumar - Towards Data Science
No ratings yet
Implementing Logic Gates Using Neural Networks (Part 2) - by Vedant Kumar - Towards Data Science
3 pages
Tableau Lab Manual
No ratings yet
Tableau Lab Manual
6 pages
Introduction: Data Analytic Thinking
No ratings yet
Introduction: Data Analytic Thinking
38 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Mathematics Questions and Answers Wassce 2017
No ratings yet
Mathematics Questions and Answers Wassce 2017
23 pages
Halstead's Operators and Operands in C, C++, JAVA (By Indranil Nandy)
100% (6)
Halstead's Operators and Operands in C, C++, JAVA (By Indranil Nandy)
5 pages
Question Bank - CSE-DS
No ratings yet
Question Bank - CSE-DS
5 pages
Chapter 3-Problem Solving by Searching Part 1
No ratings yet
Chapter 3-Problem Solving by Searching Part 1
80 pages
Data Science Resource Package!
No ratings yet
Data Science Resource Package!
14 pages
Artificial Intelligence & Expert System
100% (1)
Artificial Intelligence & Expert System
18 pages
CSC8001-Data Science Project Report
No ratings yet
CSC8001-Data Science Project Report
5 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Lifecycle of A Data Science Project
No ratings yet
Lifecycle of A Data Science Project
1 page
OOSE Lab Report
No ratings yet
OOSE Lab Report
30 pages
1.2 Introduction To Applied Data Science
No ratings yet
1.2 Introduction To Applied Data Science
47 pages
CCS341 Data Warehousing
No ratings yet
CCS341 Data Warehousing
7 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
PPT1
No ratings yet
PPT1
93 pages
SHG Mitsubishi - lehy-II C-1 - en
100% (1)
SHG Mitsubishi - lehy-II C-1 - en
24 pages
Unit Iv
No ratings yet
Unit Iv
8 pages
Data Science Course Content Chapter 1: Introduction To Data Science
No ratings yet
Data Science Course Content Chapter 1: Introduction To Data Science
8 pages
Unit 2
No ratings yet
Unit 2
11 pages
Advanced Certification in Data Science and Artificial Intelligence
No ratings yet
Advanced Certification in Data Science and Artificial Intelligence
18 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Assignment - 3 BI
No ratings yet
Assignment - 3 BI
7 pages
Bda - 2 Unit
No ratings yet
Bda - 2 Unit
12 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
Sample Business Plan For Skin Care Company
100% (1)
Sample Business Plan For Skin Care Company
8 pages
Automobile Gannt Chart
No ratings yet
Automobile Gannt Chart
6 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Pattern Recognition
No ratings yet
Pattern Recognition
3 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages

Challenges and Scope of Data Science Project

Uploaded by

Challenges and Scope of Data Science Project

Uploaded by

Challenges and scope of data

You might also like