IDS Sec-1 CS1-CS8 Merged Slides
IDS Sec-1 CS1-CS8 Merged Slides
M ODULE # 1 : I NTRODUCTION
Dr. Shreyas Rao
BITS Pilani
Profile of Instructor
Dr. Shreyas Rao
• 18+ Years of Experience in IT, Teaching and Research
• B.E from VTU, M.S in Software Systems from BITS (WILP) and PhD from MAHE
• Worked as Business Analyst and Team Lead at SLK Software Services for 7 years
• COE member in AI&ML and COE member in Data Science (Govt. Sponsored for 1.2 Cr)
I N T R OD U CT ION TO D AT A S C I E N C E
Profile of Instructor
Dr. Shreyas Rao
Consultant:
• ISRO-SAC (Ahmedabad) funded research project titled “Ontology Enabled Disaster
Management Web Service using Data Integration” as Technical Consultant. Deployed in
ISRO.
• Designed and Developed ‘Dhriti’, a mental health resource Chabot that caters to mental
health needs of people during Covid, from the COE in AI&ML, SCEM. Bot is released in
Dakshina Kannada region of Karnataka which answers user queries in English, Kannada and
Hindi languages. Deployed on the Web and Facebook Messenger channels.
I N T R OD U CT ION TO D AT A S C I E N C E
Profile of Instructor
Dr. Shreyas Rao
Collaboration with Dept. of Health Innovation, Kasturba Hospital, MAHE
• Telemedicine effectiveness during Covid Wave-I at Kasturba Hospital, Manipal (Statistical
Analysis)
• Study on psychological implications of COVID-19 on Nursing professionals (Statistical
Analysis)
• Covid prediction using Patient Discharge Data (Deep Learning)
Dept. of Psychology, Montfort College:
• AI enabled tool for juvenile self-transformation (Mental Health domain, Deep Learning &
NLP)
1
1 C OURSE LOGISTICS
2 Fundamentals of Data Science
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S
I N T R OD U CT ION TO D AT A S C I E N C E
C OURSE S T R UCTURE
I N T R OD U CT ION TO D AT A S C I E N C E
T E X T AND R EFERENCE B OOKS
T EXT B OOKS
T1 Introducing Data Science by Cielen, Meysman and Ali
T2 Storytelling with Data, A data visualization guide for business professionals,
by Cole, Nussbaumer Knaflic; Wiley
T3 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar
R EFERENCE B OOKS
R1 The Art of Data Science by Roger D Peng and Elizabeth Matsui
R2 Ethics and Data Science by DJ Patil, Hilary Mason, Mike Loukides
R3 Python Data Science Handbook: Essential tools for working with data by Jake
VanderPlas
R4 KDD, SEMMA and CRISP-DM: A Parallel Overview , Ana Azevedo and M.F.
Santos, IADS-DM, 2008
I N T R OD U CT ION TO D AT A S C I E N C E
C A N VA S
Most relevant and up to date info on
• Course Handout
• Schedule for Webinar, Quiz, and Assignments [By 19-Nov-22]
• Lecture Slides
• Quiz
• Assignment
I N T R OD U CT ION TO D AT A S C I E N C E
Evaluation
1. EC1- 30 marks
• Three quizzes (5 marks each) -10 marks (best 2 will be considered)
• One assignment - 20 marks
2. EC2 [Mid Term Exam] – 30 marks
3. EC3 [Comprehensive Exam] – 40 marks
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S
I N T R OD U CT ION TO D AT A S C I E N C E
W H A T i s S CIENCE?
Science is the systematic study of the structure and behavior of world (phenomenon)
through observation, experimentation and measurement.
I N T R OD U CT ION TO D AT A S C I E N C E
Prefixes to ‘Science’
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E – I n t e r d i s c i p l i n a r y F i e l d
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E – M U L T I P L E D I S C I P L I N E S
I N T R OD U CT ION TO D AT A S C I E N C E
WHY D AT A S C I E N C E ?
I N T R OD U CT ION TO D AT A S C I E N C E
WHY D AT A S C I E N C E ?
• In India, the average salary of a data scientist as of January 2022 is Rs.10L/yr.
[Glassdoor, 2022].
• The increase in data science as a career choice in 2022 will also see the rise in its
various job roles.
• Data Engineer
• Data Administrator
• Machine Learning Engineer
• Statistician
• Data and Analytics Manager
I N T R OD U CT ION TO D AT A S C I E N C E
N EED OF D AT A S C I E N C E - D I G I T A L D A T A D E L U G E
https://www.retailtouchpoints.com/resources/digital-data-deluge-becomes-a-tsunami-due-to-covid-19
I N T R OD U CT ION TO D AT A S C I E N C E
N EED OF D AT A S C I E N C E
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E , A I AND ML Convergence
Artificial Intelligence
• AI involves making machines capable of mimicking human behavior, particularly
cognitive functions like facial recognition, automated driving, sorting mail based on
postal code.
Machine Learning
• Considered a sub-field of or one of the tools of AI.
• Involves providing machines with the capability of learning from experience.
• Experience for machines comes in the form of data.
Data Science
• Data science is the application of machine learning, artificial intelligence, and other
quantitative fields like statistics, visualization, and mathematics to uncover insights from
data to enable better decision marking.
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E , A I AND ML
https://www.sciencedirect.com/topics/physics-and-astronomy/artificial-intelligence
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S
I N T R OD U CT ION TO D AT A S C I E N C E
U SE CASES OF D ATA S C I E N C E
DataFlair
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN F ACEBOOK
Social Analytics
Utilizes quantitative research to gain insights about the social interactions among
people.
Makes use of deep learning, facial recognition, and text analysis.
In facial recognition, it uses powerful neural networks to classify faces in the
photographs.
In text analysis, it uses “DeepText” to understand people’s interest and aligns
photographs with texts.
It uses deep learning for targeted advertising.
Using the insights gained from data, it clusters users based on their preferences and
provides them with the advertisements that appeal to them.
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN A MAZON
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN A MAZON – C O N T D ...
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN U BER
Improving Rider Experience
Uber maintains large database of drivers, customers, and several other records.
Makes extensive use of Big Data and crowdsourcing to derive insights and provide
best services to its customers.
Dynamic pricing
• Use of big Data and data science to calculate fares based on specific parameters.
• Uber matches customer profile with the most suitable driver and charges them based on
the time it takes to cover the distance rather than the distance itself.
• The time of travel is calculated using algorithms that make use of data related to traffic
density and weather conditions.
• When the demand is higher (more riders) than supply (less drivers), the price of the ride
goes up. [Rainy Season]
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN B ANK OF A MERICA
Improving Customer Experience
Erica – a virtual financial assistant (BoA)
• Erica serves as a customer advisor to over 45 million users around the world.
• Erica makes use of Speech Recognition to take customer inputs.
Fraud detection
• Uses data science and predictive analytics to detect frauds in payments,
insurance, credit cards, and customer information.
Customer segmentation
• Segment their customers in the high-value and low-value segments.
• Data scientists makes use of clustering, logistic regression, decision trees to
help the banks to understand the Customer Lifetime Value (CLV) and group
them in the appropriate segments.
• Customer segmentation helps in up-selling and cross-selling of products.
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN A IRBNB
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN SPOTIFY
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN S P O T I F Y ... C O N T D ..
Spotify uses data science to gain insights about which universities had the highest
percentage of party playlists and which ones spent the most time on it.
”Spotify Insights” publishes information about the ongoing trends in the music.
Spotify’s Niland, an API based product, uses machine learning to provide better
searches and recommendations to its users.
Spotify analyzes listening habits of its users to predict the Grammy Award Winners.
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN Healthcare
Covid Patient Discharge Prediction (Dataset: 2 nd Wave April-2021 to June 2021)
Type of Project: Machine Learning
Dataset size: 1233 patients suffering from Covid
Variables:
X: Age, Gender, Co_morbid, Admit Date, Discharge date, days of stay,
covid_severity
Y: Discharge Type (Recovered, Expired)
Exploratory Data Analysis: Univariate, Bivariate, Multivariate
Models applied: Support Vector Machine, Naïve Bayes, Logistic Regression,
Decision Trees, KNN, ANN, Random Forest
Best Accuracy: Random Forest (92%)
I N T R OD U CT ION TO D AT A S C I E N C E
A P P L I C AT I O N S OF D ATA S C I E N C E
DataFlair
I N T R OD U CT ION TO D AT A S C I E N C E
A P P L I C AT I O N S OF D ATA S C I E N C E
edureka.co
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
Business intelligence comprises the strategies and technologies used by enterprises for the data analysis and
management of business information. One of the key BI components is Data Warehouse.
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N T I S T VS. B I A N A LY S T
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E VS. S TAT I S T I C S
• Statistics is the science of collecting, analyzing, presenting, and interpreting data. Objective
to draw conclusions on the population.
• The science of statistics enables Data Science.
• Data Science expands the application of statistics towards solving Big Data challenges.
• Data Science comprises of 4As (data architecture, data acquisition, data analysis and data
archiving). The two types of statistics namely ‘descriptive’ and ‘inferential’ are applied during
‘Data Analysis’ phase in data science.
*Source - H. Hassani et al., “The science of statistics versus data science: What is the future?”, Technological
Forecasting and Social Change (Elsevier), Volume 173, 2021
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E VS. S TAT I S T I C S
Statistics Data Science
Theoretical Origins Mathematical biology and Statistics and Probability
biometry
Main Focus Theoretical Sophistication Practical Solutions to
real problems
Main Approach Methodology / Model Application of machine
development and confirmation learning and data mining
models
Focus of Model Building Examination of correlations, Hyper parameter
causality between the variables optimization and feature
selection
Interpretability vs High Interpretability, Low High Accuracy, Low
Accuracy Accuracy Interpretability (XAI or
Explainable AI)
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E VS. S TAT I S T I C S
Statistics Data Science
Type of problem Well structured Semi structured or
[Survey – Likert Scale data] unstructured
Source - H. Hassani et al., “The science of statistics versus data science: What is the future?”, Technological
Forecasting and Social Change (Elsevier), Volume 173, 2021
I N T R OD U CT ION TO D AT A S C I E N C E
Data Mining vs Data Science
• Data Mining field started in 1989 as “Algorithms for Pattern Recognition”, later remodeled as a “Step in the KDD process”
• Data Mining is Goal-oriented and Process driven in nature!
• Understand the business goals first, then apply the DM process to arrive at a result!
• Process takes center stage!
• More of ‘mining’ the data to find insights using algorithms!
• Data Science term first coined in 1962, but remodeled in 2007 as “Derive insights from big data for making smarter
decisions”
• Data Science is Data-oriented and Exploratory in nature!
• Data exploration may help define the business goals or insights and arrive at results!
• Data takes the center stage!
• More work in ‘exploring or searching’ data, than actual mining!
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S
I N T R OD U CT ION TO D AT A S C I E N C E
W H O IS A D AT A S C I E N T I S T ?
I N T R OD U CT ION TO D AT A S C I E N C E
R OLE OF A D AT A S C I E N T I S T
Learn how to draw insights out of data and communicate them effectively.
I N T R OD U CT ION TO D AT A S C I E N C E
Data Science – Hierarchy of Needs
I N T R OD U CT ION TO D AT A S C I E N C E
Differences between roles
Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at
scale. Lay the foundation for Data Analysis. Concerned with security, reliability, fault tolerance, scalability and
efficiency of the data processing systems.
I N T R OD U CT ION TO D AT A S C I E N C E
S KILLS REqUIRED FOR A D AT A S C I E N T I S T
Communicative Qualitative
Data
Curious Technical
Scientist
Creative Skeptical
I N T R OD U CT ION TO D AT A S C I E N C E
T OOLS AVA I L A B L E T O A D ATA S C I E N T I S T
R
SQL
Python
Scala
Tools SAS
Hadoop
Julia
Tableau
Weka
I N T R OD U CT ION TO D AT A S C I E N C E
A LGORITHMS F OR A D AT A S C I E N T I S T
Logistic
Regression
K-means Linear
clustering Regression
Decision
SVM
Tree
ANN
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S
I N T R OD U CT ION TO D AT A S C I E N C E
S O F T WA R E E N G I N E E R I N G
In general,
Software engineering is an engineering discipline that is concerned with all aspects of
software production.
Software includes computer programs, all associated documentation, and
configuration data that are needed for software to work correctly.
Waterfall model, Iterative models, Agile models
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E P R O C E S S
I N T R OD U CT ION TO D AT A S C I E N C E
S O F T WA R E E N G I N E E R I N G VS. D ATA S C I E N C E
Concerned with creating useful appli- Involves collecting, analyzing and visualizing
cations data
Software engineers use the SDLC pro- Data scientists utilize the ETL (Ex-
cess tract, Tranform, Load) process
Uses frameworks like Waterfall, Agile (Scrum, Methodologies like CRISP-DM, SMAM, SEMMA,
XP) Big Data Lifecycle etc.
Software engineers use programming languages like Data scientists use tools like Ama-
C#, Java and web frameworks like Django, Flask zon S3, MongoDB, Hadoop, and MySQL
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E C H A L L E N G E S
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E C H A L L E N G E S
I N T R OD U CT ION TO D AT A S C I E N C E
C OGNITIVE B IAS
Cognitive Biases are the distortions of reality because of the lens through which we
view the world. [Subjective vs Objective view of reality]
Each of us sees things differently based on our preconceptions, past experiences,
cultural, environmental, and social factors. This doesn’t necessarily mean that the
way we think or feel about something is truly representative of reality.
I N T R OD U CT ION TO D AT A S C I E N C E
References:
https://data-flair.training/blogs/data-science-use-
cases/ https:
• //www.northeastern.edu/graduate/blog/what-does-a-data-scientist-
do/
• https://www.visual-paradigm.com/guide/software-development-
process/ what-is-a-software-process-model/
• https://www.sciencedirect.com/science/article/abs/pii/S004016252100
5448
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 2 : DATA A NALYTICS
IDS Course Team
BITS Pilani
TABLE OF C ONTENTS
1 A N A LY T I C S
2 B I G D ATA
3 D ATA A N A LY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
D EFINITION OF A NALY T I C S – D I C T I O N A RY
Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E
D EFINITION OF A NALY T I C S – WEBSITES
Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E
D EFINING A NALY T I C S
Analytics is the process of extracting and creating information from raw data by using
techniques such as:
• filtering, processing, categorizing, condensing and contextualizing the data.
Analytics is a broad term that encompasses the processes, technologies, frameworks
and algorithms to extract meaningful insights from data.
This information thus obtained is then used to infer knowledge about the system
and/or its users, and its operations to make the systems smarter and more efficient.
Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E
G OA L S OF D AT A A N A L Y T I C S
To predict something
• whether a transaction is a fraud or not [Banking]
• whether it will rain on a particular day [Weather Forecast]
• whether a tumor is benign or malignant [Cancer Prediction, Healthcare]
To find patterns in the data
• finding the top 10 coldest days in the year [Weather Forecast]
• which pages are visited the most on a particular website [Web Traffic Rank]
• finding the most searched celebrity in a particular year [Awards]
To find relationships in the data
• finding similar news articles [Bing, Google]
• finding similar patients in an electronic health record system [Healthcare]
• finding related products on an e-commerce website [Recommendation]
• finding correlation between news items and stock prices
* https://www.cnbc.com/2022/04/04/twitter-shares-soar-more-than-25percent-after-elon-musk-takes-9percent-stake-in-social-media-company.html
I N T R OD U CT ION TO D AT A S C I E N C E
TABLE OF C ONTENTS
1 A N A LY T I C S
2 B I G D ATA
3 D ATA A N A LY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D AT A
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D AT A - E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E
C HAR ACTERISTICS OF B I G D ATA
Volume
Velocity
Big Data
Value
5 V’s
Variety
Veracity
I N T R OD U CT ION TO D AT A S C I E N C E
C HARACTERISTICS OF B I G D AT A
1 Volume
• Volume of data involved is so large that it is difficult to store, process and analyze data
on a single machine.
• Volumes of data generated by IT / IoT systems is growing exponentially.
• lowering costs of data storage and processing architectures [possible due to Cloud]
• need to extract valuable insights from the data to improve business processes, efficiency
and service to consumers.
2 Velocity
• Velocity of data refers to how fast the data is generated.
• High velocity of data results in the volume of data accumulated to become very large, in
short span of time.
• Need to consider parameters such as data provenance and accuracy
I N T R OD U CT ION TO D AT A S C I E N C E
C HARACTERISTICS OF B I G D AT A
3 Variety
• Variety refers to the forms / types of the data.
• Big data comes in different forms such as structured, unstructured or semi-structured,
including text data, image, audio, video and sensor data.
4 Veracity
• Veracity refers to how accurate is the data.
• To extract value from the data, the data needs to be cleaned to remove noise.
5 Value
• Value of data refers to the usefulness of data for the intended purpose.
• The value of the data is also related to the veracity or accuracy of the data.
• For some applications value also depends on how fast we are able to process the data.
[Static (Warehouse) vs Real Time (lecture)]
I N T R OD U CT ION TO D AT A S C I E N C E
TABLE OF C ONTENTS
1 A N A LY T I C S
2 B I G D ATA
3 D ATA A N A LY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A N A L Y T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA A N A LY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
D ESCRIPTIVE A NALY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
D E S C R I P T I V E A N A LY T I C S E X A M P L E - I
I N T R OD U CT ION TO D AT A S C I E N C E
D E S C R I P T I V E A N A LY T I C S E X A M P L E - I I
Paper - Healthcare Delivery through Telemedicine during the COVID-19 Pandemic: Case Study from a Tertiary Care Center in South India
https://pubmed.ncbi.nlm.nih.gov/33528313/
I N T R OD U CT ION TO D AT A S C I E N C E
D ESCRIPTIVE A NALY T I C S
Techniques:
• Descriptive Statistics - histogram, correlation
• Data Visualization
• Exploratory Analysis [Seaborn Library in Python]
I N T R OD U CT ION TO D AT A S C I E N C E
D IA G N O S T I C A N A LY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
D IA G N O S T I C A N A LY T I C S E X A M P L E
What is the effect of global warming in the Southwest monsoon?
I N T R OD U CT ION TO D AT A S C I E N C E
D IA G N O S T I C A N A L Y T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
P RE D IC T IV E A NALY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
P R E D I C T I V E A N A LY T I C S E X A M P L E - I
I N T R OD U CT ION TO D AT A S C I E N C E
P R E D I C T I V E A N A LY T I C S E X A M P L E - I I
I N T R OD U CT ION TO D AT A S C I E N C E
P RE D IC T IV E A NALY T I C S
Techniques / Algorithms:
• Regression
• Classification
• ML algorithms like Linear regression, Logistic regression, SVM
• Deep Learning techniques
I N T R OD U CT ION TO D AT A S C I E N C E
P RE S C RIPT IV E A NALY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
P R E S C R I P T I V E A N A LY T I C S E X A M P L E - I
• Apollo Hospitals uses an AI tool to predict the risk of cardiovascular disease.
• The Apollo AI-powered “Cardiovascular Disease Risk” tool will help healthcare providers to predict
the risk of cardiac disease in their patients [Predictive Analytics]
• The prediction initiates intervention early enough to make a real difference. [Prescriptive]
• The cardiac risk scoring tool is remarkable for the speed in processing data and its accuracy at
predicting the probability of a patient developing coronary disease.
• Using the tool, physicians will be enabled to deliver proactive, pre-emptive and preventive care for at-
risk individuals, improving lives, while mitigating future pressure on healthcare systems.
https://www.apollohospitals.com/apollo-in-the-news/apollo-hospitals-has-launched-an-artificial-intelligence-tool-to-predict-the-risk-of-cardiovascular-disease/#:~:text=On%20COVID%2D19-
,Apollo%20Hospitals%20has%20launched%20an%20Artificial%20Intelligence%20tool,the%20risk%20of%20cardiovascular%20disease.&text=Apollo%20Hospitals%20announced%20the%20national,the%20risk%20of
%20cardiovascular%20disease.
I N T R OD U CT ION TO D AT A S C I E N C E
P R E S C R I P T I V E A N A LY T I C S E X A M P L E - I I
How can we improve the crop production?
I N T R OD U CT ION TO D AT A S C I E N C E
Types of Data Analytics
I N T R OD U CT ION TO D AT A S C I E N C E
Types of Data Analytics
Exercise
Instagram Reels allows users to create fun videos and share with their contacts. Users can
record 15 second multi-clip videos with audio and effects. Some features include: exploring reels
based on subject; following, commenting and liking a reel; identifying trends to create new reels.
The reels are released in two versions – public (free for all), and premium (subscription basis).
Discuss the four analytical tasks that can be performed with respect to the Instagram Reels?
[Descriptive, Diagnostic, Predictive and Prescriptive]
I N T R OD U CT ION TO D AT A S C I E N C E
Types of Data Analytics
Instagram Reels
Descriptive - How many followers do you have, how many views, comments, likes for your
video [free], audience breakdown by country, follower activity per hour [premium]
Diagnostic - Why your video’s engagement rate is less. [premium users]
Predictive - Trending topics for you to make video on - their approximate engagement rates
[premium]
Prescriptive - Tips to increase average watch time of your videos [premium]
I N T R OD U CT ION TO D AT A S C I E N C E
C O G N I T I V E A N A LY T I C S
Cognitive Analytics – What I Don’t Know?
h ttp s : / / w w w. 10x d s. com/ b log /cogn itive - a n a lytics - to- rein ven t - b u s in es s/
I N T R OD U CT ION TO D AT A S C I E N C E
C OGNITIVE A NALY T I C S
• Next level of Analytics
• Human cognition is based on the context and reasoning.
• Cognitive systems mimic how humans reason and process.
• Cognitive systems analyze information and draw inferences using probability.
• They continuously learn from data and reprogram themselves.
• According to one source:
• ”The essential distinction between cognitive platforms and artificial
intelligence systems is that you want an AI to do something for you. A
cognitive platform is something you turn to for collaboration or for advice.”
I N T R OD U CT ION TO D AT A S C I E N C E
C OGNITIVE A NALY T I C S
Benefits:
• Using Woebot led to significant reductions in anxiety and depression among people aged 18-28 years
old, compared to an information-only control group.
• 85% of participants used Woebot on a daily or almost daily basis.
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A N A L Y T I C S - B A S E D ON D OMAIN
Types of analytics according to the domain
1 Marketing Analytics
2 Financial Analytics
3 Healthcare Analytics
4 Sports Analytics
5 HR Analytics
6 Customer Analytics
7 Web Analytics
8 Social Analytics
9 Political Analytics
I N T R OD U CT ION TO D AT A S C I E N C E
Sports Analytics - Powerbat
I N T R OD U CT ION TO D AT A S C I E N C E
Web Analytics – Google Analytics
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A N A L Y T I C S - T YPE OF D AT A
I N T R OD U CT ION TO D AT A S C I E N C E
Geo Analytics – Location Intelligence
https://medium.com/loctruth/unlock-the-power-of-location-intelligence-c0cea20d5a06
I N T R OD U CT ION TO D AT A S C I E N C E
TABLE OF C ONTENTS
1 A N A LY T I C S
2 B I G D ATA
3 D ATA A N A LY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
D ESCRIPTIVE A NALY T I C S – E X AM PL E # 1
Data captured
Problem Statement : Gender
“Market research team at Aqua Analytics Age (In years)
Pvt. Ltd is assigned a task to identify pro- Education (In years)
file of a typical customer for a Digital fit- Relationship Status (Single or Partnered)
Annual Household income
ness band that is offered by Titanic Corp.
Average number of times customer tracks activity each
The market research team decides to inves- week
tigate whether there are differences across Number of miles customer expect to walk each week
the usage patterns and product lines with Self-rated fitness on a scale 1– 5 where 1 is poor shape
I N T R OD U CT ION TO D AT A S C I E N C E
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1
Problem Statement :
“During the 1980s General Electric was selling different products to its customers such as
light bulbs, jet engines, windmills, and other related products. Also, they separately sell
parts and services this means they would sell you a certain product you would use it until it
needs repair either because of normal wear and tear or because it’s broken. And you would
come back to GE and then GE would sell you parts and services to fix it. Model for GE was
focusing on how much GE was selling, in sales of operational equipment, and in sales of
parts and services. And what does GE need to do to drive up those sales?”
https://medium.com/parrotai/
u n d ers ta n d - d a ta - a n a lytics - fra mework -w ith -a - ca s e - stu d y -in - th e -b u s in es s - world - 15b fb 421028d
I N T R OD U CT ION TO D AT A S C I E N C E
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1
https://www.sganalytics.com/blog/change -management-analytics-adoption/
I N T R OD U CT ION TO D AT A S C I E N C E
P RE D IC T IV E A NALY T I C S – E X AM PL E
• Google launched Google Flu Trends (GFT), to collect predictive analytics regarding the
outbreaks of flu. It’s a great example of seeing big data analytics in action.
• So, did Google manage to predict influenza activity in real-time by aggregating search engine
queries with this big data and adopting predictive analytics?
• Even with a wealth of big data analytics on search queries, GFT overestimated the prevalence
of flu by over 50% in 2012-2013 and 2011-2012.
• They matched the search engine terms conducted by people in different regions of the world.
• And, when these queries were compared with traditional flu surveillance systems, Google found
that the predictive analytics of the flu season pointed towards a correlation with higher search
engine traffic for certain phrases.
I N T R OD U CT ION TO D AT A S C I E N C E
P RE D IC T IV E A NALY T I C S – E X AM PL E
https://www.slideshare.net/VasileiosLampos/
u s e r g e n e ra t ed - co n t en t - c o l l e ct i ve - a n d - p er s o n a l i s e d - i n f e re n c e - ta s ks
I N T R OD U CT ION TO D AT A S C I E N C E
P RE S C RIPT IV E A NALY T I C S
Whenever you go to Amazon, the site recommends dozens and dozens of products to
you. These are based not only on your previous shopping history (reactive), but also
based on what you’ve searched for online, what other people who’ve shopped for the
same things have purchased, and about a million other factors (proactive).
Amazon and other large retailers are taking deductive, diagnostic, and predictive data
and then running it through a prescriptive analytics system to find products that you
have a higher chance of buying.
Every bit of data is broken down and examined with the end goal of helping the
company suggest products you may not have even known you wanted.
h ttp s : / / a ccen t - tech n ologies . com/ 2020/ 06/18/ ex a mp les -of- p res crip tive - a n a lytics /
I N T R OD U CT ION TO D AT A S C I E N C E
H E A LT H C A R E A NALY T I C S – C ASE S TUDY
Self study
https://integratedmp.com/
4 - key- h e alt h care - analyt ics - so u rce s - i s- yo ur -pract ice - usin g -the m/
https://www.youtube.com/watch?v=olpuyn6kemg
I N T R OD U CT ION TO D AT A S C I E N C E
References:
Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
https://blog.hootsuite.com/tiktok-analytics/
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 3 : DATA A NALYTICS - METHODOLOGIES
IDS Course Team
BITS Pilani
T ABLE OF C ONTENTS
1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A N A L Y T I C S
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A N A L Y T I C S M E T H O D O L O G I E S
I N T R OD U CT ION TO D AT A S C I E N C E
N EED FOR A Methodology
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A n a l y t i c s M E T H O D O L O G Y
10 Questions the process aims to answer
Problem to Approach
1 What is the problem that you are trying to solve?
2 Are there available solutions to similar problems?
Working with Data
3 What data do you need to answer the question?
4 Where is the data coming from? Identify all Sources. How will you acquire it?
5 Is the data that you collected representative of the problem to be solved?
6 What additional work is required to manipulate and work with the data?
Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G
I N T R OD U CT ION TO D AT A S C I E N C E
CRISP-DM
CRISP-DM Phases
Cross Industry Standard Process for Data
Mining
People realized they needed a process to define
data mining steps applicable across any Industry
such as Retail, E-Commerce, Healthcare etc.
Conceived by Daimler-Benz and Integral
Solutions Ltd in the year 1996
6 high-level phases
I N T R OD U CT ION TO D AT A S C I E N C E
C R I S P - D M P HASES
I N T R OD U CT ION TO D AT A S C I E N C E
C R I S P - D M P HASES
I N T R OD U CT ION TO D AT A S C I E N C E
C R I S P - D M P HASES AND T ASKS
I N T R OD U CT ION TO D AT A S C I E N C E
W HY CRISP -DM?
I N T R OD U CT ION TO D AT A S C I E N C E
Advantages and Disadvantages
Advantages:
• Clearly defined process (phases and tasks).
• Supports various data mining techniques
• Has documentation of several successful case studies following the approach
Disadvantages:
• Long and Complicated process
• Blind hand-off to IT from Data Science team without picturizing the operationalization
• No real measure of ROI, once all phases are completed
https://www.diva-portal.org/smash/get/diva2:1250897/FULLTEXT01.pdf
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G
I N T R OD U CT ION TO D AT A S C I E N C E
SEMMA
I N T R OD U CT ION TO D AT A S C I E N C E
SEMMA
• SEMMA is a logical organization of the functional tool set of SAS Enterprise Miner
for carrying out the core tasks of data mining.
• Enterprise Miner is a Data Mining Software to create predictive and descriptive
models for large volumes of data.
• Enterprise Miner can be used as part of any iterative data mining methodology
adopted by the client. Naturally steps such as formulating a well defined business or
research problem and assembling quality representative data sources are critical to
the overall success of any data mining project.
• SEMMA is focused on the model development aspects of data mining.
• SEMMA overlaps with Data Preparation, Modelling and Evaluation phases of CRISP-DM
I N T R OD U CT ION TO D AT A S C I E N C E
S E M M A S TAGES
1. Sample
•1 Sampling the data by extracting a portion of a large data set big enough to contain the
significant information, yet small enough to manipulate quickly.
• Partitioning the data to create training and test samples.
• Identifying dependent and independent variables influencing the process.
2. Explore
• Exploration of the data by searching for unanticipated trends and anomalies in order to
gain understanding and ideas.
• Perform Univariate analysis (single variable) and multivariate analysis (relationships)
3. Modify
• Modification of the data by creating, selecting, and transforming the variables to focus
the model selection process.
I N T R OD U CT ION TO D AT A S C I E N C E
S E M M A S TAGES
4. Model
• Apply variety of data mining techniques to produce a projected model [ML, Deep Learning,
Transfer Learning]
5. Assess
• Assessing the data by evaluating the usefulness and reliability of the findings from the
data mining process and estimate how well it performs.
I N T R OD U CT ION TO D AT A S C I E N C E
Advantages and Disadvantages
Advantages:
• Focus on only “Model aspects of Data Mining”
• Useful in most Machine Learning Projects where data comes from single datasource
Ex: Prima Indian Diabetes Dataset [Predict Diabetes], Titanic Dataset [Predict
Passenger Survival] from Kaggle
Disadvantages:
• Does not take into account the business understanding of a problem
• Disregards Data Collection and Processing from different data sources
https://www.diva-portal.org/smash/get/diva2:1250897/FULLTEXT01.pdf
I N T R OD U CT ION TO D AT A S C I E N C E
SEMMA – Case Study
Covid Patient Discharge Prediction (Dataset: 2 nd Wave April-2021 to June 2021)
Type of Project: Machine Learning
1. Sample : Dataset size: 1233 patients suffering from Covid
2. Explore: Univariate (Null values, Mean, basic statistics), Bivariate (correlation – pearson, chi square)
3. Modify : PCA (Principal Component Analysis)
4. Model : Feature Engineering, Subset selection
Final Variables:
X: Age, Gender, Co_morbid, Admit Date, Discharge date, days of stay, covid_severity
Y: Discharge Type (Recovered, Expired)
Models applied: Support Vector Machine, Naïve Bayes, Logistic Regression, Decision Trees, KNN, ANN,
Random Forest
5. Assess : Best Accuracy: Random Forest (92%)
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G
I N T R OD U CT ION TO D AT A S C I E N C E
SMAM
SMAM
(Standard Methodology for
Analytics Models)
http://www.datascienceassn.org/content/standard-methodology-analytical-models
I N T R OD U CT ION TO D AT A S C I E N C E
S M A M P HASES
Phase Description
Use-case identification Selection of the ideal approach from a list of candidates
Model requirements Understanding the conditions required for the model to func-
gathering tion
Data preparation Getting the data ready for the modeling
Modeling experiments Scientific experimentation to solve the business question
Insight creation Visualization and dash-boarding to provide insight
Proof of Value: ROI Running the model in a small scale setting to prove the value
Operationalization Embedding the analytical model in operational systems
Model life-cycle Governance around model lifetime and refresh
I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase I - Use Case Identification
• Brainstorming of Business / Management / SMEs (Domain) / IT (Data Scientist)
teams
• Discussion revolves around:
• Business Needs
• Expert inputs on the domain
• Data Availability
• Analytical Model Complexity – time and effort
• Outcome: Selected Use Case and roadmap for next phases
I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase II – Model Requirements Gathering
• Involved parties include Business / End-users / Data Scientists / IT
• Preparation of Model Requirement Document
• Business requirements
• IT requirements
• End user requirements
• Data requirements
• Analytical model requirements
I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase III – Data Preparation
• Involved parties include IT / Data Administrators / DBA / Data Modelers and Data
Scientists
• Discussion on:
• Data Access
• Data Location
• Data Understanding
• Data Validation
• Data format [prepared by DBAs and consumed by Data Scientist]
• The process is agile; the data scientist tries out various approaches on smaller sets
and then may ask IT/ DBAs to perform the required transformation in large.
I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase IV – Modeling Experiments
• Data Scientist:
• Creates testable hypothesis [Prediction of heart disease]
• Model features [Identify X and Y variables]
• Creates Analytical Model [Regression / Classification / Clustering]
• Evaluates the Analytical Model
[Metrics – Accuracy, Precision, Sensitivity, Specificity etc.]
I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase V – Insight Creation
• Data Scientist:
• Analytical reporting [Inference] and Operational reporting [Prediction]
• Visualization and Dashboards
• Provide business usable insights
I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase VI – Proof of Value: ROI
• Quality of the analytical model is observed [Ex: Accuracy of the model is >90%]
• Analytical model is applied to new data and outcomes are measured to verify if
financially viable [for small POC].
• If ROI is positive for POC:
• Set up full-scale experiment with control groups
• Measure the model effectiveness
• Compute ROI and success criteria
• Involve Finance department / IT / End-users and Data Scientists in this phase
I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase VII – Operationalization
• Data Scientist works with IT department to create repeatable experimentation of
the model; hand-over process of the model
• IT prepare the Operational environment
• Integration with existing / legacy applications
• Possible software development as Mobile / Web App for end-user usage
I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase VIII – Model Lifecycle
• Involves maintenance of the analytical model in-view of changing customer needs
• Two types of model changes:
a. Model Refresh – Model is trained with more recent data, leaving the model
structurally untouched
b. Model Upgrade – Initiated by availability of new data sources and a
business request to improve model performance.
• Involved are operational team, IT team, Data Scientists, DBAs, end-users
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage II : Data Identification
• Identify the datasets required for the project and their sources
• Guideline: Identify as many sources as possible, which help gain insights
• Sources can be internal / external to the enterprise
• Internal – Data marts, Data warehouses or operational systems
• External – Data within Blogs, websites etc.
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage III : Data Acquisition and Filtering
• Data is gathered from all sources identified in the previous phase
• Data filtering is performed to remove corrupted / noise data
• Corrupt – records with missing / nonsensical values / invalid data
types
• Create metadata, helps in data provenance, accuracy and quality
• Dataset size & structure
• Source information
• Date and time of creation
• Language specific information
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage IV : Data Extraction
• Extract disparate data and transform it into a format that the underlying Big Data
solution can use for the purpose of the data analysis .
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage V : Data Validation and Cleansing
• Big data may receive redundant data across sources
• Redundancy can be used to interconnect dataset and fill missing values
• The first value in Dataset B is validated against its corresponding value in Dataset A.
• The second value in Dataset B is not validated against its corresponding value in Dataset A.
• If a value is missing, it is inserted from Dataset A.
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage VI : Data Aggregation and Representation
• Integrating multiple datasets together to arrive at unified view
• Involves joining datasets based on common fields such as ID or Date
• Semantics standardization (Ex: Surname and Last name – Same value
labeled differently in different datasets)
• Represent using standard data format (row-oriented database)
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage VII : Data Analysis
• Perform EDA (Exploratory Data Analysis)
• Apply Analytics: Descriptive, Diagnostic, Predictive or Prescriptive
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage VIII : Data Visualization
• Use tools to graphically visualize and communicate the insights to business users
• Present Dashboards
• Excel, Tableau, Power BI etc.
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage IX : Utilization of Analysis Results
• Determining how and where the processed analysis data can be leveraged
• Results can be:
• Fed as input to enterprise systems (Customer analysis result fed into
OTT platform to assist recommendation)
• Refine the business process (Ex: Consolidate transportation routes as
part of supply chain process)
• Generate alerts (Send notification to users via Email or SMS about
impending events)
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
CASE STUDY: Background
• Company X is an Insurance Company that deals with health and home insurance
• The company has a ‘Claim Management System’ which contains the claim data,
incident photographs and claim notes
• The company wants to invest in Big Data Analytics to “detect fraudulent claims in the
building sector”
• Let us see how the company uses the ‘Big Data Analytics’ Lifecycle to achieve the
objective of ‘detecting fraudulent claims in the building sector’
* Building Insurance is a type of Home insurance that covers the structure of the house from any kinds of danger or risks
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims
• The machine learning model was incorporated into the existing claim
processing system to flag fraudulent claims.
I N T R OD U CT ION TO D AT A S C I E N C E
When to use what Methodology?
Good Model development Need proof of ROI Multiple data Find the
documentation and is priority before investment sources methodology as
case studies (provenance, quality constraining
aspects of data)
Suitable for both Maybe as a POC/ Need clarity on the Integrate the model Ex: IBM / Netflix /
data mining and MVP. division of roles and with existing Google customize
data science No deployment responsibilities of systems big data lifecycle
projects clarity team members in (operationalization) and CRISP-DM in
project execution many projects
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 B I G D ATA L I F E - C Y C L E
5 SEMMA
6 SMAM
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A D R I V E N D E C I S I O N - M A K I N G
https://unscrambl.com/blog/data-driven-companies-examples/
I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
1. Discrimination
• Algorithmic discrimination can come from various sources.
• Data used to train algorithms may have biases that lead to discriminatory decisions.
• Discrimination may arise from the use of a particular algorithm.
• Algorithms can result in discrimination as a result of misuse of certain models in different
contexts.
• Biased data can be used both as evidence for the training of algorithms and as evidence
of their effectiveness.
I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
1. Racism embedded in US healthcare
In October 2019, researchers found that an algorithm used on more than 200
million people in US hospitals to predict which patients would likely need extra
medical care heavily favoured white patients over black patients. While race
itself wasn’t a variable used in this algorithm, another variable highly
correlated to race was, which was healthcare cost history. The rationale was
that cost summarizes how many healthcare needs a particular person has.
For various reasons, black patients incurred lower healthcare costs than white
patients with the same conditions on average.
I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
2. Amazon’s hiring algorithm
Amazon’s one of the largest tech giants in the world. And so, it’s no surprise
that they’re heavy users of machine learning and artificial intelligence. In
2015, Amazon realized that their algorithm used for hiring employees was
found to be biased against women. The reason for that was because the
algorithm was based on the number of resumes submitted over the past ten
years, and since most of the applicants were men, it was trained to favor men
over women.
I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
2. Lack of transparency
• Transparency refers to the capacity to understand a computational model and therefore
contribute to the attribution of responsibility for consequences derived from its use.
• A model is transparent if a person can easily observe it and understand it.
• Three types of opacity (i.e. lack of transparency) in algorithmic decisions
• Intentional opacity – The objective of this type of opacity is to protect the algorithm
inventors’ intellectual property.
• Knowledge opacity – This type of opacity is due to the fact that the most people lack the
technical skills to understand how algorithms and computational models are constructed.
• Intrinsic opacity – This type of opacity arises from the nature of certain computer learning
methods (e.g. deep learning models).
https://philpapers.org/rec/BURHTM
I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
3. Violation of privacy
• Misuse of users’ personal data and on data aggregation by entities such as data
brokers, may have direct implications for people’s privacy. [Google faced Lawsuit
for Privacy Violation in 2020 – selling data to 3rd party companies]
4. Digital literacy
• Devote resources to digital and computer literacy programs from children to the elderly.
• This enables the society to make decisions about technologies that we do not
understand. [Cases of Cyberbullying among Juvenile population]
5. Fuzzy responsibility
• As more and more decisions that affect millions of people are made automatically by
algorithms, we must be clear about who is responsible for the consequences of these
decisions. Transparency is often considered a fundamental factor in the clarity of
attribution of responsibility.
I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
6. Lack of ethical frameworks
• Algorithmic data-based decision-making processes generate important ethical dilemmas
regarding what actions are appropriate in light of the inferences made by algorithms.
• It is therefore essential that decisions be made in accordance with a clearly defined and
accepted ethical framework.
• There is no single method for introducing ethical principles into algorithmic decision
processes.
On March 18, 2018, at around 10 p.m., Elaine Herzberg was wheeling her bicycle
across a street in Tempe, Arizona, when she was struck and killed by a self-driving
car. Although there was a human operator behind the wheel, an autonomous
system—artificial intelligence—was in full control.
I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
7. Lack of diversity
• Data-based algorithms and artificial intelligence techniques for decision-making have
been developed by homogeneous groups of IT professionals.
• Ensure that teams are diverse in terms of areas of knowledge as well as demographic
factors [interdisciplinary – teaching medical doctors data science for self-computation]
I N T R OD U CT ION TO D AT A S C I E N C E
R EFERENCES
https://www.kdnuggets.com/2014/10/
cris p - d m - top - m eth od olog y - a n a l ytic s - d a t a - m in in g - d a t a - s c ie n ce - p roj ect h tm l
https://www.datascien cecentra l.com/profiles/ blogs/
crisp-dm-a-standard-methodology-to-ensure-a-good-outcome
https://docu mentation.sas.com/?docsetId=emref&docsetTarget=n061bzurmej4j3n1jnj8bbjjm 1a2.htm&
docsetVersion=14.3&locale=en
http://jesshampton.com/2011/02/16/semma -and-crisp-dm-data-mining-methodologies/
https://www.kdnuggets.com/2015/08/new -standard-methodology-analytical-models.html
https://medium.com/illumination -curated/big-data-lifecycle-management-629dfe16b78d
https://www.esadeknowledge.com/view/
7 - ch a llen g es - a n d - op p ortu n ities - in - d a ta - b a sed - d ecis ion - ma kin g -193560
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 3 : DATA S CIENCE P ROCESS
IDS Course Team
BITS Pilani
TABLE OF C ONTENTS
1 D ATA S C I E N C E P R O C E S S
2 C ASE S TU DY
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E P R O C E S S
10 Questions the process aims to answer
• Problem to Approach
1 What is the problem that you are trying to solve?
2 How can you use data to answer the questions? CRISP-DM approach
• Working with Data
3 What data do you need to answer the question?
4 Where is the data coming from? Identify all Sources. How will you aquire it?
5 Is the data that you collected representative of the porblem to be solved?
6 What additional work is required to manipulate and work with the data?
• Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?
Source: CognitiveClass
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E P R O C E S S - I B M
Business Analytic
Understanding Approach
Data
Feedback Requirements
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
I N T R OD U CT ION TO D AT A S C I E N C E
TABLE OF C ONTENTS
1 D ATA S C I E N C E P R O C E S S
2 C ASE S TU DY
I N T R OD U CT ION TO D AT A S C I E N C E
H OSPITA L R EADMISSIONS
Image Source:
https://medium.com/nwamaka-imasogie/predicting-hospital-readmission-using-nlp-5f0fe6f1a705
I N T R OD U CT ION TO D AT A S C I E N C E
H OSPITA L R EADMISSIONS - S CENARIO
• Hospital Readmission is a common problem in the healthcare sector, wherein a patient after
discharge gets re-admitted to the hospital because of the following reasons:
• Medication errors
• Medication noncompliance by the patient
• Fall injuries
• Lack of timely follow-up care
• Inadequate Nutrition
• Inadequate discussion on palliative care [relief from suffering]
• Infection
• Failure to identify post-acute care needs etc.
• Hospital readmissions may bring bad name to the hospital / treating doctor / support staff,
and lead to increased length of stay and expenditure for the hospital and the patient.
• Hence, it is a critical issue that needs addressing.
I N T R OD U CT ION TO D AT A S C I E N C E
H OSPITA L R EADMISSIONS - S CENARIO
There is a limited budget for providing healthcare to the public.
Hospital readmissions for re-occurring problems are considered as a sign of failure in the
healthcare system.
There is a dire need to properly address the patient condition prior to the initial patient
discharge.
American Healthcare Insurance Provider, Health care authorities in the region & IBM Data
Scientists:
• What is the best way to allocate these funds to maximize their use in providing
quality care?
I N T R OD U CT ION TO D AT A S C I E N C E
F R OM P R O B L E M TO A P P R OA C H
Business Analytic
Understanding Approach
Data
Feedback Requirements
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 1. B U S I N E S S U N D E R S T A N D I N G
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 1. B U S I N E S S U N D E R S T A N D I N G
Examining hospital readmissions [Insurance Company + Hospitals + Data Scientists]
• Use Case 1: It was found that approximately 30% of individuals who finish rehab
treatment would be readmitted to a rehab center within one year.
• 50% would be readmitted within five years.
• Use Case 2: After reviewing some records, it was found that patients with heart failure
were high on the list of readmission [more frequently]
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 1. B U S I N E S S U N D E R S T A N D I N G
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 1. B U S I N E S S U N D E R S T A N D I N G
I N T R OD U CT ION T O D AT A S C I E N C E
2. A N A L Y T I C A P P R OA C H ( C O N C E P T )
Available data
• Patient data, Readmissions data, CHF data, etc
How can we use data to answer the questions?
Descriptive
Choose Analytic approach based on the type of question.
• Descriptive
• Current data
• Diagnostic (Statistical Analysis)
Diagnostic Analytics Prescriptive
• What happened?
• Why is this happening?
• Predictive (Forecasting)
• What if these trends continue?
Predictive
• What will happen next?
• Prescriptive
• How do we solve it?
I N T R OD U CT ION T O D AT A S C I E N C E
A N A L Y T I C A P P R OA C H - D E C I S I O N T R E E ( C O N C E P T )
What is a Decision Tree?
1. An algorithm that represents a set of questions & decisions using a tree-like
structure.
2. It provides a procedure to decide what questions to ask, which to ask and when to
ask them to predict the value of an outcome.
I N T R OD U CT ION TO D AT A S C I E N C E
A N A L Y T I C A P P R OA C H - D E C I S I O N T R E E ( C O N C E P T )
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 2. A N A L Y T I C A P P R OA C H
A decision tree classification model was used
to identify the combination of conditions leading
to each patient’s outcome.
Examining the variables in each of the nodes
along each path to a leaf, led to a respective
threshold value to split the tree. Eg:
Age >= 60
A decision tree classifier provides both the
predicted outcome, as well as the likelihood of
that outcome, based on the proportion at the
dominant outcome, yes or no, in each group.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 2. A N A L Y T I C A P P R OA C H
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 2. A N A L Y T I C A P P R OA C H
I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D AT A R E q U I R E M E N T S TO D AT A C O L L E C T I O N
Business Analytic
Understanding Approach
Data
Feedback Requirements
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 3. D AT A R E q U I R E M E N T S
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 3. D AT A R E q U I R E M E N T S
Data requirements for the case study included selecting a suitable list of patients from
the health insurance providers' member base.
In order to put together patient clinical histories,
three criteria were identified for selecting the
patient cohort. [Complete medical history]
1 A patient must be admitted as an in-patient
within health insurance provider’s service area.
2 Patient’s primary diagnosis should be CHF for
one full year.
3 Prior to the primary admission for CHF, a patient
must have had at least 6 months of continuous
enrollment.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 3. D AT A R E q U I R E M E N T S
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 3. D AT A R E q U I R E M E N T S
Defining the data
The content and format suitable for decision tree classifier needs to be defined.
Format
• Transactional format
• This model requires, one record per patient.
• Columns of the record represent dependent and independent variables.
Content
• To model the readmission outcome, data should represent all aspects of the patient’s
clinical history.
• This includes:
• Authorizations
• Primary, secondary and tertiary diagnoses,
• Procedures, prescriptions and other services provided during hospitalization or visits by
patients / doctors.
I NI NT RT OD
R OD
UUCTCT T ODDATAT
IONT O
ION A AS S
C ICEI N
ENC EC E
C A S E S T U D Y - 3. D AT A R E q U I R E M E N T S
A given patient can have thousands of records that represent all their attributes.
The data analytics specialists collected the transaction records from patient records
and created a set of new variables to represent that information.
It was a task for the data preparation phase, so it is important to anticipate the next
phases.
I N T R OD U CT ION TO D AT A S C I E N C E
4. D AT A C O L L E C T I O N ( C O N C E P T )
The collected data is explored using descriptive statistics and visualization to assess
its content and quality.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 4. D AT A C O L L E C T I O N
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 4. D AT A C O L L E C T I O N
This case study also required other data, but
not available.
• Pharmaceutical records
• Information on drugs
This data source was not yet integrated with the rest of the data sources.
In such situations,
• It is okay to postpone decisions about unavailable data and to try to capture them later.
• This can happen even after obtaining intermediate results from predictive modeling.
• If the results indicate that drug information may be important for a good model, you will
spend time trying to get it.
However, it turned out that they could build a reasonably good model without this
information about drugs.
I N T R OD U CT ION TO D AT A S C I E N C E
Next Phase – Data Understanding
Data Pre-processing and Merging Data
• Database administrators and programmers
often work together to extract data from
different sources and then combine them.
• Redundant data can be deleted and made
available to the next level of methodology – the
”Data Understanding” phase.
• At this stage, scientists and analysts can
discuss ways to better manage their data by
automating certain database processes to
facilitate data collection
Next, we move on to understanding the data
I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D AT A U N D E R S T A N D I N G TO D AT A P R E PA R AT I O N
Business Analytic
Understanding Approach
Data
Feedback Requirements
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D AT A U N D E R S T A N D I N G TO D AT A P R E PA R AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
5. D AT A U N D E R S T A N D I N G ( C O N C E P T S )
This section of the methodology answers the question.
• Is data you collected representative of the problem to be solved?
Descriptitive statistics
• Univariates statistics
• Pairwise correlation
• Histograms
Assert data quality
• Missing value
• Invalid data
• Misleading data
From the data collected, we should understand the variables and their characteristics
using Exploratory Data Analysis and Descriptive Statistics.
Sometimes we may have to perform pre-processing operations on the data.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 5. D AT A U N D E R S T A N D I N G
First, Univariate Statistics
• Basic statistics included univariate statistics for
each variable, such as:
• mean, median, minimum, maximum,
standard deviation, detect outliers
Second, Pairwise Correlations
• Pairwise correlations were used to determine
the degree of correlation between the
variables.
• Variables that are highly correlated means
they are essentially redundant.
• This makes only one variable relevant for the
modeling.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 5. D AT A U N D E R S T A N D I N G
Third, Histograms
• Third, the histograms of the variables were
examined to understand their distributions.
• Histograms are a good way to understand how
values or variables are distributed.
• They help to know what kind of data
preparation may be needed to make the
variable more useful in a model.
• For example:
• If a categorical variable contains too many
different values to be meaningful in a model,
the histogram can help decide how to
consolidate those values.
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 5. D AT A U N D E R S T A N D I N G
Looking at data quality
• Univariate, statistics and histograms are also used to assess the quality of the data.
• On the basis of the data provided, some values can be recoded or deleted if necessary.
• E.g., if a particular variable has a lot of missing values, we may drop the variable from the
model.
• Sometimes a missing value means ”no” or ”0” (zero), or sometimes simply ”we do not
know”.
A variable contains invalid or misleading
values.
• E.g., A numeric variable called ”age”
containing 0 to 100 and 999, where ”triple-9”
actually means ”missing”, will be treated as a
valid value unless we have corrected it.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 5. D AT A U N D E R S T A N D I N G
I N T R OD U CT ION T O D AT A S C I E N C E
6. D AT A P R E PA R AT I O N ( C O N C E P T )
In a way, data preparation is like removing dirt and washing vegetables.
Compared to data collection and understanding, data preparation is the most time
consuming phase – 70% to 90% of overall project time.
Automating collection and preparation time can reduce to 50%.
The data preparation phase of the methodology answers the question:
• What are the ways in which data is prepared?
• Address missing or invalid values
• Remove duplicates
• Format data properly
Transforming data
• Process of getting data into a state where it may be easier to work with.
Feature Engineering
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N
Source: CognitiveClass
I N T R OD U CT ION TO D AT A S C I E N C E
6. D AT A P R E PA R AT I O N ( C O N C E P T )
Feature Engineering
• Process of using domain knowledge of data to create
features that make ML algorithms work.
• Feature is a characteristic that might help solving a
problem.
• Feature engineering is also part of the data preparation.
• Use domain knowledge on data to create features that
work with machine learning algorithms.
• A feature is a property that can be useful for solving a
problem.
• The functions in the data are important for the predictive
models and influence the desired results.
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N
Data Scientists need clarification on domain terms for data preparation
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N
Aggregating data to patient level
• A given patient could have hundreds or even thousands of records, depending on their
clinical history.
• All the transactional records were aggregated to the patient level, yielding a single
record for each patient.
• This is required for the decision-tree classification method used for modeling.
• Many new columns were created representing the information in the transactions.
• E.g: Frequency and most recent visits to doctors, clinics and hospitals with diagnoses,
procedures, prescriptions, and so forth.
• Co-morbidities with CHF were also considered, such as:
• Diabetes, hypertension, and many other diseases and chronic conditions that could impact
the risk of re-admission for CHF.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D AT A M O D E L I N G TO E VA L U AT I O N
Business Analytic
Understanding Approach
Data
Feedback Requirements
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D AT A M O D E L I N G TO E VA L U AT I O N
I N T R OD U CT ION T O D AT A S C I E N C E
7. D AT A M O D E L I N G ( C O N C E P T )
In what way can the data be visualized to get to the answer that is required?
Modeling is based on the analytic approach.
Data modeling focuses on developing models that are either descriptive or predictive.
• Descriptive Models
• What happened?
• Use statistics.
• Predictive Models
• What wil happen?
• Use machine learning.
• Try to generate yes/no type outcomes.
• A training set is used for developing the predictive model.
• Training set
• Contains historical data in which the outcomes are already known. [Labeled data]
• Acts like a gauge to determine if the model needs to be calibrated.
I N T R OD U CT ION TO D AT A S C I E N C E
7. D AT A M O D E L I N G ( C O N C E P T )
The data scientist will try different algorithms to ensure that the variables in play are
actually required.
Success of compilation, preparation and modeling depends on the understanding of
problem and analytical approach being taken.
Like the quality of ingredients in cooking, the quality of data sets the stage for the
outcome.
• If data quality is bad, the outcome will be bad.
Constant refinement, adjustment, and tweaking within each step are essential to
ensure a solid outcome.
The end goal is to build a model that can answer the original question.
• Model evaluation, deployment, and feedback loops ensure that the model is relevant and
the question is really answered.
I N T R OD U CT ION TO D AT A S C I E N C E
7. D AT A M O D E L I N G – Concept of Confusion Matrix
Since Data Modeling for the case study involves the concepts of ‘Confusion
Matrix’ and ‘ROC’, let us understand the concepts.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 7. D AT A M O D E L I N G
Decision tree to predict CHF readmission is built
In this first model, the default is 1-to-1 is used.
The overall accuracy in classifying the yes and
no outcomes was 85%.
This sounds good, but it represents only 45% of
the ”yes”.
• Meaning, when it’s actually YES, model
predicted YES only 45% of the time.
The question is:
• How could the accuracy of the model be improved in predicting the yes outcome?
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 7. D AT A M O D E L I N G
There are many aspects to model building – one of those is parameter tuning to
improve the model.
With a prepared training set, the first decision tree classification model for CHF
readmission can be built.
We are looking for patients with high-risk readmission, so the outcome of interest will
be CHF readmission equals ”yes”.
For decision tree classification, the best parameter to adjust is the
relative cost of misclassified yes and no outcomes.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 7. D AT A M O D E L I N G
Type I Error or False positive
• When a true, non-readmission is misclassified, and action is taken to reduce that
patient’s risk, the cost of that error is the wasted intervention.
Type II Error or False negative
• When a true readmission is misclassified, and
no action is taken to reduce that risk.
• The cost of this error is the readmission and all
its attended costs, plus the trauma to the patient.
The costs of the two different kinds of misclassification errors can be quite different.
• Adjust the relative weights of misclassifying the yes and no outcomes.
For decision tree classification, the best parameter to adjust is the
relative cost of misclassified yes and no outcomes.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 7. D AT A M O D E L I N G
For the second model, the relative cost was set at 9-to-1.
• Ratio of cost of false positive to false negative.
• This is a very high ratio, but gives more insight to the model’s behavior.
This time the model correctly classified 97% of
the YES, but at the expense of a very low
accuracy on the NO, with an overall accuracy of
only 49%.
This was clearly not a good model.
The problem with this outcome is the large number of false-positives.
• A true, non-readmission is misclassified as re-admission.
• This would recommend unnecessary and costly intervention for patients, who would not
have been re-admitted anyway.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 7. D AT A M O D E L I N G
Try again to find a better balance between the yes and no accuracies.
For the third model, the relative cost was set at
4-to-1.
This time, the overall accuracy was 81%.
Yes accuracy was 68%. This is called sensitivity.
No accuracy was 85%. This is called specificity.
This is the optimum balance that can be obtained with a rather small training set.
• By adjusting the relative cost of misclassified yes and no outcomes parameter.
In medical diagnosis
• Test sensitivity is the ability of a test to correctly identify those with the disease (true
positive rate).
• Test specificity is the ability of the test to correctly identify those without the disease (true
negative rate).
I N T R OD U CT ION TO D AT A S C I E N C E
C O N F U S I O N M AT R I X
Confusion matrix is a table that is often used to evaluate the performance of a
classification model (or ”classifier”).
It works on a set of test data for which the true values are known.
There are two possible predicted classes: ”YES” and ”NO”.
If we were predicting the presence of a disease, for example, ”yes” would mean they
have the disease, and ”no” would mean they don’t have the disease.
• The classifier made a total of 165 predictions.
• 165 patients were being tested for the presence Predicted: Predicted
of that disease. N = 165 No Yes
• Out of those 165 cases, the classifier predicted Actual:
”yes” 110 times, and ”no” 55 times. No: 50 10
• In reality, 105 patients in the sample have the Actual:
disease, and 60 patients do not. Yes: 5 100
I N T R OD U CT ION TO D AT A S C I E N C E
C O N F U S I O N M AT R I X
I N T R OD U CT ION TO D AT A S C I E N C E
C O N F U S I O N M AT R I X
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 8. E VA L U AT I O N
One way is to find the optimal model through a diagnostic measure based on tuning
one of the parameters in model building.
Specifically we’ll see how to tune the relative cost of misclassifying yes and no
outcomes.
Four models were built with four different relative
misclassification costs.
Each value of this model-building parameter
increases the true positive rate of the accuracy in
predicting yes, at the expense of lower accuracy
in predicting no, that is, an increasing
false-positive rate.
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 8. E VA L U AT I O N
Which model is best based on tuning this parameter?
Risk-reducing intervention – two scenarios
• Cannot be applied to all CHF patients because many of them would not have been
readmitted anyway. This will be cost effective.
• The intervention itself would not be as effective in improving patient care if not enough
high-risk CHF patients targeted.
How do we determine which model was optimal?
• This can be done with the help of an ROC curve (receiver operating characteristic curve).
ROC curve is a graph showing the performance of a classification model at all
classification thresholds.
ROC curve plots two parameters:
• True Positive Rate
• False Positive Rate
I N T R OD U CT ION TO D AT A S C I E N C E
R E C E I V E R O P E R AT O R C H A R A C T E R I S T I C ( R O C ) C U RV E
ROC curves are used to show the connection/trade-off between clinical sensitivity and
specificity for every possible cut-off (threshold) for a test or a combination of tests.
The area under an ROC curve is a measure of the usefulness of a test in general.
• A greater area means a more useful test.
ROC curves are used in clinical biochemistry to choose the most appropriate cut-off for a
test.
The best cut-off has the highest true positive rate together with the lowest false
positive rate.
ROC curves were first employed in the study of discriminator systems for the detection of
radio signals in the presence of noise in the 1940s, following the attack on Pearl Harbor.
The initial research was motivated by the desire to determine how the US RADAR
”receiver operators” had missed the Japanese aircraft.
I N T R OD U CT ION TO D AT A S C I E N C E
R E C E I V E R O P E R AT O R C H A R A C T E R I S T I C ( R O C ) C U RV E
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 8. E VA L U AT I O N
We can see that model 3, with a relative misclassification cost of 4-to-1, is the best of the 4
models
I N T R OD U CT ION T O D AT A S C I E N C E
F R OM D E P L O Y M E N T TO F EEDBAC K
Business Analytic
Understanding Approach
Data
Feedback Requirements
Data
Deployment
Collection
Evaluation Data
Understanding
Data
Data Modeling
Preparation
I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D E P L O Y M E N T TO F EEDBAC K
• What is Deployment.
• The importance of stakeholder input.
• To consider the scale of deployment.
• The importance of incorporating feedback to refine the model.
• This process should be repeated as often as necessary.
I N T R OD U CT ION TO D AT A S C I E N C E
9. D E P L O Y M E N T ( C O N C E P T )
• To make the model relevant and useful to address the initial question, involves
getting the stakeholders familiar with the tool produced.
• Once the model is evaluated/approved by the stakeholders, it is deployed and put
to the ultimate test.
• The model may be rolled out to a limited group of users or in a test environment,
to build up confidence in applying the outcome for use across the board.
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 9. D E P L O Y M E N T
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 9. D E P L O Y M E N T
I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 9. D E P L O Y M E N T
Teams involved: Business Team, Intervention Team / Program Director, Clinical Staff
I N T R OD U CT ION TO D AT A S C I E N C E
9. D E P L O Y M E N T
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 9. D E P L O Y M E N T
Additional Requirements
• Processes for tracking and monitoring patients receiving the intervention would have to
be developed in collaboration with IT developers and database administrators, so that
the results could go through the feedback stage and the model could be refined over
time.
I N T R OD U CT ION TO D AT A S C I E N C E
10. F E E D B A C K ( C O N C E P T )
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K
For ethical reasons, CHF patients would not be split into controlled and treatment
groups.
Instead, readmission rates would be compared before and after the implementation of
the model to measure its impact.
After the deployment and feedback stages, the impact of the intervention program on
re-admission rates would be reviewed after the first year of its implementation.
Then the model would be refined, based on all of the data compiled after model
implementation and the knowledge gained throughout these stages.
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K
I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K
I N T R OD U CT ION T O D AT A S C I E N C E
D AT A S C I E N C E P R O C E S S - S U M M A R Y
Learn the importance of
• Understanding the question
• Picking the most effective analytic approach
Learn to work with data
• determine the data requirements iterative stages
• collect the appropriate data
• understand the data
• prepare the data for modeling
Learn how to
• evaluate and deploy the model
• getting feedback on it
• use the feedback constructively so as to impove the model
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E P R O C E S S - S U M M A R Y
I N T R OD U CT ION TO D AT A S C I E N C E
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 3 : DATA S CIENCE P ROCESS
IDS Course Team
BITS Pilani
TABLE OF C ONTENTS
1. Confusion Matrix
2. ROC
I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
• A Confusion matrix is a table that is often used to evaluate the performance of a
classification model (or “classifier”).
• A Confusion Matrix shows what the machine learning algorithm did right and what
the algorithm did wrong (misclassification).
1
• It works on a set of test data for which the true values are known. There are two
possible predicted classes: “YES” and “NO”.
I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
Actual Values
Y N
There are four quadrants in the confusion matrix, which are symbolized as below.
True Positive (TP) : The number of instances that were positive and correctly classified as positive.
False Positive (FP): The number of instances that were negative and incorrectly classified as positive.
This also known as Type 1 Error.
False Negative (FN): The number of instances that were positive and incorrectly classified as negative.
It is also known as Type 2 Error.
True Negative (TN): The number of instances that were negative and correctly classified as negative.
I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
Actual Values
Y N
I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
Which type of misclassification is more serious?? Type-I Error or Type-II Error?
Case I : Predicting whether a convict should be hanged or not? [Type I Error more Serious]
False Positive – Algorithm predicts that the convict has committed the crime, in reality, he is innocent.
Verdict: He will be hanged.
False Negative – Algorithm predicts that the convict is innocent, in reality, he has done the crime.
Verdict: He is released.
1
Case II : Predicting Smog in a region and alerting the public [Type II Error more Serious]
False Positive – Algorithm predicts smog, in reality, there is NO SMOG.
Verdict: People will take precaution unnecessarily.
False Negative – Algorithm predicts NO SMOG, in reality, there is SMOG.
Verdict: The high Smog may cause health issues in the people, since they have not taken precaution.
I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
Let us consider an example of model predicting a Tumour for a patient.
Actual
Interpretation:
Values
True Positive (TP): Model predicted ‘Tumour’ and the patient has tumour.
Y N
False Positive (FP): Model predicted ‘Tumour’, the patient has ‘No Tumour’.
Predicted Y 10 22
This also known as Type 1 Error.
Values
False Negative (FN): Model predicted ‘No Tumour’ but the patient actually has N 8 60
tumour. It is also known as Type 2 Error.
True Negative (TN): Model predicted ‘No Tumour’ and the patient has no
tumour.
Discuss on the repercussions of Type 1 and Type 2 errors w.r.t the patient and
the hospital.
I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
True Positive Rate (TPR): It is defined as False Negative Rate (FNR): It is defined as
the fraction of positive examples classified as
the fraction of the positive examples a negative class by the classifier.
predicted correctly by the classifier. This
FN
metrics is also known as Recall, Sensitivity FNR =
TP + FN
or Hit rate.
TP
TPR =
TP+FN
False Positive Rate (FPR): It is defined as the fraction of True Negative Rate (TNR): It is defined as the
negative examples classified as positive class by the fraction of negative examples classified
classifier. This metric is also known as False Alarm correctly by the classifier. This metric is also
Rate. known as Specificity.
FP TN
FPR = TNR =
FP + TN TN + FP
I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
Positive Predictive Value (PPV): It is defined Accuracy: How often is the classifier
as the fraction of the positive examples correct.
classified as positive that are really positive.
It is also known as Precision.
TP + TN
TP Accuracy =
PPV = Total
TP + FP
F1 Score (F1): Recall (r) and Precision (p) are two widely used True Miscalculation Rate or Error Rate:
metrics employed in analysis, where detection of one of the
classes is considered more significant than the others.
How often is the classifier wrong.
I N T R OD U CT ION TO D AT A S C I E N C E
All Formulae
TP FP TN
TPR = FPR =
FP + TN
TNR =
TP+FN TN + FP
TP 2𝑇𝑃
Precision = 𝐹1 =
TP + FP 2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
TP + TN FN FP + FN
Accuracy = FNR = Error Rate =
Total TP + FN Total
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – CHF Prediction
Calculate the following metrics for the given
confusion matrix:
Actual Values
1. True Positive Rate (TPR) [Recall / Sensitivity]
Y N 2. False Positive Rate (FPR)
3. False Negative Rate (FNR)
Predicted Y 100 (TP) 10 (FP)
Values 4. True Negative Rate (TNR) [Specificity]
N 5 (FN) 50 (TN) 5. Precision
6. F1 Score
7. Accuracy
8. Error Rate or Miscalculation Rate
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – CHF Prediction
Formulae
Actual Values TP FP
TPR = FPR =
TP+FN FP + TN
Y N
Predicted Y 100 (TP) 10 (FP) FN TN
Values FNR = TNR =
N 5 (FN) 50 (TN) TP + FN TN + FP
Calculate the following metrics for the given confusion matrix TP 2𝑇𝑃
Precision = 𝐹1 =
1. True Positive Rate (TPR) [Recall / Sensitivity] TP + FP 2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
2. False Positive Rate (FPR)
3. False Negative Rate (FNR)
TP + TN FP + FN
4. True Negative Rate (TNR) [Specificity] Accuracy = Error Rate =
Total Total
5. Precision
6. F1 Score
Alternative formula for F1 calculation
7. Accuracy
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
8. Error Rate or Miscalculation Rate 𝐹1 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Case Study – CHF Prediction
I N T R OD U CT ION TO D AT A S C I E N C E
ROC Curve
An ROC curve (receiver operating characteristic curve) is a
graph showing the performance of a classification model
at all classification thresholds.
It shows the trade-off between Sensitivity and Specificity
ROC curve plots two parameters:
https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-steps-795b1399481c
• Altering the threshold to 0, 0.35, 0.5, 0.65 and 1 levels. Notice how the FPR and TPR changes accordingly
• Overall, we can see this is a trade-off. As we increase our threshold, we’ll be better at classifying negatives,
but this is at the expense of misclassifying more positives
Area under
ROC Curve
(AUC)
• For NLP applications (like Chatbots), which use natural language, thresholds are generally set
lower (around 0.4) for healthcare, retail, educational bots.
URL - https://app.engati.com/static/standalone/bot.html?bot_key=889d005935e7437b
https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-
steps-795b1399481c
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 3 : DATA S CIENCE Proposal
IDS Course Team
BITS Pilani
TABLE OF C ONTENTS
1 D ATA S C I E N C E P R O P O S A L
I N T R OD U CT ION TO D AT A S C I E N C E
WHAT IS DATA SCIENCE PROPOSAL
As a Data Scientist, there are occasions when proposals need to be written for data science projects.
At Microsoft:
A. Business-led Proposal Origin of Proposal
• Business teams come with requirements
• Ex: Product Engineering Team on how to prioritize
customer feedback screening
B. Data science-led Innovation
• From Data Science team
• Ex: How to maximize customer satisfaction for Azure
C. Data science-led Systemic Solutions
• What is the impact of ‘x’ on business
• Ex: ‘X’ can be marketing campaign, new service launch
https://medium.com/data-science-at-microsoft/managing-a-data-science-project-87945ff79483
I N T R OD U CT ION TO D AT A S C I E N C E
QUESTIONNAIRE TO P R E PA R E P R O P O S A L
1. What is the business problem we are trying to solve?
2. Write an exact definition. Identify the type of the problem.
3. Are we addressing a specific problem or a problem specific to a team? Is it a
generic problem across all business? (help to create certain frameworks or
accelerators)
4. Who are the targeted audience?
5. How do you evaluate your solution outcome? Are there any evaluation
metrics available?
6. What is the acceptance criteria for the solution? (for e.g. for a classification
task accuracy should be above 65%)
I N T R OD U CT ION TO D AT A S C I E N C E
QUESTIONNAIRE TO P R E PA R E P R O P O S A L
Business Understanding
• What is the business problem we are trying to solve?
• Write an exact definition.
• Is it a prediction problem?
→ e.g. predicting company’s profit in next quarter.
• Are we doing a segmentation?
→ e.g. a customer segmentation for targeted
marketing.
• Are we going to recommend something say a product to
the user?
• Is it anomaly detection or a fraud detection problem?
• Is it an optimization problem?
→ e.g. optimizing revenue of a company.
I N T R OD U CT ION TO D AT A S C I E N C E
1. P R E D I C T I O N
• Classification
• Given a new individual observation, predicts which class it belongs to.
• e.g. whether a credit card customer will default or not given his data like credit card
balance, income etc.
• Covid Discharge Status, viz., (Recovered, Expired)
• Social media sentiment analysis to determine the emotion behind user-generated content
• Regression
• Given a new individual observation, estimates the value of a particular variable specific to
that individual.
• e.g. predicting the revenue for the next quarter
• Predicting the price of a house, given locality details
I N T R OD U CT ION TO D AT A S C I E N C E
1. P R E D I C T I O N ... C O N T D ...
I N T R OD U CT ION TO D AT A S C I E N C E
1. P R E D I C T I O N ... C O N T D ...
I N T R OD U CT ION TO D AT A S C I E N C E
1. P R E D I C T I O N ... C O N T D ... Churn Analysis
Based on Historical data, predicting the Churn Analysis for next quarter
I N T R OD U CT ION TO D AT A S C I E N C E
2. S E G M E N T AT I O N / C L U S T E R I N G
• Customer Profiling is an important aspect of Segmentation that attempts to characterize the
typical behavior of an individual or a group.
https://commence.com/blog/2020/06/16/customer-profiling-methods/
I N T R OD U CT ION TO D AT A S C I E N C E
2. S E G M E N TAT I O N / C L U S T E R I N G
Clustering attempts to group individuals based on similarity.
• e.g. Segment the customers to High spenders and Low spenders based on their buying pattern and
other data.
I N T R OD U CT ION TO D AT A S C I E N C E
3. R E C O M M E N DAT I O N / S I M I L A R I T Y M AT C H I N G
• Similarity matching attempts to find similar individuals based on the data known
about them. This is useful in recommendation problem setting.
• e.g. Finding people similar to you who have purchased or liked similar products,
recommending a movie to a user based on his preferences and similar users’ interests.
• OTT platforms, E-Commerce platforms
I N T R OD U CT ION TO D AT A S C I E N C E
4. Anomaly / F r a u d A n a l y t i c s
https://www.crisil.com/en/home/our-businesses/global-research-and-risk-solutions/our-offerings/non-financial-risk/financial-crime-
management/fraud-management/fraud-detection-and-analytics.html#
I N T R OD U CT ION TO D AT A S C I E N C E
5. C A U S A L M O D E L L I N G / R O O T C A U S E A N A LY S I S
Casual modeling helps to understand the casual relationship between events or what
events/actions influence other. [‘Why’ part of Diagnostic Analytics]
• What are the possible root causes for an anomaly detected?
• Whether the advertisements influenced consumer’s decision to purchase or not?
• What are the reasons for fraud in bank
• Lack of Training
• Competition to achieve incentives
• Overburdened Staff
• Low Compliance Level (not following RBI Guidelines)
I N T R OD U CT ION TO D AT A S C I E N C E
6. M A R K E T B A S K E T A N A LY S I S
Co-occurrence Grouping / Association Rule Discovery / Frequent Item set Mining
• Find the association between the entities based on the purchase transactions involving them.
• e.g. What items are purchased together by consumers at a supermarket.
• May lead to Upsell / Cross-sell items to customers
I N T R OD U CT ION TO D AT A S C I E N C E
7. D AT A R E D U C T I O N
• Replace a large data with a smaller set of data that contain most of the important information in the
large dataset.
• Involves loss of information.
• Which data reduction strategy to follow?
• Aggregation / Sampling / Dimensionality reduction
• Examples
• Aggregation - Massive data sets of insurance / patient data is aggregated into one row per
patient record in hospital readmission case study.
• Sampling - A large time series sensor data at a second interval may be reduced to hourly data or to a
smaller data set with only changed values [Ex: Air Pollutants data – calculate Pollution Index based on
concentration of chemicals]
• Dimensionality Reduction - ISRO weather data augmentation with semantic data project: Followed
dimensionality reduction. [Only retained rainfall rate, humidity, latitude, longitude, wind direction,
atmospheric pressure & precipitation rate variables for analysis out of 17 variables]
I N T R OD U CT ION TO D AT A S C I E N C E
QU ESTIONS T O B E A S K E D BASED ON T ASK
• Prediction
• Do we know what variable (target) to be predicted?
• Is that target variable defined precisely?
• What values or ranges of values that this variable can take? [Ordinal / Categorical]
• Will modelling this target variable address all the problems defined in the scope or only a
sub problem?
• Clustering
• Do we know the end objective? i.e. Is an EDA (Exploratory Data Analysis) path clearly
defined to see where our analysis is going?
I N T R OD U CT ION TO D AT A S C I E N C E
S O L U T I O N A P P R OA C H
• Is the proposed analytical solution formulated appropriately to solve the business problem OR is it
an approximation?
• Will the proposed solution address all the problems defined in the scope or only a sub problem?
Ex: Study to understand employee satisfaction; does it address attrition?
• What will be the benefits of the proposed solution? Benefit vs. Cost tradeoff.
Ex: Heart Disease prediction model; deployed to all hospital centers?
• What will be the specific end objectives to be met by the proposed solution?
• What should be the anticipated outcomes by the proposed solution?
I N T R OD U CT ION TO D AT A S C I E N C E
S O L U T I O N A P P R OA C H
What are the deliverables? Data Science deliverables fall under 3 categories:
1. Analysis – A study using data to describe how a
product or program is working. Ex: Exploratory Data
Analysis, Diagnosis to highlight change in trend
2. Experiment – A scientific study to test a hypothesis.
Ex: Spending more money on digital advertising leads
to increased sales.
Alternate Hypothesis – “Mean sales increased after
spending more time on advertising”
3. Model – Machine learning model trained on data to
predict an outcome. Ex: Churn prediction to alert the
company about at-risk customers.
https://medium.com/data-science-at-microsoft/managing-a-data-science-project-87945ff79483
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A P R E PA R AT I O N
• What are the important variables that you think we should collect?
• Are these variables readily available? Or is there an additional effort needed to
collect these variables?
• What are the types of data?
• e.g. Sensor data, ERP, e-commerce and SAP CRM data are structured (OLTP), Social
networking data is unstructured.
• Where are the locations of data in the system?
• e.g. Product master and sales transaction data in ERP SQL RDBMS database, OLAP
data in SQL server for BI reporting, Text data for customer review and sentiment from
Tweets and FB posts etc. [Internal to Organization, Acquire from 3 rd party sources?]
• Where are the data coming from?
• e.g. data from sensor, sales data from ERP, online store
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A P R E PA R AT I O N ... CONTD ...
• Who are the current consumers of the data?
• e.g. Visualization tools, BI application etc.
• What are the methods to acquire data?
• e.g. Sensor data are ingested to data lake. ERP, e-commerce, and SAP CRM are inside
organization’s data center and proper access control needs to be granted to access the
data. Social networking data are retrieved from streaming API as a nightly job and are
stored in a NoSQL database etc.
• What are the integration points?
• e.g. IT team needs to provide database access and needs to build API services to
access certain data.
• Will it be practical to get all the relevant variables and load it to our workspace?
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A P R E PA R AT I O N ... CONTD ...
• What are the problems in acquiring the data?
• e.g. Sensor data are archived and deleted after 'x' days. Request needs to be raised to
store the data and to archive the data to make enough sample data for analyses and
modelling.
• Social networking data may not be available for a longer term. All relevant data are
captured by existing systems, and request needs to be raised and approved for
accessing data from servers.
• For the prediction problems, is sufficient amount of labelled examples available? Or is there a
cost involved in getting these values?
• e.g. a field survey may be needed to collect the response from a customer to see the
likelihood of joining a new plan.
• Are the training data drawn from a similar population on which the model to be applied? If not,
are the selection biases noted? What are the plans to compensate?
https://machinelearningmastery.com/much-training-data-required-machine-learning/
I N T R OD U CT ION TO D AT A S C I E N C E
MODELLING
• Is the choice of model appropriate for the business problem? Is it in line with our prior
knowledge of the problem?
• Classification, scoring, clustering, etc.
• Does the modelling technique meet all the other requirements (functional and non-
functional) of the problem?
• Should various modelling techniques be tried and compared using appropriate
evaluation metrics?
• Check the amount of data required, generalization performance (i.e. how our model would be
using another sample), learning time
https://machinelearningmastery.com/much-training-data-required-machine-learning/
I N T R OD U CT ION TO D AT A S C I E N C E
E VA L U AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
E VA L U AT I O N
• For a classification problem, is there a threshold defined (for e.g. different thresholds can give
different implications in terms of benefits like reducing the threshold to a 0.70 can reduce the
False Positives)
• For a regression problem, how will we evaluate the quality of prediction in the business
context?
• For a clustering problem, how the clustering is interpreted in the context of the business
problem?
• How will we measure the business impact of the final model? How will we justify the project
expense against the benefits? [ROI]
I N T R OD U CT ION TO D AT A S C I E N C E
E XISTING S YSTEMS / R E qUIREMENTS
• What are the existing/related systems within the capability that capture/use related
information? For e.g. A prediction model is already being used for fraud analysis.
Can we reuse the same transaction dataset for providing recommendations?
• What are the gaps?
• Who are the stakeholders?
• Who will be affected by this implementation?
I N T R OD U CT ION TO D AT A S C I E N C E
A SSUMPTIONS / D EPEND ENCIES / C H A L L E N G E S
• Note down the assumptions; things like availability of necessary data, access to the
infrastructure, licenses etc.
• Any Licenses/Commercials needed in case of proprietary solutions?
• Note down the dependencies: things like dependency on setting up and access to the
infrastructure/tools, on access rights etc.
• Are there any other dependencies?
• Do you see any other problems/challenges?
I N T R OD U CT ION TO D AT A S C I E N C E
I M P L E M E N TAT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
• Bipolar Disorder (BD) is a recurrent chronic disorder characterized by fluctuations in mood state and
energy, which affects over 1% of the World population.
• BD is a primary cause of disability among people, leading to functional and cognitive impairment, with
increased morbidity, especially death by suicide.
• Compared to a normal, mentally-stable individual, an individual suffering from Bipolar Disorder
experiences extreme mood fluctuations, classified into “manic episodes” and “depressive episodes”,
which typically last between days to months.
• While the manic episodes are characterized by racing thoughts, feeling of elation, extreme irritability
etc., the depressive episodes are characterized by feelings of extreme sadness, restlessness, trouble
in concentration, insomnia etc.
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
The standard states of bipolar disorder are as follows:
• i) Bipolar I Disorder, ii) Bipolar II Disorder, iii) Cyclothymia, iv) Unspecified Bipolar.
• From a clinical viewpoint, Bipolar I is defined by the appearance of atleast one manic episode. Patients may
experience hypomanic or major depressive episodes prior to or after the manic episodes.
• Bipolar II, Cyclothymia and Unspecified vary in episodes between hypomania and depression, with each cycle
lasting between weeks to months.
• Hypomania experiences: Reduced need for sleep. Spending recklessly, like buying a car you cannot afford. Taking
chances you normally wouldn't take because you "feel lucky" Talking so fast that it's difficult for others to follow
what's being said.
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
Company X intends to develop a Smart
Healthcare System for monitoring Bipolar
Disorder.
As a Data Scientist working for the company, what
kind of questions would you ask in the Data
Collection / Preparation phases?
• What are the important variables that we need
to collect?
• What are the types of data?
• Locations of data in the system?
• Integration points?
• Problems in acquiring the data?
• Do we have sufficient labelled samples for
prediction?
• Are the training data drawn from a similar
population on which the model to be applied?
If not, are the selection biases noted?
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
Data Scientist will consult the Domain expert (Psychiatrist or Psychologist) to find what variables are
important to collect during the Manic phase / Hypomanic phase / Depressive phase of the patient
• What are the important variables that we need to collect and location of data?
• Physiological Data [From Sensor]: Heart Rate, Electrodermal Activity (EDA), Oxygen
Saturation (SPO2), Blood Pressure etc.
• Behavioral Data [From Mobile App]: Self-assessment questionnaire to capture daily
information regarding sleep quality [hourly scale], physical activity [-3 for inactive to +3 for
active], mood states using GAD [7 point likert], HDRS [7 point likert], YMRS [5 point likert].
In addition, data on alcohol intake, stress levels, motivation levels, concentration levels,
menstrual cycle pattern, irritability levels, insomnia levels. The treating doctors will be
asked to rate the patient progress using scales from much worse (-3) to much better (+3).
The behavioral data will be collected from a Mobile App.
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
• Integration points?
• The sensor data and behavioral data are integrated on a daily basis and presented to the ‘Health
Analytics Engine’ on the Cloud to perform the analytics.
• Problems in acquiring the data?
• Bipolar patients must cooperate and provide the behavioral data truthfully on a periodic basis.
• Do we have sufficient labelled samples for prediction?
• Need to devise a strategy to collect the data samples for period of 6 months, on a daily basis, to build
the labelled data.
• Are the training data drawn from a similar population on which the model to be applied? If not, are the selection
biases noted?
• All the data is captured from patients suffering from Bipolar Disorder, albeit in different cycles [Type I,
Type II, Cyclothymia etc].
I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
• What modelling techniques can be applied to predict the patient states?
• For a similar case in the US - Decision Tree, Random Forest, Support Vector Machine and Logistic
Regression models were applied, and the accuracy of Random Forest was the best.
• Outcome - Predict the patient states. [Multiclass Classification – Bipolar Type I, Bipolar Type II,
Cyclothemia and Unspecified are the states]
I N T R OD U CT ION TO D AT A S C I E N C E
A G U I D E T O D E S I G N I N G A D AT A S C I E N C E P R O J E C T
I N T R OD U CT ION TO D AT A S C I E N C E
Data Science for Business by Tom Fawcett and Foster Provost, O’Reilly
https://www.linkedin.com/pulse/
ask-questions-while-preparing-proposal-data-science-project-menon
http://www.acheronanalytics.com/acheron-blog/
a-guide-to-designing-a-data-science-project
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 5 : DATA AND DATA Q UALITY
IDS Course Team
BITS Pilani
T ABLE OF C ONTENTS
1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY
7 O UTLIERS
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A
• Data is a collection of data objects and their
attributes.
• A collection of attributes describe an object.
• Object is also known as record, observation,
case, sample or instance.
• An attribute is a property or characteristic of an
object.
• Examples: eye color of a person,
temperature
• Attribute is also known as variable, field,
characteristic, or feature.
I N T R OD U CT ION TO D AT A S C I E N C E
QUALITY OF D AT A
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA QUALITY I SSUES
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA QUALITY I SSUES
I N T R OD U CT ION TO D AT A S C I E N C E
P R E P ROCESSING ON D AT A
I N T R OD U CT ION TO D AT A S C I E N C E
A T T R I B U T E / F E AT U R E
I N T R OD U CT ION TO D AT A S C I E N C E
A T T R I B U T E / F E AT U R E
I N T R OD U CT ION TO D AT A S C I E N C E
P R O P E RT I E S OF ATTRIBUTES
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF ATTRIBUTES
Ratio
Numerical
Interval
Data
Ordinal
Categorical
Nominal
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF AT T R I B U T E S
Nominal: Distinctiveness
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF ATTRIBUTES
Ratio Income, Height, Weight,
Annual Sales, Age
Numerical
Data
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF AT T R I B U T E S E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF AT T R I B U T E S E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E
AT T R I B U T E S AND T R A N S FO R M AT I O N S
2. Continuous Attribute
• Measureable data.
• Temperature, height, age, weight
• Continuous attributes are typically represented as floating-point variables.
I N T R OD U CT ION TO D AT A S C I E N C E
A T T R IBUTES BY THE N UMBER OF V ALUES
• Discrete data is countable while
continuous data is measurable.
• Discrete data contains distinct or
separate values.
• On the other hand, continuous data
includes any value within range.
• Discrete data is graphically
represented by bar graph whereas a
histogram is used to represent
continuous data graphically.
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF AT T R I B U T E S E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF AT T R I B U T E S E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A F O R M AT S
Record data
• Transaction or Market Basket data – set of items
• Data Matrix – record data with only numeric attributes.
• Sparse Data Matrix – binary asymmetric data. 0/1 entries.
• Document term matrix
Graph data
• Data with relationships among objects – Web pages
• Data with objects as graphs – LOD cloud
Ordered data
• Sequential data or temporal data – Record data + time.
• Sequence data – genome representation
• Time series data – temporal autocorrelation
• Spatial data – spatial autocorrelation
I N T R OD U CT ION TO D AT A S C I E N C E
R E C O R D D ATA E X A M P L E
Flat file (CSV), Banking, Retail, E- SPSS data matrix Frequency of terms
RDBMS Commerce etc. that appears in
documents, used
in Information
Retrieval
https://towardsdatascience.com/types-of-data-sets-in-data-science-data-mining-machine-learning-eb47c80af7a
I N T R OD U CT ION TO D AT A S C I E N C E
G R A P H D ATA E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E
O R D E R E D D ATA E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY
7 O UTLIERS
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF D AT A - S E T S
1 Structured data
• Data containing a defined data type, format and structure.
• Example: transaction data, online analytical processing , OLAP data cubes, traditional
RDBMS, CSV file and spreadsheets.
2 Semi structured data
• Textual data file with discernible pattern that enables parsing
• Example: XML data file, JSON data file
3 Quasi structured data
• Textual data with erratic data format that can be formatted with effort, tools and time
• Example: Web click-stream data [IP address, Timestamp, GeoCodes etc]
4 Unstructured data
• Data that has no inherent structure.
• Example: PDF, Images, Video, Email
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF D AT A - S E T S
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF D AT A - S E T S
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF D AT A - S E T S
6 Graph-based or network data
• Data can be shown in a graph. [Ex: Linked Open Data Cloud]
• A graph is a mathematical structure to model pair-wise relationships between objects.
• Graph or network data focuses on the relationship or adjacency of objects.
• Graph databases with specialized query languages such as SPARQL.
• Example: DBPedia data in RDF format [RDF Dump or through end point]
[https://dbpedia.org/sparql]
7
Streaming data
• The data flows into the system when an event happens instead of being loaded into a
data store in a batch.
• Example: live sports or music events, stock market.
I N T R OD U CT ION TO D AT A S C I E N C E
C HARACTERISTICS OF D AT A - S E T S
1 Dimensionality
• Number of attributes
• Curse of Dimensionality – the difficulties associated with analyzing high-dimensional data
• Dimensionality reduction techniques [PCA, NMF, LDA etc.]
Sparsity
• For some data sets, such as those with asymmetric features, most attributes of an object have values of
2
0; in many cases, fewer than 1% of the entries are non-zero.
• Advantage because usually only the non-zero values need to be stored and
manipulated.
Resolution
• The patterns in the data depend on the level of resolution.
3
• If the resolution is too fine, a pattern may not be visible or may be buried in noise; if the resolution is
too coarse, the pattern may disappear. [Ex: Air Pollution: Index (Chemical pollutants) measured per
Second / Hour; per second – fine resolution; per hour – coarse resolution]
I N T R OD U CT ION TO D AT A S C I E N C E
Curse of Dimensionality
I N T R OD U CT ION TO D AT A S C I E N C E
C HARACTERISTICS OF D AT A - S E T S
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY
7 O UTLIERS
I N T R OD U CT ION TO D AT A S C I E N C E
R E T R I E V I N G D AT A
I N T R OD U CT ION TO D AT A S C I E N C E
R E T R I E V I N G D AT A
Data Storage
• Database tables
• Text files
• Data marts
• Data warehouses
• Data lakes (raw data)
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY
7 O UTLIERS
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA P R E PA R AT I O N
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A C L E A N S I N G
Focuses on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
Two types of errors
• Interpretation / Representation error
• Age > 130
• Height of a person is greater than 8 feet.
• Price is negative.
• Inconsistencies between data sources or against your company’s standardized values.
• Female and F
• Feet and meter
• Dollars and Pounds
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA C L E A N S I N G
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A C L E A N S I N G
Errors from data entry
• Cause
• Typos
• Errors due to lack of concentration
• Machine or hardware failure
• Detection
• Frequency table [Frequency is the number of times a specific data value occurs in your
dataset.]
• Correction
• Simple assignment statements
• If-then-else rules
White-spaces and typos
• Remove leading and trailing white-spaces.
• Change case of the alphabets from upper to lower. [Ex: SILK Framework – semantic
matching]
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A C L E A N S I N G
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A C L E A N S I N G
I N T R OD U CT ION TO D AT A S C I E N C E
C O M B I N I N G D AT A
I N T R OD U CT ION TO D AT A S C I E N C E
T R A N S FO R M I N G D AT A
Applying mathematical transformation to the input variable.
• For a relationship of the form, y = aebx transforming x to log x makes the relationship
between x and y linear.
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY
7 O UTLIERS
I N T R OD U CT ION TO D AT A S C I E N C E 47 / 69
E X P L O R AT O R Y D AT A A N A L Y S I S ( E D A )
Use graphical techniques to gain an understanding of the data and the interactions
between variables.
Look at what can be learned from the data.
Statistical properties like distribution of data, correlation.
Discover outliers.
I N T R OD U CT ION TO D AT A S C I E N C E
E X P L O R AT O R Y D AT A A N A L Y S I S ( E D A )
• Boxplot – can show the maximum, minimum, median, and other characterizing
measures at the same time.
• Histogram – In a histogram a variable is cut into discrete categories and the number of
occurrences in each category are summed up and shown in the graph.
• Clustering and other modeling techniques can also be a part of exploratory analysis.
Refer - https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Exploratory_data_Analysis.ipynb
I N T R OD U CT ION TO D AT A S C I E N C E
B OX P L O T [ W H I S K E R P L O T ]
• A boxplot incorporates
the five-number summary.
• The ends of the box are at the
quartiles.
• The box length is the interquartile
range.
• The median is marked by a line within
the box.
• The whiskers outside the box extend to
the Minimum and Maximum
observations.
I N T R OD U CT ION TO D AT A S C I E N C E
B OX P L O T
I N T R OD U CT ION TO D AT A S C I E N C E
S C AT T E R P L O T
• Determine if there appears to be a relationship, pattern, or trend between two numeric
attributes.
• Provide a visualization of bi-variate data to see clusters of points and outliers, or
correlation relationships.
I N T R OD U CT ION TO D AT A S C I E N C E
S C AT T E R P L O T
Analysis
• More tips given during the dinner time
compared to the lunch time
• Positive correlation between total bill
amount and tip given, i.e., more the bill
amount, more the tip paid.
I N T R OD U CT ION TO D AT A S C I E N C E
HeatMap
• Using visual cues in a heatmap.
• A heatmap is a way to visualize data in tabular format, where in place of the
numbers, you leverage colored cells that convey the relative magnitude of the
numbers.
• Use color saturation to provide visual cues to quickly target the potential
points of interest.
• Always include a legend as a subtitle on the heatmap with color
corresponding to the conditional formatting color.
I N T R OD U CT ION TO D AT A S C I E N C E
HeatMap
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY
7 O UTLIERS
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA QUALITY INDEX
https://www.deltapartnersgroup.com/
I N T R OD U CT ION TO D AT A S C I E N C E
I MPA C T OF M I SS IN G D ATA I N D ATA S E T
• Loss of data reduces the statistical power, i.e., may introduce selection bias which
may invalidate the study.
• Creates imbalance in the observations and can lead to invalid conclusions.
• Affects the performance of Machine Learning Models.
I N T R OD U CT ION TO D AT A S C I E N C E
M IS S IN G D AT A M E C H A N I S M S
I N T R OD U CT ION TO D AT A S C I E N C E
M IS S IN G C O M P L E T E L Y A T R A N D O M ( M C A R )
Category Product Rating
• The probability of missing is same for all the Accessories Helmets 90%
observations. Accessories Lights 90%
Accessories Locks 90%
• There is no relationship between the missing Accessories Tires and Tubes 90%
values and any other values in the dataset. Accessories Bike Racks NA
Accessories Pumps 95%
• Removing such missing values will not effect the
Clothing Jerseys NA
inferences made. Clothing Caps 15%
Clothing Tights 30%
Clothing Bib-Shorts 36%
Clothing Socks 48%
Components Chains 75%
Components Handlebars 35%
Components Brakes 36%
Components Brakes 38%
Components Bottom Brackets NA
I N T R OD U CT ION TO D AT A S C I E N C E
M IS S IN G A T R A N D O M ( M A R )
• The probability of missing values
depends on available information
• i.e it depends on other variables in the
dataset.
I N T R OD U CT ION TO D AT A S C I E N C E
N O T M IS S IN G A T R A N D O M ( N M A R )
I N T R OD U CT ION TO D AT A S C I E N C E
I M P U T AT I O N T ECHNIqUES
A. Categorical Variables
B. Numerical Variables
I N T R OD U CT ION TO D AT A S C I E N C E
I M P U T AT I O N – C A T E G O R I C A L VARIABLES
I N T R OD U CT ION TO D AT A S C I E N C E
I M P U T AT I O N – C A T E G O R I C A L VARIABLES
I N T R OD U CT ION TO D AT A S C I E N C E
I M P U TAT I O N – N U M E R I C A L VARIABL ES
I N T R OD U CT ION TO D AT A S C I E N C E
M E A N / M E D I A N I M P U T AT I O N
• Used when MCAR / MAR.
• Assumes that the feature follows normal
distribution
Advantages
• Easy to implement
• Faster way of obtaining complete dataset
Disadvantages
• Mean imputation reduces the variance of the imputed
variables.
• Mean imputation does not preserve relationships
between variables such as correlations.
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS
1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY
7 O UTLIERS
I N T R OD U CT ION TO D AT A S C I E N C E
O UTLIERS
• An outlier is a data point that is significantly far away from most other data
points. For example, if everyone in your classroom is of average height with the
exception of two basketball players that are significantly taller than the rest of the
class, these two data points would be considered outliers.
• Data objects with behaviors that are very different from expectation are called
outliers or anomalies.
• Outliers can significantly skew the distribution of your data.
• Outliers can be identified using summary statistics and plots of the data.
• Algorithms like Linear Regression, K-Nearest Neighbor, Adaboost are
sensitive to noise.
I N T R OD U CT ION TO D AT A S C I E N C E
O UTLIERS
I N T R OD U CT ION TO D AT A S C I E N C E
O UTLIER D ETECTION USING N ORMAL D ISTRIBUTION
I N T R OD U CT ION TO D AT A S C I E N C E
Outlier Detection Techniques
Outlier Type:
1. Univariate
• BoxPlot (IQR)
• Density Plot (Standard Deviation)
• Z-Score Method
2. Multivariate
• DBScan (Clustering Algorithm)
• Local Outlier Factor Method (LOF)
https://www.kaggle.com/code/rpsuraj/outlier-detection-techniques-simplified/notebook
I N T R OD U CT ION TO D AT A S C I E N C E
Self Study and Revision
1. Discussion on Previous Year Question Paper
2. Self Study (Revision)*:
• PPTs for Sessions 1-8 (Lecture Notes)
• Case Study – Data Science Proposal Evaluation
• Case Study on Air Pollution
• Some Text Books and Reference Books
I N T R OD U CT ION TO D AT A S C I E N C E
Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T3)
The Art of Data Science by Roger D Peng and Elizabeth Matsui (R1)
Introducing Data Science by Cielen, Meysman and Ali
https://www.deltapartnersgroup.com/
managing-da t a - q u a l i t y - optimize- value - extraction
http://www.dataintegration.ninja/
relationship-between-data-quality-and-master-data-management/
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 4 : DATA S CIENCE T EAMS
IDS Course Team
BITS Pilani
TABLE OF C ONTENTS
1 D ATA S C I E N C E T EAMS
I N T R OD U CT ION TO D AT A S C I E N C E
Roles in a Data Science Project
https://www.youtube.com/watch?v=m5hLUknIi5c
OR
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [1/6]
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [2/6]
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [3/6]
3 Business analyst
• A business analyst basically realizes a CAO’s functions but on the operational level.
• This implies converting business expectations into data analysis.
• If your core data scientist lacks domain expertise, a business analyst bridges this gulf.
• Preferred skills: Data Visualization & Interpretation, Business Intelligence, SQL.
4 Data scientist
• A data scientist is a person who solves business tasks using machine learning and data
mining techniques.
• The role can be narrowed down to data preparation and cleaning with further model
training and evaluation.
• Preferred skills: R, SAS, Python, Matlab, SQL, noSQL, Hive, Pig, Hadoop, Spark
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [4/6]
Job of a data scientist is often divided into two roles
[4A] Machine Learning Engineer
• A machine learning engineer combines software engineering and modeling skills by
determining which model to use and what data should be used for each model.
• Probability and statistics are also their forte.
• Training, monitoring, and maintaining a model.
• Preferred skills: R, Python, Scala, Julia, Pytorch, TensorFlow
[4B] Data Journalist
• Data journalists help make sense of data output by putting it in the right context.
• Articulating business problems and shaping analytics results into compelling stories.
• Present the idea to stakeholders and represent the data team with those unfamiliar with
statistics.
• Preferred skills: SQL, Python, R, Scala, Carto, D3, QGIS, Tableau
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [5/6]
5 Data architect
• Working with Big Data.
• This role is critical to warehouse the data, define database architecture, centralize data,
and ensure integrity across different sources.
• Preferred skills: SQL, noSQL, XML, Hive, Pig, Hadoop, Spark
6 Data engineer
• Data engineers implement, test, and maintain infrastructural components that data
architects design. They build and maintain the data pipeline [Ex: ETL -> Analysis]
• Realistically, the role of an engineer and the role of an architect can be combined in one
person.
• Preferred skills: SQL, noSQL, Hive, Pig, Matlab, SAS, Python, Java, Ruby, C++, Perl
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [6/6]
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N T I S T
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N T I S T R E q U I R E M E N T S - I N D U S T R Y - W I S E
• Business
•
Data analysis of business data can inform decisions around efficiency, inventory,
production errors, customer loyalty and more.
• E-commerce
• Improve customer service, find trends and develop services or products.
• Finance
• Data on accounts, credit and debit transactions and similar financial data, security
and compliance, including fraud detection.
• Government
• Form decisions, support constituents and monitor overall satisfaction, security and
compliance.
I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E T EAM B UILDING
(WORKING WITH OTHER TEAMS)
• Get to know each other for better communication
• Foster team cohesion and teamwork
• Encourage collaboration to boost team productivity and performance
ht t ps ://t ow a rds da t a s cie nce . com /w hy -t e a m -building -is -im port a nt -t o-da t a -s cie nt is t s -a 8fa 74dbc09b
I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM
[1] Decentralized
• Data scientists report into specific business units
(ex: Retail / BB/ Commercial Banking) or functional
units (ex: Marketing, Finance, HR) within a
company.
• Resources allocated only to projects within their
silos with no view of analytics activities or priorities
outside their function or business unit.
• Analytics are scattered across the
organization in different functions and
business units.
• Little to no coordination
• Drawback – lead to isolated teams
I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM
[2] Functional
• Resource allocation driven by a functional
agenda rather than an enterprise agenda.
• Analysts are located in the functions where
the most analytical activity takes place, but
may also provide services to rest of the
corporation. [HR Analytics]
• Little coordination
I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM
[3] Consulting
• Resources allocated based on availability
on a first-come first-served basis without
necessarily aligning to enterprise objectives
• Analysts work together in a central group
but act as internal consultants who charge
“clients” (business units) for their services
• No centralized coordination
Loosely coupled
I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM
[4] Centralized
• Data scientists are members of a core
group, reporting to a head of data science
or analytics.
• Stronger ownership and management of
resource allocation and project prioritization
within a central pool.
• Analysts reside in central group, where they
serve a variety of functions and business
units and work on diverse projects.
• Coordination by central analytic unit
• Challenge – Hard to assess and meet Tightly coupled
demands for incoming data science
projects. (esp in smaller teams)
I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM
I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM
[6] Federated
• Same as “Center of Excellence” model with
need-based operational involvement to
provide SME support.
• A centralized group of advanced analysts is
strategically deployed to enterprise-wide
initiatives.
• Flexible model with right balance of
centralized and distributed coordination.
I N T R OD U CT ION TO D AT A S C I E N C E
Common Difficulties
I N T R OD U CT ION TO D AT A S C I E N C E
Common Difficulties
Challenge #1 – Managing Data Science Application Lifecycle
Tips: Treat the ML model as a cyclical process. Data scientists should continue monitoring the performances
of the live models and to come full circle, back to the observation phase they started at. [Patterns in the data
can change, and without cyclical approach, model that works today, might not work in the future].
Ex: Personalized Product Recommendations in E-Commerce may require inputs from newer sources.
https://ortec.com/en/featured-insights/3-upcoming-challenges-your-data-science-team-will-face
I N T R OD U CT ION TO D AT A S C I E N C E
Building an Analytics-Driven Organization, Accenture
https://www.altexsoft.com/blog/datascience/
how-to-structure-data-science-team-key-models-and-roles/
https://www.cio.com/article/3217026/
what-i s - a - data-s c i e n t i s t - a - key-data- a n a l y t i c s - r o l e - and-a - l u c r a t i v e -
html
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E