0% found this document useful (0 votes)
333 views419 pages

IDS Sec-1 CS1-CS8 Merged Slides

This document provides an introduction to a data science course being taught by Dr. Shreyas Rao. It includes details about the instructor such as his educational and professional background. It also outlines the course structure, modules, evaluation criteria, textbooks and reference books. The document discusses what data science is, why it is an interdisciplinary field involving multiple disciplines like computer science, statistics, mathematics etc. It highlights the need for data science due to the massive amount of digital data being generated and how data science, AI and machine learning are converging areas.

Uploaded by

sourab jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
333 views419 pages

IDS Sec-1 CS1-CS8 Merged Slides

This document provides an introduction to a data science course being taught by Dr. Shreyas Rao. It includes details about the instructor such as his educational and professional background. It also outlines the course structure, modules, evaluation criteria, textbooks and reference books. The document discusses what data science is, why it is an interdisciplinary field involving multiple disciplines like computer science, statistics, mathematics etc. It highlights the need for data science due to the massive amount of digital data being generated and how data science, AI and machine learning are converging areas.

Uploaded by

sourab jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 419

I NTRODUCTION TO DATA S CIENCE

M ODULE # 1 : I NTRODUCTION
Dr. Shreyas Rao
BITS Pilani
Profile of Instructor
Dr. Shreyas Rao
• 18+ Years of Experience in IT, Teaching and Research

• Working as Associate Professor (Off Campus), Dept. of CSIS, BITS-Pilani, WILP

• B.E from VTU, M.S in Software Systems from BITS (WILP) and PhD from MAHE

• Worked as Business Analyst and Team Lead at SLK Software Services for 7 years

• Previously worked in Presidency University and Sahyadri College, Mangaluru as R&D


Head, CSE

• COE member in AI&ML and COE member in Data Science (Govt. Sponsored for 1.2 Cr)

I N T R OD U CT ION TO D AT A S C I E N C E
Profile of Instructor
Dr. Shreyas Rao
Consultant:
• ISRO-SAC (Ahmedabad) funded research project titled “Ontology Enabled Disaster
Management Web Service using Data Integration” as Technical Consultant. Deployed in
ISRO.
• Designed and Developed ‘Dhriti’, a mental health resource Chabot that caters to mental
health needs of people during Covid, from the COE in AI&ML, SCEM. Bot is released in
Dakshina Kannada region of Karnataka which answers user queries in English, Kannada and
Hindi languages. Deployed on the Web and Facebook Messenger channels.

I N T R OD U CT ION TO D AT A S C I E N C E
Profile of Instructor
Dr. Shreyas Rao
Collaboration with Dept. of Health Innovation, Kasturba Hospital, MAHE
• Telemedicine effectiveness during Covid Wave-I at Kasturba Hospital, Manipal (Statistical
Analysis)
• Study on psychological implications of COVID-19 on Nursing professionals (Statistical
Analysis)
• Covid prediction using Patient Discharge Data (Deep Learning)
Dept. of Psychology, Montfort College:
• AI enabled tool for juvenile self-transformation (Mental Health domain, Deep Learning &
NLP)

*Published papers can be viewed at https://scholar.google.com.tw/citations?user=MFNrrlcAAAAJ&hl=en&oi=ao


I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1
1 C OURSE LOGISTICS
2 Fundamentals of Data Science
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S

I N T R OD U CT ION TO D AT A S C I E N C E
C OURSE S T R UCTURE

M1 Introduction to Data Science


M2 Data Analytics
M3 Data Science Process
M4 Data Science Teams
M5 Data and Data Models
M6 Data Wrangling and Feature Engineering
M7 Data Visualization
M8 Storytelling with Data
M9 Ethics for Data Science

I N T R OD U CT ION TO D AT A S C I E N C E
T E X T AND R EFERENCE B OOKS
T EXT B OOKS
T1 Introducing Data Science by Cielen, Meysman and Ali
T2 Storytelling with Data, A data visualization guide for business professionals,
by Cole, Nussbaumer Knaflic; Wiley
T3 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar

R EFERENCE B OOKS
R1 The Art of Data Science by Roger D Peng and Elizabeth Matsui
R2 Ethics and Data Science by DJ Patil, Hilary Mason, Mike Loukides
R3 Python Data Science Handbook: Essential tools for working with data by Jake
VanderPlas
R4 KDD, SEMMA and CRISP-DM: A Parallel Overview , Ana Azevedo and M.F.
Santos, IADS-DM, 2008
I N T R OD U CT ION TO D AT A S C I E N C E
C A N VA S
Most relevant and up to date info on
• Course Handout
• Schedule for Webinar, Quiz, and Assignments [By 19-Nov-22]
• Lecture Slides
• Quiz
• Assignment

The video recording will be available in Microsoft Teams.

I N T R OD U CT ION TO D AT A S C I E N C E
Evaluation

1. EC1- 30 marks
• Three quizzes (5 marks each) -10 marks (best 2 will be considered)
• One assignment - 20 marks
2. EC2 [Mid Term Exam] – 30 marks
3. EC3 [Comprehensive Exam] – 40 marks

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S

I N T R OD U CT ION TO D AT A S C I E N C E
W H A T i s S CIENCE?

 Science is the systematic study of the structure and behavior of world (phenomenon)
through observation, experimentation and measurement.

Observe Experiment Measure

I N T R OD U CT ION TO D AT A S C I E N C E
Prefixes to ‘Science’

Science Observe Experiment Measure


Computer Compute Store Visualize Analyze Automate
Science
Biological Research Solve Develop Synthesize
Science
Data Capture Prepare Process Analyze Visualize
Science Predict Uncover Insights Enable Decision Making

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E

Data Science is the "study of data".


Data Science is an art of uncovering insights and trends that are hiding behind the
data.
Data Science helps to translate data into a story. The story telling helps in uncovering
insights. The insights help in making decision or strategic choices.
Data Science is the process of using data to understand different things.
• Requires a major effort of preparing, cleaning, scrubbing, or standardizing the data.
• Algorithms are then applied to crunch pre-processed data.
• This process is iterative and requires analysts’ awareness of the best practices.
• The most important aspect of data science is interpreting the results of the analysis in
order to make decisions.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E – I n t e r d i s c i p l i n a r y F i e l d

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E – M U L T I P L E D I S C I P L I N E S

I N T R OD U CT ION TO D AT A S C I E N C E
WHY D AT A S C I E N C E ?

• ”Data Science is the sexiest job in the 21st century” – IBM.


• “Data is the New Oil” - Tesco marketing mastermind Clive Humby (2006)
• Data Science is one of the fastest growing fields in the world.
• According to the U.S. Bureau of Labor Statistics, 11.5 million new jobs will be created by
the year 2026.
• Even with COVID-19 situation, and the amount of shortage in talent, there might not be a
dip in data science as a career option.

I N T R OD U CT ION TO D AT A S C I E N C E
WHY D AT A S C I E N C E ?
• In India, the average salary of a data scientist as of January 2022 is Rs.10L/yr.
[Glassdoor, 2022].
• The increase in data science as a career choice in 2022 will also see the rise in its
various job roles.
• Data Engineer
• Data Administrator
• Machine Learning Engineer
• Statistician
• Data and Analytics Manager

I N T R OD U CT ION TO D AT A S C I E N C E
N EED OF D AT A S C I E N C E - D I G I T A L D A T A D E L U G E

https://www.retailtouchpoints.com/resources/digital-data-deluge-becomes-a-tsunami-due-to-covid-19
I N T R OD U CT ION TO D AT A S C I E N C E
N EED OF D AT A S C I E N C E

Data deluge – resulting in tons of data.


Supportive technologies:
• Powerful algorithms to support computation
[Ex: Transformer models like BERT, GPT-3]
• Open source software and tools [Python]
• Computational speed, accuracy and cost [Cloud Computing – Azure, AWS]
• Data storage in terms of capacity and cost.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E , A I AND ML Convergence
Artificial Intelligence
• AI involves making machines capable of mimicking human behavior, particularly
cognitive functions like facial recognition, automated driving, sorting mail based on
postal code.
Machine Learning
• Considered a sub-field of or one of the tools of AI.
• Involves providing machines with the capability of learning from experience.
• Experience for machines comes in the form of data.
Data Science
• Data science is the application of machine learning, artificial intelligence, and other
quantitative fields like statistics, visualization, and mathematics to uncover insights from
data to enable better decision marking.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E , A I AND ML

https://www.sciencedirect.com/topics/physics-and-astronomy/artificial-intelligence
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S

I N T R OD U CT ION TO D AT A S C I E N C E
U SE CASES OF D ATA S C I E N C E

DataFlair
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN F ACEBOOK

Social Analytics
Utilizes quantitative research to gain insights about the social interactions among
people.
Makes use of deep learning, facial recognition, and text analysis.
In facial recognition, it uses powerful neural networks to classify faces in the
photographs.
In text analysis, it uses “DeepText” to understand people’s interest and aligns
photographs with texts.
It uses deep learning for targeted advertising.
Using the insights gained from data, it clusters users based on their preferences and
provides them with the advertisements that appeal to them.
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN A MAZON

Improving E-Commerce Experience


Personalized recommendation
• Predictive analytics (a personalized recommender system) to increase customer
satisfaction.
• Purchase history of customers, other customer suggestions, and user ratings are
analyzed to recommend products. [Product recommendation]
Anticipatory shipping model [Inventory Updation & Management]
• Predict the products that are most likely to be purchased by its users.
• Analyzes pattern of customer purchases and keeps products in the nearest warehouse
which the customers may utilize in the future. [Market Basket Analysis – Data Mining]

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN A MAZON – C O N T D ...

Improving E-Commerce Experience


Price discounts
• Using parameters such as the user activity, order history, prices offered by the
competitors, product availability, etc., Amazon provides discounts on popular items and
earns profits on less popular items.
Fraud Detection
• Detect fraud sellers and fraudulent purchases.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN U BER
Improving Rider Experience
Uber maintains large database of drivers, customers, and several other records.
Makes extensive use of Big Data and crowdsourcing to derive insights and provide
best services to its customers.
Dynamic pricing
• Use of big Data and data science to calculate fares based on specific parameters.
• Uber matches customer profile with the most suitable driver and charges them based on
the time it takes to cover the distance rather than the distance itself.
• The time of travel is calculated using algorithms that make use of data related to traffic
density and weather conditions.
• When the demand is higher (more riders) than supply (less drivers), the price of the ride
goes up. [Rainy Season]

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN B ANK OF A MERICA
Improving Customer Experience
Erica – a virtual financial assistant (BoA)
• Erica serves as a customer advisor to over 45 million users around the world.
• Erica makes use of Speech Recognition to take customer inputs.
Fraud detection
• Uses data science and predictive analytics to detect frauds in payments,
insurance, credit cards, and customer information.
Customer segmentation
• Segment their customers in the high-value and low-value segments.
• Data scientists makes use of clustering, logistic regression, decision trees to
help the banks to understand the Customer Lifetime Value (CLV) and group
them in the appropriate segments.
• Customer segmentation helps in up-selling and cross-selling of products.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN A IRBNB

Improving Customer Experience


Providing better search results
• Uses big data of customer and host information, homestays and lodge records, and
website traffic.
• Uses data science to provide better search results to its customers and find compatible
hosts.
Detecting bounce rates
• Use of demographic analytics to analyze bounce rates from their websites.
Providing ideal lodgings and localities
• Uses knowledge graphs where the user’s preferences are matched with the various
parameters to provide ideal lodgings and localities.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN SPOTIFY

Improving Customer Experience and recommendation


Providing better music streaming experience
• Provide personalized music recommendations.
• Uses over 600 GBs of daily data generated by the users to build its algorithms to boost
user experience.
Improving experience for artists and managers
• Spotify for Artists application allows the artists and managers to analyze their streams,
fan approval and the hits they are generating through Spotify’s playlists.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN S P O T I F Y ... C O N T D ..

Spotify uses data science to gain insights about which universities had the highest
percentage of party playlists and which ones spent the most time on it.
”Spotify Insights” publishes information about the ongoing trends in the music.
Spotify’s Niland, an API based product, uses machine learning to provide better
searches and recommendations to its users.
Spotify analyzes listening habits of its users to predict the Grammy Award Winners.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E IN Healthcare
Covid Patient Discharge Prediction (Dataset: 2 nd Wave April-2021 to June 2021)
Type of Project: Machine Learning
Dataset size: 1233 patients suffering from Covid
Variables:
X: Age, Gender, Co_morbid, Admit Date, Discharge date, days of stay,
covid_severity
Y: Discharge Type (Recovered, Expired)
Exploratory Data Analysis: Univariate, Bivariate, Multivariate
Models applied: Support Vector Machine, Naïve Bayes, Logistic Regression,
Decision Trees, KNN, ANN, Random Forest
Best Accuracy: Random Forest (92%)

I N T R OD U CT ION TO D AT A S C I E N C E
A P P L I C AT I O N S OF D ATA S C I E N C E

DataFlair
I N T R OD U CT ION TO D AT A S C I E N C E
A P P L I C AT I O N S OF D ATA S C I E N C E

edureka.co
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E

Business intelligence comprises the strategies and technologies used by enterprises for the data analysis and
management of business information. One of the key BI components is Data Warehouse.

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N T I S T VS. B I A N A LY S T

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E VS. S TAT I S T I C S
• Statistics is the science of collecting, analyzing, presenting, and interpreting data. Objective
to draw conclusions on the population.
• The science of statistics enables Data Science.
• Data Science expands the application of statistics towards solving Big Data challenges.
• Data Science comprises of 4As (data architecture, data acquisition, data analysis and data
archiving). The two types of statistics namely ‘descriptive’ and ‘inferential’ are applied during
‘Data Analysis’ phase in data science.

*Source - H. Hassani et al., “The science of statistics versus data science: What is the future?”, Technological
Forecasting and Social Change (Elsevier), Volume 173, 2021

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E VS. S TAT I S T I C S
Statistics Data Science
Theoretical Origins Mathematical biology and Statistics and Probability
biometry
Main Focus Theoretical Sophistication Practical Solutions to
real problems
Main Approach Methodology / Model Application of machine
development and confirmation learning and data mining
models
Focus of Model Building Examination of correlations, Hyper parameter
causality between the variables optimization and feature
selection
Interpretability vs High Interpretability, Low High Accuracy, Low
Accuracy Accuracy Interpretability (XAI or
Explainable AI)

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E VS. S TAT I S T I C S
Statistics Data Science
Type of problem Well structured Semi structured or
[Survey – Likert Scale data] unstructured

Size of dataset Small Large


Homogeneous Heterogeneous

End Goal Data Analysis Data Analysis and


Prediction

Source - H. Hassani et al., “The science of statistics versus data science: What is the future?”, Technological
Forecasting and Social Change (Elsevier), Volume 173, 2021

I N T R OD U CT ION TO D AT A S C I E N C E
Data Mining vs Data Science
• Data Mining field started in 1989 as “Algorithms for Pattern Recognition”, later remodeled as a “Step in the KDD process”
• Data Mining is Goal-oriented and Process driven in nature!
• Understand the business goals first, then apply the DM process to arrive at a result!
• Process takes center stage!
• More of ‘mining’ the data to find insights using algorithms!

• Data Science term first coined in 1962, but remodeled in 2007 as “Derive insights from big data for making smarter
decisions”
• Data Science is Data-oriented and Exploratory in nature!
• Data exploration may help define the business goals or insights and arrive at results!
• Data takes the center stage!
• More work in ‘exploring or searching’ data, than actual mining!

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S

I N T R OD U CT ION TO D AT A S C I E N C E
W H O IS A D AT A S C I E N T I S T ?

A data scientist is someone who extracts insights from messy data.


A data scientist is responsible for guiding a data science project from start to finish.
Success in a data science project comes not just from an one tool, but from having
quantifiable goals, good methodology, cross-discipline interactions, and a repeatable
workflow.

I N T R OD U CT ION TO D AT A S C I E N C E
R OLE OF A D AT A S C I E N T I S T

Reframe business challenges as analytics challenges.


This is a skill to diagnose the problem, consider the core of a given problem, and
determine which kinds of candidate analytical method can be applied to solve it.
Design, implement and deploy statistical models and data mining techniques on
data. This activity is mainly the role of data scientist, applying complex or advanced
analytical methods to a variety of business problem using data.
Develop insights that lead to actionable recommendations.

Learn how to draw insights out of data and communicate them effectively.

I N T R OD U CT ION TO D AT A S C I E N C E
Data Science – Hierarchy of Needs

I N T R OD U CT ION TO D AT A S C I E N C E
Differences between roles

Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at
scale. Lay the foundation for Data Analysis. Concerned with security, reliability, fault tolerance, scalability and
efficiency of the data processing systems.
I N T R OD U CT ION TO D AT A S C I E N C E
S KILLS REqUIRED FOR A D AT A S C I E N T I S T

Communicative Qualitative

Data
Curious Technical
Scientist

Creative Skeptical

I N T R OD U CT ION TO D AT A S C I E N C E
T OOLS AVA I L A B L E T O A D ATA S C I E N T I S T

R
SQL
Python

Scala

Tools SAS

Hadoop

Julia
Tableau
Weka

I N T R OD U CT ION TO D AT A S C I E N C E
A LGORITHMS F OR A D AT A S C I E N T I S T

Logistic
Regression
K-means Linear
clustering Regression

PCA Algorithms Apriori

Decision
SVM
Tree
ANN

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S

I N T R OD U CT ION TO D AT A S C I E N C E
S O F T WA R E E N G I N E E R I N G
In general,
Software engineering is an engineering discipline that is concerned with all aspects of
software production.
Software includes computer programs, all associated documentation, and
configuration data that are needed for software to work correctly.
Waterfall model, Iterative models, Agile models

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E P R O C E S S

I N T R OD U CT ION TO D AT A S C I E N C E
S O F T WA R E E N G I N E E R I N G VS. D ATA S C I E N C E

Software Engineering Data Science

Concerned with creating useful appli- Involves collecting, analyzing and visualizing
cations data
Software engineers use the SDLC pro- Data scientists utilize the ETL (Ex-
cess tract, Tranform, Load) process
Uses frameworks like Waterfall, Agile (Scrum, Methodologies like CRISP-DM, SMAM, SEMMA,
XP) Big Data Lifecycle etc.
Software engineers use programming languages like Data scientists use tools like Ama-
C#, Java and web frameworks like Django, Flask zon S3, MongoDB, Hadoop, and MySQL

Skills are focused on coding languages Skills include machine learning,


statistics, and data visualization

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 C OURSE L OGISTICS
2 F U N DA M E N TA L S OF D ATA S C I E N C E
3 D ATA S C I E N C E R E A L W ORLD A P P L I C AT I O N S
4 D ATA S C I E N C E VS. B USINESS I N T E L L I G E N C E
5 D ATA S C I E N T I S T
6 S O F T WA R E E N G I N E E R I N G FOR D ATA S C I E N C E
7 D ATA S C I E N C E C H A L L E N G E S

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E C H A L L E N G E S

Data science challenges can be categorized as:


Data related
Organization related
Technology related
People related
Skill related

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E C H A L L E N G E S

Source – Business Broadway Survey 2018

I N T R OD U CT ION TO D AT A S C I E N C E
C OGNITIVE B IAS

Cognitive Biases are the distortions of reality because of the lens through which we
view the world. [Subjective vs Objective view of reality]
Each of us sees things differently based on our preconceptions, past experiences,
cultural, environmental, and social factors. This doesn’t necessarily mean that the
way we think or feel about something is truly representative of reality.

I N T R OD U CT ION TO D AT A S C I E N C E
References:

• Introducing Data Science by Cielen, Meysman and Ali


• The Art of Data Science by Roger D Peng and Elizabeth Matsui

https://data-flair.training/blogs/data-science-use-
cases/ https:
• //www.northeastern.edu/graduate/blog/what-does-a-data-scientist-
do/
• https://www.visual-paradigm.com/guide/software-development-
process/ what-is-a-software-process-model/
• https://www.sciencedirect.com/science/article/abs/pii/S004016252100
5448
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 2 : DATA A NALYTICS
IDS Course Team
BITS Pilani
TABLE OF C ONTENTS

1 A N A LY T I C S

2 B I G D ATA

3 D ATA A N A LY T I C S

4 C ASE S TUDIES ON D ATA A N A LY T I C S

I N T R OD U CT ION TO D AT A S C I E N C E
D EFINITION OF A NALY T I C S – D I C T I O N A RY

O X F O R D Analytics is the systematic computational analysis of data or statistics.


C A M B R I D G E Analytics is a process in which a computer examines information using
mathematical methods in order to find useful patterns.
D I C T I O N A RY . C O M Analytics is the analysis of data, typically large sets of business data,
by the use of mathematics, statistics, and computer software.

Analytics is treated as both a noun and a verb.

Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E
D EFINITION OF A NALY T I C S – WEBSITES

O R A C L E Analytics is the process of discovering, interpreting, and communicating


significant patterns in data and using tools to empower your entire
organization to ask any question of any data in any environment on any
device.
E D U R E K A Data Analytics refers to the techniques used to analyze data to enhance
productivity and business gain.
I N FO R M AT I C A Data analytics is the pursuit of extracting meaning from raw data using
specialized computer systems.

Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E
D EFINING A NALY T I C S

Analytics is the process of extracting and creating information from raw data by using
techniques such as:
• filtering, processing, categorizing, condensing and contextualizing the data.
Analytics is a broad term that encompasses the processes, technologies, frameworks
and algorithms to extract meaningful insights from data.
This information thus obtained is then used to infer knowledge about the system
and/or its users, and its operations to make the systems smarter and more efficient.

Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E
G OA L S OF D AT A A N A L Y T I C S
To predict something
• whether a transaction is a fraud or not [Banking]
• whether it will rain on a particular day [Weather Forecast]
• whether a tumor is benign or malignant [Cancer Prediction, Healthcare]
To find patterns in the data
• finding the top 10 coldest days in the year [Weather Forecast]
• which pages are visited the most on a particular website [Web Traffic Rank]
• finding the most searched celebrity in a particular year [Awards]
To find relationships in the data
• finding similar news articles [Bing, Google]
• finding similar patients in an electronic health record system [Healthcare]
• finding related products on an e-commerce website [Recommendation]
• finding correlation between news items and stock prices
* https://www.cnbc.com/2022/04/04/twitter-shares-soar-more-than-25percent-after-elon-musk-takes-9percent-stake-in-social-media-company.html
I N T R OD U CT ION TO D AT A S C I E N C E
TABLE OF C ONTENTS

1 A N A LY T I C S

2 B I G D ATA

3 D ATA A N A LY T I C S

4 C ASE S TUDIES ON D ATA A N A LY T I C S

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D AT A

Big data is defined as collections of datasets whose volume, velocity or variety is so


large that it is difficult to store, manage, process and analyze the data using
traditional databases and data processing tools.
Big Data analytics deals with collection, storage, processing and analysis of this
massive scale data.
Specialized tools and frameworks are required for big data analysis when:
1 the volume of data involved is so large that it is difficult to store, process and analyze
data on a single machine
2 the velocity of data is very high and the data needs to be analyzed in real-time
3 there is variety of data involved, which can be structured, unstructured or
semi-structured, and is collected from multiple data sources
4 various types of analytics need to be performed to extract value from the data

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D AT A - E X A M P L E

I N T R OD U CT ION TO D AT A S C I E N C E
C HAR ACTERISTICS OF B I G D ATA

Volume

Velocity

Big Data
Value
5 V’s

Variety

Veracity

I N T R OD U CT ION TO D AT A S C I E N C E
C HARACTERISTICS OF B I G D AT A

1 Volume
• Volume of data involved is so large that it is difficult to store, process and analyze data
on a single machine.
• Volumes of data generated by IT / IoT systems is growing exponentially.
• lowering costs of data storage and processing architectures [possible due to Cloud]
• need to extract valuable insights from the data to improve business processes, efficiency
and service to consumers.
2 Velocity
• Velocity of data refers to how fast the data is generated.
• High velocity of data results in the volume of data accumulated to become very large, in
short span of time.
• Need to consider parameters such as data provenance and accuracy

I N T R OD U CT ION TO D AT A S C I E N C E
C HARACTERISTICS OF B I G D AT A

3 Variety
• Variety refers to the forms / types of the data.
• Big data comes in different forms such as structured, unstructured or semi-structured,
including text data, image, audio, video and sensor data.
4 Veracity
• Veracity refers to how accurate is the data.
• To extract value from the data, the data needs to be cleaned to remove noise.
5 Value
• Value of data refers to the usefulness of data for the intended purpose.
• The value of the data is also related to the veracity or accuracy of the data.
• For some applications value also depends on how fast we are able to process the data.
[Static (Warehouse) vs Real Time (lecture)]
I N T R OD U CT ION TO D AT A S C I E N C E
TABLE OF C ONTENTS

1 A N A LY T I C S

2 B I G D ATA

3 D ATA A N A LY T I C S

4 C ASE S TUDIES ON D ATA A N A LY T I C S

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A N A L Y T I C S

Data analytics is defined as a process of cleaning, transforming, and modeling data


to discover useful information for business decision-making.
4 different types of analytics
1 Descriptive Analytics
2 Diagnostic Analytics
3 Predictive Analytics
4 Prescriptive Analytics

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA A N A LY T I C S

I N T R OD U CT ION TO D AT A S C I E N C E
D ESCRIPTIVE A NALY T I C S

Answers the question of what happened.


Summarize past data usually in the form of dashboards.
Insights into the past.
Also known as statistical analysis.
Raw data from multiple data sources.

I N T R OD U CT ION TO D AT A S C I E N C E
D E S C R I P T I V E A N A LY T I C S E X A M P L E - I

I N T R OD U CT ION TO D AT A S C I E N C E
D E S C R I P T I V E A N A LY T I C S E X A M P L E - I I

Paper - Healthcare Delivery through Telemedicine during the COVID-19 Pandemic: Case Study from a Tertiary Care Center in South India
https://pubmed.ncbi.nlm.nih.gov/33528313/

I N T R OD U CT ION TO D AT A S C I E N C E
D ESCRIPTIVE A NALY T I C S

Techniques:
• Descriptive Statistics - histogram, correlation
• Data Visualization
• Exploratory Analysis [Seaborn Library in Python]

I N T R OD U CT ION TO D AT A S C I E N C E
D IA G N O S T I C A N A LY T I C S

• Answers the question of why something happened.


• Gives in-depth insights into data.
• Identify relationship between data and identify patterns of behavior.

• Diagnostic analytics is a form of data analytics that builds on descriptive analytics to


help you understand why something happened in the past.

• Often, diagnostic analysis is referred to as root cause analysis. It involves processes


such as data discovery, data mining, and drill down and drill through.

I N T R OD U CT ION TO D AT A S C I E N C E
D IA G N O S T I C A N A LY T I C S E X A M P L E
What is the effect of global warming in the Southwest monsoon?

I N T R OD U CT ION TO D AT A S C I E N C E
D IA G N O S T I C A N A L Y T I C S

• Pattern recognition to identify patterns.


• Linear / Logistic regression to identify relationship.
• Neural Network
• Deep Learning techniques

I N T R OD U CT ION TO D AT A S C I E N C E
P RE D IC T IV E A NALY T I C S

• Answers the question of what is likely to happen.


Predict future trends.
• Being able to predict allows one to make better decisions.
Analysis based on machine or deep learning.
• Accuracy of the forecasting or prediction highly depends on data quality and stability
of the situation.

I N T R OD U CT ION TO D AT A S C I E N C E
P R E D I C T I V E A N A LY T I C S E X A M P L E - I

I N T R OD U CT ION TO D AT A S C I E N C E
P R E D I C T I V E A N A LY T I C S E X A M P L E - I I

Covid Patient Discharge Prediction (Dataset: 2 nd Wave April-2021 to June 2021)


Type of Project: Machine Learning
Dataset size: 1233 patients suffering from Covid
Variables:
X: Age, Gender, Co_morbid, Admit Date, Discharge date, days of stay,
covid_severity
Y: Discharge Type (Recovered, Expired)
Exploratory Data Analysis: Univariate, Bivariate, Multivariate
Models applied: Support Vector Machine, Naïve Bayes, Logistic Regression,
Decision Trees, KNN, ANN, Random Forest
Best Accuracy: Random Forest (92%)

I N T R OD U CT ION TO D AT A S C I E N C E
P RE D IC T IV E A NALY T I C S

Techniques / Algorithms:
• Regression
• Classification
• ML algorithms like Linear regression, Logistic regression, SVM
• Deep Learning techniques

I N T R OD U CT ION TO D AT A S C I E N C E
P RE S C RIPT IV E A NALY T I C S

Answers the question of what might happen.


Data-driven decision making and corrective actions
Prescribe what action to take to eliminate a future problem or take full advantage of a
promising trend.
Need historical internal data and external information like trends.
Analysis based on machine or deep learning, business rules.
Use of AI to improve decision making.

I N T R OD U CT ION TO D AT A S C I E N C E
P R E S C R I P T I V E A N A LY T I C S E X A M P L E - I
• Apollo Hospitals uses an AI tool to predict the risk of cardiovascular disease.
• The Apollo AI-powered “Cardiovascular Disease Risk” tool will help healthcare providers to predict
the risk of cardiac disease in their patients [Predictive Analytics]
• The prediction initiates intervention early enough to make a real difference. [Prescriptive]
• The cardiac risk scoring tool is remarkable for the speed in processing data and its accuracy at
predicting the probability of a patient developing coronary disease.
• Using the tool, physicians will be enabled to deliver proactive, pre-emptive and preventive care for at-
risk individuals, improving lives, while mitigating future pressure on healthcare systems.

https://www.apollohospitals.com/apollo-in-the-news/apollo-hospitals-has-launched-an-artificial-intelligence-tool-to-predict-the-risk-of-cardiovascular-disease/#:~:text=On%20COVID%2D19-
,Apollo%20Hospitals%20has%20launched%20an%20Artificial%20Intelligence%20tool,the%20risk%20of%20cardiovascular%20disease.&text=Apollo%20Hospitals%20announced%20the%20national,the%20risk%20of
%20cardiovascular%20disease.

I N T R OD U CT ION TO D AT A S C I E N C E
P R E S C R I P T I V E A N A LY T I C S E X A M P L E - I I
How can we improve the crop production?

I N T R OD U CT ION TO D AT A S C I E N C E
Types of Data Analytics

I N T R OD U CT ION TO D AT A S C I E N C E
Types of Data Analytics

Exercise

Instagram Reels allows users to create fun videos and share with their contacts. Users can
record 15 second multi-clip videos with audio and effects. Some features include: exploring reels
based on subject; following, commenting and liking a reel; identifying trends to create new reels.
The reels are released in two versions – public (free for all), and premium (subscription basis).

 Discuss the four analytical tasks that can be performed with respect to the Instagram Reels?
[Descriptive, Diagnostic, Predictive and Prescriptive]

I N T R OD U CT ION TO D AT A S C I E N C E
Types of Data Analytics

Instagram Reels

Descriptive - How many followers do you have, how many views, comments, likes for your
video [free], audience breakdown by country, follower activity per hour [premium]
Diagnostic - Why your video’s engagement rate is less. [premium users]
Predictive - Trending topics for you to make video on - their approximate engagement rates
[premium]
Prescriptive - Tips to increase average watch time of your videos [premium]

I N T R OD U CT ION TO D AT A S C I E N C E
C O G N I T I V E A N A LY T I C S
Cognitive Analytics – What I Don’t Know?

h ttp s : / / w w w. 10x d s. com/ b log /cogn itive - a n a lytics - to- rein ven t - b u s in es s/
I N T R OD U CT ION TO D AT A S C I E N C E
C OGNITIVE A NALY T I C S
• Next level of Analytics
• Human cognition is based on the context and reasoning.
• Cognitive systems mimic how humans reason and process.
• Cognitive systems analyze information and draw inferences using probability.
• They continuously learn from data and reprogram themselves.
• According to one source:
• ”The essential distinction between cognitive platforms and artificial
intelligence systems is that you want an AI to do something for you. A
cognitive platform is something you turn to for collaboration or for advice.”

h ttp s : / / in teres tin g en g in eer in g . com / cog n iti ve - c om p u ti n g - m ore - h u m a n - t h a n - a rt if ic ia l - in t el l ig en c e


I N T R OD U CT ION TO D AT A S C I E N C E
C OGNITIVE A NALY T I C S
• Involves Semantics, AI, Machine learning, Deep
Learning, Natural Language Processing, and Neural
Networks.
• Simulates human thought process to learn from the data
and extract the hidden patterns from data.
• Uses all types of data: audio, video, text, images in the
analytics process.
• Although this is the top tier of analytics maturity, Cognitive
Analytics can be used in the prior levels.
• According to Jean Francois Puget:
• ”It extends the analytics journey to areas that were
unreachable with more classical analytics techniques like
business intelligence, statistics, and operations research.”

I N T R OD U CT ION TO D AT A S C I E N C E
C OGNITIVE A NALY T I C S

Example of Cognitive Analytics : Woebot Mental Health App

• Provides mental health support, using Cognitive Behavioral Therapy (CBT)


• NLP based self-learning App that advises / chats with users on mental health, developed by Stanford
University

Benefits:
• Using Woebot led to significant reductions in anxiety and depression among people aged 18-28 years
old, compared to an information-only control group.
• 85% of participants used Woebot on a daily or almost daily basis.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A N A L Y T I C S - B A S E D ON D OMAIN
Types of analytics according to the domain
1 Marketing Analytics
2 Financial Analytics
3 Healthcare Analytics
4 Sports Analytics
5 HR Analytics
6 Customer Analytics
7 Web Analytics
8 Social Analytics
9 Political Analytics

I N T R OD U CT ION TO D AT A S C I E N C E
Sports Analytics - Powerbat

I N T R OD U CT ION TO D AT A S C I E N C E
Web Analytics – Google Analytics

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A N A L Y T I C S - T YPE OF D AT A

Types of analytics according to the type of data


1 Text analytics
2 Real-time data analytics
3 Multimedia analytics
4 Geo analytics
5 Mobile analytics

I N T R OD U CT ION TO D AT A S C I E N C E
Geo Analytics – Location Intelligence

https://medium.com/loctruth/unlock-the-power-of-location-intelligence-c0cea20d5a06
I N T R OD U CT ION TO D AT A S C I E N C E
TABLE OF C ONTENTS

1 A N A LY T I C S

2 B I G D ATA

3 D ATA A N A LY T I C S

4 C ASE S TUDIES ON D ATA A N A LY T I C S

I N T R OD U CT ION TO D AT A S C I E N C E
D ESCRIPTIVE A NALY T I C S – E X AM PL E # 1

Data captured
Problem Statement : Gender
“Market research team at Aqua Analytics Age (In years)
Pvt. Ltd is assigned a task to identify pro- Education (In years)

file of a typical customer for a Digital fit- Relationship Status (Single or Partnered)
Annual Household income
ness band that is offered by Titanic Corp.
Average number of times customer tracks activity each
The market research team decides to inves- week
tigate whether there are differences across Number of miles customer expect to walk each week
the usage patterns and product lines with Self-rated fitness on a scale 1– 5 where 1 is poor shape

respect to customer characteristics” and 5 is excellent.


Models of the product purchased - IQ75, MZ65, DX87

https://medium.com/@as h is hp ah wa7/ firs t -case -stud y- in- d escriptive- an a lytics- a744140c39a4


I N T R OD U CT ION TO D AT A S C I E N C E
D E S C R I P T I V E A N A LY T I C S – E X A M P L E # 1

I N T R OD U CT ION TO D AT A S C I E N C E
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1

Problem Statement :
“During the 1980s General Electric was selling different products to its customers such as
light bulbs, jet engines, windmills, and other related products. Also, they separately sell
parts and services this means they would sell you a certain product you would use it until it
needs repair either because of normal wear and tear or because it’s broken. And you would
come back to GE and then GE would sell you parts and services to fix it. Model for GE was
focusing on how much GE was selling, in sales of operational equipment, and in sales of
parts and services. And what does GE need to do to drive up those sales?”

https://medium.com/parrotai/
u n d ers ta n d - d a ta - a n a lytics - fra mework -w ith -a - ca s e - stu d y -in - th e -b u s in es s - world - 15b fb 421028d
I N T R OD U CT ION TO D AT A S C I E N C E
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1

https://www.sganalytics.com/blog/change -management-analytics-adoption/
I N T R OD U CT ION TO D AT A S C I E N C E
P RE D IC T IV E A NALY T I C S – E X AM PL E

• Google launched Google Flu Trends (GFT), to collect predictive analytics regarding the
outbreaks of flu. It’s a great example of seeing big data analytics in action.
• So, did Google manage to predict influenza activity in real-time by aggregating search engine
queries with this big data and adopting predictive analytics?
• Even with a wealth of big data analytics on search queries, GFT overestimated the prevalence
of flu by over 50% in 2012-2013 and 2011-2012.
• They matched the search engine terms conducted by people in different regions of the world.
• And, when these queries were compared with traditional flu surveillance systems, Google found
that the predictive analytics of the flu season pointed towards a correlation with higher search
engine traffic for certain phrases.

I N T R OD U CT ION TO D AT A S C I E N C E
P RE D IC T IV E A NALY T I C S – E X AM PL E

https://www.slideshare.net/VasileiosLampos/
u s e r g e n e ra t ed - co n t en t - c o l l e ct i ve - a n d - p er s o n a l i s e d - i n f e re n c e - ta s ks
I N T R OD U CT ION TO D AT A S C I E N C E
P RE S C RIPT IV E A NALY T I C S

Whenever you go to Amazon, the site recommends dozens and dozens of products to
you. These are based not only on your previous shopping history (reactive), but also
based on what you’ve searched for online, what other people who’ve shopped for the
same things have purchased, and about a million other factors (proactive).
Amazon and other large retailers are taking deductive, diagnostic, and predictive data
and then running it through a prescriptive analytics system to find products that you
have a higher chance of buying.
Every bit of data is broken down and examined with the end goal of helping the
company suggest products you may not have even known you wanted.

h ttp s : / / a ccen t - tech n ologies . com/ 2020/ 06/18/ ex a mp les -of- p res crip tive - a n a lytics /
I N T R OD U CT ION TO D AT A S C I E N C E
H E A LT H C A R E A NALY T I C S – C ASE S TUDY

Self study
https://integratedmp.com/
4 - key- h e alt h care - analyt ics - so u rce s - i s- yo ur -pract ice - usin g -the m/
https://www.youtube.com/watch?v=olpuyn6kemg

I N T R OD U CT ION TO D AT A S C I E N C E
References:

Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti

https://blog.hootsuite.com/tiktok-analytics/
T HANK YOU

I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 3 : DATA A NALYTICS - METHODOLOGIES
IDS Course Team
BITS Pilani
T ABLE OF C ONTENTS

1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A N A L Y T I C S

Data Analytics is defined as a process of


cleaning, transforming, and modeling data to
discover useful information for business
decision-making.

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A N A L Y T I C S M E T H O D O L O G I E S

• Methodology is a set of guiding principles and processes used to plan,


manage, and execute projects.
• It helps data analysts to reduce risks, avoid duplication of efforts and to
ultimately increase the impact of the project.
Use standard methodology to ensure a good outcome.
1 CRISP-DM
2 SEMMA
3 SMAM
4 Big Data Life-cycle

I N T R OD U CT ION TO D AT A S C I E N C E
N EED FOR A Methodology

• Framework for recording experience.


• Allows projects to be replicated
• Aid to project planning and management.
• “Comfort factor” for new adopters
• Demonstrates maturity of Data Mining
• Encourage best practices and help to obtain better results.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A A n a l y t i c s M E T H O D O L O G Y
10 Questions the process aims to answer
Problem to Approach
1 What is the problem that you are trying to solve?
2 Are there available solutions to similar problems?
Working with Data
3 What data do you need to answer the question?
4 Where is the data coming from? Identify all Sources. How will you acquire it?
5 Is the data that you collected representative of the problem to be solved?
6 What additional work is required to manipulate and work with the data?
Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?
I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G

I N T R OD U CT ION TO D AT A S C I E N C E
CRISP-DM
CRISP-DM Phases
 Cross Industry Standard Process for Data
Mining
 People realized they needed a process to define
data mining steps applicable across any Industry
such as Retail, E-Commerce, Healthcare etc.
 Conceived by Daimler-Benz and Integral
Solutions Ltd in the year 1996

 6 high-level phases

 Iterative approach to the development of


analytical models.

I N T R OD U CT ION TO D AT A S C I E N C E
C R I S P - D M P HASES

1. Business understanding – What does the business need?


• Understand project objectives and requirements.
• Based on domain knowledge and business strategies.
2. Data understanding – What data do we have / need? Is it clean?
• Initial data collection and familiarization.
• Identify data quality issues.
• Identify initial obvious results.
3. Data preparation – How do we organize the data for modeling?
• Record and attribute selection.
• Data cleansing.

I N T R OD U CT ION TO D AT A S C I E N C E
C R I S P - D M P HASES

4. Modeling – What modeling techniques should we apply?


• Run the data mining tools.
5. Evaluation – Which model best meets the business objectives?
• Determine if results meet business objectives.
• Identify business issues that should have been addressed earlier.
6. Deployment – How do stakeholders access the results?
• Put the resulting models into practice.
• Set up for continuous mining of the data.

I N T R OD U CT ION TO D AT A S C I E N C E
C R I S P - D M P HASES AND T ASKS

I N T R OD U CT ION TO D AT A S C I E N C E
W HY CRISP -DM?

1. Reliable and Repeatable by people with little data mining skills.


2. Evergreen [Applicable for Data Mining, Data Scientists, Data Analyst titles]
3. Most other methodologies have evolved from CRISP-DM over time, so
understanding this is essential
4. Thorough [Interdisciplinary – Work with Managers, SMEs, Other teams]
5. Practical
• Concept easy to understand
• Always ties investigation with application [Always tied to Business Problem]
• Flexible
• Free

I N T R OD U CT ION TO D AT A S C I E N C E
Advantages and Disadvantages
Advantages:
• Clearly defined process (phases and tasks).
• Supports various data mining techniques
• Has documentation of several successful case studies following the approach

Disadvantages:
• Long and Complicated process
• Blind hand-off to IT from Data Science team without picturizing the operationalization
• No real measure of ROI, once all phases are completed

https://www.diva-portal.org/smash/get/diva2:1250897/FULLTEXT01.pdf

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G

I N T R OD U CT ION TO D AT A S C I E N C E
SEMMA

• SAS Institute developed SEMMA as the process


for data mining.
• 5 stages - Sample, Explore, Modify, Model,
Assess
• Used to solve a wide range of business
problems, including fraud identification,
customer retention and turnover, database
marketing, customer loyalty, bankruptcy
forecasting, market segmentation, as well as
risk, affinity, and portfolio analysis.

I N T R OD U CT ION TO D AT A S C I E N C E
SEMMA

• SEMMA is a logical organization of the functional tool set of SAS Enterprise Miner
for carrying out the core tasks of data mining.
• Enterprise Miner is a Data Mining Software to create predictive and descriptive
models for large volumes of data.
• Enterprise Miner can be used as part of any iterative data mining methodology
adopted by the client. Naturally steps such as formulating a well defined business or
research problem and assembling quality representative data sources are critical to
the overall success of any data mining project.
• SEMMA is focused on the model development aspects of data mining.
• SEMMA overlaps with Data Preparation, Modelling and Evaluation phases of CRISP-DM

I N T R OD U CT ION TO D AT A S C I E N C E
S E M M A S TAGES
1. Sample
•1 Sampling the data by extracting a portion of a large data set big enough to contain the
significant information, yet small enough to manipulate quickly.
• Partitioning the data to create training and test samples.
• Identifying dependent and independent variables influencing the process.
2. Explore
• Exploration of the data by searching for unanticipated trends and anomalies in order to
gain understanding and ideas.
• Perform Univariate analysis (single variable) and multivariate analysis (relationships)
3. Modify
• Modification of the data by creating, selecting, and transforming the variables to focus
the model selection process.

I N T R OD U CT ION TO D AT A S C I E N C E
S E M M A S TAGES

4. Model
• Apply variety of data mining techniques to produce a projected model [ML, Deep Learning,
Transfer Learning]
5. Assess
• Assessing the data by evaluating the usefulness and reliability of the findings from the
data mining process and estimate how well it performs.

I N T R OD U CT ION TO D AT A S C I E N C E
Advantages and Disadvantages
Advantages:
• Focus on only “Model aspects of Data Mining”
• Useful in most Machine Learning Projects where data comes from single datasource
Ex: Prima Indian Diabetes Dataset [Predict Diabetes], Titanic Dataset [Predict
Passenger Survival] from Kaggle

Disadvantages:
• Does not take into account the business understanding of a problem
• Disregards Data Collection and Processing from different data sources

https://www.diva-portal.org/smash/get/diva2:1250897/FULLTEXT01.pdf

I N T R OD U CT ION TO D AT A S C I E N C E
SEMMA – Case Study
Covid Patient Discharge Prediction (Dataset: 2 nd Wave April-2021 to June 2021)
Type of Project: Machine Learning
1. Sample : Dataset size: 1233 patients suffering from Covid
2. Explore: Univariate (Null values, Mean, basic statistics), Bivariate (correlation – pearson, chi square)
3. Modify : PCA (Principal Component Analysis)
4. Model : Feature Engineering, Subset selection
Final Variables:
X: Age, Gender, Co_morbid, Admit Date, Discharge date, days of stay, covid_severity
Y: Discharge Type (Recovered, Expired)
Models applied: Support Vector Machine, Naïve Bayes, Logistic Regression, Decision Trees, KNN, ANN,
Random Forest
5. Assess : Best Accuracy: Random Forest (92%)

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G

I N T R OD U CT ION TO D AT A S C I E N C E
SMAM
SMAM
(Standard Methodology for
Analytics Models)

http://www.datascienceassn.org/content/standard-methodology-analytical-models

I N T R OD U CT ION TO D AT A S C I E N C E
S M A M P HASES
Phase Description
Use-case identification Selection of the ideal approach from a list of candidates
Model requirements Understanding the conditions required for the model to func-
gathering tion
Data preparation Getting the data ready for the modeling
Modeling experiments Scientific experimentation to solve the business question
Insight creation Visualization and dash-boarding to provide insight
Proof of Value: ROI Running the model in a small scale setting to prove the value
Operationalization Embedding the analytical model in operational systems
Model life-cycle Governance around model lifetime and refresh

I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase I - Use Case Identification
• Brainstorming of Business / Management / SMEs (Domain) / IT (Data Scientist)
teams
• Discussion revolves around:
• Business Needs
• Expert inputs on the domain
• Data Availability
• Analytical Model Complexity – time and effort
• Outcome: Selected Use Case and roadmap for next phases

I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase II – Model Requirements Gathering
• Involved parties include Business / End-users / Data Scientists / IT
• Preparation of Model Requirement Document
• Business requirements
• IT requirements
• End user requirements
• Data requirements
• Analytical model requirements

I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase III – Data Preparation
• Involved parties include IT / Data Administrators / DBA / Data Modelers and Data
Scientists
• Discussion on:
• Data Access
• Data Location
• Data Understanding
• Data Validation
• Data format [prepared by DBAs and consumed by Data Scientist]
• The process is agile; the data scientist tries out various approaches on smaller sets
and then may ask IT/ DBAs to perform the required transformation in large.

I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase IV – Modeling Experiments
• Data Scientist:
• Creates testable hypothesis [Prediction of heart disease]
• Model features [Identify X and Y variables]
• Creates Analytical Model [Regression / Classification / Clustering]
• Evaluates the Analytical Model
[Metrics – Accuracy, Precision, Sensitivity, Specificity etc.]

I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase V – Insight Creation
• Data Scientist:
• Analytical reporting [Inference] and Operational reporting [Prediction]
• Visualization and Dashboards
• Provide business usable insights

I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase VI – Proof of Value: ROI
• Quality of the analytical model is observed [Ex: Accuracy of the model is >90%]
• Analytical model is applied to new data and outcomes are measured to verify if
financially viable [for small POC].
• If ROI is positive for POC:
• Set up full-scale experiment with control groups
• Measure the model effectiveness
• Compute ROI and success criteria
• Involve Finance department / IT / End-users and Data Scientists in this phase

I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase VII – Operationalization
• Data Scientist works with IT department to create repeatable experimentation of
the model; hand-over process of the model
• IT prepare the Operational environment
• Integration with existing / legacy applications
• Possible software development as Mobile / Web App for end-user usage

I N T R OD U CT ION TO D AT A S C I E N C E
SMAM Phases
Phase VIII – Model Lifecycle
• Involves maintenance of the analytical model in-view of changing customer needs
• Two types of model changes:
a. Model Refresh – Model is trained with more recent data, leaving the model
structurally untouched
b. Model Upgrade – Initiated by availability of new data sources and a
business request to improve model performance.
• Involved are operational team, IT team, Data Scientists, DBAs, end-users

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 SEMMA
5 SMAM
6 B I G D ATA L I F E - C Y C L E
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E

• Big data defers from traditional data primarily due to


volume, velocity, variety, veracity and value
• A step-by-step methodology is required to acquire,
process, analyze, visualize the big data

Book - Big Data Fundamentals: Concepts, Drivers & Techniques


https://www.informit.com/articles/article.aspx?p=2473128&seqNum=11
I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage I : Business Case Evaluation
• Create a well-defined business case and get approval
• Identify KPIs that define the assessment criteria, to make business goals SMART
(specific, measurable, attainable, relevant, timely)
• Business case must qualify as a ‘big data’ problem – volume, velocity, variety,
veracity, value
• Outcome: Budget requirements, identify software (tools), hardware, training
requirements

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage II : Data Identification
• Identify the datasets required for the project and their sources
• Guideline: Identify as many sources as possible, which help gain insights
• Sources can be internal / external to the enterprise
• Internal – Data marts, Data warehouses or operational systems
• External – Data within Blogs, websites etc.

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage III : Data Acquisition and Filtering
• Data is gathered from all sources identified in the previous phase
• Data filtering is performed to remove corrupted / noise data
• Corrupt – records with missing / nonsensical values / invalid data
types
• Create metadata, helps in data provenance, accuracy and quality
• Dataset size & structure
• Source information
• Date and time of creation
• Language specific information

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage IV : Data Extraction
• Extract disparate data and transform it into a format that the underlying Big Data
solution can use for the purpose of the data analysis .

Extraction of Latitude and Longitude from JSON


User Id and Comments
extracted from XML document

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage V : Data Validation and Cleansing
• Big data may receive redundant data across sources
• Redundancy can be used to interconnect dataset and fill missing values

• The first value in Dataset B is validated against its corresponding value in Dataset A.
• The second value in Dataset B is not validated against its corresponding value in Dataset A.
• If a value is missing, it is inserted from Dataset A.

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage VI : Data Aggregation and Representation
• Integrating multiple datasets together to arrive at unified view
• Involves joining datasets based on common fields such as ID or Date
• Semantics standardization (Ex: Surname and Last name – Same value
labeled differently in different datasets)
• Represent using standard data format (row-oriented database)

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage VII : Data Analysis
• Perform EDA (Exploratory Data Analysis)
• Apply Analytics: Descriptive, Diagnostic, Predictive or Prescriptive

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage VIII : Data Visualization
• Use tools to graphically visualize and communicate the insights to business users
• Present Dashboards
• Excel, Tableau, Power BI etc.

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
Stage IX : Utilization of Analysis Results
• Determining how and where the processed analysis data can be leveraged
• Results can be:
• Fed as input to enterprise systems (Customer analysis result fed into
OTT platform to assist recommendation)
• Refine the business process (Ex: Consolidate transportation routes as
part of supply chain process)
• Generate alerts (Send notification to users via Email or SMS about
impending events)

I N T R OD U CT ION TO D AT A S C I E N C E
B I G D ATA A N A LY T I C S L I F E C YC L E
CASE STUDY: Background
• Company X is an Insurance Company that deals with health and home insurance
• The company has a ‘Claim Management System’ which contains the claim data,
incident photographs and claim notes
• The company wants to invest in Big Data Analytics to “detect fraudulent claims in the
building sector”
• Let us see how the company uses the ‘Big Data Analytics’ Lifecycle to achieve the
objective of ‘detecting fraudulent claims in the building sector’

* Building Insurance is a type of Home insurance that covers the structure of the house from any kinds of danger or risks

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims

Phase I: Business Case Evaluation


• Use case is important as it leads to decrease in monetary loss for Company X
• It covers ‘opportunistic fraud’ such as lying and exaggeration which covers
majority of insurance claim cases.
• KPI for success is set as – ‘reduction in fraudulent claims by 15%.’
• Regarding budget allocation and Infrastructure upgrade, Company X decides to
leverage Open Source Big Data Solution – Hadoop Ecosystem.

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims

Phase II: Data Identification


• Internal datasets: Policy data, insurance application documents, claims data,
incident photographs, emails
• External datasets: Social Media Data (Twitter Feeds), Weather reports,
Geographical data (GIS), and census data.
• The claim data consists of historical claim data consisting of multiple fields where
one of the fields specifies if the claim was fraudulent or legitimate.

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims

Phase III: Data Acquisition and Filtering


• Policy data obtained from Policy Administration System
• Claim data from Claims Management System
• Call center agent notes and emails from CRM system
• Social Media Data (Twitter Feeds), Weather reports, Geographical data (GIS),
and census data are obtained from third party vendors.
• To ensure provenance, each dataset is attached metadata such as dataset
name, source, size, format, acquired date and number of records.
• Batch filtering jobs to remove corrupt records in external datasets.

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims

Phase IV: Data Extraction


• Tweets dataset is in JSON format: User Id, Timestamp and Tweet Text is
extracted into tabular form.
• Weather dataset is in XML format: Timestamp, Temperature Forecast, Wind
Speed Forecast, Wind Direction Forecast, Snow Forecast and Flood Forecast
parameters extracted into tabular form.

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims

Phase V: Data Validation and Cleaning


• Check the extracted fields from Twitter and Weather datasets for typographical
errors, incorrect data, data type validation and range validation

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims

Phase VI: Data Aggregation and Representation


• For meaningful analysis of data, join together policy data, claim data, call center
agent notes in a single dataset that is tabular, where each field can be
referenced through a user query.
• Resulting dataset is stored in RDBMS datastore.

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims

Phase VII: Data Analysis


• Perform Exploratory Data Analysis
• This stage is repeated a number of times as the results generated after the first
pass are not conclusive enough to comprehend what makes a fraudulent claim
different from a legitimate claim.
• Machine learning models were developed using Naïve Bayes, Random Forest,
Decision Tree, Logistic Model Tree etc
• Metrics used: Accuracy, Precision, Recall, F-Measure, ROC

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims

Phase VIII: Data Visualization


• The team has discovered some interesting findings and now needs to convey the
results to the Insurance experts.
• Different visualization methods are used including bar and line graphs and scatter
plots.
• Scatter plots are used to analyze groups of fraudulent and legitimate claims in the light
of different factors, such as customer age, age of policy, number of claims made
and value of claim.

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study: Detect Fraudulent Claims

Phase IX: Utilization of Analysis Results

• The machine learning model was incorporated into the existing claim
processing system to flag fraudulent claims.

I N T R OD U CT ION TO D AT A S C I E N C E
When to use what Methodology?

CRISP-DM SEMMA SMAM Big Data Custom


Lifecycle
Start-up with no Have identified Need quick POC / Big data Have additional
prior experience in dataset, preferably MVP before taking requirements (5Vs) steps / phases in
data mining or data single data source the big bang addition to the
science approach methodology

Good Model development Need proof of ROI Multiple data Find the
documentation and is priority before investment sources methodology as
case studies (provenance, quality constraining
aspects of data)

Suitable for both Maybe as a POC/ Need clarity on the Integrate the model Ex: IBM / Netflix /
data mining and MVP. division of roles and with existing Google customize
data science No deployment responsibilities of systems big data lifecycle
projects clarity team members in (operationalization) and CRISP-DM in
project execution many projects

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA A N A LY T I C S
2 D ATA A N A LY T I C S M E T H O D O L O G I E S
3 CRISP-DM
4 B I G D ATA L I F E - C Y C L E
5 SEMMA
6 SMAM
7 C HALLENGES IN D ATA D R I V E N D E C I S I O N - M A K I N G

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A D R I V E N D E C I S I O N - M A K I N G

Create new Blockbuster People Analytics Serving of Ads


Mass Personalization
of Menus Hit series

Analyze over 30 million Diagnose HR issues, Image recognition


Past History, Weather, analyze employee (pattern of people
plays a day, 4 million
Time of Day, Local performance reviews, drinking),
subscriber ratings, 3 million
events demographics,
searches developed ‘house manage workforce and
of cards’ talent better background, offer
personalized ads

https://unscrambl.com/blog/data-driven-companies-examples/

I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G

1. Discrimination
• Algorithmic discrimination can come from various sources.
• Data used to train algorithms may have biases that lead to discriminatory decisions.
• Discrimination may arise from the use of a particular algorithm.
• Algorithms can result in discrimination as a result of misuse of certain models in different
contexts.
• Biased data can be used both as evidence for the training of algorithms and as evidence
of their effectiveness.

I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
1. Racism embedded in US healthcare
In October 2019, researchers found that an algorithm used on more than 200
million people in US hospitals to predict which patients would likely need extra
medical care heavily favoured white patients over black patients. While race
itself wasn’t a variable used in this algorithm, another variable highly
correlated to race was, which was healthcare cost history. The rationale was
that cost summarizes how many healthcare needs a particular person has.
For various reasons, black patients incurred lower healthcare costs than white
patients with the same conditions on average.

I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
2. Amazon’s hiring algorithm
Amazon’s one of the largest tech giants in the world. And so, it’s no surprise
that they’re heavy users of machine learning and artificial intelligence. In
2015, Amazon realized that their algorithm used for hiring employees was
found to be biased against women. The reason for that was because the
algorithm was based on the number of resumes submitted over the past ten
years, and since most of the applicants were men, it was trained to favor men
over women.

I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G

2. Lack of transparency
• Transparency refers to the capacity to understand a computational model and therefore
contribute to the attribution of responsibility for consequences derived from its use.
• A model is transparent if a person can easily observe it and understand it.
• Three types of opacity (i.e. lack of transparency) in algorithmic decisions
• Intentional opacity – The objective of this type of opacity is to protect the algorithm
inventors’ intellectual property.
• Knowledge opacity – This type of opacity is due to the fact that the most people lack the
technical skills to understand how algorithms and computational models are constructed.
• Intrinsic opacity – This type of opacity arises from the nature of certain computer learning
methods (e.g. deep learning models).

https://philpapers.org/rec/BURHTM

I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
3. Violation of privacy
• Misuse of users’ personal data and on data aggregation by entities such as data
brokers, may have direct implications for people’s privacy. [Google faced Lawsuit
for Privacy Violation in 2020 – selling data to 3rd party companies]
4. Digital literacy
• Devote resources to digital and computer literacy programs from children to the elderly.
• This enables the society to make decisions about technologies that we do not
understand. [Cases of Cyberbullying among Juvenile population]
5. Fuzzy responsibility
• As more and more decisions that affect millions of people are made automatically by
algorithms, we must be clear about who is responsible for the consequences of these
decisions. Transparency is often considered a fundamental factor in the clarity of
attribution of responsibility.
I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
6. Lack of ethical frameworks
• Algorithmic data-based decision-making processes generate important ethical dilemmas
regarding what actions are appropriate in light of the inferences made by algorithms.
• It is therefore essential that decisions be made in accordance with a clearly defined and
accepted ethical framework.
• There is no single method for introducing ethical principles into algorithmic decision
processes.

On March 18, 2018, at around 10 p.m., Elaine Herzberg was wheeling her bicycle
across a street in Tempe, Arizona, when she was struck and killed by a self-driving
car. Although there was a human operator behind the wheel, an autonomous
system—artificial intelligence—was in full control.

I N T R OD U CT ION TO D AT A S C I E N C E
C HALLENGES IN D AT A D R I V E N D E C I S I O N - M A K I N G
7. Lack of diversity
• Data-based algorithms and artificial intelligence techniques for decision-making have
been developed by homogeneous groups of IT professionals.
• Ensure that teams are diverse in terms of areas of knowledge as well as demographic
factors [interdisciplinary – teaching medical doctors data science for self-computation]

I N T R OD U CT ION TO D AT A S C I E N C E
R EFERENCES

https://www.kdnuggets.com/2014/10/
cris p - d m - top - m eth od olog y - a n a l ytic s - d a t a - m in in g - d a t a - s c ie n ce - p roj ect h tm l
https://www.datascien cecentra l.com/profiles/ blogs/
crisp-dm-a-standard-methodology-to-ensure-a-good-outcome
https://docu mentation.sas.com/?docsetId=emref&docsetTarget=n061bzurmej4j3n1jnj8bbjjm 1a2.htm&
docsetVersion=14.3&locale=en
http://jesshampton.com/2011/02/16/semma -and-crisp-dm-data-mining-methodologies/
https://www.kdnuggets.com/2015/08/new -standard-methodology-analytical-models.html
https://medium.com/illumination -curated/big-data-lifecycle-management-629dfe16b78d
https://www.esadeknowledge.com/view/
7 - ch a llen g es - a n d - op p ortu n ities - in - d a ta - b a sed - d ecis ion - ma kin g -193560

T HANK YOU

I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 3 : DATA S CIENCE P ROCESS
IDS Course Team
BITS Pilani
TABLE OF C ONTENTS

1 D ATA S C I E N C E P R O C E S S

2 C ASE S TU DY

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E P R O C E S S
10 Questions the process aims to answer
• Problem to Approach
1 What is the problem that you are trying to solve?
2 How can you use data to answer the questions? CRISP-DM approach
• Working with Data
3 What data do you need to answer the question?
4 Where is the data coming from? Identify all Sources. How will you aquire it?
5 Is the data that you collected representative of the porblem to be solved?
6 What additional work is required to manipulate and work with the data?
• Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?

Source: CognitiveClass
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E P R O C E S S - I B M
Business Analytic
Understanding Approach

Data
Feedback Requirements

Data
Deployment
Collection

Evaluation Data
Understanding
Data
Data Modeling
Preparation
I N T R OD U CT ION TO D AT A S C I E N C E
TABLE OF C ONTENTS

1 D ATA S C I E N C E P R O C E S S

2 C ASE S TU DY

I N T R OD U CT ION TO D AT A S C I E N C E
H OSPITA L R EADMISSIONS

Image Source:
https://medium.com/nwamaka-imasogie/predicting-hospital-readmission-using-nlp-5f0fe6f1a705
I N T R OD U CT ION TO D AT A S C I E N C E
H OSPITA L R EADMISSIONS - S CENARIO
• Hospital Readmission is a common problem in the healthcare sector, wherein a patient after
discharge gets re-admitted to the hospital because of the following reasons:
• Medication errors
• Medication noncompliance by the patient
• Fall injuries
• Lack of timely follow-up care
• Inadequate Nutrition
• Inadequate discussion on palliative care [relief from suffering]
• Infection
• Failure to identify post-acute care needs etc.

• Hospital readmissions may bring bad name to the hospital / treating doctor / support staff,
and lead to increased length of stay and expenditure for the hospital and the patient.
• Hence, it is a critical issue that needs addressing.

I N T R OD U CT ION TO D AT A S C I E N C E
H OSPITA L R EADMISSIONS - S CENARIO
There is a limited budget for providing healthcare to the public.
Hospital readmissions for re-occurring problems are considered as a sign of failure in the
healthcare system.
There is a dire need to properly address the patient condition prior to the initial patient
discharge.
American Healthcare Insurance Provider, Health care authorities in the region & IBM Data
Scientists:
• What is the best way to allocate these funds to maximize their use in providing
quality care?

I N T R OD U CT ION TO D AT A S C I E N C E
F R OM P R O B L E M TO A P P R OA C H
Business Analytic
Understanding Approach

Data
Feedback Requirements

Data
Deployment
Collection

Evaluation Data
Understanding
Data
Data Modeling
Preparation

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 1. B U S I N E S S U N D E R S T A N D I N G

Case study asks the following question:


• What is the best way to allocate the limited healthcare
budget to maximize its use in providing quality care?

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 1. B U S I N E S S U N D E R S T A N D I N G
Examining hospital readmissions [Insurance Company + Hospitals + Data Scientists]
• Use Case 1: It was found that approximately 30% of individuals who finish rehab
treatment would be readmitted to a rehab center within one year.
• 50% would be readmitted within five years.
• Use Case 2: After reviewing some records, it was found that patients with heart failure
were high on the list of readmission [more frequently]

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 1. B U S I N E S S U N D E R S T A N D I N G

Data scientists proposed and organized an on-site workshop.


Based on previous experience, it was decided that a decision tree model can be
applied to predict the patient readmission rate. [Predictive Analytics]
The business sponsors involvement throughout the project was critical because the
sponsor had
• Set the overall direction
• Remained committed and advised
• When required, got the necessary support

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 1. B U S I N E S S U N D E R S T A N D I N G

Finally, four business requirements were identified:


• Case study question
• What is the best way to allocate the limited healthcare budget to maximize its use in
providing quality care?
• Business requirements
• To predict the risk of readmission. [Predictive Analytics]
• To predict readmission outcomes for those patients with Congestive Heart Failure.
• To understand the combination of events that led to the predicted outcome.
• To apply an easy-to-understand process to new patients, regarding their readmission risk.

I N T R OD U CT ION T O D AT A S C I E N C E
2. A N A L Y T I C A P P R OA C H ( C O N C E P T )
 Available data
• Patient data, Readmissions data, CHF data, etc
 How can we use data to answer the questions?
Descriptive
 Choose Analytic approach based on the type of question.
• Descriptive
• Current data
• Diagnostic (Statistical Analysis)
Diagnostic Analytics Prescriptive
• What happened?
• Why is this happening?
• Predictive (Forecasting)
• What if these trends continue?
Predictive
• What will happen next?
• Prescriptive
• How do we solve it?

I N T R OD U CT ION T O D AT A S C I E N C E
A N A L Y T I C A P P R OA C H - D E C I S I O N T R E E ( C O N C E P T )
What is a Decision Tree?
1. An algorithm that represents a set of questions & decisions using a tree-like
structure.
2. It provides a procedure to decide what questions to ask, which to ask and when to
ask them to predict the value of an outcome.

I N T R OD U CT ION TO D AT A S C I E N C E
A N A L Y T I C A P P R OA C H - D E C I S I O N T R E E ( C O N C E P T )

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 2. A N A L Y T I C A P P R OA C H
A decision tree classification model was used
to identify the combination of conditions leading
to each patient’s outcome.
Examining the variables in each of the nodes
along each path to a leaf, led to a respective
threshold value to split the tree. Eg:
Age >= 60
A decision tree classifier provides both the
predicted outcome, as well as the likelihood of
that outcome, based on the proportion at the
dominant outcome, yes or no, in each group.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 2. A N A L Y T I C A P P R OA C H

The analysts can obtain the readmission risk,


or the likelihood of a yes for each patient.
If the dominant outcome is yes, then the risk is
simply the proportion of yes patients in the leaf.
If it is no, then the risk is 1 minus the proportion
of no patients in the leaf.
For non-data scientists, a decision tree
classification model is easy to understand and
apply, to score new patients for their risk of
readmission.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 2. A N A L Y T I C A P P R OA C H

Clinicians can readily see what conditions are


causing a patient to be scored as high-risk.
Multiple models can be built and applied at
various points during hospital stay.
This gives a moving picture of the patient’s risk
and how it is evolving with the various
treatments being applied.
For these reasons, the decision tree approach
was chosen for building the Congestive Heart
Failure (CHF) readmission model.

I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D AT A R E q U I R E M E N T S TO D AT A C O L L E C T I O N
Business Analytic
Understanding Approach

Data
Feedback Requirements

Data
Deployment
Collection

Evaluation Data
Understanding
Data
Data Modeling
Preparation

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 3. D AT A R E q U I R E M E N T S

The analytic approach is decision tree classification, so data requirements should be


defined accordingly.
This involves:
• Identify data content
• Identify data formats
• Identify data sources needed for the initial data collection.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 3. D AT A R E q U I R E M E N T S

Data requirements for the case study included selecting a suitable list of patients from
the health insurance providers' member base.
In order to put together patient clinical histories,
three criteria were identified for selecting the
patient cohort. [Complete medical history]
1 A patient must be admitted as an in-patient
within health insurance provider’s service area.
2 Patient’s primary diagnosis should be CHF for
one full year.
3 Prior to the primary admission for CHF, a patient
must have had at least 6 months of continuous
enrollment.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 3. D AT A R E q U I R E M E N T S

Disqualifying conditions (outliers)


• CHF patient who have been diagnosed with
other serious conditions [Comorbidities] are
excluded because this may result in above-
average rates of re-entry and may therefore
distort results.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 3. D AT A R E q U I R E M E N T S
Defining the data
The content and format suitable for decision tree classifier needs to be defined.
Format
• Transactional format
• This model requires, one record per patient.
• Columns of the record represent dependent and independent variables.
Content
• To model the readmission outcome, data should represent all aspects of the patient’s
clinical history.
• This includes:
• Authorizations
• Primary, secondary and tertiary diagnoses,
• Procedures, prescriptions and other services provided during hospitalization or visits by
patients / doctors.
I NI NT RT OD
R OD
UUCTCT T ODDATAT
IONT O
ION A AS S
C ICEI N
ENC EC E
C A S E S T U D Y - 3. D AT A R E q U I R E M E N T S

A given patient can have thousands of records that represent all their attributes.
The data analytics specialists collected the transaction records from patient records
and created a set of new variables to represent that information.
It was a task for the data preparation phase, so it is important to anticipate the next
phases.

I N T R OD U CT ION TO D AT A S C I E N C E
4. D AT A C O L L E C T I O N ( C O N C E P T )
The collected data is explored using descriptive statistics and visualization to assess
its content and quality.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 4. D AT A C O L L E C T I O N

This case study required data about:


• Demographics, clinical and coverage information of patients, provider information, claims
records, as well as pharmaceutical and other information related to all the diagnoses of
the CHF patients.
Available data sources
• Corporate data warehouse
• Single source of medical, claims, eligibility,
provider, and member information.
• In-patient record system
• Claim patient system
• Disease management program information

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 4. D AT A C O L L E C T I O N
This case study also required other data, but
not available.
• Pharmaceutical records
• Information on drugs
This data source was not yet integrated with the rest of the data sources.
In such situations,
• It is okay to postpone decisions about unavailable data and to try to capture them later.
• This can happen even after obtaining intermediate results from predictive modeling.
• If the results indicate that drug information may be important for a good model, you will
spend time trying to get it.
However, it turned out that they could build a reasonably good model without this
information about drugs.

I N T R OD U CT ION TO D AT A S C I E N C E
Next Phase – Data Understanding
Data Pre-processing and Merging Data
• Database administrators and programmers
often work together to extract data from
different sources and then combine them.
• Redundant data can be deleted and made
available to the next level of methodology – the
”Data Understanding” phase.
• At this stage, scientists and analysts can
discuss ways to better manage their data by
automating certain database processes to
facilitate data collection
Next, we move on to understanding the data

I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D AT A U N D E R S T A N D I N G TO D AT A P R E PA R AT I O N
Business Analytic
Understanding Approach

Data
Feedback Requirements

Data
Deployment
Collection

Evaluation Data
Understanding
Data
Data Modeling
Preparation

I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D AT A U N D E R S T A N D I N G TO D AT A P R E PA R AT I O N

The importance of descriptive statistics.


How to manage missing, invalid, or misleading data?
The need to clean data and sometimes transform data.
The consequences of bad data for the model.
Data understanding is iterative.
• We learn more about data, the more we study it.

I N T R OD U CT ION TO D AT A S C I E N C E
5. D AT A U N D E R S T A N D I N G ( C O N C E P T S )
This section of the methodology answers the question.
• Is data you collected representative of the problem to be solved?
Descriptitive statistics
• Univariates statistics
• Pairwise correlation
• Histograms
Assert data quality
• Missing value
• Invalid data
• Misleading data
From the data collected, we should understand the variables and their characteristics
using Exploratory Data Analysis and Descriptive Statistics.
Sometimes we may have to perform pre-processing operations on the data.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 5. D AT A U N D E R S T A N D I N G
First, Univariate Statistics
• Basic statistics included univariate statistics for
each variable, such as:
• mean, median, minimum, maximum,
standard deviation, detect outliers
Second, Pairwise Correlations
• Pairwise correlations were used to determine
the degree of correlation between the
variables.
• Variables that are highly correlated means
they are essentially redundant.
• This makes only one variable relevant for the
modeling.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 5. D AT A U N D E R S T A N D I N G
Third, Histograms
• Third, the histograms of the variables were
examined to understand their distributions.
• Histograms are a good way to understand how
values or variables are distributed.
• They help to know what kind of data
preparation may be needed to make the
variable more useful in a model.
• For example:
• If a categorical variable contains too many
different values to be meaningful in a model,
the histogram can help decide how to
consolidate those values.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 5. D AT A U N D E R S T A N D I N G
Looking at data quality
• Univariate, statistics and histograms are also used to assess the quality of the data.
• On the basis of the data provided, some values can be recoded or deleted if necessary.
• E.g., if a particular variable has a lot of missing values, we may drop the variable from the
model.
• Sometimes a missing value means ”no” or ”0” (zero), or sometimes simply ”we do not
know”.
A variable contains invalid or misleading
values.
• E.g., A numeric variable called ”age”
containing 0 to 100 and 999, where ”triple-9”
actually means ”missing”, will be treated as a
valid value unless we have corrected it.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 5. D AT A U N D E R S T A N D I N G

Data understanding is an iterative process.


• Originally, the meaning of CHF admission was decided on the basis of a primary
diagnosis of CHF.
• However, preliminary data analysis and clinical experience revealed that CHF
admissions were also based on other diagnosis.
• The initial definition did not cover all cases of CHF admissions.
• They added secondary and tertiary diagnoses, and created a more complete definition
of CHF admission.
• This is one example of the iterative processes in the methodology .
• The more we work with the problem and the data, the more we learn and the more the
model can be adjusted, which ultimately leads to a better resolution of the problem.

I N T R OD U CT ION T O D AT A S C I E N C E
6. D AT A P R E PA R AT I O N ( C O N C E P T )
In a way, data preparation is like removing dirt and washing vegetables.
Compared to data collection and understanding, data preparation is the most time
consuming phase – 70% to 90% of overall project time.
Automating collection and preparation time can reduce to 50%.
The data preparation phase of the methodology answers the question:
• What are the ways in which data is prepared?
• Address missing or invalid values
• Remove duplicates
• Format data properly
Transforming data
• Process of getting data into a state where it may be easier to work with.
Feature Engineering

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N

Source: CognitiveClass
I N T R OD U CT ION TO D AT A S C I E N C E
6. D AT A P R E PA R AT I O N ( C O N C E P T )
Feature Engineering
• Process of using domain knowledge of data to create
features that make ML algorithms work.
• Feature is a characteristic that might help solving a
problem.
• Feature engineering is also part of the data preparation.
• Use domain knowledge on data to create features that
work with machine learning algorithms.
• A feature is a property that can be useful for solving a
problem.
• The functions in the data are important for the predictive
models and influence the desired results.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N
Data Scientists need clarification on domain terms for data preparation

1. Defining Congestive Heart Failure [from a Data Scientist perspective]


• In the case study, first step in the data preparation stage was to actually define what
CHF means.
• CHF occurs when the heart muscle does not pump blood as much as it should. This leads
to fluid build up in the lungs.
• First, the set of diagnosis-related group codes needed to be identified, as CHF implies
certain kinds of fluid buildup.
• Data scientists also needed to consider that CHF is only one type of heart failure.
• Clinical guidance was needed to get the right codes for CHF.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N

2. Defining re-admission criteria for Congestive Heart Failure


• Next step involved defining the criteria for CHF readmissions.
• The timing of events needed to be evaluated in order to define whether a particular CHF
admission was an initial event (called as index admission), or a CHF-related re-
admission.
• Based on clinical expertise, a time period of 30 days was set as the window for
readmission relevant for CHF patients, following the discharge from the initial admission.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N

Aggregating transactional records


• Next, the records that were in transactional format were aggregated.
• Meaning that the data included multiple records for each patient.
• Transactional records included claims submitted for physician, laboratory, hospital, and
clinical services.
• Also included were records describing all the diagnoses, procedures, prescriptions, and
other information about in-patients and out-patients.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N
Aggregating data to patient level
• A given patient could have hundreds or even thousands of records, depending on their
clinical history.
• All the transactional records were aggregated to the patient level, yielding a single
record for each patient.
• This is required for the decision-tree classification method used for modeling.
• Many new columns were created representing the information in the transactions.
• E.g: Frequency and most recent visits to doctors, clinics and hospitals with diagnoses,
procedures, prescriptions, and so forth.
• Co-morbidities with CHF were also considered, such as:
• Diabetes, hypertension, and many other diseases and chronic conditions that could impact
the risk of re-admission for CHF.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N

Do we need more data or less data?


• A literature review on CHF was also undertaken to see whether any important data
elements were overlooked.
• Such as co-morbidities that had not yet been accounted for.
• The literature review involved looping back to the data collection stage to add a few more
indicators for conditions and procedures.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N

Creating new variables


• Aggregating the transactional data at the patient level, involved.
• merging it with the other patient data, including their demographic information, such as
age, gender, type of insurance, and so forth.
• The result was the creation of one table containing a single record per patient.
• Columns represent the attributes about the patient in his or her clinical history.
• These columns would be used as variables in the predictive modeling.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N

Completing the data set


Here is a list of the variables that were ultimately used in building the model
• Measures
• Gender, Age, Primary Diagnosis Related Group (DRG), Length of Stay, CHF Diagnosis
Importance (primary, secondary, tertiary), Prior admissions, Line of business.
• Diagnosis Flags (Y/N)
• CHF, Atrial fibrillation, Pneumonia, Diabetes, Renal failure, Hypertension.
Dependent Variable
• CHF readmission within 30 days following discharge from CHF hospitalization (Yes/No).

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D AT A P R E PA R AT I O N

Creating training and testing datasets


• The data preparation stage resulted in a cohort of 2,343 patients.
• These patients met all of the criteria for this case study.
• The data (patient records) were then split into training and testing sets for building and
validating the model, respectively.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 6. D ATA P R E PA R AT I O N

I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D AT A M O D E L I N G TO E VA L U AT I O N
Business Analytic
Understanding Approach

Data
Feedback Requirements

Data
Deployment
Collection

Evaluation Data
Understanding
Data
Data Modeling
Preparation

I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D AT A M O D E L I N G TO E VA L U AT I O N

The difference between descriptive and predictive models.


The role of training sets and test sets.
The importance of asking if the question has been answered.
Why diagnostic measures tools are needed.
The purpose of statistical significance tests. [Hypothesis testing]
That modeling and evaluation are iterative processes.

I N T R OD U CT ION T O D AT A S C I E N C E
7. D AT A M O D E L I N G ( C O N C E P T )
In what way can the data be visualized to get to the answer that is required?
Modeling is based on the analytic approach.
Data modeling focuses on developing models that are either descriptive or predictive.
• Descriptive Models
• What happened?
• Use statistics.
• Predictive Models
• What wil happen?
• Use machine learning.
• Try to generate yes/no type outcomes.
• A training set is used for developing the predictive model.
• Training set
• Contains historical data in which the outcomes are already known. [Labeled data]
• Acts like a gauge to determine if the model needs to be calibrated.

I N T R OD U CT ION TO D AT A S C I E N C E
7. D AT A M O D E L I N G ( C O N C E P T )
The data scientist will try different algorithms to ensure that the variables in play are
actually required.
Success of compilation, preparation and modeling depends on the understanding of
problem and analytical approach being taken.
Like the quality of ingredients in cooking, the quality of data sets the stage for the
outcome.
• If data quality is bad, the outcome will be bad.
Constant refinement, adjustment, and tweaking within each step are essential to
ensure a solid outcome.
The end goal is to build a model that can answer the original question.
• Model evaluation, deployment, and feedback loops ensure that the model is relevant and
the question is really answered.

I N T R OD U CT ION TO D AT A S C I E N C E
7. D AT A M O D E L I N G – Concept of Confusion Matrix

Since Data Modeling for the case study involves the concepts of ‘Confusion
Matrix’ and ‘ROC’, let us understand the concepts.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 7. D AT A M O D E L I N G
Decision tree to predict CHF readmission is built
In this first model, the default is 1-to-1 is used.
The overall accuracy in classifying the yes and
no outcomes was 85%.
This sounds good, but it represents only 45% of
the ”yes”.
• Meaning, when it’s actually YES, model
predicted YES only 45% of the time.
The question is:
• How could the accuracy of the model be improved in predicting the yes outcome?

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 7. D AT A M O D E L I N G

There are many aspects to model building – one of those is parameter tuning to
improve the model.
With a prepared training set, the first decision tree classification model for CHF
readmission can be built.
We are looking for patients with high-risk readmission, so the outcome of interest will
be CHF readmission equals ”yes”.
For decision tree classification, the best parameter to adjust is the
relative cost of misclassified yes and no outcomes.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 7. D AT A M O D E L I N G
Type I Error or False positive
• When a true, non-readmission is misclassified, and action is taken to reduce that
patient’s risk, the cost of that error is the wasted intervention.
Type II Error or False negative
• When a true readmission is misclassified, and
no action is taken to reduce that risk.
• The cost of this error is the readmission and all
its attended costs, plus the trauma to the patient.

The costs of the two different kinds of misclassification errors can be quite different.
• Adjust the relative weights of misclassifying the yes and no outcomes.
For decision tree classification, the best parameter to adjust is the
relative cost of misclassified yes and no outcomes.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 7. D AT A M O D E L I N G
For the second model, the relative cost was set at 9-to-1.
• Ratio of cost of false positive to false negative.
• This is a very high ratio, but gives more insight to the model’s behavior.
This time the model correctly classified 97% of
the YES, but at the expense of a very low
accuracy on the NO, with an overall accuracy of
only 49%.
This was clearly not a good model.
The problem with this outcome is the large number of false-positives.
• A true, non-readmission is misclassified as re-admission.
• This would recommend unnecessary and costly intervention for patients, who would not
have been re-admitted anyway.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 7. D AT A M O D E L I N G
Try again to find a better balance between the yes and no accuracies.
For the third model, the relative cost was set at
4-to-1.
This time, the overall accuracy was 81%.
Yes accuracy was 68%. This is called sensitivity.
No accuracy was 85%. This is called specificity.
This is the optimum balance that can be obtained with a rather small training set.
• By adjusting the relative cost of misclassified yes and no outcomes parameter.
In medical diagnosis
• Test sensitivity is the ability of a test to correctly identify those with the disease (true
positive rate).
• Test specificity is the ability of the test to correctly identify those without the disease (true
negative rate).
I N T R OD U CT ION TO D AT A S C I E N C E
C O N F U S I O N M AT R I X
Confusion matrix is a table that is often used to evaluate the performance of a
classification model (or ”classifier”).
It works on a set of test data for which the true values are known.
There are two possible predicted classes: ”YES” and ”NO”.
If we were predicting the presence of a disease, for example, ”yes” would mean they
have the disease, and ”no” would mean they don’t have the disease.
• The classifier made a total of 165 predictions.
• 165 patients were being tested for the presence Predicted: Predicted
of that disease. N = 165 No Yes
• Out of those 165 cases, the classifier predicted Actual:
”yes” 110 times, and ”no” 55 times. No: 50 10
• In reality, 105 patients in the sample have the Actual:
disease, and 60 patients do not. Yes: 5 100
I N T R OD U CT ION TO D AT A S C I E N C E
C O N F U S I O N M AT R I X

True positives (TP) / Sensitivity:


• The model predicted yes, and the patients have the disease.
True negative (TN) / Specificity: Predicted: Predicted
• The model predicted no, and the patients don’t N = 165 No Yes
have the disease. Actual:
False positives (FP) / Type I error: No: TN = 50 FP =10
• The model predicted YES, but the patients don’t Actual:
actually have the disease. Yes: FN = 5 TP = 100
False negatives (FN) / Type II error:
• The model predicted NO, but the patients actually have the disease.

I N T R OD U CT ION TO D AT A S C I E N C E
C O N F U S I O N M AT R I X

Term Description Calculation


Accuracy Overall, how often is the clas- (TP+TN)/total = (100+50)/165
sifier correct? = 0.91
Misclassification Rate Overall, how often is it wrong? (FP+FN)/total = (10+5)/165
Error Rate Equivalent to 1 minus Accu- = 0.09
racy
True Positive Rate When it’s actually YES, how TP/actual YES = 100/105 =
(Sensitivity or Recall) often does it predict YES? 0.95
True Negative Rate When it’s actually NO, how of- TN/actual NO = 50/60 = 0.83
(Specificity) ten does it predict NO?
Equivalent to 1 minus False
Positive Rate
I N T R OD U CT ION TO D AT A S C I E N C E
C O N F U S I O N M AT R I X

Term Description Calculation


False Positive Rate When it’s actually NO, how of- FP/actual NO = 10/60 = 0.17
(Type I Error) ten does it predict YES?
True Negative Rate When it’s actually NO, how of- TN/actual NO = 50/60 = 0.83
(Specificity) ten does it predict NO?
Equivalent to 1 minus False
Positive Rate
Precision When it predicts YES, how of- TP/predicted YES = 100/110
ten is it correct? = 0.91
Prevalence How often does the YES con- Actual YES/total = 105/165 =
dition actually occur in our 0.64
sample?
I N T R OD U CT ION TO D AT A S C I E N C E
8. E VA L U AT I O N ( C O N C E P T )
Quality of the developed model is assessed.
Before model gets deployed, evaluate whether the model really answers the initial
question.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 8. E VA L U AT I O N

One way is to find the optimal model through a diagnostic measure based on tuning
one of the parameters in model building.
Specifically we’ll see how to tune the relative cost of misclassifying yes and no
outcomes.
Four models were built with four different relative
misclassification costs.
Each value of this model-building parameter
increases the true positive rate of the accuracy in
predicting yes, at the expense of lower accuracy
in predicting no, that is, an increasing
false-positive rate.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 8. E VA L U AT I O N
Which model is best based on tuning this parameter?
Risk-reducing intervention – two scenarios
• Cannot be applied to all CHF patients because many of them would not have been
readmitted anyway. This will be cost effective.
• The intervention itself would not be as effective in improving patient care if not enough
high-risk CHF patients targeted.
How do we determine which model was optimal?
• This can be done with the help of an ROC curve (receiver operating characteristic curve).
ROC curve is a graph showing the performance of a classification model at all
classification thresholds.
ROC curve plots two parameters:
• True Positive Rate
• False Positive Rate
I N T R OD U CT ION TO D AT A S C I E N C E
R E C E I V E R O P E R AT O R C H A R A C T E R I S T I C ( R O C ) C U RV E
ROC curves are used to show the connection/trade-off between clinical sensitivity and
specificity for every possible cut-off (threshold) for a test or a combination of tests.
The area under an ROC curve is a measure of the usefulness of a test in general.
• A greater area means a more useful test.
ROC curves are used in clinical biochemistry to choose the most appropriate cut-off for a
test.
The best cut-off has the highest true positive rate together with the lowest false
positive rate.
ROC curves were first employed in the study of discriminator systems for the detection of
radio signals in the presence of noise in the 1940s, following the attack on Pearl Harbor.
The initial research was motivated by the desire to determine how the US RADAR
”receiver operators” had missed the Japanese aircraft.

I N T R OD U CT ION TO D AT A S C I E N C E
R E C E I V E R O P E R AT O R C H A R A C T E R I S T I C ( R O C ) C U RV E

An ROC curve plots TPR vs. FPR at different classification thresholds.


Lowering the classification threshold classifies more items as positive, thus increasing
both False Positives and True Positives.
The optimal model is the one giving the maximum separation between the blue ROC
curve relative to the red base line.
This curve quantifies how well a binary classification model performs.
• Declassifying the yes and no outcomes when some discrimination criterion is varied.
• In this case, the criterion is a relative misclassification cost.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 8. E VA L U AT I O N
We can see that model 3, with a relative misclassification cost of 4-to-1, is the best of the 4
models

I N T R OD U CT ION T O D AT A S C I E N C E
F R OM D E P L O Y M E N T TO F EEDBAC K
Business Analytic
Understanding Approach

Data
Feedback Requirements

Data
Deployment
Collection

Evaluation Data
Understanding
Data
Data Modeling
Preparation

I N T R OD U CT ION TO D AT A S C I E N C E
F R OM D E P L O Y M E N T TO F EEDBAC K

• What is Deployment.
• The importance of stakeholder input.
• To consider the scale of deployment.
• The importance of incorporating feedback to refine the model.
• This process should be repeated as often as necessary.

I N T R OD U CT ION TO D AT A S C I E N C E
9. D E P L O Y M E N T ( C O N C E P T )

• To make the model relevant and useful to address the initial question, involves
getting the stakeholders familiar with the tool produced.
• Once the model is evaluated/approved by the stakeholders, it is deployed and put
to the ultimate test.
• The model may be rolled out to a limited group of users or in a test environment,
to build up confidence in applying the outcome for use across the board.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 9. D E P L O Y M E N T

Understanding the results


• In preparation for model deployment, the next step was to assimilate the knowledge for
the business group who would be designing and managing the intervention program to
reduce readmission risk.
• In this scenario, the business people translated the model results so that the clinical staff
could understand how to identify high-risk patients and design suitable intervention
actions.
• The goal was to reduce the likelihood that these patients would be readmitted within 30
days after discharge.
• During the business requirements stage, the Intervention Program Director and her team
had wanted an application that would provide automated, near real-time risk
assessments of congestive heart failure.

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 9. D E P L O Y M E N T

I N T R OD U CT ION TO D AT A S C I E N C E
C A S E S T U D Y - 9. D E P L O Y M E N T

Gathering application requirements


• It also had to be easy for clinical staff to use, and preferably through browser-based
application on a tablet, that each staff member could carry around.
• This patient data was generated throughout the hospital stay.
• It would be automatically prepared in a format needed by the model and each patient
would be scored near the time of discharge.
• Clinicians would then have the most up-to-date risk assessment for each patient, helping
them to select which patients to target for intervention after discharge. As part of solution
deployment, the Intervention team would develop and deliver training for the clinical staff.

Teams involved: Business Team, Intervention Team / Program Director, Clinical Staff

I N T R OD U CT ION TO D AT A S C I E N C E
9. D E P L O Y M E N T

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 9. D E P L O Y M E N T
Additional Requirements
• Processes for tracking and monitoring patients receiving the intervention would have to
be developed in collaboration with IT developers and database administrators, so that
the results could go through the feedback stage and the model could be refined over
time.

I N T R OD U CT ION TO D AT A S C I E N C E
10. F E E D B A C K ( C O N C E P T )

Feedback from users to refine the model.


Assess the model for performance and impact.
The value of the model will be dependent on successfully incorporating feedback and
making adjustments for as long as the solution is required.
Throughout the Data Science Methodology, each step sets the stage for the next.
This makes the methodology cyclical, ensures refinement at each stage in the game.
Once the model has been evaluated and the data scientist trusts that it will work, it
will be deployed and will undergo the final test:
Its real use in real time in the field.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K

Feedback stage included these steps:


1 The review process would be defined and put into place, with overall responsibility for
measuring the results of the model applied to CHF risk population. Clinical
management executives would have overall responsibility for the review process.
2 CHF patients receiving intervention would be tracked and their re-admission
outcomes recorded.
3 The intervention would then be measured to determine how effective it was in
reducing readmissions.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K

For ethical reasons, CHF patients would not be split into controlled and treatment
groups.
Instead, readmission rates would be compared before and after the implementation of
the model to measure its impact.
After the deployment and feedback stages, the impact of the intervention program on
re-admission rates would be reviewed after the first year of its implementation.
Then the model would be refined, based on all of the data compiled after model
implementation and the knowledge gained throughout these stages.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K

Redeployment [Decision on Model Upgrade??]


The intervention actions and processes would be reviewed and very likely refined as
well, based on the experience and knowledge gained through initial deployment and
feedback.
Finally, the refined model and intervention actions would be redeployed, with the
feedback process continued throughout the life of the Intervention program.

I N T R OD U CT ION T O D AT A S C I E N C E
C A S E S T U D Y - 10. F E E D B A C K

I N T R OD U CT ION T O D AT A S C I E N C E
D AT A S C I E N C E P R O C E S S - S U M M A R Y
Learn the importance of
• Understanding the question
• Picking the most effective analytic approach
Learn to work with data
• determine the data requirements iterative stages
• collect the appropriate data
• understand the data
• prepare the data for modeling
Learn how to
• evaluate and deploy the model
• getting feedback on it
• use the feedback constructively so as to impove the model

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N C E P R O C E S S - S U M M A R Y

Think like a data scientist


• Forming a concrete business or research problem
• Collecting and analyzing data
• Building a model iterative stages
• Understanding the feedback after model deployment

I N T R OD U CT ION TO D AT A S C I E N C E
T HANK YOU

I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 3 : DATA S CIENCE P ROCESS
IDS Course Team
BITS Pilani
TABLE OF C ONTENTS

1. Confusion Matrix
2. ROC

I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
• A Confusion matrix is a table that is often used to evaluate the performance of a
classification model (or “classifier”).

• A Confusion Matrix shows what the machine learning algorithm did right and what
the algorithm did wrong (misclassification).
1
• It works on a set of test data for which the true values are known. There are two
possible predicted classes: “YES” and “NO”.

I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
Actual Values

Y N

Y True Positive False Positive (Type


Predicted
I Error)
Values
N False Negative True Negative
(Type II Error)
1

There are four quadrants in the confusion matrix, which are symbolized as below.
True Positive (TP) : The number of instances that were positive and correctly classified as positive.
False Positive (FP): The number of instances that were negative and incorrectly classified as positive.
This also known as Type 1 Error.
False Negative (FN): The number of instances that were positive and incorrectly classified as negative.
It is also known as Type 2 Error.
True Negative (TN): The number of instances that were negative and correctly classified as negative.

I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
Actual Values

Y N

Y True Positive False Positive (Type


Predicted
I Error)
Values
N False Negative True Negative
1 (Type II Error)

Which type of misclassification is more serious?? Type-I Error or Type-II Error?

I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
Which type of misclassification is more serious?? Type-I Error or Type-II Error?

Case I : Predicting whether a convict should be hanged or not? [Type I Error more Serious]
False Positive – Algorithm predicts that the convict has committed the crime, in reality, he is innocent.
Verdict: He will be hanged.
False Negative – Algorithm predicts that the convict is innocent, in reality, he has done the crime.
Verdict: He is released.
1
Case II : Predicting Smog in a region and alerting the public [Type II Error more Serious]
False Positive – Algorithm predicts smog, in reality, there is NO SMOG.
Verdict: People will take precaution unnecessarily.
False Negative – Algorithm predicts NO SMOG, in reality, there is SMOG.
Verdict: The high Smog may cause health issues in the people, since they have not taken precaution.

I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
Let us consider an example of model predicting a Tumour for a patient.
Actual
Interpretation:
Values
True Positive (TP): Model predicted ‘Tumour’ and the patient has tumour.
Y N
False Positive (FP): Model predicted ‘Tumour’, the patient has ‘No Tumour’.
Predicted Y 10 22
This also known as Type 1 Error.
Values
False Negative (FN): Model predicted ‘No Tumour’ but the patient actually has N 8 60
tumour. It is also known as Type 2 Error.
True Negative (TN): Model predicted ‘No Tumour’ and the patient has no
tumour.

Discuss on the repercussions of Type 1 and Type 2 errors w.r.t the patient and
the hospital.

I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
True Positive Rate (TPR): It is defined as False Negative Rate (FNR): It is defined as
the fraction of positive examples classified as
the fraction of the positive examples a negative class by the classifier.
predicted correctly by the classifier. This
FN
metrics is also known as Recall, Sensitivity FNR =
TP + FN
or Hit rate.
TP
TPR =
TP+FN

False Positive Rate (FPR): It is defined as the fraction of True Negative Rate (TNR): It is defined as the
negative examples classified as positive class by the fraction of negative examples classified
classifier. This metric is also known as False Alarm correctly by the classifier. This metric is also
Rate. known as Specificity.

FP TN
FPR = TNR =
FP + TN TN + FP

I N T R OD U CT ION TO D AT A S C I E N C E
Confusion Matrix
Positive Predictive Value (PPV): It is defined Accuracy: How often is the classifier
as the fraction of the positive examples correct.
classified as positive that are really positive.
It is also known as Precision.
TP + TN
TP Accuracy =
PPV = Total
TP + FP

F1 Score (F1): Recall (r) and Precision (p) are two widely used True Miscalculation Rate or Error Rate:
metrics employed in analysis, where detection of one of the
classes is considered more significant than the others.
How often is the classifier wrong.

It is defined in terms of (TPR) and (PPV) as follows. FP + FN


Error Rate =
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
Total
𝐹1 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

I N T R OD U CT ION TO D AT A S C I E N C E
All Formulae

TP FP TN
TPR = FPR =
FP + TN
TNR =
TP+FN TN + FP

TP 2𝑇𝑃
Precision = 𝐹1 =
TP + FP 2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁

TP + TN FN FP + FN
Accuracy = FNR = Error Rate =
Total TP + FN Total

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – CHF Prediction
Calculate the following metrics for the given
confusion matrix:
Actual Values
1. True Positive Rate (TPR) [Recall / Sensitivity]
Y N 2. False Positive Rate (FPR)
3. False Negative Rate (FNR)
Predicted Y 100 (TP) 10 (FP)
Values 4. True Negative Rate (TNR) [Specificity]
N 5 (FN) 50 (TN) 5. Precision
6. F1 Score
7. Accuracy
8. Error Rate or Miscalculation Rate

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – CHF Prediction
Formulae
Actual Values TP FP
TPR = FPR =
TP+FN FP + TN
Y N
Predicted Y 100 (TP) 10 (FP) FN TN
Values FNR = TNR =
N 5 (FN) 50 (TN) TP + FN TN + FP

Calculate the following metrics for the given confusion matrix TP 2𝑇𝑃
Precision = 𝐹1 =
1. True Positive Rate (TPR) [Recall / Sensitivity] TP + FP 2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
2. False Positive Rate (FPR)
3. False Negative Rate (FNR)
TP + TN FP + FN
4. True Negative Rate (TNR) [Specificity] Accuracy = Error Rate =
Total Total
5. Precision
6. F1 Score
Alternative formula for F1 calculation
7. Accuracy
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
8. Error Rate or Miscalculation Rate 𝐹1 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Case Study – CHF Prediction

1. True Positive Rate (TPR) [Recall / Sensitivity] = 0.95


Actual Values 2. False Positive Rate (FPR) = 0.17

Y N 3. False Negative Rate (FNR) = 0.047

Predicte 4. True Negative Rate (TNR) [Specificity] = 0.83


d
Y 100 (TP) 10 (FP)
5. Precision = 0.91
Values
N 5 (FN) 50 (TN) 6. F1 Score = 0.93
7. Accuracy = 150/165 = 0.91
8. Error Rate or Miscalculation Rate = 15/165 = 0.09

I N T R OD U CT ION TO D AT A S C I E N C E
ROC Curve
An ROC curve (receiver operating characteristic curve) is a
graph showing the performance of a classification model
at all classification thresholds.
It shows the trade-off between Sensitivity and Specificity
ROC curve plots two parameters:

• True Positive Rate

• False Positive Rate

Scikit-learn’s confusion matrix uses 0.5 as threshold.

https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-steps-795b1399481c

I NTRODUCTION TO D ATA S CIENCE


Data points for Likelihood of Repaying a Loan
• The probabilities usually range between 0 and 1.
• The higher the value, the more likely the person is to repay a
loan.
• In the example on Fig, we’ve selected a threshold at 0.35:
• All predictions at or above this threshold, are
classified as “will repay”
• All predictions below this threshold, are classified
as “won’t repay”

I NTRODUCTION TO D ATA S CIENCE


Altering the Threshold values

• Altering the threshold to 0, 0.35, 0.5, 0.65 and 1 levels. Notice how the FPR and TPR changes accordingly
• Overall, we can see this is a trade-off. As we increase our threshold, we’ll be better at classifying negatives,
but this is at the expense of misclassifying more positives

I NTRODUCTION TO D ATA S CIENCE


ROC Curves: Plot TPR and FPR for every Cutoff

Area under
ROC Curve
(AUC)

I NTRODUCTION TO D ATA S CIENCE


Threshold settings in NLP Application

• For NLP applications (like Chatbots), which use natural language, thresholds are generally set
lower (around 0.4) for healthcare, retail, educational bots.

• Example: Demonstrate ‘Dhriti’ mental health chatbot

URL - https://app.engati.com/static/standalone/bot.html?bot_key=889d005935e7437b

I NTRODUCTION TO D ATA S CIENCE


https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-
learning/

https://towardsdatascience.com/understanding-the-roc-curve-in-three-visual-
steps-795b1399481c

T HANK YOU

I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 3 : DATA S CIENCE Proposal
IDS Course Team
BITS Pilani
TABLE OF C ONTENTS

1 D ATA S C I E N C E P R O P O S A L

I N T R OD U CT ION TO D AT A S C I E N C E
WHAT IS DATA SCIENCE PROPOSAL

As a Data Scientist, there are occasions when proposals need to be written for data science projects.
At Microsoft:
A. Business-led Proposal Origin of Proposal
• Business teams come with requirements
• Ex: Product Engineering Team on how to prioritize
customer feedback screening
B. Data science-led Innovation
• From Data Science team
• Ex: How to maximize customer satisfaction for Azure
C. Data science-led Systemic Solutions
• What is the impact of ‘x’ on business
• Ex: ‘X’ can be marketing campaign, new service launch

https://medium.com/data-science-at-microsoft/managing-a-data-science-project-87945ff79483

I N T R OD U CT ION TO D AT A S C I E N C E
QUESTIONNAIRE TO P R E PA R E P R O P O S A L
1. What is the business problem we are trying to solve?
2. Write an exact definition. Identify the type of the problem.
3. Are we addressing a specific problem or a problem specific to a team? Is it a
generic problem across all business? (help to create certain frameworks or
accelerators)
4. Who are the targeted audience?
5. How do you evaluate your solution outcome? Are there any evaluation
metrics available?
6. What is the acceptance criteria for the solution? (for e.g. for a classification
task accuracy should be above 65%)

I N T R OD U CT ION TO D AT A S C I E N C E
QUESTIONNAIRE TO P R E PA R E P R O P O S A L
Business Understanding
• What is the business problem we are trying to solve?
• Write an exact definition.
• Is it a prediction problem?
→ e.g. predicting company’s profit in next quarter.
• Are we doing a segmentation?
→ e.g. a customer segmentation for targeted
marketing.
• Are we going to recommend something say a product to
the user?
• Is it anomaly detection or a fraud detection problem?
• Is it an optimization problem?
→ e.g. optimizing revenue of a company.

I N T R OD U CT ION TO D AT A S C I E N C E
1. P R E D I C T I O N
• Classification
• Given a new individual observation, predicts which class it belongs to.
• e.g. whether a credit card customer will default or not given his data like credit card
balance, income etc.
• Covid Discharge Status, viz., (Recovered, Expired)
• Social media sentiment analysis to determine the emotion behind user-generated content
• Regression
• Given a new individual observation, estimates the value of a particular variable specific to
that individual.
• e.g. predicting the revenue for the next quarter
• Predicting the price of a house, given locality details

I N T R OD U CT ION TO D AT A S C I E N C E
1. P R E D I C T I O N ... C O N T D ...

• Scoring or Class Probability Estimation


• Related to the classification problem
• Instead of class prediction, predict a score representing the probability or likelihood that
the individual belongs to the class.
• e.g. *In NLP, a score for a matching statement against the threshold value.

*Engati Chatbot link - https://app.engati.com/static/standalone/bot.html?bot_key=889d005935e7437b

I N T R OD U CT ION TO D AT A S C I E N C E
1. P R E D I C T I O N ... C O N T D ...

• Survival Analysis/Churn Analysis


• Churn analytics is the process of measuring the rate at which customers quit the product,
site, or service.
• Analysis of data where outcome is the time duration until the occurrence of an event of
interest
• e.g. Customer life time with a provider

I N T R OD U CT ION TO D AT A S C I E N C E
1. P R E D I C T I O N ... C O N T D ... Churn Analysis

Based on Historical data, predicting the Churn Analysis for next quarter
I N T R OD U CT ION TO D AT A S C I E N C E
2. S E G M E N T AT I O N / C L U S T E R I N G
• Customer Profiling is an important aspect of Segmentation that attempts to characterize the
typical behavior of an individual or a group.

Consumer • Convenience driven How do I collect this


• Connectivity driven
Characteristics • Personalization driven
data?
1. Feedback
• Loyalty 2. Survey customer
• Discount
Consumer Typology • Impulsive interests and
• Need-based preferences
• Lifestyle
3. Keep profiles
Psychographic • Demographics consistent and up-
• Social Class
to-date – integrate
3rd party sources

https://commence.com/blog/2020/06/16/customer-profiling-methods/

I N T R OD U CT ION TO D AT A S C I E N C E
2. S E G M E N TAT I O N / C L U S T E R I N G
Clustering attempts to group individuals based on similarity.
• e.g. Segment the customers to High spenders and Low spenders based on their buying pattern and
other data.

I N T R OD U CT ION TO D AT A S C I E N C E
3. R E C O M M E N DAT I O N / S I M I L A R I T Y M AT C H I N G
• Similarity matching attempts to find similar individuals based on the data known
about them. This is useful in recommendation problem setting.
• e.g. Finding people similar to you who have purchased or liked similar products,
recommending a movie to a user based on his preferences and similar users’ interests.
• OTT platforms, E-Commerce platforms

I N T R OD U CT ION TO D AT A S C I E N C E
4. Anomaly / F r a u d A n a l y t i c s

https://www.crisil.com/en/home/our-businesses/global-research-and-risk-solutions/our-offerings/non-financial-risk/financial-crime-
management/fraud-management/fraud-detection-and-analytics.html#

I N T R OD U CT ION TO D AT A S C I E N C E
5. C A U S A L M O D E L L I N G / R O O T C A U S E A N A LY S I S
Casual modeling helps to understand the casual relationship between events or what
events/actions influence other. [‘Why’ part of Diagnostic Analytics]
• What are the possible root causes for an anomaly detected?
• Whether the advertisements influenced consumer’s decision to purchase or not?
• What are the reasons for fraud in bank
• Lack of Training
• Competition to achieve incentives
• Overburdened Staff
• Low Compliance Level (not following RBI Guidelines)

I N T R OD U CT ION TO D AT A S C I E N C E
6. M A R K E T B A S K E T A N A LY S I S
Co-occurrence Grouping / Association Rule Discovery / Frequent Item set Mining
• Find the association between the entities based on the purchase transactions involving them.
• e.g. What items are purchased together by consumers at a supermarket.
• May lead to Upsell / Cross-sell items to customers

I N T R OD U CT ION TO D AT A S C I E N C E
7. D AT A R E D U C T I O N
• Replace a large data with a smaller set of data that contain most of the important information in the
large dataset.
• Involves loss of information.
• Which data reduction strategy to follow?
• Aggregation / Sampling / Dimensionality reduction
• Examples
• Aggregation - Massive data sets of insurance / patient data is aggregated into one row per
patient record in hospital readmission case study.
• Sampling - A large time series sensor data at a second interval may be reduced to hourly data or to a
smaller data set with only changed values [Ex: Air Pollutants data – calculate Pollution Index based on
concentration of chemicals]
• Dimensionality Reduction - ISRO weather data augmentation with semantic data project: Followed
dimensionality reduction. [Only retained rainfall rate, humidity, latitude, longitude, wind direction,
atmospheric pressure & precipitation rate variables for analysis out of 17 variables]

I N T R OD U CT ION TO D AT A S C I E N C E
QU ESTIONS T O B E A S K E D BASED ON T ASK
• Prediction
• Do we know what variable (target) to be predicted?
• Is that target variable defined precisely?
• What values or ranges of values that this variable can take? [Ordinal / Categorical]
• Will modelling this target variable address all the problems defined in the scope or only a
sub problem?
• Clustering
• Do we know the end objective? i.e. Is an EDA (Exploratory Data Analysis) path clearly
defined to see where our analysis is going?

I N T R OD U CT ION TO D AT A S C I E N C E
S O L U T I O N A P P R OA C H
• Is the proposed analytical solution formulated appropriately to solve the business problem OR is it
an approximation?
• Will the proposed solution address all the problems defined in the scope or only a sub problem?
Ex: Study to understand employee satisfaction; does it address attrition?
• What will be the benefits of the proposed solution? Benefit vs. Cost tradeoff.
Ex: Heart Disease prediction model; deployed to all hospital centers?
• What will be the specific end objectives to be met by the proposed solution?
• What should be the anticipated outcomes by the proposed solution?

I N T R OD U CT ION TO D AT A S C I E N C E
S O L U T I O N A P P R OA C H
What are the deliverables? Data Science deliverables fall under 3 categories:
1. Analysis – A study using data to describe how a
product or program is working. Ex: Exploratory Data
Analysis, Diagnosis to highlight change in trend
2. Experiment – A scientific study to test a hypothesis.
Ex: Spending more money on digital advertising leads
to increased sales.
Alternate Hypothesis – “Mean sales increased after
spending more time on advertising”
3. Model – Machine learning model trained on data to
predict an outcome. Ex: Churn prediction to alert the
company about at-risk customers.

https://medium.com/data-science-at-microsoft/managing-a-data-science-project-87945ff79483

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A P R E PA R AT I O N
• What are the important variables that you think we should collect?
• Are these variables readily available? Or is there an additional effort needed to
collect these variables?
• What are the types of data?
• e.g. Sensor data, ERP, e-commerce and SAP CRM data are structured (OLTP), Social
networking data is unstructured.
• Where are the locations of data in the system?
• e.g. Product master and sales transaction data in ERP SQL RDBMS database, OLAP
data in SQL server for BI reporting, Text data for customer review and sentiment from
Tweets and FB posts etc. [Internal to Organization, Acquire from 3 rd party sources?]
• Where are the data coming from?
• e.g. data from sensor, sales data from ERP, online store
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A P R E PA R AT I O N ... CONTD ...
• Who are the current consumers of the data?
• e.g. Visualization tools, BI application etc.
• What are the methods to acquire data?
• e.g. Sensor data are ingested to data lake. ERP, e-commerce, and SAP CRM are inside
organization’s data center and proper access control needs to be granted to access the
data. Social networking data are retrieved from streaming API as a nightly job and are
stored in a NoSQL database etc.
• What are the integration points?
• e.g. IT team needs to provide database access and needs to build API services to
access certain data.
• Will it be practical to get all the relevant variables and load it to our workspace?

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A P R E PA R AT I O N ... CONTD ...
• What are the problems in acquiring the data?
• e.g. Sensor data are archived and deleted after 'x' days. Request needs to be raised to
store the data and to archive the data to make enough sample data for analyses and
modelling.
• Social networking data may not be available for a longer term. All relevant data are
captured by existing systems, and request needs to be raised and approved for
accessing data from servers.
• For the prediction problems, is sufficient amount of labelled examples available? Or is there a
cost involved in getting these values?
• e.g. a field survey may be needed to collect the response from a customer to see the
likelihood of joining a new plan.
• Are the training data drawn from a similar population on which the model to be applied? If not,
are the selection biases noted? What are the plans to compensate?
https://machinelearningmastery.com/much-training-data-required-machine-learning/

I N T R OD U CT ION TO D AT A S C I E N C E
MODELLING
• Is the choice of model appropriate for the business problem? Is it in line with our prior
knowledge of the problem?
• Classification, scoring, clustering, etc.
• Does the modelling technique meet all the other requirements (functional and non-
functional) of the problem?
• Should various modelling techniques be tried and compared using appropriate
evaluation metrics?
• Check the amount of data required, generalization performance (i.e. how our model would be
using another sample), learning time

https://machinelearningmastery.com/much-training-data-required-machine-learning/

I N T R OD U CT ION TO D AT A S C I E N C E
E VA L U AT I O N

• Is there a plan for domain expert validation?


• If so, will the model be in a form that they can understand?
• Is there an evaluation metric set up by the business? (e.g. For a classification problem, there
should be less than x% of False Positives). Is that appropriate for the business problem?
Ex: Netflix fixed False Positive rate at max 5% for its prediction algorithms.
• Is there a hold-out data (i.e. data used for training / test) available? [70%-30% generally]

I N T R OD U CT ION TO D AT A S C I E N C E
E VA L U AT I O N
• For a classification problem, is there a threshold defined (for e.g. different thresholds can give
different implications in terms of benefits like reducing the threshold to a 0.70 can reduce the
False Positives)
• For a regression problem, how will we evaluate the quality of prediction in the business
context?
• For a clustering problem, how the clustering is interpreted in the context of the business
problem?
• How will we measure the business impact of the final model? How will we justify the project
expense against the benefits? [ROI]

I N T R OD U CT ION TO D AT A S C I E N C E
E XISTING S YSTEMS / R E qUIREMENTS

• What are the existing/related systems within the capability that capture/use related
information? For e.g. A prediction model is already being used for fraud analysis.
Can we reuse the same transaction dataset for providing recommendations?
• What are the gaps?
• Who are the stakeholders?
• Who will be affected by this implementation?

I N T R OD U CT ION TO D AT A S C I E N C E
A SSUMPTIONS / D EPEND ENCIES / C H A L L E N G E S

• Note down the assumptions; things like availability of necessary data, access to the
infrastructure, licenses etc.
• Any Licenses/Commercials needed in case of proprietary solutions?
• Note down the dependencies: things like dependency on setting up and access to the
infrastructure/tools, on access rights etc.
• Are there any other dependencies?
• Do you see any other problems/challenges?

I N T R OD U CT ION TO D AT A S C I E N C E
I M P L E M E N TAT I O N

• Does the client have a technology preference? [Open Source vs Commercial]


• Does the client have limited / unlimited infrastructure?
[Deployment – Cloud / On-premise?]

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
• Bipolar Disorder (BD) is a recurrent chronic disorder characterized by fluctuations in mood state and
energy, which affects over 1% of the World population.
• BD is a primary cause of disability among people, leading to functional and cognitive impairment, with
increased morbidity, especially death by suicide.
• Compared to a normal, mentally-stable individual, an individual suffering from Bipolar Disorder
experiences extreme mood fluctuations, classified into “manic episodes” and “depressive episodes”,
which typically last between days to months.
• While the manic episodes are characterized by racing thoughts, feeling of elation, extreme irritability
etc., the depressive episodes are characterized by feelings of extreme sadness, restlessness, trouble
in concentration, insomnia etc.

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
The standard states of bipolar disorder are as follows:
• i) Bipolar I Disorder, ii) Bipolar II Disorder, iii) Cyclothymia, iv) Unspecified Bipolar.
• From a clinical viewpoint, Bipolar I is defined by the appearance of atleast one manic episode. Patients may
experience hypomanic or major depressive episodes prior to or after the manic episodes.
• Bipolar II, Cyclothymia and Unspecified vary in episodes between hypomania and depression, with each cycle
lasting between weeks to months.
• Hypomania experiences: Reduced need for sleep. Spending recklessly, like buying a car you cannot afford. Taking
chances you normally wouldn't take because you "feel lucky" Talking so fast that it's difficult for others to follow
what's being said.

Objective: Derive the best model for Bipolar State Prediction!

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
Company X intends to develop a Smart
Healthcare System for monitoring Bipolar
Disorder.
As a Data Scientist working for the company, what
kind of questions would you ask in the Data
Collection / Preparation phases?
• What are the important variables that we need
to collect?
• What are the types of data?
• Locations of data in the system?
• Integration points?
• Problems in acquiring the data?
• Do we have sufficient labelled samples for
prediction?
• Are the training data drawn from a similar
population on which the model to be applied?
If not, are the selection biases noted?

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
Data Scientist will consult the Domain expert (Psychiatrist or Psychologist) to find what variables are
important to collect during the Manic phase / Hypomanic phase / Depressive phase of the patient

• What are the important variables that we need to collect and location of data?
• Physiological Data [From Sensor]: Heart Rate, Electrodermal Activity (EDA), Oxygen
Saturation (SPO2), Blood Pressure etc.
• Behavioral Data [From Mobile App]: Self-assessment questionnaire to capture daily
information regarding sleep quality [hourly scale], physical activity [-3 for inactive to +3 for
active], mood states using GAD [7 point likert], HDRS [7 point likert], YMRS [5 point likert].
In addition, data on alcohol intake, stress levels, motivation levels, concentration levels,
menstrual cycle pattern, irritability levels, insomnia levels. The treating doctors will be
asked to rate the patient progress using scales from much worse (-3) to much better (+3).
The behavioral data will be collected from a Mobile App.

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
• Integration points?
• The sensor data and behavioral data are integrated on a daily basis and presented to the ‘Health
Analytics Engine’ on the Cloud to perform the analytics.
• Problems in acquiring the data?
• Bipolar patients must cooperate and provide the behavioral data truthfully on a periodic basis.
• Do we have sufficient labelled samples for prediction?
• Need to devise a strategy to collect the data samples for period of 6 months, on a daily basis, to build
the labelled data.
• Are the training data drawn from a similar population on which the model to be applied? If not, are the selection
biases noted?
• All the data is captured from patients suffering from Bipolar Disorder, albeit in different cycles [Type I,
Type II, Cyclothymia etc].

I N T R OD U CT ION TO D AT A S C I E N C E
Case Study – Bipolar Disorder
• What modelling techniques can be applied to predict the patient states?
• For a similar case in the US - Decision Tree, Random Forest, Support Vector Machine and Logistic
Regression models were applied, and the accuracy of Random Forest was the best.
• Outcome - Predict the patient states. [Multiclass Classification – Bipolar Type I, Bipolar Type II,
Cyclothemia and Unspecified are the states]

I N T R OD U CT ION TO D AT A S C I E N C E
A G U I D E T O D E S I G N I N G A D AT A S C I E N C E P R O J E C T

• To get started, brainstorm possible ideas that might interest you.


• Write a proposal along the CRISP-DM Standards.
• Planning
• Keep a timeline with a To Do, In Progress, Completed and Parking section.
• Track the progress
• Keep track of how much progress you are making on your metrics.
• Maintain a code repo for a code review.
• Know when to stop
• Identify an minimum viable product (MVP) to help you know when to stop.

I N T R OD U CT ION TO D AT A S C I E N C E
Data Science for Business by Tom Fawcett and Foster Provost, O’Reilly
https://www.linkedin.com/pulse/
ask-questions-while-preparing-proposal-data-science-project-menon
http://www.acheronanalytics.com/acheron-blog/
a-guide-to-designing-a-data-science-project

T HANK YOU

I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 5 : DATA AND DATA Q UALITY
IDS Course Team
BITS Pilani
T ABLE OF C ONTENTS

1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY

7 O UTLIERS

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A
• Data is a collection of data objects and their
attributes.
• A collection of attributes describe an object.
• Object is also known as record, observation,
case, sample or instance.
• An attribute is a property or characteristic of an
object.
• Examples: eye color of a person,
temperature
• Attribute is also known as variable, field,
characteristic, or feature.

I N T R OD U CT ION TO D AT A S C I E N C E
QUALITY OF D AT A

Data quality issues


• Noise and outliers;
• Missing data
• Inconsistent data
• Duplicate data
• Data that is biased or unrepresentative of the phenomenon or population that the data is
supposed to describe [80% Male observations and 20% Female]

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA QUALITY I SSUES

Find the issues in the given data.

Name Age Date of Birth Course ID CGPA


Amy 24 01-Jan-1995 CS 104 7.4
Ben 23 Dec-01-1996 CS 102 7.5
Cathy 25 01-Nov-1994 6.7
Diana 24 Oct-01-1995 CS 104 7.9
Ben 23 Dec-01-1996 CS 102 7.5
Eden 24 CS 103 87.5
Fischer 01-01-1959 CS 105 7.0

Amy 24 01-Jan-1995 CS 104 7.2

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA QUALITY I SSUES

Find the issues in the given data. 1. Missing data


Name Age Date of Birth Course ID CGPA 2. Inconsistent data format
(DOB)
Amy 24 01-Jan-1995 CS 104 7.4 3. Duplicate data
Ben 23 Dec-01-1996 CS 102 7.5 4. Data Inconsistency
Cathy 25 01-Nov-1994 <missing> 6.7 5. Incorrect data [Outlier]
Diana 24 Oct-01-1995 CS 104 7.9
Ben 23 Dec-01-1996 CS 102 7.5
Eden 24 <missing> CS 103 87.5
Fischer <missin 01-01-1959 CS 105 7.0
g>
Amy 24 01-Jan-1995 CS 104 7.2

I N T R OD U CT ION TO D AT A S C I E N C E
P R E P ROCESSING ON D AT A

• Improve Data Quality


• To better fit a specified data mining or machine learning technique or tool.
• Number of attributes in a data set is often reduced because many techniques
are more effective when the data has a relatively small number of attributes.
• Data correction corrects the errors in the data.
• Data cleansing removes irrelevant data.
• Data transformation changes data from one format to another.
• Correction improves the data quality.

I N T R OD U CT ION TO D AT A S C I E N C E
A T T R I B U T E / F E AT U R E

An attribute is a property or characteristic of


an object.
• eye color of a person, temperature
Attribute is also known as variable, field,
characteristic, or feature.
The values used to represent an attribute
may have properties that are not properties
of the attribute itself.
• Average age of an employee may have a
meaning , whereas it makes no sense to
talk about the average employee ID.

I N T R OD U CT ION TO D AT A S C I E N C E
A T T R I B U T E / F E AT U R E

The type of an attribute should tell us what


properties of the attribute are reflected in
the values used to measure it.
• For the age attribute, the properties of the
integers used to represent age are very
much the properties of the attribute. Even
so, ages have a maximum while integers
do not.
• The ID attribute is distinct. The only valid
operation for employee IDs is to test
whether they are equal.

I N T R OD U CT ION TO D AT A S C I E N C E
P R O P E RT I E S OF ATTRIBUTES

Specify the type of an attribute by identifying the properties of numbers that


correspond to underlying properties of the attribute.
Properties include
• Distinctiveness =, !=
• Order <, >, ≥, ≤
• Addition +, −
• Multiplication */
Based on these properties, we define four types of attributes: nominal, ordinal,
interval, and ratio.
Each attribute type possesses all of the properties and operations of the attribute
types above it.

I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF ATTRIBUTES
Ratio

Numerical

Interval

Data

Ordinal

Categorical

Nominal
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF AT T R I B U T E S
Nominal: Distinctiveness

Ordinal: Order, the data can be


categorized and ranked.

Interval: the data can be


categorized and ranked, and
evenly spaced.

Ratio: the data can be


categorized, ranked, evenly
spaced and has a natural zero.

Each attribute type possesses all


of the properties and operations of
the attribute types above it.

I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF ATTRIBUTES
Ratio Income, Height, Weight,
Annual Sales, Age
Numerical

Interval Calendar dates, temperature

Data

Ordinal Grades, Shirt Size (s, m,


l, xl, xxl)
Categorical
Eye Color, Gender,
Nominal Nationality
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF ATTRIBUTES

I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF AT T R I B U T E S E X A M P L E

Identify the types of attributes in the given data.

ID Age Gender Course ID CGPA Grade


19001 24 Female CS 104 7.4 Good
19002 23 Male CS 102 7.5 Good
19003 25 Female CS 103 6.7 Fair
19004 24 Female CS 104 7.9 Good
19005 23 Male CS 102 7.5 Good
19006 24 Female CS 103 8.5 Excellent
19007 26 Male CS 105 7.0 Good

I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF AT T R I B U T E S E X A M P L E

Identify the types of attributes in the given data.

ID Age Gender Course ID CGPA Grade


19001 24 Female CS 104 7.4 Good
19002 23 Male CS 102 7.5 Good
19003 25 Female CS 103 6.7 Fair
19004 24 Female CS 104 7.9 Good
19005 23 Male CS 102 7.5 Good
19006 24 Female CS 103 8.5 Excellent
19007 26 Male CS 105 7.0 Good
Nominal Ratio Nominal Nominal Ratio Ordinal

I N T R OD U CT ION TO D AT A S C I E N C E
AT T R I B U T E S AND T R A N S FO R M AT I O N S

Introduction to Data Mining by Tan


I N T R OD U CT ION TO D AT A S C I E N C E
A T T R IBUTES BY THE N UMBER OF V ALUES
1. Discrete Attribute
• only a finite or Countable set of values.
• zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variables.
• Note: binary attributes are a special case of discrete attributes

2. Continuous Attribute
• Measureable data.
• Temperature, height, age, weight
• Continuous attributes are typically represented as floating-point variables.

I N T R OD U CT ION TO D AT A S C I E N C E
A T T R IBUTES BY THE N UMBER OF V ALUES
• Discrete data is countable while
continuous data is measurable.
• Discrete data contains distinct or
separate values.
• On the other hand, continuous data
includes any value within range.
• Discrete data is graphically
represented by bar graph whereas a
histogram is used to represent
continuous data graphically.

I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF AT T R I B U T E S E X A M P L E

Identify the types of attributes in the given data – Discrete vs Continuous?

ID Age Gender Course ID CGPA Grade


19001 24 Female CS 104 7.4 Good
19002 23 Male CS 102 7.5 Good
19003 25 Female CS 103 6.7 Fair
19004 24 Female CS 104 7.9 Good
19005 23 Male CS 102 7.5 Good
19006 24 Female CS 103 8.5 Excellent
19007 26 Male CS 105 7.0 Good

I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF AT T R I B U T E S E X A M P L E

Identify the types of attributes in the given data.

ID Age Gender Course ID CGPA Grade


19001 24 Female CS 104 7.4 Good
19002 23 Male CS 102 7.5 Good
19003 25 Female CS 103 6.7 Fair
19004 24 Female CS 104 7.9 Good
19005 23 Male CS 102 7.5 Good
19006 24 Female CS 103 8.5 Excellent
19007 26 Male CS 105 7.0 Good
Discrete Continuous Discrete Discrete Continuous Discrete

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A F O R M AT S
Record data
• Transaction or Market Basket data – set of items
• Data Matrix – record data with only numeric attributes.
• Sparse Data Matrix – binary asymmetric data. 0/1 entries.
• Document term matrix
Graph data
• Data with relationships among objects – Web pages
• Data with objects as graphs – LOD cloud
Ordered data
• Sequential data or temporal data – Record data + time.
• Sequence data – genome representation
• Time series data – temporal autocorrelation
• Spatial data – spatial autocorrelation
I N T R OD U CT ION TO D AT A S C I E N C E
R E C O R D D ATA E X A M P L E

Flat file (CSV), Banking, Retail, E- SPSS data matrix Frequency of terms
RDBMS Commerce etc. that appears in
documents, used
in Information
Retrieval
https://towardsdatascience.com/types-of-data-sets-in-data-science-data-mining-machine-learning-eb47c80af7a

I N T R OD U CT ION TO D AT A S C I E N C E
G R A P H D ATA E X A M P L E

Linked Open Data Cloud


https://lod-cloud.net/

I N T R OD U CT ION TO D AT A S C I E N C E
O R D E R E D D ATA E X A M P L E

Also called ‘Temporal Data’, Positions instead of time stamp


each record has time Ex: DNA sequence bases (G,
associated. T, A, C)
Ex: Money transfer transaction
in Banking

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY

7 O UTLIERS

I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF D AT A - S E T S
1 Structured data
• Data containing a defined data type, format and structure.
• Example: transaction data, online analytical processing , OLAP data cubes, traditional
RDBMS, CSV file and spreadsheets.
2 Semi structured data
• Textual data file with discernible pattern that enables parsing
• Example: XML data file, JSON data file
3 Quasi structured data
• Textual data with erratic data format that can be formatted with effort, tools and time
• Example: Web click-stream data [IP address, Timestamp, GeoCodes etc]
4 Unstructured data
• Data that has no inherent structure.
• Example: PDF, Images, Video, Email
I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF D AT A - S E T S

Quasi Structured Data - Web click-stream data

I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF D AT A - S E T S

5 Natural Language data


• Entity recognition, topic recognition, summarization, text completion, and sentiment
analysis
• Models trained in one domain don’t generalize well to other domains. [Vocabulary]
6 Machine generated data
• Machine-generated data is automatically created by a computer, process, application, or
other machine without human intervention.
• High volume and speed.
• Examples web server logs, call detail records, network event logs

I N T R OD U CT ION TO D AT A S C I E N C E
T YPES OF D AT A - S E T S
6 Graph-based or network data
• Data can be shown in a graph. [Ex: Linked Open Data Cloud]
• A graph is a mathematical structure to model pair-wise relationships between objects.
• Graph or network data focuses on the relationship or adjacency of objects.
• Graph databases with specialized query languages such as SPARQL.
• Example: DBPedia data in RDF format [RDF Dump or through end point]
[https://dbpedia.org/sparql]
7
Streaming data
• The data flows into the system when an event happens instead of being loaded into a
data store in a batch.
• Example: live sports or music events, stock market.

I N T R OD U CT ION TO D AT A S C I E N C E
C HARACTERISTICS OF D AT A - S E T S
1 Dimensionality
• Number of attributes
• Curse of Dimensionality – the difficulties associated with analyzing high-dimensional data
• Dimensionality reduction techniques [PCA, NMF, LDA etc.]
Sparsity
• For some data sets, such as those with asymmetric features, most attributes of an object have values of
2
0; in many cases, fewer than 1% of the entries are non-zero.
• Advantage because usually only the non-zero values need to be stored and
manipulated.
Resolution
• The patterns in the data depend on the level of resolution.
3
• If the resolution is too fine, a pattern may not be visible or may be buried in noise; if the resolution is
too coarse, the pattern may disappear. [Ex: Air Pollution: Index (Chemical pollutants) measured per
Second / Hour; per second – fine resolution; per hour – coarse resolution]

I N T R OD U CT ION TO D AT A S C I E N C E
Curse of Dimensionality

I N T R OD U CT ION TO D AT A S C I E N C E
C HARACTERISTICS OF D AT A - S E T S

• All sensors are working • Sensor 1 and Sensor 6 are malfunctioning


• Most of sensor outputs are zero • Either fill the data manually, or discard the
• High dimensionality, but less information Sensor 1 and 6 data in calculations
• Sparse data
• The values of Sensors 1, 2 and 12 need to
be stored and manipulated

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY

7 O UTLIERS

I N T R OD U CT ION TO D AT A S C I E N C E
R E T R I E V I N G D AT A

Already collected and stored the data in the organization


Look outside the organization for high-quality data available for public and commercial
use. (open-data providers)
Quality Check while Retrieving Data [+Provenance]
• Check to see if data is equal to the data in the source document.
• Check for the right data types.

I N T R OD U CT ION TO D AT A S C I E N C E
R E T R I E V I N G D AT A

Data Storage
• Database tables
• Text files
• Data marts
• Data warehouses
• Data lakes (raw data)

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY

7 O UTLIERS

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA P R E PA R AT I O N

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A C L E A N S I N G

Focuses on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
Two types of errors
• Interpretation / Representation error
• Age > 130
• Height of a person is greater than 8 feet.
• Price is negative.
• Inconsistencies between data sources or against your company’s standardized values.
• Female and F
• Feet and meter
• Dollars and Pounds

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA C L E A N S I N G

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A C L E A N S I N G
Errors from data entry
• Cause
• Typos
• Errors due to lack of concentration
• Machine or hardware failure
• Detection
• Frequency table [Frequency is the number of times a specific data value occurs in your
dataset.]
• Correction
• Simple assignment statements
• If-then-else rules
White-spaces and typos
• Remove leading and trailing white-spaces.
• Change case of the alphabets from upper to lower. [Ex: SILK Framework – semantic
matching]
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A C L E A N S I N G

Physically impossible values


• Examples
• Age > 130
• Height of a person is greater than 8 feet.
• Price is negative.
Outliers
• Use visualization techniques like box plots.
• Use statistical summary with minimum and maximum values.

I N T R OD U CT ION TO D AT A S C I E N C E
D AT A C L E A N S I N G

Deviations from code-book


• A code book is a description of your data. It contains things such as the number of
variables per observation, the number of observations, and what each encoding within a
variable means. [Ex: One-hot encoding for categorical values such as Gender]
• Discrepancies between the code-book and the data should be corrected.
Different units of measurement
• Pay attention to the respective units of measurement.
• Simple conversion can rectify.
Different levels of aggregation
• Data set containing data per week versus one containing data per work week.
• Data summarization will fix it.

I N T R OD U CT ION TO D AT A S C I E N C E
C O M B I N I N G D AT A

Two operations to combine information from different data sets.


• Joining
• Enriching an observation from one table with information from another table.
• Requires primary keys or candidate keys.
• Use views to virtually combine data.
• Appending or stacking
• Adding the observations of one table to those of another table.

I N T R OD U CT ION TO D AT A S C I E N C E
T R A N S FO R M I N G D AT A
Applying mathematical transformation to the input variable.
• For a relationship of the form, y = aebx transforming x to log x makes the relationship
between x and y linear.

Reducing number of variables.


Combining two variables into a new variable.

Introducing Data Science by Cielen, Meysman and Ali


I N T R OD U CT ION TO D AT A S C I E N C E
T R A N S FO R M I N G D AT A

Turning variables into dummies.


• Dummy variables can only take two
values: true(1) or false(0).
• Create separate columns for the
classes stored in one variable and
indicate it with 1 if the class is
present and 0 otherwise.

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY

7 O UTLIERS

I N T R OD U CT ION TO D AT A S C I E N C E 47 / 69
E X P L O R AT O R Y D AT A A N A L Y S I S ( E D A )

Use graphical techniques to gain an understanding of the data and the interactions
between variables.
Look at what can be learned from the data.
Statistical properties like distribution of data, correlation.
Discover outliers.

I N T R OD U CT ION TO D AT A S C I E N C E
E X P L O R AT O R Y D AT A A N A L Y S I S ( E D A )

• Boxplot – can show the maximum, minimum, median, and other characterizing
measures at the same time.
• Histogram – In a histogram a variable is cut into discrete categories and the number of
occurrences in each category are summed up and shown in the graph.
• Clustering and other modeling techniques can also be a part of exploratory analysis.

Refer - https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Exploratory_data_Analysis.ipynb

I N T R OD U CT ION TO D AT A S C I E N C E
B OX P L O T [ W H I S K E R P L O T ]
• A boxplot incorporates
the five-number summary.
• The ends of the box are at the
quartiles.
• The box length is the interquartile
range.
• The median is marked by a line within
the box.
• The whiskers outside the box extend to
the Minimum and Maximum
observations.

I N T R OD U CT ION TO D AT A S C I E N C E
B OX P L O T

Consider the ordered list of observations for ‘Age’ feature.


25,25,30,33,33,35,35,35,35,36,40,41,42,42,51
Draw a box-plot to represent the above data.
Median, Minimum, Maximum, First quartile, Third quartile, Interquartile Range

BoxPlot Calculator - http://www.alcula.com/calculators/statistics/box-plot/

I N T R OD U CT ION TO D AT A S C I E N C E
S C AT T E R P L O T
• Determine if there appears to be a relationship, pattern, or trend between two numeric
attributes.
• Provide a visualization of bi-variate data to see clusters of points and outliers, or
correlation relationships.

I N T R OD U CT ION TO D AT A S C I E N C E
S C AT T E R P L O T

Analysis
• More tips given during the dinner time
compared to the lunch time
• Positive correlation between total bill
amount and tip given, i.e., more the bill
amount, more the tip paid.

I N T R OD U CT ION TO D AT A S C I E N C E
HeatMap
• Using visual cues in a heatmap.
• A heatmap is a way to visualize data in tabular format, where in place of the
numbers, you leverage colored cells that convey the relative magnitude of the
numbers.
• Use color saturation to provide visual cues to quickly target the potential
points of interest.
• Always include a legend as a subtitle on the heatmap with color
corresponding to the conditional formatting color.

I N T R OD U CT ION TO D AT A S C I E N C E
HeatMap

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY

7 O UTLIERS

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA QUALITY INDEX

https://www.deltapartnersgroup.com/
I N T R OD U CT ION TO D AT A S C I E N C E
I MPA C T OF M I SS IN G D ATA I N D ATA S E T

• Loss of data reduces the statistical power, i.e., may introduce selection bias which
may invalidate the study.
• Creates imbalance in the observations and can lead to invalid conclusions.
• Affects the performance of Machine Learning Models.

I N T R OD U CT ION TO D AT A S C I E N C E
M IS S IN G D AT A M E C H A N I S M S

• Understanding the mechanism of missing data will help us choose


appropriate imputation method.
• Missing Completely At Random (MCAR)
• Missing At Random (MAR)
• Not Missing At Random (NMAR)

I N T R OD U CT ION TO D AT A S C I E N C E
M IS S IN G C O M P L E T E L Y A T R A N D O M ( M C A R )
Category Product Rating
• The probability of missing is same for all the Accessories Helmets 90%
observations. Accessories Lights 90%
Accessories Locks 90%
• There is no relationship between the missing Accessories Tires and Tubes 90%
values and any other values in the dataset. Accessories Bike Racks NA
Accessories Pumps 95%
• Removing such missing values will not effect the
Clothing Jerseys NA
inferences made. Clothing Caps 15%
Clothing Tights 30%
Clothing Bib-Shorts 36%
Clothing Socks 48%
Components Chains 75%
Components Handlebars 35%
Components Brakes 36%
Components Brakes 38%
Components Bottom Brackets NA
I N T R OD U CT ION TO D AT A S C I E N C E
M IS S IN G A T R A N D O M ( M A R )
• The probability of missing values
depends on available information
• i.e it depends on other variables in the
dataset.

I N T R OD U CT ION TO D AT A S C I E N C E
N O T M IS S IN G A T R A N D O M ( N M A R )

• The missing values exist as an


indication of a certain class.
• Depression = yes has more missing
values. Hence choose imputation
technique appropriately.

I N T R OD U CT ION TO D AT A S C I E N C E
I M P U T AT I O N T ECHNIqUES

A. Categorical Variables
B. Numerical Variables

I N T R OD U CT ION TO D AT A S C I E N C E
I M P U T AT I O N – C A T E G O R I C A L VARIABLES

1. Imputation by Deleting rows


Years of
• Data set with Deleted rows may impact our Gender Age employment Cardiac Health
Male 28 7 8
classification process
Male 45 21 5
• Not applicable for smaller data set - 54 25 7
Female 31 5 8
Male 35 4 8
2. Replace with Most Frequent value Female 45 8 9
Male 48 29 4
• May create an imbalanced dataset within the
category [Outcome may be biased]

I N T R OD U CT ION TO D AT A S C I E N C E
I M P U T AT I O N – C A T E G O R I C A L VARIABLES

3. Create a Classifier algorithm to predict missing


values Years of
Gender Age employment Cardiac Health
• Use ‘Age’, ‘Years of employment’ and ‘Cardiac
Male 28 7 8
health’ as training dataset to predict ‘Gender’ Male 45 21 5
• Blank values become test dataset - 54 25 6
Female 31 5 8
4. Unsupervised ML technique such
Male 35 4 8
as K-means clustering Female 45 8 9
• Cluster ‘Age’ and ‘Years of employment’ into 2 Male 48 29 4

categories, then predict the missing values for


‘Gender’ [possibility of value falling into cluster]

I N T R OD U CT ION TO D AT A S C I E N C E
I M P U TAT I O N – N U M E R I C A L VARIABL ES

1. Deleting missing value


Annual Fitness
2. Create a Regression algorithm to predict values – Age Income($) Industry Level
(similar to Categorical variables)
34 $ 18,97,444 Accounting 4.5
3. Statistical methods 41 $ 81,256 Healthcare 5
• Mean: Average value of the variable is imputed for 38 $ 5,66,987 Marketing -
the missing data provided data does not contain 17 - - 9.5
extreme values (Outliers) 54 $ 1,33,500 Insurance 6
- $ 12,23,222 Real Estate 8
• Median: Centre value of the variable after arranging
in ascending order. Preferred when data consists of
21 $ 9,88,300 Research 7
Outliers 67 $ 53,00,000 IT -
46 $ 28,71,900 IT 5
• Mode: Imputed with the variable value that has
maximum frequency of occurrence [for large data
and variable value does not impact the overall
outcome]

I N T R OD U CT ION TO D AT A S C I E N C E
M E A N / M E D I A N I M P U T AT I O N
• Used when MCAR / MAR.
• Assumes that the feature follows normal
distribution
Advantages
• Easy to implement
• Faster way of obtaining complete dataset
Disadvantages
• Mean imputation reduces the variance of the imputed
variables.
• Mean imputation does not preserve relationships
between variables such as correlations.

I N T R OD U CT ION TO D AT A S C I E N C E
T ABLE OF C ONTENTS

1 D ATA
2 D ATA - S E T S
3 D ATA R E T R I E VA L
4 D ATA P R E PA R AT I O N
5 D ATA E X P L O R AT I O N
6 D ATA QUALITY

7 O UTLIERS

I N T R OD U CT ION TO D AT A S C I E N C E
O UTLIERS

• An outlier is a data point that is significantly far away from most other data
points. For example, if everyone in your classroom is of average height with the
exception of two basketball players that are significantly taller than the rest of the
class, these two data points would be considered outliers.
• Data objects with behaviors that are very different from expectation are called
outliers or anomalies.
• Outliers can significantly skew the distribution of your data.
• Outliers can be identified using summary statistics and plots of the data.
• Algorithms like Linear Regression, K-Nearest Neighbor, Adaboost are
sensitive to noise.

I N T R OD U CT ION TO D AT A S C I E N C E
O UTLIERS

Outliers can have many causes, such as:


- Measurement or input error.
- Data corruption.
- True outlier observation.

There is no precise way to define and identify outliers in general because of


the specifics of each dataset.
Instead, you, or a data scientist, must interpret the raw observations and
decide whether a value is an outlier or not.

I N T R OD U CT ION TO D AT A S C I E N C E
O UTLIER D ETECTION USING N ORMAL D ISTRIBUTION

• 99% of the observations of


a variable following a
normal distribution lie
within µ ± 3σ .

I N T R OD U CT ION TO D AT A S C I E N C E
Outlier Detection Techniques
Outlier Type:
1. Univariate
• BoxPlot (IQR)
• Density Plot (Standard Deviation)
• Z-Score Method

2. Multivariate
• DBScan (Clustering Algorithm)
• Local Outlier Factor Method (LOF)

https://www.kaggle.com/code/rpsuraj/outlier-detection-techniques-simplified/notebook

I N T R OD U CT ION TO D AT A S C I E N C E
Self Study and Revision
1. Discussion on Previous Year Question Paper
2. Self Study (Revision)*:
• PPTs for Sessions 1-8 (Lecture Notes)
• Case Study – Data Science Proposal Evaluation
• Case Study on Air Pollution
• Some Text Books and Reference Books

* Material on Canvas - https://bits-pilani.instructure.com/groups/27946/files

I N T R OD U CT ION TO D AT A S C I E N C E
Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T3)
The Art of Data Science by Roger D Peng and Elizabeth Matsui (R1)
Introducing Data Science by Cielen, Meysman and Ali
https://www.deltapartnersgroup.com/
managing-da t a - q u a l i t y - optimize- value - extraction
http://www.dataintegration.ninja/
relationship-between-data-quality-and-master-data-management/

T HANK YOU

I N T R OD U CT ION TO D AT A S C I E N C E
I NTRODUCTION TO DATA S CIENCE
M ODULE # 4 : DATA S CIENCE T EAMS
IDS Course Team
BITS Pilani
TABLE OF C ONTENTS

1 D ATA S C I E N C E T EAMS

I N T R OD U CT ION TO D AT A S C I E N C E
Roles in a Data Science Project

Watch the Video [Netflix Data Science Team]

https://www.youtube.com/watch?v=m5hLUknIi5c

OR

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/

Self-Study (Airbnb Data Science Team) - https://www.youtube.com/watch?v=6QVXPNrSbLU

I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [1/6]

[1] Chief Analytics Officer / Chief Data


Officer
• CAO, a “business translator,” bridges
the gap between data science and
domain expertise acting both as a
visionary and a technical lead.
• Preferred skills: data science and
analytics, programming skills,
domain expertise, leadership and
visionary abilities.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [2/6]

[2] Data analyst


• The data analyst role implies proper data collection and interpretation activities.
• An analyst ensures that collected data is relevant and exhaustive while also interpreting
the analytics results.
• May require data analysts to have visualization skills to convert alienating numbers into
tangible insights through graphics. [Ex: Python, Tableau, Power BI]
• Preferred skills: R, Python, SQL

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [3/6]
3 Business analyst
• A business analyst basically realizes a CAO’s functions but on the operational level.
• This implies converting business expectations into data analysis.
• If your core data scientist lacks domain expertise, a business analyst bridges this gulf.
• Preferred skills: Data Visualization & Interpretation, Business Intelligence, SQL.
4 Data scientist
• A data scientist is a person who solves business tasks using machine learning and data
mining techniques.
• The role can be narrowed down to data preparation and cleaning with further model
training and evaluation.
• Preferred skills: R, SAS, Python, Matlab, SQL, noSQL, Hive, Pig, Hadoop, Spark

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [4/6]
Job of a data scientist is often divided into two roles
[4A] Machine Learning Engineer
• A machine learning engineer combines software engineering and modeling skills by
determining which model to use and what data should be used for each model.
• Probability and statistics are also their forte.
• Training, monitoring, and maintaining a model.
• Preferred skills: R, Python, Scala, Julia, Pytorch, TensorFlow
[4B] Data Journalist
• Data journalists help make sense of data output by putting it in the right context.
• Articulating business problems and shaping analytics results into compelling stories.
• Present the idea to stakeholders and represent the data team with those unfamiliar with
statistics.
• Preferred skills: SQL, Python, R, Scala, Carto, D3, QGIS, Tableau

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [5/6]
5 Data architect
• Working with Big Data.
• This role is critical to warehouse the data, define database architecture, centralize data,
and ensure integrity across different sources.
• Preferred skills: SQL, noSQL, XML, Hive, Pig, Hadoop, Spark
6 Data engineer
• Data engineers implement, test, and maintain infrastructural components that data
architects design. They build and maintain the data pipeline [Ex: ETL -> Analysis]
• Realistically, the role of an engineer and the role of an architect can be combined in one
person.
• Preferred skills: SQL, noSQL, Hive, Pig, Matlab, SAS, Python, Java, Ruby, C++, Perl

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
R OLES IN D AT A S C I E N C E T EAM [6/6]

[7] Application/data visualization engineer


• This role is only necessary for a specialized data science model.
• An application engineer or other developers from front-end units will oversee end-user
data visualization.
• Preferred skills: Programming, JavaScript (for Visualization), SQL, noSQL.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N T I S T

Stitch Fix’s Michael Hochster defines two types of data scientists:


• Type A stands for Analysis
• This person is a statistician that makes sense of data without necessarily having strong
programming knowledge.
• Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc.
• Type B stands for Building
• These folks use data in production.
• They’re excellent good software engineers with some statistics background who build
recommendation systems, personalization use cases, etc.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-roles/
I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N T I S T R E q U I R E M E N T S - I N D U S T R Y - W I S E
• Business

Data analysis of business data can inform decisions around efficiency, inventory,
production errors, customer loyalty and more.
• E-commerce
• Improve customer service, find trends and develop services or products.
• Finance
• Data on accounts, credit and debit transactions and similar financial data, security
and compliance, including fraud detection.
• Government
• Form decisions, support constituents and monitor overall satisfaction, security and
compliance.

https://www.cio.com/article/3217026/what -is-a-data-scientist -a-key-data-analytics-role-and-a-lucrative-career.html


I N T R OD U CT ION TO D AT A S C I E N C E
D AT A S C I E N T I S T R E q U I R E M E N T S - I N D U S T R Y - W I S E
• Social networking
• Targeted advertising, improve customer satisfaction, establish trends in location data and
enhance features and services.
• Ongoing data analysis of posts, tweets, blogs and other social media can help
businesses constantly improve their services.
• Healthcare
• Electronic medical records requires a dedication to big data, security and compliance.
• Improve health services and uncover trends that might go unnoticed otherwise.
• Telecommunications
• All electronics collect data, and all that data needs to be stored, managed, maintained and
analyzed.
• Data scientists help companies squash bugs, improve products and keep customers
happy by delivering the features they want.

https://www.cio.com/article/3217026/what -is-a-data-scientist -a-key-data-analytics-role-and-a-lucrative-career.html


I N T R OD U CT ION TO D AT A S C I E N C E
SKILLSET F OR A D AT A S C I E N T I S T
P R O G R A M M I N G : Most fundamental of a data scientist’s skill set. Programming improves
your statistics skills, helps you “analyze large datasets” and gives you the
ability to create your own tools.
Q U A N T I TAT I V E A N A LY S I S : Improve your ability to run experimental analysis, scale your
data strategy and help you implement machine learning.
P R O D U C T I N T U I T I O N : Understanding products will help you perform quantitative
analysis. It will also help you predict system behavior, establish metrics and
improve debugging skills.
C O M M U N I C AT I O N : Strong communication skills will help you “leverage all of the previous
skills listed.”
T EA M W O R K : It requires being selfless, embracing feedback and sharing your knowledge
with your team.
William Chen, Data Science Manager at Quora
I N T R OD U CT ION TO D AT A S C I E N C E
SKILLSET OF A D ATA S C I E N T I S T

I N T R OD U CT ION TO D AT A S C I E N C E
D ATA S C I E N C E T EAM B UILDING
(WORKING WITH OTHER TEAMS)
• Get to know each other for better communication
• Foster team cohesion and teamwork
• Encourage collaboration to boost team productivity and performance

ht t ps ://t ow a rds da t a s cie nce . com /w hy -t e a m -building -is -im port a nt -t o-da t a -s cie nt is t s -a 8fa 74dbc09b
I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM
[1] Decentralized
• Data scientists report into specific business units
(ex: Retail / BB/ Commercial Banking) or functional
units (ex: Marketing, Finance, HR) within a
company.
• Resources allocated only to projects within their
silos with no view of analytics activities or priorities
outside their function or business unit.
• Analytics are scattered across the
organization in different functions and
business units.
• Little to no coordination
• Drawback – lead to isolated teams

I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM

[2] Functional
• Resource allocation driven by a functional
agenda rather than an enterprise agenda.
• Analysts are located in the functions where
the most analytical activity takes place, but
may also provide services to rest of the
corporation. [HR Analytics]
• Little coordination

I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM

[3] Consulting
• Resources allocated based on availability
on a first-come first-served basis without
necessarily aligning to enterprise objectives
• Analysts work together in a central group
but act as internal consultants who charge
“clients” (business units) for their services
• No centralized coordination
Loosely coupled

I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM
[4] Centralized
• Data scientists are members of a core
group, reporting to a head of data science
or analytics.
• Stronger ownership and management of
resource allocation and project prioritization
within a central pool.
• Analysts reside in central group, where they
serve a variety of functions and business
units and work on diverse projects.
• Coordination by central analytic unit
• Challenge – Hard to assess and meet Tightly coupled
demands for incoming data science
projects. (esp in smaller teams)

I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM

[5] Center of Excellence


• Better alignment of analytics initiatives and
resource allocation to enterprise priorities
without operational involvement.
• Analysts are allocated to units throughout
the organization and their activities are
coordinated by a central entity.
• Flexible model with right balance of
centralized and distributed coordination.

I N T R OD U CT ION TO D AT A S C I E N C E
O R G A N I S AT I O N OF D AT A S C I E N C E T EAM

[6] Federated
• Same as “Center of Excellence” model with
need-based operational involvement to
provide SME support.
• A centralized group of advanced analysts is
strategically deployed to enterprise-wide
initiatives.
• Flexible model with right balance of
centralized and distributed coordination.

I N T R OD U CT ION TO D AT A S C I E N C E
Common Difficulties

Source – Business Broadway Survey 2018

I N T R OD U CT ION TO D AT A S C I E N C E
Common Difficulties
Challenge #1 – Managing Data Science Application Lifecycle
Tips: Treat the ML model as a cyclical process. Data scientists should continue monitoring the performances
of the live models and to come full circle, back to the observation phase they started at. [Patterns in the data
can change, and without cyclical approach, model that works today, might not work in the future].
Ex: Personalized Product Recommendations in E-Commerce may require inputs from newer sources.

Challenge #2 - Algorithm Ethics and Bias


Tips: Invest in removing bias from trained data and make algorithms more fair.
[Facial recognition software trained on 75% male and 80% white data, responds more accurately to white
males than females, or to other colors.]

https://ortec.com/en/featured-insights/3-upcoming-challenges-your-data-science-team-will-face
I N T R OD U CT ION TO D AT A S C I E N C E
Building an Analytics-Driven Organization, Accenture
https://www.altexsoft.com/blog/datascience/
how-to-structure-data-science-team-key-models-and-roles/
https://www.cio.com/article/3217026/
what-i s - a - data-s c i e n t i s t - a - key-data- a n a l y t i c s - r o l e - and-a - l u c r a t i v e -
html

T HANK YOU

I N T R OD U CT ION TO D AT A S C I E N C E

You might also like