0% found this document useful (0 votes)

47 views51 pages

DS&BDA Unit 3

The document discusses big data and the data analytics lifecycle. It covers sources of big data, key roles in analytics projects, and the six phases of the lifecycle including discovery, data preparation, model planning, model building, communicating results, and operationalizing results.

Uploaded by

Om Badgujar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views51 pages

DS&BDA Unit 3

Uploaded by

Om Badgujar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Data Science & Big Data

Analytics

Subject Code: 310251

T. E. Computer (2019 Pattern)

Dr. S.R.Khonde 1
UNIT III

Dr. S.R.Khonde 2
Introduction to Big Data
No single definition; here is from Wikipedia:

 Big data is the term for a collection of data sets, which are large
and complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.

 The challenges include capture, creation, storage, search,

sharing, transfer, analysis, and visualization.

 The trend to larger data sets is due to the additional information

derivable from analysis of a single large set of related data, as
compared to separate smaller sets with the same total amount of
data.

Dr. S.R.Khonde 3
Big Data Example
• Credit card companies monitor every purchase their customers
make and can identify fraudulent purchases with a high degree of
accuracy using rules derived by processing billions of transactions.

• Mobile phone companies analyze subscribers' calling patterns to

determine, for example, whether a caller 's frequent contacts are on
a rival network. If that rival network is offering an attractive
promotion that might cause the subscriber to defect, the mobile
phone company can proactively offer the subscriber an incentive to
remain in her contract.

• For companies such as Linked In and Facebook, data itself is their

primary product. The valuations of these companies are heavily
derived from the data they gather and host, which contains more
and more intrinsic value as the data grows.
Dr. S.R.Khonde 4
Big Data

Dr. S.R.Khonde 5
Sources of Big Data
The data now comes from multiple sources, such as these:

 Medical information, such as genomic sequencing and diagnostic imaging

 Photos and video footage uploaded to the World Wide Web

 Video surveillance, such as the thousands of video cameras spread across a city

 Mobile devices, which provide geospatial location data of the users, as well as
metadata about text messages, phone calls, and application usage on smart phones

 Smart devices, which provide sensor-based collection of information from smart

electric grids, smart buildings, and many other public and industry infrastructures

 Non-traditional IT devices, including the use of radio-frequency identification

(RFID) readers, GPS navigation systems, and seismic processing

Dr. S.R.Khonde 6
Sources of Big Data

Dr. S.R.Khonde 7
Sources of Big Data

Dr. S.R.Khonde 8
Big Data Generators

Dr. S.R.Khonde 9
Data Analytics Lifecycle
Data Analytics Lifecycle Overview
• Phase 1: Discovery
• Phase 2: Data Preparation
• Phase 3: Model Planning
• Phase 4: Model Building
• Phase 5: Communicate Results
• Phase 6: Operationalize

Case Study: GINA

Dr. S.R.Khonde 10
Data Analytics Lifecycle Overview
• The data analytic lifecycle is designed for Big Data problems and
data science projects
• With six phases the project work can occur in several phases
simultaneously
• The cycle is iterative to portray a real project
• Work can return to earlier phases as new information is uncovered

Dr. S.R.Khonde 11
Key Roles for a Successful Analytics
Project

Dr. S.R.Khonde 12
Key Roles for a Successful Analytics
Project
• Business User – understands the domain area
• Project Sponsor – provides requirements
• Project Manager – ensures meeting objectives
• Business Intelligence Analyst – provides business domain
expertise based on deep understanding of the data
• Database Administrator (DBA) – creates DB environment
• Data Engineer – provides technical skills, assists data management
and extraction, supports analytic sandbox
• Data Scientist – provides analytic techniques and modelling

Dr. S.R.Khonde 13
Overview of Data Analytics Lifecycle

Dr. S.R.Khonde 14
Phase 1: Discovery

Dr. S.R.Khonde 15
Phase 1: Discovery
1. Learning the Business Domain
2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources

Dr. S.R.Khonde 16
Phase 2: Data Preparation

Dr. S.R.Khonde 17
Phase 2: Data Preparation

• Includes steps to explore, preprocess, and condition

data
• Create robust environment – analytics sandbox
• Data preparation tends to be the most labor-intensive
step in the analytics lifecycle
• Often at least 50% of the data science project’s time
• The data preparation phase is generally the most
iterative and the one that teams tend to underestimate
most often

Dr. S.R.Khonde 18
Preparing the Analytic Sandbox
• Create the analytic sandbox (also called workspace)
• Allows team to explore data without interfering with live
production data
• Sandbox collects all kinds of data (expansive approach)
• The sandbox allows organizations to undertake ambitious
projects beyond traditional data analysis and BI to perform
advanced predictive analytics
• Although the concept of an analytics sandbox is relatively
new, this concept has become acceptable to data science teams
and IT groups

Dr. S.R.Khonde 19
Performing ETLT
(Extract, Transform, Load, Transform)
• In ETL users perform extract, transform, load
• In the sandbox the process is often ELT – early load
preserves the raw data which can be useful to examine
• Example – in credit card fraud detection, outliers can
represent high-risk transactions that might be
inadvertently filtered out or transformed before being
loaded into the database
• Hadoop is often used here

Dr. S.R.Khonde 20
Learning about the Data

• Becoming familiar with the data is critical

• This activity accomplishes several goals:

 Determines the data available to the team early in the project
 Highlights gaps – identifies data not currently available
 Identifies data outside the organization that might be useful

Dr. S.R.Khonde 21
Learning about the Data Sample
Dataset Inventory

Dr. S.R.Khonde 22
Data Conditioning
• Data conditioning includes cleaning data, normalizing
datasets, and performing transformations
• Often viewed as a preprocessing step prior to data
analysis, it might be performed by data owner, IT
department, DBA, etc.
• Best to have data scientists involved
• Data science teams prefer more data than too little

Dr. S.R.Khonde 23
Data Conditioning
• Additional questions and considerations
• What are the data sources? Target fields?
• How clean is the data?
• How consistent are the contents and files? Missing or inconsistent
values?
• Assess the consistence of the data types – numeric, alphanumeric?
• Review the contents to ensure the data makes sense
• Look for evidence of systematic error

Dr. S.R.Khonde 24
Survey and Visualize
• Leverage data visualization tools to gain an
overview of the data

• Shneiderman’s mantra:
“Overview first, zoom and filter, then details-on-demand”
This enables the user to find areas of interest, zoom and filter to
find more detailed information about a particular area, then find
the detailed data in that area

Dr. S.R.Khonde 25
Survey and Visualize Guidelines and
Considerations
• Review data to ensure calculations are consistent
• Does the data distribution stay consistent?
• Assess the granularity of the data, the range of values, and the level
of aggregation of the data
• Does the data represent the population of interest?
• Check time-related variables – daily, weekly, monthly? Is this
good enough?
• Is the data standardized/normalized? Scales consistent?
• For geospatial datasets, are state/country abbreviations consistent

Dr. S.R.Khonde 26
Common Tools for Data Preparation
• Hadoop can perform parallel ingest and analysis
• Alpine Miner provides a graphical user interface for creating
analytic workflows
• OpenRefine (formerly Google Refine) is a free, open source tool
for working with messy data
• Similar to OpenRefine, Data Wrangler is an interactive tool for
data cleansing an transformation

Dr. S.R.Khonde 27
Phase 3: Model Planning

Dr. S.R.Khonde 28
Phase 3: Model Planning
• Activities to consider
 Assess the structure of the data – this dictates the tools and analytic
techniques for the next phase
 Ensure the analytic techniques enable the team to meet the business
objectives and accept or reject the working hypotheses
 Determine if the situation warrants a single model or a series of techniques
as part of a larger analytic workflow
 Research and understand how other analysts have approached this kind or
similar kind of problem

Dr. S.R.Khonde 29
Phase 3: Model Planning
Model Planning in Industry Verticals

Example of other analysts approaching a similar problem

Dr. S.R.Khonde 30
Data Exploration and Variable
Selection
• Explore the data to understand the relationships among the variables to
inform selection of the variables and methods
• A common way to do this is to use data visualization tools
• Often, stakeholders and subject matter experts may have ideas
For example, some hypothesis that led to the project
• Aim for capturing the most essential predictors and variables
This often requires iterations and testing to identify key variables
• If the team plans to run regression analysis, identify the candidate
predictors and outcome variables of the model

Dr. S.R.Khonde 31
Model Selection
 The main goal is to choose an analytical technique, or several candidates,
based on the end goal of the project
 We observe events in the real world and attempt to construct models that
emulate this behavior with a set of rules and conditions
 A model is simply an abstraction from reality
 Determine whether to use techniques best suited for structured data,
unstructured data, or a hybrid approach
 Teams often create initial models using statistical software packages such
as R, SAS, or Matlab
 Which may have limitations when applied to very large datasets
 The team moves to the model building phase once it has a good idea about
the type of model to try

Dr. S.R.Khonde 32
Common Tools for the Model
Planning Phase
• R has a complete set of modelling capabilities
R contains about 5000 packages for data analysis and graphical presentation
• SQL Analysis services can perform in-database analytics of common data
mining functions, involved aggregations, and basic predictive models
• SAS/ACCESS provides integration between SAS and the analytics
sandbox via multiple data connections

Dr. S.R.Khonde 33
Phase 4: Model Building

Dr. S.R.Khonde 34
Phase 4: Model Building
• Execute the models defined in Phase 3
• Develop datasets for training, testing, and production
• Develop analytic model on training data, test on test data
• Question to consider
 Does the model appear valid and accurate on the test data?
 Does the model output/behaviour make sense to the domain experts?
 Do the parameter values make sense in the context of the domain?
 Is the model sufficiently accurate to meet the goal?
 Does the model avoid intolerable mistakes?
 Are more data or inputs needed?
 Will the kind of model chosen support the runtime environment?
 Is a different form of the model required to address the business problem?

Dr. S.R.Khonde 35
Common Tools for the Model
Building Phase
• Commercial Tools
 SAS Enterprise Miner – built for enterprise-level computing and analytics
 SPSS Modeler (IBM) – provides enterprise-level computing and analytics
 Matlab – high-level language for data analytics, algorithms, data exploration
 Alpine Miner – provides GUI frontend for backend analytics tools
STATISTICA and MATHEMATICA – popular data mining and analytics tools

• Free or Open Source Tools

 R and PL/R - PL/R is a procedural language for PostgreSQL with R
 Octave – language for computational modeling
 WEKA – data mining software package with analytic workbench
 Python – language providing toolkits for machine learning and analysis
 SQL – in-database implementations provide an alternative tool

Dr. S.R.Khonde 36
Phase 5: Communicate Results

Dr. S.R.Khonde 37
Phase 5: Communicate Results
• Determine if the team succeeded or failed in its objectives
• Assess if the results are statistically significant and valid
If so, identify aspects of the results that present salient findings
Identify surprising results and those in line with the hypotheses
• Communicate and document the key findings and major
insights derived from the analysis
This is the most visible portion of the process to the outside stakeholders
and sponsors

Dr. S.R.Khonde 38
Phase 6: Operationalize

Dr. S.R.Khonde 39
Phase 6: Operationalize
 In this last phase, the team communicates the benefits of the project
more broadly and sets up a pilot project to deploy the work in a
controlled way
 Risk is managed effectively by undertaking small scope, pilot
deployment before a wide-scale rollout
 During the pilot project, the team may need to execute the algorithm
more efficiently in the database rather than with in-memory tools
like R, especially with larger datasets
 To test the model in a live setting, consider running the model in a
production environment for a discrete set of products or a single line
of business
 Monitor model accuracy and retrain the model if necessary

Dr. S.R.Khonde 40
Phase 6: Operationalize
Four main deliverables
Although the various roles represent many interests, the interests
overlap and can be met with four main deliverables
1. Presentation for project sponsors – high-level takeaways for executive
level stakeholders
2. Presentation for analysts – describes business process changes and
reporting changes, includes details and technical graphs
3. Code for technical people
4. Technical specifications of implementing the code

Dr. S.R.Khonde 41
Case Study: Global Innovation
Network and Analysis (GINA)
In 2012 EMC’s new director wanted to improve the
company’s engagement of employees across the global
centers of excellence (GCE) to drive innovation, research,
and university partnerships

This project was created to accomplish

Store formal and informal data
Track research from global technologists
Mine the data for patterns and insights to improve the team’s
operations and strategy

Dr. S.R.Khonde 42
Phase 1: Discovery
• Team members and roles
• Business user, project sponsor, project manager – Vice President
from Office of CTO
• BI analyst – person from IT
• Data engineer and DBA – people from IT
• Data scientist – distinguished engineer

Dr. S.R.Khonde 43
Phase 1: Discovery
• The data fell into two categories
 Five years of idea submissions from internal innovation contests
Minutes and notes representing innovation and research activity from around the
world

• Hypotheses grouped into two categories

 Descriptive analytics of what is happening to spark further creativity,
collaboration, and asset generation
 Predictive analytics to advise executive management of where it should be
investing in the future

Dr. S.R.Khonde 44
Phase 2: Data Preparation
• Set up an analytics sandbox
• Discovered that certain data needed conditioning and
normalization and that missing datasets were critical
• Team recognized that poor quality data could impact
subsequent steps
• They discovered many names were misspelled and problems
with extra spaces
• These seemingly small problems had to be addressed

Dr. S.R.Khonde 45
Phase 3: Model Planning
The study included the following considerations

• Identify the right milestones to achieve the goals

• Trace how people move ideas from each milestone toward
the goal
• Tract ideas that die and others that reach the goal
• Compare times and outcomes using a few different methods

Dr. S.R.Khonde 46
Phase 4: Model Building
• Several analytic method were employed
• NLP on textual descriptions
• Social network analysis using R and Rstudio
• Developed social graphs and visualizations

Dr. S.R.Khonde 47
Phase 5: Communicate Results
• Study was successful in identifying hidden innovators
• Found high density of innovators in Cork, Ireland
• The CTO office launched longitudinal studies

Dr. S.R.Khonde 48
Phase 6: Operationalize
• Deployment was not really discussed

• Key findings
 Need more data in future
 Some data were sensitive
 A parallel initiative needs to be created to improve basic BI
activities
 A mechanism is needed to continually revaluate the model after
deployment

Dr. S.R.Khonde 49
Phase 6: Operationalize

Dr. S.R.Khonde 50
END
of
UNIT III
51

Unit 1 - Data Scientist Tool Box
No ratings yet
Unit 1 - Data Scientist Tool Box
26 pages
BSR-Data Science
No ratings yet
BSR-Data Science
308 pages
Unit I Big Data
No ratings yet
Unit I Big Data
256 pages
Allinone
No ratings yet
Allinone
189 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
15 pages
Module I (Introduction Data Analytics Life Cycle) Part II
No ratings yet
Module I (Introduction Data Analytics Life Cycle) Part II
103 pages
CSCI946 w2-BDLifecycle
No ratings yet
CSCI946 w2-BDLifecycle
76 pages
Unit - I DA
No ratings yet
Unit - I DA
107 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Module 1B
No ratings yet
Module 1B
65 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
47 pages
Lec.4.Intro.D.S. Fall 2023
No ratings yet
Lec.4.Intro.D.S. Fall 2023
58 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
UNUT 1 - Introduction and Data Analytics Life Cycle
No ratings yet
UNUT 1 - Introduction and Data Analytics Life Cycle
86 pages
Business Analytics Unit I
No ratings yet
Business Analytics Unit I
45 pages
Part1 Ds ML Introduction
No ratings yet
Part1 Ds ML Introduction
61 pages
Lecture 2
No ratings yet
Lecture 2
50 pages
BDA CH 1 V1
No ratings yet
BDA CH 1 V1
48 pages
DataScience Slides
No ratings yet
DataScience Slides
33 pages
BDA Unit 1 Bigdata Intro
No ratings yet
BDA Unit 1 Bigdata Intro
69 pages
Big Data Analytics (10!06!2025)
No ratings yet
Big Data Analytics (10!06!2025)
22 pages
DataAnalytics Chap 1
No ratings yet
DataAnalytics Chap 1
36 pages
Chapter 1 - Intro To Business Analytics
No ratings yet
Chapter 1 - Intro To Business Analytics
52 pages
Data Analysis - Unit1
No ratings yet
Data Analysis - Unit1
65 pages
Unit - I - 2
No ratings yet
Unit - I - 2
63 pages
Unit - 2 PDA
No ratings yet
Unit - 2 PDA
20 pages
Data Analytics Fundementals
No ratings yet
Data Analytics Fundementals
40 pages
Unit 1
No ratings yet
Unit 1
50 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
Life Cycle
No ratings yet
Life Cycle
35 pages
W3 - DA Life Cycle
No ratings yet
W3 - DA Life Cycle
49 pages
Data Analytics Lifecycle
No ratings yet
Data Analytics Lifecycle
16 pages
Week 2 - Data Analytics Life Cycle
No ratings yet
Week 2 - Data Analytics Life Cycle
41 pages
2-Data Analytics Lifecycle
No ratings yet
2-Data Analytics Lifecycle
17 pages
Info Analytics Review
No ratings yet
Info Analytics Review
18 pages
ATW115 Slides Chp02
No ratings yet
ATW115 Slides Chp02
52 pages
Ch1-Introduction To Data Analytics & LifeCycle
No ratings yet
Ch1-Introduction To Data Analytics & LifeCycle
26 pages
Data Analytics
No ratings yet
Data Analytics
11 pages
Data Analytics 1
No ratings yet
Data Analytics 1
13 pages
2-Data Analytics Lifecycle
No ratings yet
2-Data Analytics Lifecycle
17 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
19 pages
Unit 1 Topic 1 Intro
No ratings yet
Unit 1 Topic 1 Intro
30 pages
Analytics and Data Science
No ratings yet
Analytics and Data Science
12 pages
Lec.3.Intro.D.S. Fall 2023
No ratings yet
Lec.3.Intro.D.S. Fall 2023
21 pages
Unit - 2 Learning Notes
No ratings yet
Unit - 2 Learning Notes
7 pages
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
No ratings yet
Unit 1 - DATA ANALYTICS - KIT-601 - AKTU
24 pages
As You Delve Into The World of Data Analytics
No ratings yet
As You Delve Into The World of Data Analytics
10 pages
Module 3
No ratings yet
Module 3
47 pages
What Is A Data Analytics Lifecycle
No ratings yet
What Is A Data Analytics Lifecycle
8 pages
Report Shawari
No ratings yet
Report Shawari
10 pages
Big Data and Analytics Challenges and Issues
No ratings yet
Big Data and Analytics Challenges and Issues
12 pages
Unit2 DATA SCIENCE
No ratings yet
Unit2 DATA SCIENCE
8 pages
Data Discovery With Tableau A Case Study Using Dat
No ratings yet
Data Discovery With Tableau A Case Study Using Dat
5 pages
Unit2-Data Science
No ratings yet
Unit2-Data Science
20 pages
2.entity Relationship Model
No ratings yet
2.entity Relationship Model
14 pages
Unit 1 - DSA
No ratings yet
Unit 1 - DSA
12 pages
Big Data Categories-Life Cycle
No ratings yet
Big Data Categories-Life Cycle
15 pages
Chapter 4 Scientific Writing and Presentation
No ratings yet
Chapter 4 Scientific Writing and Presentation
12 pages
Big Data
No ratings yet
Big Data
4 pages
A Step-By-Step Guite To Qualitative Data Analysis
No ratings yet
A Step-By-Step Guite To Qualitative Data Analysis
28 pages
DBM S Project Railway
No ratings yet
DBM S Project Railway
24 pages
Automate/Informate The Two Faces of Intelligent Technology
No ratings yet
Automate/Informate The Two Faces of Intelligent Technology
14 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Syllabus Computer Science Admitted Batch 2008 - 2009 (UG Courses)
No ratings yet
Syllabus Computer Science Admitted Batch 2008 - 2009 (UG Courses)
44 pages
Thesi 1 Chapter 3 Nov 29 2023
No ratings yet
Thesi 1 Chapter 3 Nov 29 2023
54 pages
Week 5 ER To Relation Mapping - 1
No ratings yet
Week 5 ER To Relation Mapping - 1
14 pages
Lecture 1
No ratings yet
Lecture 1
16 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Various Infrastructure Audit Audit Programs
No ratings yet
Various Infrastructure Audit Audit Programs
38 pages
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet
Keepingresearchdiary Burgess
No ratings yet
Keepingresearchdiary Burgess
11 pages
SGA Sizing Techniques
No ratings yet
SGA Sizing Techniques
24 pages
Fin Abe
No ratings yet
Fin Abe
70 pages
Science Lesson Plan #1: Knowledge
No ratings yet
Science Lesson Plan #1: Knowledge
11 pages
AI - Dialog Axiata Case Study
No ratings yet
AI - Dialog Axiata Case Study
5 pages
J Art Arch Stud. 6 (1) 12-17, 2017
No ratings yet
J Art Arch Stud. 6 (1) 12-17, 2017
6 pages
Mongo DB Cheat Sheet KKJHG
No ratings yet
Mongo DB Cheat Sheet KKJHG
9 pages
Competency Gap Analysis HERITAGE
No ratings yet
Competency Gap Analysis HERITAGE
52 pages
Disk Management
No ratings yet
Disk Management
33 pages
Module13 tcp1 PDF
No ratings yet
Module13 tcp1 PDF
16 pages
Group Project - MG2011 - Fall 2024
No ratings yet
Group Project - MG2011 - Fall 2024
2 pages
Quick Bms Documentation
No ratings yet
Quick Bms Documentation
87 pages
Building A Search Engine
No ratings yet
Building A Search Engine
11 pages
Dsa Presentation 28
No ratings yet
Dsa Presentation 28
17 pages
Aditya Pawar Official Resume-1
No ratings yet
Aditya Pawar Official Resume-1
2 pages
End-to-End Banking Scenario 1. Business Objective Definition
No ratings yet
End-to-End Banking Scenario 1. Business Objective Definition
13 pages
Specification of SERDE in RCFile
No ratings yet
Specification of SERDE in RCFile
5 pages
A Blockchain-Based Cloud Forensics Architecture For Privacy
No ratings yet
A Blockchain-Based Cloud Forensics Architecture For Privacy
8 pages
Tech Roaster October DBMS 2018
No ratings yet
Tech Roaster October DBMS 2018
12 pages

DS&BDA Unit 3

Uploaded by

DS&BDA Unit 3

Uploaded by

Data Science & Big Data

Subject Code: 310251

 The challenges include capture, creation, storage, search,

 The trend to larger data sets is due to the additional information

• Mobile phone companies analyze subscribers' calling patterns to

• For companies such as Linked In and Facebook, data itself is their

 Medical information, such as genomic sequencing and diagnostic imaging

 Photos and video footage uploaded to the World Wide Web

 Smart devices, which provide sensor-based collection of information from smart

 Non-traditional IT devices, including the use of radio-frequency identification

Case Study: GINA

• Includes steps to explore, preprocess, and condition

• Becoming familiar with the data is critical

• This activity accomplishes several goals:

Example of other analysts approaching a similar problem

• Free or Open Source Tools

This project was created to accomplish

• Hypotheses grouped into two categories

• Identify the right milestones to achieve the goals

You might also like