0% found this document useful (0 votes)
18 views37 pages

OC - Module 1 - Intro To BDA 021312

Uploaded by

Lakshmi Devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views37 pages

OC - Module 1 - Intro To BDA 021312

Uploaded by

Lakshmi Devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Module 1 – Introduction to Big Data

Analytics
1
Module 1: Introduction to Big Data Analytics

Upon completion of this module, you should be able to:


• Define big data
• Identify four business drivers for advanced analytics
• Distinguish the techniques for Business Intelligence from Data
Science
• Describe the role of the Data Scientist within the new big data
ecosystem
• Cite at least three illustrative examples of big data opportunities

2
Module 1: Introduction to Big Data Analytics

Lesson 1: Big Data Overview


During this lesson the following topics are covered:
• Definition of big data
• Big data characteristics and considerations
• Unstructured data fueling big data analytics
• Analyst perspective on Data Repositories

3
Introduction to Big Data Analytics

Your Thoughts?

What is Big Data?

What makes data, “Big” Data?

4
Big Data Defined
• “Big Data” is data whose scale, distribution, diversity,
and/or timeliness require the use of new technical
architectures and analytics to enable insights that unlock
new sources of business value.
 Requires new data architectures, analytic sandboxes
 New tools
 New analytical methods
 Integrating multiple skills into new role of data scientist

• Organizations are deriving business benefit from analyzing


ever larger and more complex data sets that increasingly
require real-time or near-real time capabilities
Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity
Key Characteristics of Big Data
1. Data Volume
 44x increase from 2010 to 2020
(1.2zettabytes to 35.2zb)

2. Processing Complexity
 Changing data structures
 Use cases warranting additional transformations and
analytical techniques

3. Data Structure
 Greater variety of data structures to mine and analyze
Big Data Characteristics: Data Structures
Data Growth is Increasingly Unstructured

• Data containing a defined data type, format, structure

Structu • Example: Transaction data and OLAP

red
• Textual data files with a discernable pattern,
Semi-
More Structured

enabling parsing
Structure • Example: XML data files that are self
d describing and defined by an xml schema

• Textual data with erratic data formats, can


be formatted with effort, tools, and time
“Quasi”
• Example: Web clickstream data that
Structured may contain some inconsistencies in data
values and formats
• Data that has no inherent
structure and is usually stored
as different types of files.
Unstructured
• Example: Text documents,
PDFs, images and video

7
Four Main Types of Data Structures
Structured Data Quasi-Structured Data

Semi-Structured Data
View  Source

http://www.google.com/
#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&pq=big+data&pf=p&sclien
t=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs_sm=&gs_upl=&bav=on.2,
or.r_gc.r_pw.,cf.osb&fp=d566e0fbd09c8604&biw=1382&bih=651

Unstructured Data
The Red Wheelbarrow, by
William Carlos Williams

8
Data Repositories, An Analyst Perspective
Data Islands Data Warehouses Analytic Sandbox
“Spreadmarts”
Centralized data containers Data assets gathered from multiple
Isolated data marts in a purpose-built space sources and technologies for analysis

• Spreadsheets and low- • Supports BI and reporting, but • Enables high performance analytics
volume DB‘s for restricts robust analyses using in-db processing
recordkeeping • Analyst dependent on IT & • Reduces costs associated with data
• Analyst dependent on DBAs for data access and replication into "shadow" file
data extracts schema changes systems
• Analysts must spend significant • “Analyst-owned” rather than “DBA
time to get extracts from owned”
multiple sources

9
Introduction to Big Data Analytics: Mini-Case
Study
Yoyodyne Bank Scenario
• Evolving from small community bank to a global bank
• Needs to move away from its legacy mainframes to an environment that
supports more robust analytics
• Growing through mergers and acquisitions
• Subject to many new regulatory requirements
• Increasing customer base and increased product offerings
Your Thoughts?

Discussion Questions
1. Discuss how the bank’s data would change under these circumstances.
2. How are their needs changing with these business changes?
3. What do you need to consider from an analyst point of view? What are
some things to consider implementing as the bank grows?

11
Module 1: Introduction to Big Data
Analytics
Lesson 1: Summary
During this lesson the following topics were covered:
• Definition of big data
• Big data characteristics and considerations
• Unstructured data fueling big data analytics
• Analyst perspective on Data Repositories

12
Module 1: Introduction to Big Data Analytics

Lesson 2: State of the Practice in Analytics

During this lesson the following topics are covered:


• Business drivers for analytics
• Current analytical architecture
• Business intelligence vs. data science
• Drivers of big data and new big data ecosystem

13
Business Drivers for Analytics
Current Business Problems Provide Opportunities for Organizations to
Become More Analytical & Data Driven
Driver Examples
1
Desire to optimize business
Sales, pricing, profitability, efficiency
operations

2
Desire to identify business
Customer churn, fraud, default
risk

3
Predict new business Upsell, cross-sell, best new customer
opportunities prospects
4
Comply with laws or Anti-Money Laundering, Fair Lending,
regulatory requirements Basel II

14
Analytical Approaches for Meeting Business
Drivers
Business Intelligence vs. Data Science
Predictive Analytics & Data Mining
(Data Science)
Typical • Optimization, predictive modeling,
Technique forecasting, statistical analysis
s & Data • Structured/unstructured data,
Types many types of sources, very large
High data sets
Common • What if…..?
Questions • What’s the optimal scenario for
our business ?
• What will happen next? What if
Data these trends continue? Why is this
Science happening?
Business Intelligence
BUSINESS Typical • Standard and ad hoc reporting,
Technique dashboards, alerts, queries,
VALUE s & Data details on demand
Business Types • Structured data, traditional
Intelligence sources, manageable data sets
Common • What happened last quarter?
Questions • How many did we sell?
• Where is the problem? In which
situations?
Low

Past TIME Future

15
A Typical Analytical Architecture
1 Data
Sources

Non-Agile Models

2 Departmental
“Spread
Marts”
Warehouse

Enterprise 4
Departmental Applications
Warehouse
3 Prioritized
Operational
Processes

Static schemas
accrete over time Reporting Siloed
Analytics

Non-Prioritized Data Provisioning

Errant data & marts

16
Implications of Typical Architecture for Data
Science

• High-value data is hard to reach and leverage


• Predictive analytics & data mining activities are last
in line for data
 Queued after prioritized operational processes

• Data is moving in batches from EDW to local Slow


“time-to-insight”
analytical tools
&
 In-memory analytics (such as R, SAS, SPSS, Excel)
reduced
 Sampling can skew model accuracy business impact
• Isolated, ad hoc analytic projects, rather than
centrally-managed harnessing of analytics
 Non-standardized initiatives
 Frequently, not aligned with corporate business goals

18
Opportunities for a New Approach to Analytics

New Applications Driving Data Volume

MEASURED IN MEASURED IN WILL BE MEASURED IN


LARGE TERABYTES PETABYTES EXABYTES
1TB = 1,000GB 1PB = 1,000TB 1EB = 1,000PB
VOLUME OF INFORMATION

SMALL

1990’s 2000’s 2010’s


(RDBMS & DATA (CONTENT & DIGITAL ASSET (NO-SQL & KEY/VALUE)
WAREHOUSE) MANAGEMENT)

19
Opportunities for a New Approach to
Analytics
Big Data
1 Ecosystem
Data
Devices
Individual

Analytic Medical Informatio


Services n Advertising Marketers Employers
Law Brokers
Enforceme
nt Government Internet

Data
2
Websites
3
Collectors Data
Aggregato
rs

Data
Users/
Buyers Catalog
4 Co-Ops
Phone/TV Retail
Media

Private
Media Credit List Investigators
Archives Bureaus Financial Brokers Delivery /Lawyers
Banks Service
Governmen
t

20
Considerations for Big Data
Analytics
Criteria for Big Data Projects New Analytic Architecture

Analytic Sandbox
Data assets gathered from multiple sources
1. Speed of decision making and technologies for analysis

2. Throughput
• Enables high performance analytics
using in-db processing
3. Analysis flexibility • Reduces costs associated with data
replication into "shadow" file
systems
• “Analyst-owned” rather than “DBA
owned”

23
State of the Practice in Analytics: Mini-Case
Study
Big Data Enabled Loan Processing at Yoyodyne
Traditional Big Data Enabled
Underwriting Underwriting Your Thoughts?
Risk Level Risk Level
Underwriting Risk

e t al
om on en y ing ais
c
In ati ym o r c or tory pr
ic plo is
t S is Ap
e rif Em H edit d H
V Cr An

TRADITIONAL DATA LEVERAGED BIG DATA LEVERAGED

24
Module 1: Introduction to Big Data
Analytics
Lesson 2: Summary
During this lesson the following topics were covered:
• Business drivers for analytics
• Current analytical architecture
• Business intelligence vs. data science
• Drivers of big data and new big data ecosystem

25
Module 1: Introduction to Big Data Analytics
Lesson 3: The Data Scientist
During this lesson the following topics are covered:
• Key Roles of the New Big Data Ecosystem
• Profile of a Data Scientist

26
Skills Needed In the New Data Ecosystem

Your Thoughts?

• What new skill sets do you need to take advantage of


the big data sets in the loan processing improvement
case study?

• Do most large organizations have people with these


skill sets?

• If so, who are they?

27
Three Key Roles of the New Data Ecosystem

Data
Scientists Role Role Description

Projected People with advanced training in


U.S. talent Deep Analytical quantitative disciplines, such as
Talent mathematics, statistics, and
gap: 140,000
machine learning.
to 190,000
People with a basic knowledge of
Analysts & Data Savvy statistics and/or machine learning,
Data Savvy Professionals who can define key questions that
Managers can be answered using advanced
Projected analytics
U.S. talent People providing technical
gap: 1.5 expertise to support analytical
Technology &
million projects. Skills sets including
Data Enablers
computer programming and
competition, and productivity
database
Note: Figures above reflect a projected talent gap in US in 2018, as shown in McKinsey administration
May 2011 article Big Data: The next frontier for innovation,

28
Roles Needed for Analytical Projects
Data Scientist Key Activities
Data Scientists
Key Activities Data Data Bl LOB
Enginee Analy Analys
• Reframe business rs st t Use
r
challenges as analytics
challenges Analytic Productivity Platform

• Design, implement and


Tools & Services
deploy statistical models
and data mining Data
Data Access & Query Platfor
techniques on big data m
Admin
• Create insights that lead
to actionable
recommendations Cloud Infrastructure

29
Profile of a Data Scientist

Quantitative

Curious &
Technical
Creative

Skeptical Communicative
& Collaborative

30
Module 1: Introduction to Big Data
Analytics
Lesson 3: Summary
During this lesson the following topics were covered:
• Key Roles of the New Big Data Ecosystem
• Profile of a Data Scientist

31
Module 1: Introduction to Big Data Analytics
Lesson 4: Big Data Analytics in Industry Verticals
During this lesson we cover the following representative examples:
• Health Care
• Public Services
• Life Sciences
• IT Infrastructure
• Online Services

32
Big Data Analytics: Industry Examples

1
Health Care
• Reducing Cost of Care Medical

2 Public Services Government Internet

• Preventing Pandemics
3 Life Sciences Data
Collectors
• Genomic Mapping

4 IT Infrastructure
• Unstructured Data Analysis
Phone/TV Retail

5 Online Services
Financial
• Social Media for Professionals

33
1
Big Data Analytics: Healthcare

• Poor police response and problems with medical care, triggered


Situation by shooting of a Rutgers student
• The event drove local doctor to map crime data and examine
local health care

• Dr. Jeffrey Brenner generated his own crime maps from medical
Use of Big Data billing records of 3 hospitals

• City hospitals & ER’s provided expensive care, low quality care
• Reduced hospital costs by 56% by realizing that 80% of city’s
Key medical costs came from 13% of its residents, mainly low-
Outcomes income or elderly
• Now offers preventative care over the phone or through home
visits

34
2
Big Data Analytics: Public Services

• Threat of global pandemics has increased exponentially


Situation
• Pandemics spreads at faster rates, more resistant to antibiotics

• Created a network of viral listening posts


• Combines data from viral discovery in the field, research in
Use of Big Data disease hotspots, and social media trends
• Using Big Data to make accurate predications on spread of new
pandemics
• Identified a fifth form of human malaria, including its origin

Key • Identified why efforts failed to control swine flu


Outcomes
• Proposing more proactive approaches to preventing outbreaks

35
3
Big Data Analytics: Life Sciences

Situation • Broad Institute (MIT & Harvard) mapping the Human Genome

• In 13 yrs, mapped 3 billion genetic base pairs; 8 petabytes

Use of Big Data


• Developed 30+ software packages, now shared publicly, along
with the genomic data

• Using genetic mappings to identify cellular mutations causing


Key cancer and other serious diseases
Outcomes
• Innovating how genomic research informs new pharmaceutical
drugs

36
4
Big Data Analytics: IT Infrastructure

• Explosion of unstructured data required new technology to


Situation
analyze quickly, and efficiently

• Doug Cutting created Hadoop to divide large processing tasks


into smaller tasks across many computers
Use of Big Data
• Analyzes social media data generated by hundreds of
thousands of users

• New York Times used Hadoop to transform its entire public


Key archive, from 1851 to 1922, into 11 million PDF files in 24 hrs
Outcomes
• Applications range from social media, sentiment analysis,
wartime chatter, natural language processing

37
5
Big Data Analytics: Online Services

Situation • Opportunity to create social media space for professionals

• Collects and analyzes data from over 100 million users


Use of Big Data
• Adding 1 million new users per week

• LinkedIn Skills, InMaps, Job Recommendations, Recruiting


Key
Outcomes • Established a diverse data scientist group, as founder believes
this is the start of Big Data revolution

38
Module 1: Introduction to Big Data
Analytics
Lesson 4: Summary
During this lesson the following representative examples were
covered:
• Health Care
• Public Services
• Life Sciences
• IT Infrastructure
• Online Services

39
Check Your Knowledge
1. What are the 3 characteristics of Big Data, and the Your Thoughts?
main considerations in processing Big Data?
2. What is an analytic sandbox?
3. Explain the difference between Business Intelligence
and Data Science.
4. Describe the challenges of the current analytical
architecture for Data Scientists.
5. What are the key skill sets and behavioral characteristics
of a Data Scientist?

40
Module 1: Summary

Key points covered in this module:


• Big data was defined
• Four business drivers for advanced analytics were identified
• The techniques for Business Intelligence were distinguished from
those of Data Science
• The role of the Data Scientist within the new big data ecosystem
was described
• Multiple illustrative examples of big data opportunities were cited

41

You might also like