0% found this document useful (0 votes)
69 views36 pages

Big Data Analytics

This document provides an overview of big data and analytics. It discusses key concepts like what big data is, the 5 V's that define it, and how it is changing analytics. It also covers important big data technologies like Hadoop, MapReduce, and NoSQL, as well as challenges, success factors, applications and example use cases of big data analytics. The goal is to help learners understand big data and its role in analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views36 pages

Big Data Analytics

This document provides an overview of big data and analytics. It discusses key concepts like what big data is, the 5 V's that define it, and how it is changing analytics. It also covers important big data technologies like Hadoop, MapReduce, and NoSQL, as well as challenges, success factors, applications and example use cases of big data analytics. The goal is to help learners understand big data and its role in analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

BIG DATA AND ANALYTICS

LEARNING OBJECTIVES

• Learn what Big Data is and how it is changing the world of analytics
• Understand the motivation for and business drivers of Big Data
analytics
• Become familiar with the wide range of enabling technologies for
Big Data analytics
• Learn about Hadoop, MapReduce, and NoSQL as they relate to Big
Data analytics
• Understand the role of and capabilities/skills for data scientist as a
new analytics profession

(Continued…)
LEARNING OBJECTIVES

• Compare and contrast the complementary uses of


data warehousing and Big Data
• Become familiar with the vendors of Big Data tools
and services
• Understand the need for and appreciate the
capabilities of stream analytics
• Learn about the applications of stream analytics
BIG DATA -
DEFINITION AND CONCEPTS
• Big Data means different things to people with different
backgrounds and interests
• Traditionally, “Big Data” = massive volumes of data
• E.g., volume of data at CERN, NASA, Google, …

• Where does the Big Data come from?


• Everywhere! Web logs, RFID, GPS systems, sensor networks, social
networks, Internet-based text documents, Internet search indexes,
detail call records, astronomy, atmospheric science, biology,
genomics, nuclear physics, biochemical experiments, medical records,
scientific research, military surveillance, multimedia archives, …
TECHNOLOGY INSIGHTS 13.1
THE DATA SIZE IS GETTING BIG, BIGGER…

• Hadron Collider - 1 PB/sec


Names for Big Data Sizes
• Boeing jet - 20 TB/hr
• Facebook - 500 TB/day
• YouTube – 1 TB/4 min
• The proposed Square Kilometer
Array telescope (the world’s
proposed biggest telescope) – 1
EB/day
BIG DATA -
DEFINITION AND CONCEPTS
• Big Data is a misnomer!
• Big Data is more than just “big”
• The Vs that define Big Data
• Volume-  the massive amount of data in data stores and concerns related to its scalability, accessibility and
manageability.

• Variety - the structured and unstructured data that has the possibility of getting generated either by humans or by
machines.

• Velocity- Velocity is the measure of how fast the data is coming in.
• Veracity-
• Variability
• Value
• …
5 Vs of BIG DATA
A HIGH-LEVEL CONCEPTUAL (by AsterData / Teradata)
ARCHITECTURE FOR BIG DATA SOLUTIONS
UNIFIED DATA ARCHITECTURE
System Conceptual View

ERP
ERP MOVE MANAGE ACCESS
Marketing
Marketing
Executives

SCM
DATA Operational
PLATFORM Applications
Systems
CRM
INTEGRATED
DATA WAREHOUSE Customers
Business
Partners
Images Intelligence

Frontline
Audio Workers
and Video Data
Mining

Business
Machine Analysts
Logs DISCOVERY PLATFORM
Math
and Stats
Data
Text Scientists
EVENT
PROCESSING Languages
Web and Engineers
Social

BIG DATA ANALYTIC


SOURCES TOOLS & APPS USERS
FUNDAMENTALS OF
BIG DATA ANALYTICS
• Big Data by itself, regardless of the size, type, or speed,
is worthless
• Big Data + “big” analytics = value
• With the value proposition, Big Data also brought
about big challenges
• Effectively and efficiently capturing, storing, and analyzing
Big Data
• New breed of technologies needed (developed or purchased
or hired or outsourced …)
BIG DATA CONSIDERATIONS
• You can’t process the amount of data that you want to because
of the limitations of your current platform.
• You can’t include new/contemporary data sources (e.g., social
media, RFID, Sensory, Web, GPS, textual data) because it does
not comply with the data storage schema.
• You need to (or want to) integrate data as quickly as possible to
be current on your analysis.
• You want to work with a schema-on-demand data storage
paradigm because of the variety of data types.
• The data is arriving so fast at your organization’s doorstep that
CRITICAL SUCCESS FACTORS
FOR
BIG DATA ANALYTICS
A Clear
business need

Personnel with Strong,


advanced committed
analytical skills sponsorship

Keys to Success
with Big Data
Alignment
Analytics
The right between the
analytics tools business and IT
strategy

A fact-based
A strong data
decision-making
infrastructure
culture
ENABLERS OF BIG DATA
ANALYTICS
• In-memory analytics
• Storing and processing the complete data set in RAM

• In-database analytics
• Placing analytic procedures close to where data is stored

• Grid computing & MPP


• Use of many machines and processors in parallel (MPP - massively parallel
processing)

• Appliances
• Combining hardware, software, and storage in a single unit for performance and
scalability
CHALLENGES OF BIG DATA
ANALYTICS
• Data volume
• The ability to capture, store, and process the
huge volume of data in a timely manner
• Data integration
• The ability to combine data quickly/cost
effectively
• Processing capabilities
• The ability to process the data quickly, as it is
captured (i.e., stream analytics)
BUSINESS PROBLEMS ADDRESSED BY
BIG DATA ANALYTICS
• Process efficiency and cost reduction
• Brand management
• Revenue maximization, cross-selling/up-selling
• Enhanced customer experience
• Churn identification, customer recruiting
• Improved customer service
• Identifying new products and market opportunities
• Risk management
• Regulatory compliance
• Enhanced security capabilities
• …
APPLICATION EXAMPLE CASE 1
Moving from many old systems to a unified new system

Before After

Before it was difficult to identify financial


exposure across many systems (separate
copies of derivatives trade store)

After it was possible to analyze all contracts in


single database (MarkLogic Server eliminates
the need for 20 database copies)
BIG DATA TECHNOLOGIES
• MapReduce …
• Hadoop …
• Hive
• Pig
• Hbase
• Flume
• Oozie
• Ambari
• Avro
• Mahout, Sqoop, Hcatalog, ….
BIG DATA TECHNOLOGIES
- MAPREDUCE
• MapReduce distributes the processing of very large multi-
structured data files across a large cluster of ordinary
machines/processors
• Goal - achieving high performance with “simple”
computers
• Developed and popularized by Google
• Good at processing and analyzing large volumes of multi-
structured data in a timely manner
• Example tasks: indexing the Web for search, graph
analysis, text analysis, machine learning, …
BIG DATA TECHNOLOGIES
- MAPREDUCE
How does
MapReduce
work? 4

Raw Data Map Function Reduce Function


BIG DATA TECHNOLOGIES
- HADOOP
• Hadoop is an open source framework for storing and
analyzing massive amounts of distributed, unstructured data
• Originally created by Doug Cutting at Yahoo!
• Hadoop clusters run on inexpensive commodity hardware
so projects can scale-out inexpensively
• Hadoop is now part of Apache Software Foundation
• Open source - hundreds of contributors continuously
improve the core technology
• MapReduce + Hadoop = Big Data core technology
BIG DATA TECHNOLOGIES
- HADOOP
• How Does Hadoop Work?
• Access unstructured and semi-structured data (e.g., log files,
social media feeds, other data sources)
• Break the data up into “parts,” which are then loaded into a file
system made up of multiple nodes running on commodity
hardware using HDFS
• Each “part” is replicated multiple times and loaded into the file
system for replication and failsafe processing
• A node acts as the Facilitator and another as Job Tracker
• Jobs are distributed to the clients, and once completed, the
results are collected and aggregated using MapReduce
BIG DATA TECHNOLOGIES
- HADOOP
• Hadoop Technical Components

• Hadoop Distributed File System (HDFS)


• Name Node (primary facilitator)
• Secondary Node (backup to Name Node)
• Job Tracker
• Slave Nodes (the grunts of any Hadoop cluster)
• Additionally, Hadoop ecosystem is made up of a
number of complementary sub-projects: NoSQL
(Cassandra, Hbase), DW (Hive), …
• NoSQL = not only SQL
BIG DATA TECHNOLOGIES
HADOOP - DEMYSTIFYING
FACTS
• Hadoop consists of multiple products
• Hadoop is open source but available from vendors too
• Hadoop is an ecosystem, not a single product
• HDFS is a file system, not a DBMS
• Hive resembles SQL but is not standard SQL
• Hadoop and MapReduce are related but not the same
• MapReduce provides control for analytics, not analytics
• Hadoop is about data diversity, not just data volume
• Hadoop complements a DW; it’s rarely a replacement
• Hadoop enables many types of analytics, not just Web analytics
DATA SCIENTIST

“The Sexiest Job of the 21st Century”


Thomas H. Davenport and D. J. Patil
Harvard Business Review, October 2012

• Data Scientist = Big Data guru


• One with skills to investigate Big Data
• Very high salaries, very high expectations
• Where do Data Scientists come from?
• M.S./Ph.D. in MIS, CS, IE,… and/or Analytics
• There is not a specific degree program for DS!
• PE, PML, … DSP (Data Science Professional)
SKILLS THAT DEFINE A DATA
SCIENTIST
Domain Expertise,
Problem Definition and
Decision Modeling

Data Access and


Communication and Management
Interpersonal (both traditional and
new data systems)

DATA
SCIENTIST
Curiosity and Programming,
Creativity Scripting and Hacking

Internet and Social


Media/Social Networking
Technologies
A TYPICAL
JOB POST
FOR DATA
SCIENTIST
APPLICATION EXAMPLE CASE 2
BIG DATA AND ANALYTICS IN POLITICS

INPUT: Data Sources Big Data & Analytics OUTPUT: Goals


§ Census data (Data Mining, Web Mining, Text § Raise money contributions
Population specifics, age, Mining, Multi-media Mining) § Increase number of
race, sex, income, etc. volunteers
§ Predicting outcomes and
§ Election Databases § Organize movements
trends
Party affiliations, previous § Mobilize voters to get out
§ Identifying associations
election outcomes, trends and vote
between events and
and distributions § Other goals and objectives
outcomes
§ Market research § ...
§ Assessing and measuring
Polls, recent trends and
the sentiments
movements
§ Profiling (clustering) groups
§ Social media
with similar behavioral
Facebook, Twitter, LinkedIn,
patterns
Newsgroups, Blogs, etc.
§ Other knowledge nuggets
§ Web (in general)
Web pages, posts and
replies, search trends, etc.
· Other data sources
BIG DATA AND DATA
WAREHOUSING
• What is the impact of Big Data on DW?
• Big Data and RDBMS do not go nicely together
• Will Hadoop replace data warehousing/RDBMS?

• Use Cases for Hadoop


• Hadoop as the repository and refinery
• Hadoop as the active archive

• Use Cases for Data Warehousing


• Data warehouse performance
• Integrating data that provides business value
• Interactive BI tools
HADOOP VERSUS DATA WAREHOUSE
WHEN TO USE WHICH PLATFORM
COEXISTENCE OF HADOOP AND
DW
1. Use Hadoop for storing and archiving multi-structured
data
2. Use Hadoop for filtering, transforming, and/or
consolidating multi-structured data
3. Use Hadoop to analyze large volumes of multi-
structured data and publish the analytical results
4. Use a relational DBMS that provides MapReduce
capabilities as an investigative computing platform
5. Use a front-end query tool to access and analyze data
COEXISTENCE OF HADOOP AND DW

Source: Teradata
BIG DATA VENDORS

• Big Data vendor landscape is developing very rapidly


• A representative list would include
• Claudera - claudera.com
• MapR – mapr.com Software,
• Hortonworks - hortonworks.com Hardware,
Service, …
• Also, IBM (Netezza, InfoSphere), Oracle (Exadata,
Exalogic), Microsoft, Amazon, Google, …
TOP 10 BIG DATA VENDORS
WITH PRIMARY FOCUS ON HADOOP
$70

$60

$50

$40

$30

$20

$10

$0
HOW TO SUCCEED WITH BIG DATA

1. Simplify
2. Coexist
3. Visualize
4. Empower
5. Integrate
6. Govern
7. Evangelize
BIG DATA AND STREAM
ANALYTICS
• Data-in-motion analytics and real-time data analytics
• One of the Vs in Big Data = Velocity
• Analytic process of extracting actionable information from
continuously flowing/streaming data
• Why Stream Analytics?
• It may not be feasible to store the data
• It may lose its value if not processed immediately
STREAM ANALYTICS
A USE CASE IN ENERGY
INDUSTRY Energy Production System
(Traditional and/or Renewable) Capacity Decisions

Sensor Data
(Energy Production
System Status)

Streaming Analytics
Meteorological Data Data Integration
(Predicting Usage,
(Wind, Light, and Temporary
Production and
Temperature, etc.) Staging
Anomalies)

Permanent
Usage Data
Storage Area
(Smart Meters,
Smart Grid Devices)

Energy Consumption System Pricing Decisions


(Residential and Commercial)
STREAM ANALYTICS
APPLICATIONS
• e-Commerce
• Telecommunication
• Law Enforcement and Cyber Security
• Power Industry
• Financial Services
• Health Services
• Government

You might also like