Big Data Spectrum
Big Data Spectrum
Big Data Spectrum
Contents
Introduction . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction
What is Big Data?
Today we live in the digital world. With increased digitization the amount of
structured and unstructured data being created and stored is exploding. The data
is being generated from various sources - transactions, social media, sensors, digital
images, videos, audios and clickstreams for domains including healthcare, retail,
energy and utilities. In addition to business and organizations, individuals contribute
to the data volume. For instance, 30 billion content are being shared on Facebook
every month; the photos viewed every 16 seconds in Picasa could cover a football
field.
It gets more interesting. IDC terms this as the Digital Universe and predicts that this
digital universe is set to explode to an unimaginable 8 Zeta bytes by the year 2015.
This would roughly be a stack of DVDs from Earth all the way to Mars. The term
Big Data was coined to address this massive volume of data storage and processing.
a
Dat
Soc
i al
ta
Da
sa
an
Tr
Big Data
(Volume, Velocity, Variety)
Locatio
n
/
G
eo
Da
t
n Data
ctio
Me
d
ia
Cli
ck
st
am
re
a
S e ns or Da ta
The volume, variety and velocity of Big Data causes performance problems when being
created, managed and analyzed using the conventional data processing techniques.
Using conventional techniques for Big Data storage and analysis is less efficient as
memory access is slower. The data collection is also challenging as the volume and
variety of data has to be derived from sources of different types. The other major
challenge in using the existing techniques is they require high end hardware to handle
the data with a huge volume, velocity and variety.
Big Data is a relatively new phenomenon. As with any new adoption, the adoption
of Big Data depends on the tangible benefits it provides to Business. Large data
sets which are considered as information overload are invariably treasure troves
for business insights. The volume of data sets has immense value that can improve
the business forecast, help in decision making, deciding business strategies over the
competitors. For instance, Facebook, blogs and twitter data gives insights on current
business trends.
Data
Mining
Faster Business
Decisions
Reports
Aggregated
Intelligence
Forecasting
Growth of
Information Assets
Real Time
Analytics
Innovative Business
Value
Storage
The data sets are beyond the capability of humans to analyze manually. Big data tools
have the ability to run ad-hoc queries against the large data sets in less time with a
reasonable performance. For instance, in retail domain understanding what makes
a buyer to look into a product online, sentiment analysis of a product based on the
Facebook, tweet and blogs are of great value to the business. This will enable the
business to improve their services for customers.
Big Data analysis enables the executives to get the relevant data in less time for making
decisions. Big Data can pave way for fraudulent analysis, customer segmentation
based on the store behavior analysis, loyalty programs that identifies and targets the
customers. This enables us to perform innovative analysis which indeed changes the
way we think about data.
Data Security, Integrating various technologies, catering to real time flow of data and
leveraging cloud computing.
To dive deep into Big Data technology with the goal of having a quick, managed and
quality implementation, a set of enablers were designed by the Architects at Infosys.
The section on Adoption Enablers gives the insights into these enablers.
The sections are interleaved with the viewpoints from Phil Shelly, CTO, Sears
Holdings Corporation, Doug Cutting, Co-founder of Apache Hadoop (popularly
known as father of Big Data) and Kris Gopalakrishnan, Co-Chairman, Infosys Ltd.
Phil Shelley
CTO, Sears Holdings Corporation
Phil, Sears is one of early adopter of Big Data. What are the sweet spot use
cases in the retail industry?
Transactional data such as POS, Web-based activity, loyalty-based activity, product
push, seasonality, weather patterns and major trends that affect retail, business value
that can be mined from this data. In addition to this you add the data in the social space
and the sheer amount of this data is way beyond what traditional database solutions
can handle. That is where Hadoop plays a role, to capture and keep this data in the finest level of detail.
What kind of challenges are you facing in implementing Big Data solutions?
Hadoop is relatively low cost to implement. However, to get started, you still need
some kind of business case. It is a good idea to start small and have a very specific use
case in mind. At the same time, a picture of where big data will be valuable long term
is important as well. Focusing on a key use case that can demonstrate business value
is probably the way to start, for any company a big-bang approach is not something
I would recommend.
How to handle Data Privacy and Security issues in the Big Data management?
Personally I would not put very sensitive data on a public cloud because the risk of
exposure could be catastrophic to the company. A private cloud which is co-located
with my data center or, a virtual private cloud that is physically caged, are approaches
I would recommend. Out of the box Hadoop has security limitations. You have to
explicitly design your data for security. There are ways of securing credit card and
personal information. I would recommend that anyone looking to secure such data
look to some help on how to structure a big data solution and not expose themselves.
This is an area that is somewhat new and prone to lax security.
Phil, One last question, as a CTO of Sears how do you see the connection
between real-time Digital Enterprise and Big Data?
Today Hadoop can have near real time copies of transactional data and near real time
batch reporting with as little as minutes of latency. You can process the data near real
time and then access from a data mart in real time, which create many possibilities.
But it is really going to be an ecosystem of the right tools for the right jobs.
Challenges
One of the significant challenges in architecting such a personalization system is the
amount and diversity of data that has to be handled. For example, websites today
generate user activity data that could easily run into Terabytes in a matter of months.
Equally problematic is the different formats and system interfaces. Once the data
is loaded, the system applies correlation techniques to correlate the data and draw
inferences about the preferences of individual customers. The traditional relational
data warehousing OLAP based systems struggle to process this massive amount of
high velocity data and provide insights. There is a high latency between the users
shopping activity and the generation of recommendation as well as limited granularity
which decreases the relevance of the recommendation to the end customer.
Transaction data
from all channels
Marketing
User profile data
Campaigns Data
Reviews/ ratings
Web
Analytics
Social
Networks
Twitter
Mapping
Transformation
Load
Distributed and unified file system and map reduce processing infrastructure
Navigation related
services
Product context
based services
Promotion related
services
Website
Application
StoresPOS
Phone / Catalog
sales
Data Extraction
Orchestration
Mapreduce based
Processing
Log/Sensor
information
Flume
Zookeeper
Mahout
User generated
Social Data
Sqoop
Oozie
Hive
Chukwa
Azkaban
Pig
Other unstructured
data
Recommendation,
Pattern
AdhocQuery Results
Analysis Output
Data Summary
Storage
HBase
HDFS
Business Value
Real-time Personalized insights: By combining inputs from various channels (social,
location, history etc.) and analyzing them real-time, customers can be presented
with almost instant recommendations. For example, if a customer tweets I like
Xbox , the system can provide recommendations related to Xbox when she logs
into the ecommerce site or as an ad on her social network profile or even send an
Xbox promotion coupon to her mobile if shes shopping in-store. This kind of highly
personalized and instant recommendation is being experimented and will become
more prevalent going forward.
Personalized Presentation Filtering: One of the fundamental things that can be offered
is the ability to present content which is tuned to the preferences of the consumer. This
could be in terms of product types (Wii related vs. Xbox related or full sleeve vs. half
sleeve), brands (Sony vs. Samsung) or Prices (costliest vs. cheapest) or something else
that we know about her. This can be provided as filtered navigation in a website or as
a suggestive selling tip to a customer service representative while they are speaking
to the consumer.
Context-specific and Personalized External Content Aggregation: Presentation of
context-specific information that makes sense for the consumer is a key capability. A good
example is the relevance of Social context. If we are showing a product and can show the
consumer that out of the 1500 people who said that they liked the product 15 are friends
(with the capability to know who those 15 were as well), the impact and relevance of that
would be significant. This service is relevant only for electronic channels.
Personalized Promotional Content: Different consumers get attracted by different
value propositions. Some like direct price cuts, some like more for the same money
(please note that they are exactly not the same), while some others believe in getting
more loyalty points. Showing the most appropriate promotion/offer based on their
interests is another important capability that a personalized system can provide.
Big data processing solutions can process vast amount of data (tera/peta bytes)
from various sources like browsing history, social network behavior, brand loyality,
general public opinion about the product obtained from various social networks etc.
This kind of extremely useful and tailor-made information can only be obtained using
Big Data processing solutions and retailers must leverage them to make their business
more appealing and personal to customers.
How are the customers segmented? What share of profits does each customer
segment bring in?
Some customers deserve greater attention than others. How to identify them
from the frequent flyers?
What tactics should be adopted to acquire, convert, retain and engage customers?
Personalization and Fare processing involves large data sets. While the current
systems leverage conventional enterprise information, new data sources such as social
media, web logs, call center logs & competitor pricings would have to be considered
for personalization requirements.
Here the challenges include highly scattered data sources, huge effort for data
integration, and high cost of data warehousing and storage solutions.
depends on factors like aircraft type, sector of the flight, flight date etc. The objective
here is to have a greater level of control over the type of fares sold. When making
a fare change, the fare is re-priced, which means recalculating each fare, taking into
consideration factors like:
Inventory
Information about these factors will be distributed across various databases. Naturally,
an attempt to process these data will run into all the above mentioned problems. And
these data sources may not be sufficient to discover a price aligned to the customers
point of view.
Opportunity Big Data for Fare Processing:
Fare processing involves extracting and consolidating information from external
sources such as Agent systems, External Pricing systems, and internal systems such
as Revenue Accounting, Forecasting, Inventory and Yield Management systems.
This data itself will run into tens of terabytes. Newer data sources such as Social
Media conversations about pricing decisions & competitors, and un-structured data
in enterprise like customer service e-mail data, call center logs etc. will push the data
volumes even further.
In this context, the high-cost data warehousing solutions are confronted with the
following challenges.
Analytics process should not get extended by the size of the data; terabytes
should be processed in minutes rather than days or hours
Data loss is unacceptable, the solution requires high availability and failover
10
This type of analytics requires accumulation of data from multiple sources, frequent
processing of high volume data, flexibility and agility in the processing logic. These
are all pain areas for traditional data warehouse solutions and these get compounded
as the data volume grows.
Opportunity Holistic data analytics and newer data
Analyzing the customers data from various systems including Loyalty, CRM,
and Sales & Marketing etc. and data from partners systems along with data from
social media based on customer social profile can help airlines to create deeper
personalization preference of customers and also understand their current social
status and preferences.
Run analytics on un-structured data from social and other sources to derive
newer dimension such as sentiment, buzz-words, root causes from customer
interactions and user created content
Data extracts from various feeder systems are copied to the Big Data file system.
The Analytics platform will trigger the processing of this data, while responding to
CRM Analyst
Pricing Analyst
Fares & Pricing data sources
Personalization
CRM
Tool
Departure
Control
Ticketing
Sales&
Marketing
Loyalty
Program
Social
Media
Analytics
Fare
Analytics
CRM Datamart
Fare Datamart
RES
External
pricing
Systems
Consolidator
(ATPCO)
Online
Agents
Revenue
Accounting
system
GDS
HDFS Cluster
Pig
Price &
Yield Mgmt.
Hive
Job
Scheduler
Map
reduce
Fares
DB
Data
Gathering
Data
Gathering
11
information needs for various analytical scenarios. The Analytics platforms contain the
algorithms tailored to the business process, to extract/mine meaningful information
out of the raw data.
Business Value
By switching over to a Big Data based solution, it is estimated that the fare processing
times can be cut down from days to a matter of hours or even minutes. The airline
industry operates in a high risk domain vulnerable to a large number of factors, and
this agility will be a vital tool in maximizing revenues. Big data architecture solutions
for personalization helps in understanding and deriving granular personalization
parameters and understand customers social status and needs based on their
interaction in social media systems.
The potential benefits for an airline could include:
Increased efficiencies to the airline in managing fare strategy for different sales
channels
12
Key challenges with regards to typical Auto Insurance workflow can be:
Collect the data about the customer from the Internal CRM system.
Check the other products sold to the customer from In-house ERP, CRM
systems.
Check the credit history and based on that provide the necessary quotes from
other internal systems.
Collect the data from the social networking sites regarding the behavior of
the customer profile customers on their behavior, sentiments, social networth, usage patterns, click stream analysis for their likes and dislikes,
behavior and spending patterns.
Check the reviews and ratings of the product for which the insurance is
required. For example if the insurance is required for the particular brand
of car then reviews about the cars can be mined and by using sentiment
analysis, the cars can be put in different categories. The cars with better
reviews and ratings can be offered with lower premium and cars which have
bad reviews and which brake down frequently can be offered with higher
insurance premium.
13
Check the social network of the user. This is can help in identifying the status
of the customer and even will help in identifying potential customers.
Insurance companies can even make use of the data such as the customers
interest in the car racing or his network which is quite active in rally racing
etc.
Some of the other data which can be obtained from the external systems
includes the data to check the credit worthiness of the customer from the
external 3rd party rating sites
The solution architecture breaks the entire landscape into Acquisition, Integration
and Information Delivery/Insights layers. Those three layers provide a seamless
integration, and flexible options to plug-n-play additional sources and delivery
channels hiding the underlying data complexity in each layer. The core to managing
such architecture is towards handling the large volume of information, and the varied
formats of data flowing into the system. Hence the solutions centered on big data and
data virtualization techniques to manage this burst of information.
Acquisition Layer Once the sources of both structured and un-structured data are
identified, both externally and internally the right set of adapters and crawlers are
required to extract the Big data from public/private locations outside the boundaries
of organization. This is most critical aspect, and the right set of business rules on
what needs to be pulled, requires an upfront preparation. Idea is to start with smaller
number of sources which have more likelihood of having relevant set of data for desired
outcome, and later extend this strategy to other sources. In parallel, is to leverage data
integration and data virtualization technologies to integrate the relevant data sources
which can provide auto insurance policy, customer details, product categories and
characteristics, claims history. Tools like Informatica, IBM DataStage, ODI etc for data
integration, and Denodo, Composite etc for data virtualization can be leveraged.
Few parameters need to be factored in for better performance:
Delta feeds, trickle feeds and staging strategies can play an important role
Effective business rules, technical rules filters in place to reject unwarranted data
sets
Integration This is where post extraction and filtering, the information gets
consolidated under a common data model. Integration requires a strong set of mapping
rules to map both structured and un-structured (transformed into structured format).
The data model should support integrating following set of information:
14
Premium Auto-Suggestion
by Product Category
Delivery
Warranty
Identification
Claim
Submission
Claims
Return Material
Adjudication &
Authorization
Credits
Supplier
Warranty
& Recovery
Product feedback by
Customer Behavior,
Sentiment Profiling
Acquisition
Crawlers and
Adaptors e.g.
Scribe, Flume, Sqoop
Transactional Data,
MasterData
Structured Content
Un-Structured Content
Facebook, twitter, Blog,
Mail
Structured
Content
CRM,ERP,
SEA etc
Insurance,
Warranty
Applications
Data
Files,XLS
Customer behavior pattern analysis liking for red color vehicles, sports vehicles etc
Delivery The final stage of delivering the insights generated from the integrated
information as dashboards, scorecards, charts and reports with flexibility for business
analysts to explore the details, and correlate the information sets for taking decisions
on setting insurance premiums by vehicle and type of customers. The advanced
analytics techniques in delivering such information will also help figuring out claim
frauds, and unregulated auto insurance policies, claims. The delivery channels can be
desktops via portals, mobile devices, internet based application portals etc.
Business Value
With Big Data technologies being in place solution for the two core challenges integral
to the problem statement become practical and affordable
1. Firstly, the ability to store and crunch any volume of data in a cost effective way
2. Secondly, the ability to model statistically a rare event like fraud which needs
sample data size close to the entire population to capture and predict right type
of signatures of the rare events
15
And with that the immediate business values this solution can bring, can be
categorizedin
Ability to tag right pricing on Insurance Premium with holistic view and better
insight
While this solution addresses the specific use case of auto insurance premium advisor,
the natural extension of this solution and framework can certainly be applicable to
Manufacturing parts warranty, other Insurance premium and claims, fraud detection
processes covering domains like Manufacturing, General Insurance and Retail.
Evaluation would be based on a set of predefined rules- which may focus on all
the request sent in the past predefined time window
16
Distributed Cache
With counters and write ahead logs
Messages
from Web Service
Message Bus
Distributed Columnar
Data Store for Archived Data
Server(node 1)
RAM
Disk 1
Server(node 2)
RAM
Disk 2
Server(node n)
RAM
Disk n
1. The System should be able to ingest requests at the rate they come in which will
vary over a period of time
2. The latency of the system has to be below specified SLA limits which may not
allow the system to store the incoming data before the response is to be evaluated
3. The need is to evaluate a set of incoming response which will need a good amount
of RAM to accommodate the data in memory and any derived structures created
to store the aggregated over history and current stream
4. With increased load the system may be required to use multiple nodes to ingest
the data while keeping the overall validity of the counters across the nodes.
a. The counters will be needed to be stored in a shared memory pool across the
nodes.
b. The counters will help reduce latency as they would be updated before the
data is even written to the disk (for history updates)
5. Distributing the system over multiple nodes will provide the solution with
parallelism and also ability to develop a fault-tolerant solution
a. The distributed nodes as stated in the last point will be able to handle parallel
writes across multiple nodes in a peer-to-peer architecture (not in a master
slave architecture)
17
b. With increased number of nodes the probability of a node failure increases and
hence replication of Historical data would be needed across multiple nodes.
Similarly the data could be sharded across the nodes to help in parallelizing
reads.
Assuming, that the web service takes care of writing a message to the message queue
for each request received with appropriate details, a Typical Architecture for such a
solution will be composed of the following components
6. The Acquisition layer
a. This component will read Messages from the message queue and distribute
it to a set of worker processes which will continue with the rest of the
acquisition process.
b. Each worker process will be able to Look up the cache for an appropriate
data structure based on the message details if found update the counter.
Else create a new one.
7. The distributed cache - The role of the distributed cache will be to act as initial
data store based on which the analysis could be done. Thus helping reduce latency
between the message arrival and its impact on the measurement. This will need
a. Initialization of the distributed cache while startup and also on a regular
basis while data is flushed to the data disks
b. Ability to Flush the data in cache to the data disk when the cache size reaches
a certain water mark
c.
Ability to create local structure on the node where the message is received
and replicate it to the copies situated on other nodes.
d. Ability to create and maintain a predefined set of replicas of the data structure
across the nodes to support fault tolerance
8. The Storage/retrieval layer
a. Ability to store serialized data structures for the related processing nodes
with adequate copies across multiple nodes to handle the fault tolerance in
the data storage layer
b. Ability to provide secondary index on the Data structures for alternate
Queries The historical data stored will be time series in nature and columnar
distributed data stores would be an appropriate way to handle this.
c.
The Data could be sharded across data nodes to increase the read response.
Business Value
Above mentioned solution provides opportunity to
Reduce costs with ability to handle large volumes of varying load using
commodity hardware
18
Meet Risk requirements this kind of latency would not be possible in a traditional
RDBMS where data would have to be stored and indexed before querying
Reduce technical and commercial losses (AT&C) that lead to substantial energy
shortage
One solution we can look at to address the above challenges is through the
implementation of smart grids with Big Data analytics Platform. Smart Grid is a real
time, automated system for managing energy demands and responses for optimal
efficiency. In a smart grid environment, demand response (DR) optimization is a twostep process consisting of peak demand forecasting and selecting an effective response
to it. Both these tasks can greatly benefit from the availability of accurate and real time
information on the actual energy use and supplementary factors that affect energy
use. Analytical tools can process the consumption of data coming in from an array of
smart meters and provide intelligent data which can help the utility company to plan
better for capital expenditures. Hence, the software platform that collects, manages
and analyzes the information plays a vital role.
19
Proceed with non-critical task of annotating smart meter data with domain
ontologies (The collective set of information models used by the electricity
industry can be viewed as a federation of ontologies)
Responding to peak load or other events that are detected by interacting with
consumer.
This entire process implicitly includes a feedback, since any response taken will
impact the consumer energy usage, which is measured by subsequent readings of the
smart meters.
Smart Meter
Information
Emergency
Notifications
Stream
Processing
System
Electricity
Usage data
AMIs
Enriched
Information
Critical Response
Policy
Database
Demand
Forecasting
Data sharing
APIs and
Services
Semantic Privacy Filter
Evaluation Information
Layer
Integration
Integrated
Database
Domain
Database
Big Data Analytics Platform
Major tasks
to be
performed
Predictive
Analytics
Policy
Models
Integrated
Information
CRM
Billing
Public/Private Cloud
Can be inside or outside cloud
Integrated Information
20
The technologies that will enable these tasks include scalable stream processing
systems, evaluation layer, semantic information integration and data mining systems.
The scalable stream processing system is an open-architecture system that manages
data from many different collection systems and provides secure, accurate, reliable
data to a wide array of utility billing and analysis systems. The system accept meter
readings streaming over internet or other communication protocols and detect/react
to emergency situations based on defined policies. The evaluation layer captures
the raw events and result sets for predictive modeling and sends the information to
semantic information integration system. The semantic information integration plays
a vital role by using domain knowledge base to integrate and enhance management of
transmission and distribution grid capabilities with diverse information and improve
operational efficiency across the utility value chain. The data mining systems uses
data driven mining algorithms to identify patterns among a large class of information
attributes to predict power usage and supply-demand mismatch.
All of these tools will run on scalable platforms that combine public and private Cloud
infrastructure, and allow information sharing over Web service APIs while enforcing
data privacy rules. A mix of both public and private Clouds is necessary due to data
privacy, security and reliability factors. A core set of internal, regulated services may
be hosted within the utilitys privately hosted Cloud while the public Cloud is used
for a different set of public facing services and to off-load applications that exceed the
local computational capacity. For more accurate analytics and better demand forecast,
the data needs to be integrated with the billing and CRM systems as well. Integrating
Billing and CRM systems inside the cloud may prove to be expensive, so in such case
it is better to keep the analytics outside the cloud.
Business Value
Smart metering with big data analytics gives an opportunity to focus on accounting
and energy auditing, to address theft and billing problems which have vexed the
industry. The following lists the Business value that can be realized in using big data
analytics:
Analyzing consumer usage and behavior: Big data can be used for enhanced
analytics that visualizes where energy is being consumed and provide insight
into how customers are using energy. This increases the efficiency of smart
grid solutions, allowing utilities to provide smarter and cleaner energy to their
customers at an economical rate. A significant amount of value is anticipated
to reside in secondary consumer data behavioral analytics of consumer usage
21
data will have value to utilities, service providers, and vendors, in addition to the
owners (consumers) of that data. Utilities and other energy service providers need
this type of consumer data to effectively enlist support for future energy efficiency
and demand response campaigns and programs that reward changes in energy
consumption. CRM and analytics applications can deliver valuable information to
let utilities act as trusted advisors to consumers to reduce or shape energy use.
In addition to the above benefits smart grid implementation with Big data
analytics will play a key role in addressing global issues like energy security and
climate change
22
Cost effective: Both Hadoop and HIVE are open sourced and the initial software
cost is zero. In addition, they are designed to run on commodity hardware and
the Infrastructure cost is a relatively lesser when compared to conventional EDW
hardware.
Strong eco-system: Hadoop has now become main-stream and there is a massive
support in the industry. We are seeing the eco-system grow at a rapid pace
coexisting with landscape.
To address the end to end need of Enterprise Data Warehouse, we need to effectively
handle the following:
Data Ingestion: Data needs to be ingested from a variety of data sources to the
Big Data environment (Hadoop + HIVE in this case). The data can either be
transactional data stored in RDBMS or could be any other unstructured data that
an Enterprise might want to use in their EDW environment
Source
Systems
ODS+Data Warehouse
Online
Systems
Web logs,
Click stream
Data
Processing
Data
Ingestion
HDFS/Hive
HDFS/Hive
Hadoop
Socia
Social
Networks
Data Marts
Hadoop
Processed Data
Data publishing
Hadoop Staging
Reports
HDFS/Hive
Low Latency
Systems
Dashboards
In Memory DB
Mobile Apps
Data publishing
Deep Analysis
Ad-hoc
Data
Access
Outbound Systems
Statistical
23
Data Processing: Once data is ingested to the platform, this needs to be processed
to provide business value. Processing can be in terms of aggregation, analytics or
semantic analysis. It is interesting to note that, unlike conventional EDW platform,
the Hadoop + HIVE environment is well suited to handle the unstructured or
semi-structured data. Companies like Facebook, Google and Yahoo routinely
process huge volume of unstructured data and derive structured information
Business Value
The platform provides a compelling alternative to the conventional EDW, especially
in the world of Big Data. This kind of architecture is being evaluated by many of
our clients. Development of accelerators can help package the above solution as a
full-fledged platform and facilitate smoother adoption. Following are some key
accelerators that Enterprise can look to develop in the short to medium term:
Technical Accelerators (Level 0): Big Data Aggregator Framework, Parallel Data
ingestion framework, Common Data adapter
24
Doug Cutting
Co-founder ,
Apache Hadoop project
Doug, first Google, Yahoo and Facebook learned to manage Big Data and
now large enterprises have started to leverage Big Data. What are the most
common use cases you are seeing in the context of the enterprise?
Most companies are motivated to start using Apache Hadoop by a specific task. They
have an important data set that they cannot effectively process with other technologies.
At companies with large websites, this initial task is often log analysis. For example,
most websites are composed of many web servers, and a given users requests may be
logged on a number of these servers. Hadoop lets companies easily collate the logged
requests from all servers to reconstruct each users sessions. Such Sessionization
permits a company to see how its users actually move through its website and then
optimize that site.
In other sectors, we have observed different initial tasks. Banks have a lot of data about
their customers, bill payments, ATM transactions, deposits, etc. For example, banks
can combine analysis of this data to better estimate credit worthiness. Improving the
accuracy of this estimation directly increases a banks profitability.
Retailers have a lot of data about sales, inventory and shelf space, that when they
can analyze it over multiple years can help them optimize purchasing and pricing.
The use cases vary by industry. Once companies have a Hadoop installation they tend
to load data from more sources into it and find additional uses. The trends seem clear
though: businesses continue to generate more data and Hadoop can help to harness
it profitably.
What are the challenges enterprises are facing for the adoption of Big Data?
Theres a big learning curve. It requires a different way of thinking about data
processing than has been taught and practiced for the past few decades, so business
and technical employees need to re-learn whats possible.
IT organizations can also be reluctant to deploy these new technologies. Theyre
often comfortable with the way theyve been doing things and may resist requests
to support new, unfamiliar systems like Hadoop. Often the initial installation starts
25
as a proof of concept project - implemented by a business group, and only after its
utility to the company has been demonstrated is the IT organization brought in to help
support production deployment.
Another challenge is simply that the technology stack is young. Tools and Best
practices have not yet been developed for many industry-specific vertical applications.
The landscape is rapidly changing direction, but conventional enterprise technology
has a multi-decade head start, so well be catching up for a while yet. Fortunately,
there are lots of applications that dont require much specific business logic; many
companies find they can start using Hadoop today and expand it to more applications
as the technology continues to mature.
Is Hadoop the only credible technology solution for Big Data management?
Are there any alternates? And how does Hadoop fit into enterprise systems?
Hadoop is effectively the kernel of an operating system for Big Data. Nearly all the
popular Big Data tools build on Hadoop in one way or another. I dont yet see any
credible alternatives. The platform is architected so that if a strong alternative were to
appear it should be possible to replace Hadoop.
The stack is predominantly open source and there seems to be a strong preference
for this approach. I dont believe that a core component thats not open source would
gain much traction in this space, although I expect well start to see more proprietary
applications on top, especially in vertical areas.
Doug, one last question: Hadoop Creator, Chairman of The Apache Software
Foundation and Architect at Cloudera which role do you enjoy the most?
Hadoop is the product of a community. I contributed the name and parts of the
software and am proud of these contributions. The Apache Software Foundation has
been a wonderful home for my work over the past decade and I am pleased to be
able to help sustain it. I enjoy working with the capable teams at Cloudera, bringing
Hadoop to enterprises that would otherwise have taken much longer to adopt it.
In the end, I still get most of my personal satisfaction from writing code, collaborating
with developers from around the world to create useful software.
26
Protecting Privacy
Data mining techniques provides the backbone to harnessing information quickly and
efficiently on Big Data. However, this also means there is a potential for extracting
personal information by compromising on user privacy (see Sidebar -Privacy Violation
Scenarios). In this chapter, we initially describe principles that can be used to protect
the privacy of end users at various stages of data life cycle. Subsequently we explore
technical aspects of protecting privacy while processing Big Data.
27
28
Collection limitation is a policy decision on part of the data collector, usage limitation,
securing data, retention and destruction and transfer policy can be addressed by
technical means and the last one, accountability is addressed by having a legal team
sign a declaration.
Using the collected data for analysis and deriving insight into the data is an important
technical step, in the next section we describe some of the privacy preserving data
mining techniques.
Generalization: Rare attributes in data items are replaced with generic terms. For
example let us consider persons who have a Ph.D degree are very less in employee
database. A query which collects user qualifications and their age can be correlated
with the result of a query which lists salary and age of the person can reveal the
identity of person. This can be prevented by replacing qualification (Ph.D) with
another generic term called graduation which makes it difficult to correlate and infer.
29
Payment Card Industry Data Security Standard (PCI DSS): Defined to protect
financial transaction data from potential breaches
Personal Data Privacy and Security Act: Deals with prevention and mitigation
of identity thefts.
UK data protection act: Serves the same purpose US privacy law does.
30
The figure gives an idea of where the Big Data Solutions fit in the current enterprise
context. Big Data technologies are today being used primarily for storing and
performing analytics on large amounts of data. Solutions like Hadoop and its
associated frameworks like Pig, Hive etc. help distribute processing across a cluster
of commodity hardware to perform analytic functionalities on data. Hadoop based
data stores as well as NoSQL data stores provide a low cost and highly scalable
infrastructure for storing large amounts of data.
Informatica
Actuate
Business Objects
Information Builders
IBM Cognos
Flux
Microsoft
Kettle
Delivery - Visualization
SAP
Microstrategy
Scribe
Cloudera Flume
Tableau
IBM
Sqoop
Jasper Reports
Qlik Tech
Oracle
SAS
Teradata
Hadoop (Cloudera, MapR, HortonWorks)
Oracle Exadata
Karmasphere
Greenplum
Cassandra
Datameer
Aster Data
RHIPE MongoDB
Analytics
Storage
& Mining
R
Business Objects
Trillium Software
Vertica
Microsoft SSAS
Matlab
NetApp
Sybase
IBM Netezza
HDS
EMC
SAS DataFlux
IBM
Oracle
Initiate Systems
Siperian
31
management and addressing data quality issues. However there is limited integration
of these with the Big Data technologies. This is resulting in lot of custom solutions for
scenarios where big data technologies are used.
Richness of Analytics/Mining capabilities: Big Data solutions and frameworks
available today for analytics like Apache Mahout provide a limited number of
algorithm implementations and their usability is also limited compared to the kind of
features business analysts have been used to with commercial solutions.
Limited Data Visualization and Delivery capabilities: There is limited support for
visualization of analysis results in existing Big Data solutions. A major requirement
for business users is ability to view the analyzed data in a visually comprehensible
manner. The BI/DW reporting solutions allow users to generate these visual charts
and reports by connecting with traditional BI solutions easily. Support for Big data
solutions such as Hive, HBase, MongoDB, etc. in such popular reporting tools in
limited at this point of time.
Limited integration with Stream/Event Processing solutions: Several Big Data
frameworks like Hadoop provide good results for batch requirements but they are
not architected for real-time processing requirements. There are several solutions like
CEP which address real-time processing needs but their integration with Big Data
solutions is limited.
Limited integration with EDW/BI products: Traditional BI/EDW solutions provide
advanced features like OLAP enabling easy slicing & dicing of information and also
enabling users to define and analyze the data through a user friendly UI. This allows
business analysts with limited technical expertise use these solutions to address
business requirements. The user experience maturity aspects are still in very early
stages with the Big Data Solutions available currently. A lot of work has to go in
making them more user-friendly.
33
34
Processing and Analytics: The processing of data needs to happen in real time.
Processing and establishing patterns amongst messages involves complex computations
such as detecting and establish patterns among events (correlation), applying rules,
filter, union, join, and trigger actions based on or absence of events, etc.
Result Delivery: After processing, the information needs to be presented to the end
user in real time in the form of appealing Dashboards, KPIs, Charts, reports, email
and the intervention actions like sending alerts using user preferred channels such as
Smartphone, Tablets, or Desktops (Web, thick client).
Reliability, Scalability: Systems that process such information need to be highly
fault tolerant where loss of data due to missing out on one message may become
unaffordable at times. They also need to be scalable and elastic so that they can scale
easily to cater the increased demand on processing.
Due to all the above challenges, performing analytics and data mining on Big Data in
real- time differs significantly from traditional BI because in real-time it is not feasible
to process the messages and derive insights using the conventional architecture of
storing data and processing in batch mode.
35
OLTP
Realtime
Interventions
Streaming
Data Source
External Sources
Big Data
Analytics
Predictive
Analytics
Historical Data
Analytics
Forecasts
Streaming Data
Processing Sol
Realtime Data
Realtime
Processing
Insights
36
CEPEngine
Devices Sensors
Apply Rules
Union
Mine and
Design
Correlate f(x)
y)
(x,
h
High Speed Cache Memory
Filter
Service Bus
Output Adapters
Input Adapters
Web servers
TARGETS
Engine
Reference OLTP
Data base Data base
Trading Stations
37
massive amounts of data, businesses are now enabled to get answers to their ad-hoc
analysis questions faster and with more precision.
The following table outlines a few important types of analytics those are performed
on Big Data and in many cases, can effectively leverage Data Cloud and Cloud
Infrastructure / Utility Cloud:
Types of Analytics
Characteristics
Examples
Operational
Analytics
Real-time fraud
detection, Ad-serving,
High Frequency Trading
Deep Analytics
Typically multi-source,
Non- operational transactions data
Complex data mining and predictive
analytics,
Algorithmic Trading
Streams Analytics
Logs Analysis,
Web crawling,
Recommendation Engine
A large set of data now exists in the cloud (private, public and hybrid) space and many
organizations and applications are also making their way into cloud platforms every
day. So, some key questions that enterprise architects are facing is how to leverage the
cloud computing infrastructure for Big Data storage and processing. Is it possible to
use public cloud platforms for Big Data analytics? What are the benefits and what are
the challenges?
38
Private Cloud
Public Cloud
The table provides an overview of the different architectural options for Data Cloud,
the considerations and the pros & cons:
Options
Implications
Leverage Public
Data PaaS
Frameworks like
Amazon Elastic
MapReduce
Leverage
Public IaaS and
setup Big Data
Solutions on them
39
The key benefits from leveraging cloud infrastructure for Big Data needs are
Low costs: Using Infrastructure via utility cloud model there by reducing infrastructure
costs
Fast Turn- around Time for Infrastructure: On-demand provisioning of cloud
infrastructure reduces the lead times
High Elasticity and scale-out: Massive Parallel Computing using Data Cloud with
dynamically scalable infrastructure
The key challenges are:
Data Transfer Limitations: Transferring data is key challenge especially with public
clouds. Most enterprises still have majority of their system on-premise so the data
generated is mostly in those systems. Latencies in data transfer and costs will be key
limitations.
Security and Privacy Challenges: Moving data to a public cloud infrastructure will
result in associated security, privacy challenges as the data is moved out of enterprise
boundaries.
Performance Challenges: Virtualization overheads and limitations of deployment
options with cloud infrastructure can lead to performance degradation resulting in
need for more infrastructure footprint.
S. Gopalakrishnan (Kris)
Co-Chairman, Infosys Limited
Kris is on the Global Thinkers 50 and Co-Convener, ICT & Innovation, World Economic Forum
41
How do you see the connection between Big Data and Infosys strategic
theme of Building Tomorrows enterprise?
Big Data management is a great enabler for Building Tomorrows Enterprise.
Digital consumers now are active, informed and assertive. Big data analytics helps in
delivering greater personalized products and services to them.
Sustainable tomorrow needs efficient management of Power, higher yield in
Agriculture etc. These use cases require large and complex datasets processing.
An outcome of Pervasive Computing is the ability to gather large amounts of
information from sources like social networks, mobile location based services, online
shopping. This information with Big Data analytics helps enterprises to become more
intelligent.
In the health care, Drug development analysis, genome analysis, bio-informatics all
requires Big Data management.
Thus Big Data will become one of the key enabling technology for Building Tomorrows
Enterprise.
42
Accelerator - Solution
& Expertise
Services - Extreme
Data
Product - Voice of
Customer Analytics
Platform - Social Edge
for Big Data
Big Data Adoption Enablers
The real value from Big Data solutions for the enterprise is the actionable intelligence
gained by analyzing large volumes of data within and outside the enterprise. The
objective of this is to alleviate the concerns around data ingestion and data consumption
allowing the customer to focus on the value added activity of data processing.
Accelerators Solution
Infosys Big Data Solution Accelerators : A Big Data Solution Accelerator Framework
that captures and provides reusable artifacts like process artifacts, design artifacts,
code components and horizontal and vertical solution frameworks for the various
43
Big Data solution development life cycle stages shall help reduce the risk of failure
for Big Data project execution. The vision for the Infosys Extreme Data Accelerator
Framework (InDAeX) being developed at Infosys Labs is described below:
Big Data Migration methodology with data security and privacy, reliability of
data ingestion and access in the face of unreliable networks to the data sources.
EDW
Recommendation
Engine
Social Media
Monitoring
Horizontal Framework
Data Capture
Social
Media
RBDMS
Adaptor
Data Consumption
Mobile, Browser, Tablet Widgets
Sensors
Logs
Reporting / Visualization
Data Storage
Aggregations
Query
Interface
Data Processing
Graph DB
Doc Store
In-Mem DB
Cloumn DB
KV Store
Distributed
File System
Archival
ETL
Analytics
Data Semantics
Data Mining
Orchestration
Compression
Serialization
44
Exception
Handling
Data Security
Horizontal frameworks that address data capture, processing, storage and data
delivery concerns
Accelerators Expertise
At Infosys, Certification is the professional goal and one of the ways to test ones
knowledge and measure oneself w.r.t global benchmark. Infosys Big Data Certification
Program (IBCP) was launched to validate the technical, application and solution skills
of the learnings that were derived from the project implementations done in the area
of Big Data.
IBCP has three levels namely- Big Data Beginner Certification, Big Data Technology
Specialist and Big Data Architect
Course Modules
Certifications
Solutions
Big Data
Certified Architect
Database
Big Data
Technology Specialist
Analytics
Big Data
Beginner
Machine Learning
Introduction to
Big Data
Big Data
Models
Big Data
Eco System
Big Data
for Domain
Big Data
Enablers
Big Data
Social Implications
Extracting
Transformational Value
Build
Validate
45
Improve
Benefits:
Validate your knowledge on Big Data to know where you stand in technology.
Implementing the Big Data raises three fundamental questions around dealing with
data at such a large scale:
1. How do you process this data? Given the volumes traditional relational solutions
fail and how do you leverage the newer solutions like Hadoop to provide
solutions that fit in with the data strategy of the enterprise?
2. How do you manage this data? Data be it big or small is a corporate asset and
how do you manage this data especially when the scale is large? What is the
lifecycle of big data?
3.
d. Metadata Management
e.
f.
j.
47
d. Performance Engineering
e.
Infrastructure Sizing
f.
g. Program Management
5. Extreme Data Migration Services (Services geared towards helping client in
moving from current solution to a desired one)
a. Defining Migration Path from current to future state
b. Assessment of current solution w.r.t Extreme Data need
c.
Program management
48
Infosys defined and implemented a Big Data solution combining traditional data
warehousing components i.e. Teradata with Hadoop to create a solution that is cost
effective and leverages the existing investment in the BI solutions.
Solution:
Informatica extracts the data from source systems and loads into HDFS
Results of the processing is sent to Teradata for reporting and creation of cubes
for analysis
Reporting needs (BI tools like Business Object, Corda) supported by Teradata
Benefits
This implementation is the template for an efficient and cost effective Data Warehouse
solution using Big Data solutions. This also indicates that the future of Information
Management is going to have Big Data as one of its pillars and this new world will be
built using MAD architectures.
Case Study 2: The client is one of the largest retailers in the US. For their business unit
they needed to build the ability to crawl competitor websites and adjust their product
pricing based on the crawl information.
Infosys implemented a solution using Hadoop and the grid computing solution
Condor. The combined solution was able to do the same crawls about 60 times faster.
The business benefit from this was the ability for the company to crawl additional
retailers sites thus helping to converge on the most competitive price for their product.
Solution:
Condor (Grid computing framework) to manage Crawl farm (over 200 machines)
49
Parse and extract useful data from crawl data using Hadoop MapReduce solution
using Pig
60 TB Hadoop Cluster.
Average size of crawl between 50MB to 1.5TB based on the type of crawl.
Benefits:
Case Study 3: Very large scale (over a billion transactions) reconciliation engine for a
large US Manufacturer using Mongo Db
Solution:
Mongo Db used to match millions of entities in less than an hour using commodity
hardware based horizontally scalable architecture
Benefits:
Running Matching programs in parallel for Millions of records to finish the runs
in stipulated time
Lower TCO
The knowledge gained through these projects along with the broader patterns of Big
Data processing developed at Infosys is leveraged in delivering Extreme Data services.
50
51
Data Loader
Flume/Scribe
Social Data
Aggregator
Hadoop/HDFS
Social +
Enterprise
Data
HBase/MongoDB/
Cassandra/Hive
HIMI-VOC
Analytic Engine
Reporting
Server
HIMI-VOC Workbench
Key Features
Strong text processing capability
Infosys HIMI Voice of Customer Analytics solution employs multiple text processing
techniques like classification, clustering, key phrase extraction, rule based extraction
etc. Techniques for learning from sample data, improving accuracy of classification by
incorporating domain specific information are also employed.
Multiple input format support
Input documents in various formats including, but not limited to word, PDF, excel
and HTML are supported. Data extraction from online sources is also supported.
Proprietary content rating algorithm
A proprietary content rating algorithm is used to rate the feedback. This algorithm
takes into consideration the content, its length, credibility of the websites/ forums
and people.
Powerful modeling workbench
Solutions of this nature greatly benefit from domain specific knowledge. The modeling
workbench of Infosys HIMI Voice of Customer Analytics solution facilitates easy
modeling of domain knowledge. It allows us to capture the domain rules, define
and describe the entities and concepts in the domain in a simple manner using an
interface.
Out of the box dashboards
Dashboards ensure effective visualization, enable drilling down of data and provide
multiple views of data. Graphs, pie charts, dial meter indicators and heat maps are all
part of the dashboards.
52
Business Benefits
Understanding Customer Pain Points: Providing insight into customer sentiments
can go a long way into understanding customer concerns. This is instrumental in
bringing a customer centric focus to business.
Signify Early Warnings: Customer reviews on product launch and later identifies
customer likes and dislikes which influence the product features and play a great role
in the success of a product.
Brand / Vendor Comparison: Customer review analysis can be very effective for
making a comparison across products and brands. This can give much insight into to
competitor businesses and strategies.
Improve Operational Efficiency: It is possible to detect weak links in the services
and the operations by correlating insights from customer feedback to transactional,
operational and customer demographic data.
More than 250 million tweets are generated in a day and it is increasing at a
tremendous speed.
Data will grow over 800% in the next 5 years and 80% of these data will be
unstructured.
While handling such huge volumes of data poses a significant challenge, at the same
time it provides huge opportunities and competitive edge for the enterprises that
manages it. The need for a robust platform arises due to the below challenges faced
by the enterprises in handling Big Data.
Getting in not so hygienic external data into the organizations premises, exposing
security risks
Analyzing and making sense of this massive data on an ongoing basis to derive
actionable insights.
53
Data collection and Aggregation: The platform supports data collection and
aggregation over diversified data sources across the internet.
Over 250 million blogs, 5 million Forums, message boards and usenet groups
Facebook
Twitter
Word Press
You Tube
Flickr
Presentation Layer
Help Screens
Configuration
Screens
Reports
Admin Screens
Management Screens
Services (SOAP/REST/JSON)
Business Layer
Notification
Collaboration
Engine
Workflow
Rules
Search
Engine
Feeds
Big Data
Database
Analytic
Engine
Enterprise Applications
CRM System
Campaign
Management
Ticket
Management
Logs/
ClickStream
54
Key Insights
Consumer Sentiments
Product Feedbacks.
Influencer Analysis
Market Intelligence
Competitive
benchmarking
Campaign Analysis
Data Visualization and Reporting: Reporting capability with rich UI interface and
ability to create customizable widgets, dashboards and drill down capabilities.
It also provides built in engagement capabilities where response for social
conversation can be managed and tracked.
Advanced Search: Fast-loading search engine with full Boolean capability and
search filters.
Dive into the details: Drill down into data points on a chart to understand what
drives changes in volume or sentiment at the post level.
End to End Services: Platform delivered in enterprise SAAS model with single
point accountability, application ownership, robust infrastructure, BPO and
consulting services.
Having a scalable Big Data processing platform to analyze social data will help
enterprises in,
Identifying and engaging key influencers who are impacting the increase or
decrease of sales of a particular product.
Solution Benefits:
Derive actionable insights for marketing and business managers from a large
volume of unstructured social data.
Monitor customer sentiment and perception by keeping real time tab on the pulse
of customer.
Anticipate and gauge the product Adoption lifecycle by interpreting the trends
in social media.
References
1. Big data: The next frontier for innovation, competition, and productivity,
McKinsey Global Institute, May 2011
2. Big Data Is Only the Beginning of Extreme Information Management by Gartner
(Mark A. Beyer, Anne Lapkin, Nicholas Gall, Donald Feinberg, Valentin T.
Sribar)
3. Privacy-Preserving Data Mining: Models and Algorithms. Edited by Charu C.
Agarwal and Phillip S .U, Kluwer Academic Publishers, 2007.
4. Privacy Preserving Data Mining, Rakesh Agarwal and Ramakrishan Srikanth,
IBM Almaden Research Center, 2000.
5. SQL Server 2008 R2 Glossary StreamInsight http://msdn.microsoft.com/en-us/
library/ee378962.aspx
6. Demand Response Measurement & Verification - http://www.smartgrid.gov/
sites/default/files/pdfs/demand_response.pdf
7. T. Hey, S. Tansley, and K. Tolle, Eds., The Fourth Paradigm: Data-Intensive
Scientific Discovery.,2010.
8. A. Wagner, S. Speiser, and A. Harth, Semantic web technologies for a smart
energy grid: Requirements and challenges, in ICWS, 2010. http://iswc2010.
semanticweb.org/pdf/506.pdf
9. Hadoop - The definitive guide by Tom White
10. Massive Data Analytics and the Cloud A Revolution in Intelligence Analysis by
Booz Allen Hamilton (Michael Farber, Mike Cameron, Christopher Ellis, Josh
Sullivan, Ph.D.)
56
Contributing Authors
Making it Real Industry Use Cases
Arun Thomas George, PE; Bhushan Dattatraya Masne, Infosys Labs; Brahma Acharya,
Cloud; Girish Viswanathan, RCL; Jagdish Bhandarkar, E&R; Jayanti Vemulapati, PE;
Joseph Alex, PE; Prem Kumar Karunakaran, PE; Sajith Abdul Salim, PE; Sangeetha
S, E&R; Sourav Mazumder, Cloud; Subramanian Radhakrishnan, RCL; Sumit
Sahota, Infosys Labs; Venkataramani M, Cloud; Vinay Prasad, FSI; Yogesh Bhatt,
MFG; Yuvarani Meiyappan, E&R
Reviewers
Lakshmanan G, E&R; Siva Vaidyanatha, RCL; Subrahmanya S.V, E&R
Designers
Chandrashekhar Hegde, CDG; Srinivasan Gopalakrishnan, CDG
Sponsors
Satyendra Kumar, Senior VP, Head Quality, Tools and Software Reuse
Srikantan Moorthy, Senior VP, Head Education and Research
Subrahmanyam Goparaju, Senior VP, Head Infosys Labs
Acknowledgement
Big Data Spectrum themed on Making it Real, gives the insights into applying Big
Data into real world and discussing the overall way in which any enterprise can
embrace Big Data and its related technologies.
This project would not have been possible without the immense support of Kris
Gopalakrishnan, Co-Chairman, Infosys Ltd. We are highly grateful to Shibulal S.D,
CEO, Infosys Ltd for his encouragement.
We also would like to thank Srikantan Moorthy, Senior VP and Head - Education &
Research, Subrahmanyam Goparaju, Senior VP and Head Infosys Labs, Satyendra
Kumar, Senior VP and Head - Quality for their constant support throughout this
project.
We would like to thank Dr. Phil Shelley, CTO, Sears Holdings and Doug Cutting, Cofounder, Apache Hadoop for providing their insights in the Q&A section.
We would like to thank Ms. Melissa Hick, Bhava Communications, Rob Lancaster,
Cloudera and A. Sriram, Cloud Practice, Infosys Ltd, for their help on Q&A section
with Doug Cutting. We are grateful to Mayank Ranjan, RCL, Infosys Ltd, for his help
on Q&A section with Phil Shelly.
Rajasimha S, CDG, has contributed to this project. His timely and valuable support in
experience design has helped us tremendously. Our sincere thanks to Sarma KVRS,
E&R for his help and support on Intellectual Property related details.
Thanks are due to Sanjita Bohidar, Amit Shukla, Satish Kumar Kancherla, Sujith Penta
who helped in proofreading.
About Infosys
2012 Infosys Limited, Bangalore, India. Infosys believes the information in this publication is accurate as of its publication date; such information is subject to change
without notice. Infosys acknowledges the proprietary rights of the trademarks and product names of other companies mentioned in this document.