0% found this document useful (0 votes)

7 views107 pages

Unit 1

Big Data

Uploaded by

vsatyana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views107 pages

Unit 1

Big Data

Uploaded by

vsatyana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 107

UNIT I

UNDERSTANDING
BIG DATA

What is big data – why big data – convergence of key

trends – unstructured data – industry examples of big
data – web analytics – big data and marketing – fraud and
big data – risk and big data – credit risk management –
big data and algorithmic trading – big data and
healthcare – big data in medicine – advertising and big
data – big data technologies – introduction to Hadoop –
open source technologies – cloud and big data – mobile
business intelligence – Crowd sourcing analytics – inter
and trans firewall analytics
What is Big Data?
Think of the following:
• Every second, there are around 8,22 tweets on Twitter.
• Every minute, nearly 510 comments are posted, 293,000 status are
updated and 136,000 photos are uploaded on Facebook.
• Every hour, Walmart handles more than 1 million customer
transactions.
• Everyday, Customers make around 11.5 million payments by using
PayPal.
- Digital world -> increase in data rapidly ->increase in the use of
internet, sensors at a very high rate.
- The sheer volume, variety, velocity and veracity of such data is
signified by the term ‘Big Data
What is Big Data?
• Big data is structured, unstructured and semi-
structured in nature.
• Difficult for computing systems due to high speed
and volume.
• Traditional data management, warehousing and
analysis fizzle to analyze the high speed of data.
• Hadoop by Apache is widely used for storing an
managing Big data.
• According to IBM, everyday we create 2.5 quintillion
bytes (1 billion GB) of data – so much that 90% of
the world today has been created in the last two
years alone.
• Data – sensor data, climate data, GPS data, bank
Definition

• Big data usually includes data sets with sizes beyond the
ability of commonly used software tools to capture, create,
manage, and process the data within a tolerable elapsed
time

• Big data is high-volume, high-velocity and high-variety

information assets that demand cost-effective, innovative
forms of information processing for enhanced insight and
decision-making.
• One of the data production source - Smart
Data electronic devices

Expansion – • Amount of data – 175 ZettaBytes by 2025.

• Total volume of data – double every two years.
Day by Day
Sources of
Big Data
• Social media
• Sensor placed in various
cities
• Customer satisfaction
feedback
• IoT Appliance
• E-Commerce
• Global Positioning
System(GPS)
Sources of Big
Data
Social Media
• Whatsapp, Facebook, Instagram, Twitter, YouTube
etc
• Each activity – upload photo/video, making
comment, sending a message, like etc create data.
Sensors
• Sensors in city – gather temperature, humidity etc
• Camera beside roads gather information
• Security cameras in airports/banks – create a lot of
data
Customer Satisfaction feedback
• Amazon, flipkart, firstcry, licious, swiggy, blinkit,
zepto etc – gather customer feedback – quality of
product/deliver time. It creates a lot of data.
Sources of Big Data

IoT Appliance
• Electronic devices connected to the internet create data for their smart
functionality. Example : Samsung smartthings.
E-Commerce
• Payments through Credit card, Debit card, pay later, or all electronic ways are
recorded as data.
Global Positioning System(GPS)
• Vehicle movement – directions/ traffic congestion. Creates a lot of data on
vehicle position and movement.
Real world examples – Big data
• Social media analytics – Consumer product
companies and retail organizations are observing data
on social media websites to analyze customer
behaviour, preferences etc
• Insurance companies use BDA to see which home
insurance applications can be immediately processed
and which ones need a validating in person visit from an
agent.
• Hospitals are analysing medical data and patient
records to predict those patients that are likely for
readmission within few months of discharge.
• Relying on Social networks and analytics, Companies
are gathering volumes of data from the web to help
musicians and music companies better understand their
Big Data Analytics

Big (and small) Data analytics is the process of examining

data—typically of a variety of sources, types, volumes and /
or complexities—to uncover hidden patterns, unknown
correlations, and other useful information.

The intent is to find business insights that were not previously

possible or were missed, so that better decisions can be
made.
Big Data analytics uses a wide variety of advanced analytics to provide

• Deeper insights. It’ll have insights into all the individuals, all the
products, all the parts, all the events, all the transactions, etc.

• Broader insights. It takes into account of all the data, including

new data sources, to understand the complex, evolving, and
interrelated conditions to produce more accurate insights.

• Frictionless actions. Increased reliability and accuracy that will

allow the deeper and broader insights to be automated into
systematic actions.
Advanced Big data analytics
Big data
analytic
applications
3 dimensions
/
characteristic
s of Big data
Volume

Volume is the amount of data generated by

organizations or individuals.
At present, Volume of data – exabytes
In coming years, Volume of data – zettabytes
Organizations are doing their best to handle this
ever-increasing volume of data.
Example :
-Every minute, over 571+ new websites are being
created.
-Boeing 737 will generate 240 terabytes of flight
data during a single flight across US.
Velocity
Velocity describes the rate at which data is generated, captured
and shared.

Information processing systems face problem with the data, as the

data which keeps adding up but cannot be processed quickly.

Example : eBay analyses around 5 million transactions per day in

real time to detect and prevent frauds arising from the use of
PayPal.
Sources of high velocity data:

IT devices, including routers,firewalls, switches etc generate

valuable data

Social media, including Facebook posts, tweets create huge

amount of data, to be analyzed at fast speed as the value
degrades quickly with the time.
Variety
• refers to structured, unstructured,
and semi structured data that is
gathered from multiple sources
and comes in different formats, such
as images, text, videos etc.
• While in the past, data could only be
collected from spreadsheets and
databases, today data comes in an
array of forms such as emails, PDFs,
photos, videos, audios, SM posts, and
so much more.
In short, Simple 4V’s
Why Big data?

• 1. Understanding and Targeting Customers

• 2. Understanding and Optimizing Business
Processes
• 3. Personal Quantification and Performance
Optimization
• 4. Improving Healthcare and Public Health
• 5. Improving Sports Performance
• 6. Improving Science and Research
• 7. Optimizing Machine and Device Performance
• 8. Improving Security and Law Enforcement.
• 9. Improving and Optimizing Cities and Countries
• 10. Financial Trading
Types of
Data
• Big data comprises
- Structured data
- Unstructured data
- Semi-structured data
Structured data
• Is organized data in a predefined format
• Is stored in tabular form
• Is the data that resides in fixed fields within a record
or file
• Is formatted data that has entities and their attributes
mapped
• Is used to query and report against predetermined
datatypes
• SQL is used for managing and querying data -
represent only 5 to 10% of all the data
• When data grows beyond the size of RDBMS, it Can be
stored & analyzed in data warehouses but only up to
certain limit
Example –Sample of Structured
data
Customer Name Product ID City State
ID
123 Jack 4689 Graz Styria
321 Sandy 5688 Wolfsberg Carinthia
459 Robert 459 Enns Upper
Austria
Unstructured Data

• lack of structure
• About 85% of total data is un-
structured.
Ex:
• e-mail messages,
• word processing documents,
• videos, photos, audio files,
presentations,
• web pages
• other kinds of business documents.
Some sources of unstructured
data include
• Text both internal and external to an organization—Documents, logs, survey
results, feedbacks, and e-mails from both within and across the organization

• Social media—Data obtained from social networking platforms, including

YouTube, Facebook, Twitter, LinkedIn, and Flickr

• Mobile data—Data such as text messages and location information About 80

percent of enterprise data consists of unstructured content.
Semi
Structured
Data Sl Name E-Mail
No
Also known as having
a schema-less or self- 1 Sam [email protected]
describing structure
refers to a form of 2 First Name : [email protected]
structured data that David m
contains tags in order Second Name :
to separate elements
and generate
Brown
hierarchies of records
and fields in the given
table.
Unstructured Data and Big Data
• It is estimated that 90 percent of big data is unstructured data. Many of the tools
designed for it.
• Unstructured data is the opposite of structured data.
• Structured data generally resides in a relational database, and, it is called
"relational data." This type of data easily mapped into pre-designed fields.
• By contrast, unstructured data is not relational and doesn't fit into these sorts of
pre-defined data models.
• Semi-structured data is information that doesn't reside in a relational database
but that does have some organizational properties that make it easier to analyze.
• Examples of semi-structured data might include XML documents and NoSQL
databases.
• The term "big data" is closely associated with unstructured data.
• Big data refers to extremely large datasets that are difficult to analyze with
traditional tools.
Implementing Unstructured Data Management:
Software tools to help them organize and manage

• Big data tools: Software like Hadoop can process stores of both
unstructured and structured data that are extremely large, very complex
and changing rapidly.

• Business intelligence (BI) software: This is a broad category of

analytics, data mining, dashboards and reporting tools that help
companies make sense of their structured and unstructured data for the
purpose of making better business decisions.

• Data integration tools: These tools combine data from disparate sources
so that they can be viewed or analyzed from a single application. They
sometimes include the capability to unify structured and unstructured
data.
Implementing Unstructured Data Management:
Software tools to help them organize and manage

• Document management systems: DMS can track, store and share

unstructured data that is saved in the form of document files.

• Information management solutions: This type of software tracks

structured and unstructured enterprise data throughout its lifecycle.

• Search and indexing tools: These tools retrieve information from

unstructured data files such as documents, Web pages and photos.
Industry examples of big data
Industry examples of big data
• 1.) Retail Good customer service and building customer relationships is vital in the retail industry. The best ways to
build and maintain this service and relationship is through big data analysis. Retail companies need to understand
the best techniques to market their products to their customers, the best process to manage transactions and the
most efficient and strategic way to bring back lapsed customers in such a competitive industry.
• 2.) Banking Due to the amount of data streaming into banks from a wide variety of channels, the banking sector
needs new means to manage big data. Of course like the retail industry and all others it is important to build
relationships, but banks must also minimise fraud and risk whilst at the same time maintaining compliance.
• 3.) Manufacturing Manufacturers can use big data to boost their productivity whilst also minimising wastage and
costs - processes which are welcomed in all sectors but vital within manufacturing. There has been a large cultural
shift by many manufacturers to embrace analytics in order to make more speedy and agile business decisions.
• 4.) Education Schools and colleges which use big data analysis can make large positive differences to the
education system, its employees and students. By analysing big data, schools are supplied with the intel needed to
implement a better system for evaluation and support of teachers, to make sure students are progressing and
identifying at risk pupils.
• 5.) Government The Government has a large scope to make change to the community we live in as a whole when
utilising big data, such as dealing with traffic congestion, preventing crime, running agencies and managing
utilities. Governments however need to address the issues of privacy and transparency.
• 6.) Health Care Health Care is one industry where lives could be at stake if information isn’t quick, accurate and in
some cases, transparent enough to satisfy strict industry regulations. When big data is analysed effectively, health
care providers can uncover insights that can find new cures and improve the lives of everyone.
Web Analytics

• Web analytics is the measurement, collection, analysis and

reporting of web data for purposes of understanding and
optimizing web usage.
• Web analytics is not just a tool for measuring web traffic but can be
used as a tool for business and market research, and to assess and
improve the effectiveness of a web site.
• The following are the some of the web analytic metrics:
• Hit, Page view, Visit / Session, First Visit / First Session, Repeat
Visitor, New Visitor, Bounce Rate, Exit Rate, Page Time Viewed /
Page Visibility Time / Page View Duration, Session Duration / Visit
Duration. Average Page View Duration, and Click path etc.
Why use big data tools to analyse web
analytics data?
• Web event data is incredibly valuable

• It tells you how your customers actually behave and how that varies

• It tells you how customers engage with you via your website / webapp

• It tells you how customers engage with your different marketing campaigns and
how that drives subsequent behaviour

• Deriving value from web analytics data often involves very custom-
made analytics

• The web is a rich and varied space! E.g.

• Bank, Newspaper, Social network, Analytics application, Government
organisation (e.g. tax office), Retailer, Marketplace
Why use big data tools to analyse web
analytics data?
• Web analytics tools are good at delivering the standard reports that are
common across different business types…

• Where does your traffic come from e.g.

• Sessions by marketing campaign / referrer, Sessions by landing page

• Understanding events common across business types (page views,

transactions, ‘goals

• Capturing contextual data common people browsing the web

• Timestamps, Referer data, Web page data (e.g. page title, URL), Browser
data (e.g. type, plugins, language)
• Operating system (e.g. type, timezone), Hardware (e.g. mobile / tablet /
desktop, screen resolution, colour depth)
Why use big data tools to analyse web analytics data?
Big Data and Marketing

• Big Data and the New School of Marketing

• Consumers Have Changed. So Must Marketers.

• The Right Approach: Cross-Channel Lifecycle

Marketing

• Social and Affiliate Marketing

• Empowering Marketing with Social Intelligence

Big Data and Marketing
Fraud and Big Data
Use of Big Data in Fraudulent Activities

• Most common types of Financial frauds:

(a) Credit card fraud
(b) Exchange or return policy fraud – Amazon/Flipkart
(c) Personal information fraud – Stealing login information and
orders products online. The actual customer keeps calling to
retailer to refund the amount as he has not made the
transaction.
• Traditional measures are difficult to implement for above frauds -
if the data is huge.
• Using Big Data techniques, an organization can manage and
prevent fraud
Data • Database of all existing frauds
sources • Recognized potential fraud patterns
for • Modeled behaviors of hackers
• System and networks generated
controllin data which can highlight irregularity
g frauds patterns(speed at which password
is typed).
Preventing Fraud using Big Data
Analytics
Analyzing Big Data allows organizations to:
• Keep track of and process huge volumes of data.
• Differentiate between real and fraudulent entries.
• Identify new methods of fraud and add them to the list
of fraud-prevention checks.
• Verify whether a product has actually been delivered to
valid recipient
• Determine the location of the customer and the time
when the product was actually delivered.
Use of Big Data in Detecting
Fraudulent Activities in Insurance
Sector
• Insurance company wants to improve the ability to take decisions
while processing claims.
• Decides to implement a Big Data analytical platform, which will use
the data from social media to provide the real-time view of the case
in hand.
• The information obtained will enable the insurance agent to
diagnose the patterns of customer’s claim, behavior and other
issues.
Example: In some cases, social media could also provide great
triggers to identify fraud – A customer might indicate that his car was
destroyed in a flood, but the documentation from the social media
feed any show that the car was actually in another city on the day
flood occurred.
Fraud Detection
• Fraudulent claims were identified by insurance
companies by using statistical models – can prevent
fraud to only limited extent.
Bank statements
Sources Legal judgements
of data –
To Criminal records
regulate
frauds Medical bills
Organizational data
Other Big
Data
• Social Network Analysis(SNA)
approach
• Predictive Analysis
es to • Social Customer Relationship
detect Management(CRM)
fraud
• Social Networking Analysis(SNA) is
an innovative way to identify and
detect frauds.
• SNA tool uses a mix of analytical
methods which includes
statistical methods, pattern
Social analysis and link analysis to
Network identify any kinds of relationships or
patterns within large amounts of
Analysis(S data collected from different
NA) sources.
• When link analysis is used in fraud
detection, one looks for clusters of
data and how these clusters are
linked to other data clusters.
Fraud detection using SNA method
Risk and big data
• Many of the world’s top analytics professionals work in risk management.

• Risk management is data-driven—without advanced data analytics, modern risk management

would simply not exist.

• The two most common types of risk management are credit risk management and market risk
management. A third type of risk, operational risk management, isn’t as common as credit and
market risk.

• The tactics for risk professionals typically include avoiding risk, reducing the negative effect or
probability of risk, or accepting some or all of the potential consequences in exchange for a
potential upside gain.

• Credit risk analytics focus on past credit behaviors to predict the likelihood that a borrower will
default on any type of debt by failing to make payments which they obligated to do. For example,
“Is this person likely to default on their $300,000 mortgage?”

• Market risk analytics focus on understanding the likelihood that the value of a portfolio will decrease
due to the change in stock prices, interest rates, foreign exchange rates, and commodity prices.
For example, “Should we sell this holding if the price drops another 10 percent?”
Credit risk management
Use of Big Data in Detecting Fraudulent Activities
in Retail Sector
Retail fraud:
It is an illegal transaction that a fraudster performs using
stolen credit card details or loopholes in the order
placement and payment systems and company policies. As
technology grew, so did the fraudsters' sophistication of
executing frauds online.
Types of Retail fraud:
(a) Transaction fraud
(b) Return fraud
(c) Chargeback guarantee fraud
Types of Retail fraud
• Transaction fraud
It is also called card-not-present (CNP) fraud where the fraudster uses a
stolen credit card for online purchases. The company loses money when
the original owner of the card demands a chargeback.
• Return fraud
Example - e-commerce industry
• Chargeback guarantee fraud
Many online retail fraud prevention solutions guarantee that they will
block all transactions and friendly frauds and even pay the admin fee out
of their pocket. The problem arises when the company blocks even
legitimate customers. This is called a false positive that not only
damages your reputation but also results in loss of revenue.
Use of Big Data in Detecting Fraudulent
Activities in Retail Sector -Fraud Detection in
Real time
• Big Data helps to detect frauds in real time.
Example :
(a) In an online transaction, BigData would compare the incoming IP
address with the geotag received from customer’s smartphone
apps. A valid match between the two confirms the authenticity of
transaction.
(b) Also, examines the entire historical data to track suspicious
patterns of the customer order –
Big Data analysis is performed in real time by retailers to know the
actual time of the product delivered.
Costly products of have sensors attached to transmit their location
information,thereby, preventing frauds.
Credit Risk
Framework
Big Data and Algorithmic Trading
Algorithm trading is the use of computer programs for entering
trading orders, in which computer programs decide on almost
every aspect of the order, including the timing, price, and
quantity of the order etc.

Many investment banks use algorithmic trading, a highly

sophisticated set of processes in which “insights” are made
“actionable” via automated “decisions.”

Algorithmic trading relies on sophisticated mathematics to

determine buy and sell orders for equities, commodities,
interest rate and foreign exchange rates, derivatives, and fixed
income instruments at blinding speed.
Big Data and
Algorithmic Trading

• A key component of algorithmic trading

is determining return and the risk of
each potential trade, and then making a
decision to buy or sell.
• Quantitative risk analysts help banks
develop trading rules and implement
these rules using modern technology.

• Algorithmic trading involves a huge

number of transactions with complex
interdependent data, and every
millisecond matters.
Big Data and
Algorithmic Trading
• Crunching Through Complex Interrelated Data
• Banks: daily evaluation of risk to intra-day risk
evaluations.
• Intraday Risk Analytics, a Constant Flow of Big Data
• To maintain competitive advantage, banks need
to continuously evaluate their models
• Calculating Risk in Marketing
• Banks are using risk predictive analytics for
marketing
• Other Industries Benefit from Financial Services’ Risk
Experience
• Outside of financial services there are other
industries that can benefit from this work, such
as retail, media, and telecommunications.
Big Data and Advances
in Health Care
• Big Data promises an enormous revolution in
health care, with important advancements in
everything from the management of chronic
disease to the delivery of personalized
medicine.

• In addition to saving and improving lives, Big

Data has the potential to transform the entire
health care system by replacing guesswork
and intuition with objective, data-driven
science.
Big Data and
Advances in
Health Care
This exponential growth in data is further
fueled by the digitization of patient-level
data:

Stored in Electronic Health Records

(EHRs) and Health Information
Data in Exchanges (HIEs),

the World Enhanced with data from imaging and

of Health test results,

Care Medical and prescription claims, and

Personal health devices.

Disruptive Analytics-Health
Care
• Data science and disruptive analytics can have an immediate
beneficial impact.

• Transformation of the health care system will come through Big Data-
driven decisions and improved insights.

• Patient outcomes tied to costs derived from multiple health care

Big Data assets will become the common currency across all health
care sectors.
Disruptive Analytics-Health Care

• Big Data analytics makes it possible to transform the vision into

reality, creating a transparent approach to pharmaceutical
decision making based on the aggregation and analysis of health
care data such as electronic medical records and insurance claims
data.
• Health care Big Data analytics presents an opportunity to unify the
health care value chain in a way not achieved to date, a virtual
form of unification with significant benefits for all stakeholders.
• Creating a health care analytics framework has significant value for
individual stakeholders
Big Data and Advances in Health
Care
 For providers (physicians), there is an opportunity to build analytics
systems for EBM—sifting through clinical and health outcomes data to
determine the best clinical protocols that provide the best health outcome
for patients and create defined standards of care.

 For producers (pharmaceutical and medical device companies), there is an

opportunity to build analytics systems to enable translational
medicine (TM)—
 integrating externally generated postmarketing safety, epidemiology, and
health outcomes data with internally generated clinical and discovery data
(sequencing, expression, biomarkers) to enable improved strategic R&D
decision making across the pharmaceutical value chain.
Big Data and Advances in Health
Care
• For payers (i.e., insurance companies), there is an opportunity
to create analytics systems to enable comparative
effectiveness research (CER) that will be used to drive
reimbursement—

• Mining large collections of claims, health care records

(EMR/EHR), economic and geographic, demographic data sets
to determine what treatments and therapies work best
for which patients in which context and with what overall
economic and outcomes benefits.
A Holistic Value Proposition
• It is the ability to collect, integrate, analyze and manage this data that make
health care data such as EHR/EMRs valuable.
• Longitudinal patient data is one source of the raw material on which evidence
based insight approaches can be built to enable health care reform.
• To date, there has been little attempt to “see where the data takes us” and
create a holistic health care value proposition built on quantifiable
evidence that clarifies business value for all stakeholders.
• Client relationships across the health care ecosystem, we are facilitating
unique partnerships across payer, provider, pharma, and federal
agencies to work on problems of health care data analytics together and
create value for all health care stakeholders.
• A “big data” approach to the analysis of health care data—creating
methods and platforms for the analysis of large volumes of disparate kinds of
data—clinical, EMR, claims, labs, etc.—to better answer questions of outcomes,
epidemiology, safety, effectiveness, and pharmacoeconomic benefit.
A Holistic Value
Proposition
• By “big data,” we also mean that health care
data sets are big enough to obscure underlying
meaning;

• Traditional methods of storing, accessing, and

analyzing those data are breaking down;

• Large-scale analytics are needed for critical

decision making, specifically in the face of cost
containment, regulatory burden, and
requirements of health care reform.
Pioneering New
Frontiers in Medicine
• The new era of data-driven health care is already
producing new kinds of applications.

• Uses Big Data analytics to identify the genetic

variations that predispose individuals to multiple
sclerosis (MS).

• Our algorithms identify the interactions (between

environmental factors and diseases) and they also
have rapid search techniques built into them.

• Also able to do statistical analysis.

Big data
technologies
Introduction to Hadoop

• Hadoop is a distributed system like distributed

database.
• Hadoop is a ‘software library’ that allows its
users to process large datasets across
distributed clusters of computers, thereby
enabling them to gather, store and analyze huge
sets of data.
• It provides various tools and technologies,
collectively termed as the Hadoop Ecosystem.
Hadoop Multi node Cluster Architecture

Hadoop cluster consist of single Master Node and Multiple Worker Nodes.
• Master Node
- NameNode
- JobTracker
• Worker Node
- DataNode
- TaskTracker
In a larger cluster, HDFS is managed through a NameNode server to host the file
system index.

Secondary NameNode that keeps snapshots of NameNodes

At the time of failure of NameNode, the secondary NameNode replaces the
primary NameNode, this preventing file system from getting corrupt and
reducing data loss.
Hadoop Multi node Cluster
Architecture
• If a data node cluster goes down while processing is going on,
then the NameNode should know that some data node is down
in the cluster, otherwise it can’t continue processing.

• Each Data Node sends a “Heartbeat Signal” to NameNode after

every few minutes to make NameNode aware of
active/inactive status of DataNodes. –Heartbeat Mechanism.
Ref :
https://www.geeksforgeeks.org/how-does-namenode-handles-
datanode-failure-in-hadoop-distributed-file-system/?ref=lbp
HDFS and MapReduce

Two main components

- Apache Hadoop – the Hadoop Distributed File
System(HDFS) –used for storage.
- MapReduce –used for processing
Hadoop Distributed File
System(HDFS)

Hadoop Distributed File System(HDFS)

• Fault tolerant storage system
• Stores large size files from terabytes to petabytes.
• Attains reliability by replicating the data over multiple
hosts.
• The default replication value is 3.
• File in HDFS is split into large block size of 64MB.
• Each block is independently replicated at multiple data
nodes.
Hadoop Distributed File
System(HDFS)
• HDFS is like a tree in which there is a
name node (the master) and data
nodes (workers).
• The name node is connected to the
data nodes, also known as
commodity machines where data is
stored.
• The name node contains the job
tracker which manages all the
filesystems and the tasks to be
performed.
Understanding HDFS using Legos
https://www.youtube.com/watch?
v=4Gfl0WuONMY
Features of Hadoop

• Open Source
• Highly Scalable Cluster
• Fault tolerance is available
• Flexible
• Easy to use
• Provides faster data processing
MapReduce

MapReduce
• Used for processing large distributed datasets parallelly.
• MapReduce is a process of two phases
(i) The Map phase takes in a set of data which are broken
down into key-value pairs.
(ii)The Reduce phase - The output from the Map phase goes
to the Reduce phase as input where it is reduced to
smaller key-value pairs
• The key-value pairs given out by the Reduce phase is the
final output of the MapReduce process
MapReduce
• Hadoop accomplishes its operations(dividing the
computing tasks into subtasks that are handled
by individual nodes) with the help of MapReduce
model – comprises two functions – Mapper and
Reducer.
• Mapper function – Responsible for mapping the
computational subtasks to different nodes.
• Reducer function – Responsible of reducing the
responses from compute nodes, to a single
result.
• In MapReduce algorithm, the operations of
distributing task across various systems,
handling task placement for load balancing and
managing the failure recovery are accomplished
by mapper function.
• The reducer function aggregates all the
elements together after the completion of the
distributed computation.
MapReduce
Cloud Computing and Big Data
Cloud Computing and Big
Data

• Cloud Computing is the delivery of

computing services—servers, storage,
databases, networking, software, analytics
and more—over the Internet (“the cloud”).
• Companies offering these computing
services are called cloud providers and
typically charge for cloud computing
services based on usage, similar to how you
are billed for water or electricity at home.
Cloud Computing and
Big Data
• The cloud computing environment saves costs
related to infrastructure on an organization
providing a framework that can be optimized and
expanded horizontally.
• Cost to be paid for acquiring cloud services i.e.,
resource acquisition in accordance with
requirements and cost is known as elasticity.
• Cloud computing regulate the use of computing
resources i.e., payment can be done only for the
resources to be accessed.
• Organization to need to plan, monitor and control
its resource utilization carefully, else, it can lead
to unexpected high costs.
Cloud • The cloud computing technique uses data centres to
collect the data and ensures the data back up
Computing recovery are automatically performed to meet the
organization requirements.
and Big Data • Both cloud computing and big data analytics use
the distributed computing model in a similar
manner.
Scalability
• Addition of new resources to an existing infrastructure.

Feature
• The increase in the amount of data , requires organization to
improve hardware components processing ability.
• The new hardware may not provide complete support to the
d of software that used to run properly on the earlier set of
hardware.

Cloud • Solution to this problem is using cloud services-that employ

the distributed computing technique to provide scalability.

Computi Elasticity
• Hiring certain resources, as and when required, and paying for

ng
those resources.
• No extra payment is required for acquiring specific cloud
services.
Fault Tolerance
• Offering uninterrupted services to customers, especially in
cases of component failure.
Resource Pooling
• Multiple organizations, which use similar kinds of resources to
carry out computing practices, have no need to individually

Feature
hire all the resources.
• The sharing of resources is allowed in a cloud, which facilitates
cost cutting through resource pooling.

d of Self Service
• Cloud computing involves a simple user interface that helps

Cloud customers to directly access the cloud services they want.

Low Cost

Computi • Cloud offers customized solutions, especially to

organizations that cannot afford too much initial
investment.
ng • Cloud provides pay-us-you-use option, in which
organizations need to sign for those resources
only that are essential.
• Depending upon the architecture used in forming
the n/w, services and applications used, and the
Cloud target consumers, cloud services form various
deployment models.

Deploym They are,

• Public Cloud
ent • Private Cloud

Models
• Community Cloud
• Hybrid Cloud
Public Cloud(End-User Level Cloud)

• Owned and managed by a company than the one using it.

• Third party administrator.
-Eg : Verizon, Amazon Web Services, and Rackspace.
• -The workload is categorized on the basis of service category; hardware customization is possible
to provide optimized performance.
• -The process of computing becomes very flexible and scalable through customized hardware
resources.
• -The primary concern with a public cloud include security and latency.
Private Cloud(Enterprise
Level Cloud)
• Remains entirely in the ownership of the organization using it.
• Infrastructure is solely designed for a single organization.
• Can automate several processes and operations that require
manual handling in a public cloud.
• Can also provide firewall protection to the cloud, solving
latency and security concerns.
• A private cloud can be either on-premises or hosted
externally
• on premises : service is exclusively used and hosted by a
single organization.
• Hosted externally : used by a single organization and are not
shared with other organizations.
Community Cloud

Type of cloud that is shared among various organizations with a

common tie.
• -Managed by third party cloud services.
• -Available on or off premises.
Example:
In any state, the community cloud can provide so that almost all
govt. organizations of that state can share the resources available
on the cloud.
Because of the sharing of resources on community cloud, the data
of all citizens of that state can be easily managed by the govt.
organizations.
Hybrid Cloud
• Various internal or external service
providers offer services to many
organizations.
• In hybrid clouds, an organization can
use both types of cloud, i.e. public and
private together –situations such as
cloud bursting.
• Organization uses its own computing
infrastructure, high load requirement,
access clouds.
• The organization using the hybrid cloud
can manage an internal private cloud
for general use and migrate the entire
or part of an application to the public
Cloud services for Big Data
• In big data Iaas(Infrastructure As A Service), Paas (Platform As A Service), and Saas
(Software As A Service), clouds are used in following manner.
• IaaS - Huge storage and computational power requirement for big data are fulfilled by
limitless storage space and computing ability obtained by iaas cloud.
• PaaS - Offerings of various vendors have stared adding various popular big data platforms
that include MapReduce, Hadoop. These offerings save organizations from a lot of hassels
which occur in managing individual hardware components and software applications.
• SaaS - Various organization require identifying and analyzing the voice of customers
particularly on social media. Social media data and platform are provided by SAAS vendors.
In addition, private cloud facilitates access to enterprise data which enable these analyses.
• Cetas: Cetas (see disclosure) is a stealth-mode startup focused on providing an entire
analytics stack in the cloud (or on-premise, if a customer prefers). The driving theory is
to let companies running web applications get the types of user analytics that
Facebook and Google are able to get, only without the teams of expensive engineers
and data scientists, Cetas VP of Products Karthik Kannan told me. While most of that
functionality is prepackaged now into core capabilities, Kannan said Cetas plans to let
power-users build their own custom models and tie Cetas into existing analytic
platforms.
Infochimps: Google: Microsoft
Open source technologies
If you come up with an idea, you can put it to work immediately.
That’s the advantage of the open-source stack—flexibility,
extensibility, and lower cost.

Open-source projects are managed and supported by

commercial companies, such as Cloudera, that provide extra
capabilities, training, and professional services that support
open-source projects such as Hadoop.
This is similar to what Red Hat has done for the open-source
project Linux.
Mobile Business
Intelligence
• Analytics on mobile devices is what some
refer to as putting BI in your pocket.
• Mobile drives straight to the heart of
simplicity and ease of use that has been a
major barrier to BI adoption since day one.
• Mobile devices are a great leveling field
where making complicated actions easy is
the name of the game.
• For example, a young child can use an iPad
but not a laptop. As a result, this will drive
broad-based adoption as much for the ease
of use as for the mobility these devices offer.
• This will have an immense impact on the
business intelligence sector.
Viability of Mobile BI
Three elements that have impacted
• 1. Location—the GPS component and
location . . . know where you are in time as well
as the movement.
• 2. It’s not just about pushing data; you can
transact with your smart phone based on
information you get.
• 3. Multimedia functionality allows the
Three challenges
visualization with
pieces to mobile
really BI include:
come into play.
• 1. Managing standards for rolling out these
devices.
• 2. Managing security (always a big challenge).
• 3. Managing “bring your own device,” where you
have devices both owned by the company and
devices owned by the individual, both
• Crowdsourcing is a recognition that you can’t
possibly always have the best and brightest
internal people to solve all your big problems
• Crowdsourcing is a great way to capitalize on
the resources that can build algorithms and
predictive models.
• It takes years of learning and experience to get
Crowdsourcin the knowledge to create algorithms and
predictive models.
g Analytics • So crowd sourcing is a way to capitalize on the
limited resources that are available in the
marketplace.
• Crowdsourcing is a disruptive business model whose roots
are in technology but is extending beyond technology to
other areas.
• There are various types of crowdsourcing, such as crowd
voting, crowd purchasing, wisdom of crowds, crowd
funding, and contests.
• For example:
• Kaggle has developed a remarkably effective
global platform for crowdsourcing thorny
analytic problems.
– 99designs.com/, which does crowdsourcing of graphic
design
– agentanything.com/, which posts “missions” where
agents vie for to run errands
– 33needs.com/, which allows people to contribute to
charitable programs that make a social impact
Inter- and Trans-Firewall Analytics

Decision science is witnessing a similar trend as enterprises are

beginning to collaborate on insights across the value chain.

For instance, in the health care industry, rich consumer insights

can be generated by collaborating on data and insights from
the health insurance provider, pharmacy delivering the
drugs, and the drug manufacturer.

For example, there are instances where a retailer and a social

media company can come together to share insights on
consumer behavior that will benefit both players.
Some of the more progressive companies are taking
this a step further and working on leveraging the large
volumes of data outside the firewall such as
social data, location data, and so forth.
It will be not very long before internal data and
insights from within the firewall is no longer a
differentiator. We see this trend as the move from
intra- to inter- and trans-firewall analytics.
Today they are doing intra-firewall analytics with
data within the firewall.

Tomorrow they will be collaborating on insights with

other companies to do inter-firewall analytics as well
as leveraging the public domain spaces to do trans-
firewall analytics
Value Chain for Inter-Firewall and
Trans-Firewall Analytics
Crowdsourcing Analytics
• Crowdsourcing is a recognition that you can’t possibly always have the best and
brightest internal people to solve all your big problems.
• By creating an open, competitive environment with clear rules and goals, their
objective can create a lot of buzz about their organization in the process.
• Crowdsourcing is a great way to capitalize on the resources that can build algorithms
and predictive models.
• Crowdsourcing is a disruptive business model whose roots are in technology but is
extending beyond technology to other areas. There are various types of crowdsourcing,
such as crowd voting, crowd purchasing, wisdom of crowds, crowd funding, and
contests.
Advertising and Big Data: From
Papyrus to Seeing Somebody
• Big Data Feeds the Modern-Day Donald Draper
• To get a feel for how Big Data is impacting the advertising market, we sat down with
Randall Beard, who is currently the global head of Nielsen’s Global Head of Advertiser
Solutions. Essentially what Beard’s team does is connect what people watch and what
people buy to help their clients optimize advertising and media return on investment.
The Nielsen experience is great, but the best part of interviewing Beard is that before
Nielsen he actually worked on the client side for 25 years at companies such as the big
advertising spender P&G. Needless to say, he knows his stuff.
Reach, Resonance, and Reaction

•Beard explained that big data is now changing the way advertisers address three
related needs:
• 1. How much do I need to spend
• How do I allocate that spend across all the marketing communication touch
points? “
• How do I optimize my advertising effectiveness against my brand equity and
ROI in real-time.
•Beard explained the three guiding principles to measurement:
•1. End to end measurement—reach, resonance and reaction
•2. Across platforms (TV, digital, print, mobile, etc.)
•3. Measured in real-time (when possible)
•
The Need to Act Quickly (Real-Time When Possible)
Measurement Can Be Tricky
Content Delivery Matters Too
Optimization and Marketing Mixed Modeling

• Marketing mixed modeling (MMM) is a tool that helps advertisers understand the
impact of their advertising and other marketing activities on sales results. MMM can
generally provide a solid understanding of the relative performance of advertising by
medium (e.g., TV, digital, print, etc.), and in some cases can even measure sales
performance by creative unit, program genre, website, and so on.
•Now, we can also measure the impact on sales in social media and we do that
through market mixed modeling. Market mixed modeling is a way that we can take all
the different variables in the marketing mix—including paid, owned, and earned media
—and use them as independent variables that we regress against sales data and trying
to understand the single variable impact of all these different things.
•Since these methods are quite advanced, organizations use high-end internal
analytic talent and advanced analytics platforms such as SAS or point solutions such as
Unica and Omniture. Alternatively, there are several boutique and large analytics
providers like Mu Sigma that supply it as a software-as-a-service (SaaS).
•MMM is only as good as the marketing data that is used as inputs. As the world
becomes more digital, the quantity and quality of marketing data is improving, which is
leading to more granular and insightful MMM analyses.
• Using Consumer Products as a Doorway

Big Data Analytics - Complete Notes
No ratings yet
Big Data Analytics - Complete Notes
136 pages
Big - Data Unit-1
100% (2)
Big - Data Unit-1
33 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Scholarship Management System: Team Members: BM10518, BM10527, BM10545 Class: II-MCA
100% (3)
Scholarship Management System: Team Members: BM10518, BM10527, BM10545 Class: II-MCA
30 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
CC Cancellation Data 9 Mar 2025 Pradeep
No ratings yet
CC Cancellation Data 9 Mar 2025 Pradeep
44 pages
Big Data
No ratings yet
Big Data
110 pages
NCERT Books For Class 10 Science Chapter 13 Magnetic Effects of Electric Current
No ratings yet
NCERT Books For Class 10 Science Chapter 13 Magnetic Effects of Electric Current
25 pages
Ds Unit-1
No ratings yet
Ds Unit-1
19 pages
Synopsis For Airlines Management System
0% (1)
Synopsis For Airlines Management System
11 pages
1big Data
No ratings yet
1big Data
69 pages
Unit 1
No ratings yet
Unit 1
24 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Bigdata Writing
No ratings yet
Bigdata Writing
11 pages
1492-IfM40F Wiring Diagram
No ratings yet
1492-IfM40F Wiring Diagram
196 pages
Big Data
No ratings yet
Big Data
84 pages
Big Data
No ratings yet
Big Data
16 pages
Prepared By: Asmita Deshmukh
No ratings yet
Prepared By: Asmita Deshmukh
51 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
221 pages
Big Data
No ratings yet
Big Data
54 pages
Unit 1
No ratings yet
Unit 1
57 pages
Big Data Basics Unit 1
No ratings yet
Big Data Basics Unit 1
12 pages
Unit 3 Big Data Analytics
No ratings yet
Unit 3 Big Data Analytics
18 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
Big Data Chapter 1
No ratings yet
Big Data Chapter 1
7 pages
Construction and Site Supervision and Management
100% (2)
Construction and Site Supervision and Management
2 pages
Class - Big Data UNIT-I
No ratings yet
Class - Big Data UNIT-I
40 pages
Unit 1 BDT
No ratings yet
Unit 1 BDT
27 pages
Unit 1
No ratings yet
Unit 1
44 pages
Big Data
No ratings yet
Big Data
14 pages
Big Data Analytics
No ratings yet
Big Data Analytics
83 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
14 Big Data
No ratings yet
14 Big Data
39 pages
Cloud Computing
No ratings yet
Cloud Computing
86 pages
BDA Unit-1
No ratings yet
BDA Unit-1
33 pages
BDU1
No ratings yet
BDU1
39 pages
Lec 1 - Introduction To Big Data
No ratings yet
Lec 1 - Introduction To Big Data
37 pages
Big Data Analytics
No ratings yet
Big Data Analytics
73 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
Big Data Intro PDF
No ratings yet
Big Data Intro PDF
93 pages
BDT 1
No ratings yet
BDT 1
49 pages
IMTC634 - Data Science - Chapter 11
No ratings yet
IMTC634 - Data Science - Chapter 11
22 pages
BDA NOTES With Questions Included
No ratings yet
BDA NOTES With Questions Included
108 pages
Basic Concepts in Big Data 1
No ratings yet
Basic Concepts in Big Data 1
43 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
37 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Unit I-KCS-061
No ratings yet
Unit I-KCS-061
42 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
117769
No ratings yet
117769
20 pages
Bda U1
No ratings yet
Bda U1
78 pages
Unit I: Chapter 1: Introduction To Big Data
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
35 pages
Apache Hadoop Training For Developers Day 1
No ratings yet
Apache Hadoop Training For Developers Day 1
136 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
BST Doorphone Installation and Configuration Guide
No ratings yet
BST Doorphone Installation and Configuration Guide
20 pages
Deloitte Solutions Network: Introduction To Big Data
No ratings yet
Deloitte Solutions Network: Introduction To Big Data
9 pages
Introduction To Big Data - Presentation
No ratings yet
Introduction To Big Data - Presentation
30 pages
TMS Project Report
No ratings yet
TMS Project Report
92 pages
Motivation Letter
No ratings yet
Motivation Letter
4 pages
Compresor ELFD75 100HP - EN (3 y 4)
No ratings yet
Compresor ELFD75 100HP - EN (3 y 4)
13 pages
6 Ns
No ratings yet
6 Ns
67 pages
Review of Literature
No ratings yet
Review of Literature
2 pages
Mobile Substation PDF
100% (1)
Mobile Substation PDF
4 pages
Abs Ingles
No ratings yet
Abs Ingles
9 pages
AI Games
No ratings yet
AI Games
7 pages
Anna Avdeeva Project 3 Facebook Ads
No ratings yet
Anna Avdeeva Project 3 Facebook Ads
24 pages
Usb Cash Drawer Interface
No ratings yet
Usb Cash Drawer Interface
6 pages
MC 8 Stories of Email Success
No ratings yet
MC 8 Stories of Email Success
32 pages
Brochure CS Sep 23
No ratings yet
Brochure CS Sep 23
2 pages
3 - Processor (Single Cycle)
No ratings yet
3 - Processor (Single Cycle)
53 pages
Unit Iii
No ratings yet
Unit Iii
107 pages
Chief Technology Officer CTO in Dallas TX Resume Scott Davis
No ratings yet
Chief Technology Officer CTO in Dallas TX Resume Scott Davis
2 pages
Webiopi
No ratings yet
Webiopi
35 pages
Motorola GM-300 Information Page
No ratings yet
Motorola GM-300 Information Page
17 pages
II. Brief History/Background of The Company/Company Profile
No ratings yet
II. Brief History/Background of The Company/Company Profile
2 pages
Intelligence of ThingsTechnologiesAndApplications
No ratings yet
Intelligence of ThingsTechnologiesAndApplications
435 pages
Hyper-Threading Technology Architecture and Microarchitecture - Summary
No ratings yet
Hyper-Threading Technology Architecture and Microarchitecture - Summary
4 pages
Difference Between Layer 2 and Layer 3 Cisco Switch
No ratings yet
Difference Between Layer 2 and Layer 3 Cisco Switch
3 pages
Md32xx Md36xx Compatibility Matrix
No ratings yet
Md32xx Md36xx Compatibility Matrix
41 pages
SP 1076 en Cleco NeoTek RA Brochure FINAL Singles LowRes
No ratings yet
SP 1076 en Cleco NeoTek RA Brochure FINAL Singles LowRes
8 pages
Interactivity With JavaScript
No ratings yet
Interactivity With JavaScript
1 page
Batangas State University
No ratings yet
Batangas State University
4 pages
Kpit JD
No ratings yet
Kpit JD
7 pages
4 Sol V2
No ratings yet
4 Sol V2
67 pages
The Data Whisperer - Making Sense of Big Data
From Everand
The Data Whisperer - Making Sense of Big Data
Keaton Rivers
No ratings yet
Analytics and Big Data for Accountants
From Everand
Analytics and Big Data for Accountants
Jim Lindell
No ratings yet
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
From Everand
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Unit 1

Uploaded by

Unit 1

Uploaded by

UNIT I

What is big data – why big data – convergence of key

• Big data is high-volume, high-velocity and high-variety

Expansion – • Amount of data – 175 ZettaBytes by 2025.

Big (and small) Data analytics is the process of examining

The intent is to find business insights that were not previously

• Broader insights. It takes into account of all the data, including

• Frictionless actions. Increased reliability and accuracy that will

Volume is the amount of data generated by

Information processing systems face problem with the data, as the

Example : eBay analyses around 5 million transactions per day in

IT devices, including routers,firewalls, switches etc generate

Social media, including Facebook posts, tweets create huge

• 1. Understanding and Targeting Customers

• Social media—Data obtained from social networking platforms, including

• Mobile data—Data such as text messages and location information About 80

• Business intelligence (BI) software: This is a broad category of

• Document management systems: DMS can track, store and share

• Information management solutions: This type of software tracks

• Search and indexing tools: These tools retrieve information from

• Web analytics is the measurement, collection, analysis and

• The web is a rich and varied space! E.g.

• Where does your traffic come from e.g.

• Understanding events common across business types (page views,

• Capturing contextual data common people browsing the web

• Big Data and the New School of Marketing

• The Right Approach: Cross-Channel Lifecycle

• Social and Affiliate Marketing

• Empowering Marketing with Social Intelligence

• Most common types of Financial frauds:

• Risk management is data-driven—without advanced data analytics, modern risk management

Many investment banks use algorithmic trading, a highly

Algorithmic trading relies on sophisticated mathematics to

• A key component of algorithmic trading

• Algorithmic trading involves a huge

• In addition to saving and improving lives, Big

Stored in Electronic Health Records

the World Enhanced with data from imaging and

of Health test results,

Care Medical and prescription claims, and

Personal health devices.

• Patient outcomes tied to costs derived from multiple health care

• Big Data analytics makes it possible to transform the vision into

 For producers (pharmaceutical and medical device companies), there is an

• Mining large collections of claims, health care records

• Traditional methods of storing, accessing, and

• Large-scale analytics are needed for critical

• Uses Big Data analytics to identify the genetic

• Our algorithms identify the interactions (between

• Also able to do statistical analysis.

• Hadoop is a distributed system like distributed

Secondary NameNode that keeps snapshots of NameNodes

• Each Data Node sends a “Heartbeat Signal” to NameNode after

Two main components

Hadoop Distributed File System(HDFS)

• Cloud Computing is the delivery of

Cloud • Solution to this problem is using cloud services-that employ

Cloud customers to directly access the cloud services they want.

Computi • Cloud offers customized solutions, especially to

Deploym They are,

• Owned and managed by a company than the one using it.

Type of cloud that is shared among various organizations with a

Open-source projects are managed and supported by

Decision science is witnessing a similar trend as enterprises are

For instance, in the health care industry, rich consumer insights

For example, there are instances where a retailer and a social

Tomorrow they will be collaborating on insights with

You might also like