Unit 1
Unit 1
UNDERSTANDING
BIG DATA
• Big data usually includes data sets with sizes beyond the
ability of commonly used software tools to capture, create,
manage, and process the data within a tolerable elapsed
time
IoT Appliance
• Electronic devices connected to the internet create data for their smart
functionality. Example : Samsung smartthings.
E-Commerce
• Payments through Credit card, Debit card, pay later, or all electronic ways are
recorded as data.
Global Positioning System(GPS)
• Vehicle movement – directions/ traffic congestion. Creates a lot of data on
vehicle position and movement.
Real world examples – Big data
• Social media analytics – Consumer product
companies and retail organizations are observing data
on social media websites to analyze customer
behaviour, preferences etc
• Insurance companies use BDA to see which home
insurance applications can be immediately processed
and which ones need a validating in person visit from an
agent.
• Hospitals are analysing medical data and patient
records to predict those patients that are likely for
readmission within few months of discharge.
• Relying on Social networks and analytics, Companies
are gathering volumes of data from the web to help
musicians and music companies better understand their
Big Data Analytics
• Deeper insights. It’ll have insights into all the individuals, all the
products, all the parts, all the events, all the transactions, etc.
• lack of structure
• About 85% of total data is un-
structured.
Ex:
• e-mail messages,
• word processing documents,
• videos, photos, audio files,
presentations,
• web pages
• other kinds of business documents.
Some sources of unstructured
data include
• Text both internal and external to an organization—Documents, logs, survey
results, feedbacks, and e-mails from both within and across the organization
• Big data tools: Software like Hadoop can process stores of both
unstructured and structured data that are extremely large, very complex
and changing rapidly.
• Data integration tools: These tools combine data from disparate sources
so that they can be viewed or analyzed from a single application. They
sometimes include the capability to unify structured and unstructured
data.
Implementing Unstructured Data Management:
Software tools to help them organize and manage
• It tells you how your customers actually behave and how that varies
• It tells you how customers engage with you via your website / webapp
• It tells you how customers engage with your different marketing campaigns and
how that drives subsequent behaviour
• Deriving value from web analytics data often involves very custom-
made analytics
• The two most common types of risk management are credit risk management and market risk
management. A third type of risk, operational risk management, isn’t as common as credit and
market risk.
• The tactics for risk professionals typically include avoiding risk, reducing the negative effect or
probability of risk, or accepting some or all of the potential consequences in exchange for a
potential upside gain.
• Credit risk analytics focus on past credit behaviors to predict the likelihood that a borrower will
default on any type of debt by failing to make payments which they obligated to do. For example,
“Is this person likely to default on their $300,000 mortgage?”
• Market risk analytics focus on understanding the likelihood that the value of a portfolio will decrease
due to the change in stock prices, interest rates, foreign exchange rates, and commodity prices.
For example, “Should we sell this holding if the price drops another 10 percent?”
Credit risk management
Use of Big Data in Detecting Fraudulent Activities
in Retail Sector
Retail fraud:
It is an illegal transaction that a fraudster performs using
stolen credit card details or loopholes in the order
placement and payment systems and company policies. As
technology grew, so did the fraudsters' sophistication of
executing frauds online.
Types of Retail fraud:
(a) Transaction fraud
(b) Return fraud
(c) Chargeback guarantee fraud
Types of Retail fraud
• Transaction fraud
It is also called card-not-present (CNP) fraud where the fraudster uses a
stolen credit card for online purchases. The company loses money when
the original owner of the card demands a chargeback.
• Return fraud
Example - e-commerce industry
• Chargeback guarantee fraud
Many online retail fraud prevention solutions guarantee that they will
block all transactions and friendly frauds and even pay the admin fee out
of their pocket. The problem arises when the company blocks even
legitimate customers. This is called a false positive that not only
damages your reputation but also results in loss of revenue.
Use of Big Data in Detecting Fraudulent
Activities in Retail Sector -Fraud Detection in
Real time
• Big Data helps to detect frauds in real time.
Example :
(a) In an online transaction, BigData would compare the incoming IP
address with the geotag received from customer’s smartphone
apps. A valid match between the two confirms the authenticity of
transaction.
(b) Also, examines the entire historical data to track suspicious
patterns of the customer order –
Big Data analysis is performed in real time by retailers to know the
actual time of the product delivered.
Costly products of have sensors attached to transmit their location
information,thereby, preventing frauds.
Credit Risk
Framework
Big Data and Algorithmic Trading
Algorithm trading is the use of computer programs for entering
trading orders, in which computer programs decide on almost
every aspect of the order, including the timing, price, and
quantity of the order etc.
• Transformation of the health care system will come through Big Data-
driven decisions and improved insights.
Hadoop cluster consist of single Master Node and Multiple Worker Nodes.
• Master Node
- NameNode
- JobTracker
• Worker Node
- DataNode
- TaskTracker
In a larger cluster, HDFS is managed through a NameNode server to host the file
system index.
• Open Source
• Highly Scalable Cluster
• Fault tolerance is available
• Flexible
• Easy to use
• Provides faster data processing
MapReduce
MapReduce
• Used for processing large distributed datasets parallelly.
• MapReduce is a process of two phases
(i) The Map phase takes in a set of data which are broken
down into key-value pairs.
(ii)The Reduce phase - The output from the Map phase goes
to the Reduce phase as input where it is reduced to
smaller key-value pairs
• The key-value pairs given out by the Reduce phase is the
final output of the MapReduce process
MapReduce
• Hadoop accomplishes its operations(dividing the
computing tasks into subtasks that are handled
by individual nodes) with the help of MapReduce
model – comprises two functions – Mapper and
Reducer.
• Mapper function – Responsible for mapping the
computational subtasks to different nodes.
• Reducer function – Responsible of reducing the
responses from compute nodes, to a single
result.
• In MapReduce algorithm, the operations of
distributing task across various systems,
handling task placement for load balancing and
managing the failure recovery are accomplished
by mapper function.
• The reducer function aggregates all the
elements together after the completion of the
distributed computation.
MapReduce
Cloud Computing and Big Data
Cloud Computing and Big
Data
Feature
• The increase in the amount of data , requires organization to
improve hardware components processing ability.
• The new hardware may not provide complete support to the
d of software that used to run properly on the earlier set of
hardware.
Computi Elasticity
• Hiring certain resources, as and when required, and paying for
ng
those resources.
• No extra payment is required for acquiring specific cloud
services.
Fault Tolerance
• Offering uninterrupted services to customers, especially in
cases of component failure.
Resource Pooling
• Multiple organizations, which use similar kinds of resources to
carry out computing practices, have no need to individually
Feature
hire all the resources.
• The sharing of resources is allowed in a cloud, which facilitates
cost cutting through resource pooling.
d of Self Service
• Cloud computing involves a simple user interface that helps
Models
• Community Cloud
• Hybrid Cloud
Public Cloud(End-User Level Cloud)
•Beard explained that big data is now changing the way advertisers address three
related needs:
• 1. How much do I need to spend
• How do I allocate that spend across all the marketing communication touch
points? “
• How do I optimize my advertising effectiveness against my brand equity and
ROI in real-time.
•Beard explained the three guiding principles to measurement:
•1. End to end measurement—reach, resonance and reaction
•2. Across platforms (TV, digital, print, mobile, etc.)
•3. Measured in real-time (when possible)
•
The Need to Act Quickly (Real-Time When Possible)
Measurement Can Be Tricky
Content Delivery Matters Too
Optimization and Marketing Mixed Modeling
• Marketing mixed modeling (MMM) is a tool that helps advertisers understand the
impact of their advertising and other marketing activities on sales results. MMM can
generally provide a solid understanding of the relative performance of advertising by
medium (e.g., TV, digital, print, etc.), and in some cases can even measure sales
performance by creative unit, program genre, website, and so on.
•Now, we can also measure the impact on sales in social media and we do that
through market mixed modeling. Market mixed modeling is a way that we can take all
the different variables in the marketing mix—including paid, owned, and earned media
—and use them as independent variables that we regress against sales data and trying
to understand the single variable impact of all these different things.
•Since these methods are quite advanced, organizations use high-end internal
analytic talent and advanced analytics platforms such as SAS or point solutions such as
Unica and Omniture. Alternatively, there are several boutique and large analytics
providers like Mu Sigma that supply it as a software-as-a-service (SaaS).
•MMM is only as good as the marketing data that is used as inputs. As the world
becomes more digital, the quantity and quality of marketing data is improving, which is
leading to more granular and insightful MMM analyses.
• Using Consumer Products as a Doorway