0% found this document useful (0 votes)

84 views88 pages

Big Data Unit 1

Big data refers to extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions. The volume of data is growing exponentially due to greater internet usage and increased sensors/devices. Traditional databases cannot handle big data's volume, velocity, variety, and veracity. Companies must analyze big data from new sources like social media to gain valuable insights and stay competitive. Analyzing big data requires new techniques and technologies to capture, store, manage, and make sense of large, diverse, and fast-moving data.

Uploaded by

Jokilo Taepei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views88 pages

Big Data Unit 1

Uploaded by

Jokilo Taepei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 88

UNIT 1

What is Big Data (aka Data Tsunami)?

According to study reported in literature:

•Every day, we create 2.5 quintillion (1 quintillion is 10 30 ) bytes of

data.
•So much that 90% of the data in the world today has been created
in the last two years alone.
•This data comes from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals etc.

2
According to another study
•From the beginning of recorded time (1990) until 2003, 5 billion
gigabytes of data was created.
•In 2011, the same amount was created every two days
•In 2013, the same amount of data was created every 10 minutes
•In 2015, same or more data (generating) every 10 minutes.

•Advances in communications, computation, and storage have

created huge collections of data, having information of value to
business, science, government and society.

3
•Example: Search engine companies such as Google, Yahoo!, and
Microsoft have created an entirely new business by capturing the
information freely available on the World Wide Web and providing it
to people in useful ways. (SOCIAL NETWORKING)

•These companies collect trillions of data every day and provide NEW
SERVICES such as satellite images, driving directions, image retrieval
etc.

• The societal benefits of these services are well appreciated, it has

transformed how people find and make use of information on a daily
basis.
4
•It can be used in wide variety of areas from business, health care,
scientific, Defence etc.

Example: Health care (AKA HEALTH INFORMATICS)

•Modern medicine system collects huge amounts of information
about patients through imaging technology (CAT scans, MRI),
genetic analysis (DNA microarrays), and other forms of diagnostic
equipment.

•By applying analytics to data sets for large numbers of patients,

medical researchers are gaining fundamental insights into the
GENETIC AND ENVIRONMENTAL CAUSES OF DISEASES,
and creating more effective means of diagnosis.

•Recently hollywood star underwent surgery to prevent cancer.

[who]
5
According to McKinsey report published in US

•140,000-190,000 workers with “knowledge of big data analytics”

will be needed in the US alone. (2014)

•Furthermore, 1.5 million managers will need to become data-

literate.

•Many agencies / media houses/ scientific community across the

world have identified Big Data as important research area.

6
GENESIS………………………The Beginning

•Like it or not, a massive amount of data will be coming your

way soon.

•Perhaps it has reached you already.

•Perhaps you’ve been wrestling with it for a while—trying to

figure out how to store it for later access, address its mistakes
and imperfections, or classify it into structured categories.

7
KNOW DIFFERENCE BETWEEN BIG DATA AND MANAGMENT

•As the author Bill Franks puts,

•There may soon be not only a flood of data, but flood of books on big
data.
•Most of these big-data books will be about the management of big
data:
 How to wrestle it into a database or data warehouse.
 How to structure and categorize unstructured data.
 If you find yourself reading a lot about Hadoop or MapReduce or
various approaches to data warehousing.
 you’ve stumbled upon—or were perhaps seeking—a “big data
management” (BDM) book. 8
•BDM is, of course, important work. No matter how much data you have
of whatever quality, it won’t be much good unless you get it into an
environment and format in which it can be accessed and analyzed.

•BDM alone won’t get you very far. You also have to analyze and act on
it for data of any size to be of value.

• Just as traditional database management tools didn’t automatically

analyze transaction data from traditional systems, Hadoop and
MapReduce won’t automatically interpret the meaning of data from
web sites, gene mapping, image analysis, or other sources of big data.
9
WHAT IT MEANS TO US: [APPLICATION]

You receive an EMAIL: It contains an offer for a complete personal computer

system. It seems like the retailer read your mind since you were exploring computers on their
web site just a few hours prior. …

As you drive to the store to buy the computer bundle, you get an offer for a discounted coffee
from the coffee shop you are getting ready to drive past. It says that since you’re in the area, you
can get 10% off if you stop by in the next 20 minutes

As you drink your coffee, you receive an apology from the manufacturer of a product
that you complained about yesterday on your Facebook page, as well as on the company’s web
site. …

Finally, once you get back home, you receive notice of a gadget upgrade available for purchase in
your favorite online video game.

Etc…………..

10
DATA SOURCES

•Explosion of new and powerful data sources like Facebook, Twitter,

LinkedIn, Youtube etc., contributes immensely to Bigdata & research.
• Advance Analytics will be of great impact.
•To stay competitive, it is imperative that organizations aggressively
pursue capturing and analyzing these new data sources to gain the
insights that they offer.
• Ignoring big data will put an organization at risk and cause it to fall
behind the competition.
• Analytic professionals have a lot of work to do! It won’t be easy to
incorporate big data alongside all the other data that has been used
for analysis for years. 11
WHAT IS BIG DATA?

•There is no consensus in the marketplace as to how to define big

data!

• Def#1: Big data exceeds the reach of commonly used hardware

environments and software tools to capture, manage, and process it
within a tolerable elapsed time for its user population.”
[terabytemagazine article]

• Def#2: Big data refers to data sets whose size is beyond the ability
of typical database software tools to capture, store, manage and
analyze.”[McKinseyGlobal Institute ]

•Def#3 :“big” in big data also refers to several other characteristics of

a big data source. These aspects include volume, velocity ,variety
and Veracity(optional) [ Gratner group]
12
Volume:
• The sheer volume of data being stored today is exploding.
• In the year 2000, 800,000 petabytes (PB) of data were stored in
the world.
• We expect this number to reach 35 zettabytes (ZB) by 2020.
• Twitter alone generates more than 7 terabytes (TB) of data every
day, Facebook 10 TB etc.

13
Variety : “Variety Is the Spice of Life”

• The volume associated with the Big Data phenomena brings

along new challenges for data centres trying to deal with it: its
variety.
• With the explosion of sensors, and smart devices, as well as
social collaboration technologies, data in an enterprise has
become complex, because it includes not only traditional
relational data

• But also raw, semi structured, and unstructured data from web
pages, web log files (including click-stream data), search indexes,
social media forums, e-mail, documents, sensor data from active
and passive systems, and so on.

14
Velocity : How Fast Is Fast?
•The speed at which the data is flowing.
•Increase in RFID sensors and other information streams has led to a
constant flow of data at a pace that has made it impossible for
traditional systems to handle
•Competition can mean identifying a trend, problem, or opportunity
only seconds, or even microseconds, before someone else.

•In traditional processing, you can think of running queries against

relatively static data

15
•For example, the query “Show me all people living in the City X”
would result in a single result set to be used as a warning list of
an incoming weather pattern.

•With streams computing [IBM], you can execute a process

similar to a continuous query that identifies people who are
currently “CITY X,” but you get continuously updated results,
because location information from GPS data is refreshed in real
time.

•Big Data requires that you perform analytics against the volume
and variety of data while it is still in motion, not just after it is at
rest.

16
Veracity: (Non reliable Data)

•There is volume, velocity and variety

• There is Big data Hype, also there is non-reliability with data
• How effective will these data be?
• Example: Product Branding, Image Branding, Image assignation

In addition a couple of V’s are also suggested:

17
Variability :
•It is often confused with variety.
Example:
•Say you have bakery that sells 10 different breads. That is variety.
Now imagine you go to that bakery three days in a row and every
day you buy the same type of bread but each day it tastes and smells
different.
•Variability is thus very relevant in performing sentiment analyses.
•Variability means that the meaning is changing (rapidly).
•In (almost) the same tweets a word can have a totally different
meaning.
18
Visualization

•This is the hard part of big data.

•Making all that vast amount of data comprehensible in a manner
that is easy to understand and read.
•It does not mean ordinary graphs or pie charts. They mean complex
graphs that can include many variables of data while still remaining
understandable and readable.
•Telling a complex story in a graph is very difficult but also extremely
crucial.
•Luckily there are more and more big data startups appearing that
focus on this aspect and in the end, visualizations will make the
difference 19
VALUE

•Data in itself is not valuable at all.

•The value is in the analyses done on that data and how the data
is turned into information and eventually turning it into
knowledge.

•The value is in how organisations will use that data and turn
their organisation into an information-centric company that
relies on insights derived from data analyses for their decision-
making.

20
IS THE “BIG” PART OR THE “DATA” PART MORE IMPORTANT?

•What is the most important part of the term big data? Is it (1) the
“big” part, (2) the “data” part, (3) both, or (4) neither?

•As with any source of data, big or small, the power of big data
comes :
++ What is done with that data?
++ How is it analyzed?
++ What actions are taken based on the findings?
++ How is the data used to make changes to a
business?

•People are led to believe that just because big data has high
volume, velocity, and variety, it is somehow better or more
important than other data.
21
• Many big data sources have a far higher percentage of useless or
low-value content than virtually any other data source.

•By the time, big data is trimmed down to what you actually need,
it may not even be so big any more.

In Summary:
•Whether it stays big or whether it ends up being small when you’re
done processing it,

•the size isn’t important.

•It’s what you do with it.

22
HOW IS BIG DATA DIFFERENT?

Majority of big data sources have the following feature:

1. Big data is often automatically generated by a machine.

• Instead of a person being involved in creating new data, it’s

generated purely by machines in an automated way. If you think
about traditional data sources, there was always a person
involved.

• For example: Consider retail or bank transactions, telephone call

detail records, product shipments, or invoice payments. All of
those involve a person doing something in order for a data
record to be generated.

• A lot of sources of big data are generated without any human

interaction at all. Example: Sensors 23
2.Big data is typically an entirely new source of data. It is not simply
an extended collection of existing data.

• For Example, with the use of the Internet, customers can now
execute a transaction with a bank or retailer online. But the
transactions they execute are not fundamentally different
transactions from what they would have done traditionally.

• They’ve simply executed the transactions through a different

channel.

• An organization may capture web transactions, but they are really

just more of the same old transactions that have been captured
for years.

• However, capturing browsing behaviors as customers execute a

transaction creates fundamentally new data.
24
3.Many big data sources are not designed to be friendly. In fact,
some of the sources aren’t designed at all!

• Example: Text streams from a social media site.

(There is no way to ask users to follow certain standards of
grammar, or sentence ordering, or vocabulary)

• It will be difficult to work with such data at best and very, very
ugly at worst.

• Most traditional data sources were designed up-front to be

friendly.

• Systems used to capture transactions provide data in a clean,

preformatted template that makes the data easy to load and use

25
4. Substantial amount of big data streams may not have much value. In
fact, much of the data may even be close to worthless.

• Example: Within a web log, there are information that is very

powerful. There is also a lot of information that doesn’t have much
value at all. (pic)

• It is necessary to weed through and pull out the valuable and

relevant pieces

• Traditional data sources were defined up-front to be 100 percent

relevant.

26
Example: Weblog (1)

27
Example: Weblog (2)

28
HOW IS BIG DATA MORE OF THE SAME?

•Same thing that existed in the past; is out in a new form.

• In many ways, big data doesn’t pose any problems that your
organization hasn’t faced before.

•Taming new, large data sources that push the current limits of
scalability is an ongoing theme in the world of analytics

Fig: Data Mining Process

29
RISKS OF BIG DATA
1. An organization will be so overwhelmed with big data that it won’t
make any progress.

[The key here is to get the right people. You need the right people
attacking big data and attempting to solve the right kinds of problems]

2. cost escalates too fast as too much big data is captured before an
organization knows what to do with it.

[It is not necessary to go for it all at once and capture 100 percent of
every new data source.

What is necessary is to start capturing samples of the new data

sources to learn about them. Using those initial samples, experimental
analysis can be performed to determine what is truly important within
each source and how each can be used]
30
3. Perhaps the biggest risk with many sources of big data is privacy.
• If everyone in the world was good and honest, then we wouldn’t
have to worry much about privacy

• There have also been high-profile cases of major organizations

getting into trouble for having ambiguous or poorly defined privacy
policies

Example: In April 2013, Living Social, a daily-deals site partly owned by

Amazon, announced that the names, email addresses, birth dates
and encrypted passwords of more than 50 million customers worldwide
had been stolen by hackers.

•This has led to data being used in ways that consumers didn’t
understand or support, causing a backlash

•Organizations should explain how they will keep data secure and how
they will use it, if they accept their data to be captured and analyzed31
WHY YOU NEED TO TAME BIG DATA

•Many organizations have done little with big data.

•Ecommerce industries have started, where analyzing big data is

already a standard.

•Today, they have a chance to get ahead of the pack.

•Within a few years, any organization that isn’t analyzing big data
will be late to the game and will be stuck playing catch up for years
to come.

•The time to start taming big data is now.

32
What is the difference between
Data Mining and Web Mining?
Machine Learning : Classification, Clustering etc.

Semantic approach: Statistics, NLP etc.

33
THE STRUCTURE OF BIG DATA

•Big data is often described as Unstructured

•Most traditional data sources are fully structured realm (sources)

•Data is in pre-defined format and no variation of the format on day

to day or update to update basis.

•Unstructured Data

•Semi Structures Data

• Example : Web logs

34
FILTERING BIG DATA EFFECTIVELY
•The biggest challenge with big data may not be the analytics you do
with it, but the extract, transform, and load (ETL) processes you have
to build to get it ready for analysis. (PART OF 90 %)

•Analytic processes may require filters on the front end to remove

portions of a big data stream when it first arrives. Also there will be
other filters along the way as the data is processed.

•For example, when working with a web log, a rule might be to filter
out up front any information on browser versions or operating
systems. Such data is rarely needed except for operational reasons.

•Later in the process, the data may be filtered to specific pages or

user actions that need to be examined for the business issues to be
addressed.
35
Example-1
<HTML>
<TITLE>
<BODY>
Sachin is a former Indian cricketer and captain, widely regarded as one of the
greatest batsmen of all time. Sachin took up cricket at the age of eleven, made
his Test debut on 15 November 1989 against Pakistan in Karachi at the age of
sixteen, and went on to represent Mumbai domestically and India
internationally for close to twenty-four years. Sachin is the only player to have
scored one hundred international centuries, the first batsman to score a
double century in a One Day International, the holder of the record for the
number of runs in both ODI and Test cricket, and the only player to complete
more than 30,000 runs in international cricket
</BODY>
</TITLE>
</HTML>

36
Example 2 :Opinion Analysis
Step 1: Sample text
excellent phone, excellent service . i am a business user who
heavily depend on mobile service ….,,, there is much which
has been said in other reviews about the features of this
phone.

Step 2: Remove delimiters from input file

excellent phone excellent service i am a business user who
heavily depend on mobile service there is much which has
Step 3: Subject the text to parts of speech tagger
Example: JJ excellent NN phone JJ excellent NN service FW
i VBP am DT a NN business NN user WP who RB heavily
VBP depend IN on JJ mobile NN service EX there VBZ is JJ
much WDT which VBZ has VBN been VBN said IN in JJ
other NNS reviews IN about DT the NNS features IN of
DT this NN phone

Step 4: Extract feature

JJ excellent NN phone, JJ excellent NN service
Step 4: Approaches
•Supervised approach
•Unsupervised approach

Step 5: Results:
• Positive opinion
• Negative opinion
•The complexity of the rules and the magnitude of the data being
removed or kept at each stage will vary by data source and by
business problem.

•The load processes and filters that are put on top of big data are
absolutely critical. Without getting those correct, it will be very
difficult to succeed.

•Traditional structured data doesn’t require as much effort in these

areas since it is specified, understood, and standardized in advance.

•With big data, it is necessary to specify, understand, and

standardize it as part of the analysis process in many cases.

Example: Application of Filtering to websites to derive knowledge

40
MIXING BIG DATA WITH TRADITIONAL DATA

•Perhaps the most exciting thing about big data isn’t what it will do
for a business by itself. It’s what it will do for a business when
combined with an organization’s other data.

Example:

1. Browsing history, for example, is very powerful. [Knowing how

valuable a customer is and what they have bought in the past across
all channels makes web data even more powerful by putting it in a
larger context].

2. Smart-grid data is very powerful for a utility company. [Knowing

the historical billing patterns of customers, their dwelling type, and
other factors makes data from a smart meter even more powerful
by putting it in a larger context.] 41
42
3. The text from customer service online chats and e-mails is powerful.
[Knowing the detailed product specifications of the products being
discussed, the sales data related to those products, and historical
product defect information makes that text data even more powerful
by putting it in a larger context.] - Amazon Recommendation system

4.Enterprise Data Warehouses (EDWs) have become such a

widespread corporate tool not just to centralize a bunch of data marts
to save hardware and software costs.

•An EDW adds value by allowing different data sources to intermix and
enhance one another.

•With an EDW, it is possible to analyze customer and employee data

together since they are in one location. They are no longer completely
separate.
43
•This is why it is critically important that organizations don’t develop a
big data strategy that is distinct from their traditional data strategy.

To succeed, it is necessary to plan not just how to capture and analyze

big data by itself, but also how to use it in combination with other
corporate data.

44
a. Data Mart

b. Data Warehouse

45
Hierarchy of Enterprise Data 46
THE NEED FOR STANDARDS
•Will big data continue to be a wild west of crazy formats,
unconstrained streams, and lack of definition?

•Probably not. Over time, standards will be developed.

•Many semi-structured data sources will become more structured

over time, and individual organizations will fine-tune their big data
feeds to be friendlier for analysis.

•Example:
• SQL or similar language : usage with Big Data
• Formats, Interfaces to support interoperability across
distributed applications
• Web semantics: XML, OWL etc., with Big Data
• Cloud computing – Big data

47
TODAY’S BIG DATA IS NOT TOMORROW’S BIG DATA

•There is no specific, universal definition in terms of what qualifies

as big data.

•Rather, big data is defined in relative terms tied to available

technology and resources.

•As a result, what counts as big data to one company or industry

may not count as big data to another.

•A large e-commerce company is going to have a much “bigger”

definition of big data than a small manufacturer will.

•What qualifies as big data will necessarily change over time as the
tools and techniques to handle it evolve alongside raw storage size
and processing power.
48
•Household demographic (population) files with hundreds of fields and
millions of customers were huge and tough to manage a decade or two
ago.

•Now such data fits on a thumb drive and can be analyzed by a low-
end laptop.

•Transactional data in the retail, telecommunications, and banking

industries were very big and hard to handle even a decade ago.

•What we are intimidated by today won’t be so scary a few years down

the road.

Example 1:

• Clickstream data from the web may be a standard, easily handled

data source in 10 years 49
Click Stream :Trail left by users as they click their way through a
website.

Click-path optimization – Using clickstream analysis, businesses can collect and

analyze data to see which pages web visitors are visiting and in what order.

Market basket analysis – The benefit of basket analysis for marketers is that it
can give them a better understanding of aggregate customer purchasing behavior

Next Best Product analysis :helps marketers see what products customers tend to
buy together.

Website resource allocation: Clickstream data analysis tells marketers which

paths on the site are hot and which ones are not.

Customization: personalize the user experience and convert more web visitors
from browsers to buyers.

50
2. Actively processing every e-mail, customer service chat, and
social media comment may become a standard practice for most
organizations.

As we tame the current generation of big data streams, other even

bigger data sources are going to come along and take their place.

1. Imagine web browsing data that expands to include

millisecond-level eyeball and mouse movement so that every tiny
detail of a user’s navigation is captured, instead of just what was
clicked on. This is another order of big.

51
2. Imagine video game telemetry data being upgraded to go
beyond every button pressed or movement made

3. Imagine RFID (radio frequency identification) information being

available for every single individual item in every single store,
distribution facility, and manufacturing plant globally.

4. Imagine capturing and translating to text every conversation

anyone has with a customer service or sales line. Add to that all
the associated e-mails, online chats, and comments from places
such as social media sites or product review sites. 52
Web Data: The Original Big Data
•Wouldn’t

1. it be great to understand customer intent instead of just

customer action?

2. it be great to understand each customer’s thought processes

to determine whether they make a purchase or not?

•Virtually impossible to get insights into such topics in the past

•Today, such topics can be addressed with the use of detailed web
data.

•Organizations across a number of industries have integrated detailed,

customer-level behavioral data sourced from a web site into their
enterprise analytics environments. 53
•However, for most organizations web integration mean inclusion
of online transactions.

•Traditional web analytics vendors provide operational reporting

(every day task) on click-through rates, traffic sources, and
metrics based only on web data.

•However, detailed web behavior data was not historically

leveraged outside of web reporting.

Is it possible to understand Users Better?

How

54
WEB DATA OVERVIEW

•Organizations have talked about a 360-degree view of their

customers for years.
•What it really meant is that the organization has as full a view of its
customers as possible considering the technology and data available
at that point in time.

•However, the finish line is always moving. Just when you think you
have finally arrived, the finish line moves farther out again.

55
•A few decades ago, companies were at the top of their game if they
had the names and addresses of their customers and they were able to
append demographic information(location & population) to those
names through the then-new third party data enhancement services.

•Eventually, cutting-edge companies started to have basic recency,

frequency, and monetary value (RFM) metrics attached to customers.
Such metrics look at when a customer last purchased (recency), how
often they have purchased (frequency), and how much they spent
(monetary value).

•In the past 10 to 15 years, virtually all businesses started to collect

and analyze the detailed transaction histories of their customers.

•This led to an explosion of analytical power and a much deeper

understanding of customer behavior.
56
•Many organizations are still frozen at the transactional history stage.

•Today, while this transactional view is still important, many

companies incorrectly assume that it remains the closest view possible
to a 360-degree view of their customers.

•Today, organizations need to collect from newly evolving big data

sources related to their customers from a variety of extended and
newly emerging touch points such as web browsers, mobile
applications, kiosks, social media sites, and more.

•Just as transactional data enabled a revolution in power of

computation and depth of analysis, so too do these new data sources
enable taking analytics to a new level.

57
What Are You Missing?(with Traditional Data)
•Have you ever stopped to think about what happens if only the
transactions generated by a web site are captured?

Study Reveals: 95 percent of browsing sessions do not result in a

basket being created. Of that 5 percent, only about half, or 2.5
percent, actually begin the check out process. And, of that 2.5
percent only two-thirds, or 1.7 percent, actually complete a
purchase.

•What this means is that information is missing on more than 98

percent of web sessions, if only transactions are tracked.

•For every purchase transaction, there might be dozens or hundreds

of specific actions taken on the site to get to that sale. That
information needs to be collected and analyzed alongside the final
sales data. 58
Imagine the Possibilities (Organizations are trying to know)
•Imagine knowing everything customers do as they go through the
process of doing business with your organization.

•Not just what they buy, but what they are thinking about buying
along with what key decision criteria they use.

•Such knowledge enables a new level of understanding about your

customers and a new level of interaction with your customers.

Example:
1. Imagine you are a retailer. Imagine walking through with customers
and recording every place they go, every item they look at, every item
they pick up, every item they put in the cart and back out. Imagine
knowing whether they read nutritional information, if they look at
laundry instructions, if they read the promotional brochure on the
shelf, or if they look at other information made available to them in
the store. 59
2. Imagine you are a telecom company. Imagine being able to
identify every phone model, rate plan, data plan, and accessory
that customers considered before making a final decision.

What is the difference between Traditional

Analytics and New scalable Analytics ?

60
What Data Should Be Collected and from where?
•Any action that a customer takes while interacting with an
organization should be captured if it is possible to capture it from
web sites, kiosks, social media, mobile apps etc

•Wide range of events can be captured like: Purchases Requesting,

Product views, Forwarding a link , Shopping basket additions, Posting
a comment, Watching a video, Registering for a webinar, Accessing a
download, Executing a search, Reading / writing a review etc.

61
What about privacy ? (How Flip kart is handling this?)

•Privacy is a big issue today and may become an even bigger issue as
time passes.

•Need to respect not just formal legal restrictions, but also what your
customers will view as appropriate.

•Faceless Customer: (identify of customer masked in data stores)

An arbitrary identification number that is not personally identifiable
can be matched to each unique customer based on a logon, cookie,
or similar piece of information. This creates what might be called a
“faceless” customer record.

•It is the patterns across faceless customers that matter, not the
behavior of any specific customer
62
•With today’s database technologies, it is possible to enable
analytic professionals to do analysis without having any ability to
identify the individuals involved.

•This can remove many privacy concerns.

Many organizations are in fact identifying and targeting specific

customers as a result of such analytics.

Organizations have presumably put in place privacy policies,

including opt-out options, and are careful to follow them.

63
What Web Data Reveals
1. Shopping Behaviors:

A good starting point to understand shopping behavior is

identifying:
•How customers come to a site, begin shopping and their page
navigation.
•What search engine do they use?
•What specific search terms are entered?
•Do they use a bookmark they created previously?
•Analytic professionals can take this information and look for
patterns in terms of which search terms, search engines, and
referring sites are associated with higher sales rates. 64
•One very capability of web data is to identify product set that are of
interest to a customer before they make a purchase.

•For example, consider a customer who views computers, backup

disks, printers, and monitors. It is likely the customer is considering a
complete PC system upgrade.

•Offer a package right away that contains the specific mix of items the
customer has browsed.

•Do not wait until after customers purchase the computer and then
offer generic bundles of accessories.

•A customized bundle offer is more powerful than a generic one .

[study says]

•We find this feature lacking in many sites (project work?)

65
2. Customer Purchase Paths and Preferences
• it is possible to explore and identify the ways customers arrive at
their buying decisions by watching how they navigate a site.

•It is also possible to gain insight into their preferences.

Consider for example an airline

•An airline can tell a number of things about preferences based on the
ticket that is booked.

•For example, 1.How far in advance was the ticket booked?

2.What fare class was booked?
3.Did the trip span a weekend or not?

•This is all useful, but an airline can get even more from web data. 66
•An airline can identify customers who value convenience (Such
customers typically start searches for specific times and direct flights
only.)

•Airlines can also identify customers who value price first and foremost
and are willing to consider many flight options to get the best price.

•Based on search patterns, airlines can also tell whether customer

value deals or specific destinations.

•Example : Do the customer research all of the special deals available

and then choose one for the trip? Or does the customer look at a
certain destination and pay what is required to get there?

•For example, a college student may be open to any number of

vacation destinations and will take the one with the best deal. On the
other hand, a customer who visits family on a regular basis will only be
interested in flying to where the family is. 67
3. Research Behaviors
•Understanding how customers utilize the research content on a site can
lead to tremendous insights into how to interact with each individual
customer, as well as how different aspects of the site do or do not add
value in driving sales.

For example, consider an online store selling cloths: Saree, Zovi Shirts
•Another way to use web data to understand customers’ research
patterns: is to identify which of the pieces of information offered on a
site are valued by the customer base overall and the best customers
specifically.

•How often do customers look at a previews( glance), additional photos(

thumb nails/ regular), or technical specs or reviews before making a
purchase?

•Sessions data with other data will help to know when did the
customers buy, on the same day or next day. 68
Feedback Behaviors

•Where are the Feed back expressed?

•Is it relevant? Baised?

•Does it matter?

69
Web Data in Action
•What an organization knows about its customers is never the
complete picture.

•It is always necessary to make assumptions based on the

information available.

•If there is only a partial view, the full view can often be extrapolated
accurately enough to get the job done.

•it is also possible that the information missing, paints a totally

different picture than expected.

•In the cases where the missing information differs from the
assumptions, it is possible to make suboptimal, if not totally wrong,
decisions.
70
•A very common marketing EXAMPLE is to predict what is the next best
offer customer. Of all the available options, which single offer should
next be suggested to a customer to maximize the chances of success?

•Web behaviour data can help ?

Case 1: BANK
• Mr. Kumar has an account with PNB………………………………….etc. with
relevant information.

•What is the best offer you can send via email

•Does it ever occur to provide promotional offer on Mortgage or

Housing loan ? With web data, Bank now know what to discuss with Mr.
Kumar

71
Case 2: Dominos
•Traditional data they get is:
• Historical purchases
• Marketing campaign and response history
•With web data:
• The effort leads to major changes in the promotional efforts versus
the traditional approach, providing the following results:
• A decrease in total mailings
• A reduction in total catalog promotions pages
• A materially significant increase in total revenues

• Question: With An Example, Justify How Web Data Contributes To

Better Promotional Benefits As Against Traditional Data? 72
Attrition Modelling
•In telecommunication sector (example) , companies have invested
massive amounts of time and effort to create, enhance, and perfect
“churn” models. (Trying to identify leaving customers)

•Churn models flag those customers most at risk of cancelling their

accounts so that action can be taken proactively to prevent them from
doing so.

•Management of customer churn has been, and remains, critical to

understanding patterns of customer usage and profitability.

Example :
•Mrs. Smith, as a customer of telecom Provider “AIR”, goes to Google
and types “How do I cancel my Provider AIR contract?” (Web Data).

73
• Company Analysts, perhaps not, would have seen her usage
dropping.

•It would take weeks to months to identify such a change in usage

pattern anyway.

•By capturing Mrs. Smith’s actions on the web, Provider “AIR”, is able
to move more quickly to avert losing Mrs. Smith.

74
Response Modelling
•Many models are created to help predict the choice a customer will
make when presented with a (Data set) request for action.

•Models typically try to predict which customers will make a purchase,

or accept an offer, or click on an e-mail link.

•For such models, a technique called logistic regression is often used.

These models are usually referred to as response models or
propensity models.

• The main difference between this and attrition model? predicting

negative behaviour (churn model), predicting positive behaviour
(purchase or response model).

75
WORKING
•When using a response or propensity model, all customers are scored
and ranked by likelihood of taking action.

•Then, appropriate segments (groups) are created based on those

ranks in order to reach out to the customers.

•In theory, every customer has a unique score. In practice, since only a
small number of variables define most models, many customers end
up with identical or nearly identical scores.

•Example: Customers who are not very frequent or high-spending.

•In many cases, many customers can end up in big groups with very
similar/ very low scores.
76
•Web data can help greatly increase differentiation among customers.

For Example, consider a scenario: (score can increase or decrease by

delta x)
•Customer 1 has never browsed your site
•Customer 2 viewed the product category featured in the offer within
the past month.
•Customer 3 viewed the specific product featured in the offer within
the past month.
•Customer 4 browsed the specific product featured three times last
week, added it to a basket once, abandoned the basket, then viewed
the product again later.
77
Etc….
•When asked about the value of incorporating web data, a director
of marketing from a multichannel American specialty retailer
replied, “It’s like printing money!”

78
Customer Segmentation (Grouping): Study

•What is segmentation?
•How Segmentation were done traditionally?
•Web data also enables segmentation of customers based on their
typical browsing patterns. (Seminar/Project topic on assessing
browsing pattern of users)
•Such segmentation will provide a completely different view of
customers than traditional demographic or sales-based segmentation
schemas.

•Assignment: To create dreamers segment and identify the items

selected by the dreamers 79
Example:
•Consider a segment called the Dreamers that has been derived
purely from browsing behavior.
Who are they?
•Dreamers repeatedly put an item in their basket, but then abandon
it. Dreamers often add and abandon the same item many times.

•This may be especially true for a high-value item like a TV or

computer. It should be possible to identify the segment of people
that does this repeatedly.

•So, what is the outcome of this segment” Dreamers”? 80

1. What is that the customers are abandoning?
•Perhaps a customer is looking at a high-end TV that is quite expensive
Or phone or Camera etc.

• is price the issue ? From the past data, we get to know that the
customer often aims too high and later will buy a less-expensive product
than the one that was abandoned repeatedly.

Action Plan
•Sending an e-mail, pointing to less-expensive options or other variety
of High end TV.

2: Get to Know the Abandoned basket statistics . Which can help

organizations to know prospective customer abandoning baskets.

[Helps analyst to output survey results such as 97% customers

abandoned their baskets. It also gives insights into procedural aspects,
unavailability of services like COD, Credit card etc.] 81
Assessing Advertising Results
•Assessing paid search and online advertising results is another
high-impact analysis enabled with customer level web behavior
data.

•Traditional web analytics provide high-level summaries such as

total clicks, number of searches, cost per click, keywords leading to
the most clicks, page position statistics etc.

• Most focus on single web channel.

•This means that all statistics are based only on what happened
during the single session generated from the search or ad click

82
•Once a customer leaves the web site and web session ends, the
scope of the analysis is complete.

•There is no attempt to account for past or future visits in the

statistics.

•By incorporating customers’ browsing data and extending the view

to other channels as well, it is possible to assess search and
advertising results at a much deeper level.

For Example:
• How many sales did the first click generate in days/weeks
• Are certain web sites drawing more customers from referred sites.
• Cross channel analysis study, How sales are doing, after information
about the channel was provided on web via ad or search. 83
CROSS SECTION OF BIG DATA SOURCES
AND VALUE THEY HOLD

84
CASE STUDY

1. AUTO INSURANCE: THE VALUE OF TELEMATICS DATA

•Telematics involves putting a sensor, or black box, into a car to
capture information about what’s happening with the car. This black
box can measure any number of things depending on how it is
configured.

•It can monitor speed, mileage driven, or if there has been any heavy
braking.

•Telematics data helps insurance companies better understand

customer risk levels and set insurance rates.

•If privacy concerns are ignored and it is taken to the extreme, a

telematics device could keep track of everywhere a car went, when it
was there, how fast it was going, and what features of the car were in
use. 85
2. MULTIPLE INDUSTRIES: THE VALUE OF TEXT DATA

•Text is one of the biggest and most common sources of big data. Just
imagine how much text is out there.

•There are e-mails, text messages, tweets, social media postings, instant
messages, real-time chats, and audio recordings that have been
translated into text.

•Text data is one of the least structured and largest sources of big data
in existence today.

•Luckily, a lot of work has been done already to tame text data and
utilize it to make better business decisions

• Text mining approaches have their own advantages/disadvantages

86
•Here, we will focus on, how to use the results, not produce them.

•For example, once the sentiment of a customer’s e-mail is

identified, it is possible to generate a variable that tags the
customer’s sentiment as negative or positive. That tag is now a piece
of structured data that can be fed into an analytics process.

•Creating structured data out of unstructured text is often called

information extraction.

•Another example, assume that we’ve identified which specific

products a customer commented about in his or her
communications with our company.

•We can then generate a set of variables that identify the products
discussed by the customer. Those variables are again metrics that are
structured and can be used for analysis purposes.
87
MULTIPLE INDUSTRIES: THE VALUE OF TIME AND LOCATION DATA
•With the advent of global positioning systems (GPS), personal GPS
devices, and cellular phones, time and location information is a
growing source of data.

• A wide variety of services and applications from Google Places, to

Facebook Places are centered on registering where a person is at a
given point in time.

•Cell phone applications can record your location and movement on

your behalf.

•Cell phones can even provide a fairly accurate location using cell
tower signals, if a phone is not formally GPS-enabled.

Week 5 - Big Data Analytics - 2025 2025-05-22 02 - 53 - 39
No ratings yet
Week 5 - Big Data Analytics - 2025 2025-05-22 02 - 53 - 39
94 pages
Unit 1
No ratings yet
Unit 1
107 pages
Big Data Analytics - AAM - Unit 1
No ratings yet
Big Data Analytics - AAM - Unit 1
178 pages
Unit 1
No ratings yet
Unit 1
76 pages
Unit 1
No ratings yet
Unit 1
89 pages
CBT Exam-Aramco-Excel (1) - 1
No ratings yet
CBT Exam-Aramco-Excel (1) - 1
61 pages
Bigdata Units
No ratings yet
Bigdata Units
80 pages
BIG Data
No ratings yet
BIG Data
32 pages
Unit 1
No ratings yet
Unit 1
57 pages
2020big Data
No ratings yet
2020big Data
60 pages
Big Data
No ratings yet
Big Data
54 pages
DBIS Lecture 4 - Slides (AI and Big Data)
No ratings yet
DBIS Lecture 4 - Slides (AI and Big Data)
84 pages
Drewy Demurrage & Detention Report
No ratings yet
Drewy Demurrage & Detention Report
34 pages
Unit I: Understanding Big Data
No ratings yet
Unit I: Understanding Big Data
10 pages
1big Data
No ratings yet
1big Data
69 pages
Chapter III
No ratings yet
Chapter III
52 pages
AEDT Icepak Intro 2019R1 L3 Flow and Thermal Boundary Conditions
No ratings yet
AEDT Icepak Intro 2019R1 L3 Flow and Thermal Boundary Conditions
20 pages
Big Data Sent 24 10 24
No ratings yet
Big Data Sent 24 10 24
49 pages
Unit-Iii CC&BD CS71
No ratings yet
Unit-Iii CC&BD CS71
89 pages
Introduction To Big Data Management
No ratings yet
Introduction To Big Data Management
53 pages
Unit-1 Introduction To Big Data Analytics
No ratings yet
Unit-1 Introduction To Big Data Analytics
57 pages
Bigdata Fundamentals
No ratings yet
Bigdata Fundamentals
82 pages
DL Industries Investor Presentation
No ratings yet
DL Industries Investor Presentation
28 pages
Big Data Analytics
No ratings yet
Big Data Analytics
25 pages
OptaSense Third Party Interface Specification
No ratings yet
OptaSense Third Party Interface Specification
32 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
List of Lingerie Brands - Reader
No ratings yet
List of Lingerie Brands - Reader
2 pages
Chapter-4 2
No ratings yet
Chapter-4 2
30 pages
BDA Unit 1
No ratings yet
BDA Unit 1
23 pages
Wibd Notes
No ratings yet
Wibd Notes
32 pages
Basic Concepts in Big Data 1
No ratings yet
Basic Concepts in Big Data 1
43 pages
Big Data CH 1
No ratings yet
Big Data CH 1
62 pages
Unit-2 R Programing
No ratings yet
Unit-2 R Programing
26 pages
ETB 1 (Big Data)
No ratings yet
ETB 1 (Big Data)
28 pages
Open Big - Data 2 - Selected Topics
No ratings yet
Open Big - Data 2 - Selected Topics
16 pages
Chapter 2 - Classification of Business
No ratings yet
Chapter 2 - Classification of Business
22 pages
Big Data
No ratings yet
Big Data
23 pages
Unit 1
No ratings yet
Unit 1
74 pages
Unit 3 Big Data Analytics
No ratings yet
Unit 3 Big Data Analytics
18 pages
Bda U1
No ratings yet
Bda U1
78 pages
1 - Big Data
No ratings yet
1 - Big Data
204 pages
Lecture 1
No ratings yet
Lecture 1
22 pages
Da 1
No ratings yet
Da 1
20 pages
Big Data Intro PDF
No ratings yet
Big Data Intro PDF
93 pages
Chapter 10 Strategy Implementation Organizing and Structure
100% (1)
Chapter 10 Strategy Implementation Organizing and Structure
28 pages
Unit1 - Introduction To Big Data
No ratings yet
Unit1 - Introduction To Big Data
53 pages
ETF Report
No ratings yet
ETF Report
4 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
Geographical Investigations
No ratings yet
Geographical Investigations
10 pages
2.3.11.a Calculating Property Drainage
No ratings yet
2.3.11.a Calculating Property Drainage
6 pages
Prepared By: Asmita Deshmukh
No ratings yet
Prepared By: Asmita Deshmukh
51 pages
Dsc652 - Chapter 1 Introduction To Big Data Systems
No ratings yet
Dsc652 - Chapter 1 Introduction To Big Data Systems
27 pages
2023 - RPIA Assessment 2
No ratings yet
2023 - RPIA Assessment 2
5 pages
INSPI - Yaoure-ESIA-Appendix-34-Cultural-Heritage-Management-Plan
100% (1)
INSPI - Yaoure-ESIA-Appendix-34-Cultural-Heritage-Management-Plan
7 pages
117769
No ratings yet
117769
20 pages
Slides Data Analytics
No ratings yet
Slides Data Analytics
28 pages
Apache Hadoop Training For Developers Day 1
No ratings yet
Apache Hadoop Training For Developers Day 1
136 pages
"The Electoral Reforms Law of 1987" Sec. 27. Election Offenses. - in Addition To The Prohibited Acts and Election Offenses Enumerated in
100% (1)
"The Electoral Reforms Law of 1987" Sec. 27. Election Offenses. - in Addition To The Prohibited Acts and Election Offenses Enumerated in
24 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
Testing & Commissioning of Irrigation System
No ratings yet
Testing & Commissioning of Irrigation System
13 pages
February 3, 2020 G.R. No.: Click or Tap Here To Enter Ponente
100% (1)
February 3, 2020 G.R. No.: Click or Tap Here To Enter Ponente
2 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
CV Nadia Dhatu
No ratings yet
CV Nadia Dhatu
2 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
20 pages
BDT 1
No ratings yet
BDT 1
49 pages
Instructables Com FAN Repair
No ratings yet
Instructables Com FAN Repair
9 pages
Project FInal Report
No ratings yet
Project FInal Report
67 pages
The Philippine Green Building Code
No ratings yet
The Philippine Green Building Code
5 pages
Big Data
No ratings yet
Big Data
14 pages
Kahoot Koonji Intro To PM Week 1 7
No ratings yet
Kahoot Koonji Intro To PM Week 1 7
7 pages
Package Desire': R Topics Documented
No ratings yet
Package Desire': R Topics Documented
22 pages
IEC-IM03 Series: Key Features
No ratings yet
IEC-IM03 Series: Key Features
1 page
FVC Labor Union-Ptgwo vs. Sanama-Fvc-Siglo
100% (1)
FVC Labor Union-Ptgwo vs. Sanama-Fvc-Siglo
3 pages
Global Market Forecast 2015-2034 PDF
No ratings yet
Global Market Forecast 2015-2034 PDF
27 pages
Big Data
No ratings yet
Big Data
25 pages
Felcom 12 15 16 Ssas Tie PDF
No ratings yet
Felcom 12 15 16 Ssas Tie PDF
80 pages
Job Analysis The Process and Its Uses
No ratings yet
Job Analysis The Process and Its Uses
13 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
11 pages
Principles of Digital Transmission
No ratings yet
Principles of Digital Transmission
1 page
Francisco Padilla 1
No ratings yet
Francisco Padilla 1
2 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
Appearance Release: Complete Only For Hazardous Activity
No ratings yet
Appearance Release: Complete Only For Hazardous Activity
1 page
Part 1 - Introduction To Big Data
No ratings yet
Part 1 - Introduction To Big Data
24 pages
Deloitte Solutions Network: Introduction To Big Data
No ratings yet
Deloitte Solutions Network: Introduction To Big Data
9 pages
Open The Dor
No ratings yet
Open The Dor
9 pages
The Next Frontier For Innovation, Competition and Productivity
No ratings yet
The Next Frontier For Innovation, Competition and Productivity
23 pages
Big Data
No ratings yet
Big Data
11 pages
Data Decoded - Understanding Big Data and Its Everyday Applications
From Everand
Data Decoded - Understanding Big Data and Its Everyday Applications
Michael Reed
No ratings yet
The Data Whisperer - Making Sense of Big Data
From Everand
The Data Whisperer - Making Sense of Big Data
Keaton Rivers
No ratings yet
Understanding Big Data: A Beginners Guide to Data Science & the Business Applications
From Everand
Understanding Big Data: A Beginners Guide to Data Science & the Business Applications
Eileen McNulty-Holmes
4/5 (5)