Big Data Unit 1
Big Data Unit 1
2
According to another study
•From the beginning of recorded time (1990) until 2003, 5 billion
gigabytes of data was created.
•In 2011, the same amount was created every two days
•In 2013, the same amount of data was created every 10 minutes
•In 2015, same or more data (generating) every 10 minutes.
3
•Example: Search engine companies such as Google, Yahoo!, and
Microsoft have created an entirely new business by capturing the
information freely available on the World Wide Web and providing it
to people in useful ways. (SOCIAL NETWORKING)
•These companies collect trillions of data every day and provide NEW
SERVICES such as satellite images, driving directions, image retrieval
etc.
6
GENESIS………………………The Beginning
7
KNOW DIFFERENCE BETWEEN BIG DATA AND MANAGMENT
•BDM alone won’t get you very far. You also have to analyze and act on
it for data of any size to be of value.
As you drive to the store to buy the computer bundle, you get an offer for a discounted coffee
from the coffee shop you are getting ready to drive past. It says that since you’re in the area, you
can get 10% off if you stop by in the next 20 minutes
As you drink your coffee, you receive an apology from the manufacturer of a product
that you complained about yesterday on your Facebook page, as well as on the company’s web
site. …
Finally, once you get back home, you receive notice of a gadget upgrade available for purchase in
your favorite online video game.
Etc…………..
10
DATA SOURCES
• Def#2: Big data refers to data sets whose size is beyond the ability
of typical database software tools to capture, store, manage and
analyze.”[McKinseyGlobal Institute ]
13
Variety : “Variety Is the Spice of Life”
• But also raw, semi structured, and unstructured data from web
pages, web log files (including click-stream data), search indexes,
social media forums, e-mail, documents, sensor data from active
and passive systems, and so on.
14
Velocity : How Fast Is Fast?
•The speed at which the data is flowing.
•Increase in RFID sensors and other information streams has led to a
constant flow of data at a pace that has made it impossible for
traditional systems to handle
•Competition can mean identifying a trend, problem, or opportunity
only seconds, or even microseconds, before someone else.
15
•For example, the query “Show me all people living in the City X”
would result in a single result set to be used as a warning list of
an incoming weather pattern.
•Big Data requires that you perform analytics against the volume
and variety of data while it is still in motion, not just after it is at
rest.
16
Veracity: (Non reliable Data)
17
Variability :
•It is often confused with variety.
Example:
•Say you have bakery that sells 10 different breads. That is variety.
Now imagine you go to that bakery three days in a row and every
day you buy the same type of bread but each day it tastes and smells
different.
•Variability is thus very relevant in performing sentiment analyses.
•Variability means that the meaning is changing (rapidly).
•In (almost) the same tweets a word can have a totally different
meaning.
18
Visualization
•The value is in the analyses done on that data and how the data
is turned into information and eventually turning it into
knowledge.
•The value is in how organisations will use that data and turn
their organisation into an information-centric company that
relies on insights derived from data analyses for their decision-
making.
20
IS THE “BIG” PART OR THE “DATA” PART MORE IMPORTANT?
•What is the most important part of the term big data? Is it (1) the
“big” part, (2) the “data” part, (3) both, or (4) neither?
•As with any source of data, big or small, the power of big data
comes :
++ What is done with that data?
++ How is it analyzed?
++ What actions are taken based on the findings?
++ How is the data used to make changes to a
business?
•People are led to believe that just because big data has high
volume, velocity, and variety, it is somehow better or more
important than other data.
21
• Many big data sources have a far higher percentage of useless or
low-value content than virtually any other data source.
•By the time, big data is trimmed down to what you actually need,
it may not even be so big any more.
In Summary:
•Whether it stays big or whether it ends up being small when you’re
done processing it,
22
HOW IS BIG DATA DIFFERENT?
• For Example, with the use of the Internet, customers can now
execute a transaction with a bank or retailer online. But the
transactions they execute are not fundamentally different
transactions from what they would have done traditionally.
• It will be difficult to work with such data at best and very, very
ugly at worst.
25
4. Substantial amount of big data streams may not have much value. In
fact, much of the data may even be close to worthless.
26
Example: Weblog (1)
27
Example: Weblog (2)
28
HOW IS BIG DATA MORE OF THE SAME?
•Same thing that existed in the past; is out in a new form.
• In many ways, big data doesn’t pose any problems that your
organization hasn’t faced before.
•Taming new, large data sources that push the current limits of
scalability is an ongoing theme in the world of analytics
29
RISKS OF BIG DATA
1. An organization will be so overwhelmed with big data that it won’t
make any progress.
[The key here is to get the right people. You need the right people
attacking big data and attempting to solve the right kinds of problems]
2. cost escalates too fast as too much big data is captured before an
organization knows what to do with it.
[It is not necessary to go for it all at once and capture 100 percent of
every new data source.
•This has led to data being used in ways that consumers didn’t
understand or support, causing a backlash
•Organizations should explain how they will keep data secure and how
they will use it, if they accept their data to be captured and analyzed31
WHY YOU NEED TO TAME BIG DATA
•Within a few years, any organization that isn’t analyzing big data
will be late to the game and will be stuck playing catch up for years
to come.
32
What is the difference between
Data Mining and Web Mining?
Machine Learning : Classification, Clustering etc.
33
THE STRUCTURE OF BIG DATA
•Unstructured Data
34
FILTERING BIG DATA EFFECTIVELY
•The biggest challenge with big data may not be the analytics you do
with it, but the extract, transform, and load (ETL) processes you have
to build to get it ready for analysis. (PART OF 90 %)
•For example, when working with a web log, a rule might be to filter
out up front any information on browser versions or operating
systems. Such data is rarely needed except for operational reasons.
36
Example 2 :Opinion Analysis
Step 1: Sample text
excellent phone, excellent service . i am a business user who
heavily depend on mobile service ….,,, there is much which
has been said in other reviews about the features of this
phone.
Step 5: Results:
• Positive opinion
• Negative opinion
•The complexity of the rules and the magnitude of the data being
removed or kept at each stage will vary by data source and by
business problem.
•The load processes and filters that are put on top of big data are
absolutely critical. Without getting those correct, it will be very
difficult to succeed.
40
MIXING BIG DATA WITH TRADITIONAL DATA
•Perhaps the most exciting thing about big data isn’t what it will do
for a business by itself. It’s what it will do for a business when
combined with an organization’s other data.
Example:
•An EDW adds value by allowing different data sources to intermix and
enhance one another.
44
a. Data Mart
b. Data Warehouse
45
Hierarchy of Enterprise Data 46
THE NEED FOR STANDARDS
•Will big data continue to be a wild west of crazy formats,
unconstrained streams, and lack of definition?
•Example:
• SQL or similar language : usage with Big Data
• Formats, Interfaces to support interoperability across
distributed applications
• Web semantics: XML, OWL etc., with Big Data
• Cloud computing – Big data
47
TODAY’S BIG DATA IS NOT TOMORROW’S BIG DATA
•What qualifies as big data will necessarily change over time as the
tools and techniques to handle it evolve alongside raw storage size
and processing power.
48
•Household demographic (population) files with hundreds of fields and
millions of customers were huge and tough to manage a decade or two
ago.
•Now such data fits on a thumb drive and can be analyzed by a low-
end laptop.
Example 1:
Market basket analysis – The benefit of basket analysis for marketers is that it
can give them a better understanding of aggregate customer purchasing behavior
Next Best Product analysis :helps marketers see what products customers tend to
buy together.
Customization: personalize the user experience and convert more web visitors
from browsers to buyers.
50
2. Actively processing every e-mail, customer service chat, and
social media comment may become a standard practice for most
organizations.
51
2. Imagine video game telemetry data being upgraded to go
beyond every button pressed or movement made
•Today, such topics can be addressed with the use of detailed web
data.
54
WEB DATA OVERVIEW
•However, the finish line is always moving. Just when you think you
have finally arrived, the finish line moves farther out again.
55
•A few decades ago, companies were at the top of their game if they
had the names and addresses of their customers and they were able to
append demographic information(location & population) to those
names through the then-new third party data enhancement services.
57
What Are You Missing?(with Traditional Data)
•Have you ever stopped to think about what happens if only the
transactions generated by a web site are captured?
•Not just what they buy, but what they are thinking about buying
along with what key decision criteria they use.
Example:
1. Imagine you are a retailer. Imagine walking through with customers
and recording every place they go, every item they look at, every item
they pick up, every item they put in the cart and back out. Imagine
knowing whether they read nutritional information, if they look at
laundry instructions, if they read the promotional brochure on the
shelf, or if they look at other information made available to them in
the store. 59
2. Imagine you are a telecom company. Imagine being able to
identify every phone model, rate plan, data plan, and accessory
that customers considered before making a final decision.
60
What Data Should Be Collected and from where?
•Any action that a customer takes while interacting with an
organization should be captured if it is possible to capture it from
web sites, kiosks, social media, mobile apps etc
61
What about privacy ? (How Flip kart is handling this?)
•Privacy is a big issue today and may become an even bigger issue as
time passes.
•Need to respect not just formal legal restrictions, but also what your
customers will view as appropriate.
•It is the patterns across faceless customers that matter, not the
behavior of any specific customer
62
•With today’s database technologies, it is possible to enable
analytic professionals to do analysis without having any ability to
identify the individuals involved.
63
What Web Data Reveals
1. Shopping Behaviors:
•Offer a package right away that contains the specific mix of items the
customer has browsed.
•Do not wait until after customers purchase the computer and then
offer generic bundles of accessories.
•An airline can tell a number of things about preferences based on the
ticket that is booked.
•This is all useful, but an airline can get even more from web data. 66
•An airline can identify customers who value convenience (Such
customers typically start searches for specific times and direct flights
only.)
•Airlines can also identify customers who value price first and foremost
and are willing to consider many flight options to get the best price.
For example, consider an online store selling cloths: Saree, Zovi Shirts
•Another way to use web data to understand customers’ research
patterns: is to identify which of the pieces of information offered on a
site are valued by the customer base overall and the best customers
specifically.
•Sessions data with other data will help to know when did the
customers buy, on the same day or next day. 68
Feedback Behaviors
•Does it matter?
69
Web Data in Action
•What an organization knows about its customers is never the
complete picture.
•If there is only a partial view, the full view can often be extrapolated
accurately enough to get the job done.
•In the cases where the missing information differs from the
assumptions, it is possible to make suboptimal, if not totally wrong,
decisions.
70
•A very common marketing EXAMPLE is to predict what is the next best
offer customer. Of all the available options, which single offer should
next be suggested to a customer to maximize the chances of success?
Case 1: BANK
• Mr. Kumar has an account with PNB………………………………….etc. with
relevant information.
71
Case 2: Dominos
•Traditional data they get is:
• Historical purchases
• Marketing campaign and response history
•With web data:
• The effort leads to major changes in the promotional efforts versus
the traditional approach, providing the following results:
• A decrease in total mailings
• A reduction in total catalog promotions pages
• A materially significant increase in total revenues
Example :
•Mrs. Smith, as a customer of telecom Provider “AIR”, goes to Google
and types “How do I cancel my Provider AIR contract?” (Web Data).
73
• Company Analysts, perhaps not, would have seen her usage
dropping.
•By capturing Mrs. Smith’s actions on the web, Provider “AIR”, is able
to move more quickly to avert losing Mrs. Smith.
74
Response Modelling
•Many models are created to help predict the choice a customer will
make when presented with a (Data set) request for action.
75
WORKING
•When using a response or propensity model, all customers are scored
and ranked by likelihood of taking action.
•In theory, every customer has a unique score. In practice, since only a
small number of variables define most models, many customers end
up with identical or nearly identical scores.
•In many cases, many customers can end up in big groups with very
similar/ very low scores.
76
•Web data can help greatly increase differentiation among customers.
78
Customer Segmentation (Grouping): Study
•What is segmentation?
•How Segmentation were done traditionally?
•Web data also enables segmentation of customers based on their
typical browsing patterns. (Seminar/Project topic on assessing
browsing pattern of users)
•Such segmentation will provide a completely different view of
customers than traditional demographic or sales-based segmentation
schemas.
• is price the issue ? From the past data, we get to know that the
customer often aims too high and later will buy a less-expensive product
than the one that was abandoned repeatedly.
Action Plan
•Sending an e-mail, pointing to less-expensive options or other variety
of High end TV.
•This means that all statistics are based only on what happened
during the single session generated from the search or ad click
82
•Once a customer leaves the web site and web session ends, the
scope of the analysis is complete.
For Example:
• How many sales did the first click generate in days/weeks
• Are certain web sites drawing more customers from referred sites.
• Cross channel analysis study, How sales are doing, after information
about the channel was provided on web via ad or search. 83
CROSS SECTION OF BIG DATA SOURCES
AND VALUE THEY HOLD
84
CASE STUDY
•It can monitor speed, mileage driven, or if there has been any heavy
braking.
•Text is one of the biggest and most common sources of big data. Just
imagine how much text is out there.
•There are e-mails, text messages, tweets, social media postings, instant
messages, real-time chats, and audio recordings that have been
translated into text.
•Text data is one of the least structured and largest sources of big data
in existence today.
•Luckily, a lot of work has been done already to tame text data and
utilize it to make better business decisions
86
•Here, we will focus on, how to use the results, not produce them.
•We can then generate a set of variables that identify the products
discussed by the customer. Those variables are again metrics that are
structured and can be used for analysis purposes.
87
MULTIPLE INDUSTRIES: THE VALUE OF TIME AND LOCATION DATA
•With the advent of global positioning systems (GPS), personal GPS
devices, and cellular phones, time and location information is a
growing source of data.
•Cell phones can even provide a fairly accurate location using cell
tower signals, if a phone is not formally GPS-enabled.
88