Big Data Analytics - AKM
Big Data Analytics - AKM
•These companies collect trillions of data every day and provide NEW
SERVICES such as satellite images, driving directions, image retrieval
etc.
•BDM alone won’t get you very far. You also have to analyze and act on
it for data of any size to be of value.
As you drive to the store to buy the computer bundle, you get an offer for a discounted coffee
from the coffee shop you are getting ready to drive past. It says that since you’re in the area, you
can get 10% off if you stop by in the next 20 minutes
As you drink your coffee, you receive an apology from the manufacturer of a product
that you complained about yesterday on your Facebook page, as well as on the company’s web
site. …
Finally, once you get back home, you receive notice of a gadget upgrade available for purchase
in your favorite online video game.
Etc…………..
• Def#2: Big data refers to data sets whose size is beyond the ability
of typical database software tools to capture, store, manage and
analyze.”*McKinseyGlobal Institute ]
•But also raw, semi structured, and unstructured data from web
pages, web log files (including click-stream data), search indexes,
social media forums, e-mail, documents, sensor data from active
and passive systems, and so on.
•Big Data requires that you perform analytics against the volume
and variety of data while it is still in motion, not just after it is at
rest.
•The value is in the analyses done on that data and how the data
is turned into information and eventually turning it into
knowledge.
•The value is in how organisations will use that data and turn
their organisation into an information-centric company that
relies on insights derived from data analyses for their decision-
making.
•What is the most important part of the term big data? Is it (1) the
“big” part, (2) the “data” part, (3) both, or (4) neither?
•As with any source of data, big or small, the power of big data
comes :
++ What is done with that data?
++ How is it analyzed?
++ What actions are taken based on the findings?
++ How is the data used to make changes to a
business?
•People are led to believe that just because big data has high
volume, velocity, and variety, it is somehow better or more
important than other data.
Big Data Analytics- Dr. Anil Kumar K.M 24
• Many big data sources have a far higher percentage of useless or
low-value content than virtually any other data source.
•By the time, big data is trimmed down to what you actually need,
it may not even be so big any more.
In Summary:
•Whether it stays big or whether it ends up being small when
you’re done processing it,
•For Example, with the use of the Internet, customers can now
execute a transaction with a bank or retailer online. But the
transactions they execute are not fundamentally different
transactions from what they would have done traditionally.
•It will be difficult to work with such data at best and very, very
ugly at worst.
•It is necessary to weed through and pull out the valuable and
relevant pieces
• In many ways, big data doesn’t pose any problems that your
organization hasn’t faced before.
•Taming new, large data sources that push the current limits of
scalability is an ongoing theme in the world of analytics
[The key here is to get the right people. You need the right people
attacking big data and attempting to solve the right kinds of problems]
2. cost escalates too fast as too much big data is captured before an
organization knows what to do with it.
[It is not necessary to go for it all at once and capture 100 percent of
every new data source.
•This has led to data being used in ways that consumers didn’t
understand or support, causing a backlash
•Organizations should explain how they will keep data secure and how
they will use it, if they accept their
Big Data data
Analytics- Dr. Anil to be
Kumar K.Mcaptured and analyzed
34
WHY YOU NEED TO TAME BIG DATA
•Within a few years, any organization that isn’t analyzing big data
will be late to the game and will be stuck playing catch up for years
to come.
•Unstructured Data
•For example, when working with a web log, a rule might be to filter
out up front any information on browser versions or operating
systems. Such data is rarely needed except for operational reasons.
Step 5: Results:
• Positive opinion
• Negative opinion
•The load processes and filters that are put on top of big data are
absolutely critical. Without getting those correct, it will be very
difficult to succeed.
•Perhaps the most exciting thing about big data isn’t what it will do
for a business by itself. It’s what it will do for a business when
combined with an organization’s other data.
Example:
•An EDW adds value by allowing different data sources to intermix and
enhance one another.
b. Data Warehouse
•Example:
• SQL or similar language : usage with Big Data
• Formats, Interfaces to support interoperability across
distributed applications
• Web semantics: XML, OWL etc., with Big Data
• Cloud computing – Big data
•What qualifies as big data will necessarily change over time as the
tools and techniques to handle it evolve alongside raw storage size
and processing power.
Big Data Analytics- Dr. Anil Kumar K.M 51
•Household demographic (population) files with hundreds of fields and
millions of customers were huge and tough to manage a decade or two
ago.
•Now such data fits on a thumb drive and can be analyzed by a low-
end laptop.
Example 1:
Market basket analysis – The benefit of basket analysis for marketers is that it
can give them a better understanding of aggregate customer purchasing behavior
Next Best Product analysis :helps marketers see what products customers tend to
buy together.
Customization: personalize the user experience and convert more web visitors
from browsers to buyers.
•Today, such topics can be addressed with the use of detailed web
data.
•However, the finish line is always moving. Just when you think you
have finally arrived, the finish line moves farther out again.
•Not just what they buy, but what they are thinking about buying along
with what key decision criteria they use.
Example:
1. Imagine you are a retailer. Imagine walking through with customers
and recording every place they go, every item they look at, every item
they pick up, every item they put in the cart and back out. Imagine
knowing whether they read nutritional information, if they look at
laundry instructions, if they read the promotional brochure on the
shelf, or if they look at other information made available to them in
the store. Big Data Analytics- Dr. Anil Kumar K.M 62
2.Imagine you are a telecom company. Imagine being able to
identify every phone model, rate plan, data plan, and accessory
that customers considered before making a final decision.
•Privacy is a big issue today and may become an even bigger issue as
time passes.
•Need to respect not just formal legal restrictions, but also what your
customers will view as appropriate.
•It is the patterns across faceless customers that matter, not the
behavior of any specific customer
Big Data Analytics- Dr. Anil Kumar K.M 65
•With today’s database technologies, it is possible to enable
analytic professionals to do analysis without having any ability to
identify the individuals involved.
•Offer a package right away that contains the specific mix of items the
customer has browsed.
•Do not wait until after customers purchase the computer and then
offer generic bundles of accessories.
•An airline can tell a number of things about preferences based on the
ticket that is booked.
•This is all useful, but an airline can get even more from web data.
Big Data Analytics- Dr. Anil Kumar K.M 69
•An airline can identify customers who value convenience (Such
customers typically start searches for specific times and direct flights
only.)
•Airlines can also identify customers who value price first and foremost
and are willing to consider many flight options to get the best price.
For example, consider an online store selling cloths: Saree, Zovi Shirts
•Another way to use web data to understand customers’ research
patterns: is to identify which of the pieces of information offered on a
site are valued by the customer base overall and the best customers
specifically.
•Sessions data with other data will help to know when did the
customers buy, on the sameBig day or next
Data Analytics- day.K.M
Dr. Anil Kumar 71
Feedback Behaviors
•Does it matter?
•If there is only a partial view, the full view can often be extrapolated
accurately enough to get the job done.
•In the cases where the missing information differs from the
assumptions, it is possible to make suboptimal, if not totally wrong,
decisions.
Big Data Analytics- Dr. Anil Kumar K.M 73
•A very common marketing EXAMPLE is to predict what is the next best
offer customer. Of all the available options, which single offer should
next be suggested to a customer to maximize the chances of success?
Case 1: BANK
• Mr. Kumar has an account with PNB………………………………….etc. with
relevant information.
Example :
•Mrs. Smith, as a customer of telecom Provider “AIR”, goes to Google
and types “How do I cancel my Provider AIR contract?” (Web Data).
•By capturing Mrs. Smith’s actions on the web, Provider “AIR”, is able
to move more quickly to avert losing Mrs. Smith.
•In theory, every customer has a unique score. In practice, since only a
small number of variables define most models, many customers end
up with identical or nearly identical scores.
•In many cases, many customers can end up in big groups with very
similar/ very low scores.
Big Data Analytics- Dr. Anil Kumar K.M 79
•Web data can help greatly increase differentiation among customers.
• is price the issue ? From the past data, we get to know that the
customer often aims too high and later will buy a less-expensive product
than the one that was abandoned repeatedly.
Action Plan
•Sending an e-mail, pointing to less-expensive options or other variety
of High end TV.
•This means that all statistics are based only on what happened
during the single session generated from the search or ad click
For Example:
• How many sales did the first click generate in days/weeks
• Are certain web sites drawing more customers from referred sites.
• Cross channel analysis study, How sales are doing, after information
about the channel was provided onDr.web
Big Data Analytics- via
Anil Kumar K.Mad or search. 86
CROSS SECTION OF BIG DATA SOURCES
AND VALUE THEY HOLD
•It can monitor speed, mileage driven, or if there has been any heavy
braking.
•Text is one of the biggest and most common sources of big data. Just
imagine how much text is out there.
•There are e-mails, text messages, tweets, social media postings, instant
messages, real-time chats, and audio recordings that have been
translated into text.
•Text data is one of the least structured and largest sources of big data
in existence today.
•Luckily, a lot of work has been done already to tame text data and
utilize it to make better business decisions
•We can then generate a set of variables that identify the products
discussed by the customer. Those variables are again metrics that are
structured and can be used for analysis purposes.
Big Data Analytics- Dr. Anil Kumar K.M 90
MULTIPLE INDUSTRIES: THE VALUE OF TIME AND LOCATION DATA
•With the advent of global positioning systems (GPS), personal GPS
devices, and cellular phones, time and location information is a
growing source of data.
•Cell phones can even provide a fairly accurate location using cell
tower signals, if a phone is not formally GPS-enabled.
•The fact is, if you carry a cell phone, you can keep a record of
everywhere you’ve been. You can also open up that data to others if
you choose.
•Calculator
•It removes the constraints of having one central server with only a
single set CPU and disk to manage it. (Traditional)
If an MPP system with 10 processing units is used. Data is broken into
10 independent 100-gigabyte chunks. This means it will execute 10
simultaneous 100-gigabyte queries. If more processing power and more
speed are required, just include additional capacity in the form of
additional processing units.
Let’s start by defining what cloud computing is all about and how it can
help with advanced analytics and big data.
•In these cases, it’s necessary to pull data out into a more
traditional analytics environment and run analytic tools
against that data in the traditional way.
•Large servers have been utilized for such work for quite
some time.
For Example:
•If there is a steady stream of web logs coming in, it might
be handed out in chunks to the various worker nodes.
Advantage
•With MapReduce, computational processing can occur
on data stored in a file system without loading it into a
database.
3.The results of the map step are then passed to the reduce process to
Big Data Analytics- Dr. Anil Kumar K.M 121
summarize and aggregate the final answers.
•Consider an example where an organization has a bunch
of text flowing in from online customer service chats
taking place on its web site.
•The map function will simply find each word, parse it out
of its paragraph, and associate a count of one with it. The
end result of the map step is a set of key-value pairs such
as “<my, 1>,” “<product, 1>,” “<broke, 1>.”
•At this point, the goal is to figure out how many times
each word appeared. What happens next is called
shuffling. During shuffling the answers from the map steps
are distributed through hashing so that the same key
Big Data Analytics- Dr. Anil Kumar K.M 122
words end up on the same reduce node.
•For example, in a simple situation there would be 26 reduce nodes so
that all the words beginning with A go to one node, all the B’s go to
another, all the C’s go to another, and so on.
•The reduce step will simply get the count by word. Based on our
example, the process will end up with “<my, 10>,” “<product, 25>,”
“<broke, 20>,” where the numbers represent how many times the word
was found.
•Once the word counts are computed, the results can be fed into an
analysis. The frequency of certain product names can be identified. The
frequency of words like “broken” or “angry” can be identified.
Using the MapReduce framework, task is broken down into five map tasks, where each
mapper works on one of the five files and the mapper task goes through the data and
returns the maximum temperature for each city.
For example, the results produced from one mapper task for the data above would
look like this: (Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
Let’s assume the other four mapper tasks (working on the other four files not shown
here) produced the following intermediate results:
(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New
York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31)
(Whitby, 22) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which combine the
input results and output a single value for each city, producing a final result set as
follows:
Big Data Analytics- Dr. Anil Kumar K.M 128
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
What does increased scalability bring to the organization?
(Not much if it is not put into use.)
•Example: It will be a lot like buying a new 3-D TV and then simply
connecting it to an antenna, grabbing local TV signals from the air.
The picture might be improved over your old TV, but you certainly
won’t be changing your viewing experience very much compared
to what is possible with the new TV.
Characteristic of Data
•Data created within the sandbox is segregated from the
production database.
Big Data Analytics- Dr. Anil Kumar K.M 133
•Sandbox users will also be allowed to load data of their own
for brief time periods as part of a project, even if that data is
not part of the official enterprise data model.
4.Control:
•IT will be able to control the sandbox environment, balancing
sandbox needs and the needs of other users.
• Since all of the production data and all of the sandbox data are
within the production system, it’s very easy to link those
sources to one another and work with all the data together.
Big Data Analytics- Dr. Anil Kumar K.M 139
3. An internal sandbox is very cost-effective since no new
hardware is needed.
• The production system is already in place. It is just being
used in a new way.
• The elimination of any and all cross-platform data movement
also lowers costs
• The one exception, Big Data, data movement required
between the database and the MapReduce environment.
Weakness:
1. There will be an additional load on the existing enterprise data
warehouse or data mart. The sandbox will use both space and CPU
resources.
•It will have no impact on other processes, which allows for flexibility
in design and usage.
•One common question that often arises is “Isn’t this external system
completely violating this concept of keeping the data in-database
when analyzing it?”
•An external sandbox is exactly the same concept for the exact same
reasons, only it’s dedicated to analytic initiatives.
Strength
1. The biggest strength of an external sandbox is its simplicity.
•If data extracts sent to the sandbox are kept in the same structure as
on production, migrating will be easy to do.
Big Data Analytics- Dr. Anil Kumar K.M 143
•When it comes to working with big data, a MapReduce environment
should be included as part of an external sandbox environment.
Weakness
[ It will be necessary to move data from the production system into the
sandbox before analysis. ]
•An analytic data set (ADS) is the data that is pulled together in
order to create an analysis or model.
•It is data in the format required for the specific analysis at hand.
•What this means is that there will be one record per customer,
location, product, or whatever type of entity is being analyzed.
•The analytic data set helps to bridge the gap between efficient
storage and ease of use.
Development ADS:
•It will have all the candidate variables that may be needed to solve
a problem and will be very wide.
•It might have hundreds or even thousands of variables or metrics
within it.
•However, it’s also fairly shallow, meaning that many times
development work can be done on just a sample of data.
•This makes a development ADS very wide but not very deep.
•It’s going to contain only the specific metrics (most processes only
need a small fraction of the metrics explored during development)
that were actually in the final solution.
•In a traditional environment, all analytic data sets are created outside
of the database.
•Each analytic professional creates his or her own analytic data sets
independently.
•This is done by every analytic professional, which means that there are
possibly hundreds of people generating their own independent views of
corporate data. It gets worse!
•An ADS is usually generated from scratch for each individual project.
•The problem is not just that each analytic professional has a single copy
of the production data. Each analytic professional often makes a new
ADS, and therefore a new copy of the data is required for every project.
Big Data Analytics- Dr. Anil Kumar K.M 152
•As mentioned earlier, there are cases where companies with a
given amount of data end up with 10 or 20 times that much data in
their analytic environment.
•One of the big issues people don’t think about with traditional ADS
processes is the risk of inconsistencies.
•The relationship between each analytic data set and the analytic
processes that have been created.
1. The intended usage for the model. What business issue does it
address? What are the appropriate business scenarios where it
should be used?
•That is, all our data is available when and if we want it.
In this topic:
•The fact that the rate of arrival of stream elements is not under the
control of the system distinguishes stream processing from the
processing of data that goes on within a database-management system.
•The latter system controls the rate at which data is read from the disk,
and therefore never has to worry about data getting lost as it attempts
to execute queries.
•A sudden increase in the click rate for a link could indicate some
news connected to thatBig page, orDr. Anil
Data Analytics- it Kumar
could
K.M mean that the link170is
Stream Queries
•There are two ways that queries are asked about streams.
Example:
Web sites often like to report the number of unique users over the
past month. If we think of each login as a stream element, we can
maintain a window that is all logins in the most recent month and
associate the arrival time with each login.
However, for even the largest sites, that data is not more than a few
terabytes, and so surely can be stored on disk.
Inputs:
Sample size k
Window size n >> k (alternatively, time duration m)
Stream of data elements that arrive online
Output:
k elements chosen uniformly at random from the last n
elements (alternatively, from all elements that have arrived in
the last m time units)
Goal:
maintain a data structure that can produce the desired
output at any time upon request
00000000000
Bit 5 is 1. bit 3 is 0
We conclude y is not seen before.
Stream processing:
•Introduction
•Applications
•Association rules
•Support
•Confidence
•Example
Association rule:
•How to find the products are purchased together or entities that go
together.
Ans : Association rule Big Data Analytics- Dr. Anil Kumar K.M 197
Rule form
Antecendent => Consequent [ support, confidence]
A => b [s,c]
•Confidence:
Denotes the percentage of transactions containing A which also
contain B.
•Example:
Transaction ID Products
1 Shoes, trouser, shirt, belt
2 Shoes, , trouser, shirt, hat, belt, scarf
3 Shoes, shirt
4 Shoes, trouser, belt
Consider rule Trouser => Shirt, we’ll check whether this rule
would be interesting one or not.
T1 1 1 1 1 0 0
T2 1 1 1 1 1 1
T3 1 0 1 0 0 0
T4 1 1 0 1 0 0
ID Item
1 HotDogs, Buns, Ketchup
2 HotDogs, Buns
3 HotDogs, Coke, Chips
4 Chips, Coke
5 Chips, Ketchup
6 HotDogs, Coke, Chips
Big Data Analytics- Dr. Anil Kumar K.M 202
Big Data Analytics- Dr. Anil Kumar K.M 203
Clustering
2 5 6 8 12 15 18 28 30
2 5 6 8 12 15 18 28 30
We ‘ll get
2 4.3 5 6 8 12 13.25 15 18 28 29 30