Unit-1 Module Updated
Unit-1 Module Updated
1
IFETCE R-2019 Academic Year: 2023-2024
Big data is the term for a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data
processing applications.
Examples of Bigdata
• Following are some the examples of Big Data-
– The New York Stock Exchange generates about one terabyte of new trade data
per day.
– Other examples of Big Data generation includes
• stock exchanges,
• social media sites,
• jet engines,
• etc.
1.1.3 Types Of Big Data
BigData could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
Structured Data
• Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
• Developed techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it.
• Foreseeing issues of today : when a size of such data grows to a huge extent, typical
sizes are being in the rage of multiple zetta bytes.
• 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
• Data stored in a relational database management system is one example of a 'structured'
data.
• An 'Employee' table in a database is an example of Structured Data:
2
IFETCE R-2019 Academic Year: 2023-2024
Unstructured Data
• Any data with unknown form or the structure is classified as unstructured data.
• In addition to the size being huge, un-structured data poses multiple challenges in terms
of its processing for deriving value out of it.
• A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc.
• Now day organizations have wealth of data available with them but unfortunately, they
don't know how to derive value out of it since this data is in its raw form or
unstructured format.
• Example of Unstructured data :The output returned by 'Google Search'
3
IFETCE R-2019 Academic Year: 2023-2024
<age>35</age>
</rec>
<rec>
<name>Seema R.</name>
<sex>Female</sex>
<age>41</age>
</rec>
<rec>
<name>Satish Mane</name>
<sex>Male</sex>
<age>29</age>
</rec>
<rec>
<name>Subrato Roy</name>
<sex>Male</sex>
<age>26</age>
</rec>
<rec>
<name>Jeremiah J.</name>
<sex>Male</sex>
<age>35</age></rec>
1.1.4 Three Characteristics of Big Data V3s:
Volume –It is the size of the data which determines the value and potential of the data under
consideration and whether it can actually be considered Big Data or not. The name ‘Big
Data’ itself contains a term which is related to size and hence the characteristic.
Variety - It means that, the category to which Big Data belongs to is also a very essential
fact that needs to be known by the data analysts. This helps the people, who are closely
analyzing the data and are associated with it, to effectively use the data to their advantage
and thus upholding the importance of the Big Data.
Velocity - The term ‘velocity’ refers to the speed of generation of data or how fast the data is
generated and processed to meet the demands and the challenges for the growth and
development.
1.1.5 Applications of Big Data
Big data has increased the demand of information management specialists
Government: Data analysis often requires multiple parts of government (central and local) to
work in collaboration and create new and innovative processes to deliver the desired outcome.
Manufacturing: improvements in supply planning and product quality provide the greatest
benefit of big data for manufacturing. Predictive manufacturing as an applicable approach
toward near-zero downtime and transparency requires vast amount of data and advanced
prediction tools for a systematic process of data into useful information
4
IFETCE R-2019 Academic Year: 2023-2024
Media: The ultimate aim is to serve, or convey, a message or content that is (statistically
speaking) in line with the consumers mindset. It helps media for targeting of consumers (for
advertising by marketers) and data capture
Private : Retail, banking, real-estate science and research.
1.1.6 Basics of Bigdata Platform
• Big Data platform is IT solution which combines several Big Data tools and utilities
into one packaged solution for managing and analyzing Big Data.
• Big data platform is a type of IT solution that combines the features and capabilities of
several big data application and utilities within a single solution.
• It is an enterprise class IT platform that enables organization in developing, deploying,
operating and managing a big data infrastructure /environment.
• Big Data Platform is integrated IT solution for Big Data management which combines
several software system, software tools and hardware to provide easy to use tools
system to enterprises.
• It is a single one-stop solution for all Big Data needs of an enterprise irrespective of size
and data volume. Big Data Platform is enterprise class IT solution for developing,
deploying and managing Big Data.
• There are several Open source and commercial Big Data Platform in the market with
varied features which can be used in Big Data environment.
• Big data platform is a type of IT solution that combines the features and capabilities of
several big data application and utilities within a single solution.
• It is an enterprise class IT platform that enables organization in developing, deploying,
operating and managing a big data infrastructure /environment.
• Big data platform generally consists of big data storage, servers, database, big data
management, business intelligence and other big data management utilities
• It also supports custom development, querying and integration with other systems.
• The primary benefit behind a big data platform is to reduce the complexity of multiple
vendors/ solutions into a one cohesive solution.
• Big data platform are also delivered through cloud where the provider provides an all
inclusive big data solutions and services.
1.1.7 Features of Big Data Platform
Here are most important features of any good Big Data Analytics Platform:
a) Big Data platform should be able to accommodate new platforms and tool based on the
business requirement. Because business needs can change due to new technologies or
due to change in business process.
b) It should support linear scale-out
c) It should have capability for rapid deployment
d) It should support variety of data format
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software
5
IFETCE R-2019 Academic Year: 2023-2024
g) It should have tools for searching the data through large data sets
Big data is a term for data sets that are so large or complex that traditional data processing
applications are inadequate.
Challenges include
• Analysis,
• Capture,
• Data Curation,
• Search,
• Sharing,
• Storage,
• Transfer,
• Visualization,
• Querying,
• Updating
Information Privacy.
• The term often refers simply to the use of predictive analytics or certain other advanced
methods to extract value from data, and seldom to a particular size of data set.
• ACCURACY in big data may lead to more confident decision making, and better
decisions can result in greater operational efficiency, cost reduction and reduced risk.
• Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable elapsed
time. Big data "size" is a constantly moving target.
• Big data requires a set of techniques and technologies with new forms of integration to
• reveal insights from datasets that are diverse, complex, and of a massive
scale List of BigData Platforms
a) Hadoop
b) Cloudera
c) Amazon Web Services
d) Hortonworks
e) MapR
f) IBM Open Platform
g) Microsoft HDInsight
h) Intel Distribution for Apache Hadoop
i) Datastax Enterprise Analytics
j) Teradata Enterprise Access for Hadoop
k) Pivotal HD
6
IFETCE R-2019 Academic Year: 2023-2024
Used for reporting, basic analysis, and text Used for reporting, advanced analysis, and
mining. Advanced analytics is only in a predictive modeling .
starting stage in big data.
Big data analysis needs both programming Analytical skills are sufficient for conventional
skills (such as Java) and analytical skills to data; advanced analysis tools don’t require
perform analysis. expert programing skills.
7
IFETCE R-2019 Academic Year: 2023-2024
Generated by big financial institutions, Generated by small enterprises and small banks.
Facebook, Google, Amazon, eBay, Walmart,
and
so on.
1.2.3 Challenges of Conventional Systems
In the past, the term ‘Analytics' has been used in the business intelligence world to
provide tools and intelligence to gain insight into the data through fast, consistent, interactive
access to a wide variety of possible views of information.
Data mining has been used in enterprises to keep pace with the critical monitoring and
analysis of mountains of data. The main challenge in the traditional approach is how to unearth
all the hidden information through the vast amount of data.
Traditional Analytics analyzes on the known data terrain that too the data that is well
understood. It cannot work on unstructured data efficiently.
Traditional Analytics is built on top of the relational data model, relationships between
the subjects of interests have been created inside the system and the analysis is done based on
them. This approach will not adequate for big data analytics.
Traditional analytics is batch oriented and we need to wait for nightly ETL (extract,
transform and load) and transformation jobs to complete before the required insight is obtained.
Parallelism in a traditional analytics system is achieved through costly hardware like MPP
(Massively Parallel Processing) systems Inadequate support of aggregated summaries of data .
Data challenges
• Data discovery and comprehensiveness
• Scalability
• Process challenges
• Capturing data Aligning data from different sources
• Transforming data into suitable form for data analysis
• Modeling data(mathematically, simulation)
• Understanding output, visualizing results and display issues on mobile devices
• Management challenges
• Security
• Privacy
• Governance
• Ethical issues
• Traditional/ RDBMS challenges
Designed to handle well structured data traditional storage vendor solutions are very
expensive shared block-level storage is too slow read data in 8k or 16k block size Schema-on-
write requires data be validated before it can be written to disk. Software licenses are too
expensive Get data from disk and load into memory requires application.
8
IFETCE R-2019 Academic Year: 2023-2024
Cutting-edge companies started to have basic recency, frequency, and monetary value
(RFM) metrics attached to customers. Such metrics look at when a customer last purchased
(recency), how often they have purchased (frequency), and how much they spent (monetary
value). These RFM summaries might be tallied for the past year and possibly over a customer’s
lifetime. Today, organizations collect newly evolving big data sources related to their customers
from a variety of extended and newly emerging touch points such as web browsers, mobile
applications, kiosks, social media sites, and more.
With today’s data storage and processing capabilities, it is absolutely possible to achieve
success, and many forward-thinking companies have already proven it.
Missed Data
For a web site, 95 percent of browsing sessions do not result in a basket being created.
Of that 5 percent, only about half, or 2.5 percent, actually begin the check-out process. And, of
that 2.5 percent only two-thirds, or 1.7 percent, actually complete a purchase.
This means is that information is missing on more than 98 percent of web sessions if only
transactions are tracked. Traditional web analytics focus on aggregated behavior, summarized in
an environment where only web data was included.
The goal needs to be moving beyond reporting of summary statistics, even if they can be
viewed in some detail, to actually combining customer level web behavior data with other cross-
channel customer data.
Possibilities of Improvement
Knowing everything customers do as they go through the process of doing business with
your organization. Not just what they buy, but what they are thinking about buying along with
what key decision criteria they use. Such knowledge enables a new level of understanding about
your customers and a new level of interaction with your customers. It allows you to meet their
needs more quickly and keep them satisfied.
New Source of Information
This big data source isn’t a simple extension of existing data sources. In the case of
detailed web behaviour, there is no existing analog to most of the data that can be collected. It is
a fundamentally new source of information.
One of the most exciting aspects of web data is that it provides factual information on
customer preferences, future intentions, and motivations that are virtually impossible to get from
other sources outside of a direct conversation or survey.
Once customers’ intentions, preferences, and motivations are known, there are
completely new ways of communicating with them, driving further business, and increasing their
loyalty.
Any action that a customer takes while interacting with an organization should be
captured if it is possible to capture it. That means detailed event history from any customer touch
point. Common touch points today include web sites, kiosks, mobile apps, and social media.
Behaviours That Can Be Captured
Purchases
Requesting help
10
IFETCE R-2019 Academic Year: 2023-2024
Product views
Forwarding a link
Shopping basket additions
Posting a comment
Watching a video
Registering for a webinar
Accessing a download
Executing a search
Reading / writing a review
Privacy
An arbitrary identification number that is not personally identifiable can be matched to
each unique customer based on a logon, cookie, or similar piece of information. This creates
what might be called a “faceless” customer record.
While all of the data associated with one of these identifiers is from one person, the
people doing the analysis have no ability to tie the ID back to the actual customer.
With today’s database technologies, it is possible to enable analytic professionals to do
analysis without having any ability to identify the individuals involved. This can remove many
privacy concerns.
Web Data – Area of Interest
There are a number of specific areas where web data can help organizations understand
their customers better than is possible without web data.
Shopping Behaviors
A good starting point to understanding shopping behavior is identifying how customers
come to a site to begin shopping.
What search engine do they use?
What specific search terms are entered?
Do they use a bookmark they created previously?
Analytic professionals can take this information and look for patterns in terms of which
search terms, search engines, and referring sites are associated with higher sales rates.
Note that analysts will be able to look into higher sales rates not just within a given web
session, but also for the same customer over time.
One very interesting capability enabled by web data is to identify product bundles that are
of interest to a customer before they make a purchase.
Customer Purchase Paths and Preferences
Once the aspects of a site that appeal to customers on an individual basis are known, they
can be targeted with messages that meet their needs much more effectively.
Research Behaviours
Once customers’ patterns are known, it is possible to alter what they see when they visit a
site in order to make it easier for them to find their favourite options quickly.
11
IFETCE R-2019 Academic Year: 2023-2024
Another way to use web data to understand customers’ research patterns is to identify
which of the pieces of information offered on a site are valued by the customer base overall and
the best customers specifically.
Web site feature the organization was considering removing is a big favourite among a
critical segment of customers. In that case, the feature might be kept.
Identifying which site features are important to each customer and how each customer
leverages the site for research can help better tailor a site to the individual. For customers who
always drill to detailed product specifications, perhaps those specifications come up as soon as a
product is viewed. For those who always want to see photos, perhaps photos are featured in full
size instead of as thumbnails.
Feedback Behaviours
The best information customers can provide is detailed feedback on products and
services. The fact that customers are willing to take the time to do so indicates that they are
engaged with a brand. Text mining to understand the tone, intent, and topic of a customer’s
feedback
Web Data In Action
It is possible that the information missing paints a totally different picture than expected.
It is possible to make suboptimal, if not totally wrong, decisions.
Organizations should strive to collect and analyse as much data as possible.
How organizations can apply web data to enhance existing analytics, enable new
analytics, and improve their business.
The Next Best Offer
A very common marketing analysis is to predict what the next best offer is for each
customer. The web provides direct clues as to what is of interest to customers and if they are still
engaged. Consider the case of a catalog retailer that also has many store locations. The cataloger
collects the following for each customer, among other data:
Last products browsed
Last products reviewed
Historical purchases
Marketing campaign and response history
The effort leads to major changes in the promotional efforts versus the cataloger’s
traditional approach, providing the following results:
A decrease in total mailings
A reduction in total catalog promotions pages
A materially significant increase in total revenues
Web data can help completely overhaul activities for the
better. Attrition Modeling
In the telecommunications industry, companies have invested massive amounts of time
and effort to create, enhance, and perfect “churn” models. Churn models flag those customers
most at risk of cancelling their accounts so that action can be taken proactively to prevent them
from doing
12
IFETCE R-2019 Academic Year: 2023-2024
so. Churn is a major issue for the industry and there are huge amounts of money at stake. The
models have a major impact on the bottom line.
Response Modeling
Many models are created to help predict the choice a customer will make when presented
with a request for action. Models typically try to predict which customers will make a purchase,
or accept an offer, or click on an e-mail link. For such models, a technique called logistic
regression is often used.
The main difference is that in an attrition model, the goal is predicting a negative
behavior (churn) rather than a positive behavior (purchase or response).
In theory, every customer has a unique score. In practice, since only a small number of
variables define most models, many customers end up with identical or nearly identical scores.
This is particularly true among customers who are not very frequent or highspending. In such
cases, many customers can end up in big groups with very similar, very low scores.
Web data can help greatly increase differentiation among customers. This is especially
true among low-value or infrequent customers where customers can have a large uplift in score
based on the web data.
Customer Segmentation
Web data also enables a variety of completely new analytics. One of those is to segment
customers based solely upon their typical browsing patterns. Such segmentation will provide a
completely different view of customers than traditional demographic or sales-based
segmentation schemas.
Assessing Advertising Results
Traditional web analytics provide high-level summaries such as total clicks, number of
searches, cost per click or impression, keywords leading to the most clicks, and page position
statistics. However, these metrics are at an aggregate level and are rolled up only from the
individual browsing session level.
The context is also traditionally limited solely to the web channel. Once a customer
leaves the web site and his web session ends, the scope of the analysis is complete.
1.4 Evolution of Analytical Scalability
The amount of data organizations process continues to increase. So the technologies used are
13
IFETCE R-2019 Academic Year: 2023-2024
• 1970-Calculators- and helped make it easier to utilize more data. But the volume
manageable with a calculator is still trivially small.
• 1980-Mainframes- As the decades have passed, data has moved far beyond the scale that
people can handle manually. The amount of data has grown at least as fast as the
computing power of the machines that process it. It may not be necessary to personally
break a sweat and get a headache computing things by hand, but it is still very easy to
cause computer and storage systems to start steaming as they struggle to process the data
fed to them.
• 2000- Databases- an organization that had a database holding a terabyte of data was at the
forefront. Today you can buy a terabyte disk drive for your computer for under $100! In
2012, even many small companies have systems holding a terabyte or more of data. The
companies at the forefront now measure their database size in petabytes.
14
IFETCE R-2019 Academic Year: 2023-2024
Queries from users are submitted to OLAP (online analytical processing) engines for
execution. Such in-database architectures are tested for their query throughput rather than
transaction throughput as in traditional database environments.
15
IFETCE R-2019 Academic Year: 2023-2024
Database 1 Database 4
Database 2
Consolidate
Analytic Server Or PC
16
IFETCE R-2019 Academic Year: 2023-2024
The idea behind MPP is really just that of the general parallel computing (figure 4)
wherein
the simultaneous execution of some combination of multiple instances of programmed
instructions and data on multiple processors in such a way that the result can be obtained more
effectively.
MPP uses shared distributed lock manager to maintain the integrity of the distributed
resources across the system. CPU power that can be made available in a MPP is dependent upon
number of nodes that can be connected. MPP systems build in redundancy to make recovery
easy. MPP systems have resource management tools to manage the CPU and disk space and
Query optimization.
1.4.3 Cloud Computing
Three criteria for a cloud environment are,
1. Enterprises incur no infrastructure or capital costs, only operational costs. Those operational
costs will be incurred on a pay per-use basis with no contractual obligations.
2. Capacity can be scaled up or down dynamically, and immediately. This differentiates clouds
from traditional hosting service providers where there may have been limits placed on scaling.
3. The underlying hardware can be anywhere geographically. The architectural specifics are
abstracted from the user. In addition, the hardware will run in multi-tenancy mode where
multiple users from multiple organizations can be accessing the exact same infrastructure
simultaneously.
Five essential characteristics of a cloud environment.
1.On-demand self-service
2.Broad network access
3.Resource pooling
4.Rapid elasticity
5.Measured service
The two primary types of cloud environments:
1.Public clouds
2.Private clouds.
Public Clouds
With a public cloud users are basically loading their data onto a host system and they are then
allocated resources as they need them to use that data. They will get charged according to their
usage.
17
IFETCE R-2019 Academic Year: 2023-2024
18
IFETCE R-2019 Academic Year: 2023-2024
2. It isn’t necessary to buy a system sized to handle the maximum capacity ever required and then
risk having half of the capacity sitting idle much of the time.
3. If there are short bursts where a lot of processing is needed then it is possible to get it with no
hassle. Simply pay for the extra resources.
4. There’s typically very fast ramp-up. Once granted access to the cloud environment, users load
their data and start analyzing.
5. It is easy to share data with others regardless of their location since a public cloud by definition
is outside of a corporate firewall. Anyone can be given permission to log on to the environment
created.
Disadvantages of Public Cloud
1.Few performance guarantees
2.High variability in performance
3. Concerns around the security of the data
4. It can get expensive if a cloud isn’t used wisely since users will be charged for everything that
they do.
5. If an audited trail of data and where it sits is required, it is not possible to have that in a public
cloud.
The best use for a public cloud is pure research and development work, where
performance variability isn’t something to worry about.
For non-mission-critical analytical processes, the cloud is a potential long-term host even
for deployed processes.
A public cloud can be problematic if data security is a big concern. It’s necessary to apply
good security protocols and tools to a public cloud and keep your environment highly secure.
Private Clouds
A private cloud has the same features of a public cloud, but it’s owned exclusively by one
organization and typically housed behind a corporate firewall. A private cloud is going to serve
the exact same function as a public cloud, but just for the people or teams within a given
organization. Private Cloud Vs. Public Cloud
19
IFETCE R-2019 Academic Year: 2023-2024
absolutely no concern about where it’s going. The data is at no more risk than it is on any other
internal system.
One downside of an onsite private cloud is that it is necessary to purchase and own the
entire cloud infrastructure before allocating it out to users, which could in the short term negate
some of the cost savings.
1.4.4 Grid Computing
There are some computations and algorithms that aren’t cleanly converted to SQL or
embedded in a user-defined function within a database. In these cases, it’s necessary
to pull data out into a more traditional analytics environment and run analytic tools against that
data in the traditional way.
A grid configuration can help both cost and performance. It falls into the classification of
“high-performance computing.” Instead of having a single high-end server (or maybe a few of
them), a large number of lower-cost machines are put in place. As opposed to having one server
managing its CPU and resources across jobs, jobs are parceled out individually to the different
machines to be processed in parallel. Each machine may only be able to handle a fraction of the
work of the original server and can potentially handle only one job at a time.
Using such a grid enables analytic professionals to scale an environment relatively
cheaply and quickly. If a large organization has many processes being run and most of them are
small to medium in size, a grid can be a huge boost.
MapReduce
MapReduce is a parallel programming framework. It’s neither a database nor a direct
competitor to databases. MapReduce consists of two primary processes that a programmer
builds: the “map” step and the “reduce” step. These steps get passed to the MapReduce
framework, which then runs the programs in parallel on a set of worker nodes.
In the case of MapReduce, there is a lot of commodity hardware to which data is being
passed as needed to run a process. Each MapReduce worker runs the same code against its
portion of the data. The workers do not interact or even have knowledge of each other.
MapReduce is a programming framework popularized by Google and used to simplify
data processing across massive data sets. Hadoop is a popular open-source version of
MapReduce supplied by the Apache organization. Hadoop is the bestknown implementation of
the MapReduce framework.
A big distinction of a MapReduce environment is the specific ability to handle
unstructured text. In a relational database, everything is already in tables and rows and columns.
The data already has well-defined relationships. This is not always true with raw data streams.
That’s where MapReduce can really be powerful. Loading big chunks of text into a “blob” field
in a database is possible, but it really isn’t the best use of the database or the best way to handle
such data.
Working of MapReduce
Let’s assume there are 20 terabytes of data and 20 MapReduce server nodes for a project.
The first step is to distribute a terabyte to each of the 20 nodes using a simple file copy process.
Note that this data has to be distributed prior to the MapReduce process being started. Also note
20
IFETCE R-2019 Academic Year: 2023-2024
that the data is in a file of some format determined by the user. There is no standard format like
in a relational database.
Next, the programmer submits two programs to the scheduler. One is a map program; the
other is the reduce program. In this two-step processing, the map program finds the data on disk
and executes the logic it contains. This occurs independently on each of the 20 servers in our
example. The results of the map step are then passed to the reduce process to summarize and
aggregate the final answers.
help drive consistency and productivity, and lower risk into an organization’s advanced analytics
processes.
Enterprise analytic data sets are key tools to help drive consistency and productivity, and
lower risk into an organization’s advanced analytics processes.
Analytic Sandbox
One of the uses of such a database system is to facilitate the building and deployment of
advanced analytic processes. In order for analytic professionals to utilize an enterprise data
warehouse or data mart more effectively, however, they need the correct permissions and access
to do so. An analytic sandbox is the mechanism for achieving this.
If used appropriately, an analytic sandbox can be one of the primary drivers of value in
the world of big data. Other terms used for the sandbox concept include an agile analytics cloud
and a data lab, among others.
An analytic sandbox provides a set of resources with which in-depth analysis can be done
to answer critical business questions. An analytic sandbox is ideal for data exploration,
development of analytical processes, proof of concepts, and prototyping.
A sandbox is going to be leveraged by a fairly small set of users. There will be data
created within the sandbox that is segregated from the production database. Sandbox users will
also be allowed to load data of their own for brief time periods as part of a project, even if that
data is not part of the official enterprise data model.
Data in a sandbox will have a limited shelf life. The idea isn’t to build up a bunch of
permanent data. During a project, build the data needed for the project. When that project is
done, delete the data. If used appropriately, a sandbox has the capability to be a major driver of
analytic value for an organization.
Analytic Sandbox Benefits
Benefits from the view of an analytic professional:
1) Independence.
Analytic professionals will be able to work independently on the database system without
needing to continually go back and ask for permissions for specific projects.
2) Flexibility.
Analytic professionals will have the flexibility to use whatever business intelligence, statistical
analysis, or visualization tools that they need to use.
3) Efficiency.
Analytic professionals will be able to leverage the existing enterprise data warehouse or data mart,
without having to move or migrate data.
4) Freedom.
Analytic professionals can reduce focus on the administration of systems and babysitting of
production processes by shifting those maintenance tasks to IT.
5) Speed.
Massive speed improvement will be realized with the move to parallel processing. This also
enables rapid iteration and the ability to “fail fast” and take more risks to innovate.
22
IFETCE R-2019 Academic Year: 2023-2024
23
IFETCE R-2019 Academic Year: 2023-2024
It is better to add a MapReduce environment into the mix. This would typically be
installed alongside the database platform unless you’re using a system that can combine the two
environments together.
The MapReduce environment will require access to the internal sandbox. Data can be
shared between the two environments as required.
One strength of an internal sandbox is that it will leverage existing hardware resources
and infrastructure already in place. This makes it very easy to set up. From an administration
perspective, there’s no difference in setting up a sandbox than in setting up any other database
container on the system.
The biggest strength of an internal sandbox is the ability to directly join production data
with sandbox data. Since all of the production data and all of the sandbox data are within the
production system, it’s very easy to link those sources to one another and work with all the data
together.
24
IFETCE R-2019 Academic Year: 2023-2024
The biggest strength of an external sandbox is its simplicity. The sandbox is a standalone
environment, dedicated to advanced analytics development. It will have no impact on other
processes, which allows for flexibility in design and usage.
Another strength of an external sandbox is reduced workload management. When only
analytic professionals are using the system, it isn’t necessary to worry much about tuning and
balancing. There will be predictable, stable performance in both the sandbox and production
environments.
A Hybrid Sandbox
A hybrid sandbox environment is the combination of an internal sandbox and an external
sandbox. It allows analytic professionals the flexibility to use the power of the production system
when needed, but also the flexibility of the external system for deep exploration or tasks that
aren’t as friendly to the database.
Hybrid Sandbox
The strengths of a hybrid sandbox environment are similar to the strengths of the internal
and external options, plus having ultimate flexibility in the approach taken for an analysis. It is
easy to avoid production impacts during early testing if work is done on the external sandbox.
Another advantage is if an analytic process has been built and it has to be run in a
“pseudo- production” mode temporarily while the full production system process is being
deployed. Such processes can be run out of the internal sandbox easily.
The weaknesses of a hybrid environment are similar to the weaknesses of the other two
options, but with a few additions. One weakness is the need to maintain both an internal and
external sandbox environment. Not only will it be necessary to keep the external sandbox
25
IFETCE R-2019 Academic Year: 2023-2024
consistent with the production environment in this case, but the external sandbox will also need
to be kept consistent with the internal sandbox.
It will also be necessary to establish some guidelines on when each sandbox option is used.
Workload Management and Capacity Planning
As analytic professionals start to use a sandbox, there are a lot of built-in components of
database systems that will enable it to work smoothly. Sandbox users can be assigned to a group
that has permissions that make sense for the purpose of developing new advanced analytics
processes.
For example, it is possible to limit how much of the CPU a given sandbox user can
absorb at one time.
One of the important things to do is to limit disk space usage through data retention
policies. When a data set is in a sandbox and it hasn’t been touched in a couple of months, the
default should be that it is deleted. A sandbox should not just continuously build up data sets, as
often happens in traditional environments.
Especially with an internal sandbox, as more analytics are implemented, it will change
the mix and level of resource usage in both the sandbox environment and the production
environment.
1.5.2 Analytic Data Set
An analytic data set (ADS) is the data that is pulled together in order to create an analysis
or model. It is data in the format required for the specific analysis at hand. An ADS is generated
by transforming, aggregating, and combining data. It is going to mimic a denormalized, or flat
file, structure. What this means is that there will be one record per customer, location, product,
or whatever type of entity is being analyzed. The analytic data set helps to bridge the gap
between efficient storage and ease of use.
Development versus Production Analytic Data Sets
A development ADS is going to be the data set used to build an analytic process. It will
have all the candidate variables that may be needed to solve a problem and will be very wide. A
development ADS might have hundreds or even thousands of variables or metrics within it.
26
IFETCE R-2019 Academic Year: 2023-2024
A production analytic data set, however, is what is needed for scoring and deployment.
It’s going to contain only the specific metrics that were actually in the final solution.
Typically, most processes only need a small fraction of the metrics explored during
development. A big difference here is that the scores need to be applied to every entity, not just
a sample.
27
IFETCE R-2019 Academic Year: 2023-2024
Analysis Reporting
Provides what is needed Provides what is asked for
Is typically customized Is Typically standardized
28
IFETCE R-2019 Academic Year: 2023-2024
29
IFETCE R-2019 Academic Year: 2023-2024
• Reporting translates raw data into information. Analysis transforms data and
information into insights. reporting shows you what is happening while analysis
focuses on explaining why it is happening and what you can do about it.
• Reports are like Robots n monitor and alter you and where as analysis is like parents - c
an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear
infection, etc).
• Reporting and analysis can go hand-in-hand:
• Reporting provides no limited context about what is happening in the data. Context is
critical to good analysis.
• Reporting translate a raw data into information
• Reporting usually raises a question – What is happening ?
• Analysis transforms the data into insights - Why is it happening ? What you can do
about it?
Thus, Analysis and Reporting is synonym to each other with respect their need and utilizing in
the needy context.
30
IFETCE R-2019 Academic Year: 2023-2024
• The key strength of stream processing is that it can provide insights faster, often within
milliseconds to seconds.
• It helps understanding the hidden patterns in millions of data records in real time.
• It translates into processing of data from single or multiple sources in real or near-real
time applying the desired business logic and emitting the processed information to the
sink.
• Stream processing serves multiple resolves in today’s business arena. Real time data
streaming tools are:
1. Storm
Storm is a stream processing engine without batch support, a true real-time processing
framework, taking in a stream as an entire ‘event’ instead of series of small batches. Apache
Storm is a distributed real-time computation system. It’s applications are designed as directed
acyclic graphs.
2. Apache flink
Apache flink is an open source platform which is a streaming data flow engine that provides
communication fault tolerance and data distribution computation over data stream . flink is a top
level project of Apache flink is scalable data analytics framework that is fully compatible to
hadoop. flink can execute both stream processing and batch processing easily. flink was
designed as an alternative to map-reduce.
3. Kinesis
Kinesis as an out of the box streaming data tool. Kinesis comprises of shards which
Kafka calls partitions. For organizations that take advantage of real-time or near real-
time access to large stores of data, Amazon Kinesis is great.
Kinesis Streams solves a variety of streaming data problems. One common use is the
real-time aggregation of data which is followed by loading the aggregate data into a data
warehouse. Data is put into Kinesis streams. This ensures durability and elasticity.
c) Interactive Analysis -Big Data Tools
• The interactive analysis presents the data in an interactive environment, allowing users
to undertake their own analysis of information.
• Users are directly connected to the computer and hence can interact with it in
real time.
• The data can be reviewed, compared and analyzed in tabular or graphic format
or both at the same time.
IA -Big Data Tools -
a) Google’s Dremel is the google proposed an interactive analysis system in 2010. And
named named Dremel which is scalable for processing nested data.
31
IFETCE R-2019 Academic Year: 2023-2024
– Dremel provides a very fast SQL like interface to the data by using a different
technique than MapReduce. Dremel has a very different architecture:
compared with well-known Apache Hadoop, and acts as a successful
complement of Map/Reduce-based computations.
– Dremel has capability to: run aggregation queries over trillion-row
tables in seconds by means of: combining multi-level execution trees and
columnar data layout.
b) Apache drill
• Apache drill is: Drill is an Apache open-source SQL query engine for Big Data
exploration. It is similar to Google’s Dremel.
• For Drill, there is more flexibility to support a various different query languages, • data
formats and data sources.
• Drill is designed from the ground up to support high-performance analysis on the semi-
structured and rapidly evolving data coming from modern Big Data applications.
• Drill provides plug-and-play integration with existing Apache Hive and
Apache HBase deployments.
1.7.1 Categories of Modern Analytic Tools
Big data tools for HPC and
supercomputing
MPI(Message Passing Interface, 1992) provide standardized function interfaces for
communication between parallel processes.
Collective communication operations
33
IFETCE R-2019 Academic Year: 2023-2024
34
IFETCE R-2019 Academic Year: 2023-2024
35
IFETCE R-2019 Academic Year: 2023-2024
1.9 RE-SAMPLING
• Re-sampling is the method that consists of drawing repeated samples from the original
data samples. The method of Resampling is a nonparametric method of statistical
inference. The method of resampling uses experimental methods, rather than
analytical methods, to generate the unique sampling distribution.
• In statistics, re-sampling is any of a variety of methods for doing one of the following:
– Estimating the precision of sample statistics (medians, variances, percentiles)
– by using subsets of available data (jackknifing) or drawing randomly
with replacement from a set of data points (bootstrapping)
1.9.1 Need for Re-sampling
• Re-sampling involves the selection of randomized cases with replacement from the
original data sample in such a manner that each number of the sample drawn has a
number of cases that are similar to the original data sample.
• Due to replacement the drawn number of samples that are used by the method of re-
sampling consists of repetitive cases.
• Re-sampling generates a unique sampling distribution on the basis of the actual data.
• The method of re-sampling uses experimental methods, rather than analytical methods,
to generate the unique sampling distribution.
• The method of re-sampling yields unbiased estimates as it is based on the unbiased
samples of all the possible results of the data studied by the researcher.
• Re-sampling methods are processes of repeatedly drawing samples from a data set and
refitting a given model on each sample with the goal of learning more about the fitted
model.
• Re-sampling methods can be expensive since they require repeatedly performing
the same statistical methods on N different subsets of the data.
• Re-sampling methods refit a model of interest to samples formed from the training set, –
in order to obtain additional information about the fitted model.
1.9.2 Re-sampling methods
There are four major re-sampling methods available and are:
1. Permutation
2. Bootstrap
3. Jackknife
4. Cross validation
1. Permutation
Re-sampling procedures date back to 1930s, when permutation tests were introduced by R.A.
Fisher and E.J.G. Pitman. They were not feasible until the computer era.
Permutation Example: Fisher’s Tea Taster
36
IFETCE R-2019 Academic Year: 2023-2024
• 8 cups of tea are prepared four with tea poured first four with milk poured first
• The cups are presented to her in random order.
• Mark a strip of paper with eight guesses about the order of the "tea-first" cups and
"milk-first" cups let's say T T T T M M M M.
Permutation solution
• Make a deck of eight cards, four marked "T" and four marked "M.“
• Deal out these eight cards successively in all possible orderings
(permutations) Record how many of those permutations show >= 6 matches.
Approximate Permutation
• Shuffle the deck and
• deal it out along the strip of paper with the marked guesses, record the number
of matches.
• Repeat many times.
Permutation Re-sampling Processes
Step 1: Collect Data from Control & Treatment Groups
Step 2: Merge samples to form a pseudo population
Step 3: Sample without replacement from pseudo population to simulate control Treatment
groups
Step 4: Compute target statistic for each example
Compute “different statistic” , save result in table and repeat resampling process 1000+
iterations.
Permutation Tests
In classical hypothesis testing, we start with assumptions about the underlying
distribution and then derive the sampling distribution of the test statistic under
H0.
In Permutation testing, the initial assumptions are not needed (except
exchangeability), and the sampling distribution of the test statistic under H0 is
computed by using permutations of the data.
2. Bootstrap
• The bootstrap is a widely applicable tool that can be used to quantify the uncertainty
associated with a given estimator or statistical learning approach, including those for
which it is difficult to obtain a measure of variability.
• The bootstrap generates distinct data sets by repeatedly sampling observations from the
original data set. These generated data sets can be used to estimate variability in lieu of
sampling independent data sets from the full population.
1969 Simon publishes the bootstrap as an example in Basic Research Methods in Social
Science (the earlier pigfood example)
• 1979 Efron names and publishes first paper on the bootstrap
Coincides with advent of personal computer
37
IFETCE R-2019 Academic Year: 2023-2024
38
IFETCE R-2019 Academic Year: 2023-2024
Characteristics of Bootstrapping
Bootstrapping is especially useful in situations when no analytic formula for the sampling
distribution is available.
39
IFETCE R-2019 Academic Year: 2023-2024
• Traditional forecasting methods, like exponential smoothing, work well when demand
is constant – patterns easily recognized by software
• In contrast, when demand is irregular, patterns may be difficult to recognize.
• Therefore, when faced with irregular demand, bootstrapping may be used to
provide more accurate forecasts, making some important assumptions…
Assumptions and Methodology
• Bootstrapping makes no assumption regarding the population
• No normality of error terms
• No equal variance
• Allows for accurate forecasts of intermittent demand
• If the sample is a good approximation of the population, the sampling distribution may
be estimated by generating a large number of new samples
• For small data sets, taking a small representative sample of the data and replicating it
will yield superior results
Applications and Uses
a) Criminology:- Statistical significance testing is important in criminology and
criminal justice.
b) Actuarial Practice:- Process of developing an actuarial model begins with the creation of
probability distributions of input variables. Input variables are generally asset-side
generated cash flows (financial) or cash flows generated from the liabilities side
(underwriting)
c) Classifications Used by Ecologists:- Ecologists often use cluster analysis as a tool in the
classification and mapping of entities such as communities or landscapes
d) Human Nutrition:- Inverse regression used to estimate vitamin B-6 requirement of young
women & Standard statistical methods were used to estimate the mean vitamin B-6
requirement.
e) Outsourcing:- Agilent Technologies determined it was time to transfer manufacturing of its
3070 in-circuit test systems from Colorado to Singapore & Major concern was the change
in environmental test conditions (dry vs humid).
Bootstrap Types
a) Parametric Bootstrap
b) Non-parametric Bootstrap
a) Parametric Bootstrap
• Re-sampling makes no assumptions about the population distribution.
• The bootstrap covered thus far is a nonparametric bootstrap.
• If we have information about the population distr., this can be used in resampling.
• In this case, when we draw randomly from the sample we can use population distr.
• For example, if we know that the population distr. is normal then estimate its
parameters using the sample mean and variance.
40
IFETCE R-2019 Academic Year: 2023-2024
• Then approximate the population distr. with the sample distr. and use it to draw new
samples.
• As expected, if the assumption about population distribution is correct then the
parametric bootstrap will perform better than the nonparametric bootstrap.
• If not correct, then the nonparametric bootstrap will perform better.
Bootstrap Example
• A new pigfood ration is tested on twelve pigs, with six-week weight gains as follows:
• 496 544 464 416 512 560 608 544 480 466 512 496
• Mean: 508 ounces (establish a confidence interval)
Draw simulated samples from a hypothetical universe that embodies all we know about the
universe that this sample came from – our sample, replicated an infinite number of times.
The Bootstrap process steps
1. Put the observed weight gains in a hat
2. Sample 12 with replacement
3. Record the mean
4. Repeat steps 2-3, say, 1000 times
5. Record the 5th and 95th percentiles (for a 90% confidence interval)
41
IFETCE R-2019 Academic Year: 2023-2024
42
IFETCE R-2019 Academic Year: 2023-2024
– can be used to estimate a given statistical methods test error or to determine the
appropriate amount of flexibility.
Model assessment is the process of evaluating a model’s performance
– Model selection is the process of selecting the appropriate level of flexibility for a
model.
– Bootstrap is used in a number of contexts,
– but most commonly it is used to provide a measure of accuracy of a given
statistical learning method or parameter estimate.
Need of Cross validation
• Use the entire data set when training a learner.
• Some of the data is removed before training begins.
• Then when training is done, the data that was removed can be used to test
the performance of the learned model on ``new'' data.
• This is the basic idea for a whole class of model evaluation methods called cross
validation.
Bootstrap vs. Cross-Validation
• Bootstrap
– Requires a small of data
– More complex technique – time consuming
• Cross-Validation
– Not a resampling technique
– Requires large amounts of data
– Extremely useful in data mining and artificial intelligence
Cross Validation Methods
1. holdout method
2. K-fold cross validation
3. Leave-one-out cross validation
1. holdout method
The holdout method is the simplest kind of cross validation
• The data set is separated into two sets, called the training set and the testing set.
• The function approximator fits a function using the training set only.
• Then the function approximator is asked to predict the output values for the data in the
testing set (it has never seen these output values before).
• The errors it makes are accumulated as before to give the mean absolute test set
error, which is used to evaluate the model.
• The advantage of this method is that it is usually preferable to the residual method and
takes no longer to compute. – However, its evaluation can have a high variance.
• The evaluation may depend heavily on
– which data points end up in the training set and which end up in the test set, and
43
IFETCE R-2019 Academic Year: 2023-2024
– thus the evaluation may be significantly different depending on how the division
is made.
2. K-fold cross validation
• K-fold cross validation is one way to improve over the holdout method.
– The data set is divided into k subsets, and the holdout method is repeated k times.
• Each time, one of the k subsets is used as the test set and the other k-1 subsets are
put together to form a training set.
• Then the average error across all k trials is computed.
• The advantage of this method is that it matters less how the data gets divided.
• Every data point gets to be in a test set exactly once, and gets to be in a training set k1
times.
• The variance of the resulting estimate is reduced as k is increased.
• The disadvantage of this method is that the training algorithm has to be rerun
from scratch k times, which means it takes k times as much computation to make
an evaluation.
• A variant of this method is to randomly divide the data into a test and training set
kdifferent times.
• The advantage of doing this is that you can independently choose how large each test
set is and how many trials you average over.
Leave-one-out cross validation
• Leave-one-out cross validation is K-fold cross validation taken to its :
– logical extreme, with K equal to N, the number of data points in the set.
– That means that N separate times, the function approximator is trained on all the
data except for one point and a prediction is made for that point.
– As before the average error is computed and used to evaluate the model.
– The evaluation given by leave-one-out cross validation error (LOO-XVE) is
good, but at first pass it seems very expensive to compute.
– Fortunately, locally weighted learners can make LOO predictions just as easily
as they make regular predictions.
– That means computing the LOO-XVE takes no more time than computing
the residual error and it is a much better way to evaluate models.
1.10 Statistical Inference
The process of making guesses about the truth from a sample. Statistical inference is the
process through which inferences about a population are made based on certain statistics
calculated from a sample of data drawn from that population.
Confidence Intervals.
Suppose we want to estimate an actual population mean μ. As you know, we can only
obtain , the mean of a sample randomly selected from the population of interest. We can use
to find a range of values:
44
IFETCE R-2019 Academic Year: 2023-2024
that we can be really confident contains the population mean μ. The range of values is called a
"confidence interval" .
General form of most confidence intervals. The previous example illustrates the general
form of most confidence intervals, namely:
•
That is:
and:
Once we've obtained the interval, we can claim that we are really confident that the value
of the population parameter is somewhere between the value of L and the value of U. So far,
we've been very general in our discussion of the calculation and interpretation of confidence
intervals. To be more specific about their use, let's consider a specific interval, namely the "t-
interval for a population mean µ."
Hypothesis Testing
The general idea of hypothesis testing involves:
1. Making an initial assumption.
2. Collecting evidence (data).
3. Based on the available evidence (data), deciding whether to reject or not reject the
initial assumption.
Every hypothesis test, regardless of the population parameter involved, requires the above three
steps. Errors in hypothesis testing
Type I error: The null hypothesis is rejected when it is true.
Type II error: The null hypothesis is not rejected when it is false.
Test of Proportion
Let us consider the parameter p of population proportion. For instance, we might want to
know the proportion of males within a total population of adults when we conduct a survey. A
test of proportion will assess whether or not a sample from a population represents the true
proportion from the entire population.
1.11 PREDICTION ERROR
1.11.1 Introduction to Prediction Error
• A prediction error is the failure of some expected event to occur.
• Errors are an inescapable element of predictive analytics that should also be quantified
and presented along with any model, often in the form of a confidence interval that
indicates how accurate its predictions are expected to be.
• A prediction error is the failure of some expected event to occur.
• When predictions fail, humans can use metacognitive functions, examining prior
predictions and failures.
45
IFETCE R-2019 Academic Year: 2023-2024
• For example, whether there are correlations and trends, such as consistently being unable
to foresee outcomes accurately in particular situations.
• Applying that type of knowledge can inform decisions and improve the quality of future
predictions.
1.11.2 Error in Predictive Analysis
• Errors are an inescapable element of predictive analytics that should also be
quantified and presented along with any model, often in the form of a confidence
interval that indicates how accurate its predictions are expected to be.
• Analysis of prediction errors from similar or previous models can help
determine confidence intervals.
Predictions always contain errors
• Predictive analytics has many applications, the above mentioned examples are just the
tip of the iceberg.
• Many of them will add value, but it remains important to stress that the outcome of a
prediction model will always contain an error. Decision makers need to know how big
that error is.
• To illustrate, in using historic data to predict the future you assume that the future
will have the same dynamics as the past, an assumption which history has proven to
be dangerous.
• In artificial intelligence (AI), the analysis of prediction errors can help guide machine
learning (ML), similarly to the way it does for human learning.
• In reinforcement learning, for example, an agent might use the goal of minimizing
error feedback as a way to improve.
• Prediction errors, in that case, might be assigned a negative value and predicted
outcomes a positive value, in which case the AI would be programmed to attempt to
maximize its score.
• That approach to ML, sometimes known as error-driven learning, seeks to stimulate
learning by approximating the human drive for mastery.
1.11.3 Prediction Error in
statistics Standard Error of the
Estimate
• The standard error of the estimate is a measure of the accuracy of predictions.
• Recall that the regression line is the line that minimizes the sum of squared deviations
of prediction (also called the sum of squares error).
• The standard error of the estimate is closely related to this quantity and is defined below:
where σest is the standard error of the estimate, Y is an actual score, Y' is a predicted score, and N
is the number of pairs of scores.
46
IFETCE R-2019 Academic Year: 2023-2024
• In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an
estimator (of a procedure for estimating an unobserved quantity) measures the average of
the squares of the errors—that is, the average squared difference between the estimated
values and what is estimated.
• MSE is a risk function, corresponding to the expected value of the squared error loss.
• The fact that MSE is almost always strictly positive (and not zero) is because of
randomness or because the estimator does not account for information that could produce
a more accurate estimate.
1.11.4 Mean squared prediction error
• In statistics the mean squared prediction error or mean squared error of the predictions
of a smoothing or curve fitting procedure is the expected value of the squared difference
between the fitted values implied by the predictive function and the values of the
(unobservable) function g.
• The MSE is a measure of the quality of an estimator—it is always non-negative, and
values closer to zero are better.
• Root-Mean-Square error or Root-Mean-Square Deviation (RMSE or RMSD)
• In an analogy to standard deviation, taking the square root of MSE yields the root-
meansquare error or root-mean-square deviation (RMSE or RMSD), which has the same
units as the quantity .being estimated; for an unbiased estimator, the RMSE is the square
root of the variance, known as the standard error.
• The RMSD represents the square root of the second sample moment of the differences
between predicted values and observed values or the quadratic mean of these differences.
• These deviations are called residuals when the calculations are performed over the
data sample that was used for estimation and are called errors (or prediction errors)
when computed out-of-sample.
• The RMSD serves to aggregate the magnitudes of the errors in predictions for
various times into a single measure of predictive power.
• RMSD is a measure of accuracy, to compare forecasting errors of different models for
a particular dataset and not between datasets, as it is scale-dependent.
• RMSD is always non-negative, and a value of 0 (almost never achieved in
practice) would indicate a perfect fit to the data.
• In general, a lower RMSD is better than a higher one. However, comparisons across
different types of data would be invalid because the measure is dependent on the scale
of the numbers used.
• RMSD is the square root of the average of squared errors.
• The effect of each error on RMSD is proportional to the size of the squared error;
thus larger errors have a disproportionately large effect on RMSD.
• Consequently, RMSD is sensitive to outliers.
47
IFETCE R-2019 Academic Year: 2023-2024
48