0% found this document useful (0 votes)
24 views48 pages

Unit-1 Module Updated

mad material

Uploaded by

Aruna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views48 pages

Unit-1 Module Updated

mad material

Uploaded by

Aruna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

IFETCE R-2019 Academic Year: 2023-2024

IFET College of Engineering


(An Autonomous Institution)
19UCSPC701 DATA ANALYTICS
UNIT-1 INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems - Web data – Evolution
of Analytic scalability, analytic processes and tools, Analysis vs reporting - Modern data analytic
tools, Statistical concepts: Sampling distributions, resampling, statistical inference, prediction
error - Activity in Exploring Basic Data Analytics.
1.1 Introduction Big data platform
1.1.1 Data and Information
 Data are plain facts. Data is nothing but facts and statistics stored or free flowing over a
network, generally it's raw and unprocessed.
 When data are processed, organized, structured or presented in a given context so as to
make them useful, they are called Information. It is not enough to have data (such as
statistics on the economy).
 Data themselves are fairly useless, but when these data are interpreted and processed to
determine its true meaning, they becomes useful and can be called Information.
 For example: When you visit any website, they might store you IP address, that is data, in
return they might add a cookie in your browser, marking you that you visited the website,
that is data, your name, it's data, your age, it's data.
 The quantities, characters, or symbols on which operations are performed by a computer,
which may be stored and transmitted in the form of electrical signals and – recorded on
magnetic, optical, or mechanical recording media. Actions on Data are Capture,
Transform and Store.
1.1.2 BigData
 Big Data may well be the Next Big Thing in the IT world. Big data burst upon the scene
in the first decade of the 21st century. The first organizations to embrace it were online
and startup firms. Firms like Google, eBay, LinkedIn, and Facebook were built around
big data from the beginning. Like many new information technologies, big data can bring
about dramatic cost reductions, substantial improvements in the time required to perform
a computing task, or new product and service offerings. Walmart handles more than 1
million customer transactions every hour. Facebook handles 40 billion photos from its
user base. Decoding the human genome originally took 10years to process; now it can be
achieved in one week.
 Big Data is also data but with a huge size. Big Data is a term used to describe a collection
of data that is huge in size and yet growing exponentially with time. In short such data is
so large and complex that none of the traditional data management tools are able to store
it or process it efficiently.

1
IFETCE R-2019 Academic Year: 2023-2024

 Big data is the term for a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional data
processing applications.
Examples of Bigdata
• Following are some the examples of Big Data-
– The New York Stock Exchange generates about one terabyte of new trade data
per day.
– Other examples of Big Data generation includes
• stock exchanges,
• social media sites,
• jet engines,
• etc.
1.1.3 Types Of Big Data
BigData could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
Structured Data
• Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
• Developed techniques for working with such kind of data (where the format is well
known in advance) and also deriving value out of it.
• Foreseeing issues of today : when a size of such data grows to a huge extent, typical
sizes are being in the rage of multiple zetta bytes.
• 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
• Data stored in a relational database management system is one example of a 'structured'
data.
• An 'Employee' table in a database is an example of Structured Data:

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

2
IFETCE R-2019 Academic Year: 2023-2024

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

Unstructured Data
• Any data with unknown form or the structure is classified as unstructured data.
• In addition to the size being huge, un-structured data poses multiple challenges in terms
of its processing for deriving value out of it.
• A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc.
• Now day organizations have wealth of data available with them but unfortunately, they
don't know how to derive value out of it since this data is in its raw form or
unstructured format.
• Example of Unstructured data :The output returned by 'Google Search'

Figure 1- Unstructured data


Semi-structured Data
• Semi-structured data can contain both the forms of data.
• Semi-structured data as a structured in form but it is actually not defined with e.g. a
table definition in relational DBMS.
• Example of semi-structured data is – a data represented in an XML file.
• Personal data stored in an XML file.
<rec>
<name>Prashant Rao</name>
<sex>Male</sex>

3
IFETCE R-2019 Academic Year: 2023-2024

<age>35</age>
</rec>
<rec>
<name>Seema R.</name>
<sex>Female</sex>
<age>41</age>
</rec>
<rec>
<name>Satish Mane</name>
<sex>Male</sex>
<age>29</age>
</rec>
<rec>
<name>Subrato Roy</name>
<sex>Male</sex>
<age>26</age>
</rec>
<rec>
<name>Jeremiah J.</name>
<sex>Male</sex>
<age>35</age></rec>
1.1.4 Three Characteristics of Big Data V3s:
Volume –It is the size of the data which determines the value and potential of the data under
consideration and whether it can actually be considered Big Data or not. The name ‘Big
Data’ itself contains a term which is related to size and hence the characteristic.
Variety - It means that, the category to which Big Data belongs to is also a very essential
fact that needs to be known by the data analysts. This helps the people, who are closely
analyzing the data and are associated with it, to effectively use the data to their advantage
and thus upholding the importance of the Big Data.
Velocity - The term ‘velocity’ refers to the speed of generation of data or how fast the data is
generated and processed to meet the demands and the challenges for the growth and
development.
1.1.5 Applications of Big Data
Big data has increased the demand of information management specialists
Government: Data analysis often requires multiple parts of government (central and local) to
work in collaboration and create new and innovative processes to deliver the desired outcome.
Manufacturing: improvements in supply planning and product quality provide the greatest
benefit of big data for manufacturing. Predictive manufacturing as an applicable approach
toward near-zero downtime and transparency requires vast amount of data and advanced
prediction tools for a systematic process of data into useful information

4
IFETCE R-2019 Academic Year: 2023-2024

Media: The ultimate aim is to serve, or convey, a message or content that is (statistically
speaking) in line with the consumers mindset. It helps media for targeting of consumers (for
advertising by marketers) and data capture
Private : Retail, banking, real-estate science and research.
1.1.6 Basics of Bigdata Platform
• Big Data platform is IT solution which combines several Big Data tools and utilities
into one packaged solution for managing and analyzing Big Data.
• Big data platform is a type of IT solution that combines the features and capabilities of
several big data application and utilities within a single solution.
• It is an enterprise class IT platform that enables organization in developing, deploying,
operating and managing a big data infrastructure /environment.
• Big Data Platform is integrated IT solution for Big Data management which combines
several software system, software tools and hardware to provide easy to use tools
system to enterprises.
• It is a single one-stop solution for all Big Data needs of an enterprise irrespective of size
and data volume. Big Data Platform is enterprise class IT solution for developing,
deploying and managing Big Data.
• There are several Open source and commercial Big Data Platform in the market with
varied features which can be used in Big Data environment.
• Big data platform is a type of IT solution that combines the features and capabilities of
several big data application and utilities within a single solution.
• It is an enterprise class IT platform that enables organization in developing, deploying,
operating and managing a big data infrastructure /environment.
• Big data platform generally consists of big data storage, servers, database, big data
management, business intelligence and other big data management utilities
• It also supports custom development, querying and integration with other systems.
• The primary benefit behind a big data platform is to reduce the complexity of multiple
vendors/ solutions into a one cohesive solution.
• Big data platform are also delivered through cloud where the provider provides an all
inclusive big data solutions and services.
1.1.7 Features of Big Data Platform
Here are most important features of any good Big Data Analytics Platform:
a) Big Data platform should be able to accommodate new platforms and tool based on the
business requirement. Because business needs can change due to new technologies or
due to change in business process.
b) It should support linear scale-out
c) It should have capability for rapid deployment
d) It should support variety of data format
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software

5
IFETCE R-2019 Academic Year: 2023-2024

g) It should have tools for searching the data through large data sets
Big data is a term for data sets that are so large or complex that traditional data processing
applications are inadequate.
Challenges include
• Analysis,
• Capture,
• Data Curation,
• Search,
• Sharing,
• Storage,
• Transfer,
• Visualization,
• Querying,
• Updating
Information Privacy.
• The term often refers simply to the use of predictive analytics or certain other advanced
methods to extract value from data, and seldom to a particular size of data set.
• ACCURACY in big data may lead to more confident decision making, and better
decisions can result in greater operational efficiency, cost reduction and reduced risk.
• Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, curate, manage, and process data within a tolerable elapsed
time. Big data "size" is a constantly moving target.
• Big data requires a set of techniques and technologies with new forms of integration to
• reveal insights from datasets that are diverse, complex, and of a massive
scale List of BigData Platforms
a) Hadoop
b) Cloudera
c) Amazon Web Services
d) Hortonworks
e) MapR
f) IBM Open Platform
g) Microsoft HDInsight
h) Intel Distribution for Apache Hadoop
i) Datastax Enterprise Analytics
j) Teradata Enterprise Access for Hadoop
k) Pivotal HD

6
IFETCE R-2019 Academic Year: 2023-2024

1.2 CHALLENGES OF CONVENTIONAL SYSTEMS


1.2.1 Introduction to Conventional Systems
The system consists of one or more zones each having either manually operated call points
or automatic detection devices, or a combination of both.
• Big data is huge amount of data which is beyond the processing capacity of
conventional data base systems to manage and analyze the data in a specific time
interval.
Difference between conventional computing and intelligent computing
• The conventional computing functions logically with a set of rules and calculations
while the neural computing can function via images, pictures, and concepts.
• Conventional computing is often unable to manage the variability of data obtained in
the real world.
• On the other hand, neural computing, like our own brains, is well suited to situations
that have no clear algorithmic solutions and are able to manage noisy imprecise data.
This allows them to excel in those areas that conventional computing often finds
difficult.
1.2.2 Comparison of Big Data with Conventional Data
Big Data Conventional Data
Huge data sets Data set size in control.
Unstructured data such as text, video, and audio. Normally structured data such as numbers and
categories, but it can take other forms as well.
Hard-to-perform queries and analysis Relatively easy-to-perform queries and
analysis.
Needs a new methodology for analysis. Data analysis can be achieved by using
conventional methods.
Need tools such as Hadoop, Hive, Hbase, Tools such as SQL, SAS, R, and Excel alone
Pig, Sqoop, and so on. may be sufficient.
The aggregated or sampled or filtered data. Raw transactional data.

Used for reporting, basic analysis, and text Used for reporting, advanced analysis, and
mining. Advanced analytics is only in a predictive modeling .
starting stage in big data.
Big data analysis needs both programming Analytical skills are sufficient for conventional
skills (such as Java) and analytical skills to data; advanced analysis tools don’t require
perform analysis. expert programing skills.

Petabytes/exabytes of data. Millions/billions of accounts.

Billions/trillions of transactions. Megabytes/gigabytes of data.

7
IFETCE R-2019 Academic Year: 2023-2024

Thousands/millions of accounts. Millions of transactions

Generated by big financial institutions, Generated by small enterprises and small banks.
Facebook, Google, Amazon, eBay, Walmart,
and
so on.
1.2.3 Challenges of Conventional Systems
In the past, the term ‘Analytics' has been used in the business intelligence world to
provide tools and intelligence to gain insight into the data through fast, consistent, interactive
access to a wide variety of possible views of information.
Data mining has been used in enterprises to keep pace with the critical monitoring and
analysis of mountains of data. The main challenge in the traditional approach is how to unearth
all the hidden information through the vast amount of data.
Traditional Analytics analyzes on the known data terrain that too the data that is well
understood. It cannot work on unstructured data efficiently.
Traditional Analytics is built on top of the relational data model, relationships between
the subjects of interests have been created inside the system and the analysis is done based on
them. This approach will not adequate for big data analytics.
Traditional analytics is batch oriented and we need to wait for nightly ETL (extract,
transform and load) and transformation jobs to complete before the required insight is obtained.
Parallelism in a traditional analytics system is achieved through costly hardware like MPP
(Massively Parallel Processing) systems Inadequate support of aggregated summaries of data .
Data challenges
• Data discovery and comprehensiveness
• Scalability
• Process challenges
• Capturing data Aligning data from different sources
• Transforming data into suitable form for data analysis
• Modeling data(mathematically, simulation)
• Understanding output, visualizing results and display issues on mobile devices
• Management challenges
• Security
• Privacy
• Governance
• Ethical issues
• Traditional/ RDBMS challenges
Designed to handle well structured data traditional storage vendor solutions are very
expensive shared block-level storage is too slow read data in 8k or 16k block size Schema-on-
write requires data be validated before it can be written to disk. Software licenses are too
expensive Get data from disk and load into memory requires application.

8
IFETCE R-2019 Academic Year: 2023-2024

1.2.4 List of challenges of Conventional Systems


The following list of challenges has been dominating in the case Conventional systems in real
time scenarios:
1) Uncertainty of Data Management Landscape
2) The Big Data Talent Gap
3) The talent gap that exists in the industry Getting data into the big data platform
4) Need for synchronization across data sources
5) Getting important insights through the use of Big data analytics
1) Uncertainty of Data Management Landscape:
• Because big data is continuously expanding, there are new companies and technologies
that are being developed everyday.
• A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.
2) The Big Data Talent Gap:
• While Big Data is a growing field, there are very few experts available in this field.
• This is because Big data is a complex field and people who understand the complexity
and intricate nature of this field are far few and between.
3) The talent gap that exists in the industry Getting data into the big data platform:
• Data is increasing every single day. This means that companies have to tackle limitless
amount of data on a regular basis.
• The scale and variety of data that is available today can overwhelm any data
practitioner and that is why it is important to make data accessibility simple and
convenient for brand mangers and owners.
4) Need for synchronization across data sources:
• As data sets become more diverse, there is a need to incorporate them into an analytical
platform.
• If this is ignored, it can create gaps and lead to wrong insights and messages.
5) Getting important insights through the use of Big data analytics:
• It is important that companies gain proper insights from big data analytics and it is
important that the correct department has access to this information.
• A major challenge in the big data analytics is bridging this gap in an effective fashion.
1.3 Web Data
Web data is one of the most popular type of big data, no other big data source is as
widely used today as web data. Organizations across a number of industries have integrated
detailed, customer-level behavioral data sourced from a web site into their enterprise analytics
environments. Integrating detailed clickstream data with other data instead to keeping it isolated
by itself is also a part of web data.
Organizations have talked about a 360-degree view of their customers for years. what is
really meant is that the organization has as full a view of its customers as possible considering
the technology and data available at that point in time.
9
IFETCE R-2019 Academic Year: 2023-2024

Cutting-edge companies started to have basic recency, frequency, and monetary value
(RFM) metrics attached to customers. Such metrics look at when a customer last purchased
(recency), how often they have purchased (frequency), and how much they spent (monetary
value). These RFM summaries might be tallied for the past year and possibly over a customer’s
lifetime. Today, organizations collect newly evolving big data sources related to their customers
from a variety of extended and newly emerging touch points such as web browsers, mobile
applications, kiosks, social media sites, and more.
With today’s data storage and processing capabilities, it is absolutely possible to achieve
success, and many forward-thinking companies have already proven it.
Missed Data
For a web site, 95 percent of browsing sessions do not result in a basket being created.
Of that 5 percent, only about half, or 2.5 percent, actually begin the check-out process. And, of
that 2.5 percent only two-thirds, or 1.7 percent, actually complete a purchase.
This means is that information is missing on more than 98 percent of web sessions if only
transactions are tracked. Traditional web analytics focus on aggregated behavior, summarized in
an environment where only web data was included.
The goal needs to be moving beyond reporting of summary statistics, even if they can be
viewed in some detail, to actually combining customer level web behavior data with other cross-
channel customer data.
Possibilities of Improvement
Knowing everything customers do as they go through the process of doing business with
your organization. Not just what they buy, but what they are thinking about buying along with
what key decision criteria they use. Such knowledge enables a new level of understanding about
your customers and a new level of interaction with your customers. It allows you to meet their
needs more quickly and keep them satisfied.
New Source of Information
This big data source isn’t a simple extension of existing data sources. In the case of
detailed web behaviour, there is no existing analog to most of the data that can be collected. It is
a fundamentally new source of information.
One of the most exciting aspects of web data is that it provides factual information on
customer preferences, future intentions, and motivations that are virtually impossible to get from
other sources outside of a direct conversation or survey.
Once customers’ intentions, preferences, and motivations are known, there are
completely new ways of communicating with them, driving further business, and increasing their
loyalty.
Any action that a customer takes while interacting with an organization should be
captured if it is possible to capture it. That means detailed event history from any customer touch
point. Common touch points today include web sites, kiosks, mobile apps, and social media.
Behaviours That Can Be Captured
 Purchases
 Requesting help
10
IFETCE R-2019 Academic Year: 2023-2024

 Product views
 Forwarding a link
 Shopping basket additions
 Posting a comment
 Watching a video
 Registering for a webinar
 Accessing a download
 Executing a search
 Reading / writing a review
Privacy
An arbitrary identification number that is not personally identifiable can be matched to
each unique customer based on a logon, cookie, or similar piece of information. This creates
what might be called a “faceless” customer record.
While all of the data associated with one of these identifiers is from one person, the
people doing the analysis have no ability to tie the ID back to the actual customer.
With today’s database technologies, it is possible to enable analytic professionals to do
analysis without having any ability to identify the individuals involved. This can remove many
privacy concerns.
Web Data – Area of Interest
There are a number of specific areas where web data can help organizations understand
their customers better than is possible without web data.
Shopping Behaviors
A good starting point to understanding shopping behavior is identifying how customers
come to a site to begin shopping.
 What search engine do they use?
 What specific search terms are entered?
 Do they use a bookmark they created previously?
Analytic professionals can take this information and look for patterns in terms of which
search terms, search engines, and referring sites are associated with higher sales rates.
Note that analysts will be able to look into higher sales rates not just within a given web
session, but also for the same customer over time.
One very interesting capability enabled by web data is to identify product bundles that are
of interest to a customer before they make a purchase.
Customer Purchase Paths and Preferences
Once the aspects of a site that appeal to customers on an individual basis are known, they
can be targeted with messages that meet their needs much more effectively.
Research Behaviours
Once customers’ patterns are known, it is possible to alter what they see when they visit a
site in order to make it easier for them to find their favourite options quickly.

11
IFETCE R-2019 Academic Year: 2023-2024

Another way to use web data to understand customers’ research patterns is to identify
which of the pieces of information offered on a site are valued by the customer base overall and
the best customers specifically.
Web site feature the organization was considering removing is a big favourite among a
critical segment of customers. In that case, the feature might be kept.
Identifying which site features are important to each customer and how each customer
leverages the site for research can help better tailor a site to the individual. For customers who
always drill to detailed product specifications, perhaps those specifications come up as soon as a
product is viewed. For those who always want to see photos, perhaps photos are featured in full
size instead of as thumbnails.
Feedback Behaviours
The best information customers can provide is detailed feedback on products and
services. The fact that customers are willing to take the time to do so indicates that they are
engaged with a brand. Text mining to understand the tone, intent, and topic of a customer’s
feedback
Web Data In Action
 It is possible that the information missing paints a totally different picture than expected.
 It is possible to make suboptimal, if not totally wrong, decisions.
 Organizations should strive to collect and analyse as much data as possible.
 How organizations can apply web data to enhance existing analytics, enable new
analytics, and improve their business.
The Next Best Offer
A very common marketing analysis is to predict what the next best offer is for each
customer. The web provides direct clues as to what is of interest to customers and if they are still
engaged. Consider the case of a catalog retailer that also has many store locations. The cataloger
collects the following for each customer, among other data:
 Last products browsed
 Last products reviewed
 Historical purchases
 Marketing campaign and response history
The effort leads to major changes in the promotional efforts versus the cataloger’s
traditional approach, providing the following results:
 A decrease in total mailings
 A reduction in total catalog promotions pages
 A materially significant increase in total revenues
 Web data can help completely overhaul activities for the
better. Attrition Modeling
In the telecommunications industry, companies have invested massive amounts of time
and effort to create, enhance, and perfect “churn” models. Churn models flag those customers
most at risk of cancelling their accounts so that action can be taken proactively to prevent them
from doing
12
IFETCE R-2019 Academic Year: 2023-2024

so. Churn is a major issue for the industry and there are huge amounts of money at stake. The
models have a major impact on the bottom line.
Response Modeling
Many models are created to help predict the choice a customer will make when presented
with a request for action. Models typically try to predict which customers will make a purchase,
or accept an offer, or click on an e-mail link. For such models, a technique called logistic
regression is often used.
The main difference is that in an attrition model, the goal is predicting a negative
behavior (churn) rather than a positive behavior (purchase or response).
In theory, every customer has a unique score. In practice, since only a small number of
variables define most models, many customers end up with identical or nearly identical scores.
This is particularly true among customers who are not very frequent or highspending. In such
cases, many customers can end up in big groups with very similar, very low scores.
Web data can help greatly increase differentiation among customers. This is especially
true among low-value or infrequent customers where customers can have a large uplift in score
based on the web data.
Customer Segmentation
Web data also enables a variety of completely new analytics. One of those is to segment
customers based solely upon their typical browsing patterns. Such segmentation will provide a
completely different view of customers than traditional demographic or sales-based
segmentation schemas.
Assessing Advertising Results
Traditional web analytics provide high-level summaries such as total clicks, number of
searches, cost per click or impression, keywords leading to the most clicks, and page position
statistics. However, these metrics are at an aggregate level and are rolled up only from the
individual browsing session level.
The context is also traditionally limited solely to the web channel. Once a customer
leaves the web site and his web session ends, the scope of the analysis is complete.
1.4 Evolution of Analytical Scalability
The amount of data organizations process continues to increase. So the technologies used are

• Massive Parallel Processing


• MapReduce
1.4.1 History of Scalability

• 1900- Analytics (Manual Computation)- To do a deep analysis, such as a predictive


model, it required manually computing all of the statistics. For example, to perform a
linear regression required manually computing a matrix and inverting the matrix by hand.
All of the computations required on top of that matrix to get the parameter estimates were
also done by hand.

13
IFETCE R-2019 Academic Year: 2023-2024

• 1970-Calculators- and helped make it easier to utilize more data. But the volume
manageable with a calculator is still trivially small.
• 1980-Mainframes- As the decades have passed, data has moved far beyond the scale that
people can handle manually. The amount of data has grown at least as fast as the
computing power of the machines that process it. It may not be necessary to personally
break a sweat and get a headache computing things by hand, but it is still very easy to
cause computer and storage systems to start steaming as they struggle to process the data
fed to them.
• 2000- Databases- an organization that had a database holding a terabyte of data was at the
forefront. Today you can buy a terabyte disk drive for your computer for under $100! In
2012, even many small companies have systems holding a terabyte or more of data. The
companies at the forefront now measure their database size in petabytes.

1024 of these . . . . . . equals 1 of these Comment1

Kilobyte Megabyte A standard music CD holds 600


megabytes.
Megabyte Gigabyte One gigabyte will hold data
equivalent to about 30 feet of
books on a shelf.
Gigabyte Terabyte Ten terabytes can hold the entire
U.S. Library of Congress.
Terabyte Petabyte A petabyte can hold 20 million
4-door filing cabinets of text.
Petabyte Exabyte Five exabytes is equal to all of the
words ever spoken by mankind.
Exabyte Zettabyte It would take approximately 11
billion years to download a
zettabyte file from the Internet
using high-power broadband
Zettabyte Yottabyte The entire Internet makes up
about a Yottabyte of data.
1.4.2 Convergence of Analytic and Data
Environment Traditional Analytic Architecture
Traditional Analytic Architecture Traditional analytics collects data from heterogeneous
data sources and we had to pull all data together into a separate analytics environment to do
analysis which can be an analytical server or a personal computer with more computing
capability. The heavy processing occurs in the analytic environment. In such environments,
shipping of data becomes a must, which might result in issues related with security of data and
its confidentiality.

14
IFETCE R-2019 Academic Year: 2023-2024

Modern In-database Architecture


Data from heterogeneous sources are collected, transformed and loaded into data
warehouse for final analysis by decision makers. The processing stays in the database where the
data has been consolidated. The data is presented in aggregated form for querying.

Queries from users are submitted to OLAP (online analytical processing) engines for
execution. Such in-database architectures are tested for their query throughput rather than
transaction throughput as in traditional database environments.

Figure 2- Traditional Analytic Architecture


More of metadata is required for directing the queries which helps in reducing the time
taken for answering queries and hence increase the query throughput. Moreover the data in
consolidated form are free from anomalies, since they are preprocessed before loading into
warehouses which may be used directly for analysis. The modern indatabase architecture is
shown in figure 3.

Massively Parallel Processing (MPP)


Massive Parallel Processing (MPP) is the ―shared nothing‖ approach of parallel
computing. It is a type of computing wherein the process is being done by many CPUs working
in parallel to execute a single program.

One of the most significant differences between a Symmetric Multi-Processing or SMP


and Massive Parallel Processing is that with MPP, each of the many CPUs has its own memory
to assist it in preventing a possible hold up that the user may experience with using SMP when
all of the CPUs attempt to access the memory simultaneously.

15
IFETCE R-2019 Academic Year: 2023-2024

The user’s machine just submits the request

Database 1 Database 4
Database 2

Consolidate

Enterprise Data Warehouse


Submit Request

Analytic Server Or PC

Figure-3 Modern In-database Architecture

Figure-4 Massively Parallel Processing (MPP)

16
IFETCE R-2019 Academic Year: 2023-2024

The salient feature of MPP systems are:


• Loosely coupled nodes
• Nodes linked together by a high speed connection
• Each node has its own memory.
Disks are not shared, each being attached to only one node – shared nothing architectures

The idea behind MPP is really just that of the general parallel computing (figure 4)

wherein
the simultaneous execution of some combination of multiple instances of programmed
instructions and data on multiple processors in such a way that the result can be obtained more
effectively.
MPP uses shared distributed lock manager to maintain the integrity of the distributed
resources across the system. CPU power that can be made available in a MPP is dependent upon
number of nodes that can be connected. MPP systems build in redundancy to make recovery
easy. MPP systems have resource management tools to manage the CPU and disk space and
Query optimization.
1.4.3 Cloud Computing
Three criteria for a cloud environment are,
1. Enterprises incur no infrastructure or capital costs, only operational costs. Those operational
costs will be incurred on a pay per-use basis with no contractual obligations.
2. Capacity can be scaled up or down dynamically, and immediately. This differentiates clouds
from traditional hosting service providers where there may have been limits placed on scaling.
3. The underlying hardware can be anywhere geographically. The architectural specifics are
abstracted from the user. In addition, the hardware will run in multi-tenancy mode where
multiple users from multiple organizations can be accessing the exact same infrastructure
simultaneously.
Five essential characteristics of a cloud environment.
1.On-demand self-service
2.Broad network access
3.Resource pooling
4.Rapid elasticity
5.Measured service
The two primary types of cloud environments:
1.Public clouds
2.Private clouds.
Public Clouds
With a public cloud users are basically loading their data onto a host system and they are then
allocated resources as they need them to use that data. They will get charged according to their
usage.
17
IFETCE R-2019 Academic Year: 2023-2024

Advantages of Public Cloud


1. The bandwidth is as-needed and users only pay for what they use.

18
IFETCE R-2019 Academic Year: 2023-2024

2. It isn’t necessary to buy a system sized to handle the maximum capacity ever required and then
risk having half of the capacity sitting idle much of the time.
3. If there are short bursts where a lot of processing is needed then it is possible to get it with no
hassle. Simply pay for the extra resources.
4. There’s typically very fast ramp-up. Once granted access to the cloud environment, users load
their data and start analyzing.
5. It is easy to share data with others regardless of their location since a public cloud by definition
is outside of a corporate firewall. Anyone can be given permission to log on to the environment
created.
Disadvantages of Public Cloud
1.Few performance guarantees
2.High variability in performance
3. Concerns around the security of the data
4. It can get expensive if a cloud isn’t used wisely since users will be charged for everything that
they do.
5. If an audited trail of data and where it sits is required, it is not possible to have that in a public
cloud.
The best use for a public cloud is pure research and development work, where
performance variability isn’t something to worry about.
For non-mission-critical analytical processes, the cloud is a potential long-term host even
for deployed processes.
A public cloud can be problematic if data security is a big concern. It’s necessary to apply
good security protocols and tools to a public cloud and keep your environment highly secure.
Private Clouds
A private cloud has the same features of a public cloud, but it’s owned exclusively by one
organization and typically housed behind a corporate firewall. A private cloud is going to serve
the exact same function as a public cloud, but just for the people or teams within a given
organization. Private Cloud Vs. Public Cloud

Figure 5-Cloud computing


One huge advantage of an onsite private cloud is that the organization will have complete
control over the data and system security. Data is never leaving the corporate firewall so there’s

19
IFETCE R-2019 Academic Year: 2023-2024

absolutely no concern about where it’s going. The data is at no more risk than it is on any other
internal system.
One downside of an onsite private cloud is that it is necessary to purchase and own the
entire cloud infrastructure before allocating it out to users, which could in the short term negate
some of the cost savings.
1.4.4 Grid Computing
There are some computations and algorithms that aren’t cleanly converted to SQL or
embedded in a user-defined function within a database. In these cases, it’s necessary
to pull data out into a more traditional analytics environment and run analytic tools against that
data in the traditional way.
A grid configuration can help both cost and performance. It falls into the classification of
“high-performance computing.” Instead of having a single high-end server (or maybe a few of
them), a large number of lower-cost machines are put in place. As opposed to having one server
managing its CPU and resources across jobs, jobs are parceled out individually to the different
machines to be processed in parallel. Each machine may only be able to handle a fraction of the
work of the original server and can potentially handle only one job at a time.
Using such a grid enables analytic professionals to scale an environment relatively
cheaply and quickly. If a large organization has many processes being run and most of them are
small to medium in size, a grid can be a huge boost.
MapReduce
MapReduce is a parallel programming framework. It’s neither a database nor a direct
competitor to databases. MapReduce consists of two primary processes that a programmer
builds: the “map” step and the “reduce” step. These steps get passed to the MapReduce
framework, which then runs the programs in parallel on a set of worker nodes.
In the case of MapReduce, there is a lot of commodity hardware to which data is being
passed as needed to run a process. Each MapReduce worker runs the same code against its
portion of the data. The workers do not interact or even have knowledge of each other.
MapReduce is a programming framework popularized by Google and used to simplify
data processing across massive data sets. Hadoop is a popular open-source version of
MapReduce supplied by the Apache organization. Hadoop is the bestknown implementation of
the MapReduce framework.
A big distinction of a MapReduce environment is the specific ability to handle
unstructured text. In a relational database, everything is already in tables and rows and columns.
The data already has well-defined relationships. This is not always true with raw data streams.
That’s where MapReduce can really be powerful. Loading big chunks of text into a “blob” field
in a database is possible, but it really isn’t the best use of the database or the best way to handle
such data.
Working of MapReduce
Let’s assume there are 20 terabytes of data and 20 MapReduce server nodes for a project.
The first step is to distribute a terabyte to each of the 20 nodes using a simple file copy process.
Note that this data has to be distributed prior to the MapReduce process being started. Also note
20
IFETCE R-2019 Academic Year: 2023-2024

that the data is in a file of some format determined by the user. There is no standard format like
in a relational database.
Next, the programmer submits two programs to the scheduler. One is a map program; the
other is the reduce program. In this two-step processing, the map program finds the data on disk
and executes the logic it contains. This occurs independently on each of the 20 servers in our
example. The results of the map step are then passed to the reduce process to summarize and
aggregate the final answers.

Figure 6- MapReduce Process

MapReduce Strengths and Weaknesses


MapReduce can run on commodity hardware. As a result, it can be very cheap to get up
and running. It can also be very cheap to expand. It is easy to expand the capacity because all
that is required is to buy more servers and bolt them on to the platform.
MapReduce is at its best when there is a large set of input data where much of the data
isn’t required for analysis. If only a small piece of the data is really going to be important, but it
isn’t clear up-front which pieces will be important, MapReduce can help.
MapReduce can be a terrific way to sort through the masses of data and pull out the important
parts.
MapReduce is not a database, so it has no built-in security, no indexing, no query or
process optimizer, no historical perspective in terms of other jobs that have been run, and no
knowledge of other data that exists. While it provides the ultimate flexibility to process different
kinds of data, it also comes with the responsibility to define exactly what the data is in every
process created.
1.5 Analytic Processes
As analytic professionals begin constantly using a database platform for their work
through a sandbox, they will be doing some tasks repeatedly. Enterprise analytic data sets are
key tools to
21
IFETCE R-2019 Academic Year: 2023-2024

help drive consistency and productivity, and lower risk into an organization’s advanced analytics
processes.
Enterprise analytic data sets are key tools to help drive consistency and productivity, and
lower risk into an organization’s advanced analytics processes.
Analytic Sandbox
One of the uses of such a database system is to facilitate the building and deployment of
advanced analytic processes. In order for analytic professionals to utilize an enterprise data
warehouse or data mart more effectively, however, they need the correct permissions and access
to do so. An analytic sandbox is the mechanism for achieving this.
If used appropriately, an analytic sandbox can be one of the primary drivers of value in
the world of big data. Other terms used for the sandbox concept include an agile analytics cloud
and a data lab, among others.
An analytic sandbox provides a set of resources with which in-depth analysis can be done
to answer critical business questions. An analytic sandbox is ideal for data exploration,
development of analytical processes, proof of concepts, and prototyping.
A sandbox is going to be leveraged by a fairly small set of users. There will be data
created within the sandbox that is segregated from the production database. Sandbox users will
also be allowed to load data of their own for brief time periods as part of a project, even if that
data is not part of the official enterprise data model.
Data in a sandbox will have a limited shelf life. The idea isn’t to build up a bunch of
permanent data. During a project, build the data needed for the project. When that project is
done, delete the data. If used appropriately, a sandbox has the capability to be a major driver of
analytic value for an organization.
Analytic Sandbox Benefits
Benefits from the view of an analytic professional:
1) Independence.
Analytic professionals will be able to work independently on the database system without
needing to continually go back and ask for permissions for specific projects.
2) Flexibility.
Analytic professionals will have the flexibility to use whatever business intelligence, statistical
analysis, or visualization tools that they need to use.
3) Efficiency.
Analytic professionals will be able to leverage the existing enterprise data warehouse or data mart,
without having to move or migrate data.
4) Freedom.
Analytic professionals can reduce focus on the administration of systems and babysitting of
production processes by shifting those maintenance tasks to IT.
5) Speed.
Massive speed improvement will be realized with the move to parallel processing. This also
enables rapid iteration and the ability to “fail fast” and take more risks to innovate.

22
IFETCE R-2019 Academic Year: 2023-2024

Benefits from the view of IT:


1) Centralization.
IT will be able to centrally manage a sandbox environment just as every other database
environment on the system is managed.
2) Streamlining.
A sandbox will greatly simplify the promotion of analytic processes into production since there
will be a consistent platform for both development and deployment.
3) Simplicity.
There will be no more processes built during development that have to be totally rewritten to run
in the production environment.
4) Control.
IT will be able to control the sandbox environment, balancing sandbox needs and the needs of
other users. The production environment is safe from an experiment gone wrong in the sandbox.
5)Costs
Big cost savings can be realized by consolidating many analytic data marts into one central system.
1.5.1 Analytical Sandbox
An analytic sandbox provides a set of resources with which in-depth analysis can be done
to answer critical business questions. An analytic sandbox is ideal for data exploration,
development of analytical processes, proof of concepts, and prototyping.
A sandbox is going to be leveraged by a fairly small set of users. There will be data
created within the sandbox that is segregated from the production database. Sandbox users will
also be allowed to load data of their own for brief time periods as part of a project, even if that
data is not part of the official enterprise data model.
An Internal Sandbox
For an internal sandbox, a portion of an enterprise data warehouse or data mart is carved
out to serve as the analytic sandbox.
If the sandbox is physically located on the production system. However, the sandbox
database itself is not a part of the production database. The sandbox is a separate database
container within the system.
An Internal Sandbox

Figure-7 Internal Sandbox

23
IFETCE R-2019 Academic Year: 2023-2024

It is better to add a MapReduce environment into the mix. This would typically be
installed alongside the database platform unless you’re using a system that can combine the two
environments together.
The MapReduce environment will require access to the internal sandbox. Data can be
shared between the two environments as required.
One strength of an internal sandbox is that it will leverage existing hardware resources
and infrastructure already in place. This makes it very easy to set up. From an administration
perspective, there’s no difference in setting up a sandbox than in setting up any other database
container on the system.
The biggest strength of an internal sandbox is the ability to directly join production data
with sandbox data. Since all of the production data and all of the sandbox data are within the
production system, it’s very easy to link those sources to one another and work with all the data
together.

Figure-8 Internal Sandbox

An internal sandbox is very cost-effective since no new hardware is needed. The


production system is already in place. It is just being used in a new way. The elimination of any
and all cross-platform data movement also lowers costs. The one exception is any data
movement required between the database and the MapReduce environment.
There are a few weaknesses of an internal sandbox. One such weakness is that there will
be an additional load on the existing enterprise data warehouse or data mart. The sandbox will
use both space and CPU resources (potentially a lot of resources). Another weakness is that an
internal sandbox can be constrained by production policies and procedures.
An External Sandbox
A physically separate analytic sandbox is created for testing and development of analytic
processes. It’s relatively rare to have an environment that’s purely external. Internal or hybrid
sandboxes, which we’ll talk about next, are more common. It is important to understand what the
external sandbox is, however, as it is a component of a hybrid sandbox environment.

24
IFETCE R-2019 Academic Year: 2023-2024

Figure-9 External Sandbox

The biggest strength of an external sandbox is its simplicity. The sandbox is a standalone
environment, dedicated to advanced analytics development. It will have no impact on other
processes, which allows for flexibility in design and usage.
Another strength of an external sandbox is reduced workload management. When only
analytic professionals are using the system, it isn’t necessary to worry much about tuning and
balancing. There will be predictable, stable performance in both the sandbox and production
environments.
A Hybrid Sandbox
A hybrid sandbox environment is the combination of an internal sandbox and an external
sandbox. It allows analytic professionals the flexibility to use the power of the production system
when needed, but also the flexibility of the external system for deep exploration or tasks that
aren’t as friendly to the database.
Hybrid Sandbox

Figure-10 Hybrid Sandbox

The strengths of a hybrid sandbox environment are similar to the strengths of the internal
and external options, plus having ultimate flexibility in the approach taken for an analysis. It is
easy to avoid production impacts during early testing if work is done on the external sandbox.
Another advantage is if an analytic process has been built and it has to be run in a
“pseudo- production” mode temporarily while the full production system process is being
deployed. Such processes can be run out of the internal sandbox easily.
The weaknesses of a hybrid environment are similar to the weaknesses of the other two
options, but with a few additions. One weakness is the need to maintain both an internal and
external sandbox environment. Not only will it be necessary to keep the external sandbox

25
IFETCE R-2019 Academic Year: 2023-2024

consistent with the production environment in this case, but the external sandbox will also need
to be kept consistent with the internal sandbox.
It will also be necessary to establish some guidelines on when each sandbox option is used.
Workload Management and Capacity Planning
As analytic professionals start to use a sandbox, there are a lot of built-in components of
database systems that will enable it to work smoothly. Sandbox users can be assigned to a group
that has permissions that make sense for the purpose of developing new advanced analytics
processes.
For example, it is possible to limit how much of the CPU a given sandbox user can
absorb at one time.
One of the important things to do is to limit disk space usage through data retention
policies. When a data set is in a sandbox and it hasn’t been touched in a couple of months, the
default should be that it is deleted. A sandbox should not just continuously build up data sets, as
often happens in traditional environments.
Especially with an internal sandbox, as more analytics are implemented, it will change
the mix and level of resource usage in both the sandbox environment and the production
environment.
1.5.2 Analytic Data Set
An analytic data set (ADS) is the data that is pulled together in order to create an analysis
or model. It is data in the format required for the specific analysis at hand. An ADS is generated
by transforming, aggregating, and combining data. It is going to mimic a denormalized, or flat
file, structure. What this means is that there will be one record per customer, location, product,
or whatever type of entity is being analyzed. The analytic data set helps to bridge the gap
between efficient storage and ease of use.
Development versus Production Analytic Data Sets
A development ADS is going to be the data set used to build an analytic process. It will
have all the candidate variables that may be needed to solve a problem and will be very wide. A
development ADS might have hundreds or even thousands of variables or metrics within it.

Figure-11 Development versus Production Analytic Data Sets

26
IFETCE R-2019 Academic Year: 2023-2024

A production analytic data set, however, is what is needed for scoring and deployment.
It’s going to contain only the specific metrics that were actually in the final solution.
Typically, most processes only need a small fraction of the metrics explored during
development. A big difference here is that the scores need to be applied to every entity, not just
a sample.

Traditional Analytic Data Sets


In a traditional environment, all analytic data sets are created outside of the database
Each analytic professional creates his or her own analytic data sets independently. This is done
by every analytic professional, which means that there are possibly hundreds of people
generating their own independent views of corporate data.

Figure-12 Traditional Analytic Data Set Process


A huge issue with the traditional approach to analytic data set generation is the
repetitious work. If analytic professionals are creating very similar data sets again and again, it’s
not just the space and system resources they are using, but it’s their time.
They have to set up the ADS processes, they have to run them, and they have to
babysit them and make sure they are complete.
Enterprise Analytic Data Sets
An EADS is a shared and reusable set of centralized, standardized analytic data sets for
use in analytics.

27
IFETCE R-2019 Academic Year: 2023-2024

1.6 ANALYSIS AND REPORTING


The process of exploring data and reports in order to extract meaningful insights, which
can be used to better understand and improve business performance. Reporting is “the process of
organizing data into informational summaries in order to monitor how different areas of a
business are performing.”
1.6.1 COMPARING ANALYSIS WITH REPORTING
• Reporting is “the process of organizing data into informational summaries in order
to monitor how different areas of a business are performing.”
• Measuring core metrics and presenting them — whether in an email, a slidedeck, or
online dashboard — falls under this category.
• Analytics is “the process of exploring data and reports in order to extract meaningful
insights, which can be used to better understand and improve business
performance.”
• Reporting helps companies to monitor their online business and be alerted to when data
falls outside of expected ranges.
• Good reporting should raise questions about the business from its end users.
• The goal of analysis is to answer questions by interpreting the data at a deeper level and
providing actionable recommendations.
• A firm may be focused on the general area of analytics (strategy, implementation,
reporting, etc.) but not necessarily on the specific aspect of analysis.
• It’s almost like some organizations run out of gas after the initial set-up-related
activities and don’t make it to the analysis stage

Figure-13 Analysis and Reporting


1.6.2 CONTRAST BETWEEN ANALYSIS AND REPORTING
The basis differences between Analysis and Reporting are as follows:

Analysis Reporting
Provides what is needed Provides what is asked for
Is typically customized Is Typically standardized

28
IFETCE R-2019 Academic Year: 2023-2024

Involves a person Does not involve a person

29
IFETCE R-2019 Academic Year: 2023-2024

Is extremely flexible Is fairly Inflexible

• Reporting translates raw data into information. Analysis transforms data and
information into insights. reporting shows you what is happening while analysis
focuses on explaining why it is happening and what you can do about it.
• Reports are like Robots n monitor and alter you and where as analysis is like parents - c
an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear
infection, etc).
• Reporting and analysis can go hand-in-hand:
• Reporting provides no limited context about what is happening in the data. Context is
critical to good analysis.
• Reporting translate a raw data into information
• Reporting usually raises a question – What is happening ?
• Analysis transforms the data into insights - Why is it happening ? What you can do
about it?
Thus, Analysis and Reporting is synonym to each other with respect their need and utilizing in
the needy context.

1.7 MODERN ANALYTIC TOOLS


• Modern Analytic Tools: Current Analytic tools concentrate on three classes:

a) Batch processing tools


b) Stream Processing tools and
c) Interactive Analysis tools.
a. Big Data Tools Based on Batch
Processing: Batch processing system :-
• Batch Processing System involves collecting a series of processing jobs and
carrying them out periodically as a group (or batch) of jobs.
• It allows a large volume of jobs to be processed at the same time.
• An organization can schedule batch processing for a time when there is little activity
on their computer systems, for example overnight or at weekends.
• One of the most famous and powerful batch process-based Big Data tools is
Apache Hadoop. It provides infrastructures and platforms for other specific Big
Data applications.
b) Stream Processing tools
• Stream processing – Envisioning (predicting) the life in data as and when it transpires.

30
IFETCE R-2019 Academic Year: 2023-2024

• The key strength of stream processing is that it can provide insights faster, often within
milliseconds to seconds.
• It helps understanding the hidden patterns in millions of data records in real time.
• It translates into processing of data from single or multiple sources in real or near-real
time applying the desired business logic and emitting the processed information to the
sink.
• Stream processing serves multiple resolves in today’s business arena. Real time data
streaming tools are:
1. Storm
Storm is a stream processing engine without batch support, a true real-time processing
framework, taking in a stream as an entire ‘event’ instead of series of small batches. Apache
Storm is a distributed real-time computation system. It’s applications are designed as directed
acyclic graphs.
2. Apache flink
Apache flink is an open source platform which is a streaming data flow engine that provides
communication fault tolerance and data distribution computation over data stream . flink is a top
level project of Apache flink is scalable data analytics framework that is fully compatible to
hadoop. flink can execute both stream processing and batch processing easily. flink was
designed as an alternative to map-reduce.
3. Kinesis
 Kinesis as an out of the box streaming data tool. Kinesis comprises of shards which
Kafka calls partitions. For organizations that take advantage of real-time or near real-
time access to large stores of data, Amazon Kinesis is great.
 Kinesis Streams solves a variety of streaming data problems. One common use is the
real-time aggregation of data which is followed by loading the aggregate data into a data
warehouse. Data is put into Kinesis streams. This ensures durability and elasticity.
c) Interactive Analysis -Big Data Tools
• The interactive analysis presents the data in an interactive environment, allowing users
to undertake their own analysis of information.
• Users are directly connected to the computer and hence can interact with it in
real time.
• The data can be reviewed, compared and analyzed in tabular or graphic format
or both at the same time.
IA -Big Data Tools -
a) Google’s Dremel is the google proposed an interactive analysis system in 2010. And
named named Dremel which is scalable for processing nested data.

31
IFETCE R-2019 Academic Year: 2023-2024

– Dremel provides a very fast SQL like interface to the data by using a different
technique than MapReduce. Dremel has a very different architecture:
compared with well-known Apache Hadoop, and acts as a successful
complement of Map/Reduce-based computations.
– Dremel has capability to: run aggregation queries over trillion-row
tables in seconds by means of: combining multi-level execution trees and
columnar data layout.
b) Apache drill
• Apache drill is: Drill is an Apache open-source SQL query engine for Big Data
exploration. It is similar to Google’s Dremel.
• For Drill, there is more flexibility to support a various different query languages, • data
formats and data sources.
• Drill is designed from the ground up to support high-performance analysis on the semi-
structured and rapidly evolving data coming from modern Big Data applications.
• Drill provides plug-and-play integration with existing Apache Hive and
Apache HBase deployments.
1.7.1 Categories of Modern Analytic Tools
Big data tools for HPC and
supercomputing
 MPI(Message Passing Interface, 1992) provide standardized function interfaces for
communication between parallel processes.
 Collective communication operations

– Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce, Reducesc


– Popular implementations – atter.
– MPICH (2001)
– OpenMPI (2004)
Big data tools on clouds
 MapReduce model
 Iterative MapReduce model
 DAG model
 Graph model
 Collective model
a. MapReduce Model
Jeffrey Dean et al. MapReduce: Simplified Data Processing on Large Clusters. OSDI
2004. Apache Hadoop (2005) , Apache Hadoop YARN: Yet Another Resource Negotiator,
SOCC 2013.
Key Features of MapReduce Model
32
IFETCE R-2019 Academic Year: 2023-2024

 Large clusters of commodity machines


 Designed for big data
 Support from local disks based distributed file system (GFS / HDFS)
 Disk based intermediate data transfer in Shuffling
MapReduce programming model
 Computation pattern: Map tasks and Reduce tasks
 Data abstraction: KeyValue pairs
b. Iterative MapReduce Model
• Twister: A runtime for iterative MapReduce have simple collectives are
Boradcasting and aggregation.
• HaLoop :An efficient Data Processing on Large clusters
– Loop-Aware Task Scheduling
– Caching and Indexing for Loop-Invariant Data on local disk.
 Resilient Distributed Datasets(RDD):
– A Fault-Tolerant Abstraction for In-Memory Cluster Computing is RDD operations
– MapReduce-like parallel operations are DAG of execution stages and pipelined
transformations
– Simple collectives: broadcasting and aggregation
c. DAG (Directed Acyclic Graph) Model
A Distributed Data-Parallel Programs from Sequential Building Blocks, Apache Spark
Cluster Computing with Working Sets
d. Graph Model
• Graph Processing with BSP model
• Pregel (2010) : A System for Large-Scale Graph Processing. SIGMOD 2010, Apache
Hama (2010)
• Apache Giraph (2012): Scaling Apache Giraph to a trillion
edges Pregel & Apache Giraph
• Computation Model – Superstep as iteration ,Vertex state machine: Active and
Inactive, vote to halt :Message passing between vertices, Combiners, Aggregators and
Topology mutation
• Master/worker model
• Graph partition: hashing
• Fault tolerance: checkpointing and confined recovery
GraphLab (2010)
• GraphLab: A New Parallel Framework for Machine Learning. UAI 2010.

33
IFETCE R-2019 Academic Year: 2023-2024

• Distributed GraphLab: A Framework for Machine Learning and Data Mining in


the Cloud.
• Data graph, Update functions and the scope
• PowerGraph (2012) -Distributed Graph-Parallel Computation on Natural Graphs. –
Gather, apply, Scatter (GAS) model
• GraphX (2013) -A Resilient Distributed Graph System on Spark. GRADES
e. Collective Model
Harp (2013)
 A Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0). Hierarchical data abstraction on
arrays, key-values and graphs for easy programming expressiveness.
 Collective communication model to support various communication operations on the
data abstractions. Caching with buffer management for memory allocation required
from computation and communication
 BSP style parallelism. Fault tolerance with check-pointing.
Other major Tools
a) AWS
b) BigData
c) Cassandra
d) Data Warehousing
e) DevOps
f) HBase
g) Hive
h) MongoDB
i) NiFi
j) Tableau
k) Talend
l) ZooKeeper
Thus the modern analytical tools play an important role in the modern data world.

1.8 STATISTICAL CONEPTS: SAMPLING DISTRIBUTIONS

1.8.1 Fundamental Statistics


• Statistics is a very broad subject, with applications in a vast number of different fields.
• In generally one can say that statistics is the methodology for collecting,
analyzing, interpreting and drawing conclusions from information.

34
IFETCE R-2019 Academic Year: 2023-2024

• Putting it in other words, statistics is the methodology which scientists and


mathematicians have developed for interpreting and drawing conclusions from collected
data.
• Everything that deals even remotely with the collection, processing, interpretation
and presentation of data belongs to the domain of statistics, and so does the detailed
planning of that precedes all these activities.
• Statistics is the study of the collection, analysis, interpretation, presentation, and
organization of data. In applying statistics to, e.g., a scientific, industrial, or social
problem, it is conventional to begin with a statistical population or a statistical
model process to be studied.
Populations and Parameters

 A population is any large collection of objects or individuals, such as


Americans, students, or trees about which information is desired.
 A parameter is any summary number, like an average or percentage that describes
the entire population.
 The population mean μ and the population proportion p are two different
population parameters.
 For example: We might be interested in learning about μ, the average weight of all
middle-aged female Americans. The population consists of all middle-aged female
Americans, and the parameter is µ. Or, we might be interested in learning about p, the
proportion of likely American voters approving of the president's job performance. The
population comprises all likely American voters, and the parameter is p. The problem
is that 99.999999999999... % of the time, we don't — or can't — know the real value of
a population parameter.
Samples and statistics
A sample is a representative group drawn from the population. A statistic is any summary
number, like an average or percentage that describes the sample. The sample mean, , and the
sample proportion are two different sample statistics. For example:
We might use the average weight of a random sample of 100 middle-aged female
Americans, to estimate µ, the average weight of all middle-aged female Americans. Or, we
might use , the proportion in a random sample of 1000 likely American voters who approve
of the president's job performance, to estimate p, the proportion of all likely American voters
who approve of the president's job performance.
Because samples are manageable in size, we can determine the actual value of any statistic.
We use the known value of the sample statistic to learn about the unknown value of the
population parameter.

35
IFETCE R-2019 Academic Year: 2023-2024

1.9 RE-SAMPLING
• Re-sampling is the method that consists of drawing repeated samples from the original
data samples. The method of Resampling is a nonparametric method of statistical
inference. The method of resampling uses experimental methods, rather than
analytical methods, to generate the unique sampling distribution.
• In statistics, re-sampling is any of a variety of methods for doing one of the following:
– Estimating the precision of sample statistics (medians, variances, percentiles)
– by using subsets of available data (jackknifing) or drawing randomly
with replacement from a set of data points (bootstrapping)
1.9.1 Need for Re-sampling
• Re-sampling involves the selection of randomized cases with replacement from the
original data sample in such a manner that each number of the sample drawn has a
number of cases that are similar to the original data sample.
• Due to replacement the drawn number of samples that are used by the method of re-
sampling consists of repetitive cases.
• Re-sampling generates a unique sampling distribution on the basis of the actual data.
• The method of re-sampling uses experimental methods, rather than analytical methods,
to generate the unique sampling distribution.
• The method of re-sampling yields unbiased estimates as it is based on the unbiased
samples of all the possible results of the data studied by the researcher.
• Re-sampling methods are processes of repeatedly drawing samples from a data set and
refitting a given model on each sample with the goal of learning more about the fitted
model.
• Re-sampling methods can be expensive since they require repeatedly performing
the same statistical methods on N different subsets of the data.
• Re-sampling methods refit a model of interest to samples formed from the training set, –
in order to obtain additional information about the fitted model.
1.9.2 Re-sampling methods
There are four major re-sampling methods available and are:
1. Permutation
2. Bootstrap
3. Jackknife
4. Cross validation
1. Permutation
Re-sampling procedures date back to 1930s, when permutation tests were introduced by R.A.
Fisher and E.J.G. Pitman. They were not feasible until the computer era.
Permutation Example: Fisher’s Tea Taster

36
IFETCE R-2019 Academic Year: 2023-2024

• 8 cups of tea are prepared four with tea poured first four with milk poured first
• The cups are presented to her in random order.
• Mark a strip of paper with eight guesses about the order of the "tea-first" cups and
"milk-first" cups let's say T T T T M M M M.
Permutation solution
• Make a deck of eight cards, four marked "T" and four marked "M.“
• Deal out these eight cards successively in all possible orderings
(permutations) Record how many of those permutations show >= 6 matches.
Approximate Permutation
• Shuffle the deck and
• deal it out along the strip of paper with the marked guesses, record the number
of matches.
• Repeat many times.
Permutation Re-sampling Processes
Step 1: Collect Data from Control & Treatment Groups
Step 2: Merge samples to form a pseudo population
Step 3: Sample without replacement from pseudo population to simulate control Treatment
groups
Step 4: Compute target statistic for each example
Compute “different statistic” , save result in table and repeat resampling process 1000+
iterations.
Permutation Tests
 In classical hypothesis testing, we start with assumptions about the underlying
distribution and then derive the sampling distribution of the test statistic under
H0.
 In Permutation testing, the initial assumptions are not needed (except
exchangeability), and the sampling distribution of the test statistic under H0 is
computed by using permutations of the data.
2. Bootstrap
• The bootstrap is a widely applicable tool that can be used to quantify the uncertainty
associated with a given estimator or statistical learning approach, including those for
which it is difficult to obtain a measure of variability.
• The bootstrap generates distinct data sets by repeatedly sampling observations from the
original data set. These generated data sets can be used to estimate variability in lieu of
sampling independent data sets from the full population.
1969 Simon publishes the bootstrap as an example in Basic Research Methods in Social
Science (the earlier pigfood example)
• 1979 Efron names and publishes first paper on the bootstrap
Coincides with advent of personal computer

37
IFETCE R-2019 Academic Year: 2023-2024

Figure 14- Sampling


• The sampling employed by the bootstrap involves randomly selecting n observations with
replacement, which means some observations can be selected multiple times while other
observations are not included at all.
• This process is repeated B times to yield B bootstrap data sets, – Z 1,Z 2,…,Z B,which
can be used to estimate other quantities such as standard error.
Working of Bootstrap
Bootstrapping is a method for estimating the sampling distribution of an estimator by resampling
with replacement from the original sample.
 The bootstrap procedure is a means of estimating the statistical accuracy . . . from
the data in a single sample.
 Bootstrapping is used to mimic the process of selecting many samples when
the population is too small to do otherwise
 The samples are generated from the data in the original sample by copying it
many number of times (Monte Carlo Simulation)
 Samples can then selected at random and descriptive statistics calculated or
regressions run for each sample
 The results generated from the bootstrap samples can be treated as if it they were
the result of actual sampling from the original population .

38
IFETCE R-2019 Academic Year: 2023-2024

Characteristics of Bootstrapping

Figure 14-Characteristics of Bootstrapping

Figure 14-Example of Bootstrapping

Bootstrapping is especially useful in situations when no analytic formula for the sampling
distribution is available.

39
IFETCE R-2019 Academic Year: 2023-2024

• Traditional forecasting methods, like exponential smoothing, work well when demand
is constant – patterns easily recognized by software
• In contrast, when demand is irregular, patterns may be difficult to recognize.
• Therefore, when faced with irregular demand, bootstrapping may be used to
provide more accurate forecasts, making some important assumptions…
Assumptions and Methodology
• Bootstrapping makes no assumption regarding the population
• No normality of error terms
• No equal variance
• Allows for accurate forecasts of intermittent demand
• If the sample is a good approximation of the population, the sampling distribution may
be estimated by generating a large number of new samples
• For small data sets, taking a small representative sample of the data and replicating it
will yield superior results
Applications and Uses
a) Criminology:- Statistical significance testing is important in criminology and
criminal justice.
b) Actuarial Practice:- Process of developing an actuarial model begins with the creation of
probability distributions of input variables. Input variables are generally asset-side
generated cash flows (financial) or cash flows generated from the liabilities side
(underwriting)
c) Classifications Used by Ecologists:- Ecologists often use cluster analysis as a tool in the
classification and mapping of entities such as communities or landscapes
d) Human Nutrition:- Inverse regression used to estimate vitamin B-6 requirement of young
women & Standard statistical methods were used to estimate the mean vitamin B-6
requirement.
e) Outsourcing:- Agilent Technologies determined it was time to transfer manufacturing of its
3070 in-circuit test systems from Colorado to Singapore & Major concern was the change
in environmental test conditions (dry vs humid).
Bootstrap Types
a) Parametric Bootstrap
b) Non-parametric Bootstrap
a) Parametric Bootstrap
• Re-sampling makes no assumptions about the population distribution.
• The bootstrap covered thus far is a nonparametric bootstrap.
• If we have information about the population distr., this can be used in resampling.
• In this case, when we draw randomly from the sample we can use population distr.
• For example, if we know that the population distr. is normal then estimate its
parameters using the sample mean and variance.

40
IFETCE R-2019 Academic Year: 2023-2024

• Then approximate the population distr. with the sample distr. and use it to draw new
samples.
• As expected, if the assumption about population distribution is correct then the
parametric bootstrap will perform better than the nonparametric bootstrap.
• If not correct, then the nonparametric bootstrap will perform better.
Bootstrap Example
• A new pigfood ration is tested on twelve pigs, with six-week weight gains as follows:
• 496 544 464 416 512 560 608 544 480 466 512 496
• Mean: 508 ounces (establish a confidence interval)
Draw simulated samples from a hypothetical universe that embodies all we know about the
universe that this sample came from – our sample, replicated an infinite number of times.
The Bootstrap process steps
1. Put the observed weight gains in a hat
2. Sample 12 with replacement
3. Record the mean
4. Repeat steps 2-3, say, 1000 times
5. Record the 5th and 95th percentiles (for a 90% confidence interval)

Figure 15- Bootstrapped sample means


Nonparametric bootstrap
 Nonparametric bootstrap method which relies on the empirical distribution function
The idea of the nonparametric bootstrap is to simulate data from the empirical cdf Fn.
 Since the bootstrap samples are generated from Fn, this method is called the
nonparametric bootstrap.
 Here Fn is a discrete probability distribution that gives probability 1/n to each
observed value x1, · · · , xn.
 A sample of size n from Fn is thus a sample of size n drawn with replacement from
the collection x1, · · · , xn.
 The standard deviation of ˆθ is then estimated by
sθˆ = SQRT( 1 B X B i=1 (θ i − ¯θ )^ 2 )
where θ 1 , . . . , θ B are produced from B sample of size n from the collection x1, · · , xn.

41
IFETCE R-2019 Academic Year: 2023-2024

Example of Bootstrap (Nonparametric)


• Have test scores (out of 100) for two consecutive years for each of 60 subjects.
• Want to obtain the correlation between the test scores and the variance of the
correlation estimate.
3. Jackknife Method
• Jackknife method was introduced by Quenouille (1949) – to estimate the bias of an
estimator.
• The method is later shown to be useful in reducing the bias as well as in estimating
the variance of an estimator.
• Let ˆθn be an estimator of θ based on n i.i.d. random vectors X1, . . . , Xn, i.e., ˆθn
= fn(X1, . . . , Xn), for some function fn. Let
• A statistical method for estimating and removing bias* and for deriving robust estimates
of standard errors and confidence intervals.
• Created by systematically dropping out subsets of data one at a time and assessing the
resulting variation.
• θn,−i = fn−1(X1, . . . , Xi−1, Xi+1, . . . , Xn) be the corresponding recomputed statistic
based on all but the i-th observation. The jackknife estimator of bias E( ˆθn) − θ is
given by
• biasJ = (n − 1) n Xn i=1 ¡ ˆθn,−i − ˆθn ¢ .
• Jackknife estimator θJ of θ is given by:
– θJ = ˆθn − biasJ = 1 n Xn i=1 ¡ nˆθn − (n − 1)ˆθn,−i ¢ .
• Such a bias corrected estimator hopefully reduces the over all bias.
• The summands above θn,i = ˆθn − (n − 1)ˆθn,−i , i = 1, . . . , n are called pseudo-values.
A comparison of the Bootstrap & Jackknife
• Bootstrap
– Yields slightly different results when repeated on the same data (when
estimating the standard error)
– Not bound to theoretical distributions
• Jackknife
– Less general technique
– Explores sample variation differently
– Yields the same result each time
– Similar data requirements
4. Cross validation
• Cross-validation is a technique used to protect against overfitting in a predictive
model, particularly in a case where the amount of data may be limited.
• In cross-validation, you make a fixed number of folds (or partitions) of the data, run
the analysis on each fold, and then average the overall error estimate.
• Cross validation is a re-sampling method that

42
IFETCE R-2019 Academic Year: 2023-2024

– can be used to estimate a given statistical methods test error or to determine the
appropriate amount of flexibility.
Model assessment is the process of evaluating a model’s performance
– Model selection is the process of selecting the appropriate level of flexibility for a
model.
– Bootstrap is used in a number of contexts,
– but most commonly it is used to provide a measure of accuracy of a given
statistical learning method or parameter estimate.
Need of Cross validation
• Use the entire data set when training a learner.
• Some of the data is removed before training begins.
• Then when training is done, the data that was removed can be used to test
the performance of the learned model on ``new'' data.
• This is the basic idea for a whole class of model evaluation methods called cross
validation.
Bootstrap vs. Cross-Validation
• Bootstrap
– Requires a small of data
– More complex technique – time consuming
• Cross-Validation
– Not a resampling technique
– Requires large amounts of data
– Extremely useful in data mining and artificial intelligence
Cross Validation Methods
1. holdout method
2. K-fold cross validation
3. Leave-one-out cross validation
1. holdout method
The holdout method is the simplest kind of cross validation
• The data set is separated into two sets, called the training set and the testing set.
• The function approximator fits a function using the training set only.
• Then the function approximator is asked to predict the output values for the data in the
testing set (it has never seen these output values before).
• The errors it makes are accumulated as before to give the mean absolute test set
error, which is used to evaluate the model.
• The advantage of this method is that it is usually preferable to the residual method and
takes no longer to compute. – However, its evaluation can have a high variance.
• The evaluation may depend heavily on
– which data points end up in the training set and which end up in the test set, and

43
IFETCE R-2019 Academic Year: 2023-2024

– thus the evaluation may be significantly different depending on how the division
is made.
2. K-fold cross validation
• K-fold cross validation is one way to improve over the holdout method.
– The data set is divided into k subsets, and the holdout method is repeated k times.
• Each time, one of the k subsets is used as the test set and the other k-1 subsets are
put together to form a training set.
• Then the average error across all k trials is computed.
• The advantage of this method is that it matters less how the data gets divided.
• Every data point gets to be in a test set exactly once, and gets to be in a training set k1
times.
• The variance of the resulting estimate is reduced as k is increased.
• The disadvantage of this method is that the training algorithm has to be rerun
from scratch k times, which means it takes k times as much computation to make
an evaluation.
• A variant of this method is to randomly divide the data into a test and training set
kdifferent times.
• The advantage of doing this is that you can independently choose how large each test
set is and how many trials you average over.
Leave-one-out cross validation
• Leave-one-out cross validation is K-fold cross validation taken to its :
– logical extreme, with K equal to N, the number of data points in the set.
– That means that N separate times, the function approximator is trained on all the
data except for one point and a prediction is made for that point.
– As before the average error is computed and used to evaluate the model.
– The evaluation given by leave-one-out cross validation error (LOO-XVE) is
good, but at first pass it seems very expensive to compute.
– Fortunately, locally weighted learners can make LOO predictions just as easily
as they make regular predictions.
– That means computing the LOO-XVE takes no more time than computing
the residual error and it is a much better way to evaluate models.
1.10 Statistical Inference
The process of making guesses about the truth from a sample. Statistical inference is the
process through which inferences about a population are made based on certain statistics
calculated from a sample of data drawn from that population.
Confidence Intervals.
Suppose we want to estimate an actual population mean μ. As you know, we can only
obtain , the mean of a sample randomly selected from the population of interest. We can use
to find a range of values:

44
IFETCE R-2019 Academic Year: 2023-2024

that we can be really confident contains the population mean μ. The range of values is called a
"confidence interval" .
General form of most confidence intervals. The previous example illustrates the general
form of most confidence intervals, namely:

That is:

and:

Once we've obtained the interval, we can claim that we are really confident that the value
of the population parameter is somewhere between the value of L and the value of U. So far,
we've been very general in our discussion of the calculation and interpretation of confidence
intervals. To be more specific about their use, let's consider a specific interval, namely the "t-
interval for a population mean µ."
Hypothesis Testing
The general idea of hypothesis testing involves:
1. Making an initial assumption.
2. Collecting evidence (data).
3. Based on the available evidence (data), deciding whether to reject or not reject the
initial assumption.
Every hypothesis test, regardless of the population parameter involved, requires the above three
steps. Errors in hypothesis testing
Type I error: The null hypothesis is rejected when it is true.
Type II error: The null hypothesis is not rejected when it is false.
Test of Proportion
Let us consider the parameter p of population proportion. For instance, we might want to
know the proportion of males within a total population of adults when we conduct a survey. A
test of proportion will assess whether or not a sample from a population represents the true
proportion from the entire population.
1.11 PREDICTION ERROR
1.11.1 Introduction to Prediction Error
• A prediction error is the failure of some expected event to occur.
• Errors are an inescapable element of predictive analytics that should also be quantified
and presented along with any model, often in the form of a confidence interval that
indicates how accurate its predictions are expected to be.
• A prediction error is the failure of some expected event to occur.
• When predictions fail, humans can use metacognitive functions, examining prior
predictions and failures.

45
IFETCE R-2019 Academic Year: 2023-2024

• For example, whether there are correlations and trends, such as consistently being unable
to foresee outcomes accurately in particular situations.
• Applying that type of knowledge can inform decisions and improve the quality of future
predictions.
1.11.2 Error in Predictive Analysis
• Errors are an inescapable element of predictive analytics that should also be
quantified and presented along with any model, often in the form of a confidence
interval that indicates how accurate its predictions are expected to be.
• Analysis of prediction errors from similar or previous models can help
determine confidence intervals.
Predictions always contain errors
• Predictive analytics has many applications, the above mentioned examples are just the
tip of the iceberg.
• Many of them will add value, but it remains important to stress that the outcome of a
prediction model will always contain an error. Decision makers need to know how big
that error is.
• To illustrate, in using historic data to predict the future you assume that the future
will have the same dynamics as the past, an assumption which history has proven to
be dangerous.
• In artificial intelligence (AI), the analysis of prediction errors can help guide machine
learning (ML), similarly to the way it does for human learning.
• In reinforcement learning, for example, an agent might use the goal of minimizing
error feedback as a way to improve.
• Prediction errors, in that case, might be assigned a negative value and predicted
outcomes a positive value, in which case the AI would be programmed to attempt to
maximize its score.
• That approach to ML, sometimes known as error-driven learning, seeks to stimulate
learning by approximating the human drive for mastery.
1.11.3 Prediction Error in
statistics Standard Error of the
Estimate
• The standard error of the estimate is a measure of the accuracy of predictions.
• Recall that the regression line is the line that minimizes the sum of squared deviations
of prediction (also called the sum of squares error).
• The standard error of the estimate is closely related to this quantity and is defined below:

where σest is the standard error of the estimate, Y is an actual score, Y' is a predicted score, and N
is the number of pairs of scores.
46
IFETCE R-2019 Academic Year: 2023-2024

• In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an
estimator (of a procedure for estimating an unobserved quantity) measures the average of
the squares of the errors—that is, the average squared difference between the estimated
values and what is estimated.
• MSE is a risk function, corresponding to the expected value of the squared error loss.
• The fact that MSE is almost always strictly positive (and not zero) is because of
randomness or because the estimator does not account for information that could produce
a more accurate estimate.
1.11.4 Mean squared prediction error
• In statistics the mean squared prediction error or mean squared error of the predictions
of a smoothing or curve fitting procedure is the expected value of the squared difference
between the fitted values implied by the predictive function and the values of the
(unobservable) function g.
• The MSE is a measure of the quality of an estimator—it is always non-negative, and
values closer to zero are better.
• Root-Mean-Square error or Root-Mean-Square Deviation (RMSE or RMSD)
• In an analogy to standard deviation, taking the square root of MSE yields the root-
meansquare error or root-mean-square deviation (RMSE or RMSD), which has the same
units as the quantity .being estimated; for an unbiased estimator, the RMSE is the square
root of the variance, known as the standard error.
• The RMSD represents the square root of the second sample moment of the differences
between predicted values and observed values or the quadratic mean of these differences.
• These deviations are called residuals when the calculations are performed over the
data sample that was used for estimation and are called errors (or prediction errors)
when computed out-of-sample.
• The RMSD serves to aggregate the magnitudes of the errors in predictions for
various times into a single measure of predictive power.
• RMSD is a measure of accuracy, to compare forecasting errors of different models for
a particular dataset and not between datasets, as it is scale-dependent.
• RMSD is always non-negative, and a value of 0 (almost never achieved in
practice) would indicate a perfect fit to the data.
• In general, a lower RMSD is better than a higher one. However, comparisons across
different types of data would be invalid because the measure is dependent on the scale
of the numbers used.
• RMSD is the square root of the average of squared errors.
• The effect of each error on RMSD is proportional to the size of the squared error;
thus larger errors have a disproportionately large effect on RMSD.
• Consequently, RMSD is sensitive to outliers.

47
IFETCE R-2019 Academic Year: 2023-2024

Prediction error in regression


• Regressions differing in accuracy of prediction.
• The standard error of the estimate is a measure of the accuracy of predictions.
• Recall that the regression line is the line that minimizes the sum of squared deviations
of prediction (also called the sum of squares error).
Thus, the prediction error influence the analytics functionalities and its applications areas.

48

You might also like