02 Unit-BDA - Big Data Analytics
02 Unit-BDA - Big Data Analytics
UNIT-2
BIG DATA ANALYTICS
A. BHANU PRASAD
Associate Professor, Dept. of CSE
2
Theory Contents Contd..
THE BIG DATA TECHNOLOGY LANDSCAPE: NoSQL (Not Only SQL),
Hadoop, Introduction to Hadoop, Introducing Hadoop, Why Hadoop?, Why
not RDBMS?, RDBMS versus Hadoop, Distributed Computing Challenges,
History of Hadoop, Hadoop Overview, Use Case of Hadoop, Hadoop
Distributors, HDFS (Hadoop Distributed File System), Processing Data with
Hadoop, Managing Resources and Applications with Hadoop YARN (Yet
another Resource Negotiator), Interacting with Hadoop Ecosystem.
INTRODUCTION TO MONGODB: What is MongoDB?, Why MongoDB?,
Terms Used in RDBMS and MongoDB, Data Types in MongoDB, MongoDB
Query Language.
INTRODUCTION TO MAPREDUCE PROGRAMMING: Introduction, Mapper,
Reducer, Combiner, Partitioner, Searching, Sorting, Compression
INTRODUCTION TO HIVE: What is Hive?, Hive Architecture, Hive Data
Types, Hive File Format, Hive Query Language (HQL)
INTRODUCTION TO PIG: What is Pig?, The Anatomy of Pig, Pig on Hadoop,
Pig Philosophy, Use Case for Pig: ETL Processing, Pig Latin Overview, Data
Types in Pig, Running Pig, Execution Modes of Pig, HDFS Commands.
3
UNIT – 2 CONTENTS
2. BIG DATA ANALYTICS
2.1 Where do we Begin?
2.2 What is Big Data Analytics?
2.3 What Big Data Analytics Isn’t?
2.4 Why this Sudden Hype Around Big Data Analytics?
2.5 Classification of Analytics
2.6 Greatest Challenges that Prevent Businesses from Capitalizing
on Big Data
2.7 Top Challenges Facing Big Data
2.8 Why is Big Data Analytics Important?
2.9 What Kind of Technologies are we looking Toward to Help
Meet the Challenges Posed by Big Data?
2.10 Terminologies Used in Big Data Environments
2.11 Basically Available Soft State Eventual Consistency (BASE)
2.12 Few Top Analytics Tools.
4
BOOKS
TEXT BOOKS:
1. Big Data and Analytics
Seema Acharya, Subhashini Chellappan
2nd Edition, Wiley India.
REFERENCE BOOKS:
2. Big Data Now
O'Reilly Media, 2nd Edition, 2012
9
2.4 Why this Sudden Hype Around Big Data
Analytics?
Why this sudden hype? Let us put it down to three foremost reasons:
1) Data is growing at a 40% compound annual rate, reaching nearly 45
ZB by 2020. In 2010, almost about 1.2 trillion Gigabyte of data was
generated. This amount doubled to 2.4 trillion Gigabyte in 2012 and to
about 5 trillion Gigabytes in the year 2014. The volume of business
data worldwide is expected to double every 1.2 years. Every day 2.5
quintillion bytes of data is created, with 90% of the world’s data
created in the past 2 years alone.
2) Cost per gigabyte of storage has
hugely dropped.
3) There are an overwhelming number
of user-friendly analytics tools
available in the market today.
14
2.7 Top Challenges Facing Big Data
Following are the various top challenges of big data:
1) Scale: Storage (RDBMS or NoSQL) is one major concern that needs to be
addressed to handle the need for scaling rapidly and elastically. The need of
the hour is a storage that can best withstand the onslaught of large volume,
velocity, and variety of big data? Should you scale vertically or should you
scale horizontally?
2) Security: Most of the NoSQL big data platforms have poor security
mechanisms (lack of proper authentication and authorization mechanisms)
when it comes to safeguarding big data.
3) Schema: Rigid schemas have no place. We want the technology to be able to
fit our big data and not the other way around. The need of the hour is
dynamic schema. Static (pre-defined schemas) are old.
4) Continuous availability: The big question here is how to provide 24/7
support because almost all RDBMS and NoSQL big data platforms have a
certain amount of downtime built in.
5) Consistency: Should one opt for consistency or eventual consistency?
6) Partition tolerant: How to build partition tolerant systems that can take care
of both hardware and software failures?
7) Data quality: How to maintain data quality – data accuracy, completeness,15
timeliness, etc.? Do we have appropriate metadata in place?
2.8 Why is Big Data Analytics Important?
Following are the various approaches to analysis of data and what it
leads to.
1) Reactive – Business Intelligence: Business Intelligence (BI) allows
the businesses to make faster and better decisions by providing the
right information to the right person at the right time in the right
format. It is about analysis of the past or historical data and then
displaying the findings of the analysis or reports in the form of
enterprise dashboards, alerts, notifications, etc.
2) Reactive – Big Data Analytics: Here the analysis is done on huge
datasets but the approach is still reactive as it is still based on static
data.
3) Proactive – Analytics: This is to support futuristic decision making by
the use of data mining, predictive modeling, text mining, and
statistical analysis. This analysis has severe limitations on the storage
capacity and the processing capability.
4) Proactive – Big Data Analytics: This is filtering through terabytes of
information to filter out the relevant data to analyze. This also
includes high performance analytics to gain rapid insights from big16
data and the ability to solve complex problems using more data.
2.9 What Kind of Technologies are we looking Toward to
Help Meet the Challenges Posed by Big Data?
1) The first requirement is of cheap and abundant storage.
2) We need faster processors to help with quicker processing of big data.
3) Affordable open-source, distributed big data platforms, such as
Hadoop.
4) Parallel processing, clustering, virtualization, large grid
environments (to distribute processing to a number of machines),
high connectivity, and high throughputs rather than low latency.
5) Cloud computing and other flexible resource allocation
arrangements.
17
2.10 Terminologies Used in Big Data Environments
1) In-Memory Analytics: Data access from non-volatile storage such as
hard disk is a slow process. All the relevant data is stored in Random
Access Memory (RAM) or primary storage thus eliminating the need
to access the data from hard disk. The advantage is faster access,
rapid deployment, better insights, and minimal IT involvement.
2) In-Database Processing (analytics): works by blending data
warehouses with analytical systems. With in-database processing, the
database program itself can run the computations eliminating the
need for Extraction Transformation and Loading data into data
warehouse and thereby saving on time.
3) Symmetric Multiprocessor System (SMP): In SMP, there is a single
common main memory that is shared by two or more identical
processors. The processors have full access to all I/O devices and are
controlled by a single operating system instance. SMP are tightly
coupled multiprocessor systems. Each processor has its own high-
speed memory, called cache memory and are connected using a
system bus.
18
Terminologies in Big Data Contd..
4) Massive Parallel Processing (MPP): refers to the coordinated
processing of programs by a number of processors working parallel.
The processors, each have their own operating systems and dedicated
memory. They work on different parts of the same program and all
the executing segments can communicate with each other.
5) Difference Between Parallel and Distributed Systems: A parallel
database system is a tightly coupled system in which the processors
co-operate for query processing. The user is unaware of the
parallelism since he/she has no access to a specific processor of the
system. Either the processors have access to a common memory or
make use of message passing for communication. Distributed
database systems are known to be loosely coupled and are composed
by individual machines that can run their individual application and
serve their own respective user. The data is usually distributed across
several machines, thereby necessitating quite a number of machines
to be accessed to answer a user query.
19
Terminologies in Big Data Contd..
6) Shared Nothing Architecture: The three most common types of
architecture for multiprocessor high transaction rate systems are:
1. Shared Memory (SM) architecture: a common central memory is
shared by multiple processors
2. Shared Disk (SD) architecture: multiple processors share a
common collection of disks while having their own private
memory.
3. Shared Nothing (SN) architecture: neither memory nor disk is
shared among multiple processors.
Advantages of a “Shared Nothing Architecture”
1. Fault Isolation: A fault in a single node is contained and
confined to that node exclusively and exposed only through
messages (or lack of it).
2. Scalability: Assume that the disk is a shared resource in which
different nodes will have to take turns to access the critical
data. This imposes a limit on how many nodes can be added to
the distributed shared disk system, thus compromising on
scalability.
20
Terminologies in Big Data Contd..
7) CAP Theorem (Brewer’s Theorem) : states that in a distributed
computing environment (a collection of interconnected nodes that
share data), it is impossible to provide the following guarantees. One
must be sacrificed.
1. Consistency implies that every read fetches the last write.
2. Availability implies that reads and writes always succeed.
3. Partition tolerance implies that the system will continue to
function when network partition occurs.
21
2.11 Basically Available Soft State Eventual
Consistency (BASE)
A few basic questions to start with:
1) Where is it used? In distributed computing.
2) Why is it used? To achieve high availability.
3) How is it achieved? If no new updates are made to the given data item
for a stipulated period of time, eventually all accesses to this data item
will return the updated value.
4) What is replica convergence? A system that has achieved eventual
consistency is said to have converged or achieved replica convergence.
5) Conflict resolution: How is the conflict resolved?
(a) Read repair: If the read leads to discrepancy or inconsistency, a
correction is initiated. It slows down the read operation.
(b) Write repair: If the write leads to discrepancy or inconsistency, a
correction is initiated. This will cause the write operation to slow
down.
(c) Asynchronous repair: The correction is not part of a read or write
operation.
22
2.12 Few Top Analytics Tools.
Below are the list of few top analytics tools.
1. MS Excel
2. SAS
3. IBM SPSS Modeler
4. Statistica
5. Salford systems
6. World Programming Systems (WPS)
Open Source Analytics Tools
1. R analytics
2. Weka
23
24