0% found this document useful (0 votes)
15 views24 pages

02 Unit-BDA - Big Data Analytics

The document outlines the curriculum for a Big Data Analytics course, detailing its content, including the introduction to big data, analytics, technology landscape, and tools like Hadoop and MongoDB. It discusses the importance of big data analytics, challenges businesses face, and classifications of analytics. Additionally, it highlights the need for technologies to address big data challenges and provides references for further reading.

Uploaded by

sidhukola28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views24 pages

02 Unit-BDA - Big Data Analytics

The document outlines the curriculum for a Big Data Analytics course, detailing its content, including the introduction to big data, analytics, technology landscape, and tools like Hadoop and MongoDB. It discusses the importance of big data analytics, challenges businesses face, and classifications of analytics. Additionally, it highlights the need for technologies to address big data challenges and provides references for further reading.

Uploaded by

sidhukola28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

B.

TECH CSE III Year I Semester


2020 – 2021
VCE-R18 (Integrated Course)
BIG DATA ANALYTICS (A4513)

UNIT-2
BIG DATA ANALYTICS

A. BHANU PRASAD
Associate Professor, Dept. of CSE

VARDHAMAN COLLEGE OF ENGINEERING


(AUTONOMOUS)
Shamshabad – 501218, Hyderabad, AP
Course Theory Contents
INTRODUCTION TO BIG DATA: Classification of Digital Data, Characteristics
of Data, Evolution of Big Data, Definition of Big Data, Challenges with Big
Data, What is Big Data?, Other Characteristics of Data Which are not
Definitional Traits of Big Data, Why Big Data?
Are We Just an Information Consumer or Do we also Produce Information?,
Traditional Business Intelligence (BI) versus Big Data, A Typical Data
Warehouse Environment, A Typical Hadoop Environment, What is New
Today?, What is changing in the Realms of Big Data?

BIG DATA ANALYTICS: Where do we Begin?, What is Big Data


Analytics?, What Big Data Analytics Isn’t?, Why this Sudden Hype
Around Big Data Analytics?, Classification of Analytics, Greatest
Challenges that Prevent Businesses from Capitalizing on Big Data, Top
Challenges Facing Big Data,
Why is Big Data Analytics Important?, What Kind of Technologies are
we looking Toward to Help Meet the Challenges Posed by Big Data?,
Terminologies Used in Big Data Environments, Basically Available Soft
State Eventual Consistency (BASE), Few Top Analytics Tools.

2
Theory Contents Contd..
THE BIG DATA TECHNOLOGY LANDSCAPE: NoSQL (Not Only SQL),
Hadoop, Introduction to Hadoop, Introducing Hadoop, Why Hadoop?, Why
not RDBMS?, RDBMS versus Hadoop, Distributed Computing Challenges,
History of Hadoop, Hadoop Overview, Use Case of Hadoop, Hadoop
Distributors, HDFS (Hadoop Distributed File System), Processing Data with
Hadoop, Managing Resources and Applications with Hadoop YARN (Yet
another Resource Negotiator), Interacting with Hadoop Ecosystem.
INTRODUCTION TO MONGODB: What is MongoDB?, Why MongoDB?,
Terms Used in RDBMS and MongoDB, Data Types in MongoDB, MongoDB
Query Language.
INTRODUCTION TO MAPREDUCE PROGRAMMING: Introduction, Mapper,
Reducer, Combiner, Partitioner, Searching, Sorting, Compression
INTRODUCTION TO HIVE: What is Hive?, Hive Architecture, Hive Data
Types, Hive File Format, Hive Query Language (HQL)
INTRODUCTION TO PIG: What is Pig?, The Anatomy of Pig, Pig on Hadoop,
Pig Philosophy, Use Case for Pig: ETL Processing, Pig Latin Overview, Data
Types in Pig, Running Pig, Execution Modes of Pig, HDFS Commands.

3
UNIT – 2 CONTENTS
2. BIG DATA ANALYTICS
2.1 Where do we Begin?
2.2 What is Big Data Analytics?
2.3 What Big Data Analytics Isn’t?
2.4 Why this Sudden Hype Around Big Data Analytics?
2.5 Classification of Analytics
2.6 Greatest Challenges that Prevent Businesses from Capitalizing
on Big Data
2.7 Top Challenges Facing Big Data
2.8 Why is Big Data Analytics Important?
2.9 What Kind of Technologies are we looking Toward to Help
Meet the Challenges Posed by Big Data?
2.10 Terminologies Used in Big Data Environments
2.11 Basically Available Soft State Eventual Consistency (BASE)
2.12 Few Top Analytics Tools.
4
BOOKS
TEXT BOOKS:
1. Big Data and Analytics
Seema Acharya, Subhashini Chellappan
2nd Edition, Wiley India.

REFERENCE BOOKS:
2. Big Data Now
O'Reilly Media, 2nd Edition, 2012

3. Big Data: A Revolution That Will Transform


How We Live, Work, and Think
Viktor Mayer-Schonberger, Kenneth Cukier,
Mariner Books, 2014
5
2. BIG DATA ANALYTICS
2.1 Where do we Begin?
 Raw data is collected, classified, and organized.
 Associating it with adequate metadata and laying bare the context
converts this data into meaningful information.
 It is then aggregated and summarized so that it becomes easy to
consume it for analysis.
 Gradual accumulation of such meaningful information builds a
knowledge repository. This, in turn, helps with actionable insights
which prove useful for decision making. Refer Figure 3.1.

Fig 2.1: Transformation of data to yield actionable insights. 6


Where do we Begin? Contd..
 Organizations have realized that they will not be able to ignore big
data if they want to be competitive enough and make those timely
decisions to make well of the fleeting opportunities.
 They will have to analyze big time and also take into consideration big
data that makes it to the organization at unprecedented level in terms
of volume, velocity, and variety.
 Big data analytics is the process of examining big data to uncover
patterns, unearth trends, and find unknown correlations and other
useful information to make faster and better decisions.
 Analytics begin with analyzing all available data. Refer Figure 3.2.

Fig 2.2: Types of unstructured data available for analysis. 7


2.2 What is Big Data Analytics?
Big Data Analytics is:
1) Technology-enabled analytics: Quite a few data analytics and
visualization tools are available in the market today from leading
vendors such as IBM, Tableau, SAS, R Analytics, Statistica, etc. to help
process and analyze your big data.
2) About gaining a meaningful, deeper, and richer insight into your
business to steer it in the right direction, understanding the
customer’s demographics to cross-sell and up-sell to them, better
leveraging the services of your vendors and suppliers, etc.
3) About a competitive edge over your competitors by enabling you with
findings that allow quicker and better decision-making.
4) A tight handshake between three communities: IT, business users, and
data scientists.
5) Working with datasets whose volume and variety exceed the current
storage, processing capabilities and infrastructure of your enterprise.
6) About moving code to data. This makes perfect sense as the program
for distributed processing is tiny (few KBs) compared to the data
(TBs/PBs/ZBs). 8
2.3 What Big Data Analytics Isn’t?
 Big data isn’t only about volume but the variety and velocity too are
very important factors.
 Big data isn’t just about technology. It is about understanding what
the data is saying to us. It is about understanding relationships that we
thought never existed between datasets. It is about patterns and trends
waiting to be unveiled.
 And of course, big data analytics is not here to replace our now very
robust and powerful Relational Database Management System
(RDBMS) or our traditional Data Warehouse. It is here to coexist with
both RDBMS and Data Warehouse, leveraging the power of each to
yield business value.
 Big data analytics is not “One-size fits all” traditional RDBMS built on
shared disk and memory.
 It is not only used by huge online companies like a Google or Amazon,
but for any business and any industry that needs actionable insights
out of their data (both internal and external).

9
2.4 Why this Sudden Hype Around Big Data
Analytics?
 Why this sudden hype? Let us put it down to three foremost reasons:
1) Data is growing at a 40% compound annual rate, reaching nearly 45
ZB by 2020. In 2010, almost about 1.2 trillion Gigabyte of data was
generated. This amount doubled to 2.4 trillion Gigabyte in 2012 and to
about 5 trillion Gigabytes in the year 2014. The volume of business
data worldwide is expected to double every 1.2 years. Every day 2.5
quintillion bytes of data is created, with 90% of the world’s data
created in the past 2 years alone.
2) Cost per gigabyte of storage has
hugely dropped.
3) There are an overwhelming number
of user-friendly analytics tools
available in the market today.

Fig 2.1: What big data entails? 10


2.5 Classification of Analytics
 There are basically two schools of thought:
1) Those that classify analytics into basic, operationalized, advanced, and
monetized.
2) Those that classify analytics into analytics 1.0, analytics 2.0, and
analytics 3.0.
First School of Thought:
1) Basic analytics: This primarily is slicing and dicing of data to help with
basic business insights. This is about reporting on historical data,
basic visualization, etc.
2) Operationalized analytics: It is operationalized analytics if it gets
woven into the enterprise’s business processes.
3) Advanced analytics: This largely is about forecasting for the future by
way of predictive and prescriptive modeling.
4) Monetized analytics: This is analytics in use to derive direct business
revenue.
5) 3.5.2 Second School of Thought Let us take a closer look at analytics
1.0, analytics 2.0, and analytics 3.0. Refer Table 3.1. 11
Classification of Analytics Contd..
Second School of Thought:
1) Let us take a closer look at analytics 1.0, analytics 2.0, and analytics
3.0. Refer Table 3.1.
Analytics 1.0 Analytics 2.0 Analytics 3.0
mid 1950s to 2009 2005 to 2012 2012 to present
Descriptive statistics Descriptive statistics Descriptive + predictive +
(report on events, + predictive statistics prescriptive statistics (use
occurrences, etc. of (use data from the data from the past to
the past) past to make make prophecies for the
predictions for the future and make
future) recommendations)
Key questions asked: Key questions asked: Key questions asked:
What happened? What will happen? What will happen?
Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be the action
taken to take advantage of
what will happen? 12
Classification of Analytics Contd..
Analytics 1.0 Analytics 2.0 Analytics 3.0
Data from legacy Big data A blend of big data and
systems, ERP, CRM, data from legacy systems,
and 3rd party ERP, CRM, and 3rd party
applications. applications.
Small and structured Big data is being A blend of big data and
data sources. Data taken up seriously. traditional analytics to
stored in enterprise Data is mainly yield insights and
data warehouses or unstructured, offerings with speed and
data marts. arriving at a much impact.
higher pace.
Data was internally Data was often Data is both being
sourced. externally sourced. internally and externally
sourced.
Relational databases Database appliances, In memory analytics, in
Hadoop clusters, SQL database processing, agile
to Hadoop analytical methods,
environments, etc. machine learning 13
techniques, etc.
2.6 Greatest Challenges that Prevent
Businesses from Capitalizing on Big Data
1) Obtaining executive sponsorships for investments in big data and its
related activities (such as training, etc.).
2) Getting the business units to share information across organizational
silos.
3) Finding the right skills (business analysts and data scientists) that can
manage large amounts of structured, semi-structured, and
unstructured data and create insights from it.
4) Determining the approach to scale rapidly and elastically. In other
words, the need to address the storage and processing of large volume,
velocity, and variety of big data.
5) Deciding whether to use structured or unstructured, internal or
external data to make business decisions.
6) Choosing the optimal way to report findings and analysis of big data
(visual presentation and analytics) for the presentations to make the
most sense.
7) Determining what to do with the insights created from big data.

14
2.7 Top Challenges Facing Big Data
 Following are the various top challenges of big data:
1) Scale: Storage (RDBMS or NoSQL) is one major concern that needs to be
addressed to handle the need for scaling rapidly and elastically. The need of
the hour is a storage that can best withstand the onslaught of large volume,
velocity, and variety of big data? Should you scale vertically or should you
scale horizontally?
2) Security: Most of the NoSQL big data platforms have poor security
mechanisms (lack of proper authentication and authorization mechanisms)
when it comes to safeguarding big data.
3) Schema: Rigid schemas have no place. We want the technology to be able to
fit our big data and not the other way around. The need of the hour is
dynamic schema. Static (pre-defined schemas) are old.
4) Continuous availability: The big question here is how to provide 24/7
support because almost all RDBMS and NoSQL big data platforms have a
certain amount of downtime built in.
5) Consistency: Should one opt for consistency or eventual consistency?
6) Partition tolerant: How to build partition tolerant systems that can take care
of both hardware and software failures?
7) Data quality: How to maintain data quality – data accuracy, completeness,15
timeliness, etc.? Do we have appropriate metadata in place?
2.8 Why is Big Data Analytics Important?
 Following are the various approaches to analysis of data and what it
leads to.
1) Reactive – Business Intelligence: Business Intelligence (BI) allows
the businesses to make faster and better decisions by providing the
right information to the right person at the right time in the right
format. It is about analysis of the past or historical data and then
displaying the findings of the analysis or reports in the form of
enterprise dashboards, alerts, notifications, etc.
2) Reactive – Big Data Analytics: Here the analysis is done on huge
datasets but the approach is still reactive as it is still based on static
data.
3) Proactive – Analytics: This is to support futuristic decision making by
the use of data mining, predictive modeling, text mining, and
statistical analysis. This analysis has severe limitations on the storage
capacity and the processing capability.
4) Proactive – Big Data Analytics: This is filtering through terabytes of
information to filter out the relevant data to analyze. This also
includes high performance analytics to gain rapid insights from big16
data and the ability to solve complex problems using more data.
2.9 What Kind of Technologies are we looking Toward to
Help Meet the Challenges Posed by Big Data?
1) The first requirement is of cheap and abundant storage.
2) We need faster processors to help with quicker processing of big data.
3) Affordable open-source, distributed big data platforms, such as
Hadoop.
4) Parallel processing, clustering, virtualization, large grid
environments (to distribute processing to a number of machines),
high connectivity, and high throughputs rather than low latency.
5) Cloud computing and other flexible resource allocation
arrangements.

17
2.10 Terminologies Used in Big Data Environments
1) In-Memory Analytics: Data access from non-volatile storage such as
hard disk is a slow process. All the relevant data is stored in Random
Access Memory (RAM) or primary storage thus eliminating the need
to access the data from hard disk. The advantage is faster access,
rapid deployment, better insights, and minimal IT involvement.
2) In-Database Processing (analytics): works by blending data
warehouses with analytical systems. With in-database processing, the
database program itself can run the computations eliminating the
need for Extraction Transformation and Loading data into data
warehouse and thereby saving on time.
3) Symmetric Multiprocessor System (SMP): In SMP, there is a single
common main memory that is shared by two or more identical
processors. The processors have full access to all I/O devices and are
controlled by a single operating system instance. SMP are tightly
coupled multiprocessor systems. Each processor has its own high-
speed memory, called cache memory and are connected using a
system bus.

18
Terminologies in Big Data Contd..
4) Massive Parallel Processing (MPP): refers to the coordinated
processing of programs by a number of processors working parallel.
The processors, each have their own operating systems and dedicated
memory. They work on different parts of the same program and all
the executing segments can communicate with each other.
5) Difference Between Parallel and Distributed Systems: A parallel
database system is a tightly coupled system in which the processors
co-operate for query processing. The user is unaware of the
parallelism since he/she has no access to a specific processor of the
system. Either the processors have access to a common memory or
make use of message passing for communication. Distributed
database systems are known to be loosely coupled and are composed
by individual machines that can run their individual application and
serve their own respective user. The data is usually distributed across
several machines, thereby necessitating quite a number of machines
to be accessed to answer a user query.

19
Terminologies in Big Data Contd..
6) Shared Nothing Architecture: The three most common types of
architecture for multiprocessor high transaction rate systems are:
1. Shared Memory (SM) architecture: a common central memory is
shared by multiple processors
2. Shared Disk (SD) architecture: multiple processors share a
common collection of disks while having their own private
memory.
3. Shared Nothing (SN) architecture: neither memory nor disk is
shared among multiple processors.
Advantages of a “Shared Nothing Architecture”
1. Fault Isolation: A fault in a single node is contained and
confined to that node exclusively and exposed only through
messages (or lack of it).
2. Scalability: Assume that the disk is a shared resource in which
different nodes will have to take turns to access the critical
data. This imposes a limit on how many nodes can be added to
the distributed shared disk system, thus compromising on
scalability.
20
Terminologies in Big Data Contd..
7) CAP Theorem (Brewer’s Theorem) : states that in a distributed
computing environment (a collection of interconnected nodes that
share data), it is impossible to provide the following guarantees. One
must be sacrificed.
1. Consistency implies that every read fetches the last write.
2. Availability implies that reads and writes always succeed.
3. Partition tolerance implies that the system will continue to
function when network partition occurs.

21
2.11 Basically Available Soft State Eventual
Consistency (BASE)
 A few basic questions to start with:
1) Where is it used? In distributed computing.
2) Why is it used? To achieve high availability.
3) How is it achieved? If no new updates are made to the given data item
for a stipulated period of time, eventually all accesses to this data item
will return the updated value.
4) What is replica convergence? A system that has achieved eventual
consistency is said to have converged or achieved replica convergence.
5) Conflict resolution: How is the conflict resolved?
(a) Read repair: If the read leads to discrepancy or inconsistency, a
correction is initiated. It slows down the read operation.
(b) Write repair: If the write leads to discrepancy or inconsistency, a
correction is initiated. This will cause the write operation to slow
down.
(c) Asynchronous repair: The correction is not part of a read or write
operation.
22
2.12 Few Top Analytics Tools.
Below are the list of few top analytics tools.
1. MS Excel
2. SAS
3. IBM SPSS Modeler
4. Statistica
5. Salford systems
6. World Programming Systems (WPS)
Open Source Analytics Tools
1. R analytics
2. Weka

23
24

You might also like