0% found this document useful (0 votes)
6 views69 pages

Introduction To Big Data

The document provides an introduction to Big Data, discussing its characteristics, evolution, and challenges, including data growth and the need for skilled professionals. It contrasts Big Data with traditional Business Intelligence, highlighting differences in data processing and analysis methods. Additionally, it covers key technologies such as NoSQL and Hadoop, their advantages, and the components of the Hadoop ecosystem.

Uploaded by

abhilashkotian08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views69 pages

Introduction To Big Data

The document provides an introduction to Big Data, discussing its characteristics, evolution, and challenges, including data growth and the need for skilled professionals. It contrasts Big Data with traditional Business Intelligence, highlighting differences in data processing and analysis methods. Additionally, it covers key technologies such as NoSQL and Hadoop, their advantages, and the components of the Hadoop ecosystem.

Uploaded by

abhilashkotian08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Introduction to Big Data

Introduction to big data


● No Scarcity of enterprise data
● Mired in data
○ Why Big data?
○ How big data plays major role?
○ How does it compare with the traditional Business Intelligence
(BI) environment
○ Replace traditional, relational database management systems
and data ware horse component or complement their existence
Characteristics of data
Characteristics of data
● Composition
○ Structure
○ Sources of data
○ Granularity
○ Type and nature of data (static or real-time)
● Condition
○ Use as it is or further preprocessing required?
● Context
○ Where this data generated?
○ Why generated?
○ Is it sensitive?
○ Events associate?
Introduction to big data
● No Scarcity of enterprise data
● Mired in data
○ Why Big data?
○ How big data plays major role?
○ How does it compare with the traditional Business Intelligence
(BI) environment
○ Replace traditional, relational database management systems
and data ware horse component or complement their existence
Evolution of Big Data
Definition of Big data
Introduction to big data
Challenges with Big Data
● Data growing at exponential rate. Will all this data be useful for analytics?
Work with all data or subset of it. How to generate knowledge from noise?
● Cloud computing and virtualization to stay. It provides cost-efficiency,
elasticity, easy upgration/downgrading This further complicates decision to
host big data solution outside the enterprise.
● Decision on period of retention of big data.
● There is a dearth of skilled professionals who poses a high level of
proficiency in data sciences that is vital in implementing big data solutions.
● Other challenges capture, storage, curation, search, Analysis,transffer.
● Data visualization.
Challenges with Big Data
What is Big data

● Big data is data that is in:


● Volume
● Velocity, and
● Variety.
Sources of data
Sources of big data
Where does data generated?
Velocity
Variety
Why big data?
Why big data? (...contd)
Traditional BI vs Big data
• Traditional BI all data reside in central server. In Big data data
reside in distributed server.
• Traditional BI data is analyzed in an offline mode, whereas in big
data data is analyzed in both real time as well as offline mode.
• Traditional BI is about processing structured data, and it is about
moving data to processing functions (data to code). Whereas big
data is about variety of data and about moving processing function
to data (code to data)
Typical data warehouse environment
Typical Hadoop environment
Chapter 3- Big data analytics
Classification of analytics
First school of thought
Second school of thought
Variety
Why Big data analytics important

Various approaches of analysis of data and what it leads to:


• Reactive – Business Intelligence: It analyzes past and historical
data and then displays the findings as reports, alert, notifications
• Reactive – Big data analytics: Analysis done on huge data sets,
based on static data
• Proactive – analytics: Futuristic decision making by using data
mining, predictive modeling, text mining and statistical mining
• Proactive – big data analytics: High performance analysis to
gain rapid insight from big data , uses more data.
Terminologies used in Big data Environment
Key terminologies used in Big data:
In-memory analytics:
• Data access from non-volatile memory is a slow process.
• This problem is solved by pre-process and store data (cubes, tables,
query sets) so that CPU has to fetch small subset of records.
• In advance it is necessary to know what will be the data required in
future.
• Relevant data is stored in RAM so that avoid access of data from
secondary memory
Terminologies used in Big data Environment

In-database processing:
• It works by fusing data warehouse with analytical systems
• Actually, data from OLTP system after ETL is stored in Enterprise
Data Warehouse (EDW)
• The huge data sets are exported to analytical programs for
complex and extensive computations.
• However, with in-database processing , database programs itself
can run the computations eliminating the need for export and
saving time.
Terminologies used in Big data Environment

Symmetric multiprocessor systems (SMP):


• In SMP, single common main memory that is shared by two or
more identical processors.
• Processors have full access to all I/O devices and controlled by
single OS instance
• Every processor have cache memory, connected using bus.
Terminologies used in Big data Environment

Symmetric multiprocessor systems (SMP):


Terminologies used in Big data Environment

Massively parallel processing:


• MPP refers to coordinated processing of programs by number of
processors
• Every processor is have its own OS and memory.
• Every processor work on same of the program.
• MPP programs are difficult to program as the application must be
divided in such a way that all executing segments communicate
with each other.
Terminologies used in Big data Environment

Difference between parallel and distributed systems:


Terminologies used in Big data Environment

Difference between parallel and distributed systems:


Shared nothing architecture
• Three most common types of multiprocessor high transaction
systems:
• Shared Memory (SM)
• Shared Dish (SD)
• Shared Nothing (SN)
Shared nothing architecture
• Advantages of shared nothing architecture:
• Fault isolation
• Scalability
CAP Theorem
• CAP theorem also called as “Brewer’s Theorem”
• It states that in a distributed computing environment it is
impossible to provide following guarantees. At the best two of the
following three –One must be sacrificed.
• Consistency
• Availability
• Partition tolerance
Big data technology landscape

• Big data technology landscape can be studied mainly in two


important technologies:
• NoSQL
• Hadoop
NoSQL (Not Only SQL)
NoSQL (Not Only SQL)
Where it is used?
NoSQL (Not Only SQL)
What is it ?
NoSQL
NoSQL databases are broadly classified into 2 types:
• Key-value or big-hash table
• Schema-less
NoSQL (Not Only SQL)
NoSQL (Not Only SQL)
NoSQL (Not Only SQL)
Why NOSQL?
Advantages of NOSQL
NOSQL (Not Only SQL)
Use of NoSQL in industry
Vendors of NoSQL
SQL vs NoSQL
NewSQL
What is NewSQL?
Comparison between SQL, NoSQL, NewSQL
Hadoop

• Hadoop uses Google’s MapReduce and Google’s File System


technologies as foundation.
• Hadoop is now a core part of the computing infrastructure for many
companies.
Hadoop
Features:
Hadoop
Key advantage of Hadoop:
Hadoop
Hadoop 1.0
Limitations of Hadoop 1.0
Hadoop 2.0
Hadoop 2.0
• Yet Another Resource Negotiator (YARN) is included in Hadoop 2.0
• Any application capable of dividing itself into parallel task is supported
by YARN.
• YARN coordinates the allocation of subtasks of the submitted
application
• MapReduce programming expertise no longer required.
• It supports both batch processing and real-time processing.
• MapReduce is not only a data processing option, other native data
processing such as data standardization, master data management,
can be performed in HDFS.
Components of Hadoop Ecosystem
Overview of Hadoop ecosystem
HDFS
• HDFS is distributed storage unit of Hadoop.
• It provides streaming access to file system data as well as file
permission and authentication.
• It is used to scale up the system into hundreds of node
• It handles large datasets
HDFS vs HBase
Components of data ingestion
Components of data
Vendors of NOSQL

You might also like