Bigdata Fundamentals
Bigdata Fundamentals
•Big Data refers to extremely large ,very fast, highly diverse and
complex data that cannot be managed by traditional data
management tools
or
•Big data primarily refers to data sets that are too large or complex
to be dealt with by traditional data-processing application software.
Big data Applications
Big data Applications
Data sources
8
• Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data
on a day to day basis as they have billions of users worldwide
• New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade
data per day.
• A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
•E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users
buying trends can be traced.
•Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
•Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish
their plans and for this they store the data of its million user
9
13
Velocity
Click streams and ad impressions capture user
behavior at millions of events per second
organization
Big Data Architecture
A big data solution typically comprises these as logical
layers.
Data ingest layer
Batch Processing layer
Infrastructure Layer
needs to understand the data sources and data types, the algorithms that would
work on that data, and the format of the desired outcomes.
output of this layer could be sent for instant reporting, or stored in a NoSQL
databases for an on-demand report, for the client.
Big Data Architecture
Data Organizing Layer
layer receives data from both the batch and stream
processing layers.
NoSQL databases. Is used to organize the data for easy
access.
SQL-like languages like Hive and Pig can be used to easily
access data and generate reports.
Big Data Architecture
Data Consumption layer
This layer consumes the output provided by the analysis layers,
directly or through the organizing layer.
The outcome could be standard reports, data analytics, dashboards
and other visualization applications, recommendation engine, on
mobile and other devices.
Infrastructure Layer
Used to manages the raw resources of storage, compute, and
communication through a cloud computing paradigm.
Distributed File System Layer
include the Hadoop Distributed File System (HDFS).
supporting applications, such as YARN (Yet Another Resource
Manager), that enable the efficient access to data storage and its
transfer.
Big Data Architecture examples-IBM
WATSON
Netflix
This is one of the largest providers of online video
entertainment. They handle 400 Billion online events per day.
As a cutting-edge user of big data technologies, they are
constantly innovating their mix of technologies to deliver the
best performance.
Kafka is the common messaging system for all incoming
requests.
They host the entire infrastructure on Amazon Web Services
(AWS).
The database is AWS’ S3 as well as Cassandra and Hbase to
store data.
Spark is used for stream processing.
Netflix
Netflix
EBAY
Ebay is the second-largest Ecommerce company in
the world.
It delivers 800 million listings from 25 million sellers
to 160 million buyers.
To manage this huge stream of activity, EBay uses a
stack of Hadoop, Spark, Kafka, and other elements.
Paypal
WORKER NODES(DataNodes)
store the data blocks in their storage space, as directed by
the master node
It contains many disks to maximize storage capacity and
access speed.
It do not have awareness about the distributed file structure
Apache Hadoop - Architecture
Apache Hadoop - Architecture
Role of NameNode
tries to ensure that files are evenly spread across the data-
nodes in the cluster
tries to optimize the networking load