Big Data Introduction
Big Data Introduction
What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.
What is Big Data?
Big Data is also data but with a huge size. Big Data is a term used to
describe a collection of data that is huge in volume and yet growing
exponentially with time. In short such data is so large and complex that none
of the traditional data management tools are able to store it or process it
efficiently.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Characteristics Of Big Data
transform data .
Big Data Technologies can be split into two categories
• Scalability
• Fault Tolerance
The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However,
the differences from other distributed file systems are significant. It is highly fault-
tolerant and is designed to be deployed on low-cost hardware. It provides high
throughput access to application data and is suitable for applications having large
datasets.
It has distributed file system known as HDFS and this HDFS splits files into blocks
and sends them across various nodes in form of large clusters. Also in case of a
node failure, the system operates and data transfer takes place between the
nodes which are facilitated by HDFS.
Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
How Does Hadoop Work?
Advantages:
Ability to store a large amount of data.
High flexibility.
Cost effective.
High computational power.
Tasks are independent.
Linear scaling.
Disadvantages:
Not very effective for small data.
Hard cluster management.
Has stability issues.
Security concerns.
Comparison with Other Systems(RDBMS)
Seeking is the process of moving the disk’s head to a
particular place on the disk to read or write data. It
characterizes the latency of a disk operation, whereas the
transfer rate corresponds to a disk’s bandwidth.