Printed Notes Dsba
Printed Notes Dsba
The quantities, characters, or symbols on which operations are performed by a computer, which may
be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.
The New York Stock Exchange is an example of Big Data that generates about one terabyte of new
trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
Types Of Big Data
Following are the types of Big Data:
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well known in advance)
and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such
data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name Big Data is given and imagine the
challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is one example of
a ‘structured’ data.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example
of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over the years
Data Growth over the years
Please note that web application data, which is unstructured, consists of log files, transaction history
files etc. OLTP systems are built to work with structured data wherein data is stored in relations
(tables).
Volume
Variety
Velocity
Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can actually be considered
as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic
which needs to be considered while dealing with Big Data solutions.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
Advantages Of Big Data Processing
Ability to process Big Data in DBMS brings in multiple benefits, such as-
Access to social data from search engines and sites like facebook, twitter are enabling organizations to
fine tune their business strategies.
Traditional customer feedback systems are getting replaced by new systems designed with Big Data
technologies. In these new systems, Big Data and natural language processing technologies are being
used to read and evaluate consumer responses.
Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big Data
technologies and data warehouse helps an organization to offload infrequently accessed data.
What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.
Since all the metadata is stored in name node, it is very important. If it fails the file system can not be
used as there would be no way of knowing how to reconstruct the files from blocks present in data
node. To overcome this, the concept of secondary name node arises.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta data
which helps minimize downtime and loss of data.
Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode. Commands are given
below.
To Start $ start-dfs.sh
o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS folder
/user/ test
Recursive deleting
Example:
"<localSrc>" and "<localDest>" are paths as above, but on the local file system
o put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within HDFS,
and then deletes the local copy on success.
Copies the file or directory in HDFS identified by src to the local file system path identified by
localDest.
o cat <filen-ame>
o moveToLocal <src><localDest>
Sets the target replication factor for files identified by path to rep. (The actual replication factor
will move toward the target over time)
o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at
path, unless the file is already size 0.
Prints information about path. Format is a string which accepts file size in blocks (%b), filename
(%n), block size (%o), replication (%r), and modification date (%y, %Y).
Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be loss.
So, to overcome such problems, HDFS always maintains the copy of data on a different
machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the event
of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine
containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to another.
Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines. Anyhow, if any
machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-ready-
many approach. So, a file once created need not to be changed. However, it can be appended
and truncate.
MapReduce Tutorial
MapReduce tutorial provides basic and advanced concepts of MapReduce. Our MapReduce tutorial is
designed for beginners and professionals.
Our MapReduce tutorial includes all topics of MapReduce such as Data Flow in MapReduce, Map
Reduce API, Word Count Example, Character Count Example, etc.
What is MapReduce?
A MapReduce is a data processing tool which is used to process the data parallelly in a distributed
form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data
Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In
the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the
reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-
value format, and the output of reducer is the final output.
Usage of MapReduce
o It can be used in various application like document clustering, distributed sorting, and web link-
graph reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning.
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi-cluster, multi-core, and mobile
environment.
Once input reads the data, it generates the corresponding key-value pairs. The input files reside in
HDFS.
Map function
The map function process the upcoming key-value pairs and generated the corresponding output key-
value pairs. The map input and output type may be different from each other.
Partition function
The partition function assigns the output of each Map function to the appropriate reducer. The available
key and value provide this function. It returns the index of reducers.
The sorting operation is performed on input data for Reduce function. Here, the data is compared using
comparison function and arranged in a sorted form.
Reduce function
The Reduce function is assigned to each unique key. These keys are already arranged in sorted order.
The values associated with the keys can iterate the Reduce and generates the corresponding output.
Output writer
Once the data flow from all the above phases, Output writer executes. The role of Output writer is to
write the Reduce output to the stable storage.