0% found this document useful (0 votes)
22 views13 pages

Printed Notes Dsba

This document provides information about data, big data, and HDFS. It defines data as quantities, characters, or symbols operated on by computers. Big data is defined as huge volumes of data growing exponentially over time that cannot be managed with traditional tools. Examples of big data include social media data and data from jet engines. HDFS is the Hadoop distributed file system that stores large files across commodity servers and provides replication for reliability. Key concepts are blocks, the NameNode for metadata, and DataNodes that store blocks.

Uploaded by

hvg yughvbnjv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views13 pages

Printed Notes Dsba

This document provides information about data, big data, and HDFS. It defines data as quantities, characters, or symbols operated on by computers. Big data is defined as huge volumes of data growing exponentially over time that cannot be managed with traditional tools. Examples of big data include social media data and data from jet engines. HDFS is the Hadoop distributed file system that stores large files across commodity servers and provides replication for reliability. Key concepts are blocks, the NameNode for metadata, and DataNodes that store blocks.

Uploaded by

hvg yughvbnjv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

What is Data?

The quantities, characters, or symbols on which operations are performed by a computer, which may
be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or
mechanical recording media.

Now, let’s learn Big Data definition

What is Big Data?


Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data
with so large size and complexity that none of traditional data management tools can store it or process
it efficiently. Big data is also a data but with huge size.

What is an Example of Big Data?


Following are some of the Big Data examples-

The New York Stock Exchange is an example of Big Data that generates about one terabyte of new
trade data per day.

Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
Types Of Big Data
Following are the types of Big Data:

1. Structured
2. Unstructured
3. Semi-structured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well known in advance)
and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such
data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.

Looking at these figures one can easily understand why the name Big Data is given and imagine the
challenges involved in its storage and processing.

Do you know? Data stored in a relational database management system is one example of
a ‘structured’ data.

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane Female Finance 550000

Examples Of Structured Data

An ‘Employee’ table in a database is an example of Structured Data


Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out of it since this data is in
its raw form or unstructured format.

Examples Of Un-structured Data

The output returned by ‘Google Search’

Example Of Un-structured Data

Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example
of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over the years
Data Growth over the years
Please note that web application data, which is unstructured, consists of log files, transaction history
files etc. OLTP systems are built to work with structured data wherein data is stored in relations
(tables).

Characteristics Of Big Data


Big data can be described by the following characteristics:

 Volume
 Variety
 Velocity
 Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can actually be considered
as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic
which needs to be considered while dealing with Big Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
Advantages Of Big Data Processing
Ability to process Big Data in DBMS brings in multiple benefits, such as-

 Businesses can utilize outside intelligence while taking decisions

Access to social data from search engines and sites like facebook, twitter are enabling organizations to
fine tune their business strategies.

 Improved customer service

Traditional customer feedback systems are getting replaced by new systems designed with Big Data
technologies. In these new systems, Big Data and natural language processing technologies are being
used to read and evaluate consumer responses.

 Early identification of risk to the product/services, if any


 Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big Data
technologies and data warehouse helps an organization to offload infrequently accessed data.

What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.

Where to use HDFS


o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
o Streaming Data Access: The time to read whole data set is more important than latency in
reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.

Where not to use HDFS


o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the first
record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the files are
small in size it takes a lot of memory for name node's memory which is not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128
MB by default and this is configurable.Files n HDFS are broken into block-sized chunks,which
are stored as independent units.Unlike a file system, if the file is in HDFS is smaller than block
size, then it does not occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128
MB takes 5MB of space only.The HDFS block size is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.Name
Node is controller and manager of HDFS as it knows the status and the metadata of all the files
in HDFS; the metadata information being file permission, names and location of each block.The
metadata are small, so it is stored in the memory of name node,allowing faster access to data.
Moreover the HDFS cluster is accessed by multiple clients concurrently,so all this information is
handled bya single machine. The file system operations like opening, closing, renaming etc. are
executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node. They
report back to name node periodically, with list of blocks that they are storing. The data node
being a commodity hardware also does the work of block creation, deletion and replication as
stated by the name node.

HDFS DataNode and NameNode Image:

HDFS Read Image:


HDFS Write Image:

Since all the metadata is stored in name node, it is very important. If it fails the file system can not be
used as there would be no way of knowing how to reconstruct the files from blocks present in data
node. To overcome this, the concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta data
which helps minimize downtime and loss of data.

Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode. Commands are given
below.

To Format $ hadoop namenode -format

To Start $ start-dfs.sh

HDFS Basic File Operations


1. Putting data to HDFS from local file system
o First create a folder in HDFS where data can be put form local file system.

$ hadoop fs -mkdir /user/test

o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS folder
/user/ test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test

o Display the content of HDFS folder

$ Hadoop fs -ls /user/test

2. Copying data from HDFS to local file system


o $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt
3. Compare the files and see that both are same
o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting

o hadoop fs -rmr <arg>

Example:

o hadoop fs -rmr /user/sonoo/

HDFS Other commands


The below is used in the commands

"<path>" means any file or directory name.

"<path>..." means one or more file or directory names.

"<file>" means any filename.

"<src>" and "<dest>" are path names in a directed operation.

"<localSrc>" and "<localDest>" are paths as above, but on the local file system

o put <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within the DFS.

o copyFromLocal <localSrc><dest>

Identical to -put
o copyFromLocal <localSrc><dest>

Identical to -put

o moveFromLocal <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within HDFS,
and then deletes the local copy on success.

o get [-crc] <src><localDest>

Copies the file or directory in HDFS identified by src to the local file system path identified by
localDest.

o cat <filen-ame>

Displays the contents of filename on stdout.

o moveToLocal <src><localDest>

Works like -get, but deletes the HDFS copy on success.

o setrep [-R] [-w] rep <path>

Sets the target replication factor for files identified by path to rep. (The actual replication factor
will move toward the target over time)

o touchz <path>

Creates a file at path containing the current time as a timestamp. Fails if a file already exists at
path, unless the file is already size 0.

o test -[ezd] <path>

Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.

o stat [format] <path>

Prints information about path. Format is a string which accepts file size in blocks (%b), filename
(%n), block size (%o), replication (%r), and modification date (%y, %Y).

HDFS Features and Goals


The Hadoop Distributed File System (HDFS) is a distributed file system. It is a core part of Hadoop
which is used for data storage. It is designed to run on commodity hardware.
Unlike other distributed file system, HDFS is highly fault-tolerant and can be deployed on low-cost
hardware. It can easily handle the application that contains large data sets.

Let's see some of the important features and goals of HDFS.

Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be loss.
So, to overcome such problems, HDFS always maintains the copy of data on a different
machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the event
of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine
containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to another.

Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines. Anyhow, if any
machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-ready-
many approach. So, a file once created need not to be changed. However, it can be appended
and truncate.

MapReduce Tutorial
MapReduce tutorial provides basic and advanced concepts of MapReduce. Our MapReduce tutorial is
designed for beginners and professionals.

Our MapReduce tutorial includes all topics of MapReduce such as Data Flow in MapReduce, Map
Reduce API, Word Count Example, Character Count Example, etc.

What is MapReduce?
A MapReduce is a data processing tool which is used to process the data parallelly in a distributed
form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data
Processing on Large Clusters," published by Google.

The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In
the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the
reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-
value format, and the output of reducer is the final output.

Steps in Map Reduce


o The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will not
be unique in this case.
o Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort and
shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of values
associated with this unique key <key, list(values)>.
o An output of sort and shuffle sent to the reducer phase. The reducer performs a defined function
on a list of values for unique keys, and Final output <key, value> will be stored/displayed.
Sort and Shuffle
The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper task is
complete, the results are sorted by key, partitioned if there are multiple reducers, and then written to
disk. Using the input from each Mapper <k2,v2>, we collect all the values for each unique key k2. This
output from the shuffle phase in the form of <k2, list(v2)> is sent as input to reducer phase.

Usage of MapReduce
o It can be used in various application like document clustering, distributed sorting, and web link-
graph reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning.
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi-cluster, multi-core, and mobile
environment.

Data Flow In MapReduce


MapReduce is used to compute the huge amount of data . To handle the upcoming data in a parallel
and distributed form, the data has to flow from various phases.

Phases of MapReduce data flow


Input reader
The input reader reads the upcoming data and splits it into the data blocks of the appropriate size (64
MB to 128 MB). Each data block is associated with a Map function.

Once input reads the data, it generates the corresponding key-value pairs. The input files reside in
HDFS.

Note - The input data can be in any form.

Map function
The map function process the upcoming key-value pairs and generated the corresponding output key-
value pairs. The map input and output type may be different from each other.

Partition function
The partition function assigns the output of each Map function to the appropriate reducer. The available
key and value provide this function. It returns the index of reducers.

Shuffling and Sorting


The data are shuffled between/within nodes so that it moves out from the map and get ready to process
for reduce function. Sometimes, the shuffling of data can take much computation time.

The sorting operation is performed on input data for Reduce function. Here, the data is compared using
comparison function and arranged in a sorted form.

Reduce function
The Reduce function is assigned to each unique key. These keys are already arranged in sorted order.
The values associated with the keys can iterate the Reduce and generates the corresponding output.

Output writer
Once the data flow from all the above phases, Output writer executes. The role of Output writer is to
write the Reduce output to the stable storage.

You might also like