Bda QB Soln
Bda QB Soln
BDA solution QB
As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity
hardware to maintain and store big size data. Hadoop works on MapReduce Programming
Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop
in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop
Architecture Mainly consists of 4 components.
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
Here, we can see that the Input is provided to the Map() function then it’s output is used as an
input to the Reduce function and after that, we receive our final output. Let’s understand What
this Map() and Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is
a set of Data. The Map() function here breaks this DataBlocks into Tuples that are nothing but a
key-value pair. These key-value pairs are now sent as input to the Reduce(). The Reduce()
function then combines this broken Tuples or key-value pair based on its Key value and form set
of Tuples, and perform some operation like sorting, summation type job, etc. which is then sent to
the final Output Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the business requirement of that
industry. This is How First Map() and then Reduce is utilized one by one.
2. HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed
for working on commodity Hardware devices(inexpensive devices), working on a distributed file
system design. HDFS is designed in such a way that it believes more in storing the data in a large
chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the other
devices present in that Hadoop cluster. Data storage Nodes in HDFS.
NameNode(Master)
DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be
the transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more
number of DataNode, the Hadoop cluster will be able to store more data. So it is advised that the
DataNode should have High storing capacity to store a large number of file blocks.
YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a big task into
small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing
can be Maximized. Job Scheduler also keeps track of which job is important, which job has more
priority, dependencies between the jobs and all the other information like job timing, etc. And the
use of Resource Manager is to manage all the resources that are made available for running a
Hadoop cluster.
Hadoop common or Common utilities are nothing but our java library and java files or we can say
the java scripts that we need for all the other components present in a Hadoop cluster. these
utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common
verify that Hardware failure in a Hadoop cluster is common so it needs to be solved automatically
in software by Hadoop Framework.
Q2. Hadoop eco system
The Hadoop Ecosystem is a platform designed to handle big data problems through a suite of tools and services.
It includes Apache projects and various commercial tools that collectively provide data absorption, analysis,
storage, and maintenance. The core components of Hadoop are HDFS, YARN, MapReduce, and Hadoop
Common Utilities, which are supplemented by various tools in the ecosystem.
Core Components:
1. HDFS (Hadoop Distributed File System): The primary storage system of Hadoop, distributing and
storing large datasets across multiple nodes. It consists of Name Nodes (metadata) and Data Nodes
(actual data).
2. YARN (Yet Another Resource Negotiator): Manages and allocates resources across the Hadoop
clusters. It includes Resource Manager, Node Manager, and Application Manager, ensuring efficient
resource utilization.
3. MapReduce: A programming model for processing large datasets in parallel across a Hadoop cluster. It
involves two main functions:
o Map(): Sorts and filters data, producing key-value pairs.
o Reduce(): Aggregates the results from Map() into a summarized output.
Supplementary Tools:
1. Apache Spark: An in-memory data processing framework for real-time data processing, faster than traditional
batch processing.
2. PIG: A platform developed by Yahoo for analyzing large datasets with a SQL-like language (Pig Latin). It abstracts
the complexities of MapReduce.
3. HIVE: A data warehousing tool for querying and managing large datasets using HQL (Hive Query Language),
similar to SQL.
4. HBase: A NoSQL database designed for quick retrieval and storage of large datasets, inspired by Google’s
BigTable.
5. Mahout: Provides machine learning capabilities with libraries for clustering, classification, and collaborative
filtering.
Additional Components:
Solr, Lucene: Tools for searching and indexing data. Lucene is Java-based and supports spell checking.
Zookeeper: Manages coordination, synchronization, and communication between Hadoop components.
Oozie: A job scheduler for organizing and executing Hadoop jobs either sequentially (Oozie Workflow) or in
response to external triggers (Oozie Coordinator).
In summary, the Hadoop Ecosystem revolves around the efficient management and processing of large datasets,
with each component playing a specific role in the data lifecycle.
4
Qr
k
Aoo
(o,o): (A,o,) Boo 2
B,o, 2 )
o) :
(0): (A, o, )
(I, o)(o,2)
Ao : 2
(A;) (o,o) :(A,,2) (o,)
(8i)
Ao
(Ai) (1, o):(A,o, 3) Bo'. S
k
(AiY) (A,1,4) (o,)(B, !, )
(!, ): (u) (Bj)
Per to e is Vale!
23
emit (tvee)
Seletin Conditen: B< 2
Mwl Mw2 Seleen
value
Vale. Cheele,Conditn and teake
(h2),,2) (I, )
(31) (2, leey allnd atve
hauh nctin will se afid:
Sw
Rul
Val \eey val keylval key al Sual
(u2) (i) (31)(st) dene
Rw
key val Val (omyirean
ti2),(Mhe) (31)
C,) (2)) redunc
Rw
A
2 |
AMqonthrn.
ojeet: (D) mapttey
Mil
(Mw lex's soy 5 bes sub itt (uteuniye
seleeted
2
2 4
A c c
4 3 ati butes
ts ople witn ny
emit (t,n)
educe ( vey valves):
vey
t23)
Valve
Value
eni t{kay ey)
(2i3),(23) (t2)
Seleet les
Toicehin Cundiha.
(21s)
Rnchun
mw
Valve Valve
(i3) (2,3),(113) (B.u)
leyjUalve yate
(2)(72)
(t;u)
(1,o)o)
Rmw Sw R mw
key Value
(3) vale
(+12) (t2) key yate
(n) () (,) (314)(3)4)
Rew
Rw
vale
(23) key Vat ve
t23)(2)) (314) 314)
(, 2) (2)2r)
() (io)
Rwi
2 3
2 2
2 3
2
2
vtolue (key i val ves):
mw
Ley Vale
(2), , ) Vale
(213) (213)
(21) ()
(4i)
(2,)
neyh inh
me
ey
(i2)
Valve
vale
(2i) ky value
(s) (31)
tuir) r)
Swl
R mwl
Rw2
9
2
2 3
S
vauest ae lommun)
Muul mw
(hoie invalu):
2 2
2
2 (kayvolue):
mw
values (reyfeyl(er
mw 2
Vqle
(0.2) ey Ualue
(213) | (213)
(31) (31)
(?1)
mw mw
valne Value
leeyatue
(213) t213)
(2)
RMw RMw
Value vale
(2) (1),(2i)
() () (:0)
Rw
2 2
mw (nly Te
muw
Vale
valve t dure (teyi vaws):
(31)
(4is) Ir)
(,)
Valve m w
value
(3) (21)
(2123||T
Ru
Rw
ey Value
key value
Key uatu
(2;1)
Vale Rw
(,2)
key vale
(21)
(u,r)
Hhet cli elminet
tasle Rw
Rw
Val
key
key
(3) Rw
Rw
attibuts
attibates.
mwl
Atqunthm
A
Aal,i|e) muf (vey,vale)
2
2
Ror (aib) in valu
(TuseTe)
Hayh
Udl
keyl Valn
2
(tu1(,3) leeylvol key va(ve fe a in
3 6 Per
Cmit (key.(Ai ey,c)
Rwl Rw
key Val Value key Vatu
2 t, c) (T4)
3 ()
(Tt)
Vey Vale
Value
2
(i2) (c)(7a,a)
q (Ts) notemm
Rwi Connut e jwined
3
3
Downloaded by Biswajit p ([email protected])
lOMoARcPSD|20752295
uy (A, s) ond
mw atiy mar au the
oigaton nt ma (key yalu:.
mw
2
3 2 H
3
3 2
Dteore (reyivalve
4 2
2
val
vey( Value
(3@,4)
(li 3) (4]
(1)
312) (314)
wi
mw
Vale
G,) key]value ky yala
(4, 2)
(2)|E]
(2) [34]
Rw
Jalie key lvatit Vale
(Gi)
vale
(23) (22)
(3)
Rwi Rw
kay Va
[lo,4, ) Valve
(3 )
(4i2)
Is,r]
Rw
Rw
A
3
2
Traditional database system deals with structured Big data system deals with structured, semi-
data. structured,database, and unstructured data.
Traditional data is generated per hour or per day or But big data is generated more frequently
more. mainly per seconds.
Traditional data source is centralized and it is Big data source is distributed and it is
managed in centralized form. managed in distributed form.
Its data sources includes ERP transaction data, CRM Its data sources includes social media,
transaction data, financial data, organizational data, device data, sensor data, video, images,
web transaction data etc. audio etc.
The business drivers for NoSQL databases are centered around the need to manage and process large, fast-
changing, and complex datasets that traditional RDBMSs struggle to handle. Here’s a brief explanation of each
driver with an example:
1. Volume:
o Explanation: As data volume grows exponentially, traditional RDBMSs can't scale effectively
to handle such large datasets. NoSQL databases, with their ability to scale horizontally across
multiple servers, are designed to manage big data efficiently.
o Example: A social media platform like Facebook needs to store and query petabytes of user
data, including posts, comments, and media. NoSQL solutions like Cassandra allow them to
handle this massive volume across distributed servers.
2. Velocity:
o Explanation: The speed at which data is generated and needs to be processed is critical for real-
time applications. Traditional databases may struggle with rapid data insertions and queries,
especially under heavy loads.
o Example: An online retail site like Amazon needs to process millions of transactions and
customer interactions in real-time. A NoSQL database like MongoDB can handle high-velocity
data with minimal delays, ensuring quick responses.
3. Variability:
o Explanation: Data in modern applications often comes in various forms, such as structured,
semi-structured, and unstructured. Traditional databases require rigid schemas, making it
difficult to handle diverse data types.
o Example: A news aggregator like Google News needs to ingest and organize data from different
sources, including articles, videos, and social media feeds. A NoSQL database like Couchbase
allows them to store and manage this diverse data without predefined schemas.
4. Agility:
o Explanation: The ability to quickly adapt to changes in application requirements is crucial for
modern businesses. Traditional RDBMSs require complex schema changes and object-relational
mappings, slowing down development.
o Example: A startup developing a new mobile app can use a NoSQL database like Firebase to
rapidly iterate and evolve their data models as the app's features change, enabling faster time to
market.
These drivers make NoSQL databases attractive for businesses needing flexibility, scalability, and speed in
handling large and diverse datasets.
Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and store it
in a valid format. It is widely used because of its flexibility and a wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.
Bigtable by Google
Cassandra
3. Document Database:
The document database fetches and accumulates data in form of key-value pairs but here, the
values are called as Documents. Document can be stated as a complex data structure. Document
here can be a form of text, arrays, strings, JSON, XML or any such format. The use of nested
documents is also very common. It is very effective as most of the data created is usually in form
of JSONs and is unstructured.
Advantages:
This type of format is very useful and apt for semi-structured data.
Storage retrieval and managing of documents is easy.
Limitations:
Handling multiple documents is challenging
Aggregation operations may not work accurately.
Examples:
MongoDB
CouchDB
Big Data refers to the vast volume of data generated every second by various digital processes, systems, and
devices. This data is so large and complex that traditional data processing tools and techniques are insufficient
to manage it. The concept of Big Data encompasses not just the amount of data but also the speed at which it is
generated (velocity), the variety of data types, and the complexities in managing and analyzing this data to
extract valuable insights.
1. Structured Data:
o Definition: Structured data is highly organized and easily searchable, typically stored in
databases in a tabular format (rows and columns). It is data that follows a predefined schema,
making it easy to enter, query, and analyze using tools like SQL.
o Examples:
Customer information in a CRM system.
Financial records such as transactions and invoices.
o Characteristics:
Fixed schema.
Easily manageable and queryable.
Limited flexibility for changes.
2. Semi-Structured Data:
o Definition: Semi-structured data does not conform to a fixed schema like structured data but still
has some organizational properties, such as tags or markers that separate data elements. It is
often stored in formats like JSON, XML, or NoSQL databases.
o Examples:
JSON or XML files.
Metadata from digital media files.
Data from social media feeds.
o Characteristics:
More flexible than structured data.
Easier to adapt to changing requirements.
Often used to store data that doesn't fit neatly into tables.
3. Unstructured Data:
o Definition: Unstructured data is raw, unorganized data that doesn't fit into traditional databases
or structured formats. It lacks a predefined data model and is usually more challenging to
analyze.
o Examples:
Text documents, emails, and PDFs.
Images, videos, and audio files.
Social media posts, blogs, and comments.
o Characteristics:
No fixed schema or structure.
Requires more advanced tools and techniques for analysis.
Can contain valuable insights but is more challenging to process.
Big Data is defined by several key characteristics that help to distinguish it from traditional data processing and
storage systems. These characteristics are often referred to as the "6 Vs" of Big Data:
1. Volume:
o Description: Volume refers to the enormous amount of data generated and collected. The size of
data is a critical factor in determining whether it qualifies as Big Data. With the exponential
growth of data, managing and analyzing such large volumes becomes challenging.
o Example: In 2016, global mobile traffic was estimated at 6.2 Exabytes per month. By 2020, this
figure had grown to nearly 40,000 Exabytes.
2. Velocity:
o Description: Velocity refers to the speed at which data is generated, collected, and processed.
Big Data often involves real-time or near-real-time data flows from various sources such as
social media, sensors, and mobile devices. The rapid influx of data requires quick processing to
derive actionable insights.
o Example: Google handles more than 3.5 billion searches per day, and Facebook's user base
grows by approximately 22% annually.
3. Variety:
o Description: Variety refers to the different types of data that are generated and processed. Big
Data includes structured, semi-structured, and unstructured data from a wide range of sources,
both internal and external to an organization. Managing this diverse data is crucial for gaining
comprehensive insights.
o Types of Data:
Structured Data: Organized data, such as relational databases.
Semi-Structured Data: Data that does not conform to a strict schema, such as JSON or
XML files.
Unstructured Data: Data that lacks a defined structure, such as text, images, and videos.
o Example: Log files are an example of semi-structured data, while social media posts and
multimedia files represent unstructured data.
4. Veracity:
o Description: Veracity refers to the quality, accuracy, and trustworthiness of data. Big Data often
comes with inconsistencies, noise, and errors, making it difficult to ensure data quality. High
veracity is essential for making reliable decisions based on data.
o Example: A large volume of data may lead to confusion, while insufficient data might result in
incomplete or misleading information.
5. Value:
o Description: Value is about the usefulness of data. Data in itself has no inherent value unless it
is processed and analyzed to generate meaningful insights that can drive business decisions.
Extracting value from Big Data is one of the primary goals of Big Data initiatives.
o Example: A company may collect vast amounts of customer data, but its value lies in how
effectively the company can use this data to enhance customer experience and drive sales.
6. Variability:
o Description: Variability refers to the changing nature or meaning of data over time. The
structure and interpretation of data can vary significantly, adding complexity to Big Data
management and analysis. This variability can affect the consistency and accuracy of data-driven
insights.
o Example: Imagine eating the same brand of ice cream daily, but the taste changes each time.
This inconsistency is akin to the variability in Big Data.
These characteristics collectively define Big Data and highlight the challenges and opportunities associated with
managing and analyzing such vast and complex datasets.