Bda QB

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

1.

Give difference between Traditional data management and analytics


approach Versus Big data Approach.

2. Explain Hadoop ecosystem with core components. Explain physical


architecture of hadoop. State its limitations.

The components of ecosystem are as follows:


1) HBase
 Open source, distributed, versioned, column oriented store.
 It is based on Google's Big Table.
2) Hive
 provides a warehouse structure for other Hadoop input sources and SQL like
access for data in HDFS.
 Hives query language, HiveQL, complies to map reduce and allow user defined
functions.
3) Pig
 pig is a run time environment that allows users to execute map reduce on a
Hadoop cluster.
4) Sqoop
 sqoop is a tool which transfers data in both ways between relational systems and
HDFS or other Hadoop data store like Hive or HBase.
5) Oozie
 It is a job coordinator and workflow manager for jobs executed in Hadoop.
 It is integrated with the rest of the Apache Hadoop stack.
6) Mahout
 It is a scalable machine learning and data mining library.
 Algorithms here can be executed in a distributed fashion.
7) Zoo keeper
 It is a distributed service with master and slave nodes for storing and maintaining
configuration information, naming, providing distributed synchronization and
providing group services.

Hadoop = physical Architecture.


 The data in organizations is stored on cloud to provide ease of access to user.
 Combining processor based servers and storage, along with networking resources
used in cloud environment, with big data processing tools such as Apache
Hadoop software, provides the high performance computing power needed to
analyze vast amounts of data efficiently and cost effectively.
 Hadoop compatible file system provide in cation awareness for effective
scheduling of work.
 Hadoop application uses this information to find the data node and seen the task.
 A small Hadoop cluster includes a single master and multiple worker nodes.
 Master node consists of job tracker, task tracker, name node and data node.
 A worker node consists of data node and task tracks.
Limitations of Hadoop
* Security Concerns: It is disabled by default due to sheer complexity.
* Vulnerable by Nature: Written in Java, the Hadoop is vulnerable by nature
* Not fit for small data
* potential stability issues: It is a open source platform. so finding and using a stable
version is a challenge.
* General limitations: There are many other technologies available for big data other
than Hadoop.
3. Describe 5V characteristics of Big Data.
OR
In recent years, Big Data was defined by the “3Vs” but now there is “5Vs” of Big Data
which are also termed as the characteristics of Big Data as follows:
1. Volume:
 The name ‘Big Data’ itself is related to a size which is enormous.
 Volume is a huge amount of data.
 To determine the value of data, size of data plays a very crucial role. If the
volume of data is very large then it is actually considered as a ‘Big Data’. This
means whether a particular data can actually be considered as a Big Data or not,
is dependent upon the volume of data.
 Hence while dealing with Big Data it is necessary to consider a characteristic
‘Volume’.
 Example: In the year 2016, the estimated global mobile traffic was 6.2
Exabytes(6.2 billion GB) per month. Also, by the year 2020 we will have almost
40000 ExaBytes of data.
2. Velocity:
 Velocity refers to the high speed of accumulation of data.
 In Big Data velocity data flows in from sources like machines, networks, social
media, mobile phones etc.
 There is a massive and continuous flow of data. This determines the potential of
data that how fast the data is generated and processed to meet the demands.
 Sampling data can help in dealing with the issue like ‘velocity’.
 Example: There are more than 3.5 billion searches per day are made on Google.
Also, FaceBook users are increasing by 22%(Approx.) year by year.
3. Variety:
 It refers to nature of data that is structured, semi-structured and unstructured
data.
 It also refers to heterogeneous sources.
 Variety is basically the arrival of data from new sources that are both inside and
outside of an enterprise. It can be structured, semi-structured and unstructured.
 Structured data: This data is basically an organized data. It
generally refers to data that has defined the length and format of data.
 Semi- Structured data: This data is basically a semi-organised data.
It is generally a form of data that do not conform to the formal
structure of data. Log files are the examples of this type of data.
 Unstructured data: This data basically refers to unorganized data. It
generally refers to data that doesn’t fit neatly into the traditional row
and column structure of the relational database. Texts, pictures,
videos etc. are the examples of unstructured data which can’t be
stored in the form of rows and columns.
4. Veracity:
 It refers to inconsistencies and uncertainty in data, that is data which is available
can sometimes get messy and quality and accuracy are difficult to control.
 Big Data is also variable because of the multitude of data dimensions resulting
from multiple disparate data types and sources.
 Example: Data in bulk could create confusion whereas less amount of data could
convey half or Incomplete Information.
5. Value:
 After having the 4 V’s into account there comes one more V which stands for
Value!. The bulk of Data having no Value is of no good to the company, unless
you turn it into something useful.
 Data in itself is of no use or importance but it needs to be converted into
something valuable to extract Information. Hence, you can state that Value! is
the most important V of all the 5V’s.

4. What is Hadoop? How big data problems are handled by Hadoop system.
 Hadoop is a single data platform infrastructure that is more simplified, efficient,
and runs on affordable commodity hardware.
 Hadoop is designed to handle the three V’s of Big Data: volume, variety,
velocity. First lets look at volume, Hadoop is a distributed architecture that scales
cost effectively.
 In other words, Hadoop was designed to scale out, and it is much more cost
effective to grow the system. As you need more storage or computing capacity,
all you need to do is add more nodes to the cluster. Second is variety, Hadoop
allows you to store data in any format, be that structured or unstructured data.
 This means that you will not need to alter your data to fit any single schema
before putting it into Hadoop. Next is velocity, with Hadoop you can load raw
data into the system and then later define how you want to view it.
 Because of the flexibility of the system, you are able to avoid many network and
processing bottlenecks associated with loading raw data. Since data is always
changing, the flexibility of the system makes it much easier to integrate any
changes.
 Hadoop will allow you to process massive amounts of data very quickly. Hadoop
is known as a distributing processing engine which leverages data locality.
 That means it was designed to execute transformations and processes where the
data actually exists. Another benefit of value is from an analytics perspective,
Hadoop allows you load raw data and then define the structure of the data at the
time of query. This means that Hadoop is quick, flexible, and able to handle any
type of analysis you want to conduct.
 Organizations begin to utilize Hadoop when they need faster processing on large
data sets, and often find they save the organization some money too. Large users
of Hadoop include: Facebook, Amazon, Adobe, EBay, and LinkedIn. It is also in
use throughout the financial sector and the US government. These organizations
are a testament to what can be done at internet speed by utilizing big data to its
fullest extent

5. Describe HDFS architecture with diagram.


6. Explain the concept of Map Reduce with wordcount example.
7. What is Map Reduce? Explain how map and reduce work? What is
shuffling in Map Reduce?
8. Give Map Reduce algorithm to perform relational algebra operations,
selection, projection, union and intersection of two sets.
Relational Algebra Operations:
1. Selection.
2. Projection.
3. Union & Intersection.
4. Natural Join.
5. Grouping & Aggregation.
Selection:
 Apply a condition c to each table in the relation and produce as output only those
tuples that satisfy c.
 The result of this selection is denoted by 6c(R)6c(R)
 Selection really do not need the full power of MapReduce.
 They can be done most conveniently in the map portion alone, although they
could also be done in the reduce portion also.
 The pseudo code is as follows:
Map (key, value)
for tuple in value:
if tuple satisfies C:
emit (tuple, tuple)
Reduce (key, values)
emit (key, key)
projection:
 for some subset s of the attribute of the relation, produce from each tuple only the
components for the attributes in S.
 The result of this projection is denoted TTs (R)
 Projection is performed similarly to selection.
 As projection may cause the same tuple to appear several times, the reduce
function eliminate duplicates.
 The pseudo code for projection is as follows :
Map (key, value)
for tuple in value :
ts = tuple with only the components for the attributes in S.
emit (ts, ts)
Reduce (key, values)
emit (key, key)

9. What is NoSQL? What are business drivers for NoSQL? Discuss 2


different data architecture pattern in NoSQL.
10.What are the different data architecture pattern in NoSQL . Explain graph
store and column family store patterns with relevant examples.
 A data architecture pattern is a consistent way of representing data in a regular
structure that will be stored in memory. Architectural patterns allow you to give
precise names to recurring high level data storage patterns.
 When you suggest a specific data architecture pattern as a solution to a business
problem, you should use a consistent process that allows you to name the pattern,
describe how it applies to the current business problem, and articulate the pros and
cons of the proposed solution. It’s important that all team members have the same
understanding about how a particular pattern solves your problem so that when
implemented, business goals and objectives are met.
Column Family Architectural Pattern:
 Column family systems are important NoSQL data architecture patterns because they
can scale to manage large volumes of data. They’re also known to be closely tied with
many MapReduce systems.
 Column family stores use row and column identifiers as general purposes keys for
data lookup. They’re sometimes referred to as data stores rather than databases, since
they lack features you may expect to find in traditional databases. For example, they
lack typed columns, secondary indexes, triggers, and query languages. Almost all
column family stores have been heavily influenced by the original Google Bigtable
paper. HBase, Hypertable, and Cassandra are good examples of systems that have
Bigtablelike interfaces, although how they’re implemented varies.

 Figure: The key structure in column family stores is similar to a spreadsheet but has
two additional attributes. In addition to the column name, a column family is used to
group similar column names together. The addition of a timestamp in the key also
allows each cell in the table to store multiple versions of a value over time.
Graph Store Architectural Pattern:
 A graph store is a system that contains a sequence of nodes and relationships that,
when combined, create a graph. a graph store has three data fields: nodes,
relationships, and properties. Some types of graph stores are referred to as triple stores
because of their node-relationship-node structure
 Graph nodes are usually representations of real-world objects like nouns. Nodes can
be people, organizations, telephone numbers, web pages, computers on a network, or
even biological cells in a living organism. The relationships can be thought of as
connections between these objects and are typically represented as arcs (lines that
connect) between circles in diagrams.
 Graph stores are important in applications that need to analyze relationships between
objects or visit all nodes in a graph in a particular manner (graph traversal). Graph
stores are highly optimized to efficiently store graph nodes and links, and allow you to
query these graphs. Graph databases are useful for any business problem that has
complex relationships between objects such as social networking, rules-based engines,
creating mashups, and graph systems that can quickly analyze complex network
structures and find patterns within these structures.

11.What a NoSQL? What do you understand by BASE properties of NoSQL?


Explain in detail any one architecture pattern in NoSQL. Identify two
applications that use these patterns.

For ans of what is NOSQL refer Q9.


12.What is the role of Job Tracker and Task Tracker in Map Reduce?
Illustrate Map Reduce execution pipeline with Word count example.
Map Reduce execution pipeline not found…

You might also like