0% found this document useful (0 votes)
33 views21 pages

Part A & B Big Data Questions Final

The document discusses key concepts related to analyzing big data and streaming data. It defines big data, its elements like volume, variety and velocity. It also discusses structured, semi-structured and unstructured data with examples. The document then covers concepts related to streaming data like what is streaming data, time-series vs sequence data, characteristics of streaming data, filters in streams and the FM algorithm. It also provides examples of calculating moments from a data stream.

Uploaded by

nageswari v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views21 pages

Part A & B Big Data Questions Final

The document discusses key concepts related to analyzing big data and streaming data. It defines big data, its elements like volume, variety and velocity. It also discusses structured, semi-structured and unstructured data with examples. The document then covers concepts related to streaming data like what is streaming data, time-series vs sequence data, characteristics of streaming data, filters in streams and the FM algorithm. It also provides examples of calculating moments from a data stream.

Uploaded by

nageswari v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit 1

Part A

1. What is Big Data? Why we need to analyze Big Data?


Big data is a term that describes the large volume of data – both structured
and unstructured – that inundates a business on a day-to-day basis. But it's
not the amount of data that's important. ... Big data can be analyzed for
insights that lead to better decisions and strategic business moves.
2. What are the conditions to satisfy the random experiments?
All possible distinct outcomes are known in advance.
In any particular trail,the outcome is not known in advance and the
experiment c
an be repeated under identical conditions.
3. What is statistics?
Statistics is the science of collecting ,organizing and drawing conclusions.
4. Differentiate structured, semi-structured and unstructured data with
examples.
Structured data
It concerns all data which can be stored in database SQL in table with rows
and columns. They have relational key and can be easily mapped into pre-
designed fields. Today, those data’s are the most processed in development
and the simpliest way to manage information’s.
Semi structured data
Semi-structured data is information that doesn’t reside in a relational
database but that does have some organizational properties that make it
easier to analyze. With some process you can store them in relation database
(it could be very hard for some kind of semi structured data), but the semi
structure exist to ease space, clarity or compute…
Examples of semi-structured : CSV but XML and JSON documents are
semi structured documents, NoSQL databases are considered as semi
structured.
Unstructured data
Unstructured data represent around 80% of data. It often include text and
multimedia content. Examples include e-mail messages, word processing
documents, videos, photos, audio files, presentations, webpages and many
other kinds of business documents. Note that while these sorts of files may
have an internal structure, they are still considered because the data they
contain doesn’t fit neatly in a database.

5. What are the different characteristics of Big Data?


Volume
The quantity of generated and stored data. The size of the data determines
the value and
potential insight- and whether it can actually be considered big data or not.
Variety
The type and nature of the data. This helps people who analyze it to
effectively use the
resulting insight.
Velocity
In this context, the speed at which the data is generated and processed to
meet the
demands and challenges that lie in the path of growth and development.
Variability
Inconsistency of the data set can hamper processes to handle and manage it.
Veracity
The data quality of captured data can vary greatly, affecting the accurate
analysis.
6. List any four benefits of Big Data.
Competitive Advantages of Big Data in Business. Big Data means a large
chunk of raw data that is collected, stored and analyzed through various
means which can be utilized by organizations to increase their efficiency and
take better decisions. Big Data can be in both - structured and unstructured
forms.

7. List some elements of Big Data.

Volume -The quantity of generated and stored data. The size of the data determines
the value and potential insight- and whether it can actually be considered big data
or not.

Variety -The type and nature of the data. This helps people who analyze it to
effectively use the resulting insight.
Velocity - In this context, the speed at which the data is generated and processed to
meet the demands and challenges that lie in the path of growth and development.

Variability -Inconsistency of the data set can hamper processes to handle and
manage it.

Veracity -The data quality of captured data can vary greatly, affecting the accurate
analysis.

7. Write down any four industry examples for Big Data.

Twitter data

Facebook data

Paypal

Flipkart

RFIDchip

Ebay

Amazon

9.What is random variable?

A random variable, usually written X, is a variable whose possible values are


numerical outcomes of a random phenomenon. There are two types of random
variables, discrete and continuous.

10.Define conditional probability.

The conditional probability of an event B is the probability that the event will
occur given the knowledge that an event A has already occurred. This probability
is written P(B|A), notation for the probability of B given A. In the case where
events A and B are independent (where event A has no effect on the probability of
event B), the conditional probability of event B given event A is simply the
probability of event B, that is P(B).
If events A and B are not independent, then the probability of the intersection of A
and B (the probability that both events occur) is defined by
P(A and B) = P(A)P(B|A).

From this definition, the conditional probability P(B|A) is easily obtained by


dividing by P(A):

Note: This expression is only valid when P(A) is greater than 0.

Part B

1. Explain in detail about Big Data.


(what is big data, type and sources of data, elements of big data, types of big
data, benefits of big data, big data technology, big data challenges)

2. Explain in detail about sampling and sampling distribution.

3. Discuss various prediction error in regression and classification.

4. Explain in detail about arcing classifiers

5. Discuss about Bagging Predictors.

Unit 2

Part A
1.How to implement count process in query give an example?

Estimate the number of 1s in the window with an error of no more than 50%

2. Write any two Issues in Stream Processing.

Data arriving out of time order is a problem.

The stream element should be processed at real time, otherwise we lose the
opportunity to

process.

It should be accessed either main memory or secondary storage.

3. Define sensor data.

Sensor data is the output of a device that detects and responds to some type of
input from the physical environment. The output may be used to provide
information or input to another system or to guide a process.

Eg: Temperature sensor

4. What is meant by sample? Why its need for analysis?

A data sample is a set of data collected and/or selected from a statistical


population by a defined procedure.

5. Define Moments.

A moment is a specific quantitative measure, used in both mechanics and statistics,


of the shape of a set of points. If the points represent mass, then the zeroth moment
is the total mass, the first moment divided by the total mass is the center of mass,
and the second moment is the rotational inertia.

6. What is streaming data?

Streaming Data is data that is generated continuously by thousands of data sources,


which typically send in the data records simultaneously, and in small sizes

7. What are the differences between time-series data and sequence data?
Time-series data: A time-series database consists of sequences of values or events
obtained over repeated measurements of time. Eg : atmosphere, temperature, and
wind.

Sequence data: A sequence database consists of sequences of ordered elements or


events, recorded with or without a concrete notion of time. Sequential pattern
mining is the discovery of frequently occurring ordered events or subsequences as
patterns.

Eg: a sequential pattern is “Customers who buy a Canon digital camera are likely
to buy an HP color printer within a month.”

8. List all the characteristics of streaming data.

Volume of data is extremely high Decisions are made in close to real time.

9.What is filter in stream?

Filter is a space-efficient probabilistic data structure that is used to test whether an


element is a member of a set.

10. What is FM algorithm?

Approximates a number of unique object in a stream or database in one pass.

Part B

1. Solve the following

(i)Compute the surprise number (second moment) for the stream 3, 1, 4, 1, 3,


4, 2, 1, 2.

What is the third moment of this stream?

R th moment of raw data stream = X1n + X2n+ ………………..+ XNn / N

X1,X2……XN - raw data stream

N –Total number of raw data stream

n- moment
Here N=9

n=3

Third moment of raw data stream = (33+13+43+13+33+43+23+13+23)/ 9

=(27+1+64+1+27+64+8+1+8)/9

= 201/9

= 22.3 (answer)

(Moments in mathematical statistics involve a basic calculation. These


calculations can be used to find a probability distribution's mean, variance and
skewness.

Suppose that we have a set of data with a total of n discrete points. One important
calculation, which is actually several numbers, is called the sth moment. The sth
moment of the data set with values x1, x2, x3, . . . , xn is given by the formula:

(x1s + x2s + x3s + . . . + xns)/n

Using this formula requires us to be careful with our order of operations. We need
to do the exponents first, add, then divide this sum by n the total number of data
values.

A NOTE ON THE TERM MOMENT

The term moment has been taken from physics. In physics the moment of a system
of point masses is calculated with a formula identical to that above, and this
formula is used in finding the center of mass of the points. In statistics the values
are no longer masses, but as we will see, moments in statistics still measure
something relative to the center of the values.

FIRST MOMENT

For the first moment we set s = 1. The formula for the first moment is thus:

(x1x2 + x3 + . . . + xn)/n
This is identical to the formula for the sample mean.

The first moment of the values 1, 3, 6, 10 is (1 + 3 + 6 + 10) / 4 = 20/4 = 5.

SECOND MOMENT

For the second moment we set s = 2. The formula for the second moment is:

(x12 + x22 + x32 + . . . + xn2)/n

The second moment of the values 1, 3, 6, 10 is (12 + 32 + 62 + 102) / 4 = (1 + 9 + 36


+ 100)/4 = 146/4 = 36.5.

THIRD MOMENT

For the third moment we set s = 3. The formula for the third moment is:

(x13 + x23 + x33 + . . . + xn3)/n

The third moment of the values 1, 3, 6, 10 is (13 + 33 + 63 + 103) / 4 = (1 + 27 + 216


+ 1000)/4 = 1244/4 = 311.

Higher moments can be calculated in a similar way. Just replace s in the above
formula with the number denoting the desired moment)

(ii)Apply partitioning methods in the following bitstreams1001011011101


and find the

number of buckets involved (DGIM).

2. How the bit streams are divided into store in different bucket and explain
(DGIM).

DGIM algorithm - Refer notes given in the class room

3. Explain in detail about stream data architecture( page no 132)


Diagram

Stream data ( sensor data , image data, internet and web traffic data)

Stream process ( Standard query, adhoc query)

Archiving storage, limited working storage

4. Explain in detail about Datar-Gionis-Indyk-Motwani Algorithm(DGIM)

Refer notes given in the class room

5.Devise an algorithm to implement counting problems for streams( DGIM).

Refer notes given in the class room

6. Discuss about Alon-Matias-Szegedy Algorithm for Second Moments.

Moments in mathematical statistics involve a basic calculation. These calculations


can be used to find a probability distribution's mean, variance and skewness.
Suppose that we have a set of data with a total of n discrete points. One important
calculation, which is actually several numbers, is called the sth moment. The sth
moment of the data set with values x1, x2, x3, . . . , xn is given by the formula:

(x1s + x2s + x3s + . . . + xns)/n

Using this formula requires us to be careful with our order of operations. We need
to do the exponents first, add, then divide this sum by n the total number of data
values.

A NOTE ON THE TERM MOMENT

The term moment has been taken from physics. In physics the moment of a system
of point masses is calculated with a formula identical to that above, and this
formula is used in finding the center of mass of the points. In statistics the values
are no longer masses, but as we will see, moments in statistics still measure
something relative to the center of the values.

FIRST MOMENT

For the first moment we set s = 1. The formula for the first moment is thus:

(x1x2 + x3 + . . . + xn)/n

This is identical to the formula for the sample mean.

The first moment of the values 1, 3, 6, 10 is (1 + 3 + 6 + 10) / 4 = 20/4 = 5.

SECOND MOMENT

For the second moment we set s = 2. The formula for the second moment is:

(x12 + x22 + x32 + . . . + xn2)/n

The second moment of the values 1, 3, 6, 10 is (12 + 32 + 62 + 102) / 4 = (1 + 9 + 36


+ 100)/4 = 146/4 = 36.5.
The Alon-Matias-Szegedy Algorithm (AMS algorithm), is used to estimate
the second moment using this formula:

Second Moment = E(n *(2 * X.value − 1))

In which X is an univocal element of the stream, randomically selected, and


X.value is a counter, that, as we read the stream, add to 1 each time we
encounter another occurrence of the x element from the time we selected
it.

n represents the length of the data stream, and "E" is the mean notation.

example

let's assume we selected randomly "a" at the 13th position of the data
stream, "d" at the 8th and "c" at the 3th. We haven't selected "b".

a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x x x

“a” is selected in 13th position


So, from this position count the occurrences of “a”
Ie, a=2
It is written as
X.element = "a" X.value = 2

“c” is selected in 3rd position


So, from this position count the occurrences of “c”
Ie, c=3
It is written as
X.element = "c" X.value = 3

“d” is selected in 8th position


So, from this position count the occurrences of “d”
Ie, d=2
It is written as
X.element = "d" X.value = 2

The estimate by the AMS algorithm is:


(15*(2 * 2 - 1) + 15*(2 * 3 - 1) + 15*(2 * 2 - 1))/3 = 55

7. Explain in detail about stream data model (refer question 3).

8. Discuss Prediction of Stock Market from Stream Data Time Series Pattern
( FM algorithm).

Refer notes given in the class

Unit 3

Part A

1.What are the differences between Hadoop Version 1 and Hadoop Version 2?
Version 1 – size of block 64KB ,only one naming node,nearly real time
process,support parallel process for single process
Version 2 – size of block 128KB ,number of naming node,real time
process,support parallel process for multiple process

2.Define HDFS.
HDFS :Hadoop Distributed File System( store data in hadoop system) It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware. HDFS holds very large amount of
data and provides easier access.
3..Write the features of HDFS.
Data availability
High availability
Speculative execution
Partitioning
Supporting partial failure

4.List out the role of Job Tracker and Task Tracker.


Job Tracker – Schedule the job(Refer notes)
Task Traker – Process the Job( refer notes)
5. What are NameNode and DataNode?
Name Node

1. NameNode is the centerpiece of HDFS.


2. NameNode is also known as the Master
3. NameNode only stores the metadata of HDFS – the directory tree of all files in
the file system, and tracks the files across the cluster.
4. NameNode does not store the actual data or the dataset. The data itself is
actually stored in the DataNodes.
5. NameNode knows the list of the blocks and its location for any given file in
HDFS. With this information NameNode knows how to construct the file from
blocks.
6. NameNode is so critical to HDFS and when the NameNode is down,
HDFS/Hadoop cluster is inaccessible and considered down.
7. NameNode is a single point of failure in Hadoop cluster.
8. NameNode is usually configured with a lot of memory (RAM). Because the
block locations are help in main memory.
Data Node

1. DataNode is responsible for storing the actual data in HDFS.


2. DataNode is also known as the Slave
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the NameNode along with
the list of blocks it is responsible for.
5. When a DataNode is down, it does not affect the availability of data or the
cluster. NameNode will arrange for replication for the blocks managed by
the DataNode that is not available.
6. DataNode is usually configured with a lot of hard disk space. Because the
actual data is stored in the DataNode.

7. What are the purpose of using MapReduce?


 Processing unit of Hadoop.
 MapReduce is a programming model that allows us to perform
parallel and distributed processing on huge data sets.
8. What is Combiner?
Combiner is a mini reducer which is used to reduce the time of reducer.It
plays in mapper
functions.
9. What are the differences between batch process and Real-Time process?
Batch data processing is an efficient way of processing high volumes of data
is where a group of transactions is collected over a period of time. Data is
collected, entered, processed and then the batch results are produced
(Hadoop is focused on batch data processing). Batch processing requires
separate programs for input, process and output. An example is payroll and
billing systems.

In contrast, real time data processing involves a continual input, process and
output of data. Data must be processed in a small time period (or near real
time). Radar systems, customer services and bank ATMs are examples.
10.What is high availability?
The data transfer between naming node and secondary naming node is nothing
but high availability.

11.Define speculative execution.


When tasks complete, they announce this fact to the JobTracker. Whichever copy
of a task finishes first becomes the definitive copy. If other copies were executing
speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their
outputs.

Part B

1. i)Discuss the design of HDFS Structure. (8)


ii) Show on how a client read and write data in HDFS. (8)
2. List all the basic unix commands involved in hadoop system and explain with
suitable
example.

3. Explain about the implementation of MapReduce concept with word count


example.

4. Discuss about history of Hadoop system.


5. With necessary diagram explain the anatomy of MapReduce Job run.

UNIT 4
Part A

1.Why the Hadoop Not Use RAID?


1.Speed limited to slower disk.
2.Reliability(JBOD performs better)

2.What is master node?


This node manages all services and operations. One Master node is enough for a
cluster but having the secondary one increases scalability and high availability.
The main operation Master node does is, running NameNode process that
coordinates Hadoop storage operations.

3.What are slave nodes?


This node provides required infrastructure such as CPU, memory and local disk for
storing and processing data. This does all slave processes; the main is running
DataNode process. Generally cluster comprises at least three Slave nodes but
cluster can be easily scaled up by adding many number of Salve nodes.

4.What is Cluster membership?

The user can specify a file containing a list of authorized machines that may join
the cluster as datanodes or node managers. The file is specified using the dfs.hosts
and yarn.resourcemanager.nodes.include-path properties (for datanodes and node
managers, respectively), and the corresponding dfs.hosts.exclude and
yarn.resourcemanager.nodes.exclude-path properties specify the files used for
decommissioning

5.What is the maximum Buffer size used in hadoop structure?

Hadoop uses a buffer size of 4 KB (4,096 bytes) for its I/O operations. This is a
conservative setting, and with modern hardware and operating systems, user will
likely see performance benefits by increasing it; 128 KB (131,072 bytes) is a
common choice. Set the value in bytes using the io.file.buffer.size property in core-
site.xml.

6.Describe HDFS block size.

128 KB for version 2

64 kb for version 2

7.What is Reserved storage space?


datanodes will try to use all of the space available in their storage directories by
default.If the user want to reserve some space on the storage volumes for non-
HDFS use, you can set dfs.datanode.du.reserved to the amount, in bytes, of space
to reserve.
8.What are necessary instructions apply during the master daemons run?
Manages the file system namespace. • Regulates client’s access to files. • It also
executes file system operations such as renaming, closing, and opening files and
directories.
10.What are the minimum requirement for implementing cluster?

Processor
Two hex/octo-core 3 GHz CPUs
Memory
1. 64−512 GB ECC RAM1

Storage
12−24 × 1−4 TB SATA disks
Network
1. Gigabit Ethernet with link aggregation
10.List any five other properties in hadoop system.
Job scheduler
Reduce slow start
Short-circuit local reads
Buffer size
Reserved storage space
Trash

Part B

1. Discuss various properties of Hadoop System.

2. Explain in detail about hadoop environment.

3.Explain in detail about running procedure of Hadoop cluster.


4.Discuss NamingNode,Secondary NamingNode and DataNode directory
structure with neat
diagram.
5. Explain Hadoop cluster architecture consists of a two-level network topology.
6. How to set the variables in hadoop environment and explain.
7. Devise the hadoop environment with 2 naming node and 10 data nodes.
8. Discuss some other properties used in setting up Hadoop cluster.

Unit 5

Part A

1. How does Hbase support Bulk data loading?

There are two main steps to do a data bulk load in Hbase.

 Generate Hbase data file(StoreFile) using a custom (mapreduce job) from


the data source. The StoreFile is created in Hbase internal format which can
be efficiently loaded.
 The prepared file is imported using another tool like complete bulk load to
import data into a running cluster. Each file gets loaded to one specific
region.

2. What are the scalar datatypes in pig?


scalar datatype
int -4bytes,
float -4bytes,
double -8bytes,
long -8bytes,
chararray,
bytearray

3. How Pig differs from MapReduce.


In mapreduce,groupby operation performed at reducer side and filter,projection can
be implemented in the map phase.pig latin also provides standard-operation similar
to mapreduce like orderby and filters,groupby..etc.we can analyze pig script and
know data flows and also early to find the error checking.pig Latin is much lower
cost to write and maintain than Java code for MapReduce.

4. Name three disadvantages Hbase has as compared to RDBMS.

 Hbase does not have in-built authentication/permission mechanism


 The indexes can be created only on a key column, but in RDBMS it can be
done in any column.
 With one HMaster node there is a single point of failure.

5.Define merge() in Hive.

CREATE DATABASE merge_data;

CREATE TABLE merge_data.transactions(

ID int,
TranValue string,

last_update_user string)

PARTITIONED BY (tran_date string)

CLUSTERED BY (ID) into 5 buckets

STORED AS ORC TBLPROPERTIES ('transactional'='true');

6..What is Local Mode in Pig?


All the files are installed and run from your local host and local file system.
There is no need of Hadoop or HDFS and its run in Linux environment. This
mode is generally used for testing purpose.

7.What is HBase?
HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data. This tutorial provides an
introduction to HBase, the procedures to set up HBase on Hadoop File Systems,
and ways to interact with HBase shell. It also describes how to connect to HBase
using java, and how to perform basic operations on HBase using java.

8.What are basic operations involved in ZooKeeper?


CREATE: you can create a child node
READ: you can get data from a node and list its children.
WRITE: you can set data for a node
DELETE: you can delete a child node
ADMIN: you can set permissions

9. Define load in PIG.


It is used to load the data from any environment

A=load ‘d:\emp’ using PigStorage(‘,’) as ( no int,name string);


10.Write a query for display 5 records from database using PIG.

A=load ‘d:\emp’ using PigStorage(‘,’) as ( no int,name string);


B=limit emp 5;
dump b;
11.How the PIG differ from RDBMS?

Pig RDBMS

Pig Latin is SQL is


a procedural language. a declarative language.

In Apache Pig, schema is Schema is mandatory in


optional. We can store data SQL.
without designing a schema
(values are stored as $01, $02
etc.)

The data model in Apache Pig The data model used in


is nested relational. SQL is flat relational.

Apache Pig provides limited There is more opportunity


opportunity for Query for query optimization in
optimization. SQL.

12.What is metastore?

Metastore is used for collection of all the Hive metadata and it’s having back up
services to backup meta store info. The service runs on the same JVM as the
services of hive running on. The structural information of tables, their columns,
column types and similarly the partition structure information will also be stored in
this.

Part B
1. Give an Overview on
(i) Characteristics of HBASE
(ii) Services of HBase.
2.Implementation of ZooKeeper services.
3. Explain in detail about Hive architecture.
4. Discuss various functions in PIG.
5. Discuss Hive Metastore configurations with neat diagram.
6.Write down all the query involved in PIG and explain.
7. Implementation of HBASE Data model.
8. Explain in detail about Hive architecture.

You might also like