Part A & B Big Data Questions Final
Part A & B Big Data Questions Final
Part A
Volume -The quantity of generated and stored data. The size of the data determines
the value and potential insight- and whether it can actually be considered big data
or not.
Variety -The type and nature of the data. This helps people who analyze it to
effectively use the resulting insight.
Velocity - In this context, the speed at which the data is generated and processed to
meet the demands and challenges that lie in the path of growth and development.
Variability -Inconsistency of the data set can hamper processes to handle and
manage it.
Veracity -The data quality of captured data can vary greatly, affecting the accurate
analysis.
Twitter data
Facebook data
Paypal
Flipkart
RFIDchip
Ebay
Amazon
The conditional probability of an event B is the probability that the event will
occur given the knowledge that an event A has already occurred. This probability
is written P(B|A), notation for the probability of B given A. In the case where
events A and B are independent (where event A has no effect on the probability of
event B), the conditional probability of event B given event A is simply the
probability of event B, that is P(B).
If events A and B are not independent, then the probability of the intersection of A
and B (the probability that both events occur) is defined by
P(A and B) = P(A)P(B|A).
Part B
Unit 2
Part A
1.How to implement count process in query give an example?
Estimate the number of 1s in the window with an error of no more than 50%
The stream element should be processed at real time, otherwise we lose the
opportunity to
process.
Sensor data is the output of a device that detects and responds to some type of
input from the physical environment. The output may be used to provide
information or input to another system or to guide a process.
5. Define Moments.
7. What are the differences between time-series data and sequence data?
Time-series data: A time-series database consists of sequences of values or events
obtained over repeated measurements of time. Eg : atmosphere, temperature, and
wind.
Eg: a sequential pattern is “Customers who buy a Canon digital camera are likely
to buy an HP color printer within a month.”
Volume of data is extremely high Decisions are made in close to real time.
Part B
n- moment
Here N=9
n=3
=(27+1+64+1+27+64+8+1+8)/9
= 201/9
= 22.3 (answer)
Suppose that we have a set of data with a total of n discrete points. One important
calculation, which is actually several numbers, is called the sth moment. The sth
moment of the data set with values x1, x2, x3, . . . , xn is given by the formula:
Using this formula requires us to be careful with our order of operations. We need
to do the exponents first, add, then divide this sum by n the total number of data
values.
The term moment has been taken from physics. In physics the moment of a system
of point masses is calculated with a formula identical to that above, and this
formula is used in finding the center of mass of the points. In statistics the values
are no longer masses, but as we will see, moments in statistics still measure
something relative to the center of the values.
FIRST MOMENT
For the first moment we set s = 1. The formula for the first moment is thus:
(x1x2 + x3 + . . . + xn)/n
This is identical to the formula for the sample mean.
SECOND MOMENT
For the second moment we set s = 2. The formula for the second moment is:
THIRD MOMENT
For the third moment we set s = 3. The formula for the third moment is:
Higher moments can be calculated in a similar way. Just replace s in the above
formula with the number denoting the desired moment)
2. How the bit streams are divided into store in different bucket and explain
(DGIM).
Stream data ( sensor data , image data, internet and web traffic data)
Using this formula requires us to be careful with our order of operations. We need
to do the exponents first, add, then divide this sum by n the total number of data
values.
The term moment has been taken from physics. In physics the moment of a system
of point masses is calculated with a formula identical to that above, and this
formula is used in finding the center of mass of the points. In statistics the values
are no longer masses, but as we will see, moments in statistics still measure
something relative to the center of the values.
FIRST MOMENT
For the first moment we set s = 1. The formula for the first moment is thus:
(x1x2 + x3 + . . . + xn)/n
SECOND MOMENT
For the second moment we set s = 2. The formula for the second moment is:
n represents the length of the data stream, and "E" is the mean notation.
example
let's assume we selected randomly "a" at the 13th position of the data
stream, "d" at the 8th and "c" at the 3th. We haven't selected "b".
a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x x x
8. Discuss Prediction of Stock Market from Stream Data Time Series Pattern
( FM algorithm).
Unit 3
Part A
1.What are the differences between Hadoop Version 1 and Hadoop Version 2?
Version 1 – size of block 64KB ,only one naming node,nearly real time
process,support parallel process for single process
Version 2 – size of block 128KB ,number of naming node,real time
process,support parallel process for multiple process
2.Define HDFS.
HDFS :Hadoop Distributed File System( store data in hadoop system) It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault
tolerant and designed using low-cost hardware. HDFS holds very large amount of
data and provides easier access.
3..Write the features of HDFS.
Data availability
High availability
Speculative execution
Partitioning
Supporting partial failure
In contrast, real time data processing involves a continual input, process and
output of data. Data must be processed in a small time period (or near real
time). Radar systems, customer services and bank ATMs are examples.
10.What is high availability?
The data transfer between naming node and secondary naming node is nothing
but high availability.
Part B
UNIT 4
Part A
The user can specify a file containing a list of authorized machines that may join
the cluster as datanodes or node managers. The file is specified using the dfs.hosts
and yarn.resourcemanager.nodes.include-path properties (for datanodes and node
managers, respectively), and the corresponding dfs.hosts.exclude and
yarn.resourcemanager.nodes.exclude-path properties specify the files used for
decommissioning
Hadoop uses a buffer size of 4 KB (4,096 bytes) for its I/O operations. This is a
conservative setting, and with modern hardware and operating systems, user will
likely see performance benefits by increasing it; 128 KB (131,072 bytes) is a
common choice. Set the value in bytes using the io.file.buffer.size property in core-
site.xml.
64 kb for version 2
Processor
Two hex/octo-core 3 GHz CPUs
Memory
1. 64−512 GB ECC RAM1
Storage
12−24 × 1−4 TB SATA disks
Network
1. Gigabit Ethernet with link aggregation
10.List any five other properties in hadoop system.
Job scheduler
Reduce slow start
Short-circuit local reads
Buffer size
Reserved storage space
Trash
Part B
Unit 5
Part A
ID int,
TranValue string,
last_update_user string)
7.What is HBase?
HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data. This tutorial provides an
introduction to HBase, the procedures to set up HBase on Hadoop File Systems,
and ways to interact with HBase shell. It also describes how to connect to HBase
using java, and how to perform basic operations on HBase using java.
Pig RDBMS
12.What is metastore?
Metastore is used for collection of all the Hive metadata and it’s having back up
services to backup meta store info. The service runs on the same JVM as the
services of hive running on. The structural information of tables, their columns,
column types and similarly the partition structure information will also be stored in
this.
Part B
1. Give an Overview on
(i) Characteristics of HBASE
(ii) Services of HBase.
2.Implementation of ZooKeeper services.
3. Explain in detail about Hive architecture.
4. Discuss various functions in PIG.
5. Discuss Hive Metastore configurations with neat diagram.
6.Write down all the query involved in PIG and explain.
7. Implementation of HBASE Data model.
8. Explain in detail about Hive architecture.