SL-VI Assignment
SL-VI Assignment
Assignment no.3 Design and develop a distributed application to find the coolest/hottest year
from the available weather data. Use weather data from the Internet and process it using
MapReduce.
Questions:
1) How Hadoop MapReduce works?
2) Explain what is shuffling in MapReduce ?
3) Explain what is distributed Cache in MapReduce Framework?
4) Explain what is NameNode in Hadoop?
5) Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?
6) Explain what is heartbeat in HDFS?
7) Explain what combiners are and when you should use a combiner in a MapReduce Job?
8) Explain what happens in TextInputFormat?
9) Mention what are the main configuration parameters that user need to specify to run
Mapreduce Job?
10) Explain what is SequenceFileInputFormat?
11) Explain what does the conf.setMapper Class do?
Assignment No.4
Questions:
1) What is Hive and HiveQL?
2) What are the different components of Hive architecture?
3) What is the use of Partitions and Bucketing in Hive?
4) What kind of Joins supported by Hive.
5) Write Syntax for creating table, loading data in table using Hive.
6) Where is table data stored in Apache Hive by default?
7) When executing Hive queries in different directories, why is metastore_db created in all
places from where Hive is launched?
8) How will you read and write HDFS files in Hive?
9) Differentiate between describe and describe extended.
10) Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.
11) What is the difference between HBase and Hive?
12) What is the default database provided by Hive for Metastore?
13) What is Apache Hbase and its use?
14) Give the name of the key components of Hbase.
15) What is the use of get () method?
16) What is the reason of using HBase?
17) Define column families in Hbase?
18) Define standalone mode in HBase?
19) What is regionserver?
20) What is the use of MasterServer and HMaster?
21) What are the operational commands of HBase?
22) Which command is used to run HBase Shell?
23) What is the use of ZooKeeper?
Assignment 1 PART-B
1) What is R?
2) Explain about data import in R language.
3) Two vectors X and Y are defined as follows – X <- c(3, 2, 4) and Y <- c(1, 2). What will be
output of vector Z that is defined as Z <- X*Y.
4) How missing values and impossible values are represented in R language?
5) How many data structures does R language have?
6) What is the value of f (2) for the following R code?
b <- 4
f <- function (a)
{
b <- 3
b^3 + g (a)
}
g <- function (a)
{
a*b
}
Assignment 2 PART-B
1) What do you understand from the term data cleaning?
2) What is Data Integration?
3) What are the benefits of data integration?
4) Is Data integration And ETL programming is same?
5) Mention what is the responsibility of a Data analyst?
6) Mention what is data cleansing?
7) List of some best tools that can be useful for data-analysis?
8) List out some common problems faced by data analyst?
9) Which functions are included in package caret, e1071, catools, class and gmodels name it?
10) How to handle missing values?
Ans:
o Mean Imputation for Continuous Variables (No Outlier)
o Median Imputation for Continuous Variables (If Outlier)
o Cluster Imputation for Continuous Variables
o Imputation with a random value that is drawn between the minimum and maximum
of the variable [Random value = min(x) + (max(x) - min(x)) * ranuni(SEED)]
o Impute Continuous Variables with Zero (Require business knowledge)
o Conditional Mean Imputation for Continuous Variables
o Other Imputation Methods for Continuous - Predictive mean matching, Bayesian
linear regression, Linear regression ignoring model error etc.
o WOE for missing values in categorical variables
o Decision Tree, Random Forest, Logistic Regression for Categorical Variables
o Decision Tree, Random Forest works for both Continuous and Categorical
Variables
o Multiple Imputation Method
11) How do you create log linear models in R language?
12) What is meant by K-nearest neighbor explain with example?
13) Write a function in R language to replace the missing value in a vector with the mean of that
vector.
14) Which function is used to create histogram visualization in R programming language?
Assignment 3 PART-B
1) What is the best way to use Hadoop and R together for analysis?
2) Which function gets used for text mining in R?
3) Which package used to create wordcloud in R?
4) What is supervised and unsupervised learning?
5) How to run map reduce program in R?
6) What are the different techniques used to process text data?
Assignment 4 PART-B
1) How to create a Scatter Plot?
2) How to create a Histogram?
3) How to create a Bar Chart?
4) How to create a Stacked Bar Chart?
5) How to create a Box Plot?
6) How to create an Area Chart?
7) How to create a Heat Map?
8) How to create a Correlogram?
9) How to plot a geographical map?
10) How to plot the entire data in a single command?
Assignment 5 PART-B
1) What is Data Visualization?
2) What are the differences between Tableau desktop and Tableau Server?
3) Define parameters in Tableau and their working.
4) Differentiate between parameters and filters in Tableau.
5) What are fact table and Dimension table in Tableau?
6) What is Data Blending?
7) How many maximum tables can you join in Tableau?
8) What different products Tableau provide?