0% found this document useful (0 votes)
70 views20 pages

Big Data Analytics (Bda) : Laboratory Workbook

Uploaded by

screenmasters02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views20 pages

Big Data Analytics (Bda) : Laboratory Workbook

Uploaded by

screenmasters02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

BIG DATA ANALYTICS (BDA)

COURSE CODE: 21EC3084

LABORATORY WORKBOOK

STUDENT ID: ACADEMIC YEAR: 2023-24


STUDENT NAME :
TABLE OF CONTENTS

# TABLE OF CONTENTS ................................................................................................. 1

#00 Reading Big Data in R .................................................................................................. 3

#01 Exploring Digital Data Types ......................................................................................... 4

#02 Analyzing Big Data with Hadoop Streaming ................................................................ 19

#03 Exploring the Hadoop Ecosystem ................................................................................ 31

#04 IBM Big Data Strategy and Infosphere Big Insights ...................................................... 50

#05 Exploring HDFS Concepts and Command Line Interface .............................................. 62

#06 Data Ingestion with Flume and Scoop ......................................................................... 72

#07 Hadoop I/O: Compression and Serialization ................................................................ 86

#08 Understanding MapReduce Job Anatomy ................................................................... 96

#09 Working with Hive and HiveQL ................................................................................. 109

#10 Supervised Learning in Machine Learning ................................................................. 121

#11 Unsupervised Learning and Collaborative Filtering ................................................... 132

#12 Big Data Analytics with R .......................................................................................... 147

#13 Patrice importing and exporting data from various data bases. ................................ 147

Course Title BIG DATA ANALYTICS (BDA) Semester: 2023-24 EVEN SEM
Course Code(s) 21EC3084 Page 1 of 19
A.Y. 2023-24 LAB CONTINUOUS EVALUATION

S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/Procedure Data and Analysis & Lab Voce (50M) Signature
(10M) (5M) Results(10M) Inference(10M) (10M) (5M)

1.

2.

3.

4.

5.

6.

7.

8.
S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/Procedure Data and Analysis & Lab Voce (50M) Signature
(10M) (5M) Results(10M) Inference(10M) (10M) (5M)

9.

10.

11

12.

13
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

Experiment: Introductory Session:

R is a potent programming language and computing environment created especially for statistical
computing, data analysis, and visual representation. It was created at the University of Auckland
in New Zealand in the early 1990s by Ross Ihaka and Robert Gentleman. R has a large following
in both academia and business thanks to its open-source nature and accessibility.
Key features of R include:
Data manipulation and analysis: R provides a wide range of tools and packages for data
manipulation, cleaning, transformation, and statistical analysis. It offers numerous functions for
handling structured, unstructured, and time-series data.

Statistical modelling and machine learning: R is extensively used for statistical modelling,
hypothesis testing, and predictive analytics. It offers a vast collection of packages and algorithms
for regression, classification, clustering, and more.

Graphics and visualization: R have excellent capabilities for data visualization, allowing users
to create high-quality graphs, charts, and plots. The base graphics system provides a variety of
plotting functions, and there are also specialized packages like ggplot2 for more advanced and
customizable visualizations.
Extensive package ecosystem: R benefits from a vast ecosystem of packages contributed by a
vibrant community of statisticians, data scientists, and researchers. These packages extend the
functionality of R, providing additional tools and methods for specific domains and applications.
Reproducible research: R supports reproducible research through the integration of code and
documentation in a single environment. Researchers can document their analyses using literate
programming techniques, combining code and explanatory text in documents called R Markdown
files.

Integration with other languages: R can be easily integrated with other programming languages
like C, C++, Java, Python, and more. This interoperability allows users to leverage the strengths
of different languages and use R for data analysis while taking advantage of other language
capabilities.

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 2 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

1. Experiment Title: Reading Big Data in R

Aim/Objective:

The main objective of this experiment is basic knowledge regarding R Programming

Description:

This section gives us the basic understanding of R Programming

Pre-Requisites:

Basic programming knowledge on R

Pre-Lab:

Answer the following questions:

1. Which function is preferable for large datasets? read.csv or read_csv?

Ans:

2. What does user, elapsed and system time measure in the output of system.time
(expression)?
Ans:

3. Which R object uses a pointer to a C++ data structure?

Ans:

4. What is the R’s allocation limit for rows or columns?

Ans.

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 3 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

5. What is the difference between read.big.matrix and read.csv.ffdf?

Ans:

In-Lab:

1.Read the Trip.csv data using read.big.matrix, read.csv.ffdf functions


a) Calculate the various time components for the mentioned file-reading functions
using system.time
b) Find the number of rows in the data
c) Calculate the mean and median for fare amount(fare_amount) in the data.

Post-Lab:

1. Implement a linear model on the trip dataset based on fare_amount and trip_distance using
the biglm () function.
 Procedure/Program:

trip_data_ffdf <- read.csv.ffdf(file = "your_path_to_Trip.csv", header = TRUE)


library(biglm)
linear_model <- biglm(fare_amount ~ trip_distance, data = trip_data_ffdf)
summary(linear_model)

Analysis and Inferences

Data Loading and Exploration:


The code begins by loading the necessary libraries (readr and biglm).
The "Trip.csv" data is read using the read_csv() function from the readr package.
The structure of the dataset is checked using str()

Data and Results:

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 4 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

2.Can ‘big memory’ and ‘ff’ handle datasets greater than 10GB?
Ans:
 Load Libraries:
 Load the necessary libraries (bigmemory and ff).
 Define File Path:
Specify the file path to your large dataset (replace "path/to/your/large_dataset.csv"
with the actual file path).
 Using bigmemory:
 Using ff:
Read the data into an ffdf (file-backed data frame) using read.csv.ffdf.
Close the ffdf connection when done (using close(ffdf_data)).
Optionally close the big.matrix connection (using close(big_matrix)).
This program demonstrates the basic usage of 'bigmemory' and 'ff' for handling
large datasets, performing operations, and closing connections. These packages
are suitable for datasets larger than 10GB, and they provide efficient memory
management for working with such data sizes.
2. Why do we need Hadoop for Big Data Analytics?
Ans:
Hadoop is a powerful framework for distributed storage and processing of large datasets. It is essential
for Big Data Analytics due to the following reasons:

Scalability: Hadoop allows horizontal scaling, enabling organizations to seamlessly add more nodes to
the cluster as the data volume grows. This ensures that the system can handle the increasing amount of
data generated in the Big Data era.

Distributed Storage (HDFS): Hadoop Distributed File System (HDFS) is designed to store and manage
massive amounts of data across distributed clusters. It breaks down large files into smaller blocks and
distributes them across multiple nodes, providing fault tolerance and redundancy.

Parallel Processing (MapReduce):Hadoop employs the MapReduce programming model for parallel
processing of data. It enables the efficient processing of large datasets by breaking down tasks into
smaller sub-tasks, distributing them across nodes, and then aggregating the results. This parallelism
accelerates data process

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 5 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

Sample VIVA-VOCE Questions (In-Lab):

1. What is R (programming Language)?

2. What are the main features of R?

3. What is the difference between R and other programming languages?

4. Can you explain the basics of R programming syntax?

5. How can you create variables and assign values in R?

Evaluator Remark (if Any):


Marks Secured:_____out of 50

Signature of the Evaluator with Date

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 6 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

2. Experiment Title: Financial statement analysis

Aim/Objective:
Scenario: You are a Data Scientist working for a consulting firm. One of your colleagues from the
Auditing department has asked you to help them assess the financial statement of organisation X.

Description:

You have been supplied with two vectors of data: monthly revenue and monthly expenses for the
financial year in question. Your task is to calculate the following financial metrics: - profit for each
month - profit after tax for each month (the tax rate is 30%) - profit margin for each month - equals to
profit a after tax divided by revenue - good months - where the profit after tax was greater than the mean
for the year - bad months - where the profit after tax was less than the mean for the year - the best month
- where the profit after tax was max for the year - the worst month - where the profit after tax was min
for the year

Prerequisite:

 Text MiningPre-Lab:

Answer the following:

1. What kind of data formats comes under unstructured data?

Ans:

2.Which library is used for text mining in R?

Ans:

3.Explain in brief the process of streaming in text mining.

Ans:

4.What is the main structure for managing documents in the tm package?

Ans:

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 7 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

In Lab:
You have been supplied with two vectors of data: monthly revenue and monthly expenses for the
financial year in question. Your task is to calculate the following financial metrics: -profit for each
month –profit after tax for each month (the tax rate is 30%) –profit margin for each month - equals to
profit a after tax divided by revenue- good months –where the profit after tax was greater than the
mean for the year - bad months - where the profit after tax was less than the mean for the year - the best
month - where the profit after tax was max for the year - the worst month - where the profit after tax
was min for the year?

SOL:

monthly_revenue <- c(14574.49, 7606.46, 8611.41, 9175.41, 8058.65, 8105.44, 11496.28, 9766.09,
10305.32, 14379.96, 10713.97, 15433.50)
monthly_expenses <- c(12051.82, 5695.07, 12319.20, 12089.72, 8658.57, 840.20, 3285.73, 5821.12,
6976.93, 16618.61, 10054.37, 3803.96)
tax_rate <- 0.3
monthly_profit <- monthly_revenue - monthly_expenses
monthly_profit_after_tax <- monthly_profit * (1 - tax_rate)
profit_margin <- monthly_profit_after_tax / monthly_revenue
financial_metrics <- data.frame(
Month = month.abb,
Profit = monthly_profit,
Profit_After_Tax = monthly_profit_after_tax,
Profit_Margin = profit_margin
)
mean_profit_after_tax <- mean(monthly_profit_after_tax)
good_months <- financial_metrics$Month[financial_metrics$Profit_After_Tax >
mean_profit_after_tax]
bad_months <- financial_metrics$Month[financial_metrics$Profit_After_Tax <
mean_profit_after_tax]
best_month <- financial_metrics$Month[which.max(financial_metrics$Profit_After_Tax)]
worst_month <- financial_metrics$Month[which.min(financial_metrics$Profit_After_Tax)]
print("Financial Metrics:")
print(financial_metrics)

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 8 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

print("\nGood Months:")
print(good_months)
print("\nBad Months:")
print(bad_months)
print("\nBest Month:")
print(best_month)
print("\nWorst Month:")
print(worst_month)

RESULT:

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 9 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

Sample VIVA-VOCE Questions (In-Lab):

1. what is role of the Hadoop YARN?

2. Which library is used for text mining in R?

3. Explain how to communicate the outputs of data analysis using R language.

4. Difference between library () and require () functions in R language.

5. In R programming, how missing values are represented?

Evaluator Remark (if Any):


Marks Secured:_____out of 50

Signature of the Evaluator with Date

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 10 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

3. Experiment Title: Basketball trends

Aim/Objective: You have been supplied data for two more additional in-game statistics:

Free Throws and Free Throw Attempts

Pre-Lab:

Answer the following questions:

1. What is Hadoop?

Ans:

2.Mention the main components of Hadoop?

Ans:

3.What is the purpose of integrating R with Hadoop?

Ans:

4.What are the different methods for integrating R with Hadoop?

Ans:

5.Mention the packages involved in R Hadoop?

Ans:

In-Lab:
Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24
Course Code(s) 21EC3084 Page 11 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

The data has been supplied in the form of vectors. You will have to create the two matrices before
you proceed with the analysis

SOL:

Seasons <- c("2005","2006","2007","2008","2009","2010","2011","2012","2013","2014")

Players <-
c("KobeBryant","JoeJohnson","LeBronJames","CarmeloAnthony","DwightHoward","ChrisBosh","ChrisPaul","
KevinDurant","DerrickRose","DwayneWade")

KobeBryant_Salary <-
c(15946875,17718750,19490625,21262500,23034375,24806250,25244493,27849149,30453805,23500000)

JoeJohnson_Salary <-
c(12000000,12744189,13488377,14232567,14976754,16324500,18038573,19752645,21466718,23180790)

LeBronJames_Salary <-
c(4621800,5828090,13041250,14410581,15779912,14500000,16022500,17545000,19067500,20644400)

CarmeloAnthony_Salary <-
c(3713640,4694041,13041250,14410581,15779912,17149243,18518574,19450000,22407474,22458000)

DwightHoward_Salary <-
c(4493160,4806720,6061274,13758000,15202590,16647180,18091770,19536360,20513178,21436271)

ChrisBosh_Salary <-
c(3348000,4235220,12455000,14410581,15779912,14500000,16022500,17545000,19067500,20644400)

ChrisPaul_Salary <-
c(3144240,3380160,3615960,4574189,13520500,14940153,16359805,17779458,18668431,20068563)

KevinDurant_Salary <- c(0,0,4171200,4484040,4796880,6053663,15506632,16669630,17832627,18995624)

DerrickRose_Salary <- c(0,0,0,4822800,5184480,5546160,6993708,16402500,17632688,18862875)

DwayneWade_Salary <-
c(3031920,3841443,13041250,14410581,15779912,14200000,15691000,17182000,18673000,15000000)

Salary <- rbind(KobeBryant_Salary, JoeJohnson_Salary, LeBronJames_Salary, CarmeloAnthony_Salary,


DwightHoward_Salary, ChrisBosh_Salary, ChrisPaul_Salary, KevinDurant_Salary, DerrickRose_Salary,
DwayneWade_Salary)

colnames(Salary) <- Seasons


Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24
Course Code(s) 21EC3084 Page 12 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

rownames(Salary) <- Players

library(ggplot2)

salary_data <- as.data.frame(t(Salary))

salary_data$Player <- rownames(salary_data)

salary_data_long <- reshape2::melt(salary_data, id.vars = "Player", variable.name = "Season", value.name =


"Salary")

ggplot(salary_data_long, aes(x = Player, y = Salary, fill = Season)) +

geom_bar(stat = "identity", position = "dodge") +

labs(title = "Player Salaries Over Seasons",

x = "Player",

y = "Salary",

fill = "Season") +

theme_minimal() +

theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better visibility

RESULT:

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 13 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

Sample VIVA-VOCE Questions (In-Lab):

1. What is Hadoop?

2. What is RDD in Spark?

3. What is the difference between Spark & Hadoop?

4. What are the different cluster managers available in Apache Spark?

5. What is the purpose of integrating R with Hadoop?

Evaluator Remark (if Any):


Marks Secured:_____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment
Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24
Course Code(s) 21EC3084 Page 14 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

4. Experiment Title: DEMOGRAPHIC ANALYSIS

Aim/Objective: You are employed as a Data Scientist by the World Bank and you are working on a project to
analyse the World’s demographic trends. You are required to produce a scatterplot illustrating Birth Rate and
Internet Usage statistics by Countr.

Description:

You have received an urgent update from your manager. You are required to produce a second scatterplot also
illustrating Birth Rate and Internet Usage statistics by Country. However, this time the scatterplot needs to be
categorised by Countries’ Regions. Additional data has been supplied in the form of R vectors

PRE LAB:

Answer the following questions:

1.What is Hadoop?

Ans:

2.Mention the main components of Hadoop?

Ans:

3.What is the purpose of integrating R with Hadoop?

Ans:

4.5What are the different methods for integrating R with Hadoop?

Ans:

5.Mention the packages involved in R Hadoop?

Ans:

What do you mean by Hadoop Streaming?

Ans:

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 15 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name

IN LAB:
You are required to produce a scatterplot illustrating Birth Rate and Internet Usage statistics by
Country.
library(ggplot2)

countries <- c("Thailand", "America", "India", "Europe", "china")

birth_rate <- c(10, 15, 12, 18, 9)

internet_usage <- c(30, 45, 35, 50, 25)

df <- data.frame(Country = countries, BirthRate = birth_rate, InternetUsage = internet_usage)

scatter_plot <- ggplot(df, aes(x = BirthRate, y = InternetUsage, label = Country)) +

geom_point() +

geom_text(vjust = -0.5, size = 3) + # Add labels

labs(title = "Scatterplot of Birth Rate vs Internet Usage by Country",

x = "Birth Rate",

y = "Internet Usage") +

theme_minimal()

print(scatter_plot)

RESULT:

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21EC3084 Page 16 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Sample VIVA-VOCE Questions (In-Lab):

1.What is Hadoop Distributed File System (HDFS), and how is it different from traditional
file systems?

2.Explain the architecture of HDFS.

3.What are the key features of HDFS that make it suitable for big data storage?

4.How does HDFS ensure fault tolerance in the storage layer?

5.What is the significance of block size in HDFS, and how does it affect data storage and
processing?

Evaluator Remark (if Any):


Marks Secured:_____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experimen

Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24


Course Code(s) 21AD3103 R Page of 19

You might also like