Big Data Analytics (Bda) : Laboratory Workbook
Big Data Analytics (Bda) : Laboratory Workbook
LABORATORY WORKBOOK
#04 IBM Big Data Strategy and Infosphere Big Insights ...................................................... 50
#13 Patrice importing and exporting data from various data bases. ................................ 147
Course Title BIG DATA ANALYTICS (BDA) Semester: 2023-24 EVEN SEM
Course Code(s) 21EC3084 Page 1 of 19
A.Y. 2023-24 LAB CONTINUOUS EVALUATION
S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/Procedure Data and Analysis & Lab Voce (50M) Signature
(10M) (5M) Results(10M) Inference(10M) (10M) (5M)
1.
2.
3.
4.
5.
6.
7.
8.
S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/Procedure Data and Analysis & Lab Voce (50M) Signature
(10M) (5M) Results(10M) Inference(10M) (10M) (5M)
9.
10.
11
12.
13
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name
R is a potent programming language and computing environment created especially for statistical
computing, data analysis, and visual representation. It was created at the University of Auckland
in New Zealand in the early 1990s by Ross Ihaka and Robert Gentleman. R has a large following
in both academia and business thanks to its open-source nature and accessibility.
Key features of R include:
Data manipulation and analysis: R provides a wide range of tools and packages for data
manipulation, cleaning, transformation, and statistical analysis. It offers numerous functions for
handling structured, unstructured, and time-series data.
Statistical modelling and machine learning: R is extensively used for statistical modelling,
hypothesis testing, and predictive analytics. It offers a vast collection of packages and algorithms
for regression, classification, clustering, and more.
Graphics and visualization: R have excellent capabilities for data visualization, allowing users
to create high-quality graphs, charts, and plots. The base graphics system provides a variety of
plotting functions, and there are also specialized packages like ggplot2 for more advanced and
customizable visualizations.
Extensive package ecosystem: R benefits from a vast ecosystem of packages contributed by a
vibrant community of statisticians, data scientists, and researchers. These packages extend the
functionality of R, providing additional tools and methods for specific domains and applications.
Reproducible research: R supports reproducible research through the integration of code and
documentation in a single environment. Researchers can document their analyses using literate
programming techniques, combining code and explanatory text in documents called R Markdown
files.
Integration with other languages: R can be easily integrated with other programming languages
like C, C++, Java, Python, and more. This interoperability allows users to leverage the strengths
of different languages and use R for data analysis while taking advantage of other language
capabilities.
Aim/Objective:
Description:
Pre-Requisites:
Pre-Lab:
Ans:
2. What does user, elapsed and system time measure in the output of system.time
(expression)?
Ans:
Ans:
Ans.
Ans:
In-Lab:
Post-Lab:
1. Implement a linear model on the trip dataset based on fare_amount and trip_distance using
the biglm () function.
Procedure/Program:
2.Can ‘big memory’ and ‘ff’ handle datasets greater than 10GB?
Ans:
Load Libraries:
Load the necessary libraries (bigmemory and ff).
Define File Path:
Specify the file path to your large dataset (replace "path/to/your/large_dataset.csv"
with the actual file path).
Using bigmemory:
Using ff:
Read the data into an ffdf (file-backed data frame) using read.csv.ffdf.
Close the ffdf connection when done (using close(ffdf_data)).
Optionally close the big.matrix connection (using close(big_matrix)).
This program demonstrates the basic usage of 'bigmemory' and 'ff' for handling
large datasets, performing operations, and closing connections. These packages
are suitable for datasets larger than 10GB, and they provide efficient memory
management for working with such data sizes.
2. Why do we need Hadoop for Big Data Analytics?
Ans:
Hadoop is a powerful framework for distributed storage and processing of large datasets. It is essential
for Big Data Analytics due to the following reasons:
Scalability: Hadoop allows horizontal scaling, enabling organizations to seamlessly add more nodes to
the cluster as the data volume grows. This ensures that the system can handle the increasing amount of
data generated in the Big Data era.
Distributed Storage (HDFS): Hadoop Distributed File System (HDFS) is designed to store and manage
massive amounts of data across distributed clusters. It breaks down large files into smaller blocks and
distributes them across multiple nodes, providing fault tolerance and redundancy.
Parallel Processing (MapReduce):Hadoop employs the MapReduce programming model for parallel
processing of data. It enables the efficient processing of large datasets by breaking down tasks into
smaller sub-tasks, distributing them across nodes, and then aggregating the results. This parallelism
accelerates data process
Aim/Objective:
Scenario: You are a Data Scientist working for a consulting firm. One of your colleagues from the
Auditing department has asked you to help them assess the financial statement of organisation X.
Description:
You have been supplied with two vectors of data: monthly revenue and monthly expenses for the
financial year in question. Your task is to calculate the following financial metrics: - profit for each
month - profit after tax for each month (the tax rate is 30%) - profit margin for each month - equals to
profit a after tax divided by revenue - good months - where the profit after tax was greater than the mean
for the year - bad months - where the profit after tax was less than the mean for the year - the best month
- where the profit after tax was max for the year - the worst month - where the profit after tax was min
for the year
Prerequisite:
Text MiningPre-Lab:
Ans:
Ans:
Ans:
Ans:
In Lab:
You have been supplied with two vectors of data: monthly revenue and monthly expenses for the
financial year in question. Your task is to calculate the following financial metrics: -profit for each
month –profit after tax for each month (the tax rate is 30%) –profit margin for each month - equals to
profit a after tax divided by revenue- good months –where the profit after tax was greater than the
mean for the year - bad months - where the profit after tax was less than the mean for the year - the best
month - where the profit after tax was max for the year - the worst month - where the profit after tax
was min for the year?
SOL:
monthly_revenue <- c(14574.49, 7606.46, 8611.41, 9175.41, 8058.65, 8105.44, 11496.28, 9766.09,
10305.32, 14379.96, 10713.97, 15433.50)
monthly_expenses <- c(12051.82, 5695.07, 12319.20, 12089.72, 8658.57, 840.20, 3285.73, 5821.12,
6976.93, 16618.61, 10054.37, 3803.96)
tax_rate <- 0.3
monthly_profit <- monthly_revenue - monthly_expenses
monthly_profit_after_tax <- monthly_profit * (1 - tax_rate)
profit_margin <- monthly_profit_after_tax / monthly_revenue
financial_metrics <- data.frame(
Month = month.abb,
Profit = monthly_profit,
Profit_After_Tax = monthly_profit_after_tax,
Profit_Margin = profit_margin
)
mean_profit_after_tax <- mean(monthly_profit_after_tax)
good_months <- financial_metrics$Month[financial_metrics$Profit_After_Tax >
mean_profit_after_tax]
bad_months <- financial_metrics$Month[financial_metrics$Profit_After_Tax <
mean_profit_after_tax]
best_month <- financial_metrics$Month[which.max(financial_metrics$Profit_After_Tax)]
worst_month <- financial_metrics$Month[which.min(financial_metrics$Profit_After_Tax)]
print("Financial Metrics:")
print(financial_metrics)
print("\nGood Months:")
print(good_months)
print("\nBad Months:")
print(bad_months)
print("\nBest Month:")
print(best_month)
print("\nWorst Month:")
print(worst_month)
RESULT:
Aim/Objective: You have been supplied data for two more additional in-game statistics:
Pre-Lab:
1. What is Hadoop?
Ans:
Ans:
Ans:
Ans:
Ans:
In-Lab:
Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24
Course Code(s) 21EC3084 Page 11 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name
The data has been supplied in the form of vectors. You will have to create the two matrices before
you proceed with the analysis
SOL:
Players <-
c("KobeBryant","JoeJohnson","LeBronJames","CarmeloAnthony","DwightHoward","ChrisBosh","ChrisPaul","
KevinDurant","DerrickRose","DwayneWade")
KobeBryant_Salary <-
c(15946875,17718750,19490625,21262500,23034375,24806250,25244493,27849149,30453805,23500000)
JoeJohnson_Salary <-
c(12000000,12744189,13488377,14232567,14976754,16324500,18038573,19752645,21466718,23180790)
LeBronJames_Salary <-
c(4621800,5828090,13041250,14410581,15779912,14500000,16022500,17545000,19067500,20644400)
CarmeloAnthony_Salary <-
c(3713640,4694041,13041250,14410581,15779912,17149243,18518574,19450000,22407474,22458000)
DwightHoward_Salary <-
c(4493160,4806720,6061274,13758000,15202590,16647180,18091770,19536360,20513178,21436271)
ChrisBosh_Salary <-
c(3348000,4235220,12455000,14410581,15779912,14500000,16022500,17545000,19067500,20644400)
ChrisPaul_Salary <-
c(3144240,3380160,3615960,4574189,13520500,14940153,16359805,17779458,18668431,20068563)
DwayneWade_Salary <-
c(3031920,3841443,13041250,14410581,15779912,14200000,15691000,17182000,18673000,15000000)
library(ggplot2)
x = "Player",
y = "Salary",
fill = "Season") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better visibility
RESULT:
1. What is Hadoop?
Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment
Course Title BIG DATA ANALYTICS (BDA) ACADEMIC YEAR: 2023-24
Course Code(s) 21EC3084 Page 14 of 19
Experiment # <TO BE FILLED BY STUDENT> Student ID
Date <TO BE FILLED BY STUDENT> Student Name
Aim/Objective: You are employed as a Data Scientist by the World Bank and you are working on a project to
analyse the World’s demographic trends. You are required to produce a scatterplot illustrating Birth Rate and
Internet Usage statistics by Countr.
Description:
You have received an urgent update from your manager. You are required to produce a second scatterplot also
illustrating Birth Rate and Internet Usage statistics by Country. However, this time the scatterplot needs to be
categorised by Countries’ Regions. Additional data has been supplied in the form of R vectors
PRE LAB:
1.What is Hadoop?
Ans:
Ans:
Ans:
Ans:
Ans:
Ans:
IN LAB:
You are required to produce a scatterplot illustrating Birth Rate and Internet Usage statistics by
Country.
library(ggplot2)
geom_point() +
x = "Birth Rate",
y = "Internet Usage") +
theme_minimal()
print(scatter_plot)
RESULT:
1.What is Hadoop Distributed File System (HDFS), and how is it different from traditional
file systems?
3.What are the key features of HDFS that make it suitable for big data storage?
5.What is the significance of block size in HDFS, and how does it affect data storage and
processing?
Evaluator MUST ask Viva-voce prior to signing and posting marks for each experimen