0% found this document useful (0 votes)
4 views22 pages

BDA Lab Manual (12 Weeks)

The Big Data Analytics Lab Manual outlines a course designed for III-B.Tech II-Semester students, focusing on the importance of Big Data Analytics and various statistical techniques. The course includes practical experiments using R and Hadoop, covering topics such as hypothesis testing, clustering, regression, and time-series analysis. By the end of the course, students will be able to develop Big Data solutions and apply machine learning techniques.

Uploaded by

Shiva Shiva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views22 pages

BDA Lab Manual (12 Weeks)

The Big Data Analytics Lab Manual outlines a course designed for III-B.Tech II-Semester students, focusing on the importance of Big Data Analytics and various statistical techniques. The course includes practical experiments using R and Hadoop, covering topics such as hypothesis testing, clustering, regression, and time-series analysis. By the end of the course, students will be able to develop Big Data solutions and apply machine learning techniques.

Uploaded by

Shiva Shiva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Big Data Analytics Lab Manual

III-B.Tech II-Semester L T P C
Course Code: A1DS604PC O 0 3 1.5

COURSE OBJECTIVES:

The course should enable the students to learn:


1. Outline the importance of Big Data Analytics
2. Apply statistical techniques for Big data Analytics.
3. Analyze problems appropriate to mining data streams.
4. Apply the knowledge of clustering techniques in data mining.
5. Use Graph Analytics for Big Data and provide solutions
6. Apply Hadoop map Reduce programming for handing Big Data

COURSE OUTCOMES:

At the end of the course the students are able to:


1. Identify Big Data and its Business Implications.
2. List the components of Hadoop and Hadoop Eco-System
3. Access and Process Data on Distributed File System
4. Manage Job Execution in Hadoop Environment
5. Develop Big Data Solutions using Hadoop Eco System
6. Analyze Infosphere BigInsights Big Data Recommendations.
7. Apply Machine Learning Techniques using R.

LIST OF EXPERIMENTS:
1. Study of R Programming.
2. Hypothesis Test using R
3. K-means Clustering using R
4. Naïve Bayesian Classifier
5. Implementation of Linear Regression
6. Implement Logistic Regression
7. Time-series Analysis
8. Association Rules using R.
9. Data Analysis-Visualization using R.
10. Map Reduce using Hadoop
11. In-database Analytics
12. Implementation of Queries using Mongo DB
WEEK - 1 STUDY OF R PROGRAMMING

1. Study of R Programming

#program:

# R Programming Basics

# 1. Variables and Data Types


name <- "John"
age <- 25
height <- 175.5
is_student <- TRUE

# Print the variables


cat("Name:", name, "\n")
cat("Age:", age, "\n")
cat("Height:", height, "\n")
cat("Is Student:", is_student, "\n\n")

# 2. Vectors and Operations


numbers <- c(2, 4, 6, 8, 10)
result <- numbers * 2

cat("Original Numbers:", numbers, "\n")


cat("Doubled Numbers:", result, "\n\n")

# 3. Data Visualization using ggplot2


# Install the ggplot2 package if not installed already
# install.packages("ggplot2")

# Load the ggplot2 library


library(ggplot2)

# Create a simple scatter plot


data <- data.frame(x = c(1, 2, 3, 4, 5), y = c(2, 4, 1, 3, 5))

# Plotting
ggplot(data, aes(x = x, y = y)) +
geom_point(color = "blue") +
ggtitle("Simple Scatter Plot") +
xlab("X-axis") +
ylab("Y-axis")

# Save the plot as an image (optional)


# ggsave("scatter_plot.png", plot = last_plot(), device = "png")
#Output:

> cat("Name:", name, "\n")


Name: John
> cat("Age:", age, "\n")
Age: 25
> cat("Height:", height, "\n")
Height: 175.5
> cat("Is Student:", is_student, "\n\n")
Is Student: TRUE

> cat("Original Numbers:", numbers, "\n")


Original Numbers: 2 4 6 8 10
> cat("Doubled Numbers:", result, "\n\n")
Doubled Numbers: 4 8 12 16 20

WEEK - 2. HYPOTHESIS TEST USING R

1. Hypothesis Test using R

#program:

# Hypothesis Test using R

# Generate hypothetical data for two groups


set.seed(123) # for reproducibility
group1 <- rnorm(30, mean = 50, sd = 10)
group2 <- rnorm(30, mean = 55, sd = 10)

# Perform a t-test
t_test_result <- t.test(group1, group2)

# Print the t-test results


cat("Hypothesis Test Results:\n")
cat("t-value:", t_test_result$statistic, "\n")
cat("p-value:", t_test_result$p.value, "\n\n")

# Interpret the results


if (t_test_result$p.value < 0.05) {
cat("The difference between the groups is statistically significant at
the 0.05 level.\n")
} else {
cat("There is no significant difference between the groups at the 0.05
level.\n")
}
#Output:

> cat("Hypothesis Test Results:\n")


Hypothesis Test Results:
> cat("t-value:", t_test_result$statistic, "\n")
t-value: -3.084094
> cat("p-value:", t_test_result$p.value, "\n\n")
p-value: 0.00315574

The difference between the groups is statistically significant at the 0.05


level.

WEEK - 3 K MEANS CLUSTERING USING R

1. K-means Clustering using R

#program:

# K-means Clustering using R

# Load the iris dataset


data(iris)

# Selecting relevant features for clustering (e.g., Sepal Length and


Sepal Width)
iris_features <- iris[, c("Sepal.Length", "Sepal.Width")]

# Set the number of clusters (K)


k <- 3

# Perform K-means clustering


kmeans_result <- kmeans(iris_features, centers = k)

# Print the clustering results


cat("Cluster Centers:\n")
print(kmeans_result$centers)

cat("\nCluster Assignments:\n")
print(kmeans_result$cluster)

# Visualize the clustering (scatter plot)


plot(iris_features, col = kmeans_result$cluster, pch = 16, main =
"K-means Clustering")
points(kmeans_result$centers, col = 1:k, pch = 8, cex = 2)
#Output:

> cat("Cluster Centers:\n")


Cluster Centers:
> print(kmeans_result$centers)
Sepal.Length Sepal.Width
1 5.006000 3.428000
2 5.773585 2.692453
3 6.812766 3.074468
>
> cat("\nCluster Assignments:\n")

Cluster Assignments:
> print(kmeans_result$cluster)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[37] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 3 2 3 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
[73] 2 2 3 3 3 3 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3
[109] 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3
[145] 3 3 2 3 3 2

WEEK - 4 NAIVE BAYESIAN CLASSIFIER

1. Naïve Bayesian Classifier

#program:

# Naive Bayesian Classifier using R

# Load the iris dataset


data(iris)

# Install and load the necessary package (if not already installed)
# install.packages("e1071")
library(e1071)

# Split the dataset into training and testing sets


set.seed(123)
indices <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[indices, ]
test_data <- iris[-indices, ]

# Train the Naive Bayes classifier


nb_model <- naiveBayes(Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width, data = train_data)

# Make predictions on the test set


predictions <- predict(nb_model, test_data)

# Confusion matrix to evaluate the performance


conf_matrix <- table(predictions, test_data$Species)
print("Confusion Matrix:")
print(conf_matrix)

# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 2)))

#Output:

> print("Confusion Matrix:")


[1] "Confusion Matrix:"
> print(conf_matrix)

predictions setosa versicolor virginica


setosa 14 0 0
versicolor 0 18 0
virginica 0 0 13
> # Calculate accuracy
[1] "Accuracy: 1"

WEEK - 5 IMPLEMENTATION OF LINEAR REGRESSION

1. Implementation of Linear Regression

#program:

# Implementation of Linear Regression in R

# Generate a hypothetical dataset


set.seed(123)
x <- 1:50
y <- 2 * x + rnorm(50, mean = 0, sd = 5)

# Create a data frame


data <- data.frame(x = x, y = y)

# Fit a linear regression model


linear_model <- lm(y ~ x, data = data)

# Display the summary of the linear regression model


summary(linear_model)
# Make predictions
new_data <- data.frame(x = 51:60)
predictions <- predict(linear_model, newdata = new_data)

# Display the predictions


cat("Predictions for new data:\n")
print(data.frame(x = new_data$x, predicted_y = predictions))

#Output:

> # Display the summary of the linear regression model


> summary(linear_model)

Call:
lm(formula = y ~ x, data = data)

Residuals:
Min 1Q Median 3Q Max
-10.0560 -3.1111 -0.4097 3.3295 10.7983

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.34508 1.34274 0.257 0.798
x 1.99321 0.04583 43.494 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.676 on 48 degrees of freedom


Multiple R-squared: 0.9753,​ Adjusted R-squared: 0.9747
F-statistic: 1892 on 1 and 48 DF, p-value: < 2.2e-16

> cat("Predictions for new data:\n")


Predictions for new data:
> print(data.frame(x = new_data$x, predicted_y = predictions))
x predicted_y
1 51 101.9990
2 52 103.9922
3 53 105.9854
4 54 107.9786
5 55 109.9718
6 56 111.9650
7 57 113.9582
8 58 115.9514
9 59 117.9447
10 60 119.9379
WEEK - 6 IMPLEMENT LOGISTIC REGRESSION

1. Implement Logistic Regression

#program:

# Implement Logistic Regression in R

# Load the iris dataset


data(iris)

# Create a binary outcome variable for demonstration (versicolor vs.


non-versicolor)
iris$IsVersicolor <- ifelse(iris$Species == "versicolor", 1, 0)

# Split the dataset into training and testing sets


set.seed(123)
indices <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[indices, ]
test_data <- iris[-indices, ]

# Fit a logistic regression model


logistic_model <- glm(IsVersicolor ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width,
data = train_data,
family = "binomial")

# Display the summary of the logistic regression model


summary(logistic_model)

# Make predictions on the test set


predictions <- predict(logistic_model, newdata = test_data, type =
"response")

# Convert predicted probabilities to binary predictions (0 or 1)


binary_predictions <- ifelse(predictions >= 0.5, 1, 0)

# Create a confusion matrix to evaluate performance


conf_matrix <- table(binary_predictions, test_data$IsVersicolor)
print("Confusion Matrix:")
print(conf_matrix)

# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
#Output:

> # Display the summary of the logistic regression model


> summary(logistic_model)

Call:
glm(formula = IsVersicolor ~ Sepal.Length + Sepal.Width + Petal.Length +
Petal.Width, family = "binomial", data = train_data)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.2596 2.8631 2.186 0.0288 *
Sepal.Length -0.5910 0.7564 -0.781 0.4346
Sepal.Width -2.0960 0.8222 -2.549 0.0108 *
Petal.Length 1.6183 0.8229 1.967 0.0492 *
Petal.Width -3.0218 1.4200 -2.128 0.0333 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 129.12 on 104 degrees of freedom


Residual deviance: 104.10 on 100 degrees of freedom
AIC: 114.1

Number of Fisher Scoring iterations: 5


> print("Confusion Matrix:")
[1] "Confusion Matrix:"
> print(conf_matrix)

binary_predictions 0 1
0 24 10
1 3 8

WEEK - 7 Time-series Analysis

1.Time-series Analysis
#program

# Install and load necessary packages


install.packages("forecast")
install.packages("tseries")
install.packages("xts")
library(forecast)
library(tseries)
library(xts)
# Load the AirPassengers dataset
data("AirPassengers")

# Convert to time series object


ts_data <- ts(AirPassengers, start = c(1949, 1), frequency = 12)

# Display the first few rows of the dataset


print(head(ts_data))

# Plot the time series data


plot(ts_data, main="AirPassengers Data", ylab="Number of Passengers", xlab="Year")

# Decompose the time series


decomposed <- decompose(ts_data)
plot(decomposed)

# Perform the Augmented Dickey-Fuller test


adf_test <- adf.test(ts_data)
print(adf_test)

# Difference the series if necessary


diff_ts_data <- diff(ts_data)
plot(diff_ts_data, main="Differenced AirPassengers Data", ylab="Differenced Number of
Passengers", xlab="Year")

# Fit an ARIMA model


fit <- auto.arima(ts_data)
print(fit)

# Forecast the next 24 months


forecasted <- forecast(fit, h = 24)
plot(forecasted)

# Check the residuals of the model


checkresiduals(fit)

# Calculate accuracy metrics


accuracy(forecasted)

#output
> library(forecast)
> library(tseries)
> library(xts)
>
> # Load the AirPassengers dataset
> data("AirPassengers")
>
> # Convert to time series object
> ts_data <- ts(AirPassengers, start = c(1949, 1), frequency = 12)
>
> # Display the first few rows of the dataset
> print(head(ts_data))
Jan Feb Mar Apr May Jun
1949 112 118 132 129 121 135
>
> # Plot the time series data
> plot(ts_data, main="AirPassengers Data", ylab="Number of Passengers",
xlab="Year")
>
> # Decompose the time series
> decomposed <- decompose(ts_data)
> plot(decomposed)
>
> # Perform the Augmented Dickey-Fuller test
> adf_test <- adf.test(ts_data)
> print(adf_test)

​ Augmented Dickey-Fuller Test

data: ts_data
Dickey-Fuller = -7.3186, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

>
> # Difference the series if necessary
> diff_ts_data <- diff(ts_data)
> plot(diff_ts_data, main="Differenced AirPassengers Data", ylab="Differenced
Number of Passengers", xlab="Year")
>
> # Fit an ARIMA model
> fit <- auto.arima(ts_data)
> print(fit)
Series: ts_data
ARIMA(2,1,1)(0,1,0)[12]

Coefficients:
ar1 ar2 ma1
0.5960 0.2143 -0.9819
s.e. 0.0888 0.0880 0.0292

sigma^2 = 132.3: log likelihood = -504.92


AIC=1017.85 AICc=1018.17 BIC=1029.35
>
> # Forecast the next 24 months
> forecasted <- forecast(fit, h = 24)
> plot(forecasted)
>
> # Check the residuals of the model
> checkresiduals(fit)

​ Ljung-Box test

data: Residuals from ARIMA(2,1,1)(0,1,0)[12]


Q* = 37.784, df = 21, p-value = 0.01366

Model df: 3. Total lags used: 24

>
> # Calculate accuracy metrics
> accuracy(forecasted)
ME RMSE MAE MPE MAPE MASE ACF1
Training set 1.3423 10.84619 7.86754 0.420698 2.800458 0.245628 -0.00124847

WEEK - 8 Association Rules using R

1.Association Rules using R.

#program:
# Load required packages
install.packages("arules")
library(arules)

# Load and prepare the data


data("Groceries")
# Inspect the dataset
summary(Groceries)

# Generate rules using the Apriori algorithm


rules <- apriori(Groceries, parameter = list(supp = 0.01, conf = 0.5))

# Inspect the first 5 rules


inspect(rules[1:5])

# Sort rules by lift and inspect the top 5 rules


rules_sorted <- sort(rules, by = "lift")
inspect(rules_sorted[1:5])

# Filter rules with specific items in the consequent


milk_rules <- subset(rules, subset = rhs %in% "whole milk" & lift > 1.2)
inspect(milk_rules)

# Visualize rules
install.packages("arulesViz")
library(arulesViz)
plot(rules)

#Output:
> library(arulesViz)
> plot(rules)
WEEK - 9 Data Analysis-Visualization using R.

1. Data Analysis-Visualization using R.

#program:
# Load required packages
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
library(ggplot2)
library(dplyr)
library(tidyr)

# Load and inspect the data


data(mtcars)
str(mtcars)
summary(mtcars)

# Data manipulation with dplyr


mtcars_filtered <- mtcars %>%
filter(cyl == 6)

mtcars_summary <- mtcars %>%


summarise(mean_mpg = mean(mpg), mean_hp = mean(hp))

mtcars <- mtcars %>%


mutate(hp_category = ifelse(hp > 150, "High HP", "Low HP"))

# Display the modified dataset


head(mtcars)

# Data visualization with ggplot2


# Scatter plot of mpg vs hp with color indicating number of cylinders
ggplot(mtcars, aes(x = hp, y = mpg, color = as.factor(cyl))) +
geom_point() +
labs(title = "Scatter plot of MPG vs HP",
x = "Horsepower",
y = "Miles Per Gallon",
color = "Cylinders")

# Histogram of mpg
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "blue", color = "black") +
labs(title = "Histogram of MPG",
x = "Miles Per Gallon",
y = "Frequency")

# Box plot of mpg by number of cylinders


ggplot(mtcars, aes(x = as.factor(cyl), y = mpg)) +
geom_boxplot() +
labs(title = "Box Plot of MPG by Cylinders",
x = "Number of Cylinders",
y = "Miles Per Gallon")

# Bar plot of the number of cars in each horsepower category


ggplot(mtcars, aes(x = hp_category)) +
geom_bar(fill = "orange") +
labs(title = "Bar Plot of Cars by Horsepower Category",
x = "Horsepower Category",
y = "Count")

#output

> # Load and inspect the data


> data(mtcars)
> str(mtcars)
'data.frame':​32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> summary(mtcars)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
>
> # Data manipulation with dplyr
> mtcars_filtered <- mtcars %>%
+ filter(cyl == 6)
>
> mtcars_summary <- mtcars %>%
+ summarise(mean_mpg = mean(mpg), mean_hp = mean(hp))
>
> mtcars <- mtcars %>%
+ mutate(hp_category = ifelse(hp > 150, "High HP", "Low HP"))
>
> # Display the modified dataset
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
hp_category
Mazda RX4 Low HP
Mazda RX4 Wag Low HP
Datsun 710 Low HP
Hornet 4 Drive Low HP
Hornet Sportabout High HP
Valiant Low HP
>
> # Data visualization with ggplot2
> # Scatter plot of mpg vs hp with color indicating number of cylinders
> ggplot(mtcars, aes(x = hp, y = mpg, color = as.factor(cyl))) +
+ geom_point() +
+ labs(title = "Scatter plot of MPG vs HP",
+ x = "Horsepower",
+ y = "Miles Per Gallon",
+ color = "Cylinders")
>
> # Histogram of mpg
> ggplot(mtcars, aes(x = mpg)) +
+ geom_histogram(binwidth = 2, fill = "blue", color = "black") +
+ labs(title = "Histogram of MPG",
+ x = "Miles Per Gallon",
+ y = "Frequency")
>
> # Box plot of mpg by number of cylinders
> ggplot(mtcars, aes(x = as.factor(cyl), y = mpg)) +
+ geom_boxplot() +
+ labs(title = "Box Plot of MPG by Cylinders",
+ x = "Number of Cylinders",
+ y = "Miles Per Gallon")
>
> # Bar plot of the number of cars in each horsepower category
> ggplot(mtcars, aes(x = hp_category)) +
+ geom_bar(fill = "orange") +
+ labs(title = "Bar Plot of Cars by Horsepower Category",
+ x = "Horsepower Category",
+ y = "Count")

WEEK - 10 Map Reduce using Hadoop

1.Map Reduce using Hadoop

#program:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper extends Mapper<LongWritable, Text, Text,


IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
String[] words = line.split("\\s+");
for (String w : words) {
word.set(w);
context.write(word, one);
}
}
}

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>


{
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws


IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

#Output

WEEK - 11. In-database Analytics

1.In-database Analytics

-- Enable the PL/R extension


CREATE EXTENSION plr;

-- Create the sales_data table


CREATE TABLE sales_data (
id SERIAL PRIMARY KEY,
product VARCHAR(50),
sales_amount DECIMAL,
sales_date DATE
);

-- Insert sample data


INSERT INTO sales_data (product, sales_amount, sales_date) VALUES
('Product A', 150.75, '2023-01-01'),
('Product B', 200.50, '2023-01-02'),
('Product C', 175.25, '2023-01-03'),
('Product A', 125.00, '2023-01-04'),
('Product B', 225.75, '2023-01-05');

-- Create a PL/R function to calculate summary statistics


CREATE OR REPLACE FUNCTION sales_summary()
RETURNS TABLE(product VARCHAR, total_sales DECIMAL, average_sales DECIMAL,
max_sales DECIMAL, min_sales DECIMAL) AS $$
data <- dbGetQuery(con, "SELECT product, sales_amount FROM sales_data")
summary_stats <- aggregate(data$sales_amount, by=list(data$product), FUN=function(x)
c(Total=sum(x), Mean=mean(x), Max=max(x), Min=min(x)))
colnames(summary_stats) <- c("Product", "Total_Sales", "Average_Sales", "Max_Sales",
"Min_Sales")
return(summary_stats)
$$ LANGUAGE plr;

-- Call the sales_summary function


SELECT * FROM sales_summary();

-- Create a PL/R function for linear regression


CREATE OR REPLACE FUNCTION linear_regression()
RETURNS TABLE(intercept DECIMAL, slope DECIMAL, r_squared DECIMAL) AS $$
data <- dbGetQuery(con, "SELECT sales_amount, EXTRACT(EPOCH FROM sales_date) AS
time FROM sales_data")
model <- lm(sales_amount ~ time, data=data)
summary_model <- summary(model)
intercept <- coef(summary_model)[1]
slope <- coef(summary_model)[2]
r_squared <- summary_model$r.squared
return(data.frame(intercept, slope, r_squared))
$$ LANGUAGE plr;

-- Call the linear_regression function


SELECT * FROM linear_regression();

#Output

WEEK - 12 Implementation of Queries using Mongo DB

#program:

// Use the database


use testDatabase

// Drop existing collections if they exist


db.salesData.drop()
db.productDetails.drop()

// Create collections and insert sample data


db.createCollection("salesData")
db.salesData.insertMany([
{ product: "Product A", salesAmount: 150.75, salesDate: new Date("2023-01-01") },
{ product: "Product B", salesAmount: 200.50, salesDate: new Date("2023-01-02") },
{ product: "Product C", salesAmount: 175.25, salesDate: new Date("2023-01-03") },
{ product: "Product A", salesAmount: 125.00, salesDate: new Date("2023-01-04") },
{ product: "Product B", salesAmount: 225.75, salesDate: new Date("2023-01-05") }
])

db.createCollection("productDetails")
db.productDetails.insertMany([
{ product: "Product A", category: "Electronics", price: 300 },
{ product: "Product B", category: "Furniture", price: 450 },
{ product: "Product C", category: "Appliances", price: 200 }
])

// Basic Queries
print("Find All Documents")
printjson(db.salesData.find().toArray())

print("Find Documents with a Condition")


printjson(db.salesData.find({ product: "Product A" }).toArray())

print("Projection (Select Specific Fields)")


printjson(db.salesData.find({ product: "Product A" }, { _id: 0, product: 1, salesAmount: 1
}).toArray())

print("Sorting")
printjson(db.salesData.find().sort({ salesAmount: -1 }).toArray())

print("Limit and Skip")


printjson(db.salesData.find().sort({ salesAmount: -1 }).limit(2).skip(1).toArray())

// Aggregation Framework
print("Simple Aggregation (Total Sales by Product)")
printjson(db.salesData.aggregate([
{ $group: { _id: "$product", totalSales: { $sum: "$salesAmount" } } }
]).toArray())

print("Aggregation with Multiple Stages (Total and Average Sales by Product)")


printjson(db.salesData.aggregate([
{ $group: { _id: "$product", totalSales: { $sum: "$salesAmount" }, avgSales: { $avg:
"$salesAmount" } } }
]).toArray())

print("Match and Group (Sales for Specific Product)")


printjson(db.salesData.aggregate([
{ $match: { product: "Product A" } },
{ $group: { _id: "$product", totalSales: { $sum: "$salesAmount" }, avgSales: { $avg:
"$salesAmount" } } }
]).toArray())

// Advanced Queries
print("Using $lookup for Joins")
printjson(db.salesData.aggregate([
{
$lookup: {
from: "productDetails",
localField: "product",
foreignField: "product",
as: "productInfo"
}
},
{ $unwind: "$productInfo" },
{ $project: { product: 1, salesAmount: 1, salesDate: 1, category: "$productInfo.category",
price: "$productInfo.price" } }
]).toArray())

print("Using $match and $project in Aggregation")


printjson(db.salesData.aggregate([
{ $match: { salesAmount: { $gt: 150 } } },
{ $project: { product: 1, salesAmount: 1, _id: 0 } }
]).toArray())

You might also like