0% found this document useful (0 votes)

4 views22 pages

BDA Lab Manual (12 Weeks)

The Big Data Analytics Lab Manual outlines a course designed for III-B.Tech II-Semester students, focusing on the importance of Big Data Analytics and various statistical techniques. The course includes practical experiments using R and Hadoop, covering topics such as hypothesis testing, clustering, regression, and time-series analysis. By the end of the course, students will be able to develop Big Data solutions and apply machine learning techniques.

Uploaded by

Shiva Shiva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views22 pages

BDA Lab Manual (12 Weeks)

Uploaded by

Shiva Shiva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Big Data Analytics Lab Manual

III-B.Tech II-Semester L T P C
Course Code: A1DS604PC O 0 3 1.5

COURSE OBJECTIVES:

The course should enable the students to learn:

1. Outline the importance of Big Data Analytics
2. Apply statistical techniques for Big data Analytics.
3. Analyze problems appropriate to mining data streams.
4. Apply the knowledge of clustering techniques in data mining.
5. Use Graph Analytics for Big Data and provide solutions
6. Apply Hadoop map Reduce programming for handing Big Data

COURSE OUTCOMES:

At the end of the course the students are able to:

1. Identify Big Data and its Business Implications.
2. List the components of Hadoop and Hadoop Eco-System
3. Access and Process Data on Distributed File System
4. Manage Job Execution in Hadoop Environment
5. Develop Big Data Solutions using Hadoop Eco System
6. Analyze Infosphere BigInsights Big Data Recommendations.
7. Apply Machine Learning Techniques using R.

LIST OF EXPERIMENTS:
1. Study of R Programming.
2. Hypothesis Test using R
3. K-means Clustering using R
4. Naïve Bayesian Classifier
5. Implementation of Linear Regression
6. Implement Logistic Regression
7. Time-series Analysis
8. Association Rules using R.
9. Data Analysis-Visualization using R.
10. Map Reduce using Hadoop
11. In-database Analytics
12. Implementation of Queries using Mongo DB
WEEK - 1 STUDY OF R PROGRAMMING

1. Study of R Programming

#program:

# R Programming Basics

# 1. Variables and Data Types

name <- "John"
age <- 25
height <- 175.5
is_student <- TRUE

# Print the variables

cat("Name:", name, "\n")
cat("Age:", age, "\n")
cat("Height:", height, "\n")
cat("Is Student:", is_student, "\n\n")

# 2. Vectors and Operations

numbers <- c(2, 4, 6, 8, 10)
result <- numbers * 2

cat("Original Numbers:", numbers, "\n")

cat("Doubled Numbers:", result, "\n\n")

# 3. Data Visualization using ggplot2

# Install the ggplot2 package if not installed already
# install.packages("ggplot2")

# Load the ggplot2 library

library(ggplot2)

# Create a simple scatter plot

data <- data.frame(x = c(1, 2, 3, 4, 5), y = c(2, 4, 1, 3, 5))

# Plotting
ggplot(data, aes(x = x, y = y)) +
geom_point(color = "blue") +
ggtitle("Simple Scatter Plot") +
xlab("X-axis") +
ylab("Y-axis")

# Save the plot as an image (optional)

# ggsave("scatter_plot.png", plot = last_plot(), device = "png")
#Output:

> cat("Name:", name, "\n")

Name: John
> cat("Age:", age, "\n")
Age: 25
> cat("Height:", height, "\n")
Height: 175.5
> cat("Is Student:", is_student, "\n\n")
Is Student: TRUE

> cat("Original Numbers:", numbers, "\n")

Original Numbers: 2 4 6 8 10
> cat("Doubled Numbers:", result, "\n\n")
Doubled Numbers: 4 8 12 16 20

WEEK - 2. HYPOTHESIS TEST USING R

1. Hypothesis Test using R

#program:

# Hypothesis Test using R

# Generate hypothetical data for two groups

set.seed(123) # for reproducibility
group1 <- rnorm(30, mean = 50, sd = 10)
group2 <- rnorm(30, mean = 55, sd = 10)

# Perform a t-test
t_test_result <- t.test(group1, group2)

# Print the t-test results

cat("Hypothesis Test Results:\n")
cat("t-value:", t_test_result$statistic, "\n")
cat("p-value:", t_test_result$p.value, "\n\n")

# Interpret the results

if (t_test_result$p.value < 0.05) {
cat("The difference between the groups is statistically significant at
the 0.05 level.\n")
} else {
cat("There is no significant difference between the groups at the 0.05
level.\n")
}
#Output:

> cat("Hypothesis Test Results:\n")

Hypothesis Test Results:
> cat("t-value:", t_test_result$statistic, "\n")
t-value: -3.084094
> cat("p-value:", t_test_result$p.value, "\n\n")
p-value: 0.00315574

The difference between the groups is statistically significant at the 0.05

level.

WEEK - 3 K MEANS CLUSTERING USING R

1. K-means Clustering using R

#program:

# K-means Clustering using R

# Load the iris dataset

data(iris)

# Selecting relevant features for clustering (e.g., Sepal Length and

Sepal Width)
iris_features <- iris[, c("Sepal.Length", "Sepal.Width")]

# Set the number of clusters (K)

k <- 3

# Perform K-means clustering

kmeans_result <- kmeans(iris_features, centers = k)

# Print the clustering results

cat("Cluster Centers:\n")
print(kmeans_result$centers)

cat("\nCluster Assignments:\n")
print(kmeans_result$cluster)

# Visualize the clustering (scatter plot)

plot(iris_features, col = kmeans_result$cluster, pch = 16, main =
"K-means Clustering")
points(kmeans_result$centers, col = 1:k, pch = 8, cex = 2)
#Output:

> cat("Cluster Centers:\n")

Cluster Centers:
> print(kmeans_result$centers)
Sepal.Length Sepal.Width
1 5.006000 3.428000
2 5.773585 2.692453
3 6.812766 3.074468
>
> cat("\nCluster Assignments:\n")

Cluster Assignments:
> print(kmeans_result$cluster)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[37] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 3 2 3 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
[73] 2 2 3 3 3 3 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3
[109] 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3
[145] 3 3 2 3 3 2

WEEK - 4 NAIVE BAYESIAN CLASSIFIER

1. Naïve Bayesian Classifier

#program:

# Naive Bayesian Classifier using R

# Load the iris dataset

data(iris)

# Install and load the necessary package (if not already installed)
# install.packages("e1071")
library(e1071)

# Split the dataset into training and testing sets

set.seed(123)
indices <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[indices, ]
test_data <- iris[-indices, ]

# Train the Naive Bayes classifier

nb_model <- naiveBayes(Species ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width, data = train_data)

# Make predictions on the test set

predictions <- predict(nb_model, test_data)

# Confusion matrix to evaluate the performance

conf_matrix <- table(predictions, test_data$Species)
print("Confusion Matrix:")
print(conf_matrix)

# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 2)))

#Output:

> print("Confusion Matrix:")

[1] "Confusion Matrix:"
> print(conf_matrix)

predictions setosa versicolor virginica

setosa 14 0 0
versicolor 0 18 0
virginica 0 0 13
> # Calculate accuracy
[1] "Accuracy: 1"

WEEK - 5 IMPLEMENTATION OF LINEAR REGRESSION

1. Implementation of Linear Regression

#program:

# Implementation of Linear Regression in R

# Generate a hypothetical dataset

set.seed(123)
x <- 1:50
y <- 2 * x + rnorm(50, mean = 0, sd = 5)

# Create a data frame

data <- data.frame(x = x, y = y)

# Fit a linear regression model

linear_model <- lm(y ~ x, data = data)

# Display the summary of the linear regression model

summary(linear_model)
# Make predictions
new_data <- data.frame(x = 51:60)
predictions <- predict(linear_model, newdata = new_data)

# Display the predictions

cat("Predictions for new data:\n")
print(data.frame(x = new_data$x, predicted_y = predictions))

#Output:

> # Display the summary of the linear regression model

> summary(linear_model)

Call:
lm(formula = y ~ x, data = data)

Residuals:
Min 1Q Median 3Q Max
-10.0560 -3.1111 -0.4097 3.3295 10.7983

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.34508 1.34274 0.257 0.798
x 1.99321 0.04583 43.494 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.676 on 48 degrees of freedom

Multiple R-squared: 0.9753, Adjusted R-squared: 0.9747
F-statistic: 1892 on 1 and 48 DF, p-value: < 2.2e-16

> cat("Predictions for new data:\n")

Predictions for new data:
> print(data.frame(x = new_data$x, predicted_y = predictions))
x predicted_y
1 51 101.9990
2 52 103.9922
3 53 105.9854
4 54 107.9786
5 55 109.9718
6 56 111.9650
7 57 113.9582
8 58 115.9514
9 59 117.9447
10 60 119.9379
WEEK - 6 IMPLEMENT LOGISTIC REGRESSION

1. Implement Logistic Regression

#program:

# Implement Logistic Regression in R

# Load the iris dataset

data(iris)

# Create a binary outcome variable for demonstration (versicolor vs.

non-versicolor)
iris$IsVersicolor <- ifelse(iris$Species == "versicolor", 1, 0)

# Split the dataset into training and testing sets

set.seed(123)
indices <- sample(1:nrow(iris), 0.7 * nrow(iris))
train_data <- iris[indices, ]
test_data <- iris[-indices, ]

# Fit a logistic regression model

logistic_model <- glm(IsVersicolor ~ Sepal.Length + Sepal.Width +
Petal.Length + Petal.Width,
data = train_data,
family = "binomial")

# Display the summary of the logistic regression model

summary(logistic_model)

# Make predictions on the test set

predictions <- predict(logistic_model, newdata = test_data, type =
"response")

# Convert predicted probabilities to binary predictions (0 or 1)

binary_predictions <- ifelse(predictions >= 0.5, 1, 0)

# Create a confusion matrix to evaluate performance

conf_matrix <- table(binary_predictions, test_data$IsVersicolor)
print("Confusion Matrix:")
print(conf_matrix)

# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
#Output:

> # Display the summary of the logistic regression model

> summary(logistic_model)

Call:
glm(formula = IsVersicolor ~ Sepal.Length + Sepal.Width + Petal.Length +
Petal.Width, family = "binomial", data = train_data)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.2596 2.8631 2.186 0.0288 *
Sepal.Length -0.5910 0.7564 -0.781 0.4346
Sepal.Width -2.0960 0.8222 -2.549 0.0108 *
Petal.Length 1.6183 0.8229 1.967 0.0492 *
Petal.Width -3.0218 1.4200 -2.128 0.0333 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 129.12 on 104 degrees of freedom

Residual deviance: 104.10 on 100 degrees of freedom
AIC: 114.1

Number of Fisher Scoring iterations: 5

> print("Confusion Matrix:")
[1] "Confusion Matrix:"
> print(conf_matrix)

binary_predictions 0 1
0 24 10
1 3 8

WEEK - 7 Time-series Analysis

1.Time-series Analysis
#program

# Install and load necessary packages

install.packages("forecast")
install.packages("tseries")
install.packages("xts")
library(forecast)
library(tseries)
library(xts)
# Load the AirPassengers dataset
data("AirPassengers")

# Convert to time series object

ts_data <- ts(AirPassengers, start = c(1949, 1), frequency = 12)

# Display the first few rows of the dataset

print(head(ts_data))

# Plot the time series data

plot(ts_data, main="AirPassengers Data", ylab="Number of Passengers", xlab="Year")

# Decompose the time series

decomposed <- decompose(ts_data)
plot(decomposed)

# Perform the Augmented Dickey-Fuller test

adf_test <- adf.test(ts_data)
print(adf_test)

# Difference the series if necessary

diff_ts_data <- diff(ts_data)
plot(diff_ts_data, main="Differenced AirPassengers Data", ylab="Differenced Number of
Passengers", xlab="Year")

# Fit an ARIMA model

fit <- auto.arima(ts_data)
print(fit)

# Forecast the next 24 months

forecasted <- forecast(fit, h = 24)
plot(forecasted)

# Check the residuals of the model

checkresiduals(fit)

# Calculate accuracy metrics

accuracy(forecasted)

#output
> library(forecast)
> library(tseries)
> library(xts)
>
> # Load the AirPassengers dataset
> data("AirPassengers")
>
> # Convert to time series object
> ts_data <- ts(AirPassengers, start = c(1949, 1), frequency = 12)
>
> # Display the first few rows of the dataset
> print(head(ts_data))
Jan Feb Mar Apr May Jun
1949 112 118 132 129 121 135
>
> # Plot the time series data
> plot(ts_data, main="AirPassengers Data", ylab="Number of Passengers",
xlab="Year")
>
> # Decompose the time series
> decomposed <- decompose(ts_data)
> plot(decomposed)
>
> # Perform the Augmented Dickey-Fuller test
> adf_test <- adf.test(ts_data)
> print(adf_test)

Augmented Dickey-Fuller Test

data: ts_data
Dickey-Fuller = -7.3186, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

>
> # Difference the series if necessary
> diff_ts_data <- diff(ts_data)
> plot(diff_ts_data, main="Differenced AirPassengers Data", ylab="Differenced
Number of Passengers", xlab="Year")
>
> # Fit an ARIMA model
> fit <- auto.arima(ts_data)
> print(fit)
Series: ts_data
ARIMA(2,1,1)(0,1,0)[12]

Coefficients:
ar1 ar2 ma1
0.5960 0.2143 -0.9819
s.e. 0.0888 0.0880 0.0292

sigma^2 = 132.3: log likelihood = -504.92

AIC=1017.85 AICc=1018.17 BIC=1029.35
>
> # Forecast the next 24 months
> forecasted <- forecast(fit, h = 24)
> plot(forecasted)
>
> # Check the residuals of the model
> checkresiduals(fit)

Ljung-Box test

data: Residuals from ARIMA(2,1,1)(0,1,0)[12]

Q* = 37.784, df = 21, p-value = 0.01366

Model df: 3. Total lags used: 24

>
> # Calculate accuracy metrics
> accuracy(forecasted)
ME RMSE MAE MPE MAPE MASE ACF1
Training set 1.3423 10.84619 7.86754 0.420698 2.800458 0.245628 -0.00124847

WEEK - 8 Association Rules using R

1.Association Rules using R.

#program:
# Load required packages
install.packages("arules")
library(arules)

# Load and prepare the data

data("Groceries")
# Inspect the dataset
summary(Groceries)

# Generate rules using the Apriori algorithm

rules <- apriori(Groceries, parameter = list(supp = 0.01, conf = 0.5))

# Inspect the first 5 rules

inspect(rules[1:5])

# Sort rules by lift and inspect the top 5 rules

rules_sorted <- sort(rules, by = "lift")
inspect(rules_sorted[1:5])

# Filter rules with specific items in the consequent

milk_rules <- subset(rules, subset = rhs %in% "whole milk" & lift > 1.2)
inspect(milk_rules)

# Visualize rules
install.packages("arulesViz")
library(arulesViz)
plot(rules)

#Output:
> library(arulesViz)
> plot(rules)
WEEK - 9 Data Analysis-Visualization using R.

1. Data Analysis-Visualization using R.

#program:
# Load required packages
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
library(ggplot2)
library(dplyr)
library(tidyr)

# Load and inspect the data

data(mtcars)
str(mtcars)
summary(mtcars)

# Data manipulation with dplyr

mtcars_filtered <- mtcars %>%
filter(cyl == 6)

mtcars_summary <- mtcars %>%

summarise(mean_mpg = mean(mpg), mean_hp = mean(hp))

mtcars <- mtcars %>%

mutate(hp_category = ifelse(hp > 150, "High HP", "Low HP"))

# Display the modified dataset

head(mtcars)

# Data visualization with ggplot2

# Scatter plot of mpg vs hp with color indicating number of cylinders
ggplot(mtcars, aes(x = hp, y = mpg, color = as.factor(cyl))) +
geom_point() +
labs(title = "Scatter plot of MPG vs HP",
x = "Horsepower",
y = "Miles Per Gallon",
color = "Cylinders")

# Histogram of mpg
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "blue", color = "black") +
labs(title = "Histogram of MPG",
x = "Miles Per Gallon",
y = "Frequency")

# Box plot of mpg by number of cylinders

ggplot(mtcars, aes(x = as.factor(cyl), y = mpg)) +
geom_boxplot() +
labs(title = "Box Plot of MPG by Cylinders",
x = "Number of Cylinders",
y = "Miles Per Gallon")

# Bar plot of the number of cars in each horsepower category

ggplot(mtcars, aes(x = hp_category)) +
geom_bar(fill = "orange") +
labs(title = "Bar Plot of Cars by Horsepower Category",
x = "Horsepower Category",
y = "Count")

#output

> # Load and inspect the data

> data(mtcars)
> str(mtcars)
'data.frame':32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> summary(mtcars)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
>
> # Data manipulation with dplyr
> mtcars_filtered <- mtcars %>%
+ filter(cyl == 6)
>
> mtcars_summary <- mtcars %>%
+ summarise(mean_mpg = mean(mpg), mean_hp = mean(hp))
>
> mtcars <- mtcars %>%
+ mutate(hp_category = ifelse(hp > 150, "High HP", "Low HP"))
>
> # Display the modified dataset
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
hp_category
Mazda RX4 Low HP
Mazda RX4 Wag Low HP
Datsun 710 Low HP
Hornet 4 Drive Low HP
Hornet Sportabout High HP
Valiant Low HP
>
> # Data visualization with ggplot2
> # Scatter plot of mpg vs hp with color indicating number of cylinders
> ggplot(mtcars, aes(x = hp, y = mpg, color = as.factor(cyl))) +
+ geom_point() +
+ labs(title = "Scatter plot of MPG vs HP",
+ x = "Horsepower",
+ y = "Miles Per Gallon",
+ color = "Cylinders")
>
> # Histogram of mpg
> ggplot(mtcars, aes(x = mpg)) +
+ geom_histogram(binwidth = 2, fill = "blue", color = "black") +
+ labs(title = "Histogram of MPG",
+ x = "Miles Per Gallon",
+ y = "Frequency")
>
> # Box plot of mpg by number of cylinders
> ggplot(mtcars, aes(x = as.factor(cyl), y = mpg)) +
+ geom_boxplot() +
+ labs(title = "Box Plot of MPG by Cylinders",
+ x = "Number of Cylinders",
+ y = "Miles Per Gallon")
>
> # Bar plot of the number of cars in each horsepower category
> ggplot(mtcars, aes(x = hp_category)) +
+ geom_bar(fill = "orange") +
+ labs(title = "Bar Plot of Cars by Horsepower Category",
+ x = "Horsepower Category",
+ y = "Count")

WEEK - 10 Map Reduce using Hadoop

1.Map Reduce using Hadoop

#program:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper extends Mapper<LongWritable, Text, Text,

IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
String[] words = line.split("\\s+");
for (String w : words) {
word.set(w);
context.write(word, one);
}
}
}

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>

{
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws

IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

#Output

WEEK - 11. In-database Analytics

1.In-database Analytics

-- Enable the PL/R extension

CREATE EXTENSION plr;

-- Create the sales_data table

CREATE TABLE sales_data (
id SERIAL PRIMARY KEY,
product VARCHAR(50),
sales_amount DECIMAL,
sales_date DATE
);

-- Insert sample data

INSERT INTO sales_data (product, sales_amount, sales_date) VALUES
('Product A', 150.75, '2023-01-01'),
('Product B', 200.50, '2023-01-02'),
('Product C', 175.25, '2023-01-03'),
('Product A', 125.00, '2023-01-04'),
('Product B', 225.75, '2023-01-05');

-- Create a PL/R function to calculate summary statistics

CREATE OR REPLACE FUNCTION sales_summary()
RETURNS TABLE(product VARCHAR, total_sales DECIMAL, average_sales DECIMAL,
max_sales DECIMAL, min_sales DECIMAL) AS $$
data <- dbGetQuery(con, "SELECT product, sales_amount FROM sales_data")
summary_stats <- aggregate(data$sales_amount, by=list(data$product), FUN=function(x)
c(Total=sum(x), Mean=mean(x), Max=max(x), Min=min(x)))
colnames(summary_stats) <- c("Product", "Total_Sales", "Average_Sales", "Max_Sales",
"Min_Sales")
return(summary_stats)
$$ LANGUAGE plr;

-- Call the sales_summary function

SELECT * FROM sales_summary();

-- Create a PL/R function for linear regression

CREATE OR REPLACE FUNCTION linear_regression()
RETURNS TABLE(intercept DECIMAL, slope DECIMAL, r_squared DECIMAL) AS $$
data <- dbGetQuery(con, "SELECT sales_amount, EXTRACT(EPOCH FROM sales_date) AS
time FROM sales_data")
model <- lm(sales_amount ~ time, data=data)
summary_model <- summary(model)
intercept <- coef(summary_model)[1]
slope <- coef(summary_model)[2]
r_squared <- summary_model$r.squared
return(data.frame(intercept, slope, r_squared))
$$ LANGUAGE plr;

-- Call the linear_regression function

SELECT * FROM linear_regression();

#Output

WEEK - 12 Implementation of Queries using Mongo DB

#program:

// Use the database

use testDatabase

// Drop existing collections if they exist

db.salesData.drop()
db.productDetails.drop()

// Create collections and insert sample data

db.createCollection("salesData")
db.salesData.insertMany([
{ product: "Product A", salesAmount: 150.75, salesDate: new Date("2023-01-01") },
{ product: "Product B", salesAmount: 200.50, salesDate: new Date("2023-01-02") },
{ product: "Product C", salesAmount: 175.25, salesDate: new Date("2023-01-03") },
{ product: "Product A", salesAmount: 125.00, salesDate: new Date("2023-01-04") },
{ product: "Product B", salesAmount: 225.75, salesDate: new Date("2023-01-05") }
])

db.createCollection("productDetails")
db.productDetails.insertMany([
{ product: "Product A", category: "Electronics", price: 300 },
{ product: "Product B", category: "Furniture", price: 450 },
{ product: "Product C", category: "Appliances", price: 200 }
])

// Basic Queries
print("Find All Documents")
printjson(db.salesData.find().toArray())

print("Find Documents with a Condition")

printjson(db.salesData.find({ product: "Product A" }).toArray())

print("Projection (Select Specific Fields)")

printjson(db.salesData.find({ product: "Product A" }, { _id: 0, product: 1, salesAmount: 1
}).toArray())

print("Sorting")
printjson(db.salesData.find().sort({ salesAmount: -1 }).toArray())

print("Limit and Skip")

printjson(db.salesData.find().sort({ salesAmount: -1 }).limit(2).skip(1).toArray())

// Aggregation Framework
print("Simple Aggregation (Total Sales by Product)")
printjson(db.salesData.aggregate([
{ $group: { _id: "$product", totalSales: { $sum: "$salesAmount" } } }
]).toArray())

print("Aggregation with Multiple Stages (Total and Average Sales by Product)")

printjson(db.salesData.aggregate([
{ $group: { _id: "$product", totalSales: { $sum: "$salesAmount" }, avgSales: { $avg:
"$salesAmount" } } }
]).toArray())

print("Match and Group (Sales for Specific Product)")

printjson(db.salesData.aggregate([
{ $match: { product: "Product A" } },
{ $group: { _id: "$product", totalSales: { $sum: "$salesAmount" }, avgSales: { $avg:
"$salesAmount" } } }
]).toArray())

// Advanced Queries
print("Using $lookup for Joins")
printjson(db.salesData.aggregate([
{
$lookup: {
from: "productDetails",
localField: "product",
foreignField: "product",
as: "productInfo"
}
},
{ $unwind: "$productInfo" },
{ $project: { product: 1, salesAmount: 1, salesDate: 1, category: "$productInfo.category",
price: "$productInfo.price" } }
]).toArray())

print("Using $match and $project in Aggregation")

printjson(db.salesData.aggregate([
{ $match: { salesAmount: { $gt: 150 } } },
{ $project: { product: 1, salesAmount: 1, _id: 0 } }
]).toArray())

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
An Introduction To Bayesian Inference, Methods and Computation
100% (1)
An Introduction To Bayesian Inference, Methods and Computation
177 pages
60d6e7a8ef539 Activity 3
100% (1)
60d6e7a8ef539 Activity 3
2 pages
WEEK
No ratings yet
WEEK
17 pages
Rlab
No ratings yet
Rlab
7 pages
Data Scinece Practical File
No ratings yet
Data Scinece Practical File
23 pages
Data Science
No ratings yet
Data Science
15 pages
R Lab Manual (1) - Merged
No ratings yet
R Lab Manual (1) - Merged
25 pages
Saurabh
No ratings yet
Saurabh
22 pages
Final Data Lab
No ratings yet
Final Data Lab
21 pages
Datamining Lab Record
No ratings yet
Datamining Lab Record
36 pages
DM Lab
No ratings yet
DM Lab
18 pages
File 2
No ratings yet
File 2
17 pages
DA All
No ratings yet
DA All
15 pages
R - Language
No ratings yet
R - Language
23 pages
R Tutorial Slides
No ratings yet
R Tutorial Slides
13 pages
R Lab Manual
No ratings yet
R Lab Manual
22 pages
DS File Et C1 23
No ratings yet
DS File Et C1 23
15 pages
R Lab Program
No ratings yet
R Lab Program
20 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
File 2
No ratings yet
File 2
17 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
R Practicals
No ratings yet
R Practicals
32 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
R Program
No ratings yet
R Program
22 pages
IBS Sample I
No ratings yet
IBS Sample I
10 pages
Supervised Learning in R Classification
No ratings yet
Supervised Learning in R Classification
7 pages
ISYE6501 Homework 2
No ratings yet
ISYE6501 Homework 2
11 pages
7406HW02 1
No ratings yet
7406HW02 1
3 pages
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
No ratings yet
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
6 pages
Vighnesh - S Log 13
No ratings yet
Vighnesh - S Log 13
4 pages
Econometrics I - R Summary (Maite Cabeza-Gutes)
No ratings yet
Econometrics I - R Summary (Maite Cabeza-Gutes)
77 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
R - Language Lab Manual - PG 2024
No ratings yet
R - Language Lab Manual - PG 2024
29 pages
FDS DW Journal
No ratings yet
FDS DW Journal
28 pages
Record
No ratings yet
Record
23 pages
Da Thoery
No ratings yet
Da Thoery
24 pages
DSR 2879
No ratings yet
DSR 2879
25 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Lab 4
No ratings yet
Lab 4
20 pages
Practical Machine Learning Course Notes
No ratings yet
Practical Machine Learning Course Notes
76 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
R Practicals (2007 Version)
No ratings yet
R Practicals (2007 Version)
15 pages
DATAMINING
No ratings yet
DATAMINING
24 pages
Mtech Final
No ratings yet
Mtech Final
16 pages
ML Record
No ratings yet
ML Record
23 pages
Course PDF
No ratings yet
Course PDF
44 pages
Data Science Content
No ratings yet
Data Science Content
11 pages
Bi 5to 8
No ratings yet
Bi 5to 8
6 pages
Introduction R
No ratings yet
Introduction R
9 pages
Bhuvan Data Science Done
No ratings yet
Bhuvan Data Science Done
51 pages
To Edit Data Science
No ratings yet
To Edit Data Science
18 pages
R Unit 4th and 5th
No ratings yet
R Unit 4th and 5th
17 pages
BAN5
No ratings yet
BAN5
2 pages
DSR LAB MANUAL - 10 Programs
No ratings yet
DSR LAB MANUAL - 10 Programs
34 pages
Course Notes18
No ratings yet
Course Notes18
113 pages
Agniva
No ratings yet
Agniva
16 pages
Data Science
No ratings yet
Data Science
13 pages
R Examples
No ratings yet
R Examples
56 pages
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Kuliah 10 Simple Regression
No ratings yet
Kuliah 10 Simple Regression
16 pages
Comparing Two Measurement Devices
No ratings yet
Comparing Two Measurement Devices
32 pages
SDSC3006 - Assignment 1
No ratings yet
SDSC3006 - Assignment 1
2 pages
Bootstrapping in Amos: Changya Hu, Ph.D. NCCU
No ratings yet
Bootstrapping in Amos: Changya Hu, Ph.D. NCCU
15 pages
Lecture 2 M
No ratings yet
Lecture 2 M
28 pages
Mata Kuliah Seminar Akuntansi: Review Jurnal Asing
No ratings yet
Mata Kuliah Seminar Akuntansi: Review Jurnal Asing
5 pages
5to Semestre I Tercer Bloque I Financial Modeling and Programming
No ratings yet
5to Semestre I Tercer Bloque I Financial Modeling and Programming
14 pages
Alt Lab 5.6 Data
No ratings yet
Alt Lab 5.6 Data
6 pages
CH 11 Slides
No ratings yet
CH 11 Slides
41 pages
Computer Interactive Statistics
No ratings yet
Computer Interactive Statistics
102 pages
Pengaruh Karakteristik Pemerintah Daerah Terhadap Tingkat Pengungkapan Laporan Keuangan Di Seluruh Provinsi Indonesia
No ratings yet
Pengaruh Karakteristik Pemerintah Daerah Terhadap Tingkat Pengungkapan Laporan Keuangan Di Seluruh Provinsi Indonesia
17 pages
Model Fine Tuning Documentation
No ratings yet
Model Fine Tuning Documentation
11 pages
Hypothesis 2
No ratings yet
Hypothesis 2
26 pages
Tugas 01 - Statin
No ratings yet
Tugas 01 - Statin
3 pages
Random Motors Project Submission: Name
No ratings yet
Random Motors Project Submission: Name
10 pages
Statistical Hypothesis
No ratings yet
Statistical Hypothesis
70 pages
Panel Data Regression Chap 10
No ratings yet
Panel Data Regression Chap 10
76 pages
One Sample Proportion Test: Practical Steps Involved in Test For Proportion of Successes
No ratings yet
One Sample Proportion Test: Practical Steps Involved in Test For Proportion of Successes
11 pages
Mid 69
No ratings yet
Mid 69
2 pages
VHLSS 2010 14 Byngoc 170830
No ratings yet
VHLSS 2010 14 Byngoc 170830
2,029 pages
Four Assumptions Test in Multiple Regression
No ratings yet
Four Assumptions Test in Multiple Regression
10 pages
Fds Mannual
No ratings yet
Fds Mannual
39 pages
PR2-Q2-Performance Task 2
No ratings yet
PR2-Q2-Performance Task 2
2 pages
BBS11 ISM Ch13
No ratings yet
BBS11 ISM Ch13
43 pages
Grossman Statistical Inference PDF
No ratings yet
Grossman Statistical Inference PDF
483 pages
Ms A 2011 Load Forecasting Workshop
No ratings yet
Ms A 2011 Load Forecasting Workshop
47 pages
Semester Based Syllabus For The Course "B.A. Programme" Statistics
No ratings yet
Semester Based Syllabus For The Course "B.A. Programme" Statistics
4 pages