0% found this document useful (0 votes)

9 views41 pages

BDA Manual

The document outlines experiments for installing and configuring Apache Hadoop and HDFS, as well as implementing word count and matrix multiplication programs using MapReduce. It details the steps for setting up the environment, including Java installation, Hadoop setup, and writing MapReduce code for processing large datasets. The experiments aim to demonstrate the capabilities of Hadoop in handling big data through practical implementations.

Uploaded by

pm96mithun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views41 pages

BDA Manual

Uploaded by

pm96mithun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 41

EXP NO: 1

Install, Configure and Run Hadoop and HDFS

Date:

AIM: To Install, Configure, and run Apache Hadoop and HDFS

CONTEXT:

Hadoop is a Java-based programming framework that supports the processing

and storage of extremely large datasets on a cluster of inexpensive machines. It
was the first major open-source project in the big data playing field and is
sponsored by the Apache Software Foundation.

Hadoop is comprised of four main layers:

 Hadoop Common is the collection of utilities and libraries that support

other Hadoop modules.
 HDFS, which stands for Hadoop Distributed File System, is responsible
for persisting data to disk.
 YARN, short for Yet Another Resource Negotiator, is the "operating
system" for HDFS.
 MapReduce is the original processing model for Hadoop clusters. It
distributes work within the cluster or map, then organizes and reduces
the results from the nodes into a response to a query. Many other
processing models are available for the 2.x version of Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-
alone mode which is suitable for learning about Hadoop, performing simple
operations, and debugging.

IMPLEMENTATION:

In this experiment, Hadoop will be installed in stand-alone mode in linux system.

(as Hadoop was originally developed on Linux and it has Native support for the
Apache Hadoop ecosystem)

PreConfiguration:

1. Pre-Configure a Virtual machine (VMware or Oracle VirtualBox) with

Ubuntu image file or use an Ubuntu operating system for running Hadoop

Configuration of Hadoop Environment:

NOTE 1: All the underlined commands are Linux statements to be

executed in the linux terminal
NOTE 2: In case of oracle virtual box, if there are issues with su access,
change the user as root and provide the password used for vm login.
su root

A. JAVA INSTALLATION

1
 Check the java version $java –version

The program 'java' can be found in the following packages:

* default-jre
* gcj-5-jre-headless
* openjdk-8-jre-headless
* gcj-4.8-jre-headless
* gcj-4.9-jre-headless
* openjdk-9-jre-headless
Try: sudo apt install <selected package>
 sudo apt-get install openjdk-8-jre-headless
 Check the java version again after installation java –version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (build 1.8.0_292-8u292-b10-
ubuntu1~16.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)
 Download JDK for Linux and copy it in Downloads folder of the Linux
system
 Create a folder for java sudo mkdir -p /usr/local/java
 Copy the java files to the above location sudo cp jdk-8u74-linux-
x64.tar.gz /usr/local/java
 cd /usr/local/java
 sudo tar -zxvf jdk-8u74-linux-x64.tar.gz # For extracting the zip files
 To add the java variables in environment:
sudo gedit /etc/profile
Add the java environmental variables in the /etc/profile file
export JAVA_HOME=/usr/local/java/jdk1.8.0_74
export PATH=$JAVA_HOME/bin:$PATH
 Run jps in the terminal ensure that there are no error messages received
>jps
B. HADOOP SETUP
 Install Hadoop from official website and place it in downloads
 Crete a folder for placing the Hadoop files sudo mkdir –p
/home/Hadoop
2
 cd /Downloads
 sudo cp hadoop-1.2.1.tar.gz /home/Hadoop
 cd /home/Hadoop
 sudo tar -zxvf filename hadoop-1.2.1.tar.gz # to unizip the file
 sudo /home/hadoop/hadoop-1.2.1/conf
 sudo vi hadoop-env.sh
Add the below lines
export JAVA_HOME=/usr/local/java/jdk1.8.0_74
export PATH=$JAVA_HOME/bin:$PATH
 sudo chmod 777 hadoop-1.2.1
 cd hadoop-1.2.1/
 sudo vi conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
 sudo vi conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
 sudo vi conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:9001</value>
</property>
 Create a file named clean in the location /home/hadoop/hadoop-1.2.1 and create the
below specified content
 pico clean #create a file named clean
sudo rm –r /home/hadoop/hadoop-1.2.1/dfs
sudo rm –r /home/hadoop/hadoop-1.2.1/dfstemp
mkdir /home/hadoop/hadoop-1.2.1/dfs
mkdir /home/hadoop/hadoop-1.2.1/dfstemp
chmod 755 /home/hadoop/hadoop-1.2.1/dfs
3
chmod 755 /home/hadoop/hadoop-1.2.1/dfstemp
 chmod 777 clean

Start all hadoop daemons

 $cd /home/hadoop/hadoop-1.2.1/
 $./clean
 $bin/hadoop namenode-format
 $bin/start-all.sh #enter passwords whenever it is prompted
 $jps
 Check if the jps command lists all the required services,
Jps
DataNode
SecondaryNameNode
NameNode
JobTracker
TaskTracker
 Hit the localhost url in the browser and see if hadoop is up and running

OUTPUT:

4
Result:

5
EXP NO: 2A
Implement word count programs using MapReduce
Date:

AIM:
To Implement programs that calculates word count of a document using MapReduce
CONTEXT:
MapReduce is a Java-based, distributed execution framework within the Apache Hadoop
Ecosystem. Using MapReduce, we can concurrently split and process petabytes of data in
parallel. It consists of two main tasks: mapping and reducing. This programming model is
highly dependent on key-value pairs for processing.
 Mapping: This process takes an input in the form of key-value pairs and produces
another set of intermediate key-value pairs after processing the input.
 Reducing: This process takes the output from the map task and further processes it
into even smaller and possibly readable chunks of data. However, the outcome is still
in form of key -value pairs

IMPLEMENTATION:
PRE-CONFIGURATION:
1. Setup a Environment/ IDE for running Java Code
a. Install latest Eclipse Version
b. Install Java JDK in your system
c. Open Environment Variables information by
Right Clicking on MyPC -> Properties -> View Advanced System Settings ->
Environment Variables
d. Add a New variable ; JAVA_HOME= C:\Program Files\Java\jre1.8.0_441
e. Append ‘Path’ Variable, PATH = C:\Program Files\Java\jre1.8.0_441\bin
f. Download Required Hadoop jars from
https://mvnrepository.com/artifact/org.apache.hadoop
g. In Project ->Properties->Build Path -> Add External Jars ->Add all Hadoop Jars
h. Apply and save the settings

6
JAVA CODE:
Execute the below java code and export the jar of this code as wc.jar
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper extends Mapper<LongWritable,
Text,Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private LongWritable key = new LongWritable();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
String[] words = line.split("\\s+");
for(String wordStr : words)
{
word.set(wordStr.trim());
if(!word.toString().isEmpty())
{
context.write(word, count);

7
}
}
}
}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws

IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/input"));
8
FileOutputFormat.setOutputPath(job, new Path("/output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
 Copy the jar into required location
 Create folders in hadoop for input and output directories
 bin/hadoop dfs –mkdir input
 bin/hadoop dfs –mkdir output
 Assume abc.txt is the input file for which word count is to be applied
 $bin/hadoop dfs –copyFromLocal abc.txt input
 $bin/hadoop jar wordCount/wc.jar WordCount input output

Result:

EXP NO: 2B Implement matrix multiplication programs using MapReduce

9
Date:

AIM:
To Implement multiplication of two matrices using MapReduce
CONTEXT:
Matrix-vector and matrix-matrix calculations fit nicely into the MapReduce style of
computing. Let M and N are two input matrices of dimension p x q and q x r respectively.
And P is the output matrix, P = M.N of dimension p x r
Map and Reduce functions will implement the following algorithms:

IMPLEMENTATION:
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

10
public class Map
extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
// (M, i, j, Mij);
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("M")) {
for (int k = 0; k < p; k++) {
outputKey.set(indicesAndValue[1] + "," + k);
// outputKey.set(i,k);
outputValue.set(indicesAndValue[0] + "," +
indicesAndValue[2]
+ "," + indicesAndValue[3]);
// outputValue.set(M,j,Mij);
context.write(outputKey, outputValue);
}
} else {
// (N, j, k, Njk);
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("N," + indicesAndValue[1] + ","
+ indicesAndValue[3]);
11
context.write(outputKey, outputValue);
}
}
}
}

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class MatrixMultiply {

public static void main(String[] args) throws Exception {

if (args.length != 2) {
System.err.println("Usage: MatrixMultiply <in_dir> <out_dir>");
System.exit(2);
}
Configuration conf = new Configuration();
// M is an m-by-n matrix; N is an n-by-p matrix.
conf.set("m", "1000");
conf.set("n", "100");
conf.set("p", "1000");
@SuppressWarnings("deprecation")
Job job = new Job(conf, "MatrixMultiply");

12
job.setJarByClass(MatrixMultiply.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}
}

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.HashMap;

public class Reduce

extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text> {
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String[] value;

13
//key=(i,k),
//Values = [(M/N,j,V/W),..]
HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("M")) {
hashA.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
}
}
int n = Integer.parseInt(context.getConfiguration().get("n"));
float result = 0.0f;
float m_ij;
float n_jk;
for (int j = 0; j < n; j++) {
m_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;
n_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;
result += m_ij * n_jk;
}
if (result != 0.0f) {
context.write(null,
new Text(key.toString() + "," + Float.toString(result)));
}
}
}
OUTPUT:

14
[hadoop@master ~]$ cat matrix_a.txt
0,0,1
0,1,2
0,2,3
1,0,4
1,1,5
1,2,6
2,0,7
2,1,8
2,2,9

[hadoop@master ~]$ cat matrix_b.txt

0,0,9
0,1,8
0,2,7
1,0,6
1,1,5
1,2,4
2,0,3
2,1,2
2,2,1

[hadoop@master ~]$ bin/hadoop dfs -copyFromLocal matrix_a.txt matrix_b.txt input-

matrices
[hadoop@master ~]$ bin/hadoop jar matrixMultiply/mm.jar MatrixMultiply input-matrices
output-matrix
[hadoop@master ~]$ bin/hadoop dfs -cat output-matrix/part-r-00000
0,0 30
0,1 24
0,2 18

15
1,0 84
1,1 69
1,2 54
2,0 138
2,1 114
2,2 90

Result:

EXP NO: 3
Implement an MR program that processes a weather dataset
Date:

16
AIM: To Develop a MapReduce program to find the maximum temperature
from a given weather dataset.

CONTEXT:

The weather data for any year is extracted from National Climatic Data Center –
NCDC website ftp://ftp.ncdc.noaa.gov/pub/data/noaa/.

Map Phase: The input for Map phase is set of weather data files T. Each Map task extracts
the temperature data from the given year file. The output of the map phase is set of key value
pairs. Set of keys are the years. Values are the temperature of each year.

Reduce Phase: Reduce phase takes all the values associated with a particular
key. That is all the temperature values belong to a particular year is fed to a same
reducer. Then each reducer finds the highest recorded temperature for each year.

IMPLEMENTATION:

HighestMapper.java

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

public class HighestMapper extends MapReduceBase implements Mapper<LongWritable,

Text, Text, IntWritable>

public static final int MISSING = 9999;

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException

String line = value.toString(); String year = line.substring(15,19); int temperature; if

(line.charAt(87)=='+')

temperature = Integer.parseInt(line.substring(88, 92)); else

temperature = Integer.parseInt(line.substring(87, 92)); String quality = line.substring(92,

93);

if(temperature != MISSING && quality.matches("[01459]")) output.collect(new

Text(year),new IntWritable(temperature)); }

17
}

HighestReducer.java

import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.*; import

org.apache.hadoop.mapred.*;

public class HighestReducer extends MapReduceBase implements Reducer<Text,

IntWritable, Text, IntWritable>

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException

int max_temp = 0;

while (values.hasNext())

int current=values.next().get(); if ( max_temp < current) max_temp = current;

output.collect(key, new IntWritable(max_temp/10));

HighestDriver.java

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class HighestDriver extends Configured implements Tool

public int run(String[] args) throws Exception

18
JobConf conf = new JobConf(getConf(), HighestDriver.class);
conf.setJobName("HighestDriver"); conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(HighestMapper.class);
conf.setReducerClass(HighestReducer.class);

Path inp = new Path(args[0]);

Path out = new Path(args[1]);

FileInputFormat.addInputPath(conf, inp);

FileOutputFormat.setOutputPath(conf, out);

JobClient.runJob(conf);

return 0; }

public static void main(String[] args) throws Exception

int res = ToolRunner.run(new Configuration(), new HighestDriver(),args);

System.exit(res); } }

OUTPUT:

bin/hadoop dfs –mkdir whetherdata

$bin/hadoop dfs –copyFromLocal /w1/* whetherdata
bin/hadoop jar whetheranalyze.jar whetherdata MyOutput

Result:

EXP NO: 4A Implement Linear Regression

19
Date:

AIM :
To implement Linear regression to predict housing prices.
CONTEXT:
Linear regression is best used in scenarios where you want to understand and predict the
relationship between a dependent variable and one or more independent variables,
particularly when that relationship appears to be linear. Best use cases are as follows:
 Predicting numeric outcomes based on historical data
 Examples include sales predictions, housing prices, or stock market trends
 Works well when there's a clear linear relationship between variables
 Understanding cause-and-effect relationships

SOURCE CODE :
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
df = pd.read_csv('D:\MiniWorks\ML Programs\canada_per_capita_income.csv')
df = df.rename(columns={'per capita income (US$)': 'income'})

plt.xlabel("year")
plt.ylabel("income")
plt.scatter(df.year, df.income, color='blue', marker='*')

newydf = df.income
newxdf = dfx=df.drop('income', axis='columns')
regressionModel = linear_model.LinearRegression()
regressionModel.fit(newxdf, newydf)
print('prediction', regressionModel.predict([[2020]]))
coef =regressionModel.coef_

20
intercept = regressionModel.intercept_
print('coeff', coef)
print('intercept', intercept)
plt.plot(df.year, coef*df.year + intercept, ls='-', marker=' ')
plt.plot(df.year, df.income)

OUTPUT:

Figure 1Dataset Plot

Figure 2 Linear Regression Output

21
Figure 3 Linear Regression Line Plot

Result:

EXP NO: 4B
Implement Binary Logistic Regression
Date:

22
AIM:
To perform Logistic Regression to predict if a person would buy life insurance based on his
age using logistic regression
CONTEXT:
Logistic regression is a Supervised Learning technique used for predicting the categorical
dependent variable using a given set of independent variables. Logistic regression is
primarily used for binary classification problems. Logistic regression works best when:
 The relationship between features and the outcome is approximately linear
 There are no highly correlated independent variables
 The sample size is relatively large
 The outcome is truly binary
SOURCE CODE
import pandas as pd
from matplotlib import pyplot as plt

import math
def sigmoid(x):
return 1 / (1 + math.exp(-x))

def prediction_function(age,inter,coeff):
z = coeff * age + inter
y = sigmoid(z)
return y

df = pd.read_csv("D:\MiniWorks\ML Programs\insurance_data.csv")
df.head()
plt.scatter(df.age,df.bought_insurance,marker='+',color='red')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(df[['age']],df.bought_insurance,train_size=0.8)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
model.predict_proba(X_test)
model.score(X_test,y_test)

#Change the value of Age and see Results

age =60
val = prediction_function(age,model.intercept_,model.coef_)

23
if(val > 0.5):
print("Yes - Buy Insurance")
else:
print("No Insurance")

OUTPUT:

Figure 4Dataset Distribution

Result:

EXP NO: 5
Decision Tree Classifier
Date:

24
AIM:
To execute a decision tree classifier algorithm for predicting diabetic conditions

THEORY:
Decision tree classification starts with the entire dataset at its root and then selects the
best feature to split the data (using metrics like Gini impurity or information gain) . It
then recursively creates branches by making decisions at each node . Splitting is
continued until a stopping criterion is met (max depth, minimum samples, etc.) Best
usecases include Spam email detection , Credit risk assessment, Predicting disease
risk etc

SOURCE CODE:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

# Import Decision Tree Classifier

from sklearn.model_selection import train_test_split
from sklearn import metrics
#Import scikit-learn metrics module for accuracy calculation
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

# load dataset
pima = pd.read_csv("D:\MiniWorks\ML Programs\inddiab.csv", header=None,
names=col_names)

#split dataset in features and target variable

feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# 70% training and 30% test

# Create Decision Tree classifier object

clf = DecisionTreeClassifier()
# Train Decision Tree Classifier
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

from sklearn.tree import export_graphviz

25
from six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())

OUTPUT

Result:

EXP NO: 6A
IMPLEMENT CLUSTERING TECHNIQUES – K Means
Date:

AIM:
To implement K Means clustering algorithm for grouping set of Loan applicants.

26
THEORY:
K-Means Clustering Overview:
K-means is a fundamental partitioning clustering algorithm that divides a dataset into K
predefined number of distinct, non-overlapping clusters. The algorithm operates by
identifying K centroids and assigning each data point to the nearest centroid, creating clusters
based on proximity. Its primary goal is to minimize the within-cluster variance, ensuring that
points within each cluster are as similar as possible.

IMPLEMENTATION
#import libraries
import pandas as pd
import numpy as np
import random as rd
import matplotlib.pyplot as plt

data = pd.read_csv('clustering.csv')
data.head()
X = data[["LoanAmount","ApplicantIncome"]]
#Visualise data points
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('AnnualIncome')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()
K=3

# Select random observation as centroids

Centroids = (X.sample(n=K))
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.scatter(Centroids["ApplicantIncome"],Centroids["LoanAmount"],c='red')
plt.xlabel('AnnualIncome')
27
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

diff = 1
j=0

while(diff!=0):
XD=X
i=1
for index1,row_c in Centroids.iterrows():
ED=[]
for index2,row_d in XD.iterrows():
d1=(row_c["ApplicantIncome"]-row_d["ApplicantIncome"])**2
d2=(row_c["LoanAmount"]-row_d["LoanAmount"])**2
d=np.sqrt(d1+d2)
ED.append(d)
X[i]=ED
i=i+1

C=[]
for index,row in X.iterrows():
min_dist=row[1]
pos=1
for i in range(K):
if row[i+1] < min_dist:
min_dist = row[i+1]
pos=i+1
C.append(pos)
X["Cluster"]=C

28
Centroids_new = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]
if j == 0:
diff=1
j=j+1
else:
diff = (Centroids_new['LoanAmount'] - Centroids['LoanAmount']).sum() +
(Centroids_new['ApplicantIncome'] - Centroids['ApplicantIncome']).sum()
print(diff.sum())
Centroids = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]

OUTPUT:

29
Figure 5 Dataset Description

Figure 6When Clusters =3

Figure 7When Clusters =2

Result:

EXP NO: 7 IMPLEMENT VARIOUS VISUALIZATION TECHNIQUES

30
Date:

AIM:
To perform exploratory data analysis using various visualization techniques

THEORY:

Data Visualization techniques involve the generation of graphical or pictorial

representation of data, form which leads you to understand the insight of a given data set.
This visualisation technique aims to identify the Patterns, Trends, Correlations, and Outliers
of data sets. Data visualization techniques help us to determine the patterns of business
operations. By understanding the problem statement and identifying the solutions in terms of
pattering and applied to eliminate one or more of the inherent problems.

IMPLEMENTATION

1. Line Chart
import matplotlib.pyplot as plt
import numpy as np
#simple array
x = np.array([1, 2, 3, 4])
#genearting y values
y = x*2
plt.plot(x, y)
plt.show()
#Sample #2
x = np.array([1, 2, 3, 4])
y = np.array([2, 4, 6, 8])
plt.plot(x, y)
plt.xlabel("Time in Hrs")
plt.ylabel("Distance in Km")
plt.title("Time Vs Distance -LINE CHART")
plt.show()
plt.savefig("time_distance.png")
2. Histogram

31
from matplotlib import pyplot as plt

import numpy as np

fig,ax = plt.subplots(1,1)

a=
np.array([25,42,48,55,60,62,67,70,30,38,44,50,54,58,75,78,85,88,89,28,35,90,95])

ax.hist(a, bins = [20,40,60,80,100])

ax.set_title("Student's Score - Histogram")

ax.set_xticks([0,20,40,60,80,100])

ax.set_xlabel('Marks Scored')

ax.set_ylabel('No. of Students')

plt.show()

3. Distribution Plot and Joint plot

import seaborn as sns
import matplotlib.pyplot as plt
from warnings import filterwarnings
df = sns.load_dataset('tips')
sns.distplot(df['total_bill'], kde = True, color ='green', bins = 20, label="Distribution
Plot")
sns.jointplot(x ='total_bill',color ='green', y ='tip', data = df,label="Joint Plot")

4. Pie Chart
from matplotlib import pyplot as plt
import numpy as np
Language = ['English', 'Spanish', 'Chinese',
'Russian', 'Japanese', 'French']
data = [379, 480, 918, 154, 128, 77.2]
# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.pie(data, labels = Language)
plt.title("Pie Chart")
plt.show()

5. Area plot
import matplotlib.pyplot as plt

32
days = [1, 2, 3, 4, 5]

raining = [7, 8, 6, 11, 7]

snow = [8, 5, 7, 8, 13]

plt.stackplot(days, raining, snow,colors =['b', 'y'])

plt.xlabel('Days')

plt.ylabel('No of Hours')

plt.title('Representation of Raining and Snowy Days – AREA PLOT’)

plt.show()

6. Scatter Plot
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9]
y = [99,86,87,88,67,86,87,78,77,85,86,56]
plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

7. Heat map

import seaborn as sn
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.random((7,7)),columns=['a','b','c','d','e','f','g'])
sn.heatmap(df,annot=True,annot_kws={'size':7})

8. Box Plot

#import matplotlib.pyplot as plt

np.random.seed(10)
one=np.random.normal(100,10,200)
two=np.random.normal(80, 30, 200)
three=np.random.normal(90, 20, 200)
four=np.random.normal(70, 25, 200)
to_plot=[one,two,three,four]
fig=plt.figure(1,figsize=(9,6))
ax=fig.add_subplot()

33
bp=ax.boxplot(to_plot)
fig.savefig('boxplot.png',bbox_inches='tight')

OUTPUT:

34
35
36
Result:

37
EXP NO: 8
IMPLEMENT AN APPLICATION THAT STORES BIG DATA IN HBASE
Date:

AIM: Implementing Storage and retrieval of data on HBASE

THEORY:
HBase is a Distributed, columnar NoSQL database Built on top of Hadoop Distributed File
System (HDFS)
It is designed for random, real-time read/write access to large datasets. It provides strong
consistency and is horizontally scalable.
Key Storage Concepts
 Data stored in tables
 Each table has rows and column families
 Rows are identified by unique row keys
 Column families group related columns together
 Supports sparse data storage

IMPLEMENTATION:

Accessing HBase using Shell

# Start HBase shell
hbase shell

# Create a table
create 'users', 'personal', 'contact'

# Insert data
put 'users', 'user_1', 'personal:name', 'John Doe'
put 'users', 'user_1', 'personal:age', '30'
put 'users', 'user_1', 'contact:email', '[email protected]'

# Scan the entire table

scan 'users'

# Get specific row

get 'users', 'user_1'

# Delete a specific cell

delete 'users', 'user_1', 'personal:age'

# Delete entire row

38
deleteall 'users', 'user_1'

# Drop table (must disable first)

disable 'users'
drop 'users'

JAVA-API Implementation

import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.util.Bytes;

public class HBaseOperations {

public void createTable(Admin admin, String tableName) throws IOException {
TableName table = TableName.valueOf(tableName);

// Create table descriptor

HTableDescriptor descriptor = new HTableDescriptor(table);
descriptor.addFamily(new HColumnDescriptor("personal"));
descriptor.addFamily(new HColumnDescriptor("contact"));

// Create table
admin.createTable(descriptor);
}

public void insertData(Table table, String rowKey) throws IOException {

Put put = new Put(Bytes.toBytes(rowKey));

// Add columns
put.addColumn(
Bytes.toBytes("personal"),
Bytes.toBytes("name"),
Bytes.toBytes("John Doe")
);

table.put(put);
}

public void deleteRow(Table table, String rowKey) throws IOException {

Delete delete = new Delete(Bytes.toBytes(rowKey));
table.delete(delete);
}

39
}

public void insertData(Table table, String rowKey) throws IOException { Put put = new
Put(Bytes.toBytes(rowKey)); // Add columns put.addColumn( Bytes.toBytes("personal"),
Bytes.toBytes("name"), Bytes.toBytes("John Doe") ); table.put(put); } public void
deleteRow(Table table, String rowKey) throws IOException { Delete delete = new
Delete(Bytes.toBytes(rowKey)); table.delete(delete); } }

OUTPUT:

40
Result:

Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
of Sedimentary Basins - Notes
100% (1)
of Sedimentary Basins - Notes
44 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
100% (1)
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
16 pages
DA Lab EXERCISE
No ratings yet
DA Lab EXERCISE
24 pages
Logs
No ratings yet
Logs
7 pages
Nand 2 Nor 2
No ratings yet
Nand 2 Nor 2
19 pages
TM 55 4920 413 13 and P
No ratings yet
TM 55 4920 413 13 and P
115 pages
EX. NO Date Program NO Sign
No ratings yet
EX. NO Date Program NO Sign
80 pages
Cp5261 Da Lab Me-Cse 2021 - Edit
No ratings yet
Cp5261 Da Lab Me-Cse 2021 - Edit
88 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Big Data File
No ratings yet
Big Data File
16 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
Data Science
No ratings yet
Data Science
82 pages
20CSPL701 - Bda - Record 2024-2025
No ratings yet
20CSPL701 - Bda - Record 2024-2025
61 pages
20CSPL701 - Bda - Record 2024-2025
No ratings yet
20CSPL701 - Bda - Record 2024-2025
54 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
34 pages
BDA Record
No ratings yet
BDA Record
58 pages
Bigdata Lab
No ratings yet
Bigdata Lab
55 pages
DA Lab Manual Final
No ratings yet
DA Lab Manual Final
46 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
Big Dataa-Lab-Manual
No ratings yet
Big Dataa-Lab-Manual
24 pages
Big Data
No ratings yet
Big Data
28 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Amrita CC 3.1
No ratings yet
Amrita CC 3.1
7 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
CC 8
No ratings yet
CC 8
4 pages
Hadoop Single Node Cluster Setup Steps
No ratings yet
Hadoop Single Node Cluster Setup Steps
7 pages
BDA Lab Manual - Organized
No ratings yet
BDA Lab Manual - Organized
69 pages
BDA Lab Manual-1
No ratings yet
BDA Lab Manual-1
60 pages
CCS334 Bda
No ratings yet
CCS334 Bda
23 pages
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
No ratings yet
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
61 pages
Course: Big Data Analytics Lab Scheme: 2017
No ratings yet
Course: Big Data Analytics Lab Scheme: 2017
25 pages
BDF Programs
No ratings yet
BDF Programs
32 pages
Bda Manual
No ratings yet
Bda Manual
33 pages
BIG Data File
No ratings yet
BIG Data File
28 pages
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
No ratings yet
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
14 pages
Ba Lab Record-It b2022-26
No ratings yet
Ba Lab Record-It b2022-26
43 pages
Perhitungan Tugas Besar Geometri Jalan Raya (Andre Gunawan 1622201019)
No ratings yet
Perhitungan Tugas Besar Geometri Jalan Raya (Andre Gunawan 1622201019)
77 pages
Final Copy - BDA LAB Record
No ratings yet
Final Copy - BDA LAB Record
44 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
BDA Lab
No ratings yet
BDA Lab
13 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Bigdatamanual
No ratings yet
Bigdatamanual
45 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
Hadoop Lab Notes: Nicola Tonellotto November 15, 2010
No ratings yet
Hadoop Lab Notes: Nicola Tonellotto November 15, 2010
9 pages
Bda Exp1 Chinmay
No ratings yet
Bda Exp1 Chinmay
13 pages
Lab Manual
No ratings yet
Lab Manual
34 pages
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
No ratings yet
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
22 pages
Bi Lab File
No ratings yet
Bi Lab File
19 pages
Exp 1-2
No ratings yet
Exp 1-2
9 pages
Big Data Record
No ratings yet
Big Data Record
13 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Big Data Record
No ratings yet
Big Data Record
14 pages
Exp 5 - 9
No ratings yet
Exp 5 - 9
25 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Part B Assignment - No - 11
No ratings yet
Part B Assignment - No - 11
6 pages
A - First Solar FS 275
No ratings yet
A - First Solar FS 275
2 pages
Lab Manual
No ratings yet
Lab Manual
27 pages
Big Data Analytics Lab Manual (BE AI&DS)
No ratings yet
Big Data Analytics Lab Manual (BE AI&DS)
29 pages
MAGNESITA CEMENT Folder 092015
No ratings yet
MAGNESITA CEMENT Folder 092015
24 pages
Grammar Now Plus 2 - SB Answer Keys
No ratings yet
Grammar Now Plus 2 - SB Answer Keys
59 pages
104 Da11-13
No ratings yet
104 Da11-13
14 pages
Master Hunter CX
No ratings yet
Master Hunter CX
13 pages
Taurus Led 08-08-2022.
No ratings yet
Taurus Led 08-08-2022.
60 pages
Uniform Plane Wave Solution To The Wave Equation
No ratings yet
Uniform Plane Wave Solution To The Wave Equation
5 pages
Hotel Paris - Managing Global HUman Resources
No ratings yet
Hotel Paris - Managing Global HUman Resources
6 pages
Child, You Have To Do It Now
No ratings yet
Child, You Have To Do It Now
69 pages
R 19 Unit V
No ratings yet
R 19 Unit V
13 pages
Grid-Connected EV Charging With Renewable Energy Integration in Parking Lots
No ratings yet
Grid-Connected EV Charging With Renewable Energy Integration in Parking Lots
64 pages
GEEL1 - 02 Web and The Internet
No ratings yet
GEEL1 - 02 Web and The Internet
54 pages
Digital Resonance Frequency Tester
No ratings yet
Digital Resonance Frequency Tester
3 pages
Daily Work Instructions Plan (IKH) 03-04 12 24
No ratings yet
Daily Work Instructions Plan (IKH) 03-04 12 24
3 pages
Dissertation Essex Uni
100% (2)
Dissertation Essex Uni
6 pages
Cantilever Snap-Fit Performance Analysis For Haptic Evaluation
No ratings yet
Cantilever Snap-Fit Performance Analysis For Haptic Evaluation
8 pages
Mse & History Format
No ratings yet
Mse & History Format
24 pages
TDA8139
No ratings yet
TDA8139
5 pages
Hair Transplant in Nepal
100% (1)
Hair Transplant in Nepal
3 pages
Salinas CA Fy 2025 26 Adopted Budget in Brief
No ratings yet
Salinas CA Fy 2025 26 Adopted Budget in Brief
13 pages
Real Numbers
No ratings yet
Real Numbers
6 pages
American Choral Directors Association The Choral Journal
No ratings yet
American Choral Directors Association The Choral Journal
3 pages
Tonic: Digital Media Strategy
No ratings yet
Tonic: Digital Media Strategy
8 pages
Business Studies Notes PDF Class 12 Chapter 13
No ratings yet
Business Studies Notes PDF Class 12 Chapter 13
6 pages
CYCLOPENTANE
No ratings yet
CYCLOPENTANE
2 pages
Sahu Krishi Kendra Kanhiwada-25 Aug 24
No ratings yet
Sahu Krishi Kendra Kanhiwada-25 Aug 24
1 page
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet

BDA Manual

Uploaded by

BDA Manual

Uploaded by

EXP NO: 1

Install, Configure and Run Hadoop and HDFS

AIM: To Install, Configure, and run Apache Hadoop and HDFS

Hadoop is a Java-based programming framework that supports the processing

Hadoop is comprised of four main layers:

 Hadoop Common is the collection of utilities and libraries that support

In this experiment, Hadoop will be installed in stand-alone mode in linux system.

1. Pre-Configure a Virtual machine (VMware or Oracle VirtualBox) with

Configuration of Hadoop Environment:

NOTE 1: All the underlined commands are Linux statements to be

The program 'java' can be found in the following packages:

Start all hadoop daemons

public static class IntSumReducer

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws

public static void main(String[] args) throws Exception {

EXP NO: 2B Implement matrix multiplication programs using MapReduce

public class MatrixMultiply {

public static void main(String[] args) throws Exception {

FileInputFormat.addInputPath(job, new Path(args[0]));

public class Reduce

[hadoop@master ~]$ cat matrix_b.txt

[hadoop@master ~]$ bin/hadoop dfs -copyFromLocal matrix_a.txt matrix_b.txt input-

public class HighestMapper extends MapReduceBase implements Mapper<LongWritable,

public static final int MISSING = 9999;

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,

String line = value.toString(); String year = line.substring(15,19); int temperature; if

temperature = Integer.parseInt(line.substring(88, 92)); else

temperature = Integer.parseInt(line.substring(87, 92)); String quality = line.substring(92,

if(temperature != MISSING && quality.matches("[01459]")) output.collect(new

import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.*; import

public class HighestReducer extends MapReduceBase implements Reducer<Text,

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,

int current=values.next().get(); if ( max_temp < current) max_temp = current;

output.collect(key, new IntWritable(max_temp/10));

public class HighestDriver extends Configured implements Tool

public int run(String[] args) throws Exception

Path inp = new Path(args[0]);

Path out = new Path(args[1]);

public static void main(String[] args) throws Exception

int res = ToolRunner.run(new Configuration(), new HighestDriver(),args);

bin/hadoop dfs –mkdir whetherdata

EXP NO: 4A Implement Linear Regression

Figure 1Dataset Plot

Figure 2 Linear Regression Output

#Change the value of Age and see Results

Figure 4Dataset Distribution

# Import Decision Tree Classifier

#split dataset in features and target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Create Decision Tree classifier object

from sklearn.tree import export_graphviz

# Select random observation as centroids

Figure 6When Clusters =3

Figure 7When Clusters =2

EXP NO: 7 IMPLEMENT VARIOUS VISUALIZATION TECHNIQUES

Data Visualization techniques involve the generation of graphical or pictorial

ax.hist(a, bins = [20,40,60,80,100])

ax.set_title("Student's Score - Histogram")

3. Distribution Plot and Joint plot

raining = [7, 8, 6, 11, 7]

snow = [8, 5, 7, 8, 13]

plt.stackplot(days, raining, snow,colors =['b', 'y'])

plt.title('Representation of Raining and Snowy Days – AREA PLOT’)

#import matplotlib.pyplot as plt

AIM: Implementing Storage and retrieval of data on HBASE

Accessing HBase using Shell

# Scan the entire table

# Get specific row

# Delete a specific cell

# Delete entire row

# Drop table (must disable first)

public class HBaseOperations {

// Create table descriptor

public void insertData(Table table, String rowKey) throws IOException {

public void deleteRow(Table table, String rowKey) throws IOException {

You might also like