0% found this document useful (0 votes)
9 views41 pages

BDA Manual

The document outlines experiments for installing and configuring Apache Hadoop and HDFS, as well as implementing word count and matrix multiplication programs using MapReduce. It details the steps for setting up the environment, including Java installation, Hadoop setup, and writing MapReduce code for processing large datasets. The experiments aim to demonstrate the capabilities of Hadoop in handling big data through practical implementations.

Uploaded by

pm96mithun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views41 pages

BDA Manual

The document outlines experiments for installing and configuring Apache Hadoop and HDFS, as well as implementing word count and matrix multiplication programs using MapReduce. It details the steps for setting up the environment, including Java installation, Hadoop setup, and writing MapReduce code for processing large datasets. The experiments aim to demonstrate the capabilities of Hadoop in handling big data through practical implementations.

Uploaded by

pm96mithun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

EXP NO: 1

Install, Configure and Run Hadoop and HDFS


Date:

AIM: To Install, Configure, and run Apache Hadoop and HDFS

CONTEXT:

Hadoop is a Java-based programming framework that supports the processing


and storage of extremely large datasets on a cluster of inexpensive machines. It
was the first major open-source project in the big data playing field and is
sponsored by the Apache Software Foundation.

Hadoop is comprised of four main layers:

 Hadoop Common is the collection of utilities and libraries that support


other Hadoop modules.
 HDFS, which stands for Hadoop Distributed File System, is responsible
for persisting data to disk.
 YARN, short for Yet Another Resource Negotiator, is the "operating
system" for HDFS.
 MapReduce is the original processing model for Hadoop clusters. It
distributes work within the cluster or map, then organizes and reduces
the results from the nodes into a response to a query. Many other
processing models are available for the 2.x version of Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-
alone mode which is suitable for learning about Hadoop, performing simple
operations, and debugging.

IMPLEMENTATION:

In this experiment, Hadoop will be installed in stand-alone mode in linux system.


(as Hadoop was originally developed on Linux and it has Native support for the
Apache Hadoop ecosystem)

PreConfiguration:

1. Pre-Configure a Virtual machine (VMware or Oracle VirtualBox) with


Ubuntu image file or use an Ubuntu operating system for running Hadoop

Configuration of Hadoop Environment:

NOTE 1: All the underlined commands are Linux statements to be


executed in the linux terminal
NOTE 2: In case of oracle virtual box, if there are issues with su access,
change the user as root and provide the password used for vm login.
su root

A. JAVA INSTALLATION

1
 Check the java version $java –version

The program 'java' can be found in the following packages:


* default-jre
* gcj-5-jre-headless
* openjdk-8-jre-headless
* gcj-4.8-jre-headless
* gcj-4.9-jre-headless
* openjdk-9-jre-headless
Try: sudo apt install <selected package>
 sudo apt-get install openjdk-8-jre-headless
 Check the java version again after installation java –version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (build 1.8.0_292-8u292-b10-
ubuntu1~16.04.1-b10)
OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)
 Download JDK for Linux and copy it in Downloads folder of the Linux
system
 Create a folder for java sudo mkdir -p /usr/local/java
 Copy the java files to the above location sudo cp jdk-8u74-linux-
x64.tar.gz /usr/local/java
 cd /usr/local/java
 sudo tar -zxvf jdk-8u74-linux-x64.tar.gz # For extracting the zip files
 To add the java variables in environment:
sudo gedit /etc/profile
Add the java environmental variables in the /etc/profile file
export JAVA_HOME=/usr/local/java/jdk1.8.0_74
export PATH=$JAVA_HOME/bin:$PATH
 Run jps in the terminal ensure that there are no error messages received
>jps
B. HADOOP SETUP
 Install Hadoop from official website and place it in downloads
 Crete a folder for placing the Hadoop files sudo mkdir –p
/home/Hadoop
2
 cd /Downloads
 sudo cp hadoop-1.2.1.tar.gz /home/Hadoop
 cd /home/Hadoop
 sudo tar -zxvf filename hadoop-1.2.1.tar.gz # to unizip the file
 sudo /home/hadoop/hadoop-1.2.1/conf
 sudo vi hadoop-env.sh
Add the below lines
export JAVA_HOME=/usr/local/java/jdk1.8.0_74
export PATH=$JAVA_HOME/bin:$PATH
 sudo chmod 777 hadoop-1.2.1
 cd hadoop-1.2.1/
 sudo vi conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
 sudo vi conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
 sudo vi conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:9001</value>
</property>
 Create a file named clean in the location /home/hadoop/hadoop-1.2.1 and create the
below specified content
 pico clean #create a file named clean
sudo rm –r /home/hadoop/hadoop-1.2.1/dfs
sudo rm –r /home/hadoop/hadoop-1.2.1/dfstemp
mkdir /home/hadoop/hadoop-1.2.1/dfs
mkdir /home/hadoop/hadoop-1.2.1/dfstemp
chmod 755 /home/hadoop/hadoop-1.2.1/dfs
3
chmod 755 /home/hadoop/hadoop-1.2.1/dfstemp
 chmod 777 clean

Start all hadoop daemons

 $cd /home/hadoop/hadoop-1.2.1/
 $./clean
 $bin/hadoop namenode-format
 $bin/start-all.sh #enter passwords whenever it is prompted
 $jps
 Check if the jps command lists all the required services,
Jps
DataNode
SecondaryNameNode
NameNode
JobTracker
TaskTracker
 Hit the localhost url in the browser and see if hadoop is up and running

OUTPUT:

4
Result:

5
EXP NO: 2A
Implement word count programs using MapReduce
Date:

AIM:
To Implement programs that calculates word count of a document using MapReduce
CONTEXT:
MapReduce is a Java-based, distributed execution framework within the Apache Hadoop
Ecosystem. Using MapReduce, we can concurrently split and process petabytes of data in
parallel. It consists of two main tasks: mapping and reducing. This programming model is
highly dependent on key-value pairs for processing.
 Mapping: This process takes an input in the form of key-value pairs and produces
another set of intermediate key-value pairs after processing the input.
 Reducing: This process takes the output from the map task and further processes it
into even smaller and possibly readable chunks of data. However, the outcome is still
in form of key -value pairs

IMPLEMENTATION:
PRE-CONFIGURATION:
1. Setup a Environment/ IDE for running Java Code
a. Install latest Eclipse Version
b. Install Java JDK in your system
c. Open Environment Variables information by
Right Clicking on MyPC -> Properties -> View Advanced System Settings ->
Environment Variables
d. Add a New variable ; JAVA_HOME= C:\Program Files\Java\jre1.8.0_441
e. Append ‘Path’ Variable, PATH = C:\Program Files\Java\jre1.8.0_441\bin
f. Download Required Hadoop jars from
https://mvnrepository.com/artifact/org.apache.hadoop
g. In Project ->Properties->Build Path -> Add External Jars ->Add all Hadoop Jars
h. Apply and save the settings

6
JAVA CODE:
Execute the below java code and export the jar of this code as wc.jar
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper extends Mapper<LongWritable,
Text,Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private LongWritable key = new LongWritable();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
String[] words = line.split("\\s+");
for(String wordStr : words)
{
word.set(wordStr.trim());
if(!word.toString().isEmpty())
{
context.write(word, count);

7
}
}
}
}

public static class IntSumReducer


extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws


IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/input"));
8
FileOutputFormat.setOutputPath(job, new Path("/output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
 Copy the jar into required location
 Create folders in hadoop for input and output directories
 bin/hadoop dfs –mkdir input
 bin/hadoop dfs –mkdir output
 Assume abc.txt is the input file for which word count is to be applied
 $bin/hadoop dfs –copyFromLocal abc.txt input
 $bin/hadoop jar wordCount/wc.jar WordCount input output

Result:

EXP NO: 2B Implement matrix multiplication programs using MapReduce

9
Date:

AIM:
To Implement multiplication of two matrices using MapReduce
CONTEXT:
Matrix-vector and matrix-matrix calculations fit nicely into the MapReduce style of
computing. Let M and N are two input matrices of dimension p x q and q x r respectively.
And P is the output matrix, P = M.N of dimension p x r
Map and Reduce functions will implement the following algorithms:

IMPLEMENTATION:
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

10
public class Map
extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
// (M, i, j, Mij);
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("M")) {
for (int k = 0; k < p; k++) {
outputKey.set(indicesAndValue[1] + "," + k);
// outputKey.set(i,k);
outputValue.set(indicesAndValue[0] + "," +
indicesAndValue[2]
+ "," + indicesAndValue[3]);
// outputValue.set(M,j,Mij);
context.write(outputKey, outputValue);
}
} else {
// (N, j, k, Njk);
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("N," + indicesAndValue[1] + ","
+ indicesAndValue[3]);
11
context.write(outputKey, outputValue);
}
}
}
}

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class MatrixMultiply {

public static void main(String[] args) throws Exception {


if (args.length != 2) {
System.err.println("Usage: MatrixMultiply <in_dir> <out_dir>");
System.exit(2);
}
Configuration conf = new Configuration();
// M is an m-by-n matrix; N is an n-by-p matrix.
conf.set("m", "1000");
conf.set("n", "100");
conf.set("p", "1000");
@SuppressWarnings("deprecation")
Job job = new Job(conf, "MatrixMultiply");

12
job.setJarByClass(MatrixMultiply.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}
}

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.HashMap;

public class Reduce


extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text> {
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String[] value;

13
//key=(i,k),
//Values = [(M/N,j,V/W),..]
HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("M")) {
hashA.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
}
}
int n = Integer.parseInt(context.getConfiguration().get("n"));
float result = 0.0f;
float m_ij;
float n_jk;
for (int j = 0; j < n; j++) {
m_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;
n_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;
result += m_ij * n_jk;
}
if (result != 0.0f) {
context.write(null,
new Text(key.toString() + "," + Float.toString(result)));
}
}
}
OUTPUT:

14
[hadoop@master ~]$ cat matrix_a.txt
0,0,1
0,1,2
0,2,3
1,0,4
1,1,5
1,2,6
2,0,7
2,1,8
2,2,9

[hadoop@master ~]$ cat matrix_b.txt


0,0,9
0,1,8
0,2,7
1,0,6
1,1,5
1,2,4
2,0,3
2,1,2
2,2,1

[hadoop@master ~]$ bin/hadoop dfs -copyFromLocal matrix_a.txt matrix_b.txt input-


matrices
[hadoop@master ~]$ bin/hadoop jar matrixMultiply/mm.jar MatrixMultiply input-matrices
output-matrix
[hadoop@master ~]$ bin/hadoop dfs -cat output-matrix/part-r-00000
0,0 30
0,1 24
0,2 18

15
1,0 84
1,1 69
1,2 54
2,0 138
2,1 114
2,2 90

Result:

EXP NO: 3
Implement an MR program that processes a weather dataset
Date:

16
AIM: To Develop a MapReduce program to find the maximum temperature
from a given weather dataset.

CONTEXT:

The weather data for any year is extracted from National Climatic Data Center –
NCDC website ftp://ftp.ncdc.noaa.gov/pub/data/noaa/.

Map Phase: The input for Map phase is set of weather data files T. Each Map task extracts
the temperature data from the given year file. The output of the map phase is set of key value
pairs. Set of keys are the years. Values are the temperature of each year.

Reduce Phase: Reduce phase takes all the values associated with a particular
key. That is all the temperature values belong to a particular year is fed to a same
reducer. Then each reducer finds the highest recorded temperature for each year.

IMPLEMENTATION:

HighestMapper.java

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

public class HighestMapper extends MapReduceBase implements Mapper<LongWritable,


Text, Text, IntWritable>

public static final int MISSING = 9999;

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,


Reporter reporter) throws IOException

String line = value.toString(); String year = line.substring(15,19); int temperature; if


(line.charAt(87)=='+')

temperature = Integer.parseInt(line.substring(88, 92)); else

temperature = Integer.parseInt(line.substring(87, 92)); String quality = line.substring(92,


93);

if(temperature != MISSING && quality.matches("[01459]")) output.collect(new


Text(year),new IntWritable(temperature)); }

17
}

HighestReducer.java

import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.*; import


org.apache.hadoop.mapred.*;

public class HighestReducer extends MapReduceBase implements Reducer<Text,


IntWritable, Text, IntWritable>

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,


IntWritable> output, Reporter reporter) throws IOException

int max_temp = 0;

while (values.hasNext())

int current=values.next().get(); if ( max_temp < current) max_temp = current;

output.collect(key, new IntWritable(max_temp/10));

HighestDriver.java

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class HighestDriver extends Configured implements Tool

public int run(String[] args) throws Exception

18
JobConf conf = new JobConf(getConf(), HighestDriver.class);
conf.setJobName("HighestDriver"); conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(HighestMapper.class);
conf.setReducerClass(HighestReducer.class);

Path inp = new Path(args[0]);

Path out = new Path(args[1]);

FileInputFormat.addInputPath(conf, inp);

FileOutputFormat.setOutputPath(conf, out);

JobClient.runJob(conf);

return 0; }

public static void main(String[] args) throws Exception

int res = ToolRunner.run(new Configuration(), new HighestDriver(),args);

System.exit(res); } }

OUTPUT:

bin/hadoop dfs –mkdir whetherdata


$bin/hadoop dfs –copyFromLocal /w1/* whetherdata
bin/hadoop jar whetheranalyze.jar whetherdata MyOutput

Result:

EXP NO: 4A Implement Linear Regression

19
Date:

AIM :
To implement Linear regression to predict housing prices.
CONTEXT:
Linear regression is best used in scenarios where you want to understand and predict the
relationship between a dependent variable and one or more independent variables,
particularly when that relationship appears to be linear. Best use cases are as follows:
 Predicting numeric outcomes based on historical data
 Examples include sales predictions, housing prices, or stock market trends
 Works well when there's a clear linear relationship between variables
 Understanding cause-and-effect relationships

SOURCE CODE :
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
df = pd.read_csv('D:\MiniWorks\ML Programs\canada_per_capita_income.csv')
df = df.rename(columns={'per capita income (US$)': 'income'})

plt.xlabel("year")
plt.ylabel("income")
plt.scatter(df.year, df.income, color='blue', marker='*')

newydf = df.income
newxdf = dfx=df.drop('income', axis='columns')
regressionModel = linear_model.LinearRegression()
regressionModel.fit(newxdf, newydf)
print('prediction', regressionModel.predict([[2020]]))
coef =regressionModel.coef_

20
intercept = regressionModel.intercept_
print('coeff', coef)
print('intercept', intercept)
plt.plot(df.year, coef*df.year + intercept, ls='-', marker=' ')
plt.plot(df.year, df.income)

OUTPUT:

Figure 1Dataset Plot

Figure 2 Linear Regression Output

21
Figure 3 Linear Regression Line Plot

Result:

EXP NO: 4B
Implement Binary Logistic Regression
Date:

22
AIM:
To perform Logistic Regression to predict if a person would buy life insurance based on his
age using logistic regression
CONTEXT:
Logistic regression is a Supervised Learning technique used for predicting the categorical
dependent variable using a given set of independent variables. Logistic regression is
primarily used for binary classification problems. Logistic regression works best when:
 The relationship between features and the outcome is approximately linear
 There are no highly correlated independent variables
 The sample size is relatively large
 The outcome is truly binary
SOURCE CODE
import pandas as pd
from matplotlib import pyplot as plt

import math
def sigmoid(x):
return 1 / (1 + math.exp(-x))

def prediction_function(age,inter,coeff):
z = coeff * age + inter
y = sigmoid(z)
return y

df = pd.read_csv("D:\MiniWorks\ML Programs\insurance_data.csv")
df.head()
plt.scatter(df.age,df.bought_insurance,marker='+',color='red')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(df[['age']],df.bought_insurance,train_size=0.8)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
model.predict_proba(X_test)
model.score(X_test,y_test)

#Change the value of Age and see Results


age =60
val = prediction_function(age,model.intercept_,model.coef_)

23
if(val > 0.5):
print("Yes - Buy Insurance")
else:
print("No Insurance")

OUTPUT:

Figure 4Dataset Distribution

Result:

EXP NO: 5
Decision Tree Classifier
Date:

24
AIM:
To execute a decision tree classifier algorithm for predicting diabetic conditions

THEORY:
Decision tree classification starts with the entire dataset at its root and then selects the
best feature to split the data (using metrics like Gini impurity or information gain) . It
then recursively creates branches by making decisions at each node . Splitting is
continued until a stopping criterion is met (max depth, minimum samples, etc.) Best
usecases include Spam email detection , Credit risk assessment, Predicting disease
risk etc

SOURCE CODE:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

# Import Decision Tree Classifier


from sklearn.model_selection import train_test_split
from sklearn import metrics
#Import scikit-learn metrics module for accuracy calculation
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

# load dataset
pima = pd.read_csv("D:\MiniWorks\ML Programs\inddiab.csv", header=None,
names=col_names)

#split dataset in features and target variable


feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)


# 70% training and 30% test

# Create Decision Tree classifier object


clf = DecisionTreeClassifier()
# Train Decision Tree Classifier
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

from sklearn.tree import export_graphviz

25
from six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())

OUTPUT

Result:

EXP NO: 6A
IMPLEMENT CLUSTERING TECHNIQUES – K Means
Date:

AIM:
To implement K Means clustering algorithm for grouping set of Loan applicants.

26
THEORY:
K-Means Clustering Overview:
K-means is a fundamental partitioning clustering algorithm that divides a dataset into K
predefined number of distinct, non-overlapping clusters. The algorithm operates by
identifying K centroids and assigning each data point to the nearest centroid, creating clusters
based on proximity. Its primary goal is to minimize the within-cluster variance, ensuring that
points within each cluster are as similar as possible.

IMPLEMENTATION
#import libraries
import pandas as pd
import numpy as np
import random as rd
import matplotlib.pyplot as plt

data = pd.read_csv('clustering.csv')
data.head()
X = data[["LoanAmount","ApplicantIncome"]]
#Visualise data points
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('AnnualIncome')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()
K=3

# Select random observation as centroids


Centroids = (X.sample(n=K))
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.scatter(Centroids["ApplicantIncome"],Centroids["LoanAmount"],c='red')
plt.xlabel('AnnualIncome')
27
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

diff = 1
j=0

while(diff!=0):
XD=X
i=1
for index1,row_c in Centroids.iterrows():
ED=[]
for index2,row_d in XD.iterrows():
d1=(row_c["ApplicantIncome"]-row_d["ApplicantIncome"])**2
d2=(row_c["LoanAmount"]-row_d["LoanAmount"])**2
d=np.sqrt(d1+d2)
ED.append(d)
X[i]=ED
i=i+1

C=[]
for index,row in X.iterrows():
min_dist=row[1]
pos=1
for i in range(K):
if row[i+1] < min_dist:
min_dist = row[i+1]
pos=i+1
C.append(pos)
X["Cluster"]=C

28
Centroids_new = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]
if j == 0:
diff=1
j=j+1
else:
diff = (Centroids_new['LoanAmount'] - Centroids['LoanAmount']).sum() +
(Centroids_new['ApplicantIncome'] - Centroids['ApplicantIncome']).sum()
print(diff.sum())
Centroids = X.groupby(["Cluster"]).mean()[["LoanAmount","ApplicantIncome"]]

data = pd.read_csv('clustering.csv')
data.head()
X = data[["LoanAmount","ApplicantIncome"]]
#Visualise data points
plt.scatter(X["ApplicantIncome"],X["LoanAmount"],c='black')
plt.xlabel('AnnualIncome')
plt.ylabel('Loan Amount (In Thousands)')
plt.show()

OUTPUT:

29
Figure 5 Dataset Description

Figure 6When Clusters =3

Figure 7When Clusters =2

Result:

EXP NO: 7 IMPLEMENT VARIOUS VISUALIZATION TECHNIQUES

30
Date:

AIM:
To perform exploratory data analysis using various visualization techniques

THEORY:

Data Visualization techniques involve the generation of graphical or pictorial


representation of data, form which leads you to understand the insight of a given data set.
This visualisation technique aims to identify the Patterns, Trends, Correlations, and Outliers
of data sets. Data visualization techniques help us to determine the patterns of business
operations. By understanding the problem statement and identifying the solutions in terms of
pattering and applied to eliminate one or more of the inherent problems.

IMPLEMENTATION

1. Line Chart
import matplotlib.pyplot as plt
import numpy as np
#simple array
x = np.array([1, 2, 3, 4])
#genearting y values
y = x*2
plt.plot(x, y)
plt.show()
#Sample #2
x = np.array([1, 2, 3, 4])
y = np.array([2, 4, 6, 8])
plt.plot(x, y)
plt.xlabel("Time in Hrs")
plt.ylabel("Distance in Km")
plt.title("Time Vs Distance -LINE CHART")
plt.show()
plt.savefig("time_distance.png")
2. Histogram

31
from matplotlib import pyplot as plt

import numpy as np

fig,ax = plt.subplots(1,1)

a=
np.array([25,42,48,55,60,62,67,70,30,38,44,50,54,58,75,78,85,88,89,28,35,90,95])

ax.hist(a, bins = [20,40,60,80,100])

ax.set_title("Student's Score - Histogram")

ax.set_xticks([0,20,40,60,80,100])

ax.set_xlabel('Marks Scored')

ax.set_ylabel('No. of Students')

plt.show()

3. Distribution Plot and Joint plot


import seaborn as sns
import matplotlib.pyplot as plt
from warnings import filterwarnings
df = sns.load_dataset('tips')
sns.distplot(df['total_bill'], kde = True, color ='green', bins = 20, label="Distribution
Plot")
sns.jointplot(x ='total_bill',color ='green', y ='tip', data = df,label="Joint Plot")

4. Pie Chart
from matplotlib import pyplot as plt
import numpy as np
Language = ['English', 'Spanish', 'Chinese',
'Russian', 'Japanese', 'French']
data = [379, 480, 918, 154, 128, 77.2]
# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.pie(data, labels = Language)
plt.title("Pie Chart")
plt.show()

5. Area plot
import matplotlib.pyplot as plt

32
days = [1, 2, 3, 4, 5]

raining = [7, 8, 6, 11, 7]

snow = [8, 5, 7, 8, 13]

plt.stackplot(days, raining, snow,colors =['b', 'y'])

plt.xlabel('Days')

plt.ylabel('No of Hours')

plt.title('Representation of Raining and Snowy Days – AREA PLOT’)

plt.show()

6. Scatter Plot
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9]
y = [99,86,87,88,67,86,87,78,77,85,86,56]
plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

7. Heat map

import seaborn as sn
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.random((7,7)),columns=['a','b','c','d','e','f','g'])
sn.heatmap(df,annot=True,annot_kws={'size':7})

8. Box Plot

#import matplotlib.pyplot as plt


np.random.seed(10)
one=np.random.normal(100,10,200)
two=np.random.normal(80, 30, 200)
three=np.random.normal(90, 20, 200)
four=np.random.normal(70, 25, 200)
to_plot=[one,two,three,four]
fig=plt.figure(1,figsize=(9,6))
ax=fig.add_subplot()

33
bp=ax.boxplot(to_plot)
fig.savefig('boxplot.png',bbox_inches='tight')

OUTPUT:

34
35
36
Result:

37
EXP NO: 8
IMPLEMENT AN APPLICATION THAT STORES BIG DATA IN HBASE
Date:

AIM: Implementing Storage and retrieval of data on HBASE

THEORY:
HBase is a Distributed, columnar NoSQL database Built on top of Hadoop Distributed File
System (HDFS)
It is designed for random, real-time read/write access to large datasets. It provides strong
consistency and is horizontally scalable.
Key Storage Concepts
 Data stored in tables
 Each table has rows and column families
 Rows are identified by unique row keys
 Column families group related columns together
 Supports sparse data storage

IMPLEMENTATION:

Accessing HBase using Shell


# Start HBase shell
hbase shell

# Create a table
create 'users', 'personal', 'contact'

# Insert data
put 'users', 'user_1', 'personal:name', 'John Doe'
put 'users', 'user_1', 'personal:age', '30'
put 'users', 'user_1', 'contact:email', '[email protected]'

# Scan the entire table


scan 'users'

# Get specific row


get 'users', 'user_1'

# Delete a specific cell


delete 'users', 'user_1', 'personal:age'

# Delete entire row

38
deleteall 'users', 'user_1'

# Drop table (must disable first)


disable 'users'
drop 'users'

JAVA-API Implementation

import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.util.Bytes;

public class HBaseOperations {


public void createTable(Admin admin, String tableName) throws IOException {
TableName table = TableName.valueOf(tableName);

// Create table descriptor


HTableDescriptor descriptor = new HTableDescriptor(table);
descriptor.addFamily(new HColumnDescriptor("personal"));
descriptor.addFamily(new HColumnDescriptor("contact"));

// Create table
admin.createTable(descriptor);
}

public void insertData(Table table, String rowKey) throws IOException {


Put put = new Put(Bytes.toBytes(rowKey));

// Add columns
put.addColumn(
Bytes.toBytes("personal"),
Bytes.toBytes("name"),
Bytes.toBytes("John Doe")
);

table.put(put);
}

public void deleteRow(Table table, String rowKey) throws IOException {


Delete delete = new Delete(Bytes.toBytes(rowKey));
table.delete(delete);
}

39
}

public void insertData(Table table, String rowKey) throws IOException { Put put = new
Put(Bytes.toBytes(rowKey)); // Add columns put.addColumn( Bytes.toBytes("personal"),
Bytes.toBytes("name"), Bytes.toBytes("John Doe") ); table.put(put); } public void
deleteRow(Table table, String rowKey) throws IOException { Delete delete = new
Delete(Bytes.toBytes(rowKey)); table.delete(delete); } }

OUTPUT:

40
Result:

41

You might also like