BDA Experiment 14 PDF
BDA Experiment 14 PDF
: 01
Aim: Study of Hadoop Ecosystem.
Theory:
Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems. It includes Apache projects and various commercial
tools and solutions. There are four major elements of Hadoop i.e. HDFS,
MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are
used to supplement or support these major elements. All these tools work
collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is
responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
MapReduce:
MapReduce makes the use of two functions i.e. Map() and Reduce() whose
task is:
1. Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based result
which is later on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating
the mapped data. In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge
data sets.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as HQL
(Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine
Learning, as the name suggests helps the system to develop itself based on
some patterns, user/environmental interaction or om the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine
learning. It allows invoking algorithms as per our need with the help of its
own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions,
and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something
small in a huge database, the request must be processed within a short quick
span of time. At such times, HBase comes handy as it gives us a tolerant way
of storing limited data.
Other Components: Apart from all of these, there are some other components
too that carry out a huge task in order to make Hadoop capable of processing large
datasets. They are as follows:
Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which
resulted in inconsistency, often. Zookeeper overcame all the problems by
performing synchronization, inter-component based communication,
grouping, and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs
and binding them together as a single unit. There is two kinds of jobs .i.e
Oozie workflow and Oozie coordinator jobs. Oozie workflow is the jobs that
need to be executed in a sequentially ordered manner whereas Oozie
Coordinator jobs are those that are triggered when some data or external
stimulus is given to it.
Theory:
STEP 1: Installing Java, OpenSSH, rsync
We need to install certain dependencies before installing
Hadoop. This includes Java, OpenSSH and rsync.
On Debian like systems use:
sudo apt-get update && sudo apt-get install -y default-jdk openssh-server rsync
STEP 2: Setting up SSH keys
Genrate passwordless RSA public & private keys, you will be required to
answer a prompt by
hitting enter to keep the default file location of the keys:
ssh-keygen -t rsa -P ''
Add the newly created key to the list of authorized keys:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
STEP 3: Downloading and Extracting Hadoop archive
You can download the latest stable release of Hadoop binary named hadoop-
x.y.z.tar.gz
from
http://www.eu.apache.org/dist/hadoop/common/sta
ble/.
1. Download the file to your ~/ (home) folder:
2. FILE=$(wget "http://www.eu.apache.org/dist/hadoop/common/stable/" -O - |
grep - Po "hadoop-[0-9].[0-9].[0-9].tar.gz" | head -n 1)
3. URL=http://www.eu.apache.org/dist/hadoop/common/stable/$FILE
4. wget -c "$URL" -O "$FILE"
5. Extract the downloaded file to /usr/local/ directory and then rename the just
extracted hadoop-x.y.z directory to hadoop and make yourself the owner.
Extract the downloaded pre-complied hadoop binary:
sudo tar xfzv "$FILE" -C /usr/local
Move the extracted contecnts to /user/local/hadoop
directory: sudo mv /usr/local/hadoop-*/
/usr/local/hadoop
Change ownership of hadoop directory (so that we don’t need sudo
everytime): sudo chown -R $USER:$USER /usr/local/hadoop
STEP 4: Editing Configuration Files
Now we need to make changes to a few configuration files.
1. To append text to your ~/.bashrc file, execute this block of code in the terminal:
2. cat << 'EOT' >> ~/.bashrc
3. #SET JDK
4. export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")
35. <name>yarn.nodemanager.aux-services</name>
36. <value>mapreduce_shuffle</value>
37. </property>
38. <property>
39. <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
40. <value>org.apache.hadoop.mapred.ShuffleHandler</value>
41. </property>
42. </configuration>
43. EOT
44. To genrate and then edit /usr/local/hadoop/etc/hadoop/mapred-site.xml file,
execute this block of code in the terminal:
45. cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml
46. sed -n -i.bak '/<configuration>/q;p' /usr/local/hadoop/etc/hadoop/mapred-site.xml
47. cat << EOT >> /usr/local/hadoop/etc/hadoop/mapred-site.xml
48. <configuration>
49. <property>
50. <name>mapreduce.framework.name</name>
51. <value>yarn</value>
52. </property>
53. </configuration>
54. EOT
55. To make ~/hadoop_store/hdfs/namenode, ~/hadoop_store/hdfs/datanode
folders and
edit /usr/local/hadoop/etc/hadoop/hdfs-site.xml file, execute this block of code
in the terminal:
56. mkdir -p ~/hadoop_store/hdfs/namenode
57. mkdir -p ~/hadoop_store/hdfs/datanode
58. sed -n -i.bak '/<configuration>/q;p' /usr/local/hadoop/etc/hadoop/hdfs-site.xml
59. cat << EOT >> /usr/local/hadoop/etc/hadoop/hdfs-site.xml
60. <configuration>
61. <property>
62. <name>dfs.replication</name>
63. <value>1</value>
64. </property>
65. <property>
66. <name>dfs.namenode.name.dir</name>
67. <value>file:/home/$USER/hadoop_store/hdfs/namenode</value>
68. </property>
69. <property>
70. <name>dfs.datanode.data.dir</name>
71. <value>file:/home/$USER/hadoop_store/hdfs/datanode</value>
72. </property>
73. </configuration>
STEP 5: Formatting HDFS
Before we can start using our hadoop cluster, we need to format the HDFS
through the namenode. Fomat the HDFS filesystem, answer password prompts
if any:
/usr/local/hadoop/bin/hdfs namenode -format
STEP 6: Strating Hadoop daemons
Start HDFS daemon /usr/local/hadoop/sbin/start-
dfs.sh Start YARN daemon
/usr/local/hadoop/sbin/start-yarn.sh
Output:
Conclusion: Thus, we have successfully install and configure Hadoop.
Experiment No. : 02(b)
File: WC_Reducer.java
package com.abc;
import
java.io.IOException;
import java.util.Iterator;
import
org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapred.MapReduceBase;
import
org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
File: WC_Runner.java
package com.abc;
import java.io.IOException;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapred.FileInputFormat;
import
org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
Theory:
Follow these steps to install MongoDB Enterprise Edition using the apt
package manager.
1 Import the public key used by the package management system.
wget -qO - https://www.mongodb.org/static/pgp/server-4.2.asc | sudo apt-key
add - 2 Create a /etc/apt/sources.list.d/mongodb-enterprise.list file for
MongoDB.
echo "deb [ arch=amd64,arm64,s390x ] http://repo.mongodb.com/apt/ubuntu
bionic/mongodb-enterprise/4.2 multiverse" | sudo tee
/etc/apt/sources.list.d/mongodbenterprise.
list
3 Reload local package database.
sudo apt-get update
4 Install the MongoDB Enterprise
packages. sudo apt-get install -y mongodb-
enterprise
Use the initialization system appropriate for your platform:
1) start MongoDB
sudo systemctl start mongod
2) Verify that MongoDB has started
successfully sudo systemctl enable
mongod
3) Stop MongoDB.
sudo systemctl stop mongod
4) Restart MongoDB
sudo systemctl restart mongod
5) Begin using MongoDB
Mongo
. MongoDB CRUD Operations:
• Create Operations
• Read Operations
• Update Operations
• Delete Operations
• Bulk Write
CRUD operations create, read, update, and delete documents.
Create Operations
Create or insert operations add new documents to a collection. If the collection does
not currently exist, insert operations will create the collection.
MongoDB provides the following methods to insert documents into a collection:
• db.collection.insertOne() New in version 3.2
• db.collection.insertMany() New in version 3.2
Read Operations
Read operations retrieves documents from a collection; i.e. queries a collection for
documents. MongoDB provides the following methods to read documents from a
collection:
• db.collection.find()
You can specify query filters or criteria that identify the documents to return.
Update Operations:
Delete Operations
Delete operations remove documents from a collection. MongoDB provides the
following methods to delete documents of a collection:
• db.collection.deleteOne() New in version 3.2
• db.collection.deleteMany() New in version 3.2
In MongoDB, delete operations target a single collection. All write operations in
MongoDB are atomic on the level of a single document.
You can specify criteria, or filters, that identify the documents to remove. These
filters use the same syntax as read operations.
Bulk Write
MongoDB provides the ability to perform write operations in bulk.
Output:
Open local disk->create new folder of “data”->in data folder create “db” folder ->run command
“mangod” in terminal.
Conclusion: Thus, I have installed MongoDB and performed queries using MongoDB
Experiment No. : 03(a)
Data
encapsulation.
Ad-hoc queries.
Analysis of huge datasets.
Characteristics of HIVE -
In Hive, tables and databases are created first and then data is loaded into these
tables. Hive as data warehouse designed for managing and querying only
structured data that is stored in tables. While dealing with structured data, Map
Reduce doesn't have optimization and usability features like UDFs but Hive
framework does. Query optimization refers to an effective way of query
execution in terms of performance.
HIVE ARCHITECTURE:
parts:
1. Hive Clients
2. Hive Services
1. HIVE Clients:
Hive provides different drivers for communication with a different type of
applications. For Thrift based applications, it will provide Thrift client for
communication.
2. HIVE Services:
Client interactions with Hive can be performed through Hive Services. If the
client wants to perform any query related operations in Hive, it has to
communicate through Hive Services. CLI is the command line interface acts as
Hive service for DDL (Data definition Language) operations. All drivers
communicate with Hive server and to the main driver in Hive services as shown
in above architecture diagram.
Theory:
Apache Pig is a tool/platform for creating and executing Map Reduce program
used with Hadoop. It is a tool/platform for analyzing large sets of data. You can
say, Apache Pig is an abstraction over MapReduce. Programmers who are not so
good at Java used to struggle working on Hadoop, majorly while writing
MapReduce jobs.
Below are the steps for Apache Pig Installation on Linux (ubuntu/centos/windows
using Linux VM). I am using Ubuntu 16.04 in below setup.
Step 1: Download Pig tar file.
Command: wget http://www-us.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz
Step 2: Extract the tar file using tar command. In below tar command, x means
extract an archive file, z means filter an archive through gzip, f means filename
of an archive file.
Command: tar -xzf pig-
0.16.0.tar.gz Command: ls
Step 3: Edit the “.bashrc” file to update the environment variables of Apache Pig.
We are setting it so that we can access pig from any directory, we need not go to
pig directory to
execute pig commands. Also, if any other application is looking for Pig, it will
get to know the path of Apache Pig from this file.
Command: sudo gedit .bashrc
Add the following at the end of the file:
# Set PIG_HOME
export PIG_HOME=/home/edureka/pig-0.16.0
export PATH=$PATH:/home/edureka/pig-
0.16.0/bin export
PIG_CLASSPATH=$HADOOP_CONF_DIR
Also, make sure that hadoop path is also set.
Run below command to make the changes get updated in same
terminal. Command: source .bashrc
Step 4: Check pig version. This is to test that Apache Pig got installed correctly. In
case, you don’t get the Apache Pig version, you need to verify if you have followed
the above steps correctly.
Command: pig -version
Step 5: Check pig help to see all the pig command
options. Command: pig -help
Step 6: Run Pig to start the grunt shell. Grunt shell is used to run Pig Latin
scripts. Command: pig
If you look at the above image correctly, Apache Pig has two modes in which it
can run, by default it chooses MapReduce mode. The other mode in which you
can run Pig is Local mode. Let me tell you more about this.
Execution modes in Apache Pig:
• MapReduce Mode – This is the default mode, which requires access to a
Hadoop cluster and HDFS installation. Since, this is a default mode, it is not
necessary to specify -x flag ( you can execute pig OR pig -x mapreduce). The
input and output in this mode are present on HDFS.
• Local Mode – With access to a single machine, all files are installed and run
using a local host and file system. Here the local mode is specified using ‘-x
flag’ (pig -x local). The input and output in this mode are present on local file
system.
Command: pig -x local
MapReduce Mode: In ‘MapReduce mode’, the data needs to be stored in
HDFS file system
and you can process the data with the help of pig
script. Apache Pig Script in MapReduce Mode
***************s1.txt************
*** 1 10000
2 20000
3 30000
4 40000
5 50000
***********output.pig**************
A = LOAD '/home/user/input/s1.txt' using PigStorage (',') as (id: chararray,
salary: chararray);
B = FOREACH A generate
id,salary; DUMP B;
Conclusion: Thus, we have successfully done with installation of Pig and
implementation of script for displaying contents of file stored in local system.
Experiment No. : 04(a)
WordCount.java
import java.io.IOException;
import
java.util.StringTokenizer;
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new
IntWritable(1); private Text word = new Text();
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
StringTokenizer itr = new
StringTokenizer(value.toString()); while
(itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one); }
}}
public static class IntSumReducer extends
Reducer<Text,IntWritable,Text,IntWritable
> { private IntWritable result = new
IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val :
values) { sum += val.get();
}
result.set(sum);
context.write(key, result);
}}
public static void main(String[] args) throws
Exception { Configuration conf = new
Configuration();
Job job = Job.getInstance(conf, "word
count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.clas
s);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new
Path(args[0])); FileOutputFormat.setOutputPath(job,
new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}}
OUTPUT:
Conclusion: Hence we successfully implemented total word count using MapReduce.
Experiment No. : 04(b)
Theory:
The Map Function: For each element mij of M, produce all the key-value pairs
((i, k), (M, j, mij) ) for k = 1, 2, . . ., up to the number of columns of N. Similarly,
for each element njk of N, produce all the key-value pairs ((i, k), (N, j, njk)) for i =
1, 2, . . ., up to the number of rows of M. As before,M and N are really bits to tell
which of the two matrices a value comes From.
The Reduce Function: Each key (i, k) will have an associated list with all the
values (M, j, mij ) and (N, j, njk), for all possible values of j. The Reduce function
needs to connect the two values on the list that have the same value of j, for each j.
An easy way to do this step is to sort by j the values that begin with M and sort by
j the values that begin with N, in separate lists. The jth values on each list must
have their third components, mij and njk extracted and multiplied. Then, these
products are summed and the result is paired with (i, k) in the output of the
Reduce function.
Take 2 matrices and solve using above algorithm.
Logic with Example,
mm.java
import java.io.IOException;
import java.util.*;
import java.io.*;
import java.lang.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class mm
{
public static class Map extends Mapper<LongWritable, Text, Text, Text>
{
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
Configuration conf = context.getConfiguration();
try
{
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A")) {
for (int k = 0; k < p; k++) {
outputKey.set(indicesAndValue[1] + "," + k);
outputValue.set("A," + indicesAndValue[2] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
} else
{
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("B," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}}
catch(ArrayIndexOutOfBoundsException e)
{}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
String[] value;
HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("A")) {
hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
}
}
int n = Integer.parseInt(context.getConfiguration().get("n"));
float result = 0.0f;
float a_ij;
float b_jk;
for (int j = 0; j < n; j++) {
a_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;
b_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;
result += a_ij * b_jk;
}
if (result != 0.0f) {
context.write(null, new Text(key.toString() + "," + Float.toString(result)));
}}}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// A is an m-by-n matrix; B is an n-by-p matrix.
conf.set("m", "2");
conf.set("n", "5");
conf.set("p", "3");
Job job = new Job(conf, "MatrixMatrixMultiplicationOneStep");
job.setJarByClass(mm.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);}}
hadoop fs -rm -r /input
hadoop fs -rm -r /output
hadoop fs -mkdir /input
hadoop fs -put input.txt /input
javac -Xlint -classpath /usr/local/hadoop/share/hadoop/common/hadoop-common-
3.21.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.2.1.jar
mm.java
jar cvf mm.jar *.class
hadoop jar mm.jar mm /input /output
hadoop fs -cat /output/part-r-00000
A-2x5 matrix and B-5x3 matrix
INPUT:
A,0,0,1.0
A,0,1,1.0
A,0,2,2.0
A,0,3,3.0
A,0,4,4.0
A,1,0,5.0
A,1,1,6.0
A,1,2,7.0
A,1,3,8.0
A,1,4,9.0
B,0,1,1.0
B,0,2,2.0
B,1,0,3.0
B,1,1,4.0
B,1,2,5.0
B,2,0,6.0
B,2,1,7.0
B,2,2,8.0
B,3,0,9.0
B,3,1,10.0
B,3,2,11.0
B,4,0,12.0
B,4,1,13.0
B,4,2,14.0
Output:
Theory:
OUTPUT:
Theory:
K-MEANS:
We are given a data set of items, with certain features, and values for these
features (like a vector). The task is to categorize those items into groups. To
achieve this, we will use the k- Means algorithm; an unsupervised learning
algorithm.
import java.util.*;
import java.io.*;
class KMeans {
// this is the list of clusters that come out of the current iteration
ArrayList <VectorizedObject> newClusters = new ArrayList
<VectorizedObject>
(0); try {
// read in each cluster... each cluster is on one line of the clusters file
String cur = inClusters.readLine ();
while (cur != null) {
// now create the new cluster with the same name (key) as the old one,
with zero points
// assigned, and with a vector at the origin
VectorizedObject newCluster = new VectorizedObject (cluster.getKey (),
"0",
new SparseDoubleVector (cluster.getLocation
().getLength
()));
newClusters.add (newCluster);
} catch (Exception e) {
throw new RuntimeException (e);
}
// read in each data point... each point is on one line of the file
String cur = inData.readLine ();
int numPoints = 0;
// now, compare it with each of the existing cluster centers to find the
closet one
double minDist = 9e99;
} catch (Exception e)
{ e.printStackTrace
();
throw new RuntimeException (e);
}
// loop through all of the clusters, finding the one with the most points, and
the one with the fewest
int bigIndex = -1, big = 0, smallIndex = -1, small = 999999999, curIndex = 0;
// if the big one is less than 1/20 the size of the small one, then split the big
one and use
// it to replace the small one
if (small < big / 20) {
String temp = newClusters.get (bigIndex).writeOut ();
VectorizedObject newObj = new VectorizedObject (temp);
newObj.setKey (newClusters.get (smallIndex).getKey ());
newObj.getLocation ().multiplyMyselfByHim (1.00000001);
newClusters.set (smallIndex, newObj);
}
Integer k = null;
System.out.format ("Enter the number of iterations to run: ");
while (k == null) {
try {
String temp = myConsole.readLine
(); k = Integer.parseInt (temp);
} catch (Exception e) {
System.out.println ("Problem reading the number of clusters.");
}
}
OUTPUTS:
Conclusion: We have implemented k-means clustering using map- reduced.
Experiment No. : 06
Aim: Implementation of recommendation engine (CF)
Theory:
Recommendation system provides the facility to understand a person's taste and find new,
desirable content for them automatically based on the pattern between their likes and rating
of different items. In this paper, we have proposed a recommendation system for the large
amount of data available on the web in the form of ratings, reviews, opinions, complaints,
remarks, feedback, and comments about any item (product, event, individual and services)
using Hadoop Framework. We have implemented Mahout Interfaces for analyzing the data
provided by review and rating site for movies.
Explain how recommendations are given using collaborative filtering.Use pearson or
cosine correlation formula to calculate similarity between two users.
dataset-recommendation_data1.py
dataset={'Eric Anderson':{
'0001055178':4.0,
'0007189885':5.0,
'0020209851':5.0,
'0060004843':5.0,
'0060185716':3.0
},
'H. Schneider':{
'0006511252':5.0,
'0020209851':5.0,
'0007156618':4.0,
'0007156634':5.0,
'0030565812':5.0,
'0020519001':4.0
},
'Gene Zafrin':{
'0002051850':5.0,
'0020519001':5.0
},
'Enjolras':{ 62
'0006551386':5.0,
'0007124015':5.0,
'0007257546':3.0,
'0030624363':5.0
},
'Elizabeth Fonzino':{
'0000031887':5.0,
'0002007770':3.0
},
'Amanda':{
'0000031887':4.0,
'0002007770':2.0,
'0007164785':5.0,
'0007173687':5.0
}}
cf_pearson.py
#!/usr/bin/env python
# Implementation of collaborative filtering recommendation engine
from recommendation_data1 import dataset
from math import sqrt
def similarity_score(person1,person2):
# Returns ratio Euclidean distance score of person1 and person2
both_viewed = {} # To get both rated items by person1 and person2
for item in dataset[person1]:
if item in dataset[person2]:
#print("common items between"+person1+" and
"+person2+"are"+item)
both_viewed[item] = 1
no = len(both_viewed)
#print("total no of common items in "+person1 +" and "+person2+ " are")
#print(no)
# Conditions to check they both have an common rating items
if no == 0:
return 0
# Finding Euclidean distance
sum_of_eclidean_distance = 0
a=0
for item in dataset[person1]:
if item in dataset[person2]:
a=pow(dataset[person1][item] - dataset[person2][item],2)
#sum_of_eclidean_distance.append(pow(dataset[person1][item] -
dataset[person2][item],2) for item in both_viewed)
sum_of_eclidean_distance = sum_of_eclidean_distance + a
return 1/(1+sqrt(sum_of_eclidean_distance)) 63
#b=similarity_score('Elizabeth Fonzino','Amanda')
#a=similarity_score('Enjolras','Amanda')
#a=similarity_score('Gene Zafrin','H. Schneider')
#a=similarity_score('Eric Anderson','H. Schneider')
#print(b)
def pearson_correlation(person1,person2):
# To get both rated items
both_rated = {}
for item in dataset[person1]:
if item in dataset[person2]:
both_rated[item] = 1
number_of_ratings = len(both_rated)
# Checking for number of ratings in common
if number_of_ratings == 0:
return 0
# Add up all the preferences of each user
person1_preferences_sum = sum([dataset[person1][item] for item in both_rated])
person2_preferences_sum = sum([dataset[person2][item] for item in both_rated])
# Sum up the squares of preferences of each user
person1_square_preferences_sum = sum([pow(dataset[person1][item],2) for item in
both_rated])
person2_square_preferences_sum = sum([pow(dataset[person2][item],2) for item in
both_rated])
# Sum up the product value of both preferences for each item
product_sum_of_both_users = sum([dataset[person1][item] * dataset[person2][item]
for item in both_rated])
# Calculate the pearson score
numerator_value = product_sum_of_both_users -
(person1_preferences_sum*person2_preferences_sum/number_of_ratings)
denominator_value = sqrt((person1_square_preferences_sum -
pow(person1_preferences_sum,2)/number_of_ratings) *
(person2_square_preferences_sum -
pow(person2_preferences_sum,2)/number_of_ratings))
if denominator_value == 0:
return 0
else:
r = numerator_value/denominator_value
print(r)
return r
def most_similar_users(person,number_of_users):
# returns the number_of_users (similar persons) for a given specific person.
scores = [(pearson_correlation(person,other_person),other_person) for other_person
in dataset if other_person != person ]
# Sort the similar persons so that highest scores person will appear at the first
scores.sort() 64
scores.reverse()
return scores[0:number_of_users]
for person in dataset:
print(" similar users for given user "+person)
print (most_similar_users(person,3))
print("----")
def user_reommendations(person):
# Gets recommendations for a person by using a weighted average of every other
user's rankings
totals = {}
simSums = {}
rankings_list =[]
for other in dataset:
# don't compare me to myself
if other == person:
continue
sim = pearson_correlation(person,other)
# ignore scores of zero or lower
if sim <=0:
continue
for item in dataset[other]:
# only score movies i haven't seen yet
if item not in dataset[person] or dataset[person][item] == 0:
# Similrity * score
totals.setdefault(item,0)
totals[item] += dataset[other][item]* sim
# sum of similarities
simSums.setdefault(item,0)
simSums[item]+= sim
# Create the normalized list
rankings = [(total/simSums[item],item) for item,total in totals.items()]
rankings.sort()
rankings.reverse()
# returns the recommended items
recommendataions_list = [recommend_item for score,recommend_item in rankings]
return recommendataions_list
print("************Recommendation for all users************")
for person in dataset:
print (" Recommendations for user "+person)
print (user_reommendations(person))
65
OUTPUT:
66
Experiment No. : 07 (Mini Project)
DATASET:
67
Output:
Import Library
Import Dataset
68
Data Visualization
69
Correlation
70
71
Split dataset into training data and train data
72
Decision Tree is a graph to represent choices and their results in form of
a tree. The nodes in the graph represent an event or choice and the edges
of the graph represent the decision rules or conditions.
73
KNN The advantage of this method is non-assumption about the
distribution of the variable. This aspect is an important point to compare
with the two previous methods.
74
Neural Network is a model characterized by an activation function,
which is used by interconnected information processing units to transform
input into output.
Comparison of Algorithm
75
Conclusion: In this project different techniques are used for the prediction of breast
cancer risk and their performance is compared in order to evaluate the best model.
Experimental results shows that the Random Forest is a better model for the
prediction of breast cancer risks. We have found a model based on Neural Network
with good results over the test set. This model has a sensibility of 0.92 with a F1 score
of 0.92.
76
77