0% found this document useful (0 votes)
531 views77 pages

BDA Experiment 14 PDF

The document provides instructions for installing and configuring Hadoop on a system. It describes downloading and extracting the Hadoop binary, setting up SSH keys for passwordless access, editing configuration files like core-site.xml and hdfs-site.xml, and creating directories for the HDFS NameNode and DataNodes. The goal is to get all of the components of the Hadoop ecosystem installed and configured properly to be able to run Hadoop jobs and processes on the system.

Uploaded by

Nikita Ichale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
531 views77 pages

BDA Experiment 14 PDF

The document provides instructions for installing and configuring Hadoop on a system. It describes downloading and extracting the Hadoop binary, setting up SSH keys for passwordless access, editing configuration files like core-site.xml and hdfs-site.xml, and creating directories for the HDFS NameNode and DataNodes. The goal is to get all of the components of the Hadoop ecosystem installed and configured properly to be able to run Hadoop jobs and processes on the system.

Uploaded by

Nikita Ichale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Experiment No.

: 01
Aim: Study of Hadoop Ecosystem.
Theory:
Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems. It includes Apache projects and various commercial
tools and solutions. There are four major elements of Hadoop i.e. HDFS,
MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are
used to supplement or support these major elements. All these tools work
collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem:


 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

HDFS:
 HDFS is the primary or major component of Hadoop ecosystem and is
responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
YARN:
 Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
MapReduce:
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose
task is:
1. Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based result
which is later on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating
the mapped data. In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into smaller set of tuples.
PIG:
 Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge
data sets.
HIVE:
 With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as HQL
(Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.

Mahout:
 Mahout, allows Machine Learnability to a system or application. Machine
Learning, as the name suggests helps the system to develop itself based on
some patterns, user/environmental interaction or om the basis of algorithms.
 It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine
learning. It allows invoking algorithms as per our need with the help of its
own libraries.
Apache Spark:
 It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions,
and visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.
Apache HBase:
 It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something
small in a huge database, the request must be processed within a short quick
span of time. At such times, HBase comes handy as it gives us a tolerant way
of storing limited data.
Other Components: Apart from all of these, there are some other components
too that carry out a huge task in order to make Hadoop capable of processing large
datasets. They are as follows:
 Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which
resulted in inconsistency, often. Zookeeper overcame all the problems by
performing synchronization, inter-component based communication,
grouping, and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs
and binding them together as a single unit. There is two kinds of jobs .i.e
Oozie workflow and Oozie coordinator jobs. Oozie workflow is the jobs that
need to be executed in a sequentially ordered manner whereas Oozie
Coordinator jobs are those that are triggered when some data or external
stimulus is given to it.

Conclusion: Thus, I have studied Hadoop Ecosystem.


Experiment No. : 02(a)
Aim: Installation of Hadoop.

Theory:
STEP 1: Installing Java, OpenSSH, rsync
We need to install certain dependencies before installing
Hadoop. This includes Java, OpenSSH and rsync.
On Debian like systems use:
sudo apt-get update && sudo apt-get install -y default-jdk openssh-server rsync
STEP 2: Setting up SSH keys
Genrate passwordless RSA public & private keys, you will be required to
answer a prompt by
hitting enter to keep the default file location of the keys:
ssh-keygen -t rsa -P ''
Add the newly created key to the list of authorized keys:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
STEP 3: Downloading and Extracting Hadoop archive
You can download the latest stable release of Hadoop binary named hadoop-
x.y.z.tar.gz
from
http://www.eu.apache.org/dist/hadoop/common/sta
ble/.
1. Download the file to your ~/ (home) folder:
2. FILE=$(wget "http://www.eu.apache.org/dist/hadoop/common/stable/" -O - |
grep - Po "hadoop-[0-9].[0-9].[0-9].tar.gz" | head -n 1)
3. URL=http://www.eu.apache.org/dist/hadoop/common/stable/$FILE
4. wget -c "$URL" -O "$FILE"
5. Extract the downloaded file to /usr/local/ directory and then rename the just
extracted hadoop-x.y.z directory to hadoop and make yourself the owner.
Extract the downloaded pre-complied hadoop binary:
sudo tar xfzv "$FILE" -C /usr/local
Move the extracted contecnts to /user/local/hadoop
directory: sudo mv /usr/local/hadoop-*/
/usr/local/hadoop
Change ownership of hadoop directory (so that we don’t need sudo
everytime): sudo chown -R $USER:$USER /usr/local/hadoop
STEP 4: Editing Configuration Files
Now we need to make changes to a few configuration files.
1. To append text to your ~/.bashrc file, execute this block of code in the terminal:
2. cat << 'EOT' >> ~/.bashrc
3. #SET JDK
4. export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:jre/bin/java::")

5. #HADOOP VARIABLES START


6. export HADOOP_HOME=/usr/local/hadoop
7. export PATH=$PATH:$HADOOP_HOME/bin
8. export PATH=$PATH:$HADOOP_HOME/sbin
9. export HADOOP_MAPRED_HOME=$HADOOP_HOME
10. export HADOOP_COMMON_HOME=$HADOOP_HOME
11. export HADOOP_HDFS_HOME=$HADOOP_HOME
12. export YARN_HOME=$HADOOP_HOME
13. export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
14. export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
15. export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
16. #HADOOP VARIABLES END
17. EOT
18. To edit /usr/local/hadoop/etc/hadoop/hadoop-env.sh file, execute this block
of code in
the terminal:
19. sed -i.bak -e 's/export
JAVA_HOME=${JAVA_HOME}/export
JAVA_HOME=$(readlink -f \/usr\/bin\/java | sed
"s:jre\/bin\/java::")/g'
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
20. To edit /usr/local/hadoop/etc/hadoop/core-site.xml file, execute this block of
code in
the terminal:
21. sed -n -i.bak '/<configuration>/q;p' /usr/local/hadoop/etc/hadoop/core-site.xml
22. cat << EOT >> /usr/local/hadoop/etc/hadoop/core-site.xml
23. <configuration>
24. <property>
25. <name>fs.defaultFS</name>
26. <value>hdfs://localhost:9000</value>
27. </property>
28. </configuration>
29. EOT
30. To edit /usr/local/hadoop/etc/hadoop/yarn-site.xml file, execute this block of
code in
the terminal:
31. sed -n -i.bak '/<configuration>/q;p' /usr/local/hadoop/etc/hadoop/yarn-site.xml
32. cat << EOT >> /usr/local/hadoop/etc/hadoop/yarn-site.xml
33. <configuration>
34. <property>

35. <name>yarn.nodemanager.aux-services</name>
36. <value>mapreduce_shuffle</value>
37. </property>
38. <property>
39. <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
40. <value>org.apache.hadoop.mapred.ShuffleHandler</value>
41. </property>
42. </configuration>
43. EOT
44. To genrate and then edit /usr/local/hadoop/etc/hadoop/mapred-site.xml file,
execute this block of code in the terminal:
45. cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml
46. sed -n -i.bak '/<configuration>/q;p' /usr/local/hadoop/etc/hadoop/mapred-site.xml
47. cat << EOT >> /usr/local/hadoop/etc/hadoop/mapred-site.xml
48. <configuration>
49. <property>
50. <name>mapreduce.framework.name</name>
51. <value>yarn</value>
52. </property>
53. </configuration>
54. EOT
55. To make ~/hadoop_store/hdfs/namenode, ~/hadoop_store/hdfs/datanode
folders and
edit /usr/local/hadoop/etc/hadoop/hdfs-site.xml file, execute this block of code
in the terminal:
56. mkdir -p ~/hadoop_store/hdfs/namenode
57. mkdir -p ~/hadoop_store/hdfs/datanode
58. sed -n -i.bak '/<configuration>/q;p' /usr/local/hadoop/etc/hadoop/hdfs-site.xml
59. cat << EOT >> /usr/local/hadoop/etc/hadoop/hdfs-site.xml
60. <configuration>
61. <property>
62. <name>dfs.replication</name>
63. <value>1</value>
64. </property>
65. <property>
66. <name>dfs.namenode.name.dir</name>
67. <value>file:/home/$USER/hadoop_store/hdfs/namenode</value>
68. </property>
69. <property>
70. <name>dfs.datanode.data.dir</name>

71. <value>file:/home/$USER/hadoop_store/hdfs/datanode</value>
72. </property>
73. </configuration>
STEP 5: Formatting HDFS
Before we can start using our hadoop cluster, we need to format the HDFS
through the namenode. Fomat the HDFS filesystem, answer password prompts
if any:
/usr/local/hadoop/bin/hdfs namenode -format
STEP 6: Strating Hadoop daemons
Start HDFS daemon /usr/local/hadoop/sbin/start-
dfs.sh Start YARN daemon
/usr/local/hadoop/sbin/start-yarn.sh

Output:
Conclusion: Thus, we have successfully install and configure Hadoop.
Experiment No. : 02(b)

Aim: Implementation of word count using MapReduce.


Theory:
Steps to execute MapReduce word count example
 Create a text file in your local machine and write some text into it.
Write the MapReduce program using eclipse.
File: WC_Mapper.java
package com.abc;
import java.io.IOException;
import
java.util.StringTokenizer;
import
org.apache.hadoop.io.IntWritable;
import
org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import
org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WC_Mapper extends MapReduceBase
implements
Mapper<LongWritable,Text,Text,IntWritable>{
private final static IntWritable one = new
IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text
value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{

String line = value.toString();


StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken(
)); output.collect(word, one);
}
}
}

File: WC_Reducer.java
package com.abc;
import
java.io.IOException;
import java.util.Iterator;
import
org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapred.MapReduceBase;
import
org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WC_Reducer extends MapReduceBase implements


Reducer<Text,IntWritable,Text,IntWritable> {
public void reduce(Text key,
Iterator<IntWritable>
values,OutputCollector<Text,IntWritable>
output,
Reporter reporter) throws
IOException { int sum=0;
while (values.hasNext())
{
sum+=values.next().get(
);
}
output.collect(key,new IntWritable(sum));
}
}

File: WC_Runner.java
package com.abc;
import java.io.IOException;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import
org.apache.hadoop.mapred.FileInputFormat;
import
org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;

public class WC_Runner {


public static void main(String[] args) throws
IOException{ JobConf conf = new
JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new
Path(args[0]));
FileOutputFormat.setOutputPath(conf,new
Path(args[1])); JobClient.runJob(conf);
}
}
OUTPUT:

Conclusion: Thus we have successfully implemented word count using MapReduce


Experiment No. : 02(c)

Aim: Installation of MongoDB and execution of queries using MongoDB.

Theory:

Follow these steps to install MongoDB Enterprise Edition using the apt
package manager.
1 Import the public key used by the package management system.
wget -qO - https://www.mongodb.org/static/pgp/server-4.2.asc | sudo apt-key
add - 2 Create a /etc/apt/sources.list.d/mongodb-enterprise.list file for
MongoDB.
echo "deb [ arch=amd64,arm64,s390x ] http://repo.mongodb.com/apt/ubuntu
bionic/mongodb-enterprise/4.2 multiverse" | sudo tee
/etc/apt/sources.list.d/mongodbenterprise.
list
3 Reload local package database.
sudo apt-get update
4 Install the MongoDB Enterprise
packages. sudo apt-get install -y mongodb-
enterprise
Use the initialization system appropriate for your platform:
1) start MongoDB
sudo systemctl start mongod
2) Verify that MongoDB has started
successfully sudo systemctl enable
mongod
3) Stop MongoDB.
sudo systemctl stop mongod
4) Restart MongoDB
sudo systemctl restart mongod
5) Begin using MongoDB
Mongo
. MongoDB CRUD Operations:
• Create Operations
• Read Operations
• Update Operations
• Delete Operations
• Bulk Write
CRUD operations create, read, update, and delete documents.

Create Operations
Create or insert operations add new documents to a collection. If the collection does
not currently exist, insert operations will create the collection.
MongoDB provides the following methods to insert documents into a collection:
• db.collection.insertOne() New in version 3.2
• db.collection.insertMany() New in version 3.2

In MongoDB, insert operations target a single collection. All write operations in


MongoDB are atomic on the level of a single document.

Read Operations
Read operations retrieves documents from a collection; i.e. queries a collection for
documents. MongoDB provides the following methods to read documents from a
collection:
• db.collection.find()

You can specify query filters or criteria that identify the documents to return.

Update Operations:

Update operations modify existing documents in a collection. MongoDB provides the


following methods to update documents of a collection:
 db.collection.updateOne() New in version 3.2
• db.collection.updateMany() New in version 3.2
• db.collection.replaceOne() New in version 3.2
In MongoDB, update operations target a single collection. All write operations in
MongoDB are atomic on the level of a single document
You can specify criteria, or filters, that identify the documents to update. These
filters use the same syntax as read operations.

Delete Operations
Delete operations remove documents from a collection. MongoDB provides the
following methods to delete documents of a collection:
• db.collection.deleteOne() New in version 3.2
• db.collection.deleteMany() New in version 3.2
In MongoDB, delete operations target a single collection. All write operations in
MongoDB are atomic on the level of a single document.

You can specify criteria, or filters, that identify the documents to remove. These
filters use the same syntax as read operations.

Bulk Write
MongoDB provides the ability to perform write operations in bulk.

Output:
Open local disk->create new folder of “data”->in data folder create “db” folder ->run command
“mangod” in terminal.
Conclusion: Thus, I have installed MongoDB and performed queries using MongoDB
Experiment No. : 03(a)

Aim: Installation of Hive and data aggregation using Hive.


Theory:
What is HIVE?
Hive is an ETL and Data warehousing tool developed on top of Hadoop
Distributed File System (HDFS). Hive makes job easy for performing
operations like

Data
encapsulation.
Ad-hoc queries.
Analysis of huge datasets.

Characteristics of HIVE -
In Hive, tables and databases are created first and then data is loaded into these
tables. Hive as data warehouse designed for managing and querying only
structured data that is stored in tables. While dealing with structured data, Map
Reduce doesn't have optimization and usability features like UDFs but Hive
framework does. Query optimization refers to an effective way of query
execution in terms of performance.

HIVE ARCHITECTURE:

The above screenshot explains the Apache Hive

architecture in detail. Hive Consists of Mainly 3 core

parts:
1. Hive Clients
2. Hive Services

3. Hive Storage and Computing

1. HIVE Clients:
Hive provides different drivers for communication with a different type of
applications. For Thrift based applications, it will provide Thrift client for
communication.

2. HIVE Services:
Client interactions with Hive can be performed through Hive Services. If the
client wants to perform any query related operations in Hive, it has to
communicate through Hive Services. CLI is the command line interface acts as
Hive service for DDL (Data definition Language) operations. All drivers
communicate with Hive server and to the main driver in Hive services as shown
in above architecture diagram.

3. HIVE Storage and Computing:


Hive services such as Meta store, File system, and Job Client in turn
communicates with Hive storage and performs the following actions:

Metadata information of tables created in Hive is stored in Hive "Meta


storage database".
Query results and data loaded in the tables are going to be stored in Hadoop
cluster
on HDFS.
OUTPUT:
Conclusion: Thus, we have successfully done with installation of Hive and
aggregation functions using Hive.
Experiment No. : 03(b)

Aim: Installation of Pig and implementation of script for displaying contents of


file stored in local system.

Theory:

Apache Pig is a tool/platform for creating and executing Map Reduce program
used with Hadoop. It is a tool/platform for analyzing large sets of data. You can
say, Apache Pig is an abstraction over MapReduce. Programmers who are not so
good at Java used to struggle working on Hadoop, majorly while writing
MapReduce jobs.
Below are the steps for Apache Pig Installation on Linux (ubuntu/centos/windows
using Linux VM). I am using Ubuntu 16.04 in below setup.
Step 1: Download Pig tar file.
Command: wget http://www-us.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz
Step 2: Extract the tar file using tar command. In below tar command, x means
extract an archive file, z means filter an archive through gzip, f means filename
of an archive file.
Command: tar -xzf pig-
0.16.0.tar.gz Command: ls
Step 3: Edit the “.bashrc” file to update the environment variables of Apache Pig.
We are setting it so that we can access pig from any directory, we need not go to
pig directory to
execute pig commands. Also, if any other application is looking for Pig, it will
get to know the path of Apache Pig from this file.
Command: sudo gedit .bashrc
Add the following at the end of the file:
# Set PIG_HOME
export PIG_HOME=/home/edureka/pig-0.16.0
export PATH=$PATH:/home/edureka/pig-
0.16.0/bin export
PIG_CLASSPATH=$HADOOP_CONF_DIR
Also, make sure that hadoop path is also set.
Run below command to make the changes get updated in same
terminal. Command: source .bashrc
Step 4: Check pig version. This is to test that Apache Pig got installed correctly. In
case, you don’t get the Apache Pig version, you need to verify if you have followed
the above steps correctly.
Command: pig -version
Step 5: Check pig help to see all the pig command
options. Command: pig -help
Step 6: Run Pig to start the grunt shell. Grunt shell is used to run Pig Latin
scripts. Command: pig
If you look at the above image correctly, Apache Pig has two modes in which it
can run, by default it chooses MapReduce mode. The other mode in which you
can run Pig is Local mode. Let me tell you more about this.
Execution modes in Apache Pig:
• MapReduce Mode – This is the default mode, which requires access to a
Hadoop cluster and HDFS installation. Since, this is a default mode, it is not
necessary to specify -x flag ( you can execute pig OR pig -x mapreduce). The
input and output in this mode are present on HDFS.
• Local Mode – With access to a single machine, all files are installed and run
using a local host and file system. Here the local mode is specified using ‘-x
flag’ (pig -x local). The input and output in this mode are present on local file
system.
Command: pig -x local
MapReduce Mode: In ‘MapReduce mode’, the data needs to be stored in
HDFS file system
and you can process the data with the help of pig
script. Apache Pig Script in MapReduce Mode
***************s1.txt************
*** 1 10000
2 20000
3 30000
4 40000
5 50000
***********output.pig**************
A = LOAD '/home/user/input/s1.txt' using PigStorage (',') as (id: chararray,
salary: chararray);
B = FOREACH A generate
id,salary; DUMP B;
Conclusion: Thus, we have successfully done with installation of Pig and
implementation of script for displaying contents of file stored in local system.
Experiment No. : 04(a)

Aim: Implementation of total word count using MapReduce.


Theory:
TotalWordCount.java
// Java program to count the
// number of charaters in a
file import java.io.*;
public class TotalWordCount
{
public static void main(String[] args) throws IOException
{
File file = new File("/home/hd-
dhaval/word_count/input_data/input.txt"); FileInputStream
fileStream = new FileInputStream(file); InputStreamReader input =
new InputStreamReader(fileStream); BufferedReader reader = new
BufferedReader(input);
String line;
// Initializing
counters int
countWord = 0;
int sentenceCount = 0;
int characterCount = 0;
int paragraphCount =
1; int whitespaceCount
= 0;
// Reading line by line from the
// file until a null is returned
while((line = reader.readLine()) != null)
{
if(line.equals(""))
{
paragraphCount++;
}
if(!(line.equals("")))
{
characterCount += line.length();
// \\s+ is the space delimiter in
java String[] wordList =
line.split("\\s+"); countWord +=
wordList.length; whitespaceCount
+= countWord -1;
// [!?.:]+ is the sentence delimiter in
java String[] sentenceList =
line.split("[!?.:]+"); sentenceCount +=
sentenceList.length;
}
}
System.out.println("Total word count = " + countWord);

System.out.println("Total number of sentences = " +


sentenceCount); System.out.println("Total number of characters = "
+ characterCount); System.out.println("Number of paragraphs = " +
paragraphCount); System.out.println("Total number of whitespaces
= " + whitespaceCount);
}
}

WordCount.java
import java.io.IOException;
import
java.util.StringTokenizer;
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new
IntWritable(1); private Text word = new Text();
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
StringTokenizer itr = new
StringTokenizer(value.toString()); while
(itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one); }
}}
public static class IntSumReducer extends
Reducer<Text,IntWritable,Text,IntWritable
> { private IntWritable result = new
IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val :
values) { sum += val.get();
}
result.set(sum);
context.write(key, result);
}}
public static void main(String[] args) throws
Exception { Configuration conf = new
Configuration();
Job job = Job.getInstance(conf, "word
count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.clas
s);

job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new
Path(args[0])); FileOutputFormat.setOutputPath(job,
new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}}
OUTPUT:
Conclusion: Hence we successfully implemented total word count using MapReduce.
Experiment No. : 04(b)

Aim: Implementation of matrix multiplication using MapReduce.

Theory:

The Map Function: For each element mij of M, produce all the key-value pairs
((i, k), (M, j, mij) ) for k = 1, 2, . . ., up to the number of columns of N. Similarly,
for each element njk of N, produce all the key-value pairs ((i, k), (N, j, njk)) for i =
1, 2, . . ., up to the number of rows of M. As before,M and N are really bits to tell
which of the two matrices a value comes From.

The Reduce Function: Each key (i, k) will have an associated list with all the
values (M, j, mij ) and (N, j, njk), for all possible values of j. The Reduce function
needs to connect the two values on the list that have the same value of j, for each j.
An easy way to do this step is to sort by j the values that begin with M and sort by
j the values that begin with N, in separate lists. The jth values on each list must
have their third components, mij and njk extracted and multiplied. Then, these
products are summed and the result is paired with (i, k) in the output of the
Reduce function.
Take 2 matrices and solve using above algorithm.
Logic with Example,
mm.java
import java.io.IOException;
import java.util.*;
import java.io.*;
import java.lang.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class mm
{
public static class Map extends Mapper<LongWritable, Text, Text, Text>
{
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
Configuration conf = context.getConfiguration();
try
{
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A")) {
for (int k = 0; k < p; k++) {
outputKey.set(indicesAndValue[1] + "," + k);
outputValue.set("A," + indicesAndValue[2] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
} else
{
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("B," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}}
catch(ArrayIndexOutOfBoundsException e)
{}
}
}
public static class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
String[] value;
HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("A")) {
hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
}
}
int n = Integer.parseInt(context.getConfiguration().get("n"));
float result = 0.0f;
float a_ij;
float b_jk;
for (int j = 0; j < n; j++) {
a_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;
b_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;
result += a_ij * b_jk;
}
if (result != 0.0f) {
context.write(null, new Text(key.toString() + "," + Float.toString(result)));
}}}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// A is an m-by-n matrix; B is an n-by-p matrix.
conf.set("m", "2");
conf.set("n", "5");
conf.set("p", "3");
Job job = new Job(conf, "MatrixMatrixMultiplicationOneStep");
job.setJarByClass(mm.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);}}
hadoop fs -rm -r /input
hadoop fs -rm -r /output
hadoop fs -mkdir /input
hadoop fs -put input.txt /input
javac -Xlint -classpath /usr/local/hadoop/share/hadoop/common/hadoop-common-
3.21.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.2.1.jar
mm.java
jar cvf mm.jar *.class
hadoop jar mm.jar mm /input /output
hadoop fs -cat /output/part-r-00000
A-2x5 matrix and B-5x3 matrix

INPUT:
A,0,0,1.0
A,0,1,1.0
A,0,2,2.0
A,0,3,3.0
A,0,4,4.0
A,1,0,5.0
A,1,1,6.0
A,1,2,7.0
A,1,3,8.0
A,1,4,9.0
B,0,1,1.0
B,0,2,2.0
B,1,0,3.0
B,1,1,4.0
B,1,2,5.0
B,2,0,6.0
B,2,1,7.0
B,2,2,8.0
B,3,0,9.0
B,3,1,10.0
B,3,2,11.0
B,4,0,12.0
B,4,1,13.0
B,4,2,14.0

Output:

Conclusion: Hence we have successfully implemented matrix multiplication using


MapReduce.
Experiment No. : 05(a)

Aim: Implementation of frequent item set mining using MapReduce (Apriori


Algorithm)

Theory:

In the first pass, we create two tables.


• The first table, if necessary, translates item names into integers from 1 to n,.
• The other table is an array of counts; the ith array element counts the occurrences
of the item numbered i. Initially, the counts for all the items are 0 For the second
pass of A-Priori,
• we create a new numbering from 1 to m for just the frequent items.
• This table is an array indexed 1 to n, and the entry for i is either 0, if item i is not
frequent, or a unique integer in the range 1 to m if item i is frequent. We shall refer
to this table as the frequent-items table

Second pass of Apriori


• During the second pass, we count all the pairs that consist of two frequent items.
A pair cannot be frequent unless both its members are frequent. Thus, we miss no
frequent pairs. The space required on
the second pass is 2m2 bytes, rather than 2n2 bytes, if we use the triangular
matrix method for counting.
The mechanics of the second pass are as follows.
1. For each basket, look in the frequent-items table to see which of its items
are frequent.
2. In a double loop, generate all pairs of frequent items in that basket.
3. For each such pair, add one to its count in the data structure used to store
counts. Finally, at the end of the second pass, examine the structure of counts to
determine which pairs are frequent.
File:
Input files needed:
1. config.txt - three lines, each line is an integer
line 1 - number of items per transaction
line 2 - number of transactions
line 3 - minsup
2. transa.txt - transaction file, each line is a transaction, items are separated by a space
Consider the config.txt andtransa.txt file and find frequent itemsets with support.
*****************config.txt*****************
55
40
******************transa.txt***********************
11100
11111
10110
10111
11110
******************Ariori.java***********
import java.io.*;
import java.util.*;
public class Apriori {
public static void main(String[] args) {
AprioriCalculation ap = new AprioriCalculation();
ap.aprioriProcess();
}
}
/**************************************************************************
****
* Class Name : AprioriCalculation
* Purpose : generate Apriori itemsets
***************************************************************************
**/
class AprioriCalculation
{
Vector<String> candidates=new Vector<String>(); //the current candidates
String configFile="config.txt"; //configuration file
String transaFile="transa.txt"; //transaction file
String outputFile="apriori-output.txt";//output file
int numItems; //number of items per transaction
int numTransactions; //number of transactions
double minSup; //minimum support for a frequent itemset
String oneVal[]; //array of value per column that will be treated as a '1'
String itemSep = " "; //the separator value for items in the database
/************************************************************************
* Method Name : aprioriProcess
* Purpose : Generate the apriori itemsets
* Parameters : None
* Return : None
*************************************************************************/
public void aprioriProcess()
{
Date d; //date object for timing purposes
long start, end; //start and end time
int itemsetNumber=0; //the current itemset being looked at
//get config
getConfig();
System.out.println("Apriori algorithm has started.\n");
//start timer
d = new Date();
start = d.getTime();
//while not complete
do
{
//increase the itemset that is being looked at
itemsetNumber++;
//generate the candidates
generateCandidates(itemsetNumber);
//determine and display frequent itemsets
calculateFrequentItemsets(itemsetNumber);
if(candidates.size()!=0)
{
System.out.println("Frequent " + itemsetNumber + "-itemsets");
System.out.println(candidates);
}
//if there are <=1 frequent items, then its the end. This prevents reading through the
database again. When there is only one frequent itemset.
}while(candidates.size()>1);
//end timer
d = new Date();
end = d.getTime();
//display the execution time
System.out.println("Execution time is: "+((double)(end-start)/1000) + " seconds.");
}
/************************************************************************
* Method Name : getInput
* Purpose : get user input from System.in
* Parameters : None
* Return : String value of the users input
*************************************************************************/
public static String getInput()
{
String input="";
//read from System.in
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
//try to get users input, if there is an error print the message
try
{
input = reader.readLine();
}
catch (Exception e)
{
System.out.println(e);
}
return input;
}
/************************************************************************
* Method Name : getConfig
* Purpose : get the configuration information (config filename, transaction filename)
* : configFile and transaFile will be change appropriately
* Parameters : None
* Return : None
*************************************************************************/
private void getConfig()
{
FileWriter fw;
BufferedWriter file_out;
String input="";
//ask if want to change the config
System.out.println("Default Configuration: ");
System.out.println("\tRegular transaction file with '" + itemSep + "' item separator.");
System.out.println("\tConfig File: " + configFile);
System.out.println("\tTransa File: " + transaFile);
System.out.println("\tOutput File: " + outputFile);
System.out.println("\nPress 'C' to change the item separator, configuration file and
transaction files");
System.out.print("or any other key to continue. ");
input=getInput();
if(input.compareToIgnoreCase("c")==0)
{
System.out.print("Enter new transaction filename (return for '"+transaFile+"'): ");
input=getInput();
if(input.compareToIgnoreCase("")!=0)
transaFile=input;
System.out.print("Enter new configuration filename (return for '"+configFile+"'): ");
input=getInput();
if(input.compareToIgnoreCase("")!=0)
configFile=input;
System.out.print("Enter new output filename (return for '"+outputFile+"'): ");
input=getInput();
if(input.compareToIgnoreCase("")!=0)
outputFile=input;
System.out.println("Filenames changed");
System.out.print("Enter the separating character(s) for items (return for
'"+itemSep+"'): ");
input=getInput();
if(input.compareToIgnoreCase("")!=0)
itemSep=input;
}
try
{
FileInputStream file_in = new FileInputStream(configFile);
BufferedReader data_in = new BufferedReader(new InputStreamReader(file_in));
//number of items
numItems=Integer.valueOf(data_in.readLine()).intValue();
//number of transactions
numTransactions=Integer.valueOf(data_in.readLine()).intValue();
//minsup
minSup=(Double.valueOf(data_in.readLine()).doubleValue());
//output config info to the user
System.out.print("\nInput configuration: "+numItems+" items, "+numTransactions+"
transactions, ");
System.out.println("minsup = "+minSup+"%");
System.out.println();
minSup/=100.0;
oneVal = new String[numItems];
System.out.print("Enter 'y' to change the value each row recognizes as a '1':");
if(getInput().compareToIgnoreCase("y")==0)
{
for(int i=0; i<oneVal.length; i++)
{
System.out.print("Enter value for column #" + (i+1) + ": ");
oneVal[i] = getInput();
}
}
else
for(int i=0; i<oneVal.length; i++)
oneVal[i]="1";
//create the output file
fw= new FileWriter(outputFile);
file_out = new BufferedWriter(fw);
//put the number of transactions into the output file
file_out.write(numTransactions + "\n");
file_out.write(numItems + "\n******\n");
file_out.close();
}
//if there is an error, print the message
catch(IOException e)
{
System.out.println(e);
}
}
/************************************************************************
* Method Name : generateCandidates
* Purpose : Generate all possible candidates for the n-th itemsets
* : these candidates are stored in the candidates class vector
* Parameters : n - integer value representing the current itemsets to be created
* Return : None
*************************************************************************/
private void generateCandidates(int n)
{
Vector<String> tempCandidates = new Vector<String>(); //temporary candidate string
vector
String str1, str2; //strings that will be used for comparisons
StringTokenizer st1, st2; //string tokenizers for the two itemsets being compared
//if its the first set, candidates are just the numbers
if(n==1)
{
for(int i=1; i<=numItems; i++)
{
tempCandidates.add(Integer.toString(i));
}
}
else if(n==2) //second itemset is just all combinations of itemset 1
{
//add each itemset from the previous frequent itemsets together
for(int i=0; i<candidates.size(); i++)
{
st1 = new StringTokenizer(candidates.get(i));
str1 = st1.nextToken();
for(int j=i+1; j<candidates.size(); j++)
{
st2 = new StringTokenizer(candidates.elementAt(j));
str2 = st2.nextToken();
tempCandidates.add(str1 + " " + str2);
}
}
}
else
{
//for each itemset
for(int i=0; i<candidates.size(); i++)
{
//compare to the next itemset
for(int j=i+1; j<candidates.size(); j++)
{
//create the strigns
str1 = new String();
str2 = new String();
//create the tokenizers
st1 = new StringTokenizer(candidates.get(i));
st2 = new StringTokenizer(candidates.get(j));
//make a string of the first n-2 tokens of the strings
for(int s=0; s<n-2; s++)
{
str1 = str1 + " " + st1.nextToken();
str2 = str2 + " " + st2.nextToken();
}
//if they have the same n-2 tokens, add them together
if(str2.compareToIgnoreCase(str1)==0)
tempCandidates.add((str1 + " " + st1.nextToken() + " " +
st2.nextToken()).trim());
}
}
}
//clear the old candidates
candidates.clear();
//set the new ones
candidates = new Vector<String>(tempCandidates);
tempCandidates.clear();
}
/************************************************************************
* Method Name : calculateFrequentItemsets
* Purpose : Determine which candidates are frequent in the n-th itemsets
* : from all possible candidates
* Parameters : n - iteger representing the current itemsets being evaluated
* Return : None
*************************************************************************/
private void calculateFrequentItemsets(int n)
{
Vector<String> frequentCandidates = new Vector<String>(); //the frequent candidates
for the current itemset
FileInputStream file_in; //file input stream
BufferedReader data_in; //data input stream
FileWriter fw;
BufferedWriter file_out;
StringTokenizer st, stFile; //tokenizer for candidate and transaction
boolean match; //whether the transaction has all the items in an itemset
boolean trans[] = new boolean[numItems]; //array to hold a transaction so that can be
checked
int count[] = new int[candidates.size()]; //the number of successful matches
try
{
//output file
fw= new FileWriter(outputFile, true);
file_out = new BufferedWriter(fw);
//load the transaction file
file_in = new FileInputStream(transaFile);
data_in = new BufferedReader(new InputStreamReader(file_in));
//for each transaction
for(int i=0; i<numTransactions; i++)
{
//System.out.println("Got here " + i + " times"); //useful to debug files that you
are unsure of the number of line
stFile = new StringTokenizer(data_in.readLine(), itemSep); //read a line from the
file to the tokenizer
//put the contents of that line into the transaction array
for(int j=0; j<numItems; j++)
{
trans[j]=(stFile.nextToken().compareToIgnoreCase(oneVal[j])==0); //if it is
not a 0, assign the value to true
}
//check each candidate
for(int c=0; c<candidates.size(); c++)
{
match = false; //reset match to false
//tokenize the candidate so that we know what items need to be present for a
match
st = new StringTokenizer(candidates.get(c));
//check each item in the itemset to see if it is present in the transaction
while(st.hasMoreTokens())
{
match = (trans[Integer.valueOf(st.nextToken())-1]);
if(!match) //if it is not present in the transaction stop checking
break;
}
if(match) //if at this point it is a match, increase the count
count[c]++;
}
}
for(int i=0; i<candidates.size(); i++)
{
// System.out.println("Candidate: " + candidates.get(c) + " with count: " + count
+ " % is: " + (count/(double)numItems));
//if the count% is larger than the minSup%, add to the candidate to the frequent
candidates
if((count[i]/(double)numTransactions)>=minSup)
{
frequentCandidates.add(candidates.get(i));
//put the frequent itemset into the output file
file_out.write(candidates.get(i) + "," + count[i]/(double)numTransactions +
"\n");
}
}
file_out.write("-\n");
file_out.close();
}
//if error at all in this process, catch it and print the error messate
catch(IOException e)
{
System.out.println(e);
}
//clear old candidates
candidates.clear();
//new candidates are the old frequent candidates
candidates = new Vector<String>(frequentCandidates);
frequentCandidates.clear();
}
}
hadoop fs -rm -r /input
hadoop fs -rm -r /output
hadoop fs -mkdir /input
hadoop fs -mkdir /input1
hadoop fs -put transa.txt /input
hadoop fs -put config.txt /input
javac -Xlint -classpath /usr/local/hadoop/share/hadoop/common/hadoop-common-
3.21.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.2.1.jar
Apriori.java
jar cvf Apriori.jar *.class
hadoop jar Apriori.jar Apriori transa.txt config.txt output2
hadoop fs -cat /output2/part-r-00000

OUTPUT:

Conclusion: Thus we have successfully implemented frequent item set


mining using MapReduce (Apriori Algorithm).
Experiment No. : 05(b)
Aim: Implementation of clustering (K-means) algorithm using MapReduce

Theory:

K-MEANS:

We are given a data set of items, with certain features, and values for these
features (like a vector). The task is to categorize those items into groups. To
achieve this, we will use the k- Means algorithm; an unsupervised learning
algorithm.

The approach kmeans follows to solve the problem is called Expectation-


Maximization. The E-step is assigning the data points to the closest cluster. The
M-step is computing the centroid of each cluster. Below is a break down of how
we can solve it mathematically (feel free to skip it).
Program:

import java.util.*;
import java.io.*;

class KMeans {

public static void runAnIteration (String dataFile, String clusterFile) {

// start out by opening up the two input files


BufferedReader inData, inClusters;
try {
inData = new BufferedReader (new FileReader (dataFile));
inClusters = new BufferedReader (new FileReader (clusterFile));
} catch (Exception e) {
throw new RuntimeException (e);
}

// this is the list of current clusters


ArrayList <VectorizedObject> oldClusters = new ArrayList
<VectorizedObject> (0);

// this is the list of clusters that come out of the current iteration
ArrayList <VectorizedObject> newClusters = new ArrayList
<VectorizedObject>

(0); try {
// read in each cluster... each cluster is on one line of the clusters file
String cur = inClusters.readLine ();
while (cur != null) {

// first, read in the old cluster


VectorizedObject cluster = new VectorizedObject (cur);
oldClusters.add (cluster);

// now create the new cluster with the same name (key) as the old one,
with zero points
// assigned, and with a vector at the origin
VectorizedObject newCluster = new VectorizedObject (cluster.getKey (),
"0",
new SparseDoubleVector (cluster.getLocation
().getLength
()));

newClusters.add (newCluster);

// read in the next line


cur = inClusters.readLine ();
}
inClusters.close ();

} catch (Exception e) {
throw new RuntimeException (e);
}

// now, process the data


try {

// read in each data point... each point is on one line of the file
String cur = inData.readLine ();
int numPoints = 0;

// while we have not hit the end of the file


while (cur != null) {

// process the next data point


VectorizedObject temp = new VectorizedObject (cur);

// now, compare it with each of the existing cluster centers to find the
closet one
double minDist = 9e99;

int bestIndex = -1;


for (int i = 0; i < oldClusters.size (); i++) {
if (temp.getLocation ().distance (oldClusters.get (i).getLocation ()) <
minDist) {
bestIndex = i;
minDist = temp.getLocation ().distance (oldClusters.get (i).getLocation
());
}
}

// since we have found the closest one, we add outselves in


temp.getLocation ().addMyselfToHim (newClusters.get
(bestIndex).getLocation ());
newClusters.get (bestIndex).incrementValueAsInt ();

// let people know that we are progressing


if (numPoints++ % 1000 == 0)
System.out.format (".");

// read in the next line from the file


cur = inData.readLine ();
}

System.out.println ("Done with pass thru data.");


inData.close ();

} catch (Exception e)
{ e.printStackTrace
();
throw new RuntimeException (e);
}

// loop through all of the clusters, finding the one with the most points, and
the one with the fewest
int bigIndex = -1, big = 0, smallIndex = -1, small = 999999999, curIndex = 0;

// loop thru the clusters


for (VectorizedObject i : newClusters) {

// if we get one that has fewer than "small" points, remember it if


(i.getValueAsInt () < small) {
small = i.getValueAsInt
(); smallIndex =
curIndex;

// if we get one with more than "big" points, remember it if


(i.getValueAsInt () > big) {
big = i.getValueAsInt
(); bigIndex =
curIndex;
}
curIndex++;
}

// if the big one is less than 1/20 the size of the small one, then split the big
one and use
// it to replace the small one
if (small < big / 20) {
String temp = newClusters.get (bigIndex).writeOut ();
VectorizedObject newObj = new VectorizedObject (temp);
newObj.setKey (newClusters.get (smallIndex).getKey ());
newObj.getLocation ().multiplyMyselfByHim (1.00000001);
newClusters.set (smallIndex, newObj);
}

// lastly, divide each cluster by its count and write it out


PrintWriter out;
try {
out = new PrintWriter (new BufferedWriter (new FileWriter (clusterFile)));
} catch (Exception e) {
throw new RuntimeException (e);
}

// loops through all of the clusters and writes them out


for (VectorizedObject i : newClusters) {
i.getLocation ().multiplyMyselfByHim (1.0 / i.getValueAsInt ());
String stringRep = i.writeOut ();
out.println (stringRep);
}

// and get outta here!


out.close ();
}

public static void main (String [] args) {

BufferedReader myConsole = new BufferedReader(new


InputStreamReader(System.in));
if (myConsole == null) {
throw new RuntimeException ("No console.");
}

String dataFile = null;


System.out.format ("Enter the file with the data vectors: ");
while (dataFile == null) {
try {
dataFile = myConsole.readLine ();
} catch (Exception e) {
System.out.println ("Problem reading data file name.");
}
}

String clusterFile = null;


System.out.format ("Enter the name of the file where the clusers are loated:
");
while (clusterFile == null)
{ try {
clusterFile = myConsole.readLine ();
} catch (Exception e) {
System.out.println ("Problem reading file name.");
}
}

Integer k = null;
System.out.format ("Enter the number of iterations to run: ");
while (k == null) {
try {
String temp = myConsole.readLine
(); k = Integer.parseInt (temp);
} catch (Exception e) {
System.out.println ("Problem reading the number of clusters.");
}
}

for (int i = 0; i < k; i++)


runAnIteration (dataFile, clusterFile);
}
}

OUTPUTS:
Conclusion: We have implemented k-means clustering using map- reduced.
Experiment No. : 06
Aim: Implementation of recommendation engine (CF)

Theory:
Recommendation system provides the facility to understand a person's taste and find new,
desirable content for them automatically based on the pattern between their likes and rating
of different items. In this paper, we have proposed a recommendation system for the large
amount of data available on the web in the form of ratings, reviews, opinions, complaints,
remarks, feedback, and comments about any item (product, event, individual and services)
using Hadoop Framework. We have implemented Mahout Interfaces for analyzing the data
provided by review and rating site for movies.
Explain how recommendations are given using collaborative filtering.Use pearson or
cosine correlation formula to calculate similarity between two users.

dataset-recommendation_data1.py
dataset={'Eric Anderson':{
'0001055178':4.0,
'0007189885':5.0,
'0020209851':5.0,
'0060004843':5.0,
'0060185716':3.0
},
'H. Schneider':{
'0006511252':5.0,
'0020209851':5.0,
'0007156618':4.0,
'0007156634':5.0,
'0030565812':5.0,
'0020519001':4.0
},
'Gene Zafrin':{
'0002051850':5.0,
'0020519001':5.0
},
'Enjolras':{ 62
'0006551386':5.0,
'0007124015':5.0,
'0007257546':3.0,
'0030624363':5.0
},
'Elizabeth Fonzino':{
'0000031887':5.0,
'0002007770':3.0
},
'Amanda':{
'0000031887':4.0,
'0002007770':2.0,
'0007164785':5.0,
'0007173687':5.0
}}

cf_pearson.py
#!/usr/bin/env python
# Implementation of collaborative filtering recommendation engine
from recommendation_data1 import dataset
from math import sqrt
def similarity_score(person1,person2):
# Returns ratio Euclidean distance score of person1 and person2
both_viewed = {} # To get both rated items by person1 and person2
for item in dataset[person1]:
if item in dataset[person2]:
#print("common items between"+person1+" and
"+person2+"are"+item)
both_viewed[item] = 1
no = len(both_viewed)
#print("total no of common items in "+person1 +" and "+person2+ " are")
#print(no)
# Conditions to check they both have an common rating items
if no == 0:
return 0
# Finding Euclidean distance
sum_of_eclidean_distance = 0
a=0
for item in dataset[person1]:
if item in dataset[person2]:
a=pow(dataset[person1][item] - dataset[person2][item],2)
#sum_of_eclidean_distance.append(pow(dataset[person1][item] -
dataset[person2][item],2) for item in both_viewed)
sum_of_eclidean_distance = sum_of_eclidean_distance + a
return 1/(1+sqrt(sum_of_eclidean_distance)) 63
#b=similarity_score('Elizabeth Fonzino','Amanda')
#a=similarity_score('Enjolras','Amanda')
#a=similarity_score('Gene Zafrin','H. Schneider')
#a=similarity_score('Eric Anderson','H. Schneider')
#print(b)
def pearson_correlation(person1,person2):
# To get both rated items
both_rated = {}
for item in dataset[person1]:
if item in dataset[person2]:
both_rated[item] = 1
number_of_ratings = len(both_rated)
# Checking for number of ratings in common
if number_of_ratings == 0:
return 0
# Add up all the preferences of each user
person1_preferences_sum = sum([dataset[person1][item] for item in both_rated])
person2_preferences_sum = sum([dataset[person2][item] for item in both_rated])
# Sum up the squares of preferences of each user
person1_square_preferences_sum = sum([pow(dataset[person1][item],2) for item in
both_rated])
person2_square_preferences_sum = sum([pow(dataset[person2][item],2) for item in
both_rated])
# Sum up the product value of both preferences for each item
product_sum_of_both_users = sum([dataset[person1][item] * dataset[person2][item]
for item in both_rated])
# Calculate the pearson score
numerator_value = product_sum_of_both_users -
(person1_preferences_sum*person2_preferences_sum/number_of_ratings)
denominator_value = sqrt((person1_square_preferences_sum -
pow(person1_preferences_sum,2)/number_of_ratings) *
(person2_square_preferences_sum -
pow(person2_preferences_sum,2)/number_of_ratings))
if denominator_value == 0:
return 0
else:
r = numerator_value/denominator_value
print(r)
return r
def most_similar_users(person,number_of_users):
# returns the number_of_users (similar persons) for a given specific person.
scores = [(pearson_correlation(person,other_person),other_person) for other_person
in dataset if other_person != person ]
# Sort the similar persons so that highest scores person will appear at the first
scores.sort() 64
scores.reverse()
return scores[0:number_of_users]
for person in dataset:
print(" similar users for given user "+person)
print (most_similar_users(person,3))
print("----")
def user_reommendations(person):
# Gets recommendations for a person by using a weighted average of every other
user's rankings
totals = {}
simSums = {}
rankings_list =[]
for other in dataset:
# don't compare me to myself
if other == person:
continue
sim = pearson_correlation(person,other)
# ignore scores of zero or lower
if sim <=0:
continue
for item in dataset[other]:
# only score movies i haven't seen yet
if item not in dataset[person] or dataset[person][item] == 0:
# Similrity * score
totals.setdefault(item,0)
totals[item] += dataset[other][item]* sim
# sum of similarities
simSums.setdefault(item,0)
simSums[item]+= sim
# Create the normalized list
rankings = [(total/simSums[item],item) for item,total in totals.items()]
rankings.sort()
rankings.reverse()
# returns the recommended items
recommendataions_list = [recommend_item for score,recommend_item in rankings]
return recommendataions_list
print("************Recommendation for all users************")
for person in dataset:
print (" Recommendations for user "+person)
print (user_reommendations(person))

65
OUTPUT:

Conclusion: We have studied how recommendation engine works and implemented


collaborative filtering using python program.

66
Experiment No. : 07 (Mini Project)

Project Name: Breast cancer Predication


Introduction:
Breast cancer is the most common cancer in women in Ireland, after skin cancer,
according to the Irish Cancer Society. In fact, 1 in 10 women in Ireland will get breast
cancer at some stage in their lives.The principal objective in this project is simple,
classify a patient in the groups with diagnosis benign or malign according to values of
32 features provided by the dataset created by the University of Wisconsin which has
569 instances (rows — samples).Wrong diagnose, it means to classify some patient in a
group of benign tumours when it is not (has cancer) indicate that this diagnosis is
misclassified, false negative, would have a considerable impact on the patient and
therefore, the method has not trustworthy for prediction. Machine Learning in the
healthcare sector is a useful help, but it has a significant impact on the healthcare
subject understudy for this reason.

DATASET:

67
Output:
 Import Library


 Import Dataset

68
 Data Visualization

69
 Correlation

 Principal component analysis, commonly referred to as PCA, has


become an essential tool for multivariate data analysis and unsupervised
dimension reduction, the goal of which is to find a lower dimensional subspace
that captures most of the variation in the dataset.

70
71
 Split dataset into training data and train data

 K-means Clustering is an unsupervised learning algorithm that tries


to cluster data based on their similarity. Unsupervised learning means that
there is no outcome to be predicted, and the algorithm just tries to find
patterns in the data.

72
 Decision Tree is a graph to represent choices and their results in form of
a tree. The nodes in the graph represent an event or choice and the edges
of the graph represent the decision rules or conditions.

 Random Forest algorithm as an extension of bagged decision trees is a


powerful alternative, produce multiple trees to improve prediction
accuracy and reduce the risk of over-fitting.

73
 KNN The advantage of this method is non-assumption about the
distribution of the variable. This aspect is an important point to compare
with the two previous methods.

74
 Neural Network is a model characterized by an activation function,
which is used by interconnected information processing units to transform
input into output.

 Comparison of Algorithm

75
Conclusion: In this project different techniques are used for the prediction of breast
cancer risk and their performance is compared in order to evaluate the best model.
Experimental results shows that the Random Forest is a better model for the
prediction of breast cancer risks. We have found a model based on Neural Network
with good results over the test set. This model has a sensibility of 0.92 with a F1 score
of 0.92.

76
77

You might also like