0% found this document useful (0 votes)
27 views

Big Data Journal

The document outlines practical exercises for implementing big data applications using various technologies such as MongoDB, Hadoop, Hive, and HBase. It includes detailed installation steps, configuration settings, and commands for creating databases, collections, and performing data manipulation tasks. Each practical exercise aims to familiarize users with the respective technology's setup and basic operations.

Uploaded by

shrutipawaskar4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Big Data Journal

The document outlines practical exercises for implementing big data applications using various technologies such as MongoDB, Hadoop, Hive, and HBase. It includes detailed installation steps, configuration settings, and commands for creating databases, collections, and performing data manipulation tasks. Each practical exercise aims to familiarize users with the respective technology's setup and basic operations.

Uploaded by

shrutipawaskar4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

PRACTICAL NO -1

Aim:
Implement an application that stores big data in Hbase / MongoDB and manipulate it using
R / Python

Requirements
a. PyMongo
b. Mongo Database
Step A: Install Mongo database
Step 1) Go to (https://www.mongodb.com/download-center/community) and Download
MongoDB Community Server. We will install the 64-bit version for Windows.

Step 2) Once download is complete open the msi file. Click Next in the start up screen
Step 3)
1. Accept the End-User License Agreement
2. Click Next

Step 4) Click on the "complete" button to install all of the components. The custom
option can be used to install selective components or if you want to change the location
of the installation.
Step 5)
1. Select “Run service as Network Service user”. make a note of the data directory,
we”ll need this later.
2. Click Next

Step 6) Click on the Install button to start the installation.


Step 7) Installation begins. Click Next once completed.

Step 8) Click on the Finish button to complete the installation.


Test Mongodb
Step 1) Go to " C:\Program Files\MongoDB\Server\4.0\bin" and double click on mongo.exe.
Alternatively, you can also click on the MongoDB desktop icon.

· Create the directory where MongoDB will store its files.


Open command prompt window and apply following commands

C:\users\admin> cd\
C:\>md data\db

Step 2) Execute mongodb


Open another command prompt window.
C:\> cd C:\Program Files\MongoDB\Server\4.0\bin
C:\Program Files\MongoDB\Server\4.0\bin> mongod

Step 3) Connect to MongoDB using the Mongo shell


Let the MongoDB daemon run.
Open another command prompt window and run the following commands:
C:\users\admin> cd C:\Program Files\MongoDB\Server\4.0\bin
C:\Program Files\MongoDB\Server\4.0\bin>mongo

Step 4) Install PyMongo


Open another command prompt window and run the following commands:
Check the python version on your desktop / laptop and copy that path from window explorer

Step 5) Test PyMongo


Run the following command from python command prompt

import pymongo
Now, either create a file in Python IDLE or run all commands one by one in sequence on Python
cell
Program 1: Creating a Database: create_dp.py

Program 2: Creating a Collection: create_collection.py

Program 3: Insert into Collection: insert_into_collection.py

Program 4: Insert Multiple data into Collection: insert_many.py


Step 6) Test in Mongodb to check database and data inserted in collection
a. If you want to check your database list, use the command show dbs in mongo
command prompt

> show dbs

b. If you want to use a database with name mybigdata, then use database
statement would be as follow:

> use mybigdata

c. If you want to check collection in mongodb use the command show collections
> show collections

d. If you want to display the first row from collection: db.collection_name.find()


> db.student.findOne()

e. If you want to display all the data from collection: db.collection_name.find()


> db.student.find()

f. count number of rows in a collection


> db.student.count()
PRACTICAL NO -2
Hadoop Installation

Aim:
Install, configure and run Hadoop and HDFS and explore HDFS on Windows

Code:
Steps to Install Hadoop
1. Install Java JDK 1.8
2. Download Hadoop and extract and place under C drive
3. Set Path in Environment Variables
4. Config files under Hadoop directory
5. Create folder datanode and namenode under data directory
6. Edit HDFS and YARN files
7. Set Java Home environment in Hadoop environment
8. Setup Complete. Test by executing start-all.cmd

1. Install Java
· Create a java folder in C:\
· download and install jdk-8u112-windows-x64 in C:\java and not in c:\program files\java

2. Download Hadoop
· Browse the site https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/
Search for hadoop-2.7.0-src.tar.gz

· Extract it to c:\
· Rename Hadoop 2.7.0 to Hadoop
3. Set the path JAVA_HOME Environment variable

Type env in the search window to get the environment variable set up window.

4. Set the system path as follows:

5. Configurations

Edit file C:/Hadoop/etc/hadoop/core-site.xml


Rename “mapred-site.xml.template” to “mapred-site.xml” and edit this file
C:/Hadoop/etc/hadoop/mapred-site.xml, paste xml code and save this file.

Create folder “data” under “C:\Hadoop”


Create folder “datanode” under “C:\Hadoop\data”
Create folder “namenode” under “C:\Hadoop\data”

Edit file C:\Hadoop\etc\hadoop\hdfs-site.xml,

paste xml code and save this file.


Edit file C:/Hadoop/etc/hadoop/yarn-site.xml,

paste xml code and save this file.

Edit file C:/Hadoop/etc/hadoop/hadoop-env.cmd

Hadoop Configurations

Download bin folder from


https://github.com/s911415/apache-hadoop-3.1.0-winutils
Copy the bin folder to c:\Hadoop. Replace the existing bin folder.
Run winutils file from the bin folder.
If it shows an error related to any dll file missing then search for it on net and copy that file in
c:\windows\system32 folder

Again run the same file and you will find no error.
Then Download and run VC_redist.x64.exe
This is a “redistributable” package of the Visual C runtime code for 64-bit applications, from
Microsoft. It contains certain shared code that every application written with Visual C expects to
have available on the Windows computer it runs on.

Format the NameNode

– Open cmd ‘Run as Administrator’ and type command hdfs namenode –format

Testing
– Open cmd ‘Run as Administrator’ and change directory to C:\Hadoop\sbin
– type start-all.cmd
OR
- type start-dfs.cmd
– type start-yarn.cmd

– You will get 4 more running threads for Datanode, namenode, resource manager and node
manager.

Type JPS command to start-all.cmd command prompt, you will get the following output.
Run http://localhost:9870/ from any browser Or http://localhost:50070/
PRACTICAL NO -3
MapReduce Implementation
Aim:
Implement word count / frequency programs using MapReduce.

Steps:
Step:1
C:\Hadoop\sbin>start-all.cmd
OR
C:\Hadoop\sbin>start-dfs.cmd
C:\Hadoop\sbin>start-yarn.cmd

Step: 2
I) Open a command prompt as administrator and run the following command to create an input
and output folder on the Hadoop file system, to which we will be moving the sample.txt file for our
analysis.
II) C:\Hadoop\bin>cd\
III) C:\>hadoop dfsadmin -safemode leave

DEPRECATED: Use of this script to execute hdfs command is deprecated.


Instead use the hdfs command for it.
Safe mode is OFF

IV) C:\>hadoop fs -mkdir /input_dir


Step: 3
Check it by giving the following URL at browser
http://localhost:9870
OR
http://localhost:50070/

Utilities -> browse the file system


Step: 4
Make a file in c:\input_file.txt and write some content in it, having at least word repeated, as written
below:
Hadoop Window version is easy compared to Ubuntu version

Step: 5
Copy the input text file named input_file.txt in the input directory (input_dir) of HDFS by applying
the following command at c:\>

C:\> hadoop fs -put C:/input_file.txt /input_dir

Step: 6

Verify input_file.txt available in HDFS input directory (input_dir).


C:\>Hadoop fs -ls /input_dir/

Step: 7

Verify content of the copied file


C:\>hadoop dfs -cat /input_dir/input_file.txt
You can see the file content displayed on the CMD.
Run MapReduceClient.jar and also provide input and out directories.

Step: 8

C:\>hadoop jar C:/Hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar


wordcount /input_dir /output_dir

In case, there is some error in executing then copy the file hadoop-mapreduce-examples-3.3.0.jar in
C:\ and run the program with the jar file using existing hadoop-mapreduce-examples-3.3.0.jar file as:

C:\> hadoop jar C:/ hadoop-mapreduce-examples-2.7.0.jar wordcount /input_dir /output_dir

Now, check the output_dir on browser as follows:


Click on output_dir à part-r-00000 à Head the file (first 32 K) and check the file content as the
output.

Alternatively, you may type the following command on CMD window as:
C:\> hadoop dfs -cat /output_dir/*
You can get the following output
It may also give the outcome like this

But check the output dir in the browser


PRACTICAL NO -4
Aim:
Hive installation and commands

Step1

Search for Apache derby and download db-derby-10.14.2.0-bin from


https://db.apache.org/derby/releases/release-10_14_2_0.html

Step2:
Browse the site
https://archive.apache.org/dist/hive/hive-2.1.0/
download and extract hive-2.1.0 in c:\
rename it to hive

Step 3
From c:\derby\lib folder copy all the files and paste them in c:\hive\lib folder
Step 4:
User variable and system variable setup
Step 5:
Copy hive-cite.xml to c:\hive\conf folder.
Take this file from net
https://drive.google.com/file/d/1tsBbHdvM1fFktmn9O0-u0pbG1vWWFoyE/view
step 6:
open the command prompt as administrator.
Give the command from c:\windows\system32>start-all.cmd
Step 7:
C:\windows\system32>StartNetworkServer -h 0.0.0.0

Minimise it
Step 8:
Open the another command prompt as administrator
C:\windows\system32>hive
hive> create database empdb;

hive> use empdb;

hive> create table employee (Id int, Name string , Salary float);
hive> insert into employee values (1,"Anjani", 25000);
hive> select * from employee;
OK
1 Anjani 25000.0
1 row selected (0.201 seconds)
hive> insert into employee values (2,"Gayatri", 25000);

hive> insert into employee values (3,"Joel", 28000);


hive> insert into employee values (4,"Jyoti", 38000);
hive> insert into employee values (5,"Amit", 18000);
hive> select * from employee;
hive> select id, name, salary + 50 from employee;

hive> select * from employee where salary >= 25000;

hive> select id, name, sqrt(salary) from employee;

hive> select Id, upper(Name) from employee;


PRACTICAL NO -5
Aim:
Hbase and Pig installation and commands

Pre-requisites: We are going to make a standalone setup of HBase in our machine which
requires us to
• Install Java JDK 1.8 - We can download and install it from
https://www.oracle.com/java/technologies/downloads/ and set JAVA_HOME in
environment variable.
• Download Hbase-bin - Download Apache Hbase-2.2.5 from
https://archive.apache.org/dist/hbase/2.2.5/
Steps:
• Step 1 – Extract all files from the archive
• Step 2 – Create folders named "hbase" and "zookeeper"
• Step 3 – Deleting line in HBase.cmd
Open C:\hbase-2.2.5\bin\hbase.cmd in any text editor.
Search for line %HEAP_SETTINGS% and remove the
highlighted part.
set java_arguments=%HEAP_SETTINGS%
%HBASE_OPTS% -classpath "%CLASSPATH%" %CLASS%
%hbase-command-arguments%
• Step 4 – Adding lines in hbase-env.cmd
Open C:\hbase-2.2.5\conf\hbase-env.cmd in any text editor.
Add the below lines in the file after the comment session.
set JAVA_HOME=%JAVA_HOME%
set HBASE_CLASSPATH=%HBASE_HOME%\lib\client-facing-thirdparty\*
set HBASE_HEAPSIZE=8000
set HBASE_OPTS="-XX:+UseConcMarkSweepGC" "-Djava.net.preferIPv4Stack=true"
set SERVER_GC_OPTS="-verbose:gc" "-XX:+PrintGCDetails" "-XX:+PrintGCDateStamps" %HBASE_GC_OPTS%
set HBASE_USE_GC_LOGFILE=true

set HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false" "-


Dcom.sun.management.jmxremote.authenticate=false"

set HBASE_MASTER_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10101"


set HBASE_REGIONSERVER_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10102"
set HBASE_THRIFT_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10103"
set HBASE_ZOOKEEPER_OPTS=%HBASE_JMX_BASE% -Dcom.sun.management.jmxremote.port=10104"
set HBASE_REGIONSERVERS=%HBASE_HOME%\conf\regionservers
set HBASE_LOG_DIR=%HBASE_HOME%\logs
set HBASE_IDENT_STRING=%USERNAME%
set HBASE_MANAGES_ZK=true
• Step 5 – Adding lines in hbase-site.xml
Open C:\hbase-2.2.5\conf\hbase-site.xml in any text editor.
Add the below lines inside the <configuration> tag.
<property>
<name>hbase.rootdir</name>
<value>file:///C:/Documents/hbase-2.2.5/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/C:/Documents/hbase-2.2.5/zookeeper</value>
</property>
<property>
<name> hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
• Step 6 – Setting Environment Variables:
Input the following:
• Variable name: HBASE_HOME
• Variable Value: Put the path of the Hbase folder.
Starting the HBASE shell:
• Step 1 – Use start-hbase.cmd:
Go to the bin folder of hbase and start a cmd prompt there
Then type: start-hbase.cmd
• Step 2 – Start hbase shell:

Type: hbase shell

In HBase, interactive shell mode is used to interact with HBase for table operations, table
management, and data modeling.
Hbase Commands
General commands
In Hbase, general commands are categorized into following commands
● Status
● Version
● Table_help ( scan, drop, get, put, disable, etc.)
● Whoami
To get enter into HBase shell command, first of all, we have to execute the code as
mentioned below
hbase Shell
Once we get to enter into HBase shell, we can execute all shell commands mentioned
below. With the help of these commands, we can perform all type of table operations in the
HBase shell mode.
Let us look into all of these commands and their usage one by one with an example.
Status
Syntax: status
This command will give details about the system status like a number of servers present in
the cluster, active server count, and average load value. You can also pass any particular
parameters depending on how detailed status you want to know about the system. The
parameters can be ‘summary’, ‘simple’, or ‘detailed’, the default parameter provided is
“summary”.
Example:
whoami
Syntax:
Syntax: Whoami
This command “whoami” is used to return the current HBase user information from the
HBase cluster.
Example:

Tables Managements commands


These commands will allow programmers to create tables and table schemas with rows and
column families.
The following are Table Management commands
● Create
● List
● Describe
● Disable
● Disable_all
● Enable
● Enable_all
● Drop
● Drop_all
● Show_filters
● Alter
● Alter_status
Let us look into various command usage in HBase with an example.
Create
Syntax: create <tablename>, <columnfamilyname>
Example:
Creating a table named ‘kccs’ with column family ‘students’
The above example explains how to create a table in HBase with the specified name given
according to the dictionary or specifications as per column family. In addition to this we can
also pass some table-scope attributes as well into it.
In order to check whether the table ‘kccs’ is created or not, we have to use
the “list” command as mentioned below.
List
Syntax:list

● “List” command will display all the tables that are present or created in HBase
● The output showing in above screen shot is currently showing the existing tables in
HBase
● We can filter output values from tables by passing optional regular expression
parameters
Describe
Syntax:describe <table name>
Example:

hbase(main):011:0>describe 'kccs'
This command describes the named table.
● It will give more information about column families present in the mentioned table
● In our case, it gives the description about table “kccs”
● It will give information about table name with column families, associated filters,
versions and some more details.
disable
Syntax: disable <tablename>
Example: disable 'kccs'
● This command will start disabling the named table
● If table needs to be deleted or dropped, it has to disabled first
disable_all
Syntax: disable_all<"matching regex">
● This command will disable all the tables matching the given regex.
● The implementation is same as delete command (Except adding regex for matching)
● Once the table gets disable the user can able to delete the table from HBase
● Before delete or dropping table, it should be disabled first
Enable
Syntax: enable <tablename>
Example: enable 'kccs’

● This command will start enabling the named table


● Whichever table is disabled, to retrieve back to its previous state we use this
command
● If a table is disabled in the first instance and not deleted or dropped, and if we want
to re-use the disabled table then we have to enable it by using this command.
show_filters
Syntax: show_filters

This command displays all the filters present in HBase like ColumnPrefix Filter,
TimestampsFilter, PageFilter, FamilyFilter, etc.
drop
Syntax:drop <table name>
Example:

We have to observe below points for drop command


● To delete the table present in HBase, first we have to disable it
● To drop the table present in HBase, first we have to disable it
● So either table to drop or delete first the table should be disable using disable
command
● Before execution of this command, it is necessary that we disable table “kccs”
drop_all
Syntax: drop_all<"regex">
● This command will drop all the tables matching the given regex
● Tables have to disable first before executing this command using disable_all
● Tables with regex matching expressions are going to drop from HBase
alter
Syntax: alter <tablename>, NAME=><column familyname>, VERSIONS=>5
This command alters the column family schema. To understand what exactly it does, we
have explained it here with an example.
Example: Altering table kccs, by adding a new column family named ‘teachers’

Data manipulation commands


These commands will work on the table related to data manipulations such as putting data
into a table, retrieving data from a table and deleting schema, etc.
The commands come under these are
● Count
● Put
● Get
● Delete
● Delete all
● Truncate
● Scan
Let look into these commands usage with an example.
Count
Syntax: count <'tablename'>, CACHE =>1000
● The command will retrieve the count of a number of rows in a table. The value
returned by this one is the number of rows.
● Current count is shown per every 1000 rows by default.
● Count interval may be optionally specified.
● Default cache size is 10 rows.
● Count command will work fast when it is configured with right Cache.
Example:

Put
Syntax: put <'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>
This command is used for following things
● It will put a cell ‘value’ at defined or specified table or row or column.
●It will optionally coordinate time stamp.
Example:

● hbase> put 'kccs',1,'students:name','Dhruv'


Here we are placing values into table “kccs” under row 1, column family ‘students’ and
column ‘name’
● hbase> put 'kccs',1,'students:roll','59'
Here we are placing values into table “kccs” under row 1, column family ‘students’ and
column ‘roll’
● hbase> put 'kccs',1,'teachers:name','Mr. Naveen Pahuja'
Here we are placing values into table “kccs” under row 1, column family ‘teachers’ and
column ‘name’
Get
Syntax: get <'tablename'>, <'rowname'>, {< Additional parameters>}
Here <Additional Parameters> include TIMERANGE, TIMESTAMP, VERSIONS and
FILTERS.
By using this command, you will get a row or cell contents present in the table. In addition to
that you can also add additional parameters to it like TIMESTAMP,
TIMERANGE,VERSIONS, FILTERS, etc. to get a particular row or cell content.
Example:
Delete
Syntax:delete <'tablename'>,<'row name'>,<'column name'>
● This command will delete cell value at defined table of row or column.
● Delete must and should match the deleted cells coordinates exactly.
● When scanning, delete cell suppresses older versions of values.
Example: hbase> delete 'kccs',1,'students:roll'

● The above execution will delete column ‘roll’ in the column family ‘students’ from the
table ‘kccs’
deleteall
Syntax: deleteall <'tablename'>, <'rowname'>
● This Command will delete all cells in a given row.
● We can define optionally column names and time stamp to the syntax.
● Optionally we can mention column names in that.
Truncate
Syntax: truncate <tablename>
After truncate of an hbase table, the schema will present but not the records. This command
performs 3 functions; those are listed below
● Disables table if it already presents
● Drops table if it already presents
● Recreates the mentioned table
Example:

Scan
Syntax: scan <'tablename'>, {Optional parameters}
This command scans entire table and displays the table contents.
● We can pass several optional specifications to this scan command to get more
information about the tables present in the system.
● Scanner specifications may include one or more of the following attributes.
● These are TIMERANGE, FILTER, TIMESTAMP, LIMIT, MAXLENGTH, COLUMNS,
CACHE, STARTROW and STOPROW.
Example:
Pig installation
Steps:
1. Downlaod pig from official page. Download pig-0.17.0.tar.gz file
https://downloads.apache.org/pig/pig-0.17.0/
2. Unzip file

3. Set environment variables


4. Check pig version

5. If received above error, solve by changing pig.cmd file as below

6. Try pig –version command again.

7. Try running pig in local mode


a. Local mode:

Pig Commands
1. Load: Loads the data from the file system.
Eg: a = load ‘data.txt’ as (lines:int);
2. Dump : Executes the relation. Used to display the output

3. Describe: To describe the relation

4. Pig -x local file.pig : To execute the pig script

5. Cross : To perform the cross product of 2 or more relations. It is an expensive operation. Displays
the combined result of the relations.
6. Store: To store the output relation into a folder. Folder will contain another file where the result
will be stored

7. Filter : Will filter input based on a criteria provided.


8. Foreach : Generates data transformations based on columns of data.

9. Group: Group operator is used to group the data in one or more relations. It groups the tuples
that contain a similar group key. If the group key has more than one field, it treats it as tuple
otherwise it will be the same type as that of the group key. In a result, it provides a relation that
contains one tuple per group.
10. Limit : Limit no of output tuples

11. Order by : The Apache Pig ORDER BY operator sorts a relation based on one or more fields. It
maintains the order of tuples
12. Split : The Apache Pig SPLIT operator breaks the relation into two or more relations according to
the provided expression. Here, a tuple may or may not be assigned to one or more than one
relation.
13. Distinct : Used to remove duplicate tuples in a relation. Pig sorts the data and then eliminated
duplicates

14. Join
The JOIN operator is used to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys
match, the two particular tuples are matched, else the records are dropped. Joins can be of the
following types:
a. Self Join : Used to join a table with itself, as if there are 2 relations, temporarily renaming at
least 1 relation
b. Inner Join : Also referred to as Equijoin. Returns rows if there is a match in both tables.

c. Outer Join : Returns all rows from at least one of the relations. Can be Left outer join, Right
outer join, or full outer join.
1. Left outer join : Returns all rows from the left table, even if there are no matches in the
right relation.
2. Right outer join : Returns all rows from the right table, even if there are no matches in
the left table
3. Full Outer Join : Returns all rows when there is a match in match in one of the relations.
Example: Working with a csv file
1. Creating the csv file containing sales records

2. Load the records and group them based on the city

3. Count the number of sales in each city.


Note: Concat command will concatenate 2 or more fields of the same type
4. Store the sales records obtained earlier into a folder and verify them
Practical 6
Aim: Implement Clustering Algorithm Using Map-Reduce.

B.1. Clustering:

Clustering is an unsupervised learning technique, in short, you are working on data,


without having any information about a target attribute or a dependent variable. The
general idea of clustering is to find some intrinsic structure in the data, often referred
to as groups of similar objects. The algorithm studies the data to identify these
patterns or groups such that each member in a group is closer to another member in
the group (lower intracluster distance) and farther from another member in a
different group (higher inter-cluster distance).

B.2 Input and Output:

Data: The dataset consists of 9K active credit cardholders over 6 months and their
transaction and account attributes. The idea is to develop a customer segmentation
for marketing strategy.

Using PySpark:

➔ Step 1: Installation of Hadoop and PySpark in Colab.

1
➔ Step 2: Installing Map and find out the path where PySpark is installed.

➔ Step 3: Schema information of the dataset.

2
➔ Step 4: All attributes under consideration are numerical or discrete numeric,
hence we need to convert them into features using a Vector Assembler. Since

3
customer id is an identifier that won’t be used for clustering, we first extract
the required columns using .columns, pass it as an input to Vector Assembler,
and then use the transform() to convert the input columns into a single vector
column called a feature.

➔ Step 5: Now that all columns are transformed into a single feature vector we
need to standardize the data to bring them to a comparable scale.

4
➔ Step 6: Now that our data is standardized we can develop the K Means
algorithm.

➔ Step 7: After Visualizing the silhouette score.

5
B.3 Observations and learning:

I observed that K-means clustering is a classical clustering algorithm that uses an


expectation-maximization like technique to partition a number of data points into k
clusters and MapReduce is a style of computing that has been implemented in this
system.

B.4 Conclusion:

We have successfully implemented k-means clustering using Hadoop Map Reduce


along in python.

B.5 Question of Curiosity:


1. Explain clustering strategies?

6
Ans:
Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data points
in the same group than those in other groups. In simple words, the aim is to
segregate groups with similar traits and assign them into clusters.

2. What are the clustering applications?


Ans:
Clustering techniques can be used in various areas or fields of real-life examples such
as data mining, web cluster engines, academics, bioinformatics, image processing &
transformation, and many more and emerged as an effective solution to the above-
mentioned areas.

3. How is clustering different from classification?


Ans:
BASIS FOR CLASSIFICATION CLUSTERING
COMPARISON

Basic This model function This function maps the data into one
classifies the data into of the multiple clusters where the
one of numerous already arrangement of data items relies on
defined definite classes. the similarities between them.

Involved in Supervised learning Unsupervised learning

Training Labelled data is provided. Unlabeled data was provided.


sample

You might also like