0% found this document useful (0 votes)

27 views

Big Data Journal

The document outlines practical exercises for implementing big data applications using various technologies such as MongoDB, Hadoop, Hive, and HBase. It includes detailed installation steps, configuration settings, and commands for creating databases, collections, and performing data manipulation tasks. Each practical exercise aims to familiarize users with the respective technology's setup and basic operations.

Uploaded by

shrutipawaskar4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Big Data Journal

Uploaded by

shrutipawaskar4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

PRACTICAL NO -1

Aim:
Implement an application that stores big data in Hbase / MongoDB and manipulate it using
R / Python

Requirements
a. PyMongo
b. Mongo Database
Step A: Install Mongo database
Step 1) Go to (https://www.mongodb.com/download-center/community) and Download
MongoDB Community Server. We will install the 64-bit version for Windows.

Step 2) Once download is complete open the msi file. Click Next in the start up screen
Step 3)
1. Accept the End-User License Agreement
2. Click Next

Step 4) Click on the "complete" button to install all of the components. The custom
option can be used to install selective components or if you want to change the location
of the installation.
Step 5)
1. Select “Run service as Network Service user”. make a note of the data directory,
we”ll need this later.
2. Click Next

Step 6) Click on the Install button to start the installation.

Step 7) Installation begins. Click Next once completed.

Step 8) Click on the Finish button to complete the installation.

Test Mongodb
Step 1) Go to " C:\Program Files\MongoDB\Server\4.0\bin" and double click on mongo.exe.
Alternatively, you can also click on the MongoDB desktop icon.

· Create the directory where MongoDB will store its files.

Open command prompt window and apply following commands

C:\users\admin> cd\
C:\>md data\db

Step 2) Execute mongodb

Open another command prompt window.
C:\> cd C:\Program Files\MongoDB\Server\4.0\bin
C:\Program Files\MongoDB\Server\4.0\bin> mongod

Step 3) Connect to MongoDB using the Mongo shell

Let the MongoDB daemon run.
Open another command prompt window and run the following commands:
C:\users\admin> cd C:\Program Files\MongoDB\Server\4.0\bin
C:\Program Files\MongoDB\Server\4.0\bin>mongo

Step 4) Install PyMongo

Open another command prompt window and run the following commands:
Check the python version on your desktop / laptop and copy that path from window explorer

Step 5) Test PyMongo

Run the following command from python command prompt

import pymongo
Now, either create a file in Python IDLE or run all commands one by one in sequence on Python
cell
Program 1: Creating a Database: create_dp.py

Program 2: Creating a Collection: create_collection.py

Program 3: Insert into Collection: insert_into_collection.py

Program 4: Insert Multiple data into Collection: insert_many.py

Step 6) Test in Mongodb to check database and data inserted in collection
a. If you want to check your database list, use the command show dbs in mongo
command prompt

> show dbs

b. If you want to use a database with name mybigdata, then use database
statement would be as follow:

> use mybigdata

c. If you want to check collection in mongodb use the command show collections
> show collections

d. If you want to display the first row from collection: db.collection_name.find()

> db.student.findOne()

e. If you want to display all the data from collection: db.collection_name.find()

> db.student.find()

f. count number of rows in a collection

> db.student.count()
PRACTICAL NO -2
Hadoop Installation

Aim:
Install, configure and run Hadoop and HDFS and explore HDFS on Windows

Code:
Steps to Install Hadoop
1. Install Java JDK 1.8
2. Download Hadoop and extract and place under C drive
3. Set Path in Environment Variables
4. Config files under Hadoop directory
5. Create folder datanode and namenode under data directory
6. Edit HDFS and YARN files
7. Set Java Home environment in Hadoop environment
8. Setup Complete. Test by executing start-all.cmd

1. Install Java
· Create a java folder in C:\
· download and install jdk-8u112-windows-x64 in C:\java and not in c:\program files\java

2. Download Hadoop
· Browse the site https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/
Search for hadoop-2.7.0-src.tar.gz

· Extract it to c:\
· Rename Hadoop 2.7.0 to Hadoop
3. Set the path JAVA_HOME Environment variable

Type env in the search window to get the environment variable set up window.

4. Set the system path as follows:

5. Configurations

Edit file C:/Hadoop/etc/hadoop/core-site.xml

Rename “mapred-site.xml.template” to “mapred-site.xml” and edit this file
C:/Hadoop/etc/hadoop/mapred-site.xml, paste xml code and save this file.

Create folder “data” under “C:\Hadoop”

Create folder “datanode” under “C:\Hadoop\data”
Create folder “namenode” under “C:\Hadoop\data”

Edit file C:\Hadoop\etc\hadoop\hdfs-site.xml,

paste xml code and save this file.

Edit file C:/Hadoop/etc/hadoop/yarn-site.xml,

paste xml code and save this file.

Edit file C:/Hadoop/etc/hadoop/hadoop-env.cmd

Hadoop Configurations

Download bin folder from

https://github.com/s911415/apache-hadoop-3.1.0-winutils
Copy the bin folder to c:\Hadoop. Replace the existing bin folder.
Run winutils file from the bin folder.
If it shows an error related to any dll file missing then search for it on net and copy that file in
c:\windows\system32 folder

Again run the same file and you will find no error.
Then Download and run VC_redist.x64.exe
This is a “redistributable” package of the Visual C runtime code for 64-bit applications, from
Microsoft. It contains certain shared code that every application written with Visual C expects to
have available on the Windows computer it runs on.

Format the NameNode

– Open cmd ‘Run as Administrator’ and type command hdfs namenode –format

Testing
– Open cmd ‘Run as Administrator’ and change directory to C:\Hadoop\sbin
– type start-all.cmd
OR
- type start-dfs.cmd
– type start-yarn.cmd

– You will get 4 more running threads for Datanode, namenode, resource manager and node
manager.

Type JPS command to start-all.cmd command prompt, you will get the following output.
Run http://localhost:9870/ from any browser Or http://localhost:50070/
PRACTICAL NO -3
MapReduce Implementation
Aim:
Implement word count / frequency programs using MapReduce.

Steps:
Step:1
C:\Hadoop\sbin>start-all.cmd
OR
C:\Hadoop\sbin>start-dfs.cmd
C:\Hadoop\sbin>start-yarn.cmd

Step: 2
I) Open a command prompt as administrator and run the following command to create an input
and output folder on the Hadoop file system, to which we will be moving the sample.txt file for our
analysis.
II) C:\Hadoop\bin>cd\
III) C:\>hadoop dfsadmin -safemode leave

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.
Safe mode is OFF

IV) C:\>hadoop fs -mkdir /input_dir

Step: 3
Check it by giving the following URL at browser
http://localhost:9870
OR
http://localhost:50070/

Utilities -> browse the file system

Step: 4
Make a file in c:\input_file.txt and write some content in it, having at least word repeated, as written
below:
Hadoop Window version is easy compared to Ubuntu version

Step: 5
Copy the input text file named input_file.txt in the input directory (input_dir) of HDFS by applying
the following command at c:\>

C:\> hadoop fs -put C:/input_file.txt /input_dir

Step: 6

Verify input_file.txt available in HDFS input directory (input_dir).

C:\>Hadoop fs -ls /input_dir/

Step: 7

Verify content of the copied file

C:\>hadoop dfs -cat /input_dir/input_file.txt
You can see the file content displayed on the CMD.
Run MapReduceClient.jar and also provide input and out directories.

Step: 8

C:\>hadoop jar C:/Hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar

wordcount /input_dir /output_dir

In case, there is some error in executing then copy the file hadoop-mapreduce-examples-3.3.0.jar in
C:\ and run the program with the jar file using existing hadoop-mapreduce-examples-3.3.0.jar file as:

C:\> hadoop jar C:/ hadoop-mapreduce-examples-2.7.0.jar wordcount /input_dir /output_dir

Now, check the output_dir on browser as follows:

Click on output_dir à part-r-00000 à Head the file (first 32 K) and check the file content as the
output.

Alternatively, you may type the following command on CMD window as:
C:\> hadoop dfs -cat /output_dir/*
You can get the following output
It may also give the outcome like this

But check the output dir in the browser

PRACTICAL NO -4
Aim:
Hive installation and commands

Step1

Search for Apache derby and download db-derby-10.14.2.0-bin from

https://db.apache.org/derby/releases/release-10_14_2_0.html

Step2:
Browse the site
https://archive.apache.org/dist/hive/hive-2.1.0/
download and extract hive-2.1.0 in c:\
rename it to hive

Step 3
From c:\derby\lib folder copy all the files and paste them in c:\hive\lib folder
Step 4:
User variable and system variable setup
Step 5:
Copy hive-cite.xml to c:\hive\conf folder.
Take this file from net
https://drive.google.com/file/d/1tsBbHdvM1fFktmn9O0-u0pbG1vWWFoyE/view
step 6:
open the command prompt as administrator.
Give the command from c:\windows\system32>start-all.cmd
Step 7:
C:\windows\system32>StartNetworkServer -h 0.0.0.0

Minimise it
Step 8:
Open the another command prompt as administrator
C:\windows\system32>hive
hive> create database empdb;

hive> use empdb;

hive> create table employee (Id int, Name string , Salary float);
hive> insert into employee values (1,"Anjani", 25000);
hive> select * from employee;
OK
1 Anjani 25000.0
1 row selected (0.201 seconds)
hive> insert into employee values (2,"Gayatri", 25000);

hive> insert into employee values (3,"Joel", 28000);

hive> insert into employee values (4,"Jyoti", 38000);
hive> insert into employee values (5,"Amit", 18000);
hive> select * from employee;
hive> select id, name, salary + 50 from employee;

hive> select * from employee where salary >= 25000;

hive> select id, name, sqrt(salary) from employee;

hive> select Id, upper(Name) from employee;

PRACTICAL NO -5
Aim:
Hbase and Pig installation and commands

Pre-requisites: We are going to make a standalone setup of HBase in our machine which
requires us to
• Install Java JDK 1.8 - We can download and install it from
https://www.oracle.com/java/technologies/downloads/ and set JAVA_HOME in
environment variable.
• Download Hbase-bin - Download Apache Hbase-2.2.5 from
https://archive.apache.org/dist/hbase/2.2.5/
Steps:
• Step 1 – Extract all files from the archive
• Step 2 – Create folders named "hbase" and "zookeeper"
• Step 3 – Deleting line in HBase.cmd
Open C:\hbase-2.2.5\bin\hbase.cmd in any text editor.
Search for line %HEAP_SETTINGS% and remove the
highlighted part.
set java_arguments=%HEAP_SETTINGS%
%HBASE_OPTS% -classpath "%CLASSPATH%" %CLASS%
%hbase-command-arguments%
• Step 4 – Adding lines in hbase-env.cmd
Open C:\hbase-2.2.5\conf\hbase-env.cmd in any text editor.
Add the below lines in the file after the comment session.
set JAVA_HOME=%JAVA_HOME%
set HBASE_CLASSPATH=%HBASE_HOME%\lib\client-facing-thirdparty\*
set HBASE_HEAPSIZE=8000
set HBASE_OPTS="-XX:+UseConcMarkSweepGC" "-Djava.net.preferIPv4Stack=true"
set SERVER_GC_OPTS="-verbose:gc" "-XX:+PrintGCDetails" "-XX:+PrintGCDateStamps" %HBASE_GC_OPTS%
set HBASE_USE_GC_LOGFILE=true

set HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false" "-

Dcom.sun.management.jmxremote.authenticate=false"

set HBASE_MASTER_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10101"

set HBASE_REGIONSERVER_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10102"
set HBASE_THRIFT_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10103"
set HBASE_ZOOKEEPER_OPTS=%HBASE_JMX_BASE% -Dcom.sun.management.jmxremote.port=10104"
set HBASE_REGIONSERVERS=%HBASE_HOME%\conf\regionservers
set HBASE_LOG_DIR=%HBASE_HOME%\logs
set HBASE_IDENT_STRING=%USERNAME%
set HBASE_MANAGES_ZK=true
• Step 5 – Adding lines in hbase-site.xml
Open C:\hbase-2.2.5\conf\hbase-site.xml in any text editor.
Add the below lines inside the <configuration> tag.
<property>
<name>hbase.rootdir</name>
<value>file:///C:/Documents/hbase-2.2.5/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/C:/Documents/hbase-2.2.5/zookeeper</value>
</property>
<property>
<name> hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
• Step 6 – Setting Environment Variables:
Input the following:
• Variable name: HBASE_HOME
• Variable Value: Put the path of the Hbase folder.
Starting the HBASE shell:
• Step 1 – Use start-hbase.cmd:
Go to the bin folder of hbase and start a cmd prompt there
Then type: start-hbase.cmd
• Step 2 – Start hbase shell:

Type: hbase shell

In HBase, interactive shell mode is used to interact with HBase for table operations, table
management, and data modeling.
Hbase Commands
General commands
In Hbase, general commands are categorized into following commands
● Status
● Version
● Table_help ( scan, drop, get, put, disable, etc.)
● Whoami
To get enter into HBase shell command, first of all, we have to execute the code as
mentioned below
hbase Shell
Once we get to enter into HBase shell, we can execute all shell commands mentioned
below. With the help of these commands, we can perform all type of table operations in the
HBase shell mode.
Let us look into all of these commands and their usage one by one with an example.
Status
Syntax: status
This command will give details about the system status like a number of servers present in
the cluster, active server count, and average load value. You can also pass any particular
parameters depending on how detailed status you want to know about the system. The
parameters can be ‘summary’, ‘simple’, or ‘detailed’, the default parameter provided is
“summary”.
Example:
whoami
Syntax:
Syntax: Whoami
This command “whoami” is used to return the current HBase user information from the
HBase cluster.
Example:

Tables Managements commands

These commands will allow programmers to create tables and table schemas with rows and
column families.
The following are Table Management commands
● Create
● List
● Describe
● Disable
● Disable_all
● Enable
● Enable_all
● Drop
● Drop_all
● Show_filters
● Alter
● Alter_status
Let us look into various command usage in HBase with an example.
Create
Syntax: create <tablename>, <columnfamilyname>
Example:
Creating a table named ‘kccs’ with column family ‘students’
The above example explains how to create a table in HBase with the specified name given
according to the dictionary or specifications as per column family. In addition to this we can
also pass some table-scope attributes as well into it.
In order to check whether the table ‘kccs’ is created or not, we have to use
the “list” command as mentioned below.
List
Syntax:list

● “List” command will display all the tables that are present or created in HBase
● The output showing in above screen shot is currently showing the existing tables in
HBase
● We can filter output values from tables by passing optional regular expression
parameters
Describe
Syntax:describe <table name>
Example:

hbase(main):011:0>describe 'kccs'
This command describes the named table.
● It will give more information about column families present in the mentioned table
● In our case, it gives the description about table “kccs”
● It will give information about table name with column families, associated filters,
versions and some more details.
disable
Syntax: disable <tablename>
Example: disable 'kccs'
● This command will start disabling the named table
● If table needs to be deleted or dropped, it has to disabled first
disable_all
Syntax: disable_all<"matching regex">
● This command will disable all the tables matching the given regex.
● The implementation is same as delete command (Except adding regex for matching)
● Once the table gets disable the user can able to delete the table from HBase
● Before delete or dropping table, it should be disabled first
Enable
Syntax: enable <tablename>
Example: enable 'kccs’

● This command will start enabling the named table

● Whichever table is disabled, to retrieve back to its previous state we use this
command
● If a table is disabled in the first instance and not deleted or dropped, and if we want
to re-use the disabled table then we have to enable it by using this command.
show_filters
Syntax: show_filters

This command displays all the filters present in HBase like ColumnPrefix Filter,
TimestampsFilter, PageFilter, FamilyFilter, etc.
drop
Syntax:drop <table name>
Example:

We have to observe below points for drop command

● To delete the table present in HBase, first we have to disable it
● To drop the table present in HBase, first we have to disable it
● So either table to drop or delete first the table should be disable using disable
command
● Before execution of this command, it is necessary that we disable table “kccs”
drop_all
Syntax: drop_all<"regex">
● This command will drop all the tables matching the given regex
● Tables have to disable first before executing this command using disable_all
● Tables with regex matching expressions are going to drop from HBase
alter
Syntax: alter <tablename>, NAME=><column familyname>, VERSIONS=>5
This command alters the column family schema. To understand what exactly it does, we
have explained it here with an example.
Example: Altering table kccs, by adding a new column family named ‘teachers’

Data manipulation commands

These commands will work on the table related to data manipulations such as putting data
into a table, retrieving data from a table and deleting schema, etc.
The commands come under these are
● Count
● Put
● Get
● Delete
● Delete all
● Truncate
● Scan
Let look into these commands usage with an example.
Count
Syntax: count <'tablename'>, CACHE =>1000
● The command will retrieve the count of a number of rows in a table. The value
returned by this one is the number of rows.
● Current count is shown per every 1000 rows by default.
● Count interval may be optionally specified.
● Default cache size is 10 rows.
● Count command will work fast when it is configured with right Cache.
Example:

Put
Syntax: put <'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>
This command is used for following things
● It will put a cell ‘value’ at defined or specified table or row or column.
●It will optionally coordinate time stamp.
Example:

● hbase> put 'kccs',1,'students:name','Dhruv'

Here we are placing values into table “kccs” under row 1, column family ‘students’ and
column ‘name’
● hbase> put 'kccs',1,'students:roll','59'
Here we are placing values into table “kccs” under row 1, column family ‘students’ and
column ‘roll’
● hbase> put 'kccs',1,'teachers:name','Mr. Naveen Pahuja'
Here we are placing values into table “kccs” under row 1, column family ‘teachers’ and
column ‘name’
Get
Syntax: get <'tablename'>, <'rowname'>, {< Additional parameters>}
Here <Additional Parameters> include TIMERANGE, TIMESTAMP, VERSIONS and
FILTERS.
By using this command, you will get a row or cell contents present in the table. In addition to
that you can also add additional parameters to it like TIMESTAMP,
TIMERANGE,VERSIONS, FILTERS, etc. to get a particular row or cell content.
Example:
Delete
Syntax:delete <'tablename'>,<'row name'>,<'column name'>
● This command will delete cell value at defined table of row or column.
● Delete must and should match the deleted cells coordinates exactly.
● When scanning, delete cell suppresses older versions of values.
Example: hbase> delete 'kccs',1,'students:roll'

● The above execution will delete column ‘roll’ in the column family ‘students’ from the
table ‘kccs’
deleteall
Syntax: deleteall <'tablename'>, <'rowname'>
● This Command will delete all cells in a given row.
● We can define optionally column names and time stamp to the syntax.
● Optionally we can mention column names in that.
Truncate
Syntax: truncate <tablename>
After truncate of an hbase table, the schema will present but not the records. This command
performs 3 functions; those are listed below
● Disables table if it already presents
● Drops table if it already presents
● Recreates the mentioned table
Example:

Scan
Syntax: scan <'tablename'>, {Optional parameters}
This command scans entire table and displays the table contents.
● We can pass several optional specifications to this scan command to get more
information about the tables present in the system.
● Scanner specifications may include one or more of the following attributes.
● These are TIMERANGE, FILTER, TIMESTAMP, LIMIT, MAXLENGTH, COLUMNS,
CACHE, STARTROW and STOPROW.
Example:
Pig installation
Steps:
1. Downlaod pig from official page. Download pig-0.17.0.tar.gz file
https://downloads.apache.org/pig/pig-0.17.0/
2. Unzip file

3. Set environment variables

4. Check pig version

5. If received above error, solve by changing pig.cmd file as below

6. Try pig –version command again.

7. Try running pig in local mode

a. Local mode:

Pig Commands
1. Load: Loads the data from the file system.
Eg: a = load ‘data.txt’ as (lines:int);
2. Dump : Executes the relation. Used to display the output

3. Describe: To describe the relation

4. Pig -x local file.pig : To execute the pig script

5. Cross : To perform the cross product of 2 or more relations. It is an expensive operation. Displays
the combined result of the relations.
6. Store: To store the output relation into a folder. Folder will contain another file where the result
will be stored

7. Filter : Will filter input based on a criteria provided.

8. Foreach : Generates data transformations based on columns of data.

9. Group: Group operator is used to group the data in one or more relations. It groups the tuples
that contain a similar group key. If the group key has more than one field, it treats it as tuple
otherwise it will be the same type as that of the group key. In a result, it provides a relation that
contains one tuple per group.
10. Limit : Limit no of output tuples

11. Order by : The Apache Pig ORDER BY operator sorts a relation based on one or more fields. It
maintains the order of tuples
12. Split : The Apache Pig SPLIT operator breaks the relation into two or more relations according to
the provided expression. Here, a tuple may or may not be assigned to one or more than one
relation.
13. Distinct : Used to remove duplicate tuples in a relation. Pig sorts the data and then eliminated
duplicates

14. Join
The JOIN operator is used to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys
match, the two particular tuples are matched, else the records are dropped. Joins can be of the
following types:
a. Self Join : Used to join a table with itself, as if there are 2 relations, temporarily renaming at
least 1 relation
b. Inner Join : Also referred to as Equijoin. Returns rows if there is a match in both tables.

c. Outer Join : Returns all rows from at least one of the relations. Can be Left outer join, Right
outer join, or full outer join.
1. Left outer join : Returns all rows from the left table, even if there are no matches in the
right relation.
2. Right outer join : Returns all rows from the right table, even if there are no matches in
the left table
3. Full Outer Join : Returns all rows when there is a match in match in one of the relations.
Example: Working with a csv file
1. Creating the csv file containing sales records

2. Load the records and group them based on the city

3. Count the number of sales in each city.

Note: Concat command will concatenate 2 or more fields of the same type
4. Store the sales records obtained earlier into a folder and verify them
Practical 6
Aim: Implement Clustering Algorithm Using Map-Reduce.

B.1. Clustering:

Clustering is an unsupervised learning technique, in short, you are working on data,

without having any information about a target attribute or a dependent variable. The
general idea of clustering is to find some intrinsic structure in the data, often referred
to as groups of similar objects. The algorithm studies the data to identify these
patterns or groups such that each member in a group is closer to another member in
the group (lower intracluster distance) and farther from another member in a
different group (higher inter-cluster distance).

B.2 Input and Output:

Data: The dataset consists of 9K active credit cardholders over 6 months and their
transaction and account attributes. The idea is to develop a customer segmentation
for marketing strategy.

Using PySpark:

➔ Step 1: Installation of Hadoop and PySpark in Colab.

1
➔ Step 2: Installing Map and find out the path where PySpark is installed.

➔ Step 3: Schema information of the dataset.

2
➔ Step 4: All attributes under consideration are numerical or discrete numeric,
hence we need to convert them into features using a Vector Assembler. Since

3
customer id is an identifier that won’t be used for clustering, we first extract
the required columns using .columns, pass it as an input to Vector Assembler,
and then use the transform() to convert the input columns into a single vector
column called a feature.

➔ Step 5: Now that all columns are transformed into a single feature vector we
need to standardize the data to bring them to a comparable scale.

4
➔ Step 6: Now that our data is standardized we can develop the K Means
algorithm.

➔ Step 7: After Visualizing the silhouette score.

5
B.3 Observations and learning:

I observed that K-means clustering is a classical clustering algorithm that uses an

expectation-maximization like technique to partition a number of data points into k
clusters and MapReduce is a style of computing that has been implemented in this
system.

B.4 Conclusion:

We have successfully implemented k-means clustering using Hadoop Map Reduce

along in python.

B.5 Question of Curiosity:

1. Explain clustering strategies?

6
Ans:
Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data points
in the same group than those in other groups. In simple words, the aim is to
segregate groups with similar traits and assign them into clusters.

2. What are the clustering applications?

Ans:
Clustering techniques can be used in various areas or fields of real-life examples such
as data mining, web cluster engines, academics, bioinformatics, image processing &
transformation, and many more and emerged as an effective solution to the above-
mentioned areas.

3. How is clustering different from classification?

Ans:
BASIS FOR CLASSIFICATION CLUSTERING
COMPARISON

Basic This model function This function maps the data into one
classifies the data into of the multiple clusters where the
one of numerous already arrangement of data items relies on
defined definite classes. the similarities between them.

Involved in Supervised learning Unsupervised learning

Training Labelled data is provided. Unlabeled data was provided.

sample

Bda Manual
No ratings yet
Bda Manual
80 pages
YMCA Kampala Courses and Fees Structure
85% (33)
YMCA Kampala Courses and Fees Structure
1 page
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
CAD - Exp - 10 - OEL - Fall - 2022-23
No ratings yet
CAD - Exp - 10 - OEL - Fall - 2022-23
4 pages
Teaching English Through English Movie - Advantages and Disadvantages
No ratings yet
Teaching English Through English Movie - Advantages and Disadvantages
6 pages
French Physical Descriptions Lesson Plan
No ratings yet
French Physical Descriptions Lesson Plan
5 pages
Bda Record
No ratings yet
Bda Record
83 pages
Anushka Shetty 35
No ratings yet
Anushka Shetty 35
34 pages
Data Science
No ratings yet
Data Science
82 pages
BDA Lab ManuaL[1]
No ratings yet
BDA Lab ManuaL[1]
83 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Hadoop installation process
No ratings yet
Hadoop installation process
16 pages
Bigdata Lab
No ratings yet
Bigdata Lab
55 pages
BIG data file
No ratings yet
BIG data file
28 pages
Install and Run Hadoop On Windows
No ratings yet
Install and Run Hadoop On Windows
29 pages
bda2
No ratings yet
bda2
25 pages
MapReduce Merged
No ratings yet
MapReduce Merged
18 pages
How to Install Hadoop in Windows 10 & 11 _ Hadoop Installation
No ratings yet
How to Install Hadoop in Windows 10 & 11 _ Hadoop Installation
9 pages
NEW BDA MANUAL
No ratings yet
NEW BDA MANUAL
80 pages
HADOOP RECORD 2024-FINAL
No ratings yet
HADOOP RECORD 2024-FINAL
59 pages
BDA Lab Manual by T.Naga Praveena
No ratings yet
BDA Lab Manual by T.Naga Praveena
40 pages
Hadoop on Windows
No ratings yet
Hadoop on Windows
13 pages
Big Data Security 20100BTCSDSI07268
No ratings yet
Big Data Security 20100BTCSDSI07268
76 pages
Final Bda 1-8 Lab Aayush
No ratings yet
Final Bda 1-8 Lab Aayush
17 pages
Big Data Analytics Lab
No ratings yet
Big Data Analytics Lab
18 pages
Big Data Analytics IT
No ratings yet
Big Data Analytics IT
55 pages
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
No ratings yet
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
51 pages
Big Data Analysis 3170722 Lab Manual
No ratings yet
Big Data Analysis 3170722 Lab Manual
68 pages
BDA RECORD (24-25)
No ratings yet
BDA RECORD (24-25)
50 pages
bdh lab manual FINAL(hadoop)
No ratings yet
bdh lab manual FINAL(hadoop)
29 pages
BIGDATA LAB MANUAL
No ratings yet
BIGDATA LAB MANUAL
27 pages
Big Data Manual
No ratings yet
Big Data Manual
82 pages
CCS334 BDA Lab Manual
No ratings yet
CCS334 BDA Lab Manual
35 pages
Ba Lab Record-It b2022-26
No ratings yet
Ba Lab Record-It b2022-26
43 pages
Final Copy - BDA LAB Record
No ratings yet
Final Copy - BDA LAB Record
44 pages
big data
No ratings yet
big data
28 pages
Big data analytics lab-JD
No ratings yet
Big data analytics lab-JD
49 pages
Cp5261 Da Lab Me-Cse 2021 - Edit
No ratings yet
Cp5261 Da Lab Me-Cse 2021 - Edit
88 pages
Step 1: Download Binary Package
No ratings yet
Step 1: Download Binary Package
50 pages
BDA Record (1)
No ratings yet
BDA Record (1)
34 pages
BDA lab Manual
No ratings yet
BDA lab Manual
62 pages
kh5(bda)_merged
No ratings yet
kh5(bda)_merged
21 pages
bigdatamanual(2)
No ratings yet
bigdatamanual(2)
45 pages
BIGDATALABCURRENT
No ratings yet
BIGDATALABCURRENT
54 pages
bda lab s
No ratings yet
bda lab s
92 pages
Amc Engineering College: Dept. of Computer Science and Engineering
No ratings yet
Amc Engineering College: Dept. of Computer Science and Engineering
6 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
3 pages
Practical N0.2 AIM: Install Hadoop Hadoop Installation On Windows 10
No ratings yet
Practical N0.2 AIM: Install Hadoop Hadoop Installation On Windows 10
12 pages
Hadoop
No ratings yet
Hadoop
71 pages
BDA Lab Manual_organized (2) (1) - Copy
No ratings yet
BDA Lab Manual_organized (2) (1) - Copy
69 pages
bda megh
No ratings yet
bda megh
50 pages
Prachi 20CS111 BDALab File
No ratings yet
Prachi 20CS111 BDALab File
20 pages
Big Data Analysis - Lab Manual - Bharathidasan University - B.Sc Data Science, Second Year, 4th Semester
No ratings yet
Big Data Analysis - Lab Manual - Bharathidasan University - B.Sc Data Science, Second Year, 4th Semester
41 pages
HadoopfilePP
No ratings yet
HadoopfilePP
83 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
CCS334-BDA LAB MANUAL final (1)
No ratings yet
CCS334-BDA LAB MANUAL final (1)
46 pages
CC EXP 8 VBHV
No ratings yet
CC EXP 8 VBHV
8 pages
Notes
No ratings yet
Notes
53 pages
Evaluation of Some Android Emulators and Installation of Android OS on Virtualbox and VMware
From Everand
Evaluation of Some Android Emulators and Installation of Android OS on Virtualbox and VMware
Dr. Hidaia Mahmood Alassouli
No ratings yet
Big Data Manual - Fall 2023
No ratings yet
Big Data Manual - Fall 2023
76 pages
reStructuredText for Sphinx
From Everand
reStructuredText for Sphinx
Vimalkumar Velayudhan
No ratings yet
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mamood Alassouli
No ratings yet
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
From Everand
Evaluation of Some Cloud Based Virtual Private Server (VPS) Providers
Dr. Hidaia Mahmood Alassouli
No ratings yet
Essay Philosophy by Piquero, Johann Reginald
No ratings yet
Essay Philosophy by Piquero, Johann Reginald
3 pages
Paper 7 Model Question Paper-1
No ratings yet
Paper 7 Model Question Paper-1
9 pages
Disordered Personalities at Work
No ratings yet
Disordered Personalities at Work
16 pages
Introduction To The Philosophy of The Human Person: Activity Sheets
No ratings yet
Introduction To The Philosophy of The Human Person: Activity Sheets
22 pages
Cumilla Board: Colleges With Vacancies: 8-Aug-2020 Brahmanbaria Thana: Akhaura
No ratings yet
Cumilla Board: Colleges With Vacancies: 8-Aug-2020 Brahmanbaria Thana: Akhaura
61 pages
Pma156 long course complete details all concept ok
No ratings yet
Pma156 long course complete details all concept ok
3 pages
Unit II: Chapter 1 - Corporate Culture and Communication
100% (2)
Unit II: Chapter 1 - Corporate Culture and Communication
21 pages
IML's of 9th Pass 10th Class, (BEEF Policy 2024-25)
100% (1)
IML's of 9th Pass 10th Class, (BEEF Policy 2024-25)
47 pages
Pink Simple Profile Resume
No ratings yet
Pink Simple Profile Resume
1 page
Introduction To The Sensory Toolkit
No ratings yet
Introduction To The Sensory Toolkit
7 pages
CAAT
No ratings yet
CAAT
13 pages
Case Report Roll no 3
No ratings yet
Case Report Roll no 3
14 pages
Stat 3504a W12
No ratings yet
Stat 3504a W12
2 pages
Report On Indian Martial Arts and Self Defence 2
No ratings yet
Report On Indian Martial Arts and Self Defence 2
26 pages
The Catholic di-WPS Office
No ratings yet
The Catholic di-WPS Office
2 pages
DLL - Quarter 1 - W 1 MAPEH 8
No ratings yet
DLL - Quarter 1 - W 1 MAPEH 8
15 pages
Calculus 2024
No ratings yet
Calculus 2024
66 pages
Brief Faculty Perceptions of Entrustable Professional Activities To Determine Pharmacy Student Readiness For Advanced Practice Experiences
No ratings yet
Brief Faculty Perceptions of Entrustable Professional Activities To Determine Pharmacy Student Readiness For Advanced Practice Experiences
15 pages
Tukamushaba Dickson CV
No ratings yet
Tukamushaba Dickson CV
3 pages
Around Carry Because Light Before Show Bring Think: Vocabulary Word Cards Grade 1, Unit 4
100% (1)
Around Carry Because Light Before Show Bring Think: Vocabulary Word Cards Grade 1, Unit 4
5 pages
file 1
No ratings yet
file 1
7 pages
Essay About Leadership
No ratings yet
Essay About Leadership
2 pages
Learning3 6pp
No ratings yet
Learning3 6pp
15 pages
PACIFIC 2017 - FINAL Participating Exhibitor List
No ratings yet
PACIFIC 2017 - FINAL Participating Exhibitor List
12 pages
Key To Susan Cain Video Comprehension
No ratings yet
Key To Susan Cain Video Comprehension
2 pages
397df1b8db506fd024f28eb8cbcdc0c8
No ratings yet
397df1b8db506fd024f28eb8cbcdc0c8
1 page

Big Data Journal

Uploaded by

Big Data Journal

Uploaded by

PRACTICAL NO -1

Step 6) Click on the Install button to start the installation.

Step 8) Click on the Finish button to complete the installation.

· Create the directory where MongoDB will store its files.

Step 2) Execute mongodb

Step 3) Connect to MongoDB using the Mongo shell

Step 4) Install PyMongo

Step 5) Test PyMongo

Program 2: Creating a Collection: create_collection.py

Program 3: Insert into Collection: insert_into_collection.py

Program 4: Insert Multiple data into Collection: insert_many.py

> show dbs

> use mybigdata

d. If you want to display the first row from collection: db.collection_name.find()

e. If you want to display all the data from collection: db.collection_name.find()

f. count number of rows in a collection

4. Set the system path as follows:

Edit file C:/Hadoop/etc/hadoop/core-site.xml

Create folder “data” under “C:\Hadoop”

Edit file C:\Hadoop\etc\hadoop\hdfs-site.xml,

paste xml code and save this file.

paste xml code and save this file.

Edit file C:/Hadoop/etc/hadoop/hadoop-env.cmd

Download bin folder from

Format the NameNode

DEPRECATED: Use of this script to execute hdfs command is deprecated.

IV) C:\>hadoop fs -mkdir /input_dir

Utilities -> browse the file system

C:\> hadoop fs -put C:/input_file.txt /input_dir

Verify input_file.txt available in HDFS input directory (input_dir).

Verify content of the copied file

C:\>hadoop jar C:/Hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar

C:\> hadoop jar C:/ hadoop-mapreduce-examples-2.7.0.jar wordcount /input_dir /output_dir

Now, check the output_dir on browser as follows:

But check the output dir in the browser

Search for Apache derby and download db-derby-10.14.2.0-bin from

hive> use empdb;

hive> insert into employee values (3,"Joel", 28000);

hive> select * from employee where salary >= 25000;

hive> select id, name, sqrt(salary) from employee;

hive> select Id, upper(Name) from employee;

set HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false" "-

set HBASE_MASTER_OPTS=%HBASE_JMX_BASE% "-Dcom.sun.management.jmxremote.port=10101"

Type: hbase shell

Tables Managements commands

● This command will start enabling the named table

We have to observe below points for drop command

Data manipulation commands

● hbase> put 'kccs',1,'students:name','Dhruv'

3. Set environment variables

5. If received above error, solve by changing pig.cmd file as below

6. Try pig –version command again.

7. Try running pig in local mode

3. Describe: To describe the relation

4. Pig -x local file.pig : To execute the pig script

7. Filter : Will filter input based on a criteria provided.

2. Load the records and group them based on the city

3. Count the number of sales in each city.

Clustering is an unsupervised learning technique, in short, you are working on data,

B.2 Input and Output:

➔ Step 1: Installation of Hadoop and PySpark in Colab.

➔ Step 3: Schema information of the dataset.

➔ Step 7: After Visualizing the silhouette score.

I observed that K-means clustering is a classical clustering algorithm that uses an

We have successfully implemented k-means clustering using Hadoop Map Reduce

B.5 Question of Curiosity:

2. What are the clustering applications?

3. How is clustering different from classification?

Involved in Supervised learning Unsupervised learning

Training Labelled data is provided. Unlabeled data was provided.

You might also like