Big Data Journal
Big Data Journal
Aim:
Implement an application that stores big data in Hbase / MongoDB and manipulate it using
R / Python
Requirements
a. PyMongo
b. Mongo Database
Step A: Install Mongo database
Step 1) Go to (https://www.mongodb.com/download-center/community) and Download
MongoDB Community Server. We will install the 64-bit version for Windows.
Step 2) Once download is complete open the msi file. Click Next in the start up screen
Step 3)
1. Accept the End-User License Agreement
2. Click Next
Step 4) Click on the "complete" button to install all of the components. The custom
option can be used to install selective components or if you want to change the location
of the installation.
Step 5)
1. Select “Run service as Network Service user”. make a note of the data directory,
we”ll need this later.
2. Click Next
C:\users\admin> cd\
C:\>md data\db
import pymongo
Now, either create a file in Python IDLE or run all commands one by one in sequence on Python
cell
Program 1: Creating a Database: create_dp.py
b. If you want to use a database with name mybigdata, then use database
statement would be as follow:
c. If you want to check collection in mongodb use the command show collections
> show collections
Aim:
Install, configure and run Hadoop and HDFS and explore HDFS on Windows
Code:
Steps to Install Hadoop
1. Install Java JDK 1.8
2. Download Hadoop and extract and place under C drive
3. Set Path in Environment Variables
4. Config files under Hadoop directory
5. Create folder datanode and namenode under data directory
6. Edit HDFS and YARN files
7. Set Java Home environment in Hadoop environment
8. Setup Complete. Test by executing start-all.cmd
1. Install Java
· Create a java folder in C:\
· download and install jdk-8u112-windows-x64 in C:\java and not in c:\program files\java
2. Download Hadoop
· Browse the site https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/
Search for hadoop-2.7.0-src.tar.gz
· Extract it to c:\
· Rename Hadoop 2.7.0 to Hadoop
3. Set the path JAVA_HOME Environment variable
Type env in the search window to get the environment variable set up window.
5. Configurations
Hadoop Configurations
Again run the same file and you will find no error.
Then Download and run VC_redist.x64.exe
This is a “redistributable” package of the Visual C runtime code for 64-bit applications, from
Microsoft. It contains certain shared code that every application written with Visual C expects to
have available on the Windows computer it runs on.
– Open cmd ‘Run as Administrator’ and type command hdfs namenode –format
Testing
– Open cmd ‘Run as Administrator’ and change directory to C:\Hadoop\sbin
– type start-all.cmd
OR
- type start-dfs.cmd
– type start-yarn.cmd
– You will get 4 more running threads for Datanode, namenode, resource manager and node
manager.
Type JPS command to start-all.cmd command prompt, you will get the following output.
Run http://localhost:9870/ from any browser Or http://localhost:50070/
PRACTICAL NO -3
MapReduce Implementation
Aim:
Implement word count / frequency programs using MapReduce.
Steps:
Step:1
C:\Hadoop\sbin>start-all.cmd
OR
C:\Hadoop\sbin>start-dfs.cmd
C:\Hadoop\sbin>start-yarn.cmd
Step: 2
I) Open a command prompt as administrator and run the following command to create an input
and output folder on the Hadoop file system, to which we will be moving the sample.txt file for our
analysis.
II) C:\Hadoop\bin>cd\
III) C:\>hadoop dfsadmin -safemode leave
Step: 5
Copy the input text file named input_file.txt in the input directory (input_dir) of HDFS by applying
the following command at c:\>
Step: 6
Step: 7
Step: 8
In case, there is some error in executing then copy the file hadoop-mapreduce-examples-3.3.0.jar in
C:\ and run the program with the jar file using existing hadoop-mapreduce-examples-3.3.0.jar file as:
Alternatively, you may type the following command on CMD window as:
C:\> hadoop dfs -cat /output_dir/*
You can get the following output
It may also give the outcome like this
Step1
Step2:
Browse the site
https://archive.apache.org/dist/hive/hive-2.1.0/
download and extract hive-2.1.0 in c:\
rename it to hive
Step 3
From c:\derby\lib folder copy all the files and paste them in c:\hive\lib folder
Step 4:
User variable and system variable setup
Step 5:
Copy hive-cite.xml to c:\hive\conf folder.
Take this file from net
https://drive.google.com/file/d/1tsBbHdvM1fFktmn9O0-u0pbG1vWWFoyE/view
step 6:
open the command prompt as administrator.
Give the command from c:\windows\system32>start-all.cmd
Step 7:
C:\windows\system32>StartNetworkServer -h 0.0.0.0
Minimise it
Step 8:
Open the another command prompt as administrator
C:\windows\system32>hive
hive> create database empdb;
hive> create table employee (Id int, Name string , Salary float);
hive> insert into employee values (1,"Anjani", 25000);
hive> select * from employee;
OK
1 Anjani 25000.0
1 row selected (0.201 seconds)
hive> insert into employee values (2,"Gayatri", 25000);
Pre-requisites: We are going to make a standalone setup of HBase in our machine which
requires us to
• Install Java JDK 1.8 - We can download and install it from
https://www.oracle.com/java/technologies/downloads/ and set JAVA_HOME in
environment variable.
• Download Hbase-bin - Download Apache Hbase-2.2.5 from
https://archive.apache.org/dist/hbase/2.2.5/
Steps:
• Step 1 – Extract all files from the archive
• Step 2 – Create folders named "hbase" and "zookeeper"
• Step 3 – Deleting line in HBase.cmd
Open C:\hbase-2.2.5\bin\hbase.cmd in any text editor.
Search for line %HEAP_SETTINGS% and remove the
highlighted part.
set java_arguments=%HEAP_SETTINGS%
%HBASE_OPTS% -classpath "%CLASSPATH%" %CLASS%
%hbase-command-arguments%
• Step 4 – Adding lines in hbase-env.cmd
Open C:\hbase-2.2.5\conf\hbase-env.cmd in any text editor.
Add the below lines in the file after the comment session.
set JAVA_HOME=%JAVA_HOME%
set HBASE_CLASSPATH=%HBASE_HOME%\lib\client-facing-thirdparty\*
set HBASE_HEAPSIZE=8000
set HBASE_OPTS="-XX:+UseConcMarkSweepGC" "-Djava.net.preferIPv4Stack=true"
set SERVER_GC_OPTS="-verbose:gc" "-XX:+PrintGCDetails" "-XX:+PrintGCDateStamps" %HBASE_GC_OPTS%
set HBASE_USE_GC_LOGFILE=true
In HBase, interactive shell mode is used to interact with HBase for table operations, table
management, and data modeling.
Hbase Commands
General commands
In Hbase, general commands are categorized into following commands
● Status
● Version
● Table_help ( scan, drop, get, put, disable, etc.)
● Whoami
To get enter into HBase shell command, first of all, we have to execute the code as
mentioned below
hbase Shell
Once we get to enter into HBase shell, we can execute all shell commands mentioned
below. With the help of these commands, we can perform all type of table operations in the
HBase shell mode.
Let us look into all of these commands and their usage one by one with an example.
Status
Syntax: status
This command will give details about the system status like a number of servers present in
the cluster, active server count, and average load value. You can also pass any particular
parameters depending on how detailed status you want to know about the system. The
parameters can be ‘summary’, ‘simple’, or ‘detailed’, the default parameter provided is
“summary”.
Example:
whoami
Syntax:
Syntax: Whoami
This command “whoami” is used to return the current HBase user information from the
HBase cluster.
Example:
● “List” command will display all the tables that are present or created in HBase
● The output showing in above screen shot is currently showing the existing tables in
HBase
● We can filter output values from tables by passing optional regular expression
parameters
Describe
Syntax:describe <table name>
Example:
hbase(main):011:0>describe 'kccs'
This command describes the named table.
● It will give more information about column families present in the mentioned table
● In our case, it gives the description about table “kccs”
● It will give information about table name with column families, associated filters,
versions and some more details.
disable
Syntax: disable <tablename>
Example: disable 'kccs'
● This command will start disabling the named table
● If table needs to be deleted or dropped, it has to disabled first
disable_all
Syntax: disable_all<"matching regex">
● This command will disable all the tables matching the given regex.
● The implementation is same as delete command (Except adding regex for matching)
● Once the table gets disable the user can able to delete the table from HBase
● Before delete or dropping table, it should be disabled first
Enable
Syntax: enable <tablename>
Example: enable 'kccs’
This command displays all the filters present in HBase like ColumnPrefix Filter,
TimestampsFilter, PageFilter, FamilyFilter, etc.
drop
Syntax:drop <table name>
Example:
Put
Syntax: put <'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>
This command is used for following things
● It will put a cell ‘value’ at defined or specified table or row or column.
●It will optionally coordinate time stamp.
Example:
● The above execution will delete column ‘roll’ in the column family ‘students’ from the
table ‘kccs’
deleteall
Syntax: deleteall <'tablename'>, <'rowname'>
● This Command will delete all cells in a given row.
● We can define optionally column names and time stamp to the syntax.
● Optionally we can mention column names in that.
Truncate
Syntax: truncate <tablename>
After truncate of an hbase table, the schema will present but not the records. This command
performs 3 functions; those are listed below
● Disables table if it already presents
● Drops table if it already presents
● Recreates the mentioned table
Example:
Scan
Syntax: scan <'tablename'>, {Optional parameters}
This command scans entire table and displays the table contents.
● We can pass several optional specifications to this scan command to get more
information about the tables present in the system.
● Scanner specifications may include one or more of the following attributes.
● These are TIMERANGE, FILTER, TIMESTAMP, LIMIT, MAXLENGTH, COLUMNS,
CACHE, STARTROW and STOPROW.
Example:
Pig installation
Steps:
1. Downlaod pig from official page. Download pig-0.17.0.tar.gz file
https://downloads.apache.org/pig/pig-0.17.0/
2. Unzip file
Pig Commands
1. Load: Loads the data from the file system.
Eg: a = load ‘data.txt’ as (lines:int);
2. Dump : Executes the relation. Used to display the output
5. Cross : To perform the cross product of 2 or more relations. It is an expensive operation. Displays
the combined result of the relations.
6. Store: To store the output relation into a folder. Folder will contain another file where the result
will be stored
9. Group: Group operator is used to group the data in one or more relations. It groups the tuples
that contain a similar group key. If the group key has more than one field, it treats it as tuple
otherwise it will be the same type as that of the group key. In a result, it provides a relation that
contains one tuple per group.
10. Limit : Limit no of output tuples
11. Order by : The Apache Pig ORDER BY operator sorts a relation based on one or more fields. It
maintains the order of tuples
12. Split : The Apache Pig SPLIT operator breaks the relation into two or more relations according to
the provided expression. Here, a tuple may or may not be assigned to one or more than one
relation.
13. Distinct : Used to remove duplicate tuples in a relation. Pig sorts the data and then eliminated
duplicates
14. Join
The JOIN operator is used to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys
match, the two particular tuples are matched, else the records are dropped. Joins can be of the
following types:
a. Self Join : Used to join a table with itself, as if there are 2 relations, temporarily renaming at
least 1 relation
b. Inner Join : Also referred to as Equijoin. Returns rows if there is a match in both tables.
c. Outer Join : Returns all rows from at least one of the relations. Can be Left outer join, Right
outer join, or full outer join.
1. Left outer join : Returns all rows from the left table, even if there are no matches in the
right relation.
2. Right outer join : Returns all rows from the right table, even if there are no matches in
the left table
3. Full Outer Join : Returns all rows when there is a match in match in one of the relations.
Example: Working with a csv file
1. Creating the csv file containing sales records
B.1. Clustering:
Data: The dataset consists of 9K active credit cardholders over 6 months and their
transaction and account attributes. The idea is to develop a customer segmentation
for marketing strategy.
Using PySpark:
1
➔ Step 2: Installing Map and find out the path where PySpark is installed.
2
➔ Step 4: All attributes under consideration are numerical or discrete numeric,
hence we need to convert them into features using a Vector Assembler. Since
3
customer id is an identifier that won’t be used for clustering, we first extract
the required columns using .columns, pass it as an input to Vector Assembler,
and then use the transform() to convert the input columns into a single vector
column called a feature.
➔ Step 5: Now that all columns are transformed into a single feature vector we
need to standardize the data to bring them to a comparable scale.
4
➔ Step 6: Now that our data is standardized we can develop the K Means
algorithm.
5
B.3 Observations and learning:
B.4 Conclusion:
6
Ans:
Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data points
in the same group than those in other groups. In simple words, the aim is to
segregate groups with similar traits and assign them into clusters.
Basic This model function This function maps the data into one
classifies the data into of the multiple clusters where the
one of numerous already arrangement of data items relies on
defined definite classes. the similarities between them.