Lab Manual - Student Copy - Index & Experiments CCS334 - BDA
Lab Manual - Student Copy - Index & Experiments CCS334 - BDA
SYLLABUS
COURSE OBJECTIVES:
The student should be made to:
To understand big data.
To learn and use NoSQL big data management.
To learn MapReduce analytics using Hadoop and related tools.
To work with map reduce applications
To understand the usage of Hadoop related tools for Big Data Analytics
LIST OF EXPERIMENTS
1. Downloading and installing Hadoop; Understanding different Hadoop modes, Startup scripts,
Configuration files
2. Hadoop Implementation of file management tasks, such as Adding files and directories,
retrieving files and Deleting files
3. Implement of Matrix Multiplication with Hadoop Map Reduce
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm
5. Installation of Hive along with practice examples
6. Installation of HBase, Installing thrift along with Practice examples
7. Practice importing and exporting data from various databases.
COURSE OUTCOMES:
At the end of the course, the students will be able to:
CO1: Describe big data and use cases from selected business domains.
CO2: Explain NoSQL big data management.
CO3: Install, configure, and run Hadoop and HDFS.
CO4: Perform map-reduce analytics using Hadoop.
CO5: Use Hadoop-related tools such as HBase, Cassandra, Pig, and Hive for big data analytics.
Big data analytics uses advanced analytical methods that can extract important business
insights from bulk datasets. Within these datasets lies both structured (organized) and unstructured
(unorganized) data. Its applications cover different industries such as healthcare, education,
insurance, AI, retail, and manufacturing.
Data Visualization:
Data analysis usually is presented in visual form, for illustration – charts, graphs and
interactive dashboards. The visualizations provided a way to simplify the large amounts of data and
allowed for decision makers to quickly detect patterns and trends.
Data Storage and Management:
The stored and managed analyzed data is of utmost importance. It is like digital scrapbooking.
May be you would want to go back to those lessons in the long run, therefore, how you store them has
great importance. Moreover, data protection and adherence to regulations are the key issues to be
addressed during this crucial stage.
Continuous Learning and Improvement:
Big data analytics is a continuous process of collecting, cleaning, and analyzing data to
uncover hidden insights. It helps businesses make better decisions and gain a competitive edge.
Types of Big Data Analytics
Big Data Analytics comes in many different types, each serving a different purpose:
Descriptive Analytics:
This type helps us understand past events. In social media, it shows performance metrics, like
the number of likes on a post.
Diagnostic Analytics:
In Diagnostic analytics delves deeper to uncover the reasons behind past events. In healthcare,
it identifies the causes of high patient re-admissions.
Predictive Analytics:
Predictive analytics forecasts future events based on past data. Weather forecasting, for
example, predicts tomorrow's weather by analyzing historical patterns.
Prescriptive Analytics:
However, this category not only predicts results but also offers recommendations for action to achieve
the best results. In e-commerce, it may suggest the best price for a product to achieve the highest
possible profit.
Real-time Analytics:
The key function of real-time analytics is data processing in real time. It swiftly allows traders
to make decisions based on real-time market events.
Spatial Analytics:
Spatial analytics is about the location data. In urban management, it optimizes traffic flow from
the data under the sensors and cameras to minimize the traffic jam.
Text Analytics:
Text analytics delves into the unstructured data of text. In the hotel business, it can use the
guest reviews to enhance services and guest satisfaction.
Fraud Detection:
Credit card companies, like MasterCard, use Big Data Analytics to catch and stop fraudulent
transactions. It's like having a guardian that watches over your money and keeps it safe.
Optimized Logistics:
FedEx, for example, uses Big Data Analytics to deliver your packages faster and with less
impact on the environment. It's like taking the fastest route to your destination while also being kind to
the planet.
Challenges of Big data analytics
While Big Data Analytics offers incredible benefits, it also comes with its set of challenges:
Data Overload:
Consider Twitter, where approximately 6,000 tweets are posted every second. The challenge is
sifting through this avalanche of data to find valuable insights.
Data Quality:
If the input data is inaccurate or incomplete, the insights generated by Big Data Analytics can
be flawed. For example, incorrect sensor readings could lead to wrong conclusions in weather
forecasting.
Privacy Concerns:
With the vast amount of personal data used, like in Facebook's ad targeting, there's a fine line
between providing personalized experiences and infringing on privacy.
Security Risks:
With cyber threats increasing, safeguarding sensitive data becomes crucial. For instance,
banks use Big Data Analytics to detect fraudulent activities, but they must also protect this information
from breaches.
Costs:
Implementing and maintaining Big Data Analytics systems can be expensive. Airlines like
Delta use analytics to optimize flight schedules, but they need to ensure that the benefits outweigh the
costs.
Usage of Big Data Analytics
Big Data Analytics has a significant impact in various sectors:
Healthcare: It aids in precise diagnoses and disease prediction, elevating patient care.
Retail: Amazon's use of Big Data Analytics offers personalized product recommendations based on
your shopping history, creating a more tailored and enjoyable shopping experience.
Finance: Credit card companies such as Visa rely on Big Data Analytics to swiftly identify and
prevent fraudulent transactions, ensuring the safety of your financial assets.
Transportation: Companies like Uber use Big Data Analytics to optimize drivers' routes and predict
demand, reducing wait times and improving overall transportation experiences.
Harnessing all of that data requires tools. Thankfully, technology has advanced so that many
intuitive software systems are available for data analysts to use.
Hadoop:
An open-source framework that stores and processes big data sets. Hadoop can handle and
analyse structured and unstructured data.
Spark:
An open-source cluster computing framework for real-time processing and data analysis.
Data integration software:
Programs that allow big data to be streamlined across different platforms, such as MongoDB,
Apache, Hadoop, and Amazon EMR.
Stream analytics tools:
Systems that filter, aggregate, and analyse data that might be stored in different platforms and
formats, such as Kafka.
Distributed storage:
Databases that can split data across multiple servers and can identify lost or corrupt data, such
as Cassandra.
EXPT.NO.1(a)
Downloading and installing Hadoop; Understanding different Hadoop modes,
DATE: Startup scripts, Configuration files.
AIM
PROCEDURE
OUTPUT
RESULT
Thus the Downloaded and installed Hadoop and also understand different Hadoop
modes are successfully implemented.
EXPT.NO.1(b)
Downloading and installing Hadoop; Startup scripts, Configuration files.
DATE:
AIM
Edit core-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/cse/hdata</value>
</property>
</configuration>
Edit hdfs-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Edit mapred-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
#Add below lines in this file(between "<configuration>" and "<"/configuration>")
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Edit yarn-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME,
HADO
OP_CONF_DIR, CLASSPATH_PERPEND_DISTCACHE, HADOOP_YARN_HOME,
HA
DOOP_MAPRED_HOME</value>
</property>
6651 NameNode
7339 NodeManager
7663 Jps
OUTPUT
RESULT
Thus Downloaded and installed Hadoop and also understand different Hadoop modes.
Startup scripts, Configuration files are successfully implemented.
EXPT.NO.2
Hadoop Implementation of file management tasks, such as Adding files
DATE: and directories, retrieving files and Deleting files
AIM
To implement the following file management tasks in Hadoop:
1. Adding files and directories
2. Retrieving files
3. Deleting Files
DESCRIPTION: -
HDFS is a scalable distributed filesystem designed to scale to petabytes of data while
running on top of the underlying filesystem of the operating system. HDFS keeps track of
where the data resides in a network by associating the name of its rack (or network switch)
with the dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain
data, or which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of
command line utilities that work similarly to the Linux file commands, and serve as your
primary interface with HDFS.
We‘re going to have a look into HDFS by interacting with it from the command line.
We will take a look at the most common file management tasks in Hadoop, which include:
1. Adding files and directories to HDFS
2. Retrieving files from HDFS to local filesystem
3. Deleting files from HDFS
After formatting the HDFS, start the distributed file system. The following command will
start the namenode as well as the data nodes as cluster. $ start-dfs.sh
After loading the information in the server, we can find the list of files in a directory, status
of a file, using ls Given below is the syntax of ls that you can pass to a directory or a
filename as an argument.
Transfer and store a data file from local systems to the Hadoop file system using the put
command.
$ stop-dfs.sh
RESULT:
Thus, the Installing of Hadoop in three operating modes has been successfully completed.
EXPT.NO.3(a)
Implement of Matrix Multiplication with Hadoop Map Reduce
DATE:
AIM
To Develop a MapReduce program to implement Matrix Multiplication.
DESCRIPTION
a. for each element mij of M do produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,..
upto the number of columns of N
b. for each element njk of N do
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number of
rows of M.
c. return Set of (key,value) pairs that each key (i,k), has list with values (M,j,mij) and (N,
#!/usr/bin/env python
import sys
from operator import itemgetter
prev_index = None
value_list = []
for line in sys.stdin: curr_index, index, value =
line.rstrip().split("\t")
index, value = map(int,
[index,value]) if curr_index ==
prev_index:
value_list.append((index,value))
else:
if prev_index: value_list =
sorted(value_list,key=itemgetter(0)) i = 0
result = 0 while i < len(value_list) - 1:
if value_list[i][0] == value_list[i + 1][0]:
result += value_list[i][1]*value_list[i +
1][1] i += 2
else:
i += 1
print "%s,%s"%(prev_index,str(result))
prev_index = curr_index
value_list = [(index,value)]
$ chmod +x ~/Desktop/mr/matrix/Mapper.py
$ chmod +x ~/Desktop/mr/matrixl/Reducer.py
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
> -input /user/cse/matrices/ \
> -output /user/cse/mat_output \
> -mapper ~/Desktop/mr/matrix/Mapper.py \
> -reducer ~/Desktop/mr/matrix/Reducer.py
OUTPUT
RESULT:
EXPT.NO.4
Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm
DATE:
AIM
To Develop a MapReduce program to calculate the frequency of a given
word in a given fileMap Function – It takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples (Key-
Value pair). Example – (Map function in Word Count)
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS,
TRAIN
Output
Convert into
another set of
data(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
Input Set of
Tuples(output of
Map function)
(train,1), (bus,1),(TRAIN,1),(BUS,1),
hadoop version
javac –version
Step 2. Create a directory on the Desktop named Lab and inside it create two folders; one
called “Input” and the other called “tutorial_classes”. [You can do this step using GUI
normally or through terminal commands] cd Desktop mkdir Lab mkdir Lab/Input mkdir
Lab/tutorial_classes
Step 3. Add the file attached with this document
“WordCount.java” in the directory Lab
Step 4. Add the file attached with this document “input.txt” in
the directory Lab/Input.
Step 7. Go to localhost:9870 from the browser, Open“Utilities→ Browse File System” and
you should see the directories and files we placed in the file system.
Step 8. Then, back to local machine where we will compile the WordCount.java
file.
Assuming we are currently in the Desktop directory.
cd Lab
javac -classpath $HADOOP_CLASSPATH -d tutorial_classes WordCount.java
Put the output files in one jar file (There is a dot at the end) jar
-cvf WordCount.jar -C tutorial_classes .
Step 9. Now, we run the jar file on Hadoop.
hadoop jar WordCount.jar WordCount /WordCountTutorial/Input
/WordCountTutorial/Output
package PackageDemo;
import
java.io.IOException;
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.LongWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser; public
class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new
GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]); Path output=new
Path(files[1]); Job j=new
Job(c,"wordcount");
j.setJarByClass(WordCount.
c lass);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(j,
input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
OUTPUT
RESULT:
Thus the Word Count Map Reduce program to understand Map Reduce Paradigm
was successfully executed.
EXPT.NO.5(a)
Installation hive
DATE:
AIM
To installing hive with example
step 3:
source ~/.bashrc
step 4:
Edit hive-config.sh file
====================================
sudo nano $HIVE_HOME/bin/hive-config.sh
export HADOOP_HOME=/home/cse/hadoop-
3.3.6
step 5:
Create Hive directories in HDFS
===================================
hdfs dfs -mkdir /tmp hdfs dfs -chmod
g+w /tmp hdfs dfs -mkdir -p
/user/hive/warehouse hdfs dfs -chmod
g+w /user/hive/warehouse
step 6:
=================
rm $HIVE_HOME/lib/guava-19.0.jar
cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/
cp hive-default.xml.template hive-site.xml
Access the hive-site.xml file using the nano text editor:
sudo nano hive-site.xml
OUTPUT
RESULT:
Thus, Installation hive was successfully installed and executed.
EXPT.NO.5(b)
HIVE WITH EXAMPLES
DATE:
AIM
OUTPUT
RESULT:
Thus, Installation hive was successfully installed and executed.
EXPT.NO.6(a)
Installation of HBase, Installing thrift along with Practice examples
DATE:
AIM
To Install HBase on Ubuntu 18.04 HBase in Standalone Mode
PROCEDURE
Pre-requisite
step-1: Make sure that java has installed in your machine to verify that run java –version
If any Error Occured While Execute this command, then java is not installed in your system
Step-2:Download Hbase
wget https://dlcdn.apache.org/hbase/2.5.5/hbase-2.5.5-bin.tar.gz
Step-3: Extract The hbase-2.5.5-bin.tar.gz file by using the command tar xvf hbase-2.5.5-
bin.tar.gz
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
and then open .bashrc file and mention HBASE_HOME path as shown
HBASE_HOME=/home/<your_machine_name>/hbase-2.5.5
OUTPUT
RESULT:
EXPT.NO.6(b)
HBase, Installing thrift along with Practice examples
DATE:
AIM
EXAMPLE
3) insert
data syntax:
put
‘table_name’,’row_key’,’column_family:attribute’,’value’ here
code:
this data will enter data into the dept column family put
'aamec','cse','dept:studentname','prasanna'
put
'aamec','cse','dept:year','third'
put
'aamec','cse','dept:section','A'
This data will enter data into the year column family
put 'aamec','cse','year: joinedyear','2021'
put 'aamec','cse','year:
finishingyear','2025'
4) Scan Table
syntax: scan
‘table_na
me’
code:
scan ‘aamec’
syntax:
get ‘table_name’,’row_key’,[optional column family: attribute]
code:
get ‘aamec’,’cse’
previously the value for the section in cse is A ,But after running this command the value
will be changed into B
8.De lete Table first we need to disable the table before dropping it To Disable:
syntax:
disable ‘table_name’
code:
disable ‘aamec’
RESULT:
HBase was successfully installed with an example on Ubuntu 18.04.
EXPT.NO.7 Practice importing and exporting data from various
DATE: databases.
AIM
To import or export, the order of columns in MySQL and Hive
Pre-requisite
sudo apt install mysql-server ( use this command to install MySQL server)
COMMANDS:
~$ sudo su
After this enter your linux user password,then the root mode will be open here we don’t need
any authentication for mysql.
~root$ mysql
Creating user profiles and grant them permissions:
Note: This step is not required if you just use the root user to make CRUD operations in the
MySQL
Mysql> CREATE USER ‘bigdata’@’127.0.0.1' IDENTIFIED BY ‘bigdata’; Mysql>grant all
privileges on *.* to [email protected];
Note: Here, *.* means that the user we create has all the privileges on all the tables of all the
databases.
Now, we have created user profiles which will be used to make CRUD operations in the mysql
Step 3: Create a database and table and insert data.
Example:
create database Employe;
create table Employe.Emp(author_name varchar (65), total_no_of_articles int, phone_no int,
address varchar (65));
Step 3: Create a database and table in the hive where data should be imported. create table
geeks_hive_table(name string, total_articles int, phone_no int, address string) row format
delimited fields terminated by ‘,’;
Next to move that to the usr/lib which requires a super user privilege
$ mv sqoop-1.4.4.bin hadoop-2.0.4-alpha /usr/lib/sqoop
Then exit : $ exit
Goto .bashrc: $ sudo nano .bashrc , and then add the following
export SQOOP_HOME=/usr/lib/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
$ source ~/.bashrc
Then configure the sqoop, goto the directory of the config folder of sqoop_home and then move
the contents of template file to the environment file.
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
Then open the sqoop-environment file and then add the
following, export
HADOOP_COMMON_HOME=/usr/local/Hadoop export
HADOOP_MAPRED_HOME=/usr/local/hadoop
Note : Here we add the path of the Hadoop libraries and files and it may different from the path
which we mentioned here. So, add the Hadoop path based on your installation.
Next, to extract the file and place it to the lib folder of sqoop
$ tar -zxf mysql-connector-java-5.1.30.tar.gz
$ su
$ cd mysql-connector-java-5.1.30
$ mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib
Note : This is library file is very important don’t skip this step because it contains the libraries to
connect the mysql databases to jdbc.
Verify sqoop: sqoop-version
Step 3: hive database Creation
hive> create database
sqoop_example; hive>use
sqoop_example;
hive>create table sqoop(usr_name string,no_ops int,ops_names string);
Hive commands much more alike mysql commands.Here, we just create the structure to store
the data which we want to import in hive.
--table table_name_in_mysql \
OUTPUT
RESULT:
Thus, the import and export, the order of columns in MySQL queries are exported to hive
successfully.
AIM
To create clustering using SPARK.
ALORITHM
1. First, we randomly initialize k points, called means or cluster centroids.
2. We categorize each item to its closest mean, and we update the mean’s coordinates, which are
the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our clusters.
PROGRAM
Step 1: Starting the PySpark server
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cluster').getOrCreate() print('Spark Version:
{}'.format(spark.version))
OUTPUT:
Spark Version: 3.3.1
Step 2: Load the dataset
PROGRAM
#Loading the data
dataset = spark.read.csv("seeds_dataset.csv",header=True,inferSchema=True)
#show the data in the above file using the below command dataset.show(5)
OUTPUT
RESULT:
Thus, the cluster has been created using SPARK.
EXPT.NO.9
VISUALIZE DATA USING BASIC PLOTTING TECHNIQUES IN
DATE: PYTHON
AIM
To create an application that takes the Visualize Data Using Basic Plotting Techniques.
STEPS TO BE FOLLOWED:
STEP1:
Importing Datasets
In this article, we will use two freely available datasets. The Iris and Wine Reviews dataset,
which we can both load into memory using pandas read_csv method.
import pandas as pd
iris = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'class']) print(iris.head())
Matplotlib
Matplotlib is the most popular Python plotting library. It is a low-level library with a Matlab-like
interface that offers lots of freedom at the cost of having to write more code.
To install Matplotlib, pip, and conda can be used.
Scatter Plot
To create a scatter plot in Matplotlib, we can use the scatter method. We will also create a figure and
an axis using plt.subplots to give our plot a title and labels.
# create a figure and axis fig, ax = plt.subplots()
# scatter the sepal_length against the sepal_width ax.scatter(iris['sepal_length'], iris['sepal_width']) #
set a title and labels
ax.set_title('Iris Dataset') ax.set_xlabel('sepal_length') ax.set_ylabel('sepal_width')
We can give the graph more meaning by coloring each data point by its class. This can be done by
creating a dictionary that maps from class to color and then scattering each point on its own using a
for-loop and passing the respective color.
Line Chart
In Matplotlib, we can create a line chart by calling the plot method. We can also plot multiple columns
in one graph by looping through the columns we want and plotting each column on the same axis.
Histogram
In Matplotlib, we can create a Histogram using the hist method. If we pass categorical data like the
points column from the wine-review dataset, it will automatically calculate how often each class
occurs.
Bar Chart
A bar chart can be created using the bar method. The bar chart isn’t automatically calculating the
frequency of a category, so we will use pandas value_counts method to do this. The bar chart is useful
for categorical data that doesn’t have a lot of different categories (less than 30) because else it can get
quite messy.
# create a figure and axis fig, ax = plt.subplots()
# count the occurrence of each class
data = wine_reviews['points'].value_counts() # get x and y data
points = data.index frequency = data.values # create bar chart ax.bar(points, frequency) # set title and
labels
ax.set_title('Wine Review Scores') ax.set_xlabel('Points') ax.set_ylabel('Frequency')
Pandas Visualization
Pandas is an open-source, high-performance, and easy-to-use library providing data structures, such as
data frames and data analysis tools like the visualization tools we will use in this article.
Pandas Visualization makes it easy to create plots out of a pandas dataframe and series. It also has a
higher- level API than Matplotlib, and therefore we need less code for the same results.
Pandas can be installed using either pip or conda.
pip install pandas or
conda install pandas
Scatter Plot
To create a scatter plot in Pandas, we can call <dataset>.plot.scatter() and pass it two arguments, the
name of the x-column and the name of the y-column. Optionally we can also give it a title.
iris.plot.scatter(x='sepal_length', y='sepal_width', title='Iris Dataset')
OUTPUT
RESULT:
Thus, to visualize data using basic plotting techniques in python has been executed
successfully.