0% found this document useful (0 votes)
30 views66 pages

Lab Manual - Student Copy - Index & Experiments CCS334 - BDA

Uploaded by

sathya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views66 pages

Lab Manual - Student Copy - Index & Experiments CCS334 - BDA

Uploaded by

sathya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 66

| CCS334 – Big Data Analytics

CCS334 – BIG DATA ANALYTICS


(R2021)

SYLLABUS

COURSE OBJECTIVES:
The student should be made to:
 To understand big data.
 To learn and use NoSQL big data management.
 To learn MapReduce analytics using Hadoop and related tools.
 To work with map reduce applications
 To understand the usage of Hadoop related tools for Big Data Analytics

LIST OF EXPERIMENTS
1. Downloading and installing Hadoop; Understanding different Hadoop modes, Startup scripts,
Configuration files
2. Hadoop Implementation of file management tasks, such as Adding files and directories,
retrieving files and Deleting files
3. Implement of Matrix Multiplication with Hadoop Map Reduce
4. Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm
5. Installation of Hive along with practice examples
6. Installation of HBase, Installing thrift along with Practice examples
7. Practice importing and exporting data from various databases.

COURSE OUTCOMES:
At the end of the course, the students will be able to:

CO1: Describe big data and use cases from selected business domains.
CO2: Explain NoSQL big data management.
CO3: Install, configure, and run Hadoop and HDFS.
CO4: Perform map-reduce analytics using Hadoop.
CO5: Use Hadoop-related tools such as HBase, Cassandra, Pig, and Hive for big data analytics.

GRT Institute of Engineering and Technology, Tiruttani


Page |1
| CCS334 – Big Data Analytics

CCS334 – BIG DATA ANALYTICS

Ex. Page Marks


Date INDEX Signature
No. Number (out 0f 100)

Downloading and installing Hadoop;


1 Understanding different Hadoop modes, 8
Startup scripts, Configuration files
Hadoop Implementation of file management
2 tasks, such as Adding files and directories, 19
retrieving files and Deleting files

3 Implement of Matrix Multiplication with 22


Hadoop Map Reduce
Run a basic Word Count Map Reduce
4 program to understand Map Reduce 27
Paradigm

5 Installation of Hive along with practice 35


examples

6 Installation of HBase, Installing thrift along 42


with Practice examples

7 Practice importing and exporting data from 52


various databases.

CONTENT BEYOND SYLLABUS

Implement Clustering Techniques using


8 59
SPARK

Visualize Data using Basic Plotting


9 61
Techniques in Python

GRT Institute of Engineering and Technology, Tiruttani


Page |2
| CCS334 – Big Data Analytics

INTRODUCTION TO BIG DATA ANALYTICS

What is Big Data Analytics?

Big data analytics uses advanced analytical methods that can extract important business
insights from bulk datasets. Within these datasets lies both structured (organized) and unstructured
(unorganized) data. Its applications cover different industries such as healthcare, education,
insurance, AI, retail, and manufacturing.

How does big data analytics work?


Big Data Analytics is a powerful tool which helps to find the potential of large and complex datasets.
Data Collection:
Data is the core of Big Data Analytics. It is the gathering of data from different sources such
as the customers’ comments, surveys, sensors, social media, and so on. The primary aim of data
collection is to compile as much accurate data as possible. The more data, the more insights.
Data Cleaning (Data Preprocessing):
The next step is to process this information. It often requires some cleaning. This entails the
replacement of missing data, the correction of inaccuracies, and the removal of duplicates. It is like
sifting through a treasure trove, separating the rocks and debris and leaving only the valuable gems
behind.
Data Processing:
After that we will be working on the data processing. This process contains such important
stages as writing, structuring, and formatting of data in a way it will be usable for the analysis. It is
like a chef who is gathering the ingredients before cooking. Data processing turns the data into a
format suited for analytics tools to process.
Data Analysis:
Data analysis is being done by means of statistical, mathematical, and machine learning
methods to get out the most important findings from the processed data. For example, it can uncover
customer preferences, market trends, or patterns in healthcare data.

GRT Institute of Engineering and Technology, Tiruttani


Page |3
| CCS334 – Big Data Analytics

Data Visualization:
Data analysis usually is presented in visual form, for illustration – charts, graphs and
interactive dashboards. The visualizations provided a way to simplify the large amounts of data and
allowed for decision makers to quickly detect patterns and trends.
Data Storage and Management:
The stored and managed analyzed data is of utmost importance. It is like digital scrapbooking.
May be you would want to go back to those lessons in the long run, therefore, how you store them has
great importance. Moreover, data protection and adherence to regulations are the key issues to be
addressed during this crucial stage.
Continuous Learning and Improvement:
Big data analytics is a continuous process of collecting, cleaning, and analyzing data to
uncover hidden insights. It helps businesses make better decisions and gain a competitive edge.
Types of Big Data Analytics
Big Data Analytics comes in many different types, each serving a different purpose:
Descriptive Analytics:
This type helps us understand past events. In social media, it shows performance metrics, like
the number of likes on a post.
Diagnostic Analytics:
In Diagnostic analytics delves deeper to uncover the reasons behind past events. In healthcare,
it identifies the causes of high patient re-admissions.
Predictive Analytics:
Predictive analytics forecasts future events based on past data. Weather forecasting, for
example, predicts tomorrow's weather by analyzing historical patterns.
Prescriptive Analytics:
However, this category not only predicts results but also offers recommendations for action to achieve
the best results. In e-commerce, it may suggest the best price for a product to achieve the highest
possible profit.
Real-time Analytics:
The key function of real-time analytics is data processing in real time. It swiftly allows traders
to make decisions based on real-time market events.
Spatial Analytics:
Spatial analytics is about the location data. In urban management, it optimizes traffic flow from
the data under the sensors and cameras to minimize the traffic jam.
Text Analytics:
Text analytics delves into the unstructured data of text. In the hotel business, it can use the
guest reviews to enhance services and guest satisfaction.

GRT Institute of Engineering and Technology, Tiruttani


Page |4
| CCS334 – Big Data Analytics

Big Data Analytics Technologies and Tools


Big Data Analytics relies on various technologies and tools that might sound complex
Hadoop:
Imagine Hadoop as an enormous digital warehouse. It's used by companies like Amazon to
store tons of data efficiently. For instance, when Amazon suggests products you might like, it's
because Hadoop helps manage your shopping history.
Spark:
Think of Spark as the super-fast data chef. Netflix uses it to quickly analyze what you watch
and recommend your next binge-worthy show.
NoSQL Databases:
NoSQL databases, like MongoDB, are like digital filing cabinets that Airbnb uses to store your
booking details and user data. These databases are famous because of their quick and flexible, so the
platform can provide you with the right information when you need it.
Tableau:
Tableau is like an artist that turns data into beautiful pictures. The World Bank uses it to create
interactive charts and graphs that help people understand complex economic data.
Python and R:
Python and R are like magic tools for data scientists. They use these languages to solve tricky
problems. For example, Kaggle uses them to predict things like house prices based on past data.
Machine Learning Frameworks (e.g., TensorFlow):
In Machine learning frameworks are the tools who make predictions. Airbnb
uses TensorFlow to predict which properties are most likely to be booked in certain areas. It helps
hosts make smart decisions about pricing and availability.
These tools and technologies are the building blocks of Big Data Analytics and helps
organizations gather, process, understand, and visualize data, making it easier for them to make
decisions based on information.
Benefits of Big Data Analytics
Big Data Analytics offers a host of real-world advantages, and let's understand with examples:
Informed Decisions:
Imagine a store like Walmart. Big Data Analytics helps them make smart choices about what
products to stock. This not only reduces waste but also keeps customers happy and profits high.
Enhanced Customer Experiences:
Think about Amazon. Big Data Analytics is what makes those product suggestions so accurate.
It's like having a personal shopper who knows your taste and helps you find what you want.

GRT Institute of Engineering and Technology, Tiruttani


Page |5
| CCS334 – Big Data Analytics

Fraud Detection:
Credit card companies, like MasterCard, use Big Data Analytics to catch and stop fraudulent
transactions. It's like having a guardian that watches over your money and keeps it safe.
Optimized Logistics:
FedEx, for example, uses Big Data Analytics to deliver your packages faster and with less
impact on the environment. It's like taking the fastest route to your destination while also being kind to
the planet.
Challenges of Big data analytics
While Big Data Analytics offers incredible benefits, it also comes with its set of challenges:
Data Overload:
Consider Twitter, where approximately 6,000 tweets are posted every second. The challenge is
sifting through this avalanche of data to find valuable insights.
Data Quality:
If the input data is inaccurate or incomplete, the insights generated by Big Data Analytics can
be flawed. For example, incorrect sensor readings could lead to wrong conclusions in weather
forecasting.
Privacy Concerns:
With the vast amount of personal data used, like in Facebook's ad targeting, there's a fine line
between providing personalized experiences and infringing on privacy.
Security Risks:
With cyber threats increasing, safeguarding sensitive data becomes crucial. For instance,
banks use Big Data Analytics to detect fraudulent activities, but they must also protect this information
from breaches.
Costs:
Implementing and maintaining Big Data Analytics systems can be expensive. Airlines like
Delta use analytics to optimize flight schedules, but they need to ensure that the benefits outweigh the
costs.
Usage of Big Data Analytics
Big Data Analytics has a significant impact in various sectors:
Healthcare: It aids in precise diagnoses and disease prediction, elevating patient care.
Retail: Amazon's use of Big Data Analytics offers personalized product recommendations based on
your shopping history, creating a more tailored and enjoyable shopping experience.
Finance: Credit card companies such as Visa rely on Big Data Analytics to swiftly identify and
prevent fraudulent transactions, ensuring the safety of your financial assets.
Transportation: Companies like Uber use Big Data Analytics to optimize drivers' routes and predict
demand, reducing wait times and improving overall transportation experiences.

GRT Institute of Engineering and Technology, Tiruttani


Page |6
| CCS334 – Big Data Analytics

Big data analytics tools

Harnessing all of that data requires tools. Thankfully, technology has advanced so that many
intuitive software systems are available for data analysts to use.
Hadoop:
An open-source framework that stores and processes big data sets. Hadoop can handle and
analyse structured and unstructured data.
Spark:
An open-source cluster computing framework for real-time processing and data analysis.
Data integration software:
Programs that allow big data to be streamlined across different platforms, such as MongoDB,
Apache, Hadoop, and Amazon EMR.
Stream analytics tools:
Systems that filter, aggregate, and analyse data that might be stored in different platforms and
formats, such as Kafka.
Distributed storage:
Databases that can split data across multiple servers and can identify lost or corrupt data, such
as Cassandra.

GRT Institute of Engineering and Technology, Tiruttani


Page |7
| CCS334 – Big Data Analytics

EXPT.NO.1(a)
Downloading and installing Hadoop; Understanding different Hadoop modes,
DATE: Startup scripts, Configuration files.

AIM

To Downloading and installing Hadoop; Understanding different Hadoop modes, Startup


scripts, Configuration files.

PROCEDURE

Prerequisites to Install Hadoop on Ubuntu


Hardware requirement- The machine must have 4GB RAM and minimum 60 GB hard disk
for better performance.
Check java version- It is recommended to install Oracle Java 8.
The user can check the version of java with below command.
$ java –version
STEP 1: Setup passwordless ssh
a) Install Open SSH Server and Open SSH Client
We will now setup the passwordless ssh client with the following command.
1.$sudo apt-get install openssh-server openssh-client

b) Generate Public & Private Key Pairs


2. ssh-keygen -t rsa -P “”

c) Configure password-less SSH

GRT Institute of Engineering and Technology, Tiruttani


Page |8
| CCS334 – Big Data Analytics

3. cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

d) Now verify the working of password-less ssh


$ ssh localhost

e) Now install rsync with command


$ sudo apt-get install rsync

STEP 2: Configure and Setup Hadoop


Downloading Hadoop
$wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-
3.3.6tar.gz $tar xzf hadoop-3.3.6.tar.gz Once you’ve downloaded the file, you can unzip it to
a folder.

GRT Institute of Engineering and Technology, Tiruttani


Page |9
| CCS334 – Big Data Analytics

tar -xvzf hadoop-3.3.6.tar.gz


mv hadoop-3.3.6 hadoop

OUTPUT

RESULT
Thus the Downloaded and installed Hadoop and also understand different Hadoop
modes are successfully implemented.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 10
| CCS334 – Big Data Analytics

EXPT.NO.1(b)
Downloading and installing Hadoop; Startup scripts, Configuration files.
DATE:

AIM

To Downloading and installing Hadoop; Understanding different Hadoop modes.


Startup scripts, Configuration files.

STEP 1:Setup Configuration


a) Setting Up the environment variables
Edit .bashrc- Edit the bashrc and therefore add hadoop in a path:
$nano bash.bashrc
export HADOOP_HOME=/home/cse/hadoop-3.3.6 export
HADOOP_INSTALL=$HADOOP_HOME export
HADOOP_MAPRED_HOME=$HADOOP_HOME export
HADOOP_COMMON_HOME=$HADOOP_HOME export
HADOOP_HDFS_HOME=$HADOOP_HOME export
YARN_HOME=$HADOOP_HOME
exportHADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/
native export PATH=$PATH:$HADOOP_HOME/sbin:
$HADOOP_HOME/bin

Source .bashrc in current login session in terminal


$source ~/.bashrc
b) Hadoop configuration file changes
Edit hadoop-env.sh

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 11
| CCS334 – Big Data Analytics

Edit hadoop-env.sh file which is in etc/hadoop inside the Hadoop installation


directory. $sudo nano $HADOOP_HOME/etc/hadoop/Hadoop-env.sh The user can
set JAVA_HOME:
export JAVA_HOME=<root directory of Java-installation> (eg:
/usr/lib/jvm/jdk1.8.0_151/)

Edit core-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/cse/hdata</value>
</property>
</configuration>

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 12
| CCS334 – Big Data Analytics

Edit hdfs-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")


<property>
<name>dfs.data.dir</name>
<value>/home/cse/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/cse/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 13
| CCS334 – Big Data Analytics

Edit mapred-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
#Add below lines in this file(between "<configuration>" and "<"/configuration>")
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 14
| CCS334 – Big Data Analytics

Edit yarn-site.xml
$sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

#Add below lines in this file(between "<configuration>" and "<"/configuration>")

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME,
HADO
OP_CONF_DIR, CLASSPATH_PERPEND_DISTCACHE, HADOOP_YARN_HOME,
HA
DOOP_MAPRED_HOME</value>
</property>

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 15
| CCS334 – Big Data Analytics

Step 2: Start the cluster


We will now start the single node cluster with the following commands.
a) Format the namenode
$hdfs namenode –format

b) Start the HDFS


$start-all.sh
c) Verify if all process started
$ jps
6775 DataNode
7209 ResourceManager
7017 SecondaryNameNode

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 16
| CCS334 – Big Data Analytics

6651 NameNode
7339 NodeManager
7663 Jps

d) Web interface-For viewing Web UI of NameNode


visit : (http://localhost:9870)

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 17
| CCS334 – Big Data Analytics

OUTPUT

RESULT
Thus Downloaded and installed Hadoop and also understand different Hadoop modes.
Startup scripts, Configuration files are successfully implemented.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 18
| CCS334 – Big Data Analytics

EXPT.NO.2
Hadoop Implementation of file management tasks, such as Adding files
DATE: and directories, retrieving files and Deleting files

AIM
To implement the following file management tasks in Hadoop:
1. Adding files and directories
2. Retrieving files
3. Deleting Files

DESCRIPTION: -
HDFS is a scalable distributed filesystem designed to scale to petabytes of data while
running on top of the underlying filesystem of the operating system. HDFS keeps track of
where the data resides in a network by associating the name of its rack (or network switch)
with the dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain
data, or which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of
command line utilities that work similarly to the Linux file commands, and serve as your
primary interface with HDFS.
We‘re going to have a look into HDFS by interacting with it from the command line.
We will take a look at the most common file management tasks in Hadoop, which include:
1. Adding files and directories to HDFS
2. Retrieving files from HDFS to local filesystem
3. Deleting files from HDFS

SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM


HDFS

Step 1:Starting HDFS


Initially you have to format the configured HDFS file system, open namenode (HDFS
server), and execute the following command.
$ hadoop namenode -format

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 19
| CCS334 – Big Data Analytics

After formatting the HDFS, start the distributed file system. The following command will
start the namenode as well as the data nodes as cluster. $ start-dfs.sh

Listing Files in HDFS

After loading the information in the server, we can find the list of files in a directory, status
of a file, using ls Given below is the syntax of ls that you can pass to a directory or a
filename as an argument.

$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Inserting Data into HDFS


Assume we have data in the file called file.txt in the local system which is ought to be saved
in the hdfs file system. Follow the steps given below to insert the required file in the Hadoop
file system.

Step-2: Adding Files and Directories to HDFS

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input

Transfer and store a data file from local systems to the Hadoop file system using the put
command.

$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input


Step 3 :You can verify the file using ls command.

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Step 4 Retrieving Data from HDFS


Assume we have a file in HDFS called outfile. Given below is a simple demonstration for
retrieving the required file from the Hadoop file system.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 20
| CCS334 – Big Data Analytics

Initially, view the data from HDFS using cat command.


$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Get the file from HDFS to the local file system using get command.

$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/


Step-5: Deleting Files from HDFS
$ hadoop fs -rm file.txt

Step 6:Shutting Down the HDFS


You can shut down the HDFS by using the following command.

$ stop-dfs.sh

RESULT:
Thus, the Installing of Hadoop in three operating modes has been successfully completed.

EXPT.NO.3(a)
Implement of Matrix Multiplication with Hadoop Map Reduce
DATE:

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 21
| CCS334 – Big Data Analytics

AIM
To Develop a MapReduce program to implement Matrix Multiplication.
DESCRIPTION

In mathematics, matrix multiplication or the matrix product is a binary


operationthat produces a matrix from two matrices. The definition is motivated by linear
equations and linear transformations on vectors, which have numerous applicationsin
applied mathematics, physics, and engineering. In more detail, if A is an n × m matrix and
B is an m × p matrix, their matrix product AB is an n × p matrix, in which the m entries
across a row of A are multiplied with the m entries down a column of B and summed to
produce an entry of AB. When two linear transformations are represented by matrices,
then the matrix product represents the composition of the two transformations.

Algorithm for Map Function.

a. for each element mij of M do produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,..
upto the number of columns of N
b. for each element njk of N do
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number of
rows of M.
c. return Set of (key,value) pairs that each key (i,k), has list with values (M,j,mij) and (N,

d. j,njk) for all possible values of j.

Algorithm for Reduce Function.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 22
| CCS334 – Big Data Analytics

e. for each key (i,k) do


f. sort values begin with M by j in listM sort values begin with N by j in list
N multiply mij and njk for jth value of each list
g. sum up mij x njk return (i,k), Σj=1 mij x njk

Step 1. Creating directory for matrix


Then open matrix1.txt and matrix2.txt put the values in that text files

Step 2. Creating Mapper file for Matrix Multiplication.


#!/usr/bin/env python
import sys
cache_info = open("cache.txt").readlines()[0].split(",")
row_a, col_b =
map(int,cache_info) for line in
sys.stdin:
matrix_index, row, col, value =
line.rstrip().split(",") if matrix_index == "A":
for i in xrange(0,col_b):
key = row + "," + str(i) print
"%s\t%s\t%s"%(key,col,value)
else:
for j in xrange(0,row_a):
key = str(j) + "," + col print
"%s\t%s\t%s"%(key,row,value)

Step 3. Creating reducer file for Matrix Multiplication.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 23
| CCS334 – Big Data Analytics

#!/usr/bin/env python
import sys
from operator import itemgetter
prev_index = None
value_list = []
for line in sys.stdin: curr_index, index, value =
line.rstrip().split("\t")
index, value = map(int,
[index,value]) if curr_index ==
prev_index:
value_list.append((index,value))
else:
if prev_index: value_list =
sorted(value_list,key=itemgetter(0)) i = 0
result = 0 while i < len(value_list) - 1:
if value_list[i][0] == value_list[i + 1][0]:
result += value_list[i][1]*value_list[i +
1][1] i += 2
else:
i += 1
print "%s,%s"%(prev_index,str(result))
prev_index = curr_index
value_list = [(index,value)]

if curr_index == prev_index: value_list =


sorted(value_list,key=itemgetter(0)) i = 0
result = 0 while i < len(value_list) - 1:
if value_list[i][0] == value_list[i + 1][0]:
result += value_list[i][1]*value_list[i +
1][1] i += 2
else:
i += 1
print "%s,%s"%(prev_index,str(result))

Step 4. To view this file using cat command


$cat *.txt |python mapper.py

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 24
| CCS334 – Big Data Analytics

$ chmod +x ~/Desktop/mr/matrix/Mapper.py
$ chmod +x ~/Desktop/mr/matrixl/Reducer.py
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
> -input /user/cse/matrices/ \
> -output /user/cse/mat_output \
> -mapper ~/Desktop/mr/matrix/Mapper.py \
> -reducer ~/Desktop/mr/matrix/Reducer.py

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 25
| CCS334 – Big Data Analytics

OUTPUT

RESULT:

Thus, the MapReduce program to implement Matrix Multiplication was successfully


executed.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 26
| CCS334 – Big Data Analytics

EXPT.NO.4
Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm
DATE:

AIM
To Develop a MapReduce program to calculate the frequency of a given
word in a given fileMap Function – It takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples (Key-
Value pair). Example – (Map function in Word Count)
Input
Set of data

Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS,
TRAIN

Output
Convert into
another set of
data(Key,Value)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),

(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)


Reduce Function – Takes the output from Map as an input and combines
those data tuplesinto a smaller set of tuples.
Example – (Reduce function in Word Count)

Input Set of

Tuples(output of

Map function)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1),

(train,1), (bus,1),(TRAIN,1),(BUS,1),

(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)


Output Converts into smaller set of tuples

(BUS,7), (CAR,7), (TRAIN,4)

Workflow of MapReduce consists of 5 steps

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 27
| CCS334 – Big Data Analytics

1. Splitting – The splitting parameter can be anything, e.g. splitting by space,comma,


semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In orderto
group them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from eachcluster) is
combine together to form a Result

Now Let’s See the Word Count Program in Java

Step1 : Make sure Hadoop and Java are installed properly

hadoop version
javac –version
Step 2. Create a directory on the Desktop named Lab and inside it create two folders; one
called “Input” and the other called “tutorial_classes”. [You can do this step using GUI
normally or through terminal commands] cd Desktop mkdir Lab mkdir Lab/Input mkdir
Lab/tutorial_classes
Step 3. Add the file attached with this document
“WordCount.java” in the directory Lab
Step 4. Add the file attached with this document “input.txt” in
the directory Lab/Input.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 28
| CCS334 – Big Data Analytics

Step 5. Type the following command to export the hadoop classpath


into bash.
export HADOOP_CLASSPATH=$(hadoop classpath)
Make sure it is now exported.
echo $HADOOP_CLASSPATH
Step 6. It is time to create these directories on HDFS rather than locally.
Type the following commands.
hadoop fs -mkdir /WordCountTutorial
hadoop fs -mkdir /WordCountTutorial/Input
hadoop fs -put Lab/Input/input.txt /WordCountTutorial/Input

Step 7. Go to localhost:9870 from the browser, Open“Utilities→ Browse File System” and
you should see the directories and files we placed in the file system.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 29
| CCS334 – Big Data Analytics

Step 8. Then, back to local machine where we will compile the WordCount.java
file.
Assuming we are currently in the Desktop directory.
cd Lab
javac -classpath $HADOOP_CLASSPATH -d tutorial_classes WordCount.java

Put the output files in one jar file (There is a dot at the end) jar
-cvf WordCount.jar -C tutorial_classes .
Step 9. Now, we run the jar file on Hadoop.
hadoop jar WordCount.jar WordCount /WordCountTutorial/Input
/WordCountTutorial/Output

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 30
| CCS334 – Big Data Analytics

Step 10. Output the result:

hadoop dfs -cat /WordCountTutorial/Output/*

Program: Step 5. Type following Program :

package PackageDemo;
import
java.io.IOException;
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.LongWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser; public
class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 31
| CCS334 – Big Data Analytics

GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]); Path output=new
Path(files[1]); Job j=new
Job(c,"wordcount");
j.setJarByClass(WordCount.
c lass);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(j,
input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}

public static class MapForWordCount extends Mapper<LongWritable, Text,


Text,IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line =
value.toString(); String[]
words=line.split(",");
for(String word: words
)
{
Text outputKey = new
Text(word.toUpperCase().trim());IntWritable
outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 32
| CCS334 – Big Data Analytics

public static class ReduceForWordCount extends Reducer<Text, IntWritable,


Text,IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con)
throwsIOException,
InterruptedException
{ int
sum
= 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
The output is stored in /r_output/part-00000

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 33
| CCS334 – Big Data Analytics

OUTPUT

RESULT:
Thus the Word Count Map Reduce program to understand Map Reduce Paradigm
was successfully executed.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 34
| CCS334 – Big Data Analytics

EXPT.NO.5(a)
Installation hive
DATE:

AIM
To installing hive with example

STEPS FOR HIVE INSTALLATION


• Download and Unzip Hive
• Edit .bashrc file
• Edit hive-config.sh file
• Create Hive directories in HDFS
• Initiate Derby database • Configure hive-site.xml file Step 1:

download and unzip Hive


=============================
wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-
bin.tar.gz tar xzf apache-hive-3.1.2-bin.tar.gz step 2:
Edit .bashrc file
========================
sudo nano .bashrc
export HIVE_HOME= /home/hdoop/apache-hive-3.1.2-bin
export PATH=$PATH:$HIVE_HOME/bin

step 3:
source ~/.bashrc

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 35
| CCS334 – Big Data Analytics

step 4:
Edit hive-config.sh file
====================================
sudo nano $HIVE_HOME/bin/hive-config.sh
export HADOOP_HOME=/home/cse/hadoop-
3.3.6

step 5:
Create Hive directories in HDFS
===================================
hdfs dfs -mkdir /tmp hdfs dfs -chmod
g+w /tmp hdfs dfs -mkdir -p
/user/hive/warehouse hdfs dfs -chmod
g+w /user/hive/warehouse

step 6:

Fixing guava problem – Additional step

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 36
| CCS334 – Big Data Analytics

=================
rm $HIVE_HOME/lib/guava-19.0.jar
cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/

step 7: Configure hive-site.xml File (Optional)


Use the following command to locate the correct
file: cd $HIVE_HOME/conf
List the files contained in the folder using the ls command.

Use the hive-default.xml.template to create the hive-site.xml file:

cp hive-default.xml.template hive-site.xml
Access the hive-site.xml file using the nano text editor:
sudo nano hive-site.xml

Step 8: Initiate Derby Database


============================
$HIVE_HOME/bin/schematool -dbType derby –initSchema

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 37
| CCS334 – Big Data Analytics

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 38
| CCS334 – Big Data Analytics

OUTPUT

RESULT:
Thus, Installation hive was successfully installed and executed.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 39
| CCS334 – Big Data Analytics

EXPT.NO.5(b)
HIVE WITH EXAMPLES
DATE:

AIM

To installing hive with example

CREATE DATABASE FROM HIVEBEELINE


shell
1. Create database database_name;
Ex:
>Create database Emp;
>use Emp;
>create table emp.employee(sno int,user String,city String)Row format delimited fields terminated
by /n stored as textfile;

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 40
| CCS334 – Big Data Analytics

OUTPUT

RESULT:
Thus, Installation hive was successfully installed and executed.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 41
| CCS334 – Big Data Analytics

EXPT.NO.6(a)
Installation of HBase, Installing thrift along with Practice examples
DATE:

AIM
To Install HBase on Ubuntu 18.04 HBase in Standalone Mode

PROCEDURE
Pre-requisite

Ubuntu 16.04 or higher installed on a virtual machine.

step-1: Make sure that java has installed in your machine to verify that run java –version

If any Error Occured While Execute this command, then java is not installed in your system

To Install Java sudo apt install openjdk-8-jdk -y

Step-2:Download Hbase
wget https://dlcdn.apache.org/hbase/2.5.5/hbase-2.5.5-bin.tar.gz

Step-3: Extract The hbase-2.5.5-bin.tar.gz file by using the command tar xvf hbase-2.5.5-
bin.tar.gz

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 42
| CCS334 – Big Data Analytics

step-4: goto hbase2.5.5/conf folder and open hbase-env.sh file

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 43
| CCS334 – Big Data Analytics

step-5 : Edit .bashrc file

and then open .bashrc file and mention HBASE_HOME path as shown

in below export HBASE_HOME=/home/prasanna/hbase-2.5.5 here you

can change name according to your local machine name eg : export

HBASE_HOME=/home/<your_machine_name>/hbase-2.5.5

export PATH= $PATH:$HBASE_HOME/bin


Note:*make sure that the hbase-2.5.5 folderin home directory before setting
HBASE_HOME path , if not then move the hbase-2.5.5 file to home directory*

step-6 : Add properties in the hbase-site.xml

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 44
| CCS334 – Big Data Analytics

put the below property between the <configuratio></configuration> tag


<property>
<name>hbase.rootdir</name>
<value>file:///home/prasanna/HBASE/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/prasanna/HBASE/zookeeper</value>
</property> step-7:Goto To /etc/ folderand run the following command and
configure

change in line no-2 by default the ip is 127.0.1.1


change it to 127.0.0.1 in second line onlystep-8:starting
hbase goto hbase-2.5.5/bin folder

After this run jps command to ensure that hbase is running

run http://localhost:16010 to see hbase web UI

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 45
| CCS334 – Big Data Analytics

step-9: accessing hbase shell by running ./hbase shell command

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 46
| CCS334 – Big Data Analytics

OUTPUT

RESULT:

HBase was successfully installed on Ubuntu 18.04.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 47
| CCS334 – Big Data Analytics

EXPT.NO.6(b)
HBase, Installing thrift along with Practice examples
DATE:

AIM

To Install HBase on Ubuntu 18.04 HBase in Standalone Mod

EXAMPLE

1) To create Table syntax: create


‘Table_Name’,’col_fam_1’,’col_fam_1’, ....... ’col_fam-n’
code:
create 'aamec','dept','year'

2) List All Tables code:


list

3) insert

data syntax:
put

‘table_name’,’row_key’,’column_family:attribute’,’value’ here

row_key is a unique key to retrive data

code:

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 48
| CCS334 – Big Data Analytics

this data will enter data into the dept column family put
'aamec','cse','dept:studentname','prasanna'
put
'aamec','cse','dept:year','third'
put
'aamec','cse','dept:section','A'

This data will enter data into the year column family
put 'aamec','cse','year: joinedyear','2021'
put 'aamec','cse','year:
finishingyear','2025'

4) Scan Table

same as desc in RDBMS

syntax: scan
‘table_na
me’
code:
scan ‘aamec’

5) To get specific data

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 49
| CCS334 – Big Data Analytics

syntax:
get ‘table_name’,’row_key’,[optional column family: attribute]
code:
get ‘aamec’,’cse’

6. update table value


The same put command is used to update the table value ,if the row key is aldready present in the
database then it will update data according to the value ,if not present the it will create new row
with the given row key

previously the value for the section in cse is A ,But after running this command the value
will be changed into B

7)To Delete Data


syntax: delete
‘table_name’,’row_key’,’column_family:attribute’ code
: delete 'aamec','cse','year:joinedyear'

8.De lete Table first we need to disable the table before dropping it To Disable:

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 50
| CCS334 – Big Data Analytics

syntax:
disable ‘table_name’

code:

disable ‘aamec’

RESULT:
HBase was successfully installed with an example on Ubuntu 18.04.
EXPT.NO.7 Practice importing and exporting data from various

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 51
| CCS334 – Big Data Analytics

DATE: databases.

AIM
To import or export, the order of columns in MySQL and Hive
Pre-requisite

Hadoop and Java


MySQL
Hive
SQOOP

Step 1:To start hdfs

Step 2: MySQL Installation

sudo apt install mysql-server ( use this command to install MySQL server)

COMMANDS:
~$ sudo su

After this enter your linux user password,then the root mode will be open here we don’t need
any authentication for mysql.
~root$ mysql
Creating user profiles and grant them permissions:

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 52
| CCS334 – Big Data Analytics

Mysql> CREATE USER ‘bigdata'@'localhost' IDENTIFIED BY ‘bigdata’;


Mysql>grant all privileges on *.* to bigdata@localhost;

Note: This step is not required if you just use the root user to make CRUD operations in the
MySQL
Mysql> CREATE USER ‘bigdata’@’127.0.0.1' IDENTIFIED BY ‘bigdata’; Mysql>grant all
privileges on *.* to [email protected];

Note: Here, *.* means that the user we create has all the privileges on all the tables of all the
databases.
Now, we have created user profiles which will be used to make CRUD operations in the mysql
Step 3: Create a database and table and insert data.
Example:
create database Employe;
create table Employe.Emp(author_name varchar (65), total_no_of_articles int, phone_no int,
address varchar (65));

insert into Emp values (“Rohan”,10,123456789,” Lucknow”);

Step 3: Create a database and table in the hive where data should be imported. create table
geeks_hive_table(name string, total_articles int, phone_no int, address string) row format
delimited fields terminated by ‘,’;

Step 4: SQOOP INSTALLATION:


After downloading the sqoop , go to the directory where we downloaded the sqoop and then
extract it using the following command :

$ tar -xvf sqoop-1.4.4.bin hadoop-2.0.4-alpha.tar.gz

Then enter into the super user : $ su

Next to move that to the usr/lib which requires a super user privilege
$ mv sqoop-1.4.4.bin hadoop-2.0.4-alpha /usr/lib/sqoop
Then exit : $ exit

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 53
| CCS334 – Big Data Analytics

Goto .bashrc: $ sudo nano .bashrc , and then add the following

export SQOOP_HOME=/usr/lib/sqoop

export PATH=$PATH:$SQOOP_HOME/bin

$ source ~/.bashrc

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 54
| CCS334 – Big Data Analytics

Then configure the sqoop, goto the directory of the config folder of sqoop_home and then move
the contents of template file to the environment file.
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
Then open the sqoop-environment file and then add the
following, export
HADOOP_COMMON_HOME=/usr/local/Hadoop export
HADOOP_MAPRED_HOME=/usr/local/hadoop
Note : Here we add the path of the Hadoop libraries and files and it may different from the path
which we mentioned here. So, add the Hadoop path based on your installation.

Step 5: Download and Configure mysql-connector-java :


We can download mysql-connector-java-5.1.30.tar.gz file from the following link.

Next, to extract the file and place it to the lib folder of sqoop
$ tar -zxf mysql-connector-java-5.1.30.tar.gz
$ su
$ cd mysql-connector-java-5.1.30
$ mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib
Note : This is library file is very important don’t skip this step because it contains the libraries to
connect the mysql databases to jdbc.
Verify sqoop: sqoop-version
Step 3: hive database Creation
hive> create database
sqoop_example; hive>use
sqoop_example;
hive>create table sqoop(usr_name string,no_ops int,ops_names string);
Hive commands much more alike mysql commands.Here, we just create the structure to store
the data which we want to import in hive.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 55
| CCS334 – Big Data Analytics

Step 6: Importing data from MySQL to hive :

sqoop import --connect \ jdbc:mysql://127.0.0.1:3306/database_name_in_mysql \

--username root --password cloudera \

--table table_name_in_mysql \

--hive-import --hive-table database_name_in_hive.table_name_in_hive \ --m 1

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 56
| CCS334 – Big Data Analytics

OUTPUT

RESULT:
Thus, the import and export, the order of columns in MySQL queries are exported to hive
successfully.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 57
| CCS334 – Big Data Analytics

CONTENT BEYOND SYLLABUS

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 58
| CCS334 – Big Data Analytics

EXPT.NO.8 IMPLEMENT CLUSTERING TECHNIQUES USING


DATE: SPARK

AIM
To create clustering using SPARK.

ALORITHM
1. First, we randomly initialize k points, called means or cluster centroids.
2. We categorize each item to its closest mean, and we update the mean’s coordinates, which are
the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our clusters.

PROGRAM
Step 1: Starting the PySpark server
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cluster').getOrCreate() print('Spark Version:
{}'.format(spark.version))

OUTPUT:
Spark Version: 3.3.1
Step 2: Load the dataset

PROGRAM
#Loading the data
dataset = spark.read.csv("seeds_dataset.csv",header=True,inferSchema=True)
#show the data in the above file using the below command dataset.show(5)

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 59
| CCS334 – Big Data Analytics

OUTPUT

RESULT:
Thus, the cluster has been created using SPARK.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 60
| CCS334 – Big Data Analytics

EXPT.NO.9
VISUALIZE DATA USING BASIC PLOTTING TECHNIQUES IN
DATE: PYTHON

AIM
To create an application that takes the Visualize Data Using Basic Plotting Techniques.

STEPS TO BE FOLLOWED:
STEP1:
Importing Datasets
In this article, we will use two freely available datasets. The Iris and Wine Reviews dataset,
which we can both load into memory using pandas read_csv method.
import pandas as pd
iris = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'class']) print(iris.head())

wine_reviews = pd.read_csv('winemag-data-130k-v2.csv', index_col=0) wine_reviews.head()

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 61
| CCS334 – Big Data Analytics

Matplotlib
Matplotlib is the most popular Python plotting library. It is a low-level library with a Matlab-like
interface that offers lots of freedom at the cost of having to write more code.
To install Matplotlib, pip, and conda can be used.

pip install matplotlib or


conda install matplotlib
Matplotlib is specifically suitable for creating basic graphs like line charts, bar charts, histograms, etc.
It can be imported by typing:

import matplotlib.pyplot as plt

Scatter Plot
To create a scatter plot in Matplotlib, we can use the scatter method. We will also create a figure and
an axis using plt.subplots to give our plot a title and labels.
# create a figure and axis fig, ax = plt.subplots()
# scatter the sepal_length against the sepal_width ax.scatter(iris['sepal_length'], iris['sepal_width']) #
set a title and labels
ax.set_title('Iris Dataset') ax.set_xlabel('sepal_length') ax.set_ylabel('sepal_width')

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 62
| CCS334 – Big Data Analytics

We can give the graph more meaning by coloring each data point by its class. This can be done by
creating a dictionary that maps from class to color and then scattering each point on its own using a
for-loop and passing the respective color.

# create color dictionary


colors = {'Iris-setosa':'r', 'Iris-versicolor':'g', 'Iris-virginica':'b'} # create a figure and axis
fig, ax = plt.subplots() # plot each data-point
for i in range(len(iris['sepal_length'])):

ax.scatter(iris['sepal_length'][i], iris['sepal_width'][i],color=colors[iris['class'][i]]) # set a title and


labels
ax.set_title('Iris Dataset') ax.set_xlabel('sepal_length') ax.set_ylabel('sepal_width')

Line Chart
In Matplotlib, we can create a line chart by calling the plot method. We can also plot multiple columns
in one graph by looping through the columns we want and plotting each column on the same axis.

# get columns to plot


columns = iris.columns.drop(['class']) # create x data
x_data = range(0, iris.shape[0]) # create figure and axis
fig, ax = plt.subplots() # plot each column
for column in columns: ax.plot(x_data, iris[column])
# set title and legend ax.set_title('Iris Dataset') ax.legend()

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 63
| CCS334 – Big Data Analytics

Histogram
In Matplotlib, we can create a Histogram using the hist method. If we pass categorical data like the
points column from the wine-review dataset, it will automatically calculate how often each class
occurs.

# create figure and axis fig, ax = plt.subplots() # plot histogram


ax.hist(wine_reviews['points']) # set title and labels
ax.set_title('Wine Review Scores') ax.set_xlabel('Points') ax.set_ylabel('Frequency')

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 64
| CCS334 – Big Data Analytics

Bar Chart
A bar chart can be created using the bar method. The bar chart isn’t automatically calculating the
frequency of a category, so we will use pandas value_counts method to do this. The bar chart is useful
for categorical data that doesn’t have a lot of different categories (less than 30) because else it can get
quite messy.
# create a figure and axis fig, ax = plt.subplots()
# count the occurrence of each class
data = wine_reviews['points'].value_counts() # get x and y data
points = data.index frequency = data.values # create bar chart ax.bar(points, frequency) # set title and
labels
ax.set_title('Wine Review Scores') ax.set_xlabel('Points') ax.set_ylabel('Frequency')

Pandas Visualization
Pandas is an open-source, high-performance, and easy-to-use library providing data structures, such as
data frames and data analysis tools like the visualization tools we will use in this article.
Pandas Visualization makes it easy to create plots out of a pandas dataframe and series. It also has a
higher- level API than Matplotlib, and therefore we need less code for the same results.
Pandas can be installed using either pip or conda.
pip install pandas or
conda install pandas
Scatter Plot

To create a scatter plot in Pandas, we can call <dataset>.plot.scatter() and pass it two arguments, the
name of the x-column and the name of the y-column. Optionally we can also give it a title.
iris.plot.scatter(x='sepal_length', y='sepal_width', title='Iris Dataset')

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 65
| CCS334 – Big Data Analytics

OUTPUT

RESULT:
Thus, to visualize data using basic plotting techniques in python has been executed
successfully.

GRT Institute of Engineering and Technology, Tiruttani


P a g e | 66

You might also like