0% found this document useful (0 votes)
392 views41 pages

Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester

This lab manual is designed for the Big Data Analysis course offered by Bharathidasan University for B.Sc Data Science students in their second year, 4th semester. It includes practical exercises such as setting up a Hadoop ecosystem, working with Hive, Pig, and MongoDB, and analyzing JSON files. Each lab includes aims, pre-requisites, step-by-step programs with explanations, outputs, and conclusions to help students understand the practical aspects of big data technologies. #BigDataAnalysis #DataScience #Hadoop #MongoDB #Hive #Pig #JSON #BharathidasanUniversity #LabManual #BScDataScience #4thSemester #SecondYear #BigData #Programming #Python #DataEngineering #StudentResources #Education #TechTutorial

Uploaded by

Nandhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
392 views41 pages

Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester

This lab manual is designed for the Big Data Analysis course offered by Bharathidasan University for B.Sc Data Science students in their second year, 4th semester. It includes practical exercises such as setting up a Hadoop ecosystem, working with Hive, Pig, and MongoDB, and analyzing JSON files. Each lab includes aims, pre-requisites, step-by-step programs with explanations, outputs, and conclusions to help students understand the practical aspects of big data technologies. #BigDataAnalysis #DataScience #Hadoop #MongoDB #Hive #Pig #JSON #BharathidasanUniversity #LabManual #BScDataScience #4thSemester #SecondYear #BigData #Programming #Python #DataEngineering #StudentResources #Education #TechTutorial

Uploaded by

Nandhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Bharathidasan University Big Data Analysis Lab Practical Lab Manual

1|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

Learn Big Data Hadoop With Python


| Python Hadoop Tutorial for Beginners | Python | Edureka Rewind

https://www.youtube.com/watch?v=7t6HfdE7238

1. Setting up a Pseudo-Distributed Single-Node Hadoop Cluster with HDFS Using


Python

Big Data ecosystem - Hadoop on Single Node Cluster - Configure Hadoop HDFS
Hadoop Single Node Setup{தமிழ் }

https://www.youtube.com/watch?v=Z-LnXz2mIHg

2.Installation of Hadoop Ecosystem - Pig

Installation of Apache Pig on Windows 11 (in 2 minutes)

https://www.youtube.com/watch?v=xaWInB7Zir4

3.Installation of Hadoop Ecosystem - Hive

Easy Hive Installation Guidelines: Step-by-Step Tutorial for Hadoop on Windows 10/11 |
Kundan Kumar|
https://www.youtube.com/watch?v=CL6t2W8YC1Y
4. Loading Data into Hive Tables from Local File System
https://www.youtube.com/watch?v=prF6VT6OpTY
5&6. Write a Hive query that returns the contents of the whole table.
https://www.youtube.com/watch?v=Kwb1vPoFLiY
7. How to install MongoDB 6 on Windows 10/ Windows 11
https://www.youtube.com/watch?v=gB6WLkSrtJk
8 .Reading CSV file and loading it into MongoDB:
Use mongoimport to Import a CSV file into a MongoDB Database and Collection

https://www.youtube.com/watch?v=nuQD3Xfr0KY
9. MongoDB - How to Import and Export JSON Files [MongoDB# 02]

https://www.youtube.com/watch?v=B86Gw3kiA0M

10. How to migrate data from Mongodb to MySQL

https://www.youtube.com/watch?v=S3F3vNNnLGs

2|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

Pre-requisites: Required Software and Installation

To set up and run the programs in your Big Data Analytics Lab, the following software tools
are required. Below is a list of tools along with installation and setup instructions:

1. Java Development Kit (JDK)

Purpose:

Hadoop runs on Java, so JDK is essential.

Installation:

1. Linux (Ubuntu/Debian):

bash

sudo apt update


sudo apt install openjdk-11-jdk -y
java -version

2. Windows:
o Download JDK from Oracle or OpenJDK.
o Install and set the environment variable JAVA_HOME pointing to the JDK
directory.
o Add JAVA_HOME/bin to your system's PATH.

2. Hadoop

Purpose:

For setting up a pseudo-distributed Hadoop cluster.

Installation:

1. Download Hadoop:
o Visit Apache Hadoop and download the stable version (e.g., Hadoop 3.3.x).
2. Install Hadoop:
o Extract the package:

bash

tar -xvzf hadoop-3.3.x.tar.gz


sudo mv hadoop-3.3.x /usr/local/hadoop

o Set Environment Variables: Add the following to your ~/.bashrc:

3|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

bash

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Reload the file:

bash

source ~/.bashrc

3. Configure Hadoop for Pseudo-Distributed Mode:


o Edit the following files in $HADOOP_HOME/etc/hadoop/:
▪ core-site.xml:

xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

▪ hdfs-site.xml:

xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

▪ mapred-site.xml:

xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

▪ yarn-site.xml:

xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
4|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

4. Format the Namenode:

bash

hdfs namenode -format

5. Start Hadoop Services:

bash

start-dfs.sh
start-yarn.sh
jps # Verify services are running

3. Python

Purpose:

For interacting with HDFS, MongoDB, and MySQL.

Installation:

1. Install Python:
o Linux:

bash

sudo apt update


sudo apt install python3 python3-pip -y

o Windows:
▪ Download from Python.org.
▪ Install and add Python to PATH.
2. Install Required Python Libraries:

bash

pip install hdfs pymongo mysql-connector-python pandas

4. Apache Hive

Purpose:

For querying and managing data stored in HDFS.

Installation:

1. Download Hive:
o Visit Apache Hive and download the latest version.
2. Install Hive:
5|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

o Extract the archive:

bash

tar -xvzf apache-hive-3.1.x-bin.tar.gz


sudo mv apache-hive-3.1.x-bin /usr/local/hive

o Add to ~/.bashrc:

bash

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Reload the file:

bash

source ~/.bashrc

3. Set Hive Metastore: Configure hive-site.xml in $HIVE_HOME/conf/.


4. Start Hive:

bash

hive

5. Apache Pig

Purpose:

For analyzing large datasets using a scripting language.

Installation:

1. Download Pig:
o Visit Apache Pig.
2. Install Pig:
o Extract the archive:

bash

tar -xvzf pig-0.17.0.tar.gz


sudo mv pig-0.17.0 /usr/local/pig

o Add to ~/.bashrc:

bash

export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin

6|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

Reload the file:

bash

source ~/.bashrc

3. Run Pig:

bash

pig

6. MongoDB

Purpose:

For working with unstructured data.

Installation:

1. Linux:

bash

sudo apt update


sudo apt install -y mongodb
sudo systemctl start mongodb

2. Windows:
o Download MongoDB from MongoDB Download Center.
o Install and start MongoDB.
3. Python Integration:

bash

pip install pymongo

7. MySQL

Purpose:

For working with structured data.

Installation:

1. Linux:

bash

sudo apt update

7|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

sudo apt install mysql-server -y


sudo mysql_secure_installation

2. Windows:
o Download MySQL from MySQL.com.
o Install and configure.
3. Python Integration:

bash

pip install mysql-connector-python

Verification and Setup

• Ensure all services (HDFS, Hive, Pig, MongoDB, MySQL) are running.
• Test Python libraries for each tool (HDFS, MongoDB, MySQL) using sample scripts.

Let me know if you need detailed steps for any specific tool!

Why Java Is Commonly Used for Hadoop:

1. Hadoop Framework: Hadoop is written in Java, so its core libraries and ecosystem
(e.g., MapReduce, HDFS) rely heavily on Java.
2. Default API Support: Hadoop's primary APIs are in Java, making it a natural choice
for interacting with its components.
3. Java's Robust Ecosystem: Java provides high scalability, making it suitable for
large-scale distributed systems like Hadoop.

Advantages of Using Python for Hadoop:

1. Ease of Use: Python is easier to write and debug compared to Java.


2. Wide Ecosystem: Libraries like hdfs, pymongo, pyhive, and pyspark make it easy
to interact with Big Data tools.
3. Rapid Prototyping: Python is ideal for quick development and testing of Big Data
applications.

When to Use Java:

• When you're working on large-scale enterprise projects that require fine-grained


control over Hadoop's internals.
• If you need to extend Hadoop or work on custom features within the Hadoop
ecosystem.

8|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

1.Solution: Setting up a Pseudo-Distributed Single-Node Hadoop Cluster with


HDFS Using Python

1. Aim

To set up a single-node, pseudo-distributed Hadoop cluster, backed by the Hadoop


Distributed File System (HDFS), and interact with HDFS using Python.

2. Pre-requisites

To set up the environment, the following software and tools are required:

1. Hadoop: The core framework for distributed storage and processing.


o Download: Apache Hadoop
o Version: Ensure a stable release (e.g., Hadoop 3.x).
2. Java: Required by Hadoop as a runtime environment.
o Install OpenJDK:

bash

sudo apt update


sudo apt install openjdk-11-jdk
java -version

3. Python: Ensure Python 3.x is installed.


o Install necessary Python libraries:

bash

pip install hdfs

4. SSH: Required for Hadoop's pseudo-distributed mode to allow local SSH access.
o Install and configure SSH:

bash

sudo apt install openssh-server


ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost

3. Program and Procedure

Below is the step-by-step guide and Python program with inline comments.

9|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

Step 1: Install and Configure Hadoop

1. Download and Extract Hadoop:

bash

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-
3.3.6.tar.gz
tar -xvzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 ~/hadoop

2. Set Environment Variables: Add the following lines to ~/.bashrc:

bash

export HADOOP_HOME=~/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Reload the environment:

bash

source ~/.bashrc

3. Edit Configuration Files:


o Core Site Configuration (core-site.xml):

xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

o HDFS Site Configuration (hdfs-site.xml):

xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/your_user/hadoopdata/namenode</value>

10 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/your_user/hadoopdata/datanode</value>
</property>
</configuration>

o MapReduce Configuration (mapred-site.xml):

xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

o YARN Configuration (yarn-site.xml):

xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

4. Format the Namenode:

bash

hdfs namenode -format

5. Start Hadoop Services:

bash

start-dfs.sh
start-yarn.sh

Step 2: Python Program to Interact with HDFS

python

from hdfs import InsecureClient

# Step 1: Connect to HDFS


client = InsecureClient('http://localhost:9870', user='hadoop')

# Step 2: Write a file to HDFS


print("Writing a file to HDFS...")
file_content = "Hello, Hadoop! This is a test file."
with client.write('/user/hadoop/test_file.txt', encoding='utf-8') as
writer:
11 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

writer.write(file_content)

# Step 3: Read the file from HDFS


print("Reading the file from HDFS...")
with client.read('/user/hadoop/test_file.txt', encoding='utf-8') as reader:
file_data = reader.read()
print("File content:", file_data)

# Step 4: List files in the HDFS directory


print("Listing files in HDFS /user/hadoop/ directory...")
files = client.list('/user/hadoop')
print("Files in directory:", files)

# Step 5: Delete the file from HDFS


print("Deleting the file from HDFS...")
client.delete('/user/hadoop/test_file.txt')
print("File deleted successfully!")

4. Output

After running the program, the output will look like this:

plaintext

Writing a file to HDFS...


Reading the file from HDFS...
File content: Hello, Hadoop! This is a test file.
Listing files in HDFS /user/hadoop/ directory...
Files in directory: ['test_file.txt']
Deleting the file from HDFS...
File deleted successfully!

5. Result / Conclusion

• A pseudo-distributed single-node Hadoop cluster was successfully set up.


• Python was used to interact with the HDFS to write, read, list, and delete files.
• This setup and program demonstrate how to integrate Python with Hadoop and
HDFS, providing a simpler alternative to Java for Big Data applications.

12 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

2.Installation of Hadoop Ecosystem - Pig

1. Aim

To install and use Apache Pig, an open-source platform for analyzing large datasets on
Hadoop, and perform data analysis using Pig scripts with Python integration.

2. Pre-requisites

Ensure the following prerequisites are met:

1. Hadoop Installation:
o A running Hadoop pseudo-distributed cluster.
o Check Hadoop is installed and running:

bash

hdfs dfs -ls /

2. Python Installation:
o Install Python:

bash

sudo apt update


sudo apt install python3 python3-pip
python3 --version

3. Apache Pig Installation:


o Install Apache Pig using the steps below.

3. Installation Steps

1. Download Apache Pig:

bash

wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz

2. Extract Pig:

bash

tar -xzf pig-0.17.0.tar.gz


sudo mv pig-0.17.0 /usr/local/pig

3. Set Environment Variables: Add the following lines to ~/.bashrc:

13 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

bash

export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
export HADOOP_HOME=~/hadoop

Reload the environment:

bash

source ~/.bashrc

4. Verify Pig Installation:

bash

pig -version

Output:

plaintext

Apache Pig version 0.17.0 (r1832090) compiled Feb 5 2018, 16:31:43

4. Python Program with Pig Script

Below is a Python-based solution to run Pig scripts for data analysis on Hadoop.

Python Program to Execute Pig Scripts


python

import os

# Step 1: Define the Pig script


pig_script = """
-- Load data from HDFS into a Pig relation
data = LOAD '/user/hadoop/input/sample.txt' USING PigStorage(',') AS
(id:int, name:chararray, score:int);

-- Filter data where the score is greater than 50


filtered_data = FILTER data BY score > 50;

-- Store the filtered data into an HDFS output directory


STORE filtered_data INTO '/user/hadoop/output' USING PigStorage(',');
"""

# Step 2: Save the script to a file


with open("example_script.pig", "w") as file:
file.write(pig_script)

# Step 3: Execute the Pig script


print("Running the Pig script...")
os.system("pig -x local example_script.pig")
print("Pig script execution completed.")
14 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

Steps to Run

1. Prepare Input Data: Create a sample text file:

bash

echo -e "1,John,70\n2,Jane,40\n3,Mark,85\n4,Lucy,30" > sample.txt

2. Place the File in HDFS:

bash

hdfs dfs -mkdir -p /user/hadoop/input


hdfs dfs -put sample.txt /user/hadoop/input/

3. Run the Python Program: Execute the Python script to run the Pig script:

bash

python3 run_pig_script.py

5. Output

After executing the Pig script, the filtered data will be stored in HDFS.

1. Check the Output Directory in HDFS:

bash

hdfs dfs -ls /user/hadoop/output

2. View the Output Data:

bash

hdfs dfs -cat /user/hadoop/output/part-r-00000

Expected Output:

plaintext

1,John,70
3,Mark,85

Explanation:

o The rows where score > 50 were filtered and written to the output.

15 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

6. Result / Conclusion

• Apache Pig was successfully installed and configured for Hadoop.


• A Python program was written to automate the execution of Pig scripts for analysing data
stored in HDFS.
• The output verified the successful use of Pig to filter data based on a specific condition.

16 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

3.Installation of Hadoop Ecosystem - Hive

1. Aim

To install Apache Hive, an open-source data warehouse system on top of Hadoop, and
perform operations on structured data stored in HDFS using HiveQL queries through Python.

2. Pre-requisites

1. Hadoop Installation:
o A running Hadoop pseudo-distributed cluster.
o Confirm Hadoop is working:

bash

hdfs dfs -ls /

2. Python Installation:
o Install Python and required packages:

bash

sudo apt update


sudo apt install python3 python3-pip
python3 --version

3. Apache Hive Installation:


o Install Hive as detailed below.

3. Installation Steps

1. Download Apache Hive:

bash

wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-
bin.tar.gz

2. Extract Hive:

bash

tar -xzf apache-hive-3.1.3-bin.tar.gz


sudo mv apache-hive-3.1.3-bin /usr/local/hive

3. Set Environment Variables: Add the following to ~/.bashrc:

bash

17 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_HOME=~/hadoop

Reload the environment:

bash

source ~/.bashrc

4. Configure Hive Metastore:


o Create a directory in HDFS for Hive:

bash

hdfs dfs -mkdir /user/hive/warehouse


hdfs dfs -chmod g+w /user/hive/warehouse

o Configure Hive by editing the hive-site.xml file in $HIVE_HOME/conf.


5. Verify Hive Installation:

bash

hive --version

Expected Output:

plaintext

Hive 3.1.3

4. Python Program to Execute Hive Queries

Below is a Python-based solution to execute Hive queries using the pyhive library.

Python Program to Execute HiveQL Queries


python

from pyhive import hive


import pandas as pd

# Step 1: Connect to the Hive server


conn = hive.Connection(host="localhost", port=10000, username="hadoop",
database="default")
cursor = conn.cursor()

# Step 2: Create a table in Hive


create_table_query = """
CREATE TABLE IF NOT EXISTS employee (
id INT,
name STRING,
salary FLOAT
18 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
"""
cursor.execute(create_table_query)
print("Table 'employee' created successfully.")

# Step 3: Load data into the Hive table


load_data_query = """
LOAD DATA LOCAL INPATH '/path/to/sample_data.txt'
OVERWRITE INTO TABLE employee;
"""
cursor.execute(load_data_query)
print("Data loaded successfully into 'employee' table.")

# Step 4: Query data from the Hive table


select_query = "SELECT * FROM employee WHERE salary > 50000;"
cursor.execute(select_query)

# Fetch and display the results


results = cursor.fetchall()
print("Query Results:")
df = pd.DataFrame(results, columns=["ID", "Name", "Salary"])
print(df)

# Close the connection


cursor.close()
conn.close()

Steps to Execute

1. Prepare Input Data: Create a file sample_data.txt with sample employee data:

plaintext

1,John,60000
2,Jane,40000
3,Mark,75000
4,Lucy,30000

2. Run the Python Program:

bash

python3 hive_script.py

5. Output

1. Verify Table Creation in Hive: Open Hive shell:

bash

hive

19 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

Run the following command:

sql

SHOW TABLES;

Output:

plaintext

employee

2. Expected Query Results:

plaintext

Query Results:
ID Name Salary
0 1 John 60000.0
1 3 Mark 75000.0

6. Result / Conclusion

• Apache Hive was successfully installed and configured on the Hadoop ecosystem.
• A Python program was developed to create a Hive table, load data, and execute HiveQL
queries.
• The output verified that structured data could be efficiently queried using HiveQL via
Python.

20 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

4.Hive Query to Load Data from a Local File into a Table

Aim:

To load data from a local file into a Hive table for further analysis or processing.

Pre-requisites:

• Hive installed and configured properly.


• A local file (e.g., data.txt) containing the data you want to load.
• A Hive table created to load the data into (if it doesn't exist, it will need to be created).

Program with Comment Line Explanations:

1. Step 1: Start the Hive shell First, start the Hive shell from the command line:

bash

hive

2. Step 2: Create a Hive table (if not already created)

For this example, let's assume we have a simple dataset (CSV format) and want to
load it into a Hive table with three columns: id, name, and age.

sql

CREATE TABLE IF NOT EXISTS people (


id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','; -- Specify the delimiter used in the file
(e.g., comma for CSV)

This will create a table named people with the columns id, name, and age. The ROW
FORMAT DELIMITED clause specifies that the fields in the data are separated by
commas.

3. Step 3: Load data from the local file into the Hive table

Use the LOAD DATA command to load the data from a local file into the Hive table. For
this, you'll need the absolute path to the local file.

sql

LOAD DATA LOCAL INPATH '/path/to/data.txt' INTO TABLE people;

Explanation:

o LOCAL INPATH: Specifies that the file is on the local file system (instead of HDFS).

21 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

o '/path/to/data.txt': Path to the local file.


o INTO TABLE people: Specifies the target Hive table where the data will be
loaded.
4. Step 4: Verify the data is loaded into the table

You can verify the data has been loaded by querying the table:

sql

SELECT * FROM people;

This will return all the rows from the people table.

5. Step 5: Exit the Hive shell

Once done, exit the Hive shell by typing:

bash

exit;
Output Solutions:

1. Sample Input File (data.txt):

text

1,John,28
2,Jane,32
3,Paul,25

2. Result after running the query:

After running the LOAD DATA command, if you query the people table, the output
should look like this:

text

OK
1 John 28
2 Jane 32
3 Paul 25
Conclusion:

• The LOAD DATA LOCAL INPATH command successfully loads the data from the local file
into the Hive table.
• The Hive table now contains the data from the data.txt file, which can be further
processed or queried.
• This process allows for easy integration of local datasets into Hive for use in larger-scale data
analysis or ETL pipelines.

22 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

5. Write a Hive query that returns the contents of the whole table:

Aim

The aim is to write a Hive query that returns the contents of the entire table.

Pre-requisites

1. Hadoop and Hive must be installed and properly configured on your system.
2. Ensure that Hive is running.
3. You have a Hive table created and populated with data.

Hive Query to Return the Contents of the Whole Table


Step-by-Step Instructions:

1. Create a Hive Table:


First, ensure you have a Hive table created. If not, you can create one using
the following query:

SQL

CREATE TABLE IF NOT EXISTS my_table (


id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

2. Load Data into the Hive Table:


You can load data from a local file into the Hive table using the following
query:

SQL

LOAD DATA LOCAL INPATH '/path/to/your/localfile.csv' INTO TABLE


my_table;

3. Return the Contents of the Whole Table:


To return the contents of the entire table, use the following query:

SQL

SELECT * FROM my_table;

23 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

Program with Comment Line Explanations

Here's an example of how you can execute these queries using Hive:

Shell
#!/bin/bash

# Aim: Load data into a Hive table and return the contents of the table

# Step 1: Define the Hive table creation query


create_table_query="CREATE TABLE IF NOT EXISTS my_table (
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;"

# Step 2: Execute the Hive table creation query


hive -e "$create_table_query"

# Step 3: Define the local file path


local_file_path="/path/to/your/localfile.csv"

# Step 4: Define the Hive query to load data into the table
load_data_query="LOAD DATA LOCAL INPATH '$local_file_path' INTO TABLE
my_table;"

# Step 5: Execute the Hive query to load data


hive -e "$load_data_query"

# Step 6: Define the Hive query to select all data from the table
select_all_query="SELECT * FROM my_table;"

# Step 7: Execute the Hive query to return the contents of the table
hive -e "$select_all_query"

echo "Query executed successfully. The contents of the table are displayed
above."

Output Solutions

1. Creating the Hive Table:


o The Hive table my_table will be created with the specified columns and
configurations if it does not already exist.
2. Loading Data:
o The data from the local file localfile.csv will be loaded into
the my_table Hive table.
3. Returning the Contents:
o The SELECT * FROM my_table; query will return all the contents of
the my_table table, displaying all columns and rows.

24 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

Result / Conclusion

• Result:
o After running the above script, the data from the local file localfile.csv will
be successfully loaded into the Hive table my_table, and the entire contents
of the table will be displayed.
• Conclusion:
o This process demonstrates how to automate the creation of Hive tables,
loading data from local files, and retrieving data using Hive queries within a
shell script. This method is useful for integrating data ingestion and retrieval
tasks into larger data processing pipelines.

Important Notes:
• Replace /path/to/your/localfile.csv with the actual path to your local file.
• Adjust the table schema in the CREATE TABLE statement to match the structure of
your CSV file.
• Ensure that Hadoop and Hive are configured correctly on your system, and the Hive
CLI is accessible from your environment.

25 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

6. Demonstrate the usage of Hive functions:

Aim

The aim is to demonstrate the usage of Hive functions through examples.

Pre-requisites

1. Hadoop and Hive must be installed and properly configured on your system.
2. Ensure that Hive is running.
3. You have a Hive table created and populated with data.

Demonstration of Hive Functions

Step-by-Step Instructions:

1. Create a Hive Table:


First, ensure you have a Hive table created. If not, you can create one using
the following query:

SQL

CREATE TABLE IF NOT EXISTS employee (


id INT,
name STRING,
age INT,
salary FLOAT,
department STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

2. Load Data into the Hive Table:


You can load data from a local file into the Hive table using the following
query:

SQL

LOAD DATA LOCAL INPATH '/path/to/your/employee_data.csv' INTO TABLE


employee;

3. Demonstrate Hive Functions:


Use Hive functions to perform various operations on the data.

26 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

Program with Comment Line Explanations

Below are examples of Hive functions used in queries:

SQL
-- Aim: Demonstrate the usage of Hive functions

-- Step 1: Create a Hive table named 'employee'


CREATE TABLE IF NOT EXISTS employee (
id INT,
name STRING,
age INT,
salary FLOAT,
department STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

-- Step 2: Load data into the Hive table from a local CSV file
LOAD DATA LOCAL INPATH '/path/to/your/employee_data.csv' INTO TABLE
employee;

-- Step 3: Demonstrate the usage of Hive functions

-- Example 1: Use the COUNT function to count the number of employees


SELECT COUNT(*) AS total_employees FROM employee;

-- Example 2: Use the AVG function to calculate the average salary of


employees
SELECT AVG(salary) AS average_salary FROM employee;

-- Example 3: Use the MAX and MIN functions to find the highest and lowest
salaries
SELECT MAX(salary) AS highest_salary, MIN(salary) AS lowest_salary FROM
employee;

-- Example 4: Use the SUM function to calculate the total salary


expenditure
SELECT SUM(salary) AS total_salary_expenditure FROM employee;

-- Example 5: Use the GROUP BY clause with the COUNT function to count
employees in each department
SELECT department, COUNT(*) AS employees_count FROM employee GROUP BY
department;

-- Example 6: Use the CONCAT function to combine first and last names
(assuming first_name and last_name columns)
-- SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM employee;

-- Example 7: Use the UPPER and LOWER functions to convert names to


uppercase and lowercase
SELECT UPPER(name) AS uppercase_name, LOWER(name) AS lowercase_name FROM
employee;

-- Step 4: Execute the queries to see the results

Output Solutions
27 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

1. Creating the Hive Table:


o The Hive table employee will be created with the specified columns and
configurations if it does not already exist.
2. Loading Data:
o The data from the local file employee_data.csv will be loaded into
the employee Hive table.
3. Demonstrating Hive Functions:
o Each query demonstrates a different Hive function, and the results will
be displayed when the queries are executed.

Result / Conclusion

• Result:
o After running the above queries, the data from the local
file employee_data.csv will be successfully loaded into the Hive
table employee.
The queries will demonstrate the usage of various Hive functions and
o
display the results accordingly.
• Conclusion:
o This process demonstrates how to use Hive functions to perform
various operations on data stored in Hive tables. The examples cover
common functions such as COUNT, AVG, MAX, MIN, SUM, CONCAT, UPPER,
and LOWER. These functions are useful for data analysis and aggregation
tasks in Hive.

Important Notes:

• Replace /path/to/your/employee_data.csv with the actual path to your local


file.
• Ensure that Hadoop and Hive are configured correctly on your system.
• Adjust the table schema in the CREATE TABLE statement to match the structure
of your CSV file.

28 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

7. Installation of MongoDB:

Aim

The aim is to guide you through the installation of MongoDB on your system.

Pre-requisites

1. A supported operating system (Windows, macOS, or Linux).


2. Administrative privileges to install software on your system.
3. Internet access to download MongoDB.

Program with Comment Line Explanations

Below are the steps to install MongoDB on different operating systems:

Installation on Windows

1. Download MongoDB:
o Go to the MongoDB Download Center.
Select Windows as the operating system and download the MSI
o
installer.
2. Run the Installer:
o Double-click the downloaded MSI file to start the installation.
Follow the prompts in the MongoDB Installer.
o
3. Configure MongoDB:
o Choose the Complete setup type.
o Make sure to install MongoDB as a Service.
Select the option to install MongoDB Compass (optional GUI tool).
o
4. Verify Installation:
o Open Command Prompt and run:

Shell

mongod --version
o You should see the version of MongoDB installed.

Installation on macOS

1. Install Homebrew (if not already installed):

29 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

o Open Terminal and run:

Shell

/bin/bash -c "$(curl -fsSL


https://raw.githubusercontent.com/Homebrew/install/HEAD/install
.sh)"

Public code references from 3 repositories

2. Install MongoDB with Homebrew:


o Run the following commands:

Shell

brew tap mongodb/brew


brew install [email protected]

Public code references from 3 repositories

3. Start MongoDB Service:


o Run:

Shell

brew services start mongodb/brew/mongodb-community

4. Verify Installation:
o Run:

Shell

mongod --version
o You should see the version of MongoDB installed.

Installation on Linux (Ubuntu)

1. Import the MongoDB Public GPG Key:


o Open Terminal and run:

Shell

wget -qO - https://www.mongodb.org/static/pgp/server-5.0.asc |


sudo apt-key add -

Public code references from 3 repositories

2. Create the MongoDB Source List:


30 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

o Run:

Shell

echo "deb [ arch=amd64,arm64 ]


https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/5.0
multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-
5.0.list

Public code references from 3 repositories

3. Reload the Local Package Database:


o Run:

Shell

sudo apt-get update

4. Install MongoDB Packages:


o Run:

Shell

sudo apt-get install -y mongodb-org

5. Start MongoDB:
o Run:

Shell

sudo systemctl start mongod

6. Verify Installation:
o Run:

Shell

mongod --version
o You should see the version of MongoDB installed.

Output Solutions

1. Verification:
o You should be able to see the version of MongoDB installed by
running mongod --version.
o Example output:

31 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

Code

db version v5.0.5
git version: 62e7b76e9d0c41903a1f3e4b8e7d3f2d7f746a86
allocator: tcmalloc
modules: none
build environment:
distarch: x86_64
target_arch: x86_64
Public code references from 3 repositories

Result / Conclusion

• Result:
o MongoDB will be successfully installed on your system.
You will be able to verify the installation by checking the MongoDB
o
version.
• Conclusion:
o This guide provides step-by-step instructions for installing MongoDB
on different operating systems. Following these steps ensures that
MongoDB is installed correctly and ready for use. MongoDB is a
powerful NoSQL database that can be used for various applications,
including web development, data analysis, and more.

Important Notes:

• Ensure you have administrative privileges to install the software.


• Follow the specific steps for your operating system.
• Use the official MongoDB documentation for more advanced configurations
and troubleshooting.

This guide covers the installation process for MongoDB, a popular NoSQL database,
on Windows, macOS, and Linux systems.

32 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

8. Reading CSV file and loading it into MongoDB:


Aim

The aim is to read data from a CSV file and load it into a MongoDB collection using a
Python script.

Pre-requisites

1. Python installed on your system.


2. MongoDB installed and running on your system.
3. Pandas library installed for reading CSV files.
4. Pymongo library installed for interacting with MongoDB.

You can install the required libraries using pip:

sh
pip install pandas pymongo

Program with Comment Line Explanations

Below is a Python program that reads data from a CSV file and loads it into a
MongoDB collection:

Python
# Import necessary libraries
import pandas as pd
from pymongo import MongoClient

# Aim: Read data from a CSV file and load it into a MongoDB collection

# Step 1: Read data from the CSV file using pandas


# Replace 'path/to/your/file.csv' with the actual path to your CSV file
csv_file_path = 'path/to/your/file.csv'
data = pd.read_csv(csv_file_path)

# Step 2: Connect to MongoDB


# Replace 'localhost' and '27017' with your MongoDB server address and port
if different
client = MongoClient('localhost', 27017)

# Step 3: Create/select the database and collection


# Replace 'mydatabase' and 'mycollection' with your database and collection
names
db = client['mydatabase']
collection = db['mycollection']

# Step 4: Convert the DataFrame to a list of dictionaries


data_dict = data.to_dict('records')

33 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

# Step 5: Insert the data into the MongoDB collection


# Insert the list of dictionaries into the collection
collection.insert_many(data_dict)

print("Data loaded successfully into the MongoDB collection.")

Output Solutions

1. Reading CSV File:


o The pandas library reads the CSV file and stores the data in a
DataFrame.
o Example CSV data:

CSV

id,name,age
1,John Doe,25
2,Jane Smith,30

2. Connecting to MongoDB:
o The pymongo library connects to the MongoDB server.
3. Inserting Data into MongoDB:
o The data from the DataFrame is converted to a list of dictionaries.
o The list of dictionaries is inserted into the MongoDB collection.

Result / Conclusion

• Result:
o After running the above script, the data from the CSV file will be
successfully loaded into the specified MongoDB collection.
• Conclusion:
o This process demonstrates how to read data from a CSV file using
the pandas library and load it into a MongoDB collection using
the pymongo library. This method is useful for data migration and
integration tasks, allowing you to easily transfer data from CSV files to
MongoDB for storage and further analysis.

Important Notes:

• Ensure that MongoDB is running and accessible on your system.


• Replace 'path/to/your/file.csv' with the actual path to your CSV file.
• Replace 'localhost', '27017', 'mydatabase', and 'mycollection' with your
MongoDB server details and the names of your database and collection.
• Install the necessary libraries using:

34 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

sh

pip install pandas pymongo


This guide provides a complete solution for reading data from a CSV file and loading
it into a MongoDB collection using Python.

35 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

9.Reading JSON File and Loading it into MongoDB

Aim

To demonstrate how to read data from a JSON file and insert it into a MongoDB collection
using Python.

Pre-requisites

1. Software Requirements:
o Python (version 3.6 or later)
o MongoDB (installed locally or using MongoDB Atlas)
o Required Python Libraries: pymongo, json
2. Environment Setup:
o Install MongoDB from the MongoDB Download Center.
o Install Python dependencies:

bash

pip install pymongo

3. Create a JSON file: Prepare a data.json file with sample data. For example:

json

[
{ "name": "Alice", "age": 25, "city": "New York" },
{ "name": "Bob", "age": 30, "city": "Los Angeles" },
{ "name": "Charlie", "age": 35, "city": "Chicago" }
]

Program
python

# Import necessary libraries


import json # For reading the JSON file
from pymongo import MongoClient # For connecting to MongoDB

# Step 1: Connect to MongoDB


# Create a MongoDB client connection
client = MongoClient("mongodb://localhost:27017/") # Replace localhost and
port if needed
print("Connected to MongoDB!")

# Step 2: Access the database and collection


db = client["example_db"] # Use 'example_db' or create it if it doesn't
exist

36 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

collection = db["example_collection"] # Use 'example_collection' or create


it if it doesn't exist

# Step 3: Read data from the JSON file


json_file_path = "data.json" # Replace with the path to your JSON file
with open(json_file_path, "r") as file:
data = json.load(file) # Load JSON data from the file

# Step 4: Insert data into MongoDB


if isinstance(data, list):
# If the JSON file contains a list of documents
collection.insert_many(data) # Insert multiple documents
else:
# If the JSON file contains a single document
collection.insert_one(data) # Insert a single document

print("Data has been successfully loaded into MongoDB!")

# Step 5: Retrieve and print the inserted data (optional)


print("Inserted data:")
for document in collection.find():
print(document)

Output

1. Program Execution:
o Run the script in a terminal or IDE:

bash

python3 load_json_to_mongodb.py

o Sample console output:

css

Connected to MongoDB!
Data has been successfully loaded into MongoDB!
Inserted data:
{'_id': ObjectId('...'), 'name': 'Alice', 'age': 25, 'city':
'New York'}
{'_id': ObjectId('...'), 'name': 'Bob', 'age': 30, 'city': 'Los
Angeles'}
{'_id': ObjectId('...'), 'name': 'Charlie', 'age': 35, 'city':
'Chicago'}

2. Verification in MongoDB:
o Use the MongoDB shell or GUI (e.g., MongoDB Compass) to verify the inserted data:

bash

mongo
use example_db
db.example_collection.find().pretty()

o Expected result:
37 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

json

{
"_id": ObjectId("..."),
"name": "Alice",
"age": 25,
"city": "New York"
}
{
"_id": ObjectId("..."),
"name": "Bob",
"age": 30,
"city": "Los Angeles"
}
{
"_id": ObjectId("..."),
"name": "Charlie",
"age": 35,
"city": "Chicago"
}

Result/Conclusion

• The program successfully connected to MongoDB and read data from the JSON file.
• Data from the JSON file was inserted into a MongoDB collection.
• MongoDB was verified to store the data, which could be retrieved using the find()
method.
• Conclusion: Using Python with the pymongo library, JSON data can be efficiently loaded into
MongoDB, making it a suitable choice for handling structured data in NoSQL databases.
• =======================

38 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

10. Reading MongoDB and writing into MySQL:

Aim

To transfer data from a MongoDB database to a MySQL database. This process is essential in
scenarios where MongoDB is used for unstructured data storage, but data analysis or
processing requires relational database features.

Prerequisites

1. Software and Libraries:


o MongoDB installed and configured.
o MySQL installed and configured.
o Python environment with the following libraries:
▪ pymongo (to interact with MongoDB)
▪ mysql-connector-python (to interact with MySQL)
o Alternatively, Node.js or any other supported language can be used.
2. Knowledge Requirements:
o Basic understanding of MongoDB and MySQL databases.
o Familiarity with Python or another programming language.
o Understanding of CRUD operations in both databases.
3. Setup Requirements:
o MongoDB should have a collection with sample data to transfer.
o MySQL should have a table ready to receive the data.

Program

Below is a Python program to perform the task, with comments explaining each step:

python

# Import necessary libraries


from pymongo import MongoClient # Library to connect to MongoDB
import mysql.connector # Library to connect to MySQL

# Step 1: Connect to MongoDB


try:
mongo_client = MongoClient("mongodb://localhost:27017/") # Replace
with your MongoDB URI
mongo_db = mongo_client["source_database"] # Name of the
MongoDB database
mongo_collection = mongo_db["source_collection"] # Name of the
MongoDB collection
print("Connected to MongoDB successfully!")
except Exception as e:
print(f"Error connecting to MongoDB: {e}")
exit()

# Step 2: Connect to MySQL


39 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

try:
mysql_conn = mysql.connector.connect(
host="localhost", # MySQL host
user="root", # MySQL username
password="your_password", # MySQL password
database="target_database" # MySQL database name
)
mysql_cursor = mysql_conn.cursor()
print("Connected to MySQL successfully!")
except Exception as e:
print(f"Error connecting to MySQL: {e}")
exit()

# Step 3: Fetch data from MongoDB


try:
mongo_data = list(mongo_collection.find()) # Fetch all documents as a
list
print(f"Fetched {len(mongo_data)} records from MongoDB.")
except Exception as e:
print(f"Error fetching data from MongoDB: {e}")
exit()

# Step 4: Insert data into MySQL


try:
insert_query = """
INSERT INTO target_table (column1, column2, column3)
VALUES (%s, %s, %s)
""" # Adjust column names and values as per your table structure

for document in mongo_data:


# Extract data from MongoDB document
value1 = document.get("field1") # Replace 'field1' with your
MongoDB field name
value2 = document.get("field2") # Replace 'field2' with your
MongoDB field name
value3 = document.get("field3") # Replace 'field3' with your
MongoDB field name

# Execute the MySQL insert query


mysql_cursor.execute(insert_query, (value1, value2, value3))

# Commit changes to MySQL


mysql_conn.commit()
print(f"Successfully inserted {mysql_cursor.rowcount} records into
MySQL.")
except Exception as e:
print(f"Error inserting data into MySQL: {e}")
exit()

# Step 5: Close connections


mongo_client.close()
mysql_conn.close()
print("Connections closed. Data transfer complete!")

Output Solutions

1. Sample MongoDB Data:

40 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual

json

{
"_id": "12345",
"field1": "John Doe",
"field2": "Software Engineer",
"field3": "50000"
}

2. MySQL Table Schema:

sql

CREATE TABLE target_table (


id INT AUTO_INCREMENT PRIMARY KEY,
column1 VARCHAR(255),
column2 VARCHAR(255),
column3 VARCHAR(255)
);

3. Expected Output:
o Console log:

vbnet

Connected to MongoDB successfully!


Connected to MySQL successfully!
Fetched 10 records from MongoDB.
Successfully inserted 10 records into MySQL.
Connections closed. Data transfer complete!

o MySQL Table target_table will contain:

bash

| id | column1 | column2 | column3 |


|----|-----------|--------------------|---------|
| 1 | John Doe | Software Engineer | 50000 |

Result / Conclusion

1. Data Migration:
o Successfully migrated data from MongoDB to MySQL.
2. Use Case Scenarios:
o Useful for ETL pipelines where data is extracted from MongoDB, transformed, and
loaded into MySQL.
o Suitable for situations requiring structured analysis, reporting, or relational queries
on MongoDB data.
3. Learning Outcomes:
o You learned to interact with two databases programmatically.
o The process emphasized error handling, connection management, and query
execution.

41 | P a g e Nandhini.T

You might also like