Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester
Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester
1|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
https://www.youtube.com/watch?v=7t6HfdE7238
Big Data ecosystem - Hadoop on Single Node Cluster - Configure Hadoop HDFS
Hadoop Single Node Setup{தமிழ் }
https://www.youtube.com/watch?v=Z-LnXz2mIHg
https://www.youtube.com/watch?v=xaWInB7Zir4
Easy Hive Installation Guidelines: Step-by-Step Tutorial for Hadoop on Windows 10/11 |
Kundan Kumar|
https://www.youtube.com/watch?v=CL6t2W8YC1Y
4. Loading Data into Hive Tables from Local File System
https://www.youtube.com/watch?v=prF6VT6OpTY
5&6. Write a Hive query that returns the contents of the whole table.
https://www.youtube.com/watch?v=Kwb1vPoFLiY
7. How to install MongoDB 6 on Windows 10/ Windows 11
https://www.youtube.com/watch?v=gB6WLkSrtJk
8 .Reading CSV file and loading it into MongoDB:
Use mongoimport to Import a CSV file into a MongoDB Database and Collection
https://www.youtube.com/watch?v=nuQD3Xfr0KY
9. MongoDB - How to Import and Export JSON Files [MongoDB# 02]
https://www.youtube.com/watch?v=B86Gw3kiA0M
https://www.youtube.com/watch?v=S3F3vNNnLGs
2|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
To set up and run the programs in your Big Data Analytics Lab, the following software tools
are required. Below is a list of tools along with installation and setup instructions:
Purpose:
Installation:
1. Linux (Ubuntu/Debian):
bash
2. Windows:
o Download JDK from Oracle or OpenJDK.
o Install and set the environment variable JAVA_HOME pointing to the JDK
directory.
o Add JAVA_HOME/bin to your system's PATH.
2. Hadoop
Purpose:
Installation:
1. Download Hadoop:
o Visit Apache Hadoop and download the stable version (e.g., Hadoop 3.3.x).
2. Install Hadoop:
o Extract the package:
bash
3|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
bash
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
bash
source ~/.bashrc
xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
▪ hdfs-site.xml:
xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
▪ mapred-site.xml:
xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
▪ yarn-site.xml:
xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
4|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
bash
bash
start-dfs.sh
start-yarn.sh
jps # Verify services are running
3. Python
Purpose:
Installation:
1. Install Python:
o Linux:
bash
o Windows:
▪ Download from Python.org.
▪ Install and add Python to PATH.
2. Install Required Python Libraries:
bash
4. Apache Hive
Purpose:
Installation:
1. Download Hive:
o Visit Apache Hive and download the latest version.
2. Install Hive:
5|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
bash
o Add to ~/.bashrc:
bash
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
bash
source ~/.bashrc
bash
hive
5. Apache Pig
Purpose:
Installation:
1. Download Pig:
o Visit Apache Pig.
2. Install Pig:
o Extract the archive:
bash
o Add to ~/.bashrc:
bash
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
6|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
bash
source ~/.bashrc
3. Run Pig:
bash
pig
6. MongoDB
Purpose:
Installation:
1. Linux:
bash
2. Windows:
o Download MongoDB from MongoDB Download Center.
o Install and start MongoDB.
3. Python Integration:
bash
7. MySQL
Purpose:
Installation:
1. Linux:
bash
7|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
2. Windows:
o Download MySQL from MySQL.com.
o Install and configure.
3. Python Integration:
bash
• Ensure all services (HDFS, Hive, Pig, MongoDB, MySQL) are running.
• Test Python libraries for each tool (HDFS, MongoDB, MySQL) using sample scripts.
Let me know if you need detailed steps for any specific tool!
1. Hadoop Framework: Hadoop is written in Java, so its core libraries and ecosystem
(e.g., MapReduce, HDFS) rely heavily on Java.
2. Default API Support: Hadoop's primary APIs are in Java, making it a natural choice
for interacting with its components.
3. Java's Robust Ecosystem: Java provides high scalability, making it suitable for
large-scale distributed systems like Hadoop.
8|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
1. Aim
2. Pre-requisites
To set up the environment, the following software and tools are required:
bash
bash
4. SSH: Required for Hadoop's pseudo-distributed mode to allow local SSH access.
o Install and configure SSH:
bash
Below is the step-by-step guide and Python program with inline comments.
9|Page Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
bash
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-
3.3.6.tar.gz
tar -xvzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 ~/hadoop
bash
export HADOOP_HOME=~/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
bash
source ~/.bashrc
xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/your_user/hadoopdata/namenode</value>
10 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/your_user/hadoopdata/datanode</value>
</property>
</configuration>
xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
bash
bash
start-dfs.sh
start-yarn.sh
python
writer.write(file_content)
4. Output
After running the program, the output will look like this:
plaintext
5. Result / Conclusion
12 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
1. Aim
To install and use Apache Pig, an open-source platform for analyzing large datasets on
Hadoop, and perform data analysis using Pig scripts with Python integration.
2. Pre-requisites
1. Hadoop Installation:
o A running Hadoop pseudo-distributed cluster.
o Check Hadoop is installed and running:
bash
2. Python Installation:
o Install Python:
bash
3. Installation Steps
bash
wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz
2. Extract Pig:
bash
13 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
bash
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
export HADOOP_HOME=~/hadoop
bash
source ~/.bashrc
bash
pig -version
Output:
plaintext
Below is a Python-based solution to run Pig scripts for data analysis on Hadoop.
import os
Steps to Run
bash
bash
3. Run the Python Program: Execute the Python script to run the Pig script:
bash
python3 run_pig_script.py
5. Output
After executing the Pig script, the filtered data will be stored in HDFS.
bash
bash
Expected Output:
plaintext
1,John,70
3,Mark,85
Explanation:
o The rows where score > 50 were filtered and written to the output.
15 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
6. Result / Conclusion
16 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
1. Aim
To install Apache Hive, an open-source data warehouse system on top of Hadoop, and
perform operations on structured data stored in HDFS using HiveQL queries through Python.
2. Pre-requisites
1. Hadoop Installation:
o A running Hadoop pseudo-distributed cluster.
o Confirm Hadoop is working:
bash
2. Python Installation:
o Install Python and required packages:
bash
3. Installation Steps
bash
wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-
bin.tar.gz
2. Extract Hive:
bash
bash
17 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_HOME=~/hadoop
bash
source ~/.bashrc
bash
bash
hive --version
Expected Output:
plaintext
Hive 3.1.3
Below is a Python-based solution to execute Hive queries using the pyhive library.
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
"""
cursor.execute(create_table_query)
print("Table 'employee' created successfully.")
Steps to Execute
1. Prepare Input Data: Create a file sample_data.txt with sample employee data:
plaintext
1,John,60000
2,Jane,40000
3,Mark,75000
4,Lucy,30000
bash
python3 hive_script.py
5. Output
bash
hive
19 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
sql
SHOW TABLES;
Output:
plaintext
employee
plaintext
Query Results:
ID Name Salary
0 1 John 60000.0
1 3 Mark 75000.0
6. Result / Conclusion
• Apache Hive was successfully installed and configured on the Hadoop ecosystem.
• A Python program was developed to create a Hive table, load data, and execute HiveQL
queries.
• The output verified that structured data could be efficiently queried using HiveQL via
Python.
20 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Aim:
To load data from a local file into a Hive table for further analysis or processing.
Pre-requisites:
1. Step 1: Start the Hive shell First, start the Hive shell from the command line:
bash
hive
For this example, let's assume we have a simple dataset (CSV format) and want to
load it into a Hive table with three columns: id, name, and age.
sql
This will create a table named people with the columns id, name, and age. The ROW
FORMAT DELIMITED clause specifies that the fields in the data are separated by
commas.
3. Step 3: Load data from the local file into the Hive table
Use the LOAD DATA command to load the data from a local file into the Hive table. For
this, you'll need the absolute path to the local file.
sql
Explanation:
o LOCAL INPATH: Specifies that the file is on the local file system (instead of HDFS).
21 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
You can verify the data has been loaded by querying the table:
sql
This will return all the rows from the people table.
bash
exit;
Output Solutions:
text
1,John,28
2,Jane,32
3,Paul,25
After running the LOAD DATA command, if you query the people table, the output
should look like this:
text
OK
1 John 28
2 Jane 32
3 Paul 25
Conclusion:
• The LOAD DATA LOCAL INPATH command successfully loads the data from the local file
into the Hive table.
• The Hive table now contains the data from the data.txt file, which can be further
processed or queried.
• This process allows for easy integration of local datasets into Hive for use in larger-scale data
analysis or ETL pipelines.
22 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
5. Write a Hive query that returns the contents of the whole table:
Aim
The aim is to write a Hive query that returns the contents of the entire table.
Pre-requisites
1. Hadoop and Hive must be installed and properly configured on your system.
2. Ensure that Hive is running.
3. You have a Hive table created and populated with data.
SQL
SQL
SQL
23 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Here's an example of how you can execute these queries using Hive:
Shell
#!/bin/bash
# Aim: Load data into a Hive table and return the contents of the table
# Step 4: Define the Hive query to load data into the table
load_data_query="LOAD DATA LOCAL INPATH '$local_file_path' INTO TABLE
my_table;"
# Step 6: Define the Hive query to select all data from the table
select_all_query="SELECT * FROM my_table;"
# Step 7: Execute the Hive query to return the contents of the table
hive -e "$select_all_query"
echo "Query executed successfully. The contents of the table are displayed
above."
Output Solutions
24 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Result / Conclusion
• Result:
o After running the above script, the data from the local file localfile.csv will
be successfully loaded into the Hive table my_table, and the entire contents
of the table will be displayed.
• Conclusion:
o This process demonstrates how to automate the creation of Hive tables,
loading data from local files, and retrieving data using Hive queries within a
shell script. This method is useful for integrating data ingestion and retrieval
tasks into larger data processing pipelines.
Important Notes:
• Replace /path/to/your/localfile.csv with the actual path to your local file.
• Adjust the table schema in the CREATE TABLE statement to match the structure of
your CSV file.
• Ensure that Hadoop and Hive are configured correctly on your system, and the Hive
CLI is accessible from your environment.
25 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Aim
Pre-requisites
1. Hadoop and Hive must be installed and properly configured on your system.
2. Ensure that Hive is running.
3. You have a Hive table created and populated with data.
Step-by-Step Instructions:
SQL
SQL
26 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
SQL
-- Aim: Demonstrate the usage of Hive functions
-- Step 2: Load data into the Hive table from a local CSV file
LOAD DATA LOCAL INPATH '/path/to/your/employee_data.csv' INTO TABLE
employee;
-- Example 3: Use the MAX and MIN functions to find the highest and lowest
salaries
SELECT MAX(salary) AS highest_salary, MIN(salary) AS lowest_salary FROM
employee;
-- Example 5: Use the GROUP BY clause with the COUNT function to count
employees in each department
SELECT department, COUNT(*) AS employees_count FROM employee GROUP BY
department;
-- Example 6: Use the CONCAT function to combine first and last names
(assuming first_name and last_name columns)
-- SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM employee;
Output Solutions
27 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Result / Conclusion
• Result:
o After running the above queries, the data from the local
file employee_data.csv will be successfully loaded into the Hive
table employee.
The queries will demonstrate the usage of various Hive functions and
o
display the results accordingly.
• Conclusion:
o This process demonstrates how to use Hive functions to perform
various operations on data stored in Hive tables. The examples cover
common functions such as COUNT, AVG, MAX, MIN, SUM, CONCAT, UPPER,
and LOWER. These functions are useful for data analysis and aggregation
tasks in Hive.
Important Notes:
28 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
7. Installation of MongoDB:
Aim
The aim is to guide you through the installation of MongoDB on your system.
Pre-requisites
Installation on Windows
1. Download MongoDB:
o Go to the MongoDB Download Center.
Select Windows as the operating system and download the MSI
o
installer.
2. Run the Installer:
o Double-click the downloaded MSI file to start the installation.
Follow the prompts in the MongoDB Installer.
o
3. Configure MongoDB:
o Choose the Complete setup type.
o Make sure to install MongoDB as a Service.
Select the option to install MongoDB Compass (optional GUI tool).
o
4. Verify Installation:
o Open Command Prompt and run:
Shell
mongod --version
o You should see the version of MongoDB installed.
Installation on macOS
29 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Shell
Shell
Shell
4. Verify Installation:
o Run:
Shell
mongod --version
o You should see the version of MongoDB installed.
Shell
o Run:
Shell
Shell
Shell
5. Start MongoDB:
o Run:
Shell
6. Verify Installation:
o Run:
Shell
mongod --version
o You should see the version of MongoDB installed.
Output Solutions
1. Verification:
o You should be able to see the version of MongoDB installed by
running mongod --version.
o Example output:
31 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Code
db version v5.0.5
git version: 62e7b76e9d0c41903a1f3e4b8e7d3f2d7f746a86
allocator: tcmalloc
modules: none
build environment:
distarch: x86_64
target_arch: x86_64
Public code references from 3 repositories
Result / Conclusion
• Result:
o MongoDB will be successfully installed on your system.
You will be able to verify the installation by checking the MongoDB
o
version.
• Conclusion:
o This guide provides step-by-step instructions for installing MongoDB
on different operating systems. Following these steps ensures that
MongoDB is installed correctly and ready for use. MongoDB is a
powerful NoSQL database that can be used for various applications,
including web development, data analysis, and more.
Important Notes:
This guide covers the installation process for MongoDB, a popular NoSQL database,
on Windows, macOS, and Linux systems.
32 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
The aim is to read data from a CSV file and load it into a MongoDB collection using a
Python script.
Pre-requisites
sh
pip install pandas pymongo
Below is a Python program that reads data from a CSV file and loads it into a
MongoDB collection:
Python
# Import necessary libraries
import pandas as pd
from pymongo import MongoClient
# Aim: Read data from a CSV file and load it into a MongoDB collection
33 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Output Solutions
CSV
id,name,age
1,John Doe,25
2,Jane Smith,30
2. Connecting to MongoDB:
o The pymongo library connects to the MongoDB server.
3. Inserting Data into MongoDB:
o The data from the DataFrame is converted to a list of dictionaries.
o The list of dictionaries is inserted into the MongoDB collection.
Result / Conclusion
• Result:
o After running the above script, the data from the CSV file will be
successfully loaded into the specified MongoDB collection.
• Conclusion:
o This process demonstrates how to read data from a CSV file using
the pandas library and load it into a MongoDB collection using
the pymongo library. This method is useful for data migration and
integration tasks, allowing you to easily transfer data from CSV files to
MongoDB for storage and further analysis.
Important Notes:
34 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
sh
35 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Aim
To demonstrate how to read data from a JSON file and insert it into a MongoDB collection
using Python.
Pre-requisites
1. Software Requirements:
o Python (version 3.6 or later)
o MongoDB (installed locally or using MongoDB Atlas)
o Required Python Libraries: pymongo, json
2. Environment Setup:
o Install MongoDB from the MongoDB Download Center.
o Install Python dependencies:
bash
3. Create a JSON file: Prepare a data.json file with sample data. For example:
json
[
{ "name": "Alice", "age": 25, "city": "New York" },
{ "name": "Bob", "age": 30, "city": "Los Angeles" },
{ "name": "Charlie", "age": 35, "city": "Chicago" }
]
Program
python
36 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Output
1. Program Execution:
o Run the script in a terminal or IDE:
bash
python3 load_json_to_mongodb.py
css
Connected to MongoDB!
Data has been successfully loaded into MongoDB!
Inserted data:
{'_id': ObjectId('...'), 'name': 'Alice', 'age': 25, 'city':
'New York'}
{'_id': ObjectId('...'), 'name': 'Bob', 'age': 30, 'city': 'Los
Angeles'}
{'_id': ObjectId('...'), 'name': 'Charlie', 'age': 35, 'city':
'Chicago'}
2. Verification in MongoDB:
o Use the MongoDB shell or GUI (e.g., MongoDB Compass) to verify the inserted data:
bash
mongo
use example_db
db.example_collection.find().pretty()
o Expected result:
37 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
json
{
"_id": ObjectId("..."),
"name": "Alice",
"age": 25,
"city": "New York"
}
{
"_id": ObjectId("..."),
"name": "Bob",
"age": 30,
"city": "Los Angeles"
}
{
"_id": ObjectId("..."),
"name": "Charlie",
"age": 35,
"city": "Chicago"
}
Result/Conclusion
• The program successfully connected to MongoDB and read data from the JSON file.
• Data from the JSON file was inserted into a MongoDB collection.
• MongoDB was verified to store the data, which could be retrieved using the find()
method.
• Conclusion: Using Python with the pymongo library, JSON data can be efficiently loaded into
MongoDB, making it a suitable choice for handling structured data in NoSQL databases.
• =======================
38 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
Aim
To transfer data from a MongoDB database to a MySQL database. This process is essential in
scenarios where MongoDB is used for unstructured data storage, but data analysis or
processing requires relational database features.
Prerequisites
Program
Below is a Python program to perform the task, with comments explaining each step:
python
try:
mysql_conn = mysql.connector.connect(
host="localhost", # MySQL host
user="root", # MySQL username
password="your_password", # MySQL password
database="target_database" # MySQL database name
)
mysql_cursor = mysql_conn.cursor()
print("Connected to MySQL successfully!")
except Exception as e:
print(f"Error connecting to MySQL: {e}")
exit()
Output Solutions
40 | P a g e Nandhini.T
Bharathidasan University Big Data Analysis Lab Practical Lab Manual
json
{
"_id": "12345",
"field1": "John Doe",
"field2": "Software Engineer",
"field3": "50000"
}
sql
3. Expected Output:
o Console log:
vbnet
bash
Result / Conclusion
1. Data Migration:
o Successfully migrated data from MongoDB to MySQL.
2. Use Case Scenarios:
o Useful for ETL pipelines where data is extracted from MongoDB, transformed, and
loaded into MySQL.
o Suitable for situations requiring structured analysis, reporting, or relational queries
on MongoDB data.
3. Learning Outcomes:
o You learned to interact with two databases programmatically.
o The process emphasized error handling, connection management, and query
execution.
41 | P a g e Nandhini.T