0% found this document useful (0 votes)
142 views34 pages

BDT Lab Manual

Big data technology lab manual
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views34 pages

BDT Lab Manual

Big data technology lab manual
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Course code:21F00307 BIG DATA MCA III

TECHNOLOGIES SEMESTER
LABORATORY

Week 1: Hadoop Installation


Objective:

 To install and configure Hadoop in a Single Node setup on CentOS 7.


 To install and configure Apache SPARK on CentOS 7.
 To launch a cloud instance for AWS on CentOS 7.

Theory:

Hadoop Overview:

Hadoop is an open-source framework used for storing and processing large data sets in a
distributed computing environment. It consists of the Hadoop Distributed File System
(HDFS) and a processing engine, typically using MapReduce. Hadoop is scalable, allowing
you to start with a single node and grow to thousands of nodes.

Key Components:
 HDFS (Hadoop Distributed File System): A distributed file system designed to run
on commodity hardware. It provides high throughput access to application data.
 MapReduce: A processing model for distributed computing. It divides tasks into
smaller sub-tasks and processes them in parallel.

SPARK Overview:

Apache SPARK is a unified analytics engine for large-scale data processing. Unlike Hadoop's
MapReduce, SPARK provides an in-memory cluster computing, making it faster for iterative
algorithms.

Key Features:

 In-Memory Processing: Data is processed in memory, reducing the time taken to


write and read from disk.
 Rich API: SPARK provides APIs in Java, Scala, Python, and R.
 Support for Multiple Workloads: It supports batch processing, interactive queries,
stream processing, machine learning, and graph processing.

AWS Overview:

Amazon Web Services (AWS) provides cloud-based services including computing power,
storage, and databases. You can launch and manage virtual servers (EC2 instances) in the
cloud, enabling flexible computing resources without maintaining physical hardware.

Key Components:

 EC2 (Elastic Compute Cloud): Scalable virtual servers in the cloud.


 S3 (Simple Storage Service): Object storage with high scalability, data availability,
security, and performance.

Pre-requisites:

 CentOS 7 installed on the machine or a cloud instance.


 Basic understanding of Linux commands.
 An AWS account (for launching an EC2 instance).

Materials Required:

 A computer with CentOS 7 or access to a cloud service (AWS, Google Cloud, etc.).
 Internet connection to download and install packages.

Procedure:

Step 1: Install CentOS 7

1. Download the CentOS 7 ISO image from the official CentOS website.
2. Create a bootable USB drive using tools like Rufus or Etcher.
3. Boot from the USB drive and follow the on-screen instructions to install CentOS 7.
4. After installation, update the system using the following command:

bash

sudo yum update -y

Step 2: Install Java

Hadoop requires Java to run, so the first step is to install the Java Development Kit (JDK):

bash

sudo yum install java-1.8.0-openjdk-devel -y

Verify the installation by checking the Java version:

bash

java -version

Step 3: Download and Install Hadoop

1. Download Hadoop from the official website:

bash

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-
3.3.4.tar.gz

2. Extract the downloaded tarball:

bash

tar -xzvf hadoop-3.3.4.tar.gz

3. Move the extracted files to /usr/local/hadoop:

bash

sudo mv hadoop-3.3.4 /usr/local/hadoop

4. Set environment variables for Hadoop in the .bashrc file:

bash

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

5. Reload the .bashrc file to apply the changes:


bash

source ~/.bashrc

6. Verify the Hadoop installation by running:

bash

hadoop version

Step 4: Configure Hadoop

1. Edit the core-site.xml file:

bash

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration:

xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

2. Configure the HDFS by editing the hdfs-site.xml file:

bash

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following:

xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

3. Format the Hadoop filesystem:

bash

hdfs namenode -format

Step 5: Start Hadoop Services

1. Start the HDFS services:


bash

start-dfs.sh

2. Start the YARN services:

bash

start-yarn.sh

3. Check the Hadoop services by accessing:


o Namenode: http://localhost:9870
o ResourceManager: http://localhost:8088

Step 6: Install Apache SPARK

1. Download SPARK:

bash

wget https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-
bin-hadoop3.2.tgz

2. Extract and move it to /usr/local/spark:

bash

tar -xzvf spark-3.4.1-bin-hadoop3.2.tgz


sudo mv spark-3.4.1-bin-hadoop3.2 /usr/local/spark

3. Set environment variables for SPARK in the .bashrc file:

bash

export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

4. Reload the .bashrc file:

bash

source ~/.bashrc

5. Verify the installation by running:

bash

spark-shell

Step 7: Launch an AWS Cloud Instance

1. Log in to the AWS Management Console.


2. Navigate to EC2 Dashboard and click on "Launch Instance."
3. Select an AMI (Amazon Machine Image), for example, Amazon Linux 2 or CentOS
7.
4. Choose an instance type (e.g., t2.micro for free tier).
5. Configure instance details like VPC, subnet, and security group.
6. Add storage as per your requirement.
7. Review and launch the instance.
8. Once launched, connect to the instance using SSH:

bash

ssh -i "your-key.pem" ec2-user@your-ec2-public-ip

Expected Output:

 A fully functional Hadoop single-node cluster.


 Apache SPARK installed and running.
 An EC2 instance launched and accessible via SSH.

Observations:

 Record any issues or errors encountered during installation.


 Note the configuration settings used.
 Verify that Hadoop and SPARK services are up and running.
 Confirm the connectivity and functioning of the AWS EC2 instance.

Conclusion:

In this experiment, you successfully installed and configured a Hadoop single-node cluster,
installed Apache SPARK, and launched a cloud instance on AWS. These foundational steps
are crucial for setting up a big data processing environment.

Week 2: Design a Distributed Application


Using MapReduce
Objective:

 To design and implement a distributed application using the MapReduce


programming model that processes log files of a system.
 To identify the users who have logged in for the maximum period on the system.
 To deploy the application on a Hadoop cluster for processing.

Theory:

MapReduce Overview:
MapReduce is a programming model and an associated implementation for processing and
generating large data sets with a parallel, distributed algorithm on a cluster. The model is
based on two main functions: Map and Reduce.

 Map Function: The Map function takes a set of data and converts it into a set of
key/value pairs. The mapper processes each record in the input split and generates a
key-value pair as the output.
 Reduce Function: The Reduce function takes the output from the Map as input and
combines the data tuples (key-value pairs) into a smaller set of key-value pairs. The
reduce operation is applied to the list of key-value pairs generated by the map
function.

The power of MapReduce comes from the ability to parallelize the process of data
manipulation and then reduce the result into a simpler form.

Use Case: Processing Log Files

System log files record user activities, including login times, logout times, and durations. By
analyzing these logs, we can extract useful information, such as identifying the users who
spent the maximum time logged in on the system.

Pre-requisites:

 Hadoop single-node or multi-node cluster setup (from Experiment 1).


 Basic knowledge of Java or Python (depending on the language used for the
MapReduce program).
 Log files in a structured format (for example, containing fields like user ID, login
time, logout time).

Materials Required:

 Hadoop cluster (single-node setup is sufficient).


 Sample log files in text format.
 IDE or text editor to write the MapReduce code.

Procedure:

Step 1: Prepare Sample Log Files

1. Create a sample log file with the following format:

Code
user1,2024-08-17 09:00:00,2024-08-17 12:30:00
user2,2024-08-17 10:00:00,2024-08-17 11:00:00
user1,2024-08-18 14:00:00,2024-08-18 17:00:00
user3,2024-08-17 13:00:00,2024-08-17 16:00:00

o The log file contains three fields: User ID, Login Time, and Logout Time.

Step 2: Write the Map Function


1. The Mapper class processes each log entry and calculates the session duration
(logout time minus login time) for each user.
2. Code example in Java:

java
Code
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class LogMapper extends Mapper<LongWritable, Text, Text,


LongWritable> {

private SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-


MM-dd HH:mm:ss");

public void map(LongWritable key, Text value, Context context)


throws IOException, InterruptedException {
String[] fields = value.toString().split(",");
String userID = fields[0];
try {
Date loginTime = dateFormat.parse(fields[1]);
Date logoutTime = dateFormat.parse(fields[2]);
long duration = logoutTime.getTime() -
loginTime.getTime(); // Duration in milliseconds
context.write(new Text(userID), new
LongWritable(duration));
} catch (Exception e) {
e.printStackTrace();
}
}
}

Step 3: Write the Reduce Function

1. The Reducer class aggregates the total session duration for each user by summing up
the durations provided by the mapper.
2. Code example in Java:

java
Code
import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class LogReducer extends Reducer<Text, LongWritable, Text,


LongWritable> {

public void reduce(Text key, Iterable<LongWritable> values,


Context context) throws IOException, InterruptedException {
long totalDuration = 0;
for (LongWritable value : values) {
totalDuration += value.get();
}
context.write(key, new LongWritable(totalDuration));
}
}

Step 4: Combine the Mapper and Reducer

1. Write the driver class to run the MapReduce job.


2. Example:

java
Code
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class LogAnalysis {

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "log analysis");

job.setJarByClass(LogAnalysis.class);
job.setMapperClass(LogMapper.class);
job.setReducerClass(LogReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

3. Package the code into a JAR file for execution.

Step 5: Deploy and Run the MapReduce Job on Hadoop

1. Upload the log file to the Hadoop Distributed File System (HDFS):

bash
Code
hdfs dfs -put /path/to/logfile.txt /user/hadoop/logs

2. Run the MapReduce job:

bash
Code
hadoop jar loganalysis.jar /user/hadoop/logs /user/hadoop/logs/output
3. The job will process the log file and output the total duration of login time for each
user.

Step 6: Analyze the Output

1. View the output files in HDFS:

bash
Code
hdfs dfs -cat /user/hadoop/logs/output/part-r-00000

2. The output should list the total login time for each user.

Expected Output:

 A list of users with their corresponding total logged-in durations. The output might
look something like this:

Code
user1 23400000
user2 3600000
user3 10800000

Where the numbers represent the total logged-in duration in milliseconds.

Observations:

 Record the input data (log file contents).


 Note the time taken by the MapReduce job to process the log file.
 Observe how the data is split and how the mappers and reducers process it.

Conclusion:

In this experiment, you successfully designed and implemented a distributed application


using MapReduce to process system log files. The application identified users who were
logged in for the longest durations, demonstrating the effectiveness of the MapReduce
programming model for handling large data sets

Week 3: Design a Distributed Application to


Find the Coolest/Hottest Year Using
MapReduce
Objective:
 To design and implement a distributed application using the MapReduce
programming model that analyzes weather data to find the coolest and hottest year
from the available dataset.
 To retrieve weather data from the Internet, process it using Hadoop MapReduce, and
determine the year with the highest and lowest average temperatures.

Theory:

Weather Data Analysis:

Weather data analysis is a common application of big data technologies. The data usually
includes information such as temperature, humidity, precipitation, and wind speed, collected
over various time periods. Analyzing historical weather data can help identify trends, such as
finding the coolest or hottest years.

MapReduce in Weather Data Analysis:

MapReduce is well-suited for processing large weather datasets. In this experiment, the Map
function will extract relevant temperature data for each year, and the Reduce function will
aggregate these values to calculate the average temperature for each year. Finally, we'll
determine the years with the minimum and maximum average temperatures.

Pre-requisites:

 Hadoop cluster (single-node or multi-node).


 Basic knowledge of Java or Python for writing MapReduce programs.
 A dataset containing historical weather data (with fields like year, temperature, etc.).

Materials Required:

 Hadoop cluster.
 Sample weather dataset (in text or CSV format).
 IDE or text editor for writing the MapReduce code.

Procedure:

Step 1: Obtain and Prepare the Weather Dataset

1. Download a historical weather dataset, which might look like this:

yaml
Code
2024-01-01,NY,32
2024-01-02,NY,31
2024-01-03,NY,35
...

o The dataset typically includes fields such as Date, Location, and


Temperature.
2. Store the dataset in HDFS:
bash
Code
hdfs dfs -put /path/to/weatherdata.txt /user/hadoop/weatherdata

Step 2: Write the Map Function

1. The Mapper class will process each line of the dataset, extract the year and
temperature, and emit a key-value pair where the key is the year and the value is the
temperature.
2. Code example in Java:

java
Code
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WeatherMapper extends Mapper<Object, Text, Text,


IntWritable> {

private SimpleDateFormat yearFormat = new


SimpleDateFormat("yyyy");

public void map(Object key, Text value, Context context) throws


IOException, InterruptedException {
String[] fields = value.toString().split(",");
try {
Date date = new SimpleDateFormat("yyyy-MM-
dd").parse(fields[0]);
String year = yearFormat.format(date);
int temperature = Integer.parseInt(fields[2]);
context.write(new Text(year), new
IntWritable(temperature));
} catch (Exception e) {
e.printStackTrace();
}
}
}

Step 3: Write the Reduce Function

1. The Reducer class will calculate the average temperature for each year by summing
up the temperatures and dividing by the number of records for that year.
2. Code example in Java:

java
Code
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WeatherReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,


Context context) throws IOException, InterruptedException {
int sum = 0;
int count = 0;
for (IntWritable val : values) {
sum += val.get();
count++;
}
int average = sum / count;
context.write(key, new IntWritable(average));
}
}

Step 4: Find the Coolest and Hottest Year

1. After computing the average temperature for each year, write another MapReduce job
(or a simple script) to determine the years with the minimum and maximum average
temperatures.
2. Alternatively, this could be done by processing the output file of the previous
MapReduce job using a simple program or script.

Step 5: Combine the Mapper and Reducer

1. Write the driver class to run the MapReduce job.


2. Example:

java
Code
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WeatherAnalysis {

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "weather analysis");

job.setJarByClass(WeatherAnalysis.class);
job.setMapperClass(WeatherMapper.class);
job.setReducerClass(WeatherReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
3. Package the code into a JAR file for execution.

Step 6: Deploy and Run the MapReduce Job on Hadoop

1. Run the MapReduce job:

bash
Code
hadoop jar weatheranalysis.jar /user/hadoop/weatherdata
/user/hadoop/weatheroutput

2. This job will calculate the average temperature for each year.

Step 7: Analyze the Output

1. View the output files in HDFS:

bash
Code
hdfs dfs -cat /user/hadoop/weatheroutput/part-r-00000

2. The output should list each year and its corresponding average temperature.
3. Identify the coolest and hottest years by analyzing the output data.

Expected Output:

 A list of years with their corresponding average temperatures. The final output should
indicate the coolest and hottest years.

Example:

yaml
Code
2022 56
2023 60
2024 63

Where the numbers represent the average temperatures.

Observations:

 Record the input data (weather data contents).


 Note the time taken by the MapReduce job to process the data.
 Observe how the data is split and how the mappers and reducers process it.

Conclusion:

In this experiment, you successfully designed and implemented a distributed application


using MapReduce to process weather data and determine the coolest and hottest years. This
experiment demonstrated the power of MapReduce in handling and analyzing large datasets
efficiently.
Week 4: Write an Application Using HBase
and HiveQL for Flight Information System
Objective:

 To design and implement a flight information system using HBase for data storage
and HiveQL for querying and analyzing flight data.
 To perform various operations on the flight data such as creating, updating, and
querying tables in HBase, and joining tables and performing aggregations using
HiveQL.

Theory:

HBase Overview:

HBase is a distributed, scalable, big data store that runs on top of the Hadoop Distributed File
System (HDFS). It is modeled after Google’s Bigtable and provides a fault-tolerant way of
storing large quantities of sparse data. HBase is particularly suitable for real-time read/write
access to large datasets.

 HBase Table Structure:


o Row Key: Uniquely identifies a row in the table.
o Column Family: A logical division within a table that groups related
columns.
o Column Qualifier: A specific column within a column family.
o Cell: The intersection of a row and a column (identified by the column
qualifier).

Hive Overview:

Apache Hive is a data warehouse software built on top of Hadoop that provides data
summarization, query, and analysis capabilities. HiveQL is a query language similar to SQL,
used to query and manage large datasets in HDFS. Hive abstracts the complexity of Hadoop
by allowing users to query data using a SQL-like interface.

Use Case: Flight Information System

A flight information system stores data about flights, such as flight numbers, departure and
arrival times, destinations, and statuses. This experiment focuses on using HBase to store
flight data and HiveQL to query and analyze the data.

Pre-requisites:
 HBase and Hive installed and configured on your Hadoop cluster.
 Basic understanding of HBase and HiveQL commands.
 Sample flight data in CSV or text format.

Materials Required:

 Hadoop cluster with HBase and Hive installed.


 Sample flight data file.
 Command-line interface (CLI) for interacting with HBase and Hive.

Procedure:

Step 1: Create HBase Table for Flight Data

1. Open the HBase shell:

bash
Code
hbase shell

2. Create a table in HBase for storing flight information. The table will have the
following schema:
o Table Name: FlightInfo
o Row Key: Flight Number (e.g., FL123)
o Column Families: FlightDetails (to store flight-related details such as
departure and arrival times, destination, etc.)
3. HBase command to create the table:

bash
Code
create 'FlightInfo', 'FlightDetails'

4. Verify that the table was created:

bash
Code
list

Step 2: Insert Data into HBase Table

1. Insert sample flight data into the FlightInfo table. Each flight will have a unique
row key (flight number).
2. HBase command to insert data:

bash
Code
put 'FlightInfo', 'FL123', 'FlightDetails:Departure', '2024-08-17
09:00:00'
put 'FlightInfo', 'FL123', 'FlightDetails:Arrival', '2024-08-17
12:30:00'
put 'FlightInfo', 'FL123', 'FlightDetails:Destination', 'New York'
put 'FlightInfo', 'FL124', 'FlightDetails:Departure', '2024-08-17
10:00:00'
put 'FlightInfo', 'FL124', 'FlightDetails:Arrival', '2024-08-17
13:45:00'
put 'FlightInfo', 'FL124', 'FlightDetails:Destination', 'Los Angeles'

3. Verify the data insertion by scanning the table:

bash
Code
scan 'FlightInfo'

Step 3: Create an External Hive Table for Flight Data

1. Launch the Hive shell:

bash
Code
hive

2. Create an external table in Hive that links to the HBase table. The Hive table will
reference the data stored in HBase.
3. HiveQL command to create the external table:

sql
Code
CREATE EXTERNAL TABLE FlightInfo_Hive(
flight_number STRING,
departure_time STRING,
arrival_time STRING,
destination STRING
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" =
":key,FlightDetails:Departure,FlightDetails:Arrival,FlightDetails:Des
tination"
)
TBLPROPERTIES ("hbase.table.name" = "FlightInfo");

4. Verify the table creation:

sql
Code
SHOW TABLES;

Step 4: Perform Data Operations Using HiveQL

1. Load Data into Hive Table: If necessary, load additional data into the Hive table
using the LOAD DATA command.
2. Query Data in Hive: Perform various queries on the flight data using HiveQL.
Examples:
o List all flights:

sql
Code
SELECT * FROM FlightInfo_Hive;
o Find flights to a specific destination:

sql
Code
SELECT flight_number, departure_time FROM FlightInfo_Hive WHERE
destination = 'New York';

o Find flights with the earliest departure time:

sql
Code
SELECT flight_number, MIN(departure_time) FROM FlightInfo_Hive
GROUP BY flight_number;

Step 5: Perform Aggregation and Join Operations Using HiveQL

1. Join Tables: If you have another table (e.g., Airports) with airport codes and names,
you can join it with the FlightInfo_Hive table.

Example:

sql
Code
SELECT f.flight_number, a.airport_name
FROM FlightInfo_Hive f
JOIN Airports a ON (f.destination = a.airport_code);

2. Aggregation Operations:
o Calculate the number of flights to each destination:

sql
Code
SELECT destination, COUNT(*) FROM FlightInfo_Hive GROUP BY
destination;

o Calculate the average flight duration for each destination:

sql
Code
SELECT destination, AVG(UNIX_TIMESTAMP(arrival_time) -
UNIX_TIMESTAMP(departure_time)) AS avg_duration
FROM FlightInfo_Hive
GROUP BY destination;

Step 6: Indexing and Optimization

1. Create an index on the destination field to speed up queries:

sql
Code
CREATE INDEX idx_destination ON TABLE FlightInfo_Hive(destination) AS
'COMPACT' WITH DEFERRED REBUILD;

2. Rebuild the index:


sql
Code
ALTER INDEX idx_destination ON FlightInfo_Hive REBUILD;

Expected Output:

 The HBase table FlightInfo will store all flight data.


 The Hive external table FlightInfo_Hive will allow SQL-like queries on the data
stored in HBase.
 Example queries will return the flight details, flights to specific destinations, and
aggregated data like the number of flights or average duration.

Observations:

 Note how data is stored in HBase and accessed through HiveQL.


 Observe the differences in performance when querying data directly in HBase versus
using HiveQL.
 Record the output of different queries and the time taken to execute them.

Conclusion:

In this experiment, you successfully implemented a flight information system using HBase
for data storage and HiveQL for querying and analyzing the data. This experiment
demonstrated the integration of HBase and Hive for managing and processing large datasets
efficiently

Week 5: Displaying Hierarchical Structure


of Data Using Pig
Objective:

 To design and implement a solution using Apache Pig to display the hierarchical
structure of data.
 To generate trees, graphs, and network visualizations of data, and perform operations
such as sorting, grouping, joining, and filtering using Pig Latin scripts.

Theory:

Apache Pig Overview:

Apache Pig is a high-level platform for processing large datasets in Hadoop. The language
used to express data flows in Pig is called Pig Latin. Pig Latin abstracts the complexity of
Hadoop MapReduce, making it easier for developers to perform data transformations,
analysis, and processing.
Hierarchical Data Structure:

Hierarchical data structures represent the organization of data in a tree-like model, where
each data point can have a parent-child relationship. Examples include file systems,
organizational charts, and product categories.

Use Case:

Consider a dataset representing a company's organizational hierarchy, where each employee


reports to a manager. The goal is to use Pig to display this hierarchy, sort employees, and
perform additional operations such as grouping by departments or filtering based on job titles.

Pre-requisites:

 Hadoop cluster with Pig installed.


 Basic understanding of Pig Latin commands.
 Sample hierarchical data file (e.g., employee data with manager relationships).

Materials Required:

 Hadoop cluster.
 Sample hierarchical data in text or CSV format.
 Command-line interface (CLI) for interacting with Pig.

Procedure:

Step 1: Prepare the Hierarchical Dataset

1. Create a sample hierarchical dataset (e.g., organizational chart) in a text file:

Code
E001,John,Manager,Marketing
E002,Jane,Lead,Sales,E001
E003,Robert,Executive,Marketing,E001
E004,Michael,Lead,IT,E001
E005,Susan,Executive,Sales,E002

oThe columns represent Employee ID, Name, Title, Department, and Manager
ID.
2. Load the dataset into HDFS:

bash
Code
hdfs dfs -put /path/to/employee_data.txt /user/hadoop/employee_data

Step 2: Write a Pig Script to Load and Display Hierarchical Data

1. Open a text editor and create a Pig script (employee_hierarchy.pig).


2. Load the data into Pig:

pig
Code
EMPLOYEE_DATA = LOAD '/user/hadoop/employee_data' USING
PigStorage(',')
AS (emp_id:chararray, name:chararray, title:chararray,
dept:chararray, mgr_id:chararray);

3. Group the data by Manager ID to show the hierarchical structure:

pig
Code
GROUPED_BY_MANAGER = GROUP EMPLOYEE_DATA BY mgr_id;

4. Sort the grouped data by Department:

pig
Code
SORTED_EMPLOYEES = ORDER GROUPED_BY_MANAGER BY dept ASC;

5. Perform a FOREACH operation to display the hierarchical structure:

pig
Code
HIERARCHY = FOREACH GROUPED_BY_MANAGER GENERATE
group AS manager_id,
FLATTEN(EMPLOYEE_DATA.name) AS employee_name,
FLATTEN(EMPLOYEE_DATA.title) AS employee_title;

6. Store the results:

pig
Code
STORE HIERARCHY INTO '/user/hadoop/employee_hierarchy_output' USING
PigStorage(',');

Step 3: Run the Pig Script

1. Run the Pig script in local or MapReduce mode:

bash
Code
pig -x mapreduce employee_hierarchy.pig

2. Pig will process the hierarchical data, group employees by their managers, and sort
them by department.

Step 4: Analyze the Output

1. Check the output stored in HDFS:

bash
Code
hdfs dfs -cat /user/hadoop/employee_hierarchy_output/part-r-00000

2. The output will display the hierarchical structure of the organization, showing each
manager and their respective employees.
Step 5: Perform Additional Operations (Optional)

1. Filtering Data: You can filter employees based on criteria such as job title or
department.

pig
Code
FILTERED_EMPLOYEES = FILTER EMPLOYEE_DATA BY dept == 'Marketing';

2. Joining Data: If you have another dataset (e.g., DEPARTMENT_DATA), you can perform
a join operation to enrich the employee data.

pig
Code
JOINED_DATA = JOIN EMPLOYEE_DATA BY dept, DEPARTMENT_DATA BY
dept_name;

3. Generating Visualizations: Although Pig itself doesn't provide direct support for
visualizations, you can export the results to a format like CSV or JSON and use
external tools (like D3.js, Gephi, or Graphviz) to create hierarchical trees or network
graphs.

Expected Output:

 A hierarchical representation of the employees grouped by their managers and sorted


by department.

Example output:

Code
E001,John,Manager
E002,Jane,Lead
E003,Robert,Executive
E004,Michael,Lead
E005,Susan,Executive

Observations:

 Observe how Pig simplifies the process of loading, transforming, and analyzing
hierarchical data.
 Note the differences between the original and processed data, particularly in terms of
grouping and sorting.
 Record the time taken by Pig to process the dataset.

Conclusion:

In this experiment, you successfully used Apache Pig to process hierarchical data and display
it in a structured format. This experiment demonstrated the ability of Pig to efficiently handle
complex data structures and perform operations like sorting, grouping, and filtering.
Week 6: Working with JSON Data and
Word Count on Tweets Using Pig
Objective:

 To parse JSON data using Pig, perform word count operations on text data such as
tweets, and analyze the results.
 To demonstrate how to use Pig for processing semi-structured data (like JSON) and
perform text analysis operations.

Theory:

JSON Overview:

JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for
humans to read and write, and easy for machines to parse and generate. It is commonly used
to transmit data between a server and a web application, as well as to store structured data in
NoSQL databases.

Pig and JSON:

Apache Pig provides built-in support for working with JSON data through its JsonLoader
and JsonStorage functions. These functions allow users to load, store, and process JSON
data in a Pig script.

Use Case: Tweet Analysis

In this experiment, we will work with a dataset of tweets stored in JSON format. Each tweet
contains text that can be analyzed to count the frequency of words. This experiment will
demonstrate how to parse the JSON data, extract the tweet text, and perform a word count
operation using Pig.

Pre-requisites:

 Hadoop cluster with Pig installed.


 Basic understanding of Pig Latin commands.
 Sample JSON file containing tweets.

Materials Required:

 Hadoop cluster.
 Sample JSON file containing tweets.
 Command-line interface (CLI) for interacting with Pig.
Procedure:

Step 1: Prepare the JSON Dataset

1. Create a sample JSON dataset containing tweets in a file called tweets.json.


Example content:

json
Code
{"user": "Alice", "tweet": "Learning Apache Pig is fun!"}
{"user": "Bob", "tweet": "Pig Latin makes data processing easy."}
{"user": "Charlie", "tweet": "Big data analysis with Pig and
Hadoop."}

2. Load the dataset into HDFS:

bash
Code
hdfs dfs -put /path/to/tweets.json /user/hadoop/tweets

Step 2: Write a Pig Script to Parse JSON Data and Extract Tweet Text

1. Open a text editor and create a Pig script (tweet_analysis.pig).


2. Load the JSON data into Pig using the JsonLoader function:

pig
Code
TWEETS = LOAD '/user/hadoop/tweets' USING JsonLoader('user:chararray,
tweet:chararray');

3. Extract the tweet text from the loaded data:

pig
Code
TWEET_TEXT = FOREACH TWEETS GENERATE tweet;

Step 3: Perform Word Count on the Tweets

1. Split the tweet text into individual words:

pig
Code
WORDS = FOREACH TWEET_TEXT GENERATE FLATTEN(TOKENIZE(tweet)) AS word;

2. Group the words to count occurrences:

pig
Code
GROUPED_WORDS = GROUP WORDS BY word;

3. Count the number of occurrences of each word:

pig
Code
WORD_COUNT = FOREACH GROUPED_WORDS GENERATE group AS word,
COUNT(WORDS) AS count;

4. Sort the words by their count in descending order:

pig
Code
SORTED_WORD_COUNT = ORDER WORD_COUNT BY count DESC;

5. Store the results in HDFS:

pig
Code
STORE SORTED_WORD_COUNT INTO '/user/hadoop/word_count_output' USING
PigStorage(',');

Step 4: Run the Pig Script

1. Run the Pig script in local or MapReduce mode:

bash
Code
pig -x mapreduce tweet_analysis.pig

2. Pig will process the JSON data, extract the tweet text, and perform a word count
operation.

Step 5: Analyze the Output

1. Check the output stored in HDFS:

bash
Code
hdfs dfs -cat /user/hadoop/word_count_output/part-r-00000

2. The output will display the words and their respective counts, sorted in descending
order of frequency.

Example output:

python
Code
Pig,2
data,2
Apache,1
Latin,1
analysis,1
...

Step 6: JSON Parsing in Python (Optional)

1. Python Snippet for JSON Parsing: As an optional task, you can use Python to parse
JSON data and prepare it for Pig processing. Below is an example Python script that
reads a JSON file and extracts the tweet text:
python
Code
import json

# Load JSON data from file


with open('/path/to/tweets.json') as file:
tweets = [json.loads(line) for line in file]

# Extract tweet text


tweet_texts = [tweet['tweet'] for tweet in tweets]

# Print extracted text


for text in tweet_texts:
print(text)

2. Integrating Python with Pig: You can combine the Python script with Pig
processing by outputting the extracted text to a file and then loading that file into Pig
for further analysis.

Expected Output:

 The Pig script will output a list of words from the tweets along with their counts,
sorted in descending order of frequency.

Example:

python
Code
Pig,2
data,2
Apache,1
Latin,1
analysis,1
...

Observations:

 Observe how Pig handles JSON data and performs text processing.
 Note the frequency of different words in the tweets and consider what this reveals
about the content.
 Record the time taken by Pig to process the JSON data and perform the word count.

Conclusion:

In this experiment, you successfully used Apache Pig to parse JSON data and perform a word
count operation on tweet text. This experiment demonstrated the flexibility of Pig in handling
semi-structured data and performing text analysis tasks efficiently.
Week 7: Reading Different Types of Data
Sets Using R
Objective:

 To demonstrate how to read different types of datasets such as .txt, .csv, and .xml
files using the R programming language.
 To perform basic operations on these datasets such as writing to disk, reading from
web locations, and using R objects and functions for data manipulation and storage.

Theory:

R Overview:

R is a powerful programming language and environment used for statistical computing and
graphics. It is widely used for data analysis, data manipulation, and visualization. R provides
a rich set of functions to work with various types of data formats, making it a versatile tool
for data scientists and analysts.

Data Formats:

 Text Files (.txt): Plain text files that can contain structured or unstructured data.
They are typically used for simple data storage and transfer.
 Comma-Separated Values (.csv): A common data format where each line
represents a row of data, with columns separated by commas. CSV files are widely
used for tabular data storage and exchange.
 Extensible Markup Language (.xml): A markup language that defines rules for
encoding documents in a format that is both human-readable and machine-readable.
XML is often used to store and transport data.

Use Case:

This experiment will cover how to read these different data formats using R, perform basic
manipulations, and store the results to a specified location on disk. The experiment will also
cover reading data from web locations, and working with R objects and functions to
manipulate the data.

Pre-requisites:

 R installed on your system.


 Basic understanding of R syntax and functions.
 Sample .txt, .csv, and .xml files.

Materials Required:

 R software.
 Sample data files (data.txt, data.csv, data.xml).
 Internet connection for reading data from web locations.

Procedure:

Step 1: Setting Up the Environment

1. Open RStudio or any other R environment you prefer.


2. Set the Working Directory: This is the directory where your data files are located.

R
Code
setwd("path/to/your/directory")

Step 2: Reading a .txt File

1. Create a sample text file (data.txt) with the following content:

r
Code
Name Age Gender
Alice 28 F
Bob 34 M
Charlie 25 M

2. Read the text file into R:

R
Code
txt_data <- read.table("data.txt", header = TRUE, sep = " ")
print(txt_data)

o header = TRUE: Indicates that the first row contains column names.
o sep = " ": Specifies that the columns are separated by spaces.
3. Write the data back to a different text file:

R
Code
write.table(txt_data, "output_data.txt", sep = "\t", row.names =
FALSE)

Step 3: Reading a .csv File

1. Create a sample CSV file (data.csv) with the following content:

r
Code
Name,Age,Gender
Alice,28,F
Bob,34,M
Charlie,25,M

2. Read the CSV file into R:

R
Code
csv_data <- read.csv("data.csv", header = TRUE)
print(csv_data)

3. Write the data back to a different CSV file:

R
Code
write.csv(csv_data, "output_data.csv", row.names = FALSE)

Step 4: Reading an .xml File

1. Create a sample XML file (data.xml) with the following content:

xml
Code
<dataset>
<record>
<Name>Alice</Name>
<Age>28</Age>
<Gender>F</Gender>
</record>
<record>
<Name>Bob</Name>
<Age>34</Age>
<Gender>M</Gender>
</record>
<record>
<Name>Charlie</Name>
<Age>25</Age>
<Gender>M</Gender>
</record>
</dataset>

2. Read the XML file into R:

R
Code
library(XML)
xml_data <- xmlTreeParse("data.xml", useInternalNodes = TRUE)
rootNode <- xmlRoot(xml_data)
print(rootNode)

3. Extract specific data from the XML:

R
Code
names <- xpathSApply(rootNode, "//Name", xmlValue)
ages <- xpathSApply(rootNode, "//Age", xmlValue)
genders <- xpathSApply(rootNode, "//Gender", xmlValue)

df <- data.frame(Name = names, Age = ages, Gender = genders)


print(df)

4. Write the extracted data to a CSV file:

R
Code
write.csv(df, "output_data_from_xml.csv", row.names = FALSE)

Step 5: Reading Data from Web Locations

1. Read data directly from a web location (e.g., a CSV file hosted online):

R
Code
web_data <- read.csv("http://example.com/data.csv")
print(web_data)

2. Perform operations on the downloaded data similar to what you did with local
files.

Step 6: Using R Objects and Functions

1. Use basic R objects to perform operations on your data:

R
Code
summary(df) # Summary statistics for the dataframe

2. Perform calculations using R's built-in functions:

R
Code
mean_age <- mean(as.numeric(df$Age))
print(mean_age)

3. Save R objects to a file for later use:

R
Code
save(df, file = "data_frame.RData")

Expected Output:

 Successfully read and wrote data from different file formats (.txt, .csv, .xml).
 Extracted and manipulated data from XML files and performed basic operations on it.
 Downloaded and processed data from web locations.
 Used R functions to perform statistical operations and save R objects for future use.

Observations:

 Observe how R simplifies the process of reading and writing different data formats.
 Note the differences in syntax and functions used to handle each data format.
 Record any challenges or errors encountered during the data import/export process.

Conclusion:
In this experiment, you successfully demonstrated the ability to read, manipulate, and store
data in various formats using R. This experiment highlighted R's versatility in handling
diverse data formats and performing essential data operations.

Week 8: Implementing Data Visualization


Using R
Objective:

 To implement data visualization techniques in R to analyze data distributions using


various plots, such as box plots and scatter plots.
 To identify outliers in the dataset using visual techniques and to display data
distributions using histograms, bar charts, and pie charts.

Theory:

Data Visualization:

Data visualization is the graphical representation of data to help communicate information


clearly and efficiently. It allows for a better understanding of patterns, trends, and outliers
within large datasets.

Common Visualization Techniques:

1. Box Plot: A box plot is used to display the distribution of data based on a five-
number summary: minimum, first quartile, median, third quartile, and maximum. It is
particularly useful for identifying outliers.
2. Scatter Plot: A scatter plot is used to represent the relationship between two
continuous variables. It helps in identifying correlations, patterns, and outliers in the
data.
3. Histogram: A histogram is used to represent the distribution of a single numeric
variable by dividing the data into bins and plotting the frequency of each bin.
4. Bar Chart: A bar chart is used to compare categorical data. The height of each bar
represents the frequency or value associated with that category.
5. Pie Chart: A pie chart is used to represent the proportion of categories in a whole. It
is divided into slices, where each slice represents a category's contribution to the total.

Pre-requisites:

 R installed on your system.


 Basic understanding of R and data visualization concepts.
 Sample dataset for visualization.

Materials Required:

 R software.
 Sample dataset (e.g., mtcars, iris, or any other dataset of your choice).
 R packages: ggplot2 (for advanced visualizations).

Procedure:

Step 1: Setting Up the Environment

1. Open RStudio or any other R environment you prefer.


2. Load the necessary libraries:

R
Code
library(ggplot2)

Step 2: Load the Dataset

1. Use an inbuilt dataset like mtcars or load your own dataset:

R
Code
data(mtcars)
df <- mtcars
head(df)

Step 3: Create a Box Plot

1. Plot a box plot to visualize the distribution of a continuous variable, such as mpg
(miles per gallon):

R
Code
boxplot(df$mpg, main="Box Plot of Miles per Gallon", ylab="Miles per
Gallon (mpg)", col="lightblue")

2. Box Plot using ggplot2:

R
Code
ggplot(df, aes(x=factor(0), y=mpg)) +
geom_boxplot(fill="lightblue") +
labs(title="Box Plot of Miles per Gallon", y="Miles per Gallon
(mpg)") +
theme_minimal()

Step 4: Create a Scatter Plot

1. Plot a scatter plot to show the relationship between mpg and hp (horsepower):

R
Code
plot(df$mpg, df$hp, main="Scatter Plot of MPG vs HP", xlab="Miles per
Gallon (mpg)", ylab="Horsepower (hp)", pch=19, col="darkgreen")

2. Scatter Plot using ggplot2:


R
Code
ggplot(df, aes(x=mpg, y=hp)) +
geom_point(color="darkgreen", size=3) +
labs(title="Scatter Plot of MPG vs HP", x="Miles per Gallon
(mpg)", y="Horsepower (hp)") +
theme_minimal()

Step 5: Create a Histogram

1. Plot a histogram to visualize the distribution of mpg:

R
Code
hist(df$mpg, main="Histogram of Miles per Gallon", xlab="Miles per
Gallon (mpg)", col="lightcoral", breaks=10)

2. Histogram using ggplot2:

R
Code
ggplot(df, aes(x=mpg)) +
geom_histogram(binwidth=2, fill="lightcoral", color="black") +
labs(title="Histogram of Miles per Gallon", x="Miles per Gallon
(mpg)") +
theme_minimal()

Step 6: Create a Bar Chart

1. Plot a bar chart to visualize the frequency of the number of cylinders (cyl):

R
Code
barplot(table(df$cyl), main="Bar Chart of Cylinder Count",
xlab="Number of Cylinders", ylab="Frequency", col="lightgreen")

2. Bar Chart using ggplot2:

R
Code
ggplot(df, aes(x=factor(cyl))) +
geom_bar(fill="lightgreen") +
labs(title="Bar Chart of Cylinder Count", x="Number of
Cylinders", y="Frequency") +
theme_minimal()

Step 7: Create a Pie Chart

1. Plot a pie chart to visualize the proportion of different cylinder counts (cyl):

R
Code
cyl_count <- table(df$cyl)
pie(cyl_count, main="Pie Chart of Cylinder Count", col=c("red",
"blue", "green"))
2. Pie Chart using ggplot2 (requires transforming data):

R
Code
cyl_df <- data.frame(cyl = names(cyl_count), count =
as.vector(cyl_count))
ggplot(cyl_df, aes(x="", y=count, fill=cyl)) +
geom_bar(stat="identity", width=1) +
coord_polar("y") +
labs(title="Pie Chart of Cylinder Count") +
theme_minimal()

Step 8: Identifying Outliers

1. Use the box plot created earlier to identify outliers in the mpg variable. Outliers
will appear as individual points outside the whiskers of the box plot.
2. Use the scatter plot to visually inspect for outliers in the relationship between mpg
and hp.

Expected Output:

 Box Plot: Displays the distribution of mpg, showing the median, quartiles, and
outliers.
 Scatter Plot: Shows the relationship between mpg and hp, helping identify
correlations and outliers.
 Histogram: Displays the distribution of mpg, helping visualize the spread and
concentration of values.
 Bar Chart: Shows the frequency of different cylinder counts in the dataset.
 Pie Chart: Displays the proportion of different cylinder counts in a visual, circular
format.
 Outlier Identification: Highlights any data points that fall outside the expected range
in the box plot and scatter plot.

Observations:

 Observe the distribution of data and identify any patterns or trends.


 Note the presence of any outliers and how they might affect data analysis.
 Compare different visualization techniques to determine which is most effective for
conveying specific types of data.

Conclusion:

In this experiment, you successfully implemented various data visualization techniques in R,


including box plots, scatter plots, histograms, bar charts, and pie charts. You also identified
outliers in the dataset using visual methods, which is crucial for effective data analysis.

You might also like