BDT Lab Manual
BDT Lab Manual
TECHNOLOGIES SEMESTER
LABORATORY
Theory:
Hadoop Overview:
Hadoop is an open-source framework used for storing and processing large data sets in a
distributed computing environment. It consists of the Hadoop Distributed File System
(HDFS) and a processing engine, typically using MapReduce. Hadoop is scalable, allowing
you to start with a single node and grow to thousands of nodes.
Key Components:
HDFS (Hadoop Distributed File System): A distributed file system designed to run
on commodity hardware. It provides high throughput access to application data.
MapReduce: A processing model for distributed computing. It divides tasks into
smaller sub-tasks and processes them in parallel.
SPARK Overview:
Apache SPARK is a unified analytics engine for large-scale data processing. Unlike Hadoop's
MapReduce, SPARK provides an in-memory cluster computing, making it faster for iterative
algorithms.
Key Features:
AWS Overview:
Amazon Web Services (AWS) provides cloud-based services including computing power,
storage, and databases. You can launch and manage virtual servers (EC2 instances) in the
cloud, enabling flexible computing resources without maintaining physical hardware.
Key Components:
Pre-requisites:
Materials Required:
A computer with CentOS 7 or access to a cloud service (AWS, Google Cloud, etc.).
Internet connection to download and install packages.
Procedure:
1. Download the CentOS 7 ISO image from the official CentOS website.
2. Create a bootable USB drive using tools like Rufus or Etcher.
3. Boot from the USB drive and follow the on-screen instructions to install CentOS 7.
4. After installation, update the system using the following command:
bash
Hadoop requires Java to run, so the first step is to install the Java Development Kit (JDK):
bash
bash
java -version
bash
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-
3.3.4.tar.gz
bash
bash
bash
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
source ~/.bashrc
bash
hadoop version
bash
xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
bash
xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
bash
start-dfs.sh
bash
start-yarn.sh
1. Download SPARK:
bash
wget https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-
bin-hadoop3.2.tgz
bash
bash
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
bash
source ~/.bashrc
bash
spark-shell
bash
Expected Output:
Observations:
Conclusion:
In this experiment, you successfully installed and configured a Hadoop single-node cluster,
installed Apache SPARK, and launched a cloud instance on AWS. These foundational steps
are crucial for setting up a big data processing environment.
Theory:
MapReduce Overview:
MapReduce is a programming model and an associated implementation for processing and
generating large data sets with a parallel, distributed algorithm on a cluster. The model is
based on two main functions: Map and Reduce.
Map Function: The Map function takes a set of data and converts it into a set of
key/value pairs. The mapper processes each record in the input split and generates a
key-value pair as the output.
Reduce Function: The Reduce function takes the output from the Map as input and
combines the data tuples (key-value pairs) into a smaller set of key-value pairs. The
reduce operation is applied to the list of key-value pairs generated by the map
function.
The power of MapReduce comes from the ability to parallelize the process of data
manipulation and then reduce the result into a simpler form.
System log files record user activities, including login times, logout times, and durations. By
analyzing these logs, we can extract useful information, such as identifying the users who
spent the maximum time logged in on the system.
Pre-requisites:
Materials Required:
Procedure:
Code
user1,2024-08-17 09:00:00,2024-08-17 12:30:00
user2,2024-08-17 10:00:00,2024-08-17 11:00:00
user1,2024-08-18 14:00:00,2024-08-18 17:00:00
user3,2024-08-17 13:00:00,2024-08-17 16:00:00
o The log file contains three fields: User ID, Login Time, and Logout Time.
java
Code
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
1. The Reducer class aggregates the total session duration for each user by summing up
the durations provided by the mapper.
2. Code example in Java:
java
Code
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
java
Code
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setJarByClass(LogAnalysis.class);
job.setMapperClass(LogMapper.class);
job.setReducerClass(LogReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
1. Upload the log file to the Hadoop Distributed File System (HDFS):
bash
Code
hdfs dfs -put /path/to/logfile.txt /user/hadoop/logs
bash
Code
hadoop jar loganalysis.jar /user/hadoop/logs /user/hadoop/logs/output
3. The job will process the log file and output the total duration of login time for each
user.
bash
Code
hdfs dfs -cat /user/hadoop/logs/output/part-r-00000
2. The output should list the total login time for each user.
Expected Output:
A list of users with their corresponding total logged-in durations. The output might
look something like this:
Code
user1 23400000
user2 3600000
user3 10800000
Observations:
Conclusion:
Theory:
Weather data analysis is a common application of big data technologies. The data usually
includes information such as temperature, humidity, precipitation, and wind speed, collected
over various time periods. Analyzing historical weather data can help identify trends, such as
finding the coolest or hottest years.
MapReduce is well-suited for processing large weather datasets. In this experiment, the Map
function will extract relevant temperature data for each year, and the Reduce function will
aggregate these values to calculate the average temperature for each year. Finally, we'll
determine the years with the minimum and maximum average temperatures.
Pre-requisites:
Materials Required:
Hadoop cluster.
Sample weather dataset (in text or CSV format).
IDE or text editor for writing the MapReduce code.
Procedure:
yaml
Code
2024-01-01,NY,32
2024-01-02,NY,31
2024-01-03,NY,35
...
1. The Mapper class will process each line of the dataset, extract the year and
temperature, and emit a key-value pair where the key is the year and the value is the
temperature.
2. Code example in Java:
java
Code
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
1. The Reducer class will calculate the average temperature for each year by summing
up the temperatures and dividing by the number of records for that year.
2. Code example in Java:
java
Code
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WeatherReducer extends Reducer<Text, IntWritable, Text,
IntWritable> {
1. After computing the average temperature for each year, write another MapReduce job
(or a simple script) to determine the years with the minimum and maximum average
temperatures.
2. Alternatively, this could be done by processing the output file of the previous
MapReduce job using a simple program or script.
java
Code
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setJarByClass(WeatherAnalysis.class);
job.setMapperClass(WeatherMapper.class);
job.setReducerClass(WeatherReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
3. Package the code into a JAR file for execution.
bash
Code
hadoop jar weatheranalysis.jar /user/hadoop/weatherdata
/user/hadoop/weatheroutput
2. This job will calculate the average temperature for each year.
bash
Code
hdfs dfs -cat /user/hadoop/weatheroutput/part-r-00000
2. The output should list each year and its corresponding average temperature.
3. Identify the coolest and hottest years by analyzing the output data.
Expected Output:
A list of years with their corresponding average temperatures. The final output should
indicate the coolest and hottest years.
Example:
yaml
Code
2022 56
2023 60
2024 63
Observations:
Conclusion:
To design and implement a flight information system using HBase for data storage
and HiveQL for querying and analyzing flight data.
To perform various operations on the flight data such as creating, updating, and
querying tables in HBase, and joining tables and performing aggregations using
HiveQL.
Theory:
HBase Overview:
HBase is a distributed, scalable, big data store that runs on top of the Hadoop Distributed File
System (HDFS). It is modeled after Google’s Bigtable and provides a fault-tolerant way of
storing large quantities of sparse data. HBase is particularly suitable for real-time read/write
access to large datasets.
Hive Overview:
Apache Hive is a data warehouse software built on top of Hadoop that provides data
summarization, query, and analysis capabilities. HiveQL is a query language similar to SQL,
used to query and manage large datasets in HDFS. Hive abstracts the complexity of Hadoop
by allowing users to query data using a SQL-like interface.
A flight information system stores data about flights, such as flight numbers, departure and
arrival times, destinations, and statuses. This experiment focuses on using HBase to store
flight data and HiveQL to query and analyze the data.
Pre-requisites:
HBase and Hive installed and configured on your Hadoop cluster.
Basic understanding of HBase and HiveQL commands.
Sample flight data in CSV or text format.
Materials Required:
Procedure:
bash
Code
hbase shell
2. Create a table in HBase for storing flight information. The table will have the
following schema:
o Table Name: FlightInfo
o Row Key: Flight Number (e.g., FL123)
o Column Families: FlightDetails (to store flight-related details such as
departure and arrival times, destination, etc.)
3. HBase command to create the table:
bash
Code
create 'FlightInfo', 'FlightDetails'
bash
Code
list
1. Insert sample flight data into the FlightInfo table. Each flight will have a unique
row key (flight number).
2. HBase command to insert data:
bash
Code
put 'FlightInfo', 'FL123', 'FlightDetails:Departure', '2024-08-17
09:00:00'
put 'FlightInfo', 'FL123', 'FlightDetails:Arrival', '2024-08-17
12:30:00'
put 'FlightInfo', 'FL123', 'FlightDetails:Destination', 'New York'
put 'FlightInfo', 'FL124', 'FlightDetails:Departure', '2024-08-17
10:00:00'
put 'FlightInfo', 'FL124', 'FlightDetails:Arrival', '2024-08-17
13:45:00'
put 'FlightInfo', 'FL124', 'FlightDetails:Destination', 'Los Angeles'
bash
Code
scan 'FlightInfo'
bash
Code
hive
2. Create an external table in Hive that links to the HBase table. The Hive table will
reference the data stored in HBase.
3. HiveQL command to create the external table:
sql
Code
CREATE EXTERNAL TABLE FlightInfo_Hive(
flight_number STRING,
departure_time STRING,
arrival_time STRING,
destination STRING
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" =
":key,FlightDetails:Departure,FlightDetails:Arrival,FlightDetails:Des
tination"
)
TBLPROPERTIES ("hbase.table.name" = "FlightInfo");
sql
Code
SHOW TABLES;
1. Load Data into Hive Table: If necessary, load additional data into the Hive table
using the LOAD DATA command.
2. Query Data in Hive: Perform various queries on the flight data using HiveQL.
Examples:
o List all flights:
sql
Code
SELECT * FROM FlightInfo_Hive;
o Find flights to a specific destination:
sql
Code
SELECT flight_number, departure_time FROM FlightInfo_Hive WHERE
destination = 'New York';
sql
Code
SELECT flight_number, MIN(departure_time) FROM FlightInfo_Hive
GROUP BY flight_number;
1. Join Tables: If you have another table (e.g., Airports) with airport codes and names,
you can join it with the FlightInfo_Hive table.
Example:
sql
Code
SELECT f.flight_number, a.airport_name
FROM FlightInfo_Hive f
JOIN Airports a ON (f.destination = a.airport_code);
2. Aggregation Operations:
o Calculate the number of flights to each destination:
sql
Code
SELECT destination, COUNT(*) FROM FlightInfo_Hive GROUP BY
destination;
sql
Code
SELECT destination, AVG(UNIX_TIMESTAMP(arrival_time) -
UNIX_TIMESTAMP(departure_time)) AS avg_duration
FROM FlightInfo_Hive
GROUP BY destination;
sql
Code
CREATE INDEX idx_destination ON TABLE FlightInfo_Hive(destination) AS
'COMPACT' WITH DEFERRED REBUILD;
Expected Output:
Observations:
Conclusion:
In this experiment, you successfully implemented a flight information system using HBase
for data storage and HiveQL for querying and analyzing the data. This experiment
demonstrated the integration of HBase and Hive for managing and processing large datasets
efficiently
To design and implement a solution using Apache Pig to display the hierarchical
structure of data.
To generate trees, graphs, and network visualizations of data, and perform operations
such as sorting, grouping, joining, and filtering using Pig Latin scripts.
Theory:
Apache Pig is a high-level platform for processing large datasets in Hadoop. The language
used to express data flows in Pig is called Pig Latin. Pig Latin abstracts the complexity of
Hadoop MapReduce, making it easier for developers to perform data transformations,
analysis, and processing.
Hierarchical Data Structure:
Hierarchical data structures represent the organization of data in a tree-like model, where
each data point can have a parent-child relationship. Examples include file systems,
organizational charts, and product categories.
Use Case:
Pre-requisites:
Materials Required:
Hadoop cluster.
Sample hierarchical data in text or CSV format.
Command-line interface (CLI) for interacting with Pig.
Procedure:
Code
E001,John,Manager,Marketing
E002,Jane,Lead,Sales,E001
E003,Robert,Executive,Marketing,E001
E004,Michael,Lead,IT,E001
E005,Susan,Executive,Sales,E002
oThe columns represent Employee ID, Name, Title, Department, and Manager
ID.
2. Load the dataset into HDFS:
bash
Code
hdfs dfs -put /path/to/employee_data.txt /user/hadoop/employee_data
pig
Code
EMPLOYEE_DATA = LOAD '/user/hadoop/employee_data' USING
PigStorage(',')
AS (emp_id:chararray, name:chararray, title:chararray,
dept:chararray, mgr_id:chararray);
pig
Code
GROUPED_BY_MANAGER = GROUP EMPLOYEE_DATA BY mgr_id;
pig
Code
SORTED_EMPLOYEES = ORDER GROUPED_BY_MANAGER BY dept ASC;
pig
Code
HIERARCHY = FOREACH GROUPED_BY_MANAGER GENERATE
group AS manager_id,
FLATTEN(EMPLOYEE_DATA.name) AS employee_name,
FLATTEN(EMPLOYEE_DATA.title) AS employee_title;
pig
Code
STORE HIERARCHY INTO '/user/hadoop/employee_hierarchy_output' USING
PigStorage(',');
bash
Code
pig -x mapreduce employee_hierarchy.pig
2. Pig will process the hierarchical data, group employees by their managers, and sort
them by department.
bash
Code
hdfs dfs -cat /user/hadoop/employee_hierarchy_output/part-r-00000
2. The output will display the hierarchical structure of the organization, showing each
manager and their respective employees.
Step 5: Perform Additional Operations (Optional)
1. Filtering Data: You can filter employees based on criteria such as job title or
department.
pig
Code
FILTERED_EMPLOYEES = FILTER EMPLOYEE_DATA BY dept == 'Marketing';
2. Joining Data: If you have another dataset (e.g., DEPARTMENT_DATA), you can perform
a join operation to enrich the employee data.
pig
Code
JOINED_DATA = JOIN EMPLOYEE_DATA BY dept, DEPARTMENT_DATA BY
dept_name;
3. Generating Visualizations: Although Pig itself doesn't provide direct support for
visualizations, you can export the results to a format like CSV or JSON and use
external tools (like D3.js, Gephi, or Graphviz) to create hierarchical trees or network
graphs.
Expected Output:
Example output:
Code
E001,John,Manager
E002,Jane,Lead
E003,Robert,Executive
E004,Michael,Lead
E005,Susan,Executive
Observations:
Observe how Pig simplifies the process of loading, transforming, and analyzing
hierarchical data.
Note the differences between the original and processed data, particularly in terms of
grouping and sorting.
Record the time taken by Pig to process the dataset.
Conclusion:
In this experiment, you successfully used Apache Pig to process hierarchical data and display
it in a structured format. This experiment demonstrated the ability of Pig to efficiently handle
complex data structures and perform operations like sorting, grouping, and filtering.
Week 6: Working with JSON Data and
Word Count on Tweets Using Pig
Objective:
To parse JSON data using Pig, perform word count operations on text data such as
tweets, and analyze the results.
To demonstrate how to use Pig for processing semi-structured data (like JSON) and
perform text analysis operations.
Theory:
JSON Overview:
JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for
humans to read and write, and easy for machines to parse and generate. It is commonly used
to transmit data between a server and a web application, as well as to store structured data in
NoSQL databases.
Apache Pig provides built-in support for working with JSON data through its JsonLoader
and JsonStorage functions. These functions allow users to load, store, and process JSON
data in a Pig script.
In this experiment, we will work with a dataset of tweets stored in JSON format. Each tweet
contains text that can be analyzed to count the frequency of words. This experiment will
demonstrate how to parse the JSON data, extract the tweet text, and perform a word count
operation using Pig.
Pre-requisites:
Materials Required:
Hadoop cluster.
Sample JSON file containing tweets.
Command-line interface (CLI) for interacting with Pig.
Procedure:
json
Code
{"user": "Alice", "tweet": "Learning Apache Pig is fun!"}
{"user": "Bob", "tweet": "Pig Latin makes data processing easy."}
{"user": "Charlie", "tweet": "Big data analysis with Pig and
Hadoop."}
bash
Code
hdfs dfs -put /path/to/tweets.json /user/hadoop/tweets
Step 2: Write a Pig Script to Parse JSON Data and Extract Tweet Text
pig
Code
TWEETS = LOAD '/user/hadoop/tweets' USING JsonLoader('user:chararray,
tweet:chararray');
pig
Code
TWEET_TEXT = FOREACH TWEETS GENERATE tweet;
pig
Code
WORDS = FOREACH TWEET_TEXT GENERATE FLATTEN(TOKENIZE(tweet)) AS word;
pig
Code
GROUPED_WORDS = GROUP WORDS BY word;
pig
Code
WORD_COUNT = FOREACH GROUPED_WORDS GENERATE group AS word,
COUNT(WORDS) AS count;
pig
Code
SORTED_WORD_COUNT = ORDER WORD_COUNT BY count DESC;
pig
Code
STORE SORTED_WORD_COUNT INTO '/user/hadoop/word_count_output' USING
PigStorage(',');
bash
Code
pig -x mapreduce tweet_analysis.pig
2. Pig will process the JSON data, extract the tweet text, and perform a word count
operation.
bash
Code
hdfs dfs -cat /user/hadoop/word_count_output/part-r-00000
2. The output will display the words and their respective counts, sorted in descending
order of frequency.
Example output:
python
Code
Pig,2
data,2
Apache,1
Latin,1
analysis,1
...
1. Python Snippet for JSON Parsing: As an optional task, you can use Python to parse
JSON data and prepare it for Pig processing. Below is an example Python script that
reads a JSON file and extracts the tweet text:
python
Code
import json
2. Integrating Python with Pig: You can combine the Python script with Pig
processing by outputting the extracted text to a file and then loading that file into Pig
for further analysis.
Expected Output:
The Pig script will output a list of words from the tweets along with their counts,
sorted in descending order of frequency.
Example:
python
Code
Pig,2
data,2
Apache,1
Latin,1
analysis,1
...
Observations:
Observe how Pig handles JSON data and performs text processing.
Note the frequency of different words in the tweets and consider what this reveals
about the content.
Record the time taken by Pig to process the JSON data and perform the word count.
Conclusion:
In this experiment, you successfully used Apache Pig to parse JSON data and perform a word
count operation on tweet text. This experiment demonstrated the flexibility of Pig in handling
semi-structured data and performing text analysis tasks efficiently.
Week 7: Reading Different Types of Data
Sets Using R
Objective:
To demonstrate how to read different types of datasets such as .txt, .csv, and .xml
files using the R programming language.
To perform basic operations on these datasets such as writing to disk, reading from
web locations, and using R objects and functions for data manipulation and storage.
Theory:
R Overview:
R is a powerful programming language and environment used for statistical computing and
graphics. It is widely used for data analysis, data manipulation, and visualization. R provides
a rich set of functions to work with various types of data formats, making it a versatile tool
for data scientists and analysts.
Data Formats:
Text Files (.txt): Plain text files that can contain structured or unstructured data.
They are typically used for simple data storage and transfer.
Comma-Separated Values (.csv): A common data format where each line
represents a row of data, with columns separated by commas. CSV files are widely
used for tabular data storage and exchange.
Extensible Markup Language (.xml): A markup language that defines rules for
encoding documents in a format that is both human-readable and machine-readable.
XML is often used to store and transport data.
Use Case:
This experiment will cover how to read these different data formats using R, perform basic
manipulations, and store the results to a specified location on disk. The experiment will also
cover reading data from web locations, and working with R objects and functions to
manipulate the data.
Pre-requisites:
Materials Required:
R software.
Sample data files (data.txt, data.csv, data.xml).
Internet connection for reading data from web locations.
Procedure:
R
Code
setwd("path/to/your/directory")
r
Code
Name Age Gender
Alice 28 F
Bob 34 M
Charlie 25 M
R
Code
txt_data <- read.table("data.txt", header = TRUE, sep = " ")
print(txt_data)
o header = TRUE: Indicates that the first row contains column names.
o sep = " ": Specifies that the columns are separated by spaces.
3. Write the data back to a different text file:
R
Code
write.table(txt_data, "output_data.txt", sep = "\t", row.names =
FALSE)
r
Code
Name,Age,Gender
Alice,28,F
Bob,34,M
Charlie,25,M
R
Code
csv_data <- read.csv("data.csv", header = TRUE)
print(csv_data)
R
Code
write.csv(csv_data, "output_data.csv", row.names = FALSE)
xml
Code
<dataset>
<record>
<Name>Alice</Name>
<Age>28</Age>
<Gender>F</Gender>
</record>
<record>
<Name>Bob</Name>
<Age>34</Age>
<Gender>M</Gender>
</record>
<record>
<Name>Charlie</Name>
<Age>25</Age>
<Gender>M</Gender>
</record>
</dataset>
R
Code
library(XML)
xml_data <- xmlTreeParse("data.xml", useInternalNodes = TRUE)
rootNode <- xmlRoot(xml_data)
print(rootNode)
R
Code
names <- xpathSApply(rootNode, "//Name", xmlValue)
ages <- xpathSApply(rootNode, "//Age", xmlValue)
genders <- xpathSApply(rootNode, "//Gender", xmlValue)
R
Code
write.csv(df, "output_data_from_xml.csv", row.names = FALSE)
1. Read data directly from a web location (e.g., a CSV file hosted online):
R
Code
web_data <- read.csv("http://example.com/data.csv")
print(web_data)
2. Perform operations on the downloaded data similar to what you did with local
files.
R
Code
summary(df) # Summary statistics for the dataframe
R
Code
mean_age <- mean(as.numeric(df$Age))
print(mean_age)
R
Code
save(df, file = "data_frame.RData")
Expected Output:
Successfully read and wrote data from different file formats (.txt, .csv, .xml).
Extracted and manipulated data from XML files and performed basic operations on it.
Downloaded and processed data from web locations.
Used R functions to perform statistical operations and save R objects for future use.
Observations:
Observe how R simplifies the process of reading and writing different data formats.
Note the differences in syntax and functions used to handle each data format.
Record any challenges or errors encountered during the data import/export process.
Conclusion:
In this experiment, you successfully demonstrated the ability to read, manipulate, and store
data in various formats using R. This experiment highlighted R's versatility in handling
diverse data formats and performing essential data operations.
Theory:
Data Visualization:
1. Box Plot: A box plot is used to display the distribution of data based on a five-
number summary: minimum, first quartile, median, third quartile, and maximum. It is
particularly useful for identifying outliers.
2. Scatter Plot: A scatter plot is used to represent the relationship between two
continuous variables. It helps in identifying correlations, patterns, and outliers in the
data.
3. Histogram: A histogram is used to represent the distribution of a single numeric
variable by dividing the data into bins and plotting the frequency of each bin.
4. Bar Chart: A bar chart is used to compare categorical data. The height of each bar
represents the frequency or value associated with that category.
5. Pie Chart: A pie chart is used to represent the proportion of categories in a whole. It
is divided into slices, where each slice represents a category's contribution to the total.
Pre-requisites:
Materials Required:
R software.
Sample dataset (e.g., mtcars, iris, or any other dataset of your choice).
R packages: ggplot2 (for advanced visualizations).
Procedure:
R
Code
library(ggplot2)
R
Code
data(mtcars)
df <- mtcars
head(df)
1. Plot a box plot to visualize the distribution of a continuous variable, such as mpg
(miles per gallon):
R
Code
boxplot(df$mpg, main="Box Plot of Miles per Gallon", ylab="Miles per
Gallon (mpg)", col="lightblue")
R
Code
ggplot(df, aes(x=factor(0), y=mpg)) +
geom_boxplot(fill="lightblue") +
labs(title="Box Plot of Miles per Gallon", y="Miles per Gallon
(mpg)") +
theme_minimal()
1. Plot a scatter plot to show the relationship between mpg and hp (horsepower):
R
Code
plot(df$mpg, df$hp, main="Scatter Plot of MPG vs HP", xlab="Miles per
Gallon (mpg)", ylab="Horsepower (hp)", pch=19, col="darkgreen")
R
Code
hist(df$mpg, main="Histogram of Miles per Gallon", xlab="Miles per
Gallon (mpg)", col="lightcoral", breaks=10)
R
Code
ggplot(df, aes(x=mpg)) +
geom_histogram(binwidth=2, fill="lightcoral", color="black") +
labs(title="Histogram of Miles per Gallon", x="Miles per Gallon
(mpg)") +
theme_minimal()
1. Plot a bar chart to visualize the frequency of the number of cylinders (cyl):
R
Code
barplot(table(df$cyl), main="Bar Chart of Cylinder Count",
xlab="Number of Cylinders", ylab="Frequency", col="lightgreen")
R
Code
ggplot(df, aes(x=factor(cyl))) +
geom_bar(fill="lightgreen") +
labs(title="Bar Chart of Cylinder Count", x="Number of
Cylinders", y="Frequency") +
theme_minimal()
1. Plot a pie chart to visualize the proportion of different cylinder counts (cyl):
R
Code
cyl_count <- table(df$cyl)
pie(cyl_count, main="Pie Chart of Cylinder Count", col=c("red",
"blue", "green"))
2. Pie Chart using ggplot2 (requires transforming data):
R
Code
cyl_df <- data.frame(cyl = names(cyl_count), count =
as.vector(cyl_count))
ggplot(cyl_df, aes(x="", y=count, fill=cyl)) +
geom_bar(stat="identity", width=1) +
coord_polar("y") +
labs(title="Pie Chart of Cylinder Count") +
theme_minimal()
1. Use the box plot created earlier to identify outliers in the mpg variable. Outliers
will appear as individual points outside the whiskers of the box plot.
2. Use the scatter plot to visually inspect for outliers in the relationship between mpg
and hp.
Expected Output:
Box Plot: Displays the distribution of mpg, showing the median, quartiles, and
outliers.
Scatter Plot: Shows the relationship between mpg and hp, helping identify
correlations and outliers.
Histogram: Displays the distribution of mpg, helping visualize the spread and
concentration of values.
Bar Chart: Shows the frequency of different cylinder counts in the dataset.
Pie Chart: Displays the proportion of different cylinder counts in a visual, circular
format.
Outlier Identification: Highlights any data points that fall outside the expected range
in the box plot and scatter plot.
Observations:
Conclusion: