BIG Data File
BIG Data File
LIST OF EXPERIMENTS
1. Installation of Hadoop
2. File Management tasks in Hadoop
Count Map Reduce Word program to understand Map
3. Reduce Paradigm.
Hadoop is a well-known big data processing system for storing and analysing enormous volumes of data. It’s an open-
source project that you can use for free. If you’re new to Hadoop, you may find the installation process difficult.
In this tutorial, I’ll walk you through the steps of installing Hadoop on Windows.
Step 1: Download and install Java
Hadoop is built on Java, so you must have Java installed on your PC. You can get the most recent version of Java from
the official website. After downloading, follow the installation wizard to install Java on your system.
JDK: https://www.oracle.com/java/technologies/javase-downloads.html
Step 2: Download Hadoop
Hadoop can be downloaded from the Apache Hadoop website. Make sure to have the latest stable release of Hadoop. Once
downloaded, extract the contents to a convenient location.
Hadoop: https://hadoop.apache.org/releases.html
Step 3: Set Environment Variables
You must configure environment variables after downloading and unpacking Hadoop. Launch the Start menu, type “Edit the
system environment variables,” and select the result. This will launch the System Properties dialogue box. Click on
“Environment Variables” button to open.
Click “New” under System Variables to add a new variable. Enter the variable name “HADOOP_HOME” and the path to
the Hadoop folder as the variable value. Then press “OK.”
Then, under System Variables, locate the “Path” variable and click “Edit.” Click “New” in the Edit Environment Variable
window and enter “%HADOOP_HOME%bin” as the variable value. To close all the windows, use the “OK” button.
Step 4: Setup Hadoop
You must configure Hadoop in this phase by modifying several configuration files. Navigate to the “etc/hadoop” folder in
the Hadoop folder. You must make changes to three files:
core-site.xml
hdfs-site.xml
mapred-site.xml
Open each file in a text editor and edit the following properties:
In core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
In hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop-3.3.1/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop-3.3.1/data/datanode</value>
</property>
</configuration>
In mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
</configuration>
Save the changes in each file.
Step 5: Format Hadoop NameNode
You must format the NameNode before you can start Hadoop. Navigate to the Hadoop bin folder using a command prompt.
Execute this command:
hadoop namenode -format
Step 6: Start Hadoop
To start Hadoop, open a command prompt and navigate to the Hadoop bin folder. Run the following command:
start-all.cmd
This command will start all the required Hadoop services, including the NameNode, DataNode, and JobTracker. Wait for
a few minutes until all the services are started.
Step 7: Verify Hadoop Installation
To ensure that Hadoop is properly installed, open a web browser and go to http://localhost:50070/. This will launch the web
interface for the Hadoop NameNode. You should see a page with Hadoop cluster information.
Wrapping Up
By following the instructions provided in this article, you should be able to get Hadoop up and operating on your
machine. Remember to get the most recent stable version of Hadoop, install Java, configure Hadoop, format the
NameNode, and start Hadoop services. Finally, check the NameNode web interface to ensure that Hadoop is properly
installed.
EXPRIMENT: 2
AIM: Word Count Map Reduce program to understand Map Reduce Paradigm.
In Hadoop, MapReduce is a computation that decomposes large manipulation jobs into individual tasks that can
be executed in parallel across a cluster of servers. The results of tasks can be joined together to compute final
results.
MapReduce consists of 2 steps:
Map Function – It takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Bus, Car, bus, car, train, car, bus, car,
Set of data
Input train, bus, TRAIN,BUS, buS, caR, CAR,
car, BUS, TRAIN
(BUS,7),
Converts into
Output (CAR,7),
smaller set of tuples
(TRAIN,4)
Work Flow of the Program
3
import java.io.IOException;
4
import org.apache.hadoop.conf.Configuration;
5
import org.apache.hadoop.fs.Path;
6
import org.apache.hadoop.io.IntWritable;
7
import org.apache.hadoop.io.LongWritable;
8
import org.apache.hadoop.io.Text;
9
import org.apache.hadoop.mapreduce.Job;
10
import org.apache.hadoop.mapreduce.Mapper;
11
import org.apache.hadoop.mapreduce.Reducer;
12
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
13
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
14
import org.apache.hadoop.util.GenericOptionsParser;
15
public class WordCount {
20
public static void main(String [] args) throws Exception
21
{
22
Configuration c=new Configuration();
23
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
24
Path input=new Path(files[0]);
25
Path output=new Path(files[1]);
26
Job j=new Job(c,"wordcount");
27
j.setJarByClass(WordCount.class);
28
j.setMapperClass(MapForWordCount.class);
29
j.setReducerClass(ReduceForWordCount.class);
30
j.setOutputKeyClass(Text.class);
31
j.setOutputValueClass(IntWritable.class);
32
FileInputFormat.addInputPath(j, input);
33
FileOutputFormat.setOutputPath(j, output);
34
System.exit(j.waitForCompletion(true)?0:1);
35
}
36
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>{
37
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
38
{
39
String line = value.toString();
40
String[] words=line.split(",");
41
for(String word: words )
42
{
43
Text outputKey = new Text(word.toUpperCase().trim());
44
IntWritable outputValue = new IntWritable(1);
45
con.write(outputKey, outputValue);
46
}
47
}
48
}
49
50
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
51
{
52
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException,
InterruptedException
53
{
54
int sum = 0;
55
for(IntWritable value : values)
56
{
57
sum += value.get();
58
}
59
con.write(word, new IntWritable(sum));
60
}
61
62
}
63
64
}
The above program consists of three classes:
Driver class (Public, void, static, or main; this is the entry point).
The Map class which extends the public class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and
implements the Map function.
The Reduce class which extends the public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and
implements the Reduce function.
6. Make a jar file
Right Click on Project> Export> Select export destination as Jar File > next> Finish.
7. Take a text file and move it into HDFS format:
To move this into Hadoop directly, open the terminal and enter the following commands:
1
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
8. Run the jar file:
(Hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile PathToOutputDirectry)
1
[training@localhost ~]$ hadoop jar MRProgramsDemo.jar PackageDemo.WordCount wordCountFile MRDir1
9. Open the result:
1
[training@localhost ~]$ hadoop fs -ls MRDir1
2
3
Found 3 items
4
5
-rw-r--r-- 1 training supergroup 0 2016-02-23 03:36 /user/training/MRDir1/_SUCCESS
6
drwxr-xr-x - training supergroup 0 2016-02-23 03:36 /user/training/MRDir1/_logs
7
-rw-r--r-- 1 training supergroup 20 2016-02-23 03:36 /user/training/MRDir1/part-r-00000
1
[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000
2
BUS 7
3
CAR 4
4
TRAIN 6
hadoopMapReduceJava (programming language)
EXPRIMENT: 4
First download input file which contains temperature statistics with time for multiple cities.Schema of
record set : CA_25-Jan-2014 00:12:345 15.7 01:19:345 23.1 02:34:542 12.3 ......
CA is city code, here it stands for California followed by date. After that each pair of values represent time
and temperature.
Lets create a Map/Reduce project in eclipse and create a class file name it
as CalculateMaxAndMinTemeratureWithTime. For simplicity,here we have written mapper and reducer
class as inner static class. Copy following code lines and paste in newly created class file.
/**
* Question:- To find Max and Min temperature from record set stored in
* text file. Schema of record set :- tab separated (\t) CA_25-Jan-2014
* 00:12:345 15.7 01:19:345 23.1 02:34:542 12.3 03:12:187 16 04:00:093
* -14 05:12:345 35.7 06:19:345 23.1 07:34:542 12.3 08:12:187 16
* 09:00:093 -7 10:12:345 15.7 11:19:345 23.1 12:34:542 -22.3 13:12:187
* 16 14:00:093 -7 15:12:345 15.7 16:19:345 23.1 19:34:542 12.3
* 20:12:187 16 22:00:093 -7
* Expected output:- Creates files for each city and store maximum & minimum
* temperature for each day along with time.
*/
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
/**
* @author devinline
*/
public class CalculateMaxAndMinTemeratureWithTime {
public static String calOutputName = "California";
public static String nyOutputName = "Newyork";
public static String njOutputName = "Newjersy";
public static String ausOutputName = "Austin";
public static String bosOutputName = "Boston";
public static String balOutputName = "Baltimore";
while (strTokens.hasMoreElements()) {
if (counter == 0) {
date = strTokens.nextToken();
} else {
if (counter % 2 == 1) {
currentTime = strTokens.nextToken();
} else {
currnetTemp = Float.parseFloat(strTokens.nextToken());
if (minTemp > currnetTemp) {
minTemp = currnetTemp;
minTempANDTime = minTemp + "AND" + currentTime;
}
if (maxTemp < currnetTemp) {
maxTemp = currnetTemp;
maxTempANDTime = maxTemp + "AND" + currentTime;
}
}
}
counter++;
}
// Write to context - MinTemp, MaxTemp and corresponding time
Text temp = new Text();
temp.set(maxTempANDTime);
Text dateText = new Text();
dateText.set(date);
try {
con.write(dateText, temp);
} catch (Exception e) {
e.printStackTrace();
}
temp.set(minTempANDTime);
dateText.set(date);
con.write(dateText, temp);
}
}
if (counter == 0) {
reducerInputStr = value.toString().split("AND");
f1 = reducerInputStr[0];
f1Time = reducerInputStr[1];
}
else {
reducerInputStr = value.toString().split("AND");
f2 = reducerInputStr[0];
f2Time = reducerInputStr[1];
}
counter = counter + 1;
}
if (Float.parseFloat(f1) > Float.parseFloat(f2)) {
@Override
public void cleanup(Context context) throws IOException,
InterruptedException {
mos.close();
}
}
job.setMapperClass(WhetherForcastMapper.class);
job.setReducerClass(WhetherForcastReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleOutputs.addNamedOutput(job, calOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, nyOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, njOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, bosOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, ausOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, balOutputName,
TextOutputFormat.class, Text.class, Text.class);
try {
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Explanation:-
In map method, we are parsing each input line and maintains a counter for extracting date and each
temperature & time information.For a given input line, first extract date(counter ==0) and followed by
alternatively extract time(counter%2==1) since time is on odd number position like (1,3,5. .. ) and get
temperature otherwise. Compare for max & min temperature and store it accordingly. Once while loop
terminates for a given input line, write maxTempTime and minTempTime with date.
In reduce method, for each reducer task, setup method is executed and create MultipleOutput object. For
a given key, we have two entry (maxtempANDTime and mintempANDTime). Iterate values list , split
value and get temperature & time value. Compare temperature value and create actual value sting which
reducer write in appropriate file.
In main method,a instance of Job is created with Configuration object. Job is configured with mapper,
reducer class and along with input and output format. MultipleOutputs information added to Job to
indicate file name to be used with input format. For this sample program, we are using input
file("/weatherInputData/input_temp.txt") placed on HDFS and output directory
(/user/hduser1/testfs/output_mapred5) will be also created on HDFS. Refer below command to copy
downloaded input file from local file system to HDFS and give write permission to client who is
executing this program unit so that output directory can be created.
Copy a input file form local file system to HDFS
Before executing above program unit make sure hadoop services are running(to start all service execute
./start-all.sh from <hadoop_home>/sbin).
Now execute above sample program. Run -> Run as hadoop. Wait for a moment and check whether output
directory is in place on HDFS. Execute following command to verify the same.
Note:-
In order to reference input file from local file system instead of HDFS, uncomment below lines in main
method and comment below added addInputPath and setOutputPath lines. Here Path(args[0])
and Path(args[1]) read input and output location path from program arguments. OR create path object with
sting input of input file and output location.
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of columns(j)=2.
Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of columns(k)=2. Each cell of the
matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called A21 i.e. 2nd-row 1st column. Now
One step matrix multiplication has 1 mapper and 1 reducer. The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values
# in formula
k=1 i=1 j=1 ((1, 1), (A, 1, 1))
j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))
k=2 i=1 j=1 ((1, 2), (A, 1, 1))
j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
AIM: Pig Latin scripts to sort,group, join, project, and filter your data.
Sample Data: To make these scripts concrete, let’s assume two example datasets:
code
1,John,Engineering,500002,Sara,Marketing,60000
3,Mike,Engineering,700004,Jane,Marketing,55000
5,Bob,HR,45000
code
Engineering,101
Marketing,102
HR,103
First, load these datasets into Pig variables for use in transformations.
pig
Copy code
-- Load the employees dataset employees = LOAD
'employees.txt'
USING PigStorage(',')
AS (emp_id: int, name: chararray, department: chararray, salary: int);
1. Sorting Data
pig
Copy code
sorted_employees = ORDER employees BY salary DESC;DUMP
sorted_employees;
2. Grouping Data
Group the employeesdata by departmentto calculate the total salary per department.
pig
Copy code
grouped_employees = GROUP employees BY department; total_salary_per_department =
FOREACH grouped_employees GENERATE
group AS department, SUM(employees.salary) AS
total_salary;
DUMP total_salary_per_department;
3. Joining Data
Join the employeesdataset with the departmentsdataset based on the departmentfield to associate eachemployee with
their dept_id.
pig
Copy code
joined_data = JOIN employees BY department, departments BY department;projected_data = FOREACH joined_data
GENERATE
employees::emp_id AS emp_id, employees::name AS
name, employees::department AS department,
departments::dept_id AS dept_id, employees::salary
AS salary;
DUMP projected_data;
pig
Copy code
projected_employees = FOREACH employees GENERATE name, salary;DUMP
projected_employees;
5. Filtering Data
Filter the employeesdataset to show only employees with a salary greater than 50000.
pig
Copy code
high_salary_employees = FILTER employees BY salary > 50000;DUMP
high_salary_employees;
Summary of Operations
Step-by-Step Process
1. Verify Source File Existence: Check if the source file exists at the specified path.
2. Check Permissions: Ensure that the user has necessary permissions to read from the source location and
write to the destination location.
3. Lock Source File: Lock the source file to prevent concurrent modifications.
4. Create Destination Directory: Create the destination directory if it does not exist.
5. Copy File Contents: Copy the contents of the source file to the destination file.
6. Update File Metadata: Update the file metadata (e.g., ownership, permissions, timestamps) at the
destination.
7. Delete Source File: Delete the source file.
8. Unlock Destination File: Unlock the destination file.
Java API
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
Error Handling
Best Practices
Use Cases
Advantages
Disadvantages
Advantages
Disadvantages
1. **Open Terminal**: Use the command line or terminal to enter HDFS commands.
2. **Run Remove Command**: Use the following command to delete a file:
```bash
hdfs dfs -rm /path/to/your/file
```
- **Example**: If you have a file named `example.txt` in `/user/hadoop/`, the command will be:
```bash
hdfs dfs -rm /user/hadoop/example.txt
```
3. **Confirm Deletion**: The command will delete the specified file. You’ll see a confirmation message
in the terminal if the file is successfully deleted.
This command gives detailed information about a file, including its size.
1. **Open Terminal**.
2. **Run the `stat` Command**:
```bash
hdfs dfs -stat %b /path/to/your/file
```
- **Explanation**:
- `%b`: Displays the file length in bytes.
- **Example**:
```bash
hdfs dfs -stat %b /user/hadoop/example.txt
```
3. **Output**: The total file size in bytes will be shown.
Example Output
If `example.txt` is 1048576 bytes (1 MB), the command output might look like:
```bash
1048576
```
This output represents the total size (aggregate length) of the file.