02 HDFS - 3 JavaAPI
02 HDFS - 3 JavaAPI
5
FileSystem Implementations
• Hadoop ships with multiple concrete
implementations:
– org.apache.hadoop.fs.LocalFileSystem
• Good old native file system using local disk(s)
– org.apache.hadoop.hdfs.DistributedFileSystem
• Hadoop Distributed File System (HDFS)
• Will mostly focus on this implementation
– org.apache.hadoop.hdfs.HftpFileSystem
• Access HDFS in read-only mode over HTTP
– org.apache.hadoop.fs.ftp.FTPFileSystem
• File system on FTP server
FileSystem Implementations
• FileSystem concrete implementations
– Two options that are backed by Amazon S3 cloud
• org.apache.hadoop.fs.s3.S3FileSystem
• http://wiki.apache.org/hadoop/AmazonS3
– org.apache.hadoop.fs.kfs.KosmosFileSystem
• Backed by CloudStore
• http://code.google.com/p/kosmosfs
7
FileSystem Implementations
• Different use cases for different concrete
implementations
• HDFS is the most common choice
– org.apache.hadoop.hdfs.DistributedFileSystem
SimpleLocalLs.java Example
public class SimpleLocalLs {
public static void main(String[] args) throws Exception{
10
LoadConfigurations.java
Example
public class LoadConfigurations {
private final static String PROP_NAME = "fs.default.name";
conf.set(PROP_NAME, "hdfs://localhost:8111");
System.out.println("After set: " + conf.get(PROP_NAME));
}
15 }
Run LoadConfigurations
$ java -cp
$PLAY_AREA/HadoopSamples.jar:$HADOOP_HOME/share/had
common/hadoop-common-2.0.0-
cdh4.0.0.jar:$HADOOP_HOME/share/hadoop/common/lib/*
hdfs.LoadConfigurations
1. Print the
property with
After construction: file:/// empty
Configuration
After addResource: hdfs://localhost:9000
After set: hdfs://localhost:8111
2. Add
properties from
core-site.xml
3. Manually set
the property
16
FileSystem API
• Recall FileSystem is a generic abstract class
used to interface with a file system
• FileSystem class also serves as a factory for
concrete implementations, with the
following methods
– public static FileSystem get(Configuration conf)
• Will use information from Configuration such as scheme
and authority
• Recall hadoop loads conf/core-site.xml by default
• Core-site.xml typically sets fs.default.name property to
something like hdfs://localhost:8020
– org.apache.hadoop.hdfs.DistributedFileSystem will be used by
default
– Otherwise known as HDFS
17
Simple List Example
public class SimpleLocalLs {
public static void main(String[] args) throws Exception{
Path path = new Path("/");
if ( args.length == 1){
path = new Path(args[0]);
}
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FileStatus [] files = fs.listStatus(path);
for (FileStatus file : files ){
System.out.println(file.getPath().getName());
}
}
}
18
20
1: Create FileSystem
• FileSystem fs = FileSystem.get(new
Configuration());
– If you run with yarn command, DistributedFileSystem
(HDFS) will be created
• Utilizes fs.default.name property from configuration
• Recall that Hadoop framework loads core-site.xml which
sets property to hdfs (hdfs://localhost:8020)
21
2: Open Input Stream to a Path
...
InputStream input = null;
try {
input = fs.open(fileToRead);
...
• fs.open returns
org.apache.hadoop.fs.FSDataInputStream
– Another FileSystem implementation will return their own
custom implementation of InputStream
• Opens stream with a default buffer of 4k
• If you want to provide your own buffer size
use
– fs.open(Path f, int bufferSize)
22
23
4: Close Stream
...
} finally {
IOUtils.closeStream(input);
...
24
ReadFile.java Example
public class ReadFile {
public static void main(String[] args)
throws IOException {
Path fileToRead = new Path("/training/data/readMe.txt");
FileSystem fs = FileSystem.get(new Configuration());
26
Seeking to a Position
• FSDataInputStream implements Seekable
interface
– void seek(long pos) throws IOException
• Seek to a particular position in the file
• Next read will begin at that position
• If you attempt to seek past the file boundary IOException
is emitted
• Somewhat expensive operation – strive for streaming and
not seeking
– long getPos() throws IOException
• Returns the current position/offset from the beginning of
the stream/file
27
SeekReadFile.java Example
public class SeekReadFile {
public static void main(String[] args) throws IOException {
Path fileToRead = new Path("/training/data/readMe.txt");
FileSystem fs = FileSystem.get(new Configuration());
FSDataInputStream input = null;
try { Start at position 0
input = fs.open(fileToRead);
System.out.print("start postion=" + input.getPos() + ":
IOUtils.copyBytes(input, System.out, 4096, false);
Seek to position 11
input.seek(11);
System.out.print("start postion=" + input.getPos() + ":
IOUtils.copyBytes(input, System.out, 4096, false);
Seek back to 0
input.seek(0);
System.out.print("start postion=" + input.getPos() + ":
IOUtils.copyBytes(input, System.out, 4096, false);
} finally {
IOUtils.closeStream(input);
}
}
28 }
29
Write Data
1. Create FileSystem instance
2. Open OutputStream
– FSDataOutputStream in this case
– Open a stream directly to a Path from FileSystem
– Creates all needed directories on the provided path
3. Copy data using IOUtils
30
WriteToFile.java Example
public class WriteToFile {
public static void main(String[] args) throws IOException {
String textToWrite = "Hello HDFS! Elephants are awesome!\n";
InputStream in = new BufferedInputStream(
new ByteArrayInputStream(textToWrite.getBytes()));
Path toHdfs = new Path("/training/playArea/writeMe.txt");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
2: Open OutputStream
FSDataOutputStream out = fs.create(toHdfs);
3: Copy Data
IOUtils.copyBytes(in, out, conf);
}
}
31
Run WriteToFile
$ yarn jar $PLAY_AREA/HadoopSamples.jar hdfs.WriteToFile
$ hdfs dfs -cat /training/playArea/writeMe.txt
Hello HDFS! Elephants are awesome!
32
33
FileSystem: Writing Data
• FileSystem's create and append methods
have overloaded version that take callback
interface to notify client of the progress
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream out = fs.create(toHdfs, new Progressable(){
@Override
public void progress() {
System.out.print("..");
}
});
34
Overwrite Flag
• Recall FileSystem's create(Path) creates all
the directories on the provided path
– create(new Path(“/doesnt_exist/doesnt_exist/file/txt”)
– can be dangerous, if you want to protect yourself then
utilize the following overloaded method:
35
Overwrite Flag Example
Path toHdfs = new
Path("/training/playArea/writeMe.txt");
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream out = fs.create(toHdfs, false);
37
Copy from Local to HDFS
FileSystem fs = FileSystem.get(new Configuration());
Path fromLocal = new
Path("/home/hadoop/Training/exercises/sample_data/hamlet.txt");
Path toHdfs = new Path("/training/playArea/hamlet.txt");
fs.copyFromLocalFile(fromLocal, toHdfs);
Empty directory
$ hdfs dfs -ls /training/playArea/ before copy
38
Delete Data
FileSystem.delete(Path path,Boolean recursive)
Path toDelete =
new Path("/training/playArea/writeMe.txt");
boolean isDeleted = fs.delete(toDelete, false);
System.out.println("Deleted: " + isDeleted);
40
FileSystem: listStatus
Browse the FileSystem with listStatus()
methods
Path path = new Path("/");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FileStatus [] files = fs.listStatus(path);
for (FileStatus file : files ){
System.out.println(file.getPath().getName());
}
41
LsWithPathFilter.java example
FileSystem fs = FileSystem.get(conf);
FileStatus [] files = fs.listStatus(path, new PathFilter() {
@Override
public boolean accept(Path path) {
if (path.getName().equals("user")){
return false;
}
return true; Restrict result of
Do not show path whose
} listStatus() by
name equals to "user"
}); supplying PathFilter
object
for (FileStatus file : files ){
System.out.println(file.getPath().getName());
}
42
Run LsWithPathFilter
Example
$ yarn jar $PLAY_AREA/HadoopSamples.jar hdfs.SimpleLs
training
user
43
FileSystem: Globbing
• FileSystem supports file name pattern
matching via globStatus() methods
• Good for traversing through a sub-set of
files by using a pattern
• Support is similar to bash glob: *, ?, etc...
44
SimpleGlobbing.java
public class SimpleGlobbing {
public static void main(String[] args)
throws IOException {
Path glob = new Path(args[0]); Read glob from
command line
Similar usage to
listStatus method
45
Run SimpleGlobbing
$ hdfs dfs -ls /training/data/glob/
Found 4 items
drwxr-xr-x - hadoop supergroup 0 2011-12-24 11:20 /training/data/glob/2007
drwxr-xr-x - hadoop supergroup 0 2011-12-24 11:20 /training/data/glob/2008
drwxr-xr-x - hadoop supergroup 0 2011-12-24 11:21 /training/data/glob/2010
drwxr-xr-x - hadoop supergroup 0 2011-12-24 11:21 /training/data/glob/2011
46
FileSystem: Globbing
Glob Explanation
? Matches any single character
* Matches zero or more characters
[abc] Matches a single character from character set {a,b,c}.
[a-b] Matches a single character from the character range {a...b}.
Note that character a must be lexicographically less than or
equal to character b.
[^a] Matches a single character that is not from character set or
range {a}. Note that the ^ character must occur immediately to
the right of the opening bracket.
\c Removes (escapes) any special meaning of character c.
{ab,cd} Matches a string from the string set {ab, cd}
{ab,c{de,fh}} Matches a string from the string set {ab, cde, cfh}
48
BadRename.java
FileSystem fs = FileSystem.get(new Configuration());
Path source = new Path("/does/not/exist/file.txt");
Path nonExistentPath = new Path("/does/not/exist/file1.txt");
boolean result = fs.rename(source, nonExistentPath);
System.out.println("Rename: " + result);
49
© 2012 coreservlets.com and Dima May
Wrap-Up
Summary
• We learned about
– HDFS API
– How to use Configuration class
– How to read from HDFS
– How to write to HDFS
– How to browse HDFS
51
© 2012 coreservlets.com and Dima May
Questions?
More info:
http://www.coreservlets.com/hadoop-tutorial/ – Hadoop programming tutorial
http://courses.coreservlets.com/hadoop-training.html – Customized Hadoop training courses, at public venues or onsite at your organization
http://courses.coreservlets.com/Course-Materials/java.html – General Java programming tutorial
http://www.coreservlets.com/java-8-tutorial/ – Java 8 tutorial
http://www.coreservlets.com/JSF-Tutorial/jsf2/ – JSF 2.2 tutorial
http://www.coreservlets.com/JSF-Tutorial/primefaces/ – PrimeFaces tutorial
http://coreservlets.com/ – JSF 2, PrimeFaces, Java 7 or 8, Ajax, jQuery, Hadoop, RESTful Web Services, Android, HTML5, Spring, Hibernate, Servlets, JSP, GWT, and other Java EE training