Added FileIO Support for ORC Reader and Writers #6293

pavibhai · 2022-11-28T14:27:10Z

What?

This adds support for FileIO with Apache ORC by wrapping InputFile and OutputFile into a consumable FileSystem objects.

Why?

In the absence of this ORC is not able to leverage the benefits of better FileIO such as S3FileIO compared to S3AFileSystem

Tested?

New unit tests have been added to ensure the use of FileIO in ORC Readers and Writers

nastra · 2022-11-28T15:21:42Z

orc/src/test/java/org/apache/iceberg/orc/TestORCFileIOProxies.java

+    assertNotNull(is);
+
+    // Cannot use the filesystem for any other operation
+    assertThrows(


In general it would be good to always check against a specific error message. We typically use Assertions.assertThatThrownBy(() -> codeThatFails()).isInstanceOf(Xyz.class).hasMessage(...) to do that

Thanks. Changing

nastra · 2022-11-28T15:23:18Z

orc/src/test/java/org/apache/iceberg/orc/TestOrcDataWriter.java

+    // Show that FileSystem access is not possible for the file we are supplying as the scheme
+    // dummy is not handled
+    ProxyOutputFile outFile = new ProxyOutputFile(Files.localOutput(temp.newFile()));
+    Assertions.assertThrows(


it would be better to move such assertions to AssertJ, because otherwise it'll be more work potentially to move such code when upgrading to Junit5

Makes sense. Changing.

RussellSpitzer · 2022-12-06T22:49:45Z

I'm not a big fan of the fake filesystem approach here, mostly because i'm afraid of mocking an object like that when we don't have the full filesystem state. I feel like this patch would have us maintaining a rather large Hadoop mock.

Is there any chance we can convince the ORC project to allow the creation of a writer from an "java.io.OutputStream" instead of always creating its own file?

rdblue · 2022-12-07T00:01:00Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

+
+    @Override
+    public FSDataInputStream open(Path f) throws IOException {
+      return open(f, 0);


Rather than making up a fake buffer size, I think it would be better to call this method from open(Path, int) and discard the buffer size that's passed into that one.

Agreed this is sleek. Will change.

rdblue · 2022-12-07T00:01:23Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

+
+    @Override
+    public FSDataOutputStream create(Path f) throws IOException {
+      return create(f, null, true, 0, (short) 0, 0, null);


Like above, I this is the method to implement, not create(...).

rdblue · 2022-12-07T00:02:09Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

+    }
+
+    @Override
+    public FSDataOutputStream create(Path f) throws IOException {


Should this use a Preconditions to check that the path to open is the same as outputFile.location()?

We do have that check, with the previous comment the check will now move to `create(Path f, boolean overwrite) method with the other method calling this.

rdblue · 2022-12-07T00:02:41Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

+    }
+  }
+
+  private static class NullFileSystem extends FileSystem {


Is there a better place for this fake FS stuff? Maybe a top-level util class?

Sure. Moved to FileIOFSUtil class

rdblue · 2022-12-07T00:03:21Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

+    @Override
+    protected void finalize() throws Throwable {
+      super.finalize();
+      if (!closed) {


I don't think there's a need for this since the input stream should have its own finalizer.

Makes sense. Removing this.

rdblue · 2022-12-07T00:04:11Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

    if (file instanceof HadoopInputFile) {
      readerOptions.filesystem(((HadoopInputFile) file).getFileSystem());
+    } else {
+      readerOptions.filesystem(new InputFileSystem(file)).maxLength(file.getLength());


maxLength is used to avoid calls to get status? If so, we should add a comment.

Added comment to clarify this.

rdblue · 2022-12-07T00:04:53Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

    ReaderOptions readerOptions = OrcFile.readerOptions(config).useUTCTimestamp(true);
    if (file instanceof HadoopInputFile) {
      readerOptions.filesystem(((HadoopInputFile) file).getFileSystem());
+    } else {


The newFileReader(String, ReaderOptions) above should be deprecated, right?

I moved the OrcFile.createReader method into the new method and deprecated the previous one. That is a package local method, if acceptable I can remove it as it has no other uses.

Yeah, rather than deprecating we should just remove it if it isn't public.

rdblue · 2022-12-07T00:10:49Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java

+    private boolean closed;
+    private final StackTraceElement[] createStack;
+
+    private WrappedSeekableInputStream(SeekableInputStream inputStream) {


Can you move this stream to HadoopStreams? There's nothing ORC specific here and we may find other uses for it. That would also allow us to update the wrap methods that convert to detect double wrapping and return the underlying stream.

Did the following:

Moved class to HadoopStreams

Made HadoopStreams public to be accessible here

Implemented DelegatingInputStream

rdblue · 2022-12-07T00:12:22Z

orc/src/main/java/org/apache/iceberg/orc/OrcFileAppender.java

    options.setSchema(orcSchema);
-    this.writer = newOrcWriter(file, options, metadata);
-
+    this.writer = ORC.newFileWriter(file, options, metadata);


is the options.fileSystem call still needed above?

Yes that is still needed. We continue to retain the current behavior for HadoopIO. So the FileSystem from HadoopOutputFile or HadoopInputFile is used for the FS operations. When we use anything other than HadoopIO the FileIOFS kicks in.

rdblue · 2022-12-07T00:15:12Z

orc/src/test/java/org/apache/iceberg/orc/TestOrcDataWriter.java

+        .hasMessageStartingWith("Failed to get file system for path: dummy");
+
+    // We are creating the proxy
+    SortOrder sortOrder = SortOrder.builderFor(SCHEMA).withOrderId(10).asc("id").build();


I don't think the sort order is needed.

Sure, removed it to simplify the test

rdblue · 2022-12-07T00:15:23Z

orc/src/test/java/org/apache/iceberg/orc/TestOrcDataWriter.java

+            .schema(SCHEMA)
+            .createWriterFunc(GenericOrcWriter::buildWriter)
+            .overwrite()
+            .withSpec(PartitionSpec.unpartitioned())


This should be the default, right?

Sorry didn't follow this. What do we mean by default?

Before the patch if you gave a Local[Output|Input]File this was converted to FileSystem operation resulting in a LocalFileSystem for handling this.
To ensure that is not taking place, we are mimicking a schema dummy that is not handled, so if any FS operations happen normally then it will fail with Failed to get file system for path: dummy but if handled via FileIOFS then it should be successful.

I will add this comment to the test to make it clearer.

Do we need to add withSpec or will it use the unpartitioned spec by default?

yeah we need it, without it I get an error that Cannot create data writer without spec

rdblue · 2022-12-07T00:17:32Z

Looks mostly good overall. Thanks for getting this working @pavibhai!

pavibhai · 2022-12-07T09:48:58Z

I'm not a big fan of the fake filesystem approach here, mostly because i'm afraid of mocking an object like that when we don't have the full filesystem state. I feel like this patch would have us maintaining a rather large Hadoop mock.

Is there any chance we can convince the ORC project to allow the creation of a writer from an "java.io.OutputStream" instead of always creating its own file?

Thanks for this insight. Overall I agree that this should not be our long term answer and we shall work in the ORC community to make this integration better.
In the meantime given the relative stability of the FileSystem APIs and the limited exposure to just the creation of the input and output streams I hope this is not too painful to maintain.

pavibhai · 2022-12-07T10:59:14Z

Looks mostly good overall. Thanks for getting this working @pavibhai!

Thanks @rdblue for your comments. I have addressed the comments, there are a few comments where I gave additional clarification.

gaborkaszab · 2022-12-09T14:38:23Z

orc/src/main/java/org/apache/iceberg/orc/ORC.java


 @SuppressWarnings("checkstyle:AbbreviationAsWordInName")
 public class ORC {
+  private static final Logger LOG = LoggerFactory.getLogger(ORC.class);


nit: I don't see any usage for this new Logger introduced by this patch.

Nice, catch. Removing this.

rdblue · 2022-12-16T16:22:09Z

orc/src/main/java/org/apache/iceberg/orc/FileIOFSUtil.java

+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+
+public class FileIOFSUtil {


I don't think this should be public. Could you make it package-private?

Sure, done.

rdblue · 2022-12-18T23:55:38Z

Thanks, @pavibhai! Great to have this fixed, even if it's a hack 😅.

Added FileIO Support for ORC Reader and Writers

0197c2c

github-actions bot added the ORC label Nov 28, 2022

pavibhai mentioned this pull request Nov 28, 2022

ORC does not use InputFile and OutputFile abstractions #96

Closed

nastra reviewed Nov 28, 2022

View reviewed changes

Address review comments on Assertions

bd10b26

rdblue reviewed Dec 7, 2022

View reviewed changes

Addressed review comments

612eab1

github-actions bot added the core label Dec 7, 2022

pavibhai added 2 commits December 7, 2022 03:07

Fixed the test to validate dummy scheme against FileSystem directly

c90f49b

Added a public wrap method and made WrappedSeekableInputStream private

a52718e

gaborkaszab reviewed Dec 9, 2022

View reviewed changes

removed unused logger

25523a3

rdblue reviewed Dec 16, 2022

View reviewed changes

Made FileIOFSUtil package local and deleted a deprecated method

28793a8

rdblue approved these changes Dec 18, 2022

View reviewed changes

rdblue merged commit b4d9770 into apache:master Dec 18, 2022

Added FileIO Support for ORC Reader and Writers #6293

Added FileIO Support for ORC Reader and Writers #6293

Uh oh!

Conversation

pavibhai commented Nov 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

Tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Dec 6, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Dec 7, 2022

Uh oh!

pavibhai commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavibhai commented Dec 7, 2022

Uh oh!

Choose a reason for hiding this comment

pavibhai commented Nov 28, 2022 •

edited

Loading

pavibhai commented Dec 7, 2022 •

edited

Loading