0% found this document useful (0 votes)
117 views18 pages

文件系统 - Apache Flink

File sources and sinks in Flink allow for reading from and writing to files. File sources support both batch and streaming reads from files using various formats like CSV, Avro, and Parquet. File sinks support writing data to files in both row-encoded and bulk-encoded formats with options for rolling policies. Common bulk formats supported include Parquet, Avro, and others.

Uploaded by

zhao.tang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views18 pages

文件系统 - Apache Flink

File sources and sinks in Flink allow for reading from and writing to files. File sources support both batch and streaming reads from files using various formats like CSV, Avro, and Parquet. File sinks support writing data to files in both row-encoded and bulk-encoded formats with options for rolling policies. Common bulk formats supported include Parquet, Avro, and others.

Uploaded by

zhao.tang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

On This Page 

File Source

Format Types

File Sink
Format Types

Part

BATCH STREAMING Source Sink Flink FileSystem abstraction


BATCH STREAMING
STREAMING exactly-once

POSIX S3 HDFS format (


Avro CSV Parquet)

File Source
File Source Source API Source File Source
SplitEnumerator SourceReader

SplitEnumerator SourceReader
SourceReader

format File Source CSV AVRO Parquet

File Source SplitEnumerator


File Source enumerator SplitEnumerator
File Source
SplitEnumerator
SourceReader

API File Source

Java Python

//
FileSource.forRecordStreamFormat(StreamFormat,Path...);

//
FileSource.forBulkFileFormat(BulkFormat,Path...);

FileSource.FileSourceBuilder File Source

/ File Source / File


Source

File Source FileSource.FileSourceBuilder


Source /
AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration) Source

Java Python

final FileSource<String> source =


FileSource.forRecordStreamFormat(...)
.monitorContinuously(Duration.ofMillis(5))
.build();

Format Types
file formats readers Source
/

StreamFormat
Checkpoint

BulkFormat “ ”
TextLine Format
StreamFormat Java InputStreamReader
Checkpoint
Checkpoint

SimpleStreamFormat
StreamFormat SimpleStreamFormat

Java

private static final class ArrayReaderFormat extends SimpleStreamFormat<byte[]> {


private static final long serialVersionUID = 1L;

@Override
public Reader<byte[]> createReader(Configuration config, FSDataInputStream stre
throws IOException {
return new ArrayReader(stream);
}

@Override
public TypeInformation<byte[]> getProducedType() {
return PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO;
}
}

final FileSource<byte[]> source =


FileSource.forRecordStreamFormat(new ArrayReaderFormat(), path).bu

CsvReaderFormat SimpleStreamFormat

CsvReaderFormat<SomePojo> csvFormat = CsvReaderFormat.forPojo(SomePojo.class);


FileSource<SomePojo> source =
FileSource.forRecordStreamFormat(csvFormat, Path.fromLocalFile(...)).build();

CSV Format Jackson SomePojo


@JsonPropertyOrder({field1, field2, ...})
CSV )

CSV CsvReaderFormat
forSchema
CsvReaderFormat<T> forSchema(Supplier<CsvMapper> mapperFactory,
Function<CsvMapper, CsvSchema> schemaGenerator,
TypeInformation<T> typeInformation)

Bulk Format
BulkFormat BulkFormat ORC Format Parquet Format
BulkFormat reader BulkFormat.Reader
BulkFormat#createReader(Configuration, FileSourceSplit)
checkpoint checkpoint Bulk reader reader
BulkFormat#restoreReader(Configuration, FileSourceSplit)

SimpleStreamFormat StreamFormatAdapter BulkFormat

BulkFormat<SomePojo, FileSourceSplit> bulkFormat =


new StreamFormatAdapter<>(CsvReaderFormat.forPojo(SomePojo.class));

Java

/**
* Hive FileEnumerator HiveTablePartition
*/
public class HiveSourceFileEnumerator implements FileEnumerator {

//
public HiveSourceFileEnumerator(...) {
...
}

/***
* {@code
* minDesiredSplits}
*/
@Override
public Collection<FileSourceSplit> enumerateSplits(Path[] paths, int minDesired
throws IOException {
// createInputSplits:splitting files into fragmented collections
return new ArrayList<>(createInputSplits(...));
}

...

/***
* HiveSourceFileEnumerator
*/
public static class Provider implements FileEnumerator.Provider {

...
@Override
public FileEnumerator create() {
return new HiveSourceFileEnumerator(...);
}
}
}
//
new HiveSource<>(
...,
new HiveSourceFileEnumerator.Provider(
partitions != null ? partitions : Collections.emptyList(),
new JobConfWrapper(jobConf)),
...);

Watermark Watermark
Watermark

File Sources state

Source API File Sources


Source API documentation on data sources FLIP-27

File Sink
File Sink
Part

Part Sink Subtask


Part Part Row-encoded Formats
Format Types Part
Bulk-encoded Formats Checkpoint

: STREAMING FileSink Checkpoint Checkpoint


Checkpoint in-progress pending
Format Types
FileSink Row-encoded Bulk-encoded Apache Parquet

Row-encoded sink: FileSink.forRowFormat(basePath, rowEncoder)


Bulk-encoded sink: FileSink.forBulkFormat(basePath, bulkWriterFactory)

Row-encoded Format Bulk-encoded Format Sink

JavaDoc FileSink

Row-encoded Formats
Row-encoded Format Encoder
OutputStream

bucket assigner RowFormatBuilder

Custom RollingPolicy DefaultRollingPolicy


bucketCheckInterval ( = 1 min)

Java Scala Python

import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.core.fs.Path;
import org.apache.flink.configuration.MemorySize;
import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.Def

import java.time.Duration;
DataStream<String> input = ...;

final FileSink<String> sink = FileSink


.forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withRolloverInterval(Duration.ofMinutes(15))
.withInactivityInterval(Duration.ofMinutes(5))
.withMaxPartSize(MemorySize.ofMebiBytes(1024))
.build())
.build();

input.sinkTo(sink);

Sink
In-progress

15
5
1GB

Bulk-encoded Formats
Bulk-encoded Sink Row-encoded Encoder
BulkWriter.Factory BulkWriter.Factory BulkWriter

Flink 5 BulkWriter

ParquetWriterFactory
AvroWriterFactory
SequenceFileWriterFactory
CompressWriterFactory
OrcBulkWriterFactory

Bulk-encoded Format CheckpointRollingPolicy


Checkpoint

Parquet Format

Flink Avro Format Parquet AvroParquetWriters

Parquet Format ParquetWriterFactory


ParquetBuilder

Parquet Bulk-encoded Format

<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-parquet_2.12</artifactId>
<version>1.16.0</version>
</dependency>

PyFlink Parquet format

PyFlink JAR

Download

PyFlink JAR Python

FileSink Parquet Format Avro

Java Scala Python

import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.formats.parquet.avro.AvroParquetWriters;
import org.apache.avro.Schema;

Schema schema = ...;


DataStream<GenericRecord> input = ...;

final FileSink<GenericRecord> sink = FileSink


.forBulkFormat(outputBasePath, AvroParquetWriters.forGenericRecord(schema)
.build();

input.sinkTo(sink);

FileSink Parquet Format Protobuf

Java Scala

import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.formats.parquet.protobuf.ParquetProtoWriters;

// ProtoRecord protobuf
DataStream<ProtoRecord> input = ...;

final FileSink<ProtoRecord> sink = FileSink


.forBulkFormat(outputBasePath, ParquetProtoWriters.forType(ProtoRecord.cla
.build();

input.sinkTo(sink);
PyFlink ParquetBulkWriters Row Parquet
BulkWriterFactory

row_type = DataTypes.ROW([
DataTypes.FIELD('string', DataTypes.STRING()),
DataTypes.FIELD('int_array', DataTypes.ARRAY(DataTypes.INT()))
])

sink = FileSink.for_bulk_format(
OUTPUT_DIR, ParquetBulkWriters.for_row_type(
row_type,
hadoop_config=Configuration(),
utc_timestamp=True,
)
).build()

ds.sink_to(sink)

Avro Format

Flink Avro Format AvroWriters Avro writer

AvroWriters

<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-avro</artifactId>
<version>1.16.0</version>
</dependency>

PyFlink Avro format

PyFlink JAR

Download

PyFlink JAR Python

FileSink Avro Format

Java Scala Python

import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.formats.avro.AvroWriters;
import org.apache.avro.Schema;
Schema schema = ...;
DataStream<GenericRecord> input = ...;

final FileSink<GenericRecord> sink = FileSink


.forBulkFormat(outputBasePath, AvroWriters.forGenericRecord(schema))
.build();

input.sinkTo(sink);

Avro writers AvroWriterFactory


AvroBuilder :

Java Scala

AvroWriterFactory<?> factory = new AvroWriterFactory<>((AvroBuilder<Address>) out -


Schema schema = ReflectData.get().getSchema(Address.class);
DatumWriter<Address> datumWriter = new ReflectDatumWriter<>(schema);

DataFileWriter<Address> dataFileWriter = new DataFileWriter<>(datumWriter)


dataFileWriter.setCodec(CodecFactory.snappyCodec());
dataFileWriter.create(schema, out);
return dataFileWriter;
});

DataStream<Address> stream = ...


stream.sinkTo(FileSink.forBulkFormat(
outputBasePath,
factory).build());

ORC Format

ORC Format Bulk-encoded Format Flink Vectorizer


OrcBulkWriterFactory

Bulk-encoded Format Flink OrcBulkWriter ORC


VectorizedRowBatch

VectorizedRowBatch Vectorizer
vectorize(T element, VectorizedRowBatch batch)
VectorizedRowBatch element ColumnVectors
VectorizedRowBatch

Person
Java

class Person {
private final String name;
private final int age;
...
}

Person VectorizedRowBatch

Java Scala

import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;

import java.io.IOException;
import java.io.Serializable;
import java.nio.charset.StandardCharsets;

public class PersonVectorizer extends Vectorizer<Person> implements Serializable {


public PersonVectorizer(String schema) {
super(schema);
}
@Override
public void vectorize(Person element, VectorizedRowBatch batch) throws IOEx
BytesColumnVector nameColVector = (BytesColumnVector) batch.cols[0
LongColumnVector ageColVector = (LongColumnVector) batch.cols[1];
int row = batch.size++;
nameColVector.setVal(row, element.getName().getBytes(StandardCharse
ageColVector.vector[row] = element.getAge();
}
}

ORC Bulk-encoded Format

<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-orc_2.12</artifactId>
<version>1.16.0</version>
</dependency>
FileSink ORC Format

Java Scala

import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.orc.writer.OrcBulkWriterFactory;

String schema = "struct<_col0:string,_col1:int>";


DataStream<Person> input = ...;

final OrcBulkWriterFactory<Person> writerFactory = new OrcBulkWriterFactory<>(new P

final FileSink<Person> sink = FileSink


.forBulkFormat(outputBasePath, writerFactory)
.build();

input.sinkTo(sink);

OrcBulkWriterFactory Hadoop Configuration Properties


Hadoop ORC

Java Scala

String schema = ...;


Configuration conf = ...;
Properties writerProperties = new Properties();

writerProperties.setProperty("orc.compress", "LZ4");
// ORC

final OrcBulkWriterFactory<Person> writerFactory = new OrcBulkWriterFactory<>(


new PersonVectorizer(schema), writerProperties, conf);

ORC

vectorize(...) addUserMetadata(...) ORC

Java Scala
public class PersonVectorizer extends Vectorizer<Person> implements Serializable {
@Override
public void vectorize(Person element, VectorizedRowBatch batch) throws IOEx
...
String metadataKey = ...;
ByteBuffer metadataValue = ...;
this.addUserMetadata(metadataKey, metadataValue);
}
}

PyFlink OrcBulkWriters Orc BulkWriterFactory

PyFlink ORC format

PyFlink JAR

Download

PyFlink JAR Python

row_type = DataTypes.ROW([
DataTypes.FIELD('name', DataTypes.STRING()),
DataTypes.FIELD('age', DataTypes.INT()),
])

sink = FileSink.for_bulk_format(
OUTPUT_DIR,
OrcBulkWriters.for_row_type(
row_type=row_type,
writer_properties=Configuration(),
hadoop_config=Configuration(),
)
).build()

ds.sink_to(sink)

Hadoop SequenceFile Format

SequenceFile Bulk-encoded Format

<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-sequence-file</artifactId>
<version>1.16.0</version>
</dependency>

SequenceFile
Java Scala

import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.configuration.GlobalConfiguration;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

DataStream<Tuple2<LongWritable, Text>> input = ...;


Configuration hadoopConf = HadoopUtils.getHadoopConfiguration(GlobalConfiguration.
final FileSink<Tuple2<LongWritable, Text>> sink = FileSink
.forBulkFormat(
outputBasePath,
new SequenceFileWriterFactory<>(hadoopConf, LongWritable.class, Text.class))
.build();

input.sinkTo(sink);

SequenceFileWriterFactory

Row-encoded Format Bulk-encoded Format ( Format Types) DateTimeBucketAssigner


DateTimeBucketAssigner yyyy-MM-dd--HH

.withBucketAssigner(assigner)
BucketAssigner

Flink BucketAssigners

DateTimeBucketAssigner
BasePathBucketAssigner

PyFlink DateTimeBucketAssigner BasePathBucketAssigner

RollingPolicy In-progress Part Pending


Finished Finished
STREAMING Checkpoint Checkpoint Pending
Finished Part readers
BATCH Part Job Part
Flink RollingPolicies

DefaultRollingPolicy
OnCheckpointRollingPolicy

PyFlink DefaultRollingPolicy OnCheckpointRollingPolicy

Part
FileSink

Part

1. In-progress Part in-progress


2. Pending in-progress
3. Finished (STREAMING) Checkpoint (BATCH)
Pending Finished

Finished

Subtask In-progress Part


Pending Finished

Part

2 Sink Subtask

└── 2019-08-25--12
├── part-4005733d-a830-4323-8291-8866de98b582-0.inprogress.bd053eb0-5ecf-4c85-8433-
└── part-81fc4980-a6af-41c8-9937-9939408a734b-0.inprogress.ea65a428-a1d0-4a0b-bbc5-

Part part-81fc4980-a6af-41c8-9937-9939408a734b-0
Pending Sink Part part-81fc4980-
a6af-41c8-9937-9939408a734b-1

└── 2019-08-25--12
├── part-4005733d-a830-4323-8291-8866de98b582-0.inprogress.bd053eb0-5ecf-4c85-8433-
├── part-81fc4980-a6af-41c8-9937-9939408a734b-0.inprogress.ea65a428-a1d0-4a0b-bbc5-
└── part-81fc4980-a6af-41c8-9937-9939408a734b-1.inprogress.bc279efe-b16f-47d8-b828-

part-81fc4980-a6af-41c8-9937-9939408a734b-0 Pending Checkpoint


Finished

└── 2019-08-25--12
├── part-4005733d-a830-4323-8291-8866de98b582-0.inprogress.bd053eb0-5ecf-4c85-8433-
├── part-81fc4980-a6af-41c8-9937-9939408a734b-0
└── part-81fc4980-a6af-41c8-9937-9939408a734b-1.inprogress.bc279efe-b16f-47d8-b828-

In-progress
└── 2019-08-25--12
├── part-4005733d-a830-4323-8291-8866de98b582-0.inprogress.bd053eb0-5ecf-4c85-8433-
├── part-81fc4980-a6af-41c8-9937-9939408a734b-0
└── part-81fc4980-a6af-41c8-9937-9939408a734b-1.inprogress.bc279efe-b16f-47d8-b828-
└── 2019-08-25--13
└── part-4005733d-a830-4323-8291-8866de98b582-0.inprogress.2b475fec-1482-4dea-9946-

Part
Finished In-progress

In-progress / Pending part-<uid>-<partFileIndex>.inprogress.uid


Finished part-<uid>-<partFileIndex> Sink Subtask uid
Subtask ID uid Subtask uid

Flink Part / OutputFileConfig


Sink “prefix” “.ext”

└── 2019-08-25--12
├── prefix-4005733d-a830-4323-8291-8866de98b582-0.ext
├── prefix-4005733d-a830-4323-8291-8866de98b582-1.ext.inprogress.bd053eb0-5ecf-4c85
├── prefix-81fc4980-a6af-41c8-9937-9939408a734b-0.ext
└── prefix-81fc4980-a6af-41c8-9937-9939408a734b-1.ext.inprogress.bc279efe-b16f-47d8

OutputFileConfig

Java Scala Python

OutputFileConfig config = OutputFileConfig


.builder()
.withPartPrefix("prefix")
.withPartSuffix(".ext")
.build();

FileSink<Tuple2<Integer, Integer>> sink = FileSink


.forRowFormat((new Path(outputPath), new SimpleStringEncoder<>("UTF-8"))
.withBucketAssigner(new KeyBucketAssigner())
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.withOutputFileConfig(config)
.build();
1.15 FileSink pending
bulk
checkpoint

Java Scala Python

FileSink<Integer> fileSink=
FileSink.forRowFormat(new Path(path),new SimpleStringEncoder<Integer>())
.enableCompact(
FileCompactStrategy.Builder.newBuilder()
.setNumCompactThreads(1024)
.enableCompactionOnCheckpoint(5)
.build(),
new RecordWiseFileCompactor<>(
new DecoderBasedReader.Factory<>(SimpleStringDecoder::new)))
.build();

pending pending
.
pending Committer

FileCompactStrategy FileCompactor

FileCompactStrategy
Checkpoint Checkpoint
FileSink

FileCompactor

OutputStreamBasedFileCompactor :
CompactingFileWriter
ConcatFileCompactor
RecordWiseFileCompactor CompactingFileWriter
FileWriter CompactingFileWriter
RecordWiseFileCompactor CompactingFileWriter

1 FileSink
disableCompact
2

PyFlink ConcatFileCompactor IdenticalFileCompactor

1 Hadoop < 2.7 Checkpoint


OnCheckpointRollingPolicy Part Part “ ” Checkpoint
FileSink truncate() In-
progress Hadoop 2.7 Flink

2 Flink Sink UDF


Job In-progress “Finished”

3 Flink FileSink In-progress


Checkpoint Checkpoint / Savepoint FileSink
In-progress

4 FileSink 3 HDFS S3 Local


Flink

BATCH
1 Writer parallelism Committer parallelism = 1

2 Pending Finished

3 Committers JobManager
Flink FLIP-147

S3
1 S3 FileSink Hadoop-based Presto
Job FileSink S3 Presto Sink Checkpoint
“s3a://” Hadoop Sink “s3p://” Checkpoint
Presto Sink Checkpoint “s3://” “

2 exactly-once FileSink S3 Multi-part Upload


MPU “multi-part” MPU
MPU S3

Savepoint MPU Job


Job Savepoint Pending Part Flink Job

Back to top

You might also like