BigData Avro-1
BigData Avro-1
(AVRO)
What is AVRO ?
Avro depends heavily on its schema. It allows every data to be written with
no prior knowledge of the schema. It serializes fast and the resulting
serialized data is lesser in size. Schema is stored along with the Avro data in
a file for any further processing.
In RPC, the client and the server exchange schemas during the connection.
This exchange helps in the communication between same named fields,
missing fields, extra fields, etc. Both the old and new schemas are always
present to resolve any differences.
Avro Schemas
Avro schemas are defined with JSON that simplifies its implementation in
languages with JSON libraries.
Like Avro, there are other serialization mechanisms in Hadoop such as
Sequence Files, Protocol Buffers, and Thrift.
Comparison with Thrift and Protocol
Buffers
Thrift and Protocol Buffers are the most competent libraries of Avro. Avro
differs from these frameworks in the following ways:
• Avro supports both dynamic and static types as per the requirement.
Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to
specify schemas and their types. These IDLs are used to generate code for
serialization and deserialization.
• Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not
built in Hadoop ecosystem.
Comparison with Thrift and Protocol
Buffers
Unlike Thrift and Protocol Buffer, Avro's schema definition is in JSON and not
in any proprietary IDL; that makes it language neutral.
Features of Avro
What is Serialization ?
Serialization is the process of translating data structures or objects
state into binary or textual form. Once the data is transported over
network or retrieved from the persistent storage, it needs to be
deserialized again. Serialization is termed as marshalling and
deserialization is termed as unmarshalling.
Serialization in Java
This is the interface in Hadoop which provides methods for serialization and
deserialization. The following table describes the methods:
Writable Comparable Interface