0% found this document useful (0 votes)
138 views7 pages

Common Data Representation Formats Used For Big Data Include

Common data representation formats for big data include row-based formats like flat files, CSV, Avro, and JSON, column-based formats like RC, ORC, and Parquet, and NoSQL datastores. Row-based formats with compression are commonly used for interoperability, but column-based formats provide faster query execution and better compression. Avro and SequenceFiles are binary formats that store individual records in custom data types, with SequenceFiles having higher performance than text files since records don't need to be parsed.

Uploaded by

Mahmoud Elmahdy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views7 pages

Common Data Representation Formats Used For Big Data Include

Common data representation formats for big data include row-based formats like flat files, CSV, Avro, and JSON, column-based formats like RC, ORC, and Parquet, and NoSQL datastores. Row-based formats with compression are commonly used for interoperability, but column-based formats provide faster query execution and better compression. Avro and SequenceFiles are binary formats that store individual records in custom data types, with SequenceFiles having higher performance than text files since records don't need to be parsed.

Uploaded by

Mahmoud Elmahdy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Common data representation formats used for big data include:

 Row- or record-based encodings:


-Flatfiles / text files
-CSV and delimited files
-Avro / SequenceFile
-JSON
-Other formats: XML, YAML
 Column-based storage formats:
-RC / ORC file
-Parquet
 NoSQL datastores
• Compression of data
Row-based encodings (Text, Avro, JSON) with a general purpose compression
library
(GZip, LZO, CMX, Snappy) are common mainly for interoperability reasons, but
column-based storage formats (Parquet, ORC) provide not only faster query
execution
by minimizing IO but also great compression.
Avro/SequenceFile
• Avro data files are a compact, efficient binary format that provides
interoperability with applications written in other programming
languages
SequenceFiles are a binary format that store individual records in custom record-specific
data types.
 Reading from SequenceFiles is higher-performance than reading from text files, as records do
not need to be parsed).

Two primary reasons:


1. Language Independence. The SequenceFile container and each Writable
implementation stored in it are only implemented in Java. There is no format
specification independent of the Java implementation.
Versioning. If a Writable class changes, if fields are added or removed, the type
of a field is changed or the class is renamed, then data is usually unreadable. A
Writable implementation can explicitly manage versioning, writing a version
number with each instance and handling older versions at read-time
JSON format: JavaScript Object Notation
• JSON is a plain-text object serialization format that can represent quite
complex data in a way that can be transferred between a user and a
program or one program to another program
• Often called the language of Web 2.0
• Two basic structures:
 Records consisting of maps (aka key/value pairs), in curly braces:
{name: "John", age: 25}
 Lists (aka arrays), in square brackets: [ . . . ]
• Records and arrays can be nested in each other multiple times
• Support libraries are available in R, Python, and other languages
• Standard JSON format does not offer any formal schema mechanism
although there are attempts at developing a formal schema
• APIs that return JSON data: Cnet, Flikr, Google Geocoder, Twitter,
Yahoo Answers, Yelp, etc

XML (eXtensible Markup Language)


• XML is an incredibly rich and flexible data representation format
 Uses markup to provide context for fields in plain text
 Provides an excellent mechanism for serializing objects and data
 Widely used as an electronic data interchange (EDI) format within industry
sectors
• XML has a formal schema language, written in XML, and data written
within the constraints of a schema are guaranteed to be valid for later
processing
• Webpages are written in HTML, a variant on XML

You might also like