0% found this document useful (0 votes)
10 views6 pages

Module III - Storing and Querying Data

The document outlines various data file formats including Flat/Text files, CSV, XML, JSON, and YAML, detailing their structures, pros, and cons. It also discusses four types of NoSQL datastores (Key-Value, Document, Column-Family, and Graph DB) and provides an in-depth look at HBase storage architecture and its components. Additionally, it compares Pig and Hive for data processing and lists programming languages typically used with HBase.

Uploaded by

ayux0431
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

Module III - Storing and Querying Data

The document outlines various data file formats including Flat/Text files, CSV, XML, JSON, and YAML, detailing their structures, pros, and cons. It also discusses four types of NoSQL datastores (Key-Value, Document, Column-Family, and Graph DB) and provides an in-depth look at HBase storage architecture and its components. Additionally, it compares Pig and Hive for data processing and lists programming languages typically used with HBase.

Uploaded by

ayux0431
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Module III: Storing and Querying Data,

1. Characteristics of Representative Data File


Formats
A. Flat/Text Files

● Structure: Simple, line-by-line text data.

● Delimiters: Tabs, spaces, commas.

● Use Case: Small datasets, logs, config files.

● Pros:

○ Human-readable

○ Easy to create/edit

● Cons:

○ No data types

○ No hierarchy

○ Harder to parse programmatically

B. CSV (Comma-Separated Values)

● Structure: Table-like, rows and columns.

● Delimiter: Comma ,

● Pros:

○ Simple and widely supported

○ Easy to import into Excel, DBs


● Cons:

○ No support for complex/nested data

○ No metadata or schema

Visualization:

Name, Age, City


Alice, 25, New York
Bob, 30, Los Angeles

C. XML (eXtensible Markup Language)

● Structure: Tree-based, hierarchical.

● Self-descriptive: Tags define structure.

● Pros:

○ Supports complex/nested data

○ Schema support (XSD)

● Cons:

○ Verbose

○ Parsing can be slow

Visualization:

<person>
<name>Alice</name>
<age>25</age>
<city>New York</city>
</person>
D. JSON (JavaScript Object Notation)

● Structure: Key-value pairs, supports nesting.

● Lightweight and easy to parse.

● Pros:

○ Human-readable and compact

○ Supported by most modern languages

● Cons:

○ No comments

○ Less strict than XML

Visualization:

{
"name": "Alice",
"age": 25,
"city": "New York"
}

E. YAML (YAML Ain’t Markup Language)

● Structure: Indentation-based, human-readable

● Pros:

○ Clean syntax

○ Good for config files (e.g., Docker, Kubernetes)

● Cons:

○ Sensitive to indentation

○ Parser differences can be problematic

Visualization:
name: Alice
age: 25
city: New York

2. Characteristics of the Four Types of NoSQL


Datastores
Type Structure Use Case Examples

Key-Value Key → Value Caching, session management Redis, Riak


Store

Document JSON-like Content management, user MongoDB,


Store docs profiles CouchDB

Column-Family Column Analytics, big data Cassandra, HBase


groups

Graph DB Nodes + Social networks, Neo4j, ArangoDB


Edges recommendations

3. HBase Storage in Detail


Architecture Highlights:

● Based on Google’s BigTable.

● Built on top of HDFS.

● Optimized for sparse, wide tables.

Key Components:

● HMaster: Manages region servers.

● Region Server: Manages regions (subsets of tables).

● Region: Holds a range of rows.

● Store: Stores data for a column family.

● MemStore: In-memory buffer (flushes to disk).


● HFile: Actual storage file on HDFS.

Data Storage Flow:

1. Write → MemStore

2. MemStore full → HFile written on HDFS

3. Multiple HFiles → Compacted into one

Read/Write Characteristics:

● Writes: Fast (via append)

● Reads: Slower (depends on compaction)

● Indexing: Based on row key

4. Pig vs Hive
Feature Pig Hive

Language Pig Latin (procedural) HiveQL (declarative SQL-like)

Users Developers Analysts

Execution Engine MapReduce MapReduce / Tez / Spark

Data Handling Semi-structured (nested OK) Structured/tabular

Learning Curve Slightly harder Easier (SQL-like syntax)

Use Case Data pipelines, ETL Data warehousing, reporting

Visualization:

● Pig: Step-by-step instructions.

● Hive: SQL queries over tables.


5. Programming Languages Typically Used by HBase
1. Java

● Native API for HBase

● Full control over HBase features

2. Shell (HBase Shell)

● Interactive command-line tool

● CRUD operations

3. Python (via HappyBase or Thrift)

● Easy integration with Python apps

● Simple client library

4. Scala (via Spark-HBase connector)

● For big data processing

● Integrates with Spark jobs

5. REST API

● Lightweight access using HTTP

Example: HBase Shell

create 'users', 'info'


put 'users', '1', 'info:name', 'Alice'
get 'users', '1'

You might also like