0% found this document useful (0 votes)

10 views6 pages

Module III - Storing and Querying Data

The document outlines various data file formats including Flat/Text files, CSV, XML, JSON, and YAML, detailing their structures, pros, and cons. It also discusses four types of NoSQL datastores (Key-Value, Document, Column-Family, and Graph DB) and provides an in-depth look at HBase storage architecture and its components. Additionally, it compares Pig and Hive for data processing and lists programming languages typically used with HBase.

Uploaded by

ayux0431

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views6 pages

Module III - Storing and Querying Data

Uploaded by

ayux0431

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Module III: Storing and Querying Data,

1. Characteristics of Representative Data File

Formats
A. Flat/Text Files

● Structure: Simple, line-by-line text data.

● Delimiters: Tabs, spaces, commas.

● Use Case: Small datasets, logs, config files.

● Pros:

○ Human-readable

○ Easy to create/edit

● Cons:

○ No data types

○ No hierarchy

○ Harder to parse programmatically

B. CSV (Comma-Separated Values)

● Structure: Table-like, rows and columns.

● Delimiter: Comma ,

● Pros:

○ Simple and widely supported

○ Easy to import into Excel, DBs

● Cons:

○ No support for complex/nested data

○ No metadata or schema

Visualization:

Name, Age, City

Alice, 25, New York
Bob, 30, Los Angeles

C. XML (eXtensible Markup Language)

● Structure: Tree-based, hierarchical.

● Self-descriptive: Tags define structure.

● Pros:

○ Supports complex/nested data

○ Schema support (XSD)

● Cons:

○ Verbose

○ Parsing can be slow

Visualization:

<person>
<name>Alice</name>
<age>25</age>
<city>New York</city>
</person>
D. JSON (JavaScript Object Notation)

● Structure: Key-value pairs, supports nesting.

● Lightweight and easy to parse.

● Pros:

○ Human-readable and compact

○ Supported by most modern languages

● Cons:

○ No comments

○ Less strict than XML

Visualization:

{
"name": "Alice",
"age": 25,
"city": "New York"
}

E. YAML (YAML Ain’t Markup Language)

● Structure: Indentation-based, human-readable

● Pros:

○ Clean syntax

○ Good for config files (e.g., Docker, Kubernetes)

● Cons:

○ Sensitive to indentation

○ Parser differences can be problematic

Visualization:
name: Alice
age: 25
city: New York

2. Characteristics of the Four Types of NoSQL

Datastores
Type Structure Use Case Examples

Key-Value Key → Value Caching, session management Redis, Riak

Store

Document JSON-like Content management, user MongoDB,

Store docs profiles CouchDB

Column-Family Column Analytics, big data Cassandra, HBase

groups

Graph DB Nodes + Social networks, Neo4j, ArangoDB

Edges recommendations

3. HBase Storage in Detail

Architecture Highlights:

● Based on Google’s BigTable.

● Built on top of HDFS.

● Optimized for sparse, wide tables.

Key Components:

● HMaster: Manages region servers.

● Region Server: Manages regions (subsets of tables).

● Region: Holds a range of rows.

● Store: Stores data for a column family.

● MemStore: In-memory buffer (flushes to disk).

● HFile: Actual storage file on HDFS.

Data Storage Flow:

1. Write → MemStore

2. MemStore full → HFile written on HDFS

3. Multiple HFiles → Compacted into one

Read/Write Characteristics:

● Writes: Fast (via append)

● Reads: Slower (depends on compaction)

● Indexing: Based on row key

4. Pig vs Hive
Feature Pig Hive

Language Pig Latin (procedural) HiveQL (declarative SQL-like)

Users Developers Analysts

Execution Engine MapReduce MapReduce / Tez / Spark

Data Handling Semi-structured (nested OK) Structured/tabular

Learning Curve Slightly harder Easier (SQL-like syntax)

Use Case Data pipelines, ETL Data warehousing, reporting

Visualization:

● Pig: Step-by-step instructions.

● Hive: SQL queries over tables.

5. Programming Languages Typically Used by HBase
1. Java

● Native API for HBase

● Full control over HBase features

2. Shell (HBase Shell)

● Interactive command-line tool

● CRUD operations

3. Python (via HappyBase or Thrift)

● Easy integration with Python apps

● Simple client library

4. Scala (via Spark-HBase connector)

● For big data processing

● Integrates with Spark jobs

5. REST API

● Lightweight access using HTTP

Example: HBase Shell

create 'users', 'info'

put 'users', '1', 'info:name', 'Alice'
get 'users', '1'

Solid Mechanics - Chapter-13-Riveted, Bolted, and Welded Connections Singer - Pytel (IEM 2-2 KUET)
100% (1)
Solid Mechanics - Chapter-13-Riveted, Bolted, and Welded Connections Singer - Pytel (IEM 2-2 KUET)
61 pages
Unit 5 Lecture No-3 (Hbase)
No ratings yet
Unit 5 Lecture No-3 (Hbase)
35 pages
Designing Data Intensive Applications
25% (4)
Designing Data Intensive Applications
61 pages
BDA (2) Merged
No ratings yet
BDA (2) Merged
29 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Wa0005.
No ratings yet
Wa0005.
53 pages
Hadoop
No ratings yet
Hadoop
83 pages
Microscopic Study of Epithelial Tissue and Connective Tissue
100% (1)
Microscopic Study of Epithelial Tissue and Connective Tissue
16 pages
4.5 Hbase
No ratings yet
4.5 Hbase
27 pages
4.1 Intro Nosql-Converted-133751863122661863
No ratings yet
4.1 Intro Nosql-Converted-133751863122661863
43 pages
Unit V Hadoop Related Tools
No ratings yet
Unit V Hadoop Related Tools
54 pages
2 Unit 5
No ratings yet
2 Unit 5
24 pages
Lecture10 HBase
No ratings yet
Lecture10 HBase
70 pages
Unit III - Full
No ratings yet
Unit III - Full
31 pages
Chapter 9
100% (1)
Chapter 9
11 pages
Lec09 No SQL
No ratings yet
Lec09 No SQL
42 pages
10 NoSQL Databases - HBase Hive Cassandra
No ratings yet
10 NoSQL Databases - HBase Hive Cassandra
74 pages
9 HBase
No ratings yet
9 HBase
77 pages
Chapter 14
No ratings yet
Chapter 14
35 pages
Big Data: Week - 11
No ratings yet
Big Data: Week - 11
22 pages
4.1 Intro Nosql
No ratings yet
4.1 Intro Nosql
43 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
Module 1
No ratings yet
Module 1
34 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
S Pig Hive HBase Zookeeper
No ratings yet
S Pig Hive HBase Zookeeper
19 pages
S Pig Hive HBase
No ratings yet
S Pig Hive HBase
19 pages
DBMS Unit3
No ratings yet
DBMS Unit3
28 pages
Unit 5 Bigdata
No ratings yet
Unit 5 Bigdata
14 pages
Big Data UNIT 5 Own
No ratings yet
Big Data UNIT 5 Own
18 pages
Unit 3 Hbase, Mongodb and Couch DB
No ratings yet
Unit 3 Hbase, Mongodb and Couch DB
12 pages
4.1 Intro Nosql
No ratings yet
4.1 Intro Nosql
45 pages
DBMS 11
No ratings yet
DBMS 11
13 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
HBase (Unit 4)
No ratings yet
HBase (Unit 4)
37 pages
4.1 Intro Nosql
No ratings yet
4.1 Intro Nosql
43 pages
Hadoop Week 6
No ratings yet
Hadoop Week 6
38 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
LITESTAR 4D v. 4.00: User Manual Litecalc - Lighting Design Module
No ratings yet
LITESTAR 4D v. 4.00: User Manual Litecalc - Lighting Design Module
117 pages
2 Emerging
No ratings yet
2 Emerging
10 pages
Database Types
No ratings yet
Database Types
9 pages
5.1 Intro Nosql
No ratings yet
5.1 Intro Nosql
22 pages
Case Study: Hadoop
No ratings yet
Case Study: Hadoop
46 pages
CT 2
No ratings yet
CT 2
8 pages
04-2 Intro Nosql
No ratings yet
04-2 Intro Nosql
18 pages
NoteGPT - What Is HBase - HBase Architecture - HBase Tutorial For Beginners - Hadoop Tutorial - Simplilearn
No ratings yet
NoteGPT - What Is HBase - HBase Architecture - HBase Tutorial For Beginners - Hadoop Tutorial - Simplilearn
5 pages
Big Data Unit 5
No ratings yet
Big Data Unit 5
18 pages
HBase
No ratings yet
HBase
4 pages
POA - Tracker MACHINE LEARNING
100% (1)
POA - Tracker MACHINE LEARNING
48 pages
L1 - (JLD 3.0) - Magnetic Field - 29th Jun
No ratings yet
L1 - (JLD 3.0) - Magnetic Field - 29th Jun
60 pages
Digital Design Using FPGA
No ratings yet
Digital Design Using FPGA
33 pages
Tensile Test Worksheet
100% (1)
Tensile Test Worksheet
5 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
Types No-Sql
No ratings yet
Types No-Sql
3 pages
Croissants Recipe (With Video) - NYT Cooking
No ratings yet
Croissants Recipe (With Video) - NYT Cooking
19 pages
Arithmetic 20 Dec
No ratings yet
Arithmetic 20 Dec
43 pages
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
No ratings yet
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
49 pages
Nosqldbs
No ratings yet
Nosqldbs
149 pages
cp5293 Big Data Analytics Unit 5 PDF
No ratings yet
cp5293 Big Data Analytics Unit 5 PDF
28 pages
Hbase
No ratings yet
Hbase
13 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Chapter 3 Nodal and Mesh Rule
No ratings yet
Chapter 3 Nodal and Mesh Rule
9 pages
DNA Repair
No ratings yet
DNA Repair
23 pages
Firebird Commander Manual
No ratings yet
Firebird Commander Manual
18 pages
Database Types
No ratings yet
Database Types
4 pages
623-23 - Thrust Reverser Doors
No ratings yet
623-23 - Thrust Reverser Doors
70 pages
Fundamentals of Tree Rings
No ratings yet
Fundamentals of Tree Rings
86 pages
Vlsi Bits Syllabus
No ratings yet
Vlsi Bits Syllabus
4 pages
Transmision Trituradora Tesab
No ratings yet
Transmision Trituradora Tesab
4 pages
Pascal's Principle and Its Applications
No ratings yet
Pascal's Principle and Its Applications
12 pages
Cre 1 Introduction
No ratings yet
Cre 1 Introduction
4 pages
Practical-2 Hive (Show - Create - Load Commands)
No ratings yet
Practical-2 Hive (Show - Create - Load Commands)
13 pages
Class 12 Accounts, Partnership Admission New Ratio Treatment of GW Revaluation Account Test Ansx
No ratings yet
Class 12 Accounts, Partnership Admission New Ratio Treatment of GW Revaluation Account Test Ansx
4 pages
Audit Fuzzer Adalogics 2021
No ratings yet
Audit Fuzzer Adalogics 2021
28 pages
Advaita Vedanta Platonic Metaphysics
No ratings yet
Advaita Vedanta Platonic Metaphysics
5 pages
A Janvi Ganatra Stat
No ratings yet
A Janvi Ganatra Stat
7 pages
The Masterful Movers in The Low Payload Category
No ratings yet
The Masterful Movers in The Low Payload Category
4 pages
12 Animation Principles
No ratings yet
12 Animation Principles
2 pages
HDP Components Detailed
No ratings yet
HDP Components Detailed
4 pages
Programming Paradigms in Python
No ratings yet
Programming Paradigms in Python
5 pages
Hortonworks HDP Explained
No ratings yet
Hortonworks HDP Explained
3 pages
Unit - 2 (BBA)
No ratings yet
Unit - 2 (BBA)
3 pages
Sample Paper Xii Phy.
No ratings yet
Sample Paper Xii Phy.
4 pages
05 - Thermodynamic - Cycles - (Rankine) PDF
No ratings yet
05 - Thermodynamic - Cycles - (Rankine) PDF
6 pages
Befaco User Manual
No ratings yet
Befaco User Manual
5 pages
Test - D22 May 2025
No ratings yet
Test - D22 May 2025
2 pages
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet

Module III - Storing and Querying Data

Uploaded by

Module III - Storing and Querying Data

Uploaded by

Module III: Storing and Querying Data,

1. Characteristics of Representative Data File

● Structure: Simple, line-by-line text data.

● Delimiters: Tabs, spaces, commas.

● Use Case: Small datasets, logs, config files.

○ Harder to parse programmatically

B. CSV (Comma-Separated Values)

● Structure: Table-like, rows and columns.

○ Simple and widely supported

○ Easy to import into Excel, DBs

○ No support for complex/nested data

Name, Age, City

C. XML (eXtensible Markup Language)

● Structure: Tree-based, hierarchical.

● Self-descriptive: Tags define structure.

○ Supports complex/nested data

○ Schema support (XSD)

○ Parsing can be slow

● Structure: Key-value pairs, supports nesting.

● Lightweight and easy to parse.

○ Human-readable and compact

○ Supported by most modern languages

○ Less strict than XML

E. YAML (YAML Ain’t Markup Language)

● Structure: Indentation-based, human-readable

○ Good for config files (e.g., Docker, Kubernetes)

○ Parser differences can be problematic

2. Characteristics of the Four Types of NoSQL

Key-Value Key → Value Caching, session management Redis, Riak

Document JSON-like Content management, user MongoDB,

Column-Family Column Analytics, big data Cassandra, HBase

Graph DB Nodes + Social networks, Neo4j, ArangoDB

3. HBase Storage in Detail

● Based on Google’s BigTable.

● Built on top of HDFS.

● Optimized for sparse, wide tables.

● HMaster: Manages region servers.

● Region Server: Manages regions (subsets of tables).

● Region: Holds a range of rows.

● Store: Stores data for a column family.

● MemStore: In-memory buffer (flushes to disk).

Data Storage Flow:

2. MemStore full → HFile written on HDFS

3. Multiple HFiles → Compacted into one

● Writes: Fast (via append)

● Reads: Slower (depends on compaction)

● Indexing: Based on row key

Language Pig Latin (procedural) HiveQL (declarative SQL-like)

Users Developers Analysts

Execution Engine MapReduce MapReduce / Tez / Spark

Data Handling Semi-structured (nested OK) Structured/tabular

Learning Curve Slightly harder Easier (SQL-like syntax)

Use Case Data pipelines, ETL Data warehousing, reporting

● Pig: Step-by-step instructions.

● Hive: SQL queries over tables.

● Native API for HBase

● Full control over HBase features

2. Shell (HBase Shell)

● Interactive command-line tool

3. Python (via HappyBase or Thrift)

● Easy integration with Python apps

● Simple client library

4. Scala (via Spark-HBase connector)

● For big data processing

● Integrates with Spark jobs

● Lightweight access using HTTP

Example: HBase Shell

create 'users', 'info'

You might also like