0% found this document useful (0 votes)
32 views

Lecture 33

The document discusses column-family stores, including their data model and examples like BigTable and HBase. A column-family store uses a column-based model with rows containing many columns grouped into column families that are often accessed together.

Uploaded by

Aman Salman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Lecture 33

The document discusses column-family stores, including their data model and examples like BigTable and HBase. A column-family store uses a column-based model with rows containing many columns grouped into column families that are often accessed together.

Uploaded by

Aman Salman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Column-Family

Stores
Introduction
Basics
• AKA (Also Know As) Wide-column, columnar

• Data model

• rows that have many columns, all associated with the


same key

• Columns family

• is group of related columns

• often accessed together


Example
Example
Data Model: Column

• column is the basic data item

• represented as 3-tuple

• column name

• value

• timestamp
Data Model: Row
// row
"martin-fowler" : {
firstName: "Martin",
lastName: "Fowler",
• Row is a collection of columns
location: "Boston"
attached to the same row key

}
• columns can be added to
any row at anytime without
having to add for all rows
// row
"Jack-fowler" : {
firstName: "Jack",
lastName: "Fowler",
}
Data Model: Column Family
(CF)

• CF is a set of columns
containing related data

• For example, UserInfo in table


User is a column family
Data Model: Interpretation 1

• Each column family is equivalent to a


relational table

• Column family is considered a map of


map:

• Map<rowKey, Map<columnKey, columnValue>>


Data Model: Interpretation
et
(Visual)
Column-family Stores
• Representative

• Cassandra

• BigTable

• HBase

• Hypertable

• Accumulo
Ranked list: http://db-engines.com/en/ranking/wide+column+store
BigTable

• Google’s paper:
Chang, F. et al. (2008). Bigtable: A Distributed Storage System for
Structured Data. ACM TOCS, 26(2), pp 1–26.

• Data model: column family

• Multi-dimensional map

• (row:string, column:string, timestamp:int64) —> value



HBase
• An open source implementation of Google BigTable

• Initial release: 2008

• Implementation: Java

• Runs on top of Hadoop Distributed File System

• Operating systems: linux, windows (you need Cygwin)

• can handle billions of rows with millions of columns

Cassandra
• Developed at Facebook

• Initial release in 2008, stable release in 2013

• Written in Java

• Operations

• Cassandra Query Language (CQL)

• MapReduce support (can cooperate with Hadoop)


BigTable & HBase in
Details
BigTable: Overview
• Highly available distributed storage

• One big table distributed on multiple nodes

• Built with semi-structured data in mind

• billion of URLs with their content over time (versions)

• users data: profiles, preferences, queries

• Geographical data: road map, satellite images


BigTable: Large Scale

• Petabytes of data across thousands of computers

• Billions of users

• thousands of queries per second


BigTable: Uses
• At Google used for

• Google analytic

• a web service tool that track and analyze web traffic

• Google Finance

• stock information

• Personalized Search

• based on users preferences show personalized search

• Youtube

• Google Earth & Map

• and more
A big table

• Appears as one table

• characteristics:

• sparse

• distributed

• multi dimensional map


Characteristics

• A map

• Is an associative array

• values can be quickly looked up for a given key

• key identifies the row

• value identifies set of columns


Characteristics
• Persistent

• Data stays on disk after operations finish

• Data is stored persistently on disk

• Distributed

• BigTable data is distributed across multiple machines

• BigTable runs on top of Google File System (GFS)

• The table is split based on the rows


Characteristics
• Sparse

• different rows have different columns

• some columns may be empty for some rows

• Sorted

• BigTable uses associative array (which is not sorted)

• but BigTable sort based on rows

• related data will be adjacent

• For example if we want to store all pages of the same domain, then we reverse the url
and use it as row key

• edu.ptuk.ps all pages from edu domain will be adjacent


Characteristics
• Multidimensional

• Table is indexed by row

• a Table contains one or more column families

• at least one column family is created when the table is created

• a column family can contain various number of columns which are


usually related

• columns in the column family can be created on the fly

• three level of hierarchy; row, column family, and column


Characteristics

• Time based

• another dimension in BigTable

• keep multiple versions of the same data

• at retrieval time (search) if the time is not specified then


the latest version will be returned
Example
Table Model
• (row, column, timestamp) -> cell content

• Multiple versions of the same cell

Rows & Partitioning


• A table is split among rows into sub tables called (Tablets)

• A tablet

• consist of a set of consecutive rows

• it is the unit of data distribution & load balancing

• stored on one server

• rows are sorted by key

• reading is efficient

• single row or scan range of rows

• designing the row is very important as it determines how data is stored

• influence read performance, and data distribution


Table splitting
• A table start as one tablet

• As it grows it will be split into tablets (100 - 200 MB each)

• can be configured

Splitting a Tablet
Columns & Column Families
• Column Family

• group of columns

• basic unit of data access

• data is of the same type (related)

• data in the same column family is compressed together

• operations:

• create a column family

• store data in any column key

• Table can have unlimited number of column families

• Number of columns in the same column family is up to hundreds

• identified by➔ family:qualifier


Example
• Web pages table

• store web pages and their information in a table

• use URL as the row key

• several column families to store various attributes of Web pages

• content (multiple content over time)

• anchors (multiple anchors)

• language single value


Example (cont.)
• Three column families

• “content:” — content of the web page

• “language:” — language of the web page

• “anchors” — pages referencing the web page (in the row key)

• use url of page as column name

• and the cell value is anchor text

• ancho is an example of dynamic column family


Timestamps
• Each column family cell may contain multiple versions of content

• In the previous example, same URL may have different content over time

• Timestamp is identified by 64-bits

• which represent real time, time when the cell was added

• or manually specified by the user

• At retrieval time

• most recent version will be returned if no timestamp is specified

• if the timestamp is specified

• if no exact match then the latest version that is earlier than the specified timestamp

You might also like