0% found this document useful (0 votes)
21 views24 pages

Bcis5420 - Lecture Note - ch6 - Big Data Technologies

The document discusses big data technologies including data lakes, NoSQL databases, and MongoDB. It covers characteristics of big data like volume, variety, and velocity. It also compares schema on read and schema on write approaches and provides examples of JSON and XML data structures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views24 pages

Bcis5420 - Lecture Note - ch6 - Big Data Technologies

The document discusses big data technologies including data lakes, NoSQL databases, and MongoDB. It covers characteristics of big data like volume, variety, and velocity. It also compares schema on read and schema on write approaches and provides examples of JSON and XML data structures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Ch.6.

Big Data Technologies

Chapter 6

Big Data Technologies

2.1
Ch.6. Big Data Technologies

Section. 1

Introduction to Big Data

2.2
Ch.6. Big Data Technologies

Big Data and Analytics

• Big Data: data that exist in very large volumes and many
different data types and that need to be processed at a very
high velocity (e.g., Internet of Things)

• Analytics: systematic analysis and interpretation of data,


typically using mathematical, statistical, and computational
tools, to improve understanding of a real-world domain

2.3
Ch.6. Big Data Technologies

Characteristics of Big Data

• Five V’s of Big Data


▪ Volume – much larger quantity of data than typical for relational databases
▪ Variety – various data formats (e.g., structured, semi-structured, unstructured)
▪ Velocity – data comes at very fast rate (e.g. mobile sensors, Web click stream)
▪ Veracity – credible and reliable data
▪ Value – practical and beneficial data

2.4
Ch.6. Big Data Technologies

Why Relational DBMS is NOT for Big Data

• Hard to organize big data into relational structures, due to various data
types (variety), enormous size (volume), and growth rate (velocity)
• Hence, hard to ensure credibility (veracity) and usefulness of data (value)

Variety: Data
Types and Sources
Velocity: Real (or near)
e.g., image, video, text Unstructured time data transaction
Data

e.g., various data frames Semi-structured Data Data


(CSV, JSON etc.) Data Lake Warehouse

Structured Volume: large Analytics Friendly


e.g., data in rows and columns
(e.g., tidy data) Data size of data

2.5
Ch.6. Big Data Technologies

Technologies for Big Data


• Data Lake
▪ A large integrated repository for internal and external data that do not
follow a predefined schema (i.e., data warehouse)
▪ Capture everything, dive in anywhere, flexible access

• Schema on Read, rather than Schema on Write


▪ Schema on Write: preexisting data model, how traditional databases are
designed (relational databases)
▪ Schema on Read: data model determined later, depends on how you
want to use it (XML, JSON, BSON)
▪ Capture and store the data, and care about how you want to use it later
▪ More flexible in terms of data integration and migration
▪ Little need for data normalization

2.6
Ch.6. Big Data Technologies

Schema on Write vs. Schema on Read

Schema on Write: Schema on Read:


Relational Database Design Big Data Approach

Requiring more programming and technical


skills for database managers and analysts

2.7
Ch.6. Big Data Technologies

Examples of JSON and XML


The same data in different data
structures, JSON and XML

2.8
Ch.6. Big Data Technologies

Section. 2

NoSQL

2.9
Ch.6. Big Data Technologies

NoSQL (Not Only SQL)

• A category of recently introduced data


storage and retrieval technologies not
based on the relational model
• Supports schema on read for data lake
• Optimal for a cloud environment
• Mostly open source

source: aws.amazon.com

2.10
Ch.6. Big Data Technologies

NoSQL Classifications
• Key-Value Stores
▪ A simple pair of a key and an associated collection of values
▪ Key is usually a string
▪ Database has no knowledge of the structure or meaning of the values
▪ Example: Redis

Key Value
Table_Name: Primary_Key_value: Attribute_Name = Value

STUDENT
STUDENT:ID100:FName = “John”,
ID FName LName Grade
STUDENT:ID100:LName = “Smith”,
ID100 John Smith 90 STUDENT:ID100:Grade = 90

ID101 Peter Gregory 95


STUDENT:ID101:FName = “Peter”,
STUDENT:ID101:LName = “Gregory”,
STUDENT:ID101:Grade = 95

2.11
Ch.6. Big Data Technologies

NoSQL Classifications (cont’d)

• Wide-Column Stores

▪ Distribution of data is based on both key values (records) and


columns, using “column families”
▪ Example: Cassandra

STUDENT Column Family_1

ID FName LName Grade FName LName Grade


ID100
ID100 John Smith 90 John Smith 90

ID101 Peter Gregory N/A FName LName


ID101
Peter Gregory

Column Family_2

2.12
Ch.6. Big Data Technologies

NoSQL Classifications (cont’d)


• Document Stores

▪ Like a key-value store, but “document” goes further than “value”


▪ Document is structured so specific elements can be manipulated separately
▪ Example: MongoDB

“Document”
{ Primary_Key: ID1, Attribute_1: Value_1, Attribute_2: Value_2, …}

STUDENT
ID FName LName Grade {_id: “ID100”, FName: “John”,
ID100 John Smith 90 LName: “Smith”, Grade: 90}

ID101 Peter Gregory 95 {_id: “ID101”, FName: “Peter”, LName:


“Gregory”, Grade: 95}

2.13
Ch.6. Big Data Technologies

NoSQL Classifications (cont’d)


• Graph-Oriented Database

▪ Maintain information regarding the relationships between


data tables in graphs
▪ More intuitive and analytics friendly (easy to understand)
▪ Example: Neo 4j

Relational DB Graph-Oriented DB ORDER

PRODUCT ORDERLINE

ORDER
CUSTOMER LINE
CUSTOMER ORDER

PRODUCT

2.14
Ch.6. Big Data Technologies

Section. 2

MongoDB

2.15
Ch.6. Big Data Technologies

MongoDB
• “Mongo” being short for humongous, meaning huge or gigantic
• A document-store database
• Collections
• Equivalent to tables in a relational database, containing multiple documents
• Keys
• Equivalent to columns in a relational database
• Documents
• Equivalent to rows in a relational database
• Documents do not need to have the same structure
• Relationships
• _id property serves as “primary key”
• Another document can have a “foreign” key for another JSON property

2.16
Ch.6. Big Data Technologies

JSON in Mongo DB vs. Relational DB

• JSON Data
“Advisor”:
[
{“_id”: “Advisor1”, “name”: “Luke Shire”, “dept”: “Sales”},
{“_id”: “Advisor2”, “name”: “Merry Longbottom”, “dept”: “Accounting”}
{“_id”: “Advisor3”, “name”: “Jason Manor”, “dept”: “Marketing”, “tell”: “123-456-7890”}
]
Missing values may occur
• Relational Data (i.e., data anomalies)
ADVISOR
id name dept tell
Advisor1 Luke Shire Sales

Advisor2 Merry Longbottom Accounting

Advisor3 Jason Manor Marketing 123-456-7890

2.17
Ch.6. Big Data Technologies

Mongo Documents with Relationships

a) A document in the product collection b) A document in the reviewer collection

Based on the primary key, “_id”, it is


possible to have more details of a
reviewer, such as first and last names

2.18
Ch.6. Big Data Technologies

How to Query in Mongo DB

• Select Documents (rows) in a Collection (table) by Conditions

▪ db.collection.find ( )

• Types of Brackets

▪ Round bracket ( ) for collection (i.e., table)


▪ Brace bracket { } for document (i.e., row) and operator
▪ Box bracket [ ] for array (i.e., group of values)

2.19
Ch.6. Big Data Technologies

How to Query in Mongo DB (cont’d)


• Equator ($eq) and And ($and) Operator

▪ Using Yelp review data, show businesses that are open


: db.collection.find({is_open: {$eq:1}})
▪ Show businesses from Colorado
: db.collection.find({state: {$eq: "CO"}})
c.f., equator ($eq) operator CAN be dropped, producing the same result
▪ Show businesses from Boulder, Colorado
: db.collection.find({$and: [{state: {$eq: "CO"}},{city: {$eq: "Boulder"}}]})
c.f., and ($and) operator CAN be dropped, because state and city use the
same operator ($eq) and thus, equivalent to “db.collection.find({state: "CO",
city: "Boulder" })”

2.20
Ch.6. Big Data Technologies

How to Query in Mongo DB (cont’d)

• Greater ($gt) and Greater or Equal ($gte) Operator

▪ Show businesses with star rating greater than 3.5


: db.collection.find({stars: {$gt: 3.5}})
▪ Show businesses with star rating greater than or equal to 4
: db.collection.find({stars: {$gte: 4.0}})
▪ Show businesses with star rating greater than or equal to 4 and open
: db.collection.find({$and:[{is_open:{$eq:1}},{stars:{$gte:4}}]})
c.f., and ($and) operator CANNOT be dropped, because there are two
different operators ($eq and $gte)

2.21
Ch.6. Big Data Technologies

How to Query in Mongo DB (cont’d)

• Less ($lt) and Less or Equal ($lte) Operator

• Show businesses with star rating less than 4.5


: db.collection.find({stars: {$lt: 4.5}})
• Show businesses with star rating less than or equal to 4
: db.collection.find({stars: {$lte: 4.0}})
• Show businesses with star rating less than or equal to 4 and open
: db.collection.find({$and: [{is_open: {$eq: 1}},{stars: {$lte: 4}}]})

2.22
Ch.6. Big Data Technologies

How to Query in Mongo DB (cont’d)

• In ($in) and Not In ($nin) Operator (c.f., only for array)

▪ Show businesses in Colorado or Texas


: db.collection.find({state: {$in: ["CO","TX"]}})
▪ Show businesses not in Colorado or Texas
: db.collection.find({state: {$nin: ["CO","TX"]}})

2.23
Ch.6. Big Data Technologies

References

• The major contents of this note are reproduced from the textbook of BCIS 5420;
Topi et al. Modern Database Management. 13 th edition. Pearson's, 2019

• Unless having a specific reference source, the photos and icons used in this
material are from the following sources providing copyright free images:
imagesource.com, iconfinder.com, and pexels.com

• The diagrams used are from the textbook publisher’s materials

2.24

You might also like