0% found this document useful (0 votes)
24 views

NoSQL & Virtualization

Uploaded by

骆庭臻
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

NoSQL & Virtualization

Uploaded by

骆庭臻
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

COMP5434 Big Data Computing

NoSQL & Visualization


Song Guo
COMP, Hong Kong Polytechnic University

For internal use only, please do not distribute!


Agenda

§ NoSQL
§ Types of Data
§ NoSQL Database
§ The CAP Theorem, the ACID and BASE Properties
§ Visualization
§ Motivation
§ Data Visualization Methods
§ Dimensionality Reduction

COMP5434 2
New Jersey Institute of Technology
Types of Data
Data can be broadly classified into four types:
1. Structured Data:
§ Have a predefined model, which organizes data into a
form that is relatively easy to store, process, retrieve
and manage,
§ E.g., relational data
2. Unstructured Data:
§ Opposite of structured data
§ E.g., Flat binary files containing text, video or audio
§ Note: data is not completely devoid of a structure (e.g.,
an audio file may still have an encoding structure and
some metadata associated with it)
COMP5434 3
New Jersey Institute of Technology
Types of Data

3. Dynamic Data:
§ Data that changes relatively frequently
§ E.g., office documents and transactional entries in a
financial database
4. Static Data:
§ Opposite of dynamic data
§ E.g., Medical imaging data from MRI or CT scans

COMP5434 4
New Jersey Institute of Technology
Why Classifying Data?
§ Segmenting data into one of the following 4 quadrants can help
in designing and developing a pertaining storage solution
Dynamic Static
Structured Unstructured

Media Production,eCAD, Media Archive,Broadcast,


mCAD, OfficeDocs Medical Imaging

Transaction Systems,ERP, BI, DataWarehousing


CRM

§ Relational databases are usually used for structured data


§ File systems or NoSQL databases can be used for (static),
unstructured data (more on these later)

COMP5434 5
New Jersey Institute of Technology
Relational Database Management Systems
§ RDBMS are predominant database technologies
§ first defined in 1970 by Edgar Codd of IBM's Research Lab
§ Data modeled as relations (tables)
§ object = tuple of attribute values
§ each attribute has a certain domain
§ a table is a set of objects (tuples, rows) of the same type
§ relation is a subset of cartesian product of the attribute domains
§ tables and objects “interconnected” via (foreign) keys
§ field (or a set of fields) that uniquely identifies a row in another
table

§ Relational calculus, SQL query language

COMP5434 6
New Jersey Institute of Technology
RDBMS Example

SELECT Name FROM Students NATURAL JOIN


Takes_Course WHERE ClassID = 1001

COMP5434 7
New Jersey Institute of Technology
Agenda

§ NoSQL
§ Types of Data
§ NoSQL Database
§ The CAP Theorem, the ACID and BASE Properties
§ Visualization
§ Motivation
§ Data Visualization
§ Dimensionality Reduction

COMP5434 8
New Jersey Institute of Technology
Background
§ Relational databases used to be mainstay of business
(often structured data)
§ Web-based applications caused spikes
§ Explosion of social media sites (Facebook, Twitter) with
large data needs

COMP5434 9
New Jersey Institute of Technology
Background
§ Hooking relational databases to web-based application
is trouble because
§ Relational databases assume that data are
§ Dense
§ Largely uniform (structured data)

§ However, data coming from Internet are


§ Massive and sparse
§ Semi-structured or unstructured

COMP5434 10
New Jersey Institute of Technology
What is NoSQL

§ The Name:
§ Stands for Not Only SQL
§ The term NoSQL was introduced by Carlo Strozzi in 1998 to
name his file-based database
§ It was again re-introduced by Eric Evans when an event was
organized to discuss open source distributed databases
§ Eric states that “… but the whole point of seeking
alternatives is that you need to solve a problem that
relational databases are a bad fit for. …”

COMP5434 11
New Jersey Institute of Technology
NoSQL

§ Disadvantages:
§ Don’t fully support relational features
§ no join, group by, order by operations (except within partitions)
§ no referential integrity constraints across partitions
§ No declarative query language (e.g., SQL) ® more
programming
§ Relaxed ACID (see CAP theorem discussed later) ®
fewer guarantees
§ No easy integration with other applications that
support SQL

COMP5434 12
New Jersey Institute of Technology
NoSQL

§ Key features (advantages):


§ non-relational
§ don’t require schema
§ data are replicated to multiple
nodes (so, identical & fault-tolerant)
and can be partitioned:
§ down nodes easily replaced
§ no single point of failure
§ horizontal scalable
§ cheap, easy to implement (open-source)
§ massive write performance
§ fast key-value access

COMP5434 13
New Jersey Institute of Technology
3 major papers for NoSQL

§ Three major papers were the “seeds” of the NoSQL


movement:
§ BigTable (Google)
§ DynamoDB (Amazon)
§ Ring partition and replication
§ Gossip protocol (discovery and error detection)
§ Distributed key-value data stores
§ Eventual consistency
§ CAP Theorem

COMP5434 14
New Jersey Institute of Technology
NoSQL categories
§ Key-value
§ Example: DynamoDB, Voldermort, Scalaris
§ Document-based
§ Example: MongoDB, CouchDB
§ Column-based
§ Example: BigTable, Cassandra, Hbase
§ Graph-based
§ Example: Neo4J, InfoGrid

COMP5434 15
New Jersey Institute of Technology
Key-value

§ Focus on scaling to huge amounts of data


§ Designed to handle massive load
§ Based on Amazon’s dynamo paper
§ Data model: (global) collection of Key-value pairs
§ Dynamo ring partitioning and replication
§ Example: (DynamoDB)
§ items having one or more attributes (name, value)
§ An attribute can be single-valued or multi-valued like set.
§ items are combined into a table

COMP5434 16
New Jersey Institute of Technology
Key-value

§ Basic API access:


§ get(key): extract the value given a key
§ put(key, value): create or update the value given its key
§ delete(key): remove the key and its associated value
§ execute(key, operation, parameters): invoke an operation to
the value (given its key) which is a special data structure (e.g.
List, Set, Map .... etc)

COMP5434 17
New Jersey Institute of Technology
Key-value

§ Pros:
§ very fast
§ very scalable (horizontally distributed to nodes based on key)
§ simple data model
§ eventual consistency
§ fault-tolerance
§ Cons:
§ Can’t model more complex data structure such as objects

COMP5434 18
New Jersey Institute of Technology
Column-based
§ Based on Google’s BigTable paper
§ Like column oriented relational databases (store data in column
order) but with a twist
§ Tables similarly to RDBMS, but handle semi-structured
§ Data model:
§ Collection of Column Families
§ Column family = (key, value) where value = set of related columns
(standard, super)
§ indexed by row key, column key and timestamp
Record 1 Column A Column A = GroupA

Alice 3 25 Bob Alice Bob Carol Alice Bob Carol


4 19 Carol 0 3 4 0 25 3 25 4 19
45 19 45 0 45
Row-Order Columnar (or Column-Order) Column Family {B,C}
Columnar with Locality Groups
COMP5434 19
New Jersey Institute of Technology
Column-based

§ One column family can have variable numbers of columns


§ Cells within a column family are sorted “physically”
§ Very sparse, most cells have null values
§ Comparison: RDBMS vs column-based NoSQL
§ Query on multiple tables:
RDBMS: must fetch data from several places on disk and glue together
Column-based NoSQL: only fetch column families of those columns that are
required by a query (all columns in a column family are stored together on
the disk, so multiple rows can be retrieved in one read operation à data
locality)

COMP5434 20
New Jersey Institute of Technology
Column-based

§ Example: (Cassandra column family-- timestamps removed for


simplicity)
UserProfile = {
Cassandra ={ emailAddress:”[email protected]” , age:”20”}
TerryCho = { emailAddress:”[email protected]” , gender:”male”}
Cath = { emailAddress:”[email protected]” ,
age:”20”,gender:”female”,address:”Seoul”}
}

COMP5434 21
New Jersey Institute of Technology
Document-based

§ Can model more complex objects


§ Inspired by Lotus Notes
§ Data model: collection of documents
§ Document: JSON (JavaScript Object Notation is a data model,
key-value pairs, which supports objects, records, structs, lists,
array, maps, dates, Boolean with nesting), XML, other semi-
structured formats.
§ Example: (MongoDB) document
{Name:"Jaroslav“, Address:"Malostranske nám. 25, 118 00 Praha 1”,
Grandchildren: {Claire: "7", Barbara: "6", "Magda: "3", "Kirsten: "1", "Otis: "3",
Richard: "1“}
Phones: [ “123-456-7890”, “234-567-8963” ]}

COMP5434 22
New Jersey Institute of Technology
Document-based

§ The main difference between column-based NoSQL and


document-based NoSQL:
§ document stores (e.g. MongoDB and CouchDB) allow
arbitrarily complex documents, i.e. subdocuments within
subdocuments, lists with documents, etc.
§ column stores (e.g. Cassandra and HBase) only allow a fixed
format, e.g. strict one-level or two-level dictionaries.

COMP5434 23
New Jersey Institute of Technology
Graph-based

§ Focus on modeling the structure of data (interconnectivity)


§ Scales to the complexity of data
§ Inspired by mathematical Graph Theory (G=(E,V))
§ Data model:
§ (Property Graph) nodes and edges
§ Nodes may have properties (including ID)
§ Edges may have labels or roles
§ Key-value pairs on both
§ Interfaces and query languages vary
§ Single-step vs path expressions vs full recursion
§ Example:
§ Neo4j, FlockDB, Pregel, InfoGrid …
COMP5434 24
New Jersey Institute of Technology
Graph-based

COMP5434 25
New Jersey Institute of Technology
Agenda

§ NoSQL
§ Types of Data
§ NoSQL Database
§ The CAP Theorem, the ACID and BASE Properties
§ Visualization
§ Motivation
§ Data Visualization Methods
§ Dimensionality Reduction

COMP5434 26
New Jersey Institute of Technology
The CAP Theorem

§ The limitations of distributed databases can be


described in the so-called CAP theorem
§ Consistency: every node always sees the same data at any
given instance (i.e., strict consistency)
§ Availability: the system continues to operate, even if nodes
in a cluster crash, or some hardware or software parts are
down due to upgrades
§ Partition Tolerance: the system continues to operate in the
presence of network partitions
CAP theorem: anydistributed database with shared data,canhave at most two
of the three desirable properties,C,A or P

COMP5434 27
New Jersey Institute of Technology
The CAP Theorem for Existing Databases

source: https://blog.nahurst.com/visual-guide-to-nosql-systems

COMP5434 28
New Jersey Institute of Technology
The CAP Theorem
§ Let us assume two nodes on opposite sides of a
network partition:

§ Availability + Partition Tolerance (AP) forfeit Consistency


§ Consistency + Partition Tolerance (CP) entails that one side
of the partition must act as if it is unavailable, thus
forfeiting Availability
§ Consistency + Availability (CA) is only possible if there is no
network partition, thereby forfeiting Partition Tolerance
COMP5434 29
New Jersey Institute of Technology
Example for CAP Theorem

§ When companies such as Google and Amazon were designing


large-scale databases, 24/7 Availability was a key
§ A few minutes of downtime means lost revenue

§ When horizontally scaling databases to 1000s of machines, the


likelihood of a node or a network failure increases
tremendously

§ Therefore, in order to have strong guarantees on Availability


and Partition Tolerance, they had to sacrifice “strict”
Consistency (implied by the CAP theorem)

COMP5434 30
New Jersey Institute of Technology
Trading-Off Consistency

§ Maintaining consistency should balance between the strictness


of consistency versus availability/scalability
§ Good-enough consistency depends on your application

LooseConsistency StrictConsistency

Easier toimplement, Generally hard toimplement,


and isefficient and isinefficient

COMP5434 31
New Jersey Institute of Technology
The ACID Properties
§ The CAP theorem proves that it is impossible to guarantee strict
Consistency and Availability while being able to tolerate
network partitions

§ This resulted in RDBMS databases with relaxed ACID guarantee


§ Atomicity: An “all or nothing” approach. If any statement in the
transaction fails, the entire transaction is rolled back.
§ Consistency: The transaction must meet all protocols defined by the
system. No half-completed transactions.
§ Isolation: No transaction has access to any other transaction that is in
an intermediate or unfinished state. Each transaction is independent.
§ Durability: Ensures that once a transaction commits to the database, it
is preserved through the use of backups and transaction logs.

COMP5434 32
New Jersey Institute of Technology
The BASE Properties
§ In particular, NoSQL databases apply the BASE properties
(almost the opposite of ACID):
§ Basically Available: The database appears to work most of the time
§ Soft-State: Stores don’t have to be write-consistent, nor do different
replicas have to be mutually consistent all the time
§ Eventual Consistency: Stores exhibit consistency at some later point

COMP5434 33
New Jersey Institute of Technology
Agenda

§ NoSQL
§ Types of Data
§ NoSQL Database
§ The CAP Theorem, the ACID and BASE Properties
§ Visualization
§ Motivation
§ Data Visualization Methods
§ Dimensionality Reduction

COMP5434 34
New Jersey Institute of Technology
Motivation
§ Why data visualization?
§ Gain insight into an
information space by mapping
data onto graphical primitives
§ Provide qualitative overview
of large data sets
§ Search for patterns, trends,
structure, irregularities,
relationships among data
§ Help find interesting regions
and suitable parameters for
further quantitative analysis
§ Provide a visual proof of
computer representations
derived
COMP5434 35
New Jersey Institute of Technology
Data Visualization

COMP5434 36
New Jersey Institute of Technology
Agenda

§ NoSQL
§ Types of Data
§ NoSQL Database
§ The CAP Theorem, the ACID and BASE Properties
§ Visualization
§ Motivation
§ Data Visualization Methods
§ Dimensionality Reduction

COMP5434 37
New Jersey Institute of Technology
Methods of Data visualization
§ Different methods are available for data visualization based
on type of data
§ Data can be:
§ Univariate: single quantitative variable
§ Boxplot, Histogram, Pie chart
§ Bivariate: variables are related
§ Scatter plots, Line graphs
§ Multivariate: multi-dimensional representation of
multivariate data
§ Icon-based methods
§ Pixel-based methods
§ Dynamic parallel coordinate system

COMP5434 38
New Jersey Institute of Technology
Boxplot Analysis
§ Five-number summary of a distribution:
min, Q1, median, Q3, max
§ Boxplot
§ Data is represented with a box
§ The ends of the box are at the first and third quartiles, i.e., the height
of the box is Inter-Quartile Range (IQR)
§ The median is marked by a line within the box

min Q1 median Q3 max

Box
COMP5434 39
New Jersey Institute of Technology
Histogram Analysis
§ Histogram: graph display of tabulated frequencies, shown as bars.
§ It shows what proportion of cases fall into each of several categories.
§ Differs from a bar chart in that it is the area of the bar that denotes the value,
not the height as in bar charts, a crucial distinction when the categories are
not of uniform width.
§ The categories are usually specified as non-overlapping intervals of some
variable. The categories (bars) must be adjacent.
25
Frequency of term

20

15

10

0
category
COMP5434 40
New Jersey Institute of Technology
Histograms Often Tell More than Boxplots
§ The two histograms shown in the left may have the same
boxplot representation
§ The same values for: min, Q1, median, Q3, max
§ But they have rather different data distributions

COMP5434 41
New Jersey Institute of Technology
Image Histograms

COMP5434 42
New Jersey Institute of Technology
Table Histograms

COMP5434 43
New Jersey Institute of Technology
Table Histograms
§ Frequency density = Frequency / Class width

COMP5434 44
New Jersey Institute of Technology
Table Histograms

COMP5434 45
New Jersey Institute of Technology
Pie Chart

COMP5434 46
New Jersey Institute of Technology
Scatter plot
§ Provides a first look at bivariate data to see clusters of
points, outliers, etc.
§ Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

COMP5434 47
New Jersey Institute of Technology
Line Graphs

COMP5434 48
New Jersey Institute of Technology
Icon-Based Visualization Methods
§ Visualization of the data values as features of icons
§ Typical visualization methods
§ Chernoff Faces
§ Stick Figures
§ General techniques
§ Shape coding: Use shape to represent certain
information encoding
§ Color icons: Use color icons to encode more
information
§ Tile bars: Use small icons to represent the relevant
feature vectors in document retrieval

COMP5434 49
New Jersey Institute of Technology
Chernoff Faces
§ A way to display variables on a two-dimensional surface,
e.g., let 𝑥 be eyebrow slant, 𝑦 be eye size, 𝑧 be nose length,
etc.
§ The figure shows faces produced using 10 characteristics--head
eccentricity, eye size, eye spacing, eye eccentricity, pupil size,
eyebrow slant, nose size, mouth shape, mouth size, and mouth
opening): Each assigned one of 10 possible values, generated using
Mathematica (S. Dickson)

References:
[1] Gonick, L. and Smith, W. The Cartoon Guide to Statistics.
New York: Harper Perennial, p. 212, 1993
[2] Weisstein, Eric W. "Chernoff Face." From MathWorld--A
Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html

COMP5434 50
New Jersey Institute of Technology
Stick Figure

§ A census data figure showing


age, income, gender,
education, etc.
§ A 5-piece stick figure (1 body
and 4 limbs w. different
angle/length). Two attributes
of the data are mapped to
the display axes and the
remaining attributes are
mapped to the angle and/or
length of the limbs.

Look at texture pattern!

COMP5434 51
New Jersey Institute of Technology
Pixel-Oriented Visualization Techniques
§ For a data set of m dimensions, create m windows on
the screen, one for each dimension
§ The 𝑚 dimension values of a record are mapped to 𝑚
pixels at the corresponding positions in the windows
§ The colors of the pixels reflect the corresponding values

(a) Income (b) Credit Limit (c) Transaction volume (d) Age
COMP5434 52
New Jersey Institute of Technology
Laying Out Pixels in Circle Segments
§ To save space and show the connections among multiple
dimensions, space filling is often done in a circle segment

(a) Representing a data record (b) Laying out pixels in circle


in circle segment segment

COMP5434 53
New Jersey Institute of Technology
Parallel Coordinates
§ 𝑛 equidistant axes which are parallel to one of the screen
axes and correspond to the attributes
§ The axes are scaled to the [minimum, maximum]: range of
the corresponding attribute
§ Every data item corresponds to a polygonal line which
intersects each of the axes at the point which corresponds to
the value for the attribute

COMP5434 54
New Jersey Institute of Technology
Parallel Coordinates of a Data Set

COMP5434 55
New Jersey Institute of Technology
Summary
§ Gain insight into the data by:
§ Basic statistical data description: central tendency,
dispersion, graphical displays
§ Data visualization: map data onto graphical primitives
§ Above steps are the beginning of data preprocessing
§ Many methods have been developed but still an active
area of research

COMP5434 56
New Jersey Institute of Technology
Agenda

§ NoSQL
§ Types of Data
§ NoSQL Database
§ The CAP Theorem, the ACID and BASE Properties
§ Visualization
§ From Data Distribution to Visualization
§ Data Visualization
§ Dimensionality Reduction

COMP5434 57
New Jersey Institute of Technology
Dimensionality Reduction
§ Our focus
§ Not simply word cloud like presentation/visualization!
§ Preserving the original similarity and being able to
visualize the high dimensional data in low dimensional
space!
§ Many modern data domains involve huge numbers
of features/dimensions
§ Documents: thousands of words, millions of bigrams
§ Images: thousands to millions of pixels
§ Genomics: thousands of genes
§ E-commerce
COMP5434 58
New Jersey Institute of Technology
Why reduce dimensions?
§ High dimensionality has many costs
§ Redundant and irrelevant features degrade
performance of some ML/analytics algorithms
§ Difficulty in interpretation and visualization
§ Computation may become infeasible
§ what if your algorithm scales as 𝑂( 𝑛! )?
§ Curse of dimensionality
§ refers to various phenomena that when the dimensionality
increases, the volume of the space increases so fast that the
available data become sparse

COMP5434 59
New Jersey Institute of Technology
Approaches to dimensionality reduction
§ Feature selection
§ Select subset of existing features (without modification)
§ Feature transformation - Combining (mapping)
existing features into smaller number of
new/alternative features
§ Linear combination (projection)
§ Nonlinear combination
§ Deep learning: Autoencoder

COMP5434 60
New Jersey Institute of Technology
Linear dimensionality reduction
§ Linearly project n-dimensional data onto a 𝑘-
dimensional space
§ 𝑘 < 𝑛, often 𝑘 << 𝑛
§ Example: project 10!-D space of words into 3
dimensions

§ There are infinitely many k-dimensional subspaces


we can project the data onto.
§ Which one should we choose?

COMP5434 61
New Jersey Institute of Technology
Linear dimensionality reduction
§ Best 𝑘-dimensional subspace for projection
depends on task
§ Supervised: maximize separation among classes
§ Example: linear discriminant analysis (LDA)
§ Unsupervised: retain as much data variance as possible
§ Example: principal component analysis (PCA)

COMP5434 62
New Jersey Institute of Technology
Linear discriminant analysis (LDA) for two classes
§ Projecting data onto one dimension that maximize the ratio
of between-class scatter and total within-class scatter

COMP5434 63
New Jersey Institute of Technology
Unsupervised dimensionality reduction
§ Consider data without class labels
§ Try to find a more compact representation of the data

§ Assume that the high dimensional


data actually resides in an inherent
low-dimensional space
§ Additional dimensions are just
random noise
§ Goal is to recover these inherent
dimensions and discard noise
dimensions

COMP5434 64
New Jersey Institute of Technology
Principal Component Analysis (PCA)
§ Widely used method for unsupervised, linear
dimensionality reduction

§ GOAL: account for variance of data in as few


dimensions as possible (using linear projection)

COMP5434 65
New Jersey Institute of Technology
Geometric Picture of Principal Components (PCs)
§ First PC is the projection direction that maximizes the
variance of the projected data
§ Second PC is the projection direction that is orthogonal to
the first PC and maximizes variance of the projected data

COMP5434 66
New Jersey Institute of Technology
PCA vs LDA

COMP5434 67
New Jersey Institute of Technology
PCA: conceptual algorithm
§ Find a line, such that when the data is projected
onto that line, it has the maximum variance.

COMP5434 68
New Jersey Institute of Technology
PCA: conceptual algorithm
§ Find a second line, orthogonal to the first, that has
maximum projected variance.

COMP5434 69
New Jersey Institute of Technology
PCA: conceptual algorithm
§ Repeat until have 𝑘 orthogonal lines
§ The projected position of a point on these lines gives the
coordinates in the k-dimensional reduced space.

COMP5434 70
New Jersey Institute of Technology
Applying principal component analysis
§ Full set of PCs comprise a new orthogonal basis for
feature space, whose axes are aligned with the
maximum variances of original data.
§ Projection of original data onto first 𝑘 PCs gives a
reduced dimensionality representation of the data.
§ Transforming reduced dimensionality projection
back into original space gives a reduced
dimensionality reconstruction of the original data.
§ Reconstruction will have some error, but it can be
small and often is acceptable given the other
benefits of dimensionality reduction.
COMP5434 71
New Jersey Institute of Technology
PCA example 1
mean centered data with
original data PCs overlayed

COMP5434 72
New Jersey Institute of Technology
PCA example 1
original data projected original data reconstructed
into full PC space using only a single PC

COMP5434 73
New Jersey Institute of Technology
PCA example 2

COMP5434 74
New Jersey Institute of Technology
PCA on MNIST

COMP5434 75
New Jersey Institute of Technology
PCA example 3
Original Image: 144-D 144-D -> 60-D

144-D -> 16-D 144-D -> 6-D

COMP5434 76
New Jersey Institute of Technology
Steps in principal component analysis
§ Mean center the data
§ Compute covariance matrix S
§ Calculate eigenvalues and eigenvectors of S
§ Eigenvector with largest eigenvalues 𝜆" is 1st principal
component (PC)
§ Eigenvector with 𝑘#$ largest eigenvalues 𝜆% is 𝑘#$
principal component (PC)
&!
§ ∑" &
= proportion of variance captured by 𝑘#$ PC
"

COMP5434 77
New Jersey Institute of Technology
PCA Calculation Example
Consider the following design matrix, representing four sample points 𝑋" ∈ ℝ# .

8 10
2 8
𝑋=
6 4
0 2

We want to represent the data in only one dimension using principal


components analysis (PCA).
Compute the unit-length principal component directions of X, and state which
on the PCA algorithm would choose if you request just one principal component.

COMP5434 78
New Jersey Institute of Technology
Solution
§ First center the data:
4 4
−2 2
𝑋̇ =
2 −2
−4 −4

§ Compute covariance matrix:


40 24
𝑋̇ 𝑋̇ =
!
24 40

§ Find the eigenvalues and eigenvectors:


𝜆𝐸 − 𝑋̇ ! 𝑋̇ = 0
𝜆 − 40 −24
= 0
−24 𝜆 − 40
𝜆 − 40 " − 24" = 0
we can get 𝜆# = 16, 𝜆" = 64

COMP5434 79
New Jersey Institute of Technology
Solution

§ Because 𝜆" > 𝜆# , and we only use one principal component, so 𝜆" will be
chosen by PCA algorithm.

Set 𝜆 = 64, and


𝜆𝐸 − 𝑋̇ ! 𝑋̇ 𝑃 = 0
24 −24
𝑃=0
−24 24

2
𝑃= 2
2
2

COMP5434 80
New Jersey Institute of Technology
Solution

Original Data Centralization Centered Data PCA Compressed Data


12 5 5

4 4
10
3 3

8 2 2
1 1
6
0 0
-5 0 5 -6 -4 -2 0 2 4 6
-1
4 -1
-2
-2
2 -3
-3
-4
0 -4
0 5 10 -5
-5

8 10 4 4 4 2
2 8 −2 2 0
6 4 2 −2 0
0 2 −4 −4 −4 2

COMP5434 81
New Jersey Institute of Technology
PCA: Choosing the dimension K
§ Calculate the covariance matrix of data S
§ Calculate the eigen-vectors/eigen-values of S
§ Rank the eigen-values in decreasing order
§ Select eigen-vectors that retain a fixed percentage of the
∑,
"*+ &"
variance, (e.g., 80%, the smallest d such that ∑ & ≥ 80%)
" "

COMP5434 82
New Jersey Institute of Technology
PCA: a useful preprocessing step
§ Helps reduce computational complexity
§ Can help supervised learning
§ Reduced dimension Þ simpler hypothesis space
§ Smaller VC (Vapnik–Chervonenkis) dimension Þ less risk
of overfitting
§ PCA can also be seen as noise reduction
§ Caveats:
§ Fails when data consists of multiple separate clusters
§ Directions of greatest variance may not be most
informative (i.e. greatest classification power)

COMP5434 83
New Jersey Institute of Technology
Further Reading
§ CAP theorem
§ https://towardsdatascience.com/cap-theorem-and-distributed-database-
management-systems-5c2be977950e
§ NoSQL
§ http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-
nosql-for.html
§ http://horicky.blogspot.com/2010/10/bigtable-model-with-cassandra-and-hbase.html
§ http://faculty.washington.edu/wlloyd/courses/tcss562/papers/Spring2017/team7_N
OSQL_DB/Survey%20on%20NoSQL%20Database.pdf
§ PCA
§ http://scienceai.github.io/tsne-js/
§ https://cs.stanford.edu/people/karpathy/tsnejs/csvdemo.html
§ https://www.keboola.com/blog/pca-machine-learning
§ https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
§ https://blog.umetrics.com/what-is-principal-component-analysis-pca-and-how-it-is-
used

COMP5434 84
New Jersey Institute of Technology
Thank you !

Song Guo

The Hong Kong Polytechnic University


Email: [email protected]

COMP5434 85
New Jersey Institute of Technology

You might also like