0% found this document useful (0 votes)
45 views61 pages

NLP03 Vector Space Models

Uploaded by

Sadaf Abbasi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views61 pages

NLP03 Vector Space Models

Uploaded by

Sadaf Abbasi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Natural Language Processing

03: Vector Space Models

1. Vector Space Models – Introduction


2. Euclidean Distance and Cosine Similarity
3. Principal Component Analysis – PCA Algorithm
4. Linear Algebra in Python with NumPy
5. Manipulating Word Embeddings
Dr. Imran Ihsan
6. Implementing PCA Ph.D. in Knowledge Engineering
Associate Professor

SE454 – Natural Language Processing


2

03-01
Vector Space Models – Introduction
03: Vector Space Models

SE454 – Natural Language Processing


03: Vector Space Models
3

Why Learn Vector Space Models?

Where are you heading? What is your age?

Where are you from? How old are you?

Different Meaning Same Meaning

SE454 – Natural Language Processing


03: Vector Space Models
4

Vector Space Models Applications


You eat cereal from a bowl

You buy something and someone else sells it

Information Extraction Machine Translation Chatbots

SE454 – Natural Language Processing


03: Vector Space Models
5

Fundamental Concept

"You shall know a word by the company it keeps" Firth, 1957

Firth, J.R. 1957:11

SE454 – Natural Language Processing


03: Vector Space Models
6

Word by Word Design

Number of times they occur together within a certain distance 𝑘

I like simple data k=2

I prefer simple raw data

simple raw like I

data 2 1 1 0

SE454 – Natural Language Processing


03: Vector Space Models
7

Word by Document Design

Number of times they occur within a certain category

Entertainment Economy
Corpus Machine Learning

Entertainment Economy Machine Learning

data 500 6620 9320

film 7000 4000 1000

SE454 – Natural Language Processing


03: Vector Space Models
8

Vector Space
Entertainment Economy ML
10000
data 500 6620 9320

film Entertainment film 7000 4000 1000

5000
Economy

Machine Learning Measures of “Similarity”


1000 Angle Distance

1000 5000 data 10000

SE454 – Natural Language Processing


03: Vector Space Models
9

03-02
Euclidean Distance and Cosine Similarity
03: Vector Space Models

SE454 – Natural Language Processing


03: Vector Space Models
10

Euclidean Distance

Corpus A: (500,7000)

Entertainment
Corpus B: (9320,1000)

Machine Learning

SE454 – Natural Language Processing


03: Vector Space Models
11

Euclidean Distance

10000
Corpus A: (500,7000)

film Entertainment
Corpus B: (9320,1000)

5000 𝑑 𝐵, 𝐴 ≈ 10667
𝑑 𝐵, 𝐴 = 𝐵1 − 𝐴1 2 + 𝐵2 − 𝐴2 2

𝑐 2 = 𝑎2 + 𝑏2
1000
ML 2 2
𝑑 𝐵, 𝐴 = −8820 + 6000
1000 5000 data 10000

SE454 – Natural Language Processing


03: Vector Space Models
12

Euclidean Distance for n-dimensional Vectors


𝑤 𝑣Ԧ
Data Boba Tea Ice-Cream
= 1−0 2 + 6−4 2 + 8−6 2
AI 6 0 1

Drinks 0 4 6 = 1+4+4= 9=3

Food 0 6 8

𝑑 𝑣,
Ԧ 𝑤 = ෍ 𝑣𝑖 − 𝑤𝑖 2 → 𝑁𝑜𝑟𝑚 𝑜𝑓 𝑣Ԧ − 𝑤
𝑖=1

SE454 – Natural Language Processing


03: Vector Space Models
13

Euclidean Distance in Python


# Create numpy vectors v and w
v = np.array([1, 6, 8])
w = np.array([0, 4, 6])

# Calculate the Euclidean Distance d


d = np.linalg.norm(v-w)

#print the result


print(“The Euclidean Distance between v and w is: “, d)

The Euclidean Distance between v and w is: 3

SE454 – Natural Language Processing


03: Vector Space Models
14

Euclidean Distance vs Cosine Similarity


Agriculture Corpus (20,40)
40
𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑2 < 𝑑1

𝑑2 𝐴𝑛𝑔𝑙𝑒 𝐶𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛: 𝛽 > 𝛼


30 𝑑1
eggs

20 History Corpus (30, 20)


Food Corpus (5,15)
𝛼 𝛽 The cosine of the angle
10 between the vectors

10 20 30 40
disease
SE454 – Natural Language Processing
03: Vector Space Models
15

Cosine Similarity
Agriculture Corpus (20,40) 𝑣.
ො𝑤 ෝ = 𝑣ො 𝑤
ෝ cos 𝛽
40
𝑣ො 𝑣.
ො𝑤 ෝ
cos 𝛽 =
𝑣ො 𝑤 ෝ
30
20 ∗ 30 + (40 ∗ 20)
=
eggs

202 + 402 ∗ 302 + 202


𝑤

20 History Corpus (30, 20) = 0.87

𝛽
Vector Norm Dot Product
10 𝑛
𝑛

𝑣Ԧ = ෍ 𝑣𝑖 2 𝑣.
Ԧ 𝑤 = ෍ 𝑣𝑖 − 𝑤𝑖
10 20 30 40 𝑖=1 𝑖=1
disease
SE454 – Natural Language Processing
03: Vector Space Models
16

Cosine Similarity

40 40
𝑣ො 𝑣ො
𝛽 = 0°

𝛽 = 90° cos 𝛽 = 1
30 30

cos 𝛽 = 0
eggs

eggs
20
𝑤

20
Dissimilar
Similar

10 10
𝑤

10 20 30 40 10 20 30 40
disease disease
SE454 – Natural Language Processing
03: Vector Space Models
17

03-03
Principal Component Analysis – PCA
03: Vector Space Models

SE454 – Natural Language Processing


03: Vector Space Models
18

Manipulating Words in Vector Spaces

USA Washington DC

Russia

SE454 – Natural Language Processing


03: Vector Space Models
19

Manipulating Words in Vector Spaces

USA (5,6)
Washington – USA = [5 − 1]
Russia (5,5) Washington (10,5) Russia +[5 − 1] = [10 4]

Japan (4,3) (10,4)


Moscow (9,3)
Tokyo (8.5,2)
Turkey (3,1)

Ankara (0.5,0.9)

SE454 – Natural Language Processing


03: Vector Space Models
20

Visualization of Word Vectors

𝑑>2
Oil 0.20 … 0.10
Gas 2.10 … 3.40
City 9.30 … 52.1 Oil & Gas
Town 6.20 … 34.3

How can you visualize if your representation


captures these relationships?

Town & City


SE454 – Natural Language Processing
03: Vector Space Models
21

Visualization of Word Vectors

𝑑>2 𝑑=2
Oil 0.20 … 0.10 Oil 2.30 21.2
Gas 2.10 … 3.40 PCA Gas 1.56 19.3
City 9.30 … 52.1 City 13.4 34.1
Town 6.20 … 34.3 Town 15.6 29.8

SE454 – Natural Language Processing


03: Vector Space Models
22

Principal Component Analysis


Uncorrelated
Features

Dimensionality
Reduction

SE454 – Natural Language Processing


03: Vector Space Models
23

Principal Component Analysis

Uncorrelated Dimension
Original Space
Features Reduction

SE454 – Natural Language Processing


03: Vector Space Models
24

PCA Algorithm

Eigenvector Eigenvalue

Uncorrelated features for your data The amount of information


retained by each feature.
−1 −3 3 −1
1 0 0 2
3 1 3 3

SE454 – Natural Language Processing


03: Vector Space Models
25

PCA Algorithm

𝑥𝑖 −𝜇𝑥𝑖
Mean Normalize Data 𝑥𝑖 =
𝜎𝑥𝑖

Get Covariance Matrix σ


Eigenvectors Eigenvalues
Perform Singular Value
Decomposition SVD(σ )

SE454 – Natural Language Processing


03: Vector Space Models
26

PCA Algorithm

Dot Product to Project Data


𝑋 ′ = 𝑋𝑈[: , 0: 2]

Eigenvectors Eigenvalues
Percentage of Retained Variance
σ1𝑖=0 𝑆𝑖𝑖
𝑈 𝑆 σ𝑠𝑗=0 𝑆𝑗𝑗

SE454 – Natural Language Processing


03: Vector Space Models
27

Summary

Eigenvectors give the direction of Eigenvalues are the variance of the


uncorrelated features new features

Dot product gives the projection on


uncorrelated features

SE454 – Natural Language Processing


03: Vector Space Models
28

03-04
Linear Algebra in Python with NumPy
03: Vector Space Models

SE454 – Natural Language Processing


03: Vector Space Models
29

Lists and NumPy Arrays


import numpy as np # The swiss knife of the data scientist. Algebraic Operators on NumPy Arrays vs. Python Lists
print(narray + narray)
alist = [1, 2, 3, 4, 5] print(alist + alist)
# Define a python list. It looks like an np array
narray = np.array([1, 2, 3, 4]) # Define a numpy array
[2 4 6 8]
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
print(alist)
print(narray)
print(type(alist)) print(narray * 3)
print(type(narray)) print(alist * 3)

[1, 2, 3, 4, 5] [ 3 6 9 12]
[1 2 3 4] [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
<class 'list’>
<class 'numpy.ndarray'>
SE454 – Natural Language Processing
03: Vector Space Models
30

Matrix or Array of Arrays


# Matrix initialized with NumPy arrays Example 01
npmatrix1 = np.array([narray, narray, narray]) okmatrix = np.array([[1, 2], [3, 4]]) # Define a 2x2 matrix
# Matrix initialized with lists print(okmatrix) # Print okmatrix
print(okmatrix * 2) # Print a scaled version of okmatrix
npmatrix2 = np.array([alist, alist, alist])
# Matrix initialized with both types
[ [1 2] [3 4] ] [ [2 4] [6 8] ]
npmatrix3 = np.array([narray, [1, 1, 1, 1], narray])
Example 02
print(npmatrix1) # Define a matrix. Note the third row contains 3 elements
print(npmatrix2) badmatrix = np.array([[1, 2], [3, 4], [5, 6, 7]])
print(npmatrix3) print(badmatrix) # Print the malformed matrix
[ [1 2 3 4] [ [1 2 3 4 5] [ [1 2 3 4] print(badmatrix * 2) # It is supposed to scale the whole matrix
[1 2 3 4] [1 2 3 4 5] [1 1 1 1]
[ list([1, 2]) list([3, 4]) list([5, 6, 7]) ]
[1 2 3 4] ] [1 2 3 4 5] ] [1 2 3 4] ]
[ list([1, 2, 1, 2]) list([3, 4, 3, 4]) list([5, 6, 7, 5, 6, 7]) ]

SE454 – Natural Language Processing


03: Vector Space Models
31

Scaling and Translating Matrices


# Scale by 2 and translate 1 unit the matrix
result = okmatrix * 2 + 1 # For each element in the matrix, multiply by 2 and add 1 [ [3 5]
print(result) [7 9] ]

result1 = okmatrix + okmatrix [ [2 4]


print(result1) [6 8] ]

# Subtract two sum compatible matrices. This is called the difference vector
result2 = okmatrix - okmatrix [ [0 0]
print(result2) [0 0] ]

result = okmatrix * okmatrix # Multiply each element by itself [ [1 4]


print(result) [9 16] ]

SE454 – Natural Language Processing


03: Vector Space Models
32

Transpose a Matrix
# Define a 3x2 matrix nparray = np.array([1, 2, 3, 4]) # Define an array
matrix3x2 = np.array([[1, 2], [3, 4], [5, 6]]) print('Original array')
print('Original matrix 3 x 2') Original array
print(nparray) [1 2 3 4]
print(matrix3x2)
print('Transposed matrix 2 x 3') print('Transposed array') Transposed array
[1 2 3 4]
print(matrix3x2.T) print(nparray.T)

Original matrix 3 x 2
# Define a 1 x 4 matrix. Note the 2 level of square brackets
[ [1 2]
[3 4] nparray = np.array([[1, 2, 3, 4]])
[5 6] ] print('Original array') Original array
[[1 2 3 4]]
print(nparray) Transposed array
Transposed matrix 2 x 3
print('Transposed array') [ [1]
[ [1 3 5] [2]
[2 4 6] ] print(nparray.T)
[3]
[4] ]

SE454 – Natural Language Processing


03: Vector Space Models
33

Norm of a Matrix
nparray1 = np.array([1, 2, 3, 4]) # Define an array # Define a 3 x 2 matrix.
norm1 = np.linalg.norm(nparray1) nparray2 = np.array([[1, 1], [2, 2], [3, 3]])

# Define a 2 x 2 matrix. Note the 2 level of square brackets


# Get the norm for each column. Returns 2 elements
nparray2 = np.array([[1, 2], [3, 4]])
normByCols = np.linalg.norm(nparray2, axis=0)
norm2 = np.linalg.norm(nparray2)
# Get the norm for each row. Returns 3 elements
print(norm1) normByRows = np.linalg.norm(nparray2, axis=1)
print(norm2)
print(normByCols)

5.477225575051661 print(normByRows)
5.477225575051661
[3.74165739 3.74165739]
[1.41421356 2.82842712 4.24264069]
SE454 – Natural Language Processing
03: Vector Space Models
34

The dot Product between Arrays: All the Flavors


nparray1 = np.array([0, 1, 2, 3]) # Define an array
nparray2 = np.array([4, 5, 6, 7]) # Define an array

flavor1 = np.dot(nparray1, nparray2) # Recommended way


print(flavor1) 38

flavor2 = np.sum(nparray1 * nparray2) # Ok way


print(flavor2) 38

flavor3 = nparray1 @ nparray2 # Geeks way We strongly recommend using np.dot,


since it is the only method that accepts
print(flavor3)
arrays and lists without problems
# As you never should do: # Noobs way
flavor4 = 0 38
for a, b in zip(nparray1, nparray2):
flavor4 += a * b

print(flavor4) 38

SE454 – Natural Language Processing


03: Vector Space Models
35

Sums by Rows or Columns


nparray2 = np.array([[1, -1], [2, -2], [3, -3]]) # Define a 3 x 2 matrix.

sumByCols = np.sum(nparray2, axis=0) # Get the sum for each column. Returns 2 elements
sumByRows = np.sum(nparray2, axis=1) # get the sum for each row. Returns 3 elements

print('Sum by columns: ')


print(sumByCols)
print('Sum by rows:')
print(sumByRows)

Sum by columns: [ 6 -6]


Sum by rows: [0 0 0]

SE454 – Natural Language Processing


03: Vector Space Models
36

Mean by Rows or Columns


nparray2 = np.array([[1, -1], [2, -2], [3, -3]]) # Define a 3 x 2 matrix. Chosen to be a matrix with 0 mean

mean = np.mean(nparray2) # Get the mean for the whole matrix


meanByCols = np.mean(nparray2, axis=0) # Get the mean for each column. Returns 2 elements
meanByRows = np.mean(nparray2, axis=1) # Get the mean for each row. Returns 3 elements

print('Matrix mean: ')


print(mean) Matrix mean: 0.0

print('Mean by columns: ')


print(meanByCols) Mean by columns: [ 2. -2.]

print('Mean by rows:’)
print(meanByRows) Mean by rows: [0. 0. 0.]
SE454 – Natural Language Processing
03: Vector Space Models
37

Center the Columns of a Matrix


# Define a 3 x 2 matrix. Original matrix
nparray2 = np.array([[1, 1], [2, 2], [3, 3]]) [ [1 1]
[2 2]
# Remove the mean for each column [3 3]]
nparrayCentered = nparray2 - np.mean(nparray2, axis=0)

Centered by columns matrix


print('Original matrix')
[ [-1. -1.]
print(nparray2)
[ 0. 0.]
[ 1. 1.]]
print('Centered by columns matrix')
print(nparrayCentered)
New mean by column
print('New mean by column') [0. 0.]
print(nparrayCentered.mean(axis=0))

SE454 – Natural Language Processing


03: Vector Space Models
38

Center the Columns of a Matrix


# Define a 3 x 2 matrix. Original matrix
nparray2 = np.array([[1, 3], [2, 4], [3, 5]]) [ [1 3]
# Remove the mean for each row [2 4]
nparrayCentered = nparray2.T - np.mean(nparray2, axis=1)
[3 5]]
nparrayCentered = nparrayCentered.T # Transpose back the result

Centered by columns matrix


print('Original matrix')
[ [-1. 1.]
print(nparray2)
[ -1. 1.]
[ -1. 1.]]
print('Centered by columns matrix')
print(nparrayCentered)
New mean by column
print('New mean by column') [0. 0. 0.]
print(nparrayCentered.mean(axis=0)) Warning: This process does not apply for row centering. In such cases, consider
transposing the matrix, centering by columns, and then transpose back the result.
SE454 – Natural Language Processing
03: Vector Space Models
39

Mean Function
nparray2 = np.array([[1, 3], [2, 4], [3, 5]]) # Define a 3 x 2 matrix.

mean1 = np.mean(nparray2) # Static way


mean2 = nparray2.mean() # Dynamic way

print(mean1, ' == ', mean2)

3.0 == 3.0

SE454 – Natural Language Processing


03: Vector Space Models
40

03-05
Manipulating Word Embeddings
03: Vector Space Models

SE454 – Natural Language Processing


03: Vector Space Models
41

Pre-trained Word Embedding


import pandas as pd # Library for Dataframes
import numpy as np # Library for math functions
import pickle # Python object serialization library. Not secure

word_embeddings = pickle.load( open( "word_embeddings_subset.p", "rb" ) )


len(word_embeddings) # there should be 243 words that will be used in this assignment

243

word_embeddings_subset.p

SE454 – Natural Language Processing


03: Vector Space Models
42

Word Embedding is a Dictionary


countryVector = word_embeddings['country’] # Get the vector representation for the word 'country'
print(type(countryVector)) # Print the type of the vector. Note it is a NumPy array
print(countryVector) # Print the values of the vector.

#Get the vector for a given word:


def vec(w):
return word_embeddings[w]

SE454 – Natural Language Processing


03: Vector Space Models
43

Operating on Word Embeddings


import matplotlib.pyplot as plt # Import matplotlib
words = ['oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful']
bag2d = np.array([vec(word) for word in words]) # Convert each word to its vector representation

fig, ax = plt.subplots(figsize = (10, 10)) # Create custom size image


col1 = 3 # Select the column for the x axis
col2 = 2 # Select the column for the y axis

for word in bag2d: # Print an arrow for each word


ax.arrow(0, 0, word[col1], word[col2], head_width=0.005, head_length=0.005, fc='r', ec='r', width = 1e-5)

ax.scatter(bag2d[:, col1], bag2d[:, col2]); # Plot a dot for each word


for i in range(0, len(words)): # Add the word label over each dot in the scatter plot
ax.annotate(words[i], (bag2d[i, col1], bag2d[i, col2]))

plt.show()
SE454 – Natural Language Processing
03: Vector Space Models
44

Operating on Word Embeddings

SE454 – Natural Language Processing


03: Vector Space Models
45

Word Distance
words = ['sad', 'happy', 'town', 'village'] ax.arrow(village[col1], village[col2], diff[col1], diff[col2],
fc='b', ec='b', width = 1e-5) sad
bag2d = np.array([vec(word) for word in words])
# Convert each word to its vector representation ax.arrow(sad[col1], sad[col2], diff[col1], diff[col2],
fc='b', ec='b', width = 1e-5)
fig, ax = plt.subplots(figsize = (10, 10)) # Create custom size image
ax.scatter(bag2d[:, col1], bag2d[:, col2]); # Plot a dot for each word
col1 = 3 # Select the column for the x axe
col2 = 2 # Select the column for the y axe # Add the word label over each dot in the scatter plot
for i in range(0, len(words)):
# Print an arrow for each word
ax.annotate(words[i], (bag2d[i, col1], bag2d[i, col2]))
for word in bag2d:
ax.arrow(0, 0, word[col1], word[col2], head_width=0.0005,
head_length=0.0005, fc='r', ec='r', width = 1e-5) plt.show( )

# print the vector difference between village and town


village = vec('village')
town = vec('town')
diff = town - village
SE454 – Natural Language Processing
03: Vector Space Models
46

Word Distance

SE454 – Natural Language Processing


03: Vector Space Models
47

Linear Algebra on Word Embeddings


Norm
print(np.linalg.norm(vec('town'))) # Print the norm of the word town
print(np.linalg.norm(vec('sad'))) # Print the norm of the word sad
2.3858097 2.9004838

Predicting Capitals
capital = vec('France') - vec('Paris')
country = vec('Madrid') + capital
print(country[0:5]) # Print the first 5 values of the vector
[-0.02905273 -0.2475586 0.53952026 0.20581055 -0.14862823]

diff = country - vec('Spain')


print(diff[0:10])
[-0.06054688 -0.06494141 0.37643433 0.08129883 -0.13007355
-0.00952148 -0.03417969 -0.00708008 0.09790039 -0.01867676 ]
SE454 – Natural Language Processing
03: Vector Space Models
48

Find Closest Word


# Create a dataframe out of the dictionary embedding. This facilitate the algebraic operations
keys = word_embeddings.keys()
data = [ ]
for key in keys:
data.append(word_embeddings[key])

embedding = pd.DataFrame(data=data, index=keys)


# Define a function to find the closest word to a vector:
def find_closest_word(v, k = 1):
diff = embedding.values - v # Calculate the vector difference from each word to the input vector
# Get the norm of each difference vector.
delta = np.sum(diff * diff, axis=1) # It means the squared Euclidean distance from each word to the input vector
i = np.argmin(delta) # Find the index of the minimun distance in the array
return embedding.iloc[i].name # Return the row name for this item

SE454 – Natural Language Processing


03: Vector Space Models
49

Find Closest Word


embedding.head(10) # Print some rows of the embedding as a Dataframe

find_closest_word(country)
‘Spain’

SE454 – Natural Language Processing


03: Vector Space Models
50

Predicting Other Countries


find_closest_word(vec('Italy') - vec('Rome') + vec('Madrid’))
Spain

print(find_closest_word(vec('Berlin') + capital))
print(find_closest_word(vec('Beijing') + capital))
Germany
China

print(find_closest_word(vec('Lisbon') + capital))
Lisbon

SE454 – Natural Language Processing


03: Vector Space Models
51

Represent a Sentence as a Vector


doc = "Spain petroleum city king"
vdoc = [vec(x) for x in doc.split(" ")]
doc2vec = np.sum(vdoc, axis = 0)
doc2vec

find_closest_word(doc2vec)
petroleum

SE454 – Natural Language Processing


03: Vector Space Models
52

03-06
Implementing PCA
03: Vector Space Models

SE454 – Natural Language Processing


03: Vector Space Models
53

Import Libraries
import numpy as np # Linear algebra library
import matplotlib.pyplot as plt # library for visualization
from sklearn.decomposition import PCA # PCA library
import pandas as pd # Data frame library
import math # Library for math functions
import random # Library for pseudo random numbers

SE454 – Natural Language Processing


03: Vector Space Models
54

Understanding PCA
n=1 # The amount of the correlation # Create the transformation model for this data. Internally, it gets
# Generate 1000 samples from a uniform random variable the rotation matrix and the explained variance
x = np.random.uniform(1,2,1000) pcaTr = pca.fit(data)
y = x.copy( ) * n # Make y = n * x
# Transform the data base on the rotation matrix of pcaTr
# PCA works better if the data is centered rotatedData = pcaTr.transform(data)
x = x - np.mean(x) # Center x. Remove its mean # # Create a data frame with the new variables.
y = y - np.mean(y) # Center y. Remove its mean We call these new variables PC1 and PC2
dataPCA = pd.DataFrame(data = rotatedData, columns = ['PC1’, 'PC2'])
data = pd.DataFrame({'x': x, 'y': y}) # Create a data frame with x & y
plt.scatter(data.x, data.y) # Plot the original correlated data in blue # Plot the transformed data in orange
plt.scatter(dataPCA.PC1, dataPCA.PC2)
# Instantiate a PCA. Choose to get 2 output variables
plt.show()
pca = PCA(n_components=2)

SE454 – Natural Language Processing


03: Vector Space Models
55

Understanding PCA

SE454 – Natural Language Processing


03: Vector Space Models
56

Transformation Model pcaTr


print('Eigenvectors or principal component: First row must be in the direction of [1, n]')
print(pcaTr.components_)

print( )
print('Eigenvalues or explained variance')
print(pcaTr.explained_variance_)

Eigenvectors or principal component: First row must be in the direction of [1, n]


[ [ 0.70710678 0.70710678]
[-0.70710678 0.70710678] ]

Eigenvalues or explained variance


[1.62696473e-01 1.05938412e-33]

SE454 – Natural Language Processing


03: Vector Space Models
57

Correlated Normal Random Variables


import matplotlib.lines as mlines #Define a pair of dependent variables with a desired amount of covariance
import matplotlib.transforms as mtransforms n = 1 # Magnitude of covariance.
angle = np.arctan(1 / n) # Convert the covariance to and angle
random.seed(100) print('angle: ', angle * 180 / math.pi)

std1 = 1 # The desired standard deviation of our first random variable # Create a rotation matrix using the given angle
std2 = 0.333 # The desired standard deviation of our 2nd random variable rotationMatrix = np.array([[np.cos(angle), np.sin(angle)],
[-np.sin(angle), np.cos(angle)]])
x = np.random.normal(0, std1, 1000) # Get 1000 samples from x ~ N(0, std1)
y = np.random.normal(0, std2, 1000) # Get 1000 samples from y ~ N(0, std2) print('rotationMatrix')
#y = y + np.random.normal(0,1,1000)*noiseLevel * np.sin(0.78) print(rotationMatrix)

# PCA works better if the data is centered xy = np.concatenate(([x] , [y]), axis=0).T # Create a matrix with columns x & y
x = x - np.mean(x) # Center x
# Transform the data using the rotation matrix. It correlates the two variables
y = y - np.mean(y) # Center y
data = np.dot(xy, rotationMatrix) # Return a nD array

SE454 – Natural Language Processing


03: Vector Space Models
58

Correlated Normal Random Variables


# Print the rotated data
plt.scatter(data[:,0], data[:,1])
plt.show( )

angle: 45.0
rotationMatrix
[[ 0.70710678 0.70710678]
[-0.70710678 0.70710678]]

SE454 – Natural Language Processing


03: Vector Space Models
59

Correlated Normal Random Variables


plt.scatter(data[:,0], data[:,1]) # Print the original data in blue print()
print('Eigenvalues or explained variance')
# Apply PCA. In theory, the Eigenvector matrix must be the
print(pcaTr.explained_variance_)
# inverse of the original rotationMatrix.
pca = PCA(n_components=2) # Instantiate a PCA. Choose to get 2 output variables
# Print the rotated data
# Create the transformation model for this data. Internally it gets the plt.scatter(dataPCA[:,0], dataPCA[:,1])
rotation matrix and the explained variance
pcaTr = pca.fit(data) # Plot the 1st component axe. Use the explained variance to scale the vector
plt.plot([0, rotationMatrix[0][0] * std1 * 3], [0, rotationMatrix[0][1] * std1 * 3],
# Create an array with the transformed data 'k-', color='red')
dataPCA = pcaTr.transform(data)
# Plot the 2nd component axe. Use the explained variance to scale the vector
print('Eigenvectors or principal component: plt.plot([0, rotationMatrix[1][0] * std2 * 3], [0, rotationMatrix[1][1] * std2 * 3],
First row must be in the direction of [1, n]') 'k-', color='green')
print(pcaTr.components_)
plt.show( )

SE454 – Natural Language Processing


03: Vector Space Models
60

Correlated Normal Random Variables


Eigenvectors or principal component: First row must be in the direction of [1, n]
[[ 0.71633871 0.69775271]
[-0.69775271 0.71633871]]

Eigenvalues or explained variance


[0.99330378 0.10927556]
<ipython-input-5-1fada4e9fc41>:25: UserWarning: color is redundantly defined by the
'color' keyword argument and the fmt string "k-" (-> color='k’).
The keyword argument will take precedence.
plt.plot([0, rotationMatrix[0][0] * std1 * 3], [0, rotationMatrix[0][1] * std1 * 3],
'k-', color='red')
<ipython-input-5-1fada4e9fc41>:27: UserWarning: color is redundantly defined by the
'color' keyword argument and the fmt string "k-" (-> color='k’).
The keyword argument will take precedence.
plt.plot([0, rotationMatrix[1][0] * std2 * 3], [0, rotationMatrix[1][1] * std2 * 3],
'k-', color='green')

SE454 – Natural Language Processing


03: Vector Space Models
61

Dimensionality Reduction
nPoints = len(data)

# Plot the original data in blue


plt.scatter(data[:,0], data[:,1])

#Plot the projection along the first component in orange


plt.scatter(data[:,0], np.zeros(nPoints))

#Plot the projection along the second component in green


plt.scatter(np.zeros(nPoints), data[:,1])

plt.show()

SE454 – Natural Language Processing

You might also like