NLP03 Vector Space Models
NLP03 Vector Space Models
03-01
Vector Space Models – Introduction
03: Vector Space Models
Fundamental Concept
data 2 1 1 0
Entertainment Economy
Corpus Machine Learning
Vector Space
Entertainment Economy ML
10000
data 500 6620 9320
5000
Economy
03-02
Euclidean Distance and Cosine Similarity
03: Vector Space Models
Euclidean Distance
Corpus A: (500,7000)
Entertainment
Corpus B: (9320,1000)
Machine Learning
Euclidean Distance
10000
Corpus A: (500,7000)
film Entertainment
Corpus B: (9320,1000)
5000 𝑑 𝐵, 𝐴 ≈ 10667
𝑑 𝐵, 𝐴 = 𝐵1 − 𝐴1 2 + 𝐵2 − 𝐴2 2
𝑐 2 = 𝑎2 + 𝑏2
1000
ML 2 2
𝑑 𝐵, 𝐴 = −8820 + 6000
1000 5000 data 10000
Food 0 6 8
𝑑 𝑣,
Ԧ 𝑤 = 𝑣𝑖 − 𝑤𝑖 2 → 𝑁𝑜𝑟𝑚 𝑜𝑓 𝑣Ԧ − 𝑤
𝑖=1
10 20 30 40
disease
SE454 – Natural Language Processing
03: Vector Space Models
15
Cosine Similarity
Agriculture Corpus (20,40) 𝑣.
ො𝑤 ෝ = 𝑣ො 𝑤
ෝ cos 𝛽
40
𝑣ො 𝑣.
ො𝑤 ෝ
cos 𝛽 =
𝑣ො 𝑤 ෝ
30
20 ∗ 30 + (40 ∗ 20)
=
eggs
𝛽
Vector Norm Dot Product
10 𝑛
𝑛
𝑣Ԧ = 𝑣𝑖 2 𝑣.
Ԧ 𝑤 = 𝑣𝑖 − 𝑤𝑖
10 20 30 40 𝑖=1 𝑖=1
disease
SE454 – Natural Language Processing
03: Vector Space Models
16
Cosine Similarity
40 40
𝑣ො 𝑣ො
𝛽 = 0°
𝛽 = 90° cos 𝛽 = 1
30 30
cos 𝛽 = 0
eggs
eggs
20
𝑤
ෝ
20
Dissimilar
Similar
10 10
𝑤
ෝ
10 20 30 40 10 20 30 40
disease disease
SE454 – Natural Language Processing
03: Vector Space Models
17
03-03
Principal Component Analysis – PCA
03: Vector Space Models
USA Washington DC
Russia
USA (5,6)
Washington – USA = [5 − 1]
Russia (5,5) Washington (10,5) Russia +[5 − 1] = [10 4]
Ankara (0.5,0.9)
𝑑>2
Oil 0.20 … 0.10
Gas 2.10 … 3.40
City 9.30 … 52.1 Oil & Gas
Town 6.20 … 34.3
𝑑>2 𝑑=2
Oil 0.20 … 0.10 Oil 2.30 21.2
Gas 2.10 … 3.40 PCA Gas 1.56 19.3
City 9.30 … 52.1 City 13.4 34.1
Town 6.20 … 34.3 Town 15.6 29.8
Dimensionality
Reduction
Uncorrelated Dimension
Original Space
Features Reduction
PCA Algorithm
Eigenvector Eigenvalue
PCA Algorithm
𝑥𝑖 −𝜇𝑥𝑖
Mean Normalize Data 𝑥𝑖 =
𝜎𝑥𝑖
PCA Algorithm
Eigenvectors Eigenvalues
Percentage of Retained Variance
σ1𝑖=0 𝑆𝑖𝑖
𝑈 𝑆 σ𝑠𝑗=0 𝑆𝑗𝑗
Summary
03-04
Linear Algebra in Python with NumPy
03: Vector Space Models
[1, 2, 3, 4, 5] [ 3 6 9 12]
[1 2 3 4] [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
<class 'list’>
<class 'numpy.ndarray'>
SE454 – Natural Language Processing
03: Vector Space Models
30
# Subtract two sum compatible matrices. This is called the difference vector
result2 = okmatrix - okmatrix [ [0 0]
print(result2) [0 0] ]
Transpose a Matrix
# Define a 3x2 matrix nparray = np.array([1, 2, 3, 4]) # Define an array
matrix3x2 = np.array([[1, 2], [3, 4], [5, 6]]) print('Original array')
print('Original matrix 3 x 2') Original array
print(nparray) [1 2 3 4]
print(matrix3x2)
print('Transposed matrix 2 x 3') print('Transposed array') Transposed array
[1 2 3 4]
print(matrix3x2.T) print(nparray.T)
Original matrix 3 x 2
# Define a 1 x 4 matrix. Note the 2 level of square brackets
[ [1 2]
[3 4] nparray = np.array([[1, 2, 3, 4]])
[5 6] ] print('Original array') Original array
[[1 2 3 4]]
print(nparray) Transposed array
Transposed matrix 2 x 3
print('Transposed array') [ [1]
[ [1 3 5] [2]
[2 4 6] ] print(nparray.T)
[3]
[4] ]
Norm of a Matrix
nparray1 = np.array([1, 2, 3, 4]) # Define an array # Define a 3 x 2 matrix.
norm1 = np.linalg.norm(nparray1) nparray2 = np.array([[1, 1], [2, 2], [3, 3]])
5.477225575051661 print(normByRows)
5.477225575051661
[3.74165739 3.74165739]
[1.41421356 2.82842712 4.24264069]
SE454 – Natural Language Processing
03: Vector Space Models
34
print(flavor4) 38
sumByCols = np.sum(nparray2, axis=0) # Get the sum for each column. Returns 2 elements
sumByRows = np.sum(nparray2, axis=1) # get the sum for each row. Returns 3 elements
print('Mean by rows:’)
print(meanByRows) Mean by rows: [0. 0. 0.]
SE454 – Natural Language Processing
03: Vector Space Models
37
Mean Function
nparray2 = np.array([[1, 3], [2, 4], [3, 5]]) # Define a 3 x 2 matrix.
3.0 == 3.0
03-05
Manipulating Word Embeddings
03: Vector Space Models
243
word_embeddings_subset.p
plt.show()
SE454 – Natural Language Processing
03: Vector Space Models
44
Word Distance
words = ['sad', 'happy', 'town', 'village'] ax.arrow(village[col1], village[col2], diff[col1], diff[col2],
fc='b', ec='b', width = 1e-5) sad
bag2d = np.array([vec(word) for word in words])
# Convert each word to its vector representation ax.arrow(sad[col1], sad[col2], diff[col1], diff[col2],
fc='b', ec='b', width = 1e-5)
fig, ax = plt.subplots(figsize = (10, 10)) # Create custom size image
ax.scatter(bag2d[:, col1], bag2d[:, col2]); # Plot a dot for each word
col1 = 3 # Select the column for the x axe
col2 = 2 # Select the column for the y axe # Add the word label over each dot in the scatter plot
for i in range(0, len(words)):
# Print an arrow for each word
ax.annotate(words[i], (bag2d[i, col1], bag2d[i, col2]))
for word in bag2d:
ax.arrow(0, 0, word[col1], word[col2], head_width=0.0005,
head_length=0.0005, fc='r', ec='r', width = 1e-5) plt.show( )
Word Distance
Predicting Capitals
capital = vec('France') - vec('Paris')
country = vec('Madrid') + capital
print(country[0:5]) # Print the first 5 values of the vector
[-0.02905273 -0.2475586 0.53952026 0.20581055 -0.14862823]
find_closest_word(country)
‘Spain’
print(find_closest_word(vec('Berlin') + capital))
print(find_closest_word(vec('Beijing') + capital))
Germany
China
print(find_closest_word(vec('Lisbon') + capital))
Lisbon
find_closest_word(doc2vec)
petroleum
03-06
Implementing PCA
03: Vector Space Models
Import Libraries
import numpy as np # Linear algebra library
import matplotlib.pyplot as plt # library for visualization
from sklearn.decomposition import PCA # PCA library
import pandas as pd # Data frame library
import math # Library for math functions
import random # Library for pseudo random numbers
Understanding PCA
n=1 # The amount of the correlation # Create the transformation model for this data. Internally, it gets
# Generate 1000 samples from a uniform random variable the rotation matrix and the explained variance
x = np.random.uniform(1,2,1000) pcaTr = pca.fit(data)
y = x.copy( ) * n # Make y = n * x
# Transform the data base on the rotation matrix of pcaTr
# PCA works better if the data is centered rotatedData = pcaTr.transform(data)
x = x - np.mean(x) # Center x. Remove its mean # # Create a data frame with the new variables.
y = y - np.mean(y) # Center y. Remove its mean We call these new variables PC1 and PC2
dataPCA = pd.DataFrame(data = rotatedData, columns = ['PC1’, 'PC2'])
data = pd.DataFrame({'x': x, 'y': y}) # Create a data frame with x & y
plt.scatter(data.x, data.y) # Plot the original correlated data in blue # Plot the transformed data in orange
plt.scatter(dataPCA.PC1, dataPCA.PC2)
# Instantiate a PCA. Choose to get 2 output variables
plt.show()
pca = PCA(n_components=2)
Understanding PCA
print( )
print('Eigenvalues or explained variance')
print(pcaTr.explained_variance_)
std1 = 1 # The desired standard deviation of our first random variable # Create a rotation matrix using the given angle
std2 = 0.333 # The desired standard deviation of our 2nd random variable rotationMatrix = np.array([[np.cos(angle), np.sin(angle)],
[-np.sin(angle), np.cos(angle)]])
x = np.random.normal(0, std1, 1000) # Get 1000 samples from x ~ N(0, std1)
y = np.random.normal(0, std2, 1000) # Get 1000 samples from y ~ N(0, std2) print('rotationMatrix')
#y = y + np.random.normal(0,1,1000)*noiseLevel * np.sin(0.78) print(rotationMatrix)
# PCA works better if the data is centered xy = np.concatenate(([x] , [y]), axis=0).T # Create a matrix with columns x & y
x = x - np.mean(x) # Center x
# Transform the data using the rotation matrix. It correlates the two variables
y = y - np.mean(y) # Center y
data = np.dot(xy, rotationMatrix) # Return a nD array
angle: 45.0
rotationMatrix
[[ 0.70710678 0.70710678]
[-0.70710678 0.70710678]]
Dimensionality Reduction
nPoints = len(data)
plt.show()