0% found this document useful (0 votes)

46 views61 pages

NLP03 Vector Space Models

Uploaded by

Sadaf Abbasi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views61 pages

NLP03 Vector Space Models

Uploaded by

Sadaf Abbasi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Natural Language Processing

03: Vector Space Models

1. Vector Space Models – Introduction

2. Euclidean Distance and Cosine Similarity
3. Principal Component Analysis – PCA Algorithm
4. Linear Algebra in Python with NumPy
5. Manipulating Word Embeddings
Dr. Imran Ihsan
6. Implementing PCA Ph.D. in Knowledge Engineering
Associate Professor

SE454 – Natural Language Processing

03-01
Vector Space Models – Introduction
03: Vector Space Models

SE454 – Natural Language Processing

03: Vector Space Models
3

Why Learn Vector Space Models?

Where are you heading? What is your age?

Where are you from? How old are you?

Different Meaning Same Meaning

SE454 – Natural Language Processing

03: Vector Space Models
4

Vector Space Models Applications

You eat cereal from a bowl

You buy something and someone else sells it

Information Extraction Machine Translation Chatbots

SE454 – Natural Language Processing

03: Vector Space Models
5

Fundamental Concept

"You shall know a word by the company it keeps" Firth, 1957

Firth, J.R. 1957:11

SE454 – Natural Language Processing

03: Vector Space Models
6

Word by Word Design

Number of times they occur together within a certain distance 𝑘

I like simple data k=2

I prefer simple raw data

simple raw like I

data 2 1 1 0

SE454 – Natural Language Processing

03: Vector Space Models
7

Word by Document Design

Number of times they occur within a certain category

Entertainment Economy
Corpus Machine Learning

Entertainment Economy Machine Learning

data 500 6620 9320

film 7000 4000 1000

SE454 – Natural Language Processing

03: Vector Space Models
8

Vector Space
Entertainment Economy ML
10000
data 500 6620 9320

film Entertainment film 7000 4000 1000

5000
Economy

Machine Learning Measures of “Similarity”

1000 Angle Distance

1000 5000 data 10000

SE454 – Natural Language Processing

03: Vector Space Models
9

03-02
Euclidean Distance and Cosine Similarity
03: Vector Space Models

SE454 – Natural Language Processing

03: Vector Space Models
10

Euclidean Distance

Corpus A: (500,7000)

Entertainment
Corpus B: (9320,1000)

Machine Learning

SE454 – Natural Language Processing

03: Vector Space Models
11

Euclidean Distance

10000
Corpus A: (500,7000)

film Entertainment
Corpus B: (9320,1000)

5000 𝑑 𝐵, 𝐴 ≈ 10667
𝑑 𝐵, 𝐴 = 𝐵1 − 𝐴1 2 + 𝐵2 − 𝐴2 2

𝑐 2 = 𝑎2 + 𝑏2
1000
ML 2 2
𝑑 𝐵, 𝐴 = −8820 + 6000
1000 5000 data 10000

SE454 – Natural Language Processing

03: Vector Space Models
12

Euclidean Distance for n-dimensional Vectors

𝑤 𝑣Ԧ
Data Boba Tea Ice-Cream
= 1−0 2 + 6−4 2 + 8−6 2
AI 6 0 1

Drinks 0 4 6 = 1+4+4= 9=3

Food 0 6 8

𝑑 𝑣,
Ԧ 𝑤 = ෍ 𝑣𝑖 − 𝑤𝑖 2 → 𝑁𝑜𝑟𝑚 𝑜𝑓 𝑣Ԧ − 𝑤
𝑖=1

SE454 – Natural Language Processing

03: Vector Space Models
13

Euclidean Distance in Python

# Create numpy vectors v and w
v = np.array([1, 6, 8])
w = np.array([0, 4, 6])

# Calculate the Euclidean Distance d

d = np.linalg.norm(v-w)

#print the result

print(“The Euclidean Distance between v and w is: “, d)

The Euclidean Distance between v and w is: 3

SE454 – Natural Language Processing

03: Vector Space Models
14

Euclidean Distance vs Cosine Similarity

Agriculture Corpus (20,40)
40
𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑2 < 𝑑1

𝑑2 𝐴𝑛𝑔𝑙𝑒 𝐶𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛: 𝛽 > 𝛼

30 𝑑1
eggs

20 History Corpus (30, 20)

Food Corpus (5,15)
𝛼 𝛽 The cosine of the angle
10 between the vectors

10 20 30 40
disease
SE454 – Natural Language Processing
03: Vector Space Models
15

Cosine Similarity
Agriculture Corpus (20,40) 𝑣.
ො𝑤 ෝ = 𝑣ො 𝑤
ෝ cos 𝛽
40
𝑣ො 𝑣.
ො𝑤 ෝ
cos 𝛽 =
𝑣ො 𝑤 ෝ
30
20 ∗ 30 + (40 ∗ 20)
=
eggs

202 + 402 ∗ 302 + 202

𝑤
ෝ
20 History Corpus (30, 20) = 0.87

𝛽
Vector Norm Dot Product
10 𝑛
𝑛

𝑣Ԧ = ෍ 𝑣𝑖 2 𝑣.
Ԧ 𝑤 = ෍ 𝑣𝑖 − 𝑤𝑖
10 20 30 40 𝑖=1 𝑖=1
disease
SE454 – Natural Language Processing
03: Vector Space Models
16

Cosine Similarity

40 40
𝑣ො 𝑣ො
𝛽 = 0°

𝛽 = 90° cos 𝛽 = 1
30 30

cos 𝛽 = 0
eggs

eggs
20
𝑤
ෝ
20
Dissimilar
Similar

10 10
𝑤
ෝ
10 20 30 40 10 20 30 40
disease disease
SE454 – Natural Language Processing
03: Vector Space Models
17

03-03
Principal Component Analysis – PCA
03: Vector Space Models

SE454 – Natural Language Processing

03: Vector Space Models
18

Manipulating Words in Vector Spaces

USA Washington DC

Russia

SE454 – Natural Language Processing

03: Vector Space Models
19

Manipulating Words in Vector Spaces

USA (5,6)
Washington – USA = [5 − 1]
Russia (5,5) Washington (10,5) Russia +[5 − 1] = [10 4]

Japan (4,3) (10,4)

Moscow (9,3)
Tokyo (8.5,2)
Turkey (3,1)

Ankara (0.5,0.9)

SE454 – Natural Language Processing

03: Vector Space Models
20

Visualization of Word Vectors

𝑑>2
Oil 0.20 … 0.10
Gas 2.10 … 3.40
City 9.30 … 52.1 Oil & Gas
Town 6.20 … 34.3

How can you visualize if your representation

captures these relationships?

Town & City

SE454 – Natural Language Processing
03: Vector Space Models
21

Visualization of Word Vectors

𝑑>2 𝑑=2
Oil 0.20 … 0.10 Oil 2.30 21.2
Gas 2.10 … 3.40 PCA Gas 1.56 19.3
City 9.30 … 52.1 City 13.4 34.1
Town 6.20 … 34.3 Town 15.6 29.8

SE454 – Natural Language Processing

03: Vector Space Models
22

Principal Component Analysis

Uncorrelated
Features

Dimensionality
Reduction

SE454 – Natural Language Processing

03: Vector Space Models
23

Principal Component Analysis

Uncorrelated Dimension
Original Space
Features Reduction

SE454 – Natural Language Processing

03: Vector Space Models
24

PCA Algorithm

Eigenvector Eigenvalue

Uncorrelated features for your data The amount of information

retained by each feature.
−1 −3 3 −1
1 0 0 2
3 1 3 3

SE454 – Natural Language Processing

03: Vector Space Models
25

PCA Algorithm

𝑥𝑖 −𝜇𝑥𝑖
Mean Normalize Data 𝑥𝑖 =
𝜎𝑥𝑖

Get Covariance Matrix σ

Eigenvectors Eigenvalues
Perform Singular Value
Decomposition SVD(σ )

SE454 – Natural Language Processing

03: Vector Space Models
26

PCA Algorithm

Dot Product to Project Data

𝑋 ′ = 𝑋𝑈[: , 0: 2]

Eigenvectors Eigenvalues
Percentage of Retained Variance
σ1𝑖=0 𝑆𝑖𝑖
𝑈 𝑆 σ𝑠𝑗=0 𝑆𝑗𝑗

SE454 – Natural Language Processing

03: Vector Space Models
27

Summary

Eigenvectors give the direction of Eigenvalues are the variance of the

uncorrelated features new features

Dot product gives the projection on

uncorrelated features

SE454 – Natural Language Processing

03: Vector Space Models
28

03-04
Linear Algebra in Python with NumPy
03: Vector Space Models

SE454 – Natural Language Processing

03: Vector Space Models
29

Lists and NumPy Arrays

import numpy as np # The swiss knife of the data scientist. Algebraic Operators on NumPy Arrays vs. Python Lists
print(narray + narray)
alist = [1, 2, 3, 4, 5] print(alist + alist)
# Define a python list. It looks like an np array
narray = np.array([1, 2, 3, 4]) # Define a numpy array
[2 4 6 8]
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
print(alist)
print(narray)
print(type(alist)) print(narray * 3)
print(type(narray)) print(alist * 3)

[1, 2, 3, 4, 5] [ 3 6 9 12]
[1 2 3 4] [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
<class 'list’>
<class 'numpy.ndarray'>
SE454 – Natural Language Processing
03: Vector Space Models
30

Matrix or Array of Arrays

# Matrix initialized with NumPy arrays Example 01
npmatrix1 = np.array([narray, narray, narray]) okmatrix = np.array([[1, 2], [3, 4]]) # Define a 2x2 matrix
# Matrix initialized with lists print(okmatrix) # Print okmatrix
print(okmatrix * 2) # Print a scaled version of okmatrix
npmatrix2 = np.array([alist, alist, alist])
# Matrix initialized with both types
[ [1 2] [3 4] ] [ [2 4] [6 8] ]
npmatrix3 = np.array([narray, [1, 1, 1, 1], narray])
Example 02
print(npmatrix1) # Define a matrix. Note the third row contains 3 elements
print(npmatrix2) badmatrix = np.array([[1, 2], [3, 4], [5, 6, 7]])
print(npmatrix3) print(badmatrix) # Print the malformed matrix
[ [1 2 3 4] [ [1 2 3 4 5] [ [1 2 3 4] print(badmatrix * 2) # It is supposed to scale the whole matrix
[1 2 3 4] [1 2 3 4 5] [1 1 1 1]
[ list([1, 2]) list([3, 4]) list([5, 6, 7]) ]
[1 2 3 4] ] [1 2 3 4 5] ] [1 2 3 4] ]
[ list([1, 2, 1, 2]) list([3, 4, 3, 4]) list([5, 6, 7, 5, 6, 7]) ]

SE454 – Natural Language Processing

03: Vector Space Models
31

Scaling and Translating Matrices

# Scale by 2 and translate 1 unit the matrix
result = okmatrix * 2 + 1 # For each element in the matrix, multiply by 2 and add 1 [ [3 5]
print(result) [7 9] ]

result1 = okmatrix + okmatrix [ [2 4]

print(result1) [6 8] ]

# Subtract two sum compatible matrices. This is called the difference vector
result2 = okmatrix - okmatrix [ [0 0]
print(result2) [0 0] ]

result = okmatrix * okmatrix # Multiply each element by itself [ [1 4]

print(result) [9 16] ]

SE454 – Natural Language Processing

03: Vector Space Models
32

Transpose a Matrix
# Define a 3x2 matrix nparray = np.array([1, 2, 3, 4]) # Define an array
matrix3x2 = np.array([[1, 2], [3, 4], [5, 6]]) print('Original array')
print('Original matrix 3 x 2') Original array
print(nparray) [1 2 3 4]
print(matrix3x2)
print('Transposed matrix 2 x 3') print('Transposed array') Transposed array
[1 2 3 4]
print(matrix3x2.T) print(nparray.T)

Original matrix 3 x 2
# Define a 1 x 4 matrix. Note the 2 level of square brackets
[ [1 2]
[3 4] nparray = np.array([[1, 2, 3, 4]])
[5 6] ] print('Original array') Original array
[[1 2 3 4]]
print(nparray) Transposed array
Transposed matrix 2 x 3
print('Transposed array') [ [1]
[ [1 3 5] [2]
[2 4 6] ] print(nparray.T)
[3]
[4] ]

SE454 – Natural Language Processing

03: Vector Space Models
33

Norm of a Matrix
nparray1 = np.array([1, 2, 3, 4]) # Define an array # Define a 3 x 2 matrix.
norm1 = np.linalg.norm(nparray1) nparray2 = np.array([[1, 1], [2, 2], [3, 3]])

# Define a 2 x 2 matrix. Note the 2 level of square brackets

# Get the norm for each column. Returns 2 elements
nparray2 = np.array([[1, 2], [3, 4]])
normByCols = np.linalg.norm(nparray2, axis=0)
norm2 = np.linalg.norm(nparray2)
# Get the norm for each row. Returns 3 elements
print(norm1) normByRows = np.linalg.norm(nparray2, axis=1)
print(norm2)
print(normByCols)

5.477225575051661 print(normByRows)
5.477225575051661
[3.74165739 3.74165739]
[1.41421356 2.82842712 4.24264069]
SE454 – Natural Language Processing
03: Vector Space Models
34

The dot Product between Arrays: All the Flavors

nparray1 = np.array([0, 1, 2, 3]) # Define an array
nparray2 = np.array([4, 5, 6, 7]) # Define an array

flavor1 = np.dot(nparray1, nparray2) # Recommended way

print(flavor1) 38

flavor2 = np.sum(nparray1 * nparray2) # Ok way

print(flavor2) 38

flavor3 = nparray1 @ nparray2 # Geeks way We strongly recommend using np.dot,

since it is the only method that accepts
print(flavor3)
arrays and lists without problems
# As you never should do: # Noobs way
flavor4 = 0 38
for a, b in zip(nparray1, nparray2):
flavor4 += a * b

print(flavor4) 38

SE454 – Natural Language Processing

03: Vector Space Models
35

Sums by Rows or Columns

nparray2 = np.array([[1, -1], [2, -2], [3, -3]]) # Define a 3 x 2 matrix.

sumByCols = np.sum(nparray2, axis=0) # Get the sum for each column. Returns 2 elements
sumByRows = np.sum(nparray2, axis=1) # get the sum for each row. Returns 3 elements

print('Sum by columns: ')

print(sumByCols)
print('Sum by rows:')
print(sumByRows)

Sum by columns: [ 6 -6]

Sum by rows: [0 0 0]

SE454 – Natural Language Processing

03: Vector Space Models
36

Mean by Rows or Columns

nparray2 = np.array([[1, -1], [2, -2], [3, -3]]) # Define a 3 x 2 matrix. Chosen to be a matrix with 0 mean

mean = np.mean(nparray2) # Get the mean for the whole matrix

meanByCols = np.mean(nparray2, axis=0) # Get the mean for each column. Returns 2 elements
meanByRows = np.mean(nparray2, axis=1) # Get the mean for each row. Returns 3 elements

print('Matrix mean: ')

print(mean) Matrix mean: 0.0

print('Mean by columns: ')

print(meanByCols) Mean by columns: [ 2. -2.]

print('Mean by rows:’)
print(meanByRows) Mean by rows: [0. 0. 0.]
SE454 – Natural Language Processing
03: Vector Space Models
37

Center the Columns of a Matrix

# Define a 3 x 2 matrix. Original matrix
nparray2 = np.array([[1, 1], [2, 2], [3, 3]]) [ [1 1]
[2 2]
# Remove the mean for each column [3 3]]
nparrayCentered = nparray2 - np.mean(nparray2, axis=0)

Centered by columns matrix

print('Original matrix')
[ [-1. -1.]
print(nparray2)
[ 0. 0.]
[ 1. 1.]]
print('Centered by columns matrix')
print(nparrayCentered)
New mean by column
print('New mean by column') [0. 0.]
print(nparrayCentered.mean(axis=0))

SE454 – Natural Language Processing

03: Vector Space Models
38

Center the Columns of a Matrix

# Define a 3 x 2 matrix. Original matrix
nparray2 = np.array([[1, 3], [2, 4], [3, 5]]) [ [1 3]
# Remove the mean for each row [2 4]
nparrayCentered = nparray2.T - np.mean(nparray2, axis=1)
[3 5]]
nparrayCentered = nparrayCentered.T # Transpose back the result

Centered by columns matrix

print('Original matrix')
[ [-1. 1.]
print(nparray2)
[ -1. 1.]
[ -1. 1.]]
print('Centered by columns matrix')
print(nparrayCentered)
New mean by column
print('New mean by column') [0. 0. 0.]
print(nparrayCentered.mean(axis=0)) Warning: This process does not apply for row centering. In such cases, consider
transposing the matrix, centering by columns, and then transpose back the result.
SE454 – Natural Language Processing
03: Vector Space Models
39

Mean Function
nparray2 = np.array([[1, 3], [2, 4], [3, 5]]) # Define a 3 x 2 matrix.

mean1 = np.mean(nparray2) # Static way

mean2 = nparray2.mean() # Dynamic way

print(mean1, ' == ', mean2)

3.0 == 3.0

SE454 – Natural Language Processing

03: Vector Space Models
40

03-05
Manipulating Word Embeddings
03: Vector Space Models

SE454 – Natural Language Processing

03: Vector Space Models
41

Pre-trained Word Embedding

import pandas as pd # Library for Dataframes
import numpy as np # Library for math functions
import pickle # Python object serialization library. Not secure

word_embeddings = pickle.load( open( "word_embeddings_subset.p", "rb" ) )

len(word_embeddings) # there should be 243 words that will be used in this assignment

243

word_embeddings_subset.p

SE454 – Natural Language Processing

03: Vector Space Models
42

Word Embedding is a Dictionary

countryVector = word_embeddings['country’] # Get the vector representation for the word 'country'
print(type(countryVector)) # Print the type of the vector. Note it is a NumPy array
print(countryVector) # Print the values of the vector.

#Get the vector for a given word:

def vec(w):
return word_embeddings[w]

SE454 – Natural Language Processing

03: Vector Space Models
43

Operating on Word Embeddings

import matplotlib.pyplot as plt # Import matplotlib
words = ['oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful']
bag2d = np.array([vec(word) for word in words]) # Convert each word to its vector representation

fig, ax = plt.subplots(figsize = (10, 10)) # Create custom size image

col1 = 3 # Select the column for the x axis
col2 = 2 # Select the column for the y axis

for word in bag2d: # Print an arrow for each word

ax.arrow(0, 0, word[col1], word[col2], head_width=0.005, head_length=0.005, fc='r', ec='r', width = 1e-5)

ax.scatter(bag2d[:, col1], bag2d[:, col2]); # Plot a dot for each word

for i in range(0, len(words)): # Add the word label over each dot in the scatter plot
ax.annotate(words[i], (bag2d[i, col1], bag2d[i, col2]))

plt.show()
SE454 – Natural Language Processing
03: Vector Space Models
44

Operating on Word Embeddings

SE454 – Natural Language Processing

03: Vector Space Models
45

Word Distance
words = ['sad', 'happy', 'town', 'village'] ax.arrow(village[col1], village[col2], diff[col1], diff[col2],
fc='b', ec='b', width = 1e-5) sad
bag2d = np.array([vec(word) for word in words])
# Convert each word to its vector representation ax.arrow(sad[col1], sad[col2], diff[col1], diff[col2],
fc='b', ec='b', width = 1e-5)
fig, ax = plt.subplots(figsize = (10, 10)) # Create custom size image
ax.scatter(bag2d[:, col1], bag2d[:, col2]); # Plot a dot for each word
col1 = 3 # Select the column for the x axe
col2 = 2 # Select the column for the y axe # Add the word label over each dot in the scatter plot
for i in range(0, len(words)):
# Print an arrow for each word
ax.annotate(words[i], (bag2d[i, col1], bag2d[i, col2]))
for word in bag2d:
ax.arrow(0, 0, word[col1], word[col2], head_width=0.0005,
head_length=0.0005, fc='r', ec='r', width = 1e-5) plt.show( )

# print the vector difference between village and town

village = vec('village')
town = vec('town')
diff = town - village
SE454 – Natural Language Processing
03: Vector Space Models
46

Word Distance

SE454 – Natural Language Processing

03: Vector Space Models
47

Linear Algebra on Word Embeddings

Norm
print(np.linalg.norm(vec('town'))) # Print the norm of the word town
print(np.linalg.norm(vec('sad'))) # Print the norm of the word sad
2.3858097 2.9004838

Predicting Capitals
capital = vec('France') - vec('Paris')
country = vec('Madrid') + capital
print(country[0:5]) # Print the first 5 values of the vector
[-0.02905273 -0.2475586 0.53952026 0.20581055 -0.14862823]

diff = country - vec('Spain')

print(diff[0:10])
[-0.06054688 -0.06494141 0.37643433 0.08129883 -0.13007355
-0.00952148 -0.03417969 -0.00708008 0.09790039 -0.01867676 ]
SE454 – Natural Language Processing
03: Vector Space Models
48

Find Closest Word

# Create a dataframe out of the dictionary embedding. This facilitate the algebraic operations
keys = word_embeddings.keys()
data = [ ]
for key in keys:
data.append(word_embeddings[key])

embedding = pd.DataFrame(data=data, index=keys)

# Define a function to find the closest word to a vector:
def find_closest_word(v, k = 1):
diff = embedding.values - v # Calculate the vector difference from each word to the input vector
# Get the norm of each difference vector.
delta = np.sum(diff * diff, axis=1) # It means the squared Euclidean distance from each word to the input vector
i = np.argmin(delta) # Find the index of the minimun distance in the array
return embedding.iloc[i].name # Return the row name for this item

SE454 – Natural Language Processing

03: Vector Space Models
49

Find Closest Word

embedding.head(10) # Print some rows of the embedding as a Dataframe

find_closest_word(country)
‘Spain’

SE454 – Natural Language Processing

03: Vector Space Models
50

Predicting Other Countries

find_closest_word(vec('Italy') - vec('Rome') + vec('Madrid’))
Spain

print(find_closest_word(vec('Berlin') + capital))
print(find_closest_word(vec('Beijing') + capital))
Germany
China

print(find_closest_word(vec('Lisbon') + capital))
Lisbon

SE454 – Natural Language Processing

03: Vector Space Models
51

Represent a Sentence as a Vector

doc = "Spain petroleum city king"
vdoc = [vec(x) for x in doc.split(" ")]
doc2vec = np.sum(vdoc, axis = 0)
doc2vec

find_closest_word(doc2vec)
petroleum

SE454 – Natural Language Processing

03: Vector Space Models
52

03-06
Implementing PCA
03: Vector Space Models

SE454 – Natural Language Processing

03: Vector Space Models
53

Import Libraries
import numpy as np # Linear algebra library
import matplotlib.pyplot as plt # library for visualization
from sklearn.decomposition import PCA # PCA library
import pandas as pd # Data frame library
import math # Library for math functions
import random # Library for pseudo random numbers

SE454 – Natural Language Processing

03: Vector Space Models
54

Understanding PCA
n=1 # The amount of the correlation # Create the transformation model for this data. Internally, it gets
# Generate 1000 samples from a uniform random variable the rotation matrix and the explained variance
x = np.random.uniform(1,2,1000) pcaTr = pca.fit(data)
y = x.copy( ) * n # Make y = n * x
# Transform the data base on the rotation matrix of pcaTr
# PCA works better if the data is centered rotatedData = pcaTr.transform(data)
x = x - np.mean(x) # Center x. Remove its mean # # Create a data frame with the new variables.
y = y - np.mean(y) # Center y. Remove its mean We call these new variables PC1 and PC2
dataPCA = pd.DataFrame(data = rotatedData, columns = ['PC1’, 'PC2'])
data = pd.DataFrame({'x': x, 'y': y}) # Create a data frame with x & y
plt.scatter(data.x, data.y) # Plot the original correlated data in blue # Plot the transformed data in orange
plt.scatter(dataPCA.PC1, dataPCA.PC2)
# Instantiate a PCA. Choose to get 2 output variables
plt.show()
pca = PCA(n_components=2)

SE454 – Natural Language Processing

03: Vector Space Models
55

Understanding PCA

SE454 – Natural Language Processing

03: Vector Space Models
56

Transformation Model pcaTr

print('Eigenvectors or principal component: First row must be in the direction of [1, n]')
print(pcaTr.components_)

print( )
print('Eigenvalues or explained variance')
print(pcaTr.explained_variance_)

Eigenvectors or principal component: First row must be in the direction of [1, n]

[ [ 0.70710678 0.70710678]
[-0.70710678 0.70710678] ]

Eigenvalues or explained variance

[1.62696473e-01 1.05938412e-33]

SE454 – Natural Language Processing

03: Vector Space Models
57

Correlated Normal Random Variables

import matplotlib.lines as mlines #Define a pair of dependent variables with a desired amount of covariance
import matplotlib.transforms as mtransforms n = 1 # Magnitude of covariance.
angle = np.arctan(1 / n) # Convert the covariance to and angle
random.seed(100) print('angle: ', angle * 180 / math.pi)

std1 = 1 # The desired standard deviation of our first random variable # Create a rotation matrix using the given angle
std2 = 0.333 # The desired standard deviation of our 2nd random variable rotationMatrix = np.array([[np.cos(angle), np.sin(angle)],
[-np.sin(angle), np.cos(angle)]])
x = np.random.normal(0, std1, 1000) # Get 1000 samples from x ~ N(0, std1)
y = np.random.normal(0, std2, 1000) # Get 1000 samples from y ~ N(0, std2) print('rotationMatrix')
#y = y + np.random.normal(0,1,1000)*noiseLevel * np.sin(0.78) print(rotationMatrix)

# PCA works better if the data is centered xy = np.concatenate(([x] , [y]), axis=0).T # Create a matrix with columns x & y
x = x - np.mean(x) # Center x
# Transform the data using the rotation matrix. It correlates the two variables
y = y - np.mean(y) # Center y
data = np.dot(xy, rotationMatrix) # Return a nD array

SE454 – Natural Language Processing

03: Vector Space Models
58

Correlated Normal Random Variables

# Print the rotated data
plt.scatter(data[:,0], data[:,1])
plt.show( )

angle: 45.0
rotationMatrix
[[ 0.70710678 0.70710678]
[-0.70710678 0.70710678]]

SE454 – Natural Language Processing

03: Vector Space Models
59

Correlated Normal Random Variables

plt.scatter(data[:,0], data[:,1]) # Print the original data in blue print()
print('Eigenvalues or explained variance')
# Apply PCA. In theory, the Eigenvector matrix must be the
print(pcaTr.explained_variance_)
# inverse of the original rotationMatrix.
pca = PCA(n_components=2) # Instantiate a PCA. Choose to get 2 output variables
# Print the rotated data
# Create the transformation model for this data. Internally it gets the plt.scatter(dataPCA[:,0], dataPCA[:,1])
rotation matrix and the explained variance
pcaTr = pca.fit(data) # Plot the 1st component axe. Use the explained variance to scale the vector
plt.plot([0, rotationMatrix[0][0] * std1 * 3], [0, rotationMatrix[0][1] * std1 * 3],
# Create an array with the transformed data 'k-', color='red')
dataPCA = pcaTr.transform(data)
# Plot the 2nd component axe. Use the explained variance to scale the vector
print('Eigenvectors or principal component: plt.plot([0, rotationMatrix[1][0] * std2 * 3], [0, rotationMatrix[1][1] * std2 * 3],
First row must be in the direction of [1, n]') 'k-', color='green')
print(pcaTr.components_)
plt.show( )

SE454 – Natural Language Processing

03: Vector Space Models
60

Correlated Normal Random Variables

Eigenvectors or principal component: First row must be in the direction of [1, n]
[[ 0.71633871 0.69775271]
[-0.69775271 0.71633871]]

Eigenvalues or explained variance

[0.99330378 0.10927556]
<ipython-input-5-1fada4e9fc41>:25: UserWarning: color is redundantly defined by the
'color' keyword argument and the fmt string "k-" (-> color='k’).
The keyword argument will take precedence.
plt.plot([0, rotationMatrix[0][0] * std1 * 3], [0, rotationMatrix[0][1] * std1 * 3],
'k-', color='red')
<ipython-input-5-1fada4e9fc41>:27: UserWarning: color is redundantly defined by the
'color' keyword argument and the fmt string "k-" (-> color='k’).
The keyword argument will take precedence.
plt.plot([0, rotationMatrix[1][0] * std2 * 3], [0, rotationMatrix[1][1] * std2 * 3],
'k-', color='green')

SE454 – Natural Language Processing

03: Vector Space Models
61

Dimensionality Reduction
nPoints = len(data)

# Plot the original data in blue

plt.scatter(data[:,0], data[:,1])

#Plot the projection along the first component in orange

plt.scatter(data[:,0], np.zeros(nPoints))

#Plot the projection along the second component in green

plt.scatter(np.zeros(nPoints), data[:,1])

plt.show()

SE454 – Natural Language Processing

Matrix Operations - Richard Bronson - 2011 - SCHAUM's OUTLINES
100% (4)
Matrix Operations - Richard Bronson - 2011 - SCHAUM's OUTLINES
244 pages
Vibrations - Analytical and Experimental Modal Analysis
80% (5)
Vibrations - Analytical and Experimental Modal Analysis
219 pages
Thesis - Structural-Acoustic Vibrations in Wooden Assemblies
No ratings yet
Thesis - Structural-Acoustic Vibrations in Wooden Assemblies
102 pages
Mathematics Honours Syllabus of Presidency University
No ratings yet
Mathematics Honours Syllabus of Presidency University
14 pages
Mws Gen Sle PPT Eigenvalues
No ratings yet
Mws Gen Sle PPT Eigenvalues
41 pages
Eigen Values and Vectors Ai HL
No ratings yet
Eigen Values and Vectors Ai HL
9 pages
Lin Alg
No ratings yet
Lin Alg
71 pages
Linear Algebra - Inder Sir - Demo
No ratings yet
Linear Algebra - Inder Sir - Demo
45 pages
VR17
No ratings yet
VR17
206 pages
Feature Extraction
No ratings yet
Feature Extraction
90 pages
FDS Module II-I
No ratings yet
FDS Module II-I
27 pages
The Matrix PDF
No ratings yet
The Matrix PDF
120 pages
R Programming PDF
No ratings yet
R Programming PDF
128 pages
R Programming PDF
No ratings yet
R Programming PDF
128 pages
R Program 6
No ratings yet
R Program 6
3 pages
The Matrix
No ratings yet
The Matrix
120 pages
R Session A
No ratings yet
R Session A
107 pages
AE01 M1 Solutions
No ratings yet
AE01 M1 Solutions
154 pages
Deep Learning
No ratings yet
Deep Learning
142 pages
SMB-R Programming Lab
No ratings yet
SMB-R Programming Lab
57 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
LinearAlgebra Lect2 Karan
No ratings yet
LinearAlgebra Lect2 Karan
62 pages
Data Preprocessing-AIML Algorithm1
No ratings yet
Data Preprocessing-AIML Algorithm1
47 pages
Session 1
No ratings yet
Session 1
66 pages
Transmission: Systems
No ratings yet
Transmission: Systems
48 pages
AI and Linear Algebra
No ratings yet
AI and Linear Algebra
2 pages
R18 CSD Syllabus
No ratings yet
R18 CSD Syllabus
127 pages
Singular Value Decomposition Tutorial
No ratings yet
Singular Value Decomposition Tutorial
24 pages
Word Embedding Generation For Telugu Corpus
No ratings yet
Word Embedding Generation For Telugu Corpus
28 pages
ECT463 M4 Ktunotes - in
No ratings yet
ECT463 M4 Ktunotes - in
70 pages
Lecture 2 Vectors
No ratings yet
Lecture 2 Vectors
43 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
49 pages
Week One Presentation Laff
No ratings yet
Week One Presentation Laff
33 pages
AI Foundations and Challenges1
No ratings yet
AI Foundations and Challenges1
31 pages
Math Linear Algebra
No ratings yet
Math Linear Algebra
54 pages
Lecture 2 New
No ratings yet
Lecture 2 New
40 pages
cs229 Python Friday
No ratings yet
cs229 Python Friday
40 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Lab 2 - NA
No ratings yet
Lab 2 - NA
20 pages
Ids 3 &5
No ratings yet
Ids 3 &5
81 pages
Numpy Recommendation System
No ratings yet
Numpy Recommendation System
26 pages
Lab Report 02 Objective:: Arrays and Vectors
No ratings yet
Lab Report 02 Objective:: Arrays and Vectors
9 pages
CG Lab Week 1 - For Students
No ratings yet
CG Lab Week 1 - For Students
26 pages
cs224n 2025 Lecture02 Wordvecs2
No ratings yet
cs224n 2025 Lecture02 Wordvecs2
46 pages
Zebedee: Design of A Spring-Mounted 3-D Range Sensor With Application To Mobile Mapping
No ratings yet
Zebedee: Design of A Spring-Mounted 3-D Range Sensor With Application To Mobile Mapping
16 pages
Winter 2021 Paper Solution - Math 1
No ratings yet
Winter 2021 Paper Solution - Math 1
42 pages
Wordembed
No ratings yet
Wordembed
31 pages
Introduction To R
No ratings yet
Introduction To R
21 pages
Linear Algebra
No ratings yet
Linear Algebra
19 pages
Lecture 2 Introduction To Linear Algebra (Part 1)
No ratings yet
Lecture 2 Introduction To Linear Algebra (Part 1)
49 pages
DL 1
No ratings yet
DL 1
63 pages
Immediate Download Elementary Linear Algebra 1st Edition James R. Kirkwood Ebooks 2024
No ratings yet
Immediate Download Elementary Linear Algebra 1st Edition James R. Kirkwood Ebooks 2024
40 pages
Singular Value Decomposition Tutorial
No ratings yet
Singular Value Decomposition Tutorial
24 pages
Lec 10
No ratings yet
Lec 10
15 pages
Lec 3
No ratings yet
Lec 3
51 pages
Get A Survey of Matrix Theory and Matrix Inequalities Marcus Free All Chapters
100% (2)
Get A Survey of Matrix Theory and Matrix Inequalities Marcus Free All Chapters
55 pages
Analysis of Low-Frequency Passive Seismic Attributes in Maroun Oil Field, Iran
No ratings yet
Analysis of Low-Frequency Passive Seismic Attributes in Maroun Oil Field, Iran
16 pages
Linear Algebra For Deep Learning. The Math Behind Every Deep Learning - by Vihar Kurama - Towards Data Science
No ratings yet
Linear Algebra For Deep Learning. The Math Behind Every Deep Learning - by Vihar Kurama - Towards Data Science
20 pages
Study On The Instrument Panel Assembly Modal Analysis Basic On CAE Technology
No ratings yet
Study On The Instrument Panel Assembly Modal Analysis Basic On CAE Technology
6 pages
Ids Unit 3 Notes CSM & CSD
No ratings yet
Ids Unit 3 Notes CSM & CSD
24 pages
Unit 5 Numpy and Pandas - in Python
No ratings yet
Unit 5 Numpy and Pandas - in Python
58 pages
Solutions 6
No ratings yet
Solutions 6
10 pages
Protvec: Problem Based Learning - July 2
No ratings yet
Protvec: Problem Based Learning - July 2
12 pages
Basic Concepts of Vibrating System
No ratings yet
Basic Concepts of Vibrating System
8 pages
Flood Risk Assessment and Mapping in Abidjan Distr
No ratings yet
Flood Risk Assessment and Mapping in Abidjan Distr
14 pages
SAS Libary Factor Analysis Using SAS PROC FACTOR
No ratings yet
SAS Libary Factor Analysis Using SAS PROC FACTOR
18 pages
Cosine Similarity in Machine Learning
No ratings yet
Cosine Similarity in Machine Learning
14 pages
Math Linear Algebra
No ratings yet
Math Linear Algebra
54 pages
C1 W2 Lab01 Python Numpy Vectorization Soln
No ratings yet
C1 W2 Lab01 Python Numpy Vectorization Soln
12 pages
Learn R Basics
No ratings yet
Learn R Basics
14 pages
Dimensionality Reduction Techniques For ML Class
No ratings yet
Dimensionality Reduction Techniques For ML Class
17 pages
Guided Learning Pathways Project: 4/7, 2011 Tetsuro Takahashi
No ratings yet
Guided Learning Pathways Project: 4/7, 2011 Tetsuro Takahashi
21 pages
The Impact of Trade Openness On Economic Growth in China - An Empirical Analysis
No ratings yet
The Impact of Trade Openness On Economic Growth in China - An Empirical Analysis
11 pages
Lab Description File
No ratings yet
Lab Description File
8 pages
Vectors
No ratings yet
Vectors
11 pages
MATH 1005H: Fall 2015 Test Three Solutions
No ratings yet
MATH 1005H: Fall 2015 Test Three Solutions
4 pages
L1 - Linear Algebra - Introduction
No ratings yet
L1 - Linear Algebra - Introduction
9 pages
Section 1
No ratings yet
Section 1
46 pages
‏لقطة شاشة ٢٠٢٤-٠١-٠٣ في ١٠.٤٧.٣٦ م
No ratings yet
‏لقطة شاشة ٢٠٢٤-٠١-٠٣ في ١٠.٤٧.٣٦ م
7 pages
PW Matrices and Vectors: The Index 0 (N, M) NXM N M (I) (J)
No ratings yet
PW Matrices and Vectors: The Index 0 (N, M) NXM N M (I) (J)
5 pages
IDS (R22) U3 NotesRK 13112024
No ratings yet
IDS (R22) U3 NotesRK 13112024
34 pages
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
No ratings yet
Linear Algebra: Submitted by Ahmad Saeed Submitted To Sir Muzzam Ali BITM-F18-022
5 pages
CS671A/CS671: Introduction To Natural Language Processing Mid-Semester Exam
No ratings yet
CS671A/CS671: Introduction To Natural Language Processing Mid-Semester Exam
7 pages
9A - Eigenvalues and Eigenvectors
No ratings yet
9A - Eigenvalues and Eigenvectors
5 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
3 pages
Need of Matrix
No ratings yet
Need of Matrix
4 pages
IR-Lab Manual A1
No ratings yet
IR-Lab Manual A1
3 pages
MT 222
No ratings yet
MT 222
2 pages
IGNOU MCA Digital Image Processing and Computer Vision Unsolved Paper Book MCS 230
From Everand
IGNOU MCA Digital Image Processing and Computer Vision Unsolved Paper Book MCS 230
Manish Soni
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet