Numpy
Numpy
Anis Koubaa
October 2024
Companion Notebook
For more detailed examples, visit the Google Colab notebook at: Jupyter Notebook for NumPy
Data Analytics.
1 Loading Data
1.1 Reading CSV Files
Data Importing Concept:
• Introduction to loading data from external sources
1
CS316: Introduction to Data Science Anis Koubaa, October 2024
5 Vectors
5.1 Introduction to Vectors
Linear Algebra Concept:
• Definition of vectors (ordered lists of numbers)
• Geometric interpretation
NumPy Implementation:
• Scalar multiplication
NumPy Implementation:
• Performing vector addition and subtraction: result = vector + vector, result = vector
- vector
6 Matrices
6.1 Understanding Matrices
Linear Algebra Concept:
• Definition of matrices (2D arrays of numbers)
• Applications in transformations
NumPy Implementation:
• Creating matrices with 2D arrays: matrix = np.array([[1, 2], [3, 4]])
9.2 Diagonalization
Linear Algebra Concept:
• Diagonalizing a matrix using eigenvalues
NumPy Implementation:
• Performing matrix diagonalization: diag matrix = np.diag(eigenvalues)
2 Dataset Overview
This section provides an overview of a dataset used for heart disease studies. We will demonstrate
how to load this dataset using both NumPy and Pandas, which are powerful tools for data analysis
in Python.
Sample Data Description: Here is a sample of the dataset.
Age Sex CP Trestbps Chol FBS Restecg Thalach Exang Oldpeak Slope CA Thal Target
63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
Here is a sample output, showing how the data looks once loaded:
1 array ([[67. , 1. , 4. , ... , 3. , 3. , 2.] ,
2 [67. , 1. , 4. , ... , 2. , 7. , 1.] ,
3 [37. , 1. , 3. , ... , 0. , 3. , 0.] ,
4 ... ,
5 [57. , 1. , 4. , ... , 1. , 7. , 3.] ,
6 [57. , 0. , 2. , ... , 1. , 3. , 1.] ,
7 [38. , 1. , 3. , ... , nan , 3. , 0.]])
This will produce the following output, helping to understand the dimensions and type of data
stored in the array:
1 Data Shape : (303 , 14) # Assuming there are 303 records and 14 features
2 Data Type : float64
3 Number of Dimensions : 2
4.2 Slicing
Slicing allows you to select a subset of your array. This is particularly useful for extracting specific
rows or columns.
1 # Slice the first five rows and the first four columns
2 subset = data_np [:5 , :4]
3
4 print ( " First five rows and four columns :\ n " , subset )
NumPy Implementation:
1 import numpy as np
2
3 # Creating a scalar value in NumPy
4 scalar_value = np . array (42)
5
6 # Displaying the scalar value and its data type
7 print ( " Scalar Value : " , scalar_value )
8 print ( " Type of Scalar Value : " , type ( scalar_value ) )
9
10 # For comparison , display the NumPy data type of the scalar
11 print ( " NumPy Data Type of Scalar Value : " , scalar_value . dtype )
12
13 # Assume data_np is a loaded NumPy array from the heart disease dataset
14 # For instance , adjusting the ’ Trestbps ’ column ( index 3) by subtracting the mean
15
16 # Calculate the mean of the Trestbps column
17 mean_trestbps = np . mean ( data_np [: , 3])
18
19 # Subtract the mean from the Trestbps column to center the data around zero
20 n o r m a l i z e d _ t r e s t b p s = data_np [: , 3] - mean_trestbps
21
22 print ( " Normalized Trestbps : " , n o r m a l i z e d _ t r e s t b p s )
This example demonstrates normalizing the resting blood pressure measurements (’Trestbps’)
by centering them around zero, a common preprocessing step to reduce model bias due to scale
differences in features.
6 Vectors
6.1 Introduction to Vectors
Linear Algebra Concept: Vectors provide a way to store and manipulate data across multiple
dimensions, essential for numerous algorithms in machine learning and artificial intelligence.
NumPy Implementation: In NumPy, vectors are represented as 1D arrays, which can be
used for various mathematical operations.
⃗v
2w
⃗
x
⃗u
Figure 1: Illustration of vector ⃗v as the sum of scaled unit vectors ⃗u along the x-axis and w
⃗ along
the y-axis.
This scalar measures the magnitude of ⃗a in the direction of ⃗b and is foundational in determining
the angle between vectors.
• Vector Norms (||⃗a||): The norm of a vector ⃗a in Euclidean space is defined as:
√ q
||⃗a|| = ⃗a · ⃗a = a21 + a22 + · · · + a2n
Normalizing a vector involves dividing the vector by its norm, resulting in a unit vector. This
process is crucial for many machine learning algorithms to ensure unbiased comparisons by
distance or angle.
Cosine Similarity: Cosine similarity measures the cosine of the angle between two vectors,
providing a scale- and length-independent measure of similarity:
Pn
⃗a · ⃗b i=1 ai bi
Cosine Similarity(⃗a, ⃗b) = = pPn pPn
||⃗a|| × ||⃗b|| 2
i=1 ai ×
2
i=1 bi
This metric ranges from -1 to 1, where 1 indicates vectors in the same direction, 0 indicates orthog-
onality, and -1 indicates vectors in opposite directions.
NumPy Implementation and Practical Example with Word Embeddings: Word em-
beddings represent words in multi-dimensional space. By applying dot products and norms, we can
objectively measure how similar or different the words are.
1 import numpy as np
2
3 # Vectors representing word embeddings for " apple " , " banana " , and " vehicle "
4 apple = np . array ([1 , 2])
5 banana = np . array ([1 , 2.1])
6 vehicle = np . array ([8 , 3])
7
8 # Computing dot products
9 d o t _ p r o d u c t _ a p p l e _ b a n a n a = np . dot ( apple , banana )
10 d o t _ p r o d u c t _ a p p l e _ v e h i c l e = np . dot ( apple , vehicle )
11
12 # Calculating norms
13 norm_apple = np . linalg . norm ( apple )
14 norm_banana = np . linalg . norm ( banana )
15 norm_vehicle = np . linalg . norm ( vehicle )
16
17 # Calculating cosine similarity
18 c o s i n e _ s i m _ a p p l e _ b a n a n a = d o t _ p r o d u c t _ a p p l e _ b a n a n a / ( norm_apple * norm_banana )
19 c o s i n e _ s i m _ a p p l e _ v e h i c l e = d o t _ p r o d u c t _ a p p l e _ v e h i c l e / ( norm_apple * norm_vehicle )
20
21 print ( " Cosine Similarity between Apple and Banana : " , c o s i n e _ s i m _ a p p l e _ b a n a n a )
22 print ( " Cosine Similarity between Apple and Vehicle : " , c o s i n e _ s i m _ a p p l e _ v e h i c l e )
Cosine Similarity: This measure computes the cosine of the angle between two vectors pro-
jected in a multi-dimensional space. It is particularly useful in natural language processing to
determine how similar two words are, irrespective of their magnitude, by normalizing the vectors
to unit length.
Discussion: - The cosine similarity between ”apple” and ”banana” is higher, indicating a
closer relationship due to their similar meanings compared to ”apple” and ”vehicle.” - This exam-
ple highlights why normalizing vectors is critical for meaningful comparisons in cosine similarity
calculations, as it ensures comparisons are based on directions alone, not magnitudes.
This diagram visually demonstrates the positions of ”apple,” ”banana,” and ”vehicle” in a 2D
vector space, reflecting their semantic relationships based on their cosine similarities.
vehicle
apple
banana
Figure 2: Visualization of word vectors in a 2D space, showing relative distances and directions
between ‘apple,‘ ‘banana,‘ and ‘vehicle.‘ These distances and angles illustrate the semantic similar-
ities and differences among the words.
7 Matrices
7.1 Understanding Matrices
Linear Algebra Concept: Matrices are fundamental in linear algebra, representing systems of
equations, transformations in graphics, and data structures in machine learning. They facilitate
operations across multiple data points simultaneously, making them invaluable in complex compu-
tations.
NumPy Implementation: Matrices in NumPy are implemented as 2D arrays, providing a
powerful tool for numerical computing with features for performing a wide range of mathematical
operations efficiently.
• Creating matrices: matrix = np.array([[1, 2], [3, 4]])
• Accessing matrix attributes (e.g., shape, size): print(matrix.shape), print(matrix.size)
1 import numpy as np
2 data = np . genfromtxt ( ’ h e a r t _ d i s e a s e _ d a t a . csv ’ , delimiter = ’ , ’ , skip_header =1)
3
• Matrix Multiplication: Demonstrates matrix multiplication which is crucial for data trans-
formations:
1 # Assuming ’ matrix ’ is defined as np . array ([[1 , 2] , [3 , 4]])
2 # Multiply by a vector
3 vector = np . array ([1 , 1])
4 product = np . dot ( matrix , vector ) # Output will be an array ([3 , 7])
5
This section illustrates the depth of matrix manipulation capabilities in NumPy, using a practical
example from the dataset that includes basic loading, manipulation, and application in data science
tasks like feature scaling.
• Scalar Multiplication:
Compact Form:
C = kA
Expanded Form:
k · a11 k · a12
C=
k · a21 k · a22
• Matrix Multiplication:
Compact Form:
C = AB
Expanded Form:
a11 b11 + a12 b21 a11 b12 + a12 b22
C=
a21 b11 + a22 b21 a21 b12 + a22 b22
• Dot Product between a Matrix and a Vector: This operation involves multiplying a
matrix by a vector. The result is a new vector where each component is a linear combination
of the columns of the matrix, weighted by the corresponding components of the vector.
Compact Form:
⃗u = A⃗v
Expanded Form: Assume a 2x2 matrix A and a 2-dimensional vector ⃗v :
a11 a12 v
A= , ⃗v = 1
a21 a22 v2
a v + a12 v2
⃗u = 11 1
a21 v1 + a22 v2
NumPy Implementation: NumPy provides intuitive and powerful tools for performing
these operations efficiently, essential for numerical computing and data analysis tasks.
– Scalar Multiplication:
1 # Scalar multipli cation by a constant c
2 result_scalar_mult = A * c
3
– Matrix Multiplication:
1 # Matrix multipli cation of A and B
2 r es ul t_ m at _m u lt = np . dot (A , B )
3
– Cross Product:
1 # Cross product of vectors a and b from matrices A and B
2 r e s u l t _ c r o s s _ p r o d = np . cross (a , b )
3
– Dot Product:
1 # Dot product of matrix A and vector v
2 r es ul t_ d ot _p r od = np . dot (A , v )
3
• Identity Matrix: An identity matrix is a square matrix with ones on the main diagonal and
zeros elsewhere. It functions as the multiplicative identity in matrix multiplication, such that
any matrix multiplied by an identity matrix does not change.
• Diagonal Matrix: A diagonal matrix is a type of matrix in which all off-diagonal entries
are zero. The non-zero entries can be found only on the diagonal running from the upper left
to the lower right.
NumPy Implementation: NumPy provides efficient functions to handle these types of ma-
trices, useful in various numerical computations.
identity = np.eye(3)
• Extracting Diagonals:
diagonal = np.diag(matrix)
If matrix is:
1 2 3
M= 4
5 6 ,
7 8 9
then np.diag(matrix) will return:
1 5 9
NumPy Implementation and Examples: NumPy provides intuitive functions to create and
manipulate these matrices. Below are some Python code examples that illustrate their creation and
use:
1 import numpy as np
2
3 # Creating a 3 x3 identity matrix
4 identity = np . eye (3)
5 print ( " Identity Matrix :\\ n " , identity )
6
7 # Creating a matrix to demonstrate diagonal extraction
8 matrix = np . array ([[1 , 2 , 3] , [4 , 5 , 6] , [7 , 8 , 9]])
9 diagonal = np . diag ( matrix )
10 print ( " Diagonal of the Matrix :\\ n " , diagonal )
11
12 # Example of Matrix Multipli cation with an Identity Matrix
13 A = np . array ([[2 , 3 , 4] , [5 , 6 , 7] , [8 , 9 , 10]])
14 result = np . dot (A , identity )
15 print ( " Result of Multiplying A by the Identity Matrix :\\ n " , result )
x = A−1 b
Using NumPy to perform this computation:
1 import numpy as np
2
3 # Define the matrix A and vector b
4 A = np . array ([[3 , 1] , [1 , 2]])
5 b = np . array ([9 , 8])
6
7 # Use NumPy to solve for x
8 x = np . linalg . solve (A , b )
9
10 print ( " Solution vector x : " , x )
Illustrative Numerical Example: For the matrix A given above, its inverse A−1 can be
computed:
0.4 −0.2
A−1 =
−0.2 0.6
The product A · A−1 confirms the identity matrix:
1 0
A · A−1 =
0 1
This computation illustrates that multiplying A by A−1 yields the identity matrix, verifying the
properties of inverse matrices.
1 import numpy as np
2
3 # Define a matrix B , possibly non - square or singular
4 B = np . array ([[1 , 2] , [3 , 4] , [5 , 6]])
5
6 # Calculate the pseudo - inverse of B
7 pseudo_inv_B = np . linalg . pinv ( B )
8 print ( " Pseudo - inverse of B :\\ n " , pseudo_inv_B )
This matrix, when used to multiply B, approximates an identity matrix, reflecting the least squares
solution to an overdetermined system. This concept is crucial in applications like signal processing
and regression analysis where exact solutions are not feasible.
the determinant can be calculated using the Leibniz formula, which is a sum over permutations of
the matrix indices:
X n
Y
det(A) = sgn(σ) ai,σ(i)
σ∈Sn i=1
where σ ranges over all permutations of n elements, and sgn(σ) is the sign of the permutation (+1
for even permutations and −1 for odd permutations).
NumPy Implementation:
1 import numpy as np
2
3 # Define an n x n matrix
4 A = np . array ([[ a11 , a12 , ... , a1n ] , [ a21 , a22 , ... , a2n ] , ... , [ an1 , an2 , ... , ann
]])
5 det_A = np . linalg . det ( A )
6 print ( " Determinant of A : " , det_A )
(A − λI)v = 0
where I is the identity matrix of the same dimension as A. This equation implies that for non-trivial
solutions, the matrix (A − λI) must be singular, which leads to the characteristic equation:
det(A − λI) = 0
Solving this polynomial in λ gives the eigenvalues, and substituting each λ back into the equation
provides the corresponding eigenvectors.
NumPy Implementation:
1 import numpy as np
2
3 # Define a matrix A
4 A = np . array ([[4 , 2] , [1 , 3]])
5
6 # Calculate eigenvalues and eigenvectors
7 eigenvalues , eigenvectors = np . linalg . eig ( A )
8 print ( " Eigenvalues : " , eigenvalues )
9 print ( " Eigenvectors :\\ n " , eigenvectors )
det(A − λI) = 0
Choosing t = 1 gives:
2
v1 =
1
Eigenvector for λ2 = 2:
2 2
A − 2I =
1 1
Row reduce this matrix:
2 2 1 1
→
1 1 0 0
From x1 + x2 = 0, let x2 = t. Then x1 = −t.
−1
v2 = t (where t ̸= 0, typically t = 1)
1
Choosing t = 1 gives:
−1
v2 =
1
10.2 Diagonalization
Linear Algebra Concept: Diagonalization is the process of transforming a matrix into a diagonal
form that reflects its eigenvalues, which is possible if the matrix has enough linearly independent
eigenvectors to form a basis. A matrix A is diagonalizable if it can be expressed as:
A = P DP −1
where P is the matrix of column eigenvectors of A, and D is the diagonal matrix containing the
eigenvalues of A. Each column vi of P corresponds to an eigenvalue λi in D.
NumPy Implementation:
1 # Assuming eigenvalues and eigenvectors from previous code
2 P = eigenvectors
3 D = np . diag ( eigenvalues )
4 P_inv = np . linalg . inv ( P )
5 A_diag = np . dot ( np . dot (P , D ) , P_inv )
6 print ( " Diagonalized matrix A :\\ n " , A_diag )
Verification of Diagonalization:
With the eigenvalues λ1 = 5 and λ2 = 2 and the cor-
2 −1
responding eigenvectors v1 = and v2 = computed previously, we can perform matrix
1 1
diagonalization and verify it by reconstructing the original matrix A.
Calculate P −1 : Inverse of P :
−1 1 1 1 1 1 1
P = =
det(P ) −1 2 3 −1 2
4. Sort and Select Principal Components: Sort the eigenvalues in descending order and
select the top k eigenvectors to form the transformation matrix.
5. Transform the Data: Project the data onto the new subspace:
XPCA = Xc · Vk
where Vk is the matrix of the top k eigenvectors.
NumPy Implementation:
1 import numpy as np
2
3 # Define a mean - centered data matrix X_c (3 x2 for simplicity )
4 X_c = np . array ([[1 , 2] , [3 , 4] , [5 , 6]]) - np . mean ( np . array ([[1 , 2] , [3 , 4] , [5 ,
6]]) , axis =0)
5
6 # Compute the covariance matrix
7 cov_matrix = np . cov ( X_c , rowvar = False )
8
9 # Compute eigenvalues and eigenvectors
10 eigenvalues , eigenvectors = np . linalg . eig ( cov_matrix )
11 print ( " Covariance Matrix :\\ n " , cov_matrix )
12 print ( " Eigenvalues : " , eigenvalues )
13 print ( " Eigenvectors :\\ n " , eigenvectors )
14
15 # Select the top k eigenvectors ( here , k =1)
16 k = 1
17 V_k = eigenvectors [: , : k ]
18
19 # Transform the data
20 X_pca = X_c @ V_k
21 print ( " Transformed data :\\ n " , X_pca )
Ak = Uk Σk VkT
• Noise Reduction: Smaller singular values often correspond to noise or less significant in-
formation. Discarding these values helps in cleaning the data, enhancing pattern recognition
and interpretation.
Practical Application:
1 # Assuming top 1 singular value to reduce d imensio nality
2 k = 1
3 A_k = np . dot ( u [: , : k ] * s [: k ] , vh [: k , :])
4 print ( " Approximated matrix A_k :\\ n " , A_k )
A = U ΣV T
Expanding this product, we express A as a sum of outer products scaled by the singular values:
A = U ΣV T
Expanding this product, the reconstruction of A involves summing the outer products of the singular
vectors scaled by the corresponding singular values:
0.20673589 T 0.88915331
A = 25.4368356·0.51828874 0.40361757 0.46474413 0.52587069 0.58699725 +1.72261225· 0.25438183 −0.7
0.82984158 −0.38038964