0% found this document useful (0 votes)
39 views28 pages

Numpy

Uploaded by

Rim Moussa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views28 pages

Numpy

Uploaded by

Rim Moussa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

NumPy for Data Science Cheat Sheet

Anis Koubaa
October 2024

Companion Notebook
For more detailed examples, visit the Google Colab notebook at: Jupyter Notebook for NumPy
Data Analytics.

1 Loading Data
1.1 Reading CSV Files
Data Importing Concept:
• Introduction to loading data from external sources

• Importance of CSV files in data analytics


NumPy Implementation:
• Using NumPy to read CSV files: data = np.genfromtxt(’data.csv’, delimiter=’,’)

• Handling missing values and setting data types

2 Numpy Array Dimensions and Data Types


2.1 Exploring Array Attributes
Concepts of Array Dimensions and Data Types:
• Understanding the structure of arrays

• The role of data types in numerical computations


NumPy Implementation:
• Examining array dimensions: print(data.shape)

• Checking data types: print(data.dtype)

1
CS316: Introduction to Data Science Anis Koubaa, October 2024

3 NumPy Indexing and Slicing


3.1 Accessing Data Elements
Indexing and Slicing Techniques:

• Basics of indexing and slicing in Python arrays


• Advanced techniques for efficient data manipulation
NumPy Implementation:

• Basic indexing: element = data[0, 0]


• Slicing arrays: subarray = data[1:5]

4 Scalars and NumPy Basics


4.1 Understanding Scalars
Linear Algebra Concept:

• Definition of scalars (single numerical values)


• Operations with scalars
NumPy Implementation:

• Creating scalar variables: scalar = np.array(5)


• Performing arithmetic operations with NumPy scalars: scalar + 2, scalar * 3

5 Vectors
5.1 Introduction to Vectors
Linear Algebra Concept:
• Definition of vectors (ordered lists of numbers)
• Geometric interpretation
NumPy Implementation:

• Representing vectors using 1D arrays: vector = np.array([1, 2, 3])


• Initializing vectors: zeros = np.zeros(3), ones = np.ones(3)

Anis Koubaa Page 2 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

5.2 Vector Operations


Operations in Linear Algebra:
• Addition and subtraction of vectors

• Scalar multiplication
NumPy Implementation:
• Performing vector addition and subtraction: result = vector + vector, result = vector
- vector

• Scalar multiplication: scaled = vector * 2

5.3 Dot Product, Vector Norms, and Cosine Similarity


Linear Algebra Concept:

• Dot product and its properties


• Calculating vector norms (magnitude)
• Understanding cosine similarity between vectors
NumPy Implementation:

• Computing dot product: dot product = np.dot(vector, vector)


• Calculating norms: norm = np.linalg.norm(vector)
• Computing cosine similarity: cosine similarity = np.dot(vector1, vector2) / (np.linalg.norm(vector1)
* np.linalg.norm(vector2))

6 Matrices
6.1 Understanding Matrices
Linear Algebra Concept:
• Definition of matrices (2D arrays of numbers)

• Applications in transformations
NumPy Implementation:
• Creating matrices with 2D arrays: matrix = np.array([[1, 2], [3, 4]])

• Exploring matrix attributes: print(matrix.shape)

Anis Koubaa Page 3 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

6.2 Matrix Operations


Operations in Linear Algebra:
• Matrix addition and subtraction
• Scalar multiplication
• Matrix multiplication
NumPy Implementation:
• Performing matrix operations: result = matrix + matrix, result = matrix - matrix
• Matrix multiplication: product = np.matmul(matrix, matrix)

6.3 Special Matrices


Linear Algebra Concept:
• Identity matrix and its properties
• Diagonal matrices
NumPy Implementation:
• Creating identity matrices: identity = np.eye(3)
• Extracting diagonals: diagonal = np.diag(matrix)

7 Systems of Linear Equations


7.1 Solving Linear Systems
Linear Algebra Concept:
• Representing systems as Ax = b
• Methods for solving
NumPy Implementation:
• Solving equations: x = np.linalg.solve(A, b)

7.2 Inverse Matrices


Linear Algebra Concept:
• Definition and properties of inverse matrices
NumPy Implementation:
• Calculating inverses: inv A = np.linalg.inv(A)

Anis Koubaa Page 4 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

8 Determinants and Rank


8.1 Determinants
Linear Algebra Concept:
• Understanding determinants and their significance
NumPy Implementation:
• Computing determinants: det A = np.linalg.det(A)

8.2 Rank of a Matrix


Linear Algebra Concept:
• Definition and importance of rank
NumPy Implementation:
• Determining rank: rank A = np.linalg.matrix rank(A)

9 Eigenvalues and Eigenvectors


9.1 Fundamentals
Linear Algebra Concept:
• What are eigenvalues and eigenvectors?
• Their role in transformations
NumPy Implementation:
• Calculating eigenvalues and eigenvectors: eigenvalues, eigenvectors = np.linalg.eig(A)

9.2 Diagonalization
Linear Algebra Concept:
• Diagonalizing a matrix using eigenvalues
NumPy Implementation:
• Performing matrix diagonalization: diag matrix = np.diag(eigenvalues)

10 Singular Value Decomposition (SVD)


10.1 Understanding SVD
Linear Algebra Concept:
• Breaking down a matrix into singular vectors and singular values
NumPy Implementation:
• Performing SVD: u, s, vh = np.linalg.svd(matrix)

Anis Koubaa Page 5 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

10.2 Applications of SVD


Concepts:
• Dimensionality reduction

• Noise reduction in data


Practical Application:
• Applying SVD in data science tasks

Anis Koubaa Page 6 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

2 Dataset Overview
This section provides an overview of a dataset used for heart disease studies. We will demonstrate
how to load this dataset using both NumPy and Pandas, which are powerful tools for data analysis
in Python.
Sample Data Description: Here is a sample of the dataset.
Age Sex CP Trestbps Chol FBS Restecg Thalach Exang Oldpeak Slope CA Thal Target
63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
41 0 2 130 204 0 2 172 0 1.4 1 0 3 0

2.1 Loading Data with NumPy


NumPy’s genfromtxt function is perfect for loading data when you don’t need to manipulate or
preprocess data before analysis.
1 import numpy as np
2
3 # Load the heart disease dataset directly into a NumPy array
4 data_np = np . genfromtxt ( ’ h e a r t _ d i s e a s e _ d a t a . csv ’ , delimiter = ’ , ’ , skip_header =1)

2.2 Loading Data with Pandas and Converting to NumPy


Using Pandas to load data provides flexibility for preprocessing. Once data manipulation is com-
plete, you can easily convert the DataFrame to a NumPy array.
1 import pandas as pd
2
3 # Load data into a Pandas DataFrame
4 df = pd . read_csv ( ’ h e a r t _ d i s e a s e _ d a t a . csv ’)
5 # Convert the DataFrame to a NumPy array
6 data_pd_to_np = df . to_numpy ()

Here is a sample output, showing how the data looks once loaded:
1 array ([[67. , 1. , 4. , ... , 3. , 3. , 2.] ,
2 [67. , 1. , 4. , ... , 2. , 7. , 1.] ,
3 [37. , 1. , 3. , ... , 0. , 3. , 0.] ,
4 ... ,
5 [57. , 1. , 4. , ... , 1. , 7. , 3.] ,
6 [57. , 0. , 2. , ... , 1. , 3. , 1.] ,
7 [38. , 1. , 3. , ... , nan , 3. , 0.]])

3 Numpy Array Dimensions and Data Types


After loading the data, it’s useful to check the array’s shape, data type, and dimensionality. This
can provide insight into the structure of the dataset that may influence further data processing
steps.

Anis Koubaa Page 7 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

1 # Print the shape of the data


2 print ( ’ Data Shape : ’ , data_np . shape )
3
4 # Print the data type of the array
5 print ( ’ Data Type : ’ , data_np . dtype )
6
7 # Print the number of dimensions of the array
8 print ( ’ Number of Dimensions : ’ , data_np . ndim )

This will produce the following output, helping to understand the dimensions and type of data
stored in the array:
1 Data Shape : (303 , 14) # Assuming there are 303 records and 14 features
2 Data Type : float64
3 Number of Dimensions : 2

4 NumPy Indexing and Slicing


Indexing and slicing are fundamental for data manipulation in NumPy arrays, allowing for selecting
and operating on subsets of the dataset efficiently.

4.1 Basic Indexing


You can access individual elements or ranges of elements in a NumPy array using basic indexing.
1 # Access the first element ( first row )
2 first_row = data_np [0]
3
4 # Access a specific element ( element at first row and first column )
5 first_element = data_np [0 , 0]
6
7 print ( " First row : " , first_row )
8 print ( " First element : " , first_element )

4.2 Slicing
Slicing allows you to select a subset of your array. This is particularly useful for extracting specific
rows or columns.
1 # Slice the first five rows and the first four columns
2 subset = data_np [:5 , :4]
3
4 print ( " First five rows and four columns :\ n " , subset )

4.3 Boolean Indexing


Boolean indexing is a powerful feature that allows you to select elements based on conditions.
1 # Find all rows where the target variable ( last column ) is 1
2 posi tive_cas es = data_np [ data_np [: , -1] == 1]
3
4 print ( " Cases with positive heart disease diagnosis :\ n " , p ositive_ cases )

Anis Koubaa Page 8 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

4.4 Fancy Indexing


Fancy indexing involves passing arrays of indices to access multiple array elements at once.
1 # Access specific rows and columns using lists of indices
2 rows = [0 , 2 , 4] # Select the first , third , and fifth rows
3 cols = [0 , 1 , 2 , 3] # Select the first four columns
4 selected_data = data_np [ rows ][: , cols ]
5
6 print ( " Selected data :\ n " , selected_data )

5 Scalars and NumPy Basics


5.1 Understanding Scalars
Linear Algebra Concept in Data Science: In data science and AI, a scalar is a single value
used to adjust or scale features within a dataset. Scalars are fundamental in operations such as
feature scaling, which is essential for many machine learning algorithms to perform optimally.
Operations with Scalars:

• Addition/Subtraction: Adjusting a baseline value, like correcting sensor offsets in data


collection.
• Multiplication/Division: Scaling features to a normalized range, such as [0,1] or [-1,1],
which is critical for algorithms like gradient descent to converge more efficiently.

NumPy Implementation:
1 import numpy as np
2
3 # Creating a scalar value in NumPy
4 scalar_value = np . array (42)
5
6 # Displaying the scalar value and its data type
7 print ( " Scalar Value : " , scalar_value )
8 print ( " Type of Scalar Value : " , type ( scalar_value ) )
9
10 # For comparison , display the NumPy data type of the scalar
11 print ( " NumPy Data Type of Scalar Value : " , scalar_value . dtype )
12
13 # Assume data_np is a loaded NumPy array from the heart disease dataset
14 # For instance , adjusting the ’ Trestbps ’ column ( index 3) by subtracting the mean
15
16 # Calculate the mean of the Trestbps column
17 mean_trestbps = np . mean ( data_np [: , 3])
18
19 # Subtract the mean from the Trestbps column to center the data around zero
20 n o r m a l i z e d _ t r e s t b p s = data_np [: , 3] - mean_trestbps
21
22 print ( " Normalized Trestbps : " , n o r m a l i z e d _ t r e s t b p s )

This example demonstrates normalizing the resting blood pressure measurements (’Trestbps’)
by centering them around zero, a common preprocessing step to reduce model bias due to scale
differences in features.

Anis Koubaa Page 9 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

6 Vectors
6.1 Introduction to Vectors
Linear Algebra Concept: Vectors provide a way to store and manipulate data across multiple
dimensions, essential for numerous algorithms in machine learning and artificial intelligence.
NumPy Implementation: In NumPy, vectors are represented as 1D arrays, which can be
used for various mathematical operations.

• Creating vectors using 1D arrays: vector = np.array([1, 2, 3])


• Initializing vectors with zeros or ones:
– zeros = np.zeros(3)
– ones = np.ones(3)

6.1.1 Vector Equations


Consider a vector ⃗v = [1, 2]. The compact form of a vector equation might be ⃗v = x⃗u + y w,
⃗ where
⃗u and w
⃗ are unit vectors along the x and y axes, respectively.
Expanded Form:
⃗v = 1⃗u + 2w

6.1.2 Visual Representation of Vectors in 2D


This figure visually demonstrates how the vector ⃗v is composed from its components along the x
and y axes, illustrating the geometric interpretation of vector addition. The unit vectors ⃗u and
w
⃗ represent the standard basis vectors for the 2D coordinate system, where ⃗u is aligned with the
x-axis and w⃗ with the y-axis.

⃗v
2w

x
⃗u

Figure 1: Illustration of vector ⃗v as the sum of scaled unit vectors ⃗u along the x-axis and w
⃗ along
the y-axis.

Anis Koubaa Page 10 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

6.2 Vector Operations


Operations in Linear Algebra: Vectors can be manipulated through various operations, such
as additional and subtraction.
NumPy Implementation: NumPy makes it straightforward to perform these vector op-
erations efficiently, allowing data scientists to manage and manipulate large datasets or model
parameters quickly.
• Performing Vector Addition and Subtraction: Adding or subtracting vectors element-
wise.
• Scalar Multiplication: Scaling a vector by multiplying with a scalar value.
Here is how these operations can be implemented in NumPy:
1 import numpy as np
2
3 # Creating two vectors
4 vector1 = np . array ([1 , 2 , 3])
5 vector2 = np . array ([4 , 5 , 6])
6
7 # Vector addition
8 result_add = vector1 + vector2
9
10 # Vector subtraction
11 result_sub = vector1 - vector2
12
13 # Scalar multipl ication
14 scaled = vector1 * 2
15
16 print ( " Vector Addition Result : " , result_add )
17 print ( " Vector Subtraction Result : " , result_sub )
18 print ( " Scalar Multi plicatio n Result : " , scaled )

6.3 Dot Product and Vector Norms


Linear Algebra Concepts: Key operations in vector algebra are utilized in computational geom-
etry, data science, and machine learning, especially for analyzing data in high-dimensional spaces.
• Dot Product (⃗a · ⃗b): Given two vectors ⃗a and ⃗b, the dot product is calculated as:
n
X
⃗a · ⃗b = ai bi = a1 b1 + a2 b2 + · · · + an bn
i=1

This scalar measures the magnitude of ⃗a in the direction of ⃗b and is foundational in determining
the angle between vectors.
• Vector Norms (||⃗a||): The norm of a vector ⃗a in Euclidean space is defined as:
√ q
||⃗a|| = ⃗a · ⃗a = a21 + a22 + · · · + a2n

Normalizing a vector involves dividing the vector by its norm, resulting in a unit vector. This
process is crucial for many machine learning algorithms to ensure unbiased comparisons by
distance or angle.

Anis Koubaa Page 11 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

Cosine Similarity: Cosine similarity measures the cosine of the angle between two vectors,
providing a scale- and length-independent measure of similarity:
Pn
⃗a · ⃗b i=1 ai bi
Cosine Similarity(⃗a, ⃗b) = = pPn pPn
||⃗a|| × ||⃗b|| 2
i=1 ai ×
2
i=1 bi

This metric ranges from -1 to 1, where 1 indicates vectors in the same direction, 0 indicates orthog-
onality, and -1 indicates vectors in opposite directions.
NumPy Implementation and Practical Example with Word Embeddings: Word em-
beddings represent words in multi-dimensional space. By applying dot products and norms, we can
objectively measure how similar or different the words are.
1 import numpy as np
2
3 # Vectors representing word embeddings for " apple " , " banana " , and " vehicle "
4 apple = np . array ([1 , 2])
5 banana = np . array ([1 , 2.1])
6 vehicle = np . array ([8 , 3])
7
8 # Computing dot products
9 d o t _ p r o d u c t _ a p p l e _ b a n a n a = np . dot ( apple , banana )
10 d o t _ p r o d u c t _ a p p l e _ v e h i c l e = np . dot ( apple , vehicle )
11
12 # Calculating norms
13 norm_apple = np . linalg . norm ( apple )
14 norm_banana = np . linalg . norm ( banana )
15 norm_vehicle = np . linalg . norm ( vehicle )
16
17 # Calculating cosine similarity
18 c o s i n e _ s i m _ a p p l e _ b a n a n a = d o t _ p r o d u c t _ a p p l e _ b a n a n a / ( norm_apple * norm_banana )
19 c o s i n e _ s i m _ a p p l e _ v e h i c l e = d o t _ p r o d u c t _ a p p l e _ v e h i c l e / ( norm_apple * norm_vehicle )
20
21 print ( " Cosine Similarity between Apple and Banana : " , c o s i n e _ s i m _ a p p l e _ b a n a n a )
22 print ( " Cosine Similarity between Apple and Vehicle : " , c o s i n e _ s i m _ a p p l e _ v e h i c l e )

Cosine Similarity: This measure computes the cosine of the angle between two vectors pro-
jected in a multi-dimensional space. It is particularly useful in natural language processing to
determine how similar two words are, irrespective of their magnitude, by normalizing the vectors
to unit length.
Discussion: - The cosine similarity between ”apple” and ”banana” is higher, indicating a
closer relationship due to their similar meanings compared to ”apple” and ”vehicle.” - This exam-
ple highlights why normalizing vectors is critical for meaningful comparisons in cosine similarity
calculations, as it ensures comparisons are based on directions alone, not magnitudes.
This diagram visually demonstrates the positions of ”apple,” ”banana,” and ”vehicle” in a 2D
vector space, reflecting their semantic relationships based on their cosine similarities.

Anis Koubaa Page 12 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

vehicle

apple
banana

Figure 2: Visualization of word vectors in a 2D space, showing relative distances and directions
between ‘apple,‘ ‘banana,‘ and ‘vehicle.‘ These distances and angles illustrate the semantic similar-
ities and differences among the words.

7 Matrices
7.1 Understanding Matrices
Linear Algebra Concept: Matrices are fundamental in linear algebra, representing systems of
equations, transformations in graphics, and data structures in machine learning. They facilitate
operations across multiple data points simultaneously, making them invaluable in complex compu-
tations.
NumPy Implementation: Matrices in NumPy are implemented as 2D arrays, providing a
powerful tool for numerical computing with features for performing a wide range of mathematical
operations efficiently.
• Creating matrices: matrix = np.array([[1, 2], [3, 4]])
• Accessing matrix attributes (e.g., shape, size): print(matrix.shape), print(matrix.size)

7.1.1 Matrix Operations


Sample Data Description: Here is a sample of the dataset used to demonstrate matrix opera-
tions.
Age Sex CP Trestbps Chol FBS Restecg Thalach Exang Oldpeak Slope CA Thal Target
63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
NumPy Implementation:
• Loading Data into a Matrix: To use this dataset in NumPy, you would typically load it
from a CSV file using:

Anis Koubaa Page 13 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

1 import numpy as np
2 data = np . genfromtxt ( ’ h e a r t _ d i s e a s e _ d a t a . csv ’ , delimiter = ’ , ’ , skip_header =1)
3

• Matrix Multiplication: Demonstrates matrix multiplication which is crucial for data trans-
formations:
1 # Assuming ’ matrix ’ is defined as np . array ([[1 , 2] , [3 , 4]])
2 # Multiply by a vector
3 vector = np . array ([1 , 1])
4 product = np . dot ( matrix , vector ) # Output will be an array ([3 , 7])
5

7.1.2 Practical Application: Data Transformation


Matrix operations are core to many algorithms in data science. For example, transforming data
features for standardization:
1 # Standardizing the ’ Age ’ column ( index 0 in the dataset )
2 age_mean = np . mean ( data [: ,0])
3 age_std = np . std ( data [: ,0])
4 s t a n d a r d i z e d _ a ge = ( data [: ,0] - age_mean ) / age_std

This section illustrates the depth of matrix manipulation capabilities in NumPy, using a practical
example from the dataset that includes basic loading, manipulation, and application in data science
tasks like feature scaling.

7.2 Matrix Operations


Operations in Linear Algebra: Matrix operations are crucial for many applications in engineer-
ing, physics, economics, and computer science, allowing for the manipulation and transformation
of data stored in matrices.

• Matrix Addition and Subtraction:


Compact Form:
C =A±B
Expanded Form:  
a11 ± b11 a12 ± b12
C=
a21 ± b21 a22 ± b22

• Scalar Multiplication:
Compact Form:
C = kA
Expanded Form:  
k · a11 k · a12
C=
k · a21 k · a22

Anis Koubaa Page 14 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

• Matrix Multiplication:
Compact Form:
C = AB
Expanded Form:  
a11 b11 + a12 b21 a11 b12 + a12 b22
C=
a21 b11 + a22 b21 a21 b12 + a22 b22

• Dot Product between a Matrix and a Vector: This operation involves multiplying a
matrix by a vector. The result is a new vector where each component is a linear combination
of the columns of the matrix, weighted by the corresponding components of the vector.
Compact Form:
⃗u = A⃗v
Expanded Form: Assume a 2x2 matrix A and a 2-dimensional vector ⃗v :
   
a11 a12 v
A= , ⃗v = 1
a21 a22 v2
 
a v + a12 v2
⃗u = 11 1
a21 v1 + a22 v2

NumPy Implementation: NumPy provides intuitive and powerful tools for performing
these operations efficiently, essential for numerical computing and data analysis tasks.

– Performing Matrix Addition and Subtraction:


1 # Assuming matrices A and B
2 result_add = A + B
3 result_sub = A - B
4

– Scalar Multiplication:
1 # Scalar multipli cation by a constant c
2 result_scalar_mult = A * c
3

– Matrix Multiplication:
1 # Matrix multipli cation of A and B
2 r es ul t_ m at _m u lt = np . dot (A , B )
3

– Cross Product:
1 # Cross product of vectors a and b from matrices A and B
2 r e s u l t _ c r o s s _ p r o d = np . cross (a , b )
3

– Dot Product:
1 # Dot product of matrix A and vector v
2 r es ul t_ d ot _p r od = np . dot (A , v )
3

Anis Koubaa Page 15 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

7.3 Special Matrices


Linear Algebra Concept: Special matrices are utilized extensively in mathematical computations
due to their unique properties.

• Identity Matrix: An identity matrix is a square matrix with ones on the main diagonal and
zeros elsewhere. It functions as the multiplicative identity in matrix multiplication, such that
any matrix multiplied by an identity matrix does not change.
• Diagonal Matrix: A diagonal matrix is a type of matrix in which all off-diagonal entries
are zero. The non-zero entries can be found only on the diagonal running from the upper left
to the lower right.

NumPy Implementation: NumPy provides efficient functions to handle these types of ma-
trices, useful in various numerical computations.

• Creating Identity Matrices:

identity = np.eye(3)

This function creates a 3x3 identity matrix:


 
1 0 0
I = 0 1 0
0 0 1

• Extracting Diagonals:
diagonal = np.diag(matrix)
If matrix is:  
1 2 3
M= 4
 5 6 ,
7 8 9
then np.diag(matrix) will return:  
1 5 9

Example of Matrix Multiplication with an Identity Matrix: Consider a 3x3 matrix A:


 
2 3 4
A = 5 6 7
8 9 10

Multiplying A by the identity matrix I results in:


 
2 3 4
A · I = 5 6 7
8 9 10

NumPy Implementation and Examples: NumPy provides intuitive functions to create and
manipulate these matrices. Below are some Python code examples that illustrate their creation and
use:

Anis Koubaa Page 16 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

1 import numpy as np
2
3 # Creating a 3 x3 identity matrix
4 identity = np . eye (3)
5 print ( " Identity Matrix :\\ n " , identity )
6
7 # Creating a matrix to demonstrate diagonal extraction
8 matrix = np . array ([[1 , 2 , 3] , [4 , 5 , 6] , [7 , 8 , 9]])
9 diagonal = np . diag ( matrix )
10 print ( " Diagonal of the Matrix :\\ n " , diagonal )
11
12 # Example of Matrix Multipli cation with an Identity Matrix
13 A = np . array ([[2 , 3 , 4] , [5 , 6 , 7] , [8 , 9 , 10]])
14 result = np . dot (A , identity )
15 print ( " Result of Multiplying A by the Identity Matrix :\\ n " , result )

8 Systems of Linear Equations


8.1 Solving Linear Systems
Linear Algebra Concept: Systems of linear equations can be compactly represented as matrix
equations of the form Ax = b, where A is a coefficient matrix, x is a vector of variables, and b is a
vector of constants.
Illustrative Numerical Example and NumPy Implementation: Consider the system of
linear equations given by:
3x1 + x2 = 9
x1 + 2x2 = 8
This system can be represented in matrix form as Ax = b, where
   
3 1 9
A= , b=
1 2 8
The matrix A contains the coefficients of the variables x1 and x2 , and vector b contains the
constants from the right-hand side of the equations.
To find the solution vector x, which contains the values of x1 and x2 , we solve the matrix
equation:

x = A−1 b
Using NumPy to perform this computation:
1 import numpy as np
2
3 # Define the matrix A and vector b
4 A = np . array ([[3 , 1] , [1 , 2]])
5 b = np . array ([9 , 8])
6
7 # Use NumPy to solve for x
8 x = np . linalg . solve (A , b )
9
10 print ( " Solution vector x : " , x )

Anis Koubaa Page 17 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

Executing the above Python code yields:


 
2
x=
3
This means that x1 = 2 and x2 = 3 are the solutions to the system of equations, effectively
satisfying both equations. This illustrative example not only demonstrates how to set up and
solve a system of linear equations in matrix form but also confirms the solution using a direct
computational approach with NumPy.

8.2 Inverse Matrices


Linear Algebra Concept: The inverse of a matrix A, denoted as A−1 , is a matrix that, when
multiplied by A, yields the identity matrix I. For a matrix to have an inverse, it must be a square
matrix (i.e., the same number of rows and columns) and non-singular, meaning its determinant is
not zero (det(A) ̸= 0). This non-zero determinant ensures that the matrix has linearly independent
columns, which is a prerequisite for invertibility.
NumPy Implementation: NumPy provides the ‘linalg.inv‘ function to compute the inverse
of an invertible matrix.
1 import numpy as np
2
3 # Define the matrix A
4 A = np . array ([[3 , 1] , [1 , 2]])
5
6 # Calculate the inverse of A
7 inv_A = np . linalg . inv ( A )
8 print ( " Inverse of A :\\ n " , inv_A )

Illustrative Numerical Example: For the matrix A given above, its inverse A−1 can be
computed:  
0.4 −0.2
A−1 =
−0.2 0.6
The product A · A−1 confirms the identity matrix:
 
1 0
A · A−1 =
0 1

This computation illustrates that multiplying A by A−1 yields the identity matrix, verifying the
properties of inverse matrices.

8.3 Pseudo Inverse


Linear Algebra Concept: The pseudo-inverse, or Moore-Penrose inverse, of a matrix is partic-
ularly useful for matrices that are not square or are singular. It is denoted as A+ and is used to
compute solutions to systems of linear equations that may not have a unique solution. This is often
used in least squares data fitting and calculating optimal solutions to overdetermined systems.
NumPy Implementation: NumPy provides the ‘linalg.pinv‘ function to compute the pseudo-
inverse of any matrix, even if it is not invertible.

Anis Koubaa Page 18 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

1 import numpy as np
2
3 # Define a matrix B , possibly non - square or singular
4 B = np . array ([[1 , 2] , [3 , 4] , [5 , 6]])
5
6 # Calculate the pseudo - inverse of B
7 pseudo_inv_B = np . linalg . pinv ( B )
8 print ( " Pseudo - inverse of B :\\ n " , pseudo_inv_B )

Illustrative Numerical Example: The pseudo-inverse B + for the matrix B is:


 
+ −1.944 −0.444 1.056
B =
1.611 0.389 −0.833

This matrix, when used to multiply B, approximates an identity matrix, reflecting the least squares
solution to an overdetermined system. This concept is crucial in applications like signal processing
and regression analysis where exact solutions are not feasible.

9 Determinants and Rank


9.1 Determinants
Linear Algebra Concept: The determinant is a scalar value that can be calculated from the
elements of a square matrix. It provides insights into the matrix’s properties, indicating whether
the matrix is invertible (a determinant of zero means the matrix is not invertible) and the volume
factor of the linear transformation represented by the matrix.
General Equation for Determinants: For a matrix
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A= .
 
. .. . . .. 
 . . . . 
an1 an2 · · · ann

the determinant can be calculated using the Leibniz formula, which is a sum over permutations of
the matrix indices:
X n
Y
det(A) = sgn(σ) ai,σ(i)
σ∈Sn i=1

where σ ranges over all permutations of n elements, and sgn(σ) is the sign of the permutation (+1
for even permutations and −1 for odd permutations).
NumPy Implementation:
1 import numpy as np
2
3 # Define an n x n matrix
4 A = np . array ([[ a11 , a12 , ... , a1n ] , [ a21 , a22 , ... , a2n ] , ... , [ an1 , an2 , ... , ann
]])
5 det_A = np . linalg . det ( A )
6 print ( " Determinant of A : " , det_A )

Anis Koubaa Page 19 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

9.2 Rank of a Matrix


Linear Algebra Concept: The rank of a matrix is defined as the maximum number of linearly
independent rows or columns, which is also the dimension of the row space or column space of the
matrix.
General Equation for Rank: The rank of an m × n matrix A can be determined by reducing
the matrix to its Row Echelon Form (REF) and counting the non-zero rows. Mathematically, this
process involves elementary row operations that do not change the row space of the matrix.
NumPy Implementation:
1 import numpy as np
2
3 # Define an m x n matrix A
4 A = np . array ([[4 , 5 , 6] , [7 , 8 , 9] , [1 , 1 , 1]])
5 rank_A = np . linalg . matrix_rank ( A )
6 print ( " Rank of A : " , rank_A )

9.3 Trace of a Matrix


Linear Algebra Concept: The trace of a square matrix is the sum of the elements on its main
diagonal. The trace operation is important because it is invariant under change of basis and gives
the sum of the eigenvalues of the matrix.
General Equation for Trace: For a square matrix
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A= .
 
.. .. .. 
 .. . . . 
an1 an2 · · · ann
the trace of A, denoted as tr(A), is calculated as:
tr(A) = a11 + a22 + · · · + ann
NumPy Implementation:
1 import numpy as np
2
3 # Define a square matrix A
4 A = np . array ([[4 , 5 , 6] , [7 , 8 , 9] , [10 , 11 , 12]])
5 trace_A = np . trace ( A )
6 print ( " Trace of A : " , trace_A )

10 Eigenvalues and Eigenvectors


10.1 Fundamentals
Linear Algebra Concept: Eigenvalues (λ) and eigenvectors (v) of a square matrix A are funda-
mental in understanding the behavior of linear transformations represented by A. The eigenvector
v is a non-zero vector that only scales (by λ) when the matrix A is applied to it, thus satisfying:
Av = λv

Anis Koubaa Page 20 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

This relationship can be explored by rearranging the terms:

(A − λI)v = 0

where I is the identity matrix of the same dimension as A. This equation implies that for non-trivial
solutions, the matrix (A − λI) must be singular, which leads to the characteristic equation:

det(A − λI) = 0

Solving this polynomial in λ gives the eigenvalues, and substituting each λ back into the equation
provides the corresponding eigenvectors.
NumPy Implementation:
1 import numpy as np
2
3 # Define a matrix A
4 A = np . array ([[4 , 2] , [1 , 3]])
5
6 # Calculate eigenvalues and eigenvectors
7 eigenvalues , eigenvectors = np . linalg . eig ( A )
8 print ( " Eigenvalues : " , eigenvalues )
9 print ( " Eigenvectors :\\ n " , eigenvectors )

Numerical Example: Consider the matrix A:


 
4 2
A=
1 3

Finding Eigenvalues: To find the eigenvalues λ, solve the characteristic equation:

det(A − λI) = 0

First, set up A − λI:  


4−λ 2
A − λI =
1 3−λ
The determinant of this matrix is:

det(A − λI) = (4 − λ)(3 − λ) − 2 · 1 = λ2 − 7λ + 10

Solve the quadratic equation λ2 − 7λ + 10 = 0 using the quadratic formula:



7 ± 49 − 40 7±3
λ= =
2 2
λ1 = 5, λ2 = 2
Finding Eigenvectors: For each eigenvalue, solve (A − λI)v = 0 to find the corresponding
eigenvectors.
Eigenvector for λ1 = 5:  
−1 2
A − 5I =
1 −2

Anis Koubaa Page 21 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

Row reduce this matrix:    


−1 2 1 −2

1 −2 0 0
From x1 − 2x2 = 0, let x2 = t. Then x1 = 2t.
 
2
v1 = t (where t ̸= 0, typically t = 1)
1

Choosing t = 1 gives:  
2
v1 =
1
Eigenvector for λ2 = 2:  
2 2
A − 2I =
1 1
Row reduce this matrix:    
2 2 1 1

1 1 0 0
From x1 + x2 = 0, let x2 = t. Then x1 = −t.
 
−1
v2 = t (where t ̸= 0, typically t = 1)
1

Choosing t = 1 gives:  
−1
v2 =
1

10.2 Diagonalization
Linear Algebra Concept: Diagonalization is the process of transforming a matrix into a diagonal
form that reflects its eigenvalues, which is possible if the matrix has enough linearly independent
eigenvectors to form a basis. A matrix A is diagonalizable if it can be expressed as:

A = P DP −1

where P is the matrix of column eigenvectors of A, and D is the diagonal matrix containing the
eigenvalues of A. Each column vi of P corresponds to an eigenvalue λi in D.
NumPy Implementation:
1 # Assuming eigenvalues and eigenvectors from previous code
2 P = eigenvectors
3 D = np . diag ( eigenvalues )
4 P_inv = np . linalg . inv ( P )
5 A_diag = np . dot ( np . dot (P , D ) , P_inv )
6 print ( " Diagonalized matrix A :\\ n " , A_diag )

Verification of Diagonalization:
  With the eigenvalues λ1 = 5 and λ2 = 2 and the cor-
2 −1
responding eigenvectors v1 = and v2 = computed previously, we can perform matrix
1 1
diagonalization and verify it by reconstructing the original matrix A.

Anis Koubaa Page 22 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

Setup the Matrix P and D: Matrix P , containing the eigenvectors as columns:


 
2 −1
P =
1 1

Matrix D, a diagonal matrix containing the eigenvalues:


 
5 0
D=
0 2

Calculate P −1 : Inverse of P :
   
−1 1 1 1 1 1 1
P = =
det(P ) −1 2 3 −1 2

Reconstruct A by Calculating P DP −1 : Multiply the matrices to reconstruct A:


    
2 −1 5 0 1 1 1
A = P DP −1 =
1 1 0 2 3 −1 2
 
4 2
A=
1 3

11 Principal Component Analysis (PCA)


11.1 Understanding PCA
Linear Algebra Concept:
• PCA is a technique for reducing the dimensionality of data while preserving as much variance
as possible.
• It involves transforming the data into a new coordinate system where the axes (principal
components) are directions of maximum variance.
• This transformation is achieved by finding the eigenvalues and eigenvectors of the covariance
matrix of the data.
Steps to Perform PCA:
1. Center the Data: Subtract the mean of each feature from the dataset X to obtain a mean-
centered matrix Xc .
2. Compute the Covariance Matrix:
1
Cov(Xc ) = X T Xc
n−1 c
where Xc is the mean-centered data matrix and n is the number of samples.
3. Find Eigenvalues and Eigenvectors: Compute the eigenvalues λi and eigenvectors vi of
the covariance matrix.

Anis Koubaa Page 23 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

4. Sort and Select Principal Components: Sort the eigenvalues in descending order and
select the top k eigenvectors to form the transformation matrix.
5. Transform the Data: Project the data onto the new subspace:
XPCA = Xc · Vk
where Vk is the matrix of the top k eigenvectors.
NumPy Implementation:
1 import numpy as np
2
3 # Define a mean - centered data matrix X_c (3 x2 for simplicity )
4 X_c = np . array ([[1 , 2] , [3 , 4] , [5 , 6]]) - np . mean ( np . array ([[1 , 2] , [3 , 4] , [5 ,
6]]) , axis =0)
5
6 # Compute the covariance matrix
7 cov_matrix = np . cov ( X_c , rowvar = False )
8
9 # Compute eigenvalues and eigenvectors
10 eigenvalues , eigenvectors = np . linalg . eig ( cov_matrix )
11 print ( " Covariance Matrix :\\ n " , cov_matrix )
12 print ( " Eigenvalues : " , eigenvalues )
13 print ( " Eigenvectors :\\ n " , eigenvectors )
14
15 # Select the top k eigenvectors ( here , k =1)
16 k = 1
17 V_k = eigenvectors [: , : k ]
18
19 # Transform the data
20 X_pca = X_c @ V_k
21 print ( " Transformed data :\\ n " , X_pca )

11.2 Covariance Matrix and Eigenvalues


Mathematical Explanation:
• The covariance matrix captures the variance and the relationship between different features
of the mean-centered data:
1
Cov(Xc ) = X T Xc
n−1 c
• The eigenvalues λi represent the variance captured by each principal component.
• The corresponding eigenvectors vi indicate the direction of each principal component.
Numerical Example: Consider the covariance matrix Cov(Xc ) calculated from the centered
data:    
σ11 σ12 8.33 8.33
Cov(Xc ) = =
σ21 σ22 8.33 8.33
Calculating the eigenvalues and eigenvectors:
λ1 = 16.66, λ2 = 0
"√ # " √ #
2
√2
− 2
v1 = 2
, v2 = √22
2 2

Anis Koubaa Page 24 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

11.3 Applications of PCA


Concepts:
• Dimensionality Reduction: PCA projects the data onto the top k principal components
to reduce dimensionality while preserving variance.
• Feature Extraction: Identifies new axes of maximum variance, allowing for improved data
representation and simplification.
Practical Application:
• Apply PCA to reduce data dimensionality, simplify models, and visualize high-dimensional
datasets.
Reconstructing Data:
• The approximate reconstruction of the original data using the top k principal components is:
Xreconstructed = XPCA · VkT + µ
where µ is the mean vector used for centering.

12 Singular Value Decomposition (SVD)


12.1 Understanding SVD
Linear Algebra Concept:
• Singular Value Decomposition (SVD) is a factorization of a matrix into three components:
A = U ΣV T
where A is an m × n matrix.
– U is an m × m orthogonal matrix whose columns are the left-singular vectors of A.
– Σ is an m × n diagonal matrix with non-negative real numbers on the diagonal known
as singular values.
– V T is the transpose of an n × n orthogonal matrix whose columns are the right-singular
vectors of A.
• The singular values in Σ are the square roots of the eigenvalues of both AT A and AAT .
NumPy Implementation:
1 import numpy as np
2
3 # Define a matrix A
4 matrix = np . array ([[1 , 2] , [3 , 4] , [5 , 6]])
5
6 # Perform Singular Value Decomposition
7 u , s , vh = np . linalg . svd ( matrix )
8 print ( " U matrix :\\ n " , u )
9 print ( " Singular values :\\ n " , s )
10 print ( " V ^ T matrix :\\ n " , vh )

Anis Koubaa Page 25 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

12.2 Applications of SVD


Concepts:
• Dimensionality Reduction: SVD is used to reduce the number of features in data while
retaining the most critical information. This is achieved by truncating Σ to keep only the top
k largest singular values and corresponding columns of U and rows of V T :

Ak = Uk Σk VkT

where Uk , Σk , and VkT are truncated versions of U , Σ, and V T .

• Noise Reduction: Smaller singular values often correspond to noise or less significant in-
formation. Discarding these values helps in cleaning the data, enhancing pattern recognition
and interpretation.
Practical Application:
1 # Assuming top 1 singular value to reduce d imensio nality
2 k = 1
3 A_k = np . dot ( u [: , : k ] * s [: k ] , vh [: k , :])
4 print ( " Approximated matrix A_k :\\ n " , A_k )

Given the matrix A:  


1 2 3 4
A = 5 6 7 8
9 10 11 12

12.3 Performing SVD


Performing the Singular Value Decomposition of A yields the matrices U , Σ, and V T , where:
Matrix U (Left-singular vectors):
 
u11 u12 u13
U = u21 u22 u23 
u31 u32 u33

Singular values (arranged in Σ):


 
σ1 0 0 0
Σ=0 σ2 0 0
0 0 σ3 0

Matrix V T (Right-singular vectors):


 
v11 v12 v13 v14
V T = v21 v22 v23 v24 
v31 v32 v33 v34

Anis Koubaa Page 26 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

12.4 Matrix Reconstruction


Using the SVD components, the original matrix A can be reconstructed as follows:

A = U ΣV T

Expanding this product, we express A as a sum of outer products scaled by the singular values:

A = σ1 u1 v1T + σ2 u2 v2T + σ3 u3 v3T

where ui and vi are columns of U and rows of V T , respectively.

12.5 Numercial Example


Performing the Singular Value Decomposition of A yields the matrices U , Σ, and V T , with the
following results:
Matrix U (Left-singular vectors):
 
0.20673589 0.88915331 0.40824829
U = 0.51828874 0.25438183 −0.81649658
0.82984158 −0.38038964 0.40824829

Singular values (arranged in Σ):


 
2.54368356e + 01 0 0 0
Σ= 0 1.72261225e + 00 0 0
0 0 5.14037515e − 16 0

Matrix V T (Right-singular vectors):


 
0.40361757 0.46474413 0.52587069 0.58699725
V T = −0.73286619 −0.28984978 0.15316664 0.59618305
0.44527162 −0.83143156 0.32704826 0.05911168

12.6 Matrix Reconstruction


Using the SVD components, the original matrix A can be reconstructed as follows:

A = U ΣV T

Expanding this product, the reconstruction of A involves summing the outer products of the singular
vectors scaled by the corresponding singular values:
   
0.20673589  T 0.88915331 
A = 25.4368356·0.51828874 0.40361757 0.46474413 0.52587069 0.58699725 +1.72261225· 0.25438183  −0.7
0.82984158 −0.38038964

Anis Koubaa Page 27 NumPy for Data Science Cheat Sheet


CS316: Introduction to Data Science Anis Koubaa, October 2024

13 Interpretation of SVD Components


13.1 Matrix U (Left-Singular Vectors)
• Mathematical Significance: U contains the left-singular vectors of A, forming an orthog-
onal matrix. These vectors define orthonormal bases for the row space transformations of
A.
• Practical Interpretation: In data science, the columns of U represent the principal direc-
tions of data variation when projected into a lower-dimensional space, ordered by importance.

13.2 Matrix Σ (Singular Values)


• Mathematical Significance: Σ is a diagonal matrix with singular values which are the
square roots of the eigenvalues of AT A. These values are sorted in descending order.
• Practical Interpretation: Singular values quantify the contribution of each singular vector
to the overall data structure. They measure the ”strength” or ”information content” of each
principal direction.

13.3 Matrix V T (Right-Singular Vectors)


• Mathematical Significance: V T , containing the right-singular vectors, is another orthog-
onal matrix. It represents transformations within the column space of A.
• Practical Interpretation: The rows of V T (columns of V ) are weights for combining the
principal components to reconstruct original features, revealing underlying patterns in data.

13.4 Application in Reconstruction


• Combining U , Σ, and V T reconstructs the original matrix A or its approximation by reducing
dimensions, thereby summarizing essential features with minimal data loss, used extensively
for noise reduction and data compression.

Anis Koubaa Page 28 NumPy for Data Science Cheat Sheet

You might also like