# [ Data Preprocessing with NumPy ] {CheatSheet}
Basics and Array Creation:
● Create NumPy Array: [Link]([1, 2, 3])
● Array Shape: [Link]
● Array Dimensions: [Link]
● Array Size: [Link]
● Reshape Array: [Link]((rows, cols))
● Concatenate Arrays Vertically: [Link]((array1, array2))
● Concatenate Arrays Horizontally: [Link]((array1, array2))
● Transpose Array: array.T
Indexing and Slicing:
● Indexing: array[0]
● Slicing: array[1:4]
● Boolean Indexing: array[array > 5]
● Fancy Indexing: array[[1, 3, 5]]
Missing Data:
● Replace NaN with Zero: np.nan_to_num(array)
● Remove NaN Values: array = array[~[Link](array)]
Mathematical Operations:
● Element-wise Addition: array1 + array2
● Element-wise Multiplication: array1 * array2
● Matrix Multiplication: [Link](matrix1, matrix2)
● Element-wise Square Root: [Link](array)
Statistical Operations:
● Mean: [Link](array)
● Median: [Link](array)
● Standard Deviation: [Link](array)
By: Waleed Mousa
● Variance: [Link](array)
● Minimum Value: [Link](array)
● Maximum Value: [Link](array)
Data Cleaning:
● Remove Duplicates: [Link](array)
● Replace Values: [Link](array == 0, 1, array)
● Clip Values: [Link](array, min_val, max_val)
Filtering and Sorting:
● Filter by Condition: array[array > threshold]
● Sort Array: [Link](array)
● Sort by Column/Axis: [Link](axis=0)
Random Sampling:
● Random Permutation: [Link](array)
● Random Sampling with Replacement: [Link](array, size=n,
replace=True)
● Shuffle Array: [Link](array)
Vectorization:
● Vectorized Operations: [Link](function)(array)
File I/O:
● Read CSV: [Link]('[Link]', delimiter=',')
● Write CSV: [Link]('[Link]', array, delimiter=',')
Linear Algebra:
● Dot Product: [Link](array1, array2)
● Matrix Inversion: [Link](matrix)
● Eigenvalues and Eigenvectors: eigenvalues, eigenvectors =
[Link](matrix)
By: Waleed Mousa
Broadcasting:
● Broadcasting: array += 5
Data Transformation:
● Log Transformation: [Link](array)
● Exponential Transformation: [Link](array)
● Box-Cox Transformation: [Link](array)
Scaling and Normalization:
● Min-Max Scaling: (array - [Link]()) / ([Link]() -
[Link]())
● Standardization: (array - [Link](array)) / [Link](array)
● Z-Score Transformation: [Link](array)
Handling Categorical Data:
● One-Hot Encoding: [Link](num_classes)[array]
Reshaping and Flattening:
● Flatten Array: [Link]()
● Ravel Array: [Link](array)
Interpolation:
● Linear Interpolation: [Link](x, xp, yp)
Polynomial Fitting:
● Polynomial Fitting: [Link](x, y, degree)
Time Series Operations:
● Time Lag Transformation: [Link](array, shift=n)
● Moving Average: [Link](array, [Link](window)/window,
mode='valid')
By: Waleed Mousa
Image Processing:
● Image Resizing: [Link](image, zoom=(2, 2, 1))
● Image Rotation: [Link](image, angle=45,
reshape=False)
Handling Strings:
● String Operations on Array: [Link](array1, array2)
Set Operations:
● Set Union: np.union1d(array1, array2)
● Set Intersection: np.intersect1d(array1, array2)
● Set Difference: np.setdiff1d(array1, array2)
Handling Dates:
● Convert to DateTime: np.datetime64('2022-01-01')
● Date Arithmetic: np.datetime64('2022-01-01') + np.timedelta64(5,
'D')
Handling Complex Numbers:
● Create Complex Numbers: [Link](real, imag)
● Complex Conjugate: [Link](complex_array)
Handling Inf and NaN:
● Replace Inf with Max Value: array[[Link](array)] = [Link]
● Replace NaN with Mean: array[[Link](array)] = [Link](array)
Distance Metrics:
● Euclidean Distance: [Link](vector1 - vector2)
● Cosine Similarity: cosine_similarity(array1, array2)
By: Waleed Mousa
Statistical Testing:
● T-Test for Independent Samples: t_stat, p_value =
[Link].ttest_ind(sample1, sample2)
● ANOVA Test: f_stat, p_value = [Link].f_oneway(group1, group2,
group3)
Outlier Detection:
● Z-Score Outliers: z_scores = [Link](array)
Handling Logarithmic Data:
● Log Transformation for Skewed Data: log_array =
np.log1p(skewed_array)
Handling Exponential Data:
● Exponential Transformation for Highly Skewed Data: exp_array =
[Link](original_array)
Handling Power Law Data:
● Power Transformation: power_transformed_array, lambda_value =
[Link](array)
● Yeo-Johnson Transformation: yeo_johnson_transformed_array,
lambda_value = [Link](array)
Principal Component Analysis (PCA):
● PCA: pca = PCA(n_components=2); transformed_data =
pca.fit_transform(data)
Singular Value Decomposition (SVD):
● SVD: U, S, Vt = [Link](matrix)
By: Waleed Mousa
Handling Outliers:
● Winsorizing Outliers: winsorized_array =
[Link](original_array, limits=[0.05, 0.05])
Time Window Operations:
● Rolling Window Mean: rolling_mean =
[Link](array).rolling(window=3).mean()
Interpolation:
● Linear Interpolation: interpolated_values = [Link](x, xp, yp)
Handling JSON Data:
● Convert NumPy Array to JSON: json_data = [Link]([Link]())
● Convert JSON to NumPy Array: numpy_array = [Link](json_data)
Handling CSV Data:
● Read CSV into NumPy Array: data = [Link]('[Link]',
delimiter=',')
● CSV File Reading with Pandas: data = pd.read_csv('[Link]').values
Handling Excel Data:
● Read Excel into NumPy Array: data = pd.read_excel('[Link]',
header=None).values
Handling Text Data:
● Convert Text to NumPy Array: text_array = [Link](list(text))
● Tokenization with CountVectorizer: vectorizer =
sklearn.feature_extraction.[Link](); tokenized_matrix
= vectorizer.fit_transform(text_data)
● TF-IDF Transformation: tfidf_transformer =
sklearn.feature_extraction.[Link](); tfidf_matrix =
tfidf_transformer.fit_transform(count_matrix)
By: Waleed Mousa
Handling Time Series Data:
● Time Series Rolling Mean: rolling_mean =
[Link](array).rolling(window=3).mean()
● Time Series Differencing: differenced_series = [Link](time_series,
n=1)
Handling Multidimensional Arrays:
● Reshape to 3D Array: reshaped_array =
original_array.reshape((num_samples, num_rows, num_cols))
Handling Spatial Data:
● Distance between Two Points in 2D Space: distance =
[Link](point1 - point2)
● Calculate Haversine Distance: haversine_distance = haversine(lon1,
lat1, lon2, lat2)
Data Binning:
● Binning Numerical Data: binned_data = [Link](array, bins)
Handling Imbalanced Data:
● Under-sampling with Random Choice: undersampled_data =
[Link]([[Link](data[data_label == label],
size=min_class_samples) for label in unique_labels])
● Over-sampling with Repetition: oversampled_data =
[Link]([data[data_label == label] for _ in
range(int(max_class_samples / min_class_samples))])
● Synthetic Over-sampling with SMOTE: oversampled_data,
oversampled_labels = SMOTE().fit_resample(data, labels)
Handling Image Data:
● Flatten 2D Image: flat_image = [Link]()
● Reshape 1D Image to 2D: reshaped_image =
flat_image.reshape((height, width))
By: Waleed Mousa
● Convert Image to Grayscale: grayscale_image = [Link](image[...,
:3], [0.2989, 0.5870, 0.1140])
● Resize Image: resized_image = [Link](image,
(new_height, new_width), mode='constant')
● Image Rotation with Scipy: rotated_image =
[Link](image, angle=45, reshape=False)
● Image Histogram Equalization: equalized_image =
[Link].equalize_hist(image)
● Image Gaussian Blurring: blurred_image =
[Link](image, sigma=2)
● Image Edge Detection: edges = [Link](image, sigma=1)
● Image Segmentation with K-Means Clustering: segmented_image =
[Link](image, n_segments=100)
● Image Feature Extraction with Histogram of Oriented Gradients
(HOG): features, hog_image = [Link](image,
visualize=True)
● Image Cropping: cropped_image = original_image[y1:y2, x1:x2]
● Image Histogram: hist, bins = [Link]([Link](),
bins=256, range=[0,256])
● Image Thresholding: thresholded_image =
[Link](grayscale_image, threshold_value, 255,
cv2.THRESH_BINARY)[1]
● Image Morphological Operations: kernel = [Link]((5,5),np.uint8);
morph_image = [Link](thresh_image, cv2.MORPH_OPEN,
kernel)
● Image Contour Detection: contours, hierarchy =
[Link](thresh_image, cv2.RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE)
● Image Color Spaces Conversion: hsv_image = [Link](rgb_image,
cv2.COLOR_BGR2HSV)
● Image Filtering with OpenCV: filtered_image =
[Link](image, d=9, sigmaColor=75, sigmaSpace=75)
● Image Edge Detection with OpenCV: edges = [Link](image,
low_threshold, high_threshold)
● Image Feature Extraction with OpenCV: sift = cv2.SIFT_create();
keypoints, descriptors = [Link](gray_image, None)
● Image Template Matching with OpenCV: result =
[Link](image, template, cv2.TM_CCOEFF_NORMED)
By: Waleed Mousa
● Image Superpixel Segmentation with OpenCV: segments =
[Link](image, algorithm=0,
region_size=10)
● Image Corner Detection with OpenCV: corners =
[Link](image, maxCorners=25, qualityLevel=0.01,
minDistance=10)
● Image Affine Transformation with OpenCV: rows, cols =
[Link][:2]; M = cv2.getRotationMatrix2D((cols/2, rows/2),
angle, scale); rotated_image = [Link](image, M, (cols,
rows))
● Image Perspective Transformation with OpenCV: pts1 =
np.float32([[56,65],[368,52],[28,387],[389,390]]); pts2 =
np.float32([[0,0],[300,0],[0,300],[300,300]]); M =
[Link](pts1,pts2); transformed_image =
[Link](image,M,(300,300))
● Image Color Histogram with OpenCV: hist = [Link]([image],
[0, 1, 2], None, [256, 256, 256], [0, 256, 0, 256, 0, 256])
● Image Color Quantization with K-Means Clustering: image_reshaped =
[Link]((-1, 3)); kmeans =
KMeans(n_clusters=k).fit(image_reshaped); quantized_image =
kmeans.cluster_centers_.astype(int)[kmeans.labels_].reshape(image.s
hape)
Advanced Operations with NumPy:
● Handling Sparse Data: sparse_matrix =
[Link].csr_matrix(array)
● Matrix Factorization with NMF: W, H =
[Link](n_components=2).fit_transform(data)
● Sparse Matrix Operations: result =
[Link].csr_matrix.dot(sparse_matrix1, sparse_matrix2)
Handling HDF5 Data:
● Read HDF5 File into NumPy Array: data =
pd.read_hdf('data.h5').values
Handling XML Data:
By: Waleed Mousa
● XML Parsing with BeautifulSoup: soup = BeautifulSoup(xml_data,
'xml'); values = [float([Link]) for tag in
soup.find_all('value')]
Handling SQLite Data:
● Read SQLite Database into NumPy Array: connection =
[Link]('[Link]'); query = 'SELECT * FROM table'; data
= pd.read_sql(query, connection).values
Handling Pickle Data:
● Read Pickle File into NumPy Array: with open('[Link]', 'rb') as
f: data = [Link](f)
Handling Avro Data:
● Read Avro File into NumPy Array: import fastavro; with
open('[Link]', 'rb') as f: data = [Link](f)
Handling Parquet Data:
● Read Parquet File into NumPy Array: import [Link] as pq;
table = pq.read_table('[Link]'); data =
table.to_pandas().values
Handling Feather Data:
● Read Feather File into NumPy Array: import [Link] as
feather; table = feather.read_table('[Link]'); data =
table.to_pandas().values
Handling Video Data:
● Read Video Frames into NumPy Array: import cv2; video_capture =
[Link]('video.mp4'); success, frame =
video_capture.read(); video_array = [] while success:
video_array.append(frame); success, frame = video_capture.read()
Handling Audio Data:
By: Waleed Mousa
● Read Audio File into NumPy Array: import librosa; audio_data,
sampling_rate = [Link]('[Link]', sr=None)
Handling NumPy Datetime:
● NumPy Datetime Operations: date1 = np.datetime64('2022-01-01');
date2 = np.datetime64('2022-01-05'); days_difference = date2 -
date1
Handling Complex Numbers:
● Complex Numbers Operations: complex_result = complex_array1 +
complex_array2
Handling Units:
● Convert Units with Pint: import pint; ureg = [Link]();
quantity = 5 * [Link]; converted_quantity =
[Link]([Link])
Handling Heterogeneous Data:
● Structured Arrays: structured_array = [Link]([(1, 'John', 25),
(2, 'Alice', 30)], dtype=[('id', int), ('name', 'U10'), ('age',
int)])
Handling Point Cloud Data:
● PointCloud Operations with Open3D: import open3d; point_cloud =
[Link].read_point_cloud('point_cloud.ply'); downsampled_cloud =
point_cloud.voxel_down_sample(voxel_size=0.05)
By: Waleed Mousa