0% found this document useful (0 votes)
250 views

(Feature Engineering) (Extended-Cheatsheet)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
250 views

(Feature Engineering) (Extended-Cheatsheet)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

# [ Feature Engineering ] [ Extended-cheatsheet ]

1. Data Preprocessing

1.1 Handling Missing Values

● Check for missing values: df.isnull().sum()


● Drop rows with missing values: df.dropna()
● Fill missing values with a specific value: df.fillna(value)
● Fill missing values with mean: df.fillna(df.mean())
● Fill missing values with median: df.fillna(df.median())
● Fill missing values with mode: df.fillna(df.mode().iloc[0])
● Fill missing values with forward fill: df.fillna(method='ffill')
● Fill missing values with backward fill: df.fillna(method='bfill')

1.2 Encoding Categorical Variables

● One-hot encoding: pd.get_dummies(df, columns=['column_name'])


● Label encoding: from sklearn.preprocessing import LabelEncoder;
LabelEncoder().fit_transform(df['column_name'])
● Ordinal encoding: from sklearn.preprocessing import OrdinalEncoder;
OrdinalEncoder().fit_transform(df[['column_name']])
● Binary encoding: df['binary_column'] = np.where(df['column_name'] ==
'value', 1, 0)
● Frequency encoding: df['freq_encoded'] =
df.groupby('column_name')['column_name'].transform('count')
● Mean encoding: df['mean_encoded'] =
df.groupby('column_name')['target'].transform('mean')
● Weight of Evidence (WoE) encoding: df['woe'] =
np.log(df.groupby('column_name')['target'].mean() / (1 -
df.groupby('column_name')['target'].mean()))

1.3 Scaling and Normalization

● Min-max scaling: from sklearn.preprocessing import MinMaxScaler;


MinMaxScaler().fit_transform(df[['column_name']])
● Standard scaling (Z-score normalization): from sklearn.preprocessing
import StandardScaler;
StandardScaler().fit_transform(df[['column_name']])

By: Waleed Mousa


● Max-abs scaling: from sklearn.preprocessing import MaxAbsScaler;
MaxAbsScaler().fit_transform(df[['column_name']])
● Robust scaling: from sklearn.preprocessing import RobustScaler;
RobustScaler().fit_transform(df[['column_name']])
● Normalization (L1, L2, Max): from sklearn.preprocessing import
Normalizer; Normalizer(norm='l1').fit_transform(df[['column_name']])

1.4 Handling Outliers

● Identify outliers using IQR: Q1 = df['column_name'].quantile(0.25); Q3 =


df['column_name'].quantile(0.75); IQR = Q3 - Q1; df[(df['column_name'] <
Q1 - 1.5 * IQR) | (df['column_name'] > Q3 + 1.5 * IQR)]
● Identify outliers using Z-score: from scipy.stats import zscore;
df[np.abs(zscore(df['column_name'])) > 3]
● Remove outliers: df = df[(df['column_name'] >= lower_bound) &
(df['column_name'] <= upper_bound)]
● Cap outliers: df['column_name'] = np.where(df['column_name'] >
upper_bound, upper_bound, np.where(df['column_name'] < lower_bound,
lower_bound, df['column_name']))

2. Feature Transformation

2.1 Mathematical Transformations

● Logarithmic transformation: df['log_column'] = np.log(df['column_name'])


● Square root transformation: df['sqrt_column'] =
np.sqrt(df['column_name'])
● Exponential transformation: df['exp_column'] = np.exp(df['column_name'])
● Reciprocal transformation: df['reciprocal_column'] = 1 /
df['column_name']
● Box-Cox transformation: from scipy.stats import boxcox;
df['boxcox_column'] = boxcox(df['column_name'])[0]
● Yeo-Johnson transformation: from scipy.stats import yeojohnson;
df['yeojohnson_column'] = yeojohnson(df['column_name'])[0]

2.2 Binning and Discretization

● Equal-width binning: pd.cut(df['column_name'], bins=n)


● Equal-frequency binning: pd.qcut(df['column_name'], q=n)
● Custom binning: pd.cut(df['column_name'], bins=[0, 10, 20, 30, 40, 50,
60, 70, 80, 90, 100])

By: Waleed Mousa


● Discretization using KBinsDiscretizer: from sklearn.preprocessing import
KBinsDiscretizer; KBinsDiscretizer(n_bins=n,
encode='ordinal').fit_transform(df[['column_name']])

2.3 Interaction Features

● Multiplication: df['interaction'] = df['column_1'] * df['column_2']


● Division: df['interaction'] = df['column_1'] / df['column_2']
● Addition: df['interaction'] = df['column_1'] + df['column_2']
● Subtraction: df['interaction'] = df['column_1'] - df['column_2']
● Polynomial features: from sklearn.preprocessing import
PolynomialFeatures;
PolynomialFeatures(degree=n).fit_transform(df[['column_1', 'column_2']])

2.4 Date and Time Features

● Extract year: df['year'] = df['date_column'].dt.year


● Extract month: df['month'] = df['date_column'].dt.month
● Extract day: df['day'] = df['date_column'].dt.day
● Extract hour: df['hour'] = df['datetime_column'].dt.hour
● Extract minute: df['minute'] = df['datetime_column'].dt.minute
● Extract second: df['second'] = df['datetime_column'].dt.second
● Extract day of week: df['day_of_week'] = df['date_column'].dt.dayofweek
● Extract day of year: df['day_of_year'] = df['date_column'].dt.dayofyear
● Extract week of year: df['week_of_year'] =
df['date_column'].dt.weekofyear
● Extract quarter: df['quarter'] = df['date_column'].dt.quarter
● Extract is_weekend: df['is_weekend'] =
df['date_column'].dt.dayofweek.isin([5, 6])
● Extract is_holiday: holidays = ['2023-01-01', '2023-12-25'];
df['is_holiday'] = df['date_column'].isin(holidays)
● Time since feature: df['time_since'] = (df['date_column'] -
df['reference_date']).dt.days

3. Feature Selection

3.1 Univariate Feature Selection

● Select K best features: from sklearn.feature_selection import


SelectKBest, f_classif; SelectKBest(score_func=f_classif,
k=n).fit_transform(X, y)

By: Waleed Mousa


● Select percentile of features: from sklearn.feature_selection import
SelectPercentile, f_classif; SelectPercentile(score_func=f_classif,
percentile=p).fit_transform(X, y)

3.2 Recursive Feature Elimination

● Recursive Feature Elimination (RFE): from sklearn.feature_selection


import RFE; from sklearn.linear_model import LinearRegression;
RFE(estimator=LinearRegression(),
n_features_to_select=n).fit_transform(X, y)
● Recursive Feature Elimination with Cross-Validation (RFECV): from
sklearn.feature_selection import RFECV; from sklearn.linear_model import
LinearRegression; RFECV(estimator=LinearRegression(),
min_features_to_select=n).fit_transform(X, y)

3.3 L1 and L2 Regularization

● Lasso (L1) regularization: from sklearn.linear_model import Lasso;


Lasso(alpha=a).fit_transform(X, y)
● Ridge (L2) regularization: from sklearn.linear_model import Ridge;
Ridge(alpha=a).fit_transform(X, y)
● Elastic Net regularization: from sklearn.linear_model import ElasticNet;
ElasticNet(alpha=a, l1_ratio=r).fit_transform(X, y)

3.4 Feature Importance

● Feature importance using Random Forest: from sklearn.ensemble import


RandomForestClassifier; rf = RandomForestClassifier(); rf.fit(X, y);
rf.feature_importances_
● Feature importance using Gradient Boosting: from sklearn.ensemble import
GradientBoostingClassifier; gb = GradientBoostingClassifier(); gb.fit(X,
y); gb.feature_importances_
● Permutation feature importance: from sklearn.inspection import
permutation_importance; permutation_importance(model, X, y, n_repeats=n)

4. Dimensionality Reduction

4.1 Principal Component Analysis (PCA)

● PCA: from sklearn.decomposition import PCA;


PCA(n_components=n).fit_transform(X)

By: Waleed Mousa


● Incremental PCA: from sklearn.decomposition import IncrementalPCA;
IncrementalPCA(n_components=n).fit_transform(X)
● Kernel PCA: from sklearn.decomposition import KernelPCA;
KernelPCA(n_components=n, kernel='rbf').fit_transform(X)

4.2 t-SNE (t-Distributed Stochastic Neighbor Embedding)

● t-SNE: from sklearn.manifold import TSNE;


TSNE(n_components=n).fit_transform(X)

4.3 UMAP (Uniform Manifold Approximation and Projection)

● UMAP: from umap import UMAP; UMAP(n_components=n).fit_transform(X)

4.4 Autoencoders

● Autoencoder using Keras: from keras.layers import Input, Dense; from


keras.models import Model; input_layer = Input(shape=(n,)); encoded =
Dense(encoding_dim, activation='relu')(input_layer); decoded = Dense(n,
activation='sigmoid')(encoded); autoencoder = Model(input_layer,
decoded); encoder = Model(input_layer, encoded)

5. Text Feature Engineering

5.1 Text Preprocessing

● Lowercase: df['text'] = df['text'].str.lower()


● Remove punctuation: df['text'] = df['text'].str.replace('[^a-zA-Z]', ' ')
● Remove stopwords: from nltk.corpus import stopwords; stop_words =
set(stopwords.words('english')); df['text'] = df['text'].apply(lambda x:
' '.join([word for word in x.split() if word not in stop_words]))
● Stemming: from nltk.stem import PorterStemmer; ps = PorterStemmer();
df['text'] = df['text'].apply(lambda x: ' '.join([ps.stem(word) for word
in x.split()]))
● Lemmatization: from nltk.stem import WordNetLemmatizer; lemmatizer =
WordNetLemmatizer(); df['text'] = df['text'].apply(lambda x: '
'.join([lemmatizer.lemmatize(word) for word in x.split()]))

5.2 Text Vectorization

● Bag-of-Words (BoW): from sklearn.feature_extraction.text import


CountVectorizer; CountVectorizer().fit_transform(df['text'])

By: Waleed Mousa


● TF-IDF: from sklearn.feature_extraction.text import TfidfVectorizer;
TfidfVectorizer().fit_transform(df['text'])
● Word2Vec: from gensim.models import Word2Vec;
Word2Vec(sentences=df['text'], vector_size=n, window=w, min_count=m,
workers=wrk)
● GloVe: from gensim.models import KeyedVectors;
KeyedVectors.load_word2vec_format('glove.6B.100d.txt', binary=False)
● FastText: from gensim.models import FastText;
FastText(sentences=df['text'], vector_size=n, window=w, min_count=m,
workers=wrk)
● BERT: from transformers import BertTokenizer, BertModel; tokenizer =
BertTokenizer.from_pretrained('bert-base-uncased'); model =
BertModel.from_pretrained('bert-base-uncased')

5.3 Text Feature Extraction

● Named Entity Recognition (NER): from nltk import word_tokenize, pos_tag,


ne_chunk; ne_chunk(pos_tag(word_tokenize(text)))
● Part-of-Speech (POS) tagging: from nltk import word_tokenize, pos_tag;
pos_tag(word_tokenize(text))
● Sentiment Analysis: from textblob import TextBlob;
TextBlob(text).sentiment

6. Image Feature Engineering

6.1 Image Preprocessing

● Resize: from PIL import Image; img = Image.open('image.jpg'); img =


img.resize((width, height))
● Convert to grayscale: from PIL import Image; img =
Image.open('image.jpg'); img = img.convert('L')
● Normalize pixel values: from PIL import Image; img =
Image.open('image.jpg'); img = img / 255.0
● Data augmentation (flip, rotate, etc.): from keras.preprocessing.image
import ImageDataGenerator; datagen = ImageDataGenerator(rotation_range=r,
width_shift_range=ws, height_shift_range=hs, shear_range=s, zoom_range=z,
horizontal_flip=True)

6.2 Image Feature Extraction

By: Waleed Mousa


● HOG (Histogram of Oriented Gradients): from skimage.feature import hog;
hog_features = hog(image, orientations=n, pixels_per_cell=(p, p),
cells_per_block=(c, c))
● SIFT (Scale-Invariant Feature Transform): from cv2 import xfeatures2d;
sift = xfeatures2d.SIFT_create(); keypoints, descriptors =
sift.detectAndCompute(image, None)
● ORB (Oriented FAST and Rotated BRIEF): from cv2 import ORB; orb =
ORB_create(); keypoints, descriptors = orb.detectAndCompute(image, None)
● CNN features: from keras.applications.vgg16 import VGG16; model =
VGG16(weights='imagenet', include_top=False); features =
model.predict(image)

7. Audio Feature Engineering

7.1 Audio Preprocessing

● Load audio file: import librosa; audio, sample_rate =


librosa.load('audio.wav')
● Resampling: import librosa; audio = librosa.resample(audio,
orig_sr=sample_rate, target_sr=target_sample_rate)
● Normalize audio: import librosa; audio = librosa.util.normalize(audio)

7.2 Audio Feature Extraction

● MFCC (Mel-Frequency Cepstral Coefficients): import librosa; mfccs =


librosa.feature.mfcc(y=audio, sr=sample_rate)
● Chroma features: import librosa; chroma =
librosa.feature.chroma_stft(y=audio, sr=sample_rate)
● Spectral contrast: import librosa; contrast =
librosa.feature.spectral_contrast(y=audio, sr=sample_rate)
● Tonnetz: import librosa; tonnetz = librosa.feature.tonnetz(y=audio,
sr=sample_rate)

8. Time Series Feature Engineering

8.1 Time Series Decomposition

● Decompose time series into trend, seasonality, and residuals: from


statsmodels.tsa.seasonal import seasonal_decompose; decomposition =
seasonal_decompose(time_series, model='additive', period=p)

By: Waleed Mousa


● STL decomposition: from statsmodels.tsa.seasonal import STL; stl =
STL(time_series, period=p); res = stl.fit()

8.2 Rolling and Expanding Statistics

● Rolling mean: time_series.rolling(window=n).mean()


● Rolling standard deviation: time_series.rolling(window=n).std()
● Expanding mean: time_series.expanding(min_periods=n).mean()
● Expanding standard deviation: time_series.expanding(min_periods=n).std()

8.3 Lag Features

● Shift/lag feature: df['lag_1'] = df['column'].shift(1)


● Difference feature: df['diff_1'] = df['column'].diff(1)
● Percentage change feature: df['pct_change_1'] =
df['column'].pct_change(1)

8.4 Autocorrelation and Partial Autocorrelation

● Autocorrelation Function (ACF): from statsmodels.tsa.stattools import


acf; acf_values = acf(time_series, nlags=n)
● Partial Autocorrelation Function (PACF): from statsmodels.tsa.stattools
import pacf; pacf_values = pacf(time_series, nlags=n)

9. Geospatial Feature Engineering

9.1 Geospatial Distance and Proximity

● Haversine distance: from math import radians, cos, sin, asin, sqrt; def
haversine(lon1, lat1, lon2, lat2): lon1, lat1, lon2, lat2 = map(radians,
[lon1, lat1, lon2, lat2]); dlon = lon2 - lon1; dlat = lat2 - lat1; a =
sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2; c = 2 *
asin(sqrt(a)); r = 6371; return c * r
● Manhattan distance: from math import fabs; def manhattan(x1, y1, x2, y2):
return fabs(x1 - x2) + fabs(y1 - y2)
● Euclidean distance: from math import sqrt; def euclidean(x1, y1, x2, y2):
return sqrt((x1 - x2)**2 + (y1 - y2)**2)

9.2 Geospatial Aggregation

● Spatial join: import geopandas as gpd; gpd.sjoin(gdf1, gdf2,


op='intersects')

By: Waleed Mousa


● Spatial groupby: import geopandas as gpd;
gdf.groupby('column').agg({'geometry': 'first', 'other_column': 'mean'})

9.3 Geospatial Binning

● Create grid: import geopandas as gpd; grid =


gpd.GeoDataFrame(geometry=gpd.points_from_xy(x, y))
● Spatial binning: import geopandas as gpd; gdf['grid_id'] = gpd.sjoin(gdf,
grid, op='within')['index_right']

10. Feature Scaling and Normalization

10.1 Scaling

● Min-max scaling: from sklearn.preprocessing import MinMaxScaler;


MinMaxScaler().fit_transform(df[['column_name']])
● Standard scaling (Z-score normalization): from sklearn.preprocessing
import StandardScaler;
StandardScaler().fit_transform(df[['column_name']])
● Max-abs scaling: from sklearn.preprocessing import MaxAbsScaler;
MaxAbsScaler().fit_transform(df[['column_name']])
● Robust scaling: from sklearn.preprocessing import RobustScaler;
RobustScaler().fit_transform(df[['column_name']])

10.2 Normalization

● L1 normalization: from sklearn.preprocessing import Normalizer;


Normalizer(norm='l1').fit_transform(df[['column_name']])
● L2 normalization: from sklearn.preprocessing import Normalizer;
Normalizer(norm='l2').fit_transform(df[['column_name']])
● Max normalization: from sklearn.preprocessing import Normalizer;
Normalizer(norm='max').fit_transform(df[['column_name']])

By: Waleed Mousa

You might also like