0% found this document useful (0 votes)
20 views10 pages

Feature Engineering

Uploaded by

Rithik Logu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

Feature Engineering

Uploaded by

Rithik Logu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

# [ Feature Engineering ] ( CheatSheet )

1. Basic Feature Creation

● Combining Features: df['new_feature'] = df['feature1'] + df['feature2']


● Differences Between Features: df['new_feature'] = df['feature1'] -
df['feature2']
● Ratios of Features: df['new_feature'] = df['feature1'] / df['feature2']
● Product of Features: df['new_feature'] = df['feature1'] * df['feature2']
● Log Transformation: df['log_feature'] = np.log(df['feature'] + 1)

2. Handling Categorical Features

● One-Hot Encoding: pd.get_dummies(df['categorical_feature'])


● Label Encoding:
sklearn.preprocessing.LabelEncoder().fit_transform(df['categorical_featur
e'])
● Binary Encoding:
category_encoders.BinaryEncoder().fit_transform(df['categorical_feature']
)
● Frequency Encoding: df.groupby('category').size() / len(df)
● Mean Encoding: df.groupby('category')['target'].mean()

3. Temporal Features

● Extracting Day, Month, Year: df['day'] = df['datetime'].dt.day


● Extracting Weekday: df['weekday'] = df['datetime'].dt.weekday
● Extracting Hour, Minute, Second: df['hour'] = df['datetime'].dt.hour
● Time Since a Reference Point: df['time_since'] = df['datetime'] -
reference_date
● Is Weekend Feature: df['is_weekend'] =
df['weekday'].isin([5,6]).astype(int)

4. Text Features

● Tokenization: nltk.word_tokenize(df['text'])
● TF-IDF Vectorization:
sklearn.feature_extraction.text.TfidfVectorizer().fit_transform(df['text'
])

By: Waleed Mousa


● Count Vectorization:
sklearn.feature_extraction.text.CountVectorizer().fit_transform(df['text'
])
● N-grams:
sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1,2)).fit_tr
ansform(df['text'])
● Sentiment Analysis: TextBlob(df['text']).sentiment.polarity

5. Numerical Features

● Standardization: (df['num_feature'] - df['num_feature'].mean()) /


df['num_feature'].std()
● Min-Max Normalization: (df['num_feature'] - df['num_feature'].min()) /
(df['num_feature'].max() - df['num_feature'].min())
● Robust Scaling (handling outliers):
sklearn.preprocessing.RobustScaler().fit_transform(df[['num_feature']])
● Binning: pd.cut(df['num_feature'], bins=5, labels=False)
● Rank Transformation: df['num_feature'].rank()

6. Feature Interaction

● Polynomial Features:
sklearn.preprocessing.PolynomialFeatures().fit_transform(df[['feature1',
'feature2']])
● Interaction Between Categorical and Numerical Features:
df['new_feature'] = df['cat_feature'].astype(str) + '_' +
df['num_feature'].astype(str)
● Feature Crosses for Categorical Features: pd.get_dummies(df['feature1'] +
'_' + df['feature2'])
● Pairwise Products of Numerical Features: df['feature1'] * df['feature2']
● Creating Ratios of All Pairwise Features: df['feature1'] / df['feature2']

7. Dimensionality Reduction

● PCA (Principal Component Analysis):


sklearn.decomposition.PCA(n_components=3).fit_transform(df)
● t-SNE (t-Distributed Stochastic Neighbor Embedding):
sklearn.manifold.TSNE(n_components=2).fit_transform(df)
● LDA (Linear Discriminant Analysis):
sklearn.discriminant_analysis.LinearDiscriminantAnalysis().fit_transform(
X, y)

By: Waleed Mousa


● Feature Agglomeration (Clustering of Features):
sklearn.cluster.FeatureAgglomeration().fit_transform(df)
● UMAP (Uniform Manifold Approximation and Projection):
umap.UMAP().fit_transform(df)

8. Handling Missing Values

● Imputation with Mean/Median/Mode:


sklearn.impute.SimpleImputer(strategy='mean').fit_transform(df[['feature'
]])
● KNN Imputation:
sklearn.impute.KNNImputer(n_neighbors=5).fit_transform(df)
● Iterative Imputer (Multivariate Imputation):
sklearn.experimental.enable_iterative_imputer.IterativeImputer().fit_tran
sform(df)
● Imputation Using (Group) Median: df['feature'] =
df.groupby('group')['feature'].transform(lambda x: x.fillna(x.median()))
● Creating Missing Value Indicator: df['feature_is_missing'] =
df['feature'].isnull().astype(int)

9. Feature Extraction from Unstructured Data

● Image Feature Extraction (e.g., Color Histograms): cv2.calcHist([image],


channels, mask, histSize, ranges)
● Text Embeddings (Word2Vec, GloVe): gensim.models.Word2Vec(sentences).wv
● Audio Feature Extraction (MFCC): librosa.feature.mfcc(y=audio_waveform,
sr=sampling_rate)
● Extracting Features from Time Series Data:
tsfresh.extract_features(time_series, column_id='id')
● Image Resizing and Flattening for ML: cv2.resize(image, (width,
height)).flatten()
● Bag of Words for Text Data:
sklearn.feature_extraction.text.CountVectorizer().fit_transform(corpus)

10. Geospatial Feature Engineering

● Latitude and Longitude to Cartesian Coordinates:


np.radians(df['latitude']), np.radians(df['longitude'])
● Haversine Distance Between Two Points: haversine(lon1, lat1, lon2, lat2)
● Geospatial Clustering (e.g., DBSCAN): sklearn.cluster.DBSCAN(eps=0.01,
min_samples=10).fit(geo_coordinates)

By: Waleed Mousa


● Extracting ZIP Codes from Addresses:
uszipcode.SearchEngine().by_address(address, returns=1)

11. Time-based Features

● Elapsed Time Since a Reference Event: df['time_since_event'] =


pd.to_datetime(df['timestamp']) - reference_timestamp
● Cyclical Time Features (e.g., Hour of Day, Day of Week): np.sin(2 *
np.pi * df['hour']/23.0), np.cos(2 * np.pi * df['hour']/23.0)
● Time to Next/Previous Event: df['time_to_next_event'] =
df['timestamp'].shift(-1) - df['timestamp']
● Window Functions for Time Series (e.g., Rolling Mean):
df['rolling_mean'] = df['value'].rolling(window=5).mean()
● Exponential Weighted Moving Average: df['ewma'] =
df['value'].ewm(span=50).mean()

12. Textual Feature Engineering

● Named Entity Recognition (NER): spacy.load('en_core_web_sm').ents(text)


● Part-of-Speech Tagging: nltk.pos_tag(tokens)
● Term Frequency-Inverse Document Frequency (TF-IDF):
TfidfVectorizer().fit_transform(corpus)
● Extracting Readability Features: textstat.flesch_reading_ease(text)
● Word Embedding (Pre-trained Models like GloVe, BERT):
transformers.BertModel.from_pretrained('bert-base-uncased').encode(text)

13. Feature Selection Techniques

● SelectKBest with Custom Score Function:


SelectKBest(score_func=f_regression, k=10).fit_transform(X, y)
● Recursive Feature Elimination: RFE(estimator,
n_features_to_select=10).fit(X, y)
● Feature Importance From Tree-based Models: model.feature_importances_
● L1 Regularization for Feature Selection (Lasso): Lasso(alpha=0.1).fit(X,
y)
● Variance Thresholding for Feature Selection:
VarianceThreshold(threshold=(.8 * (1 - .8))).fit_transform(X)

14. Handling Imbalanced Data

● Random Over-sampling: RandomOverSampler().fit_resample(X, y)


● Random Under-sampling: RandomUnderSampler().fit_resample(X, y)
By: Waleed Mousa
● SMOTE for Over-sampling: SMOTE().fit_resample(X, y)
● ADASYN for Adaptive Synthetic Sampling: ADASYN().fit_resample(X, y)

15. Advanced Numerical Techniques

● Discretization (Binning): KBinsDiscretizer(n_bins=5,


encode='ordinal').fit_transform(X)
● Log Transform Plus One: np.log1p(df['num_feature'])
● Square Root Transform: np.sqrt(df['num_feature'])
● Inverse Transform: np.reciprocal(df['num_feature'])

16. Scaling and Normalization

● MaxAbsScaler for Scaling: MaxAbsScaler().fit_transform(X)


● Normalizer for Row-wise Normalization: Normalizer().fit_transform(X)
● Quantile Transformer for Robust Scaling:
QuantileTransformer().fit_transform(X)

17. Encoding High Cardinality Features

● Target Encoding: category_encoders.TargetEncoder().fit_transform(X, y)


● Binary Encoding for High Cardinality:
category_encoders.BinaryEncoder().fit_transform(df['high_card_feature'])
● Hashing Trick:
category_encoders.HashingEncoder().fit_transform(df['high_card_feature'])

18. Feature Engineering for Time Series

● Fourier Transform for Periodic Patterns: np.fft.fft(time_series)


● Lag Features for Autocorrelation: df['lag_1'] = df['value'].shift(1)
● Rolling Window Features (e.g., Rolling Variance): df['rolling_var'] =
df['value'].rolling(window=5).var()
● Decomposing Time Series (Seasonal-Trend Decomposition):
seasonal_decompose(time_series, model='additive')

19. Custom Feature Engineering

● Custom Transformer in Pipeline: Pipeline(steps=[('custom_transformer',


CustomTransformer()), ...])
● Conditional Feature Creation: df['new_feature'] =
np.where(df['condition'], value_if_true, value_if_false)

By: Waleed Mousa


20. Data Reduction for Efficiency

● Random Projection for Dimensionality Reduction:


GaussianRandomProjection().fit_transform(X)
● Feature Agglomeration for Clustering Features:
FeatureAgglomeration().fit_transform(X)

21. Handling Multi-dimensional Data

● Flattening Image Data for ML Models: np.reshape(image_data, (num_samples,


-1))
● PCA for Image Data Dimensionality Reduction:
PCA(n_components=0.95).fit_transform(image_data)

22. Feature Engineering for Specific Models

● Word Tokenization for NLP Models: nltk.word_tokenize(text)


● Creating Polynomial Features for Regression Models:
PolynomialFeatures(degree=2).fit_transform(X)

23. Integration with External Data

● Merging External Data Sources: pd.merge(df, external_data, on='key')


● Feature Engineering from APIs (e.g., Weather Data for Sales
Predictions): fetch_weather_data(api_key, location)

24. Feature Engineering Automation

● Automated Feature Engineering (Featuretools):


featuretools.dfs(entityset=es, target_entity='target')

25. Features from Aggregations

● GroupBy Aggregations for Categorical Features:


df.groupby('category').agg({'num_feature': ['mean', 'sum']})
● Window Functions in Pandas: df['rolling_mean'] =
df.sort_values('time').groupby('group').rolling(window=3)['value'].mean()

26. Encoding Techniques for Tree-based Models

● Ordinal Encoding for Tree-Based Models: OrdinalEncoder().fit_transform(X)

By: Waleed Mousa


● Frequency Encoding for Tree-Based Models: df['freq_encode'] =
df.groupby('feature')['feature'].transform('count') / len(df)

27. Multi-level Feature Engineering

● Hierarchical Interactions (e.g., City within Country):


df['city_country'] = df['city'] + '_' + df['country']
● Creating Features from Hierarchical Clustering:
AgglomerativeClustering(n_clusters=10).fit_predict(X)

28. Signal Processing Features

● Fast Fourier Transform for Signal Data: np.fft.fft(signal)


● Signal Filtering (e.g., Low/High/Band-Pass Filters):
scipy.signal.butter(N, Wn, btype, output='sos')
● Signal Envelope Extraction: scipy.signal.hilbert(signal)

29. Handling Sparse Data

● Sparse Matrix Representation: scipy.sparse.csr_matrix(data)


● Dimensionality Reduction for Sparse Data:
sklearn.decomposition.TruncatedSVD(n_components).fit_transform(sparse_dat
a)

30. Feature Hashing

● Feature Hashing for Vectorizing Text:


sklearn.feature_extraction.FeatureHasher(n_features).transform(raw_featur
es)
● Hashing Trick for Categorical Features:
sklearn.feature_extraction.text.HashingVectorizer(n_features)

31. Features from Probabilistic Distributions

● Extracting Parameters from Distributions: scipy.stats.norm.fit(data)


● Generating Features from Distribution Samples: np.random.normal(loc,
scale, size)

32. Text Specific Features

● Character Length of Text: df['text_length'] = df['text'].apply(len)

By: Waleed Mousa


● Count of Specific Words or Characters: df['word_count'] =
df['text'].str.count('specific_word')

33. Image Specific Features

● Edge Detection (e.g., Sobel, Canny): cv2.Canny(image, threshold1,


threshold2)
● Feature Descriptors (e.g., SIFT, SURF):
cv2.xfeatures2d.SIFT_create().detectAndCompute(image, None)

34. Multi-Label Feature Engineering

● Binary Relevance Transformation:


sklearn.preprocessing.MultiLabelBinarizer().fit_transform(multi_label)

35. Survival Analysis Features

● Cox Proportional Hazards Features: lifelines.CoxPHFitter().fit(df,


duration_col, event_col).predict_partial_hazard(df)

36. Features from Graphs/Networks

● Node Centrality Measures (e.g., Degree, Eigenvector):


networkx.degree_centrality(G)
● Graph Embeddings (e.g., Node2Vec): node2vec = Node2Vec(G); model =
node2vec.fit()

37. Cross-validation in Feature Engineering

● K-Fold Cross-Validated Features:


sklearn.model_selection.cross_val_predict(model, X, y, cv=5)

38. Handling Unstructured Data

● Converting Unstructured Data to Structured Form:


extract_features_from_unstructured_data(unstructured_data)

39. Features from Audio Data

● Mel-Frequency Cepstral Coefficients (MFCCs):


librosa.feature.mfcc(y=audio, sr=sampling_rate)

By: Waleed Mousa


● Spectral Centroid: librosa.feature.spectral_centroid(y=audio,
sr=sampling_rate)

40. Custom Feature Engineering Functions

● User-Defined Transformation Functions: df['custom_feature'] =


df['feature'].apply(custom_transformation_function)

41. Feature Engineering in Time Series

● Lagged Features for Autocorrelation: df['lag_feature'] =


df['feature'].shift(periods)
● Rolling Window Features (e.g., Rolling Std): df['rolling_std'] =
df['feature'].rolling(window=5).std()

42. Natural Language Specific Features

● Part-of-Speech Tag Counts: nltk.pos_tag(text).count('NN') # count of


nouns
● Text Complexity Features (e.g., Flesch Reading Ease):
textstat.flesch_reading_ease(text)

43. Social Media Specific Features

● Sentiment Score from Social Media Text: TextBlob(text).sentiment.polarity


● Social Media Engagement Features (e.g., Likes, Shares):
calculate_engagement_metrics(df['likes'], df['shares'])

44. Geographic Features

● Distance to Points of Interest: calculate_distance_to_poi(lat, long,


poi_lat, poi_long)
● Geographic Clustering Features: DBSCAN(eps=0.01,
min_samples=10).fit(geo_data)

45. Time-based Aggregation Features

● Cumulative Count or Sum Over Time: df.groupby('time').cumcount()


● Time-based Aggregation (e.g., Sum of Sales per Month):
df.groupby(pd.Grouper(key='date', freq='M'))['sales'].sum()

46. Online Behavior Features

By: Waleed Mousa


● Session Duration in Web Analytics:
calculate_session_duration(df['session_start'], df['session_end'])
● Click Through Rate (CTR) for Online Ads: df['CTR'] = df['clicks'] /
df['impressions']

47. Biomedical Signal Features

● Heart Rate Variability Metrics: calculate_hrv(ecg_signal)


● Biomedical Image Texture Features: skimage.feature.greycomatrix(image,
[distance], [angle])

48. Feature Pipelines

● Creating a Feature Pipeline: Pipeline(steps=[('feature_creation',


CustomFeatureTransformer()), ...])

By: Waleed Mousa

You might also like