Unit 3 Feature Generation & Selection
Unit 3 Feature Generation & Selection
Data science is an interdisciplinary field that combines statistics, computer science, and
domain knowledge to analyze data and generate actionable insights. It involves collecting,
cleaning, processing, analyzing, and visualizing data to answer questions or solve problems.
Let’s break down how data science turns data into knowledge:
1. Data Collection
Everything starts with data—collected from sources like apps, surveys, sensors, websites, or
databases. For example, an e-commerce platform collects user clicks, purchase history, and
product reviews.
Raw data is often messy or incomplete. Data scientists clean it by removing errors, handling
missing values, and formatting it correctly. This step is crucial for ensuring accurate analysis.
Using statistical techniques and tools like Python, R, or SQL, data scientists explore the data to
find patterns, trends, and anomalies. For example, they might find that sales drop on certain
weekdays or that users from a particular city spend more.
5. Data Visualization
Charts, graphs, and dashboards are used to visually present the results in a clear and
understandable way. Tools like Tableau, Power BI, or Matplotlib help turn complex insights
into stories anyone can understand.
The final and most important step: drawing conclusions and making informed decisions.
Whether it’s a business strategy, healthcare diagnosis, or policy development, the goal is to use
data insights to act smarter and faster.
Extracting meaning from data comes with responsibility. Data must be interpreted ethically
and accurately, keeping in mind privacy, bias, and fairness. Misinterpreted or biased data can
lead to wrong decisions or unfair outcomes.
Quote: Data is the new oil, but data science is the refinery that turns it into value.
Use machine learning models to predict churn (customers likely to stop buying). Common
models:
Logistic Regression
Random Forest
XGBoost
Neural Networks
Label your past data as "churned" vs. "retained" to train supervised models.
Run A/B tests to see which retention strategies work best. Compare two customer groups:
Use dashboards and KPIs to track customer retention over time. Tools like:
Power BI
Tableau
Google Data Studio
Python (Plotly, Seaborn)
Feature generation is a critical step in data science and machine learning where we create
new input variables (features) from raw data to improve model performance. Brainstorming in
this context means creatively thinking about what extra or derived features can help the model
better understand patterns and relationships in the data.
It’s the idea generation phase where data scientists explore, discuss, and invent new features
from existing data using:
Domain knowledge
Statistical thinking
Business goals
Logical combinations and transformations
This helps models "learn" more from the data by giving them richer and more meaningful
inputs.
These new features can often reveal hidden relationships not obvious from raw data.
1. Mathematical Transformations
Sandip Kumar Singh
Assistant Professor
RRIMT Lucknow
Log, square, root, ratios (e.g., income per person)
3. Grouping or Binning
4. Interaction Features
5. Aggregations
Sum, mean, min, max over groups (e.g., total purchases per user)
6. Text Features
Raw data:
Number of logins
Days since last visit
Total complaints
Brainstormed features:
Sandip Kumar Singh
Assistant Professor
RRIMT Lucknow
Average time between logins
Complaints per transaction
Loyalty score = (total spend / tenure)
Models using these brainstormed features often perform better than those using raw features
alone.
Role of Domain Expertise and Imagination in Feature Generation Using Data Science
Feature generation (or feature engineering) is a key step in building powerful data science
models. It’s not just about using algorithms — it’s about understanding what data truly means
and how to represent it in a way that helps a machine learning model perform better.
Domain expertise means having deep knowledge of the industry, subject, or problem area
you're working with (e.g., finance, healthcare, retail, education, etc.).
Example:
Create new patterns: Think of combining two unrelated variables to find hidden
insights (e.g., age × income).
Simulate user behavior: Imagine how a customer or user might act, and create features
that reflect that.
Build abstract or high-level ideas: For example, “loyalty score” is not in the raw data,
but you can invent it from purchases, visits, and feedback.
Think like the model: Imagine what the algorithm would find useful or confusing and
shape the data accordingly.
Example:
Both are equally important. One gives credibility, the other gives creativity.
Summary
Domain expertise guides us to features that make sense in the real world.
Imagination allows us to explore creative and abstract possibilities.
Together, they help build smarter, more accurate, and more explainable models.
Feature selection is the process of selecting the most important variables (features) from your
dataset that contribute the most to the prediction output. This helps to:
Feature selection methods fall into three main categories, each with different algorithms:
1. Filter Methods
These methods use statistical techniques to score and rank features, independently of any
machine learning model.
Algorithms:
2. Wrapper Methods
These evaluate subsets of features by training a model and testing performance. They
consider feature dependencies but are computationally expensive.
3. Embedded Methods
These combine feature selection as part of the model training process itself. They are efficient
and often give good results.
Algorithms:
Algorithm Description
Genetic
Uses evolutionary strategies to find optimal feature subset
Algorithms
Built on Random Forest; compares real features with random "shadow"
Boruta Algorithm
features
Sandip Kumar Singh
Assistant Professor
RRIMT Lucknow
Algorithm Description
Explainable AI tools that rank features based on their contribution to
SHAP / LIME
model output
Example Workflow: