0% found this document useful (0 votes)

15 views9 pages

2 Mutual Information - Kaggle

The document discusses using mutual information to rank features by their association with a target variable. It explains what mutual information measures, how to interpret scores, and provides an example ranking automotive features by their mutual information with price. Visualizations are suggested to validate and explore features with different scores.

Uploaded by

Prujith Muthu Ram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views9 pages

2 Mutual Information - Kaggle

Uploaded by

Prujith Muthu Ram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

link code

Introduction

First encountering a new dataset can sometimes feel overwhelming. You might be presented with hundreds
or thousands of features without even a description to go by. Where do you even begin?

A great first step is to construct a ranking with a feature utility metric, a function measuring associations
between a feature and the target. Then you can choose a smaller set of the most useful features to develop
initially and have more confidence that your time will be well spent.

The metric we'll use is called "mutual information". Mutual information is a lot like correlation in that it
measures a relationship between two quantities. The advantage of mutual information is that it can detect
any kind of relationship, while correlation only detects linear relationships.

Mutual information is a great general-purpose metric and especially useful at the start of feature
development when you might not know what model you'd like to use yet. It is:

easy to use and interpret,

computationally efficient,
theoretically well-founded,

resistant to overfitting, and,

able to detect any kind of relationship

Mutual Information and What it Measures

Mutual information describes relationships in terms of uncertainty. The mutual information (MI) between
two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about
the other. If you knew the value of a feature, how much more confident would you be about the target?

Here's an example from the Ames Housing data. The figure shows the relationship between the exterior
quality of a house and the price it sold for. Each point represents a house.
Knowing the exterior quality of a house reduces uncertainty about its sale price.

From the figure, we can see that knowing the value of ExterQual should make you more certain about
the corresponding SalePrice -- each category of ExterQual tends to concentrate SalePrice to
within a certain range. The mutual information that ExterQual has with SalePrice is the average
reduction of uncertainty in SalePrice taken over the four values of ExterQual . Since Fair occurs
less often than Typical , for instance, Fair gets less weight in the MI score.

(Technical note: What we're calling uncertainty is measured using a quantity from information theory known
as "entropy". The entropy of a variable means roughly: "how many yes-or-no questions you would need to
describe an occurance of that variable, on average." The more questions you have to ask, the more
uncertain you must be about the variable. Mutual information is how many questions you expect the feature
to answer about the target.)

Interpreting Mutual Information Scores

The least possible mutual information between quantities is 0.0. When MI is zero, the quantities are
independent: neither can tell you anything about the other. Conversely, in theory there's no upper bound to
what MI can be. In practice though values above 2.0 or so are uncommon. (Mutual information is a
logarithmic quantity, so it increases very slowly.)
The next figure will give you an idea of how MI values correspond to the kind and degree of association a
feature has with the target.

Left: Mutual information increases as the dependence between feature and target becomes tighter.
Right: Mutual information can capture any kind of association (not just linear, like correlation.)

Here are some things to remember when applying mutual information:

MI can help you to understand the relative potential of a feature as a predictor of the target,
considered by itself.

It's possible for a feature to be very informative when interacting with other features, but not so
informative all alone. MI can't detect interactions between features. It is a univariate metric.

The actual usefulness of a feature depends on the model you use it with. A feature is only useful to the
extent that its relationship with the target is one your model can learn. Just because a feature has a
high MI score doesn't mean your model will be able to do anything with that information. You may need
to transform the feature first to expose the association.

Example - 1985 Automobiles

The Automobile (https://www.kaggle.com/toramky/automobile-dataset) dataset consists of 193 cars from

the 1985 model year. The goal for this dataset is to predict a car's price (the target) from 23 of the car's
features, such as make , body_style , and horsepower . In this example, we'll rank the features with
mutual information and investigate the results by data visualization.

This hidden cell imports some libraries and loads the dataset.
unfold_less Hide code

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.style.use("seaborn-whitegrid")

df = pd.read_csv("../input/fe-course-data/autos.csv")
df.head()

Out[1]:

symboling make fuel_type aspiration num_of_doors body_style drive_wheels engine_

alfa-
0 3 gas std 2 convertible rwd front
romero

alfa-
1 3 gas std 2 convertible rwd front
romero

alfa-
2 1 gas std 2 hatchback rwd front
romero

3 2 audi gas std 4 sedan fwd front

4 2 audi gas std 4 sedan 4wd front

5 rows × 25 columns

The scikit-learn algorithm for MI treats discrete features differently from continuous features.
Consequently, you need to tell it which are which. As a rule of thumb, anything that must have a float
dtype is not discrete. Categoricals ( object or categorial dtype) can be treated as discrete by giving
them a label encoding. (You can review label encodings in our Categorical Variables
(http://www.kaggle.com/alexisbcook/categorical-variables) lesson.)
In [2]:
X = df.copy()
y = X.pop("price")

# Label encoding for categoricals

for colname in X.select_dtypes("object"):
X[colname], _ = X[colname].factorize()

# All discrete features should now have integer dtypes (double-check this b
efore using MI!)
discrete_features = X.dtypes == int

Scikit-learn has two mutual information metrics in its feature_selection module: one for real-valued
targets ( mutual_info_regression ) and one for categorical targets ( mutual_info_classif ). Our
target, price , is real-valued. The next cell computes the MI scores for our features and wraps them up in
a nice dataframe.
In [3]:
from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):

mi_scores = mutual_info_regression(X, y, discrete_features=discrete_f
eatures)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)

mi_scores[::3] # show a few features with their MI scores

Out[3]:
curb_weight 1.540126
highway_mpg 0.951700
length 0.621566
fuel_system 0.485085
stroke 0.389321
num_of_cylinders 0.330988
compression_ratio 0.133927
fuel_type 0.048139
Name: MI Scores, dtype: float64

And now a bar plot to make comparisions easier:

In [4]:
def plot_mi_scores(scores):
scores = scores.sort_values(ascending=True)
width = np.arange(len(scores))
ticks = list(scores.index)
plt.barh(width, scores)
plt.yticks(width, ticks)
plt.title("Mutual Information Scores")

plt.figure(dpi=100, figsize=(8, 5))

plot_mi_scores(mi_scores)

Data visualization is a great follow-up to a utility ranking. Let's take a closer look at a couple of these.

As we might expect, the high-scoring curb_weight feature exhibits a strong relationship with price ,
the target.
In [5]:
sns.relplot(x="curb_weight", y="price", data=df);

The fuel_type feature has a fairly low MI score, but as we can see from the figure, it clearly separates
two price populations with different trends within the horsepower feature. This indicates that
fuel_type contributes an interaction effect and might not be unimportant after all. Before deciding a
feature is unimportant from its MI score, it's good to investigate any possible interaction effects -- domain
knowledge can offer a lot of guidance here.
In [6]:
sns.lmplot(x="horsepower", y="price", hue="fuel_type", data=df);

Data visualization is a great addition to your feature-engineering toolbox. Along with utility metrics like
mutual information, visualizations like these can help you discover important relationships in your data.
Check out our Data Visualization (https://www.kaggle.com/learn/data-visualization) course to learn more!

Your Turn

Rank the features (https://www.kaggle.com/kernels/fork/14393925) of the Ames Housing dataset and

choose your first set of features to start developing.

Module 2
No ratings yet
Module 2
107 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
IDS 2nd Unit Notes
No ratings yet
IDS 2nd Unit Notes
14 pages
Mining
No ratings yet
Mining
129 pages
3-Random Projection and Compressed Sensing Technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing Technique-13-01-2025
84 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
Cours Data
No ratings yet
Cours Data
51 pages
Lecture 03
No ratings yet
Lecture 03
33 pages
DMDW Module2-Chapter 2
No ratings yet
DMDW Module2-Chapter 2
67 pages
ML Inter Q&A
No ratings yet
ML Inter Q&A
54 pages
Feature Selection
No ratings yet
Feature Selection
53 pages
CS464 Ch5 FeatureSelection
No ratings yet
CS464 Ch5 FeatureSelection
31 pages
3 - Intro To Predictive Modeling
No ratings yet
3 - Intro To Predictive Modeling
40 pages
Data Reduction
No ratings yet
Data Reduction
23 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
5 - Predictive Modeling Using Decision Trees
No ratings yet
5 - Predictive Modeling Using Decision Trees
25 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
A Practical Guide To Conjoint Analysis
100% (1)
A Practical Guide To Conjoint Analysis
8 pages
Attribute Oriented Analysis
No ratings yet
Attribute Oriented Analysis
27 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
Eda 1
No ratings yet
Eda 1
29 pages
UNIT3
No ratings yet
UNIT3
98 pages
Week 2
No ratings yet
Week 2
73 pages
Multi-Attribute Decision Making
No ratings yet
Multi-Attribute Decision Making
95 pages
Feature Selection
No ratings yet
Feature Selection
61 pages
Engo 645
No ratings yet
Engo 645
10 pages
CH 2
No ratings yet
CH 2
35 pages
Data Mining - Lecture 1
No ratings yet
Data Mining - Lecture 1
33 pages
The Art of Finding The Best Features For Machine Learning - by Rebecca Vickery - Towards Data Science
No ratings yet
The Art of Finding The Best Features For Machine Learning - by Rebecca Vickery - Towards Data Science
14 pages
CH 2
No ratings yet
CH 2
68 pages
Exploring Categorical Data - Students
No ratings yet
Exploring Categorical Data - Students
40 pages
Building Classification Models - ID3 and C4.5
No ratings yet
Building Classification Models - ID3 and C4.5
1 page
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
MTH 101 (Elementary Mathematics I) - 2223venn
No ratings yet
MTH 101 (Elementary Mathematics I) - 2223venn
19 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
02 Data
No ratings yet
02 Data
47 pages
8610 Quiz Best File Braveheart
No ratings yet
8610 Quiz Best File Braveheart
141 pages
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
No ratings yet
Data Mining and Predictive Modelling: Lecture 2: Functionalities, KDD Process, Data Attributes and Properties
11 pages
Feature Ranking Methods Based On Information Entropy With Parzen Windows
No ratings yet
Feature Ranking Methods Based On Information Entropy With Parzen Windows
9 pages
Unit 4
No ratings yet
Unit 4
42 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Iijcs 2014 07 19 18
No ratings yet
Iijcs 2014 07 19 18
7 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
02 Data
No ratings yet
02 Data
41 pages
Chpater 2 PDF
No ratings yet
Chpater 2 PDF
44 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
No ratings yet
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
12 pages
Study Plan at Chess Level Intermediate
No ratings yet
Study Plan at Chess Level Intermediate
34 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Answer Key - Domain 3
100% (8)
Answer Key - Domain 3
16 pages
Topic 2. Visual Data Analysis in Python: Mlcourse - Ai (Https://mlcourse - Ai)
No ratings yet
Topic 2. Visual Data Analysis in Python: Mlcourse - Ai (Https://mlcourse - Ai)
25 pages
A Comparative Study Between Feature Selection Algorithms - Ok
No ratings yet
A Comparative Study Between Feature Selection Algorithms - Ok
10 pages
LAC Minutes Sample
100% (1)
LAC Minutes Sample
9 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Multiattribute Choice
No ratings yet
Multiattribute Choice
30 pages
FINAL DOCUMENT BARON and GROUP
No ratings yet
FINAL DOCUMENT BARON and GROUP
28 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
HR Analytics PDF
100% (3)
HR Analytics PDF
16 pages
A Practical Guide To Conjoint Analysis M-0675
No ratings yet
A Practical Guide To Conjoint Analysis M-0675
9 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
TOS-mechanical Drafting 7
100% (2)
TOS-mechanical Drafting 7
1 page
1 Action Research Final
No ratings yet
1 Action Research Final
75 pages
WIS2040 Spring 2025 Syllabus-1
No ratings yet
WIS2040 Spring 2025 Syllabus-1
11 pages
Attention-Deficit/Hyperactivity Disorder
No ratings yet
Attention-Deficit/Hyperactivity Disorder
3 pages
How Ai-Powered Sportsbooks Are Shaping The Sports Betting Industry
No ratings yet
How Ai-Powered Sportsbooks Are Shaping The Sports Betting Industry
39 pages
PISA 2025 Science Framework
No ratings yet
PISA 2025 Science Framework
93 pages
The Family
No ratings yet
The Family
2 pages
Diploma in Computer Applications: Course Brochure
No ratings yet
Diploma in Computer Applications: Course Brochure
7 pages
Educating For Peace 1st Edition Lokanath Mishra Download
100% (1)
Educating For Peace 1st Edition Lokanath Mishra Download
42 pages
Class 8 BP MP BPM
No ratings yet
Class 8 BP MP BPM
125 pages
Dhobi Ghat
No ratings yet
Dhobi Ghat
2 pages
Mikao Usui Senseis Birthday
No ratings yet
Mikao Usui Senseis Birthday
2 pages
Going Deeperwith Embedded FPGAPlatformfor Convolutional Neural Network
No ratings yet
Going Deeperwith Embedded FPGAPlatformfor Convolutional Neural Network
11 pages
Online Quran Recitation With Tajweed PDF
No ratings yet
Online Quran Recitation With Tajweed PDF
3 pages
Course Summary: Page 1 of 7
No ratings yet
Course Summary: Page 1 of 7
7 pages
ACS133 Assignment 2 - Briefing 2023-2024
No ratings yet
ACS133 Assignment 2 - Briefing 2023-2024
6 pages
Experimental Psychology Final Exam Reviewer
No ratings yet
Experimental Psychology Final Exam Reviewer
13 pages
12 A Barem Locala 2025
No ratings yet
12 A Barem Locala 2025
3 pages
EdTech KSA
No ratings yet
EdTech KSA
19 pages
Religiosity Spirituality, and Help-Seeking Among Filipino Americans
No ratings yet
Religiosity Spirituality, and Help-Seeking Among Filipino Americans
16 pages
A8 Meantime 09 10
No ratings yet
A8 Meantime 09 10
24 pages
2021.02 Bachelor Thesis Logbook
No ratings yet
2021.02 Bachelor Thesis Logbook
12 pages
Adsmitcard PGHHK Merged
No ratings yet
Adsmitcard PGHHK Merged
2 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Statistical Analysis with R For Dummies
From Everand
Statistical Analysis with R For Dummies
Joseph Schmuller
5/5 (1)

2 Mutual Information - Kaggle

Uploaded by

2 Mutual Information - Kaggle

Uploaded by

link code

easy to use and interpret,

resistant to overfitting, and,

able to detect any kind of relationship

Mutual Information and What it Measures

Interpreting Mutual Information Scores

Here are some things to remember when applying mutual information:

Example - 1985 Automobiles

The Automobile (https://www.kaggle.com/toramky/automobile-dataset) dataset consists of 193 cars from

symboling make fuel_type aspiration num_of_doors body_style drive_wheels engine_

3 2 audi gas std 4 sedan fwd front

4 2 audi gas std 4 sedan 4wd front

# Label encoding for categoricals

def make_mi_scores(X, y, discrete_features):

mi_scores = make_mi_scores(X, y, discrete_features)

And now a bar plot to make comparisions easier:

plt.figure(dpi=100, figsize=(8, 5))

Rank the features (https://www.kaggle.com/kernels/fork/14393925) of the Ames Housing dataset and

You might also like