Correspondence Analysis using Python
Last Updated :
31 May, 2024
Correspondence Analysis (CA) is a statistical technique used to analyze the relationships between the categorical variables in a contingency table. It provides a visual representation of the data allowing for the identification of the patterns and associations between the categories of the variables.
What is Correspondence Analysis?
Correspondence analysis is a multivariate graphical technique designed to analyze two-way and multi-way tables, containing some measures of the correspondence between the rows and columns. It decomposes the data matrix into a set of singular vectors and singular values, allowing for the visualization of the data in the reduced dimensionality space. The main goal is to reveal the structure of the data in terms of the associations between the row and column categories.
Key Steps in Correspondence Analysis
- Construct the Contingency Table: The data is represented as the contingency table where rows and columns correspond to the categories of the variables.
- Calculate the Correspondence Matrix: Normalize the contingency table to form the correspondence matrix.
- Singular Value Decomposition (SVD): Perform the SVD on the correspondence matrix to extract the singular vectors and singular values.
- Compute Coordinates: Calculate the coordinates of the rows and columns in the reduced dimensional space using singular vectors and values.
- Visualize the Results: The Plot of the row and column coordinate to visualize the relationships between the categories.
Performing Correspondence Analysis in Python
Step 1: Importing Libraries
First, we need to install and import the necessary libraries.
Python
!pip install pandas numpy matplotlib prince
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import prince
Step 2: Creating a Contingency Table
Let's create a sample contingency table representing survey responses of different age groups regarding their preferred modes of transport.
Python
data = {
'Car': [20, 30, 50],
'Bike': [30, 10, 20],
'Bus': [50, 60, 30]
}
age_groups = ['18-25', '26-35', '36-50']
df = pd.DataFrame(data, index=age_groups)
print(df)
Step 3: Performing Correspondence Analysis
Using the prince
library, we can perform Correspondence Analysis on this table.
Python
ca = prince.CA(n_components=2)
ca = ca.fit(df)
Step 4: Extracting and Visualizing Results
We can extract the row and column coordinates from the fitted model and visualize them.
Python
# Extract row and column coordinates
row_coords = ca.row_coordinates(df)
col_coords = ca.column_coordinates(df)
# Plotting the results
plt.figure(figsize=(8, 6))
plt.scatter(row_coords[0], row_coords[1], c='red', label='Rows')
plt.scatter(col_coords[0], col_coords[1], c='blue', label='Columns')
# Adding labels
for i, txt in enumerate(df.index):
plt.annotate(txt, (row_coords[0][i], row_coords[1][i]), color='red')
for i, txt in enumerate(df.columns):
plt.annotate(txt, (col_coords[0][i], col_coords[1][i]), color='blue')
plt.title('Correspondence Analysis')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.grid()
plt.show()
Interpreting the Results
The plot provides a visual representation of the relationships between age groups and transport preferences. Points that are close together indicate categories that are similar in their distribution. For example, if '18-25' and 'Car' are close together, it suggests that younger people prefer cars.
Output:

Example: Consumer Preferences for Beverages
Step 1: Importing Libraries
First, ensure that the necessary libraries are installed and imported.
Python
!pip install pandas numpy matplotlib prince
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import prince
Step 2: Creating a Contingency Table
We'll create a contingency table that shows the preferences of consumers in different regions for various types of beverages.
Python
data = {
'Coffee': [50, 30, 40, 20],
'Tea': [30, 60, 10, 40],
'Juice': [20, 10, 50, 60],
'Soda': [10, 20, 30, 80]
}
regions = ['North', 'South', 'East', 'West']
df = pd.DataFrame(data, index=regions)
print(df)
This will produce a contingency table like this:
Coffee Tea Juice Soda
North 50 30 20 10
South 30 60 10 20
East 40 10 50 30
West 20 40 60 80
Step 3: Performing Correspondence Analysis
Using the prince
library, we perform Correspondence Analysis on this table.
Python
ca = prince.CA(n_components=2)
ca = ca.fit(df)
Step 4: Extracting and Visualizing Results
We can extract the row and column coordinates from the fitted model and visualize them.
Python
# Extract row and column coordinates
row_coords = ca.row_coordinates(df)
col_coords = ca.column_coordinates(df)
# Plotting the results
plt.figure(figsize=(8, 6))
plt.scatter(row_coords[0], row_coords[1], c='red', label='Regions')
plt.scatter(col_coords[0], col_coords[1], c='blue', label='Beverages')
# Adding labels
for i, txt in enumerate(df.index):
plt.annotate(txt, (row_coords[0][i], row_coords[1][i]), color='red')
for i, txt in enumerate(df.columns):
plt.annotate(txt, (col_coords[0][i], col_coords[1][i]), color='blue')
plt.title('Correspondence Analysis: Beverage Preferences by Region')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.grid()
plt.show()
Output

Interpreting the Results
The plot provides a visual representation of the relationships between regions and beverage preferences. Points that are close together indicate categories that are similar in their distribution. For example, if 'North' and 'Coffee' are close together, it suggests that consumers in the North region have a preference for coffee.
By examining the plot, you can identify patterns and associations. For instance, if 'West' and 'Soda' are close, it suggests a strong preference for soda in the West region.
Similar Reads
What is Exploratory Data Analysis? Exploratory Data Analysis (EDA) is a important step in data science as it visualizing data to understand its main features, find patterns and discover how different parts of the data are connected. In this article, we will see more about Exploratory Data Analysis (EDA).Why Exploratory Data Analysis
8 min read
Univariate Data EDA
Measures of Central Tendency in Statistics Central tendencies in statistics are numerical values that represent the middle or typical value of a dataset. Also known as averages, they provide a summary of the entire data, making it easier to understand the overall pattern or behavior. These values are useful because they capture the essence o
11 min read
Measures of Spread - Range, Variance, and Standard Deviation Collecting the data and representing it in form of tables, graphs, and other distributions is essential for us. But, it is also essential that we get a fair idea about how the data is distributed, how scattered it is, and what is the mean of the data. The measures of the mean are not enough to descr
8 min read
Interquartile Range and Quartile Deviation using NumPy and SciPy In statistical analysis, understanding the spread or variability of a dataset is crucial for gaining insights into its distribution and characteristics. Two common measures used for quantifying this variability are the interquartile range (IQR) and quartile deviation. Quartiles Quartiles are a kind
5 min read
Anova Formula ANOVA Test, or Analysis of Variance, is a statistical method used to test the differences between the means of two or more groups. Developed by Ronald Fisher in the early 20th century, ANOVA helps determine whether there are any statistically significant differences between the means of three or mor
7 min read
Skewness of Statistical Data Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. In simpler terms, it indicates whether the data is concentrated more on one side of the mean compared to the other side.Why is skewness important?Understanding the skewness of data
5 min read
How to Calculate Skewness and Kurtosis in Python? Skewness is a statistical term and it is a way to estimate or measure the shape of a distribution. Â It is an important statistical methodology that is used to estimate the asymmetrical behavior rather than computing frequency distribution. Skewness can be two types: Symmetrical: A distribution can b
3 min read
Difference Between Skewness and Kurtosis What is Skewness? Skewness is an important statistical technique that helps to determine the asymmetrical behavior of the frequency distribution, or more precisely, the lack of symmetry of tails both left and right of the frequency curve. A distribution or dataset is symmetric if it looks the same t
4 min read
Histogram | Meaning, Example, Types and Steps to Draw What is Histogram?A histogram is a graphical representation of the frequency distribution of continuous series using rectangles. The x-axis of the graph represents the class interval, and the y-axis shows the various frequencies corresponding to different class intervals. A histogram is a two-dimens
5 min read