Correspondence Analysis using Python

Last Updated : 31 May, 2024

Correspondence Analysis (CA) is a statistical technique used to analyze the relationships between the categorical variables in a contingency table. It provides a visual representation of the data allowing for the identification of the patterns and associations between the categories of the variables.

What is Correspondence Analysis?

Correspondence analysis is a multivariate graphical technique designed to analyze two-way and multi-way tables, containing some measures of the correspondence between the rows and columns. It decomposes the data matrix into a set of singular vectors and singular values, allowing for the visualization of the data in the reduced dimensionality space. The main goal is to reveal the structure of the data in terms of the associations between the row and column categories.

Key Steps in Correspondence Analysis

Construct the Contingency Table: The data is represented as the contingency table where rows and columns correspond to the categories of the variables.
Calculate the Correspondence Matrix: Normalize the contingency table to form the correspondence matrix.
Singular Value Decomposition (SVD): Perform the SVD on the correspondence matrix to extract the singular vectors and singular values.
Compute Coordinates: Calculate the coordinates of the rows and columns in the reduced dimensional space using singular vectors and values.
Visualize the Results: The Plot of the row and column coordinate to visualize the relationships between the categories.

Performing Correspondence Analysis in Python

Step 1: Importing Libraries

First, we need to install and import the necessary libraries.

Python

!pip install pandas numpy matplotlib prince

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import prince

Step 2: Creating a Contingency Table

Let's create a sample contingency table representing survey responses of different age groups regarding their preferred modes of transport.

Python

data = {
    'Car': [20, 30, 50],
    'Bike': [30, 10, 20],
    'Bus': [50, 60, 30]
}
age_groups = ['18-25', '26-35', '36-50']
df = pd.DataFrame(data, index=age_groups)
print(df)

Step 3: Performing Correspondence Analysis

Using the prince library, we can perform Correspondence Analysis on this table.

Python

ca = prince.CA(n_components=2)
ca = ca.fit(df)

Step 4: Extracting and Visualizing Results

We can extract the row and column coordinates from the fitted model and visualize them.

Python

# Extract row and column coordinates
row_coords = ca.row_coordinates(df)
col_coords = ca.column_coordinates(df)

# Plotting the results
plt.figure(figsize=(8, 6))
plt.scatter(row_coords[0], row_coords[1], c='red', label='Rows')
plt.scatter(col_coords[0], col_coords[1], c='blue', label='Columns')

# Adding labels
for i, txt in enumerate(df.index):
    plt.annotate(txt, (row_coords[0][i], row_coords[1][i]), color='red')

for i, txt in enumerate(df.columns):
    plt.annotate(txt, (col_coords[0][i], col_coords[1][i]), color='blue')

plt.title('Correspondence Analysis')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.grid()
plt.show()

Interpreting the Results

The plot provides a visual representation of the relationships between age groups and transport preferences. Points that are close together indicate categories that are similar in their distribution. For example, if '18-25' and 'Car' are close together, it suggests that younger people prefer cars.

Output:

Example: Consumer Preferences for Beverages

Step 1: Importing Libraries

First, ensure that the necessary libraries are installed and imported.

Python

!pip install pandas numpy matplotlib prince

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import prince

Step 2: Creating a Contingency Table

We'll create a contingency table that shows the preferences of consumers in different regions for various types of beverages.

Python

data = {
    'Coffee': [50, 30, 40, 20],
    'Tea': [30, 60, 10, 40],
    'Juice': [20, 10, 50, 60],
    'Soda': [10, 20, 30, 80]
}
regions = ['North', 'South', 'East', 'West']
df = pd.DataFrame(data, index=regions)
print(df)

This will produce a contingency table like this:

       Coffee  Tea  Juice  Soda
North      50   30     20    10
South      30   60     10    20
East       40   10     50    30
West       20   40     60    80

Step 3: Performing Correspondence Analysis

Using the prince library, we perform Correspondence Analysis on this table.

Python

ca = prince.CA(n_components=2)
ca = ca.fit(df)

Step 4: Extracting and Visualizing Results

We can extract the row and column coordinates from the fitted model and visualize them.

Python

# Extract row and column coordinates
row_coords = ca.row_coordinates(df)
col_coords = ca.column_coordinates(df)

# Plotting the results
plt.figure(figsize=(8, 6))
plt.scatter(row_coords[0], row_coords[1], c='red', label='Regions')
plt.scatter(col_coords[0], col_coords[1], c='blue', label='Beverages')

# Adding labels
for i, txt in enumerate(df.index):
    plt.annotate(txt, (row_coords[0][i], row_coords[1][i]), color='red')

for i, txt in enumerate(df.columns):
    plt.annotate(txt, (col_coords[0][i], col_coords[1][i]), color='blue')

plt.title('Correspondence Analysis: Beverage Preferences by Region')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.grid()
plt.show()

Output

Interpreting the Results

The plot provides a visual representation of the relationships between regions and beverage preferences. Points that are close together indicate categories that are similar in their distribution. For example, if 'North' and 'Coffee' are close together, it suggests that consumers in the North region have a preference for coffee.

By examining the plot, you can identify patterns and associations. For instance, if 'West' and 'Soda' are close, it suggests a strong preference for soda in the West region.

Anova Formula

subramanyasmgm

Improve

Article Tags :

Python

Practice Tags :

python

Univariate Data EDA

Measures of Central Tendency in Statistics

Central tendencies in statistics are numerical values that represent the middle or typical value of a dataset. Also known as averages, they provide a summary of the entire data, making it easier to understand the overall pattern or behavior. These values are useful because they capture the essence o

11 min read

Measures of Spread - Range, Variance, and Standard Deviation

Collecting the data and representing it in form of tables, graphs, and other distributions is essential for us. But, it is also essential that we get a fair idea about how the data is distributed, how scattered it is, and what is the mean of the data. The measures of the mean are not enough to descr

8 min read

Interquartile Range and Quartile Deviation using NumPy and SciPy

In statistical analysis, understanding the spread or variability of a dataset is crucial for gaining insights into its distribution and characteristics. Two common measures used for quantifying this variability are the interquartile range (IQR) and quartile deviation. Quartiles Quartiles are a kind

5 min read

Anova Formula

ANOVA Test, or Analysis of Variance, is a statistical method used to test the differences between the means of two or more groups. Developed by Ronald Fisher in the early 20th century, ANOVA helps determine whether there are any statistically significant differences between the means of three or mor

7 min read

Skewness of Statistical Data

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. In simpler terms, it indicates whether the data is concentrated more on one side of the mean compared to the other side.Why is skewness important?Understanding the skewness of data

5 min read

How to Calculate Skewness and Kurtosis in Python?

Skewness is a statistical term and it is a way to estimate or measure the shape of a distribution. Â It is an important statistical methodology that is used to estimate the asymmetrical behavior rather than computing frequency distribution. Skewness can be two types: Symmetrical: A distribution can b

3 min read

Difference Between Skewness and Kurtosis

What is Skewness? Skewness is an important statistical technique that helps to determine the asymmetrical behavior of the frequency distribution, or more precisely, the lack of symmetry of tails both left and right of the frequency curve. A distribution or dataset is symmetric if it looks the same t

4 min read

Histogram | Meaning, Example, Types and Steps to Draw

What is Histogram?A histogram is a graphical representation of the frequency distribution of continuous series using rectangles. The x-axis of the graph represents the class interval, and the y-axis shows the various frequencies corresponding to different class intervals. A histogram is a two-dimens

5 min read

Correspondence Analysis using Python

What is Correspondence Analysis?

Key Steps in Correspondence Analysis

Performing Correspondence Analysis in Python

Step 1: Importing Libraries

Step 2: Creating a Contingency Table

Step 3: Performing Correspondence Analysis

Step 4: Extracting and Visualizing Results

Interpreting the Results

Example: Consumer Preferences for Beverages

Step 1: Importing Libraries

Step 2: Creating a Contingency Table

Step 3: Performing Correspondence Analysis

Step 4: Extracting and Visualizing Results

Interpreting the Results

Similar Reads

Univariate Data EDA

Thank You!

What kind of Experience do you want to share?