Open In App

Correspondence Analysis using Python

Last Updated : 31 May, 2024
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Correspondence Analysis (CA) is a statistical technique used to analyze the relationships between the categorical variables in a contingency table. It provides a visual representation of the data allowing for the identification of the patterns and associations between the categories of the variables.

What is Correspondence Analysis?

Correspondence analysis is a multivariate graphical technique designed to analyze two-way and multi-way tables, containing some measures of the correspondence between the rows and columns. It decomposes the data matrix into a set of singular vectors and singular values, allowing for the visualization of the data in the reduced dimensionality space. The main goal is to reveal the structure of the data in terms of the associations between the row and column categories.

Key Steps in Correspondence Analysis

  • Construct the Contingency Table: The data is represented as the contingency table where rows and columns correspond to the categories of the variables.
  • Calculate the Correspondence Matrix: Normalize the contingency table to form the correspondence matrix.
  • Singular Value Decomposition (SVD): Perform the SVD on the correspondence matrix to extract the singular vectors and singular values.
  • Compute Coordinates: Calculate the coordinates of the rows and columns in the reduced dimensional space using singular vectors and values.
  • Visualize the Results: The Plot of the row and column coordinate to visualize the relationships between the categories.

Performing Correspondence Analysis in Python

Step 1: Importing Libraries

First, we need to install and import the necessary libraries.

Python
!pip install pandas numpy matplotlib prince

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import prince

Step 2: Creating a Contingency Table

Let's create a sample contingency table representing survey responses of different age groups regarding their preferred modes of transport.

Python
data = {
    'Car': [20, 30, 50],
    'Bike': [30, 10, 20],
    'Bus': [50, 60, 30]
}
age_groups = ['18-25', '26-35', '36-50']
df = pd.DataFrame(data, index=age_groups)
print(df)

Step 3: Performing Correspondence Analysis

Using the prince library, we can perform Correspondence Analysis on this table.

Python
ca = prince.CA(n_components=2)
ca = ca.fit(df)

Step 4: Extracting and Visualizing Results

We can extract the row and column coordinates from the fitted model and visualize them.

Python
# Extract row and column coordinates
row_coords = ca.row_coordinates(df)
col_coords = ca.column_coordinates(df)

# Plotting the results
plt.figure(figsize=(8, 6))
plt.scatter(row_coords[0], row_coords[1], c='red', label='Rows')
plt.scatter(col_coords[0], col_coords[1], c='blue', label='Columns')

# Adding labels
for i, txt in enumerate(df.index):
    plt.annotate(txt, (row_coords[0][i], row_coords[1][i]), color='red')

for i, txt in enumerate(df.columns):
    plt.annotate(txt, (col_coords[0][i], col_coords[1][i]), color='blue')

plt.title('Correspondence Analysis')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.grid()
plt.show()

Interpreting the Results

The plot provides a visual representation of the relationships between age groups and transport preferences. Points that are close together indicate categories that are similar in their distribution. For example, if '18-25' and 'Car' are close together, it suggests that younger people prefer cars.

Output:

Capture1


Example: Consumer Preferences for Beverages

Step 1: Importing Libraries

First, ensure that the necessary libraries are installed and imported.

Python
!pip install pandas numpy matplotlib prince

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import prince

Step 2: Creating a Contingency Table

We'll create a contingency table that shows the preferences of consumers in different regions for various types of beverages.

Python
data = {
    'Coffee': [50, 30, 40, 20],
    'Tea': [30, 60, 10, 40],
    'Juice': [20, 10, 50, 60],
    'Soda': [10, 20, 30, 80]
}
regions = ['North', 'South', 'East', 'West']
df = pd.DataFrame(data, index=regions)
print(df)

This will produce a contingency table like this:

       Coffee  Tea  Juice  Soda
North 50 30 20 10
South 30 60 10 20
East 40 10 50 30
West 20 40 60 80

Step 3: Performing Correspondence Analysis

Using the prince library, we perform Correspondence Analysis on this table.

Python
ca = prince.CA(n_components=2)
ca = ca.fit(df)

Step 4: Extracting and Visualizing Results

We can extract the row and column coordinates from the fitted model and visualize them.

Python
# Extract row and column coordinates
row_coords = ca.row_coordinates(df)
col_coords = ca.column_coordinates(df)

# Plotting the results
plt.figure(figsize=(8, 6))
plt.scatter(row_coords[0], row_coords[1], c='red', label='Regions')
plt.scatter(col_coords[0], col_coords[1], c='blue', label='Beverages')

# Adding labels
for i, txt in enumerate(df.index):
    plt.annotate(txt, (row_coords[0][i], row_coords[1][i]), color='red')

for i, txt in enumerate(df.columns):
    plt.annotate(txt, (col_coords[0][i], col_coords[1][i]), color='blue')

plt.title('Correspondence Analysis: Beverage Preferences by Region')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.grid()
plt.show()

Output

Capture


Interpreting the Results

The plot provides a visual representation of the relationships between regions and beverage preferences. Points that are close together indicate categories that are similar in their distribution. For example, if 'North' and 'Coffee' are close together, it suggests that consumers in the North region have a preference for coffee.

By examining the plot, you can identify patterns and associations. For instance, if 'West' and 'Soda' are close, it suggests a strong preference for soda in the West region.



Article Tags :
Practice Tags :

Similar Reads