100% found this document useful (1 vote)
37 views

Correlation Analysis in python

Uploaded by

willamsgeorge863
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
37 views

Correlation Analysis in python

Uploaded by

willamsgeorge863
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Python made easy with CyberGiant

Study Guide: Correlation Analysis in Python

1. Introduction to Correlation

Correlation measures the relationship between two variables. It


tells us if an increase in one variable results in an increase
or decrease in another variable.

● Positive Correlation: Both variables increase together.


● Negative Correlation: One variable increases while the
other decreases.
● No Correlation: No relationship between the variables.

The Pearson correlation coefficient (r) is a common measure:

● r=1r = 1r=1: Perfect positive correlation.


● r=−1r = -1r=−1: Perfect negative correlation.
● r=0r = 0r=0: No correlation.

2. Setting Up Python

First, ensure you have Python installed. Then, install the


required libraries. Open your terminal or command prompt and
type:

Copy code
pip install pandas seaborn matplotlib scipy

3. Writing the Python Code


Step 1: Import Libraries

We need several libraries for data manipulation, statistical


analysis, and visualization.

Copy code
import pandas as pd # For data handling
import seaborn as sns # For data visualization
import matplotlib.pyplot as plt # For plotting graphs
from scipy.stats import pearsonr # For statistical analysis

Step 2: Create Sample Data

Let's create a simple dataset with two variables: StudyHours and


TestScores.

Copy code
# Sample data
data = {
'StudyHours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], # Number of
hours studied
'TestScores': [55, 60, 65, 70, 75, 80, 85, 90, 95, 100] #
Corresponding test scores
}

# Convert data to DataFrame


df = pd.DataFrame(data)

Step 3: Calculate Pearson Correlation Coefficient

The Pearson correlation coefficient measures the linear


relationship between two variables.
Copy code
# Calculate Pearson correlation coefficient
correlation, p_value = pearsonr(df['StudyHours'],
df['TestScores'])
print("Pearson Correlation Coefficient:", correlation) # Should
print a value close to 1
print("P-value:", p_value) # Should print a very small number
indicating significance

Explanation:

● correlation is a number between -1 and 1 that tells us the


strength and direction of the relationship.
● p_value tells us the significance of this correlation. A
small p-value (typically < 0.05) means the correlation is
significant.

Step 4: Visualize the Data

Visualization helps us see the relationship between the


variables.

Copy code
# Create scatter plot
sns.scatterplot(x='StudyHours', y='TestScores', data=df)
plt.title("Scatter Plot of Study Hours vs Test Scores")
plt.xlabel("Study Hours")
plt.ylabel("Test Scores")
plt.show()

Explanation:

● The scatter plot shows individual data points.


● If the points roughly form a straight line, it indicates a
strong linear relationship.
Step 5: Correlation Matrix

For datasets with more variables, a correlation matrix can show


the correlation between each pair of variables.

Copy code
# Add more variables for demonstration
df['PracticeTests'] = [2, 3, 1, 2, 3, 4, 2, 5, 4, 6] # Number
of practice tests taken

# Calculate correlation matrix


corr_matrix = df.corr()

# Visualize correlation matrix


sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Heatmap of Correlation Matrix")
plt.show()

Explanation:

● df.corr() calculates the correlation matrix.


● The heatmap shows the strength of correlation with color
intensity.

4. Full Example Code

Here is the complete example with all steps combined:

Copy code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
# Step 1: Create sample data
data = {
'StudyHours': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'TestScores': [55, 60, 65, 70, 75, 80, 85, 90, 95, 100],
'PracticeTests': [2, 3, 1, 2, 3, 4, 2, 5, 4, 6]
}

# Step 2: Convert data to a DataFrame


df = pd.DataFrame(data)

# Step 3: Calculate Pearson correlation coefficient


correlation, p_value = pearsonr(df['StudyHours'],
df['TestScores'])
print("Pearson Correlation Coefficient:", correlation)
print("P-value:", p_value)

# Step 4: Create scatter plot


sns.scatterplot(x='StudyHours', y='TestScores', data=df)
plt.title("Scatter Plot of Study Hours vs Test Scores")
plt.xlabel("Study Hours")
plt.ylabel("Test Scores")
plt.show()

# Step 5: Calculate and visualize correlation matrix


corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Heatmap of Correlation Matrix")
plt.show()

5. Interpreting the Results

● Pearson Correlation Coefficient: If the value is close to


1, it indicates a strong positive linear relationship.
● P-value: A small p-value (< 0.05) suggests the correlation
is statistically significant.
● Scatter Plot: Helps visualize the relationship between two
variables. A clear trend line indicates a strong
correlation.
● Heatmap: Visualizes the correlation matrix, where darker
shades indicate stronger correlations.

You might also like