Meeting 5 - Data Analysis
Meeting 5 - Data Analysis
PERTEMUAN 5
CORRELATION
COURSE CONTENT
Introducing correlation
Type of Correlation
Method of Measuring Correlation
Significance Test in Correlation Analysis
Steps in Correlation Analysis
Case Study Correlation Analysis : Lung Cancer
Dataset
COURSE LEARNING
OBJECTIVES
After attending this course, students will be able to :
Explain the basic concepts of correlation analysis.
Explain how to measure the strength and direction
of the relationship between variables using the
correlation coefficient.
Recognize the types of correlation and when to
use each method.
Interpret correlation analysis results in the context
of relevant problems.
WHAT IS CORRELATION?
Correlation is a statistical method used to measure
the strength and direction of the relationship
between two or more variables.
Correlation only demonstrates how closely two
variables are related to one another; it does not
establish a cause and effect relationship.
THE OBJECTIVE OF
CORRELATION ANALYSIS
Find patterns of relationships between variables.
Understand the strength and direction of the
relationship.
Guiding decision-making.
TYPE OF CORRELATION
Positive Correlation
• Both variables move in the same direction.
• Example: Increase in temperature, increase in ice cream
sales.
Negative Correlation
• Variables move in opposite directions.
• Example: Increase in price, decrease in the number of
buyers.
No Correlation
• No apparent relationship between variables.
• Example: the relationship between temperature in New
York and the price of tea in China.
METHOD OF
MEASURING
CORRELATION
Pearson Correlation Coefficient
Spearman and Kendall Correlation
PEARSON CORRELATION
COEFFICIENT
Continuous variables with a normal or nearly
normal distribution.
Measure the strength and direction of the linear
relationship between two variables.
Range :
• From -1 to 1.
• 1 indicates a perfect positive correlation, -1
indicates a perfect negative correlation, and 0
indicates no correlation.
CALCULATE PEARSON
CORRELATION COEFFICIENT
SPEARMAN AND
KENDALL CORRELATION
Spearman Correlation:
• Ordinal variables or continuous variables with a
non-normal distribution
• Measures the strength and direction of
monotonic relationships (not necessarily linear).
Kendall Correlation:
• Ordinal variables or continuous variables with a
non-normal distribution.
• Measures the similarity of the orderings of the
data points in two variables.
CALCULATE SPEARMAN AND
KENDALL CORRELATION
COEFFICIENT
INTERPRETING
CORRELATION
Strong (0.7 - 1.0): Very significant correlation.
Moderate (0.4 - 0.7): Moderate correlation.
Weak (0.1 - 0.4): Weak correlation.
Very Weak (0 - 0.1): Almost no correlation.
SIGNIFICANCE TEST IN
CORRELATION ANALYSIS
In correlation analysis, significance tests can be
performed to determine whether the correlation
coefficient calculated from the sample data is
statistically significant.
The most common test used for this purpose is the
t-test for correlation coefficients, which helps
assess whether the observed correlation is likely to
be real or occurred by chance.
Example : Is there a significant relationship
between advertising spending and sales?
PERFORMING
HYPOTHESIS TEST
STEPS IN CORRELATION
ANALYSIS
1. Understanding the Data
Identify Variables: Determine which variables you want
to analyze for correlation.
Data Type: Understand the data types of the variables
(numeric, categorical, etc.).
Data Distribution: Examine the data distribution using
visualizations and descriptive statistics.
2. Setting the Analysis Goal
Hypotheses: Plan the hypotheses you want to test. Do
you expect a positive, negative, or no correlation?
Objective: Determine what you want to learn from the
correlation analysis, such as the relationship between
variable A and variable B.
STEPS IN CORRELATION
ANALYSIS
3. Selecting the Correlation Method
Pearson Correlation: Used to measure the
linear relationship between two numerical
variables.
Spearman Correlation: Used to measure the
monotonic relationship (not necessarily linear)
between numerical or ordinal variables.
Kendall Correlation: Used to measure the
similarity of the orderings of data points
between two variables.
STEPS IN CORRELATION
ANALYSIS
4. Calculating Correlation
Calculate Correlation Coefficient: Compute the
correlation coefficient using the appropriate
formula based on the chosen method.
Interpret the Coefficient: Determine the
direction (positive/negative) and strength of the
correlation based on the coefficient value
(between -1 and 1).
STEPS IN CORRELATION
ANALYSIS
5. Data Visualization
Scatter Plot: Plot a scatter plot to visualize the relationship
between two variables.
Heatmap: If you're calculating correlations for multiple
variables, visualize all correlation coefficients at once using a
heatmap.
6. Significance Testing (Optional):
Hypothesis Testing: Conduct significance tests (such as t-test)
to determine if the observed correlation is statistically
significant.
Choose Significance Level: Select alpha value (usually 0.05) as
the threshold to determine whether the correlation is
significant.
STEPS IN CORRELATION
ANALYSIS
7. Interpreting the Results:
Correlation Direction: Is it positive, negative, or no
correlation?
Correlation Strength: How strong is the relationship based on
the coefficient value?
Significance: If you performed significance testing, check if
the correlation is statistically significant.
8. Conclusion and Reporting:
Summarize Conclusion: Provide a brief summary of your
findings.
Report: If this analysis is part of a larger report, contextualize
the results and provide in-depth interpretations.
CASE STUDY : LUNG
CANCER DATASET
Step 1: Understanding the Data
import pandas as pd
# Read the dataset
data = pd.read_csv('lung-cancer.csv’)
# Display first few rows of the dataset
print(data.head())
# Check data types of variables
print(data.dtypes)
CASE STUDY : LUNG
CANCER DATASET
# Transform categorical variables (strings
or objects) to whole numbers (integers)
from sklearn.preprocessing import
LabelEncoder
lung_encoder = LabelEncoder()
data = data.copy()
data.LUNG_CANCER =
lung_encoder.fit_transform(data.LUNG_CA
NCER)
# Compute descriptive statistics
print(data.describe())
CASE STUDY : LUNG
CANCER DATASET
Step2 : Setting the Analysis Goal in Correlation Analysis
In this step, you define the specific correlation you want
to explore between variables. Let's consider a scenario
where the goal is to investigate the correlation between
alhcohol consumption and lung cancer patients.
2.1. Define the Hypothesis:
Null Hypothesis (H0): There is no significant
correlation between alcohol consumption with lung
cancer patients.
Alternative Hypothesis (H1): There is a significant
correlation between alcohol consumption with lung
cancer patients.
CASE STUDY : LUNG
CANCER DATASET
Step 3 : Selecting the Correlation Method
In this step, we use the Pearson method to calculate
the correlation coefficient between alcohol
consumption with lung cancer patients.
CASE STUDY : LUNG
CANCER DATASET
Step 4: Calculating Correlation
# Perform Pearson correlation hypothesis test
corr_coefficient, p_value = pearsonr(data['ALCOHOL CONSUMING'],
data['LUNG_CANCER'])
# Creating a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True,
cmap='coolwarm', fmt=".2f")
plt.title('Pearson Correlation Coefficient Heatmap')
plt.show()
CASE STUDY : LUNG
CANCER DATASET
Step 6: Interpreting the Results
Correlation direction : The Pearson correlation between
alcohol consumption and lung cancer patients is a very
significant correlation : positive .
Correlation Strength : 0.28853280309173124
Significance : Correlation is significant
COURSE ASSIGNMENT
1. Data Preparation
Read the dataset using Python (use pandas).
Understand the structure of the dataset, check for missing values,
and perform data cleaning if necessary.
2. Exploratory Data Analysis (EDA)
Conduct data exploration to understand the distribution of
variables.
Create data visualizations such as histograms, box plots, or scatter
plots for each variable.
3. Correlation Analysis
Calculate the correlation coefficient between variables X, Y, and Z.
Determine the type of correlation (positive, negative, or no
correlation) based on the correlation coefficient value.
Visualize correlations using scatter plots or heatmaps.
COURSE ASSIGNMENT
4. Significance Test
Perform a significance test (use a T-Test or other
correlation test) to ensure that the correlation found is
statistically significant.
6. Presentation of Results
Present your findings in the form of a report or
presentation.
Use visualizations and graphs to explain the correlations
found.
Make conclusions and recommendations based on the
correlation analysis you have conducted.