Unit 4
Unit 4
Bivariate Analysis
4.1. Relationship between Two variables:
• Bivariate analysis is a statistical method used to explore and understand the
relationship between two different variables in a dataset.
• It involves examining how changes in one variable are associated with
changes in another, providing insights into potential connections,
correlations, or dependencies between them.
• Bivariate analysis commonly employs techniques such as scatterplots,
correlation coefficients, and regression analysis to assess the strength,
direction, and significance of the relationship between the two variables.
• This type of analysis is invaluable for uncovering patterns, making predictions,
and informing decision-making across various fields, including economics, social
sciences, and scientific research.
Customer Satisfaction
Product Category Satisfied Dissatisfied Total
• The rows represent the product categories (Electronics, Clothing, Books, and
Appliances).
• The columns re present customer satisfaction levels (Satisfied and Dissatis fied).
• The numbers in the table represent the percentage of customers falling into each category
combination.
• For instance, 45% of customers who bought Electronics were satisfied, while 15%
were dissatisfied. This bivariate percentage table provides insights into how
customer satisfaction varies across different product categories.
• It helps you understand which categories have the highest or lowest levels of
customer satisfaction, enabling you to make data-driven decisions for improving
customer experience and product offerings.
Key Observations:
Product Categories: The rows of the table represent different product categories
(Electronics, Clothing, Books, and Appliances).
• In the Electronics category, 45% of customers were satisfied, while 15% were
dissatisfied.
• In the Clothing category, 25% of customers were satisfied, while 10% were
dissatisfied.
• In the Books category, 20% of customers were satisfied, while 30% were
dissatisfied.
• In the Appliances category, 10% of customers were satisfied, while 20% were
dissatisfied.
Analysis:
High Satisfaction, Low Dissatisfaction: Electronics has the highest satisfaction rate
(45%) among customers, with a relatively low dissatisfaction rate (15%). This suggests
that customers who purchased Electronics products tend to be more satisfied.
Mixed Satisfaction: Clothing has a moderate satisfaction rate (25%) and a low
dissatisfaction rate (10%). It indicates that customers buying Clothing products have a
reasonably positive experience, but there is room for improvement.
Low Satisfaction, High Dissatisfaction: Books have a lower satisfaction rate (20%) and
a higher dissatisfaction rate (30%). This category requires attention to address the
higher dissatisfaction level.
Low Satisfaction, Moderate Dissatisfaction: Appliances have the lowest satisfaction rate
(10%) among all categories, with a moderate dissatisfaction rate (20%). This category
needs significant improvement to enhance customer satisfaction.
Overall Insights:
Electronics appears to be the most popular category with high customer satisfaction.
Books and Appliances need improvement in customer satisfaction, with Appliances being the
most challenging category.
Clothing has a reasonably positive satisfaction level but could benefit from some
enhancements.
This analysis of the bivariate percentage table helps you identify areas where customer
satisfaction is strong and where improvements are needed, allowing you to make informed
decisions to enhance the customer experience and product offerings in your e-commerce
platform.
Contingency table:
A contingency table displays frequencies for combinations of two categorical variables.
Analysts also refer to contingency tables as cross tabulation and two- way tables.
Contingency tables classify outcomes for one variable in rows and the other in columns. The
values at the row and column intersections are frequencies for each unique combination of the
two variables.
Use contingency tables to understand the relationship between categorical variables. For
example, is there a relationship between gender (male/female) and type of computer
(Mac/PC)
Example Contingency Table:
The contingency table example below displays computer sales at our fictional store.
Specifically, it describes sales frequencies by the customer’s gender and the type of
computer purchased. It is a two-way table (2 X 2). I cover the naming conventions at the
end.
In this contingency table, columns represent computer types and rows represent genders. Cell
values are frequencies for each combination of gender and computer type. Totals are in the
margins. Notice the grand total in the bottom-right margin.
t a glance, it’s easy to see how two-way tables both organize your data and paint a picture of
the results. You can easily see the frequencies for all possible subset combinations along with
totals for males, females, PCs, and Macs.
For example, 66 males bought PCs while females bought 87 Macs. Furthermore, there are
117 females, 106 males, 96 PC sales, 127 Mac sales, and a grand total of 223 observations
in the study.
• CROSSTABS is easiest. You can create several tables in one go but they
require quite some (manual) editing.
• CTABLES runs the desired table straight away and could be run from the menu.
However, it creates one table at the time and requires an additional license.
• TABLES also comes up with the right table straight away. However, the
syntax is difficult and there's no menu.
For example, the marginal distribution of gender without considering computer type is
the following:
Males: 106
Females: 117
PC: 96
Mac: 127
Conditional Distribution
For these distributions, you specify the value for one of the variables in the contingency
table and then assess the distribution of frequencies for the other variable. In other
words, you condition the frequency distribution for one variable by setting a value of the
other variable. That might sound complicated, but it’s easy using a contingency table.
Just look across one row or down one column.
For example, the conditional distribution of computer type for females is the
following:
PC: 30
Mac: 87
Alternatively, the conditional distribution of gender for Macs is the following: Males: 40
Females: 87
• If there is a relationship between ice cream preference and gender, we’d expect
the conditional distribution of flavors in the two gender rows to differ. From the
contingency table, females are more likely to prefer chocolate (37 vs. 21), while males
prefer vanilla (32 vs. 12).
• Both genders have an equal preference for strawberry. Overall, the two- way table
suggests that males and females have different ice cream preferences.
Row and column percentages help you draw conclusions when you have unequal
numbers in the margins. In the contingency table example above, more women than men
prefer chocolate, but how do we know that’s not due to the sample having more women?
Use percentages to adjust for unequal group sizes. Percentages are relative frequencies.
Learn more about Relative Frequencies and their Distributions.
Row Percentage: Take a cell value and divide by the cell’s row total. Column
Percentage: Take a cell value and divide by the cell’s column total.
For example, the row percentage of females who prefer chocolate is simply the number of
observations in the Female/Chocolate cell divided by the row total for women: 37 / 66 =
56%.
The column percentage for the same cell is the frequency of the Female/Chocolate cell
divided by the column total for chocolate: 37 / 58 = 63.8%.
Boxplots in SPSS:
Our data file contains a sample of N = 238 people who were examined in a driving
simulator. Participants were presented with 5 dangerous situations to which they had to
respond as fast as possible. The data hold their reaction times and some other variables.
Our boxplot shows some potential outliers as well as extreme values.
Outliers:
Outliers are data points in a dataset that significantly deviate from the majority of other
data points. They are observations that fall well outside the typical range or distribution
of values and may be unusually high or low in comparison to the rest of the data.
Outliers can potentially distort statistical analyses and should be
carefully examined to determine whether they represent genuine extreme values or are
the result of errors or anomalies in the data collection process.
Let's take a good look at the first of our 5 histograms shown below. The “normal
range” for this variable seems to run from 500 through 1500 ms. It seems that 3 scores
lie outside this range. So are these outliers? Honestly,
Personally, I'd settle for only excluding the score ≥ 2000 ms. So what's the right way to
do so? And what about the other variables?
The right way to exclude outliers from data analysis is to specify them as user missing
values. So for reaction time 1 (reac01), running
excludes reaction times of 2000 ms and higher from all data analyses and editing. So
what about the other 4 variables?
The histograms for reac02 and reac03 don't show any outliers.
For reac04, we see some low outliers as well as a high outlier. We can find which values
these are in the bottom and top of its frequency distribution as shown below. We can
exclude all of these outliers in one go by running
By the way: “lo thru 400” means the lowest value in this variable (its minimum) through 400
ms.
For reac05, we see several low and high outliers. The obvious thing to do seems to run
something like
But sadly, this only triggers the following error: The problem here is that you
Since this is what you typically need to do, this is one of the biggest stupidities still
found in SPSS today. A workaround for this problem is to
RECODE the entire low range into some huge value such as 999999999; add the
The syntax below does just that and reruns our histograms to check if all outliers have
indeed been correctly excluded.
Result
First off, note that none of our 5 histograms show any outliers anymore; they're now
excluded from all data analysis and editing. Also note the bottom of the frequency table
for reac05 shown below.
Even though we had to recode some values, we can still report precisely which outliers we
excluded for this variable due to our value label.
Before proceeding to boxplots, I'd like to mention 2 worst practices for excluding outliers:
removing outliers by changing them into system missing values. After doing so, we no
longer know which outliers we excluded. Also, we're clueless why values are system
missing as they don't have any value labels.
removing entire cases -often respondents- because they have 1(+) outliers. Such cases
typically have mostly “normal” data values that we can use just fine for analyzing other (sets
of) variables.
Sadly, supervisors sometimes force their students to take this road anyway. If so, SELECT
IF permanently removes entire cases from your data.
In a scatter plot, a linear relationship would appear as a pattern where the data points
roughly follow a straight line. If, as one variable increases, the other also tends to
increase, it's a positive linear relationship. Conversely, if one variable increases as the
other decreases, it's a negative linear relationship.
The strength of a linear relationship can be quantified using correlation coefficients like
Pearson's correlation coefficient. A value close to 1 indicates a strong positive linear
relationship, close to -1 indicates a strong negative linear relationship, and close to 0
suggests little to no linear relationship.
A scatter plot is a graphical representation used to visualize the relationship between two
sets of data points. It consists of points on a two-dimensional plane, where each point
represents the values of two variables. By plotting these points, you can quickly identify
patterns, correlations, or trends in the data. Scatter plots are commonly used in data
analysis to determine whether there is a connection between the variables and to
visualize how changes in one variable relate to changes in another, making them a
valuable tool in understanding and interpreting data.
Scatter plot in SPSS:
The starting assumption is that you have already imported your data into SPSS, and that
you’re looking at something like the data set below.
This hypothetical data set contains the mid-term and final exam scores of 40 students in
a Statistics course (the first 20 records are displayed above). We want to create a scatter
plot to visualize the relationship between the two sets of scores.
Create a Scatter Plot
Click Graphs -> Legacy Dialogs -> Scatter/Dot as illustrated below. Note, however that
in newer versions of SPSS, you will need to click Graphs > Scatter/Dot.
We recommend that you click the Reset button to clear any previous settings.
The next step is to move your variables into the X Axis and Y Axis boxes. If your data
is from a regression study, select your predictor/independent variable, and use the arrow
button to move it to the X Axis. Then select the criterion/dependent variable, and use
the arrow button to move it to the Y Axis box. If your data is from a simple correlation
study, as is the case with our example, there may not be obvious predictor/independent
and criterion/dependent variables. In these cases, it doesn’t matter which variable you
move to the X Axis box and which variable you move to the Y Axis box.
It is a good idea to give your scatter plot a title. To do this, click the Titles button, add
your title, and click Continue to return to the “Simple Scatterplot” dialog box.
Select Simple Scatter and then click Define.This brings up the “Simple Scatterplot”
dialog box below.
Select OK to generate your scatter plot.
Each student in our hypothetical study is represented by one dot on our scatter plot.
Each dot’s position on the X (horizontal) axis represents a student’s mid- term exam
score, and its position on the Y (vertical) axis represents their final exam score.
After we create a scatter plot, we need to review it to assess the nature of the
relationship – if any – that exists between our variables. The scatter plot above indicates
that there is a positive linear relationship between mid-term and final exam scores in
this Statistics course. In other words, lower mid-term exam scores tend to be associated
with lower final exam scores, and higher mid-term exam scores tend to be associated
with higher final exam scores. It is important to note that a scatter plot cannot prove a
causal relationship between variables. Therefore, we cannot conclude that high mid-
term exam scores cause high final exam scores on the basis of the scatter plot above.
Some of the other relationships between variables that your scatter plot may indicate are
illustrated below.
However, the function doesn’t stop there. After fitting the inital line, the function fits
another line (following the aforementioned methodology) to the model’s residuals. If
the slope is not close to zero, the residual slope is added to the original fitted model
creating an updated model. This iteration is repeated until the residual slope is close to
zero or until the residual slope changes in sign (at which point the average of the last
two iterated slopes is used in the final fit).
An example of the iteration follows using data from Velleman et. al’s book. The dataset,
neoplasms, consists of breast cancer mortality rates for regions with varying mean
annual temperatures.
Note that the 16 record dataset is not divisible by three thus forcing an extra point in the
middle batch (had the remainder of the division by three been two, then each extra point
would have been added to the tail-end batches).
where the subscripts r and l reference the median values for the right-most and left-
most batches.
where (x,y),l,m,r are the median x and y values for each batch. This line is then used to
compute the first set of residuals. A line is then fitted to the residuals following the same
procedure outlined above.
he initial model slope and intercept are 3.412 and -69.877 respectively and the
residual’s slope and intercept are -0.873 and 41.451 respectively. The residual slope is
then added to the first computed slope and the process is again repeated thus
generating the following tweaked slope and updated residuals:
The updated slope is now 3.412 + (-0.873) = 2.539. The iteration continues until the
slope residuals stabilize. The final line for this working example is,
where the final slope and intercept are 2.89 and -45.91, respectively.
Analyze Menu: Go to the "Analyze" menu at the top of the SPSS window.
Regression: Under the "Analyze" menu, select "Regression."
Linear: In the "Regression" submenu, choose "Linear."
Dependent and Independent Variables: In the "Linear Regression" dialog box, specify
your dependent variable (the one you want to predict) and your independent variable(s)
(the ones you want to use to predict the dependent variable).
Options: Click the "Options" button in the "Linear Regression" dialog box.
Resistant Line: In the "Options" dialog, you might find an option related to "Resistant
Line." This option is typically used for resistant line regression techniques, such as
robust regression. You may need to check a box or select a specific method depending
on your analysis requirements.
OK: After setting your options, click "OK" to close the "Options" dialog.
Run: Back in the "Linear Regression" dialog box, click "OK" to run the analysis.