0% found this document useful (0 votes)
3 views22 pages

Unit 4

Bivariate analysis is a statistical method that examines the relationship between two variables, utilizing techniques like scatterplots and regression analysis to uncover patterns and correlations. Percentage tables and contingency tables are tools used to display joint distributions of categorical variables, aiding in the understanding of customer satisfaction and other relationships. Additionally, box plots and scatter plots are employed to visualize data distributions and linear relationships, respectively, facilitating informed decision-making in various fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views22 pages

Unit 4

Bivariate analysis is a statistical method that examines the relationship between two variables, utilizing techniques like scatterplots and regression analysis to uncover patterns and correlations. Percentage tables and contingency tables are tools used to display joint distributions of categorical variables, aiding in the understanding of customer satisfaction and other relationships. Additionally, box plots and scatter plots are employed to visualize data distributions and linear relationships, respectively, facilitating informed decision-making in various fields.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT-4

Bivariate Analysis
4.1. Relationship between Two variables:
• Bivariate analysis is a statistical method used to explore and understand the
relationship between two different variables in a dataset.
• It involves examining how changes in one variable are associated with
changes in another, providing insights into potential connections,
correlations, or dependencies between them.
• Bivariate analysis commonly employs techniques such as scatterplots,
correlation coefficients, and regression analysis to assess the strength,
direction, and significance of the relationship between the two variables.
• This type of analysis is invaluable for uncovering patterns, making predictions,
and informing decision-making across various fields, including economics, social
sciences, and scientific research.

4.2. Percentage Tables:


Percentage Tables:
In bivariate analysis, a percentage table is used to display the joint distribution of two
categorical variables. It shows the relative frequencies or percentages of observations
that fall into each combination of categories from the two variables. Here's an example
of how a percentage table in bivariate analysis might look:

Customer Satisfaction
Product Category Satisfied Dissatisfied Total

Electronics 45% 15% 60%


Clothing 25% 10% 35%
Books 20% 30% 50%
Appliances 10% 20% 30%
Total 100% 75% 175%
In this example:

• The rows represent the product categories (Electronics, Clothing, Books, and
Appliances).
• The columns re present customer satisfaction levels (Satisfied and Dissatis fied).
• The numbers in the table represent the percentage of customers falling into each category
combination.
• For instance, 45% of customers who bought Electronics were satisfied, while 15%
were dissatisfied. This bivariate percentage table provides insights into how
customer satisfaction varies across different product categories.
• It helps you understand which categories have the highest or lowest levels of
customer satisfaction, enabling you to make data-driven decisions for improving
customer experience and product offerings.

Key Observations:

Product Categories: The rows of the table represent different product categories
(Electronics, Clothing, Books, and Appliances).

Customer Satisfaction: The columns represent customer satisfaction levels, categorized as


Satisfied and Dissatisfied.

Individual Category Percentages:

• In the Electronics category, 45% of customers were satisfied, while 15% were
dissatisfied.
• In the Clothing category, 25% of customers were satisfied, while 10% were
dissatisfied.
• In the Books category, 20% of customers were satisfied, while 30% were
dissatisfied.
• In the Appliances category, 10% of customers were satisfied, while 20% were
dissatisfied.

Analysis:

High Satisfaction, Low Dissatisfaction: Electronics has the highest satisfaction rate
(45%) among customers, with a relatively low dissatisfaction rate (15%). This suggests
that customers who purchased Electronics products tend to be more satisfied.

Mixed Satisfaction: Clothing has a moderate satisfaction rate (25%) and a low
dissatisfaction rate (10%). It indicates that customers buying Clothing products have a
reasonably positive experience, but there is room for improvement.

Low Satisfaction, High Dissatisfaction: Books have a lower satisfaction rate (20%) and
a higher dissatisfaction rate (30%). This category requires attention to address the
higher dissatisfaction level.
Low Satisfaction, Moderate Dissatisfaction: Appliances have the lowest satisfaction rate
(10%) among all categories, with a moderate dissatisfaction rate (20%). This category
needs significant improvement to enhance customer satisfaction.

Overall Insights:

Electronics appears to be the most popular category with high customer satisfaction.

Books and Appliances need improvement in customer satisfaction, with Appliances being the
most challenging category.

Clothing has a reasonably positive satisfaction level but could benefit from some
enhancements.

This analysis of the bivariate percentage table helps you identify areas where customer
satisfaction is strong and where improvements are needed, allowing you to make informed
decisions to enhance the customer experience and product offerings in your e-commerce
platform.

Contingency table:
A contingency table displays frequencies for combinations of two categorical variables.
Analysts also refer to contingency tables as cross tabulation and two- way tables.

Contingency tables classify outcomes for one variable in rows and the other in columns. The
values at the row and column intersections are frequencies for each unique combination of the
two variables.
Use contingency tables to understand the relationship between categorical variables. For
example, is there a relationship between gender (male/female) and type of computer
(Mac/PC)
Example Contingency Table:
The contingency table example below displays computer sales at our fictional store.
Specifically, it describes sales frequencies by the customer’s gender and the type of
computer purchased. It is a two-way table (2 X 2). I cover the naming conventions at the
end.
In this contingency table, columns represent computer types and rows represent genders. Cell
values are frequencies for each combination of gender and computer type. Totals are in the
margins. Notice the grand total in the bottom-right margin.
t a glance, it’s easy to see how two-way tables both organize your data and paint a picture of
the results. You can easily see the frequencies for all possible subset combinations along with
totals for males, females, PCs, and Macs.

For example, 66 males bought PCs while females bought 87 Macs. Furthermore, there are
117 females, 106 males, 96 PC sales, 127 Mac sales, and a grand total of 223 observations
in the study.

Creating Contingency table:

We showed 3 ways for creating APA style contingency tables in SPSS:

• CROSSTABS is easiest. You can create several tables in one go but they
require quite some (manual) editing.
• CTABLES runs the desired table straight away and could be run from the menu.
However, it creates one table at the time and requires an additional license.
• TABLES also comes up with the right table straight away. However, the
syntax is difficult and there's no menu.

Running Simple Contingency Tables in SPSS


• The fastest way to create the table we just saw is running one line of SPSS
syntax:
crosstabs educ by marit.
• The categories of the first variable -educ or education- become rows in the table. The values
of the second variable -marit or marital status- become the columns. As a rule of thumb,
the columns hold the groups you want to compare on whatever goes into the
rows.
In this case, we're comparing marital status
groups on education level.
• If we had wanted to do the reverse -compare education level groups on
marital status- we'd swap the rows and columns and run
crosstabs marit by educ.

CROSSTABS with Column Percentages:


Right. So how do our groups compare on education level? It's hard to see from our first
table because each marital status group has a different n or sample size. We may see more
of a pattern if we add column percentages. The syntax below does just that.

APA Contingency Tables from CTABLES


• The table we just created can be run in one go with CTABLES. However, this only
works if you've a license for the custom tables module. You can check this by
running
• show license.
• If the resulting table includes “Custom Tables”, try the syntax below.
Now drag and drop statistics right underneath Marital status and just close the window.

Let's now make 2 text replacements:

• use n instead of “Count”


• use % instead of “% within Marital status”
• Using the Ctrl + H short key in the output viewer should work. Or -much nicer- use
the OUTPUT MODIFY syntax below if you're on SPSS version 22 or higher.
4.3. Analyzing Contingency Table:
Marginal Distribution

These distributions represent the frequency distribution of one categorical variable


without regard for other variables. Unsurprisingly, you can find these distributions in the
margins of a contingency table.

The following marginal distribution examples correspond to the blue highlights.

For example, the marginal distribution of gender without considering computer type is
the following:

Males: 106

Females: 117

Alternatively, the marginal distribution of computer types is the following:

PC: 96

Mac: 127

Conditional Distribution

For these distributions, you specify the value for one of the variables in the contingency
table and then assess the distribution of frequencies for the other variable. In other
words, you condition the frequency distribution for one variable by setting a value of the
other variable. That might sound complicated, but it’s easy using a contingency table.
Just look across one row or down one column.

The following conditional distribution examples correspond to the green highlights.

For example, the conditional distribution of computer type for females is the

following:
PC: 30

Mac: 87

Alternatively, the conditional distribution of gender for Macs is the following: Males: 40

Females: 87

Finding Relationships in a Contingency Table:

• If there is a relationship between ice cream preference and gender, we’d expect
the conditional distribution of flavors in the two gender rows to differ. From the
contingency table, females are more likely to prefer chocolate (37 vs. 21), while males
prefer vanilla (32 vs. 12).

• Both genders have an equal preference for strawberry. Overall, the two- way table
suggests that males and females have different ice cream preferences.

Row and column percentages help you draw conclusions when you have unequal
numbers in the margins. In the contingency table example above, more women than men
prefer chocolate, but how do we know that’s not due to the sample having more women?
Use percentages to adjust for unequal group sizes. Percentages are relative frequencies.
Learn more about Relative Frequencies and their Distributions.

Here’s how to calculate row and column percentages in a two-way table.

Row Percentage: Take a cell value and divide by the cell’s row total. Column

Percentage: Take a cell value and divide by the cell’s column total.

For example, the row percentage of females who prefer chocolate is simply the number of
observations in the Female/Chocolate cell divided by the row total for women: 37 / 66 =
56%.

The column percentage for the same cell is the frequency of the Female/Chocolate cell
divided by the column total for chocolate: 37 / 58 = 63.8%.

4.4. Handling Several Batches:


Box plot:
A box plot, also known as a box plot, box plots or box-and-whisker plot, is a
standardized way of displaying the distribution of a data set based on its five- number
summary of data points: the “minimum,” first quartile [Q1], median, third quartile [Q3]
and “maximum.”

Boxplots in SPSS:

Our data file contains a sample of N = 238 people who were examined in a driving
simulator. Participants were presented with 5 dangerous situations to which they had to
respond as fast as possible. The data hold their reaction times and some other variables.
Our boxplot shows some potential outliers as well as extreme values.

Outliers:
Outliers are data points in a dataset that significantly deviate from the majority of other
data points. They are observations that fall well outside the typical range or distribution
of values and may be unusually high or low in comparison to the rest of the data.
Outliers can potentially distort statistical analyses and should be
carefully examined to determine whether they represent genuine extreme values or are
the result of errors or anomalies in the data collection process.

Removing outliers using spss in histogram:


Let's first try to identify outliers by running some quick histograms over our 5
reaction time variables. Doing so from SPSS’ menu is discussed in Creating
Histograms in SPSS. A faster option, though, is running the syntax below.

Let's take a good look at the first of our 5 histograms shown below. The “normal
range” for this variable seems to run from 500 through 1500 ms. It seems that 3 scores
lie outside this range. So are these outliers? Honestly,

different analysts will make different decisions here.

Personally, I'd settle for only excluding the score ≥ 2000 ms. So what's the right way to
do so? And what about the other variables?

The right way to exclude outliers from data analysis is to specify them as user missing
values. So for reaction time 1 (reac01), running

excludes reaction times of 2000 ms and higher from all data analyses and editing. So
what about the other 4 variables?

The histograms for reac02 and reac03 don't show any outliers.

For reac04, we see some low outliers as well as a high outlier. We can find which values
these are in the bottom and top of its frequency distribution as shown below. We can
exclude all of these outliers in one go by running

missing values reac04 (lo thru 400,2085).

By the way: “lo thru 400” means the lowest value in this variable (its minimum) through 400
ms.

For reac05, we see several low and high outliers. The obvious thing to do seems to run
something like

missing values reac05 (lo thru 400,2000 thru hi).

But sadly, this only triggers the following error: The problem here is that you

can't specify a low and a high


range of missing values in SPSS.

Since this is what you typically need to do, this is one of the biggest stupidities still
found in SPSS today. A workaround for this problem is to

RECODE the entire low range into some huge value such as 999999999; add the

original values to a value label for this value;

specify only a high range of missing values that includes 999999999.

The syntax below does just that and reruns our histograms to check if all outliers have
indeed been correctly excluded.

Result

First off, note that none of our 5 histograms show any outliers anymore; they're now
excluded from all data analysis and editing. Also note the bottom of the frequency table
for reac05 shown below.

Even though we had to recode some values, we can still report precisely which outliers we
excluded for this variable due to our value label.

Before proceeding to boxplots, I'd like to mention 2 worst practices for excluding outliers:

removing outliers by changing them into system missing values. After doing so, we no
longer know which outliers we excluded. Also, we're clueless why values are system
missing as they don't have any value labels.

removing entire cases -often respondents- because they have 1(+) outliers. Such cases
typically have mostly “normal” data values that we can use just fine for analyzing other (sets
of) variables.

Sadly, supervisors sometimes force their students to take this road anyway. If so, SELECT
IF permanently removes entire cases from your data.

4.5. Scatterplots and Resistant


Lines: Linear Relationship:
In bivariate analysis, which involves examining two variables together, one of the key
aspects to explore is linear relationships. A linear relationship between two variables
means that as one variable changes, the other tends to change in a straight-line fashion.

In a scatter plot, a linear relationship would appear as a pattern where the data points
roughly follow a straight line. If, as one variable increases, the other also tends to
increase, it's a positive linear relationship. Conversely, if one variable increases as the
other decreases, it's a negative linear relationship.

The strength of a linear relationship can be quantified using correlation coefficients like
Pearson's correlation coefficient. A value close to 1 indicates a strong positive linear
relationship, close to -1 indicates a strong negative linear relationship, and close to 0
suggests little to no linear relationship.

Identifying and understanding linear relationships in bivariate analysis is essential as it


helps in making predictions, drawing insights, and can serve as the basis for building
linear regression models to make quantitative predictions based on these relationships.

A scatter plot is a graphical representation used to visualize the relationship between two
sets of data points. It consists of points on a two-dimensional plane, where each point
represents the values of two variables. By plotting these points, you can quickly identify
patterns, correlations, or trends in the data. Scatter plots are commonly used in data
analysis to determine whether there is a connection between the variables and to
visualize how changes in one variable relate to changes in another, making them a
valuable tool in understanding and interpreting data.
Scatter plot in SPSS:
The starting assumption is that you have already imported your data into SPSS, and that
you’re looking at something like the data set below.

This hypothetical data set contains the mid-term and final exam scores of 40 students in
a Statistics course (the first 20 records are displayed above). We want to create a scatter
plot to visualize the relationship between the two sets of scores.
Create a Scatter Plot
Click Graphs -> Legacy Dialogs -> Scatter/Dot as illustrated below. Note, however that
in newer versions of SPSS, you will need to click Graphs > Scatter/Dot.

We recommend that you click the Reset button to clear any previous settings.

The next step is to move your variables into the X Axis and Y Axis boxes. If your data
is from a regression study, select your predictor/independent variable, and use the arrow
button to move it to the X Axis. Then select the criterion/dependent variable, and use
the arrow button to move it to the Y Axis box. If your data is from a simple correlation
study, as is the case with our example, there may not be obvious predictor/independent
and criterion/dependent variables. In these cases, it doesn’t matter which variable you
move to the X Axis box and which variable you move to the Y Axis box.

It is a good idea to give your scatter plot a title. To do this, click the Titles button, add
your title, and click Continue to return to the “Simple Scatterplot” dialog box.
Select Simple Scatter and then click Define.This brings up the “Simple Scatterplot”
dialog box below.
Select OK to generate your scatter plot.

Each student in our hypothetical study is represented by one dot on our scatter plot.
Each dot’s position on the X (horizontal) axis represents a student’s mid- term exam
score, and its position on the Y (vertical) axis represents their final exam score.

After we create a scatter plot, we need to review it to assess the nature of the
relationship – if any – that exists between our variables. The scatter plot above indicates
that there is a positive linear relationship between mid-term and final exam scores in
this Statistics course. In other words, lower mid-term exam scores tend to be associated
with lower final exam scores, and higher mid-term exam scores tend to be associated
with higher final exam scores. It is important to note that a scatter plot cannot prove a
causal relationship between variables. Therefore, we cannot conclude that high mid-
term exam scores cause high final exam scores on the basis of the scatter plot above.
Some of the other relationships between variables that your scatter plot may indicate are
illustrated below.

Fitting resistant lines:


The eda_rline function fits a robust line through a bivariate dataset. It does so by first
breaking the data into three roughly equal sized batches following the x-axis variable. It
then uses the batches’ median values to compute the slope and intercept.

However, the function doesn’t stop there. After fitting the inital line, the function fits
another line (following the aforementioned methodology) to the model’s residuals. If
the slope is not close to zero, the residual slope is added to the original fitted model
creating an updated model. This iteration is repeated until the residual slope is close to
zero or until the residual slope changes in sign (at which point the average of the last
two iterated slopes is used in the final fit).
An example of the iteration follows using data from Velleman et. al’s book. The dataset,
neoplasms, consists of breast cancer mortality rates for regions with varying mean
annual temperatures.

Note that the 16 record dataset is not divisible by three thus forcing an extra point in the
middle batch (had the remainder of the division by three been two, then each extra point
would have been added to the tail-end batches).

Next, we compute the medians for each batch


The two end medians are used to compute the slope as:

where the subscripts r and l reference the median values for the right-most and left-
most batches.

Once the slope is computed, the intercept can be computed as follows:

where (x,y),l,m,r are the median x and y values for each batch. This line is then used to
compute the first set of residuals. A line is then fitted to the residuals following the same
procedure outlined above.

he initial model slope and intercept are 3.412 and -69.877 respectively and the
residual’s slope and intercept are -0.873 and 41.451 respectively. The residual slope is
then added to the first computed slope and the process is again repeated thus
generating the following tweaked slope and updated residuals:
The updated slope is now 3.412 + (-0.873) = 2.539. The iteration continues until the
slope residuals stabilize. The final line for this working example is,

where the final slope and intercept are 2.89 and -45.91, respectively.

Fitting resistant line in SPSS:


Data Preparation: Start by opening your dataset in SPSS. Make sure it contains the
variables you want to analyze.

Analyze Menu: Go to the "Analyze" menu at the top of the SPSS window.
Regression: Under the "Analyze" menu, select "Regression."
Linear: In the "Regression" submenu, choose "Linear."
Dependent and Independent Variables: In the "Linear Regression" dialog box, specify
your dependent variable (the one you want to predict) and your independent variable(s)
(the ones you want to use to predict the dependent variable).
Options: Click the "Options" button in the "Linear Regression" dialog box.
Resistant Line: In the "Options" dialog, you might find an option related to "Resistant
Line." This option is typically used for resistant line regression techniques, such as
robust regression. You may need to check a box or select a specific method depending
on your analysis requirements.
OK: After setting your options, click "OK" to close the "Options" dialog.
Run: Back in the "Linear Regression" dialog box, click "OK" to run the analysis.

You might also like