We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
2181S
Introduction to Data Mining
Lecture 5 Scatter Plots |
Scatter Plots
Scatter plot (or scatter diagram) shows the
position of all of the cases in an x-y coordinate
system. The relationship between two interval
variables can be identified from Scatter graph.
— The independent variable is plotted on the x-axis, or
the horizontal axis.
— The dependent variable is plotted on the y-axis, or the
vertical axis
A dot in the body of the chart represents the
intersection of the data on the x and y-axis
Use of Scatter Plots
+ The purpose of Scatter Plot:
— Perform regression analysis and Correlation checks
+ With the information obtained from the above
techniques, can answer the following questions
that identify the relationships between variables:
— Is there a relationship between the two variables?
(The two variables usually have range of values with
scale or ordinal measure)
— How strong is the relationship?
— What is the direction of the relationship?
Finding Out the Relationship —
+ The overall pattern of data points summarizes
the nature of the relationship between the two
variables.
+ The clarity of the pattem formed by the data
points can be enhanced by drawing a straight
line through the cluster such that the line
touches every dot or comes as close to doing so
as possible.
+ This summarizing line is called “regression line.”
— It finds the line that best predicts Y from values of XSample Scatter Plot Direction of Relationship
The pattern of the points on the scatter plot + Positive vs Negative Relations
shows the relationship between the variables. — The dependent variable increases when the
The regression line independent variable increases > Positive relation
makes it easier — The dependent variable decreases when the
to understand the independent variable increases > Negative relation
relationship.
ip i s AS
i y i
ry |
i i : |
A »
a : Fostive Relation] [M@awvercawo] =,
Evidence of Relationship | Strength of Relationship
The angle between the regression line and the > ‘The relationship between the two variables can
horizontal x-axis provides evidence of a be drawn from the scatter plot
relationship. If there is no relationship, the — A horizonial or vertical line means no relation |
regression line will be parallel to the x or y-axis. — Avery flat or steep line usually means slight relation
[> Perfect Relator : i
[eon eee ET However, by looking at the angle of the
regression line, you can be easily misled by the
scaling of the graph!
— Next slide shows an exampleEffect of Scaling on Scatter Plot Slope/Intercept of Regression Line
| + The scale of the x and y-axis can affect the view | + The value of the slope b for the regression line
| of the scatter plot! - can be found by the formula: |
| 7 La- Ladin |
| it ee =
: 2 Satay ln |
| spe |= where (3, y) is the value of data point i, and nis the
| [ivaetaitiwepbtearefem) = | number sf data points aay
| | aes cece meren) os + The y-intercept a can be found by: LPL |
| | should be the same. Sealing : 9 . |
eangveyouthewongvewl | "ss yp ww
r Equation of the Regression Line | Important Note |
| + Amore objective measure about the relation + If we interchange the X and Y data, the scatter |
| between the two variables is to observe the plot will be transposed. However, the regression
| equation of the regression line line is notan exact transpose.
— We know the line can be expressed as y = a + bx ~ The transpose of y = 1.6x + 1 is y = 0.625x - 0.625,
|= aand b are constants, where b is the slope and a is but the ragression lin is y = 0.597x ~ 0.4627 |
the y-intercept (value of y is a when x is 0)
— The slope gives a more objective measure of the |
steepness of the graph |
— Positive relation has positive slope, negative relation
has negative slopeWhat is R2?
+ You may notice R? appears in Excel and SPSS
— Itis a measure of how well the values are fitted to the
regression line (goodness of fit)
— A value closer to 1 means the data points are closer
to the regression line.
—Avalue 1 means all the data points are on the
regression line.
+ In fact, Ris the square of the correlation
coefficient
~ R has the value between O and 1
Correlation Coefficient
+ The (Pearson) correlation coefficient (also
written as r) has been talked about in Lecture 4:
~ Correlation(x, y):
om
a2,
See a0i-m)
ee
vihere
iB -H,
2 nol
(3 ar tha sev
of and ¥ resp.
Correlation coefficient has a value between -1 and +4
Meaning of Correlation Coefficient
Correlation Coefficient Meaning
No correlation
Slight correlation
Low/Small correlation
‘Moderate to Substantial
44-070 correlation
0.71-0.90 Strong/High correlation
0.91-0.99 Near Perfect correlation
1.00 Perfect correlation
+ Similar for negative values
What do all these things mean?
+ Correlation coefficient rand R? give the relation
between 2 variables
— ralso shows the direction of relationship
— R? closer to 1 > more relation between 2 variables
+ The regression line is the “best fit’ line of the
points on the scatter plot
— We would also like to know the average error of each
point with respect to the regression lineHow to find the Error?
+ The y ertor for every
data point is:
— For every data point,
get the x-coordinate.
Find the difference of
the expected y value
with the actual y value
~ Example: For the data
expected
actual y=2,
Summing the Errors
+ Note that the summation of y errors for every
data point in the scatter plot is always 0
Standard Error of Regression Line
| + The Standard Error S, y of the regression line is
defined by the formula: 7
yom
Saf
n-2
In the previous example, the standard error of
the regression line is sqrt(1.20/3) = 0.632
Question: What is the standard error for the
transpose of the scatter plot (i.e., interchanging
X and Y)?
Transpose of the Scatter Plot
| + The transpose of the scatter plot has regression
| line with equation y = 0.697% — 0.4627
|
x_[Expected y (y)] Actual y | Error | Error? i 7 ‘
1 26 20 06! 0.36 2 1 : cage
2 42 5.0 08] 0.64 5 2 ‘
3 58 60 -0.2| 0.04 6 3 2
[sf 7a 70} 04] 076 Tope :
5 a0 | 90 0.0/ 0.00 9 | 5 ‘ 5
Total oo] 1.20Standard Error
x_[Expectedy (y)| Actualy | Eror | Erro®
2 [07313 1.0 _|-0.2687|
5 2.5223 2.0 | 0.5223] 0.2728|
6 | 31193 3.0 | 0.1193) 0.0142
7 | 3.7163 4.0 |-0.2837| 0.0805
9 | 4.9103 5.0 _|-0.0897| 0.0080
Total 0.0] 0.447
+ Standard error is sqrt(0.4477/3) = 0.3863