2-Scatterplots and Correlation
2-Scatterplots and Correlation
classroom.udacity.com/nanodegrees/nd089/parts/8de94dee-7635-43b3-9d11-5e4583f22ce3/modules/c7f6b93a-
28e1-46b3-97c2-997a7eeddbf3/lessons/0491d74e-dcd8-4700-a971-a7f1b0a26ddb/concepts/9d1316b3-f339-4d52-
b63f-91994aefdd40
Scatterplots
1/4
If we want to inspect the relationship between two numeric variables, the standard choice
of plot is the scatterplot. In a scatterplot, each data point is plotted individually as a
point, its x-position corresponding to one feature value and its y-position corresponding
to the second.
matplotlib.pyplot.scatter()
One basic way of creating a scatterplot is through Matplotlib's scatter function:
# Scatter plot
plt.scatter(data = fuel_econ, x = 'displ', y = 'comb');
plt.xlabel('Displacement (1)')
plt.ylabel('Combined Fuel Eff. (mpg)')
In the example above, the relationship between the two variables is negative because as
higher values of the x-axis variable are increasing, the values of the variable plotted on the
y-axis are decreasing.
2/4
sb.regplot(data = fuel_econ, x = 'displ', y = 'comb');
plt.xlabel('Displacement (1)')
plt.ylabel('Combined Fuel Eff. (mpg)')
The basic function parameters, "data", "x", and "y" are the same for regplot as they are
for matplotlib's scatter .
The regression line in a scatter plot showing a negative correlation between the two
variables.
Let's consider another plot shown below that shows a positive correlation between two
variables.
The regression line in a scatter plot showing a positive correlation between the two
variables.
3/4
In the scatter plot above, by default, the regression function is linear and includes a
shaded confidence region for the regression estimate. In this case, since the trend looks
like a \text{log}(y) \propto xlog(y)∝x relationship (that is, linear increases in the value of
x are associated with linear increases in the log of y), plotting the regression line on the
raw units is not appropriate. If we don't care about the regression line, then we could set
fit_reg = False in the regplot function call.
You can even plot the regression line on the transformed data as shown in the example
below. For transformation, use a similar approach as you've learned in the last lesson.
sb.regplot(fuel_econ['displ'], fuel_econ['comb'].apply(log_trans))
tick_locs = [10, 20, 50, 100]
plt.yticks(log_trans(tick_locs), tick_locs);
Note - In this example, the x- and y- values sent to regplot are set directly as Series,
extracted from the dataframe.
Supporting Materials
fuel_econ.csv
4/4