0% found this document useful (0 votes)
7 views35 pages

Lab 04 Handout

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 35

What is statistics?

I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
What can statistics do?
What is statistics?
The field of statistics - the practice and study of collecting and analyzing data

A summary statistic - a fact about or summary of some data

What can statistics do?


How likely is someone to purchase a product? Are people more likely to purchase it if they
can use a different payment system?

How many occupants will your hotel have? How can you optimize occupancy?

How many sizes of jeans need to be manufactured so they can fit 95% of the population?
Should the same number of each size be produced?

A/B tests: Which ad is more effective in getting people to purchase a product?

INTRODUCTION TO STATISTICS IN PYTHON


Types of statistics
Descriptive statistics Inferential statistics
Describe and summarize data Use a sample of data to make inferences
about a larger population

50% of friends drive to work

25% take the bus

25% bike What percent of people drive to work?


Go to the exercise on descriptive and Inferential statistics

INTRODUCTION TO STATISTICS IN PYTHON


Why does data type matter?
Summary statistics Plots

import numpy as np
np.mean(car_speeds['speed_mph'])

40.09062

INTRODUCTION TO STATISTICS IN PYTHON


Why does data type matter?
Summary statistics Plots
demographics['marriage_status'].value_counts()

single 188
married 143
divorced 124
dtype: int64

INTRODUCTION TO STATISTICS IN PYTHON


Correlation
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Relationships between two variables

x = explanatory/independent variable
y = response/dependent variable

INTRODUCTION TO STATISTICS IN PYTHON


Correlation coefficient
Quantifies the linear relationship between two variables

Number between -1 and 1

Magnitude corresponds to strength of relationship

Sign (+ or -) corresponds to direction of relationship

INTRODUCTION TO STATISTICS IN PYTHON


Magnitude = strength of relationship
0.99 (very strong relationship)

INTRODUCTION TO STATISTICS IN PYTHON


Magnitude = strength of relationship
0.99 (very strong relationship) 0.75 (strong relationship)

INTRODUCTION TO STATISTICS IN PYTHON


Magnitude = strength of relationship
0.56 (moderate relationship)

INTRODUCTION TO STATISTICS IN PYTHON


Magnitude = strength of relationship
0.56 (moderate relationship) 0.21 (weak relationship)

INTRODUCTION TO STATISTICS IN PYTHON


Magnitude = strength of relationship
0.04 (no relationship) Knowing the value of x doesn't tell us
anything about y

INTRODUCTION TO STATISTICS IN PYTHON


Sign = direction
0.75: as x increases, y increases -0.75: as x increases, y decreases

INTRODUCTION TO STATISTICS IN PYTHON


Visualizing relationships
import seaborn as sns
sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)
plt.show()

INTRODUCTION TO STATISTICS IN PYTHON


Adding a trendline
import seaborn as sns
sns.lmplot(x="sleep_total", y="sleep_rem", data=msleep, ci=None)
plt.show()

INTRODUCTION TO STATISTICS IN PYTHON


Computing correlation
msleep['sleep_total'].corr(msleep['sleep_rem'])

0.751755

msleep['sleep_rem'].corr(msleep['sleep_total'])

0.751755

INTRODUCTION TO STATISTICS IN PYTHON


Many ways to calculate correlation
Used in this course: Pearson product-moment correlation (r )
Most common

x̄ = mean of x
σx = standard deviation of x
n
(xi − x̄)(yi − ȳ )
r=∑
σx × σy
i=1

Variations on this formula:


Kendall's tau
Spearman's rho

INTRODUCTION TO STATISTICS IN PYTHON


Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Correlation caveats
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N

Maggie Matsui
Content Developer, DataCamp
Non-linear relationships

r = 0.18

INTRODUCTION TO STATISTICS IN PYTHON


Non-linear relationships
What we see: What the correlation coefficient sees:

INTRODUCTION TO STATISTICS IN PYTHON


Correlation only accounts for linear relationships
Correlation shouldn't be used blindly Always visualize your data

df['x'].corr(df['y'])

0.081094

INTRODUCTION TO STATISTICS IN PYTHON


Mammal sleep data
print(msleep)

name genus vore order ... sleep_cycle awake brainwt bodywt


1 Cheetah Acinonyx carni Carnivora ... NaN 11.9 NaN 50.000
2 Owl monkey Aotus omni Primates ... NaN 7.0 0.01550 0.480
3 Mountain beaver Aplodontia herbi Rodentia ... NaN 9.6 NaN 1.350
4 Greater short-ta... Blarina omni Soricomorpha ... 0.133333 9.1 0.00029 0.019
5 Cow Bos herbi Artiodactyla ... 0.666667 20.0 0.42300 600.000
.. ... ... ... ... ... ... ... ... ...
79 Tree shrew Tupaia omni Scandentia ... 0.233333 15.1 0.00250 0.104
80 Bottle-nosed do... Tursiops carni Cetacea ... NaN 18.8 NaN 173.330
81 Genet Genetta carni Carnivora ... NaN 17.7 0.01750 2.000
82 Arctic fox Vulpes carni Carnivora ... NaN 11.5 0.04450 3.380
83 Red fox Vulpes carni Carnivora ... 0.350000 14.2 0.05040 4.230

INTRODUCTION TO STATISTICS IN PYTHON


Body weight vs. awake time
msleep['bodywt'].corr(msleep['awake'])

0.3119801

INTRODUCTION TO STATISTICS IN PYTHON


Distribution of body weight

INTRODUCTION TO STATISTICS IN PYTHON


Log transformation
msleep['log_bodywt'] = np.log(msleep['bodywt'])

sns.lmplot(x='log_bodywt',
y='awake',
data=msleep,
ci=None)
plt.show()

msleep['log_bodywt'].corr(msleep['awake'])

0.5687943

INTRODUCTION TO STATISTICS IN PYTHON


Other transformations
Log transformation ( log(x) )
Square root transformation ( sqrt(x) )

Reciprocal transformation ( 1 / x )

Combinations of these, e.g.:


log(x) and log(y)

sqrt(x) and 1 / y

INTRODUCTION TO STATISTICS IN PYTHON


Why use a transformation?
Certain statistical methods rely on variables having a linear relationship
Correlation coefficient

Linear regression

Introduction to Linear Modeling in Python

INTRODUCTION TO STATISTICS IN PYTHON


Correlation does not imply causation
x is correlated with y does not mean x causes y

INTRODUCTION TO STATISTICS IN PYTHON


Confounding

INTRODUCTION TO STATISTICS IN PYTHON


Confounding

INTRODUCTION TO STATISTICS IN PYTHON


Confounding

INTRODUCTION TO STATISTICS IN PYTHON


Confounding

INTRODUCTION TO STATISTICS IN PYTHON


Confounding

INTRODUCTION TO STATISTICS IN PYTHON

You might also like