Data Sci Linkedin
Data Sci Linkedin
Statistics
General
A sampling distribution is the distribution of all possible values of a statistic for a given sample size.
To create a sampling distribution of two independent samples first select a sample from each of the
two populations and calculate their means. Calculate the difference between their means and repeat
this many times, the set of all the differences is the sampling distribution, this set is also called the
Sampling Distribution of the Difference Between Means.
An actual sample distribution is never created for a study because due to the central limit if the
samples are large, the sampling distribution of the difference between means is approximately a
normal distribution, then if the populations are normally distributed, the sampling distribution of the
difference between means is a normal distribution even if the samples are small.
Regression
Prediction is the goal of statistics, Regression is the use of data from one variable (the independent
variable) to predict data for another (the dependent variable), given a particular dependent-
independent variable pair the best course of action is to create a scatterplot with the independent
variable in the x axis and the dependent variable in they axis, for this type of chart each dot
represents a unique sample, after having the graph we draw the regression line, this is the line of
best fit through a scatterplot, it summarizes the relationship between the independent variable and
the dependent variable, this line minimizes the sum of squared distances in the y direction from the
points to the line.
As all lines the regression line is defined by the equation y = a+bx, a and b are called regression
coefficients and are found as follows:
There is always going to be variability around the regression line, the residual
is the distance in the y direction from a point to the regression line, it’s the deviation of an observed
data point from the corresponding predicted data point. It can be used to calculate residual variance
and standard error to determine how well the regression line fits the data.
You can have three kinds of variance in a scatterplot, Residual variance mentioned above, regression
variance and total variance, regression variance is based in the difference between the predicted y
value and the average y value of the population, total variance is based on the difference between y
values and the average y value of the population.
Correlation
When variables are correlated, they vary together so correlation can be >0, <0, = 0. If correlation is
positive low x scores (on a scatterplot) are associated with low y scores and high x scores are
associated with high y scores also the slope of the regression line is positive. If correlation is
negative low x scores are associated with high y scores. Correlation does not imply causality.
The correlation coefficient is the statistic that shows the strength of the relationship between
correlated variables, its formula is
General
To find multiple modes in a data use mode.mult, this is an array formula, so you need to select
multiple cells to hold the results then press ctrl shift enter.
To analyze data using samples is necessary to gather as large a sample as you can, estimate the
population’s standard deviation, determine the confidence level (alpha) usually 95% and calculate
margin of error.
To calculate margin of error find the standard error ( standard deviation/sqrt(samples)), margin of
error is the standard error times the z-score, the z-score is the number of standard deviations from
the mean and its tabulated based on confidence level.
There are multiple sources of error that could affect your study among them: using non random
samples, having investigator bias (anticipating the results), working with outdated data, small
sample size.
Data visualization
General
A good data visualization is defined by ASK: Accurate, tells a good Story and delivers Knowledge.
The six Ws (when, what, why, where, who, how )are used in data visualization, to visualize data it is
sometimes useful to turn values into percentiles to do that organize the data from largest to smallest
then type 1-(row/count)