0% found this document useful (0 votes)
37 views3 pages

Data Sci Linkedin

A sampling distribution is the distribution of values of a statistic from repeated samples of the same size. The standard error is the standard deviation of a sampling distribution. To create a sampling distribution of differences between two means, samples are taken from two populations and their means are calculated, with the differences forming the sampling distribution. Even with small samples, if the populations are normally distributed, the sampling distribution of differences between means will be normally distributed. Regression analysis uses one variable to predict another, finding the line of best fit that minimizes residuals. Correlation measures the strength and direction of a relationship between variables, with a coefficient between -1 and 1. Excel can be used as a data science tool to find modes, calculate margins of error

Uploaded by

angel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views3 pages

Data Sci Linkedin

A sampling distribution is the distribution of values of a statistic from repeated samples of the same size. The standard error is the standard deviation of a sampling distribution. To create a sampling distribution of differences between two means, samples are taken from two populations and their means are calculated, with the differences forming the sampling distribution. Even with small samples, if the populations are normally distributed, the sampling distribution of differences between means will be normally distributed. Regression analysis uses one variable to predict another, finding the line of best fit that minimizes residuals. Correlation measures the strength and direction of a relationship between variables, with a coefficient between -1 and 1. Excel can be used as a data science tool to find modes, calculate margins of error

Uploaded by

angel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Data Science

Statistics

General

A sampling distribution is the distribution of all possible values of a statistic for a given sample size.

The standard error is the standard deviation of a sampling distribution.

To create a sampling distribution of two independent samples first select a sample from each of the
two populations and calculate their means. Calculate the difference between their means and repeat
this many times, the set of all the differences is the sampling distribution, this set is also called the
Sampling Distribution of the Difference Between Means.

An actual sample distribution is never created for a study because due to the central limit if the
samples are large, the sampling distribution of the difference between means is approximately a
normal distribution, then if the populations are normally distributed, the sampling distribution of the
difference between means is a normal distribution even if the samples are small.

Regression

Prediction is the goal of statistics, Regression is the use of data from one variable (the independent
variable) to predict data for another (the dependent variable), given a particular dependent-
independent variable pair the best course of action is to create a scatterplot with the independent
variable in the x axis and the dependent variable in they axis, for this type of chart each dot
represents a unique sample, after having the graph we draw the regression line, this is the line of
best fit through a scatterplot, it summarizes the relationship between the independent variable and
the dependent variable, this line minimizes the sum of squared distances in the y direction from the
points to the line.

As all lines the regression line is defined by the equation y = a+bx, a and b are called regression
coefficients and are found as follows:

There is always going to be variability around the regression line, the residual
is the distance in the y direction from a point to the regression line, it’s the deviation of an observed
data point from the corresponding predicted data point. It can be used to calculate residual variance
and standard error to determine how well the regression line fits the data.

You can have three kinds of variance in a scatterplot, Residual variance mentioned above, regression
variance and total variance, regression variance is based in the difference between the predicted y
value and the average y value of the population, total variance is based on the difference between y
values and the average y value of the population.

Correlation

When variables are correlated, they vary together so correlation can be >0, <0, = 0. If correlation is
positive low x scores (on a scatterplot) are associated with low y scores and high x scores are
associated with high y scores also the slope of the regression line is positive. If correlation is
negative low x scores are associated with high y scores. Correlation does not imply causality.

The correlation coefficient is the statistic that shows the strength of the relationship between
correlated variables, its formula is

Excel as Data Science Tool

General

To find multiple modes in a data use mode.mult, this is an array formula, so you need to select
multiple cells to hold the results then press ctrl shift enter.

To analyze data using samples is necessary to gather as large a sample as you can, estimate the
population’s standard deviation, determine the confidence level (alpha) usually 95% and calculate
margin of error.

To calculate margin of error find the standard error ( standard deviation/sqrt(samples)), margin of
error is the standard error times the z-score, the z-score is the number of standard deviations from
the mean and its tabulated based on confidence level.

There are multiple sources of error that could affect your study among them: using non random
samples, having investigator bias (anticipating the results), working with outdated data, small
sample size.

Data visualization
General

Know your audience.

A good data visualization is defined by ASK: Accurate, tells a good Story and delivers Knowledge.
The six Ws (when, what, why, where, who, how )are used in data visualization, to visualize data it is
sometimes useful to turn values into percentiles to do that organize the data from largest to smallest
then type 1-(row/count)

You might also like