0% found this document useful (0 votes)
29 views7 pages

Basicof Stats

this is the word file containing the basic of statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

Basicof Stats

this is the word file containing the basic of statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true.

The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or
lower is generally considered statistically significant.

A/B testing, also known as split testing or bucket testing, is a method for comparing two versions of a website
or app to see which one performs better. Marketers use A/B testing to experiment with different elements of
their website or app, such as messaging, page layouts, color schemes, and calls to action. The goal is to
determine which version works best for a given conversion goal, such as sales or conversions

What is regression?

What are residuals?

How do you interpret the coefficients in a linear regression model?

What is a confidence interval?

What is multicollinearity?

Regularization techniques are a powerful technique for treating multicollinearity in regression


models. They are also used to prevent overfitting by adding a penalty to the model for having large
coefficients. This helps in creating a more generalizable model. Common regularization techniques
include Lasso and Ridge Regression.

The bias-variance tradeoff in machine learning involves balancing two error sources. Bias is the error
from overly simplistic model assumptions, causing underfitting and missing data patterns. Variance is
the error from excessive sensitivity to training data fluctuations, causing overfitting and capturing
noise instead of true patterns.

Structural Equation Modeling is a technique for analyzing complex relationships among what are
called observed and latent variables. It is kind of like a mix between multiple regression and factor
analysis. Structural equation modeling requires multiple steps, like model specification, estimation,
and then evaluation. SEM is known to be flexible but it requires large sample sizes and, in order to
use it, you will. need a strong theoretical foundation.

In data analysis, missingness refers to the absence of data points. There are three main types of
missingness:

1. **Missing Completely at Random (MCAR)**: The missing data is entirely random, with no
relationship to the observed or unobserved data. The absence of data does not depend on any
values, making the data analysis unbiased.

2. **Missing at Random (MAR)**: The missingness is related to observed data but not related to the
missing data itself. For example, if older individuals are less likely to respond to a survey about
technology, the missingness can be explained by their age, but not the technology preferences
themselves.

3. **Missing Not at Random (MNAR)**: The missingness is related to the missing data itself. For
instance, individuals who have higher incomes may be less likely to report their income level. Here,
the reasons for missingness are tied to the values that are missing.
**Mean/Median Imputation**:

- **Advantages**: Simple and quick to compute; preserves data size; works well when data is
missing completely at random.

**Mode Imputation**:

- **Advantages**: Useful for categorical data; retains the most common value, which can help
maintain distribution properties.

**K-Nearest Neighbors (KNN)**:

- **Advantages**: Considers the similarity between data points, providing contextually relevant
imputations based on nearby observations.

**Interpolation**:

- **Advantages**: Effective for time series data, where missing values can be estimated based on
nearby values.

Seasonality in a time series is a repeating pattern that occurs at regular intervals, such as daily,
weekly, or quarterly. Here's how you can detect seasonality:

 Plot the data: Look for repeating patterns or cycles. For example, a monthly time series with
a peak at lag 12 indicates yearly seasonality.

 Use autocorrelation: Measure the similarity between the time series and itself at different
lags.

 Check for differences: Differences in medians and quartiles can highlight seasonality.

 Use a seasonal subseries plot: This plot is useful if you already know the seasonality period.

 Use the Kruskal-Wallis test: A p-value below a chosen significance level indicates seasonality.

 Use the ratio-to-moving-average method: This method measures the degree of seasonal
variation.

Cross-validation is a statistical technique used to evaluate the performance of a machine


learning model. It involves splitting the dataset into multiple subsets, or "folds." The model is
trained on some of these folds and tested on the remaining fold(s). This process is repeated
several times, with different subsets used for training and testing in each iteration. The results
are then averaged to obtain a more reliable estimate of the model’s effectiveness. This helps in
assessing how well the model will perform on unseen data and reduces the risk of overfitting.

Outliers are determined by using two methods:

 Standard deviation/z-score

 Interquartile range (IQR)

1. What is the law of large numbers?


 Answer: The law of large numbers states that as the sample size increases, the sample
mean will get closer to the expected value (population mean). This principle underlies
many statistical practices, including the use of sample means to estimate population
means.

2. Explain what a confidence level is in the context of confidence intervals.

 Answer: The confidence level represents the proportion of all possible samples that
can be expected to include the true population parameter. For example, a 95%
confidence level means that 95% of the confidence intervals calculated from repeated
sampling will contain the true population parameter.

3. What is a z-score, and how is it used?

 Answer: A z-score indicates how many standard deviations an element is from the
mean of the population. It is used to standardize scores on different scales, to compare
them directly, or to find the probability of a score occurring within a standard normal
distribution.

4. What is a Bayesian approach to statistics?

 Answer: Bayesian statistics is a method of inference in which Bayes' theorem is used


to update the probability of a hypothesis as more evidence or information becomes
available. Unlike frequentist methods, Bayesian inference incorporates prior
knowledge or beliefs.

5. What is the difference between point estimation and interval estimation?

 Answer: Point estimation gives a single value estimate of a population parameter


(e.g., sample mean), while interval estimation provides a range of values (e.g.,
confidence interval) within which the parameter is expected to lie with a certain level
of confidence.

6. Explain what a residual is in the context of regression analysis.

 Answer: A residual is the difference between the observed value and the value
predicted by the regression model. It represents the error or deviation of the observed
data from the fitted regression line.

7. What is a scatter plot, and what information can you derive from it?

 Answer: A scatter plot is a graphical representation of the relationship between two


quantitative variables. It helps in identifying the nature of the relationship (linear,
nonlinear, etc.), detecting outliers, and assessing the strength of the correlation.

8. What is an outlier, and how can it impact your analysis?


 Answer: An outlier is an observation that lies an abnormal distance from other values
in the data. Outliers can skew the results of statistical analyses, like means or
regressions, leading to misleading conclusions.

9. What is the difference between an ANOVA and a t-test?

 Answer: Both ANOVA (Analysis of Variance) and t-tests are used to compare
means. A t-test compares the means of two groups, while ANOVA can compare the
means of three or more groups simultaneously.

10. Explain the concept of “degrees of freedom” in statistics.

 Answer: Degrees of freedom refer to the number of independent values or quantities


that can vary in a statistical calculation. It is often used in the context of t-tests, chi-
square tests, and ANOVA to account for the number of independent pieces of
information in the data.

11. What is the F-distribution, and when is it used?

 Answer: The F-distribution is a probability distribution that arises frequently in


ANOVA, regression analysis, and the comparison of variances. It is used to test
hypotheses about whether the variances of two populations are equal.

12. What is the difference between a one-tailed and a two-tailed test?

 Answer: A one-tailed test assesses the significance of the effect in only one direction
(either greater than or less than), while a two-tailed test assesses the significance in
both directions (greater than and less than).

13. What is the purpose of using a control group in an experiment?

 Answer: A control group serves as a baseline that does not receive the treatment or
intervention. It allows the researcher to compare outcomes and determine the effect of
the treatment, helping to rule out alternative explanations for the results.

14. Explain what a sampling distribution is.

 Answer: A sampling distribution is the probability distribution of a statistic (like a


sample mean) obtained from a large number of samples drawn from the same
population. It is used to understand the variability of the statistic and to make
inferences about the population.

15. What is a chi-square test, and when would you use it?

 Answer: The chi-square test is used to determine whether there is a significant


association between two categorical variables. It compares the observed frequencies
in each category with the frequencies expected under the null hypothesis.
16. What is the difference between a parametric and a non-parametric
method?

 Answer: Parametric methods assume that the data follows a specific distribution,
typically normal, and rely on estimating population parameters. Non-parametric
methods make fewer assumptions about the distribution of the data and are used when
parametric assumptions cannot be satisfied.

17. Explain what cross-validation is and why it is used.

 Answer: Cross-validation is a technique used to assess the predictive performance of


a model by dividing the data into training and testing subsets multiple times. It is used
to detect overfitting and ensure that the model generalizes well to unseen data.

18. What is the difference between R-squared and adjusted R-squared?

 Answer: R-squared measures the proportion of variance in the dependent variable


that is explained by the independent variables in a regression model. Adjusted R-
squared adjusts this value for the number of predictors in the model, penalizing for
adding variables that do not improve the model.

19. What is multivariate analysis, and how is it different from univariate


analysis?

 Answer: Multivariate analysis involves the analysis of more than one variable at a
time to understand relationships and interactions between variables. Univariate
analysis, on the other hand, involves the analysis of a single variable at a time.

20. What is Simpson’s Paradox, and can you provide an example?

 Answer: Simpson’s Paradox occurs when a trend observed within multiple groups
reverses when the groups are combined. For example, a treatment might appear
effective in two separate groups, but when the groups are combined, the treatment
appears ineffective.

21. What is the difference between heteroscedasticity and homoscedasticity?

 Answer: Homoscedasticity refers to the condition where the variance of the residuals
is constant across all levels of the independent variable(s). Heteroscedasticity occurs
when the variance of the residuals varies across levels of the independent variable(s),
potentially violating assumptions of regression models.

22. Explain what a ROC curve is and what it is used for.

 Answer: A ROC (Receiver Operating Characteristic) curve is a graphical


representation of a classifier's performance, plotting the true positive rate against the
false positive rate at various threshold settings. It is used to assess the trade-off
between sensitivity and specificity.
23. What is bootstrapping in statistics?

 Answer: Bootstrapping is a resampling technique used to estimate the sampling


distribution of a statistic by repeatedly sampling with replacement from the original
data. It is useful for assessing the accuracy of sample estimates, especially when the
underlying distribution is unknown.

24. What is a likelihood function in the context of maximum likelihood


estimation?

 Answer: The likelihood function is a function of the parameters of a statistical model,


given the observed data. Maximum likelihood estimation (MLE) involves finding the
parameter values that maximize the likelihood function, making the observed data
most probable.

25. What is the difference between linear regression and logistic regression?

 Answer: Linear regression is used for predicting a continuous dependent variable


based on one or more independent variables. Logistic regression, on the other hand, is
used for predicting a binary or categorical dependent variable and estimates the
probability of a particular outcome.

26. What is the purpose of regularization in machine learning models?

 Answer: Regularization is used to prevent overfitting by adding a penalty to the


model for complexity (i.e., for using too many predictors or coefficients). Common
regularization techniques include Lasso (L1) and Ridge (L2) regression.

27. What is the difference between a permutation test and a t-test?

 Answer: A permutation test is a non-parametric method that assesses the significance


of an observed effect by comparing it to a distribution of effects generated by
randomly shuffling the data. A t-test, on the other hand, is a parametric test that
assumes the data follows a normal distribution and compares means between groups.

28. What is the purpose of dummy variables in regression analysis?

 Answer: Dummy variables are used in regression analysis to represent categorical


data with two or more categories. They allow categorical variables to be included in
regression models by converting them into a series of binary (0 or 1) indicators.

29. What is the difference between bias and variance in a predictive model?

 Answer: Bias refers to the error introduced by approximating a real-world problem


with a simplified model. Variance refers to the error introduced by the model’s
sensitivity to small fluctuations in the

4o
You said:

You might also like