Basicof Stats
Basicof Stats
The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or
lower is generally considered statistically significant.
A/B testing, also known as split testing or bucket testing, is a method for comparing two versions of a website
or app to see which one performs better. Marketers use A/B testing to experiment with different elements of
their website or app, such as messaging, page layouts, color schemes, and calls to action. The goal is to
determine which version works best for a given conversion goal, such as sales or conversions
What is regression?
What is multicollinearity?
The bias-variance tradeoff in machine learning involves balancing two error sources. Bias is the error
from overly simplistic model assumptions, causing underfitting and missing data patterns. Variance is
the error from excessive sensitivity to training data fluctuations, causing overfitting and capturing
noise instead of true patterns.
Structural Equation Modeling is a technique for analyzing complex relationships among what are
called observed and latent variables. It is kind of like a mix between multiple regression and factor
analysis. Structural equation modeling requires multiple steps, like model specification, estimation,
and then evaluation. SEM is known to be flexible but it requires large sample sizes and, in order to
use it, you will. need a strong theoretical foundation.
In data analysis, missingness refers to the absence of data points. There are three main types of
missingness:
1. **Missing Completely at Random (MCAR)**: The missing data is entirely random, with no
relationship to the observed or unobserved data. The absence of data does not depend on any
values, making the data analysis unbiased.
2. **Missing at Random (MAR)**: The missingness is related to observed data but not related to the
missing data itself. For example, if older individuals are less likely to respond to a survey about
technology, the missingness can be explained by their age, but not the technology preferences
themselves.
3. **Missing Not at Random (MNAR)**: The missingness is related to the missing data itself. For
instance, individuals who have higher incomes may be less likely to report their income level. Here,
the reasons for missingness are tied to the values that are missing.
**Mean/Median Imputation**:
- **Advantages**: Simple and quick to compute; preserves data size; works well when data is
missing completely at random.
**Mode Imputation**:
- **Advantages**: Useful for categorical data; retains the most common value, which can help
maintain distribution properties.
- **Advantages**: Considers the similarity between data points, providing contextually relevant
imputations based on nearby observations.
**Interpolation**:
- **Advantages**: Effective for time series data, where missing values can be estimated based on
nearby values.
Seasonality in a time series is a repeating pattern that occurs at regular intervals, such as daily,
weekly, or quarterly. Here's how you can detect seasonality:
Plot the data: Look for repeating patterns or cycles. For example, a monthly time series with
a peak at lag 12 indicates yearly seasonality.
Use autocorrelation: Measure the similarity between the time series and itself at different
lags.
Check for differences: Differences in medians and quartiles can highlight seasonality.
Use a seasonal subseries plot: This plot is useful if you already know the seasonality period.
Use the Kruskal-Wallis test: A p-value below a chosen significance level indicates seasonality.
Use the ratio-to-moving-average method: This method measures the degree of seasonal
variation.
Standard deviation/z-score
Answer: The confidence level represents the proportion of all possible samples that
can be expected to include the true population parameter. For example, a 95%
confidence level means that 95% of the confidence intervals calculated from repeated
sampling will contain the true population parameter.
Answer: A z-score indicates how many standard deviations an element is from the
mean of the population. It is used to standardize scores on different scales, to compare
them directly, or to find the probability of a score occurring within a standard normal
distribution.
Answer: A residual is the difference between the observed value and the value
predicted by the regression model. It represents the error or deviation of the observed
data from the fitted regression line.
7. What is a scatter plot, and what information can you derive from it?
Answer: Both ANOVA (Analysis of Variance) and t-tests are used to compare
means. A t-test compares the means of two groups, while ANOVA can compare the
means of three or more groups simultaneously.
Answer: A one-tailed test assesses the significance of the effect in only one direction
(either greater than or less than), while a two-tailed test assesses the significance in
both directions (greater than and less than).
Answer: A control group serves as a baseline that does not receive the treatment or
intervention. It allows the researcher to compare outcomes and determine the effect of
the treatment, helping to rule out alternative explanations for the results.
15. What is a chi-square test, and when would you use it?
Answer: Parametric methods assume that the data follows a specific distribution,
typically normal, and rely on estimating population parameters. Non-parametric
methods make fewer assumptions about the distribution of the data and are used when
parametric assumptions cannot be satisfied.
Answer: Multivariate analysis involves the analysis of more than one variable at a
time to understand relationships and interactions between variables. Univariate
analysis, on the other hand, involves the analysis of a single variable at a time.
Answer: Simpson’s Paradox occurs when a trend observed within multiple groups
reverses when the groups are combined. For example, a treatment might appear
effective in two separate groups, but when the groups are combined, the treatment
appears ineffective.
Answer: Homoscedasticity refers to the condition where the variance of the residuals
is constant across all levels of the independent variable(s). Heteroscedasticity occurs
when the variance of the residuals varies across levels of the independent variable(s),
potentially violating assumptions of regression models.
25. What is the difference between linear regression and logistic regression?
29. What is the difference between bias and variance in a predictive model?
4o
You said: