0% found this document useful (0 votes)

29 views7 pages

Basicof Stats

this is the word file containing the basic of statistics

Uploaded by

Ramnaresh Vishwakarma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views7 pages

Basicof Stats

this is the word file containing the basic of statistics

Uploaded by

Ramnaresh Vishwakarma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true.

The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or
lower is generally considered statistically significant.

A/B testing, also known as split testing or bucket testing, is a method for comparing two versions of a website
or app to see which one performs better. Marketers use A/B testing to experiment with different elements of
their website or app, such as messaging, page layouts, color schemes, and calls to action. The goal is to
determine which version works best for a given conversion goal, such as sales or conversions

What is regression?

What are residuals?

How do you interpret the coefficients in a linear regression model?

What is a confidence interval?

What is multicollinearity?

Regularization techniques are a powerful technique for treating multicollinearity in regression

models. They are also used to prevent overfitting by adding a penalty to the model for having large
coefficients. This helps in creating a more generalizable model. Common regularization techniques
include Lasso and Ridge Regression.

The bias-variance tradeoff in machine learning involves balancing two error sources. Bias is the error
from overly simplistic model assumptions, causing underfitting and missing data patterns. Variance is
the error from excessive sensitivity to training data fluctuations, causing overfitting and capturing
noise instead of true patterns.

Structural Equation Modeling is a technique for analyzing complex relationships among what are
called observed and latent variables. It is kind of like a mix between multiple regression and factor
analysis. Structural equation modeling requires multiple steps, like model specification, estimation,
and then evaluation. SEM is known to be flexible but it requires large sample sizes and, in order to
use it, you will. need a strong theoretical foundation.

In data analysis, missingness refers to the absence of data points. There are three main types of
missingness:

1. **Missing Completely at Random (MCAR)**: The missing data is entirely random, with no
relationship to the observed or unobserved data. The absence of data does not depend on any
values, making the data analysis unbiased.

2. **Missing at Random (MAR)**: The missingness is related to observed data but not related to the
missing data itself. For example, if older individuals are less likely to respond to a survey about
technology, the missingness can be explained by their age, but not the technology preferences
themselves.

3. **Missing Not at Random (MNAR)**: The missingness is related to the missing data itself. For
instance, individuals who have higher incomes may be less likely to report their income level. Here,
the reasons for missingness are tied to the values that are missing.
**Mean/Median Imputation**:

- **Advantages**: Simple and quick to compute; preserves data size; works well when data is
missing completely at random.

**Mode Imputation**:

- **Advantages**: Useful for categorical data; retains the most common value, which can help
maintain distribution properties.

K-Nearest Neighbors (KNN):

- **Advantages**: Considers the similarity between data points, providing contextually relevant
imputations based on nearby observations.

**Interpolation**:

- **Advantages**: Effective for time series data, where missing values can be estimated based on
nearby values.

Seasonality in a time series is a repeating pattern that occurs at regular intervals, such as daily,
weekly, or quarterly. Here's how you can detect seasonality:

 Plot the data: Look for repeating patterns or cycles. For example, a monthly time series with
a peak at lag 12 indicates yearly seasonality.

 Use autocorrelation: Measure the similarity between the time series and itself at different
lags.

 Check for differences: Differences in medians and quartiles can highlight seasonality.

 Use a seasonal subseries plot: This plot is useful if you already know the seasonality period.

 Use the Kruskal-Wallis test: A p-value below a chosen significance level indicates seasonality.

 Use the ratio-to-moving-average method: This method measures the degree of seasonal
variation.

Cross-validation is a statistical technique used to evaluate the performance of a machine

learning model. It involves splitting the dataset into multiple subsets, or "folds." The model is
trained on some of these folds and tested on the remaining fold(s). This process is repeated
several times, with different subsets used for training and testing in each iteration. The results
are then averaged to obtain a more reliable estimate of the model’s effectiveness. This helps in
assessing how well the model will perform on unseen data and reduces the risk of overfitting.

Outliers are determined by using two methods:

 Standard deviation/z-score

 Interquartile range (IQR)

1. What is the law of large numbers?

 Answer: The law of large numbers states that as the sample size increases, the sample
mean will get closer to the expected value (population mean). This principle underlies
many statistical practices, including the use of sample means to estimate population
means.

2. Explain what a confidence level is in the context of confidence intervals.

 Answer: The confidence level represents the proportion of all possible samples that
can be expected to include the true population parameter. For example, a 95%
confidence level means that 95% of the confidence intervals calculated from repeated
sampling will contain the true population parameter.

3. What is a z-score, and how is it used?

 Answer: A z-score indicates how many standard deviations an element is from the
mean of the population. It is used to standardize scores on different scales, to compare
them directly, or to find the probability of a score occurring within a standard normal
distribution.

4. What is a Bayesian approach to statistics?

 Answer: Bayesian statistics is a method of inference in which Bayes' theorem is used

to update the probability of a hypothesis as more evidence or information becomes
available. Unlike frequentist methods, Bayesian inference incorporates prior
knowledge or beliefs.

5. What is the difference between point estimation and interval estimation?

 Answer: Point estimation gives a single value estimate of a population parameter

(e.g., sample mean), while interval estimation provides a range of values (e.g.,
confidence interval) within which the parameter is expected to lie with a certain level
of confidence.

6. Explain what a residual is in the context of regression analysis.

 Answer: A residual is the difference between the observed value and the value
predicted by the regression model. It represents the error or deviation of the observed
data from the fitted regression line.

7. What is a scatter plot, and what information can you derive from it?

 Answer: A scatter plot is a graphical representation of the relationship between two

quantitative variables. It helps in identifying the nature of the relationship (linear,
nonlinear, etc.), detecting outliers, and assessing the strength of the correlation.

8. What is an outlier, and how can it impact your analysis?

 Answer: An outlier is an observation that lies an abnormal distance from other values
in the data. Outliers can skew the results of statistical analyses, like means or
regressions, leading to misleading conclusions.

9. What is the difference between an ANOVA and a t-test?

 Answer: Both ANOVA (Analysis of Variance) and t-tests are used to compare
means. A t-test compares the means of two groups, while ANOVA can compare the
means of three or more groups simultaneously.

10. Explain the concept of “degrees of freedom” in statistics.

 Answer: Degrees of freedom refer to the number of independent values or quantities

that can vary in a statistical calculation. It is often used in the context of t-tests, chi-
square tests, and ANOVA to account for the number of independent pieces of
information in the data.

11. What is the F-distribution, and when is it used?

 Answer: The F-distribution is a probability distribution that arises frequently in

ANOVA, regression analysis, and the comparison of variances. It is used to test
hypotheses about whether the variances of two populations are equal.

12. What is the difference between a one-tailed and a two-tailed test?

 Answer: A one-tailed test assesses the significance of the effect in only one direction
(either greater than or less than), while a two-tailed test assesses the significance in
both directions (greater than and less than).

13. What is the purpose of using a control group in an experiment?

 Answer: A control group serves as a baseline that does not receive the treatment or
intervention. It allows the researcher to compare outcomes and determine the effect of
the treatment, helping to rule out alternative explanations for the results.

14. Explain what a sampling distribution is.

 Answer: A sampling distribution is the probability distribution of a statistic (like a

sample mean) obtained from a large number of samples drawn from the same
population. It is used to understand the variability of the statistic and to make
inferences about the population.

15. What is a chi-square test, and when would you use it?

 Answer: The chi-square test is used to determine whether there is a significant

association between two categorical variables. It compares the observed frequencies
in each category with the frequencies expected under the null hypothesis.
16. What is the difference between a parametric and a non-parametric
method?

 Answer: Parametric methods assume that the data follows a specific distribution,
typically normal, and rely on estimating population parameters. Non-parametric
methods make fewer assumptions about the distribution of the data and are used when
parametric assumptions cannot be satisfied.

17. Explain what cross-validation is and why it is used.

 Answer: Cross-validation is a technique used to assess the predictive performance of

a model by dividing the data into training and testing subsets multiple times. It is used
to detect overfitting and ensure that the model generalizes well to unseen data.

18. What is the difference between R-squared and adjusted R-squared?

 Answer: R-squared measures the proportion of variance in the dependent variable

that is explained by the independent variables in a regression model. Adjusted R-
squared adjusts this value for the number of predictors in the model, penalizing for
adding variables that do not improve the model.

19. What is multivariate analysis, and how is it different from univariate

analysis?

 Answer: Multivariate analysis involves the analysis of more than one variable at a
time to understand relationships and interactions between variables. Univariate
analysis, on the other hand, involves the analysis of a single variable at a time.

20. What is Simpson’s Paradox, and can you provide an example?

 Answer: Simpson’s Paradox occurs when a trend observed within multiple groups
reverses when the groups are combined. For example, a treatment might appear
effective in two separate groups, but when the groups are combined, the treatment
appears ineffective.

21. What is the difference between heteroscedasticity and homoscedasticity?

 Answer: Homoscedasticity refers to the condition where the variance of the residuals
is constant across all levels of the independent variable(s). Heteroscedasticity occurs
when the variance of the residuals varies across levels of the independent variable(s),
potentially violating assumptions of regression models.

22. Explain what a ROC curve is and what it is used for.

 Answer: A ROC (Receiver Operating Characteristic) curve is a graphical

representation of a classifier's performance, plotting the true positive rate against the
false positive rate at various threshold settings. It is used to assess the trade-off
between sensitivity and specificity.
23. What is bootstrapping in statistics?

 Answer: Bootstrapping is a resampling technique used to estimate the sampling

distribution of a statistic by repeatedly sampling with replacement from the original
data. It is useful for assessing the accuracy of sample estimates, especially when the
underlying distribution is unknown.

24. What is a likelihood function in the context of maximum likelihood

estimation?

 Answer: The likelihood function is a function of the parameters of a statistical model,

given the observed data. Maximum likelihood estimation (MLE) involves finding the
parameter values that maximize the likelihood function, making the observed data
most probable.

25. What is the difference between linear regression and logistic regression?

 Answer: Linear regression is used for predicting a continuous dependent variable

based on one or more independent variables. Logistic regression, on the other hand, is
used for predicting a binary or categorical dependent variable and estimates the
probability of a particular outcome.

26. What is the purpose of regularization in machine learning models?

 Answer: Regularization is used to prevent overfitting by adding a penalty to the

model for complexity (i.e., for using too many predictors or coefficients). Common
regularization techniques include Lasso (L1) and Ridge (L2) regression.

27. What is the difference between a permutation test and a t-test?

 Answer: A permutation test is a non-parametric method that assesses the significance

of an observed effect by comparing it to a distribution of effects generated by
randomly shuffling the data. A t-test, on the other hand, is a parametric test that
assumes the data follows a normal distribution and compares means between groups.

28. What is the purpose of dummy variables in regression analysis?

 Answer: Dummy variables are used in regression analysis to represent categorical

data with two or more categories. They allow categorical variables to be included in
regression models by converting them into a series of binary (0 or 1) indicators.

29. What is the difference between bias and variance in a predictive model?

 Answer: Bias refers to the error introduced by approximating a real-world problem

with a simplified model. Variance refers to the error introduced by the model’s
sensitivity to small fluctuations in the

4o
You said:

Bmsi Solved Past Papers April Updated
No ratings yet
Bmsi Solved Past Papers April Updated
69 pages
Statistics
No ratings yet
Statistics
7 pages
ASM Using R 2 Marks Answer Keys
100% (1)
ASM Using R 2 Marks Answer Keys
10 pages
Data Analytics Questions
No ratings yet
Data Analytics Questions
40 pages
AI in Drilling Rop
No ratings yet
AI in Drilling Rop
102 pages
Meteorological Drought Forecasting For Ungauged Areas Based On
No ratings yet
Meteorological Drought Forecasting For Ungauged Areas Based On
18 pages
Human Stress Detection Based On Sleeping Habits Using Machine Learning Algorithms
No ratings yet
Human Stress Detection Based On Sleeping Habits Using Machine Learning Algorithms
6 pages
Data Science Interview Questions and Answer
100% (1)
Data Science Interview Questions and Answer
41 pages
Interview Questions
No ratings yet
Interview Questions
225 pages
Itae 002 Test 1 2
0% (1)
Itae 002 Test 1 2
5 pages
Inferential Statistics For Data Science
100% (1)
Inferential Statistics For Data Science
10 pages
1694600777-Unit2.2 Logistic Regression CU 2.0
100% (1)
1694600777-Unit2.2 Logistic Regression CU 2.0
37 pages
2025 01 Exam SRM Syllabus
No ratings yet
2025 01 Exam SRM Syllabus
6 pages
Sources of Validity Evidence PDF
No ratings yet
Sources of Validity Evidence PDF
14 pages
Inferential Statistics
No ratings yet
Inferential Statistics
19 pages
Datascience Interview
100% (1)
Datascience Interview
31 pages
Crack Data Science Interview 1731300339
No ratings yet
Crack Data Science Interview 1731300339
132 pages
2425s Csec520 08 Naive Bayes KNN
No ratings yet
2425s Csec520 08 Naive Bayes KNN
44 pages
Final Stats Intrerview Q&A
No ratings yet
Final Stats Intrerview Q&A
12 pages
2023-Data Analytics For Non-Life Insurance Pricing
No ratings yet
2023-Data Analytics For Non-Life Insurance Pricing
240 pages
Vignesh's Documentation
No ratings yet
Vignesh's Documentation
59 pages
Analytics PrepBook AnSoc 2017 PDF
100% (1)
Analytics PrepBook AnSoc 2017 PDF
41 pages
Regression
No ratings yet
Regression
86 pages
3 4 Research 8 2
No ratings yet
3 4 Research 8 2
54 pages
Data Scientist Interview Questions and Answers PDF
No ratings yet
Data Scientist Interview Questions and Answers PDF
37 pages
Final Stats Intrerview Q&A
No ratings yet
Final Stats Intrerview Q&A
20 pages
Interview Questions
No ratings yet
Interview Questions
27 pages
Lecture Notes Statistics
100% (2)
Lecture Notes Statistics
117 pages
Interview Quations Data Science
50% (2)
Interview Quations Data Science
3 pages
Das FFFF
No ratings yet
Das FFFF
16 pages
Decision Trees and Random Forests
No ratings yet
Decision Trees and Random Forests
25 pages
EDAV Assignment2 Devyani2
No ratings yet
EDAV Assignment2 Devyani2
7 pages
MLSlides1 Selected Shared
No ratings yet
MLSlides1 Selected Shared
21 pages
Adama Science and Technology University-Dm-Lab
No ratings yet
Adama Science and Technology University-Dm-Lab
47 pages
اسايمنت
No ratings yet
اسايمنت
28 pages
Project Report Major Project
No ratings yet
Project Report Major Project
86 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
54 pages
Python, Machine Learning and Statistics
No ratings yet
Python, Machine Learning and Statistics
24 pages
Data Analysis Notes
No ratings yet
Data Analysis Notes
9 pages
DSA1101 2019 Week4 Part1
No ratings yet
DSA1101 2019 Week4 Part1
39 pages
Al3451 ML - Questionbank - 3,4,5
No ratings yet
Al3451 ML - Questionbank - 3,4,5
11 pages
FAQ in Data Science Interviews
No ratings yet
FAQ in Data Science Interviews
93 pages
Society 5.0 Unit Ii
No ratings yet
Society 5.0 Unit Ii
26 pages
Rpart
No ratings yet
Rpart
34 pages
Multivariate Convex Regression With Adaptive Partitioning
No ratings yet
Multivariate Convex Regression With Adaptive Partitioning
34 pages
AI Project
No ratings yet
AI Project
30 pages
Data Science Refresher: Gunjan Trivedi
No ratings yet
Data Science Refresher: Gunjan Trivedi
93 pages
FDS Sem5
No ratings yet
FDS Sem5
15 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
50 Important Statistics' Q & A To Crack DS Interview
No ratings yet
50 Important Statistics' Q & A To Crack DS Interview
14 pages
Big Data SYBBA (CA)
No ratings yet
Big Data SYBBA (CA)
12 pages
IV AI-DS AD3491 FDSA Unit2
No ratings yet
IV AI-DS AD3491 FDSA Unit2
4 pages
VIVA - Revision
No ratings yet
VIVA - Revision
5 pages
108 Pub 2003 (Tapp) Spectroscopy and Multivariate Analysis Can Distinguish The Origin of VOO
No ratings yet
108 Pub 2003 (Tapp) Spectroscopy and Multivariate Analysis Can Distinguish The Origin of VOO
6 pages
Informatics in Medicine Unlocked: Wafae Abbaoui, Sara Retal, Brahim El Bhiri, Nassim Kharmoum, Soumia Ziti
No ratings yet
Informatics in Medicine Unlocked: Wafae Abbaoui, Sara Retal, Brahim El Bhiri, Nassim Kharmoum, Soumia Ziti
19 pages
304BA AdvancedStatisticalMethodsUsingR
No ratings yet
304BA AdvancedStatisticalMethodsUsingR
31 pages
DAV Short Notes
No ratings yet
DAV Short Notes
5 pages
Part A Answers
No ratings yet
Part A Answers
2 pages
2 4 Module Lectures
No ratings yet
2 4 Module Lectures
10 pages
An Overview of Descriptive Statistics
No ratings yet
An Overview of Descriptive Statistics
6 pages
Dataset of Short-Term Prediction of CO2 Concentration Based On A Wireless Sensor Network
No ratings yet
Dataset of Short-Term Prediction of CO2 Concentration Based On A Wireless Sensor Network
13 pages
03 Casos de Estudio 11
No ratings yet
03 Casos de Estudio 11
22 pages
Type II Error
No ratings yet
Type II Error
6 pages
Day 3 Statistics Interview QnA
No ratings yet
Day 3 Statistics Interview QnA
5 pages
A Comprehensive Guide To Machine Learning
No ratings yet
A Comprehensive Guide To Machine Learning
152 pages
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet
Word Typed Stats Theory
No ratings yet
Word Typed Stats Theory
3 pages
Oral Aswers Dsbda
No ratings yet
Oral Aswers Dsbda
7 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
From Everand
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
Lee Baker
No ratings yet
EDU 411 Topic 5 Data Analysis
No ratings yet
EDU 411 Topic 5 Data Analysis
9 pages
Statistics
No ratings yet
Statistics
8 pages
Application and Testing of The Simple Rainfall-Runoff Model Simhyd
No ratings yet
Application and Testing of The Simple Rainfall-Runoff Model Simhyd
35 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Data Analysis
No ratings yet
Data Analysis
13 pages
Secrets of Statistical Data Analysis and Management Science!
From Everand
Secrets of Statistical Data Analysis and Management Science!
Andrei Besedin
No ratings yet
2023 Data Driven Discovery of An Analytic Formula For The Life Prediction of Lithium-Ion Batteries
No ratings yet
2023 Data Driven Discovery of An Analytic Formula For The Life Prediction of Lithium-Ion Batteries
7 pages
Shrinkage Parameter Selection Via Modified Cross Validation Approach For Ridge Regression Model
No ratings yet
Shrinkage Parameter Selection Via Modified Cross Validation Approach For Ridge Regression Model
10 pages
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet
Riadsaboundji 2020
No ratings yet
Riadsaboundji 2020
8 pages
Data Science Questions
No ratings yet
Data Science Questions
4 pages
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
Efficient Software Cost Estimation Using Machine Learning Techniques
No ratings yet
Efficient Software Cost Estimation Using Machine Learning Techniques
20 pages
Data Index
No ratings yet
Data Index
2 pages
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
Intro To Probability and Statistics
No ratings yet
Intro To Probability and Statistics
147 pages
Glossary of Research Methodology
From Everand
Glossary of Research Methodology
Dr. Awadhesh Kishore
No ratings yet
Statistical Concepts You Need For Life After This Course Is Over
No ratings yet
Statistical Concepts You Need For Life After This Course Is Over
3 pages
Gale Researcher Guide for: Econometric Models
From Everand
Gale Researcher Guide for: Econometric Models
Chupp
No ratings yet
YMS Topic Review (Chs 1-8)
No ratings yet
YMS Topic Review (Chs 1-8)
7 pages

Basicof Stats

Uploaded by

Basicof Stats

Uploaded by

A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true.

What are residuals?

How do you interpret the coefficients in a linear regression model?

What is a confidence interval?

Regularization techniques are a powerful technique for treating multicollinearity in regression

**K-Nearest Neighbors (KNN)**:

Cross-validation is a statistical technique used to evaluate the performance of a machine

Outliers are determined by using two methods:

 Interquartile range (IQR)

1. What is the law of large numbers?

2. Explain what a confidence level is in the context of confidence intervals.

3. What is a z-score, and how is it used?

4. What is a Bayesian approach to statistics?

 Answer: Bayesian statistics is a method of inference in which Bayes' theorem is used

5. What is the difference between point estimation and interval estimation?

 Answer: Point estimation gives a single value estimate of a population parameter

6. Explain what a residual is in the context of regression analysis.

 Answer: A scatter plot is a graphical representation of the relationship between two

8. What is an outlier, and how can it impact your analysis?

9. What is the difference between an ANOVA and a t-test?

10. Explain the concept of “degrees of freedom” in statistics.

 Answer: Degrees of freedom refer to the number of independent values or quantities

11. What is the F-distribution, and when is it used?

 Answer: The F-distribution is a probability distribution that arises frequently in

12. What is the difference between a one-tailed and a two-tailed test?

13. What is the purpose of using a control group in an experiment?

14. Explain what a sampling distribution is.

 Answer: A sampling distribution is the probability distribution of a statistic (like a

 Answer: The chi-square test is used to determine whether there is a significant

17. Explain what cross-validation is and why it is used.

 Answer: Cross-validation is a technique used to assess the predictive performance of

18. What is the difference between R-squared and adjusted R-squared?

 Answer: R-squared measures the proportion of variance in the dependent variable

19. What is multivariate analysis, and how is it different from univariate

20. What is Simpson’s Paradox, and can you provide an example?

21. What is the difference between heteroscedasticity and homoscedasticity?

22. Explain what a ROC curve is and what it is used for.

 Answer: A ROC (Receiver Operating Characteristic) curve is a graphical

 Answer: Bootstrapping is a resampling technique used to estimate the sampling

24. What is a likelihood function in the context of maximum likelihood

 Answer: The likelihood function is a function of the parameters of a statistical model,

 Answer: Linear regression is used for predicting a continuous dependent variable

26. What is the purpose of regularization in machine learning models?

 Answer: Regularization is used to prevent overfitting by adding a penalty to the

27. What is the difference between a permutation test and a t-test?

 Answer: A permutation test is a non-parametric method that assesses the significance

28. What is the purpose of dummy variables in regression analysis?

 Answer: Dummy variables are used in regression analysis to represent categorical

 Answer: Bias refers to the error introduced by approximating a real-world problem

You might also like

K-Nearest Neighbors (KNN):