0% found this document useful (0 votes)
15 views

BA - Advanced statistical method using R (P2)

Uploaded by

Yogesh Nagvekar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

BA - Advanced statistical method using R (P2)

Uploaded by

Yogesh Nagvekar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

304 - BA-SC-BA-03 : ADVANCED STATISTICAL METHODS USING R

– By Pratik Patil
(2019 CBCS Pattern) (Semester - III)

Time : 2½ hours] [Max. Marks :


50
Instructions to the candidates:
1) All questions are compulsory.
2) Make appropriate assumptions wherever required.

Q1) Answer the following questions (Any Five)


a) Enlist basic statistical functions in R.
b) What is difference between parametric and non parametric tests?
c) Define predictive analytics?
d) Explain pbinom ( ) function in R.
e) How do you interprete p value in hypothesis testing?
f) Write a function to get a list of all the packages installed in R.
g) Write a function to obtain the transpose of a matrix in R?
h) What is the purpose of regression analysis in R?

a) Basic Statistical Functions in R:


• Mean: mean(x)
• Median: median(x)
• Mode: mode(x)
• Standard Deviation: sd(x)
• Variance: var(x)
• Range: range(x)
• Minimum: min(x)
• Maximum: max(x)
• Quantiles: quantile(x, probs = c(0.25, 0.5, 0.75))
• Correlation: cor(x, y)
• Summary Statistics: summary(x)
b) Parametric vs. Non-Parametric Tests:
Parametric Tests:
• Assume data follows a specific distribution (usually normal).
• More powerful when assumptions are met.
• Examples: t-test, ANOVA, linear regression.
Non-Parametric Tests:
• Make fewer assumptions about data distribution.
• More robust when assumptions are violated or for ordinal data.
• Examples: Wilcoxon rank-sum test, Kruskal-Wallis test, Spearman
correlation.
c) Predictive Analytics:
• Using statistical techniques and machine learning algorithms to make
predictions about future or unknown events.
• Involves building models based on historical data to uncover patterns and
relationships.
• Applications: forecasting sales, predicting customer behavior, detecting
fraud, risk assessment.
d) pbinom() Function in R:
• Calculates the cumulative probability of a binomial distribution.
• Syntax: pbinom(q, size, prob, lower.tail = TRUE)
o q: number of successes
o size: number of trials
o prob: probability of success
o lower.tail: TRUE for P(X ≤ q), FALSE for P(X > q)
e) Interpreting p-Value:
• The probability of observing a test statistic as extreme or more extreme than
the one calculated, assuming the null hypothesis is true.
• Common threshold: 0.05
• If p-value ≤ 0.05, reject the null hypothesis (suggesting significant evidence
against it).
• If p-value > 0.05, fail to reject the null hypothesis (not enough evidence to
disprove it).
f) Getting List of Installed Packages:
Code snippet
installed.packages()
g) Transpose of a Matrix:
Code snippet
t(matrix_name)
h) Purpose of Regression Analysis in R:
• Modeling the relationship between a dependent variable (outcome) and one
or more independent variables (predictors).
• Used for:
o Predicting future values of the dependent variable.
o Understanding how independent variables influence the dependent
variable.
o Assessing the strength of relationships between variables.

Q2) Answer the following questions (Any Two)


a) Explain T-test of hypothesis testing in R. Write syntax and explain in
detail.
Here's an explanation of T-tests for hypothesis testing in R, including syntax and
details:
T-Test:
• A statistical test used to determine whether the means of two groups
significantly differ.
• Assumes the data is normally distributed and the variances of the two
groups are equal (or approximately equal).
Types of T-Tests in R:
1. One-Sample T-Test:
o Compares the mean of a sample to a known population mean.
o Syntax: t.test(x, mu = hypothesized_mean, alternative =
c("two.sided", "less", "greater"))
2. Two-Sample T-Test:
o Compares the means of two independent samples.
o Syntax: t.test(x, y, paired = FALSE, var.equal = TRUE)
 paired = TRUE for paired samples (e.g., pre-test and post-test
scores).
 var.equal = FALSE for unequal variances (Welch's t-test).
Interpreting Results:
• P-value: The probability of observing a difference as extreme or more
extreme, assuming the null hypothesis is true (no difference).
o Common threshold: p-value ≤ 0.05 suggests rejecting the null
hypothesis (significant difference).
• Confidence interval: A range of values that likely contains the true mean
difference.
Example:
Code snippet
# One-sample t-test:
test_scores <- c(85, 92, 78, 89, 95)
t.test(test_scores, mu = 80)

# Two-sample t-test (independent samples):


group1 <- c(10, 12, 14, 16)
group2 <- c(15, 18, 20, 22)
t.test(group1, group2)
Key Considerations:
• Check assumptions of normality and equal variances using visualizations
and tests (e.g., Q-Q plots, Shapiro-Wilk test, F-test).
• Use non-parametric alternatives (e.g., Wilcoxon rank-sum test) if
assumptions are not met.
• Consider sample size and effect size (practical significance) alongside
statistical significance.

b) Define probability. Explain any two functions of probability distribution.


Probability Defined:
Probability measures the likelihood or chance of an event occurring. It is a
numerical value between 0 and 1, where:
• 0: The event is impossible.
• 1: The event is certain.
• Values between 0 and 1: Represent the degree of uncertainty, with higher
values indicating greater likelihood.

Probability plays a crucial role in various fields, including statistics, mathematics,


decision-making, and everyday life. It helps us quantify uncertainty, predict
outcomes, and make informed choices under uncertain conditions.

Functions of Probability Distribution:

Probability distributions provide a mathematical framework for describing the


probabilities of different possible outcomes of a random event. Here are two
important functions associated with them:

1. Cumulative Distribution Function (CDF):


• The CDF of a probability distribution (F(x)) gives the probability that a
random variable will be less than or equal to a specific value (x).
• It is a non-decreasing function, always starting at 0 and approaching 1 as x
approaches infinity.
• Useful for calculating the probability of an event falling within a certain
range.
2. Probability Density Function (PDF):
• The PDF of a probability distribution (f(x)) describes the probability density
of a random variable taking on a specific value (x).
• It represents the "instantaneous rate" of change of the CDF at x.
• Useful for visualizing the "spread" of the distribution and understanding the
relative likelihood of different outcomes.

By understanding these functions, we can gain valuable insights into the behavior
of random phenomena and make informed decisions based on predicted
probabilities.
c) What is linear regression? What do you mean by dependent and
independent variables? What is difference between linear & multiple
regression?
Linear Regression Explained:
Linear regression is a statistical technique used to model the relationship between
a dependent variable (the outcome you want to predict) and one or more
independent variables (the factors you believe influence the outcome). This
relationship is modeled by a straight line, hence the "linear" part of the name.
Key Components:
• Dependent Variable: The variable you're trying to predict or explain. It's
often denoted as "y".
• Independent Variables: The variables you believe influence the dependent
variable. They're often denoted as "x1", "x2", etc.
• Linear Model: The equation representing the relationship between the
dependent and independent variables. It typically takes the form y = β0 +
β1x1 + β2x2 + ε, where:
o β0 is the intercept (the y-value when all independent variables are
zero).
o β1, β2, etc. are the regression coefficients, indicating the contribution
of each independent variable to the dependent variable.
o ε is the error term, accounting for unexplained variance.
Types of Linear Regression:
• Simple Linear Regression: Only one independent variable is used to predict
the dependent variable.
• Multiple Linear Regression: Two or more independent variables are used to
predict the dependent variable.
Difference between Linear and Multiple Regression:
• Complexity: Simple regression is less complex and easier to interpret, while
multiple regression allows for more nuanced analysis by considering the
influence of multiple factors.
• Interpretation: Simple regression coefficients directly quantify the effect of
the single independent variable on the dependent variable. In multiple
regression, interpretations require considering the interplay and potential
interactions between multiple variables.
• Application: Simple regression is suitable for basic analysis with one
dominant factor, while multiple regression is beneficial for exploring
complex relationships involving multiple contributing factors.
Overall, linear regression is a powerful statistical tool for understanding and
predicting linear relationships between variables. Choosing between simple and
multiple regression depends on the specific context and research question.

Q3) Answer the following question (Any one).


[5946]-317 1
a) Examine ANOVA in R? State the assumptions and explain one way
ANOVA in detail. Also state benefits of ANOVA.
Examining ANOVA in R:
Analysis of Variance (ANOVA) is a statistical technique used to compare the
means of three or more groups. It analyzes whether the differences between group
means are statistically significant or simply due to chance.
Assumptions of ANOVA:
1. Normality: Each group's data should be normally distributed.
2. Homoscedasticity: Variances of all groups should be equal.
3. Independence: Observations within each group should be independent of
each other.
4. Equal interval data: The data should be measured on an interval or ratio
scale.
One-Way ANOVA in R:
Here's how to perform a one-way ANOVA in R:
1. Load Data:
Code snippet
data <- read.csv("your_data.csv")
group <- data$group_variable # Specify group variable
outcome <- data$outcome_variable # Specify outcome variable
2. Perform ANOVA:
Code snippet
anova_model <- aov(outcome ~ group, data = data)
summary(anova_model)
3. Interpret Results:
• The summary will show the F-statistic and p-value.
• If the p-value is less than your significance level (e.g., 0.05), you reject the
null hypothesis that the group means are equal.
• You can further investigate pairwise comparisons between groups using
post-hoc tests like Tukey's HSD.
Benefits of ANOVA:
• Compare multiple groups: Allows for simultaneous comparison of several
group means, reducing the need for multiple t-tests.
• Control for other variables: Can include additional factors in the model to
control for their influence on the outcome variable.
• Robustness: Relatively robust to departures from normality compared to
other methods.
• Interpretability: Provides easily interpretable estimates of group means and
effect sizes.
Remember to check the assumptions before using ANOVA and consider
alternative methods if they are not met.

b) What do you mean by dimension reduction? Explain linear discrimination


analysis (LDA) with sytax. Also explain application of LDA in marketing
domain.
Here's an explanation of dimension reduction, LDA, and its application in
marketing:
Dimension Reduction:
• It's a technique that involves reducing the number of variables (dimensions)
in a dataset while retaining most of the important information.
• It's beneficial for:
o Handling high-dimensional data that can be computationally
expensive or difficult to visualize.
o Mitigating the curse of dimensionality, where performance of some
algorithms degrades with too many dimensions.
o Identifying the most important features for prediction or analysis.
Linear Discriminant Analysis (LDA):
• A supervised dimension reduction technique for classification.

• Aims to find a linear combination of features that best separates two or more
classes in the data.
• Projects data onto a lower-dimensional space where classes are maximally
separated, aiding classification.
R Syntax for LDA:
Code snippet
library(MASS)

lda_model <- lda(class ~ ., data = your_data) # class is the categorical variable

# Predict classes for new data:


predictions <- predict(lda_model, new_data)
predictions$class
Application of LDA in Marketing:
• Customer Segmentation: Identifying distinct customer groups based on
demographics, purchase behavior, preferences, etc.
• Market Targeting: Determining which segments to focus marketing efforts
on for maximum impact.
• Campaign Response Prediction: Predicting which customers are most likely
to respond to specific marketing campaigns.
• Customer Churn Prediction: Identifying customers at risk of leaving and
taking proactive measures to retain them.
• Product Recommendation: Recommending products or services that align
with customer preferences and interests.
Key Considerations:
• LDA assumes normally distributed data within each class.

• It's sensitive to outliers, so consider pre-processing steps.


• It works best when classes are well-separated in the original space.
• For non-linear relationships, consider non-linear LDA or other techniques
like kernel LDA.

Q4) Answer the following question (Any One)

a) Describe descriptive analytics in R. Explain any three functions of


descriptive analytics in R.
Descriptive analytics in R involves summarizing and describing the key
characteristics of a dataset. It helps us understand the basic features, patterns, and
distributions of the data.
Here are three commonly used functions for descriptive analytics in R:
1. summary():
o Provides a concise overview of a dataset or variable.
o Includes measures of central tendency (mean, median), spread (range,
quartiles, standard deviation), and missing values.
o Example: summary(data) or summary(data$variable)
2. str():
o Displays the internal structure of a dataset or object.
o Reveals data types, variable names, and dimensions.
o Example: str(data)
3. table():
o Produces frequency tables for categorical variables.
o Shows the count of occurrences for each unique value.
o Example: table(data$category)
Additional functions for descriptive analytics in R:
• mean(): Calculates the arithmetic mean (average).
• median(): Calculates the median (middle value).
• sd(): Calculates the standard deviation (measure of spread).
• var(): Calculates the variance.
• range(): Returns the minimum and maximum values.
• quantile(): Computes specified quantiles (e.g., quartiles, percentiles).
• hist(): Creates histograms to visualize distributions.
• boxplot(): Creates box plots to visualize distributions and outliers.
Benefits of Descriptive Analytics:
• Data Understanding: Gain insights into the nature of your data.
• Pattern Identification: Discover trends, patterns, and relationships.
• Outlier Detection: Identify unusual or extreme values that might warrant
further investigation.
• Data Cleaning: Detect errors or inconsistencies that need correction.
• Visualization Guidance: Inform the choice of appropriate visualizations.
Descriptive analytics is often the first step in data analysis, laying a foundation for
further exploration and modeling.

b) What is logistics regression in R? Assume suitable data and explain how


do you interprete regression coefficients in R?
Here's an explanation of logistic regression in R and how to interpret its
coefficients:
Logistic Regression:
• A statistical method used to model the probability of a binary outcome (e.g.,
yes/no, success/failure) based on a set of independent variables.
• Employs a logistic function (S-shaped curve) to transform predicted values
into probabilities between 0 and 1.
• Doesn't assume a linear relationship between the independent variables and
the outcome.
Interpreting Regression Coefficients in R:
1. Fit the Model:
Code snippet
model <- glm(y ~ x1 + x2 + ..., data = your_data, family = "binomial")
summary(model)
2. Examine Coefficients:
• The summary table displays coefficients for each independent variable.
• Positive coefficients indicate a positive association with the probability of
the outcome (increase in x increases probability).
• Negative coefficients indicate a negative association (increase in x decreases
probability).
3. Interpret Odds Ratios:
• Exponentiate coefficients to obtain odds ratios: exp(coefficient).
• Odds ratio represents the change in odds of the outcome occurring per unit
increase in the independent variable, holding other variables constant.
• Example: An odds ratio of 1.5 for x1 means a unit increase in x1 is
associated with 1.5 times higher odds of the outcome.
4. Significance:
• P-values assess the statistical significance of each coefficient.
• Small p-values (e.g., less than 0.05) suggest a significant relationship
between the variable and the outcome.
Example:
Code snippet
model <- glm(purchased ~ age + income, data = customer_data, family =
"binomial")
summary(model)
• Positive coefficient for age might suggest older customers are more likely to
purchase.
• Negative coefficient for income might suggest lower-income customers are
more likely to purchase.
Key Considerations:
• Check for multicollinearity (high correlation between independent
variables).
• Assess model fit using measures like AIC and residual diagnostics.
• Consider interactions between variables if relevant.
Logistic regression is a powerful tool for predicting binary outcomes and
understanding the factors that influence them.

Q5) Answer the following questions (Any One)

a) Revise the concept of Time series analysis. Explain how time series
analysis is used for business forecasting?
Time Series Analysis: Unveiling Patterns for Better Business Predictions
Revised Concept:
Time series analysis is a statistical, analytical, and modeling approach used to
understand and forecast how a variable changes over time. It delves into historical
data points, revealing underlying patterns and trends, allowing us to predict future
values with increased accuracy. It's like peering into a crystal ball made of data,
not guaranteeing the future, but providing the clearest glimpse possible.
Business Forecasting with Time Series Analysis:
Companies leverage time series analysis to anticipate trends in various aspects,
enhancing decision-making and gaining a competitive edge. Here's how it plays
out in different scenarios:
1. Sales Forecasting:
• Predicting future sales volume allows for efficient inventory management,
resource allocation, and targeted marketing campaigns.
• Time series analysis identifies seasonal patterns, promotional effects, and
economic influences on sales, leading to more accurate forecasts.
2. Demand Forecasting:
• Understanding future demand for products or services helps optimize
production planning, logistics, and personnel needs.
• This analysis accounts for external factors like weather patterns, competitor
strategies, and market fluctuations, ensuring resources are available when
needed.
3. Customer Churn Prediction:
• Identifying customers at risk of leaving helps implement retention strategies
before they say goodbye.
• Analyzing past customer behavior, engagement, and service interactions
reveals patterns that predict potential churn, allowing targeted interventions.
4. Financial Market Predictions:
• Time series analysis can be applied to historical stock prices, economic
indicators, and news sentiment to forecast future market trends.
• While not crystal balls, these models provide valuable insights for
investment decisions and portfolio management.
Benefits of Time Series Analysis for Business:
• Improved decision-making: Data-driven forecasts lead to more informed
strategic and operational decisions.
• Reduced uncertainty: Understanding future trends mitigates risks and allows
for proactive planning.
• Enhanced resource allocation: Efficient optimization of inventory,
personnel, and marketing budgets.
• Competitive advantage: Early identification of opportunities and threats
keeps businesses ahead of the curve.
In conclusion, time series analysis is a powerful tool for businesses to unravel the
mysteries of time and use its insights to confidently navigate the future. By
harnessing the power of historical data, companies can make informed decisions,
optimize resources, and gain a competitive edge in today's dynamic market.

b) Write short Notes (Any one)

i) F Test in R

ii) Bayes Theorem


iii) Correlation analysis

Short Notes: All in One!


i) F Test in R
Purpose: Compare the variances of two populations.
How to perform:
• var.test() function: Takes two vectors or formulas and returns statistics.
• anova() function: Used with linear models to compare nested models and
variance changes.
Interpretation:
• High F-statistic and low p-value: Variances likely different.
• Small F-statistic and high p-value: Variances likely equal.
Assumptions: Normality of data.
ii) Bayes Theorem
Formula: 𝑃𝑃(𝐻𝐻|𝐷𝐷) = 𝑃𝑃(𝐷𝐷|𝐻𝐻)𝑃𝑃(𝐻𝐻) / ∑𝑃𝑃(𝐷𝐷|𝐻𝐻_i)𝑃𝑃(𝐻𝐻_i)
• Updates belief in a hypothesis (H) based on new evidence (D).
• P(H|D): Posterior probability of H given D.
• P(D|H): Likelihood of D happening given H is true.
• P(H): Prior probability of H.
Application: Updating predictions, diagnosing diseases, analyzing evidence.
Limitations: Relies on accurate priors and likelihoods.
iii) Correlation analysis
Purpose: Assess the strength and direction of linear relationships between two
variables.
Types:
• Pearson correlation (r): Measures linear association, ranges from -1 to 1.
• Spearman rank correlation: Non-parametric, focuses on monotonic
relationships.
• Kendall tau correlation: Non-parametric, measures concordance between
ranked data.
Interpretation:
• Positive correlation: Variables move in the same direction.
• Negative correlation: Variables move in opposite directions.
• 0 correlation: No linear relationship.
Limitations: Assumes linearity, may not capture complex relationships.

You might also like