0% found this document useful (0 votes)
2 views

r lang-Unit-04

The document covers statistical modeling and testing, emphasizing their importance in data analysis across various fields. It explains key concepts such as hypothesis testing, sampling distributions, and different types of statistical tests, including t-tests and ANOVA. Additionally, it discusses the relationship between statistical modeling and machine learning, highlighting their complementary roles in data interpretation and prediction.

Uploaded by

km587522
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

r lang-Unit-04

The document covers statistical modeling and testing, emphasizing their importance in data analysis across various fields. It explains key concepts such as hypothesis testing, sampling distributions, and different types of statistical tests, including t-tests and ANOVA. Additionally, it discusses the relationship between statistical modeling and machine learning, highlighting their complementary roles in data interpretation and prediction.

Uploaded by

km587522
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Nrupathunga University

Department of Computer Science


V Sem BCA (NEP)
Statistical Computing and R Programming Language
Unit -04
Statistical testing and modelling, sampling distributions, hypothesis testing, components of a
hypothesis test, testing means, testing proportions, testing categorical variables, errors and
power, Analysis of variance.
1.Statistical modelling

Uses mathematics and statistics as a way to make assumptions and reach conclusions
from data. The use of statistical models is ubiquitous in scientific fields, including engineering,
business, and life sciences. It’s a valuable tool for drawing inferences and making quantitative
predictions about data.

This article defines statistical modeling and shows where and why it’s used. It then examines
the finer points of the most common forms of modeling, with actual published examples so
you can see regression in action.

2.Definition of Statistical Modeling (2 marks)


In simple terms, statistical modeling is a way to learn and reach meaningful
conclusions from data. A statistical model is defined by a mathematical equation, but defining
its very meaning is a good place to start:

• Statistics: the science of displaying, collecting, and analysing data


• Model: a mathematical representation of a phenomenon
The first step in any statistical analysis is to gather relevant data about the population. The
population is a set of similar items or events that you want to study and acquire information
about.

An example: U.S. voters


For instance, your population could be all U.S. voters. The population’s quantities you’re
interested in are called population parameters. In this case, it could be the approval
percentage for a presidential candidate. It would be impractical (and essentially impossible)
to collect this data on all U.S. voters.
You’ll typically run into the difficulty of obtaining one parameter for an entire population,
because It’s often impossible.

With statistical inference, you can estimate the population parameter by measuring it from
part of the population. Choosing that part to collect information on is known as statistical
sampling. Election polls do precisely that to predict winners (or losers) from a sample of U.S.
voters.

3.Why use statistical modelling?


The ability to predict and extract information are the main goals of data analysis.

The “cultures” of modeling


The late Leo Breiman, a noted statistician, stated there are two ways to approach these goals:
1. Data modeling culture
2. Algorithmic modeling culture
This notion of two cultures translates into a common conflict. How do you decide the best
approach to analyze a given dataset?

Statistical modeling is often referred to as data modeling. Many machine learning


models fall into the category of algorithmic modeling. These two share similar mathematical
underpinnings but differ in their purposes.
Machine learning models are used for large datasets, model automation, and are very
good at identifying hidden patterns in data. They’re an appropriate and necessary tool in
several data science applications, given their strong predictive power. However, the
predictability offered by machine learning doesn’t exclude the need for statistical modeling.

Statistical models are better at explaining the relationship between variables. They
seek some understanding of the structure underlying the observed data.

Statistics extracts population inferences from a sample, while machine learning


identifies generalizable patterns. More importantly, the approaches complement each other.
If possible, aim for accurate predictions as well as good interpretations.
4.Types of statistical models (2 marks)

Maybe the “simplest” statistical model is the arithmetic mean, or average, of a population
parameter. With this measure, you’re attempting to guess what the expected value is if you
take a random sample from the population.

Regression analysis is an important set of statistical models. It allows you to estimate a


variable using one or more independent variables. Those independent variables are also
known as explanatory variables. A regression model is specified by selecting a functional form
for the statistical model. Following are some of the most common regression models.
1. Linear Regression
2. Mulitple Regression
3. Logistic Regression
4. Ridge Regression

An example: COVID-19 mortality rate (not important)

the factors associated with the COVID-19 mortality rate (the dependent variable) in
169 countries were identified using linear regression.
The researchers performed a simple linear regression analysis to test the correlation
between COVID-19 mortality and a test number (the independent variable). They used
multiple regression analysis for predicting mortality rates considering other explanatory
factors (e.g., case numbers, critical cases, hospital bed numbers).

The results suggested that an increase in testing was effective at attenuating mortality when
other means of control were insufficient. Higher mortality was found to be associated with:

• lower test number


• lower government effectiveness
• elderly population
• fewer hospital beds
• better transport infrastructure
5.Statistical modeling can help with: (2 marks)

• Analyzing data
• Making predictions

• Understanding relationships between variables

• Decision-making in various fields, such as finance and healthcare.

6.Define Statistical testing


is a procedure used to determine the likelihood of observing certain patterns,
relationships, or differences in a dataset. It helps researchers draw conclusions about the
population based on sample data.

Statistical testing can also refer to a testing method in software engineering that aims
to identify unreliable software package products.

Some statistical tests include:

• Hypothesis testing: A statistical method that determines if there is enough evidence


in a sample data to draw conclusions about a population

• t-test: Compares the means of two samples

• ANOVA tests: Compares the means of more than two groups


Some basic statistical tests include:

• t-Test

• chi-square Test

• Kolmogorov-Smirnov Test (more commonly called the K-S Test)

Statistical testing in R involves


using statistical methods to analyze data and make inferences about
population parameters based on sample data. R is a programming language and
environment for statistical computing and graphics, and it provides a wide range of
functions and packages for conducting various statistical tests. Here are some
common statistical tests in R:
• t-test:
Used to compare the means of two groups to determine if they are significantly
different.
Example:
t.test(x, y)
• ANOVA (Analysis of Variance):
Used to compare means across multiple groups.
Example:
anova(lm(response_variable ~ factor_variable, data = your_data))
• Chi-square test:
Used for categorical data analysis to test the association between two categorical
variables.
Example:
chisq.test(table(variable1, variable2))
• Correlation test:
Used to measure the strength and direction of a linear relationship between two
continuous variables.
Example:
cor.test(x, y)
• Regression analysis:
What is Regression Analysis (2 marks)
Used to model the relationship between a dependent variable and one or more
independent variables.
Example:
lm(y ~ x, data = your_data)

6.Sample and Sampling Distribution:


What is a Sample? (2 marks)
A sample is a smaller set of data that a researcher chooses or selects from a
larger population using a pre-defined selection bias method. These elements are
known as sample points, sampling units, or observations.
What is Sampling Distribution? (2 marks)
A sampling distribution is a probability distribution of a statistic that is obtained
through repeated sampling of a specific population. It describes a range of possible
outcomes for a statistic, such as the mean or mode of some variable, of a population.
How Sampling Distribution is done for iris dataset? ( 5 marks)
Reference Youtube Link : https://youtu.be/Xfdg0xqFjts?si=jAoLfbU3M8RotyZr
(watch this video which I have explained in class and learn the code)

Demonstrate Sampling and Sampling Distribution using the Iris Data set

str(iris)

iris_df<-data.frame(iris)

View(iris_df)

iter<-100

n<-5

means<-rep(NA,iter)

for(i in 1:iter)

mean_of_each_sample<-sample(iris$Petal.Length,n)

means[i]<-mean(mean_of_each_sample)

hist(means)

7.Hypothesis Testing
What is Hypothesis Testing? (2 marks)
Is a form of statistical inference that uses data from a sample to draw conclusions about a
population parameters or a population parameter or a population probability distribution.
What are Inferential Statistics? (2 marks) Sampling
Distribution
Sample 1 Sample 2
Sample

Conclusion
Sample 3 Population Sample 6

Sample 4
Sample n

All Samples such as Sample1,2,3,4,….n are Sampling Distribution ,So if we many


such samples from a population data and draw a conlusion for any given samples
could be called as inferential Statistics

Sample Data Population

Data

Conclusion
The The Conclusion that is drawn from the population data or the hypothesis
made is tested with various types of Hypothesis testing
8.Components of Hypothesis testing
List the Components of Hypothesis Testing. (2 marks)
The components of Hypothesis testing are
1. Null Hypothesis
2. Alternate Hypothesis testing
3. Confidence Interval (CI)
4. Significant Value
5. Decision boundary
Define Null Hypothesis. (2 marks)
Null hypothesis as a general statement or a default position that says there is no
relationship between two measured phenomena or there is no association among groups.
A Null Hypothesis is denoted by the symbol H0 in statistics. It is usually pronounced as
“h-nought” or “H-null”. The Subscript in H is the digit 0.
Define Alternate Hypothesis. (2 marks)

The alternative hypothesis is a statement used in statistical inference experiments. It


is contradictory to the null hypothesis and denoted by Ha or H1. it is simply an alternative to
the null.

Explain the Null and Alternate Hypothesis with an Example (8 marks) (Very Important)

Consider an Example of Tossing a coin

When we toss a coin there may be chances of getting the following


1. 50% may be head and 50% may be tail
2. 60% may be head and 40% may be tail
3. 70% may be head and 30% may be tail

So the following Graph represents the above chances which is distributed normally

Figure 1 : Resenting the Normal Distribution Curve for Hypothesis testing

So if we get 50% head and 50% tail then we get a straight line exactly at the centre

If we get 60% head and 40% tail then we get a straight line between 40 and 60

If we get 70% head and 30% tail then we get a straight line between 30 and 70

So now we can explain the components of the Hypothesis for this problem
Figure 2: Resenting the Normal Distribution Curve for Hypothesis testing and its
Components

1. Null Hypothesis : H0 : Coin is Fair

Means if we toss the coin the best output that we need is between 30 and 70, if it
is so then we accept the Null Hypothesis which is justified from the experiment.

2. Alternate Hypothesis: H1 or Ha: Coin is not fair

This means if we toss a coin and if the chances of getting head and tail is beyond 30
and 70 then we accept the Alternate Hypothesis.

3. Confidence Interval

The Confidence interval is said to be 95 percent, means that if we repeat the coin flip
experiment 100 times, for 95 times, our probability of getting heads will fall within
that confidence interval.

Means all 95 times also if the chances of getting heads and tails lies between 30 and
70

4. Significant value

Can be defined as one minus the Confidence Interval 95%, means

Significant value= 1-0.95

=0.05
This means that 5% of the chances in tossing the coin 100 times may fall beyond 30
and 70, therefore the chances of getting the rest of the chances other than the three
chances mentioned above are 0.05% (or 5%).

5. Decision Boundary

The Confidence Interval itself is the Decision Boundary which has been decided
from tossing the coin for 100 times and it is proved that for 95 times the probability
of getting number of heads and tails is between 30 and 70,so the decision boundary
is 95%.

Conclusion

Suppose for the 5th flip of the coin, if we get 65% head and 25% tail, then accept
the Null Hypothesis as it lies in the given decision boundary as shown in the above
figure 2.

(Reference Youtube Link : https://youtu.be/pZ1d32ar_iY?si=a4dMtaGYUcOIegZJ )


please Look into the above Video to understand the above experiment if Required.

Testing Means or Types of Testing

Mention the different types of Hypothesis testing (2 marks)


1. Z Test
2. T Test
3. Annova Test
4. Chisqaure Test
What is Z-test? (2 marks)
A Z-test is a statistical test that determines if two population means are
different when the variances are known and the sample size is large. It is a type of statistical
hypothesis test where the test statistic follows a normal distribution.

What is T-Test ? (2 marks)


A t-test, also known as a Student's t-test, is a statistical test that compares the means
of two groups. It's used to determine if there's a significant difference between the means of
the two groups and how they're related.

What is annova-Test ? (2 marks)


Analysis of Variance (ANOVA) is a statistical test that determines differences between
research results from three or more unrelated samples or groups. It tests the hypothesis that
the means of two or more populations are equal

What is chi-square Test ? (2 marks)


The chi-square test is a statistical tool that determines if two categorical variables are
related or independent.

When do we use Z-test? (2 marks) (just write the below diagram for this question)

Explain z-test with an example. (8 marks) (very important)

Consider an example

The Average heights of all residents in a city is 168 cm with a population std deviation σ = 3.9.
A doctor believes the mean to be different. He measured the height of 36 individuals and
found the average to be 169.5

a. State the Null and Alternate Hypothesis

b. At a 95% CI is there is enough evidence to reject the Null Hypothesis

Given:

Standard Deviation σ = 3.9

Average of Population or Population mean μ=168 cm

Sample n=36

Average of Sample or Sample mean x̄=169.5

a. 1.State the Null Hypothesis

H0 = μ=168 cm

2.State the Alternate Hypothesis


H1 or Ha = μ ≠168 cm

b. Confidence Interval is 95% (given)

Significant value α = 1 – CI

= 1 - 0.95=0.05

Therefore with above data the Normal distribution curve is as fallows

0.025 0r 2.5 95% 0.025 or 2.5

-0.9750 μ=168 cm +0.9750

Population Mean

The above graph represents that, as we know CI is 95%,the significant value is


0.05 %,if we divide 0.05 by 2 we get 0.025 as it is a two-tailed test . So We can decide the
Decision boundary as

1 – 0.025 = 0.9750

So, the Decision boundary is +0.9750 and -0.9750 as shown in the graph.

Now Apply Z-test

Z-test Formula is as fallows

169.5−168
Z-test = = 2.31
3.9/√36

Statistical Inference:

So when 36 samples are drawn from a population of mean 169.5 and with standard
deviation 168, when this is applied to Z-test formula , we get 2.31. So when 2.31 is compared
with the above decision boundary value i.e 0.975 .

2.31 < 0.975 which lies outside the Decision boundary, So in this Case we Reject the Null
hypothesis and Accept the Alternate Hypothesis.

You might also like