SSMDA
SSMDA
Median
The median of a variable is determined by identifying the middle value within a dataset
when the data are arranged in ascending order. It effectively divides the data into two
equal halves, with 50% of the data points falling below the median and the remaining
50% above it.
Range
The range of avariable is determined by subtracting the smallest value from the largest
value within a quantitative dataset, making it the most basic measure that relies solely on
these two extreme values.
Variance
Variance involves the computation of the squared differences between each value and the
arithmetic mean. This approach accommodates both positive and negative deviations.
The sample variance (s) serves as an unbiased estimator of the population variance (o).
with (n-1) degrees of freedom.
Box Plot
Abox graph is achart that is used to display information in the form of distribution by
drawing boxplots for each of them. This distribution of data is based on five sets
(minimum, first quartile, median, third quartile, and maximum).
Boxplots in RProgramming Language
Boxplots arecreated in Rbyusing the boxplot() function.
Syntax: boxplot(x, data, notch, varwidth, names, main)
Parameters:
X: Thisparameter sets as a vector or a formula.
data: This parameter sets the data frame.
notch: This parameter is the label for horizontal axis. width of the
Varwidth: This parameter is alogical value. Set as true to draw
box proportionate to the sample size.
main: This parameter is the title of the chart.
will be showed under each
names: This parameter are the aroup labels that
boxplot.
Scatter Plot
on the
Ascatter plot is a set of dotted points representing individual data pieces plotted
horizontal and vertical axis. In a graph in which the values of two variables are
correlation
along the X-axis and Y-axis, the pattern of the resulting points reveals a
between them.
R- Scatter plots
We can create a scatter plot in RProgramming Language using the plot) function.
Syntax: plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Parameters:
X: This parameter sets the horizontalcoordinates.
y: This parameter sets the verticalcoordinates.
xlab: This parameter is the label for horizontal axis.
ylab: This parameter is the label for verticalaxis.
main: This parameter main is the title of the chart.
xlim: This parameter is used for plotting values of x.
ylim: This parameter is used for plotting values ofy.
axes: This parameter indicates whether both axes should be drawn on the
plot.
Histogram
Ahistogram contains arectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical
intervals. Agraphical representation that manages a group of data points into different
specified ranges. It has a special feature that shows no gaps between the bars and is
similar to a vertical bar graph.
R- Histograms
We can create histograms in R Programming Language using the hist(0)
function.
Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)
Parameters:
v: This parameter contains numerical values used in histogram.
main: This parameter main is the title of the chart.
col: This parameter isused toset color of the bars.
xlab: This parameter is the label for horizontal axis.
border: This parameter is used to set border color of each bar.
xlim: This parameter is used for plotting values of x-axis.
ylim: This parameter is used for plotting values ofy-axis.
breaks: This parameter is used as width of each bar.
Code
head(data)
summary(data)
column_data =data$mpg
# Basic statistics
# Print statistics
cat("Mean:",mean_value, "\n")
cat("Median:", median_value, "\n")
cat("Variance:", variance, "\n")
cat("Standard Deviation:", std_dev, "\n")
hist(column_data,
breaks = 10,
col = "lightblue",
main = "Histogram",
xlab = column_name)
# Boxplot
boxplot(column_data,
main = "Boxplot",
col ="orange",
horizontal =TRUE)
ylab ="cyl",
col = "blue",
pch = 19)
print(quantiles)
Output
3 Histogram
10 15 20 30
mpg
Boxplot
T T
5 20 25 20
Scatterplot
y 6
10 15 20 25 30
mpg
Viva- Voce
Q1. What isa histogram, and how is it different froma bar chart?
Ahistogram is agraphical representation of the
distribution of a continuous variable. It
groups the data into bins (intervals) and shows the frequency of data
points in each bin.
Abar chart,on the other hand, represents
categorical data and displays frequencies or
values for distinct categories.
Key Difference: Histograms use bins for continuous data, while
bar charts use distinct
categories with gaps between bars.
Q2. What can you infer from the pattern of points in a
scatter plot?
Positive Correlation Points slope upward, indicating that as one variable
increases, the other also increases.
Software Used: R
Theory:
Classical Probability
Classical probability, often referred to as "a priori" probability, is a branch of
theory that deals with situations where all possible probability
outcomes are equally likely. It
provides a foundational understanding of how probability works and forms the
for more advanced probability concepts. basis
Mathematical Foundations
Sample Space: The sample space represents the set of all
outcomes in a given experiment. It serves as the foundation for possible
calculating
probabilities. For instance, when rolling a fair six-sided die, the sample
is {1, 2, 3, 4, 5, 6}. space
Events: An event is a subset of the sample space,
representing a specific
outcome or set of outcomes. Events can range from simple, such as rolling an
even number,to complex, like drawing a red card from a deck.
Probability Distribution: A probability distribution assigns probabilities
to each event in the sample space. For classical
probability, all outcomes are
equally likely, so each event has the same probability.
Calculating Classical Probability
Classical probability is based on the principle of equally likely outcomes. Consider an
experiment with a finite sample space S, consisting of n equally likely outcomes. Let A
be an event of interest within S.
The classical probability of event A, denoted as P(A), is calculated as:
Number of favourable outcomes for event A
P(A) =
Total number of equally likely outcomes in S
Where:
P(A) is the probability of event A.
" n(A) is the number of favourable outcomes for event A.
" n(S) is the total number of equally likely outcomes in the sample space S.
This
formula allows us to calculate the probability of an event by counting the
favourable outcomes annd dividing by the total number of equally likely outcomes.
In R, you can use this formula to calculate classical probabilities for various events,
bing it afundamental concept in probability theory for data analysis and statistics.
Properties of Classical Probability
Complementary Probability - The probability of an event not occurring is
known as the complementary probability. It can be calculated as : 1- P(E)
Mutually Exclusive Events - Events are mutually exclusive if they cannot
occur simultaneously. For example, rolling a die and getting both a 2 and a 4
in a single roll is impossible.
Independent Events - Events are considered independent if the outcome
of one event does not affect the outcome of another. For instance, tossing a
coin does not influence the roll of a die.
Advantages:
" Simpleness: Classical probability offers an easy-to-understand framework
for modelling and analysing random events, making it approachable for
novices and the basis for more complex probability ideas.
Theoretical Foundation: It provides the foundation for more intricate
probability theories, allowing for a thorough comprehension of probability
concepts.
Classical probability is unbiased and simple to use in circumstances with
well-defined sample spaces because it makes the assumption that each result
is equally likely.
Limitations:
Application: When dealing with continuous or complicated data or when
events arenot allequally likely, classical probability may not correctly reflect
real-world scenarios.
Limited Complexity: It may not be able to handle complex probabilistic
issues, necessitating the use of more sophisticated models like Bayesian
probability for in-depth investigations.
Discreteness: Due to the inherent discreteness of classical probability,
continuous probability distributions may not match it in some real-world
situations.
Real-world Applications
Weather Forecasting: Classical probability is used in weather forecasting
to estimate the likelihood of various weather conditions based on historical
data.
Quality Control: In manufacturing, classical probability is applied to assess
the probability of defects in a production process, aiding in quality control
Code 1:
six-sided die
# Rolling afair
die <- 1:6
probabilities <- rep(1/6, 6) # Each face has equal probability
4
# Probability of rolling a
prob_4 <- probabilities [die == 4]
print(paste("Probability of rolling a 4:", prob_4))
# Random numbers
random_values <- runif( 10, min = 0, max = 1)
#Plotting PDF and CDF
plot(x, pdf, type = "I", col = "blue", main ="Uniform Distribution", ylab = "Density")
lines(x, cdf, col ="red")
legend("bottomright",legend =c("PDF", "CDF"), col =c("blue", "red"), Ity =1)
# Normal distribution with mean=0, sd=1
x<- seq(-4,4, by = 0.01)
# PDF
pdf <-dnorm(x, mean = 0, sd = 1)
# CDF
cdf <- pnorm(x, mean = 0, sd = 1)
# Random numbers
random_values <-rnorm(1000, mean = 0, sd= 1)
# Plotting PDF and CDF
plot(x, pdf, type = "",col = "blue", main = "Normal Distribution",ylab = "Density")
lines(x, cdf, col ="red")
legend("bottomright", legend =c("PDF", "CDF"), col =c("blue", "red"), Ity =1)
Values
# Histogram of Random
hist(random_values, probability = TRUE, col =
Random Values")
"lightblue", main = "Histogram of
lines(density(random _values), col = "red")
Probability Distribution
Bmakes it easy to draw probability distributions and demonstrate
statistical concepts.
Someof the more common probability distributions available in R are given below.
F f Uniform unif
Logistic logis
Output:
[1] "Simulated rolls:"
>print(rolls)
[1]31 5 2256433
Uniform Distribution
.4
1.2
Density
1.0
0.8
0.6
PDF
CDF
Density
0.2
0.1
0.0
PDF
CDF
-2 2
0.3
Density
0.2
-3 -2 -1 2 3
random_ values
Viva-Voce:
Q1 What is classicalprobability?
Classical probability is a branch of probability theory that deals with
lkely outcomes. It forms the basis of events having
probability theory and is widely
statistics and data science. used in
Software Used: R
Theory:
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.One of these variables is called predictor variable whose value is
gathered through experiments. The other variable is called response variable
value is derived from the predictor variable. whose
In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear
represents a straight line when plotted as a graph. A non-linear relationshiprelationship
where the
exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is -
y= ax + b
Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
Create a relationship model using the Im() functions in R.
Find the coefficients from the model created and create the mathematical equation
using these
prediction.
Get a summary of the relationship model to know the average error in
Also called residuals.
Topredict the weight of new persons, use the predict() function
inR.
Im()Function
Thie function creates the relationship model between the predictor and the response
variable.
Syntax
predict(object, newdata)
Following is the description of the parameters used -
object is the formula which is already created using the
newdata is the vector containing the new value for Im() function.
predictor variable.
Code:
#Input Data
#Below is the sample data representing the
observations -
#Values of
height
151, 174, 138, 186, 128, 136, 179, 163, 152,
131
# Values of
weight.
63,81, 56,91, 47, 57, 76, 72, 62, 48
X<-c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y<- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
Coefficients
#Create Relationship Model &get the
# Apply the Im()
function.
relation <-Im(y~x)
print(relation)
print(summary(relation)
#predict
dev.off0)
Output:
Call:
Im(formula =y~ x)
Coefficients:
(Intercept)
38.4551 0.6746
>print(summary(relation))
Call:
Im(formula =y ~ x)
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>\t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
X
0.67461 0.05191 12.997 1.16e-06 ***
> #predict
76.22869
1.
50 60 70 80
Weight un Ka
Viva-Voce:
Q.1. What are the assumptionsof a linear regression model?
when two or more independent variables are highly correlated, it becomes ditticult to
Isolate the effect of each variable on the dependent variable. The regression model may
indcate that both variables are significant predictors of the dependent variable, but it
can be difiult to determine which variable is actually responsible for the observed
cftect
.3. What are the common techniques used to improve the accuracy of a linear
regression model?
Featureselection: selecting the most relevant features for the model to improve
its predictive power.
Feature scaling: scaling the features to a similar range to prevent bias towards
certain features.
Regularization: adding a penalty term to the model to prevent overfitting and
improve generalization.
Cross-validation: dividing the data into multiple partitions and using a different
partition for validation in each iteration to avoid overfitting.
Ensemble methods: combining multiple models to improve the overall accuracy
and reduce variance.
Heteroscedasticity is a statistical term that refers to the unequal variance of the error
terms (or residuals) in a regression model. In a regression model, the
residuals
represent the difference between the observed values and the predicted values of the
dependent variable. When heteroscedasticity occurs:
The variance of the error terms is not constant across the range of the
variables.
independent
Error terms tend to be larger for some values of the independent variables than
for others.
This can result in biased and inconsistent estimates of the regression coefficients
and standard errors, which can affect the accuracy of the statistical inferences and
predictions made from the model.