0% found this document useful (0 votes)
8 views37 pages

SSMDA

The document outlines basic statistical concepts and visualizations using R, including mean, median, variance, box plots, scatter plots, and histograms. It also covers classical probability, its properties, advantages, limitations, and real-world applications, along with R code examples for implementing these concepts. Additionally, it includes a viva-voce section with questions related to probability theory and its applications.

Uploaded by

ashish.raj.mlbx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views37 pages

SSMDA

The document outlines basic statistical concepts and visualizations using R, including mean, median, variance, box plots, scatter plots, and histograms. It also covers classical probability, its properties, advantages, limitations, and real-world applications, along with R code examples for implementing these concepts. Additionally, it includes a viva-voce section with questions related to probability theory and its applications.

Uploaded by

ashish.raj.mlbx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Experiment-6

Alm: Tofind basic statistics and visualizationof a given data set in k


Software Used: RStudio
Theory:
Mean

The arithmetic mean of a variable, often referred to as the average, is calculated by


summing up allthe values and then dividing the total by the count of values.
Population Mean (u):=
Sample Mean (2): X= )* n

Median

The median of a variable is determined by identifying the middle value within a dataset
when the data are arranged in ascending order. It effectively divides the data into two
equal halves, with 50% of the data points falling below the median and the remaining
50% above it.

Range

The range of avariable is determined by subtracting the smallest value from the largest
value within a quantitative dataset, making it the most basic measure that relies solely on
these two extreme values.

Variance

Variance involves the computation of the squared differences between each value and the
arithmetic mean. This approach accommodates both positive and negative deviations.
The sample variance (s) serves as an unbiased estimator of the population variance (o).
with (n-1) degrees of freedom.

Box Plot

Abox graph is achart that is used to display information in the form of distribution by
drawing boxplots for each of them. This distribution of data is based on five sets
(minimum, first quartile, median, third quartile, and maximum).
Boxplots in RProgramming Language
Boxplots arecreated in Rbyusing the boxplot() function.
Syntax: boxplot(x, data, notch, varwidth, names, main)
Parameters:
X: Thisparameter sets as a vector or a formula.
data: This parameter sets the data frame.
notch: This parameter is the label for horizontal axis. width of the
Varwidth: This parameter is alogical value. Set as true to draw
box proportionate to the sample size.
main: This parameter is the title of the chart.
will be showed under each
names: This parameter are the aroup labels that
boxplot.

Scatter Plot

on the
Ascatter plot is a set of dotted points representing individual data pieces plotted
horizontal and vertical axis. In a graph in which the values of two variables are
correlation
along the X-axis and Y-axis, the pattern of the resulting points reveals a
between them.

R- Scatter plots

We can create a scatter plot in RProgramming Language using the plot) function.
Syntax: plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Parameters:
X: This parameter sets the horizontalcoordinates.
y: This parameter sets the verticalcoordinates.
xlab: This parameter is the label for horizontal axis.
ylab: This parameter is the label for verticalaxis.
main: This parameter main is the title of the chart.
xlim: This parameter is used for plotting values of x.
ylim: This parameter is used for plotting values ofy.
axes: This parameter indicates whether both axes should be drawn on the
plot.

Histogram
Ahistogram contains arectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical
intervals. Agraphical representation that manages a group of data points into different
specified ranges. It has a special feature that shows no gaps between the bars and is
similar to a vertical bar graph.

R- Histograms
We can create histograms in R Programming Language using the hist(0)
function.
Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)
Parameters:
v: This parameter contains numerical values used in histogram.
main: This parameter main is the title of the chart.
col: This parameter isused toset color of the bars.
xlab: This parameter is the label for horizontal axis.
border: This parameter is used to set border color of each bar.
xlim: This parameter is used for plotting values of x-axis.
ylim: This parameter is used for plotting values ofy-axis.
breaks: This parameter is used as width of each bar.

Code

# Load the dataset

data <- mtcars[, c('mpg', 'cyl')]

#Display the first few rowsof the dataset


print("First Few Rows of the Dataset:")

head(data)

# Summary of the dataset

print("Summary Statistics of the Dataset:")

summary(data)

# Structure of the dataset

print("Structure of the Dataset:")


str(data)
# Choose a numeric column

column_data =data$mpg
# Basic statistics

mean_value <- mean(column_data, na.rm =TRUE)

median_value <- median(column_data, na.rm =TRUE)

variance <- var(column_data, na.rm =TRUE)


std_dev <- sd(column_data, na.rm =TRUE)
min_value <- min(column_data, na.rm =TRUE)

max_value<-max(column_data, na.rm =TRUE)


quantiles <- quantile(column_data, na.rm = TRUE)

# Print statistics

cat("Mean:",mean_value, "\n")
cat("Median:", median_value, "\n")
cat("Variance:", variance, "\n")
cat("Standard Deviation:", std_dev, "\n")

cat("Minimum:", min_value, "\n")

cat("Maximum:", max_value, "\n")


cat("Quantiles:\n")
# Histogram

hist(column_data,
breaks = 10,

col = "lightblue",

main = "Histogram",
xlab = column_name)

# Boxplot
boxplot(column_data,
main = "Boxplot",
col ="orange",
horizontal =TRUE)

# Scatterplot (if the dataset has two numeric columns)


plot(dataSmpg, datascyl,
main ="Scatterplot",
xlab ="mpg",

ylab ="cyl",
col = "blue",

pch = 19)

print(quantiles)

Output

3 Histogram

10 15 20 30

mpg
Boxplot

T T

5 20 25 20

Scatterplot

y 6

10 15 20 25 30

mpg

Viva- Voce

Q1. What isa histogram, and how is it different froma bar chart?
Ahistogram is agraphical representation of the
distribution of a continuous variable. It
groups the data into bins (intervals) and shows the frequency of data
points in each bin.
Abar chart,on the other hand, represents
categorical data and displays frequencies or
values for distinct categories.
Key Difference: Histograms use bins for continuous data, while
bar charts use distinct
categories with gaps between bars.
Q2. What can you infer from the pattern of points in a
scatter plot?
Positive Correlation Points slope upward, indicating that as one variable
increases, the other also increases.

Negative Correlation: Points slope downward, indicating that as one variable


increases, the other decreases.

" No Correlation: Points are scattered randomly, showing no relationship.


" Clustersor Outliers: Specificgroupings or isolated points may indicate data
subgroups or anomalies.
Q3. What is a bar chart, and what type of data does it represent?
Abar chart represents categorical data, where each bar corresponds to a category, and
the bar's height represents the frequency or value for that category.
Experiment-7
Aim: To implement concepts of probability and distributions in R.

Software Used: R

Theory:

Classical Probability
Classical probability, often referred to as "a priori" probability, is a branch of
theory that deals with situations where all possible probability
outcomes are equally likely. It
provides a foundational understanding of how probability works and forms the
for more advanced probability concepts. basis
Mathematical Foundations
Sample Space: The sample space represents the set of all
outcomes in a given experiment. It serves as the foundation for possible
calculating
probabilities. For instance, when rolling a fair six-sided die, the sample
is {1, 2, 3, 4, 5, 6}. space
Events: An event is a subset of the sample space,
representing a specific
outcome or set of outcomes. Events can range from simple, such as rolling an
even number,to complex, like drawing a red card from a deck.
Probability Distribution: A probability distribution assigns probabilities
to each event in the sample space. For classical
probability, all outcomes are
equally likely, so each event has the same probability.
Calculating Classical Probability
Classical probability is based on the principle of equally likely outcomes. Consider an
experiment with a finite sample space S, consisting of n equally likely outcomes. Let A
be an event of interest within S.
The classical probability of event A, denoted as P(A), is calculated as:
Number of favourable outcomes for event A
P(A) =
Total number of equally likely outcomes in S

Mathematically, this can be expressed as:


n(A)
P(A)
n(S)

Where:
P(A) is the probability of event A.
" n(A) is the number of favourable outcomes for event A.
" n(S) is the total number of equally likely outcomes in the sample space S.
This
formula allows us to calculate the probability of an event by counting the
favourable outcomes annd dividing by the total number of equally likely outcomes.
In R, you can use this formula to calculate classical probabilities for various events,
bing it afundamental concept in probability theory for data analysis and statistics.
Properties of Classical Probability
Complementary Probability - The probability of an event not occurring is
known as the complementary probability. It can be calculated as : 1- P(E)
Mutually Exclusive Events - Events are mutually exclusive if they cannot
occur simultaneously. For example, rolling a die and getting both a 2 and a 4
in a single roll is impossible.
Independent Events - Events are considered independent if the outcome
of one event does not affect the outcome of another. For instance, tossing a
coin does not influence the roll of a die.

Advantages and Limitations of Classical Probability

Advantages:
" Simpleness: Classical probability offers an easy-to-understand framework
for modelling and analysing random events, making it approachable for
novices and the basis for more complex probability ideas.
Theoretical Foundation: It provides the foundation for more intricate
probability theories, allowing for a thorough comprehension of probability
concepts.
Classical probability is unbiased and simple to use in circumstances with
well-defined sample spaces because it makes the assumption that each result
is equally likely.
Limitations:
Application: When dealing with continuous or complicated data or when
events arenot allequally likely, classical probability may not correctly reflect
real-world scenarios.
Limited Complexity: It may not be able to handle complex probabilistic
issues, necessitating the use of more sophisticated models like Bayesian
probability for in-depth investigations.
Discreteness: Due to the inherent discreteness of classical probability,
continuous probability distributions may not match it in some real-world
situations.

Real-world Applications
Weather Forecasting: Classical probability is used in weather forecasting
to estimate the likelihood of various weather conditions based on historical
data.
Quality Control: In manufacturing, classical probability is applied to assess
the probability of defects in a production process, aiding in quality control
Code 1:

six-sided die
# Rolling afair
die <- 1:6
probabilities <- rep(1/6, 6) # Each face has equal probability

4
# Probability of rolling a
prob_4 <- probabilities [die == 4]
print(paste("Probability of rolling a 4:", prob_4))

# Simulating 10 rolls of the die


rolls <- sample(die, size = 10, replace =TRUE, prob =probabilities)
print("Simulated rolls:")
print(rolls)

# Uniform distribution between 0 and 1


x<- seq(0, 1, by = 0.01)
# PDF
pdf <- dunif(x, min = 0, max = 1)
# CDF
cdf <- punif(x, min = 0, max = 1)

# Random numbers
random_values <- runif( 10, min = 0, max = 1)
#Plotting PDF and CDF
plot(x, pdf, type = "I", col = "blue", main ="Uniform Distribution", ylab = "Density")
lines(x, cdf, col ="red")
legend("bottomright",legend =c("PDF", "CDF"), col =c("blue", "red"), Ity =1)
# Normal distribution with mean=0, sd=1
x<- seq(-4,4, by = 0.01)
# PDF
pdf <-dnorm(x, mean = 0, sd = 1)
# CDF
cdf <- pnorm(x, mean = 0, sd = 1)
# Random numbers
random_values <-rnorm(1000, mean = 0, sd= 1)
# Plotting PDF and CDF
plot(x, pdf, type = "",col = "blue", main = "Normal Distribution",ylab = "Density")
lines(x, cdf, col ="red")
legend("bottomright", legend =c("PDF", "CDF"), col =c("blue", "red"), Ity =1)
Values
# Histogram of Random
hist(random_values, probability = TRUE, col =
Random Values")
"lightblue", main = "Histogram of
lines(density(random _values), col = "red")

Probability Distribution
Bmakes it easy to draw probability distributions and demonstrate
statistical concepts.
Someof the more common probability distributions available in R are given below.

distribution R name Distribution R name

Beta beta Lognormal Inorm

Binomial binom Negative Binomial nbinom


Cauchy cauchy Normal norm

Chisquare chisq Poisson pois

Exponential exp Student t

F f Uniform unif

Gamma gamma Tukey tukey


Geometric geom Weibull weib

Hypergeometric hyper Wilcoxon wilcox

Logistic logis

The functions available for each distribution follow this format:


name Description

d name( ) density or probability function

P name( ) cumulative density function

q name(O quantile function


randon deviates
Rname (0
For example. pnorm(0) =0.5 (the area under the standard normal curve to the left of
zero). qnorm(0.9)= 1.28 (1.28 is the 90th percentile of the standard normal
stribution ). rnorm(100) generates 100 random deviates from a standard normal
distribution.
Each function has parameters specific to that distribution. For example, rnorm(100.
m=50, sd=10) generates 100 random deviates from a normal distribution with mean 50
and standard deviation 10.

Output:
[1] "Simulated rolls:"
>print(rolls)
[1]31 5 2256433

Uniform Distribution
.4

1.2

Density
1.0

0.8

0.6
PDF
CDF

0.0 0.2 0.4 06 0.8 1.0


Normal Distribution
4

Density
0.2

0.1

0.0
PDF
CDF

-2 2

0.4 Histogram of Random Values

0.3

Density
0.2

-3 -2 -1 2 3

random_ values
Viva-Voce:

Q1 What is classicalprobability?
Classical probability is a branch of probability theory that deals with
lkely outcomes. It forms the basis of events having
probability theory and is widely
statistics and data science. used in

Q.2. How can Iuse Rfor probability calculations?


8Isapowerful programming language for statistical analysis and data
Vou can use R packages like 'prob' and manipulation.
calculations.
'gtools' to perform various probability
0.3: What are some real-world applications of
Probability plays a crucial role in data science
probability in data science?
applications like risk
predictive modelling,quality control, and decision-making under assessment,
uncertainty.
Q.4:Can you recommend any additional resources for
Certainly! There are numerous online courses, books, learning probability in R?
and tutorials available for
learning probability in R. Somepopular resources
Statistics in R," the book "Introduction to Probability"include Coursera's "Probability and
by Joseph K. Blitzstein and Jessica
Hwang, and online R documentation.
Q. 5: What are the mainchallenges when
working with classical probability in R?

Challenges in classical probability include making


limited realism, dealing with data quality issues,simplifying assumptions, handling
and addressing computationally
intensive calculations. In such cases, alternative
or advanced machine learning techniques may beapproaches like Bayesian probability
considered.
Experiment-8
Aim:To implement linear regression using R.

Software Used: R

Theory:
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.One of these variables is called predictor variable whose value is
gathered through experiments. The other variable is called response variable
value is derived from the predictor variable. whose

In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear
represents a straight line when plotted as a graph. A non-linear relationshiprelationship
where the
exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is -
y= ax + b

Following is the description of the parameters used -


y is the response variable.
x is the predictor variable.
aand b are constants which are called the coefficients.
Steps to Establish a Regression
Asimple example of regression is predicting weight of a person when his height is known.
To do this we need to havethe relationship between height and weight of a person.
The steps to create the relationship is -

Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
Create a relationship model using the Im() functions in R.
Find the coefficients from the model created and create the mathematical equation
using these
prediction.
Get a summary of the relationship model to know the average error in
Also called residuals.
Topredict the weight of new persons, use the predict() function
inR.
Im()Function

Thie function creates the relationship model between the predictor and the response
variable.

Syntax

The basic syntax for Im(0function in linear regression is -


Im(formula,data)

Following is the description of the parameters used -


formula is asymbol presenting the relation between x and y.
data is the vector on which the formula will be applied.
predict() Function
Syntax

The basic syntax for predict() in linear regression is -

predict(object, newdata)
Following is the description of the parameters used -
object is the formula which is already created using the
newdata is the vector containing the new value for Im() function.
predictor variable.

Code:
#Input Data
#Below is the sample data representing the
observations -
#Values of
height
151, 174, 138, 186, 128, 136, 179, 163, 152,
131
# Values of
weight.
63,81, 56,91, 47, 57, 76, 72, 62, 48
X<-c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y<- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
Coefficients
#Create Relationship Model &get the
# Apply the Im()
function.
relation <-Im(y~x)

print(relation)

print(summary(relation)
#predict

# Find weight of a person with height 170.


a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)

#Visualize the Regression Graphically


# Give the chart file a name.
png(file ="linearregression.png")

# Plot the chart.

plot(y,x,col = "blue",main = "Height &Weight Regression",


abline (Im(x~y)),cex = 1.3,pch = 16,xlab = "Weight in
Kg"ylab ="Height in cm")

# Save the file.

dev.off0)

Output:
Call:

Im(formula =y~ x)
Coefficients:
(Intercept)

38.4551 0.6746

>print(summary(relation))

Call:

Im(formula =y ~ x)

Residuals:

Min 10 Median 3Q Max

-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>\t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
X
0.67461 0.05191 12.997 1.16e-06 ***

Signif. codes: 0 *** 0.001 *** 0.01 ** 0.05"0.1"1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1and 8 DF, p-value: 1.164e-06

> #predict

># Find weight of aperson with height 170.


>a<- data.frame(x = 170)
predict(relation,a)
>result <-
>print(result)

76.22869

Helght &Welght Regresslon

1.
50 60 70 80

Weight un Ka

Viva-Voce:
Q.1. What are the assumptionsof a linear regression model?

The assumptions of a linear regression model are:

The relationship between the independent and dependent variables is linear.


The residuals, or errors, are normally distributed with a mean of zero and a
Constant variance.
The independent variables are not correlated with each other (i.e. they are not
collinear).
The residuals are independent of each other (i.e. they are not autocorrelated).
The model includes all the relevant independent variables needed to accurately
predict the dependent variable.
2. What is multicollinearityand howdoes it affect linearregression analysis?
Multicollinearity refers to asituation in which two or more independent variables in a
linear regression model are highly correlated with each other. This can create problems
In the regression analysis, as it can be difficult to determine the individualeffects ot
each independent variable on the dependent variable.

when two or more independent variables are highly correlated, it becomes ditticult to
Isolate the effect of each variable on the dependent variable. The regression model may
indcate that both variables are significant predictors of the dependent variable, but it
can be difiult to determine which variable is actually responsible for the observed
cftect

.3. What are the common techniques used to improve the accuracy of a linear
regression model?

Featureselection: selecting the most relevant features for the model to improve
its predictive power.
Feature scaling: scaling the features to a similar range to prevent bias towards
certain features.
Regularization: adding a penalty term to the model to prevent overfitting and
improve generalization.
Cross-validation: dividing the data into multiple partitions and using a different
partition for validation in each iteration to avoid overfitting.
Ensemble methods: combining multiple models to improve the overall accuracy
and reduce variance.

0.4.What is a residual in linear regression and how is it used in model


evaluation?
In linear regression, a residual is the difference between the
dependent variable (based on the model) and the actual observed predicted value of the
value. It is used to
evaluate the performance of the model by measuring how well the model fits the
If the residuals are small and evenly data.
distributed around the me an, it indicates that the
model is agood fit for the data. However, if the residuals are large
distributed, it indicates that the model may not be a good fit for the dataand not evenly
and may need
to be improved or refined.

Q.5. What is heteroscedasticity?

Heteroscedasticity is a statistical term that refers to the unequal variance of the error
terms (or residuals) in a regression model. In a regression model, the
residuals
represent the difference between the observed values and the predicted values of the
dependent variable. When heteroscedasticity occurs:
The variance of the error terms is not constant across the range of the
variables.
independent
Error terms tend to be larger for some values of the independent variables than
for others.
This can result in biased and inconsistent estimates of the regression coefficients
and standard errors, which can affect the accuracy of the statistical inferences and
predictions made from the model.

Heteroscedasticity can be caused by a number of factors, including:


Outliers
Omitted variables
Measurement errors
Nonlinear relationships between the variables

You might also like