Stats Lab1
Stats Lab1
Descriptive statistics summarizes the data and usually focuses on the distribution, the
central tendency, and dispersion of the data. The distributions can be normal
distribution, binomial distribution, and other distributions like Bernoulli distribution.
Binomial distribution and normal distribution are the more popular and important
distributions, especially normal distribution. When exploring data and many
statistical tests, you will usually look for the normality of the data, which is how
normal the data is or how likely it is that the data is normally distributed. The Central
Limit Theorem states that the mean of a sample or subset of a distribution will be
equal to the normal distribution mean when the sample size increases, regardless
whether the sample is from a normal distribution. The central tendency, not the
central limit theorem, is used to describe the data with respect to the center of the
data. Central tendency can be the mean, median, and mode of the data. The dispersion
describes the spread of the data, and dispersion can be the variance, standard
deviation, and interquantile range. Descriptive statistics summarizes the data set, lets
us have a feel and understanding of the data and variables, and allows us to decide
or determine whether we should use inferential statistics to identify the relationship
between data sets or use regression analysis to identify the relationships between
variables.
> require("xlsx");
Loading required package: xlsx
To read the Excel file, you can use the read.xlsx() function:
> data <- read.xlsx(file="data.xlsx", 1);
file is the location of the Excel file. 1 refers to sheet number 1.To view the data
variable, you can use the View() function or click the data variable in the
Environment portion of RStudio, as shown in Figure.
To look for the documentation of read.xlsx(), you can use the following code.
> help(read.xlsx);
After importing the data, you may need to do some simple data processing like
selecting data, sorting data, filtering data, getting unique values, and removing
missing values.
data=read.csv(“C:/Users/dkalp/OneDrive/Desktop/spreadsheet.csv”)
Mean, median, and mode are the most common measures for central
tendency. Central tendency is a measure that best summarizes the data
and is a measure that is related to the center of the data set.
Mode
Mode is a value in data that has the highest frequency and is useful when
the differences are non-numeric and seldom occur.
> A <- c(1, 2, 3, 4, 5, 5, 5, 6, 7, 8); #To get mode in a vector, you create a frequency
table:
> y <- table(A);
> y;
A
12345678
11113111
You want to get the highest frequency, so you use the following to get
the mode:
> names(y)[which(y==max(y))];
[1] "5"
Median
The median is the middle or midpoint of the data and is also the 50
percentile of the data. The median is affected by the outliers and skewness
of the data. The median can be a better measurement for centrality than
the mean if the data is skewed. The mean is the average, which is liable to
be influenced by outliers, so median is a better measure when the data is
skewed.
A typical problem occurs when the data contains NAs. Let’s modify our example
vector to simulate such a situation:
>B=c(A,NA)
>B
[1] 1 2 3 4 5 5 5 6 7 8 NA
Our new example vector looks exactly the same as the first example vector, but this
time with an NA value at the end. Let’s see what happens when we apply the mean
function as before:
>mean(B)
> [1] NA
>mean(B,na.rm=TRUE)
>[1] 4.6
R code:- >
x=c(18,19,19,19,19,20,20,20,20,20,21,21,21,21,22,23,24,27,30,36)
> mean(x) #mean
[1] 22
> median(x) #median
[1] 20.5
> y=x[x<25] #mode
> md=median(y)
> md
[1] 20
> xr=table(x) #mode
> mode=which(xr==max(xr))
> mode
20
3
Measures of central tendency for frequency table:-
> x=seq(147.5,182.5,5)
> x
[1] 147.5 152.5 157.5 162.5 167.5 172.5 177.5 182.5
> f=c(4,6,28,58,64,30,5,5)
> mean=sum(x*f)/sum(f)
> mean
[1] 165.175
For Median:
> c=cumsum(f)
> cl=cumsum(f)
> cl
[1] 4 10 38 96 160 190 195 200
> N=sum(f)
> N
[1] 200
> ml=min(which(cl>N/2))
> ml
[1] 5
> h=5
> h
[1] 5
> fm=f[ml]
> fm
[1] 64
> cf=cl[ml-1]
> cf
[1] 96
> l=x[ml]-h/2
> l
[1] 165
> median=l+(((N/2)-cf)/fm)*h #median
> median
[1] 165.3125
To find Quartile 1:
> Q1=min(which(cl>N/4))
> Q1
[1] 4
> fq1=f[Q1]
> fq1
[1] 58
> cf1=cl[Q1-1]
> cf1
[1] 38
> l=x[Q1]-h/2
> l
[1] 160
> quartile1=l+(((N/4)-cf1)/fq1)*h
> quartile1
[1] 161.0345
To find Quartile 3:
> Q3=min(which(cl>3*N/4))
> Q3
[1] 5
> fq3=f[Q3]
> fq3
[1] 64
> cf2=cl[Q3-1]
> cf2
[1] 96
> l=x[Q3]-h/2
> l
[1] 165
> quartile3=l+(((3*N/4)-cf2)/fq3)*h
> quartile3
[1] 169.2188
Mode:
> m=which(f==max(f))
> m
[1] 5
> f0=f[m]
> f0
[1] 64
> f1=f[m-1]
> f1
[1] 58
> f2=f[m+1]
> f2
[1] 30
> l=x[m]-h/2
> l
[1] 165
> mode=l+((f0-f1)/(2*f0-f1-f2))*h
> mode
[1] 165.75
Range
The range is the difference between the largest and smallest points in the
data.
To find the range in R, you use the range() function:
> A <- c(1, 2, 3, 4, 5, 5, 5, 6, 7, 8);
> range(A);
[1] 1 8
To get the difference between the max and the min, you can use
> A <- c(1, 2, 3, 4, 5, 5, 5, 6, 7, 8);
> res <- range(A);
> diff(res);
[1] 7
You can use the min() and max() functions to find the range also:
> A <- c(1, 2, 3, 4, 5, 5, 5, 6, 7, 8);
> min(A);
[1] 1
> max(A);
[1] 8
> max(A) - min(A);
[1] 7
To get the range for a data set:
> res <- range(data$x2);
> diff(res);
[1] 10.65222
> res <- range(data$x2, na.rm=TRUE);
> diff(res);
[1] 10.65222
na.rm is a logical value to state whether to remove NA values
Interquartile Range
Example:
An entomologist studying morphological variation in species of mosquito recorded
the following data on body length: 1.2,1.4,1.3,1.6,1.0,1.5,1.7,1.1,1.2,1.3.Compute all
the measures of disersion.
> x=c(1.2,1.4,1.3,1.6,1.0,1.5,1.7,1.1,1.2,1.3)
> x
[1] 1.2 1.4 1.3 1.6 1.0 1.5 1.7 1.1 1.2 1.3
> res=range(x)
> res
[1] 1.0 1.7
> diff(res)
[1] 0.7
> var(x) # Variance
[1] 0.049
> sd(x) # standard deviation
[1] 0.2213594
> quantile(x)
0% 25% 50% 75% 100%
1.000 1.200 1.300 1.475 1.700
References
1. Biological data analysis, Tartu 2006/2007 (Tech.). (n.d.). Retrieved
September 1, 2018, from .
6. How to Make a Histogram with Basic R. (2017, May 04). Retrieved from
.