0% found this document useful (0 votes)
5 views36 pages

R Notes1 Merged

R is a programming language and environment for statistical analysis, data visualization, and reporting, developed by Ross Ihaka and Robert Gentleman. It is widely used in various fields such as academia, healthcare, finance, and machine learning, with capabilities for data manipulation, creating vectors, lists, matrices, and data frames. R supports multiple data types and operators, allowing for efficient data analysis and representation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views36 pages

R Notes1 Merged

R is a programming language and environment for statistical analysis, data visualization, and reporting, developed by Ross Ihaka and Robert Gentleman. It is widely used in various fields such as academia, healthcare, finance, and machine learning, with capabilities for data manipulation, creating vectors, lists, matrices, and data frames. R supports multiple data types and operators, allowing for efficient data analysis and representation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Introduction to R

R is a programming language and software environment for statistical analysis, graphics


representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently developed by the R Development
Core Team.

R used for?
•Statistical inference
•Data analysis
•Data Vizualisation
•Reporting
•Machine learning algorithm
Used in Areas of
•Academic
•Health care
• Finance
•Consulting
•Energy

Communicate with R
R has multiple ways to present and share work, either through a markdown document or a
shiny app.

Reserved Keywords in R can be vieved by ?Reserved

TRUE and FALSE are the logical constants in R.

NULL represents the absence of a value or an undefined value.

Inf is for “Infinity”, for example when 1 is divided by 0

NaN is for “Not a Number”, for example when 0 is divided by 0.

NA stands for “Not Available” and is used to represent missing values.

R is a case sensitive language.


Variables in R

Variables are used to store data, whose value can be changed according to our need. Unique
name given to variable (function and objects as well) is identifier.

Rules for writing Identifiers in R

1. Identifiers can be a combination of letters, digits, period (.) and underscore (_).

2. It must start with a letter or a period. If it starts with a period, it cannot be followed by a
digit.

3. Reserved words in R cannot be used as identifiers.

Valid identifiers in R

total, Sum, .fine.with.dot, this_is_acceptable, Number5

Invalid identifiers in R

tot@l, 5um, _fine, TRUE, .0ne

Constants in R

Constants, as the name suggests, are entities whose value cannot be altered

1 Numeric Constants

All numbers fall under this category. They can be of type integer, double or complex.

2Character Constants

Character constants can be represented using either single quotes (') or double quotes (") as
delimiters.

3Built-in Constants

Some of the built-in constants defined in R along with their values is shown below.

Ex LETTER, letter,pi however these values can be changed


Object

An object is a data structure having some attributes and methods which act on its attributes

• names
• dimnames
• dim
• class

Data Type

data type is a data storage format that can contain a specific type or range of values.

• character
•numeric (real or decimal)
•integer
•logical
•complex

R Operators

Assignment Operators in R

Operator Description

<-, <<-, = Leftwards assignment

->, ->> Rightwards assignment

Arithmetic Operators in R

Operator Description

+ Addition

– Subtraction

* Multiplication
/ Division

^ Exponent

%% Modulus (Remainder from division)

%/% Integer Division

Relational Operators in R

Operator Description

< Less than

> Greater than

<= Less than or equal to

>= Greater than or equal to

== Equal to

!= Not equal to

Operator Precedence in R

When multiple operators are involved in a expression

> 2 + 6 * 5 ans 32

Operator Associativity

It is possible to have multiple operators of same precedence in an expression. In such case


the order of execution is determined through associativity.

3 / 4 / 5 ans 0.15

3 / (4 / 5) ans 3.75

.
Read Data

# Read tab separated values read.delim(file.choose())


# Read comma (",") separated values read.csv(file.choose())

read.table(file, header = FALSE, sep = "", dec = ".")


# Read "comma separated value" files (".csv")
read.csv(file, header = TRUE, sep = ",", dec = ".", ...)

Write data

write.table(x, file, append = FALSE, sep = " ", dec = ".", row.names = TRUE,
col.names = TRUE)

save and restore one single R object: saveRDS(object, file), my_data <- readRDS(file)

# Saving on object in RData format save(data1, file = "data.RData")

# Save multiple objects save(data1, data2, file = "data.Rdata")

# To load the data again load("data.Rdata")

For Entire work space

save.image(file="my_work_space.RData")

#To restore your workspace, type this:


load("my_work_space.RData")
R Tutorials

Data Structure

Data Structures are the way of arranging data so that it can be used efficiently in a computer.

To store multiple data data structures are used, R has many data structures. These include

•vector

•list

•matrix

•data frame

•factors

Vector

Vector is a basic data structure in R. It contains element of the same type.


These elements are known as components of a vector.

How to create vector in R?


There are numerous ways to create an R vector:
1. Using c() Function
To create a vector, we use the c() function:
Code:
1.> vec <- c(1,2,3,4,5) #creates a vector named vec
2.> vec
2. Using assign() function
Another way to create a vector is the assign() function.
Code:
1.> assign("vec2",c(6,7,8,9,10)) #creates a vector named vec2
2.> vec2
3. Using : operator
An easy way to make integer vectors is to use the : operator.
Code:
1.> vec3 <- 1:20
2.> vec3

By- Satish 1 PGDBDA


Dr Hari singh Gour University
Using Sequence
Seq(5,15)
[1] 5 6 7 8 9 10 11 12 13 14 15

Generating Repeated Sequence


The rep() (or repeat) function puts the same constant into long vectors. The call form is
rep(x,times).
x<-rep(6,4)
rep(c(11,12,13),3)
rep(1:2,3)
rep(c(11,12,13),each=2)

What are the types of vectors in R?


A vector can be of different types depending on the elements it contains. These may be:
1. Numeric Vectors
Vectors containing numeric values.

1.> num_vec <- c(1,2,3,4,5)


2.> num_vec
2. Integer Vectors
Vectors containing integer values.

1.> int_vec <- c(6L,7L,8L,9L,10L)


2.> int_vec
3. Logical Vectors
Vectors containing logical values of TRUE or FALSE.

1.> log_vec <- c(TRUE,FALSE,TRUE,FALSE,FALSE)


2.> log_vec
4. Character Vectors
Vectors containing text.

1.> char_vec <- c("aa","bb","cc","dd","ee")


2.> char_vec

By- Satish 2 PGDBDA


Dr Hari singh Gour University
5. Complex Vectors
Vectors containing complex values.

1.> comp_vec <- c(12+1i,3i,5+4i,4+9i,6i)


2.> comp_vec

How to combine R vectors?


The c() function can also combine two or more vectors and add elements to vectors.

1.> vec4 <- c(vec, vec2)

coercion in R vector?

Vectors only hold elements of the same data type. If there is more than one data type,
the c() function converts the elements. This is known as coercion. The conversion takes
place from lower to higher types.
logical < integer < double < complex < character.
Code:
1.> vec6 <- c(1,FALSE,3L,12+5i,"hello")
2.> typeof(vec6)

access Elements of a Vector

Elements of a vector can be accessed using vector indexing. The vector used for indexing
can be logical, integer or character vector.

Using integer vector as index

Vector index in R starts from 1, unlike most programming languages where index start from
0.

x
[1] 0 2 4 6 8 10
> x[3] # access 3rd element

By- Satish 3 PGDBDA


Dr Hari singh Gour University
[1] 4
> x[c(2, 4)] # access 2nd and 4th element
[1] 2 6
> x[-1] # access all but 1st element
[1] 2 4 6 8 10

1 Indexing with Integer Vector


The indexing of vectors in R starts with 1.
> x[2] #indexing with vector
2. Indexing with Character Vector
Character vector indexing is useful in vector with name spaces
and can be done as follows:
1.> x <- c("One" = 1, "Two" = 2, "Three" = 3)
2.> x["Two"]

3
In logical indexing, the positions whose corresponding position has logical vector TRUE are
returned. For example, in the below code, R returns the positions of 1 and 3, where the
corresponding logical vectors are TRUE.
1.> a <- c(1,2,3,4)
2.> a[c(TRUE, FALSE, TRUE, FALSE)]

What is R List?
R list is the object which contains elements of different types – like strings, numbers,
vectors and another list inside it.
Like an R vector, an R list can contain items of different data types. List elements are
accessed using two-part names, it is indicated with the dollar sign $ in R.

vec <- c(1,2,3)


1.char_vec <- c("Hadoop", "Spark", "Flink", "Mahout")
2.logic_vec <- c(TRUE, FALSE, TRUE, FALSE)
3.out_list <- list(vec, char_vec, logic_vec)
4.out_list

By- Satish 4 PGDBDA


Dr Hari singh Gour University
R Predefined Lists
Lists for letters and month names are predefined:

1.letters
2.LETTERS
3.month.abb
4.month.name

Create Lists in R Programming


In this section of the R list tutorial, we will create R list with the help of an example.
Let’s create a list containing string, numbers, vectors and logical values.
list_data <- list("Red", "White", c(1,2,3), TRUE, 22.4)
1.print(list_data)

Name List Elements in R


data_list <- list(c("Jan","Feb","Mar"), matrix(c(1,2,3,4,-1,9), nrow = 2),list("Red",12.3))
1.names(data_list) <- c("Monat", "Matrix", "Misc")
2.print(data_list)
Access R List Elements
In order to give names to the elements of the list:
1.names(data_list) <- c("Monat", "Matrix", "Misc")
Access the first element of the list.
1.print(data_list[1]) #Accessing the First element
Access the third element. As it is also a list, all its elements will print.
1.print(data_list[3]) #Accessing the Third element
By using the name of the element access the list elements.
1.print(data_list$Matrix) #Using name of access element

Merge Lists in R Programming language


We can merge many lists into one list by placing all the lists element inside
one list() function.
For example:

By- Satish 5 PGDBDA


Dr Hari singh Gour University
1.num_list <- list(1,2,3,4,5) #Author DataFlair
2.day_list <- list("Mon","Tue","Wed", "Thurs", "Fri")
3.merge_list <- c(num_list, day_list)
4.merge_list

Convert list to vector

int_list <- list(1:5) #A


1.print(int_list)

vec1 <- unlist(int_list) # converted

R Matrix
In a matrix, numbers are arranged in a fixed number of rows and columns and usually, the
numbers are the real numbers.
Matrix is a two dimensional data structure in R programming.

Matrix is similar to vector but additionally contains the dimension attribute.

create a matrix in R programming?

Matrix can be created using the matrix() function.

Dimension of the matrix can be defined by passing appropriate value for


arguments nrow and ncol
Providing value for both dimension is not necessary. If one of the dimension is provided, the
other is inferred from length of the data.

> matrix(1:9, nrow = 3, ncol = 3)

matrix(1:9, nrow = 3)

t is possible to name the rows and columns of matrix during creation by passing a 2 element
list to the argument dimnames.

> x <- matrix(1:9, nrow = 3, dimnames = list(c("X","Y","Z"), c("A","B","C")))

By- Satish 6 PGDBDA


Dr Hari singh Gour University
>x

These names can be accessed or changed with two helpful


functions colnames() and rownames().

Another way of creating a matrix is by using functions cbind() and rbind() as in column
bind and row bind.

> cbind(c(1,2,3),c(4,5,6))

rbind(c(1,2,3),c(4,5,6))

access Elements of a matrix

We can access elements of a matrix using the square bracket [ indexing method. Elements
can be accessed as var[row, column]. Here rows and columns are vectors.

R Data Frame
The tabular data is referred by the data frames. In particular, it is a data structure in R that
represents cases in which there are a number of observations(rows) or measurements
(columns).
A data frame is being used for storing data tables, the vectors that are contained in the form
of a list in a data frame are of equal length.

Characteristics of R Data Frame


Now, let’s discuss the characteristics of data frame in R.
•The column names should be non-empty.
•The row names should be unique.
•The data frame can hold the data which can be a numeric, character or of factor type.
•Each column should contain the same number of data items.

By- Satish 7 PGDBDA


Dr Hari singh Gour University
Create Data Frame

employee_data <- data.frame(


employee_id = c (1:5),
employee_name = c("James","Harry","Shinji","Jim","Oliver"),
sal = c(642.3,535.2,681.0,739.0,925.26),
join_date = as.Date(c("2013-02-04", "2017-06-21", "2012-11-14", "2018-05-
19","2016-03-25")),
stringsAsFactors = FALSE)

Get the Structure of the R Data Frame


The structure of the data frame can see by using the star () function.
> str(employee_data)

Extract data from Data Frame


By using the name of the column, extract a specific column from the column.
Extract Specific columns:

1.>output<-data.frame(employee_data$employee_name,
employee_data$employee_id)
2.> print(output)

Extract first Three rows

1.> output <- employee_data[1:3,]


2.> print(output)

Add Column
Add the column vector using a new column name.
•Add the “dept” column

1.> employee_data$dept <- c("IT","Finance","Operations","HR","Administration")


2.> out <- employee_data
3.> print(out)

Factor in R

By- Satish 8 PGDBDA


Dr Hari singh Gour University
Factor is a data structure used for fields that takes only predefined, finite number of values
(categorical data). For example: a data field such as marital status may contain only values
from single, married, separated, divorced, or widowed

Create a factor in R?
We can create a factor using the function factor(). Levels of a factor are inferred from the
data if not provided.
x <- factor(c("single", "married", "married", "single"));

>x

Factors are also created when we read non-numerical columns into a data frame
Accessing components of a factor is very much similar to that of vectors.
x[3]

By- Satish 9 PGDBDA


Dr Hari singh Gour University
Statistics

Statistics is the discipline that concerns the collection, organization, analysis, interpretation
and presentation of data.

Statistics is the science of learning from data, and of measuring, controlling, and
communicating uncertainty; and it thereby provides the navigation essential for controlling
the course of scientific and societal advances.’

Variable : A variable in the mathematical sense, i.e. a quantity which may take any one of
specified set of values.

Population : In statistics, a population is the entire pool from which a statistical sample is
drawn
population can be said to be an aggregate observation of subjects grouped together by a
common feature.

Sample : A sample is a smaller group of members of a population selected to represent the


population.

A parameter is a characteristic of a population. A statistic is a characteristic of a sample

There are two basic forms: descriptive statistics and inferential statistics.

• Descriptive Statistics is primarily about summarizing a given data set through numerical
summaries and graphs, and can be used for exploratory analysis to visualize the information
contained in the data and suggest hypotheses etc. It is useful and important. It has become
more exciting nowadays with people regularly using fancy interactive computer graphics to
display numerical information

• Inferential Statistics is concerned with methods for making conclusions about a


population using information from a sample, and assessing the reliability of, and uncertainty
in, these conclusions. This allows us to make judgements in the presence of uncertainty and
variability, which is extremely important in underpinning evidence-based decision making
in science, government, business etc.

Descreptive Statistics

A descriptive statistic is a summary statistic that quantitatively describes or summarizes


features from a collection of information.

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gaur University( A Centeral University) By- Satish
Descriptive statistics involves summarizing and organizing the data so they can be easily
understood. Descriptive statistics, unlike inferential statistics, seeks to describe the data, but
do not attempt to make inferences from the sample to the whole population. Here, we
typically describe the data in a sample.

Descriptive statistics can be categoriesed as,

Measures of Central Tendency : Central tendency refers to the idea that there is one
number that best summarizes the entire set of measurements, a number that is in
some way “central” to the set

• Mean
• Median
• Mode

• Measures of Dispersion : Measure of Spread refers to the idea of variability within


your data. It is a statistic that tells you how dispersed, or spread out, data values are.
• Range

• Variance

• Standard Deviation

• Qurtile

• Measures of Association : Describes relationships between the variables.

• Corelation

• Measures of shapes : It describes the distribution (or pattern) of the data within a
dataset
• skewness

• kurtosis

Mean / Average
Mean or Average is a central tendency of the data i.e. a number around which a whole data
is spread out. In a way, it is a single number which can estimate the value of whole data set.

Let’s calculate mean of the data set having 8 integers.

Ex 50, 67, 35, 46, 21, 77, 92, 46, 88, 63

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gaur University( A Centeral University) By- Satish
n=10
The sum = 584
Mean = 585/10 = 58.5
Median
Median is the value which divides the data in 2 equal parts i.e. number of terms on right side
of it is same as number of terms on left side of it when data is arranged in either ascending
or descending order.

Median will be a middle term, if number of terms is odd

Median will be average of middle 2 terms, if number of terms is even.

Ex 50, 67, 35, 46, 21, 77, 92, 46, 88, 63


First, let’s arrange them in ascending order:
21, 35, 46, 46, 50, 63, 67, 77, 88, 92

Here we have position 5 and 6 in the middle, therefore, to get the median we are going to
interpolate them by adding the two then dividing them by 2.

Median=(50+63)/2= 56.5

Mode

Mode is the term appearing maximum time in data set i.e. term that has highest frequency.

Ex 50, 67, 35, 46, 21, 77, 92, 46, 88, 63

For our example above, the most occurring number is 46

But there could be a data set where there is no mode at all as all values appears same
number of times. If two values appeared same time and more than the rest of the values then
the data set is bimodal. If three values appeared same time and more than the rest of the
values then the data set is trimodal and for n modes, that data set is multimodal.

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gaur University( A Centeral University) By- Satish
Range

The range is just the maximum value minus the minimum value.

For our example above, the highest value is 92 and the lowest is 21. So the range is

=92-21

=71

Variance
IT is the most commonly used measure of dispersion. It is calculated by taking the average
of the squared differences between each value and the mean.

Standard Deviation

This is the most detailed and the most accurate description of dispersion. This is because it
shows how the different values in the set, relate to the mean.

Let’s look at our example:


21, 35, 46, 46, 50, 63, 67, 77, 88, 92
To compute the standard deviation, we first need to find the differences between the values
and the mean.

21-58.5= -37.5
35-58.5= -23.5
46-58.5= -12.5
46-58.5= -12.5
50-58.5= -8.5
63-58.5= 4.5
67-58.5 = 7.5
77-58.5 = 18.5

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gaur University( A Centeral University) By- Satish
88-58.5= 29.5
92-58.5 =33.5
Note that, all the values above the mean have positive discrepancies while the values below
the mean have negative ones.
The next step is to square all the discrepancies:
-37.5x-37.5=1406.25
-23.5x-23.5= 552.25
-12.5x-12.5=156.25
-12.5 x-12.5=156.25
-8.5x-8.5=72.25
4.5×4.5=20.25
7.5×7.5=56.25
18.5×18.5= 342.25
29.5×29.5 = 870.25
33.5 x 33.5= 112.25
Now we need to determine the variance:
We get this by, finding the sum of the squares of the discrepancies (sum of squares) then
divide them by (n-1).

Variance = Sum of Squares/ (n-1)


= 4770.5/9
=530.0556

Our standard deviation is now the square root of the variance


SQRT (530.0556)

=23.02294

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gaur University( A Centeral University) By- Satish
Quartiles

In statistics and probability, quartiles are values that divide your data into quarters provided
data is sorted in an ascending order.

There are three quartile values. First quartile value is at 25 percentile. Second quartile is 50
percentile and third quartile is 75 percentile. Second quartile (Q2) is median of the whole
data. First quartile (Q1) is median of upper half of the data. And Third Quartile (Q3) is
median of lower half of the data.

Ex Example: 5, 7, 4, 4, 6, 2, 8

Put them in order: 2, 4, 4, 5, 6, 7, 8


Cut the list into quarters:

2, 4, 4, 5, 6, 7, 8
Q1 Q2 Q3
lower middle quartile upper
quartile (median) quartile

Quartile 1 (Q1) = 4
•Quartile 2 (Q2), which is also the Median, = 5
•Quartile 3 (Q3) = 7

so inter quartile range is q3-q1 =7-4=3

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gaur University( A Centeral University) By- Satish
Post Graduate Diploma in Big Data Analytics Statistics Notes
Dr Hari Singh Gaur University( A Centeral University) By- Satish
Basic Probability Theory

A probability is a number that reflects the chance or likelihood that a particular event will
occur. Probabilities can be expressed as proportions that range from 0 to 1, and they can also
be expressed as percentages ranging from 0% to 100%. A probability of 0 indicates that
there is no chance that a particular event will occur, whereas a probability of 1 indicates that
an event is certain to occur. A probability of 0.45 (45%) indicates that there are 45 chances
out of 100 of the event occurring.

Example
A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two
outcomes (“heads” and “tails”) are both equally probable; the probability of “heads” equals
the probability of “tails”; and since no other outcomes are possible, the probability of either
“heads” or “tails” is 1/2 (which could also be written as 0.5 or 50%).

.Experiment – are the uncertain situations, which could have multiple outcomes.
Whether it rains on a daily basis is an experiment, physical situation whoseoutcome
cannot be predicted until it is observed.
Outcome is the result of a single trial. So, if it rains today, the outcome of today’s
trial from the experiment is “It rained”
•Event is one or more outcome from an experiment. An EVENT is a subset of the
sample space.“It rained” is one of the possible event for this experiment.

•SAMPLE SPACE of a statistical experiment is the set of all possible outcomes (also
SAMPLE SPACE known as SAMPLE POINTS ).

Example I flip a coin, with two possible outcomes: heads (H) or tails (T).
What is the sample space for this experiment? What about for three flips in a row?
Solution: For the first experiment (flip a coin once), the sample space is just {H, T}. For the
second experiment (flip a coin three times), the sample space is {HHH, HHT, HTH, HTT,
THH, THT, TTH, TTT}. Order matters: HHT is a different outcome than HTH.

Example . For the experiment where I flip a coin three times in a row, consider the
event that I get exactly one T. Which outcomes are in this event?
Solution: The subset of the sample space that contains all outcomes with exactly one T is
{HHT, HTH, THH}.

The probability of an event


Now, let’s consider the probability of an event. By definition, an impossible event has
probability zero, and a certain event has probability one. The more interesting cases are
events that are neither impossible nor certain. For the moment, let’s assume that all

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
outcomes in the sample space S are equally likely. If that is the case, then the probability of
an event E, which we write as P(E), is simply the number of outcomes in E divided by
number of outcomes in S:

P(E) =|E| / |S| if all outcomes in S are equally likely

That is, the probability of an event is the proportion of outcomes in the sample space that
are also outcomes in that event.

Example flip a fair coin, with two possible outcomes: heads (H) or tails (T). What is the
probability that I get exactly one T if I flip the coin once? What if I flip it three times?
Solution: First, note that I said it’s a fair coin. This is important, because it means that on
any one flip, each outcome is equally likely, We already determined the relevant events and
sample spaces for each experiment in the previous section, so now we just need to divide
those numbers. Specifically, if we only flip the coin once, then the event we care about
(getting T) has one possible outcome, and the sample space has two possible outcomes, so
the probability of getting T is 1/2. If we flip the coin three times, there are three outcomes
with exactly one T , and eight outcomes altogether , so the probability of getting exactly one
T is 3/8

MUTUALLY EXCLUSIVE (or DISJOINT ) events. Two events are mutually exclusive
iff (“iff” means “if and only if”) they contain no outcomes in common (i.e., both events
cannot occur at the same time). For example, if I roll two dice, the events “get a total of 7”
and “get a total of 8” are mutually exclusive. On the other hand, “get a total of 7” and “get a
6 on one die” are not mutually exclusive, since both could occur on the same
roll

Now, suppose we have a set of n mutually exclusive events that together cover all possible
outcomes in our sample space S. Such a set is called a PARTITION of the sample space
A simple example would be flipping a coin, where S = {H, T} and we define n = 2 mutually
exclusive events, E 1 = {H} and E 2 = {T}.

PROBABILITY DISTRIBUTION

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
A partition of the sample space into four mutually exclusive events. The rectangle represents
all the outcomes in the entire sample space, and each labelled region represents the
outcomes in that event.
We then assign a number, P(E i ), to each of the events E i in the partition. If and only if the
following two properties hold, then each P(E i ) can be considered a probability, and the
set of values {P(E 1 ) . . . P(E n )} can be considered a PROBABILITY DISTRIBUTION .

Property 1: 0 ≤ P(E i ) ≤ 1
Property 2: ∑ P(E i ) = 1

That is, every probability must fall between 0 and 1 (inclusive), and the sum of the
probabilities of the mutually exclusive events that cover the the sample space must equal 1.

Uniform Distribution
The amount of mass assigned to each E i is what we call P(E i ). In our coin flipping
example, the two events are equally likely, so the mass is divided evenly and we get P(E 1 )
= P(E 2 ) = 1/2. This is an example of a UNIFORM DISTRIBUTION , a distribution where
all events in a partition are equally likely.

Combining events
Often we are interested in combinations of two or more events. This can be represented
using set theoretic operations. Assume a sample space S and two events A and B:
• complement A (also A 0 ): all elements of S that are not in A;

• subset A ⊆ B: all elements of A are also elements of B;

• union A ∪ B: all elements of S that are in A or B;

• intersection A ∩ B: all elements of S that are in A and B.

Above operations can be represented graphically using Venn diagrams

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
Complement and Union of Event
Starting simple, suppose we have an event E that has probability P(E). What is the
probability that E does not happen? Put another way, what is the probability of the
COMPLEMENT of E, written as ¬E (or E 0 or Ē)? ¬E is the set of outcomes that are in the
sample space S but not in E, and it’s easy to see that, since the total probability mass of S is
1, then

P(¬E) = 1 − P(E).

Example Suppose I have a list of words, and I choose a word uniformly at random. If
the probability of getting a word starting with t is 1/7, then what is the probability of getting
a word that does not start with t?
Solution: Let E be the event that the word starts with t. Then P(¬E) is the probability we
were asked for, and it is 1 − P(E), or 6/7.

Example :Suppose I have a group containing the following first- and second-year university
students from various countries. The first 3 are male, and the last 4 female:

The union of two events, A and B. Since events are just sets of outcomes, taking their
union corresponds to considering any outcome that belongs to either A or B. For example,
looking at the scenario , let’s define A = “the student is female” and B = “the student is from
the UK”.

What is P(A ∪ B), that is, the probability that the student is female or from the UK?
Solution: You might imagine that the answer is just P(A) + P(B). Let’s see if that is correct.
First, we compute P(A), which is 4/7. Next, compute P(B), which is also 4/7. So P(A) +
P(B) = 8/7. But that clearly can’t be correct, since probabilities cannot be greater than one.
So, let’s instead consider which outcomes are actually in the set A ∪ B. They are: {Fiona,
Lea, Ajitha, Sarah, Andrew}. Since this set has five elements, we know that P(A ∪ B) must
be 5/7.

So what went wrong when we computed P(A) + P(B)? Notice that there are three students
who belong to both A and B: Fiona, Ajitha, and Sarah. So when we counted the outcomes in

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
A, we included these students. And when we counted the outcomes in B, we included these
students again. That means when we computed P(A) + P(B), we added in those three
students twice. To correctly compute P(A ∪ B) from P(A) and P(B), we need to subtract off
those extra counts. We do so using the following formula:

P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

Since A ∩ B is the set of students that are in both A and B, this is exactly the set that will
have been counted twice, so we subtract off that amount from the probability. In our
example, we now get P(A ∪ B) = 4/7 + 4/7 − 3/7 = 5/7, which is exactly what we got when
computing P(A ∪ B) directly

where A and B are mutually exclusive, it is true that P(A ∪B) = P(A)+P(B), because there
are no items in the intersection.

Joint probabilities
It is a special term for the probability of the intersection of two events: it is called the JOINT
PROBABILITY of A and B, written P(A ∩ B).

law of total probability


Now, suppose we have a set of events {E 1 . . . E n } that partition the sample space, and we
have some other event B that is also in the sample space. The diagram in Figure illustrates
such a situation and provides some intuition for the LAW OF TOTAL PROBABILITY , also
known as the SUM RULE :

P(B) = ∑ P(B ∩ E i )

The law of total probability tells us that we can compute the probability of B by adding
up the joint probability of B with each of the E i .

Example Consider the scenario from Exercise . We partition the sample space according to
the country that each student comes from, with E 1 = “student is British”, E 2 = “student is
Chinese”, and E 3 = “student is German”. Also let B be the event that the student is female.
Apply the law of total probability to compute P(B), and check that the result is the
same as when computing P(B) directly.
Solution: Using the law of total probability, we have

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
P(B) = P(B ∩ E 1 ) + P(B ∩ E 2 ) + P(B ∩ E 3 )
= 3/7 + 0/7 + 1/7
= 4/7
which is the same result we get by computing P(B) directly (counting the number of female
students and dividing by the total number of students)

Conditional probability
Conditional probability is one of the most important concepts of probability theory. A
conditional probability expresses the probability that some event A will occur, given that
PROBABILITY (conditioned on the fact that) event B has occurred. The conditional
probability of A given B, written P(A | B), where the | is pronounced “given”, is defined as
P(A | B) = P(A ∩ B)
P(B)

Example Again let’s use the scenario from same question with events A = “the student is
male” and B = “the student is from the UK”. What is P(A | B)?
Solution: In this case, A∩B is the set of male British students, so P(A∩B) = 1/7. P(B) = 4/7,
so P(A | B) = 1/7
4/7
= 1/4
the probability that the chosen student is male given that the student is British is simply the
number of male students as a fraction of the number of British students, or 1/4. But again,
it’s important to learn the formal rules of probability theory since not all problems you will
be faced with are so straightforward.

Example . Using the same A and B as in que what is P(B | A)?


Solution: We have the same A ∩ B, so P(A ∩ B) = 1/7. P(A) = 3/7,
so P(B | A) =1/7
3/7
= 1/3.
This example illustrates the fact that conditional probabilities are not commutative:
P(A | B) is not the same as P(B | A), and in general the two will not be equal.

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
Conditional probability continued

Example :The probability that it is Friday and that a student is absent is 0.03. Since there
are 5 school days in a week, the probability that it is Friday is 0.2. What is the probability
that a student is absent given that today is Friday?
Solution:

P(Friday and Absent) 0.03


P(Absent|Friday) = = = 0.15 = 15%
P(Friday) 0.2

Example : A machine produces parts that are either good (90%), slightly defective (2%), or
obviously defective (8%). Produced parts get passed through an automatic inspection machine,
which is able to detect any part that is obviously defective and discard it. What is the quality of the
parts that make it through the inspection machine and get shipped?
soln Let G , SD, OD be the event that a randomly chosen shipped part is good , slightly defective,
obviously defective respectively. We are told that P(G) = .90, P(SD) = 0.02, and P(OD) = 0.08. We
want to compute the probability that a part is good given that it passed the inspection machine (i.e.,
it is not obviously defective),
which is P(G|ODc ) = P(G ∩ ODc ) /P(ODc)
= P(G) /1 − P(OD)
= .90 /1 − .08
= 90 / 92
= .978

Independent Event

Two events, A and B, are independent if the fact that A occurs does not affect the
probability of B occurring.
examples of independent events are:
•Landing on heads after tossing a coin AND rolling a 5 on a single 6-sided die.

•Rolling a 4 on a single 6-sided die, AND then rolling a 1 on a second roll of the die.

To find the probability of two independent events that occur in sequence, find the
probability of each event occurring separately, and then multiply the probabilities. This
multiplication rule is defined symbolically below. Note that multiplication is represented by
AND.
If the conditional probability P(A | B) is equal to the unconditional probability P(A). That
is, whether or not B occurs has no effect on the probability of A occurring. This is one way
of defining the notion of I Iindependent of two events. Two independent events A and B are
INDEPENDENT EVENTS 7 iff
P(A | B) = P(A).

By substituting in the definition of conditional probability from Eq (5) and rearranging the
terms, we can equivalently state that events A and B are independent if
P(A ∩ B) = P(A)P(B)

Multiplication Rule 1: When two events, A and B, are independent, the probability of
both occurring is:
P(A and B) = P(A) · P(B)
Example A card is chosen at random from a deck of 52 cards. It is then replaced and a
second card is chosen. What is the probability of choosing a jack and then an eight?

Probabilities:
4
P(jack) =
52
4
P(8) =
52
P(jack and 8) = P(jack) · P(8)
4 4
= ·
52 52
16
=
2704
1
=
169

DEPENDENT EVENTS
Two events are dependent if the outcome or occurrence of the first afects the outcome
or occurrence of the second so that the probability is changed.

Experiment 1: A card is chosen at random from a standard deck of 52 playing cards.


Without replacing it, a second card is chosen. What is the probability that the first card
chosen is a queen and the second card chosen is a jack?

Analysis: The probability that the first card is a queen is 4 out of 52. However, if the first
card is not replaced, then the second card is chosen from only 51 cards. Accordingly, the
probability that the second card is a jack given that the first card is a queen is 4 out of 51.
Conclusion: The outcome of choosing the first card has afected the outcome of choosing
the second card, making these events dependent.

Now that we have accounted for the fact that there is no replacement, we can find the
probability of the dependent events in Experiment 1 by multiplying the probabilities of
each event.
Example: A card is chosen at random from a standard deck of 52 playing cards. Without replacing
it, a second card is chosen. What is the probability that the first card chosen is a queen and the
second card chosen is a jack?
Probabilities:
P(queen on first pick) = 4 /52
P(jack on 2nd pick given queen on 1st pick) = 4 /51
P(queen and jack) = 4 /52 · 4 /51 = 16 /2652 = 4 / 663

RANDOM VARIABLE
A RANDOM VARIABLE (or RV) is a variable that represents all the possible events in some
partition of the sample space. In another way, an RV has several possible values, with each RV
value being one event in a partition (and where the values cover all events in the partition). We
will write random variables with uppercase letters, and their possible values with lowercase
letters or numbers.
Example 5.1.1. Define a random variable X to represent the outcome flipping a fair coin,
where this variable can take on two possible values (h or t) representing heads or tails. What
is the distribution over X?
Solution: The distribution over an RV simply tells us the probability of each value, so the
distribution over X is P(X = h) = P(X = t) = 1/2.
We use the notation P(X) as a shorthand meaning “the entire distribution over X”, in P(X)
contrast to P(X = x), which means “the probability that X takes the value x”.
Correlation is a bivariate analysis that measures the strength of association between two
variables.
The degree of association is measured by a correlation coefficient, denoted by r. It is
sometimes called Pearson's correlation coefficient after its originator and is a measure of
linear association

The correlation coefficient is measured on a scale that varies from + 1 through 0 to - 1.


Complete correlation between two variables is expressed by either + 1 or -1. When one
variable increases as the other increases the correlation is positive; when one decreases as
the other increases it is negative. Complete absence of correlation is represented by 0.

Formula to calculate correlation coefficient

Where:
•rxy – the correlation coefficient of the linear relationship between the variables x and y
•xi – the values of the x-variable in a sample
•xx – the mean of the values of the x-variable
•yi – the values of the y-variable in a sample
•ȳ – the mean of the values of the y-variable

ex By Using Pearson Coffecient formula

AGE GLUCOSE
SUBJECT XY X2 Y2
X LEVEL Y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 2048511409 40022

•Σx = 247
•Σy = 486
•Σxy = 20,485
•Σx2 = 11,409
•Σy2 = 40,022

n is the sample size, in our case = 6


The correlation coefficient =

6(20,485) – (247 × 486) / [√[[6(11,409) – (247^2)] × [6(40,022) – 486^2]]]


= 0.5298

Skewness

It is the degree of distortion from the symmetrical bell curve or the normal distribution. It
measures the lack of symmetry in data distribution.
It differentiates extreme values in one versus the other tail. A symmetrical distribution will
have a skewness of 0

• If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
• If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and
1(positively skewed), the data are moderately skewed.
• If the skewness is less than -1(negatively skewed) or greater than 1(positively
skewed), the data are highly skewed.

Coefficient of Skewness

Kurtosis

A measure of the peakness or convexity of a curve is known as Kurtosis.


Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used
to describe the extreme values in one versus the other tail. measure of outliers present in the
distribution

Mesokurtic:
This distribution has kurtosis statistic similar to that of the normal distribution. It means that
the extreme values of the distribution are similar to that of a normal distribution
characteristic. This definition is used so that the standard normal distribution has a kurtosis
of three.
Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak is higher and
sharper than Mesokurtic, which means that data are heavy-tailed or profusion of outliers.
Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data
appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a
leptokurtic distribution.
Platykurtic: (Kurtosis < 3): Distribution is shorter, tails are thinner than the normal
distribution. The peak is lower and broader than Mesokurtic, which means that data are
light-tailed or lack of outliers.
The reason for this is because the extreme values are less than that of the normal
distribution.

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
Summary statistics: There are literally dozens of ways to display summary data using
graphs or charts. Some of the most common ones are listed below.
•Histogram.
• Box Plot
•Bar chart.
•Scatter plot.
•Pie chart.

Bar Chart
A bar chart is a graph with rectangular bars. The graph usually compares different
categories. Although the graphs can be plotted vertically (bars standing up) or horizontally
(bars laying flat from left to right), the most usual type of bar graph is vertical.

The horizontal (x) axis represents the categories; The vertical (y) axis represents a value for
those categories. In the graph below, the values are percentages.

Below bar chart shows frequency of some categories

HistogramA histogram is a specific type of bar chart, where the categories are
ranges of numbers. Histograms therefore show combined continuous data. .

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
This type of graph is used with quantitative data.
Below histogram shows frequenciesof ordinal categories

#Note :-Plot discrete data on a bar chart, and plot continuous data on a histogram

Pie Charts
A pie chart looks like a circle (or a pie) cut up into segments. Pie charts are used to show
how the whole breaks down into parts.
Pie charts show percentages of a whole - your total is therefore 100% and the segments of
the pie chart are proportionally sized to represent the percentage of the total.

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish
Scatter Plot
Scatter plots (also called scatter graphs) are similar to line graphs. A line graph uses a line
on an X-Y axis to plot a continuous function, while a scatter plot uses dots to represent
individual pieces of data. In statistics, these plots are useful to see if two variables are
related to each other. For example, a scatter chart can suggest a linear relationship

Box Plot

Boxplots are a standardized way of displaying the distribution of data based on a five
number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and
“maximum”)

Post Graduate Diploma in Big Data Analytics Statistics Notes


Dr Hari Singh Gour University( A Centeral University) By- Satish

You might also like