0% found this document useful (0 votes)
272 views65 pages

Data Science

This document provides an overview of key concepts in data science including mathematical objects, linear algebra, statistics, probabilities, and linear regression. It contains 10 sections that cover scalars, vectors, matrices, measures of central tendency, probabilities, regression, and examples of linear regression in Python. The document is 10,050 words in length and follows research ethics guidelines.

Uploaded by

Anuran Bordoloi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
272 views65 pages

Data Science

This document provides an overview of key concepts in data science including mathematical objects, linear algebra, statistics, probabilities, and linear regression. It contains 10 sections that cover scalars, vectors, matrices, measures of central tendency, probabilities, regression, and examples of linear regression in Python. The document is 10,050 words in length and follows research ethics guidelines.

Uploaded by

Anuran Bordoloi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Course Name

Module Data Science

Session No. I

Version 1.0
Data Science

Material from the published or unpublished work of others which is referred to in the Class
Notes is credited to the author in question in the text. The Class Notes prepared is of 10,050
words in length. Research ethics issues have been considered and handled appropriately
within the Globsyn Business School guidelines and procedures.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Table of Contents
List of Tables ................................................................................................................. 6

1. Introduction ............................................................................................................... 8

2. Mathematical Objects ............................................................................................... 8

2.1. Scalar .................................................................................................................... 9

2.2. Vector.................................................................................................................... 9

2.2.1. Finding the Magnitude ..............................................................................................10

2.3. Relationship between Linear Algebra and Machine Learning ............................. 11

2.3.1. Data Set and Data Files ...........................................................................................11

2.4. Matrix .................................................................................................................. 12

2.4.1. Addition ....................................................................................................................13

2.4.2. Subtraction ...............................................................................................................13

2.4.3. Multiplication, using a Constant ................................................................................13

2.4.4. Division ....................................................................................................................14

2.4.5. Matrix Vector Multiplication ......................................................................................16

2.4.6. Matrix Addition and Subtraction................................................................................17

2.4.7. Use of Matrices in the Machine Learning process ....................................................18

2.4.8. Machine Learning and Statistics...............................................................................18

3. Basics of Statistics ................................................................................................. 19

3.1. Histogram............................................................................................................ 22

3.2. Frequency Polygon ............................................................................................. 23

3.3. Frequency Curve ................................................................................................ 23

3.4. Ogives ................................................................................................................. 23

3.5. Central Tendency ................................................................................................ 24

3.5.1. Arithmetic Mean .......................................................................................................24

3.5.2. Median .....................................................................................................................26

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

3.5.3. Mode ........................................................................................................................29

3.5.4. Empirical Relationship between Statistical Averages ...............................................31

3.5.5. Geometric Mean.......................................................................................................31

3.5.6. Harmonic Mean ........................................................................................................32

3.5.7. Quartiles ..................................................................................................................33

3.6. Mean Deviation ................................................................................................... 34

3.7. Standard Deviation ............................................................................................. 38

3.7.1 Properties of Standard Deviation ...............................................................................40

4. Probabilities ............................................................................................................. 42

4.1. Certain terminologies used in the process of Probability. ................................... 42

4.1.1. Experiment ...............................................................................................................42

4.1.2. Random Experiment ................................................................................................42

4.1.3. Sample Space ..........................................................................................................42

4.1.4. Event........................................................................................................................43

4.1.5. Equally likely Events ................................................................................................43

4.1.6. Mutually Exclusively Events .....................................................................................43

4.1.7. Exhaustive set of Events ..........................................................................................43

4.1.8. Independent Events .................................................................................................43

4.2. Rules of Probability ............................................................................................. 44

4.2.1. Addition Rule............................................................................................................44

4.2.2. Multiplication Rule ....................................................................................................45

4.3. Conditional Probability ........................................................................................ 45

4.3.1. Bivariate Distribution under Conditional Probability ..................................................46

4.4. Seven important steps on Probability for Solving Problem.................................. 47

5. Regression ............................................................................................................... 50

5.1. Regression Lines ................................................................................................ 50

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

5.2. Regression Coefficient ........................................................................................ 51

It is an absolute measure. .......................................................................................... 51

5.2.1. Features of Regression Coefficient ..........................................................................51

6. Linear Regress explained using Python ............................................................... 53

6.1. Graphical presentation of Linear Algebra ............................................................ 53

6.2. Graphical Presentation of Linear Regression ...................................................... 54

7. Linear Regression in Python .................................................................................. 60

7.1 Example – 1 ......................................................................................................... 60

7.2 Example – 2 ......................................................................................................... 61

7.3. Example - 3 ......................................................................................................... 62

7.4. Example – 4 ........................................................................................................ 63

References ................................................................................................................... 64

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Tables & Figures


Fig. 1: Mathematical Object........................................................................................................ 9
Fig. 2: Vector ............................................................................................................................. 9
Fig. 3: Graphical Explanation ....................................................................................................10
Fig. 4: Iris Flower Data ..............................................................................................................11
Fig. 5: Grades for Exam ............................................................................................................12
Fig. 6: Matrices .........................................................................................................................12
Fig. 7: Addition ..........................................................................................................................13
Fig. 8: Subtraction .....................................................................................................................13
Fig. 9: Multiplication ..................................................................................................................14
Fig. 10: Division ........................................................................................................................14
Fig. 11: Inverse .........................................................................................................................14
Fig. 12: Multiplication by Inverse ...............................................................................................15
Fig. 13: 2X2 Matrix ....................................................................................................................15
Fig. 14: Inverse .........................................................................................................................15
Table 15: Vector Multiplication ..................................................................................................16
Fig. 16: Vector Matrix Addition ..................................................................................................17
Table 1: Frequency Distribution of Number of children ..............................................................20
Table 2: Continuous Frequency Distribution ..............................................................................21
Fig. 18: Histogram Graph ..........................................................................................................22
Fig. 19: Frequency Polygon ......................................................................................................23
Fig. 20: Frequency Curve ..........................................................................................................23
Fig. 21: Mode ............................................................................................................................29
Fig. 22: GM Discrete Series with Frequency .............................................................................31
Fig. 23: GM Continuous Series .................................................................................................31
Fig. 24: Mean Deviation – Data with Frequency .......................................................................35
Fig. 25: Coefficient of Mean Deviation .......................................................................................35
Fig. 26: Variance without Frequency .........................................................................................38
Fig. 27: Variance-with Frequency ..............................................................................................39
Fig.28: Alternative form of Variance for A ..................................................................................39
Fig. 29: Alternative form of Variance for B .................................................................................39
Fig. 30: Combined Standard Deviation ......................................................................................41
Fig. 31: Venn Diagram ..............................................................................................................44

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Fig. 32: Linear Relationship.......................................................................................................53


Fig. 33: Linear Regression Algorithm ........................................................................................55
Fig. 34: Formula of Slope (m)....................................................................................................55
Fig. 35: Linear Regression Algorithm ........................................................................................56
Fig. 36: Slope – Line Presentation ............................................................................................57
Fig. 37: R - Squared ..................................................................................................................58
Fig. 38: Determination of R2 ......................................................................................................58
Fig. 39: Measurement of R2 ......................................................................................................59
Fig. 40: Measurement of R2 ......................................................................................................59
Fig. 41: Measurement of R2 ......................................................................................................59
Fig. 42: Measurement of R2 ......................................................................................................60
Fig. 43: Outcome ......................................................................................................................61
Fig. 44: Regression Line with Scattered Plot .............................................................................62

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

1. Introduction
The continuous form of mathematics is expressed through Linear Algebra. If you want to model
natural phenomena efficiently Linear Algebra is considered as an important tool to follow. The
entire science and engineering depend on the application of linear Algebra. It is not discrete
mathematics. The nature of Linear Algebra is continuous. When data is grouped into finite sets
such a type of math is called discrete maths. For example, a matrix can have a 2 nd or 3rd
element but no 2.5th. On the other hand, the functions of continuous maths follow continuity. The
evaluation of such a function can be made at any accuracy. For example, in the case of Linear
Algebra, you can take values at any number of decimal points. Since most of the computer
scientists do not practice a continuous form of mathematics, so it is required for them to learn
this technique as Linear Algebra is considered as the central to almost all areas of mathematics.
If any student is interested to go through Deep Learning Algorithms, he has to possess the
conception of Linear Algebra. Without having the conception of this subject no one can proceed
to go through the subject like Deep Learning Algorithms. In the case of learning material like
Machine learning where knowledge of Deep Learning Algorithms is necessary a learner should
have an adequate conception about Linear Algebra (Donges, 2019). If he possesses adequate
knowledge it will enable him to gain a better understanding of the machine learning systems’
development. In other words, he will become capable to handle varied Machine Learning
algorithms. Mastering the subject called Machine Learning requires the deep knowledge of
Linear equations that dealt with vectors and matrices mostly. In addition, it can also deal with
scalars.

2. Mathematical Objects
The abstract object found in mathematics is referred as mathematical object. This concept is
observed in philosophy of mathematics. Scalar, vector and Matrices are considered as the three
elements of mathematical objects.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Fig. 1: Mathematical Object

(Donges, 2019)

2.1. Scalar
It is just a single number, for e.g., 60. A single real number acts as a representative of quantity.
Such quantity is termed as a scalar. In the previous example, we have narrated that a scalar is
just a number. When we say 60 cm, it is nothing but to mean that a certain length is 60 cm. It is
an example of a scalar (The Physics Classroom, 2019).

2.2. Vector
It is a quantity that can be defined through multiple scalars. In the case of scalar we can talk
about the magnitude. However, in the case of a vector we not only find the magnitude of a
mathematical object, we also talk about the direction of such object. Consider the following
example:

Fig. 2: Vector

(Statistics , 2019)

There are two sections pointed out in the line. These are initial point (A) and direction (a).
Magnitude lies in between these two points. Smaller case ‘a’ also direction (The Physics
Classroom, 2019).

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

2.2.1. Finding the Magnitude


The magnitude of a vector can be illustrated through the following graph:

Fig. 3: Graphical Explanation

(Statistics , 2019)

In determining the magnitude of Point A and point B it is required to compute the distance
between these two points. In this aspect, distance formula can be used where the useful
coordinates are required to be given (Ducksters, 2019).

The following is the formula of distance:

AB = √(𝑥₂ − 𝑥₁)² + (𝑦₂ − 𝑦₁ )²

Example:
Find the magnitude of vector AB where point A is (3, 2) and point B is (7,4)

Put these values appropriately into the formula:

AB = √(7 − 3)² + (4 − 2 )²

= √(4)² + (2 )²

= √16 + 4

= √20

= 4.47

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

In terms of numbers, vector can be represented as an ordered array that is arranged in a row or
a column. A vector has a single index, which can point to a specific value within the vector.

2.3. Relationship between Linear Algebra and Machine Learning


The sub-field of mathematics makes a connection with scalar, vector and linear transforms.
Linear algebra is a sub-field of mathematics. Hence it is concerned with these three elements. It
is observed that the algorithm operation is comprehended through certain notations and
algorithms are utilised into the manner of code. Both for these purposes, implementation of
linear algebra is a must. It is comprehensible that if any person wants to go through a machine
learning process he must be a master of linear algebra as both these subjects are deeply
connected with each other. A suitable example can be placed which may show the bonding
between these two subjects (Brownlee, 2019).

2.3.1. Data Set and Data Files


In Machine Learning you can fit a model on a data set.

A following table-set has been established in which each row represents an observation.
Simultaneously, each column exhibits a feature of the observation. For example, Iris flower data
set is laid down below:

Fig. 4: Iris Flower Data

(Machine Learning Mastery, 2019)

The above-stated data is one sort of matrix which is regarded as the key data structure in the
linear algebra. In building a machine learning model it is required to split the data into inputs and
outputs. When such a thing is done it leads to prepare a supervised machine learning model,
such as the measurements called ‘Matrix’ (X) and the flower species, which is termed as vector
(Y). The vector is another key data structure in linear algebra. In the above table, it is observed
that each row carries same number of columns. When it is like this we can conclude that the
data is vectorised where rows can be provided to a model one at a time or in a batch and the
model can be pre-configured to expect rows of a fixed width (Intellipaat, 2019).

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

2.4. Matrix
A ‘Matrix’ is a certain framework comprised of fixed rows and column where collected numbers
are arranged properly. It is a rectangular array of numbers arranged into columns and rows. The
collected data can be expressed in the form of matrix algebra. These collected numbers are
generally real numbers. For example, the grades for exam (afterwards converted into matrix
algebra) are shown in the following form:

Fig. 5: Grades for Exam

(Maths Fun, 2019)

When a conversion is made it is required to bring a function identifier replacing rows and
columns identifiers. The following chart shows such a replacement: (Math Planet, 2019):

Fig. 6: Matrices

(Maths Fun, 2019)

The above matrix set contains numbers which are known as elements.

There are four basic operations found in the procedure of Matrix row operations. These are
Addition, Subtraction, Multiplication and Division.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

2.4.1. Addition
When addition is made between two matrices the outcome of such addition is laid down below:

Fig. 7: Addition

(Maths Fun, 2019)

It is necessary that the both the rows and columns must match in size (Math Planet, 2019)

Addition processing is made in the following way:

3+4 = 7 8+0 = 8
4+1 = 5 6+(- 9) = - 3

2.4.2. Subtraction
In the following manner, subtraction can be done between two matrices and the outcome of
such subtraction can be obtained.

Fig. 8: Subtraction

(Maths Fun, 2019)

3 - 4 = -1 8-0 =8
4 -1 = 3 6 -(-9) = 15

2.4.3. Multiplication, using a Constant


Multiplication of a matrix can be made with a certain number. The following multiplication is a
sort of scalar multiplication:

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Fig. 9: Multiplication

(Maths Fun, 2019)

2X 4 = 8 2X0 =0
2X1 = 2 2X (-9) = -18

2.4.4. Division
In the following manner the division is executed:

Fig. 10: Division

(Maths Fun, 2019)

The principal of inverse can be shown in the following way:

Fig. 11: Inverse

(Maths Fun, 2019)

When we multiply a matrix by its inverse we get the identity matrix. It can be expressed in the
following manner (Math Planet, 2019):

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Fig. 12: Multiplication by Inverse

(Maths Fun, 2019)

In the case of 2 X 2 Matrix a process of swap is required to be applied. Here, in the numerator
place 1. The numerator or 1 is divided by the determinant. It is shown in the following manner.

Fig. 13: 2X2 Matrix

(Maths Fun, 2019)

Here you see 1 is divided by the determinant. In the following manner the determinant is made
up. There is a cross multiplication happens between the numbers: a is multiplied with d, and c is
multiplied with b. However, before b and c we use the sign of subtraction (Math Planet, 2019).

Suppose the matrix is:

Fig. 14: Inverse

(Maths Fun, 2019)

We have to apply the determinant factor to compute the outcome of this matrix.

1
=
(ad−bc)

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

1
= (4x6−2x7)

1
=10

2.4.5. Matrix Vector Multiplication


When two matrices are given for multiplication the outcome would follow only one column. It is
shown through a suitable example:

Table 15: Vector Multiplication

(Maths Fun, 2019)

Suppose two matrices are given:

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

By applying vector multiplication process the answer would be:

2.4.6. Matrix Addition and Subtraction


It is an easy process. Addition of matrix can be done through the following process

Fig. 16: Vector Matrix Addition

(Maths Fun, 2019)

Suppose the two matrices are the following:

It is equal to:

In the same manner subtraction can be done. Only the addition signs to be replaced by
subtraction sign. Others remain same.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

2.4.7. Use of Matrices in the Machine Learning process


Nowadays, most of us are highly dependent on the computer. It is a kind of machine that
executes our important work efficiently like calculations, data recording, content writing,
searching etc. Computers that are used to serve all such functions are known as classical
computers. These computers store the data and at the same time, it executes manipulation of
these data. In the manipulation process, traditional computer depends on bits and it stores the
information in a binary formation which is either 0 or 1 state. Since in our daily life, we are
undergoing several challenges and sometimes these challenges cannot be solved through
classical computers. Therefore, the scientists and researchers are now engaged to find out an
alternative way to remove all such constraints and make our life easier. They are trying to
implement the concept of Quantum Computing in lieu of classical computing which is capable to
address any types of problems and provide an effective solution more rapidly in comparison to
the function of classical computers. The Quantum Computing process bears the capability to
perform the task to be in multiple states. Moreover, it manages to drive the possible
permutations simultaneously. To understand the concept of quantum it is required to introduce a
suitable measure that is capable to define what can and cannot physically occur. If any easiest
means is required to apply, the technique like matrices should be employed which is considered
as the easiest way to grasp the subject like Quantum Computing. When the linear system
involves a number of bases the matrices are required to be helpful. In the quantum accounting
process, the system under consideration possesses a finite number of energy levels. Further,
two energy levels are observed under qubits. It means there is a certainty in the matter of using
the number of energy levels under qubits. So Quantum Computing uses Matrix representation
(Brownlee, 2019)

2.4.8. Machine Learning and Statistics


Machine Learning is an algorithm that can learn from data without relying on rules-based
programming. On the other hand, statistical modelling is a formalization of relationships
between variables in the data in the form of mathematical equations. If we want to see the basic
attributes of the two subjects, we can observe three important attributes under the machine
learning process namely predictions, supervised learning and unsupervised learning and under
statistics, the following attributes can be observed – sample, population and hypothesis. In
terms of dealing with data, we can observe the similarity between these two processes. Both of
these subjects follow a certain concept. It means conceptually the nature found under the two
subjects follow the same routine. However, in terms of expression, they are dissimilar and
nothing else. Like, estimation under statistics, learning under machine learning bears the same

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

meaning. Only in terms of terminology, they are different. Likewise, classifier, data point
regression under statistics carry similar meaning with the terms like hypothesis, example, and
supervised learning found under Machine Learning. That is why the Machine Learning process
is also called glorified Statistics. In recent times, both Machine Learning and Statistics
techniques are used in pattern recognition, knowledge discovery and data mining. A Venn
diagram is given below that shows how these two processes are connected (Stewart, 2019).

. Fig. 17: Interconnection between Data Mining and Statistics

(KD Nuggets, 2016)

3. Basics of Statistics
The collection of data and data analysis are vital factors in Statistics. Based on data a new
theory can be formulated. Here, reasonable data must be collected that should be coherent with
the existing nature of the masses. Moreover, it is required to consider the relationship between
the features of units in the population. Afterward, such data should be analysed systematically.
Searching numerical data and its analysis is known as a statistical survey or statistical
investigation. Anyhow, collection of data is the first and most important stage in any statistical
survey. The method for collection of data depends upon various considerations such as
objective, scope, nature of information, availability of resources (Make me Analyst, 2019). Data

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

collected for the first time keeping in view the objective of the survey is known as primary data.
Collection of primary data can be done by anyone of the following methods:
 Direct personal observation.
 Indirect oral interview.
 Information through agencies.
 Information through mailed questionnaires.
 Information through schedule filled by investigators.

On the other hand, secondary data is the data which is collected by someone else earlier.
Unlike real time data this type of data is regarded as past data. The secondary data may be
collected either by census or sampling methods. Sources of such data include Government
publications, websites, books, journals, articles, internal records etc. Collected data is
obtained in the raw form. These are countless and non-comprehensible. Therefore, it is
required to simplify the data for better understanding and usefulness. The first stage of
simplification is known as classification followed by tabulation. Classification reduces bulk
data and makes the data more comprehensible. Tabulation also simplifies complex data.
Here, data is listed according to a logical sequence of related characteristics. The next step
of simplification is frequency and frequency distribution (Clark, 2019). The number of units
associated with each value of the variable is called frequency of that value. Suppose the
variable takes the value 51 and the value 51 occurs 6 times, then 6 is called the frequency
of the value 51. There are two types of frequency distribution: Discrete Frequency
Distribution and Continuous Frequency Distribution. When variables are taken with
corresponding frequencies then frequency distribution of the variables are formed. A
discrete frequency distribution lists all the observed values. Example of Discrete frequency
is given below:
Table 1: Frequency Distribution of Number of children

(SPSS Tutorials, 2019)

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

The example of Continuous Frequency Distribution is laid down below:

Table 2: Continuous Frequency Distribution

(SPSS Tutorials, 2019)

If we consider the range 20 – 30, 20 is the lower class interval and 30 is the upper class
interval. 30 – 20 = 10 is the width of the class.

20+30
The mid value of the class is = = 25
2

The class interval that does not include upper class limit is called exclusive type of class
interval. The class interval that includes the upper class limits, is called inclusive – type of
class interval.

Example:

Inclusive Type:

Marks
0 - 9 15
10 - 19 20

Here, the class 0-9 includes the value “9”

Exclusive Type:

0 - 10 15
10 - 20 20
20 - 30 28

The class 0-10 does not include the value 10. If the value 10 occurs, it is included in the
class 10-20.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

The end process of simplification is known as Graphical Presentation. Most often used
graphs for Frequency Distribution are:

3.1. Histogram
The frequency distribution is represented by a set of rectangular bars with area proportional to
class frequency. The following conditions are required to be maintained. These are:

If the class intervals have equal width, then the variable is taken along X-axis and frequency
along Y-axis. In this way, the rectangle can be made (Clark, 2019).

Example:

For the following distribution of age, the histogram is drawn as follows:

Age: 0-10 10-20 20-30 30-40 40-50

No. of People: 5 10 15 12 8

Fig. 18: Histogram Graph

(SPSS Tutorials, 2019)

See the intersecting point from where a perpendicular is drawn to the x-axis. This line is a
dotted line. The highest points exist in the range of 20 and 30. From these two points when two
lines are drawn diagonally we get the intersecting point. The x-reading at that point gives the
mode of the distribution.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

3.2. Frequency Polygon


The mid values of class-intervals are plotted against frequency of the class interval. These
points are joined by straight lines. A diagram is shown below:

Fig. 19: Frequency Polygon

(SPSS Tutorials, 2019)

3.3. Frequency Curve


First we draw histogram for the given data. Then join the mid points of the rectangles by a
smooth curve. Total area under Frequency Curve represents total frequency. They are the most
useful form of frequency distribution (Clark, 2019).

Fig. 190: Frequency Curve

(SPSS Tutorials, 2019)

3.4. Ogives
The term Ogives are also known as cumulative histograms. These are graphs. If any data set
contains a certain value, in such an incident it is required to check the status of many data
values. Here status of these data value refers to check whether such values lie above or below

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

of the certain value which takes position in the data set. The cumulative frequency is calculated
from a frequency table. A single frequency is added to the total of the frequencies of all data
values before it in the data set. It is seen that both the last value for the cumulative frequency
and the total number of data values remain equal. It is because; the earlier total is made up
through the addition of all frequencies (Clark, 2019).

3.5. Central Tendency


The tendency of central or typical value for a probability distribution is known as Central
Tendency under statistics. Often, the central tendency can be termed as averages. There are
different types of central tendencies found for a probability distribution (Purple math, 2019).
These are:

3.5.1. Arithmetic Mean


Arithmetic Mean is defined as the sum of the values divided by number of values and is
represented by X

Example:

Find out the arithmetic mean of 15,17,22,21,19,26,20

15+17+22+21+19+26+20 140
X= = = 20
7 7

Discrete Data with Frequency can be presented through an example:

Students’ Age 20 23 25 28 30

No. of Students 3 5 10 6 1

20×3+23×5+25×10+28×6+30×1 623
X= = =24.92
3+5+10+6+1 25

For Continuous Distribution the following formula is applicable:

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Where,

A= Arbitrary point or assumed mean

d = (X- assumed mean i.e. A) / width of classical interval

X = Mid value of the class.

C.I. = size of the equal class interval

Example:

Height in cms X: 140 – 150 150 – 160 160 – 170 170 – 180

No. of students 50 65 80 55

Solution

Step 1:
𝑿−𝟏𝟓𝟓
Mid f d= fd
𝟏𝟎
145 50 -1 -50
155 65 0 0
165 80 1 80
175 55 2 110
Total 250 140

Step 2:

140
X=155 + 250 × 10

= 155 + 5.6 = 160.6 cm

The merits and demerits of Arithmetic mean is given below:

 It is a simple calculation process


 It is based on all values
 It is rigidly defined

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

 It is capable for further algebraic treatment.


 On the other hand, it is affected by extreme values and it cannot be determined for
distributions with open-end class intervals.

3.5.2. Median
Among middle values the most middle of such values in a set of values is known as Median.
These values are arranged in the form of ascending order of magnitude. Median is denoted by
M. In the case of discrete series, with or without frequency it is given by M= (n+1)/2th value. The
data is required to arrange either ascending manner or descending manner (Purple math,
2019).

Example 1

Find the median value of the following:

Set values 45,32,31,46,40,28,27,37,36,41,47,50

When we arrange the above set in ascending order we get the following thing

27,28,31,32,36,37,40,41,45,46,47,50

n = 12

12+1𝑡ℎ
Therefore, Median =
2

Value = 6.5th value.

Here Median = 37 +0.5(40-37) = 37 + 1.5 = 38.5

Example 2

Find the Median value of x series

X: 12, 16, 10, 14, 17, 20, 15

f: 4, 9, 3, 5, 4, 2, 10

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Step1

If we arrange the series in the ascending order, we shall get the following:

X f Cumulative Frequency
10 3 3
12 4 7
14 5 12
15 10 22
16 9 31
17 4 35
20 2 37

n= 37

37+1𝑡ℎ
Therefore, Median = = 19th value
2

We have to find the terms of the Cumulative Frequency:

X f Cumulative Frequency Terms


10 3 3 1-3
12 4 7 4-7
14 5 12 8 -12
15 10 22 13 - 22
16 9 31 23 - 31
17 4 35 32 - 35
20 2 37 36 - 37

Now you see that 19th value falls under the range of 13-22. Therefore, Median (M) is 15.

In the case of continuous series, another example is given below:

Weight (kg): 30-35 35-40 40-45 45-50 50-55

Frequency 10 15 40 27 8

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Solution

The series given here is continuous in nature. Here, the class interval is marked as exclusive
type. Total cumulative frequency is ascertained in the following manner.

Weight Frequency f Cumulative Frequency


30 - 35 10 10
35 - 40 15 25
40 - 45 40 65
45 - 50 27 92
50 - 55 8 100

Since the class interval is exclusive type so we have to consider N/2 instead of (N + 1)/2. Here,
N/2 is 100/2 = 50.

An alternative formula can be applied to find out the Median:


n
−Cf˳
Median = lower limit of median class + 2 f
× C. I.

Here Cf˳= Cumulative Frequency up to previous class

f = Frequency Class

C.I. = Width of Class Interval


n
−Cf˳
Therefore, M = lower limit of Median Class + 2 × C. I.
f

100
−25
2
=40 + 40
×5

= 43.125

Several merits can be observed by using Median. These are:


 It can be easily understood and computed
 It is not affected by extreme values
 It can be measured graphically
 It can be used for qualitative data
 It can be calculated for distributions with open – end classes. Although it is not based on
all values. Further, it not capable of further algebraic treatment.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

3.5.3. Mode
Mode denotes the highest frequency. It is shown by Z. It is observed that those who are
involved in business they put emphasis on modal value. In the case of a planning a suitable
operation the shoe and garment manufacturer provide stress on modal size of the people. For
discrete data with or without frequency it is that value corresponding to highest frequency
(Purple math, 2019).

Example

Find the Mode of the following data:

6,7,6,8,9,9,9,10,8,7,7,9,10,9,9,9,8,8,11

Arranging the data in ascending order:

Size Frequency
6 2
7 3
8 4
9 7
10 2
11 1

Since, frequency of 9 is 7 therefore; the size 9 has a highest modal value. The frequency of all
other numbers is below 7.

In the case of continuous frequency distribution, the following formula is applicable. It is given in
table - 22

Fig. 21: Mode

(Vedantu, 2019)

l = Lower limit of Modal Class

f1 = Frequency of Modal Class

f0 = Frequency of Previous Class

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

f2 = Frequency of Succeeding Class

h = Width of Class Interval

Example

Builders of Pravesh Apartment found the number of customers who wishes to have plinth area
of their apartments as follows:

Plinth Area Sq. Ft. No. of Customers


600 - 800 4
800 - 1000 10
1000 - 1200 15
1200 - 1400 25
1400 - 1600 12
1600 - 1800 8
Above 1800 2

Find the Modal Plinth area:

Solution

Here, the intervals are exclusive type. Highest frequency is 25. The corresponding interval is
1200 – 1400. It is called modal class.

By applying the formula shown in table 22 we shall compute mode:

25−15
Mode = 1200 + ×200
2 ×25−15−12

2000
= 1200 +
23

= 1200 + 86.95

= 1286.95

The following are the merits and demerits of Mode:


 It is used for inspection purpose.
 It is not affected by extreme values.
 It can be calculated for distributions with open and end classes.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

 It can be located graphically and used for qualitative data.


 It is not based on all values
 It is not capable of further mathematical treatment.
 It is much affected by sampling fluctuations.

3.5.4. Empirical Relationship between Statistical Averages


The empirical relationship between Mean, Median and Mode can be represented in the flowing
way:

Mean – Mode = 3 (Mean – Median)

Or, Mean – Mode = 3 Mean – 3 Median

Or, – Mode = 2 Mean – 3 Median

Or, Mode = 3 Median – 2 Mean

3.5.5. Geometric Mean


In the case of a geometric mean the positive numbers can be set in a discrete series with or
without frequency. It can also be arranged in a continuous manner (Purple math, 2019).

In the case of discrete series of n numbers without frequency the formula is used:

GM = n√X₁ + X₂ + ⋯ … Xₙ

In the case of discrete series with frequency the following formula is applicable

Fig. 202: GM Discrete Series with Frequency

(Vedantu, 2019)

Where n = f1 + f2 + …………..fn

In the case of continuous series, the following formula is applicable:

Fig. 213: GM Continuous Series

(Vedantu, 2019)

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Example

The growth in bad-debt expense for Das office supply company over the last few years is as
follows:

Calculate the average percentage increase in bad debt expense over this time period

Year: 1992 1992 1993 1994 1995 1996 1997 1998

Expense Rate: 1.11 1.09 1.075 1.08 1.095 1.08 1.20

Solution

G.M. = 7√(1.11)(1.09)(1.075)(1.08)(1.095)(1.08)(1.20)

= 1.09675

The average increase is 1.09675 – 1 = 0.09675%

3.5.6. Harmonic Mean


The Harmonic Mean is a sort of numerical average. When the number of observations is divided
by the reciprocal of each number in the series we get the harmonic mean (Purple math, 2019).
Suppose, three numbers are 2,4 and 6

3
Harmonic Mean = 1 1 1
+ +
2 4 6

3
= 11
12

= 3.27

Another example

Find the Harmonic Mean of the following distribution:

X 121 122 123 124 125


f 5 25 36 37 20

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Solution:
X f f/X
121 5 0.04132
122 25 0.20492
123 36 0.29268
124 37 0.29839
125 20 0.16000
Total 123 0.99731

123
H.M. = = 123.33
0.99731

3.5.7. Quartiles
When distribution is divided into four equal portions, we get the First Quartile (Q1), Second
Quartile (Q2) and Third Quartile (Q3).

N+1 3( N+1)
Q1 is shown as Q3 is shown as
4 4

The above stated formula is connected with discrete set

Example

Weekly sales of a product on 8 different shops are as follows. Calculate the quartiles (Purple
math, 2019).

Sales in units 309, 312, 305, 307, 310, 308, 308, 306

Solution

At first, we have to write the series in ascending order.

305, 306, 307, 308, 308, 309, 310, 312

n+1th
Q1 = Value
4

8+1th
= Value
4

= 2.25th value

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

= 2nd value + 0.25 (3rd value – 2nd value)

= 306 + 0.25 (307 – 306)

= 306.25

2(n+1)th
Q2 = Value
4

= 2.25 × 2 = 4.5th value

= 4th value + 0.5 (5th value – 4th value)

= 308 + 0.5 (309 – 308)

= 308.5

3(n+1)th
Q3 = Value
4

= 2.25 × 3 = 6.75th value

= 6th value + 0.75 (7th value – 6th value)

= 309 + 0.75(310 – 309)

= 309 + 0.75

= 309.75

3.6. Mean Deviation


Suppose there are certain numbers like 3, 6, 6, 7, 8, 11, 15, 16

3+6+6+7+8+11+15+16
The mean of these numbers is
8

72
=
8

=9

The next step is to see the deviation of these numbers from mean

When mean is 9 the corresponding deviations are

9-3, 9-6, 9-6, 9-7, 9-8, 11-9, 15-9, 16-9

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

i.e. 6, 3, 3, 2, 1, 2, 6, 7

6+3+3+2+1+2+6+7
Mean Deviation =
8

30
=
8

= 3.75

Therefore, when the summation of the outcomes of deviation is divided by the total numbers we
get the mean deviation. It can also be defined in other way i.e. the mean of absolute deviations
of the values from central value (Purple math, 2019).

The Mean deviation from mean for discrete series without frequency is given by. For data with
frequency it is given by:

Fig. 224: Mean Deviation – Data with Frequency

(Frost, 2019)

In the case of continuous series, “X” represents mid value of class interval. Similarly, we can
have mean deviation from median or mode. X is replaced by median or mode in the above
formula. However, mean deviation from median is the least. It is known as Minimal property of
mean deviation. The corresponding relative measures are coefficient of mean deviation.

Fig. 235: Coefficient of Mean Deviation

(Frost, 2019)

Example

Calculate mean deviation and also coefficient of mean deviation using i) mean ii) median.
Compare the results.

Heights of Plants (cm) 140,147,143,145,144,150,142,141

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Solution

X From Mean From Median


X - 145 X – 143.5
140 5 3.5
141 4 2.5
142 3 1.5
143 2 0.5
144 1 0.5
145 0 1.5
147 2 3.5
158 13 6.5
Total = 1160 30 20.00

1160
Mean = = 145
8

30
Mean Deviation from mean = Ʃ X – X = = 3.75
8

(8+1)th
Median is value = 4.5th
2

Therefore, Median = 143 + 0.5 (144 – 143) = 143.5

20
Mean deviation from Median = = 2.5
8

3.75
Coefficient of MD (X) = = 0.0258
145

2.5
Coefficient of Mean Eviation from Median = = 0.001742
143.5

Therefore, Mean Deviation from median is less than M.D. from Mean.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Example

The following is the distribution of employees of a firm according to their efficiency. Find Mean
Deviation and Coefficient of Mean Deviation from i) Mean and ii) Median

Efficiency Index 18-22 22-26 26-30 30-34 34-38


Employees 20 30 11 3 1

Solution

𝐗−𝟐𝟖
Efficiency Index Frequency d= 𝟒 fd f X - 24 Cf X - Med f X - Med
18 -22 20 -2 -40 80 20 3.63 72.60
22 - 26 30 -1 -30 0 50 0.34 10.20
26 - 30 11 0 0 44 61 4.34 47.74
30 - 34 3 1 3 24 64 8.34 25.02
34 - 38 1 2 2 12 65 12.34 12.34
-65 160 168.00

For Continuous Distribution

Ʃ𝑓𝑑
(X) = A + × CI
Ʃ𝑓
−65
= 28 + ×4
65

= 28 -4

= 24
f∣X−24∣
Now MD (X) =
f

160
=
65

= 2.46
2.46
Coefficient of Mean Deviation =
24

= 0.1025

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Determination of Mean Deviation from Median and Coefficient

Nth value 65
= = 32.5
2 2

Median Class = 22 – 26

=4

32.5−20
Median = 22 + ×4
30

= 22 + 1.667

= 23.66

f∣X−Med∣
MD (Median) =
f

168
=
65

= 2.58

2.58
Coefficient = = 0.109
23.66

3.7. Standard Deviation


It is observed that measures of dispersion range and quartile deviations are not based on all
values. The dispersion of a data set relative to its mean is measured through a statistical
process known as standard deviation. However, variance of such data set is measured first. The
square root of such variance is called standard deviation (Maths Fun, 2019). The standard
deviation of a set of values is the positive square root of mean of the standard deviations of the
values from their arithmetic mean. It is denoted by sigma (σ).

For discrete series without frequency it is given by:

Fig. 26: Variance without Frequency

(Maths Fun, 2019)

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

For discrete series with Frequency and Continuous it is given by:

Fig. 27: Variance-with Frequency

(Maths Fun, 2019)

Where X is the mid value of class interval for continuous series. Alternative form for (A) and (B)
S.D. are:
For A

Fig.28: Alternative form of Variance for A

(Maths Fun, 2019)

Fig. 29: Alternative form of Variance for B

(Maths Fun, 2019)

Where d = X-A. A is Assumed Mean

Example

Calculate the SD for variation in temperature observed during two months at Kolkata:

Temperature 18 19 20 21 22 23 24 25 Total

Frequency 3 5 8 16 12 8 5 3 60

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Solution

Here Assumed Mean is 21

X f d = x - 21 fd fd2
18 3 -3 -9 27
19 5 -2 -10 20
20 8 -1 -8 8
21 16 0 0 0
22 12 1 12 12
23 8 2 16 32
24 5 3 15 45
25 3 4 12 48
60 28 192

fd
X=A+ × CI
f

28
= 21 + ×1
60

= 21.47

192 28²
Variance = –( )× 1
60 60²

= 3.2 - .217

= 2.983

SD = √2.983

= 1.727

3.7.1 Properties of Standard Deviation

The properties of standard deviation can be described in the following manner.


 It is independent of origin but not independent of scale.
 Standard deviation is always greater than or equal to zero.
 It is the least of all root – mean – square deviations.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

 Suppose the mean of n1 values is x1 and the mean of n2 is x2 and the standard deviations
are σ1 and σ2 respectively. The combine standard deviation can be furnished through the
following formula.

Fig. 30: Combined Standard Deviation

(Maths Fun, 2019)

Where d1 = X – X1 and d2 = X – X2

X is the combined mean of n1 and n2 values.

Example

The average weight of 100 apples from area A is 150 gms with standard deviation of 10 gms.
Similarly, the average weight of 200 apples from area B is 200 gms with standard deviation of
15 gms. Find the combine standard deviation.

Solution

The following things are given:

i.e. n1 = 100 n2 = 200 X1 = 150 X2 = 200 σ1 = 10 σ2 = 15

n₁ X₁ +n₂ X₂
Combined average =
n₁ +n₂

100 ×150+200×200
=
100+200

15000+40000
=
300

55000
=
300

= 183.33

Therefore, d12 = (150 – 183.33)2 = (33.33)2 = 1110.889

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

And d22 = (200 – 183.33)2 = (16.66)2 = 277.5556

√100(100+1110.8889)+200(200+277.5556)
Standard Deviation =
√100+200

= 26.87

4. Probabilities
In the case of weather forecasting, it is often heard that rain might occur during a certain period
of time. It means the weather forecasting office reveals the possibility of rain but they do not
divulge with certainty that the rain must happen at the scheduled time. It means in their
announcement the element of uncertainty is impliedly narrated. Likewise, the share market
analysts often tell that the share price may go up or down. However, such an analyst never
makes sure that the price will go up or down. Therefore, it is required to handle the uncertainty
in a systematic way. Probability theory helps us to make wiser decisions (Wolfram MathWorld,
2019). Probability is a numerical measure which indicates the chance of occurrence of an event
A. It is denoted by P(A). It is the ratio between the favourable outcomes to an event ‘A’ (m) to
m
the total outcomes of the experiment (n). P(A) =
n

4.1. Certain terminologies used in the process of Probability.


4.1.1. Experiment
An operation that results in a definite outcome is called an experiment. For example, when we
toss a coin it shows either head or tail on falling. If it stands on its edge, then it is not an
experiment.

4.1.2. Random Experiment


Sometimes it is observed that the outcome of an experiment cannot be predicted. When such a
situation occurred it is called random experiment or stochastic experiment.

4.1.3. Sample Space


When random experiment is made a set of possible outcomes of an experiment is occurred
which is denoted by S. In tossing two coins the following outcomes are occurred.

S = HH, HT, TH, TT

The number of outcomes is denoted by n(s) = 4

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

If the number of outcomes is finite, it is called finite sample space otherwise; it is called infinite
sample space.

4.1.4. Event
There are basically two kinds of outcomes. One is called single outcome and the other is
combination of outcomes. In tossing a coin getting a head (event A) a combination outcomes
HT and TH. Therefore, P(A) = 2/4 = ½. It is a part of sample space.

4.1.5. Equally likely Events


Two or more events are said to be equally likely if they have equal chance of occurrence. In
tossing an unbiased coin getting head and tail are equally likely.

4.1.6. Mutually Exclusively Events


Two or more events are said to be mutually exclusive if the occurrence of one prevents the
occurrence of other events. In tossing a coin if head falls, it prevents the occurrence of tail and
vice versa.

4.1.7. Exhaustive set of Events


A set of events is exhaustive if one or other of the events in the set occurs whenever the
experiment is conducted. It can be defined also as the set whose totality of sample points from
the total sample points of the experiment.

4.1.8. Independent Events


Two events are said to be independent of each other if the occurrence of one is not affected by
the occurrence of other or does not affect the occurrence of the other.

Illustration

Consider tossing of three pair coins. Then

S = [HHH, HHT, HTH, THH, HTT, THT, TTH, TTT]

Let A be the event of getting three heads

Let B be the event of getting two heads

Let C be the event of getting one head

Let D be the event of getting no head

Then A = [HHH]; B = [HHT, HTH, THH]

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

C = [HTT, THT, TTH]

D = [TTT]

Event A,B,C, and D are mutually exclusive and exhaustive but not equally likely.

4.2. Rules of Probability


There are different rules found under probability. These are:

4.2.1. Addition Rule


If A and B are two events, then the probability of the occurrence of either A or B is given by

P (A ᴗ B) = P(A) + P(B) – P(A ᴖ B)

If A and B are two mutually exclusive events, then the probability of occurrence of either A or B
is given by:
P (A ᴗ B) = P(A) + P(B)

If A, B and C are any three events then the probability of occurrence of either A or B or C is
given by

P (A ᴗ B ᴗ C) = P(A) + P(B) + P(C) – P(A ᴖ B) – P(B ᴖ C) – P(A ᴖ C) + P(A ᴖ B ᴖ C)

If A1, A2, A3…………………An are “n” mutually exclusive and exhaustive events then the probability of
occurrence of at least one of them is given by

P(A1 ᴗ A2 ᴗ ………….. ᴗ An) = P(A1) + P(A2) +…………+P(An)

As per Venn diagram the above illustration can be presented in a following manner.

Fig. 31: Venn Diagram

(Wolfram MathWorld, 2019)

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

When there exist several options in front of managers and they are required to choose only one
of such options for implementation. In such a case the addition rule related to probability can be
applied. Sometimes, a situation occurs which demands to choose both A and B for
implementation. In such a case, Multiplication rule related to probability is required to apply.

4.2.2. Multiplication Rule


If A and B are two independent events, then the probability of occurrence of A and B is given by

P(A ᴖ B) = P(A) P(B)

4.3. Conditional Probability


When a certain event influences other events and brings a certain change to these events then
it can be said that the occurrence of the conditional probability takes place. For example, the
increase in petrol prices brings a price hike towards petroleum products. This incident comes
under the purview of conditional probability. Thus the conditional probability of occurrence of an
event “A” given that the event “B” has already occurred is denoted by P(A / B). Here A and B
are dependent events (Statistics how to, 2019). Accordingly, a new rule is formed. It is laid down
through the following expression. If A and B are dependent events, then the probability of
occurrence of A and B is given by

P (A ᴖ B) = P(A) P(B/A)

= P(B) P(A/B)

It follows that:

P(A ᴖ B)
P(A / B) =
P(B)

P(A ᴖ B)
P(B / A) =
P(A)

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

4.3.1. Bivariate Distribution under Conditional Probability


The bivariate distribution can be presented in the following manner. A librarian analysed the
type of visitors and their choice of library section as follows:

Types of Visitors & Level of Education Section Total

Newspaper Magazine Novel Subject


Under Graduates 50 100 120 50 320
Graduates 70 90 50 100 310
Post Graduates 100 60 30 150 340
Total 220 250 200 300 970

We can get the following distributions


i)
Types of Visitors Frequency
Undergraduates 320
Graduates 310
Post Graduates 340
Total 970

This represents the distribution of level of education irrespective of their sections. It is regarded
as one Marginal distribution.
ii)
Newspaper Magazine Novels Subjects Total
220 250 200 300 970

This represents the distribution of people in sections irrespective of their educational levels. It is
another Marginal Distribution. These are two marginal distributions found under Bivariate data.
There are two types of variables subsist in this type of data. These are

iii)
Level of Education News paper Magazine Novels Subjects Total
Under Graduate 50 100 120 50 320

This represents the distribution of people in sections given that they are under graduate.
Therefore, it is a conditional distribution. Thus for any bivariate distributions having such i and ii

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

classifications there exists two marginal distributions and i + ii conditional distributions. In this
case there are 3+4 = 7 conditional distributions.

4.4. Seven important steps on Probability for Solving Problem


 Define the events
 Find the total outcome of the experiment
 Find the probability of each event
 If the word “either” or is used check whether the events are mutually exclusive or not, to
apply addition rule.
 If the words “both or and” used, check whether the events are independent or
dependent, to apply proper multiplication rule.
 To find the total outcome of the experiment use 2n or 6n in the case of coin or dice
respectively, where “n” is the number of coins or dice thrown at a time or a coin or dice
thrown “n” times. In all other cases use nCr.

n!
Now nCr =
(n−r)!r!

Example 1

n!
10 C2 =
(n−r)!r!

10 ×9 ×8
=
(10−2) ×2

5 ×9
=
1

= 45

16 ×15 ×14
16 C3 =
(16−3) ×3

16 ×5 ×14
=
1

16 ×5 ×7
=
1

= 560

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Example 2

Find the probability of getting a head when a coin is tossed

Let “A” be the event of getting a head i.e. n(A) = 1

S = {H,T}

or n(S) = 2

n(A) 1
P(A) = =
n(S) 2

Example 3 (Part A)

What is probability of getting two heads when 3 coins are tossed and what is the probability of
getting at least two head?

Total number of incidence (S)

HHH, HHT, HTT, TTT, TTH, THH, THT, HTH = 8

Occurrence of two heads from the above occurrences (A)

HHT, THH, HTH = 3

n(A) 3
P(A) = =
n(S) 8

Example 3 (Part B)

Out of total occurrences it is found in the case of the following sequences at least two head can
be found:

These are HHH, HHT, THH, HTH = 4

n(A) 4
P(A) = =
n(S) 8

1
=
2

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Example 4 (i)

What is the probability of (i) getting a sum of “nine” and (ii) at least 9 when two dices are thrown
together?

The total occurrences are 62 = 36 ……..S

No. of events when sum comes to 9 are

(3,6) (6,3) (4,5) (5,4) = 4 times ……….A

n(A) 4
P(A) = =
n(S) 36

1
=
9

Example 4 (ii)

Occurrence of at least 9

(3,6) (6,3) (4,5) (5,4), (5,5) (5,6) (6,5) (6,6) (4,6) (6,4) = 10 = A

n(A) 10 5
P(A) = = =
n(S) 36 18

Example 4

Board of directors of a company want to form a quality management committee to monitor


quality of their products. The company has 5 statistics, 4 engineers, and 6 accountants. Find the
probability that the committee will contain 2 scientists, 1 engineer and 2 accountants.

Here we have to use the following formula i.e. nCr

We know that:
n!
nCr = where n = total item to choose from and r = No. of items an user want to
(n−r)!r!
continue.

Here n = 15 and r = 5

n!
n (S) = 15C5 =
(n−r)!r!

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

15×14×13×12×11×10
=
10 ×5×4×3×2×1

3003
= = 3003
1

n (A) = 5C2 × 4C1 × 6C2

5×4 4 6 ×5
= × ×
2 ×1 1 2 ×1

2400
=
4

= 600

(A) 600
P(A) = =
(S) 3003

5. Regression
Regression is defined as “the measure of the average relationship between two or more
variables in terms of the original units of the data”. Correlation analysis attempts to study the
relationship between the two variables x and y. Regression analysis attempts to predict the
average x for a given y. In regression it is attempted to quantify the dependence of one variable
on the other. Example: There are two variables x and y. y depends on x. The dependence is
expressed in the form of the equations. Regression analysis used to estimate the values of the
dependent variables from the values of the independent variables. Regression analysis is used
to get a measure of the error involved while using the regression line as a basis for estimation
(Gallo, 2019). Regression coefficient is used to calculate correlation coefficient. The square of
correlation that prevails between the given two variables.

5.1. Regression Lines


For a set of paired observations there exist two straight lines. The line drawn such that sum of
vertical deviation is zero and sum of their squares is minimum, it is called regression line of y
and x. It is used to estimate y – values for given x – values. The line drawn such that sum of
horizontal deviation is zero and sum of their squares is minimum, it is called Regression line of x
on y. It is used to estimate x – values for given y – values. The smaller angle between these

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

lines, higher is the correlation between the variables. The regression lines always intersect at
(X, Y). The regression equation of y on x is given by:

Y - Y = byx (X – X)

On the other hand, the regression equation of x on y is given by:

X - X = bxy (Y - Y)

N⅀dxdy−(⅀dx)(⅀dy) N⅀dxdy−(⅀dx)(⅀dy)
Where bxy = bxy =
N⅀d𝑥 2 −(⅀d𝑥 2 ) N⅀dy2 −(⅀dy2 )

The regression equations found by the above conditions are said to be fitted by the method of
least squares, bxy and byx are called regression coefficients.

5.2. Regression Coefficient


byx × bxy = r2 ― ±√byx × bxy = 1

byx × bxy = ≤ 1

if byx is –ve, then bxy is also –ve and r is –ve

The whole thing can also be expressed in the following manner:

σᵧ
byx = r ×
σᵪ

σᵪ
bxy = r ×
σᵧ

It is an absolute measure.

5.2.1. Features of Regression Coefficient


The first feature of such coefficient can be expressed in the following manner:
byx = bxy

If byx can be greater than one, but bxy must be less than one such that byx × bxy < 1. Moreover, the
regression equation is based on cause and effect relationship and it is meant for estimation
(Gallo, 2019).

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Example

Find the regression equation from the following data:


Age of Husband 18 19 20 21 22 23 24 25 26 27
Age of Wife 17 17 18 18 19 19 19 20 21 22

Hence calculate correlation coefficient:

Age of Husband (x) dx = x - 22 dx² Age of wife (y) dy = y - 19 dy² dx dy

18 -4 16 17 -2 4 8
19 -3 9 17 -2 4 6
20 -2 4 18 -1 1 2
21 -1 1 18 -1 1 1
22 0 0 19 0 0 0
23 1 1 19 0 0 0
24 2 4 19 0 0 0
25 3 9 20 1 1 3
26 4 16 21 2 4 8
27 5 25 22 3 9 15

Total 225 5 85 190 0 24 43

225 190
X= = 22.5 Y= = 19
10 10

Regression Equation of Y on X is:

Y−Y 10 ×43−(5)(0) 430


Or, byx = = = = 0.521
X−X 10 ×85−(5)² 825

→ Y – 19 = 0.521 (X – 22.5)

→ Y = 0.521X – 7.2775

Again Regression Equation of X on Y is:

10 ×43−(5)(0) 43
bxy = = = 1.392
10 ×24−(5)² 24

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

→ X – 22.5 = 1.792 (Y – 19)

→ X = 1.792Y – 11.548

r = √0.521 × 1.792

= 0.966

6. Linear Regress explained using Python


Python is regarded as one of the powerful programming languages. It is made up of high-level
data structure. Moreover, it also represents object-oriented programming in an efficient way. To
understand the basics of python it is required to grab the concept of linear algebra. After gaining
the knowledge of this subject we shall move to attain the knowledge of linear regress which will
enable us to understand the basic concept of python in a fruitful way.

6.1. Graphical presentation of Linear Algebra

Fig. 32: Linear Relationship

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Mathematically, a linear relationship is one that satisfies the equation given below:
y = mx + c

Where m represents slope; c = y-intercept

X axis denotes speed which is independent variable. Y axis denotes distance which depends
upon speed so it is regarded as a dependent variable.

X and Y variables are connected with “m” and “c” parameters. Graphically, y = mx + c plots in
the x-y plane as a line. The slope is represented by m and y-intercept “c”. It is simply the value
of “y” when x =0. The two individual points are used to represent the point “m”.

m can be shown through the following formula:

(y₂ −y₁)
m=
(x₂ −x₁)

In the above graph, Y axis is denoted as dependent variable as it is dependent on speed,


represented by X axis. More distance can be covered when speed increases and vice versa.
This relationship can be expressed through a straight-line shown through the above stated
formula.

Examples of Linear Relationships

The instances of linear relationship can be observed in our daily life. For example, speed. The
rate of speed is the distance travelled over time. Suppose, you are travelling from A to B at a
41.3-mile stretch and you take 40 minutes to reach B. In that case, if you check your speed you
will see that your speed will be just below 60 miles per hour. In this connection, one thing is
required to be added to this conversation. The extent of the linear relationship between the
dependent and independent variables can be measured through the application of linear
regression technique.

6.2. Graphical Presentation of Linear Regression


Different occurrences can be described by applying linear regression. It is regarded as the
common statistical data analysis technique. As already stated that Linear Relationship is used
to check the extent of the relationship between the dependent variable, and one or more than
one independent variable. The dependent variable must be determined on a continuous
measurement scale (e.g. 0 -100 test score). On the other hand, the independent variable can be
explained either in the categorical form or on a continuous measurement basis. An example of

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

categorical form is like male versus female. The Linear Regression Algorithm can be explained
in the following manner:
Fig. 33: Linear Regression Algorithm

(ML Glosary, 2019)

In the earlier step we have known that X axis represents independent variable and Y axis
represents dependent variable. Both these axes are marked with several numbers like 1, 2 etc.
[See the upper right portion of the graph]. From the box we see that when X = 1, Y = 3. By
applying these two values we get a certain point represented through green ball. In this way,
position of other balls can be determined. When all such positions are determined we can
calculate X-mean and Y-mean. These points are given as (3, 3.6).

The next step is related to determining the slope of the straight line. If we see table -36 we see
that a straight line has started from point 2 at Y axis. It maintains upward slope moving towards
mean point (X3: Y3.6). Now the question arises how can we get the slope of such straight line?

For this purpose, we have to determine “m”

Fig. 244: Formula of Slope (m)

(Stat Trek, 2019)

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

We know the value of X-mean; Y-mean and different values of X and Y. With these data, from
the graph given below, we can understand the Linear Regression Algorithm.

Fig. 25: Linear Regression Algorithm

(Stat Trek, 2019)

Explanation of the graph

Step 1
We have calculated the deviations related to X-mean and Y-mean. We get the following values

(-2, -1, 0, 1, 2) related to X and (-0.6, 0.4, -1.6, 0.4, 1.4) related to Y

Step 2
We have prepared the square of the above stated deviations related to X-deviations. After
getting these values we add all the values related to square of X-deviations and (x – x) (y – y).
The summation of X related deviations i.e. (X-deviations) ² come ⅀10 and (x – x) (y – y) = ⅀4.

Step 3
4
When we applied ⅀10 and ⅀4 into the formula of “m”, we see that m =
10
Step 4
After getting the magnitude of m, we put this value to y= mx + c

or, y = mx + c

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

4
3.6 = ᵪ 3+c
10

c = 3.6 – 1.2

c = 2.4

It means that we have determined the two unknown values i.e. m= 4/10 or 0.4 and c = 2.4

In the graph we have known all the values of the two variables. If we put these unknown values
to the formula y = mx + c we can easily find the new values of y where the slope of the straight
line passes. All these facts can be represented through a suitable graph which is laid down
below:
Fig. 266: Slope – Line Presentation

(Stat Trek, 2019)

From Table – 39 we have observed that a straight line is made by applying the measured
values of m and c respectively over the different values of x variable. The predict values for y for
x has been determined. After obtaining these values (predictive) i.e. 2.8, 3.2, 3.6, 4.0, and 4.4 a
straight line can be easily drawn which is represented here through a red line.

Step 5
The red line is drawn over the predictive values that are shown in the earlier section table – 39.
This line is called the regression line. Now we have to find out how close the data to be set over
the regression line. For this purpose, we have to determine a certain statistical measure called

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

R – squared value. The R – squared value can be identified as coefficient determination, or the
coefficient of multiple determinations. We have got the predictive values and these are 2.8, 3.2,
3.6, 4.0, and 4.4. We have to determine the differences between distance actual – mean and
distance predicted – mean. It is nothing but:

Fig. 3727: R - Squared

(Stat Trek, 2019)

Where yp = predictive values; y = mean i.e. 3.6

In finding out R2, we can follow the following graph:


2
Fig. 38: Determination of R

(Stat Trek, 2019)

By applying the formula of R2 we have found that R2 is closer to 3 i.e R2 ≈ 0.3

In the same way, R2 can be closer to different values like 0.7, 0.9, 1, 0.02. However, in such
circumstances actual positions of variables get changed. The position of variables is marked by
a green colour. Such a change in position can be shown through several graphs on step-by-step
basis.

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

When R2 ≈ 0.7:
2
Fig. 39: Measurement of R

(Stat Trek, 2019)

When R2 ≈ 0.9:
2
Fig. 40: Measurement of R

(Stat Trek, 2019)

When R2 ≈ 1
2
Fig. 281: Measurement of R

(Stat Trek, 2019)

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

When R2 ≈ 0.02
2
Fig. 292: Measurement of R

(Stat Trek, 2019)

7. Linear Regression in Python


Linear regression is regarded as one of the fundamental statistical and machine learning
techniques. In the regression process, we have observed dependent features as well as
independent features. The dependent features are called dependent variables and the
independent features are called the independent variables. Regression is such a technique that
searches the relationship between these variables. On the other hand, Python is a high-level,
interpreted, interactive, and object-oriented scripting language. It is an easily readable language
and any user can easily understand this language. It is observed that this language can easily
be handled. It supports an interactive mode. It is portable, extendable and scalable. The
integration between linear regression techniques and Python can be represented through the
following examples:
7.1 Example – 1
In import numpy as np [2] :
import pandas as pd
import matplotlib. pyplot as plt
#plt.figure(figsize=(100,10))

#Read csv
data=pd.read_csv( ‘Headbrain.csv’ )
print(data.shape)
data.head()

(237, 4)

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Out [2]:
Fig. 303: Outcome

(Stat Trek, 2019)

7.2 Example – 2
In [7] : #Collecting x and y
X=data [ ‘Head size’ ] . values
Y=data [ ‘Brain Weights ‘] . values

#Mean of x and y
# Total number of values=len(x)
length=len(x)
mean_x=np.mean(x)
mean_y=np.mean(y)

#using the formula calculate m and c


numer=0
denom=0
for i in range(length):
numer+=(x[i]-mean_x)*(y[i]-mean_y)
denom+=(x[i]-mean_x)**2
m=numer/denom
c=mean_y-(m*mean_x)
print (m,c)

0.263429339489 325.573421049

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

7.3. Example - 3
In [4] : # plotting values and regression line
max_x=np.max(x)
min_x=np.min(x)
x1=np.linspace(min_x,max_x,500)
#print(x1)
y1=c+m*x1
#print(y1)

#plotting the line


Plt.plot(x1,y1,color= ‘#58b970’,label= ‘Regression Line’)
#plotting scatter point
plt.xlabel ( ‘Head size in cm3 )
plt.ylabel ( ‘Brain wight in grams’ )
plt.legend()
plt.show()

Fig. 314: Regression Line with Scattered Plot

(Stat Trek, 2019)

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

7.4. Example – 4
In [5] : ss_t=0
ss_r=0
#print (m)
Length=len(x)
for i in range(length):
y_pred=c+m*x[i]
ss_t+=(y[i]-mean_y)**2
ss_r+=(y_pred-mean_y)**2
r2=ss_r/ss_t
print(r2)

0.639311719957

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

References
Brownlee, J., 2019. Examples of Linear Algebra in Machine Learning. [Online]
Available at: https://machinelearningmastery.com/examples-of-linear-algebra-in-machine-
learning/
[Accessed 03 01 2020].
Clark, J., 2019. Statistics Basics: Here’s What You Need to Know. [Online]
Available at: https://magoosh.com/statistics/statistics-basics-heres-what-you-need-to-know/
[Accessed 03 01 2020].
Donges, N., 2019. Basic Linear Algebra for Deep Learning. [Online]
Available at: https://towardsdatascience.com/linear-algebra-for-deep-learning-f21d7e7d7f23
[Accessed 23 12 2019].
Ducksters, 2019. Scalars and Vectors. [Online]
Available at: https://www.ducksters.com/science/physics/scalars_and_vectors.php
[Accessed 03 01 2020].
Frost, J., 2019. Measures of Central Tendency: Mean, Median, and Mode. [Online]
Available at: https://statisticsbyjim.com/basics/measures-central-tendency-mean-median-mode/
[Accessed 26 12 2019].
Gallo, A., 2019. A Refresher on Regression Analysis. [Online]
Available at: https://hbr.org/2015/11/a-refresher-on-regression-analysis
[Accessed 03 01 2020].
Intellipaat, 2019. What is Data Science?. [Online]
Available at: https://intellipaat.com/blog/what-is-data-science/
[Accessed 03 01 2020].
KD Nuggets, 2016. Machine learning Vs Statistics. [Online]
Available at: https://www.kdnuggets.com/2016/11/machine-learning-vs-statistics.html
[Accessed 02 1 2020].
Machine Learning Mastery, 2019. Examples of Linear Algebra in Machine Learning. [Online]
Available at: https://machinelearningmastery.com/examples-of-linear-algebra-in-machine-
learning/
[Accessed 2 1 2020].
Make me Analyst, 2019. BASIC STATISTICS FOR DATA ANALYSIS. [Online]
Available at: http://makemeanalyst.com/basic-statistics-for-data-analysis/
[Accessed 03 01 2020].
Math Planet, 2019. How to operate with matrices. [Online]
Available at: https://www.mathplanet.com/education/algebra-2/matrices/how-to-operate-with-
matrices
[Accessed 03 01 2020].

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I


Data Science

Maths Fun, 2019. Matrices. [Online]


Available at: https://www.mathsisfun.com/algebra/matrix-introduction.html
[Accessed 24 12 2019].
Maths Fun, 2019. Standard Deviation and Variance. [Online]
Available at: https://www.mathsisfun.com/data/standard-deviation.html
[Accessed 31 12 2019].
Purple math, 2019. Mean, Median, Mode, and Range. [Online]
Available at: https://www.purplemath.com/modules/meanmode.htm
[Accessed 03 01 2020].
SPSS Tutorials, 2019. What is a Frequency Distribution?. [Online]
Available at: https://www.spss-tutorials.com/frequency-distribution-what-is-it/
[Accessed 26 12 2019].
Statistics , 2019. Scalar Definition: Scalars vs. Vectors (Difference Between Scalar and Vector).
[Online]
Available at: https://www.statisticshowto.datasciencecentral.com/scalar-definition/
[Accessed 23 12 2019].
Statistics how to, 2019. Conditional Probability: Definition & Examples. [Online]
Available at: https://www.statisticshowto.datasciencecentral.com/what-is-conditional-probability/
[Accessed 03 01 2020].
Stewart, M., 2019. The Actual Difference Between Statistics and Machine Learning. [Online]
Available at: https://towardsdatascience.com/the-actual-difference-between-statistics-and-
machine-learning-64b49f07ea3
[Accessed 03 01 2020].
The Physics Classroom, 2019. Scalars and Vectors. [Online]
Available at: https://www.physicsclassroom.com/class/1DKin/Lesson-1/Scalars-and-Vectors
[Accessed 03 01 2020].
Vedantu, 2019. Mean median mode formula. [Online]
Available at: https://www.vedantu.com/formula/mean-median-mode-formula
[Accessed 27 12 2019].
Wolfram MathWorld, 2019. Probability. [Online]
Available at: http://mathworld.wolfram.com/Probability.html
[Accessed 31 12 2019].

Industry4.0/M8SI/v1.0/121219 Data Science | Session No.: I

You might also like