Data Science
Data Science
Session No. I
Version 1.0
Data Science
Material from the published or unpublished work of others which is referred to in the Class
Notes is credited to the author in question in the text. The Class Notes prepared is of 10,050
words in length. Research ethics issues have been considered and handled appropriately
within the Globsyn Business School guidelines and procedures.
Table of Contents
List of Tables ................................................................................................................. 6
1. Introduction ............................................................................................................... 8
2.2. Vector.................................................................................................................... 9
3.1. Histogram............................................................................................................ 22
4. Probabilities ............................................................................................................. 42
4.1.4. Event........................................................................................................................43
5. Regression ............................................................................................................... 50
References ................................................................................................................... 64
1. Introduction
The continuous form of mathematics is expressed through Linear Algebra. If you want to model
natural phenomena efficiently Linear Algebra is considered as an important tool to follow. The
entire science and engineering depend on the application of linear Algebra. It is not discrete
mathematics. The nature of Linear Algebra is continuous. When data is grouped into finite sets
such a type of math is called discrete maths. For example, a matrix can have a 2 nd or 3rd
element but no 2.5th. On the other hand, the functions of continuous maths follow continuity. The
evaluation of such a function can be made at any accuracy. For example, in the case of Linear
Algebra, you can take values at any number of decimal points. Since most of the computer
scientists do not practice a continuous form of mathematics, so it is required for them to learn
this technique as Linear Algebra is considered as the central to almost all areas of mathematics.
If any student is interested to go through Deep Learning Algorithms, he has to possess the
conception of Linear Algebra. Without having the conception of this subject no one can proceed
to go through the subject like Deep Learning Algorithms. In the case of learning material like
Machine learning where knowledge of Deep Learning Algorithms is necessary a learner should
have an adequate conception about Linear Algebra (Donges, 2019). If he possesses adequate
knowledge it will enable him to gain a better understanding of the machine learning systems’
development. In other words, he will become capable to handle varied Machine Learning
algorithms. Mastering the subject called Machine Learning requires the deep knowledge of
Linear equations that dealt with vectors and matrices mostly. In addition, it can also deal with
scalars.
2. Mathematical Objects
The abstract object found in mathematics is referred as mathematical object. This concept is
observed in philosophy of mathematics. Scalar, vector and Matrices are considered as the three
elements of mathematical objects.
(Donges, 2019)
2.1. Scalar
It is just a single number, for e.g., 60. A single real number acts as a representative of quantity.
Such quantity is termed as a scalar. In the previous example, we have narrated that a scalar is
just a number. When we say 60 cm, it is nothing but to mean that a certain length is 60 cm. It is
an example of a scalar (The Physics Classroom, 2019).
2.2. Vector
It is a quantity that can be defined through multiple scalars. In the case of scalar we can talk
about the magnitude. However, in the case of a vector we not only find the magnitude of a
mathematical object, we also talk about the direction of such object. Consider the following
example:
Fig. 2: Vector
(Statistics , 2019)
There are two sections pointed out in the line. These are initial point (A) and direction (a).
Magnitude lies in between these two points. Smaller case ‘a’ also direction (The Physics
Classroom, 2019).
(Statistics , 2019)
In determining the magnitude of Point A and point B it is required to compute the distance
between these two points. In this aspect, distance formula can be used where the useful
coordinates are required to be given (Ducksters, 2019).
Example:
Find the magnitude of vector AB where point A is (3, 2) and point B is (7,4)
AB = √(7 − 3)² + (4 − 2 )²
= √(4)² + (2 )²
= √16 + 4
= √20
= 4.47
In terms of numbers, vector can be represented as an ordered array that is arranged in a row or
a column. A vector has a single index, which can point to a specific value within the vector.
A following table-set has been established in which each row represents an observation.
Simultaneously, each column exhibits a feature of the observation. For example, Iris flower data
set is laid down below:
The above-stated data is one sort of matrix which is regarded as the key data structure in the
linear algebra. In building a machine learning model it is required to split the data into inputs and
outputs. When such a thing is done it leads to prepare a supervised machine learning model,
such as the measurements called ‘Matrix’ (X) and the flower species, which is termed as vector
(Y). The vector is another key data structure in linear algebra. In the above table, it is observed
that each row carries same number of columns. When it is like this we can conclude that the
data is vectorised where rows can be provided to a model one at a time or in a batch and the
model can be pre-configured to expect rows of a fixed width (Intellipaat, 2019).
2.4. Matrix
A ‘Matrix’ is a certain framework comprised of fixed rows and column where collected numbers
are arranged properly. It is a rectangular array of numbers arranged into columns and rows. The
collected data can be expressed in the form of matrix algebra. These collected numbers are
generally real numbers. For example, the grades for exam (afterwards converted into matrix
algebra) are shown in the following form:
When a conversion is made it is required to bring a function identifier replacing rows and
columns identifiers. The following chart shows such a replacement: (Math Planet, 2019):
Fig. 6: Matrices
The above matrix set contains numbers which are known as elements.
There are four basic operations found in the procedure of Matrix row operations. These are
Addition, Subtraction, Multiplication and Division.
2.4.1. Addition
When addition is made between two matrices the outcome of such addition is laid down below:
Fig. 7: Addition
It is necessary that the both the rows and columns must match in size (Math Planet, 2019)
3+4 = 7 8+0 = 8
4+1 = 5 6+(- 9) = - 3
2.4.2. Subtraction
In the following manner, subtraction can be done between two matrices and the outcome of
such subtraction can be obtained.
Fig. 8: Subtraction
3 - 4 = -1 8-0 =8
4 -1 = 3 6 -(-9) = 15
Fig. 9: Multiplication
2X 4 = 8 2X0 =0
2X1 = 2 2X (-9) = -18
2.4.4. Division
In the following manner the division is executed:
When we multiply a matrix by its inverse we get the identity matrix. It can be expressed in the
following manner (Math Planet, 2019):
In the case of 2 X 2 Matrix a process of swap is required to be applied. Here, in the numerator
place 1. The numerator or 1 is divided by the determinant. It is shown in the following manner.
Here you see 1 is divided by the determinant. In the following manner the determinant is made
up. There is a cross multiplication happens between the numbers: a is multiplied with d, and c is
multiplied with b. However, before b and c we use the sign of subtraction (Math Planet, 2019).
We have to apply the determinant factor to compute the outcome of this matrix.
1
=
(ad−bc)
1
= (4x6−2x7)
1
=10
It is equal to:
In the same manner subtraction can be done. Only the addition signs to be replaced by
subtraction sign. Others remain same.
meaning. Only in terms of terminology, they are different. Likewise, classifier, data point
regression under statistics carry similar meaning with the terms like hypothesis, example, and
supervised learning found under Machine Learning. That is why the Machine Learning process
is also called glorified Statistics. In recent times, both Machine Learning and Statistics
techniques are used in pattern recognition, knowledge discovery and data mining. A Venn
diagram is given below that shows how these two processes are connected (Stewart, 2019).
3. Basics of Statistics
The collection of data and data analysis are vital factors in Statistics. Based on data a new
theory can be formulated. Here, reasonable data must be collected that should be coherent with
the existing nature of the masses. Moreover, it is required to consider the relationship between
the features of units in the population. Afterward, such data should be analysed systematically.
Searching numerical data and its analysis is known as a statistical survey or statistical
investigation. Anyhow, collection of data is the first and most important stage in any statistical
survey. The method for collection of data depends upon various considerations such as
objective, scope, nature of information, availability of resources (Make me Analyst, 2019). Data
collected for the first time keeping in view the objective of the survey is known as primary data.
Collection of primary data can be done by anyone of the following methods:
Direct personal observation.
Indirect oral interview.
Information through agencies.
Information through mailed questionnaires.
Information through schedule filled by investigators.
On the other hand, secondary data is the data which is collected by someone else earlier.
Unlike real time data this type of data is regarded as past data. The secondary data may be
collected either by census or sampling methods. Sources of such data include Government
publications, websites, books, journals, articles, internal records etc. Collected data is
obtained in the raw form. These are countless and non-comprehensible. Therefore, it is
required to simplify the data for better understanding and usefulness. The first stage of
simplification is known as classification followed by tabulation. Classification reduces bulk
data and makes the data more comprehensible. Tabulation also simplifies complex data.
Here, data is listed according to a logical sequence of related characteristics. The next step
of simplification is frequency and frequency distribution (Clark, 2019). The number of units
associated with each value of the variable is called frequency of that value. Suppose the
variable takes the value 51 and the value 51 occurs 6 times, then 6 is called the frequency
of the value 51. There are two types of frequency distribution: Discrete Frequency
Distribution and Continuous Frequency Distribution. When variables are taken with
corresponding frequencies then frequency distribution of the variables are formed. A
discrete frequency distribution lists all the observed values. Example of Discrete frequency
is given below:
Table 1: Frequency Distribution of Number of children
If we consider the range 20 – 30, 20 is the lower class interval and 30 is the upper class
interval. 30 – 20 = 10 is the width of the class.
20+30
The mid value of the class is = = 25
2
The class interval that does not include upper class limit is called exclusive type of class
interval. The class interval that includes the upper class limits, is called inclusive – type of
class interval.
Example:
Inclusive Type:
Marks
0 - 9 15
10 - 19 20
Exclusive Type:
0 - 10 15
10 - 20 20
20 - 30 28
The class 0-10 does not include the value 10. If the value 10 occurs, it is included in the
class 10-20.
The end process of simplification is known as Graphical Presentation. Most often used
graphs for Frequency Distribution are:
3.1. Histogram
The frequency distribution is represented by a set of rectangular bars with area proportional to
class frequency. The following conditions are required to be maintained. These are:
If the class intervals have equal width, then the variable is taken along X-axis and frequency
along Y-axis. In this way, the rectangle can be made (Clark, 2019).
Example:
No. of People: 5 10 15 12 8
See the intersecting point from where a perpendicular is drawn to the x-axis. This line is a
dotted line. The highest points exist in the range of 20 and 30. From these two points when two
lines are drawn diagonally we get the intersecting point. The x-reading at that point gives the
mode of the distribution.
3.4. Ogives
The term Ogives are also known as cumulative histograms. These are graphs. If any data set
contains a certain value, in such an incident it is required to check the status of many data
values. Here status of these data value refers to check whether such values lie above or below
of the certain value which takes position in the data set. The cumulative frequency is calculated
from a frequency table. A single frequency is added to the total of the frequencies of all data
values before it in the data set. It is seen that both the last value for the cumulative frequency
and the total number of data values remain equal. It is because; the earlier total is made up
through the addition of all frequencies (Clark, 2019).
Example:
15+17+22+21+19+26+20 140
X= = = 20
7 7
Students’ Age 20 23 25 28 30
No. of Students 3 5 10 6 1
20×3+23×5+25×10+28×6+30×1 623
X= = =24.92
3+5+10+6+1 25
Where,
Example:
Height in cms X: 140 – 150 150 – 160 160 – 170 170 – 180
No. of students 50 65 80 55
Solution
Step 1:
𝑿−𝟏𝟓𝟓
Mid f d= fd
𝟏𝟎
145 50 -1 -50
155 65 0 0
165 80 1 80
175 55 2 110
Total 250 140
Step 2:
140
X=155 + 250 × 10
3.5.2. Median
Among middle values the most middle of such values in a set of values is known as Median.
These values are arranged in the form of ascending order of magnitude. Median is denoted by
M. In the case of discrete series, with or without frequency it is given by M= (n+1)/2th value. The
data is required to arrange either ascending manner or descending manner (Purple math,
2019).
Example 1
When we arrange the above set in ascending order we get the following thing
27,28,31,32,36,37,40,41,45,46,47,50
n = 12
12+1𝑡ℎ
Therefore, Median =
2
Example 2
f: 4, 9, 3, 5, 4, 2, 10
Step1
If we arrange the series in the ascending order, we shall get the following:
X f Cumulative Frequency
10 3 3
12 4 7
14 5 12
15 10 22
16 9 31
17 4 35
20 2 37
n= 37
37+1𝑡ℎ
Therefore, Median = = 19th value
2
Now you see that 19th value falls under the range of 13-22. Therefore, Median (M) is 15.
Frequency 10 15 40 27 8
Solution
The series given here is continuous in nature. Here, the class interval is marked as exclusive
type. Total cumulative frequency is ascertained in the following manner.
Since the class interval is exclusive type so we have to consider N/2 instead of (N + 1)/2. Here,
N/2 is 100/2 = 50.
f = Frequency Class
100
−25
2
=40 + 40
×5
= 43.125
3.5.3. Mode
Mode denotes the highest frequency. It is shown by Z. It is observed that those who are
involved in business they put emphasis on modal value. In the case of a planning a suitable
operation the shoe and garment manufacturer provide stress on modal size of the people. For
discrete data with or without frequency it is that value corresponding to highest frequency
(Purple math, 2019).
Example
6,7,6,8,9,9,9,10,8,7,7,9,10,9,9,9,8,8,11
Size Frequency
6 2
7 3
8 4
9 7
10 2
11 1
Since, frequency of 9 is 7 therefore; the size 9 has a highest modal value. The frequency of all
other numbers is below 7.
In the case of continuous frequency distribution, the following formula is applicable. It is given in
table - 22
(Vedantu, 2019)
Example
Builders of Pravesh Apartment found the number of customers who wishes to have plinth area
of their apartments as follows:
Solution
Here, the intervals are exclusive type. Highest frequency is 25. The corresponding interval is
1200 – 1400. It is called modal class.
25−15
Mode = 1200 + ×200
2 ×25−15−12
2000
= 1200 +
23
= 1200 + 86.95
= 1286.95
In the case of discrete series of n numbers without frequency the formula is used:
GM = n√X₁ + X₂ + ⋯ … Xₙ
In the case of discrete series with frequency the following formula is applicable
(Vedantu, 2019)
Where n = f1 + f2 + …………..fn
(Vedantu, 2019)
Example
The growth in bad-debt expense for Das office supply company over the last few years is as
follows:
Calculate the average percentage increase in bad debt expense over this time period
Solution
G.M. = 7√(1.11)(1.09)(1.075)(1.08)(1.095)(1.08)(1.20)
= 1.09675
3
Harmonic Mean = 1 1 1
+ +
2 4 6
3
= 11
12
= 3.27
Another example
Solution:
X f f/X
121 5 0.04132
122 25 0.20492
123 36 0.29268
124 37 0.29839
125 20 0.16000
Total 123 0.99731
123
H.M. = = 123.33
0.99731
3.5.7. Quartiles
When distribution is divided into four equal portions, we get the First Quartile (Q1), Second
Quartile (Q2) and Third Quartile (Q3).
N+1 3( N+1)
Q1 is shown as Q3 is shown as
4 4
Example
Weekly sales of a product on 8 different shops are as follows. Calculate the quartiles (Purple
math, 2019).
Sales in units 309, 312, 305, 307, 310, 308, 308, 306
Solution
n+1th
Q1 = Value
4
8+1th
= Value
4
= 2.25th value
= 306.25
2(n+1)th
Q2 = Value
4
= 308.5
3(n+1)th
Q3 = Value
4
= 309 + 0.75
= 309.75
3+6+6+7+8+11+15+16
The mean of these numbers is
8
72
=
8
=9
The next step is to see the deviation of these numbers from mean
i.e. 6, 3, 3, 2, 1, 2, 6, 7
6+3+3+2+1+2+6+7
Mean Deviation =
8
30
=
8
= 3.75
Therefore, when the summation of the outcomes of deviation is divided by the total numbers we
get the mean deviation. It can also be defined in other way i.e. the mean of absolute deviations
of the values from central value (Purple math, 2019).
The Mean deviation from mean for discrete series without frequency is given by. For data with
frequency it is given by:
(Frost, 2019)
In the case of continuous series, “X” represents mid value of class interval. Similarly, we can
have mean deviation from median or mode. X is replaced by median or mode in the above
formula. However, mean deviation from median is the least. It is known as Minimal property of
mean deviation. The corresponding relative measures are coefficient of mean deviation.
(Frost, 2019)
Example
Calculate mean deviation and also coefficient of mean deviation using i) mean ii) median.
Compare the results.
Solution
1160
Mean = = 145
8
30
Mean Deviation from mean = Ʃ X – X = = 3.75
8
(8+1)th
Median is value = 4.5th
2
20
Mean deviation from Median = = 2.5
8
3.75
Coefficient of MD (X) = = 0.0258
145
2.5
Coefficient of Mean Eviation from Median = = 0.001742
143.5
Therefore, Mean Deviation from median is less than M.D. from Mean.
Example
The following is the distribution of employees of a firm according to their efficiency. Find Mean
Deviation and Coefficient of Mean Deviation from i) Mean and ii) Median
Solution
𝐗−𝟐𝟖
Efficiency Index Frequency d= 𝟒 fd f X - 24 Cf X - Med f X - Med
18 -22 20 -2 -40 80 20 3.63 72.60
22 - 26 30 -1 -30 0 50 0.34 10.20
26 - 30 11 0 0 44 61 4.34 47.74
30 - 34 3 1 3 24 64 8.34 25.02
34 - 38 1 2 2 12 65 12.34 12.34
-65 160 168.00
Ʃ𝑓𝑑
(X) = A + × CI
Ʃ𝑓
−65
= 28 + ×4
65
= 28 -4
= 24
f∣X−24∣
Now MD (X) =
f
160
=
65
= 2.46
2.46
Coefficient of Mean Deviation =
24
= 0.1025
Nth value 65
= = 32.5
2 2
Median Class = 22 – 26
=4
32.5−20
Median = 22 + ×4
30
= 22 + 1.667
= 23.66
f∣X−Med∣
MD (Median) =
f
168
=
65
= 2.58
2.58
Coefficient = = 0.109
23.66
Where X is the mid value of class interval for continuous series. Alternative form for (A) and (B)
S.D. are:
For A
Example
Calculate the SD for variation in temperature observed during two months at Kolkata:
Temperature 18 19 20 21 22 23 24 25 Total
Frequency 3 5 8 16 12 8 5 3 60
Solution
X f d = x - 21 fd fd2
18 3 -3 -9 27
19 5 -2 -10 20
20 8 -1 -8 8
21 16 0 0 0
22 12 1 12 12
23 8 2 16 32
24 5 3 15 45
25 3 4 12 48
60 28 192
fd
X=A+ × CI
f
28
= 21 + ×1
60
= 21.47
192 28²
Variance = –( )× 1
60 60²
= 3.2 - .217
= 2.983
SD = √2.983
= 1.727
Suppose the mean of n1 values is x1 and the mean of n2 is x2 and the standard deviations
are σ1 and σ2 respectively. The combine standard deviation can be furnished through the
following formula.
Where d1 = X – X1 and d2 = X – X2
Example
The average weight of 100 apples from area A is 150 gms with standard deviation of 10 gms.
Similarly, the average weight of 200 apples from area B is 200 gms with standard deviation of
15 gms. Find the combine standard deviation.
Solution
n₁ X₁ +n₂ X₂
Combined average =
n₁ +n₂
100 ×150+200×200
=
100+200
15000+40000
=
300
55000
=
300
= 183.33
√100(100+1110.8889)+200(200+277.5556)
Standard Deviation =
√100+200
= 26.87
4. Probabilities
In the case of weather forecasting, it is often heard that rain might occur during a certain period
of time. It means the weather forecasting office reveals the possibility of rain but they do not
divulge with certainty that the rain must happen at the scheduled time. It means in their
announcement the element of uncertainty is impliedly narrated. Likewise, the share market
analysts often tell that the share price may go up or down. However, such an analyst never
makes sure that the price will go up or down. Therefore, it is required to handle the uncertainty
in a systematic way. Probability theory helps us to make wiser decisions (Wolfram MathWorld,
2019). Probability is a numerical measure which indicates the chance of occurrence of an event
A. It is denoted by P(A). It is the ratio between the favourable outcomes to an event ‘A’ (m) to
m
the total outcomes of the experiment (n). P(A) =
n
If the number of outcomes is finite, it is called finite sample space otherwise; it is called infinite
sample space.
4.1.4. Event
There are basically two kinds of outcomes. One is called single outcome and the other is
combination of outcomes. In tossing a coin getting a head (event A) a combination outcomes
HT and TH. Therefore, P(A) = 2/4 = ½. It is a part of sample space.
Illustration
D = [TTT]
Event A,B,C, and D are mutually exclusive and exhaustive but not equally likely.
If A and B are two mutually exclusive events, then the probability of occurrence of either A or B
is given by:
P (A ᴗ B) = P(A) + P(B)
If A, B and C are any three events then the probability of occurrence of either A or B or C is
given by
If A1, A2, A3…………………An are “n” mutually exclusive and exhaustive events then the probability of
occurrence of at least one of them is given by
As per Venn diagram the above illustration can be presented in a following manner.
When there exist several options in front of managers and they are required to choose only one
of such options for implementation. In such a case the addition rule related to probability can be
applied. Sometimes, a situation occurs which demands to choose both A and B for
implementation. In such a case, Multiplication rule related to probability is required to apply.
P (A ᴖ B) = P(A) P(B/A)
= P(B) P(A/B)
It follows that:
P(A ᴖ B)
P(A / B) =
P(B)
P(A ᴖ B)
P(B / A) =
P(A)
This represents the distribution of level of education irrespective of their sections. It is regarded
as one Marginal distribution.
ii)
Newspaper Magazine Novels Subjects Total
220 250 200 300 970
This represents the distribution of people in sections irrespective of their educational levels. It is
another Marginal Distribution. These are two marginal distributions found under Bivariate data.
There are two types of variables subsist in this type of data. These are
iii)
Level of Education News paper Magazine Novels Subjects Total
Under Graduate 50 100 120 50 320
This represents the distribution of people in sections given that they are under graduate.
Therefore, it is a conditional distribution. Thus for any bivariate distributions having such i and ii
classifications there exists two marginal distributions and i + ii conditional distributions. In this
case there are 3+4 = 7 conditional distributions.
n!
Now nCr =
(n−r)!r!
Example 1
n!
10 C2 =
(n−r)!r!
10 ×9 ×8
=
(10−2) ×2
5 ×9
=
1
= 45
16 ×15 ×14
16 C3 =
(16−3) ×3
16 ×5 ×14
=
1
16 ×5 ×7
=
1
= 560
Example 2
S = {H,T}
or n(S) = 2
n(A) 1
P(A) = =
n(S) 2
Example 3 (Part A)
What is probability of getting two heads when 3 coins are tossed and what is the probability of
getting at least two head?
n(A) 3
P(A) = =
n(S) 8
Example 3 (Part B)
Out of total occurrences it is found in the case of the following sequences at least two head can
be found:
n(A) 4
P(A) = =
n(S) 8
1
=
2
Example 4 (i)
What is the probability of (i) getting a sum of “nine” and (ii) at least 9 when two dices are thrown
together?
n(A) 4
P(A) = =
n(S) 36
1
=
9
Example 4 (ii)
Occurrence of at least 9
(3,6) (6,3) (4,5) (5,4), (5,5) (5,6) (6,5) (6,6) (4,6) (6,4) = 10 = A
n(A) 10 5
P(A) = = =
n(S) 36 18
Example 4
We know that:
n!
nCr = where n = total item to choose from and r = No. of items an user want to
(n−r)!r!
continue.
Here n = 15 and r = 5
n!
n (S) = 15C5 =
(n−r)!r!
15×14×13×12×11×10
=
10 ×5×4×3×2×1
3003
= = 3003
1
5×4 4 6 ×5
= × ×
2 ×1 1 2 ×1
2400
=
4
= 600
(A) 600
P(A) = =
(S) 3003
5. Regression
Regression is defined as “the measure of the average relationship between two or more
variables in terms of the original units of the data”. Correlation analysis attempts to study the
relationship between the two variables x and y. Regression analysis attempts to predict the
average x for a given y. In regression it is attempted to quantify the dependence of one variable
on the other. Example: There are two variables x and y. y depends on x. The dependence is
expressed in the form of the equations. Regression analysis used to estimate the values of the
dependent variables from the values of the independent variables. Regression analysis is used
to get a measure of the error involved while using the regression line as a basis for estimation
(Gallo, 2019). Regression coefficient is used to calculate correlation coefficient. The square of
correlation that prevails between the given two variables.
lines, higher is the correlation between the variables. The regression lines always intersect at
(X, Y). The regression equation of y on x is given by:
Y - Y = byx (X – X)
X - X = bxy (Y - Y)
N⅀dxdy−(⅀dx)(⅀dy) N⅀dxdy−(⅀dx)(⅀dy)
Where bxy = bxy =
N⅀d𝑥 2 −(⅀d𝑥 2 ) N⅀dy2 −(⅀dy2 )
The regression equations found by the above conditions are said to be fitted by the method of
least squares, bxy and byx are called regression coefficients.
byx × bxy = ≤ 1
σᵧ
byx = r ×
σᵪ
σᵪ
bxy = r ×
σᵧ
It is an absolute measure.
If byx can be greater than one, but bxy must be less than one such that byx × bxy < 1. Moreover, the
regression equation is based on cause and effect relationship and it is meant for estimation
(Gallo, 2019).
Example
18 -4 16 17 -2 4 8
19 -3 9 17 -2 4 6
20 -2 4 18 -1 1 2
21 -1 1 18 -1 1 1
22 0 0 19 0 0 0
23 1 1 19 0 0 0
24 2 4 19 0 0 0
25 3 9 20 1 1 3
26 4 16 21 2 4 8
27 5 25 22 3 9 15
225 190
X= = 22.5 Y= = 19
10 10
→ Y – 19 = 0.521 (X – 22.5)
→ Y = 0.521X – 7.2775
10 ×43−(5)(0) 43
bxy = = = 1.392
10 ×24−(5)² 24
→ X = 1.792Y – 11.548
r = √0.521 × 1.792
= 0.966
Mathematically, a linear relationship is one that satisfies the equation given below:
y = mx + c
X axis denotes speed which is independent variable. Y axis denotes distance which depends
upon speed so it is regarded as a dependent variable.
X and Y variables are connected with “m” and “c” parameters. Graphically, y = mx + c plots in
the x-y plane as a line. The slope is represented by m and y-intercept “c”. It is simply the value
of “y” when x =0. The two individual points are used to represent the point “m”.
(y₂ −y₁)
m=
(x₂ −x₁)
The instances of linear relationship can be observed in our daily life. For example, speed. The
rate of speed is the distance travelled over time. Suppose, you are travelling from A to B at a
41.3-mile stretch and you take 40 minutes to reach B. In that case, if you check your speed you
will see that your speed will be just below 60 miles per hour. In this connection, one thing is
required to be added to this conversation. The extent of the linear relationship between the
dependent and independent variables can be measured through the application of linear
regression technique.
categorical form is like male versus female. The Linear Regression Algorithm can be explained
in the following manner:
Fig. 33: Linear Regression Algorithm
In the earlier step we have known that X axis represents independent variable and Y axis
represents dependent variable. Both these axes are marked with several numbers like 1, 2 etc.
[See the upper right portion of the graph]. From the box we see that when X = 1, Y = 3. By
applying these two values we get a certain point represented through green ball. In this way,
position of other balls can be determined. When all such positions are determined we can
calculate X-mean and Y-mean. These points are given as (3, 3.6).
The next step is related to determining the slope of the straight line. If we see table -36 we see
that a straight line has started from point 2 at Y axis. It maintains upward slope moving towards
mean point (X3: Y3.6). Now the question arises how can we get the slope of such straight line?
We know the value of X-mean; Y-mean and different values of X and Y. With these data, from
the graph given below, we can understand the Linear Regression Algorithm.
Step 1
We have calculated the deviations related to X-mean and Y-mean. We get the following values
(-2, -1, 0, 1, 2) related to X and (-0.6, 0.4, -1.6, 0.4, 1.4) related to Y
Step 2
We have prepared the square of the above stated deviations related to X-deviations. After
getting these values we add all the values related to square of X-deviations and (x – x) (y – y).
The summation of X related deviations i.e. (X-deviations) ² come ⅀10 and (x – x) (y – y) = ⅀4.
Step 3
4
When we applied ⅀10 and ⅀4 into the formula of “m”, we see that m =
10
Step 4
After getting the magnitude of m, we put this value to y= mx + c
or, y = mx + c
4
3.6 = ᵪ 3+c
10
c = 3.6 – 1.2
c = 2.4
It means that we have determined the two unknown values i.e. m= 4/10 or 0.4 and c = 2.4
In the graph we have known all the values of the two variables. If we put these unknown values
to the formula y = mx + c we can easily find the new values of y where the slope of the straight
line passes. All these facts can be represented through a suitable graph which is laid down
below:
Fig. 266: Slope – Line Presentation
From Table – 39 we have observed that a straight line is made by applying the measured
values of m and c respectively over the different values of x variable. The predict values for y for
x has been determined. After obtaining these values (predictive) i.e. 2.8, 3.2, 3.6, 4.0, and 4.4 a
straight line can be easily drawn which is represented here through a red line.
Step 5
The red line is drawn over the predictive values that are shown in the earlier section table – 39.
This line is called the regression line. Now we have to find out how close the data to be set over
the regression line. For this purpose, we have to determine a certain statistical measure called
R – squared value. The R – squared value can be identified as coefficient determination, or the
coefficient of multiple determinations. We have got the predictive values and these are 2.8, 3.2,
3.6, 4.0, and 4.4. We have to determine the differences between distance actual – mean and
distance predicted – mean. It is nothing but:
In the same way, R2 can be closer to different values like 0.7, 0.9, 1, 0.02. However, in such
circumstances actual positions of variables get changed. The position of variables is marked by
a green colour. Such a change in position can be shown through several graphs on step-by-step
basis.
When R2 ≈ 0.7:
2
Fig. 39: Measurement of R
When R2 ≈ 0.9:
2
Fig. 40: Measurement of R
When R2 ≈ 1
2
Fig. 281: Measurement of R
When R2 ≈ 0.02
2
Fig. 292: Measurement of R
#Read csv
data=pd.read_csv( ‘Headbrain.csv’ )
print(data.shape)
data.head()
(237, 4)
Out [2]:
Fig. 303: Outcome
7.2 Example – 2
In [7] : #Collecting x and y
X=data [ ‘Head size’ ] . values
Y=data [ ‘Brain Weights ‘] . values
#Mean of x and y
# Total number of values=len(x)
length=len(x)
mean_x=np.mean(x)
mean_y=np.mean(y)
0.263429339489 325.573421049
7.3. Example - 3
In [4] : # plotting values and regression line
max_x=np.max(x)
min_x=np.min(x)
x1=np.linspace(min_x,max_x,500)
#print(x1)
y1=c+m*x1
#print(y1)
7.4. Example – 4
In [5] : ss_t=0
ss_r=0
#print (m)
Length=len(x)
for i in range(length):
y_pred=c+m*x[i]
ss_t+=(y[i]-mean_y)**2
ss_r+=(y_pred-mean_y)**2
r2=ss_r/ss_t
print(r2)
0.639311719957
References
Brownlee, J., 2019. Examples of Linear Algebra in Machine Learning. [Online]
Available at: https://machinelearningmastery.com/examples-of-linear-algebra-in-machine-
learning/
[Accessed 03 01 2020].
Clark, J., 2019. Statistics Basics: Here’s What You Need to Know. [Online]
Available at: https://magoosh.com/statistics/statistics-basics-heres-what-you-need-to-know/
[Accessed 03 01 2020].
Donges, N., 2019. Basic Linear Algebra for Deep Learning. [Online]
Available at: https://towardsdatascience.com/linear-algebra-for-deep-learning-f21d7e7d7f23
[Accessed 23 12 2019].
Ducksters, 2019. Scalars and Vectors. [Online]
Available at: https://www.ducksters.com/science/physics/scalars_and_vectors.php
[Accessed 03 01 2020].
Frost, J., 2019. Measures of Central Tendency: Mean, Median, and Mode. [Online]
Available at: https://statisticsbyjim.com/basics/measures-central-tendency-mean-median-mode/
[Accessed 26 12 2019].
Gallo, A., 2019. A Refresher on Regression Analysis. [Online]
Available at: https://hbr.org/2015/11/a-refresher-on-regression-analysis
[Accessed 03 01 2020].
Intellipaat, 2019. What is Data Science?. [Online]
Available at: https://intellipaat.com/blog/what-is-data-science/
[Accessed 03 01 2020].
KD Nuggets, 2016. Machine learning Vs Statistics. [Online]
Available at: https://www.kdnuggets.com/2016/11/machine-learning-vs-statistics.html
[Accessed 02 1 2020].
Machine Learning Mastery, 2019. Examples of Linear Algebra in Machine Learning. [Online]
Available at: https://machinelearningmastery.com/examples-of-linear-algebra-in-machine-
learning/
[Accessed 2 1 2020].
Make me Analyst, 2019. BASIC STATISTICS FOR DATA ANALYSIS. [Online]
Available at: http://makemeanalyst.com/basic-statistics-for-data-analysis/
[Accessed 03 01 2020].
Math Planet, 2019. How to operate with matrices. [Online]
Available at: https://www.mathplanet.com/education/algebra-2/matrices/how-to-operate-with-
matrices
[Accessed 03 01 2020].