MM 501
MM 501
States in India
November 2022
DECLARATION
We hereby declare that the report entitled “Statistical analysis of Population of different
states in India” is a genuine record of work carried out by us and no part of this report has
been submitted to any University or Institution for the completion of any course.
1
Acknowledgment
We are very grateful as students of 5 Year Integrated M.Sc. program in Mathematics at Sar-
dar Vallabhbhai National Institute of Technology, Surat. First, we would like to express our
genuine appreciation to our supervisors Dr. Neeru Adlakha for her guidance for our work.
To work with her is a great opportunity and pleasure to us. We are thankful to Director of
SVNIT and Dr. Jayesh M. Dhodiya, Head of Department of Mathematics & Humanities and
all other Faculties, Research Scholars and Non-Teaching staff of our department for their reg-
ular help, moral support and encouragement.
2
Contents
1 Abstract 1
2 Introduction 2
3 Literature Review 3
4 Relevant Theory 4
5 Methodology 6
3
Chapter 1
Abstract
India, the country that holds the greater part of South Asia and is one of the oldest civiliza-
tions in the world with a rich cultural heritage. It represents a highly diverse population con-
sisting of many ethnic groups and several languages.
India is a federal union comprising 28 states and 8 union territories, a total of 36 num-
bers. According to Census Population Projection Report, India’s Population in 2022 is es-
timated at 1,375,586,000 (1.38 Billion or 138 Crore), India has witnessed a huge growth in
its population in the last 50 years. According to estimates, India will become the most pop-
ulated country in the world by 2028 leaving behind China.Nearly half of India’s total pop-
ulation lives in five states of Uttar Pradesh, Maharashtra, Bihar, West Bengal and Andhra
Pradesh.In this project we will try to predict the future population of 2036 based on
data from 2018 and 2019.We also evaluate different statistical measures of interest from the
data used, and finally tried to draw some correlation between the number of internet users
and the population of different states in India by year.
1
Chapter 2
Introduction
Population of Indian states like Uttar Pradesh, Maharashtra and Bihar is more than many
countries around the world. Uttar Pradesh, most populated state in India is currently home
to over 237 million people. Most of the states in India are very densely populated as com-
pared to other places in the world, thus leading to danger of environment imbalances. Popu-
lation growth rate of many highly populated states in India is 5% to 18% in a decade.
First, we have noted down some basic relevant statistical concepts and then divided
right into a state-over-state comparison of the yearly population of different states between
2018 and 2019 and tried to draw inferences from the same. The data sets for our purposes
were mainly taken from Census of India(Government Official Website for Population demo-
graphics). The tools used for the data analysis and visualization were Excel and Jupyter
notebook for using Python. Using the tables and the visualization afforded to us by the graphs,
we were able to make some relate the trends in the figures with the actual on-ground situa-
tion with respect to the population in India.
We next checked our data set of population of 2019 for our selected different 15 states
for different statistical measures/quantities such as mean, median, variance, coefficient of
variation, standard deviation, skewness and kurtosis. This gave us information about the dis-
tribution of the data at hand.
Lastly, we tried to correlate the number of internet users and population in different
states of India year wise. We obtain only a moderate correlation for the same.we explored the
reasons why that might have been the case.
2
Chapter 3
Literature Review
There is extensive literature available on the impact of population of different states in India
on various aspects of life, economy, livelihoods, environment etc.There are both governmental
and non-governmental sources available which carry reports on population of India as well as
the different states of India itself.
• The Official Website of Indian Government (Census of India’s) detailed reports and
data sets on the Population of different states of India for different years are some of
the most credible sources for the same.
• Some of the non-governmental websites like statisticstimes , statista also provides the
detailed analysis of population growth in different states of India over past years.
• Some of the more common online sources such as Wikipedia also carry detailed infor-
mation on different aspects of the population and carry tons of further sources and ref-
erences.
3
Chapter 4
Relevant Theory
• Mean of a set of observations is the sum of all observations divided by the total num-
ber of observations. Thus, if X1 , X2 , · · · , XN represent the values of N items or obser-
vations, the arithmetic mean denoted by X or µ is defined as:
N
P
Xi
X1 + X 2 + · · · + XN
X= = i=1
.
N N
While the essence remains the same, the formula changes for grouped data, or if we
want to take a weighted mean.
• Median is a measure of central tendency that finds the center of the data when ar-
ranged in some order.
N +1
th
Median = Size of observation.
2
For grouped data,
N/2 − cf
Median = L + × i, where
f
L = Lower limit of median class, i.e., th class in which the middle observation in the
distribution lies,
cf = Preceding cumulative frequency to the median class,
f = Frequency of the median class, and
i = Class-interval of the median class.
• Mode is the data value which has the highest frequency. For ungrouped data, one can
directly count the number of times that different values repeat themselves, so that the
one that occurs the maximum number of times is the modal value. On the other hand,
in the case of grouped data, the following formula is used for calculating mode:
f1 − f2
Mo = L + × i, where
2f1 − f0 − f2
L = Lower limit of the modal class,
f0 = Frequency of the class preceding the modal class,
f1 = Frequency of the modal class,
f2 = Frequency of the class succeeding the modal class, and
i = The size of the modal class.
4
• Variance is defined as the expectation of squared deviations about the mean of given
data. It is a measure of spread or dispersion.
• Standard Deviation (σ) is the square root of variance. It is also a measure of spread
or dispersion.
• Percentage Growth is given mathematically, as
(Final Value − Initial Value)
Percentage Growth = × 100.
Initial Value
• Correlation Coefficient is a statistical measure of the strength of the relationship be-
tween the relative movements of two variables. Denoted by the symbol r, it summarizes
in one figure the direction and degree of correlation. Here, for our purposes, we have
used the Karl Pearson Coefficient of Correlation, which assumes a linear relationship
between variables. Let X and Y be two variables whose coefficient of correlation we are
interested in. Then,
(X − X)(Y − Y )
P
r = qP qP .
(X − X)2 (Y − Y )2
• Trend Line: A line on a graph showing the general direction that a group of points
seem to follow.
• Linear Regression attempts to model the relationship between two variables by fit-
ting a linear equation to observed data points.
• Regression Line: The line corresponding to the fitted linear equation above is the
regression line.
• The first thing one usually notices about a distribution’s shape is whether it has one
mode (peak) or more than one. If it’s unimodal (has just one peak), like most data
sets, the next thing to notice is whether it’s symmetric or skewed to one side. If the
bulk of the data is at the left and the right tail is longer, we say that the distribution is
skewed right or positively skewed; if the peak is toward the right and the left tail is
longer, we say that the distribution is skewed left or negatively skewed.
• The other common measure of shape is called the kurtosis. As skewness involves the
third moment of the distribution, kurtosis involves the fourth moment. Higher values
indicate a higher, sharper peak; lower values indicate a lower, less distinct peak. The
reference standard is a normal distribution, which has a kurtosis of 3. In token of this,
often the excess kurtosis is presented: excess kurtosis is simply kurtosis − 3.
– A normal distribution has kurtosis exactly 3 (excess kurtosis exactly 0). Any dis-
tribution with kurtosis ≈ 3 (excess ≈ 0) is called mesokurtic.
– A distribution with kurtosis < 3 (excess kurtosis < 0) is called platykurtic.
Compared to a normal distribution, its tails are shorter and thinner, and often its
central peak is lower and broader.
– A distribution with kurtosis > 3 (excess kurtosis > 0) is called leptokurtic.
Compared to a normal distribution, its tails are longer and fatter, and often its
central peak is higher and sharper.
• Coefficient of Variation,
σ
C.V. = × 100%
µ
5
Chapter 5
Methodology
Publicly available datasets were obtained from sources such as Census of India’s official web-
site.To handle the data, create graphs and make calculations, we took the help of the pro-
gramming language Python.
The pandas library of Python was used to pre-process the data for the Population fig-
ures and to reduce it down to a less unwieldy data set. Some exploratory data analysis and
relevant calculations with the obtained data were also performed using this library. In order
to visualise the results, the matplotlib library of Python was used to generate a few of the
plots. In later stages, some use of Excel was also made for the purpose.
states = list(data['State'])
p2020 = list(data["p2020"])
p2019 = list(data["p2019"]) #population of 2019
p2018 = list(data["p2018"]) #population of 2018
p2017 = list(data["p2017"]) #population of 2017
p2016 = list(data["p2016"]) #population of 2016
p2015 = list(data["p2015"]) #population of 2015
p2014 = list(data["p2014"]) #population of 2014
i2020 = list(data["i2020"])
i2019 = list(data["i2019"]) #internetUsers of 2019
i2018 = list(data["i2018"]) #internetUsers of 2018
6
i2017 = list(data["i2017"]) #internetUsers of 2017
i2016 = list(data["i2016"]) #internetUsers of 2016
i2015 = list(data["i2015"]) #internetUsers of 2015
i2014 = list(data["i2014"]) #internetUsers of 2014
x_axis = np.arange(len(states))
#Comparison Graph
plt.figure(figsize=(15,8), dpi=120)
plt.xticks(x_axis, states, rotation=45)
plt.plot(x_axis, p2019, marker="o")
plt.plot(x_axis, p2018, marker="o")
plt.ylabel("Population in Crores")
plt.legend(["2018","2019"])
plt.savefig('statevspop18-19.png', dpi=300, bbox_inches='tight')
plt.show()
#Bar Graph
width = 0.1
plt.figure(figsize=(15,8), dpi=120)
plt.bar(x_axis, p2020, width=width, label='2020')
plt.bar(x_axis+width, p2019, width=width, label='2019')
plt.bar(x_axis+width*2, p2018, width=width, label='2018')
plt.bar(x_axis+width*3, p2017, width=width, label='2017')
plt.bar(x_axis+width*4, p2016, width=width, label='2016')
plt.bar(x_axis+width*5, p2015, width=width, label='2015')
plt.bar(x_axis+width*6, p2014, width=width, label='2014')
plt.xticks(x_axis, states, rotation=45)
plt.ylabel("Population(in Cr.)")
plt.legend()
plt.savefig('barGraphAllYears.png', dpi=300, bbox_inches='tight')
plt.show()
#Regresiion Lines
plt.figure(figsize=(15,8), dpi=120)
plt.grid(True)
plt.plot(x1,y1, linestyle='--', color='red')
plt.plot(x2,y2, color='blue')
plt.axvline(0.634, color='gray', linestyle='--')
plt.axhline(1.617, color='gray', linestyle='--')
plt.xlabel("X")
plt.ylabel("Y")
plt.legend(["Y = 1.43097 + 0.29372 X","Y = 0.32972 +2.02979 X"])
plt.savefig('regressionLines.png', dpi=300, bbox_inches='tight')
plt.show()
7
plt.figure(figsize=(15,8), dpi=120)
plt.xticks(x_axis, states, rotation=45)
plt.plot(x_axis, i2019, 'o-', color='green')
#plt.plot(x1,y1, linestyle='--', color='red')
plt.plot(x2,y2, color='blue')
plt.xlabel("States")
plt.ylabel("Internet Users (in Cr.)")
plt.legend(["2019","Y on X"])
plt.savefig('regressionLineInternet.png', dpi=300,
,→ bbox_inches='tight')
plt.show()
8
Chapter 6
We first look at the data on population across fifteen states with highest population in the
India ,which are Uttar Pradesh, Maharashtra, Bihar, West Bengal,Andhra Pradesh, Mad-
hya Pradesh, Tamilnadu, Rajasthan, Karnataka, Gujarat, Odisha, Kerla, Jharkhand, Assam,
Punjab. We obtained data for both 2019 and 2018 for these states to be able to comment
on the deviation from usual/expected figures for certain states in 2019. The reasons for the
same will be discussed in detail in the next section. For our purpose, we obtained a publicly
available data set from the Census of India’s official website. The Figure 6.1 shows the com-
parison between the population of fifteen states in 2018 and 2019.
Now we plot the Bar graph for the Population of highest populated states(as described above)
9
Year
States 2019 2018
Uttar Pradesh 237,882,725 223,897,418
Maharashtra 123,144,223 124,945,748
Bihar 124,799,926 121,741,741
West Bengal 99,609,303 98,785,114
Andhra Pradesh 53,903,393 87,641,369
Madhya Pradesh 85,358,965 82,961,852
Tamil Nadu 77,841,267 80,288,487
Rajasthan 81,032,689 77,122,315
Karnataka 67,562,686 68,159,821
Gujarat 63,872,399 68,927,491
Odisha 46,356,334 46,172,447
Kerala 35,699,443 34,732,356
Jharkhand 38,593,948 34,149,478
Assam 35,607,039 32,652,597
Punjab 30,141,373 30,471,254
Year Percentage
State 2019 2018 Growth(%)
Uttar Pradesh 237,882,725 223,897,418 6.2463
Maharashtra 123,144,223 124,945,748 -1.4418
Bihar 124,799,926 121,741,741 2.5120
West Bengal 99,609,303 98,785,114 0.8343
Andhra Pradesh 53,903,393 87,641,369 -38.4954
Madhya Pradesh 85,358,965 82,961,852 2.8894
Tamil Nadu 77,841,267 80,288,487 -3.0481
Rajasthan 81,032,689 77,122,315 5.0704
Karnataka 67,562,686 68,159,821 -0.8761
Gujarat 63,872,399 68,927,491 -7.3339
Odisha 46,356,334 46,172,447 0.3983
Kerala 35,699,443 34,732,356 2.7844
Jharkhand 38,593,948 34,149,478 13.0148
Assam 35,607,039 32,652,597 9.0481
Punjab 30,141,373 30,471,254 -1.0826
Table 6.2: Yearly Population Growth for year 2018 and 2019
10
Figure 6.2: Population of different states from year 2020 to 2014
Year
State 2020 2019 2018 2017 2016 2015 2014
Uttar Pradesh 236,693,311 237,882,725 223,897,418 246,035,979 243,209,093 234,125,886 231,048,278
Maharashtra 122,528,502 123,144,223 124,945,748 127,364,900 125,901,512 121,199,429 119,606,250
Bihar 124,175,926 124,799,926 121,741,741 129,077,351 127,594,287 122,828,983 121,214,384
West Bengal 99,111,256 99,609,303 98,785,114 103,023,338 101,839,628 98,036,191 96,747,496
Andhra Pradesh 53,633,876 53,903,393 87,641,369 55,750,892 55,110,329 53,052,106 52,354,731
Madhya Pradesh 84,932,170 85,358,965 82,961,852 88,284,580 87,270,215 84,010,906 82,906,575
Tamil Nadu 77,452,061 77,841,267 80,288,487 80,509,219 79,584,190 76,611,934 75,604,863
Rajasthan 80,627,526 81,032,689 77,122,315 83,810,024 82,847,070 79,752,954 78,704,594
Karnataka 67,224,873 67,562,686 68,159,821 69,878,347 69,075,464 66,495,681 65,621,588
Gujarat 63,553,037 63,872,399 68,927,491 66,061,578 65,302,549 62,863,674 62,037,325
Odisha 46,124,552 46,356,334 46,172,447 47,945,163 47,394,286 45,624,237 45,024,502
Kerala 35,520,946 35,699,443 34,732,356 36,923,015 36,498,780 35,135,648 34,673,787
Jharkhand 38,400,978 38,593,948 34,149,478 39,916,727 39,458,095 37,984,441 37,485,132
Assam 35,429,004 35,607,039 32,652,597 36,827,444 36,404,307 35,044,703 34,584,037
Punjab 29,990,666 30,141,373 30,471,254 31,174,446 30,816,260 29,665,356 29,275,402
11
Chapter 7
Table 7.1: Table for Calculating the First Four Moments about the Mean (µ = 8.0094)
12
7.2 Calculation of Median
Observations in ascending order are: 2.9991, 3.5429, 3.5521, 3.8401, 4.6356, 5.3903, 6.3872,
6.7563, 7.7841, 8.1033, 8.5359, 9.9609, 12.3144, 12.4800 and 23.7883. Here N = 15 is odd.
Now Median
N + 1 th
M = Value of observation
2
15 + 1 th
= Value of observation
2
= Value of (16)th observation
= Value of (8)th observation
∴ M = 6.7563
13
7.6 Calculation of Skewness
Skewness,
v
u 2
q u µ3
γ1 = + β1 = t
µ32
v
u (244.7324)2
u
= +t
(26.3705)3
s
59893.9619
=+
18365.7309
√
= + 3.26118
∴ γ1 = 1.8059
γ2 = β2 − 3
µ4
= 2 −3
µ2
4308.7174
= −3
26.38372
4308.7174
= −3
696.1008
∴ γ2 = 6.1898 − 3
∴ γ2 = 3.1898
14
Chapter 8
We shall now draw inferences from our data set in terms of how correlated this population
data of different states with the number of internet users of the same states in year 2019.
We except that the number of internet users should increase as the population increases, and
similarly the number of internet users should decrease as the population decreases.
The data for the internet users in different states of India per year was obtained from
the official website of the Department of Telecommunications of Government of India.
Table 8.1: Number of Internet Users(Y ) and Population(X) of year 2019 (in Crore)
For our purpose, we shall use the Karl Pearson Coefficient of Correlation(r), given by
(X − X)(Y − Y )
P
r = qP q
(X − X)2 (Y − Y )2
where X and Y are the variables being examined for correlation. One thing to note about
15
this coefficient is that it assumes a linear relationship between variables. We shall calculate
the same, albeit using the following formula:
n· dxdy − dx ·
P P P
dy
r=q q
n· dx2 − ( dx)2 · n· dy 2 − ( dy)2
P P P P
Now, µx = 8.0047 is already calculated in Section 7. We calculate the same for Y . Mean
X Y dX dY dX · dY dX 2 dY 2
23.7882725000 7.703 15.78 3.86 60.934047 249.12105 14.904232
12.3144223000 8.032 4.31 4.19 18.055998 18.57368 17.552748
12.4799926000 3.934 4.48 0.09 0.409936 20.02821 0.008391
9.9609303000 2.683 1.96 -1.16 -2.268050 3.82682 1.344208
5.3903393000 4.929 -2.61 1.09 -2.840768 6.83490 1.180700
8.5358965000 4.140 0.53 0.30 0.158083 0.28217 0.088566
7.7841267000 4.548 -0.22 0.71 -0.155639 0.04865 0.497871
8.1032689000 3.597 0.10 -0.25 -0.024188 0.00972 0.060221
6.7562686000 4.039 -1.25 0.20 -0.245442 1.55859 0.038652
6.3872399000 4.018 -1.62 0.18 -0.284027 2.61619 0.030835
4.6356334000 1.581 -3.37 -2.26 7.618815 11.35063 5.113930
3.5520945785 2.654 -0.29 2.65 -0.770471 0.08428 7.043716
3.8400978260 1.741 3.84 1.74 6.685610 14.74635 3.031081
3.5429003805 1.424 3.54 1.42 5.045090 12.55214 2.027776
2.9990666135 2.613 3.00 2.61 7.836561 8.99440 6.827769
X = 120.0706 Y = 57.6360 dX = 0.0000 dY = 0.0000 dXdY = 112.3461 dX 2 = 396.3844 dY 2 = 54.0086
P P P P P P P
N· dXdY − dX ·
P P P
dY
r=q q
N· dX 2 − ( dX)2 · N· dY 2 − ( dY )2
P P P P
15 × 112.3461 − (0)(0)
=q q
15 × 396.3844 − (0)2 · 15 × 54.0086 − (0)2
1685.191547
=
2194.729077
∴ r = 0.7678
Thus, the two datasets are highly correlated (|r| > 0.75). However, we are also interested in a
visualization of these data sets, so that we can better comment on the relations at work and
see any inferences can be drawn.
16
Figure 8.1: Regression Lines
17
Figure 8.3: Regression line with Population year 2019
18
Figure 8.5: Comparison of Internet user and Population year 2019
19
Chapter 9
20