Analyzing Census Data in Python

Census is about recording information about a given population in a systematic manner. The data captured includes various category of information like – demographic, economic, habitation details etc. This ultimately helps the government in understanding the current scenario as well as Planning for the future. In this article we will see how to leverage python to analyze the census data for Indian population. We will look at various demographic and economic aspects. Then plot charge which will project the analysis in a graphical manner. The source that is collected from kaggle. It is located here.

Organizing the Data

In the below program first we acquire the data using a short python program. It just loads the data to the pandas dataframe for further analysis. The output shows some of the fields for simpler representation.

Example

import pandas as pd
datainput = pd.read_csv('E:\\india-districts-census-2011.csv')
#https://www.kaggle.com/danofer/india-census#india-districts-census-2011.csv
print(datainput)

Output

Running the above code gives us the following result −

      District code ... Total_Power_Parity
0                 1 ...               1119
1                 2 ...               1066
2                 3 ...                242
3                 4 ...                214
4                 5 ...                629
..              ... ...                ...
635             636 ...              10027
636             637 ...               4890
637             638 ...               3151
638             639 ...               3151
639             640 ...               5782

[640 rows x 118 columns]

Analysing Similarity Between Two States

Now that we have gathered the data we can proceed to analyze the similarities on various fronts between two States. The similarities can be on the basis of age group, computer ownership, housing availability, education level etc. In the below example we take the two states named Assam and Andhra Pradesh. Then we compare the two states using the similarity_matrix. All the data fields are compared for each possible pair of districts from both the states. The resulting heatmap indicates how closely these two are related. The darker the shade the closer they are related.

Example

import pandas as pd
import matplotlib.pyplot as plot
from matplotlib.colors import Normalize
import seaborn as sns
import math
datainput = pd.read_csv('E:\\india-districts-census-2011.csv')
df_ASSAM = datainput.loc[datainput['State name'] == 'ASSAM']
df_ANDHRA_PRADESH = datainput.loc[datainput['State name'] == 'ANDHRA PRADESH']
def segment(x1, x2):
   # Set indices for both the data frames
   x1.set_index('District code')
   x2.set_index('District code')
   # The similarity matrix of size len(x1) X len(x2)
   similarity_matrix = []
   # Iterate through rows of df1
   for r1 in x1.iterrows():
      # Create list to hold similarity score of row1 with other rows of x2
      y = []
      # Iterate through rows of x2
      for r2 in x2.iterrows():
         # Calculate sum of squared differences
         n = 0
         for c in list(datainput)[3:]:
            maximum_c = max(datainput[c])
            minimum_c = min(datainput[c])
            n += pow((r1[1][c] - r2[1][c]) / (maximum_c - minimum_c), 2)
         # Take sqrt and inverse the result
         y.append(1 / math.sqrt(n))
      # Append similarity scores
      similarity_matrix.append(y)
   p = 0
   q = 0
   r = 0
   for m in range(len(similarity_matrix)):
      for n in range(len(similarity_matrix[m])):
         if (similarity_matrix[m][n] > p):
            p = similarity_matrix[m][n]
            q = m
            r = n
   print("%s from ASSAM and %s from ANDHRA PRADESH are most similar" % (x1['District name'].iloc[q],x2['District name'].iloc[r]))
   return similarity_matrix
m = segment(df_ASSAM, df_ANDHRA_PRADESH)
normalization=Normalize()
s = plot.axes()
sns.heatmap(normalization(m), xticklabels=df_ANDHRA_PRADESH['District name'],yticklabels=df_ASSAM['District name'],linewidths=0.05,cmap='Oranges').set_title("similar districts matrix of assam AND andhra_pradesh")
plot.rcParams['figure.figsize'] = (20,20)
plot.show()

Output

Running the above code gives us the following result −

Comparing specific parameters

Now we can also compare places with respect to specific parameters. In the below example we compare the availability of household computers available for the cultivator workers. We produce graph which shows the comparison between these two parameters for each of the state.

Example

import pandas as pd
import matplotlib.pyplot as plot
from numpy import *

datainput = pd.read_csv('E:\\india-districts-census-2011.csv')
z = datainput.groupby(by="State name")
m = []
w = []
for k, g in z:
   t = 0
   t1 = 0
   for r in g.iterrows():
      t += r[1][36]
      t1 += r[1][21]
   m.append((k, t))
   w.append((k, t1))
mp= pd.DataFrame({
   'state': [x[0] for x in m],
   'Households_with_Computer': [x[1] for x in m],
   'Cultivator_Workers': [x[1] for x in w]})

d = arange(35)
wi = 0.3
fig, f = plot.subplots()
plot.xlim(0, 22000000)
r1 = f.barh(d, mp['Cultivator_Workers'], wi, color='g', align='center')
r2 = f.barh(d + wi, mp['Households_with_Computer'], wi, color='b', align='center')
f.set_xlabel('Population')
f.set_title('COMPUTER PENETRATION IN VARIOUS STATES W.R.T. Cultivator_Workers')
f.set_yticks(d + wi / 2)
f.set_yticklabels((x for x in mp['state']))
f.legend((r1[0], r2[0]), ('Cultivator_Workers', 'Households_with_Computer'))
plot.rcParams.update({'font.size': 15})
plot.rcParams['figure.figsize'] = (15, 15)
plot.show()

Output

Running the above code gives us the following result −