Srivardhan Python
Srivardhan Python
1. Data Preparation
1.1 Downloaded dataset for the period 2008 to 2019 from stat wales data source.
1.2 Data has been processed and we found that there is no outlier or null vales in the dataset.
1.3 Dataset has changes the name of the industry as aforementioned in assignment.
In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
In [2]:
base_data=pd.read_excel('C:\\Users\\Admin\\Downloads\\Python\\Dataset.xlsx')
In [3]:
base_data.head(15)
Out[3]:
In [4]:
base_data.index=base_data['Industry']
base_data.head()
Out[4]:
Industry
Agriculture Agriculture 37700 38200 36100 36100 36800 42700 40700 43200
Production Production 156700 149800 158600 154400 164200 173300 172300 162500
Construction Construction 96600 93200 90000 91300 89300 97000 92600 102700
Retail Retail 345400 344500 343100 347300 345100 337300 357700 360200
ICT ICT 27800 27900 26400 27200 26900 35700 24000 34400
In [5]:
del base_data['Industry']
In [6]:
base_data.head()
Out[6]:
Industry
Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 40200 41
Production 156700 149800 158600 154400 164200 173300 172300 162500 165100 165
Construction 96600 93200 90000 91300 89300 97000 92600 102700 90800 101
Retail 345400 344500 343100 347300 345100 337300 357700 360200 333500 347
ICT 27800 27900 26400 27200 26900 35700 24000 34400 58900 31
In [7]:
base_data['Total_Employees']=base_data.sum(axis=1)
In [8]:
base_data['Total_Employees_Growth']=round(((base_data[2018]/base_data[2009])-1)*100,2)
In [9]:
base_data.head(10)
Out[9]:
Industry
Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 402
Production 156700 149800 158600 154400 164200 173300 172300 162500 1651
Construction 96600 93200 90000 91300 89300 97000 92600 102700 908
Retail 345400 344500 343100 347300 345100 337300 357700 360200 3335
ICT 27800 27900 26400 27200 26900 35700 24000 34400 589
Finance 33800 29800 33200 31100 32400 32400 30800 31000 321
Real_Estate 13500 14600 17600 18800 18000 22200 19100 22700 182
Professional_Service 144800 145800 143600 137300 149900 152900 166200 161200 1764
Public_Adminstration 415600 418600 425600 421000 427000 427600 423200 418500 4245
Other_Service 64200 68000 72400 72800 75500 73300 77200 72400 832
2. Data Analysis
localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 3/25
5/22/2020 srivardhan python
In [10]:
base_data['Total_Employees'].min()
Out[10]:
189900
In [11]:
base_data[base_data['Total_Employees']==base_data['Total_Employees'].min()].index
Out[11]:
In [12]:
base_data['Total_Employees'].max()
Out[12]:
4236500
In [13]:
base_data[base_data['Total_Employees']==base_data['Total_Employees'].max()].index
Out[13]:
In [14]:
import plotly.express as px
2.1 Which industry employed highest and lowest workers over the period
We have fetch data of workplace employment by industry and area (Wales) for 12 years i.e. from 2008 to
2019. Below is the visualisation using python plotly express. It can observed that public administration has
highest number of employments over the period while real estate employee least number of the employee in
time span from 2008 to 2019.
In [15]:
fig.show()
4M
3.5M
3461700
3M
Total_Employees
2.5M
2M
1.5M
162
2.2 Which industry has the highest and lowest overall growth over the period?
The below visualization shows industry percentage growth of employment over the period. It can be
observed that real estate shows highest percentage i.e. 86% growth in the employment from 2008 to 2019
while retail shows least percentage employee growth i.e. 0.64%.
In [16]:
90
86.67
80
70
tal_Employees_Growth
60
50
40
30
2.3 Which years are the best and worst performing year in relation to number of employments. (highest and
lowest employment)
Bar graph visualisation shows number of performing years in relations to employment. It shows that 2018 is
the best performing year with highest employment with 1.45 million whiles 2010 is worst performing year with
least number of employments with 1.33 million.
In [17]:
base_data2=base_data.T
base_data2.head()
Out[17]:
In [18]:
base_data2.head()
Out[18]:
In [19]:
base_data2['Yearly_Total_Employees']=base_data2.sum(axis=1)
In [20]:
base_data2.head(10)
Out[20]:
In [21]:
fig=px.bar(base_data2,x=base_data2.index,y="Yearly_Total_Employees")
fig.update_layout(title='Yearly Total Employee',legend=dict(x=0,y=0.5))
fig.show()
14M
12M
rly_Total_Employees
10M
8M
6M
3 Visual analysis
In [22]:
base_data3=pd.read_excel('C:\\Users\\Admin\\Downloads\\Python\\Dataset.xlsx')
base_data3.index=base_data3['Industry']
base_data3.head()
Out[22]:
Industry
Agriculture Agriculture 37700 38200 36100 36100 36800 42700 40700 43200
Production Production 156700 149800 158600 154400 164200 173300 172300 162500
Construction Construction 96600 93200 90000 91300 89300 97000 92600 102700
Retail Retail 345400 344500 343100 347300 345100 337300 357700 360200
ICT ICT 27800 27900 26400 27200 26900 35700 24000 34400
3.1 Create a dynamic scatter/bubble plot showing the change of workforce number over the period using
plotly Express.
To plot scatter chart, first we have to convert dataframe into columns, below is syntax to convert data frame
into columns.
In [23]:
del base_data3['Industry']
In [24]:
base_data4=base_data3.T
In [25]:
Final_df=pd.DataFrame(columns=['Year','Workforce','Industry','Workforce_Change'])
for col in base_data4.columns:
if col!='Yearly_Total_Employees':
#print(col)
final_data=pd.DataFrame(columns=['Year','Workforce','Industry','Workforce_Chang
e'])
final_data['Workforce']=base_data4[col].tolist()
final_data['Industry']=col
final_data['Year']=base_data4[col].index
final_data['Workforce_Change']= final_data['Workforce'] - final_data['Workforc
e'].shift()
final_data=final_data.fillna(0)
Final_df=Final_df.append(final_data)
In [26]:
Final_df.head(100)
Out[26]:
Below dynamic scatter plot visualization shows the change of workforce number over the period. It can be
observed that in year 2017 ICT industry shows highest number of increases in workforce change followed by
retail industry with workforce employee to 20.4k in 2015 while same industry(ICT) shows highest number of
decreases in workforce in 2018 followed by retail in 2017. So it can be concluded that retail and ICT shows a
greater number of workforce changes over time period.
In [27]:
20k
10k
Workforce_Change
10k
4. PCA/Correlation
PCA is basically dimensionality reduction method that is used to reduce the dimensions of the dataset into
smaller set of the variables. Using below syntax in python we have drawn PCA = 2 (Principle Component
Analysis).
In [28]:
In [29]:
PCA_base.head()
Out[29]:
Industry
Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 40200 41
Production 156700 149800 158600 154400 164200 173300 172300 162500 165100 165
Construction 96600 93200 90000 91300 89300 97000 92600 102700 90800 101
Retail 345400 344500 343100 347300 345100 337300 357700 360200 333500 347
ICT 27800 27900 26400 27200 26900 35700 24000 34400 58900 31
In [30]:
pca.n_components = 2
X_reduced = pca.fit_transform(PCA_base)
df_X_reduced = pd.DataFrame(X_reduced,columns=['PC1','PC2'], index=PCA_base.index)
df_X_reduced=round(df_X_reduced,2)
In [31]:
df_X_reduced.head(10)
Out[31]:
PC1 PC2
Industry
In [32]:
corr = df_X_reduced.T.corr()
corr.style.background_gradient(cmap='coolwarm')
Out[32]:
Industry
Agriculture 1 -1 1 -1 1 1
Production -1 1 -1 1 -1 -1 -
Construction 1 -1 1 -1 1 1
Retail -1 1 -1 1 -1 -1 -
ICT 1 -1 1 -1 1 1
Finance 1 -1 1 -1 1 1
Real_Estate 1 -1 1 -1 1 1
Professional_Service -1 1 -1 1 -1 -1 -
Public_Adminstration -1 1 -1 1 -1 -1 -
Other_Service 1 -1 1 -1 1 1
Real estate, Finance, agriculture, and ICT have large negative loading on principle component 2. This
component focuses on wales more unemployed workforce. While, production, public admin service, retail
and professional service have positive loading on component 1. This component have focuses on industry
have more workforce.
In [33]:
30k
20k
10k
PC2
Below correlation matrix shows correlation for each industry from 2009 to 2018. It can be observe that
agriculture industry is highly correlated with construction industry and other services is also positively
corelated with professional service and public administration. whereas retail and ICT shows weak linear
relationship.
In [34]:
PCA_base.head()
Out[34]:
Industry
Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 40200 41
Production 156700 149800 158600 154400 164200 173300 172300 162500 165100 165
Construction 96600 93200 90000 91300 89300 97000 92600 102700 90800 101
Retail 345400 344500 343100 347300 345100 337300 357700 360200 333500 347
ICT 27800 27900 26400 27200 26900 35700 24000 34400 58900 31
In [35]:
corr = round(PCA_base.T.corr(),3)
corr.style.background_gradient(cmap='coolwarm')
Out[35]:
Industry
Below is K-means Clustering Table. K_2 is 2 means clustering K_3 is 3 means clustering
In [36]:
cluster_base=base_data3[[2010,2018]]
In [37]:
cluster_base.head()
Out[37]:
2010 2018
Industry
In [38]:
cluster2 = KMeans(n_clusters=3)
predicted_3 = cluster2.fit_predict(cluster_base)
In [39]:
cluster_base['K_2']=predicted_2+1
cluster_base['K_3']=predicted_3+1
cluster_base['K_2']=cluster_base['K_2'].astype(str)
cluster_base['K_3']=cluster_base['K_3'].astype(str)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: Settin
gWithCopyWarning:
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: Settin
gWithCopyWarning:
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: Settin
gWithCopyWarning:
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: Settin
gWithCopyWarning:
In [40]:
cluster_base.head(10)
Out[40]:
Industry
Scatter plot of K means clustering with K=2. From below two cluster visualization it can be observed that
more industry are in cluster 2 while cluster 1 has only 2 industry with certain similarities.
In [41]:
450k
400k
350k
300k
250k
2018
200k
Scatter plot of K means clustering with K=3. From below k = 3 cluster visualization it can be observed
that more industry are in cluster 3 while cluster 1 and cluster 2 has 2 industry with certain similarities.
In [42]:
450k
400k
350k
300k
250k
2018
200k
Dendrogram is used to determine the number of appropriate clusters in hierarchical clustering. It is the main
output of hierarchical clustering. The horizontal axis of dendrogram represent distances between cluster. The
number of clusters is equal to distance between two straight line drawn from one cluster to another. This is
refer to as Euclidean distance. So from above diagram using this clustering we have identified 6 clusters.
In [43]:
In [ ]:
In [44]:
In [45]:
cluster_base['Hierarchical_clustering']=y_hc
cluster_base['Hierarchical_clustering']=cluster_base['Hierarchical_clustering']+1
cluster_base['Hierarchical_clustering']=cluster_base['Hierarchical_clustering'].astype(
str)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: Settin
gWithCopyWarning:
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: Settin
gWithCopyWarning:
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: Settin
gWithCopyWarning:
In [46]:
450k
400k
350k
300k
250k
2018
200k
k-means cluster is formed with predetermine number of clusters. In this we have identify the industry cluster
of best and worst performing year of employment with k = 2 and k = 3 cluster while in hierarchical clustering
as name suggest built hierarchy of cluster and result of number of clusters are reproduced as k =6 industry
cluster for best and worst performing year of employment.
6. Discussion
Provide a brief discussion (~ 300 words) on employment landscape of Wales based on the employment data
analysis results.
From the report it can be observed that employment in wales shows highest workforce in public
administration services followed by retail, production, and professional services while least work force is in
real estate, but it shows highest growth percentage in employment from 2008 to 2019. Though retail work
force is second highest, but this industry has lowest percentage growth rate over a period. With year wise
total wales employment, 2018 shows highest employment with real estate showing 38% growth while ICT
shows negative % growth. In 2010, wales shows least total workforce, with average negative (-2%) growth.
From correlation matrix, it can be observe that agriculture industry is highly correlated with construction
industry and other services is also positively corelated with professional service and public administration.
whereas retail and ICT shows weak linear relationship.
In [ ]: