0% found this document useful (0 votes)
80 views25 pages

Srivardhan Python

The document summarizes data analysis on Welsh employment statistics from 2008-2019: 1. Public administration employed the most workers over the period while real estate employed the fewest. 2. Real estate saw the highest employment growth rate at 86.7% while retail saw the smallest at 0.64%. 3. 2018 had the highest overall employment of 1.45 million, while 2010 had the lowest employment of 1.33 million.

Uploaded by

Pallavi Pallu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views25 pages

Srivardhan Python

The document summarizes data analysis on Welsh employment statistics from 2008-2019: 1. Public administration employed the most workers over the period while real estate employed the fewest. 2. Real estate saw the highest employment growth rate at 86.7% while retail saw the smallest at 0.64%. 3. 2018 had the highest overall employment of 1.45 million, while 2010 had the lowest employment of 1.33 million.

Uploaded by

Pallavi Pallu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

5/22/2020 srivardhan python

Programming for data analysis


Name: katakam srivardhan hruday kuamr

student id: st20166815

Moodle code: CIS7031_S2_19

Moodle leader: Imitiaz Khan

1. Data Preparation
1.1 Downloaded dataset for the period 2008 to 2019 from stat wales data source.

1.2 Data has been processed and we found that there is no outlier or null vales in the dataset.

1.3 Dataset has changes the name of the industry as aforementioned in assignment.

In [1]:

import pandas as pd
import numpy as np
%matplotlib inline

In [2]:

base_data=pd.read_excel('C:\\Users\\Admin\\Downloads\\Python\\Dataset.xlsx')

Below is the final dataframe, shows wales total employment values.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 1/25


5/22/2020 srivardhan python

In [3]:

base_data.head(15)

Out[3]:

Industry 2009 2010 2011 2012 2013 2014 2015 2016

0 Agriculture 37700 38200 36100 36100 36800 42700 40700 43200

1 Production 156700 149800 158600 154400 164200 173300 172300 162500 1

2 Construction 96600 93200 90000 91300 89300 97000 92600 102700

3 Retail 345400 344500 343100 347300 345100 337300 357700 360200 3

4 ICT 27800 27900 26400 27200 26900 35700 24000 34400

5 Finance 33800 29800 33200 31100 32400 32400 30800 31000

6 Real_Estate 13500 14600 17600 18800 18000 22200 19100 22700

7 Professional_Service 144800 145800 143600 137300 149900 152900 166200 161200 1

8 Public_Adminstration 415600 418600 425600 421000 427000 427600 423200 418500 4

9 Other_Service 64200 68000 72400 72800 75500 73300 77200 72400

In [4]:

base_data.index=base_data['Industry']
base_data.head()

Out[4]:

Industry 2009 2010 2011 2012 2013 2014 2015 2016

Industry

Agriculture Agriculture 37700 38200 36100 36100 36800 42700 40700 43200

Production Production 156700 149800 158600 154400 164200 173300 172300 162500

Construction Construction 96600 93200 90000 91300 89300 97000 92600 102700

Retail Retail 345400 344500 343100 347300 345100 337300 357700 360200

ICT ICT 27800 27900 26400 27200 26900 35700 24000 34400

In [5]:

del base_data['Industry']

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 2/25


5/22/2020 srivardhan python

In [6]:

base_data.head()

Out[6]:

2009 2010 2011 2012 2013 2014 2015 2016 2017 2

Industry

Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 40200 41

Production 156700 149800 158600 154400 164200 173300 172300 162500 165100 165

Construction 96600 93200 90000 91300 89300 97000 92600 102700 90800 101

Retail 345400 344500 343100 347300 345100 337300 357700 360200 333500 347

ICT 27800 27900 26400 27200 26900 35700 24000 34400 58900 31

In [7]:

base_data['Total_Employees']=base_data.sum(axis=1)

In [8]:

base_data['Total_Employees_Growth']=round(((base_data[2018]/base_data[2009])-1)*100,2)

In [9]:

base_data.head(10)

Out[9]:

2009 2010 2011 2012 2013 2014 2015 2016 20

Industry

Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 402

Production 156700 149800 158600 154400 164200 173300 172300 162500 1651

Construction 96600 93200 90000 91300 89300 97000 92600 102700 908

Retail 345400 344500 343100 347300 345100 337300 357700 360200 3335

ICT 27800 27900 26400 27200 26900 35700 24000 34400 589

Finance 33800 29800 33200 31100 32400 32400 30800 31000 321

Real_Estate 13500 14600 17600 18800 18000 22200 19100 22700 182

Professional_Service 144800 145800 143600 137300 149900 152900 166200 161200 1764

Public_Adminstration 415600 418600 425600 421000 427000 427600 423200 418500 4245

Other_Service 64200 68000 72400 72800 75500 73300 77200 72400 832

2. Data Analysis
localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 3/25
5/22/2020 srivardhan python

In [10]:

base_data['Total_Employees'].min()
Out[10]:

189900

In [11]:

base_data[base_data['Total_Employees']==base_data['Total_Employees'].min()].index

Out[11]:

Index(['Real_Estate'], dtype='object', name='Industry')

In [12]:

base_data['Total_Employees'].max()

Out[12]:

4236500

In [13]:

base_data[base_data['Total_Employees']==base_data['Total_Employees'].max()].index

Out[13]:

Index(['Public_Adminstration'], dtype='object', name='Industry')

In [14]:

import plotly.express as px

2.1 Which industry employed highest and lowest workers over the period

We have fetch data of workplace employment by industry and area (Wales) for 12 years i.e. from 2008 to
2019. Below is the visualisation using python plotly express. It can observed that public administration has
highest number of employments over the period while real estate employee least number of the employee in
time span from 2008 to 2019.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 4/25


5/22/2020 srivardhan python

In [15]:

fig = px.bar(base_data, y="Total_Employees", x=base_data.index, color=base_data.index,t


ext='Total_Employees')
fig.update_layout(title_text='Industry Employee Numbers')

fig.show()

Industry Employee Numbers

4M

3.5M

3461700
3M
Total_Employees

2.5M

2M

1.5M
162

2.2 Which industry has the highest and lowest overall growth over the period?

The below visualization shows industry percentage growth of employment over the period. It can be
observed that real estate shows highest percentage i.e. 86% growth in the employment from 2008 to 2019
while retail shows least percentage employee growth i.e. 0.64%.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 5/25


5/22/2020 srivardhan python

In [16]:

fig = px.bar(base_data, y="Total_Employees_Growth", x=base_data.index, color=base_data.


index,text='Total_Employees_Growth')
fig.update_layout(title_text='Industry Employee % Growth')
fig.show()

Industry Employee % Growth

90
86.67
80

70
tal_Employees_Growth

60

50

40

30

2.3 Which years are the best and worst performing year in relation to number of employments. (highest and
lowest employment)

Bar graph visualisation shows number of performing years in relations to employment. It shows that 2018 is
the best performing year with highest employment with 1.45 million whiles 2010 is worst performing year with
least number of employments with 1.33 million.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 6/25


5/22/2020 srivardhan python

In [17]:

base_data2=base_data.T
base_data2.head()

Out[17]:

Industry Agriculture Production Construction Retail ICT Finance Real_Estate Pro

2009 37700.0 156700.0 96600.0 345400.0 27800.0 33800.0 13500.0

2010 38200.0 149800.0 93200.0 344500.0 27900.0 29800.0 14600.0

2011 36100.0 158600.0 90000.0 343100.0 26400.0 33200.0 17600.0

2012 36100.0 154400.0 91300.0 347300.0 27200.0 31100.0 18800.0

2013 36800.0 164200.0 89300.0 345100.0 26900.0 32400.0 18000.0

In [18]:

base_data2.head()

Out[18]:

Industry Agriculture Production Construction Retail ICT Finance Real_Estate Pro

2009 37700.0 156700.0 96600.0 345400.0 27800.0 33800.0 13500.0

2010 38200.0 149800.0 93200.0 344500.0 27900.0 29800.0 14600.0

2011 36100.0 158600.0 90000.0 343100.0 26400.0 33200.0 17600.0

2012 36100.0 154400.0 91300.0 347300.0 27200.0 31100.0 18800.0

2013 36800.0 164200.0 89300.0 345100.0 26900.0 32400.0 18000.0

In [19]:

base_data2['Yearly_Total_Employees']=base_data2.sum(axis=1)

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 7/25


5/22/2020 srivardhan python

In [20]:

base_data2.head(10)

Out[20]:

Industry Agriculture Production Construction Retail ICT Finance Real_Estate Pro

2009 37700.0 156700.0 96600.0 345400.0 27800.0 33800.0 13500.0

2010 38200.0 149800.0 93200.0 344500.0 27900.0 29800.0 14600.0

2011 36100.0 158600.0 90000.0 343100.0 26400.0 33200.0 17600.0

2012 36100.0 154400.0 91300.0 347300.0 27200.0 31100.0 18800.0

2013 36800.0 164200.0 89300.0 345100.0 26900.0 32400.0 18000.0

2014 42700.0 173300.0 97000.0 337300.0 35700.0 32400.0 22200.0

2015 40700.0 172300.0 92600.0 357700.0 24000.0 30800.0 19100.0

2016 43200.0 162500.0 102700.0 360200.0 34400.0 31000.0 22700.0

2017 40200.0 165100.0 90800.0 333500.0 58900.0 32100.0 18200.0

2018 41100.0 165700.0 101800.0 347600.0 31500.0 35500.0 25200.0

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 8/25


5/22/2020 srivardhan python

In [21]:

fig=px.bar(base_data2,x=base_data2.index,y="Yearly_Total_Employees")
fig.update_layout(title='Yearly Total Employee',legend=dict(x=0,y=0.5))
fig.show()

Yearly Total Employee

14M

12M
rly_Total_Employees

10M

8M

6M

3 Visual analysis

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 9/25


5/22/2020 srivardhan python

In [22]:

base_data3=pd.read_excel('C:\\Users\\Admin\\Downloads\\Python\\Dataset.xlsx')
base_data3.index=base_data3['Industry']
base_data3.head()

Out[22]:

Industry 2009 2010 2011 2012 2013 2014 2015 2016

Industry

Agriculture Agriculture 37700 38200 36100 36100 36800 42700 40700 43200

Production Production 156700 149800 158600 154400 164200 173300 172300 162500

Construction Construction 96600 93200 90000 91300 89300 97000 92600 102700

Retail Retail 345400 344500 343100 347300 345100 337300 357700 360200

ICT ICT 27800 27900 26400 27200 26900 35700 24000 34400

3.1 Create a dynamic scatter/bubble plot showing the change of workforce number over the period using
plotly Express.

To plot scatter chart, first we have to convert dataframe into columns, below is syntax to convert data frame
into columns.

In [23]:

del base_data3['Industry']

In [24]:

base_data4=base_data3.T

In [25]:

Final_df=pd.DataFrame(columns=['Year','Workforce','Industry','Workforce_Change'])
for col in base_data4.columns:
if col!='Yearly_Total_Employees':
#print(col)
final_data=pd.DataFrame(columns=['Year','Workforce','Industry','Workforce_Chang
e'])
final_data['Workforce']=base_data4[col].tolist()
final_data['Industry']=col
final_data['Year']=base_data4[col].index
final_data['Workforce_Change']= final_data['Workforce'] - final_data['Workforc
e'].shift()
final_data=final_data.fillna(0)
Final_df=Final_df.append(final_data)

Final output of the Data Frame to plot scatter chart.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 10/25


5/22/2020 srivardhan python

In [26]:

Final_df.head(100)

Out[26]:

Year Workforce Industry Workforce_Change

0 2009 37700 Agriculture 0.0

1 2010 38200 Agriculture 500.0

2 2011 36100 Agriculture -2100.0

3 2012 36100 Agriculture 0.0

4 2013 36800 Agriculture 700.0

... ... ... ... ...

5 2014 73300 Other_Service -2200.0

6 2015 77200 Other_Service 3900.0

7 2016 72400 Other_Service -4800.0

8 2017 83200 Other_Service 10800.0

9 2018 81800 Other_Service -1400.0

100 rows × 4 columns

Below dynamic scatter plot visualization shows the change of workforce number over the period. It can be
observed that in year 2017 ICT industry shows highest number of increases in workforce change followed by
retail industry with workforce employee to 20.4k in 2015 while same industry(ICT) shows highest number of
decreases in workforce in 2018 followed by retail in 2017. So it can be concluded that retail and ICT shows a
greater number of workforce changes over time period.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 11/25


5/22/2020 srivardhan python

In [27]:

fig = px.scatter(Final_df, x="Year", y="Workforce_Change", color="Industry",


log_x=True, size_max=60)
fig.update_layout(title='Scatter plot of change in workforce')
fig.show()

Scatter plot of change in workforce

20k

10k
Workforce_Change

10k

4. PCA/Correlation
PCA is basically dimensionality reduction method that is used to reduce the dimensions of the dataset into
smaller set of the variables. Using below syntax in python we have drawn PCA = 2 (Principle Component
Analysis).

In [28]:

from sklearn.decomposition import PCA


pca = PCA()
PCA_base=base_data3[[2009,2010,2011,2012,2013,2014,2015,2016,2017,2018]]

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 12/25


5/22/2020 srivardhan python

In [29]:

PCA_base.head()

Out[29]:

2009 2010 2011 2012 2013 2014 2015 2016 2017 2

Industry

Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 40200 41

Production 156700 149800 158600 154400 164200 173300 172300 162500 165100 165

Construction 96600 93200 90000 91300 89300 97000 92600 102700 90800 101

Retail 345400 344500 343100 347300 345100 337300 357700 360200 333500 347

ICT 27800 27900 26400 27200 26900 35700 24000 34400 58900 31

In [30]:

pca.n_components = 2
X_reduced = pca.fit_transform(PCA_base)
df_X_reduced = pd.DataFrame(X_reduced,columns=['PC1','PC2'], index=PCA_base.index)
df_X_reduced=round(df_X_reduced,2)

In [31]:

df_X_reduced.head(10)

Out[31]:

PC1 PC2

Industry

Agriculture -312091.23 -8151.22

Production 76819.90 912.47

Construction -137358.52 -8786.80

Retail 658534.20 -15872.69

ICT -335201.17 10284.00

Finance -334432.79 -10293.86

Real_Estate -376204.46 -7409.96

Professional_Service 58629.49 33479.53

Public_Adminstration 903354.93 2773.17

Other_Service -202050.35 3065.36

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 13/25


5/22/2020 srivardhan python

In [32]:

corr = df_X_reduced.T.corr()
corr.style.background_gradient(cmap='coolwarm')

Out[32]:

Industry Agriculture Production Construction Retail ICT Finance Real_Estate

Industry

Agriculture 1 -1 1 -1 1 1

Production -1 1 -1 1 -1 -1 -

Construction 1 -1 1 -1 1 1

Retail -1 1 -1 1 -1 -1 -

ICT 1 -1 1 -1 1 1

Finance 1 -1 1 -1 1 1

Real_Estate 1 -1 1 -1 1 1

Professional_Service -1 1 -1 1 -1 -1 -

Public_Adminstration -1 1 -1 1 -1 -1 -

Other_Service 1 -1 1 -1 1 1

Real estate, Finance, agriculture, and ICT have large negative loading on principle component 2. This
component focuses on wales more unemployed workforce. While, production, public admin service, retail
and professional service have positive loading on component 1. This component have focuses on industry
have more workforce.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 14/25


5/22/2020 srivardhan python

In [33]:

fig = px.scatter(df_X_reduced, x='PC1', y='PC2',color=df_X_reduced.index,hover_name=df_


X_reduced.index)
fig.update_layout(title='Principle Component Analysis Scatterplot')
fig.show()

Principle Component Analysis Scatterplot

30k

20k

10k
PC2

4.2 Correlation for each industry over years

Below correlation matrix shows correlation for each industry from 2009 to 2018. It can be observe that
agriculture industry is highly correlated with construction industry and other services is also positively
corelated with professional service and public administration. whereas retail and ICT shows weak linear
relationship.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 15/25


5/22/2020 srivardhan python

In [34]:

PCA_base.head()

Out[34]:

2009 2010 2011 2012 2013 2014 2015 2016 2017 2

Industry

Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 40200 41

Production 156700 149800 158600 154400 164200 173300 172300 162500 165100 165

Construction 96600 93200 90000 91300 89300 97000 92600 102700 90800 101

Retail 345400 344500 343100 347300 345100 337300 357700 360200 333500 347

ICT 27800 27900 26400 27200 26900 35700 24000 34400 58900 31

In [35]:

import matplotlib.pyplot as plt

corr = round(PCA_base.T.corr(),3)
corr.style.background_gradient(cmap='coolwarm')

Out[35]:

Industry Agriculture Production Construction Retail ICT Finance Real_Es

Industry

Agriculture 1 0.647 0.727 0.228 0.378 -0.005 0

Production 0.647 1 0.188 0.028 0.232 0.225 0

Construction 0.727 0.188 1 0.414 0.01 0.309 0

Retail 0.228 0.028 0.414 1 -0.552 -0.253 0

ICT 0.378 0.232 0.01 -0.552 1 0.043 0

Finance -0.005 0.225 0.309 -0.253 0.043 1 0

Real_Estate 0.668 0.604 0.598 0.232 0.154 0.316

Professional_Service 0.637 0.56 0.441 0.046 0.503 0.389 0

Public_Adminstration 0.195 0.547 0.08 -0.258 0.122 0.59 0

Other_Service 0.333 0.578 -0.031 -0.156 0.543 0.242 0

5. Clustering (k means & hierarchical)


5.1 Using the best and worst performing year column’s employment data (2.3) undertake a K means
clustering analysis (K=2 & 3) and identify industries cluster together.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 16/25


5/22/2020 srivardhan python

Below is K-means Clustering Table. K_2 is 2 means clustering K_3 is 3 means clustering

In [36]:

cluster_base=base_data3[[2010,2018]]

In [37]:

cluster_base.head()

Out[37]:

2010 2018

Industry

Agriculture 38200 41100

Production 149800 165700

Construction 93200 101800

Retail 344500 347600

ICT 27900 31500

In [38]:

import matplotlib.pyplot as plt


from sklearn.cluster import KMeans
cluster = KMeans(n_clusters=2)
predicted_2 = cluster.fit_predict(cluster_base)

cluster2 = KMeans(n_clusters=3)
predicted_3 = cluster2.fit_predict(cluster_base)

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 17/25


5/22/2020 srivardhan python

In [39]:

cluster_base['K_2']=predicted_2+1
cluster_base['K_3']=predicted_3+1
cluster_base['K_2']=cluster_base['K_2'].astype(str)
cluster_base['K_3']=cluster_base['K_3'].astype(str)

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: Settin
gWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.


Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-doc


s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: Settin
gWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.


Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-doc


s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: Settin
gWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.


Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-doc


s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: Settin
gWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.


Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-doc


s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 18/25


5/22/2020 srivardhan python

In [40]:

cluster_base.head(10)

Out[40]:

2010 2018 K_2 K_3

Industry

Agriculture 38200 41100 1 2

Production 149800 165700 1 1

Construction 93200 101800 1 2

Retail 344500 347600 2 3

ICT 27900 31500 1 2

Finance 29800 35500 1 2

Real_Estate 14600 25200 1 2

Professional_Service 145800 187100 1 1

Public_Adminstration 418600 434900 2 3

Other_Service 68000 81800 1 2

Scatter plot of K means clustering with K=2. From below two cluster visualization it can be observed that
more industry are in cluster 2 while cluster 1 has only 2 industry with certain similarities.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 19/25


5/22/2020 srivardhan python

In [41]:

fig = px.scatter(cluster_base, x=2010, y=2018, color="K_2",hover_name=cluster_base.inde


x)
fig.update_layout(title='Scatter plot of K means clustering with K=2')
fig.show()

Scatter plot of K means clustering with K=2

450k

400k

350k

300k

250k
2018

200k

Scatter plot of K means clustering with K=3. From below k = 3 cluster visualization it can be observed
that more industry are in cluster 3 while cluster 1 and cluster 2 has 2 industry with certain similarities.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 20/25


5/22/2020 srivardhan python

In [42]:

fig = px.scatter(cluster_base, x=2010, y=2018, color="K_3",hover_name=cluster_base.inde


x)
fig.update_layout(title='Scatter plot of K means clustering with K=3')
fig.show()

Scatter plot of K means clustering with K=3

450k

400k

350k

300k

250k
2018

200k

5.2 Hierarchical cluster

Dendrogram is used to determine the number of appropriate clusters in hierarchical clustering. It is the main
output of hierarchical clustering. The horizontal axis of dendrogram represent distances between cluster. The
number of clusters is equal to distance between two straight line drawn from one cluster to another. This is
refer to as Euclidean distance. So from above diagram using this clustering we have identified 6 clusters.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 21/25


5/22/2020 srivardhan python

In [43]:

import scipy.cluster.hierarchy as sch


#Lets create a dendrogram variable linkage is actually the algorithm #itself of hierarc
hical clustering and then in linkage we have to #specify on which data we apply and eng
age. This is X dataset
dendrogram = sch.dendrogram(sch.linkage(base_data3[[2010,2018]], method = "ward"))
plt.title('Dendrogram')
plt.xlabel('Years')
plt.ylabel('Euclidean distances')
plt.show()

In [ ]:

In [44]:

from sklearn.cluster import AgglomerativeClustering


hc = AgglomerativeClustering(n_clusters = 6, affinity = 'euclidean', linkage ='ward')
# Lets try to fit the hierarchical clustering algorithm to dataset #X while creating t
he clusters vector that tells for each customer #which cluster the customer belongs to.
y_hc=hc.fit_predict(base_data3[[2010,2018]])

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 22/25


5/22/2020 srivardhan python

In [45]:

cluster_base['Hierarchical_clustering']=y_hc
cluster_base['Hierarchical_clustering']=cluster_base['Hierarchical_clustering']+1
cluster_base['Hierarchical_clustering']=cluster_base['Hierarchical_clustering'].astype(
str)

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: Settin
gWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.


Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-doc


s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: Settin
gWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.


Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-doc


s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: Settin
gWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.


Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-doc


s/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Below Scatter plot is created using the k = 6 cluster.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 23/25


5/22/2020 srivardhan python

In [46]:

fig = px.scatter(cluster_base, x=2010, y=2018, color="Hierarchical_clustering",hover_na


me=cluster_base.index)
fig.update_layout(title='Scatter plot of Hierarchical clustering with K=6')
fig.show()

Scatter plot of Hierarchical clustering with K=6

450k

400k

350k

300k

250k
2018

200k

k-means cluster is formed with predetermine number of clusters. In this we have identify the industry cluster
of best and worst performing year of employment with k = 2 and k = 3 cluster while in hierarchical clustering
as name suggest built hierarchy of cluster and result of number of clusters are reproduced as k =6 industry
cluster for best and worst performing year of employment.

6. Discussion
Provide a brief discussion (~ 300 words) on employment landscape of Wales based on the employment data
analysis results.

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 24/25


5/22/2020 srivardhan python

From the report it can be observed that employment in wales shows highest workforce in public
administration services followed by retail, production, and professional services while least work force is in
real estate, but it shows highest growth percentage in employment from 2008 to 2019. Though retail work
force is second highest, but this industry has lowest percentage growth rate over a period. With year wise
total wales employment, 2018 shows highest employment with real estate showing 38% growth while ICT
shows negative % growth. In 2010, wales shows least total workforce, with average negative (-2%) growth.
From correlation matrix, it can be observe that agriculture industry is highly correlated with construction
industry and other services is also positively corelated with professional service and public administration.
whereas retail and ICT shows weak linear relationship.

In [ ]:

localhost:8888/nbconvert/html/srivardhan python.ipynb?download=false 25/25

You might also like