0% found this document useful (0 votes)
23 views

UN Data Analysis Excel Pivot Pandas Matplotlib NumPy

This document provides instructions for analyzing UN environmental data using Microsoft Excel pivot tables, Pandas, Matplotlib, and NumPy. It details how to save a CSV file as an XLSX, install Power Pivot, create a pivot table to analyze the data, save as a CSV, and import into Jupyter Notebook to further explore the data using Pandas and Matplotlib.

Uploaded by

20105167
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

UN Data Analysis Excel Pivot Pandas Matplotlib NumPy

This document provides instructions for analyzing UN environmental data using Microsoft Excel pivot tables, Pandas, Matplotlib, and NumPy. It details how to save a CSV file as an XLSX, install Power Pivot, create a pivot table to analyze the data, save as a CSV, and import into Jupyter Notebook to further explore the data using Pandas and Matplotlib.

Uploaded by

20105167
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

UN Data Analysis – MS Excel Pivot Tables,

Pandas, Matplotlib & NumPy Exercise


UN Data Analysis – MS Excel Pivot Tables, Pandas, Matplotlib & NumPy
Exercise.................................................................1

Save the CSV as an XLSX file............................................3

Installing Power Pivot..................................................4

Create the Data Connection and Import the Data..........................6

Create the Pivot Table.................................................12

Save the XLSX file as a CSV file and Clean.............................20

Import the Data in the .CSV file into Jupyter..........................23

Analyse the Data.......................................................23

df.head()............................................................23

df...................................................................23

.plot................................................................24

index_col............................................................25

df.dot notation......................................................27

df.min().............................................................27

df.max().............................................................28

df.mean()............................................................28

df.std().............................................................29

df.describe()........................................................29

df.loc[].............................................................30

df.iloc[]............................................................30

sum(axis)............................................................31

sort_values..........................................................32

Plot a Bar Chart.......................................................33

figsize..............................................................34

1
Subscript Title......................................................34

Set the Y Axis Label.................................................35

Remove Scientific Notation...........................................35

Slicing Data Using Indexing............................................36

Export the Data Frame to a New .CSV file...............................38

df.to_csv............................................................38

2
Save the CSV as an XLSX file
Open the UnEnvData.csv file in MS Excel. Re-save it as an MS Excel File.
Go to File – Save As, and choose Excel Workbook (*.xlsx) as the file
extension. Save it to the same folder as the original .csv file.

When you click on the Title at the top of the document, it should now show
that the file is a .xlsx file extension and not a .csv:

3
Now close the file.

Installing Power Pivot


Open a new Excel Blank Workbook

Save it in the same folder as the other files as UnEnvDataPivot.xlsx

4
First, we need to check if Power Pivot is already installed. Go to the
Data Tab and Click on Power Pivot under Data Tools.

If you don’t see this option in MS Excel, you need to install the
Microsoft Power Pivot Add-In first. So, if this is the case, go to File –
Options – Add-ins and click on Microsoft Power Pivot for Excel which will
5
be under the Inactive Application Add-ins in your case. It only moves to
the Active Application Add-ins after you have installed it as in the case
of my window below:

Create the Data Connection and Import the Data


After you have installed Power Pivot, go back to the Data Tab and click on
PowerPivot. It launches as a separate screen to MS Excel which you can
minimize or maximise depending on your preference.

6
Click on Get External Data and click on From Other Sources

Scroll down and choose Import Data from an Excel File and click Next.

7
Browse and choose the newly saved UnEnvData.xlsx file and click Open.

It will insert the file path into the Excel File Path selection box:

8
Click Test Connection to see that you have built the connection to the
file correctly. It should say Test connection successful. Then click on Ok
and then on Next.

9
Just to note, that when we import data, we must always ensure that the
file is not in use (i.e.) open in this case, or else you will get an error
message as per below. Although we closed the file so we shouldn’t get the
message, it is important to know what this error message means.

After you click Next, you should see the following screen. Click on
Finish.

10
You will be advised as to how much data has been imported into your MS
Excel Workbook Power Pivot. When importing any data, it is important to
take not of the Error count and address this if it is an issue. Otherwise,
your data analysis will be flawed if you data import is flawed. Our import
is successful, so click on Close.

11
You should now see the following screen which contains our three columns
of data from the UnEnvDataPivot.xlsx file.

Create the Pivot Table


Click on Pivot Table – Pivot Table to create a Pivot Table in Excel.

12
Choose Existing Worksheet and click OK.

Your screen will now look like this:

Click on the arrow beside UnEnvData and tick the box F1 which corresponds
to the names of the countries in our .xlsx file. You will see the data for
13
the columns appear in a column on the Worksheet and F1 appears in the Rows
segment of the Pivot Table in the bottom right of the screen.

However, we want this data to actually form as columns not rows. So in the
bottom of the screen under Rows, click on your mouse and drag the F1 label
into the Columns segment. You will see how the Worksheet relocates the
data as follows:

14
Now tick the F2 box which corresponds to our Year column in our .xlsx
file. It erroneously appears in the Value segment. However, we want it in
the Rows segment, so once again, drag and drop. Your screen should appear
as follows which shows the Year data aligned in the first column:

15
Now tick the F3 box which corresponds with the Value column in our .xlsx
file. By default, it should appear in the Values segment which is correct.
Your screen should now look as follows:

I would strongly advise that you do a quick check of the data in this
Pivot Table with the original data in the UnEnvData.csv file to ensure
that all is correct. This is good data quality control practice.

If at any time your Pivot Table Field List disappears or accidentally


closes, you can open it by Right Clicking on the Mouse and choosing Show
Field List as follows. Similarly, you can Hide the Field List.

16
You will notice that we have a Sum Total row at the end of our data. We
don’t want this so we will deactivate it. Click on the Design Tab and
choose Grand Totals – Off for Rows and Columns.

You will note that to the left of that button, there is an option to do
the same for Subtotals should the need ever arise when cleaning data. Our
screen should now look as follows:

17
We now want to turn off the Field Headers on the top right corner:

Our screen should look as follows:

18
Once we are happy with the data, we can close the Pivot table fields by
clicking on the X to Close:

Then we can close the Power Pivot window by clicking on the X to Close.
19
Save the XLSX file as a CSV file and Clean
We want to save our file now as a CSV so that we can get rid of any excess
rows of data/labels in order to get it ready for import into Jupyter. So
go to File – Save As – CSV(Comma delimited)(*csv).

20
Now, close MS Excel and any Pivot Windows if you still have them open. Now
go to your folder and reopen the UNEnvDataPivot.csv file for one file data
cleanse. Technically, we don’t need to do this close/reopen but I prefer
to do it to be absolutely sure. It should reopen as follows:

Delete any blank rows if you have them, and then delete the entire row
where it contains the words Sum of F3.

21
Then above the years, type Year in the blank cell so that it is consistent
with our original dataset. Your data should now look as follows:

Close it and save the changes. We are now ready to import it into Jupyter.

22
Import the Data in the .CSV file into Jupyter
Open a new Jupyter Notebook and save it as UNEnvDataAnalysis2Nov2022.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('unenvdatapivot.csv')

Analyse the Data


df.head()

df.head()

df

It is now much easier to call country data:

df['Australia']

23
.plot

And it is much easier to quickplot the data.

df['Australia'].plot()

24
index_col

So we see that the x axis is still referring to the Index (0,1, 2, etc.)
rather than to our years. Again, we can change this by declaring our X and
Y axes but there is actually an easier way to do it by declaring the year
as the Index when we import the CSV data. So let’s reimport it as follows:

df = pd.read_csv('unenvdatapivot.csv', index_col='Year')

and re-run the plot:

df['Australia'].plot()

We have a much more accurate figure.

25
We can also look at the df.head() again to see that the default index
column is gone and it has been replaced by the Year from our CSV file, and
we now have 43 columns instead of 44.

df.head()

Once again, we can call the data for Australia as follows:

df['Australia']

26
df.dot notation

Or we can even use the df. dot notation which does the same think.

df.Australia

Our stats will also look very different from a presentation perspective as
follows:

df.min()

df.min()

will give us the minimum value of each country.

27
df.max()

df.max()

will give us the maximum value of each country.

df.mean()

df.mean()

will give us the average value of each country.

28
df.std()

df.std()

will give us the standard deviation value of each country.

df.describe()

df.describe()

will give us an overview of key stats from all countries.

29
df.loc[]

We can also locate the values for a specific year for each country now
that our index is the year. This is because .loc[] is index based.

df.loc[1990]

df.iloc[]

Or we can call each row by index number using .iloc[]. So, if we want to
pull out data from index 10 we get following data from Year 2000.

df.iloc[10]

30
sum(axis)

We want to plot a bar chart which displays the total sum of all CO2
emissions for each country over the last 30 years.

First, we simply use sum.

df.sum(axis=0)

You will notice that it is ordered alphabetically based on the country


name.

31
sort_values

We want to reorder it based on the CO2 emissions:

df.sum(axis=0).sort_values()

Which sorts the values based on lowest emitting to highest emitting.

32
Plot a Bar Chart
Before plotting the bar chart, we will rename it as df.totals so that we
create a new Data Frame which will be easier to plot if we wish.

df_totals = df.sum(axis=0).sort_values()

We already have, but remember to import matplotlib if you haven’t already.

import matplotlib.pyplot as plt

And then plot the bar chart using the kind method.

df_totals.plot(kind = 'bar')

plt.show()

which outputs

33
figsize

So we want to make the chart wider so we pass the figsize flag:

df_totals.plot(kind = 'bar', figsize=(16, 6))

plt.show()

Subscript Title

We can add a subscript title by dropping the 2 in CO2 and the Year range!

df_totals.plot(kind = 'bar', figsize=(16, 6), title='CO$_2$ Emissions


$_{(1990-2019)}$')

plt.show()

34
Set the Y Axis Label

We can also go back to an old trick in matplotlib and we can set our Y
label as follows:

ax = df_totals.plot(kind = 'bar', figsize=(16, 6), title='CO$_2$


Emissions $_{(1990-2019)}$')

ax.set_ylabel('CO$_2$ (ktonnes)')

plt.show()

Remove Scientific Notation

You will also notice that the Scientific Notation 1e8 in the top left
corner of the chart. If you don’t like it and want to get rid of it and
see the actual value range as per the raw data, we can do the following to
remove the label and amend the Y axes values on the chart:

ax = df_totals.plot(kind = 'bar', figsize=(16, 6), title='CO$_2$


Emissions $_{(1990-2019)}$')

ax.set_ylabel('CO$_2$ (ktonnes)')

ax.get_yaxis().get_major_formatter().set_scientific(False)

plt.show()

35
Slicing Data Using Indexing
Because our Series are essentially NumPy arrays, we can slice them using
NumPy methods. If we just want to display the five highest emitters, we
can use our indexing to slice the data from -5 to the end of the range – 1
which we don’t need to specify:

df_totals = df.sum(axis=0).sort_values()[-5:]

And then of course, re-run the chart plot and adjust our Title.

ax = df_totals.plot(kind = 'bar', figsize=(16, 6), title='Top 5


CO$_2$ Emissions $_{(1990-2019)}$')

ax.set_ylabel('CO$_2$ (ktonnes)')

ax.get_yaxis().get_major_formatter().set_scientific(False)

plt.show()

36
If we just want to display the five lowest emitters, we can use our
indexing to slice the data from 5 to the start of the range 0 which we
don’t need to specify:

df_totals = df.sum(axis=0).sort_values()[:5]

And then of course, re-run the chart plot and adjust our Title.

ax = df_totals.plot(kind = 'bar', figsize=(16, 6), title='Lowest 5


CO$_2$ Emissions $_{(1990-2019)}$')

ax.set_ylabel('CO$_2$ (ktonnes)')

ax.get_yaxis().get_major_formatter().set_scientific(False)

plt.show()

37
Export the Data Frame to a New .CSV file
Let’s say we want to export this new df_totals data frame which is a
subset of the main data frame to an external .csv file:

df.to_csv

We can also save our DataFrame to a new CSV file using .to_csv as follows:

df_totals.to_csv('new_df.csv')

Go check your directory to see that it has been created correctly.

Click on the file and you can look at a preview:

You can of course download it also and open it in Excel:

38
Save all of the changes and close this Jupyter Notebook.

39

You might also like