UN Data Analysis Excel Pivot Pandas Matplotlib NumPy
UN Data Analysis Excel Pivot Pandas Matplotlib NumPy
df.head()............................................................23
df...................................................................23
.plot................................................................24
index_col............................................................25
df.dot notation......................................................27
df.min().............................................................27
df.max().............................................................28
df.mean()............................................................28
df.std().............................................................29
df.describe()........................................................29
df.loc[].............................................................30
df.iloc[]............................................................30
sum(axis)............................................................31
sort_values..........................................................32
figsize..............................................................34
1
Subscript Title......................................................34
df.to_csv............................................................38
2
Save the CSV as an XLSX file
Open the UnEnvData.csv file in MS Excel. Re-save it as an MS Excel File.
Go to File – Save As, and choose Excel Workbook (*.xlsx) as the file
extension. Save it to the same folder as the original .csv file.
When you click on the Title at the top of the document, it should now show
that the file is a .xlsx file extension and not a .csv:
3
Now close the file.
4
First, we need to check if Power Pivot is already installed. Go to the
Data Tab and Click on Power Pivot under Data Tools.
If you don’t see this option in MS Excel, you need to install the
Microsoft Power Pivot Add-In first. So, if this is the case, go to File –
Options – Add-ins and click on Microsoft Power Pivot for Excel which will
5
be under the Inactive Application Add-ins in your case. It only moves to
the Active Application Add-ins after you have installed it as in the case
of my window below:
6
Click on Get External Data and click on From Other Sources
Scroll down and choose Import Data from an Excel File and click Next.
7
Browse and choose the newly saved UnEnvData.xlsx file and click Open.
It will insert the file path into the Excel File Path selection box:
8
Click Test Connection to see that you have built the connection to the
file correctly. It should say Test connection successful. Then click on Ok
and then on Next.
9
Just to note, that when we import data, we must always ensure that the
file is not in use (i.e.) open in this case, or else you will get an error
message as per below. Although we closed the file so we shouldn’t get the
message, it is important to know what this error message means.
After you click Next, you should see the following screen. Click on
Finish.
10
You will be advised as to how much data has been imported into your MS
Excel Workbook Power Pivot. When importing any data, it is important to
take not of the Error count and address this if it is an issue. Otherwise,
your data analysis will be flawed if you data import is flawed. Our import
is successful, so click on Close.
11
You should now see the following screen which contains our three columns
of data from the UnEnvDataPivot.xlsx file.
12
Choose Existing Worksheet and click OK.
Click on the arrow beside UnEnvData and tick the box F1 which corresponds
to the names of the countries in our .xlsx file. You will see the data for
13
the columns appear in a column on the Worksheet and F1 appears in the Rows
segment of the Pivot Table in the bottom right of the screen.
However, we want this data to actually form as columns not rows. So in the
bottom of the screen under Rows, click on your mouse and drag the F1 label
into the Columns segment. You will see how the Worksheet relocates the
data as follows:
14
Now tick the F2 box which corresponds to our Year column in our .xlsx
file. It erroneously appears in the Value segment. However, we want it in
the Rows segment, so once again, drag and drop. Your screen should appear
as follows which shows the Year data aligned in the first column:
15
Now tick the F3 box which corresponds with the Value column in our .xlsx
file. By default, it should appear in the Values segment which is correct.
Your screen should now look as follows:
I would strongly advise that you do a quick check of the data in this
Pivot Table with the original data in the UnEnvData.csv file to ensure
that all is correct. This is good data quality control practice.
16
You will notice that we have a Sum Total row at the end of our data. We
don’t want this so we will deactivate it. Click on the Design Tab and
choose Grand Totals – Off for Rows and Columns.
You will note that to the left of that button, there is an option to do
the same for Subtotals should the need ever arise when cleaning data. Our
screen should now look as follows:
17
We now want to turn off the Field Headers on the top right corner:
18
Once we are happy with the data, we can close the Pivot table fields by
clicking on the X to Close:
Then we can close the Power Pivot window by clicking on the X to Close.
19
Save the XLSX file as a CSV file and Clean
We want to save our file now as a CSV so that we can get rid of any excess
rows of data/labels in order to get it ready for import into Jupyter. So
go to File – Save As – CSV(Comma delimited)(*csv).
20
Now, close MS Excel and any Pivot Windows if you still have them open. Now
go to your folder and reopen the UNEnvDataPivot.csv file for one file data
cleanse. Technically, we don’t need to do this close/reopen but I prefer
to do it to be absolutely sure. It should reopen as follows:
Delete any blank rows if you have them, and then delete the entire row
where it contains the words Sum of F3.
21
Then above the years, type Year in the blank cell so that it is consistent
with our original dataset. Your data should now look as follows:
Close it and save the changes. We are now ready to import it into Jupyter.
22
Import the Data in the .CSV file into Jupyter
Open a new Jupyter Notebook and save it as UNEnvDataAnalysis2Nov2022.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('unenvdatapivot.csv')
df.head()
df
df['Australia']
23
.plot
df['Australia'].plot()
24
index_col
So we see that the x axis is still referring to the Index (0,1, 2, etc.)
rather than to our years. Again, we can change this by declaring our X and
Y axes but there is actually an easier way to do it by declaring the year
as the Index when we import the CSV data. So let’s reimport it as follows:
df = pd.read_csv('unenvdatapivot.csv', index_col='Year')
df['Australia'].plot()
25
We can also look at the df.head() again to see that the default index
column is gone and it has been replaced by the Year from our CSV file, and
we now have 43 columns instead of 44.
df.head()
df['Australia']
26
df.dot notation
Or we can even use the df. dot notation which does the same think.
df.Australia
Our stats will also look very different from a presentation perspective as
follows:
df.min()
df.min()
27
df.max()
df.max()
df.mean()
df.mean()
28
df.std()
df.std()
df.describe()
df.describe()
29
df.loc[]
We can also locate the values for a specific year for each country now
that our index is the year. This is because .loc[] is index based.
df.loc[1990]
df.iloc[]
Or we can call each row by index number using .iloc[]. So, if we want to
pull out data from index 10 we get following data from Year 2000.
df.iloc[10]
30
sum(axis)
We want to plot a bar chart which displays the total sum of all CO2
emissions for each country over the last 30 years.
df.sum(axis=0)
31
sort_values
df.sum(axis=0).sort_values()
32
Plot a Bar Chart
Before plotting the bar chart, we will rename it as df.totals so that we
create a new Data Frame which will be easier to plot if we wish.
df_totals = df.sum(axis=0).sort_values()
And then plot the bar chart using the kind method.
df_totals.plot(kind = 'bar')
plt.show()
which outputs
33
figsize
plt.show()
Subscript Title
We can add a subscript title by dropping the 2 in CO2 and the Year range!
plt.show()
34
Set the Y Axis Label
We can also go back to an old trick in matplotlib and we can set our Y
label as follows:
ax.set_ylabel('CO$_2$ (ktonnes)')
plt.show()
You will also notice that the Scientific Notation 1e8 in the top left
corner of the chart. If you don’t like it and want to get rid of it and
see the actual value range as per the raw data, we can do the following to
remove the label and amend the Y axes values on the chart:
ax.set_ylabel('CO$_2$ (ktonnes)')
ax.get_yaxis().get_major_formatter().set_scientific(False)
plt.show()
35
Slicing Data Using Indexing
Because our Series are essentially NumPy arrays, we can slice them using
NumPy methods. If we just want to display the five highest emitters, we
can use our indexing to slice the data from -5 to the end of the range – 1
which we don’t need to specify:
df_totals = df.sum(axis=0).sort_values()[-5:]
And then of course, re-run the chart plot and adjust our Title.
ax.set_ylabel('CO$_2$ (ktonnes)')
ax.get_yaxis().get_major_formatter().set_scientific(False)
plt.show()
36
If we just want to display the five lowest emitters, we can use our
indexing to slice the data from 5 to the start of the range 0 which we
don’t need to specify:
df_totals = df.sum(axis=0).sort_values()[:5]
And then of course, re-run the chart plot and adjust our Title.
ax.set_ylabel('CO$_2$ (ktonnes)')
ax.get_yaxis().get_major_formatter().set_scientific(False)
plt.show()
37
Export the Data Frame to a New .CSV file
Let’s say we want to export this new df_totals data frame which is a
subset of the main data frame to an external .csv file:
df.to_csv
We can also save our DataFrame to a new CSV file using .to_csv as follows:
df_totals.to_csv('new_df.csv')
38
Save all of the changes and close this Jupyter Notebook.
39