ICT2103 Full Book-Part-3
ICT2103 Full Book-Part-3
Now, if you go into the Example1 folder, you will see this screen:
Providea
name for
your
Python
code.
Make sure
to select
Python 3.6
from this
drop-down
menu.
58
Getting familiar with Azure Notebook
Jupyter Notebook is one of the industry standards for performing data analytics
in Python. The interface is divided into cells. You can execute one cell at a time or
run all cells at once. This technique of dividing code into cells allows for
programming that divides the code into components. For example, you can have
a cell that accesses data from an external file. You can run this cell once and then
use the other cells to perform the required analysis. Therefore, there is no need
to run a cell every time you execute your code.
Code cells
Add anew
cell
59
Example 1: Accessing Excel files and getting familiar with the data set
In this example we will use an open source data set from the Dubai Knowledge
and Human Development Authority (KHDA). The data can be downloaded
from the KHDA website we will use the following data set from KHDA:
Our focus will be on the sheet “Census 2015-2016”. This sheet contains
student enrollment and graduation from private higher education
institutions in Dubai for both undergraduate and postgraduate programs.
Tasks:
In this example, we will perform the following tasks:
60
Example 1: Accessing Excel files and getting familiar with the data set
You need to download the data from the KHDA website and
upload it to the Data folder.
The filename
with the full
extension.
This line
imports
pandas and
creates a
pd object.
The above line creates a DataFrame and stores it in the object data1.
To display the first 5 rows of the DataFrame you can use the command
head().
61
pandas DataFrame
Index
Rows
Index: The index is Column zero in the DataFrame. The
default index is the integer 0 with length of 1 of the
DataFrame. The index is used to speed up
searching large data. Later we will learn how to
change the index of a DataFrame.
62
Skipping unnecessary rows and columns
We have seen from the previous page that our columns start from row 4.
Pandas allows you to skip unwanted rows when reading the data. Here is how
to do it.
63
Display data in one or more columns
If you want to focus your analysis on one or a few columns, you can do so by
selecting only those desired columns from the DataFrame.
Column name as
it appears in the
DataFrame
Data[‘ColumnName’]
This is your
DataFrame. It can
be any name that
you chose to store
your data.
Example:
Display data stored in column
“Question1” only.
data[“Question1”]
64
Displaying unique values in a column
Example:
The file Superstore.xlsx contains one large sales file. You are
required to read this file and show the unique values stored in the
field “Ship Mode”.
Here is the line of code to display the unique values of “Ship Mode”.
65
Describing the data of a column
Data[‘ColumnName’].describe()
Example:
Data[‘Question1’].describe()
66
Slice data using index
Slicing data is a technique which is used to create small sets of your large
data. In this section we will use the index (the first column, which starts
with the value of zero) to slice our data.
Example:
Create a sub dataset from your
DataFrame that starts from index row
5 to index row 9.
Our DataFrame in this example is
stored in the variable data.
DataFrame[Start:End+1]
data[5:10]
Exercises:
67
Slice data using conditions
You can also slice data using conditions based on the value of one or
more columns.
Example:
Find the students who scored less
than 60 in the total.
DataFrame[Condition]
Exercises:
68
Slice data using more than one condition
To slice data using more than one condition, you need to put each
condition inside a small bracket of this type: (). You also need to use a
logical operator. The symbol & is used for and. The symbol | is used for
or.
Example:
find students with total >= 80 and
<=90
Exercises:
1. Display the IDs of students who scored more than 22 in
Question1 and more than 23 in Question2.
2. Display the IDs of students who scored more than 20 in all
questions.
3. Display the IDs of students who scored more than or equal to
15 in Question4 and less than 85 in total.
4. Display the IDs of students who scored more than 22 in any
of the four questions.
65
Working with loc in pandas
The pandas loc function allows one to search and slice data based on both
index and columns. It is a powerful tool that lets us focus on the important
rows and columns for our data analytics.
The name of Please note Here you specify The colon Here you specify
your the use of the first column separates the last column
DataFrame square name. Please the start name. Please
object. In our brackets. note that you and endof note that you
example this Curved must usecolumn the should use
is data2. brackets names and not columns. column names
will not numbers. It is a and not numbers.
work. 'must-
have'.
Example:
data.loc[5:10,"Question1":"Question2"]
70
Working with loc in pandas
You can display columns that are not in sequence. For example, you can
display Question1 and Question2.
Example:
data.loc[[3,8,20],["Question1","Question4"]]
Exercises:
71