SIM - Chapters - DA T4
SIM - Chapters - DA T4
LEARNING OUTCOMES
At the end of this topic, students should be able to:
• Utilize tools/software for analytics programming
• Explain variable, string, comments and programming structures
• Implement data management/manipulation codes using programming language
• Implement basic plots/visualization codes using programming language
INTRODUCTION
How do you implement data analytics solutions? i.e. when you have an analytics problem and want to
resolve it using machine learning, for example, you need to code the solution using a programming
language. To code a programming language, you have to know the tools/software, which will be
introduced in this topic. Furthermore, you also have to know the concept of variable and basic
programming structures such as string, array, repetition, condition and function. Next, in this topic you
will also learn how to code a programming language so that you can manage data as well as visualize
them using plots/graphs.
Google Colab allows you to write Python code online using web browser as the interface. It does not
require download and install, hence giving you convenience to start the Python coding. The works/codes
that you write is in the format known as notebook, and each notebook file is stored with .ipynb
extension. More information about Google Colab can be retrieved from the link below:
To start using Google Colab, login your Google account at https://colab.research.google.com. The
interface shown in Figure 1 will be prompted out on your browser. There are five tabs available, which
are described as the following:
• Examples – contains some notebooks of Python programming examples.
• Recent – contains the recent notebooks that you have worked with.
• Google Drive – contains all notebooks in your Google drive.
• GitHub – this tab allows you to load notebook from GitHub
• Upload – this tab allows you to load notebook from local directory.
For first time use, or in the event that you do not have any notebook, or you want to start a new notebook,
click the link “New notebook” at the bottom right of the window in Figure 1.
The notebook is shown in Figure 2. The component marked as “1” in the notebook is the cell, which is
where you will write your Python codes. It is always advisable to name your notebook before start
working on the codes. To name your notebook, click on the part marked as “2”, and name with a good,
representing name. As a start, name the notebook as myFirstNb.ipynb.
2+3
The code should be written in the cell. Meanwhile, to run the code, either click “Run” button or press
Ctrl + Enter. You should see the output 5 is displayed in the output segment, the segment underneath
the cell. Figure 3 shows the mentioned components.
Python code
Output
Run button
Figure 3: Google Colab interface for writing Python codes
You may write as many codes as possible in a single cell. However, it might be useful to write codes
with different objective/purpose in separate cells. To create a new cell, click “+ Code” button at the left
top part of Google Colab notebook as shown in Figure 2. You will find that a new cell will be created
underneath the previous one.
You will get the text displayed in the output cell. This is an important point that you may need to note
– the output could be in numerical or text-based format.
SELF-LEARNING ACTIVITY
Differentiate between the display of output using literal mathematical operation codes and print
function.
a = 10
b = 15
c=a+b
d=a*b
The codes above show another important behaviour of Python language, whereby a value can be stored
in a placeholder, known as variable. A variable can hold a single value. In this case, a, b, c and d are
variables, each holds a single value.
Furthermore, you may also need to note that a variable can hold a literal value, as shown in the first and
second line (a and b) or it can also hold a value based on the outcome of other operations, as shown in
the third and fourth line (c and d). To display the value of a variable, you can either type its name or
print (variable name). The above variables’ values can be printed by typing and running the following:
print (b)
print (c)
These will print the value of b and c. You also need to note that combining the following two codes and
run together will only print the latest variable’s value:
a
b
c
Another important point that you have to be aware of is the naming of the variables. Python sets some
rules with regards to naming the variables, and they are as the following:
• Must start with a letter/underscore character
• Cannot start with a number
• Can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
• Case-sensitive (name, Name and NAME are three different variables)
SELF-LEARNING ACTIVITY
Give three examples each for valid and invalid variable names.
4.4 STRING
As in other programming languages, Python also allows a variable to hold text-based values comprising
string of characters. This type of values is known as string. Type and run the following codes:
str1 = "football"
fr2 = "rugby"
str3, str4, str5 = "cycling", "judo", "table tennis"
print ("I like to watch " + str4)
print (str1 + " and " + str5 + " are among the games in the Summer Olympics")
Line 1 – 3 show how strings are defined – by assigning to the variables. Meanwhile, line 4 and 5 show
how strings are concatenated together. Strings concatenation can be done upon literal strings or
variables (that hold string value) or combination of both.
As mentioned above, a string is a set of characters, which allows it to be split by character, as follows:
print (str2[0])
print (str2[3])
The two lines above print the characters in the str2 that are located at position 1 and 4. The number in
the square brackets represent the position of the character in the str2. This position number is normally
called as an index. Hence, you should note that the index starts with 0 (instead of 1), and ends with
n-1.
Furthermore, Python also provides some methods that can be used to return specific information/values
about strings. For example:
print (len(str3))
will print the number of characters contained in str3. This is done by the len () method.
Another example of method is split (), that split a string according to the specific separator. The codes
below split str5 by a space separator. Hence line 2 and 3 below will each print “table” and “tennis”.
SELF-LEARNING ACTIVITY
There are quite a number more functions that can be used to manipulate strings. Find three of the
functions and implement them with Python to show how the work.
4.5 COMMENTS
Python allows you to provide comments in the programming code. A comment skips codes from being
executed, and this is done by supplying the start of the line of code with #. Comments are normally
used to describe codes. For example;
Line 1 will not be executed, instead it is only used to explain what the codes that follow will do.
4.6 OPERATORS
In Python, the most commonly used operators are mathematical, comparison and logical operators.
and If both operands (left and right) are true, then condition becomes true
or If any of the two operands (left or right) is true, then condition becomes true
not Returns the reverse logical value of the operand
SELF-LEARNING ACTIVITY
Run the following codes containing the comparison and logical operators, and see the results:
x = 10
y = 11
print (x == y)
print (x != y)
print (x > y)
print (x < y)
print (x <= y)
print (x >= y)
4.7.1 Decision
Decision is a process of checking for condition, and determining actions according to the condition.
Type and run the following codes:
Single condition:
cr = 1.5
if (cr == 1.5) :
print ("Warning!")
Note: You may change the cr value to other number and see what happens.
Two conditions:
cr = 0.8
if ( cr == 1.5 ) :
print ("Warning!")
else :
print("Normal")
Multiple conditions:
cr = 1.5
if ( cr >= 1.5 ):
print ("Critical")
elif (cr >=1.0 and cr <1.5):
print("Warning")
else:
print("Normal")
Note: Change cr to 1.1 and 0.7, and observe the output
In some cases, several single condition If statements need to be combined together to evaluate separate
conditions but accumulatively, they contribute to the results. Taking the example below, type and run
the codes:
spe = 9505
pre = 13000
tem = 165
if ( spe >= 9500 ):
if ( pre >= 12800):
if (tem >= 150):
print ("Equipment FAIL")
The codes above comprise three If statements, but only if all of them are evaluated to True, then the text
will be printed. Alternatively, the codes above can be written using the logical operators as shown
below:
In the codes above, if the accuracy value is equal to 90, or precision value is greater than or equal to 70,
“Good” will be printed. Otherwise, “Repeat” will be printed.
4.7.2 Repetition
Repetition, which is also known as looping, is a statement to execute codes (or a block of codes) for
several times. Repetition also performs its duty based on condition(s). Type and run the following
examples:
Example 1:
for x in range(10):
print(x)
Example 2:
for x in range(10):
print(x, end=’ ‘)
Example 3:
for num1 in range(3):
for num2 in range(10, 14):
print(num1, ",", num2)
4.7.3 Functions
A function is a block of codes that becomes executed when it is called – using its name. So far, we have
seen the print function that displays the values we supply in the parenthesis (this is called arguments).
The print function and many others are predefined functions provided by the tools/library. Other than
predefined functions, we may also create functions, and these are known as user-defined functions.
def ex_function():
print("Hello from a function")
ex_function() #function call
The simple codes above define a function to print the text “Hello from a function”. You could see that
the text will only be printed when the function name, in this case it is ex_function, is called. The code
ex_function() mimics the way the print function is called. That is how a function is executed/triggered
for actions.
def bmi_score(w,h):
return w/(h*h)
print(bmi_score(90,1.75))
print(bmi_score(51,1.53))
print(bmi_score(45,1.51))
print(bmi_score(89,1.65))
The codes above show another example of function definition and call. In this case, the same function
is called for four times – this becomes one of the main purposes of having functions in the code i.e.,
code reusability.
To understand array, you firstly need to understand that thus far the variables that you have seen are
normal variables. The behaviour of a normal variable is that it can only store a single value, e.g.:
num1 = 3
num1 = 3 * 12
print(num1)
The above codes show that when the same variable, num1 is assigned with a value for two times, only
the latest one will be taken for printing (display). This means that the latest value assigned will overwrite
the previous value. It happens due to the behaviour of a variable that can only store a single value.
Imagine that you are dealing with 100 values of marks, or 500 values of salary, you need to use 100 and
500 separate variables, respectively. This is cumbersome, thanks to the concept of array which allows
a single variable to hold multiple values with the same type. To utilize array, the numpy library is used
as the following:
print(arr[0])
print(arr[4])
print(arr[2])
You have to be cautious so that the index value does not exceed its boundary. The index exceeding its
boundary becomes one of the common errors done with regards to array. This is mainly due to the fact
that the last element’s index is position of element – 1 (as index begins with 0). Type and run the
following code and see what happens:
print(arr[5])
Another important behaviour that you have to note about array is that its elements must be of the same
data type. Taking the following codes, you should be able to make a conclusion about the matter:
a_list = numpy.array([1,25,"Three"])
print(a_list[0]+a_list[1])
You should supposedly expect that the above codes will result in 26 being printed, as a result of the
first and second elements of a_list array are added together. However, Python treats all the elements in
the a_list array as string because one of them (the third element) is indeed a string. Consequently, when
the additional operation is carried out, the + operator is treated as the string concatenation operator,
instead of the addition operator.
The above codes should have been written as the following, then only the printed output will be 26.
b_list = numpy.array([1,25,3])
print(b_list[0]+b_list[1])
4.7.5 Dictionary
A dictionary is a collection of unordered, changeable and indexed data. Type and run:
cars = {
"brand": "Proton",
"model": "Preve",
"year": 2015
}
print(cars)
m = cars['model']
m = cars.get('year')
cars["color"] = "Blue"
print(cars)
To delete item:
del cars["model"]
print(cars)
import pandas as pd
data = {'Name':['Carrol', 'Mike', 'John'],'Gender':['Female', 'Male', 'Male'],
'Height':[160,175,173], 'Weight':[49,89,77], 'Age':[35,36,41]}
df = pd.DataFrame(data)
print(df)
You should be able to get a nice data frame containing three observations (rows) and five columns
printed out.
Similar to the concept of array, you may also access to specific values (data) in the data frame by using
index operators. Based on data frame you created earlier, run each of the following codes and make a
conclusion of what it does:
Example 1:
print(df['Height'])
print(df.loc[:,'Height'])
Example 2:
print(df.loc[:,['Name','Age']])
print(df[['Name','Age']])
Example 3:
print(df.loc[2])
print(df.loc[1:2])
print(df.loc[[1,2]])
Example 4:
print(df.loc[[0,1],['Name','Weight']])
Example 5:
print(df.iloc[:,2])
print(df.iloc[2])
print(df.iloc[2,4])
You will find that even though both lines will display the same three values of height, but their display
are slightly different between each other. The content of h1 actually contains the height values in data
frame form. In contrast, the content of h2 contains the literal height values in numerical form. To clearly
see the difference, let’s perform a mathematical function sum that adds all the three values of height in
h1 and h2:
print(sum(h1))
print(sum(h2))
You will find that with h1, the mathematical operation could not be performed due to the fact that the
three values of height are in data frame form. In contrast, the mathematical operation is successfully
carried out on h2 because all the three height values are in numerical form.
import pandas as pd
import numpy as np
data = {'Name':['Ali', 'Abu', 'George', 'Mike', 'Chan', 'Sammy'],
'Marks':[70, 65,np.nan, 82, 78, 75]}
score = pd.DataFrame(data)
The above codes create a data frame containing two columns i.e. “Name” and “Marks” with six
observations/records (rows). To display the data frame, you may type and run:
print(score)
The table shows that in the third observation there is an empty cell marked as “NaN”. Now, using the
predefined function sum(), you are going to calculate the sum of all marks. The following codes will
carry out the task:
print(sum(score[‘Marks’]))
You will find that the above code will produce error – which is related to the NaN value exists in the
marks. Having the NaN value, which is a non-numerical value, the mathematical functions like sum()
will generate error. To handle this issue, you need to treat the NaN value so that it will not affect the
calculation. One of the ways is to set the NaN value into the value of 0. Another way is to drop the NaN
value when it exists using the special function knows as dropna():
score2 = score.dropna()
print(sum(score2['Marks']))
Assume that there is a dataset named ds1.csv in the working folder (working path), the following code
will import the data into Python environment:
my = pd.read_csv("ds1.csv")
my
The correctness of data import may be verified by using head() function (this is normally used when
data is huge, and we only want to display certain n observations of the data):
my.head()
Summary of data containing some statistical measurements can be retrieved by using describe()
function:
my.describe()
my.to_csv("ds1.csv")
For Excel files, the functions read_excel() and to_excel() are used for importing and saving Excel files,
respectively.
The examples above are used when the dataset is located in the working directory (i.e. the directory
where you set the Jupiter Notebook to work). If the dataset is located outside of the working directory,
then the path of the location of the dataset has to be supplied in the import/save functions. For instance;
pd.read_csv("C:/Users/johnny/Py/ds1.csv")
Note: to check for the working directory, type and run the following codes:
import os
os.getcwd()
SELF-LEARNING ACTIVITY
Find a dataset known as Iris. Import and display the dataset from Iris into a table in the Google Colab.
4.7.9 Visualization/Plotting
The commonly used library for Python plotting is matplotlib, whereby its sub-library known as pyplot
will be used as examples in this book. Type and run the following simple plotting:
In most cases, the properties of the plotting functions can be edited so that different
visualization/plotting can be displayed. For instance, the fourth line in codes above can be changed to:
plt.plot(x, y, ‘go’)
This will make the points to be presented as dots:
When the above codes are amended with more points/lines then the plot will also change accordingly.
For instance:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
a = [1, 2, 3, 4, 5]
b = [3, 5, 7, 9, 11]
plt.plot(x, y, 'go')
plt.plot(a, b, 'b*')
plt.xlabel('Numbers')
plt.ylabel('Doubles')
Employee = ['John','Mike','Brenda','Tony','Miranda']
Salary = [145000,92000,152000,79000,87000]
plt.bar(Employee, Salary)
plt.title('Employees Annual Salary')
plt.xlabel('Employee')
plt.ylabel('Annual Salary')
plt.show()
The following codes plot a pie chart for the data:
import numpy as np
medals = ['USA', 'Britain', 'China','Russia', 'Germany', 'Japan', 'France']
data = [46, 27, 26, 19, 17, 12, 10]
fig = plt.figure(figsize =(5, 7))
plt.pie(data, labels = medals)
plt.title('Gold Medals by Top 7 Countries in 2016 Olympics')
plt.show()
Thus far, the examples show how to plot graphs/charts using hard-coded data (data written in the code
directly). The plotted data may also be retrieved from the library, or the csv/Excel files as shown in the
earlier section.
SELF-LEARNING ACTIVITY
Using the Iris dataset you imported earlier, come out with two visualizations to represent the data.
**See the videos below for further explanation about the contents of this topic:
https://recordings.roc2.blindsidenetworks.com/utp/9058d062d33ebba8947bd34bbb62d4c14b1ab850-
1620175923063/capture/
https://recordings.roc2.blindsidenetworks.com/utp/4df8fdc80b974c3a0cdcbee87fcfb1a859a94559-
1620195618605/capture/
SUMMARY
In this topic, you have learned the important data analytics tools i.e., Python programming language
and Google Colab as the development environment software. The fundamentals of Python have been
discussed such as variable, string, comments and programming structures become the basic constructs
for Python-based program/solution. Furthermore, this topic also discusses the application of Python in
data science activities, namely data management/manipulation as well as basic plots/visualization. You
will utilize/apply all these knowledge and skills for the development of predictive analytics solutions,
which will be covered in the next topic.
KEYWORDS
Python, Python structure, data analytics, data management with Python, visualization
REFERENCES
IEEE for Engineering, Science & Technology
[1] (Book) Author, Book Title, Edition, City of Publisher, State: Publisher, Year.
[2] (Chapter in Book) Author, “Title of chapter”, in Title of Published Book, Editor, Edition, City of
Publisher, State: Publisher, Year, pp. x-xx.
[3] (Journal) Author, “Article title”, Title of Journal/Periodical, vol. x, no x, pp x ̶ xx, month, Year.
[4] (E-book) Authors, Book Title, City of Publisher, State: Publisher, year. [Online] Available:
http/DOI/URL.
[5] (Online Journal) Author, “Article title”, in Title of Journal/Periodical, vol. x, no x, pp x ̶xx, month,
Year. [Online]. Available: site/path/file. Accessed on: Month, Day, Year.