MTE204 Data Python
MTE204 Data Python
1
1.0. Introduction to Python
1.1.Why Python
Python is simple and easy to learn, read, and write. It is a Free/Libre and Open
Source Software (FLOSS). Meaning one can distribute copies freely, read its source
code, modify it, etc. it is a high-level language and portable: meaning it is supported
by Linux, Windows, FreeBSD, Macintosh, Solaris, BeOS, OS/390, PlayStation, and
Windows CE platforms. Python supports procedure-oriented programming as wells
as object-oriented programming. It can also invoke C and C++ libraries can be called
from and C++ programs, can integrate with Java and NET components. Python has
been used by many of the big companies known today such as: YouTube, Google,
Dropbox, RaspberryPi, BitTorrent, NASA and NETFLIX.
1. Web Scrapping
2. Automation Testing
3. Web Development
4. Data Analysis (Our focus in this course)
After the installation of Python, the next step is to start working on python. We will discuss
some important attributes then move to data analysis with python.
Data in terms of statistics and probability refers to facts and statistics collected together for
reference or analysis. The figure below shows what can generally be done with data.
2
Quantitative data deals with characteristics and descriptors that cannot be easily measured,
but can be observed subjectively.
o Nominal data are data with no inherent order or ranking such as gender or race.
Qualitative data deals with numbers and things that can be measured objectively.
o Discrete data are also known as categorical data; it can hold finite number of possible
values e.g. number of student in a class.
o Continuous data are data that can hold infinite number of possible values e.g. a
person’s weight.
o Discrete variable also known as categorical variable; it can hold values of different
categories. For instance, your email can hold value for inbox message or spam
message.
o Continuous variable are variables that stores infinite number of values e.g. a vehicle
speed.
Hence, variable is anything used to store a value and the kind of data associate with such
variable determines if such variable is discrete or continuous.
3
Variables can either be dependent or independent. Dependent variable are variables whose
value depends on any other independent variable.
1.3.1. Comments
Passing comments in python can be done using the # or ‘’’comment’’’. Where there is only
one line of comment the # is used. That is any text to the right of # is not executed by python.
The ‘’’comment’’’ is applicable where multiply line of comment is to be passed in python.
Example:
Or
‘’’
This
Code
Analysis
dataset
‘’’
4
Presented below is a demo of the python interface
We can see form the Fig. 2 above that the keyword print is used to print the character in the
string (“”) or (‘’) in the brace.
5
Figure 2 Printing multiply variable form a single line input
1.3.2. Identifier
This is a name used to identify a variable, function, class, module or other object. For
instance, in Fig 3 X, Y and Z are identifiers. An identifier can start with A to Z or a to
z or an underscore (_) followed by zero or more letters, underscore and digits (0 to
6
9). Note that python is a case sensitive programming language and does not allow
any special character within identifiers such as %, $ etc.
1. class name start with an uppercase letter. All other identifier starts with
lowercase letter.
2. Starting an identifier with one leading underscore means that identifier is
private
3. Starting an identifier with two leading underscores means that identifier is
strongly private
4. Ending an identifier with two trailing underscores means that identifier is
language-defined special name
Figure 4 identifier
Data type is a way of defining what kind of data and entry is; it can be numbers,
integer, float, Boolean etc. The data is the entry on the right side of the equality sign
and to the left is the identifier. Data can mutable or immutable data type. The figure
below presents more details:
7
Figure 5 Standard Data Type
8
2. Sting
9
3. Tuples
Figure 6 Tuples
10
Mutable Data Type
1. List
11
Figure 7 List data type
2. Dictionaries
12
Figure 8 Dictionaries
You can have dictionary within a dictionary or any other combination of data types.
3. Sets
13
Figure 9 Sets
1.3.4. Operators
14
1. Arithmetic Operators
2. Assignment Operator
3. Comparison Operator
15
4. Logic Operators
5. Bitwise Operator
6. Identity Operator
16
Figure 10 Identity operator
17
Figure 11 Membership
18
2.0. Data Analysis with Python
This is the process of inspecting, cleaning, transforming, and modeling data with the
goal of discovering useful information’s, suggesting conclusions and supporting
decision making.
Python provides various methods for data analysis, manipulation (NumPy and
Pandas libraries) and visualization (Matplotlib library) (see Fig 14).
19
Figure 14 NumPy Features
20
2.1.1. Creating a NumPy Array
The NumPy is a library in Python therefore, the first step is to import the NumPy
library. See Fig 17.
The linspace function in Python can allow us to create a vector by calling the linspece
function in NumPy and specifying the initial, final and the step. See Fig. 18.
21
Figure 17 Linspece function
Import the NumPy library as np (as in Fig 17) and pass the array code as seen in Fig.
19.
The arrange function can also be used to create a multidimensional array within a
specified range. See Fig. 20.
22
Figure 19 Arange function
We can also create an array of zeros by using the zeros function and specifying the
number of rows and column. See Fig. 21
23
Figure 21 Creating an Array from Existing Data
A linear array of any number of elements can be restructured as desired. Here will
make an instance, converting a linear array of 8 elements into 2×2×2 3D array
consider the case of transpose in matrix.
24
Figure 22 Restructuring a NumPy Array
The restructured array can be returned to its initial state by using the function ravel.
25
Figure 24 Indexing NumPy Array
Slicing a NumPy array, the slice object is constructed by providing the initial, final
and the step parameter to slice ()
26
Figure 27 Other Slicing Methods
NumPy provides the option of importing data from files directly into ndarray using
the loadtxt function. The savetxt function can be used to write data from an array
into text file.
27
Figure 29 Writing and Reading from a Text file
NumPy arrays can be dumped into CSV using the savetxt and the comma delimiter
and the CSV file can be read into NumPy array using the genfromtxt function.
28
Figure 30 Writing and Reading from CSV file
Pandas is an open-source Python library which provides efficient, easy-to-use data structure
and data analysis tools. Pandas is built on NumPy and the name Pandas is derived from
“Panel Data” = an Econometrics form multidimensional data. The Pandas library is well
suited for several kind of data like:
Pandas provides three data structures; Series, DataFrames, Panels all to which are built on
NUmPy array. Note all data structure in Pandas are value-mutable.
29
2.2.1.1. Series
Series is a single dimensional array structure that stores homogenous data i.e., data of single
type. All element in the Series are value-mutable but size-immutable.
Python
default
index
column
using
Series
create your
own index
column
using Series
We can see from the output console in Fig. 31 that Python automatically added the
default index for the data. However, we can set the index by using the dictionary
data structure to input the Series. See Fig 31.
Elements within a Series can be accessed using the slicing function as earlier stated
in Fig. 27 and 28 however, an example is provided in Fig. 32.
30
Figure 32 Accessing Data in Series Using Slicing
2.2.1.2.DataFrames
31
default
column
name in
python '0'
using
DataFrame
created column
name using
DataFrame
32
Figure 35 Using your own Index in Data Frame
33
Figure 37 Adding a Column to a DataFrame
3. We can also use the pop function to view on the column of interest. For
instance, let view the result of Femi only
34
Figure 39 Using the pop function
1. Rows can be selected in DataFramenby passing the row label to the loc[]
function. Alternatively, we can pass in the row index into the iloc[] function.
35
2.2.2. Importing and Exporting data using Pandas
Data can be loaded into DataFrame from input data stored in CSV format using
the read_csv() function
Data present in a given DataFrame can be written to CSV file using the to_csv()
function. If the specified path does not exist, a file of the same name is automatically
created.
36
Similarly, we can write to and read from excel file using Pandas
37
2.3. Introduction to Data Visualization and Matplotlib
Plotting ins Python is done by importing the matplotlib.pyplot library as plt and
pass in the argument.
38
Figure 45 Specifying the x and v values in Matplotlib
Labels can be added to x and y axis using plt.xlabel(‘X axis’)and plt.ylable(‘Y axis’)
39
We save the plot by using the plt.savefig(‘file name’)
There are several plot formats for visualizing information in Python such as
Histogram, Scatter Plot, Bar Graph and Pie Chart. We will show here how to
demand any and other plot format in Python.
40
1. Histogram
2. Bar Graph
We create two arrays; the first array is the midpoint of the face of every bar
i.e., where the midpoint of the bar graph should be. The second array is the
height of the successive bar graph.
41
Figure 47 Bar graph in Python
42
Figure 48 Plotting Bar graph from Dictionary
3. Pie Chart
Pie chart is used to compare multiple parts against the whole. For instance, let
use a pie chart to visualize how a student spend her income.
Expenses Amount
Food 5,000
Transport 1,500
Credit card 2,000
Accessory 500
43
Figure 49 Pie chart in Python
4. Scatter plot
Scatter plots displays the values for two sets of data, visualized as a collection of
points.
44
Figure 50 Scatter plot in Python
45