Unit 7 - Python Libraries
Unit 7 - Python Libraries
1. NumPy
NumPy is short for “Numerical Python”. NumPy is a library for the Python
programming language, adding support for large, multi-dimensional arrays and
matrices, along with a large collection of high-level mathematical functions to operate
on these arrays.
NumPy also offers comprehensive mathematical functions, random number
generators, linear algebra routines, Fourier transforms, and more.
If we don’t use start index it is considered 0, if we don’t use end index it is considered
the length of the array in that dimension and if we don’t consider both indices every
element in the array is considered. For example,
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[:5])
print(arr[2:])
print(arr[:])
We can also define the step during array slicing such as [start:end:step]. The default
value for step is 1. For example,
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[1:5:2])
We can also use minus operator to refer to an index from the end. For example, to
Slice from the index 3 from the end to index 1 from the end we write.
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[-3:-1])
We can also slice two-dimensional array. For example, to slice elements from index 1
to index 4 (not included) from the second element of the array, we write
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[1, 1:4])
To return index 2 from both elements, we write
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 2])
To slice index 1 to index 4 (not included) from both elements of the two-dimensional
array, we write
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])
print(arr[0:2, 1:4])
b = np.load('outfile.npy')
print(b)
The storage and retrieval of array data in simple text file format is done
with savetxt() and loadtxt() functions. For example,
a = np.array([1,2,3,4,5])
np.savetxt('out.txt',a)
b = np.loadtxt('out.txt')
print(b)
arr = np.arange(0,10,.5,float)
print(arr)
We can also use numpy.linspace() function as the arrange function. However, it
doesn’t allow us to specify the step size in the syntax. Instead of that, it only returns
evenly separated values over a specified period. The system implicitly calculates the
step size. The syntax is given below:
numpy.linspace(start, stop, num, endpoint, retstep, dtype)
Here, start represents the starting value of the interval, stop represents the stopping
value of the interval, num is the amount of evenly spaced samples over the interval to
be generated (default is 50), endpoint (Boolean value) represents the value which
indicates that the stopping value is included in the interval, rettstep (boolean value)
represents the steps and samples between the consecutive numbers and dtype
represents the data type of the array items. For example,
arr = np.linspace(0, 10, 5, True, True, np.single)
print(arr)
print(np.sort(arr, 0))
Example 3:
import numpy as np
dt = np.dtype([('name', 'S10'),('age', int)])
a = np.array([("raju",21),("anil",25),("ravi", 17), ("amar",27)], dtype = dt)
print (np.sort(a, order = 'name'))
The numpy.ptp() function is used to return the range of values along an axis. For
example,
a = np.array([[2, 10, 20], [80, 43, 31], [22, 43, 10]])
print(np.ptp(a, 1))
print(np.ptp(a, 0))
The numpy.mean() function is used to calculate mean of items along an axis. And,
the numpy.meadian() function is used to calculate median of items along an axis. For
example,
a = np.array([[1,5,3],[14,5,6],[7,8,10]])
print(np.mean(a, 0))
print(np.mean(a, 1))
print(np.median(a, 0))
print(np.median(a, 1))
Similarly, we can use numpy.std() function is used to calculate standard deviation
along an axis and numpy.var() function to calculate variance along an axis.
2. Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool, built on top of the Python programming language. It is the most
popular library for data analysis. There are two core data structures in pandas: series
and data frame.
2.1. Series
Series is a one-dimensional labeled array capable of holding any data type (integers,
strings, floating point numbers, Python objects, etc.). The values are labeled with
index numbers. First value has index 0, second value has index 1, and so on. These
labels can be used to access a specified value. For example,
import pandas as pd
a = [11, 7, 10]
myvar = pd.Series(a)
print(myvar)
print(myvar[1])
We can use index argument to name own labels. These labels can also be used to
access values in the series. For example,
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
print(myvar["y"])
Series can also have a name attribute. For example,
a = [1, 7, 2]
var1 = pd.Series(a, index = ["x", "y", "z"], name = "something")
print(var1)
Series can also be instantiated from dictionary. In this case, the keys of the dictionary
become the labels. For example
We can create a data frame from another data frame using copy() method. For
example,
data = {
"name": ["Ram", "Shyam", "Hari"],
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
df1 = df[["name", "duration"]].copy()
print(df1)
Pandas use the loc attribute to return one or more specified row(s). For example,
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df.loc[0])
You can also get, set, and delete columns from the data frame. For example,
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df["calories"]) # get column
df["age"] = [34, 67, 14] # set column
print(df)
del(df["calories"]) # delete column
print(df)
df = pd.DataFrame(data)
print(df)
print(df.shape)
print(df.size)
print(df.index)
print(df.columns)
In order to drop a null values from a data frame, we used dropna() function. This
function drops rows or columns of the data frame with Null values. To drop rows with
at least one NaN value, we write
df.dropna()
To drop rows with all data missing or containing NaN, we write
df.dropna(how = 'all')
To drop columns with at least one NaN value, we write
df.dropna(axis = 1)
To drop columns with all data missing or containing NaN, we write
df.dropna(axis = 1, how = "all")
Often times we want to replace arbitrary values with other values. The replace()
method provides an efficient yet flexible way to perform such replacements. For
example,
df.replace(to_replace = 40, value = -99)
To discover duplicates, we can use the duplicated() method. The duplicated() method
returns a Boolean values for each row. For example,
df.duplicated()
To remove duplicates, use the drop_duplicates() method. For example,
df.drop_duplicates(inplace = True)
2.3.4. Indexing, Slicing, and Subsetting
We often want to work with subsets of a data frame. There are different ways to
accomplish this including: using labels (column headings), numeric ranges, or specific
x, y index locations. We use square brackets [] to select a subset of a data frame. For
example,
df[['Second_Score', 'Third_Score']]
df[0:2]
We can select specific ranges of our data in both the row and column directions using
either label or integer-based indexing.
• loc is primarily label based indexing. Integers may be used but they are
interpreted as a label.
• iloc is primarily integer-based indexing
To select a subset of rows and columns from the data frame, we can use both loc and
iloc. For example,
df.loc[0:2, 'First_Score':'Second_Score']
df.iloc[0:2, 0:1]
We can also select a specific data value using a row and column location within the
data frame using loc and iloc indexing. For example,
df.loc[0,'Second_Score']
df.iloc[2, 1]
We can also select a subset of our data using criteria. For example,
df[df.Second_Score >35]
Nawaraj Paudel Page 14
Unit 7: Python Libraries
If the data frame has many rows, pandas will only return the first 5 rows, and the last
5 rows. To display entire data frame, we use to_string() function of the data frame.
After reading csv file into a data frame, we can use all the attributes and methods of
the data frame to work with the data.
We can also write data frame into a csv file. To write data frame to csv file, we use
to_csv() function. For example,
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
df.to_csv("data.csv", index = False)
pd.read_csv("data.csv")
Pandas read_json() function reads a JavaScript Object Notation (JSON) file into data
frame. For example,
df = pd.read_json('data.json')
print(df.to_string())
We can also write data frame into a JSON file. To write data frame to JSON file, we
use to_json() function. For example,
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
df.to_json("data.json")
pd.read_json("data.json")
3. Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Most of the Matplotlib utilities lies under the pyplot module,
and are usually imported under the plt alias.
The plot() function is used to draw points (markers) in a diagram. This function takes
parameters for specifying points in the diagram. First parameter is an array containing
points on the x-axis and second parameter is an array containing the points on the y-
axis. For example, to draw a line in a diagram from position (1, 3) to position (8, 10),
we write.
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([1, 8])
ypoints = np.array([3, 10])
plt.plot(xpoints, ypoints)
plt.show()
To plot only the markers, you can use shortcut string notation parameter 'o', which
means 'rings'. For example,
plt.plot(xpoints, ypoints, 'o')
To plot as many points as we like, we use same number of points in both x-axis and y-
axis. For example,
xpoints = np.array([1, 2, 6, 8])
ypoints = np.array([3, 8, 1, 10])
plt.plot(xpoints, ypoints)
If we do not specify the points in the x-axis, they will get the default values 0, 1, 2, 3,
etc. depending on the length of the y-points. For example,
ypoints = np.array([3, 8, 1, 10, 5, 7])
plt.plot(ypoints)
3.1. Markers
You can use the marker argument to emphasize each point with a specified marker.
For example, to mark each point with circle, we write
plt.plot(xpoints, ypoints, marker = 'o')
Some commonly used markers are:
'o' Circle '*' Star '.' Point ',' Pixel
'x' X 'X' X (filled) '+' Plus 'P' Plus (filled)
's' Square 'D' Diamond 'd' Diamond (thin) 'p' Pentagon
'v' Triangle Down '^' Triangle Up '<' Triangle Left '>' Triangle Right
You can use also use the shortcut string notation parameter to specify the marker.
This parameter is also called fmt, and is written with the syntax markerlinecolor. For
example,
plt.plot(xpoints, ypoints, 'o--r')
The marker value can be anything from the table above. The line value and color
value can be one of the following given in the table below.
Line reference Color reference
'-' Solid line 'r' Red
':' Dotted line 'g' Green
'--' Dashed line 'b' Blue
'-.' Dashed/dotted line 'c' Cyan
'm' Magenta
'y' Yellow
'k' Black
'w' White
We can use the argument markersize or the shorter version, ms to set the size of the
markers. For example,
plt.plot(xpoints, ypoints, 'o--r', ms = 20)
We can use the argument markeredgecolor or the shorter mec to set the color of
the edge of the markers. For example,
plt.plot(xpoints, ypoints, 'o--r', ms = 20, mec = 'b')
We can use the argument markerfacecolor or the shorter mfc to set the color inside
the edge of the markers. For example,
plt.plot(xpoints, ypoints, 'o--r', ms = 20, mec = 'b', mfc = 'y')
Remember: We can also use hexadecimal color values (such as '#4CAF50') and color
names (such as 'hotpink') to represent different colors.
3.2. Line
We can use the argument linestyle, or shorter ls, to change the style of the plotted
line. The different line styles are:
'solid' or '-' (Default) 'dotted' or ':' 'dashed' or '--' 'dashdot' or '-.' 'None' or ' '
We can use the argument color or the shorter c to set the color of the line. We can
also use the argument linewidth or the shorter lw to change the width of the line.
We can plot as many lines as you like by simply adding more plot() functions. For
example,
x1 = np.array([0, 1, 2, 3])
y1 = np.array([3, 8, 1, 10])
x2 = np.array([0, 1, 2, 3])
y2 = np.array([6, 2, 7, 11])
plt.plot(x1, y1)
plt.plot(x2, y2)
plt.show()
We can also plot many lines by adding the points for the x- and y-axis for each line in
the same plot() function. For example,
plt.plot(x1, y1, x2, y2)
3.3. Labels
We can use xlabel() and ylabel() functions to set a label for the x- and y-axis. And,
with title() function you can set a title for the plot. We can use the fontdict parameter
in xlabel(), ylabel(), and title() to set font properties for the title and labels. For
example,
x = np.array([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = np.array([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
font1 = {'family':'serif','color':'blue','size':20}
font2 = {'family':'serif','color':'darkred','size':15}
plt.title("Sports Watch Data", fontdict = font1)
plt.xlabel("Average Pulse", fontdict = font2)
plt.ylabel("Calorie Burnage", fontdict = font2)
plt.plot(x, y)
plt.show()
The loc parameter in title() is used to position the title in 'left', 'right', and 'center'.
Default value for loc is 'center'. For example,
plt.title("Sports Watch Data", fontdict = font1, loc = 'left')
We can also set line properties of the grid using plt.grid(color = 'color', linestyle =
'linestyle', linewidth = number).
3.5. Subplot
We can use subplot() function to draw multiple plots in one figure.
The subplot() function takes three arguments that describes the layout of the figure.
The layout is organized in rows and columns, which are represented by
the first and second argument. The third argument represents the index of the current
plot. For example,
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(2, 1, 1)
plt.plot(x,y)
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(2, 1, 2)
plt.plot(x,y)
We can add a title to each plot with the title() function. We can also use
subplots_adjust() method to change the space between subplots. For example,
x = np.array([0, 1, 2, 3])
y = np.array([3, 8, 1, 10])
plt.subplot(2, 1, 1)
plt.plot(x,y)
plt.title("Plot 1")
x = np.array([0, 1, 2, 3])
y = np.array([10, 20, 30, 40])
plt.subplot(2, 1, 2)
plt.plot(x,y)
plt.title("Plot 2")
plt.subplots_adjust(left = 0.1,
bottom = 0.1,
right = 0.9,
top = 0.9,
wspace = 0.4,
hspace = 0.4)
plt.show()
We use suptitle() function to add a title to the entire figure. For example,
plt.suptitle("This is super title").
When two plots are plotted, the plots will be plotted with two different colors, by
default blue and orange. For example,
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y)
x = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y = np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80,85])
plt.scatter(x, y)
We can also set our own color for each scatter plot with the color or the c argument.
For example,
plt.scatter(x, y, color = 'hotpink')
We can even set a specific color for each dot by using an array of colors as value for
the c argument. For example,
x = np.array([5,7,8,7,2,17])
y = np.array([99,86,87,88,111,86])
cols = np.array(["red","green","blue","yellow","pink","black"])
plt.scatter(x, y, c = cols)
We can change the size of the dots with the s argument. We can adjust the
transparency of the dots with the alpha argument. For example,
x = np.array([5,7,8,7,2,17])
y = np.array([99,86,87,88,111,86])
sizes = np.array([20,50,100,200,500,1000])
plt.scatter(x, y, s = sizes, alpha = 0.3)
3.8. Histogram
It is a graph showing the number of observations within each given interval. We can
use hist() function to produce a histogram. For example,
import matplotlib.pyplot as plt
x = [1,1,2,3,3,5,7,8,9,10,
10,11,11,13,13,15,16,17,18,18,
Nawaraj Paudel Page 20
Unit 7: Python Libraries
18,19,20,21,21,23,24,24,25,25,
25,25,26,26,26,27,27,27,27,27,
29,30,30,31,33,34,34,34,35,36,
36,37,37,38,38,39,40,41,41,42,
43,44,45,45,46,47,48,48,49,50,
51,52,53,54,55,55,56,57,58,60,
61,63,64,65,66,68,70,71,72,74,
75,77,81,83,84,87,89,90,90,91
]
plt.hist(x, bins = 5)
plt.show()
We use legend() function to add list of explanation for each wedge. To add a header
to the legend, add the title parameter to the legend function. For example,
x = [35, 25, 25, 15]
mylabels = ["Apples", "Bananas", "Cherries", "Dates"]
plt.pie(x)
plt.legend(title = "Four Fruits", labels = mylabels)