3 IntroToPython-PythonLibraries
3 IntroToPython-PythonLibraries
Introduction to Python
Python Libraries
Kevyn Stefanelli
2023-2024
3. Python libraries
Python libraries are pre-written collections of code modules that provide a wide range of
functionalities to extend the capabilities of the Python programming language.
Libraries are created to solve specific problems or tasks, allowing developers to avoid
reinventing the wheel by utilizing existing well-tested code.
They offer ready-to-use functions, classes, and methods that streamline the development
process and enable the creation of complex applications with relative ease.
When working with libraries, you have to import the required modules into your code, gaining
access to their functions and classes.
https://numpy.org/doc/stable/user/absolute_beginners.html
For example, we can assign the alias "np" to the NumPy library.
This allows you to access NumPy's functions and classes using the shorter "np" prefix instead of
the full "numpy"
# import numpy and assign it an alias
import numpy as np
Once we have imported numpy, we can use the functions contained within it by using the syntax:
[2 3 4 5]
[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]
Note: In NumPy, when you create a multidimensional array, data are stored sequentially by
rows.
10
# vertical mean (sums all rows for each column and takes the mean);
# axis=0 means axis='rows'
np.mean(mat, axis=0)
# horizontal mean (sums all columns for each row and takes the mean);
# axis=1 means axis='columns'
np.mean(mat, axis=1)
# if axis is not specified, np.mean compute the mean of all the matrix
elements
np.mean(mat)
array([1, 2, 3])
You can find all the mathematical functions contained in NumPy here:
https://numpy.org/doc/stable/reference/routines.math.html
# Operations
# retrieve the diagonal of a matrix
# trace of a matrix
np.trace(np.ones((5, 5)))
np.diag([[1, 2], [1, 2]])
# matrix multiplication
np.dot(np.array([[1, 2], [1, 2]]), np.array([[3, 3], [3, 3]]))
# and also:
# np.gradient() # gradient
# np.kron() # Kronecker product
# np.outer() # outer product
array([ 1, 2, 3, 100])
(2, 3)
array([[1, 1],
[2, 2],
[3, 3]])
1.4
## Boolean
# checks if all elements are True
np.all([True, True, False])
# checks if at least 1 element in the iterable is True
np.any([True, True, False])
True
# Define w and E
w = np.array([1, -1, 7, 2, 0])
E = np.array([[1, 0, 4, 5],
[2, 1, -1, 0],
[-1, -2, 3, -1],
[2, 0, -3, 0]])
# Select from the second to the fourth elements of the third row of E
print(E[2, 1:4])
# Select only the elements in the first two rows and in the third and
forth columns of E
print(E[0:2, [2, 3]])
0
-1
2
0
[ 0 1 -2 0]
[1 0 4 5]
[[ 1 0 4 5]
[ 2 1 -1 0]]
[-2 3 -1]
[[ 4 5]
[-1 0]]
[[ 1 0 5]
[ 2 1 0]
[-1 -2 -1]
[ 2 0 0]]
[[ 2 1 -1 0]
[-1 -2 3 -1]]
Now, we can use the functions contained into the module random
87
2
5
91
62
46
61
Alternatively, you can specify the number of random numbers you want using the size
parameter:
[ 4 78 75 60 99 42]
[[67 59 19 25 91]
[43 79 70 87 66]
[91 29 6 59 28]]
[[43.26410355 28.58972112]
[37.67737838 26.67180781]
[38.24220209 47.23658844]]
Normal Distribution:
To generate random numbers from a normal distribution, you can use random.normal().
numpy.random.seed()
• Reproducibility: setting the random seed ensures that you get the same sequence
of random numbers every time you run your code. It allows you and others to
recreate the same results.
• Comparability: you might want to compare the results of different algorithms or
models using randomization. Setting the random seed guarantees that you are
comparing the same set of random numbers, making your comparisons more
meaningful.
• In our case: it helps us obtain the same series of random numbers. Everyone in the
class will have the same sequence.
VERY IMPORTANT: Remember to specify the seed before each random generation.
3.2 The following equation represents the probability density function (pdf) of a Normal
distribution with mean μ and standard deviation σ :
− ( )
2
1 x−μ
1 2 σ
f ( x , μ ,σ )= ⋅e
σ √2 π
Write a function called 'fNormal' that calculates this pdf and evaluate it for x=8, μ=-3, and σ =1.5.
3.3 Given m = 2 and s = 1, create x as a vector of the first 100 integers between [ 1 ,100 ) and
evaluate the fNormal function defined in the previous step for each point of the vector x.
• Population Variance:
N
1
σ = ∑ ( x i − x́ )
2 2
N i=1
• Sample Variance:
n
1
s2= ∑
n −1 i=1
( xi − x́ )
2
σ =√(¿ σ 2) ¿
• Sample Standard Deviation:
s= √(¿ s 2) ¿
Compute the unbiased SD (Sample Standard Deviation) of the vector ω :
ω=( 1 ,− 1 ,5 , 6 , 1 ,− 6 , 8 , 9 ,1 , 3 )
Then, compare the output with the Population SD (σ ) provided by Python.
Alternative A:
import numpy as np
Therefore, we can transition from one formula to the other by making the following adjustment:
σ=√
N −1
⋅s
√N
and then:
s= √ N ⋅σ
√ N −1
import numpy as np
3.5 Matrices
\begin{matrix} &&1 & 1 & 1 \ A&=&0 & 1 & 2, \ &&1 & -1 & 1 \ \end{matrix}
\begin{matrix} &&1 & 4 & 7 \ B&=&2 & 5 & 8, \ &&3 & 6 & 9 \ \end{matrix}
compute:
1. A+ A
2. A⋅B
3. d e t (B)
4. A− 1
5. A′
3.6 Boolean
Given matrix A \begin{matrix} &&1 & 0 & 4 & 5 \ A&=&-1 & -3 & 4 &3, \ &&2 & 1 & -6 &2 \ &&0 &
0 & 2 &4 \ \end{matrix}
Matrices
Given A, B and c: \begin{matrix} &&1 & 1 & 1 \ A&=&0 & 1 & 2, \ &&1 & -1 & 1 \ \end{matrix}
\begin{matrix} &&1 & 4 & 7 \ B&=&2 & 5 & 8, \ &&3 & 6 & 9 \ \end{matrix}
c=3 ,
compute:
1. A·B
2. d e t ( A · B)
−1
3. A⋅ A
4. A ⋅ B⋅ c
5. A ⋅ B′
A−B
6.
c
7. d i a g ( B)⋅ c
8. A [ 2 ,3 ) ⋅ B [ 3 , 3 )
3.2 The Pandas Library
Pandas is a powerful Python library designed for data manipulation and analysis. It provides data
structures like DataFrame and Series that allow you to easily work with structured and tabular
data.
With its comprehensive set of functions, Pandas enables data cleaning, transformation,
merging, and exploration, making it an essential tool for data professionals and analysts.
You can find any further information about the library here:
https://pandas.pydata.org/docs/user_guide/index.html
import pandas as pd
Dict = {
'Car': ["BMW", "Volvo", "Ford"],
'Cost x1000': [70, 47, 33]
}
print(Dict)
Data = pd.DataFrame(Dict)
print(Data)
0 1
1 7
2 2
dtype: int64
3.2.1 Working with Pandas objects
Deleting Rows and Columns from the Dataset
# Remove the first row and modify the original DataFrame
Data.drop([0], inplace=True)
print(Data)
Cost x1000
1 47
2 33
Learning by doing
We can create a dataframe starting from a list of vectors.
We select a class of 20 people coming from different countries of the world. For each one of
them we collect names, ages, heights, nationalities, gender, and the final grades in the math
exam.
import pandas as pd
# Define vectors
names = ["Andrew", "Anna", "Alice", "Antony",
"Barbara", "Brian", "Boris", "Barney",
"Claudia", "Cliff", "Cecilia", "Clara",
"David", "Dora", "Denise", "Donatello",
"Emma", "Elise", "Esteban", "Elon"]
ages = [20, 22, 27, 25, 18, 22, 26, 21, 19, 24,
27, 23, 22, 19, 23, 28, 22, 24, 25, 19]
heights = [180, 170, 155, 175, 150, 197, 178, 182, 183, 170,
175, 178, 170, 160, 175, 194, 180, 165, 172, 183]
gender = [0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0]
grades = [16, 18, 19, 18, 15, 14, 15, 18, 17, 20,
20, 19, 15, 16, 18, 14, 20, 15, 19, 17]
# Create a dictionary
class_dic = {
"names": names,
"ages": ages,
"heights": heights,
"nationalities": nationalities,
"gender": gender,
"grades": grades
}
class_df = pd.DataFrame(class_dic)
import pandas as pd
0 Andrew
1 Anna
2 Alice
3 Antony
4 Barbara
5 Brian
6 Boris
7 Barney
8 Claudia
9 Cliff
10 Cecilia
11 Clara
12 David
13 Dora
14 Denise
15 Donatello
16 Emma
17 Elise
18 Esteban
19 Elon
Name: names, dtype: object
Variable Type: object
First Five Rows:
names ages heights nationalities gender grades
0 Andrew 20 180 France 0 16
1 Anna 22 170 Scotland 1 18
2 Alice 27 155 Italy 1 19
3 Antony 25 175 Poland 0 18
4 Barbara 18 150 France 1 15
Number of Rows: 20
Number of Columns: 6
Column Names: ['names', 'ages', 'heights', 'nationalities', 'gender',
'grades']
Number of Rows: 20
Number of Columns: 6
Note: the new variable must have the same number of elements of the others comprised in the
dataset.
We can also add new variables as transformation of pre-existing variables in the dataset.
For example, we can add a new variable called “MinStudy” which express the variable HStudy
(currently measured in hours) in minutes. In formula:
M i n S t u d y =H S t u d y ⋅60
Number of students: 20
Mean age: 22.8
Median grade: 17.5
Highest grade: 20
Lowest grade: 14
Nationality frequencies:
France 5
Poland 3
Scotland 2
Italy 2
UK 2
USA 2
India 1
Mexico 1
Germany 1
Spain 1
Name: nationalities, dtype: int64
The average grade in the Female subsample is: 17.7
The average grade in the Male subsample is: 16.6
Note: Sometimes we need matrices and not dataframes (e.g., matrix calculus).
# to .xlsx file
df.to_excel("filename.xlsx") # where "filename" is the output file
name
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 5 non-null int64
1 Masters 5 non-null bool
2 City 5 non-null object
dtypes: bool(1), int64(1), object(1)
memory usage: 213.0+ bytes
Age 5
Masters 5
City 5
dtype: int64
## Descriptive Statistics
# define a new dictionary
data = {"Age" : [10, 14, 11,9],
"Height" : [100, 140, 120, 80],
"Weight" : [30, 42, 32, 28]
}
# transform it into a Pandas DataFrame
df = pd.DataFrame(data, columns = ["Age","Height","Weight"])
Age 14
Height 140
Weight 42
dtype: int64
# average value
df.mean()
# median
df.median()
# correlation among variables
df.corr()
# define a lambda
f = lambda x: x / 1.2
Data visualizations are a powerful means to communicate complex information in a clear and
intuitive manner, aiding in the exploration and presentation of data-driven findings.
With Matplotlib, you can produce engaging visual representations that enhance understanding
and facilitate decision-making across various fields, from scientific research to business
analytics.
Tip: never underestimate the power of a graph. Utilize graphs wisely to unlock the potential of
your data and convey its significance to others.
import matplotlib
Most of the plots are produced using the pyplot module of Matplotlib. So we import also pyplot:
3.3.1 Scatterplot
A scatter plot is a simple yet effective visualization tool used to display the relationship between
two variables.
# import numpy
import numpy as np
You can specify the type of marker (points/lines) to use in the plot by using the marker
parameter in the scatter() function.
[<matplotlib.lines.Line2D at 0x12751d6a0>]
In double quotation marks (" "), we can add additional parameters such as color and line style.
Here's a list of parameters you can specify, such as color, linestyle, and marker:
• Color: You can specify colors using strings like 'red' (r), 'blue' (b), 'green' (g), or in
HTML/CSS color codes like '#FF5733'.
• Linestyle: You can choose the linestyle of the plot, such as solid ('-'), dashed ('--'), dotted
('.'), and more.
• Marker: You can select a marker for each data point, like 'o' for circle, 's' for square, '^'
for triangle, etc.
• Marker Size: Adjust the size of markers with the markersize parameter.
Feel free to experiment with these parameters to create visually appealing and informative
plots.
# Generate data
x = np.linspace(0, 10, 20)
y = np.sin(x)
x = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plt.scatter(x, y, color="hotpink")
plt.show()
To compare two different data sets, you can simply overlay two plots:
# Plot 1
x1 = np.array([5,7,8,7,2,17,2,9,4,11,12,9,6])
y1 = np.array([99,86,87,88,111,86,103,87,94,78,77,85,86])
plot1 = plt.scatter(x1, y1, label='Dataset 1') # Add label for legend
# Plot 2
x2 = np.array([2,2,8,1,15,8,12,9,7,3,11,4,7,14,12])
y2 = np.array([100,105,84,105,90,99,90,95,94,100,79,112,91,80,85])
plot2 = plt.scatter(x2, y2, label='Dataset 2') # Add label for legend
It represents data using rectangular bars, where the length or height of each bar corresponds to
the value (frequency) of a particular category or group.
Bar plots provide a clear visual representation of how categories differ from one another,
making them effective for conveying comparisons, trends, and distributions.
• Comparison between Categories: Use bar plots to compare the values of different
categories, such as comparing sales across different products or customer ratings
across different services.
• Frequency Distribution: Bar plots are effective for showing the frequency
distribution of categorical data, such as the distribution of student grades or the
distribution of responses to survey questions.
• Trends over Time: Stacked bar plots can be used to show how the composition of
categories changes over time, providing insights into evolving trends.
• Nominal or Ordinal Data: Bar plots are suitable for both nominal data (categories
with no inherent order) and ordinal data (categories with a defined order).
In Python, we will use the function 'plt.bar'.
# define two arrays
x = np.array(["Ita", "Fra", "Ger", "UK"])
y = np.array([3, 8, 1, 10])
Similarly to the scatter plot, there are various ways to customize the bar plot. Refer to the
available guides at:
https://matplotlib.org/stable/tutorials/index.html
3.3.3 Histogram
An histogram is a graphical representation that provides insights into the distribution and
frequency of continuous data. Unlike bar plots, which are suitable for categorical data,
histograms are used to display the distribution of numeric data over continuous intervals or bins.
Each bin represents a range of values, and the height of the bar over a bin corresponds to the
frequency or count of data points falling within that range. Histograms are particularly useful
for identifying patterns, central tendencies, and outliers in your data.
# Generate data
x = np.random.normal(10, 2, 100)
The chart is divided into sectors, where each sector represents a different category or group. The
size of each sector is proportional to the relative frequency or proportion of the corresponding
category within the data. Pie plots are particularly useful for visualizing parts of a whole and
comparing the contributions of different categories to the total.
Limited Number of Categories: Pie plots are best suited for representing a small number of
categories. Too many categories can make the chart difficult to interpret.
Visualizing Composition: Pie plots are effective for showing how a whole can be divided into
different parts, such as budget allocation, market share, or distribution of grades.
Labels and Legends: Labels are often added to each sector to indicate the category it represents
and the corresponding percentage or value. A legend is used to provide more detailed
information about each category.
Limitations: Pie plots can sometimes be misleading when it comes to comparing angles and
accurately interpreting small differences between categories.
# Add a title
plt.title("Distribution of Categories")