CHPC/BBML – Python Workshops
Malcolm Tobias
[email protected]
(314) 362-1594
Xing Huang
[email protected]
http://chpc2.wustl.edu
http://chpc.wustl.edu
CHPC/BBML – Python Workshops
Maze Ndonwi
[email protected]
Marcy Vana
[email protected]
https://becker.wustl.edu/services/science-informatics-support
CHPC/BBML – Python Workshops
Aditi Gupta
[email protected]
Madhurima Kaushal
[email protected]
https://informatics.wustl.edu
CHPC/BBML – Python Workshops
Introduction to Python #1 - Getting Started with
Python
Introduction to Python #2 – Using Python for Data
Analysis
CHPC/BBML – Python Workshops
Introduction to Python #2 – Using Python for Data
Analysis
Goals
- Learn to write Python code to perform data
analysis and visualization
- Learn to run Python code in the virtual
environment set up on the CHPC
Topics covered in Intro to Python #1
Variables and Python data types
Python lists
Numpy arrays
Matplotlib for basic data visualization
Jupyter Notebook
Topics to be covered today
Conditions and Loops
Functions and Methods
Packages for Data Analysis
(Numpy, Pandas, and Matplotlib)
Python Virtual Environment on the CHPC
Conditions and loops
Python can use rational and logical operators in
conditions and loops to make comparisons
between objects
Booleans
Relational Operators >>> 4 < 5
True
>>> 6 <= 3
< strictly less than False
<= less than or equal >>> 10.7 > 8.2
True
> strictly greater than
>>> 12 >= 12.0
>= greater than or equal True
== equal >>> 3/5 == 0.6
True
!= not equal
>>> 2**4 != 16
False
Booleans
Logical Operators
And Or Not
>>> True and True >>> True or True
True True
>>> False and True >>> False or True >>> not True
False True False
>>> True and False >>> True or False >>> not False
False True True
>>> False and False >>> False or False
False False
Conditional Statements
if-elif-else
if condition1: if score >= 90:
expression 1 letter = ‘A’
elif condition2: elif score >= 80:
expression 2 letter = ‘B’
elif condition3: elif score >= 70:
expression 3 letter = ‘C’
…… elif score >= 60:
elif condition n-1: letter = ‘D’
expression n-1 else:
else: letter = ‘F’
expression n
Looping
while lines = list()
print (‘Enter lines of text.’)
print (‘Enter an empty line to quit.’)
while condition1:
expression 1 Line = input (‘Next line: ’)
expression 2 while line != ‘ ’:
…… lines.append (line)
expression n line = input (‘Next line: ’)
print (‘Your lines were:’)
print (lines)
Looping
for … in …
height = [74, 70, 63, 69, 67,
71, 64, 66, 71]
for iterating_variable in sequence: sum = 0.0
expression 1
…… for i in range(len(height)):
expression n sum += height[i]
avg = sum/len(height)
print(‘The average height is:’)
print(avg)
Built-in Functions
• pieces of reusable code
• solve particular tasks
• call function instead of writing
your own code
We have seen print() and type(), what else?
https://docs.python.org/3/library/functions.html
Built-in Functions
max(), min(), sum(), len()
Let’s use the height list from Intro to Python #1
height = [74, 70, 63, 69, 67, 71, 64, 66, 71]
>>> max(height) Return the largest item in an object.
74
>>> min(height) Return the smallest item in an object.
63
>>> sum(height) Sums the items of an object from left
615 to right and returns the total.
>>> len(height) Return the length (the number of items) of an object.
9
Define your own Functions
define your function PI = 3.14159265358979
def function_name([argv]): def circleArea(radius):
body return PI*radius*radius
call your function def circleCircumference(radius):
return 2*PI*radius
function_name([argv])
def main():
print('circle area with radius 5:', circleArea(5))
Note that the function print('circumference with radius 5:',
must be defined first circleCircumference(5))
before it can be called
main()
Methods
Objects: everything, including data and functions
Object have methods associated, depending on type
Methods: call functions on objects
Syntax: object.method(parameters)
String Methods
List Methods
Methods
String Methods
find() >>> s = 'Mississippi'
>>> s.find('si')
It shows the
3
location of the 1st
>>> s.find('sa')
occurrence of the
-1 String not found
searched string
split() >>> s = 'Mississippi'
>>> s.split('i')
['M', 'ss', 'ss', 'pp', ' ']
>>> s.split() # no white space
['Mississippi']
Methods
List Methods
index() >>> fam = ['emma', 1.68, 'mom', 1.71, 'dad', 1.89]
>>> fam.index('mom')
2
count() >>> fam.count(1.68)
1
>>> fam.append(1.78)
append()
>>> fam
['emma', 1.68, 'mom', 1.71, 'dad', 1.89, 1.78]
Packages
A collection of python scripts
Thousands of them available from the internet
Packages for data science
Numpy arrays
Pandas dataframe
Matlibplot data visualization
Numpy: Basic Statistics
data containing
height and weight
of 5000 people
>>> import numpy as np
>>> np_city >>> np.mean(np_city[:,0])
array([[ 1.69, 83.24], 1.7188
[ 1.38, 58.78], >>> np.median(np_city[:,1])
[ 1.89, 85.14], 63.43
…, >>> np.std(np_city[:,0])
[ 1.75, 66.55], 0.1719
[ 1.61, 54.46],
[ 1.86, 95.69]])
Numpy: Basic Statistics
Generate data
>>> import numpy as np
>>> height = np.round(np.random.normal(1.72, 0.17, 5000), 2)
>>> weight = np.round(np.random.normal(63.45, 18, 5000), 2)
>>> np.city = np.column((height, weight))
average standard number of
value deviation data points
Pandas
Handle data of different types (for example, CSV files)
data.csv
country population area capital
US United States 326,474,013 9,144,930 Washington, DC
column labels
RU Russia 143,375,006 16,292,614 Moscow
IN India 1,342,512,706 2,973,450 New Delhi
CH China 1,388,232,693 9,386,293 Beijing
BR Brazil 211,243,220 8,349,534 Brasilia
row labels
Pandas
>>> import pandas as pd
>>> data = pd.read_csv(“path_to_data.csv”)
>>> data
Unnamed: 0 country population area capital
0 US United States 326474013 9144930 Washtington, DC
1 RU Russia 143375006 16292614 Moscow
2 IN India 1342512706 2973450 New Delhi
3 CH China 1388232693 9386293 Beijing
4 BR Brazil 211243220 8349534 Brasilia
Pandas
>>> data = pd.read_csv(“path_to_data.csv”, index_col = 0)
>>> data
country population area capital
US United States 326474013 9144930 Washtington, DC
RU Russia 143375006 16292614 Moscow
IN India 1342512706 2973450 New Delhi
CH China 1388232693 9386293 Beijing
BR Brazil 211243220 8349534 Brasilia
Pandas
Column Access This output is resulted from specifying
index_col=0.
In the case of column access, this is
>>> data[[“country”]]
optional, but with different format of
US United States
outputs.
RU Russia
IN India
CH China This output is resulted from specifying
BR Brazil index_col=0.
In the case of row access, this is a
Row Access must, otherwise Python will give an
exception.
>>> data.loc[[“BR”]]
country population area capital density
BR Brazil 211243220 8349534 Brasilia 25
Pandas
Add Column
>>> data[“on_earth”] = [True, True, True, True, True]
>>> data
country population area capital on_earth
US United States 326474013 9144930 Washtington, DC True
RU Russia 143375006 16292614 Moscow True
IN India 1342512706 2973450 New Delhi True
CH China 1388232693 9386293 Beijing True
BR Brazil 211243220 8349534 Brasilia True
Pandas
Add Column
>>> data[“density”] = data[“population”] / data[“area”]
>>> data
country population area capital density
US United States 326474013 9144930 Washtington, DC 36
RU Russia 143375006 16292614 Moscow 9
IN India 1342512706 2973450 New Delhi 452
CH China 1388232693 9386293 Beijing 148
BR Brazil 211243220 8349534 Brasilia 25
Pandas
Element Access
>>> data.loc[“US”, “capital”]
Washington, DC
>>> data[“capital”].loc[“US”]
Washington, DC
>>> data.loc[“US”][“capital”]
Washington, DC
Matplotlib
Advanced features
fig, ax = plt.subplots (m, n,
sharex = True, sharey =
True)
ax – user defined name for this
set of plots (can be anything)
m – number of plots in rows
n – number of plots in columns
sharex = True, share x axis
sharey = True, share y axis
Matplotlib
Advanced features
1. add axis labels
ax.set_xlabel(‘abc’)
ax.set_xlabel (‘angle’, fontsize=16)
2. specify axis limits
ax.set_xlim([a,b])
ax.set_xlim ([0,10])
3. specify axis tick values
ax.set_xticks([a, b, c, d, e])
ax.set_xticks([-2,-1,0,1,2])
Matplotlib
Advanced features
4. specify axis tick positions
ax.xaxis.set_ticks_position(‘bottom’/’top’/’left’/’right’)
ax.xaxis.set_ticks_position(‘bottom’)
5. specify font size for tick label
ax.yaxis.set_tick_params()
ax.yaxis.set_tick_params(labelsize=15)
6. add plot title
ax.set_title(‘abc’)
ax.set_title(‘House price’,fontsize = 12)
Matplotlib
customize color, either by
specify a name, e.g., “red”,
Advanced features or Hexagonal number, e.g.,
#FF0000
7. change line color and line width For searching the
8. add customized legend hexagonal number for a
specific color, go to:
ax.plot(x,y,color=‘XX’,lw=XX,label=‘XXX’) https://color.adobe.com/cre
ax.legend() ate/color-wheel/
ax.plot(data[:,0],data[:,1],color=‘#1980FF’, customize the relative
lw=2,label=‘New York’) position of the legend
to the entire plot
specify the
line width axlegend(loc = (0.03,0.77), prop=dict(size=11),
in unit of pt.
frameon=True, framealpha=.2, handlelength=1.25,
specify the handleheight=0.5, labelspacing=0.25, handletextpad=0.8,
length of the
legend handle
ncol=1,numpoints=1, columnspacing=0.5)
Using Python on CHPC
Set up Python virtual environment
http://login02.chpc.wustl.edu/wiki119/index.php/Python
1. manually add Anaconda to your path:
[xhuang@login01 ~]$ export PATH=/act/Anaconda3-
2.3.0/bin:${PATH}
2. create an environment with:
[xhuang@login01 ~]$ conda create --name <name_of_env>
python=3
where <name_of_env> can be whatever you want to call it. You
can also use python=2 depending on which version of Python
you want to use.
Using Python on CHPC
Set up Python virtual environment
3. Activate this environment using:
[xhuang@login01 ~]$ source activate <name_of_env>
You can now install any package you'd like with:
(conda_env)[xhuang@login01 ~]$ conda install --name
<name_of_env> <name_of_package>
Besides being more flexible, this installation method won’t
interfere with software modules.
Using Python on CHPC
Using Python on CHPC
Using Python on CHPC
Set up Python virtual environment
4. Install numpy package:
(conda_env)[xhuang@login01 ~]$ conda install --name
<name_of_env> numpy
5. Install matlibplot package:
(conda_env)[xhuang@login01 ~]$ conda install --name
<name_of_env> matplotlib
6. Install pandas package:
(conda_env)[xhuang@login01 ~]$ conda install --name
<name_of_env> pandas
Using Python on CHPC
Using Python on CHPC
Using Python on CHPC
Using Python on CHPC
Using Python on CHPC
Once virtual environment is set use Python scripts to run on CHPC
1. Write script:
py_vir_env)[xhuang@login02 ~]$ vi TEST.py
#!/usr/bin/python
base = 2.5
height = 3
area = base * height / 2
print(area)
2. Make it executable:
(py_vir_env)[xhuang@login02 ~]$ chmod +x TEST.py
3. Run script:
(py_vir_env)[xhuang@login02 ~]$ ./TEST.py
3.75