Using Python For Data Analysis - July 2018 - Slides
Using Python For Data Analysis - July 2018 - Slides
Malcolm Tobias
[email protected]
(314) 362-1594
Xing Huang
[email protected]
http://chpc2.wustl.edu
http://chpc.wustl.edu
CHPC/BBML – Python Workshops
Maze Ndonwi
[email protected]
Marcy Vana
[email protected]
https://becker.wustl.edu/services/science-informatics-support
CHPC/BBML – Python Workshops
Aditi Gupta
[email protected]
Madhurima Kaushal
[email protected]
https://informatics.wustl.edu
CHPC/BBML – Python Workshops
Goals
- Learn to write Python code to perform data
analysis and visualization
- Learn to run Python code in the virtual
environment set up on the CHPC
Topics covered in Intro to Python #1
Python lists
Numpy arrays
Jupyter Notebook
Topics to be covered today
And Or Not
for … in …
height = [74, 70, 63, 69, 67,
71, 64, 66, 71]
for iterating_variable in sequence: sum = 0.0
expression 1
…… for i in range(len(height)):
expression n sum += height[i]
avg = sum/len(height)
print(‘The average height is:’)
print(avg)
Built-in Functions
Syntax: object.method(parameters)
String Methods
List Methods
Methods
String Methods
>>> fam.append(1.78)
append()
>>> fam
['emma', 1.68, 'mom', 1.71, 'dad', 1.89, 1.78]
Packages
A collection of python scripts
Numpy arrays
Pandas dataframe
data.csv
country population area capital
US United States 326,474,013 9,144,930 Washington, DC
column labels
RU Russia 143,375,006 16,292,614 Moscow
IN India 1,342,512,706 2,973,450 New Delhi
CH China 1,388,232,693 9,386,293 Beijing
BR Brazil 211,243,220 8,349,534 Brasilia
row labels
Pandas
>>> data
Unnamed: 0 country population area capital
0 US United States 326474013 9144930 Washtington, DC
1 RU Russia 143375006 16292614 Moscow
2 IN India 1342512706 2973450 New Delhi
3 CH China 1388232693 9386293 Beijing
4 BR Brazil 211243220 8349534 Brasilia
Pandas
>>> data
country population area capital
US United States 326474013 9144930 Washtington, DC
RU Russia 143375006 16292614 Moscow
IN India 1342512706 2973450 New Delhi
CH China 1388232693 9386293 Beijing
BR Brazil 211243220 8349534 Brasilia
Pandas
Column Access This output is resulted from specifying
index_col=0.
In the case of column access, this is
>>> data[[“country”]]
optional, but with different format of
US United States
outputs.
RU Russia
IN India
CH China This output is resulted from specifying
BR Brazil index_col=0.
In the case of row access, this is a
Row Access must, otherwise Python will give an
exception.
>>> data.loc[[“BR”]]
country population area capital density
BR Brazil 211243220 8349534 Brasilia 25
Pandas
Add Column
>>> data
country population area capital on_earth
US United States 326474013 9144930 Washtington, DC True
RU Russia 143375006 16292614 Moscow True
IN India 1342512706 2973450 New Delhi True
CH China 1388232693 9386293 Beijing True
BR Brazil 211243220 8349534 Brasilia True
Pandas
Add Column
>>> data
country population area capital density
US United States 326474013 9144930 Washtington, DC 36
RU Russia 143375006 16292614 Moscow 9
IN India 1342512706 2973450 New Delhi 452
CH China 1388232693 9386293 Beijing 148
BR Brazil 211243220 8349534 Brasilia 25
Pandas
Element Access
>>> data[“capital”].loc[“US”]
Washington, DC
>>> data.loc[“US”][“capital”]
Washington, DC
Matplotlib
Advanced features
1. Write script:
py_vir_env)[xhuang@login02 ~]$ vi TEST.py
#!/usr/bin/python
base = 2.5
height = 3
area = base * height / 2
print(area)
2. Make it executable:
(py_vir_env)[xhuang@login02 ~]$ chmod +x TEST.py
3. Run script:
(py_vir_env)[xhuang@login02 ~]$ ./TEST.py
3.75