Unit - 4 - Part 2
Unit - 4 - Part 2
3
AGENDA
� Reading Data
� Selecting and Filtering the Data
� Grouping
� Slicing
� Data manipulation
� loc, iloc
� Sorting
� Handling Missing values
� Aggregation
4
Python Libraries for Data Science
NumPy:
▪ introduces objects for multidimensional arrays and matrices, as
well as functions that allow to easily perform advanced
mathematical and statistical operations on those objects
Link: http://www.numpy.org/
5
Python Libraries for Data Science
SciPy:
▪ collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics and
more
▪ built on NumPy
Link: https://www.scipy.org/scipylib/
6
Python Libraries for Data Science
Pandas:
▪ adds data structures and tools designed to work with table-like
data (similar to Series and Data Frames in R)
Link: http://pandas.pydata.org/
7
Python Libraries for Data Science
SciKit-Learn:
▪ provides machine learning algorithms: classification, regression,
clustering, model validation etc.
Link: http://scikit-learn.org/
8
Python Libraries for Data Science
matplotlib:
▪ python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats
9
Python Libraries for Data Science
Seaborn:
▪ based on matplotlib
Link: https://seaborn.pydata.org/
10
Loading Python Libraries
11
Reading data using pandas
In [ ]:
#Read csv file
df = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/Salaries.csv")
pd.read_excel('myfile.xlsx',sheet_name='Sheet1',
index_col=None, na_values=['NA'])
pd.read_stata('myfile.dta')
pd.read_sas('myfile.sas7bdat')
pd.read_hdf('myfile.h5','df')
12
Exploring data frames
13
Data Frame data types
14
Data Frame data types
In [4]:
#Check a particular column type
df['salary'].dtype
Out[4]: dtype('int64')
15
Data Frames attributes
df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values numpy representation of the data
16
Data Frames methods
df.method() description
head( [n] ), tail( [n] ) first/last n rows
17
Selecting a column in a Data Frame
Method 1: Subset the data frame using column name:
df['sex']
Note: there is an attribute rank for pandas data frames, so to select a column with a
name "rank" we should use method 1.
18
Data Frames groupby method
In [ ]:
#Calculate mean value for each numeric column per each
group
df_rank.mean()
19
Data Frames groupby method
Once groupby object is create we can calculate various statistics for each
group:
Note: If single brackets are used to specify the column (e.g. salary), then
the output is Pandas Series object. When double brackets are used the
output is a Data Frame
20
Data Frames groupby method
21
Data Frame: filtering
22
Data Frames: Slicing
23
Data Frames: Slicing
When selecting one column, it is possible to use single set of brackets, but
the resulting object will be a Series (not a DataFrame):
In [ ]: #Select column salary:
df['salary']
When we need to select more than one column and/or make the output to
be a DataFrame, we should use double brackets:
In [ ]: #Select column salary:
df[['rank','salary']]
24
Data Frames: Selecting rows
If we need to select a range of rows, we can specify the range using ":"
Notice that the first row has a position 0, and the last value in the range is
omitted:
So for 0:10 range the first 10 rows are returned with the positions starting
with 0 and ending with 9
25
Data Frames: method loc
If we need to select a range of rows, using their labels we can use method
loc:
In [ ]: #Select rows by their labels:
df_sub.loc[10:20,['rank','sex','salary']]
Out[ ]:
26
Data Frames: method iloc
Out[ ]:
27
Data Frames: method iloc (summary)
28
Data Frames: Sorting
We can sort the data by a value in the column. By default the sorting will
occur in ascending order and a new data frame is return.
Out[ ]:
29
Data Frames: Sorting
Out[ ]:
30
Missing Values
Out[ ]:
31
Missing Values
There are a number of methods to deal with missing values in the data
frame:
df.method() description
dropna() Drop missing observations
32
Missing Values
33
Aggregation Functions in Pandas
min, max
count, sum, prod
mean, median, mode, mad
std, var
34
Aggregation Functions in Pandas
agg() method are useful when multiple statistics are computed per column:
In [ ]: flights[['dep_delay','arr_delay']].agg(['min','mean','max'])
Out[ ]:
35
Basic Descriptive Statistics
df.method() description
describe Basic statistics (count, mean, std, min, quantiles, max)
kurt kurtosis
36