Phan1_Pandas_Numpy_Matplotlib
Phan1_Pandas_Numpy_Matplotlib
Python 3x
Tutor Mrs. Mỹ Linh
Programing
Time 90 mins
1
Content • Statistical Functions
• Python Pandas • Window Functions
• Series • Aggregations
• DataFrame • Missing Data
• Panel • GroupBy
• Basic Functionality • Merging/Joining
• Descriptive Statistics • Concatenation
• Function Application • Date Functionality
• Reindexing • Timedelta
• Iteration • Categorical Data
• Sorting • Visualization
• Working with Text Data • IO Tools
• Options & Customization • Sparse Data
• Indexing & Selecting Data • Caveats & Gotchas
Introduction to Pandas
3
Python Pandas • Fast and efficient DataFrame object with default and
• Pandas is an open-source Python Library providing customized indexing.
high-performance data manipulation and analysis • Tools for loading data into in-memory data objects
tool using its powerful data structures. The name from different file formats.
Pandas is derived from the word Panel Data – an • Data alignment and integrated handling of missing
Econometrics from Multidimensional data. data.
• To use Pandas, must import pandas as pd • Reshaping and pivoting of date sets.
• Pandas deals with the following three data structures • Label-based slicing, indexing and subsetting of large
• Series: dimension = 1 data sets.
• DataFrame: dimension = 2
• Panel: dimension = 3 • Columns from a data structure can be deleted or
inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
• Retrieve Data Using Label
Python Pandas - Series
• Create: pandas.Series( data, index, dtype, copy)
• Data: data takes various forms like ndarray, list,
constants
• Index: Index values must be unique and hashable, same
length as data. Default np.arrange(n) if no index is
passed.
• Dtype: dtype is for data type. If None, data type will be
inferred
• Copy: Copy data. Default False
→
Python Pandas - DataFrame
• Create: • Creating dataframe many ways
• pandas.DataFrame( data, index, columns, dtype, copy) • Adding column
• Columns: For column labels, the optional default syntax is - • Delete column
np.arrange(n). This is only true if no index is passed. • Row Selection, Addition, and Deletion
Example – Create Dataframe
Column Addition
Column Deletion
Row Selection, Addition, and Deletion
Python Pandas - Panel
• Create: pandas.Panel(data, items, major_axis,
minor_axis, dtype, copy)
• Data: Data takes various forms like ndarray, series,
map, lists, dict, constants and also another
DataFrame
• Items: axis=0
• Major_axis: axis=1
• Minor_axis: axis=2
• Dtype: Data type of each column
• Copy: Copy data. Default, false
Example - From 3D ndarray
Series Basic Functionality
13
Series Basic Functionality
14
DataFrame Basic Functionality
15
DataFrame Basic Functionality
16
Function Application
• To apply your own or another library’s functions to Pandas objects, you should be aware of the three
important methods. The appropriate method to use depends on whether your function expects to operate on an
entire DataFrame, row- or column-wise, or element wise.
• Table wise Function Application: pipe()
• Row or Column Wise Function Application: apply()
• Element wise Function Application: applymap()
17
Function Application
• Suppose df is data frame and adder is function
• df = df.pipe(adder,2)
df = df['Salary'].map(lambda x:x*10)
#On Series data
• df = df.apply(np.mean)
df = df.applymap(lambda x:x*10)
• df = df.apply(np.mean, axis = 1)
df = df.apply(lambda x: x.max() - x.min())
12/28/2022 18
Mapping
• map = {
'label1' : 'value1,
'label2' : 'value2,
...
}
• The functions that you will see in this section perform specific operations, but they
all accept a dict object.
• replace()—Replaces values
• map()—Creates a new column
• rename()—Replaces the index values
19
Mapping
20
Adding Values via Mapping
21
Rename the Indexes of the Axes
22
Rename the Indexes of the Axes
23
Re-indexing
• Reindexing changes the row labels and column labels of a DataFrame. To reindex means to
conform the data to match a given set of labels along a particular axis.
• Multiple operations can be accomplished through indexing like −
• Insert missing value (NA) markers in label locations where no data for the label existed.
24
Example
12/28/2022 25
Re-index to Align with Other Objects
26
Filling while ReIndexing
• reindex() takes an optional parameter method which is a filling method with values as follows −
27
Example
28
Limits on Filling while Re-indexing
• The limit argument provides additional control over filling while reindexing. Limit specifies the maximum
count of consecutive matches.
29
Renaming
• The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary
function.
30
ITERATION
• The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is
regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and
Panel, follow the dict-like convention of iterating over the keys of the objects.
• In short, basic iteration (for i in object) produces −
• Series − values
• DataFrame − column labels
• Panel − item labels
31
ITERATOR COLUMN
• Iterating a DataFrame gives column names
32
ITERATOR ROWS
• To iterate over the rows of the DataFrame, we can use the following functions −
• iteritems() − to iterate over the (key,value) pairs
• iterrows() − iterate over the rows as (index,series) pairs
• itertuples() − iterate over the rows as namedtuples
33
iteritems()
• Iterates over each column as key, value pair with
label as key and column value as a Series object.
34
iterrows()
• iterrows() returns the iterator yielding each index value along with a series containing the data in each row.
35
itertuples()
• itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first
element of the tuple will be the row’s corresponding index value, while the remaining values are the row
values.
36
Example
37
Sorting
• There are two kinds of sorting available in Pandas. They are −
• By label
• By Actual Value
• Look at data generating randomly
38
Sorting Example
39
Sorting Example
40
Sorting Example
41
Working with Text Data
• Pandas provides a set of string functions
which make it easy to operate on string
data. Most importantly, these functions
ignore (or exclude) missing/NaN values.
42
Working with Text Data
43
Working with Text Data
44
Options and Customization
• get_option(param): get_option takes a single
parameter and returns the value as given in the table
• set_option(param,value): set_option takes two
arguments and sets the value to the parameter as
shown table
• reset_option(param): takes an argument and sets the
value back to the default value.
• describe_option(param): describe_option prints the
description of the argument.
• option_context(): option_context context manager
is used to set the option in with statement
temporarily. Option values are restored
automatically when you exit the with block
45
Indexing and Selecting Data in Pandas
46
Indexing and Selecting Data
• The Python and NumPy indexing operators "[ ]" and attribute operator "." provide quick and easy access to
Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed
isn’t known in advance, directly using standard operators has some optimization limits. For production code,
we recommend that you take advantage of the optimized pandas data access methods explained.
• Pandas now supports three types of Multi-axes indexing; the three types are mentioned in the following table.
47
.loc()
• Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also
included. Integers are valid labels, but they refer to the label and not the position.
• .loc() has multiple access methods like:
• A single scalar label
• A list of labels
• A slice object
• A Boolean array
• loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one
indicates columns.
48
.loc() Example
49
.loc() Example
50
.loc() Example
51
.loc() Example
52
.loc() Example
53
.loc() Example
54
.iloc()
• Pandas provide various methods in order to get purely integer based indexing. Like python and numpy,
these are 0-based indexing.
• The various access methods are as follows:
• An Integer
• A list of integers
• A range of values
55
.iloc() Example
56
.iloc() Example
57
.iloc() Example
58
.iloc() Example
12/28/2022 59
.ix()
• Besides pure label based and integer based, Pandas provides a hybrid method for selections and subsetting the
object using the .ix() operator.
60
.ix() Example
12/28/2022 61
Use of Notations
• Getting values from the Pandas object with Multi-axes indexing uses the following notation
• Note: .iloc() & .ix() applies the same indexing options and Return value.
62
(Example 1) Use the basic indexing operator '[ ]'
63
(Example 1) Use the basic indexing operator '[ ]'
12/28/2022 64
Sort, Filter, Aggregation, Grouping, Pivot,
Concatenation, Merge/Join in Pandas
65
Sort
• Sort theo 1 column, mặc định là tăng dần: df.sort_values(by='TOTAL')
67
Filter
• Filter các row chứa ký tự: df.filter(like='bbi', axis=0)
• Filter các row theo biểu thức so sánh
• Ví dụ lấy tất cả các order có TOTAL lớn hơn 100: df[df['TOTAL'] > 100]
df[custom(df['TAX'], df['TOTAL'])]
68
Aggregation
12/28/2022 69
Example
70
Example
71
Example
72
Example
73
Group
74
Group
75
Group
76
Group
77
Group
78
Group
79
Grouping with user-define function
• Chẳng hạn group lại theo Team và lấy ra tổng số tuổi của 10 bản ghi đầu tiên
def custom_aggregate(series):
return series.head(10).sum()
df.groupby([‘Team’])[‘Age’].agg(custom_aggregate)
80
Pivot
• One of the most common tasks in data science is to manipulate the data frame we have to a specific format.
• Give data about life expectancy (expectancy refers to the number of years a person is expected to live based
on the statistical average. Life expectancy varies by geographical area and by era.)
• Python Pandas function pivot_table help us with the summarization and conversion of dataframe in long form
to dataframe in wide form, in a variety of complex scenarios.
82
Pandas pivot_table on a data frame with three columns
• Pandas pivot_table gets more useful when we try to summarize and convert a tall data frame with more than
two variables into a wide data frame. Use three columns; continent, year, and lifeExp
12/28/2022 83
Pandas pivot_table with Different Aggregating Function
• Pivot_table uses mean function for aggregating or summarizing data by default. We can change the
aggregating function, if needed.
• For example, we can use aggfunc=’max’ to compute “maximum” lifeExp instead of “mean” lifeExp for each
year and continent values.
12/28/2022 84
Pandas pivot_table with Different Aggregating Function
• pd.pivot_table(df[['continent', 'year','lifeExp']], values='lifeExp', index=['year'], columns='continent',aggfunc=[min,max])
85
Melt
• Pandas melt() function is used to change the DataFrame format from wide to long. It’s used to create a
specific format of the DataFrame object where one or more columns work as identifiers. All the remaining
columns are treated as values and unpivoted to the row axis and only two columns – variable and value.
86
Concatenation
87
Advanced Concatenation
88
Advanced Concatenation
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
89
Merging
90
Joining
91
Data Manipulation in Pandas
92
Regex
93
Date Functionality
94
Date functionality
95
Time Delta
• Time deltas are differences in times, expressed in difference units, for example,
days, hours, minutes, seconds.
• They can be both positive and negative.
96
Example
Name Code Output
By passing a string literal, we can create a pd.Timedelta('2 days 2 hours 15 minutes 30 seconds') 2 days 02:15:30
timedelta object.
97
Example
Name Code Output
12/28/2022 99
Standardization
100
Missing Data Handle
• Missing Data can occur when no information is • Pandas treat None and NaN as essentially
provided for one or more items or for a whole unit. interchangeable for indicating missing or null
Missing Data is a very big problem in real life values. To facilitate this convention, there are
scenario. Missing Data can also refer to as NA(Not several useful functions for detecting, removing, and
Available) values in pandas. In DataFrame replacing null values in Pandas DataFrame :
sometimes many datasets simply arrive with missing • isnull()
data, either because it exists and was not collected or • notnull()
it never existed. • dropna()
• In Pandas missing data is represented by two value: • fillna()
• None: None is a Python singleton object that is often used • replace()
for missing data in Python code. • interpolate()
• NaN : NaN (an acronym for Not a Number), is a special
floating-point value recognized by all systems that use the
standard IEEE floating-point representation
101
isnull()
102
notnull()
103
Filling Missing Data
104
#1
105
#2
106
#3
107
Interpolate
108
dropna()
109
dropna()
110
dropna()
111
Window Functions
• .rolling() Function
• .expanding() Function
• .ewm() Function
112
.rolling() Function
113
.expanding() Function
114
EWM
• Ewm is applied on a series of data. Specify any of the com, span, halflife argument and apply the appropriate
statistical function on top of it. It assigns the weights exponentially.
• Using to make data smooth to handle noise data
df.ewm(com=0.5).mean()
115
Data Analysis in Pandas
116
Descriptive Statistics
• Most of these are aggregations like
sum(), mean(), but some of them,
like sumsum(), produce an object of
the same size.
• These methods take an axis
argument, just like ndarray.{sum,
std, ...}, but the axis can be
specified by name or integer.
• DataFrame − “index” (axis=0,
default), “columns” (axis=1)
117
Example
Summarizing Data
• The describe() function computes a summary of statistics pertaining to the Data Frame columns.
119
Summarizing Data with include
• This function gives the mean, std and IQR values. And, function excludes the character columns
and given summary about numeric columns. 'include' is the argument which is used to pass
necessary information regarding what columns need to be considered for summarizing. Takes
the list of values; by default, 'number'.
• object − Summarizes String columns
• number − Summarizes Numeric columns
• all − Summarizes all columns together (Should not pass it as a list value)
120
describe(include=['object'])
• #Create a DataFrame
• df = pd.DataFrame(d)
• print df.describe(include=['object'])
121
describe(include='all')
122
Statistical Functions
• Statistical methods help in the understanding and analyzing the behavior of data.
• Some useful functions:
• Percent change
• Covariance
• Correlation
• Data Ranking
123
Percent_change
• Series, DatFrames and Panel, all have the function pct_change().
• This function compares every element with its prior element and computes the change percentage.
• Formulas: 𝒗𝒂𝒍𝒖𝒆𝒏 = (𝒙𝒏 − 𝒙𝒏−𝟏 ) : (𝒙𝒏−𝟏 )
124
Co-variance
• Covariance is applied on series data. The Series object has a method cov to compute covariance between
series objects. NA will be excluded automatically.
• The covariance formula is similar to the formula for deals with the calculation of data points from the average
value in a dataset. For example, the covariance between two random variables X and Y can be calculated
using the following formula (for population → left) or (for sample → right):
n-1
12/28/2022 125
Correlation Value
• The correlation coefficient is a value that indicates the strength of the relationship. The coefficient can take
any values from -1 to 1. The interpretations of the values are:
• -1: Perfect negative correlation. The variables tend to move in opposite directions (i.e., when one variable increases, the
other variable decreases).
• 0: No correlation. The variables do not have a relationship with each other.
• 1: Perfect positive correlation. The variables tend to move in the same direction (i.e., when one variable increases, the other
variable also increases).
12/28/2022 126
Data Ranking
• Data Ranking produces ranking for each element in the array of elements. In case of ties, assigns the mean
rank.
127
Data Ranking – More Example
128
Categorical Data
• Data includes the text columns, which are repetitive. Features like gender, country, and codes are always
repetitive. These are the examples for categorical data.
• Categorical variables can take on only a limited, and usually fixed number of possible values. Besides
the fixed length, categorical data might have an order but cannot perform numerical operation.
Categorical are a Pandas data type.
• The categorical data type is useful in the following cases −
• A string variable consisting of only a few different values. Converting such a string variable to a
categorical variable will save some memory.
• The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By
converting to a categorical and specifying an order on the categories, sorting and min/max will use
the logical order instead of the lexical order.
• As a signal to other python libraries that this column should be treated as a categorical variable
(e.g. to use suitable statistical methods or plot types)
129
Example
130
Comparison of Categorical Data
131
Visualization
• Plotting methods allow a handful of plot styles other than the default line plot. These methods
can be provided as the kind keyword argument to plot(). These include −
• bar or barh for bar plots
• hist for histogram
• box for boxplot
• 'area' for area plots
• 'scatter' for scatter plots
132
Plotting
133
Bar Plotting
134
Bar Plotting
12/28/2022 135
Bar Plotting
12/28/2022 136
Bar Plotting
137
Bar Plotting
12/28/2022 138
Stacked Bar Plotting
12/28/2022 139
Horizontal Bar Plotting
12/28/2022 140
Histogram in same plot
141
Plot different histograms for each column
142
Box Plots
143
Area Plot
144
Scatter Plots
145
Pie Chart
146
Pie Chart
147
Pie Chart
148
Pie Chart
12/28/2022 149
IO Tools
• The two workhorse functions for reading text files (or the flat files) are read_csv() and read_table(). They both
use the same parsing code to intelligently convert tabular data into a DataFrame object
• Example: The temp.csv file data looks like
12/28/2022 150
Example
• df=pd.read_csv("temp.csv")
• df=pd.read_csv("temp.csv",index_col=['S.No'])
• df = pd.read_csv("temp.csv", dtype={'Salary': np.float64})
• df=pd.read_csv("temp.csv", names=['a', 'b', 'c','d','e'])
df=pd.read_csv("temp.csv",names=['a','b','c','d','e'],header=0)
→ What is about
• df=pd.read_csv("temp.csv", skiprows=2)
151
Sparse Data
• Sparse objects are “compressed” when any data matching a specific value (NaN / missing value, though any
value can be chosen) is omitted. A special SparseIndex object tracks where data has been “sparsified”.
• Using to compress data to improve memory if data is sparse
• Use for Series data and Data Frame
• Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and bool dtypes
are supported. Depending on the original dtype, fill_value default changes.
• float64 − np.nan
• int64 − 0
• bool − False
152
Example
df.to_sparse()
compressing
decompressing
sdf.to_dense()
153
Caveats & Gotchas
• Caveats means warning and gotcha means an unseen problem.
• Pandas follows the numpy convention of raising an error when you try to convert something to a bool. This
happens in an if or when using the Boolean operations, and, or, or not. It is not clear what the result should be.
Should it be True because it is not zerolength? False because there are False values? It is unclear, so instead,
Pandas raises a ValueError.
• Series data
• .empty
• .bool()
• .item()
• .any()
• .all()
• Bitwise Boolean
• Isin
154
Example
155
Comparison with SQL