0% found this document useful (0 votes)
5 views

Phan1_Pandas_Numpy_Matplotlib

Uploaded by

minhtandragon29
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Phan1_Pandas_Numpy_Matplotlib

Uploaded by

minhtandragon29
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 158

LessonPANDAS

Python 3x
Tutor Mrs. Mỹ Linh
Programing

Time 90 mins

1
Content • Statistical Functions
• Python Pandas • Window Functions
• Series • Aggregations
• DataFrame • Missing Data
• Panel • GroupBy
• Basic Functionality • Merging/Joining
• Descriptive Statistics • Concatenation
• Function Application • Date Functionality
• Reindexing • Timedelta
• Iteration • Categorical Data
• Sorting • Visualization
• Working with Text Data • IO Tools
• Options & Customization • Sparse Data
• Indexing & Selecting Data • Caveats & Gotchas
Introduction to Pandas

3
Python Pandas • Fast and efficient DataFrame object with default and
• Pandas is an open-source Python Library providing customized indexing.
high-performance data manipulation and analysis • Tools for loading data into in-memory data objects
tool using its powerful data structures. The name from different file formats.
Pandas is derived from the word Panel Data – an • Data alignment and integrated handling of missing
Econometrics from Multidimensional data. data.
• To use Pandas, must import pandas as pd • Reshaping and pivoting of date sets.
• Pandas deals with the following three data structures • Label-based slicing, indexing and subsetting of large
• Series: dimension = 1 data sets.
• DataFrame: dimension = 2
• Panel: dimension = 3 • Columns from a data structure can be deleted or
inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
• Retrieve Data Using Label
Python Pandas - Series
• Create: pandas.Series( data, index, dtype, copy)
• Data: data takes various forms like ndarray, list,
constants
• Index: Index values must be unique and hashable, same
length as data. Default np.arrange(n) if no index is
passed.
• Dtype: dtype is for data type. If None, data type will be
inferred
• Copy: Copy data. Default False

Python Pandas - DataFrame
• Create: • Creating dataframe many ways
• pandas.DataFrame( data, index, columns, dtype, copy) • Adding column
• Columns: For column labels, the optional default syntax is - • Delete column
np.arrange(n). This is only true if no index is passed. • Row Selection, Addition, and Deletion
Example – Create Dataframe
Column Addition
Column Deletion
Row Selection, Addition, and Deletion
Python Pandas - Panel
• Create: pandas.Panel(data, items, major_axis,
minor_axis, dtype, copy)
• Data: Data takes various forms like ndarray, series,
map, lists, dict, constants and also another
DataFrame
• Items: axis=0
• Major_axis: axis=1
• Minor_axis: axis=2
• Dtype: Data type of each column
• Copy: Copy data. Default, false
Example - From 3D ndarray
Series Basic Functionality

13
Series Basic Functionality

14
DataFrame Basic Functionality

15
DataFrame Basic Functionality

16
Function Application
• To apply your own or another library’s functions to Pandas objects, you should be aware of the three
important methods. The appropriate method to use depends on whether your function expects to operate on an
entire DataFrame, row- or column-wise, or element wise.
• Table wise Function Application: pipe()
• Row or Column Wise Function Application: apply()
• Element wise Function Application: applymap()

17
Function Application
• Suppose df is data frame and adder is function

• df = df.pipe(adder,2)
df = df['Salary'].map(lambda x:x*10)
#On Series data

• df = df.apply(np.mean)

df = df.applymap(lambda x:x*10)

• df = df.apply(np.mean, axis = 1)
df = df.apply(lambda x: x.max() - x.min())

12/28/2022 18
Mapping
• map = {
'label1' : 'value1,
'label2' : 'value2,
...
}

• The functions that you will see in this section perform specific operations, but they
all accept a dict object.
• replace()—Replaces values
• map()—Creates a new column
• rename()—Replaces the index values

19
Mapping

20
Adding Values via Mapping

21
Rename the Indexes of the Axes

22
Rename the Indexes of the Axes

23
Re-indexing
• Reindexing changes the row labels and column labels of a DataFrame. To reindex means to
conform the data to match a given set of labels along a particular axis.
• Multiple operations can be accomplished through indexing like −

• Reorder the existing data to match a new set of labels.

• Insert missing value (NA) markers in label locations where no data for the label existed.

24
Example

12/28/2022 25
Re-index to Align with Other Objects

26
Filling while ReIndexing
• reindex() takes an optional parameter method which is a filling method with values as follows −

• pad/ffill − Fill values forward

• bfill/backfill − Fill values backward

• nearest − Fill from the nearest index values

27
Example

28
Limits on Filling while Re-indexing
• The limit argument provides additional control over filling while reindexing. Limit specifies the maximum
count of consecutive matches.

29
Renaming
• The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary
function.

30
ITERATION
• The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is
regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and
Panel, follow the dict-like convention of iterating over the keys of the objects.
• In short, basic iteration (for i in object) produces −
• Series − values
• DataFrame − column labels
• Panel − item labels

31
ITERATOR COLUMN
• Iterating a DataFrame gives column names

32
ITERATOR ROWS
• To iterate over the rows of the DataFrame, we can use the following functions −
• iteritems() − to iterate over the (key,value) pairs
• iterrows() − iterate over the rows as (index,series) pairs
• itertuples() − iterate over the rows as namedtuples

33
iteritems()
• Iterates over each column as key, value pair with
label as key and column value as a Series object.

34
iterrows()
• iterrows() returns the iterator yielding each index value along with a series containing the data in each row.

35
itertuples()
• itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first
element of the tuple will be the row’s corresponding index value, while the remaining values are the row
values.

36
Example

37
Sorting
• There are two kinds of sorting available in Pandas. They are −
• By label
• By Actual Value
• Look at data generating randomly

38
Sorting Example

39
Sorting Example

40
Sorting Example

41
Working with Text Data
• Pandas provides a set of string functions
which make it easy to operate on string
data. Most importantly, these functions
ignore (or exclude) missing/NaN values.

42
Working with Text Data

43
Working with Text Data

44
Options and Customization
• get_option(param): get_option takes a single
parameter and returns the value as given in the table
• set_option(param,value): set_option takes two
arguments and sets the value to the parameter as
shown table
• reset_option(param): takes an argument and sets the
value back to the default value.
• describe_option(param): describe_option prints the
description of the argument.
• option_context(): option_context context manager
is used to set the option in with statement
temporarily. Option values are restored
automatically when you exit the with block

45
Indexing and Selecting Data in Pandas

46
Indexing and Selecting Data
• The Python and NumPy indexing operators "[ ]" and attribute operator "." provide quick and easy access to
Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed
isn’t known in advance, directly using standard operators has some optimization limits. For production code,
we recommend that you take advantage of the optimized pandas data access methods explained.
• Pandas now supports three types of Multi-axes indexing; the three types are mentioned in the following table.

47
.loc()
• Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also
included. Integers are valid labels, but they refer to the label and not the position.
• .loc() has multiple access methods like:
• A single scalar label
• A list of labels
• A slice object
• A Boolean array
• loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one
indicates columns.

48
.loc() Example

49
.loc() Example

50
.loc() Example

51
.loc() Example

52
.loc() Example

53
.loc() Example

54
.iloc()
• Pandas provide various methods in order to get purely integer based indexing. Like python and numpy,
these are 0-based indexing.
• The various access methods are as follows:
• An Integer
• A list of integers
• A range of values

55
.iloc() Example

56
.iloc() Example

57
.iloc() Example

58
.iloc() Example

12/28/2022 59
.ix()
• Besides pure label based and integer based, Pandas provides a hybrid method for selections and subsetting the
object using the .ix() operator.

60
.ix() Example

12/28/2022 61
Use of Notations
• Getting values from the Pandas object with Multi-axes indexing uses the following notation
• Note: .iloc() & .ix() applies the same indexing options and Return value.

62
(Example 1) Use the basic indexing operator '[ ]'

63
(Example 1) Use the basic indexing operator '[ ]'

12/28/2022 64
Sort, Filter, Aggregation, Grouping, Pivot,
Concatenation, Merge/Join in Pandas

65
Sort
• Sort theo 1 column, mặc định là tăng dần: df.sort_values(by='TOTAL')

• Sort theo thứ tự giảm dần: df.sort_values(by='TOTAL', ascending=False)


• Sort theo nhiều trường: df.sort_values(by=['QUANTITY','TOTAL'])
• Sort nhiều trường theo thứ tự khác nhau: df.sort_values(by=['QUANTITY','TOTAL'],
ascending=[True, False])
66
Filter (lọc dữ liệu)
• Filter lấy ra các cột của dataframe: df.filter(items=['USER_ID', 'TAX'])
• Filter lấy ra các cột theo regular expression: df.filter(regex='T$', axis=1)

67
Filter
• Filter các row chứa ký tự: df.filter(like='bbi', axis=0)
• Filter các row theo biểu thức so sánh
• Ví dụ lấy tất cả các order có TOTAL lớn hơn 100: df[df['TOTAL'] > 100]

• Filter theo một hàm tự định nghĩa

def custom(tax, total):


return ( total - tax > 100)

df[custom(df['TAX'], df['TOTAL'])]

68
Aggregation

12/28/2022 69
Example

70
Example

71
Example

72
Example

73
Group

74
Group

75
Group

76
Group

77
Group

78
Group

79
Grouping with user-define function
• Chẳng hạn group lại theo Team và lấy ra tổng số tuổi của 10 bản ghi đầu tiên

def custom_aggregate(series):
return series.head(10).sum()

df.groupby([‘Team’])[‘Age’].agg(custom_aggregate)

80
Pivot
• One of the most common tasks in data science is to manipulate the data frame we have to a specific format.
• Give data about life expectancy (expectancy refers to the number of years a person is expected to live based
on the statistical average. Life expectancy varies by geographical area and by era.)
• Python Pandas function pivot_table help us with the summarization and conversion of dataframe in long form
to dataframe in wide form, in a variety of complex scenarios.

Raw data: df Pivot


81
Pandas Simple Pivot
• A simple example of Python Pivot using a dataframe with jus two columns. Let us subset our
dataframe to contain just two columns, continent and lifeExp

pd.pivot_table(df[['continent','lifeExp']], values='lifeExp', columns='continent')

82
Pandas pivot_table on a data frame with three columns
• Pandas pivot_table gets more useful when we try to summarize and convert a tall data frame with more than
two variables into a wide data frame. Use three columns; continent, year, and lifeExp

pd.pivot_table(df[['continent', 'year','lifeExp']], values='lifeExp', index=['year'], columns='continent')

12/28/2022 83
Pandas pivot_table with Different Aggregating Function
• Pivot_table uses mean function for aggregating or summarizing data by default. We can change the
aggregating function, if needed.
• For example, we can use aggfunc=’max’ to compute “maximum” lifeExp instead of “mean” lifeExp for each
year and continent values.

pd.pivot_table(df[['continent', 'year','lifeExp']], values='lifeExp', index=['year'], columns='continent',aggfunc='max')

12/28/2022 84
Pandas pivot_table with Different Aggregating Function
• pd.pivot_table(df[['continent', 'year','lifeExp']], values='lifeExp', index=['year'], columns='continent',aggfunc=[min,max])

85
Melt
• Pandas melt() function is used to change the DataFrame format from wide to long. It’s used to create a
specific format of the DataFrame object where one or more columns work as identifiers. All the remaining
columns are treated as values and unpivoted to the row axis and only two columns – variable and value.

86
Concatenation

87
Advanced Concatenation

88
Advanced Concatenation

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

89
Merging

90
Joining

91
Data Manipulation in Pandas

92
Regex

A Regular Expression (RegEx) is a sequence of


characters that defines a search pattern.

93
Date Functionality

94
Date functionality

95
Time Delta
• Time deltas are differences in times, expressed in difference units, for example,
days, hours, minutes, seconds.
• They can be both positive and negative.

96
Example
Name Code Output

By passing a string literal, we can create a pd.Timedelta('2 days 2 hours 15 minutes 30 seconds') 2 days 02:15:30
timedelta object.

By passing an integer value with the unit, an pd.Timedelta(6,unit='h') 0 days 06:00:00


argument creates a Timedelta object.

Data offsets such as - weeks, days, hours, pd.Timedelta(days=2) 2 days 00:00:00


minutes, seconds, milliseconds, microseconds,
nanoseconds

Convert a scalar, array, list, or series from a pd.Timedelta(days=2) 2 days 00:00:00


recognized timedelta format/ value into a
Timedelta type. It will construct Series if the
input is a Series, a scalar if the input is scalar-
like, otherwise will output a TimedeltaIndex.

97
Example
Name Code Output

Operate on Series/ Data s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))


Frames and construct td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ])
timedelta64[ns] Series df = pd.DataFrame(dict(A = s, B = td))
through subtraction
operations on datetime64[ns]
Series, or Timestamps.
Addition Operations s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))
td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ])
df = pd.DataFrame(dict(A = s, B = td))
df['C']=df['A']+df['B']

Subtraction Operation s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))


td = pd.Series([ pd.Timedelta(days=i) for i in range(3) ])
df = pd.DataFrame(dict(A = s, B = td))
df['C']=df['A']+df['B']
df['D']=df['C']+df['B']
12/28/2022 98
Normalization
• Normalization refers to rescaling real-valued numeric
attributes into a 0 to 1 range.
• Data normalization is used in machine learning to make
model training less sensitive to the scale of features.
This allows our model to converge to better weights
and, in turn, leads to a more accurate model.

12/28/2022 99
Standardization

100
Missing Data Handle
• Missing Data can occur when no information is • Pandas treat None and NaN as essentially
provided for one or more items or for a whole unit. interchangeable for indicating missing or null
Missing Data is a very big problem in real life values. To facilitate this convention, there are
scenario. Missing Data can also refer to as NA(Not several useful functions for detecting, removing, and
Available) values in pandas. In DataFrame replacing null values in Pandas DataFrame :
sometimes many datasets simply arrive with missing • isnull()
data, either because it exists and was not collected or • notnull()
it never existed. • dropna()
• In Pandas missing data is represented by two value: • fillna()
• None: None is a Python singleton object that is often used • replace()
for missing data in Python code. • interpolate()
• NaN : NaN (an acronym for Not a Number), is a special
floating-point value recognized by all systems that use the
standard IEEE floating-point representation

101
isnull()

102
notnull()

103
Filling Missing Data

104
#1

105
#2

106
#3

107
Interpolate

108
dropna()

109
dropna()

110
dropna()

111
Window Functions
• .rolling() Function
• .expanding() Function
• .ewm() Function

112
.rolling() Function

113
.expanding() Function

114
EWM
• Ewm is applied on a series of data. Specify any of the com, span, halflife argument and apply the appropriate
statistical function on top of it. It assigns the weights exponentially.
• Using to make data smooth to handle noise data

df.ewm(com=0.5).mean()

115
Data Analysis in Pandas

116
Descriptive Statistics
• Most of these are aggregations like
sum(), mean(), but some of them,
like sumsum(), produce an object of
the same size.
• These methods take an axis
argument, just like ndarray.{sum,
std, ...}, but the axis can be
specified by name or integer.
• DataFrame − “index” (axis=0,
default), “columns” (axis=1)

117
Example
Summarizing Data
• The describe() function computes a summary of statistics pertaining to the Data Frame columns.

119
Summarizing Data with include
• This function gives the mean, std and IQR values. And, function excludes the character columns
and given summary about numeric columns. 'include' is the argument which is used to pass
necessary information regarding what columns need to be considered for summarizing. Takes
the list of values; by default, 'number'.
• object − Summarizes String columns
• number − Summarizes Numeric columns
• all − Summarizes all columns together (Should not pass it as a list value)

120
describe(include=['object'])
• #Create a DataFrame
• df = pd.DataFrame(d)
• print df.describe(include=['object'])

121
describe(include='all')

122
Statistical Functions
• Statistical methods help in the understanding and analyzing the behavior of data.
• Some useful functions:
• Percent change
• Covariance
• Correlation
• Data Ranking

123
Percent_change
• Series, DatFrames and Panel, all have the function pct_change().
• This function compares every element with its prior element and computes the change percentage.
• Formulas: 𝒗𝒂𝒍𝒖𝒆𝒏 = (𝒙𝒏 − 𝒙𝒏−𝟏 ) : (𝒙𝒏−𝟏 )

124
Co-variance
• Covariance is applied on series data. The Series object has a method cov to compute covariance between
series objects. NA will be excluded automatically.
• The covariance formula is similar to the formula for deals with the calculation of data points from the average
value in a dataset. For example, the covariance between two random variables X and Y can be calculated
using the following formula (for population → left) or (for sample → right):

n-1

12/28/2022 125
Correlation Value
• The correlation coefficient is a value that indicates the strength of the relationship. The coefficient can take
any values from -1 to 1. The interpretations of the values are:
• -1: Perfect negative correlation. The variables tend to move in opposite directions (i.e., when one variable increases, the
other variable decreases).
• 0: No correlation. The variables do not have a relationship with each other.
• 1: Perfect positive correlation. The variables tend to move in the same direction (i.e., when one variable increases, the other
variable also increases).

12/28/2022 126
Data Ranking
• Data Ranking produces ranking for each element in the array of elements. In case of ties, assigns the mean
rank.

127
Data Ranking – More Example

128
Categorical Data
• Data includes the text columns, which are repetitive. Features like gender, country, and codes are always
repetitive. These are the examples for categorical data.
• Categorical variables can take on only a limited, and usually fixed number of possible values. Besides
the fixed length, categorical data might have an order but cannot perform numerical operation.
Categorical are a Pandas data type.
• The categorical data type is useful in the following cases −
• A string variable consisting of only a few different values. Converting such a string variable to a
categorical variable will save some memory.
• The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By
converting to a categorical and specifying an order on the categories, sorting and min/max will use
the logical order instead of the lexical order.
• As a signal to other python libraries that this column should be treated as a categorical variable
(e.g. to use suitable statistical methods or plot types)

129
Example

130
Comparison of Categorical Data

131
Visualization
• Plotting methods allow a handful of plot styles other than the default line plot. These methods
can be provided as the kind keyword argument to plot(). These include −
• bar or barh for bar plots
• hist for histogram
• box for boxplot
• 'area' for area plots
• 'scatter' for scatter plots

132
Plotting

133
Bar Plotting

134
Bar Plotting

12/28/2022 135
Bar Plotting

12/28/2022 136
Bar Plotting

137
Bar Plotting

12/28/2022 138
Stacked Bar Plotting

12/28/2022 139
Horizontal Bar Plotting

12/28/2022 140
Histogram in same plot

141
Plot different histograms for each column

142
Box Plots

143
Area Plot

144
Scatter Plots

145
Pie Chart

146
Pie Chart

147
Pie Chart

148
Pie Chart

12/28/2022 149
IO Tools
• The two workhorse functions for reading text files (or the flat files) are read_csv() and read_table(). They both
use the same parsing code to intelligently convert tabular data into a DataFrame object
• Example: The temp.csv file data looks like

12/28/2022 150
Example
• df=pd.read_csv("temp.csv")
• df=pd.read_csv("temp.csv",index_col=['S.No'])
• df = pd.read_csv("temp.csv", dtype={'Salary': np.float64})
• df=pd.read_csv("temp.csv", names=['a', 'b', 'c','d','e'])

df=pd.read_csv("temp.csv",names=['a','b','c','d','e'],header=0)

→ What is about

• df=pd.read_csv("temp.csv", skiprows=2)

151
Sparse Data
• Sparse objects are “compressed” when any data matching a specific value (NaN / missing value, though any
value can be chosen) is omitted. A special SparseIndex object tracks where data has been “sparsified”.
• Using to compress data to improve memory if data is sparse
• Use for Series data and Data Frame
• Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and bool dtypes
are supported. Depending on the original dtype, fill_value default changes.
• float64 − np.nan
• int64 − 0
• bool − False

152
Example

df.to_sparse()

compressing

decompressing

sdf.to_dense()

sdf.density → density = 0.4

153
Caveats & Gotchas
• Caveats means warning and gotcha means an unseen problem.
• Pandas follows the numpy convention of raising an error when you try to convert something to a bool. This
happens in an if or when using the Boolean operations, and, or, or not. It is not clear what the result should be.
Should it be True because it is not zerolength? False because there are False values? It is unclear, so instead,
Pandas raises a ValueError.
• Series data
• .empty
• .bool()
• .item()
• .any()
• .all()
• Bitwise Boolean
• Isin

154
Example

155
Comparison with SQL

Query T-SQL Pandas


SELECT SELECT total_bill, tip, smoker, time tips[['total_bill', 'tip', 'smoker',
FROM tips 'time']].head(5)
LIMIT 5;

WHERE SELECT * FROM tips WHERE time = tips[tips['time'] == 'Dinner'].head(5)


'Dinner' LIMIT 5;
GROUP BY SELECT sex, count(*) FROM tips tips.groupby('sex').size()
GROUP BY sex;
TOP N ROWs SELECT * FROM tips LIMIT 5 ; tips.head(5)
156
Mastering Pandas - To master data manipulation in Python using Pandas, here’s
what you need to learn:

• read csv • query


• set index • rename
• reset index • sort values
• loc • agg
• iloc • groupby
• drop • concat
• dropna • merge
• fillna • pivot
• assign • melt
• filter
157
158

You might also like