0% found this document useful (0 votes)
76 views1 page

Python数据科学速查表 - Pandas 进阶

This document provides a summary of advanced pandas techniques: 1) It outlines various methods for selecting, filtering, and querying data in pandas DataFrames including selecting columns based on criteria, filtering rows, and querying. 2) It also summarizes pandas techniques for reshaping data including pivoting/unpivoting data and creating pivot tables. 3) Finally, it discusses various data merging techniques using pandas merge as well as setting/resetting indexes and handling missing data.

Uploaded by

Keith Ng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views1 page

Python数据科学速查表 - Pandas 进阶

This document provides a summary of advanced pandas techniques: 1) It outlines various methods for selecting, filtering, and querying data in pandas DataFrames including selecting columns based on criteria, filtering rows, and querying. 2) It also summarizes pandas techniques for reshaping data including pivoting/unpivoting data and creating pivot tables. 3) Finally, it discusses various data merging techniques using pandas merge as well as setting/resetting indexes and handling missing data.

Uploaded by

Keith Ng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Python 数据科学 速查表 高级索引 参阅 NumPy Arrays 合并数据

Pandas 进阶
呆鸟 译 基础选择 数据1 数据2
>>> df3.loc[:,(df3>1).any()] 选择任一值大于1的列
选择所有值大于1的列
X1 X2 X1 X3
天善智能 商业智能与大数据社区 www.hellobi.com
>>> df3.loc[:,(df3>1).all()]
>>> df3.loc[:,df3.isnull().any()] 选择含 NaN值的列 a 11.432 a 20.784
>>> df3.loc[:,df3.notnull().all()] 选择不含NaN值的列 b 1.303 b NaN
通过isin选择
选择为某一类型的数值
c 99.906 d 20.784
数据重塑
>>> df[(df.Country.isin(df2.Type))]
选择特定值
合并-Merge
>>> df3.filter(items=”a”,”b”])
选择指定元素
透视
>>> df.select(lambda x: not x%5)
通过Where选择 >>> pd.merge(data1, X1 X2 X3
>>> df3= df2.pivot(index='Date', 将行变为列 >>> s.where(s > 0)
选择子集
data2, a 11.432 20.784
columns='Type', 通过Query选择 how='left',
b 1.303 NaN
values='Value')
查询DataFrame
>>> df6.query('second > first') on='X1')
c 99.906 NaN
Date Type Value

0 2016-03-01 a 11.432 Type a b c 设置/取消索引 >>> pd.merge(data1, X1 X2 X3


data2,
设置索引
1 2016-03-02 b 13.031 Date >>> df.set_index('Country') a 11.432 20.784
how='right',
2 2016-03-01 c 20.784 2016-03-01 11.432 NaN 20.784 >>> df4 = df.reset_index() 取消索引 on='X1')
b 1.303 NaN
3 2016-03-03 a 99.906 2016-03-02 1.303 13.031 NaN
>>> df = df.rename(index=str, 重命名DataFrame列名 d NaN 20.784
columns={"Country":"cntry",
4 2016-03-02 a 1.303 "Capital":"cptl", >>> pd.merge(data1,
2016-03-03 99.906 NaN 20.784 "Population":"ppltn"}) X1 X2 X3
5 2016-03-03 c 20.784 data2,

重置索引
how='inner', a 11.432 20.784
透视表 on='X1') b 1.303 NaN

将行变为列
>>> s2 = s.reindex(['a','c','d','e','b'])
>>> df4 = pd.pivot_table(df2, X1 X2 X3
values='Value', 前向填充 后向填充 >>> pd.merge(data1,
a 11.432 20.784
index='Date', data2,
columns='Type']) >>> df.reindex(range(4), >>> s3 = s.reindex(range(5), how='outer', b 1.303 NaN
method='ffill') method='bfill')
堆栈 / 反堆栈
on='X1') c 99.906 NaN
Country Capital Population 0 3
0 Belgium Brussels 11190846 1 3 d NaN 20.784
透视列标签
连接-Join
>>> stacked = df5.stack() 1 India New Delhi 1303171035 2 3
>>> stacked.unstack() 透视索引标签 2 Brazil Brasília 207847528 3 3
3 Brazil Brasília 207847528 4 3
0 1 1 5 0 0.233482 >>> data1.join(data2, how='right')
1 5 0.233482 0.390959 1 0.390959 多重索引 拼接-Concatenate
2 4 0 0.184713
纵向
2 4 0.184713 0.237102
>>> arrays = [np.array([1,2,3]),
3 3 0.433522 0.429401 1 0.237102 np.array([5,4,3])]
>>> s.append(s2)
反堆栈 横向/纵向
3 3 0 0.433522
>>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)
>>> tuples = list(zip(*arrays))
1 0.429401 >>> index = pd.MultiIndex.from_tuples(tuples, >>> pd.concat([s,s2],axis=1, keys=['One','Two'])
堆栈
names=['first', 'second']) >>> pd.concat([data1, data2], axis=1, join='inner')
>>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
融合
日期
>>> df2.set_index(["Date", "Type"])
>>> pd.melt(df2, 将列转为行
id_vars=["Date"],
value_vars=["Type", "Value"],
重复数据 >>> df2['Date']= pd.to_datetime(df2['Date'])
>>> df2['Date']= pd.date_range('2000-1-1',
value_name="Observations") >>> s3.unique() 返回唯一值 periods=6,
查找重复值
>>> df2.duplicated('Type') freq='M')
去除重复值
Date Type Value
Date Variable Observations >>> dates = [datetime(2012,5,1), datetime(2012,5,2)]
>>> df2.drop_duplicates('Type', keep='last') >>> index = pd.DatetimeIndex(dates)
查找重复索引
0 2016-03-01 Type a
0 2016-03-01 a 11.432 1 2016-03-02 Type b
>>> df.index.duplicated() >>> index = pd.date_range(datetime(2012,2,1), end, freq='BM')

数据分组
1 2016-03-02 b 13.031 2 2016-03-01 Type c
2 2016-03-01 c 20.784 3 2016-03-03 Type a 可视化 参阅 Matplotlib
聚合
4 2016-03-02 Type a
3 2016-03-03 a 99.906
5 2016-03-03 Type c >>> import matplotlib.pyplot as plt
4 2016-03-02 a 1.303 >>> df2.groupby(by=['Date','Type']).mean()
6 2016-03-01 Value 11.432 >>> df4.groupby(level=0).sum() >>> s.plot() >>> df2.plot()
5 2016-03-03 c 20.784 7 2016-03-02 Value 13.031 >>> df4.groupby(level=0).agg({'a':lambda x:sum(x)/len(x), >>> plt.show() >>> plt.show()
'b': np.sum})
转换
8 2016-03-01 Value 20.784
9 2016-03-03 Value 99.906
>>> customSum = lambda x: (x+x%2)
10 2016-03-02 Value 1.303
>>> df4.groupby(level=0).transform(customSum)
11 2016-03-03 Value 20.784

缺失值
迭代
>>> df.dropna() 去除缺失值NaN
(列索引,序列)键值对 用预设值填充缺失值NaN
原文作者
>>> df.iteritems() >>> df3.fillna(df3.mean())
(行索引,序列)键值对 DataCamp
>>> df.iterrows() >>> df2.replace("a", "f") 用一个值替换另一个值 Learn Python for Data Science Interactively

You might also like