Python数据科学速查表 - Pandas 进阶
Python数据科学速查表 - Pandas 进阶
Pandas 进阶
呆鸟 译 基础选择 数据1 数据2
>>> df3.loc[:,(df3>1).any()] 选择任一值大于1的列
选择所有值大于1的列
X1 X2 X1 X3
天善智能 商业智能与大数据社区 www.hellobi.com
>>> df3.loc[:,(df3>1).all()]
>>> df3.loc[:,df3.isnull().any()] 选择含 NaN值的列 a 11.432 a 20.784
>>> df3.loc[:,df3.notnull().all()] 选择不含NaN值的列 b 1.303 b NaN
通过isin选择
选择为某一类型的数值
c 99.906 d 20.784
数据重塑
>>> df[(df.Country.isin(df2.Type))]
选择特定值
合并-Merge
>>> df3.filter(items=”a”,”b”])
选择指定元素
透视
>>> df.select(lambda x: not x%5)
通过Where选择 >>> pd.merge(data1, X1 X2 X3
>>> df3= df2.pivot(index='Date', 将行变为列 >>> s.where(s > 0)
选择子集
data2, a 11.432 20.784
columns='Type', 通过Query选择 how='left',
b 1.303 NaN
values='Value')
查询DataFrame
>>> df6.query('second > first') on='X1')
c 99.906 NaN
Date Type Value
重置索引
how='inner', a 11.432 20.784
透视表 on='X1') b 1.303 NaN
将行变为列
>>> s2 = s.reindex(['a','c','d','e','b'])
>>> df4 = pd.pivot_table(df2, X1 X2 X3
values='Value', 前向填充 后向填充 >>> pd.merge(data1,
a 11.432 20.784
index='Date', data2,
columns='Type']) >>> df.reindex(range(4), >>> s3 = s.reindex(range(5), how='outer', b 1.303 NaN
method='ffill') method='bfill')
堆栈 / 反堆栈
on='X1') c 99.906 NaN
Country Capital Population 0 3
0 Belgium Brussels 11190846 1 3 d NaN 20.784
透视列标签
连接-Join
>>> stacked = df5.stack() 1 India New Delhi 1303171035 2 3
>>> stacked.unstack() 透视索引标签 2 Brazil Brasília 207847528 3 3
3 Brazil Brasília 207847528 4 3
0 1 1 5 0 0.233482 >>> data1.join(data2, how='right')
1 5 0.233482 0.390959 1 0.390959 多重索引 拼接-Concatenate
2 4 0 0.184713
纵向
2 4 0.184713 0.237102
>>> arrays = [np.array([1,2,3]),
3 3 0.433522 0.429401 1 0.237102 np.array([5,4,3])]
>>> s.append(s2)
反堆栈 横向/纵向
3 3 0 0.433522
>>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays)
>>> tuples = list(zip(*arrays))
1 0.429401 >>> index = pd.MultiIndex.from_tuples(tuples, >>> pd.concat([s,s2],axis=1, keys=['One','Two'])
堆栈
names=['first', 'second']) >>> pd.concat([data1, data2], axis=1, join='inner')
>>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
融合
日期
>>> df2.set_index(["Date", "Type"])
>>> pd.melt(df2, 将列转为行
id_vars=["Date"],
value_vars=["Type", "Value"],
重复数据 >>> df2['Date']= pd.to_datetime(df2['Date'])
>>> df2['Date']= pd.date_range('2000-1-1',
value_name="Observations") >>> s3.unique() 返回唯一值 periods=6,
查找重复值
>>> df2.duplicated('Type') freq='M')
去除重复值
Date Type Value
Date Variable Observations >>> dates = [datetime(2012,5,1), datetime(2012,5,2)]
>>> df2.drop_duplicates('Type', keep='last') >>> index = pd.DatetimeIndex(dates)
查找重复索引
0 2016-03-01 Type a
0 2016-03-01 a 11.432 1 2016-03-02 Type b
>>> df.index.duplicated() >>> index = pd.date_range(datetime(2012,2,1), end, freq='BM')
数据分组
1 2016-03-02 b 13.031 2 2016-03-01 Type c
2 2016-03-01 c 20.784 3 2016-03-03 Type a 可视化 参阅 Matplotlib
聚合
4 2016-03-02 Type a
3 2016-03-03 a 99.906
5 2016-03-03 Type c >>> import matplotlib.pyplot as plt
4 2016-03-02 a 1.303 >>> df2.groupby(by=['Date','Type']).mean()
6 2016-03-01 Value 11.432 >>> df4.groupby(level=0).sum() >>> s.plot() >>> df2.plot()
5 2016-03-03 c 20.784 7 2016-03-02 Value 13.031 >>> df4.groupby(level=0).agg({'a':lambda x:sum(x)/len(x), >>> plt.show() >>> plt.show()
'b': np.sum})
转换
8 2016-03-01 Value 20.784
9 2016-03-03 Value 99.906
>>> customSum = lambda x: (x+x%2)
10 2016-03-02 Value 1.303
>>> df4.groupby(level=0).transform(customSum)
11 2016-03-03 Value 20.784
缺失值
迭代
>>> df.dropna() 去除缺失值NaN
(列索引,序列)键值对 用预设值填充缺失值NaN
原文作者
>>> df.iteritems() >>> df3.fillna(df3.mean())
(行索引,序列)键值对 DataCamp
>>> df.iterrows() >>> df2.replace("a", "f") 用一个值替换另一个值 Learn Python for Data Science Interactively