10分钟入门Pandas-系列(2)
- 2019 年 12 月 26 日
- 筆記
本文为作者翻译,原文地址 https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
前文回顾
缺失数据处理
pandas默认使用np.nan
表示确实数据。 重新索引可以在特定的轴上修改、新增和删除索引。他将返回数据的副本。
In [9]: import pandas as pd ...: import numpy as np ...: ...: ...: s = pd.Series([1, 3, 5, np.nan, 6, 8]) ...: #s ...: dates = pd.date_range('20130101', periods=6) ...: #dates ...: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) ...: ...: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6)) ...: df['F'] = s1 ...: df.at[dates[0], 'A'] = 0 ...: df.iat[0, 1] = 0 ...: df.loc[:, 'D'] = np.array([5] * len(df)) In [10]: df Out[10]: A B C D F 2013-01-01 0.000000 0.000000 1.199418 5 NaN 2013-01-02 1.534227 0.769859 -0.674221 5 1.0 2013-01-03 0.874073 -0.389967 -0.670989 5 2.0 2013-01-04 0.942903 0.778283 -0.700779 5 3.0 2013-01-05 1.607993 -0.032569 1.991723 5 4.0 2013-01-06 0.345273 0.558341 0.792286 5 5.0 In [11]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E']) In [12]: df1.loc[dates[0]:dates[1], 'E'] = 1 In [13]: df1 Out[13]: A B C D F E 2013-01-01 0.000000 0.000000 1.199418 5 NaN 1.0 2013-01-02 1.534227 0.769859 -0.674221 5 1.0 1.0 2013-01-03 0.874073 -0.389967 -0.670989 5 2.0 NaN 2013-01-04 0.942903 0.778283 -0.700779 5 3.0 NaN
丢弃包含缺失数据的任意行
In [14]: df1.dropna(how='any') Out[14]: A B C D F E 2013-01-02 1.534227 0.769859 -0.674221 5 1.0 1.0
填充缺失的值
In [15]: df1.fillna(value=5) Out[15]: A B C D F E 2013-01-01 0.000000 0.000000 1.199418 5 5.0 1.0 2013-01-02 1.534227 0.769859 -0.674221 5 1.0 1.0 2013-01-03 0.874073 -0.389967 -0.670989 5 2.0 5.0 2013-01-04 0.942903 0.778283 -0.700779 5 3.0 5.0
返回nan
的布尔值判断
In [16]: pd.isna(df1) Out[16]: A B C D F E 2013-01-01 False False False False True False 2013-01-02 False False False False False False 2013-01-03 False False False False False True 2013-01-04 False False False False False True
数据运算
统计 计算时一般不包括丢失的数据 执行一个描述性统计
In [17]: df.mean() #axis=0 Out[17]: A 0.884078 B 0.280658 C 0.322906 D 5.000000 F 3.000000 dtype: float64
在其他轴上进行相同的运算
In [18]: df.mean(1) # axis=1 Out[18]: 2013-01-01 1.549854 2013-01-02 1.525973 2013-01-03 1.362623 2013-01-04 1.804082 2013-01-05 2.513429 2013-01-06 2.339180 Freq: D, dtype: float64
用于运算的对象有不同的维度并需要对齐。除此之外,pandas会自动沿着指定维度计算。
In [19]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2) In [20]: s Out[20]: 2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1.0 2013-01-04 3.0 2013-01-05 5.0 2013-01-06 NaN Freq: D, dtype: float64 In [21]: df.sub(s, axis='index') Out[21]: A B C D F 2013-01-01 NaN NaN NaN NaN NaN 2013-01-02 NaN NaN NaN NaN NaN 2013-01-03 -0.125927 -1.389967 -1.670989 4.0 1.0 2013-01-04 -2.057097 -2.221717 -3.700779 2.0 0.0 2013-01-05 -3.392007 -5.032569 -3.008277 0.0 -1.0 2013-01-06 NaN NaN NaN NaN NaN
应用(使用/作用) 在数据上使用函数
In [22]: df.apply(np.cumsum) #累加 Out[22]: A B C D F 2013-01-01 0.000000 0.000000 1.199418 5 NaN 2013-01-02 1.534227 0.769859 0.525197 10 1.0 2013-01-03 2.408300 0.379892 -0.145793 15 3.0 2013-01-04 3.351203 1.158175 -0.846572 20 6.0 2013-01-05 4.959196 1.125606 1.145151 25 10.0 2013-01-06 5.304469 1.683947 1.937437 30 15.0 In [25]: df.apply(lambda x: x.max() - x.min()) #求每列最大值和最小值的差 Out[25]: A 1.607993 B 1.168250 C 2.692502 D 0.000000 F 4.000000 dtype: float64
直方图
In [26]: s = pd.Series(np.random.randint(0, 7, size=10)) In [27]: s Out[27]: 0 5 1 1 2 3 3 6 4 6 5 6 6 1 7 5 8 4 9 4 dtype: int32 In [28]: s.value_counts() Out[28]: 6 3 5 2 4 2 1 2 3 1 dtype: int64
字符串方法 序列可以使用一些字符串处理方法很轻易操作数据组中的每个元素,比如以下代码片断。注意字符匹配方法默认情况下通常使用正则表达式(并且大多数时候都如此)。
In [32]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) In [33]: s.str.lower() Out[33]: 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object
数据合并
concat连接 pandas提供各种工具以便合并序列,数据桢,和组合对象, 在连接/合并类型操作中使用多种类型索引和相关数学函数。 使用concat()
把pandas对象连接到一起
In [34]: df = pd.DataFrame(np.random.randn(10, 4)) In [35]: df Out[35]: 0 1 2 3 0 2.173213 -0.598088 1.706941 1.058619 1 0.050132 0.772387 -0.375141 0.653398 2 1.903610 -0.840157 0.923137 0.598493 3 0.437401 -1.832521 0.535153 0.894782 4 -0.807621 0.280262 0.994054 0.402910 5 -0.717370 -1.312313 -1.041917 0.949461 6 0.923145 0.723936 -1.682883 1.406243 7 1.270020 1.309892 0.765172 1.322076 8 -1.432259 1.566898 0.186364 0.956206 9 -1.350429 -0.374433 -0.790461 -0.340310 # 数据分片 In [36]: pieces = [df[:2], df[4:7], df[8:]] # 分片数据合并(纵向) In [37]: pd.concat(pieces) Out[37]: 0 1 2 3 0 2.173213 -0.598088 1.706941 1.058619 1 0.050132 0.772387 -0.375141 0.653398 4 -0.807621 0.280262 0.994054 0.402910 5 -0.717370 -1.312313 -1.041917 0.949461 6 0.923145 0.723936 -1.682883 1.406243 8 -1.432259 1.566898 0.186364 0.956206 9 -1.350429 -0.374433 -0.790461 -0.340310
join连接 SQL形式的合并。
In [38]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) In [39]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]}) In [40]: left Out[40]: key lval 0 foo 1 1 foo 2 In [41]: right Out[41]: key rval 0 foo 4 1 foo 5 In [42]: pd.merge(left, right, on='key') Out[42]: key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5
另外一个例子
In [43]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]}) In [44]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]}) In [45]: left Out[45]: key lval 0 foo 1 1 bar 2 In [46]: right Out[46]: key rval 0 foo 4 1 bar 5 In [47]: pd.merge(left, right, on='key') Out[47]: key lval rval 0 foo 1 4 1 bar 2 5
追加 追加行到dataframe
In [48]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D']) In [49]: df Out[49]: A B C D 0 -0.541936 0.304257 -0.666548 1.448552 1 0.604079 -0.106248 -0.926829 -0.817308 2 -0.584797 -0.809370 -1.532935 -0.569670 3 0.690052 -0.605041 -0.465595 1.114489 4 0.913118 -0.605127 0.366342 0.393939 5 1.444903 0.274196 1.650371 -0.598488 6 0.063602 1.016262 0.493207 -0.814833 7 -0.042801 -1.752630 0.159843 0.645406 In [50]: s = df.iloc[3] #按照行号索引数据 In [51]: df.append(s, ignore_index=True) Out[51]: A B C D 0 -0.541936 0.304257 -0.666548 1.448552 1 0.604079 -0.106248 -0.926829 -0.817308 2 -0.584797 -0.809370 -1.532935 -0.569670 3 0.690052 -0.605041 -0.465595 1.114489 4 0.913118 -0.605127 0.366342 0.393939 5 1.444903 0.274196 1.650371 -0.598488 6 0.063602 1.016262 0.493207 -0.814833 7 -0.042801 -1.752630 0.159843 0.645406 8 0.690052 -0.605041 -0.465595 1.114489
分组
对于“group by”指的是涉及以下一个或多个步骤的处理:
- 将数据按某些标准分割为不同的组
- 在每个独立组上应用函数
- 组合结果为一个数据结构
In [52]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', ...: 'foo', 'bar', 'foo', 'foo'], ...: 'B': ['one', 'one', 'two', 'three', ...: 'two', 'two', 'one', 'three'], ...: 'C': np.random.randn(8), ...: 'D': np.random.randn(8)}) In [53]: df Out[53]: A B C D 0 foo one 0.214543 -0.743767 1 bar one 0.933127 -0.380128 2 foo two 0.106048 0.527558 3 bar three 0.905622 -0.892613 4 foo two -0.289133 1.136363 5 bar two -0.323147 -0.013935 6 foo one 1.070073 -0.038362 7 foo three 0.130030 0.455695
分组,然后应用sum( )
函数放到结果组
In [54]: df.groupby('A').sum() Out[54]: C D A bar 1.515602 -1.286676 foo 1.231561 1.337486
按照多列分组为层次索引,然后,再次应用sum
函数
In [55]: df.groupby(['A', 'B']).sum() Out[55]: C D A B bar one 0.933127 -0.380128 three 0.905622 -0.892613 two -0.323147 -0.013935 foo one 1.284616 -0.782129 three 0.130030 0.455695 two -0.183085 1.663921
重塑
堆砌
In [56]: tuples = list(zip( * [['bar', 'bar', 'baz', 'baz', ...: 'foo', 'foo', 'qux', 'qux'], ...: ['one', 'two', 'one', 'two', ...: 'one', 'two', 'one', 'two']])) In [57]: tuples Out[57]: [('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')] In [58]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) In [59]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) In [60]: df Out[60]: A B first second bar one -0.236559 -1.052202 two 0.663142 -0.426109 baz one 0.903963 -0.326161 two -0.640158 1.695094 foo one 0.684574 -0.352153 two -0.156945 -0.282294 qux one 0.853196 0.920496 two -0.131899 1.526732 In [61]: df2 = df[:4] In [62]: df2 Out[62]: A B first second bar one -0.236559 -1.052202 two 0.663142 -0.426109 baz one 0.903963 -0.326161 two -0.640158 1.695094
stack()
函数压缩DataFrame
的列,使得其减少一个级别(二维转一维,列转行)。
In [63]: stacked = df2.stack() In [64]: stacked Out[64]: first second bar one A -0.236559 B -1.052202 two A 0.663142 B -0.426109 baz one A 0.903963 B -0.326161 two A -0.640158 B 1.695094
被“堆砌”的DataFrame or Series
(有一个多重索引作为索引),堆砌的反向操作是逆堆砌,默认逆堆砌到上一级别:
In [66]: stacked.unstack() #逆堆砌 Out[66]: A B first second bar one -0.236559 -1.052202 two 0.663142 -0.426109 baz one 0.903963 -0.326161 two -0.640158 1.695094
In [66]: stacked.unstack() Out[66]: A B first second bar one -0.236559 -1.052202 two 0.663142 -0.426109 baz one 0.903963 -0.326161 two -0.640158 1.695094 In [67]: stacked.unstack(1) Out[67]: second one two first bar A -0.236559 0.663142 B -1.052202 -0.426109 baz A 0.903963 -0.640158 B -0.326161 1.695094 In [68]: stacked.unstack(0) Out[68]: first bar baz second one A -0.236559 0.903963 B -1.052202 -0.326161 two A 0.663142 -0.640158 B -0.426109 1.695094
数据透视表
In [69]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three']*3, ...: 'B': ['A', 'B', 'C']*4, ...: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar']*2, ...: 'D': np.random.randn(12), ...: 'E': np.random.randn(12)}) In [70]: df Out[70]: A B C D E 0 one A foo 0.356812 0.508259 1 one B foo -0.223585 2.090394 2 two C foo -0.366076 0.835971 3 three A bar -0.107184 0.377065 4 one B bar 0.335025 1.876482 5 one C bar 1.300853 -0.437791 6 two A foo -1.127656 1.240071 7 three B foo -0.719621 -0.541175 8 one C foo -0.464972 0.662851 9 one A bar -1.283444 -0.518707 10 two B bar -0.118690 -1.065746 11 three C bar -0.605458 0.908448
我们可以轻松的从这个数据生成数据透视表
In [71]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[71]: C bar foo A B one A -1.283444 0.356812 B 0.335025 -0.223585 C 1.300853 -0.464972 three A -0.107184 NaN B NaN -0.719621 C -0.605458 NaN two A NaN -1.127656 B -0.118690 NaN C NaN -0.366076
时间序列
pandas
拥有简单、强大且高效的函数用于高频数据的重采样转换工作(例如,秒粒度数据转换为5分钟粒度数据)。这在金融应用领域广泛使用,但是不局限于此。
In [72]: rng = pd.date_range('1/1/2012', periods=100, freq='S') In [73]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) In [74]: ts.resample('5Min').sum() Out[74]: 2012-01-01 26018 Freq: 5T, dtype: int32
时区表示
In [80]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D') In [81]: ts = pd.Series(np.random.randn(len(rng)), rng) In [82]: ts Out[82]: 2012-03-06 0.874556 2012-03-07 3.030650 2012-03-08 -0.286538 2012-03-09 0.154653 2012-03-10 -0.513948 Freq: D, dtype: float64 In [83]: ts_utc = ts.tz_localize('UTC') In [84]: ts_utc Out[84]: 2012-03-06 00:00:00+00:00 0.874556 2012-03-07 00:00:00+00:00 3.030650 2012-03-08 00:00:00+00:00 -0.286538 2012-03-09 00:00:00+00:00 0.154653 2012-03-10 00:00:00+00:00 -0.513948 Freq: D, dtype: float64
时区转换
In [85]: ts_utc.tz_convert('US/Eastern') Out[85]: 2012-03-05 19:00:00-05:00 0.874556 2012-03-06 19:00:00-05:00 3.030650 2012-03-07 19:00:00-05:00 -0.286538 2012-03-08 19:00:00-05:00 0.154653 2012-03-09 19:00:00-05:00 -0.513948 Freq: D, dtype: float64
不同时间跨度的转换
In [86]: rng = pd.date_range('1/1/2012', periods=5, freq='M') In [87]: ts = pd.Series(np.random.randn(len(rng)), index=rng) In [88]: ts Out[88]: 2012-01-31 -0.135612 2012-02-29 2.293425 2012-03-31 0.789431 2012-04-30 -0.525814 2012-05-31 1.558781 Freq: M, dtype: float64 In [89]: ps = ts.to_period() #转换为时间段,其频率默认是从时间点推断而来,也可以指定: # ts.to_period('M')/ ts.to_period('Y') In [90]: ps Out[90]: 2012-01 -0.135612 2012-02 2.293425 2012-03 0.789431 2012-04 -0.525814 2012-05 1.558781 Freq: M, dtype: float64 In [91]: ps.to_timestamp() Out[91]: 2012-01-01 -0.135612 2012-02-01 2.293425 2012-03-01 0.789431 2012-04-01 -0.525814 2012-05-01 1.558781 Freq: MS, dtype: float64
不同时间跨度的转换使得一些简单四则运算函数可以使用。在下面的例子中,我们将把一个11月份结束的季度财年时间转换为每季度结束的次月的首日9点。
In [95]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV') #pandas支持12种可能的季度型频率,即Q-JAN到Q-DEC,Q-NOV代表财年是11月份结束 In [96]: prng Out[96]: PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2', '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4', '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2', '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4', '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2', '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4', '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2', '2000Q3', '2000Q4'], dtype='period[Q-NOV]', freq='Q-NOV') In [97]: ts = pd.Series(np.random.randn(len(prng)), prng) In [98]: ts.head() Out[98]: 1990Q1 -0.394540 1990Q2 0.380933 1990Q3 0.108135 1990Q4 0.874617 1991Q1 -0.560237 Freq: Q-NOV, dtype: float64 In [99]: ts.index = prng.asfreq('M', 'e') #转换为季度末的月份,等效于prng.asfreq('M', how='e') In [100]: ts Out[100]: 1990-02 -0.394540 1990-05 0.380933 1990-08 0.108135 1990-11 0.874617 1991-02 -0.560237 1991-05 -0.912935 Freq: M, dtype: float64 In [101]: ts.index = prng.asfreq('M', 'e') + 1 #转换为季度末月份的次月 In [103]: ts.head() Out[103]: 1990-03 -0.394540 1990-06 0.380933 1990-09 0.108135 1990-12 0.874617 1991-03 -0.560237 Freq: M, dtype: float64 In [104]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9 #转换为季度末月份次月首日的9点,等效于ts.index = (prng.asfreq('M', how= 'end') + 1).asfreq('H', how= 'start') + 9 In [107]: ts.head() Out[107]: 1990-03-01 09:00 -0.394540 1990-06-01 09:00 0.380933 1990-09-01 09:00 0.108135 1990-12-01 09:00 0.874617 1991-03-01 09:00 -0.560237 Freq: H, dtype: float64