10分钟入门Pandas-系列(2)

  • 2019 年 12 月 26 日
  • 筆記

本文为作者翻译,原文地址 https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

前文回顾

10分钟入门Pandas-系列(1)

缺失数据处理

pandas默认使用np.nan表示确实数据。 重新索引可以在特定的轴上修改、新增和删除索引。他将返回数据的副本。

In [9]: import pandas as pd     ...: import numpy  as np     ...:     ...:     ...: s = pd.Series([1, 3, 5, np.nan, 6, 8])     ...: #s     ...: dates = pd.date_range('20130101', periods=6)     ...: #dates     ...: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))     ...:     ...: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))     ...: df['F'] = s1     ...: df.at[dates[0], 'A'] = 0     ...: df.iat[0, 1] = 0     ...: df.loc[:, 'D'] = np.array([5] * len(df))    In [10]: df  Out[10]:                     A         B         C  D    F  2013-01-01  0.000000  0.000000  1.199418  5  NaN  2013-01-02  1.534227  0.769859 -0.674221  5  1.0  2013-01-03  0.874073 -0.389967 -0.670989  5  2.0  2013-01-04  0.942903  0.778283 -0.700779  5  3.0  2013-01-05  1.607993 -0.032569  1.991723  5  4.0  2013-01-06  0.345273  0.558341  0.792286  5  5.0    In [11]:  df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])    In [12]: df1.loc[dates[0]:dates[1], 'E'] = 1    In [13]: df1  Out[13]:                     A         B         C  D    F    E  2013-01-01  0.000000  0.000000  1.199418  5  NaN  1.0  2013-01-02  1.534227  0.769859 -0.674221  5  1.0  1.0  2013-01-03  0.874073 -0.389967 -0.670989  5  2.0  NaN  2013-01-04  0.942903  0.778283 -0.700779  5  3.0  NaN  

丢弃包含缺失数据的任意行

In [14]:  df1.dropna(how='any')  Out[14]:                     A         B         C  D    F    E  2013-01-02  1.534227  0.769859 -0.674221  5  1.0  1.0  

填充缺失的值

In [15]:  df1.fillna(value=5)  Out[15]:                     A         B         C  D    F    E  2013-01-01  0.000000  0.000000  1.199418  5  5.0  1.0  2013-01-02  1.534227  0.769859 -0.674221  5  1.0  1.0  2013-01-03  0.874073 -0.389967 -0.670989  5  2.0  5.0  2013-01-04  0.942903  0.778283 -0.700779  5  3.0  5.0  

返回nan的布尔值判断

In [16]:  pd.isna(df1)  Out[16]:                  A      B      C      D      F      E  2013-01-01  False  False  False  False   True  False  2013-01-02  False  False  False  False  False  False  2013-01-03  False  False  False  False  False   True  2013-01-04  False  False  False  False  False   True  

数据运算

统计 计算时一般不包括丢失的数据 执行一个描述性统计

In [17]:  df.mean()  #axis=0  Out[17]:  A    0.884078  B    0.280658  C    0.322906  D    5.000000  F    3.000000  dtype: float64  

在其他轴上进行相同的运算

In [18]: df.mean(1)  # axis=1  Out[18]:  2013-01-01    1.549854  2013-01-02    1.525973  2013-01-03    1.362623  2013-01-04    1.804082  2013-01-05    2.513429  2013-01-06    2.339180  Freq: D, dtype: float64  

用于运算的对象有不同的维度并需要对齐。除此之外,pandas会自动沿着指定维度计算。

In [19]:  s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)    In [20]: s  Out[20]:  2013-01-01    NaN  2013-01-02    NaN  2013-01-03    1.0  2013-01-04    3.0  2013-01-05    5.0  2013-01-06    NaN  Freq: D, dtype: float64    In [21]:  df.sub(s, axis='index')  Out[21]:                     A         B         C    D    F  2013-01-01       NaN       NaN       NaN  NaN  NaN  2013-01-02       NaN       NaN       NaN  NaN  NaN  2013-01-03 -0.125927 -1.389967 -1.670989  4.0  1.0  2013-01-04 -2.057097 -2.221717 -3.700779  2.0  0.0  2013-01-05 -3.392007 -5.032569 -3.008277  0.0 -1.0  2013-01-06       NaN       NaN       NaN  NaN  NaN  

应用(使用/作用) 在数据上使用函数

In [22]: df.apply(np.cumsum) #累加  Out[22]:                     A         B         C   D     F  2013-01-01  0.000000  0.000000  1.199418   5   NaN  2013-01-02  1.534227  0.769859  0.525197  10   1.0  2013-01-03  2.408300  0.379892 -0.145793  15   3.0  2013-01-04  3.351203  1.158175 -0.846572  20   6.0  2013-01-05  4.959196  1.125606  1.145151  25  10.0  2013-01-06  5.304469  1.683947  1.937437  30  15.0    In [25]:  df.apply(lambda x: x.max() - x.min()) #求每列最大值和最小值的差  Out[25]:  A    1.607993  B    1.168250  C    2.692502  D    0.000000  F    4.000000  dtype: float64  

直方图

In [26]: s = pd.Series(np.random.randint(0, 7, size=10))    In [27]: s  Out[27]:  0    5  1    1  2    3  3    6  4    6  5    6  6    1  7    5  8    4  9    4  dtype: int32    In [28]:  s.value_counts()  Out[28]:  6    3  5    2  4    2  1    2  3    1  dtype: int64  

字符串方法 序列可以使用一些字符串处理方法很轻易操作数据组中的每个元素,比如以下代码片断。注意字符匹配方法默认情况下通常使用正则表达式(并且大多数时候都如此)。

In [32]:  s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])    In [33]:  s.str.lower()  Out[33]:  0       a  1       b  2       c  3    aaba  4    baca  5     NaN  6    caba  7     dog  8     cat  dtype: object  

数据合并

concat连接 pandas提供各种工具以便合并序列,数据桢,和组合对象, 在连接/合并类型操作中使用多种类型索引和相关数学函数。 使用concat()把pandas对象连接到一起

In [34]:  df = pd.DataFrame(np.random.randn(10, 4))    In [35]: df  Out[35]:            0         1         2         3  0  2.173213 -0.598088  1.706941  1.058619  1  0.050132  0.772387 -0.375141  0.653398  2  1.903610 -0.840157  0.923137  0.598493  3  0.437401 -1.832521  0.535153  0.894782  4 -0.807621  0.280262  0.994054  0.402910  5 -0.717370 -1.312313 -1.041917  0.949461  6  0.923145  0.723936 -1.682883  1.406243  7  1.270020  1.309892  0.765172  1.322076  8 -1.432259  1.566898  0.186364  0.956206  9 -1.350429 -0.374433 -0.790461 -0.340310    # 数据分片  In [36]:  pieces = [df[:2], df[4:7], df[8:]]  # 分片数据合并(纵向)  In [37]:  pd.concat(pieces)  Out[37]:            0         1         2         3  0  2.173213 -0.598088  1.706941  1.058619  1  0.050132  0.772387 -0.375141  0.653398  4 -0.807621  0.280262  0.994054  0.402910  5 -0.717370 -1.312313 -1.041917  0.949461  6  0.923145  0.723936 -1.682883  1.406243  8 -1.432259  1.566898  0.186364  0.956206  9 -1.350429 -0.374433 -0.790461 -0.340310  

join连接 SQL形式的合并。

In [38]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})    In [39]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})    In [40]: left  Out[40]:     key  lval  0  foo     1  1  foo     2    In [41]: right  Out[41]:     key  rval  0  foo     4  1  foo     5    In [42]: pd.merge(left, right, on='key')  Out[42]:     key  lval  rval  0  foo     1     4  1  foo     1     5  2  foo     2     4  3  foo     2     5  

另外一个例子

In [43]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})    In [44]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})    In [45]: left  Out[45]:     key  lval  0  foo     1  1  bar     2    In [46]: right  Out[46]:     key  rval  0  foo     4  1  bar     5    In [47]: pd.merge(left, right, on='key')  Out[47]:     key  lval  rval  0  foo     1     4  1  bar     2     5  

追加 追加行到dataframe

In [48]:  df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])    In [49]: df  Out[49]:            A         B         C         D  0 -0.541936  0.304257 -0.666548  1.448552  1  0.604079 -0.106248 -0.926829 -0.817308  2 -0.584797 -0.809370 -1.532935 -0.569670  3  0.690052 -0.605041 -0.465595  1.114489  4  0.913118 -0.605127  0.366342  0.393939  5  1.444903  0.274196  1.650371 -0.598488  6  0.063602  1.016262  0.493207 -0.814833  7 -0.042801 -1.752630  0.159843  0.645406    In [50]:  s = df.iloc[3] #按照行号索引数据    In [51]:  df.append(s, ignore_index=True)  Out[51]:            A         B         C         D  0 -0.541936  0.304257 -0.666548  1.448552  1  0.604079 -0.106248 -0.926829 -0.817308  2 -0.584797 -0.809370 -1.532935 -0.569670  3  0.690052 -0.605041 -0.465595  1.114489  4  0.913118 -0.605127  0.366342  0.393939  5  1.444903  0.274196  1.650371 -0.598488  6  0.063602  1.016262  0.493207 -0.814833  7 -0.042801 -1.752630  0.159843  0.645406  8  0.690052 -0.605041 -0.465595  1.114489  

分组

对于“group by”指的是涉及以下一个或多个步骤的处理:

  • 将数据按某些标准分割为不同的组
  • 在每个独立组上应用函数
  • 组合结果为一个数据结构
In [52]:  df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',      ...:                           'foo', 'bar', 'foo', 'foo'],      ...:                     'B': ['one', 'one', 'two', 'three',      ...:                           'two', 'two', 'one', 'three'],      ...:                     'C': np.random.randn(8),      ...:                     'D': np.random.randn(8)})    In [53]: df  Out[53]:       A      B         C         D  0  foo    one  0.214543 -0.743767  1  bar    one  0.933127 -0.380128  2  foo    two  0.106048  0.527558  3  bar  three  0.905622 -0.892613  4  foo    two -0.289133  1.136363  5  bar    two -0.323147 -0.013935  6  foo    one  1.070073 -0.038362  7  foo  three  0.130030  0.455695  

分组,然后应用sum( )函数放到结果组

In [54]: df.groupby('A').sum()  Out[54]:              C         D  A  bar  1.515602 -1.286676  foo  1.231561  1.337486  

按照多列分组为层次索引,然后,再次应用sum函数

In [55]:  df.groupby(['A', 'B']).sum()  Out[55]:                    C         D  A   B  bar one    0.933127 -0.380128      three  0.905622 -0.892613      two   -0.323147 -0.013935  foo one    1.284616 -0.782129      three  0.130030  0.455695      two   -0.183085  1.663921  

重塑

堆砌

In [56]:  tuples = list(zip( * [['bar', 'bar', 'baz', 'baz',      ...:                         'foo', 'foo', 'qux', 'qux'],      ...:                        ['one', 'two', 'one', 'two',      ...:                         'one', 'two', 'one', 'two']]))    In [57]: tuples  Out[57]:  [('bar', 'one'),   ('bar', 'two'),   ('baz', 'one'),   ('baz', 'two'),   ('foo', 'one'),   ('foo', 'two'),   ('qux', 'one'),   ('qux', 'two')]    In [58]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])    In [59]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])    In [60]: df  Out[60]:                       A         B  first second  bar   one    -0.236559 -1.052202        two     0.663142 -0.426109  baz   one     0.903963 -0.326161        two    -0.640158  1.695094  foo   one     0.684574 -0.352153        two    -0.156945 -0.282294  qux   one     0.853196  0.920496        two    -0.131899  1.526732    In [61]:  df2 = df[:4]    In [62]: df2  Out[62]:                       A         B  first second  bar   one    -0.236559 -1.052202        two     0.663142 -0.426109  baz   one     0.903963 -0.326161        two    -0.640158  1.695094  

stack()函数压缩DataFrame的列,使得其减少一个级别(二维转一维,列转行)。

In [63]: stacked = df2.stack()    In [64]: stacked  Out[64]:  first  second  bar    one     A   -0.236559                 B   -1.052202         two     A    0.663142                 B   -0.426109  baz    one     A    0.903963                 B   -0.326161         two     A   -0.640158                 B    1.695094  

被“堆砌”的DataFrame or Series(有一个多重索引作为索引),堆砌的反向操作是逆堆砌,默认逆堆砌到上一级别:

In [66]: stacked.unstack()  #逆堆砌  Out[66]:                       A         B  first second  bar   one    -0.236559 -1.052202        two     0.663142 -0.426109  baz   one     0.903963 -0.326161        two    -0.640158  1.695094  
In [66]: stacked.unstack()  Out[66]:                       A         B  first second  bar   one    -0.236559 -1.052202        two     0.663142 -0.426109  baz   one     0.903963 -0.326161        two    -0.640158  1.695094    In [67]: stacked.unstack(1)  Out[67]:  second        one       two  first  bar   A -0.236559  0.663142        B -1.052202 -0.426109  baz   A  0.903963 -0.640158        B -0.326161  1.695094    In [68]: stacked.unstack(0)  Out[68]:  first          bar       baz  second  one    A -0.236559  0.903963         B -1.052202 -0.326161  two    A  0.663142 -0.640158         B -0.426109  1.695094  

数据透视表

In [69]:  df = pd.DataFrame({'A': ['one', 'one', 'two', 'three']*3,      ...:                     'B': ['A', 'B', 'C']*4,      ...:                     'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar']*2,      ...:                     'D': np.random.randn(12),      ...:                     'E': np.random.randn(12)})    In [70]: df  Out[70]:          A  B    C         D         E  0     one  A  foo  0.356812  0.508259  1     one  B  foo -0.223585  2.090394  2     two  C  foo -0.366076  0.835971  3   three  A  bar -0.107184  0.377065  4     one  B  bar  0.335025  1.876482  5     one  C  bar  1.300853 -0.437791  6     two  A  foo -1.127656  1.240071  7   three  B  foo -0.719621 -0.541175  8     one  C  foo -0.464972  0.662851  9     one  A  bar -1.283444 -0.518707  10    two  B  bar -0.118690 -1.065746  11  three  C  bar -0.605458  0.908448  

我们可以轻松的从这个数据生成数据透视表

In [71]:  pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])  Out[71]:  C             bar       foo  A     B  one   A -1.283444  0.356812        B  0.335025 -0.223585        C  1.300853 -0.464972  three A -0.107184       NaN        B       NaN -0.719621        C -0.605458       NaN  two   A       NaN -1.127656        B -0.118690       NaN        C       NaN -0.366076  

时间序列

pandas拥有简单、强大且高效的函数用于高频数据的重采样转换工作(例如,秒粒度数据转换为5分钟粒度数据)。这在金融应用领域广泛使用,但是不局限于此。

In [72]: rng = pd.date_range('1/1/2012', periods=100, freq='S')    In [73]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)    In [74]: ts.resample('5Min').sum()  Out[74]:  2012-01-01    26018  Freq: 5T, dtype: int32  

时区表示

In [80]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')    In [81]: ts = pd.Series(np.random.randn(len(rng)), rng)    In [82]: ts  Out[82]:  2012-03-06    0.874556  2012-03-07    3.030650  2012-03-08   -0.286538  2012-03-09    0.154653  2012-03-10   -0.513948  Freq: D, dtype: float64    In [83]: ts_utc = ts.tz_localize('UTC')    In [84]: ts_utc  Out[84]:  2012-03-06 00:00:00+00:00    0.874556  2012-03-07 00:00:00+00:00    3.030650  2012-03-08 00:00:00+00:00   -0.286538  2012-03-09 00:00:00+00:00    0.154653  2012-03-10 00:00:00+00:00   -0.513948  Freq: D, dtype: float64  

时区转换

In [85]: ts_utc.tz_convert('US/Eastern')  Out[85]:  2012-03-05 19:00:00-05:00    0.874556  2012-03-06 19:00:00-05:00    3.030650  2012-03-07 19:00:00-05:00   -0.286538  2012-03-08 19:00:00-05:00    0.154653  2012-03-09 19:00:00-05:00   -0.513948  Freq: D, dtype: float64  

不同时间跨度的转换

In [86]: rng = pd.date_range('1/1/2012', periods=5, freq='M')    In [87]: ts = pd.Series(np.random.randn(len(rng)), index=rng)    In [88]: ts  Out[88]:  2012-01-31   -0.135612  2012-02-29    2.293425  2012-03-31    0.789431  2012-04-30   -0.525814  2012-05-31    1.558781  Freq: M, dtype: float64    In [89]: ps = ts.to_period()  #转换为时间段,其频率默认是从时间点推断而来,也可以指定:  # ts.to_period('M')/ ts.to_period('Y')    In [90]: ps  Out[90]:  2012-01   -0.135612  2012-02    2.293425  2012-03    0.789431  2012-04   -0.525814  2012-05    1.558781  Freq: M, dtype: float64    In [91]:  ps.to_timestamp()  Out[91]:  2012-01-01   -0.135612  2012-02-01    2.293425  2012-03-01    0.789431  2012-04-01   -0.525814  2012-05-01    1.558781  Freq: MS, dtype: float64  

不同时间跨度的转换使得一些简单四则运算函数可以使用。在下面的例子中,我们将把一个11月份结束的季度财年时间转换为每季度结束的次月的首日9点。

In [95]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')  #pandas支持12种可能的季度型频率,即Q-JAN到Q-DEC,Q-NOV代表财年是11月份结束  In [96]: prng  Out[96]:  PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',               '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',               '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',               '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',               '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',               '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',               '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',               '2000Q3', '2000Q4'],              dtype='period[Q-NOV]', freq='Q-NOV')    In [97]: ts = pd.Series(np.random.randn(len(prng)), prng)    In [98]: ts.head()  Out[98]:  1990Q1   -0.394540  1990Q2    0.380933  1990Q3    0.108135  1990Q4    0.874617  1991Q1   -0.560237  Freq: Q-NOV, dtype: float64    In [99]: ts.index = prng.asfreq('M', 'e')  #转换为季度末的月份,等效于prng.asfreq('M', how='e')  In [100]: ts  Out[100]:  1990-02   -0.394540  1990-05    0.380933  1990-08    0.108135  1990-11    0.874617  1991-02   -0.560237  1991-05   -0.912935  Freq: M, dtype: float64    In [101]: ts.index = prng.asfreq('M', 'e') + 1  #转换为季度末月份的次月  In [103]: ts.head()  Out[103]:  1990-03   -0.394540  1990-06    0.380933  1990-09    0.108135  1990-12    0.874617  1991-03   -0.560237  Freq: M, dtype: float64    In [104]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9    #转换为季度末月份次月首日的9点,等效于ts.index = (prng.asfreq('M', how= 'end') + 1).asfreq('H', how= 'start') + 9    In [107]: ts.head()  Out[107]:  1990-03-01 09:00   -0.394540  1990-06-01 09:00    0.380933  1990-09-01 09:00    0.108135  1990-12-01 09:00    0.874617  1991-03-01 09:00   -0.560237  Freq: H, dtype: float64