数据分析篇 | Pandas 时间序列 - 日期时间索引

数据分析篇 | Pandas 时间序列 – 日期时间索引

2019 年 12 月 22 日
筆記

部字符串索引切片 vs. 精准匹配精确索引截断与花式索引日期/时间组件

DatetimeIndex 主要用作 Pandas 对象的索引。DatetimeIndex 类为时间序列做了很多优化：

预计算了各种偏移量的日期范围，并在后台缓存，让后台生成后续日期范围的速度非常快（仅需抓取切片）。
在 Pandas 对象上使用 shift 与 tshift 方法进行快速偏移。
合并具有相同频率的重叠 DatetimeIndex 对象的速度非常快（这点对快速数据对齐非常重要）。
通过 year、month 等属性快速访问日期字段。
snap 等正则函数与超快的 asof 逻辑。

DatetimeIndex 对象支持全部常规 Index 对象的基本用法，及一些列简化频率处理的高级时间序列专有方法。

参阅：重置索引注意：Pandas 不强制排序日期索引，但如果日期没有排序，可能会引发可控范围之外的或不正确的操作。

DatetimeIndex 可以当作常规索引，支持选择、切片等方法。

In [94]: rng = pd.date_range(start, end, freq='BM')    In [95]: ts = pd.Series(np.random.randn(len(rng)), index=rng)    In [96]: ts.index  Out[96]:  DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',                 '2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',                 '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],                dtype='datetime64[ns]', freq='BM')    In [97]: ts[:5].index  Out[97]:  DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',                 '2011-05-31'],                dtype='datetime64[ns]', freq='BM')    In [98]: ts[::2].index  Out[98]:  DatetimeIndex(['2011-01-31', '2011-03-31', '2011-05-31', '2011-07-29',                 '2011-09-30', '2011-11-30'],                dtype='datetime64[ns]', freq='2BM')

局部字符串索引

能解析为时间戳的日期与字符串可以作为索引的参数：

In [99]: ts['1/31/2011']  Out[99]: 0.11920871129693428    In [100]: ts[datetime.datetime(2011, 12, 25):]  Out[100]:  2011-12-30    0.56702  Freq: BM, dtype: float64    In [101]: ts['10/31/2011':'12/31/2011']  Out[101]:  2011-10-31    0.271860  2011-11-30   -0.424972  2011-12-30    0.567020  Freq: BM, dtype: float64

Pandas 为访问较长的时间序列提供了便捷方法，年、年月字符串均可：

In [102]: ts['2011']  Out[102]:  2011-01-31    0.119209  2011-02-28   -1.044236  2011-03-31   -0.861849  2011-04-29   -2.104569  2011-05-31   -0.494929  2011-06-30    1.071804  2011-07-29    0.721555  2011-08-31   -0.706771  2011-09-30   -1.039575  2011-10-31    0.271860  2011-11-30   -0.424972  2011-12-30    0.567020  Freq: BM, dtype: float64    In [103]: ts['2011-6']  Out[103]:  2011-06-30    1.071804  Freq: BM, dtype: float64

带 DatetimeIndex 的 DateFrame 也支持这种切片方式。局部字符串是标签切片的一种形式，这种切片也包含截止时点，即，与日期匹配的时间也会包含在内：

In [104]: dft = pd.DataFrame(np.random.randn(100000, 1), columns=['A'],     .....:                    index=pd.date_range('20130101', periods=100000, freq='T'))     .....:    In [105]: dft  Out[105]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-03-11 10:35:00 -0.747967  2013-03-11 10:36:00 -0.034523  2013-03-11 10:37:00 -0.201754  2013-03-11 10:38:00 -1.509067  2013-03-11 10:39:00 -1.693043    [100000 rows x 1 columns]    In [106]: dft['2013']  Out[106]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-03-11 10:35:00 -0.747967  2013-03-11 10:36:00 -0.034523  2013-03-11 10:37:00 -0.201754  2013-03-11 10:38:00 -1.509067  2013-03-11 10:39:00 -1.693043    [100000 rows x 1 columns]

下列代码截取了自 1 月 1 日凌晨起，至 2 月 28 日午夜的日期与时间。

In [107]: dft['2013-1':'2013-2']  Out[107]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-02-28 23:55:00  0.850929  2013-02-28 23:56:00  0.976712  2013-02-28 23:57:00 -2.693884  2013-02-28 23:58:00 -1.575535  2013-02-28 23:59:00 -1.573517    [84960 rows x 1 columns]

下列代码截取了包含截止日期及其时间在内的日期与时间。

In [108]: dft['2013-1':'2013-2-28']  Out[108]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-02-28 23:55:00  0.850929  2013-02-28 23:56:00  0.976712  2013-02-28 23:57:00 -2.693884  2013-02-28 23:58:00 -1.575535  2013-02-28 23:59:00 -1.573517    [84960 rows x 1 columns]

下列代码指定了精准的截止时间，注意此处的结果与上述截取结果的区别：

In [109]: dft['2013-1':'2013-2-28 00:00:00']  Out[109]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-02-27 23:56:00  1.197749  2013-02-27 23:57:00  0.720521  2013-02-27 23:58:00 -0.072718  2013-02-27 23:59:00 -0.681192  2013-02-28 00:00:00 -0.557501    [83521 rows x 1 columns]

截止时间是索引的一部分，包含在截取的内容之内：

In [110]: dft['2013-1-15':'2013-1-15 12:30:00']  Out[110]:                              A  2013-01-15 00:00:00 -0.984810  2013-01-15 00:01:00  0.941451  2013-01-15 00:02:00  1.559365  2013-01-15 00:03:00  1.034374  2013-01-15 00:04:00 -1.480656  ...                       ...  2013-01-15 12:26:00  0.371454  2013-01-15 12:27:00 -0.930806  2013-01-15 12:28:00 -0.069177  2013-01-15 12:29:00  0.066510  2013-01-15 12:30:00 -0.003945    [751 rows x 1 columns]

0.18.0 版新增。

DatetimeIndex 局部字符串索引还支持多层索引 DataFrame。

In [111]: dft2 = pd.DataFrame(np.random.randn(20, 1),     .....:                     columns=['A'],     .....:                     index=pd.MultiIndex.from_product(     .....:                         [pd.date_range('20130101', periods=10, freq='12H'),     .....:                          ['a', 'b']]))     .....:    In [112]: dft2  Out[112]:                                A  2013-01-01 00:00:00 a -0.298694                      b  0.823553  2013-01-01 12:00:00 a  0.943285                      b -1.479399  2013-01-02 00:00:00 a -1.643342  ...                         ...  2013-01-04 12:00:00 b  0.069036  2013-01-05 00:00:00 a  0.122297                      b  1.422060  2013-01-05 12:00:00 a  0.370079                      b  1.016331    [20 rows x 1 columns]    In [113]: dft2.loc['2013-01-05']  Out[113]:                                A  2013-01-05 00:00:00 a  0.122297                      b  1.422060  2013-01-05 12:00:00 a  0.370079                      b  1.016331    In [114]: idx = pd.IndexSlice    In [115]: dft2 = dft2.swaplevel(0, 1).sort_index()    In [116]: dft2.loc[idx[:, '2013-01-05'], :]  Out[116]:                                A  a 2013-01-05 00:00:00  0.122297    2013-01-05 12:00:00  0.370079  b 2013-01-05 00:00:00  1.422060    2013-01-05 12:00:00  1.016331

0.25.0 版新增。

字符串索引切片支持 UTC 偏移。

In [117]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))    In [118]: df  Out[118]:                             0  2019-01-01 00:00:00-08:00  0    In [119]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']  Out[119]:                             0  2019-01-01 00:00:00-08:00  0

切片 vs. 精准匹配

0.20.0 版新增。

基于索引的精度，字符串既可用于切片，也可用于精准匹配。字符串精度比索引精度低，就是切片，比索引精度高，则是精准匹配。

In [120]: series_minute = pd.Series([1, 2, 3],     .....:                           pd.DatetimeIndex(['2011-12-31 23:59:00',     .....:                                             '2012-01-01 00:00:00',     .....:                                             '2012-01-01 00:02:00']))     .....:    In [121]: series_minute.index.resolution  Out[121]: 'minute'

下例中的时间戳字符串没有 Series 对象的精度高。series_minute 到秒，时间戳字符串只到分。

In [122]: series_minute['2011-12-31 23']  Out[122]:  2011-12-31 23:59:00    1  dtype: int64

精度为分钟（或更高精度）的时间戳字符串，给出的是标量，不会被当作切片。

In [123]: series_minute['2011-12-31 23:59']  Out[123]: 1    In [124]: series_minute['2011-12-31 23:59:00']  Out[124]: 1

索引的精度为秒时，精度为分钟的时间戳返回的是 Series。

In [125]: series_second = pd.Series([1, 2, 3],     .....:                           pd.DatetimeIndex(['2011-12-31 23:59:59',     .....:                                             '2012-01-01 00:00:00',     .....:                                             '2012-01-01 00:00:01']))     .....:    In [126]: series_second.index.resolution  Out[126]: 'second'    In [127]: series_second['2011-12-31 23:59']  Out[127]:  2011-12-31 23:59:59    1  dtype: int64

用时间戳字符串切片时，还可以用 [] 索引 DataFrame。

In [128]: dft_minute = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},     .....:                           index=series_minute.index)     .....:    In [129]: dft_minute['2011-12-31 23']  Out[129]:                       a  b  2011-12-31 23:59:00  1  4

警告：字符串执行精确匹配时，用 [] 按列，而不是按行截取 DateFrame ，参阅索引基础。如，dft_minute ['2011-12-31 23:59'] 会触发 KeyError，这是因为 2012-12-31 23:59与索引的精度一样，但没有叫这个名字的列。

为了实现精准切片，要用 .loc 对行进行切片或选择。

In [130]: dft_minute.loc['2011-12-31 23:59']  Out[130]:  a    1  b    4  Name: 2011-12-31 23:59:00, dtype: int64

注意，DatetimeIndex 精度不能低于日。

In [131]: series_monthly = pd.Series([1, 2, 3],     .....:                            pd.DatetimeIndex(['2011-12', '2012-01', '2012-02']))     .....:    In [132]: series_monthly.index.resolution  Out[132]: 'day'    In [133]: series_monthly['2011-12']  # 返回的是 Series  Out[133]:  2011-12-01    1  dtype: int64

精确索引

正如上节所述，局部字符串依靠时间段的精度索引 DatetimeIndex，即时间间隔与索引精度相关。反之，用 Timestamp 或 datetime 索引更精准，这些对象指定的时间更精确。注意，精确索引包含了起始时点。

就算没有显式指定，Timestamp 与datetime 也支持 hours、minutes、seconds，默认值为 0。

In [134]: dft[datetime.datetime(2013, 1, 1):datetime.datetime(2013, 2, 28)]  Out[134]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-02-27 23:56:00  1.197749  2013-02-27 23:57:00  0.720521  2013-02-27 23:58:00 -0.072718  2013-02-27 23:59:00 -0.681192  2013-02-28 00:00:00 -0.557501    [83521 rows x 1 columns]

不用默认值。

In [135]: dft[datetime.datetime(2013, 1, 1, 10, 12, 0):     .....:     datetime.datetime(2013, 2, 28, 10, 12, 0)]     .....:  Out[135]:                              A  2013-01-01 10:12:00  0.565375  2013-01-01 10:13:00  0.068184  2013-01-01 10:14:00  0.788871  2013-01-01 10:15:00 -0.280343  2013-01-01 10:16:00  0.931536  ...                       ...  2013-02-28 10:08:00  0.148098  2013-02-28 10:09:00 -0.388138  2013-02-28 10:10:00  0.139348  2013-02-28 10:11:00  0.085288  2013-02-28 10:12:00  0.950146    [83521 rows x 1 columns]

截断与花式索引

truncate() 便捷函数与切片类似。注意，与切片返回的是部分匹配日期不同， truncate 假设 DatetimeIndex 里未标明时间组件的值为 0。

In [136]: rng2 = pd.date_range('2011-01-01', '2012-01-01', freq='W')    In [137]: ts2 = pd.Series(np.random.randn(len(rng2)), index=rng2)    In [138]: ts2.truncate(before='2011-11', after='2011-12')  Out[138]:  2011-11-06    0.437823  2011-11-13   -0.293083  2011-11-20   -0.059881  2011-11-27    1.252450  Freq: W-SUN, dtype: float64    In [139]: ts2['2011-11':'2011-12']  Out[139]:  2011-11-06    0.437823  2011-11-13   -0.293083  2011-11-20   -0.059881  2011-11-27    1.252450  2011-12-04    0.046611  2011-12-11    0.059478  2011-12-18   -0.286539  2011-12-25    0.841669  Freq: W-SUN, dtype: float64

花式索引返回 DatetimeIndex，但因为打乱了 DatetimeIndex 频率，丢弃了频率信息，见 freq=None：

In [140]: ts2[[0, 2, 6]].index  Out[140]: DatetimeIndex(['2011-01-02', '2011-01-16', '2011-02-13'], dtype='datetime64[ns]', freq=None)

日期/时间组件

以下日期/时间属性可以访问 Timestamp 或 DatetimeIndex。

属性	说明
year	datetime 的年
month	datetime 的月
day	datetime 的日
hour	datetime 的小时
minute	datetime 的分钟
second	datetime 的秒
microsecond	datetime 的微秒
nanosecond	datetime 的纳秒
date	返回 datetime.date（不包含时区信息）
time	返回 datetime.time（不包含时区信息）
timetz	返回带本地时区信息的 datetime.time
dayofyear	一年里的第几天
weekofyear	一年里的第几周
week	一年里的第几周
dayofweek	一周里的第几天，Monday=0, Sunday=6
weekday	一周里的第几天，Monday=0, Sunday=6
weekday_name	这一天是星期几（如，Friday）
quarter	日期所处的季节：Jan-Mar = 1 等
days_in_month	日期所在的月有多少天
is_month_start	逻辑判断是不是月初（由频率定义）
is_month_end	逻辑判断是不是月末（由频率定义）
is_quarter_start	逻辑判断是不是季初（由频率定义）
is_quarter_end	逻辑判断是不是季末（由频率定义）
is_year_start	逻辑判断是不是年初（由频率定义）
is_year_end	逻辑判断是不是年末（由频率定义）
is_leap_year	逻辑判断是不是日期所在年是不是闰年

参照 .dt 访问器一节介绍的知识点，Series 的值为 datetime 时，还可以用 .dt 访问这些属性。