數據分析篇 | Pandas 時間序列 - 日期時間索引

數據分析篇 | Pandas 時間序列 – 日期時間索引

2019 年 12 月 22 日
筆記

部字符串索引切片 vs. 精準匹配精確索引截斷與花式索引日期/時間組件

DatetimeIndex 主要用作 Pandas 對象的索引。DatetimeIndex 類為時間序列做了很多優化：

預計算了各種偏移量的日期範圍，並在後台緩存，讓後台生成後續日期範圍的速度非常快（僅需抓取切片）。
在 Pandas 對象上使用 shift 與 tshift 方法進行快速偏移。
合併具有相同頻率的重疊 DatetimeIndex 對象的速度非常快（這點對快速數據對齊非常重要）。
通過 year、month 等屬性快速訪問日期字段。
snap 等正則函數與超快的 asof 邏輯。

DatetimeIndex 對象支持全部常規 Index 對象的基本用法，及一些列簡化頻率處理的高級時間序列專有方法。

參閱：重置索引注意：Pandas 不強制排序日期索引，但如果日期沒有排序，可能會引發可控範圍之外的或不正確的操作。

DatetimeIndex 可以當作常規索引，支持選擇、切片等方法。

In [94]: rng = pd.date_range(start, end, freq='BM')    In [95]: ts = pd.Series(np.random.randn(len(rng)), index=rng)    In [96]: ts.index  Out[96]:  DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',                 '2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',                 '2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],                dtype='datetime64[ns]', freq='BM')    In [97]: ts[:5].index  Out[97]:  DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',                 '2011-05-31'],                dtype='datetime64[ns]', freq='BM')    In [98]: ts[::2].index  Out[98]:  DatetimeIndex(['2011-01-31', '2011-03-31', '2011-05-31', '2011-07-29',                 '2011-09-30', '2011-11-30'],                dtype='datetime64[ns]', freq='2BM')

局部字符串索引

能解析為時間戳的日期與字符串可以作為索引的參數：

In [99]: ts['1/31/2011']  Out[99]: 0.11920871129693428    In [100]: ts[datetime.datetime(2011, 12, 25):]  Out[100]:  2011-12-30    0.56702  Freq: BM, dtype: float64    In [101]: ts['10/31/2011':'12/31/2011']  Out[101]:  2011-10-31    0.271860  2011-11-30   -0.424972  2011-12-30    0.567020  Freq: BM, dtype: float64

Pandas 為訪問較長的時間序列提供了便捷方法，年、年月字符串均可：

In [102]: ts['2011']  Out[102]:  2011-01-31    0.119209  2011-02-28   -1.044236  2011-03-31   -0.861849  2011-04-29   -2.104569  2011-05-31   -0.494929  2011-06-30    1.071804  2011-07-29    0.721555  2011-08-31   -0.706771  2011-09-30   -1.039575  2011-10-31    0.271860  2011-11-30   -0.424972  2011-12-30    0.567020  Freq: BM, dtype: float64    In [103]: ts['2011-6']  Out[103]:  2011-06-30    1.071804  Freq: BM, dtype: float64

帶 DatetimeIndex 的 DateFrame 也支持這種切片方式。局部字符串是標籤切片的一種形式，這種切片也包含截止時點，即，與日期匹配的時間也會包含在內：

In [104]: dft = pd.DataFrame(np.random.randn(100000, 1), columns=['A'],     .....:                    index=pd.date_range('20130101', periods=100000, freq='T'))     .....:    In [105]: dft  Out[105]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-03-11 10:35:00 -0.747967  2013-03-11 10:36:00 -0.034523  2013-03-11 10:37:00 -0.201754  2013-03-11 10:38:00 -1.509067  2013-03-11 10:39:00 -1.693043    [100000 rows x 1 columns]    In [106]: dft['2013']  Out[106]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-03-11 10:35:00 -0.747967  2013-03-11 10:36:00 -0.034523  2013-03-11 10:37:00 -0.201754  2013-03-11 10:38:00 -1.509067  2013-03-11 10:39:00 -1.693043    [100000 rows x 1 columns]

下列代碼截取了自 1 月 1 日凌晨起，至 2 月 28 日午夜的日期與時間。

In [107]: dft['2013-1':'2013-2']  Out[107]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-02-28 23:55:00  0.850929  2013-02-28 23:56:00  0.976712  2013-02-28 23:57:00 -2.693884  2013-02-28 23:58:00 -1.575535  2013-02-28 23:59:00 -1.573517    [84960 rows x 1 columns]

下列代碼截取了包含截止日期及其時間在內的日期與時間。

In [108]: dft['2013-1':'2013-2-28']  Out[108]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-02-28 23:55:00  0.850929  2013-02-28 23:56:00  0.976712  2013-02-28 23:57:00 -2.693884  2013-02-28 23:58:00 -1.575535  2013-02-28 23:59:00 -1.573517    [84960 rows x 1 columns]

下列代碼指定了精準的截止時間，注意此處的結果與上述截取結果的區別：

In [109]: dft['2013-1':'2013-2-28 00:00:00']  Out[109]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-02-27 23:56:00  1.197749  2013-02-27 23:57:00  0.720521  2013-02-27 23:58:00 -0.072718  2013-02-27 23:59:00 -0.681192  2013-02-28 00:00:00 -0.557501    [83521 rows x 1 columns]

截止時間是索引的一部分，包含在截取的內容之內：

In [110]: dft['2013-1-15':'2013-1-15 12:30:00']  Out[110]:                              A  2013-01-15 00:00:00 -0.984810  2013-01-15 00:01:00  0.941451  2013-01-15 00:02:00  1.559365  2013-01-15 00:03:00  1.034374  2013-01-15 00:04:00 -1.480656  ...                       ...  2013-01-15 12:26:00  0.371454  2013-01-15 12:27:00 -0.930806  2013-01-15 12:28:00 -0.069177  2013-01-15 12:29:00  0.066510  2013-01-15 12:30:00 -0.003945    [751 rows x 1 columns]

0.18.0 版新增。

DatetimeIndex 局部字符串索引還支持多層索引 DataFrame。

In [111]: dft2 = pd.DataFrame(np.random.randn(20, 1),     .....:                     columns=['A'],     .....:                     index=pd.MultiIndex.from_product(     .....:                         [pd.date_range('20130101', periods=10, freq='12H'),     .....:                          ['a', 'b']]))     .....:    In [112]: dft2  Out[112]:                                A  2013-01-01 00:00:00 a -0.298694                      b  0.823553  2013-01-01 12:00:00 a  0.943285                      b -1.479399  2013-01-02 00:00:00 a -1.643342  ...                         ...  2013-01-04 12:00:00 b  0.069036  2013-01-05 00:00:00 a  0.122297                      b  1.422060  2013-01-05 12:00:00 a  0.370079                      b  1.016331    [20 rows x 1 columns]    In [113]: dft2.loc['2013-01-05']  Out[113]:                                A  2013-01-05 00:00:00 a  0.122297                      b  1.422060  2013-01-05 12:00:00 a  0.370079                      b  1.016331    In [114]: idx = pd.IndexSlice    In [115]: dft2 = dft2.swaplevel(0, 1).sort_index()    In [116]: dft2.loc[idx[:, '2013-01-05'], :]  Out[116]:                                A  a 2013-01-05 00:00:00  0.122297    2013-01-05 12:00:00  0.370079  b 2013-01-05 00:00:00  1.422060    2013-01-05 12:00:00  1.016331

0.25.0 版新增。

字符串索引切片支持 UTC 偏移。

In [117]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))    In [118]: df  Out[118]:                             0  2019-01-01 00:00:00-08:00  0    In [119]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']  Out[119]:                             0  2019-01-01 00:00:00-08:00  0

切片 vs. 精準匹配

0.20.0 版新增。

基於索引的精度，字符串既可用於切片，也可用於精準匹配。字符串精度比索引精度低，就是切片，比索引精度高，則是精準匹配。

In [120]: series_minute = pd.Series([1, 2, 3],     .....:                           pd.DatetimeIndex(['2011-12-31 23:59:00',     .....:                                             '2012-01-01 00:00:00',     .....:                                             '2012-01-01 00:02:00']))     .....:    In [121]: series_minute.index.resolution  Out[121]: 'minute'

下例中的時間戳字符串沒有 Series 對象的精度高。series_minute 到秒，時間戳字符串只到分。

In [122]: series_minute['2011-12-31 23']  Out[122]:  2011-12-31 23:59:00    1  dtype: int64

精度為分鐘（或更高精度）的時間戳字符串，給出的是標量，不會被當作切片。

In [123]: series_minute['2011-12-31 23:59']  Out[123]: 1    In [124]: series_minute['2011-12-31 23:59:00']  Out[124]: 1

索引的精度為秒時，精度為分鐘的時間戳返回的是 Series。

In [125]: series_second = pd.Series([1, 2, 3],     .....:                           pd.DatetimeIndex(['2011-12-31 23:59:59',     .....:                                             '2012-01-01 00:00:00',     .....:                                             '2012-01-01 00:00:01']))     .....:    In [126]: series_second.index.resolution  Out[126]: 'second'    In [127]: series_second['2011-12-31 23:59']  Out[127]:  2011-12-31 23:59:59    1  dtype: int64

用時間戳字符串切片時，還可以用 [] 索引 DataFrame。

In [128]: dft_minute = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]},     .....:                           index=series_minute.index)     .....:    In [129]: dft_minute['2011-12-31 23']  Out[129]:                       a  b  2011-12-31 23:59:00  1  4

警告：字符串執行精確匹配時，用 [] 按列，而不是按行截取 DateFrame ，參閱索引基礎。如，dft_minute ['2011-12-31 23:59'] 會觸發 KeyError，這是因為 2012-12-31 23:59與索引的精度一樣，但沒有叫這個名字的列。

為了實現精準切片，要用 .loc 對行進行切片或選擇。

In [130]: dft_minute.loc['2011-12-31 23:59']  Out[130]:  a    1  b    4  Name: 2011-12-31 23:59:00, dtype: int64

注意，DatetimeIndex 精度不能低於日。

In [131]: series_monthly = pd.Series([1, 2, 3],     .....:                            pd.DatetimeIndex(['2011-12', '2012-01', '2012-02']))     .....:    In [132]: series_monthly.index.resolution  Out[132]: 'day'    In [133]: series_monthly['2011-12']  # 返回的是 Series  Out[133]:  2011-12-01    1  dtype: int64

精確索引

正如上節所述，局部字符串依靠時間段的精度索引 DatetimeIndex，即時間間隔與索引精度相關。反之，用 Timestamp 或 datetime 索引更精準，這些對象指定的時間更精確。注意，精確索引包含了起始時點。

就算沒有顯式指定，Timestamp 與datetime 也支持 hours、minutes、seconds，默認值為 0。

In [134]: dft[datetime.datetime(2013, 1, 1):datetime.datetime(2013, 2, 28)]  Out[134]:                              A  2013-01-01 00:00:00  0.276232  2013-01-01 00:01:00 -1.087401  2013-01-01 00:02:00 -0.673690  2013-01-01 00:03:00  0.113648  2013-01-01 00:04:00 -1.478427  ...                       ...  2013-02-27 23:56:00  1.197749  2013-02-27 23:57:00  0.720521  2013-02-27 23:58:00 -0.072718  2013-02-27 23:59:00 -0.681192  2013-02-28 00:00:00 -0.557501    [83521 rows x 1 columns]

不用默認值。

In [135]: dft[datetime.datetime(2013, 1, 1, 10, 12, 0):     .....:     datetime.datetime(2013, 2, 28, 10, 12, 0)]     .....:  Out[135]:                              A  2013-01-01 10:12:00  0.565375  2013-01-01 10:13:00  0.068184  2013-01-01 10:14:00  0.788871  2013-01-01 10:15:00 -0.280343  2013-01-01 10:16:00  0.931536  ...                       ...  2013-02-28 10:08:00  0.148098  2013-02-28 10:09:00 -0.388138  2013-02-28 10:10:00  0.139348  2013-02-28 10:11:00  0.085288  2013-02-28 10:12:00  0.950146    [83521 rows x 1 columns]

截斷與花式索引

truncate() 便捷函數與切片類似。注意，與切片返回的是部分匹配日期不同， truncate 假設 DatetimeIndex 里未標明時間組件的值為 0。

In [136]: rng2 = pd.date_range('2011-01-01', '2012-01-01', freq='W')    In [137]: ts2 = pd.Series(np.random.randn(len(rng2)), index=rng2)    In [138]: ts2.truncate(before='2011-11', after='2011-12')  Out[138]:  2011-11-06    0.437823  2011-11-13   -0.293083  2011-11-20   -0.059881  2011-11-27    1.252450  Freq: W-SUN, dtype: float64    In [139]: ts2['2011-11':'2011-12']  Out[139]:  2011-11-06    0.437823  2011-11-13   -0.293083  2011-11-20   -0.059881  2011-11-27    1.252450  2011-12-04    0.046611  2011-12-11    0.059478  2011-12-18   -0.286539  2011-12-25    0.841669  Freq: W-SUN, dtype: float64

花式索引返回 DatetimeIndex，但因為打亂了 DatetimeIndex 頻率，丟棄了頻率信息，見 freq=None：

In [140]: ts2[[0, 2, 6]].index  Out[140]: DatetimeIndex(['2011-01-02', '2011-01-16', '2011-02-13'], dtype='datetime64[ns]', freq=None)

日期/時間組件

以下日期/時間屬性可以訪問 Timestamp 或 DatetimeIndex。

屬性	說明
year	datetime 的年
month	datetime 的月
day	datetime 的日
hour	datetime 的小時
minute	datetime 的分鐘
second	datetime 的秒
microsecond	datetime 的微秒
nanosecond	datetime 的納秒
date	返回 datetime.date（不包含時區信息）
time	返回 datetime.time（不包含時區信息）
timetz	返回帶本地時區信息的 datetime.time
dayofyear	一年裡的第幾天
weekofyear	一年裡的第幾周
week	一年裡的第幾周
dayofweek	一周里的第幾天，Monday=0, Sunday=6
weekday	一周里的第幾天，Monday=0, Sunday=6
weekday_name	這一天是星期幾（如，Friday）
quarter	日期所處的季節：Jan-Mar = 1 等
days_in_month	日期所在的月有多少天
is_month_start	邏輯判斷是不是月初（由頻率定義）
is_month_end	邏輯判斷是不是月末（由頻率定義）
is_quarter_start	邏輯判斷是不是季初（由頻率定義）
is_quarter_end	邏輯判斷是不是季末（由頻率定義）
is_year_start	邏輯判斷是不是年初（由頻率定義）
is_year_end	邏輯判斷是不是年末（由頻率定義）
is_leap_year	邏輯判斷是不是日期所在年是不是閏年

參照 .dt 訪問器一節介紹的知識點，Series 的值為 datetime 時，還可以用 .dt 訪問這些屬性。