Pandas数据结构之DataFrame

2019 年 11 月 27 日
笔记

用 Series 字典或字典生成 DataFrame
用多维数组字典、列表字典生成 DataFrame
用结构多维数组或记录多维数组生成 DataFrame
用列表字典生成 DataFrame
用元组字典生成 DataFrame
用 Series 创建 DataFrame
备选构建器

DataFrame 是由多种类型的列构成的二维标签数据结构，类似于 Excel 、SQL 表，或 Series 对象构成的字典。DataFrame 是最常用的 Pandas 对象，与 Series 一样，DataFrame 支持多种类型的输入数据：

一维 ndarray、列表、字典、Series 字典
二维 numpy.ndarray
结构多维数组或记录多维数组
Series
DataFrame

除了数据，还可以有选择地传递 index（行标签）和 columns（列标签）参数。传递了索引或列，就可以确保生成的 DataFrame 里包含索引或列。Series 字典加上指定索引时，会丢弃与传递的索引不匹配的所有数据。

没有传递轴标签时，按常规依据输入数据进行构建。

Python > = 3.6，且 Pandas > = 0.23，数据是字典，且未指定 columns 参数时，DataFrame 的列按字典的插入顺序排序。 Python < 3.6 或 Pandas < 0.23，且未指定 columns 参数时，DataFrame 的列按字典键的字母排序。

用 Series 字典或字典生成 DataFrame

生成的索引是每个 Series 索引的并集。先把嵌套字典转换为 Series。如果没有指定列，DataFrame 的列就是字典键的有序列表。

In [37]: d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),     ....:      'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}     ....:    In [38]: df = pd.DataFrame(d)    In [39]: df  Out[39]:     one  two  a  1.0  1.0  b  2.0  2.0  c  3.0  3.0  d  NaN  4.0    In [40]: pd.DataFrame(d, index=['d', 'b', 'a'])  Out[40]:     one  two  d  NaN  4.0  b  2.0  2.0  a  1.0  1.0    In [41]: pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])  Out[41]:     two three  d  4.0   NaN  b  2.0   NaN  a  1.0   NaN

index 和 columns 属性分别用于访问行、列标签：

指定列与数据字典一起传递时，传递的列会覆盖字典的键。

In [42]: df.index  Out[42]: Index(['a', 'b', 'c', 'd'], dtype='object')    In [43]: df.columns  Out[43]: Index(['one', 'two'], dtype='object')

用多维数组字典、列表字典生成 DataFrame

多维数组的长度必须相同。如果传递了索引参数，index 的长度必须与数组一致。如果没有传递索引参数，生成的结果是 range(n)，n 为数组长度。

In [44]: d = {'one': [1., 2., 3., 4.],     ....:      'two': [4., 3., 2., 1.]}     ....:    In [45]: pd.DataFrame(d)  Out[45]:     one  two  0  1.0  4.0  1  2.0  3.0  2  3.0  2.0  3  4.0  1.0    In [46]: pd.DataFrame(d, index=['a', 'b', 'c', 'd'])  Out[46]:     one  two  a  1.0  4.0  b  2.0  3.0  c  3.0  2.0  d  4.0  1.0

用结构多维数组或记录多维数组生成 DataFrame

本例与数组字典的操作方式相同。

In [47]: data = np.zeros((2, ), dtype=[('A', 'i4'), ('B', 'f4'), ('C', 'a10')])    In [48]: data[:] = [(1, 2., 'Hello'), (2, 3., "World")]    In [49]: pd.DataFrame(data)  Out[49]:     A    B         C  0  1  2.0  b'Hello'  1  2  3.0  b'World'    In [50]: pd.DataFrame(data, index=['first', 'second'])  Out[50]:          A    B         C  first   1  2.0  b'Hello'  second  2  3.0  b'World'    In [51]: pd.DataFrame(data, columns=['C', 'A', 'B'])  Out[51]:            C  A    B  0  b'Hello'  1  2.0  1  b'World'  2  3.0

DataFrame 的运作方式与 NumPy 二维数组不同。

用列表字典生成 DataFrame

In [52]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]    In [53]: pd.DataFrame(data2)  Out[53]:     a   b     c  0  1   2   NaN  1  5  10  20.0    In [54]: pd.DataFrame(data2, index=['first', 'second'])  Out[54]:          a   b     c  first   1   2   NaN  second  5  10  20.0    In [55]: pd.DataFrame(data2, columns=['a', 'b'])  Out[55]:     a   b  0  1   2  1  5  10

用元组字典生成 DataFrame

元组字典可以自动创建多层索引 DataFrame。

In [56]: pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},     ....:               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},     ....:               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},     ....:               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},     ....:               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})     ....:  Out[56]:         a              b         b    a    c    a     b  A B  1.0  4.0  5.0  8.0  10.0    C  2.0  3.0  6.0  7.0   NaN    D  NaN  NaN  NaN  NaN   9.0

用 Series 创建 DataFrame

生成的 DataFrame 继承了输入的 Series 的索引，如果没有指定列名，默认列名是输入 Series 的名称。

缺失数据

更多内容，详见缺失数据。DataFrame 里的缺失值用 np.nan 表示。DataFrame 构建器以 numpy.MaskedArray 为参数时，被屏蔽的条目为缺失数据。

备选构建器

DataFrame.from_dict

DataFrame.from_dict 接收字典组成的字典或数组序列字典，并生成 DataFrame。除了 orient 参数默认为 columns，本构建器的操作与 DataFrame 构建器类似。把 orient 参数设置为 'index'，即可把字典的键作为行标签。

In [57]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))  Out[57]:     A  B  0  1  4  1  2  5  2  3  6

orient='index' 时，键是行标签。本例还传递了列名：

In [58]: pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),     ....:                        orient='index', columns=['one', 'two', 'three'])     ....:  Out[58]:     one  two  three  A    1    2      3  B    4    5      6

DataFrame.from_records

DataFrame.from_records 构建器支持元组列表或结构数据类型（dtype）的多维数组。本构建器与 DataFrame 构建器类似，只不过生成的 DataFrame 索引是结构数据类型指定的字段。例如：

In [59]: data  Out[59]:  array([(1, 2., b'Hello'), (2, 3., b'World')],        dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])    In [60]: pd.DataFrame.from_records(data, index='C')  Out[60]:            A    B  C  b'Hello'  1  2.0  b'World'  2  3.0