Python Pandas 常用统计数据方法汇总(求和,计数,均值,中位数,分位数,最大/最小,方差,标准差等)

Pandas 统计数据方法汇总


准备数据:

import pandas as pd
# 假设有 5 个人,分别参加了 4 门课程,获得了对应的分数
# 同时这个 5 个人分别负责的项目个数 在 'Project_num' 列中显示
data = {'name' : pd.Series(['Alice', 'Bob', 'Cathy', 'Dany', 'Ella', 'Ford', 'Gary', 'Ham', 'Ico', 'Jack']),
        'Math_A' : pd.Series([1.1, 2.2, 3.3, 4.4, 5, 3.2, 2.4, 1.5, 4.3, 4.5]),
        'English_A' : pd.Series([3, 2.6, 2, 1.7, 3, 3.3, 4.4, 5, 3.2, 2.4]),
        'Math_B' : pd.Series([1.7, 2.5, 3.6, 2.4, 5, 2.2, 3.3, 4.4, 1.5, 4.3]),
        'English_B' : pd.Series([5, 2.6, 2.4, 1.3, 3, 3.6, 2.4, 5, 2.2, 3.1]),
        'Project_num' : pd.Series([2, 3, 0, 1, 7, 2, 1, 5, 3, 4]),
        'Sex' : pd.Series(['F', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M'])
     }
df = pd.DataFrame(data)
print(df)

运行结果:

    name  Math_A  English_A  Math_B  English_B  Project_num Sex
0  Alice     1.1        3.0     1.7        5.0            2   F
1    Bob     2.2        2.6     2.5        2.6            3   M
2  Cathy     3.3        2.0     3.6        2.4            0   M
3   Dany     4.4        1.7     2.4        1.3            1   F
4   Ella     5.0        3.0     5.0        3.0            7   M
5   Ford     3.2        3.3     2.2        3.6            2   F
6   Gary     2.4        4.4     3.3        2.4            1   M
7    Ham     1.5        5.0     4.4        5.0            5   M
8    Ico     4.3        3.2     1.5        2.2            3   F
9   Jack     4.5        2.4     4.3        3.1            4   M

一、数据的总体描述

1.1 统计行数 len(df)

print(len(df))
# 不包括表头(列标签)

运行结果:

10

1.2 统计有多少种不同的值 df[‘lable’].nunique()

print(df['Sex'].nunique())
# 这些人中有多少种性别呢

运行结果:

2

1.3 对 列 中每种不同的值 进行计数 df[‘lable’].value_counts()

print(df['Sex'].value_counts())
# 统计每种性别有多少人数,这里的 int64 指的是统计数字

运行结果:

M    6
F    4
Name: Sex, dtype: int64

1.4 整体统计描述 df.describe()

1.4.1 仅对数值型

print(df.describe())
# 对整张表格进行统计描述(这里仅对数值形的列进行统计)

运行结果:

          Math_A  English_A     Math_B  English_B  Project_num
count  10.000000  10.000000  10.000000  10.000000    10.000000
mean    3.190000   3.060000   3.090000   3.060000     2.800000
std     1.355196   1.014561   1.211473   1.189958     2.097618
min     1.100000   1.700000   1.500000   1.300000     0.000000
25%     2.250000   2.450000   2.250000   2.400000     1.250000
50%     3.250000   3.000000   2.900000   2.800000     2.500000
75%     4.375000   3.275000   4.125000   3.475000     3.750000
max     5.000000   5.000000   5.000000   5.000000     7.000000

1.4.2 对所有类型 df.describe(include=’all’)

print(df.describe(include='all'))
# 对整张表格进行统计描述(所有类型进行统计)

运行结果:

        name     Math_A  English_A     Math_B  English_B  Project_num  Sex
count     10  10.000000  10.000000  10.000000  10.000000    10.000000   10
unique    10        NaN        NaN        NaN        NaN          NaN    2
top     Jack        NaN        NaN        NaN        NaN          NaN    M
freq       1        NaN        NaN        NaN        NaN          NaN    6
mean     NaN   3.190000   3.060000   3.090000   3.060000     2.800000  NaN
std      NaN   1.355196   1.014561   1.211473   1.189958     2.097618  NaN
min      NaN   1.100000   1.700000   1.500000   1.300000     0.000000  NaN
25%      NaN   2.250000   2.450000   2.250000   2.400000     1.250000  NaN
50%      NaN   3.250000   3.000000   2.900000   2.800000     2.500000  NaN
75%      NaN   4.375000   3.275000   4.125000   3.475000     3.750000  NaN
max      NaN   5.000000   5.000000   5.000000   5.000000     7.000000  NaN

1.4.3 对指定的列

print(df.Math_A.describe())
# 对指定的列进行统计描述

运行结果:

count    10.000000
mean      3.190000
std       1.355196
min       1.100000
25%       2.250000
50%       3.250000
75%       4.375000
max       5.000000
Name: Math_A, dtype: float64

二、指定统计方式

2.1 求和 sum()

print(df.Project_num.sum())

运行结果:

28

2.2 计数 df.count()

print(df.count())

运行结果:

name           10
Math_A         10
English_A      10
Math_B         10
English_B      10
Project_num    10
Sex            10
dtype: int64

2.3 中位数 df.median()

print(df.median())

运行结果:

Math_A         3.25
English_A      3.00
Math_B         2.90
English_B      2.80
Project_num    2.50
dtype: float64

2.4 分位数 df.quantile()

print(df.quantile([0.25,0.75]))

运行结果:

      Math_A  English_A  Math_B  English_B  Project_num
0.25   2.250      2.450   2.250      2.400         1.25
0.75   4.375      3.275   4.125      3.475         3.75

2.5 最大值 / 最小值 df.max() / df.min()

print(df.max())
print(df.min())
# 字符串在计算机表示中也是有大小区别的

运行结果:

name           Jack
Math_A            5
English_A         5
Math_B            5
English_B         5
Project_num       7
Sex               M
dtype: object

name           Alice
Math_A           1.1
English_A        1.7
Math_B           1.5
English_B        1.3
Project_num        0
Sex                F
dtype: object

2.6 均值 df.mean()

print(df.mean())
# 仅对数值类型进行求解

运行结果:

Math_A         3.19
English_A      3.06
Math_B         3.09
English_B      3.06
Project_num    2.80
dtype: float64

2.7 方差 / 标准差 df.var() / df.std()

print(df.var())
print(df.std())
# 仅对数值类型进行求解

运行结果:

Math_A         1.836556
English_A      1.029333
Math_B         1.467667
English_B      1.416000
Project_num    4.400000
dtype: float64

Math_A         1.355196
English_A      1.014561
Math_B         1.211473
English_B      1.189958
Project_num    2.097618
dtype: float64

三、批量操作(对每个元素应用同一个自定义函数)df.apply()

3.1 对整张表格

def double (x):
    return x*2

print(df.apply(double))
# 对于数值就是 乘以 2
# 但是对于字符串,那就是重复2遍的操作

运行结果:

         name  Math_A  English_A  Math_B  English_B  Project_num Sex
0  AliceAlice     2.2        6.0     3.4       10.0            4  FF
1      BobBob     4.4        5.2     5.0        5.2            6  MM
2  CathyCathy     6.6        4.0     7.2        4.8            0  MM
3    DanyDany     8.8        3.4     4.8        2.6            2  FF
4    EllaElla    10.0        6.0    10.0        6.0           14  MM
5    FordFord     6.4        6.6     4.4        7.2            4  FF
6    GaryGary     4.8        8.8     6.6        4.8            2  MM
7      HamHam     3.0       10.0     8.8       10.0           10  MM
8      IcoIco     8.6        6.4     3.0        4.4            6  FF
9    JackJack     9.0        4.8     8.6        6.2            8  MM

3.2 对指定的列

def double (x):
    return x*2

print(df.Math_B.apply(double))

运行结果:

0     3.4
1     5.0
2     7.2
3     4.8
4    10.0
5     4.4
6     6.6
7     8.8
8     3.0
9     8.6
Name: Math_B, dtype: float64
Tags: