Python Pandas 常用统计数据方法汇总(求和,计数,均值,中位数,分位数,最大/最小,方差,标准差等)
Pandas 统计数据方法汇总
准备数据:
import pandas as pd
# 假设有 5 个人,分别参加了 4 门课程,获得了对应的分数
# 同时这个 5 个人分别负责的项目个数 在 'Project_num' 列中显示
data = {'name' : pd.Series(['Alice', 'Bob', 'Cathy', 'Dany', 'Ella', 'Ford', 'Gary', 'Ham', 'Ico', 'Jack']),
'Math_A' : pd.Series([1.1, 2.2, 3.3, 4.4, 5, 3.2, 2.4, 1.5, 4.3, 4.5]),
'English_A' : pd.Series([3, 2.6, 2, 1.7, 3, 3.3, 4.4, 5, 3.2, 2.4]),
'Math_B' : pd.Series([1.7, 2.5, 3.6, 2.4, 5, 2.2, 3.3, 4.4, 1.5, 4.3]),
'English_B' : pd.Series([5, 2.6, 2.4, 1.3, 3, 3.6, 2.4, 5, 2.2, 3.1]),
'Project_num' : pd.Series([2, 3, 0, 1, 7, 2, 1, 5, 3, 4]),
'Sex' : pd.Series(['F', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M'])
}
df = pd.DataFrame(data)
print(df)
运行结果:
name Math_A English_A Math_B English_B Project_num Sex
0 Alice 1.1 3.0 1.7 5.0 2 F
1 Bob 2.2 2.6 2.5 2.6 3 M
2 Cathy 3.3 2.0 3.6 2.4 0 M
3 Dany 4.4 1.7 2.4 1.3 1 F
4 Ella 5.0 3.0 5.0 3.0 7 M
5 Ford 3.2 3.3 2.2 3.6 2 F
6 Gary 2.4 4.4 3.3 2.4 1 M
7 Ham 1.5 5.0 4.4 5.0 5 M
8 Ico 4.3 3.2 1.5 2.2 3 F
9 Jack 4.5 2.4 4.3 3.1 4 M
一、数据的总体描述
1.1 统计行数 len(df)
print(len(df))
# 不包括表头(列标签)
运行结果:
10
1.2 统计有多少种不同的值 df[‘lable’].nunique()
print(df['Sex'].nunique())
# 这些人中有多少种性别呢
运行结果:
2
1.3 对 列 中每种不同的值 进行计数 df[‘lable’].value_counts()
print(df['Sex'].value_counts())
# 统计每种性别有多少人数,这里的 int64 指的是统计数字
运行结果:
M 6
F 4
Name: Sex, dtype: int64
1.4 整体统计描述 df.describe()
1.4.1 仅对数值型
print(df.describe())
# 对整张表格进行统计描述(这里仅对数值形的列进行统计)
运行结果:
Math_A English_A Math_B English_B Project_num
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 3.190000 3.060000 3.090000 3.060000 2.800000
std 1.355196 1.014561 1.211473 1.189958 2.097618
min 1.100000 1.700000 1.500000 1.300000 0.000000
25% 2.250000 2.450000 2.250000 2.400000 1.250000
50% 3.250000 3.000000 2.900000 2.800000 2.500000
75% 4.375000 3.275000 4.125000 3.475000 3.750000
max 5.000000 5.000000 5.000000 5.000000 7.000000
1.4.2 对所有类型 df.describe(include=’all’)
print(df.describe(include='all'))
# 对整张表格进行统计描述(所有类型进行统计)
运行结果:
name Math_A English_A Math_B English_B Project_num Sex
count 10 10.000000 10.000000 10.000000 10.000000 10.000000 10
unique 10 NaN NaN NaN NaN NaN 2
top Jack NaN NaN NaN NaN NaN M
freq 1 NaN NaN NaN NaN NaN 6
mean NaN 3.190000 3.060000 3.090000 3.060000 2.800000 NaN
std NaN 1.355196 1.014561 1.211473 1.189958 2.097618 NaN
min NaN 1.100000 1.700000 1.500000 1.300000 0.000000 NaN
25% NaN 2.250000 2.450000 2.250000 2.400000 1.250000 NaN
50% NaN 3.250000 3.000000 2.900000 2.800000 2.500000 NaN
75% NaN 4.375000 3.275000 4.125000 3.475000 3.750000 NaN
max NaN 5.000000 5.000000 5.000000 5.000000 7.000000 NaN
1.4.3 对指定的列
print(df.Math_A.describe())
# 对指定的列进行统计描述
运行结果:
count 10.000000
mean 3.190000
std 1.355196
min 1.100000
25% 2.250000
50% 3.250000
75% 4.375000
max 5.000000
Name: Math_A, dtype: float64
二、指定统计方式
2.1 求和 sum()
print(df.Project_num.sum())
运行结果:
28
2.2 计数 df.count()
print(df.count())
运行结果:
name 10
Math_A 10
English_A 10
Math_B 10
English_B 10
Project_num 10
Sex 10
dtype: int64
2.3 中位数 df.median()
print(df.median())
运行结果:
Math_A 3.25
English_A 3.00
Math_B 2.90
English_B 2.80
Project_num 2.50
dtype: float64
2.4 分位数 df.quantile()
print(df.quantile([0.25,0.75]))
运行结果:
Math_A English_A Math_B English_B Project_num
0.25 2.250 2.450 2.250 2.400 1.25
0.75 4.375 3.275 4.125 3.475 3.75
2.5 最大值 / 最小值 df.max() / df.min()
print(df.max())
print(df.min())
# 字符串在计算机表示中也是有大小区别的
运行结果:
name Jack
Math_A 5
English_A 5
Math_B 5
English_B 5
Project_num 7
Sex M
dtype: object
name Alice
Math_A 1.1
English_A 1.7
Math_B 1.5
English_B 1.3
Project_num 0
Sex F
dtype: object
2.6 均值 df.mean()
print(df.mean())
# 仅对数值类型进行求解
运行结果:
Math_A 3.19
English_A 3.06
Math_B 3.09
English_B 3.06
Project_num 2.80
dtype: float64
2.7 方差 / 标准差 df.var() / df.std()
print(df.var())
print(df.std())
# 仅对数值类型进行求解
运行结果:
Math_A 1.836556
English_A 1.029333
Math_B 1.467667
English_B 1.416000
Project_num 4.400000
dtype: float64
Math_A 1.355196
English_A 1.014561
Math_B 1.211473
English_B 1.189958
Project_num 2.097618
dtype: float64
三、批量操作(对每个元素应用同一个自定义函数)df.apply()
3.1 对整张表格
def double (x):
return x*2
print(df.apply(double))
# 对于数值就是 乘以 2
# 但是对于字符串,那就是重复2遍的操作
运行结果:
name Math_A English_A Math_B English_B Project_num Sex
0 AliceAlice 2.2 6.0 3.4 10.0 4 FF
1 BobBob 4.4 5.2 5.0 5.2 6 MM
2 CathyCathy 6.6 4.0 7.2 4.8 0 MM
3 DanyDany 8.8 3.4 4.8 2.6 2 FF
4 EllaElla 10.0 6.0 10.0 6.0 14 MM
5 FordFord 6.4 6.6 4.4 7.2 4 FF
6 GaryGary 4.8 8.8 6.6 4.8 2 MM
7 HamHam 3.0 10.0 8.8 10.0 10 MM
8 IcoIco 8.6 6.4 3.0 4.4 6 FF
9 JackJack 9.0 4.8 8.6 6.2 8 MM
3.2 对指定的列
def double (x):
return x*2
print(df.Math_B.apply(double))
运行结果:
0 3.4
1 5.0
2 7.2
3 4.8
4 10.0
5 4.4
6 6.6
7 8.8
8 3.0
9 8.6
Name: Math_B, dtype: float64