Pandas - 2
Contents
Pandas
- 2¶
主讲人:李显祥
大气科学学院
import pandas as pd
import numpy as np
print('Pandas version:', pd.__version__)
Pandas version: 1.1.3
data = pd.read_csv('table.csv')
data2 = data.set_index('ID')
5. 常用操作¶
describe
和 info
¶
info
函数返回有哪些列、有多少非缺失值、每列的类型
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 35 non-null object
1 Class 35 non-null object
2 ID 35 non-null int64
3 Gender 35 non-null object
4 Address 35 non-null object
5 Height 35 non-null int64
6 Weight 35 non-null int64
7 Math 35 non-null float64
8 Physics 35 non-null object
dtypes: float64(1), int64(3), object(5)
memory usage: 2.6+ KB
describe
默认统计数值型数据的各个统计量
data.describe()
ID | Height | Weight | Math | |
---|---|---|---|---|
count | 35.00000 | 35.000000 | 35.000000 | 35.000000 |
mean | 1803.00000 | 174.142857 | 74.657143 | 61.351429 |
std | 536.87741 | 13.541098 | 12.895377 | 19.915164 |
min | 1101.00000 | 155.000000 | 53.000000 | 31.500000 |
25% | 1204.50000 | 161.000000 | 63.000000 | 47.400000 |
50% | 2103.00000 | 173.000000 | 74.000000 | 61.700000 |
75% | 2301.50000 | 187.500000 | 82.000000 | 77.100000 |
max | 2405.00000 | 195.000000 | 100.000000 | 97.000000 |
也可以自行选择分位数 percentile
data.describe(percentiles=[.10, .25, .75, .90])
ID | Height | Weight | Math | |
---|---|---|---|---|
count | 35.00000 | 35.000000 | 35.000000 | 35.000000 |
mean | 1803.00000 | 174.142857 | 74.657143 | 61.351429 |
std | 536.87741 | 13.541098 | 12.895377 | 19.915164 |
min | 1101.00000 | 155.000000 | 53.000000 | 31.500000 |
10% | 1104.40000 | 158.400000 | 60.400000 | 33.880000 |
25% | 1204.50000 | 161.000000 | 63.000000 | 47.400000 |
50% | 2103.00000 | 173.000000 | 74.000000 | 61.700000 |
75% | 2301.50000 | 187.500000 | 82.000000 | 77.100000 |
90% | 2401.60000 | 192.600000 | 92.800000 | 86.480000 |
max | 2405.00000 | 195.000000 | 100.000000 | 97.000000 |
对于非数值型也可以用 describe
函数
data['Physics'].describe()
count 35
unique 7
top B+
freq 9
Name: Physics, dtype: object
head
和 tail
¶
data.head()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|---|
0 | S_1 | C_1 | 1101 | M | street_1 | 173 | 63 | 34.0 | A+ |
1 | S_1 | C_1 | 1102 | F | street_2 | 192 | 73 | 32.5 | B+ |
2 | S_1 | C_1 | 1103 | M | street_2 | 186 | 82 | 87.2 | B+ |
3 | S_1 | C_1 | 1104 | F | street_2 | 167 | 81 | 80.4 | B- |
4 | S_1 | C_1 | 1105 | F | street_4 | 159 | 64 | 84.8 | B+ |
data.head(6)
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|---|
0 | S_1 | C_1 | 1101 | M | street_1 | 173 | 63 | 34.0 | A+ |
1 | S_1 | C_1 | 1102 | F | street_2 | 192 | 73 | 32.5 | B+ |
2 | S_1 | C_1 | 1103 | M | street_2 | 186 | 82 | 87.2 | B+ |
3 | S_1 | C_1 | 1104 | F | street_2 | 167 | 81 | 80.4 | B- |
4 | S_1 | C_1 | 1105 | F | street_4 | 159 | 64 | 84.8 | B+ |
5 | S_1 | C_2 | 1201 | M | street_5 | 188 | 68 | 97.0 | A- |
data.tail()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|---|
30 | S_2 | C_4 | 2401 | F | street_2 | 192 | 62 | 45.3 | A |
31 | S_2 | C_4 | 2402 | M | street_7 | 166 | 82 | 48.7 | B |
32 | S_2 | C_4 | 2403 | F | street_6 | 158 | 60 | 59.7 | B+ |
33 | S_2 | C_4 | 2404 | F | street_2 | 160 | 84 | 67.7 | B |
34 | S_2 | C_4 | 2405 | F | street_6 | 193 | 54 | 47.6 | B |
算术/统计运算¶
sum
,mean
,median
,min
,max
,abs
,std
,var
,quantile
,cummax
,cumsum
,cumprod
DataFrame
进行这些运算都有一个共同参数 axis
,指定沿着哪个方向进行运算:
axis=0
或axis=index
: 沿着行方向(跨行)进行运算axis=1
或axis=columns
: 沿着列方向(跨列)进行运算
data.sum(axis='index')
School S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S...
Class C_1C_1C_1C_1C_1C_2C_2C_2C_2C_2C_3C_3C_3C_3C_3C...
ID 63105
Gender MFMFFMFMFFMFMMFMFMFMMFMMFFMFFMFMFFF
Address street_1street_2street_2street_2street_4street...
Height 6095
Weight 2613
Math 2147.3
Physics A+B+B+B-B+A-B-A+BB-B+A-BAB-CB+B-B+ABB+A+B-BB+A...
dtype: object
data.mean(axis=0)
ID 1803.000000
Height 174.142857
Weight 74.657143
Math 61.351429
dtype: float64
data.mean(axis=1).head()
0 342.750
1 349.875
2 364.550
3 358.100
4 353.200
dtype: float64
unique
和 nunique
¶
unique
显示所有的唯一值
data.Physics.unique()
array(['A+', 'B+', 'B-', 'A-', 'B', 'A', 'C'], dtype=object)
nunique
显示有多少个唯一值
# 两种方法给出相同的结果
data.Physics.nunique(), data.Physics.unique().size
(7, 7)
count
和 value_counts
¶
count
返回非缺失值元素个数
data['Math'].count()
35
value_counts
返回每个唯一值的个数
data['Physics'].value_counts()
B+ 9
B 8
B- 6
A 4
A- 3
A+ 3
C 2
Name: Physics, dtype: int64
为了更好地显示结果,我们可以对上面的结果进行排序,利用 sort_index
方法
data['Physics'].value_counts().sort_index()
A 4
A+ 3
A- 3
B 8
B+ 9
B- 6
C 2
Name: Physics, dtype: int64
idxmax/idxmin
和 nlargest/nsmallest
¶
idxmax
(idxmin
) 函数返回最大 (小) 值所在行的索引
data['Math'].idxmax(), data['Math'].idxmin(), data['Math'].agg(['idxmax','idxmin'])
(5,
10,
idxmax 5
idxmin 10
Name: Math, dtype: int64)
nlargest
(nsmallest
) 函数返回前几个大 (小) 的元素值
data['Math'].nlargest(3), data['Math'].nsmallest(3)
(5 97.0
28 95.5
11 87.7
Name: Math, dtype: float64,
10 31.5
1 32.5
26 32.7
Name: Math, dtype: float64)
sort_index
和 sort_values
¶
sort_index
按照索引进行排序,默认从低到高排序,可以用 ascending = False
参数从高到低排序
data.set_index('ID').sort_index().head(10)
School | Class | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|
ID | ||||||||
1101 | S_1 | C_1 | M | street_1 | 173 | 63 | 34.0 | A+ |
1102 | S_1 | C_1 | F | street_2 | 192 | 73 | 32.5 | B+ |
1103 | S_1 | C_1 | M | street_2 | 186 | 82 | 87.2 | B+ |
1104 | S_1 | C_1 | F | street_2 | 167 | 81 | 80.4 | B- |
1105 | S_1 | C_1 | F | street_4 | 159 | 64 | 84.8 | B+ |
1201 | S_1 | C_2 | M | street_5 | 188 | 68 | 97.0 | A- |
1202 | S_1 | C_2 | F | street_4 | 176 | 94 | 63.5 | B- |
1203 | S_1 | C_2 | M | street_6 | 160 | 53 | 58.8 | A+ |
1204 | S_1 | C_2 | F | street_5 | 162 | 63 | 33.8 | B |
1205 | S_1 | C_2 | F | street_6 | 167 | 63 | 68.4 | B- |
data.set_index('ID').sort_index(ascending=False).head(10)
School | Class | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|
ID | ||||||||
2405 | S_2 | C_4 | F | street_6 | 193 | 54 | 47.6 | B |
2404 | S_2 | C_4 | F | street_2 | 160 | 84 | 67.7 | B |
2403 | S_2 | C_4 | F | street_6 | 158 | 60 | 59.7 | B+ |
2402 | S_2 | C_4 | M | street_7 | 166 | 82 | 48.7 | B |
2401 | S_2 | C_4 | F | street_2 | 192 | 62 | 45.3 | A |
2305 | S_2 | C_3 | M | street_4 | 187 | 73 | 48.9 | B |
2304 | S_2 | C_3 | F | street_6 | 164 | 81 | 95.5 | A- |
2303 | S_2 | C_3 | F | street_7 | 190 | 99 | 65.9 | C |
2302 | S_2 | C_3 | M | street_5 | 171 | 88 | 32.7 | A |
2301 | S_2 | C_3 | F | street_4 | 157 | 78 | 72.3 | B+ |
sort_values
按照列的数值进行排序,默认从低到高排序,可以用 ascending = False
参数从高到低排序
参数 by
用来指定用来排序的列,可以用 list
来指定多个列,即先对第一层排,在第一层相同的情况下对第二层排序
data.sort_values(by='Class', ascending=False).head(10)
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|---|
34 | S_2 | C_4 | 2405 | F | street_6 | 193 | 54 | 47.6 | B |
33 | S_2 | C_4 | 2404 | F | street_2 | 160 | 84 | 67.7 | B |
32 | S_2 | C_4 | 2403 | F | street_6 | 158 | 60 | 59.7 | B+ |
31 | S_2 | C_4 | 2402 | M | street_7 | 166 | 82 | 48.7 | B |
30 | S_2 | C_4 | 2401 | F | street_2 | 192 | 62 | 45.3 | A |
25 | S_2 | C_3 | 2301 | F | street_4 | 157 | 78 | 72.3 | B+ |
26 | S_2 | C_3 | 2302 | M | street_5 | 171 | 88 | 32.7 | A |
27 | S_2 | C_3 | 2303 | F | street_7 | 190 | 99 | 65.9 | C |
28 | S_2 | C_3 | 2304 | F | street_6 | 164 | 81 | 95.5 | A- |
29 | S_2 | C_3 | 2305 | M | street_4 | 187 | 73 | 48.9 | B |
data.sort_values(by=['Address','Height']).head(10)
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|---|
0 | S_1 | C_1 | 1101 | M | street_1 | 173 | 63 | 34.0 | A+ |
11 | S_1 | C_3 | 1302 | F | street_1 | 175 | 57 | 87.7 | A- |
23 | S_2 | C_2 | 2204 | M | street_1 | 175 | 74 | 47.2 | B- |
33 | S_2 | C_4 | 2404 | F | street_2 | 160 | 84 | 67.7 | B |
3 | S_1 | C_1 | 1104 | F | street_2 | 167 | 81 | 80.4 | B- |
2 | S_1 | C_1 | 1103 | M | street_2 | 186 | 82 | 87.2 | B+ |
1 | S_1 | C_1 | 1102 | F | street_2 | 192 | 73 | 32.5 | B+ |
30 | S_2 | C_4 | 2401 | F | street_2 | 192 | 62 | 45.3 | A |
13 | S_1 | C_3 | 1304 | M | street_2 | 195 | 70 | 85.2 | A |
22 | S_2 | C_2 | 2203 | M | street_4 | 155 | 91 | 73.8 | A+ |
6. 分组聚合¶
处理数据时,我们有时需要对数据进行分组处理,然后再把结果结合在一起,形成处理后的结果。这个过程称为 “分组-应用-结合”(Split-Apply-Combine)。
split:基于某一些规则,将数据拆成若干组
apply:对每一组独立地使用函数
combine:将每一组的结果组合成某一类数据结构
Pandas
提供了一些工具来进行这项工作。
groupby
函数¶
分组函数可以根据某一列进行分组:
group_school = data2.groupby('School')
返回的 group_school
是一个 DataFrameGroupBy
对象,相当于一个以 School
列的唯一值为 key
的 dict
。
for key, value in group_school:
print(key)
print(value.head(),'\n')
S_1
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
1102 S_1 C_1 F street_2 192 73 32.5 B+
1103 S_1 C_1 M street_2 186 82 87.2 B+
1104 S_1 C_1 F street_2 167 81 80.4 B-
1105 S_1 C_1 F street_4 159 64 84.8 B+
S_2
School Class Gender Address Height Weight Math Physics
ID
2101 S_2 C_1 M street_7 174 84 83.3 C
2102 S_2 C_1 F street_6 161 61 50.6 B+
2103 S_2 C_1 M street_4 157 61 52.5 B-
2104 S_2 C_1 F street_5 159 97 72.2 B+
2105 S_2 C_1 M street_4 170 81 34.2 A
取出某一个组:
group_school.get_group('S_2').head()
School | Class | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|
ID | ||||||||
2101 | S_2 | C_1 | M | street_7 | 174 | 84 | 83.3 | C |
2102 | S_2 | C_1 | F | street_6 | 161 | 61 | 50.6 | B+ |
2103 | S_2 | C_1 | M | street_4 | 157 | 61 | 52.5 | B- |
2104 | S_2 | C_1 | F | street_5 | 159 | 97 | 72.2 | B+ |
2105 | S_2 | C_1 | M | street_4 | 170 | 81 | 34.2 | A |
也可以查询组容量与组数
group_school.size()
School
S_1 15
S_2 20
dtype: int64
group_school.ngroups
2
我们也可以根据某几列进行分组
group_school_class = data2.groupby(['School','Class'])
group_school_class.get_group(('S_2','C_4'))
School | Class | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|
ID | ||||||||
2401 | S_2 | C_4 | F | street_2 | 192 | 62 | 45.3 | A |
2402 | S_2 | C_4 | M | street_7 | 166 | 82 | 48.7 | B |
2403 | S_2 | C_4 | F | street_6 | 158 | 60 | 59.7 | B+ |
2404 | S_2 | C_4 | F | street_2 | 160 | 84 | 67.7 | B |
2405 | S_2 | C_4 | F | street_6 | 193 | 54 | 47.6 | B |
group_school_class.groups
{('S_1', 'C_1'): [1101, 1102, 1103, 1104, 1105], ('S_1', 'C_2'): [1201, 1202, 1203, 1204, 1205], ('S_1', 'C_3'): [1301, 1302, 1303, 1304, 1305], ('S_2', 'C_1'): [2101, 2102, 2103, 2104, 2105], ('S_2', 'C_2'): [2201, 2202, 2203, 2204, 2205], ('S_2', 'C_3'): [2301, 2302, 2303, 2304, 2305], ('S_2', 'C_4'): [2401, 2402, 2403, 2404, 2405]}
first
函数显示的是以分组为索引的每组的第一个分组信息
group_school_class.first()
Gender | Address | Height | Weight | Math | Physics | ||
---|---|---|---|---|---|---|---|
School | Class | ||||||
S_1 | C_1 | M | street_1 | 173 | 63 | 34.0 | A+ |
C_2 | M | street_5 | 188 | 68 | 97.0 | A- | |
C_3 | M | street_4 | 161 | 68 | 31.5 | B+ | |
S_2 | C_1 | M | street_7 | 174 | 84 | 83.3 | C |
C_2 | M | street_5 | 193 | 100 | 39.1 | B | |
C_3 | F | street_4 | 157 | 78 | 72.3 | B+ | |
C_4 | F | street_2 | 192 | 62 | 45.3 | A |
[]
操作符可以用来选出 groupby
对象的某个或者某几个列
data2.groupby(['Gender','School'])['Math'].mean()>=60
Gender School
F S_1 True
S_2 True
M S_1 True
S_2 False
Name: Math, dtype: bool
data2.groupby(['Gender','School'])[['Math','Height']].mean()
Math | Height | ||
---|---|---|---|
Gender | School | ||
F | S_1 | 64.100000 | 173.125000 |
S_2 | 66.427273 | 173.727273 | |
M | S_1 | 63.342857 | 178.714286 |
S_2 | 51.155556 | 172.000000 |
Pandas
的 cut
函数可以将连续量根据指定的分区进行分组:
bins = [0,40,60,80,90,100]
cuts_1 = pd.cut(data2['Math'],bins=bins)
data2.groupby(cuts_1)['Math'].count()
Math
(0, 40] 7
(40, 60] 10
(60, 80] 9
(80, 90] 7
(90, 100] 2
Name: Math, dtype: int64
我们可以用 labels
参数给每一组指定一个有意义的名字:
cuts_2 = pd.cut(data2['Math'],bins=bins,labels=['差','不及格','及格','优良','优秀'])
data2.groupby(cuts_2)['Math'].count()
Math
差 7
不及格 10
及格 9
优良 7
优秀 2
Name: Math, dtype: int64
apply 过程¶
在 apply 过程中,我们实际往往会遇到四类问题:
聚合(Aggregation)—— 即分组计算统计量(如求均值、求每组元素个数)
变换(Transformation)—— 即分组对每个单元的数据进行操作(如将每组元素进行标准化)
过滤(Filtration)—— 即按照某些规则筛选出一些组(如选出组内某一指标小于50的组)
综合问题 —— 即前面提及的三种问题的混合
聚合
所谓聚合就是把一堆数,变成一个标量,因此 mean/sum/size/count/std/var/sem/describe/first/ last/nth/min/max
都是聚合函数
group_m = group_school['Math']
group_m.mean()
School
S_1 63.746667
S_2 59.555000
Name: Math, dtype: float64
group_m.agg(['mean','max','min'])
mean | max | min | |
---|---|---|---|
School | |||
S_1 | 63.746667 | 97.0 | 31.5 |
S_2 | 59.555000 | 95.5 | 32.7 |
如果想改变结果里面的列的名字(比如改成中文),我们可以利用元组进行重命名
group_m.agg([('平均分','mean'),('最高分','max'),('最低分','min')])
平均分 | 最高分 | 最低分 | |
---|---|---|---|
School | |||
S_1 | 63.746667 | 97.0 | 31.5 |
S_2 | 59.555000 | 95.5 | 32.7 |
我们也可以通过 dict
来指定哪些函数作用于哪些列
group_school.agg({'Math':['mean','max'],'Height':'var','Physics':['first','last']})
Math | Height | Physics | |||
---|---|---|---|---|---|
mean | max | var | first | last | |
School | |||||
S_1 | 63.746667 | 97.0 | 161.638095 | A+ | B- |
S_2 | 59.555000 | 95.5 | 205.523684 | C | B |
当然也可以使用自定义函数,例如求取每个学校数学考试成绩的最大值和最小值之差
group_m.agg(lambda x:x.max()-x.min())
School
S_1 65.5
S_2 62.8
Name: Math, dtype: float64
变换
变换功能由 transform
函数实现,其传入的对象是组内的列,并且返回值需要与列长完全一致;如果返回值是标量,该标量会被广播到整个列
group_school[['Math','Height']].transform('mean').head()
Math | Height | |
---|---|---|
ID | ||
1101 | 63.746667 | 175.733333 |
1102 | 63.746667 | 175.733333 |
1103 | 63.746667 | 175.733333 |
1104 | 63.746667 | 175.733333 |
1105 | 63.746667 | 175.733333 |
group_school[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()
Math | Height | |
---|---|---|
ID | ||
1101 | -1.288991 | -0.214991 |
1102 | -1.353990 | 1.279460 |
1103 | 1.016287 | 0.807528 |
1104 | 0.721627 | -0.686923 |
1105 | 0.912289 | -1.316166 |
过滤
过滤功能由 filter
函数来实现,其返回结果是组的全体,因此传入的参数值应当是 布尔标量
group_school[['School','Math','Physics']].filter(lambda x:(x['Math']>32).all())
School | Math | Physics | |
---|---|---|---|
ID | |||
2101 | S_2 | 83.3 | C |
2102 | S_2 | 50.6 | B+ |
2103 | S_2 | 52.5 | B- |
2104 | S_2 | 72.2 | B+ |
2105 | S_2 | 34.2 | A |
2201 | S_2 | 39.1 | B |
2202 | S_2 | 68.5 | B+ |
2203 | S_2 | 73.8 | A+ |
2204 | S_2 | 47.2 | B- |
2205 | S_2 | 85.4 | B |
2301 | S_2 | 72.3 | B+ |
2302 | S_2 | 32.7 | A |
2303 | S_2 | 65.9 | C |
2304 | S_2 | 95.5 | A- |
2305 | S_2 | 48.9 | B |
2401 | S_2 | 45.3 | A |
2402 | S_2 | 48.7 | B |
2403 | S_2 | 59.7 | B+ |
2404 | S_2 | 67.7 | B |
2405 | S_2 | 47.6 | B |
apply
函数
apply
函数是非常灵活的工具,它是分组函数中应用最为广泛的,因为它非常灵活。
首先,apply
的输入值是每一个分组本身:
group_school.apply(lambda x:print(x.head(1),'\n'))
School Class Gender Address Height Weight Math Physics
ID
1101 S_1 C_1 M street_1 173 63 34.0 A+
School Class Gender Address Height Weight Math Physics
ID
2101 S_2 C_1 M street_7 174 84 83.3 C
apply
的返回值很灵活,可以是标量、列表或者 DataFrame
group_school[['Math','Height']].apply(lambda x:x.max())
Math | Height | |
---|---|---|
School | ||
S_1 | 97.0 | 195.0 |
S_2 | 95.5 | 194.0 |
group_school[['Math','Height']].apply(lambda x:x-x.min()).head()
Math | Height | |
---|---|---|
ID | ||
1101 | 2.5 | 14.0 |
1102 | 1.0 | 33.0 |
1103 | 55.7 | 27.0 |
1104 | 48.9 | 8.0 |
1105 | 53.3 | 0.0 |
6. 缺失数据处理¶
在处理数据过程中,我们经常会碰到缺失数据,它们的处理也成为令人头疼的问题。Pandas
提供了一些方法来解决这个棘手的问题。
在 1.0 版本之前,Pandas
主要借助 Numpy
的 nan
(not a number) 来表示缺失值。
data_missing = pd.read_csv('table_missing.csv')
data_missing.head()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|---|
0 | S_1 | C_1 | NaN | M | street_1 | 173 | NaN | 34.0 | A+ |
1 | S_1 | C_1 | NaN | F | street_2 | 192 | NaN | 32.5 | B+ |
2 | S_1 | C_1 | 1103.0 | M | street_2 | 186 | NaN | 87.2 | B+ |
3 | S_1 | NaN | NaN | F | street_2 | 167 | 81.0 | 80.4 | NaN |
4 | S_1 | C_1 | 1105.0 | NaN | street_4 | 159 | 64.0 | 84.8 | A- |
isna
和 notna
方法¶
返回是否 NaN
的信息,对 Series
返回布尔列表,对 DataFrame
返回布尔型的 DataFrame
。
data_missing['Physics'].isna().head()
0 False
1 False
2 False
3 True
4 False
Name: Physics, dtype: bool
data_missing['Physics'].notna().head()
0 True
1 True
2 True
3 False
4 True
Name: Physics, dtype: bool
data_missing.isna().head()
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|---|
0 | False | False | True | False | False | False | True | False | False |
1 | False | False | True | False | False | False | True | False | False |
2 | False | False | False | False | False | False | True | False | False |
3 | False | True | True | False | False | False | False | False | True |
4 | False | False | False | True | False | False | False | False | False |
为了得到每列有多少缺失值,可以使用如下方法,得到的结果和 info
方法结果一致。
data_missing.isna().astype('int').sum()
School 0
Class 4
ID 6
Gender 7
Address 0
Height 0
Weight 13
Math 5
Physics 4
dtype: int64
data_missing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 35 non-null object
1 Class 31 non-null object
2 ID 29 non-null float64
3 Gender 28 non-null object
4 Address 35 non-null object
5 Height 35 non-null int64
6 Weight 22 non-null float64
7 Math 30 non-null float64
8 Physics 31 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 2.6+ KB
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 35 non-null object
1 Class 35 non-null object
2 ID 35 non-null int64
3 Gender 35 non-null object
4 Address 35 non-null object
5 Height 35 non-null int64
6 Weight 35 non-null int64
7 Math 35 non-null float64
8 Physics 35 non-null object
dtypes: float64(1), int64(3), object(5)
memory usage: 2.6+ KB
比较 data_missing 和 data 的 info
输出,可见因为缺失值的引入,有 3 列的数据类型发生了改变
为了保持数据类型不变,新版本的 Pandas
引入了几个 Nullable
的数据类型:Int64
(代替 int64
),string
(代替 object
)和 boolean
(代替 bool
)。其好处是缺失值都会被替换为统一的 <NA>
符号,且不改变数据类型.
同时,提供了函数 convert_dtypes
来将旧的数据类型转化为新的数据类型:
data_newtypes = data_missing.convert_dtypes()
data_newtypes
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|---|
0 | S_1 | C_1 | <NA> | M | street_1 | 173 | <NA> | 34.0 | A+ |
1 | S_1 | C_1 | <NA> | F | street_2 | 192 | <NA> | 32.5 | B+ |
2 | S_1 | C_1 | 1103 | M | street_2 | 186 | <NA> | 87.2 | B+ |
3 | S_1 | <NA> | <NA> | F | street_2 | 167 | 81 | 80.4 | <NA> |
4 | S_1 | C_1 | 1105 | <NA> | street_4 | 159 | 64 | 84.8 | A- |
5 | S_1 | C_2 | 1201 | M | street_5 | 159 | 68 | 97.0 | A- |
6 | S_1 | C_2 | 1202 | F | street_4 | 176 | 94 | 63.5 | B- |
7 | S_1 | C_2 | <NA> | M | street_6 | 160 | 53 | 58.8 | A+ |
8 | S_1 | C_2 | 1204 | F | street_5 | 162 | 63 | 33.8 | <NA> |
9 | S_1 | C_2 | 1205 | <NA> | street_6 | 167 | <NA> | 68.4 | B- |
10 | S_1 | C_3 | 1301 | M | street_4 | 161 | 68 | NaN | B+ |
11 | S_1 | <NA> | 1302 | F | street_1 | 175 | <NA> | 87.7 | A- |
12 | S_1 | C_3 | 1303 | M | street_7 | 188 | 82 | 49.7 | B |
13 | S_1 | C_3 | 1304 | <NA> | street_2 | 195 | 70 | 85.2 | <NA> |
14 | S_1 | C_3 | <NA> | F | street_5 | 187 | 69 | 61.7 | B- |
15 | S_2 | C_1 | 2101 | M | street_7 | 159 | 84 | NaN | C |
16 | S_2 | C_1 | 2102 | F | street_6 | 161 | <NA> | 50.6 | B+ |
17 | S_2 | C_1 | 2103 | M | street_4 | 157 | 61 | 52.5 | B- |
18 | S_2 | <NA> | 2104 | F | street_5 | 159 | 97 | 72.2 | B+ |
19 | S_2 | C_1 | 2105 | M | street_4 | 170 | <NA> | 34.2 | A |
20 | S_2 | C_2 | 2201 | M | street_5 | 193 | <NA> | NaN | B |
21 | S_2 | C_2 | 2202 | F | street_7 | 194 | 77 | 68.5 | B+ |
22 | S_2 | C_2 | 2203 | M | street_4 | 155 | 91 | 73.8 | <NA> |
23 | S_2 | C_2 | <NA> | M | street_1 | 175 | 74 | 47.2 | B- |
24 | S_2 | C_2 | 2205 | <NA> | street_7 | 159 | 76 | NaN | B |
25 | S_2 | C_3 | 2301 | F | street_4 | 157 | 78 | 72.3 | B+ |
26 | S_2 | <NA> | 2302 | M | street_5 | 171 | 88 | NaN | A |
27 | S_2 | C_3 | 2303 | F | street_7 | 190 | 99 | 65.9 | C |
28 | S_2 | C_3 | 2304 | F | street_6 | 164 | 81 | 95.5 | A- |
29 | S_2 | C_3 | 2305 | M | street_4 | 187 | 73 | 48.9 | B |
30 | S_2 | C_4 | 2401 | F | street_2 | 159 | <NA> | 45.3 | A |
31 | S_2 | C_4 | 2402 | M | street_7 | 166 | <NA> | 48.7 | B |
32 | S_2 | C_4 | 2403 | <NA> | street_6 | 158 | <NA> | 59.7 | B+ |
33 | S_2 | C_4 | 2404 | <NA> | street_2 | 160 | <NA> | 67.7 | B |
34 | S_2 | C_4 | 2405 | <NA> | street_6 | 193 | <NA> | 47.6 | B |
data_newtypes.dtypes
School string
Class string
ID Int64
Gender string
Address string
Height Int64
Weight Int64
Math float64
Physics string
dtype: object
data_newtypes[['7' in x for x in data_newtypes.Address.tolist()]]
School | Class | ID | Gender | Address | Height | Weight | Math | Physics | |
---|---|---|---|---|---|---|---|---|---|
12 | S_1 | C_3 | 1303 | M | street_7 | 188 | 82 | 49.7 | B |
15 | S_2 | C_1 | 2101 | M | street_7 | 159 | 84 | NaN | C |
21 | S_2 | C_2 | 2202 | F | street_7 | 194 | 77 | 68.5 | B+ |
24 | S_2 | C_2 | 2205 | <NA> | street_7 | 159 | 76 | NaN | B |
27 | S_2 | C_3 | 2303 | F | street_7 | 190 | 99 | 65.9 | C |
31 | S_2 | C_4 | 2402 | M | street_7 | 166 | <NA> | 48.7 | B |
缺失数据的处理¶
填充:
fillna
方法
可以指定用来填充缺失值的内容,或者通过 method
参数指定用前面或者后面的值来填充
data_missing['Physics'].fillna('缺失值').head()
0 A+
1 B+
2 B+
3 缺失值
4 A-
Name: Physics, dtype: object
data_missing['Math'].fillna(data_missing['Math'].mean())[10:16] # 或者 method = 'backfill'
10 62.51
11 87.70
12 49.70
13 85.20
14 61.70
15 62.51
Name: Math, dtype: float64
data_missing['Physics'].fillna(method='ffill').head() # 或者 method = 'backfill'
0 A+
1 B+
2 B+
3 B+
4 A-
Name: Physics, dtype: object
删除:
dropna
方法axis
参数:指定删除的方向,0 为行,1为列how
参数:可以是all
(全部为NaN
才删除)或者any
(任一个为NaN
就删除)subset
参数:指定执行删除操作的列
df_drop = pd.DataFrame({'A':[np.nan,np.nan,np.nan],'B':[np.nan,1,5],'C':[4,5,6]})
df_drop
A | B | C | |
---|---|---|---|
0 | NaN | NaN | 4 |
1 | NaN | 1.0 | 5 |
2 | NaN | 5.0 | 6 |
df_drop.dropna(axis=0) # 默认 how = 'any'
A | B | C |
---|
df_drop.dropna(axis=1) # 默认 how = 'any'
C | |
---|---|
0 | 4 |
1 | 5 |
2 | 6 |
df_drop.dropna(axis=0, how = 'all')
A | B | C | |
---|---|---|---|
0 | NaN | NaN | 4 |
1 | NaN | 1.0 | 5 |
2 | NaN | 5.0 | 6 |
df_drop.dropna(axis=1, how = 'all')
B | C | |
---|---|---|
0 | NaN | 4 |
1 | 1.0 | 5 |
2 | 5.0 | 6 |
df_drop.dropna(axis=0,subset=['B','C'])
A | B | C | |
---|---|---|---|
1 | NaN | 1.0 | 5 |
2 | NaN | 5.0 | 6 |
References¶
1.https://pandas.pydata.org/pandas-docs/stable/user_guide