Pandas - 2

`Pandas` - 2¶

主讲人：李显祥

大气科学学院

import pandas as pd
import numpy as np
print('Pandas version:', pd.__version__)

Pandas version: 1.1.3

data = pd.read_csv('table.csv')
data2 = data.set_index('ID')

5. 常用操作¶

`describe` 和 `info`¶

info 函数返回有哪些列、有多少非缺失值、每列的类型

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    35 non-null     object 
 2   ID       35 non-null     int64  
 3   Gender   35 non-null     object 
 4   Address  35 non-null     object 
 5   Height   35 non-null     int64  
 6   Weight   35 non-null     int64  
 7   Math     35 non-null     float64
 8   Physics  35 non-null     object 
dtypes: float64(1), int64(3), object(5)
memory usage: 2.6+ KB

describe 默认统计数值型数据的各个统计量

data.describe()

	ID	Height	Weight	Math
count	35.00000	35.000000	35.000000	35.000000
mean	1803.00000	174.142857	74.657143	61.351429
std	536.87741	13.541098	12.895377	19.915164
min	1101.00000	155.000000	53.000000	31.500000
25%	1204.50000	161.000000	63.000000	47.400000
50%	2103.00000	173.000000	74.000000	61.700000
75%	2301.50000	187.500000	82.000000	77.100000
max	2405.00000	195.000000	100.000000	97.000000

也可以自行选择分位数 percentile

data.describe(percentiles=[.10, .25, .75, .90])

	ID	Height	Weight	Math
count	35.00000	35.000000	35.000000	35.000000
mean	1803.00000	174.142857	74.657143	61.351429
std	536.87741	13.541098	12.895377	19.915164
min	1101.00000	155.000000	53.000000	31.500000
10%	1104.40000	158.400000	60.400000	33.880000
25%	1204.50000	161.000000	63.000000	47.400000
50%	2103.00000	173.000000	74.000000	61.700000
75%	2301.50000	187.500000	82.000000	77.100000
90%	2401.60000	192.600000	92.800000	86.480000
max	2405.00000	195.000000	100.000000	97.000000

对于非数值型也可以用 describe 函数

data['Physics'].describe()

count     35
unique     7
top       B+
freq       9
Name: Physics, dtype: object

`head` 和 `tail`¶

data.head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

data.head(6)

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+
5	S_1	C_2	1201	M	street_5	188	68	97.0	A-

data.tail()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
30	S_2	C_4	2401	F	street_2	192	62	45.3	A
31	S_2	C_4	2402	M	street_7	166	82	48.7	B
32	S_2	C_4	2403	F	street_6	158	60	59.7	B+
33	S_2	C_4	2404	F	street_2	160	84	67.7	B
34	S_2	C_4	2405	F	street_6	193	54	47.6	B

算术/统计运算¶

sum,mean,median,min,max,abs,std,var,quantile,cummax,cumsum,cumprod

DataFrame 进行这些运算都有一个共同参数 axis，指定沿着哪个方向进行运算：

axis=0 或 axis=index: 沿着行方向（跨行）进行运算
axis=1 或 axis=columns: 沿着列方向（跨列）进行运算

data.sum(axis='index')

School     S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S_1S...
Class      C_1C_1C_1C_1C_1C_2C_2C_2C_2C_2C_3C_3C_3C_3C_3C...
ID                                                     63105
Gender                   MFMFFMFMFFMFMMFMFMFMMFMMFFMFFMFMFFF
Address    street_1street_2street_2street_2street_4street...
Height                                                  6095
Weight                                                  2613
Math                                                  2147.3
Physics    A+B+B+B-B+A-B-A+BB-B+A-BAB-CB+B-B+ABB+A+B-BB+A...
dtype: object

data.mean(axis=0)

ID        1803.000000
Height     174.142857
Weight      74.657143
Math        61.351429
dtype: float64

data.mean(axis=1).head()

0    342.750
1    349.875
2    364.550
3    358.100
4    353.200
dtype: float64

`unique` 和 `nunique`¶

unique 显示所有的唯一值

data.Physics.unique()

array(['A+', 'B+', 'B-', 'A-', 'B', 'A', 'C'], dtype=object)

nunique 显示有多少个唯一值

# 两种方法给出相同的结果
data.Physics.nunique(), data.Physics.unique().size 

(7, 7)

`count` 和 `value_counts`¶

count 返回非缺失值元素个数

data['Math'].count()

value_counts 返回每个唯一值的个数

data['Physics'].value_counts()

B+    9
B     8
B-    6
A     4
A-    3
A+    3
C     2
Name: Physics, dtype: int64

为了更好地显示结果，我们可以对上面的结果进行排序，利用 sort_index 方法

data['Physics'].value_counts().sort_index()

A     4
A+    3
A-    3
B     8
B+    9
B-    6
C     2
Name: Physics, dtype: int64

`idxmax/idxmin` 和 `nlargest/nsmallest`¶

idxmax (idxmin) 函数返回最大 (小) 值所在行的索引

data['Math'].idxmax(), data['Math'].idxmin(), data['Math'].agg(['idxmax','idxmin'])

(5,
 10,
 idxmax     5
 idxmin    10
 Name: Math, dtype: int64)

nlargest (nsmallest) 函数返回前几个大 (小) 的元素值

data['Math'].nlargest(3), data['Math'].nsmallest(3)

(5     97.0
 28    95.5
 11    87.7
 Name: Math, dtype: float64,
 10    31.5
 1     32.5
 26    32.7
 Name: Math, dtype: float64)

`sort_index` 和 `sort_values`¶

sort_index 按照索引进行排序，默认从低到高排序，可以用 ascending = False 参数从高到低排序

data.set_index('ID').sort_index().head(10)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1202	S_1	C_2	F	street_4	176	94	63.5	B-
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1204	S_1	C_2	F	street_5	162	63	33.8	B
1205	S_1	C_2	F	street_6	167	63	68.4	B-

data.set_index('ID').sort_index(ascending=False).head(10)

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2405	S_2	C_4	F	street_6	193	54	47.6	B
2404	S_2	C_4	F	street_2	160	84	67.7	B
2403	S_2	C_4	F	street_6	158	60	59.7	B+
2402	S_2	C_4	M	street_7	166	82	48.7	B
2401	S_2	C_4	F	street_2	192	62	45.3	A
2305	S_2	C_3	M	street_4	187	73	48.9	B
2304	S_2	C_3	F	street_6	164	81	95.5	A-
2303	S_2	C_3	F	street_7	190	99	65.9	C
2302	S_2	C_3	M	street_5	171	88	32.7	A
2301	S_2	C_3	F	street_4	157	78	72.3	B+

sort_values 按照列的数值进行排序，默认从低到高排序，可以用 ascending = False 参数从高到低排序

参数 by 用来指定用来排序的列，可以用 list 来指定多个列，即先对第一层排，在第一层相同的情况下对第二层排序

data.sort_values(by='Class', ascending=False).head(10)

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
34	S_2	C_4	2405	F	street_6	193	54	47.6	B
33	S_2	C_4	2404	F	street_2	160	84	67.7	B
32	S_2	C_4	2403	F	street_6	158	60	59.7	B+
31	S_2	C_4	2402	M	street_7	166	82	48.7	B
30	S_2	C_4	2401	F	street_2	192	62	45.3	A
25	S_2	C_3	2301	F	street_4	157	78	72.3	B+
26	S_2	C_3	2302	M	street_5	171	88	32.7	A
27	S_2	C_3	2303	F	street_7	190	99	65.9	C
28	S_2	C_3	2304	F	street_6	164	81	95.5	A-
29	S_2	C_3	2305	M	street_4	187	73	48.9	B

data.sort_values(by=['Address','Height']).head(10)

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
11	S_1	C_3	1302	F	street_1	175	57	87.7	A-
23	S_2	C_2	2204	M	street_1	175	74	47.2	B-
33	S_2	C_4	2404	F	street_2	160	84	67.7	B
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
30	S_2	C_4	2401	F	street_2	192	62	45.3	A
13	S_1	C_3	1304	M	street_2	195	70	85.2	A
22	S_2	C_2	2203	M	street_4	155	91	73.8	A+

6. 分组聚合¶

处理数据时，我们有时需要对数据进行分组处理，然后再把结果结合在一起，形成处理后的结果。这个过程称为 “分组-应用-结合”（Split-Apply-Combine）。

split：基于某一些规则，将数据拆成若干组
apply：对每一组独立地使用函数
combine：将每一组的结果组合成某一类数据结构

Pandas 提供了一些工具来进行这项工作。

`groupby` 函数¶

分组函数可以根据某一列进行分组：

group_school = data2.groupby('School')

返回的 group_school 是一个 DataFrameGroupBy 对象，相当于一个以 School 列的唯一值为 key 的 dict。

for key, value in group_school:
    print(key)
    print(value.head(),'\n')

S_1
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+
1102    S_1   C_1      F  street_2     192      73  32.5      B+
1103    S_1   C_1      M  street_2     186      82  87.2      B+
1104    S_1   C_1      F  street_2     167      81  80.4      B-
1105    S_1   C_1      F  street_4     159      64  84.8      B+ 

S_2
     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2101    S_2   C_1      M  street_7     174      84  83.3       C
2102    S_2   C_1      F  street_6     161      61  50.6      B+
2103    S_2   C_1      M  street_4     157      61  52.5      B-
2104    S_2   C_1      F  street_5     159      97  72.2      B+
2105    S_2   C_1      M  street_4     170      81  34.2       A

取出某一个组：

group_school.get_group('S_2').head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2101	S_2	C_1	M	street_7	174	84	83.3	C
2102	S_2	C_1	F	street_6	161	61	50.6	B+
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2104	S_2	C_1	F	street_5	159	97	72.2	B+
2105	S_2	C_1	M	street_4	170	81	34.2	A

也可以查询组容量与组数

group_school.size() 

School
S_1    15
S_2    20
dtype: int64

group_school.ngroups

我们也可以根据某几列进行分组

group_school_class = data2.groupby(['School','Class'])

group_school_class.get_group(('S_2','C_4'))

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
2401	S_2	C_4	F	street_2	192	62	45.3	A
2402	S_2	C_4	M	street_7	166	82	48.7	B
2403	S_2	C_4	F	street_6	158	60	59.7	B+
2404	S_2	C_4	F	street_2	160	84	67.7	B
2405	S_2	C_4	F	street_6	193	54	47.6	B

group_school_class.groups

{('S_1', 'C_1'): [1101, 1102, 1103, 1104, 1105], ('S_1', 'C_2'): [1201, 1202, 1203, 1204, 1205], ('S_1', 'C_3'): [1301, 1302, 1303, 1304, 1305], ('S_2', 'C_1'): [2101, 2102, 2103, 2104, 2105], ('S_2', 'C_2'): [2201, 2202, 2203, 2204, 2205], ('S_2', 'C_3'): [2301, 2302, 2303, 2304, 2305], ('S_2', 'C_4'): [2401, 2402, 2403, 2404, 2405]}

first 函数显示的是以分组为索引的每组的第一个分组信息

group_school_class.first()

		Gender	Address	Height	Weight	Math	Physics
School	Class
S_1	C_1	M	street_1	173	63	34.0	A+
	C_2	M	street_5	188	68	97.0	A-
	C_3	M	street_4	161	68	31.5	B+
S_2	C_1	M	street_7	174	84	83.3	C
	C_2	M	street_5	193	100	39.1	B
	C_3	F	street_4	157	78	72.3	B+
	C_4	F	street_2	192	62	45.3	A

[] 操作符可以用来选出 groupby 对象的某个或者某几个列

data2.groupby(['Gender','School'])['Math'].mean()>=60

Gender  School
F       S_1        True
        S_2        True
M       S_1        True
        S_2       False
Name: Math, dtype: bool

data2.groupby(['Gender','School'])[['Math','Height']].mean()

		Math	Height
Gender	School
F	S_1	64.100000	173.125000
F	S_2	66.427273	173.727273
M	S_1	63.342857	178.714286
M	S_2	51.155556	172.000000

Pandas 的 cut 函数可以将连续量根据指定的分区进行分组：

bins = [0,40,60,80,90,100]
cuts_1 = pd.cut(data2['Math'],bins=bins) 
data2.groupby(cuts_1)['Math'].count()

Math
(0, 40]       7
(40, 60]     10
(60, 80]      9
(80, 90]      7
(90, 100]     2
Name: Math, dtype: int64

我们可以用 labels 参数给每一组指定一个有意义的名字：

cuts_2 = pd.cut(data2['Math'],bins=bins,labels=['差','不及格','及格','优良','优秀']) 
data2.groupby(cuts_2)['Math'].count()

Math
差       7
不及格    10
及格      9
优良      7
优秀      2
Name: Math, dtype: int64

apply 过程¶

在 apply 过程中，我们实际往往会遇到四类问题：

聚合（Aggregation）—— 即分组计算统计量（如求均值、求每组元素个数）
变换（Transformation）—— 即分组对每个单元的数据进行操作（如将每组元素进行标准化）
过滤（Filtration）—— 即按照某些规则筛选出一些组（如选出组内某一指标小于50的组）
综合问题 —— 即前面提及的三种问题的混合

聚合

所谓聚合就是把一堆数，变成一个标量，因此 mean/sum/size/count/std/var/sem/describe/first/ last/nth/min/max 都是聚合函数

group_m = group_school['Math']
group_m.mean()

School
S_1    63.746667
S_2    59.555000
Name: Math, dtype: float64

group_m.agg(['mean','max','min'])

	mean	max	min
School
S_1	63.746667	97.0	31.5
S_2	59.555000	95.5	32.7

如果想改变结果里面的列的名字（比如改成中文），我们可以利用元组进行重命名

group_m.agg([('平均分','mean'),('最高分','max'),('最低分','min')])

	平均分	最高分	最低分
School
S_1	63.746667	97.0	31.5
S_2	59.555000	95.5	32.7

我们也可以通过 dict 来指定哪些函数作用于哪些列

group_school.agg({'Math':['mean','max'],'Height':'var','Physics':['first','last']})

	Math		Height	Physics
	mean	max	var	first	last
School
S_1	63.746667	97.0	161.638095	A+	B-
S_2	59.555000	95.5	205.523684	C	B

当然也可以使用自定义函数，例如求取每个学校数学考试成绩的最大值和最小值之差

group_m.agg(lambda x:x.max()-x.min())

School
S_1    65.5
S_2    62.8
Name: Math, dtype: float64

变换

变换功能由 transform 函数实现，其传入的对象是组内的列，并且返回值需要与列长完全一致；如果返回值是标量，该标量会被广播到整个列

group_school[['Math','Height']].transform('mean').head()

	Math	Height
ID
1101	63.746667	175.733333
1102	63.746667	175.733333
1103	63.746667	175.733333
1104	63.746667	175.733333
1105	63.746667	175.733333

group_school[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()

	Math	Height
ID
1101	-1.288991	-0.214991
1102	-1.353990	1.279460
1103	1.016287	0.807528
1104	0.721627	-0.686923
1105	0.912289	-1.316166

过滤

过滤功能由 filter 函数来实现，其返回结果是组的全体，因此传入的参数值应当是 布尔标量

group_school[['School','Math','Physics']].filter(lambda x:(x['Math']>32).all())

	School	Math	Physics
ID
2101	S_2	83.3	C
2102	S_2	50.6	B+
2103	S_2	52.5	B-
2104	S_2	72.2	B+
2105	S_2	34.2	A
2201	S_2	39.1	B
2202	S_2	68.5	B+
2203	S_2	73.8	A+
2204	S_2	47.2	B-
2205	S_2	85.4	B
2301	S_2	72.3	B+
2302	S_2	32.7	A
2303	S_2	65.9	C
2304	S_2	95.5	A-
2305	S_2	48.9	B
2401	S_2	45.3	A
2402	S_2	48.7	B
2403	S_2	59.7	B+
2404	S_2	67.7	B
2405	S_2	47.6	B

apply 函数

apply 函数是非常灵活的工具，它是分组函数中应用最为广泛的，因为它非常灵活。

首先，apply 的输入值是每一个分组本身：

group_school.apply(lambda x:print(x.head(1),'\n'))

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
1101    S_1   C_1      M  street_1     173      63  34.0      A+ 

     School Class Gender   Address  Height  Weight  Math Physics
ID                                                              
2101    S_2   C_1      M  street_7     174      84  83.3       C

apply 的返回值很灵活，可以是标量、列表或者 DataFrame

group_school[['Math','Height']].apply(lambda x:x.max())

	Math	Height
School
S_1	97.0	195.0
S_2	95.5	194.0

group_school[['Math','Height']].apply(lambda x:x-x.min()).head()

	Math	Height
ID
1101	2.5	14.0
1102	1.0	33.0
1103	55.7	27.0
1104	48.9	8.0
1105	53.3	0.0

6. 缺失数据处理¶

在处理数据过程中，我们经常会碰到缺失数据，它们的处理也成为令人头疼的问题。Pandas 提供了一些方法来解决这个棘手的问题。

在 1.0 版本之前，Pandas 主要借助 Numpy 的 nan (not a number) 来表示缺失值。

data_missing = pd.read_csv('table_missing.csv')
data_missing.head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	NaN	M	street_1	173	NaN	34.0	A+
1	S_1	C_1	NaN	F	street_2	192	NaN	32.5	B+
2	S_1	C_1	1103.0	M	street_2	186	NaN	87.2	B+
3	S_1	NaN	NaN	F	street_2	167	81.0	80.4	NaN
4	S_1	C_1	1105.0	NaN	street_4	159	64.0	84.8	A-

`isna` 和 `notna` 方法¶

返回是否 NaN 的信息，对 Series 返回布尔列表，对 DataFrame 返回布尔型的 DataFrame。

data_missing['Physics'].isna().head()

0    False
1    False
2    False
3     True
4    False
Name: Physics, dtype: bool

data_missing['Physics'].notna().head()

0     True
1     True
2     True
3    False
4     True
Name: Physics, dtype: bool

data_missing.isna().head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	False	False	True	False	False	False	True	False	False
1	False	False	True	False	False	False	True	False	False
2	False	False	False	False	False	False	True	False	False
3	False	True	True	False	False	False	False	False	True
4	False	False	False	True	False	False	False	False	False

为了得到每列有多少缺失值，可以使用如下方法，得到的结果和 info 方法结果一致。

data_missing.isna().astype('int').sum()

School      0
Class       4
ID          6
Gender      7
Address     0
Height      0
Weight     13
Math        5
Physics     4
dtype: int64

data_missing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    31 non-null     object 
 2   ID       29 non-null     float64
 3   Gender   28 non-null     object 
 4   Address  35 non-null     object 
 5   Height   35 non-null     int64  
 6   Weight   22 non-null     float64
 7   Math     30 non-null     float64
 8   Physics  31 non-null     object 
dtypes: float64(3), int64(1), object(5)
memory usage: 2.6+ KB

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   School   35 non-null     object 
 1   Class    35 non-null     object 
 2   ID       35 non-null     int64  
 3   Gender   35 non-null     object 
 4   Address  35 non-null     object 
 5   Height   35 non-null     int64  
 6   Weight   35 non-null     int64  
 7   Math     35 non-null     float64
 8   Physics  35 non-null     object 
dtypes: float64(1), int64(3), object(5)
memory usage: 2.6+ KB

比较 data_missing 和 data 的 info 输出，可见因为缺失值的引入，有 3 列的数据类型发生了改变

为了保持数据类型不变，新版本的 Pandas 引入了几个 Nullable 的数据类型：Int64 （代替 int64），string （代替 object）和 boolean （代替 bool）。其好处是缺失值都会被替换为统一的 <NA> 符号，且不改变数据类型.

同时，提供了函数 convert_dtypes 来将旧的数据类型转化为新的数据类型：

data_newtypes = data_missing.convert_dtypes()
data_newtypes

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	<NA>	M	street_1	173	<NA>	34.0	A+
1	S_1	C_1	<NA>	F	street_2	192	<NA>	32.5	B+
2	S_1	C_1	1103	M	street_2	186	<NA>	87.2	B+
3	S_1	<NA>	<NA>	F	street_2	167	81	80.4	<NA>
4	S_1	C_1	1105	<NA>	street_4	159	64	84.8	A-
5	S_1	C_2	1201	M	street_5	159	68	97.0	A-
6	S_1	C_2	1202	F	street_4	176	94	63.5	B-
7	S_1	C_2	<NA>	M	street_6	160	53	58.8	A+
8	S_1	C_2	1204	F	street_5	162	63	33.8	<NA>
9	S_1	C_2	1205	<NA>	street_6	167	<NA>	68.4	B-
10	S_1	C_3	1301	M	street_4	161	68	NaN	B+
11	S_1	<NA>	1302	F	street_1	175	<NA>	87.7	A-
12	S_1	C_3	1303	M	street_7	188	82	49.7	B
13	S_1	C_3	1304	<NA>	street_2	195	70	85.2	<NA>
14	S_1	C_3	<NA>	F	street_5	187	69	61.7	B-
15	S_2	C_1	2101	M	street_7	159	84	NaN	C
16	S_2	C_1	2102	F	street_6	161	<NA>	50.6	B+
17	S_2	C_1	2103	M	street_4	157	61	52.5	B-
18	S_2	<NA>	2104	F	street_5	159	97	72.2	B+
19	S_2	C_1	2105	M	street_4	170	<NA>	34.2	A
20	S_2	C_2	2201	M	street_5	193	<NA>	NaN	B
21	S_2	C_2	2202	F	street_7	194	77	68.5	B+
22	S_2	C_2	2203	M	street_4	155	91	73.8	<NA>
23	S_2	C_2	<NA>	M	street_1	175	74	47.2	B-
24	S_2	C_2	2205	<NA>	street_7	159	76	NaN	B
25	S_2	C_3	2301	F	street_4	157	78	72.3	B+
26	S_2	<NA>	2302	M	street_5	171	88	NaN	A
27	S_2	C_3	2303	F	street_7	190	99	65.9	C
28	S_2	C_3	2304	F	street_6	164	81	95.5	A-
29	S_2	C_3	2305	M	street_4	187	73	48.9	B
30	S_2	C_4	2401	F	street_2	159	<NA>	45.3	A
31	S_2	C_4	2402	M	street_7	166	<NA>	48.7	B
32	S_2	C_4	2403	<NA>	street_6	158	<NA>	59.7	B+
33	S_2	C_4	2404	<NA>	street_2	160	<NA>	67.7	B
34	S_2	C_4	2405	<NA>	street_6	193	<NA>	47.6	B

data_newtypes.dtypes

School      string
Class       string
ID           Int64
Gender      string
Address     string
Height       Int64
Weight       Int64
Math       float64
Physics     string
dtype: object

data_newtypes[['7' in x for x in data_newtypes.Address.tolist()]]

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
12	S_1	C_3	1303	M	street_7	188	82	49.7	B
15	S_2	C_1	2101	M	street_7	159	84	NaN	C
21	S_2	C_2	2202	F	street_7	194	77	68.5	B+
24	S_2	C_2	2205	<NA>	street_7	159	76	NaN	B
27	S_2	C_3	2303	F	street_7	190	99	65.9	C
31	S_2	C_4	2402	M	street_7	166	<NA>	48.7	B

缺失数据的处理¶

填充: fillna 方法

可以指定用来填充缺失值的内容，或者通过 method 参数指定用前面或者后面的值来填充

data_missing['Physics'].fillna('缺失值').head()

0     A+
1     B+
2     B+
3    缺失值
4     A-
Name: Physics, dtype: object

data_missing['Math'].fillna(data_missing['Math'].mean())[10:16]  # 或者 method = 'backfill'

10    62.51
11    87.70
12    49.70
13    85.20
14    61.70
15    62.51
Name: Math, dtype: float64

data_missing['Physics'].fillna(method='ffill').head()  # 或者 method = 'backfill'

0    A+
1    B+
2    B+
3    B+
4    A-
Name: Physics, dtype: object

删除: dropna 方法
- axis 参数：指定删除的方向，0 为行，1为列
- how 参数：可以是 all（全部为 NaN 才删除）或者 any（任一个为 NaN就删除）
- subset 参数：指定执行删除操作的列

df_drop = pd.DataFrame({'A':[np.nan,np.nan,np.nan],'B':[np.nan,1,5],'C':[4,5,6]})
df_drop

	A	B	C
0	NaN	NaN	4
1	NaN	1.0	5
2	NaN	5.0	6

df_drop.dropna(axis=0) # 默认 how = 'any'

	A	B	C

df_drop.dropna(axis=1) # 默认 how = 'any'

	C
0	4
1	5
2	6

df_drop.dropna(axis=0, how = 'all')

	A	B	C
0	NaN	NaN	4
1	NaN	1.0	5
2	NaN	5.0	6

df_drop.dropna(axis=1, how = 'all')

	B	C
0	NaN	4
1	1.0	5
2	5.0	6

df_drop.dropna(axis=0,subset=['B','C'])

	A	B	C
1	NaN	1.0	5
2	NaN	5.0	6

References¶

1.https://pandas.pydata.org/pandas-docs/stable/user_guide

2.https://github.com/datawhalechina/joyful-pandas

3.https://github.com/hangsz/pandas-tutorial

Pandas - 1

Pandas - 3

AS213 Python in Atmospheric Sciences

Pandas - 2

Contents

`Pandas` - 2¶

5. 常用操作¶

`describe` 和 `info`¶

`head` 和 `tail`¶

算术/统计运算¶

`unique` 和 `nunique`¶

`count` 和 `value_counts`¶

`idxmax/idxmin` 和 `nlargest/nsmallest`¶

`sort_index` 和 `sort_values`¶

6. 分组聚合¶

`groupby` 函数¶

apply 过程¶

6. 缺失数据处理¶

`isna` 和 `notna` 方法¶

缺失数据的处理¶

References¶

AS213 Python in Atmospheric Sciences

Pandas - 2

Contents

Pandas - 2¶

5. 常用操作¶

describe 和 info¶

head 和 tail¶

算术/统计运算¶

unique 和 nunique¶

count 和 value_counts¶

idxmax/idxmin 和 nlargest/nsmallest¶

sort_index 和 sort_values¶

6. 分组聚合¶

groupby 函数¶

apply 过程¶

6. 缺失数据处理¶

isna 和 notna 方法¶

缺失数据的处理¶

References¶

`Pandas` - 2¶

`describe` 和 `info`¶

`head` 和 `tail`¶

`unique` 和 `nunique`¶

`count` 和 `value_counts`¶

`idxmax/idxmin` 和 `nlargest/nsmallest`¶

`sort_index` 和 `sort_values`¶

`groupby` 函数¶

`isna` 和 `notna` 方法¶