Pandas - 1

`Pandas` - 1¶

主讲人：李显祥

大气科学学院

import pandas as pd
import numpy as np

Pandas 是一个优秀的数据分析工具包，主要用来分析二维数据和时间序列。它建构在 Numpy 之上，但是能处理混合类型的数据。

1. `Pandas` 数据结构¶

Pandas 主要支持两种数据结构：

Series: 一维带标签的数组，由数据类型相同的元素构成，具有列表和字典的特性。
DataFrame: 二维数据，由索引相同的 Series 构成的的二维数据结构，类似表格

1.1. `Series`¶

四个重要属性:

Series.index
Series.name
Series.values
Series.dtype

s = pd.Series(np.random.randn(5),
              index=['a','b','c','d','e'],
              name='这是一个Series',dtype='float64')
s

a    0.279662
b    2.179480
c    1.181846
d   -0.635226
e   -0.016946
Name: 这是一个Series, dtype: float64

s.values

array([ 0.27966185,  2.17948013,  1.18184557, -0.63522589, -0.01694577])

s.name

'这是一个Series'

s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

s.dtype

dtype('float64')

data2 = { 'a':1, 'b':2,'c':3 }
index2 = ['a','b','d']
s2 = pd.Series(data=data2, index=index2)  # 未指定 name
s2      # 缺失项填充NaN（NaN：not a number 为 pandas缺失值标记）

a    1.0
b    2.0
d    NaN
dtype: float64

1.2. `DataFrame`¶

四个重要属性:

DataFrame.index
DataFrame.columns
DataFrame.values
DataFrame.dtypes

df = pd.DataFrame({'col1':list('abcde'),'col2':range(5,10),'col3':[1.3,2.5,3.6,4.6,5.8]},
                 index=list('一二三四五'))
df

	col1	col2	col3
一	a	5	1.3
二	b	6	2.5
三	c	7	3.6
四	d	8	4.6
五	e	9	5.8

data1 = np.array([[1,2,3],
                  [4,5,6]] )
index1 = ['a','b']
columns1 = ['A','B','C']
df2 = pd.DataFrame(data=data1, index=index1, columns=columns1)
df2

	A	B	C
a	1	2	3
b	4	5	6

从 DataFrame 取出一列为 Series

df['col1']

一    a
二    b
三    c
四    d
五    e
Name: col1, dtype: object

df.values

array([['a', 5, 1.3],
       ['b', 6, 2.5],
       ['c', 7, 3.6],
       ['d', 8, 4.6],
       ['e', 9, 5.8]], dtype=object)

df.columns

Index(['col1', 'col2', 'col3'], dtype='object')

df.index

Index(['一', '二', '三', '四', '五'], dtype='object')

df.dtypes

col1     object
col2      int64
col3    float64
dtype: object

df.shape

(5, 3)

df.T

	一	二	三	四	五
col1	a	b	c	d	e
col2	5	6	7	8	9
col3	1.3	2.5	3.6	4.6	5.8

df.T.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, col1 to col3
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   一       3 non-null      object
 1   二       3 non-null      object
 2   三       3 non-null      object
 3   四       3 non-null      object
 4   五       3 non-null      object
dtypes: object(5)
memory usage: 224.0+ bytes

Series 可以转化为 DataFrame, 它只有一列

s.to_frame()

	这是一个Series
a	0.279662
b	2.179480
c	1.181846
d	-0.635226
e	-0.016946

1.3. `Index`¶

Index 是构成和操作 Series、DataFrame 的关键。
索引类似于元组，其本身是不能赋值修改的；
其在数据进行整体运算时，辅助自动对齐，这是 Pandas 不同于其他数据处理库的一大特征。
多层索引可以帮助改变表的形态，如透视表（pivot table）等。

三个重要属性:

Index.name
Index.values
Index.dtype

data = ['a','b','c']
index = pd.Index(data, name = 'name1')
index

Index(['a', 'b', 'c'], dtype='object', name='name1')

index[0]

'a'

index[[0, 2]]

Index(['a', 'c'], dtype='object', name='name1')

index[[True, False, True]]

Index(['a', 'c'], dtype='object', name='name1')

index.drop(['a','b'])

Index(['c'], dtype='object', name='name1')

index

Index(['a', 'b', 'c'], dtype='object', name='name1')

2. `Pandas` 数据类型¶

旧版本 Pandas 的数据类型实际上都是继承自 NumPy
所以使用这些类型时，要加上前缀 np
基本数据类型中不包括字符串类型，字符串都是存储为 object 型
新版本引入了字符串类型 string，但是还不是很成熟；暂时还是采用之前的做法（即用 object 表示）

bool 布尔型
int 整型
- 有符号整型 byte, short, intc, int_, longlong, intp, int8, int16, int32, int64
- 无符号整型 ubyte, ushort, uintc, uint, ulonglong, uintp, uint8, uint16, uint32, uint64
float 浮点型：single, double, float_, longfloat, float16, float32, float64, float96, float128
complex 复数型：csingle, complex_, clongfloat, complex64, complex128, complex196,complex256

object 一般类型：适合所有非 bool、int、float 和 complex的类型，包括字符串

Pandas 新版本引入了几个新的数据类型：

Categorical 类别类型
string 字符串类型

Categorical 类别类型

取固定几种值；
可以定义序，序的形式与实数序或字典序可以都不同；
即使是数值表示，数值运算可能也无意义，与离散数值型不一定相同。

类别类型有两个重要属性：

categories
ordered

pd.Categorical(values, categories=None, ordered=False)

values: 类别序列；
categories：自定义的类别序列；
ordered：类别是否定义顺序，默认增序。

c = pd.Categorical([2,1,1,3], ordered=True) 
c

[2, 1, 1, 3]
Categories (3, int64): [1 < 2 < 3]

d = pd.Categorical([2,1,1,3], categories=[3,2], ordered=True) 
d

[2, NaN, NaN, 3]
Categories (2, int64): [3 < 2]

d.categories  # 类别

Int64Index([3, 2], dtype='int64')

d.ordered # 是否有序

True

3. 文件 I/O¶

Pandas 可以对多种文件进行读写操作，最常用的是 csv 文件（comma separated values）和 excel 文件

文件读取¶

pd.read_csv
pd.read_table
pd.read_excel (需要安装 xlrd 包)
pd.read_HDFStore HDF5 文件 (需要安装 PyTables 包)
…

pd.read_csv 有几个很有用的参数：[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html]

sep: str, 间隔符，默认值 ‘,’
header: int 或 list of ints, 表头的位置（第几行，从 0 开始），默认为 ‘infer’（一般为第一行）
skiprows：list-like 或 integer, 读数据前要跳过的行数，或者需要跳过的行号（从 0 开始），默认值 None
skipfooter: int, 文件尾部需要跳过的行数，默认值 0

comment: str, 默认值 None，表示该符号之后的同行内容为注释，将被忽略
encoding: str, 文件编码信息，例如 ‘utf8’, ‘gbk’, ‘gb2312’ 等，默认值 None.
names: list，column 的命名
usecols: list-like 或 callable, 指定读取哪些列，默认值 None（读取所有列）.
index_col: int, str, sequence of int / str, 或 False, 指定作为索引的列，默认值 None

na_values: scalar, str, list-like, 或 dict, 指定其它的缺失值，例如 -999, ‘’ 等，默认值为 None
dtype: Type name 或 dict of column -> type, 指定读取的每列的数据类型，默认值为 None
parse_dates: boolean，list of ints，names，list of lists 或者 dict, 指定是否将某些列转化为日期，默认 False.

文件写入¶

pd.to_csv
pd.to_excel
pd.to_html
pd.to_latex
…

pd.to_csv(path_or_buf, ...)

sep: 间隔符 (默认值 “,”)
index: 是否输出索引值 (默认值 True)
na_rep: 指定缺失值的输出字符串 (默认值 ‘’)
float_format: 浮点数的输出格式，例如 ‘%3f’
columns: 指定输出哪些列 (默认 None)
header: 是否输出列的名字作为文件的第一行 (默认 True)
date_format: 指定时间的输出格式

4. 元素选取方法¶

data = pd.read_csv('table.csv')
data

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
0	S_1	C_1	1101	M	street_1	173	63	34.0	A+
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+
5	S_1	C_2	1201	M	street_5	188	68	97.0	A-
6	S_1	C_2	1202	F	street_4	176	94	63.5	B-
7	S_1	C_2	1203	M	street_6	160	53	58.8	A+
8	S_1	C_2	1204	F	street_5	162	63	33.8	B
9	S_1	C_2	1205	F	street_6	167	63	68.4	B-
10	S_1	C_3	1301	M	street_4	161	68	31.5	B+
11	S_1	C_3	1302	F	street_1	175	57	87.7	A-
12	S_1	C_3	1303	M	street_7	188	82	49.7	B
13	S_1	C_3	1304	M	street_2	195	70	85.2	A
14	S_1	C_3	1305	F	street_5	187	69	61.7	B-
15	S_2	C_1	2101	M	street_7	174	84	83.3	C
16	S_2	C_1	2102	F	street_6	161	61	50.6	B+
17	S_2	C_1	2103	M	street_4	157	61	52.5	B-
18	S_2	C_1	2104	F	street_5	159	97	72.2	B+
19	S_2	C_1	2105	M	street_4	170	81	34.2	A
20	S_2	C_2	2201	M	street_5	193	100	39.1	B
21	S_2	C_2	2202	F	street_7	194	77	68.5	B+
22	S_2	C_2	2203	M	street_4	155	91	73.8	A+
23	S_2	C_2	2204	M	street_1	175	74	47.2	B-
24	S_2	C_2	2205	F	street_7	183	76	85.4	B
25	S_2	C_3	2301	F	street_4	157	78	72.3	B+
26	S_2	C_3	2302	M	street_5	171	88	32.7	A
27	S_2	C_3	2303	F	street_7	190	99	65.9	C
28	S_2	C_3	2304	F	street_6	164	81	95.5	A-
29	S_2	C_3	2305	M	street_4	187	73	48.9	B
30	S_2	C_4	2401	F	street_2	192	62	45.3	A
31	S_2	C_4	2402	M	street_7	166	82	48.7	B
32	S_2	C_4	2403	F	street_6	158	60	59.7	B+
33	S_2	C_4	2404	F	street_2	160	84	67.7	B
34	S_2	C_4	2405	F	street_6	193	54	47.6	B

data2 = data.set_index('ID')
data2

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1202	S_1	C_2	F	street_4	176	94	63.5	B-
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1204	S_1	C_2	F	street_5	162	63	33.8	B
1205	S_1	C_2	F	street_6	167	63	68.4	B-
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1302	S_1	C_3	F	street_1	175	57	87.7	A-
1303	S_1	C_3	M	street_7	188	82	49.7	B
1304	S_1	C_3	M	street_2	195	70	85.2	A
1305	S_1	C_3	F	street_5	187	69	61.7	B-
2101	S_2	C_1	M	street_7	174	84	83.3	C
2102	S_2	C_1	F	street_6	161	61	50.6	B+
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2104	S_2	C_1	F	street_5	159	97	72.2	B+
2105	S_2	C_1	M	street_4	170	81	34.2	A
2201	S_2	C_2	M	street_5	193	100	39.1	B
2202	S_2	C_2	F	street_7	194	77	68.5	B+
2203	S_2	C_2	M	street_4	155	91	73.8	A+
2204	S_2	C_2	M	street_1	175	74	47.2	B-
2205	S_2	C_2	F	street_7	183	76	85.4	B
2301	S_2	C_3	F	street_4	157	78	72.3	B+
2302	S_2	C_3	M	street_5	171	88	32.7	A
2303	S_2	C_3	F	street_7	190	99	65.9	C
2304	S_2	C_3	F	street_6	164	81	95.5	A-
2305	S_2	C_3	M	street_4	187	73	48.9	B
2401	S_2	C_4	F	street_2	192	62	45.3	A
2402	S_2	C_4	M	street_7	166	82	48.7	B
2403	S_2	C_4	F	street_6	158	60	59.7	B+
2404	S_2	C_4	F	street_2	160	84	67.7	B
2405	S_2	C_4	F	street_6	193	54	47.6	B

思考题¶

结合上面讲过的文件 I/O，思考如何将上面两个语句合并成一个，以达到同样的效果？

4.1 `[]` 操作符¶

[] 用来选取行，使用的是位置（即第几行）

data[1:4]

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
1	S_1	C_1	1102	F	street_2	192	73	32.5	B+
2	S_1	C_1	1103	M	street_2	186	82	87.2	B+
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-

data2[1:4]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-

使用列的名字，[] 可以用来提取一列（成为 Series）或者多列 (返回新的 DataFrame)

data['Physics'].head()  # 等价于 data.Physics

  A+
  B+
  B+
  B-
  B+
Name: Physics, dtype: object

注意：使用 . + 列名来提取一列时，该列名必须满足 Python 变量的要求（即不能有特殊字符，不能以数字开头）。否则，只能用 [] 来提取。

data[['ID','School','Class','Physics']].head()

	ID	School	Class	Physics
0	1101	S_1	C_1	A+
1	1102	S_1	C_1	B+
2	1103	S_1	C_1	B+
3	1104	S_1	C_1	B-
4	1105	S_1	C_1	B+

data[(data['Math']>80) & (data['Address']=='street_7')].head()

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
15	S_2	C_1	2101	M	street_7	174	84	83.3	C
24	S_2	C_2	2205	F	street_7	183	76	85.4	B

`.loc` 基于 `label` 的选取方式¶

基于 label 的选取方式，使用索引值和列名来访问元素。

data2.loc[1101:1104]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1102	S_1	C_1	F	street_2	192	73	32.5	B+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1104	S_1	C_1	F	street_2	167	81	80.4	B-

注意：这种方法的切片是闭区间（左右都包括），区别于基于位置的选取方式。

data2.loc[1101:1104,'Math'].head()

ID
1101    34.0
1102    32.5
1103    87.2
1104    80.4
Name: Math, dtype: float64

data2.loc[1101:1104,['Math','Physics']].head()

	Math	Physics
ID
1101	34.0	A+
1102	32.5	B+
1103	87.2	B+
1104	80.4	B-

data2.loc[:,['Class','Physics']].head()

	Class	Physics
ID
1101	C_1	A+
1102	C_1	B+
1103	C_1	B+
1104	C_1	B-
1105	C_1	B+

data2.loc[:,'Height':'Math'].head()

	Height	Weight	Math
ID
1101	173	63	34.0
1102	192	73	32.5
1103	186	82	87.2
1104	167	81	80.4
1105	159	64	84.8

选取元素的条件可以是函数（返回 True/False 或者切片)

data2.loc[data2.Gender=='M']

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1103	S_1	C_1	M	street_2	186	82	87.2	B+
1201	S_1	C_2	M	street_5	188	68	97.0	A-
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1303	S_1	C_3	M	street_7	188	82	49.7	B
1304	S_1	C_3	M	street_2	195	70	85.2	A
2101	S_2	C_1	M	street_7	174	84	83.3	C
2103	S_2	C_1	M	street_4	157	61	52.5	B-
2105	S_2	C_1	M	street_4	170	81	34.2	A
2201	S_2	C_2	M	street_5	193	100	39.1	B
2203	S_2	C_2	M	street_4	155	91	73.8	A+
2204	S_2	C_2	M	street_1	175	74	47.2	B-
2302	S_2	C_3	M	street_5	171	88	32.7	A
2305	S_2	C_3	M	street_4	187	73	48.9	B
2402	S_2	C_4	M	street_7	166	82	48.7	B

data2.loc[data2['Physics'].isin(['A+','A'])]

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1101	S_1	C_1	M	street_1	173	63	34.0	A+
1203	S_1	C_2	M	street_6	160	53	58.8	A+
1304	S_1	C_3	M	street_2	195	70	85.2	A
2105	S_2	C_1	M	street_4	170	81	34.2	A
2203	S_2	C_2	M	street_4	155	91	73.8	A+
2302	S_2	C_3	M	street_5	171	88	32.7	A
2401	S_2	C_4	F	street_2	192	62	45.3	A

data2.loc[[True if i[-1]=='4' or i[-1]=='7' else False 
           for i in data2.Address.values]].head()

	School	Class	Gender	Address	Height	Weight	Math	Physics
ID
1105	S_1	C_1	F	street_4	159	64	84.8	B+
1202	S_1	C_2	F	street_4	176	94	63.5	B-
1301	S_1	C_3	M	street_4	161	68	31.5	B+
1303	S_1	C_3	M	street_7	188	82	49.7	B
2101	S_2	C_1	M	street_7	174	84	83.3	C

`.iloc` 基于位置的选取方式¶

.iloc 可以使用位置来选取元素，注意：与 .loc 不同，切片右端点不包含

data.iloc[3]

School          S_1
Class           C_1
ID             1104
Gender            F
Address    street_2
Height          167
Weight           81
Math           80.4
Physics          B-
Name: 3, dtype: object

data.iloc[3:5]

	School	Class	ID	Gender	Address	Height	Weight	Math	Physics
3	S_1	C_1	1104	F	street_2	167	81	80.4	B-
4	S_1	C_1	1105	F	street_4	159	64	84.8	B+

data.iloc[:,2].head()

  1101
  1102
  1103
  1104
  1105
Name: ID, dtype: int64

data.iloc[:,7::-2].head()

	Math	Height	Gender	Class
0	34.0	173	M	C_1
1	32.5	192	F	C_1
2	87.2	186	M	C_1
3	80.4	167	F	C_1
4	84.8	159	F	C_1

data.iloc[3::4,7::-2].head()

	Math	Height	Gender	Class
3	80.4	167	F	C_1
7	58.8	160	M	C_2
11	87.7	175	F	C_3
15	83.3	174	M	C_1
19	34.2	170	M	C_1

`at` 和 `iat` 方法¶

当访问单个元素时，使用 at（基于 label）和 iat（基于位置）都比相应的 loc 和 iloc 要快一些

display(data2.at[1101,'School'])
display(data2.loc[1101,'School'])
display(data2.iat[0,0])
display(data2.iloc[0,0])

%timeit data2.at[1101,'School']
%timeit data2.loc[1101,'School']
%timeit data2.iat[0,0]
%timeit data2.iloc[0,0]

'S_1'

'S_1'

'S_1'

'S_1'

94 µs ± 79.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
77 µs ± 80.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
7 µs ± 166 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

References¶

1.https://pandas.pydata.org/pandas-docs/stable/user_guide

2.https://github.com/datawhalechina/joyful-pandas

3.https://github.com/hangsz/pandas-tutorial

AS213: Python in Atmosphric Sciences

Pandas - 2

AS213 Python in Atmospheric Sciences

Pandas - 1

Contents

`Pandas` - 1¶

1. `Pandas` 数据结构¶

1.1. `Series`¶

1.2. `DataFrame`¶

1.3. `Index`¶

2. `Pandas` 数据类型¶

3. 文件 I/O¶

文件读取¶

文件写入¶

4. 元素选取方法¶

思考题¶

4.1 `[]` 操作符¶

`.loc` 基于 `label` 的选取方式¶

`.iloc` 基于位置的选取方式¶

`at` 和 `iat` 方法¶

References¶

AS213 Python in Atmospheric Sciences

Pandas - 1

Contents

Pandas - 1¶

1. Pandas 数据结构¶

1.1. Series¶

1.2. DataFrame¶

1.3. Index¶

2. Pandas 数据类型¶

3. 文件 I/O¶

文件读取¶

文件写入¶

4. 元素选取方法¶

思考题¶

4.1 [] 操作符¶

.loc 基于 label 的选取方式¶

.iloc 基于位置的选取方式¶

at 和 iat 方法¶

References¶

`Pandas` - 1¶

1. `Pandas` 数据结构¶

1.1. `Series`¶

1.2. `DataFrame`¶

1.3. `Index`¶

2. `Pandas` 数据类型¶

4.1 `[]` 操作符¶

`.loc` 基于 `label` 的选取方式¶

`.iloc` 基于位置的选取方式¶

`at` 和 `iat` 方法¶