8.《Python数据分析》数据的联接、合并与重塑

1 层次化索引
2 合并数据集
3 重塑和旋转

1 层次化索引

层次化索引（hierarchical indexing）为pandas提供了一种以低维形式处理高维数据的方法

层次化索引的简单示例：

data = pd.Series(np.random.randn(9),index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
    									   [1, 2, 3, 1, 3, 1, 2, 2, 3]]) # 一个简单的双层索引示例
data.index
data['b':'c'] # 第一层索引选择
data.loc[:, 2] # 第二层索引选择
data.unstack() # 二层索引转一层（Series会重塑为DataFrame）
data.stack() # unstack的逆运算

frame = pd.DataFrame(np.arange(12).reshape((4, 3)), # 列索引也可以是多层的
                      index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
                      columns=[["Ohio", "Ohio", "Colorado"],["Green", "Red", "Green"]])
frame.index.names = ["key1", "key2"] # 为每层行索引命名
frame.columns.names = ["state", "color"] # 为每层列索引命名
pd.MultiIndex.from_arrays([["Ohio", "Ohio", "Colorado"],
                          ["Green", "Red", "Green"]],
                          names=["state", "color"]) # 也可以单独创建多层索引，方便复用
### 多层索引的常用方法：
frame.index.nlevels # 查看索引的层级数
frame.swaplevel("key1", "key2") # 交换两层行索引
frame.sort_index(level=1) # 索引排序，仅按照第二层行索引排序
frame.groupby(level="key2").sum() # 指定行索引层级，并完成聚合运算
frame.sum(level='color', axis=1) # 指定列索引层级，并完成聚合运算

### 普通列与行索引之间的互转
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two','two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame2 = frame.set_index(['c', 'd']) # 将普通列转化为行索引
frame2.reset_index() # 将行索引转化为普通列

2 合并数据集

pandas中的常见的三种数据合并方式：

pandas.merge指定一列或多列进行join操作（SQL常用）
pandas.concat指定一个轴直接拼接多个DataFrame
combine_first用一个DataFrame填充另一个的缺失值

pandas.merge简单使用示例：

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})
pd.merge(df1, df2, on='key', how='left') # 以key列为键进行左关联
pd.merge(df1, df2, left_on='data1', right_on='daat2', how='inner') # 内关联

pandas.merge其他常用技巧：

pandas.merge可以通过参数left_index=True或right_index=True指定行索引作为关联键
当两份数据存在名称重复的列时可以通过参数suffixes指定后缀，如suffixes=['_left','_right']
未指定关联键时，pandas.merge会自动选择所有名称重复的列作为关联键
pandas.merge支持左（left）、右（right）、内（inner）、外（outer）关联
当索引为多层索引时，使用索引作为关联键相当于多个列作为关联键

pandas.concat简单使用示例：

np.concatenate([arr, arr], axis=1) # 对于numpy数值可以这么拼接
s1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
s2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
s3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64")
s4 = pd.concat([s1, s3]) # 直接行拼接

pd.concat([s1, s2, s3], axis="columns") # 列拼接多个DataFrame
pd.concat([s1, s4], axis="columns", join="inner") # 指定拼接方式为inner
pd.concat([s1, s1, s3], keys=["one", "two", "three"]) # 构建层次索引，区分三种来源

pandas.concat其他常用技巧：

以上示例方法适用于DataFrame，只是拼接可选择的维度更多
DataFrame行拼接时要注意忽略（ignore_index=True）或重置行索引

combine_first简单使用示例：

a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
               index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64),
               index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1] = np.nan
np.where(pd.isnull(a), b, a) # 直接填充缺失
# 结果 array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])
a.combine_first(b) # 先根据索引对齐，再填充缺失
# 结果 Series([ 0.0,  4.5,  3.5, 0.0 , 2.5,  5.00])

3 重塑和旋转

重塑多层次索引：stack对应“列转行”，unstack对应“行转列”

通过参数level来指定重塑第几层的索引
被unstack的行索引会转换到列索引的最底层
stack操作默认会丢弃缺失的数据，可使用dropna=False保留

“宽格式”与“长格式”：

常见的Excel表格数据都是宽格式的，每一个变量单独成一列
长格式常用于存储时间序列，每一行代表着一个变量的一次观测

"宽转长"的代码示例：

df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                    'A': [1, 2, 3],
                    'B': [4, 5, 6],
                    'C': [7, 8, 9]})
melted = pd.melt(df, ['key']) # “宽格式”转为“长格式”
# melted 结果展示
#    key variable  value
# 0  foo        A      1
# 1  bar        A      2
# 2  baz        A      3
# 3  foo        B      4
# 4  bar        B      5
# 5  baz        B      6
# 6  foo        C      7
# 7  bar        C      8
# 8  baz        C      9
reshaped = melted.pivot(index="key", columns="variable", # 再转回去
    					values="value")
# reshaped 结果展示

# variable  A  B  C
# key              
# bar       2  5  8
# baz       3  6  9
# foo       1  4  7

pivot等价于使用set_index创建一个分层索引，然后调用unstack

个人笔记

Digital Garden | 王半仙

1 层次化索引

2 合并数据集

3 重塑和旋转