数据存储方式对比

1 常见存储格式说明

1.1 表格形式

得益于强大的pandas模块,与数据框的结构更贴近的表格文件是Python中最流行的数据文件存储格式之一。对应的存储文件一般后缀为.CSV或.XLSX

  • CSV(Comma Separated Values)是一种以逗号作为分隔的纯文本格式文件
  • CSV格式的好处在于易读性,但在处理大规模数据时会比较慢,压缩效率也会差一些
  • XLSX或XLS的文件相比于CSV格式更复杂,因此在读取和存储的时候会慢很多(不推荐)

1.2 Pickle序列化

pickle 模块 实现了对一个 Python 对象结构的二进制序列化和反序列化。pickle 协议是 Python 特定的,因为对Python的支持非常友好。但是与其他语言可能存在兼容问题。

一种最常见的方式是使用pickle模型直接存储或读取文件:

import pickle
# Pickle存储
with open('data_array.pickle', 'wb') as f:
    pickle.dump(data_array, f)
# Pickle读取
with open('data_array.pickle', 'rb') as f:
    data_array_pickle = pickle.load(f)

除此之外,也可以使用pandas模型内置的read_pickleto_pickle函数(好像还更快一点点)

1.3 JSON序列化

JSON是最常见的序列化存储方式,具体可参考JSON基本说明

import json
# JSON存储
with open('/usr/data/json-dump.json', 'w+') as f:
    data = { "x": 12153535.232321, "y": 35234531.232322 }
    json.dump(data, f)
# JSON读取
with open('/usr/data/application/json-dump.json', 'r') as f:
    data=json.load(file)

JSON与Pickle的对比可参考:[[4_book/Python编程/编写高质量代码改善 Python 程序的 91 个建议/4.《Python高质量代码的91个建议》库#10 建议 45:序列化的另一个选择——JSON]]

1.4 其他存储格式

  • 当存储较大的浮点型向量时,可考虑Numpy模块的 <code>numpy.savetxt()</code>
  • HDF 是一种开源文件格式,使用类似文件目录的结构来组织数据,能支持大型、复杂、异构数据,适合“一次写入多次读取”的场景。目前的主流格式为HDF5,其在HDF4的的基础上做出了较大升级
  • Parquet 是 Hadoop 生态圈中主流的列式存储格式,与大多数 Hadoop 处理框架兼容(适合分布式存储),支持对复杂数据的快速数据处理。其借助列存储的优势实现了更高压缩比,更小的IO操作

2 存储方式对比

2.1 思路与代码

基本思路:围绕同一份表格数据,针对每一种存储结构和压缩方法,重复实验过程y次,每次进行x次存储和读取操作形成一组实验结果,选择y组实验结果中最优的一组作为该方法的最终成绩

对比指标:x次存储的平均耗时,x次压缩后的平均文件大小,x次读取的平均耗时

Python代码实现:

import pandas as pd # version=1.3.5

filename = 'trf_op_op_info'
df = pd.read_csv(filename + '.csv')
display(df.sample(5))
display(df.shape)   # (28003, 222)
display(df.dtypes)

compress_list=[
    # CSV + INDEX
    {'method_name':"CSV",'save_format':"df.to_csv(filename + '_csv.csv')",
     'read_format':"pd.read_csv(filename + '_csv.csv')",'suffix':"_csv.csv",'method_describe':"有索引的CSV文件格式"},
    # CSV - INDEX
    {'method_name':"CSV No Index",'save_format':"df.to_csv(filename + '_csv.csv', index=False)",
     'read_format':"pd.read_csv(filename + '_csv.csv')",'suffix':"_csv.csv",'method_describe':"无索引的CSV文件格式"},
    # CSV No Index (GZIP)
    {'method_name':"CSV No Index (GZIP)",'save_format':"df.to_csv(filename + '.gzip', index=False, compression='gzip')",
     'read_format':"pd.read_csv(filename + '.gzip', compression='gzip')",'suffix':".gzip",'method_describe':"gzip压缩格式的无索引CSV"},
    # CSV No Index (BZ2)
    {'method_name':"CSV No Index (BZ2)",'save_format':"df.to_csv(filename + '.bz2', index=False, compression='bz2')",
     'read_format':"pd.read_csv(filename + '.bz2', compression='bz2')",'suffix':".bz2",'method_describe':"bz2压缩格式的无索引CSV"},
    # CSV No Index (ZIP)
    {'method_name':"CSV No Index (ZIP)",'save_format':"df.to_csv(filename + '.zip', index=False, compression='zip')",
     'read_format':"pd.read_csv(filename + '.zip', compression='zip')",'suffix':".zip",'method_describe':"zip压缩格式的无索引CSV"},
    # CSV No Index (XZ)
    {'method_name':"CSV No Index (XZ)",'save_format':"df.to_csv(filename + '.xz', index=False, compression='xz')",
     'read_format':"pd.read_csv(filename + '.xz', compression='xz')",'suffix':".xz",'method_describe':"xz压缩格式的无索引CSV"},

    # JSON
    {'method_name':"JSON",'save_format':"df.to_json(filename + '.json')",
     'read_format':"pd.read_json(filename + '.json')",'suffix':".json",'method_describe':"json序列化"},
    # JSON (GZIP)
    {'method_name':"JSON(GZIP)",'save_format':"df.to_json(filename + '.json', compression='gzip')",
     'read_format':"pd.read_json(filename + '.json', compression='gzip')",'suffix':".json",'method_describe':"gzip压缩格式的json序列化"},
    # JSON (BZ2)
    {'method_name':"JSON(BZ2)",'save_format':"df.to_json(filename + '.json', compression='bz2')",
     'read_format':"pd.read_json(filename + '.json', compression='bz2')",'suffix':".json",'method_describe':"bz2压缩格式的json序列化"},
    # JSON (ZIP)
    {'method_name':"JSON(ZIP)",'save_format':"df.to_json(filename + '.json', compression='zip')",
     'read_format':"pd.read_json(filename + '.json', compression='zip')",'suffix':".json",'method_describe':"zip压缩格式的json序列化"},
    # JSON (XZ)
    {'method_name':"JSON(XZ)",'save_format':"df.to_json(filename + '.json', compression='xz')",
     'read_format':"pd.read_json(filename + '.json', compression='xz')",'suffix':".json",'method_describe':"xz压缩格式的json序列化"},

    # Pickle
    {'method_name':"Pickle",'save_format':"df.to_pickle(filename + '.pkl')",
     'read_format':"pd.read_pickle(filename + '.pkl')",'suffix':".pkl",'method_describe':"二进制序列化Pickle"},
    # Pickle (GZIP)
    {'method_name':"Pickle (GZIP)",'save_format':"df.to_pickle(filename + '.pkl', compression='gzip')",
     'read_format':"pd.read_pickle(filename + '.pkl', compression='gzip')",'suffix':".pkl",'method_describe':"gzip压缩的序列化Pickle"},
    # Pickle (BZ2)
    {'method_name':"Pickle (BZ2)",'save_format':"df.to_pickle(filename + '.pkl', compression='bz2')",
     'read_format':"pd.read_pickle(filename + '.pkl', compression='bz2')",'suffix':".pkl",'method_describe':"bz2压缩的序列化Pickle"},
    # Pickle (ZIP)
    {'method_name':"Pickle (ZIP)",'save_format':"df.to_pickle(filename + '.pkl', compression='zip')",
     'read_format':"pd.read_pickle(filename + '.pkl', compression='zip')",'suffix':".pkl",'method_describe':"zip压缩的序列化Pickle"},
    # Pickle (XZ)
    {'method_name':"Pickle (XZ)",'save_format':"df.to_pickle(filename + '.pkl', compression='xz')",
     'read_format':"pd.read_pickle(filename + '.pkl', compression='xz')",'suffix':".pkl",'method_describe':"xz压缩的序列化Pickle"},

    # HDF+不压缩
    {'method_name':"HDF+不压缩",'save_format':"df.to_hdf(filename + '.h5', key='key', mode='w',complevel=0)",
     'read_format':"pd.read_hdf(filename + '.h5', key='key', mode='r')",'suffix':".h5",'method_describe':"不压缩的HDF5格式"},
    # HDF+浅压缩
    {'method_name':"HDF+浅压缩",'save_format':"df.to_hdf(filename + '.h5', key='key', mode='w',complevel=3)",
     'read_format':"pd.read_hdf(filename + '.h5', key='key', mode='r')",'suffix':".h5",'method_describe':"3级压缩的HDF5格式"},
    # HDF+深压缩
    {'method_name':"HDF+深压缩",'save_format':"df.to_hdf(filename + '.h5', key='key', mode='w',complevel=6)",
     'read_format':"pd.read_hdf(filename + '.h5', key='key', mode='r')",'suffix':".h5",'method_describe':"6级压缩的HDF5格式"},
    # HDF+极限压缩
    {'method_name':"HDF+极限压缩",'save_format':"df.to_hdf(filename + '.h5', key='key', mode='w',complevel=9)",
     'read_format':"pd.read_hdf(filename + '.h5', key='key', mode='r')",'suffix':".h5",'method_describe':"9级压缩的HDF5格式"},

    # Parquet(snappy)
    {'method_name':"Parquet(snappy)",'save_format':"df.to_parquet(filename + '.parquet', engine='fastparquet', compression='snappy')",
     'read_format':"pd.read_parquet(filename + '.parquet')",'suffix':".parquet",'method_describe':"snappy压缩的Parquet格式"},
    # Parquet(gzip)
    {'method_name':"Parquet(gzip)",'save_format':"df.to_parquet(filename + '.parquet', engine='fastparquet', compression='gzip')",
     'read_format':"pd.read_parquet(filename + '.parquet')",'suffix':".parquet",'method_describe':"gzip压缩的Parquet格式"},
    # Parquet(brotli)
    {'method_name':"Parquet(brotli)",'save_format':"df.to_parquet(filename + '.parquet', engine='fastparquet', compression='brotli')",
     'read_format':"pd.read_parquet(filename + '.parquet')",'suffix':".parquet",'method_describe':"brotli压缩的Parquet格式"},
]
def compress_performance(df_results, compress_detail):
    #---saving---
    result_save = %timeit -n5 -r5 -o eval(compress_detail['save_format'])
    #---get the size of file---
    file_size = os.path.getsize(filename + compress_detail['suffix']) / 1024**2
    #---load---
    result_read = %timeit -n5 -r5 -o eval(compress_detail['read_format'])
    #---save the result to the dataframe---
    row = {
        'method_name': compress_detail['method_name'],
        'file_size': file_size, 
        'write_time_mean': result_save.average, 
        'write_time_std': round(result_save.stdev,5),
        'read_time_mean':  result_read.average,
        'read_time_std': round(result_read.stdev,5),
        'method_describe': compress_detail['method_describe'],
      }
    return df_results.append(pd.Series(row), ignore_index = True)

compress_results = pd.DataFrame(columns=['method_name','file_size', 'write_time_mean', 'write_time_std',
                                         'read_time_mean','read_time_std','method_describe'])
for compress_detail in compress_list:
    print('start check compress_method: '+compress_detail['method_name'])
    compress_results = compress_performance(compress_results, compress_detail)

df = compress_results.copy()
df = df.apply(lambda x: x.str.replace('\n', '<br>') if x.dtype == 'object' else x) # 排除多行文本导致的markdown表格转换错误
md_table = df.to_markdown()
print(md_table)

注意:实现HDF5存读可能需要额外安装tables模块:pip install tables;而实现Parquet存读需要额外安装fastparquet模块:pip install fastparquet

2.2 结果分析

method_name file_size write_time_mean write_time_std read_time_mean read_time_std method_describe
CSV 24.5381 2.44883 0.01565 0.698063 0.00304 有索引的CSV文件格式
CSV No Index 24.3885 2.42836 0.02425 0.693689 0.00497 无索引的CSV文件格式
CSV No Index (GZIP) 5.65706 7.91941 0.03595 0.863275 0.00894 gzip压缩格式的无索引CSV
CSV No Index (BZ2) 3.34583 5.12625 0.02599 1.57579 0.0036 bz2压缩格式的无索引CSV
CSV No Index (ZIP) 5.82188 3.46738 0.01495 0.834589 0.00258 zip压缩格式的无索引CSV
CSV No Index (XZ) 3.30216 19.2397 0.03484 1.06094 0.00139 xz压缩格式的无索引CSV
JSON 91.797 1.73225 0.01243 3.84612 0.01924 json序列化
JSON(GZIP) 18.1919 14.7919 0.01356 4.23321 0.00772 gzip压缩格式的json序列化
JSON(BZ2) 12.421 16.6413 0.00407 6.21472 0.0103 bz2压缩格式的json序列化
JSON(ZIP) 18.4101 4.64663 0.00392 4.10226 0.00919 zip压缩格式的json序列化
JSON(XZ) 4.81302 46.1733 0.46348 4.48091 0.00479 xz压缩格式的json序列化
Pickle 52.564 0.667306 0.0109 0.120051 0.00373 二进制序列化Pickle
Pickle (GZIP) 4.22666 10.5702 0.03953 0.290277 0.00039 gzip压缩的序列化Pickle
Pickle (BZ2) 2.83825 14.0252 0.01365 1.20366 0.00115 bz2压缩的序列化Pickle
Pickle (ZIP) 4.38329 1.81503 0.00999 0.350026 0.00133 zip压缩的序列化Pickle
Pickle (XZ) 2.91055 14.8552 0.01723 0.523916 0.00144 xz压缩的序列化Pickle
HDF+不压缩 53.6985 0.790838 0.01001 0.777448 0.00783 不压缩的HDF5格式
HDF+浅压缩 21.451 1.23071 0.00704 0.586584 0.00685 3级压缩的HDF5格式
HDF+深压缩 20.8419 2.19926 0.00831 0.624411 0.03532 6级压缩的HDF5格式
HDF+极限压缩 20.3504 21.1893 0.01573 0.595076 0.01197 9级压缩的HDF5格式
Parquet(snappy) 7.73179 0.947902 0.01185 0.168116 0.04005 snappy压缩的Parquet格式
Parquet(gzip) 4.19815 2.59033 0.00396 0.143755 0.00358 gzip压缩的Parquet格式
Parquet(brotli) 3.2077 87.6438 0.04822 0.1367 0.00514 brotli压缩的Parquet格式

小结:

  • 纯CSV文件在去除索引后,文件大小、读存速度均有改善,但不多
  • JSON序列化后的原始数据较大,压缩后的写入时间偏长,不推荐使用
  • 相比于CSV文件,Pickle格式的原始文件大小会大一些,但是经过压缩算法后,最终的压缩文件大小反而会更小
  • 压缩算法GZIP和XZ都会导致较高的存储耗时,并保持较低的读取耗时;相对来说,XZ算法的压缩率更高,相应的耗时也略有增加
  • 压缩算法BZ2+Pickle格式是所有方案中压缩率最高的一种方案,但相应的存储和读取耗时也都偏高,适合极致压缩的场景
  • 压缩算法ZIP虽然压缩率略低一些,但是存储耗时具备显著优势
  • 本模拟场景下,HDF5格式的存储不太具备优势,不推荐
  • Parquet格式的存储,保持着较好的压缩率,并且在存储和读取耗时方面也保持着一定的优势,也是值得尝试的一种存储方式

综合考虑到存储压缩率和读取存储速率的收益,本人最终选择了Pickle+ZIP的方式

参考

Pandas官方API参考 The Best Format to Save Pandas Data Still Saving Your Data in CSV? Try these other options

往年同期文章