1 常见存储格式说明
1.1 表格形式
得益于强大的pandas模块,与数据框的结构更贴近的表格文件是Python中最流行的数据文件存储格式之一。对应的存储文件一般后缀为.CSV或.XLSX
- CSV(Comma Separated Values)是一种以逗号作为分隔的纯文本格式文件
- CSV格式的好处在于易读性,但在处理大规模数据时会比较慢,压缩效率也会差一些
- XLSX或XLS的文件相比于CSV格式更复杂,因此在读取和存储的时候会慢很多(不推荐)
1.2 Pickle序列化
pickle
模块 实现了对一个 Python 对象结构的二进制序列化和反序列化。pickle 协议是 Python 特定的,因为对Python的支持非常友好。但是与其他语言可能存在兼容问题。
一种最常见的方式是使用pickle
模型直接存储或读取文件:
import pickle
# Pickle存储
with open('data_array.pickle', 'wb') as f:
pickle.dump(data_array, f)
# Pickle读取
with open('data_array.pickle', 'rb') as f:
data_array_pickle = pickle.load(f)
除此之外,也可以使用pandas
模型内置的read_pickle
和to_pickle
函数(好像还更快一点点)
1.3 JSON序列化
JSON是最常见的序列化存储方式,具体可参考JSON基本说明
import json
# JSON存储
with open('/usr/data/json-dump.json', 'w+') as f:
data = { "x": 12153535.232321, "y": 35234531.232322 }
json.dump(data, f)
# JSON读取
with open('/usr/data/application/json-dump.json', 'r') as f:
data=json.load(file)
JSON与Pickle的对比可参考:[[4_book/Python编程/编写高质量代码改善 Python 程序的 91 个建议/4.《Python高质量代码的91个建议》库#10 建议 45:序列化的另一个选择——JSON]]
1.4 其他存储格式
- 当存储较大的浮点型向量时,可考虑Numpy模块的 <code>numpy.savetxt()</code>
- HDF 是一种开源文件格式,使用类似文件目录的结构来组织数据,能支持大型、复杂、异构数据,适合“一次写入多次读取”的场景。目前的主流格式为HDF5,其在HDF4的的基础上做出了较大升级
- Parquet 是 Hadoop 生态圈中主流的列式存储格式,与大多数 Hadoop 处理框架兼容(适合分布式存储),支持对复杂数据的快速数据处理。其借助列存储的优势实现了更高压缩比,更小的IO操作
2 存储方式对比
2.1 思路与代码
基本思路:围绕同一份表格数据,针对每一种存储结构和压缩方法,重复实验过程y次,每次进行x次存储和读取操作形成一组实验结果,选择y组实验结果中最优的一组作为该方法的最终成绩
对比指标:x次存储的平均耗时,x次压缩后的平均文件大小,x次读取的平均耗时
Python代码实现:
import pandas as pd # version=1.3.5
filename = 'trf_op_op_info'
df = pd.read_csv(filename + '.csv')
display(df.sample(5))
display(df.shape) # (28003, 222)
display(df.dtypes)
compress_list=[
# CSV + INDEX
{'method_name':"CSV",'save_format':"df.to_csv(filename + '_csv.csv')",
'read_format':"pd.read_csv(filename + '_csv.csv')",'suffix':"_csv.csv",'method_describe':"有索引的CSV文件格式"},
# CSV - INDEX
{'method_name':"CSV No Index",'save_format':"df.to_csv(filename + '_csv.csv', index=False)",
'read_format':"pd.read_csv(filename + '_csv.csv')",'suffix':"_csv.csv",'method_describe':"无索引的CSV文件格式"},
# CSV No Index (GZIP)
{'method_name':"CSV No Index (GZIP)",'save_format':"df.to_csv(filename + '.gzip', index=False, compression='gzip')",
'read_format':"pd.read_csv(filename + '.gzip', compression='gzip')",'suffix':".gzip",'method_describe':"gzip压缩格式的无索引CSV"},
# CSV No Index (BZ2)
{'method_name':"CSV No Index (BZ2)",'save_format':"df.to_csv(filename + '.bz2', index=False, compression='bz2')",
'read_format':"pd.read_csv(filename + '.bz2', compression='bz2')",'suffix':".bz2",'method_describe':"bz2压缩格式的无索引CSV"},
# CSV No Index (ZIP)
{'method_name':"CSV No Index (ZIP)",'save_format':"df.to_csv(filename + '.zip', index=False, compression='zip')",
'read_format':"pd.read_csv(filename + '.zip', compression='zip')",'suffix':".zip",'method_describe':"zip压缩格式的无索引CSV"},
# CSV No Index (XZ)
{'method_name':"CSV No Index (XZ)",'save_format':"df.to_csv(filename + '.xz', index=False, compression='xz')",
'read_format':"pd.read_csv(filename + '.xz', compression='xz')",'suffix':".xz",'method_describe':"xz压缩格式的无索引CSV"},
# JSON
{'method_name':"JSON",'save_format':"df.to_json(filename + '.json')",
'read_format':"pd.read_json(filename + '.json')",'suffix':".json",'method_describe':"json序列化"},
# JSON (GZIP)
{'method_name':"JSON(GZIP)",'save_format':"df.to_json(filename + '.json', compression='gzip')",
'read_format':"pd.read_json(filename + '.json', compression='gzip')",'suffix':".json",'method_describe':"gzip压缩格式的json序列化"},
# JSON (BZ2)
{'method_name':"JSON(BZ2)",'save_format':"df.to_json(filename + '.json', compression='bz2')",
'read_format':"pd.read_json(filename + '.json', compression='bz2')",'suffix':".json",'method_describe':"bz2压缩格式的json序列化"},
# JSON (ZIP)
{'method_name':"JSON(ZIP)",'save_format':"df.to_json(filename + '.json', compression='zip')",
'read_format':"pd.read_json(filename + '.json', compression='zip')",'suffix':".json",'method_describe':"zip压缩格式的json序列化"},
# JSON (XZ)
{'method_name':"JSON(XZ)",'save_format':"df.to_json(filename + '.json', compression='xz')",
'read_format':"pd.read_json(filename + '.json', compression='xz')",'suffix':".json",'method_describe':"xz压缩格式的json序列化"},
# Pickle
{'method_name':"Pickle",'save_format':"df.to_pickle(filename + '.pkl')",
'read_format':"pd.read_pickle(filename + '.pkl')",'suffix':".pkl",'method_describe':"二进制序列化Pickle"},
# Pickle (GZIP)
{'method_name':"Pickle (GZIP)",'save_format':"df.to_pickle(filename + '.pkl', compression='gzip')",
'read_format':"pd.read_pickle(filename + '.pkl', compression='gzip')",'suffix':".pkl",'method_describe':"gzip压缩的序列化Pickle"},
# Pickle (BZ2)
{'method_name':"Pickle (BZ2)",'save_format':"df.to_pickle(filename + '.pkl', compression='bz2')",
'read_format':"pd.read_pickle(filename + '.pkl', compression='bz2')",'suffix':".pkl",'method_describe':"bz2压缩的序列化Pickle"},
# Pickle (ZIP)
{'method_name':"Pickle (ZIP)",'save_format':"df.to_pickle(filename + '.pkl', compression='zip')",
'read_format':"pd.read_pickle(filename + '.pkl', compression='zip')",'suffix':".pkl",'method_describe':"zip压缩的序列化Pickle"},
# Pickle (XZ)
{'method_name':"Pickle (XZ)",'save_format':"df.to_pickle(filename + '.pkl', compression='xz')",
'read_format':"pd.read_pickle(filename + '.pkl', compression='xz')",'suffix':".pkl",'method_describe':"xz压缩的序列化Pickle"},
# HDF+不压缩
{'method_name':"HDF+不压缩",'save_format':"df.to_hdf(filename + '.h5', key='key', mode='w',complevel=0)",
'read_format':"pd.read_hdf(filename + '.h5', key='key', mode='r')",'suffix':".h5",'method_describe':"不压缩的HDF5格式"},
# HDF+浅压缩
{'method_name':"HDF+浅压缩",'save_format':"df.to_hdf(filename + '.h5', key='key', mode='w',complevel=3)",
'read_format':"pd.read_hdf(filename + '.h5', key='key', mode='r')",'suffix':".h5",'method_describe':"3级压缩的HDF5格式"},
# HDF+深压缩
{'method_name':"HDF+深压缩",'save_format':"df.to_hdf(filename + '.h5', key='key', mode='w',complevel=6)",
'read_format':"pd.read_hdf(filename + '.h5', key='key', mode='r')",'suffix':".h5",'method_describe':"6级压缩的HDF5格式"},
# HDF+极限压缩
{'method_name':"HDF+极限压缩",'save_format':"df.to_hdf(filename + '.h5', key='key', mode='w',complevel=9)",
'read_format':"pd.read_hdf(filename + '.h5', key='key', mode='r')",'suffix':".h5",'method_describe':"9级压缩的HDF5格式"},
# Parquet(snappy)
{'method_name':"Parquet(snappy)",'save_format':"df.to_parquet(filename + '.parquet', engine='fastparquet', compression='snappy')",
'read_format':"pd.read_parquet(filename + '.parquet')",'suffix':".parquet",'method_describe':"snappy压缩的Parquet格式"},
# Parquet(gzip)
{'method_name':"Parquet(gzip)",'save_format':"df.to_parquet(filename + '.parquet', engine='fastparquet', compression='gzip')",
'read_format':"pd.read_parquet(filename + '.parquet')",'suffix':".parquet",'method_describe':"gzip压缩的Parquet格式"},
# Parquet(brotli)
{'method_name':"Parquet(brotli)",'save_format':"df.to_parquet(filename + '.parquet', engine='fastparquet', compression='brotli')",
'read_format':"pd.read_parquet(filename + '.parquet')",'suffix':".parquet",'method_describe':"brotli压缩的Parquet格式"},
]
def compress_performance(df_results, compress_detail):
#---saving---
result_save = %timeit -n5 -r5 -o eval(compress_detail['save_format'])
#---get the size of file---
file_size = os.path.getsize(filename + compress_detail['suffix']) / 1024**2
#---load---
result_read = %timeit -n5 -r5 -o eval(compress_detail['read_format'])
#---save the result to the dataframe---
row = {
'method_name': compress_detail['method_name'],
'file_size': file_size,
'write_time_mean': result_save.average,
'write_time_std': round(result_save.stdev,5),
'read_time_mean': result_read.average,
'read_time_std': round(result_read.stdev,5),
'method_describe': compress_detail['method_describe'],
}
return df_results.append(pd.Series(row), ignore_index = True)
compress_results = pd.DataFrame(columns=['method_name','file_size', 'write_time_mean', 'write_time_std',
'read_time_mean','read_time_std','method_describe'])
for compress_detail in compress_list:
print('start check compress_method: '+compress_detail['method_name'])
compress_results = compress_performance(compress_results, compress_detail)
df = compress_results.copy()
df = df.apply(lambda x: x.str.replace('\n', '<br>') if x.dtype == 'object' else x) # 排除多行文本导致的markdown表格转换错误
md_table = df.to_markdown()
print(md_table)
注意:实现HDF5存读可能需要额外安装
tables
模块:pip install tables
;而实现Parquet存读需要额外安装fastparquet
模块:pip install fastparquet
2.2 结果分析
method_name | file_size | write_time_mean | write_time_std | read_time_mean | read_time_std | method_describe |
---|---|---|---|---|---|---|
CSV | 24.5381 | 2.44883 | 0.01565 | 0.698063 | 0.00304 | 有索引的CSV文件格式 |
CSV No Index | 24.3885 | 2.42836 | 0.02425 | 0.693689 | 0.00497 | 无索引的CSV文件格式 |
CSV No Index (GZIP) | 5.65706 | 7.91941 | 0.03595 | 0.863275 | 0.00894 | gzip压缩格式的无索引CSV |
CSV No Index (BZ2) | 3.34583 | 5.12625 | 0.02599 | 1.57579 | 0.0036 | bz2压缩格式的无索引CSV |
CSV No Index (ZIP) | 5.82188 | 3.46738 | 0.01495 | 0.834589 | 0.00258 | zip压缩格式的无索引CSV |
CSV No Index (XZ) | 3.30216 | 19.2397 | 0.03484 | 1.06094 | 0.00139 | xz压缩格式的无索引CSV |
JSON | 91.797 | 1.73225 | 0.01243 | 3.84612 | 0.01924 | json序列化 |
JSON(GZIP) | 18.1919 | 14.7919 | 0.01356 | 4.23321 | 0.00772 | gzip压缩格式的json序列化 |
JSON(BZ2) | 12.421 | 16.6413 | 0.00407 | 6.21472 | 0.0103 | bz2压缩格式的json序列化 |
JSON(ZIP) | 18.4101 | 4.64663 | 0.00392 | 4.10226 | 0.00919 | zip压缩格式的json序列化 |
JSON(XZ) | 4.81302 | 46.1733 | 0.46348 | 4.48091 | 0.00479 | xz压缩格式的json序列化 |
Pickle | 52.564 | 0.667306 | 0.0109 | 0.120051 | 0.00373 | 二进制序列化Pickle |
Pickle (GZIP) | 4.22666 | 10.5702 | 0.03953 | 0.290277 | 0.00039 | gzip压缩的序列化Pickle |
Pickle (BZ2) | 2.83825 | 14.0252 | 0.01365 | 1.20366 | 0.00115 | bz2压缩的序列化Pickle |
Pickle (ZIP) | 4.38329 | 1.81503 | 0.00999 | 0.350026 | 0.00133 | zip压缩的序列化Pickle |
Pickle (XZ) | 2.91055 | 14.8552 | 0.01723 | 0.523916 | 0.00144 | xz压缩的序列化Pickle |
HDF+不压缩 | 53.6985 | 0.790838 | 0.01001 | 0.777448 | 0.00783 | 不压缩的HDF5格式 |
HDF+浅压缩 | 21.451 | 1.23071 | 0.00704 | 0.586584 | 0.00685 | 3级压缩的HDF5格式 |
HDF+深压缩 | 20.8419 | 2.19926 | 0.00831 | 0.624411 | 0.03532 | 6级压缩的HDF5格式 |
HDF+极限压缩 | 20.3504 | 21.1893 | 0.01573 | 0.595076 | 0.01197 | 9级压缩的HDF5格式 |
Parquet(snappy) | 7.73179 | 0.947902 | 0.01185 | 0.168116 | 0.04005 | snappy压缩的Parquet格式 |
Parquet(gzip) | 4.19815 | 2.59033 | 0.00396 | 0.143755 | 0.00358 | gzip压缩的Parquet格式 |
Parquet(brotli) | 3.2077 | 87.6438 | 0.04822 | 0.1367 | 0.00514 | brotli压缩的Parquet格式 |
小结:
- 纯CSV文件在去除索引后,文件大小、读存速度均有改善,但不多
- JSON序列化后的原始数据较大,压缩后的写入时间偏长,不推荐使用
- 相比于CSV文件,Pickle格式的原始文件大小会大一些,但是经过压缩算法后,最终的压缩文件大小反而会更小
- 压缩算法GZIP和XZ都会导致较高的存储耗时,并保持较低的读取耗时;相对来说,XZ算法的压缩率更高,相应的耗时也略有增加
- 压缩算法BZ2+Pickle格式是所有方案中压缩率最高的一种方案,但相应的存储和读取耗时也都偏高,适合极致压缩的场景
- 压缩算法ZIP虽然压缩率略低一些,但是存储耗时具备显著优势
- 本模拟场景下,HDF5格式的存储不太具备优势,不推荐
- Parquet格式的存储,保持着较好的压缩率,并且在存储和读取耗时方面也保持着一定的优势,也是值得尝试的一种存储方式
综合考虑到存储压缩率和读取存储速率的收益,本人最终选择了Pickle+ZIP的方式
参考
Pandas官方API参考 The Best Format to Save Pandas Data Still Saving Your Data in CSV? Try these other options