tsfresh概述

1 基本介绍
2 数据格式
- 2.1 输入数据
- 2.2 输出特征
3 基本功能
- 3.1 特征计算
- 3.2 特征过滤
4 进阶特性

1 基本介绍

tsfresh是专门用于时序类数据的特征工程构建工具

tsfresh 主要特点：

并行化高效自动构建特征

兼容Python常见的数据格式（pandas或scikit-learn）

tsfresh 局限性：

不适合流数据处理，更适合离线数据

不包含模型训练的功能（尽量兼容scikit-learn，不重复造轮子）

仅考虑时序的顺序性，对时间间隔差异较大的情况可能存在计算问题

快速安装

pip install tsfresh # 安装

2 数据格式

2.1 输入数据

在调用tsfresh包中的特征工程构建时，需要指定四个关键参数：column_id（必需）, column_sort（必需）, column_value（可选）, column_kind（可选）。以“风机故障预测（根据风机传感器获取得到的时序数据，预测多个风机的可能故障概率）”为例说明这四个参数：

这种情况下，column_id对应风机的唯一编号（+预测时刻），column_sort对应传感器获取时序数据的时间点，column_value对应传感器获取时序数据的具体值，column_kind对应传感器的类型。

注意数据中不应该包含NaN，Inf，-Inf等取值

tsfresh支持三种不同的时间序列数据格式：

宽格式数据框（Flat DataFrame or Wide DataFrame）

id	time	x	y
A	t1	x(A,t1)	y(A,t1)
A	t2	x(A,t2)	y(A,t2)
B	t1	x(A,t1)	y(A,t1)
B	t2	x(A,t2)	y(A,t2)

column_id="id", column_sort="time", column_kind=None, column_value=None

长格式数据框（Stacked DataFrame or Long DataFrame）

id	time	kind	value
A	t1	x	x(A,t1)
A	t2	x	x(A,t2)
B	t1	x	x(B,t1)
B	t2	x	x(B,t2)
A	t1	y	y(A,t1)
A	t2	y	y(A,t2)
B	t1	y	y(B,t1)
B	t2	y	y(B,t2)

column_id="id", column_sort="time", column_kind="kind", column_value="value"

数据框字典（Dictionary of flat DataFrames）

{'x':DataFrame(x),'y':DataFrame(y)}

其中DataFrame(x)具体如下所示：

id	time	value
A	t1	x(A,t1)
A	t2	x(A,t2)
B	t1	x(B,t1)
B	t2	x(B,t2)

其中DataFrame(y)具体如下所示：

id	time	value
A	t1	y(A,t1)
A	t2	y(A,t2)
B	t1	y(B,t1)
B	t2	y(B,t2)

2.2 输出特征

示例如下：

id	x_feature_1	...	x_feature_N	y_feature_1	...	y_feature_N
A	...	...	...	...	...	...
B	...	...	...	...	...	...

特征命名规则：

{time_series_name}__{feature_name}__{parameter name 1}_{parameter value 1}__[..]__{parameter name k}_{parameter value k}

特征命名示例：

temperature__quantile__q_0.6

Pressure__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_14__w_5

对于具体的特征名称feature_name，可以通过内置函数快速转为对应的配置信息：tsfresh.feature_extraction.settings.from_columns

3 基本功能

3.1 特征计算

tsfresh主要通过函数tsfresh.extract_features()完成特征的计算

def extract_features(timeseries_container, default_fc_parameters=None,
                     kind_to_fc_parameters=None,
                     column_id=None, column_sort=None, column_kind=None, column_value=None,
                     chunksize=defaults.CHUNKSIZE,
                     n_jobs=defaults.N_PROCESSES, show_warnings=defaults.SHOW_WARNINGS,
                     disable_progressbar=defaults.DISABLE_PROGRESSBAR,
                     impute_function=defaults.IMPUTE_FUNCTION,
                     profile=defaults.PROFILING,
                     profiling_filename=defaults.PROFILING_FILENAME,
                     profiling_sorting=defaults.PROFILING_SORTING,
                     distributor=None, pivot=True):
"""


Examples

========

:param timeseries_container: 输入数据，数据框或数据框组成的字典
:param default_fc_parameters: 用于全局配置（对不同类型数据都计算）所需计算的特征，dict
:param kind_to_fc_parameters: 用于单独配置每一类数据所需计算的特征，dict
:param column_id/column_sort/column_kind/column_value: str
:param n_jobs: 并行数，默认为可用CPU数的一半. 设置为0则表示不进行并行计算，int
:param chunksize: 设置每个任务计算chunksize个时序的所有特征，默认为自适应设定，None or int
:param show_warnings: 是否显示特征计算过程中的警告，默认为False，bool
:param disable_progressbar: 是否不显示特征计算过程中的进度条，默认为False，bool

:return: The (maybe imputed) DataFrame containing extracted features.

"""

tsfresh预设了三种特征组成，方便快速实验：

tsfresh.feature_extraction.settings.ComprehensiveFCParameters: 为extract_features()函数中参数default_fc_parameters的默认值，包含所有内置的特征计算方法

tsfresh.feature_extraction.settings.MinimalFCParameters: 内置特征计算函数的最小集（所有带有‘minimal’标识的特征计算函数），方便快速调试测试

tsfresh.feature_extraction.settings.EfficientFCParameters:排除了部分计算成本较高的特征计算方法（排除了带有‘high_comp_cost‘标识的特征计算函数），平衡全面性和效率

也可以自定义特征组合，示例如下：

fc_parameters = {
    "length": None,
    "large_standard_deviation": [{"r": 0.05}, {"r": 0.1}]
}

也可以为不同类型的时序数据，分别定义特征组合：

kind_to_fc_parameters = {
    "temperature": {"mean": None},
    "pressure": {"maximum": None, "minimum": None}
}

tsfresh内置特征的详细列表 tsfresh内置特征的详细源码

3.2 特征过滤

tsfresh主要通过函数tsfresh.feature_selection.relevance.calculate_relevance_table完成特征的过滤，它主要根据具体的分类或回归任务找出特征的重要性，同时过滤掉指定比例的不相关特征。

def calculate_relevance_table(X, y, ml_task='auto', n_jobs=1, show_warnings=False, chunksize=None, test_for_binary_target_binary_feature='fisher', test_for_binary_target_real_feature='mann', test_for_real_target_binary_feature='mann', test_for_real_target_real_feature='kendall', fdr_level=0.05, hypotheses_independent=False):
    """
    针对特定模型类型，借助假设检验与p值评估特征和预测目标间的关联性
    :param X: 包含所有特征的pandas.DataFrame
    :param y: 包含预测目标值的pandas.Series或numpy.ndarray
    :param ml_task: 建模类型，主要包括'classification', 'regression' or 'auto（默认）'，str
    :param test_for_binary_target_binary_feature / test_for_binary_target_real_featur / test_for_real_target_binary_feature / test_for_real_target_real_feature: 指定不同数据类型的假设检验方法（fisher检验、），str
    :param fdr_level: 预设假阳率（错误发现率）水平
    :param hypotheses_independent: 特征间是否是独立的，基本上都是False，bool
    :param n_jobs: 假设检验计算的进程数，int
    :param show_warnings: 是否显示警告信息，bool

    :return: 类型为pandas.DataFrame，主要包括特征、特征类型、假设检验p值、相关性等信息
    """

内置的假设检验方法对应的函数为tsfresh.feature_selection.significance_tests()，具体的检验方法主要借助了stats模块，并且支持自定义其他检验方法（未尝试）

tsfresh内置假设检验的源码

4 进阶特性

4.1 自定义特征计算方法

根据特征计算方法返回一个特征还是多个分为simple模式和combiner模式

simple（单个返回值）特征示例（计算众数）

from tsfresh.feature_extraction import feature_calculators


@feature_calculators.set_property("fctype", "simple")
def mode(x):
    """Return the mode of the parameter (i.e. most common value)

    :param x: the time series to calculate the feature of
    :type x: np.ndarray
    :return: the different feature values
    :return type: tuple
    """
    c = Counter(x)
    return tuple(x for x, count in c.items() if count == c.most_common(1)[0][1])

combiner（多个返回值）特征示例（统计每种值的出现频次）

@feature_calculators.set_property("fctype", "combiner")
def value_count_all(x, param):
    """
    Returns the number of values in x 

    :param x: the time series on which to calculate the feature.
    :type x: pandas.Series
    :param param: None
    :return: the value of this feature
    :return type: list
    """
    values, counts = np.unique(x, return_counts=True)

    return [("value_count__value_\"{}\"".format(value), \
    		 feature_calculators.value_count(x, value)) for value in values]

为自定义计算方法添加属性（统计最近值的value与index）

@feature_calculators.set_property("input", "pd.Series") # 输入格式为pd.Series（默认是numpy）
@feature_calculators.set_property("fctype", "combiner")
@feature_calculators.set_property("minimal", True)
def last(x, param):
    """Return the last index and value of x.

    :param x: the time series on which to calculate the feature.
    :type x: pandas.Series
    :return: the value of this feature
    :return type: list
    """
    return [("value",x.values[-1]), ("index",x.index[-1])]

最终特征名称示例：DiasBP_24_0__last__value; DiasBP_24_0__last__index

注册自定义计算方法

所有的已注册方法，以字典的形式存储于于变量tsfresh.feature_extraction.settings.ComprehensiveFCParameters

自定义计算方法可以通过更新字典的方式实现注册，字典的形式可参考函数源码

个人觉得较为优雅的添加方式（参考自Github-Issue区），如下所示：

from tsfresh.feature_extraction import feature_calculators
custom_functions = [mode, value_count_all, last]
for func in custom_functions:
    setattr(feature_calculators, func.__name__, func)

4.2 高功效与低成本

当tsfresh进行特征计算或筛选时，会默认启动并行化以提高处理效率

当需要提高计算效率，或者内存消耗过大时，可以考虑将pandas.DataFrame转化为dask.DataFrame。dask模块本身就继承了pandas模块的相关接口设计，通过在内存消耗、计算性能、分布式支持方面进行更好的处理，tsfresh模块本身对于dask模块也有较为友好的支持。

当计算量过大时，可能需要进行分布式计算。tsfresh包内置了<code>tsfresh.utilities.distribution</code>模块方便实现分布式计算。具体来说，分布式计算会启发式寻找最优的chunk_size，然后分割原始时序数据并分发到不同机器中，最后汇总计算出的最终特征，释放相关连接与资源占用。具体代码示例

由于dask模块本身对于分布式计算有较好的支持，所以鱼与熊掌可以兼得。具体代码示例

4.3 其他特性

支持时序回滚（可设置回滚窗口，具体文档说明和函数说明）
支持scikit-learn的pipeline操作，相关示例
支持对特征计算函数进行性能分析，具体源码s，使用时可通过设置extract_features函数中的profile相关参数实现
类似于对dask模块的支持，tsfresh也兼容Spark计算图，详见tsfresh.convenience.bindings.spark_feature_extraction_on_chunk().

个人笔记

Digital Garden | 王半仙