View Jupyter notebook on the GitHub.

Feature selection#

Binder

This notebook contains the simple examples of using feature extractor transforms with ETNA library.

Table of contents

  • Loading dataset

  • Feature selection methods

    • Intro to feature selection

    • TreeFeatureSelectionTransform

    • GaleShapleyFeatureSelectionTransform

    • MRMRFeatureSelectionTransform

  • Summary

[1]:
import warnings

warnings.filterwarnings("ignore")
[2]:
import time
from copy import deepcopy

import pandas as pd

from etna.analysis import plot_backtest
from etna.datasets import TSDataset
from etna.metrics import SMAPE
from etna.models import CatBoostPerSegmentModel
from etna.pipeline import Pipeline
from etna.transforms import DateFlagsTransform
from etna.transforms import FourierTransform
from etna.transforms import HolidayTransform
from etna.transforms import LagTransform
from etna.transforms import MeanTransform
from etna.transforms import TrendTransform
[3]:
HORIZON = 60

1. Loading dataset#

We are going to work with the time series from Tabular Playground Series - Jan 2022. The dataset contains daily merchandise sales — mugs, hats, and stickers — at two imaginary store chains across three Scandinavian countries. As exogenous data, we will use Finland, Norway, and Sweden Weather Data 2015-2019 dataset containing daily country average precipitation, snow depth and air temperature data.

[4]:
df = pd.read_csv("data/nordic_merch_sales.csv")
[5]:
df = TSDataset.to_dataset(df)
ts = TSDataset(df, freq="D")
ts.plot(4)
../_images/tutorials_207-feature_selection_8_0.png

2. Feature selection methods#

2.1 Intro to feature selection#

Let’s create features and build pipeline with dataset:

[6]:
transforms = [
    TrendTransform(in_column="target", out_column="trend"),
    LagTransform(in_column="target", lags=range(HORIZON, 100), out_column="target_lag"),
    DateFlagsTransform(
        day_number_in_month=True, day_number_in_week=False, is_weekend=False, out_column="datetime_flag"
    ),
    MeanTransform(in_column=f"target_lag_{HORIZON}", window=12, seasonality=7, out_column="mean_transform"),
    FourierTransform(period=365, order=30, out_column="fourier_year"),
    FourierTransform(period=30.4, order=10, out_column="fourier_month"),
    HolidayTransform(iso_code="SWE", out_column="SWE_holidays"),
    HolidayTransform(iso_code="NOR", out_column="NOR_holidays"),
    HolidayTransform(iso_code="FIN", out_column="FIN_holidays"),
]

pipeline = Pipeline(model=CatBoostPerSegmentModel(), transforms=transforms, horizon=HORIZON)
[7]:
start_time = time.perf_counter()
metrics_baseline, forecast_baseline, _ = pipeline.backtest(ts=ts, metrics=[SMAPE()], n_folds=1)
time_baseline = time.perf_counter() - start_time
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:  1.8min
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:  1.8min
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    1.0s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    1.0s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.1s
[8]:
plot_backtest(forecast_baseline, ts, history_len=60, segments=ts.segments[:4])
../_images/tutorials_207-feature_selection_14_0.png
[9]:
smape_baseline = metrics_baseline["SMAPE"].mean()
print(f"SMAPE baseline: {smape_baseline:.3f}")
print(f"Time baseline: {time_baseline:.3f}")
SMAPE baseline: 8.094
Time baseline: 106.122

Let’s look how many features are used during fitting a model:

[10]:
temp_ts = deepcopy(ts)
temp_ts.fit_transform(transforms=transforms)

num_features = len(temp_ts.columns.get_level_values("feature").unique()) - 1
print(f"Num features: {num_features}")
Num features: 126

Now our pipeline has too much features, takes a lot of time for fit and predict methods and shows not perfect metrics. ETNA provides some methods to optimize this situation by removing some features.

2.2 TreeFeatureSelectionTransform#

This method uses tree-based model to estimate importance of each feature. After that method leaves top_k best features. This class accepts as input the next parameters:

  • model: model for estimating feature importance. Must have fit method and feature_importances_ parameter;

  • top_k: number of features to select;

  • features_to_use: which features participate in feature selection;

  • return_features: indicates whether to save features or not. It should be set to True only if you want to return features back to the dataset during inverse_transform.

Let’s optimize our baseline:

[11]:
from sklearn.tree import DecisionTreeRegressor

from etna.transforms import TreeFeatureSelectionTransform

model_feature_selection = DecisionTreeRegressor(random_state=42)
feature_selector_transform = TreeFeatureSelectionTransform(model=model_feature_selection, top_k=20)
pipeline = Pipeline(
    model=CatBoostPerSegmentModel(), transforms=transforms + [feature_selector_transform], horizon=HORIZON
)
[12]:
start_time = time.perf_counter()
metrics_tree_feature_selector, forecast_tree_feature_selector, _ = pipeline.backtest(
    ts=ts, metrics=[SMAPE()], n_folds=1
)
time_tree_feature_selector = time.perf_counter() - start_time
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:   37.6s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:   37.6s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    1.1s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    1.1s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.1s
[13]:
plot_backtest(forecast_tree_feature_selector, ts, history_len=60, segments=ts.segments[:4])
../_images/tutorials_207-feature_selection_23_0.png

Let’s look whether metrics had improved or not:

[14]:
smape_tree_feature_selector = metrics_tree_feature_selector["SMAPE"].mean()
print(f"SMAPE with TreeFeatureSelectionTransform: {smape_tree_feature_selector:.3f}")
print(f"Time with TreeFeatureSelectionTransform: {time_tree_feature_selector:.3f}")
SMAPE with TreeFeatureSelectionTransform: 7.371
Time with TreeFeatureSelectionTransform: 38.784

Results:

  • SMAPE is better;

  • time is less by around 65%.

ETNA also provides methods to plot importance of each feature:

[15]:
from etna.analysis import ModelRelevanceTable
from etna.analysis import plot_feature_relevance

temp_ts = deepcopy(ts)
temp_ts.fit_transform(transforms=transforms)

plot_feature_relevance(
    temp_ts,
    relevance_table=ModelRelevanceTable(),
    top_k=20,
    relevance_params={"model": model_feature_selection},
)
../_images/tutorials_207-feature_selection_28_0.png

2.3 GaleShapleyFeatureSelectionTransform#

Let’s assume that we have a table of relevances with size \(N_{f} \times N_{s}\) where \(N_{f}\) — number of features, \(N_{s}\) — number of segments.

Procedure of filtering features consist of \(\lceil \frac{k}{N_{s}} \rceil\) iterations. Algorithm of each iteration:

  • build a matching between segments and features by Gale–Shapley algorithm according to the relevance table, during the matching segments send proposals to features;

  • select features to add by taking matched feature for each segment;

  • add selected features to accumulated list of selected features taking into account that this list shouldn’t exceed the size of top_k;

  • remove added features from future consideration.

In a library we have two possible ways to build such a table:

  • etna.analysis.StatisticsRelevanceTable: calculates p-value relevances using tsfresh;

  • etna.analysis.ModelRelevanceTable: calculate standard feature importance relevances using tree-based models.

In this example let’s try StatisticsRelevanceTable.

[16]:
from etna.analysis import StatisticsRelevanceTable
from etna.transforms import GaleShapleyFeatureSelectionTransform

rt = StatisticsRelevanceTable()
feature_selector_transform = GaleShapleyFeatureSelectionTransform(top_k=20, relevance_table=rt)
pipeline = Pipeline(
    model=CatBoostPerSegmentModel(), transforms=transforms + [feature_selector_transform], horizon=HORIZON
)
[17]:
start_time = time.perf_counter()
metrics_galeshapley_feature_selector, forecast_galeshapley_feature_selector, _ = pipeline.backtest(
    ts=ts, metrics=[SMAPE()], n_folds=1
)
time_galeshapley_feature_selector = time.perf_counter() - start_time
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:   42.6s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:   42.6s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    1.0s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    1.0s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.1s
[18]:
plot_backtest(forecast_galeshapley_feature_selector, ts, history_len=60, segments=ts.segments[:4])
../_images/tutorials_207-feature_selection_34_0.png
[19]:
smape_galeshapley_feature_selector = metrics_galeshapley_feature_selector["SMAPE"].mean()
print(f"SMAPE with GaleShapleyFeatureSelectionTransform: {smape_galeshapley_feature_selector:.3f}")
print(f"Time with GaleShapleyFeatureSelectionTransform: {time_galeshapley_feature_selector:.3f}")
SMAPE with GaleShapleyFeatureSelectionTransform: 7.402
Time with GaleShapleyFeatureSelectionTransform: 43.729

Results:

  • SMAPE is better;

  • time is less by around 60%.

2.4 MRMRFeatureSelectionTransform#

The MRMR algorithm stands for “Maximum Relevance - Minimum Redundancy”, the idea is to select features, which are informative for the target and aren’t correlated with each other. At each iteration we select only one feature that has maximum relevance for the target and minimum redundancy with already selected set of features.

For achieving this we select the feature \(f\) that maximizes the score:

\[score(f) = \frac{\operatorname{relevance}(f | target)}{\operatorname{redundancy}(f | prev)},\]

where \(prev\) are previously selected features.

We use:

\[\begin{split}\operatorname{relevance}(f | target) = \operatorname{F}(f, target), \\ \operatorname{redundancy}(f | prev) = \sum_{s \in prev} |\rho(f, s)| \cdot \frac{1}{i-1},\end{split}\]

where \(\operatorname{F}\) is F-statistic, \(\rho\) is Pearson correlation, \(i\) is number of iteration.

Let’s look at example of usage. Now we are going to try ModelRelevanceTable.

[20]:
from etna.transforms import MRMRFeatureSelectionTransform

rt = ModelRelevanceTable()
feature_selector_transform = MRMRFeatureSelectionTransform(
    top_k=20,
    relevance_table=rt,
    fast_redundancy=True,
    model=model_feature_selection,
)
pipeline = Pipeline(
    model=CatBoostPerSegmentModel(), transforms=transforms + [feature_selector_transform], horizon=HORIZON
)

Here we set fast_redundancy=True to use optimized version of the algorithm, it isn’t default for backward compatibility.

[21]:
start_time = time.perf_counter()
metrics_mrmr_feature_selector, forecast_mrmr_feature_selector, _ = pipeline.backtest(
    ts=ts, metrics=[SMAPE()], n_folds=1
)
time_mrmr_feature_selector = time.perf_counter() - start_time
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:  1.2min
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:  1.2min
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    1.2s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    1.2s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done   1 tasks      | elapsed:    0.1s
[22]:
plot_backtest(forecast_mrmr_feature_selector, ts, history_len=60, segments=ts.segments[:4])
../_images/tutorials_207-feature_selection_43_0.png
[23]:
smape_mrmr_feature_selector = metrics_mrmr_feature_selector["SMAPE"].mean()
print(f"SMAPE with MRMRFeatureSelectionTransform: {smape_mrmr_feature_selector:.3f}")
print(f"Time with MRMRFeatureSelectionTransform: {time_mrmr_feature_selector:.3f}")
SMAPE with MRMRFeatureSelectionTransform: 7.416
Time with MRMRFeatureSelectionTransform: 70.717

Results:

  • SMAPE is better,

  • time is less by around 35%.

3. Summary#

In this notebook we have checked some methods of feature selection. As we have seen, such tricks can improve metrics and decrease runtime of your pipeline.