View Jupyter notebook on the GitHub.

Get started#

Binder

This notebook contains the simple examples of time series forecasting pipeline using ETNA library.

Table of contents

  • Loading dataset

  • Plotting

  • Forecasting single time series

    • Naive forecast

    • Prophet

    • Catboost

  • Forecasting multiple time series

[1]:
!pip install "etna[prophet]" -q
[2]:
import warnings

warnings.filterwarnings(action="ignore", message="Torchmetrics v0.9")
[3]:
import pandas as pd

1. Loading dataset#

Let’s load and look at the dataset

[4]:
original_df = pd.read_csv("data/monthly-australian-wine-sales.csv")
original_df.head()
[4]:
month sales
0 1980-01-01 15136
1 1980-02-01 16733
2 1980-03-01 20016
3 1980-04-01 17708
4 1980-05-01 18019

etna_ts is strict about data format:

  • column we want to predict should be called target

  • column with datatime data should be called timestamp

  • because etna is always ready to work with multiple time series, column segment is also compulsory

Our library works with the special data structure TSDataset. So, before starting anything, we need to convert the classical DataFrame to TSDataset.

Let’s rename first

[5]:
original_df["timestamp"] = pd.to_datetime(original_df["month"])
original_df["target"] = original_df["sales"]
original_df.drop(columns=["month", "sales"], inplace=True)
original_df["segment"] = "main"
original_df.head()
[5]:
timestamp target segment
0 1980-01-01 15136 main
1 1980-02-01 16733 main
2 1980-03-01 20016 main
3 1980-04-01 17708 main
4 1980-05-01 18019 main

Time to convert to TSDataset!

To do this, we initially need to convert the classical DataFrame to the special format.

[6]:
from etna.datasets.tsdataset import TSDataset

df = TSDataset.to_dataset(original_df)
df.head()
[6]:
segment main
feature target
timestamp
1980-01-01 15136
1980-02-01 16733
1980-03-01 20016
1980-04-01 17708
1980-05-01 18019

Now we can construct the TSDataset.

Additionally to passing dataframe we should specify frequency of our data. In this case it is monthly data.

[7]:
ts = TSDataset(df, freq="1M")
/Users/d.a.binin/Documents/tasks/etna-github/etna/datasets/tsdataset.py:147: UserWarning: You probably set wrong freq. Discovered freq in you data is MS, you set 1M
  warnings.warn(

Oups. Let’s fix that by looking at the table of offsets in pandas:

[8]:
ts = TSDataset(df, freq="MS")

We can look at the basic information about the dataset

[9]:
ts.info()
<class 'etna.datasets.TSDataset'>
num_segments: 1
num_exogs: 0
num_regressors: 0
num_known_future: 0
freq: MS
         start_timestamp end_timestamp  length  num_missing
segments
main          1980-01-01    1994-08-01     176            0

Or in DataFrame format

[10]:
ts.describe()
[10]:
start_timestamp end_timestamp length num_missing num_segments num_exogs num_regressors num_known_future freq
segments
main 1980-01-01 1994-08-01 176 0 1 0 0 0 MS

2. Plotting#

Let’s take a look at the time series in the dataset

[11]:
ts.plot()
../_images/tutorials_101-get_started_21_0.png

3. Forecasting single time series#

Our library contains a wide range of different models for time series forecasting. Let’s look at some of them.

[12]:
warnings.filterwarnings("ignore")

Let’s predict the monthly values in 1994 for our dataset.

[13]:
HORIZON = 8
[14]:
train_ts, test_ts = ts.train_test_split(test_size=HORIZON)

3.1 Naive forecast#

We will start by using the NaiveModel that just takes the value from lag time steps before.

This model doesn’t require any features, so to make a forecast we should define pipeline with this model and set a proper horizon value.

[15]:
from etna.models import NaiveModel
from etna.pipeline import Pipeline

# Define a model
model = NaiveModel(lag=12)

# Define a pipeline
pipeline = Pipeline(model=model, horizon=HORIZON)

Let’s make a forecast.

[16]:
# Fit the pipeline
pipeline.fit(train_ts)

# Make a forecast
forecast_ts = pipeline.forecast()

Calling pipeline.forecast without parameters makes a forecast for the next HORIZON points after the end of the training set.

[17]:
forecast_ts.info()
<class 'etna.datasets.TSDataset'>
num_segments: 1
num_exogs: 0
num_regressors: 0
num_known_future: 0
freq: MS
         start_timestamp end_timestamp  length  num_missing
segments
main          1994-01-01    1994-08-01       8            0

Now let’s look at the result metric and plot the prediction. All the methods already built-in in ETNA.

[18]:
from etna.metrics import SMAPE

smape = SMAPE()
smape(y_true=test_ts, y_pred=forecast_ts)
[18]:
{'main': 11.492045838249387}
[19]:
from etna.analysis import plot_forecast

plot_forecast(forecast_ts=forecast_ts, test_ts=test_ts, train_ts=train_ts, n_train_samples=10)
../_images/tutorials_101-get_started_35_0.png

3.2 Prophet#

Now we can try to improve the forecast by using Prophet model.

[20]:
from etna.models import ProphetModel

# Define a model
model = ProphetModel()

# Define a pipeline
pipeline = Pipeline(model=model, horizon=HORIZON)

# Fit the pipeline
pipeline.fit(train_ts)

# Make a forecast
forecast_ts = pipeline.forecast()
12:21:11 - cmdstanpy - INFO - Chain [1] start processing
12:21:11 - cmdstanpy - INFO - Chain [1] done processing
[21]:
smape(y_true=test_ts, y_pred=forecast_ts)
[21]:
{'main': 10.514961160817307}
[22]:
plot_forecast(forecast_ts=forecast_ts, test_ts=test_ts, train_ts=train_ts, n_train_samples=10)
../_images/tutorials_101-get_started_39_0.png

3.3 Catboost#

Finally, let’s try the ML-model. This kind of models require some features to make a forecast.

3.3.1 Basic transforms#

ETNA has a wide variety of transforms to work with data, let’s take a look at some of them.

Lags

Lag transformation is the most basic one. It gives us some previous value of the time series. For example, the first lag is the previous value, and the fifth lag is the value five steps ago. Lags are essential for regression models, like linear regression or boosting, because they allow these models to grasp information about the past.

The scheme of working:

lags-scheme

[23]:
from etna.transforms import LagTransform

lags = LagTransform(in_column="target", lags=list(range(HORIZON, 24)), out_column="lag")

There are some limitations on available lags during the forecasting. Imagine that we want to make a forecast for 3 step ahead. We can’t take the previous value when we make a forecast for the last step, we just don’t know the value. For this reason, you should use lags >= HORIZON when using a Pipeline.

Statistics

Statistics are another essential feature. It is also useful for regression models as it allows them to look at the information about the past but in different ways than lags. There are different types of statistics: mean, median, standard deviation, minimum and maximum on the interval.

The scheme of working:

statistics-scheme

As we can see, the window includes the current timestamp. For this reason, we shouldn’t apply the statistics transformations to target variable, we should apply it to lagged target variable.

[24]:
from etna.transforms import MeanTransform

mean = MeanTransform(in_column=f"lag_{HORIZON}", window=12)

Dates

The time series also has the timestamp column that we have not used yet. But date number in a week and in a month, as well as week number in year or weekend flag can be really useful for the machine learning model. And ETNA allows us to extract all this information with DateFlagTransform.

[25]:
from etna.transforms import DateFlagsTransform

date_flags = DateFlagsTransform(
    day_number_in_week=False,
    day_number_in_month=False,
    week_number_in_month=False,
    month_number_in_year=True,
    season_number=True,
    is_weekend=False,
    out_column="date_flag",
)

Logarithm

However, there is another type of transform that alters the column itself. We call it “inplace transform”. The easiest is LogTransform. It logarithms values in a column.

[26]:
from etna.transforms import LogTransform

log = LogTransform(in_column="target", inplace=True)

3.3.2 Forecasting#

Now let’s pass these transforms into our Pipeline. It will do all the work with applying the transforms and making exponential inverse transformation after the prediction.

[27]:
from etna.models import CatBoostMultiSegmentModel

# Define transforms
transforms = [lags, mean, date_flags, log]

# Define a model
model = CatBoostMultiSegmentModel()

# Define a pipeline
pipeline = Pipeline(model=model, transforms=transforms, horizon=HORIZON)

# Fit the pipeline
pipeline.fit(train_ts)

# Make a forecast
forecast_ts = pipeline.forecast()
[28]:
smape(y_true=test_ts, y_pred=forecast_ts)
[28]:
{'main': 10.78610453770036}
[29]:
plot_forecast(forecast_ts=forecast_ts, test_ts=test_ts, train_ts=train_ts, n_train_samples=10)
../_images/tutorials_101-get_started_59_0.png

4. Forecasting multiple time series#

In this section you may see example of how easily ETNA works with multiple time series and get acquainted with other transforms the library contains.

[30]:
HORIZON = 30
[31]:
original_df = pd.read_csv("data/example_dataset.csv")
original_df.head()
[31]:
timestamp segment target
0 2019-01-01 segment_a 170
1 2019-01-02 segment_a 243
2 2019-01-03 segment_a 267
3 2019-01-04 segment_a 287
4 2019-01-05 segment_a 279
[32]:
df = TSDataset.to_dataset(original_df)
ts = TSDataset(df, freq="D")
ts.plot()
../_images/tutorials_101-get_started_63_0.png
[33]:
ts.info()
<class 'etna.datasets.TSDataset'>
num_segments: 4
num_exogs: 0
num_regressors: 0
num_known_future: 0
freq: D
          start_timestamp end_timestamp  length  num_missing
segments
segment_a      2019-01-01    2019-11-30     334            0
segment_b      2019-01-01    2019-11-30     334            0
segment_c      2019-01-01    2019-11-30     334            0
segment_d      2019-01-01    2019-11-30     334            0
[34]:
train_ts, test_ts = ts.train_test_split(test_size=HORIZON)
[35]:
test_ts.info()
<class 'etna.datasets.TSDataset'>
num_segments: 4
num_exogs: 0
num_regressors: 0
num_known_future: 0
freq: D
          start_timestamp end_timestamp  length  num_missing
segments
segment_a      2019-11-01    2019-11-30      30            0
segment_b      2019-11-01    2019-11-30      30            0
segment_c      2019-11-01    2019-11-30      30            0
segment_d      2019-11-01    2019-11-30      30            0
[36]:
from etna.transforms import LinearTrendTransform
from etna.transforms import SegmentEncoderTransform

# Define transforms
log = LogTransform(in_column="target")
trend = LinearTrendTransform(in_column="target")
seg = SegmentEncoderTransform()
lags = LagTransform(in_column="target", lags=list(range(HORIZON, 96)), out_column="lag")
date_flags = DateFlagsTransform(
    day_number_in_week=True,
    day_number_in_month=True,
    week_number_in_month=True,
    week_number_in_year=True,
    month_number_in_year=True,
    year_number=True,
    is_weekend=True,
)
mean = MeanTransform(in_column=f"lag_{HORIZON}", window=30)
transforms = [log, trend, lags, date_flags, seg, mean]

# Define a model
model = CatBoostMultiSegmentModel()

# Define a pipeline
pipeline = Pipeline(model=model, transforms=transforms, horizon=HORIZON)

# Fit the pipeline
pipeline.fit(train_ts)

# Make a forecast
forecast_ts = pipeline.forecast()
[37]:
smape(y_true=test_ts, y_pred=forecast_ts)
[37]:
{'segment_d': 4.705485234692143,
 'segment_b': 5.1724144384125275,
 'segment_a': 6.61633906776369,
 'segment_c': 13.503310484019664}
[38]:
plot_forecast(forecast_ts=forecast_ts, test_ts=test_ts, train_ts=train_ts, n_train_samples=20)
../_images/tutorials_101-get_started_69_0.png