duplicate_data#

duplicate_data(df: DataFrame, segments: Sequence[str], format: str = DataFrameFormat.wide) DataFrame[source]#

Duplicate dataframe for all the segments.

Parameters:
  • df (DataFrame) – dataframe to duplicate, there should be column “timestamp”

  • segments (Sequence[str]) – list of segments for making duplication

  • format (str) – represent the result in TSDataset inner format (wide) or in flatten format (long)

Returns:

result – result of duplication for all the segments

Return type:

pd.DataFrame

Raises:
  • ValueError: – if segments list is empty

  • ValueError: – if incorrect strategy is given

  • ValueError: – if dataframe doesn’t contain “timestamp” column

Examples

>>> from etna.datasets import generate_const_df
>>> from etna.datasets import duplicate_data
>>> from etna.datasets import TSDataset
>>> df = generate_const_df(
...    periods=50, start_time="2020-03-10",
...    n_segments=2, scale=1
... )
>>> timestamp = pd.date_range("2020-03-10", periods=100, freq="D")
>>> is_friday_13 = (timestamp.weekday == 4) & (timestamp.day == 13)
>>> df_exog_raw = pd.DataFrame({"timestamp": timestamp, "is_friday_13": is_friday_13})
>>> df_exog = duplicate_data(df_exog_raw, segments=["segment_0", "segment_1"], format="wide")
>>> df_ts_format = TSDataset.to_dataset(df)
>>> ts = TSDataset(df=df_ts_format, df_exog=df_exog, freq="D", known_future="all")
>>> ts.head()
segment       segment_0           segment_1
feature    is_friday_13 target is_friday_13 target
timestamp
2020-03-10        False   1.00        False   1.00
2020-03-11        False   1.00        False   1.00
2020-03-12        False   1.00        False   1.00
2020-03-13         True   1.00         True   1.00
2020-03-14        False   1.00        False   1.00