IMA: An Imputation-based Mixup Augmentation Using Self-Supervised Learning
for Time Series Data
Nha Dang Nguyen1,, Dang Hai Nguyen2, Khoa Tho Anh Nguyen3*
1Vietnam - Korea University of Information and Communication Technology, Da Nang City, Vietnam
2VNU University of Engineering and Technology, Ha Noi City, Vietnam
3Vietnamese – German University, Ho Chi Minh City, Vietnam
nhand.21it@vku.udn.vn, 24025015@vnu.edu.vn, 30421001@student.vgu.edu.vn
Abstract
Data augmentation plays a crucial role in enhancing model
performance across various AI fields by introducing variabil-
ity while maintaining the underlying temporal patterns. How-
ever, in the context of long sequence time series data, where
maintaining temporal consistency is critical, there are fewer
augmentation strategies compared to fields such as image or
text, with advanced techniques like Mixup rarely being used.
In this work, we propose a new approach, Imputation-based
Mixup Augmentation (IMA), which combines Imputed-data
Augmentation with Mixup Augmentation to bolster model
generalization and improve forecasting performance. We
evaluate the effectiveness of this method across several fore-
casting models, including DLinear (MLP), TimesNet (CNN),
and iTrainformer (Transformer), these models represent some
of the most recent advances in long sequence time series
forecasting. Our experiments, conducted on three datasets
(ETT-small, Illness, Exchange Rate) from various domains
and compared against eight other augmentation techniques,
demonstrate that IMA consistently enhances performance,
achieving 22 improvements out of 24 instances, with 10 of
those being the best performances, particularly with iTrain-
former imputation in ETT dataset. The GitHub repository is
available at: https://github.com/dangnha/IMA.
Introduction
Time series forecasting, particularly long sequence time-
series forecasting (LSTF), plays a critical role in domains
like finance, healthcare, energy, and urban planning (Chen
et al. 2023b). Traditional statistical methods such as ARIMA
and exponential smoothing laid the foundation but struggled
with complex temporal dependencies. The advent of deep
learning introduced RNNs (Elman 1990), LSTMs (Hochre-
iter and Schmidhuber 1997), and more recently, models like
MLPs (Lai et al. 2018), CNNs (Huang et al. 2019), and
Transformers (Wu et al. 2021; Liu et al. 2024), achieving
state-of-the-art results in LSTF tasks (Fig. 1). These ad-
vances emphasize robust pipelines integrating preprocess-
ing, feature extraction, and optimization.
Despite significant progress, challenges persist in time
series data augmentation (Wen et al. 2021). Unlike Com-
puter Vision (CV) and Natural Language Processing (NLP),
*This research is supported by AI VIETNAM
Copyright © 2025, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Year  2019  2020  2021  2022  2023  N-BEAT  LogTrans  
AST  Informer  
Autofomer  
Pyraformer  Triformer  Non-stationary  
Transformer  TiDE  
TimesNET  PatchMixer  
iTransformer  FreTS  ModernTCN  
SCINet  
FEDformer  DLinear  MICN  
TDformer  
MTS -Mixer  Koopa  Basisformer  
Transformer -based  
MLP -based  
CNN -based  Figure 1: Key Milestones in Time Series Forecasting (Liu
and Wang 2024)
where augmentation techniques like flipping, cropping, and
Mixup (Xu et al. 2023) are well-developed, time series tech-
niques such as jittering and scaling often fail to capture com-
plex temporal patterns (Chen et al. 2023a). Emerging meth-
ods like latent Mixup show potential but remain underex-
plored. Similarly, while imputation methods like KNN and
deep learning effectively handle missing data, their use for
augmentation remains untapped, focusing solely on recov-
ery rather than diversity enhancement.
This paper addresses these gaps by introducing
Imputation-based Mixup Augmentation (IMA):
• We propose Imputated-data Augmentation with Self-
Supervised Reconstruction (SSL), leveraging imputation
for enriched data diversity.
• We develop IMA, combining imputation with Mixup to
improve model generalization and performance.
• We evaluate IMA on three models—DLinear (MLP)
(Zeng et al. 2023), TimesNet (CNN) (Wu et al.
2023), and iTransformer (Transformer) (Liu et al.
2024)—demonstrating its effectiveness in enhancing
forecasting performance across diverse scenarios.
Related Work
Long sequence time-series forecasting (LSTF)has seen
rapid development due to its critical applications in various
domains. Transformer-based models have significantly ad-arXiv:2511.07930v1  [cs.LG]  11 Nov 2025
vanced the field by capturing long-term dependencies. In-
former (Zhou et al. 2021) and Autoformer (Wu et al. 2021)
utilize sparse attention and series decomposition to reduce
computational costs, while FEDformer (Zhou et al. 2022)
incorporates frequency-domain techniques like Fourier and
wavelet transformations for improved periodicity modeling.
CNN-based models, including TCN (Bharilya and Kumar
2024) and SCINet (LIU et al. 2022), leverage dilated convo-
lutions and hierarchical downsampling to capture both local
and global patterns but face challenges in modeling long-
term dependencies. RNNs (Elman 1990), such as LSTM
(Hochreiter and Schmidhuber 1997) and GRU (Cho et al.
2014), have been enhanced through attention mechanisms
(e.g., DA-RNN (Qin et al. 2017)) and hybrid models like
ES-LSTM (Smyl 2020), boosting their multivariate forecast-
ing performance. MLP-based methods, while traditionally
less suited for sequential data, have regained attention with
feature-engineered adaptations, offering lightweight solu-
tions for simpler tasks. These methods collectively highlight
the progress in addressing scalability and efficiency chal-
lenges in LSTF (Fig. 1).
Data augmentationhas emerged as a vital strategy for
improving model performance, especially in scenarios with
limited labeled data. Traditional time-domain transforma-
tions, such as window cropping, slicing, and warping, are
widely used for their simplicity and ability to introduce
variability (Wen et al. 2021). Advanced techniques like
decomposition-based methods (e.g., STL (Ouyang, Ravier,
and Jabloun 2021), Robust-STL (Wen et al. 2019)) and gen-
erative models like GANs and V AEs expand this toolkit
by creating diverse yet structurally coherent synthetic data.
Mixup, a method that interpolates between samples to gen-
erate new ones, remains underexplored for time series (Zhou
et al. 2023), leaving significant room for further research.
Imputation, traditionally used to reconstruct missing
data, has evolved with methods ranging from statistical in-
terpolation to machine learning and deep learning tech-
niques, including k-nearest neighbors (kNN), Gaussian Pro-
cesses (Jafrasteh et al. 2023), and Transformer-based mod-
els (Wang et al. 2024a). These approaches restore data while
preserving temporal dependencies, suggesting potential for
data augmentation. However, leveraging imputation explic-
itly for augmentation remains underexplored. This gap moti-
vates our proposed Imputation-based Mixup Augmentation
(IMA), detailed in the next section.
Methodology
Our approach consists of two main phases:Self-Supervised
Reconstruction (SSR)andImputed-data Augmenta-
tion (IA)withImputation-based Mixup Augmentation
(IMA).
In the first phase,Self-Supervised Reconstruction
(SSR), an imputation model is trained to reconstruct masked
input data, effectively capturing the intrinsic patterns and
structures of the time series. This pre-training step allows the
model to understand the temporal dependencies and com-
plex features in the data.
In the second phase, the pre-trained imputation model is
used forImputed-data Augmentation (IA)to enhance datadiversity by reconstructing masked sequences. Additionally,
the augmented data is integrated withMixup Augmenta-
tion (IMA), which blends samples to introduce further vari-
ability in data representations. This combination improves
model generalization and performance across various time
series forecasting tasks.
Annotation Definition
LetB={(X(i),Y(i))}|B|
i=1denote a batch of|B|samples
randomly drawn from datasetD, where:
D={(X(i),Y(i))|X(i)∈RTX×NX,Y(i)∈RTY×NY},
withi∈[1,|D|]. Here,X(i)represents the input sequence,
andY(i)is the corresponding target sequence. Parameters
TX,TYdenote the number of time steps, whileN X,NY
refer to features per time step inXandY.
Each input sequenceX(i)is represented as:
X(i)= [X(i)
1,X(i)
2, . . . ,X(i)
TX],
whereX(i)
t∈RNXfort= 1, . . . , T X. Each time stepX(i)
t
is defined as:
X(i)
t= [x 1, x2, . . . , x NX],
withx jbeing thej-th feature at time stept.
Self-Supervised Reconstruction (SSR)
Self-supervised learning enhances downstream tasks by cap-
turing inherent patterns within data. We apply this approach
to time series imputation.
For each sampleX(i)in batchB, masking is applied using
MSSR∈R|B|×T X×NX. The masked versionX(i)
mis defined
as:
X(i)
m={X(i)
t⊙M(i)
t|t= 1, . . . , T X},
whereX(i)
t∈RNXis the feature vector at timet, and
M(i)
t∈ {0,1}NXis a binary mask vector. MaskM(i)
tis
constructed by randomly sampling values from a uniform
distribution. Elements are set to 0 if the value is below the
mask rate, indicating the feature is masked, or 1 otherwise,
indicating the feature is observed (Fig. 3).
After getting{X(i)
m}|B|
i=1, the objective is to utilize an
imputation modelf θ(whereθdenotes the model parame-
ters) to reconstruct the original input data{X(i)}|B|
i=1from a
masked version{X(i)
m}|B|
i=1. The model processes the masked
data as input and generates imputed data as outputX(i)
imp=
fθ(X(i)
m)(as showed in SSR phase of figure 2). Finally, to
guide the imputation model, MSE loss between original in-
put sequence and imputed input sequence was applied:
Limp=1
|B||B|X
i=1TXX
t=1(1−M(i)
t)·
X(i)
t−X(i)
imp,t2
(1)
Dataset  Input 
Sequences  
Target 
Sequences  Mask  Imputation 
Model  
Forecasting  
Model  Mixup  Imputation 
Model  Imputed  
Input 
Sequences  
Input 
Sequences  Mask  
Predicting 
Target 
Sequences  IA & IMA  SSR 
Batch 
Sampling  
Batch 
Sampling  Imputed  
Input  
Sequences  
If 
Impute  Figure 2: Illustration of the proposed data augmentation framework, comprising two key phases: Self-Supervised Reconstruc-
tion (SSR) for learning intrinsic data patterns and Imputed-based Mixup Augmentation (IMA) for enhancing data diversity and
model generalization.
0 0 1 
0 0 
Figure 3: Data masking strategy: applying a binary mask to
generate masked inputs.
Imputed-data Augmentation (IA)
After sampling batches, a binary vectori∈RBis defined,
whereBis the number of batches. Each elementi(i)is de-
termined by comparing a random number (from a uniform
distribution) with the imputation rate. If the random num-
ber is less than the imputation rate,i(i)= 1, indicating
imputation-based augmentation for the batch. Otherwise,
i(i)= 0, and no augmentation is applied.
Following theSelf-Supervised Reconstruction (SSR)phase, the pre-trained modelf θ, which has learned the tem-
poral patterns and structures of the data, reconstructs masked
sequencesX(i)
min each batchB. Using the binary mask ma-
trixM IMA∈R|B|×T X×NX, the imputed sequence is gen-
erated as:
X(i)
imp=fθ(X(i)
m).
The reconstructed sequences form the augmented batch
BX
imp={X(i)
imp}|B|
i=1. These sequences are then passed into
a forecasting modelg w(·), parameterized byw, to predict
target sequences:
ˆY(i)=gw(X(i)
imp).
This process mitigates biases by imputing missing values
with plausible estimates, thereby increasing diversity while
maintaining the original data’s structure and patterns.
Imputed-based Mixup Augmentation
After generating the imputed batchBX
imp, Mixup Augmenta-
tion is applied to create synthetic data and enhance model
generalization. Mixup interpolates between pairs of sam-
ples withinB imp, governed by a mixing coefficientλ∼
Beta(α, α), whereλ∈[0,1]. This coefficient determines
the contribution of each sample in the interpolation.
For two randomly selected imputed samplesX(i)
impand
Algorithm 1Imputed-based Mixup Augmentation
1:Input:BatchB={(X(i),Y(i))}|B|
i=1, imputation rate,
mask rate
2:Apply SSR to computeX(i)
impfor eachX(i)
minB
3:ShuffleB impto create pairs(X(i)
imp,X(j)
imp)
4:Compute mixed samplesX(i,j)
mixusingλ∼Beta(α, α)
5:Compute lossL mixand updateg wvia gradient descent
X(j)
imp, the mixed inputX(i,j)
mixis computed as:
X(i,j)
mix=λ·X(i)
imp+ (1−λ)·X(j)
imp.
The mixed sample is passed to the forecasting modelg w,
and the loss for the mixed sample is calculated as:
Lmix=λ·L(g w(X(i,j)
mix),Y(i))+(1−λ)·L(g w(X(i,j)
mix),Y(j)),
whereLdenotes the forecasting loss, andY(i),Y(j)are the
target sequences corresponding to the original samples.
(              )  (              )  
Forecasting  
Model  
_ 
_ 
Figure 4: Mixup applied to two imputed samples.
Experiments and Results
Dataset.We evaluate our data augmentation method using
three long-term time series forecasting datasets:ETT-small,
Illness, andExchange Rate.ETT-small(Zhou et al. 2021)
includes two subsets (ETTh and ETTm) tracking trans-
former station temperatures at hourly and 15-minute inter-
vals, with 70,080 samples. Each sample contains six features
and an Oil Temperature target, capturing seasonal and irreg-
ular patterns.Illness1records weekly influenza-like illness
(ILI) rates along with features like population and health-
care capacity.Exchange Rate(Lai et al. 2017) tracks daily
exchange rates for eight currencies (e.g., USD, GBP, AUD)
from 1990 to 2016, comprising 7,588 time steps, offering in-
sights into long-term financial forecasting and cross-variable
correlations.
Experimental Setting.We conducted our experiments
using the TSLib framework (Wu et al. 2023; Wang et al.
1Illness dataset: Weekly ILI rates across U.S. regions, provided
by the CDC via FluView portal. Accessed December 1, 2024,
gis.cdc.gov/grasp/fluview/fluportaldashboard.html.2024b), evaluating three baseline models—DLinear, Times-
Net, and iTransformer—representing key approaches in time
series modeling: MLP, CNN, and Transformer, respectively.
Seven widely used data augmentation techniques (Jitter,
Horizontal Flip, Vertical Flip, Scaling, Window Warp, Win-
dow Slide, and Permutation) and the Mixup method were
applied.
Our proposed IMA method optimized the imputation rate
(0.125 for TimesNet and iTransformer) and mask rate
(0.375 for TimesNet, 0.125 for iTransformer) through grid
search. DLinear was excluded from the imputation task due
to its inability to capture complex temporal patterns, as evi-
denced by consistent underperformance in preliminary tests.
Model performance was evaluated using Mean Squared
Error (MSE) and Mean Absolute Error (MAE) to assess both
prediction accuracy and robustness across datasets and aug-
mentation strategies.
0510152025
Jitter Hflip Vflip ScalingWin_warpWin_slidePermu Mixup TS_IA iT_IA TS_IMA iT_IMA
best cases
improved cases
Figure 5: Comparison of the number of improvement cases
and the best-case performance among eight augmentation
methods, IA, and IMA on the ETT dataset.
Results.Table 1 demonstrates that Imputed-data Aug-
mentation (IA) significantly improves performance, espe-
cially on the ETT dataset, achieving enhancements in 20 out
of 24 cases, with notable success in all 8 instances using
the iTransformer model (Fig. 5). Combining IA with Mixup
(IMA) further strengthens results, improving 22 out of 24
cases on the ETT dataset, including 10 best-case outcomes.
IMA also slightly outperforms Mixup alone on the Illness
and Exchange Rate datasets.
However, IA and IMA struggle in some scenarios, partic-
ularly with DLinear on the ETTm1 dataset. This is due to
the DLinear model’s simplified architecture, which cannot
fully leverage complex temporal patterns or augmented data
diversity, highlighting the need for advanced temporal fea-
ture extraction capabilities.
For the Illness and Exchange Rate datasets, IA achieves
improvements in 4 out of 6 cases for both datasets, with
peak performance in 2 Illness cases and 4 Exchange Rate
cases (Fig. 6). These datasets’ simplicity, characterized by
many zero values, enables models like DLinear and iTrans-
former to effectively learn patterns without requiring sig-
nificant augmentation, reducing the impact of augmentation
methods. In contrast, TimesNet, with its convolutional op-
erations, is more sensitive to augmentation, where IA out-
performs IMA, suggesting that standalone imputation better
suits the characteristics of these datasets.
Model DLinear TimesNet iTransformer DLinear TimesNet iTransformer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Dataset ETTh1 ETTh2
Baseline 0.445051 0.440448 0.458988 0.45493 0.447285 0.440212 0.479687 0.477797 0.406138 0.413908 0.4578 0.398759
Jitter 0.00E+00 1.00E-06 -1.37E-03 -7.73E-04 -1.10E-05 7.06E-03 3.00E-05 2.20E-05 3.99E-03 3.63E-03 -7.84E-02 -1.13E-04
Hflip -5.73E-04 -2.07E-04 7.94E-03 2.84E-03 -5.47E-03 -4.07E-03 1.24E-03 1.22E-03 -1.63E-02 -7.31E-03 -5.92E-02 -2.66E-01
Vflip 4.41E-03 4.89E-03 -6.30E-03 -2.93E-03 -5.86E-03 -4.96E-03 7.55E-03 6.83E-03 3.36E-02 1.29E-02 -7.43E-02 -7.35E-04
Scaling 7.93E-04 1.14E-03 -1.34E-03 -6.42E-04 -2.98E-01 -2.93E-01 -6.82E-04 1.97E-04 2.07E-02 1.29E-02 -7.80E-02 -2.83E-04
Win warp 1.81E-02 1.81E-02 3.71E-02 2.06E-02 2.21E-02 1.30E-02 -6.41E-04 9.71E-04 -1.16E-04 3.98E-03 -7.46E-02 2.40E-03
Win slide 6.66E-02 4.17E-02 1.65E-02 1.50E-02 2.79E-02 1.76E-02 7.63E-03 7.55E-03 1.48E-02 1.23E-02 -6.57E-02 9.67E-03
Permu -1.66E-04 4.87E-04 -3.60E-03 1.44E-03 1.05E-02 1.76E-02 8.17E-03 6.42E-03 1.21E-02 5.43E-03 -7.62E-02 1.34E-04
Mixup 2.10E-04 1.92E-04 7.78E-04 -4.23E-04 -1.58E-03 -1.38E-03 -1.01E-02 -5.84E-03 -2.67E-03 -1.06E-03 -7.88E-02 -3.87E-04
TSIA -4.54E-03 -3.23E-03 -1.41E-04 -1.48E-03 -1.78E-03 -1.89E-03 -1.87E-02 -1.20E-02 1.17E-03 1.02E-04 -7.74E-02 -2.00E-03
iTIA -5.73E-03 -4.84E-03 -7.39E-03 -7.65E-03 -1.33E-03 -1.96E-03 -1.78E-02 -1.16E-02 -7.39E-03 -3.62E-03 -7.49E-02 -3.09E-04
TSIMA -6.31E-03 -6.11E-03 -5.13E-03 -6.76E-03 -3.68E-03 -3.60E-03 -2.15E-02 -1.30E-02 -1.28E-02 -7.06E-03 -7.65E-02 -1.84E-03
iTIMA -6.38E-03 -5.85E-03 -1.54E-02 -1.18E-02 -3.14E-03 -3.98E-03 -1.25E-02 -8.32E-03 -1.18E-02 -6.75E-03 -7.52E-02 -2.02E-04
Dataset ETTm1 ETTm2
Baseline 0.381687 0.390652 0.39177 0.403024 0.398893 0.394252 0.281894 0.358602 0.2544 0.307104 0.252632 0.312604
Jitter 1.50E-05 -8.95E-03 1.39E-04 4.12E-04 -4.64E-03 -4.60E-05 1.40E-05 1.60E-05 6.46E-03 3.53E-03 -1.81E-04 -2.21E-04
Hflip -8.20E-05 -2.56E-04 3.84E-03 2.34E-04 -1.65E-02 -2.11E-03 2.81E-03 3.15E-03 -2.67E-03 1.20E-03 -4.76E-03 -3.79E-03
Vflip 5.25E-04 1.00E-03 -2.07E-03 -4.10E-03 -6.76E-03 3.11E-03 -5.25E-03 -2.63E-03 -2.69E-03 1.28E-03 5.62E-02 -3.79E-03
Scaling 3.79E-04 7.50E-04 -2.01E-03 -1.02E-03 -1.80E-02 5.00E-05 2.25E-03 2.33E-03 7.54E-03 4.58E-03 -4.20E-05 5.30E-05
Win warp 7.35E-02 4.84E-02 4.79E-02 3.62E-02 4.49E-02 3.92E-02 3.46E-03 4.96E-03 3.92E-03 5.27E-03 2.42E-03 3.48E-03
Win slide 8.62E-02 5.08E-02 8.04E-02 4.75E-02 3.45E-02 3.92E-02 1.50E-03 2.18E-03 1.33E-02 1.04E-02 4.73E-03 4.28E-03
Permu 4.45E-02 3.36E-02 7.88E-03 1.13E-02 3.45E-02 1.95E-02 5.75E-03 7.47E-03 1.33E-02 2.23E-03 1.93E-03 2.91E-03
Mixup 1.54E-04 4.21E-04 -2.86E-03 -2.19E-03 -1.90E-02 -2.12E-03 -2.09E-03 -1.48E-03 -1.14E-03 -9.32E-04 -1.56E-03 -1.10E-03
TSIA 4.96E-03 4.27E-03 -8.28E-03 -2.50E-03 -1.99E-02 -1.15E-03 -2.18E-02 -2.18E-02 -4.66E-03 -3.04E-03 -2.82E-03 -3.71E-03
iTIA 5.44E-03 4.72E-03 -7.23E-03 -3.52E-03 -2.12E-02 -1.20E-03 2.06E-03 1.87E-03 -4.39E-03 -2.77E-03 -2.74E-03 -3.71E-03
TSIMA 5.18E-03 4.27E-03 -7.62E-03 -4.76E-03 -2.17E-02 -3.63E-03 -1.49E-02 -1.61E-02 -5.80E-03 -2.50E-03 -4.03E-03 -4.44E-03
iTIMA 6.27E-03 5.27E-03 -1.22E-02 -9.31E-03 -2.36E-02 -4.22E-03 -6.69E-03 -5.99E-03 -7.18E-03 -4.58E-03 -4.34E-03 -5.13E-03
Dataset Illness Exchange Rate
Baseline 4.003106 1.441318 1.998755 0.885458 1.807093 0.870089 0.168927 0.305395 0.219712 0.340417 0.180669 0.303503
Jitter 0.00E+00 0.00E+00 -1.90E-05 -3.00E-06 0.00E+00 0.00E+00 0.00E+00 0.00E+00 -6.93E-04 -3.37E-04 0.00E+00 0.00E+00
Hflip 0.00E+00 0.00E+00 3.30E-05 7.00E-06 0.00E+00 0.00E+00 0.00E+00 0.00E+00 -3.72E-04 -2.52E-04 0.00E+00 0.00E+00
Vflip 0.00E+00 0.00E+00 -8.00E-06 3.00E-06 0.00E+00 0.00E+00 0.00E+00 0.00E+00 -1.44E-03 -1.29E-03 0.00E+00 0.00E+00
Scaling 0.00E+00 0.00E+00 1.10E-05 3.00E-06 0.00E+00 0.00E+00 0.00E+00 0.00E+00 -1.14E-03 -5.71E-04 0.00E+00 0.00E+00
Win warp 0.00E+00 0.00E+00 1.90E-05 2.00E-06 0.00E+00 0.00E+00 0.00E+00 0.00E+00 -1.90E-03 -7.10E-04 0.00E+00 0.00E+00
Win slide 0.00E+00 0.00E+00 1.30E-05 1.00E-06 0.00E+00 0.00E+00 0.00E+00 0.00E+00 -8.52E-04 -6.02E-04 0.00E+00 0.00E+00
Permu 0.00E+00 0.00E+00 -2.30E-05 -2.00E-06 0.00E+00 0.00E+00 0.00E+00 0.00E+00 -3.70E-05 4.92E-04 0.00E+00 0.00E+00
Mixup 2.45E-02 4.30E-03 -1.06E-01 -2.64E-02 3.93E-02 3.99E-03 7.51E-03 8.34E-03 -8.73E-03 -9.11E-03 -1.08E-03 -1.46E-03
TSIA -9.15E-03 -7.45E-04 -7.46E-02 -2.10E-02 3.21E-02 6.48E-03 4.18E-03 6.24E-03 -1.65E-02 -1.59E-02 -1.90E-03 -2.32E-03
iTIA -9.85E-03 -9.53E-04 -7.03E-02 -2.40E-02 4.09E-02 3.00E-03 7.16E-03 8.38E-03 4.83E-03 1.75E-03 -1.46E-03 -1.51E-03
TSIMA -1.92E-03 2.27E-04 -2.67E-03 -1.23E-02 1.04E-01 2.12E-02 3.56E-03 5.04E-03 -8.48E-03 -9.61E-03 -1.84E-03 -2.12E-03
iTIMA 1.60E-03 1.74E-03 -1.43E-01 -2.54E-02 7.62E-02 1.13E-02 7.68E-03 1.00E-02 -1.54E-02 -1.48E-02 -1.49E-03 -1.89E-03
Table 1: Forecasting Performance Evaluation. Comparison of 8 augmentation methods with IM and IMA, using TimesNet (TS)
and iTransformer (iT) for imputation-based enhancement.Red bold: best case,Blue: improvement case,Green Background:
Our methods.
01234567
Jitter
Hflip
Vflip
Scaling
Win_warp
Win_slide
Permu
Mixup
TS_IA
iT_IA
TS_IMA
iT_IMAIllness Dataset 
0123456789
Jitter
Hflip
Vflip
Scaling
Win_warp
Win_slide
Permu
Mixup
TS_IA
iT_IA
TS_IMA
iT_IMAExchange Rate Dataset
best cases
improved cases
best cases
improved cases
Figure 6: Comparison of the number of improvement cases
and the best-case performance among eight augmentation
methods, IA, and IMA on Illness, Exchange Rate datasets.
In conclusion, IA and IMA demonstrate robust perfor-
mance improvements across models and datasets. While IA
occasionally surpasses IMA for specific datasets, the com-
bined approach of IMA offers a versatile solution with more
consistent performance across diverse scenarios, highlight-
ing its advantage over existing methods.Conclusion
In this study, we propose Imputation-based Mixup Augmen-
tation (IMA), a method that enhances time series forecasting
by leveraging SSL training to capture trends and patterns in
the data while preserving essential characteristics. By com-
bining Imputation with Mixup, IMA not only increases data
diversity but also improves model generalization, leading to
better forecasting performance. Our results demonstrate that
this approach outperforms Mixup alone, highlighting its po-
tential to generate more diverse and resilient training data.
Moreover, IMA may not yield optimal results for every fore-
casting model and dataset, but it opens promising avenues
for further exploration and development in this direction.
References
Bharilya, V .; and Kumar, N. 2024. Machine learning for au-
tonomous vehicle’s trajectory prediction: A comprehensive
survey, challenges, and future research directions.Vehicular
Communications, 46: 100733.
Chen, M.-H.; Xu, Z.; Zeng, A.; and Xu, Q. 2023a. FrAug:
Frequency Domain Augmentation for Time Series Forecast-
ing.ArXiv.
Chen, Z.; Ma, M.; Li, T.; Wang, H.; and Li, C. 2023b. Long
sequence time-series forecasting with deep learning: A sur-
vey.Information Fusion, 97: 101819.
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.;
Bougares, F.; Schwenk, H.; and Bengio, Y . 2014. Learn-
ing Phrase Representations using RNN Encoder-Decoder
for Statistical Machine Translation. arXiv:1406.1078.
Elman, J. L. 1990. Finding structure in time.Cognitive
Science, 14(2): 179–211.
Hochreiter, S.; and Schmidhuber, J. 1997. Long Short-Term
Memory.Neural Computation, 9(8): 1735–1780.
Huang, S.; Wang, D.; Wu, X.; and Tang, A. 2019. DSANet:
Dual Self-Attention Network for Multivariate Time Series
Forecasting. InProceedings of the 28th ACM International
Conference on Information and Knowledge Management,
CIKM ’19, 2129–2132. New York, NY , USA: Association
for Computing Machinery. ISBN 9781450369763.
Jafrasteh, B.; Hern ´andez-Lobato, D.; Lubi ´an-L ´opez, S. P.;
and Benavente-Fern ´andez, I. 2023. Gaussian processes for
missing value imputation.Knowledge-Based Systems, 273:
110603.
Lai, G.; Chang, W.-C.; Yang, Y .; and Liu, H. 2017. Modeling
Long- and Short-Term Temporal Patterns with Deep Neural
Networks.The 41st International ACM SIGIR Conference
on Research & Development in Information Retrieval.
Lai, G.; Chang, W.-C.; Yang, Y .; and Liu, H. 2018. Modeling
Long- and Short-Term Temporal Patterns with Deep Neural
Networks. InThe 41st International ACM SIGIR Confer-
ence on Research & Development in Information Retrieval,
SIGIR ’18, 95–104. New York, NY , USA: Association for
Computing Machinery. ISBN 9781450356572.
LIU, M.; Zeng, A.; Chen, M.; Xu, Z.; LAI, Q.; Ma, L.; and
Xu, Q. 2022. SCINet: Time Series Modeling and Forecast-
ing with Sample Convolution and Interaction. In Oh, A. H.;
Agarwal, A.; Belgrave, D.; and Cho, K., eds.,Advances in
Neural Information Processing Systems.
Liu, X.; and Wang, W. 2024. Deep Time Series Forecasting
Models: A Comprehensive Survey.Mathematics, 12(10).
Liu, Y .; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; and
Long, M. 2024. iTransformer: Inverted Transformers Are
Effective for Time Series Forecasting. InThe Twelfth Inter-
national Conference on Learning Representations.
Ouyang, Z.; Ravier, P.; and Jabloun, M. 2021. STL Decom-
position of Time Series Can Benefit Forecasting Done by
Statistical Methods but Not by Machine Learning Ones.En-
gineering Proceedings, 5(1).
Qin, Y .; Song, D.; Cheng, H.; Cheng, W.; Jiang, G.; and
Cottrell, G. W. 2017. A dual-stage attention-based recur-
rent neural network for time series prediction. InProceed-
ings of the 26th International Joint Conference on Artifi-
cial Intelligence, IJCAI’17, 2627–2633. AAAI Press. ISBN
9780999241103.
Smyl, S. 2020. A hybrid method of exponential smooth-
ing and recurrent neural networks for time series forecast-
ing.International Journal of Forecasting, 36(1): 75–85. M4
Competition.Wang, J.; Du, W.; Cao, W.; Zhang, K.; Wang, W.; Liang, Y .;
and Wen, Q. 2024a. Deep Learning for Multivariate Time
Series Imputation: A Survey. arXiv:2402.04059.
Wang, Y .; Wu, H.; Dong, J.; Liu, Y .; Long, M.; and Wang, J.
2024b. Deep Time Series Models: A Comprehensive Survey
and Benchmark. arXiv:2407.13278.
Wen, Q.; Gao, J.; Song, X.; Sun, L.; Xu, H.; and Zhu, S.
2019. RobustSTL: A Robust Seasonal-Trend Decomposi-
tion Algorithm for Long Time Series.Proceedings of the
AAAI Conference on Artificial Intelligence, 33: 5409–5416.
Wen, Q.; Sun, L.; Yang, F.; Song, X.; Gao, J.; Wang, X.;
and Xu, H. 2021. Time Series Data Augmentation for Deep
Learning: A Survey. In Zhou, Z.-H., ed.,Proceedings of the
Thirtieth International Joint Conference on Artificial Intel-
ligence, IJCAI-21, 4653–4660. International Joint Confer-
ences on Artificial Intelligence Organization. Survey Track.
Wu, H.; Hu, T.; Liu, Y .; Zhou, H.; Wang, J.; and Long, M.
2023. TimesNet: Temporal 2D-Variation Modeling for Gen-
eral Time Series Analysis. arXiv:2210.02186.
Wu, H.; Xu, J.; Wang, J.; and Long, M. 2021. Auto-
former: Decomposition Transformers with Auto-Correlation
for Long-Term Series Forecasting. In Beygelzimer, A.;
Dauphin, Y .; Liang, P.; and Vaughan, J. W., eds.,Advances
in Neural Information Processing Systems.
Xu, M.; Yoon, S.; Fuentes, A.; and Park, D. S. 2023. A
Comprehensive Survey of Image Augmentation Techniques
for Deep Learning.Pattern Recognition, 137: 109347.
Zeng, A.; Chen, M.; Zhang, L.; and Xu, Q. 2023.
Are transformers effective for time series forecasting?
AAAI’23/IAAI’23/EAAI’23. AAAI Press. ISBN 978-1-
57735-880-0.
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong,
H.; and Zhang, W. 2021. Informer: Beyond Efficient
Transformer for Long Sequence Time-Series Forecasting.
arXiv:2012.07436.
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; and
Jin, R. 2022. FEDformer: Frequency Enhanced De-
composed Transformer for Long-term Series Forecasting.
arXiv:2201.12740.
Zhou, Y .; You, L.; Zhu, W.; and Xu, P. 2023. Improving time
series forecasting with mixup data augmentation. InECML
PKDD 2023 International Workshop on Machine Learning
for Irregular Time Series.