MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion
Benchmarking Across Specialized Domains
Leyan Xue1, Changqing Zhang1*, Kecheng Xue2, Xiaohong Liu3, Guangyu Wang2, Zongbo Han2*
1College of Intelligence and Computing, Tianjin University, Tianjin, China
2State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications,
Beijing, China
3Institute of Medical Artificial Intelligence, South China Hospital, Medical School, Shenzhen University, Guangdong, China
Abstract
Although multimodal fusion has made significant progress,
its advancement is severely hindered by the lack of adequate
evaluation benchmarks. Current fusion methods are typically
evaluated on a small selection of public datasets, a limited
scope that inadequately represents the complexity and di-
versity of real-world scenarios, potentially leading to biased
evaluations. This issue presents a twofold challenge. On one
hand, models may overfit to the biases of specific datasets,
hindering their generalization to broader practical applica-
tions. On the other hand, the absence of a unified evaluation
standard makes fair and objective comparisons between dif-
ferent fusion methods difficult. Consequently, a truly univer-
sal and high-performance fusion model has yet to emerge.
To address these challenges, we have developed a large-
scale, domain-adaptive benchmark for multimodal evalua-
tion. This benchmark integrates over 30 datasets, encompass-
ing 15 modalities and 20 predictive tasks across key applica-
tion domains. To complement this, we have also developed
an open-source, unified, and automated evaluation pipeline
that includes standardized implementations of state-of-the-art
models and diverse fusion paradigms. Leveraging this plat-
form, we have conducted large-scale experiments, success-
fully establishing new performance baselines across multiple
tasks. This work provides the academic community with a
crucial platform for rigorous and reproducible assessment of
multimodal models, aiming to propel the field of multimodal
artificial intelligence to new heights.
Code— https://github.com/ravexly/MultiBenchplus
1 Introduction
Multimodal data, such as text, images, and sensor sig-
nals, is driving the next generation of artificial intelligence.
Through a technique known as Multimodal Fusion, AI sys-
tems can integrate and understand information from these
diverse sources, achieving a more comprehensive, accurate,
and robust understanding than is possible with any single
source (Baltru ˇsaitis, Ahuja, and Morency 2018; Xu, Zhu,
and Clifton 2023). This capability is a key driver for advanc-
ing AI to higher levels of intelligence and shows immense
*Corresponding author.
Copyright © 2026, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.potential in fields like autonomous driving and medical di-
agnostics (Caesar et al. 2020; Azam et al. 2022).
However, a significant divergence exists between mul-
timodal research and other domains. While fields such as
natural language processing and graph learning have suc-
cessfully converged on dominant architectural paradigms,
specifically the Transformer and GNNs, the multimodal do-
main conspicuously lacks an equivalent unified, founda-
tional framework. Progress remains highly fragmented; re-
searchers typically validate new methods on a small, be-
spoke selection of classic datasets. This reliance on siloed
benchmarks is a key bottleneck. It not only leads to models
overfitting to specific data biases and prevents fair objective
comparisons, but more importantly, it has hindered the sys-
tematic search for a truly general-purpose fusion architec-
ture. Four years ago, the groundbreaking MULTIBENCH
framework partially addressed this by providing a unified
evaluation platform. Yet, with the field’s rapid evolution, its
limitations are now apparent, and it is no longer sufficient to
meet today’s challenges.
The urgent need for a new foundational platform stems
from two primary trends. The first is the explosive growth in
data combinatorial complexity. Unlike unimodal tasks, real-
world applications in medical imaging, IoT, and autonomous
driving (Hu et al. 2023; Kong et al. 2011) span a vast, hetero-
geneous spectrum. This combinatorial effect, where differ-
ent data combinations can yield entirely different analytical
conclusions, means older, simpler benchmarks can no longer
demonstrate a model’s robustness or adaptability to real-
world complexity. The second trend is the rapid evolution of
fusion models, especially Transformer-based methods (Wei
et al. 2020; Wang et al. 2022; Xu, Zhu, and Clifton 2023).
Without a standardized, complex testbed, it is impossible
to determine if these new models are truly general-purpose
or simply adept at specific data combinations. Therefore, a
platform that forces models to confront this combinatorial
complexity has become essential to guide the search for a
foundational architecture.
To address key challenges in evaluating next-generation
multimodal fusion models, we introduce MULTI-
BENCH++, a new, large-scale benchmark. Instead of
being a simple incremental update, MULTIBENCH++
represents a significant leap forward in terms of scale,
domain diversity, and suitability for modern architectures.arXiv:2511.06452v2  [cs.LG]  14 Nov 2025
Table 1: MULTIBENCH++ offers a unified benchmark suite of 37 multimodal datasets spanning a wide spectrum of research
fields, data scales, input modalities(a: audio,c: clinical/tabular,d: depth/DSM,e: events/spiking,f: 2D fundus,g: GIS,h: HSI,
i: image,k: time-series,L: LiDAR,m: multispectral,M: metadata,o: multi-omics,O: 3D OCT,s: SAR,t: text,v: video, with
omics sub-typed aso 1: mRNA,o 2: miRNA,o 3: DNA), and downstream tasks.
Domain Dataset Modalities # Samples Prediction Task
Remote SensingHouston2013{h, L}14,999 Land Cover Classification
Houston2018{h, L}2,018,910 Land Cover Classification
MUUFL Gulfport{h, L}53,687 Land Cover Classification
Trento{h, L}30,214 Land Cover Classification
Berlin{h, s}464,671 Land Cover Classification
MDAS (Augsburg){h, s}78,294 Land Cover Classification
ForestNet{c, i, m}2,757 Forest Type Mapping
Medical AITCGA-BRCA{o 1, o2, o3}875 Survival Prediction, Subtype Classification
ROSMAP{o 1, o2, o3}351 Disease Progression Prediction
SIIM-ISIC{i, M}33,126 Malignant Tumor Classification
Derm7pt{i, M}1,011 Lesion Diagnosis Prediction
GAMMA{f, O}100 Glaucoma Grading
MIMIC-III{c, t}36,212 Mortality Prediction
MIMIC-CXR{c, t}372,147 Mortality Prediction, Multilabel Classification
eICU{c, t}7,637 Mortality Prediction
TCGA{o 1, o2, o3}306 Subtype classification, Tumor Malignancy Grading
Affective Computing
&Social Media
UnderstandingMELD{a, t}13,708 Emotion/Sentiment Recognition
IEMOCAP{a, v, t}7,433 Emotion/Sentiment Recognition
MAMI{i, t}11,000 Misogyny Content Detection
Memotion{i, t}6,831 Offensive Content Detection
MUTE{i, t}4,156 Hate Speech Detection
MultiOFF{i, t}743 Offensive Content Detection
MET-Meme(C){i, t}2,299 Metaphor/Emotion/Intent Recognition
MET-Meme(E){i, t}1,053 Metaphor/Emotion/Intent Recognition
CH-SIMS{a, v, t}2,281 Sentiment Analysis
CH-SIMS v2.0{a, v, t}4,403 Sentiment Analysis
Twitter2015{i, t}5,338 Multimodal Named Entity Recognition
Twitter1517{i, t}4,672 Multimodal Named Entity Recognition
OthersMIRFLICKR{i, t}20,015 Image Retrieval
CUB Image-Caption{i, t}117,880 Fine-grained Classification
SUN-RGBD{i, d}9,504 Scene Understanding, Object Detection
NYUDv2{i, d}1,863 Scene Understanding, Object Detection
UPMC-Food101{i, t}90,686 Food Recognition
MVSA-Single{i, t}2,592 Sentiment Analysis
MNIST-SVHN{i, i}660,680 Digit Recognition
N-MNIST+N-TIDIGITS{e, i}4,050 Digit Recognition
E-MNIST+EEG{i, k}702 Digit Recognition
Its core contributions are threefold:
• Expanded scale and domain coverage. MULTI-
BENCH++ brings together over 30 datasets, more than
doubling the size of its predecessor. More importantly,
it extends into highly complex and specialized domains,
including Remote Sensing, Healthcare, Affective Com-
puting, and Social Media Analysis. These domains
present unique data fusion challenges.
• Designed for rigorous testing of advanced architectures.
Its datasets are carefully chosen for their high complex-
ity, rich interplay between modalities, and naturally oc-
curring missing data, creating a challenging test envi-
ronment. This design allows for rigorous testing of ad-
vanced, Transformer-based architectures and novel fu-
sion techniques.• An open-source framework for robust and reproducible
evaluation. To ensure fair and rigorous scientific compar-
isons, MULTIBENCH++ includes a standardized, open-
source evaluation framework. This framework provides
standardized data splits, Robustness Probes, and a set of
strong baseline models that have been carefully tuned us-
ing Automated Hyperparameter Optimization. This in-
frastructure is designed to lower the barrier to entry for
researchers and ensure that future innovations can be re-
liably evaluated on a fair and consistent foundation.
2 Related Works
Comparisons with Related Benchmarks.Multimodal
research has been driven by a series of influential bench-
marks. Foundational datasets for visual question answer-
Figure 1: An overview of the MULTIBENCH++ framework, highlighting our core contributions. (Left) We introduce a broader
and deeper collection of datasets, significantly expanding into more specialized domains and data modalities. (Center) We
integrate more advanced fusion paradigms, including feature-level transformer-based fusion and decision-level fusion. (Right)
We provide an automated hyper-parameter tuning platform, powered by Optuna, to ensure robust and reproducible evaluation.
ing, such as VQA-v2 (Goyal et al. 2017), established large-
scale, open-ended visual reasoning as a core challenge. In
parallel, benchmarks for multimodal sentiment analysis and
emotion recognition, like CMU-MOSI (Zadeh et al. 2016)
and the larger CMU-MOSEI (Zadeh et al. 2018), provided
key testbeds for integrating language, visual, and acoustic
signals. To standardize evaluation across this growing land-
scape, frameworks like MULTIBENCH (Liang et al. 2021)
were introduced to offer a unified, reproducible testbed for
assessing model robustness across a diverse set of tasks.
Building on this principle, our work expands this suite
with 20 additional datasets, contributing to a broader trend
of comprehensive evaluation that also includes integrating
more competitive methods and developing automated tun-
ing platforms.
Other works introduce new domains, such as MM-GRAPH
(Zhu et al. 2025), which integrates visual features into
graph-based tasks, and Dyn-VQA (Li et al. 2024), which
tests dynamic question answering requiring multi-hop re-
trieval.
Multimodal Fusion.The core challenge in multimodal
learning is fusion: the effective combination of information
from different modalities. Classical approaches are catego-
rized by the architectural stage at which fusion occurs: early
(feature-level) fusion concatenates raw or low-level features
(Ramachandram and Taylor 2017; Atrey et al. 2010), while
late (decision-level) fusion combines outputs from modality-
specific models (Soleymani et al. 2017; Han et al. 2022;
Zhang et al. 2023). The advent of the Transformer has made
attention-based fusion the dominant paradigm (Xu, Zhu,
and Clifton 2023). These methods can be broadly classi-
fied as single-stream, where multimodal inputs are concate-
nated and processed by a unified encoder, or multi-stream,
which uses separate encoders for each modality followed
by cross-attention mechanisms to integrate information (Na-grani et al. 2021). Such architectures allow for nuanced,
dynamically-weighted integration of modal features. More
recent research also focuses on developing fusion techniques
that are robust to real-world challenges like noisy, incom-
plete, or imbalanced data (Zhang et al. 2024).
Analysis of Multimodal Representations.Beyond pre-
dictive performance, understanding the quality of learned
multimodal representations is a critical area of research
(Frome et al. 2013; Kiros, Salakhutdinov, and Zemel 2014).
A primary technique is the use of probing tasks, where sim-
ple classifiers are trained on a model’s internal embeddings
to test for specific encoded properties (Alain and Bengio
2016). For example, studies have shown that global seman-
tics are often encoded in the intermediate layers of MLLMs,
while local, fine-grained details are captured in the final lay-
ers (Tao et al. 2024). Other works focus on interpretability
and explainability, using techniques like visualizing atten-
tion maps or model dissection tools to attribute a model’s
decision to specific unimodal features (Liang, Zadeh, and
Morency 2024). The overarching goal of these analysis is
to measure fundamental properties like modality alignment,
complementarity, and redundancy, which is crucial for build-
ing more robust and efficient models .
3 MULTIBENCH++ : A Broader & Deeper
Multimodal Benchmark
3.1 Background
Multimodal datasets differ in both the types of information
they provide and the ways that information is encoded. Re-
mote sensing archives couple spectral bands with LiDAR
point clouds, while electronic health records weave free-
text notes together with structured lab values and pathol-
ogy slides. Affective repositories, in turn, align video frames
to audio streams and wearable signals. These collections
Algorithm 1 End-to-End Workflow with Optuna.
1# 1. Load data and define models
2train_loader, val_loader, test_loader, n_classes = get_loader()
3encoders = [Modality1Encoder(args), Modality2Encoder(args)]
4fusion = TMC(n_classes)
5head = get_head(n_classes, decision=True)
6# 2. Define a minimal objective function
7defobjective(trial):
8# Suggest hyperparameters
9params = {’lr’: trial.suggest_loguniform(’lr’, 1e-5, 1e-3), ...}
10# Train a model and return its validation accuracy
11val_accuracy = train(encoders, fusion, head, params, ...)
12returnval_accuracy
13# 3. Run the hyperparameter search
14study = optuna.create_study(direction=’maximize’)
15study.optimize(objective, n_trials=10)
16# 4. Get the best parameters and build the final model
17final_model = build_model(study.best_trial.number)
18# 5. Evaluate the final, optimized model
19test_accuracy = test(final_model, test_loader)
are rarely designed for joint use, so their formats, resolu-
tions, and noise levels diverge. A benchmark must there-
fore expose models to this heterogeneity. As shown in Fig.
1, MULTIBENCH++ addresses the current evaluation gap
by collecting over thirty datasets drawn from highly special-
ized domains including remote sensing, healthcare, affec-
tive computing and social media. These specialized domains
are not arbitrary. they represent critical frontiers where
multimodal integration is pivotal for scientific and societal
progress. The inclusion of such a wide array of data sources
ensures that the benchmark rigorously tests a model’s abil-
ity to generalize across fundamentally different data struc-
tures and noise profiles, moving beyond single-domain eval-
uations.
3.2 Datasets
Remote Sensing for Environmental IntelligenceThe
remote-sensing domain for environmental intelligence tack-
les land-cover classification, target detection, spectral un-
mixing, and related tasks by fusing data whose physi-
cal origins differ fundamentally. Optical and hyperspec-
tral systems capture surface chemistry yet remain weather-
dependent; SAR measures microwave backscatter day-and-
night; LiDAR delivers centimetre-level topography and
canopy structure. Integrating these streams requires recon-
ciling disparate spatial resolutions, geometries, and noise
statistics. Foundational datasets establish this task-oriented
landscape. Houston2013 and Houston2018 (Debes et al.
2014; Xu et al. 2018) combine hyperspectral imagery with
LiDAR for urban land-cover classification. MUUFL Gulf-
port (Gader et al. 2013) adds co-registered hyperspectral
and LiDAR data over a university campus for classifica-
tion and rare-target detection. Trento (University of Trento
2022) provides a rural counterpart with hyperspectral and
LiDAR. Berlin (Okujeni, van der Linden, and Hostert 2016)
fuses PolSAR and hyperspectral data, while ForestNet (Irvin
et al. 2020) couples satellite imagery and airborne LiDAR
for forest-type and biomass mapping. The recent MDASdataset (Hu et al. 2023) enriches the benchmark suite with
simultaneous SAR, multispectral, hyperspectral, DSM, and
GIS layers, supporting resolution enhancement, unmixing,
and classification. Together, these datasets constitute a sys-
tematic test-bed for advancing theoretically grounded and
practically robust environmental-intelligence algorithms.
Medical AI for Diagnostics and PrognosisThe medical-
intelligence domain addresses survival prediction, malig-
nancy classification, disease-progression modelling and re-
lated tasks by fusing exceptionally heterogeneous data
streams. Gigapixel whole-slide images quantify tissue mor-
phology; high-dimensional omics profiles capture molecu-
lar aberrations; dense time-series vital signs and concise
clinical narratives encode patient trajectories. Integrating
these modalities demands reconciling extreme differences
in resolution, scale and noise, while preserving clinical in-
terpretability. Benchmark datasets anchor this landscape.
TCGA-BRCA (Weinstein et al. 2013) couples WSIs with
multi-omics for breast-cancer survival and subtype analy-
sis. ROSMAP (Bennett et al. 2018) supplies longitudinal
multi-omics and pathology to chart Alzheimer progression.
In dermatology, SIIM-ISIC (Rotemberg et al. 2021) and
Derm7pt (Kawahara et al. 2018) pair dermoscopic images
with patient metadata for melanoma detection. GAMMA
(Wu et al. 2023) fuses 2D fundus photographs with 3D OCT
volumes for glaucoma grading and optic-disc/cup segmenta-
tion. MIMIC-III (Johnson et al. 2016), MIMIC-CXR (John-
son et al. 2019) and eICU (Pollard et al. 2018) deliver large-
scale ICU time series merged with static clinical records to
support mortality prediction and disease-code classification.
Affective Computing and Social Media Understanding
The Affective Computing domain addresses emotion recog-
nition, sarcasm detection, sentiment analysis and related
high-level tasks by fusing text, acoustic, visual and cultural
cues that are frequently incongruent or metaphorical. Dyadic
and multi-party conversations introduce temporal alignment
challenges, while internet memes overlay visual symbols
with rapidly shifting socio-cultural contexts. Effective inte-
gration demands models that can resolve cross-modal sar-
casm, capture long-range conversational flow and remain
sensitive to cultural nuance. Benchmark datasets collec-
tively span these phenomena. MELD (Poria et al. 2019) and
IEMOCAP (Busso et al. 2008) provide temporally aligned
audio, video and transcriptions for multi-party and dyadic
emotion recognition. MUTE (Hossain, Sharif, and Hoque
2022) extends the conversational setting to multilingual sce-
narios. MAMI (Fersini et al. 2022), MultiOFF (Suryawanshi
et al. 2020), Memotion (Sharma et al. 2020) and MET-Meme
(Xu et al. 2022) (Chinese & English versions) jointly encode
images and text for the detection of misogynistic, offensive
and metaphorical content in memes. CH-SIMS (Yu et al.
2020) and its successor CH-SIMS v2.0 (Liu et al. 2022) de-
liver fine-grained Chinese multimodal sentiment annotations
with explicit modality-importance scores. Twitter2015 and
Twitter1517 (Zhang et al. 2018; Lu et al. 2018; Chen et al.
2023) close the loop with classic social-media tasks, linking
text and images for named-entity recognition and sentiment
polarity prediction.
OthersThis domain addresses image–text retrieval, fine-
grained classification, digit recognition, scene understand-
ing and related tasks by fusing heterogeneous yet tightly
aligned modalities. Integrating vision with language, depth,
audio or event streams demands reconciling distinct resolu-
tions, sampling rates and noise distributions while preserv-
ing interpretability. Benchmark datasets anchor this land-
scape. MIRFLICKR-25K (Huiskes and Lew 2008) cou-
ples images with user tags for large-scale retrieval. MVSA-
Single (Niu et al. 2016) supplies tweet images and text for
visual–sentiment classification. CUB Image-Caption (Shi
et al. 2019) pairs bird photographs with textual descrip-
tions for fine-grained classification, while MNIST-SVHN
(Shi et al. 2019) aligns handwritten and street-view dig-
its across domains. SUN-RGBD (Song, Lichtenberg, and
Xiao 2015) and NYUDv2 (Silberman et al. 2012) provide
RGB–depth pairs for scene understanding and object detec-
tion, and UPMC-Food101 (Wang et al. 2015) fuses food im-
ages with recipe text for cross-modal recognition. Follow-
ing Lin et al. (2025), we further combine N-MNIST+N-
TIDIGITS (Orchard et al. 2015; Anumula et al. 2018) to
synchronise frame-based and event-based vision with spo-
ken digits, and E-MNIST+EEG (Cohen et al. 2017; Willett
et al. 2021) to link character images with electroencephalog-
raphy signals for cognitive-state-aware digit recognition.
3.3 Evaluation Protocol
We follow MULTIBENCH’s holistic evaluation with only
minor adjustments. For every dataset and method, we re-
port performance on the test fold using task-specific metrics
(e.g., accuracy, macro-F1, AUPRC, or MSE). Each run is
performed 3 times using different random seeds, all under
the same hardware configuration.4 MULTIBENCH++ Algorithms: More
Advanced Fusion Paradigms
We introduce a complete, end-to-end framework for system-
atic multimodal evaluation. The framework is built around
two classes of fusion methodologies: four Transformer-
centric paradigms to model complex cross-modal interac-
tions, and two modules for efficient decision-level logits fu-
sion. To enable robust and reproducible experimentation, we
further develop an automated hyperparameter optimization
engine based on Optuna (Akiba et al. 2019). This engine fa-
cilitates a systematic exploration and optimization, allowing
for an efficient identification of optimal configurations.
4.1 Transformer-Based Feature Fusion
Architectures
We evaluate several Transformer-based architectures for
multimodal feature fusion. Each model is designed to accept
a set of modality-specific input tensors{x i}k
i=1and produce
a unified representation vectorg.
Hierarchical Attention (Multi-to-One)This model first
encodes each modality independently using a shallow,
modality-specific Transformer encoderE i. The resulting
classification tokens are then concatenated and processed by
a deeper, shared fusion encoderE fuseto model high-level in-
teractions (Li et al. 2021).
zi=T(E i(Φi(xi)))
g=T(E fuse([z1;z2;. . .;z k]))
whereΦ iis a 1D convolution andT(·)is an operator that
selects the classification token’s embedding.
Hierarchical Attention (One-to-Multi)Conversely, this
architecture first models cross-modal interactions before re-
fining modality-specific features. All inputs are projected by
a linear layerΨ iand concatenated into a single sequence.
This joint sequence is processed by a shared encoderE shared .
The output sequence is then split into its original modality-
specific segments, each of which is passed through a final
dedicated encoderE i(Lin et al. 2020).
h=E shared([Ψ 1(x1);. . .; Ψ k(xk)])
[h1, . . . , h k] =Split(h)
g= [T(E 1(h1));. . .;T(E k(hk))]
Cross-Attention Fusion (CAF)CAF facilitates direct,
dense interaction between pairs of modalities. For a bimodal
case (x 1, x2), each modality’s sequence is used to generate
queries that attend to the keys and values of the other modal-
ity (Lu et al. 2019).
z1←2=MultiHead(Ψ Q(x1),ΨK(x2),ΨV(x2))
z2←1=MultiHead(Ψ Q(x2),ΨK(x1),ΨV(x1))
g= [T(z 1←2);T(z 2←1)]
whereΨ Q,ΨK,ΨVare modality-specific linear projection
layers.
Table 2: Performance on Remote Sensing Datasets.
Dataset Concat TF Concat
EarlyLFT EFT Multi-
to-OneOne-to-
MultiCAF CACF LS TMC
Houston2013 75.68 76.53 75.95 60.54 24.68 78.54 79.22 74.92 83.32 79.43 79.12
Houston2018 70.99 74.46 70.07 63.41 35.00 76.52 70.89 68.71 77.66 80.52 77.33
MUUFL Gulfport 84.21 83.90 86.26 71.77 46.68 80.95 81.23 83.99 86.42 86.64 83.20
Trento 98.43 96.95 98.14 95.68 71.64 97.71 96.08 97.74 98.68 98.53 97.67
Berlin 68.25 70.22 72.53 61.72 60.74 73.75 71.22 76.31 77.42 78.61 77.67
Augsburg 89.24 85.86 89.50 82.88 57.29 85.86 86.80 87.92 89.05 89.49 87.21
ForestNet 45.18 45.63 45.68 47.19 44.33 45.78 45.58 45.03 45.93 46.08 45.68
Table 3: Performance on Medical AI Datasets. The dash “-” indicates the method is not applicable to this dataset.
Dataset Concat TF Concat
EarlyLFT EFT Multi-
to-OneOne-to-
MultiCAF CACF LS TMC
TCGA-BRCA 78.17 77.18 77.18 69.44 65.67 76.79 76.19 78.77 78.37 77.18 75.40
ROSMAP 70.95 75.71 70.00 42.86 69.52 69.52 68.10 66.67 71.90 71.43 66.67
SIM-ISIC 97.86 97.77 97.86 97.83 97.85 97.85 97.83 97.86 97.83 97.83 97.85
Derm7pt 45.49 52.72 45.41 43.96 38.86 45.92 46.77 48.13 52.72 46.34 52.72
GAMMA 61.43 62.86 63.81 62.38 59.05 57.62 65.71 63.33 63.33 62.38 61.43
MIMIC-III 68.09 68.81 68.48 68.42 68.59 69.01 68.76 68.51 68.88 68.47 68.87
eICU 90.05 90.03 90.05 90.05 90.05 90.05 90.05 90.07 90.05 90.05 90.03
TCGA 51.91 - 53.55 60.66 51.37 60.11 53.01 56.28 61.75 54.10 62.84
MIMIC-CXR (macro-F1) 0.6861 0.8458 0.6788 0.1551 0.1829 0.5229 0.4031 0.6136 0.7523 0.7644 -
Cross-Attention Concatenation Fusion (CACF)CACF
extends CAF by incorporating an additional global rea-
soning step (Zhan et al. 2021; Tsai et al. 2019). The
cross-attended representations (z 1←2, z2←1) are concate-
nated with initial linear projections of the original inputs
(x′
i= Ψ i(xi)). This combined sequence is then processed
by a final global Transformer encoderE global .
f= [x′
1;z1←2;x′
2;z2←1]
g=T(E global(f))
Hybrid Logit Fusion Methods
We also implement two representative methods that operate
directly on the output logits{ℓ i}k
i=1from modality-specific
classifiers. Other methods can also be easily and quickly in-
corporated into our proposed benchmark.
Logit Summation (LS)This is the most direct parameter-
free method for logit fusion. It operates under the assump-
tion that each modality contributes equally to the final pre-
diction. The logit vectors from all modality-specific classi-
fiers are simply summed to produce the final fused logits:
ℓfused=Pk
i=1ℓi.
Evidential Fusion (TMC)This method, based on eviden-
tial deep learning, transforms logits into evidence parame-
tersα ifor a Dirichlet distribution (Han et al. 2022). This
allows for explicit uncertainty quantification. The evidence
from each modality is then fused using Dempster’s rule of
combination.
αi=Softplus(ℓ i) + 1, α fused=kM
i=1αiHere,Ldenotes the Dempster-Shafer combination opera-
tor. The final class probabilities are derived from the fused
evidence vectorα fused.
4.2 Automated Hyper-Parameter Tuning with
Optuna
MULTIBENCH++ radically simplifies hyper-parameter tun-
ing by usingOptuna, eliminating traditional, GPU-heavy
grid searches. A singleobjective(trial)callback ef-
ficiently handles the entire process, including:
• Dynamic search-space definition
• Module re-instantiation
• Early-stopping pruning
• Best checkpoint storage
This automated approach drastically cuts tuning time and re-
sources, significantly boosting efficiency and performance.
Search-Space SpecificationFor every trial, Optuna inde-
pendently samples
• learning ratelogU(10−5,10−3),
• weight decaylogU(10−6,10−2),
• optimizer type∈ {AdamW,RMSprop,Adam}.
Thus, the joint space spans three orders of magnitude in
learning dynamics and two architectural regimes, while re-
maining compact for efficient Bayesian optimisation.
End-to-End WorkflowAlgorithm 1 demonstrates the
end-to-end workflow. After retrieving a dataset via the un-
changed data loader, one may substitute any of the pre-
sented Transformer fusion modules or logits-level com-
biners; the Optuna wrapper then orchestrates the hyper-
parameter search and returns a trained model, which is sub-
sequently evaluated under the standard protocol.
Table 4: Performance on Affective Computing & Social Media Understanding Datasets. The dash “-” indicates the method is
not applicable to this dataset.
Dataset Concat TF Concat
EarlyLFT EFT Multi-
to-OneOne-to-
MultiCAF CACF LS TMC
MELD 61.34 65.66 62.73 47.77 57.92 66.37 64.02 65.54 62.22 61.60 65.56
IEMOCAP 54.96 54.59 54.82 32.64 48.14 54.49 54.16 54.26 55.06 54.30 51.22
MAMI 70.00 66.47 66.13 67.87 66.63 68.97 64.50 65.40 67.23 67.80 69.73
Memotion 77.89 77.75 78.14 78.09 78.09 78.09 78.09 78.14 78.09 78.19 78.09
MUTE 67.87 67.31 66.59 65.63 68.19 66.03 66.99 65.87 67.07 67.95 68.67
MultiOFF 56.95 59.86 55.38 57.76 59.11 54.38 59.33 59.76 62.03 54.16 60.01
MET-Meme(C) 35.67 34.87 36.92 29.68 24.12 33.55 35.09 32.89 36.62 34.94 33.85
MET-Meme(E) 42.95 42.47 41.83 32.69 29.81 44.23 39.74 40.38 41.03 43.11 42.47
Twitter2015 76.05 76.37 75.76 63.39 68.05 74.12 70.40 75.31 75.70 75.86 64.45
Twitter1517 76.83 76.72 76.68 76.76 76.68 76.29 76.72 76.54 75.97 76.15 76.86
CHSIMS(MSE) 0.4835 0.7431 0.4790 0.4775 0.4804 0.4772 0.4836 0.4794 0.4824 0.4898 -
CHSIMS-v2(MSE) 0.3202 0.3566 0.3338 0.3214 0.3391 0.3351 0.3448 0.2801 0.3361 0.3360 -
Table 5: Performance on Other Datasets.
Dataset Concat TF Concat
EarlyLFT EFT Multi-
to-OneOne-to-
MultiCAF CACF LS TMC
MIRFLICKR 62.81 62.33 62.42 50.25 38.02 58.44 59.95 59.13 62.51 62.45 62.00
CUB Image-Caption 77.90 76.26 78.04 2.89 1.81 71.90 69.19 21.09 73.95 79.48 77.73
SUN-RGBD 60.28 59.43 60.83 45.22 31.61 53.52 56.97 53.53 58.14 60.78 59.00
NYUDv2 59.02 61.47 58.82 47.96 31.70 59.58 63.20 60.55 64.02 66.87 66.16
UPMC-Food101 91.95 92.04 91.80 83.76 8.75 86.04 88.80 86.89 90.12 91.66 92.02
MVSA-Single 79.83 78.36 78.29 68.40 63.39 79.32 77.78 79.51 78.03 79.25 67.50
MNIST-SVHN 96.41 96.64 96.42 62.11 93.06 95.45 93.47 95.35 96.45 96.46 96.95
N-MNIST+N-TIDIGITS 94.99 94.28 95.26 80.99 30.45 93.52 94.34 94.06 95.26 94.88 94.23
E-MNIST+EEG 58.72 58.21 61.28 17.69 7.95 42.56 30.51 49.74 59.49 62.05 57.69
5 Experiment and Discussion
5.1 Setup
Using MULTIBENCH++, we load each of the expanded
datasets and systematically evaluate the multimodal ap-
proaches in our MULTIBENCH++ Algorithms metioned in
Sec. 4: We maintain a consistent experimental setup, varying
only the method while keeping all other factors constant, in-
cluding the training loop and data preprocessing steps. This
approach ensures that observed differences in performance
can be directly attributed to the fusion method under evalu-
ation. We compare our method with several classic baseline
approaches previously proposed, including Concat, Tensor-
Fusion (TF), ConcatEarly, LateFusionTransformer (LFT),
EarlyFusionTransformer (EFT) (Liang et al. 2021).
5.2 Overall performance
The performance metrics across diverse datasets highlight
the nuanced effectiveness of different fusion strategies. As
shown in Tables 2 to 5, we have the following observations:
(i) Our algorithms (CAF, CACF, Logit Summation,
TMC, One-to-Multi, Multi-to-One) yield the
highest accuracy on 27 of 37 datasets, routinely beating
plain concatenation. (ii) Early fusion likeConcatandTF
collapses on weakly-aligned modalities, yet shows no gain
on saturated tasks (SIIM-ISIC,eICU). (iii)CACFtops
seven benchmarks, confirming its broad efficacy.
Not surprisingly, the marginal utility of advanced fusion isstrictly positive when and only when cross-modal redun-
dancy is low; otherwise, naive concatenation attains near-
optimal performance once any single modality approaches
the task ceiling. Full results are provided in the appendix.
5.3 Data Complexity as a Model Selector
An analysis of the performance metrics in the Tables 2
to 5 reveals that data complexity is a critical factor for
model selection. Taking Table 2 as an example, on the low-
complexityTrentodataset, a simple model likeConcat
(98.43) is highly effective and performs nearly as well as the
top model,CACF(98.68), indicating that increased model
complexity provides little benefit. Conversely, for a high-
complexity dataset likeBerlin, there’s a vast performance
gap; simple models fail (Concatat 68.25) while sophisti-
cated models likeLS(78.61) andTMC(77.67) are essen-
tial for achieving high accuracy. This proves that the op-
timal model choice is not universal; it is dictated by the
dataset’s inherent complexity, requiring simple models for
simple data and advanced architectures for complex data.
6 Future Work and Conclusion
Multimodal fusion’s future hinges on two challenges:
datasets lack fine-grained alignment for real validity, and
models remain fragmented and unscalable. The path forward
requires creating datasets with deeper structural correspon-
dences while developing unified, theoretically-grounded fu-
sion frameworks. Ultimately, this evolution must extend to
the evaluation process itself, shifting from simple tuning to-
wards ethics-aware meta-learning where fairness and robust-
ness are primary objectives.
In conclusion, we present MULTIBENCH++, a rigorously-
curated, open-source benchmark that unites 30+ datasets
across 15+ modalities and 20+ tasks across specialized do-
mains. Coupled with auto-tuned Transformer and hybrid-
logit baselines, it gives researchers a fair testbed to compare
new fusion models, making results easier to reproduce and
closer to real-world use.
References
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; and Koyama, M.
2019. Optuna: A next-generation hyperparameter optimiza-
tion framework. InProceedings of the 25th ACM SIGKDD
international conference on knowledge discovery & data
mining, 2623–2631.
Alain, G.; and Bengio, Y . 2016. Understanding interme-
diate layers using linear classifier probes.arXiv preprint
arXiv:1610.01644.
Anumula, J.; Neil, D.; Delbruck, T.; and Liu, S.-C.
2018. Feature representations for neuromorphic audio spike
streams.Frontiers in neuroscience, 12: 23.
Atrey, P. K.; Hossain, M. A.; El Saddik, A.; and Kankanhalli,
M. S. 2010. Multimodal fusion for multimedia analysis: a
survey.Multimedia systems, 16(6): 345–379.
Azam, M. A.; Khan, K. B.; Salahuddin, S.; Rehman, E.;
Khan, S. A.; Khan, M. A.; Kadry, S.; and Gandomi, A. H.
2022. A review on multimodal medical image fusion:
Compendious analysis of medical modalities, multimodal
databases, fusion techniques and quality metrics.Computers
in biology and medicine, 144: 105253.
Baltru ˇsaitis, T.; Ahuja, C.; and Morency, L.-P. 2018. Mul-
timodal machine learning: A survey and taxonomy.IEEE
transactions on pattern analysis and machine intelligence,
41(2): 423–443.
Bennett, D. A.; Buchman, A. S.; Boyle, P. A.; Barnes, L. L.;
Wilson, R. S.; and Schneider, J. A. 2018. Religious or-
ders study and rush memory and aging project.Journal of
Alzheimer’s disease, 64(s1): S161–S189.
Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower,
E.; Kim, S.; Chang, J. N.; Lee, S.; and Narayanan, S. S.
2008. IEMOCAP: Interactive emotional dyadic motion cap-
ture database.Language resources and evaluation, 42(4):
335–359.
Caesar, H.; Bankiti, V .; Lang, A. H.; V ora, S.; Liong, V . E.;
Xu, Q.; Krishnan, A.; Pan, Y .; Baldan, G.; and Beijbom, O.
2020. nuscenes: A multimodal dataset for autonomous driv-
ing. InProceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, 11621–11631.
Chen, D.; Su, W.; Wu, P.; and Hua, B. 2023. Joint mul-
timodal sentiment analysis based on information relevance.
Information Processing & Management, 60(2): 103193.
Cohen, G.; Afshar, S.; Tapson, J.; and Van Schaik, A. 2017.
EMNIST: Extending MNIST to handwritten letters. In2017
international joint conference on neural networks (IJCNN),
2921–2926. IEEE.Debes, C.; Merentitis, A.; Heremans, R.; Hahn, J.; Fran-
giadakis, N.; Van Kasteren, T.; Liao, W.; Bellens, R.;
Piˇzurica, A.; Gautama, S.; et al. 2014. Hyperspectral and
LiDAR data fusion: Outcome of the 2013 GRSS data fusion
contest.IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 7(6): 2405–2418.
Fersini, E.; Gasparini, F.; Rizzi, G.; Saibene, A.; Chulvi, B.;
Rosso, P.; Lees, A.; and Sorensen, J. 2022. SemEval-2022
Task 5: Multimedia automatic misogyny identification. In
Proceedings of the 16th International Workshop on Seman-
tic Evaluation (SemEval-2022), 533–549.
Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.;
Ranzato, M.; and Mikolov, T. 2013. Devise: A deep visual-
semantic embedding model.Advances in neural information
processing systems, 26.
Gader, P.; Zare, A.; Close, R.; Aitken, J.; and Tuell, G. 2013.
MUUFL Gulfport hyperspectral and LiDAR airborne data
set.Univ. Florida, Gainesville, FL, USA, Tech. Rep. REP-
2013-570.
Goyal, Y .; Khot, T.; Summers-Stay, D.; Batra, D.; and
Parikh, D. 2017. Making the v in vqa matter: Elevating the
role of image understanding in visual question answering.
InProceedings of the IEEE conference on computer vision
and pattern recognition, 6904–6913.
Han, Z.; Zhang, C.; Fu, H.; and Zhou, J. T. 2022. Trusted
multi-view classification with dynamic evidential fusion.
IEEE transactions on pattern analysis and machine intel-
ligence, 45(2): 2551–2566.
Hossain, E.; Sharif, O.; and Hoque, M. M. 2022. MUTE:
A multimodal dataset for detecting hateful memes. InPro-
ceedings of the 2nd conference of the asia-pacific chapter
of the association for computational linguistics and the 12th
international joint conference on natural language process-
ing: student research workshop, 32–39.
Hu, J.; Liu, R.; Hong, D.; Camero, A.; Yao, J.; Schneider,
M.; Kurz, F.; Segl, K.; and Zhu, X. X. 2023. MDAS: A new
multimodal benchmark dataset for remote sensing.Earth
System Science Data, 15(1): 113–131.
Huiskes, M. J.; and Lew, M. S. 2008. The mir flickr retrieval
evaluation. InProceedings of the 1st ACM international
conference on Multimedia information retrieval, 39–43.
Irvin, J.; Sheng, H.; Ramachandran, N.; Johnson-Yu, S.;
Zhou, S.; Story, K.; Rustowicz, R.; Elsworth, C.; Austin,
K.; and Ng, A. Y . 2020. Forestnet: Classifying drivers of
deforestation in indonesia using deep learning on satellite
imagery.arXiv preprint arXiv:2011.05479.
Johnson, A. E.; Pollard, T. J.; Berkowitz, S. J.; Greenbaum,
N. R.; Lungren, M. P.; Deng, C.-y.; Mark, R. G.; and Horng,
S. 2019. MIMIC-CXR, a de-identified publicly available
database of chest radiographs with free-text reports.Scien-
tific data, 6(1): 317.
Johnson, A. E.; Pollard, T. J.; Shen, L.; Lehman, L.-w. H.;
Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; An-
thony Celi, L.; and Mark, R. G. 2016. MIMIC-III, a freely
accessible critical care database.Scientific data, 3(1): 1–9.
Kawahara, J.; Daneshvar, S.; Argenziano, G.; and
Hamarneh, G. 2018. Seven-point checklist and skin
lesion classification using multitask multimodal neural nets.
IEEE journal of biomedical and health informatics, 23(2):
538–546.
Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2014. Uni-
fying visual-semantic embeddings with multimodal neural
language models.arXiv preprint arXiv:1411.2539.
Kong, J.; Cooper, L. A. D.; Wang, F.; Gutman, D. A.; Gao,
J.; Chisolm, C.; Sharma, A.; Pan, T.; Van Meir, E. G.; Kurc,
T. M.; Moreno, C. S.; Saltz, J. H.; and Brat, D. J. 2011.
Integrative, Multi-modal Analysis of Glioblastoma Using
TCGA Molecular Data, Pathology Images and Clinical Out-
comes.IEEE Transactions on Biomedical Engineering,
58(12): 3469–3474.
Li, R.; Yang, S.; Ross, D. A.; and Kanazawa, A. 2021. Ai
choreographer: Music conditioned 3d dance generation with
aist++. InProceedings of the IEEE/CVF international con-
ference on computer vision, 13401–13412.
Li, Y .; Li, Y .; Wang, X.; Jiang, Y .; Zhang, Z.; Zheng, X.;
Wang, H.; Zheng, H.-T.; Huang, F.; Zhou, J.; et al. 2024.
Benchmarking Multimodal Retrieval Augmented Genera-
tion with Dynamic VQA Dataset and Self-adaptive Plan-
ning Agent. InThe Thirteenth International Conference on
Learning Representations.
Liang, P. P.; Lyu, Y .; Fan, X.; Wu, Z.; Cheng, Y .; Wu, J.;
Chen, L.; Wu, P.; Lee, M. A.; Zhu, Y .; et al. 2021. Multi-
bench: Multiscale benchmarks for multimodal representa-
tion learning.Advances in neural information processing
systems, 2021(DB1): 1.
Liang, P. P.; Zadeh, A.; and Morency, L.-P. 2024. Founda-
tions & trends in multimodal machine learning: Principles,
challenges, and open questions.ACM Computing Surveys,
56(10): 1–42.
Lin, J.; Yang, A.; Zhang, Y .; Liu, J.; Zhou, J.; and Yang, H.
2020. Interbert: Vision-and-language interaction for multi-
modal pretraining.arXiv preprint arXiv:2003.13198.
Lin, N.; Wang, S.; Li, Y .; Wang, B.; Shi, S.; He, Y .; Zhang,
W.; Yu, Y .; Zhang, Y .; Zhang, X.; et al. 2025. Resis-
tive memory-based zero-shot liquid state machine for multi-
modal event data learning.Nature Computational Science,
5(1): 37–47.
Liu, Y .; Yuan, Z.; Mao, H.; Liang, Z.; Yang, W.; Qiu, Y .;
Cheng, T.; Li, X.; Xu, H.; and Gao, K. 2022. Make acoustic
and visual cues matter: Ch-sims v2. 0 dataset and av-mixup
consistent module. InProceedings of the 2022 international
conference on multimodal interaction, 247–258.
Lu, D.; Neves, L.; Carvalho, V .; Zhang, N.; and Ji, H. 2018.
Visual attention model for name tagging in multimodal so-
cial media. InProceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long
Papers), 1990–1999.
Lu, J.; Batra, D.; Parikh, D.; and Lee, S. 2019. Vilbert:
Pretraining task-agnostic visiolinguistic representations for
vision-and-language tasks.Advances in neural information
processing systems, 32.
Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.;
and Sun, C. 2021. Attention bottlenecks for multimodal fu-sion.Advances in neural information processing systems,
34: 14200–14213.
Niu, T.; Zhu, S.; Pang, L.; and El Saddik, A. 2016. Sen-
timent analysis on multi-view social data. InInternational
conference on multimedia modeling, 15–27. Springer.
Okujeni, A.; van der Linden, S.; and Hostert, P. 2016.
Berlin-urban-gradient dataset 2009-an enmap preparatory
flight campaign.
Orchard, G.; Jayawant, A.; Cohen, G. K.; and Thakor, N.
2015. Converting static image datasets to spiking neuro-
morphic datasets using saccades.Frontiers in neuroscience,
9: 437.
Pollard, T. J.; Johnson, A. E.; Raffa, J. D.; Celi, L. A.; Mark,
R. G.; and Badawi, O. 2018. The eICU Collaborative Re-
search Database, a freely available multi-center database for
critical care research.Scientific data, 5(1): 1–13.
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria,
E.; and Mihalcea, R. 2019. MELD: A Multimodal Multi-
Party Dataset for Emotion Recognition in Conversations. In
Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, 527–536.
Ramachandram, D.; and Taylor, G. W. 2017. Deep mul-
timodal learning: A survey on recent advances and trends.
IEEE signal processing magazine, 34(6): 96–108.
Rotemberg, V .; Kurtansky, N.; Betz-Stablein, B.; Caffery,
L.; Chousakos, E.; Codella, N.; Combalia, M.; Dusza, S.;
Guitera, P.; Gutman, D.; et al. 2021. A patient-centric dataset
of images and metadata for identifying melanomas using
clinical context.Scientific data, 8(1): 34.
Sharma, C.; Bhageria, D.; Scott, W.; Pykl, S.; Das, A.;
Chakraborty, T.; Pulabaigari, V .; and Gamb ¨ack, B. 2020.
SemEval-2020 Task 8: Memotion Analysis-the Visuo-
Lingual Metaphor! InProceedings of the Fourteenth Work-
shop on Semantic Evaluation, 759–773.
Shi, Y .; Paige, B.; Torr, P.; et al. 2019. Variational mixture-
of-experts autoencoders for multi-modal deep generative
models.Advances in neural information processing systems,
32.
Silberman, N.; Hoiem, D.; Kohli, P.; and Fergus, R. 2012.
Indoor segmentation and support inference from rgbd im-
ages. InEuropean conference on computer vision, 746–760.
Springer.
Soleymani, M.; Garcia, D.; Jou, B.; Schuller, B.; Chang, S.-
F.; and Pantic, M. 2017. A survey of multimodal sentiment
analysis.Image and Vision Computing, 65: 3–14.
Song, S.; Lichtenberg, S. P.; and Xiao, J. 2015. Sun rgb-d:
A rgb-d scene understanding benchmark suite. InProceed-
ings of the IEEE conference on computer vision and pattern
recognition, 567–576.
Suryawanshi, S.; Chakravarthi, B. R.; Arcan, M.; and Buite-
laar, P. 2020. Multimodal meme dataset (MultiOFF) for
identifying offensive content in image and text. InProceed-
ings of the second workshop on trolling, aggression and cy-
berbullying, 32–41.
Tao, M.; Huang, Q.; Xu, K.; Chen, L.; Feng, Y .; and Zhao,
D. 2024. Probing Multimodal Large Language Models for
Global and Local Semantic Representations. InProceed-
ings of the 2024 Joint International Conference on Com-
putational Linguistics, Language Resources and Evaluation
(LREC-COLING 2024), 13050–13056.
Tsai, Y .-H. H.; Bai, S.; Liang, P. P.; Kolter, J. Z.; Morency,
L.-P.; and Salakhutdinov, R. 2019. Multimodal transformer
for unaligned multimodal language sequences. InProceed-
ings of the conference. Association for computational lin-
guistics. Meeting, volume 2019, 6558.
University of Trento. 2022. Theses of the University of
Trento. [Data set]. Original work published 2020.
Wang, X.; Kumar, D.; Thome, N.; Cord, M.; and Precioso,
F. 2015. Recipe recognition with large multimodal food
dataset. In2015 IEEE International Conference on Multi-
media & Expo Workshops (ICMEW), 1–6. IEEE.
Wang, Y .; Chen, X.; Cao, L.; Huang, W.; Sun, F.; and Wang,
Y . 2022. Multimodal Token Fusion for Vision Transformers.
InProceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR).
Wei, X.; Zhang, T.; Li, Y .; Zhang, Y .; and Wu, F. 2020.
Multi-Modality Cross Attention Network for Image and
Sentence Matching. InProceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
(CVPR).
Weinstein, J. N.; Collisson, E. A.; Mills, G. B.; Shaw, K. R.;
Ozenberger, B. A.; Ellrott, K.; Shmulevich, I.; Sander, C.;
and Stuart, J. M. 2013. The cancer genome atlas pan-cancer
analysis project.Nature genetics, 45(10): 1113–1120.
Willett, F. R.; Avansino, D. T.; Hochberg, L. R.; Henderson,
J. M.; and Shenoy, K. V . 2021. High-performance brain-
to-text communication via handwriting.Nature, 593(7858):
249–254.
Wu, J.; Fang, H.; Li, F.; Fu, H.; Lin, F.; Li, J.; Huang, Y .; Yu,
Q.; Song, S.; Xu, X.; et al. 2023. Gamma challenge: glau-
coma grading from multi-modality images.Medical Image
Analysis, 90: 102938.
Xu, B.; Li, T.; Zheng, J.; Naseriparsa, M.; Zhao, Z.; Lin, H.;
and Xia, F. 2022. Met-meme: A multimodal meme dataset
rich in metaphors. InProceedings of the 45th international
ACM SIGIR conference on research and development in in-
formation retrieval, 2887–2899.
Xu, P.; Zhu, X.; and Clifton, D. A. 2023. Multimodal learn-
ing with transformers: A survey.IEEE Transactions on
Pattern Analysis and Machine Intelligence, 45(10): 12113–
12132.
Xu, Y .; Du, B.; Zhang, F.; and Zhang, L. 2018. Hyperspec-
tral image classification via a random patches network.IS-
PRS journal of photogrammetry and remote sensing, 142:
344–357.
Yu, W.; Xu, H.; Meng, F.; Zhu, Y .; Ma, Y .; Wu, J.; Zou, J.;
and Yang, K. 2020. Ch-sims: A chinese multimodal senti-
ment analysis dataset with fine-grained annotation of modal-
ity. InProceedings of the 58th annual meeting of the asso-
ciation for computational linguistics, 3718–3727.Zadeh, A.; Zellers, R.; Pincus, E.; and Morency, L.-P. 2016.
Mosi: multimodal corpus of sentiment intensity and sub-
jectivity analysis in online opinion videos.arXiv preprint
arXiv:1606.06259.
Zadeh, A. B.; Liang, P. P.; Poria, S.; Cambria, E.; and
Morency, L.-P. 2018. Multimodal language analysis in the
wild: Cmu-mosei dataset and interpretable dynamic fusion
graph. InProceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long
Papers), 2236–2246.
Zhan, X.; Wu, Y .; Dong, X.; Wei, Y .; Lu, M.; Zhang, Y .;
Xu, H.; and Liang, X. 2021. Product1m: Towards weakly
supervised instance-level product retrieval via cross-modal
pretraining. InProceedings of the IEEE/CVF international
conference on computer vision, 11782–11791.
Zhang, Q.; Fu, J.; Liu, X.; and Huang, X. 2018. Adaptive
co-attention network for named entity recognition in tweets.
InProceedings of the AAAI conference on artificial intelli-
gence, volume 32.
Zhang, Q.; Wei, Y .; Han, Z.; Fu, H.; Peng, X.; Deng, C.;
Hu, Q.; Xu, C.; Wen, J.; Hu, D.; et al. 2024. Multimodal
fusion on low-quality data: A comprehensive survey.arXiv
preprint arXiv:2404.18947.
Zhang, Q.; Wu, H.; Zhang, C.; Hu, Q.; Fu, H.; Zhou, J. T.;
and Peng, X. 2023. Provable dynamic fusion for low-quality
multimodal data. InInternational conference on machine
learning, 41753–41769. PMLR.
Zhu, J.; Zhou, Y .; Qian, S.; He, Z.; Zhao, T.; Shah, N.; and
Koutra, D. 2025. Mosaic of modalities: A comprehensive
benchmark for multimodal graph learning. InProceedings
of the Computer Vision and Pattern Recognition Conference,
14215–14224.
Appendix
A More Exprimental Results
This section contains the complete set of quantitative re-
sults that support the main paper. Every reported metric is
the mean of 3 independent training runs initialized with dif-
ferent random seeds. After each run we computed the met-
ric on the test split and the population standard deviation
across the 3 runs is reported in parentheses immediately af-
ter the mean. Due to their large scale, the results for MNIST-
SVHN, UPMC-Food101, and CUB Image-Caption are not
reported after only a single run.
B Optuna Hyper-Parameter Setup
In our experiments, we employ the Optuna framework to
automatically search for the optimal combination of hyper-
parameters. As detailed in Table Tables 6 to 9, we con-
figured a total of 10 trials to explore the defined param-
eter space. Specifically, the learning rate (lr) and weight
decay (weight decay) are sampled from a log-uniform
distribution over the ranges[10−5,10−3]and[10−6,10−2]
respectively. The optimizer type (optimtype) is chosen
from three candidates: AdamW, RMSprop, and Adam. For
the encoder freezing parameter,freeze encoders, we
generally set it to False by default. It is only included as a
searchable categorical variable, allowing Optuna to decide
between True and False, in specific experiments where we
need to investigate its impact on model performance.
C Dataset Details
For every dataset in MULTIBENCH++, we give a short
overview that includes following items: (1) where the data
come from and what they contain, (2) how we cleaned and
extracted features following recent work, and(3) the exact
train, validation, and test splits we use.
MELD
Data Source and ContentThe Multimodal EmotionLines
Dataset (MELD) is a large-scale dataset for emotion recog-
nition in conversations. It contains over 13,000 utterances
from the TV showFriends, with audio, video, and text.
Linkhttps://github.com/declare-lab/MELD
Feature Extraction
Data SplitsThe official split is used, containing 9989
training, 1109 validation, and 2610 test utterances.
IEMOCAP
Data Source and ContentThe Interactive Emotional
Dyadic Motion Capture (IEMOCAP) database is a popular
dataset for emotion recognition, containing approximately
12 hours of audiovisual data from ten actors.
Linkhttps://sail.usc.edu/iemocap/Feature ExtractionWe follow https://github.com/
soujanyaporia/multimodal-sentiment-analysis/tree/master?
tab=readme-ov-file and preprocess the features.
Data SplitsThe official split is used, containing 5810
training and 1623 test utterances. Since the official dataset
split does not include a dedicated validation set, we reuse
the test set as the validation set.
MAMI
Data Source and ContentA dataset for Misogynistic
Meme Detection, containing over 10,000 image-text memes
annotated for misogyny and other categories.
Linkhttps://github.com/MIND-Lab/SemEval2022-Task-
5-Multimedia-Automatic-Misogyny-Identification-MAMI-
Feature ExtractionAResNetis used to encode the im-
age component, and aBERTmodel is used to encode the
text.
Data SplitsThe official split is used, containing 9000
training, 1000 validation, and 1000 test utterances.
Memotion
Data Source and ContentA dataset for analyzing emo-
tions in memes. It contains 10,000 memes annotated for sen-
timent and three types of emotions (humor, sarcasm, moti-
vation).
Linkhttps://www.kaggle.com/datasets/williamscott701/
memotion-dataset-7k
Feature ExtractionAResNetis used to encode the im-
age component, and aBERTmodel is used to encode the
text.
Data SplitsWe adopt an 8:1:1 split for training, valida-
tion, and test sets, containing 5465 training, 683 validation,
and 683 test utterances, with the random seed fixed at 42.
MUTE
Data Source and ContentThis datasets focus on detect-
ing harmful content. MUTE targets troll-like behavior in im-
age/text posts, while MultiOFF focuses on identifying offen-
sive content and its target.
Linkhttps://github.com/eftekhar-hossain/MUTE-
AACL22
Feature ExtractionAResNetis used to encode the im-
age component, and aBERTmodel is used to encode the
text.
Data SplitsThe official split is used, containing 3365
training, 375 validation, and 416 test utterances.
Table 6: Performance on Affective Computing & Social Media Understanding Datasets. Results are shown asmean±standard
deviation. For MSE metrics, lower is better. The dash “-” indicates the method is not applicable.
Dataset Concat TF ConcatEarly LFT EFT Multi-to-One One-to-Multi CAF CACF LS TMC
MELD61.34
±0.5565.66
±0.7162.73
±0.0547.77
±0.5757.92
±0.8266.37
±0.7864.02
±0.4065.54
±0.6662.22
±0.5861.60
±0.1265.56
±0.18
IEMOCAP54.96
±0.1354.59
±0.1354.82
±0.0832.64
±0.5948.14
±1.9354.49
±0.3554.16
±0.7454.26
±0.1955.06
±0.3254.30
±0.2451.22
±0.15
MAMI70.00
±1.0066.47
±3.8466.13
±5.4367.87
±2.4966.63
±2.2868.97
±2.5364.50
±3.6965.40
±3.0667.23
±1.9067.80
±2.6169.73
±1.03
Memotion77.89
±2.5177.75
±2.0278.14
±4.3978.09
±2.1478.09
±1.7178.09
±5.0178.09
±1.7178.14
±5.0078.09
±1.8078.19
±2.6178.09
±1.94
MUTE67.87
±1.2267.31
±1.8566.59
±1.5265.63
±3.7468.19
±2.4766.03
±3.3566.99
±2.1665.87
±3.3467.07
±2.5267.95
±0.8968.67
±3.24
MultiOFF56.95
±1.4659.86
±1.2255.38
±2.0057.76
±5.4359.11
±6.8254.38
±2.4159.33
±2.4959.76
±1.8762.03
±0.8654.16
±1.1060.01
±8.02
MET-Meme(C)35.67
±0.0034.87
±0.0036.92
±0.0029.68
±0.4124.12
±0.4533.55
±0.1035.09
±0.4332.89
±0.0536.62
±0.6334.94
±0.6833.85
±0.33
MET-Meme(E)42.95
±2.2342.47
±3.8741.83
±2.0432.69
±1.7129.81
±2.6744.23
±4.2139.74
±4.9540.38
±1.5641.03
±2.6143.11
±1.9442.47
±3.00
Twitter201576.05
±0.3676.37
±0.2875.76
±0.0963.39
±0.4168.05
±0.4574.12
±0.1070.40
±0.4375.31
±0.0575.70
±0.6375.86
±0.6864.45
±0.33
Twitter201776.83
±1.3476.72
±1.0976.68
±0.5276.76
±1.1976.68
±1.8276.29
±1.0176.72
±2.4976.54
±1.8775.97
±1.4776.15
±2.3476.86
±0.41
Lower is better for MSE metrics
Dataset (MSE) Concat TF ConcatEarly LFT EFT Multi-to-One One-to-Multi CAF CACF LS TMC
CHSIMS0.4835
±0.00830.7431
±0.30120.4790
±0.00270.4775
±0.00190.4804
±0.00210.4772
±0.00340.4836
±0.00600.4794
±0.00420.4824
±0.00300.4898
±0.0031-
CHSIMS-v20.3202
±0.03010.3566
±0.02460.3338
±0.00040.3214
±0.02240.3391
±0.00680.3351
±0.00280.3448
±0.00670.2801
±0.03810.3361
±0.00090.3360
±0.0024-
Table 7: Performance on Medical AI Datasets. Results are shown asmeanwith the±standard deviationvalue below it in a
smaller font. The dash “-” indicates the method is not applicable.
Dataset Concat TF ConcatEarly LFT EFT Multi-to-One One-to-Multi CAF CACF LS TMC
TCGA-BRCA78.17
±1.7177.18
±2.8577.18
±0.7469.44
±7.8765.67
±4.8776.79
±1.6876.19
±3.6778.77
±0.2878.37
±2.0277.18
±1.4075.40
±0.56
ROSMAP70.95
±3.0975.71
±0.6770.00
±2.3342.86
±0.6769.52
±1.3569.52
±0.6768.10
±0.6766.67
±3.0971.90
±0.0971.43
±3.0266.67
±5.99
SIM-ISIC97.86
±0.0797.77
±0.0497.86
±0.0797.83
±0.0097.85
±0.0797.85
±0.0397.83
±0.3397.86
±0.0797.83
±0.0097.83
±0.0297.85
±0.20
Derm7pt45.49
±2.2552.72
±0.8445.41
±1.4443.96
±0.3238.86
±1.1145.92
±0.6046.77
±2.1948.13
±1.3652.72
±2.3046.34
±1.2252.72
±1.22
GAMMA61.43
±2.3362.86
±3.5063.81
±0.6762.38
±2.4359.05
±0.6757.62
±2.9465.71
±4.6763.33
±2.9463.33
±1.3562.38
±2.3361.43
±2.33
MIMIC-III68.09
±0.9168.81
±0.2368.48
±0.7368.42
±0.1568.59
±0.1169.01
±0.3368.76
±0.3568.51
±0.2568.88
±0.2768.47
±0.2268.87
±0.20
eICU90.05
±0.0090.03
±0.0390.05
±0.0090.05
±0.0090.05
±0.0090.05
±0.0090.05
±0.0390.07
±0.0090.05
±0.0090.05
±0.0390.03
±0.03
TCGA51.91
±4.09-53.55
±4.0960.66
±5.3551.37
±2.7960.11
±3.8653.01
±6.0456.28
±5.4161.75
±2.0454.10
±6.1362.84
±6.18
For MIMIC-CXR, the metric is macro-F1
MIMIC-CXR0.6861
±0.02480.8458
±0.04470.6788
±0.05270.1551
±0.03210.1829
±0.01310.5229
±0.15160.4031
±0.08520.6136
±0.00760.7523
±0.00420.7644
±0.0171-
Table 8: Performance on Remote Sensing Datasets. Results are shown asmeanwith the±standard deviationvalue below it in
a smaller font. The dash “-” indicates the method is not applicable.
Dataset Concat TF ConcatEarly LFT EFT Multi-to-One One-to-Multi CAF CACF LS TMC
Houston201375.68
±4.1576.53
±6.1575.95
±5.5160.54
±11.4124.68
±1.4278.54
±2.0479.22
±2.0374.92
±0.8583.32
±0.5479.43
±0.9679.12
±0.59
Houston201870.99
±11.7074.46
±10.0070.07
±12.3563.41
±9.7335.00
±18.9176.52
±1.7870.89
±7.3268.71
±8.5577.66
±4.9380.52
±0.9277.33
±3.03
MUUFL Gulfport84.21
±2.9283.90
±1.6586.26
±1.1371.77
±3.4346.68
±3.1680.95
±1.1081.23
±2.4183.99
±1.0286.42
±0.6786.64
±0.6483.20
±4.32
Trento98.43
±0.2696.95
±1.3298.14
±0.6395.68
±0.8471.64
±4.5697.71
±0.1096.08
±1.3997.74
±0.6598.68
±0.2098.53
±0.3697.67
±1.02
Berlin68.25
±14.4970.22
±8.1272.53
±9.5561.72
±6.4160.74
±3.5373.75
±0.8071.22
±1.2076.31
±1.6277.42
±0.8378.61
±0.1177.67
±0.62
Augsburg89.24
±0.8685.86
±1.0389.50
±1.1382.88
±0.5057.29
±5.4085.86
±0.7386.80
±0.9187.92
±0.3489.05
±0.3589.49
±0.3487.21
±0.63
ForestNet45.18
±2.1445.63
±1.2345.68
±0.3847.19
±1.9944.33
±0.9345.78
±0.7145.58
±1.8145.03
±0.7745.93
±0.6546.08
±0.6545.68
±0.50
Table 9: Performance on Other Datasets. Results are shown asmeanwith the±standard deviationvalue below it in a smaller
font. The dash “-” indicates the method is not applicable.
Dataset Concat TF ConcatEarly LFT EFT Multi-to-One One-to-Multi CAF CACF LS TMC
MIRFLICKR62.81
±1.5262.33
±2.2362.42
±1.1850.25
±1.6738.02
±2.6958.44
±0.9359.95
±1.4059.13
±0.2762.51
±1.2162.45
±1.3962.00
±1.47
NYUDv259.02
±4.2461.47
±4.0558.82
±4.5847.96
±8.5131.70
±6.2459.58
±3.8363.20
±5.5860.55
±3.0964.02
±5.2666.87
±3.5266.16
±1.95
SUN-RGBD60.28
±1.6559.43
±1.4360.83
±1.9045.22
±0.3031.61
±2.6153.52
±2.1656.97
±1.1153.53
±0.3758.14
±1.9460.78
±2.2659.00
±1.99
MVSA-Single79.83
±0.8778.36
±2.9178.29
±1.3168.40
±7.8663.39
±9.1979.32
±0.6077.78
±1.0579.51
±1.5078.03
±3.5679.25
±2.1467.50
±10.49
N-MNIST+N-TIDIGITS94.99
±0.2094.28
±0.1395.26
±0.1380.99
±1.0730.45
±4.9193.52
±0.4194.34
±0.7794.06
±0.5495.26
±0.2794.88
±0.4194.23
±0.34
E-MNIST+EEG58.72
±5.1258.21
±1.9261.28
±3.2217.69
±3.227.95
±0.3642.56
±5.6630.51
±9.0249.74
±3.1059.49
±3.1062.05
±1.5857.69
±3.14
Table 10: Optuna hyper-parameter search space (trials = 10)
Parameter Type Sampler Search Space / Candidates
lrContinuous (log)suggest loguniform[10−5,10−3]
weight decayContinuous (log)suggest loguniform[10−6,10−2]
freeze encodersCategoricalsuggest categorical{True,False}
optimtypeCategoricalsuggest categorical{AdamW,RMSprop,Adam}
MultiOFF
Data Source and ContentThis datasets focus on detect-
ing harmful content. MUTE targets troll-like behavior in im-
age/text posts, while MultiOFF focuses on identifying offen-
sive content and its target.
Linkhttps://github.com/bharathichezhiyan/Multimodal-
Meme-Classification-Identifying-Offensive-Content-in-
Image-and-Text
Feature ExtractionAResNetis used to encode the im-
age component, and aBERTmodel is used to encode the
text.
Data SplitsThe official split is used, containing 445 train-
ing, 149 validation, and 149 test utterances.
MET-Meme (Chinese)
Data Source and ContentA dataset for multimodal
metaphor detection in memes, available in both English and
Chinese versions.
Linkhttps://github.com/liaolianfoka/MET-Meme-A-
Multi-modal-Meme-Dataset-Rich-in-Metaphors
Feature ExtractionAResNetis used to encode the im-
age component, and aBERTmodel is used to encode the
text.
Data SplitsWe adopt an 7:1:2 split for training, valida-
tion, and test sets, containing 1,609 training, 229 validation,
and 461 test utterances, with the random seed fixed at 42.
MET-Meme (English)
Data Source and ContentA dataset for multimodal
metaphor detection in memes, available in both English and
Chinese versions.
Linkhttps://github.com/liaolianfoka/MET-Meme-A-
Multi-modal-Meme-Dataset-Rich-in-Metaphors
Feature ExtractionAResNetis used to encode the im-
age component, and aBERTmodel is used to encode the
text.
Data SplitsWe adopt an 7:1:2 split for training, valida-
tion, and test sets, containing 737 training, 105 validation,
and 211 test utterances, with the random seed fixed at 42.
CH-SIMS
Data Source and ContentA fine-grained single- and
multi-modal sentiment analysis dataset in Chinese. It con-
tains over 2,200 short videos with unimodal and multimodal
annotations.
Linkhttps://github.com/thuiar/MMSA
Feature ExtractionWe use the pre-extracted features pro-
vided by the dataset without any additional modifications.
Data SplitsThe official split is used, containing 1368
training, 456 validation, and 457 test utterances.CH-SIMSv2
Data Source and ContentA fine-grained single- and
multi-modal sentiment analysis dataset in Chinese. It con-
tains over 2,200 short videos with unimodal and multimodal
annotations.
Linkhttps://thuiar.github.io/sims.github.io/chsims
Feature ExtractionWe use the pre-extracted features pro-
vided by the dataset without any additional modifications.
Data SplitsThe official split is used, containing 2722
training, 647 validation, and 1034 test utterances.
Twitter2015
Data Source and ContentOriginating from SemEval
tasks, these datasets were extended for multimodal aspect-
based sentiment analysis, pairing tweets with relevant im-
ages.
Linkhttps://archive.org/details/twitterstream
Feature ExtractionAResNetis used to encode the im-
age component, and aBERTmodel is used to encode the
text.
Data SplitsThe official split is used, containing 3179
training, 1122 validation, and 1037 test utterances.
Twitter1517
Data Source and Content
Linkhttps://github.com/code-chendl/HFIR
Feature ExtractionAResNetis used to encode the im-
age component, and aBERTmodel is used to encode the
text.
Data SplitsWe adopt an 7:1:2 split for training, valida-
tion, and test sets, containing 3270 training, 467 validation,
and 935 test utterances, with the random seed fixed at 42.
Houston2013
Data Source and ContentThe 2013 dataset, from the
IEEE GRSS Data Fusion Contest, provides HSI and LiDAR
data over the University of Houston campus, covering 15
land use classes. The 2018 version is a more complex dataset
covering 20 urban land use classes.
Linkhttps://machinelearning.ee.uh.edu/?page id=459
Feature ExtractionFollowing the link https://github.
com/songyz2019/rs-fusion-datasets, we will proceed with
the direct acquisition and loading of the dataset, which will
be used for the joint classification of hyperspectral, LiDAR,
and SAR data. The HSI data is processed by theconv hsi
encoder, and the LiDAR data is processed by theconv dsm
encoder.
Data SplitsThe official split is used, containing 2817
training and 12182 test utterances. Since the official dataset
split does not include a dedicated validation set, we reuse the
test set as the validation set.
Houston2018
Data Source and ContentThe 2013 dataset, from the
IEEE GRSS Data Fusion Contest, provides HSI and LiDAR
data over the University of Houston campus, covering 15
land use classes. The 2018 version is a more complex dataset
covering 20 urban land use classes.
Linkhttps://machinelearning.ee.uh.edu/2018-ieee-grss-
data-fusion-challenge-fusion-of-multispectral-lidar-and-
hyperspectral-data/
Feature ExtractionFollowing the link https://github.
com/songyz2019/rs-fusion-datasets, we will proceed with
the direct acquisition and loading of the dataset, which will
be used for the joint classification of hyperspectral, LiDAR,
and SAR data. The HSI data is processed by theconv hsi
encoder, and the LiDAR data is processed by theconv dsm
encoder.
Data SplitsThe official split is used, containing 18750
training and 2000160 test utterances. Since the official
dataset split does not include a dedicated validation set, we
reuse the test set as the validation set.
MUUFL Gulfport
Data Source and ContentThe MUUFL dataset contains
HSI and LiDAR data collected over the University of South-
ern Mississippi, Gulfport campus. The ground truth contains
11 urban land use classes.
Linkhttps://github.com/GatorSense/MUUFLGulfport
Feature ExtractionFollowing the link https://github.
com/songyz2019/rs-fusion-datasets, we will proceed with
the direct acquisition and loading of the dataset, which will
be used for the joint classification of hyperspectral, LiDAR,
and SAR data. We use theconv hsiencoder for the HSI
data and theconv dsmencoder for the LiDAR data.
Data SplitsThe official split is used, containing 1100
training and 52587 test utterances. Since the official dataset
split does not include a dedicated validation set, we reuse the
test set as the validation set.
Trento
Data Source and ContentThis dataset covers a rural area
south of Trento, Italy. It combines HSI data with correspond-
ing LiDAR-derived Digital Surface Model (DSM) data, with
6 classified land cover classes.
Linkhttps://github.com/tyust-dayu/Trento/tree/
b4afc449ce5d6936ddc04fe267d86f9f35536afd
Feature ExtractionFollowing the link https://github.
com/songyz2019/rs-fusion-datasets, we will proceed with
the direct acquisition and loading of the dataset, which will
be used for the joint classification of hyperspectral, LiDAR,
and SAR data. The HSI data is processed byconv hsi,
and the LiDAR data byconv dsm.Data SplitsThe official split is used, containing 600 train-
ing and 29614 test utterances. Since the official dataset split
does not include a dedicated validation set, we reuse the test
set as the validation set.
Berlin
Data Source and ContentThis dataset provides co-
registered HSI and Synthetic Aperture Radar (SAR) data for
the city of Berlin, Germany, with 8 distinct urban land cover
classes.
Linkhttps://gfzpublic.gfz-potsdam.de/pubman/faces/
ViewItemFullPage.jsp?itemId=item 1480927 5
Feature ExtractionFollowing the link https://github.
com/songyz2019/rs-fusion-datasets, we will proceed with
the direct acquisition and loading of the dataset, which will
be used for the joint classification of hyperspectral, LiDAR,
and SAR data. The HSI data is processed byconv hsi.
The SAR data, being image-like, is processed by aResNet
encoder.
Data SplitsThe official split is used, containing 2820
training and 461851 test utterances. Since the official dataset
split does not include a dedicated validation set, we reuse the
test set as the validation set.
MDAS (Augsburg)
Data Source and ContentThis dataset contains multi-
sensor data for urban area classification in Augsburg, Ger-
many, featuring HSI and SAR imagery.
Linkhttps://github.com/songyz2019/rs-fusion-datasets
Feature ExtractionFollowing the link https://github.
com/songyz2019/rs-fusion-datasets, we will proceed with
the direct acquisition and loading of the dataset, which will
be used for the joint classification of hyperspectral, LiDAR,
and SAR data. The HSI data uses theconv hsiencoder,
while the SAR data is processed using aResNetencoder.
Data SplitsThe official split is used, containing 761 train-
ing and 77533 test utterances. Since the official dataset split
does not include a dedicated validation set, we reuse the test
set as the validation set.
ForestNet
Data Source and ContentA dataset for wildfire preven-
tion, containing satellite imagery from Sentinel-2, topogra-
phy data (DSM), and weather data. It’s used to predict wild-
fire risk.
Linkhttps://github.com/spott/ForestNet
Feature ExtractionAResNetprocesses satellite im-
agery,ResNetprocesses topography data, and an MLP pro-
cesses tabular weather data (if needed).
Data SplitsThe official split is used, containing 1616
training, 473 validation, and 668 test utterances.
TCGA-BRCA
Data Source and ContentData from The Cancer
Genome Atlas (TCGA) Program, focusing on Breast Can-
cer (BRCA). It’s a multi-omics dataset, often including gene
expression, DNA methylation, and copy number variation,
used for tasks like survival prediction.
Linkhttps://github.com/txWang/MOGONET
Feature ExtractionEach tabular omics modality is en-
coded via an independent fully-connected linear layer.
Data SplitsThe dataset is extracted sequentially: the lead-
ing 20% forms the test set, the subsequent 5% the validation
set, and the remaining 75% the training set, containing 657
training, 43 validation, and 175 test utterances.
TCGA
Data Source and ContentData from The Cancer
Genome Atlas (TCGA) Program, focusing on Breast Can-
cer (BRCA). It’s a multi-omics dataset, often including gene
expression, DNA methylation, and copy number variation,
used for tasks like survival prediction.
Linkhttps://www.cancer.gov/ccg/access-data
Feature ExtractionWe use the TCGA dataset selected
by https://github.com/bowang-lab/IntegrAO. Each tabular
omics modality is encoded via an independent fully-
connected linear layer.
Data SplitsThe dataset is extracted sequentially: the first
60% for training, the next 25% for validation, and the final
20% for testing, containing 169 training, 76 validation, and
61 test utterances.
ROSMAP
Data Source and ContentData from the Religious Or-
ders Study and Memory and Aging Project (ROSMAP) for
Alzheimer’s disease research. It includes multi-omics data
from post-mortem brain tissue.
Linkhttps://github.com/txWang/MOGONET
Feature ExtractionEach tabular omics modality is en-
coded via an independent Identity mapping encoder.
Data SplitsThe dataset is extracted sequentially: the first
60% for training, the next 25% for validation, and the final
20% for testing, containing 194 training, 87 validation, and
70 test utterances.
SIIM-ISIC
Data Source and ContentFrom the 2020 SIIM-ISIC
Melanoma Classification Kaggle challenge. The dataset
contains thousands of dermoscopic images of skin lesions,
with patient-level metadata, for classifying lesions.
Linkhttps://www.kaggle.com/competitions/siim-isic-
melanoma-classification/data
Feature ExtractionAResNetencoder is used for the
dermoscopic images, and an independent Identity mapping
encoder is used for the tabular patient metadata.Data SplitsWe adopt an 8:1:1 split for training, valida-
tion, and test sets, containing 26502 training, 3312 valida-
tion, and 3312 test utterances, with the random seed fixed at
42.
Derm7pt
Data Source and ContentA multiclass skin lesion clas-
sification dataset based on the 7-point checklist. It provides
dermoscopic images and a corresponding vector of semi-
quantitative clinical features.
Linkhttps://www.kaggle.com/datasets/
menakamohanakumar/derm7pt
Feature ExtractionThe images are encoded with a CN-
NEncoder, while the tabular clinical feature vectors are en-
coded with Identity mapping encoder.
Data SplitsThe official split is used, containing 413 train-
ing, 203 validation, and 395 test utterances.
GAMMA
Data Source and ContentFrom the Glaucoma grAding
from Multi-Modality imAges challenge. It contains color
fundus images and stereo-pairs of disc photos for glaucoma
diagnosis.
Linkhttps://zenodo.org/records/15119049
Feature ExtractionBoth the fundus images and stereo-
pair images are encoded using a CNNEncoder.
Data SplitsThe dataset is extracted sequentially: the first
20% for training, the next 10% for validation, and the final
70% for testing, containing 20 training, 10 validation, and
70 test utterances.
MIMIC-III
Data Source and ContentA large, de-identified ICU
database containing structured data (lab results, vitals) and
unstructured clinical notes for over 40,000 patients. It’s used
for tasks like mortality prediction.
Linkhttps://physionet.org/content/mimiciii/1.4/
Feature ExtractionModality is encoded by a Time-
SeriesTransformerEncoder and Identity mapping.
Data SplitsThe dataset is extracted sequentially: the lead-
ing 20% forms the test set, the subsequent 5% the valida-
tion set, and the remaining 75% the training set, containing
24462 training, 1631 validation, and 6523 test utterances.
MIMIC-CXR
Data Source and ContentA large-scale dataset contain-
ing over 377,000 chest X-ray images and their correspond-
ing free-text radiology reports.
Linkhttps://physionet.org/content/mimic-cxr-jpg/2.1.0/
Feature ExtractionWe use aResNetencoder for the
chest X-ray images and aBERTmodel for the radiology re-
ports.
Data SplitsFollowing the official split, we cap the training
set to a maximum of 5,000 utterances, yielding 5,000 for
training, 2942 for validation, and 5117 for test. We set the
random seed=42 to select the training samples.
eICU
Data Source and ContentA multi-center ICU database
with de-identified data for over 200,000 admissions from
hospitals across the US. It contains high-granularity vital
sign data and clinical notes.
Linkhttps://eicu-crd.mit.edu/https://physionet.org/
content/eicu-crd/2.0/
Feature ExtractionAll modalities are encoded using the
identity mapping.
Data SplitsThe corpus is hierarchically partitioned: an
initial 80 / 20 split isolates the test set; the remaining 80% is
then subdivided at a 15 : 1 ratio, producing final allocations
of approximately 75% training, 5% validation, and 20% test.
We then get 5727 training, 382 validation, and 1528 test ut-
terances.The random seed is fixed at 42.
MIRFLICKR
Data Source and ContentA dataset of one million im-
ages from the Flickr website with their associated user-
assigned tags. A 25,000-image subset with curated labels is
commonly used for multimodal retrieval and classification.
Linkhttps://press.liacs.nl/mirflickr/
Feature ExtractionAResNetis used for the images,
and aBERTmodel is used for the textual tags.
Data SplitsWe adopt an 7:1:2 split for training, valida-
tion, and test sets, containing 14010 training, 2001 valida-
tion, and 4004 test utterances, with the random seed fixed at
42.
CUB Image-Caption
Data Source and ContentAn extension of the Caltech-
UCSD Birds-200-2011 dataset. It pairs detailed bird images
with rich, descriptive text captions, ideal for fine-grained
image-text matching.
Linkhttps://github.com/iffsid/mmvae
Feature ExtractionThe bird images are encoded with a
ResNet, and the text captions are encoded with aBERT
model.
Data SplitsWe adopt an 7:1.5:1.5 split for training, vali-
dation, and test sets, containing 82510 training, 17680 vali-
dation, and 17690 test utterances, with the random seed fixed
at 2025.
SUN-RGBD
Data Source and ContentA large-scale dataset for in-
door scene understanding. It contains over 10,000 RGB-D
(color and depth) images with dense annotations.
Linkhttps://rgbd.cs.princeton.edu/Feature ExtractionWe use aResNetencoder for the
RGB images and a similarResNetarchitecture for the
depth maps.
Data SplitsThe official split is used, containing 4845
training and 4659 test utterances. Since the official dataset
split does not include a dedicated validation set, we reuse
the test set as the validation set.
NYUDv2
Data Source and ContentThe NYU-Depth Dataset V2
provides RGB and Depth images of various indoor scenes
captured from a Microsoft Kinect, with dense labels for se-
mantic segmentation.
Linkhttps://cs.nyu.edu/ ∼fergus/datasets/nyu depth v2.
html
Feature ExtractionAResNetencoder is used for the
RGB images and another for the depth images.
Data SplitsThe official split is used, containing 795 train-
ing, 414 validation, and 654 test utterances.
UPMC-Food101
Data Source and ContentAn extension of Food-101, this
dataset pairs food images with their ingredient lists for tasks
like recipe retrieval from images.
Linkhttps://www.kaggle.com/datasets/gianmarco96/
upmcfood101
Feature ExtractionWe follow https://github.com/
facebookresearch/mmbt to prepare the splits. AResNet
encodes the food images, while aBERTmodel encodes the
ingredient lists.
Data SplitsThe official split is used, containing 62971
training, 5000 validation, and 22715 test utterances.
MVSA-Single
Data Source and ContentA Multi-View Sentiment Anal-
ysis dataset containing image-text posts from Twitter. The
“Single” variant contains posts where the sentiment label is
consistent across annotators.
Linkhttps://www.kaggle.com/datasets/vincemarcs/
mvsasingle
Feature ExtractionWe follow https://github.com/
facebookresearch/mmbt to prepare the splits. Images are
encoded usingResNet, and the tweet text is encoded using
BERT.
Data SplitsThe official split is used, containing 1555
training, 518 validation, and 519 test utterances.
MNIST-SVHN
Data Source and ContentThis is a synthetic dataset com-
bining two famous digit recognition datasets: MNIST (hand-
written digits) and SVHN (street view house numbers). The
task is to classify pairs of digits.
Linkhttps://github.com/iffsid/mmvae
Feature ExtractionWe apply a flattening operation to
MNIST images, whereas SVHN images are processed by
a simple CNN encoder.
Data SplitsThe official split is used, containing 560,680
training and 100,000 test utterances. Since the official
dataset split does not include a dedicated validation set, we
reuse the test set as the validation set.
N-MNIST+N-TIDIGITS
Data Source and ContentNeuromorphic versions of
MNIST (vision) and TIDIGITS (audio-spoken digits). Data
is recorded as asynchronous event streams (spikes) from
event-based sensors.
Feature ExtractionWe follow https://github.com/
MrLinNing/MemristorLSM, and pair each N-MNIST frame
with its corresponding N-TIDIGITS audio clip to form
unique image–sound pairs and perform classification on
these aligned inputs. NMNIST frames are encoded by a
CNN; NTIDIGITS MFCCs are encoded by an LSTM.
Data SplitsThe dataset is extracted sequentially: the first
70% for training, the next 15% for validation, and the final
15% for testing, containing 2835 training, 603 validation,
and 612 test utterances.
E-MNIST+EEG
Data Source and ContentA dataset that combines im-
ages from the E-MNIST dataset (handwritten letters/digits)
with simultaneously recorded EEG brain signals from sub-
jects viewing them.
Feature ExtractionWe follow https://github.com/
MrLinNing/MemristorLSM, and construct unique im-
age–EEG pairs by pairing each E-MNIST sample with its
corresponding EEG recording, and then perform classifi-
cation on these paired inputs. The E-MNIST images are
encoded with a CNN, and the time-series EEG signals are
encoded with an LSTM encoder.
Data SplitsThe dataset is extracted sequentially: the first
70% for training, the next 15% for validation, and the final
15% for testing, containing 468 training, 104 validation, and
130 test utterances.