SpatialActor: Exploring Disentangled Spatial Representations
for Robust Robotic Manipulation
Hao Shi1*, Bin Xie2, Yingfei Liu2, Yang Yue1, Tiancai Wang2,
Haoqiang Fan2, Xiangyu Zhang3,4, Gao Huang1†
1Department of Automation, BNRist, Tsinghua University,
2Dexmal,
3MEGVII Technology,
4StepFun
shi-h23@mails.tsinghua.edu.cn, wtc@dexmal.com, gaohuang@tsinghua.edu.cn
Abstract
Robotic manipulation requires precise spatial understanding
to interact with objects in the real world. Point-based meth-
ods suffer from sparse sampling, leading to the loss of fine-
grained semantics. Image-based methods typically feed RGB
and depth into 2D backbones pre-trained on 3D auxiliary
tasks, but their entangled semantics and geometry are sensi-
tive to inherent depth noise in real-world that disrupts seman-
tic understanding. Moreover, these methods focus on high-
level geometry while overlooking low-level spatial cues es-
sential for precise interaction. We proposeSpatialActor, a
disentangled framework for robust robotic manipulation that
explicitly decouples semantics and geometry. The Semantic-
guided Geometric Module adaptively fuses two complemen-
tary geometry from noisy depth and semantic-guided ex-
pert priors. Also, a Spatial Transformer leverages low-level
spatial cues for accurate 2D-3D mapping and enables in-
teraction among spatial features. We evaluateSpatialActor
on multiple simulation and real-world scenarios across 50+
tasks. It achieves state-of-the-art performance with 87.4%
on RLBench and improves by 13.9% to 19.4% under vary-
ing noisy conditions, showing strong robustness. Moreover,
it significantly enhances few-shot generalization to new tasks
and maintains robustness under various spatial perturbations.
Project Page: https://shihao1895.github.io/SpatialActor
1 Introduction
Robotic manipulation enables robots to understand scenes
and interact with objects to perform precise physical tasks in
the real-world environments. Some existing methods (Zeng
et al. 2021; Zhao et al. 2023; Brohan et al. 2022; Kim et al.
2024; Chi et al. 2023; Liu et al. 2024a; Shi et al. 2025) rely
solely on 2D visual inputs to predict end-effector actions in
3D space, however, they often struggle in scenarios requir-
ing spatial reasoning, occlusion handling, geometric shape
comprehension, or fine-grained object interactions due to
their limited understanding of spatial geometry. Given that
real-world tasks inherently occur in 3D space, incorporat-
*Work done during internship at Dexmal.
†Corresponding author: Gao Huang.
Copyright © 2026, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.ing 3D spatial information is crucial for learning robust and
generalizable robotic manipulation policies.
Recent efforts in robotic manipulation have explored var-
ious approaches to exploit spatial information. In Fig. 1 (a),
point cloud-based approaches (Zhang et al. 2023; Chen et al.
2023; Ze et al. 2024; James et al. 2022) represent 3D geom-
etry explicitly, yet suffer from semantic loss due to sparse
sampling and are limited by the high cost of 3D annota-
tions, which constrains pretraining scalability. In contrast,
Fig. 1 (b) illustrates image-based methods (Goyal et al.
2023, 2024; Fang et al. 2025; Wang et al. 2024) that uti-
lize multi-view RGB-D to jointly model semantics and ge-
ometry in a shared feature space. These methods exploit
structured 2D inputs to obtain dense semantics and bene-
fit from strong 2D pretrained priors, enabling competitive
performance. However, the entanglement of semantics and
geometry makes these methods sensitive to inherent depth
noise in the real-world, which degrades semantic and geo-
metric understanding. As shown in Fig. 1 (d), even minor
noise can lead to a significant performance drop of 8.9% in
RVT2 (Goyal et al. 2024). In reality, depth is often com-
promised by sensor noise, lighting variations, and surface
reflections, which severely limit the practical application of
such methods in the real-world. Furthermore, the joint mod-
eling primarily retains high-level geometry while neglecting
low-level spatial cues that are critical for precise interaction
by providing fine-grained 2D-3D correspondences.
The limitations above call for three critical capabilities in
robotic manipulation: 1) fine-grained spatial understanding
to enable accurate control; 2) robustness to sensor noise to
ensure real-world reliability; and 3) low-level spatial cues
to support consistent spatial tokens interaction. This raises a
fundamental question: How can we construct a robust spatial
representation that fulfills these requirements?
To address this, we proposeSpatialActor, a novel frame-
work for robust spatial representation in robotic manipula-
tion. Instead of a shared feature space, we decouple seman-
tics and geometry to mitigate cross-modal interference. Fur-
thermore, we decompose geometric information into high-
level geometric representations and low-level spatial cues.
To construct a robust high-level geometric representation,arXiv:2511.09555v1  [cs.RO]  12 Nov 2025
Loss fine -grained 
semantics.
(a) Point- based…
3D Backbone
sparse 
sampleText,  
Prop.
Robotic 
TransformerNoisy geometry 
interfere semantics.
Coupled 2D 
Backbone
…
(b) Image -basedText,  
Prop.3D aux loss
Robotic 
Transformer 81.4
72.5
68.4
5787.4 86.485
76.4
30405060708090
Ideal Light Middle HardRVT-2
Ours
(d) Performance under noiseSpatialActor is robust to 
various degrees of noise.Disentangled visual semantics, high- level 
complementary geometry, low -level geometry.
Robotic Transformer
Depth 
Encoder3D PE
Depth 
Expert
(c) SpatialActor…2D 
BackboneAdaptive Fusion
2D→3D 
Prior
Signal -Noise RatioRobust but 
Coarse -grainedFine-grained 
yet Noisy
Complementary
High -level Geometry
Precise 2D → 
3D Mappings Low -level Geometry
Visual Semantics
Text,  
Prop.Figure 1:Methodology comparisons. (a) Point-based methods suffer from sparse sampling, leading to the loss of fine-grained
semantics. (b) Image-based methods typically entangle semantics and geometry, while inherent depth noise in real-world dis-
rupts semantic understanding. (c) SpatialActor disentangle visual semantics, two complementary high-level geometry from
noisy depth and expert priors, low-level spatial cues. (d) Performance under various degrees of noise, showing the robustness.
we propose a Semantic-guided Geometric Module (SGM).
Within the SGM, high signal-to-noise semantics from RGB
are processed by a large-scale pretrained depth estimation
expert (Yang et al. 2024, 2025) to produce a robust but
coarse geometric prior. Meanwhile, raw depth inputs re-
tain fine-grained geometric details but are inherently noisy.
By adaptively integrating these complementary geometric
representations through a gating mechanism, the SGM en-
hances both robustness and spatial precision, effectively ad-
dressing the limitations of individual modalities. For low-
level positional cues, we introduce a Spatial Transformer
(SPT) that integrates spatial modeling into the transformer
layers. By employing spatial position encoding, distinct spa-
tial tokens are endowed with unique spatial indices, facil-
itating spatial interactions. The model performs view-level
interaction to refine token relationships within each view,
followed by scene-level interaction that unifies cross-modal
cues across the scene, yielding features for the action head.
To comprehensively evaluate our method, SpatialActor,
we conduct experiments on 50+ robotic manipulation tasks
in both simulation and real-world. In RLBench (18 tasks
with 249 variations), SpatialActor achieves an 87.4% av-
erage success rate, surpassing state-of-the-art methods by
approximately 6.0%, with a notable 53.3% improvement in
high-precision spatial tasks like Insert Peg. Our method also
shows strong robustness, maintaining higher success rates
under noise conditions with improvements of 13.9%, 16.9%,
and 19.4% at light, medium, and heavy noise levels, respec-
tively. On ColosseumBench, which evaluates 20 tasks under
spatial perturbations, SpatialActor consistently outperforms
baselines, showcasing superior spatial generalization. Ad-
ditionally, in a few-shot setting, adapting a multi-task pre-
trained model to 19 novel tasks with only 10 demonstrations
per task, SpatialActor achieves 79.2% success compared to
46.9% for RVT-2. Real-world experiments further validate
these results, as SpatialActor outperforms RVT-2 across 8
tasks and 15 variations, demonstrating its strong robustness
and generalization across diverse scenarios.2 Related Works
2.1 Representation Learning for Manipulation
Early methods relied on proprioceptive sensing (Deng et al.
2020; Andrychowicz et al. 2020), which limited their gen-
eralization. With the rise of large-scale visual pretraining,
many 2D-based approaches (Nair et al. 2022; Chi et al. 2023;
Zhao et al. 2023; Yue et al. 2025; Zeng et al. 2024; Zhong
et al. 2025; Xie et al. 2025) leverage strong visual priors to
extract semantics. However, they often lack 3D spatial un-
derstanding, limiting their effectiveness in precise manipu-
lation. Point cloud-based methods (Fang et al. 2023; Chen
et al. 2023; Jia et al. 2024; Ze et al. 2024; Zhang et al.
2023; Sun et al. 2025) capture explicit 3D structures, of-
fering geometry but are hampered by sparsity. V oxel-based
representations (Shridhar, Manuelli, and Fox 2023; James
et al. 2022) reduce sparsity by discretizing space for struc-
tured reasoning, yet they incur high computational costs.
Multi-view RGB-D approaches (Goyal et al. 2023, 2024;
Zhang et al. 2024; Fang et al. 2025; Wang et al. 2024; Seo
et al. 2023) integrate dense 2D semantics with geometry via
early fusion or auxiliary supervision, yet such shared fea-
ture spaces remain vulnerable to sensor noise and often lack
precise spatial corresponding for fine-grained interaction. To
address these limitations, we decouple semantics and geom-
etry, and construct geometric representations by fusing com-
plementary high-level expert priors and raw depth together
with low-level spatial cues for precise manipulation.
2.2 Vision Foundation Models for Robotics
Vision foundation models have significantly enhanced
robotic perception by incorporating semantic and geometric
priors. Visual and multimodal models (Radford et al. 2021;
Li et al. 2022; Feng et al. 2023; Liu et al. 2024b; Wang
et al. 2025b) leverage diverse datasets to learn strong se-
mantic priors that improve visual understanding, which ben-
efits downstream robotic tasks. However, they primarily fo-
cus on the 2D domain and lack spatial understanding ca-
pabilities. 3D vision models (Zhu et al. 2024; Zheng et al.
Multi- scale
Gate Fusion
Depth 
Expert
SGM Fine-grained yet noisy
Robust but coarse -grained
Pick up the glue stick and place it into the box. PromptProprio .
Text EncoderSPT
View- level 
Interaction
Scene- level 
Interaction... Noisy 
Depth s
Geometric 
Encoder
Semantic Encoder
ConcatSpatial Token
Spatial PE
Trans.Rot.Execute
Grip.
ConvexUp
MLP
MLP
Action Head
... RGB s
Figure 2:Overall framework of SpatialActor.The architecture employs separate vision and depth encoders. Semantic-guided
Geometric Module (SGM) adaptively fuses robust yet coarse geometric priors from a pretrained depth expert with noisy depth
features via gated fusion to yield high-level geometric representations. In the Spatial Transformer (SPT), low-level spatial cues
are encoded as positional embeddings to drive spatial interactions. Finally, view-level interactions refine intra-view features,
while scene-level interactions consolidate cross-modal information across views to support the subsequent action head.
2024; Qian et al. 2022; Zheng et al. 2025; Kang et al. 2024;
Zhang et al. 2025) integrate semantic information with ex-
plicit spatial structures to facilitate effective geometric per-
ception. However, the acquisition and annotation of 3D data
are inherently expensive and labor-intensive, which restricts
scalability and limits their application in real-world scenar-
ios. Depth estimation experts (Yang et al. 2024, 2025; Bhat
et al. 2023; Wang et al. 2025a) leverage large-scale pretrain-
ing on diverse datasets to translate semantics in images into
corresponding geometric structures, robustly inferring geo-
metric information even under challenging conditions such
as sensor noise and occlusions. In this paper, we leverage the
strong semantic alignment of vision models together with
robust geometric priors from depth estimation experts.
3 Method
3.1 Overall Framework
Fig. 2 illustrates the overall framework of our approach. The
inputs to the robot’s control system are given by
X={Iv, Dv}V
v=1, P, L,(1)
whereIv∈RH×W×3andDv∈RH×Wdenote the RGB
image and depth map for viewv(withVviews in total),P∈
Rdprepresents the robot’s proprioceptive state (d pindicating
its dimension), andLdenotes the language prompt.
For each viewv, the RGB images and noisy depth maps
are processed separately. The imagesIvand the language
instructionLare fed into a vision-language model (e.g.,
CLIP (Radford et al. 2021)) to extract semantic features
Fv
semand text featuresF text. Meanwhile, raw depth maps
Dvare processed by a depth encoder to yield fine-grained
but noisy geometric featuresFv
geo. Subsequently,Fv
geois en-
hanced via a Semantic-guided Geometric Module (SGM). InSGM, large-scale pre-trained depth estimation expert is em-
ployed to obtain robust yet coarse geometric priors ˆFv
geo. A
multi-scale gated fusion module then adaptively fusesFv
geo
with ˆFv
geoto produce refined geometric featuresFv
fuse-geo , pre-
serving details while reducing noise, which are concatenated
withFv
semto form the final spatial representationHv.
We further introduce a Spatial Transformer (SPT). Within
the SPT, intrinsic and extrinsic parameters, along with depth
values, are used to construct a spatial encoding that captures
the low-level spatial cues between spatial tokens. The SPT
first applies view-level interaction to consolidate intra-view
context, followed by scene-level cross-modal interaction to
aggregate cross-modal cues into a unified scene represen-
tation. Finally, an action head predicts the robot’s 3D end-
effector pose and gripper state.
3.2 Semantic-guided Geometric Module
Real-world depth measurements are often noisy due to
sensor limitations and environmental interference, whereas
RGB images provide high signal-to-noise semantic cues.
Large-scale pretrained depth estimation models (e.g., Depth
Anything (Yang et al. 2024, 2025)) learn a smooth semantic-
to-geometric mapping, offering robust and generalizable ge-
ometric priors. In contrast, raw depth features retain fine-
grained, pixel-level details but are highly sensitive to noise.
To leverage these complementary strengths, we extract ro-
bust yet coarse-grained geometric priors from RGB inputs
via a frozen large-scale pre-trained depth estimation expert:
ˆFv
geo=E expert(Iv)∈RH×W×C,(2)
and extract fine-grained but noisy geometry from raw depth
using a depth encoder (e.g. ResNet-50 (He et al. 2016)):
Fv
geo=E raw(Dv)∈RH×W×C.(3)
C
 MLP
From Depth Encoder
1-
(a) Semantic -guided Geometric Module (SGM)
S
CConcat SSigmoid
Depth 
Expert
Fine- grained 
yet noisy
Robust but 
abstractMulti- scale Gate Fusion
3D 
Points
RoPE
Spatial Position Encoding Module×4
(b) Spatial Tranformer (SPT)
F
FFourier CConcatdepth
FFN
Spatial PESpatial TokenProprio .
View- level Interaction
Scene- level Interaction
MLP
C
PE
TextFigure 3:Semantic-guided Geometric Module and Spatial Transformer.(a) SGM adaptively combines two complementary
geometric representations via a gating mechanism. (b) SPT converts 3D points into spatial positional embeddings using RoPE
to establish 2D–3D correspondences, followed by view-level and scene-level interactions for spatial token refinement.
As shown in Fig. 3 (a), a multi-scale gating mechanism
then adaptively fuses these features to yield an optimized
geometric representation that preserves fine details while re-
ducing noise and aligning with the semantic cues.
Gv=σ 
MLP 
Concat( ˆFv
geo, Fv
geo)
,(4)
Fv
fuse-geo =Gv⊙Fv
geo+ 
1−Gv
⊙ˆFv
geo,(5)
whereσdenotes sigmoid activation and⊙element-wise
multiplication. The gateGvlearns to retain reliable depth
details while suppressing noise.
3.3 Spatial Transformer
For each viewv, we denote the spatial features asHv∈
RNv×D. The proprioceptive inputPis projected via an MLP
and fused withHvby element-wise addition:
eHv=Hv+MLP(P).(6)
Given a pixel(x′, y′)with depthd=Dv(x′, y′), its 3D
coordinate[x, y, z]⊤in the robot-centric coordinate system
is computed via perspective projection:
[x, y, z,1]⊤=Ev 
d·(Kv)−1[x′, y′,1]⊤∥1
,(7)
whereKv∈R3×3andEv∈R4×4denote the intrinsic and
extrinsic matrices, and∥denotes vector concatenation.
To encode spatial cues, we apply rotary positional encod-
ing toeHv, where each axis is assignedD/3dimensions. We
define a set of frequencies:
ωk=λ−2k/d, k= 0,1, . . . ,d
2−1, d=D/3,(8)
withλ= 10000to control the frequency bandwidth. In the
spirit of Fourier feature mappings, we compute axis-wise si-
nusoidal embeddings as:
cos pos= [cos(ω ku)]u∈{x,y,z}, k=0,...,d/2−1,(9)
sinpos= [sin(ω ku)]u∈{x,y,z}, k=0,...,d/2−1.(10)
The final position-encoded features are given by:
Tv=eHv⊙cos pos+rot(eHv)⊙sin pos,(11)where⊙denotes element-wise multiplication, androt(·)ro-
tates each(f 2i, f2i+1)feature pair as(−f 2i+1, f2i).
At the view level, self-attention followed by a feed-
forward network (FFN) refines each view’s token represen-
tation. At the scene level, tokens from all views are concate-
nated with language featuresF text. Another round of self-
attention and an FFN then fuses cross-view and language
context, producing the final refined tokens. The tokens are
fed into a lightweight decoder (ConvexUp) to generate a per-
view 2D heatmap. The target 2D position is obtained via
argmax and lifted into 3D using the camera model. The ac-
tion head then uses an MLP on local features around this
position to regress the rotationθ= (θ x, θy, θz)and gripper
stateg. Together with the 3D translation(x, y, z), these form
the final actionA= (x, y, z, θ x, θy, θz, g).
The action supervision includes three parts: a cross-
entropy loss on per-view 2D heatmaps for translation, cross-
entropy losses on discretized Euler angles for rotation, and a
binary classification loss for the gripper state.
4 Experiments
To comprehensively evaluate the effectiveness of SpatialAc-
tor, we conduct experiments in both simulation and real-
world settings. Specifically, we aim to answer the follow-
ing key questions: (1) How does SpatialActor compare to
state-of-the-art robotic manipulation policies? (2) How ro-
bust is SpatialActor under noisy conditions? (3) How well
does SpatialActor generalize to few-shot settings? (4) How
does SpatialActor perform under spatial perturbations? (5)
What is the impact of different components of SpatialActor?
(6) How does SpatialActor perform in real-robot setups?
4.1 Comparison with State-of-the-Art Policies
Simulation Environment and Datasets.We evaluate
SpatialActor on RLBench (James et al. 2020), a main-
stream multi-task 3D manipulation benchmark built on Cop-
peliaSim (Rohmer, Singh, and Freese 2013). The simulation
environment features a Franka robotic arm with a parallel
gripper operating in a tabletop scenario. Observations come
from four fixed RGB-D cameras (front, left/right shoulder,
wrist) at128×128resolution. The action space consists of
Models Avg. Success↑Avg. Rank↓Close Jar Drag Stick Insert Peg Meat off Grill Open Drawer Place Cups Place Wine Push Buttons
C2F-ARM-BC(James et al. 2022)20.1 9.5 24.0 24.0 4.0 20.0 20.0 0.0 8.0 72.0
HiveFormer(Guhur et al. 2023)45.3 7.8 52.0 76.0 0.0100.052.0 0.0 80.0 84.0
PolarNet(Chen et al. 2023)46.4 7.3 36.0 92.0 4.0100.084.0 0.0 40.0 96.0
PerAct(Shridhar, Manuelli, and Fox 2023)49.4 7.1 55.2 ±4.7 89.6±4.1 5.6±4.1 70.4±2.0 88.0±5.7 2.4±3.2 44.8±7.8 92.8±3.0
RVT(Goyal et al. 2023)62.9 5.3 52.0 ±2.5 99.2±1.6 11.2±3.0 88.0±2.5 71.2±6.9 4.0±2.5 91.0±5.2 100.0 ±0.0
Act3D(Gervet et al. 2023)65.0 5.3 92.0 92.0 27.0 94.0 93.0 3.0 80.0 99.0
SAM-E(Zhang et al. 2024)70.6 2.9 82.4 ±3.6 100.0 ±0.0 18.4±4.6 95.2±3.3 95.2±5.2 0.0±0.0 94.4±4.6 100.0 ±0.0
3D Diffuser Actor(Ke et al. 2024)81.3 2.8 96.0 ±2.5 100.0 ±0.0 65.6±4.1 96.8±1.6 89.6±4.1 24.0±7.6 93.6±4.8 98.4±2.0
RVT-2(Goyal et al. 2024)81.4 2.8100.0 ±0.0 99.0±1.7 40.0±0.0 99.0±1.7 74.0±11.8 38.0±4.5 95.0±3.3 100.0 ±0.0
SpatialActor (Ours) 87.4±0.8 2.3 94.0±4.2 100.0 ±0.0 93.3±4.8 98.7±2.1 82.0±3.3 56.7±8.5 94.7±4.8 100.0 ±0.0
Models Put in Cupboard Put in Drawer Put in Safe Screw Bulb Slide Block Sort Shape Stack Blocks Stack Cups Sweep to Dustpan Turn Tap
C2F-ARM-BC(James et al. 2022)0.0 4.0 12.0 8.0 16.0 8.0 0.0 0.0 0.0 68.0
HiveFormer(Guhur et al. 2023)32.0 68.0 76.0 8.0 64.0 8.0 8.0 0.0 28.0 80.0
PolarNet(Chen et al. 2023)12.0 32.0 84.0 44.0 56.0 12.0 4.0 8.0 52.0 80.0
PerAct(Shridhar, Manuelli, and Fox 2023)28.0 ±4.4 51.2±4.7 84.0±3.6 17.6±2.0 74.0±13.0 16.8±4.7 26.4±3.2 2.4±2.0 52.0±0.0 88.0±4.4
RVT(Goyal et al. 2023)49.6 ±3.2 88.0±5.7 91.2±3.0 48.0±5.7 81.6±5.4 36.0±2.5 28.8±3.9 26.4±8.2 72.0±0.0 93.6±4.1
Act3D(Gervet et al. 2023)51.0 90.0 95.0 47.0 93.0 8.0 12.0 9.0 92.0 94.0
SAM-E(Zhang et al. 2024)64.0 ±2.8 92.0±5.7 95.2±3.3 78.4±3.6 95.2±1.8 34.4±6.1 26.4±4.6 0.0±0.0 100.0 ±0.0 100.0 ±0.0
3D Diffuser Actor(Ke et al. 2024)85.6 ±4.1 96.0±3.6 97.6±2.0 82.4±2.0 97.6±3.2 44.0±4.4 68.3±3.3 47.2±8.5 84.0±4.4 99.2±1.6
RVT-2(Goyal et al. 2024)66.0 ±4.5 96.0±0.0 96.0±2.8 88.0±4.9 92.0±2.8 35.0±7.1 80.0±2.8 69.0±5.9 100.0 ±0.0 99.0±1.7
SpatialActor (Ours) 72.0±3.6 98.7±3.3 96.7±3.9 88.7±3.9 91.3±6.9 73.3±6.5 56±7.6 81.3±4.1 100.0 ±0.0 95.3±3.0
Table 1:Performance on RLBench.We report success rates on 18 RLBench tasks with 249 variations. SpatialActor achieves
the highest overall performance, surpassing the previous state-of-the-art RVT-2 by 6.0%. Notably, on tasks requiring high spatial
precision, such asInsert PegandSort Shape, SpatialActor outperforms RVT-2 by 53.3% and 38.3%, respectively.
3D translation, rotation of the end-effector, and binary grip-
per control. An OMPL-based motion planner (Sucan, Moll,
and Kavraki 2012) is utilized to compute feasible trajecto-
ries. Following PerAct (Shridhar, Manuelli, and Fox 2023),
we use 18 tasks with 249 variations covering diverse ma-
nipulation skills, each with 100 expert demonstrations for
training and 25 unseen episodes for evaluation.
Implementation Details.SpatialActor is trained for ap-
proximately 40k iterations using a cosine learning rate
schedule with an initial 2k-iteration warm-up. Training is
performed using 8 GPUs with a total batch size of 192 (24
per GPU) and an initial learning rate of2.4×10−3. Data
augmentation includes random spatial translations of up to
12.5 cm along thex,y, andzaxes, as well as rotations of
up to45◦around thezaxis. We follow RVT (Goyal et al.
2023, 2024), incorporating its virtual view design and two-
stage process. Furthermore, we employ CLIP (Radford et al.
2021) as our vision-language encoder to provide aligned
cross-modal representations, and Depth Anything v2 (Yang
et al. 2025) as our geometry expert.
Performance on RLBench 18 Tasks.Tab. 1 summarizes
the performance of various methods on 18 RLBench tasks
with 249 variations. SpatialActor achieves an average suc-
cess rate of87.4%, surpassing the previous state-of-the-
art by 6.0%. Notably, SpatialActor shows substantial im-
provements on tasks requiring high spatial precision, such as
Insert PegandSort Shape. It achieves success rates
of 93.3% and 73.3% on these tasks, outperforming RVT-2
by53.3%and38.3%, respectively. These results highlight
SpatialActor’s superior spatial handling capability.
4.2 Robustness under Noisy Conditions
Experimental Setup.Depth measurements are inherently
affected by sensor noise, lighting variations, and surface re-
flections. To simulate these challenges, we inject controlled
Gaussian noise into reconstructed point clouds. Specifically,we design three noise levels:Lightcorrupts 20% of the
points with a Gaussian standard deviation of 0.05,Middle
corrupts 50% of the points with a standard deviation of 0.1,
andHeavycorrupts 80% of the points with a standard devia-
tion of 0.1. This setup allows us to evaluate the robustness of
our approach under progressively severe noisy conditions.
Performance Evaluation.Tab. 2 shows that underLight,
Middle, andHeavynoise, SpatialActor improves average
success rates over RVT-2 by 13.9%, 16.9%, and 19.4%,
respectively. Notably, in tasks requiring high spatial preci-
sion, these gains are even more pronounced. For instance,
onInsert Pegtask, SpatialActor outperforms RVT-2 by
88.0%, 78.6%, and 61.3% under the respective noise levels.
4.3 Few-Shot Generalization
We evaluate the few-shot generalization ability of Spa-
tialActor by adapting the multi-task pre-trained model to
19 novel tasks using only 10 demonstrations per task, just
one-tenth of the data used during multi-task training. In this
few-shot adaptation scenario, the model is initialized with
its pre-trained weights and then fine-tuned on the limited
data. As shown in Tab. 3, our experiments demonstrate that
SpatialActor effectively transfers previously learned skills to
new tasks with minimal adaptation data. Overall, SpatialAc-
tor achieves an average success rate of 79.2%, compared
to 46.9% for RVT-2, yielding an improvement of approxi-
mately 32.3%. This significant boost underscores the supe-
rior few-shot generalization capability of our approach.
4.4 Spatial Perturbations on ColosseumBench
Experimental Setup.We evaluate the robust capability of
SpatialActor on the Colosseum benchmark (Pumacay et al.
2024), which is designed to assess robot manipulation poli-
cies under environmental changes. We evaluate performance
on 20 benchmark tasks under both baseline (no perturbation)
and three spatial perturbation conditions. The first, manipu-
Models Noise type Avg. Success↑Close Jar Drag Stick Insert Peg Meat off Grill Open Drawer Place Cups Place Wine Push Buttons
RVT-2Light72.5±0.5 92.0±4.0 100.0 ±0.0 6.7±4.6 100.0 ±0.0 82.7±10.1 25.3±6.1 96.0±4.0 74.7±8.3
SpatialActor (Ours)86.4 ±0.4 97.3±2.3 98.7±2.3 94.7±6.1 96.0±0.0 73.3±10.1 54.7±8.3 92.0±4.0 98.7±2.3
RVT-2Middle68.4±0.9 85.3±2.3 100.0 ±0.0 2.7±2.3 94.7±2.3 82.7±11.5 20.0±0.0 89.3±4.6 73.3±4.6
SpatialActor (Ours)85.3 ±0.9 100.0 ±0.0 98.7±2.3 81.3±6.1 96.0±4.0 78.7±8.3 45.3±10.1 89.3±4.6 97.3±4.6
RVT-2Heavy57.0±0.9 49.3±6.1 94.7±4.6 0.0±0.0 97.3±2.3 86.7±2.3 8.0±4.0 86.7±2.3 64.0±4.0
SpatialActor (Ours)76.4 ±0.5 82.7±2.3 98.7±2.3 61.3±6.1 100.0 ±0.0 80.0±4.0 21.3±4.6 92.0±0.0 92.0±4.6
Models Put in Cupboard Put in Drawer Put in Safe Screw Bulb Slide Block Sort Shape Stack Blocks Stack Cups Sweep to Dustpan Turn Tap
RVT-2 57.3 ±2.3 100.0 ±0.0 92.0±4.0 81.3±6.1 62.7±23.1 46.7±6.1 53.3±2.3 45.3±2.3 96.0±6.9 93.3±4.6
SpatialActor (Ours)81.3 ±2.3 98.7±2.3 98.7±2.3 88.0±4.0 72.0±4.0 76.0±6.9 62.7±2.3 82.7±2.3 97.3±4.6 93.3±2.3
RVT-2 50.7 ±12.2 98.7±2.3 98.7±2.3 76.0±4.0 57.3±2.3 38.7±10.1 45.3±12.2 25.3±6.1 96.0±4.0 96.0±6.9
SpatialActor (Ours)74.7 ±6.1 100.0 ±0.0 94.7±2.3 88.0±4.0 81.3±15.1 76.0±4.0 58.7±6.1 77.3±8.3 100.0 ±0.0 97.3±2.3
RVT-2 20.0 ±6.9 97.3±2.3 93.3±2.3 58.7±2.3 57.3±8.3 13.3±6.1 13.3±6.1 1.3±2.3 92.0±0.0 92.0±4.0
SpatialActor (Ours)64.0 ±4.0 100.0 ±0.0 100.0 ±0.0 78.7±8.3 58.7±2.3 52.0±4.0 42.7±6.1 70.7±6.1 82.7±2.3 97.3±4.6
Table 2:Performance Under Various Noise Levels.We report success rates under three noise conditions:Lightnoise corrupts
20% of the points in the reconstructed point cloud with random Gaussian noise (std = 0.05),Middlenoise corrupts 50% with
noise of std = 0.1, andHeavynoise corrupts 80% with noise of std = 0.1. Under these conditions, SpatialActor improves average
success rates by approximately 13.9%, 16.9%, and 19.4% over RVT-2 at the Light, Middle, and Heavy noise levels, respectively.
Close Laptop Put Rubbish in Bin Put Shoes in Box Close Microwave Beat Buzz Get Ice Change Clock Close Box Reach Target
Close Door Remove Cups Close Drawer Spatula Scoop Put Knife on Board Close Fridge Screw Nail Close Grill Plate in Rack
Models Avg. Success↑Close Laptop Put Rubbish in Bin Beat Buzz Close Microwave Put Shoes in Box Get Ice Change Clock Close Box Reach Target
RVT-2 46.9 ±1.5 76.0±6.1 10.3±5.1 47.4±8.5 61.7±9.8 7.4±4.3 93.7±3.9 72.6±2.8 49.1±8.6 12.0±5.7
SpatialActor (Ours)79.2 ±2.7 90.0±7.5 100.0 ±0.0 92.0±2.5 95.3±11.4 25.3±13.8 96.0±2.5 83.3±7.3 95.3±4.7 86.0±2.2
Models Close Door Remove Cups Close Drawer Spatula Scoop Close Fridge Put Knife on Board Screw Nail Close Grill Plate in Rack Meat on Grill
RVT-2 4.0 ±3.3 33.7±13.8 96.0±0.0 70.9±6.8 81.7±8.6 14.3±7.3 38.9±15.1 66.3±8.9 24.6±7.1 30.0±8.5
SpatialActor (Ours)36.0 ±14.1 66.0±8.3 96.7±3.9 84.7±8.2 95.3±5.3 66.0±2.2 62.7±6.0 96.0±0.0 48.0±8.0 90.0±2.8
Table 3:Few-Shot Generalization.We adapt pre-trained model to 19 new tasks using only 10 demonstrations per task (1/10th
of original data). We reports success rates, showing that SpatialActor, significantly outperforms RVT-2 in the few-shot setting.
lation object size (MO-Size), scales the object the robot di-
rectly interacts with, simulating dimensional variations. The
second, receiver object size (RO-Size), alters the size of an
indirectly used object, such as a container or support sur-
face. The third condition introduces camera pose perturba-
tions by randomly adjusting the camera’s position and ori-
entation, mimicking changes in the observation viewpoint.
Together, these perturbations capture spatial environmental
variations that can affect robot performance.
Performance Evaluation.The results in Tab. 4 indi-
cate that under the no-perturbation condition, our method
achieves a task-average success rate of 57.4%. When spatial
perturbations are introduced, SpatialActor attains 59.2% un-
der MO-Size variations, 62.0% under RO-Size changes, and
54.2% with camera pose perturbations. These results consis-
tently outperform competing methods, demonstrating strong
robustness and generalization under spatial variations.4.5 Ablation Study
Our ablation study on 18 tasks (Tab. 5) shows that decou-
pling semantics and geometry improves performance in both
noise-free and heavy-noise settings, increasing success rates
from 81.4% to 85.1% and from 57.0% to 68.7%, respec-
tively. Introducing the Semantic-guided Geometry Module
(SGM) further boosts performance, especially under heavy
noise, where performance rises to 73.9%. Finally, the Spatial
Transformer (SPT), which provides precise low-level spa-
tial cues, brings the success rates to 87.4% and 76.4% in
noise-free and noisy conditions, respectively. These results
highlight the importance of each proposed component in im-
proving both accuracy and robustness.
4.6 Real-World Evaluation
Setup.In real-world experiments, we use a WidowX
single-arm robot equipped with an Intel RealSense D435i
Method No-Vars↑MO-Size↑RO-Size↑Cam Pose↑
R3M (Nair et al. 2022) 2.9 1.8 0.0 0.8
MVP (Radosavovic et al. 2022) 3.4 4.4 0.5 2.6
V oxPoser (Huang et al. 2023) 5.4 3.3 6.5 6.2
PerAct (Shridhar, Manuelli, and Fox 2023) 34.5 35.6 29.3 36.3
RVT (Goyal et al. 2023) 43.6 35.3 40.5 42.2
SpatialActor (Ours) 57.4±3.0 59.2±2.4 62.0±3.2 54.2±1.8
Table 4:Performance Under Spatial Perturbations.We
report average success rates on 20 ColosseumBench tasks
under four conditions: No-Vars, manipulation object size
(MO-Size), receiver object size (RO-Size), and camera pose
(Cam Pose). SpatialActor achieves higher performance.
RGB-D camera. The camera is statically mounted to cap-
ture a front view of the workspace. We perform both intrinsic
and extrinsic calibration between the camera and the robot
to accurately transform the observed point clouds into the
robot’s base coordinate system, ensuring precise manipula-
tion. The system is integrated using a ROS package (Cole-
man et al. 2014). Images are originally captured at a resolu-
tion of1280×720and are downsampled to128×128.
PlaceCarrot to Box Push Button Slide B lock Insert Ring onto Cone
Pick G lue to Box Stack Block Wipe Table Stack Cup
Figure 4:Real-world tasks.We employed 8 distinct tasks
with a total of 15 variants in real-world experiments.
Default Manip . Object Receiver Object
Receiver Object Brightness Background80
73.3
60 60
0102030405060708090100
Figure 5:Real-world Generalization Evaluation. We as-
sess SpatialActor under variations in manipulated object, re-
ceiver object, brightness, and background. Performance re-
mains robust across challenging settings.
Dataset Collection.We conduct experiments on a series
of real-world tasks (Fig. 4), including (1) Pick Glue to Box,
(2) Stack Cup, (3) Push Button, (4) Slide Block, (5) Place
Carrot to Box, (6) Stack Block, (7) Insert Ring onto Cone,
and (8) Wipe Table. For each task, we collect 25 demonstra-
tions that capture diverse spatial configurations and objectDecouple SGM SPT Avg. success on 18 tasks↑
No noise Heavy noise
81.4 57.0
✓85.1 68.7
✓ ✓86.4 73.9
✓ ✓ ✓87.4 76.4
Table 5:Ablation Study.We analyze the contribution of
each module to overall performance and their effect on ro-
bustness under heavy noisy conditions.
Task#vari.RVT-2 SpatialActor (Ours)
(1) Pick Glue to Box 1 50% 85%
(2) Stack Cup 2 30% 30%
(3) Push Button 3 67% 90%
(4) Slide Block 3 60% 67%
(5) Place Carrot to Box 1 30% 65%
(6) Stack Block 2 40% 35%
(7) Insert Ring Onto Cone 2 20% 50%
(8) Wipe Table 1 50% 80%
All tasks 15 43% 63%
Table 6:Real-World Results.We report success rates for
each task and overall performance across 8 tasks with 15
variations. SpatialActor, consistently outperforms RVT-2,
indicating superior robustness in real-world scenarios.
variations. Some tasks are instantiated with multiple vari-
ations, for example, the Slide Block task includes yellow,
green, and red variants, resulting in a total of 15 variations
across the 8 tasks. The trajectories are recorded at 30 fps,
and key-frames are extracted to construct the training set.
Evaluation.We evaluate SpatialActor against RVT-2 on
various real-world tasks. Single-variant tasks are tested 20
times, and multi-variant tasks 10 times per variant. As shown
in Tab. 6, SpatialActor consistently outperforms RVT-2,
with an average improvement of around 20% across tasks,
demonstrating effectiveness in real-world scenarios.
To evaluate robustness to distribution shifts, we test Spa-
tialActor under variations in manipulated object, receiver
object, lighting, and background (Fig. 5). SpatialActor main-
tains consistently high performance across these diverse and
challenging conditions, clearly demonstrating strong robust-
ness and generalization in complex real-world scenarios.
5 Conclusion
In this work, we present SpatialActor, a framework for ro-
bust spatial representation in robotic manipulation that ad-
dresses the challenges of precise spatial understanding, sen-
sor noise, and effective interaction. SpatialActor disentan-
gles semantic and geometric information, with the geo-
metric branch divided into high-level and low-level com-
ponents: SGM adaptively fuses semantic-guided geomet-
ric priors with raw depth features for robust high-level ge-
ometry, while SPT captures low-level spatial cues through
position-aware interactions. Extensive experiments across
50+ simulated and real-world tasks demonstrate that Spa-
tialActor achieves higher success rates and strong robustness
under diverse conditions. These results highlight the impor-
tance of disentangled spatial representations for developing
more robust and generalizable robotic systems.
Acknowledgments
This work was supported by the National Science and
Technology Major Project of China under Grant No.
2023ZD0121300, the Scientific Research Innovation Ca-
pability Support Project for Young Faculty under Grant
No. ZYGXQNJSKYCXNLZCXM-I20, and the National
Natural Science Foundation of China under Grant No.
U24B20173.
References
Andrychowicz, O. M.; Baker, B.; Chociej, M.; Jozefowicz,
R.; McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Pow-
ell, G.; Ray, A.; et al. 2020. Learning dexterous in-hand ma-
nipulation.The International Journal of Robotics Research,
39(1): 3–20.
Bhat, S. F.; Birkl, R.; Wofk, D.; Wonka, P.; and M ¨uller, M.
2023. Zoedepth: Zero-shot transfer by combining relative
and metric depth.arXiv preprint arXiv:2302.12288.
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y .; Dabis,
J.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.;
Hsu, J.; et al. 2022. Rt-1: Robotics transformer for real-
world control at scale.arXiv preprint arXiv:2212.06817.
Chen, S.; Garcia, R.; Schmid, C.; and Laptev, I. 2023. Po-
larnet: 3d point clouds for language-guided robotic manipu-
lation.arXiv preprint arXiv:2309.15596.
Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y .; Burchfiel,
B.; Tedrake, R.; and Song, S. 2023. Diffusion policy: Visuo-
motor policy learning via action diffusion.The International
Journal of Robotics Research, 02783649241273668.
Coleman, D.; Sucan, I.; Chitta, S.; and Correll, N. 2014. Re-
ducing the barrier to entry of complex robotic software: a
moveit! case study.arXiv preprint arXiv:1404.3785.
Deng, X.; Xiang, Y .; Mousavian, A.; Eppner, C.; Bretl, T.;
and Fox, D. 2020. Self-supervised 6d object pose estimation
for robot manipulation. In2020 IEEE International Con-
ference on Robotics and Automation (ICRA), 3665–3671.
IEEE.
Fang, H.; Grotz, M.; Pumacay, W.; Wang, Y . R.; Fox, D.; Kr-
ishna, R.; and Duan, J. 2025. SAM2Act: Integrating Visual
Foundation Model with A Memory Architecture for Robotic
Manipulation.arXiv preprint arXiv:2501.18564.
Fang, H.-S.; Wang, C.; Fang, H.; Gou, M.; Liu, J.; Yan, H.;
Liu, W.; Xie, Y .; and Lu, C. 2023. Anygrasp: Robust and
efficient grasp perception in spatial and temporal domains.
IEEE Transactions on Robotics, 39(5): 3929–3945.
Feng, T.; Shi, H.; Liu, X.; Feng, W.; Wan, L.; Zhou, Y .;
and Lin, D. 2023. Open compound domain adaptation
with object style compensation for semantic segmentation.
Advances in Neural Information Processing Systems, 36:
63136–63149.
Gervet, T.; Xian, Z.; Gkanatsios, N.; and Fragkiadaki, K.
2023. Act3d: 3d feature field transformers for multi-task
robotic manipulation.arXiv preprint arXiv:2306.17817.
Goyal, A.; Blukis, V .; Xu, J.; Guo, Y .; Chao, Y .-W.; and Fox,
D. 2024. Rvt-2: Learning precise manipulation from few
demonstrations.arXiv preprint arXiv:2406.08545.Goyal, A.; Xu, J.; Guo, Y .; Blukis, V .; Chao, Y .-W.; and Fox,
D. 2023. Rvt: Robotic view transformer for 3d object manip-
ulation. InConference on Robot Learning, 694–710. PMLR.
Guhur, P.-L.; Chen, S.; Pinel, R. G.; Tapaswi, M.; Laptev,
I.; and Schmid, C. 2023. Instruction-driven history-aware
policies for robotic manipulations. InConference on Robot
Learning, 175–187. PMLR.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
ual learning for image recognition. InProceedings of the
IEEE conference on computer vision and pattern recogni-
tion, 770–778.
Huang, W.; Wang, C.; Zhang, R.; Li, Y .; Wu, J.; and Fei-
Fei, L. 2023. V oxPoser: Composable 3D Value Maps
for Robotic Manipulation with Language Models.arXiv
preprint arXiv:2307.05973.
James, S.; Ma, Z.; Arrojo, D. R.; and Davison, A. J. 2020.
Rlbench: The robot learning benchmark & learning environ-
ment.IEEE Robotics and Automation Letters, 5(2): 3019–
3026.
James, S.; Wada, K.; Laidlow, T.; and Davison, A. J.
2022. Coarse-to-fine q-attention: Efficient learning for vi-
sual robotic manipulation via discretisation. InProceedings
of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, 13739–13748.
Jia, Y .; Liu, J.; Chen, S.; Gu, C.; Wang, Z.; Luo, L.; Lee,
L.; Wang, P.; Wang, Z.; Zhang, R.; et al. 2024. Lift3d
foundation policy: Lifting 2d large-scale pretrained mod-
els for robust 3d robotic manipulation.arXiv preprint
arXiv:2411.18623.
Kang, B.; Yue, Y .; Lu, R.; Lin, Z.; Zhao, Y .; Wang, K.;
Huang, G.; and Feng, J. 2024. How far is video genera-
tion from world model: A physical law perspective.arXiv
preprint arXiv:2411.02385.
Ke, T.-W.; Gkanatsios, N.; and Fragkiadaki, K. 2024. 3d dif-
fuser actor: Policy diffusion with 3d scene representations.
arXiv preprint arXiv:2402.10885.
Kim, M. J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakr-
ishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; Sanketi,
P.; et al. 2024. Openvla: An open-source vision-language-
action model.arXiv preprint arXiv:2406.09246.
Li, J.; Li, D.; Xiong, C.; and Hoi, S. 2022. Blip: Boot-
strapping language-image pre-training for unified vision-
language understanding and generation. InInternational
conference on machine learning, 12888–12900. PMLR.
Liu, S.; Wu, L.; Li, B.; Tan, H.; Chen, H.; Wang, Z.; Xu,
K.; Su, H.; and Zhu, J. 2024a. Rdt-1b: a diffusion foun-
dation model for bimanual manipulation.arXiv preprint
arXiv:2410.07864.
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang,
Q.; Li, C.; Yang, J.; Su, H.; et al. 2024b. Grounding dino:
Marrying dino with grounded pre-training for open-set ob-
ject detection. InEuropean Conference on Computer Vision,
38–55. Springer.
Nair, S.; Rajeswaran, A.; Kumar, V .; Finn, C.; and Gupta,
A. 2022. R3m: A universal visual representation for robot
manipulation.arXiv preprint arXiv:2203.12601.
Pumacay, W.; Singh, I.; Duan, J.; Krishna, R.; Thomason, J.;
and Fox, D. 2024. The colosseum: A benchmark for evaluat-
ing generalization for robotic manipulation.arXiv preprint
arXiv:2402.08191.
Qian, G.; Li, Y .; Peng, H.; Mai, J.; Hammoud, H.; Elhoseiny,
M.; and Ghanem, B. 2022. Pointnext: Revisiting pointnet++
with improved training and scaling strategies.Advances in
neural information processing systems, 35: 23192–23204.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.;
Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.;
et al. 2021. Learning transferable visual models from nat-
ural language supervision. InInternational conference on
machine learning, 8748–8763. PmLR.
Radosavovic, I.; Xiao, T.; James, S.; Abbeel, P.; Malik, J.;
and Darrell, T. 2022. Real-World Robot Learning with
Masked Visual Pre-training.CoRL.
Rohmer, E.; Singh, S. P.; and Freese, M. 2013. V-REP: A
versatile and scalable robot simulation framework. In2013
IEEE/RSJ international conference on intelligent robots and
systems, 1321–1326. IEEE.
Seo, Y .; Kim, J.; James, S.; Lee, K.; Shin, J.; and Abbeel, P.
2023. Multi-view masked world models for visual robotic
manipulation. InInternational Conference on Machine
Learning, 30613–30632. PMLR.
Shi, H.; Xie, B.; Liu, Y .; Sun, L.; Liu, F.; Wang, T.; Zhou,
E.; Fan, H.; Zhang, X.; and Huang, G. 2025. Mem-
oryvla: Perceptual-cognitive memory in vision-language-
action models for robotic manipulation.arXiv preprint
arXiv:2508.19236.
Shridhar, M.; Manuelli, L.; and Fox, D. 2023. Perceiver-
actor: A multi-task transformer for robotic manipulation. In
Conference on Robot Learning, 785–799. PMLR.
Sucan, I. A.; Moll, M.; and Kavraki, L. E. 2012. The open
motion planning library.IEEE Robotics & Automation Mag-
azine, 19(4): 72–82.
Sun, L.; Xie, B.; Liu, Y .; Shi, H.; Wang, T.; and Cao, J. 2025.
Geovla: Empowering 3d representations in vision-language-
action models.arXiv preprint arXiv:2508.09071.
Wang, J.; Chen, M.; Karaev, N.; Vedaldi, A.; Rupprecht, C.;
and Novotny, D. 2025a. Vggt: Visual geometry grounded
transformer. InProceedings of the Computer Vision and Pat-
tern Recognition Conference, 5294–5306.
Wang, W.; Lei, Y .; Jin, S.; Hager, G. D.; and Zhang, L. 2024.
Vihe: Virtual in-hand eye transformer for 3d robotic manip-
ulation. In2024 IEEE/RSJ International Conference on In-
telligent Robots and Systems (IROS), 403–410. IEEE.
Wang, Y .; Yue, Y .; Yue, Y .; Wang, H.; Jiang, H.; Han, Y .; Ni,
Z.; Pu, Y .; Shi, M.; Lu, R.; et al. 2025b. Emulating human-
like adaptive vision for efficient and flexible machine visual
perception.Nature Machine Intelligence, 1–19.
Xie, B.; Zhou, E.; Jia, F.; Shi, H.; Fan, H.; Zhang, H.; Li, H.;
Sun, J.; Bin, J.; Huang, J.; et al. 2025. Dexbotic: Open-
Source Vision-Language-Action Toolbox.arXiv preprint
arXiv:2510.23511.Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; and Zhao,
H. 2024. Depth anything: Unleashing the power of large-
scale unlabeled data. InProceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
10371–10381.
Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.;
and Zhao, H. 2025. Depth anything v2.Advances in Neural
Information Processing Systems, 37: 21875–21911.
Yue, Y .; Wang, Y .; Kang, B.; Han, Y .; Wang, S.; Song, S.;
Feng, J.; and Huang, G. 2025. DeeR-VLA: Dynamic In-
ference of Multimodal Large Language Models for Efficient
Robot Execution.Advances in Neural Information Process-
ing Systems, 37: 56619–56643.
Ze, Y .; Zhang, G.; Zhang, K.; Hu, C.; Wang, M.; and Xu,
H. 2024. 3d diffusion policy: Generalizable visuomotor pol-
icy learning via simple 3d representations.arXiv preprint
arXiv:2403.03954.
Zeng, A.; Florence, P.; Tompson, J.; Welker, S.; Chien, J.;
Attarian, M.; Armstrong, T.; Krasin, I.; Duong, D.; Sind-
hwani, V .; et al. 2021. Transporter networks: Rearranging
the visual world for robotic manipulation. InConference on
Robot Learning, 726–747. PMLR.
Zeng, J.; Bu, Q.; Wang, B.; Xia, W.; Chen, L.; Dong, H.;
Song, H.; Wang, D.; Hu, D.; Luo, P.; et al. 2024. Learn-
ing manipulation by predicting interaction.arXiv preprint
arXiv:2406.00439.
Zhang, J.; Bai, C.; He, H.; Xia, W.; Wang, Z.; Zhao, B.; Li,
X.; and Li, X. 2024. SAM-E: leveraging visual foundation
model with sequence imitation for embodied manipulation.
arXiv preprint arXiv:2405.19586.
Zhang, T.; Hu, Y .; Cui, H.; Zhao, H.; and Gao, Y . 2023. A
universal semantic-geometric representation for robotic ma-
nipulation.arXiv preprint arXiv:2306.10474.
Zhang, Y .; Wu, D.; Shi, H.; Liu, Y .; Wang, T.; Fan, H.;
and Dong, X. 2025. Grounding Beyond Detection: Enhanc-
ing Contextual Understanding in Embodied 3D Grounding.
arXiv preprint arXiv:2506.05199.
Zhao, T. Z.; Kumar, V .; Levine, S.; and Finn, C. 2023. Learn-
ing fine-grained bimanual manipulation with low-cost hard-
ware.arXiv preprint arXiv:2304.13705.
Zheng, H.; Shi, H.; Chng, Y . X.; Huang, R.; Ni, Z.; Tan,
T.; Peng, Q.; Weng, Y .; Shi, Z.; and Huang, G. 2024.
DenseG: Alleviating Vision-Language Feature Sparsity in
Multi-View 3D Visual Grounding.Autonomous Grand
Challenge CVPR 2024 Workshop.
Zheng, H.; Shi, H.; Peng, Q.; Chng, Y . X.; Huang, R.; Weng,
Y .; Shi, Z.; and Huang, G. 2025. Densegrounding: Improv-
ing dense language-vision semantics for ego-centric 3d vi-
sual grounding.arXiv preprint arXiv:2505.04965.
Zhong, Y .; Bai, F.; Cai, S.; Huang, X.; Chen, Z.; Zhang, X.;
Wang, Y .; Guo, S.; Guan, T.; Lui, K. N.; et al. 2025. A
Survey on Vision-Language-Action Models: An Action To-
kenization Perspective.arXiv preprint arXiv:2507.01925.
Zhu, C.; Wang, T.; Zhang, W.; Pang, J.; and Liu, X. 2024.
Llava-3d: A simple yet effective pathway to empowering
lmms with 3d-awareness.arXiv preprint arXiv:2409.18125.
Supplementary Material
A Hyperparameters
As detailed in Table 7, we train our model with a per-GPU
batch size of 24 on eight GPUs for a total batch size of 192,
using the LAMB optimizer at an initial learning rate of4×
10−3and a cosine decay schedule. A linear warmup runs
for the first 2000 steps, followed by around 40 000 training
steps (50 epochs).
B Data Scaling Analysis
We examine how the size of the training dataset affects our
model’s performance on eight challenging RLBench manip-
ulation tasks. As shown in Table 8, the average success rate
grows monotonically with the number of training samples,
rising from 69.3 % at 25 samples to 73.0 % at 50 samples,
75.4 % at 100 samples, and 80.0 % at 200 samples, before
reaching 81.3 % at 500 samples. This monotonic improve-
ment illustrates a clear data-scaling effect, where enlarging
the dataset consistently enhances task performance and gen-
eralization. For fair comparison with prior work, however,
we report all results in the main paper using the 100 sample.
C Ablation on View Setup
Table 9 presents success rates for the front-only (single-
view) and multi-view configurations on 18 RLBench manip-
ulation tasks. The multi-view setup, which integrates front,
left shoulder, and right shoulder cameras, increases the av-
erage success rate from 80.0 % to 87.4 %, demonstrating
that additional viewpoints enrich spatial context and sub-
stantially improve manipulation performance.
D Ablation on Visual Backbone
Table 10 compares DINO and CLIP as visual backbones on
18 RLBench manipulation tasks. Using CLIP’s language-
aligned visual features increases the average success rate
from 86.5 % to 87.4 %, indicating that semantic alignment
in the encoder contributes to improved task performance.
E Qualitative Results
Figures 6 and 7 offer comparisons between RVT-2 and
our method on two real-world manipulation tasks. In the
PlaceGlueToBox task, RVT-2’s grasp attempts frequently
slip or miss the glue stick as a result of noisy depth per-
ception, whereas our method produces stable, secure grasps
that reliably lift the stick for subsequent placement. In the In-
sertRingOntoCone task, RVT-2’s unstable pickups often let
the ring slip or drift off-axis, whereas our method reliably
acquires the ring and centers it above the cone for smooth
insertion.
F Failure Cases
As shown in Figure 8 (a), simulation failures include (1)
instruction understanding errors, where the Open Drawer
task opens the wrong drawer; (2) long-horizon breakdowns,
where Place Cups stalls after only a few placements; andHyperparameters Value
batch size24×8
learning rate 2.4e-3
optimizer LAMB
learning rate schedule cosine decay
warmup steps 2000
training steps 40k
training epochs 50
Table 7:Hyperparameter Settings for Training.
(3) semantic grounding confusion, where Stack Cups picks
the incorrect cup among similar ones. In real-world trials
(Figure 8 (b)), failures arise from (1) pose precision limits,
where slight end-effector drift causes Stack Cups to miss
its target; (2) instruction mis-understanding, where Stack
Blocks grasps the wrong block; and (3) distractor suscep-
tibility, where background clutter diverts the policy during
Pick Glue To Plate. These findings point toward promising
enhancements, such as incorporating large language mod-
els (LLMs) for more accurate instruction parsing, adding
episodic memory or belief tracking to support reliable long-
horizon planning, and integrating uncertainty-aware pose re-
finement and attention-based filtering to improve resilience
against calibration errors and visual clutter.
G Robot Setup
As shown in Figure 9, a WidowX-250 single-arm robot sits
adjacent to an Intel RealSense D435i RGB-D camera fixed
on a tripod 0.8 m from the workspace. The camera cap-
tures synchronized 1280×720 color and depth frames at 30
Hz, which are downsampled to 128×128 for input into Spa-
tialActor. We perform intrinsic and extrinsic calibration to
align depth measurements with the robot’s base frame, en-
abling accurate end-effector control in real-world.
H Simulation Tasks
We follow the evaluation protocol of PerAct and bench-
mark SpatialActor on the same 18 RLBench tasks shown
in Figure 10. In total, these tasks cover 249 randomized
scene configurations varying in object color, count, place-
ment, shape, size, or category. Table 11 provides each task’s
language instruction template, the number of variations, and
the type of variation.
I Real-world Tasks
We evaluate SpatialActor on 8 real-world manipulation tasks
illustrated in Figure 11. In total, these tasks comprise 15 ran-
domized variants. Table 12 lists each task’s language instruc-
tion template and the corresponding number of variants.
Samples Avg. Open Drawer Put in Cupboard Sort Shape Stack Blocks Place Cups Screw Bulb Stack Cups Insert Peg
25 69.3 74.0 72.0 62.0 52.0 52.0 82.0 74.0 86.0
50 73.0 81.3 74.7 62.7 54.7 52.0 92.0 77.3 89.3
100 75.4 82.0 72.0 73.3 56.0 56.7 88.7 81.3 93.3
200 80.0 82.7 74.7 72.0 72.0 64.0 94.7 80.0 100.0
500 81.3 86.7 78.7 73.3 77.3 78.7 92.0 64.0 100.0
Table 8:Data Scaling Analysis.Success rates (%) of our model on eight challenging RLBench manipulation tasks as training
sample size increases. The average success rate increases steadily from 69.3 % at 25 samples to 73.0 % at 50 samples, 75.4 %
at 100 samples and 80.0 % at 200 samples, reaching 81.3 % at 500 samples. These results demonstrate clear scaling behavior:
adding more training data yields consistent gains in both overall and task-specific performance.
Model Avg. Success Close Jar Drag Stick Insert Peg Meat off Grill Open Drawer Place Cups Place Wine Push Buttons
Single-view 80 92 100 64 100 76 32 100 100
Multi-view (ours) 87.4 94 100 93.3 98.7 82 56.7 94.7 100
Model Put in Cupboard Put in Drawer Put in Safe Screw Bulb Slide Block Sort Shape Stack Blocks Stack Cups Sweep to Dustpan Turn Tap
Single-view 48 84 96 84 80 64 68 56 100 96
Multi-view (ours) 72 98.7 96.7 88.7 91.3 73.3 56 81.3 100 95.3
Table 9:Comparison of Single-view and Multi-view Setup.Single-view uses front camera only, while multi-view combines
front, left shoulder, and right shoulder views. The multi-view model raises average success from 80.0% to 87.4% on 18 RL-
Bench manipulation tasks, demonstrating the benefits of multiple viewpoints.
Model Avg. Success Close Jar Drag Stick Insert Peg Meat off Grill Open Drawer Place Cups Place Wine Push Buttons
DINO 86.5 94.7 96 96 97.3 80 52 93.3 100
CLIP (ours) 87.4 94 100 93.3 98.7 82 56.7 94.7 100
Model Put in Cupboard Put in Drawer Put in Safe Screw Bulb Slide Block Sort Shape Stack Blocks Stack Cups Sweep to Dustpan Turn Tap
DINO 76 98.7 100 88 80 69.3 61.3 84 98.7 92
CLIP (ours) 72 98.7 96.7 88.7 91.3 73.3 56 81.3 100 95.3
Table 10:Ablation of Visual Backbone.We compare DINO and CLIP as visual backbones on 18 RLBench manipulation
tasks. The CLIP backbone raises average success from 86.5% to 87.4%, demonstrating the benefit of language-aligned visual
features.
RVT-2
Oursimprecise
RVT-2
Oursimprecise
Figure 6:Qualitative Comparison on the PlaceGlueToBox Task.RVT-2 often fails to grasp the glue stick reliably, with its
gripper missing or slipping off the object due to noisy depth. In contrast, our method consistently secures the stick and holds it
firmly for downstream placement.
RVT-2
Oursimprecise
RVT-2
OursimpreciseFigure 7:Qualitative Comparison on the InsertRingOntoCone Task.RVT-2’s noisy perception leads to unstable grasps that
drop or misalign the ring during pickup. Our method achieves stable grasping and precise alignment of the ring before insertion.
Pose PrecisionInstruction 
UnderstandingExcessive 
Distractors
Stack yellow cup 
to pink cup Stack blue block 
to white blockPick up the orange 
glue stick to plate
(b) Real -worldInstruction 
UnderstandingLong HorizonSemantic 
Understanding
(a) SimulationOpen top drawer Place 3 cups to holder Stack other cups to 
violet cup
(c) SP T can enhance per -pixel 3D position 
mappings.(a) Cluttered background may cause 
semantic -geometric interference. Stack blue block to white block
w/ SGM wo/ SGM
Pick up the orange glue stick to plate
w/ Decouple wo/ Decouple
 w/ SPT wo/ SPT
(b) SGM can alleviate the depth noise caused 
by abnormal lighting conditions. 
Figure 8:Examples of Failure Cases in Simulation and Real-world. (a) Simulation failures from instruction mis-
understanding, long-horizon errors, and semantic understanding mistakes. (b) Real-world failures due to pose precision limits,
instruction understanding mistakes, and excessive visual distractors.
WidowX RobotIntel Realsense
D435 Camera
Figure 9:Real-World Robot Setup. WidowX-250 arm and Intel RealSense D435i camera mounted 0.8 m apart in a front-
facing configuration.
Close Jar
 Drag Stick Insert Peg Meat off Grill Open Drawer Place Cups
Place Wine Push Buttons Put in Cupboard Put in Drawer Put in Safe Screw Bulb 
Slide Block Sort Shape Stack Blocks Stack Cups Sweep to Dustpan Turn TapFigure 10:RLBench Manipulation Tasks.We evaluate SpatialActor on 18 simulated RLBench tasks, covering 249 variations
of object poses, goal configurations, and scene appearances. During evaluation, the robot must complete each task within 25
execution steps under randomized colors, shapes, sizes, and semantic arrangements.
Task Name Language Instruction Template # Variations Variation Type
Close Jar “close the [ ] jar” 20 color
Drag Stick “use the stick to drag the cube onto the [ ] target” 20 color
Insert Peg “put the ring on the [ ] spoke” 20 color
Meat off Grill “take the [ ] off the grill” 2 category
Open Drawer “open the [ ] drawer” 3 placement
Place Cups “place [ ] cups on the cup holder” 3 count
Place Wine “stack the wine bottle to the [ ] of the rack” 3 placement
Push Buttons “push the [ ] button, then the [ ] button” 50 color
Put in Cupboard “put the [ ] in the cupboard” 9 category
Put in Drawer “put the item in the [ ] drawer” 3 placement
Put in Safe “put the money away in the safe on the [ ] shelf” 3 placement
Screw Bulb “screw in the [ ] light bulb” 20 color
Slide Block “slide the block to the [ ] target” 4 color
Sort Shape “put the [ ] in the shape sorter” 5 shape
Stack Blocks “stack [ ] blocks” 60 color, count
Stack Cups “stack the other cups on top of the [ ] cup” 20 color
Sweep to Dustpan “sweep dirt to the [ ] dustpan” 2 size
Turn Tap “turn [ ] tap” 2 placement
Table 11:RLBench Task Set.18 manipulation tasks with corresponding language instruction templates, number of randomized
variations, and variation types (color, count, placement, shape, size, category), totaling 249 distinct scene configurations.
PlaceCarrot to Box Push Button Slide B lock Insert Ring onto Cone
Pick G lue to Box Stack Block Wipe Table Stack CupFigure 11:Real-world tasks.We employed 8 distinct tasks with a total of 15 variants in real-world experiments.
Task Name Language Instruction Template # Variations
Place Carrot To Box “place the carrot into the box” 1
Insert Ring Onto Cone “insert the [] ring onto the cone” 2
Push Button “push the [ ] button” 3
Slide Block “slide the block to the [ ] target” 3
Stack Block “stack the [ ] block on the other block” 2
Wipe Table “wipe the table” 1
Pick Glue To Box “pick up the glue stick and place it in the box” 1
Stack Cup “stack the [ ] cup on the other cup” 2
Table 12:Real-World Task Set.8 real-world tasks with their language instruction templates and number of variations.