[SpaVLE @ NeurIPS 2025] Workshop on Space in Vision, Language, and Embodied AI

NeurIPS 2025 Workshop on
SPACE in Vision, Language, and Embodied AI (SpaVLE)

December 7th 2025, San Diego Convention Center, San Diego, USA

About SpaVLE

The SpaVLE workshop aims to bridge the historically siloed efforts of NLP, CV, and robotics communities by fostering cross-disciplinary dialogue to advance research on spatial understanding and representation.

Where

Upper Level Room 29A-D, San Diego Convention Center, San Diego, USA

When

Sunday
7 December 2025

Overview

We never directly "see" space, but we perceive and reconstruct it, describe it through language, and plan our actions in accordance with its constraints.

Background

At the core of embodied cognition lies the challenge of understanding and reasoning about space, its representations, dynamics, how it forms from and constrains visual perception, language communication, action, and interaction with the environment. Spatial representations are inherently multimodal, integrating vision, language, and motor control. They must capture both 2D and 3D spatial structures, from local object relationships to global scene geometry. Although we live in a dynamic 3D world, the images and videos we perceive offer only partial, static glimpses of this environment. Reconstructing an accurate, temporally coherent model of the 3D world from such limited observations remains a central challenge in computer vision. Spatial reasoning is equally critical for language understanding, especially in human-robot interaction, where interpreting grounded references to objects and locations is essential for effective communication. For embodied agents, building robust spatial representations is vital to support real-world navigation, manipulation, and dialogue. Despite advances in data-driven pretraining across 2D/3D vision, language, and action modalities, spatial understanding remains an open challenge. For example, (vision-)language models still fall short of human-level spatial reasoning, often struggling to interpret the semantics of spatial expressions, particularly when perspective, memory, or embodiment constraints are involved.

Scope and Goal

This workshop is particularly interested in how spatial representations can be learned from multimodal data, and applied to core tasks in computer vision (CV), natural language processing (NLP), and robotics. Spatial understanding has historically been pursued separately by these communities, each adopting distinct approaches and problem formulations. The goal of this joint workshop is to provide a platform for much-needed cross-disciplinary dialogue, advancing research on spatial understanding and representation by bringing together diverse perspectives. We aim to foster discussion on how spatial representations, whether symbolic, neural, verbal, or geometric, can be learned, evaluated, and deployed across modalities and tasks. A key focus is aligning these approaches with the demands of real-world applications and addressing practical challenges such as the high cost of real-robot experiments, the scarcity of multimodal 3D and embodied data, and the complexities of human-in-the-loop evaluation and non-verbal communication.

Call for Papers

The list of accepted papers is now available at OpenReview!

Topics of Interest

The scope and topics include, but are not limited to:

Foundations of Spatial Representation and Reasoning. Models and formalisms for representing space, including symbolic, geometric, neural, or hybrid approaches, and reasoning about objects, spatial relations, and dynamics in 2D and 3D environments.
Multimodal Spatial Grounding. Learning and aligning spatial concepts across language, vision (2D / 3D), and action modalities, including spatial representation learning, cross-modal fusion, and spatially informed planning.
Applications in NLP, Vision, Robotics, and Generative AI. Applying spatial understanding to tasks such as instruction following, embodied manipulation, situated dialogue, scene understanding, and spatially grounded generation (e.g., image/video synthesis, layout prediction, or simulation).
Evaluation and Benchmarking Spatial Intelligence. Developing datasets, tasks, and metrics for assessing spatial reasoning, generative fidelity, grounded communication, and skill learning in embodied or interactive contexts.
Spatial Reasoning in Pre-trained Models. Examining how large language and vision-language models handle spatial understanding, identifying current limitations, and exploring methods to integrate spatial priors or structural biases for improved generalization and robustness.

Submission Format

We welcome (1) Short papers up to 4 pages and (2) Full papers up to 9 pages.

Long Research Paper (up to 9 pages)
Short Research Paper (up to 4 pages)
Dataset/Benchmark Paper (up to 9 pages, authors are strongly encouraged to follow the Datasets & Benchmarks Policies)
Survey/Review Paper (up to 9 pages)
Position Paper (up to 4 pages)

The page limit excludes references and supplementary materials. Submissions must be prepared in PDF format using the NeurIPS 2025 style template through OpenReview. One additional page is allowed for camera-ready versions of accepted papers to address reviewer comments.

Submission Protocol: OpenReview
Submission Template: Zip
Archival Option: All accepted papers will be non-archival.

Important Dates

Submission Open: July 7, 2025
Submission Deadline: ~~August 23, 2025~~ September 1, 2025
Notification of Acceptance: September 22, 2025
Camera-Ready Deadline: October 25, 2025
Workshop Day: December 7, 2025 (co-located with NeurIPS 2025)

Challenge Track

Multi-Agent Embodied Intelligence (MARS) Challenge

In partner with the ICML 2025 MAS Workshop, we are excited to host the Multi-Agent Embodied Intelligence (MARS) Challenge as part of SpaVLE 2025.

Track 1: Multi-Agent Embodied Planning. This track focuses on high-level task planning across heterogeneous embodied agents. Built upon the ManiSkill platform and RoboCasa dataset, we curate a set of task scenarios involving diverse robot embodiments and complex collaborative goals. Given a structured scene image with multiple candidate agents (humanoids, quadrupeds, manipulators), participants need to complete the following two tasks: (1) Select Agents: Choose a subset of appropriate agents from the scene based on a natural language command. (2) Assign Actions: Define a sequence of high-level actions for each selected agent to accomplish the collaborative task. This task evaluates the vision large language model's ability to reason over multi-agent allocation, role assignment, and symbolic planning, simulating real-world cooperation among diverse robots.
Track 2: Policy Execution for Multi-Agent Control. This track focuses on low-level policy execution in physically realistic simulation environments. It utilizes RoboFactory, a simulation benchmark for embodied agents based on the ManiSkill platform. Participants are required to deploy and control multiple embodied agents (e.g., robotic arms) to collaboratively complete manipulation-centric tasks like block stacking. Each task is an episode where agents interact with dynamic objects in a shared workspace under partial observability and randomized conditions. The core challenge lies in achieving robust, learned coordination across multiple agents.

Submission Format

SpaVLE will host the technical reports to challenge solutions as regular long papers (up to 9 pages).

The page limit excludes references and supplementary materials. Submissions must be prepared in PDF format using the NeurIPS 2025 style template through OpenReview.

Mandatory Discussion: Authors are required to provide a dedicated paragraph explaining the methods used for spatial reasoning and the mechanisms through which spatial information is communicated among agents..

Important Dates

Please refer to the MARS Challenge website for the most up-to-date information.

Keynote Speakers (Alphabetical Order)

Amir Zadeh

Lambda Labs

Barbara Landau

Johns Hopkins University

Dieter Fox

University of Washington / Allen Institute for AI

Joyce Chai

University of Michigan

Ranjay Krishna

University of Washington / Allen Institute for AI

Saining Xie

New York University

Confirmed Panelists (Alphabetical Order)

Aishwarya Agrawal

University of Montreal/Mila/Google DeepMind

Anthony G. Cohn

University of Leeds

Daniel Havir

Alquist Robotics

Organizing Committee

Martin Ziqiao Ma

University of Michigan

Freda Shi

University of Waterloo / Vector Institute

Jiayuan Mao

University of Pennsylvania / Amazon FAR

Jiafei Duan

University of Washington

Manling Li

Northwestern University

David Hsu

National University of Singapore

Parisa Kordjamshidi

Michigan State University

Event Schedule

08:45 -- 09:00

Welcome

Opening Remarks.

09:00 -- 09:10

Oral Presentation

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards. Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark

09:10 -- 09:20

Oral Presentation

NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language. Danial Kamali, Parisa Kordjamshidi

09:20 -- 10:00

Keynote Saining Xie

Title: Towards Spatial Supersensing in Video

10:00 -- 10:40

Keynote Ranjay Krishna

Title: Visual Reasoning for Robotics

10:40 -- 11:40

Poster Presentations

Poster session 1 (main) and coffee break.

11:40 -- 11:50

Oral Presentation (Best Paper)

ROSE: Reconstructing Objects, Scenes, and Trajectories from Casual Videos for Robotic Manipulation. Peihao Li, Haoran Geng, Jameson Crate, Yanbing Han, Junyi Zhang, Feishi Wang, Charlie Tianyue Cheng, Runpei Dong, Yen-Jen Wang, Haozhe Lou, Trevor Darrell, Pieter Abbeel, Jitendra Malik

11:50 -- 12:00

Oral Presentation (Best Paper Runner-Up)

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs Yuyou Zhang, Radu Corcodel, Chiori Hori, Anoop Cherian, Ding Zhao

12:00 -- 12:40

Panel Discussion

Panel discussions with lunch break.

12:40 -- 13:00

Main and Challenge Track Summary

Best papers and challenge award annoucement.

13:00 -- 13:20

Industry Lightening Talk (EdenSign)

13:20 -- 13:40

Industry Lightening Talk (Alquist Robotics)

13:40 - 14:20

Keynote Barbara Landau

Title: Spatial Language and Vision: Some Complexities

14:20 - 15:00

Keynote Joyce Chai

Title: Language use in 3D Space

15:00 - 15:40

Keynote Dieter Fox

Title: Toward Mastering Domains via Automated Data Scaling

15:40 - 16:20

Keynote Amir Zadeh

Title: Latent Particle World Models

16:20 -- 16:30

Oral Presentation

Maestro: Orchestrating Robotics Modules with Vision-Language Models for Zero-Shot Generalist Robots. Junyao Shi, Rujia Yang, Kaitian Chao, Bingqing Selina Wan, Yifei Simon Shao, Jiahui Lei, Jianing Qian, Long Le, Pratik Chaudhari, Kostas Daniilidis, Chuan Wen, Dinesh Jayaraman

16:30 -- 16:40

Oral Presentation

Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation. Yi-Ruei Liu, You-Zhe Xie, Yu-Hsiang Hsu, I-Sheng Fang, Yu-Lun Liu, Jun-Cheng Chen

16:40 - 17:00

Poster Presentations

Poster session 2 and wrap up.

09:20 -- 10:00

Keynote Saining Xie

Title: Towards Spatial Supersensing in Video

10:00 -- 10:40

Keynote Ranjay Krishna

Title: Visual Reasoning for Robotics

12:00 -- 12:40

Panel Discussion

Panel discussions with lunch break.

13:00 -- 13:20

Industry Lightening Talk (EdenSign)

13:20 -- 13:40

Industry Lightening Talk (Alquist Robotics)

13:40 -- 14:20

Keynote Barbara Landau

Title: Spatial Language and Vision: Some Complexities

14:20 -- 15:00

Keynote Joyce Chai

Title: Language use in 3D Space

15:00 -- 15:40

Keynote Dieter Fox

Title: Toward Mastering Domains via Automated Data Scaling

15:40 -- 16:20

Keynote Amir Zadeh

Title: Latent Particle World Models

09:00 -- 09:10

Oral Presentation

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards. Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark

09:10 -- 09:20

Oral Presentation

NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language. Danial Kamali, Parisa Kordjamshidi

10:40 -- 11:40

Poster Presentations

Poster session 1 and coffee break.

11:40 -- 11:50

Oral Presentation

11:50 -- 12:00

Oral Presentation

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs Yuyou Zhang, Radu Corcodel, Chiori Hori, Anoop Cherian, Ding Zhao

12:40 -- 13:00

Main and Challenge Track Summary

Best papers and challenge award annoucement.

16:20 -- 16:30

Oral Presentation

16:30 -- 16:40

Oral Presentation

16:40 -- 17:00

Poster Presentations

Poster session 2 and wrap up.

Workshop Venue

San Diego Convention Center

Room: Upper Level Room 29A-D

111 Harbor Dr, San Diego, CA 92101

Frequently Asked Questions

Can I submit a paper that has been accepted in NeurIPS 2025 or previously published other conferences/journals?

No. Per NeurIPS 2025 workshop guidlines, workshops are not a venue for work that has been previously published in other conferences on machine learning or related fields. Work that is presented at the main NeurIPS conference should not appear in a workshop, including as part of an invited talk.

Can I submit a manuscript that is not accepted yet but is on arxiv and/or is currently under review?

Yes, the SpaVLE workshop is non-archival. Note that vision conferences such as CVPR and ICCV consider peer-reviewed workshop papers as publications if their length exceeds 4 pages (excluding references), even if they do not appear in a proceedings. If you are considering submitting your paper later to vision conferences, please consider submitting an abridged version as a short paper (up to 4 pages).

Can I volunteer as a reviewer for SpaVLE?

Yes! Please fill in this form and we will be in touch!

Contact

Address

111 Harbor Dr, San Diego, CA 92101

Email Us

spavle.committee@gmail.com

NeurIPS 2025 Workshop onSPACE in Vision, Language, and Embodied AI (SpaVLE)

About SpaVLE

Where

When

Overview

Background

Scope and Goal

Call for Papers

Topics of Interest

Submission Format

Important Dates

Challenge Track

Submission Format

Important Dates

Keynote Speakers (Alphabetical Order)

Confirmed Panelists (Alphabetical Order)

Organizing Committee

Event Schedule

Welcome

Oral Presentation

Oral Presentation

Keynote Saining Xie

Keynote Ranjay Krishna

Poster Presentations

Oral Presentation (Best Paper)

Oral Presentation (Best Paper Runner-Up)

Panel Discussion

Main and Challenge Track Summary

Industry Lightening Talk (EdenSign)

Industry Lightening Talk (Alquist Robotics)

Keynote Barbara Landau

Keynote Joyce Chai

Keynote Dieter Fox

Keynote Amir Zadeh

Oral Presentation

Oral Presentation

Poster Presentations

Keynote Saining Xie

Keynote Ranjay Krishna

Panel Discussion

Industry Lightening Talk (EdenSign)

Industry Lightening Talk (Alquist Robotics)

Keynote Barbara Landau

Keynote Joyce Chai

Keynote Dieter Fox

Keynote Amir Zadeh

Oral Presentation

Oral Presentation

Poster Presentations

Oral Presentation

Oral Presentation

Main and Challenge Track Summary

Oral Presentation

Oral Presentation

Poster Presentations

Workshop Venue

San Diego Convention Center

Sponsors

Frequently Asked Questions

Can I submit a paper that has been accepted in NeurIPS 2025 or previously published other conferences/journals?

Can I submit a manuscript that is not accepted yet but is on arxiv and/or is currently under review?

Can I volunteer as a reviewer for SpaVLE?

Contact

Address

Email Us

NeurIPS 2025 Workshop on
SPACE in Vision, Language, and Embodied AI (SpaVLE)