Logo SAGE-3D

Towards Physically Executable 3D Gaussian for Embodied Navigation

1Zhejiang University, 2Manycore Tech Inc., 3Huazhong University of Science and Technology
Corresponding Author

Introduction

Vision-and-Language Navigation (VLN) relies heavily on the sim-to-real paradigm, and 3D Gaussian Splatting (3DGS) stands out for its photorealistic real-time rendering ability, which is crucial for narrowing the sim-to-real gap. However, existing 3DGS lacks fine-grained object semantics and physical executability, making it unsuitable for practical VLN tasks.

We propose Logo SAGE-3D, a novel paradigm that upgrades 3DGS into an executable environment foundation aligned with semantics and physics. It consists of two core components: Object-Level Semantic Grounding that enriches 3DGS with dense object-level annotations, and Physics-Aware Execution Jointing that embeds collision bodies and builds rich physical interaction interfaces.

We also release two key resources to advance related research: Logo InteriorGS, a dataset with 1,000 annotated indoor 3DGS scenes that covers mostly furnished indoor environments plus venues like concert halls and amusement parks, totaling over 554k object instances across 755 categories; and Logo SAGE-Bench, the first 3DGS - based VLN benchmark featuring 2 million trajectory-instruction pairs, a hierarchical instruction pipeline, three novel navigation continuity metrics, and 554k detailed collision bodies. Experiments verify that SAGE-3D enhances model generalizability significantly, providing a solid foundation for embodied navigation research.

Logo SAGE-3D Data

Overview

Vision-and-Language Navigation (VLN) relies on environment foundations that bridge simulation and real-world execution — and 3D Gaussian Splatting (3DGS) has emerged as a promising candidate for its photorealistic real-time rendering. However, traditional 3DGS falls short as an embodied learning base: it lacks fine-grained semantic annotations (e.g., object-level labels) and physical executability (evidenced by issues like agent penetration), failing to support practical embodied agent interaction and navigation.

To address this gap, we build two core resources that upgrade 3DGS into a semantically and physically aligned embodied environment:

  • Logo InteriorGS: A dataset of 1,000 3DGS scenes (covering furnished indoor spaces, concert halls, amusement parks, etc.), equipped with dense object-level annotations — totaling over 554k object instances across 755 categories.
  • Logo SAGE-Bench: The first VLN benchmark built entirely on 3DGS, featuring 2 million trajectory–instruction pairs (generated via a hierarchical pipeline), three novel navigation-continuity evaluation metrics, 554k detailed collision bodies, semantic maps, and diverse robot APIs.
  • algebraic reasoning
    Traditional 3DGS vs. Our work

    Pipeline

    arithmetic reasoning
    Pipeline of LogoSAGE-3D.

    To upgrade 3DGS into an executable embodied environment, SAGE-3D relies on two key pipelines that address traditional 3DGS’s semantic and physical limitations:

  • Object-Level Semantic Grounding: We start with artist-created mesh scenes (residential interiors, public spaces like concert halls), render ~3,000 camera views per scene to convert meshes to 3DGS via the GSplat pipeline. We then add double-verified object-level annotations (755 categories, instance IDs, 3D bounding boxes) to form InteriorGS (1k scenes, 554k+ objects), and generate 2D semantic top-down maps by projecting 3D objects (refined into irregular masks via convex hull) to support VLN instruction creation.
  • Physics-Aware Execution Jointing: We build a 3DGS–Mesh Hybrid Representation — we use CoACD to decompose original artist meshes into collision bodies (rigid shapes for physics), paired with 3DGS (for photorealistic rendering) in a USDA scene. We also expose robot APIs for discrete/continuous control, provide synchronized multi-modal observations, and cache collision bodies for fast, stable evaluation.
  • Logo SAGE-Bench

    Overview

    SAGE-Bench is the pioneering VLN benchmark built on 3DGS, featuring 2 million instruction–trajectory pairs and 554k detailed collision bodies. Its core highlights lie in the hierarchical instruction system and three novel navigation natural continuity metrics, which effectively support the evaluation of VLN models in complex scenarios.

    • Hierarchical Instruction: A two-level scheme tailored for realistic navigation evaluation. High-level instructions focus on task semantics and human intent, covering five categories: Add Object (introducing causal objects for logical navigation), Scenario Driven (embedding situational motives like daily needs), Relative Relationship (distinguishing targets via spatial relations such as "next to"), Attribute-based (identifying targets through perceivable attributes like color or state), and Area-based (directing to functional areas instead of specific objects). Low-level instructions prioritize control and kinematic evaluation, including primitive actions (e.g., forward moves, in-place rotation) and single-goal point-to-point navigation—serving as the basic execution foundation for high-level semantic tasks.
    • Navigation Natural Continuity Metrics: Three dedicated metrics addressing the limitations of traditional evaluation methods in continuous motion assessment:
      • Continuous Success Ratio (CSR): Beyond endpoint-only 0/1 Success Rate (SR), it calculates the proportion of time the agent stays within the permissible corridor around the reference path, reflecting goal-consistent behavior throughout navigation.
      • Integrated Collision Penalty (ICP): Unlike simple Collision Rate (CR), it integrates collision intensity over time to measure time-averaged collision intensity, capturing both collision frequency and duration.
      • Path Smoothness (PS): Evaluated via the variance of consecutive heading changes, with higher values indicating smoother paths—reducing abrupt turns and improving real-robot navigation feasibility.
    arithmetic reasoning
    Overview of Logo SAGE-Bench.

    Benchmark Results

    arithmetic reasoning
    Results of Logo SAGE-Bench.

    BibTeX

    
    @misc{miao2025physicallyexecutable3dgaussian,
          title={Towards Physically Executable 3D Gaussian for Embodied Navigation},
          author={Bingchen Miao and Rong Wei and Zhiqi Ge and Xiaoquan sun and Shiqi Gao and Jingzhe Zhu and Renhan Wang and Siliang Tang and Jun Xiao and Rui Tang and Juncheng Li},
          year={2025},
          eprint={2510.21307},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2510.21307},
    }