LaGS revisits 4D panoptic occupancy tracking through sparse latent Gaussians, replacing dense voxel processing with hierarchical point-based reasoning and Gaussian feature splatting. The resulting representation enables expressive long-range spatial aggregation and sets a new state of the art for camera-based 4D panoptic occupancy tracking on Occ3D nuScenes and Waymo.

Abstract

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT.

Method

Overview of our approach

LaGS combines dense multi-view perception with sparse latent Gaussian reasoning for 4D panoptic occupancy tracking. Starting from synchronized camera images, we first lift image features into a coarse 3D voxel representation before converting them into a sparse set of latent Gaussian queries carrying geometric location and learned feature embeddings.

Image Encoding & 3D Lifting

Multi-view images are processed by an image backbone to extract image and depth features, which are explicitly lifted into 3D space and pooled into a voxel feature pyramid.

Latent Gaussian Encoder

Voxel features are converted into sparse latent Gaussian queries through magnitude-guided initialization. We process these hierarchically using a fine stream for local geometric detail and a coarse stream for global spatial context. Our Serialized Multi-Stream Attention (SMSA) module enables efficient interaction across resolutions while preserving sparse computation. The refined Gaussian features are subsequently splatted back into a dense voxel representation for downstream decoding.

Panoptic Mask Decoder

The decoder combines voxel features, Gaussian features, and learned queries to jointly predict semantic occupancy and instance masks. Detection queries model foreground object instances (“things”), while semantic queries capture amorphous background regions (“stuff”).

Query Propagation & Tracking

To maintain temporal consistency across frames, decoded detection queries are propagated over time following the tracking-by-attention paradigm. A spatio-temporal refinement module further improves instance association and recovers intermittent missed detections.

Key Ideas

  • Sparse latent Gaussians replace costly dense voxel processing
  • Hierarchical point-based reasoning enables long-range spatial interaction
  • Gaussian feature splatting bridges sparse and dense 3D representations
  • Query propagation enables temporally consistent 4D scene tracking

Overview and Qualitative Results

Code & Models

Stay tuned

Coming Soon

Code and models will be released here shortly.

Publication

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

LaGS: Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

IEEE Robotics and Automation Letters (RA-L), 2026

If you find our work useful, please consider citing our paper.

Authors

Maximilian Luz

Maximilian Luz

University of Freiburg

Rohit Mohan

Rohit Mohan

University of Freiburg

Thomas Nürnberg

Thomas Nürnberg

Bosch Research

Yakov Miron

Yakov Miron

Bosch Research, University of Haifa

Daniele Cattaneo

Daniele Cattaneo

University of Freiburg

Abhinav Valada

Abhinav Valada

University of Freiburg

Acknowledgment

This work was funded by the Bosch Research collaboration on AI-driven automated driving, the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant number 539134284, through EFRE (FEIH_2698644), and the state of Baden-Württemberg.