LaGS

LaGS revisits 4D panoptic occupancy tracking through sparse latent Gaussians, replacing dense voxel processing with hierarchical point-based reasoning and Gaussian feature splatting. The resulting representation enables expressive long-range spatial aggregation and sets a new state of the art for camera-based 4D panoptic occupancy tracking on Occ3D nuScenes and Waymo.

Abstract

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT.

Method

Latent Gaussians are sparse feature-bearing ellipsoids that act as dynamic volume-oriented keypoints, enabling efficient point-based reasoning before being splatted back into a voxel representation.

LaGS combines dense multi-view perception with sparse latent Gaussian reasoning for 4D panoptic occupancy tracking. Starting from synchronized camera images, we first lift image features into a coarse 3D voxel representation before converting them into a sparse set of feature-bearing latent Gaussians that carry both geometric location and learned feature embeddings, acting as dynamic volume-oriented keypoints for point-based reasoning.

Image Encoding & 3D Lifting

Multi-view images are processed by an image backbone to extract image and depth features, which are explicitly lifted into 3D space and pooled into a voxel feature pyramid.

Latent Gaussian Encoder

Voxel features are converted into a sparse set of feature-bearing latent Gaussians through magnitude-guided initialization. Each Gaussian acts as a dynamic, volume-oriented keypoint with adaptive anisotropic support, distilling the dense grid into a compact set of points that carry both geometric location and learned features. This enables efficient point-based reasoning and long-range spatial interactions in a sparse latent space. We process these Gaussians hierarchically using a fine stream for local geometric detail and a coarse stream for global spatial context, with our Serialized Multi-Stream Attention (SMSA) module enabling efficient interaction across resolutions while preserving sparse computation. The refined Gaussians are subsequently splatted back into a dense voxel representation for downstream decoding.

Panoptic Mask Decoder

The decoder combines voxel features, Gaussian features, and learned queries to jointly predict semantic occupancy and instance masks. Detection queries model foreground object instances (“things”), while semantic queries capture amorphous background regions (“stuff”).

Query Propagation & Tracking

To maintain temporal consistency across frames, decoded detection queries are propagated over time following the tracking-by-attention paradigm. A spatio-temporal refinement module further improves instance association and recovers intermittent missed detections.

Key Ideas

Sparse latent Gaussians replace costly dense voxel processing
Latent Gaussians act as dynamic volume-oriented keypoints with adaptive receptive fields
Hierarchical point-based reasoning enables long-range spatial interaction
Gaussian feature splatting bridges sparse and dense 3D representations
Query propagation enables temporally consistent 4D scene tracking