SAM4D introduces a 4D foundation model for promptable segmentation across camera and LiDAR streams, addressing the limitations of frame-centric and modality-isolated approaches in autonomous driving.
Key Highlights:
- Promptable Multi-modal Segmentation (PMS) – Enables interactive segmentation across sequences from both modalities using diverse prompts (points, boxes, masks), allowing cross-modal propagation and long-term object tracking.
- Unified Multi-modal Positional Encoding (UMPE) – Aligns image and LiDAR features in a shared 3D space using sinusoidal and MLP-based encoding for seamless cross-modal interaction while preserving modality-specific structure.
- Motion-aware Cross-modal Memory Attention (MCMA) – Incorporates ego-motion compensation into memory attention, enabling temporally consistent retrieval and robust segmentation in dynamic scenes.
- Multi-modal Architecture – Builds on SAM2 with Hiera for image encoding and MinkUNet (via TorchSparse) for LiDAR voxelization, allowing efficient 2D-3D joint segmentation.
- Efficient Prompt Handling – Supports point, box, and mask prompts from either modality, using a unified decoder to produce temporally consistent masks across the stream.
- Waymo-4DSeg Dataset – A large-scale pseudo-labeled dataset containing 15M image masks, 30M LiDAR masks, and 300k cross-modal masklets, generated via VFM segmentation, 4D LiDAR reconstruction, and ray casting.
- Cross-Modal Label Fusion Pipeline – Builds dense pixel-to-voxel mappings, filters noisy masklets using DBSCAN clustering, and merges multi-view data into high-quality voxel masklets.
- Cross-Dataset Generalization – Demonstrates strong zero-shot and fine-tuned performance on nuScenes, validating robust transferability across sensor configurations and environments.
- Quantitative Performance – Achieves 69.8% mIoU on images and 55.7% on LiDAR with 80.1% J&F, significantly outperforming single-modality and projection-based baselines.
- Scalable & Efficient Design – 119.88M parameter model optimized with memory banks, FIFO queues, and prompt imitation logic for high-throughput 4D segmentation.
- Future-Proof Foundation – Roadmap includes natural language prompting via LLMs, multi-sensor scaling, weak/self-supervised learning, and improved memory and compute efficiency.
Paper Resources
- Project: SAM4D-Project.github.io
- Github Repo: https://github.com/CN-ADLab/SAM4D
- LearnopenCV blog post: https://learnopencv.com/sam-2/
The post SAM4D: Segment Anything in Camera and LiDAR Streams appeared first on OpenCV.