Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. Existing methods face challenges in achieving detailed 3D tracking because they often track only a few points, which need more detail for full-scene understanding. They also demand computational power, making it difficult to handle long videos efficiently. Additionally, many of them must be fixed to maintain accuracy over extended sequences, as problems like camera movement and object occlusion cause the model to lose track or introduce errors.
Current methods include several approaches for estimating motion in video sequences, each with unique strengths and limitations. Optical flow techniques provide dense pixel-wise tracking but struggle with robustness in complex scenes, especially when extended to long sequences. Scene Flow generalizes optical flow to estimate dense 3D motion, using either RGB-D data or point clouds, but it remains challenging to apply efficiently over long sequences. Point tracking captures motion trajectories by tracking specific points, with recent advancements incorporating spatial and temporal attention for smoother tracking. However, point-tracking methods still need to improve in achieving dense monitoring due to the high computational cost. Tracking by Reconstructing methods uses a deformation field to estimate motion making them less practical for real-time applications.
A team of researchers from UMass Amherst & MIT-IBM Watson AI Lab, Snap Inc. have proposed DELTA (Dense Efficient Long-range 3D Tracking for Any video), the first method designed to efficiently track every pixel in 3D space across long video sequences. DELTA operates by starting with reduced-resolution tracking via spatio-temporal attention and applying an attention-based upsampler for high-resolution accuracy. Key innovations include an upsampler for sharp motion boundaries, an efficient spatial attention architecture for dense tracking, and a log-depth representation that enhances tracking performance. DELTA achieves state-of-the-art results on the CVO and Kubric3D datasets, showing over 10% improvement in metrics like Average Jaccard (AJ) and Average Position Difference in 3D (APD3D), and performs competitively on 3D point tracking benchmarks such as TAP-Vid3D and LSFOdyssey. Unlike existing methods, DELTA delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy.
An experiment conducted showed that DELTA excels in 3D tracking tasks, outperforming previous methods in speed and accuracy. Trained on Kubric’s dataset with over 5,600 videos, DELTA’s loss function combines 2D coordinate, depth, and visibility losses.
In benchmarks, DELTA achieved top scores on CVO for long-range 2D tracking and on Kubric3D for dense 3D tracking, completing tasks much faster than other methods. DELTA’s design choices, including log-depth representation, spatial attention, and an attention-based upsampler, significantly enhance its accuracy and efficiency across diverse tracking scenarios.
In conclusion, DELTA is a highly efficient method for tracking every pixel across video frames, achieving accuracy in dense 2D and 3D tracking with a faster runtime than existing methods. The model may need help with points that remain occluded for extended periods and perform best on videos with fewer than several hundred frames. The approach has limitations similar to those of earlier methods as it utilizes shorter temporal processing windows. Moreover, the method’s 3D tracking accuracy relies on the precision and temporal stability of the monocular depth estimation used. Anticipated monocular depth estimation research improvements will likely enhance the method’s performance further.
Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.
[Sponsorship Opportunity with us] Promote Your Research/Product/Webinar with 1Million+ Monthly Readers and 500k+ Community Members
The post DELTA: A Novel AI Method that Efficiently (10x Faster) Tracks Every Pixel in 3D Space from Monocular Videos appeared first on MarkTechPost.