← back to Blog

Computer Vision

Category Added in a WPeMatico Campaign

  • Introduction to CLIP

    Hello, Let me show you an image, can you describe what you see? Perfect! You nailed it: a bird sitting peacefully on a railing. Now, let’s flip it. I’ll describe something, and you imagine how it might appear: “A puppy sitting on a railway track.” Nice! Something like this might be popped right into your…

    Read more →

  • SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

    SimLingo unifies autonomous driving, vision-language understanding, and action reasoning-all from camera input only. It introduces Action Dreaming to test how well models follow instructions, and outperforms all prior methods on CARLA Leaderboard 2.0 and Bench2Drive. Key Highlights Unified Model – Combines driving, VQA, and instruction-following using a single Vision-Language Model (InternVL2-1B + Qwen2-0.5B). State-of-the-Art Driving – Ranks #1 on…

    Read more →

  • OpenCV 4.12.0 Is Now Available

    OpenCV’s summer update for 2025 is now available in all your favorite flavors on the Releases page. It includes a big list of changes to Core, Imgproc, Calib3d, DNN, Objdetect, Photo, VideoIO, Imgcodecs, Highgui, G-API, Video, and HAL modules, the Python, Java and JavaScript bindings and even more. Highlights include: GIF decode and encode for imgcodecs,…

    Read more →

  • Introducing HAL riscv-rvv: Unleashing the power of RISC-V CPUs with RVV 1.0

    What is RISC-V and RVV 1.0? RISC-V (pronounced “risk-five”) is an open standard instruction set architecture (ISA) based on the principles of reduced instruction set computing (RISC). Unlike proprietary ISAs such as Intel’s x86 or ARM’s architecture, RISC-V is free to use and modify, enabling companies and researchers to design custom processors without licensing fees…

    Read more →

  • SAM4D: Segment Anything in Camera and LiDAR Streams

    SAM4D introduces a 4D foundation model for promptable segmentation across camera and LiDAR streams, addressing the limitations of frame-centric and modality-isolated approaches in autonomous driving. Key Highlights: Promptable Multi-modal Segmentation (PMS) – Enables interactive segmentation across sequences from both modalities using diverse prompts (points, boxes, masks), allowing cross-modal propagation and long-term object tracking. Unified Multi-modal Positional…

    Read more →

  • Vector Embeddings Explained

    You’ve just finished listening to your favorite high-energy workout song on Spotify, and the next track that automatically plays is one you’ve never heard, but it’s a perfect fit for your playlist. Is it magic? Not quite. It’s a clever AI concept called vector embeddings, and it’s the secret sauce behind much of the smart…

    Read more →

  • VideoGameBench: Can Vision-Language Models Complete Popular Video Games?

    VideoGameBench is a rigorous benchmark that evaluates VLMs’ real-time decision-making, perception, memory, and planning by challenging them to complete 1990s-era video games with only raw visual inputs and minimal control instructions. Key Highlights Real-Time, Visually Rich Environments – Evaluates VLMs on 23 popular Game Boy and MS-DOS games, including 3 secret test games to assess generalization…

    Read more →

  • Announcing The Winners of the First Perception Challenge for Bin-Picking (BPC)

    OpenCV and sponsors at Intrinsic, BOP, and University of Hawaiʻi at Mānoa are excited to announce the prize winners of the first Perception Challenge for Bin-Picking, first revealed at CVPR during the Perception for Industrial Robotics workshop. Beginning in February 2025, this challenge had over $60,000 at stake and over 450 teams vying for a…

    Read more →

  • LeGO-LOAM: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain

    LeGO-LOAM introduces a cutting-edge lidar odometry and mapping framework designed to deliver real-time, accurate 6-DOF pose estimation for ground vehicles, optimized for challenging, variable terrain environments. It significantly reduces computational overhead while maintaining high accuracy, making it ideal for embedded systems. Key Highlights Ground-Optimized Approach – Segments lidar point clouds by leveraging ground plane information, filtering…

    Read more →

  • Applications of Vision Language Models – Real World Use Cases with PaliGemma2 Mix

    Imagine machines that don’t just capture pixels but truly understand them, recognizing objects, reading text, interpreting scenes, and even “speaking” about images as fluently as a human. VLMs merge computer vision’s “sight” with language’s “speech,” letting AI both describe and converse about any picture it sees. From generating captions and answering questions to counting objects,…

    Read more →