GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks

Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have made significant strides recently, particularly in public demonstrations. While their language capabilities are well-documented, their proficiency in visual comprehension remains uncertain. Current benchmarks predominantly assess text-based tasks, such as Visual Question Answering (VQA) or classification, which tend to highlight language strengths rather than genuine visual skills. Furthermore, critical components such as 3D perception, segmentation, and grouping are often neglected in these evaluations.

MFMs have shown strong performance in tasks that integrate visual and language understanding, including captioning and visual question answering. However, their effectiveness in tasks requiring in-depth visual comprehension is still under scrutiny. Many existing benchmarks depend on text outputs, complicating fair comparisons with vision-specific models. Some studies have attempted to adapt vision datasets for MFMs by converting annotations into text, but this approach limits evaluation to language outputs. Prompting strategies have also been explored to assist MFMs in tackling visual tasks by decomposing them into manageable subtasks, although reproducibility remains a challenge in some instances.

Researchers at EPFL evaluated several prominent MFMs, including GPT-4o, Gemini 2.0 Flash, and Claude 3.5, on fundamental computer vision tasks such as segmentation, object detection, and depth prediction, utilizing datasets like COCO and ImageNet. Given that most MFMs are designed to produce text and are only accessible through APIs, they developed a prompt-chaining framework to convert these visual tasks into text-compatible formats. Their findings indicate that while MFMs are competent generalists, they do not match the performance of specialized vision models, particularly in geometric tasks. GPT-4o excelled, achieving the best results in 4 out of 6 tasks. The evaluation toolkit will be open-sourced.

To assess MFMs on vision tasks, the study introduced a prompt chaining strategy, simplifying complex tasks into more manageable, language-friendly subtasks. For instance, rather than directly predicting bounding boxes, the model first identifies present objects and then locates them through recursive image cropping. For segmentation and grouping, images are segmented into superpixels, which are easier to label and compare. Depth and surface normals are estimated using pairwise rankings of superpixel regions. This modular design leverages MFMs’ strengths in classification and similarity, while calibration controls ensure fair comparisons. The method is adaptable, and performance improves with finer-grained prompting.

The study evaluated various MFMs across multiple tasks, such as image classification, object detection, and segmentation, using datasets like ImageNet, COCO, and Hypersim. Results showed GPT-4o achieving 77.2% on ImageNet and 60.62 Average Precision at 50% Intersection over Union (AP50) for object detection, though it was outperformed by specialized models like ViT-G (90.94%) and Co-DETR (91.30%). In semantic segmentation, GPT-4o scored 44.89 mean Intersection over Union (mIoU), while OneFormer led with 65.52. MFMs managed distribution shifts reasonably well but struggled with precise visual reasoning. The study also introduced prompt chaining and oracle baselines to evaluate upper-bound performance.

In conclusion, the study presents a benchmarking framework to evaluate the visual capabilities of MFMs, including GPT-4o, Gemini, and Claude, by translating standard vision tasks into prompt-based formats. Findings reveal that MFMs perform better on semantic tasks than geometric ones, with GPT-4o leading overall. However, all MFMs significantly lag behind task-specific vision models. Despite being generalists primarily trained on image-text data, they demonstrate promising advancements, particularly with newer reasoning models like o3 in 3D tasks. Limitations include high inference costs and prompt sensitivity. Nevertheless, this framework establishes a unified approach for evaluating MFMs’ visual understanding, paving the way for future advancements.

Check out the Paper, GitHub Page, and Project. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo, and hundreds more. SUBSCRIBE NOW

The post GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks appeared first on MarkTechPost.