Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries

Vision-Language Models (VLMs) are increasingly used for generating responses to queries about visual content. Despite their progress, they often suffer from a major issue: generating plausible but incorrect responses, also known as hallucinations. These hallucinations can lead to a lack of trust in these systems, especially in real-world, high-stakes applications. Evaluating the helpfulness and truthfulness of VLM-generated responses is challenging because it requires not only understanding visual content but also verifying each claim made in the response. Traditional benchmarks have not been adequate for addressing this challenge, either because they limit evaluations to simplistic, binary questions or because they rely on incomplete context to judge open-ended responses.

Researchers from Salesforce AI Research have proposed Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm that evaluates VLM responses to open-ended visual queries. In PROVE, researchers use a high-fidelity scene graph representation constructed from hyper-detailed image captions and employ a large language model (LLM) to generate diverse question-answer (QA) pairs along with executable programs to verify each QA pair. This approach allows the creation of a benchmark dataset of 10.5k visually grounded and challenging QA pairs. The evaluation strategy involves measuring both the helpfulness and truthfulness of VLM responses using a unified framework based on scene graph comparisons. This programmatic evaluation provides a more reliable and interpretable assessment of VLM performance compared to previous benchmarks.

The PROVE benchmark uses detailed scene graph representations and executable programs to verify the correctness of VLM responses. Scene graphs, constructed from detailed image captions, contain entities, attributes, and relationships that represent the visual scene. By prompting an LLM, researchers generate open-ended QA pairs and corresponding verification programs that ensure the questions are challenging yet verifiable. Only QA pairs that can be programmatically verified are retained in the benchmark, resulting in a high-quality dataset. The evaluation involves extracting scene graph representations from both the model responses and ground truth answers, and then calculating scores based on the recall and precision of these representations, measuring how helpful and truthful the responses are.

The results of the evaluation show that current VLMs struggle to achieve a good balance between helpfulness and truthfulness. Models such as GPT-4o, Phi-3.5-Vision, and Pixtral demonstrated higher helpfulness scores but not necessarily higher truthfulness. The study also found that increasing model size tends to improve helpfulness but does not always enhance truthfulness. The evaluation of various models revealed that recent improvements in training better VLMs have led to enhanced helpfulness but have not consistently translated into truthful outputs. Notably, the LLaVA-1.5 model series achieved the best truthfulness scores, indicating that smaller, more focused models might outperform larger ones in maintaining accuracy.

In conclusion, PROVE presents a significant advancement in evaluating the helpfulness and truthfulness of VLM-generated responses. By leveraging detailed scene graph representations and programmatic verification, this benchmark provides a more reliable and interpretable evaluation framework. The findings underscore the need for VLMs that strike a balance between generating informative and accurate responses, especially as their use in real-world applications continues to grow. Future research is expected to focus on improving both the helpfulness and truthfulness of these models through advanced training techniques and new evaluation strategies.

Check out the Paper and Dataset Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries appeared first on MarkTechPost.