Omni-modality language models (OLMs) are a rapidly advancing area of AI that enables understanding and reasoning across multiple data types, including text, audio, video, and images. These models aim to simulate human-like comprehension by processing diverse inputs simultaneously, making them highly useful in complex, real-world applications. The research in this field seeks to create AI…
Large language models (LLMs) have revolutionized how machines process and generate human language, but their ability to reason effectively across diverse tasks remains a significant challenge. Researchers in AI are working to enable these models to perform not just language understanding but also complex reasoning tasks like problem-solving in mathematics, logic, and general knowledge. The…
Large Language Models (LLMs) have gained significant traction in recent years, with fine-tuning pre-trained models for specific tasks becoming a common practice. However, this approach needs help in resource efficiency when deploying separate models for each task. The growing demand for multitask learning solutions has led researchers to explore model merging as a viable alternative.…
Multimodal AI models are powerful tools capable of both understanding and generating visual content. However, existing approaches often use a single visual encoder for both tasks, which leads to suboptimal performance due to the fundamentally different requirements of understanding and generation. Understanding requires high-level semantic abstraction, while generation focuses on local details and global consistency.…
Maintaining the model’s capacity to manage changes in data distribution, i.e., the ability to function effectively even when presented with data that is different from what it was trained on, is essential when modifying a pre-trained foundation model for certain downstream tasks. Because retraining the entire model for each new dataset or task can be…
Vision-Language Models (VLMs) struggle with spatial reasoning tasks like object localization, counting, and relational question-answering. This issue stems from Vision Transformers (ViTs) trained with image-level supervision, which often fail to encode localized information effectively, limiting spatial understanding. Researchers from Stanford University propose a novel solution called Locality Alignment, which involves a post-training stage for Vision…
Research on the robustness of LLMs to jailbreak attacks has mostly focused on chatbot applications, where users manipulate prompts to bypass safety measures. However, LLM agents, which utilize external tools and perform multi-step tasks, pose a greater misuse risk, especially in malicious contexts like ordering illegal materials. Studies show that defenses effective in single-turn interactions…
Large Language Model (LLM)–based online agents have significantly advanced in recent times, resulting in unique designs and new benchmarks that show notable improvements in autonomous web navigation and interaction. These advancements demonstrate how web agents can increasingly carry out intricate online tasks more accurately and effectively. However, many of the current benchmarks overlook important factors…