Accurately predicting where a person is looking in a scene—gaze target estimation—represents a significant challenge in AI research. Integrating complex cues such as head orientation and scene context must be used to infer gaze direction. Traditionally, methods for this problem use multi-branch architectures, processing the scene and head features separately before integrating them with auxiliary…
Multimodal large language models (MLLMs) are advancing rapidly, enabling machines to interpret and reason about textual and visual data simultaneously. These models have transformative applications in image analysis, visual question answering, and multimodal reasoning. By bridging the gap between vision & language, they play a crucial role in improving artificial intelligence’s ability to understand and…
Foundation models, pre-trained on extensive unlabeled data, have emerged as a cutting-edge approach for developing versatile AI systems capable of solving complex tasks through targeted prompts. Researchers are now exploring the potential of extending this paradigm beyond language and visual domains, focusing on behavioral foundation models (BFMs) for agents interacting with dynamic environments. Specifically, the…
Audio language models (ALMs) play a crucial role in various applications, from real-time transcription and translation to voice-controlled systems and assistive technologies. However, many existing solutions face limitations such as high latency, significant computational demands, and a reliance on cloud-based processing. These issues pose challenges for edge deployment, where low power consumption, minimal latency, and…
Integrating vision and language capabilities in AI has led to breakthroughs in Vision-Language Models (VLMs). These models aim to process and interpret visual and textual data simultaneously, enabling applications such as image captioning, visual question answering, optical character recognition, and multimodal content analysis. VLMs play an important role in developing autonomous systems, enhanced human-computer interactions,…
Recent advancements in healthcare AI, including medical LLMs and LMMs, show great potential for improving access to medical advice. However, these models are largely English-centric, limiting their utility for non-English-speaking populations, such as those in Arabic-speaking regions. Furthermore, many medical LMMs need help to balance advanced medical text comprehension with multimodal capabilities. While models like…
Large Language Models (LLMs) have achieved remarkable advancements in natural language processing (NLP), enabling applications in text generation, summarization, and question-answering. However, their reliance on token-level processing—predicting one word at a time—presents challenges. This approach contrasts with human communication, which often operates at higher levels of abstraction, such as sentences or ideas. Token-level modeling also…
Large language models (LLMs) have demonstrated remarkable performance across multiple domains, driven by scaling laws highlighting the relationship between model size, training computation, and performance. Despite significant advancements in model scaling, a critical gap exists in comprehending how computational resources during inference impact model performance post-training. The complexity arises from balancing performance improvements against the…
Vision-and-Language Navigation (VLN) combines visual perception with natural language understanding to guide agents through 3D environments. The goal is to enable agents to follow human-like instructions and navigate complex spaces effectively. Such advancements hold potential in robotics, augmented reality, and smart assistant technologies, where linguistic instructions guide interaction with physical spaces. The core problem in…
Masked diffusion has emerged as a promising alternative to autoregressive models for the generative modeling of discrete data. Despite its potential, existing research has been constrained by overly complex model formulations and ambiguous relationships between different theoretical perspectives. These limitations have resulted in suboptimal parameterization and training objectives, often requiring ad hoc adjustments to address…