The study of artificial intelligence has witnessed transformative developments in reasoning and understanding complex tasks. The most innovative developments are large language models (LLMs) and multimodal large language models (MLLMs). These systems can process textual and visual data, allowing them to analyze intricate tasks. Unlike traditional approaches that base their reasoning skills on verbal means,…
Video understanding has long presented unique challenges for AI researchers. Unlike static images, videos involve intricate temporal dynamics and spatial-temporal reasoning, making it difficult for models to generate meaningful descriptions or answer context-specific questions. Issues like hallucination, where models fabricate details, further compromise the reliability of existing systems. Despite advancements with models such as GPT-4o…
The growing reliance on AI models for edge and mobile devices has underscored significant challenges. Balancing computational efficiency, model size, and multilingual capabilities remains a persistent hurdle. Traditional large language models (LLMs), while powerful, often require extensive resources, making them less suitable for edge applications like smartphones or IoT devices. Additionally, delivering robust multilingual performance…
Agentic AI enables autonomous and collaborative problem-solving that mimics human cognition. By facilitating multi-agent cooperation with real-time communication, it holds promise across diverse industries, from autonomous transportation to adaptive healthcare. However, achieving this potential requires scalable, robust, and seamlessly integrative frameworks with existing technologies while addressing technical challenges that limit adaptability and precision. The significant…
The growth of data in the digital age presents both opportunities and challenges. An immense volume of text, images, audio, and video is generated daily across platforms. Traditional machine learning models, while effective in many scenarios, often struggle to process high-dimensional and unstructured data without extensive preprocessing and feature engineering. This approach is not only…
Generative Large Multimodal Models (LMMs), such as LLaVA and Qwen-VL, excel in vision-language (VL) tasks like image captioning and visual question answering (VQA). However, these models face challenges when applied to foundational discriminative VL tasks, such as image classification or multiple-choice VQA, which require discrete label predictions. The primary obstacle is the difficulty in extracting…
Large Language Models (LLMs) and Vision-Language Models (VLMs) transform natural language understanding, multimodal integration, and complex reasoning tasks. Yet, one critical limitation remains: current models cannot efficiently handle extremely large contexts. This challenge has prompted researchers to explore new methods and architectures to improve these models’ scalability, efficiency, and performance. Existing models typically support token…
Advances in large language and multimodal speech-text models have laid a foundation for seamless, real-time, natural, and human-like voice interactions. Achieving this requires systems to process speech content, emotional tones, and audio cues while giving accurate and coherent responses. However, challenges remain in overcoming differences in speech and text sequences, limited pre-training for speech tasks…
Scientific metadata in research literature holds immense significance, as highlighted by flourishing research in scientometrics—a discipline dedicated to analyzing scholarly literature. Metadata improves the findability and accessibility of scientific documents by indexing and linking papers in a massive graph. Today, the research community has realized the importance of metadata. However, its awareness and consideration were…
Speech processing systems often struggle to deliver clear audio in noisy environments. This challenge impacts applications such as hearing aids, automatic speech recognition (ASR), and speaker verification. Conventional single-channel speech enhancement (SE) systems use neural network architectures like LSTMs, CNNs, and GANs, but they are not without limitations. For instance, attention-based models such as Conformers,…