The advancements in large language models (LLMs) have created opportunities across industries, from automating content creation to improving scientific research. However, significant challenges remain. High-performing models are often proprietary, restricting transparency and access for researchers and developers. Open-source alternatives, while promising, frequently struggle with balancing computational efficiency and performance at scale. Furthermore, limited language diversity…
Transformers have become the backbone of deep learning models for tasks requiring sequential data processing, such as natural language understanding, computer vision, and reinforcement learning. These models rely heavily on self-attention mechanisms, enabling them to capture complex relationships within input sequences. However, as tasks and models scale, the demand for longer context windows increases significantly.…
Chemical synthesis is essential in developing new molecules for medical applications, materials science, and fine chemicals. This process, which involves planning chemical reactions to create desired target molecules, has traditionally relied on human expertise. Recent advancements have turned to computational methods to enhance the efficiency of retrosynthesis—working backward from a target molecule to determine the…
While multimodal models (LMMs) have advanced significantly for text and image tasks, video-based models remain underdeveloped. Videos are inherently complex, combining spatial and temporal dimensions that demand more from computational resources. Existing methods often adapt image-based approaches directly or rely on uniform frame sampling, which poorly captures motion and temporal patterns. Moreover, training large-scale video…
Reinforcement Learning is now applied in almost every pursuit of science and tech, either as a core methodology or to optimize existing processes and systems. Despite broad adoption even in highly advanced fields, RL lags in some fundamental skills. Sample Inefficiency is one such problem that limits its potential. In simple terms, RL needs thousands…
Accurately predicting where a person is looking in a scene—gaze target estimation—represents a significant challenge in AI research. Integrating complex cues such as head orientation and scene context must be used to infer gaze direction. Traditionally, methods for this problem use multi-branch architectures, processing the scene and head features separately before integrating them with auxiliary…
Multimodal large language models (MLLMs) are advancing rapidly, enabling machines to interpret and reason about textual and visual data simultaneously. These models have transformative applications in image analysis, visual question answering, and multimodal reasoning. By bridging the gap between vision & language, they play a crucial role in improving artificial intelligence’s ability to understand and…
Foundation models, pre-trained on extensive unlabeled data, have emerged as a cutting-edge approach for developing versatile AI systems capable of solving complex tasks through targeted prompts. Researchers are now exploring the potential of extending this paradigm beyond language and visual domains, focusing on behavioral foundation models (BFMs) for agents interacting with dynamic environments. Specifically, the…