Generative models have revolutionized fields like language, vision, and biology through their ability to learn and sample from complex data distributions. While these models benefit from scaling up during training through increased data, computational resources, and model sizes, their inference-time scaling capabilities face significant challenges. Specifically, diffusion models, which excel in generating continuous data like… →
Swarm is an innovative open-source framework designed to explore the orchestration and coordination of multi-agent systems. It is developed and managed by the OpenAI Solutions team, and it provides a lightweight, ergonomic, and educational environment for developers to learn and experiment with agent-based systems. At its core, Swarm is built to facilitate the interaction of… →
Vision-language models (VLMs) play a crucial role in multimodal tasks like image retrieval, captioning, and medical diagnostics by aligning visual and linguistic data. However, understanding negation in these models remains one of the main challenges. Negation is critical for nuanced applications, such as distinguishing “a room without windows” from “a room with windows.” Despite their… →
One of the most significant and advanced capabilities of a multimodal large language model is long-context video modeling, which allows models to handle movies, documentaries, and live streams spanning multiple hours. However, despite the commendable advancements made in video comprehension in LLMs, including caption generation and question answering, many obstructions remain in processing extremely long… →
LLMs have made significant strides in automated writing, particularly in tasks like open-domain long-form generation and topic-specific reports. Many approaches rely on Retrieval-Augmented Generation (RAG) to incorporate external information into the writing process. However, these methods often fall short due to fixed retrieval strategies, limiting the generated content’s depth, diversity, and utility—this lack of nuanced… →
Scaling the size of large language models (LLMs) and their training data have now opened up emergent capabilities that allow these models to perform highly structured reasoning, logical deductions, and abstract thought. These are not incremental improvements over previous tools but mark the journey toward reaching Artificial general intelligence (AGI). Training LLMs to reason well… →
Video diffusion models have emerged as powerful tools for video generation and physics simulation, showing promise in developing game engines. These generative game engines function as video generation models with action controllability, allowing them to respond to user inputs like keyboard and mouse interactions. A critical challenge in this field is scene generalization – the… →
Understanding long videos, such as 24-hour CCTV footage or full-length films, is a major challenge in video processing. Large Language Models (LLMs) have shown great potential in handling multimodal data, including videos, but they struggle with the massive data and high processing demands of lengthy content. Most existing methods for managing long videos lose critical… →
Code retrieval has become essential for developers in modern software development, enabling efficient access to relevant code snippets and documentation. Unlike traditional text retrieval, which effectively handles natural language queries, code retrieval must address unique challenges, such as programming languages’ structural variations, dependencies, and contextual relevance. With tools like GitHub Copilot gaining popularity, advanced code… →
The development of VLMs in the biomedical domain faces challenges due to the lack of large-scale, annotated, and publicly accessible multimodal datasets across diverse fields. While datasets have been constructed from biomedical literature, such as PubMed, they often focus narrowly on domains like radiology and pathology, neglecting complementary areas such as molecular biology and pharmacogenomics… →