«`html

JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing

Understanding the Target Audience

The primary audience for JarvisArt includes professional photographers, graphic designers, and content creators who seek to enhance their images with precision and creativity. These users often face challenges in mastering complex editing software while also desiring high-quality results that reflect their artistic vision.

Key Pain Points

Difficulty in mastering professional editing tools like Adobe Lightroom.
Limited control and precision in automated AI-driven editing solutions.
Time-consuming processes that hinder productivity.

Goals and Interests

To achieve high-quality photo edits that align with specific aesthetic goals.
To find efficient solutions that combine artistic intent with technical execution.
To utilize tools that support both global and localized editing tasks.

Communication Preferences

The target audience prefers clear, concise, and technical communication that provides actionable insights and practical examples. They value peer-reviewed research and case studies that demonstrate the effectiveness of new tools and methodologies.

Bridging the Gap Between Artistic Intent and Technical Execution

Photo retouching is essential in digital photography, allowing users to manipulate elements like tone, exposure, and contrast. However, achieving high-quality results often requires significant expertise. The challenge lies in the gap between manual editing tools and automated solutions, with traditional software being complex and AI-driven methods lacking the necessary control for nuanced edits.

Limitations of Current AI-Based Photo Editing Models

Current AI models rely on zeroth- and first-order optimization and reinforcement learning, but they struggle with fine-grained regional control and high-resolution outputs. Even advanced models like GPT-4o and Gemini-2-Flash compromise user control and often overwrite critical content details during generative processes.

Introducing JarvisArt

JarvisArt is an intelligent retouching agent developed by researchers from Xiamen University, the Chinese University of Hong Kong, Bytedance, the National University of Singapore, and Tsinghua University. This system utilizes a multimodal large language model for flexible, instruction-guided image editing, emulating the decision-making process of professional artists.

Methodology

The development of JarvisArt involved three major components:

Creation of the MMArt dataset, comprising 5,000 standard and 50,000 Chain-of-Thought–annotated samples.
A two-stage training process: initial supervised fine-tuning followed by Group Relative Policy Optimization for Retouching (GRPO-R).
Implementation of the Agent-to-Lightroom (A2L) protocol for seamless execution of tools within Lightroom.

Performance Evaluation

JarvisArt was benchmarked using MMArt-Bench, showing a 60% improvement in average pixel-level metrics for content fidelity compared to GPT-4o. It effectively handles both global image edits and localized refinements, allowing users to manipulate images based on specific instructions while preserving aesthetic goals.

Conclusion

JarvisArt addresses the challenge of intelligent, high-quality photo retouching without requiring professional expertise. By combining data synthesis, reasoning-driven training, and integration with commercial software, it offers a powerful solution for creative users seeking flexibility and quality in their image editing.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project.

«`