Zhipu AI Releases ‘Glyph’: An AI Framework for Scaling the Context Length through Visual-Text Compression

Can we render long texts as images and use a vision-language model (VLM) to achieve 3–4× token compression, preserving accuracy while scaling a 128K context toward 1M-token workloads? A team of researchers from Zhipu AI has released Glyph, an AI framework for scaling context length through visual-text compression. Glyph transforms long textual sequences into images, which are then processed end-to-end by a VLM. Each visual token encodes many characters, effectively shortening the token sequence while preserving semantics. Glyph can achieve 3-4x token compression on long text sequences without performance degradation, leading to significant gains in memory efficiency, training throughput, and inference speed.

Why Glyph?

Conventional methods that expand positional encodings or modify attention still see compute and memory scale with token count. While retrieval methods can trim inputs, they risk omitting evidence and introduce latency. Glyph alters the representation by converting text to images and shifting the burden to a VLM that already learns optical character recognition (OCR), layout, and reasoning. This increases information density per token, allowing a fixed token budget to cover more original context. Under extreme compression, the research team demonstrates that a 128K context VLM can handle tasks originating from 1M token-level text.

System Design and Training

The method consists of three stages: continual pre-training, LLM-driven rendering search, and post-training. Continual pre-training exposes the VLM to large corpora of rendered long text with diverse typography and styles. The objective is to align visual and textual representations, transferring long context skills from text tokens to visual tokens. The rendering search utilizes a genetic loop driven by an LLM, mutating parameters such as page size, dpi, font family, font size, line height, alignment, indent, and spacing. Candidates are evaluated on a validation set to optimize accuracy and compression jointly. Post-training involves supervised fine-tuning and reinforcement learning with Group Relative Policy Optimization (GRPO), along with an auxiliary OCR alignment task to improve character fidelity, especially when fonts are small and spacing is tight.

Results, Performance, and Efficiency

Evaluation metrics like LongBench and MRCR establish accuracy and compression under long dialogue histories and document tasks. The model achieves an average effective compression ratio of about 3.3 on LongBench, with some tasks nearing 5, and approximately 3.0 on MRCR. These gains scale with longer inputs, as each visual token carries more characters. Speedups compared to the text backbone at 128K inputs are about 4.8 times for pre-fill, 4.4 times for decoding, and 2 times for supervised fine-tuning throughput. The Ruler benchmark indicates that higher dpi at inference time improves scores, as crisper glyphs enhance OCR and layout parsing. The research team reports dpi 72 with an average compression of 4.0 and a maximum of 7.7 on specific sub-tasks, dpi 96 with an average of 2.2 and a maximum of 4.4, and dpi 120 with an average of 1.2 and a maximum of 2.8. The maximum of 7.7 belongs to Ruler, not MRCR.

Applications

Glyph enhances multimodal document understanding. Training on rendered pages improves performance on MMLongBench Doc relative to a base visual model, indicating that the rendering objective serves as a useful pretext for real document tasks that involve figures and layout. The primary failure mode is sensitivity to aggressive typography; very small fonts and tight spacing degrade character accuracy, particularly for rare alphanumeric strings. The research team excludes the UUID subtask on Ruler. The approach assumes server-side rendering and a VLM with robust OCR and layout priors.

Key Takeaways

Glyph renders long text into images, allowing a VLM to process these pages, reframing long-context modeling as a multimodal problem while preserving semantics and reducing tokens.
The research team reports token compression of 3 to 4 times with accuracy comparable to strong 8B text baselines on long-context benchmarks.
Prefill speedup is about 4.8 times, decoding speedup is about 4.4 times, and supervised fine-tuning throughput is about 2 times, measured at 128K inputs.
The system employs continual pre-training on rendered pages, an LLM-driven genetic search over rendering parameters, followed by supervised fine-tuning and reinforcement learning with GRPO, plus an OCR alignment objective.
Evaluations include LongBench, MRCR, and Ruler, with an extreme case showing a 128K context VLM addressing 1M token-level tasks.

For further details, you can access the research paper, along with the associated code and model card available on GitHub and Hugging Face.

Join our community on Telegram and subscribe to our newsletter for more updates on AI advancements.