«`html
Advancing Multimodal Mathematical Reasoning with Vision-to-Code Alignment
Multimodal mathematical reasoning enables machines to solve problems involving both textual information and visual components such as diagrams and figures. This capability is essential in education, automated tutoring, and document analysis, where problems are often presented with a combination of text and images.
A significant challenge in this area is the lack of high-quality, precise alignment between mathematical images and their textual or symbolic representations. Most datasets used to train large multimodal models are derived from image captions in natural settings, which often overlook the detailed elements critical for mathematical accuracy. This limitation can lead to unreliable model performance, particularly in geometry, figures, or technical diagrams.
Research from the Multimedia Laboratory at The Chinese University of Hong Kong and CPII under InnoHK introduced a novel approach called MathCoder-VL. This method combines a vision-to-code model named FigCodifier with a synthetic data engine, creating the ImgCode-8.6M dataset using a model-in-the-loop strategy. This approach allowed for the iterative construction of the largest image-code dataset to date. Additionally, they developed MM-MathInstruct-3M, a multimodal instruction dataset enriched with newly synthesized images.
The MathCoder-VL model is trained in two stages: mid-training on ImgCode-8.6M to enhance visual-text alignment and fine-tuning on MM-MathInstruct-3M to improve reasoning abilities. The FigCodifier model translates mathematical figures into code that can accurately recreate those figures. This code-image pairing ensures strict alignment and accuracy, unlike traditional caption-based datasets.
The dataset includes 8.6 million code-image pairs covering various mathematical topics, sourced from textbooks, K12 datasets, and arXiv papers. FigCodifier supports Python-based rendering, adding variety to image generation. The system filters low-quality data by validating code and removing redundant or unhelpful visuals, resulting in 4.3 million high-quality TikZ and 4.3 million Python-based pairs.
Performance evaluations demonstrate that MathCoder-VL surpasses several open-source models. The 8B version achieved 73.6% accuracy on the MathVista Geometry Problem Solving subset, outperforming GPT-4o and Claude 3.5 Sonnet by 8.9% and 9.2%, respectively. It scored 26.1% on MATH-Vision and 46.5% on MathVerse. In Chinese-language benchmarks, it reached 51.2% on GAOKAO-MM. On the We-Math benchmark, it solved two-step problems at 58.6%, slightly exceeding GPT-4o’s 58.1%. Its performance on three-step problems reached 52.1%, again outperforming GPT-4o’s 43.6%. Compared to its base model InternVL2-8B, MathCoder-VL showed gains of 6.1% on MATH-Vision and 11.6% on MathVista.
This research clearly identifies the issue of insufficient visual-textual alignment in multimodal mathematical reasoning and provides a scalable and innovative solution. The introduction of FigCodifier and synthetic datasets enables models to learn from accurate, diverse visuals paired with precise code, significantly enhancing their reasoning abilities.
For further details, refer to the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and join our 95k+ ML SubReddit. Don’t forget to subscribe to our Newsletter.
«`