View PDF
HTML (experimental)
Abstract:We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning.
Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process.
Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.
Subjects:
Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
Cite as:
arXiv:2603.19500 [cs.AI]
(or
arXiv:2603.19500v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2603.19500
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Xiaodan Du [view email] [v1]
Thu, 19 Mar 2026 22:08:53 UTC (15,422 KB)