Sun
          Sun
          Emoji
          Spotlight
          Water
          Cancel
          Eye
          Map Pin
          Christmas Ornament
          Play
          Heart
          
          Bag
          Settings
          Cloud
          Watch
          Microphone
          y in regular font
          o in regular font
          u in regular font
          capital l in normal font
          0 in bolditalic font
          capital o in italic font
          capital k in regular font
          a in bold font
          w in regular font
          e in regular font
          s in regular font
          0 in cond regular font
          m in italic font
          e in semi - condensed extrabold font
          Exclamation Mark (Icon)
Scalable Vector Graphics (SVG) is a popular format on the web and in the design industry. However, despite the great strides made in generative modeling, SVG has remained underexplored due to the discrete and complex nature of such data. We introduce GRIMOIRE, a text-guided SVG generative model that is comprised of two modules: A Visual Shape Quantizer (VSQ) learns to map raster images onto a discrete codebook by reconstructing them as vector shapes, and an Auto-Regressive Transformer (ART) models the joint probability distribution over shape tokens, positions, and textual descriptions, allowing us to generate vector graphics from natural language. Unlike existing models that require direct supervision from SVG data, GRIMOIRE learns shape image patches using only raster image supervision which opens up vector generative modeling to significantly more data. We demonstrate the effectiveness of our method by fitting GRIMOIRE for closed filled shapes on MNIST and for outline strokes on icon and font data, surpassing previous image-supervised methods in generative quality and the vector-supervised approach in flexibility.
            
            Grimoire enables both generation from text and completion of partly drawn objects. In the latter case, one or multiple shapes drawn at a given position on a canvas can be encoded with the pre-trained VSQ module to obtain the closest code learned during the training stage. Finally, this conditioning code sequence, along with the original positions can be jointly provided to the auto-regressive model with the text descriptions. The rest of the decoding pipeline remains the same. An overview of the two approaches is illustrated above. We report a series of qualitative completion to show how the network predictions change or align with an increasing amount of conditioning shapes. Moreover, Grimoire could easily be extended to also perform fill-in-the-middle tasks.
              Phone
              Dice
              Eye
              Mask
              Check
              User
              Photo
              Glasses
              Arrow
              Lock
              Conversation
              Alarm
              Glass
              Tooth
              Shield
              Bottle
              Tape
              Search
              Sun
              User
              Document
              Bell
              Cube
              Smile
              Thermometer
              Glass
              Lock
              Headphones
              Calendar
              Arrow
              Sea
              Anchor
              folder
              User
              Apple
              Sun
              Anchor
              Mouse
              Star
              Plus
              Share
              Mountain
              Fingers
              Church
              Male
              Hourglass
              Avatar
              Crown
        @inproceedings{CiprianoFeuerpfeilDeMelo2025VectorGrimoire,
          title={Vector Grimoire: Codebook-based Shape Generation under Raster Image Supervision},
          author={Marco Cipriano and Moritz Feuerpfeil and Gerard de Melo},
          booktitle={Proceedings of ICML 2025},
          year={2025},
        }