Visit our on-demand library to view VB Transform 2023 sessions. Sign up here
Meta continues to advance its research into new forms of generative AI models, today revealing its latest effort known as CM3leon (pronounced as “chameleon”).
CM3leon is a multi-modal base template for text-to-image creation, as well as image-to-text creation, which is useful for automatically generating captions for images.
AI-generated images are obviously not a new concept at this point, with popular tools like Stable Diffusion, DALL-E and Midjourney being widely available.
What’s new are the techniques Meta uses to build CM3leon and the performance Meta claims the base model is capable of achieving.
Event
VB Transform 2023 on demand
Did you miss a session of VB Transform 2023? Sign up to access the on-demand library for all of our featured sessions.
Register now
Text-to-image generation technologies today rely heavily on the use of diffusion models (from which Stable Diffusion takes its name) to create an image. CM3leon uses something different: a token-based autoregressive model.
“Diffusion models have recently dominated image generation work due to their high performance and relatively low computational cost,” wrote Meta research in a research paper titled Scaling Autoregressive Multi-Modal Models: Pretraining and Tuning instructions. “In contrast, token-based autoregressive models are known to also produce good results, with even better overall picture consistency in particular, but are much more expensive to train and use for inference.”
What the Meta researchers were able to do with CM3leon is actually demonstrate how the token-based autoregressive model can, in fact, be more efficient than a diffusion model-based approach.
“CM3leon achieves state-of-the-art performance for text-to-image generation, despite being trained with five times less computation than previous transformer-based methods,” researcher Meta wrote in a blog post.
The basic scheme of how CM3leon works is somewhat similar to how existing text generation models work.
The meta-researchers started with a pre-training stage augmented by recovery. Rather than simply scraping publicly available images from the Internet, which is a method that has caused legal issues for broadcast-based models, Meta took a different route.
“The ethical implications of sourcing image data in the field of text-to-image generation have been the subject of considerable debate,” states the Meta research paper. “In this study, we only use images licensed from Shutterstock. As a result, we can avoid problems with image ownership and attribution, without sacrificing performance.”
After pre-training, the CM3leon model goes through a supervised fine-tuning (SFT) stage which, according to Meta researchers, produces highly optimized results, both in terms of resource usage and image quality. SFT is an approach used by OpenAI to help train ChatGPT. Meta notes in his research paper that SFT is used to train the model to understand complex prompts, which is useful for generative tasks.
“We found that statement tuning notably amplifies the performance of the multimodal model in various tasks such as image caption generation, visual question answering, text-based editing, and conditional image generation” , indicates the document.
Looking at the sample generated images that Meta shared in his CM3leon blog post, the results are impressive and clearly show the model’s ability to understand complex multi-step prompts, thus generating very high resolution images.

Currently, CM3leon is a research effort and it is unclear when or even if Meta will make this technology publicly available in a service on any of its platforms. Given its power and higher generation efficiency, it’s highly likely that CMleon and its approach to generative AI will go beyond research (eventually).
VentureBeat’s mission is to be a digital public square for technical decision makers to learn about transformative enterprise technology and conduct transactions. Discover our Briefings.