Text-to-image generative models are machine learning models that can generate images from natural language descriptions. For example, if you give these models a prompt like āa cat wearing a hatā, they will try to create an image that matches that description as closely as possible.
These models have become more advanced and realistic in recent years, thanks to the development of deep neural networks, diffusion models, large-scale datasets, and powerful computing resources. Ranking them is not an easy task, as different models may have different strengths and weaknesses, such as image quality, diversity, resolution, speed, and creativity.
Midjourney: One of the best text-to-image generative AI models that you can use to create amazing images from text. Currently, Midjourney is only accessible via a Discord bot, which can also be loaded onto a third-party server.
DALL-E 3 (OpenAI): It can create realistic images and art from a description in natural language. It can also combine concepts, attributes, and styles in various ways, such as creating anthropomorphic versions of animals and objects, rendering text, and applying transformations to existing images.
Stable Diffusion (66k ā): It is based on a kind of diffusion model called a latent diffusion model, which is trained to remove noise from images in an iterative process. It is one of the first text-to-image models that can run on consumer hardware and has its code and model weights publicly available.
Imagen (Google Research, Paper): A text-to-image generation model that uses diffusion models and large transformer language models. Imagen is based on the research paper āImagen: Text-to-Image Diffusion Modelsā by Google Research, Brain Team.
Muse (Google Research, Paper): A text-to-image generation model that uses masked generative transformers. Muse can create realistic and diverse images from natural language descriptions. It can also edit images in various ways, such as inpainting, outpainting, and mask-free editing.
DreamBooth (Google Research, Paper): Developed by researchers from Google Research and Boston University in 2022. It can take a small set of images of a specific subject use them to train a text-to-image model to generate more images of that subject based on natural language.
DeepFloyd IF (StabilityAI, 7.5k ā): A novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. DeepFloyd IF is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules.
DreamFusion (Google Research, Paper): A text-to-3D model using a pretrained 2D text-to-image diffusion model to perform text-to-3D synthesis. The resulting 3D model of the given text can be viewed from any angle, relit by arbitrary illumination, or composited into any 3D environment.
GLIGEN (1.8k ā, Paper): A novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. It is a zero-shot method, meaning that it does not require any fine-tuning or re-training of the pre-trained text-to-image diffusion models.
pix2pix-zero (1k ā, Paper): A diffusion-based image-to-image approach that allows users to specify the edit direction on-the-fly. This method can directly use pre-trained text-to-image diffusion models, such as Stable Diffusion, for editing real and synthetic images while preserving the input imageās structure.
These models use foundation models or pretrained transformers, which are large neural networks that are trained on massive amounts of unlabeled data and can be used for different tasks with additional fine-tuning.
They use diffusion-based models, which are generative models that produce images by gradually adding noise to an initial image and then reversing the process. These models can generate high-resolution images with fine details and realistic textures.
They use grounding inputs or spatial information, which are additional inputs that can guide the generation process to adhere to the composition specified by the user. These inputs can be bounding boxes, keypoints, or images, and can be used to control the layout, pose, or style of the generated image.
They use neural style transfer or adversarial networks, which are techniques that allow the models to apply different artistic styles to the generated images, such as impressionism, cubism, or abstract. These techniques can create beautiful and unique artworks from text inputs.
They use natural language requests or text prompts, which are the main inputs that the models use to generate images. These requests do not require knowledge of or entering code, and can be simple or complex, descriptive or abstract, factual or fictional. The models use what they have learned from their training data to generate images that they believe correspond to the requests.