Image Synthesis

Image generation technology has advanced to an unprecedented level in the last 2 years. Algorithms and software exist that can process a natural language description of the desired image and generate a photorealistic version in a matter of seconds. We will be focusing on the more realistic image synthesis research and models, though there do exist extensive bodies of work on more artistic or stylized images. Let’s look at a few examples of the technique in action, discuss some backstory, examine how the technology works, and close with some applications and potential drawbacks.

For transparency and posterity, the author is using a local web UI for Stable Diffusion (SD) from GitHub, an extension to the web UI that runs Deforum also from GitHub, and the model checkpoint is DreamShaper available from CivitAI.

Some Outputs

The first thing users typically notice is that natural language processing (NLP) requires some getting used to. Let’s say we’re interested in a boat by the seaside, as viewed from a crowded beach. There are several grammatically correct ways to put this, all of which generate slightly different results:

Clearly, there is room for interpretation and improvement when it comes to creating the prompts. For a start, in (a) there’s no people and the definition of “boat” is loose. Otherwise, the images look fairly realistic (at this zoom level). The water and shadows are especially nice. Some of these behaviors are driven from the prompts themselves, some are driven by the parameters used to generate the image (resolution, iterations, CFG schedule, etc.), but most are driven by the training data used to create the model itself. In addition, the parts of the image that look “real” and the parts that look “fake” are largely a function of the available references within the data used to train the model. It’s clear that the model in use has seen many similar beach-like environments and a lot of different-shaped boats. So, if we want to create a model with a specific purpose, say to generate images of a specific boat, then we will need lots of reference images of the specific boat.

Where did this come from?

First, let’s be specific about what we’re doing. We want to be able to tell the computer what we want, and end with a synthetic image of roughly what we described. Though, as we’ll see later, the input doesn’t necessarily have to be in text. Conditioning images or a hybrid of image and text is also possible. Image synthesis, or image generation, is a massive field of computer science that has been growing extremely quickly in recent years. To get a sense of the growth, let us look at a paper on the topic from ten years ago and try to compare it to what we have today. Spoiler, the idea of generating an entirely synthetic image of a desired subject ten years ago was not feasible. In 2014, Xu, Li, et al in their paper “Deep convolutional neural network for image deconvolution” attempt to refine burry images using a deep neural network. The actual term deconvolution, for optics and imaging, is the process of removing optical distortions caused by imaging hardware and lenses. As we can see below, the convolutional neural network (CNN) based approach is effective in creating an image that is clearer than the original. But the inputs here are a bit different. The authors start with a blurry image instead of a text or image prompt. Even with the goal image given, the same general method (neural networks) is not capable of creating a comparable result.

Just for fun, let’s see what stable diffusion can do on an image refinement problem like this. It is an ill-conditioned problem to begin with, the model we’re using was not made for this goal, but it will provide some perspective. There is an option to use an input image along with a text description, so let’s tell it to make the same image but less blurry. See below for the difference ten years makes in the world of image processing software.

How does it work?

To generate something like an image, the main enemy is always the dimensionality of the space. Simply put, there are a lot of numbers to deal with. Researchers have been using many approaches in recent years to combat this challenge: generative adversarial networks (GANs), variational autoencoders (VAE), and diffusion probabilistic models (DMs). Each of these methods have their own advantages and disadvantages. The main challenge facing researchers is balancing the detail considered in training images, the ability to capture the breadth of available training data, and the training time required. The result is actually a combination of GANs, VAEs, and DMs, so let’s go through them and try to understand.

Perhaps the most important algorithm of recent years, at least when speaking about computer vision and machine learning, is the GAN. GANs operate by using two neural networks competing against each other to produce synthetic data that looks nearly indistinguishable from the original training data. One network, called the creator or generator, learns to make a signal that looks like the real thing. The other network, called the critic or discriminator, learns to tell the difference between real and fake data. Pitting these two against each other has resulted in scary-smart networks. However, GANs have several fatal flaws. They are difficult to deal with, require long training times (weeks for large datasets), and sometimes fail to capture the full diversity of the training data available.

Variational autoencoders, a subset of probabilistic flow models, are generative models that utilize neural networks in their structure. However, in contrast to GANs, VAEs excel at learning general trends in large datasets. This behavior is accomplished by using neural network components as an encoder and a decoder. This design choice reduces the size of the signal without losing too much information, neatly navigating around the dimensionality problem. The reduced signal is usually referred to as a latent space signal, which then needs to be decoded by the other side of the neural network. VAEs allow for relatively efficient image synthesis, but with less image quality than can be achieved by GANs.

Diffusion models are likelihood-based learning models, similar in setting to the GAN and VAE. Where the DM differs is in the modeling approach, using a UNet as the central neural network instead of two competing networks. A UNet is another form of a convolutional neural network that specializes in upscaling images and was originally developed for biomedical image segmentation. The diffusion model excels at synthesizing high-quality images and has the option to trade image quality for increased image compression and speed. However, similar to GANs, diffusion models struggle when it comes to training and synthesis time. Building upon these elements, various combinations and adaptations have been attempted with some success. We won’t go into too much detail here, but many two-stage image synthesis pipelines were developed between the years of 2018 and 2022. In 2022, researchers from the University of Munich developed a so-called Latent Diffusion Model (LDM) pipeline that was able to produce very realistic images. Scary realistic in fact. All without needing huge amounts of compute time to generate images. The training process was still intensive, a single epoch took over twenty hours initially, but the time to generate an image with a pre-trained model was relatively low. The cartoon version of the architecture is shown below (taken from Rombach, Robin, et al. and their paper on “High-resolution image synthesis with latent diffusion models.”):

Where do the images come from?

A commonality among all forms of image synthesis is that a source of training data is necessary. In the case of Stable Diffusion and a few other methods, that source of training data is derived from the internet. In 2008, a non-profit company called Common Crawl started to publish massive datasets of scrapped webpage content, anything that could be seen from a webpage including HTTP responses, images, text, and metadata. These sets are mind bogglingly huge, their latest being 86 TiB. Side note, the TiB stands for “Tebibyte” instead of the more common marketed unit of “Terabyte”, but they are similar in size. The difference between the units is that Tebibyte uses the technically correct base 2 representation (1GiB = 1024MiB), and the Terabyte uses the more commonly known base 10 representation (1GB = 1000MB). Anyway, the datasets are very large.

The raw dataset from Common Crawl was then adapted by another non-profit, funded by Stability AI (the creators of Stable Diffusion), called Laion. Laion took the raw web data and formatted it into text-image pairs, i.e. the format required for neural network training. There are several datasets released by Laion, including LAION-400M and LAION-5B which contain 400 million text-image pairs and 5 billion text-image pairs, respectively. It’s important to note that Laion does not distribute the actual datasets themselves, but libraries of URLs that point to the images. Researchers interested in using the dataset must download subsets themselves for training. Laion does some level of content filtering but it’s mostly automated (due to the size of the data). They state openly that there may be disturbing, nsfw, and copyrighted images included in the datasets. To this point, if a user finds themselves, or other personally identifiable information, in the dataset they can submit a claim form to have the offending entry removed. It is good of Laion to provide such a service, however it appears it’s not as easy as they make it sound.

Laion has an interesting tool to explore subsets of the data with text searches, it can be found at: https://rom1504.github.io/clip-retrieval

There has been a significant amount of push back from artists, writers, and content creators regarding synthetic images, and with good reason. As the last image shows, anybody’s personal work and efforts can be included in these massive datasets without their consent. Not only does this present an issue for fair and ethical use of the datasets, but a user could easily create prompts that explicitly call out artists and then attempt to copy their work. You can ask for a photograph in the style of Sarah van Rij, or a painting in the style of Georgia O’Keeffe, and the image synthesizer will recognize the name from its training and produce something that looks very similar to the authentic work. Regardless of your stance on automation removing opportunities for qualified individuals, the use of copyrighted, personal, and private images in the training datasets is unethical.

A clear example of the issues inherent to this technology approach is the 2023 writers’ strike currently ongoing at the time of this post. TV writers, authors, and many other content creators are protesting the use of their work in synthetically generated scripts and books. The techniques employed in synthetic text generation are, at their core, the same as those employed in synthetic image generation and as a result produce the same ethics issues. Until a generator is trained using solely voluntary contributions and synthetically generated work, be it images, text, or music. It should be viewed as an inspiration tool, a concept art generator, and emergent research frontier, rather than as a commercial endeavor.

Applications and Examples

The potential applications of this technology, even those that respect the commercial and copyright issues, are extremely broad. Image upscaling is a relatively new option in the user interface for stable diffusion, and it allows users to dynamically increase the resolution of any image. Along with appropriate models, this option can upscale any input image to four times its original resolution, without disturbing the content too much. This technique could be especially handy for heavily cropped images that need a bit more resolution to remain readable.

Another fascinating application is using image synthesizers to generate training data for other, more specific, image synthesizers. Julien Simon from Hugging Face (a company doing a lot of research on image synthesis and computer vision tasks) has a nice video on the topic. In the video, a new, and completely synthetic, set of images of boeuf bourguignon is added to an existing dataset of food types. One can easily imagine how this same technique can be adapted to other computer vision tasks. Say we want to be able to recognize a specific type of boat, from a limited viewpoint, without having a lot of reference images. We can train a stable diffusion model to generate realistic-enough images and then use those images to create a much more robust computer vision pipeline.

We’ve already seen a few samples of LDMs in action, but let’s try some more weird stuff. These models do require a few iterations to get the image fully formed, 20-30 usually. Each iteration builds upon the previous to generate a nice crisp image, but what happens if we limit the samples?

We actually get to see the model building a boat from scratch! Well, two boats, but still cool. With just one sample, the image is very noisy and doesn’t look like much of anything, analogous to the image in your brain, the first millisecond after someone tells you to picture a boat.

Something else cool we can do, is instead of using text to condition the image synthesis, we can use another image, just like we did with the blurry images. For example, if we take a standard QR code as the base and inform the model that we want a top-down view of city, we get a city block with the same rough shape as the QR code!

Now someone other than me could probably get a QR-code inspired image that is actually scannable but we’ll leave it there for now.

References

Peer-Reviewed References:

[1] Xu, Li, et al. “Deep convolutional neural network for image deconvolution.” Advances in neural information processing systems 27 (2014).

[2] Rombach, Robin, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

Web References: