Null-text Guidance in Diffusion Models is Secretly a Cartoon-style Creator

Accepted by MM 2023

Jing Zhao1 , Heliang Zheng2, Chaoyue Wang2, Long Lan1,Wanrong Huang1, Wenjing Yang1

1National University of Defense Technology; 2JD Explore Academy

Paper Code

Abstract

Classifier-free guidance is an effective sampling technique in diffusion models that has been widely adopted. The main idea is to extrapolate the model in the direction of text guidance and away from null-text guidance. In this paper, we demonstrate that null-text guidance in diffusion models is secretly a cartoon-style creator, i.e., the generated images can be efficiently transformed into cartoons by simply perturbing the null-text guidance. Specifically, we proposed two disturbance methods, i.e., Rollback disturbance (Back-D) and Image disturbance (Image-D), to construct misalignment between the noisy images used for predicting null-text guidance and text guidance (subsequently referred to as null-text noisy image andtext noisy image respectively) in the sampling process. Back-D achieves cartoonization by altering the noise level of null-text noisy image via replacing 𝑥𝑡 with 𝑥𝑡 +Δ𝑡 . Image-D, alternatively, produces high-fidelity, diverse cartoons by defining 𝑥𝑡 as a clean input image, which further improves the incorporation of finer image details. Through comprehensive experiments, we delved into the principle of noise disturbing for null-text and uncovered that the efficacy of disturbance depends on the correlation between the null-text noisy image and the source image. Moreover, our proposed techniques, which can generate cartoon images and cartoonize specific ones, are training-free and easily integrated as a plug-and-play component in any classifier-free guided diffusion model. Project page is available at https://nulltextforcartoon.github.io.

An overview of the proposed methods.

Comparison between Rollback disturbance and Image disturbance

The results depicted in the second row on the right of Figure 4 demonstrate that utilizing Image-D leads to enhanced preservation of intricate features present in the input image, and thereby results in superior fidelity.

Analysis of the null-text noisy image

Figure 5 illustrates two additional settings for 𝑥𝜎 , viz., an unrelated image 𝑥𝑖𝑟𝑟 and an isotropic image 𝑥𝑖𝑠𝑜 that shares structural similarity with the input image. The degree of correlation between 𝑥𝑟𝑒𝑓 and 𝑥𝜎 in various settings satisfies:

The results indicate that as the correlation degree between null-text noisy image and input images 𝑥𝑟𝑒𝑓 increases, both the quality and fidelity of generation improve.

The main experimental results

Figure 6: Results of (a) free generation using Back-D, (b) Image cartoonization using Back-D and (c) using Image-D. The results indicate that the proposed method enables free cartoon generation of portraits, animals, landscapes, and architectures, while achieving image cartoonization.

Figure 7: Image cartoonization showcases diversity. The Image disturbance (Image-D) contain richer diversity of details.

Comparison with other cartoon generation works

Figure 8 displays a comparison between our method and cartoon image generation model Anything v3 [ 24 ] and stable diffusion model v1.4[24]. Anything v3 is trained extensively with cartoon images but fails to accurately generate cartoons for new concepts or scenes not featured within its training data-as seen. For example, case "A photo of Robert Downey Jr." and case "The city of lights". It also suffers from scenario construction failure (case"A rabbit is eating carrot") and over-anthropomorphization of animals as illustrated in case "A koala is climbing a tree". Meanwhile, the stable diffusion model v1.4 operates by modifying guided prompts "xxx" with "xxx in cartoon style", and its resulting images lack spatial information and appear too flat, as demonstrated in the first three cases of row 2 in Figure 22. Moreover, it remains prone to cartoonization failures, as shown in the case "A koala is climbing a tree". Conversely, our method generates more accurate, vivid, and artistically textured cartoon images.

Comparison with other Image cartoonization works

Figure 9: Comparison with other Image cartoonization works. The images resulting from the cartoonization of AnimeGANv3 [4] and White-box [36] appear to have been flattened, resembling drawings on a two-dimensional plane. However, our method produces cartoonized images that are more vivid and lifelike, approaching the three-dimensional quality of animated scenes.

Restrictions on the sampling steps

Figure 10: Study on the number of DDIM sampling steps 𝑁 . 𝑁 larger than 60 yield a clean cartoon, while greater steps (100 or above) enhance the cartoon effect.

The influence of text guidance

Figure 11: The influence of text guidance. Accurate textual guidance can enhance the conceptual understanding of the diffusion model on the input image, thereby rendering generated images more expressive. conversely, mismatched textual guidance may introduce greater creativity into the generated output.

Application on ControlNet

As a plug-and-play cartoonize component, the proposed method can be readily applied to the classifier-free guided diffusion model. In this study, we investigated the efficacy of the proposed method in ControlNet [ 39 ]. Specifically, we leveraged the Back-D proposed in this work to cartoonize the results of scribble-to-image task in ControlNet and present the findings in Figure 12. The outcomes indicate that the proposed technique is not only easily adaptable to other tasks but also produces a favorable cartoon effect.