Giffusion is a Web UI for generating GIFs and Videos using Stable Diffusion.
Giffusion supports using any pipeline and compatible checkpoint from the Diffusers library. Simply paste in the checkpoint name and pipeline name in the Pipeline Settings
Giffusion allows you to use the StableDiffusionControlNetPipeline
. Simply paste in the ControlNet checkpoint you would like to use to load in the Pipeline.
Giffusion follows a prompt syntax similar to the one used in Deforum Art's Stable Diffusion Notebook
0: a picture of a corgi
60: a picture of a lion
The first part of the prompt indicates a key frame number, while the text after the colon is the prompt used by the model to generate the image.
In the example above, we're asking the model to generate a picture of a Corgi at frame 0 and a picture of a lion at frame 60. So what about all the images in between these two key frames? How do they get generated?
You might recall that Diffusion Models work by turning noise into images. Stable Diffusion turns a noise tensor into a latent embedding in order to save time and memory when running the diffusion process. This latent embedding is fed into a decoder to produce the image.
The inputs to our model are a noise tensor and text embedding tensor. Using our key frames as our start and end points, we can produce images in between these frames by interpolating these tensors.
Creating prompts can be challenging. Click the Give me some inspiration
button to automatically generate prompts for you.
You can even provide a list of topics for the inspiration button to use as a starting point.
Augment the image generation process with additional media inputs
Image Input
You can seed the generation process with an inital image. Upload your file using the, using the Image Input
dropdown.
Audio Input
Drive your GIF and Video animations using audio.
output.47.mp4
In order to use audio to drive your animations,
- Head over to the
Audio Input
dropdown and upload your audio file. - Click
Get Key Frame Information
. This will extract key frames from the audio based on theAudio Component
you have selected. You can extract key frames based on the percussive, harmonic or combined audio components of your file.
Additionally, timestamp information for these key frames is also extracted for reference in case you would like to sync your prompts to a particular time in the audio.
Note: The key frames will change based the frame rate that you have set in the UI.
Video Input
You can use frames from an existing video as initial images in the diffusion process.
output-knight-dancing-final.mp4
To use video initialization:
-
Head over to the
Video Input
dropdown -
Upload your file. Click
Get Key Frame Information
to extract the maximum number of frames present in the video and to update the frame rate setting in the UI to match the frame rate of the input video.
You can resample videos and GIFs created in the output tab and send them either to the Image Input or Video Input.
Resampling to Image Input
To sample an image from a video, select the frame id you want to sample from your output video or GIF and click on Send to Image Input
GIFfusion also support saving prompts, generated GIFs/Videos, images, and settings to Comet so you can keep track of your generative experiments.
Check out an example project here with some of my GIFs!
This section covers all the components in the Diffusion Settings dropdown.
-
Use Fixed Latent: Use the same noise latent for every frame of the generation process. This is useful if you want to keep the noise latent fixed while interpolating over just the prompt embeddings.
-
Use Prompt Embeds: By default, Giffusion converts your prompts into embeddings and interpolates between the prompt embeddings for the in between frames. If you disable this option, Giffusion will forward fill the text prompts between frames instead. If you are using the
ComposableDiffusion
pipeline or would like to use the prompt embedding function of the pipeline directly, disable this option. -
Numerical Seed: Seed for the noise latent generation process. If
Use Fixed Latent
isn't set, this seed is used to generate a schedule that provides a unique seed for each key frame. -
Number of Iteration Steps: Number of steps to use in the generation process.
-
Classifier Free Guidance Scale: Higher guidance scale encourages generated images that are closely linked to the text prompt, usually at the expense of lower image quality.
-
Image Strength: Indicates how much to transform the reference image. Must be between 0 and 1. The image will be used as a starting point, adding more noise to it larger the strength. This is only applicable to Pipelines that support images as inputs.
-
Use Default Pipeline Scheduler: Select to use the scheduler that has been preconfigured with the Pipeline.
-
Scheduler: Schedulers take in the output of a trained model, a sample which the diffusion process is iterating on, and a timestep to return a denoised sample. The different schedulers require a different number of iteration steps to produce good results. Use this selector to experiment with different schedulers and pipelines.
-
Scheduler Arguments: Additional Keyword arguments to pass to the selected scheduler.
-
Batch Size: Set the batch size used in the generation process. If you have access to a GPU with more memory, increase the batch size to increase the speed of the generation process.
-
Image Height: By default, generated images will have a height of 512 pixels. Certain models and pipelines support generating higher resolution images. Adjust this setting to account for those configurations. If an Image or Video input is provided, the height is set to the height of the original input.
-
Image Width: By default, generated images will have a width of 512 pixels. Certain models and pipelines support generating higher resolution images. Adjust this setting to account for those configurations. If an Image or Video input is provided, the width is set to the width of the original input.
-
Number of Latent Channels: This is used to set the channel dimension of the noise latent. Certain Pipelines, e.g.
InstructPix2Pix
require the number of latent channels to be different from the number of input channels of the Unet model. The default value of4
should work for a majority of pipelines and models. -
Additional Pipeline Arguments: Diffuser Pipelines support a wide variety of arguments depending on the task. Use this textbox to input a dictionary of values that will be passed to the pipeline object as keyword arguments. e.g. Passing the Image Guidance Scale parameter to the InstructPix2PixPipeline
Giffusion generates animations by first generating prompt embeddings and initial latents for the provided key frames and then interpolating the inbetween values using spherical interpolation. The schedule that controls the rate of change between interpolated values is linear
by default.
You are free to change this schedule to using this dropdown to either sine
or curve
.
Sine:
Using the sine
schedule will interpolate between your start and end latents and embeddings using the following function np.sin(np.pi * frequency) ** 2
with a default frequency of value of 1.0
. This will produce a single oscillation that will cause the generated output to move from your start prompt to the end prompt and back. Doubling the frequency double the number of oscillations.
Sine interpolation also supports using multiple frequencies. An input of 1.0, 2.0
to the Interpolation Arguments
will combine two sine waves with those frequencies.
Curve:
You can also manually define an interpolation curve for your animation using Chigozie Nri's Keyframe DSL which follows the Deforum format.
An example curve would be
0: (0.0), 50: (1.0), 60: (0.5)
Curve values must be between 0.0 and 1.0
Giffusion allows you to use key frame animation strings to control the angle, zoom and translation of the image across frames. These animation strings follow the exact format as Deforum. Currently, Giffusion only supports 2D animation and allows you to control the following parameters
- Zoom: Scales the canvas size, multiplicatively. 1 is static, with numbers greater than 1 moving forwards and numbers less than 1 moving backward.
- Angle: Rolls the canvas clockwise or counterclockwise in degrees per frame. This parameter uses positive values to roll counterclockwise and negative values to roll clockwise.
- Translation X: Number of pixels to shift in the X direction. Moves the canvas left or right. This parameter uses positive values to move right and negative values to move left.
- Translation Y: Number of pixels to shift in the Y direction. Moves the canvas up or down. This parameter uses positive values to move up and negative values to move down.
Zoom Parameter Example
0: (1.05),1: (1.05),2: (1.05),3: (1.05),4: (1.05),5: (1.05),6: (1.05),7: (1.05),8: (1.05),9: (1.05),10: (1.05)
Angle Parameter Example
0: (10.0),1: (10.0),2: (10.0),3: (10.0),4: (10.0),5: (10.0),6: (10.0),7: (10.0),8: (10.0),9: (10.0),10: (10.0)
Translation X/Y Parameter Example
0: (5.0),1: (5.0),2: (5.0),3: (5.0),4: (5.0),5: (5.0),6: (5.0),7: (5.0),8: (5.0),9: (5.0),10: (5.0)
- Output Format: Set the output format to either be a GIF or an MP4 video.
- Frame Rate: Set the frame rate for the output.
Giffusion would not be possible without the following resources ❤️
- Prompt format is based on the work from Deforum Art
- Inspiration Button uses the Midjourney Prompt Generator Space by DoEvent
- Stable Diffusion Videos with Audio Reactivity
- Comet ML Project with some of the things made with Giffusion
- Gradio Docs: The UI for this project is built with Gradio.
- Hugging Face Diffusers
- Keyframed for curve interpolation