Way of the TTI Artist

          33322Way of the TTI Artist

Introduction

Preface

Common Questions

How many prompts can I have?

Why are my results noisy/chaotic?

How do I make my result more artistic?

Why are my colors uninteresting?

Why are my colors washed out / low contrast?

Why are my colors too dark?

How does palette mode work? I thought I would be the one to pick a palette?

Why do we need multiple palettes/why is that something we want? Why wouldn't we index a single palette?

How do the animation fields work?

What are good values for the near and far plane?

How can I decide the camera's 3D position at any given time with some function of time? (like other values)

What is the performance impact of X feature?

What is the memory impact of X feature?

How are people rendering such long animations or large images? Doesn’t it take forever?!

Techniques and Ideas

Effective PyTTI Workflow

Prompt Design

Scale and size

Conceptual Overlaps

Search engine test

Artistic Knowledge

Magic keywords

Personal Vocabulary Documents

Surprising Relationships

CLIP is a Natural Language Processor

More stable/less chaotic results without diffusion

Images

Camera motion creates composition

Border Stretch

Conclusion

Resources

Useful Links

Complementing Tools

Future of TTI / PyTTI

Better masks

Project Ideas

About this document

Acknowledgements

History

Contributions

Upcoming improvements

PyTTI parameter comparisons

Improved PyTTI workflow


Introduction

Welcome to the text-to-image (TTI) era, a brand new field of imagery and rendering. In this document you will learn how to quickly get over the basic hurdles of TTI, and jumpstart your learning. This is mostly directed at PyTTI users, however all techniques and principles should remain relevant for years to come in all implementations.

TTI took off for real around middle/late 2021, fairly early given the current situation with computing. Still at this stage, it takes hours to render a minute of animation! It’s tedious and time consuming to iterate and experiment. As a result, it’s not yet been introduced to the general public in tools like Photoshop and After Effects. To get the most out of the generative AI, for now you will need some technical skills. You will likely spend a significant amount of time doing math, touching some code... Despite these challenges however, a smart teenager could understand everything here and master it with practice, and free time. In the future, we can expect prompt brushes in Photoshop and stuff like that, should be pretty cool.

Preface

What is PyTTI?

PyTTI is a Google Colab notebook which implements text-to-image rendering (TTI) through combining two pre-trained neural networks, VQGAN and CLIP.

What is Google Colab?

Most TTI tools are currently implemented through Colab. It’s effectively an interactive notebook for Python coding. Code cells can be added and moved around to your preference, and simply serve to hold and execute code in steps. The main thing we care about here is that these notebooks run on Google servers with hardware that’s monumentally faster than what most people have in their computer.

Note: If you want to get serious with AI art, you’ll probably want Colab Pro ($10/m USD) to get the beefy P100 GPU. And with that, 100GB of space on Google Drive ($2.79/m) makes life much simpler. Let us know if you find a better deal elsewhere.

Disclaimer

TTI art is still in its infancy, it only really picked up around August 2021. Iteration speed is so slow that it’s discouraging scientific methods of investigating the system, preferring more chaotic ways of ‘lucking’ out on the right parameter combinations. As such, not all the information presented here may be fully accurate yet. It’s a relay of both my and the community’s current discoveries, thoughts and ideas on how things might work. Some of it should be close to accuracy, some might be a bit more theoretical. Don’t be afraid to experiment or go against the grain!

Common Questions

  1. How many prompts can I have?

There is no limit, and each additional prompt only incurs a slight performance cost!!

Each prompt is limited to 77 tokens (~330 characters or ~38 words)

Keep in mind having too many prompts without proper prompt design will hurt the results.

See the section on prompt design.

  1. Why are my results noisy/chaotic?

This is almost impossible to diagnose since there can be a number of different causes for this, and some could be weird interactions between two parameters.

 

  • Learning_rate directly increases noise in the image. It makes the algorithm search across a much longer distance at each step, i.e. it hallucinates harder. Used sparingly it can make the image more interesting, but too much and there is no understanding anymore, no order, the entire universe dissolves. The defaults in pytti are already well balanced, however slightly above or below can be interesting, e.g. within [0.015, 0.0325] for palette mode.

  • Smoothing weight is too low. The default value in PyTTI is 1, and the docs don’t make it clear that you can crank it up way above. 10 would be a better default in my opinion.

  • Cut_pow is too high. Values around [0.2, 0.6] tend towards larger shapes while [0.6, 1.7] gives smaller details. Anything above 1.5 is probably a bad idea unless you know what you’re doing.

The real answer might be more complicated as this is linked to the scale/proportion of your prompts. Let’s say you want “A whole black cat”, ~0.45 might be better. If you want “Paws from a black cat”, it might be better to have it higher at ~0.85. (making CLIP’s eyes a bit smaller) This is mostly theorizing, it’s not fully clear yet how the prompts & size of cutouts interact together.

Currently cut_pow cannot be animated over time. In the future it might be interesting to match cut_pow with our prompt scales to get more bang for our steps.

Note: I have now some indications that smoothing_weight may be in opposition with learning_rate. That is if you increase one, you should increase the other as well.

  1. How do I make my result more artistic?

With regular language we can’t really reach into artistic parts of the CLIP network. Without artistic prompts, the result looks like a boring popart collage of real life imagery and photographs. If you want the result to look like a beautiful painting, you absolutely must use some art prompts. Some examples:

  • By Vincent Van Gogh, in the style of Vincent Van Gogh

  • Oil on canvas, acrylic, <art mediums>

  • #pixelart, #lowpoly, dither pattern

  • #blender #eevee, #photorealism, etc.

  • Unreal Engine

  • Trending on art station

  • Technicolor

Your choice of artistic prompts is what results in an AI artist’s unique style. Somewhere inside of CLIP there is a great understanding of artists and aesthetics, and it’s our job to research the world of human arts to incorporate it.

More in the section on prompt design.

  1. Why are my colors uninteresting?

  • Smoothing weight is too high. (washed out look / missing colors)

  • A prompt is pulling the whole image towards a specific palette. (all color issues) Remember that colors are linked to images and shapes, so a prompt can also affect the overall mood of the image. Even an artist prompt like “By Van Gogh“ with a small weight can change the entire palette and style of the image to be more in Van Gogh’s style, along with his potentially saturated colors and noisy brush strokes/canvas textures. Something like “Light passing through trees landing on the ground” will pull the entire image towards a more green hue as the most common place to see this phenomenon is in the middle of forests. Some negative prompts can help with this.

  • Range of colors and contrasts in the init image is too small. In limited palette mode (and possibly Unlimited as well) the init image makes a massive impact on the initial palette. This is why without an init image, PyTTI initializes with random noise from 0 to 1. Try increasing the contrast of your init image in an image editor, or shifting the hue of highlights and shadow, etc. Otherwise, you must let the colors develop over a longer period of time, preferably with animation.

  • Not enough palettes. I have experimented with both many many palettes and fewer colors (24x10) and all the way down to two palettes and 84 colors (2x84). It seems that fewer palettes restricts the overall color range of the image. More experimentation is required however.

If you have not messed with your settings too much, most likely it’s a prompt issue.

  1. Why are my colors washed out / low contrast?

It’s unlikely to be your prompts. Instead, check your settings:

  • Gamma too low / HDR weight too high. HDR weights tries to force the whole image to be more like gamma value. I would leave HDR weight = 0.01 and gamma = 1.0 unless you’re experimenting.

  • Smoothing weight is too high. This parameter tries to keep the palette colors closer to one another. [1, 6] will give more detailed (and therefore more contrasting) results, while values above that will progressively get more washed out. Around 75 the image is almost monochromatic. On the other hand, a negative smoothing weight will result in a deep fried style with a lot of colors.

  • I believe higher gamma can increase contrast a bit, however it can give a weird look to the image often times. Anywhere between [1, 1.5] appears to be usable for various results. Below 0.8 hasn’t been too great for me, but it’s worth exploring more.

  • For limited palette mode, you might not have enough palettes or enough colors per palette. 4-5 palettes with ~16 colors appears to be a good starting point. It appears that having 1 or 2 palettes will tend towards monochrome results.

  1. Why are my colors too dark?

  • Gamma may be too high.

  • Prompts are pulling too much into dark colors.

  • In palette mode when the camera moves too quickly, all the detail and image can get lost which also crushes the palette towards black colors it seems like. For now it’s important to keep the camera motions very in balance with the rest of the parameters.

Generally it’s a mix of prompts and camera motion crushing the palette, but especially prompts. Tends to happen with dark concepts, like outer space.

  1. How does palette mode work? I thought I would be supplying the palette?

The palette is under the same influence from CLIP as the shapes of the image. The colors are the image. Colors are as fundamental to CLIP's image recognition as shapes and patterns. In other words, the colors are linked to the prompts.  The gradient may pass through different colors in a way that satisfies the prompt and how CLIP has learned that colors should flow. E.g. a prompt like "Grass field at night" might lead to a palette that has some dark green highlights and blueish shadows.

This is why painting colors with prompts is hit or miss. The colors are inherently tied to the concepts. As such, using color prompts may randomly alter the image with new shapes that you didn’t expect, or simply introduces noise. This is the same problem with artist prompts resulting in text appearing because of signatures, or a popular piece appearing randomly like Starry Night if you simply prompt for “Van Gogh”. It might be a good idea instead to think in terms of textures/objects that happen to have the colors you want. If you want black, use “black curtain 4k” instead. This will integrate some curtain texture, but you’ll have to live with it. Additionally, it’s more effective to manipulate colors through phenomena, e.g. “The edges are reflecting beautiful orange colors!!”

If you want specific colors, it’s way more efficient to give it an init image that has this color theme. A solid black init image will make the whole mood of the image a lot darker, while a white init image will make it very light. This is why the run starts out with greyscale noise by default, it gives an OK result. The impact of the init image on the final colors and composition of the image can’t be emphasized enough. 

  1. Why do we need multiple palettes/why is that something we want? Why wouldn't we index a single palette?

This is not fully understood yet, but it seems that adjacent colors in a palette must be related so as to create a logical gradient that’s relevant to the prompts. This means that having too many colors per palette slows down how quickly colors can change. By having fewer colors, the gradients have less spread/nuance and don’t require as many steps to cross into different hues. There is a balance to strike, but ideally you want to maximize palettes and minimize colors. Too few colors however and you introduce noise. Up to 70 palettes with 7-10 colors are good numbers to stick to, providing enough shades for most gradients. If you up the colors higher than that, you’ll notice that it starts to take longer for colors to emerge from pure black & white init noise.

  1. How do the animation fields work?

They are simply python expressions. All python math functions are available, and there is a parameter t representing elapsed time in seconds. t=steps / (frames_per_second * steps_per_frame)

Additionally, almost all weights and stops can be animated the same way. (prompt weights, stabilization weights, etc.) Huge amount of power there.

If you want, you can completely let go of the scene system in PyTTI and manually animate the weight of each prompt in a single scene, allowing those prompts to come and go in much more complex ways!

  1. What are good values for the near and far plane?

It’s important to understand what is a depth buffer in 3D graphics and what it’s used for, search online for this. The near/far values control the range of distance that the depth of the image will span. A larger range means more depth in the animation. Near is the minimum distance to the camera, and far is the maximum distance, i.e. how ‘deep’ the image can be.

Near=1 and Far=10000 are good defaults to use. These values should be fine to play with as desired, going up to 100k if you want.

(?) In 2D, translation is specified in pixels. In 3D, it’s relative to the size of the depth buffer. A shorter depth range means your translations must be scaled down accordingly. Moving into the screen by 1000 units every frame will be slower with a larger depth range, and faster with a shorter range. This is also true for X/Y, so be careful.

  1. How can I decide the camera's 3D position at any given time with some function of time? (like other values)

All camera animation parameters are not animating the position of the camera, but rather the movement/velocity for the current frame. This can be counterintuitive if you come from a 3D background, but it’s important to remember there is no camera to position around, nor is there any scene to navigate, only a 2D image.

  1. AdaBins is used to guess the depth of the image for each pixel.

  2. We move each pixel on the image by the x/y/z from the animation parameters.

  3. We fill the empty area with the infill_mode method chosen.

Therefore all camera animations are simply displacing the current image and filling the holes as best as we can.

Of course this means if you move the camera into the scene and then back to its starting point, you probably won’t see the same scene you saw initially. (It might be possible to do it though by creating a virtual camera with tracked position, and using that as a 'prompt' or 'seed', if that makes any sense)

  1. What is the performance impact of X feature?

Prompt count

Minimal, grows linearly

Prompt length

Minimal, grows linearly

Image size

Severe, grows quadratically

Animations

Almost zero. Slightly slower with 3D, but not by much.

Flow stabilization

Medium, static regardless of value (?)

Other significant or surprising values should be added here

  1. What is the memory impact of X feature?

3D animation

Moderately high

Cutouts #

Severe

Other significant or surprising values should be added here

  1. How are people rendering such long animations or large images? Doesn’t it take forever?!

For longer animations it’s normal to be waiting a few hours, anywhere from 2 to 24 hours depending on the length of your video.

But wait, there’s a dark secret: everyone is upscaling!!! We render at a small resolution or low FPS, then we take our result to a different AI specialized in upscaling pictures or interpolating between them, and these are much faster. These AIs have been trained on real world data just like VQGAN and CLIP, so they fill in detail way better than classic upscaling.

  • FPS: 12 images per second seems to be a standard among artists.

  • Resolution: ~512

Check the section on complementing tools for upscaling and frame interpolation. Some artists upscale as much as x4! ESRGAN tends towards a specific visual aesthetic, so going 2x with some nearest scaling can be better for some aesthetics. If you like the low-fi or pixel art styles, it may be better to just go low res and completely upscale with classic nearest neighbor or bilinear.

Maybe in the future we can have CLIP-guided ESRGAN for sharper images? 😉

Note: all of this is simply because 2021 hardware is just not enough. This is a new tech and we’re back in the 90s like we were with early 3D! We have to use some tricks and hacks to make it look good. For now, we can look forward and dream about realtime 60fps AI rendering, an upcoming new era in video-games and entertainment. We’re right on the edge, things are about to go nuts: AI brush in Photoshop. Universal video-game texture packs. The entire NES library is infinitely reimaged with user-shareable TTI presets. 3D AI rendering with infinite detail filling, down to the grains of sand. God simulator, the game. Mental health treatment using AI imagery in VR. AI addiction...?


Techniques and Ideas

Effective PyTTI Workflow

  • Out of the box, it appears that the best workflow is to modify your json in a text editor and copy it back into the Load Settings and then execute the Run cell.

  • GDrive Mounting: If you create a blank notebook in your gdrive and copy all the cells from PyTTI into it, you can perma-mount your Google Drive to it. Highly recommended!

  • Drafting: When you begin a new project, set the image size a bit smaller in the beginning. Remember what seed you used! This can help you decide what prompts work best before commiting to a 2 hour high-res render.Treat it like you’re sketching out ideas before the final render. Your iteration speed will be much better this way. Wasting 2 hours rendering stuff that doesn’t look good is really annoying!

  • Mix prompts sparingly: Until you get really good, test only 1 or 2 prompts at a time, and make incremental changes with the seed locked. That will help you understand a bit better how the prompts affect the image. Usually you won’t land on the perfect prompts immediately, it’s better to work your way up to the final piece by continuously tuning the prompts until you understand how that prompt can integrate with the other themes, and progressively adding more prompts little by little.

  • Use a timeline tool / GUI: You can use Pyttipanna to develop a timeline of your animation using a web GUI.

  • Keyboard Shortcuts: Colab has excellent shortcut management, and a long list of them at that. To open the shortcuts and edit them, press CTRL-M, release it, then press H. Here is a list of the most useful ones for me:

    • Focus previous/next Cell

    • Focus the last run cell

    • Move selected cells up/down

    • Run cell and select next cell

    • Interrupt execution

    • Restart runtime

Soon, I will share a set of notebook modifications and custom cells that will massively boost your productivity by cutting down on extra steps and tedious navigation around the notebook. 

Prompt Design

This is probably the biggest skill factor in this art. CLIP has learned concepts in a certain way and you need to know its language to get where you want to inside this network. It’s like its own world, a complex network of train rails and junctions that will get you absolutely anywhere, but you need to produce the right tickets to get to your desired destination.

It’s harder than it sounds, you can’t just throw any concepts together at random and get a high quality art piece. Try a single prompt with just one word to see what ground zero of TTI looks like. Without any guidance and skillful prompt design, you don’t get anything interesting on the screen! This is where you need to start thinking like an artist. Unfortunately, artists don’t share their scene too often since it’s the heart of their creations, their style!

Scale and size

There is a finite amount of pixels in every image, so it's impossible to have infinite details. If you have too many prompts or you try to put too many objects/textures/things that take a lot of space to depict, it might struggle as every concept will compete for that limited space. Some other settings are important here to get a balanced image in terms of small and large details, but with some prompts it can be effective to use an adjective/specifier of scale, e.g. Seen from a Distance, Seen from above, imposing, macro, closeup.

Cutout number and power is perhaps one of the most important settings in this category. If you want more cohesion/bigger shapes, you want fewer cutouts and bigger ones.

Conceptual Overlaps

A highly effective technique to increase the cohesion of a scene is to design the prompts such as to maximize conceptual overlap. For example, if you have a prompt like “circle”, it will fit naturally with “eyes” which are somewhat like circles.

You can do some clever overlaps, for example with these scenes (prompts separated by | like in PyTTI):

  • Medieval house with a beautiful stone chimney | Thick smoke coming out of a chimney | Smoke in the shape of aurora borealis | Aurora clouds | Aurora borealis

  • Bicycle in a medieval street | Stone pavement | Wheel of fire | Spinning fire | Vortex of fire | Eye of the hurricane seen from space | Clouds | Skid marks left by car | Smoke trail emanating from pavement

The more distant are your prompts, the less of them you can have before it devolves into anarchy. When there is a clear path going through each prompt and linking them together, there is no in-fighting happening. Every prompt falls naturally into place and fits together in harmony with the image. It may be effective to design scenes or prompt sets as chains of related concepts, and even better if you can make looping chains.

Instead of painting with colors, we paint with concepts. The more distant your concepts are in the network, the more contrast in the overall style and mood of the piece. When you carefully design the overlaps, you can stack far more prompts than usually.

Search engine test

CLIP was trained on text/image pairs from the internet, so testing your prompts in a search engine gives a rough idea of the imagery you’ll activate in the network. In general, it's pretty effective to treat prompts the same way you would search with keywords on Google Image. It helps to add more adjectives and specifiers to restrict the search space to something more specific.

However, be mindful that CLIP does understand natural language to some degree, it’s not simply a keyword search system like Google. It has some of those same characteristics, but it’s also much more than that. The clip-retrieval search tool may be more representative than Google of the pictures that will guide the render for a given prompt.

It can help as well to read up on OpenAI’s announcement of CLIP and DALL-E to understand a bit better the capabilities and how it works. CLIP for example understands different civilizations and time periods very well, so things like “futuristic” or “from the 40s” can be used to great effect.

Artistic Knowledge

Finally, your art degree can be put to good use. CLIP naively draws from everyday pictures more easily than arts. As such, it’s useful to gather knowledge on various art techniques in fields like photography, painting, architecture, famous artists, etc. in order to incorporate them.

The Google Arts and Culture website is so good for this, you would think it was made just for us in anticipation.

Scouring wikipedia is a great way to discover weird relics of the past that will look cool.

Note that with stylistic prompts, a little can go a long way. You should treat it like colors for a painter, or seasoning for a chef, just a very light weight “Van Gogh” prompt can have a subtle effect on the image without being recognizable as his style.

Magic keywords

Some keywords can drastically affect the style and mood of the image without introducing new shapes or concepts. 

  • “Unreal Engine” seems to make the image more realistic.

  • #macro and #pixelart can be highly efficient, as they’re used on social media where tons of artists share their art with those keywords, meaning the data for them is rich and nuanced.

  • Most artists act as magic keywords.

  • “Trending on Art Station” is a classic!

Personal Vocabulary Documents

Create your own personal idea document. Cool textures, artist names, objects, keywords, etc. Try to make it personal, write down things that represent your identity, things that have sentimental value, things you like,  interests or hobbies, etc.

Gloss through this list whenever you want ideas for your next project. Even if you don’t use these concepts specifically, it will act as a ‘seed’ for your imagination, thus reinforcing your own distinct style and ‘artist brand’ each time you review it.

Surprising Relationships

This is something that catches new TTI artists off-guard quite frequently. The more vague a prompt is, the more likely it is to act unexpectedly. Objects and concepts that are often seen together can indirectly sneak into the picture through a vaguely related prompt.

Some typical examples…

  • Desert, oasis, dune, sahara —> cactus

  • Cities, town, public places —> humans

  • Famous Artists —> Text (because of the signature)

  • Famous Artists —> Famous piece by the artist (e.g. Starry Night with “By Van Gogh”)

  • Single-word concepts —> Text of the word itself

Learning to predict these surprising relationships unlocks a greater control over the image, since you can use negative weight prompts to remove them. (with varying success)

CLIP is a Natural Language Processor

If you are new to AI, you may not realize the gravity of the situation. Through sheer CO2 emissions, we have created a system which can recognize and associate language to imagery. Some examples that work well: 

  • X in the shape of Y

  • X beneath Y

  • Human wearing X

Our prompts can be more than simple text queries like on Google. We can make them full blown sentences with complex relationships, so long as CLIP knows about the concepts.

And, it’s important to remember CLIP always does its best no matter what. When grammar and logic falls apart, it runs as far as possible with it and some crazy things happen. Here are some excellent prompts to abuse it:

  • Interlocking Triangle in a revolving whirlpool of space galaxies circling each and inverting on itself like in a triangular mirror.

  • An holographic human is seen through a mirror, and all the walls are mirrors

  • Amazing! I can see the galaxy reflecting in the coffee cup, and inside the coffee cup is inside the galaxy

  • Diagonal lines in a swimming pool aligning with the current like a motion field, everything is moving in a diagonal.

  • Imaginary Circles and Triangle shapes jumping around like grasshoppers (+ negative prompt on ‘insect’ or ‘grasshopper’, to be sure)

  • Big glass ball or beads made out out of tryptic balls made out of glass beads (repetition has an interesting effect)

  • Holographic metallic foil appearing out of thin air like iridescence

  • Recursive structure of pyramidal humans standing in a pyramid formation like a triangle shape made out of sierpinski humans

More stable/less chaotic results without diffusion

Images

VQGAN+CLIP can get large shapes and avoid the psychedelic style, even without masks or init images. Composition still won’t be good, but this technique has its uses nonetheless. Instead of trying to get a full picture in a single run, do this:

  1. First render with 3-4 prompts max, tune the settings to avoid small detail

  • Fewer prompts.

  • Broad prompts that tend to larger silhouettes with implicit textures, like “cat” instead of “fur”.

  • More smoothing

  • Fewer cutouts

  • Smaller cut power

  1. Let it go for 100-200 steps and play with the prompts until you like where it's going, the colors, the shapes, etc.

  2. Once you land on that first good render, save it and use it as the init image for your next run. Tune your settings for finer detail.

Mapping the stabilization weights to a function of time is also effective, but it takes some tweaking to get the right curve and numbers. Needs more exploration

Camera motion creates composition

With time, some AI artists have discovered that their renders are leagues ahead when the animation progresses into the several minutes. They get a quality of image (composition, balance, complexity, originality, …) that seems impossible to achieve when they are going specifically for images by running from scratch for 400-500 steps and no motion. (without init images that is)

The simple answer is that camera motion has a drastic effect on the picture. More specifically, anything that ‘artificially’ pushes pixels around on the image in an uneven manner, e.g. 3D camera with AdaBins. The faster the movement, the more powerful the effect on composition.

Next time you come across an AI animation piece, pay attention to the camera motion. Most likely you will notice that the motion seems present in the overall look of the image. In the spinning 3D zooms the composition tends to spiral, depending on the speed. 

Detailed answer

As we know, all renders start out with random noise and the VQGAN+CLIP algorithm consists in ‘pushing’ pixels around an abstract network. Pushing pixels means they naturally follow the flow of the previous image, and the learning_rate is trying to fight against this by searching for new flows that matches up. (This is why starting from scratch with noise always gives a more boring result, the entire flow of the image is decided from the first few steps and depends on how the size of CLIP cutouts and how high the learning_rate is. Ideally you should have a very high learning_rate and large cutouts in the first few steps and then quickly taper down, but these parameters currently can’t be animated)

Camera animation on the other hand is pushing pixels according to AdaBins’s depth map. After 50-100 steps on a pixel, they should all mostly be in a position that feels good to CLIP. But then PyTTI animation forces them out of that position to where CLIP is no longer happy with that pixel placement. Now, AdaBins is controlling things as decisions are felt instantly at each frame through the camera smear. This results in a sort of interplay between the two networks to where they each disturb one another intermittently. With a slow and constant animation the pixels fall into place and are disturbed ever so slightly every frame, and with a stop/go camera it’s a back and forth between them.

Before 3D depth smears, the image appears like a flat mosaic or collage of random themes and shapes:

After 3D depth smears, the clusters have connected up into larger structures that flow diagonally with the camera:

Border Stretch

When the camera moves, there are no outside pixels to bring into view, so instead the outermost pixels end up stretched on the image.

Horizontal

Vertical

These border stretches are generally bad, but have some artistic value as well. They’re generally ugly right when they happen, but with a stop-and-go camera they might have some 40-50 steps to refine into an interesting image that flows with those stretches. The way that VQGAN+CLIP works, it’s highly likely to keep the original angles of the image.

Horizontal stretches in particular could have some value for AdaBins as they add some horizontal flow to the image. That can create the illusion of horizon lines which could be an important indicator that AdaBins uses to estimate depth, i.e. the depth map appears more nuanced and has a very nice gradient from close to far away.

Vertical stretches are generally pretty ugly, but if they happen occasionally and with the right timings, it gives a sort of tilt-shift/macro style where everything is compressed vertically like in the image above.

Conclusion

When designing an animated render and imagining the scene, you should consider the camera motion and how it will impact the flow of the image. If you want something like a landscape, incorporate more horizontal panning and less vertical.

Even if you are making still images, it can be better to sample them from an animation just for that composition aspect. We can then take the smeared shot as an init image and run some steps on it with the same prompts to clean up the shot or remove the motion blur effect.

Currently, camera motion seems to be the only way to influence composition in a VQGAN+CLIP animated render. In the future, it might possible to create custom pixel patterns to stamp out instead of stretching the outermost pixels, thus forcing a specific angle (or even a lack of specific angles, using a noise pattern just like PyTTI does from a fresh run) Or, we may be able to use diffusion in conjunction, another technique which understands composition to a much better degree.

Resources

Feel free to suggest any other tools or links you have

  • Video interpolation / Motion smoothing

  • AnimationKit - Colab Notebook combining  Real-ESRGAN (image upscaling) and RIFE (video frame interpolation) [Colab]

  • Flowframes: Windows GUI for video interpolation using DAIN (NCNN) or RIFE (CUDA/NCNN) [Github]]

Great tutorial format guide by [email protected]

Future of TTI / PyTTI

  1. Better masks

PyTTI 5 has added support for masking weights with images, which gives infinitely more control over where the prompts will apply in an image. But If you try them, you may come to think that there is a bug or they aren’t working quite right. The problem is that CLIP ‘sees’ the image through its cutouts which are rectangles. The cutouts are distributed randomly across the image, meaning that cutouts will randomly include only sections of a prompt’s mask. This makes the borders of the mask more fuzzy. As a result, it’s presumed that the efficiency of a mask decreases as the cutouts get larger. (lower cut_pow)

Project Ideas

Here are some exclusive ideas if you are looking for a future project. You may need some programming skills or modifying PyTTI to achieve these feats. It can still be good to read about them to get a feel for what is possible, to challenge your creativity.

  • AI texture painting: model the object in Blender, render a turntable animation of it in black and white. Use vertex paint to paint masks that each get split and rendered to a black and white mp4. Use these as prompt masks. Doing this, you can paint textures onto the scene using AI, maybe. The stabilization weights should help a lot for this. It will probably require the improvements to cutout placement to work well, which are planned for future PyTTI versions.

  • Spherical Prompts: Quantize the directions around a sphere to some 32-64 directions, and map a prompt or scene to each one. This way, you can make a replica of Earth, or an abstract world that has greater level of consistency as you rotate around it. With some camera movements that matches the rotation, this ought to look really cool.


About this document

Acknowledgements

  • Author: ryunuck

  • Notable contributors: siuwi ffhgx

  • Various community members for pointing out typos, weird sentence constructions, and various links & tools to reference

History

  • Nov 23 - Creation

  • Jan 27 - Added new section on camera motion

  • Jan 28 - Updated information on colors & added QA question about colors too dark

Contributions

Currently the document is written and maintained by one person. If you wish to become a full blown author, contact oxysoft#6139 on discord. We are looking for more technique, best practices, examples, more information on how certain  parameters work, etc.

If you are new to TTI and feel that your basic question isn’t covered here, contact me asap.

If Sportsracer42 answers an interesting question in #technical-support, check to see if there is new knowledge worth adding here.

Upcoming improvements

  1. PyTTI parameter comparisons

We show the same scene, params and seed, at different intervals for a given parameter. This way we can quickly teach how the image changes for 10-15 given increments of that parameter.

Preferably in animation format so we can more finely understand how the image transforms in regards to each step. This will greatly advance the knowledge about PyTTI!

Leave a Comment

Your email address will not be published.