This can be a continuation of my ongoing undertaking. The earlier posts may be discovered right here and right here; previously generally known as S2ID and SIID earlier than that. Since then, lots has modified, and R2IR and R2ID work very otherwise. You’ll be able to learn into the earlier levels, nevertheless it's not obligatory. The GitHub repository is right here for people who wish to see the code. PrefaceOver the previous couple of months, I've been considerably disillusioned by the pitfalls in basic diffusion fashions. Subsequently, I've been working by myself structure, aptly named S2ID (Scale Invariant Picture Diffuser), and now, aptly and sensibly renamed to R2ID: Decision Invariant Picture Diffuser. R2ID goals to keep away from these pitfalls. Particularly:
The core idea of the mannequin has gone unchanged: every pixel is a definite level within the picture, whose coordinate and colour we all know. This pixel is successfully a token, and we will attend to different tokens (pixels) within the picture to determine composition. However not like LLM tokens, the tokens listed here are essentially a bit completely different, and that’s that they are often infinitely subdivided. A 1MP picture upscaled by 2x to 4MP doesn't include 4x as a lot info. Slightly, the data is 4x as correct. Subsequently, a relative, not absolute, coordinate system is used (defined later). R2ID has skilled large adjustments, particularly fixing the most important downside to it within the earlier stage of iteration, which was velocity. Now, R2IR and R2ID are quick sufficient to really be viable (and I'd assume aggressive) at massive resolutions. Earlier than, it used consideration over all the picture, which was tremendous sluggish. The earlier put up received quite a lot of recommendations, however one significantly caught out to me by u/MoridinB who steered to someway transfer the decision invariance to the autoencoder. So after a break and quite a lot of pondering, I figured that cross consideration with my coordinate system (defined later) might really work as this "autoencoder" of types. Subsequently it was made and named R2IR: Decision Invariant Picture Resampler. Whereas it "kind of" performs the function of an autoencoder by lowering the peak and width, it essentially isn't (defined later). Thus, a pair of fashions: R2ID for diffusion, and R2IR to make photographs smaller to make R2ID quicker. A lot so, that in comparison with the earlier time of three.5h for coaching, each R2IR and R2ID are actually educated in 2 hours whole, thus about 30-60% quicker, with reminiscence consumption about 3x much less, regardless of having over double the overall parameter rely. However it will get higher. Each R2IR and R2ID have been educated at 32×32 photographs which were became a 4×4 latent: to pattern into, diffuse in, and pattern out of these 4×4 latents. But regardless of this, each fashions have confirmed to:
Although neither mannequin ever noticed any augmented picture. This implies you can prepare on one decision and side ratio, and the mannequin will likely be pre-configured to be adequate for different resolutions and side ratios from the get-go, even when it's wildly completely different. I’ve additionally provide you with an evidence as to why it's in a position to try this, and it's because of the twin coordinate system (defined later). On this put up I’ll:
Mannequin ShowcaseAllow us to start with the mannequin showcase. As earlier than, it's essential to notice that the mannequin was educated solely on 32×32 MNIST photographs, tensor of measurement [1, 32, 32]. These photographs, handed by way of R2IR, turn out to be [64, 4, 4], thus a 4×4 latent. So all subsequent outcomes are successfully testing how properly R2IR and R2ID can generalize. I used completely different decision and side ratio latents, in addition to varied decision and side ratio photographs. It's essential to notice that with the way in which R2IR works, the latent and picture sizes are decoupled: you may diffuse on one decision, however resample (thus the title) into a special one. Resampling shouldn’t be equal to a easy upscale, nevertheless it's a wise interpolation of types. All will likely be defined later. Let's begin with 4×4 latents, 32×32 photographs. The factor that the mannequin was educated on. Coaching for each fashions was aggressive, batch measurement of 100, ema decay of 0.999, linear warmup, cosine decayed scheduler for AdamW optimizer. Studying price peaks at 1e-3 by the 600th step (finish of first epoch) and decays right down to 1e-5 over 40 epochs. Thus, a complete of 24,000 optimizer steps have been made. 4×4 latent, resampled into 32×32 photographs Surprisingly sufficient, the outcomes are… unhealthy. It is because a 4×4 latent is means too small to diffuse in. So let's bump it as much as an 8×8 latent. 8×8 latent, resampled into 32×32 photographs Significantly better. However maintain up, this latent decision wasn't educated. As in, in any respect. Neither R2ID that subtle within the latent house, nor R2IR that was educated to make these latents within the first place, ever noticed a 8×8 latent. Solely 4×4 latents. What does this imply? This implies you can prepare on one decision, and never fear about inference in one other decision. Instinct means that bigger latents end in higher high quality, as a result of similar to said earlier, extra pixels means extra correct info. How about we stress check R2IR, the resampler. Lets' nonetheless diffuse on 8×8 latents, however this time, pattern into a special decision. Let's do 10×10 pixels for the intense. 8×8 latent, resampled into 10×10 photographs It nonetheless works. In the event you examine the pictures, you'd see that the pictures are an identical in construction, and that's as a result of they arrive from the identical latent. They're simply pixelated, which is predicted while you solely have 10 pixels to work with. Let's take a look at a 16×16 resample. 8×8 latent, resampled into 16×16 photographs As anticipated, it's higher but. Similar underlying photographs as earlier than, simply pixelated otherwise. So R2IR is clearly in a position to resample a latent right into a decision decrease than educated, and it really works as anticipated. However what about increased? Let's resample into 64×64, to see if we will use increased resolutions, however for a similar latent. 8×8 latent, resampled into 64×64 photographs But once more, similar to earlier than, it really works. No shock right here. The way in which R2IR works (defined later), this isn’t equal to a easy upscale (re-sample). From what you've seen now, it might look like R2IR simply upscales some basic latent picture into completely different resolutions, however that's not what's occurring. For every pixel within the output picture, R2IR has selectively chosen what elements of the latent it's attending. That is an adaptive, dynamic course of. Actually, this complete time, R2IR was already working extra time: it was by no means educated to decode 8×8 latents, solely educated on 4×4, and it's proven that it may well resample an 8×8 latent into resolutions that it was by no means educated on both, as R2IR was solely ever educated to re-sample again into 32×32. Let's actually stress check it. Diffuse on an 8×8 latent, however re-sample into a special side ratio. Shouldn't actually work, proper? 8×8 latent, resampled into 27×18 photographs (3:2 side ratio) Nope, it nonetheless works. It's essential to notice, that with the way in which the twin coordinate system works (defined later), a lot of the coordinates that R2IR sees, haven’t been within the coaching knowledge. And this isn't a sort of interpolation between identified coordinates, no, the 2 coordinate methods are actively sending conflicting indicators. But it really works. Now we've already seen that R2ID can diffuse at latents on sizes it wasn't educated on, however Let's simply make it possible for it really works. Let's diffuse on a non-square latent, like 4×10, however then resample it again to a sq. picture and see if we’ve got any deformities. In spite of everything, the 4×4 latent might barely make digits, and now we're including a bunch of coordinates to the perimeters, so we're not likely fixing the bottleneck all that properly right here, after which we're asking to re-sample again right into a sq. from a non-square latent. 4×10 latent resampled into 32×32 photographs However no. But once more, it really works. We see residual deformities, as a result of we've nonetheless solely 4 in top. But that additional width has been confirmed helpful sufficient to _still_ remedy some deformities. And the resultant photographs are legible. Okay, let's actually stress check it. Let's diffuse on a 4×10 latent which is brief however extensive, however then resample it into a thin and tall picture, like a 16:9 side ratio. That is foolish and pointless, however nonetheless. 4×10 latent resampled into 32×18 photographs And but, it nonetheless works. We see deformities, however the photographs are nonetheless surprisingly cleaner than that authentic 4×4. Let's additionally diffuse on a 10×4 latent that's nearer to the 16:9 ratio to see if having side ratios not battle helps. 10×4 latent resampled into 32×18 photographs Surprisingly, this doesn't appear to have a lot, if any of an impact. Which appears that one or each of the fashions don't really care about how a lot you stretch or squeeze the picture. And as stated earlier than, the way in which that the twin coordinate system works, each R2IR and R2ID see conflicting coordinates, but it nonetheless works. For completion, right here is the t-scrape loss. It's annoying to measure all permutations, so that is the t-scrape loss for an 8×8 latent as they've proven to be good high quality. This graph exhibits the MSE loss between the expected epsilon noise, and the precise epsilon noise (gaussian noise, imply of 0, stdev of 1) used for that specific timestep (alpha bar), a worth between 0 and 1 that represents the SNR of the picture. T-scrape loss, absurdly good in comparison with the earlier state In comparison with the earlier put up, this can be a _lot_ smoother, and utterly mogs the outdated t-scrape losses throughout the board, actually 5-10x higher just about in all places. Now, let's check out the precise structure itself. Twin Coordinate Positioning SystemWithin the earlier put up, I didn't actually clarify this half properly, however that is the one factor that makes the whole lot even work within the first place, for R2IR and R2ID. Thus it's integral to grasp. In brief, it's a system that provides two coordinates to every pixel: the place it’s with respect to the picture's edges (relative) and the place it _actually_ is if you happen to drew it on a display (absolute (however not really absolute, it's nonetheless relative)). For the primary system, it's easy: make the sides +-0.5, and see how far alongside the pixel is. For the second system, we take the picture and no matter side ratio it’s, and inscribe and middle it inside a sq.. Then, these +-0.5 values are given to the sq., not the picture's personal edges. We then get the coordinate by seeing how far alongside the sq. the pixel is. Thus, we’ve got 2 values for X and a pair of values for Y, one "relative" and the opposite "absolute". We’d like the primary system in order that the mannequin is aware of about picture bounds, and we want the second system in order that the mannequin doesn't repair composition to the picture edges. Use the primary system with out the second, and the mannequin will stretch and squeeze the picture if you happen to change the inference side ratio. Use the second system with out the primary and the mannequin will crop the picture if you happen to change the inference side ratio. We subsequent cross these 4 values by way of a fourier sequence by way of powers of two. That is in order that the fashions can distinguish pixels which are close to and pixels which are far. For traditional RoPE in LLMs, the place we’ve got an increasing number of atomic tokens, we have to distinguish additional and additional away. However right here, we've a relative system, so we want ever-increasing frequencies as an alternative, to tell apart adjoining pixels the upper and better decision we go. In _this_ instance, I used 10 constructive frequencies and 6 unfavorable frequencies, so 16 whole, x2 for X/Y, x2 for relative/absolute, x2 for sine/cosine, therefore a complete of 128 positioning channels. The eager viewer could have sensed one thing off with the excessive frequencies, as they need to: 10 frequencies to the ability of two, that's means too many. 2^10=1024, which implies that the mannequin wants 1024 pixels to be able to have the ultimate frequency not appear to be noise, how is the mannequin not simply memorizing the values and as an alternative generalizes? It is because coordinate jitter is used, _before_ the fourier sequence. For no matter decision picture that R2IR or R2ID makes use of, if the mannequin is coaching, to the uncooked coordinate's X/Y worth, we add gaussian noise with stdev of half a pixel's width. Because of this throughout coaching, the pixels that the fashions take a look at aren't in a inflexible grid, however are as an alternative like random samples from a steady subject, and thus when the mannequin works with the next decision, it's already seen these coordinates earlier than, and it already is aware of what colour is supposed to be there: it's a mixture of if the 2 adjoining pixels have been gaussian fields. To these conscious, this sounds awfully much like gaussian splats, as a result of it’s in a way. Sooner or later, I plan to make RIGSIG: Decision Invariant Gaussian Splat Picture Generator; a mannequin that can instantly work on gaussian splats relatively than not directly like right here. Now _why_ does this method work? Why is it in a position to generalize to resolutions, however extra curiously so, side ratios? Except for jittering performing some heavy lifting across the edge pixels (thus making them look like in the event that they're additional out than they really are, thus as if the picture was completely different), the primary purpose is that the middle coordinates don't all that drastically change. If you change the side ratio, the pixels that change most are across the edges, not the middle, and that's good contemplating that it's just about by no means that your topic is simply cropped for some purpose. The topics are centered, the sides change. Change the side ratio, and the center stays largely the identical whereas the sides change extra. 128 channels could sound like lots, nevertheless it actually isn't. Particularly contemplating the parameter rely. Let's check out R2IR for a second. Within the present configuration, it has about 3.3M parameters, which may really be reduce down by about 4x (defined later). It expands the colour channels from 1 to 64, as a result of I assumed an 8x top and width discount. For true RGB photographs which are massive, we'd need 16x discount in top and width. We'd therefore get 768 channels as an alternative. As for the positional frequencies, we will go nuts: 16 constructive and 16 unfavorable. These unfavorable frequencies, they're frankly largely ineffective: ever longer wavelengths that rapidly turn out to be indistinguishable from a continuing contemplating our relative nature of coordinates (though it’s attention-grabbing in the event that they can be utilized as an absolute system), so we will actually re-distribute them into the constructive frequencies into one thing like 22 constructive and 10 unfavorable (even then, it's overkill). Simply what measurement picture do we have to use the ultimate frequency, in order that it's indistinguishable from noise? What’s the decision restrict of the mannequin? 2^22=4194304. We would wish 4,194,304 _latent_ pixels to only _start_ utilizing the ultimate frequency. With the assumed 16x compression through R2IR, this may turn out to be over 64 million pixels wanted alongside one dimension. And we solely want 256 channels for this. 768 colour channels and 256 positioning channels implies that the mannequin by no means goes past 1024 channels for every token, which by trendy requirements inflated by LLMs is laughably tiny. Now that I say it, I'm keen to guess that R2ID and the coordinate system could also be used for greater than photographs, however say audio as an alternative, or one thing of the kind, after which these absurd lengths turn out to be very sensible. The coordinate jitter strategy implies that though these channels are indistinguishable from noise, the mannequin nonetheless learns sufficient about them to generalize to resolutions increased. R2IDFrom the narrative perspective, it is sensible to take a look at R2ID first, because it's the precise diffusion mannequin. Additionally, it's troublesome to see use in R2IR until you perceive R2ID and it's ache level. The idea has largely remained unchanged:
Nevertheless, 2 main developments:
I began creating R2IR after I was nonetheless on the cloud consideration concept, and it helped lots again then. However then I began utilizing linear consideration in R2IR, and the whole lot turned blazing quick, and I questioned if R2IR was even obligatory within the first place. Seems, sure, it nonetheless is, actually, possibly much more so than earlier than. R2IR is sensible as a pure extension as soon as you determine the drawbacks of R2ID:
So, let's make R2IR. R2IRWe now know the drawbacks of R2ID, and we all know what we want for R2IR: someway convert top and width into additional channels. 2 months in the past after I made the earlier put up, one remark caught out to me. u/MoridinB proposed that as an alternative of getting a decision invariant diffuser, how about I make a decision invariant autoencoder. Even again then, I had felt the ache of the coaching time, and the idea sounded superb in principle, however I had no concept how one can do it in apply. Wanting into current architectures, I couldn't actually discover the factor I used to be searching for. The obvious different was to only diffuse in fourier sequence for instance, however that's not fairly it for my part. I assumed that there simply have to be someway some sort of clear answer and I simply haven't come to it but. The obvious answer to the conundrum (much less top and width, extra channels) is to only use an current VAE or AE. However there's a large drawback, and that’s that they work on non 1×1 convolution kernels. 1×1 convolution kernels are high-quality as a result of they're simply a picture form linear layer, they don't combine pixels collectively. However that's not what CNN based mostly autoencoders do. They’ve 3×3 convolutions within the easiest of configurations, which immediately stops them from being decision invariant, and makes them pixel density dependent. Coaching on varied resolutions, having a number of kernels for various resolutions, or reusing the identical kernel and dynamically scaling it, to me that sounds extra like a hack than a clear and proper implementation. Over this time, I had tried:
I genuinely successfully gave up, till at one second a thought struck me: why not use cross consideration? Cross consideration selectively passes info from one tensor to a different. We sometimes use it to cross info from textual content tokens to the picture, that means doing our textual content conditioning. However what if I made an empty latent, populated it with coordinates, after which used cross consideration to maneuver info from the picture into the latent? What if, for the decoding, every pixel selectively built-in info from the latent? The queries Q know solely about their coordinate, whereas the keys Ok and values V know in regards to the coordinate and colour. Thus, the _only_ means for info to cross by way of, could be positional based mostly. A sort of clean view of the picture, based mostly on no matter coordinate you're concerned about. Thus I made it, R2IR. The dumb strategy of full consideration, the quadratic scaling, and but it nonetheless labored. Early R2IR was in a position to compress and broaden out. Now as stated earlier than, I made it earlier than switching to linear consideration, and the change to linear consideration was triggered by the deadly flaw within the early stage of R2IR, and that’s that it requires _even more_ computation than R2ID. Let's say that we wished to encode and decode a 1024×1024 picture, what number of consideration calls would we have to do? For encoding, let's say we wish an 8x discount in top and width, that will be a complete of 128×128 latent pixels which is 16,384 whole consideration calls, and every consideration name could be for 1,048,576 whole pixels. Yikes. For the decoder, it's 1,048,576 calls over a sequence size of 16,384. On the time, I used to be experimenting with cloud level consideration, splitting the variety of pixels into random teams and solely attending throughout the group as a method to hurry up. Equally, I used solely random fractions of the pixels for the KV, however nonetheless, it was extremely sluggish and I hit OOM on 64×64 photographs until I had a batch measurement of 10 and fractions like 1/4. After which, I stumbled upon Linear Consideration, and it actually fastened the whole lot. Blazing speeds, reminiscence, the whole lot. And the reconstructions have been even higher as a result of not are fractions wanted and as an alternative you might do full consideration. Cloud mechanics turn out to be out of date too. Coaching R2ID with out R2IR and with is like evening and day: epochs go from 10 minutes or so to about 40 seconds, batch sizes may be set to 100, and to prime it off we reap the rewards of the resampling methods. So how does this really work? It's easy. We make Q maintain solely the coordinates, and KV maintain the coordinates and colour. For the case of encoding, Q is the latent and KV is made by the precise picture. For the case of decoding, Q is the picture, and KV is the latent. The coordinate system is identical one as earlier than. Now one cross of Linear Consideration is dangerous, even when it's multi-head. That is beacuse it really works as an averaging of kind, only one cross of consideration, and we danger blurring particulars, which is precisely what occurred. So as an alternative let's make it a transformer block with residual addition, similar to what was executed for the "encoder" and "decoder" blocks in R2ID, however we don't want AdaLN for time conditioning this time round. Let's have 4 blocks, simply in case. First cross does basic colours, last passes refine particulars. After which the ultimate stage is to compress again right down to the colour house through a 1×1 convolution, whether or not it’s for the latent or the precise picture. Does it work? Sure, actually it really works _too_ properly. Check out the hooked up photographs and see if you happen to can spot what's improper. They're all at 1024×1024 decision, resampled up from a 100×100 latent. 100×100 latent, resampled right into a 1024×1024 picture 100×100 latent, resampled right into a 1024×1024 picture 100×100 latent, resampled right into a 1024×1024 picture That's proper, R2IR has memorized the pixelation from the unique picture. The uncooked MNIST photographs are all 28×28. I educated on 32×32, however that's nonetheless the identical quantity of information as 28×28. By having 4 blocks as an alternative of 1, R2IR was in a position to memorize the pixelation that you just see on small resolutions. Had I used 1 block as an alternative, it could have been a pleasant clean transition. It's protected to say, the mannequin is aware of what it's doing and definitely can seize high-quality particulars. Additionally, only for enjoyable, let's check out how the latent house seems to be like. This can be a fastened set of photographs, encoded through R2IR after which rendered instantly. The explanation it really works is that the latent house colours are nonetheless literal colours, they're sure between -1 and 1, similar to the colour house (it's re-shaped in order that [0, 1] re-maps to [-1,1]). Normalization confirmed to enhance the loss, and makes it simpler to visualise too. Every column's 64 rows are a picture's 64 separate channels within the latent house. 32×32 photographs compressed to 4×4 latents There's this very attention-grabbing, and equally inexplicable sample. I genuinely don’t know why it loves to do that clear left/proper separation? Truthfully, no concept, any guesses could be good. We will additionally compress the identical 32×32 photographs into an even bigger measurement latent, and see why it’s that the mannequin is so sturdy in opposition to resolutions. 32×32 photographs compressed to 14×14 latents This time, the 32×32 picture is compressed to a 14×14 latent as an alternative, which means that whereas with the 4×4 latent we had no info doubling ([1, 32, 32] -> [64, 4, 4]), we now have over 3x as a lot of the identical info repeated, and never precisely within the cleanest of the way since we don't have extra pixels on the enter finish. And but, the latents are _identical_, they simply achieve some additional particulars that weren't there earlier than. All collectively, the total mannequinAll collectively, the mannequin is totally nuts, and I actually imply it. It’s worlds aside to the earlier iteration.
Simply to essentially put the living proof: within the earlier iteration, to diffuse on a single 1024×1024 picture, I’d actually have a minute per prediction. Now? R2ID diffuses on a 256×256 latent (equal to 2048×2048 picture, 4MP) at 4.2 steps per second, at simply 1.6GiB at fp32. That is worlds aside, contemplating that I haven't actually put a lot effort in to optimize it both. I made a dummy mannequin which did 16x discount in top and width, and educated it on 3 channel MNIST photographs. R2IR and R2ID would therefore have 1024 channels, 256 of them for positioning, 768 for colours. The mannequin _still worked_, however what was extra wild was simply how light-weight it was. R2IR had 27M parameters, which is nothing in comparison with the SDXL VAE, whereas the 8 encoder block 8 decoder block configuration in R2ID had a complete of about 270M paramters, additionally completely nothing by trendy requirements. I really feel it’s protected to say that R2IR and R2ID can _truly_ be expanded to massive resolutions, and have aggressive speeds and high quality. The prior issues raised (velocity, reminiscence, skill to seize particulars), to me appear solved, and now all that's left is to go greater. Future growth and shutting ideasAs talked about simply above, the long run objective is to broaden into precise photographs. I imply actual photographs at precise resolutions, not dummy datasets. I'm open to recommendations. I believe that one thing at 512px, could be good, with R2IR doing the 16x discount strategy, and thus making R2IR and R2ID operate on 1024 channels for positioning and colour. The quantity 1024 is good and spherical, the 16x top and width discount is aggressive, however matches in cleanly with the enlargement to 768 colour channels from 3. I've additionally briefly talked about RIGSIG. This can be a dummy repo for now that I've made, however will ultimately™ get to it as soon as R2IR and R2ID are completed. I believe that as a beginning step, it could make sense to only prepare a mannequin do be taught to maneuver gaussian splats round, step-by-step, though ideally, I'd make the splats be 3d, after which you might pattern at really completely different side ratios, not simply varied re-shapes. Don't understand how to try this contemplating the coordinate sytem I've received although, and that's for later. Associated to RIGSIG, I believe it might be doable to feed into R2ID some bogus coordinates for nonexistent factors, like for instance having pixels with coordinates comparable to many side ratios. That means, you diffuse as soon as throughout all these completely different side ratios, after which simply pattern as soon as and decide and select what factor you need. Though I'm involved that this will likely be a bit messy. An alternative choice is to make use of the unfavorable frequencies as an precise absolute system, for instance outpainting _is_ including extra info, so that will be good. Though I'm not likely positive how one can cleanly tie all of it in. In any case, with that being stated, thanks for studying. I'm open to critique, recommendations and questions. The code remains to be a bit messy, however with LLMs it must be easy to grasp and run by your self. I'll get round to creating it cleaner quickly™ as soon as I've completed with the attention-grabbing stuff. As all the time, sort regards. submitted by /u/Tripel_Meow |
Subscribe to Updates
Get the latest tech insights from TechnologiesDigest.com on AI, innovation, and the future of digital technology.
Trending
- Getting Began with Python Async Programming
- Samsung AI-RAN demo indicators telecom cloud shift at MWC 2026
- Kigen GSMA eSA licensed eSIMs deliver safety patching at scale for Bodily AI throughout IoT and shopper market
- 6 classes I realized watching a robotics startup die from the within
- NebiOS turns your Linux desktop right into a Google Workspace different – with one caveat
- The really programmable SASE platform
- emnify Launches Programmable SGP.32 eSIM Connectivity
- Robots Play a Key Function in Trade 5.0



![[P] R2IR & R2ID: Decision Invariant Picture Resampler and Diffuser – Skilled on 1:1 32×32 photographs, generalized to arbitrary side ratio and backbone, diffuses 4MP photographs at 4 steps per second. [P] R2IR & R2ID: Resolution Invariant Image Resampler and Diffuser - Trained on 1:1 32x32 images, generalized to arbitrary aspect ratio and resolution, diffuses 4MP images at 4 steps per second.](https://technologiesdigest.com/wp-content/uploads/2026/03/P-R2IR-amp-R2ID-Resolution-Invariant-Image-Resampler-and-Diffuser.png)