— that’s the bold title the authors selected for his or her paper introducing each YOLOv2 and YOLO9000. The title of the paper itself is “YOLO9000: Higher, Quicker, Stronger” [1], which was printed again in December 2016. The principle focus of this paper is certainly to create YOLO9000. However let’s make issues clear. Regardless of the title of the paper, the mannequin proposed within the research known as YOLOv2. The title YOLO9000 is their proposed algorithm specialised to detect over 9000 object classes which is constructed on prime of the YOLOv2 structure.
On this article I’m going to concentrate on how YOLOv2 works and learn how to implement the structure from scratch with PyTorch. I may also discuss somewhat bit about how the authors ultimately ended up with YOLO9000.
From YOLOv1 to YOLOv2
Because the title suggests, YOLOv2 is the development of YOLOv1. Thus, with a purpose to perceive YOLOv2, I like to recommend you learn my earlier article about YOLOv1 [2] and its loss perform [3] earlier than studying this one.
There have been two predominant issues raised by the authors on YOLOv1: first, the excessive localization error, or in different phrases the bounding field predictions made by the mannequin is just not fairly correct. Second, the low recall, which is a situation the place the mannequin is unable to detect all objects throughout the picture. There have been a lot of modifications made by the authors on YOLOv1 to handle the above points, which basically the modifications they made are summarized in Determine 1. We’re going to talk about every of those modifications one after the other within the subsequent sub-sections.
Batch Normalization
The primary modification the authors did was making use of batch normalization layer. Keep in mind that YOLOv1 is kind of outdated. It was first launched again when BN layer was not fairly widespread simply but, which was the rationale why YOLOv1 don’t make the most of this normalization mechanism within the first place. It’s already confirmed that BN layer is ready to stabilize coaching, velocity up convergence, dan regularize mannequin. Because of this purpose, the dropout layer we beforehand have in YOLOv1 was omitted as we apply BN layers. It’s talked about within the paper that by attaching this kind layer after every convolution they obtained 2.4% enchancment in mAP from 63.4% to 65.8%.
Higher Positive-Tuning
Subsequent, the authors proposed a greater approach to carry out fine-tuning. Beforehand in YOLOv1 the spine mannequin was pretrained on ImageNet classification dataset which the photographs had the scale of 224×224. Then, they changed the classification head with detection head and straight fine-tune it on PASCAL VOC detection dataset which comprises photographs of measurement 448×448. Right here we are able to clearly see that there was one thing like a “soar” due to the totally different picture resolutions in pretraining and fine-tuning. The pipeline used for coaching YOLOv2 is barely modified, the place the authors added an intermediate step, specifically fine-tuning the mannequin on 448×448 ImageNet earlier than fine-tuning it once more on PASCAL VOC of the identical picture decision. This extra step permits the mannequin to get tailored to the upper decision picture earlier than being fine-tuned for detection, in contrast to in YOLOv1 which the mannequin is pressured to work on 448×448 photographs straight after being pretrained on 224×224 photographs. This new fine-tuning pipeline allowed the mAP to extend by 3.7% from 65.8% to 69.5%.

Anchor Field and Totally Convolutional Community
The subsequent modification was associated to using anchor field. Should you’re not but accustomed to it, that is primarily a template bounding field (a.ok.a. prior field) equivalent to a single grid cell, which is rescaled to match the precise object measurement. The mannequin is then skilled to foretell the offset of the anchor field moderately than the bounding field coordinates like YOLOv1. We are able to consider an anchor field as the start line of the mannequin to make bounding field prediction. Based on the paper, predicting offset like that is simpler than predicting coordinates, therefore permitting the mannequin to carry out higher. Determine 3 beneath illustrates 5 anchor packing containers that correspond to the top-left grid cell. In a while, the identical anchor packing containers might be utilized to all grid cells throughout the picture.

Using anchor packing containers additionally modified the way in which we do the item classification. Beforehand in YOLOv1 we had every grid cell predicted two bounding packing containers, but it may solely predict a single object class. YOLOv2 addresses this problem by attaching the item classification mechanism with the anchor field moderately than the grid cell, permitting every anchor field from the identical grid cell to foretell totally different object courses. Mathematically talking, the size of the prediction vector of YOLOv1 might be formulated as (B×5)+C for every grid cell, whereas in YOLOv2 this prediction vector size modified to B×(5+C), the place B is the variety of bounding field to be generated, C is the variety of courses within the dataset, and 5 is the variety of xywh and the bounding field confidence worth. With the mechanism launched in YOLOv2, the prediction vector certainly turns into longer, however it permits every anchor field predicts its personal class. The determine beneath illustrates the prediction vectors of YOLOv1 and YOLOv2, the place we set B to 2 and C to twenty. On this explicit case, the size of the prediction vectors of each fashions are (2×5)+20=30 and a pair of×(5+20)=50, respectively.

At this level authors additionally changed the fully-connected layers in YOLOv1 with a stack of convolution layers, inflicting your entire mannequin to be a completely convolutional community which has the downsampling issue of 32. This downsampling issue causes an enter tensor of measurement 448×448 to get diminished to 14×14. The authors argued that giant objects are normally situated in the midst of a picture, so that they made the output characteristic map to have odd dimensions, making certain that there’s a single heart cell to foretell such objects. So as to obtain this, the authors modified the enter form to 416×416 because the default configuration in order that the output dimension has the spatial decision of 13×13.
Apparently, using anchor field and totally convolutional community brought on the mAP to lower by 0.3% from 69.5% to 69.2% as an alternative, but on the similar time the recall elevated by 7% from 81% to 88%. This enchancment in recall was significantly brought on by the rise of the variety of predictions made by the mannequin. Within the case of YOLOv1, the mannequin may solely predict 7×7=49 objects in complete, and now YOLOv2 can predict as much as 13×13×5=845 objects, the place the quantity 5 comes from the default variety of anchor packing containers used. In the meantime, the lower in mAP indicated that there was a room for enchancment on the anchor packing containers.
Prior Field Clustering and Constrained Predictions
The authors certainly noticed an issue within the anchor packing containers, and so within the subsequent step they tried to change the way in which it really works. Beforehand in Quicker R-CNN the anchor packing containers had been manually handpicked, which brought on them to not optimally symbolize all object shapes within the dataset. To deal with this downside, authors used Okay-means to cluster the distribution of the bounding field measurement. They did it by taking the w and h values of the bounding packing containers within the object detection dataset, placing them into two-dimensional house, and clustering the datapoints utilizing Okay-means as typical. The authors determined to make use of Okay=5, which primarily means that we’ll later have that variety of clusters.
The illustration in Determine 5 beneath shows what the bounding field measurement distribution seems to be like, the place every black datapoint represents a single bounding field within the dataset and the inexperienced circles are the centroids which can then act as the scale of our anchor packing containers. Observe that this illustration is certainly created primarily based on dummy information, however the concept right here is that the datapoint situated on the top-right represents a big sq. bounding field, the one within the top-left is a vertical rectangle field, and so forth.

Should you’re accustomed to Okay-means, we usually use Euclidean distance to measure the gap between datapoints and the centroids. However right here the authors created a brand new distance metric particularly for this case, during which they used the complement of the IOU between the bounding packing containers and the cluster centroids. See the equation beneath for the main points.

Utilizing the gap metric above, we are able to see within the following desk that the prior packing containers generated utilizing Okay-means clustering (highlighted in blue) have a higher common IOU in comparison with the prior packing containers utilized in Quicker R-CNN (highlighted in inexperienced) regardless of the smaller variety of prior packing containers (5 vs 9). This primarily signifies that the proposed clustering mechanism permits the ensuing prior packing containers to symbolize the bounding field measurement distribution within the dataset higher as in comparison with the handpicked anchor packing containers.

Nonetheless associated to prior field, the authors discovered that predicting anchor field offset like Quicker R-CNN was really nonetheless not fairly optimum due to the unbounded equation. If we check out the Determine 8 beneath, there’s a risk that the field place may very well be shifted wildly all through your entire picture, inflicting the coaching troublesome particularly in earlier levels.

As a substitute of being relative to the anchor field like Quicker R-CNN, the authors solved this problem by adopting the concept of predicting location coordinates relative to the grid cell from YOLOv1. Nevertheless, the authors additional modify this by introducing sigmoid perform to constrain the xy coordinate prediction of the community, successfully bounds the worth to the vary of 0 to 1 therefore inflicting the anticipated location won’t ever fall outdoors the corresponding grid cell, as proven within the first and the second rows in Determine 9. Subsequent, the w and h of the bounding field are processed with exponential perform (third and fourth row), which is helpful to stop unfavourable values as a result of it’s simply nonsense to have unfavourable width or peak. In the meantime, the way in which to compute confidence rating on the fifth row is similar as YOLOv1, specifically by calculating the multiplication of objectness confidence and the IOU between the anticipated and the goal field.

So in easy phrases, we certainly undertake the idea of prior field launched by Quicker R-CNN, however as an alternative of handpicking the field, we use clustering to robotically discover essentially the most optimum prior field sizes. The bounding field is created with extra sigmoid perform for the xy and exponential perform for the wh. It’s price noting that now the x and y are relative to the grid cell whereas the w and h are relative to the prior field. The authors discovered that this methodology improved mAP from 69.6% to 74.4%.
Passthrough Layer
The ultimate output characteristic map of YOLOv2 has the spatial dimension of 13×13, during which every ingredient corresponds to a single grid cell. The data contained inside every grid cell is taken into account coarse, which completely is smart as a result of maxpooling layers throughout the community certainly work by taking solely the very best values from the sooner characteristic map. This won’t be an issue if the objects to be detected are significantly giant. But when the objects are small, our mannequin could be having onerous time in performing the detection because of the lack of info contained within the non-prominent pixels.
To deal with this downside, authors proposed to use the so-called passthrough layer. The target of this layer is to protect fine-grained info from earlier characteristic map earlier than being downsampled by maxpooling layer. In Determine 12, the a part of the community known as passthrough layer is the connection that branches out from the community earlier than ultimately merging again to the primary movement in the long run. The concept of this layer is kind of just like id mapping launched in ResNet. Nevertheless, the method carried out in that mannequin is less complicated as a result of the tensor dimension from the unique movement and the one within the skip-connection matches precisely, permitting them to be element-wise sumed. The case is totally different in passthrough layer, during which right here the latter characteristic map has a smaller spatial dimension, therefore we have to assume a approach to mix info from the 2 tensors. The authors got here up with an concept the place they divide the picture within the passthrough layer after which stack the divided tensors in channel-wise method as proven in Determine 10 beneath. By doing so, we can have the spatial dimension of the ensuing tensor matches with the next characteristic map, permitting them to be concatenated alongside the channel axis. The fine-grained info from the previous layer will then be mixed with the higher-level options from the latter layer utilizing a convolution layer.

Multi-Scale Coaching
Beforehand I’ve talked about that in YOLOv2 all FC layers have been changed with a stack of convolution layers. This primarily permits us to feed photographs of various scales throughout the similar coaching course of, contemplating that the weights of CNN-based mannequin correspond to the trainable parameters within the kernel, which is impartial of the enter picture dimension. The truth is, that is really the rationale why the authors determined to take away the FC layers within the first place. Through the coaching part, authors modified the enter decision each 10 batches randomly from 320×320, 352×352, 384×384, and so forth to 608×608, all with a number of of 32. This course of might be regarded as their strategy to reinforce the info in order that the mannequin can detect objects throughout various enter dimensions, which I imagine it additionally permits the mannequin to foretell objects of various scales with higher efficiency. This course of boosted mAP to 76.8% on the default enter decision 416×416, and it bought even increased to 78.6% after we they elevated the picture decision additional to 544×544.
Darknet-19
All modifications on YOLOv1 we mentioned within the earlier sub sections had been all associated to how the authors enhance detection high quality by way of the mAP and recall. Now the main target of this sub part is to enhance mannequin efficiency by way of the velocity. It’s talked about within the paper that the authors use a mannequin known as Darknet-19 because the spine, which has much less operations in comparison with the spine of YOLOv1 (5.58 billion vs 8.52 billion), permitting YOLOv2 to run sooner than its predecessor. The unique model of this mannequin consists of 19 convolution layers and 5 maxpooling layers, which the main points might be seen in Determine 11 beneath.

You will need to observe that the above structure is the vanilla Darknet-19 mannequin, which is barely appropriate for classification activity. To adapt it with the requirement of YOLOv2, we have to barely modify it by including passthrough layer and changing the classification head with detection head. You possibly can see the modified structure in Determine 12 beneath.

Right here you possibly can see that the passthrough layer is positioned after the final 26×26 characteristic map. This passthrough layer will cut back the spatial dimension to 13×13, permitting it to be concatenated in channel-wise method with the 13×13 characteristic map from the primary movement. Later within the subsequent part I’m going to show learn how to implement this Darknet-19 structure from scratch together with the detection head in addition to the passthrough layer.
9000-Class Object Detection
YOLOv2 mannequin was initially skilled on PASCAL VOC and COCO datasets which have 20 and 80 variety of object courses, respectively. The authors noticed this as an issue as a result of they thought that this quantity may be very restricted for normal case, and therefore lack of versatility. On account of this purpose, it’s essential to enhance the mannequin such that it might detect a greater variety of object courses. Nevertheless, creating object detection dataset may be very costly and laborious, as a result of not solely the item courses however we’re additionally required to annotate the bounding field info.
The authors got here up with a really intelligent concept, the place they mixed ImageNet, which has over 22,000 courses, with COCO utilizing class hierarchy mechanism which they consult with as WordTree as proven in Determine 13. You possibly can see the illustration within the determine that blue nodes are the courses from COCO dataset, whereas the purple ones are from ImageNet dataset. The item classes obtainable within the COCO dataset are comparatively normal, whereas those in ImageNet are much more fine-grained. As an illustration, if in COCO we bought airplane, in ImageNet we bought biplane, jet, airbus, and stealth fighter. So utilizing the concept of WordTree, the authors put these 4 airplane varieties because the subclass of airplane. You possibly can consider the inference like this: the mannequin works by predicting bounding field and the mum or dad class, then it is going to verify if it bought subclasses. In that case, the mannequin will proceed predicting from the smaller subset of courses.
By combining the 2 datasets like this, we ultimately ended up with a mannequin that’s able to predicting over 9000 object courses (9418 to be precise), therefore the title YOLO9000.

YOLOv2 Structure Implementation
As I promised earlier, on this part I’m going to show learn how to implement the YOLOv2 structure from scratch with the intention to get higher understanding about how an enter picture ultimately turns into a tensor containing bounding field and sophistication predictions.
Now what we have to do first is to import the required modules which is proven in Codeblock 1 beneath.
# Codeblock 1
import torch
import torch.nn as nnSubsequent, we create the ConvBlock class, during which it’s going to encapsulate the convolution layer itself, a batch normalization layer and a leaky ReLU activation perform. The negative_slope parameter itself is ready to 0.1 as proven at line #(1), which is strictly the identical because the one utilized in YOLOv1.
# Codeblock 2
class ConvBlock(nn.Module):
def __init__(self,
in_channels,
out_channels,
kernel_size,
padding):
tremendous().__init__()
self.conv = nn.Conv2d(in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
padding=padding)
self.bn = nn.BatchNorm2d(num_features=out_channels)
self.leaky_relu = nn.LeakyReLU(negative_slope=0.1) #(1)
def ahead(self, x):
print(f'originalt: {x.measurement()}')
x = self.conv(x)
print(f'after convt: {x.measurement()}')
x = self.leaky_relu(x)
print(f'after leaky relu: {x.measurement()}')
return xSimply to verify if the above class works correctly, right here I take a look at it with a quite simple take a look at case, the place I initialize a ConvBlock occasion which accepts an RGB picture of measurement 416×416. You possibly can see within the ensuing output that the picture now has 64 channels, proving that our ConvBlock works correctly.
# Codeblock 3
convblock = ConvBlock(in_channels=3,
out_channels=64,
kernel_size=3,
padding=1)
x = torch.randn(1, 3, 416, 416)
out = convblock(x)# Codeblock 3 Output
unique : torch.Measurement([1, 3, 416, 416])
after conv : torch.Measurement([1, 64, 416, 416])
after leaky relu : torch.Measurement([1, 64, 416, 416])Darknet-19 Implementation
Now let’s use this ConvBlock class to assemble the Darknet-19 structure. The way in which to take action is fairly easy, as what we have to do is simply to stack a number of ConvBlock cases adopted by a maxpooling layer in keeping with the structure in Determine 12. See the main points in Codeblock 4a beneath. Observe that the maxpooling layer for stage4 is positioned firstly of stage5 as proven on the line marked with #(1). That is primarily carried out as a result of the output of stage4 will straight be fed into the passthrough layer with out being downsampled. Along with this, you will need to observe that the time period “stage” is just not formally talked about within the paper. Reasonably, that is only a time period I personally use for the sake of this implementation.
# Codeblock 4a
class Darknet(nn.Module):
def __init__(self):
tremendous(Darknet, self).__init__()
self.stage0 = nn.ModuleList([
ConvBlock(3, 32, 3, 1),
nn.MaxPool2d(kernel_size=2, stride=2)
])
self.stage1 = nn.ModuleList([
ConvBlock(32, 64, 3, 1),
nn.MaxPool2d(kernel_size=2, stride=2)
])
self.stage2 = nn.ModuleList([
ConvBlock(64, 128, 3, 1),
ConvBlock(128, 64, 1, 0),
ConvBlock(64, 128, 3, 1),
nn.MaxPool2d(kernel_size=2, stride=2)
])
self.stage3 = nn.ModuleList([
ConvBlock(128, 256, 3, 1),
ConvBlock(256, 128, 1, 0),
ConvBlock(128, 256, 3, 1),
nn.MaxPool2d(kernel_size=2, stride=2)
])
self.stage4 = nn.ModuleList([
ConvBlock(256, 512, 3, 1),
ConvBlock(512, 256, 1, 0),
ConvBlock(256, 512, 3, 1),
ConvBlock(512, 256, 1, 0),
ConvBlock(256, 512, 3, 1),
])
self.stage5 = nn.ModuleList([
nn.MaxPool2d(kernel_size=2, stride=2), #(1)
ConvBlock(512, 1024, 3, 1),
ConvBlock(1024, 512, 1, 0),
ConvBlock(512, 1024, 3, 1),
ConvBlock(1024, 512, 1, 0),
ConvBlock(512, 1024, 3, 1),
])As all layers have been initialized, the following factor we do is to attach all these layers utilizing the ahead() methodology in Codeblock 4b beneath. Beforehand I stated that we’ll take the output of stage4 because the enter for the passthrough layer. To take action, I retailer the characteristic map produced by the final layer of stage4 in a separate variable which I consult with as x_stage4 (#(1)). We then do the identical factor for the output of stage5 (#(2)) and return each x_stage4 and x_stage5 because the output of our Darknet (#(3)).
# Codeblock 4b
def ahead(self, x):
print(f'originalt: {x.measurement()}')
print()
for i in vary(len(self.stage0)):
x = self.stage0[i](x)
print(f'after stage0 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage1)):
x = self.stage1[i](x)
print(f'after stage1 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage2)):
x = self.stage2[i](x)
print(f'after stage2 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage3)):
x = self.stage3[i](x)
print(f'after stage3 #{i}t: {x.measurement()}')
print()
for i in vary(len(self.stage4)):
x = self.stage4[i](x)
print(f'after stage4 #{i}t: {x.measurement()}')
x_stage4 = x.clone() #(1)
print()
for i in vary(len(self.stage5)):
x = self.stage5[i](x)
print(f'after stage5 #{i}t: {x.measurement()}')
x_stage5 = x.clone() #(2)
return x_stage4, x_stage5 #(3)Subsequent, I take a look at the Darknet-19 mannequin above by passing the identical dummy picture because the one in our earlier take a look at case.
# Codeblock 5
darknet = Darknet()
x = torch.randn(1, 3, 416, 416)
out = darknet(x)# Codeblock 5 Output
unique : torch.Measurement([1, 3, 416, 416])
after stage0 #0 : torch.Measurement([1, 32, 416, 416])
after stage0 #1 : torch.Measurement([1, 32, 208, 208])
after stage1 #0 : torch.Measurement([1, 64, 208, 208])
after stage1 #1 : torch.Measurement([1, 64, 104, 104])
after stage2 #0 : torch.Measurement([1, 128, 104, 104])
after stage2 #1 : torch.Measurement([1, 64, 104, 104])
after stage2 #2 : torch.Measurement([1, 128, 104, 104])
after stage2 #3 : torch.Measurement([1, 128, 52, 52])
after stage3 #0 : torch.Measurement([1, 256, 52, 52])
after stage3 #1 : torch.Measurement([1, 128, 52, 52])
after stage3 #2 : torch.Measurement([1, 256, 52, 52])
after stage3 #3 : torch.Measurement([1, 256, 26, 26])
after stage4 #0 : torch.Measurement([1, 512, 26, 26])
after stage4 #1 : torch.Measurement([1, 256, 26, 26])
after stage4 #2 : torch.Measurement([1, 512, 26, 26])
after stage4 #3 : torch.Measurement([1, 256, 26, 26])
after stage4 #4 : torch.Measurement([1, 512, 26, 26])
after stage5 #0 : torch.Measurement([1, 512, 13, 13])
after stage5 #1 : torch.Measurement([1, 1024, 13, 13])
after stage5 #2 : torch.Measurement([1, 512, 13, 13])
after stage5 #3 : torch.Measurement([1, 1024, 13, 13])
after stage5 #4 : torch.Measurement([1, 512, 13, 13])
after stage5 #5 : torch.Measurement([1, 1024, 13, 13])Right here we are able to see that our output matches precisely with the architectural particulars in Determine 12, indicating that our implementation of the Darknet-19 mannequin is right.
The Whole YOLOv2 Structure
Earlier than really establishing your entire YOLOv2 structure, we have to initialize the parameters for the mannequin first. Right here we would like each single cell to generate 5 anchor packing containers, therefore we have to set the NUM_ANCHORS variable to that quantity. Subsequent, I set NUM_CLASSES to twenty as a result of we assume that we wish to prepare the mannequin on PASCAL VOC dataset.
# Codeblock 6
NUM_ANCHORS = 5
NUM_CLASSES = 20Now it’s time to outline the YOLOv2 class. Within the Codeblock 7a beneath, we initially outline the __init__() methodology, the place we initialize the Darknet mannequin (#(1)), a single ConvBlock for the passthrough layer (#(2)), a stack of two convolution layers which I consult with as stage6 (#(3)), and one other stack of two convolution layers which the final one is used to map the tensor into prediction vector with B×(5+C) variety of channels (#(4)).
# Codeblock 7a
class YOLOv2(nn.Module):
def __init__(self):
tremendous().__init__()
self.darknet = Darknet() #(1)
self.passthrough = ConvBlock(512, 64, 1, 0) #(2)
self.stage6 = nn.ModuleList([ #(3)
ConvBlock(1024, 1024, 3, 1),
ConvBlock(1024, 1024, 3, 1),
])
self.stage7 = nn.ModuleList([
ConvBlock(1280, 1024, 3, 1),
ConvBlock(1024, NUM_ANCHORS*(5+NUM_CLASSES), 1, 0) #(4)
])Afterwards, we outline the so-called reorder() methodology, which we’ll use to course of the characteristic map within the passthrough layer. The logic of the code beneath is kind of difficult although, however the primary concept is that it follows the precept given in Determine 10. Right here I present you the output of every line with the intention to get higher understanding of how the method goes contained in the perform given an enter tensor of form 1×64×26×26, which represents a single picture of measurement 26×26 with 64 channels. Within the final step we are able to see that the ultimate output tensor has the form of 1×256×13×13. This form matches precisely with our requirement, the place the channel dimension turns into 4 occasions bigger than that of the enter whereas on the similar time the spatial dimension halves.
# Codeblock 7b
def reorder(self, x, scale=2): # ([1, 64, 26, 26])
B, C, H, W = x.form
h, w = H // scale, W // scale
x = x.reshape(B, C, h, scale, w, scale) # ([1, 64, 13, 2, 13, 2])
x = x.transpose(3, 4) # ([1, 64, 13, 13, 2, 2])
x = x.reshape(B, C, h * w, scale * scale) # ([1, 64, 169, 4])
x = x.transpose(2, 3) # ([1, 64, 4, 169])
x = x.reshape(B, C, scale * scale, h, w) # ([1, 64, 4, 13, 13])
x = x.transpose(1, 2) # ([1, 4, 64, 13, 13])
x = x.reshape(B, scale * scale * C, h, w) # ([1, 256, 13, 13])
return xSubsequent, the Codeblock 7c beneath reveals how we create the movement of the community. We initially begin from the darknet spine, during which it returns x_stage4 and x_stage5 (#(1)). The x_stage5 tensor will straight be processed with the next convolution layers which I consult with as stage6 (#(2)) whereas the x_stage4 tensor might be handed to the passthrough layer (#(3)) and processed by the reorder() (#(4)) methodology we outlined in Codeblock 7b above. Afterwards, we then concatenate each tensors in channel-wise method at line #(5). This concatenated tensor is then processed additional with one other stack of convolution layers referred to as stage7 (#(6)) which returns the prediction vector.
# Codeblock 7c
def ahead(self, x):
print(f'originalttt: {x.measurement()}')
x_stage4, x_stage5 = self.darknet(x) #(1)
print(f'nx_stage4ttt: {x_stage4.measurement()}')
print(f'x_stage5ttt: {x_stage5.measurement()}')
print()
x = x_stage5
for i in vary(len(self.stage6)):
x = self.stage6[i](x) #(2)
print(f'x_stage5 after stage6 #{i}t: {x.measurement()}')
x_stage4 = self.passthrough(x_stage4) #(3)
print(f'nx_stage4 after passthrought: {x_stage4.measurement()}')
x_stage4 = self.reorder(x_stage4) #(4)
print(f'x_stage4 after reordertt: {x_stage4.measurement()}')
x = torch.cat([x_stage4, x], dim=1) #(5)
print(f'nx after concatenatett: {x.measurement()}')
for i in vary(len(self.stage7)): #(6)
x = self.stage7[i](x)
print(f'x after stage7 #{i}t: {x.measurement()}')
return xOnce more, to check the above code we’ll move by a tensor of measurement 1×3×416×416.
# Codeblock 8
yolov2 = YOLOv2()
x = torch.randn(1, 3, 416, 416)
out = yolov2(x)And beneath is what the output seems to be like after the code is run. The outputs known as stage0 to stage5 are the processes throughout the Darknet spine, during which that is precisely the identical because the one I confirmed you earlier in Codeblock 5 Output. Afterwards we are able to see in stage6 that the form of the x_stage5 tensor doesn’t change in any respect (#(1–3)). In the meantime, the channel dimension of x_stage4 elevated from 64 to 256 after being processed by the reorder() operation (#(4–5)). The tensor from the primary movement is then concatenated with the one from passthrough layer, which brought on the variety of channels within the ensuing tensor grew to become 1024+256=1280 (#(6)). Lastly, we move the tensor to stage7 which returns a prediction tensor of measurement 125×13×13, denoting that now we have 13×13 grid cells the place each single of these cells comprises a prediction vector of size 125 (#(7)), storing the bounding field and the item class predictions.
# Codeblock 8 Output
unique : torch.Measurement([1, 3, 416, 416])
after stage0 #0 : torch.Measurement([1, 32, 416, 416])
after stage0 #1 : torch.Measurement([1, 32, 208, 208])
after stage1 #0 : torch.Measurement([1, 64, 208, 208])
after stage1 #1 : torch.Measurement([1, 64, 104, 104])
after stage2 #0 : torch.Measurement([1, 128, 104, 104])
after stage2 #1 : torch.Measurement([1, 64, 104, 104])
after stage2 #2 : torch.Measurement([1, 128, 104, 104])
after stage2 #3 : torch.Measurement([1, 128, 52, 52])
after stage3 #0 : torch.Measurement([1, 256, 52, 52])
after stage3 #1 : torch.Measurement([1, 128, 52, 52])
after stage3 #2 : torch.Measurement([1, 256, 52, 52])
after stage3 #3 : torch.Measurement([1, 256, 26, 26])
after stage4 #0 : torch.Measurement([1, 512, 26, 26])
after stage4 #1 : torch.Measurement([1, 256, 26, 26])
after stage4 #2 : torch.Measurement([1, 512, 26, 26])
after stage4 #3 : torch.Measurement([1, 256, 26, 26])
after stage4 #4 : torch.Measurement([1, 512, 26, 26])
after stage5 #0 : torch.Measurement([1, 512, 13, 13])
after stage5 #1 : torch.Measurement([1, 1024, 13, 13])
after stage5 #2 : torch.Measurement([1, 512, 13, 13])
after stage5 #3 : torch.Measurement([1, 1024, 13, 13])
after stage5 #4 : torch.Measurement([1, 512, 13, 13])
after stage5 #5 : torch.Measurement([1, 1024, 13, 13])
x_stage4 : torch.Measurement([1, 512, 26, 26])
x_stage5 : torch.Measurement([1, 1024, 13, 13]) #(1)
x_stage5 after stage6 #0 : torch.Measurement([1, 1024, 13, 13]) #(2)
x_stage5 after stage6 #1 : torch.Measurement([1, 1024, 13, 13]) #(3)
x_stage4 after passthrough : torch.Measurement([1, 64, 26, 26]) #(4)
x_stage4 after reorder : torch.Measurement([1, 256, 13, 13]) #(5)
x after concatenate : torch.Measurement([1, 1280, 13, 13]) #(6)
x after stage7 #0 : torch.Measurement([1, 1024, 13, 13])
x after stage7 #1 : torch.Measurement([1, 125, 13, 13])Ending
I believe that’s just about all the pieces about YOLOv2 and its mannequin structure implementation from scratch. The code used on this article can also be obtainable on my GitHub repository [6]. Please let me know if you happen to spot any mistake in my rationalization or within the code. Thanks for studying, I hope you be taught one thing new from this text. See ya in my subsequent writing!
References
[1] Joseph Redmon and Ali Farhadi. YOLO9000: Higher, Quicker, Stronger. Arxiv. [Accessed August 9, 2025].
[2] Muhammad Ardi. YOLOv1 Paper Walkthrough: The Day YOLO First Noticed the World. Medium. [Accessed January 24, 2026].
[3] Muhammad Ardi. YOLOv1 Loss Perform Walkthrough: Regression for All. In direction of Information Science. [Accessed January 24, 2026].
[4] Picture initially created by writer, partially generated with Gemini
[5] Picture initially created by writer
[6] MuhammadArdiPutra. Higher, Quicker, Stronger — YOLOv2 and YOLO9000. GitHub. [Accessed August 9, 2025].



