Experimental dataset and pre-processing
In this study, we created and publicly shared a new dataset called Otitis1415, which contains CT scan images from Xiangnan University Affiliated Hospital spanning the years 2014 to 2022. The collection includes 4,216 images depicting cases of otitis media (OM). It is important to emphasize that this entire research process—including gathering, labeling, and utilizing the data—was approved by the hospital’s ethics committee under Approval Number K2022-015-01. All personal health information was removed from the data to ensure patient confidentiality. Furthermore, every image in the Otitis1415 dataset has been processed to eliminate private details, safeguarding individuals from exposure or harm.
The dataset was randomly split into two groups: a training set with 3,280 images to build the model, and a validation set with the remaining 936 images to assess how well the model performs. Every image includes a ground truth bounding box that outlines the middle ear region. This label marks whether the ear is healthy (label 0) or shows disease (label 1). In the training group, roughly 66.7% of the samples show disease, while about 65.8% of the validation samples also indicate illness. The dataset remained unchanged throughout all tests.
Experimental setup and evaluation
Our experiments were run on a system equipped with a Tesla V100 GPU. The software stack consisted of PyTorch (v2.1.1), CUDA 12.1, Python 3.9.18, Torch 2.1.0 for CUDA 12.1, Torchvision 0.16.0 for CUDA 12.1, and OpenCV 4.8.1.78. To ensure fair comparisons, we set up each method’s environment exactly as described in their publicly available code. Standard data augmentation methods like scaling, flipping, and translation were applied during training to diversify the input data. We used ResNet50 and DN-DAB-DETR as our core model backbones, with ResNet50 weights pre-trained by PyTorch’s official library. Training and testing occurred at an image resolution of 800×800 pixels. We trained with a batch size of 2 across 40 epochs.
We measured model performance using Mean Average Precision (mAP) as our main metric. IoU (Intersection over Union) calculates how much the predicted box overlaps with the true box. mAP50 reflects mAP when the overlap threshold is set at 50%, while mAP75 corresponds to a 75% threshold. For specific size analysis, mAP(medium) covers objects between 32×32 pixels, and mAP(large) focuses on objects 96×96 pixels or larger.
Comparison with the latest methods
To prove our method’s effectiveness, we benchmarked it against top detectors representing various modern object detection styles. Our comparisons included the classic two-stage CNN model Faster R-CNN; transformer-based models like DN-DETR, DINO, and Co-DETR; the efficient query-based Sparse R-CNN; and speed-focused models such as YOLOv8 and RT-DETR.
Every baseline model was configured strictly according to its original published settings to maintain fairness. We did not adjust any hyperparameters specifically for our medical data to prevent bias. All models used the same backbone, image size, data augmentation, optimizer, learning rate plan, and pre-training. They all started with their official pre-trained weights. Transformer models followed their specific optimizer and learning rate rules, while CNN models like Faster R-CNN and YOLOv8 used their standard setups. To ensure equal conditions, every model was trained for the same number of cycles on identical hardware.

Detection examples of the compared methods. Each row corresponds to one object detection algorithm: (a) ground truth, (b) baseline, (c) Deformable DETR, (d) Faster R-CNN, (e) DINO, (f) Co-DETR, (g) YOLOv8, (h) RT-DETR, (i) Ours.
Figure 3 displays sample detection results where our model consistently beats other methods. The visual evidence shows our approach is more accurate at identifying the correct condition and pinpointing the exact location of the lesion. Table 1 provides the numerical proof of these advantages. This success comes from our use of denser residual connections, which help the model avoid over-processing and locate the ear more precisely. We also use an entropy-balanced loss function to keep the model’s internal complexity in check, preventing issues that usually arise with such dense connections.
Computational complexity analysis
Table 2 demonstrates that our method offers a good balance of computational efficiency regarding processing power (GFLOPs) and memory usage (parameters). Our model’s processing requirements match the baseline and are lower than most competitors. Our parameter count is identical to the baseline, slightly more than Faster R-CNN, Deformable-DETR, and RT-DETR, but less than the rest. This means our model is lighter on resources and faster to run, making it more practical for real-world use.


Detection performance versus computational demands of our approach compared to leading algorithms. The four subfigures show: (a) mAP50 and mAP75 versus GFLOPs; (b) mAP50 and mAP75 versus parameter memory (M).
Figure 4 presents a comparison of our mAP scores against parameter count and computational cost. As shown in earlier analysis, our model delivers a substantial performance boost without adding extra parameters or increasing floating-point operations relative to the baseline. Moreover, when benchmarked against models such as DINO and Co-DETR, our approach uses fewer parameters, demands less computation, and achieves superior results. This demonstrates that we have successfully struck a balance—and made a breakthrough—between accuracy and efficiency.
Ablation study
To verify the effectiveness of our enhanced method, we performed ablation studies. In these tests, “A” denotes the addition of denser residual connections between neighboring encoder layers and between neighboring decoder layers, while “B” refers to integrating a new entropy-balanced loss function by applying a weight of wc=0.05 to the focal loss, substituting the original loss during training. The findings reveal that incorporating denser residual connections leads to notable gains in the model’s mAP, mAP75, mAP(large), and mAP(medium) over the baseline. Additionally, adjusting the loss function produced a significant jump in mAP50 compared to the intermediate model with only denser residual connections. Combining both “A” and “B” improvements yielded gains across all performance metrics. The results are summarized in Table 3.
In detail, the baseline model achieves an mAP of 0.457, serving as the reference point for all comparisons. Adding denser residual connections (A) raises the mAP considerably to 0.502, confirming that refining residual connections positively impacts model performance. At the same time, mAP50—which measures average precision at an IoU threshold of 50%—climbs from the baseline’s 0.950 to 0.960, indicating that the enhanced model is more reliable in detecting objects at lower IoU thresholds.
Furthermore, when denser residual connections (A) and the entropy-balanced loss function (B) are applied together, the model’s mAP rises further to 0.568. This gain highlights that the two optimization strategies work synergistically to boost overall performance. Meanwhile, mAP50 holds steady at 0.975, reflecting the model’s consistently strong detection capability at higher IoU thresholds.
The enhanced model also shows gains in detecting medium-sized (mAP(medium)) and large-sized (mAP(large)) objects. The baseline records an mAP of 0.367 for medium-sized objects and 0.455 for large-sized objects. With denser residual connections (A) added, mAP for medium-sized objects improves to 0.410, and for large-sized objects it rises to 0.505. When the entropy-balanced loss function (B) is also incorporated, mAP(medium) reaches 0.497, while mAP(large) climbs to 0.571, underscoring the value of combining both strategies, especially for detecting larger objects.
Overall, incorporating denser residual connections and the entropy-balanced loss function drives performance gains across multiple evaluation criteria, confirming that these optimization techniques effectively strengthen the model’s detection capabilities.


Side-by-side comparison of experimental outcomes and heat maps for our model against the Baseline. GT bounding boxes are overlaid on the heatmaps to visually assess spatial focus and localization patterns. (a) ground truth, (b) Predicted Box of Baseline, (c) Heat map of Baseline, (d) Predicted Box of Baseline + A, (e) Heat map of Baseline + A, (f) Predicted Box of Ours, (g) Heat map of Ours.
As illustrated in Fig. 5, adding denser residual connections helps reduce the risk of over-decoding, allowing the decoder to better refine bounding boxes. As a result, these enhanced connections enable more precise placement of predicted boxes, and the entropy-balanced loss function lowers the model’s entropy, stabilizing predicted box positions and preventing large deviations.
To offer more direct visual evidence of this behavior, ground-truth (GT) bounding boxes are overlaid on the corresponding attention heatmaps in Fig. 5. These visualizations reveal clear distinctions among the models being compared. For the baseline, the heatmap responses are spatially spread out and scattered, signaling unstable and unfocused attention that often drifts away from the lesion area. After adding denser residual connections (baseline + A), the heatmap responses become markedly more focused and are mostly centered around the GT bounding boxes, pointing to improved spatial concentration. By contrast, the proposed model generates tightly focused and well-aligned heatmaps that closely correspond to the GT lesion regions, with very little dispersion.
These qualitative findings show that the proposed architecture enables more stable and lesion-aware feature aggregation, which directly contributes to better localization accuracy. Together with the entropy-balanced loss function, the model achieves both refined attention patterns and reliable bounding box predictions, offering compelling evidence of its effectiveness.
To further analyze ourTo assess the model’s performance, the TIDE general-purpose toolbox was employed for error analysis in object detection31. Table 4 details the comparison of various error types between the baseline and our proposed method, including classification errors (Cls), localization errors (Loc), combined errors (Both), duplicate detection errors (Dupe), background errors (Bkg), missed ground truths (Miss), and specific error metrics.
Our model shows a marginally higher Cls of 2.28 versus the baseline’s 2.20, suggesting the baseline is slightly more accurate in categorizing objects. Conversely, our model achieves a lower Loc of 2.28 compared to the baseline’s 3.62, indicating superior precision in pinpointing object locations. For the Both metric, our model records a lower error of 0.02 against the baseline’s 0.04, demonstrating better overall performance when classification and localization are considered together. Both models show zero Dupe and Bkg errors, meaning they effectively avoid duplicate detections and background misclassifications. A key advantage is seen in the Miss category, where our model achieves 0.00, meaning no true objects were missed, while the baseline had an error of 0.48. Regarding specific errors, our model has a lower FalsePos (false positive) rate of 1.04 compared to the baseline’s 1.42, and a lower FalseNeg (false negative) rate of 3.41 versus the baseline’s 5.36, indicating improvements in reducing both types of detection errors.
Significance test
To further confirm the robustness of the proposed 4DO-DETR and ensure that the observed performance gains are not due to random chance, we conducted additional experiments focusing on statistical consistency and sensitivity to hyperparameters. Specifically, we repeated the experiments multiple times under identical conditions to verify the stability of the results. Additionally, as batch size is a critical hyperparameter in object detection training, we tested various batch sizes while keeping all other settings constant. The findings are presented in Table 5.
The results show that 4DO-DETR consistently achieves high accuracy across different batch size settings, surpassing existing state-of-the-art methods. This confirms that the performance improvement is a robust and meaningful outcome, not merely a result of hyperparameter optimization.
Effect of entropy balance on training stability
Training stability is essential for reliable optimization and reproducible results, particularly in medical image analysis where noisy labels and class imbalances are prevalent. To examine the impact of our proposed entropy-balance mechanism on training stability, we compared it against standard Cross Entropy loss and Focal Loss, both commonly used in detection and classification.
Unlike label smoothing or other label-level regularization methods, the entropy-balance mechanism functions directly at the loss and gradient level. Label smoothing converts hard targets into soft labels to reduce model overconfidence and enhance generalization, but it does not directly address optimization instability and may dilute the supervision signal. In contrast, entropy balance does not modify the target distribution; instead, it limits excessive entropy caused by higher-order unstable terms in the loss landscape, leading to smoother gradient updates and more stable training behavior.
Quantitative results are provided in Table 6. The proposed method achieves the lowest standard deviation of training loss (0.7488), significantly lower than Focal Loss (1.2347) and Cross Entropy (7.6188). This indicates that entropy balance effectively minimizes large loss fluctuations and promotes a more stable convergence path. Such stability is crucial for reproducibility and robustness in real-world applications. Furthermore, the proposed method also attains a lower mean loss during the convergence phase, with an average of 4.1327, outperforming Focal Loss (4.5716) and Cross Entropy (8.7296). This shows that entropy balance not only stabilizes training but also maintains strong optimization performance in later learning stages.
Figure 6 visually compares the training loss curves for different loss functions. Cross Entropy displays sharp spikes and significant oscillations, reflecting unstable gradient behavior. Focal Loss partially reduces this issue but still exhibits noticeable fluctuations. In contrast, the entropy-balance mechanism yields a smoother and more consistent loss curve, confirming its effectiveness in regulating training dynamics.
In summary, these findings show that entropy balance enhances training stability by directly controlling loss entropy and gradient behavior, rather than by weakening label supervision. This fundamental difference from label smoothing makes it especially suitable for tasks requiring robust and stable optimization, such as medical image detection.

Comparison of training loss curves using different loss functions.
The entropy-balance method generates a smoother and more stable loss trajectory compared to Cross Entropy and Focal Loss. Cross Entropy shows large fluctuations and sharp spikes during training, while Focal Loss partially mitigates this. The reduced oscillation in our method indicates improved optimization stability.
Experiments on the brain tumor dataset
To further test the model’s generalization ability, experiments were conducted on a brain-tumor dataset. Co-DETR and Deformable DETR performed poorly in this setting, with mAP scores below 0.1, and were therefore excluded from the comparison. The performance of Faster R-CNN, DINO, the baseline model, and our method is compared as follows:

Detection examples from the compared methods. Each row corresponds to one algorithm: (a) baseline, (b) Faster R-CNN, (c) DINO, (d) Ours.
As illustrated in Fig. 7, our model excels in anomaly detection, showing superior performance in both detection type and location. Table 7 further demonstrates the strong generalization capability of our model, especially in the mAP(large) metric, where it significantly outperforms other models. For small-sized objects, our model’s performance is slightly lower than DINO’s. However, on the Otitis1415 dataset, our model surpasses DINO, indicating a minor limitation in generalizing to medium-sized objects.



